Abugida Normalizer and Parser for Unicode texts
This paper proposes two libraries to address common and uncommon issues with Unicode-based writing schemes for Indic languages. The first is a normalizer that corrects inconsistencies caused by the encoding scheme https://pypi.org/project/bnunicodenormalizer/ . The second is a grapheme parser for Abugida text https://pypi.org/project/indicparser/ . Both tools are more efficient and effective than previously used tools. We report 400% increase in speed and ensure significantly better performance for different language model based downstream tasks.
PDF Abstract