Artificial Intelligence #kashmiri#diacritic restoration
Koshur Diacritizer: A Byte-Level Model Restores Diacritics for Kashmiri Language NLP
Researchers have developed Koshur Diacritizer, a byte-level sequence-to-sequence model based on ByT5-small, to restore missing diacritic marks in Kashmiri digital text. The model, trained on 23,700 sentence pairs, achieves a DERm of 0.2012 and word error rate of 0.2159, with a native expert accuracy of 77.5%. The dataset, model, and source code are publicly released to support low-resource language research.
Jun 16, 2026 1 source