Kashmiri, an Indo-Aryan language written in a modified Perso-Arabic script, suffers from frequent omission of diacritic marks in digital text, creating ambiguity and hindering downstream natural language processing (NLP) applications. To address this, researchers have introduced Koshur Diacritizer, a byte-level sequence-to-sequence model for diacritic restoration, detailed in a paper on arXiv.
The Problem of Missing Diacritics
In many languages written in Arabic-derived scripts, diacritic marks indicate vowel sounds or other phonetic distinctions. When omitted, readers and NLP systems must rely on context to disambiguate words. For low-resource languages like Kashmiri, this problem is compounded by limited digital resources. According to the paper by Malik and colleagues, the lack of diacritics creates “ambiguity” that challenges text-to-speech, machine translation, and other applications.
The Koshur Diacritizer Approach
The researchers built Koshur Diacritizer as a ByT5-small model — a byte-level variant of the T5 transformer architecture that operates directly on UTF-8 bytes rather than tokenized text. This byte-level design is particularly suited for languages with complex scripts, as it avoids the need for script-specific tokenization. The framework incorporates:
- Script-aware normalization to standardize input variations.
- Alignment validation to ensure correct pairing of undiacritized and diacritized sentences.
- Skeleton-preserving inference that restores diacritics while keeping the original base-letter sequence intact.
Dataset and Training
To train the model, the researchers created a publicly available dataset of 23,700 aligned undiacritized-diacritized Kashmiri sentence pairs. Details on the data source or collection methodology were not specified in the abstract, but the release provides a foundation for future work. The model was trained on this dataset, with validation on a held-out test set.
Performance Evaluation
Experimental results on the test set report a DERm (Diacritic Error Rate modified) of 0.2012 and a Word Error Rate (WER) of 0.2159. Beyond automated metrics, a native Kashmiri linguistic expert evaluated the model’s output, yielding a mean accuracy of 77.5%. The following table summarizes the key performance indicators:
| Metric | Value |
|---|---|
| DERm | 0.2012 |
| WER | 0.2159 |
| Expert accuracy | 77.5% |
Implications for Low-Resource Language NLP
Koshur Diacritizer provides a reproducible baseline for diacritic restoration in Kashmiri and demonstrates the viability of byte-level models for low-resource languages. The public release of the dataset, model, and source code (under a Creative Commons BY-NC-SA 4.0 license) enables other researchers to build on this work. While the immediate application is linguistic, such diacritic restoration can improve downstream tasks like text-to-speech, machine translation, and information retrieval — capabilities that, when extended to other languages, may eventually benefit multinational enterprises handling multilingual trade documentation.