Koshur Diacritizer: A Byte-Level Model Restores Diacritics for Kashmiri Language NLP

Researchers have developed Koshur Diacritizer, a byte-level sequence-to-sequence model based on ByT5-small, to restore missing diacritic marks in Kashmiri digital text. The model, trained on 23,700 sentence pairs, achieves a DERm of 0.2012 and word error rate of 0.2159, with a native expert accuracy of 77.5%. The dataset, model, and source code are publicly released to support low-resource language research.

iGEN Editorial

June 16, 2026

Koshur Diacritizer: A Byte-Level Model Restores Diacritics for Kashmiri Language NLP

Kashmiri, an Indo-Aryan language written in a modified Perso-Arabic script, suffers from frequent omission of diacritic marks in digital text, creating ambiguity and hindering downstream natural language processing (NLP) applications. To address this, researchers have introduced Koshur Diacritizer, a byte-level sequence-to-sequence model for diacritic restoration, detailed in a paper on arXiv.

The Problem of Missing Diacritics

In many languages written in Arabic-derived scripts, diacritic marks indicate vowel sounds or other phonetic distinctions. When omitted, readers and NLP systems must rely on context to disambiguate words. For low-resource languages like Kashmiri, this problem is compounded by limited digital resources. According to the paper by Malik and colleagues, the lack of diacritics creates “ambiguity” that challenges text-to-speech, machine translation, and other applications.

The Koshur Diacritizer Approach

The researchers built Koshur Diacritizer as a ByT5-small model — a byte-level variant of the T5 transformer architecture that operates directly on UTF-8 bytes rather than tokenized text. This byte-level design is particularly suited for languages with complex scripts, as it avoids the need for script-specific tokenization. The framework incorporates:

Script-aware normalization to standardize input variations.
Alignment validation to ensure correct pairing of undiacritized and diacritized sentences.
Skeleton-preserving inference that restores diacritics while keeping the original base-letter sequence intact.

Dataset and Training

To train the model, the researchers created a publicly available dataset of 23,700 aligned undiacritized-diacritized Kashmiri sentence pairs. Details on the data source or collection methodology were not specified in the abstract, but the release provides a foundation for future work. The model was trained on this dataset, with validation on a held-out test set.

Performance Evaluation

Experimental results on the test set report a DERm (Diacritic Error Rate modified) of 0.2012 and a Word Error Rate (WER) of 0.2159. Beyond automated metrics, a native Kashmiri linguistic expert evaluated the model’s output, yielding a mean accuracy of 77.5%. The following table summarizes the key performance indicators:

Metric	Value
DERm	0.2012
WER	0.2159
Expert accuracy	77.5%

Implications for Low-Resource Language NLP

Koshur Diacritizer provides a reproducible baseline for diacritic restoration in Kashmiri and demonstrates the viability of byte-level models for low-resource languages. The public release of the dataset, model, and source code (under a Creative Commons BY-NC-SA 4.0 license) enables other researchers to build on this work. While the immediate application is linguistic, such diacritic restoration can improve downstream tasks like text-to-speech, machine translation, and information retrieval — capabilities that, when extended to other languages, may eventually benefit multinational enterprises handling multilingual trade documentation.

Sources:

Koshur Diacritizer: A Byte-Level Model Restores Diacritics for Kashmiri Language NLP

The Problem of Missing Diacritics

The Koshur Diacritizer Approach

Dataset and Training

Performance Evaluation

Implications for Low-Resource Language NLP

Recommended Stories

Large Language Models Can Read Compressed Text That Humans Cannot, Researchers Find

Data Augmentations Offer Path to Efficient Language Model Pretraining Under Data Constraints

CREDENCE Framework Improves Automated Fact-Checking with Semantic Metrics and Convergence Analysis

IHUBERT: Vector-Based Semantic Deduplication and Domain-Balanced Pretraining for Persian Resources