iGEN
Visit IGEN World Explore IGEN Expo
EXPLORE UPGRADE PLANS
BREAKING
AEGIS Secures LLM API Routers Against Man-in-the-Middle Attacks Using Attested Trusted Execution Environments India Inc gears up for Middle East rebound after peace deal India's Exports Rise 18% to 6-Month High of $45.2 Billion in May, Trade Deficit Widens Cognitive Trajectory Modeling: A New Framework for Quantifying Human-AI Co-Creation DOG-DPO: Training-Free Geometric Data Selection Boosts LLM Safety Alignment with 11% of Data BRIDGE: Biological Evidence Refinement and Heterogeneous Dynamic Gating for Gene Regulatory Networks Model Graph Inductive Learning Achieves State-of-the-Art Performance in Knowledge Graph Completion New Research Reveals Spatial Audio Foundation Models Rely on Spectro-Temporal Interference Rather Than True Phase Encoding LatentGym: New Testbed Measures How AI Agents Learn Across Related Tasks Mojo Language Shows 20x–180x Speedups for Financial AI Workloads on Apple Silicon AEGIS Secures LLM API Routers Against Man-in-the-Middle Attacks Using Attested Trusted Execution Environments India Inc gears up for Middle East rebound after peace deal India's Exports Rise 18% to 6-Month High of $45.2 Billion in May, Trade Deficit Widens Cognitive Trajectory Modeling: A New Framework for Quantifying Human-AI Co-Creation DOG-DPO: Training-Free Geometric Data Selection Boosts LLM Safety Alignment with 11% of Data BRIDGE: Biological Evidence Refinement and Heterogeneous Dynamic Gating for Gene Regulatory Networks Model Graph Inductive Learning Achieves State-of-the-Art Performance in Knowledge Graph Completion New Research Reveals Spatial Audio Foundation Models Rely on Spectro-Temporal Interference Rather Than True Phase Encoding LatentGym: New Testbed Measures How AI Agents Learn Across Related Tasks Mojo Language Shows 20x–180x Speedups for Financial AI Workloads on Apple Silicon
Home ›› Technology ›› Ai ›› Koshur Diacritizer: A Byte-Level Model Restores Diacritics for Kashmiri Language NLP

Koshur Diacritizer: A Byte-Level Model Restores Diacritics for Kashmiri Language NLP

Researchers have developed Koshur Diacritizer, a byte-level sequence-to-sequence model based on ByT5-small, to restore missing diacritic marks in Kashmiri digital text. The model, trained on 23,700 sentence pairs, achieves a DERm of 0.2012 and word error rate of 0.2159, with a native expert accuracy of 77.5%. The dataset, model, and source code are publicly released to support low-resource language research.

iG
iGEN Editorial
June 16, 2026
Koshur Diacritizer: A Byte-Level Model Restores Diacritics for Kashmiri Language NLP

Kashmiri, an Indo-Aryan language written in a modified Perso-Arabic script, suffers from frequent omission of diacritic marks in digital text, creating ambiguity and hindering downstream natural language processing (NLP) applications. To address this, researchers have introduced Koshur Diacritizer, a byte-level sequence-to-sequence model for diacritic restoration, detailed in a paper on arXiv.

The Problem of Missing Diacritics

In many languages written in Arabic-derived scripts, diacritic marks indicate vowel sounds or other phonetic distinctions. When omitted, readers and NLP systems must rely on context to disambiguate words. For low-resource languages like Kashmiri, this problem is compounded by limited digital resources. According to the paper by Malik and colleagues, the lack of diacritics creates “ambiguity” that challenges text-to-speech, machine translation, and other applications.

The Koshur Diacritizer Approach

The researchers built Koshur Diacritizer as a ByT5-small model — a byte-level variant of the T5 transformer architecture that operates directly on UTF-8 bytes rather than tokenized text. This byte-level design is particularly suited for languages with complex scripts, as it avoids the need for script-specific tokenization. The framework incorporates:

  • Script-aware normalization to standardize input variations.
  • Alignment validation to ensure correct pairing of undiacritized and diacritized sentences.
  • Skeleton-preserving inference that restores diacritics while keeping the original base-letter sequence intact.

Dataset and Training

To train the model, the researchers created a publicly available dataset of 23,700 aligned undiacritized-diacritized Kashmiri sentence pairs. Details on the data source or collection methodology were not specified in the abstract, but the release provides a foundation for future work. The model was trained on this dataset, with validation on a held-out test set.

Performance Evaluation

Experimental results on the test set report a DERm (Diacritic Error Rate modified) of 0.2012 and a Word Error Rate (WER) of 0.2159. Beyond automated metrics, a native Kashmiri linguistic expert evaluated the model’s output, yielding a mean accuracy of 77.5%. The following table summarizes the key performance indicators:

Metric Value
DERm 0.2012
WER 0.2159
Expert accuracy 77.5%

Implications for Low-Resource Language NLP

Koshur Diacritizer provides a reproducible baseline for diacritic restoration in Kashmiri and demonstrates the viability of byte-level models for low-resource languages. The public release of the dataset, model, and source code (under a Creative Commons BY-NC-SA 4.0 license) enables other researchers to build on this work. While the immediate application is linguistic, such diacritic restoration can improve downstream tasks like text-to-speech, machine translation, and information retrieval — capabilities that, when extended to other languages, may eventually benefit multinational enterprises handling multilingual trade documentation.


Sources:

Keep Reading

Recommended Stories

Data Augmentations Offer Path to Efficient Language Model Pretraining Under Data Constraints Technology

Data Augmentations Offer Path to Efficient Language Model Pretraining Under Data Constraints

As AI labs face a data ceiling where compute capacity outpaces new high-quality text, researchers propose data augmentations to enable productive multi-epoch training on fixed corpora. Three categories—token-level noise, sequence permutations, and target offset prediction—are shown to delay overfitting and lower validation loss compared to standard autoregressive pretraining. Random token replacement achieved the best minimum loss among individual methods, with combined augmentations further improving results.

June 16, 2026
Privacy-Preserving Text Sanitization for Distributed Agents via Disentangled Representations Technology

Privacy-Preserving Text Sanitization for Distributed Agents via Disentangled Representations

Researchers propose DiSan, a privacy-preserving text sanitization framework that uses disentangled representations to separate task semantics from style identifiers. Experiments show it reduces personally identifiable information exposure by 20 times while maintaining 83% answer faithfulness on a multi-agent RAG benchmark, outperforming token-level masking.

June 16, 2026
Who Should Lead Decoding Now? Tracking Reliable Trajectories for Ensembling Masked Diffusion Language Models Technology

Who Should Lead Decoding Now? Tracking Reliable Trajectories for Ensembling Masked Diffusion Language Models

Masked Diffusion Language Models (MDLMs) have emerged as a distinct paradigm for sequence generation, but combining their knowledge is an underexplored problem. Researchers introduce TIE (Trajectory-based Iterative Ensembling), a framework that tracks confidence dynamics over answer-relevant positions to relay decoding trajectories between models, achieving strong performance on diverse reasoning tasks.

June 16, 2026
VibeThinker-3B: Small Language Model Matches Giants in Verifiable Reasoning, According to arXiv Paper Technology

VibeThinker-3B: Small Language Model Matches Giants in Verifiable Reasoning, According to arXiv Paper

A new technical report on arXiv introduces VibeThinker-3B, a compact 3B-parameter language model that achieves verifiable reasoning scores comparable to models orders of magnitude larger, including DeepSeek V3.2, GLM-5, and Gemini 3 Pro. The model uses a Spectrum-to-Signal post-training paradigm and achieves 94.3 on AIME26 and 80.2% Pass@1 on LiveCodeBench v6.

June 16, 2026