iGEN
Visit IGEN World Explore IGEN Expo
EXPLORE UPGRADE PLANS
BREAKING
MSC denies report of Hapag-Lloyd acquisition talks; carrier says claim 'not true or correct' Tin Prices Poised to Rule Elevated in 2026 on Semiconductor Demand and Supply Disruptions India must boost oilseed yields to cut edible oil imports, SEA chief says India Air Freights 5 Tonnes of Medical Aid to Afghanistan Under Humanitarian Assistance Tsakos Joins Greek Capesize Ordering Wave at Hengli Heavy Industries How US quietly kept Gulf crude moving despite Iran's Hormuz blockade Rupee Rebounds 31 Paise to 94.29 as Easing Oil, Dollar Index Boost Sentiment Shipping Braces for Monster El Niño as NOAA Warns of Record-Intensity Event Threatening Global Trade Lanes India May Require Refiners to Triple Crude Oil Inventories After Lessons From China Fleets Reposition for Hormuz Reopening Ahead of US-Iran Peace Deal Signing MSC denies report of Hapag-Lloyd acquisition talks; carrier says claim 'not true or correct' Tin Prices Poised to Rule Elevated in 2026 on Semiconductor Demand and Supply Disruptions India must boost oilseed yields to cut edible oil imports, SEA chief says India Air Freights 5 Tonnes of Medical Aid to Afghanistan Under Humanitarian Assistance Tsakos Joins Greek Capesize Ordering Wave at Hengli Heavy Industries How US quietly kept Gulf crude moving despite Iran's Hormuz blockade Rupee Rebounds 31 Paise to 94.29 as Easing Oil, Dollar Index Boost Sentiment Shipping Braces for Monster El Niño as NOAA Warns of Record-Intensity Event Threatening Global Trade Lanes India May Require Refiners to Triple Crude Oil Inventories After Lessons From China Fleets Reposition for Hormuz Reopening Ahead of US-Iran Peace Deal Signing
Home ›› Technology ›› Ai ›› Llms ›› LM-SPT Uses Semantic Distillation to Improve Speech Tokenization for Language Models

LM-SPT Uses Semantic Distillation to Improve Speech Tokenization for Language Models

A new speech tokenization method called LM-SPT uses semantic speech-resynthesis distillation to better align discrete speech tokens with language models. The approach outperforms previous semantic-enhanced tokenizers on automatic speech recognition and text-to-speech tasks without sacrificing reconstruction fidelity.

iG
iGEN Editorial
June 17, 2026
LM-SPT Uses Semantic Distillation to Improve Speech Tokenization for Language Models

Speech language models (SLMs) increasingly rely on discrete speech tokens as an interface between speech and text. However, token sequences from current methods are often much longer than their textual counterparts, hindering integration with pretrained language models. A new paper on arXiv proposes LM-SPT, an LM-aligned speech tokenization method that uses semantic distillation to produce more compact and semantically aligned tokens.

The Challenge of Speech Tokenization

Existing speech tokenization approaches use self-supervised learning (SSL) teachers such as HuBERT to extract semantic representations, which are then distilled into a semantic quantizer. This process suppresses acoustic redundancy and captures content-related latent structures. However, according to the paper by Jo, Daejin; Yun; Jeeyoung; Roh; Byungseok; Kim; Sungwoong (Computer Science > Computation and Language), these tokenizers "often operate at relatively high frame rates, producing token sequences significantly longer than their textual counterparts and hindering seamless integration with pretrained LMs."

Some recent methods attempt to reduce the token rate by applying uniform average pooling to SSL features, but the authors argue this "can over-smooth content-bearing regions and dilute the structural information, thereby potentially limiting the LM alignment."

How LM-SPT Works

LM-SPT addresses these limitations through a novel approach called semantic speech-resynthesis distillation. Instead of directly matching teacher and student features via pooling, LM-SPT resynthesizes speech from semantic tokens only. It then minimizes the discrepancy between representations extracted from the original and resynthesized waveforms using a frozen, LM-aligned speech encoder.

This indirect supervision avoids rigid temporal alignment and encourages dedicated semantic units that are more semantically aligned with LMs under reduced frame rates. The method combines:

  • A frozen, LM-aligned speech encoder to provide supervision
  • Semantic resynthesis that generates speech from tokens alone
  • Distillation loss that measures representation discrepancy

Experimental Results

The paper reports that LM-SPT "consistently outperforms previous semantic-enhanced speech tokenizers when applied to SLMs for the tasks of automatic speech recognition and text-to-speech, even without compromising the speech reconstruction fidelity at the codec level." This means the method improves downstream task performance while maintaining the quality of reconstructed speech.

Implications for Speech Language Models

For enterprise technology leaders exploring voice interfaces or speech-based automation, LM-SPT represents a step toward more efficient integration of speech and text modalities. By producing token sequences that are shorter and better aligned with language models, the approach could reduce computational overhead in SLM pipelines and improve accuracy on tasks like transcription and voice synthesis. The paper is available on arXiv under the identifier 2506.16738.


Sources:

Keep Reading

Recommended Stories

New Framework TRACED Evaluates LLM Reasoning Using Geometric Stability and Progress Technology

New Framework TRACED Evaluates LLM Reasoning Using Geometric Stability and Progress

A new research framework called TRACED evaluates LLM reasoning quality by analyzing geometric progress and stability of reasoning traces. It distinguishes correct reasoning from hallucinations based on trajectory patterns, offering a more robust evaluation method than scalar probabilities.

June 16, 2026
Self-Gated Clarification Method Boosts AI Accuracy in Complex Tariff Classification Technology

Self-Gated Clarification Method Boosts AI Accuracy in Complex Tariff Classification

Researchers propose ACTION-RATING, a self-gated clarification formulation that enables hierarchical language agents to decide when to ask for help during decision-making. Tested on Harmonized Tariff Schedule classification across nine LLMs, the method improved Information-Seeking Effectiveness from 50% to 74% and achieved up to +16.2% accuracy gains at the 10-digit level.

June 16, 2026
Tyler Framework Boosts LLM Reasoning by Up to 14 Points with Smarter Compute Allocation Technology

Tyler Framework Boosts LLM Reasoning by Up to 14 Points with Smarter Compute Allocation

A new framework called Tyler introduces typed latent reasoning for large language models, learning when to invoke latent computation and how much to allocate. On three backbone LLMs, Tyler improved accuracy by up to 14.49 points over chain-of-thought prompting and up to 4.30 points over competing baselines, while reducing forgetting.

June 16, 2026
Who Should Lead Decoding Now? Tracking Reliable Trajectories for Ensembling Masked Diffusion Language Models Technology

Who Should Lead Decoding Now? Tracking Reliable Trajectories for Ensembling Masked Diffusion Language Models

Masked Diffusion Language Models (MDLMs) have emerged as a distinct paradigm for sequence generation, but combining their knowledge is an underexplored problem. Researchers introduce TIE (Trajectory-based Iterative Ensembling), a framework that tracks confidence dynamics over answer-relevant positions to relay decoding trajectories between models, achieving strong performance on diverse reasoning tasks.

June 16, 2026