LM-SPT Uses Semantic Distillation to Improve Speech Tokenization for Language Models

A new speech tokenization method called LM-SPT uses semantic speech-resynthesis distillation to better align discrete speech tokens with language models. The approach outperforms previous semantic-enhanced tokenizers on automatic speech recognition and text-to-speech tasks without sacrificing reconstruction fidelity.

iGEN Editorial

June 17, 2026

LM-SPT Uses Semantic Distillation to Improve Speech Tokenization for Language Models

Speech language models (SLMs) increasingly rely on discrete speech tokens as an interface between speech and text. However, token sequences from current methods are often much longer than their textual counterparts, hindering integration with pretrained language models. A new paper on arXiv proposes LM-SPT, an LM-aligned speech tokenization method that uses semantic distillation to produce more compact and semantically aligned tokens.

The Challenge of Speech Tokenization

Existing speech tokenization approaches use self-supervised learning (SSL) teachers such as HuBERT to extract semantic representations, which are then distilled into a semantic quantizer. This process suppresses acoustic redundancy and captures content-related latent structures. However, according to the paper by Jo, Daejin; Yun; Jeeyoung; Roh; Byungseok; Kim; Sungwoong (Computer Science > Computation and Language), these tokenizers "often operate at relatively high frame rates, producing token sequences significantly longer than their textual counterparts and hindering seamless integration with pretrained LMs."

Some recent methods attempt to reduce the token rate by applying uniform average pooling to SSL features, but the authors argue this "can over-smooth content-bearing regions and dilute the structural information, thereby potentially limiting the LM alignment."

How LM-SPT Works

LM-SPT addresses these limitations through a novel approach called semantic speech-resynthesis distillation. Instead of directly matching teacher and student features via pooling, LM-SPT resynthesizes speech from semantic tokens only. It then minimizes the discrepancy between representations extracted from the original and resynthesized waveforms using a frozen, LM-aligned speech encoder.

This indirect supervision avoids rigid temporal alignment and encourages dedicated semantic units that are more semantically aligned with LMs under reduced frame rates. The method combines:

A frozen, LM-aligned speech encoder to provide supervision
Semantic resynthesis that generates speech from tokens alone
Distillation loss that measures representation discrepancy

Experimental Results

The paper reports that LM-SPT "consistently outperforms previous semantic-enhanced speech tokenizers when applied to SLMs for the tasks of automatic speech recognition and text-to-speech, even without compromising the speech reconstruction fidelity at the codec level." This means the method improves downstream task performance while maintaining the quality of reconstructed speech.

Implications for Speech Language Models

For enterprise technology leaders exploring voice interfaces or speech-based automation, LM-SPT represents a step toward more efficient integration of speech and text modalities. By producing token sequences that are shorter and better aligned with language models, the approach could reduce computational overhead in SLM pipelines and improve accuracy on tasks like transcription and voice synthesis. The paper is available on arXiv under the identifier 2506.16738.

Sources:

LM-SPT Uses Semantic Distillation to Improve Speech Tokenization for Language Models

The Challenge of Speech Tokenization

How LM-SPT Works

Experimental Results

Implications for Speech Language Models

Recommended Stories

New Framework TRACED Evaluates LLM Reasoning Using Geometric Stability and Progress

Self-Gated Clarification Method Boosts AI Accuracy in Complex Tariff Classification

Tyler Framework Boosts LLM Reasoning by Up to 14 Points with Smarter Compute Allocation

Who Should Lead Decoding Now? Tracking Reliable Trajectories for Ensembling Masked Diffusion Language Models