Speech language models (SLMs) increasingly rely on discrete speech tokens as an interface between speech and text. However, token sequences from current methods are often much longer than their textual counterparts, hindering integration with pretrained language models. A new paper on arXiv proposes LM-SPT, an LM-aligned speech tokenization method that uses semantic distillation to produce more compact and semantically aligned tokens.
The Challenge of Speech Tokenization
Existing speech tokenization approaches use self-supervised learning (SSL) teachers such as HuBERT to extract semantic representations, which are then distilled into a semantic quantizer. This process suppresses acoustic redundancy and captures content-related latent structures. However, according to the paper by Jo, Daejin; Yun; Jeeyoung; Roh; Byungseok; Kim; Sungwoong (Computer Science > Computation and Language), these tokenizers "often operate at relatively high frame rates, producing token sequences significantly longer than their textual counterparts and hindering seamless integration with pretrained LMs."
Some recent methods attempt to reduce the token rate by applying uniform average pooling to SSL features, but the authors argue this "can over-smooth content-bearing regions and dilute the structural information, thereby potentially limiting the LM alignment."
How LM-SPT Works
LM-SPT addresses these limitations through a novel approach called semantic speech-resynthesis distillation. Instead of directly matching teacher and student features via pooling, LM-SPT resynthesizes speech from semantic tokens only. It then minimizes the discrepancy between representations extracted from the original and resynthesized waveforms using a frozen, LM-aligned speech encoder.
This indirect supervision avoids rigid temporal alignment and encourages dedicated semantic units that are more semantically aligned with LMs under reduced frame rates. The method combines:
- A frozen, LM-aligned speech encoder to provide supervision
- Semantic resynthesis that generates speech from tokens alone
- Distillation loss that measures representation discrepancy
Experimental Results
The paper reports that LM-SPT "consistently outperforms previous semantic-enhanced speech tokenizers when applied to SLMs for the tasks of automatic speech recognition and text-to-speech, even without compromising the speech reconstruction fidelity at the codec level." This means the method improves downstream task performance while maintaining the quality of reconstructed speech.
Implications for Speech Language Models
For enterprise technology leaders exploring voice interfaces or speech-based automation, LM-SPT represents a step toward more efficient integration of speech and text modalities. By producing token sequences that are shorter and better aligned with language models, the approach could reduce computational overhead in SLM pipelines and improve accuracy on tasks like transcription and voice synthesis. The paper is available on arXiv under the identifier 2506.16738.