Non-verbal vocalizations (NVs) such as laughter, sighs, and coughs carry critical emotional and intentional cues in speech, yet existing AI systems have struggled to assess their perceptual quality. According to a paper published on arXiv by researchers Mai, Jialong, Jinxin, Xing, Xiaofen, Liu, Wencui, Xu, and Xiangmin, the team has developed NVMOS—to their knowledge the first model that can reliably predict the perceptual quality of NV events in speech.
The Gap in Non-Verbal Vocalization Assessment
Existing speech quality assessment methods typically focus on overall naturalness, the authors explain. Meanwhile, non-verbal text-to-speech (TTS) evaluations mainly examine whether a target NV appears with the correct type and position. The perceptual quality of the NV events themselves has remained largely underexplored.
The NV-MOS Dataset
To fill this gap, the researchers constructed an NV-MOS dataset containing outputs from multiple NV-TTS systems as well as naturally occurring NV samples. Ratings were collected from three acoustic experts on a perceptual quality scale, providing a ground-truth reference for model training and evaluation.
NVMOS Model Architecture and Performance
The proposed NVMOS model incorporates a local NV-event focusing module. Experimental results show that NVMOS reaches expert-level or stronger agreement with human Mean Opinion Scores (MOS). The authors state that it is "to our knowledge the first model that can reliably predict the perceptual quality of NV events in speech."
Multimodal LLM Limitations
The study also evaluated audio-capable multimodal large language models such as Gemini. Clear inconsistencies were found between the scores generated by Gemini and the expert ratings. The authors conclude that general-purpose multimodal models cannot reliably replace human judgments for NV quality assessment.
| Method | Performance vs Human MOS |
|---|---|
| Human experts (3 raters) | Ground truth reference |
| NVMOS | Expert-level or stronger agreement |
| Gemini (multimodal LLM) | Clear inconsistencies |
Implications for Enterprise Voice AI
While the paper does not cite specific commercial applications, the ability to objectively assess NV quality is critical for industries deploying synthetic voices—such as customer service chatbots, virtual assistants, and accessibility tools. The study underscores that off-the-shelf multimodal AI models are insufficient for this specialized task, and dedicated models like NVMOS are necessary to ensure realistic, emotionally appropriate synthetic speech.