NVMOS: Novel AI Model Predicts Perceptual Quality of Non-Verbal Vocalizations in Speech

A new paper on arXiv introduces NVMOS, the first model purpose-built to assess the perceptual quality of non-verbal vocalizations (NVs) such as laughter, sighs, and coughs in speech. The model was trained on a newly constructed NV-MOS dataset with expert ratings and achieves expert-level agreement with human Mean Opinion Scores. Tests on multimodal LLMs like Gemini showed clear inconsistencies, highlighting the need for specialized NV quality assessment.

iGEN Editorial

June 16, 2026

NVMOS: Novel AI Model Predicts Perceptual Quality of Non-Verbal Vocalizations in Speech

Non-verbal vocalizations (NVs) such as laughter, sighs, and coughs carry critical emotional and intentional cues in speech, yet existing AI systems have struggled to assess their perceptual quality. According to a paper published on arXiv by researchers Mai, Jialong, Jinxin, Xing, Xiaofen, Liu, Wencui, Xu, and Xiangmin, the team has developed NVMOS—to their knowledge the first model that can reliably predict the perceptual quality of NV events in speech.

The Gap in Non-Verbal Vocalization Assessment

Existing speech quality assessment methods typically focus on overall naturalness, the authors explain. Meanwhile, non-verbal text-to-speech (TTS) evaluations mainly examine whether a target NV appears with the correct type and position. The perceptual quality of the NV events themselves has remained largely underexplored.

The NV-MOS Dataset

To fill this gap, the researchers constructed an NV-MOS dataset containing outputs from multiple NV-TTS systems as well as naturally occurring NV samples. Ratings were collected from three acoustic experts on a perceptual quality scale, providing a ground-truth reference for model training and evaluation.

NVMOS Model Architecture and Performance

The proposed NVMOS model incorporates a local NV-event focusing module. Experimental results show that NVMOS reaches expert-level or stronger agreement with human Mean Opinion Scores (MOS). The authors state that it is "to our knowledge the first model that can reliably predict the perceptual quality of NV events in speech."

Multimodal LLM Limitations

The study also evaluated audio-capable multimodal large language models such as Gemini. Clear inconsistencies were found between the scores generated by Gemini and the expert ratings. The authors conclude that general-purpose multimodal models cannot reliably replace human judgments for NV quality assessment.

Method	Performance vs Human MOS
Human experts (3 raters)	Ground truth reference
NVMOS	Expert-level or stronger agreement
Gemini (multimodal LLM)	Clear inconsistencies

Implications for Enterprise Voice AI

While the paper does not cite specific commercial applications, the ability to objectively assess NV quality is critical for industries deploying synthetic voices—such as customer service chatbots, virtual assistants, and accessibility tools. The study underscores that off-the-shelf multimodal AI models are insufficient for this specialized task, and dedicated models like NVMOS are necessary to ensure realistic, emotionally appropriate synthetic speech.

Sources:

NVMOS: Novel AI Model Predicts Perceptual Quality of Non-Verbal Vocalizations in Speech

The Gap in Non-Verbal Vocalization Assessment

The NV-MOS Dataset

NVMOS Model Architecture and Performance

Multimodal LLM Limitations

Implications for Enterprise Voice AI

Recommended Stories

Neural Audio Codecs' Low Frame Rate Degradation Linked to Training Configuration

Maharashtra’s ₹500 crore AI agriculture policy targets data, traceability and farm advisory

Hugging Face CEO demands AI firms answer for rogue bot attacks

Chinese AI Researchers Are Finding Their Voice on X