iGEN
Visit IGEN World Explore IGEN Expo
EXPLORE UPGRADE PLANS
BREAKING
From Finance to Human Trafficking: How Banks Can Protect Customers During the 2026 World Cup Gen-VCoT: New Framework Generates RGB Images as Visual Chain-of-Thought Intermediates for Multimodal AI Reasoning MASCOT-Android: Automated Pipeline and Curated Dataset for Android Malware Source Code Discovery Human Genetic Evidence Found to Be Strongly Associated with Drug Approval in Observational Study of 26,278 Target-Disease Pairs UniBrain: A Unified Multimodal Model for Brain MRI Imputation and Understanding DeepRoot Multi-Agent System Enables Therapeutic Reasoning Over Historical Medical Texts with 47.6% Accuracy Primacy Bias in Multimodal RAG: First Retrieved Items Dominate, Study Finds N-Sea appoints Pim Nelemans as chief executive, succeeding Martin Adler ‘We’re not flipping a switch and pushing it to everyone at once’: Sonos is about to make its biggest changes yet to the controversial new app, designed to make it way more intuitive to use — and it seems to have learned from its past mistakes New Generalization Bounds for Deep Learning Models via Local Robustness and Stability From Finance to Human Trafficking: How Banks Can Protect Customers During the 2026 World Cup Gen-VCoT: New Framework Generates RGB Images as Visual Chain-of-Thought Intermediates for Multimodal AI Reasoning MASCOT-Android: Automated Pipeline and Curated Dataset for Android Malware Source Code Discovery Human Genetic Evidence Found to Be Strongly Associated with Drug Approval in Observational Study of 26,278 Target-Disease Pairs UniBrain: A Unified Multimodal Model for Brain MRI Imputation and Understanding DeepRoot Multi-Agent System Enables Therapeutic Reasoning Over Historical Medical Texts with 47.6% Accuracy Primacy Bias in Multimodal RAG: First Retrieved Items Dominate, Study Finds N-Sea appoints Pim Nelemans as chief executive, succeeding Martin Adler ‘We’re not flipping a switch and pushing it to everyone at once’: Sonos is about to make its biggest changes yet to the controversial new app, designed to make it way more intuitive to use — and it seems to have learned from its past mistakes New Generalization Bounds for Deep Learning Models via Local Robustness and Stability
Home ›› Technology ›› Ai ›› Llms ›› NVMOS: Novel AI Model Predicts Perceptual Quality of Non-Verbal Vocalizations in Speech

NVMOS: Novel AI Model Predicts Perceptual Quality of Non-Verbal Vocalizations in Speech

A new paper on arXiv introduces NVMOS, the first model purpose-built to assess the perceptual quality of non-verbal vocalizations (NVs) such as laughter, sighs, and coughs in speech. The model was trained on a newly constructed NV-MOS dataset with expert ratings and achieves expert-level agreement with human Mean Opinion Scores. Tests on multimodal LLMs like Gemini showed clear inconsistencies, highlighting the need for specialized NV quality assessment.

iG
iGEN Editorial
June 16, 2026
NVMOS: Novel AI Model Predicts Perceptual Quality of Non-Verbal Vocalizations in Speech

Non-verbal vocalizations (NVs) such as laughter, sighs, and coughs carry critical emotional and intentional cues in speech, yet existing AI systems have struggled to assess their perceptual quality. According to a paper published on arXiv by researchers Mai, Jialong, Jinxin, Xing, Xiaofen, Liu, Wencui, Xu, and Xiangmin, the team has developed NVMOS—to their knowledge the first model that can reliably predict the perceptual quality of NV events in speech.

The Gap in Non-Verbal Vocalization Assessment

Existing speech quality assessment methods typically focus on overall naturalness, the authors explain. Meanwhile, non-verbal text-to-speech (TTS) evaluations mainly examine whether a target NV appears with the correct type and position. The perceptual quality of the NV events themselves has remained largely underexplored.

The NV-MOS Dataset

To fill this gap, the researchers constructed an NV-MOS dataset containing outputs from multiple NV-TTS systems as well as naturally occurring NV samples. Ratings were collected from three acoustic experts on a perceptual quality scale, providing a ground-truth reference for model training and evaluation.

NVMOS Model Architecture and Performance

The proposed NVMOS model incorporates a local NV-event focusing module. Experimental results show that NVMOS reaches expert-level or stronger agreement with human Mean Opinion Scores (MOS). The authors state that it is "to our knowledge the first model that can reliably predict the perceptual quality of NV events in speech."

Multimodal LLM Limitations

The study also evaluated audio-capable multimodal large language models such as Gemini. Clear inconsistencies were found between the scores generated by Gemini and the expert ratings. The authors conclude that general-purpose multimodal models cannot reliably replace human judgments for NV quality assessment.

Method Performance vs Human MOS
Human experts (3 raters) Ground truth reference
NVMOS Expert-level or stronger agreement
Gemini (multimodal LLM) Clear inconsistencies

Implications for Enterprise Voice AI

While the paper does not cite specific commercial applications, the ability to objectively assess NV quality is critical for industries deploying synthetic voices—such as customer service chatbots, virtual assistants, and accessibility tools. The study underscores that off-the-shelf multimodal AI models are insufficient for this specialized task, and dedicated models like NVMOS are necessary to ensure realistic, emotionally appropriate synthetic speech.


Sources:

Keep Reading

Recommended Stories

UniBrain: A Unified Multimodal Model for Brain MRI Imputation and Understanding Technology

UniBrain: A Unified Multimodal Model for Brain MRI Imputation and Understanding

Researchers propose UniBrain, a unified multimodal large language model for brain MRI analysis that handles missing data through joint imputation and understanding. The model uses interleaved data flow, self-alignment, and dynamic hidden state mechanisms to achieve high performance on multi-disease MRI datasets.

June 16, 2026
Deep Residual Injection Method Enables Full-Spectrum Forensic AI Detection in Multimodal Models Technology

Deep Residual Injection Method Enables Full-Spectrum Forensic AI Detection in Multimodal Models

Researchers propose Deep Visual Residual MLLM (Deep-VRM), a method that injects low-level artifact signals into multimodal large language models without disrupting pre-trained semantic knowledge. The approach achieves state-of-the-art detection of AI-generated images across multiple benchmarks.

June 16, 2026
JoyAI-VL-Interaction Model Brings Real-Time Vision-Language AI to Enterprise Applications Technology

JoyAI-VL-Interaction Model Brings Real-Time Vision-Language AI to Enterprise Applications

JoyAI-VL-Interaction is an open-source, 8B-scale vision-language model that continuously monitors video streams and decides in real time whether to stay silent, speak, or delegate to a background model. Human raters preferred it over Doubao and Gemini in six real-world scenarios. The system includes pluggable ASR/TTS, memory, and API integration.

June 16, 2026
LectūraAgents Multi-Agent Framework Promises Adaptive Personalized AI-Assisted Learning Technology

LectūraAgents Multi-Agent Framework Promises Adaptive Personalized AI-Assisted Learning

Researchers propose LectūraAgents, a multi-agent framework for adaptive personalized AI-assisted learning. It uses a hierarchical architecture with a ProfessorAgent leading specialized agents to generate and deliver tailored lecture content with embodied teaching actions. The system was validated on diverse courses and showed gains in content quality and personalization.

June 16, 2026