New Research Advances Emotional Speech Synthesis with Latent Representations and FastSpeech 2

Researchers have published an empirical study on arXiv detailing a method for emotional speech synthesis by integrating speaker embedding and a prosody bottleneck into the FastSpeech 2 architecture. The approach addresses two sub-tasks: generating emotional speech for a single speaker and transferring speaking styles from another speaker while retaining target speaker identity. The work was submitted to the VLSP 2022 competition.

iGEN Editorial

June 16, 2026

New Research Advances Emotional Speech Synthesis with Latent Representations and FastSpeech 2

A new empirical study published on arXiv explores methods for learning latent representations to control emotional expression in speech synthesis, a field that has seen rapid advances through deep learning. The research, submitted to the VLSP 2022 competition, proposes modifications to the FastSpeech 2 architecture by integrating speaker embedding and a prosody bottleneck to generate natural-sounding emotional speech.

The Challenge of Emotional Speech Synthesis

According to the paper, the field of speech synthesis has improved dramatically over the last few years thanks to deep learning. More and more deep learning-based text-to-speech (TTS) systems have been developed to produce voices with high intelligibility and naturalness. However, controlling the expressiveness of generated speech remains a significant challenge. Generating speech in different styles or manners has received increasing attention, and this study aims to address the task of emotional speech synthesis (ESS) as defined by the VLSP 2022 competition.

Methodology: Integrating Speaker Embedding and Prosody Bottleneck

The researchers built on FastSpeech 2, a popular non-autoregressive TTS model, by adding two key components: speaker embedding and a prosody bottleneck. Speaker embedding helps capture and preserve the target speaker's vocal characteristics, while the prosody bottleneck encodes prosodic variations such as pitch, duration, and energy that convey emotion. According to the paper, this integration allows the system to promisingly generate emotional speech while maintaining the speaker's identity. The model learns latent representations that disentangle speaker identity from prosodic features, enabling fine-grained control over emotional expression.

Sub-tasks and Experimental Setup

The study targets two specific sub-tasks from the VLSP 2022 emotional speech synthesis challenge:

Sub-task	Description
Sub-task 1	Generate emotional speech of a single speaker
Sub-task 2	Transfer speaking styles from another speaker to the target speaker with neutral non-expressive data, while retaining the target speaker's identity

The first sub-task involves producing emotional utterances (e.g., happy, sad, angry) from a given input text for a single speaker. The second sub-task is more complex: it requires transferring the emotional speaking style from a source speaker (who provides expressive data) to a target speaker, using only neutral non-expressive data from the target speaker, while preserving the target speaker's voice identity.

Implications for AI and Human-Computer Interaction

While the paper does not provide specific quantitative results, it states that the proposed systems can promisingly generate emotional speech for both sub-tasks. This work contributes to the broader goal of making TTS systems more expressive and controllable, which has applications in virtual assistants, audiobooks, customer service, and assistive technologies. The use of latent representations to separate speaker identity from prosody is a step toward more personalized and emotionally aware speech interfaces.

The study was authored by Quang, Vinh Dang, and Huy Ngo, and is available on arXiv. The code and data associated with the article are linked from the paper's page, though specific details on training data and evaluation metrics are not included in the abstract. The research aligns with ongoing efforts in the speech synthesis community to move beyond neutral, flat speech and toward more natural, emotionally nuanced communication.

Sources:

New Research Advances Emotional Speech Synthesis with Latent Representations and FastSpeech 2

The Challenge of Emotional Speech Synthesis

Methodology: Integrating Speaker Embedding and Prosody Bottleneck

Sub-tasks and Experimental Setup

Implications for AI and Human-Computer Interaction

Recommended Stories

Repurposing a Speech Classifier for Guided Diffusion-Based Speech Generation

Study Reveals Patterns of Pre-Trained Deep Learning Model Reuse in Scientific Research

Pixel-TTS: Image-Based Text Rendering Improves Robustness in Speech Synthesis

Bi-Anchor Interpolation Solver Cuts Generative Modeling Steps from 100 to 10, Researchers Show