A new empirical study published on arXiv explores methods for learning latent representations to control emotional expression in speech synthesis, a field that has seen rapid advances through deep learning. The research, submitted to the VLSP 2022 competition, proposes modifications to the FastSpeech 2 architecture by integrating speaker embedding and a prosody bottleneck to generate natural-sounding emotional speech.
The Challenge of Emotional Speech Synthesis
According to the paper, the field of speech synthesis has improved dramatically over the last few years thanks to deep learning. More and more deep learning-based text-to-speech (TTS) systems have been developed to produce voices with high intelligibility and naturalness. However, controlling the expressiveness of generated speech remains a significant challenge. Generating speech in different styles or manners has received increasing attention, and this study aims to address the task of emotional speech synthesis (ESS) as defined by the VLSP 2022 competition.
Methodology: Integrating Speaker Embedding and Prosody Bottleneck
The researchers built on FastSpeech 2, a popular non-autoregressive TTS model, by adding two key components: speaker embedding and a prosody bottleneck. Speaker embedding helps capture and preserve the target speaker's vocal characteristics, while the prosody bottleneck encodes prosodic variations such as pitch, duration, and energy that convey emotion. According to the paper, this integration allows the system to promisingly generate emotional speech while maintaining the speaker's identity. The model learns latent representations that disentangle speaker identity from prosodic features, enabling fine-grained control over emotional expression.
Sub-tasks and Experimental Setup
The study targets two specific sub-tasks from the VLSP 2022 emotional speech synthesis challenge:
| Sub-task | Description |
|---|---|
| Sub-task 1 | Generate emotional speech of a single speaker |
| Sub-task 2 | Transfer speaking styles from another speaker to the target speaker with neutral non-expressive data, while retaining the target speaker's identity |
The first sub-task involves producing emotional utterances (e.g., happy, sad, angry) from a given input text for a single speaker. The second sub-task is more complex: it requires transferring the emotional speaking style from a source speaker (who provides expressive data) to a target speaker, using only neutral non-expressive data from the target speaker, while preserving the target speaker's voice identity.
Implications for AI and Human-Computer Interaction
While the paper does not provide specific quantitative results, it states that the proposed systems can promisingly generate emotional speech for both sub-tasks. This work contributes to the broader goal of making TTS systems more expressive and controllable, which has applications in virtual assistants, audiobooks, customer service, and assistive technologies. The use of latent representations to separate speaker identity from prosody is a step toward more personalized and emotionally aware speech interfaces.
The study was authored by Quang, Vinh Dang, and Huy Ngo, and is available on arXiv. The code and data associated with the article are linked from the paper's page, though specific details on training data and evaluation metrics are not included in the abstract. The research aligns with ongoing efforts in the speech synthesis community to move beyond neutral, flat speech and toward more natural, emotionally nuanced communication.