NSV-TTS: Non-speech Vocalization Modeling and Transfer in Emotional Text-to-speech

Author name Netease Games AI Lab, Guangzhou, China

Abstract

This paper addresses the problem of non-speech verbal (NSV) modeling and transfer in emotional TTS. The goal is to transfer NSV to the target speaker, whose training data contains no NSV samples. We utilize unsupervised learning to extract unsupervised linguistic units for NSV labeling. Besides that, we propose token mixing and random masking to mitigate the training-inference mismatch problem. We evaluate the proposed method on various NSV types and emotion classes. The experimental results reveal that using ULUs as the input representation does not affect the emotional TTS performance. Furthermore, the proposed method provides a decent performance in the NSV transfer task. Lastly, we conduct ablation studies to investigate the proposed method further.

Paper link: arXiv

Zero-shot NSV Transfer Demo

Cough

Anger	Disgust	Doubt	Fear	Sad	Surprise

Cry

Fear	Sad

Fright

Fear

laughter

Happy

Struggle

Fear	Surprise