NSV-TTS: Non-speech Vocalization Modeling and Transfer in Emotional Text-to-speech
Author name
Netease Games AI Lab, Guangzhou, China
Abstract
This paper addresses the problem of non-speech verbal (NSV) modeling and transfer in emotional TTS. The goal is to transfer NSV to the target speaker, whose training data contains no NSV samples. We utilize unsupervised learning to extract unsupervised linguistic units for NSV labeling. Besides that, we propose token mixing and random masking to mitigate the training-inference mismatch problem. We evaluate the proposed method on various NSV types and emotion classes. The experimental results reveal that using ULUs as the input representation does not affect the emotional TTS performance. Furthermore, the proposed method provides a decent performance in the NSV transfer task. Lastly, we conduct ablation studies to investigate the proposed method further.