Github Page demo

Abstract:

Recently, sequence-to-sequence-based models have been successfully applied in text-to-speech (TTS) to synthesize speech for single-language text. To synthesize speech for code-switched text usually requires multi-lingual speech from the target speaker. However, it i hard to collect these data. In this paper, we investigate transfer learning for code-switch TTS, where we utilize code-switch speech from Automatic Speech Recognition (ASR) to synthesize code-switch speech for our target speakers. As a result, our proposed methods enabl the model to generate better code-switched speech for the target speakers in terms of naturalness and speaker consistency. What's more, we also found that transfer learning can also achieve cross-lingual voice cloning whose performance is comparable to the state-of-the-art model.

Note:

Here, we presents the synthesized code-switch speech by all the model variants in the paper. All of the texts here are unseen in training and development.

System investigated in the paper:

Tac: Tacotron trained with mono-lingual data;

Tac_Mix: Tacotron pre-trained in the proposed way, then fine-tuned with mono-lingual data;

SPE: Tacotron-based SPE TTS system trained with mono-lingual data;

SPE_Mix: Tacotron-based SPE TTS system pre-trained in the proposed way, then fine-tuned with the mono-lingual data.

Tac Tac_Mix SPE SPE_Mix

Text: "播放no limit的所有歌"

GT: