Clip here to read the paper!

Abstract:

Recently, sequence-to-sequence-based models have been successfully applied in text-to-speech (TTS) to synthesize speech for single-language text. To synthesize speech for code-switched text usually requires multi-lingual speech from the target speaker. However, it i hard to collect these data. In this paper, we investigate transfer learning for code-switch TTS, where we utilize code-switch speech from Automatic Speech Recognition (ASR) to synthesize code-switch speech for our target speakers. As a result, our proposed methods enabl the model to generate better code-switched speech for the target speakers in terms of naturalness and speaker consistency. What's more, we also found that transfer learning can also achieve cross-lingual voice cloning whose performance is comparable to the state-of-the-art model.

Note:

Here, we presents the synthesized code-switch speech by all the model variants in the paper. All of the texts here are unseen in training and development.

System investigated in the paper:

        Tac: Tacotron trained with mono-lingual data;

        Tac_Mix: Tacotron pre-trained in the proposed way, then fine-tuned with mono-lingual data;

        SPE: Tacotron-based SPE TTS system trained with mono-lingual data;

        SPE_Mix: Tacotron-based SPE TTS system pre-trained in the proposed way, then fine-tuned with the mono-lingual data.

                        Tac                                                         Tac_Mix                                                     SPE                                                     SPE_Mix

Text: "播放no limit的所有歌"

GT:

Chinese Speaker:

English Speaker:

Text: "Angle grinder又称研磨机或盘磨机"

GT:

Chinese Speaker:

English Speaker:

Text: "现在还不是谈论未来的事情。"

GT:

Chinese Speaker:

English Speaker:

Text: "会议还听取了有关人事事项的说明。"

GT:

Chinese Speaker:

English Speaker:

Text: "THE BEST ACCOMMODATION THE JAIL COULD OFFER WAS RESERVED FOR THE PRISONERS ON THE STATE SIDE."

GT:

Chinese Speaker:

English Speaker:

Text: "THIS WAS MORE PARTICULARLY THE PRACTICE IN LONDON."

GT:

Chinese Speaker:

English Speaker: