Github Page demo

Abstract:

International Phonetic Alphabet (IPA) has been widely used in cross-lingual text-to-speech (TTS) to achieve cross-lingual voice cloning (CL VC). However, IPA itself has been understudied in cross-lingual TTS. In this paper, we report some empirical findings of building a cross-lingual TTS model using IPA as inputs. Experiments show that the way to process the IPA and suprasegmental sequence has a negligible impact on the CL VC performance. Furthermore, we find that using a dataset including one speaker per language to build an IPA-based TTS system would fail CL VC since the language-unique IPA and tone/stress symbols could leak the speaker information. In addition, we experiment with different combinations of speakers in the training dataset to further investigate the effect of the number of speakers on the CL VC performance.

Section 1: The impact of input processing modules

To investigate whether the way to process them has an impact on the CL VL performance, we consider two different processing modules:

1. SEA: use Separate Embedding sequences for IPA and tone/stress, then Add two embedding sequences to form the final input embedding;

2. UEI: use Unified Embedding for IPA and tone/stress, then take each embedding as an Independent input in the final input embedding.

The following samples refer to Section 4.1 in the paper. The models investigated include MSEA (the model with SEA) and MUEI (the model with UEI).

Text: "你想玩点什么."

Ground-Truth

MSEA MUEI

Chinese speaker

English speaker

Text: "PLANNING THE TEXAS TRIP."

Ground-Truth

MSEA MUEI

Chinese speaker

English speaker

Take-home conclusion: these two input processing modules have comparable performances on intra-lingual and cross-lingual voice cloning.

**************************************************************************************************************************************************************************

Section 2: Why fails cross-lingual voice cloning

We learn from an informal listening test that many Chinese utterances synthesized using the English speaker's voice sound like the Chinese speaker and English utterances synthesized using the Chinese speaker's voice sound like the English speaker. In other words, only using IPA does not guarantee a perfect disentanglement between speaker identities and language symbols. We hypothesize that this result can be attributed to the fact that (1) there are some non-overlapped IPA symbols across two target languages; (2) the suprasegmental, including tone and stress, are unique to only one of the target languages. To test our hypothesizes, we devised two input perturbation methods.

1. IPA perturbation: Replace all the IPA symbols in testing sentences in one language with the non-overlapped IPA symbols from the other language randomly. To remove the potential effect of tone/stress, we replace all tone/stress symbols with the special non-tone symbol.

2. Tone/stress perturbation: Replace all tone symbols in Chinese testing sentences with the primary stress symbol in English, or replace all stress symbols in English testing sentences with the mid-tone in Chinese. To remove the potential effect of the non-overlapped IPA symbols, we replace them with their closest IPA symbols.

We use these two input perturbation methods to modify the original testing sentences and create in total six test datasets: CH and EN (original Chinese and English test data), CH_IP and EN_IP (Chinese and English test data with IPA perturbation), and CH_TP and EN_SP (Chinese and English test data with tone/stress perturbation).

Note: Since using the proposed IPA or tone/stress perturbation may result in non-intelligible or accented speech, please focus on the speaker similarity.

Original Chinese text: "你想玩点什么." Original English text: "PLANNING THE TEXAS TRIP."

CH CH_IP CH_TP EN EN_IP EN_SP

Chinese speaker:

English speaker:

Take-home conclusion: (1) the non-overlapped IPA symbols are likely to contain some speaker information; (2) the tone/stress symbols contain speaker information as well.

**************************************************************************************************************************************************************************

Section 3: The number of speakers

In this section, we continued the investigation by proposing the following hypothesizes.

Hypothesis 1: The secondary or indirect reason our models fail CL VL is that we only use two speakers as training data. In other words, as we increase the number of speakers, this failure can be avoided.

Hypothesis 2: Increasing the number of speakers in only one language would result in success to CL VL for speakers in this language, but a failure for the speaker in the other language.

To test our hypothesizes we compared several model variants trained with different subsets of Dataset2:

C1E1: Model trained with one Chinese speaker and one English speaker;

C1E4: Model trained with one Chinese speaker and four English speakers;

C4E1: Model trained with four Chinese speakers and one English speaker;

C4E4: Model trained with four Chinese speakers and four English speakers.

C1E1 C1E4 C4E1 C4E4

Text: "你想玩点什么."

Chinese speaker

English speaker

Text: "PLANNING THE TEXAS TRIP."

Chinese speaker

English speaker

Text: "Happy Birthday, 我的宝贝。"

Chinese speaker

English speaker

Take-home conclusion: One simple but effective method to improve the CL VL performance of IPA-based CL TTS is to increase the number of speakers in all languages.