Abstract:
Recently, few-shot voice cloning has achieved a significant improvement. However, most models for few-shot voice cloning are single-modal, and multi-modal few-shot voice cloning has been understudied. In this paper, we propose to use multi-modal learning to improve the few-shot voice cloning performance. Inspired by the recent works on unsupervised speech representation, the proposed multi-modal system is built by extending Tacotron2 with an unsupervised speech representation module. We evaluate our proposed system in two few-shot voice cloning scenarios, namely few-shot text-to-speech(TTS) and voice conversion(VC). Experimental results demonstrate that the proposed multi-modal learning can significantly improve the few-shot voice cloning performance over their counterpart single-modal systems.
Few-shot TTS scenario
These samples refer to Section 4.1 in the paper, in which we investigate the effectiveness of our proposed system in few-shot TTS. The models investigated include:
Tacotron: Tacotron2 pre-trained using
MD-Tacotron: The proposed system pre-trained using
                      Ground-Truth                                                   Tacotron                                     MD-Tacotron                            
Speaker p225:
Text: "YOU ARE A GREAT PLAYER."
Text: "SCOTLAND WILL PLAY ITS PART."
Speaker p226:
Text: "THAT DECISION IS FOR THE BRITISH PARLIAMENT AND PEOPLE."
Text: "THEY ARE NEVER HERE."
Few-shot voice conversion
These samples refer to Section 4.2 in the paper. We chose p227 and p228 as the source speakers, and p225 and p226 as the target speakers. The model variants in this scenario include:
VQ-VAE: The single-modal VQ-VAE model pre-trained with speech data from 104 speakers (excluding the source and target speakers), then fine-tuned using 2-minute speech data from the target speakers;
MD-Tacotron: The proposed multi-modal system pre-trained with speech data from 104 speakers (excluding the source and target speakers); then only the seq-to-seq module fine-tuned using 2-minute
                        Source                                                   VQ-VAE                                               MD-Tacotron                                           Target
p227-p225 (M-F)
p227-p226 (M-M)
p228-p225 (F-F)
p228-p226 (F-M)