Clip here to read the paper!

Abstract:

Recently, few-shot voice cloning has achieved a significant improvement. However, most models for few-shot voice cloning are single-modal, and multi-modal few-shot voice cloning has been understudied. In this paper, we propose to use multi-modal learning to improve the few-shot voice cloning performance. Inspired by the recent works on unsupervised speech representation, the proposed multi-modal system is built by extending Tacotron2 with an unsupervised speech representation module. We evaluate our proposed system in two few-shot voice cloning scenarios, namely few-shot text-to-speech(TTS) and voice conversion(VC). Experimental results demonstrate that the proposed multi-modal learning can significantly improve the few-shot voice cloning performance over their counterpart single-modal systems.

Few-shot TTS scenario

These samples refer to Section 4.1 in the paper, in which we investigate the effectiveness of our proposed system in few-shot TTS. The models investigated include:

Tacotron: Tacotron2 pre-trained using paired data from 106 speakers (excluding p225 and p226), then fine-tuned using 2-minute paired data from the target speakers (namely p225 and p226);

MD-Tacotron: The proposed system pre-trained using $and paired data from 106 speakers (excluding p225 and p226), then fine-tuned using 2-minute and paired data from the target speakers (namely p225 and p226);

                      Ground-Truth                                                   Tacotron                                     MD-Tacotron                            

Speaker p225:

Text: "YOU ARE A GREAT PLAYER."

Text: "SCOTLAND WILL PLAY ITS PART."

Speaker p226:

Text: "THAT DECISION IS FOR THE BRITISH PARLIAMENT AND PEOPLE."

Text: "THEY ARE NEVER HERE."

Few-shot voice conversion

These samples refer to Section 4.2 in the paper. We chose p227 and p228 as the source speakers, and p225 and p226 as the target speakers. The model variants in this scenario include:

VQ-VAE: The single-modal VQ-VAE model pre-trained with speech data from 104 speakers (excluding the source and target speakers), then fine-tuned using 2-minute speech data from the target speakers;

MD-Tacotron: The proposed multi-modal system pre-trained with speech data from 104 speakers (excluding the source and target speakers); then only the seq-to-seq module fine-tuned using 2-minute paired data from the target speakers;

                        Source                                                   VQ-VAE                                               MD-Tacotron                                           Target

p227-p225 (M-F)

p227-p226 (M-M)

p228-p225 (F-F)

p228-p226 (F-M)