Efficient Strategies of Few-Shot On-Device Voice Cloning

Anonymous authors

Abstract: Recent advances in neural text-to-speech allowed to build multi-speaker systems capable of performing high-fidelity speech generation. However, it is often desirable to be able to add a new voice to a text-to-speech system based on only a few recordings. In this work, we study several approaches to the design of on-device voice cloning. Starting from a multi-speaker TTS system we improve its quality for a target speaker by fine-tuning the feature generation module on a small speech sample. We compare the performance of a feature generation module based on conventional Tacotron2 with step-wise monotonic attention with the ones based on Non-attentive Tacotron and Glow-TTS. We show that Non-attentive Tacotron significantly outperforms the attention-based model and demonstrate that a compact on-device TTS system of good quality can be obtained using only 1 minute of adaptation data with no more than 200 iterations of SGD corresponding to less than 1.5 hours of on-device training time on a consumer mobile phone.

Different VC techniques

Synthesized texts:

Example 1:	"Different telescope designs perform differently and have different strengths and weaknesses."
Example 2:	"We have the means to help ourselves."