Efficient Strategies of Few-Shot On-Device Voice Cloning

Anonymous authors

Abstract: Recent advances in neural text-to-speech allowed to build multi-speaker systems capable of performing high-fidelity speech generation. However, it is often desirable to be able to add a new voice to a text-to-speech system based on only a few recordings. In this work, we study several approaches to the design of on-device voice cloning. Starting from a multi-speaker TTS system we improve its quality for a target speaker by fine-tuning the feature generation module on a small speech sample. We compare the performance of a feature generation module based on conventional Tacotron2 with step-wise monotonic attention with the ones based on Non-attentive Tacotron and Glow-TTS. We show that Non-attentive Tacotron significantly outperforms the attention-based model and demonstrate that a compact on-device TTS system of good quality can be obtained using only 1 minute of adaptation data with no more than 200 iterations of SGD corresponding to less than 1.5 hours of on-device training time on a consumer mobile phone.

Different VC techniques

Synthesized texts:

Example 1: "Different telescope designs perform differently and have different strengths and weaknesses."
Example 2: "We have the means to help ourselves."

Adam:

Ground truth
Model Example 1 Example 2
TT2-LS
TT2-LS-GAN
TT2-0
TT2-1200
GlowTTS-0
GlowTTS-600
NAT-0
NAT-200

Amanda:

Ground truth
Model Example 1 Example 2
TT2-LS
TT2-LS-GAN
TT2-0
TT2-1200
GlowTTS-0
GlowTTS-600
NAT-0
NAT-200

Donald Trump:

Ground truth
Model Example 1 Example 2
TT2-LS
TT2-LS-GAN
TT2-0
TT2-1200
GlowTTS-0
GlowTTS-600
NAT-0
NAT-200

John:

Ground truth
Model Example 1 Example 2
TT2-LS
TT2-LS-GAN
TT2-0
TT2-1200
GlowTTS-0
GlowTTS-600
NAT-0
NAT-200

Kristin:

Ground truth
Model Example 1 Example 2
TT2-LS
TT2-LS-GAN
TT2-0
TT2-1200
GlowTTS-0
GlowTTS-600
NAT-0
NAT-200

Larry:

Ground truth
Model Example 1 Example 2
TT2-LS
TT2-LS-GAN
TT2-0
TT2-1200
GlowTTS-0
GlowTTS-600
NAT-0
NAT-200

Nancy:

Ground truth
Model Example 1 Example 2
TT2-LS
TT2-LS-GAN
TT2-0
TT2-1200
GlowTTS-0
GlowTTS-600
NAT-0
NAT-200

Scarlett:

Ground truth
Model Example 1 Example 2
TT2-LS
TT2-LS-GAN
TT2-0
TT2-1200
GlowTTS-0
GlowTTS-600
NAT-0
NAT-200

p280:

Ground truth
Model Example 1 Example 2
TT2-LS
TT2-LS-GAN
TT2-0
TT2-1200
GlowTTS-0
GlowTTS-600
NAT-0
NAT-200

p315:

Ground truth
Model Example 1 Example 2
TT2-LS
TT2-LS-GAN
TT2-0
TT2-1200
GlowTTS-0
GlowTTS-600
NAT-0
NAT-200