AI Case Study

Baidu researchers synthesise speech through neural voice cloning with limited data samples

Baidu researchers demonstrate the ability to clone voices for speech using two methods: speaker adaptation and speaker encoding. The former demonstrates better cloning quality but the latter is less computing resource-intensive while still achieving good results.



Internet Services Consumer

Project Overview

"Neural network based speech synthesis has been shown to generate high quality speech for a large number of speakers. In this paper, we introduce a neural voice cloning system that takes a few audio samples as input. We study two approaches: speaker adaptation and speaker encoding. Speaker adaptation is based on fine-tuning a multi-speaker generative model with a few cloning samples. Speaker encoding is based on training a separate model to directly infer a new speaker embedding from cloning audios, which is used in a multi-speaker generative model. The generative model needs to learn the speaker characteristics from limited information provided by a few audio samples and generalize to unseen texts. The two performance metrics for the generated audio are (i) how natural it is and (ii) whether it sounds like it is pronounced by the same speaker. Besides evaluations by discriminative models, we also conduct subject tests on Amazon Mechanical Turk framework."

Reported Results

"In terms of naturalness of the speech and its similarity to original speaker, both approaches can achieve good performance, even with very few cloning audios. While speaker adaptation can achieve better naturalness and similarity, cloning time or required memory for the speaker encoding approach is significantly less, making it favorable for low-resource deployment."


"In neural speech synthesis, an encoder converts text to hidden representations, and a decoder estimates the time-frequency representation of speech in an autoregressive way. "


R And D

Core Research And Development


"Voice cloning is a highly desired capability for personalized speech interfaces."



"[M]ulti-speaker generative model and speaker encoder model are trained using LibriSpeech dataset, which contains audios for 2484 speakers sampled at 16 KHz, totalling 820 hours. LibriSpeech is a dataset for automatic speech recognition, and its audio quality is lower compared to speech synthesis datasets. Voice cloning is performed using VCTK dataset. VCTK consists of audios for 108 native speakers of English with various accents sampled at 48 KHz."