Diff Transferer

Any speaker adaptive Text-to-Speech with diffusion

In the course project, we designed a zero-shot any-speaker adaptive TTS (Text-to-Speech) model that aims at synthesizing voices with unseen speech prompts. With a given text, and a few seconds of audio from unseen speakers as input, the model is capable of synthesizing speech of their styles.

Specifically, we 1)design a style encoder and phoneme encoder and try to decouple the content and style from a speech; 2)and train a conditional diffusion-based style transferer with manifold constraints, which could transfer the original speech generated from a current TTS-backbone to the target style; 3)Leverage a shallow diffusion mechanism to speech up the training. With these designs, our model is suitable for fast adapt to various unseen speaker domains.

Besides, we also try to extend the zero-shot any-speaker adaptive TTS model to an any-singer adaptive SVS (Singing Voice Synthesis) one. Singing voice is more complicated with not only the speech style but also some rhythm, chord, and singing technique styles.

We leveraged a text to speech backbone to generate speech from the given text, then used conditional diffusion model to transfer the original speech generated from a current TTS-backbone to the target style.

Figure 1: diff-transferer pipeline.
We use a style discriminator to extract the style embedding and compute their cos-distance to judge whether the style is the same as the source input at the t step. We stop the forward process once the state is reached to shallow the diffusion.Since the generating voice from fastspeech2 would probably have the same style, the shallow t-steps could be set as a hyperparameter. Here, a phoneme discriminator is also used if we input a different text instead of the target one.
Figure 2: Forward Process until Style-agnostic with shallow diffusion mechanism.
Figure 3: Backward with Manifold Constraint.