Multi-Speaker End-to-End Speech Synthesis
In this work, we extend ClariNet (Ping et al., 2019), a fully end-to-end speech synthesis model (i.e., text-to-wave), to generate high-fidelity speech from multiple speakers. To model the unique characteristic of different voices, low dimensional trainable speaker embeddings are shared across each component of ClariNet and trained together with the rest of the model. We demonstrate that the multi-speaker ClariNet outperforms state-of-the-art systems in terms of naturalness, because the whole model is jointly optimized in an end-to-end manner.
PDF AbstractTasks
Datasets
Add Datasets
introduced or used in this paper
Results from the Paper
Submit
results from this paper
to get state-of-the-art GitHub badges and help the
community compare results to other papers.
Methods
Bridge-net •
ClariNet •
Convolution •
Dense Connections •
Dilated Causal Convolution •
Dropout •
DV3 Attention Block •
DV3 Convolution Block •
GLU •
L1 Regularization •
Leaky ReLU •
Mixture of Logistic Distributions •
Normalizing Flows •
ReLU •
Residual Connection •
Scaled Dot-Product Attention •
Softmax •
Softsign Activation •
WaveNet •
Weight Normalization