TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Text-To-Speech Synthesis	LJSpeech	NaturalSpeech	Audio Quality MOS	4.56	# 1
Text-To-Speech Synthesis	LJSpeech	FastSpeech 2 + HiFiGAN	Audio Quality MOS	4.34	# 4
Text-To-Speech Synthesis	LJSpeech	VITS	Audio Quality MOS	4.43	# 2

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/naturalspeech-end-to-end-text-to-speech/text-to-speech-synthesis-on-ljspeech)](https://paperswithcode.com/sota/text-to-speech-synthesis-on-ljspeech?p=naturalspeech-end-to-end-text-to-speech)`

NaturalSpeech: End-to-End Text to Speech Synthesis with Human-Level Quality

9 May 2022 · Xu Tan, Jiawei Chen, Haohe Liu, Jian Cong, Chen Zhang, Yanqing Liu, Xi Wang, Yichong Leng, YuanHao Yi, Lei He, Frank Soong, Tao Qin, Sheng Zhao, Tie-Yan Liu ·

Text to speech (TTS) has made rapid progress in both academia and industry in recent years. Some questions naturally arise that whether a TTS system can achieve human-level quality, how to define/judge that quality and how to achieve it. In this paper, we answer these questions by first defining the human-level quality based on the statistical significance of subjective measure and introducing appropriate guidelines to judge it, and then developing a TTS system called NaturalSpeech that achieves human-level quality on a benchmark dataset. Specifically, we leverage a variational autoencoder (VAE) for end-to-end text to waveform generation, with several key modules to enhance the capacity of the prior from text and reduce the complexity of the posterior from speech, including phoneme pre-training, differentiable duration modeling, bidirectional prior/posterior modeling, and a memory mechanism in VAE. Experiment evaluations on popular LJSpeech dataset show that our proposed NaturalSpeech achieves -0.01 CMOS (comparative mean opinion score) to human recordings at the sentence level, with Wilcoxon signed rank test at p-level p >> 0.05, which demonstrates no statistically significant difference from human recordings for the first time on this dataset.

PDF Abstract

Code

Add Remove Mark official

microsoft/NeuralSpeech official

1,290

heatz123/naturalspeech

436

daniilrobnikov/vits2

358

Tasks

Add Remove

Sentence

Speech Synthesis

Text-To-Speech Synthesis

Datasets

LJSpeech

Results from the Paper

Edit

Ranked #1 on Text-To-Speech Synthesis on LJSpeech (using extra training data)

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Text-To-Speech Synthesis	LJSpeech	NaturalSpeech	Audio Quality MOS	4.56	# 1	Compare
Text-To-Speech Synthesis	LJSpeech	FastSpeech 2 + HiFiGAN	Audio Quality MOS	4.34	# 4	Compare
Text-To-Speech Synthesis	LJSpeech	VITS	Audio Quality MOS	4.43	# 2	Compare

Methods

Add Remove

No methods listed for this paper. Add relevant methods here

Edit Social Preview

NaturalSpeech: End-to-End Text to Speech Synthesis with Human-Level Quality

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove