1、首个神经声码器WaveNet基于神经网络的语言合成神经语音模型Tacotron应用TransformerTransformerTTS结合神经语音模型和声码器Tacotron 2通过单调搜索对齐文本波形GlowTTS首个非自回归语音模型FastSpeech基于对抗生成的非自回归模型Parallel WaveGAN引入音高、能量等预测FastSpeech 2基于扩散模型的生成式模型Diffwave约束注意力机制对齐文本音频EfficientTTS基于网络结构搜索的轻量模型LightSpeech语音模型声码器文本音素频谱波形端到端非自回归语音合成FastSpeech 2s基于标准化流的生成式模型Wa
2、veGlow神经语音模型神经声码器端到端语音合成神经声码器的发展01Neural Speech SynthesisWaveNetWaveNet 是首个基于神经网络的声码器模型。模型采用了层次化的空洞因果卷积设计,大幅度扩张了模型的感受野大小,以此捕获超长序列上的依赖关系,从而使得模型能够处理高采样率的音频数据。van den Oord,Aron,et al.WaveNet:A Generative Model for Raw Audio.9th ISCA Speech Synthesis Workshop.层次化空洞因果卷积PreviousCurrentParallel WaveGAN 使用对
3、抗生成网络的方法,直接训练一个非自回归的声码器模型。模型通过优化对抗损失和多分辨率梅尔频谱进行训练,以此学习真实语音的频域特性。相比必须分步自回归合成的声码器,其训练于合成速度都有显著提升。Parallel WaveGANDiscriminatorGenerator(WaveNet)STFT loss(1st)STFT loss(2nd)STFT loss(Mth)Adversarial lossDiscriminator loss+Natural SpeechRandom noiseAuxiliary featureParameter updateParameter updateYamamo
4、to,Ryuichi,Eunwoo Song,and Jae-Min Kim.Parallel WaveGAN:A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram.ICASSP 2020-2020 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP).IEEE,2020.WaveGlowxaxbWNAffineTransform
5、xaxbUpsampledMel-spectrogramAffine Coupling LayerInvertible 1x1ConvolutionSqueeze toVectorsxz 12Prenger,Ryan,Rafael Valle,and Bryan Catanzaro.Waveglow:A flow-based generative network for speech synthesis.ICASSP 2019-2019 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP)
6、.IEEE,2019.WaveGlow 是基于标准化流的非自回归声码器模型。模型完全由可逆结构组成,能够直接学习声音波形到简单随机分布空间的双向映射。合成速度远高于自回归模型,且训练上相比对抗学习更加稳定。DiffwaveKong,Zhifeng,et al.DiffWave:A Versatile Diffusion Model for Audio Synthesis.International Conference on Learning Representations.2020.Diffwave 是基于扩散建模的声码器模型。该方法将波形到随机噪声的映射看做一个固定步数随机扩散,并使用神经