1、MicrosoftNVIDIA.FastSpeech: Algorithm and Optimizationfor State-of-the-art Text to SpeechXu Tan & Dabi AhnMicrosoft Research Asia & NVIDIA#page#NVIDIA.MicrosoftOutlineThe algorithm of FastSpeechBy Xu Tan, Microsoft Research AsiaThe optimization of FastSpeechBy Dabi Ahn NVIDIA#page#MicrosoftAbout tex
2、t to speech systemFaest spiychFastSpeechTTSMel-spectrogramPhonemeAcousticVocoderF SpeechText ModelFrontend#page#MicrosoftAbout FastSpeechA fast robust controllable high-quality and end-to-end text to speech(TTS) systemFastSpeech: Fast, Robust and Controllable Text to Speech, NeuriPS 2019 1 FastSpeec
3、h 2: Fast and High-Quality End-to-End Text to Speech,ICLR 2021submission 2Widely supported by the community and deployedin Microsoft AzureTTS service to support all the languages1 https/proceedings.neurips.cc/paper/2019/file/f63f65b503e22cb970527f23c9ad7dbl-Paperpdf2 https:/ these issues?Slow infere
4、nce speedAutoregressive generationsInference time depends on sequence length (for 5s speech mellength is about 500)Not robustEncoder-decoder attention is not accurate, repeating and skipping attentionLack of controllability No control information as inputAutoregressive generation cannot explicitly c
5、ontrol the duration#page#MicrosoftOur solution: FastSpeechKey designsGenerate melspectrogram in parallel (forspeedup)s Remove the attention mechanism between text and speech (for robustness) Variance adaptor introduces duration pitch energy (for controllability)FastSpeech has the following advantage
6、sx8Euolelaug ueodsu uo dnpaadsaualayu xzseuxspeedup on voice generation!Robust: no bad case of word skipping and repeating Controllable: can controlvoice speed and prosody.Voicequaliity:on par or better than SOTA model#page#MicrosoftOur solution: FastSpeechFeed-forward transformer: generate mel-spec