1、Text to Audio Generation and Editing with Latent Diffusion ModelsYuancheng W12Text-to-Audio GenerationWhat is text-to-audio generation:Generate sounds that are semantically in line with descriptionsSome examples:A group of sheep are baaing.(animals)Water flowing down a river.(environment)Piano and v
2、iolin plays.(musician)A cat meowing and young female speaking.(human speech,animals)3Text-to-Audio GenerationSome methodsDiffSound 1:Discrete diffusion model,use a VQ-VAE model to discretize the mel-spectrogram.1 Diffsound:Discrete diffusion model for text-to-sound generation.arXiv preprint arXiv:22
3、07.09983,2022.1.a horse galloping;2.piano and violin plays;3.drums and music playing with a man speaking4Text-to-Audio GenerationSome methodsAudioGen 2:Autoregressive,decoder only,discretize the waveform directly.2 Audiogen:Textually guided audio generation.arXiv preprint arXiv:2209.15352,2022.5Text
4、-to-Audio with Latent Diffusion ModelLatent diffusion based methods:Make-an-Audio 3 and AudioLDM 43 Make-an-audio:Text-to-audio generation with prompt-enhanced diffusion models.arXiv preprint arXiv:2301.12661,2023.4 Audioldm:Text-to-audio generation with latent diffusion models.arXiv preprint arXiv:
5、2301.12503,2023.1.a horse galloping;2.piano and violin plays;3.drums and music playing with a man speaking6Text-to-Audio with Latent Diffusion ModelSome challenges:Data:the most used audio-caption dataset AudioCaps 5 has only 50K data,AudioSet 6 has 2M audio-label data.Variable length and higher qua
6、lity audio generation.5 Audiocaps:Generating captions for audios in the wild.In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies.6 Audio set:An ontology and human-labeled dataset for audio events.In 2017 IEE