1、GLM-4-Voice:Towards Intelligent and Human-LikeEnd-to-End Spoken ChatbotAohan Zeng,Zhengxiao Du,Mingdao Liu,Kedong Wang,Shengmin Jiang,Lei ZhaoYuxiao Dong,Jie TangZhipu.AITsinghua Universityhttps:/ introduce GLM-4-Voice,an intelligent and human-like end-to-end spoken chat-bot.It supports both Chinese
2、 and English,engages in real-time voice conversations,and varies vocal nuances such as emotion,intonation,speech rate,and dialectaccording to user instructions.GLM-4-Voice uses an ultra-low bitrate(175bps),single-codebook speech tokenizer with 12.5Hz frame rate derived from an auto-matic speech reco
3、gnition(ASR)model by incorporating a vector-quantized bottle-neck into the encoder.To effi ciently transfer knowledge from text to speech modal-ities,we synthesize speech-text interleaved data from existing text pre-trainingcorpora using a text-to-token model.We continue pre-training from the pre-tr
4、ainedtext language model GLM-4-9B with a combination of unsupervised speech data,interleaved speech-text data,and supervised speech-text data,scaling up to 1 trilliontokens,achieving state-of-the-art performance in both speech language modelingand spoken question answering.We then fi ne-tune the pre
5、-trained model withhigh-quality conversational speech data,achieving superior performance comparedto existing baselines in both conversational ability and speech quality.The openmodels can be accessed throughhttps:/ https:/huggingface.co/THUDM/glm-4-voice-9b.1IntroductionThe success of large languag
6、e models(LLMs)has driven signifi cant advancements in conversationalAI,enabling the development of text-based chatbots and digital assistants.However,LLMs areprimarily designed to process text input and generate text output,focusing on semantic and logicalcommunication.In contrast,human communicatio