1、The exploration of complex Large-scale databased scenario automatic speech recognitionComplex scenario ASR in ZOOMHaoyu(Charlie)TangApril 24,2022Zoom AI/ML EngineeringContent1.Introduction to automatic speech recognition2.End-to-End automatic speech recognition3.Model innovation4.Training pipeline i
2、nnovation5.Large scale data model training acceleration in ZOOM6.Summary and next step1Introduction to automatic speechrecognitionWhat is automatic speech recognitionAutomatic Speech Recognition(ASR):generate texts from given audiowav,argmax(P(Y|X)Figure 1:ASR1Conventional Method:Acoustic model,Lang
3、uage model andPronunciation dict/model.End-To-End Method:Main Model,and language model(optional).1https:/ History of ASRFigure 2:ASR history22https:/sonix.ai/history-of-speech-recognition.3Brief History of ASRFigure 3:Recent decade ASR history33https:/sonix.ai/history-of-speech-recognition.4ASR:curr
4、ent problemsCurrent problem in ASR for live,meeting and online chat scenario:1.Spontaneous but most ASR open data is read sound2.Open-set recognition+Large vocabulary3.Noise especially for background music4.Accent independent5.Code-switch6.Free switch between far-field and near-field since moving sp
5、eak5End-to-End automatic speechrecognitionA standard end-to-end(E2E)ASR architectureFigure 4:A standard end-to-end(E2E)ASR architecture4This figure shows two standard E2E modeling method:CTC andencoder-attention-decoder.And These could be combined togetheras CTC-ATT architecture.4Watanabe et al.,“Hy
6、brid CTC/attention architecture for end-to-end speechrecognition”.6ATT-CTC Training and Inference5Loss combine:LMTL=LCTC+(1 )LAttention(1)Figure 5:Loss combineJoint decoding/rescoring:C=argmaxlogpctc(h|X)+(1 )patt(h|X)(2)5Watanabe et al.,“Hybrid CTC/attention architecture for end-to-end speechrecogn