《optimize-llm-workflows-with-smart-infrastructure-enhanced-by-volcano-chuan-hui-volcanozha-xia-27dya-shi-llmxiao-xin-li-qihoo360-xuzheng-chang-huawei-cloud-technologies-co-ltd-1.pdf》由会员分享,可在线阅读,更多相关《optimize-llm-workflows-with-smart-infrastructure-enhanced-by-volcano-chuan-hui-volcanozha-xia-27dya-shi-llmxiao-xin-li-qihoo360-xuzheng-chang-huawei-cloud-technologies-co-ltd-1.pdf(21页珍藏版)》请在三个皮匠报告上搜索。
1、Optimize LLM Workflows with Smart Infrastructure Enhanced by VolcanoXin Li,Qihoo360 Xuzheng Chang,Huawei Cloud Technologies Co.,LTDCatalog1.Background2.Status3.Issues4.SolutionsBackgroundLLM Keyword TrendsStarting from 2023,LLM has received more and more attentionMore and more LLM infrastructures us
2、ing KubernetesKubernetes support for LLM is getting better and betterOpenAI Blog Post20182021Google Search ResultsStatusx3000 x6000/Mx1000TrainingBig dataTextVideoCPUMemoryNVIDIAAscendOthers3000+users from different departments,6000+tasks per month10+clusters,1000+nodesComplexity of task types.Train
3、ing,reasoning,development.Resources:1-200 instances per task,single instance CPU:1c-200c,GPU:1-8,memory 20G-2TFunction:ssh password-free,pod-to-pod communicationOperation:all instances are scheduled simultaneouslyComplexity of running time.Hours,days,months and days coexist.Complexity of computing r
4、esources.CPU,GPU,NPU,etc.Complexity of network environment.Ethernet,IB,RoCEDevelopmentInferenceIssueFailureEfficiencyUsabilityFailureGPU lostECC errorGPU failureNIC failureData center power outageMisoperationNAS failureCluster failureNVLINK failureP2P failureCooling failure.The Llama 3 Herd of Model
5、shttps:/ strategy optimizationMultiple mission typesVarious hardwareMassive data transferEnvironment dependencyEnvironment preservationMultiple IDE integrationsTensorboard,GrafanaObservability optimizationMulti-department resource allocationExclusive resources/public resourcesTask preemptionTask que
6、uingGang scheduling strategyBinpack scheduling strategyMegatron-LMDeepSpeedopensoraDistributed training tasksLLM tasksMultimodal tasksData processingSingle machine single card,single machine multiple cards,multiple machines multiple cards tasksNVIDIAAscendPure CPU tasksRoCE/IBGPU slicingSolutionsVol