《阿里云:2025年UPN512 技术架构白皮书 v1.0(英文版)(32页).pdf》由会员分享,可在线阅读,更多相关《阿里云:2025年UPN512 技术架构白皮书 v1.0(英文版)(32页).pdf(32页珍藏版)》请在三个皮匠报告上搜索。
1、UPN512 Technical Architecture Whitepaper v1.0 Alibaba Cloud Network Infra Team 1/32Table of contents1.Terminology?.Trends in AI Infrastructure Networking?.Evolution and Challenges of xPU Scale-up Netwo4.Alibaba Cloud UPN?1?Architecture Overview?.UPN?1?System Design and Key Components5.1 System archi
2、tecture5.1.1 AI Racktightly coupled copper interconnect5.1.2 UPN512single-tier optical,decoupled system5.1.2.1 All-optical interconnect5.1.2.2 Single-tier,1K-scale domain5.1.2.3 Decoupled design5.2 Optical Interconnect Overview5.2.1 Pluggable optics5.2.2 High-Density Bandwidth Optical Interconnect S
3、olutions5.2.3 LPO vs.NPO:Use Case and Solution SelectionConclusion:LPO and NPO as Complementary Options for UPN5.2.4 LPO/NPO Cost Analysis5.2.5 Interconnect stability5.3 Communication semantics5.4 In-network computation2/321.Terminology2.TrendsinAIInfrastructureNetworkingAbbreviationDefinitionUPNUlt
4、ra Performance NetworkHPNHigh Performance NetworkMoEMixture of ExpertsEP Expert ParallelismFROFully Retimed OpticsLPOLinear-drive Pluggable OpticsNPONear-packaged OpticsCPOCo-packaged OpticsOEOptical EngineVCSELVertical-Cavity Surface-Emitting LaserEMLElectro-Absorption Modulated LaserELSFPExternal
5、Laser Small Form-Factor PluggableMTBFMean Time Between FailuresMTTRMean Time To Repair3/32In recent years,as artificial intelligence(AI)has surged,the compute and memory demands of large-scale model training and inference have grown exponentially.To boost computation throughput,shorten training time
6、,and improve inference efficiency,AI clusters scales via high-performance networks,marching from tens of thousands to hundreds of thousands of accelerators(xPUs).To achieve efficient training and inference,the industry typically employs multiple parallelization strategies that drive thousands to ten