当前位置:首页 > 报告详情

重新审视大规模部署中 RoCEv2 的问题以及 UEC 的未来前景.pdf

上传人: 明**** 编号:1011553 2025-12-21 16页 1.42MB

1、Revisit RoCEv2 issues in large scale deployment and the future that UEC promiseAMD and EdgecoreRevisit RoCEv2 issues in large scale deployment and the future that UEC promiseNETWORKINGPoWen TsaiDirector Technical Sales,Edgecore NetworksAzeem SulemanSr.Director Technical Product Management,AMDAgendaP

2、roblem StatementProductSolutionPerformanceQ&A0203040501Network UtilizationReliabilityScalabilityOperationsTCOInefficientGPU-to-GPU communicationLink,NIC and Switch failurePFC&Queue Pair stalls Elephant flows sharing Poor telemetry and lack of network state at CCLRequire deep buffer switches,lack of

3、multi-plane/rail networksAI Scale-out Networking ChallengesRoCEv2 Requires Improvements for modern GenAI&HPC deploymentsPFCCongestion ControlDifferent trafficsco-existsPFC requires at least BW*RTT+MTU buffering for fully lossless transmissionBlocked victim flowsPFC stormsDifferent DCQCN implementati

4、onsRoCEv2 core design natively does not support different transport protocols for different services.SecurityLink Level Reliability or Network ReliabilityFlexibility for End-to-End confidentiality and service protection.Large session state(keys)Delays become more significant as scale increases Requi

5、res error handling at link layer51.2Tbps while 1W per 100GbpsBest-in-Class SerDes that enable LPO(OSFP,QSFP)(AFO,AFI)complete portfolioAdaptive Routing&Cognitive Routing for all traffic types Improved Network Utilization Lowest Tail LatencyProgrammable out-of-band telemetry(6 ARM cores)and Programma

6、ble inband telemetry Minimized Packet Drops and Latency JitterEdgecore AIS800 Tomahawk 5 AI Switch AMD PensandoPollara 400 AI NICFully Programable Customizable TransportsOffload and AccelerationPCIeGen5,400G Scale-Out Choice No Fabric DependencyAMD PensandoPollara 400 AI NICP4-based architecture-72

word格式文档无特别注明外均可编辑修改,预览文件经过压缩,下载原文更清晰!
三个皮匠报告文库所有资源均是客户上传分享,仅供网友学习交流,未经上传用户书面授权,请勿作商用。
根据《Revisit RoCEv2 issues in largescale deployment and the future that UEC promise》的内容,以下是全文关键点的概括: 1. **大规模部署中的RoCEv2问题**: - 不高效的GPU到GPU通信链路。 - NIC和交换机故障。 - PFC和队列对停滞。 - “大象流”共享问题。 - 缺乏网络状态监控。 2. **AI扩展网络挑战**: - RoCEv2需要改进以适应现代AI和HPC部署。 - PFC拥塞控制问题。 - 安全性和网络可靠性需求。 3. **解决方案**: - Edgecore AIS800 Tomahawk 5 AI交换机和AMD Pensando Pollara 400 AI网卡。 - 高性能和可编程性,支持PCIe Gen5和400G。 - UEC-NSCC基于的拥塞控制算法。 4. **性能提升**: - 1.25倍的性能提升,得益于UEC软件差异化。 5. **关键数据**: - 1.2Tbps吞吐量,每100Gbps功耗低于1W。 6. **未来展望**: - UEC验证参考指南,提供更多信息和资源。
RoCEv2如何改进?" 揭秘高性能AI网络解决方案!" 大规模部署中的RoCEv2问题与未来展望!"
客服
商务合作
小程序
服务号
折叠