当前位置:首页 > 报告详情

Zettascale OCI GPU 集群内部一瞥 [LRN1500].pdf

上传人: Fl****zo 编号:970947 2025-11-08 33页 1.65MB

1、 A peek inside OCI A peek inside OCI ZettascaleZettascale ClustersClustersJag BrarDistinguished EngineerDavid BeckerArchitectThe following is intended to outline our general product direction.It is intended for information purposes only,and may not be incorporated into any contract.It is not a commi

2、tment to deliver any material,code,or functionality,and should not be relied upon in making purchasing decisions.The development,release,timing,and pricing of any features or functionality described for Oracles products may change and remains at the sole discretion of Oracle Corporation.Safe harbor

3、statement2Copyright 2025,Oracle and/or its affiliates|Confidential:Internal/Restricted/Highly RestrictedHPC ComputeCopyright 2025,Oracle and/or its affiliates 3AI MegafactoryCopyright 2025,Oracle and/or its affiliates 4100s of Megawatts of AI computeOCI CloudOCI CloudFully-featured Tier-1 CloudCateg

4、orized in the leaders category of Gartner Magic QuadrantOver 50 regions in 26 countriesOCI backbone interconnecting regions around the globeSupport for Commercial,Sovereign,Government and Multi-Cloud workloads.5Copyright 2025,Oracle and/or its affiliates OCI Cloud beginnings6Copyright 2025,Oracle an

5、d/or its affiliates First Region Launch in 2016Launched RDMA Cluster Network in 2018 Targeted low latency use casesOracle Exadata HPC instancesHPC workloadsAutomotive fluid dynamic simulations(Fiat/Chrysler and RedBull Racing)High Frequency TradingRDMARDMA(Remote Direct Memory Access)lets one comput

6、er read from or write to the memory of another computer without involving operating-system or CPU-resulting in low latency&high throughput.2017 HPC on OCI2018Exadata on OCI2023 AI LLMs on OCI7Copyright 2025,Oracle and/or its affiliates Low Latency in the Cloud8Copyright 2025,Oracle and/or its affili

word格式文档无特别注明外均可编辑修改,预览文件经过压缩,下载原文更清晰!
三个皮匠报告文库所有资源均是客户上传分享,仅供网友学习交流,未经上传用户书面授权,请勿作商用。
根据报告的内容,以下是全文主要内容的简明扼要概括: 1. **OCI云平台发展历程**:自2016年首个区域启动,到2018年推出RDMA集群网络,再到2023年AI大型语言模型(LLMs)在OCI上运行,OCI云平台不断扩展其功能和性能。 2. **低延迟网络技术**:OCI通过创建基于证书的认证和内网虚拟化,实现了多租户而不会牺牲吞吐量和延迟。 3. **网络性能**:OCI网络性能在GPU工作负载方面表现出色,基于RoCE(RDMA over Converged Ethernet)技术,提供了低延迟和高吞吐量。 4. **AI工作负载需求**:预计到2025年,集群网络规模将达兆瓦级别,AI工作负载对性能、吞吐量、可用性和规模有极高要求。 5. **网络拓扑**:OCI采用多平面拓扑和Clos拓扑,以实现大规模、高带宽和低延迟的网络。 6. **挑战与解决方案**:OCI面临光学链路故障、流量冲突和拥塞等挑战,通过链路去抖动、应用重试逻辑和主机端插件等方法减轻影响。 7. **未来展望**:OCI将继续优化RDMA网络,以支持不同类型的工作负载,并探索与RDMA网卡供应商的互操作性、数据包喷射和传输可靠性等方面的创新。
"云上超低延迟,揭秘OCI奥秘!" "OCI集群网络,AI训练加速利器?" "如何打造超大规模集群?OCI给你答案!"
客服
商务合作
小程序
服务号
折叠