1、George Zervas,Oriole NetworksPhotonic networks for AI:Challenges and OpportunitiesPhotonic networks for AI:Challenges and OpportunitiesGeorge Zervas,Oriole NetworksSPECIAL FOCUS:PHOTONICSAI Performance doubles every 9 monthsJensen “AI Revenues are Power-Limited”xPUs and Network I/OTraining xPUsInfer
2、ence xPUsNVIDIA GB300 Amazon AWS Trainium2 chipAMD MI325XAmazon AWS Inferentia2GroqCard LPUGoogle IronwoodMultiple Tb/s evolving to 10 Tb/s&beyondSub Tb/s to one or few Tb/sSignificant difference(2-7x)Training and Inference:Network requirementsTraining completion time per MW is KEY.Workloads tend to
3、 be long-lasting with a fixed computational graphA)New model training and B)tuning an existing model.Distributed training uses both scale-up(10s on nodes)and scale-out networks(10K-100k)with very different performance/scale abilities.Tokens/s/KW is fundamental.Workloads are very short-lived,and patt
4、erns are less deterministic than training.MoE and Reasoning could trigger a variable set of xPUs network connectivity and collectives per and within a lifetime of a LLM prompt.Collective communication networking is keyInferenceInferenceBoth networks benefit from fully connected deterministically syn
5、chronous lossless networks with fast reconfiguration=“zero”collective communication tail latency.TrainingTrainingGoogles deployment of OCS in Data CentersCan we carry on scaling the EPS networks?Diminishing returns in power consumption for future electronic switches and opticsOCS is used at the aggr
6、egation layer and supports 10s ms reconfiguration timesJupiter Evolving:Transforming Googles Datacenter Network via Optical Circuit Switches and Software-Defined Networking,SIGGCOM 22Hypercube OCS ArchitectureML 4096 TPU cluster-64 Racks-64 TPUs/RackA Rack of 64-TPUs with a 3D Torus connectivity and