1、NVIDIAHIGH PERFORMANCE E2EETHERNET SOLUTIONACCELERATERECOMMENDERSYSTEMGTC China,Oct 2020#page#Recommendation PipelinesExampleExperimentationDATALAKETrain dataFeature engineringData Pre-processingTBstPBsModel(s)trainingTrain dataGBstOTBProduction InferenceProduction Re-training0(10)Feature engineerin
2、gRecommender5ystemImprowedaccuracy?DataPreprocessingCanddate generationModel(s)trainingweekly/0oil2电座D#page#Recommendation PipelinesChallengesData (ETL)TrainingInferenceFeatureThroughput&HugeembeddingPerformance &Data loadingtablesexplorationAccuracyLatencyHuge data sets:Data loading canLarge embedd
3、ingHard to achievDifficult to havebe50%oftotaltablesexceedTBs,PBsormorehigh scalinghigh throughputefficiencywithtraining time.single GPUand low latencyComplex databoth model andmemorywhen ranking preprocessing andTabular datadata parallelism.huge number ofloading scalesSub-optimalfeatureitems.Longer
4、 iterationengineeringpoorlywitharlookupsopscycles reducethepipelines.item-by-itemimplementation.abilitytoreachapproach.Many iterationshigheraccuraciesrequired.quickly#page#Nvidia Ethernet Switch addressthe challengesSpeed, Feed and Latency-Fast interconnectFast access datasetRDMA and RoCELow latency
5、 access GPU memoryLoW latency access external datasetMonitoring and Management#page#SPEED AND FEED-THE NEED OF BANDWIDTHIntra-layer model parallelData parallelIntra-layer model parallel leaves collectives exposedCommunication speedup mustAccelerating math without accelerationmatch math speedup,other
6、wisecommunication suffers from basic Amadahls lawproblemwe achieve little E2E speedupTypically collectives span NVLink domain onlyAlreduce spans both NVLink and networking domains:bandwidth must be available in each#page#NVIDIAS MULTI-GPU,MULTI-NODE NETWORKING AND STORAGE IOOPTIMIZATION STACKBuild l