1、Tommy Yan,GPU Project Lead,Microsoft AzureAnna Mary Mathew,Director,Microsoft AzureFight fire with fire:AI-assisted test/debug flow and log analysis for AI GPU systemsFight fire with fire:AI-assisted test/debug flow and log analysisfor AI GPU systemsTommy Yan,GPU Project Lead,Microsoft AzureAnna Mar
2、y Mathew,Director,Microsoft AzureTEST&VALIDATIONAI Infrastructure scaling and introduction of new technologies creates unique test validation framework that has massive validation data being created for post processingValidation data uses heterogenous formats Debug with massive data is becoming even
3、 more complexFew of the key areas of debug are oRack level connectivity issuesoPower envelope worst case scenariosoPerformance variation at cluster levelProblem statementAI assisted System Test/Debug Flow and Log AnalysisInterested Logs File patterns to scan(e.g.,*BMCSELListDetail*.csv).Error Match
4、TypeERROR Flag if log line contains any error_text keyword,excluding whitelist_text.Match Text Keywords that indicate a problem(error,fail,critical,).Whitelist Text Known safe/irrelevant phrases to ignore(non-critical,Correctable error,).PASS All pass_text keywords must be present in each log entry.
5、Stop-on-Fail Flag Halt test flow on detection if true.Define Error Signature FileFor interested logs:Pre-Search Treatment07 00 ca 24 c2 96 68 37 01 00 02 02 10 00 ff ffRecordName:DramTest Error OEM Event EvtD 1:1st Error ID(DimmDtrResult):DTR_STATUS_NO_FAILUREEvtD 2:2nd Error ID(Fail count):255Log S
6、earchKnown Good Log CompareResult CategorizationGroup by message patternsSeparate matched vs missing signaturesResult De-duplicationFuzzy matching(80%threshold)Merge similar error signaturesMatched Result Post-Search TreatmentResult AnalysisStop-on-fail as behaviorTriggered as ne