《评估驱动的开发工作流程:最佳实践和实际场景.pdf》由会员分享,可在线阅读,更多相关《评估驱动的开发工作流程:最佳实践和实际场景.pdf(32页珍藏版)》请在三个皮匠报告上搜索。
1、Evaluation-Driven Development Workflows:Best Practices and Real-World ScenariosVivian(Wenwen)Xie,Arthur DoonerJune 9About UsArthur Dooner,Specialist Solutions ArchitectVivian(Wenwen)XieSpecialist Solutions ArchitectEvaluation FoundationsEvaluation in a process of assessing the quality of an AI syste
2、m in a repeatable and standardized wayEvery evaluation you run of an AI system will have different definitions of success and“correct”as the goals of each AI system is differentWe need to compare from one version of our AI system to another,how a system is performing compared to its previous version
3、Without evaluation,we cannot prove that our iterations amount to development progressWhat is Evaluation?Why do we Evaluate our AI Systems?4Evaluation is a MindsetTypes of EvaluationGet F1 scores or equivalents for an AI system to ask:Is it labeling this record correctly?What percentage of the time i
4、s the record labeled?What labels?What score makes sense(F1,RMSE,etc)Retrieval performanceRetrieval qualityData Structures(are chunks connected and meaningful)Prompt ExpansionSystem and Contextual PromptsLatency and CostMine traces to see what tools are usedMore compound retrievalTool quality/perf in
5、dividuallyTool results togetherAgent System PromptDeterminismClassification of AccuracyRAG System EvaluationAgent System Evaluation5Evaluation Exists in Many Forms-Identify which one you need to do!Evaluation-Driven DevelopmentDefinition:EDD integrates continuous evaluation into the AI development l
6、ifecycle,ensuring models meet quality,cost and latency benchmarksImportance:Facilitates iterative improvements,aligns models with user needs,and ensures compliance with organizational standardsPrerequisites:Gather requirements to validate gen AI fit and identify constraintsDesign your solution archi