1、Pallavi Koppol,Research ScientistJonathan Frankle,Chief AI ScientistThursday,June 12AI Evaluation from First Principles:You Cant Manage What You Cant MeasureMotivation:A MetaphorYoure building a new software product.You write a 1000 line script in Python.You play with it a bit.“Seems good.”You ship
2、it.2This would be crazyYou write a design doc.You break into modules with abstraction boundaries.You write unit tests for typical cases and corner cases.You write integration tests.You know it will work before you ship it.3Motivation:A MetaphorYoure building a new software product.You write a 1000 l
3、ine script in Python.You play with it a bit.“Seems good.”You ship it.4Motivation:A MetaphorYoure building a new AI product.You write a 1000 word prompt.You check the vibes.“Seems good.”You ship it.52025 is the year of AI EngineeringThis is the year we move from AI demos to AI engineering.The watchwo
4、rd is reliability.How do you build an AI system that will still exist in a year?How do you build an AI system that multiple people can work on simultaneously?How do you build a“million line”equivalent AI system?62025 is the year of AI EngineeringThis is the year we move from AI demos to AI engineeri
5、ng.The watchword is reliability.We only know one way to do this:modularity,abstraction,and specification.AI isnt software,so we need to figure out what these concepts mean in this new context.72025 is the year of AI EngineeringThis is the year we move from AI demos to AI engineering.The watchword is
6、 reliability.Our belief at Databricks:it all starts with evals.8Outline1.Challenge:Measuring quality is important but difficult2.Framework:3x3 approach for understanding evaluation needs3.Solution:Recipe for building gold standard evaluations4.Takeaways:Discussion&next steps 9Outline101.Challenge:Me