《引发笑声:测试和评估喜剧法学硕士的成功率.pdf》由会员分享,可在线阅读,更多相关《引发笑声:测试和评估喜剧法学硕士的成功率.pdf(21页珍藏版)》请在三个皮匠报告上搜索。
1、2025|1Confidential|2025 Galileo Technologies,Inc.Generating LaughterEvaluating the Success of LLMs for Comedic PurposesErin Mikail Staples(she/her)Sr.DX Engineer,Galileo.ai 2025|2Confidential|2025 Galileo Technologies,Inc.Generating LaughterEvaluating the Success of LLMs for Comedic PurposesErin Mik
2、ail Staples(she/her)Sr.DX Engineer,Galileo.ai 2025|3Erin Mikail Staples(she/her)developer experience engineer,open source enthusiast,+ai-enhanced stand-up eringalileo.aieringalileo.ai2025|42025|52025|6The r oadmap.Nondeterminism Isnt a Bug,Its the BitUnderstanding Metrics Building your metrics stack
3、Business Case ExamplesStartup Simulator 30002025|7Nondeterminism Isnt a Bug,Its the Bit2025|8LLMs are unpredictable by designBut how do you test something you dont expect to repeat?2025|9Keeping One Footon the Ground2025|10Standard AI Evaluation MetricsAccuracy:Accuracy:Is it truthful?Token Usage:To
4、ken Usage:How many tokens did it burn?Latency:Latency:How long did it take?Safety:Safety:Did the model create harm?2025|11Confidential|2025 Galileo Technologies,Inc.Serious ModeSilly ModePitch DecksNews API Serious FormatterHacker NewsSatireSilly Formatter2025|12Traditional AI Eval MetricsAgentic Ex
5、perienceMetricsWhat really What really mattersmatters2025|13Traditional AI Eval MetricsAgentic ExperienceMetricsWhat What reallyreallymattersmattersCustom Metrics(involve SME)2025|14Beyond Comedy:Custom metrics forany industry.2025|15What shouldI measure?Start with key metricsStart with key metrics-
6、Focus on metrics most relevant to your use case.2025|16What shouldI measure?Establish baselinesEstablish baselines-Understand your current performance before making changes2025|17What shouldI measure?Track trends over timeTrack trends over time-Monitor how met