引发笑声：测试和评估喜剧法学硕士的成功率.pdf

上传人： Fl****zo

编号：718908

2025-06-22

PDF 21页 1.67MB

《引发笑声：测试和评估喜剧法学硕士的成功率.pdf》由会员分享，可在线阅读，更多相关《引发笑声：测试和评估喜剧法学硕士的成功率.pdf（21页珍藏版）》请在三个皮匠报告上搜索。

1、2025|1Confidential|2025 Galileo Technologies,Inc.Generating LaughterEvaluating the Success of LLMs for Comedic PurposesErin Mikail Staples(she/her)Sr.DX Engineer,Galileo.ai 2025|2Confidential|2025 Galileo Technologies,Inc.Generating LaughterEvaluating the Success of LLMs for Comedic PurposesErin Mik

2、ail Staples(she/her)Sr.DX Engineer,Galileo.ai 2025|3Erin Mikail Staples(she/her)developer experience engineer,open source enthusiast,+ai-enhanced stand-up eringalileo.aieringalileo.ai2025|42025|52025|6The r oadmap.Nondeterminism Isnt a Bug,Its the BitUnderstanding Metrics Building your metrics stack

3、Business Case ExamplesStartup Simulator 30002025|7Nondeterminism Isnt a Bug,Its the Bit2025|8LLMs are unpredictable by designBut how do you test something you dont expect to repeat?2025|9Keeping One Footon the Ground2025|10Standard AI Evaluation MetricsAccuracy:Accuracy:Is it truthful?Token Usage:To

4、ken Usage:How many tokens did it burn?Latency:Latency:How long did it take?Safety:Safety:Did the model create harm?2025|11Confidential|2025 Galileo Technologies,Inc.Serious ModeSilly ModePitch DecksNews API Serious FormatterHacker NewsSatireSilly Formatter2025|12Traditional AI Eval MetricsAgentic Ex

5、perienceMetricsWhat really What really mattersmatters2025|13Traditional AI Eval MetricsAgentic ExperienceMetricsWhat What reallyreallymattersmattersCustom Metrics(involve SME)2025|14Beyond Comedy:Custom metrics forany industry.2025|15What shouldI measure?Start with key metricsStart with key metrics-

6、Focus on metrics most relevant to your use case.2025|16What shouldI measure?Establish baselinesEstablish baselines-Understand your current performance before making changes2025|17What shouldI measure?Track trends over timeTrack trends over time-Monitor how met

引发笑声：测试和评估喜剧法学硕士的成功率.pdf

相关报告