1、GDPVAL:EVALUATINGAI MODELPERFORMANCEONREAL-WORLDECONOMICALLYVALUABLETASKSTejal PatwardhanRachel DiasElizabeth ProehlGrace KimMichele WangOlivia WatkinsSim on Posada FishmanMarwan AljubehPhoebe ThackerLaurance FauconnetNatalie S.KimPatrick ChaoSamuel MiserendinoGildas ChabotDavid LiMichael SharmanAle
2、xandra BarrAmelia GlaeseJerry TworekOpenAIABSTRACTWe introduce GDPval,a benchmark evaluating AI model capabilities on real-world economically valuable tasks.GDPval covers the majority of U.S.Bureauof Labor Statistics Work Activities for 44 occupations across the top 9 sectorscontributing to U.S.GDP(
3、Gross Domestic Product).Tasks are constructed fromthe representative work of industry professionals with an average of 14 years ofexperience.We fi nd that frontier model performance on GDPval is improvingroughly linearly over time,and that the current best frontier models are approach-ing industry e
4、xperts in deliverable quality.We analyze the potential for frontiermodels,when paired with human oversight,to perform GDPval tasks cheaper andfaster than unaided experts.We also demonstrate that increased reasoning effort,increased task context,and increased scaffolding improves model performance on
5、GDPval.Finally,we open-source a gold subset of 220 tasks and provide a pub-lic automated grading service at to facilitate future research inunderstanding real-world model capabilities.1INTRODUCTIONThere is growing debate about how increasingly capable AI models could affect the labor marketwhether b
6、y automating specifi c tasks,replacing entire occupations,or creating entirely new kindsof work(Brynjolfsson et al.,2025;Chen et al.,2025).Current approaches to measure the economicimpact of AI focus on indicators such as adoption rates,usage patterns,and GDP growth attributedto AI(Chatterji et al.,