《从代码完成到自主软件工程代理.pdf》由会员分享,可在线阅读,更多相关《从代码完成到自主软件工程代理.pdf(27页珍藏版)》请在三个皮匠报告上搜索。
1、 benchmarks2.Agent Basics3.SWE-agent4.Training LMs&whats nextFrom Code Completion to Autonomous Software Engineering AgentsBenchmarksBasicsSWE-agentTrainingSWE-benchBenchmarks:HumanEval-style2Perfect benchmark for evaluating code generation&autocomplete!BenchmarksBut:Saturated,little context,andmost
2、 developer time is not spent writing code!BenchmarksBasicsSWE-agentTrainingSWE-benchBenchmarks:Software engineering3BugReproduceFind/read codeEditTestDoneBenchmarksBenchmarksBasicsSWE-agentTrainingSWE-benchBenchmarks:Software engineering4DoneBugReproduceFind/read codeEditTestother Bugsother Bugsothe
3、r BugsBenchmarksBenchmarksBasicsSWE-agentTrainingSWE-benchBenchmarks:SWE-bench5Solve real-world GitHub issuesVery challenging:Extreme context,reasoning-heavy,enormous action spaceBenchmarksBenchmarksBasicsSWE-agentTrainingSWE-benchBenchmarks:SWE-bench multimodal6BenchmarksFar from being saturated:25
4、%SotABenchmarksBasicsSWE-agentTrainingSWE-benchBenchmarks:SWE-bench7BenchmarksTaking results for best comparability,not best performance.Generally same scaffolding between 24 and 25 but different tools,promptsBenchmarksBasicsSWE-agentTrainingPut the LM into the terminalbash commands are hardHumans u
5、se tools like VSCode or vimLets build&optimize an agent-computer interface(ACI)!(Yang et.al 2023)8BasicsBenchmarksBasicsSWE-agentTrainingTools910%18%Apr 24Many things have changed since Apr 24Most editing tools now search&replace basedOpen source models might still benefit from linting,but less need
6、ed for SoTA modelsBUT:edit is still the most important part of the ACI to optimizeLMs got much better at using many toolsBasicsBenchmarksBasicsSWE-agentTrainingPrompts10Initial promptsExplain mission,strategy&give tipsCan be very short with Claude 3.5+Do not need to explain tools if using function c