![[State of Code Evals] After SWE-bench, Code Clash & SOTA Coding Benchmarks recap — John Yang](https://assets.flightcast.com/V2Uploads/nvaja2542wefzb8rjg5f519m/01K4D8FB4MNA071BM5ZDSMH34N/square.jpg)
从普林斯顿地下室创建SWE-bench(软件工程基准测试),到推出CodeClash(代码对决)、SWE-bench Multimodal(多模态软件工程基准测试)和SWE-bench Multilingual(多语言软件工程基准测试),John Yang在过去一年半的时间里,亲眼见证了自己的基准测试成为评估AI编程代理的事实标准——被Cognition(Devin)、OpenAI、Anthropic以及所有致力于大规模解决软件工程问题的主要实验室所信赖。
我们在NeurIPS 2025现场与John进行了交流,深入探讨了2026年代码评估的现状:为什么SWE-bench从被忽视(2023年10月)到在Devin发布后成为行业标准(以及Walden如何在发布前两周给他发邮件);该基准测试如何从以Django为主演变为涵盖9种语言、40个代码仓库(JavaScript、Rust、Java、C、Ruby);为什么单元测试作为验证手段存在局限性,而长期运行的代理锦标赛可能是未来方向(CodeClash:代理维护代码库、在竞技场中竞争、并经过多轮迭代);SWE-bench变体的激增(SWE-bench Pro、SWE-bench Live、SWE-Efficiency、AlgoTune、SciCode),以及基准测试作者现在如何通过筛选技术而非仅仅“增加更多代码仓库”来证明其数据划分的合理性;为什么Tau-bench的“不可能任务”争议实际上是特性而非缺陷(故意包含不可能任务可以标记作弊行为);长时间自主运行(5小时)与交互性(Cognition强调快速来回反馈)之间的张力;Terminal-bench如何通过让博士生和非编程人员设计超越GitHub问题和PR(拉取请求)的环境来释放创造力;学术数据问题(像Cognition和Cursor这样的公司拥有丰富的用户交互数据,而学术界需要用户模拟器或像LMArena这样有吸引力的产品来获取类似信号);以及他对CodeClash作为人机协作测试平台的愿景——冻结模型能力,改变协作设置(单代理、多代理、人+代理),并衡量随着模型从代码补全到完整代码库推理的升级,交互模式如何变化。
本期播客在NeurIPS会议现场录制,主持人与SWE-Bench创始人John Yang及其他几位嘉宾,围绕AI编码评估基准的发展现状与未来趋势进行了深入探讨。对话涵盖了SWE-Bench的演进、新兴基准测试的涌现,以及像Code Clash这样的创新竞赛平台如何重新定义对AI编程能力的长期评估。
总结:编码评估领域正在快速进化,从静态基准走向动态、交互、长期的竞赛环境。未来的赢家或许不是能一次性写出完美代码的AI,而是那些最擅长持续学习、优化并能与人类开发者无缝协作的智能系统。
From creating SWE-bench in a Princeton basement to shipping CodeClash, SWE-bench Multimodal, and SWE-bench Multilingual, John Yang has spent the last year and a half watching his benchmark become the de facto standard for evaluating AI coding agents—trusted by Cognition (Devin), OpenAI, Anthropic, and every major lab racing to solve software engineering at scale.
We caught up with John live at NeurIPS 2025 to dig into the state of code evals heading into 2026: why SWE-bench went from ignored (October 2023) to the industry standard after Devin's launch (and how Walden emailed him two weeks before the big reveal), how the benchmark evolved from Django-heavy to nine languages across 40 repos (JavaScript, Rust, Java, C, Ruby), why unit tests as verification are limiting and long-running agent tournaments might be the future (CodeClash: agents maintain codebases, compete in arenas, and iterate over multiple rounds), the proliferation of SWE-bench variants (SWE-bench Pro, SWE-bench Live, SWE-Efficiency, AlgoTune, SciCode) and how benchmark authors are now justifying their splits with curation techniques instead of just "more repos," why Tau-bench's "impossible tasks" controversy is actually a feature not a bug (intentionally including impossible tasks flags cheating), the tension between long autonomy (5-hour runs) vs.
interactivity (Cognition's emphasis on fast back-and-forth), how Terminal-bench unlocked creativity by letting PhD students and non-coders design environments beyond GitHub issues and PRs, the academic data problem (companies like Cognition and Cursor have rich user interaction data, academics need user simulators or compelling products like LMArena to get similar signal), and his vision for CodeClash as a testbed for human-AI collaboration—freeze model capability, vary the collaboration setup (solo agent, multi-agent, human+agent), and measure how interaction patterns change as models climb the ladder from code completion to full codebase reasoning.