[State of Code Evals] After SWE-bench, Code Clash & SOTA Coding Benchmarks recap — John Yang

[State of Code Evals] After SWE-bench, Code Clash & SOTA Coding Benchmarks recap — John Yang

Latent Space: The AI Engineer Podcast
about 1 month ago17m

From creating SWE-bench in a Princeton basement to shipping CodeClash, SWE-bench Multimodal, and SWE-bench Multilingual, John Yang has spent the last year and a half watching his benchmark become the de facto standard for evaluating AI coding agents—trusted by Cognition (Devin), OpenAI, Anthropic, and every major lab racing to solve software engineering at scale.

We caught up with John live at NeurIPS 2025 to dig into the state of code evals heading into 2026: why SWE-bench went from ignored (October 2023) to the industry standard after Devin's launch (and how Walden emailed him two weeks before the big reveal), how the benchmark evolved from Django-heavy to nine languages across 40 repos (JavaScript, Rust, Java, C, Ruby), why unit tests as verification are limiting and long-running agent tournaments might be the future (CodeClash: agents maintain codebases, compete in arenas, and iterate over multiple rounds), the proliferation of SWE-bench variants (SWE-bench Pro, SWE-bench Live, SWE-Efficiency, AlgoTune, SciCode) and how benchmark authors are now justifying their splits with curation techniques instead of just "more repos," why Tau-bench's "impossible tasks" controversy is actually a feature not a bug (intentionally including impossible tasks flags cheating), the tension between long autonomy (5-hour runs) vs.

interactivity (Cognition's emphasis on fast back-and-forth), how Terminal-bench unlocked creativity by letting PhD students and non-coders design environments beyond GitHub issues and PRs, the academic data problem (companies like Cognition and Cursor have rich user interaction data, academics need user simulators or compelling products like LMArena to get similar signal), and his vision for CodeClash as a testbed for human-AI collaboration—freeze model capability, vary the collaboration setup (solo agent, multi-agent, human+agent), and measure how interaction patterns change as models climb the ladder from code completion to full codebase reasoning.

Episode Content
Original Audio

播客摘要:编码评估的未来:从SWE-Bench到Code Clash,AI编程竞赛将走向何方?

概述

本期播客在NeurIPS会议现场录制,主持人与SWE-Bench创始人John Yang及其他几位嘉宾,围绕AI编码评估基准的发展现状与未来趋势进行了深入探讨。对话涵盖了SWE-Bench的演进、新兴基准测试的涌现,以及像Code Clash这样的创新竞赛平台如何重新定义对AI编程能力的长期评估。

核心话题与讨论要点

1. SWE-Bench的演进与生态扩展

  • 起源与引爆点:SWE-Bench于2022年10月发布,但真正引发广泛关注是在Cognition公司发布Devin之后,其出色的表现开启了AI编程助手竞赛的新阶段。
  • 主要扩展方向
    • 变体衍生:出现了独立的SWE-Bench Pro和SWE-Bench Live等变体,体现了社区的活跃参与。
    • 多模态与多语言:团队推出了支持多模态(图像+代码)和多语言(覆盖JavaScript、Rust、Java、C、Ruby等约9种语言,涉及40个代码库)的版本,以应对原版对Django框架的过度依赖问题。
    • 未来方向:John期待看到社区如何策划下一代数据集,论证理由可能从简单的“更多语言”升级到“通过筛选技术使任务更难”。

2. 超越单次任务:Code Clash与长周期评估

  • 设计理念:Code Clash旨在解决SWE-Bench等基准的局限——任务独立、单次提交即结束。它专注于评估长周期开发过程和模型在持续迭代中的表现。
  • 运行机制:两个或多个语言模型进行编程锦标赛。每轮包含自主改进代码库和代码库对抗竞赛两个阶段,由竞技场(或LLM裁判)判定胜负。这模拟了真实的、持续的开发与优化循环。
  • 竞技场类型:初期使用现有编程游戏(如Halite),未来计划构建具有“经济价值”的竞技场,使其更贴近现实世界的实用性和复杂性。

3. 编码评估领域的其他创新工作

  • 性能优化:如SWEetefficiency(优化代码运行速度,保持行为不变)和AlgoTune
  • 科学计算CyCode被描述为“更好的人类评估(HumanEval)”,提供了更高效的补全类基准测试。
  • 垂直领域深耕:出现了专注于特定领域的基准,如SecBench(网络安全)、SREBench(站点可靠性工程)。
  • 用户交互模拟Tau BenchVending Bench等尝试模拟真实用户交互,尽管其实用性和真实性仍存争议。一个有趣的衍生是Impossible Bench,专门测试模型对不可能任务的拒绝能力。

4. 未来展望与挑战

  • 评估趋势:预测将出现更多SWE-Bench变体。TerminalBench因其能注入更多创造性和支持非编码任务而备受期待。
  • 长期自主性与人机协作的平衡
    • 愿景:实现智能体在给定目标下,长时间(如数小时甚至数天)自主优化代码库。
    • 反思:嘉宾对过度推动长期完全自主性持谨慎态度,认为人机交互(快速、反复的交流)在现实中更为关键和普遍。未来的方向可能是提供不同层次的抽象,适应不同任务和开发者偏好。
  • 数据与评估的瓶颈:学术研究面临缺乏优质用户交互数据的挑战。构建有吸引力的产品或高质量的用户模拟器来获取数据均非易事。
  • 新前沿——代码理解:随着Cognition等公司推动基于代码的理解与检索,如何评估AI对代码库的“理解力”成为一个新的开放性问题,简单的问答形式很容易达到性能饱和。

关键洞见与行动启示

  • 基准测试的民主化:像SWE-Bench Pro这样的独立变体出现,表明编码评估正在成为一个活跃的、社区驱动的领域。
  • 评估范式的转变:重点正从单次任务完成度转向长周期、多轮次、竞争性的智能体行为评估。
  • 贴近现实是核心价值:无论是SWE-Bench基于真实GitHub问题,还是Code Clash引入竞技场,与现实开发场景的接近程度是衡量基准测试价值的关键。
  • 人机协作是未来:完全自动化并非唯一或最佳路径。未来的工具和评估应更多地关注如何优化和衡量人类与AI在编程中的有效协作
  • 行动号召
    1. 需要更多真实交互数据:学术界呼吁获得更多灵感,了解真实人机编程交互的数据形态,以改进评估模拟。
    2. 探索多元实验平台:像Code Clash这样的框架为测试不同人机组合、协作模式提供了平台,邀请研究者共同探索。
    3. 关注代码理解评估:如何有效衡量AI对复杂代码库的深层理解,是下一个亟待解决的评估难题。

总结:编码评估领域正在快速进化,从静态基准走向动态、交互、长期的竞赛环境。未来的赢家或许不是能一次性写出完美代码的AI,而是那些最擅长持续学习、优化并能与人类开发者无缝协作的智能系统。


Original Description

From creating SWE-bench in a Princeton basement to shipping CodeClash, SWE-bench Multimodal, and SWE-bench Multilingual, John Yang has spent the last year and a half watching his benchmark become the de facto standard for evaluating AI coding agents—trusted by Cognition (Devin), OpenAI, Anthropic, and every major lab racing to solve software engineering at scale.

We caught up with John live at NeurIPS 2025 to dig into the state of code evals heading into 2026: why SWE-bench went from ignored (October 2023) to the industry standard after Devin's launch (and how Walden emailed him two weeks before the big reveal), how the benchmark evolved from Django-heavy to nine languages across 40 repos (JavaScript, Rust, Java, C, Ruby), why unit tests as verification are limiting and long-running agent tournaments might be the future (CodeClash: agents maintain codebases, compete in arenas, and iterate over multiple rounds), the proliferation of SWE-bench variants (SWE-bench Pro, SWE-bench Live, SWE-Efficiency, AlgoTune, SciCode) and how benchmark authors are now justifying their splits with curation techniques instead of just "more repos," why Tau-bench's "impossible tasks" controversy is actually a feature not a bug (intentionally including impossible tasks flags cheating), the tension between long autonomy (5-hour runs) vs.

interactivity (Cognition's emphasis on fast back-and-forth), how Terminal-bench unlocked creativity by letting PhD students and non-coders design environments beyond GitHub issues and PRs, the academic data problem (companies like Cognition and Cursor have rich user interaction data, academics need user simulators or compelling products like LMArena to get similar signal), and his vision for CodeClash as a testbed for human-AI collaboration—freeze model capability, vary the collaboration setup (solo agent, multi-agent, human+agent), and measure how interaction patterns change as models climb the ladder from code completion to full codebase reasoning.