![[State of Evals] LMArena's $100M Vision — Anastasios Angelopoulos, LMArena](https://assets.flightcast.com/V2Uploads/nvaja2542wefzb8rjg5f519m/01K4D8FB4MNA071BM5ZDSMH34N/square.jpg)
从伯克利地下室搭建LMArena,到融资1亿美元并成为前沿AI领域事实上的排行榜,Anastasios Angelopoulos重返Latent Space,回顾2025年这个AI领域最具影响力的平台之一——它被数百万用户、所有主要实验室以及整个行业所信赖,只为回答一个问题:哪个模型在实际应用场景中表现最佳?
我们在NeurIPS 2025现场与Anastasios进行了深入交流,探讨了其起源故事(剧透:它最初是a16z的Anjney Midha孵化的一个学术项目,在决定成立公司之前,他们先组建了实体并提供了资助),为何选择脱离学术或非营利模式独立发展(唯一能规模化运作的方式就是创建公司),如何花掉那1亿美元(推理成本、从Gradio迁移到React,以及招募涵盖机器学习、产品和市场推广领域的顶尖人才),排行榜争议风波以及他们如何彻底驳斥了相关论文的指控(事实错误、对开源与闭源采样的曲解,以及忽视社区喜爱的预览测试透明度),为何平台诚信至上(公开排行榜是公益性质,而非付费参与系统——模型不能花钱上榜,也不能花钱下榜,评分反映的是数百万真实投票),如何拓展至职业垂直领域(医疗、法律、金融、创意营销)和多模态竞技场(视频即将上线),为何消费者留存需要每天争取(登录和持久历史记录是关键,但用户善变,随时可能流失),Gemini Nano的“香蕉时刻”如何在一夜之间改变谷歌的市场份额(以及为何多模态模型在营销、设计和AI-for-science领域正变得经济上至关重要),他们对智能体和测试框架的思考(Code Arena评估模型,但或许应该评估像Devin这样的完整智能体),以及他对Arena作为行业核心评估平台的愿景——持续更新、避免过拟合,并基于数百万真实用户的真实对话。
本期播客邀请到了AI模型评估平台Arena的联合创始人Anastasios Angelopoulos。他分享了Arena如何从一个伯克利的学术项目(LMSYS)蜕变为一家获得1亿美元融资的独立公司,并详细阐述了其核心使命、运营原则、面临的挑战以及对AI评估生态的深远影响。
行动号召:Anastasios邀请各领域的顶尖人才加入Arena,也欢迎像Cognition(Devin的创造者)这样的AI公司合作,将他们的智能体框架接入CodeArena等平台进行公开评估,共同塑造AI能力的衡量标准。
From building LMArena in a Berkeley basement to raising $100M and becoming the de facto leaderboard for frontier AI, Anastasios Angelopoulos returns to Latent Space to recap 2025 in one of the most influential platforms in AI—trusted by millions of users, every major lab, and the entire industry to answer one question: which model is actually best for real-world use cases?
We caught up with Anastasios live at NeurIPS 2025 to dig into the origin story (spoiler: it started as an academic project incubated by Anjney Midha at a16z, who formed an entity and gave grants before they even committed to starting a company), why they decided to spin out instead of staying academic or nonprofit (the only way to scale was to build a company), how they're spending that $100M (inference costs, React migration off Gradio, and hiring world-class talent across ML, product, and go-to-market), the leaderboard delusion controversy and why their response demolished the paper's claims (factual errors, misrepresentation of open vs.
closed source sampling, and ignoring the transparency of preview testing that the community loves), why platform integrity comes first (the public leaderboard is a charity, not a pay-to-play system—models can't pay to get on, can't pay to get off, and scores reflect millions of real votes), how they're expanding into occupational verticals (medicine, legal, finance, creative marketing) and multimodal arenas (video coming soon), why consumer retention is earned every single day (sign-in and persistent history were the unlock, but users are fickle and can leave at any moment), the Gemini Nano Banana moment that changed Google's market share overnight (and why multimodal models are becoming economically critical for marketing, design, and AI-for-science), how they're thinking about agents and harnesses (Code Arena evaluates models, but maybe it should evaluate full agents like Devin), and his vision for Arena as the central evaluation platform that provides the North Star for the industry—constantly fresh, immune to overfitting, and grounded in millions of real-world conversations from real users.