![[NeurIPS Best Paper] 1000 Layer Networks for Self-Supervised RL — Kevin Wang et al, Princeton](https://assets.flightcast.com/V2Uploads/nvaja2542wefzb8rjg5f519m/01K4D8FB4MNA071BM5ZDSMH34N/square.jpg)
从普林斯顿大学的本科生研究研讨会,到在NeurIPS 2025上获得最佳论文奖,Kevin Wang、Ishaan Javali、Michał Bortkiewicz、Tomasz Trzcinski和Benjamin Eysenbach打破了传统观念,将强化学习网络扩展到1000层深度——解锁了RL社区曾认为不可能的性能提升。
我们在NeurIPS现场与团队进行了深入交流,挖掘RL1000背后的故事:为什么深度网络在语言和视觉领域有效,却在RL领域失败了十多年(剧透:问题不在于深度本身,而在于目标函数);他们如何发现自监督RL(通过对比学习学习状态、动作和未来状态的表征)在基于价值的方法崩溃时仍能扩展;实现这一突破的关键架构技巧(残差连接、层归一化,以及从回归到分类的转变);为什么扩展深度比扩展宽度在参数效率上更优(线性增长 vs.
二次增长);Jax和GPU加速环境如何让他们在数小时内收集数亿次转移(这是解锁扩展的数据丰裕前提);“临界深度”现象——一旦跨越1500万次转移并添加正确的架构组件,性能不仅提升,而是成倍增长;为什么这不仅仅是“让网络更大”,而是RL目标函数的根本性转变(他们的代码中没有一行写着“最大化奖励”——完全是自监督表征学习);深度教师、浅层学生蒸馏如何解锁大规模部署(用1000层训练前沿能力,蒸馏为高效推理模型);对机器人技术的启示(无需人工监督或演示的目标条件RL,扩展架构而非扩展手动数据收集);以及他们的核心论点:RL终于准备好像语言和视觉领域一样扩展——不是通过向价值函数投入算力,而是借鉴让深度学习其他领域成功的自监督、表征学习范式。
这篇播客采访了NeurIPS最佳论文奖得主团队,他们通过创新的自监督强化学习方法,成功将神经网络深度扩展到千层级别,打破了传统强化学习只能使用浅层网络的限制。这项研究揭示了强化学习与自监督学习融合的新可能性,为机器人等领域带来了新的发展路径。
这项研究不仅展示了深度网络在强化学习中的可行性,更重要的是提供了一种新的方法论视角:通过重新定义学习目标和巧妙结合架构创新,可以打破领域传统限制,开启新的研究方向。它提醒研究者,有时最大的突破来自质疑最基本的假设,并勇敢地跨越不同领域间的界限。
From undergraduate research seminars at Princeton to winning Best Paper award at NeurIPS 2025, Kevin Wang, Ishaan Javali, Michał Bortkiewicz, Tomasz Trzcinski, Benjamin Eysenbach defied conventional wisdom by scaling reinforcement learning networks to 1,000 layers deep—unlocking performance gains that the RL community thought impossible.
We caught up with the team live at NeurIPS to dig into the story behind RL1000: why deep networks have worked in language and vision but failed in RL for over a decade (spoiler: it's not just about depth, it's about the objective), how they discovered that self-supervised RL (learning representations of states, actions, and future states via contrastive learning) scales where value-based methods collapse, the critical architectural tricks that made it work (residual connections, layer normalization, and a shift from regression to classification), why scaling depth is more parameter-efficient than scaling width (linear vs.
quadratic growth), how Jax and GPU-accelerated environments let them collect hundreds of millions of transitions in hours (the data abundance that unlocked scaling in the first place), the "critical depth" phenomenon where performance doesn't just improve—it multiplies once you cross 15M+ transitions and add the right architectural components, why this isn't just "make networks bigger" but a fundamental shift in RL objectives (their code doesn't have a line saying "maximize rewards"—it's pure self-supervised representation learning), how deep teacher, shallow student distillation could unlock deployment at scale (train frontier capabilities with 1000 layers, distill down to efficient inference models), the robotics implications (goal-conditioned RL without human supervision or demonstrations, scaling architecture instead of scaling manual data collection), and their thesis that RL is finally ready to scale like language and vision—not by throwing compute at value functions, but by borrowing the self-supervised, representation-learning paradigms that made the rest of deep learning work.