![[State of Post-Training] From GPT-4.1 to 5.1: RLVR, Agent & Token Efficiency — Josh McGrath, OpenAI](https://assets.flightcast.com/V2Uploads/nvaja2542wefzb8rjg5f519m/01K4D8FB4MNA071BM5ZDSMH34N/square.jpg)
From pre-training data curation to shipping GPT-4o, o1, o3, and now GPT-5 thinking and the shopping model, Josh McGrath has lived through the full arc of OpenAI's post-training evolution—from the PPO vs DPO debates of 2023 to today's RLVR era, where the real innovation isn't optimization methods but data quality, signal trust, and token efficiency.
We sat down with Josh at NeurIPS 2025 to dig into the state of post-training heading into 2026: why RLHF and RLVR are both just policy gradient methods (the difference is the input data, not the math), how GRPO from DeepSeek Math was underappreciated as a shift toward more trustworthy reward signals (math answers you can verify vs.
human preference you can't), why token efficiency matters more than wall-clock time (GPT-5 to 5.
1 bumped evals and slashed tokens), how Codex has changed his workflow so much he feels "trapped" by 40-minute design sessions
本期播客邀请到OpenAI后训练研究员乔希,深入探讨了从GPT-4.1到GPT-5.1的技术发展历程。对话涵盖了后训练领域的核心挑战、购物模型的创新交互、上下文窗口的扩展极限,以及预训练与后训练的资源分配等前沿话题。乔希分享了作为研究员的真实工作体验,包括深夜调试强化学习系统的复杂性,以及Codex如何彻底改变他的编程方式。
乔希的分享揭示了一个核心现实:AI发展已进入深水区,工程挑战与研究创新同等重要。从购物模型的可中断交互,到上下文窗口的极限探索,再到预训练与后训练的资源平衡,每一个决策都影响着技术落地的轨迹。这场对话最深刻的启示或许是——在追求AGI的道路上,保持对未知的敬畏和跨学科协作的开放心态,比任何单一技术突破都更为重要。
收听价值:适合AI研究者、工程师、产品经理以及对大模型技术演进感兴趣的所有人。通过一线研究员的真实经历,理解AI前沿的挑战、权衡与未来方向。
From pre-training data curation to shipping GPT-4o, o1, o3, and now GPT-5 thinking and the shopping model, Josh McGrath has lived through the full arc of OpenAI's post-training evolution—from the PPO vs DPO debates of 2023 to today's RLVR era, where the real innovation isn't optimization methods but data quality, signal trust, and token efficiency.
We sat down with Josh at NeurIPS 2025 to dig into the state of post-training heading into 2026: why RLHF and RLVR are both just policy gradient methods (the difference is the input data, not the math), how GRPO from DeepSeek Math was underappreciated as a shift toward more trustworthy reward signals (math answers you can verify vs.
human preference you can't), why token efficiency matters more than wall-clock time (GPT-5 to 5.
1 bumped evals and slashed tokens), how Codex has changed his workflow so much he feels "trapped" by 40-minute design sessions