arxiv: 2604.20733 · v1 · submitted 2026-04-22 · 💻 cs.LG

Recognition: unknown

Near-Future Policy Optimization

Chuanyu Qin , Chenxu Yang , Qingyi Si , Naibin Gu , Dingyu Yao , Zheng Lin , Peng Fu , Nan Duan

show 1 more author

Jiaqi Wang

Authors on Pith no claims yet

Pith reviewed 2026-05-10 00:47 UTC · model grok-4.3

classification 💻 cs.LG

keywords Near-Future Policy OptimizationNPOAutoNPORLVRmixed-policyreinforcement learning with verifiable rewardsGRPOpolicy optimization

0 comments

The pith

A policy's own near-future checkpoint supplies auxiliary trajectories that are both higher-quality and lower-variance than external teachers or past replays, raising RLVR performance ceilings and accelerating convergence.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that in reinforcement learning with verifiable rewards, mixing on-policy exploration with trajectories from the model's own later checkpoint solves the core tension of needing off-policy data that is simultaneously strong enough and close enough. This near-future source maximizes the effective learning signal by balancing new knowledge against absorption ease without external distribution shifts. The approach is shown through timed manual interventions for bootstrapping and plateau-breaking, plus an adaptive AutoNPO variant that monitors training signals to choose optimal checkpoints. On the evaluated vision-language model using GRPO, the method delivers measurable lifts in average performance while speeding training progress.

Core claim

Reinforcement learning with verifiable rewards accelerates when off-policy trajectories satisfy both higher quality and lower variance than alternatives; a later checkpoint from the same run meets this by being stronger than the current policy yet closer in distribution than external teachers or past trajectories, directly maximizing S = Q/V. NPO implements this mixed-policy scheme with manual interventions at early and late stages, while AutoNPO automates triggering and checkpoint selection from online signals. On Qwen3-VL-8B-Instruct with GRPO, NPO raises average performance from 57.88 to 62.84 and AutoNPO reaches 63.15, improving both final ceiling and convergence speed.

What carries the argument

Near-Future Policy Optimization (NPO), a mixed-policy scheme that draws auxiliary trajectories from a later checkpoint of the same training run to maximize the ratio S = Q/V of knowledge gain to variance cost.

Load-bearing premise

A later checkpoint from the same training run is simultaneously stronger and closer than external teachers or past trajectories, maximizing S = Q/V without introducing new distribution shifts or instabilities.

What would settle it

Running the same GRPO training on Qwen3-VL-8B-Instruct but replacing NPO or AutoNPO with either no auxiliary trajectories or external-teacher mixing, and observing no gain or a drop in final average score or slower convergence, would falsify the central claim.

read the original abstract

Reinforcement learning with verifiable rewards (RLVR) has become a core post-training recipe. Introducing suitable off-policy trajectories into on-policy exploration accelerates RLVR convergence and raises the performance ceiling, yet finding a source of such trajectories remains the key challenge. Existing mixed-policy methods either import trajectories from external teachers (high-quality but distributionally far) or replay past training trajectories (close but capped in quality), and neither simultaneously satisfies the strong enough (higher $Q$ , more new knowledge to learn) and close enough (lower $V$ , more readily absorbed) conditions required to maximize the effective learning signal $\mathcal{S} = Q/V$. We propose \textbf{N}ear-Future \textbf{P}olicy \textbf{O}ptimization (\textbf{NPO}), a simple mixed-policy scheme that learns from a policy's own near-future self: a later checkpoint from the same training run is a natural source of auxiliary trajectories that is both stronger than the current policy and closer than any external source, directly balancing trajectory quality against variance cost. We validate NPO through two manual interventions, early-stage bootstrapping and late-stage plateau breakthrough, and further propose \textbf{AutoNPO},an adaptive variant that automatically triggers interventions from online training signals and selects the guide checkpoint that maximizes $S$. On Qwen3-VL-8B-Instruct with GRPO, NPO improves average performance from 57.88 to 62.84, and AutoNPO pushes it to 63.15, raising the final performance ceiling while accelerating convergence.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

NPO's near-future self-checkpoint idea balances quality and closeness better than prior mixes, but the early bootstrapping gains likely come from offline replay rather than true online acceleration.

read the letter

The main thing here is that NPO pulls auxiliary trajectories from a later checkpoint of the same training run, which the authors position as stronger than past data yet closer than external teachers, and they report a clear lift on Qwen3-VL-8B-Instruct with GRPO. AutoNPO adds an adaptive selector that picks the checkpoint maximizing S = Q/V from online signals. That framing is new relative to the external-teacher or replay-past baselines mentioned in the abstract, and the manual interventions for early bootstrapping and late plateaus show the idea can be applied at different stages. The performance numbers—57.88 to 62.84 with NPO, then 63.15 with AutoNPO—are the most concrete part and suggest the method can raise the ceiling while speeding convergence on this task. The paper does a decent job spelling out the Q/V tradeoff without overclaiming theoretical novelty. The soft spots are mostly around evidence and setup. The abstract gives no variance across runs, no full baseline tables, and no description of how Q and V are actually measured or how the future checkpoints are generated without extra forward passes. The stress-test point holds up: early-stage bootstrapping requires a checkpoint that only exists after more training, so the reported speed-up may reflect post-hoc data mixing rather than an online process that finishes faster. AutoNPO's online trigger helps a bit but does not remove the dependency. This is aimed at people doing RLVR post-training on large models who already run GRPO-style loops and want a low-overhead way to improve trajectory quality. A reader focused on practical policy optimization will get usable ideas even if the claims need more backing. It deserves a serious referee to check the experimental controls and test whether the gains survive when the lookahead is removed. I would send it for peer review.

Referee Report

2 major / 1 minor

Summary. The paper introduces Near-Future Policy Optimization (NPO), a mixed-policy RLVR method that sources auxiliary trajectories from a later checkpoint of the same training run to balance higher quality (Q, more new knowledge) against lower variance (V, more readily absorbed) via the effective signal S = Q/V. It validates the idea via two manual interventions (early-stage bootstrapping and late-stage plateau breakthrough) and proposes AutoNPO, an adaptive variant that triggers interventions and selects the guide checkpoint online to maximize S. On Qwen3-VL-8B-Instruct with GRPO, NPO raises average performance from 57.88 to 62.84 while AutoNPO reaches 63.15, claiming both a higher performance ceiling and accelerated convergence.

Significance. If the results prove robust, NPO offers a simple, self-contained mechanism to improve RLVR efficiency by exploiting the model's own near-future states rather than external teachers or replay buffers. The concrete gains on a modern VLM and the adaptive AutoNPO design are practical strengths that could influence post-training recipes for large models.

major comments (2)

[Method] Method section (NPO early-stage bootstrapping): The central mechanism treats a later checkpoint as simultaneously higher-Q and lower-V, yet the manuscript does not specify how this checkpoint is obtained without lookahead; generating the required trajectories either demands a separate forward run or post-hoc replay, which risks converting the claimed online acceleration into offline data mixing and negates the convergence benefit.
[Experiments] Experiments section: The reported improvements (57.88 baseline to 62.84 NPO to 63.15 AutoNPO) are presented without variance across runs, statistical significance tests, exact baseline configurations, total compute accounting, or concrete definitions of how Q and V are measured, leaving the attribution of gains to NPO versus uncontrolled factors unverified and load-bearing for the ceiling and convergence claims.

minor comments (1)

[Abstract] Abstract: The phrase 'two manual interventions' is used without a one-sentence gloss of what they entail, which would improve immediate readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below with clarifications and commit to revisions that strengthen the presentation without altering the core claims.

read point-by-point responses

Referee: [Method] Method section (NPO early-stage bootstrapping): The central mechanism treats a later checkpoint as simultaneously higher-Q and lower-V, yet the manuscript does not specify how this checkpoint is obtained without lookahead; generating the required trajectories either demands a separate forward run or post-hoc replay, which risks converting the claimed online acceleration into offline data mixing and negates the convergence benefit.

Authors: We appreciate this observation. The manual early-stage bootstrapping experiment serves as a controlled validation of the NPO concept by using a checkpoint from a continued training run to demonstrate the value of near-future trajectories. This setup does involve lookahead for the purpose of the proof-of-concept experiment. In contrast, the AutoNPO algorithm is designed to be fully online: it uses real-time training signals to decide when to intervene and selects the guide checkpoint from the ongoing run (e.g., recent saved states) without a separate forward pass or post-hoc mixing. We will revise the Method section to explicitly distinguish the manual interventions from the AutoNPO procedure, add pseudocode clarifying checkpoint management within a single training run, and emphasize that the online acceleration claim applies to AutoNPO. revision: yes
Referee: [Experiments] Experiments section: The reported improvements (57.88 baseline to 62.84 NPO to 63.15 AutoNPO) are presented without variance across runs, statistical significance tests, exact baseline configurations, total compute accounting, or concrete definitions of how Q and V are measured, leaving the attribution of gains to NPO versus uncontrolled factors unverified and load-bearing for the ceiling and convergence claims.

Authors: We agree that additional statistical and methodological details are needed to support the claims. In the revised manuscript, we will report results with standard deviations across multiple independent runs, include statistical significance tests (e.g., paired t-tests) against the baseline, specify exact GRPO hyperparameters and baseline configurations, provide total compute accounting (GPU-hours and approximate FLOPs), and give precise definitions and measurement procedures for Q (e.g., via reward improvement or new knowledge metrics) and V (e.g., via distributional distance or absorption indicators). These additions will strengthen attribution of the observed gains to NPO. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The paper defines the effective learning signal as S = Q/V and hypothesizes that a later checkpoint from the same run simultaneously raises Q while lowering V relative to external teachers or past trajectories. This hypothesis is tested empirically via manual interventions and the AutoNPO selector that chooses the checkpoint maximizing S, with reported gains on Qwen3-VL-8B-Instruct under GRPO. No step reduces by construction to a fitted parameter, self-citation, or renamed input; the core claim remains an independent empirical assertion rather than a tautology. The lookahead dependency is a practical scheduling issue, not a logical reduction of the stated derivation to its inputs.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard RL assumptions about value functions and trajectory distributions plus one domain-specific premise that near-future checkpoints naturally satisfy the Q/V trade-off; no new physical entities or ad-hoc constants are introduced beyond checkpoint timing.

free parameters (1)

guide checkpoint selection
Choice of which later checkpoint to use, either manually or by maximizing S in AutoNPO; this timing parameter directly affects the reported gains.

axioms (1)

domain assumption Later checkpoints from the same run have both higher Q and lower V relative to the current policy
Invoked to justify why the near-future source outperforms external or replay sources; appears in the motivation for balancing quality against variance cost.

pith-pipeline@v0.9.0 · 5596 in / 1539 out tokens · 41705 ms · 2026-05-10T00:47:40.970139+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Co-Evolving Policy Distillation
cs.LG 2026-04 unverdicted novelty 6.0

CoPD integrates multiple expert capabilities by running parallel RLVR training with bidirectional online policy distillation among experts, outperforming mixed RLVR and sequential OPD while surpassing domain-specific ...

Reference graph

Works this paper leans on

39 extracted references · 29 canonical work pages · cited by 1 Pith paper · 11 internal anchors

[1]

On-policy distillation of language models: Learning from self-generated mistakes

Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos Garea, Matthieu Geist, and Olivier Bachem. On-policy distillation of language models: Learning from self-generated mistakes. InThe Twelfth International Conference on Learning Representations, 2024

2024
[2]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Are we on the right way for evaluating large vision-language models?Advances in Neural Information Processing Systems, 37:27056–27087, 2024

Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, et al. Are we on the right way for evaluating large vision-language models?Advances in Neural Information Processing Systems, 37:27056–27087, 2024

2024
[4]

S-grpo: Early exit via reinforcement learning in reasoning models.arXiv preprint arXiv:2505.07686, 2025

Muzhi Dai, Chenxu Yang, and Qingyi Si. S-grpo: Early exit via reinforcement learning in reasoning models, 2025. https://arxiv.org/abs/2505.07686

work page arXiv 2025
[5]

Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Leng Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, Jie Liu, Lei Qi, Zhiyuan Liu, and Maosong Sun

Yuqian Fu, Tinghong Chen, Jiajun Chai, Xihuai Wang, Songjun Tu, Guojun Yin, Wei Lin, Qichao Zhang, Yuanheng Zhu, and Dongbin Zhao. SRFT: A single-stage method with supervised and reinforcement fine-tuning for reasoning, 2025.https://arxiv.org/abs/2506.19767

work page arXiv 2025
[6]

Soft Adaptive Policy Optimization

Chang Gao, Chujie Zheng, Xiong-Hui Chen, Kai Dang, Shixuan Liu, Bowen Yu, An Yang, Shuai Bai, Jingren Zhou, and Junyang Lin. Soft adaptive policy optimization.arXiv preprint arXiv:2511.20347, 2025

work page internal anchor Pith review arXiv 2025
[7]

Reinforced Self-Training (ReST) for Language Modeling

Caglar Gulcehre, Tom Le Paine, Srivatsan Srinivasan, Ksenia Konyushkova, Lotte Weerts, Abhishek Sharma, Aditya Siddhant, Alex Ahern, Miaosen Wang, Chenjie Gu, et al. Reinforced self-training (ReST) for language modeling. arXiv preprint arXiv:2308.08998, 2023

work page Pith review arXiv 2023
[8]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-R1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[9]

Nathan Lambert, Jacob Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brah- man, Lester James V

Zeyu Huang, Tianhao Cheng, Zihan Qiu, Zili Wang, Yinghui Xu, Edoardo M. Ponti, and Ivan Titov. Blending supervised and reinforcement fine-tuning with prefix sampling, 2025.https://arxiv.org/abs/2507.01679

work page arXiv 2025
[10]

Reinforcement Learning via Self-Distillation

Jonas Hübotter, Frederike Lübeck, Lejs Behric, Anton Baumann, Marco Bagatella, Daniel Marta, Ido Hakimi, Idan Shenfeld, Thomas Kleine Buening, Carlos Guestrin, and Andreas Krause. Reinforcement learning via self-distillation. arXiv preprint arXiv:2601.20802, 2026

work page internal anchor Pith review arXiv 2026
[11]

arXiv preprint arXiv:2506.09340 (2025) 3

Siheng Li, Zhanhui Zhou, Wai Lam, Chao Yang, and Chaochao Lu. RePO: Replay-enhanced policy optimization. arXiv preprint, abs/2506.09340, 2025

work page arXiv 2025
[12]

Mmfinereason: Closing the multimodal reasoning gap via open data-centric methods, 2026

Honglin Lin, Zheng Liu, Yun Zhu, Chonghan Qin, Juekai Lin, Xiaoran Shang, Conghui He, Wentao Zhang, and Lijun Wu. Mmfinereason: Closing the multimodal reasoning gap via open data-centric methods, 2026. https://arxiv.org/abs/2601.21821. 10

work page arXiv 2026
[13]

Mmbench: Is your multi-modal model an all-around player? InEuropean conference on computer vision, pages 216–233

Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? InEuropean conference on computer vision, pages 216–233. Springer, 2024

2024
[14]

On-policy distillation.Thinking Machines Lab: Connectionism, 2025

Kevin Lu and Thinking Machines Lab. On-policy distillation.Thinking Machines Lab: Connectionism, 2025. https://thinkingmachines.ai/blog/on-policy-distillation

2025
[15]

Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts

Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. InThe TwelfthInternational Conference on Learning Representations
[16]

Kevin Murphy

Lu Ma, Hao Liang, Meiyi Qiang, Lexiang Tang, Xiaochen Ma, Zhen Hao Wong, Junbo Niu, Chengyu Shen, Runming He, Bin Cui, et al. Learning what reinforcement learning can’t: Interleaved online fine-tuning for hardest questions. arXiv preprint arXiv:2506.07527, 2025

work page arXiv 2025
[17]

Runqi Qiao, Qiuna Tan, Guanting Dong, MinhuiWu MinhuiWu, Chong Sun, Xiaoshuai Song, Jiapeng Wang, Zhuoma Gongque, Shanglin Lei, Yifan Zhang, et al. We-math: Does your large multimodal model achieve human- like mathematical reasoning? InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume1: Long Papers), pages 200...

2025
[18]

EasyVideoR1: Easier RL for Video Understanding

Chuanyu Qin, Chenxu Yang, Qingyi Si, Naibin Gu, Dingyu Yao, Zheng Lin, Peng Fu, Nan Duan, and Jiaqi Wang. Easyvideor1: Easier rl for video understanding, 2026.https://arxiv.org/abs/2604.16893

work page internal anchor Pith review Pith/arXiv arXiv 2026
[19]

Zerobench: An impossible visual benchmark for contemporary large multimodal models.arXiv preprint arXiv:2502.09696, 2025

Jonathan Roberts, Mohammad Reza Taesiri, Ansh Sharma, Akash Gupta, Samuel Roberts, Ioana Croitoru, Simion-Vlad Bogolin, Jialu Tang, Florian Langer, Vyas Raina, et al. Zerobench: An impossible visual benchmark for contemporary large multimodal models.arXiv preprint arXiv:2502.09696, 2025

work page arXiv 2025
[20]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Yang Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[21]

Self-Distillation Enables Continual Learning

Idan Shenfeld, Mehul Damani, Jonas Hübotter, and Pulkit Agrawal. Self-distillation enables continual learning, 2026.https://arxiv.org/abs/2601.19897

work page internal anchor Pith review arXiv 2026
[22]

Learning by distilling context.arXiv preprint arXiv:2209.15189, 2022

Charlie Snell, Dan Klein, and Ruiqi Zhong. Learning by distilling context.arXiv preprint arXiv:2209.15189, 2022

work page arXiv 2022
[23]

Richard S

Mingyu Su, Jian Guan, Yuxian Gu, Minlie Huang, and Hongning Wang. Trust-region adaptive policy optimization, 2025.https://arxiv.org/abs/2512.17636

work page arXiv 2025
[24]

Measuring multimodal mathematical reasoning with math-vision dataset.Advances in Neural Information Processing Systems, 37:95095–95169, 2024

Ke Wang, Junting Pan, Weikang Shi, Zimu Lu, Houxing Ren, Aojun Zhou, Mingjie Zhan, and Hongsheng Li. Measuring multimodal mathematical reasoning with math-vision dataset.Advances in Neural Information Processing Systems, 37:95095–95169, 2024

2024
[25]

arXiv preprint arXiv:2504.14945 , year =

Jianhao Yan, Yafu Li, Zican Hu, Zhi Wang, Ganqu Cui, Xiaoye Qu, Yu Cheng, and Yue Zhang. Learning to reason under off-policy guidance.arXiv preprint arXiv:2504.14945, 2025

work page arXiv 2025
[26]

Weights-rotated preference optimization for large language models

Chenxu Yang, Ruipeng Jia, Mingyu Zheng, Naibin Gu, Zheng Lin, Siyuan Chen, Weichong Yin, Hua Wu, and Weiping Wang. Weights-rotated preference optimization for large language models. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processi...

work page doi:10.18653/v1/2025.emnlp-main.1329 2025
[27]

Test-time prompt intervention, 2025

Chenxu Yang, Qingyi Si, Mz Dai, Dingyu Yao, Mingyu Zheng, Minghui Chen, Zheng Lin, and Weiping Wang. Test-time prompt intervention, 2025.https://arxiv.org/abs/2508.02511

work page arXiv 2025
[28]

arXiv preprint arXiv:2504.15895 , year=

Chenxu Yang, Qingyi Si, Yongjie Duan, Zheliang Zhu, Chenyu Zhu, Qiaowei Li, Minghui Chen, Zheng Lin, and Weiping Wang. Dynamic early exit in reasoning models, 2025.https://arxiv.org/abs/2504.15895

work page arXiv 2025
[29]

Self-Distilled RLVR

Chenxu Yang, Chuanyu Qin, Qingyi Si, Minghui Chen, Naibin Gu, Dingyu Yao, Zheng Lin, Weiping Wang, Jiaqi Wang, and Nan Duan. Self-distilled rlvr, 2026.https://arxiv.org/abs/2604.03128

work page internal anchor Pith review Pith/arXiv arXiv 2026
[30]

System 1&2 synergy via dynamic model interpolation.arXiv preprint arXiv:2601.21414,

Chenxu Yang, Qingyi Si, Chong Tian, Xiyu Liu, Dingyu Yao, Chuanyu Qin, Zheng Lin, Weiping Wang, and Jiaqi Wang. System 1&2 synergy via dynamic model interpolation.arXiv preprint arXiv:2601.21414, 2026. 11

work page arXiv 2026
[31]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale. arXiv preprint arXiv:2503.14476, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[32]

Mmmu-pro: A more robust multi-discipline multimodal understanding benchmark

Xiang Yue, Tianyu Zheng, Yuansheng Ni, Yubo Wang, Kai Zhang, Shengbang Tong, Yuxuan Sun, Botao Yu, Ge Zhang, Huan Sun, et al. Mmmu-pro: A more robust multi-discipline multimodal understanding benchmark. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15134–15186, 2025

2025
[33]

Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

Yang Yue, Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Yang Yue, Shiji Song, and Gao Huang. Does reinforcement learning really incentivize reasoning capacity in LLMs beyond the base model?arXiv preprint arXiv:2504.13837, 2025

work page Pith review arXiv 2025
[34]

STaR: Bootstrapping reasoning with reasoning

Eric Zelikman, Yuhuai Wu, Jesse Mu, and Noah Goodman. STaR: Bootstrapping reasoning with reasoning. Advancesin Neural Information Processing Systems, 35:15476–15488, 2022

2022
[35]

Exgrpo: Learning to reason from experience

Runzhe Zhan, Yafu Li, Zhi Wang, Xiaoye Qu, Dongrui Liu, Jing Shao, Derek F. Wong, and Yu Cheng. ExGRPO: Learning to reason from experience, 2025.https://arxiv.org/abs/2510.02245

work page arXiv 2025
[36]

arXiv preprint arXiv:2507.07451 , year=

HongzhiZhang, JiaFu, JingyuanZhang, KaiFu, QiWang, FuzhengZhang, andGuoruiZhou. RLEP:Reinforcement learning with experience replay for llm reasoning.arXiv preprint arXiv:2507.07451, 2025

work page arXiv 2025
[37]

Mathverse: Does your multi-modal llm truly see the diagrams in visual math problems? In European Conference on Computer Vision, pages 169–186

Renrui Zhang, Dongzhi Jiang, Yichi Zhang, Haokun Lin, Ziyu Guo, Pengshuo Qiu, Aojun Zhou, Pan Lu, Kai-Wei Chang, Yu Qiao, et al. Mathverse: Does your multi-modal llm truly see the diagrams in visual math problems? In European Conference on Computer Vision, pages 169–186. Springer, 2024

2024
[38]

Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models

Siyan Zhao, Zhihui Xie, Mengchen Liu, Jing Huang, Guan Pang, Feiyu Chen, and Aditya Grover. On-policy self-distillation for large language models, 2026.https://arxiv.org/abs/2601.18734

work page internal anchor Pith review arXiv 2026
[39]

Group Sequence Policy Optimization

Chujie Zheng, Shixuan Liu, Kai Dang, Xiong-Hui Chen, Bowen Yu, An Yang, Shuai Bai, Jingren Zhou, and Junyang Lin. Group sequence policy optimization.arXiv preprint arXiv:2507.18071, 2025. 12 Algorithm 1AutoNPO: adaptive intervention for near-future policy optimization Require: dataset D, base policyπ(0), mistake thresholdτerr, confirmation pass-rate thres...

work page internal anchor Pith review arXiv 2025