pith. machine review for the scientific record. sign in

arxiv: 2604.20733 · v1 · submitted 2026-04-22 · 💻 cs.LG

Recognition: unknown

Near-Future Policy Optimization

Authors on Pith no claims yet

Pith reviewed 2026-05-10 00:47 UTC · model grok-4.3

classification 💻 cs.LG
keywords Near-Future Policy OptimizationNPOAutoNPORLVRmixed-policyreinforcement learning with verifiable rewardsGRPOpolicy optimization
0
0 comments X

The pith

A policy's own near-future checkpoint supplies auxiliary trajectories that are both higher-quality and lower-variance than external teachers or past replays, raising RLVR performance ceilings and accelerating convergence.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that in reinforcement learning with verifiable rewards, mixing on-policy exploration with trajectories from the model's own later checkpoint solves the core tension of needing off-policy data that is simultaneously strong enough and close enough. This near-future source maximizes the effective learning signal by balancing new knowledge against absorption ease without external distribution shifts. The approach is shown through timed manual interventions for bootstrapping and plateau-breaking, plus an adaptive AutoNPO variant that monitors training signals to choose optimal checkpoints. On the evaluated vision-language model using GRPO, the method delivers measurable lifts in average performance while speeding training progress.

Core claim

Reinforcement learning with verifiable rewards accelerates when off-policy trajectories satisfy both higher quality and lower variance than alternatives; a later checkpoint from the same run meets this by being stronger than the current policy yet closer in distribution than external teachers or past trajectories, directly maximizing S = Q/V. NPO implements this mixed-policy scheme with manual interventions at early and late stages, while AutoNPO automates triggering and checkpoint selection from online signals. On Qwen3-VL-8B-Instruct with GRPO, NPO raises average performance from 57.88 to 62.84 and AutoNPO reaches 63.15, improving both final ceiling and convergence speed.

What carries the argument

Near-Future Policy Optimization (NPO), a mixed-policy scheme that draws auxiliary trajectories from a later checkpoint of the same training run to maximize the ratio S = Q/V of knowledge gain to variance cost.

Load-bearing premise

A later checkpoint from the same training run is simultaneously stronger and closer than external teachers or past trajectories, maximizing S = Q/V without introducing new distribution shifts or instabilities.

What would settle it

Running the same GRPO training on Qwen3-VL-8B-Instruct but replacing NPO or AutoNPO with either no auxiliary trajectories or external-teacher mixing, and observing no gain or a drop in final average score or slower convergence, would falsify the central claim.

read the original abstract

Reinforcement learning with verifiable rewards (RLVR) has become a core post-training recipe. Introducing suitable off-policy trajectories into on-policy exploration accelerates RLVR convergence and raises the performance ceiling, yet finding a source of such trajectories remains the key challenge. Existing mixed-policy methods either import trajectories from external teachers (high-quality but distributionally far) or replay past training trajectories (close but capped in quality), and neither simultaneously satisfies the strong enough (higher $Q$ , more new knowledge to learn) and close enough (lower $V$ , more readily absorbed) conditions required to maximize the effective learning signal $\mathcal{S} = Q/V$. We propose \textbf{N}ear-Future \textbf{P}olicy \textbf{O}ptimization (\textbf{NPO}), a simple mixed-policy scheme that learns from a policy's own near-future self: a later checkpoint from the same training run is a natural source of auxiliary trajectories that is both stronger than the current policy and closer than any external source, directly balancing trajectory quality against variance cost. We validate NPO through two manual interventions, early-stage bootstrapping and late-stage plateau breakthrough, and further propose \textbf{AutoNPO},an adaptive variant that automatically triggers interventions from online training signals and selects the guide checkpoint that maximizes $S$. On Qwen3-VL-8B-Instruct with GRPO, NPO improves average performance from 57.88 to 62.84, and AutoNPO pushes it to 63.15, raising the final performance ceiling while accelerating convergence.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces Near-Future Policy Optimization (NPO), a mixed-policy RLVR method that sources auxiliary trajectories from a later checkpoint of the same training run to balance higher quality (Q, more new knowledge) against lower variance (V, more readily absorbed) via the effective signal S = Q/V. It validates the idea via two manual interventions (early-stage bootstrapping and late-stage plateau breakthrough) and proposes AutoNPO, an adaptive variant that triggers interventions and selects the guide checkpoint online to maximize S. On Qwen3-VL-8B-Instruct with GRPO, NPO raises average performance from 57.88 to 62.84 while AutoNPO reaches 63.15, claiming both a higher performance ceiling and accelerated convergence.

Significance. If the results prove robust, NPO offers a simple, self-contained mechanism to improve RLVR efficiency by exploiting the model's own near-future states rather than external teachers or replay buffers. The concrete gains on a modern VLM and the adaptive AutoNPO design are practical strengths that could influence post-training recipes for large models.

major comments (2)
  1. [Method] Method section (NPO early-stage bootstrapping): The central mechanism treats a later checkpoint as simultaneously higher-Q and lower-V, yet the manuscript does not specify how this checkpoint is obtained without lookahead; generating the required trajectories either demands a separate forward run or post-hoc replay, which risks converting the claimed online acceleration into offline data mixing and negates the convergence benefit.
  2. [Experiments] Experiments section: The reported improvements (57.88 baseline to 62.84 NPO to 63.15 AutoNPO) are presented without variance across runs, statistical significance tests, exact baseline configurations, total compute accounting, or concrete definitions of how Q and V are measured, leaving the attribution of gains to NPO versus uncontrolled factors unverified and load-bearing for the ceiling and convergence claims.
minor comments (1)
  1. [Abstract] Abstract: The phrase 'two manual interventions' is used without a one-sentence gloss of what they entail, which would improve immediate readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below with clarifications and commit to revisions that strengthen the presentation without altering the core claims.

read point-by-point responses
  1. Referee: [Method] Method section (NPO early-stage bootstrapping): The central mechanism treats a later checkpoint as simultaneously higher-Q and lower-V, yet the manuscript does not specify how this checkpoint is obtained without lookahead; generating the required trajectories either demands a separate forward run or post-hoc replay, which risks converting the claimed online acceleration into offline data mixing and negates the convergence benefit.

    Authors: We appreciate this observation. The manual early-stage bootstrapping experiment serves as a controlled validation of the NPO concept by using a checkpoint from a continued training run to demonstrate the value of near-future trajectories. This setup does involve lookahead for the purpose of the proof-of-concept experiment. In contrast, the AutoNPO algorithm is designed to be fully online: it uses real-time training signals to decide when to intervene and selects the guide checkpoint from the ongoing run (e.g., recent saved states) without a separate forward pass or post-hoc mixing. We will revise the Method section to explicitly distinguish the manual interventions from the AutoNPO procedure, add pseudocode clarifying checkpoint management within a single training run, and emphasize that the online acceleration claim applies to AutoNPO. revision: yes

  2. Referee: [Experiments] Experiments section: The reported improvements (57.88 baseline to 62.84 NPO to 63.15 AutoNPO) are presented without variance across runs, statistical significance tests, exact baseline configurations, total compute accounting, or concrete definitions of how Q and V are measured, leaving the attribution of gains to NPO versus uncontrolled factors unverified and load-bearing for the ceiling and convergence claims.

    Authors: We agree that additional statistical and methodological details are needed to support the claims. In the revised manuscript, we will report results with standard deviations across multiple independent runs, include statistical significance tests (e.g., paired t-tests) against the baseline, specify exact GRPO hyperparameters and baseline configurations, provide total compute accounting (GPU-hours and approximate FLOPs), and give precise definitions and measurement procedures for Q (e.g., via reward improvement or new knowledge metrics) and V (e.g., via distributional distance or absorption indicators). These additions will strengthen attribution of the observed gains to NPO. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The paper defines the effective learning signal as S = Q/V and hypothesizes that a later checkpoint from the same run simultaneously raises Q while lowering V relative to external teachers or past trajectories. This hypothesis is tested empirically via manual interventions and the AutoNPO selector that chooses the checkpoint maximizing S, with reported gains on Qwen3-VL-8B-Instruct under GRPO. No step reduces by construction to a fitted parameter, self-citation, or renamed input; the core claim remains an independent empirical assertion rather than a tautology. The lookahead dependency is a practical scheduling issue, not a logical reduction of the stated derivation to its inputs.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard RL assumptions about value functions and trajectory distributions plus one domain-specific premise that near-future checkpoints naturally satisfy the Q/V trade-off; no new physical entities or ad-hoc constants are introduced beyond checkpoint timing.

free parameters (1)
  • guide checkpoint selection
    Choice of which later checkpoint to use, either manually or by maximizing S in AutoNPO; this timing parameter directly affects the reported gains.
axioms (1)
  • domain assumption Later checkpoints from the same run have both higher Q and lower V relative to the current policy
    Invoked to justify why the near-future source outperforms external or replay sources; appears in the motivation for balancing quality against variance cost.

pith-pipeline@v0.9.0 · 5596 in / 1539 out tokens · 41705 ms · 2026-05-10T00:47:40.970139+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Co-Evolving Policy Distillation

    cs.LG 2026-04 unverdicted novelty 6.0

    CoPD integrates multiple expert capabilities by running parallel RLVR training with bidirectional online policy distillation among experts, outperforming mixed RLVR and sequential OPD while surpassing domain-specific ...

Reference graph

Works this paper leans on

39 extracted references · 29 canonical work pages · cited by 1 Pith paper · 11 internal anchors

  1. [1]

    On-policy distillation of language models: Learning from self-generated mistakes

    Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos Garea, Matthieu Geist, and Olivier Bachem. On-policy distillation of language models: Learning from self-generated mistakes. InThe Twelfth International Conference on Learning Representations, 2024

  2. [2]

    Qwen3-VL Technical Report

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...

  3. [3]

    Are we on the right way for evaluating large vision-language models?Advances in Neural Information Processing Systems, 37:27056–27087, 2024

    Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, et al. Are we on the right way for evaluating large vision-language models?Advances in Neural Information Processing Systems, 37:27056–27087, 2024

  4. [4]

    S-grpo: Early exit via reinforcement learning in reasoning models.arXiv preprint arXiv:2505.07686, 2025

    Muzhi Dai, Chenxu Yang, and Qingyi Si. S-grpo: Early exit via reinforcement learning in reasoning models, 2025. https://arxiv.org/abs/2505.07686

  5. [5]

    Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Leng Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, Jie Liu, Lei Qi, Zhiyuan Liu, and Maosong Sun

    Yuqian Fu, Tinghong Chen, Jiajun Chai, Xihuai Wang, Songjun Tu, Guojun Yin, Wei Lin, Qichao Zhang, Yuanheng Zhu, and Dongbin Zhao. SRFT: A single-stage method with supervised and reinforcement fine-tuning for reasoning, 2025.https://arxiv.org/abs/2506.19767

  6. [6]

    Soft Adaptive Policy Optimization

    Chang Gao, Chujie Zheng, Xiong-Hui Chen, Kai Dang, Shixuan Liu, Bowen Yu, An Yang, Shuai Bai, Jingren Zhou, and Junyang Lin. Soft adaptive policy optimization.arXiv preprint arXiv:2511.20347, 2025

  7. [7]

    Reinforced Self-Training (ReST) for Language Modeling

    Caglar Gulcehre, Tom Le Paine, Srivatsan Srinivasan, Ksenia Konyushkova, Lotte Weerts, Abhishek Sharma, Aditya Siddhant, Alex Ahern, Miaosen Wang, Chenjie Gu, et al. Reinforced self-training (ReST) for language modeling. arXiv preprint arXiv:2308.08998, 2023

  8. [8]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-R1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

  9. [9]

    Nathan Lambert, Jacob Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brah- man, Lester James V

    Zeyu Huang, Tianhao Cheng, Zihan Qiu, Zili Wang, Yinghui Xu, Edoardo M. Ponti, and Ivan Titov. Blending supervised and reinforcement fine-tuning with prefix sampling, 2025.https://arxiv.org/abs/2507.01679

  10. [10]

    Reinforcement Learning via Self-Distillation

    Jonas Hübotter, Frederike Lübeck, Lejs Behric, Anton Baumann, Marco Bagatella, Daniel Marta, Ido Hakimi, Idan Shenfeld, Thomas Kleine Buening, Carlos Guestrin, and Andreas Krause. Reinforcement learning via self-distillation. arXiv preprint arXiv:2601.20802, 2026

  11. [11]

    arXiv preprint arXiv:2506.09340 (2025) 3

    Siheng Li, Zhanhui Zhou, Wai Lam, Chao Yang, and Chaochao Lu. RePO: Replay-enhanced policy optimization. arXiv preprint, abs/2506.09340, 2025

  12. [12]

    Mmfinereason: Closing the multimodal reasoning gap via open data-centric methods, 2026

    Honglin Lin, Zheng Liu, Yun Zhu, Chonghan Qin, Juekai Lin, Xiaoran Shang, Conghui He, Wentao Zhang, and Lijun Wu. Mmfinereason: Closing the multimodal reasoning gap via open data-centric methods, 2026. https://arxiv.org/abs/2601.21821. 10

  13. [13]

    Mmbench: Is your multi-modal model an all-around player? InEuropean conference on computer vision, pages 216–233

    Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? InEuropean conference on computer vision, pages 216–233. Springer, 2024

  14. [14]

    On-policy distillation.Thinking Machines Lab: Connectionism, 2025

    Kevin Lu and Thinking Machines Lab. On-policy distillation.Thinking Machines Lab: Connectionism, 2025. https://thinkingmachines.ai/blog/on-policy-distillation

  15. [15]

    Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts

    Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. InThe TwelfthInternational Conference on Learning Representations

  16. [16]

    Kevin Murphy

    Lu Ma, Hao Liang, Meiyi Qiang, Lexiang Tang, Xiaochen Ma, Zhen Hao Wong, Junbo Niu, Chengyu Shen, Runming He, Bin Cui, et al. Learning what reinforcement learning can’t: Interleaved online fine-tuning for hardest questions. arXiv preprint arXiv:2506.07527, 2025

  17. [17]

    Runqi Qiao, Qiuna Tan, Guanting Dong, MinhuiWu MinhuiWu, Chong Sun, Xiaoshuai Song, Jiapeng Wang, Zhuoma Gongque, Shanglin Lei, Yifan Zhang, et al. We-math: Does your large multimodal model achieve human- like mathematical reasoning? InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume1: Long Papers), pages 200...

  18. [18]

    EasyVideoR1: Easier RL for Video Understanding

    Chuanyu Qin, Chenxu Yang, Qingyi Si, Naibin Gu, Dingyu Yao, Zheng Lin, Peng Fu, Nan Duan, and Jiaqi Wang. Easyvideor1: Easier rl for video understanding, 2026.https://arxiv.org/abs/2604.16893

  19. [19]

    Zerobench: An impossible visual benchmark for contemporary large multimodal models.arXiv preprint arXiv:2502.09696, 2025

    Jonathan Roberts, Mohammad Reza Taesiri, Ansh Sharma, Akash Gupta, Samuel Roberts, Ioana Croitoru, Simion-Vlad Bogolin, Jialu Tang, Florian Langer, Vyas Raina, et al. Zerobench: An impossible visual benchmark for contemporary large multimodal models.arXiv preprint arXiv:2502.09696, 2025

  20. [20]

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Yang Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024

  21. [21]

    Self-Distillation Enables Continual Learning

    Idan Shenfeld, Mehul Damani, Jonas Hübotter, and Pulkit Agrawal. Self-distillation enables continual learning, 2026.https://arxiv.org/abs/2601.19897

  22. [22]

    Learning by distilling context.arXiv preprint arXiv:2209.15189, 2022

    Charlie Snell, Dan Klein, and Ruiqi Zhong. Learning by distilling context.arXiv preprint arXiv:2209.15189, 2022

  23. [23]

    Richard S

    Mingyu Su, Jian Guan, Yuxian Gu, Minlie Huang, and Hongning Wang. Trust-region adaptive policy optimization, 2025.https://arxiv.org/abs/2512.17636

  24. [24]

    Measuring multimodal mathematical reasoning with math-vision dataset.Advances in Neural Information Processing Systems, 37:95095–95169, 2024

    Ke Wang, Junting Pan, Weikang Shi, Zimu Lu, Houxing Ren, Aojun Zhou, Mingjie Zhan, and Hongsheng Li. Measuring multimodal mathematical reasoning with math-vision dataset.Advances in Neural Information Processing Systems, 37:95095–95169, 2024

  25. [25]

    arXiv preprint arXiv:2504.14945 , year =

    Jianhao Yan, Yafu Li, Zican Hu, Zhi Wang, Ganqu Cui, Xiaoye Qu, Yu Cheng, and Yue Zhang. Learning to reason under off-policy guidance.arXiv preprint arXiv:2504.14945, 2025

  26. [26]

    Weights-rotated preference optimization for large language models

    Chenxu Yang, Ruipeng Jia, Mingyu Zheng, Naibin Gu, Zheng Lin, Siyuan Chen, Weichong Yin, Hua Wu, and Weiping Wang. Weights-rotated preference optimization for large language models. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processi...

  27. [27]

    Test-time prompt intervention, 2025

    Chenxu Yang, Qingyi Si, Mz Dai, Dingyu Yao, Mingyu Zheng, Minghui Chen, Zheng Lin, and Weiping Wang. Test-time prompt intervention, 2025.https://arxiv.org/abs/2508.02511

  28. [28]

    arXiv preprint arXiv:2504.15895 , year=

    Chenxu Yang, Qingyi Si, Yongjie Duan, Zheliang Zhu, Chenyu Zhu, Qiaowei Li, Minghui Chen, Zheng Lin, and Weiping Wang. Dynamic early exit in reasoning models, 2025.https://arxiv.org/abs/2504.15895

  29. [29]

    Self-Distilled RLVR

    Chenxu Yang, Chuanyu Qin, Qingyi Si, Minghui Chen, Naibin Gu, Dingyu Yao, Zheng Lin, Weiping Wang, Jiaqi Wang, and Nan Duan. Self-distilled rlvr, 2026.https://arxiv.org/abs/2604.03128

  30. [30]

    System 1&2 synergy via dynamic model interpolation.arXiv preprint arXiv:2601.21414,

    Chenxu Yang, Qingyi Si, Chong Tian, Xiyu Liu, Dingyu Yao, Chuanyu Qin, Zheng Lin, Weiping Wang, and Jiaqi Wang. System 1&2 synergy via dynamic model interpolation.arXiv preprint arXiv:2601.21414, 2026. 11

  31. [31]

    DAPO: An Open-Source LLM Reinforcement Learning System at Scale

    Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale. arXiv preprint arXiv:2503.14476, 2025

  32. [32]

    Mmmu-pro: A more robust multi-discipline multimodal understanding benchmark

    Xiang Yue, Tianyu Zheng, Yuansheng Ni, Yubo Wang, Kai Zhang, Shengbang Tong, Yuxuan Sun, Botao Yu, Ge Zhang, Huan Sun, et al. Mmmu-pro: A more robust multi-discipline multimodal understanding benchmark. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15134–15186, 2025

  33. [33]

    Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

    Yang Yue, Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Yang Yue, Shiji Song, and Gao Huang. Does reinforcement learning really incentivize reasoning capacity in LLMs beyond the base model?arXiv preprint arXiv:2504.13837, 2025

  34. [34]

    STaR: Bootstrapping reasoning with reasoning

    Eric Zelikman, Yuhuai Wu, Jesse Mu, and Noah Goodman. STaR: Bootstrapping reasoning with reasoning. Advancesin Neural Information Processing Systems, 35:15476–15488, 2022

  35. [35]

    Exgrpo: Learning to reason from experience

    Runzhe Zhan, Yafu Li, Zhi Wang, Xiaoye Qu, Dongrui Liu, Jing Shao, Derek F. Wong, and Yu Cheng. ExGRPO: Learning to reason from experience, 2025.https://arxiv.org/abs/2510.02245

  36. [36]

    arXiv preprint arXiv:2507.07451 , year=

    HongzhiZhang, JiaFu, JingyuanZhang, KaiFu, QiWang, FuzhengZhang, andGuoruiZhou. RLEP:Reinforcement learning with experience replay for llm reasoning.arXiv preprint arXiv:2507.07451, 2025

  37. [37]

    Mathverse: Does your multi-modal llm truly see the diagrams in visual math problems? In European Conference on Computer Vision, pages 169–186

    Renrui Zhang, Dongzhi Jiang, Yichi Zhang, Haokun Lin, Ziyu Guo, Pengshuo Qiu, Aojun Zhou, Pan Lu, Kai-Wei Chang, Yu Qiao, et al. Mathverse: Does your multi-modal llm truly see the diagrams in visual math problems? In European Conference on Computer Vision, pages 169–186. Springer, 2024

  38. [38]

    Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models

    Siyan Zhao, Zhihui Xie, Mengchen Liu, Jing Huang, Guan Pang, Feiyu Chen, and Aditya Grover. On-policy self-distillation for large language models, 2026.https://arxiv.org/abs/2601.18734

  39. [39]

    Group Sequence Policy Optimization

    Chujie Zheng, Shixuan Liu, Kai Dang, Xiong-Hui Chen, Bowen Yu, An Yang, Shuai Bai, Jingren Zhou, and Junyang Lin. Group sequence policy optimization.arXiv preprint arXiv:2507.18071, 2025. 12 Algorithm 1AutoNPO: adaptive intervention for near-future policy optimization Require: dataset D, base policyπ(0), mistake thresholdτerr, confirmation pass-rate thres...