pith. machine review for the scientific record. sign in

arxiv: 2605.11505 · v2 · submitted 2026-05-12 · 💻 cs.AI

Recognition: 2 theorem links

· Lean Theorem

Selective Off-Policy Reference Tuning with Plan Guidance

Authors on Pith no claims yet

Pith reviewed 2026-05-14 21:17 UTC · model grok-4.3

classification 💻 cs.AI
keywords reinforcement learningreasoning modelsoff-policy tuningplan guidancetoken weightinglanguage model trainingGRPOfailure handling
0
0 comments X

The pith

SORT derives a plan from the reference solution to selectively weight tokens that become more predictable under plan conditioning, turning all-failed rollouts into structure-aware learning signals.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper addresses a key limitation in reinforcement learning methods for language model reasoning, where hard prompts produce no successful rollouts and standard approaches like GRPO stall. It introduces SORT to repair those failures without altering how rollouts are generated. The method extracts a plan from the reference solution, measures how each token's probability changes when conditioned on the plan, and assigns higher weight to tokens that gain predictability. This creates targeted updates focused on structural elements instead of uniform imitation of the reference. If correct, the approach would allow continued training progress on difficult reasoning tasks by converting otherwise useless all-wrong cases into useful signals.

Core claim

SORT derives a plan from the reference solution, compares token probabilities with and without that plan, and gives higher weight to tokens that become more predictable under plan conditioning. This turns all-wrong prompts into selective, structure-aware learning signals instead of uniform imitation.

What carries the argument

The selective weighting rule that upweights tokens whose probability increases under conditioning on a plan derived from the reference solution.

If this is right

  • Performance improves on reasoning benchmarks where all sampled rollouts fail.
  • Gains appear across three different model backbones and eight benchmarks, with larger benefits for weaker models.
  • The method requires no change to the rollout sampling process.
  • It outperforms both standard GRPO and existing guidance baselines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • SORT could be combined with standard success-based updates to handle mixed success-failure batches in a single training loop.
  • The plan-comparison idea might generalize to other structured generation tasks where a reference outline is available during training.
  • If plans can be approximated without full references, the approach might reduce dependence on high-quality labeled solutions at scale.

Load-bearing premise

A plan derived from the reference solution supplies unbiased conditioning that identifies meaningful structure-aware tokens without adding artifacts.

What would settle it

Apply SORT using deliberately incorrect or random plans instead of reference-derived plans and check whether the performance gains on reasoning benchmarks disappear.

Figures

Figures reproduced from arXiv: 2605.11505 by Duc Anh Le, Linh Ngo Van, Thien Huu Nguyen, Tien-Phat Nguyen, Trung Le.

Figure 1
Figure 1. Figure 1: Training dynamics across 500 optimization steps. Rows show reward, response length, and [PITH_FULL_IMAGE:figures/full_fig_p008_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Token-level salience on a zero-reward reference trace. Colors indicate the plan-conditioned [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Number of all-wrong rollout groups during SORT training, where every sampled trajectory [PITH_FULL_IMAGE:figures/full_fig_p023_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: SFT-path loss trajectories across 500 training steps for the auxiliary reference-learning [PITH_FULL_IMAGE:figures/full_fig_p024_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Full token salience visualization, part 1. [PITH_FULL_IMAGE:figures/full_fig_p025_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Full token salience visualization, part 2. [PITH_FULL_IMAGE:figures/full_fig_p026_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Full token salience visualization, part 3. [PITH_FULL_IMAGE:figures/full_fig_p027_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Full token salience visualization, part 4. [PITH_FULL_IMAGE:figures/full_fig_p028_8.png] view at source ↗
read the original abstract

Reinforcement learning with verifiable rewards helps reasoning, but GRPO-style methods stall on hard prompts where all sampled rollouts fail. SORT adds a repair update for those failures without changing rollout generation: it derives a plan from the reference solution, compares token probabilities with and without that plan, and gives higher weight to tokens that become more predictable under plan conditioning. This turns all-wrong prompts into selective, structure-aware learning signals instead of uniform imitation. Across three backbones and eight reasoning benchmarks, SORT improves over GRPO and guidance baselines, with largest gains on weaker models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces Selective Off-Policy Reference Tuning (SORT) as an extension to GRPO-style RL for reasoning. For hard prompts where all rollouts fail, SORT derives a plan from the reference solution, compares token probabilities under prompt-only vs. prompt-plus-plan conditioning, and upweights tokens that become more predictable under the plan. This is claimed to convert uniform failure signals into selective, structure-aware updates without altering rollout generation. Empirical results across three backbones and eight reasoning benchmarks show gains over GRPO and guidance baselines, largest on weaker models.

Significance. If the weighting mechanism can be shown to extract general structure without reference leakage, SORT would usefully extend RLVR applicability to difficult cases that currently yield zero reward. The multi-backbone evaluation is a strength, as is the focus on turning complete failures into usable signals rather than discarding them.

major comments (2)
  1. [Abstract and §3] Abstract and §3 (method description): The weighting rule is stated only qualitatively ('gives higher weight to tokens that become more predictable under plan conditioning'). No explicit equation, pseudocode, or definition of the plan injection (tokenization, masking, or conditioning format) is provided. Without this, it cannot be verified whether the ratio reduces to a fitted hyperparameter or introduces direct reference leakage, as the skeptic note flags.
  2. [§4] §4 (experiments): The abstract asserts gains on eight benchmarks but supplies no quantitative tables, error bars, or controls for plan quality. The central claim that the method yields 'structure-aware' rather than reference-imitation signals therefore lacks the load-bearing empirical support needed to assess whether the improvement is robust or artifactual.
minor comments (1)
  1. [Abstract] Abstract: 'eight reasoning benchmarks' is mentioned without naming them or giving effect sizes; a one-sentence summary of the largest gains would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments and the recommendation for major revision. We address each point below and will update the manuscript accordingly to improve clarity and empirical support.

read point-by-point responses
  1. Referee: [Abstract and §3] Abstract and §3 (method description): The weighting rule is stated only qualitatively ('gives higher weight to tokens that become more predictable under plan conditioning'). No explicit equation, pseudocode, or definition of the plan injection (tokenization, masking, or conditioning format) is provided. Without this, it cannot be verified whether the ratio reduces to a fitted hyperparameter or introduces direct reference leakage, as the skeptic note flags.

    Authors: We agree the description is insufficiently precise. In the revision we will insert the explicit weighting formula w_t = p(x_t | prompt + plan) / p(x_t | prompt) (with log-space implementation for stability), together with pseudocode for plan extraction from the reference and the exact conditioning format used during probability computation. We will also add a short paragraph clarifying that the plan is used only for off-policy weighting on failed rollouts and does not appear in the policy gradient itself, thereby avoiding direct imitation. revision: yes

  2. Referee: [§4] §4 (experiments): The abstract asserts gains on eight benchmarks but supplies no quantitative tables, error bars, or controls for plan quality. The central claim that the method yields 'structure-aware' rather than reference-imitation signals therefore lacks the load-bearing empirical support needed to assess whether the improvement is robust or artifactual.

    Authors: The current §4 contains summary tables of accuracy improvements across the eight benchmarks and three backbones. To address the concern, we will expand these tables to include per-benchmark numbers with standard-error bars from three random seeds and add a new ablation that replaces reference-derived plans with random or noisy plans. The latter control directly tests whether gains require coherent plan structure rather than mere reference exposure. revision: yes

Circularity Check

0 steps flagged

Derivation self-contained; no reduction to inputs by construction

full rationale

The paper introduces SORT as a repair update that derives a plan from the reference solution and reweights tokens according to the change in their conditional probabilities when the plan is added as conditioning. This weighting rule is defined directly from the probability comparison operation itself rather than being fitted to a target or reduced to a prior self-citation. No equations, uniqueness theorems, or ansatzes are shown that would make the reweighting equivalent to the reference trajectory by construction. The central claim therefore rests on an independent computation performed on the model's own distribution and does not collapse into a tautology or load-bearing self-reference.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no identifiable free parameters, axioms, or invented entities; the method implicitly assumes plan derivation and probability comparison are well-defined but details are absent.

pith-pipeline@v0.9.0 · 5391 in / 1084 out tokens · 37534 ms · 2026-05-14T21:17:08.782334+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

49 extracted references · 49 canonical work pages · 14 internal anchors

  1. [1]

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y . K. Li, Y . Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024. URL https://arxiv.org/abs/ 2402.03300

  2. [2]

    DeepSeek-AI, Aixin Liu, Aoxue Mei, Bangcai Lin, Bing Xue, Bingxuan Wang, Bingzheng Xu, Bochao Wu, Bowei Zhang, Chaofan Lin, Chen Dong, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenhao Xu, Chong Ruan, Damai Dai, Daya Guo, Dejian Yang, Deli Chen, Erhang Li, Fangqi Zhou, Fangyun Lin, Fucong Dai, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Hanwei Xu, Ha...

  3. [3]

    Learning to reason under off-policy guidance, 2025

    Jianhao Yan, Yafu Li, Zican Hu, Zhi Wang, Ganqu Cui, Xiaoye Qu, Yu Cheng, and Yue Zhang. Learning to reason under off-policy guidance, 2025. URL https://arxiv.org/abs/2504. 14945

  4. [4]

    Learning what reinforcement learning can’t: Interleaved online fine-tuning for hardest questions, 2026

    Lu Ma, Hao Liang, Meiyi Qiang, Lexiang Tang, Xiaochen Ma, Zhen Hao Wong, Junbo Niu, Chengyu Shen, Runming He, Yanhao Li, Bin Cui, and Wentao Zhang. Learning what reinforcement learning can’t: Interleaved online fine-tuning for hardest questions, 2026. URL https://arxiv.org/abs/2506.07527

  5. [5]

    Scaf-grpo: Scaffolded group relative policy optimization for enhancing llm reasoning, 2026

    Xichen Zhang, Sitong Wu, Yinghao Zhu, Haoru Tan, Shaozuo Yu, Ziyi He, and Jiaya Jia. Scaf-grpo: Scaffolded group relative policy optimization for enhancing llm reasoning, 2026. URLhttps://arxiv.org/abs/2510.19807

  6. [6]

    Expo: Unlocking hard reasoning with self-explanation-guided reinforcement learning, 2026

    Ruiyang Zhou, Shuozhe Li, Amy Zhang, and Liu Leqi. Expo: Unlocking hard reasoning with self-explanation-guided reinforcement learning, 2026. URL https://arxiv.org/abs/2507. 02834

  7. [7]

    A minimalist approach to llm reasoning: from rejection sampling to reinforce, 2025

    Wei Xiong, Jiarui Yao, Yuhui Xu, Bo Pang, Lei Wang, Doyen Sahoo, Junnan Li, Nan Jiang, Tong Zhang, Caiming Xiong, and Hanze Dong. A minimalist approach to llm reasoning: from rejection sampling to reinforce, 2025. URLhttps://arxiv.org/abs/2504.11343

  8. [8]

    DAPO: An Open-Source LLM Reinforcement Learning System at Scale

    Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, Jiaze Chen, Jiangjie Chen, Chengyi Wang, Hongli Yu, Yuxuan Song, Xiangpeng Wei, Hao Zhou, Jingjing Liu, W...

  9. [9]

    Knapsack rl: Unlocking exploration of llms via optimizing budget allocation,

    Ziniu Li, Congliang Chen, Tianyun Yang, Tian Ding, Ruoyu Sun, Ge Zhang, Wenhao Huang, and Zhi-Quan Luo. Knapsack rl: Unlocking exploration of llms via optimizing budget allocation,

  10. [10]

    URLhttps://arxiv.org/abs/2509.25849

  11. [11]

    Nudging the boundaries of llm reasoning,

    Justin Chih-Yao Chen, Becky Xiangyu Peng, Prafulla Kumar Choubey, Kung-Hsiang Huang, Jiaxin Zhang, Mohit Bansal, and Chien-Sheng Wu. Nudging the boundaries of llm reasoning,

  12. [12]

    URLhttps://arxiv.org/abs/2509.25666

  13. [13]

    Kimi Team, Yifan Bai, Yiping Bao, Y . Charles, Cheng Chen, Guanduo Chen, Haiting Chen, Huarong Chen, Jiahao Chen, Ningxin Chen, Ruijue Chen, Yanru Chen, Yuankun Chen, Yutian Chen, Zhuofu Chen, Jialei Cui, Hao Ding, Mengnan Dong, Angang Du, Chenzhuang Du, Dikang Du, Yulun Du, Yu Fan, Yichen Feng, Kelin Fu, Bofei Gao, Chenxiao Gao, Hongcheng Gao, Peizhong G...

  14. [14]

    On the generalization of sft: A reinforcement learning perspective with reward rectification, 2026

    Yongliang Wu, Yizhou Zhou, Zhou Ziheng, Yingzhe Peng, Xinyu Ye, Xinting Hu, Wenbo Zhu, Lu Qi, Ming-Hsuan Yang, and Xu Yang. On the generalization of sft: A reinforcement learning perspective with reward rectification, 2026. URLhttps://arxiv.org/abs/2508.05629

  15. [15]

    No prompt left behind: Exploiting zero-variance prompts in llm reinforcement learning via entropy- guided advantage shaping.arXiv preprint arXiv:2509.21880,

    Thanh-Long V . Le, Myeongho Jeon, Kim Trong Vu, Viet Dac Lai, and Eunho Yang. No prompt left behind: Exploiting zero-variance prompts in llm reinforcement learning via entropy-guided advantage shaping, 2026. URLhttps://arxiv.org/abs/2509.21880. ICLR 2026

  16. [16]

    Adaptive rollout allocation for online reinforcement learning with verifiable rewards, 2026

    Hieu Trung Nguyen, Bao Nguyen, Wenao Ma, Yuzhi Zhao, Ruifeng She, and Viet Anh Nguyen. Adaptive rollout allocation for online reinforcement learning with verifiable rewards, 2026. URLhttps://arxiv.org/abs/2602.01601. ICLR 2026

  17. [17]

    Self-hinting language models enhance reinforcement learning, 2026

    Baohao Liao, Hanze Dong, Xinxing Xu, Christof Monz, and Jiang Bian. Self-hinting language models enhance reinforcement learning, 2026. URL https://arxiv.org/abs/2602.03143

  18. [18]

    Adaptive guidance accelerates reinforcement learning of reasoning models, 2025

    Vaskar Nath, Elaine Lau, Anisha Gunjal, Manasi Sharma, Nikhil Baharte, and Sean Hendryx. Adaptive guidance accelerates reinforcement learning of reasoning models, 2025. URL https: //arxiv.org/abs/2506.13923

  19. [19]

    Stepwise guided policy optimiza- tion: Coloring your incorrect reasoning in grpo, 2026

    Peter Chen, Xiaopeng Li, Ziniu Li, Xi Chen, and Tianyi Lin. Stepwise guided policy optimiza- tion: Coloring your incorrect reasoning in grpo, 2026. URL https://arxiv.org/abs/2505. 11595

  20. [20]

    First return, entropy-eliciting explore, 2025

    Tianyu Zheng, Tianshun Xing, Qingshui Gu, Taoran Liang, Xingwei Qu, Xin Zhou, Yizhi Li, Zhoufutu Wen, Chenghua Lin, Wenhao Huang, Qian Liu, Ge Zhang, and Zejun Ma. First return, entropy-eliciting explore, 2025. URLhttps://arxiv.org/abs/2507.07017

  21. [21]

    Trust, but verify: A self-verification approach to reinforcement learn- ing with verifiable rewards, 2025

    Xiaoyuan Liu, Tian Liang, Zhiwei He, Jiahao Xu, Wenxuan Wang, Pinjia He, Zhaopeng Tu, Haitao Mi, and Dong Yu. Trust, but verify: A self-verification approach to reinforcement learn- ing with verifiable rewards, 2025. URL https://arxiv.org/abs/2505.13445. NeurIPS 2025

  22. [22]

    Rubric as reward: Decomposing verification signals for logical reasoning in grpo

    Ishaan Gangwani and Aayam Bansal. Rubric as reward: Decomposing verification signals for logical reasoning in grpo. ICLR 2026 Workshop on LLM Reasoning, 2026. URL https: //openreview.net/forum?id=LFBnQEGh23

  23. [23]

    Self-Distillation Enables Continual Learning

    Idan Shenfeld, Mehul Damani, Jonas Hübotter, and Pulkit Agrawal. Self-distillation enables continual learning, 2026. URLhttps://arxiv.org/abs/2601.19897

  24. [24]

    Rlpr: Extrapolating rlvr to general domains without verifiers, 2025

    Tianyu Yu, Bo Ji, Shouli Wang, Shu Yao, Zefan Wang, Ganqu Cui, Lifan Yuan, Ning Ding, Yuan Yao, Zhiyuan Liu, Maosong Sun, and Tat-Seng Chua. Rlpr: Extrapolating rlvr to general domains without verifiers, 2025. URLhttps://arxiv.org/abs/2506.18254

  25. [25]

    InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

    Zhenru Zhang, Chujie Zheng, Yangzhen Wu, Beichen Zhang, Runji Lin, Bowen Yu, Dayiheng Liu, Jingren Zhou, and Junyang Lin. The lessons of developing process reward models in mathematical reasoning. InFindings of the Association for Computational Linguistics: ACL 2025, pages 10495–10516. Association for Computational Linguistics, 2025. doi: 10.18653/v1/ 202...

  26. [26]

    S-grpo: Early exit via reinforcement learning in reasoning models.arXiv preprint arXiv:2505.07686, 2025

    Muzhi Dai, Chenxu Yang, and Qingyi Si. S-grpo: Early exit via reinforcement learning in reasoning models.arXiv preprint arXiv:2505.07686, 2025. URL https://arxiv.org/abs/ 2505.07686. 12

  27. [27]

    Free process rewards without process labels.arXiv preprint arXiv:2412.01981, 2024

    Lifan Yuan, Wendi Li, Huayu Chen, Ganqu Cui, Ning Ding, Kaiyan Zhang, Bowen Zhou, Zhiyuan Liu, and Hao Peng. Free process rewards without process labels.arXiv preprint arXiv:2412.01981, 2024. URLhttps://arxiv.org/abs/2412.01981

  28. [28]

    Process Reinforcement through Implicit Rewards

    Ganqu Cui, Lifan Yuan, Zefan Wang, Hanbin Wang, Wendi Li, Bingxiang He, Yuchen Fan, Tianyu Yu, Qixin Xu, Weize Chen, et al. Process reinforcement through implicit rewards.arXiv preprint arXiv:2502.01456, 2025. URLhttps://arxiv.org/abs/2502.01456

  29. [29]

    Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning

    Shenzhi Wang, Le Yu, Chang Gao, Chujie Zheng, Shixuan Liu, Rui Lu, Kai Dang, Xionghui Chen, Jianxin Yang, Zhenru Zhang, Yuqiong Liu, An Yang, Andrew Zhao, Yang Yue, Shiji Song, Bowen Yu, Gao Huang, and Junyang Lin. Beyond the 80/20 rule: High-entropy minority tokens drive effective reinforcement learning for llm reasoning, 2025. URL https://arxiv. org/abs...

  30. [30]

    Ignore the kl penalty! boosting exploration on critical tokens to enhance rl fine-tuning, 2025

    Jean Vassoyan, Nathanael Beau, and Roman Plaud. Ignore the kl penalty! boosting exploration on critical tokens to enhance rl fine-tuning, 2025. URL https://arxiv.org/abs/2502. 06533

  31. [31]

    Preserving diversity in supervised fine-tuning of large language models, 2025

    Ziniu Li, Congliang Chen, Tian Xu, Zeyu Qin, Jiancong Xiao, Zhi-Quan Luo, and Ruoyu Sun. Preserving diversity in supervised fine-tuning of large language models, 2025. URL https://arxiv.org/abs/2408.16673

  32. [32]

    SED-SFT: Selectively encouraging diversity in supervised fine-tuning, 2026

    Yijie Chen, Yijin Liu, and Fandong Meng. SED-SFT: Selectively encouraging diversity in supervised fine-tuning, 2026. URLhttps://arxiv.org/abs/2602.07464

  33. [33]

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ah- mad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aurelien Rodriguez, Austen Gregerson, Ava...

  34. [34]

    Qwen2.5 Technical Report

    Qwen, :, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li,...

  35. [35]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

  36. [36]

    Numinamath: The largest public dataset in ai4maths with 860k pairs of competition math problems and solutions.Hugging Face repository, 13(9):9, 2024

    Jia Li, Edward Beeching, Lewis Tunstall, Ben Lipkin, Roman Soletskyi, Shengyi Huang, Kashif Rasul, Longhui Yu, Albert Q Jiang, Ziju Shen, et al. Numinamath: The largest public dataset in ai4maths with 860k pairs of competition math problems and solutions.Hugging Face repository, 13(9):9, 2024

  37. [37]

    Aime problems and solutions

    MAA Committees. Aime problems and solutions. https://artofproblemsolving.com/ wiki/index.php/AIME_Problems_and_Solutions, 2024. Art of Problem Solving

  38. [38]

    Aime problems and solutions

    MAA Committees. Aime problems and solutions. https://artofproblemsolving.com/ wiki/index.php/AIME_Problems_and_Solutions, 2025. Art of Problem Solving

  39. [39]

    American mathematics competitions (amc) 2023

    Mathematical Association of America. American mathematics competitions (amc) 2023. https://maa.org, 2023. AMC 8, AMC 10, and AMC 12 competition problems

  40. [40]

    Measuring mathematical problem solving with the MATH dataset

    Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the MATH dataset. In Joaquin Vanschoren and Sai-Kit Yeung, editors,Pro- ceedings of the Neural Information Processing Systems Track on Datasets and Bench- marks 1, NeurIPS Datasets and Benchmarks ...

  41. [41]

    Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, Yuhuai Wu, Behnam Neyshabur, Guy Gur-Ari, and Vedant Misra

    Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay V . Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, Yuhuai Wu, Behnam Neyshabur, Guy Gur-Ari, and Vedant Misra. Solving quantitative reasoning problems with language models. In Sanmi Koyejo, S. Mohamed, A. Agarwal, Danielle Belgrave, K. Cho, and A. O...

  42. [42]

    Olympiadbench: A challenging benchmark for promoting AGI with olympiad-level bilingual multimodal scientific problems

    Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Leng Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, Jie Liu, Lei Qi, Zhiyuan Liu, and Maosong Sun. Olympiadbench: A challenging benchmark for promoting AGI with olympiad-level bilingual multimodal scientific problems. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors, Proceed...

  43. [43]

    David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. GPQA: A graduate-level google-proof q&a benchmark.CoRR, abs/2311.12022, 2023. doi: 10.48550/ARXIV .2311.12022. URL https://doi.org/10.48550/arXiv.2311.12022. 15

  44. [44]

    Mmlu-pro: A more robust and challenging multi-task language understanding benchmark

    Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, Tianle Li, Max Ku, Kai Wang, Alex Zhuang, Rongqi Fan, Xiang Yue, and Wenhu Chen. Mmlu-pro: A more robust and challenging multi-task language understanding benchmark. In Amir Globersons, Lester Mackey, Danielle Belgrave, Angela ...

  45. [45]

    HybridFlow: A Flexible and Efficient RLHF Framework , url=

    Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework. In Proceedings of the Twentieth European Conference on Computer Systems, EuroSys ’25, page 1279–1297. ACM, March 2025. doi: 10.1145/3689031.3696075. URL http://dx.doi.org/ 10.1145/3689031.3696075

  46. [46]

    Gonzalez, Hao Zhang, and Ion Stoica

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention, 2023. URL https://arxiv.org/abs/2309. 06180

  47. [47]

    Reinforcement Learning via Self-Distillation

    Jonas Hübotter, Frederike Lübeck, Lejs Behric, Anton Baumann, Marco Bagatella, Daniel Marta, Ido Hakimi, Idan Shenfeld, Thomas Kleine Buening, Carlos Guestrin, et al. Reinforcement learning via self-distillation.arXiv preprint arXiv:2601.20802, 2026

  48. [48]

    Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models

    Siyan Zhao, Zhihui Xie, Mengchen Liu, Jing Huang, Guan Pang, Feiyu Chen, and Aditya Grover. Self-distilled reasoner: On-policy self-distillation for large language models.arXiv preprint arXiv:2601.18734, 2026

  49. [49]

    Why Does Self-Distillation (Sometimes) Degrade the Reasoning Capability of LLMs?

    Jeonghye Kim, Xufang Luo, Minbeom Kim, Sangmook Lee, Dohyung Kim, Jiwon Jeon, Dongsheng Li, and Yuqing Yang. Why does self-distillation (sometimes) degrade the reasoning capability of llms?, 2026. URLhttps://arxiv.org/abs/2603.24472. A Algorithm B Limitations & Broader Impacts Limitations.First, SORT assumes a verified reference solution is available for ...