arxiv: 2605.11505 · v2 · submitted 2026-05-12 · 💻 cs.AI

Recognition: 2 theorem links

· Lean Theorem

Selective Off-Policy Reference Tuning with Plan Guidance

Duc Anh Le , Tien-Phat Nguyen , Thien Huu Nguyen , Linh Ngo Van , Trung Le

Authors on Pith no claims yet

Pith reviewed 2026-05-14 21:17 UTC · model grok-4.3

classification 💻 cs.AI

keywords reinforcement learningreasoning modelsoff-policy tuningplan guidancetoken weightinglanguage model trainingGRPOfailure handling

0 comments

The pith

SORT derives a plan from the reference solution to selectively weight tokens that become more predictable under plan conditioning, turning all-failed rollouts into structure-aware learning signals.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper addresses a key limitation in reinforcement learning methods for language model reasoning, where hard prompts produce no successful rollouts and standard approaches like GRPO stall. It introduces SORT to repair those failures without altering how rollouts are generated. The method extracts a plan from the reference solution, measures how each token's probability changes when conditioned on the plan, and assigns higher weight to tokens that gain predictability. This creates targeted updates focused on structural elements instead of uniform imitation of the reference. If correct, the approach would allow continued training progress on difficult reasoning tasks by converting otherwise useless all-wrong cases into useful signals.

Core claim

SORT derives a plan from the reference solution, compares token probabilities with and without that plan, and gives higher weight to tokens that become more predictable under plan conditioning. This turns all-wrong prompts into selective, structure-aware learning signals instead of uniform imitation.

What carries the argument

The selective weighting rule that upweights tokens whose probability increases under conditioning on a plan derived from the reference solution.

If this is right

Performance improves on reasoning benchmarks where all sampled rollouts fail.
Gains appear across three different model backbones and eight benchmarks, with larger benefits for weaker models.
The method requires no change to the rollout sampling process.
It outperforms both standard GRPO and existing guidance baselines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

SORT could be combined with standard success-based updates to handle mixed success-failure batches in a single training loop.
The plan-comparison idea might generalize to other structured generation tasks where a reference outline is available during training.
If plans can be approximated without full references, the approach might reduce dependence on high-quality labeled solutions at scale.

Load-bearing premise

A plan derived from the reference solution supplies unbiased conditioning that identifies meaningful structure-aware tokens without adding artifacts.

What would settle it

Apply SORT using deliberately incorrect or random plans instead of reference-derived plans and check whether the performance gains on reasoning benchmarks disappear.

Figures

Figures reproduced from arXiv: 2605.11505 by Duc Anh Le, Linh Ngo Van, Thien Huu Nguyen, Tien-Phat Nguyen, Trung Le.

**Figure 2.** Figure 2: Token-level salience on a zero-reward reference trace. Colors indicate the plan-conditioned [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗

**Figure 3.** Figure 3: Number of all-wrong rollout groups during SORT training, where every sampled trajectory [PITH_FULL_IMAGE:figures/full_fig_p023_3.png] view at source ↗

**Figure 4.** Figure 4: SFT-path loss trajectories across 500 training steps for the auxiliary reference-learning [PITH_FULL_IMAGE:figures/full_fig_p024_4.png] view at source ↗

**Figure 5.** Figure 5: Full token salience visualization, part 1. [PITH_FULL_IMAGE:figures/full_fig_p025_5.png] view at source ↗

**Figure 6.** Figure 6: Full token salience visualization, part 2. [PITH_FULL_IMAGE:figures/full_fig_p026_6.png] view at source ↗

**Figure 7.** Figure 7: Full token salience visualization, part 3. [PITH_FULL_IMAGE:figures/full_fig_p027_7.png] view at source ↗

**Figure 8.** Figure 8: Full token salience visualization, part 4. [PITH_FULL_IMAGE:figures/full_fig_p028_8.png] view at source ↗

read the original abstract

Reinforcement learning with verifiable rewards helps reasoning, but GRPO-style methods stall on hard prompts where all sampled rollouts fail. SORT adds a repair update for those failures without changing rollout generation: it derives a plan from the reference solution, compares token probabilities with and without that plan, and gives higher weight to tokens that become more predictable under plan conditioning. This turns all-wrong prompts into selective, structure-aware learning signals instead of uniform imitation. Across three backbones and eight reasoning benchmarks, SORT improves over GRPO and guidance baselines, with largest gains on weaker models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SORT gives a workable fix for total-failure prompts in GRPO by reweighting tokens via reference-plan conditioning, but the risk of reference leakage needs checking.

read the letter

The main point is that SORT targets the stall in GRPO-style RL when every rollout on a hard prompt is wrong. It pulls a plan from the reference solution, compares token probabilities with and without that plan, and upweights tokens that become more likely under the plan. This turns failed samples into selective training signals without altering rollout generation. The paper reports gains over GRPO and guidance baselines across three backbones and eight reasoning benchmarks, with the biggest lifts on weaker models. That pattern suggests the method helps most where standard approaches give up. The empirical scope is decent for the subfield and the core idea is a direct response to a known pain point. The soft spot is the conditioning step itself. If the plan encodes reference-specific steps or the final answer, the probability shift will favor tokens that match the reference trajectory rather than general reasoning structure. The abstract does not detail how the plan is tokenized, injected, or masked, so it is unclear whether the weighting stays unbiased or slides into selective imitation. I would want to see the exact equations and any ablations that test for this leakage. This paper is for people already working on verifiable-reward RL for language models. Readers who know GRPO will follow the contribution quickly and can judge whether the added machinery is worth the extra complexity. It deserves a serious referee because the problem is real, the fix is targeted, and the results span multiple settings, even if the method section needs tightening to address the conditioning concern.

Referee Report

2 major / 1 minor

Summary. The paper introduces Selective Off-Policy Reference Tuning (SORT) as an extension to GRPO-style RL for reasoning. For hard prompts where all rollouts fail, SORT derives a plan from the reference solution, compares token probabilities under prompt-only vs. prompt-plus-plan conditioning, and upweights tokens that become more predictable under the plan. This is claimed to convert uniform failure signals into selective, structure-aware updates without altering rollout generation. Empirical results across three backbones and eight reasoning benchmarks show gains over GRPO and guidance baselines, largest on weaker models.

Significance. If the weighting mechanism can be shown to extract general structure without reference leakage, SORT would usefully extend RLVR applicability to difficult cases that currently yield zero reward. The multi-backbone evaluation is a strength, as is the focus on turning complete failures into usable signals rather than discarding them.

major comments (2)

[Abstract and §3] Abstract and §3 (method description): The weighting rule is stated only qualitatively ('gives higher weight to tokens that become more predictable under plan conditioning'). No explicit equation, pseudocode, or definition of the plan injection (tokenization, masking, or conditioning format) is provided. Without this, it cannot be verified whether the ratio reduces to a fitted hyperparameter or introduces direct reference leakage, as the skeptic note flags.
[§4] §4 (experiments): The abstract asserts gains on eight benchmarks but supplies no quantitative tables, error bars, or controls for plan quality. The central claim that the method yields 'structure-aware' rather than reference-imitation signals therefore lacks the load-bearing empirical support needed to assess whether the improvement is robust or artifactual.

minor comments (1)

[Abstract] Abstract: 'eight reasoning benchmarks' is mentioned without naming them or giving effect sizes; a one-sentence summary of the largest gains would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments and the recommendation for major revision. We address each point below and will update the manuscript accordingly to improve clarity and empirical support.

read point-by-point responses

Referee: [Abstract and §3] Abstract and §3 (method description): The weighting rule is stated only qualitatively ('gives higher weight to tokens that become more predictable under plan conditioning'). No explicit equation, pseudocode, or definition of the plan injection (tokenization, masking, or conditioning format) is provided. Without this, it cannot be verified whether the ratio reduces to a fitted hyperparameter or introduces direct reference leakage, as the skeptic note flags.

Authors: We agree the description is insufficiently precise. In the revision we will insert the explicit weighting formula w_t = p(x_t | prompt + plan) / p(x_t | prompt) (with log-space implementation for stability), together with pseudocode for plan extraction from the reference and the exact conditioning format used during probability computation. We will also add a short paragraph clarifying that the plan is used only for off-policy weighting on failed rollouts and does not appear in the policy gradient itself, thereby avoiding direct imitation. revision: yes
Referee: [§4] §4 (experiments): The abstract asserts gains on eight benchmarks but supplies no quantitative tables, error bars, or controls for plan quality. The central claim that the method yields 'structure-aware' rather than reference-imitation signals therefore lacks the load-bearing empirical support needed to assess whether the improvement is robust or artifactual.

Authors: The current §4 contains summary tables of accuracy improvements across the eight benchmarks and three backbones. To address the concern, we will expand these tables to include per-benchmark numbers with standard-error bars from three random seeds and add a new ablation that replaces reference-derived plans with random or noisy plans. The latter control directly tests whether gains require coherent plan structure rather than mere reference exposure. revision: yes

Circularity Check

0 steps flagged

Derivation self-contained; no reduction to inputs by construction

full rationale

The paper introduces SORT as a repair update that derives a plan from the reference solution and reweights tokens according to the change in their conditional probabilities when the plan is added as conditioning. This weighting rule is defined directly from the probability comparison operation itself rather than being fitted to a target or reduced to a prior self-citation. No equations, uniqueness theorems, or ansatzes are shown that would make the reweighting equivalent to the reference trajectory by construction. The central claim therefore rests on an independent computation performed on the model's own distribution and does not collapse into a tautology or load-bearing self-reference.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no identifiable free parameters, axioms, or invented entities; the method implicitly assumes plan derivation and probability comparison are well-defined but details are absent.

pith-pipeline@v0.9.0 · 5391 in / 1084 out tokens · 37534 ms · 2026-05-14T21:17:08.782334+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

ωi,t = sg(√(pbase i,t * pplan i,t)) ... LSORT(Mk) = −1/|S(Mk)| Σ ωi,t log πθ(y⋆i,t | BASEPROMPT(qi), y⋆i,<t)
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean alpha_pin_under_high_calibration unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

σi,t := log πθ(y⋆i,t | qi, Si, y⋆i,<t) / πθ(y⋆i,t | qi, y⋆i,<t) ... plan-regret bound under Assumption 3

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

49 extracted references · 49 canonical work pages · 14 internal anchors

[1]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y . K. Li, Y . Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024. URL https://arxiv.org/abs/ 2402.03300

work page internal anchor Pith review Pith/arXiv arXiv 2024
[2]

DeepSeek-AI, Aixin Liu, Aoxue Mei, Bangcai Lin, Bing Xue, Bingxuan Wang, Bingzheng Xu, Bochao Wu, Bowei Zhang, Chaofan Lin, Chen Dong, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenhao Xu, Chong Ruan, Damai Dai, Daya Guo, Dejian Yang, Deli Chen, Erhang Li, Fangqi Zhou, Fangyun Lin, Fucong Dai, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Hanwei Xu, Ha...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Learning to reason under off-policy guidance, 2025

Jianhao Yan, Yafu Li, Zican Hu, Zhi Wang, Ganqu Cui, Xiaoye Qu, Yu Cheng, and Yue Zhang. Learning to reason under off-policy guidance, 2025. URL https://arxiv.org/abs/2504. 14945

work page 2025
[4]

Learning what reinforcement learning can’t: Interleaved online fine-tuning for hardest questions, 2026

Lu Ma, Hao Liang, Meiyi Qiang, Lexiang Tang, Xiaochen Ma, Zhen Hao Wong, Junbo Niu, Chengyu Shen, Runming He, Yanhao Li, Bin Cui, and Wentao Zhang. Learning what reinforcement learning can’t: Interleaved online fine-tuning for hardest questions, 2026. URL https://arxiv.org/abs/2506.07527

work page arXiv 2026
[5]

Scaf-grpo: Scaffolded group relative policy optimization for enhancing llm reasoning, 2026

Xichen Zhang, Sitong Wu, Yinghao Zhu, Haoru Tan, Shaozuo Yu, Ziyi He, and Jiaya Jia. Scaf-grpo: Scaffolded group relative policy optimization for enhancing llm reasoning, 2026. URLhttps://arxiv.org/abs/2510.19807

work page arXiv 2026
[6]

Expo: Unlocking hard reasoning with self-explanation-guided reinforcement learning, 2026

Ruiyang Zhou, Shuozhe Li, Amy Zhang, and Liu Leqi. Expo: Unlocking hard reasoning with self-explanation-guided reinforcement learning, 2026. URL https://arxiv.org/abs/2507. 02834

work page 2026
[7]

A minimalist approach to llm reasoning: from rejection sampling to reinforce, 2025

Wei Xiong, Jiarui Yao, Yuhui Xu, Bo Pang, Lei Wang, Doyen Sahoo, Junnan Li, Nan Jiang, Tong Zhang, Caiming Xiong, and Hanze Dong. A minimalist approach to llm reasoning: from rejection sampling to reinforce, 2025. URLhttps://arxiv.org/abs/2504.11343

work page arXiv 2025
[8]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, Jiaze Chen, Jiangjie Chen, Chengyi Wang, Hongli Yu, Yuxuan Song, Xiangpeng Wei, Hao Zhou, Jingjing Liu, W...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[9]

Knapsack rl: Unlocking exploration of llms via optimizing budget allocation,

Ziniu Li, Congliang Chen, Tianyun Yang, Tian Ding, Ruoyu Sun, Ge Zhang, Wenhao Huang, and Zhi-Quan Luo. Knapsack rl: Unlocking exploration of llms via optimizing budget allocation,

work page
[10]

URLhttps://arxiv.org/abs/2509.25849

work page arXiv
[11]

Nudging the boundaries of llm reasoning,

Justin Chih-Yao Chen, Becky Xiangyu Peng, Prafulla Kumar Choubey, Kung-Hsiang Huang, Jiaxin Zhang, Mohit Bansal, and Chien-Sheng Wu. Nudging the boundaries of llm reasoning,

work page
[12]

URLhttps://arxiv.org/abs/2509.25666

work page arXiv
[13]

Kimi Team, Yifan Bai, Yiping Bao, Y . Charles, Cheng Chen, Guanduo Chen, Haiting Chen, Huarong Chen, Jiahao Chen, Ningxin Chen, Ruijue Chen, Yanru Chen, Yuankun Chen, Yutian Chen, Zhuofu Chen, Jialei Cui, Hao Ding, Mengnan Dong, Angang Du, Chenzhuang Du, Dikang Du, Yulun Du, Yu Fan, Yichen Feng, Kelin Fu, Bofei Gao, Chenxiao Gao, Hongcheng Gao, Peizhong G...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[14]

On the generalization of sft: A reinforcement learning perspective with reward rectification, 2026

Yongliang Wu, Yizhou Zhou, Zhou Ziheng, Yingzhe Peng, Xinyu Ye, Xinting Hu, Wenbo Zhu, Lu Qi, Ming-Hsuan Yang, and Xu Yang. On the generalization of sft: A reinforcement learning perspective with reward rectification, 2026. URLhttps://arxiv.org/abs/2508.05629

work page arXiv 2026
[15]

No prompt left behind: Exploiting zero-variance prompts in llm reinforcement learning via entropy- guided advantage shaping.arXiv preprint arXiv:2509.21880,

Thanh-Long V . Le, Myeongho Jeon, Kim Trong Vu, Viet Dac Lai, and Eunho Yang. No prompt left behind: Exploiting zero-variance prompts in llm reinforcement learning via entropy-guided advantage shaping, 2026. URLhttps://arxiv.org/abs/2509.21880. ICLR 2026

work page arXiv 2026
[16]

Adaptive rollout allocation for online reinforcement learning with verifiable rewards, 2026

Hieu Trung Nguyen, Bao Nguyen, Wenao Ma, Yuzhi Zhao, Ruifeng She, and Viet Anh Nguyen. Adaptive rollout allocation for online reinforcement learning with verifiable rewards, 2026. URLhttps://arxiv.org/abs/2602.01601. ICLR 2026

work page arXiv 2026
[17]

Self-hinting language models enhance reinforcement learning, 2026

Baohao Liao, Hanze Dong, Xinxing Xu, Christof Monz, and Jiang Bian. Self-hinting language models enhance reinforcement learning, 2026. URL https://arxiv.org/abs/2602.03143

work page arXiv 2026
[18]

Adaptive guidance accelerates reinforcement learning of reasoning models, 2025

Vaskar Nath, Elaine Lau, Anisha Gunjal, Manasi Sharma, Nikhil Baharte, and Sean Hendryx. Adaptive guidance accelerates reinforcement learning of reasoning models, 2025. URL https: //arxiv.org/abs/2506.13923

work page arXiv 2025
[19]

Stepwise guided policy optimiza- tion: Coloring your incorrect reasoning in grpo, 2026

Peter Chen, Xiaopeng Li, Ziniu Li, Xi Chen, and Tianyi Lin. Stepwise guided policy optimiza- tion: Coloring your incorrect reasoning in grpo, 2026. URL https://arxiv.org/abs/2505. 11595

work page 2026
[20]

First return, entropy-eliciting explore, 2025

Tianyu Zheng, Tianshun Xing, Qingshui Gu, Taoran Liang, Xingwei Qu, Xin Zhou, Yizhi Li, Zhoufutu Wen, Chenghua Lin, Wenhao Huang, Qian Liu, Ge Zhang, and Zejun Ma. First return, entropy-eliciting explore, 2025. URLhttps://arxiv.org/abs/2507.07017

work page arXiv 2025
[21]

Trust, but verify: A self-verification approach to reinforcement learn- ing with verifiable rewards, 2025

Xiaoyuan Liu, Tian Liang, Zhiwei He, Jiahao Xu, Wenxuan Wang, Pinjia He, Zhaopeng Tu, Haitao Mi, and Dong Yu. Trust, but verify: A self-verification approach to reinforcement learn- ing with verifiable rewards, 2025. URL https://arxiv.org/abs/2505.13445. NeurIPS 2025

work page arXiv 2025
[22]

Rubric as reward: Decomposing verification signals for logical reasoning in grpo

Ishaan Gangwani and Aayam Bansal. Rubric as reward: Decomposing verification signals for logical reasoning in grpo. ICLR 2026 Workshop on LLM Reasoning, 2026. URL https: //openreview.net/forum?id=LFBnQEGh23

work page 2026
[23]

Self-Distillation Enables Continual Learning

Idan Shenfeld, Mehul Damani, Jonas Hübotter, and Pulkit Agrawal. Self-distillation enables continual learning, 2026. URLhttps://arxiv.org/abs/2601.19897

work page internal anchor Pith review Pith/arXiv arXiv 2026
[24]

Rlpr: Extrapolating rlvr to general domains without verifiers, 2025

Tianyu Yu, Bo Ji, Shouli Wang, Shu Yao, Zefan Wang, Ganqu Cui, Lifan Yuan, Ning Ding, Yuan Yao, Zhiyuan Liu, Maosong Sun, and Tat-Seng Chua. Rlpr: Extrapolating rlvr to general domains without verifiers, 2025. URLhttps://arxiv.org/abs/2506.18254

work page arXiv 2025
[25]

InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Zhenru Zhang, Chujie Zheng, Yangzhen Wu, Beichen Zhang, Runji Lin, Bowen Yu, Dayiheng Liu, Jingren Zhou, and Junyang Lin. The lessons of developing process reward models in mathematical reasoning. InFindings of the Association for Computational Linguistics: ACL 2025, pages 10495–10516. Association for Computational Linguistics, 2025. doi: 10.18653/v1/ 202...

work page doi:10.18653/v1/ 2025
[26]

S-grpo: Early exit via reinforcement learning in reasoning models.arXiv preprint arXiv:2505.07686, 2025

Muzhi Dai, Chenxu Yang, and Qingyi Si. S-grpo: Early exit via reinforcement learning in reasoning models.arXiv preprint arXiv:2505.07686, 2025. URL https://arxiv.org/abs/ 2505.07686. 12

work page arXiv 2025
[27]

Free process rewards without process labels.arXiv preprint arXiv:2412.01981, 2024

Lifan Yuan, Wendi Li, Huayu Chen, Ganqu Cui, Ning Ding, Kaiyan Zhang, Bowen Zhou, Zhiyuan Liu, and Hao Peng. Free process rewards without process labels.arXiv preprint arXiv:2412.01981, 2024. URLhttps://arxiv.org/abs/2412.01981

work page arXiv 2024
[28]

Process Reinforcement through Implicit Rewards

Ganqu Cui, Lifan Yuan, Zefan Wang, Hanbin Wang, Wendi Li, Bingxiang He, Yuchen Fan, Tianyu Yu, Qixin Xu, Weize Chen, et al. Process reinforcement through implicit rewards.arXiv preprint arXiv:2502.01456, 2025. URLhttps://arxiv.org/abs/2502.01456

work page internal anchor Pith review Pith/arXiv arXiv 2025
[29]

Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning

Shenzhi Wang, Le Yu, Chang Gao, Chujie Zheng, Shixuan Liu, Rui Lu, Kai Dang, Xionghui Chen, Jianxin Yang, Zhenru Zhang, Yuqiong Liu, An Yang, Andrew Zhao, Yang Yue, Shiji Song, Bowen Yu, Gao Huang, and Junyang Lin. Beyond the 80/20 rule: High-entropy minority tokens drive effective reinforcement learning for llm reasoning, 2025. URL https://arxiv. org/abs...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[30]

Ignore the kl penalty! boosting exploration on critical tokens to enhance rl fine-tuning, 2025

Jean Vassoyan, Nathanael Beau, and Roman Plaud. Ignore the kl penalty! boosting exploration on critical tokens to enhance rl fine-tuning, 2025. URL https://arxiv.org/abs/2502. 06533

work page 2025
[31]

Preserving diversity in supervised fine-tuning of large language models, 2025

Ziniu Li, Congliang Chen, Tian Xu, Zeyu Qin, Jiancong Xiao, Zhi-Quan Luo, and Ruoyu Sun. Preserving diversity in supervised fine-tuning of large language models, 2025. URL https://arxiv.org/abs/2408.16673

work page arXiv 2025
[32]

SED-SFT: Selectively encouraging diversity in supervised fine-tuning, 2026

Yijie Chen, Yijin Liu, and Fandong Meng. SED-SFT: Selectively encouraging diversity in supervised fine-tuning, 2026. URLhttps://arxiv.org/abs/2602.07464

work page arXiv 2026
[33]

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ah- mad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aurelien Rodriguez, Austen Gregerson, Ava...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[34]

Qwen2.5 Technical Report

Qwen, :, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li,...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[35]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[36]

Numinamath: The largest public dataset in ai4maths with 860k pairs of competition math problems and solutions.Hugging Face repository, 13(9):9, 2024

Jia Li, Edward Beeching, Lewis Tunstall, Ben Lipkin, Roman Soletskyi, Shengyi Huang, Kashif Rasul, Longhui Yu, Albert Q Jiang, Ziju Shen, et al. Numinamath: The largest public dataset in ai4maths with 860k pairs of competition math problems and solutions.Hugging Face repository, 13(9):9, 2024

work page 2024
[37]

Aime problems and solutions

MAA Committees. Aime problems and solutions. https://artofproblemsolving.com/ wiki/index.php/AIME_Problems_and_Solutions, 2024. Art of Problem Solving

work page 2024
[38]

Aime problems and solutions

MAA Committees. Aime problems and solutions. https://artofproblemsolving.com/ wiki/index.php/AIME_Problems_and_Solutions, 2025. Art of Problem Solving

work page 2025
[39]

American mathematics competitions (amc) 2023

Mathematical Association of America. American mathematics competitions (amc) 2023. https://maa.org, 2023. AMC 8, AMC 10, and AMC 12 competition problems

work page 2023
[40]

Measuring mathematical problem solving with the MATH dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the MATH dataset. In Joaquin Vanschoren and Sai-Kit Yeung, editors,Pro- ceedings of the Neural Information Processing Systems Track on Datasets and Bench- marks 1, NeurIPS Datasets and Benchmarks ...

work page 2021
[41]

Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, Yuhuai Wu, Behnam Neyshabur, Guy Gur-Ari, and Vedant Misra

Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay V . Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, Yuhuai Wu, Behnam Neyshabur, Guy Gur-Ari, and Vedant Misra. Solving quantitative reasoning problems with language models. In Sanmi Koyejo, S. Mohamed, A. Agarwal, Danielle Belgrave, K. Cho, and A. O...

work page 2022
[42]

Olympiadbench: A challenging benchmark for promoting AGI with olympiad-level bilingual multimodal scientific problems

Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Leng Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, Jie Liu, Lei Qi, Zhiyuan Liu, and Maosong Sun. Olympiadbench: A challenging benchmark for promoting AGI with olympiad-level bilingual multimodal scientific problems. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors, Proceed...

work page doi:10.18653/v1/2024.acl-long.211 2024
[43]

David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. GPQA: A graduate-level google-proof q&a benchmark.CoRR, abs/2311.12022, 2023. doi: 10.48550/ARXIV .2311.12022. URL https://doi.org/10.48550/arXiv.2311.12022. 15

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv 2023
[44]

Mmlu-pro: A more robust and challenging multi-task language understanding benchmark

Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, Tianle Li, Max Ku, Kai Wang, Alex Zhuang, Rongqi Fan, Xiang Yue, and Wenhu Chen. Mmlu-pro: A more robust and challenging multi-task language understanding benchmark. In Amir Globersons, Lester Mackey, Danielle Belgrave, Angela ...

work page 2024
[45]

HybridFlow: A Flexible and Efficient RLHF Framework , url=

Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework. In Proceedings of the Twentieth European Conference on Computer Systems, EuroSys ’25, page 1279–1297. ACM, March 2025. doi: 10.1145/3689031.3696075. URL http://dx.doi.org/ 10.1145/3689031.3696075

work page doi:10.1145/3689031.3696075 2025
[46]

Gonzalez, Hao Zhang, and Ion Stoica

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention, 2023. URL https://arxiv.org/abs/2309. 06180

work page 2023
[47]

Reinforcement Learning via Self-Distillation

Jonas Hübotter, Frederike Lübeck, Lejs Behric, Anton Baumann, Marco Bagatella, Daniel Marta, Ido Hakimi, Idan Shenfeld, Thomas Kleine Buening, Carlos Guestrin, et al. Reinforcement learning via self-distillation.arXiv preprint arXiv:2601.20802, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[48]

Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models

Siyan Zhao, Zhihui Xie, Mengchen Liu, Jing Huang, Guan Pang, Feiyu Chen, and Aditya Grover. Self-distilled reasoner: On-policy self-distillation for large language models.arXiv preprint arXiv:2601.18734, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[49]

Why Does Self-Distillation (Sometimes) Degrade the Reasoning Capability of LLMs?

Jeonghye Kim, Xufang Luo, Minbeom Kim, Sangmook Lee, Dohyung Kim, Jiwon Jeon, Dongsheng Li, and Yuqing Yang. Why does self-distillation (sometimes) degrade the reasoning capability of llms?, 2026. URLhttps://arxiv.org/abs/2603.24472. A Algorithm B Limitations & Broader Impacts Limitations.First, SORT assumes a verified reference solution is available for ...

work page internal anchor Pith review Pith/arXiv arXiv 2026