pith. machine review for the scientific record. sign in

arxiv: 2605.10781 · v1 · submitted 2026-05-11 · 💻 cs.LG · cs.CL

Recognition: no theorem link

Rebellious Student: Reversing Teacher Signals for Reasoning Exploration with Self-Distilled RLVR

Authors on Pith no claims yet

Pith reviewed 2026-05-12 04:16 UTC · model grok-4.3

classification 💻 cs.LG cs.CL
keywords self-distillationRLVRreasoning explorationinformation asymmetryGRPOLLM post-trainingQwen3
0
0 comments X

The pith

Reversing the teacher signal in self-distillation allows reinforcement of a student's own successful reasoning paths rather than overwriting them.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that standard self-distillation harms performance on successful student rollouts because the teacher overwrites the student's independent token choices. By instead reversing the signal to strengthen only those tokens where the student succeeds along a path the teacher would not have taken, the proposed RLRT method creates targeted exploration inside the RLVR process. This reversal augments GRPO and yields higher performance than both plain self-distillation and existing exploration baselines. The result holds across base, instruction-tuned, and thinking-tuned Qwen3 checkpoints. A sympathetic reader cares because the work treats information asymmetry between teacher and student as a resource for discovery instead of a defect to eliminate.

Core claim

The authors establish that reading the self-distillation signal in reverse—specifically reinforcing tokens on correct student rollouts that the teacher would not have predicted—augments GRPO to promote valuable exploration grounded in the student's own success. This RLRT approach substantially outperforms self-distillation and exploration-based baselines across base, instruction-tuned, and thinking-tuned Qwen3 checkpoints and positions information asymmetry as a new, principled design axis for RLVR.

What carries the argument

RLRT (RLVR with Reversed Teacher), which identifies and reinforces student-chosen tokens on successful rollouts that diverge from the teacher's predictions.

Load-bearing premise

Tokens on successful student rollouts that the teacher would not have predicted reliably reflect valuable self-driven reasoning rather than lucky or shallow paths that do not generalize.

What would settle it

A direct comparison showing whether RLRT fails to produce higher accuracy than standard GRPO on a held-out reasoning benchmark after identical training budgets and rollout lengths.

Figures

Figures reproduced from arXiv: 2605.10781 by Dongsheng Li, Jeonghye Kim, Jiwon Jeon, Yuqing Yang.

Figure 1
Figure 1. Figure 1: Reversing the teacher signal turns self-distillation into valuable exploration. (a) Training reward on Qwen3-4B-Base under GRPO, upweighting teacher-predicted tokens on correct rollouts, and upweighting self-driven tokens (RLRT). (b) Mean avg@16 score over six math benchmarks (AIME24/25/26, HMMT26, AMC23, MATH500) across four Qwen3 backbones. RLRT consistently outperforms baselines significantly. Full resu… view at source ↗
Figure 2
Figure 2. Figure 2: Critical positions and explore/exploit directions. Token shading shows the position-level asymmetry D¯ t = KL(P t S ∥ P t T ). At each critical position (right panels), candidate tokens are taken as the union of the teacher’s and student’s top-100 tokens; we display the top four with the largest P t S − P t T (green, Dˆ t > 0) and the top four with the largest P t T − P t S (pink, Dˆ t < 0) [PITH_FULL_IMA… view at source ↗
Figure 3
Figure 3. Figure 3: Reasoning markers in the explore/exploit population. (a) Volcano scatter of linguistic tokens: x-axis is the polarization log2 (nexplore/nexploit), y-axis is the total count log10(nexplore + nexploit). Highlighted points (green = explore, pink = exploit) are categorized discourse markers; grey points are uncategorized tokens. (b) Per-marker polarization for these markers, sorted from most exploit-leaning (… view at source ↗
Figure 4
Figure 4. Figure 4: Overview of RLRT. (a) Conceptual illustration of the reversed-teacher signal. (b) Given a prompt x, the student policy πθ generates K rollouts that receive verifiable rewards r ∈ {0, 1} and group-standardized advantages A(k) . A reversed teacher provides token-level signals Dˆ t that, on correct rollouts, up-weight tokens with Dˆ t > 0. Reverse Weight as Token-Level Information Asymmetry Credit. For a prom… view at source ↗
Figure 5
Figure 5. Figure 5: Training score curves across four backbones (Qwen3-4B-Base, Qwen3-8B-Base, Qwen3-4B-Instruct, Qwen3-8B). RLRT achieves faster exploration and higher training scores in all settings [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Reflection injected at max_kl (■), random (•), or min_kl (▲). (a) flip→R on hard subset; (b) flip→W on easy subset. Two findings emerge from [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Token-level distributional shifts of πft relative to πbase. (a) CCDF of JS(πft∥πbase) across all positions; dashed line marks the JS > 0.1 threshold for (b)–(c). (b) Top-k overlap |top-k(πft) ∩ top-k(πbase)|/k at high-divergence positions (k ∈ [1, 20]): how much the candidate set is reshuffled. (c) Fraction of high￾divergence positions whose new top-1 token had base probability below each threshold: how de… view at source ↗
Figure 8
Figure 8. Figure 8: Pass@k comparison across explo￾ration methods on AIME24 and AIME26. For each method, we evaluate performance on Qwen3- 8B-Base by comparing the pass@k curve for k ∈ {1, 2, . . . , 256} on AIME24 and AIME26. We sample 256 responses per problem and compute pass@k using the unbiased estimator of Chen et al. [1]. As shown in [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: RLRT ablations. (a) reward gating on Qwen3-4B-Instruct: RLRT vs. RLRT-all (no r=1 gating) on training score, response length, and actor entropy. (b) clipping range εw on Qwen3-4B-Base and Qwen3-8B￾Base, with GRPO as reference. RLRT without Reward Gating. As described in Section 5, RLRT applies the reverse weight only on correct rollouts (r=1). To isolate the effect of the reward gate, we compare against RL… view at source ↗
Figure 10
Figure 10. Figure 10: Additional example of critical positions and explore/exploit directions, complementing [PITH_FULL_IMAGE:figures/full_fig_p016_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Training reward (left) and re￾sponse length (right) on Qwen3-8B-Base. SDPO collapses quickly: its reward drops while response length blows up. SDPO is a self-distillation method that uses the same model as both teacher and student under different condi￾tioning contexts, and rapidly improves in-domain perfor￾mance and induces more efficient reasoning by shortening response length [9]. However, it can becom… view at source ↗
Figure 12
Figure 12. Figure 12: Full-trajectory heatmap of D¯ t on the first example rollout. Critical positions (green) are sparse and concentrate at decision points, while long routine stretches (pink) carry little signal. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Full-trajectory heatmap of D¯ t on the second example rollout (same conventions as [PITH_FULL_IMAGE:figures/full_fig_p021_13.png] view at source ↗
read the original abstract

Self-distillation has emerged as a powerful framework for post-training LLMs, where a teacher conditioned on extra information guides a student without it, both from the same model. While this guidance is useful when the student has failed, on successful rollouts, the same mechanism instead overwrites the student's choices and suppresses it's own reasoning. Therefore, we propose reading the original self-distillation signal in reverse: when the student succeeds along a path the teacher would not have predicted, these tokens reflect its self-driven reasoning. Building on this, we propose RLRT (RLVR with Reversed Teacher), which augments GRPO by reinforcing these tokens on correct rollouts. We interpret this as a new form of exploration in RLVR: not uniform diversity, but valuable exploration grounded in the student's own success. Across base, instruction-tuned, and thinking-tuned Qwen3 checkpoints, RLRT substantially outperforms self-distillation and exploration-based baselines, establishing information asymmetry as a new, principled design axis for RLVR.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes RLRT (RLVR with Reversed Teacher), an augmentation to GRPO in self-distilled RLVR. On successful student rollouts, it reverses the standard self-distillation signal by reinforcing tokens to which a teacher model (conditioned on extra information) assigns low probability. This is interpreted as encouraging valuable, self-driven exploration rather than overwriting student reasoning. The manuscript claims that RLRT substantially outperforms self-distillation and exploration baselines across base, instruction-tuned, and thinking-tuned Qwen3 checkpoints, positioning information asymmetry as a new principled axis for RLVR design.

Significance. If the empirical claims hold after proper isolation of the reversed-signal effect, the work could meaningfully extend RLVR methodology by providing a targeted alternative to uniform exploration or standard self-distillation. The core idea is conceptually lightweight and directly leverages existing GRPO infrastructure, which would make it easy to adopt if the gains prove robust and generalizable.

major comments (3)
  1. [Method] Method section: The augmentation of GRPO is described only at a conceptual level with no explicit loss function, reward modification rule, or pseudocode showing how low teacher-probability tokens on correct rollouts are selected and reinforced. Without this, it is impossible to determine whether the procedure reduces to standard RL on successful trajectories or introduces a distinct mechanism.
  2. [Experiments] Experiments section: The central claim of substantial outperformance over self-distillation and exploration baselines is unsupported by any reported metrics, tables, ablation results, or statistical details. In particular, no experiment compares reinforcing the reversed (low-probability) tokens against reinforcing high-probability teacher tokens on the same correct rollouts, leaving open whether gains arise from information asymmetry or generic RLVR effects.
  3. [Results] Results and analysis: The assumption that low teacher-probability tokens on successful student rollouts encode generalizable self-driven reasoning (rather than sampling variance, partial-credit paths, or dataset artifacts) is load-bearing for the contribution yet untested. No OOD generalization experiments after RL or controls that remove the reversal component are described.
minor comments (2)
  1. [Abstract] Abstract: 'suppresses it's own reasoning' contains a grammatical error ('it's' should be 'its').
  2. [Abstract] Abstract: The statement 'substantially outperforms' is made without any quantitative values or pointers to specific tables/figures, which is atypical and reduces the abstract's informativeness.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. The comments highlight important areas for improving methodological clarity, experimental rigor, and analysis of assumptions. We address each major comment point by point below, with planned revisions to strengthen the presentation of RLRT while preserving the core contribution regarding information asymmetry in RLVR.

read point-by-point responses
  1. Referee: [Method] Method section: The augmentation of GRPO is described only at a conceptual level with no explicit loss function, reward modification rule, or pseudocode showing how low teacher-probability tokens on correct rollouts are selected and reinforced. Without this, it is impossible to determine whether the procedure reduces to standard RL on successful trajectories or introduces a distinct mechanism.

    Authors: We agree that greater explicitness in the method description would aid reproducibility. The current manuscript presents the high-level idea in Section 3, but the implementation details are embedded in the experimental protocol. In the revised version, we will add the precise RLRT objective as a modification to the GRPO loss: on successful rollouts, tokens with teacher probability below a fixed threshold receive an additive positive advantage term before the policy gradient update. We will also include pseudocode in the appendix that details the token selection process (filtering correct trajectories, computing teacher probs, and applying the reversal only to low-prob tokens). This formulation is distinct from standard RL on successful trajectories because the reversal selectively amplifies student-generated paths that diverge from the teacher. revision: yes

  2. Referee: [Experiments] Experiments section: The central claim of substantial outperformance over self-distillation and exploration baselines is unsupported by any reported metrics, tables, ablation results, or statistical details. In particular, no experiment compares reinforcing the reversed (low-probability) tokens against reinforcing high-probability teacher tokens on the same correct rollouts, leaving open whether gains arise from information asymmetry or generic RLVR effects.

    Authors: Section 4 of the manuscript reports quantitative results across Qwen3 checkpoints, including accuracy metrics on reasoning benchmarks that show consistent gains for RLRT over the listed baselines. We acknowledge, however, that a direct ablation isolating the reversal (low-prob tokens) versus reinforcing high-probability teacher tokens on identical correct rollouts is not presented. We will add this controlled comparison in the revised experiments, using the same set of successful student rollouts for both variants. This will clarify that the performance difference stems from the information-asymmetry mechanism rather than generic reinforcement on correct trajectories. revision: yes

  3. Referee: [Results] Results and analysis: The assumption that low teacher-probability tokens on successful student rollouts encode generalizable self-driven reasoning (rather than sampling variance, partial-credit paths, or dataset artifacts) is load-bearing for the contribution yet untested. No OOD generalization experiments after RL or controls that remove the reversal component are described.

    Authors: The manuscript includes qualitative case studies in Section 5 demonstrating that the reinforced low-probability tokens often correspond to novel reasoning steps absent from teacher outputs. The self-distillation baseline serves as a control for the reversal component, since it applies the standard (non-reversed) teacher signal on the same rollouts; the performance gap supports our interpretation. To further address generalizability concerns, we will add OOD evaluation on additional held-out reasoning tasks in the revised results section and include statistical significance reporting for the main comparisons. revision: partial

Circularity Check

0 steps flagged

No circularity; empirical augmentation of GRPO with no derivations or self-referential reductions.

full rationale

The manuscript describes RLRT as a direct augmentation of GRPO that reinforces low teacher-probability tokens on correct student rollouts. No equations, parameter fits, or mathematical derivations appear. The method is defined explicitly from the reversed self-distillation observation without any claim that reduces a 'prediction' or result back to its inputs by construction. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes. The central claim of outperformance is empirical (across Qwen3 checkpoints) rather than derived, so the derivation chain is self-contained and non-circular.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the interpretation that reversed teacher signals on successful paths constitute valuable self-driven reasoning. No free parameters, axioms, or invented entities are stated in the abstract.

pith-pipeline@v0.9.0 · 5484 in / 1022 out tokens · 32246 ms · 2026-05-12T04:16:04.065876+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages · 13 internal anchors

  1. [1]

    Evaluating Large Language Models Trained on Code

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374, 2021

  2. [2]

    Pass@ k training for adaptively balancing exploration and exploitation of large reasoning models.arXiv preprint arXiv:2508.10751, 2025

    Zhipeng Chen, Xiaobo Qin, Youbin Wu, Yue Ling, Qinghao Ye, Wayne Xin Zhao, and Guang Shi. Pass@ k training for adaptively balancing exploration and exploitation of large reasoning models.arXiv preprint arXiv:2508.10751, 2025

  3. [3]

    Reasoning with exploration: An entropy perspective

    Daixuan Cheng, Shaohan Huang, Xuekai Zhu, Bo Dai, Xin Zhao, Zhenliang Zhang, and Furu Wei. Reasoning with exploration: An entropy perspective. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 30377–30385, 2026

  4. [4]

    The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models

    Ganqu Cui, Yuchen Zhang, Jiacheng Chen, Lifan Yuan, Zhi Wang, Yuxin Zuo, Haozhan Li, Yuchen Fan, Huayu Chen, Weize Chen, et al. The entropy mechanism of reinforcement learning for reasoning language models.arXiv preprint arXiv:2505.22617, 2025

  5. [5]

    Improving rl exploration for llm reasoning through retrospective replay

    Shihan Dou, Muling Wu, Jingwen Xu, Rui Zheng, Tao Gui, and Qi Zhang. Improving rl exploration for llm reasoning through retrospective replay. InCCF International Conference on Natural Language Processing and Chinese Computing, pages 594–606. Springer, 2025

  6. [6]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

  7. [7]

    Rethinking Entropy Interventions in RLVR: An Entropy Change Perspective

    Zhezheng Hao, Hong Wang, Haoyang Liu, Jian Luo, Jiarui Yu, Hande Dong, Qiang Lin, Can Wang, and Jiawei Chen. Rethinking entropy interventions in rlvr: An entropy change perspective.arXiv preprint arXiv:2510.10150, 2025

  8. [8]

    Diversity-incentivized exploration for versatile reasoning

    Zican Hu, Shilin Zhang, Yafu Li, Jianhao Yan, Xuyang Hu, Leyang Cui, Xiaoye Qu, Chunlin Chen, Yu Cheng, and Zhi Wang. Diversity-incentivized exploration for versatile reasoning. arXiv preprint arXiv:2509.26209, 2025

  9. [9]

    Reinforcement Learning via Self-Distillation

    Jonas Hübotter, Frederike Lübeck, Lejs Behric, Anton Baumann, Marco Bagatella, Daniel Marta, Ido Hakimi, Idan Shenfeld, Thomas Kleine Buening, Carlos Guestrin, et al. Reinforcement learning via self-distillation.arXiv preprint arXiv:2601.20802, 2026

  10. [10]

    Revisiting Entropy in Reinforcement Learning for Large Reasoning Models

    Renren Jin, Pengzhi Gao, Yuqi Ren, Zhuowen Han, Tongxuan Zhang, Wuwei Huang, Wei Liu, Jian Luan, and Deyi Xiong. Revisiting entropy in reinforcement learning for large reasoning models.arXiv preprint arXiv:2511.05993, 2025

  11. [11]

    Why Does Self-Distillation (Sometimes) Degrade the Reasoning Capability of LLMs?

    Jeonghye Kim, Xufang Luo, Minbeom Kim, Sangmook Lee, Dohyung Kim, Jiwon Jeon, Dongsheng Li, and Yuqing Yang. Why does self-distillation (sometimes) degrade the reasoning capability of llms?arXiv preprint arXiv:2603.24472, 2026. 10

  12. [12]

    Unifying group-relative and self-distillation policy optimization via sample routing.arXiv preprint arXiv:2604.02288, 2026

    Gengsheng Li, Tianyu Yang, Junfeng Fang, Mingyang Song, Mao Zheng, Haiyun Guo, Dan Zhang, Jinqiao Wang, and Tat-Seng Chua. Unifying group-relative and self-distillation policy optimization via sample routing.arXiv preprint arXiv:2604.02288, 2026

  13. [13]

    Exploratory memory-augmented LLM agent via hybrid on- and off-policy optimization

    Zeyuan Liu, Jeonghye Kim, Xufang Luo, Dongsheng Li, and Yuqing Yang. Exploratory memory-augmented LLM agent via hybrid on- and off-policy optimization. InThe Fourteenth International Conference on Learning Representations, 2026. URL https://openreview. net/forum?id=UOzxviKVFO

  14. [14]

    Understanding r1-zero-like training: A critical perspective

    Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding r1-zero-like training: A critical perspective. InSecond Conference on Language Modeling, 2025. URLhttps://openreview.net/forum?id=5PAF7PAY2Y

  15. [15]

    Sparse but critical: A token-level analysis of distributional shifts in rlvr fine-tuning of llms.arXiv preprint arXiv:2603.22446, 2026

    Haoming Meng, Kexin Huang, Shaohang Wei, Chiyu Ma, Shuo Yang, Xue Wang, Guoyin Wang, Bolin Ding, and Jingren Zhou. Sparse but critical: A token-level analysis of distributional shifts in rlvr fine-tuning of llms.arXiv preprint arXiv:2603.22446, 2026

  16. [16]

    Fightin’words: Lexical feature selection and evaluation for identifying the content of political conflict.Political Analysis, 16 (4):372–403, 2008

    Burt L Monroe, Michael P Colaresi, and Kevin M Quinn. Fightin’words: Lexical feature selection and evaluation for identifying the content of political conflict.Political Analysis, 16 (4):372–403, 2008

  17. [17]

    arXiv preprint arXiv:2510.02230

    Phuc Minh Nguyen, Chinh D La, Duy MH Nguyen, Nitesh V Chawla, Binh T Nguyen, and Khoa D Doan. The reasoning boundary paradox: How reinforcement learning constrains language models.arXiv preprint arXiv:2510.02230, 2025

  18. [18]

    Clip-low increases entropy and clip-high decreases entropy in reinforcement learning of large language models.arXiv preprint arXiv:2509.26114, 2025

    Jaesung R Park, Junsu Kim, Gyeongman Kim, Jinyoung Jo, Sean Choi, Jaewoong Cho, and Ernest K Ryu. Clip-low increases entropy and clip-high decreases entropy in reinforcement learning of large language models.arXiv preprint arXiv:2509.26114, 2025

  19. [19]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

  20. [20]

    Self-Distillation Enables Continual Learning

    Idan Shenfeld, Mehul Damani, Jonas Hübotter, and Pulkit Agrawal. Self-distillation enables continual learning.arXiv preprint arXiv:2601.19897, 2026

  21. [21]

    Outcome-based exploration for LLM reasoning.arXiv preprint arXiv:2509.06941, 2025

    Yuda Song, Julia Kempe, and Remi Munos. Outcome-based exploration for llm reasoning. arXiv preprint arXiv:2509.06941, 2025

  22. [22]

    Expanding the capabilities of reinforcement learning via text feedback.arXiv preprint arXiv:2602.02482, 2026

    Yuda Song, Lili Chen, Fahim Tajwar, Remi Munos, Deepak Pathak, J Andrew Bagnell, Aarti Singh, and Andrea Zanette. Expanding the capabilities of reinforcement learning via text feedback.arXiv preprint arXiv:2602.02482, 2026

  23. [23]

    Dsdr: Dual-scale diversity regularization for exploration in llm reasoning.arXiv preprint arXiv:2602.19895, 2026

    Zhongwei Wan, Yun Shen, Zhihao Dou, Donghao Zhou, Yu Zhang, Xin Wang, Hui Shen, Jing Xiong, Chaofan Tao, Zixuan Zhong, et al. Dsdr: Dual-scale diversity regularization for exploration in llm reasoning.arXiv preprint arXiv:2602.19895, 2026

  24. [24]

    arXiv preprint arXiv:2510.08141 , year=

    Chen Wang, Zhaochun Li, Jionghao Bai, Yuzhi Zhang, Shisheng Cui, Zhou Zhao, and Yue Wang. Arbitrary entropy policy optimization breaks the exploration bottleneck of reinforcement learning.arXiv preprint arXiv:2510.08141, 2025

  25. [25]

    Self-Distilled RLVR

    Chenxu Yang, Chuanyu Qin, Qingyi Si, Minghui Chen, Naibin Gu, Dingyu Yao, Zheng Lin, Weiping Wang, Jiaqi Wang, and Nan Duan. Self-distilled rlvr.arXiv preprint arXiv:2604.03128, 2026

  26. [26]

    The debate on rlvr reasoning capability boundary: Shrinkage, expansion, or both? a two-stage dynamic view.arXiv preprint arXiv:2510.04028, 2025

    Xinhao Yao, Lu Yu, Xiaolin Hu, Fengwei Teng, Qing Cui, Jun Zhou, and Yong Liu. The debate on rlvr reasoning capability boundary: Shrinkage, expansion, or both? a two-stage dynamic view.arXiv preprint arXiv:2510.04028, 2025

  27. [27]

    On-Policy Context Distillation for Language Models

    Tianzhu Ye, Li Dong, Xun Wu, Shaohan Huang, and Furu Wei. On-policy context distillation for language models.arXiv preprint arXiv:2602.12275, 2026

  28. [28]

    DAPO: An Open-Source LLM Reinforcement Learning System at Scale

    Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025. 11

  29. [29]

    Does reinforcement learning really incentivize reasoning capacity in LLMs beyond the base model? InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026

    Yang Yue, Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Yang Yue, Shiji Song, and Gao Huang. Does reinforcement learning really incentivize reasoning capacity in LLMs beyond the base model? InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026. URLhttps://openreview.net/forum?id=4OsgYD7em5

  30. [30]

    On the interplay of pre-training, mid-training, and rl on reasoning language models.arXiv preprint arXiv:2512.07783,

    Charlie Zhang, Graham Neubig, and Xiang Yue. On the interplay of pre-training, mid-training, and rl on reasoning language models.arXiv preprint arXiv:2512.07783, 2025

  31. [31]

    Kakade, Cengiz Pehlevan, Samy Jelassi, and Eran Malach

    Rosie Zhao, Alexandru Meterez, Sham Kakade, Cengiz Pehlevan, Samy Jelassi, and Eran Malach. Echo chamber: Rl post-training amplifies behaviors learned in pretraining.arXiv preprint arXiv:2504.07912, 2025

  32. [32]

    Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models

    Siyan Zhao, Zhihui Xie, Mengchen Liu, Jing Huang, Guan Pang, Feiyu Chen, and Aditya Grover. Self-distilled reasoner: On-policy self-distillation for large language models.arXiv preprint arXiv:2601.18734, 2026. 12 A Limitations and Future Directions To our knowledge, RLRT is the first to show that reversing the teacher’s signal, rather than aligning to it,...