pith. sign in

arxiv: 2605.28014 · v1 · pith:QDRCQ3OSnew · submitted 2026-05-27 · 💻 cs.CL · cs.LG

ROSD: Reflective On-Policy Self-Distillation for Language Model Reasoning across Domains

Pith reviewed 2026-06-29 13:01 UTC · model grok-4.3

classification 💻 cs.CL cs.LG
keywords on-policy self-distillationlanguage model reasoningout-of-domain generalizationreflective distillationerror localizationself-reflectorLLM reasoning
0
0 comments X

The pith

ROSD improves LLM reasoning by using a self-reflector to extract a corrective idea and restrict distillation to the first error span in each rollout.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard on-policy self-distillation conditions the teacher on verified solutions and applies updates across the entire response, which encourages imitation of reference trajectories and can overwrite valid reasoning steps. ROSD replaces this with a self-reflector that identifies a corrective idea and the initial erroneous span, then guides the teacher only on that span. The result is error-specific correction that keeps good prefixes intact. Experiments across in-domain and out-of-domain reasoning benchmarks show stronger overall in-domain results and markedly better generalization than prior OPSD approaches.

Core claim

ROSD turns reference-solution imitation into targeted reasoning correction through reflection-guided, error-localized distillation. For each rollout, a self-reflector extracts a corrective idea and locates the first erroneous span. The corrective idea guides the self-teacher toward targeted supervision, while the localized error span restricts distillation to where correction is needed. This design corrects flawed reasoning while preserving valid prefixes.

What carries the argument

The self-reflector that extracts a corrective idea and locates the first erroneous span, enabling error-localized distillation instead of full-response updates.

If this is right

  • Stronger in-domain reasoning performance overall than standard OPSD.
  • Substantially better out-of-domain generalization than standard OPSD.
  • Preservation of valid reasoning prefixes during the distillation step.
  • Shift from imitation of reference trajectories to error-specific correction.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same reflection-plus-localization pattern could be tested on non-reasoning sequence tasks such as code generation or dialogue.
  • If the reflector works reliably, the method reduces dependence on large sets of verified reference solutions.
  • Error localization may lower the risk of reinforcing domain-specific patterns that hurt generalization.

Load-bearing premise

A self-reflector can reliably extract a corrective idea and accurately locate the first erroneous span without introducing new errors or misidentifying valid steps.

What would settle it

An ablation that supplies the self-reflector with deliberately incorrect error locations or corrective ideas, after which out-of-domain gains over standard OPSD disappear.

Figures

Figures reproduced from arXiv: 2605.28014 by Daiting Shi, Jingzhou He, Liu Yang, Xiao-Ming Wu, Xin Xin, Xinyu Ma, Yujie Feng, Zhaochun Ren, Ziqi Zhao.

Figure 1
Figure 1. Figure 1: Left: Pilot study. We use SDPO (Hübotter et al., 2026) as a representative OPSD method. All methods are trained on the Material dataset with Qwen3-8B, and evaluated on Material and ToolUse as the in-domain and out-of-domain benchmarks, respectively. Right: Comparison of post-training methods. Green and red tokens denote the valid prefix before the first error and the error-affected suffix, respectively. (a… view at source ↗
Figure 2
Figure 2. Figure 2: Self-reflection prompt templates used by ROSD. The system prompt provides task-level instructions to [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: In domain training dynamics on Material and Chemistry. We report the mean@16 test score, and the [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Rolling average of rollout accuracy over 10 [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Error localization dynamics during training. [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Training efficiency. We report time(seconds) [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Response length dynamics. A rolling average [PITH_FULL_IMAGE:figures/full_fig_p011_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: System prompt for ScienceQA. [User] {question} Please reason step by step [PITH_FULL_IMAGE:figures/full_fig_p012_8.png] view at source ↗
Figure 10
Figure 10. Figure 10: User prompt for ToolUse. [User] {question} Please reason step by step, and put your final answer within \boxed{} [PITH_FULL_IMAGE:figures/full_fig_p012_10.png] view at source ↗
Figure 12
Figure 12. Figure 12: Teacher input template with reflection. The self-teacher preserves the original system and user prompts [PITH_FULL_IMAGE:figures/full_fig_p012_12.png] view at source ↗
read the original abstract

On-policy self-distillation (OPSD) improves the reasoning performance of large language models (LLMs) by providing dense token-level supervision for on-policy rollouts. However, existing OPSD methods often yield limited gains on in-domain reasoning and generalize poorly to out-of-domain problems. We identify two key causes: conditioning the self-teacher on a verified solution encourages imitation of training-domain reference trajectories rather than error-specific correction, and applying distillation to the full response can overwrite valid reasoning prefixes and reinforce overfitting. We propose Reflective On-policy Self-Distillation (ROSD), a framework that turns reference-solution imitation into targeted reasoning correction through reflection-guided, error-localized distillation. For each rollout, ROSD uses a self-reflector to extract a corrective idea and locate the first erroneous span. The corrective idea guides the self-teacher toward targeted supervision, while the localized error span restricts distillation to where correction is needed. This design corrects flawed reasoning while preserving valid prefixes. Experiments on multiple in-domain and out-of-domain reasoning benchmarks show that ROSD yields stronger in-domain reasoning performance overall and substantially better out-of-domain generalization than standard OPSD. Code is available at https://github.com/ZiqiZhao1/ROSD.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that standard on-policy self-distillation (OPSD) yields limited in-domain gains and poor out-of-domain generalization because it encourages imitation of reference trajectories and applies distillation over full responses. ROSD addresses this by introducing a self-reflector that extracts a corrective idea from the reference and identifies the first erroneous span in each rollout; distillation is then restricted to that span under guidance from the corrective idea. Experiments on multiple in-domain and out-of-domain reasoning benchmarks are reported to show stronger overall in-domain performance and substantially better out-of-domain generalization than standard OPSD. Code is released.

Significance. If the central mechanism proves reliable, the targeted correction approach could meaningfully advance on-policy distillation methods for LLM reasoning by reducing overfitting to training-domain trajectories while preserving valid prefixes. The public code release is a clear strength that supports reproducibility.

major comments (2)
  1. [Abstract, §3] Abstract and §3 (method description): the central claim that restricting distillation to the first erroneous span 'preserves valid prefixes' and supplies 'targeted correction' is load-bearing, yet the manuscript supplies no quantitative evidence (human agreement rates, precision/recall on span localization, or ablation removing the reflector) that the self-reflector meets the required reliability threshold at non-negligible rates.
  2. [§4] §4 (experiments): the reported gains in out-of-domain generalization are attributed to error-localized distillation, but without an ablation that compares ROSD against a version using the same corrective idea yet full-response distillation, it is unclear whether the localization step, rather than the corrective idea alone, drives the improvement.
minor comments (2)
  1. [§3] Notation for the self-reflector output (corrective idea and span) should be formalized with explicit symbols to avoid ambiguity when describing the modified distillation loss.
  2. [Abstract, §4] The abstract states 'multiple in-domain and out-of-domain reasoning benchmarks' but does not name them; the experimental section should list the exact datasets and splits used.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We respond point by point to the major comments and indicate the revisions we will make.

read point-by-point responses
  1. Referee: [Abstract, §3] Abstract and §3 (method description): the central claim that restricting distillation to the first erroneous span 'preserves valid prefixes' and supplies 'targeted correction' is load-bearing, yet the manuscript supplies no quantitative evidence (human agreement rates, precision/recall on span localization, or ablation removing the reflector) that the self-reflector meets the required reliability threshold at non-negligible rates.

    Authors: We agree that the manuscript does not provide direct quantitative validation of the self-reflector's span localization (e.g., human agreement rates, precision/recall) or an ablation that removes the reflector. End-to-end gains serve as indirect evidence in the current version. In revision we will add both an ablation removing the reflector and human evaluation of localization accuracy on a sampled subset of rollouts. revision: yes

  2. Referee: [§4] §4 (experiments): the reported gains in out-of-domain generalization are attributed to error-localized distillation, but without an ablation that compares ROSD against a version using the same corrective idea yet full-response distillation, it is unclear whether the localization step, rather than the corrective idea alone, drives the improvement.

    Authors: We concur that the experiments do not isolate the localization step from the corrective idea. We will add the requested ablation (corrective idea with full-response distillation) in the revised manuscript to clarify the contribution of error localization. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical method with independent experimental validation

full rationale

The paper introduces ROSD as a modification to OPSD that incorporates a self-reflector for corrective idea extraction and error-span localization, then reports benchmark results showing improved in-domain and out-of-domain performance. No derivation step reduces by construction to its inputs: there are no self-definitional equations, no parameters fitted on a subset and then relabeled as predictions, and no load-bearing self-citations or uniqueness theorems. The mechanism is described procedurally and evaluated externally via experiments, making the central claims falsifiable and non-circular. The reader's assessment of score 1.0 aligns with this finding.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Review is abstract-only; no explicit free parameters, axioms, or invented entities are detailed in the provided text.

axioms (1)
  • domain assumption Standard assumptions of on-policy distillation and LLM fine-tuning hold for the proposed reflective mechanism.
    The method extends OPSD without challenging base training assumptions.

pith-pipeline@v0.9.1-grok · 5776 in / 1167 out tokens · 52755 ms · 2026-06-29T13:01:02.592910+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. UCOB: Learning to Utilize and Evolve Agentic Skills via Credit-Aware On-Policy Bidirectional Self-Distillation

    cs.AI 2026-06 unverdicted novelty 5.0

    UCOB improves agentic RL by using return-to-go comparisons between skill-conditioned and no-skill prompts as local teachers for bidirectional self-distillation and skill memory updates.

Reference graph

Works this paper leans on

9 extracted references · 5 canonical work pages · cited by 1 Pith paper · 5 internal anchors

  1. [1]

    Self-Distillation Zero: Self-Revision Turns Binary Rewards into Dense Supervision

    Self-distillation zero: Self-revision turns bi- nary rewards into dense supervision.arXiv preprint arXiv:2604.12002. Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015. Distilling the knowledge in a neural network.arXiv preprint arXiv:1503.02531. Jonas Hübotter, Frederike Lübeck, Lejs Behric, An- ton Baumann, Marco Bagatella, Daniel Marta, Ido Hakimi, Id...

  2. [2]

    Understanding R1-Zero-Like Training: A Critical Perspective

    Understanding r1-zero-like training: A critical perspective.arXiv preprint arXiv:2503.20783. Kevin Lu and Thinking Machines Lab. 2025. On- policy distillation.Thinking Machines Lab: Con- nectionism. Https://thinkingmachines.ai/blog/on- policy-distillation. Fanqing Meng, Lingxiao Du, Zongkai Liu, Zhixiang Zhou, Quanfeng Lu, Daocheng Fu, Botian Shi, Wen- ha...

  3. [3]

    Mm-eureka: Exploring visual aha moment with rule-based large-scale reinforcement learning. CoRR. Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, and 1 others. 2022. Training language models to follow in- structions with human feedback.Advances in neural informat...

  4. [4]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Direct preference optimization: Your language model is secretly a reward model.Advances in neural information processing systems, 36:53728–53741. Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, and 1 others. 2024. Deepseekmath: Pushing the limits of mathematical reasoning in open langua...

  5. [5]

    Kimi k1.5: Scaling Reinforcement Learning with LLMs

    Kimi k1. 5: Scaling reinforcement learning with llms.arXiv preprint arXiv:2501.12599. An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, and 1 others

  6. [6]

    Qwen3 Technical Report

    Qwen3 technical report.arXiv preprint arXiv:2505.09388. Feng Yao, Liyuan Liu, Dinghuai Zhang, Chengyu Dong, Jingbo Shang, and Jianfeng Gao. 2025. Your efficient rl framework secretly brings you off-policy rl training, august 2025.URL https://fengyao. notion. site/off- policy-rl. Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan...

  7. [7]

    We train for 10 epochs on the science question-answering tasks and 5 epochs on the ToolUse task, using a training batch size of

    that train under a fixed time budget, we use an epoch budget large enough for the methods to converge in our setting. We train for 10 epochs on the science question-answering tasks and 5 epochs on the ToolUse task, using a training batch size of

  8. [8]

    All algorithms are eval- uated every 10 training steps

    The maximum length of the reflection prompt is 8k tokens, and the maximum length of the gener- ated reflection is 4k tokens. All algorithms are eval- uated every 10 training steps. For self-distillation methods, we use Jensen–Shannon divergence with α= 0.5 , a distillation top-k of 100, and a frozen self-teacher and self-reflector during each run. C Addit...

  9. [9]

    Given a question and four options, please select the right answer

    The prompt used as input to the self-teacher is shown in Figure 12. Given a question and four options, please select the right answer. Respond in the following format: <reasoning> ... </reasoning> <answer> ... </answer> For the answer, only output the letter corresponding to the correct option (A, B, C, or D), and nothing else. Do not restate the answer t...