OISD: On-Policy Internal Self-Distillation of Language Models

Darryl Cherian Jacob; Jindong Wang; Pan He; Xinyu Liu; Yang Zhou

arxiv: 2605.29089 · v1 · pith:UFV56ISUnew · submitted 2026-05-27 · 💻 cs.LG · cs.AI· cs.CV

OISD: On-Policy Internal Self-Distillation of Language Models

Xinyu Liu , Darryl Cherian Jacob , Yang Zhou , Jindong Wang , Pan He This is my paper

Pith reviewed 2026-06-29 13:41 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CV

keywords on-policy self-distillationinternal representationslanguage model reasoningreinforcement learningGRPOlogit alignmentattention alignmentmathematical reasoning

0 comments

The pith

Language models improve reasoning by distilling final-layer signals into intermediate layers during on-policy RL training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces on-policy internal self-distillation (OISD) to move predictive signals from the final layer of a language model into selected intermediate layers. It does this inside the same rollout used for Group Relative Policy Optimization by treating the final layer as a detached internal teacher. Alignment occurs through two channels: logit alignment transfers high-level reasoning behaviors and attention alignment transfers consistent focus patterns, both using signed advantage-weighted Jensen-Shannon loss to keep the policy unchanged. Readers would care because standard RL post-training only rewards final answers with sparse signals and therefore leaves potentially useful intermediate representations under-exploited.

Core claim

OISD uses the final layer as both the acting policy and a detached internal teacher during rollout and GRPO optimization. Selected intermediate layers are guided to match the final layer through logit alignment, which copies reasoning behaviors, and attention alignment, which copies focus patterns. The alignment employs signed advantage-weighted Jensen-Shannon divergence so that distillation occurs while preserving policy consistency under a single unified acting policy. Experiments show this produces substantial and consistent gains over strong reasoning RL baselines on four mathematical reasoning tasks.

What carries the argument

Signed advantage-weighted Jensen-Shannon alignment that distills logits and attention from the final layer to intermediate layers under a unified on-policy acting policy.

If this is right

The approach transfers high-level reasoning behaviors without any external privileged information or separate teacher models.
It enforces consistent attention patterns across layers while the model continues to act under one policy.
The method yields measurable gains on four separate mathematical reasoning tasks over strong RL baselines.
Distillation happens on-policy during the same rollouts used for policy optimization.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same internal-teacher pattern could be tested on non-mathematical domains where intermediate layers already encode useful task signals.
OISD might be combined with other forms of auxiliary supervision inside the same RL loop to compound gains.
If the final layer's signals prove transferable, future post-training pipelines could routinely add lightweight internal alignment heads rather than only optimizing final outputs.

Load-bearing premise

The final layer's on-policy representations contain transferable predictive signals about reasoning that are worth distilling to intermediate layers without causing policy degradation.

What would settle it

Applying OISD during GRPO training and observing no improvement or outright degradation on the four mathematical reasoning benchmarks compared with GRPO alone would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.29089 by Darryl Cherian Jacob, Jindong Wang, Pan He, Xinyu Liu, Yang Zhou.

**Figure 2.** Figure 2: Layerwise logit-lens visualization of a reason [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Pass@k comparison between GRPO and OISD using Qwen3-4B. under the same setting as others. Qwen3-8B tests scaling within the Qwen3 family, while the OctoThinker models evaluate transfer to a different model family. We compare OISD against the corresponding Vanilla, PPO, Reinforce++, RLOO, GRPO, and BuPO (Tan et al., 2025). GRPO serves as the final-layer-only RL baseline, while BuPO represents direct intern… view at source ↗

**Figure 4.** Figure 4: Effect of logit and attention alignment on Qwen3-4B. We compare Vanilla, GRPO, [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Training dynamics of OISD over the full RL run on Qwen3-4B. (a) Average rollout reward of the final-layer acting policy increases steadily across training. (b) Response-token entropy of the middle-layer readout (layer 6). (c) The response length of generated trajectories grows as the policy develops longer reasoning chains. (a) Layer-wise Attention Signals on Reasoning Evidence Question: (b) Training-Time … view at source ↗

**Figure 7.** Figure 7: Logit-lens comparison between OISD and GRPO on a reasoning trace. Each row shows the top token decoded from the final or layer-6 hidden state. Bold outlines indicate agreement between layer-6 and final-layer predictions. OISD exhibits earlier alignment on reasoning-critical tokens, whereas GRPO produces more local or incomplete intermediate predictions. predictions. OISD exhibits substantially earlier alig… view at source ↗

**Figure 8.** Figure 8: Checkpoint progression for Qwen3-4B OISD with logit and attention alignment. Results are reported with Avg@16 on AMC23 and MATH500, Avg@32 on AIME24 and AIME25, and the four-benchmark average. where sg(·) denotes stop-gradient on the detached final-layer teacher. Define the mixture distribution m l,L,τ t = 1 2 p l,τ t + q L,τ t . The token-level Jensen–Shannon divergence is JS p l,τ t , q L,τ t = 1… view at source ↗

**Figure 9.** Figure 9: Gradient norm for the think and attention [PITH_FULL_IMAGE:figures/full_fig_p013_9.png] view at source ↗

**Figure 10.** Figure 10: Reasoning comparison on the same AIME 2024 tetrahedron problem. The top box states the problem [PITH_FULL_IMAGE:figures/full_fig_p014_10.png] view at source ↗

read the original abstract

Recent reinforcement learning (RL) post-training approaches primarily optimize the final output policy using sparse outcome-level rewards, while largely overlooking predictive signals encoded in intermediate representations. In this paper, we introduce a new paradigm called on-policy internal self-distillation and propose the OISD framework, which improves reasoning by transferring on-policy predictive signals from the final layer to intermediate representations. During rollout and Group Relative Policy Optimization (GRPO) optimization, the final layer acts as both the policy and a detached internal teacher for selected intermediate layers, which are guided to align with it through two complementary mechanisms: logit alignment, which transfers high-level reasoning behaviors (how to think), and attention alignment, which enforces consistent attention patterns (where to look) from the final layer to the selected intermediate layer, both without requiring external privileged information. Our OISD, together with GRPO, employs signed advantage-weighted Jensen--Shannon alignment to distill informative intermediate representations while preserving policy consistency under a unified acting policy. Experimental results demonstrate the effectiveness of OISD, with substantial and consistent improvements over strong reasoning RL baselines across four mathematical reasoning tasks. The code will be released at https://github.com/THE-MALT-LAB/OISD

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

OISD adds a specific on-policy internal distillation step to GRPO but the abstract shows no results and leaves the policy-consistency claim unverified.

read the letter

The main takeaway is that this paper proposes on-policy internal self-distillation inside GRPO training: the final layer serves as a detached teacher for selected intermediate layers, aligning logits (to transfer reasoning behavior) and attention patterns (to enforce consistent focus) via signed advantage-weighted Jensen-Shannon divergence, all without external teachers. It claims this yields substantial gains on four math reasoning tasks.

What is new is the combination of logit-plus-attention alignment under a unified acting policy during rollout and optimization. The framing correctly notes that standard outcome-reward RL overlooks predictive signals in intermediate layers, and the two mechanisms target different aspects of that signal.

The paper does a reasonable job laying out why internal signals matter and why keeping the teacher on-policy could avoid distribution shift. The idea of using the model's own final layer as teacher is straightforward and avoids privileged information.

The soft spots are more central. The abstract supplies zero numbers, no baseline details, no statistical tests, and no equations showing how the alignment loss combines with the GRPO objective or where stop-gradients sit. Without those, the central claim cannot be evaluated. The stress-test concern lands: if the signed advantage weighting is not fully detached, the alignment term could alter the effective advantages that GRPO relies on, undermining the on-policy guarantee. The abstract gives no indication this interaction was checked.

This work is aimed at groups already running GRPO-style RL on reasoning models who want to experiment with internal distillation. A reader already familiar with the GRPO paper and basic distillation losses could extract the high-level recipe, but would still need the full methods to implement or judge it.

I would send it to peer review so the experiments and loss derivations can be examined, though the current abstract is too thin to assess impact on its own.

Referee Report

2 major / 2 minor

Summary. The paper introduces the OISD framework for on-policy internal self-distillation during RL post-training of language models with GRPO. The final layer serves as a detached teacher that aligns selected intermediate layers via logit alignment (transferring reasoning behaviors) and attention alignment (enforcing consistent patterns) using signed advantage-weighted Jensen-Shannon divergence, without external privileged information. The central claim is that this distills informative intermediate representations while preserving policy consistency under a unified acting policy, yielding substantial and consistent improvements over strong reasoning RL baselines on four mathematical reasoning tasks.

Significance. If the on-policy property is preserved and the improvements hold under proper controls, the approach would address a gap in outcome-only RL by leveraging internal predictive signals for better reasoning representations. The use of a unified policy and detached teacher is a clean design choice that could generalize beyond the reported tasks.

major comments (2)

[Abstract] Abstract: the claim of 'substantial and consistent improvements' and 'preserving policy consistency under a unified acting policy' is load-bearing, yet the abstract supplies no quantitative results, baseline names, effect sizes, statistical tests, or ablation numbers. Without these, the empirical support for the central claim cannot be assessed.
[OISD framework description] OISD + GRPO description (alignment mechanism paragraph): the signed advantage-weighted Jensen-Shannon alignment is jointly optimized with the GRPO objective, but no equation shows the combined loss, the coefficient on the alignment term, or stop-gradient placement on the advantage weights and final-layer teacher. If the advantage weighting is not fully detached, the gradient for the acting policy incorporates representation-matching terms, which would alter the on-policy assumption that advantages derive solely from outcome rewards.

minor comments (2)

[Abstract] The abstract mentions 'four mathematical reasoning tasks' but does not name them; adding the task names would improve clarity.
[Abstract] The code release link is welcome, but the manuscript should include a short reproducibility checklist (hyperparameters for the alignment loss, layer selection criteria, and exact JS weighting) to support the promised release.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript to strengthen the presentation and clarify technical details.

read point-by-point responses

Referee: [Abstract] Abstract: the claim of 'substantial and consistent improvements' and 'preserving policy consistency under a unified acting policy' is load-bearing, yet the abstract supplies no quantitative results, baseline names, effect sizes, statistical tests, or ablation numbers. Without these, the empirical support for the central claim cannot be assessed.

Authors: We agree that the abstract would benefit from quantitative support for the central claims. In the revised version we will incorporate specific results, including average performance gains over the GRPO baseline across the four mathematical reasoning tasks, the primary baseline names, and a brief reference to the evaluation protocol. revision: yes
Referee: [OISD framework description] OISD + GRPO description (alignment mechanism paragraph): the signed advantage-weighted Jensen-Shannon alignment is jointly optimized with the GRPO objective, but no equation shows the combined loss, the coefficient on the alignment term, or stop-gradient placement on the advantage weights and final-layer teacher. If the advantage weighting is not fully detached, the gradient for the acting policy incorporates representation-matching terms, which would alter the on-policy assumption that advantages derive solely from outcome rewards.

Authors: We acknowledge that the manuscript describes the alignment mechanisms but omits an explicit combined-loss equation and the precise stop-gradient placements. The intended design detaches both the final-layer teacher and the advantage weights (computed exclusively from outcome rewards) so that representation-matching gradients do not affect the policy update. We will add the equation L_total = L_GRPO + λ L_OISD, report the coefficient λ used in experiments, and explicitly document the stop-gradient operations to confirm that the on-policy property is preserved. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation self-contained

full rationale

The paper introduces OISD as an on-policy internal self-distillation mechanism added to GRPO, using signed advantage-weighted Jensen-Shannon alignment with the final layer as detached teacher for logit and attention alignment. The abstract and description present this as an empirical augmentation that transfers predictive signals while preserving policy consistency, without any equations or definitions that reduce the claimed improvements, policy consistency, or advantages to self-referential fits or tautologies by construction. No load-bearing self-citations, ansatzes smuggled via prior work, or renaming of known results appear in the provided text. The central claims rest on the proposed mechanisms and experimental results rather than circular reductions.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; full text would be required to audit hyperparameters in the alignment loss, any domain assumptions about layer representations, or new constructs.

pith-pipeline@v0.9.1-grok · 5754 in / 1223 out tokens · 40549 ms · 2026-06-29T13:41:00.279257+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

4 extracted references · 3 canonical work pages · 2 internal anchors

[1]

Proximal Policy Optimization Algorithms

Let’s verify step by step. InInternational Conference on Learning Representations, volume 2024, pages 39578–39601. MAA. 2023. American mathematics contest 12 (amc 12). MAA. 2024. American invitational mathematics exami- nation (aime). MAA. 2025. American invitational mathematics exami- nation (aime). 9 María Luisa Menéndez, Julio Angel Pardo, Leandro Pard...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[2]

Siyan Zhao, Zhihui Xie, Mengchen Liu, Jing Huang, Guan Pang, Feiyu Chen, and Aditya Grover

Reinforcement learning fine-tuning enhances activation intensity and diversity in the internal cir- cuitry of llms.arXiv preprint arXiv:2509.21044. Siyan Zhao, Zhihui Xie, Mengchen Liu, Jing Huang, Guan Pang, Feiyu Chen, and Aditya Grover

work page arXiv
[3]

Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models

Self-distilled reasoner: On-policy self- distillation for large language models.arXiv preprint arXiv:2601.18734. Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, and 1 others. 2025a. Group sequence policy optimization.arXiv preprint arXiv:2507.18071. Chujie Zheng, Zhenru Zhang, Beichen Zha...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

This distance can be written in the form m√n p , where m, n, p are positive integers, m and p are relatively prime, and n is not divisible by the square of any prime

There exists a point I inside the tetrahedron such that the distances from I to each of the faces of the tetrahedron are all equal. This distance can be written in the form m√n p , where m, n, p are positive integers, m and p are relatively prime, and n is not divisible by the square of any prime. Find m+n+p . Ground truth:104. 13 Problem Let ABCD be a te...

2024

[1] [1]

Proximal Policy Optimization Algorithms

Let’s verify step by step. InInternational Conference on Learning Representations, volume 2024, pages 39578–39601. MAA. 2023. American mathematics contest 12 (amc 12). MAA. 2024. American invitational mathematics exami- nation (aime). MAA. 2025. American invitational mathematics exami- nation (aime). 9 María Luisa Menéndez, Julio Angel Pardo, Leandro Pard...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[2] [2]

Siyan Zhao, Zhihui Xie, Mengchen Liu, Jing Huang, Guan Pang, Feiyu Chen, and Aditya Grover

Reinforcement learning fine-tuning enhances activation intensity and diversity in the internal cir- cuitry of llms.arXiv preprint arXiv:2509.21044. Siyan Zhao, Zhihui Xie, Mengchen Liu, Jing Huang, Guan Pang, Feiyu Chen, and Aditya Grover

work page arXiv

[3] [3]

Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models

Self-distilled reasoner: On-policy self- distillation for large language models.arXiv preprint arXiv:2601.18734. Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, and 1 others. 2025a. Group sequence policy optimization.arXiv preprint arXiv:2507.18071. Chujie Zheng, Zhenru Zhang, Beichen Zha...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[4] [4]

This distance can be written in the form m√n p , where m, n, p are positive integers, m and p are relatively prime, and n is not divisible by the square of any prime

There exists a point I inside the tetrahedron such that the distances from I to each of the faces of the tetrahedron are all equal. This distance can be written in the form m√n p , where m, n, p are positive integers, m and p are relatively prime, and n is not divisible by the square of any prime. Find m+n+p . Ground truth:104. 13 Problem Let ABCD be a te...

2024