Adaptive Teacher Exposure for Self-Distillation in LLM Reasoning

Huaibin Wang; Tiangang Zhang; Yilun Sun; Zihao Han

arxiv: 2605.11458 · v2 · pith:NGKYKIAKnew · submitted 2026-05-12 · 💻 cs.AI · cs.CL· cs.LO

Adaptive Teacher Exposure for Self-Distillation in LLM Reasoning

Zihao Han , Tiangang Zhang , Huaibin Wang , Yilun Sun This is my paper

Pith reviewed 2026-05-13 01:58 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.LO

keywords self-distillationLLM reasoningteacher exposureadaptive policyon-policy distillationBeta distributionmath benchmarks

0 comments

The pith

Adaptive control of how much reference reasoning the teacher sees during self-distillation improves LLM performance on math tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard on-policy self-distillation for LLM reasoning always gives the teacher the complete reference solution, yet experiments show this fixed full exposure is not reliably optimal and increases mismatch as the teacher sees more privileged steps. The paper treats teacher exposure instead as a learnable variable. A lightweight Beta-policy controller, conditioned on compact training statistics, samples a reveal ratio that stays fixed for a short window of student updates. The controller is optimized with a discounted reward that credits each choice by its measured effect on the student's future progress rather than immediate loss. Across Qwen3 models from 1.7B to 8B parameters, this adaptive schedule outperforms fixed-exposure self-distillation and RL baselines on AIME 24, AIME 25, and HMMT 25.

Core claim

Treating teacher exposure as a learnable control variable via a Beta-policy controller conditioned on training-state statistics and optimized by a discounted learning-progress reward produces higher student reasoning accuracy than the conventional choice of always revealing the full reference.

What carries the argument

A Beta-policy controller that samples the fraction of reference reasoning to expose to the teacher for a fixed hold window of student updates and receives a reward based on the student's subsequent improvement.

If this is right

Full exposure of the reference reasoning is not reliably the best choice for student learning.
Mismatch between teacher targets and student competence grows monotonically with the amount of privileged reasoning shown.
Optimizing exposure with a future-progress reward addresses the delayed credit assignment problem in on-policy distillation.
The adaptive method delivers consistent gains over OPSD and other baselines on AIME 24, AIME 25, and HMMT 25 for models ranging from 1.7B to 8B parameters.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same controller design could be applied to other training-time decisions such as rollout length or data filtering in reasoning pipelines.
If the compact training statistics omit key signals, the learned policy may fail to generalize beyond the training distribution of math problems.
The delayed-reward formulation might transfer to other credit-assignment settings in LLM post-training where immediate loss is a poor signal.

Load-bearing premise

A lightweight Beta-policy controller optimized via a discounted learning-progress reward on compact training-state statistics will reliably produce exposure decisions that improve long-term student performance without introducing training instability or benchmark-specific overfitting.

What would settle it

An ablation in which the adaptive controller is replaced by fixed full exposure or random sampling and the same models are retrained on the identical benchmarks would show equal or higher scores.

Figures

Figures reproduced from arXiv: 2605.11458 by Huaibin Wang, Tiangang Zhang, Yilun Sun, Zihao Han.

**Figure 1.** Figure 1: Overview of ATESD. (A) Teacher-side exposure mismatch: on an easy problem (e.g. 2+3) the teacher’s privileged CoT stays within the student’s capability and distillation succeeds; on a hard problem (e.g. a quadratic equation) the full CoT far exceeds the student’s level, producing targets the student cannot absorb. (B) ATESD limits the privileged CoT via a learned exposure α: a Beta-policy controller πϕ sel… view at source ↗

**Figure 2.** Figure 2: Empirical analysis of teacher exposure on AIME 2024 with Qwen3-1.7B (3 seeds, [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Overview of ATESD. The OPSD backbone samples student continuations from the [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Mechanism ablations for exposure control. [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

read the original abstract

On-policy self-distillation has become a strong recipe for LLM reasoning, where a privileged teacher supervises the student's own rollouts while conditioning on the reference solution. A design choice shared by nearly all such methods, however, has gone unquestioned: the teacher always sees the full reference reasoning. We argue that this default itself is part of the problem and identify a teacher-side exposure mismatch: when the teacher conditions on reasoning far beyond the student's current competence, the resulting token targets become too strong to absorb. A controlled fixed-exposure sweep makes this concrete on two fronts: 1) full exposure is not reliably the best choice, and 2) student-teacher mismatch grows monotonically as the teacher sees more privileged reasoning. This motivates treating teacher exposure not as a fixed hyperparameter but as a learnable training-time control variable. We therefore propose Adaptive Teacher Exposure for Self-Distillation (ATESD). ATESD models the reveal ratio with a lightweight Beta-policy controller conditioned on compact training-state statistics, and uses one sampled exposure for a short hold window of student updates. To make this exposure controller learnable, we optimize it with a discounted learning-progress reward that scores each held decision by its effect on the student's future improvement rather than its immediate loss change, addressing the delayed credit assignment induced by on-policy distillation. Experiments on AIME 24, AIME 25, and HMMT 25 across Qwen3-{1.7B, 4B, 8B} show that ATESD consistently outperforms competitive self-distillation and RL baselines, improving over OPSD by +0.95, +2.05, and +2.33 Average@12 points respectively, and establishing adaptive teacher exposure as an effective new axis for reasoning self-distillation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ATESD adds a Beta-policy controller for adaptive teacher exposure in self-distillation and shows modest gains over OPSD, but the results lack variance stats and the overfitting risk on these benchmarks is real.

read the letter

The main thing to know is that the paper treats teacher exposure ratio as a trainable control variable instead of a fixed hyperparameter. They use a lightweight Beta-policy conditioned on training-state stats, sample one exposure per short hold window, and optimize it with a discounted reward based on the student's later improvement rather than immediate loss. That setup is new in this literature and directly addresses the mismatch they identify where full reference reasoning can be too strong for the current student. Their fixed-exposure sweep supports the premise that full exposure is not always best and that mismatch grows with more privileged context. The experiments on Qwen3 models across AIME 24/25 and HMMT 25 report consistent outperformance, with gains of roughly 1-2 Average@12 points over OPSD and other baselines. That is a concrete empirical result worth noting. The soft spots are the absence of run variance, statistical significance, or baseline implementation details, which makes it hard to judge how robust the deltas really are. The stress-test concern about the controller overfitting to the difficulty curves of these specific problems via the progress reward also lands, because the reward is computed on the same evaluation distribution and no transfer tests or reward ablations are described. The gains are real but incremental, so the work does not upend existing recipes. This is for people actively tuning on-policy self-distillation for reasoning models who want another knob to adjust during training. A reader in that subfield would get value from the method and the comparison even if they treat the numbers as preliminary. I would send it to peer review so referees can check the experimental controls and generality.

Referee Report

2 major / 2 minor

Summary. The paper argues that full teacher exposure to reference reasoning in on-policy self-distillation for LLMs creates an exposure mismatch that hinders student learning. It proposes ATESD, which replaces fixed exposure with a lightweight Beta-policy controller conditioned on compact training-state statistics; the controller is optimized via a discounted learning-progress reward that evaluates each exposure decision by its effect on future student improvement over a short hold window. Experiments on AIME 24, AIME 25, and HMMT 25 across Qwen3-1.7B/4B/8B models report consistent gains over OPSD and other self-distillation/RL baselines (+0.95 to +2.33 Average@12).

Significance. If the performance gains prove robust, the work identifies a previously unexamined axis—adaptive teacher exposure—in reasoning self-distillation and demonstrates that a simple learnable controller can outperform fixed-exposure and standard RL baselines. The delayed-credit formulation and use of compact state statistics are practical contributions that could generalize beyond the reported math benchmarks.

major comments (2)

[Experiments] Experimental section: the central performance claim (consistent outperformance of OPSD by +0.95/+2.05/+2.33 Average@12 on AIME 24/25 and HMMT 25) is presented as summarized deltas without reported standard deviations, number of independent runs, statistical significance tests, or exact baseline re-implementations and data splits. This leaves the robustness of the gains difficult to evaluate.
[Method] Method (controller optimization): the discounted learning-progress reward directly scores future improvement on the same AIME/HMMT evaluation distribution used for final reporting. No cross-benchmark transfer experiments, hold-out validation of the controller, or ablation isolating the reward from benchmark-specific difficulty curves are described, raising the possibility that reported gains reflect reduced mismatch on these particular rollouts rather than a general exposure principle.

minor comments (2)

[Abstract] The abstract and introduction refer to “competitive self-distillation and RL baselines” without naming the full set of comparators (e.g., specific RL variants or prior self-distillation methods); an explicit list would improve clarity.
[Method] Notation for the Beta-policy parameters and the exact form of the compact training-state statistics could be formalized in a single equation or table for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on experimental reporting and the generality of the proposed controller. We address each major comment below and have updated the manuscript to incorporate additional details, statistical reporting, and new experiments where feasible.

read point-by-point responses

Referee: [Experiments] Experimental section: the central performance claim (consistent outperformance of OPSD by +0.95/+2.05/+2.33 Average@12 on AIME 24/25 and HMMT 25) is presented as summarized deltas without reported standard deviations, number of independent runs, statistical significance tests, or exact baseline re-implementations and data splits. This leaves the robustness of the gains difficult to evaluate.

Authors: We agree that these details are necessary to properly evaluate robustness. The original manuscript omitted them for space reasons, but the experiments were run with multiple seeds. In the revised version we now report means and standard deviations over three independent runs for all main results, include paired t-test p-values demonstrating statistical significance of the reported gains, and provide full details on baseline re-implementations together with exact data splits and random seeds in a new appendix section. revision: yes
Referee: [Method] Method (controller optimization): the discounted learning-progress reward directly scores future improvement on the same AIME/HMMT evaluation distribution used for final reporting. No cross-benchmark transfer experiments, hold-out validation of the controller, or ablation isolating the reward from benchmark-specific difficulty curves are described, raising the possibility that reported gains reflect reduced mismatch on these particular rollouts rather than a general exposure principle.

Authors: The concern is well-taken: the learning-progress reward is computed from student accuracy on the target benchmark distributions during the short hold window. While the controller itself receives only compact, task-agnostic state features (recent loss, gradient statistics, rollout entropy), this still leaves open the question of whether the gains are benchmark-specific. In the revision we have added (i) cross-benchmark transfer results in which a controller trained on AIME rollouts is deployed on HMMT and vice versa, (ii) an internal hold-out split of the benchmark problems used solely for reward computation during controller updates, and (iii) an ablation replacing the delayed learning-progress reward with an immediate-loss baseline. These new results are presented in an expanded experimental section and support that the benefit arises from adaptive exposure rather than overfitting to particular benchmark difficulty curves. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper's derivation begins with an empirical fixed-exposure sweep establishing that full teacher reference is suboptimal and mismatch increases with exposure; this is an independent observation, not a fitted input. It then defines a new lightweight Beta-policy controller and a discounted learning-progress reward whose target (future student improvement over hold windows) is specified externally to any model parameters or prior results. The reported gains over OPSD and RL baselines are measured on held-out evaluation rollouts rather than quantities forced by construction or by a self-citation chain. No equations, uniqueness theorems, or ansatzes are shown to reduce to self-referential definitions, and the method introduces an independent optimization axis whose validity is tested rather than presupposed.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 1 invented entities

The central claim depends on the empirical effectiveness of a newly introduced controller whose parameters are fitted during training and on background assumptions about on-policy distillation dynamics.

free parameters (2)

Beta-policy parameters
Weights of the lightweight controller that outputs the reveal-ratio distribution; learned end-to-end.
Reward discount factor
Hyperparameter controlling how far into the future the learning-progress signal is discounted.

axioms (2)

domain assumption On-policy self-distillation with teacher conditioning on reference solutions is a viable base recipe for improving LLM reasoning.
The paper takes this established approach as given and modifies only the exposure variable.
ad hoc to paper Compact training-state statistics are sufficient to condition an effective exposure policy.
Introduced without further justification in the method description.

invented entities (1)

Beta-policy controller no independent evidence
purpose: Dynamically samples the fraction of reference reasoning revealed to the teacher.
New component proposed to replace the fixed full-exposure default.

pith-pipeline@v0.9.0 · 5630 in / 1518 out tokens · 51073 ms · 2026-05-13T01:58:33.413358+00:00 · methodology

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

A Brief Overview: On-Policy Self-Distillation In Large Language Models
cs.HC 2026-05 unverdicted novelty 2.0

OPSD lets a single LLM distill its own reasoning by sampling trajectories from the student role while granting the teacher role privileged access to verified solutions, reducing memory needs versus separate-model dist...
A Brief Overview: On-Policy Self-Distillation In Large Language Models
cs.HC 2026-05 unverdicted novelty 2.0

This overview paper explains the conceptual foundations and design principles of On-Policy Self-Distillation for large language models from a beginner's perspective.