Prefix Teach, Suffix Fade: Local Teachability Collapse in Strong-to-Weak On-Policy Distillation

Bing Wang; Jieping Ye; Kaiyuan Liu; Rongxiang Weng; Yang Bai; Ziyuan Zhuang

arxiv: 2605.13643 · v2 · pith:6CAYQMLCnew · submitted 2026-05-13 · 💻 cs.CL

Prefix Teach, Suffix Fade: Local Teachability Collapse in Strong-to-Weak On-Policy Distillation

Kaiyuan Liu , Ziyuan Zhuang , Yang Bai , Bing Wang , Rongxiang Weng , Jieping Ye This is my paper

Pith reviewed 2026-05-14 19:50 UTC · model grok-4.3

classification 💻 cs.CL

keywords on-policy distillationstrong-to-weak distillationlocal teachability collapsechange point detectionteacher margintrajectory truncationQwen3 models

0 comments

The pith

In strong-to-weak on-policy distillation, truncating supervision at the onset of local teachability collapse outperforms full-trajectory training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper shows that in strong-to-weak on-policy distillation, full-trajectory supervision can fail because later tokens often lack the local contrast needed to guide learning effectively. The authors identify this as local teachability collapse and propose focusing supervision only on the prefix where the teacher's feedback remains discriminative. They detect the cutoff by tracking the teacher's margin over the student's top-K options, averaging it per sentence, and applying a BIC change-point test to find the downward shift. Experiments on Qwen3 models demonstrate that this selective approach outperforms standard full-sequence OPD on five in-domain benchmarks while better maintaining out-of-domain performance. The insight is that teacher guidance must be not only present but locally useful for the student.

Core claim

The paper establishes that in strong-to-weak OPD, supervision should be truncated at the onset of local teachability collapse, detected via a BIC-style downward change point in NLTK-sentence-aggregated teacher margins over the student's top-K set. This trajectory-specific release rule delivers superior performance compared to full-trajectory supervision across multiple benchmarks and student scales, while also improving out-of-domain capability retention.

What carries the argument

Trajectory-specific release rule that measures teacher margin over student's top-K set, aggregates across NLTK sentences, and truncates at the BIC-detected downward change point.

If this is right

Truncating at the change point yields consistent gains over full-trajectory OPD on in-domain tasks.
The approach better preserves out-of-domain model capabilities than baseline distillation methods.
Performance improvements hold across different student scales within the Qwen3 model family.
Effective OPD requires assessing local utility of teacher feedback in addition to its availability.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Dynamic per-trajectory monitoring of teachability could extend to other AI feedback methods such as RLHF.
Replacing sentence-level aggregation with token-level detection might allow even more precise cutoffs.
Similar collapse patterns may limit gains in self-improvement loops or iterative distillation.

Load-bearing premise

The BIC-style downward change point on sentence-aggregated teacher margins reliably marks where feedback stops being locally discriminative without cutting useful earlier supervision.

What would settle it

If the release rule produces lower scores than full-trajectory OPD when evaluated on the same five in-domain benchmarks and student scales, the central claim would be falsified.

Figures

Figures reproduced from arXiv: 2605.13643 by Bing Wang, Jieping Ye, Kaiyuan Liu, Rongxiang Weng, Yang Bai, Ziyuan Zhuang.

**Figure 1.** Figure 1: The presence of a guidance signal does not guarantee high-value local teachability. (a) The mean teacher-student advantage remains nonzero across late response regions, indicating that the teacher has not simply ceased guiding the student. (b) The within-bin standard deviation of At, normalized by its early-region value, steadily decays throughout the rollout. This decrease in sampled-path dispersion is a … view at source ↗

**Figure 2.** Figure 2: Teacher margin over student top-K candidates. The normalized margin declines along the response, providing the local signal that is later aggregated over sentence segments for release. Trajectory-specific release rule. The preceding diagnostic, combined with the established insight from outcome-driven RL that high-contrast advantages are more useful for policy updates while low-contrast advantages can of… view at source ↗

**Figure 3.** Figure 3: The release rule detects trajectory-specific drops in local contrast. Statistics are computed on the exported OPD rollouts using the teacher’s top-1 and top-2 margin over the student’s top-K candidates. (a) The profiled RSS-BIC gain, BIC0 − minτ BIC1(τ ), is positive for most samples. The mean gain is 24.0, and 79% of the samples accept a significant downward BICimproving release with a gain greater than … view at source ↗

**Figure 4.** Figure 4: Qualitative Example: A representative rollout shows that while teacher and student log-probabilities diverge early on, they become highly aligned in later stages (logprobabilities correlation increases from -0.09 to 0.93). The resulting reduction in advantage variation (std(A) drops from 2.08 to 0.61) renders the dense feedback less actionable for guiding specific local policy improvements. Interpretation… view at source ↗

**Figure 5.** Figure 5: Support-size sensitivity. Left: The normalized teacher-margin curves remain similar across K ∈ {2, 4, 8, 16, 32, 64}, supporting the robustness of the local-contrast decline. Right: Early segments consistently exhibit larger teacher top-1–top-2 margins than late segments across support sizes, indicating that the decline in local teachability is not an artifact of a particular K [PITH_FULL_IMAGE:figures/fu… view at source ↗

**Figure 6.** Figure 6: Example BIC-style release. A representative rollout shows an accepted downward change point in the teacher top1–top2 margin over the student’s top-K candidates [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗

**Figure 7.** Figure 7: Diagnostic release positions and fixedprefix performance. Each point is one fixed-prefix checkpoint; with only four checkpoints, this is descriptive rather than a correlation claim. The following case studies illustrate the specific behaviors that the release rule is designed to exclude from supervision. These instances do not constitute inference-time truncation. Because the trajectory-specific rule exc… view at source ↗

**Figure 8.** Figure 8: Average response length during training. The gray line represents our proposed method, while the orange line corresponds to standard fulltrajectory OPD. .5We further evaluate the wall-clock training time of the 1.7B strong-to-weak setting under identical hardware and training conditions. On a single node equipped with eight H800 GPUs, standard full-trajectory OPD requires 14.28 hours for 100 training st… view at source ↗

read the original abstract

On-policy distillation (OPD) trains a student model on its own rollouts using dense feedback from a stronger teacher. Prior literature suggests that, provided teacher feedback is available, supervising the full sequence of response tokens should monotonically improve performance. However, we demonstrate that this assumption sometimes fails to hold in strong-to-weak OPD settings. While later segments of a generated trajectory may still exhibit a non-zero teacher-student advantage, they frequently lack the local contrast that makes dense feedback effective for prioritizing student learning. We term this failure mode local teachability collapse. The resulting principle is straightforward: supervision should concentrate on trajectory regions where the teacher's feedback remains discriminative, rather than uniformly covering the entire response. We operationalize this principle through a trajectory-specific release rule. This rule measures the teacher's margin over the student's top-$K$ candidate set, aggregates this margin across NLTK-tokenized sentence segments, and truncates dense OPD supervision upon detecting a BIC-style downward change point. Experimental results across strong-to-weak distillation tasks using the Qwen3 model family indicate that this release rule consistently outperforms standard full-trajectory OPD across five in-domain benchmarks at various student scales. Furthermore, compared to baseline distillation methods, our approach better preserves model capabilities on out-of-domain task. These results suggest that effective strong-to-weak OPD requires evaluating not only the availability of teacher guidance but also its local utility, ensuring that the generated feedback remains teachable.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper flags that full-trajectory OPD supervision often stops being useful once local contrast fades in strong-to-weak settings and gives a simple BIC change-point rule to truncate early, with reported gains on five benchmarks.

read the letter

The main point is that in strong-to-weak on-policy distillation, keeping dense teacher feedback on the whole student rollout can backfire once later tokens lose local contrast, even when the teacher still holds an edge. The authors name this local teachability collapse and turn the observation into a concrete release rule: compute the teacher's margin over the student's top-K set, aggregate by NLTK sentences, and stop supervision at the first downward BIC change point. That rule is the operational contribution beyond prior full-sequence OPD work. They test it with Qwen3 models and report consistent wins over standard OPD on five in-domain benchmarks plus better out-of-domain retention. The idea is straightforward and directly addresses a practical pain point when distilling smaller models from stronger ones. The experiments are the weakest part. The abstract asserts outperformance but supplies no error bars, no full hyperparameter tables, no baseline implementation details, and no statistical tests, so the size and reliability of the gains remain hard to judge from the given information. The change-point detector itself is presented as an empirical fix without token-level or gradient evidence that post-change tokens are genuinely less informative than earlier ones; any length-reducing heuristic could produce similar results. The NLTK sentence aggregation and BIC sensitivity are also free parameters that could be tuned to the test sets. This work is aimed at people running strong-to-weak distillation pipelines who want a lightweight way to avoid wasting supervision on unteachable suffixes. A reader who already works on on-policy methods will get a usable heuristic to try, even if the current evidence is preliminary. Send it to peer review so the experimental claims can be checked with proper controls and the rule can be stress-tested on more models and tasks.

Referee Report

3 major / 2 minor

Summary. The paper argues that standard full-trajectory on-policy distillation (OPD) from strong to weak models can fail due to 'local teachability collapse,' where later tokens in a rollout retain non-zero teacher advantage but lack local contrast that makes dense feedback useful for student updates. The authors operationalize this via a trajectory-specific release rule: compute teacher margins over the student's top-K candidates, aggregate by NLTK sentence segments, and truncate supervision at the first downward BIC change point. Experiments on the Qwen3 family report that this rule outperforms full-trajectory OPD on five in-domain benchmarks at multiple scales and yields better out-of-domain preservation.

Significance. If the release rule correctly isolates regions of locally discriminative teacher feedback, the result would provide a practical, low-overhead improvement to strong-to-weak OPD that reduces unnecessary supervision while preserving or enhancing performance. The work also highlights a previously under-examined failure mode in on-policy distillation, which could influence how future distillation pipelines allocate teacher compute.

major comments (3)

[Abstract and §3] Abstract and §3 (method): the claim that the BIC downward change point on NLTK-aggregated margins 'reliably identifies the onset of local teachability collapse' is not supported by any token-level or gradient-level evidence that post-change-point tokens are less informative for student updates than pre-change-point tokens. Outperformance versus full-trajectory OPD could therefore be explained by any length-reducing heuristic rather than by correctly locating a teachability boundary.
[§4] §4 (experiments): the reported consistent outperformance on five benchmarks lacks error bars, statistical significance tests, exact hyperparameter values, baseline implementation details, or ablation studies on the top-K and BIC sensitivity parameters. Without these, it is impossible to determine whether the gains are robust or sensitive to the two free parameters listed in the axiom ledger.
[§3.2] §3.2 (release rule): the rule is presented as an empirical heuristic motivated by observed failure modes, yet no derivation or controlled experiment shows that the BIC statistic on sentence-aggregated margins isolates loss of local discriminative utility rather than simply detecting a drop in margin magnitude.

minor comments (2)

[§3] Notation for the top-K set and margin aggregation should be defined with explicit equations rather than prose descriptions to improve reproducibility.
[§4] The OOD preservation claim would be strengthened by reporting the specific out-of-domain tasks and the magnitude of capability retention relative to the full-trajectory baseline.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive report. We address each major comment below, providing additional evidence and clarifications where possible. The revised manuscript incorporates several changes to strengthen the claims and experimental reporting.

read point-by-point responses

Referee: [Abstract and §3] the claim that the BIC downward change point on NLTK-aggregated margins 'reliably identifies the onset of local teachability collapse' is not supported by any token-level or gradient-level evidence... Outperformance versus full-trajectory OPD could therefore be explained by any length-reducing heuristic

Authors: We agree that direct token- or gradient-level evidence would provide stronger causal support. Our primary evidence remains the consistent performance gains over full-trajectory OPD. To rule out a generic length-reduction effect, we have added an ablation comparing BIC truncation against random truncation at matched lengths; the BIC rule outperforms random truncation, indicating it locates regions of reduced local utility rather than simply shortening sequences. A full gradient analysis lies outside the current revision scope but is noted as future work. revision: partial
Referee: [§4] the reported consistent outperformance on five benchmarks lacks error bars, statistical significance tests, exact hyperparameter values, baseline implementation details, or ablation studies on the top-K and BIC sensitivity parameters

Authors: We have revised §4 and the appendix to include: (i) error bars from 3 independent runs, (ii) paired t-test p-values for all reported gains, (iii) exact hyperparameter tables, (iv) full baseline implementation details, and (v) sensitivity ablations for top-K and BIC penalty showing stable performance across reasonable ranges. These additions confirm the gains are robust. revision: yes
Referee: [§3.2] the rule is presented as an empirical heuristic... yet no derivation or controlled experiment shows that the BIC statistic on sentence-aggregated margins isolates loss of local discriminative utility rather than simply detecting a drop in margin magnitude

Authors: We have expanded §3.2 with a controlled comparison: BIC applied to sentence-aggregated margins versus BIC applied directly to raw margin values. The margin-based BIC better predicts downstream performance degradation in suffix regions, supporting that it captures loss of local contrast. We also added a short derivation sketch linking the change-point detection to the point where teacher advantage ceases to be locally discriminative for the student's top-K set. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical truncation heuristic tested on held-out data

full rationale

The paper observes that full-trajectory OPD can fail in strong-to-weak settings due to loss of local contrast in later tokens, introduces the term 'local teachability collapse,' and defines a release rule that computes teacher margins over the student's top-K set, aggregates at NLTK sentence level, and truncates at the first BIC downward change point. This rule is presented as an operationalization of an empirical principle rather than a derivation from first principles. No equations reduce the reported gains to quantities fitted from the evaluation data, no self-citations bear the central load, and the outperformance is measured on held-out in-domain and out-of-domain benchmarks. The construction is therefore self-contained and does not collapse to its inputs by definition.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 1 invented entities

The central claim rests on the domain assumption that teacher feedback is dense and locally comparable, plus two implementation choices (top-K and sentence segmentation) whose values are not justified from first principles.

free parameters (2)

top-K
Size of student's candidate set used to compute teacher margin; value not stated in abstract but required for the release rule.
BIC change-point sensitivity
Threshold or penalty parameter controlling when a downward change point triggers truncation; not specified.

axioms (2)

domain assumption NLTK sentence tokenization produces segments that align with regions of stable teachability
Used to aggregate margins before change-point detection.
domain assumption Teacher margin over top-K remains a valid local proxy for teachability throughout the trajectory until the detected change point
Core premise of the release rule.

invented entities (1)

local teachability collapse no independent evidence
purpose: Named failure mode describing loss of discriminative power in suffix segments
Coined term for the observed phenomenon; no independent evidence supplied beyond the experiments.

pith-pipeline@v0.9.0 · 5576 in / 1460 out tokens · 75982 ms · 2026-05-14T19:50:03.308971+00:00 · methodology

Prefix Teach, Suffix Fade: Local Teachability Collapse in Strong-to-Weak On-Policy Distillation

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)