Multi-Rollout On-Policy Distillation via Peer Successes and Failures

Chen Henry Wu; Gaurav Mittal; Haixin Wang; Matt Fredrikson; Ruowang Zhang; Weichen Yu; Xiaomin Li; Xiaoze Liu; Yinyi Luo; Yizhou Zhao

arxiv: 2605.12652 · v2 · pith:43MK434Fnew · submitted 2026-05-12 · 💻 cs.LG · cs.AI

Multi-Rollout On-Policy Distillation via Peer Successes and Failures

Weichen Yu , Xiaomin Li , Yizhou Zhao , Xiaoze Liu , Ruowang Zhang , Haixin Wang , Yinyi Luo , Chen Henry Wu

show 3 more authors

Gaurav Mittal Matt Fredrikson Yu Hu

This is my paper

Pith reviewed 2026-05-14 21:25 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords on-policy distillationmulti-rolloutpeer conditioningsuccess-failure contrastlanguage model post-trainingreasoning benchmarksverifier alignment

0 comments

The pith

By conditioning teacher signals on both successful and failed peer rollouts from the same prompt, multi-rollout on-policy distillation supplies denser and better-aligned supervision than single-rollout baselines.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that standard on-policy distillation wastes information by treating each student rollout in isolation, even though the student has already generated multiple attempts for the same prompt. MOPD instead feeds the teacher both the successes and the failures within that local group so that positive patterns can be reinforced and plausible mistakes can be explicitly discouraged. This produces token-level targets that track external verifier rewards more closely than isolated distillation does. A sympathetic reader would care because sparse verifier rewards are the dominant training signal for reasoning models, and any method that extracts more signal from the same samples could reduce the cost of post-training. The experiments show the approach works across competitive programming, mathematical reasoning, scientific question answering, and tool-use tasks.

Core claim

MOPD constructs teacher signals by conditioning on the student's local rollout group, employing both positive peer imitation and contrastive success-failure conditioning; the resulting mixed contexts yield teacher scores that align more closely with verifier rewards and deliver consistent gains over standard on-policy distillation baselines on competitive programming, mathematical reasoning, scientific question answering, and tool-use benchmarks.

What carries the argument

The peer-conditioned distillation framework that builds teacher targets from the student's own multi-rollout group by contrasting successful and failed trajectories for the identical prompt.

If this is right

Distillation performance improves when the teacher sees both correct and incorrect student attempts for the same prompt rather than one attempt at a time.
Mixed success-failure contexts increase the correlation between the teacher's token-level scores and the external verifier's binary reward.
On-policy methods become more effective when they treat the student's trial-and-error set as a structured source of positive and negative evidence.
The gains appear across four distinct reasoning domains, suggesting the mechanism is not tied to any single task format.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same peer-conditioning idea could be applied in other on-policy RL settings where multiple trajectories are sampled per state to sharpen value estimates.
If the peer-group construction is kept instance-adaptive, it may reduce the need for hand-crafted negative examples or additional preference data.
The alignment result suggests that future verifier design could be guided by how well its signals match what a multi-rollout teacher already discovers.

Load-bearing premise

The student's local set of rollouts for a given prompt supplies teacher signals that are more informative and better aligned with verifier rewards without injecting new selection biases.

What would settle it

An experiment in which mixed success-failure conditioning produces no improvement in task accuracy or no increase in correlation between teacher scores and verifier rewards compared with single-rollout distillation.

Figures

Figures reproduced from arXiv: 2605.12652 by Chen Henry Wu, Gaurav Mittal, Haixin Wang, Matt Fredrikson, Ruowang Zhang, Weichen Yu, Xiaomin Li, Xiaoze Liu, Yinyi Luo, Yizhou Zhao, Yu Hu.

**Figure 1.** Figure 1: MOPD Illustration. To directly examine whether peer conditioning improves the self-teacher signal itself, we introduce an analysis of self-teacher signal quality. For each prompt, we fix a set of student-generated rollouts containing both successful and failed attempts, vary only the context shown to the self-teacher, and compare the self-teacher’s normalized logits or scores with ground-truth verifier r… view at source ↗

**Figure 2.** Figure 2: MOPD Pipeline. to the successes and failures observed in the other rollouts. This prevents the teacher from exploiting local, instance-specific evidence contained in the rollout group. 4 Multi-Rollout On-Policy Distillation We propose Multi-Rollout On-Policy Distillation (MOPD), a peer-conditioned distillation framework that exploits the local structure of multiple on-policy rollouts generated for the same… view at source ↗

**Figure 3.** Figure 3: Number of training data that have ever generated a correct answer in the N rollout during training. Case Study. During training, we save the generated rollouts and compare them on the same question across training steps to provide a case study. Additionally, after training for the same number of steps, we save checkpoints from both SDPO and MOPD, then sample from these checkpoints to evaluate whether each … view at source ↗

**Figure 4.** Figure 4: Self-teacher-signal quality across seven context conditions. Each panel reports an averaged prompt [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Diversity Analysis. evidence sharpens decision boundaries that positive evidence alone leaves blurred. 4) Combining both types yields the best results: the “2 success + 1 failure” context achieves the highest score on 5 of the 6 ranking and discrimination metrics in the signal-quality analysis, with a competitive Brier score, and the highest LCB downstream mean@8 among the compact peer-context settings. 5)… view at source ↗

read the original abstract

Large language models are often post-trained with sparse verifier rewards, which indicate whether a sampled trajectory succeeds but provide limited guidance about where reasoning succeeds or fails. On-policy distillation (OPD) offers denser token-level supervision by training on student-generated trajectories, yet existing methods typically distill each rollout independently and ignore the other attempts sampled for the same prompt. We introduce Multi-Rollout On-Policy Distillation (MOPD), a peer-conditioned distillation framework that uses the student's local rollout group to construct more informative teacher signals. MOPD conditions the teacher on both successful and failed peer rollouts: successes provide positive evidence for valid reasoning patterns, while failures provide structured negative evidence about plausible mistakes to avoid. We study two peer-context constructions: positive peer imitation and contrastive success-failure conditioning. Experiments on competitive programming, mathematical reasoning, scientific question answering, and tool-use benchmarks show that MOPD consistently improves over standard on-policy baselines. Further teacher-signal analysis shows that mixed success-failure contexts better align teacher scores with verifier rewards, indicating that the gains arise from more faithful, instance-adaptive supervision. These results indicate that effective on-policy distillation should exploit the student's multi-rollout trial-and-error behavior rather than treating rollouts as isolated samples.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MOPD gets modest gains by turning same-prompt failures into contrastive signals for on-policy distillation, but the circularity risk from correlated rollouts is real and unaddressed in the reported results.

read the letter

The main thing to know is that this paper takes the multi-rollout groups already generated during on-policy training and uses both the successes and failures within each group to build richer teacher signals. Positive peer imitation copies good patterns; contrastive success-failure conditioning adds explicit negative examples of plausible mistakes. Experiments across competitive programming, math reasoning, scientific QA, and tool use show consistent improvements over standard on-policy baselines, plus better alignment between teacher scores and verifier rewards when mixed contexts are used. That is the concrete advance over prior work that treats each rollout in isolation. The approach is simple and fits naturally into existing post-training pipelines that already sample multiple trajectories per prompt. The empirical pattern is reported across four domains, which gives it some breadth. The soft spot is exactly the one flagged in the stress test. All trajectories come from the current student policy, so they share token patterns, reasoning shortcuts, and failure modes. Conditioning the teacher on this correlated set can amplify those biases instead of supplying independent evidence. The abstract claims mixed contexts improve alignment with verifiers, but without ablations on rollout similarity, cross-prompt negatives, or diversity metrics, it is difficult to rule out that the gains are partly circular. If the full paper only shows aggregate benchmark lifts without those controls, the central claim rests on an assumption that intra-group variation is sufficient. This work is for groups already running on-policy distillation or RL post-training on LLMs and looking for cheap ways to densify supervision. A reader who cares about practical distillation tricks will find the constructions and the multi-domain results useful to try. It deserves a serious referee because the idea is well-motivated, the experiments are broad, and the circularity issue is fixable with targeted controls rather than fatal. I would send it out.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes Multi-Rollout On-Policy Distillation (MOPD), a peer-conditioned framework that constructs teacher signals by conditioning on both successful and failed rollouts sampled from the student's local rollout group for each prompt. It evaluates two constructions—positive peer imitation and contrastive success-failure conditioning—on competitive programming, mathematical reasoning, scientific question answering, and tool-use benchmarks, claiming consistent improvements over standard on-policy baselines. Teacher-signal analysis is reported to show that mixed success-failure contexts produce better alignment between teacher scores and external verifier rewards.

Significance. If the empirical gains and alignment results hold under rigorous controls, the work indicates that exploiting intra-prompt multi-rollout diversity can yield more informative, instance-adaptive supervision in on-policy distillation for LLMs trained with sparse verifier rewards, without requiring additional external data.

major comments (3)

[Experiments] Experiments section: The central claim of 'consistent improvements' over on-policy baselines is load-bearing, yet the abstract and summary provide no quantitative deltas, ablation results (e.g., positive-only vs. contrastive), or statistical significance tests; this leaves the magnitude and reliability of the reported gains unassessable.
[Method] Method and Analysis sections: The assumption that local rollout groups supply sufficiently independent positive/negative evidence is untested; because all trajectories are drawn from the current student policy, they are likely to share systematic errors, and no rollout-similarity metrics, diversity controls, or cross-prompt negative-example baselines are described to rule out circular reinforcement of policy biases.
[Analysis] Teacher-signal analysis: The claim that mixed success-failure contexts 'better align teacher scores with verifier rewards' requires concrete metrics (e.g., correlation coefficients or alignment scores per construction); without these numbers or controls for rollout correlation, the interpretation that gains arise from 'more faithful' supervision remains qualitative.

minor comments (2)

[Abstract] Abstract: Specify the number of rollouts per prompt and the precise on-policy baselines (e.g., standard OPD, PPO variants) used for comparison.
[Method] Notation: Define the exact conditioning mechanism for contrastive success-failure (e.g., how failures are formatted as negative evidence) with a short illustrative example.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments on our work. We provide point-by-point responses to the major comments below and will update the manuscript to incorporate the suggested improvements.

read point-by-point responses

Referee: [Experiments] Experiments section: The central claim of 'consistent improvements' over on-policy baselines is load-bearing, yet the abstract and summary provide no quantitative deltas, ablation results (e.g., positive-only vs. contrastive), or statistical significance tests; this leaves the magnitude and reliability of the reported gains unassessable.

Authors: The manuscript's experimental results section includes tables with performance numbers on all benchmarks, showing improvements over baselines and ablations for the two peer-context constructions. To make these more prominent and address the concern directly, we will add the quantitative deltas, specific ablation comparisons, and statistical significance tests (including p-values) to the abstract, introduction, and a new subsection on statistical analysis in the revised manuscript. revision: yes
Referee: [Method] Method and Analysis sections: The assumption that local rollout groups supply sufficiently independent positive/negative evidence is untested; because all trajectories are drawn from the current student policy, they are likely to share systematic errors, and no rollout-similarity metrics, diversity controls, or cross-prompt negative-example baselines are described to rule out circular reinforcement of policy biases.

Authors: We agree that this is an important point to verify. Although the success and failure labels provide a natural distinction, we will add experiments reporting rollout similarity metrics (e.g., average pairwise BLEU scores or embedding cosine similarities within rollout groups) and diversity statistics. We will also include a control experiment using cross-prompt negative examples to rule out bias reinforcement and demonstrate the benefit of intra-prompt peer failures. revision: yes
Referee: [Analysis] Teacher-signal analysis: The claim that mixed success-failure contexts 'better align teacher scores with verifier rewards' requires concrete metrics (e.g., correlation coefficients or alignment scores per construction); without these numbers or controls for rollout correlation, the interpretation that gains arise from 'more faithful' supervision remains qualitative.

Authors: We will revise the teacher-signal analysis to include concrete quantitative metrics. Specifically, we will report correlation coefficients (Pearson and Spearman) between teacher-assigned scores and verifier rewards for positive-only, failure-only, and mixed constructions. We will also add controls accounting for rollout correlations and present per-construction alignment scores to substantiate the claim with numerical evidence. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical method with external benchmark validation

full rationale

The paper defines MOPD as a framework that constructs teacher signals from the student's own multi-rollout group for each prompt, then reports empirical gains on independent benchmarks (competitive programming, math reasoning, scientific QA, tool-use) against standard on-policy baselines. No equations, derivations, or fitted parameters are shown that reduce the claimed improvements or alignment metrics to quantities defined by the method inputs by construction. Teacher-signal analysis compares against external verifier rewards rather than self-referential quantities. No self-citations are invoked as load-bearing uniqueness theorems, and the method does not rename known results or smuggle ansatzes. The derivation chain is self-contained as an empirical proposal with measurable external outcomes.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the domain assumption that multi-rollout trial-and-error behavior contains structured positive and negative evidence that can be turned into faithful teacher signals; no free parameters or invented entities are mentioned in the abstract.

axioms (2)

domain assumption On-policy distillation offers denser token-level supervision than sparse verifier rewards
Stated directly in the opening of the abstract as the motivation for OPD.
domain assumption Conditioning the teacher on both successful and failed peer rollouts produces more informative signals
Core premise of the MOPD framework introduced in the abstract.

pith-pipeline@v0.9.0 · 5558 in / 1335 out tokens · 32483 ms · 2026-05-14T21:25:53.077522+00:00 · methodology

Multi-Rollout On-Policy Distillation via Peer Successes and Failures

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)