pith. sign in
Pith Number

pith:43MK434F

pith:2026:43MK434FGV2PB6DWQ25Y4V6ULU
not attested not anchored not stored refs resolved

Multi-Rollout On-Policy Distillation via Peer Successes and Failures

Chen Henry Wu, Gaurav Mittal, Haixin Wang, Matt Fredrikson, Ruowang Zhang, Weichen Yu, Xiaomin Li, Xiaoze Liu, Yinyi Luo, Yizhou Zhao, Yu Hu

By conditioning teacher signals on both successful and failed peer rollouts from the same prompt, multi-rollout on-policy distillation supplies denser and better-aligned supervision than single-rollout baselines.

arxiv:2605.12652 v1 · 2026-05-12 · cs.LG · cs.AI

Add to your LaTeX paper
\usepackage{pith}
\pithnumber{43MK434FGV2PB6DWQ25Y4V6ULU}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp
2 Internet Archive
3 Author claim open · sign in to claim
4 Citations open
5 Replications open
Portable graph bundle live · download bundle · merged state
The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

Experiments on competitive programming, mathematical reasoning, scientific question answering, and tool-use benchmarks show that MOPD consistently improves over standard on-policy baselines. Further teacher-signal analysis shows that mixed success-failure contexts better align teacher scores with verifier rewards.

C2weakest assumption

That the student's local rollout group can be used to construct teacher signals that are both more informative and better aligned with external verifier rewards without introducing new biases from the peer selection process.

C3one line summary

MOPD improves on-policy distillation for LLMs by using peer successes for positive patterns and failures for negative examples to create more informative teacher signals.

References

79 extracted · 79 resolved · 23 Pith anchors

[1] Evaluating Large Language Models Trained on Code · arXiv:2107.03374
[2] Distilling the Knowledge in a Neural Network · arXiv:1503.02531
[3] Rush , title = 2016
[4] A Survey on Knowledge Distillation of Large Language Models · arXiv:2402.13116
[5] arXiv preprint arXiv:2305.15717 , year=

Formal links

1 machine-checked theorem link

Receipt and verification
First computed 2026-05-18T03:09:50.759655Z
Builder pith-number-builder-2026-05-17-v1
Signature Pith Ed25519 (pith-v1-2026-05) · public key
Schema pith-number/v1.0

Canonical hash

e6d8ae6f853574f0f87686bb8e57d45d14eaa6d5928503e5034afa5da6273372

Aliases

arxiv: 2605.12652 · arxiv_version: 2605.12652v1 · doi: 10.48550/arxiv.2605.12652 · pith_short_12: 43MK434FGV2P · pith_short_16: 43MK434FGV2PB6DW · pith_short_8: 43MK434F
Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/43MK434FGV2PB6DWQ25Y4V6ULU \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: e6d8ae6f853574f0f87686bb8e57d45d14eaa6d5928503e5034afa5da6273372
Canonical record JSON
{
  "metadata": {
    "abstract_canon_sha256": "593fa5067f258c404222fd96a88c5f1b645eb03f85ac6ce256a6dbdd8e7b3fcc",
    "cross_cats_sorted": [
      "cs.AI"
    ],
    "license": "http://creativecommons.org/licenses/by/4.0/",
    "primary_cat": "cs.LG",
    "submitted_at": "2026-05-12T18:57:44Z",
    "title_canon_sha256": "e19006a81b64cfbc13bb3059452dcfc0320e92d0a44498802fc5f562d060bf42"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2605.12652",
    "kind": "arxiv",
    "version": 1
  }
}