Pith Number

pith:43MK434F

pith:2026:43MK434FGV2PB6DWQ25Y4V6ULU

not attested not anchored not stored refs resolved

Multi-Rollout On-Policy Distillation via Peer Successes and Failures

Chen Henry Wu, Gaurav Mittal, Haixin Wang, Matt Fredrikson, Ruowang Zhang, Weichen Yu, Xiaomin Li, Xiaoze Liu, Yinyi Luo, Yizhou Zhao, Yu Hu

By conditioning teacher signals on both successful and failed peer rollouts from the same prompt, multi-rollout on-policy distillation supplies denser and better-aligned supervision than single-rollout baselines.

arxiv:2605.12652 v1 · 2026-05-12 · cs.LG · cs.AI

Open paper page JSON Open Graph Bundle Merged state Verified badge What is a Pith Number?

Add to your LaTeX paper

\usepackage{pith}
\pithnumber{43MK434FGV2PB6DWQ25Y4V6ULU}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp

2 Internet Archive

3 Author claim open · sign in to claim

4 Citations open

5 Replications open

✓ Portable graph bundle live · download bundle · merged state

The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

Experiments on competitive programming, mathematical reasoning, scientific question answering, and tool-use benchmarks show that MOPD consistently improves over standard on-policy baselines. Further teacher-signal analysis shows that mixed success-failure contexts better align teacher scores with verifier rewards.

C2weakest assumption

That the student's local rollout group can be used to construct teacher signals that are both more informative and better aligned with external verifier rewards without introducing new biases from the peer selection process.

C3one line summary

MOPD improves on-policy distillation for LLMs by using peer successes for positive patterns and failures for negative examples to create more informative teacher signals.

References

79 extracted · 79 resolved · 23 Pith anchors

[1] Evaluating Large Language Models Trained on Code · arXiv:2107.03374

[2] Distilling the Knowledge in a Neural Network · arXiv:1503.02531

[3] Rush , title = 2016

[4] A Survey on Knowledge Distillation of Large Language Models · arXiv:2402.13116

[5] arXiv preprint arXiv:2305.15717 , year=

Formal links

1 machine-checked theorem link

Receipt and verification

First computed	2026-05-18T03:09:50.759655Z
Builder	pith-number-builder-2026-05-17-v1
Signature	Pith Ed25519 (`pith-v1-2026-05`) · public key
Schema	pith-number/v1.0

Canonical hash

e6d8ae6f853574f0f87686bb8e57d45d14eaa6d5928503e5034afa5da6273372

Aliases

arxiv: 2605.12652 · arxiv_version: 2605.12652v1 · doi: 10.48550/arxiv.2605.12652 · pith_short_12: 43MK434FGV2P · pith_short_16: 43MK434FGV2PB6DW · pith_short_8: 43MK434F

Agent API

Resolver JSON Graph JSON Events JSON Schema Signing key

Verify this Pith Number yourself

curl -sH 'Accept: application/ld+json' https://pith.science/pith/43MK434FGV2PB6DWQ25Y4V6ULU \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: e6d8ae6f853574f0f87686bb8e57d45d14eaa6d5928503e5034afa5da6273372

Canonical record JSON

{
  "metadata": {
    "abstract_canon_sha256": "593fa5067f258c404222fd96a88c5f1b645eb03f85ac6ce256a6dbdd8e7b3fcc",
    "cross_cats_sorted": [
      "cs.AI"
    ],
    "license": "http://creativecommons.org/licenses/by/4.0/",
    "primary_cat": "cs.LG",
    "submitted_at": "2026-05-12T18:57:44Z",
    "title_canon_sha256": "e19006a81b64cfbc13bb3059452dcfc0320e92d0a44498802fc5f562d060bf42"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2605.12652",
    "kind": "arxiv",
    "version": 1
  }
}