pith. machine review for the scientific record. sign in

arxiv: 2605.01347 · v1 · submitted 2026-05-02 · 💻 cs.CL · cs.AI· cs.LG

Recognition: unknown

MAD-OPD: Breaking the Ceiling in On-Policy Distillation via Multi-Agent Debate

Authors on Pith no claims yet

Pith reviewed 2026-05-09 15:14 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG
keywords on-policy distillationmulti-agent debateknowledge distillationagentic taskscode generationlarge language modelsmodel compression
0
0 comments X

The pith

Multi-agent debate among teachers supplies higher-quality token supervision than any single teacher for on-policy distillation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that on-policy distillation is limited because students copy errors from one teacher, and this limit is worse in agentic tasks where mistakes add up over many steps. It proposes MAD-OPD, which lets several teachers debate the student's current state and produce a weighted collective signal instead. The method adds step-level sampling for agentic stability and selects divergence measures according to task type. Across six teacher-student size pairs and five benchmarks, the debate version ranks highest and delivers measurable gains over the best single-teacher baseline.

Core claim

MAD-OPD replaces the single-teacher supervisor in on-policy distillation with a deliberative group whose debate yields an emergent collective intelligence; each teacher's contribution is weighted by its post-debate confidence, and the resulting token-level targets are used to train the student on its own trajectories. The paper also defines On-Policy Agentic Distillation with step-level sampling to counteract compounding errors and derives a task-adaptive divergence rule that favors Jensen-Shannon divergence for agentic work and reverse KL for code generation.

What carries the argument

The multi-agent debate process that turns separate teacher outputs into a single weighted supervision signal by letting the teachers discuss the student's on-policy state and score their own contributions.

If this is right

  • Students can exceed the performance ceiling set by any individual teacher on both agentic and code tasks.
  • Step-level sampling prevents error accumulation from destabilizing training in long-horizon agentic settings.
  • Choosing Jensen-Shannon divergence for agentic tasks and reverse KL for code generation improves stability and final scores.
  • The ranking advantage holds across all tested model-size combinations from 1.7B students to 32B teachers.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The debate mechanism could be inserted into other distillation pipelines that currently rely on a single teacher.
  • For very large models, collective supervision might reduce the need to train one enormous teacher first.
  • The same collective-intelligence idea might extend to reinforcement learning from human feedback where multiple reward models are available.
  • Dynamic selection of which teachers participate in each debate round could further reduce compute while preserving gains.

Load-bearing premise

The multi-agent debate process will consistently produce supervision that is better than the strongest single teacher without introducing new biases or training instability.

What would settle it

If the same teacher-student pairs and benchmarks show that MAD-OPD performance is equal to or lower than the best single-teacher OPD run, the claim that debate supplies superior supervision would be falsified.

Figures

Figures reproduced from arXiv: 2605.01347 by Hua Yang, Jianze Wang, Jinlong Chen, Jun Wang, Qianglong Chen, Qilong Zhang, Xuchun Hu, Ying Liu, Yong Xie, Yu Cao.

Figure 1
Figure 1. Figure 1: Left: a single teacher’s erroneous tool call is faithfully inherited by the student (the single-teacher capability ceiling). Right: MAD-OPD’s multi-round debate corrects each teacher’s blind spots, producing supervision that outperforms individual teachers. A worked-out instance with the full debate trajectory is in App. E. attempts to stabilize this objective [12, 19, 20] each patch one failure mode witho… view at source ↗
Figure 2
Figure 2. Figure 2: The MAD-OPD Pipeline. At trajectory step m, the student πθ samples an on-policy action am; K teachers debate for R rounds to produce a transcript HR m visible only to the teachers, establishing the privileged p–q gap. Teachers then force-decode am and contribute to a confidence￾weighted divergence D (JSD for agents, D← KL for code); gradients update only πθ. The dashed border marks the OPAD outer loop wher… view at source ↗
Figure 3
Figure 3. Figure 3: Scaling behavior. (a) Overall Avg (%) vs. student size on the Qwen3 family; solid lines connect within-pair points, dashed segments at 4B connect cross-pair points. (b) ∆(Base) Avg (%), the gain over the un-distilled Base, across all six configurations. BFCL-v4 τ 2 Vita LCB-v6 MBPP+ 10 15 20 25 30 60 64 68 Performance (%) (a) Component Ablation MAD-OPD w/o Conf.Wt. w/o Debate w/o Multi-T w/o On-Policy Agen… view at source ↗
Figure 4
Figure 4. Figure 4: Ablation study (14B+8B→4B). Shading separates agentic/code benchmarks. (a) Compo￾nent ablation: debate adds +4.6% Co-Avg vs. MT-OPD; confidence weighting adds +1.7% Ag-Avg. (b) Per-divergence ablation (full numbers view at source ↗
Figure 5
Figure 5. Figure 5: Accuracy–efficiency on LCB-v6 (14B+8B→4B, 16 seeds). Both panels share the y￾axis (avg. output tokens, lower is better) for direct token-cost comparison. Each ellipse spans ±1 standard deviation; markers are seed means; teacher reference points (gray stars) are dashed. The vertical dotted line and shaded region mark the 14B-teacher ceiling. (a) pass@1 (single sample): only MAD-OPD crosses the 25.57% ceilin… view at source ↗
Figure 6
Figure 6. Figure 6: Training loss curves for MAD-OPD (rows 1–2) and OPD (rows 3–4) on agentic and code tasks under three divergences (Forward KL, JSD, Reverse KL), 14B+8B→4B. X-axes are gradient steps (gradient accumulation = 8, so absolute loss values are roughly 8× the per-micro-batch value), truncated to the first 100. Each panel overlays the per-step training loss (light) with an exponential moving average (solid) using s… view at source ↗
Figure 7
Figure 7. Figure 7: Effect of debate rounds R on the 14B+8B→4B configuration. R=0: no teacher generation (single-teacher OPD). R=1: independent only (MT-OPD). R=2 (hatched, default): one inter-teacher revision round. R=3 degrades sharply on agentic tasks as the extra round inflates the prompt context. Background shading separates agentic tasks (left, JSD) from code tasks (right, reverse KL). The wavy break compresses the y-ax… view at source ↗
Figure 8
Figure 8. Figure 8: Token-level supervision heatmap for the start_date argument. Color intensity encodes divergence: low (agreement), med , high (strong corrective gradient). The single 14B teacher shares the student’s wrong month, providing zero corrective signal. After debate, MAD-OPD teachers concentrate supervision on the erroneous token. F.1 Round 1: Independent Response Each teacher receives only the original input x an… view at source ↗
read the original abstract

On-policy distillation (OPD) trains a student on its own trajectories under token-level teacher supervision, but existing methods are capped by a single-teacher capability ceiling: when the teacher errs, the student inherits the error. OPD also remains largely unexplored in agentic tasks, where per-step errors compound across long trajectories and destabilize training. We propose MAD-OPD (Multi-Agent Debate-driven On-Policy Distillation), which breaks this ceiling by recasting the distillation teacher as a deliberative collective of teachers that debate over the student's on-policy state; the debate produces an emergent collective intelligence that supplies token-level supervision, with each teacher's contribution weighted by its post-debate confidence. To extend OPD to agentic tasks, we also introduce On-Policy Agentic Distillation (OPAD), which adds step-level sampling to stabilize training under multi-step error compounding. We additionally derive a task-adaptive divergence principle, selecting JSD (Jensen-Shannon divergence) for agentic stability and reverse KL (Kullback-Leibler) divergence for code generation, and verify it both theoretically and empirically. Across six teacher-student configurations (Qwen3 and Qwen3.5; 1.7B-14B students, 8B-32B teachers) and five agentic and code benchmarks, MAD-OPD ranks first across all six configurations; on the 14B+8B$\to$4B setting it lifts the agentic average by $+2.4\%$ and the code average by $+3.7\%$ over the stronger single-teacher OPD.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes MAD-OPD, which recasts on-policy distillation by using multi-agent debate among teachers to produce emergent collective token-level supervision for a student, weighted by post-debate confidence. It introduces OPAD with step-level sampling to stabilize agentic distillation under error compounding, and derives a task-adaptive divergence principle selecting JSD for agentic tasks and reverse KL for code generation. Across six teacher-student configurations on agentic and code benchmarks, MAD-OPD ranks first, with reported lifts of +2.4% agentic average and +3.7% code average over single-teacher OPD in the 14B+8B to 4B setting.

Significance. If the gains are shown to stem specifically from the debate mechanism rather than ensembling, the work could meaningfully advance on-policy distillation methods by addressing single-teacher ceilings and extending to agentic settings. The cross-configuration ranking and task-adaptive divergence choice offer a potentially generalizable framework, though this depends on stronger controls for the proposed mechanisms.

major comments (2)
  1. [Abstract / Results] Abstract and results summary: The central claim attributes the reported first-place ranking and specific lifts (+2.4% agentic, +3.7% code on 14B+8B→4B) to emergent collective intelligence from debate plus post-debate confidence weighting. However, the manuscript compares only to single-teacher OPD; no non-debate multi-teacher baseline (e.g., mean or max of teacher logits on identical on-policy trajectories) is reported. This is load-bearing for crediting the deliberative process, JSD/reverse-KL selection, and OPAD sampling, as the gains could arise from multi-teacher access alone.
  2. [Abstract] The task-adaptive divergence principle: The abstract states that JSD is selected for agentic stability and reverse KL for code generation, derived theoretically and verified empirically. Without the explicit derivation steps, assumptions about error compounding or distribution mismatch, or the theoretical verification (e.g., any inequality or stability analysis), it is difficult to evaluate whether the selection is principled or post-hoc.
minor comments (2)
  1. [Abstract] No implementation details are supplied for debate prompt construction, confidence calibration, or the exact OPAD step-sampling procedure, which hinders reproducibility.
  2. [Abstract] The reported averages lack error bars, statistical significance tests, or variance across runs, making it hard to assess the reliability of the +2.4% and +3.7% lifts.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address the major concerns regarding the need for additional baselines and the presentation of the theoretical derivation below, and we outline the revisions we will make.

read point-by-point responses
  1. Referee: [Abstract / Results] Abstract and results summary: The central claim attributes the reported first-place ranking and specific lifts (+2.4% agentic, +3.7% code on 14B+8B→4B) to emergent collective intelligence from debate plus post-debate confidence weighting. However, the manuscript compares only to single-teacher OPD; no non-debate multi-teacher baseline (e.g., mean or max of teacher logits on identical on-policy trajectories) is reported. This is load-bearing for crediting the deliberative process, JSD/reverse-KL selection, and OPAD sampling, as the gains could arise from multi-teacher access alone.

    Authors: We acknowledge the importance of this control experiment to attribute the improvements specifically to the multi-agent debate mechanism rather than merely having access to multiple teachers. The manuscript focuses on comparisons against the standard single-teacher OPD baseline to highlight the ceiling-breaking aspect. However, we agree that a non-debate multi-teacher baseline is necessary for a complete evaluation. In the revised version, we will include results from ensembling teacher logits (mean and max) on the same on-policy trajectories without the debate process. This will allow us to demonstrate the additional benefit provided by the deliberative debate and post-debate confidence weighting. revision: yes

  2. Referee: [Abstract] The task-adaptive divergence principle: The abstract states that JSD is selected for agentic stability and reverse KL for code generation, derived theoretically and verified empirically. Without the explicit derivation steps, assumptions about error compounding or distribution mismatch, or the theoretical verification (e.g., any inequality or stability analysis), it is difficult to evaluate whether the selection is principled or post-hoc.

    Authors: The derivation and theoretical verification of the task-adaptive divergence principle are detailed in Section 3.3 of the full manuscript, including assumptions about error compounding in agentic settings and distribution mismatch in code generation tasks. We show through analysis that JSD offers superior stability for long-horizon agentic tasks due to its symmetry, while reverse KL better aligns with the mode-seeking behavior needed for code. Empirical verification is provided across the benchmarks. To improve clarity in the abstract, we will revise it to briefly summarize the key theoretical motivation and assumptions, with a pointer to the detailed derivation in the main text. revision: partial

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper proposes MAD-OPD by recasting the teacher as a multi-agent debate collective and introduces OPAD with step sampling, plus a task-adaptive divergence principle (JSD for agentic, reverse KL for code) that is stated to be theoretically derived and then verified empirically. No quoted equations or steps reduce by construction to the inputs; the divergence choice is not presented as a fit to the reported gains, and the empirical lifts are measured against single-teacher OPD baselines rather than being tautological. The mechanism is self-contained against external benchmarks and does not rely on self-citation load-bearing or self-definitional renaming.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Based solely on the abstract, the central claim rests on one domain assumption about debate dynamics; no free parameters or new entities are explicitly introduced or fitted in the provided text.

axioms (1)
  • domain assumption A deliberative collective of teachers via debate produces emergent collective intelligence that supplies superior token-level supervision compared with any single teacher.
    This premise directly enables the claim of breaking the single-teacher capability ceiling.

pith-pipeline@v0.9.0 · 5625 in / 1327 out tokens · 45234 ms · 2026-05-09T15:14:19.038861+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

43 extracted references · 31 canonical work pages · 17 internal anchors

  1. [1]

    On-policy distillation of language models: Learning from self-generated mistakes

    Rishabh Agarwal et al. On-policy distillation of language models: Learning from self-generated mistakes. InInternational Conference on Learning Representations, 2024

  2. [2]

    SMAGDi: Socratic multi agent interaction graph distillation for efficient high accuracy reasoning.arXiv preprint arXiv:2511.05528, 2025

    Aayush Aluru et al. SMAGDi: Socratic multi agent interaction graph distillation for efficient high accuracy reasoning.arXiv preprint arXiv:2511.05528, 2025

  3. [3]

    Program Synthesis with Large Language Models

    Jacob Austin et al. Program synthesis with large language models.arXiv preprint arXiv:2108.07732, 2021

  4. [4]

    $\tau^2$-Bench: Evaluating Conversational Agents in a Dual-Control Environment

    Victor Barres, Honghua Dong, Soham Ray, Xujie Si, and Karthik Narasimhan. τ 2-bench: Eval- uating conversational agents in a dual-control environment.arXiv preprint arXiv:2506.07982, 2025

  5. [5]

    MAGDi: Structured distillation of multi-agent interaction graphs improves reasoning in smaller language models

    Justin Chen et al. MAGDi: Structured distillation of multi-agent interaction graphs improves reasoning in smaller language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, 2024

  6. [6]

    ReConcile: Round-table con- ference improves reasoning via consensus among diverse LLMs

    Justin Chih-Yao Chen, Swarnadeep Saha, and Mohit Bansal. ReConcile: Round-table con- ference improves reasoning via consensus among diverse LLMs. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, 2024

  7. [7]

    AMTSS: An adaptive multi-teacher single-student knowledge distillation framework for multi- lingual language inference.arXiv preprint arXiv:2305.07928, 2023

    Qianglong Chen, Feng Ji, Feng-Lin Li, Guohai Xu, Ming Yan, Ji Zhang, and Yin Zhang. AMTSS: An adaptive multi-teacher single-student knowledge distillation framework for multi- lingual language inference.arXiv preprint arXiv:2305.07928, 2023

  8. [8]

    Improving retrieval-augmented generation through multi-agent reinforcement learning

    Yiqun Chen et al. Improving retrieval-augmented generation through multi-agent reinforcement learning. InAdvances in Neural Information Processing Systems, 2025

  9. [9]

    Debate or vote: Which yields better decisions in multi-agent large language models?arXiv preprint arXiv:2508.17536, 2025

    Hyeong Kyu Choi et al. Debate or vote: Which yields better decisions in multi-agent large language models?arXiv preprint arXiv:2508.17536, 2025

  10. [10]

    DeepSeek-V4 technical report

    DeepSeek-AI. DeepSeek-V4 technical report. Technical report, DeepSeek, 2026. https: //huggingface.co/deepseek-ai/DeepSeek-V4-Pro/blob/main/DeepSeek_V4.pdf

  11. [11]

    Improving Factuality and Reasoning in Language Models through Multiagent Debate

    Yilun Du et al. Improving factuality and reasoning in language models through multiagent debate.arXiv preprint arXiv:2305.14325, 2023

  12. [12]

    Revisiting On-Policy Distillation: Empirical Failure Modes and Simple Fixes

    Yuqian Fu, Haohuan Huang, Kaiwen Jiang, Yuanheng Zhu, and Dongbin Zhao. Revisiting on- policy distillation: Empirical failure modes and simple fixes.arXiv preprint arXiv:2603.25562, 2026

  13. [13]

    MiniLLM: Knowledge distillation of large language models

    Yuxian Gu et al. MiniLLM: Knowledge distillation of large language models. InInternational Conference on Learning Representations, 2024

  14. [14]

    OpenThoughts: Data Recipes for Reasoning Models

    Etash Guha, Ryan Marten, Sedrick Keh, Negin Raoof, Georgios Smyrnis, Hritik Bansal, Marianna Nezhurina, Jean Mercat, Trung Vu, Zayne Sprague, et al. OpenThoughts: Data recipes for reasoning models.arXiv preprint arXiv:2506.04178, 2025

  15. [15]

    Vitabench: Benchmarking llm agents with versatile interactive tasks in real-world applications.arXiv preprint arXiv:2509.26490, 2025

    Wei He, Yueqing Sun, Hongyan Hao, Xueyuan Hao, Zhikang Xia, Qi Gu, Chengcheng Han, Dengchang Zhao, Hui Su, Kefeng Zhang, et al. VitaBench: Benchmarking LLM agents with versatile interactive tasks in real-world applications.arXiv preprint arXiv:2509.26490, 2025

  16. [16]

    Distilling the Knowledge in a Neural Network

    Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015

  17. [17]

    Reinforcement Learning via Self-Distillation

    Jonas Hübotter et al. Reinforcement learning via self-distillation.arXiv preprint arXiv:2601.20802, 2026

  18. [18]

    LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

    Naman Jain et al. LiveCodeBench: Holistic and contamination-free evaluation of large language models for code.arXiv preprint arXiv:2403.07974, 2024

  19. [19]

    Stable On-Policy Distillation through Adaptive Target Reformulation

    Ijun Jang, Jewon Yeom, Juan Yeo, Hyunggu Lim, and Taesup Kim. Stable on-policy distillation through adaptive target reformulation.arXiv preprint arXiv:2601.07155, 2026

  20. [20]

    Entropy-aware on-policy distillation of language models

    Woogyeol Jin, Taywon Min, Yongjin Yang, Swanand Ravindra Kadhe, Yi Zhou, Dennis Wei, Nathalie Baracaldo, and Kimin Lee. Entropy-aware on-policy distillation of language models. arXiv preprint arXiv:2603.07079, 2026. 10

  21. [21]

    Yoon Kim and Alexander M. Rush. Sequence-level knowledge distillation. InProceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 1317–1327, 2016

  22. [22]

    Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe

    Yaxuan Li, Yuxin Zuo, Bingxiang He, Jinqian Zhang, Chaojun Xiao, Cheng Qian, Tianyu Yu, Huan-ang Gao, Wenkai Yang, Zhiyuan Liu, and Ning Ding. Rethinking on-policy dis- tillation of large language models: Phenomenology, mechanism, and recipe.arXiv preprint arXiv:2604.13016, 2026

  23. [23]

    Encouraging divergent thinking in large language models through multi-agent debate

    Tian Liang et al. Encouraging divergent thinking in large language models through multi-agent debate. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, 2024

  24. [24]

    Divergence measures based on the shannon entropy.IEEE Transactions on Information Theory, 37(1):145–151, 1991

    Jianhua Lin. Divergence measures based on the shannon entropy.IEEE Transactions on Information Theory, 37(1):145–151, 1991

  25. [25]

    Is your code generated by ChatGPT really correct? rigorous evaluation of large language models for code generation

    Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. Is your code generated by ChatGPT really correct? rigorous evaluation of large language models for code generation. In Advances in Neural Information Processing Systems, 2023

  26. [26]

    Toolace: Winning the points of llm function calling

    Weiwen Liu et al. ToolACE: Winning the points of LLM function calling.arXiv preprint arXiv:2409.00920, 2024

  27. [27]

    Decoupled weight decay regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InInternational Conference on Learning Representations, 2019

  28. [28]

    https://thinkingmachines.ai/blog/ on-policy-distillation/

    Kevin Lu and Thinking Machines Lab. On-policy distillation.Thinking Machines Lab: Con- nectionism, 2025. doi: 10.64434/tml.20251026

  29. [29]

    arXiv preprint arXiv:2508.13167 , year=

    OPPO AI Agent Team. Chain-of-agents: End-to-end agent foundation models via multi-agent distillation and agentic RL.arXiv preprint arXiv:2508.13167, 2025

  30. [30]

    arXiv preprint arXiv:2602.04942 , year =

    Emiliano Penaloza, Dheeraj Vattikonda, Nicolas Gontier, Alexandre Lacoste, Laurent Charlin, and Massimo Caccia. Privileged information distillation for language models.arXiv preprint arXiv:2602.04942, 2026

  31. [31]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao et al. DeepSeekMath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

  32. [32]

    Merge-of-thought distillation: Distilling multi-agent reasoning into a single language model.arXiv preprint arXiv:2509.08814, 2025

    Zhanming Shen et al. Merge-of-thought distillation: Distilling multi-agent reasoning into a single language model.arXiv preprint arXiv:2509.08814, 2025

  33. [33]

    A Survey of On-Policy Distillation for Large Language Models

    Mingyang Song et al. A survey of on-policy distillation for large language models.arXiv preprint arXiv:2604.00626, 2026

  34. [34]

    GKD: A general knowledge distillation framework for large-scale pre-trained language model.arXiv preprint arXiv:2306.06629, 2023

    Shicheng Tan, Weng Lam Tam, Yuanchun Wang, Wenwen Gong, Shu Zhao, Peng Zhang, and Jie Tang. GKD: A general knowledge distillation framework for large-scale pre-trained language model.arXiv preprint arXiv:2306.06629, 2023

  35. [35]

    The Rise and Potential of Large Language Model Based Agents: A Survey

    Zhiheng Xi et al. The rise and potential of large language model based agents: A survey.arXiv preprint arXiv:2309.07864, 2023

  36. [36]

    Berkeley function calling leaderboard.arXiv preprint arXiv:2402.15671, 2024

    Fanjia Yan et al. Berkeley function calling leaderboard.arXiv preprint arXiv:2402.15671, 2024

  37. [37]

    Qwen3 Technical Report

    An Yang et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

  38. [38]

    Self-Distilled RLVR

    Chenxu Yang et al. Self-distilled RLVR.arXiv preprint arXiv:2604.03128, 2026

  39. [39]

    Learning beyond teacher: Generalized on-policy distillation with reward extrapolation.arXiv preprint arXiv:2602.12125, 2026

    Wenkai Yang et al. Learning beyond teacher: Generalized on-policy distillation with reward extrapolation.arXiv preprint arXiv:2602.12125, 2026

  40. [40]

    On-Policy Context Distillation for Language Models

    Tianzhu Ye et al. On-policy context distillation for language models.arXiv preprint arXiv:2602.12275, 2026

  41. [41]

    Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models

    Siyan Zhao, Zhihui Xie, Mengchen Liu, Jing Huang, Guan Pang, Feiyu Chen, and Aditya Grover. Self-distilled reasoner: On-policy self-distillation for large language models.arXiv preprint arXiv:2601.18734, 2026

  42. [42]

    next month

    Xiaofeng Zhou, Heyan Huang, and Lizi Liao. Debate, reflect, and distill: Multi-agent feedback with tree-structured preference optimization for efficient language model enhancement.arXiv preprint arXiv:2506.03541, 2025. 11 A Algorithm Pseudocode Algorithm 1 provides the complete MAD-OPD training procedure, covering both single-step (code generation, M=1) a...

  43. [43]

    2020-10-16

    Accommodation: hotel. Trip duration: Oct 16 to Oct 30 = 14 days.” Function call (key arguments): start_date:"2020-10-16"←wrong month end_date:"2020-10-30"←wrong month trip_duration:14✓ Confidence:c= 90 Round 1: Expert A (14B Teacher)×Incorrect Model output (paraphrased): “Dates: next month, 16th to 30th. Current date is October 6, 2020.So the dates are Oc...