Recognition: unknown
MAD-OPD: Breaking the Ceiling in On-Policy Distillation via Multi-Agent Debate
Pith reviewed 2026-05-09 15:14 UTC · model grok-4.3
The pith
Multi-agent debate among teachers supplies higher-quality token supervision than any single teacher for on-policy distillation.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
MAD-OPD replaces the single-teacher supervisor in on-policy distillation with a deliberative group whose debate yields an emergent collective intelligence; each teacher's contribution is weighted by its post-debate confidence, and the resulting token-level targets are used to train the student on its own trajectories. The paper also defines On-Policy Agentic Distillation with step-level sampling to counteract compounding errors and derives a task-adaptive divergence rule that favors Jensen-Shannon divergence for agentic work and reverse KL for code generation.
What carries the argument
The multi-agent debate process that turns separate teacher outputs into a single weighted supervision signal by letting the teachers discuss the student's on-policy state and score their own contributions.
If this is right
- Students can exceed the performance ceiling set by any individual teacher on both agentic and code tasks.
- Step-level sampling prevents error accumulation from destabilizing training in long-horizon agentic settings.
- Choosing Jensen-Shannon divergence for agentic tasks and reverse KL for code generation improves stability and final scores.
- The ranking advantage holds across all tested model-size combinations from 1.7B students to 32B teachers.
Where Pith is reading between the lines
- The debate mechanism could be inserted into other distillation pipelines that currently rely on a single teacher.
- For very large models, collective supervision might reduce the need to train one enormous teacher first.
- The same collective-intelligence idea might extend to reinforcement learning from human feedback where multiple reward models are available.
- Dynamic selection of which teachers participate in each debate round could further reduce compute while preserving gains.
Load-bearing premise
The multi-agent debate process will consistently produce supervision that is better than the strongest single teacher without introducing new biases or training instability.
What would settle it
If the same teacher-student pairs and benchmarks show that MAD-OPD performance is equal to or lower than the best single-teacher OPD run, the claim that debate supplies superior supervision would be falsified.
Figures
read the original abstract
On-policy distillation (OPD) trains a student on its own trajectories under token-level teacher supervision, but existing methods are capped by a single-teacher capability ceiling: when the teacher errs, the student inherits the error. OPD also remains largely unexplored in agentic tasks, where per-step errors compound across long trajectories and destabilize training. We propose MAD-OPD (Multi-Agent Debate-driven On-Policy Distillation), which breaks this ceiling by recasting the distillation teacher as a deliberative collective of teachers that debate over the student's on-policy state; the debate produces an emergent collective intelligence that supplies token-level supervision, with each teacher's contribution weighted by its post-debate confidence. To extend OPD to agentic tasks, we also introduce On-Policy Agentic Distillation (OPAD), which adds step-level sampling to stabilize training under multi-step error compounding. We additionally derive a task-adaptive divergence principle, selecting JSD (Jensen-Shannon divergence) for agentic stability and reverse KL (Kullback-Leibler) divergence for code generation, and verify it both theoretically and empirically. Across six teacher-student configurations (Qwen3 and Qwen3.5; 1.7B-14B students, 8B-32B teachers) and five agentic and code benchmarks, MAD-OPD ranks first across all six configurations; on the 14B+8B$\to$4B setting it lifts the agentic average by $+2.4\%$ and the code average by $+3.7\%$ over the stronger single-teacher OPD.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes MAD-OPD, which recasts on-policy distillation by using multi-agent debate among teachers to produce emergent collective token-level supervision for a student, weighted by post-debate confidence. It introduces OPAD with step-level sampling to stabilize agentic distillation under error compounding, and derives a task-adaptive divergence principle selecting JSD for agentic tasks and reverse KL for code generation. Across six teacher-student configurations on agentic and code benchmarks, MAD-OPD ranks first, with reported lifts of +2.4% agentic average and +3.7% code average over single-teacher OPD in the 14B+8B to 4B setting.
Significance. If the gains are shown to stem specifically from the debate mechanism rather than ensembling, the work could meaningfully advance on-policy distillation methods by addressing single-teacher ceilings and extending to agentic settings. The cross-configuration ranking and task-adaptive divergence choice offer a potentially generalizable framework, though this depends on stronger controls for the proposed mechanisms.
major comments (2)
- [Abstract / Results] Abstract and results summary: The central claim attributes the reported first-place ranking and specific lifts (+2.4% agentic, +3.7% code on 14B+8B→4B) to emergent collective intelligence from debate plus post-debate confidence weighting. However, the manuscript compares only to single-teacher OPD; no non-debate multi-teacher baseline (e.g., mean or max of teacher logits on identical on-policy trajectories) is reported. This is load-bearing for crediting the deliberative process, JSD/reverse-KL selection, and OPAD sampling, as the gains could arise from multi-teacher access alone.
- [Abstract] The task-adaptive divergence principle: The abstract states that JSD is selected for agentic stability and reverse KL for code generation, derived theoretically and verified empirically. Without the explicit derivation steps, assumptions about error compounding or distribution mismatch, or the theoretical verification (e.g., any inequality or stability analysis), it is difficult to evaluate whether the selection is principled or post-hoc.
minor comments (2)
- [Abstract] No implementation details are supplied for debate prompt construction, confidence calibration, or the exact OPAD step-sampling procedure, which hinders reproducibility.
- [Abstract] The reported averages lack error bars, statistical significance tests, or variance across runs, making it hard to assess the reliability of the +2.4% and +3.7% lifts.
Simulated Author's Rebuttal
We thank the referee for the constructive comments on our manuscript. We address the major concerns regarding the need for additional baselines and the presentation of the theoretical derivation below, and we outline the revisions we will make.
read point-by-point responses
-
Referee: [Abstract / Results] Abstract and results summary: The central claim attributes the reported first-place ranking and specific lifts (+2.4% agentic, +3.7% code on 14B+8B→4B) to emergent collective intelligence from debate plus post-debate confidence weighting. However, the manuscript compares only to single-teacher OPD; no non-debate multi-teacher baseline (e.g., mean or max of teacher logits on identical on-policy trajectories) is reported. This is load-bearing for crediting the deliberative process, JSD/reverse-KL selection, and OPAD sampling, as the gains could arise from multi-teacher access alone.
Authors: We acknowledge the importance of this control experiment to attribute the improvements specifically to the multi-agent debate mechanism rather than merely having access to multiple teachers. The manuscript focuses on comparisons against the standard single-teacher OPD baseline to highlight the ceiling-breaking aspect. However, we agree that a non-debate multi-teacher baseline is necessary for a complete evaluation. In the revised version, we will include results from ensembling teacher logits (mean and max) on the same on-policy trajectories without the debate process. This will allow us to demonstrate the additional benefit provided by the deliberative debate and post-debate confidence weighting. revision: yes
-
Referee: [Abstract] The task-adaptive divergence principle: The abstract states that JSD is selected for agentic stability and reverse KL for code generation, derived theoretically and verified empirically. Without the explicit derivation steps, assumptions about error compounding or distribution mismatch, or the theoretical verification (e.g., any inequality or stability analysis), it is difficult to evaluate whether the selection is principled or post-hoc.
Authors: The derivation and theoretical verification of the task-adaptive divergence principle are detailed in Section 3.3 of the full manuscript, including assumptions about error compounding in agentic settings and distribution mismatch in code generation tasks. We show through analysis that JSD offers superior stability for long-horizon agentic tasks due to its symmetry, while reverse KL better aligns with the mode-seeking behavior needed for code. Empirical verification is provided across the benchmarks. To improve clarity in the abstract, we will revise it to briefly summarize the key theoretical motivation and assumptions, with a pointer to the detailed derivation in the main text. revision: partial
Circularity Check
No significant circularity in derivation chain
full rationale
The paper proposes MAD-OPD by recasting the teacher as a multi-agent debate collective and introduces OPAD with step sampling, plus a task-adaptive divergence principle (JSD for agentic, reverse KL for code) that is stated to be theoretically derived and then verified empirically. No quoted equations or steps reduce by construction to the inputs; the divergence choice is not presented as a fit to the reported gains, and the empirical lifts are measured against single-teacher OPD baselines rather than being tautological. The mechanism is self-contained against external benchmarks and does not rely on self-citation load-bearing or self-definitional renaming.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption A deliberative collective of teachers via debate produces emergent collective intelligence that supplies superior token-level supervision compared with any single teacher.
Reference graph
Works this paper leans on
-
[1]
On-policy distillation of language models: Learning from self-generated mistakes
Rishabh Agarwal et al. On-policy distillation of language models: Learning from self-generated mistakes. InInternational Conference on Learning Representations, 2024
2024
-
[2]
Aayush Aluru et al. SMAGDi: Socratic multi agent interaction graph distillation for efficient high accuracy reasoning.arXiv preprint arXiv:2511.05528, 2025
-
[3]
Program Synthesis with Large Language Models
Jacob Austin et al. Program synthesis with large language models.arXiv preprint arXiv:2108.07732, 2021
work page internal anchor Pith review arXiv 2021
-
[4]
$\tau^2$-Bench: Evaluating Conversational Agents in a Dual-Control Environment
Victor Barres, Honghua Dong, Soham Ray, Xujie Si, and Karthik Narasimhan. τ 2-bench: Eval- uating conversational agents in a dual-control environment.arXiv preprint arXiv:2506.07982, 2025
work page internal anchor Pith review arXiv 2025
-
[5]
MAGDi: Structured distillation of multi-agent interaction graphs improves reasoning in smaller language models
Justin Chen et al. MAGDi: Structured distillation of multi-agent interaction graphs improves reasoning in smaller language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, 2024
2024
-
[6]
ReConcile: Round-table con- ference improves reasoning via consensus among diverse LLMs
Justin Chih-Yao Chen, Swarnadeep Saha, and Mohit Bansal. ReConcile: Round-table con- ference improves reasoning via consensus among diverse LLMs. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, 2024
2024
-
[7]
Qianglong Chen, Feng Ji, Feng-Lin Li, Guohai Xu, Ming Yan, Ji Zhang, and Yin Zhang. AMTSS: An adaptive multi-teacher single-student knowledge distillation framework for multi- lingual language inference.arXiv preprint arXiv:2305.07928, 2023
-
[8]
Improving retrieval-augmented generation through multi-agent reinforcement learning
Yiqun Chen et al. Improving retrieval-augmented generation through multi-agent reinforcement learning. InAdvances in Neural Information Processing Systems, 2025
2025
-
[9]
Hyeong Kyu Choi et al. Debate or vote: Which yields better decisions in multi-agent large language models?arXiv preprint arXiv:2508.17536, 2025
-
[10]
DeepSeek-V4 technical report
DeepSeek-AI. DeepSeek-V4 technical report. Technical report, DeepSeek, 2026. https: //huggingface.co/deepseek-ai/DeepSeek-V4-Pro/blob/main/DeepSeek_V4.pdf
2026
-
[11]
Improving Factuality and Reasoning in Language Models through Multiagent Debate
Yilun Du et al. Improving factuality and reasoning in language models through multiagent debate.arXiv preprint arXiv:2305.14325, 2023
work page internal anchor Pith review arXiv 2023
-
[12]
Revisiting On-Policy Distillation: Empirical Failure Modes and Simple Fixes
Yuqian Fu, Haohuan Huang, Kaiwen Jiang, Yuanheng Zhu, and Dongbin Zhao. Revisiting on- policy distillation: Empirical failure modes and simple fixes.arXiv preprint arXiv:2603.25562, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[13]
MiniLLM: Knowledge distillation of large language models
Yuxian Gu et al. MiniLLM: Knowledge distillation of large language models. InInternational Conference on Learning Representations, 2024
2024
-
[14]
OpenThoughts: Data Recipes for Reasoning Models
Etash Guha, Ryan Marten, Sedrick Keh, Negin Raoof, Georgios Smyrnis, Hritik Bansal, Marianna Nezhurina, Jean Mercat, Trung Vu, Zayne Sprague, et al. OpenThoughts: Data recipes for reasoning models.arXiv preprint arXiv:2506.04178, 2025
work page internal anchor Pith review arXiv 2025
-
[15]
Wei He, Yueqing Sun, Hongyan Hao, Xueyuan Hao, Zhikang Xia, Qi Gu, Chengcheng Han, Dengchang Zhao, Hui Su, Kefeng Zhang, et al. VitaBench: Benchmarking LLM agents with versatile interactive tasks in real-world applications.arXiv preprint arXiv:2509.26490, 2025
-
[16]
Distilling the Knowledge in a Neural Network
Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[17]
Reinforcement Learning via Self-Distillation
Jonas Hübotter et al. Reinforcement learning via self-distillation.arXiv preprint arXiv:2601.20802, 2026
work page internal anchor Pith review arXiv 2026
-
[18]
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code
Naman Jain et al. LiveCodeBench: Holistic and contamination-free evaluation of large language models for code.arXiv preprint arXiv:2403.07974, 2024
work page internal anchor Pith review arXiv 2024
-
[19]
Stable On-Policy Distillation through Adaptive Target Reformulation
Ijun Jang, Jewon Yeom, Juan Yeo, Hyunggu Lim, and Taesup Kim. Stable on-policy distillation through adaptive target reformulation.arXiv preprint arXiv:2601.07155, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[20]
Entropy-aware on-policy distillation of language models
Woogyeol Jin, Taywon Min, Yongjin Yang, Swanand Ravindra Kadhe, Yi Zhou, Dennis Wei, Nathalie Baracaldo, and Kimin Lee. Entropy-aware on-policy distillation of language models. arXiv preprint arXiv:2603.07079, 2026. 10
-
[21]
Yoon Kim and Alexander M. Rush. Sequence-level knowledge distillation. InProceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 1317–1327, 2016
2016
-
[22]
Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe
Yaxuan Li, Yuxin Zuo, Bingxiang He, Jinqian Zhang, Chaojun Xiao, Cheng Qian, Tianyu Yu, Huan-ang Gao, Wenkai Yang, Zhiyuan Liu, and Ning Ding. Rethinking on-policy dis- tillation of large language models: Phenomenology, mechanism, and recipe.arXiv preprint arXiv:2604.13016, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[23]
Encouraging divergent thinking in large language models through multi-agent debate
Tian Liang et al. Encouraging divergent thinking in large language models through multi-agent debate. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, 2024
2024
-
[24]
Divergence measures based on the shannon entropy.IEEE Transactions on Information Theory, 37(1):145–151, 1991
Jianhua Lin. Divergence measures based on the shannon entropy.IEEE Transactions on Information Theory, 37(1):145–151, 1991
1991
-
[25]
Is your code generated by ChatGPT really correct? rigorous evaluation of large language models for code generation
Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. Is your code generated by ChatGPT really correct? rigorous evaluation of large language models for code generation. In Advances in Neural Information Processing Systems, 2023
2023
-
[26]
Toolace: Winning the points of llm function calling
Weiwen Liu et al. ToolACE: Winning the points of LLM function calling.arXiv preprint arXiv:2409.00920, 2024
-
[27]
Decoupled weight decay regularization
Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InInternational Conference on Learning Representations, 2019
2019
-
[28]
https://thinkingmachines.ai/blog/ on-policy-distillation/
Kevin Lu and Thinking Machines Lab. On-policy distillation.Thinking Machines Lab: Con- nectionism, 2025. doi: 10.64434/tml.20251026
-
[29]
arXiv preprint arXiv:2508.13167 , year=
OPPO AI Agent Team. Chain-of-agents: End-to-end agent foundation models via multi-agent distillation and agentic RL.arXiv preprint arXiv:2508.13167, 2025
-
[30]
arXiv preprint arXiv:2602.04942 , year =
Emiliano Penaloza, Dheeraj Vattikonda, Nicolas Gontier, Alexandre Lacoste, Laurent Charlin, and Massimo Caccia. Privileged information distillation for language models.arXiv preprint arXiv:2602.04942, 2026
-
[31]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Zhihong Shao et al. DeepSeekMath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[32]
Zhanming Shen et al. Merge-of-thought distillation: Distilling multi-agent reasoning into a single language model.arXiv preprint arXiv:2509.08814, 2025
-
[33]
A Survey of On-Policy Distillation for Large Language Models
Mingyang Song et al. A survey of on-policy distillation for large language models.arXiv preprint arXiv:2604.00626, 2026
work page internal anchor Pith review arXiv 2026
-
[34]
Shicheng Tan, Weng Lam Tam, Yuanchun Wang, Wenwen Gong, Shu Zhao, Peng Zhang, and Jie Tang. GKD: A general knowledge distillation framework for large-scale pre-trained language model.arXiv preprint arXiv:2306.06629, 2023
-
[35]
The Rise and Potential of Large Language Model Based Agents: A Survey
Zhiheng Xi et al. The rise and potential of large language model based agents: A survey.arXiv preprint arXiv:2309.07864, 2023
work page internal anchor Pith review arXiv 2023
-
[36]
Berkeley function calling leaderboard.arXiv preprint arXiv:2402.15671, 2024
Fanjia Yan et al. Berkeley function calling leaderboard.arXiv preprint arXiv:2402.15671, 2024
-
[37]
An Yang et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[38]
Chenxu Yang et al. Self-distilled RLVR.arXiv preprint arXiv:2604.03128, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[39]
Wenkai Yang et al. Learning beyond teacher: Generalized on-policy distillation with reward extrapolation.arXiv preprint arXiv:2602.12125, 2026
-
[40]
On-Policy Context Distillation for Language Models
Tianzhu Ye et al. On-policy context distillation for language models.arXiv preprint arXiv:2602.12275, 2026
work page internal anchor Pith review arXiv 2026
-
[41]
Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models
Siyan Zhao, Zhihui Xie, Mengchen Liu, Jing Huang, Guan Pang, Feiyu Chen, and Aditya Grover. Self-distilled reasoner: On-policy self-distillation for large language models.arXiv preprint arXiv:2601.18734, 2026
work page internal anchor Pith review arXiv 2026
-
[42]
Xiaofeng Zhou, Heyan Huang, and Lizi Liao. Debate, reflect, and distill: Multi-agent feedback with tree-structured preference optimization for efficient language model enhancement.arXiv preprint arXiv:2506.03541, 2025. 11 A Algorithm Pseudocode Algorithm 1 provides the complete MAD-OPD training procedure, covering both single-step (code generation, M=1) a...
-
[43]
2020-10-16
Accommodation: hotel. Trip duration: Oct 16 to Oct 30 = 14 days.” Function call (key arguments): start_date:"2020-10-16"←wrong month end_date:"2020-10-30"←wrong month trip_duration:14✓ Confidence:c= 90 Round 1: Expert A (14B Teacher)×Incorrect Model output (paraphrased): “Dates: next month, 16th to 30th. Current date is October 6, 2020.So the dates are Oc...
2020
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.