"The Whole Is Greater Than the Sum of Its Parts": A Compatibility-Aware Multi-Teacher CoT Distillation Framework
Pith reviewed 2026-05-21 15:10 UTC · model grok-4.3
The pith
COMPACT weights Chain-of-Thought gradients from multiple teachers by measuring real-time student compatibility to improve small models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
COMPACT adaptively fuses supervisions from different teachers by dynamically weighting their gradients according to a student's real-time compatibility, measured along three axes: graph-based consensus that identifies mainstream reasoning paths, mutual-information adaptability that detects moments of genuine understanding, and loss-based difficulty that gauges receptivity and blocks negative transfer. This integration lets the student acquire diverse reasoning capabilities while preserving its original knowledge structure, yielding state-of-the-art benchmark results and reduced catastrophic forgetting compared with single-teacher baselines.
What carries the argument
The COMPACT framework's dynamic gradient weighting, driven by the three-dimensional compatibility metric (graph consensus, mutual-information adaptability, and loss-based difficulty) evaluated during each training step.
If this is right
- Student models reach higher accuracy on reasoning benchmarks than those trained from any one teacher alone.
- The student's pre-existing capabilities remain intact after the multi-teacher process completes.
- Incompatible or misleading rationales from individual teachers exert less influence through the weighting step.
- Latent-space inspections show reasoning features added without disrupting the original knowledge geometry.
Where Pith is reading between the lines
- The same compatibility-checking idea could be applied to distill factual knowledge or instruction-following skills rather than only CoT traces.
- Real-time compatibility monitoring may prove useful in other multi-source training settings where source models have mismatched strengths.
- Testing the framework on models an order of magnitude smaller than current students would reveal how far the compatibility signals scale.
Load-bearing premise
The three compatibility metrics reliably detect genuine student understanding and prevent negative transfer when used to weight teacher gradients.
What would settle it
If a COMPACT-trained student model exhibits higher hallucination rates or larger drops on tasks it knew before distillation than a single-teacher control on the same benchmark suite, the claim that the metrics successfully avoid negative transfer would be contradicted.
Figures
read the original abstract
Chain-of-Thought (CoT) reasoning empowers Large Language Models (LLMs) with remarkable capabilities but typically requires prohibitive parameter scales. CoT distillation has emerged as a promising paradigm to transfer reasoning prowess into compact Student Models (SLMs), but existing approaches often rely on a solitary teacher, capping the student's potential since individual LLMs often exhibit distinct capability biases and may suffer from catastrophic forgetting. While leveraging diverse teachers seems appealing, effectively fusing their supervisions remains challenging: teacher-student incompatibility risks amplifying hallucinations, and passive supervision fails to ensure genuine logic internalization. To address this, we introduce COMPACT, a framework that adaptively fuses supervisions from different teachers by dynamically weighting teacher gradients based on the student's real-time compatibility evaluated by a multi-dimensional metric: (1) Graph-based Consensus to filter misleading rationales by identifying mainstream reasoning paths; (2) Mutual-Information-based Adaptability to detect "epiphany moments" for genuinely understanding the reasoning process rather than merely imitating; and (3) Loss-based Difficulty to assess student receptivity to the teacher's guidance and prevent negative transfer. Extensive experiments and latent space analysis demonstrate that COMPACT effectively integrates diverse reasoning capabilities without damaging the model's original knowledge structure, achieving state-of-the-art performance on various benchmarks while mitigating catastrophic forgetting.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces COMPACT, a compatibility-aware multi-teacher CoT distillation framework for transferring reasoning capabilities from multiple LLMs to smaller student models. It proposes dynamically weighting teacher gradients according to the student's real-time compatibility, evaluated via three metrics: (1) Graph-based Consensus to identify mainstream reasoning paths, (2) Mutual-Information-based Adaptability to detect genuine understanding versus imitation, and (3) Loss-based Difficulty to assess receptivity and avoid negative transfer. The central claim is that this adaptive fusion integrates diverse teacher capabilities without damaging the student's original knowledge structure, achieves state-of-the-art results on benchmarks, and mitigates catastrophic forgetting, as evidenced by extensive experiments and latent-space analysis.
Significance. If the empirical claims hold after proper validation, the work would offer a meaningful advance in multi-teacher distillation by providing a principled mechanism to handle teacher-student incompatibility and negative transfer. The multi-dimensional metric approach could improve upon passive or single-teacher baselines in preserving reasoning fidelity while scaling to compact models. However, the current lack of quantitative support and independent metric validation substantially weakens the assessed significance.
major comments (3)
- Abstract: The abstract asserts that 'extensive experiments and latent space analysis demonstrate' SOTA performance and mitigation of catastrophic forgetting, yet supplies no quantitative results, baselines, error bars, ablation studies, or implementation details for the three metrics. This leaves the central empirical claim without verifiable support and makes it impossible to assess whether the metrics actually correlate with genuine reasoning internalization rather than surface statistics.
- Abstract (method description): The dynamic weighting of teacher gradients is defined directly in terms of the student's real-time performance on the same three compatibility metrics. This creates a circular dependency in which the supervision signal is derived from quantities computed from the student's current state during training, risking unstable dynamics, self-reinforcing biases, or reduction to a heuristic that still permits negative transfer.
- Abstract: The paper assumes without separate diagnostic validation that graph consensus, mutual-information adaptability, and loss-based difficulty reliably detect compatible rationales and block negative transfer. No experiments are described that test whether these metrics track ground-truth reasoning fidelity (e.g., via controlled probes correlating metric scores with actual task internalization or hallucination rates).
Simulated Author's Rebuttal
We thank the referee for their thorough and constructive feedback on our manuscript. We have carefully considered each major comment and provide point-by-point responses below. We believe these clarifications and proposed revisions will strengthen the paper.
read point-by-point responses
-
Referee: Abstract: The abstract asserts that 'extensive experiments and latent space analysis demonstrate' SOTA performance and mitigation of catastrophic forgetting, yet supplies no quantitative results, baselines, error bars, ablation studies, or implementation details for the three metrics. This leaves the central empirical claim without verifiable support and makes it impossible to assess whether the metrics actually correlate with genuine reasoning internalization rather than surface statistics.
Authors: We agree that including specific quantitative results in the abstract would enhance verifiability. In the revised manuscript, we will update the abstract to include key performance metrics, such as the average improvement over baselines on reasoning benchmarks and specific forgetting reduction percentages from our experiments. We will also reference the main sections where ablation studies and implementation details for the metrics are provided. This addresses the concern about verifiable support while maintaining the abstract's conciseness. revision: yes
-
Referee: Abstract (method description): The dynamic weighting of teacher gradients is defined directly in terms of the student's real-time performance on the same three compatibility metrics. This creates a circular dependency in which the supervision signal is derived from quantities computed from the student's current state during training, risking unstable dynamics, self-reinforcing biases, or reduction to a heuristic that still permits negative transfer.
Authors: The adaptive nature of the weighting is intentional to allow real-time adjustment based on the student's evolving state, which we argue mitigates rather than exacerbates negative transfer. To demonstrate stability, our training curves show consistent convergence without oscillations. We will add a dedicated paragraph in the method section discussing potential circularity concerns, including theoretical justification and empirical evidence from ablations comparing dynamic vs. static weighting. This will clarify that the approach does not reduce to a simple heuristic. revision: partial
-
Referee: Abstract: The paper assumes without separate diagnostic validation that graph consensus, mutual-information adaptability, and loss-based difficulty reliably detect compatible rationales and block negative transfer. No experiments are described that test whether these metrics track ground-truth reasoning fidelity (e.g., via controlled probes correlating metric scores with actual task internalization or hallucination rates).
Authors: We recognize the value of explicit diagnostic validation. While our latent space analysis and main results provide supporting evidence through improved performance and reduced forgetting, we will incorporate additional experiments in the revised version. Specifically, we will include controlled probes where we correlate the metric scores with hallucination rates on held-out tasks and measure internalization via probing classifiers on student representations. These will be added to Section 4 or a new subsection. revision: yes
Circularity Check
No significant circularity detected in COMPACT derivation
full rationale
The COMPACT framework defines a dynamic weighting scheme for multi-teacher gradients using three explicitly constructed compatibility metrics (graph consensus, mutual-information adaptability, and loss-based difficulty) applied to the student's real-time state. These are design choices for adaptive fusion rather than derivations that reduce the final performance claims to the inputs by construction. The paper presents the approach as a solution to teacher-student incompatibility and supports its effectiveness through extensive experiments on benchmarks plus latent space analysis, without any load-bearing self-citation chains, fitted parameters renamed as predictions, or ansatzes smuggled via prior work. The central claims remain empirically grounded rather than tautological.
Axiom & Free-Parameter Ledger
free parameters (1)
- dynamic weighting coefficients for the three metrics
axioms (1)
- domain assumption Individual LLMs exhibit distinct capability biases and may suffer from catastrophic forgetting when used as solitary teachers.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
dynamically weighting teacher gradients based on the student's real-time compatibility evaluated by a multi-dimensional metric: (1) Graph-based Consensus ... (2) Mutual-Information-based Adaptability ... (3) Loss-based Difficulty
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanembed_injective unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
detect 'epiphany moments' ... genuine logic comprehension from rote memorization
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Mcc-kd: Multi-cot consistent knowl- edge distillation.arXiv preprint arXiv:2310.14747,
Hongzhan Chen, Siyue Wu, Xiaojun Quan, Rui Wang, Ming Yan, and Ji Zhang. Mcc-kd: Multi-cot consistent knowl- edge distillation.arXiv preprint arXiv:2310.14747,
-
[2]
Unveiling the key factors for distilling chain-of-thought reasoning.ArXiv, abs/2502.18001,
Xinghao Chen, Zhijing Sun, Wenjin Guo, Miaoran Zhang, Yanjun Chen, Yirong Sun, Hui Su, Yijie Pan, Dietrich Klakow, Wenjie Li, and Xiaoyu Shen. Unveiling the key factors for distilling chain-of-thought reasoning.ArXiv, abs/2502.18001,
-
[3]
The False Promise of Imitating Proprietary LLMs
Arnav Gudibande, Eric Wallace, Charles Burton Snell, Xinyang Geng, Hao Liu, P. Abbeel, Sergey Levine, and Dawn Song. The false promise of imitating proprietary llms.ArXiv, abs/2305.15717,
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
Cheng-Yu Hsieh, Chun-Liang Li, Chih-Kuan Yeh, Hootan Nakhost, Yasuhisa Fujii, Alex Ratner, Ranjay Krishna, Chen-Yu Lee, and Tomas Pfister. Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes. InFindings of the Asso- ciation for Computational Linguistics: ACL 2023, pages 8003–8017,
work page 2023
-
[5]
Zhuochun Li, Yuelyu Ji, Rui Meng, and Daqing He. Learning from committee: Reasoning distillation from a mixture of teachers with peer-review.ArXiv, abs/2410.03663,
-
[6]
Sohan Patnaik, Milan Aggarwal, Sumita Bhatia, and Bal- aji Krishnamurthy. Learning together to perform better: Teaching small-scale llms to collaborate via preferential ra- tionale tuning.ArXiv, abs/2506.02519,
-
[7]
Chen Qian, Dongrui Liu, Hao Wen, Zhen Bai, Yong Liu, and Jing Shao. Demystifying reasoning dynamics with mutual information: Thinking tokens are information peaks in llm reasoning.ArXiv, abs/2506.02867,
-
[8]
GPQA: A Graduate-Level Google-Proof Q&A Benchmark
Peter Rein, Fabian Balsiger, and et al. Gpqa: A graduate- level google-proof q&a benchmark.arXiv preprint arXiv:2311.12022,
work page internal anchor Pith review Pith/arXiv arXiv
-
[9]
Cycle-instruct: Fully seed- free instruction tuning via dual self-training and cycle consistency
Zhanming Shen, Zeyu Qin, Zenan Huang, Haoxing Chen, Jiaqi Hu, Yihong Zhuang, Guoshan Lu, Gang Chen, and Junbo Zhao. Merge-of-thought distillation.ArXiv, abs/2509.08814,
-
[10]
Scott: Self-consistent chain- of-thought distillation.arXiv preprint arXiv:2305.01879,
Peifeng Wang, Zhengyang Wang, Zheng Li, Yifan Gao, Bing Yin, and Xiang Ren. Scott: Self-consistent chain- of-thought distillation.arXiv preprint arXiv:2305.01879,
-
[11]
Mixture-of-Agents Enhances Large Language Model Capabilities
Junlin Wang, Jue Wang, Ben Athiwaratkun, Ce Zhang, and James Zou. Mixture-of-agents enhances large language model capabilities.ArXiv, abs/2406.04692,
work page internal anchor Pith review Pith/arXiv arXiv
-
[12]
Emergent Abilities of Large Language Models
Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Bar- ret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, et al. Emer- gent abilities of large language models.arXiv preprint arXiv:2206.07682,
work page internal anchor Pith review Pith/arXiv arXiv
-
[13]
Xiaojun Wu, Xiaoguang Jiang, Huiyang Li, Jucai Zhai, Dengfeng Liu, Qiaobo Hao, Huang Liu, Zhiguo Yang, Ji Xie, Ninglun Gu, Jin Yang, Kailai Zhang, Yelun Bao, and Jun Wang. Beyond scaling law: A data-efficient dis- tillation framework for reasoning.ArXiv, abs/2508.09883,
-
[14]
Unlearning Isn't Deletion: Investigating Reversibility of Machine Unlearning in LLMs
Xiaoyu Xu, Xiang Yue, Yang Liu, Qingqing Ye, Haibo Hu, and Minxin Du. Unlearning isn’t deletion: Investigat- ing reversibility of machine unlearning in llms.ArXiv, abs/2505.16831,
work page internal anchor Pith review Pith/arXiv arXiv
-
[15]
Flask: Fine-grained language model evaluation based on alignment skill sets.ArXiv, abs/2307.10928,
Seonghyeon Ye, Doyoung Kim, Sungdong Kim, Hyeonbin Hwang, Seungone Kim, Yongrae Jo, James Thorne, Juho Kim, and Minjoon Seo. Flask: Fine-grained language model evaluation based on alignment skill sets.ArXiv, abs/2307.10928,
-
[16]
Le Yu, Bowen Yu, Haiyang Yu, Fei Huang, and Yongbin Li. Extend model merging from fine-tuned to pre-trained large language models via weight disentanglement.ArXiv, abs/2408.03092,
-
[17]
The ba- sic statistics of these benchmarks are presented in Tables 2–5
A Appendix A.1 Details of Tasks and Datasets We select MATH500, GSM8K, SV AMP, CommonsenseQA, StrategyQA, and GPQA-Diamond to systematically evalu- ate model performance across two dimensions: mathematical reasoning and commonsense/knowledge reasoning. The ba- sic statistics of these benchmarks are presented in Tables 2–5. MATH500.MATH500 comprises 500 pr...
work page 2021
-
[18]
Subject Area Size Proportion Algebra 124 24.8% Counting & Probability 38 7.6% Geometry 41 8.2% Intermediate Algebra 97 19.4% Number Theory 62 12.4% Prealgebra 82 16.4% Precalculus 56 11.2% Total 500 100.0% Table 2: Subject-area distribution of MATH500. Difficulty Level Size Proportion Level 1 43 8.6% Level 2 90 18.0% Level 3 105 21.0% Level 4 128 25.6% Le...
work page 2021
-
[19]
Operation-type Size Proportion Subtraction 160 53.33% Addition 59 19.67% Common-Division 48 16.00% Multiplication 33 11.00% Total 300 100.00% Table 4: Operation-type distribution of the SV AMP test set. CommonsenseQA.CommonsenseQA [Talmoret al., 2019] is a multiple-choice question answering benchmark containing 12,247 examples to test commonsense knowledg...
work page 2019
-
[20]
B Further Analysis Task-Dependent Preference in Mathematical Reasoning. The training dynamics across mathematical benchmarks elu- Parameter Value General Settings Optimizer AdamW LoRA Target Modules All linear layers (Attn. & FFN) Batch Size (per Device) 4 Gradient Accumulation 4 Hardware 2×NVIDIA A100 (80GB) SLMs Optimization Learning Rate (1.5B)5×10 −5 ...
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.