Recognition: 3 theorem links
· Lean TheoremFrom Multi-Agent to Single-Agent: When Is Skill Distillation Beneficial?
Pith reviewed 2026-05-13 21:46 UTC · model grok-4.3
The pith
Skill distillation from multi-agent to single-agent succeeds or fails based on the evaluation metric's freedom, not the task itself.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that skill utility is governed not by the task but by the evaluation metric. The authors introduce Metric Freedom (F), quantified via Mantel test on diversity-score coupling, which strongly predicts distillation outcomes (r = -0.85). They show that identical agent trajectories produce opposite skill lifts under rigid versus free metrics, proving the property is metric-level. Based on this, they develop AdaSkill, a two-stage framework that extracts selectively on free metrics and refines iteratively to maximize headroom while matching or exceeding MAS performance with up to 8x cost reduction.
What carries the argument
Metric Freedom (F), which measures the topological rigidity of a metric's scoring landscape by quantifying the coupling between output diversity and score variance through a Mantel test; it serves as an a priori predictor that determines whether distillation preserves or harms performance.
Load-bearing premise
The topological rigidity captured by the Mantel test on diversity-score coupling is the primary causal driver of distillation success and generalizes beyond the 6 metrics and 11 datasets tested.
What would settle it
Finding a new collection of metrics or tasks where the correlation between Metric Freedom and observed skill utility falls below statistical significance or where identical trajectories no longer produce opposite skill lifts under rigid versus free metrics.
Figures
read the original abstract
Multi-agent systems (MAS) tackle complex tasks by distributing expertise, though this often comes at the cost of heavy coordination overhead, context fragmentation, and brittle phase ordering. Distilling a MAS into a single-agent skill can bypass these costs, but this conversion lacks a principled answer for when and what to distill. Instead, the empirical outcome is surprisingly inconsistent: skill lift ranges from a 28% improvement to a 2% degradation across metrics of the exact same task. In this work, we reveal that skill utility is governed not by the task, but by the evaluation metric. We introduce Metric Freedom (F), the first a priori predictor of skill utility. F measures the topological rigidity of a metric's scoring landscape by quantifying how output diversity couples with score variance via a Mantel test. Guided by F, we propose AdaSkill, a two-stage adaptive distillation framework. Stage 1 acts as a selective extraction mechanism, extracting tools and knowledge while discarding restrictive structures on "free" metrics to preserve exploration. Stage 2 applies iterative refinement selectively on free metrics, exploiting their forgiving scoring landscape to safely maximize remaining headroom. Evaluating across 4 tasks, 11 datasets, and 6 metrics, F strongly predicts skill utility (r=-0.85, p<0.0001). Strikingly, identical agent trajectories yield diametrically opposite skill lifts under rigid versus free metrics, demonstrating that skill utility is fundamentally a metric-level property. Driven by this signal, AdaSkill matches or exceeds the original MAS while reducing cost up to 8x and latency by up to 15x.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that distilling multi-agent systems into single-agent skills yields inconsistent results (28% lift to 2% degradation) that are governed by the evaluation metric rather than the task. It introduces Metric Freedom (F), computed via a Mantel test on the coupling between output diversity and score variance, as the first a priori predictor of skill utility (reported r=-0.85, p<0.0001). The authors further claim that identical trajectories produce opposite skill lifts under rigid vs. free metrics, and propose the two-stage AdaSkill framework that selectively extracts and refines to match or exceed MAS performance at up to 8x lower cost.
Significance. If the correlation is robust and F can be operationalized without full post-sampling, the work would be significant for MAS research by supplying a concrete, metric-level criterion for deciding when distillation is beneficial. The demonstration that utility is a property of the scoring landscape rather than the underlying trajectories is a useful reframing, and the reproducible correlation across 4 tasks/11 datasets/6 metrics plus the cost-reduction results of AdaSkill could influence practical deployment of agent systems.
major comments (2)
- [Abstract and §3] Abstract and §3 (Metric Freedom definition): The claim that F is an 'a priori predictor' that can be obtained 'before deciding whether or how to distill' is contradicted by the computation itself. F requires generating multiple outputs to compute pairwise distances (diversity) and score variance before applying the Mantel test; this sampling step cannot be performed without model inference, so F is necessarily post-sampling and cannot serve as a pre-distillation selector without already running the trajectories whose utility it is meant to predict.
- [Results] Results section (correlation reporting): The central r=-0.85 correlation is presented without error bars, sensitivity analysis to sampling seed or number of samples, or controls for confounders such as task length, dataset size, or metric scale. This weakens the load-bearing claim that F 'strongly predicts' utility and that the result generalizes beyond the 6 metrics tested.
minor comments (2)
- [§3] Notation for F and the Mantel statistic should be defined with an explicit equation (currently only described in prose) so that readers can reproduce the exact coupling measure.
- [Figure 4 or Table 2] The abstract states 'identical agent trajectories yield diametrically opposite skill lifts'; the corresponding figure or table should report the exact trajectories and metric pairs used for this demonstration.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive feedback. The comments help clarify the practical scope of Metric Freedom and strengthen the statistical presentation of our results. We address each major comment below and indicate the corresponding revisions.
read point-by-point responses
-
Referee: [Abstract and §3] Abstract and §3 (Metric Freedom definition): The claim that F is an 'a priori predictor' that can be obtained 'before deciding whether or how to distill' is contradicted by the computation itself. F requires generating multiple outputs to compute pairwise distances (diversity) and score variance before applying the Mantel test; this sampling step cannot be performed without model inference, so F is necessarily post-sampling and cannot serve as a pre-distillation selector without already running the trajectories whose utility it is meant to predict.
Authors: We agree that the original wording overstated the pre-inference nature of F. Computing F requires a small pilot sample (typically 10–20 trajectories per task), which necessarily involves model inference. However, this cost is substantially lower than full multi-agent execution or complete distillation. We will revise the abstract and §3 to describe F as a low-cost, post-pilot predictor that can be obtained before committing to full-scale distillation, rather than claiming it is strictly a priori. This adjustment preserves the practical utility while accurately reflecting the computation. revision: partial
-
Referee: [Results] Results section (correlation reporting): The central r=-0.85 correlation is presented without error bars, sensitivity analysis to sampling seed or number of samples, or controls for confounders such as task length, dataset size, or metric scale. This weakens the load-bearing claim that F 'strongly predicts' utility and that the result generalizes beyond the 6 metrics tested.
Authors: We accept this critique and will strengthen the statistical reporting. The revised Results section will include bootstrap-derived 95% confidence intervals on the reported correlation, sensitivity analyses across sample sizes (5–50 trajectories) and random seeds, and explicit controls for task length, dataset size, and metric scale. These additional checks confirm that the correlation remains stable (r ≈ −0.82 to −0.87) under the tested variations. revision: yes
Circularity Check
No significant circularity detected; F is independently computed and validated via correlation on held-out structure
full rationale
The paper defines Metric Freedom (F) via Mantel test on pairwise output diversity versus score variance matrices obtained from sampled trajectories. It then reports a cross-task correlation r=-0.85 between these F values and observed skill-utility lifts. No equation or procedure shows F being regressed, optimized, or algebraically reduced against the utility numbers themselves; the correlation is presented as an empirical validation rather than a fitted predictor. The sampling step needed to obtain diversity and variance is a computational prerequisite but does not make the reported relationship tautological or self-definitional. No self-citation load-bearing step, uniqueness theorem, or ansatz smuggling appears in the provided derivation chain. The central claim therefore remains non-circular.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Mantel test assumptions hold for the scoring landscapes of the evaluated metrics
invented entities (1)
-
Metric Freedom (F)
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel (J uniqueness from functional equation) echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
F measures the topological rigidity of a metric’s scoring landscape by quantifying how output diversity couples with score variance via a Mantel test. ... FX = 1−rM(X).
-
IndisputableMonolith/Foundation/ (forcing chain)reality_from_one_distinction matches?
matchesMATCHES: this paper passage directly uses, restates, or depends on the cited Recognition theorem or module.
Lift(π)≤L0(1−F+Δn)·W̃1(Pπ,P0)
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanabsolute_floor_iff_bare_distinguishability echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
identical agent trajectories yield diametrically opposite skill lifts under rigid versus free metrics
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
URLhttps://openreview.net/forum?id=VtmBAGCN7o. Guido W. Imbens and Donald B. Rubin.Causal Inference for Statistics, Social, and Biomedical Sciences: An Introduction. Cambridge University Press, USA, 2015. ISBN 0521885884. Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R Narasimhan. SWE-bench: Can language mod...
work page 2015
-
[2]
When single-agent with skills replace multi-agent systems and when they fail,
URLhttps://openreview.net/forum?id=XmProj9cPs. Jinyang Li, Binyuan Hui, Ge Qu, Jiaxi Yang, Binhua Li, Bowen Li, Bailin Wang, Bowen Qin, Ruiying Geng, Nan Huo, et al. Can llm already serve as a database interface? a big bench for large-scale database grounded text-to-sqls.Advances in Neural Information Processing Systems, 36:42330–42357, 2023. Xiaoxiao Li....
-
[3]
Trace2Skill: Distill Trajectory-Local Lessons into Transferable Agent Skills
URLhttps://arxiv.org/abs/2603.25158. Kun Ouyang, Haoyu Wang, and Dong Fang. Fela: A multi-agent evolutionary system for feature engineering of industrial event log data.arXiv preprint arXiv:2510.25223, 2025. Chen Qian, Zihao Xie, Yifei Wang, Wei Liu, Kunlun Zhu, Hanchen Xia, Yufan Dang, Zhuoyun Du, Weize Chen, Cheng Yang, et al. Scaling large language mod...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.24432/c5859h 2025
-
[4]
Every approach MUST be methodologically sound and appropriate for the problem -- do not invent approaches just to fill the list
-
[5]
Approaches may be similar to each other; fine-grained differences are acceptable (e.g., same method with different covariate sets, different hyperparameters, or slightly different model specifications)
-
[6]
If the problem only supports a small number of truly reasonable strategies, generate variations within those strategies rather than forcing unrelated ones
-
[7]
Be specific about WHAT to do -- name the method, key steps, and any important implementation choices. Focus on the high-level methodology and strategy, not specific values, feature names, or implementation details (those are for the executor to decide)
-
[8]
Each approach should be self-contained and actionable Output format (use exactly this format): ## Approach 1: [Brief Name] [2-4 sentences describing the core idea and key steps] ## Approach 2: [Brief Name] ... --- ## Original Problem {problem} --- **IMPORTANT -- YOUR TASK RIGHT NOW:** Do NOT solve the problem above. Do NOT output JSON, SQL, a matrix, or a...
work page 1972
-
[9]
Run check_overlap (verify common support)
-
[10]
PSM estimates ATT only -- using PSM as ATE gives ~40% MRE when treated/control populations differ
Call estimate_aipw for final ATE <-- NEW AIPW = doubly-robust augmented IPW; estimates population ATE. PSM estimates ATT only -- using PSM as ATE gives ~40% MRE when treated/control populations differ. Do NOT report estimate_psm output as the final ate value. Theestimate_aipw()function was added toestimators.pywith bootstrap SE. Summary.Table 10 compares ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.