When Does Multi-Agent Collaboration Help? An Entropy Perspective
Pith reviewed 2026-05-16 07:34 UTC · model grok-4.3
The pith
Single-agent setups outperform multi-agent LLM collaboration in 43 percent of cases because entropy patterns stabilize after the first round.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Multi-agent systems do not reliably improve upon single large language models; instead, their success is governed by entropy dynamics that are largely fixed during the first round of interaction. A single agent already beats the multi-agent system in approximately 43.3 percent of cases. Three regularities hold across topologies and tasks: certainty preference (stable low entropy benefits correctness while high peak entropy harms it), base entropy (lower-entropy base models causally improve MAS performance), and task awareness (entropy dynamics play different roles on different problems). These patterns enable a simple selection rule, the Entropy Judger, that improves accuracy on pass@k draws
What carries the argument
The Entropy Judger, a selection algorithm that uses 245 token-, agent-, and round-level entropy features to choose the best solution from a multi-agent system's pass@k outputs.
If this is right
- Single-agent baselines are already sufficient for nearly half of the tasks where multi-agent collaboration is currently tried.
- Intervening only in the first round of interaction can steer most of the eventual performance outcome.
- Selecting among multiple agent runs with an entropy-based rule improves accuracy without retraining or altering the underlying models.
- Different tasks require different entropy tolerances, so one-size-fits-all multi-agent topologies are suboptimal.
Where Pith is reading between the lines
- Prompt engineering that keeps early entropy low could substitute for full multi-agent orchestration on many problems.
- Real-time entropy monitoring during agent exchanges could let systems decide on the fly whether to continue collaboration or switch to a single-agent answer.
- The same entropy-selection logic might apply to any multi-agent setup whose outputs can be scored for uncertainty, not just LLM-based ones.
Load-bearing premise
That 245 entropy features measured at token, agent, and round levels are enough to reveal the true drivers of multi-agent success or failure without being confounded by model-specific tokenization or prompt formatting.
What would settle it
Run the Entropy Judger on a fresh suite of models and tasks; if accuracy does not rise above the plain pass@k baseline or if first-round entropy no longer predicts final correctness, the claimed causal link fails.
Figures
read the original abstract
Multi-agent systems (MAS) have emerged as a prominent paradigm for leveraging large language models (LLMs) to tackle complex tasks. However, the mechanisms governing the effectiveness of MAS built upon publicly available LLMs, specifically the underlying rationales for their success or failure, remain largely unexplored. In this paper, we revisit MAS through the perspective of \textit{entropy}, considering both intra- and inter-agent dynamics by investigating entropy transitions during problem-solving across various topologies, six reasoning benchmarks, and two agentic tasks. By analyzing 245 features spanning token-, agent-, and round-level entropy, we counterintuitively find that a single agent outperforms MAS in approximately 43.3\% of cases, and that entropy dynamics are largely determined during the first round of interaction. Furthermore, we provide three key observations: 1) \textit{Certainty Preference}: peak entropy directly harms and stable entropy directly benefits MAS correctness; 2) \textit{Base Entropy}: base models with lower entropy during problem-solving causally drive MAS performance; and 3) \textit{Task Awareness}: entropy dynamics of MAS play varying roles across different tasks. Building on these insights, we introduce a simple yet effective algorithm, the \textit{Entropy Judger}, to select solutions from MAS's pass@$k$ results, leading to consistent accuracy improvements across all MAS configurations and tasks. Our source code is available at \href{https://github.com/AgenticFinLab/multiagent-entropy}{this https URL}.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript investigates when multi-agent systems (MAS) outperform single agents on LLM reasoning and agentic tasks by analyzing entropy transitions at token, agent, and round levels. Using a large-scale study across six reasoning benchmarks, two agentic tasks, multiple topologies, and 245 entropy features, it reports that single agents win in 43.3% of cases, that entropy dynamics are largely fixed in the first round, and three observations (certainty preference, base entropy causally driving performance, task awareness). It introduces the Entropy Judger to select from pass@k outputs for consistent accuracy gains, with public code.
Significance. If the empirical patterns hold, the work provides useful observational insights into MAS effectiveness via entropy, showing collaboration does not always help and highlighting first-round dominance. The scale of the evaluation (multiple benchmarks and topologies) and the practical Entropy Judger algorithm add value, while the open-source code supports reproducibility.
major comments (2)
- [Abstract and §4.3] Abstract and §4.3: The claim that base models with lower entropy 'causally drive' MAS performance is not supported by interventional evidence. The analysis rests on cross-model correlations on fixed tasks; no experiments fix the model while varying entropy (e.g., via temperature, top-p, or logit perturbation), and no causal identification strategy is applied to separate entropy from model capability. This limits the strength of the Base Entropy observation.
- [§4] §4: The 245 entropy features are presented as capturing key drivers, but the manuscript would benefit from an explicit ablation or sensitivity check on potential confounding from model-specific tokenization and prompt formatting choices, as noted in the analysis of feature sufficiency.
minor comments (2)
- [Methods] Methods section: Provide a concise table or appendix listing the exact definitions and extraction procedures for the 245 features at each level to aid reader understanding and replication.
- [Figures] Figure captions: Ensure all entropy transition plots include clear legends, axis labels, and descriptions of the topologies and tasks shown.
Simulated Author's Rebuttal
We thank the referee for the positive assessment and the recommendation of minor revision. We address the two major comments point by point below, indicating the changes we will incorporate.
read point-by-point responses
-
Referee: [Abstract and §4.3] Abstract and §4.3: The claim that base models with lower entropy 'causally drive' MAS performance is not supported by interventional evidence. The analysis rests on cross-model correlations on fixed tasks; no experiments fix the model while varying entropy (e.g., via temperature, top-p, or logit perturbation), and no causal identification strategy is applied to separate entropy from model capability. This limits the strength of the Base Entropy observation.
Authors: We agree that the phrasing 'causally drive' is too strong given the observational, cross-model correlational nature of the analysis. No interventional experiments (temperature sweeps, logit perturbation, etc.) were performed to isolate entropy from model capability. In the revised manuscript we will replace 'causally drive' with 'are strongly associated with' in both the abstract and §4.3, add an explicit statement that the relationship is correlational, and include a limitations paragraph discussing the absence of causal identification. revision: yes
-
Referee: [§4] §4: The 245 entropy features are presented as capturing key drivers, but the manuscript would benefit from an explicit ablation or sensitivity check on potential confounding from model-specific tokenization and prompt formatting choices, as noted in the analysis of feature sufficiency.
Authors: We acknowledge that model-specific tokenization and prompt formatting could introduce confounding. Although the current feature-sufficiency analysis already examines predictive power, we will add a new sensitivity subsection in the revision that reports results under alternative tokenizers (where feasible) and standardized prompt templates, together with an ablation that removes or normalizes tokenizer-dependent components of the entropy features. revision: yes
Circularity Check
No circularity: empirical observations from direct feature analysis
full rationale
The paper extracts 245 entropy features at token/agent/round levels from held-out benchmark runs, reports observational statistics (e.g., single-agent superiority in 43.3% of cases, first-round dominance), and derives three descriptive patterns (Certainty Preference, Base Entropy, Task Awareness). The Entropy Judger is presented as a post-hoc rule-based selector whose selection criterion is stated independently of the final accuracy numbers. No equations reduce a claimed prediction to a fitted parameter by construction, no self-citation chain supports a uniqueness theorem or ansatz, and no renaming of known results occurs. All central claims remain falsifiable against external model outputs and benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- standard math Standard Shannon entropy definition applied to token probabilities
Reference graph
Works this paper leans on
-
[1]
Architecture dominance: Architecture is the top predictor on both FinanceAgent Benchmark (ρ≈0.83 ) and the main benchmarks (Section 5), confirming that structural choices outweigh individual feature-level entropy in determining system performance
-
[2]
Initial uncertainty as failure signal: step 0 mean entropy on FinanceAgent Benchmark (ρ≈ −0.75 ) parallels the dominance of round-1 entropy features on mathematical tasks (Appendix D.5), extending the ”first-round decisive” principle to sub-round granularity
-
[3]
Base model correctness supremacy: The near-perfect correlation of base model is finally correct (ρ≈ 0.96) on FinanceAgent Benchmark matches the pattern observed across all six primary benchmarks (Appendix D.1)
-
[4]
Entropy variance as failure predictor: Inter-agent entropy dispersion metrics maintain negative correlations on FinanceAgent Benchmark, consistent with the MAS failure analysis in Appendix D.2. This consistency suggests that the entropy signal captured by our framework reflects ageneralproperty of LLM uncertainty rather than task-specific patterns. Whethe...
work page 2025
-
[5]
Convert the base-b numbers17 b and97 b to decimal form, resulting in1·b+ 7and9·b+ 7, respectively
-
[6]
Establish the divisibility condition:9b+ 7must be divisible byb+ 7, i.e., 9b+7 b+7 is an integer
-
[7]
Perform algebraic manipulation to simplify the divisibility condition, leading to the conclusion that b+ 7 divides −56 (equivalently,b+ 7divides56)
-
[8]
Identify all positive divisors of56that satisfyb+ 7>16(sinceb >9)
-
[9]
For each valid divisord=b+ 7, computeb=d−7and ensureb >9
-
[10]
Sum all valid integer values ofbobtained from step 5
-
[11]
only output the final answer without words, labels, and steps
Computeb= 21,49; sum = 70 Analysis:Qwen performs deep reasoningwithinthe <think> block, independently deriving the complete solution before outputting a structured plan. Solver Agent. System Prompt You are the solver agent. Solve strictly according to the provided plans. Execute each step precisely and produce the final result. Output the final result int...
work page 2024
-
[12]
sample_mean_answer_token_entropy × sample_median_answer_token_entropy r = +0.991
-
[13]
base_model_min_answer_token_entropy × base_model_median_answer_token_entropy r = +0.989
-
[14]
sample_round_1_max_agent_std_entropy × sample_round_1_max_agent_variance_entropy r = +0.961
-
[15]
architecture × exp_infer_average_entropy r = -0.723
-
[16]
architecture × sample_total_entropy r = -0.683
-
[17]
architecture × sample_entropy_reduction_vs_base_total r = +0.678
-
[18]
architecture × sample_round_1_all_agents_total_entropy r = -0.677
-
[19]
architecture × round_1_total_token r = -0.652
-
[20]
base_model_min_answer_token_entropy × answer_token_entropy_change_direction r = +0.651
-
[21]
base_model_min_answer_token_entropy × base_model_vs_sample_final_answer_entropy_diffr = +0.651
-
[22]
base_model_min_answer_token_entropy × answer_token_entropy_change r = +0.645
-
[23]
architecture × sample_entropy_range r = -0.614
-
[24]
architecture × sample_max_entropy r = -0.614
-
[25]
architecture × sample_num_agents r = -0.612
-
[26]
architecture × exp_total_entropy r = -0.608
-
[27]
architecture × exp_total_token r = -0.588
-
[28]
base_model_is_finally_correct × is_finally_correct r = +0.554
-
[29]
exp_infer_average_entropy × sample_round_2_all_agents_total_entropy r = +0.536
-
[30]
architecture × sample_round_1_mean_agent_max_entropy r = -0.529
-
[31]
base_model_format_compliance × base_model_format_compliance_rate r = +0.521 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 Correlation Coefficient Figure 32.Feature correlation heatmap for GMAS on LLaMA models. The lower triangle shows pairwise Pearson correlations; the upper-right inset lists the top 20 most strongly correlated feature pairs. 53
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.