When Does Multi-Agent Collaboration Help? An Entropy Perspective

Ningxin Su; Sijia Chen; Yuxuan Zhao

arxiv: 2602.04234 · v5 · submitted 2026-02-04 · 💻 cs.MA

When Does Multi-Agent Collaboration Help? An Entropy Perspective

Yuxuan Zhao , Sijia Chen , Ningxin Su This is my paper

Pith reviewed 2026-05-16 07:34 UTC · model grok-4.3

classification 💻 cs.MA

keywords multi-agent systemsentropy dynamicslarge language modelsreasoning taskssingle vs multi-agentpass@k selectionuncertainty analysisagent collaboration

0 comments

The pith

Single-agent setups outperform multi-agent LLM collaboration in 43 percent of cases because entropy patterns stabilize after the first round.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tracks uncertainty, measured as entropy, inside multi-agent systems built from large language models while they solve reasoning tasks. It shows that a lone agent already produces the correct answer more often than the group does, in roughly 43 percent of trials across six benchmarks and two agentic tasks. Most of the entropy shifts that decide success or failure occur in the opening round of agent exchanges, after which further rounds add little new information. Three regularities appear: peak entropy harms final accuracy while steady entropy helps, models whose base entropy stays low drive better group results, and the same entropy signals matter differently depending on the task type. Using these signals the authors build a lightweight selector that picks the best answer from multiple agent runs and raises accuracy on every tested configuration.

Core claim

Multi-agent systems do not reliably improve upon single large language models; instead, their success is governed by entropy dynamics that are largely fixed during the first round of interaction. A single agent already beats the multi-agent system in approximately 43.3 percent of cases. Three regularities hold across topologies and tasks: certainty preference (stable low entropy benefits correctness while high peak entropy harms it), base entropy (lower-entropy base models causally improve MAS performance), and task awareness (entropy dynamics play different roles on different problems). These patterns enable a simple selection rule, the Entropy Judger, that improves accuracy on pass@k draws

What carries the argument

The Entropy Judger, a selection algorithm that uses 245 token-, agent-, and round-level entropy features to choose the best solution from a multi-agent system's pass@k outputs.

If this is right

Single-agent baselines are already sufficient for nearly half of the tasks where multi-agent collaboration is currently tried.
Intervening only in the first round of interaction can steer most of the eventual performance outcome.
Selecting among multiple agent runs with an entropy-based rule improves accuracy without retraining or altering the underlying models.
Different tasks require different entropy tolerances, so one-size-fits-all multi-agent topologies are suboptimal.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Prompt engineering that keeps early entropy low could substitute for full multi-agent orchestration on many problems.
Real-time entropy monitoring during agent exchanges could let systems decide on the fly whether to continue collaboration or switch to a single-agent answer.
The same entropy-selection logic might apply to any multi-agent setup whose outputs can be scored for uncertainty, not just LLM-based ones.

Load-bearing premise

That 245 entropy features measured at token, agent, and round levels are enough to reveal the true drivers of multi-agent success or failure without being confounded by model-specific tokenization or prompt formatting.

What would settle it

Run the Entropy Judger on a fresh suite of models and tasks; if accuracy does not rise above the plain pass@k baseline or if first-round entropy no longer predicts final correctness, the claimed causal link fails.

Figures

Figures reproduced from arXiv: 2602.04234 by Ningxin Su, Sijia Chen, Yuxuan Zhao.

**Figure 1.** Figure 1: Accuracy comparison of SAS and MAS across models and datasets. For brevity, LLaMA-3.2-3B-Instruct and LLaMA-3.1-8BInstruct are denoted as L-3 and L-8, respectively; Qwen3-0.6B, Qwen3-4B, and Qwen3-8B are denoted as Q-0.6, Q-4, and Q-8. The base denotes the accuracy of a single Mbase on each dataset. 4.3. Mining Effectiveness of MAS MAS built on LLMs inherently exhibit uncertainty during individual reasoni… view at source ↗

**Figure 2.** Figure 2: Base model uncertainty limits MAS effectiveness. The left two subfigures show results for LLaMA; the right two for Qwen. (a) Relationship between feature values and SHAP values for the most important entropy features on Gbase-H, sorted by I¯j and annotated with ρj . (b) MAS performance across deciles of Mbase entropy: Mbase entropy is partitioned into ten equal-sized bins, and average MAS accuracy, aggrega… view at source ↗

**Figure 3.** Figure 3: MAS mainly fails on inter-agent misalignment. The left two subfigures show results for LLaMA; the right two for Qwen. (a) Same as [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Uncertainty in MAS exerts distinct effects depending on task difficulty and the coordination architecture. (a, c) Feature-SHAP relationships for top entropy features in GMAS, grouped by dataset (a) and architecture (c). (b, d) Corresponding box plots across all models, annotated with average MAS correctness per dataset (b) or per architecture (d). Structured Deliberation for Hard Problems. On AIME24/25, ro… view at source ↗

**Figure 5.** Figure 5: More rounds do not necessarily improve MAS performance. (a) Accuracy and token consumption for different MAS architectures with R = 2 and R = 5 on two benchmarks. (b) Evolution of three key entropy metrics across rounds. (c) The impact of two prominent entropy features, notable for their high importance (I¯) and strong correlation (|ρ|) with sample correctness. tion. Sequential systems are most fragile: an… view at source ↗

**Figure 6.** Figure 6: The role of uncertainty is reshaped in MAS built on Qwen2.5-7B-SimpleRL-Zoo. (a) The performance of different MAS architectures across datasets. (b) Relationship between base-model entropy and MAS accuracy. (c) Most predictive features in GMAS. 5.4. RL Training Inverts the Role of Uncertainty Few studies have investigated whether using a specialized, fine-tuned model as the base model can improve MAS perfo… view at source ↗

**Figure 7.** Figure 7: PCA variance explained for the 245-dimensional entropy feature space. The curve shows cumulative explained variance as a function of the number of principal components. Achieving 95% explained variance requires 43 components, indicating that information is distributed across many dimensions rather than concentrated in a few dominant directions [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗

**Figure 8.** Figure 8: Recursive feature elimination (RFE) performance curve. The optimal subset of 25 features achieves the highest accuracy (86.77%) and F1 (88.92%). Performance plateaus and slightly decreases as more features are added, indicating that the model does not overfit to the high-dimensional space. C.4.3. FEATURE ABLATION STUDY To assess whether each feature contributes unique predictive information, we perform rec… view at source ↗

**Figure 9.** Figure 9: Cross-method feature importance comparison across tree-based (Random Forest), logistic regression, chi-square, mutual information, and F-statistic methods. Despite fundamentally different mechanisms, the methods produce consistent top-feature rankings, validating the robustness of our feature importance findings. C.4.4. CROSS-METHOD FEATURE IMPORTANCE VALIDATION A potential concern with SHAP-based feature … view at source ↗

**Figure 10.** Figure 10: Top 20 features on Gbase-H for Qwen (a) and LLaMA (b), ranked by mean normalized importance I¯. Each panel is divided into four subplots: top-left shows feature importance from XGBoost and LightGBM; bottom-left shows mean SHAP impact S¯, representing the average contribution of each feature to model predictions; right column displays scatter plots of feature values versus SHAP values, with Pearson correla… view at source ↗

**Figure 11.** Figure 11: SHAP waterfall plots on Gbase-H for representative samples: Qwen and LLaMA, with LightGBM and XGBoost. Each bar shows the contribution of a feature to the predicted MAS correctness. perfect correlation and strongly positive S¯ confirm that base model correctness is the single most powerful predictor: MAS succeeds largely when the base model is already correct. D.2. Inter-Agent Misalignment Causes MAS Fail… view at source ↗

**Figure 12.** Figure 12: Top: Feature importance and SHAP analysis on Gbase-full for Qwen (a) and LLaMA (b). Both show that base model is finally correct achieves I¯ = 1.0 and ρ ≈ 0.96, vastly surpassing all other features with nearly linear correlation to MAS correctness. Bottom: Top 20 features on GMAS for MAS failure analysis: Qwen (c) and LLaMA (d). Qwen’s top predictor is entropy variance, while LLaMA is dominated by answer-… view at source ↗

**Figure 13.** Figure 13: SHAP waterfall plots on GMAS for representative MAS failure samples. Qwen (a-b) shows entropy dispersion features (variance, Q3 agent) as dominant contributors; LLaMA (c-d) reveals answer-level features (token count, answer entropy) driving failure predictions. shows consistent positive effects at both levels, confirming that moderate deliberation uncertainty benefits medium tasks. However, excessive earl… view at source ↗

**Figure 14.** Figure 14: Top 20 features on GMAS for mathematical reasoning tasks grouped by difficulty: (a) GSM8K (easy, |ρ| ≤ 0.15 for top features), (b) MATH500 (medium, positive ρ and S¯ for average entropy), (c-d) AIME2024/AIME2025 (hard, round-2 uncertainty harms performance). D.4. Architecture Determines Which Uncertainty Matters Section 5 establishes that architecture fundamentally shapes which entropy dimensions matter: … view at source ↗

**Figure 15.** Figure 15: Top 20 features on GMAS for (a) code generation (HumanEval) and (b) knowledge Q&A (MMLU). HumanEval shows negative ρ and S¯ for answer-level features; MMLU shows that more agents hurt performance (ρ < 0, S <¯ 0 for sample num agents). that initial divergence amplifies across rounds. However, exp total entropy (ρ = +0.68, S¯ = +0.005) shows consistent positive effects, confirming that cumulative entropy be… view at source ↗

**Figure 16.** Figure 16: Top 20 features on GMAS for multi-agent architectures: (a) Centralized (verbose answers harm performance), (b) Debate (cumulative entropy benefits once agents align), (c) Hybrid (extended deliberation helps), and (d) Sequential (answer-level entropy is the primary failure mode). (ρ = −0.73, S¯ = −0.012). Later Rounds Provide Diminishing Signal. The extended ranking confirms that later-round features provi… view at source ↗

**Figure 17.** Figure 17: Top 20 features for (a) SAS and (b) MAS with R = 5 rounds on GMAS, and for MAS using MRL-base on (c) GMAS and (d) Gbase-H. Early-round uncertainty dominates prediction in all cases. In (c), round-2 entropy shows positive ρ but negative S¯, suggesting moderate later-round uncertainty is optimal. In (d), increased entropy from base to MAS still harms performance. GMAS: Round-2 Entropy Shows Inverted Pattern… view at source ↗

**Figure 18.** Figure 18: MAS accuracy across temperatures τ ∈ {0.4, 0.6, 0.8} for all five architectures on MATH500. Accuracy remains remarkably stable: the maximum variation within any architecture is 3.2% (Single), and the multi-agent average varies by only 0.5%. McNemar’s test yields p > 0.37 for all 15 pairwise comparisons, confirming statistical invariance. Implications for RL-Trained Systems. The inverted pattern for round-… view at source ↗

**Figure 19.** Figure 19: Entropy distribution statistics across temperatures for all architectures. Absolute entropy values scale approximately 2× from τ = 0.4 to τ = 0.8 (mean entropy: Centralized 0.048 → 0.075 → 0.101), but the relative ordering of architectures is preserved: Centralized and Single consistently exhibit higher entropy than Sequential, Debate, and Hybrid across all temperatures. feature importance rankings remain… view at source ↗

**Figure 20.** Figure 20: Feature importance and SHAP analysis for Qwen3-14B across two feature groups. (a) On GMAS, round-1 entropy dominates with approximately 70% of top-20 features being entropy-related (LightGBM accuracy: 83.2%, F1: 90.3%). (b) On Gbase-H, base model answer length emerges as the top predictor, while answer token entropy change (ρ ≈ −0.84) signals that entropy increase from base to MAS predicts failure (accura… view at source ↗

**Figure 21.** Figure 21: Top 20 features on FinanceAgent across two feature groups: (a) MAS-only features (GMAS), where architecture (ρ ≈ 0.83) and step-level entropy dominate; (b) including base model entropy (Gbase-H), where architecture remains the top predictor (ρ ≈ 0.84) and step 0 mean entropy shows moderate negative correlation (ρ ≈ −0.56). On the full feature set (Gbase-full), base model is finally correct (ρ ≈ 0.96) dom… view at source ↗

**Figure 22.** Figure 22: Reliability diagrams for all five models across six datasets. Each subplot shows observed accuracy (blue bars) versus entropy-derived confidence (x-axis), with the red dashed diagonal indicating perfect calibration. Bars above (below) the diagonal indicate under-confidence (over-confidence). Bin sample counts are annotated above each bar. Qwen3-4B and Qwen3-8B achieve near-perfect calibration on GSM8K (EC… view at source ↗

**Figure 23.** Figure 23: Heatmap of the confidently wrong proportion across all model–dataset combinations. Each cell reports the fraction of samples where the model exhibits low entropy (high confidence) yet answers incorrectly. Darker red indicates higher overconfident error rates. Qwen3-4B and Qwen3-8B maintain confidently wrong rates below 10% on most datasets, while LLaMA models and competition-level tasks exhibit rates exce… view at source ↗

**Figure 24.** Figure 24: Three-way entropy comparison across all 30 model-dataset combinations. Each subplot shows violin plots of per-token entropy distributions for SAS (teal), MAS Round 1 (red), and MAS Round 2 (blue), with mean µ annotated above each violin and the Wilcoxon signed-rank p-value for SAS vs. MAS R1 in the subplot title. The systematic shift from SAS to MAS R1 demonstrates that role assignment alone constitutes a… view at source ↗

**Figure 25.** Figure 25: Left: Mean entropy change from MAS Round 1 to Round 2 (HR2 − HR1) across all modelarchitecture-dataset combinations. Blue cells indicate entropy decrease (consensus formation); red cells indicate entropy increase. Right: Mean accuracy change (MAS − SAS) for the same combinations. Green cells indicate accuracy improvement; red cells indicate degradation. Cells that are blue on the left but red on the right… view at source ↗

**Figure 26.** Figure 26: Paired entropy scatter plots (MAS Round 1 vs. Round 2) across datasets. Each point represents a single sample; green circles (◦) denote correct answers and red crosses (×) denote incorrect answers. The dashed diagonal (y = x) separates entropy decrease (below) from entropy increase (above). The predominance of points below the diagonal confirms systematic entropy reduction, while the similar spatial distr… view at source ↗

**Figure 27.** Figure 27: Token-level entropy dynamics for Qwen3-0.6B across six datasets. High entropy persistence and frequent spikes characterize this smaller model, with entropy either remaining elevated or collapsing abruptly to zero in round 2. 48 [PITH_FULL_IMAGE:figures/full_fig_p048_27.png] view at source ↗

**Figure 28.** Figure 28: Token-level entropy dynamics for Qwen3-4B. Increased model capacity yields more stable entropy on easier tasks, while harder tasks still induce erratic uncertainty patterns. 49 [PITH_FULL_IMAGE:figures/full_fig_p049_28.png] view at source ↗

**Figure 29.** Figure 29: Token-level entropy dynamics for Qwen3-8B. The largest Qwen model shows structured deliberation with controlled exploration in round 1 and smooth convergence in round 2. 50 [PITH_FULL_IMAGE:figures/full_fig_p050_29.png] view at source ↗

**Figure 30.** Figure 30: Token-level entropy dynamics for LLaMA-3.2-3B-Instruct. LLaMA exhibits lower round-2 entropy compared to Qwen, often collapsing to near-zero, reflecting a more decisive but potentially overconfident reasoning style. 51 [PITH_FULL_IMAGE:figures/full_fig_p051_30.png] view at source ↗

**Figure 31.** Figure 31: Token-level entropy dynamics for LLaMA-3.1-8B-Instruct. Scaling improves calibration, but the characteristic rapid entropy reduction in round 2 persists compared to Qwen models. 52 [PITH_FULL_IMAGE:figures/full_fig_p052_31.png] view at source ↗

**Figure 32.** Figure 32: Feature correlation heatmap for GMAS on LLaMA models. The lower triangle shows pairwise Pearson correlations; the upper-right inset lists the top 20 most strongly correlated feature pairs. 53 [PITH_FULL_IMAGE:figures/full_fig_p053_32.png] view at source ↗

read the original abstract

Multi-agent systems (MAS) have emerged as a prominent paradigm for leveraging large language models (LLMs) to tackle complex tasks. However, the mechanisms governing the effectiveness of MAS built upon publicly available LLMs, specifically the underlying rationales for their success or failure, remain largely unexplored. In this paper, we revisit MAS through the perspective of \textit{entropy}, considering both intra- and inter-agent dynamics by investigating entropy transitions during problem-solving across various topologies, six reasoning benchmarks, and two agentic tasks. By analyzing 245 features spanning token-, agent-, and round-level entropy, we counterintuitively find that a single agent outperforms MAS in approximately 43.3\% of cases, and that entropy dynamics are largely determined during the first round of interaction. Furthermore, we provide three key observations: 1) \textit{Certainty Preference}: peak entropy directly harms and stable entropy directly benefits MAS correctness; 2) \textit{Base Entropy}: base models with lower entropy during problem-solving causally drive MAS performance; and 3) \textit{Task Awareness}: entropy dynamics of MAS play varying roles across different tasks. Building on these insights, we introduce a simple yet effective algorithm, the \textit{Entropy Judger}, to select solutions from MAS's pass@$k$ results, leading to consistent accuracy improvements across all MAS configurations and tasks. Our source code is available at \href{https://github.com/AgenticFinLab/multiagent-entropy}{this https URL}.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript investigates when multi-agent systems (MAS) outperform single agents on LLM reasoning and agentic tasks by analyzing entropy transitions at token, agent, and round levels. Using a large-scale study across six reasoning benchmarks, two agentic tasks, multiple topologies, and 245 entropy features, it reports that single agents win in 43.3% of cases, that entropy dynamics are largely fixed in the first round, and three observations (certainty preference, base entropy causally driving performance, task awareness). It introduces the Entropy Judger to select from pass@k outputs for consistent accuracy gains, with public code.

Significance. If the empirical patterns hold, the work provides useful observational insights into MAS effectiveness via entropy, showing collaboration does not always help and highlighting first-round dominance. The scale of the evaluation (multiple benchmarks and topologies) and the practical Entropy Judger algorithm add value, while the open-source code supports reproducibility.

major comments (2)

[Abstract and §4.3] Abstract and §4.3: The claim that base models with lower entropy 'causally drive' MAS performance is not supported by interventional evidence. The analysis rests on cross-model correlations on fixed tasks; no experiments fix the model while varying entropy (e.g., via temperature, top-p, or logit perturbation), and no causal identification strategy is applied to separate entropy from model capability. This limits the strength of the Base Entropy observation.
[§4] §4: The 245 entropy features are presented as capturing key drivers, but the manuscript would benefit from an explicit ablation or sensitivity check on potential confounding from model-specific tokenization and prompt formatting choices, as noted in the analysis of feature sufficiency.

minor comments (2)

[Methods] Methods section: Provide a concise table or appendix listing the exact definitions and extraction procedures for the 245 features at each level to aid reader understanding and replication.
[Figures] Figure captions: Ensure all entropy transition plots include clear legends, axis labels, and descriptions of the topologies and tasks shown.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive assessment and the recommendation of minor revision. We address the two major comments point by point below, indicating the changes we will incorporate.

read point-by-point responses

Referee: [Abstract and §4.3] Abstract and §4.3: The claim that base models with lower entropy 'causally drive' MAS performance is not supported by interventional evidence. The analysis rests on cross-model correlations on fixed tasks; no experiments fix the model while varying entropy (e.g., via temperature, top-p, or logit perturbation), and no causal identification strategy is applied to separate entropy from model capability. This limits the strength of the Base Entropy observation.

Authors: We agree that the phrasing 'causally drive' is too strong given the observational, cross-model correlational nature of the analysis. No interventional experiments (temperature sweeps, logit perturbation, etc.) were performed to isolate entropy from model capability. In the revised manuscript we will replace 'causally drive' with 'are strongly associated with' in both the abstract and §4.3, add an explicit statement that the relationship is correlational, and include a limitations paragraph discussing the absence of causal identification. revision: yes
Referee: [§4] §4: The 245 entropy features are presented as capturing key drivers, but the manuscript would benefit from an explicit ablation or sensitivity check on potential confounding from model-specific tokenization and prompt formatting choices, as noted in the analysis of feature sufficiency.

Authors: We acknowledge that model-specific tokenization and prompt formatting could introduce confounding. Although the current feature-sufficiency analysis already examines predictive power, we will add a new sensitivity subsection in the revision that reports results under alternative tokenizers (where feasible) and standardized prompt templates, together with an ablation that removes or normalizes tokenizer-dependent components of the entropy features. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical observations from direct feature analysis

full rationale

The paper extracts 245 entropy features at token/agent/round levels from held-out benchmark runs, reports observational statistics (e.g., single-agent superiority in 43.3% of cases, first-round dominance), and derives three descriptive patterns (Certainty Preference, Base Entropy, Task Awareness). The Entropy Judger is presented as a post-hoc rule-based selector whose selection criterion is stated independently of the final accuracy numbers. No equations reduce a claimed prediction to a fitted parameter by construction, no self-citation chain supports a uniqueness theorem or ansatz, and no renaming of known results occurs. All central claims remain falsifiable against external model outputs and benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work is almost entirely empirical. Entropy is used in its standard information-theoretic definition with no new axioms introduced. No free parameters are fitted to produce the central claims; the Judger appears to be a threshold or ranking rule derived from observed patterns rather than optimized constants. No new physical or computational entities are postulated.

axioms (1)

standard math Standard Shannon entropy definition applied to token probabilities
Used throughout the feature extraction without re-derivation.

pith-pipeline@v0.9.0 · 5568 in / 1264 out tokens · 43835 ms · 2026-05-16T07:34:07.186419+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

31 extracted references · 31 canonical work pages

[1]

Architecture dominance: Architecture is the top predictor on both FinanceAgent Benchmark (ρ≈0.83 ) and the main benchmarks (Section 5), confirming that structural choices outweigh individual feature-level entropy in determining system performance

work page
[2]

Initial uncertainty as failure signal: step 0 mean entropy on FinanceAgent Benchmark (ρ≈ −0.75 ) parallels the dominance of round-1 entropy features on mathematical tasks (Appendix D.5), extending the ”first-round decisive” principle to sub-round granularity

work page
[3]

Base model correctness supremacy: The near-perfect correlation of base model is finally correct (ρ≈ 0.96) on FinanceAgent Benchmark matches the pattern observed across all six primary benchmarks (Appendix D.1)

work page
[4]

This consistency suggests that the entropy signal captured by our framework reflects ageneralproperty of LLM uncertainty rather than task-specific patterns

Entropy variance as failure predictor: Inter-agent entropy dispersion metrics maintain negative correlations on FinanceAgent Benchmark, consistent with the MAS failure analysis in Appendix D.2. This consistency suggests that the entropy signal captured by our framework reflects ageneralproperty of LLM uncertainty rather than task-specific patterns. Whethe...

work page 2025
[5]

Convert the base-b numbers17 b and97 b to decimal form, resulting in1·b+ 7and9·b+ 7, respectively

work page
[6]

Establish the divisibility condition:9b+ 7must be divisible byb+ 7, i.e., 9b+7 b+7 is an integer

work page
[7]

Perform algebraic manipulation to simplify the divisibility condition, leading to the conclusion that b+ 7 divides −56 (equivalently,b+ 7divides56)

work page
[8]

Identify all positive divisors of56that satisfyb+ 7>16(sinceb >9)

work page
[9]

For each valid divisord=b+ 7, computeb=d−7and ensureb >9

work page
[10]

Sum all valid integer values ofbobtained from step 5

work page
[11]

only output the final answer without words, labels, and steps

Computeb= 21,49; sum = 70 Analysis:Qwen performs deep reasoningwithinthe <think> block, independently deriving the complete solution before outputting a structured plan. Solver Agent. System Prompt You are the solver agent. Solve strictly according to the provided plans. Execute each step precisely and produce the final result. Output the final result int...

work page 2024
[12]

sample_mean_answer_token_entropy × sample_median_answer_token_entropy r = +0.991

work page
[13]

base_model_min_answer_token_entropy × base_model_median_answer_token_entropy r = +0.989

work page
[14]

sample_round_1_max_agent_std_entropy × sample_round_1_max_agent_variance_entropy r = +0.961

work page
[15]

architecture × exp_infer_average_entropy r = -0.723

work page
[16]

architecture × sample_total_entropy r = -0.683

work page
[17]

architecture × sample_entropy_reduction_vs_base_total r = +0.678

work page
[18]

architecture × sample_round_1_all_agents_total_entropy r = -0.677

work page
[19]

architecture × round_1_total_token r = -0.652

work page
[20]

base_model_min_answer_token_entropy × answer_token_entropy_change_direction r = +0.651

work page
[21]

base_model_min_answer_token_entropy × base_model_vs_sample_final_answer_entropy_diffr = +0.651

work page
[22]

base_model_min_answer_token_entropy × answer_token_entropy_change r = +0.645

work page
[23]

architecture × sample_entropy_range r = -0.614

work page
[24]

architecture × sample_max_entropy r = -0.614

work page
[25]

architecture × sample_num_agents r = -0.612

work page
[26]

architecture × exp_total_entropy r = -0.608

work page
[27]

architecture × exp_total_token r = -0.588

work page
[28]

base_model_is_finally_correct × is_finally_correct r = +0.554

work page
[29]

exp_infer_average_entropy × sample_round_2_all_agents_total_entropy r = +0.536

work page
[30]

architecture × sample_round_1_mean_agent_max_entropy r = -0.529

work page
[31]

The lower triangle shows pairwise Pearson correlations; the upper-right inset lists the top 20 most strongly correlated feature pairs

base_model_format_compliance × base_model_format_compliance_rate r = +0.521 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 Correlation Coefficient Figure 32.Feature correlation heatmap for GMAS on LLaMA models. The lower triangle shows pairwise Pearson correlations; the upper-right inset lists the top 20 most strongly correlated feature pairs. 53

work page

[1] [1]

Architecture dominance: Architecture is the top predictor on both FinanceAgent Benchmark (ρ≈0.83 ) and the main benchmarks (Section 5), confirming that structural choices outweigh individual feature-level entropy in determining system performance

work page

[2] [2]

Initial uncertainty as failure signal: step 0 mean entropy on FinanceAgent Benchmark (ρ≈ −0.75 ) parallels the dominance of round-1 entropy features on mathematical tasks (Appendix D.5), extending the ”first-round decisive” principle to sub-round granularity

work page

[3] [3]

Base model correctness supremacy: The near-perfect correlation of base model is finally correct (ρ≈ 0.96) on FinanceAgent Benchmark matches the pattern observed across all six primary benchmarks (Appendix D.1)

work page

[4] [4]

This consistency suggests that the entropy signal captured by our framework reflects ageneralproperty of LLM uncertainty rather than task-specific patterns

Entropy variance as failure predictor: Inter-agent entropy dispersion metrics maintain negative correlations on FinanceAgent Benchmark, consistent with the MAS failure analysis in Appendix D.2. This consistency suggests that the entropy signal captured by our framework reflects ageneralproperty of LLM uncertainty rather than task-specific patterns. Whethe...

work page 2025

[5] [5]

Convert the base-b numbers17 b and97 b to decimal form, resulting in1·b+ 7and9·b+ 7, respectively

work page

[6] [6]

Establish the divisibility condition:9b+ 7must be divisible byb+ 7, i.e., 9b+7 b+7 is an integer

work page

[7] [7]

Perform algebraic manipulation to simplify the divisibility condition, leading to the conclusion that b+ 7 divides −56 (equivalently,b+ 7divides56)

work page

[8] [8]

Identify all positive divisors of56that satisfyb+ 7>16(sinceb >9)

work page

[9] [9]

For each valid divisord=b+ 7, computeb=d−7and ensureb >9

work page

[10] [10]

Sum all valid integer values ofbobtained from step 5

work page

[11] [11]

only output the final answer without words, labels, and steps

Computeb= 21,49; sum = 70 Analysis:Qwen performs deep reasoningwithinthe <think> block, independently deriving the complete solution before outputting a structured plan. Solver Agent. System Prompt You are the solver agent. Solve strictly according to the provided plans. Execute each step precisely and produce the final result. Output the final result int...

work page 2024

[12] [12]

sample_mean_answer_token_entropy × sample_median_answer_token_entropy r = +0.991

work page

[13] [13]

base_model_min_answer_token_entropy × base_model_median_answer_token_entropy r = +0.989

work page

[14] [14]

sample_round_1_max_agent_std_entropy × sample_round_1_max_agent_variance_entropy r = +0.961

work page

[15] [15]

architecture × exp_infer_average_entropy r = -0.723

work page

[16] [16]

architecture × sample_total_entropy r = -0.683

work page

[17] [17]

architecture × sample_entropy_reduction_vs_base_total r = +0.678

work page

[18] [18]

architecture × sample_round_1_all_agents_total_entropy r = -0.677

work page

[19] [19]

architecture × round_1_total_token r = -0.652

work page

[20] [20]

base_model_min_answer_token_entropy × answer_token_entropy_change_direction r = +0.651

work page

[21] [21]

base_model_min_answer_token_entropy × base_model_vs_sample_final_answer_entropy_diffr = +0.651

work page

[22] [22]

base_model_min_answer_token_entropy × answer_token_entropy_change r = +0.645

work page

[23] [23]

architecture × sample_entropy_range r = -0.614

work page

[24] [24]

architecture × sample_max_entropy r = -0.614

work page

[25] [25]

architecture × sample_num_agents r = -0.612

work page

[26] [26]

architecture × exp_total_entropy r = -0.608

work page

[27] [27]

architecture × exp_total_token r = -0.588

work page

[28] [28]

base_model_is_finally_correct × is_finally_correct r = +0.554

work page

[29] [29]

exp_infer_average_entropy × sample_round_2_all_agents_total_entropy r = +0.536

work page

[30] [30]

architecture × sample_round_1_mean_agent_max_entropy r = -0.529

work page

[31] [31]

The lower triangle shows pairwise Pearson correlations; the upper-right inset lists the top 20 most strongly correlated feature pairs

base_model_format_compliance × base_model_format_compliance_rate r = +0.521 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 Correlation Coefficient Figure 32.Feature correlation heatmap for GMAS on LLaMA models. The lower triangle shows pairwise Pearson correlations; the upper-right inset lists the top 20 most strongly correlated feature pairs. 53

work page