Can Small Agents Collaborate to Beat a Single Large Language Model?

Agata \.Zywot; Anders S{\o}gaard; Maarten de Rijke; Xinyi Chen; Yifei Yuan

arxiv: 2601.11327 · v2 · submitted 2026-01-16 · 💻 cs.MA

Can Small Agents Collaborate to Beat a Single Large Language Model?

Agata \.Zywot , Xinyi Chen , Yifei Yuan , Anders S{\o}gaard , Maarten de Rijke This is my paper

Pith reviewed 2026-05-16 13:38 UTC · model grok-4.3

classification 💻 cs.MA

keywords multi-agent systemslarge language modelsorchestratortool usemulti-step reasoningbenchmarksmodel scalingagent collaboration

0 comments

The pith

Small multi-agent systems built from smaller models can outperform much larger single language models on reasoning and tool-use tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether organizing several small language models into a multi-agent system can beat a single much larger model on tasks that need multiple steps, tool use, and collaboration. They use a simple setup with one orchestrator directing a few specialized sub-agents that communicate little. Results show the small teams win on several benchmarks, with the orchestrator's reasoning ability driving most of the improvement rather than the sub-agents. This matters because it suggests that smart organization of models can substitute for raw size increases in agentic applications.

Core claim

On tool-intensive benchmarks spanning factual retrieval, multi-hop reasoning, scientific question answering, and mathematical problem solving, minimally designed multi-agent systems using smaller models outperform substantially larger single-agent models even when the latter have direct access to tools. Reasoning at the orchestrator yields the largest gains while enabling reasoning in sub-agents provides limited or negative benefits. Overall system performance is driven primarily by orchestrator capacity rather than sub-agent capacity.

What carries the argument

A minimally designed multi-agent system with a single orchestrator and a small set of specialized sub-agents with restricted communication, where the orchestrator directs coordination and reasoning.

If this is right

Architectural orchestration can substitute for raw model scaling on multi-step and tool-use tasks.
Performance gains come mainly from strengthening the orchestrator rather than the sub-agents.
Adding reasoning capabilities to sub-agents often yields little or negative returns.
Future agent systems should prioritize coordination mechanisms over uniform model enlargement.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar minimal orchestration designs could extend to domains like planning or creative generation where coordination matters more than individual model size.
The results suggest testing whether even smaller orchestrators paired with many narrow sub-agents produce further efficiency gains.
This approach may reduce overall compute costs by avoiding the need to scale every component uniformly.

Load-bearing premise

The assumption that a simple single-orchestrator design with restricted sub-agent communication fairly tests the general potential of multi-agent collaboration against single large models.

What would settle it

A single large model achieving equal or higher performance on the same tool-intensive benchmarks when given equivalent prompting and tool access would falsify the central claim.

read the original abstract

Recent progress in language modeling has largely relied on scaling model size, yet larger models do not reliably improve performance on tasks requiring multi-step reasoning and tool use. Multi-agent collaboration offers a potential alternative, raising a key question: can well-organized systems built from smaller models outperform much larger language models? We address this question using a minimally designed multi-agent system with a single orchestrator and a small set of specialized sub-agents with restricted communication. On tool-intensive benchmarks spanning factual retrieval, multi-hop reasoning, scientific question answering, and mathematical problem solving, we conduct controlled comparisons between small multi-agent systems and large single-agent models. We find that small multi-agent systems can outperform substantially larger single-agent models, even when the latter have direct access to tools. Reasoning at the orchestrator yields the largest gains, while enabling reasoning in sub-agents provides limited or negative benefits. Overall system performance is driven primarily by orchestrator capacity rather than sub-agent capacity. These results suggest that improved agentic performance depends more on architectural orchestration than on raw model scaling.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Small multi-agent systems beat larger single models on tool tasks, but the gains track orchestrator capacity more than true small-agent collaboration.

read the letter

The main result is that a minimal multi-agent setup using smaller models outperforms substantially larger single models on benchmarks for factual retrieval, multi-hop reasoning, scientific QA, and math problem solving. The experiments show that reasoning at the orchestrator level produces the biggest lift, while giving reasoning ability to the sub-agents adds little or even reduces performance. Overall, system success tracks the orchestrator's capacity far more than the sub-agents' capacity. This is a clean empirical check on whether architecture can substitute for raw scale on agentic tasks. The minimal design—one orchestrator plus a few specialized sub-agents with restricted communication—keeps the comparison straightforward and avoids the usual multi-agent sprawl. That simplicity is a strength; it makes the negative finding on sub-agent reasoning easier to interpret. The paper also runs the comparisons with tool access for the single large models, which tightens the test. The soft spot is the framing around “small agents.” Because performance is driven primarily by orchestrator capacity, the result risks being read as a capable coordinator plus helpers rather than collaboration among uniformly small models beating a large one. The paper needs to spell out the exact model sizes used for the orchestrator versus the single-model baselines so readers can judge how much scaling is still happening inside the “small” system. Beyond that, the evidence is direct benchmark work with no circular definitions or fitted parameters. It is for people working on agent orchestration and cost-efficient LLM use. A reader focused on whether we can trade model size for better coordination will find concrete numbers to think about. The question is timely enough and the controls are tight enough that it deserves a serious referee rather than a desk reject.

Referee Report

2 major / 2 minor

Summary. The paper claims that minimally designed multi-agent systems consisting of a single orchestrator and a small set of specialized sub-agents with restricted communication can outperform substantially larger single-agent LLMs on tool-intensive benchmarks (factual retrieval, multi-hop reasoning, scientific QA, and mathematical problem solving). Controlled comparisons show that reasoning at the orchestrator drives the largest gains, sub-agent reasoning provides limited or negative benefits, and overall performance depends primarily on orchestrator capacity rather than sub-agent capacity.

Significance. If the empirical results hold under rigorous controls, the work provides evidence that architectural orchestration in multi-agent setups can yield better agentic performance than raw model scaling, with potential implications for designing more efficient systems that leverage smaller models. The controlled benchmark comparisons and focus on tool use are strengths that could inform future agent research.

major comments (2)

[Abstract] Abstract: The headline claim that small multi-agent systems outperform substantially larger single-agent models is load-bearing on the assumption that the orchestrator itself qualifies as small; however, the finding that performance is driven primarily by orchestrator capacity (rather than sub-agents) creates a risk that the orchestrator model size is comparable to the large baseline, which would reframe the result as large-model orchestration plus small helpers rather than collaboration among small agents.
[Abstract] Abstract and experimental description: The abstract outlines controlled comparisons but provides no details on exact model names, parameter counts for orchestrator vs. baseline, benchmark names, or statistical significance testing; these omissions are load-bearing because they prevent verification that the multi-agent system as a whole is 'small' relative to the single large model and that confounds (e.g., tool access parity) are fully addressed.

minor comments (2)

[Abstract] Abstract: The specific benchmarks are described only generically as 'tool-intensive'; naming them explicitly would aid reproducibility and reader assessment.
[Abstract] Abstract: The quantitative support for 'limited or negative benefits' from enabling reasoning in sub-agents should be referenced with effect sizes or table citations in the main text for clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight important points about clarity in the abstract regarding model sizes and experimental details. We address each major comment below and indicate the revisions we will make.

read point-by-point responses

Referee: [Abstract] Abstract: The headline claim that small multi-agent systems outperform substantially larger single-agent models is load-bearing on the assumption that the orchestrator itself qualifies as small; however, the finding that performance is driven primarily by orchestrator capacity (rather than sub-agents) creates a risk that the orchestrator model size is comparable to the large baseline, which would reframe the result as large-model orchestration plus small helpers rather than collaboration among small agents.

Authors: We agree that the abstract must make the relative sizes explicit to avoid any ambiguity in the headline claim. In our experiments the orchestrator is a smaller model than the single-agent baselines (which are substantially larger), while the sub-agents are even smaller specialized models; the multi-agent system as a whole therefore remains smaller than the large single-agent baselines. To eliminate the risk of misinterpretation, we will revise the abstract to state the concrete model families and approximate parameter counts for the orchestrator versus the baselines, thereby confirming that the performance gains arise from orchestration of smaller models rather than from an orchestrator that is itself comparable in scale to the large baseline. revision: yes
Referee: [Abstract] Abstract and experimental description: The abstract outlines controlled comparisons but provides no details on exact model names, parameter counts for orchestrator vs. baseline, benchmark names, or statistical significance testing; these omissions are load-bearing because they prevent verification that the multi-agent system as a whole is 'small' relative to the single large model and that confounds (e.g., tool access parity) are fully addressed.

Authors: We accept that the abstract is currently too high-level and should supply the missing specifics to allow readers to verify the 'small' claim and the controls. In the revised manuscript we will expand the abstract to name the exact models and their parameter scales for both the orchestrator and the single-agent baselines, list the four benchmark families (factual retrieval, multi-hop reasoning, scientific QA, and mathematical problem solving), and note that all comparisons were performed with identical tool access. We will also add a brief statement on the statistical testing used to establish significance of the reported gains. These additions will be placed in the abstract while preserving its length constraints; fuller experimental details already appear in the main text and will be cross-referenced. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical benchmark comparison with no derivations or self-referential reductions

full rationale

The paper reports controlled experimental comparisons between multi-agent systems and single large models on tool-intensive benchmarks. No mathematical derivations, equations, fitted parameters, or first-principles results are present that could reduce to inputs by construction. Claims rest on observed performance differences rather than any self-definitional, fitted-prediction, or self-citation load-bearing steps. The central finding that orchestrator capacity drives gains is an empirical observation, not a tautology derived from the paper's own definitions or prior self-citations. This matches the expected non-finding for a benchmark study.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical comparison study; no free parameters, axioms, or invented entities are introduced.

pith-pipeline@v0.9.0 · 5495 in / 1160 out tokens · 103515 ms · 2026-05-16T13:38:20.548293+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Overall system performance is driven primarily by orchestrator capacity rather than sub-agent capacity... Tool usage provides the largest and most consistent gains.
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We find that small multi-agent systems can outperform substantially larger single-agent models

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.