arxiv: 2604.19299 · v1 · submitted 2026-04-21 · 💻 cs.CL · cs.AI

Recognition: unknown

Rethinking Scale: Deployment Trade-offs of Small Language Models under Agent Paradigms

Xinlin Wang , Mats Brorsson

Authors on Pith no claims yet

Pith reviewed 2026-05-10 02:32 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords small language modelsagent paradigmstool usemulti-agent systemsdeployment trade-offsperformance cost balanceopen source models

0 comments

The pith

Single-agent systems with tools give small language models the best performance-to-cost ratio.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether agent paradigms can offset the reasoning limits of small language models under 10 billion parameters. It evaluates the same models in three settings: the unmodified base model, a single agent given tool access, and a multi-agent collaborative system. Results indicate that the single-agent version improves task success enough to justify its modest extra cost, whereas multi-agent coordination adds latency and compute without comparable returns. This matters for practical deployment because it points to a design choice that lets smaller, cheaper models handle real applications without large-model overhead.

Core claim

For open-source models under 10B parameters, equipping a single agent with tools produces the strongest balance of accuracy gains against added latency and resource use, while multi-agent collaboration introduces measurable overhead with only marginal performance improvements over the single-agent case.

What carries the argument

The three paradigms compared across tasks: base model inference, single-agent tool-augmented execution, and multi-agent collaboration with shared context and hand-offs. Performance and cost metrics are tracked to quantify the trade-offs.

If this is right

Deployment decisions for models under 10B parameters should favor single-agent tool integration over both bare inference and multi-agent setups.
Cost and latency budgets can be allocated more efficiently by adding targeted tool access rather than expanding agent count.
Resource-constrained environments gain a concrete path to usable performance without scaling model size.
Further agent design effort should target reducing coordination overhead rather than assuming collaboration always helps.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This framing could redirect attention from parameter count toward integration patterns when choosing models for production.
Privacy-sensitive applications might adopt small models with local tool access more readily if single-agent overhead remains low.
Task-specific tool libraries could be tuned to amplify the single-agent advantage observed here.

Load-bearing premise

The tested models, tasks, and agent implementations are representative enough that the observed multi-agent overhead would appear under other coordination designs as well.

What would settle it

Re-running the same tasks with an alternative multi-agent coordination protocol that delivers substantially higher accuracy at similar or lower total cost would falsify the central claim.

Figures

Figures reproduced from arXiv: 2604.19299 by Mats Brorsson, Xinlin Wang.

**Figure 2.** Figure 2: Efficiency-Effectiveness trade-off across [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Heatmap of task-architecture adaption. B denotes Base SLM; S denotes SAS; M denotes MAS. The letter [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Distribution of failure modes for Base, SAS [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Structure of Reasoning and act B Summary of Dataset Our evaluation spans eight representative financial NLP tasks, selected to cover a broad range of linguistic, analytical, and decision-making challenges encountered in real-world financial applications. These tasks include sentiment analysis, text classification, named entity recognition, question answering, stock movement prediction, credit scoring, a… view at source ↗

read the original abstract

Despite the impressive capabilities of large language models, their substantial computational costs, latency, and privacy risks hinder their widespread deployment in real-world applications. Small Language Models (SLMs) with fewer than 10 billion parameters present a promising alternative; however, their inherent limitations in knowledge and reasoning curtail their effectiveness. Existing research primarily focuses on enhancing SLMs through scaling laws or fine-tuning strategies while overlooking the potential of using agent paradigms, such as tool use and multi-agent collaboration, to systematically compensate for the inherent weaknesses of small models. To address this gap, this paper presents the first large-scale, comprehensive study of <10B open-source models under three paradigms: (1) the base model, (2) a single agent equipped with tools, and (3) a multi-agent system with collaborative capabilities. Our results show that single-agent systems achieve the best balance between performance and cost, while multi-agent setups add overhead with limited gains. Our findings highlight the importance of agent-centric design for efficient and trustworthy deployment in resource-constrained settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Single-agent tool use gives the clearest performance-cost win for <10B models, but the multi-agent overhead claim rests on one specific coordination setup without ablations.

read the letter

The paper's central result is that single-agent systems with tools strike the best balance for small open-source models, while multi-agent collaboration adds cost and latency with only marginal gains. This comes from running the same set of tasks across base models, single-agent tool setups, and multi-agent versions on a range of models under 10B parameters. The work is new in its scale and focus: prior papers on small models stayed with scaling laws or fine-tuning, so this direct head-to-head on agent paradigms fills a practical gap. It does a clean job framing the deployment constraints around cost, latency, and privacy, and the results are presented as guidance rather than just benchmark tables. The experiments cover multiple models and tasks, which gives the comparison some breadth. The soft spot is the multi-agent side. The reported overhead could easily trace back to how tasks are split, how agents communicate, or how conflicts are handled in their particular implementation. Without ablations on alternative coordination methods or comparisons to stronger multi-agent frameworks, it is hard to tell whether multi-agent systems are inherently limited for these models or just not tuned well here. The paper does not appear to test that distinction, so the general claim about limited gains is narrower than it first sounds. The model and task choices are reasonable for the size class, but they still leave the representativeness question open for other real-world conditions. This is useful reading for people who actually ship small models in edge or privacy-sensitive settings and want to weigh agent designs against raw scaling. A practitioner or systems researcher will get concrete numbers to compare against their own constraints. It deserves a serious referee because the question is timely and the evaluation is large enough to be a useful reference point, even with the multi-agent implementation details needing closer look. I would send it to review with a request for more ablation on coordination.

Referee Report

2 major / 0 minor

Summary. The paper presents the first large-scale study of open-source SLMs (<10B parameters) under three paradigms: base models, single-agent systems with tools, and multi-agent collaboration. It claims that single-agent setups achieve the optimal performance-cost balance while multi-agent systems incur overhead with only limited gains, advocating agent-centric designs for efficient and trustworthy deployment in resource-constrained environments.

Significance. If the empirical comparisons hold under rigorous scrutiny, the findings would offer practical value for SLM deployment by demonstrating that targeted agent paradigms can offset small-model limitations more efficiently than scaling or complex multi-agent coordination. This shifts emphasis toward paradigm choice over model size, with direct relevance to latency-sensitive, privacy-focused, and low-resource applications.

major comments (2)

[Abstract] Abstract: The central claim that single-agent systems achieve the best performance-cost balance while multi-agent setups add overhead with limited gains is load-bearing for the paper's contribution. However, the evaluation does not include ablations of alternative coordination mechanisms (e.g., different task decomposition, communication protocols, or hierarchical structures) or comparisons to more structured multi-agent frameworks. Without these, the overhead may reflect the specific implementation rather than an inherent property of multi-agent paradigms for the chosen tasks and models.
[Experimental sections] Experimental sections: The abstract references clear comparative results across paradigms, but the absence of detailed baselines, statistical significance tests, data exclusion criteria, cost metric definitions, and full implementation specifics for the agent/tool setups makes it impossible to verify whether the data rigorously support the trade-off conclusions. These details are necessary to assess generalizability beyond the tested open-source models and tasks.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We have addressed each major comment point-by-point below, making revisions to strengthen the manuscript where possible while maintaining the integrity of our large-scale empirical study.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that single-agent systems achieve the best performance-cost balance while multi-agent setups add overhead with limited gains is load-bearing for the paper's contribution. However, the evaluation does not include ablations of alternative coordination mechanisms (e.g., different task decomposition, communication protocols, or hierarchical structures) or comparisons to more structured multi-agent frameworks. Without these, the overhead may reflect the specific implementation rather than an inherent property of multi-agent paradigms for the chosen tasks and models.

Authors: We agree that the specific multi-agent implementation could influence the observed overhead. Our study deliberately used a representative standard collaboration protocol to enable consistent, large-scale evaluation across models and tasks. In the revision, we have added a dedicated discussion subsection analyzing how alternative mechanisms (e.g., hierarchical structures or varied communication protocols) might affect results, supported by qualitative reasoning on communication costs. We maintain that the core trade-off findings hold for typical deployments, but we now explicitly note this as a limitation and direction for future work. revision: partial
Referee: [Experimental sections] Experimental sections: The abstract references clear comparative results across paradigms, but the absence of detailed baselines, statistical significance tests, data exclusion criteria, cost metric definitions, and full implementation specifics for the agent/tool setups makes it impossible to verify whether the data rigorously support the trade-off conclusions. These details are necessary to assess generalizability beyond the tested open-source models and tasks.

Authors: We appreciate this observation and have substantially expanded the experimental sections and appendix in the revised manuscript. Additions include: explicit descriptions of all baselines, statistical significance testing (with p-values from appropriate tests for key comparisons), data exclusion criteria, precise cost metric definitions (token usage, latency, and energy estimates), and complete implementation details for agent/tool configurations. These changes enable full verification and better assessment of generalizability. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical results from direct model evaluations

full rationale

The paper conducts a large-scale empirical comparison of <10B SLMs under base, single-agent-with-tools, and multi-agent paradigms, reporting observed performance-cost trade-offs. No equations, fitted parameters, or derivations are described; claims rest on experimental measurements rather than reducing to self-definitions, renamed inputs, or self-citation chains. The central finding (single-agent balance) is a direct outcome of the reported runs and does not collapse to any prior assumption or fit within the paper itself. This is a standard non-circular empirical study.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Empirical study relying on standard model evaluations and benchmarks; no free parameters, axioms, or invented entities are introduced beyond the three tested paradigms.

pith-pipeline@v0.9.0 · 5476 in / 1016 out tokens · 36003 ms · 2026-05-10T02:32:32.177564+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

52 extracted references · 22 canonical work pages · 11 internal anchors

[1]

GPT-4 Technical Report

Gpt-4 technical report , author=. arXiv preprint arXiv:2303.08774 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[2]

2025 , month =

MIT Technology Review Insightsarchive page , title =. 2025 , month =

2025
[3]

2025 , month =

European Parliament, Committee on Economic and Monetary Affairs and Rapporteur Arba Kokalari , title =. 2025 , month =

2025
[4]

On the Opportunities and Risks of Foundation Models

On the Opportunities and Risks of Foundation Models , author=. arXiv preprint arXiv:2108.07258 , year=

work page internal anchor Pith review arXiv
[5]

InFindings of the Associa- tion for Computational Linguistics: ACL 2024, pages 13921–13937, Bangkok, Thailand

Small Language Models are the Future of Agentic AI , author=. arXiv preprint arXiv:2506.02153 , year=

work page arXiv
[6]

2025 , month =

Aishwarya Raghuwanshi , title =. 2025 , month =

2025
[7]

Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone

Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone , author=. arXiv preprint arXiv:2404.14219 , institution=

work page internal anchor Pith review arXiv
[8]

2023 , eprint=

Mistral 7B , author=. 2023 , eprint=

2023
[9]

Qwen2 Technical Report

Qwen2 technical report , author=. arXiv preprint arXiv:2407.10671 , volume=

work page internal anchor Pith review arXiv
[10]

arXiv e-prints , pages=

The llama 3 herd of models , author=. arXiv e-prints , pages=
[11]

Scaling Laws for Neural Language Models

Scaling Laws for Neural Language Models , author=. arXiv preprint arXiv:2001.08361 , year=

work page internal anchor Pith review Pith/arXiv arXiv 2001
[12]

Training Compute-Optimal Large Language Models

Training compute-optimal large language models , author=. arXiv preprint arXiv:2203.15556 , year=

work page internal anchor Pith review arXiv
[13]

Finetuned Language Models Are Zero-Shot Learners

Finetuned language models are zero-shot learners , author=. arXiv preprint arXiv:2109.01652 , year=

work page internal anchor Pith review arXiv
[14]

Advances in neural information processing systems , volume=

Direct preference optimization: Your language model is secretly a reward model , author=. Advances in neural information processing systems , volume=
[15]

NeurIPS Deep Learning Workshop , year=

Distilling the Knowledge in a Neural Network , author=. NeurIPS Deep Learning Workshop , year=
[16]

MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies

Minicpm: Unveiling the potential of small language models with scalable training strategies , author=. arXiv preprint arXiv:2404.06395 , year=

work page internal anchor Pith review arXiv
[17]

Advances in Neural Information Processing Systems , volume=

Toolformer: Language models can teach themselves to use tools , author=. Advances in Neural Information Processing Systems , volume=
[18]

Talm: Tool augmented language models.arXiv preprint arXiv:2205.12255, 2022

TALM: Tool-Augmented Language Models , author=. arXiv preprint arXiv:2205.12255 , year=

work page arXiv
[19]

International conference on machine learning , pages=

Improving language models by retrieving from trillions of tokens , author=. International conference on machine learning , pages=. 2022 , organization=

2022
[20]

Augmented language models: a survey, 2023

Augmented language models: a survey , author=. arXiv preprint arXiv:2302.07842 , year=

work page arXiv
[21]

Evaluating Large Language Models Trained on Code

Evaluating large language models trained on code , author=. arXiv preprint arXiv:2107.03374 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[22]

Forty-first International Conference on Machine Learning , year=

Improving factuality and reasoning in language models through multiagent debate , author=. Forty-first International Conference on Machine Learning , year=
[23]

ACM Transactions on Software Engineering and Methodology , volume=

Self-collaboration code generation via chatgpt , author=. ACM Transactions on Software Engineering and Methodology , volume=. 2024 , publisher=

2024
[24]

AgentBench: Evaluating LLMs as Agents

Agentbench: Evaluating llms as agents , author=. arXiv preprint arXiv:2308.03688 , year=

work page internal anchor Pith review arXiv
[25]

Large language models as tool makers

Large language models as tool makers , author=. arXiv preprint arXiv:2305.17126 , year=

work page arXiv
[26]

Proceedings of the 36th annual acm symposium on user interface software and technology , pages=

Generative agents: Interactive simulacra of human behavior , author=. Proceedings of the 36th annual acm symposium on user interface software and technology , pages=
[27]

arXiv preprint arXiv:2404.11943 , year=

Agentcoord: Visually exploring coordination strategy for llm-based multi-agent collaboration , author=. arXiv preprint arXiv:2404.11943 , year=

work page arXiv
[28]

Artificial-Analysis

Finbert: Financial sentiment analysis with pre-trained language models , author=. arXiv preprint arXiv:1908.10063 , year=

work page arXiv 1908
[29]

S em E val-2018 Task 7: Semantic Relation Extraction and Classification in Scientific Papers

G. S em E val-2018 Task 7: Semantic Relation Extraction and Classification in Scientific Papers. Proceedings of the 12th International Workshop on Semantic Evaluation. 2018. doi:10.18653/v1/S18-1111

work page doi:10.18653/v1/s18-1111 2018
[30]

Journal of the Association for Information Science and Technology , volume=

Good debt or bad debt: Detecting semantic orientations in economic texts , author=. Journal of the Association for Information Science and Technology , volume=. 2014 , publisher=

2014
[31]

Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , pages=

Finqa: A dataset of numerical reasoning over financial data , author=. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , pages=

2021
[32]

Reflexion: Language Agents with Verbal Reinforcement Learning

Reflexion: Language agents with verbal reinforcement learning, 2023 , author=. URL https://arxiv. org/abs/2303.11366 , volume=

work page internal anchor Pith review arXiv 2023
[33]

The eleventh international conference on learning representations , year=

React: Synergizing reasoning and acting in language models , author=. The eleventh international conference on learning representations , year=
[34]

Plan-and-solve prompting: Improving zero-shot chain-of-thought reasoning by large language models,

Plan-and-Solve Prompting: Improving Zero-Shot Chain-of-Thought Reasoning by Large Language Models , author=. arXiv preprint arXiv:2305.04091 , year=

work page arXiv
[35]

2016 , note =

Regulation (. 2016 , note =

2016
[36]

2022 , version =

Payment. 2022 , version =

2022
[37]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Holistic Evaluation of Language Models , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =
[38]

NeurIPS Datasets and Benchmarks Track , year =

Open LLM Leaderboard , author =. NeurIPS Datasets and Benchmarks Track , year =
[39]

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP), Industry Track , year =

FinGPT: Open-Source Financial Large Language Models , author =. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP), Industry Track , year =

2023
[40]

Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL) , year =

BloombergGPT: A Large Language Model for Finance , author =. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL) , year =
[41]

International Conference on Learning Representations (ICLR) , year =

ReAct: Synergizing Reasoning and Acting in Language Models , author =. International Conference on Learning Representations (ICLR) , year =
[42]

International Conference on Learning Representations (ICLR) , year =

AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation , author =. International Conference on Learning Representations (ICLR) , year =
[43]

International Conference on Machine Learning (ICML) , year =

Efficient Fine-Tuning of Small Language Models , author =. International Conference on Machine Learning (ICML) , year =
[44]

Advances in Neural Information Processing Systems (NeurIPS) , year =

FlashAttention: Fast and Memory-Efficient Exact Attention , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =
[45]

SLM -Bench: A Comprehensive Benchmark of Small Language Models on Environmental Impacts

Pham, Nghiem Thanh and Kieu, Tung and Nguyen, Duc Manh and Xuan, Son Ha and Duong-Trung, Nghia and Le-Phuoc, Danh. SLM -Bench: A Comprehensive Benchmark of Small Language Models on Environmental Impacts. Findings of the Association for Computational Linguistics: EMNLP 2025. 2025. doi:10.18653/v1/2025.findings-emnlp.1165

work page doi:10.18653/v1/2025.findings-emnlp.1165 2025
[46]

Small LLM s Are Weak Tool Learners: A Multi- LLM Agent

Shen, Weizhou and Li, Chenliang and Chen, Hongzhan and Yan, Ming and Quan, Xiaojun and Chen, Hehong and Zhang, Ji and Huang, Fei. Small LLM s Are Weak Tool Learners: A Multi- LLM Agent. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.929

work page doi:10.18653/v1/2024.emnlp-main.929 2024
[47]

Unveiling Privacy Risks in LLM Agent Memory

Wang, Bo and He, Weiyi and Zeng, Shenglai and Xiang, Zhen and Xing, Yue and Tang, Jiliang and He, Pengfei. Unveiling Privacy Risks in LLM Agent Memory. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025. doi:10.18653/v1/2025.acl-long.1227

work page doi:10.18653/v1/2025.acl-long.1227 2025
[48]

2025 , eprint=

TinyLLM: Evaluation and Optimization of Small Language Models for Agentic Tasks on Edge Devices , author=. 2025 , eprint=

2025
[49]

2024 , eprint=

Cut the Crap: An Economical Communication Pipeline for LLM-based Multi-Agent Systems , author=. 2024 , eprint=

2024
[50]

2025 , eprint=

Benefits and Limitations of Communication in Multi-Agent Reasoning , author=. 2025 , eprint=

2025
[51]

2023 , eprint=

PIXIU: A Large Language Model, Instruction Data and Evaluation Benchmark for Finance , author=. 2023 , eprint=

2023
[52]

2024 , eprint=

The FinBen: An Holistic Financial Benchmark for Large Language Models , author=. 2024 , eprint=

2024