arxiv: 2506.02153 · v2 · submitted 2025-06-02 · 💻 cs.AI

Recognition: 2 theorem links

· Lean Theorem

Small Language Models are the Future of Agentic AI

Peter Belcak , Greg Heinrich , Shizhe Diao , Yonggan Fu , Xin Dong , Saurav Muralidharan , Yingyan Celine Lin , Pavlo Molchanov

Authors on Pith no claims yet

Pith reviewed 2026-05-16 11:50 UTC · model grok-4.3

classification 💻 cs.AI

keywords small language modelsagentic AIlarge language modelsAI agentsmodel efficiencyagent architecturesdeployment costsheterogeneous systems

0 comments

The pith

Small language models will replace large ones in most agentic AI applications due to better suitability and economy for specialized tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that agentic AI systems use language models mainly for a small number of specialized tasks performed repetitively with little variation, rather than for open-ended general conversation. In this setting, small language models already deliver enough capability while matching the task structure more closely and costing far less to invoke repeatedly. The authors therefore position small models as the future of agentic AI and recommend heterogeneous systems that combine different model sizes only when broad conversational abilities are required. They back the claim with an analysis of current capabilities, common agent architectures, and deployment economics, plus an algorithm for converting existing large-model agents to small-model versions.

Core claim

Agentic AI systems perform a small number of specialized tasks repetitively and with little variation. For these systems, small language models are sufficiently powerful, inherently more suitable, and necessarily more economical than large language models, establishing them as the future of agentic AI. In cases where general-purpose conversational abilities remain essential, heterogeneous agentic systems that invoke multiple different models offer the natural solution.

What carries the argument

The position that small language models are sufficiently powerful, inherently more suitable, and necessarily more economical for many invocations in agentic systems, supported by an LLM-to-SLM agent conversion algorithm.

If this is right

Agentic systems can reach comparable performance at a fraction of current inference costs.
Development efforts will shift toward fine-tuning small models for specific agent roles rather than scaling model size.
Heterogeneous designs will become standard, routing routine tasks to small models and reserving larger models for complex reasoning.
Industry-wide operational expenses for running AI agents will drop even as the number of deployed agents grows.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Local execution of agents on smaller models could reduce latency and improve data privacy by limiting cloud dependence.
Specialized small models may accelerate creation of domain-specific agents for common subtasks such as tool use or planning.
Overall compute requirements for large-scale agent deployments may stabilize despite continued growth in agent numbers.

Load-bearing premise

The specialized, low-variation tasks in current and near-future agentic systems do not require the full general capabilities that only large models currently provide.

What would settle it

A controlled study showing that replacing the language-model component in representative agentic workflows with small models produces substantially lower task-completion rates or requires frequent human intervention to correct errors.

read the original abstract

Large language models (LLMs) are often praised for exhibiting near-human performance on a wide range of tasks and valued for their ability to hold a general conversation. The rise of agentic AI systems is, however, ushering in a mass of applications in which language models perform a small number of specialized tasks repetitively and with little variation. Here we lay out the position that small language models (SLMs) are sufficiently powerful, inherently more suitable, and necessarily more economical for many invocations in agentic systems, and are therefore the future of agentic AI. Our argumentation is grounded in the current level of capabilities exhibited by SLMs, the common architectures of agentic systems, and the economy of LM deployment. We further argue that in situations where general-purpose conversational abilities are essential, heterogeneous agentic systems (i.e., agents invoking multiple different models) are the natural choice. We discuss the potential barriers for the adoption of SLMs in agentic systems and outline a general LLM-to-SLM agent conversion algorithm. Our position, formulated as a value statement, highlights the significance of the operational and economic impact even a partial shift from LLMs to SLMs is to have on the AI agent industry. We aim to stimulate the discussion on the effective use of AI resources and hope to advance the efforts to lower the costs of AI of the present day. Calling for both contributions to and critique of our position, we commit to publishing all such correspondence at https://research.nvidia.com/labs/lpr/slm-agents.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SLMs for agentic AI makes a reasonable economic case but lacks the data to back its core assumption about task requirements.

read the letter

The punchline is that this is a position paper making the case for small language models as the default for agentic AI because of their fit for repetitive specialized tasks and lower costs, with a sketched conversion method from large-model agents. They do well in laying out the economic case and noting when heterogeneous systems would still use larger models for conversational needs. The general conversion algorithm they outline gives a concrete direction that could be tested or refined by others. It's honest about aiming to stimulate discussion rather than present new results. The soft spots are in the evidence base. The central claim that SLMs are sufficiently powerful rests on qualitative assessment of current capabilities without any new benchmarks, controlled comparisons, or analysis of failure modes in agent loops. The assumption that specialized tasks avoid the need for broad generalization isn't tested against real distributional shifts or multi-step errors. This makes the argument forward-looking but not yet strongly supported by data. This paper is for practitioners and researchers working on deploying agentic systems who want to think through cost and architecture choices. A reader looking for ideas on efficient AI use would find it relevant, though they'd need to bring their own validation. It deserves a serious referee because the topic has real stakes for hardware and adoption, and the position is clearly articulated even if it would benefit from more empirical grounding in review.

Referee Report

3 major / 2 minor

Summary. The paper argues that small language models (SLMs) are sufficiently powerful, inherently more suitable, and necessarily more economical than large language models (LLMs) for the specialized, repetitive, low-variation tasks typical in agentic AI systems, positioning SLMs as the future of agentic AI. It grounds the position in current SLM capabilities, common agent architectures, and deployment economics; recommends heterogeneous systems (invoking multiple model sizes) when general conversational abilities are required; discusses adoption barriers; and outlines a high-level LLM-to-SLM agent conversion algorithm. The work is framed as a value statement intended to stimulate discussion on efficient AI resource use.

Significance. If the central position holds, the paper could have meaningful operational and economic impact by encouraging a shift toward lower-cost SLM deployments in agentic systems, reducing overall AI inference expenses across the industry. The explicit commitment to publishing all correspondence on the position at a public URL is a constructive element that supports open scientific dialogue.

major comments (3)

[Abstract and sections on current capabilities and agent architectures] The core claim that SLMs are 'sufficiently powerful' for many agentic invocations (abstract and opening sections) rests entirely on qualitative assessment of 'current level of capabilities' without any quantitative benchmarks, controlled head-to-head comparisons, task decompositions, or failure-mode analyses showing where SLM performance remains adequate inside real agent loops.
[Sections grounding the position in capabilities and architectures] The assertion that specialized low-variation tasks 'do not require the full general capabilities that only large models currently provide' (abstract and argumentation sections) is presented as an observational premise but receives no empirical support via metrics on distributional shift, ambiguity handling, or multi-step error accumulation in agentic settings.
[Section outlining the LLM-to-SLM conversion algorithm] The outlined LLM-to-SLM agent conversion algorithm (section on conversion) is described only at a conceptual level with no pseudocode, concrete steps, implementation details, or validation examples, rendering it non-actionable for the practical adoption the paper advocates.

minor comments (2)

[Abstract and introduction] The phrase 'inherently more suitable' is used repeatedly without an explicit definition or list of suitability criteria (e.g., latency, memory footprint, fine-tuning ease) that would allow readers to evaluate the claim.
[Section on potential barriers] Barriers to SLM adoption are enumerated but not ranked by severity or illustrated with concrete deployment scenarios, which would strengthen the practical discussion.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the potential operational impact of our position. Our manuscript is explicitly framed as a value statement to stimulate discussion on efficient AI resource use, rather than an empirical study. We address each major comment below and indicate where revisions will be made.

read point-by-point responses

Referee: [Abstract and sections on current capabilities and agent architectures] The core claim that SLMs are 'sufficiently powerful' for many agentic invocations (abstract and opening sections) rests entirely on qualitative assessment of 'current level of capabilities' without any quantitative benchmarks, controlled head-to-head comparisons, task decompositions, or failure-mode analyses showing where SLM performance remains adequate inside real agent loops.

Authors: We acknowledge that the claims rely on qualitative assessment of existing SLM capabilities rather than new quantitative benchmarks or controlled experiments. This aligns with the paper's purpose as a position statement grounded in observed capabilities, common agent architectures, and deployment economics, not a benchmark paper. We will revise the abstract and opening sections to explicitly cite relevant existing literature on SLM performance in specialized and repetitive tasks, while clarifying that the position is intended to highlight trends and economics rather than prove sufficiency through new data. revision: partial
Referee: [Sections grounding the position in capabilities and architectures] The assertion that specialized low-variation tasks 'do not require the full general capabilities that only large models currently provide' (abstract and argumentation sections) is presented as an observational premise but receives no empirical support via metrics on distributional shift, ambiguity handling, or multi-step error accumulation in agentic settings.

Authors: The assertion is presented as an observational premise based on the repetitive, low-variation nature of tasks that dominate agentic systems. We do not provide new empirical metrics on distributional shift or error accumulation, as the work is not an empirical evaluation. In revision we will expand the relevant sections with additional examples from current agent architectures and citations to studies showing effective handling of such tasks by smaller models, to better support the premise without altering the position-paper framing. revision: partial
Referee: [Section outlining the LLM-to-SLM conversion algorithm] The outlined LLM-to-SLM agent conversion algorithm (section on conversion) is described only at a conceptual level with no pseudocode, concrete steps, implementation details, or validation examples, rendering it non-actionable for the practical adoption the paper advocates.

Authors: We agree that the high-level description of the conversion algorithm would benefit from greater concreteness to support the practical adoption we advocate. We will revise the section to include pseudocode, a list of concrete steps, and a brief illustrative example based on a standard agent task to make the algorithm more actionable. revision: yes

Circularity Check

0 steps flagged

No circularity: observational position paper with no derivations or self-referential steps

full rationale

The manuscript is a position paper that advances an argumentative claim about SLMs in agentic systems. It supplies no equations, no fitted parameters, no uniqueness theorems, and no derivations that could reduce to their own inputs. All grounding is stated as observational (current SLM capabilities, common agent architectures, deployment economics) without any self-citation load-bearing on the central thesis or any renaming of known results as new predictions. The argument is therefore self-contained against external benchmarks and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Position paper containing no formal parameters, axioms, or invented entities; the central claim rests on informal assessment of current model capabilities and deployment costs.

pith-pipeline@v0.9.0 · 5602 in / 938 out tokens · 35950 ms · 2026-05-16T11:50:04.174682+00:00 · methodology

discussion (0)

Forward citations

Cited by 21 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Breaking MCP with Function Hijacking Attacks: Novel Threats for Function Calling and Agentic Models
cs.CR 2026-04 unverdicted novelty 7.0

A novel function hijacking attack achieves 70-100% success rates in forcing specific function calls across five LLMs on the BFCL benchmark and is robust to context semantics.
Rethinking Scale: Deployment Trade-offs of Small Language Models under Agent Paradigms
cs.CL 2026-04 unverdicted novelty 7.0

Single-agent systems with tools provide the optimal performance-efficiency trade-off for small language models, outperforming base models and multi-agent setups.
Is a Picture Worth a Thousand Words? Adaptive Multimodal Fact-Checking with Visual Evidence Necessity
cs.CL 2026-04 unverdicted novelty 7.0

AMuFC improves multimodal fact-checking accuracy by adaptively determining visual evidence necessity via a dedicated Analyzer before verification rather than always incorporating images.
Is a Picture Worth a Thousand Words? Adaptive Multimodal Fact-Checking with Visual Evidence Necessity
cs.CL 2026-04 unverdicted novelty 7.0

An adaptive multimodal fact-checking system improves accuracy by having an Analyzer determine when visual evidence is necessary before the Verifier assesses claim veracity.
Beyond Case Law: Evaluating Structure-Aware Retrieval and Safety in Statute-Centric Legal QA
cs.IR 2026-01 unverdicted novelty 7.0

SearchFireSafety benchmark shows graph-guided retrieval improves statute-centric legal QA but domain-adapted models hallucinate more when statutory evidence is missing.
SANet: A Semantic-aware Agentic AI Networking Framework for Cross-layer Optimization in 6G
cs.AI 2025-12 unverdicted novelty 7.0

SANet uses semantic-aware AI agents for cross-layer 6G optimization, achieving up to 14.61% performance gains with 44.37% of the FLOPs of prior methods via model partitioning and decentralized multi-objective algorithms.
Not Just RLHF: Why Alignment Alone Won't Fix Multi-Agent Sycophancy
cs.LG 2026-05 unverdicted novelty 6.0

Pretrained base models exhibit higher yield to peer disagreement than RLHF instruct variants, with the effect localized to mid-layer attention and mitigated by structured dissent rather than prompt defenses.
GRAIL: A Deep-Granularity Hybrid Resonance Framework for Real-Time Agent Discovery via SLM-Enhanced Indexing
cs.AI 2026-05 unverdicted novelty 6.0

GRAIL achieves over 79 times lower latency than LLM-parsing baselines and higher Recall@10 than vector search by combining SLM-enhanced prediction, pseudo-document expansion, and MaxSim resonance on the new AgentTaxo-...
AgentFloor: How Far Up the tool use Ladder Can Small Open-Weight Models Go?
cs.AI 2026-05 unverdicted novelty 6.0

Small open-weight models match GPT-5 on routine agent tool-use tasks but lag on long-horizon planning, supporting tiered routing to reduce costs in agentic systems.
SAW-INT4: System-Aware 4-Bit KV-Cache Quantization for Real-World LLM Serving
cs.LG 2026-04 unverdicted novelty 6.0

Token-wise INT4 KV-cache quantization plus block-diagonal Hadamard rotation recovers nearly all accuracy lost by naive INT4 while adding zero end-to-end overhead under paged serving constraints.
Search, Do not Guess: Teaching Small Language Models to Be Effective Search Agents
cs.AI 2026-04 unverdicted novelty 6.0

A fine-tuning policy trains small language models to search reliably and use evidence, improving multi-hop QA performance by 15-17 points to reach large-model levels.
Language Markers of Emotion Flexibility Predict Depression and Anxiety Treatment Outcomes
cs.CL 2026-01 unverdicted novelty 6.0

Emotion dynamics from therapy transcripts, extracted via transformers and clustered with state-space models, distinguish improving patients from non-responders who show higher odds of symptom worsening.
EmbeddingGemma: Powerful and Lightweight Text Representations
cs.CL 2025-09 unverdicted novelty 6.0

A 300M-parameter open embedding model sets new SOTA on MTEB for its size class and matches models twice as large while staying effective when compressed.
Cognitive Agent Compilation for Explicit Problem Solver Modeling
cs.CL 2026-05 unverdicted novelty 5.0

Cognitive Agent Compilation uses a teacher LLM to create explicit, inspectable problem-solving agents by separating knowledge, policy, and verification components for educational applications.
SHIELD: A Diverse Clinical Note Dataset and Distilled Small Language Models for Enterprise-Scale De-identification
cs.CL 2026-05 conditional novelty 5.0

SHIELD dataset and distilled DeBERTa v3 model achieve 0.88 micro precision and 0.86 recall on PHI de-identification while matching teacher performance on structured categories.
Automatic Ontology Construction Using LLMs as an External Layer of Memory, Verification, and Planning for Hybrid Intelligent Systems
cs.AI 2026-04 unverdicted novelty 5.0

A hybrid system augments LLMs with an automated external RDF/OWL ontology layer for long-term memory, SHACL/OWL validation, and improved multi-step reasoning on tasks like Tower of Hanoi.
A pragmatic approach to regulating AI agents
cs.CY 2026-04 unverdicted novelty 5.0

AI agents require distinct regulation as AI systems under the EU AI Act with orchestration-layer oversight and a risk-based traffic light authorization system in contract law to preserve human accountability.
AgentOpt v0.1 Technical Report: Client-Side Optimization for LLM-Based Agent
cs.LG 2026-04 unverdicted novelty 5.0

AgentOpt introduces a framework-agnostic package that uses algorithms like UCB-E to find cost-effective model assignments in multi-step LLM agent pipelines, cutting evaluation budgets by 62-76% while maintaining near-...
Security Threat Modeling for Emerging AI-Agent Protocols: A Comparative Analysis of MCP, A2A, Agora, and ANP
cs.CR 2026-02 unverdicted novelty 5.0

The paper identifies twelve protocol-level security risks across MCP, A2A, Agora, and ANP and quantifies wrong-provider tool execution risk in MCP via a measurement-driven case study on multi-server composition.
TRACE: A Metrologically-Grounded Engineering Framework for Trustworthy Agentic AI Systems in Operationally Critical Domains
cs.CL 2026-05 unverdicted novelty 4.0

TRACE is a metrologically-grounded four-layer engineering framework for trustworthy agentic AI that enforces an ML-LLM split, stateful policies, human supervision, and a parsimony metric across critical domains.
Terminus-4B: Can a Smaller Model Replace Frontier LLMs at Agentic Execution Tasks?
cs.AI 2026-05 unverdicted novelty 4.0

A fine-tuned 4B model matches or exceeds frontier LLMs in terminal execution subagent tasks for coding agents, reducing main agent token usage by 30% with no performance loss.

Reference graph

Works this paper leans on

87 extracted references · 87 canonical work pages · cited by 20 Pith papers · 9 internal anchors

[1]

Small language models vs

Aashima. Small language models vs. llms: Finding the right fit for your needs, October 2024. Accessed: 2025-05-09

work page 2024
[2]

Small language models vs

ABBYY. Small language models vs. large language models, November 2024. Accessed: 2025-05-09

work page 2024
[3]

Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone

Marah Abdin, Jyoti Aneja, Hany Awadalla, Ahmed Awadallah, Ammar Ahmad Awan, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Jianmin Bao, Harkirat Behl, et al. Phi-3 technical report: A highly capable language model locally on your phone.arXiv preprint arXiv:2404.14219, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[4]

The economics of ai training and inference: How deepseek broke the cost curve, February 2025

Adyog. The economics of ai training and inference: How deepseek broke the cost curve, February 2025. Accessed: 2025-05-09

work page 2025
[5]

Delift: Data efficient language model instruction fine tuning.arXiv preprint arXiv:2411.04425, 2024

Ishika Agarwal, Krishnateja Killamsetty, Lucian Popa, and Marina Danilevksy. Delift: Data efficient language model instruction fine tuning.arXiv preprint arXiv:2411.04425, 2024

work page arXiv 2024
[6]

Smollm2: When smol goes big – data-centric training of a small language model, 2025

Loubna Ben Allal, Anton Lozhkov, Elie Bakouch, Gabriel Martín Blázquez, Guilherme Penedo, Lewis Tunstall, Andrés Marafioti, Hynek Kydlíˇcek, Agustín Piqueres Lajarín, Vaibhav Srivastav, Joshua Lochner, Caleb Fahlgren, Xuan-Son Nguyen, Clémentine Fourrier, Ben Burtenshaw, Hugo Larcher, Haojun Zhao, Cyril Zakka, Mathieu Morlon, Colin Raffel, Leandro von Wer...

work page 2025
[7]

Minifinetuning: Low-data gen- eration domain adaptation through corrective self-distillation.arXiv preprint arXiv:2506.15702, 2025

Peter Belcak, Greg Heinrich, Jan Kautz, and Pavlo Molchanov. Minifinetuning: Low-data gen- eration domain adaptation through corrective self-distillation.arXiv preprint arXiv:2506.15702, 2025

work page arXiv 2025
[8]

Tiny transformers excel at sentence compression.arXiv preprint arXiv:2410.23510, 2024

Peter Belcak and Roger Wattenhofer. Tiny transformers excel at sentence compression.arXiv preprint arXiv:2410.23510, 2024

work page arXiv 2024
[9]

Nemotron-h: A family of accurate and efficient hybrid mamba-transformer models.arXiv preprint arXiv:2504.03624, 2025a

Aaron Blakeman, Aarti Basant, Abhinav Khattar, Adithya Renduchintala, Akhiad Bercovich, Aleksander Ficek, Alexis Bjorlin, Ali Taghibakhshi, Amala Sanjay Deshmukh, Ameya Sunil Mahabaleshwarkar, et al. Nemotron-h: A family of accurate and efficient hybrid mamba- transformer models.arXiv preprint arXiv:2504.03624, 2025

work page arXiv 2025
[10]

Rae, Erich Elsen, and Laurent Sifre

Sebastian Borgeaud, Arthur Mensch, Jordan Hoffmann, Trevor Cai, Eliza Rutherford, Katie Millican, George van den Driessche, Bogdan Damoc, Aidan Clark, Jan Kramár, et al. Improving language models by retrieving from trillions of tokens.arXiv preprint arXiv:2112.04426, 2022

work page arXiv 2022
[11]

Adversarial stylometry: Circumventing authorship recognition to preserve privacy and anonymity.ACM Transactions on Information and System Security (TISSEC), 15(3):1–22, 2012

Michael Brennan, Sadia Afroz, and Rachel Greenstadt. Adversarial stylometry: Circumventing authorship recognition to preserve privacy and anonymity.ACM Transactions on Information and System Security (TISSEC), 15(3):1–22, 2012

work page 2012
[12]

Flextron: Many-in-one flexible large language model

Ruisi Cai, Saurav Muralidharan, Greg Heinrich, Hongxu Yin, Zhangyang Wang, Jan Kautz, and Pavlo Molchanov. Flextron: Many-in-one flexible large language model. InProceedings of the 41st International Conference on Machine Learning (ICML 2024), 2024

work page 2024
[13]

The state of ai in 2022—and a half decade in review, December 2022

Michael Chui, Bryce Hall, Helen Mayhew, Alex Singla, and Alexander Sukharevsky. The state of ai in 2022—and a half decade in review, December 2022. Accessed: 2025-05-09

work page 2022
[14]

96% of enterprises are expanding use of ai agents, according to latest data from cloudera, April 2025

Cloudera, Inc. 96% of enterprises are expanding use of ai agents, according to latest data from cloudera, April 2025. Accessed: 2025-05-08

work page 2025
[15]

Planck 2018 results

Planck Collaboration et al. Planck 2018 results. vi. cosmological parameters.Astronomy & Astrophysics, 641:A6, 2020

work page 2018
[16]

2025 data center marketplace: Balancing unprecedented opportunity with strategic risk

Colliers. 2025 data center marketplace: Balancing unprecedented opportunity with strategic risk. U.s. research report, Colliers, 2025

work page 2025
[17]

Llm agents, April 2024

DAIR.AI. Llm agents, April 2024. Accessed: 2025-05-08. 10

work page 2024
[18]

Security and privacy challenges of large language models: A survey.ACM Computing Surveys, 57(6):1–39, 2025

Badhan Chandra Das, M Hadi Amini, and Yanzhao Wu. Security and privacy challenges of large language models: A survey.ACM Computing Surveys, 57(6):1–39, 2025

work page 2025
[19]

Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, 2025

DeepSeek-AI. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, 2025

work page 2025
[20]

Qlora: Efficient finetuning of quantized llms.Advances in neural information processing systems, 36:10088– 10115, 2023

Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. Qlora: Efficient finetuning of quantized llms.Advances in neural information processing systems, 36:10088– 10115, 2023

work page 2023
[21]

Climb: Clustering-based iterative data mixture bootstrapping for language model pre-training.arXiv preprint arXiv:2504.13161, 2025

Shizhe Diao, Yu Yang, Yonggan Fu, Xin Dong, Dan Su, Markus Kliegl, Zijia Chen, Peter Belcak, Yoshi Suhara, Hongxu Yin, et al. Climb: Clustering-based iterative data mixture bootstrapping for language model pre-training.arXiv preprint arXiv:2504.13161, 2025

work page arXiv 2025
[22]

Parameter-efficient fine-tuning of large-scale pre-trained language models.Nature Machine Intelligence, 5(3):220–235, 2023

Ning Ding, Yujia Qin, Guang Yang, Fuchao Wei, Zonghan Yang, Yusheng Su, Shengding Hu, Yulin Chen, Chi-Min Chan, Weize Chen, et al. Parameter-efficient fine-tuning of large-scale pre-trained language models.Nature Machine Intelligence, 5(3):220–235, 2023

work page 2023
[23]

press/v235/dao24a.html

Xin Dong, Yonggan Fu, Shizhe Diao, Wonmin Byeon, Zijia Chen, Ameya Sunil Mahabalesh- warkar, Shih-Yang Liu, Matthijs Van Keirsbilck, Min-Hung Chen, Yoshi Suhara, et al. Hymba: A hybrid-head architecture for small language models.arXiv preprint arXiv:2411.13676, 2024

work page arXiv 2024
[24]

Introducing nvidia dynamo, a low-latency distributed inference framework for scaling reasoning ai models, March 2025

Amr Elmeleegy et al. Introducing nvidia dynamo, a low-latency distributed inference framework for scaling reasoning ai models, March 2025. NVIDIA Technical Blog

work page 2025
[25]

Henry Evans. Llms vs. slms: Balancing comprehensiveness and smart resource-saving, April

work page
[27]

Barbara A Ferguson, Timothy A Dreisbach, Catherine G Parks, Gregory M Filip, and Craig L Schmitt. Coarse-scale population structure of pathogenic armillaria species in a mixed-conifer forest in the blue mountains of northeast oregon.Canadian Journal of Forest Research, 33(4):612–623, 2003

work page 2003
[28]

Amoeballm: Constructing any-shape large language models for efficient and instant deployment

Yonggan Fu, Zhongzhi Yu, Junwei Li, Jiayi Qian, Yongan Zhang, Xiangchi Yuan, Dachuan Shi, Roman Yakunin, and Yingyan Celine Lin. Amoeballm: Constructing any-shape large language models for efficient and instant deployment. InProceedings of the 38th Annual Conference on Neural Information Processing Systems (NeurIPS 2024), 2024

work page 2024
[29]

GitHub - google/A2A: An open protocol enabling communication and interoperability between opaque agentic applications

google. GitHub - google/A2A: An open protocol enabling communication and interoperability between opaque agentic applications

work page
[30]

Text compression for efficient language generation.arXiv preprint arXiv:2503.11426, 2025

David Gu, Peter Belcak, and Roger Wattenhofer. Text compression for efficient language generation.arXiv preprint arXiv:2503.11426, 2025

work page arXiv 2025
[31]

Large language models vs

Harrison Clarke. Large language models vs. small language models, March 2024. Accessed: 2025-05-09

work page 2024
[32]

arXiv preprint arXiv:2102.01293 , url=

Danny Hernandez, Jared Kaplan, Tom Henighan, and Sam McCandlish. Scaling laws for transfer.arXiv preprint arXiv:2102.01293, 2021

work page arXiv 2021
[33]

Training Compute-Optimal Large Language Models

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models.arXiv preprint arXiv:2203.15556, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[34]

LoRA: Low-Rank Adaptation of Large Language Models

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. arxiv 2021. arXiv preprint arXiv:2106.09685, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[35]

Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022

work page 2022
[36]

Unsupervised fine-tuning for text clustering

Shaohan Huang, Furu Wei, Lei Cui, Xingxing Zhang, and Ming Zhou. Unsupervised fine-tuning for text clustering. InProceedings of the 28th international conference on computational linguistics, pages 5530–5534, 2020. 11

work page 2020
[37]

How small language models can outperform llms, March 2025

Invisible Technologies. How small language models can outperform llms, March 2025. Ac- cessed: 2025-05-21

work page 2025
[38]

Phi-2: The surprising power of small language models,

Mojan Javaheripi and Sébastien Bubeck. Phi-2: The surprising power of small language models,

work page
[39]

Microsoft Research Blog

work page
[40]

Artificial intelligence and democracy: A conceptual framework.Social media+ society, 9(3):20563051231186353, 2023

Andreas Jungherr. Artificial intelligence and democracy: A conceptual framework.Social media+ society, 9(3):20563051231186353, 2023

work page 2023
[41]

Understanding the total cost of inferencing large language models

Aviv Kaufmann. Understanding the total cost of inferencing large language models. Technical report, Enterprise Strategy Group, April 2024. Commissioned by Dell Technologies. Accessed: 2025-05-09

work page 2024
[42]

Matformer: Nested transformer for elastic inference.arXiv preprint arXiv:2310.07707, 2023

Sneha Kudugunta, Aditya Kusupati, Tim Dettmers, Kaifeng Chen, Inderjit Dhillon, Yulia Tsvetkov, Hannaneh Hajishirzi, Sham Kakade, Ali Farhadi, Prateek Jain, et al. Matformer: Nested transformer for elastic inference.arXiv preprint arXiv:2310.07707, 2023

work page arXiv 2023
[43]

From large to small: The rise of small language models (slms) in text analytics

Akshi Kumar. From large to small: The rise of small language models (slms) in text analytics. 2025

work page 2025
[44]

A comparative study on unsupervised feature selection methods for text clustering

Luying Liu, Jianchu Kang, Jing Yu, and Zhongliang Wang. A comparative study on unsupervised feature selection methods for text clustering. In2005 International Conference on Natural Language Processing and Knowledge Engineering, pages 597–601. IEEE, 2005

work page 2005
[45]

DoRA: Weight-Decomposed Low-Rank Adaptation

Shih-Yang Liu, Chien-Yi Wang, Hongxu Yin, Pavlo Molchanov, Yu-Chiang Frank Wang, Kwang-Ting Cheng, and Min-Hung Chen. Dora: Weight-decomposed low-rank adaptation. arXiv preprint arXiv:2402.09353, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[46]

Deja vu: Contextual sparsity for efficient llms at inference time

Zichang Liu, Jue Wang, Tri Dao, Tianyi Zhou, Binhang Yuan, Zhao Song, Anshumali Shrivas- tava, Ce Zhang, Yuandong Tian, Christopher Re, et al. Deja vu: Contextual sparsity for efficient llms at inference time. InInternational Conference on Machine Learning, pages 22137–22176. PMLR, 2023

work page 2023
[47]

Large model strategic thinking, small model ef- ficiency: transferring theory of mind in large language models.arXiv preprint arXiv:2408.05241, 2024

Nunzio Lore, Sepehr Ilami, and Babak Heydari. Large model strategic thinking, small model ef- ficiency: transferring theory of mind in large language models.arXiv preprint arXiv:2408.05241, 2024

work page arXiv 2024
[48]

Autonomous generative ai agents: Under development.Deloitte Insights, November 2024

Jeff Loucks, Gillian Crossan, Baris Sarer, China Widener, and Ariane Bucaille. Autonomous generative ai agents: Under development.Deloitte Insights, November 2024. Accessed: 2025-05-08

work page 2024
[49]

Small language models: Survey, measurements, and insights.arXiv preprint arXiv:2409.15790, 2024

Zhenyan Lu, Xiang Li, Dongqi Cai, Rongjie Yi, Fangming Liu, Xiwen Zhang, Nicholas D Lane, and Mengwei Xu. Small language models: Survey, measurements, and insights.arXiv preprint arXiv:2409.15790, 2024

work page arXiv 2024
[50]

Large Language Model Agent: A Survey on Methodology, Applications and Challenges

Junyu Luo, Weizhi Zhang, Ye Yuan, Yusheng Zhao, Junwei Yang, Yiyang Gu, Bohan Wu, Binqi Chen, Ziyue Qiao, Qingqing Long, et al. Large language model agent: A survey on methodology, applications and challenges.arXiv preprint arXiv:2503.21460, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[51]

Brain size and ecology in small mammals.Journal of Zoology, 193(3):333–354, 1981

Georgina M Mace, Paul H Harvey, and Timothy H Clutton-Brock. Brain size and ecology in small mammals.Journal of Zoology, 193(3):333–354, 1981

work page 1981
[52]

A closer look at dynamo, nvidia’s ’operating system’ for ai inference, March

Tobias Mann. A closer look at dynamo, nvidia’s ’operating system’ for ai inference, March

work page
[53]

Accessed: 2025-05-09

work page 2025
[54]

Market.us. Global agentic ai market size, share analysis by product type, agent role, agent system, end user, region and companies – industry segment outlook, market assessment, compe- tition scenario, trends and forecast 2025–2034, March 2025. Accessed: 2025-05-08

work page 2025
[55]

Masterman, S

Tula Masterman, Sandi Besen, Mason Sawtell, and Alex Chao. The landscape of emerging ai agent architectures for reasoning, planning, and tool calling: A survey.arXiv preprint arXiv:2404.11584, 2024

work page arXiv 2024
[56]

How much energy do llms consume? unveiling the power behind ai, July 2024

Sourabh Mehta. How much energy do llms consume? unveiling the power behind ai, July 2024. Accessed: 2025-05-21. 12

work page 2024
[57]

Model cards and prompt formats: Llama 3.3, April 2025

Meta Platforms, Inc. Model cards and prompt formats: Llama 3.3, April 2025. Accessed: 2025-05-08

work page 2025
[58]

Understanding ai agents & data security, 2025

Metomic. Understanding ai agents & data security, 2025. Accessed: 2025-05-13

work page 2025
[59]

Agentic ai needs a systems theory.arXiv preprint arXiv:2503.00237, 2025

Erik Miehling, Karthikeyan Natesan Ramamurthy, Kush R Varshney, Matthew Riemer, Djallel Bouneffouf, John T Richards, Amit Dhurandhar, Elizabeth M Daly, Michael Hind, Prasanna Sattigeri, et al. Agentic ai needs a systems theory.arXiv preprint arXiv:2503.00237, 2025

work page arXiv 2025
[60]

Genai revenue growth and profitability, April 2025

Morgan Stanley. Genai revenue growth and profitability, April 2025. Accessed: 2025-05-08

work page 2025
[61]

Humza Naveed, Asad Ullah Khan, Shi Qiu, Muhammad Saqib, Saeed Anwar, Muhammad Usman, Naveed Akhtar, Nick Barnes, and Ajmal Mian

Humza Naveed, Asad Ullah Khan, Shi Qiu, Muhammad Saqib, Saeed Anwar, Muhammad Usman, Naveed Akhtar, Nick Barnes, and Ajmal Mian. A comprehensive overview of large language models.arXiv preprint arXiv:2307.06435, 2023

work page arXiv 2023
[62]

Chatrtx, 2024

NVIDIA. Chatrtx, 2024. NVIDIA AI Product

work page 2024
[63]

Nvidia dynamo: A datacenter scale distributed inference serving framework

NVIDIA. Nvidia dynamo: A datacenter scale distributed inference serving framework. https: //github.com/ai-dynamo/dynamo, 2025. Accessed: 2025-05-09

work page 2025
[64]

tinybenchmarks: evaluating llms with fewer examples.arXiv preprint arXiv:2402.14992, 2024

Felipe Maia Polo, Lucas Weber, Leshem Choshen, Yuekai Sun, Gongjun Xu, and Mikhail Yurochkin. tinybenchmarks: evaluating llms with fewer examples.arXiv preprint arXiv:2402.14992, 2024

work page arXiv 2024
[65]

A certified de-identification system for all clinical text documents for information extraction at scale.JAMIA open, 6(3):ooad045, 2023

Lakshmi Radhakrishnan, Gundolf Schenk, Kathleen Muenzen, Boris Oskotsky, Habibeh Ashouri Choshali, Thomas Plunkett, Sharat Israni, and Atul J Butte. A certified de-identification system for all clinical text documents for information extraction at scale.JAMIA open, 6(3):ooad045, 2023

work page 2023
[66]

Addison-Wesley, 1997

Martin J Rees.Before the Beginning: Our Universe and Others. Addison-Wesley, 1997

work page 1997
[67]

An open source python library for anonymiz- ing sensitive data.Scientific data, 11(1):1289, 2024

Judith Sáinz-Pardo Díaz and Álvaro López García. An open source python library for anonymiz- ing sensitive data.Scientific data, 11(1):1289, 2024

work page 2024
[68]

Toolformer: Language models can teach themselves to use tools

Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Luke Zettle- moyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools. InAdvances in Neural Information Processing Systems (NeurIPS), 2023

work page 2023
[69]

Microfossils of the early archean apex chert: New evidence of the antiquity of life.Science, 260(5108):640–646, 1993

J William Schopf. Microfossils of the early archean apex chert: New evidence of the antiquity of life.Science, 260(5108):640–646, 1993

work page 1993
[70]

Cloud llm cost model: Breakdown for mid-market businesses, 2024

Tanya Seda. Cloud llm cost model: Breakdown for mid-market businesses, 2024. Accessed: 2025-05-09

work page 2024
[71]

Explore ai models: Key differences between small language models and large language models, November 2024

Olivia Shone. Explore ai models: Key differences between small language models and large language models, November 2024. Accessed: 2025-05-21

work page 2024
[72]

Powerinfer: Fast large language model serving with a consumer-grade gpu

Yixin Song, Zeyu Mi, Haotong Xie, and Haibo Chen. Powerinfer: Fast large language model serving with a consumer-grade gpu. InProceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles, pages 590–606, 2024

work page 2024
[73]

Small Language Models (SLMs) Can Still Pack a Punch: A survey (updated 2026)

Shreyas Subramanian, Vikram Elango, and Mecit Gungor. Small language models (slms) can still pack a punch: A survey.arXiv preprint arXiv:2501.05465, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[74]

Small language models vs

Synergy Technical. Small language models vs. large language models, 2025. Accessed: 2025-05-09

work page 2025
[75]

Brian G. Thamm. Trustworthy and secure ai: How small language models strengthen data security.Service Contractor Magazine, October 2024. Accessed: 2025-05-08

work page 2024
[76]

Fali Wang, Zhiwei Zhang, Xianren Zhang, Zongyu Wu, Tzuhao Mo, Qiuhao Lu, Wanjing Wang, Rui Li, Junjie Xu, Xianfeng Tang, et al. A comprehensive survey of small language models in the era of large language models: Techniques, enhancements, applications, collaboration with llms, and trustworthiness.arXiv preprint arXiv:2411.03350, 2024. 13

work page arXiv 2024
[77]

Build secure ai agents, 2025

WorkOS. Build secure ai agents, 2025. Accessed: 2025-05-13

work page 2025
[78]

Powerinfer-2: Fast large language model inference on a smartphone.arXiv preprint arXiv:2406.06282, 2024

Zhenliang Xue, Yixin Song, Zeyu Mi, Xinrui Zheng, Yubin Xia, and Haibo Chen. Powerinfer-2: Fast large language model inference on a smartphone.arXiv preprint arXiv:2406.06282, 2024

work page arXiv 2024
[79]

On protecting the data privacy of large language models (llms): A survey.arXiv preprint arXiv:2403.05156, 2024

Biwei Yan, Kun Li, Minghui Xu, Yueyan Dong, Yue Zhang, Zhaochun Ren, and Xiuzhen Cheng. On protecting the data privacy of large language models (llms): A survey.arXiv preprint arXiv:2403.05156, 2024

work page arXiv 2024
[80]

Patil, Ion Stoica, and Joseph E

Fanjia Yan, Huanzhi Mao, Charlie Cheng-Jie Ji, Tianjun Zhang, Shishir G. Patil, Ion Stoica, and Joseph E. Gonzalez. Berkeley function calling leaderboard.https://gorilla.cs.berkeley. edu/blogs/8_berkeley_function_calling_leaderboard.html, 2024

work page 2024
[81]

$\tau$-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains

Shunyu Yao, Noah Shinn, Pedram Razavi, and Karthik Narasimhan. Tau-bench: A benchmark for tool-agent-user interaction in real-world domains.arXiv preprint arXiv:2406.12045, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

Showing first 80 references.