pith. sign in

arxiv: 2605.16508 · v1 · pith:G6KX4YV6new · submitted 2026-05-15 · 💻 cs.CL · cs.AI

The Scaling Laws of Skills in LLM Agent Systems

Pith reviewed 2026-05-20 18:04 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords scaling lawsLLM agentsskill librariesrouting accuracylogarithmic decayexecution rescuelibrary optimizationagent systems
0
0 comments X

The pith

A single decay slope parameter links routing accuracy loss in large skill libraries to how much execution rescues downstream performance in LLM agents.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines how skills accumulate into large reusable libraries for LLM agent systems and identifies two coupled scaling laws. Routing accuracy decays logarithmically with library size across 15 frontier models, with errors shifting from local competition to capture by overly general skills. Execution follows a multiplicative pattern before state realization, but correct prior steps can improve difficult downstream decisions by about four times. The logarithmic decay slope b from routing fits predicts the rescue effect in execution across models. This coupling shows that the same library property controls both pre-execution collapse and downstream recoverability, allowing law-guided optimizations that raise routing accuracy and transfer to execution benchmarks.

Core claim

Across 15 frontier LLMs, 1,141 real-world skills, and over 3M routing or execution decisions, single-step routing accuracy decays logarithmically with library size. Errors progress from local skill competition to cross-family drift and capture by overly general black-hole skills. Before state realization, joint routing is approximately multiplicative, whereas correct execution can improve difficult downstream decisions by about 4 times. A single parameter, the routing logarithmic decay slope b, couples the two laws: routing-side fits predict execution-side rescue across models.

What carries the argument

The routing logarithmic decay slope b, which couples the routing law to the execution rescue effect and allows routing data to predict downstream recoverability.

If this is right

  • Law-guided optimization raises held-out routing accuracy from 71.3% to 91.7%.
  • It reduces hijack from 22.4% to 4.1%.
  • Improvements transfer directionally to downstream ClawBench and ClawMark execution settings, improving mean pass rate from 49.3% to 61.6% on ClawBench and from 28.4% to 34.5% on ClawMark.
  • Agent performance depends not only on model capability but also on the structure, granularity, and exposure policy of the skill library.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the coupling through b generalizes, skill libraries could be engineered with controlled granularity to reduce both routing collapse and improve recoverability without changing the underlying model.
  • The laws suggest that exposure policies can be tuned using the decay slope to support longer multi-step agent chains more reliably.
  • Applying the same fitting procedure to non-frontier models or entirely new task domains would test whether the functional form and predictive power of b remain stable.

Load-bearing premise

The logarithmic decay form and the coupling through the slope b observed on the specific 15 frontier LLMs, 1,141 skills, and 3M decisions will continue to hold for other models, skill distributions, and task domains without requiring re-fitting or new functional forms.

What would settle it

A test on a new model or skill set in which routing accuracy does not follow a logarithmic decay with library size, or in which the fitted b from routing data fails to predict the observed execution rescue factor, would show the claimed coupling does not hold.

Figures

Figures reproduced from arXiv: 2605.16508 by Carl Che, Charles Chen, Dengyun Peng, Ethan Qin, Fanqing Meng, Hanjing Li, Hongyu Liu, Jiangyi Wang, Jinhao Liu, Mengkang Hu, Qiming Yu, Simin Liu, Yuhang Gu, Zheng Yan, Zhuoye Huang.

Figure 1
Figure 1. Figure 1: Skill-library scaling laws and optimization implications. (a) [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Library scale induces logarithmic single-step decay, super-independent pipeline loss, and a mid-chain routing trough. into the plain-text hallucination or hijack analyses. Prompts are in Appendix B. Optimization experiments use the same held-out routing protocol as the diagnostic studies. This section analyzes the routing law before execution: the model observes a task and skill descriptions, chooses ̂𝑠, a… view at source ↗
Figure 3
Figure 3. Figure 3: Local Competition Organizes Error. (a) Accuracy decays with 𝑁 in all description regimes, but better descriptions reduce the decay rate. (b) Top-50 introduces more errors than Top-20/10. (c) Danger band [0.55, 0.75) shows strongest negative 𝜌 with accuracy. (d) The Boltzmann fit between the competition index and accuracy within the danger zone, where 𝑅 2 CI=0.55, outperforms the fit with 𝑁 alone, where 𝑅 2… view at source ↗
Figure 4
Figure 4. Figure 4: Weak task anchors and vague skill descriptions convert local skill-cluster drift into black-hole skill capture. Drift and attractor mechanism. With strong anchors, errors are geometrically local substitutions within the same functional family. Weak anchors break this locality: accuracy collapses, but the model still prefers existing skills over hallucinated names. Black-hole capture is conjunctive, not add… view at source ↗
Figure 5
Figure 5. Figure 5: Before state realization, paired routing is approximately multiplicative; after correct execution, realized state rescues difficult downstream routes. We attribute this to the fact that the router is modeled as a conditional distribution over skills given the prompt, and the no-state isolation setup removes dependence between the two steps before any execution artifact is observed (Appendix A). No-state mu… view at source ↗
Figure 6
Figure 6. Figure 6: Incorrect state propagates through tight dependencies, while joint execution synergy changes sign with dependency structure and capability gap 𝐺. output, the error is carried forward ( [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Cross-law coupling: routing slope predicts rescue gain. The previous two sections establish the routing and execution laws separately. The remaining question is whether the two laws share a state variable. They do: the routing slope 𝑏 is not only the rate at which a model loses accuracy as the library grows; it also determines how often correct upstream state is available for execution-side rescue. Concret… view at source ↗
Figure 8
Figure 8. Figure 8: The performance of the law-guided automatic skill manager. models (𝜌=0.74, 𝑝<0.001; [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Cascade Penalty Decomposition. Pipeline depth amplifies the gap between observed end-to-end accuracy and the independence baseline. stratify pairs by dependency strength 𝜅, where tight dependencies require the upstream artifact for the downstream checker and loose dependencies do not. Artifact correctness is scored by task-specific checkers or human verification when a checker is unavailable. B.6. Fitting … view at source ↗
Figure 10
Figure 10. Figure 10: State-Transition Audit. Downstream routing accuracy depends on whether the preceding step remains correct, linking transition state to cascade behavior (annotated by human annotators). Using log probabilities makes the excess penalty additive across depth and comparable between models with different absolute single-step accuracies. Transition audit. To check whether the excess loss is consistent with loca… view at source ↗
Figure 11
Figure 11. Figure 11: ToolBench Validity. ToolBench routing accuracy and hallucination rates follow the same qualitative library-size pattern on an external tool corpus [PITH_FULL_IMAGE:figures/full_fig_p037_11.png] view at source ↗
read the original abstract

As agent systems scale, skills accumulate into large reusable libraries, yet their scaling laws remain poorly understood. Across 15 frontier LLMs, 1,141 real-world skills, and over 3M routing or execution decisions, we identify two coupled laws. Routing law: single-step routing accuracy decays logarithmically with library size ($R^2{>}0.97$ for all models), with errors progressing from local skill competition to cross-family drift and capture by overly general "black-hole skills". Execution law: before state realization, joint routing is approximately multiplicative, whereas correct execution can improve difficult downstream decisions by about $4{\times}$. A single parameter, the routing logarithmic decay slope $b$, couples the two laws: routing-side fits predict execution-side rescue across models, showing that the same library property controls both pre-execution collapse and downstream recoverability. The laws are actionable: law-guided optimization raises held-out routing accuracy from 71.3% to 91.7%, reduces hijack from 22.4% to 4.1%, and transfers directionally to downstream ClawBench and ClawMark execution settings, improving mean pass rate from 49.3% to 61.6% on ClawBench and from 28.4% to 34.5% on ClawMark. These results show that agent performance depends not only on model capability, but also on the structure, granularity, and exposure policy of the skill library.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims to identify two coupled scaling laws in LLM agent systems across 15 frontier LLMs, 1,141 real-world skills, and over 3M routing or execution decisions. The routing law states that single-step routing accuracy decays logarithmically with library size (R² > 0.97 for all models), with errors progressing from local skill competition to cross-family drift and capture by general 'black-hole skills'. The execution law indicates that joint routing is approximately multiplicative before state realization, while correct execution can improve difficult downstream decisions by about 4×. A single parameter—the routing logarithmic decay slope b—couples the laws such that routing-side fits predict execution-side rescue across models. Law-guided optimization is shown to raise held-out routing accuracy from 71.3% to 91.7%, reduce hijack from 22.4% to 4.1%, and improve mean pass rates on ClawBench (49.3% to 61.6%) and ClawMark (28.4% to 34.5%).

Significance. If the central claims hold after addressing methodological details, the work provides a valuable empirical framework for understanding how skill library size and structure affect agent performance, shifting focus from model capability alone to library properties. The scale of the experiments and the identification of a coupling parameter b that links pre-execution routing collapse to downstream recoverability represent a strength, as does the demonstration of actionable improvements that transfer directionally to held-out benchmarks. This could inform more principled design of reusable skill libraries in agent systems.

major comments (2)
  1. The abstract reports high R² fits for the logarithmic decay and concrete lifts such as the 4× rescue factor and optimization gains, but provides no information on how library sizes were varied, whether exclusions were post-hoc, how error bars were computed, or whether the 4× rescue factor was measured on held-out data. This is load-bearing for the central claim that the laws are robust and coupled, as the soundness of the fits and predictions cannot be assessed without these details.
  2. The coupling claim—that the routing logarithmic decay slope b fitted from routing data predicts execution-side rescue—relies on the same overall experimental setup for both regimes. This introduces a moderate circularity risk that requires explicit clarification on the independence of the routing and execution measurements, any cross-validation, and whether b was derived without reference to execution outcomes.
minor comments (2)
  1. Provide the precise functional form and fitting procedure for the logarithmic decay (including any constants or normalizations) and confirm whether the form was selected a priori or post-hoc.
  2. Include more detail on skill categorization, library construction, and the definition of 'black-hole skills' to support reproducibility of the error progression analysis.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback, which has helped us improve the clarity and transparency of our work. We address each major comment below and have revised the manuscript accordingly to provide the requested methodological details while preserving the integrity of our empirical claims.

read point-by-point responses
  1. Referee: The abstract reports high R² fits for the logarithmic decay and concrete lifts such as the 4× rescue factor and optimization gains, but provides no information on how library sizes were varied, whether exclusions were post-hoc, how error bars were computed, or whether the 4× rescue factor was measured on held-out data. This is load-bearing for the central claim that the laws are robust and coupled, as the soundness of the fits and predictions cannot be assessed without these details.

    Authors: We agree that the abstract would benefit from greater methodological transparency to allow readers to assess robustness directly. The full manuscript (Section 3) details that library sizes were varied systematically from 10 to 1,141 skills via repeated random subsampling of the fixed skill pool, with means and standard errors computed over five independent seeds per size point; no post-hoc exclusions occurred and all data are reported. Error bars throughout are standard errors from this repeated sampling. The 4× rescue factor and optimization gains were evaluated on held-out tasks and benchmarks separate from the fitting data. We have revised the abstract to include a concise clause on these points and added a short statistical methods paragraph in Section 3.2. revision: yes

  2. Referee: The coupling claim—that the routing logarithmic decay slope b fitted from routing data predicts execution-side rescue—relies on the same overall experimental setup for both regimes. This introduces a moderate circularity risk that requires explicit clarification on the independence of the routing and execution measurements, any cross-validation, and whether b was derived without reference to execution outcomes.

    Authors: We appreciate the referee's caution regarding potential circularity. Routing measurements used to fit the decay law and slope b were obtained exclusively from isolated single-step routing queries with no execution component. Execution data were collected independently from complete multi-step agent trajectories on downstream tasks. Parameter b was fitted solely on the routing regime and then tested for predictive validity on execution outcomes using a cross-validation split across models and task sets; execution results were never used to adjust or select b. We have inserted a dedicated clarification paragraph in Section 4.3 describing the separation of data collection pipelines and the cross-validation protocol. revision: yes

Circularity Check

1 steps flagged

Fitted routing parameter b used to 'predict' execution rescue on same data

specific steps
  1. fitted input called prediction [Abstract]
    "A single parameter, the routing logarithmic decay slope b, couples the two laws: routing-side fits predict execution-side rescue across models, showing that the same library property controls both pre-execution collapse and downstream recoverability."

    b is obtained by fitting the routing law (logarithmic decay of accuracy with library size) to the observed routing decisions; the same scalar is then asserted to predict the magnitude of execution rescue. Because both routing and execution measurements come from the same 3M decisions on the same skill library, the 'prediction' is a re-labeling of the fitted parameter rather than an out-of-sample or independently derived result.

full rationale

The central coupling claim rests on fitting the logarithmic decay slope b exclusively from routing accuracy data across the 15 models and 1,141 skills, then invoking that same fitted b to explain and 'predict' execution-side rescue effects. This matches the fitted-input-called-prediction pattern: the execution improvement is not an independent derivation but a re-use of the routing fit on the identical experimental corpus. No external validation or parameter-free derivation is shown in the provided text; the coupling is therefore partially circular by construction. The remainder of the scaling laws (logarithmic form, error progression) appear independently measured and are not reduced to the fit.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claims rest on empirical fits to a particular collection of models and skills; the slope b is the main fitted quantity, and the assumption that the observed functional forms generalize is the primary domain assumption.

free parameters (1)
  • routing logarithmic decay slope b
    Single parameter fitted from routing accuracy versus library size data; used to couple routing and execution regimes and to guide optimization.
axioms (1)
  • domain assumption The 1,141 skills and routing/execution decisions collected are representative of real-world agent skill libraries.
    Invoked when claiming the laws are actionable beyond the tested set.

pith-pipeline@v0.9.0 · 5835 in / 1527 out tokens · 77213 ms · 2026-05-20T18:04:30.920174+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

90 extracted references · 90 canonical work pages · 21 internal anchors

  1. [1]

    A survey on large language model based autonomous agents.Frontiers of Computer Science, 18(6):186345, 2024

    Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, et al. A survey on large language model based autonomous agents.Frontiers of Computer Science, 18(6):186345, 2024

  2. [2]

    Llm-based agents for tool learning: A survey.Data Science and Engineering, 2025

    Weikai Xu, Chengrui Huang, Shen Gao, and Shuo Shang. Llm-based agents for tool learning: A survey.Data Science and Engineering, 2025

  3. [3]

    Survey on Evaluation of LLM-based Agents

    Asaf Yehudai, Lilach Eden, Alan Li, Guy Uziel, Yilun Zhao, Roy Bar-Haim, Arman Cohan, and Michal Shmueli- Scheuer. Survey on evaluation of LLM-based agents.arXiv preprint arXiv:2503.16416, 2025

  4. [4]

    Toolformer: Language models can teach themselves to use tools

    Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools. Advances in neural information processing systems, 36:68539–68551, 2023

  5. [5]

    ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs

    Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, et al. Toolllm: Facilitating large language models to master 16000+ real-world APIs.arXiv preprint arXiv:2307.16789, 2023

  6. [6]

    Gorilla: Large Language Model Connected with Massive APIs

    Shishir G Patil, Tianjun Zhang, Xin Wang, and Joseph E Gonzalez. Gorilla: Large language model connected with massive APIs.arXiv preprint arXiv:2305.15334, 2023

  7. [7]

    API-Bank: A Comprehensive Benchmark for Tool-Augmented LLMs

    Minghao Li, Yingxiu Zhao, Bowen Yu, Feifan Song, Hangyu Li, Haiyang Yu, Zhoujun Li, Fei Huang, and Yongbin Li. API-Bank: A comprehensive benchmark for tool-augmented LLMs.arXiv preprint arXiv:2304.08244, 2023. evolvent.co 11 The Scaling Laws of Skills in LLM Agent Systems

  8. [8]

    name" - concise snake_case identifier -

    Yongliang Shen, Kaitao Song, Xu Tan, Wenqi Zhang, Kan Ren, Siyu Yuan, Weiming Lu, and Yueting Zhuang. TaskBench: Benchmarking large language models for task automation.arXiv preprint arXiv:2311.18760, 2023

  9. [9]

    SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks

    Xiangyi Li, Wenbo Chen, Yimin Liu, Shenghan Zheng, Xiaokun Chen, Yifeng He, Yubo Li, Bingran You, Haotian Shen, Jiankai Sun, Shuyi Wang, Binxu Li, Qunhong Zeng, Di Wang, Xuandong Zhao, Yuanli Wang, Roey Ben Chaim, Zonglin Di, Yipeng Gao, Junwei He, Yizhuo He, Liqiang Jing, Luyang Kong, Xin Lan, Jiachen Li, Songlin Li, Yijiang Li, Yueqian Lin, Xinyi Liu, X...

  10. [10]

    Skillrouter: Retrieve-and-rerank skill selection for llm agents at scale,

    YanZhao Zheng, ZhenTao Zhang, Chao Ma, YuanQiang Yu, JiHuai Zhu, Yong Wu, Tianze Xu, Baohua Dong, Hangcheng Zhu, Ruohui Huang, and Gang Yu. Skillrouter: Skill routing for LLM agents at scale.arXiv preprint arXiv:2603.22455, 2026

  11. [11]

    Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models

    Qiguang Chen, Libo Qin, Jinhao Liu, Dengyun Peng, Jiannan Guan, Peng Wang, Mengkang Hu, Yuhang Zhou, Te Gao, and Wanxiang Che. Towards reasoning era: A survey of long chain-of-thought for reasoning large language models.arXiv preprint arXiv:2503.09567, 2025

  12. [12]

    Ai4research: A survey of artificial intelligence for scientific research.arXiv preprint arXiv:2507.01903, 2025

    Qiguang Chen, Mingda Yang, Libo Qin, Jinhao Liu, Zheng Yan, Jiannan Guan, Dengyun Peng, Yiyan Ji, Hanjing Li, Mengkang Hu, et al. Ai4research: A survey of artificial intelligence for scientific research.arXiv preprint arXiv:2507.01903, 2025

  13. [13]

    A Survey of Self-Evolving Agents: What, When, How, and Where to Evolve on the Path to Artificial Super Intelligence

    Huan-ang Gao, Jiayi Geng, Wenyue Hua, Mengkang Hu, Xinzhe Juan, Hongzhang Liu, Shilong Liu, Jiahao Qiu, Xuan Qi, Yiran Wu, et al. A survey of self-evolving agents: What, when, how, and where to evolve on the path to artificial super intelligence.arXiv preprint arXiv:2507.21046, 2025

  14. [14]

    Agent Skills for Large Language Models: Architecture, Acquisition, Security, and the Path Forward

    Renjun Xu and Yang Yan. Agent skills for large language models: Architecture, acquisition, security, and the path forward.arXiv preprint arXiv:2602.12430, 2026

  15. [15]

    ToolAlpaca: Generalized Tool Learning for Language Models with 3000 Simulated Cases

    Qiaoyu Tang, Ziliang Deng, Hongyu Lin, Xianpei Han, Qiao Liang, Boxi Cao, and Le Sun. Toolalpaca: Generalized tool learning for language models with 3000 simulated cases.arXiv preprint arXiv:2306.05301, 2023

  16. [16]

    Openagents: An open platform for language agents in the wild,

    Tianbao Xie, Fan Zhou, Zhoujun Cheng, Peng Shi, Luoxuan Weng, Yitao Liu, Toh Jing Hua, Junning Zhao, Qian Liu, Che Liu, et al. OpenAgents: An open platform for language agents in the wild.arXiv preprint arXiv:2310.10634, 2023

  17. [17]

    AgentBench: Evaluating LLMs as Agents

    Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, et al. AgentBench: Evaluating LLMs as agents.arXiv preprint arXiv:2308.03688, 2023

  18. [18]

    SoK: Agentic Skills -- Beyond Tool Use in LLM Agents

    Yanna Jiang, Delong Li, Haiyu Deng, Baihe Ma, Xu Wang, Qin Wang, and Guangsheng Yu. SoK: Agentic skills – beyond tool use in LLM agents.arXiv preprint arXiv:2602.20867, 2026

  19. [19]

    ComplexFuncBench: Exploring multi-step and constrained function calling under long-context scenario.arXiv preprint arXiv:2501.10132, 2025

    Lucen Zhong, Zhengxiao Du, Xiaohan Zhang, Haiyi Hu, and Jie Tang. ComplexFuncBench: Exploring multi-step and constrained function calling under long-context scenario.arXiv preprint arXiv:2501.10132, 2025

  20. [20]

    BiasBusters: Uncovering and mitigating tool selection bias in large language models

    Thierry Blankenstein, Jialin Yu, Zixuan Li, Vassilis Plachouras, Sunando Sengupta, Philip Torr, Yarin Gal, Alasdair Paren, and Adel Bibi. BiasBusters: Uncovering and mitigating tool selection bias in large language models. InInternational Conference on Learning Representations, 2026. URL https://openreview.net/forum? id=DEg4vvElYu. evolvent.co 12 The Scal...

  21. [21]

    xRouter: Training cost-aware LLMs orchestration system via reinforcement learning.arXiv preprint arXiv:2510.08439, 2025

    Cheng Qian, Zuxin Liu, Shirley Kokane, Akshara Prabhakar, Jielin Qiu, Haolin Chen, Zhiwei Liu, Heng Ji, Weiran Yao, Shelby Heinecke, Silvio Savarese, Caiming Xiong, and Huan Wang. xRouter: Training cost-aware LLMs orchestration system via reinforcement learning.arXiv preprint arXiv:2510.08439, 2025

  22. [22]

    EvoRoute: Experience-driven self-routing LLM agent systems.arXiv preprint arXiv:2601.02695, 2026

    Guibin Zhang, Haiyang Yu, Kaiming Yang, Bingli Wu, Fei Huang, Yongbin Li, and Shuicheng Yan. EvoRoute: Experience-driven self-routing LLM agent systems.arXiv preprint arXiv:2601.02695, 2026

  23. [23]

    Autotool: Efficient tool selection for large language model agents.arXiv preprint arXiv:2511.14650, 2025

    Jingyi Jia and Qinbin Li. Autotool: Efficient tool selection for large language model agents.arXiv preprint arXiv:2511.14650, 2025

  24. [24]

    Meta- toolagent: Towards generalizable tool usage in llms through meta-learning.arXiv preprint arXiv:2601.12680, 2026

    Zheng Fang, Wolfgang Mayer, Zeyu Zhang, Jian Wang, Hong-Yu Zhang, Wanli Li, and Zaiwen Feng. Meta- toolagent: Towards generalizable tool usage in llms through meta-learning.arXiv preprint arXiv:2601.12680, 2026

  25. [25]

    ReAct: Synergizing reasoning and acting in language models

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. ReAct: Synergizing reasoning and acting in language models. InInternational Conference on Learning Representations, 2023

  26. [26]

    Plan-and-solve prompting: Improving zero-shot chain-of-thought reasoning by large language models

    Lei Wang, Wanyu Xu, Yihuai Lan, Zhiqiang Hu, Yunshi Lan, Roy Ka-Wei Lee, and Ee-Peng Lim. Plan-and-solve prompting: Improving zero-shot chain-of-thought reasoning by large language models. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors,Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long ...

  27. [27]

    Reflexion: Language Agents with Verbal Reinforcement Learning

    Noah Shinn, Federico Cassano, Edward Berman, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning.arXiv preprint arXiv:2303.11366, 2023

  28. [28]

    HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face

    Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, and Yueting Zhuang. Hugginggpt: Solving AI tasks with ChatGPT and its friends in Hugging Face.arXiv preprint arXiv:2303.17580, 2023

  29. [29]

    MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework

    Sirui Hong, Mingchen Zhuge, Jiaqi Chen, Xiawu Zheng, Yuheng Cheng, Ceyao Zhang, Jinlin Wang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, et al. MetaGPT: Meta programming for a multi-agent collaborative framework.arXiv preprint arXiv:2308.00352, 2023

  30. [30]

    Voyager: An Open-Ended Embodied Agent with Large Language Models

    Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anand- kumar. Voyager: An open-ended embodied agent with large language models.arXiv preprint arXiv:2305.16291, 2023

  31. [31]

    Taskweaver: A code-first agent framework.arXiv preprint arXiv:2311.17541,

    Bo Qiao, Liqun Li, Xu Zhang, Shilin He, Yu Kang, Chaoyun Zhang, Fangkai Yang, Hang Dong, Jue Zhang, Lu Wang, et al. TaskWeaver: A code-first agent framework.arXiv preprint arXiv:2311.17541, 2023

  32. [32]

    MINT: Evaluating LLMs in multi-turn interaction with tools and language feedback.arXiv preprint arXiv:2309.10691, 2023

    Xingyao Wang, Zihan Wang, Jiateng Liu, Yangyi Chen, Lifan Yuan, Hao Peng, and Heng Ji. MINT: Evaluating LLMs in multi-turn interaction with tools and language feedback.arXiv preprint arXiv:2309.10691, 2023

  33. [33]

    Anytool: Self-reflective, hierarchical agents for large-scale api calls.arXiv preprint arXiv:2402.04253, 2024

    Yu Du, Fangyun Wei, and Hongyang Zhang. AnyTool: Self-reflective, hierarchical agents for large-scale API calls.arXiv preprint arXiv:2402.04253, 2024

  34. [34]

    Evaluation and benchmarking of LLM agents: A survey,

    Mahmoud Mohammadi, Yipeng Li, Jane Lo, and Wendy Yip. Evaluation and benchmarking of llm agents: A survey.arXiv preprint arXiv:2507.21504, 2025. URLhttps://arxiv.org/abs/2507.21504. evolvent.co 13 The Scaling Laws of Skills in LLM Agent Systems

  35. [35]

    Scaling Laws for Neural Language Models

    Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361, 2020

  36. [36]

    Training Compute-Optimal Large Language Models

    Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models.arXiv preprint arXiv:2203.15556, 2022

  37. [37]

    Emergent abilities of large language models.Transactions on Machine Learning Research, 2022

    Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, et al. Emergent abilities of large language models.Transactions on Machine Learning Research, 2022

  38. [38]

    Chatterji, Sharan Narang, Mike Lewis, and Dieuwke Hupkes

    Nicholas Roberts, Niladri S. Chatterji, Sharan Narang, Mike Lewis, and Dieuwke Hupkes. Compute optimal scaling of skills: Knowledge vs reasoning. InFindings of the Association for Computational Linguistics: ACL 2025, pages 13295–13316, 2025. doi: 10.18653/v1/2025.findings-acl.688. URL https://aclanthology.org/ 2025.findings-acl.688/

  39. [39]

    Scaling laws for educational AI agents.arXiv preprint arXiv:2603.11709, 2026

    Mengsong Wu, Hao Hao, Shuzhen Bi, Keqian Li, Wentao Liu, Siyu Song, Hongbo Zhao, and Aimin Zhou. Scaling laws for educational AI agents.arXiv preprint arXiv:2603.11709, 2026. URL https://arxiv.org/abs/ 2603.11709

  40. [40]

    Dense passage retrieval for open-domain question answering

    Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen- tau Yih. Dense passage retrieval for open-domain question answering. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pages 6769–6781, 2020

  41. [41]

    Retrieval-augmented generation for knowledge- intensive NLP tasks

    Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. Retrieval-augmented generation for knowledge- intensive NLP tasks. InAdvances in Neural Information Processing Systems, volume 33, pages 9459–9474, 2020

  42. [42]

    Sentence-BERT: Sentence embeddings using siamese BERT-networks

    Nils Reimers and Iryna Gurevych. Sentence-BERT: Sentence embeddings using siamese BERT-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing, 2019

  43. [43]

    MTEB: Massive text embedding benchmark

    Niklas Muennighoff, Nouamane Tazi, Loïc Magne, and Nils Reimers. MTEB: Massive text embedding benchmark. InProceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, 2023

  44. [44]

    William E. Hick. On the rate of gain of information.Quarterly Journal of Experimental Psychology, 4(1):11–26, 1952

  45. [45]

    Stimulus information as a determinant of reaction time.Journal of Experimental Psychology, 45 (3):188–196, 1953

    Ray Hyman. Stimulus information as a determinant of reaction time.Journal of Experimental Psychology, 45 (3):188–196, 1953

  46. [46]

    Github - anthropics/skills: Public repository for agent skills

    Anthropic. Github - anthropics/skills: Public repository for agent skills. https://github.com/anthropics/ skills, 2025. Accessed: 2026-04-17

  47. [47]

    Claude code

    Anthropic. Claude code. https://www.anthropic.com/product/claude-code, 2025. Accessed: 2026-04-16

  48. [48]

    Introducing the model context protocol

    Anthropic. Introducing the model context protocol. https://www.anthropic.com/news/ model-context-protocol, 2024. Accessed: 2026-04-16

  49. [49]

    Github - trailofbits/skills: Trail of bits claude code skills for security research, vulnerability detection, and audit workflows.https://github.com/trailofbits/skills, 2025

    Trail of Bits. Github - trailofbits/skills: Trail of bits claude code skills for security research, vulnerability detection, and audit workflows.https://github.com/trailofbits/skills, 2025. Accessed: 2026-04-16. evolvent.co 14 The Scaling Laws of Skills in LLM Agent Systems

  50. [50]

    Claude code skills marketplace

    daymade. Claude code skills marketplace. https://github.com/daymade/claude-code-skills, 2025. Accessed: 2026-04-16

  51. [51]

    NET skills for claude code

    Aaronontheweb. .NET skills for claude code. https://github.com/Aaronontheweb/dotnet-skills, 2025. Accessed: 2026-04-16

  52. [52]

    PinchBench github organization

    PinchBench Team. PinchBench github organization. https://github.com/pinchbench, 2026. Accessed: 2026-04-16

  53. [53]

    Agentdeals.https://github.com/robhunter/agentdeals, 2026

    Robhunter. Agentdeals.https://github.com/robhunter/agentdeals, 2026. Accessed: 2026-04-16

  54. [54]

    M3-Embedding: Multi-linguality, multi-functionality, multi-granularity text embeddings through self-knowledge distillation

    Jianlyu Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, and Zheng Liu. M3-Embedding: Multi-linguality, multi-functionality, multi-granularity text embeddings through self-knowledge distillation. InFindings of the Association for Computational Linguistics: ACL 2024, pages 2318–2335, 2024. doi: 10.18653/v1/2024.findings-acl

  55. [55]

    URLhttps://aclanthology.org/2024.findings-acl.137/

  56. [56]

    Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces

    Mike A Merrill, Alexander G Shaw, Nicholas Carlini, Boxuan Li, Harsh Raj, Ivan Bercovich, Lin Shi, Jeong Yeon Shin, Thomas Walshe, E Kelly Buchanan, et al. Terminal-bench: Benchmarking agents on hard, realistic tasks in command line interfaces.arXiv preprint arXiv:2601.11868, 2026

  57. [57]

    Shunyu Yao, Noah Shinn, Pedram Razavi, and Karthik Narasimhan.𝜏-bench: A benchmark for tool-agent-user interaction in real-world domains.arXiv preprint arXiv:2406.12045, 2024

  58. [58]

    Victor Barres, Honghua Dong, Soham Ray, Xujie Si, and Karthik Narasimhan.𝜏2-bench: Evaluating conversa- tional agents in a dual-control environment.arXiv preprint arXiv:2506.07982, 2025

  59. [59]

    Flow- Bench: Revisiting and benchmarking workflow-guided planning for LLM-based agents

    Ruixuan Xiao, Wentao Ma, Ke Wang, Yuchuan Wu, Junbo Zhao, Haobo Wang, Fei Huang, and Yongbin Li. Flow- Bench: Revisiting and benchmarking workflow-guided planning for LLM-based agents. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors,Findings of the Association for Computational Linguistics: EMNLP 2024, pages 10883–10900, Miami, Florida, USA,...

  60. [60]

    Gpt-4o mini: advancing cost-efficient intelligence

    OpenAI. Gpt-4o mini: advancing cost-efficient intelligence. https://openai.com/index/ gpt-4o-mini-advancing-cost-efficient-intelligence/, 2024. Accessed: 2026-04-17

  61. [61]

    Gpt-5 system card

    OpenAI. Gpt-5 system card. OpenAI, August 2025. URL https://openai.com/index/gpt-5-system-card/ . Accessed: 2026-05-04

  62. [62]

    Introducing gpt-5.4 mini and nano

    OpenAI. Introducing gpt-5.4 mini and nano. https://openai.com/index/ introducing-gpt-5-4-mini-and-nano/, 2026. Accessed: 2026-04-17

  63. [63]

    Introducing gpt-5.4.https://openai.com/index/introducing-gpt-5-4/, 2026

    OpenAI. Introducing gpt-5.4.https://openai.com/index/introducing-gpt-5-4/, 2026. Accessed: 2026- 04-17

  64. [64]

    Introducing claude sonnet 4.6.https://www.anthropic.com/news/claude-sonnet-4-6, 2026

    Anthropic. Introducing claude sonnet 4.6.https://www.anthropic.com/news/claude-sonnet-4-6, 2026

  65. [65]

    Introducing claude opus 4.6.https://www.anthropic.com/news/claude-opus-4-6, 2026

    Anthropic. Introducing claude opus 4.6.https://www.anthropic.com/news/claude-opus-4-6, 2026

  66. [66]

    Gemini 3.Google DeepMind, 2025

    Google DeepMind. Gemini 3.Google DeepMind, 2025. URLhttps://deepmind.google/models/gemini/

  67. [67]

    Glm-5: from vibe coding to agentic engineering, 2026

    Aohan Zeng, Xin Lv, Zhenyu Hou, Zhengxiao Du, Qinkai Zheng, Bin Chen, Da Yin, Chendi Ge, Chenghua Huang, Chengxing Xie, et al. Glm-5: from vibe coding to agentic engineering, 2026. evolvent.co 15 The Scaling Laws of Skills in LLM Agent Systems

  68. [68]

    Glm-4.7: Advancing the coding capability.https://z.ai/blog/glm-4.7, 2025

    Zhipu AI. Glm-4.7: Advancing the coding capability.https://z.ai/blog/glm-4.7, 2025

  69. [69]

    Kimi k2.5: Visual agentic intelligence, 2026

    Kimi Team, Tongtong Bai, Yifan Bai, Yiping Bao, SH Cai, Yuan Cao, Y Charles, HS Che, Cheng Chen, Guanduo Chen, et al. Kimi k2.5: Visual agentic intelligence, 2026

  70. [70]

    Kimi k2.6: From code to creation, from one to many

    Moonshot AI. Kimi k2.6: From code to creation, from one to many. https://www.kimi.com/ai-models/ kimi-k2-6, 2026

  71. [71]

    Seed 2.0 official launch

    ByteDance Seed Team. Seed 2.0 official launch. https://seed.bytedance.com/en/blog/seed2-0-%25E6% 25AD%25A3%25E5%25BC%258F%25E5%258F%2591%25E5%25B8%2583, 2026. Accessed: 2026-04-16

  72. [72]

    Deepseek-v4:towards highly efficient million-token context intelligence.https://huggingface

    DeepSeek-AI. Deepseek-v4:towards highly efficient million-token context intelligence.https://huggingface. co/deepseek-ai/DeepSeek-V4-Pro/blob/main/DeepSeek_V4.pdf, 2026

  73. [73]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report, 2025. URLhttps://arxiv.org/abs/2505.09388. evolvent.co 16 The Scaling Laws of Skills in LLM Agent Systems Appendix This appendix is organized as a reader’s map from theory to implementation evidence. It firs...

  74. [74]

    Finite-capacity margin model.For task 𝑞, the router assigns each skill a latent score 𝑍𝑖=𝑚𝑖+𝜀𝑖, where 𝑚⋆−𝑚𝑗= Δ𝑗is the semantic margin of the gold skill against distractor𝑗, and𝜀𝑖is zero-mean sub-Gaussian noise with scale𝜎

  75. [75]

    This is the scale-free local-crowding condition tested by the CI and semantic-gap diagnostics

    Rank-regular clustered library.Within a functional cluster, distractor skills can be ordered by semantic rank𝑟= 1,…,𝑁−1around the target, and their effective overtake weights satisfy 𝑤𝑟=𝜅/𝑟+𝑂(𝑟−1−𝜉) for some𝜅,𝜉 >0over the measured range. This is the scale-free local-crowding condition tested by the CI and semantic-gap diagnostics

  76. [76]

    Irrelevant distractors have negligi- ble𝑤𝑗; local near-misses dominate𝐶(𝑁)

    Logarithmic effective competition.The total effective pressure from plausible distractors is 𝐶(𝑁) = ∑𝑗≠⋆𝑤𝑗(𝑁) =𝐶0 +𝐶1ln𝑁+𝑜(ln𝑁)over the exposed-library range. Irrelevant distractors have negligi- ble𝑤𝑗; local near-misses dominate𝐶(𝑁)

  77. [77]

    Small-error linearization regime.Over the measured operating range, the router is not saturated at 0 or 1, so first-order Taylor expansions of the error odds are valid

  78. [78]

    State-gated rescue.Correct upstream execution produces a concrete artifact with rescue coefficient 𝛼∈[0,1]; rescue can only occur when upstream routing/execution is correct and only to the extent that the downstream route has remaining headroom

  79. [79]

    Lemma 1 (Clustered libraries imply logarithmic effective competition).Under Assumption 2, Assumption 3 follows with𝐶1 =𝜅

    No pre-state leakage.In the no-state condition, paired routing prompts do not contain an execution artifact, so any interaction before execution is bounded by a protocol term𝜖proto. Lemma 1 (Clustered libraries imply logarithmic effective competition).Under Assumption 2, Assumption 3 follows with𝐶1 =𝜅. evolvent.co 17 The Scaling Laws of Skills in LLM Agen...

  80. [80]

    Concrete object: the task identifies or clearly implies the file, repository, API, dataset, table, document, command, state, or other object to operate on

Showing first 80 references.