The Scaling Laws of Skills in LLM Agent Systems
Pith reviewed 2026-05-20 18:04 UTC · model grok-4.3
The pith
A single decay slope parameter links routing accuracy loss in large skill libraries to how much execution rescues downstream performance in LLM agents.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Across 15 frontier LLMs, 1,141 real-world skills, and over 3M routing or execution decisions, single-step routing accuracy decays logarithmically with library size. Errors progress from local skill competition to cross-family drift and capture by overly general black-hole skills. Before state realization, joint routing is approximately multiplicative, whereas correct execution can improve difficult downstream decisions by about 4 times. A single parameter, the routing logarithmic decay slope b, couples the two laws: routing-side fits predict execution-side rescue across models.
What carries the argument
The routing logarithmic decay slope b, which couples the routing law to the execution rescue effect and allows routing data to predict downstream recoverability.
If this is right
- Law-guided optimization raises held-out routing accuracy from 71.3% to 91.7%.
- It reduces hijack from 22.4% to 4.1%.
- Improvements transfer directionally to downstream ClawBench and ClawMark execution settings, improving mean pass rate from 49.3% to 61.6% on ClawBench and from 28.4% to 34.5% on ClawMark.
- Agent performance depends not only on model capability but also on the structure, granularity, and exposure policy of the skill library.
Where Pith is reading between the lines
- If the coupling through b generalizes, skill libraries could be engineered with controlled granularity to reduce both routing collapse and improve recoverability without changing the underlying model.
- The laws suggest that exposure policies can be tuned using the decay slope to support longer multi-step agent chains more reliably.
- Applying the same fitting procedure to non-frontier models or entirely new task domains would test whether the functional form and predictive power of b remain stable.
Load-bearing premise
The logarithmic decay form and the coupling through the slope b observed on the specific 15 frontier LLMs, 1,141 skills, and 3M decisions will continue to hold for other models, skill distributions, and task domains without requiring re-fitting or new functional forms.
What would settle it
A test on a new model or skill set in which routing accuracy does not follow a logarithmic decay with library size, or in which the fitted b from routing data fails to predict the observed execution rescue factor, would show the claimed coupling does not hold.
Figures
read the original abstract
As agent systems scale, skills accumulate into large reusable libraries, yet their scaling laws remain poorly understood. Across 15 frontier LLMs, 1,141 real-world skills, and over 3M routing or execution decisions, we identify two coupled laws. Routing law: single-step routing accuracy decays logarithmically with library size ($R^2{>}0.97$ for all models), with errors progressing from local skill competition to cross-family drift and capture by overly general "black-hole skills". Execution law: before state realization, joint routing is approximately multiplicative, whereas correct execution can improve difficult downstream decisions by about $4{\times}$. A single parameter, the routing logarithmic decay slope $b$, couples the two laws: routing-side fits predict execution-side rescue across models, showing that the same library property controls both pre-execution collapse and downstream recoverability. The laws are actionable: law-guided optimization raises held-out routing accuracy from 71.3% to 91.7%, reduces hijack from 22.4% to 4.1%, and transfers directionally to downstream ClawBench and ClawMark execution settings, improving mean pass rate from 49.3% to 61.6% on ClawBench and from 28.4% to 34.5% on ClawMark. These results show that agent performance depends not only on model capability, but also on the structure, granularity, and exposure policy of the skill library.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims to identify two coupled scaling laws in LLM agent systems across 15 frontier LLMs, 1,141 real-world skills, and over 3M routing or execution decisions. The routing law states that single-step routing accuracy decays logarithmically with library size (R² > 0.97 for all models), with errors progressing from local skill competition to cross-family drift and capture by general 'black-hole skills'. The execution law indicates that joint routing is approximately multiplicative before state realization, while correct execution can improve difficult downstream decisions by about 4×. A single parameter—the routing logarithmic decay slope b—couples the laws such that routing-side fits predict execution-side rescue across models. Law-guided optimization is shown to raise held-out routing accuracy from 71.3% to 91.7%, reduce hijack from 22.4% to 4.1%, and improve mean pass rates on ClawBench (49.3% to 61.6%) and ClawMark (28.4% to 34.5%).
Significance. If the central claims hold after addressing methodological details, the work provides a valuable empirical framework for understanding how skill library size and structure affect agent performance, shifting focus from model capability alone to library properties. The scale of the experiments and the identification of a coupling parameter b that links pre-execution routing collapse to downstream recoverability represent a strength, as does the demonstration of actionable improvements that transfer directionally to held-out benchmarks. This could inform more principled design of reusable skill libraries in agent systems.
major comments (2)
- The abstract reports high R² fits for the logarithmic decay and concrete lifts such as the 4× rescue factor and optimization gains, but provides no information on how library sizes were varied, whether exclusions were post-hoc, how error bars were computed, or whether the 4× rescue factor was measured on held-out data. This is load-bearing for the central claim that the laws are robust and coupled, as the soundness of the fits and predictions cannot be assessed without these details.
- The coupling claim—that the routing logarithmic decay slope b fitted from routing data predicts execution-side rescue—relies on the same overall experimental setup for both regimes. This introduces a moderate circularity risk that requires explicit clarification on the independence of the routing and execution measurements, any cross-validation, and whether b was derived without reference to execution outcomes.
minor comments (2)
- Provide the precise functional form and fitting procedure for the logarithmic decay (including any constants or normalizations) and confirm whether the form was selected a priori or post-hoc.
- Include more detail on skill categorization, library construction, and the definition of 'black-hole skills' to support reproducibility of the error progression analysis.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback, which has helped us improve the clarity and transparency of our work. We address each major comment below and have revised the manuscript accordingly to provide the requested methodological details while preserving the integrity of our empirical claims.
read point-by-point responses
-
Referee: The abstract reports high R² fits for the logarithmic decay and concrete lifts such as the 4× rescue factor and optimization gains, but provides no information on how library sizes were varied, whether exclusions were post-hoc, how error bars were computed, or whether the 4× rescue factor was measured on held-out data. This is load-bearing for the central claim that the laws are robust and coupled, as the soundness of the fits and predictions cannot be assessed without these details.
Authors: We agree that the abstract would benefit from greater methodological transparency to allow readers to assess robustness directly. The full manuscript (Section 3) details that library sizes were varied systematically from 10 to 1,141 skills via repeated random subsampling of the fixed skill pool, with means and standard errors computed over five independent seeds per size point; no post-hoc exclusions occurred and all data are reported. Error bars throughout are standard errors from this repeated sampling. The 4× rescue factor and optimization gains were evaluated on held-out tasks and benchmarks separate from the fitting data. We have revised the abstract to include a concise clause on these points and added a short statistical methods paragraph in Section 3.2. revision: yes
-
Referee: The coupling claim—that the routing logarithmic decay slope b fitted from routing data predicts execution-side rescue—relies on the same overall experimental setup for both regimes. This introduces a moderate circularity risk that requires explicit clarification on the independence of the routing and execution measurements, any cross-validation, and whether b was derived without reference to execution outcomes.
Authors: We appreciate the referee's caution regarding potential circularity. Routing measurements used to fit the decay law and slope b were obtained exclusively from isolated single-step routing queries with no execution component. Execution data were collected independently from complete multi-step agent trajectories on downstream tasks. Parameter b was fitted solely on the routing regime and then tested for predictive validity on execution outcomes using a cross-validation split across models and task sets; execution results were never used to adjust or select b. We have inserted a dedicated clarification paragraph in Section 4.3 describing the separation of data collection pipelines and the cross-validation protocol. revision: yes
Circularity Check
Fitted routing parameter b used to 'predict' execution rescue on same data
specific steps
-
fitted input called prediction
[Abstract]
"A single parameter, the routing logarithmic decay slope b, couples the two laws: routing-side fits predict execution-side rescue across models, showing that the same library property controls both pre-execution collapse and downstream recoverability."
b is obtained by fitting the routing law (logarithmic decay of accuracy with library size) to the observed routing decisions; the same scalar is then asserted to predict the magnitude of execution rescue. Because both routing and execution measurements come from the same 3M decisions on the same skill library, the 'prediction' is a re-labeling of the fitted parameter rather than an out-of-sample or independently derived result.
full rationale
The central coupling claim rests on fitting the logarithmic decay slope b exclusively from routing accuracy data across the 15 models and 1,141 skills, then invoking that same fitted b to explain and 'predict' execution-side rescue effects. This matches the fitted-input-called-prediction pattern: the execution improvement is not an independent derivation but a re-use of the routing fit on the identical experimental corpus. No external validation or parameter-free derivation is shown in the provided text; the coupling is therefore partially circular by construction. The remainder of the scaling laws (logarithmic form, error progression) appear independently measured and are not reduced to the fit.
Axiom & Free-Parameter Ledger
free parameters (1)
- routing logarithmic decay slope b
axioms (1)
- domain assumption The 1,141 skills and routing/execution decisions collected are representative of real-world agent skill libraries.
Reference graph
Works this paper leans on
-
[1]
Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, et al. A survey on large language model based autonomous agents.Frontiers of Computer Science, 18(6):186345, 2024
work page 2024
-
[2]
Llm-based agents for tool learning: A survey.Data Science and Engineering, 2025
Weikai Xu, Chengrui Huang, Shen Gao, and Shuo Shang. Llm-based agents for tool learning: A survey.Data Science and Engineering, 2025
work page 2025
-
[3]
Survey on Evaluation of LLM-based Agents
Asaf Yehudai, Lilach Eden, Alan Li, Guy Uziel, Yilun Zhao, Roy Bar-Haim, Arman Cohan, and Michal Shmueli- Scheuer. Survey on evaluation of LLM-based agents.arXiv preprint arXiv:2503.16416, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[4]
Toolformer: Language models can teach themselves to use tools
Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools. Advances in neural information processing systems, 36:68539–68551, 2023
work page 2023
-
[5]
ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs
Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, et al. Toolllm: Facilitating large language models to master 16000+ real-world APIs.arXiv preprint arXiv:2307.16789, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[6]
Gorilla: Large Language Model Connected with Massive APIs
Shishir G Patil, Tianjun Zhang, Xin Wang, and Joseph E Gonzalez. Gorilla: Large language model connected with massive APIs.arXiv preprint arXiv:2305.15334, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[7]
API-Bank: A Comprehensive Benchmark for Tool-Augmented LLMs
Minghao Li, Yingxiu Zhao, Bowen Yu, Feifan Song, Hangyu Li, Haiyang Yu, Zhoujun Li, Fei Huang, and Yongbin Li. API-Bank: A comprehensive benchmark for tool-augmented LLMs.arXiv preprint arXiv:2304.08244, 2023. evolvent.co 11 The Scaling Laws of Skills in LLM Agent Systems
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[8]
name" - concise snake_case identifier -
Yongliang Shen, Kaitao Song, Xu Tan, Wenqi Zhang, Kan Ren, Siyu Yuan, Weiming Lu, and Yueting Zhuang. TaskBench: Benchmarking large language models for task automation.arXiv preprint arXiv:2311.18760, 2023
-
[9]
SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks
Xiangyi Li, Wenbo Chen, Yimin Liu, Shenghan Zheng, Xiaokun Chen, Yifeng He, Yubo Li, Bingran You, Haotian Shen, Jiankai Sun, Shuyi Wang, Binxu Li, Qunhong Zeng, Di Wang, Xuandong Zhao, Yuanli Wang, Roey Ben Chaim, Zonglin Di, Yipeng Gao, Junwei He, Yizhuo He, Liqiang Jing, Luyang Kong, Xin Lan, Jiachen Li, Songlin Li, Yijiang Li, Yueqian Lin, Xinyi Liu, X...
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[10]
Skillrouter: Retrieve-and-rerank skill selection for llm agents at scale,
YanZhao Zheng, ZhenTao Zhang, Chao Ma, YuanQiang Yu, JiHuai Zhu, Yong Wu, Tianze Xu, Baohua Dong, Hangcheng Zhu, Ruohui Huang, and Gang Yu. Skillrouter: Skill routing for LLM agents at scale.arXiv preprint arXiv:2603.22455, 2026
-
[11]
Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models
Qiguang Chen, Libo Qin, Jinhao Liu, Dengyun Peng, Jiannan Guan, Peng Wang, Mengkang Hu, Yuhang Zhou, Te Gao, and Wanxiang Che. Towards reasoning era: A survey of long chain-of-thought for reasoning large language models.arXiv preprint arXiv:2503.09567, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[12]
Qiguang Chen, Mingda Yang, Libo Qin, Jinhao Liu, Zheng Yan, Jiannan Guan, Dengyun Peng, Yiyan Ji, Hanjing Li, Mengkang Hu, et al. Ai4research: A survey of artificial intelligence for scientific research.arXiv preprint arXiv:2507.01903, 2025
-
[13]
Huan-ang Gao, Jiayi Geng, Wenyue Hua, Mengkang Hu, Xinzhe Juan, Hongzhang Liu, Shilong Liu, Jiahao Qiu, Xuan Qi, Yiran Wu, et al. A survey of self-evolving agents: What, when, how, and where to evolve on the path to artificial super intelligence.arXiv preprint arXiv:2507.21046, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[14]
Agent Skills for Large Language Models: Architecture, Acquisition, Security, and the Path Forward
Renjun Xu and Yang Yan. Agent skills for large language models: Architecture, acquisition, security, and the path forward.arXiv preprint arXiv:2602.12430, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[15]
ToolAlpaca: Generalized Tool Learning for Language Models with 3000 Simulated Cases
Qiaoyu Tang, Ziliang Deng, Hongyu Lin, Xianpei Han, Qiao Liang, Boxi Cao, and Le Sun. Toolalpaca: Generalized tool learning for language models with 3000 simulated cases.arXiv preprint arXiv:2306.05301, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[16]
Openagents: An open platform for language agents in the wild,
Tianbao Xie, Fan Zhou, Zhoujun Cheng, Peng Shi, Luoxuan Weng, Yitao Liu, Toh Jing Hua, Junning Zhao, Qian Liu, Che Liu, et al. OpenAgents: An open platform for language agents in the wild.arXiv preprint arXiv:2310.10634, 2023
-
[17]
AgentBench: Evaluating LLMs as Agents
Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, et al. AgentBench: Evaluating LLMs as agents.arXiv preprint arXiv:2308.03688, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[18]
SoK: Agentic Skills -- Beyond Tool Use in LLM Agents
Yanna Jiang, Delong Li, Haiyu Deng, Baihe Ma, Xu Wang, Qin Wang, and Guangsheng Yu. SoK: Agentic skills – beyond tool use in LLM agents.arXiv preprint arXiv:2602.20867, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[19]
Lucen Zhong, Zhengxiao Du, Xiaohan Zhang, Haiyi Hu, and Jie Tang. ComplexFuncBench: Exploring multi-step and constrained function calling under long-context scenario.arXiv preprint arXiv:2501.10132, 2025
-
[20]
BiasBusters: Uncovering and mitigating tool selection bias in large language models
Thierry Blankenstein, Jialin Yu, Zixuan Li, Vassilis Plachouras, Sunando Sengupta, Philip Torr, Yarin Gal, Alasdair Paren, and Adel Bibi. BiasBusters: Uncovering and mitigating tool selection bias in large language models. InInternational Conference on Learning Representations, 2026. URL https://openreview.net/forum? id=DEg4vvElYu. evolvent.co 12 The Scal...
work page 2026
-
[21]
Cheng Qian, Zuxin Liu, Shirley Kokane, Akshara Prabhakar, Jielin Qiu, Haolin Chen, Zhiwei Liu, Heng Ji, Weiran Yao, Shelby Heinecke, Silvio Savarese, Caiming Xiong, and Huan Wang. xRouter: Training cost-aware LLMs orchestration system via reinforcement learning.arXiv preprint arXiv:2510.08439, 2025
-
[22]
EvoRoute: Experience-driven self-routing LLM agent systems.arXiv preprint arXiv:2601.02695, 2026
Guibin Zhang, Haiyang Yu, Kaiming Yang, Bingli Wu, Fei Huang, Yongbin Li, and Shuicheng Yan. EvoRoute: Experience-driven self-routing LLM agent systems.arXiv preprint arXiv:2601.02695, 2026
-
[23]
Jingyi Jia and Qinbin Li. Autotool: Efficient tool selection for large language model agents.arXiv preprint arXiv:2511.14650, 2025
-
[24]
Zheng Fang, Wolfgang Mayer, Zeyu Zhang, Jian Wang, Hong-Yu Zhang, Wanli Li, and Zaiwen Feng. Meta- toolagent: Towards generalizable tool usage in llms through meta-learning.arXiv preprint arXiv:2601.12680, 2026
-
[25]
ReAct: Synergizing reasoning and acting in language models
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. ReAct: Synergizing reasoning and acting in language models. InInternational Conference on Learning Representations, 2023
work page 2023
-
[26]
Plan-and-solve prompting: Improving zero-shot chain-of-thought reasoning by large language models
Lei Wang, Wanyu Xu, Yihuai Lan, Zhiqiang Hu, Yunshi Lan, Roy Ka-Wei Lee, and Ee-Peng Lim. Plan-and-solve prompting: Improving zero-shot chain-of-thought reasoning by large language models. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors,Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long ...
-
[27]
Reflexion: Language Agents with Verbal Reinforcement Learning
Noah Shinn, Federico Cassano, Edward Berman, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning.arXiv preprint arXiv:2303.11366, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[28]
HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face
Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, and Yueting Zhuang. Hugginggpt: Solving AI tasks with ChatGPT and its friends in Hugging Face.arXiv preprint arXiv:2303.17580, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[29]
MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework
Sirui Hong, Mingchen Zhuge, Jiaqi Chen, Xiawu Zheng, Yuheng Cheng, Ceyao Zhang, Jinlin Wang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, et al. MetaGPT: Meta programming for a multi-agent collaborative framework.arXiv preprint arXiv:2308.00352, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[30]
Voyager: An Open-Ended Embodied Agent with Large Language Models
Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anand- kumar. Voyager: An open-ended embodied agent with large language models.arXiv preprint arXiv:2305.16291, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[31]
Taskweaver: A code-first agent framework.arXiv preprint arXiv:2311.17541,
Bo Qiao, Liqun Li, Xu Zhang, Shilin He, Yu Kang, Chaoyun Zhang, Fangkai Yang, Hang Dong, Jue Zhang, Lu Wang, et al. TaskWeaver: A code-first agent framework.arXiv preprint arXiv:2311.17541, 2023
-
[32]
Xingyao Wang, Zihan Wang, Jiateng Liu, Yangyi Chen, Lifan Yuan, Hao Peng, and Heng Ji. MINT: Evaluating LLMs in multi-turn interaction with tools and language feedback.arXiv preprint arXiv:2309.10691, 2023
-
[33]
Yu Du, Fangyun Wei, and Hongyang Zhang. AnyTool: Self-reflective, hierarchical agents for large-scale API calls.arXiv preprint arXiv:2402.04253, 2024
-
[34]
Evaluation and benchmarking of LLM agents: A survey,
Mahmoud Mohammadi, Yipeng Li, Jane Lo, and Wendy Yip. Evaluation and benchmarking of llm agents: A survey.arXiv preprint arXiv:2507.21504, 2025. URLhttps://arxiv.org/abs/2507.21504. evolvent.co 13 The Scaling Laws of Skills in LLM Agent Systems
-
[35]
Scaling Laws for Neural Language Models
Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2001
-
[36]
Training Compute-Optimal Large Language Models
Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models.arXiv preprint arXiv:2203.15556, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[37]
Emergent abilities of large language models.Transactions on Machine Learning Research, 2022
Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, et al. Emergent abilities of large language models.Transactions on Machine Learning Research, 2022
work page 2022
-
[38]
Chatterji, Sharan Narang, Mike Lewis, and Dieuwke Hupkes
Nicholas Roberts, Niladri S. Chatterji, Sharan Narang, Mike Lewis, and Dieuwke Hupkes. Compute optimal scaling of skills: Knowledge vs reasoning. InFindings of the Association for Computational Linguistics: ACL 2025, pages 13295–13316, 2025. doi: 10.18653/v1/2025.findings-acl.688. URL https://aclanthology.org/ 2025.findings-acl.688/
-
[39]
Scaling laws for educational AI agents.arXiv preprint arXiv:2603.11709, 2026
Mengsong Wu, Hao Hao, Shuzhen Bi, Keqian Li, Wentao Liu, Siyu Song, Hongbo Zhao, and Aimin Zhou. Scaling laws for educational AI agents.arXiv preprint arXiv:2603.11709, 2026. URL https://arxiv.org/abs/ 2603.11709
-
[40]
Dense passage retrieval for open-domain question answering
Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen- tau Yih. Dense passage retrieval for open-domain question answering. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pages 6769–6781, 2020
work page 2020
-
[41]
Retrieval-augmented generation for knowledge- intensive NLP tasks
Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. Retrieval-augmented generation for knowledge- intensive NLP tasks. InAdvances in Neural Information Processing Systems, volume 33, pages 9459–9474, 2020
work page 2020
-
[42]
Sentence-BERT: Sentence embeddings using siamese BERT-networks
Nils Reimers and Iryna Gurevych. Sentence-BERT: Sentence embeddings using siamese BERT-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing, 2019
work page 2019
-
[43]
MTEB: Massive text embedding benchmark
Niklas Muennighoff, Nouamane Tazi, Loïc Magne, and Nils Reimers. MTEB: Massive text embedding benchmark. InProceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, 2023
work page 2023
-
[44]
William E. Hick. On the rate of gain of information.Quarterly Journal of Experimental Psychology, 4(1):11–26, 1952
work page 1952
-
[45]
Ray Hyman. Stimulus information as a determinant of reaction time.Journal of Experimental Psychology, 45 (3):188–196, 1953
work page 1953
-
[46]
Github - anthropics/skills: Public repository for agent skills
Anthropic. Github - anthropics/skills: Public repository for agent skills. https://github.com/anthropics/ skills, 2025. Accessed: 2026-04-17
work page 2025
-
[47]
Anthropic. Claude code. https://www.anthropic.com/product/claude-code, 2025. Accessed: 2026-04-16
work page 2025
-
[48]
Introducing the model context protocol
Anthropic. Introducing the model context protocol. https://www.anthropic.com/news/ model-context-protocol, 2024. Accessed: 2026-04-16
work page 2024
-
[49]
Trail of Bits. Github - trailofbits/skills: Trail of bits claude code skills for security research, vulnerability detection, and audit workflows.https://github.com/trailofbits/skills, 2025. Accessed: 2026-04-16. evolvent.co 14 The Scaling Laws of Skills in LLM Agent Systems
work page 2025
-
[50]
Claude code skills marketplace
daymade. Claude code skills marketplace. https://github.com/daymade/claude-code-skills, 2025. Accessed: 2026-04-16
work page 2025
-
[51]
Aaronontheweb. .NET skills for claude code. https://github.com/Aaronontheweb/dotnet-skills, 2025. Accessed: 2026-04-16
work page 2025
-
[52]
PinchBench github organization
PinchBench Team. PinchBench github organization. https://github.com/pinchbench, 2026. Accessed: 2026-04-16
work page 2026
-
[53]
Agentdeals.https://github.com/robhunter/agentdeals, 2026
Robhunter. Agentdeals.https://github.com/robhunter/agentdeals, 2026. Accessed: 2026-04-16
work page 2026
-
[54]
Jianlyu Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, and Zheng Liu. M3-Embedding: Multi-linguality, multi-functionality, multi-granularity text embeddings through self-knowledge distillation. InFindings of the Association for Computational Linguistics: ACL 2024, pages 2318–2335, 2024. doi: 10.18653/v1/2024.findings-acl
-
[55]
URLhttps://aclanthology.org/2024.findings-acl.137/
work page 2024
-
[56]
Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces
Mike A Merrill, Alexander G Shaw, Nicholas Carlini, Boxuan Li, Harsh Raj, Ivan Bercovich, Lin Shi, Jeong Yeon Shin, Thomas Walshe, E Kelly Buchanan, et al. Terminal-bench: Benchmarking agents on hard, realistic tasks in command line interfaces.arXiv preprint arXiv:2601.11868, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[57]
Shunyu Yao, Noah Shinn, Pedram Razavi, and Karthik Narasimhan.𝜏-bench: A benchmark for tool-agent-user interaction in real-world domains.arXiv preprint arXiv:2406.12045, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[58]
Victor Barres, Honghua Dong, Soham Ray, Xujie Si, and Karthik Narasimhan.𝜏2-bench: Evaluating conversa- tional agents in a dual-control environment.arXiv preprint arXiv:2506.07982, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[59]
Flow- Bench: Revisiting and benchmarking workflow-guided planning for LLM-based agents
Ruixuan Xiao, Wentao Ma, Ke Wang, Yuchuan Wu, Junbo Zhao, Haobo Wang, Fei Huang, and Yongbin Li. Flow- Bench: Revisiting and benchmarking workflow-guided planning for LLM-based agents. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors,Findings of the Association for Computational Linguistics: EMNLP 2024, pages 10883–10900, Miami, Florida, USA,...
-
[60]
Gpt-4o mini: advancing cost-efficient intelligence
OpenAI. Gpt-4o mini: advancing cost-efficient intelligence. https://openai.com/index/ gpt-4o-mini-advancing-cost-efficient-intelligence/, 2024. Accessed: 2026-04-17
work page 2024
-
[61]
OpenAI. Gpt-5 system card. OpenAI, August 2025. URL https://openai.com/index/gpt-5-system-card/ . Accessed: 2026-05-04
work page 2025
-
[62]
Introducing gpt-5.4 mini and nano
OpenAI. Introducing gpt-5.4 mini and nano. https://openai.com/index/ introducing-gpt-5-4-mini-and-nano/, 2026. Accessed: 2026-04-17
work page 2026
-
[63]
Introducing gpt-5.4.https://openai.com/index/introducing-gpt-5-4/, 2026
OpenAI. Introducing gpt-5.4.https://openai.com/index/introducing-gpt-5-4/, 2026. Accessed: 2026- 04-17
work page 2026
-
[64]
Introducing claude sonnet 4.6.https://www.anthropic.com/news/claude-sonnet-4-6, 2026
Anthropic. Introducing claude sonnet 4.6.https://www.anthropic.com/news/claude-sonnet-4-6, 2026
work page 2026
-
[65]
Introducing claude opus 4.6.https://www.anthropic.com/news/claude-opus-4-6, 2026
Anthropic. Introducing claude opus 4.6.https://www.anthropic.com/news/claude-opus-4-6, 2026
work page 2026
-
[66]
Gemini 3.Google DeepMind, 2025
Google DeepMind. Gemini 3.Google DeepMind, 2025. URLhttps://deepmind.google/models/gemini/
work page 2025
-
[67]
Glm-5: from vibe coding to agentic engineering, 2026
Aohan Zeng, Xin Lv, Zhenyu Hou, Zhengxiao Du, Qinkai Zheng, Bin Chen, Da Yin, Chendi Ge, Chenghua Huang, Chengxing Xie, et al. Glm-5: from vibe coding to agentic engineering, 2026. evolvent.co 15 The Scaling Laws of Skills in LLM Agent Systems
work page 2026
-
[68]
Glm-4.7: Advancing the coding capability.https://z.ai/blog/glm-4.7, 2025
Zhipu AI. Glm-4.7: Advancing the coding capability.https://z.ai/blog/glm-4.7, 2025
work page 2025
-
[69]
Kimi k2.5: Visual agentic intelligence, 2026
Kimi Team, Tongtong Bai, Yifan Bai, Yiping Bao, SH Cai, Yuan Cao, Y Charles, HS Che, Cheng Chen, Guanduo Chen, et al. Kimi k2.5: Visual agentic intelligence, 2026
work page 2026
-
[70]
Kimi k2.6: From code to creation, from one to many
Moonshot AI. Kimi k2.6: From code to creation, from one to many. https://www.kimi.com/ai-models/ kimi-k2-6, 2026
work page 2026
-
[71]
ByteDance Seed Team. Seed 2.0 official launch. https://seed.bytedance.com/en/blog/seed2-0-%25E6% 25AD%25A3%25E5%25BC%258F%25E5%258F%2591%25E5%25B8%2583, 2026. Accessed: 2026-04-16
work page 2026
-
[72]
Deepseek-v4:towards highly efficient million-token context intelligence.https://huggingface
DeepSeek-AI. Deepseek-v4:towards highly efficient million-token context intelligence.https://huggingface. co/deepseek-ai/DeepSeek-V4-Pro/blob/main/DeepSeek_V4.pdf, 2026
work page 2026
-
[73]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report, 2025. URLhttps://arxiv.org/abs/2505.09388. evolvent.co 16 The Scaling Laws of Skills in LLM Agent Systems Appendix This appendix is organized as a reader’s map from theory to implementation evidence. It firs...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[74]
Finite-capacity margin model.For task 𝑞, the router assigns each skill a latent score 𝑍𝑖=𝑚𝑖+𝜀𝑖, where 𝑚⋆−𝑚𝑗= Δ𝑗is the semantic margin of the gold skill against distractor𝑗, and𝜀𝑖is zero-mean sub-Gaussian noise with scale𝜎
-
[75]
This is the scale-free local-crowding condition tested by the CI and semantic-gap diagnostics
Rank-regular clustered library.Within a functional cluster, distractor skills can be ordered by semantic rank𝑟= 1,…,𝑁−1around the target, and their effective overtake weights satisfy 𝑤𝑟=𝜅/𝑟+𝑂(𝑟−1−𝜉) for some𝜅,𝜉 >0over the measured range. This is the scale-free local-crowding condition tested by the CI and semantic-gap diagnostics
-
[76]
Irrelevant distractors have negligi- ble𝑤𝑗; local near-misses dominate𝐶(𝑁)
Logarithmic effective competition.The total effective pressure from plausible distractors is 𝐶(𝑁) = ∑𝑗≠⋆𝑤𝑗(𝑁) =𝐶0 +𝐶1ln𝑁+𝑜(ln𝑁)over the exposed-library range. Irrelevant distractors have negligi- ble𝑤𝑗; local near-misses dominate𝐶(𝑁)
-
[77]
Small-error linearization regime.Over the measured operating range, the router is not saturated at 0 or 1, so first-order Taylor expansions of the error odds are valid
-
[78]
State-gated rescue.Correct upstream execution produces a concrete artifact with rescue coefficient 𝛼∈[0,1]; rescue can only occur when upstream routing/execution is correct and only to the extent that the downstream route has remaining headroom
-
[79]
No pre-state leakage.In the no-state condition, paired routing prompts do not contain an execution artifact, so any interaction before execution is bounded by a protocol term𝜖proto. Lemma 1 (Clustered libraries imply logarithmic effective competition).Under Assumption 2, Assumption 3 follows with𝐶1 =𝜅. evolvent.co 17 The Scaling Laws of Skills in LLM Agen...
-
[80]
Concrete object: the task identifies or clearly implies the file, repository, API, dataset, table, document, command, state, or other object to operate on
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.