The Scaling Laws of Skills in LLM Agent Systems

Carl Che; Charles Chen; Dengyun Peng; Ethan Qin; Fanqing Meng; Hanjing Li; Hongyu Liu; Jiangyi Wang; Jinhao Liu; Mengkang Hu

arxiv: 2605.16508 · v1 · pith:G6KX4YV6new · submitted 2026-05-15 · 💻 cs.CL · cs.AI

The Scaling Laws of Skills in LLM Agent Systems

Charles Chen , Qiming Yu , Yuhang Gu , Zhuoye Huang , Hanjing Li , Hongyu Liu , Simin Liu , Jinhao Liu

show 7 more authors

Dengyun Peng Jiangyi Wang Zheng Yan Fanqing Meng Ethan Qin Carl Che Mengkang Hu

This is my paper

Pith reviewed 2026-05-20 18:04 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords scaling lawsLLM agentsskill librariesrouting accuracylogarithmic decayexecution rescuelibrary optimizationagent systems

0 comments

The pith

A single decay slope parameter links routing accuracy loss in large skill libraries to how much execution rescues downstream performance in LLM agents.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines how skills accumulate into large reusable libraries for LLM agent systems and identifies two coupled scaling laws. Routing accuracy decays logarithmically with library size across 15 frontier models, with errors shifting from local competition to capture by overly general skills. Execution follows a multiplicative pattern before state realization, but correct prior steps can improve difficult downstream decisions by about four times. The logarithmic decay slope b from routing fits predicts the rescue effect in execution across models. This coupling shows that the same library property controls both pre-execution collapse and downstream recoverability, allowing law-guided optimizations that raise routing accuracy and transfer to execution benchmarks.

Core claim

Across 15 frontier LLMs, 1,141 real-world skills, and over 3M routing or execution decisions, single-step routing accuracy decays logarithmically with library size. Errors progress from local skill competition to cross-family drift and capture by overly general black-hole skills. Before state realization, joint routing is approximately multiplicative, whereas correct execution can improve difficult downstream decisions by about 4 times. A single parameter, the routing logarithmic decay slope b, couples the two laws: routing-side fits predict execution-side rescue across models.

What carries the argument

The routing logarithmic decay slope b, which couples the routing law to the execution rescue effect and allows routing data to predict downstream recoverability.

If this is right

Law-guided optimization raises held-out routing accuracy from 71.3% to 91.7%.
It reduces hijack from 22.4% to 4.1%.
Improvements transfer directionally to downstream ClawBench and ClawMark execution settings, improving mean pass rate from 49.3% to 61.6% on ClawBench and from 28.4% to 34.5% on ClawMark.
Agent performance depends not only on model capability but also on the structure, granularity, and exposure policy of the skill library.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the coupling through b generalizes, skill libraries could be engineered with controlled granularity to reduce both routing collapse and improve recoverability without changing the underlying model.
The laws suggest that exposure policies can be tuned using the decay slope to support longer multi-step agent chains more reliably.
Applying the same fitting procedure to non-frontier models or entirely new task domains would test whether the functional form and predictive power of b remain stable.

Load-bearing premise

The logarithmic decay form and the coupling through the slope b observed on the specific 15 frontier LLMs, 1,141 skills, and 3M decisions will continue to hold for other models, skill distributions, and task domains without requiring re-fitting or new functional forms.

What would settle it

A test on a new model or skill set in which routing accuracy does not follow a logarithmic decay with library size, or in which the fitted b from routing data fails to predict the observed execution rescue factor, would show the claimed coupling does not hold.

Figures

Figures reproduced from arXiv: 2605.16508 by Carl Che, Charles Chen, Dengyun Peng, Ethan Qin, Fanqing Meng, Hanjing Li, Hongyu Liu, Jiangyi Wang, Jinhao Liu, Mengkang Hu, Qiming Yu, Simin Liu, Yuhang Gu, Zheng Yan, Zhuoye Huang.

**Figure 2.** Figure 2: Library scale induces logarithmic single-step decay, super-independent pipeline loss, and a mid-chain routing trough. into the plain-text hallucination or hijack analyses. Prompts are in Appendix B. Optimization experiments use the same held-out routing protocol as the diagnostic studies. This section analyzes the routing law before execution: the model observes a task and skill descriptions, chooses ̂𝑠, a… view at source ↗

**Figure 3.** Figure 3: Local Competition Organizes Error. (a) Accuracy decays with 𝑁 in all description regimes, but better descriptions reduce the decay rate. (b) Top-50 introduces more errors than Top-20/10. (c) Danger band [0.55, 0.75) shows strongest negative 𝜌 with accuracy. (d) The Boltzmann fit between the competition index and accuracy within the danger zone, where 𝑅 2 CI=0.55, outperforms the fit with 𝑁 alone, where 𝑅 2… view at source ↗

**Figure 4.** Figure 4: Weak task anchors and vague skill descriptions convert local skill-cluster drift into black-hole skill capture. Drift and attractor mechanism. With strong anchors, errors are geometrically local substitutions within the same functional family. Weak anchors break this locality: accuracy collapses, but the model still prefers existing skills over hallucinated names. Black-hole capture is conjunctive, not add… view at source ↗

**Figure 5.** Figure 5: Before state realization, paired routing is approximately multiplicative; after correct execution, realized state rescues difficult downstream routes. We attribute this to the fact that the router is modeled as a conditional distribution over skills given the prompt, and the no-state isolation setup removes dependence between the two steps before any execution artifact is observed (Appendix A). No-state mu… view at source ↗

**Figure 6.** Figure 6: Incorrect state propagates through tight dependencies, while joint execution synergy changes sign with dependency structure and capability gap 𝐺. output, the error is carried forward ( [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: Cross-law coupling: routing slope predicts rescue gain. The previous two sections establish the routing and execution laws separately. The remaining question is whether the two laws share a state variable. They do: the routing slope 𝑏 is not only the rate at which a model loses accuracy as the library grows; it also determines how often correct upstream state is available for execution-side rescue. Concret… view at source ↗

**Figure 8.** Figure 8: The performance of the law-guided automatic skill manager. models (𝜌=0.74, 𝑝<0.001; [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗

**Figure 9.** Figure 9: Cascade Penalty Decomposition. Pipeline depth amplifies the gap between observed end-to-end accuracy and the independence baseline. stratify pairs by dependency strength 𝜅, where tight dependencies require the upstream artifact for the downstream checker and loose dependencies do not. Artifact correctness is scored by task-specific checkers or human verification when a checker is unavailable. B.6. Fitting … view at source ↗

**Figure 10.** Figure 10: State-Transition Audit. Downstream routing accuracy depends on whether the preceding step remains correct, linking transition state to cascade behavior (annotated by human annotators). Using log probabilities makes the excess penalty additive across depth and comparable between models with different absolute single-step accuracies. Transition audit. To check whether the excess loss is consistent with loca… view at source ↗

**Figure 11.** Figure 11: ToolBench Validity. ToolBench routing accuracy and hallucination rates follow the same qualitative library-size pattern on an external tool corpus [PITH_FULL_IMAGE:figures/full_fig_p037_11.png] view at source ↗

read the original abstract

As agent systems scale, skills accumulate into large reusable libraries, yet their scaling laws remain poorly understood. Across 15 frontier LLMs, 1,141 real-world skills, and over 3M routing or execution decisions, we identify two coupled laws. Routing law: single-step routing accuracy decays logarithmically with library size ($R^2{>}0.97$ for all models), with errors progressing from local skill competition to cross-family drift and capture by overly general "black-hole skills". Execution law: before state realization, joint routing is approximately multiplicative, whereas correct execution can improve difficult downstream decisions by about $4{\times}$. A single parameter, the routing logarithmic decay slope $b$, couples the two laws: routing-side fits predict execution-side rescue across models, showing that the same library property controls both pre-execution collapse and downstream recoverability. The laws are actionable: law-guided optimization raises held-out routing accuracy from 71.3% to 91.7%, reduces hijack from 22.4% to 4.1%, and transfers directionally to downstream ClawBench and ClawMark execution settings, improving mean pass rate from 49.3% to 61.6% on ClawBench and from 28.4% to 34.5% on ClawMark. These results show that agent performance depends not only on model capability, but also on the structure, granularity, and exposure policy of the skill library.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper finds a logarithmic decay in routing accuracy with library size that couples to execution rescue via one slope parameter b, backed by large-scale runs but light on method specifics.

read the letter

The main takeaway is that routing accuracy drops logarithmically as skill libraries grow, and the decay slope b from those fits lines up with how much pre-state execution can rescue downstream decisions across models. They also show this leads to practical library tweaks that lift held-out routing from 71 percent to 92 percent and cut hijacks sharply, with some carryover to ClawBench and ClawMark pass rates. That coupling and the actionable optimization are the clearest new pieces. The scale helps: fifteen frontier models, over a thousand real skills, and three million decisions give the log fits R-squared values above 0.97 and let them track error shifts from local competition to black-hole skills. The multiplicative pre-state effect and the 4x rescue claim on hard cases are concrete enough to test. The soft spots sit in the details that are missing from the write-up. It is not clear how library sizes were sampled or varied, whether the b parameter was derived from the same runs used to measure execution rescue, or how error bars and statistical significance were handled. That opens a moderate risk that the coupling looks tighter than it would on fully independent data. The improvements are reported on held-out routing, which is good, but the execution transfer numbers would be stronger with more on controls and splits. This is aimed at people building or scaling agent skill libraries who need levers beyond raw model size. A reader focused on multi-step reliability would find testable ideas here. It has enough empirical grounding and practical payoff to deserve a serious referee, though the methods section will need expansion to hold up under review.

Referee Report

2 major / 2 minor

Summary. The paper claims to identify two coupled scaling laws in LLM agent systems across 15 frontier LLMs, 1,141 real-world skills, and over 3M routing or execution decisions. The routing law states that single-step routing accuracy decays logarithmically with library size (R² > 0.97 for all models), with errors progressing from local skill competition to cross-family drift and capture by general 'black-hole skills'. The execution law indicates that joint routing is approximately multiplicative before state realization, while correct execution can improve difficult downstream decisions by about 4×. A single parameter—the routing logarithmic decay slope b—couples the laws such that routing-side fits predict execution-side rescue across models. Law-guided optimization is shown to raise held-out routing accuracy from 71.3% to 91.7%, reduce hijack from 22.4% to 4.1%, and improve mean pass rates on ClawBench (49.3% to 61.6%) and ClawMark (28.4% to 34.5%).

Significance. If the central claims hold after addressing methodological details, the work provides a valuable empirical framework for understanding how skill library size and structure affect agent performance, shifting focus from model capability alone to library properties. The scale of the experiments and the identification of a coupling parameter b that links pre-execution routing collapse to downstream recoverability represent a strength, as does the demonstration of actionable improvements that transfer directionally to held-out benchmarks. This could inform more principled design of reusable skill libraries in agent systems.

major comments (2)

The abstract reports high R² fits for the logarithmic decay and concrete lifts such as the 4× rescue factor and optimization gains, but provides no information on how library sizes were varied, whether exclusions were post-hoc, how error bars were computed, or whether the 4× rescue factor was measured on held-out data. This is load-bearing for the central claim that the laws are robust and coupled, as the soundness of the fits and predictions cannot be assessed without these details.
The coupling claim—that the routing logarithmic decay slope b fitted from routing data predicts execution-side rescue—relies on the same overall experimental setup for both regimes. This introduces a moderate circularity risk that requires explicit clarification on the independence of the routing and execution measurements, any cross-validation, and whether b was derived without reference to execution outcomes.

minor comments (2)

Provide the precise functional form and fitting procedure for the logarithmic decay (including any constants or normalizations) and confirm whether the form was selected a priori or post-hoc.
Include more detail on skill categorization, library construction, and the definition of 'black-hole skills' to support reproducibility of the error progression analysis.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback, which has helped us improve the clarity and transparency of our work. We address each major comment below and have revised the manuscript accordingly to provide the requested methodological details while preserving the integrity of our empirical claims.

read point-by-point responses

Referee: The abstract reports high R² fits for the logarithmic decay and concrete lifts such as the 4× rescue factor and optimization gains, but provides no information on how library sizes were varied, whether exclusions were post-hoc, how error bars were computed, or whether the 4× rescue factor was measured on held-out data. This is load-bearing for the central claim that the laws are robust and coupled, as the soundness of the fits and predictions cannot be assessed without these details.

Authors: We agree that the abstract would benefit from greater methodological transparency to allow readers to assess robustness directly. The full manuscript (Section 3) details that library sizes were varied systematically from 10 to 1,141 skills via repeated random subsampling of the fixed skill pool, with means and standard errors computed over five independent seeds per size point; no post-hoc exclusions occurred and all data are reported. Error bars throughout are standard errors from this repeated sampling. The 4× rescue factor and optimization gains were evaluated on held-out tasks and benchmarks separate from the fitting data. We have revised the abstract to include a concise clause on these points and added a short statistical methods paragraph in Section 3.2. revision: yes
Referee: The coupling claim—that the routing logarithmic decay slope b fitted from routing data predicts execution-side rescue—relies on the same overall experimental setup for both regimes. This introduces a moderate circularity risk that requires explicit clarification on the independence of the routing and execution measurements, any cross-validation, and whether b was derived without reference to execution outcomes.

Authors: We appreciate the referee's caution regarding potential circularity. Routing measurements used to fit the decay law and slope b were obtained exclusively from isolated single-step routing queries with no execution component. Execution data were collected independently from complete multi-step agent trajectories on downstream tasks. Parameter b was fitted solely on the routing regime and then tested for predictive validity on execution outcomes using a cross-validation split across models and task sets; execution results were never used to adjust or select b. We have inserted a dedicated clarification paragraph in Section 4.3 describing the separation of data collection pipelines and the cross-validation protocol. revision: yes

Circularity Check

1 steps flagged

Fitted routing parameter b used to 'predict' execution rescue on same data

specific steps

fitted input called prediction [Abstract]
"A single parameter, the routing logarithmic decay slope b, couples the two laws: routing-side fits predict execution-side rescue across models, showing that the same library property controls both pre-execution collapse and downstream recoverability."

b is obtained by fitting the routing law (logarithmic decay of accuracy with library size) to the observed routing decisions; the same scalar is then asserted to predict the magnitude of execution rescue. Because both routing and execution measurements come from the same 3M decisions on the same skill library, the 'prediction' is a re-labeling of the fitted parameter rather than an out-of-sample or independently derived result.

full rationale

The central coupling claim rests on fitting the logarithmic decay slope b exclusively from routing accuracy data across the 15 models and 1,141 skills, then invoking that same fitted b to explain and 'predict' execution-side rescue effects. This matches the fitted-input-called-prediction pattern: the execution improvement is not an independent derivation but a re-use of the routing fit on the identical experimental corpus. No external validation or parameter-free derivation is shown in the provided text; the coupling is therefore partially circular by construction. The remainder of the scaling laws (logarithmic form, error progression) appear independently measured and are not reduced to the fit.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claims rest on empirical fits to a particular collection of models and skills; the slope b is the main fitted quantity, and the assumption that the observed functional forms generalize is the primary domain assumption.

free parameters (1)

routing logarithmic decay slope b
Single parameter fitted from routing accuracy versus library size data; used to couple routing and execution regimes and to guide optimization.

axioms (1)

domain assumption The 1,141 skills and routing/execution decisions collected are representative of real-world agent skill libraries.
Invoked when claiming the laws are actionable beyond the tested set.

pith-pipeline@v0.9.0 · 5835 in / 1527 out tokens · 77213 ms · 2026-05-20T18:04:30.920174+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

90 extracted references · 90 canonical work pages · 21 internal anchors

[1]

A survey on large language model based autonomous agents.Frontiers of Computer Science, 18(6):186345, 2024

Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, et al. A survey on large language model based autonomous agents.Frontiers of Computer Science, 18(6):186345, 2024

work page 2024
[2]

Llm-based agents for tool learning: A survey.Data Science and Engineering, 2025

Weikai Xu, Chengrui Huang, Shen Gao, and Shuo Shang. Llm-based agents for tool learning: A survey.Data Science and Engineering, 2025

work page 2025
[3]

Survey on Evaluation of LLM-based Agents

Asaf Yehudai, Lilach Eden, Alan Li, Guy Uziel, Yilun Zhao, Roy Bar-Haim, Arman Cohan, and Michal Shmueli- Scheuer. Survey on evaluation of LLM-based agents.arXiv preprint arXiv:2503.16416, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

Toolformer: Language models can teach themselves to use tools

Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools. Advances in neural information processing systems, 36:68539–68551, 2023

work page 2023
[5]

ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs

Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, et al. Toolllm: Facilitating large language models to master 16000+ real-world APIs.arXiv preprint arXiv:2307.16789, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[6]

Gorilla: Large Language Model Connected with Massive APIs

Shishir G Patil, Tianjun Zhang, Xin Wang, and Joseph E Gonzalez. Gorilla: Large language model connected with massive APIs.arXiv preprint arXiv:2305.15334, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[7]

API-Bank: A Comprehensive Benchmark for Tool-Augmented LLMs

Minghao Li, Yingxiu Zhao, Bowen Yu, Feifan Song, Hangyu Li, Haiyang Yu, Zhoujun Li, Fei Huang, and Yongbin Li. API-Bank: A comprehensive benchmark for tool-augmented LLMs.arXiv preprint arXiv:2304.08244, 2023. evolvent.co 11 The Scaling Laws of Skills in LLM Agent Systems

work page internal anchor Pith review Pith/arXiv arXiv 2023
[8]

name" - concise snake_case identifier -

Yongliang Shen, Kaitao Song, Xu Tan, Wenqi Zhang, Kan Ren, Siyu Yuan, Weiming Lu, and Yueting Zhuang. TaskBench: Benchmarking large language models for task automation.arXiv preprint arXiv:2311.18760, 2023

work page arXiv 2023
[9]

SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks

Xiangyi Li, Wenbo Chen, Yimin Liu, Shenghan Zheng, Xiaokun Chen, Yifeng He, Yubo Li, Bingran You, Haotian Shen, Jiankai Sun, Shuyi Wang, Binxu Li, Qunhong Zeng, Di Wang, Xuandong Zhao, Yuanli Wang, Roey Ben Chaim, Zonglin Di, Yipeng Gao, Junwei He, Yizhuo He, Liqiang Jing, Luyang Kong, Xin Lan, Jiachen Li, Songlin Li, Yijiang Li, Yueqian Lin, Xinyi Liu, X...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[10]

Skillrouter: Retrieve-and-rerank skill selection for llm agents at scale,

YanZhao Zheng, ZhenTao Zhang, Chao Ma, YuanQiang Yu, JiHuai Zhu, Yong Wu, Tianze Xu, Baohua Dong, Hangcheng Zhu, Ruohui Huang, and Gang Yu. Skillrouter: Skill routing for LLM agents at scale.arXiv preprint arXiv:2603.22455, 2026

work page arXiv 2026
[11]

Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models

Qiguang Chen, Libo Qin, Jinhao Liu, Dengyun Peng, Jiannan Guan, Peng Wang, Mengkang Hu, Yuhang Zhou, Te Gao, and Wanxiang Che. Towards reasoning era: A survey of long chain-of-thought for reasoning large language models.arXiv preprint arXiv:2503.09567, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[12]

Ai4research: A survey of artificial intelligence for scientific research.arXiv preprint arXiv:2507.01903, 2025

Qiguang Chen, Mingda Yang, Libo Qin, Jinhao Liu, Zheng Yan, Jiannan Guan, Dengyun Peng, Yiyan Ji, Hanjing Li, Mengkang Hu, et al. Ai4research: A survey of artificial intelligence for scientific research.arXiv preprint arXiv:2507.01903, 2025

work page arXiv 2025
[13]

A Survey of Self-Evolving Agents: What, When, How, and Where to Evolve on the Path to Artificial Super Intelligence

Huan-ang Gao, Jiayi Geng, Wenyue Hua, Mengkang Hu, Xinzhe Juan, Hongzhang Liu, Shilong Liu, Jiahao Qiu, Xuan Qi, Yiran Wu, et al. A survey of self-evolving agents: What, when, how, and where to evolve on the path to artificial super intelligence.arXiv preprint arXiv:2507.21046, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[14]

Agent Skills for Large Language Models: Architecture, Acquisition, Security, and the Path Forward

Renjun Xu and Yang Yan. Agent skills for large language models: Architecture, acquisition, security, and the path forward.arXiv preprint arXiv:2602.12430, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[15]

ToolAlpaca: Generalized Tool Learning for Language Models with 3000 Simulated Cases

Qiaoyu Tang, Ziliang Deng, Hongyu Lin, Xianpei Han, Qiao Liang, Boxi Cao, and Le Sun. Toolalpaca: Generalized tool learning for language models with 3000 simulated cases.arXiv preprint arXiv:2306.05301, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[16]

Openagents: An open platform for language agents in the wild,

Tianbao Xie, Fan Zhou, Zhoujun Cheng, Peng Shi, Luoxuan Weng, Yitao Liu, Toh Jing Hua, Junning Zhao, Qian Liu, Che Liu, et al. OpenAgents: An open platform for language agents in the wild.arXiv preprint arXiv:2310.10634, 2023

work page arXiv 2023
[17]

AgentBench: Evaluating LLMs as Agents

Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, et al. AgentBench: Evaluating LLMs as agents.arXiv preprint arXiv:2308.03688, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[18]

SoK: Agentic Skills -- Beyond Tool Use in LLM Agents

Yanna Jiang, Delong Li, Haiyu Deng, Baihe Ma, Xu Wang, Qin Wang, and Guangsheng Yu. SoK: Agentic skills – beyond tool use in LLM agents.arXiv preprint arXiv:2602.20867, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[19]

ComplexFuncBench: Exploring multi-step and constrained function calling under long-context scenario.arXiv preprint arXiv:2501.10132, 2025

Lucen Zhong, Zhengxiao Du, Xiaohan Zhang, Haiyi Hu, and Jie Tang. ComplexFuncBench: Exploring multi-step and constrained function calling under long-context scenario.arXiv preprint arXiv:2501.10132, 2025

work page arXiv 2025
[20]

BiasBusters: Uncovering and mitigating tool selection bias in large language models

Thierry Blankenstein, Jialin Yu, Zixuan Li, Vassilis Plachouras, Sunando Sengupta, Philip Torr, Yarin Gal, Alasdair Paren, and Adel Bibi. BiasBusters: Uncovering and mitigating tool selection bias in large language models. InInternational Conference on Learning Representations, 2026. URL https://openreview.net/forum? id=DEg4vvElYu. evolvent.co 12 The Scal...

work page 2026
[21]

xRouter: Training cost-aware LLMs orchestration system via reinforcement learning.arXiv preprint arXiv:2510.08439, 2025

Cheng Qian, Zuxin Liu, Shirley Kokane, Akshara Prabhakar, Jielin Qiu, Haolin Chen, Zhiwei Liu, Heng Ji, Weiran Yao, Shelby Heinecke, Silvio Savarese, Caiming Xiong, and Huan Wang. xRouter: Training cost-aware LLMs orchestration system via reinforcement learning.arXiv preprint arXiv:2510.08439, 2025

work page arXiv 2025
[22]

EvoRoute: Experience-driven self-routing LLM agent systems.arXiv preprint arXiv:2601.02695, 2026

Guibin Zhang, Haiyang Yu, Kaiming Yang, Bingli Wu, Fei Huang, Yongbin Li, and Shuicheng Yan. EvoRoute: Experience-driven self-routing LLM agent systems.arXiv preprint arXiv:2601.02695, 2026

work page arXiv 2026
[23]

Autotool: Efficient tool selection for large language model agents.arXiv preprint arXiv:2511.14650, 2025

Jingyi Jia and Qinbin Li. Autotool: Efficient tool selection for large language model agents.arXiv preprint arXiv:2511.14650, 2025

work page arXiv 2025
[24]

Meta- toolagent: Towards generalizable tool usage in llms through meta-learning.arXiv preprint arXiv:2601.12680, 2026

Zheng Fang, Wolfgang Mayer, Zeyu Zhang, Jian Wang, Hong-Yu Zhang, Wanli Li, and Zaiwen Feng. Meta- toolagent: Towards generalizable tool usage in llms through meta-learning.arXiv preprint arXiv:2601.12680, 2026

work page arXiv 2026
[25]

ReAct: Synergizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. ReAct: Synergizing reasoning and acting in language models. InInternational Conference on Learning Representations, 2023

work page 2023
[26]

Plan-and-solve prompting: Improving zero-shot chain-of-thought reasoning by large language models

Lei Wang, Wanyu Xu, Yihuai Lan, Zhiqiang Hu, Yunshi Lan, Roy Ka-Wei Lee, and Ee-Peng Lim. Plan-and-solve prompting: Improving zero-shot chain-of-thought reasoning by large language models. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors,Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long ...

work page doi:10.18653/v1/2023.acl-long.147 2023
[27]

Reflexion: Language Agents with Verbal Reinforcement Learning

Noah Shinn, Federico Cassano, Edward Berman, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning.arXiv preprint arXiv:2303.11366, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[28]

HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face

Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, and Yueting Zhuang. Hugginggpt: Solving AI tasks with ChatGPT and its friends in Hugging Face.arXiv preprint arXiv:2303.17580, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[29]

MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework

Sirui Hong, Mingchen Zhuge, Jiaqi Chen, Xiawu Zheng, Yuheng Cheng, Ceyao Zhang, Jinlin Wang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, et al. MetaGPT: Meta programming for a multi-agent collaborative framework.arXiv preprint arXiv:2308.00352, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[30]

Voyager: An Open-Ended Embodied Agent with Large Language Models

Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anand- kumar. Voyager: An open-ended embodied agent with large language models.arXiv preprint arXiv:2305.16291, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[31]

Taskweaver: A code-first agent framework.arXiv preprint arXiv:2311.17541,

Bo Qiao, Liqun Li, Xu Zhang, Shilin He, Yu Kang, Chaoyun Zhang, Fangkai Yang, Hang Dong, Jue Zhang, Lu Wang, et al. TaskWeaver: A code-first agent framework.arXiv preprint arXiv:2311.17541, 2023

work page arXiv 2023
[32]

MINT: Evaluating LLMs in multi-turn interaction with tools and language feedback.arXiv preprint arXiv:2309.10691, 2023

Xingyao Wang, Zihan Wang, Jiateng Liu, Yangyi Chen, Lifan Yuan, Hao Peng, and Heng Ji. MINT: Evaluating LLMs in multi-turn interaction with tools and language feedback.arXiv preprint arXiv:2309.10691, 2023

work page arXiv 2023
[33]

Anytool: Self-reflective, hierarchical agents for large-scale api calls.arXiv preprint arXiv:2402.04253, 2024

Yu Du, Fangyun Wei, and Hongyang Zhang. AnyTool: Self-reflective, hierarchical agents for large-scale API calls.arXiv preprint arXiv:2402.04253, 2024

work page arXiv 2024
[34]

Evaluation and benchmarking of LLM agents: A survey,

Mahmoud Mohammadi, Yipeng Li, Jane Lo, and Wendy Yip. Evaluation and benchmarking of llm agents: A survey.arXiv preprint arXiv:2507.21504, 2025. URLhttps://arxiv.org/abs/2507.21504. evolvent.co 13 The Scaling Laws of Skills in LLM Agent Systems

work page arXiv 2025
[35]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2001
[36]

Training Compute-Optimal Large Language Models

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models.arXiv preprint arXiv:2203.15556, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[37]

Emergent abilities of large language models.Transactions on Machine Learning Research, 2022

Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, et al. Emergent abilities of large language models.Transactions on Machine Learning Research, 2022

work page 2022
[38]

Chatterji, Sharan Narang, Mike Lewis, and Dieuwke Hupkes

Nicholas Roberts, Niladri S. Chatterji, Sharan Narang, Mike Lewis, and Dieuwke Hupkes. Compute optimal scaling of skills: Knowledge vs reasoning. InFindings of the Association for Computational Linguistics: ACL 2025, pages 13295–13316, 2025. doi: 10.18653/v1/2025.findings-acl.688. URL https://aclanthology.org/ 2025.findings-acl.688/

work page doi:10.18653/v1/2025.findings-acl.688 2025
[39]

Scaling laws for educational AI agents.arXiv preprint arXiv:2603.11709, 2026

Mengsong Wu, Hao Hao, Shuzhen Bi, Keqian Li, Wentao Liu, Siyu Song, Hongbo Zhao, and Aimin Zhou. Scaling laws for educational AI agents.arXiv preprint arXiv:2603.11709, 2026. URL https://arxiv.org/abs/ 2603.11709

work page arXiv 2026
[40]

Dense passage retrieval for open-domain question answering

Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen- tau Yih. Dense passage retrieval for open-domain question answering. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pages 6769–6781, 2020

work page 2020
[41]

Retrieval-augmented generation for knowledge- intensive NLP tasks

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. Retrieval-augmented generation for knowledge- intensive NLP tasks. InAdvances in Neural Information Processing Systems, volume 33, pages 9459–9474, 2020

work page 2020
[42]

Sentence-BERT: Sentence embeddings using siamese BERT-networks

Nils Reimers and Iryna Gurevych. Sentence-BERT: Sentence embeddings using siamese BERT-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing, 2019

work page 2019
[43]

MTEB: Massive text embedding benchmark

Niklas Muennighoff, Nouamane Tazi, Loïc Magne, and Nils Reimers. MTEB: Massive text embedding benchmark. InProceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, 2023

work page 2023
[44]

William E. Hick. On the rate of gain of information.Quarterly Journal of Experimental Psychology, 4(1):11–26, 1952

work page 1952
[45]

Stimulus information as a determinant of reaction time.Journal of Experimental Psychology, 45 (3):188–196, 1953

Ray Hyman. Stimulus information as a determinant of reaction time.Journal of Experimental Psychology, 45 (3):188–196, 1953

work page 1953
[46]

Github - anthropics/skills: Public repository for agent skills

Anthropic. Github - anthropics/skills: Public repository for agent skills. https://github.com/anthropics/ skills, 2025. Accessed: 2026-04-17

work page 2025
[47]

Claude code

Anthropic. Claude code. https://www.anthropic.com/product/claude-code, 2025. Accessed: 2026-04-16

work page 2025
[48]

Introducing the model context protocol

Anthropic. Introducing the model context protocol. https://www.anthropic.com/news/ model-context-protocol, 2024. Accessed: 2026-04-16

work page 2024
[49]

Github - trailofbits/skills: Trail of bits claude code skills for security research, vulnerability detection, and audit workflows.https://github.com/trailofbits/skills, 2025

Trail of Bits. Github - trailofbits/skills: Trail of bits claude code skills for security research, vulnerability detection, and audit workflows.https://github.com/trailofbits/skills, 2025. Accessed: 2026-04-16. evolvent.co 14 The Scaling Laws of Skills in LLM Agent Systems

work page 2025
[50]

Claude code skills marketplace

daymade. Claude code skills marketplace. https://github.com/daymade/claude-code-skills, 2025. Accessed: 2026-04-16

work page 2025
[51]

NET skills for claude code

Aaronontheweb. .NET skills for claude code. https://github.com/Aaronontheweb/dotnet-skills, 2025. Accessed: 2026-04-16

work page 2025
[52]

PinchBench github organization

PinchBench Team. PinchBench github organization. https://github.com/pinchbench, 2026. Accessed: 2026-04-16

work page 2026
[53]

Agentdeals.https://github.com/robhunter/agentdeals, 2026

Robhunter. Agentdeals.https://github.com/robhunter/agentdeals, 2026. Accessed: 2026-04-16

work page 2026
[54]

M3-Embedding: Multi-linguality, multi-functionality, multi-granularity text embeddings through self-knowledge distillation

Jianlyu Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, and Zheng Liu. M3-Embedding: Multi-linguality, multi-functionality, multi-granularity text embeddings through self-knowledge distillation. InFindings of the Association for Computational Linguistics: ACL 2024, pages 2318–2335, 2024. doi: 10.18653/v1/2024.findings-acl

work page doi:10.18653/v1/2024.findings-acl 2024
[55]

URLhttps://aclanthology.org/2024.findings-acl.137/

work page 2024
[56]

Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces

Mike A Merrill, Alexander G Shaw, Nicholas Carlini, Boxuan Li, Harsh Raj, Ivan Bercovich, Lin Shi, Jeong Yeon Shin, Thomas Walshe, E Kelly Buchanan, et al. Terminal-bench: Benchmarking agents on hard, realistic tasks in command line interfaces.arXiv preprint arXiv:2601.11868, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[57]

Shunyu Yao, Noah Shinn, Pedram Razavi, and Karthik Narasimhan.𝜏-bench: A benchmark for tool-agent-user interaction in real-world domains.arXiv preprint arXiv:2406.12045, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[58]

Victor Barres, Honghua Dong, Soham Ray, Xujie Si, and Karthik Narasimhan.𝜏2-bench: Evaluating conversa- tional agents in a dual-control environment.arXiv preprint arXiv:2506.07982, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[59]

Flow- Bench: Revisiting and benchmarking workflow-guided planning for LLM-based agents

Ruixuan Xiao, Wentao Ma, Ke Wang, Yuchuan Wu, Junbo Zhao, Haobo Wang, Fei Huang, and Yongbin Li. Flow- Bench: Revisiting and benchmarking workflow-guided planning for LLM-based agents. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors,Findings of the Association for Computational Linguistics: EMNLP 2024, pages 10883–10900, Miami, Florida, USA,...

work page doi:10.18653/v1/2024.findings-emnlp.638 2024
[60]

Gpt-4o mini: advancing cost-efficient intelligence

OpenAI. Gpt-4o mini: advancing cost-efficient intelligence. https://openai.com/index/ gpt-4o-mini-advancing-cost-efficient-intelligence/, 2024. Accessed: 2026-04-17

work page 2024
[61]

Gpt-5 system card

OpenAI. Gpt-5 system card. OpenAI, August 2025. URL https://openai.com/index/gpt-5-system-card/ . Accessed: 2026-05-04

work page 2025
[62]

Introducing gpt-5.4 mini and nano

OpenAI. Introducing gpt-5.4 mini and nano. https://openai.com/index/ introducing-gpt-5-4-mini-and-nano/, 2026. Accessed: 2026-04-17

work page 2026
[63]

Introducing gpt-5.4.https://openai.com/index/introducing-gpt-5-4/, 2026

OpenAI. Introducing gpt-5.4.https://openai.com/index/introducing-gpt-5-4/, 2026. Accessed: 2026- 04-17

work page 2026
[64]

Introducing claude sonnet 4.6.https://www.anthropic.com/news/claude-sonnet-4-6, 2026

Anthropic. Introducing claude sonnet 4.6.https://www.anthropic.com/news/claude-sonnet-4-6, 2026

work page 2026
[65]

Introducing claude opus 4.6.https://www.anthropic.com/news/claude-opus-4-6, 2026

Anthropic. Introducing claude opus 4.6.https://www.anthropic.com/news/claude-opus-4-6, 2026

work page 2026
[66]

Gemini 3.Google DeepMind, 2025

Google DeepMind. Gemini 3.Google DeepMind, 2025. URLhttps://deepmind.google/models/gemini/

work page 2025
[67]

Glm-5: from vibe coding to agentic engineering, 2026

Aohan Zeng, Xin Lv, Zhenyu Hou, Zhengxiao Du, Qinkai Zheng, Bin Chen, Da Yin, Chendi Ge, Chenghua Huang, Chengxing Xie, et al. Glm-5: from vibe coding to agentic engineering, 2026. evolvent.co 15 The Scaling Laws of Skills in LLM Agent Systems

work page 2026
[68]

Glm-4.7: Advancing the coding capability.https://z.ai/blog/glm-4.7, 2025

Zhipu AI. Glm-4.7: Advancing the coding capability.https://z.ai/blog/glm-4.7, 2025

work page 2025
[69]

Kimi k2.5: Visual agentic intelligence, 2026

Kimi Team, Tongtong Bai, Yifan Bai, Yiping Bao, SH Cai, Yuan Cao, Y Charles, HS Che, Cheng Chen, Guanduo Chen, et al. Kimi k2.5: Visual agentic intelligence, 2026

work page 2026
[70]

Kimi k2.6: From code to creation, from one to many

Moonshot AI. Kimi k2.6: From code to creation, from one to many. https://www.kimi.com/ai-models/ kimi-k2-6, 2026

work page 2026
[71]

Seed 2.0 official launch

ByteDance Seed Team. Seed 2.0 official launch. https://seed.bytedance.com/en/blog/seed2-0-%25E6% 25AD%25A3%25E5%25BC%258F%25E5%258F%2591%25E5%25B8%2583, 2026. Accessed: 2026-04-16

work page 2026
[72]

Deepseek-v4:towards highly efficient million-token context intelligence.https://huggingface

DeepSeek-AI. Deepseek-v4:towards highly efficient million-token context intelligence.https://huggingface. co/deepseek-ai/DeepSeek-V4-Pro/blob/main/DeepSeek_V4.pdf, 2026

work page 2026
[73]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report, 2025. URLhttps://arxiv.org/abs/2505.09388. evolvent.co 16 The Scaling Laws of Skills in LLM Agent Systems Appendix This appendix is organized as a reader’s map from theory to implementation evidence. It firs...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[74]

Finite-capacity margin model.For task 𝑞, the router assigns each skill a latent score 𝑍𝑖=𝑚𝑖+𝜀𝑖, where 𝑚⋆−𝑚𝑗= Δ𝑗is the semantic margin of the gold skill against distractor𝑗, and𝜀𝑖is zero-mean sub-Gaussian noise with scale𝜎

work page
[75]

This is the scale-free local-crowding condition tested by the CI and semantic-gap diagnostics

Rank-regular clustered library.Within a functional cluster, distractor skills can be ordered by semantic rank𝑟= 1,…,𝑁−1around the target, and their effective overtake weights satisfy 𝑤𝑟=𝜅/𝑟+𝑂(𝑟−1−𝜉) for some𝜅,𝜉 >0over the measured range. This is the scale-free local-crowding condition tested by the CI and semantic-gap diagnostics

work page
[76]

Irrelevant distractors have negligi- ble𝑤𝑗; local near-misses dominate𝐶(𝑁)

Logarithmic effective competition.The total effective pressure from plausible distractors is 𝐶(𝑁) = ∑𝑗≠⋆𝑤𝑗(𝑁) =𝐶0 +𝐶1ln𝑁+𝑜(ln𝑁)over the exposed-library range. Irrelevant distractors have negligi- ble𝑤𝑗; local near-misses dominate𝐶(𝑁)

work page
[77]

Small-error linearization regime.Over the measured operating range, the router is not saturated at 0 or 1, so first-order Taylor expansions of the error odds are valid

work page
[78]

State-gated rescue.Correct upstream execution produces a concrete artifact with rescue coefficient 𝛼∈[0,1]; rescue can only occur when upstream routing/execution is correct and only to the extent that the downstream route has remaining headroom

work page
[79]

Lemma 1 (Clustered libraries imply logarithmic effective competition).Under Assumption 2, Assumption 3 follows with𝐶1 =𝜅

No pre-state leakage.In the no-state condition, paired routing prompts do not contain an execution artifact, so any interaction before execution is bounded by a protocol term𝜖proto. Lemma 1 (Clustered libraries imply logarithmic effective competition).Under Assumption 2, Assumption 3 follows with𝐶1 =𝜅. evolvent.co 17 The Scaling Laws of Skills in LLM Agen...

work page
[80]

Concrete object: the task identifies or clearly implies the file, repository, API, dataset, table, document, command, state, or other object to operate on

work page

Showing first 80 references.

[1] [1]

A survey on large language model based autonomous agents.Frontiers of Computer Science, 18(6):186345, 2024

Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, et al. A survey on large language model based autonomous agents.Frontiers of Computer Science, 18(6):186345, 2024

work page 2024

[2] [2]

Llm-based agents for tool learning: A survey.Data Science and Engineering, 2025

Weikai Xu, Chengrui Huang, Shen Gao, and Shuo Shang. Llm-based agents for tool learning: A survey.Data Science and Engineering, 2025

work page 2025

[3] [3]

Survey on Evaluation of LLM-based Agents

Asaf Yehudai, Lilach Eden, Alan Li, Guy Uziel, Yilun Zhao, Roy Bar-Haim, Arman Cohan, and Michal Shmueli- Scheuer. Survey on evaluation of LLM-based agents.arXiv preprint arXiv:2503.16416, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[4] [4]

Toolformer: Language models can teach themselves to use tools

Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools. Advances in neural information processing systems, 36:68539–68551, 2023

work page 2023

[5] [5]

ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs

Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, et al. Toolllm: Facilitating large language models to master 16000+ real-world APIs.arXiv preprint arXiv:2307.16789, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[6] [6]

Gorilla: Large Language Model Connected with Massive APIs

Shishir G Patil, Tianjun Zhang, Xin Wang, and Joseph E Gonzalez. Gorilla: Large language model connected with massive APIs.arXiv preprint arXiv:2305.15334, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[7] [7]

API-Bank: A Comprehensive Benchmark for Tool-Augmented LLMs

Minghao Li, Yingxiu Zhao, Bowen Yu, Feifan Song, Hangyu Li, Haiyang Yu, Zhoujun Li, Fei Huang, and Yongbin Li. API-Bank: A comprehensive benchmark for tool-augmented LLMs.arXiv preprint arXiv:2304.08244, 2023. evolvent.co 11 The Scaling Laws of Skills in LLM Agent Systems

work page internal anchor Pith review Pith/arXiv arXiv 2023

[8] [8]

name" - concise snake_case identifier -

Yongliang Shen, Kaitao Song, Xu Tan, Wenqi Zhang, Kan Ren, Siyu Yuan, Weiming Lu, and Yueting Zhuang. TaskBench: Benchmarking large language models for task automation.arXiv preprint arXiv:2311.18760, 2023

work page arXiv 2023

[9] [9]

SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks

Xiangyi Li, Wenbo Chen, Yimin Liu, Shenghan Zheng, Xiaokun Chen, Yifeng He, Yubo Li, Bingran You, Haotian Shen, Jiankai Sun, Shuyi Wang, Binxu Li, Qunhong Zeng, Di Wang, Xuandong Zhao, Yuanli Wang, Roey Ben Chaim, Zonglin Di, Yipeng Gao, Junwei He, Yizhuo He, Liqiang Jing, Luyang Kong, Xin Lan, Jiachen Li, Songlin Li, Yijiang Li, Yueqian Lin, Xinyi Liu, X...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[10] [10]

Skillrouter: Retrieve-and-rerank skill selection for llm agents at scale,

YanZhao Zheng, ZhenTao Zhang, Chao Ma, YuanQiang Yu, JiHuai Zhu, Yong Wu, Tianze Xu, Baohua Dong, Hangcheng Zhu, Ruohui Huang, and Gang Yu. Skillrouter: Skill routing for LLM agents at scale.arXiv preprint arXiv:2603.22455, 2026

work page arXiv 2026

[11] [11]

Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models

Qiguang Chen, Libo Qin, Jinhao Liu, Dengyun Peng, Jiannan Guan, Peng Wang, Mengkang Hu, Yuhang Zhou, Te Gao, and Wanxiang Che. Towards reasoning era: A survey of long chain-of-thought for reasoning large language models.arXiv preprint arXiv:2503.09567, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[12] [12]

Ai4research: A survey of artificial intelligence for scientific research.arXiv preprint arXiv:2507.01903, 2025

Qiguang Chen, Mingda Yang, Libo Qin, Jinhao Liu, Zheng Yan, Jiannan Guan, Dengyun Peng, Yiyan Ji, Hanjing Li, Mengkang Hu, et al. Ai4research: A survey of artificial intelligence for scientific research.arXiv preprint arXiv:2507.01903, 2025

work page arXiv 2025

[13] [13]

A Survey of Self-Evolving Agents: What, When, How, and Where to Evolve on the Path to Artificial Super Intelligence

Huan-ang Gao, Jiayi Geng, Wenyue Hua, Mengkang Hu, Xinzhe Juan, Hongzhang Liu, Shilong Liu, Jiahao Qiu, Xuan Qi, Yiran Wu, et al. A survey of self-evolving agents: What, when, how, and where to evolve on the path to artificial super intelligence.arXiv preprint arXiv:2507.21046, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[14] [14]

Agent Skills for Large Language Models: Architecture, Acquisition, Security, and the Path Forward

Renjun Xu and Yang Yan. Agent skills for large language models: Architecture, acquisition, security, and the path forward.arXiv preprint arXiv:2602.12430, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[15] [15]

ToolAlpaca: Generalized Tool Learning for Language Models with 3000 Simulated Cases

Qiaoyu Tang, Ziliang Deng, Hongyu Lin, Xianpei Han, Qiao Liang, Boxi Cao, and Le Sun. Toolalpaca: Generalized tool learning for language models with 3000 simulated cases.arXiv preprint arXiv:2306.05301, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[16] [16]

Openagents: An open platform for language agents in the wild,

Tianbao Xie, Fan Zhou, Zhoujun Cheng, Peng Shi, Luoxuan Weng, Yitao Liu, Toh Jing Hua, Junning Zhao, Qian Liu, Che Liu, et al. OpenAgents: An open platform for language agents in the wild.arXiv preprint arXiv:2310.10634, 2023

work page arXiv 2023

[17] [17]

AgentBench: Evaluating LLMs as Agents

Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, et al. AgentBench: Evaluating LLMs as agents.arXiv preprint arXiv:2308.03688, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[18] [18]

SoK: Agentic Skills -- Beyond Tool Use in LLM Agents

Yanna Jiang, Delong Li, Haiyu Deng, Baihe Ma, Xu Wang, Qin Wang, and Guangsheng Yu. SoK: Agentic skills – beyond tool use in LLM agents.arXiv preprint arXiv:2602.20867, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[19] [19]

ComplexFuncBench: Exploring multi-step and constrained function calling under long-context scenario.arXiv preprint arXiv:2501.10132, 2025

Lucen Zhong, Zhengxiao Du, Xiaohan Zhang, Haiyi Hu, and Jie Tang. ComplexFuncBench: Exploring multi-step and constrained function calling under long-context scenario.arXiv preprint arXiv:2501.10132, 2025

work page arXiv 2025

[20] [20]

BiasBusters: Uncovering and mitigating tool selection bias in large language models

Thierry Blankenstein, Jialin Yu, Zixuan Li, Vassilis Plachouras, Sunando Sengupta, Philip Torr, Yarin Gal, Alasdair Paren, and Adel Bibi. BiasBusters: Uncovering and mitigating tool selection bias in large language models. InInternational Conference on Learning Representations, 2026. URL https://openreview.net/forum? id=DEg4vvElYu. evolvent.co 12 The Scal...

work page 2026

[21] [21]

xRouter: Training cost-aware LLMs orchestration system via reinforcement learning.arXiv preprint arXiv:2510.08439, 2025

Cheng Qian, Zuxin Liu, Shirley Kokane, Akshara Prabhakar, Jielin Qiu, Haolin Chen, Zhiwei Liu, Heng Ji, Weiran Yao, Shelby Heinecke, Silvio Savarese, Caiming Xiong, and Huan Wang. xRouter: Training cost-aware LLMs orchestration system via reinforcement learning.arXiv preprint arXiv:2510.08439, 2025

work page arXiv 2025

[22] [22]

EvoRoute: Experience-driven self-routing LLM agent systems.arXiv preprint arXiv:2601.02695, 2026

Guibin Zhang, Haiyang Yu, Kaiming Yang, Bingli Wu, Fei Huang, Yongbin Li, and Shuicheng Yan. EvoRoute: Experience-driven self-routing LLM agent systems.arXiv preprint arXiv:2601.02695, 2026

work page arXiv 2026

[23] [23]

Autotool: Efficient tool selection for large language model agents.arXiv preprint arXiv:2511.14650, 2025

Jingyi Jia and Qinbin Li. Autotool: Efficient tool selection for large language model agents.arXiv preprint arXiv:2511.14650, 2025

work page arXiv 2025

[24] [24]

Meta- toolagent: Towards generalizable tool usage in llms through meta-learning.arXiv preprint arXiv:2601.12680, 2026

Zheng Fang, Wolfgang Mayer, Zeyu Zhang, Jian Wang, Hong-Yu Zhang, Wanli Li, and Zaiwen Feng. Meta- toolagent: Towards generalizable tool usage in llms through meta-learning.arXiv preprint arXiv:2601.12680, 2026

work page arXiv 2026

[25] [25]

ReAct: Synergizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. ReAct: Synergizing reasoning and acting in language models. InInternational Conference on Learning Representations, 2023

work page 2023

[26] [26]

Plan-and-solve prompting: Improving zero-shot chain-of-thought reasoning by large language models

Lei Wang, Wanyu Xu, Yihuai Lan, Zhiqiang Hu, Yunshi Lan, Roy Ka-Wei Lee, and Ee-Peng Lim. Plan-and-solve prompting: Improving zero-shot chain-of-thought reasoning by large language models. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors,Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long ...

work page doi:10.18653/v1/2023.acl-long.147 2023

[27] [27]

Reflexion: Language Agents with Verbal Reinforcement Learning

Noah Shinn, Federico Cassano, Edward Berman, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning.arXiv preprint arXiv:2303.11366, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[28] [28]

HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face

Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, and Yueting Zhuang. Hugginggpt: Solving AI tasks with ChatGPT and its friends in Hugging Face.arXiv preprint arXiv:2303.17580, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[29] [29]

MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework

Sirui Hong, Mingchen Zhuge, Jiaqi Chen, Xiawu Zheng, Yuheng Cheng, Ceyao Zhang, Jinlin Wang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, et al. MetaGPT: Meta programming for a multi-agent collaborative framework.arXiv preprint arXiv:2308.00352, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[30] [30]

Voyager: An Open-Ended Embodied Agent with Large Language Models

Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anand- kumar. Voyager: An open-ended embodied agent with large language models.arXiv preprint arXiv:2305.16291, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[31] [31]

Taskweaver: A code-first agent framework.arXiv preprint arXiv:2311.17541,

Bo Qiao, Liqun Li, Xu Zhang, Shilin He, Yu Kang, Chaoyun Zhang, Fangkai Yang, Hang Dong, Jue Zhang, Lu Wang, et al. TaskWeaver: A code-first agent framework.arXiv preprint arXiv:2311.17541, 2023

work page arXiv 2023

[32] [32]

MINT: Evaluating LLMs in multi-turn interaction with tools and language feedback.arXiv preprint arXiv:2309.10691, 2023

Xingyao Wang, Zihan Wang, Jiateng Liu, Yangyi Chen, Lifan Yuan, Hao Peng, and Heng Ji. MINT: Evaluating LLMs in multi-turn interaction with tools and language feedback.arXiv preprint arXiv:2309.10691, 2023

work page arXiv 2023

[33] [33]

Anytool: Self-reflective, hierarchical agents for large-scale api calls.arXiv preprint arXiv:2402.04253, 2024

Yu Du, Fangyun Wei, and Hongyang Zhang. AnyTool: Self-reflective, hierarchical agents for large-scale API calls.arXiv preprint arXiv:2402.04253, 2024

work page arXiv 2024

[34] [34]

Evaluation and benchmarking of LLM agents: A survey,

Mahmoud Mohammadi, Yipeng Li, Jane Lo, and Wendy Yip. Evaluation and benchmarking of llm agents: A survey.arXiv preprint arXiv:2507.21504, 2025. URLhttps://arxiv.org/abs/2507.21504. evolvent.co 13 The Scaling Laws of Skills in LLM Agent Systems

work page arXiv 2025

[35] [35]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2001

[36] [36]

Training Compute-Optimal Large Language Models

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models.arXiv preprint arXiv:2203.15556, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[37] [37]

Emergent abilities of large language models.Transactions on Machine Learning Research, 2022

Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, et al. Emergent abilities of large language models.Transactions on Machine Learning Research, 2022

work page 2022

[38] [38]

Chatterji, Sharan Narang, Mike Lewis, and Dieuwke Hupkes

Nicholas Roberts, Niladri S. Chatterji, Sharan Narang, Mike Lewis, and Dieuwke Hupkes. Compute optimal scaling of skills: Knowledge vs reasoning. InFindings of the Association for Computational Linguistics: ACL 2025, pages 13295–13316, 2025. doi: 10.18653/v1/2025.findings-acl.688. URL https://aclanthology.org/ 2025.findings-acl.688/

work page doi:10.18653/v1/2025.findings-acl.688 2025

[39] [39]

Scaling laws for educational AI agents.arXiv preprint arXiv:2603.11709, 2026

Mengsong Wu, Hao Hao, Shuzhen Bi, Keqian Li, Wentao Liu, Siyu Song, Hongbo Zhao, and Aimin Zhou. Scaling laws for educational AI agents.arXiv preprint arXiv:2603.11709, 2026. URL https://arxiv.org/abs/ 2603.11709

work page arXiv 2026

[40] [40]

Dense passage retrieval for open-domain question answering

Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen- tau Yih. Dense passage retrieval for open-domain question answering. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pages 6769–6781, 2020

work page 2020

[41] [41]

Retrieval-augmented generation for knowledge- intensive NLP tasks

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. Retrieval-augmented generation for knowledge- intensive NLP tasks. InAdvances in Neural Information Processing Systems, volume 33, pages 9459–9474, 2020

work page 2020

[42] [42]

Sentence-BERT: Sentence embeddings using siamese BERT-networks

Nils Reimers and Iryna Gurevych. Sentence-BERT: Sentence embeddings using siamese BERT-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing, 2019

work page 2019

[43] [43]

MTEB: Massive text embedding benchmark

Niklas Muennighoff, Nouamane Tazi, Loïc Magne, and Nils Reimers. MTEB: Massive text embedding benchmark. InProceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, 2023

work page 2023

[44] [44]

William E. Hick. On the rate of gain of information.Quarterly Journal of Experimental Psychology, 4(1):11–26, 1952

work page 1952

[45] [45]

Stimulus information as a determinant of reaction time.Journal of Experimental Psychology, 45 (3):188–196, 1953

Ray Hyman. Stimulus information as a determinant of reaction time.Journal of Experimental Psychology, 45 (3):188–196, 1953

work page 1953

[46] [46]

Github - anthropics/skills: Public repository for agent skills

Anthropic. Github - anthropics/skills: Public repository for agent skills. https://github.com/anthropics/ skills, 2025. Accessed: 2026-04-17

work page 2025

[47] [47]

Claude code

Anthropic. Claude code. https://www.anthropic.com/product/claude-code, 2025. Accessed: 2026-04-16

work page 2025

[48] [48]

Introducing the model context protocol

Anthropic. Introducing the model context protocol. https://www.anthropic.com/news/ model-context-protocol, 2024. Accessed: 2026-04-16

work page 2024

[49] [49]

Github - trailofbits/skills: Trail of bits claude code skills for security research, vulnerability detection, and audit workflows.https://github.com/trailofbits/skills, 2025

Trail of Bits. Github - trailofbits/skills: Trail of bits claude code skills for security research, vulnerability detection, and audit workflows.https://github.com/trailofbits/skills, 2025. Accessed: 2026-04-16. evolvent.co 14 The Scaling Laws of Skills in LLM Agent Systems

work page 2025

[50] [50]

Claude code skills marketplace

daymade. Claude code skills marketplace. https://github.com/daymade/claude-code-skills, 2025. Accessed: 2026-04-16

work page 2025

[51] [51]

NET skills for claude code

Aaronontheweb. .NET skills for claude code. https://github.com/Aaronontheweb/dotnet-skills, 2025. Accessed: 2026-04-16

work page 2025

[52] [52]

PinchBench github organization

PinchBench Team. PinchBench github organization. https://github.com/pinchbench, 2026. Accessed: 2026-04-16

work page 2026

[53] [53]

Agentdeals.https://github.com/robhunter/agentdeals, 2026

Robhunter. Agentdeals.https://github.com/robhunter/agentdeals, 2026. Accessed: 2026-04-16

work page 2026

[54] [54]

M3-Embedding: Multi-linguality, multi-functionality, multi-granularity text embeddings through self-knowledge distillation

Jianlyu Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, and Zheng Liu. M3-Embedding: Multi-linguality, multi-functionality, multi-granularity text embeddings through self-knowledge distillation. InFindings of the Association for Computational Linguistics: ACL 2024, pages 2318–2335, 2024. doi: 10.18653/v1/2024.findings-acl

work page doi:10.18653/v1/2024.findings-acl 2024

[55] [55]

URLhttps://aclanthology.org/2024.findings-acl.137/

work page 2024

[56] [56]

Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces

Mike A Merrill, Alexander G Shaw, Nicholas Carlini, Boxuan Li, Harsh Raj, Ivan Bercovich, Lin Shi, Jeong Yeon Shin, Thomas Walshe, E Kelly Buchanan, et al. Terminal-bench: Benchmarking agents on hard, realistic tasks in command line interfaces.arXiv preprint arXiv:2601.11868, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[57] [57]

Shunyu Yao, Noah Shinn, Pedram Razavi, and Karthik Narasimhan.𝜏-bench: A benchmark for tool-agent-user interaction in real-world domains.arXiv preprint arXiv:2406.12045, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[58] [58]

Victor Barres, Honghua Dong, Soham Ray, Xujie Si, and Karthik Narasimhan.𝜏2-bench: Evaluating conversa- tional agents in a dual-control environment.arXiv preprint arXiv:2506.07982, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[59] [59]

Flow- Bench: Revisiting and benchmarking workflow-guided planning for LLM-based agents

Ruixuan Xiao, Wentao Ma, Ke Wang, Yuchuan Wu, Junbo Zhao, Haobo Wang, Fei Huang, and Yongbin Li. Flow- Bench: Revisiting and benchmarking workflow-guided planning for LLM-based agents. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors,Findings of the Association for Computational Linguistics: EMNLP 2024, pages 10883–10900, Miami, Florida, USA,...

work page doi:10.18653/v1/2024.findings-emnlp.638 2024

[60] [60]

Gpt-4o mini: advancing cost-efficient intelligence

OpenAI. Gpt-4o mini: advancing cost-efficient intelligence. https://openai.com/index/ gpt-4o-mini-advancing-cost-efficient-intelligence/, 2024. Accessed: 2026-04-17

work page 2024

[61] [61]

Gpt-5 system card

OpenAI. Gpt-5 system card. OpenAI, August 2025. URL https://openai.com/index/gpt-5-system-card/ . Accessed: 2026-05-04

work page 2025

[62] [62]

Introducing gpt-5.4 mini and nano

OpenAI. Introducing gpt-5.4 mini and nano. https://openai.com/index/ introducing-gpt-5-4-mini-and-nano/, 2026. Accessed: 2026-04-17

work page 2026

[63] [63]

Introducing gpt-5.4.https://openai.com/index/introducing-gpt-5-4/, 2026

OpenAI. Introducing gpt-5.4.https://openai.com/index/introducing-gpt-5-4/, 2026. Accessed: 2026- 04-17

work page 2026

[64] [64]

Introducing claude sonnet 4.6.https://www.anthropic.com/news/claude-sonnet-4-6, 2026

Anthropic. Introducing claude sonnet 4.6.https://www.anthropic.com/news/claude-sonnet-4-6, 2026

work page 2026

[65] [65]

Introducing claude opus 4.6.https://www.anthropic.com/news/claude-opus-4-6, 2026

Anthropic. Introducing claude opus 4.6.https://www.anthropic.com/news/claude-opus-4-6, 2026

work page 2026

[66] [66]

Gemini 3.Google DeepMind, 2025

Google DeepMind. Gemini 3.Google DeepMind, 2025. URLhttps://deepmind.google/models/gemini/

work page 2025

[67] [67]

Glm-5: from vibe coding to agentic engineering, 2026

Aohan Zeng, Xin Lv, Zhenyu Hou, Zhengxiao Du, Qinkai Zheng, Bin Chen, Da Yin, Chendi Ge, Chenghua Huang, Chengxing Xie, et al. Glm-5: from vibe coding to agentic engineering, 2026. evolvent.co 15 The Scaling Laws of Skills in LLM Agent Systems

work page 2026

[68] [68]

Glm-4.7: Advancing the coding capability.https://z.ai/blog/glm-4.7, 2025

Zhipu AI. Glm-4.7: Advancing the coding capability.https://z.ai/blog/glm-4.7, 2025

work page 2025

[69] [69]

Kimi k2.5: Visual agentic intelligence, 2026

Kimi Team, Tongtong Bai, Yifan Bai, Yiping Bao, SH Cai, Yuan Cao, Y Charles, HS Che, Cheng Chen, Guanduo Chen, et al. Kimi k2.5: Visual agentic intelligence, 2026

work page 2026

[70] [70]

Kimi k2.6: From code to creation, from one to many

Moonshot AI. Kimi k2.6: From code to creation, from one to many. https://www.kimi.com/ai-models/ kimi-k2-6, 2026

work page 2026

[71] [71]

Seed 2.0 official launch

ByteDance Seed Team. Seed 2.0 official launch. https://seed.bytedance.com/en/blog/seed2-0-%25E6% 25AD%25A3%25E5%25BC%258F%25E5%258F%2591%25E5%25B8%2583, 2026. Accessed: 2026-04-16

work page 2026

[72] [72]

Deepseek-v4:towards highly efficient million-token context intelligence.https://huggingface

DeepSeek-AI. Deepseek-v4:towards highly efficient million-token context intelligence.https://huggingface. co/deepseek-ai/DeepSeek-V4-Pro/blob/main/DeepSeek_V4.pdf, 2026

work page 2026

[73] [73]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report, 2025. URLhttps://arxiv.org/abs/2505.09388. evolvent.co 16 The Scaling Laws of Skills in LLM Agent Systems Appendix This appendix is organized as a reader’s map from theory to implementation evidence. It firs...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[74] [74]

Finite-capacity margin model.For task 𝑞, the router assigns each skill a latent score 𝑍𝑖=𝑚𝑖+𝜀𝑖, where 𝑚⋆−𝑚𝑗= Δ𝑗is the semantic margin of the gold skill against distractor𝑗, and𝜀𝑖is zero-mean sub-Gaussian noise with scale𝜎

work page

[75] [75]

This is the scale-free local-crowding condition tested by the CI and semantic-gap diagnostics

Rank-regular clustered library.Within a functional cluster, distractor skills can be ordered by semantic rank𝑟= 1,…,𝑁−1around the target, and their effective overtake weights satisfy 𝑤𝑟=𝜅/𝑟+𝑂(𝑟−1−𝜉) for some𝜅,𝜉 >0over the measured range. This is the scale-free local-crowding condition tested by the CI and semantic-gap diagnostics

work page

[76] [76]

Irrelevant distractors have negligi- ble𝑤𝑗; local near-misses dominate𝐶(𝑁)

Logarithmic effective competition.The total effective pressure from plausible distractors is 𝐶(𝑁) = ∑𝑗≠⋆𝑤𝑗(𝑁) =𝐶0 +𝐶1ln𝑁+𝑜(ln𝑁)over the exposed-library range. Irrelevant distractors have negligi- ble𝑤𝑗; local near-misses dominate𝐶(𝑁)

work page

[77] [77]

Small-error linearization regime.Over the measured operating range, the router is not saturated at 0 or 1, so first-order Taylor expansions of the error odds are valid

work page

[78] [78]

State-gated rescue.Correct upstream execution produces a concrete artifact with rescue coefficient 𝛼∈[0,1]; rescue can only occur when upstream routing/execution is correct and only to the extent that the downstream route has remaining headroom

work page

[79] [79]

Lemma 1 (Clustered libraries imply logarithmic effective competition).Under Assumption 2, Assumption 3 follows with𝐶1 =𝜅

No pre-state leakage.In the no-state condition, paired routing prompts do not contain an execution artifact, so any interaction before execution is bounded by a protocol term𝜖proto. Lemma 1 (Clustered libraries imply logarithmic effective competition).Under Assumption 2, Assumption 3 follows with𝐶1 =𝜅. evolvent.co 17 The Scaling Laws of Skills in LLM Agen...

work page

[80] [80]

Concrete object: the task identifies or clearly implies the file, repository, API, dataset, table, document, command, state, or other object to operate on

work page