ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs

Bill Qian; Dahai Li; Jie Zhou; Kunlun Zhu; Lan Yan; Lauren Hong; Maosong Sun; Mark Gerstein; Runchu Tian; Ruobing Xie

arxiv: 2307.16789 · v2 · submitted 2023-07-31 · 💻 cs.AI · cs.CL· cs.LG

ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs

Yujia Qin , Shihao Liang , Yining Ye , Kunlun Zhu , Lan Yan , Yaxi Lu , Yankai Lin , Xin Cong

show 11 more authors

Xiangru Tang Bill Qian Sihan Zhao Lauren Hong Runchu Tian Ruobing Xie Jie Zhou Mark Gerstein Dahai Li Zhiyuan Liu Maosong Sun

This is my paper

Pith reviewed 2026-05-24 07:46 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.LG

keywords tool uselarge language modelsAPIsinstruction tuningToolBenchToolLLaMAgeneralizationmulti-tool scenarios

0 comments

The pith

Fine-tuning LLaMA on ChatGPT-generated paths for 16,000 APIs yields an open model that matches ChatGPT on complex tool-use tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that open-source LLMs can acquire strong tool-use abilities by training on a large automatically constructed dataset of real-world API instructions. It collects over 16,000 RESTful APIs, uses ChatGPT to create diverse single-tool and multi-tool instructions along with valid solution paths, and introduces a depth-first search decision tree to improve reasoning during both data annotation and model inference. The resulting ToolLLaMA model, paired with an API retriever, executes complex instructions, generalizes to APIs absent from training, and reaches performance levels comparable to ChatGPT on an automatic evaluator. This matters because it shows a route for open models to handle external function calls without relying on closed systems at inference time. The same model also shows zero-shot transfer to a separate out-of-distribution tool-use benchmark.

Core claim

ToolLLaMA, produced by fine-tuning LLaMA on the ToolBench dataset constructed via ChatGPT and equipped with a neural API retriever, executes complex instructions involving chains of API calls and generalizes to unseen APIs, achieving performance comparable to ChatGPT while also demonstrating strong zero-shot results on the APIBench dataset.

What carries the argument

The depth-first search-based decision tree algorithm that lets the model evaluate multiple candidate reasoning traces and expand the search space to locate valid sequences of API calls.

If this is right

Open models can handle both single-API and multi-API instructions across dozens of categories without access to proprietary systems at runtime.
Automatic construction of tool-use training data scales to thousands of real RESTful APIs spanning many domains.
A dedicated search algorithm during inference improves the model's ability to find correct API sequences compared to standard prompting.
Zero-shot generalization holds on separate tool-use benchmarks that differ in distribution from the training data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This training recipe could be iterated by using the fine-tuned model itself to generate higher-quality paths for further rounds of data creation.
The same data-construction and search approach might transfer to other sequential decision tasks that require calling external functions or services.
Widespread adoption would lower the barrier for building applications that combine language models with live web services and databases.

Load-bearing premise

The solution paths and instructions generated by ChatGPT are accurate, diverse, and free of systematic errors that would mislead the fine-tuned open model.

What would settle it

Human verification of a sample of ToolLLaMA outputs on instructions where the training solution paths contain undetected errors, showing success rates well below ChatGPT levels.

read the original abstract

Despite the advancements of open-source large language models (LLMs), e.g., LLaMA, they remain significantly limited in tool-use capabilities, i.e., using external tools (APIs) to fulfill human instructions. The reason is that current instruction tuning largely focuses on basic language tasks but ignores the tool-use domain. This is in contrast to the excellent tool-use capabilities of state-of-the-art (SOTA) closed-source LLMs, e.g., ChatGPT. To bridge this gap, we introduce ToolLLM, a general tool-use framework encompassing data construction, model training, and evaluation. We first present ToolBench, an instruction-tuning dataset for tool use, which is constructed automatically using ChatGPT. Specifically, the construction can be divided into three stages: (i) API collection: we collect 16,464 real-world RESTful APIs spanning 49 categories from RapidAPI Hub; (ii) instruction generation: we prompt ChatGPT to generate diverse instructions involving these APIs, covering both single-tool and multi-tool scenarios; (iii) solution path annotation: we use ChatGPT to search for a valid solution path (chain of API calls) for each instruction. To enhance the reasoning capabilities of LLMs, we develop a novel depth-first search-based decision tree algorithm. It enables LLMs to evaluate multiple reasoning traces and expand the search space. Moreover, to evaluate the tool-use capabilities of LLMs, we develop an automatic evaluator: ToolEval. Based on ToolBench, we fine-tune LLaMA to obtain an LLM ToolLLaMA, and equip it with a neural API retriever to recommend appropriate APIs for each instruction. Experiments show that ToolLLaMA demonstrates a remarkable ability to execute complex instructions and generalize to unseen APIs, and exhibits comparable performance to ChatGPT. Our ToolLLaMA also demonstrates strong zero-shot generalization ability in an out-of-distribution tool-use dataset: APIBench.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ToolLLM scales tool-use data to 16k real APIs via ChatGPT but the supervision loop creates a real circularity risk.

read the letter

The core contribution is a concrete pipeline that pulls 16464 real REST APIs, prompts ChatGPT to write single- and multi-tool instructions, then uses a DFS decision-tree search to annotate solution paths before fine-tuning LLaMA plus a retriever. That combination of scale and automatic multi-trace search is new at this size and gives an open model that reportedly matches ChatGPT on their internal benchmarks while showing some zero-shot transfer to APIBench. The engineering is straightforward and the release of ToolBench plus the model is useful for anyone who wants a larger starting point than the smaller hand-curated sets that came before. The obvious weakness is that both the training targets and the automatic evaluator are generated by the same closed model. No human validation, no execution checks against live APIs, and no diversity or error-rate numbers are described, so the reported parity could partly reflect imitation of ChatGPT’s biases rather than independent competence. The out-of-distribution claim inherits the same limitation. If the full paper adds independent verification or shows that the DFS paths are measurably better than single-shot ChatGPT outputs, the concern shrinks; otherwise it stays central. This is worth sending to referees who work on tool-augmented agents. They can pressure the authors on the evaluation gap and decide whether the scale alone justifies the work even with noisy labels. I would bring it to a reading group for the data-construction details but would not cite the performance numbers without further checks.

Referee Report

3 major / 2 minor

Summary. The paper introduces the ToolLLM framework to improve open-source LLMs' tool-use capabilities. It constructs the ToolBench dataset by collecting 16,464 real-world RESTful APIs, using ChatGPT to generate diverse instructions (single- and multi-tool), and annotating solution paths via a novel depth-first search decision tree algorithm. LLaMA is fine-tuned into ToolLLaMA equipped with a neural API retriever; an automatic evaluator ToolEval is developed. Experiments claim that ToolLLaMA executes complex instructions, generalizes to unseen APIs, matches ChatGPT performance, and shows strong zero-shot results on the out-of-distribution APIBench dataset.

Significance. If the central claims hold after addressing verification concerns, the work would be significant for demonstrating a scalable, largely automatic pipeline to create large-scale tool-use supervision and for showing that open models can reach closed-source levels on complex, multi-tool tasks. The DFS-based search and ToolBench scale are notable technical contributions that could be reused if the data quality is independently validated.

major comments (3)

[§3.3] §3.3 (Solution Path Annotation): The DFS-based decision tree relies on ChatGPT to produce and validate solution paths, yet no human review, execution success rate against live APIs, or error analysis is reported; this is load-bearing because ToolLLaMA is trained directly on these paths and must later operate without ChatGPT.
[§5.1] §5.1 (ToolEval): The automatic evaluator is used to claim comparability with ChatGPT, but its correlation with human judgments is not quantified (e.g., via Cohen's kappa or agreement rates on a held-out set); without this, the performance numbers cannot be interpreted as independent evidence of tool-use mastery.
[§5.2] §5.2 (Generalization Experiments): Both the in-distribution ToolBench results and the out-of-distribution APIBench results inherit supervision from the same ChatGPT-generated paths; the generalization claim therefore requires an explicit control (e.g., comparison against a model trained on human-annotated paths or an error-injection study) to distinguish reproduction of teacher behavior from genuine tool-use competence.

minor comments (2)

[§3.2] The exact split between single-tool and multi-tool instructions in ToolBench is not tabulated; adding a breakdown table would clarify the diversity of the training distribution.
[§4] Notation for the neural API retriever (e.g., how top-k retrieval is combined with the LLM input) is described only at a high level; a short pseudocode block or equation would improve reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below with proposed revisions to improve the manuscript's rigor where feasible.

read point-by-point responses

Referee: [§3.3] §3.3 (Solution Path Annotation): The DFS-based decision tree relies on ChatGPT to produce and validate solution paths, yet no human review, execution success rate against live APIs, or error analysis is reported; this is load-bearing because ToolLLaMA is trained directly on these paths and must later operate without ChatGPT.

Authors: We acknowledge the importance of validating the automatically generated paths. The DFS algorithm relies on ChatGPT for exploration and validation at scale, which precluded exhaustive human review or live-API execution for all instances in the original work. In revision we will add a dedicated error analysis subsection reporting (i) the fraction of instructions for which the DFS successfully returned a path and (ii) results of manual inspection on a random sample of 100 paths, together with a discussion of observed failure modes. We will also note the practical constraints of live-API validation at this scale. revision: partial
Referee: [§5.1] §5.1 (ToolEval): The automatic evaluator is used to claim comparability with ChatGPT, but its correlation with human judgments is not quantified (e.g., via Cohen's kappa or agreement rates on a held-out set); without this, the performance numbers cannot be interpreted as independent evidence of tool-use mastery.

Authors: We agree that quantifying ToolEval's agreement with humans is necessary for interpreting the reported numbers. We will conduct a human evaluation on a held-out set of 200 instructions, obtain independent ratings from multiple annotators, and report agreement metrics including Cohen's kappa and raw agreement rates in the revised manuscript. This addition will directly address the concern. revision: yes
Referee: [§5.2] §5.2 (Generalization Experiments): Both the in-distribution ToolBench results and the out-of-distribution APIBench results inherit supervision from the same ChatGPT-generated paths; the generalization claim therefore requires an explicit control (e.g., comparison against a model trained on human-annotated paths or an error-injection study) to distinguish reproduction of teacher behavior from genuine tool-use competence.

Authors: The referee correctly notes that both training and evaluation ultimately trace back to ChatGPT-generated supervision. While an explicit control experiment (human-annotated paths or error injection) would be the strongest disambiguation, constructing such a dataset at the scale of ToolBench is resource-prohibitive. We will nevertheless expand §5.2 with additional discussion of this limitation, emphasize the fully out-of-distribution character of APIBench (different APIs, different instruction distribution, no overlap with our generation pipeline), and present the zero-shot results as supporting, albeit indirect, evidence of generalization beyond simple reproduction. revision: partial

Circularity Check

1 steps flagged

ToolLLaMA comparability to ChatGPT partly forced by training on ChatGPT-generated solution paths

specific steps

fitted input called prediction [Abstract]
"we use ChatGPT to search for a valid solution path (chain of API calls) for each instruction. ... Based on ToolBench, we fine-tune LLaMA to obtain an LLM ToolLLaMA ... Experiments show that ToolLLaMA demonstrates a remarkable ability to execute complex instructions and generalize to unseen APIs, and exhibits comparable performance to ChatGPT."

Solution paths used as training targets are generated entirely by ChatGPT (via prompting + DFS search); the subsequent claim that the fine-tuned model achieves comparable performance therefore measures how faithfully the student reproduces the teacher's outputs on the same distribution, rather than demonstrating independent tool-use competence.

full rationale

The paper constructs its entire training corpus (ToolBench) by prompting ChatGPT for both instructions and solution paths, then fine-tunes LLaMA on those paths and reports that the resulting model reaches 'comparable performance to ChatGPT' while generalizing to unseen APIs. This matches the 'fitted_input_called_prediction' pattern: the supervision targets are produced by the reference model, so measured parity is partly a reproduction of the teacher's behavior rather than an independent derivation. The out-of-distribution APIBench result supplies limited external grounding, keeping the circularity partial rather than total (score 6, not 8-10). No self-citation or definitional loops are present.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The framework depends on the unverified quality of LLM-generated supervision and on the assumption that automatic evaluation captures true tool-use competence.

axioms (1)

domain assumption ChatGPT can reliably generate diverse, correct instructions and valid multi-step API solution paths at scale
Invoked in the three-stage ToolBench construction process described in the abstract.

invented entities (2)

ToolBench no independent evidence
purpose: Large-scale instruction-tuning corpus for tool use
Automatically constructed via ChatGPT; no independent verification mentioned.
ToolEval no independent evidence
purpose: Automatic metric for tool-use success
Developed internally based on ToolBench; correlation with human judgment unspecified.

pith-pipeline@v0.9.0 · 5957 in / 1270 out tokens · 31303 ms · 2026-05-24T07:46:52.936759+00:00 · methodology

discussion (0)

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation
cs.CL 2026-05 unverdicted novelty 8.0

A new native-runtime benchmark reveals that current frontier AI agents succeed on at most 62 percent of realistic long-horizon CLI tasks.
Your Agent Is Mine: Measuring Malicious Intermediary Attacks on the LLM Supply Chain
cs.CR 2026-04 unverdicted novelty 8.0

Malicious LLM API routers actively perform payload injection and secret exfiltration, with 9 of 428 tested routers showing malicious behavior and further poisoning risks from leaked credentials.
MCP-Atlas: A Large-Scale Benchmark for Tool-Use Competency with Real MCP Servers
cs.SE 2026-01 accept novelty 8.0

MCP-Atlas is a new benchmark with 1000 tasks on production MCP servers that uses claim-level scoring to evaluate LLM agents on realistic multi-step tool-use competency.
AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents
cs.CR 2024-06 unverdicted novelty 8.0

AgentDojo introduces an extensible evaluation framework populated with realistic agent tasks and security test cases to measure prompt injection robustness in tool-using LLM agents.
Boiling the Frog: A Multi-Turn Benchmark for Agentic Safety
cs.CL 2026-05 unverdicted novelty 7.0

Boiling the Frog is a new stateful multi-turn benchmark for agentic safety that reports an aggregate strict attack success rate of 44.4% across nine models, with rates ranging from 20.5% to 92.9% depending on the mode...
Boiling the Frog: A Multi-Turn Benchmark for Agentic Safety
cs.CL 2026-05 unverdicted novelty 7.0

Boiling the Frog is a new stateful multi-turn benchmark that finds an aggregate 44.4% strict attack success rate for incremental safety violations across nine AI models, with rates ranging from 20.5% to 92.9%.
Formal Skill: Programmable Runtime Skills for Efficient and Accurate LLM Agents
cs.AI 2026-05 unverdicted novelty 7.0

Proposes Formal Skill as a programmable runtime abstraction for LLM agents, implemented in open-source FairyClaw, achieving competitive Harness-Bench scores with substantially fewer tokens.
PrefBench: Evaluating Zero-Shot LLM Agents in Hidden-Preference Personalized Pricing Negotiations
cs.GT 2026-05 accept novelty 7.0

PrefBench benchmark shows zero-shot LLMs achieve deal rates above 0.99 but seller profits only slightly above random and far below a simple concession heuristic across 7,500 episodes.
Do Coding Agents Understand Least-Privilege Authorization?
cs.CR 2026-05 unverdicted novelty 7.0

Coding agents struggle to infer least-privilege file permissions by omitting needed accesses while granting unused or sensitive ones, but Sufficiency-Tightness Decomposition improves sensitive-task success by up to 15...
ClawForge: Generating Executable Interactive Benchmarks for Command-Line Agents
cs.AI 2026-05 conditional novelty 7.0

ClawForge supplies a generator that turns scenario templates into reproducible command-line tasks testing state conflict handling, where the strongest frontier model scores only 45.3 percent strict accuracy.
ClawForge: Generating Executable Interactive Benchmarks for Command-Line Agents
cs.AI 2026-05 unverdicted novelty 7.0

ClawForge is a generator framework that creates reproducible executable benchmarks for command-line agents under state conflict, with ClawForge-Bench showing frontier models reach at most 45.3% strict accuracy and tha...
SkillOps: Managing LLM Agent Skill Libraries as Self-Maintaining Software Ecosystems
cs.SE 2026-05 unverdicted novelty 7.0

SkillOps maintains LLM skill libraries via Skill Contracts and ecosystem graphs, raising ALFWorld task success to 79.5% as a standalone agent and improving retrieval baselines by up to 2.9 points with near-zero librar...
RS-Claw: Progressive Active Tool Exploration via Hierarchical Skill Trees for Remote Sensing Agents
cs.AI 2026-05 unverdicted novelty 7.0

RS-Claw enables remote sensing agents to actively explore tools via hierarchical skill trees, achieving up to 86% token compression and outperforming flat registration and RAG baselines on Earth-Bench.
SkillSmith: Compiling Agent Skills into Boundary-Guided Runtime Interfaces
cs.AI 2026-05 unverdicted novelty 7.0

SkillSmith is a boundary-first compiler-runtime system that turns skill packages into minimal executable interfaces, cutting token usage 57%, thinking iterations 43%, and solve time 51% versus raw skill injection on S...
RecoAtlas: From Semantic Plausibility to Set-Level Utility in LLM Recommendation Agents
cs.IR 2026-05 unverdicted novelty 7.0

RecoAtlas is a benchmark that evaluates LLM recommendation agents on behavior-grounded metrics for relevance, complementarity, and diversity in addition to semantic coherence.
TacoMAS: Test-Time Co-Evolution of Topology and Capability in LLM-based Multi-Agent Systems
cs.CL 2026-05 unverdicted novelty 7.0

TacoMAS performs test-time co-evolution of agent capabilities and communication topology in LLM multi-agent systems via fast capability updates and slow meta-LLM topology edits, delivering 13.3% average gains over str...
LLM Agents Already Know When to Call Tools -- Even Without Reasoning
cs.CL 2026-05 accept novelty 7.0

LLM agents encode tool necessity in pre-generation hidden states with high linear decodability (AUROC 0.89-0.96); Probe&Prefill uses this to reduce tool calls 48% with 1.7% accuracy loss.
LLM Agents Already Know When to Call Tools -- Even Without Reasoning
cs.CL 2026-05 conditional novelty 7.0

LLMs encode tool necessity in pre-generation hidden states at AUROC 0.89-0.96, enabling Probe&Prefill to reduce tool calls 48% with 1.7% accuracy loss, outperforming prompt and reasoning baselines.
RewardHarness: Self-Evolving Agentic Post-Training
cs.AI 2026-05 unverdicted novelty 7.0

RewardHarness self-evolves a tool-and-skill library from 100 preference examples to reach 47.4% accuracy on image-edit evaluation, beating GPT-5, and yields stronger RL-tuned models.
Why Retrying Fails: Context Contamination in LLM Agent Pipelines
cs.AI 2026-05 conditional novelty 7.0

A Context-Contaminated Restart Model derives exact success probabilities and an optimal pipeline depth T* = sqrt(B * log(1/(1-ε1)) / log(1/(1-ε0))) for fixed budget B, validated on SWE-bench where it fits data far bet...
TSCG: Deterministic Tool-Schema Compilation for Agentic LLM Deployments
cs.SE 2026-05 unverdicted novelty 7.0

TSCG compiles JSON tool schemas into token-efficient structured text, raising tool-use accuracy for small LLMs from 0% to 84.4% on benchmarks while cutting tokens by 52-57%.
TADI: Tool-Augmented Drilling Intelligence via Agentic LLM Orchestration over Heterogeneous Wellsite Data
cs.AI 2026-04 unverdicted novelty 7.0

TADI shows that domain-specialized tools orchestrated by an LLM over dual structured and semantic databases can convert heterogeneous wellsite data into evidence-grounded drilling intelligence, with tool design matter...
Enforcing Benign Trajectories: A Behavioral Firewall for Structured-Workflow AI Agents
cs.CR 2026-04 unverdicted novelty 7.0

A parameterized DFA firewall enforces safe tool sequences for structured AI agents, reducing attack success rates to 2.2% in tested workflows with low added latency.
The Structured Output Benchmark: A Multi-Source Benchmark for Evaluating Structured Output Quality in Large Language Models
cs.CL 2026-04 accept novelty 7.0

SOB benchmark shows LLMs achieve near-perfect schema compliance but value accuracy of only 83% on text, 67% on images, and 24% on audio.
Context-Augmented Code Generation: How Product Context Improves AI Coding Agent Decision Compliance by 49%
cs.SE 2026-04 unverdicted novelty 7.0

Adding product context retrieval to AI coding agents raises decision compliance from 46% to 95% on a new benchmark of 8 tasks with 41 weighted decision points.
Can Current Agents Close the Discovery-to-Application Gap? A Case Study in Minecraft
cs.AI 2026-04 unverdicted novelty 7.0

Current AI agents achieve only 26% success on SciCrafter's redstone tasks requiring causal discovery and application, indicating the discovery-to-application loop remains challenging with shifting bottlenecks.
Measuring the Unmeasurable: Markov Chain Reliability for LLM Agents
cs.SE 2026-04 unverdicted novelty 7.0

TraceToChain models LLM agent traces as absorbing DTMCs using automatic clustering and smoothed MLE, with KS and AIC validation, to reconcile pass@k, pass^k, and RDC as projections of a single first-passage success-ti...
A Systematic Survey of Security Threats and Defenses in LLM-Based AI Agents: A Layered Attack Surface Framework
cs.CR 2026-04 unverdicted novelty 7.0

A new 7x4 taxonomy organizes agentic AI security threats by architectural layer and persistence timescale, revealing under-explored upper layers and missing defenses after surveying 116 papers.
AgenTEE: Confidential LLM Agent Execution on Edge Devices
cs.CR 2026-04 unverdicted novelty 7.0

AgenTEE isolates LLM agent runtime, inference, and apps in independently attested cVMs on Arm-based edge devices, achieving under 5.15% overhead versus commodity OS deployments.
Autogenesis: A Self-Evolving Agent Protocol
cs.AI 2026-04 unverdicted novelty 7.0

Autogenesis Protocol defines structured resource management and closed-loop self-evolution for multi-agent LLM systems, with the resulting AGS showing gains over baselines on long-horizon benchmarks.
SemiFA: An Agentic Multi-Modal Framework for Autonomous Semiconductor Failure Analysis Report Generation
cs.CV 2026-04 unverdicted novelty 7.0

SemiFA is a four-agent LangGraph pipeline that combines DINOv2 and LLaVA image analysis with SECS/GEM telemetry and vector retrieval to produce complete FA reports in 48 seconds.
VERITAS: Verifiable Epistemic Reasoning for Image-Derived Hypothesis Testing via Agentic Systems
cs.MA 2026-04 unverdicted novelty 7.0

VERITAS is a multi-agent system for verifiable hypothesis testing on multimodal clinical MRI datasets that achieves 81.4% verdict accuracy with frontier models and introduces an epistemic evidence labeling framework.
FinTrace: Holistic Trajectory-Level Evaluation of LLM Tool Calling for Long-Horizon Financial Tasks
cs.AI 2026-04 unverdicted novelty 7.0

FinTrace supplies trajectory-level metrics for LLM financial tool calling, exposing gaps in information use and output quality, while its preference dataset enables DPO training that boosts intermediate metrics.
Controllable and Verifiable Tool-Use Data Synthesis for Agentic Reinforcement Learning
cs.AI 2026-04 unverdicted novelty 7.0

COVERT generates verifiable synthetic tool-use environments for RL by validated trajectory synthesis and oracle-preserving augmentations, improving tool-use accuracy on BFCL v3 and ACEBench while remaining complementa...
SAGE: A Service Agent Graph-guided Evaluation Benchmark
cs.AI 2026-04 unverdicted novelty 7.0

SAGE is a new multi-agent benchmark that formalizes service SOPs as dynamic dialogue graphs to measure LLM agents on logical compliance and path coverage, uncovering an execution gap and empathy resilience across 27 m...
IoT-Brain: Grounding LLMs for Semantic-Spatial Sensor Scheduling
cs.AI 2026-04 unverdicted novelty 7.0

IoT-Brain uses a neuro-symbolic Spatial Trajectory Graph to ground LLMs for verifiable semantic-spatial sensor scheduling, achieving 37.6% higher task success with lower resource use on a campus-scale benchmark.
EpiBench: Benchmarking Multi-turn Research Workflows for Multimodal Agents
cs.CL 2026-04 unverdicted novelty 7.0

EpiBench is a new episodic multi-turn multimodal benchmark where even leading AI agents score only 29.23% on hard tasks requiring cross-paper evidence integration from figures and tables.
ClawArena: Benchmarking AI Agents in Evolving Information Environments
cs.LG 2026-04 unverdicted novelty 7.0

ClawArena introduces a benchmark with hidden ground truth, noisy multi-channel traces, and 14-category questions to evaluate multi-source conflict reasoning, dynamic belief revision, and implicit personalization in AI agents.
SKILLFOUNDRY: Building Self-Evolving Agent Skill Libraries from Heterogeneous Scientific Resources
cs.AI 2026-04 unverdicted novelty 7.0

SkillFoundry mines heterogeneous scientific resources into a self-evolving library of validated agent skills, with 71.1% novelty versus prior libraries and measurable gains on coding benchmarks plus two genomics tasks.
PowerDAG: Reliable Agentic AI System for Automating Distribution Grid Analysis
eess.SY 2026-03 unverdicted novelty 7.0

PowerDAG achieves 94-100% success on unseen distribution grid analysis queries by combining adaptive retrieval with similarity-decay cutoff and just-in-time supervision, outperforming ReAct, LangChain, and CrewAI baselines.
GraphBit: A Graph-based Agentic Framework for Non-Linear Agent Orchestration
cs.AI 2026-03 unverdicted novelty 7.0

GraphBit is a DAG-based engine-orchestrated framework for agentic LLMs that achieves 67.6% accuracy with zero hallucinations on GAIA benchmarks.
Watermarking LLM Agent Trajectories
cs.CR 2026-02 unverdicted novelty 7.0

ActHook watermarks LLM agent trajectories by embedding key-activated hook actions for black-box detection at 94.3 AUC with negligible performance degradation.
MCP-Atlas: A Large-Scale Benchmark for Tool-Use Competency with Real MCP Servers
cs.SE 2026-01 unverdicted novelty 7.0

MCP-Atlas introduces a benchmark of 36 real MCP servers, 220 tools, and 1,000 natural-language tasks to measure LLM tool-use competency in multi-server workflows.
Agentic Search in the Wild: Intents and Trajectory Dynamics from 14M+ Real Search Requests
cs.IR 2026-01 unverdicted novelty 7.0

Large-scale log study of 14M+ agentic searches finds short sessions, intent-specific repetition patterns, and that 54% of new query terms trace to prior retrieved evidence.
VideoThinker: Building Agentic VideoLLMs with LLM-Guided Tool Reasoning
cs.CV 2026-01 unverdicted novelty 7.0

VideoThinker uses LLM-generated synthetic tool trajectories in caption space grounded to video frames to train agentic VideoLLMs that outperform baselines on long-video benchmarks.
Dynamic Tool Dependency Retrieval for Lightweight Function Calling
cs.LG 2025-12 unverdicted novelty 7.0

DTDR dynamically retrieves relevant tools by modeling dependencies from demonstrations and conditioning on the evolving agent plan, improving function calling success rates by 23-104% over static retrievers across benchmarks.
ToolPRM: Fine-Grained Inference Scaling of Structured Outputs for Function Calling
cs.AI 2025-10 unverdicted novelty 7.0

ToolPRM provides fine-grained intra-call process supervision via a new dataset and reward model, outperforming outcome and coarse-grained alternatives on function-calling benchmarks.
Talking Trees: Reasoning-Assisted Induction of Decision Trees for Tabular Data
cs.LG 2025-09 unverdicted novelty 7.0

Reasoning LLMs with minimal tools for tree construction and analysis induce decision trees that outperform CART, compete with ensembles on low-resource tabular data, and provide human-readable reasoning traces.
From Standalone LLMs to Integrated Intelligence: A Survey of Compound Al Systems
cs.MA 2025-06 accept novelty 7.0

A survey that defines Compound AI Systems, proposes a multi-dimensional taxonomy based on component roles and orchestration strategies, reviews four foundational paradigms, and identifies key challenges for future research.
Prompt Injection Attack to Tool Selection in LLM Agents
cs.CR 2025-04 conditional novelty 7.0

ToolHijacker optimizes malicious tool documents via a two-phase strategy to hijack LLM agents' tool selection in no-box settings.
C-Pack: Packed Resources For General Chinese Embeddings
cs.CL 2023-09 accept novelty 7.0

C-Pack releases a new Chinese embedding benchmark, large training dataset, and optimized models that outperform priors by up to 10% on C-MTEB while also delivering English SOTA results.
SynAE: A Framework for Measuring the Quality of Synthetic Data for Tool-Calling Agent Evaluations
cs.CL 2026-05 unverdicted novelty 6.0

SynAE is a multi-metric framework that evaluates how well synthetic benchmarks replicate real data characteristics for multi-turn tool-calling agent testing.
Maestro: Reinforcement Learning to Orchestrate Hierarchical Model-Skill Ensembles
cs.LG 2026-05 unverdicted novelty 6.0

Maestro uses outcome-based RL to train a lightweight policy that orchestrates ensembles of frozen expert models and skills, reporting 70.1% average accuracy across ten multimodal benchmarks and outperforming GPT-5 and...
PAIR: Prefix-Aware Internal Reward Model for Multi-Turn Agent Optimization
cs.AI 2026-05 unverdicted novelty 6.0

PAIR combines a hidden-state probe with an attention correction to deliver robust step-level rewards for GRPO-based optimization of multi-turn LLM agents, achieving high AUROC on contaminated trajectories at low cost.
The Scaling Laws of Skills in LLM Agent Systems
cs.CL 2026-05 unverdicted novelty 6.0

Empirical analysis across 15 LLMs and 1,141 skills identifies a logarithmic routing decay law and a multiplicative execution law coupled by a single fitted slope parameter b that enables targeted library optimizations...
Latency-Quality Routing for Functionally Equivalent Tools in LLM Agents
cs.LG 2026-05 unverdicted novelty 6.0

LQM-ContextRoute routes tool calls by expected quality per service cycle using contextual bandits and LLM-as-judge feedback, yielding +2.18 pp F1, up to +18 pp accuracy, and +2.91-3.22 pp NDCG gains over SW-UCB on web...
SPHERICAL KV: Angle-Domain Attention and Rate-Distortion Retention for Efficient Long-Context Inference
cs.LG 2026-05 unverdicted novelty 6.0

Spherical KV introduces angle-domain attention with spherical key parameterization and rate-distortion retention to cut KV cache residency while preserving efficient paged decoding.
ToolCUA: Towards Optimal GUI-Tool Path Orchestration for Computer Use Agents
cs.AI 2026-05 unverdicted novelty 6.0

ToolCUA introduces a trajectory scaling pipeline and staged RL to optimize GUI-tool switching, reaching 46.85% accuracy on OSWorld-MCP for a 66% relative gain over baseline.
SkillSafetyBench: Evaluating Agent Safety under Skill-Facing Attack Surfaces
cs.CR 2026-05 unverdicted novelty 6.0

SkillSafetyBench shows that localized non-user attacks via skills and artifacts can consistently induce unsafe agent behavior across domains and model backends, independent of user intent.
On-Policy Self-Evolution via Failure Trajectories for Agentic Safety Alignment
cs.AI 2026-05 unverdicted novelty 6.0

FATE lets LLM agents self-evolve safer behaviors by generating and filtering repairs from their own failure trajectories using verifiers and Pareto optimization.