super hub Canonical reference

Toolformer: Language Models Can Teach Themselves to Use Tools

Jane Dwivedi-Yu, Luke Zettlemoyer, Maria Lomeli, Roberta Raileanu, Timo Schick · 2023 · cs.CL · arXiv 2302.04761

Canonical reference. 87% of citing Pith papers cite this work as background.

192 Pith papers citing it

Background 87% of classified citations

open full Pith review browse 192 citing papers more from Jane Dwivedi-Yu arXiv PDF

abstract

Language models (LMs) exhibit remarkable abilities to solve new tasks from just a few examples or textual instructions, especially at scale. They also, paradoxically, struggle with basic functionality, such as arithmetic or factual lookup, where much simpler and smaller models excel. In this paper, we show that LMs can teach themselves to use external tools via simple APIs and achieve the best of both worlds. We introduce Toolformer, a model trained to decide which APIs to call, when to call them, what arguments to pass, and how to best incorporate the results into future token prediction. This is done in a self-supervised way, requiring nothing more than a handful of demonstrations for each API. We incorporate a range of tools, including a calculator, a Q\&A system, two different search engines, a translation system, and a calendar. Toolformer achieves substantially improved zero-shot performance across a variety of downstream tasks, often competitive with much larger models, without sacrificing its core language modeling abilities.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 44 baseline 1 dataset 1 method 1

citation-polarity summary

background 41 support 2 unclear 2 baseline 1 use method 1

claims ledger

abstract Language models (LMs) exhibit remarkable abilities to solve new tasks from just a few examples or textual instructions, especially at scale. They also, paradoxically, struggle with basic functionality, such as arithmetic or factual lookup, where much simpler and smaller models excel. In this paper, we show that LMs can teach themselves to use external tools via simple APIs and achieve the best of both worlds. We introduce Toolformer, a model trained to decide which APIs to call, when to call them, what arguments to pass, and how to best incorporate the results into future token prediction. Thi

authors

Jane Dwivedi-Yu Luke Zettlemoyer Maria Lomeli Roberta Raileanu Roberto Dess\`i Timo Schick

co-cited works

representative citing papers

SimWorld Studio: Automatic Environment Generation with Evolving Coding Agent for Embodied Agent Learning

cs.AI · 2026-05-10 · accept · novelty 8.0 · 2 refs

SimWorld Studio deploys an evolving coding agent to create adaptive 3D environments that co-evolve with embodied learners, delivering 18-point success-rate gains over fixed environments in navigation benchmarks.

ShadowMerge: A Novel Poisoning Attack on Graph-Based Agent Memory via Relation-Channel Conflicts

cs.CR · 2026-05-09 · unverdicted · novelty 8.0 · 3 refs

ShadowMerge exploits relation-channel conflicts to poison graph-based agent memory, achieving 93.8% average attack success rate on Mem0 and real-world datasets while bypassing existing defenses.

The Khipu Problem: Institutional Legibility Under Distributed Cognition

cs.CY · 2026-05-06 · unverdicted · novelty 8.0

The khipu problem frames a governance failure in distributed AI where interpretive continuity is lost even when traces remain, requiring infrastructure to preserve reading practices rather than only data retention.

Agentic AI for Multi-Stage Physics Experiments at a Large-Scale User Facility Particle Accelerator

physics.acc-ph · 2025-09-21 · unverdicted · novelty 8.0

A language-model-driven agentic AI system autonomously executes multi-stage physics experiments at a production synchrotron light source, reducing preparation time by two orders of magnitude while upholding safety constraints.

RepairAgent: An Autonomous, LLM-Based Agent for Program Repair

cs.SE · 2024-03-25 · conditional · novelty 8.0

RepairAgent autonomously repairs 164 bugs on Defects4J including 39 not fixed by prior techniques by treating an LLM as an agent that invokes tools via a finite state machine and dynamic prompts.

DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines

cs.CL · 2023-10-05 · conditional · novelty 8.0

DSPy compiles short declarative programs into LM pipelines that self-optimize and outperform both standard few-shot prompting and expert-written chains on math, retrieval, and QA tasks.

Mind2Web: Towards a Generalist Agent for the Web

cs.CL · 2023-06-09 · accept · novelty 8.0

Mind2Web is the first large-scale dataset of real-world web tasks for developing generalist language-guided agents that complete complex actions on diverse websites.

API-Bank: A Comprehensive Benchmark for Tool-Augmented LLMs

cs.CL · 2023-04-14 · conditional · novelty 8.0

API-Bank is a new benchmark and training dataset for tool-augmented LLMs that shows fine-tuned models can approach GPT-3.5 tool-use effectiveness.

TO-Master: an LLM-agent framework for automated topology optimization

cs.CE · 2026-07-02 · unverdicted · novelty 7.0

TO-Master is an LLM agent framework that orchestrates finite-element topology optimization from conversational inputs, supporting 2D/3D compliance, thermal, stress-constrained, and multi-load cases while reproducing benchmarks without user code.

TRIAGE: Role-Typed Credit Assignment for Agentic Reinforcement Learning

cs.LG · 2026-06-30 · unverdicted · novelty 7.0

TRIAGE augments GRPO with role-typed segment rewards derived from a judge that detects regression and exploration, yielding higher success rates and fewer turns on ALFWorld, Search-QA, and WebShop.

Entity Binding Failures in Tool-Augmented Agents

cs.AI · 2026-06-29 · unverdicted · novelty 7.0

The paper defines entity binding failures as a distinct error category in tool-augmented agents separate from tool selection errors and evaluates entity-aware mechanisms that eliminate such failures in a controlled diagnostic setting.

Brain-Prompt Injection: A Route-Safety Audit for BCI-LLM Agents

cs.CR · 2026-06-08 · unverdicted · novelty 7.0

Introduces brain-prompt injection attacks on BCI-to-LLM agent pipelines and defines a Route-Safety Audit Contract with separation theorem, C3 decomposition, and split-conformal calibration to bound false-accept rates on EEG data.

Lightfall: An API-first, LLM-addressable control platform for synchrotron beamlines

physics.ins-det · 2026-06-04 · unverdicted · novelty 7.0

Lightfall provides an API-first, LLM-addressable control platform for synchrotron beamlines enabling natural language extensions by scientists through plugin skills.

AIP: A Graph Representation for Learning and Governing Agent Skills

cs.AI · 2026-06-03 · unverdicted · novelty 7.0

AIP models skills as graphs of discrete steps connected by typed I/O edges under a validated schema, raising agent mean reward from 0.60 to 0.71 and pass rate from 53% to 67% on 27 SkillsBench tasks while enabling node-level fixes.

Sandboxed Coding Agents are Competitive Omni-modal Task Solvers

cs.CL · 2026-05-30 · unverdicted · novelty 7.0

Sandboxed coding agents with text+image access match or outperform native omnimodal models on audio-video benchmarks by converting tasks into code-driven retrieval and processing.

Tool-Schema Compression Enables Agentic RAG Under Constrained Context Budgets

cs.SE · 2026-05-24 · unverdicted · novelty 7.0

Tool schema compression by 44-50% enables agentic RAG at 8K context where uncompressed schemas fail, with +20.5 pp exact match lift across models and scaling to over 800 tools.

Hylos: Operability Contracts for Model-Native Spatial Intelligence

cs.AI · 2026-05-23 · unverdicted · novelty 7.0

Hylos proposes operability contracts and SpatialTransactions to maintain scene-scale state and validate changes in model-generated 3D, shifting evaluation from visual quality to practical operability.

MemGym: a Long-Horizon Memory Environment for LLM Agents

cs.CL · 2026-05-20 · unverdicted · novelty 7.0

MemGym unifies agent gyms into a memory benchmark with isolated scoring across tool-use, research, coding, and computer-use regimes plus a lightweight reward model for tractable coding evaluation.

Latency-Quality Routing for Functionally Equivalent Tools in LLM Agents

cs.LG · 2026-05-14 · unverdicted · novelty 7.0

LQM-ContextRoute routes LLM tool calls via latency-quality matching in a contextual bandit, improving F1 by 2.18 pp, accuracy by up to 18 pp, and NDCG by 2.91-3.22 pp over SW-UCB on web-search, StrategyQA, and retriever benchmarks.

Can a Single Message Paralyze the AI Infrastructure? The Rise of AbO-DDoS Attacks through Targeted Mobius Injection

cs.CR · 2026-05-12 · unverdicted · novelty 7.0

Mobius Injection exploits semantic closure in LLM agents to enable single-message AbO-DDoS attacks achieving up to 51x call amplification and 229x latency inflation.

When Stored Evidence Stops Being Usable: Scale-Conditioned Evaluation of Agent Memory

cs.AI · 2026-05-08 · unverdicted · novelty 7.0

A new evaluation protocol shows agent memory reliability degrades variably with added irrelevant sessions depending on agent, memory interface, and scale.

MemFlow: Intent-Driven Memory Orchestration for Small Language Model Agents

cs.MA · 2026-05-05 · unverdicted · novelty 7.0

MemFlow routes queries by intent to tiered memory operations, nearly doubling accuracy of a 1.7B SLM on long-horizon benchmarks compared to full-context baselines.

BIM Information Extraction Through LLM-based Adaptive Exploration

cs.CL · 2026-05-03 · unverdicted · novelty 7.0

LLM adaptive exploration via runtime code execution outperforms static query generation for information extraction from heterogeneous BIM models on the new ifc-bench v2 benchmark.

Using LLMs in Software Design: An Empirical Study of GitHub and A Practitioner Survey

cs.SE · 2026-05-02 · unverdicted · novelty 7.0

Developers use LLMs like ChatGPT mainly for knowledge acquisition and code generation at the detailed design level, reporting benefits such as better technology selection and early flaw detection alongside limitations like lengthy outputs, incorrect code, and hallucinations.

citing papers explorer

Showing 40 of 40 citing papers after filters.

DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines cs.CL · 2023-10-05 · conditional · none · ref 48 · internal anchor
DSPy compiles short declarative programs into LM pipelines that self-optimize and outperform both standard few-shot prompting and expert-written chains on math, retrieval, and QA tasks.
Mind2Web: Towards a Generalist Agent for the Web cs.CL · 2023-06-09 · accept · none · ref 29 · internal anchor
Mind2Web is the first large-scale dataset of real-world web tasks for developing generalist language-guided agents that complete complex actions on diverse websites.
API-Bank: A Comprehensive Benchmark for Tool-Augmented LLMs cs.CL · 2023-04-14 · conditional · none · ref 14 · internal anchor
API-Bank is a new benchmark and training dataset for tool-augmented LLMs that shows fine-tuned models can approach GPT-3.5 tool-use effectiveness.
Sandboxed Coding Agents are Competitive Omni-modal Task Solvers cs.CL · 2026-05-30 · unverdicted · none · ref 5 · internal anchor
Sandboxed coding agents with text+image access match or outperform native omnimodal models on audio-video benchmarks by converting tasks into code-driven retrieval and processing.
MemGym: a Long-Horizon Memory Environment for LLM Agents cs.CL · 2026-05-20 · unverdicted · none · ref 40 · internal anchor
MemGym unifies agent gyms into a memory benchmark with isolated scoring across tool-use, research, coding, and computer-use regimes plus a lightweight reward model for tractable coding evaluation.
BIM Information Extraction Through LLM-based Adaptive Exploration cs.CL · 2026-05-03 · unverdicted · none · ref 42 · internal anchor
LLM adaptive exploration via runtime code execution outperforms static query generation for information extraction from heterogeneous BIM models on the new ifc-bench v2 benchmark.
Answer Only as Precisely as Justified: Calibrated Claim-Level Specificity Control for Agentic Systems cs.CL · 2026-04-19 · unverdicted · none · ref 13 · 2 links · internal anchor
Compositional selective specificity (CSS) decomposes generated answers into claims and emits each at the most specific level supported by evidence, raising overcommitment-aware utility from 0.846 to 0.913 on LongFact while retaining 0.938 specificity.
Transactional Attention: Semantic Sponsorship for KV-Cache Retention cs.CL · 2026-04-13 · unverdicted · none · ref 19 · internal anchor
Transactional Attention uses semantic sponsorship from anchor patterns to retain dormant critical tokens in KV caches, achieving 100% credential retrieval at 16 tokens where all prior methods fail.
GAIA: a benchmark for General AI Assistants cs.CL · 2023-11-21 · unverdicted · none · ref 63 · 2 links · internal anchor
GAIA benchmark shows humans at 92% accuracy on simple real-world questions far outperform current AI systems at 15%, proposing this gap as a key milestone for general AI.
SciBench: Evaluating College-Level Scientific Problem-Solving Abilities of Large Language Models cs.CL · 2023-07-20 · unverdicted · none · ref 4 · internal anchor
SciBench shows current LLMs reach at most 43.22% accuracy on curated collegiate scientific problems and reveals no prompting strategy dominates across all required skills.
Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks cs.CL · 2022-11-22 · unverdicted · none · ref 27 · internal anchor
PoT prompting improves numerical reasoning by having language models write programs executed by a computer instead of performing calculations in natural language chains of thought, with an average 12% gain over CoT.
Goal-Autopilot: A Verifiable Anti-Fabrication Firewall for Unattended Long-Horizon Agents cs.CL · 2026-06-10 · unverdicted · none · ref 12 · internal anchor
Autopilot enforces verifiable termination via a gated FSM scheduler and hard floor, proving that termination implies goal achievement under gate soundness, floor enforcement, and plan coverage, while cutting fabrication rates to 0.95% vs. 8-25% in baselines on 3150 paired cells including SWE-bench L
SkillAdaptor: Self-Adapting Skills for LLM Agents from Trajectories cs.CL · 2026-05-31 · unverdicted · none · ref 55 · internal anchor
SkillAdaptor introduces step-level failure attribution and targeted skill updates for LLM agents, yielding performance gains on WebShop, PinchBench, and Claw-Eval benchmarks.
PaT: Planning-after-Trial for Efficient Test-Time Code Generation cs.CL · 2026-05-08 · unverdicted · none · ref 22 · internal anchor
PaT defers planning until after failed trials in LLM code generation, enabling heterogeneous cheap-plus-powerful model setups that match large-model performance at roughly 69% lower cost.
BioTool: A Comprehensive Tool-Calling Dataset for Enhancing Biomedical Capabilities of Large Language Models cs.CL · 2026-05-07 · unverdicted · none · ref 69 · internal anchor
BioTool dataset enables fine-tuning a 4B-parameter LLM to outperform GPT-5.1 in biomedical tool calling while improving downstream answer quality per human experts.
From Skill Text to Skill Structure: The Scheduling-Structural-Logical Representation for Agent Skills cs.CL · 2026-04-27 · unverdicted · none · ref 23 · 2 links · internal anchor
SSL representation disentangles skill scheduling, structure, and logic using an LLM normalizer, improving skill discovery MRR@50 from 0.649 to 0.729 and risk assessment macro F1 from 0.409 to 0.509 over text baselines.
When to Trust Tools? Adaptive Tool Trust Calibration For Tool-Integrated Math Reasoning cs.CL · 2026-04-09 · unverdicted · none · ref 19 · internal anchor
ATTC reduces 'Tool Ignored' errors in tool-integrated reasoning by adaptively trusting tool results according to generated code confidence, yielding 4.1-7.5% gains across models and datasets.
Attention Editing: A Versatile Framework for Cross-Architecture Attention Conversion cs.CL · 2026-04-07 · conditional · none · ref 2 · internal anchor
Attention Editing converts pre-trained LLMs to new attention architectures through layer-wise teacher-forced optimization and model-level distillation, preserving performance with efficiency gains.
Querying Structured Data Through Natural Language Using Language Models cs.CL · 2026-04-03 · conditional · none · ref 15 · internal anchor
Fine-tuning an 8B LLM with synthetic data enables accurate natural language querying of structured datasets like accessibility services in Spain, generalizing to new locations.
Measuring Representation Robustness in Large Language Models for Geometry cs.CL · 2026-04-03 · unverdicted · none · ref 28 · internal anchor
LLMs display accuracy gaps of up to 14 percentage points on the same geometry problems solely due to representation choice, with vector forms consistently weakest and a convert-then-solve prompt helping only high-capacity models.
ToolMATH: A Diagnostic Benchmark for Long-Horizon Tool Use under Systematic Tool-Catalog Constraints cs.CL · 2026-02-24 · unverdicted · none · ref 3 · internal anchor
ToolMATH converts MATH solutions into controlled tool environments with gold tools and graded distractors to diagnose LLM adaptability, robustness, and long-horizon tool connectivity.
MOSAIC: Multi-agent Orchestration for Task-Intelligent Scientific Coding cs.CL · 2025-10-09 · unverdicted · none · ref 44 · internal anchor
MOSAIC is a training-free multi-agent LLM framework with rationale, coding, reflection, and debugging agents plus a consolidated context window that outperforms prior methods on scientific coding benchmarks.
OS-ATLAS: A Foundation Action Model for Generalist GUI Agents cs.CL · 2024-10-30 · unverdicted · none · ref 85 · internal anchor
OS-Atlas, trained on the largest open-source cross-platform GUI grounding corpus of 13 million elements, outperforms prior open-source models on six benchmarks across mobile, desktop, and web platforms.
Learning to Ask: When LLM Agents Meet Unclear Instruction cs.CL · 2024-08-31 · unverdicted · none · ref 22 · internal anchor
Introduces NoisyToolBench benchmark and Ask-when-Needed framework to improve LLM tool-use performance when user instructions are unclear or incomplete.
Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection cs.CL · 2023-10-17 · unverdicted · none · ref 149 · internal anchor
Self-RAG trains LLMs to adaptively retrieve passages on demand and self-critique using reflection tokens, outperforming ChatGPT and retrieval-augmented Llama2 on QA, reasoning, and fact verification.
ToRA: A Tool-Integrated Reasoning Agent for Mathematical Problem Solving cs.CL · 2023-09-29 · conditional · none · ref 38 · internal anchor
ToRA trains language models on interactive tool-use trajectories with imitation learning and output shaping to integrate reasoning and external tools, yielding 13-19% gains on math datasets and new highs like 44.6% on MATH for a 7B model.
Gorilla: Large Language Model Connected with Massive APIs cs.CL · 2023-05-24 · conditional · none · ref 33 · internal anchor
Gorilla is a fine-tuned LLM that surpasses GPT-4 in accurate API call generation and uses retrieval to handle documentation updates.
Reasoning with Language Model is Planning with World Model cs.CL · 2023-05-24 · unverdicted · none · ref 163 · internal anchor
RAP turns LLMs into dual world-model and planning agents via MCTS to generate better reasoning paths, outperforming CoT baselines and achieving 33% relative gains over GPT-4 CoT using LLaMA-33B on plan generation.
ReWOO: Decoupling Reasoning from Observations for Efficient Augmented Language Models cs.CL · 2023-05-23 · conditional · none · ref 15 · internal anchor
ReWOO decouples reasoning from tool observations in augmented language models, delivering 5x token efficiency and 4% higher accuracy on multi-step reasoning benchmarks like HotpotQA.
HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face cs.CL · 2023-03-30 · unverdicted · none · ref 24 · internal anchor
HuggingGPT is an agent system where ChatGPT plans and orchestrates calls to Hugging Face models to solve complex multi-modal AI tasks.
Language Models can Solve Computer Tasks cs.CL · 2023-03-30 · accept · none · ref 59 · internal anchor
Pre-trained LLMs using recursive criticism and improvement prompting achieve state-of-the-art results on the MiniWoB++ computer task benchmark with only a handful of demonstrations and no task-specific reward function.
ART: Automatic multi-step reasoning and tool-use for large language models cs.CL · 2023-03-16 · unverdicted · none · ref 146 · internal anchor
ART automatically generates multi-step reasoning programs with tool integration for LLMs, yielding substantial gains over few-shot and auto-CoT prompting on BigBench and MMLU while matching hand-crafted CoT on most tasks.
Mix-Quant: Quantized Prefilling, Precise Decoding for Agentic LLMs cs.CL · 2026-05-19 · unverdicted · none · ref 33 · internal anchor
Mix-Quant quantizes prefilling to NVFP4 and keeps BF16 for decoding in agentic LLMs, achieving up to 3x prefilling speedup while largely preserving task performance on long-context and agentic benchmarks.
InternLM2 Technical Report cs.CL · 2024-03-26 · unverdicted · none · ref 196 · internal anchor
InternLM2 is a new open-source LLM that outperforms prior versions on 30 benchmarks and long-context tasks through scaled pre-training to 32k tokens and a conditional online RLHF alignment strategy.
Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations cs.CL · 2023-12-07 · unverdicted · none · ref 8 · internal anchor
Llama Guard is an instruction-tuned Llama2-7b model that performs multi-class safety classification on prompts and responses, matching or exceeding existing moderation tools on benchmarks while supporting taxonomy customization.
Skill Availability and Presentation Granularity in Large-Language-Model Agents: A Controlled SkillsBench Study cs.CL · 2026-05-29 · unverdicted · none · ref 2 · internal anchor
In a 30-task SkillsBench study, skill availability boosts GPT-5.5 and DeepSeek V4-Flash agent pass rates substantially while presentation-granularity variations yield small uncertain effects.
Internalizing Tool Knowledge in Small Language Models via QLoRA Fine-Tuning cs.CL · 2026-05-18 · unverdicted · none · ref 9 · internal anchor
QLoRA fine-tuning on tool-use data enables 4B-parameter models to perform structured planning without tool catalogs in prompts, outperforming informed baselines on AssetOpsBench while reducing input length by 82.6%.
Meta-Tool: Efficient Few-Shot Tool Adaptation for Small Language Models cs.CL · 2026-04-22 · unverdicted · none · ref 46 · internal anchor
A 3B model with few-shot prompting reaches 79.7% of GPT-5 tool-use performance while a hypernetwork adaptation adds zero measurable benefit across four benchmarks.
Retrieval-Augmented Generation for Large Language Models: A Survey cs.CL · 2023-12-18 · unverdicted · none · ref 108 · internal anchor
A survey of RAG paradigms, components, benchmarks, and challenges for improving LLMs on knowledge-intensive tasks.
A Survey of Large Language Models cs.CL · 2023-03-31 · accept · none · ref 82 · internal anchor
This survey reviews the background, key techniques, and evaluation methods for large language models, emphasizing emergent abilities that appear at large scales.

Toolformer: Language Models Can Teach Themselves to Use Tools

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer