super hub Canonical reference

Toolformer: Language Models Can Teach Themselves to Use Tools

Jane Dwivedi-Yu, Luke Zettlemoyer, Maria Lomeli, Roberta Raileanu, Timo Schick · 2023 · cs.CL · arXiv 2302.04761

Canonical reference. 87% of citing Pith papers cite this work as background.

149 Pith papers citing it

Background 87% of classified citations

open full Pith review browse 149 citing papers more from Jane Dwivedi-Yu arXiv PDF

abstract

Language models (LMs) exhibit remarkable abilities to solve new tasks from just a few examples or textual instructions, especially at scale. They also, paradoxically, struggle with basic functionality, such as arithmetic or factual lookup, where much simpler and smaller models excel. In this paper, we show that LMs can teach themselves to use external tools via simple APIs and achieve the best of both worlds. We introduce Toolformer, a model trained to decide which APIs to call, when to call them, what arguments to pass, and how to best incorporate the results into future token prediction. This is done in a self-supervised way, requiring nothing more than a handful of demonstrations for each API. We incorporate a range of tools, including a calculator, a Q\&A system, two different search engines, a translation system, and a calendar. Toolformer achieves substantially improved zero-shot performance across a variety of downstream tasks, often competitive with much larger models, without sacrificing its core language modeling abilities.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 44 baseline 1 dataset 1 method 1

citation-polarity summary

background 41 support 2 unclear 2 baseline 1 use method 1

claims ledger

abstract Language models (LMs) exhibit remarkable abilities to solve new tasks from just a few examples or textual instructions, especially at scale. They also, paradoxically, struggle with basic functionality, such as arithmetic or factual lookup, where much simpler and smaller models excel. In this paper, we show that LMs can teach themselves to use external tools via simple APIs and achieve the best of both worlds. We introduce Toolformer, a model trained to decide which APIs to call, when to call them, what arguments to pass, and how to best incorporate the results into future token prediction. Thi

authors

Jane Dwivedi-Yu Luke Zettlemoyer Maria Lomeli Roberta Raileanu Roberto Dess\`i Timo Schick

co-cited works

representative citing papers

SimWorld Studio: Automatic Environment Generation with Evolving Coding Agent for Embodied Agent Learning

cs.AI · 2026-05-10 · accept · novelty 8.0 · 2 refs

SimWorld Studio deploys an evolving coding agent to create adaptive 3D environments that co-evolve with embodied learners, delivering 18-point success-rate gains over fixed environments in navigation benchmarks.

ShadowMerge: A Novel Poisoning Attack on Graph-Based Agent Memory via Relation-Channel Conflicts

cs.CR · 2026-05-09 · unverdicted · novelty 8.0 · 3 refs

ShadowMerge exploits relation-channel conflicts to poison graph-based agent memory, achieving 93.8% average attack success rate on Mem0 and real-world datasets while bypassing existing defenses.

Agentic AI for Multi-Stage Physics Experiments at a Large-Scale User Facility Particle Accelerator

physics.acc-ph · 2025-09-21 · unverdicted · novelty 8.0

A language-model-driven agentic AI system autonomously executes multi-stage physics experiments at a production synchrotron light source, reducing preparation time by two orders of magnitude while upholding safety constraints.

RepairAgent: An Autonomous, LLM-Based Agent for Program Repair

cs.SE · 2024-03-25 · conditional · novelty 8.0

RepairAgent autonomously repairs 164 bugs on Defects4J including 39 not fixed by prior techniques by treating an LLM as an agent that invokes tools via a finite state machine and dynamic prompts.

DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines

cs.CL · 2023-10-05 · conditional · novelty 8.0

DSPy compiles short declarative programs into LM pipelines that self-optimize and outperform both standard few-shot prompting and expert-written chains on math, retrieval, and QA tasks.

Mind2Web: Towards a Generalist Agent for the Web

cs.CL · 2023-06-09 · accept · novelty 8.0

Mind2Web is the first large-scale dataset of real-world web tasks for developing generalist language-guided agents that complete complex actions on diverse websites.

API-Bank: A Comprehensive Benchmark for Tool-Augmented LLMs

cs.CL · 2023-04-14 · conditional · novelty 8.0

API-Bank is a new benchmark and training dataset for tool-augmented LLMs that shows fine-tuned models can approach GPT-3.5 tool-use effectiveness.

Sandboxed Coding Agents are Competitive Omni-modal Task Solvers

cs.CL · 2026-05-30 · unverdicted · novelty 7.0

Sandboxed coding agents with text+image access match or outperform native omnimodal models on audio-video benchmarks by converting tasks into code-driven retrieval and processing.

MemGym: a Long-Horizon Memory Environment for LLM Agents

cs.CL · 2026-05-20 · unverdicted · novelty 7.0

MemGym unifies agent gyms into a memory benchmark with isolated scoring across tool-use, research, coding, and computer-use regimes plus a lightweight reward model for tractable coding evaluation.

Can a Single Message Paralyze the AI Infrastructure? The Rise of AbO-DDoS Attacks through Targeted Mobius Injection

cs.CR · 2026-05-12 · unverdicted · novelty 7.0

Mobius Injection exploits semantic closure in LLM agents to enable single-message AbO-DDoS attacks achieving up to 51x call amplification and 229x latency inflation.

When Stored Evidence Stops Being Usable: Scale-Conditioned Evaluation of Agent Memory

cs.AI · 2026-05-08 · unverdicted · novelty 7.0

A new evaluation protocol shows agent memory reliability degrades variably with added irrelevant sessions depending on agent, memory interface, and scale.

MemFlow: Intent-Driven Memory Orchestration for Small Language Model Agents

cs.MA · 2026-05-05 · unverdicted · novelty 7.0

MemFlow routes queries by intent to tiered memory operations, nearly doubling accuracy of a 1.7B SLM on long-horizon benchmarks compared to full-context baselines.

BIM Information Extraction Through LLM-based Adaptive Exploration

cs.CL · 2026-05-03 · unverdicted · novelty 7.0

LLM adaptive exploration via runtime code execution outperforms static query generation for information extraction from heterogeneous BIM models on the new ifc-bench v2 benchmark.

Using LLMs in Software Design: An Empirical Study of GitHub and A Practitioner Survey

cs.SE · 2026-05-02 · unverdicted · novelty 7.0

Developers use LLMs like ChatGPT mainly for knowledge acquisition and code generation at the detailed design level, reporting benefits such as better technology selection and early flaw detection alongside limitations like lengthy outputs, incorrect code, and hallucinations.

Measuring the Unmeasurable: Markov Chain Reliability for LLM Agents

cs.SE · 2026-04-27 · unverdicted · novelty 7.0

TraceToChain models LLM agent traces as absorbing DTMCs using automatic clustering and smoothed MLE, with KS and AIC validation, to reconcile pass@k, pass^k, and RDC as projections of a single first-passage success-time distribution.

A Systematic Survey of Security Threats and Defenses in LLM-Based AI Agents: A Layered Attack Surface Framework

cs.CR · 2026-04-25 · unverdicted · novelty 7.0

A new 7x4 taxonomy organizes agentic AI security threats by architectural layer and persistence timescale, revealing under-explored upper layers and missing defenses after surveying 116 papers.

The Platform Is Mostly Not a Platform: Token Economies and Agent Discourse on Moltbook

cs.CY · 2026-04-23 · unverdicted · novelty 7.0

Moltbook operates as two largely separate layers: a dominant transactional token economy using protocols like MBC-20 and a thinner discursive conversation layer with only 3.6% agent overlap.

Breaking MCP with Function Hijacking Attacks: Novel Threats for Function Calling and Agentic Models

cs.CR · 2026-04-22 · unverdicted · novelty 7.0

A novel function hijacking attack achieves 70-100% success rates in forcing specific function calls across five LLMs on the BFCL benchmark and is robust to context semantics.

Answer Only as Precisely as Justified: Calibrated Claim-Level Specificity Control for Agentic Systems

cs.CL · 2026-04-19 · unverdicted · novelty 7.0 · 2 refs

Compositional selective specificity (CSS) decomposes generated answers into claims and emits each at the most specific level supported by evidence, raising overcommitment-aware utility from 0.846 to 0.913 on LongFact while retaining 0.938 specificity.

AgileLog: A Forkable Shared Log for Agents on Data Streams

cs.DC · 2026-04-16 · unverdicted · novelty 7.0

AgileLog introduces forkable shared logs with cheap forking and isolation to support AI agents on data streams.

Transactional Attention: Semantic Sponsorship for KV-Cache Retention

cs.CL · 2026-04-13 · unverdicted · novelty 7.0

Transactional Attention uses semantic sponsorship from anchor patterns to retain dormant critical tokens in KV caches, achieving 100% credential retrieval at 16 tokens where all prior methods fail.

Assessing Large Language Models for Stabilizing Numerical Expressions in Scientific Software

cs.SE · 2026-04-06 · conditional · novelty 7.0

LLMs match or exceed state-of-the-art traditional methods for stabilizing numerical expressions in scientific software, succeeding on 97.9% of expressions where baselines fail to improve accuracy, but struggle with control flow and high-precision literals.

Co-Disclosing the Computer: LLM-Mediated Computing through Reflective Conversation

cs.HC · 2026-02-27 · unverdicted · novelty 7.0

Introduces LLM-mediated computing as a paradigm of reflective conversation and co-disclosure where the computer emerges through human-LLM interaction.

Talking Trees: Reasoning-Assisted Induction of Decision Trees for Tabular Data

cs.LG · 2025-09-25 · unverdicted · novelty 7.0

Reasoning LLMs with minimal tools for tree construction and analysis induce decision trees that outperform CART, compete with ensembles on low-resource tabular data, and provide human-readable reasoning traces.

citing papers explorer

Showing 50 of 149 citing papers.

BioTool: A Comprehensive Tool-Calling Dataset for Enhancing Biomedical Capabilities of Large Language Models cs.CL · 2026-05-07 · unverdicted · none · ref 69 · internal anchor
BioTool dataset enables fine-tuning a 4B-parameter LLM to outperform GPT-5.1 in biomedical tool calling while improving downstream answer quality per human experts.
AgentWard: A Lifecycle Security Architecture for Autonomous AI Agents cs.CR · 2026-04-27 · conditional · none · ref 3 · internal anchor
AgentWard organizes stage-specific security controls with cross-layer coordination to intercept threats across the full lifecycle of autonomous AI agents.
From Skill Text to Skill Structure: The Scheduling-Structural-Logical Representation for Agent Skills cs.CL · 2026-04-27 · unverdicted · none · ref 23 · 2 links · internal anchor
SSL representation disentangles skill scheduling, structure, and logic using an LLM normalizer, improving skill discovery MRR@50 from 0.649 to 0.729 and risk assessment macro F1 from 0.409 to 0.509 over text baselines.
Automation-Exploit: A Multi-Agent LLM Framework for Adaptive Offensive Security with Digital Twin-Based Risk-Mitigated Exploitation cs.CR · 2026-04-24 · unverdicted · none · ref 11 · internal anchor
Automation-Exploit is a multi-agent LLM system that uses conditional digital-twin validation to perform risk-mitigated exploitation of logical, web, and memory-corruption vulnerabilities in black-box targets.
An AI Agent Execution Environment to Safeguard User Data cs.CR · 2026-04-21 · unverdicted · none · ref 59 · internal anchor
GAAP guarantees confidentiality of private user data for AI agents by enforcing user-specified permissions deterministically through persistent information flow tracking, without trusting the agent or requiring attack-free models.
QRAFTI: An Agentic Framework for Empirical Research in Quantitative Finance cs.MA · 2026-04-20 · unverdicted · none · ref 23 · internal anchor
QRAFTI is a multi-agent framework using tool-calling and reflection-based planning to emulate quant research tasks like factor replication and signal testing on financial data.
Preregistered Belief Revision Contracts cs.AI · 2026-04-16 · unverdicted · none · ref 36 · internal anchor
PBRC is a contract protocol that enforces evidential belief updates in deliberative multi-agent systems and proves it prevents conformity-driven false cascades under conservative fallbacks.
In-situ process monitoring for defect detection in wire-arc additive manufacturing: an agentic AI approach cs.AI · 2026-04-10 · unverdicted · none · ref 91 · internal anchor
A multi-agent AI framework using processing and acoustic agents achieves 91.6% accuracy and 0.821 F1 score for in-situ porosity defect detection in wire-arc additive manufacturing.
When to Trust Tools? Adaptive Tool Trust Calibration For Tool-Integrated Math Reasoning cs.CL · 2026-04-09 · unverdicted · none · ref 19 · internal anchor
ATTC reduces 'Tool Ignored' errors in tool-integrated reasoning by adaptively trusting tool results according to generated code confidence, yielding 4.1-7.5% gains across models and datasets.
GameWorld: Towards Standardized and Verifiable Evaluation of Multimodal Game Agents cs.CV · 2026-04-08 · unverdicted · none · ref 57 · internal anchor
GameWorld is a new benchmark providing standardized interfaces, 34 games, 170 tasks, and verifiable outcome metrics to evaluate multimodal large language model agents in video game environments.
A Physical Agentic Loop for Language-Guided Grasping with Execution-State Monitoring cs.RO · 2026-04-08 · unverdicted · none · ref 4 · internal anchor
A physical agentic loop with execution-state monitoring improves robustness of language-guided grasping over open-loop execution by converting noisy telemetry into discrete outcome events that trigger retries or user escalation.
Attention Editing: A Versatile Framework for Cross-Architecture Attention Conversion cs.CL · 2026-04-07 · conditional · none · ref 2 · internal anchor
Attention Editing converts pre-trained LLMs to new attention architectures through layer-wise teacher-forced optimization and model-level distillation, preserving performance with efficiency gains.
Uncertainty-Guided Latent Diagnostic Trajectory Learning for Sequential Clinical Diagnosis cs.AI · 2026-04-06 · unverdicted · none · ref 4 · internal anchor
LDTL trains LLM agents to follow uncertainty-guided latent diagnostic trajectories, outperforming baselines on MIMIC-CDM with higher accuracy and fewer tests.
SysTradeBench: An Iterative Build-Test-Patch Benchmark for Strategy-to-Code Trading Systems with Drift-Aware Diagnostics cs.SE · 2026-04-06 · unverdicted · none · ref 6 · internal anchor
SysTradeBench evaluates 17 LLMs on 12 trading strategies, finding over 91.7% code validity but rapid convergence in iterative fixes and a continued need for human oversight on critical strategies.
Towards Predicting Multi-Vulnerability Attack Chains in Software Supply Chains from Software Bill of Materials Graphs cs.SE · 2026-04-04 · unverdicted · none · ref 25 · internal anchor
The paper shows that heterogeneous graph attention networks can classify vulnerable components in real SBOMs at 91% accuracy and that a simple MLP can predict documented multi-vulnerability chains with 0.93 ROC-AUC.
Querying Structured Data Through Natural Language Using Language Models cs.CL · 2026-04-03 · conditional · none · ref 15 · internal anchor
Fine-tuning an 8B LLM with synthetic data enables accurate natural language querying of structured datasets like accessibility services in Spain, generalizing to new locations.
Measuring Representation Robustness in Large Language Models for Geometry cs.CL · 2026-04-03 · unverdicted · none · ref 28 · internal anchor
LLMs display accuracy gaps of up to 14 percentage points on the same geometry problems solely due to representation choice, with vector forms consistently weakest and a convert-then-solve prompt helping only high-capacity models.
Mind Your HEARTBEAT! Claw Background Execution Inherently Enables Silent Memory Pollution cs.CR · 2026-03-24 · unverdicted · none · ref 10 · internal anchor
Claw AI agents' heartbeat background execution shares memory context with user sessions, allowing ordinary social misinformation to silently pollute long-term memory and shape behavior at rates up to 76% across sessions.
Tracking Capabilities for Safer Agents cs.AI · 2026-03-01 · unverdicted · none · ref 71 · internal anchor
AI agents can generate code in a capability-safe Scala dialect that statically prevents information leakage and malicious side effects while preserving task performance.
SoK: Agentic Skills -- Beyond Tool Use in LLM Agents cs.CR · 2026-02-24 · unverdicted · none · ref 22 · internal anchor
The paper systematizes agentic skills beyond tool use, providing design pattern and representation-scope taxonomies plus security analysis of malicious skill infiltration in agent marketplaces.
ToolMATH: A Diagnostic Benchmark for Long-Horizon Tool Use under Systematic Tool-Catalog Constraints cs.CL · 2026-02-24 · unverdicted · none · ref 3 · internal anchor
ToolMATH converts MATH solutions into controlled tool environments with gold tools and graded distractors to diagnose LLM adaptability, robustness, and long-horizon tool connectivity.
MOSAIC: Multi-agent Orchestration for Task-Intelligent Scientific Coding cs.CL · 2025-10-09 · unverdicted · none · ref 44 · internal anchor
MOSAIC is a training-free multi-agent LLM framework with rationale, coding, reflection, and debugging agents plus a consolidated context window that outperforms prior methods on scientific coding benchmarks.
Learning Adapter Rank via Symmetry Breaking cs.LG · 2025-06-28 · unverdicted · none · ref 1 · internal anchor
BayesLoRA applies diagonal rank-wise variational inference to break LoRA gauge symmetry and learn adapter rank with O(r) parameters.
OS-ATLAS: A Foundation Action Model for Generalist GUI Agents cs.CL · 2024-10-30 · unverdicted · none · ref 85 · internal anchor
OS-Atlas, trained on the largest open-source cross-platform GUI grounding corpus of 13 million elements, outperforms prior open-source models on six benchmarks across mobile, desktop, and web platforms.
Learning to Ask: When LLM Agents Meet Unclear Instruction cs.CL · 2024-08-31 · unverdicted · none · ref 22 · internal anchor
Introduces NoisyToolBench benchmark and Ask-when-Needed framework to improve LLM tool-use performance when user instructions are unclear or incomplete.
SGLang: Efficient Execution of Structured Language Model Programs cs.AI · 2023-12-12 · conditional · none · ref 41 · internal anchor
SGLang is a new system that speeds up structured LLM programs by up to 6.4x using RadixAttention for KV cache reuse and compressed finite state machines for output decoding.
Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection cs.CL · 2023-10-17 · unverdicted · none · ref 149 · internal anchor
Self-RAG trains LLMs to adaptively retrieve passages on demand and self-critique using reflection tokens, outperforming ChatGPT and retrieval-augmented Llama2 on QA, reasoning, and fact verification.
MemGPT: Towards LLMs as Operating Systems cs.AI · 2023-10-12 · unverdicted · none · ref 18 · internal anchor
MemGPT uses OS-inspired virtual context management to extend LLM context windows for large document analysis and long-term multi-session chat.
ToRA: A Tool-Integrated Reasoning Agent for Mathematical Problem Solving cs.CL · 2023-09-29 · conditional · none · ref 38 · internal anchor
ToRA trains language models on interactive tool-use trajectories with imitation learning and output shaping to integrate reasoning and external tools, yielding 13-19% gains on math datasets and new highs like 44.6% on MATH for a 7B model.
Cognitive Architectures for Language Agents cs.AI · 2023-09-05 · accept · none · ref 70 · internal anchor
CoALA is a modular cognitive architecture for language agents that organizes memory components, action spaces for internal and external interaction, and a generalized decision-making loop to support more systematic development of capable agents.
Ghost in the Minecraft: Generally Capable Agents for Open-World Environments via Large Language Models with Text-based Knowledge and Memory cs.AI · 2023-05-25 · conditional · none · ref 20 · internal anchor
GITM uses LLMs to generate action plans from text knowledge and memory, enabling agents to complete long-horizon Minecraft tasks at much higher success rates than prior RL methods.
Gorilla: Large Language Model Connected with Massive APIs cs.CL · 2023-05-24 · conditional · none · ref 33 · internal anchor
Gorilla is a fine-tuned LLM that surpasses GPT-4 in accurate API call generation and uses retrieval to handle documentation updates.
Reasoning with Language Model is Planning with World Model cs.CL · 2023-05-24 · unverdicted · none · ref 163 · internal anchor
RAP turns LLMs into dual world-model and planning agents via MCTS to generate better reasoning paths, outperforming CoT baselines and achieving 33% relative gains over GPT-4 CoT using LLaMA-33B on plan generation.
ReWOO: Decoupling Reasoning from Observations for Efficient Augmented Language Models cs.CL · 2023-05-23 · conditional · none · ref 15 · internal anchor
ReWOO decouples reasoning from tool observations in augmented language models, delivering 5x token efficiency and 4% higher accuracy on multi-step reasoning benchmarks like HotpotQA.
ChemCrow: Augmenting large-language models with chemistry tools physics.chem-ph · 2023-04-11 · conditional · none · ref 10 · internal anchor
ChemCrow augments LLMs with 18 expert chemistry tools to autonomously plan and execute syntheses and guide molecular discoveries in organic synthesis, drug discovery, and materials design.
CAMEL: Communicative Agents for "Mind" Exploration of Large Language Model Society cs.AI · 2023-03-31 · conditional · none · ref 100 · internal anchor
CAMEL proposes a role-playing framework with inception prompting that enables autonomous multi-agent cooperation among LLMs and generates conversational data for studying their behaviors.
HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face cs.CL · 2023-03-30 · unverdicted · none · ref 24 · internal anchor
HuggingGPT is an agent system where ChatGPT plans and orchestrates calls to Hugging Face models to solve complex multi-modal AI tasks.
Language Models can Solve Computer Tasks cs.CL · 2023-03-30 · accept · none · ref 59 · internal anchor
Pre-trained LLMs using recursive criticism and improvement prompting achieve state-of-the-art results on the MiniWoB++ computer task benchmark with only a handful of demonstrations and no task-specific reward function.
MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action cs.CV · 2023-03-20 · unverdicted · none · ref 24 · internal anchor
MM-REACT uses textual prompts to let ChatGPT collaborate with external vision experts for zero-shot multimodal reasoning and action on advanced visual tasks.
ART: Automatic multi-step reasoning and tool-use for large language models cs.CL · 2023-03-16 · unverdicted · none · ref 146 · internal anchor
ART automatically generates multi-step reasoning programs with tool integration for LLMs, yielding substantial gains over few-shot and auto-CoT prompting on BigBench and MMLU while matching hand-crafted CoT on most tasks.
Grounded Scaling: Why Agentic AI Needs Deterministic Environments cs.AI · 2026-06-21 · unverdicted · none · ref 9 · internal anchor
Agentic AI scaling requires deterministic environments because per-step success probability below 1 causes exponential degradation in k-step chains, addressed via new metrics SCI and DMM plus formal bounds.
pcbGPT: Automatic PCB Schematic Synthesis from Natural Language Requirements cs.HC · 2026-05-31 · unverdicted · none · ref 27 · internal anchor
pcbGPT generates KiCad schematics from natural language with 0.90 pass@1 on 20 embedded tasks via tool-augmented LLM synthesis and multi-stage validation.
Toward Agentic Governance: What Shapes LLM-Agent Intervention in Public Forums? cs.CY · 2026-05-30 · unverdicted · none · ref 60 · internal anchor
Four deployment choices—model version, open/closed weight status, provider, and system prompt—each alter LLM-agent intervention rates on forum posts, with closed-weight models declining more on visible challenges than open-weight models.
Sophrosyne: Agentic Exploration of Relational Data Systems Needs Moderation cs.DB · 2026-05-29 · unverdicted · none · ref 41 · internal anchor
Sophrosyne augments fine-grained data APIs with directives to curb over-exploration by Text2SQL agents, reducing it 4.6x and improving accuracy up to 4 points.
LLM Based Web Accessibility Repair: An Empirical Study of Detection, Remediation, and Cost cs.SE · 2026-05-26 · unverdicted · none · ref 16 · internal anchor
LLM agents detect accessibility issues with F1 around 0.65 and generate valid fixes that reduce violations in 80% of cases but fully resolve under 26%, indicating need for hybrid approaches.
Sibyl-AutoResearch: Autonomous Research Needs Self-Evolving Trial-and-Error Harnesses, Not Paper Generators cs.MA · 2026-05-21 · unverdicted · none · ref 33 · internal anchor
Sibyl-AutoResearch introduces self-evolving trial-and-error harnesses with auditable conversion units that link trial signals to updated research behaviors and harness repairs in autonomous systems.
Mix-Quant: Quantized Prefilling, Precise Decoding for Agentic LLMs cs.CL · 2026-05-19 · unverdicted · none · ref 33 · internal anchor
Mix-Quant quantizes prefilling to NVFP4 and keeps BF16 for decoding in agentic LLMs, achieving up to 3x prefilling speedup while largely preserving task performance on long-context and agentic benchmarks.
Episodic-Semantic Memory Architecture for Long-Horizon Scientific Agents cs.AI · 2026-05-17 · unverdicted · none · ref 28 · internal anchor
A dual-process memory architecture for scientific AI agents maintains 70-85% accuracy over 15,000 messages by using a constant 10-message episodic window and domain-specific semantic consolidation, consuming 62% fewer tokens than full-context baselines.
Good to Go: The LOOP Skill Engine That Hits 99% Success and Slashes Token Usage by 99% via One-Shot Recording and Deterministic Replay cs.AI · 2026-05-14 · unverdicted · none · ref 2 · internal anchor
The LOOP Skill Engine records one LLM-powered run of a periodic task and converts it into a deterministic replay template that eliminates further LLM usage while maintaining high success rates.
Discovery of Interpretable Surrogates via Agentic AI: Application to Gravitational Waves gr-qc · 2026-05-11 · unverdicted · none · ref 15 · internal anchor
GWAgent agentic workflow produces analytic surrogates for eccentric BBH waveforms with 6.9e-4 median mismatch and 8.4x speedup, outperforming baselines, and infers eccentricity for GW200129.

Toolformer: Language Models Can Teach Themselves to Use Tools

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer