pith. sign in

arxiv: 2305.15334 · v1 · submitted 2023-05-24 · 💻 cs.CL · cs.AI

Gorilla: Large Language Model Connected with Massive APIs

Pith reviewed 2026-05-11 23:17 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords large language modelsAPI callstool usefine-tuningretrieval augmentationmachine learning librarieshallucination
0
0 comments X

The pith

A fine-tuned LLaMA model generates more accurate API calls than GPT-4 for machine learning libraries.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper releases Gorilla, a model created by fine-tuning LLaMA on documentation for Hugging Face, TorchHub, and TensorHub APIs. It shows higher accuracy than GPT-4 at producing correct API calls from natural language instructions. Pairing the model with a retriever lets it pull fresh documentation at inference time, which helps it adjust to API updates and cuts down on invented API usages. The authors introduce APIBench as a test set to measure this performance across the three libraries. The work demonstrates that targeted fine-tuning plus retrieval can make LLMs more dependable when they need to call external tools.

Core claim

Gorilla is a LLaMA-based model fine-tuned on API documentation from Hugging Face, TorchHub, and TensorHub. It surpasses GPT-4 in accuracy when writing API calls on the APIBench dataset. When combined with a document retriever, the model adapts to changes in API documentation at test time and reduces hallucinations of incorrect API usage.

What carries the argument

The fine-tuned Gorilla model that maps instructions to API calls, augmented by a retriever that supplies relevant and current API documentation.

If this is right

  • LLMs become more reliable for calling machine learning tools without manual prompt engineering.
  • Models can remain current with frequently updated libraries through retrieval instead of full retraining.
  • Hallucination of invalid API parameters and signatures is substantially reduced.
  • A single open model can handle multiple large API collections without needing closed-source scale.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach could extend beyond machine learning libraries to web services and software engineering tools.
  • Gorilla-style models might serve as the tool-use layer inside larger agent systems.
  • Retrieval-augmented fine-tuning may lower the cost of keeping tool-using LLMs up to date compared with scaling model size alone.

Load-bearing premise

Fine-tuning on collected API documentation produces a model that generalizes to new or updated APIs when the retriever supplies accurate and relevant information.

What would settle it

Testing Gorilla on API calls for documentation or libraries introduced after its training cutoff and checking whether accuracy drops or hallucinations increase compared with GPT-4.

read the original abstract

Large Language Models (LLMs) have seen an impressive wave of advances recently, with models now excelling in a variety of tasks, such as mathematical reasoning and program synthesis. However, their potential to effectively use tools via API calls remains unfulfilled. This is a challenging task even for today's state-of-the-art LLMs such as GPT-4, largely due to their inability to generate accurate input arguments and their tendency to hallucinate the wrong usage of an API call. We release Gorilla, a finetuned LLaMA-based model that surpasses the performance of GPT-4 on writing API calls. When combined with a document retriever, Gorilla demonstrates a strong capability to adapt to test-time document changes, enabling flexible user updates or version changes. It also substantially mitigates the issue of hallucination, commonly encountered when prompting LLMs directly. To evaluate the model's ability, we introduce APIBench, a comprehensive dataset consisting of HuggingFace, TorchHub, and TensorHub APIs. The successful integration of the retrieval system with Gorilla demonstrates the potential for LLMs to use tools more accurately, keep up with frequently updated documentation, and consequently increase the reliability and applicability of their outputs. Gorilla's code, model, data, and demo are available at https://gorilla.cs.berkeley.edu

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Gorilla, a LLaMA-based model fine-tuned on API documentation from HuggingFace, TorchHub, and TensorHub. It claims to surpass GPT-4 on API call generation accuracy using the new APIBench benchmark, and when paired with a document retriever, to adapt to test-time documentation changes while substantially reducing hallucinations relative to direct LLM prompting.

Significance. If the headline performance claims hold under controlled conditions, the work demonstrates practical value in domain-specific fine-tuning and retrieval-augmented generation for reliable tool use. The public release of the model, training data, code, and demo is a clear strength that enables direct reproducibility and follow-on research.

major comments (2)
  1. [Abstract and §4 (Evaluation)] Abstract and §4 (Evaluation): The central claim that Gorilla 'surpasses the performance of GPT-4 on writing API calls' is load-bearing, yet the GPT-4 baseline prompting strategy is not described in sufficient detail. It is unclear whether GPT-4 received the same retrieved API documentation provided to the Gorilla+retriever variant or was evaluated zero-shot/few-shot without equivalent context; this distinction is required to rule out information-access artifacts rather than fine-tuning gains.
  2. [§4 (Results) and APIBench description] §4 (Results) and APIBench description: The reported gains on hallucination mitigation and adaptation to document updates lack explicit quantitative controls. For instance, the fraction of invalid API calls, incorrect arguments, or retrieval-induced errors should be broken down by model variant with the same retrieval corpus and statistical tests; without these, the 'substantially mitigates' claim cannot be fully assessed.
minor comments (2)
  1. [Figure 2 and §3] Figure 2 and §3: The retriever architecture diagram and accompanying text could more clearly label the exact retrieval top-k and embedding model used at inference time.
  2. [§5 (Limitations)] §5 (Limitations): The discussion of generalization to entirely unseen APIs is brief; adding a held-out API category in the experimental tables would strengthen the claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below. Where clarifications or additional analyses are needed, we have revised the manuscript accordingly to strengthen the presentation of our results.

read point-by-point responses
  1. Referee: [Abstract and §4 (Evaluation)] Abstract and §4 (Evaluation): The central claim that Gorilla 'surpasses the performance of GPT-4 on writing API calls' is load-bearing, yet the GPT-4 baseline prompting strategy is not described in sufficient detail. It is unclear whether GPT-4 received the same retrieved API documentation provided to the Gorilla+retriever variant or was evaluated zero-shot/few-shot without equivalent context; this distinction is required to rule out information-access artifacts rather than fine-tuning gains.

    Authors: We thank the referee for highlighting this important point. In the original submission, GPT-4 baselines were evaluated using standard zero-shot and few-shot prompting without access to the document retriever, while the Gorilla+retriever variant uses retrieved documentation at inference time. The headline claim that Gorilla surpasses GPT-4 refers to the fine-tuned model versus GPT-4 under comparable (non-retrieval) prompting conditions. To remove any ambiguity, we have expanded Section 4 to explicitly describe the prompting setup for every baseline, including the exact context (or lack thereof) provided to GPT-4. We have also added a controlled comparison in which GPT-4 is given the same retrieved documents as Gorilla+retriever, allowing readers to isolate the contribution of fine-tuning from information access. revision: yes

  2. Referee: [§4 (Results) and APIBench description] §4 (Results) and APIBench description: The reported gains on hallucination mitigation and adaptation to document updates lack explicit quantitative controls. For instance, the fraction of invalid API calls, incorrect arguments, or retrieval-induced errors should be broken down by model variant with the same retrieval corpus and statistical tests; without these, the 'substantially mitigates' claim cannot be fully assessed.

    Authors: We agree that a finer-grained error analysis and statistical support would make the claims more robust. In the revised manuscript we have added a new table in Section 4 that breaks down error types (invalid API names, incorrect arguments, retrieval-induced errors, and other hallucinations) for every model variant, always using the identical retrieval corpus for fair comparison. We also report bootstrap confidence intervals and paired significance tests (McNemar’s test) on the key metrics of hallucination rate and adaptation accuracy. These additions directly address the request for quantitative controls and are now included in both the main results and the APIBench description. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical fine-tuning and evaluation with public releases

full rationale

The paper is an empirical contribution describing fine-tuning of LLaMA on API documentation from HuggingFace, TorchHub, and TensorHub to create Gorilla, followed by evaluation on the introduced APIBench dataset and integration with a document retriever. No mathematical derivations, equations, or first-principles predictions exist that could reduce to inputs by construction. Claims rest on experimental performance metrics rather than any self-referential definitions or fitted parameters renamed as predictions. Public release of model, data, and code allows independent verification, rendering the work self-contained without load-bearing self-citations or ansatzes that define the result tautologically.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are introduced beyond standard practices in LLM fine-tuning and evaluation.

pith-pipeline@v0.9.0 · 5532 in / 1085 out tokens · 91878 ms · 2026-05-11T23:17:08.808487+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Revisable by Design: A Theory of Streaming LLM Agent Execution

    cs.LG 2026-04 unverdicted novelty 8.0

    LLM agents achieve greater flexibility during execution by classifying actions via a reversibility taxonomy and using an Earliest-Conflict Rollback algorithm that matches full-restart quality while wasting far less co...

  2. Your Agent Is Mine: Measuring Malicious Intermediary Attacks on the LLM Supply Chain

    cs.CR 2026-04 unverdicted novelty 8.0

    Malicious LLM API routers actively perform payload injection and secret exfiltration, with 9 of 428 tested routers showing malicious behavior and further poisoning risks from leaked credentials.

  3. Why Do Multi-Agent LLM Systems Fail?

    cs.AI 2025-03 unverdicted novelty 8.0

    The authors create the first large-scale dataset and taxonomy of failure modes in multi-agent LLM systems to explain their limited performance gains.

  4. AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents

    cs.CR 2024-06 unverdicted novelty 8.0

    AgentDojo introduces an extensible evaluation framework populated with realistic agent tasks and security test cases to measure prompt injection robustness in tool-using LLM agents.

  5. RepairAgent: An Autonomous, LLM-Based Agent for Program Repair

    cs.SE 2024-03 conditional novelty 8.0

    RepairAgent autonomously repairs 164 bugs on Defects4J including 39 not fixed by prior techniques by treating an LLM as an agent that invokes tools via a finite state machine and dynamic prompts.

  6. Mind2Web: Towards a Generalist Agent for the Web

    cs.CL 2023-06 accept novelty 8.0

    Mind2Web is the first large-scale dataset of real-world web tasks for developing generalist language-guided agents that complete complex actions on diverse websites.

  7. API-Bank: A Comprehensive Benchmark for Tool-Augmented LLMs

    cs.CL 2023-04 conditional novelty 8.0

    API-Bank is a new benchmark and training dataset for tool-augmented LLMs that shows fine-tuned models can approach GPT-3.5 tool-use effectiveness.

  8. Formal Skill: Programmable Runtime Skills for Efficient and Accurate LLM Agents

    cs.AI 2026-05 unverdicted novelty 7.0

    Proposes Formal Skill as a programmable runtime abstraction for LLM agents, implemented in open-source FairyClaw, achieving competitive Harness-Bench scores with substantially fewer tokens.

  9. To Call or Not to Call: Diagnosing Intrinsic Over-Calling Bias in LLM Agents

    cs.LG 2026-05 conditional novelty 7.0

    LLM agents have an intrinsic over-calling bias diagnosed via SAE activation margins and corrected by adaptive margin-calibrated steering, improving overall decision accuracy.

  10. RecoAtlas: From Semantic Plausibility to Set-Level Utility in LLM Recommendation Agents

    cs.IR 2026-05 unverdicted novelty 7.0

    RecoAtlas is a benchmark that evaluates LLM recommendation agents on behavior-grounded metrics for relevance, complementarity, and diversity in addition to semantic coherence.

  11. Agent-First Tool API: A Semantic Interface Paradigm for Enterprise AI Agent Systems

    cs.AI 2026-05 unverdicted novelty 7.0

    The Agent-First Tool API paradigm raises AI agent task success from 64% to 88% and cuts human interventions by 72.7% through semantic phases, structured contracts, and risk governance in a production enterprise system.

  12. RewardHarness: Self-Evolving Agentic Post-Training

    cs.AI 2026-05 unverdicted novelty 7.0

    RewardHarness self-evolves a tool-and-skill library from 100 preference examples to reach 47.4% accuracy on image-edit evaluation, beating GPT-5, and yields stronger RL-tuned models.

  13. MemFlow: Intent-Driven Memory Orchestration for Small Language Model Agents

    cs.MA 2026-05 unverdicted novelty 7.0

    MemFlow routes queries by intent to tiered memory operations, nearly doubling accuracy of a 1.7B SLM on long-horizon benchmarks compared to full-context baselines.

  14. TSCG: Deterministic Tool-Schema Compilation for Agentic LLM Deployments

    cs.SE 2026-05 unverdicted novelty 7.0

    TSCG compiles JSON tool schemas into token-efficient structured text, raising tool-use accuracy for small LLMs from 0% to 84.4% on benchmarks while cutting tokens by 52-57%.

  15. The Structured Output Benchmark: A Multi-Source Benchmark for Evaluating Structured Output Quality in Large Language Models

    cs.CL 2026-04 accept novelty 7.0

    SOB benchmark shows LLMs achieve near-perfect schema compliance but value accuracy of only 83% on text, 67% on images, and 24% on audio.

  16. Cutscene Agent: An LLM Agent Framework for Automated 3D Cutscene Generation

    cs.GR 2026-04 unverdicted novelty 7.0

    Cutscene Agent uses a multi-agent LLM system and a new toolkit for game engine control to automate end-to-end 3D cutscene generation, evaluated on the introduced CutsceneBench.

  17. A Systematic Survey of Security Threats and Defenses in LLM-Based AI Agents: A Layered Attack Surface Framework

    cs.CR 2026-04 unverdicted novelty 7.0

    A new 7x4 taxonomy organizes agentic AI security threats by architectural layer and persistence timescale, revealing under-explored upper layers and missing defenses after surveying 116 papers.

  18. Breaking MCP with Function Hijacking Attacks: Novel Threats for Function Calling and Agentic Models

    cs.CR 2026-04 unverdicted novelty 7.0

    A novel function hijacking attack achieves 70-100% success rates in forcing specific function calls across five LLMs on the BFCL benchmark and is robust to context semantics.

  19. SceneOrchestra: Efficient Agentic 3D Scene Synthesis via Full Tool-Call Trajectory Generation

    cs.CV 2026-04 unverdicted novelty 7.0

    SceneOrchestra trains an orchestrator to generate full tool-call trajectories for 3D scene synthesis and uses a discriminator during training to select high-quality plans, yielding state-of-the-art results with lower runtime.

  20. GraSP: Graph-Structured Skill Compositions for LLM Agents

    cs.CL 2026-04 unverdicted novelty 7.0

    GraSP introduces executable skill graphs that improve LLM agent rewards by up to 19 points and reduce steps by up to 41% over ReAct, Reflexion, ExpeL, and flat-skill baselines across ALFWorld, ScienceWorld, WebShop, a...

  21. Does RL Expand the Capability Boundary of LLM Agents? A PASS@(k,T) Analysis

    cs.LG 2026-04 unverdicted novelty 7.0

    RL expands the capability boundary of LLM agents on compositional tool-use tasks, shown by non-converging pass curves at large k with increasing T, while SFT regresses it and the effect is absent on simpler tasks.

  22. Causality Laundering: Denial-Feedback Leakage in Tool-Calling LLM Agents

    cs.CR 2026-04 unverdicted novelty 7.0

    The paper defines causality laundering as an attack leaking information from denial outcomes in LLM tool calls and proposes the Agentic Reference Monitor to block it using denial-aware provenance graphs.

  23. Evo-Memory: Benchmarking LLM Agent Test-time Learning with Self-Evolving Memory

    cs.CL 2025-11 unverdicted novelty 7.0

    Evo-Memory is a new benchmark for self-evolving memory in LLM agents across task streams, with baseline ExpRAG and proposed ReMem method that integrates reasoning, actions, and memory updates for continual improvement.

  24. Library Hallucinations in LLM-Generated Code: A Risk Analysis Grounded in Developer Queries

    cs.SE 2025-09 unverdicted novelty 7.0

    A study of seven LLMs finds that realistic prompt variations such as one-character misspellings trigger library hallucinations in up to 26% of cases, fabricated names in up to 99%, and time-based prompts in up to 85%,...

  25. From REST to MCP: An Empirical Study of API Wrapping and Automated Server Generation for LLM Agents

    cs.SE 2025-07 unverdicted novelty 7.0

    First large-scale empirical analysis of MCP server construction shows predominant REST wrapping with low operation exposure, plus an AutoMCP pipeline that improves automated generation success and reduces tool complexity.

  26. Prompt Injection Attack to Tool Selection in LLM Agents

    cs.CR 2025-04 conditional novelty 7.0

    ToolHijacker optimizes malicious tool documents via a two-phase strategy to hijack LLM agents' tool selection in no-box settings.

  27. Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation

    cs.AI 2025-03 conditional novelty 7.0

    Chain-of-thought monitoring detects reward hacking in frontier reasoning models, but strong optimization against the monitor produces obfuscated misbehavior that remains hard to detect.

  28. Agent Security Bench (ASB): Formalizing and Benchmarking Attacks and Defenses in LLM-based Agents

    cs.CR 2024-10 unverdicted novelty 7.0

    ASB is a new benchmark that tests 10 prompt injection attacks, memory poisoning, a novel Plan-of-Thought backdoor attack, and 11 defenses on LLM agents across 13 models, finding attack success rates up to 84.3% and li...

  29. GAIA: a benchmark for General AI Assistants

    cs.CL 2023-11 unverdicted novelty 7.0

    GAIA benchmark shows humans at 92% accuracy on simple real-world questions far outperform current AI systems at 15%, proposing this gap as a key milestone for general AI.

  30. An Empirical Study of Privacy Leakage Chains via Prompt Injection in Black-Box Chatbot Environments

    cs.CR 2026-05 unverdicted novelty 6.0

    Empirical demonstration that prompt injection combined with web-tool use creates a feasible privacy-leakage chain in deployed black-box chatbot agents.

  31. PULSE: Agentic Investigation with Passive Sensing for Proactive Intervention in Cancer Survivorship

    cs.HC 2026-05 unverdicted novelty 6.0

    PULSE demonstrates that agentic LLM-based investigation of passive smartphone sensing data achieves balanced accuracies of 0.743 (with diary) and 0.713 (sensing-only) for predicting emotion regulation desire and inter...

  32. The Scaling Laws of Skills in LLM Agent Systems

    cs.CL 2026-05 unverdicted novelty 6.0

    Empirical analysis across 15 LLMs and 1,141 skills identifies a logarithmic routing decay law and a multiplicative execution law coupled by a single fitted slope parameter b that enables targeted library optimizations...

  33. Latency-Quality Routing for Functionally Equivalent Tools in LLM Agents

    cs.LG 2026-05 unverdicted novelty 6.0

    LQM-ContextRoute routes tool calls by expected quality per service cycle using contextual bandits and LLM-as-judge feedback, yielding +2.18 pp F1, up to +18 pp accuracy, and +2.91-3.22 pp NDCG gains over SW-UCB on web...

  34. ToolCUA: Towards Optimal GUI-Tool Path Orchestration for Computer Use Agents

    cs.AI 2026-05 unverdicted novelty 6.0

    ToolCUA introduces a trajectory scaling pipeline and staged RL to optimize GUI-tool switching, reaching 46.85% accuracy on OSWorld-MCP for a 66% relative gain over baseline.

  35. Template-as-Ontology: Configurable Synthetic Data Infrastructure for Cross-Domain Manufacturing AI Validation

    cs.AI 2026-05 unverdicted novelty 6.0

    A single configuration file generates causally coherent synthetic MES data across domains and guarantees zero tool-parameter hallucination when AI tools are ontology-constrained.

  36. EvoMAS: Learning Execution-Time Workflows for Multi-Agent Systems

    cs.AI 2026-05 unverdicted novelty 6.0

    EvoMAS trains a workflow adapter with policy gradients to dynamically instantiate stage-specific multi-agent workflows from a fixed agent pool, using explicit task-state construction and terminal success signals, and ...

  37. Tool Calling is Linearly Readable and Steerable in Language Models

    cs.CL 2026-05 unverdicted novelty 6.0

    Tool identity is linearly readable and steerable in LLMs via mean activation differences, with 77-100% switch accuracy and error prediction from activation gaps.

  38. EnvSimBench: A Benchmark for Evaluating and Improving LLM-Based Environment Simulation

    cs.AI 2026-05 unverdicted novelty 6.0

    EnvSimBench reveals that state-of-the-art LLMs exhibit a universal state change cliff in environment simulation, with a new constraint-driven pipeline raising synthesis yield by 6.8% and cutting costs over 90%.

  39. An Agent-Oriented Pluggable Experience-RAG Skill for Experience-Driven Retrieval Strategy Orchestration

    cs.AI 2026-05 unverdicted novelty 6.0

    Experience-RAG Skill uses experience memory to dynamically select retrieval strategies for agents, achieving 0.8924 nDCG@10 on BeIR/nq, hotpotqa, and scifact while outperforming fixed single-retriever baselines.

  40. Semia: Auditing Agent Skills via Constraint-Guided Representation Synthesis

    cs.CR 2026-05 unverdicted novelty 6.0

    Semia synthesizes Datalog representations of agent skills via constraint-guided loops to enable reachability queries for semantic risks, finding critical issues in over half of 13,728 real skills with 97.7% recall on ...

  41. Claw-Eval-Live: A Live Agent Benchmark for Evolving Real-World Workflows

    cs.SE 2026-04 unverdicted novelty 6.0

    Claw-Eval-Live benchmark with 105 tasks shows no frontier LLM agent exceeds 66.7% success rate on evolving real-world workflows, with HR and multi-system tasks as persistent bottlenecks.

  42. From Skill Text to Skill Structure: The Scheduling-Structural-Logical Representation for Agent Skills

    cs.CL 2026-04 unverdicted novelty 6.0

    SSL representation disentangles skill scheduling, structure, and logic using an LLM normalizer, improving skill discovery MRR@50 from 0.649 to 0.729 and risk assessment macro F1 from 0.409 to 0.509 over text baselines.

  43. Time Series Augmented Generation for Financial Applications

    cs.AI 2026-04 unverdicted novelty 6.0

    TSAG lets LLMs use external tools for financial time series analysis, with a new benchmark showing capable agents achieve near-perfect tool accuracy and minimal hallucination.

  44. When Agents Go Quiet: Output Generation Capacity and Format-Cost Separation for LLM Document Synthesis

    cs.AI 2026-04 unverdicted novelty 6.0

    LLM agents avoid output stalling and reduce generation tokens by 48-72% via deferred template rendering guided by Output Generation Capacity and a Format-Cost Separation Theorem.

  45. Auditable Agents

    cs.AI 2026-04 unverdicted novelty 6.0

    No agent system can be accountable without auditability, which requires five dimensions (action recoverability, lifecycle coverage, policy checkability, responsibility attribution, evidence integrity) and mechanisms f...

  46. From Data to Theory: Autonomous Large Language Model Agents for Materials Science

    cs.AI 2026-04 unverdicted novelty 6.0

    An LLM agent autonomously selects, codes, and validates materials equations from data, recovering known laws reliably but requiring checks for new or specialized cases.

  47. Sutradhara: An Intelligent Orchestrator-Engine Co-design for Tool-based Agentic Inference

    cs.DC 2026-01 unverdicted novelty 6.0

    Sutradhara co-designs orchestrator and LLM serving to overlap tool execution with prefill, stream tool dispatch during decode, and use semantic hints for cache management, yielding up to 77% higher load at fixed media...

  48. Evo-Memory: Benchmarking LLM Agent Test-time Learning with Self-Evolving Memory

    cs.CL 2025-11 unverdicted novelty 6.0

    Evo-Memory is a new streaming benchmark and evaluation framework for self-evolving memory in LLM agents, unifying over ten memory modules and introducing the ReMem pipeline for continual improvement on multi-turn and ...

  49. ToolRL: Reward is All Tool Learning Needs

    cs.LG 2025-04 conditional novelty 6.0

    A principled reward design for tool selection and application in RL-trained LLMs delivers 17% gains over base models and 15% over SFT across benchmarks.

  50. AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents

    cs.LG 2024-10 accept novelty 6.0

    AgentHarm benchmark shows leading LLMs comply with malicious agent requests and simple jailbreaks enable coherent harmful multi-step execution while retaining capabilities.

  51. Learning to Ask: When LLM Agents Meet Unclear Instruction

    cs.CL 2024-08 unverdicted novelty 6.0

    Introduces NoisyToolBench benchmark and Ask-when-Needed framework to improve LLM tool-use performance when user instructions are unclear or incomplete.

  52. LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

    cs.SE 2024-03 unverdicted novelty 6.0

    LiveCodeBench collects 400 recent contest problems to create a contamination-free benchmark evaluating LLMs on code generation and related capabilities like self-repair and execution.

  53. SGLang: Efficient Execution of Structured Language Model Programs

    cs.AI 2023-12 conditional novelty 6.0

    SGLang is a new system that speeds up structured LLM programs by up to 6.4x using RadixAttention for KV cache reuse and compressed finite state machines for output decoding.

  54. A Survey on Large Language Model based Autonomous Agents

    cs.AI 2023-08 accept novelty 6.0

    A survey of LLM-based autonomous agents that proposes a unified framework for their construction and reviews applications in social science, natural science, and engineering along with evaluation methods and future di...

  55. HarnessAPI: A Skill-First Framework for Unified Streaming APIs and MCP Tools

    cs.AI 2026-05 unverdicted novelty 5.0

    HarnessAPI derives streaming HTTP endpoints, OpenAPI UI, and MCP tools from a single handler.py plus Pydantic schemas, cutting framework boilerplate by 74%.

  56. Internalizing Tool Knowledge in Small Language Models via QLoRA Fine-Tuning

    cs.CL 2026-05 unverdicted novelty 5.0

    QLoRA fine-tuning on ~1700 examples internalizes tool knowledge in Gemma-4B and Qwen3-4B, enabling description-free inference that cuts input length by 82.6% and raises planning scores above an informed baseline.

  57. Is Grep All You Need? How Agent Harnesses Reshape Agentic Search

    cs.CL 2026-05 unverdicted novelty 5.0

    Grep retrieval generally outperforms vector retrieval in agentic search tasks, with performance varying strongly by agent harness and tool-calling style.

  58. Good to Go: The LOOP Skill Engine That Hits 99% Success and Slashes Token Usage by 99% via One-Shot Recording and Deterministic Replay

    cs.AI 2026-05 unverdicted novelty 5.0

    The LOOP Skill Engine records one LLM-powered run of a periodic task and converts it into a deterministic replay template that eliminates further LLM usage while maintaining high success rates.

  59. The Semantic Training Gap: Ontology-Grounded Tool Architectures for Industrial AI Agent Systems

    cs.AI 2026-05 unverdicted novelty 5.0

    Ontology-grounded tool architectures eliminate hallucination of domain identifiers in industrial AI agents by enforcing semantic constraints through a typed relational configuration and three-operation interface.

  60. Consistency as a Testable Property: Statistical Methods to Evaluate AI Agent Reliability

    cs.AI 2026-05 unverdicted novelty 5.0

    A framework with U-statistics and kernel-based metrics quantifies AI agent consistency and robustness, showing trajectory metrics outperform pass@1 rates in diagnosing failures.

Reference graph

Works this paper leans on

48 extracted references · 48 canonical work pages · cited by 75 Pith papers · 24 internal anchors

  1. [1]

    Ahn, M., Brohan, A., Brown, N., Chebotar, Y ., Cortes, O., David, B., Finn, C., Gopalakrishnan, K., Hausman, K., Herzog, A., et al. (2022). Do as i can, not as i say: Grounding language in robotic affordances. arXiv preprint arXiv:2204.01691

  2. [2]

    Andor, D., He, L., Lee, K., and Pitler, E. (2019). Giving bert a calculator: Finding operations and arguments with reading comprehension. arXiv preprint arXiv:1909.00109

  3. [3]

    Anthropic, h.-c. (2022). Claude

  4. [4]

    Bavishi, R., Lemieux, C., Fox, R., Sen, K., and Stoica, I. (2019). Autopandas: neural- backed generators for program synthesis. Proceedings of the ACM on Programming Languages , 3(OOPSLA):1–27

  5. [5]

    D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al

    Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. (2020). Language models are few-shot learners. Advances in neural information processing systems , 33:1877–1901

  6. [6]

    Sparks of Artificial General Intelligence: Early experiments with GPT-4

    Bubeck, S., Chandrasekaran, V ., Eldan, R., Gehrke, J., Horvitz, E., Kamar, E., Lee, P., Lee, Y . T., Li, Y ., Lundberg, S., et al. (2023). Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712

  7. [7]

    Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H. P. d. O., Kaplan, J., Edwards, H., Burda, Y ., Joseph, N., Brockman, G., et al. (2021). Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374

  8. [8]

    Chen, X., Lin, M., Schärli, N., and Zhou, D. (2023). Teaching large language models to self-debug. arXiv preprint arXiv:2304.05128

  9. [9]

    E., Stoica, I., and Xing, E

    Chiang, W.-L., Li, Z., Lin, Z., Sheng, Y ., Wu, Z., Zhang, H., Zheng, L., Zhuang, S., Zhuang, Y ., Gonzalez, J. E., Stoica, I., and Xing, E. P. (2023). Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality

  10. [10]

    PaLM: Scaling Language Modeling with Pathways

    Chowdhery, A., Narang, S., Devlin, J., Bosma, M., Mishra, G., Roberts, A., Barham, P., Chung, H. W., Sutton, C., Gehrmann, S., et al. (2022). Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311

  11. [11]

    Scaling Instruction-Finetuned Language Models

    Chung, H. W., Hou, L., Longpre, S., Zoph, B., Tay, Y ., Fedus, W., Li, E., Wang, X., Dehghani, M., Brahma, S., et al. (2022). Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416

  12. [12]

    Cobbe, K., Kosaraju, V ., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., et al. (2021). Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168

  13. [13]

    Devlin, J., Uesato, J., Bhupatiraju, S., Singh, R., Mohamed, A.-r., and Kohli, P. (2017). Robust- fill: Neural program learning under noisy i/o. In International conference on machine learning , pages 990–998. PMLR

  14. [14]

    Gao, L., Madaan, A., Zhou, S., Alon, U., Liu, P., Yang, Y ., Callan, J., and Neubig, G. (2022). Pal: Program-aided language models. arXiv preprint arXiv:2211.10435

  15. [15]

    OPT-IML: Scaling Language Model Instruction Meta Learning through the Lens of Generalization

    Iyer, S., Lin, X. V ., Pasunuru, R., Mihaylov, T., Simig, D., Yu, P., Shuster, K., Wang, T., Liu, Q., Koura, P. S., et al. (2022). Opt-iml: Scaling language model instruction meta learning through the lens of generalization. arXiv preprint arXiv:2212.12017. 10

  16. [16]

    Jain, N., Vaidyanath, S., Iyer, A., Natarajan, N., Parthasarathy, S., Rajamani, S., and Sharma, R. (2022). Jigsaw: Large language models meet program synthesis. In Proceedings of the 44th International Conference on Software Engineering , pages 1219–1231

  17. [17]

    Kim, G., Baldi, P., and McAleer, S. (2023). Language models can solve computer tasks. arXiv preprint arXiv:2303.17491

  18. [18]

    Large Language Models are Zero-Shot Reasoners

    Kojima, T., Gu, S. S., Reid, M., Matsuo, Y ., and Iwasawa, Y . (2022). Large language models are zero-shot reasoners. arXiv preprint arXiv:2205.11916

  19. [19]

    Komeili, M., Shuster, K., and Weston, J. (2021). Internet-augmented dialogue generation. arXiv preprint arXiv:2107.07566

  20. [20]

    Lachaux, M.-A., Roziere, B., Chanussot, L., and Lample, G. (2020). Unsupervised translation of programming languages. arXiv preprint arXiv:2006.03511

  21. [21]

    Lazaridou, A., Gribovskaya, E., Stokowiec, W., and Grigorev, N. (2022). Internet-augmented language models through few-shot prompting for open-domain question answering. arXiv preprint arXiv:2203.05115

  22. [22]

    StarCoder: may the source be with you!

    Li, R., Allal, L. B., Zi, Y ., Muennighoff, N., Kocetkov, D., Mou, C., Marone, M., Akiki, C., Li, J., Chim, J., et al. (2023). Starcoder: may the source be with you! arXiv preprint arXiv:2305.06161

  23. [23]

    Li, Y ., Choi, D., Chung, J., Kushman, N., Schrittwieser, J., Leblond, R., Eccles, T., Keeling, J., Gimeno, F., Dal Lago, A., et al. (2022). Competition-level code generation with alphacode. Science, 378(6624):1092–1097

  24. [24]

    Liang, Y ., Wu, C., Song, T., Wu, W., Xia, Y ., Liu, Y ., Ou, Y ., Lu, S., Ji, L., Mao, S., et al. (2023). Taskmatrix. ai: Completing tasks by connecting foundation models with millions of apis. arXiv preprint arXiv:2303.16434

  25. [25]

    Menon, A., Tamuz, O., Gulwani, S., Lampson, B., and Kalai, A. (2013). A machine learning framework for programming by example. In International Conference on Machine Learning , pages 187–195. PMLR

  26. [26]

    Nakano, R., Hilton, J., Balaji, S., Wu, J., Ouyang, L., Kim, C., Hesse, C., Jain, S., Kosaraju, V ., Saunders, W., et al. (2021). Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332

  27. [27]

    Nijkamp, E., Hayashi, H., Xiong, C., Savarese, S., and Zhou, Y . (2023). Codegen2: Lessons for training llms on programming and natural languages. arXiv preprint arXiv:2305.02309

  28. [28]

    Nijkamp, E., Pang, B., Hayashi, H., Tu, L., Wang, H., Zhou, Y ., Savarese, S., and Xiong, C. (2022). Codegen: An open large language model for code with multi-turn program synthesis. arXiv preprint arXiv:2203.13474

  29. [29]

    Gpt-4 technical report

    OpenAI (2023). Gpt-4 technical report

  30. [30]

    OpenAI and https://openai.com/blog/chatgpt (2022). Chatgpt

  31. [31]

    Multitask Prompted Training Enables Zero-Shot Task Generalization

    Sanh, V ., Webson, A., Raffel, C., Bach, S. H., Sutawika, L., Alyafeai, Z., Chaffin, A., Stiegler, A., Scao, T. L., Raja, A., et al. (2021). Multitask prompted training enables zero-shot task generalization. arXiv preprint arXiv:2110.08207

  32. [32]

    BLOOM: A 176B-Parameter Open-Access Multilingual Language Model

    Scao, T. L., Fan, A., Akiki, C., Pavlick, E., Ili ´c, S., Hesslow, D., Castagné, R., Luccioni, A. S., Yvon, F., Gallé, M., et al. (2022). Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100

  33. [33]

    Schick, T., Dwivedi-Yu, J., Dessì, R., Raileanu, R., Lomeli, M., Zettlemoyer, L., Cancedda, N., and Scialom, T. (2023). Toolformer: Language models can teach themselves to use tools. arXiv preprint arXiv:2302.04761

  34. [34]

    arXiv preprint arXiv:2001.07676 , year=

    Schick, T. and Schütze, H. (2020). Exploiting cloze questions for few shot text classification and natural language inference. arXiv preprint arXiv:2001.07676. 11

  35. [35]

    Shen, Y ., Song, K., Tan, X., Li, D., Lu, W., and Zhuang, Y . (2023). Hugginggpt: Solving ai tasks with chatgpt and its friends in huggingface. arXiv preprint arXiv:2303.17580

  36. [36]

    Shinn, N., Labash, B., and Gopinath, A. (2023). Reflexion: an autonomous agent with dynamic memory and self-reflection. arXiv preprint arXiv:2303.11366

  37. [37]

    BlenderBot 3: a deployed conversational agent that continually learns to responsibly engage

    Shuster, K., Xu, J., Komeili, M., Ju, D., Smith, E. M., Roller, S., Ung, M., Chen, M., Arora, K., Lane, J., et al. (2022). Blenderbot 3: a deployed conversational agent that continually learns to responsibly engage. arXiv preprint arXiv:2208.03188

  38. [38]

    Taori, R., Gulrajani, I., Zhang, T., Dubois, Y ., Li, X., Guestrin, C., Liang, P., and Hashimoto, T. B. (2023). Stanford alpaca: An instruction-following llama model. https://github.com/ tatsu-lab/stanford_alpaca

  39. [39]

    Thoppilan, R., De Freitas, D., Hall, J., Shazeer, N., Kulshreshtha, A., Cheng, H.-T., Jin, A., Bos, T., Baker, L., Du, Y ., et al. (2022). Lamda: Language models for dialog applications.arXiv preprint arXiv:2201.08239

  40. [40]

    Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al. (2023). Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971

  41. [41]

    Vemprala, S., Bonatti, R., Bucker, A., and Kapoor, A. (2023). Chatgpt for robotics: Design principles and model abilities. 2023

  42. [42]

    Self-Instruct: Aligning Language Models with Self-Generated Instructions

    Wang, Y ., Kordi, Y ., Mishra, S., Liu, A., Smith, N. A., Khashabi, D., and Hajishirzi, H. (2022a). Self-instruct: Aligning language model with self generated instructions. arXiv preprint arXiv:2212.10560

  43. [43]

    S., Arunkumar, A., Stap, D., et al

    Wang, Y ., Mishra, S., Alipoormolabashi, P., Kordi, Y ., Mirzaei, A., Naik, A., Ashok, A., Dhanasekaran, A. S., Arunkumar, A., Stap, D., et al. (2022b). Super-naturalinstructions: General- ization via declarative instructions on 1600+ nlp tasks. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing , pages 5085–5109

  44. [44]

    Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., and Zhou, D. (2022). Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903

  45. [45]

    F., Alon, U., Neubig, G., and Hellendoorn, V

    Xu, F. F., Alon, U., Neubig, G., and Hellendoorn, V . J. (2022). A systematic evaluation of large language models of code. In Proceedings of the 6th ACM SIGPLAN International Symposium on Machine Programming, pages 1–10

  46. [46]

    Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., and Cao, Y . (2022). React: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629

  47. [47]

    Zeng, A., Liu, X., Du, Z., Wang, Z., Lai, H., Ding, M., Yang, Z., Xu, Y ., Zheng, W., Xia, X., et al. (2022). Glm-130b: An open bilingual pre-trained model. arXiv preprint arXiv:2210.02414

  48. [48]

    OPT: Open Pre-trained Transformer Language Models

    Zhang, S., Roller, S., Goyal, N., Artetxe, M., Chen, M., Chen, S., Dewan, C., Diab, M., Li, X., Lin, X. V ., et al. (2022). Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068. 12 8 Appendix 8.1 Dataset Details Our dataset is multi-faceted, comprising three distinct domains: Torch Hub, Tensor Hub, and HuggingFace. Each entry...