arxiv: 2305.15334 · v1 · submitted 2023-05-24 · 💻 cs.CL · cs.AI

Recognition: 3 theorem links

· Lean Theorem

Gorilla: Large Language Model Connected with Massive APIs

Shishir G. Patil , Tianjun Zhang , Xin Wang , Joseph E. Gonzalez

Authors on Pith no claims yet

Pith reviewed 2026-05-11 23:17 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords large language modelsAPI callstool usefine-tuningretrieval augmentationmachine learning librarieshallucination

0 comments

The pith

A fine-tuned LLaMA model generates more accurate API calls than GPT-4 for machine learning libraries.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper releases Gorilla, a model created by fine-tuning LLaMA on documentation for Hugging Face, TorchHub, and TensorHub APIs. It shows higher accuracy than GPT-4 at producing correct API calls from natural language instructions. Pairing the model with a retriever lets it pull fresh documentation at inference time, which helps it adjust to API updates and cuts down on invented API usages. The authors introduce APIBench as a test set to measure this performance across the three libraries. The work demonstrates that targeted fine-tuning plus retrieval can make LLMs more dependable when they need to call external tools.

Core claim

Gorilla is a LLaMA-based model fine-tuned on API documentation from Hugging Face, TorchHub, and TensorHub. It surpasses GPT-4 in accuracy when writing API calls on the APIBench dataset. When combined with a document retriever, the model adapts to changes in API documentation at test time and reduces hallucinations of incorrect API usage.

What carries the argument

The fine-tuned Gorilla model that maps instructions to API calls, augmented by a retriever that supplies relevant and current API documentation.

If this is right

LLMs become more reliable for calling machine learning tools without manual prompt engineering.
Models can remain current with frequently updated libraries through retrieval instead of full retraining.
Hallucination of invalid API parameters and signatures is substantially reduced.
A single open model can handle multiple large API collections without needing closed-source scale.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach could extend beyond machine learning libraries to web services and software engineering tools.
Gorilla-style models might serve as the tool-use layer inside larger agent systems.
Retrieval-augmented fine-tuning may lower the cost of keeping tool-using LLMs up to date compared with scaling model size alone.

Load-bearing premise

Fine-tuning on collected API documentation produces a model that generalizes to new or updated APIs when the retriever supplies accurate and relevant information.

What would settle it

Testing Gorilla on API calls for documentation or libraries introduced after its training cutoff and checking whether accuracy drops or hallucinations increase compared with GPT-4.

read the original abstract

Large Language Models (LLMs) have seen an impressive wave of advances recently, with models now excelling in a variety of tasks, such as mathematical reasoning and program synthesis. However, their potential to effectively use tools via API calls remains unfulfilled. This is a challenging task even for today's state-of-the-art LLMs such as GPT-4, largely due to their inability to generate accurate input arguments and their tendency to hallucinate the wrong usage of an API call. We release Gorilla, a finetuned LLaMA-based model that surpasses the performance of GPT-4 on writing API calls. When combined with a document retriever, Gorilla demonstrates a strong capability to adapt to test-time document changes, enabling flexible user updates or version changes. It also substantially mitigates the issue of hallucination, commonly encountered when prompting LLMs directly. To evaluate the model's ability, we introduce APIBench, a comprehensive dataset consisting of HuggingFace, TorchHub, and TensorHub APIs. The successful integration of the retrieval system with Gorilla demonstrates the potential for LLMs to use tools more accurately, keep up with frequently updated documentation, and consequently increase the reliability and applicability of their outputs. Gorilla's code, model, data, and demo are available at https://gorilla.cs.berkeley.edu

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Gorilla beats GPT-4 on API calls largely by training on the exact docs, but the open model and data release still makes the work worth checking.

read the letter

Gorilla is a fine-tuned LLaMA that beats GPT-4 at writing API calls for ML libraries on their new benchmark, and the retriever addition lets it handle doc updates. The open release of everything is the strongest part here. They created APIBench from HuggingFace, TorchHub, and TensorHub APIs. Fine-tuning on the docs improves accuracy and cuts hallucinations compared to direct prompting. The retriever setup shows how to keep the model current without retraining when docs change. The paper does a good job making the work reproducible. Releasing the model, data, and code means others can test the claims directly. The idea of combining retrieval with fine-tuning for tool use is a reasonable incremental step. The soft spot is the GPT-4 comparison. If GPT-4 didn't get the same API docs during evaluation, the win is expected from specialization rather than a general advance. The abstract doesn't detail the baseline prompting, so the full paper needs to show that the comparison is controlled. It's also unclear how well Gorilla handles APIs it wasn't trained on. This paper is for people building tool-augmented LLMs or agents that interact with libraries. Readers focused on practical LLM applications in software would get value from the benchmark and the retriever results. It deserves peer review because of the concrete problem it tackles and the artifacts provided. The claims can be tightened in revision, but the core work is solid enough to review.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Gorilla, a LLaMA-based model fine-tuned on API documentation from HuggingFace, TorchHub, and TensorHub. It claims to surpass GPT-4 on API call generation accuracy using the new APIBench benchmark, and when paired with a document retriever, to adapt to test-time documentation changes while substantially reducing hallucinations relative to direct LLM prompting.

Significance. If the headline performance claims hold under controlled conditions, the work demonstrates practical value in domain-specific fine-tuning and retrieval-augmented generation for reliable tool use. The public release of the model, training data, code, and demo is a clear strength that enables direct reproducibility and follow-on research.

major comments (2)

[Abstract and §4 (Evaluation)] Abstract and §4 (Evaluation): The central claim that Gorilla 'surpasses the performance of GPT-4 on writing API calls' is load-bearing, yet the GPT-4 baseline prompting strategy is not described in sufficient detail. It is unclear whether GPT-4 received the same retrieved API documentation provided to the Gorilla+retriever variant or was evaluated zero-shot/few-shot without equivalent context; this distinction is required to rule out information-access artifacts rather than fine-tuning gains.
[§4 (Results) and APIBench description] §4 (Results) and APIBench description: The reported gains on hallucination mitigation and adaptation to document updates lack explicit quantitative controls. For instance, the fraction of invalid API calls, incorrect arguments, or retrieval-induced errors should be broken down by model variant with the same retrieval corpus and statistical tests; without these, the 'substantially mitigates' claim cannot be fully assessed.

minor comments (2)

[Figure 2 and §3] Figure 2 and §3: The retriever architecture diagram and accompanying text could more clearly label the exact retrieval top-k and embedding model used at inference time.
[§5 (Limitations)] §5 (Limitations): The discussion of generalization to entirely unseen APIs is brief; adding a held-out API category in the experimental tables would strengthen the claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below. Where clarifications or additional analyses are needed, we have revised the manuscript accordingly to strengthen the presentation of our results.

read point-by-point responses

Referee: [Abstract and §4 (Evaluation)] Abstract and §4 (Evaluation): The central claim that Gorilla 'surpasses the performance of GPT-4 on writing API calls' is load-bearing, yet the GPT-4 baseline prompting strategy is not described in sufficient detail. It is unclear whether GPT-4 received the same retrieved API documentation provided to the Gorilla+retriever variant or was evaluated zero-shot/few-shot without equivalent context; this distinction is required to rule out information-access artifacts rather than fine-tuning gains.

Authors: We thank the referee for highlighting this important point. In the original submission, GPT-4 baselines were evaluated using standard zero-shot and few-shot prompting without access to the document retriever, while the Gorilla+retriever variant uses retrieved documentation at inference time. The headline claim that Gorilla surpasses GPT-4 refers to the fine-tuned model versus GPT-4 under comparable (non-retrieval) prompting conditions. To remove any ambiguity, we have expanded Section 4 to explicitly describe the prompting setup for every baseline, including the exact context (or lack thereof) provided to GPT-4. We have also added a controlled comparison in which GPT-4 is given the same retrieved documents as Gorilla+retriever, allowing readers to isolate the contribution of fine-tuning from information access. revision: yes
Referee: [§4 (Results) and APIBench description] §4 (Results) and APIBench description: The reported gains on hallucination mitigation and adaptation to document updates lack explicit quantitative controls. For instance, the fraction of invalid API calls, incorrect arguments, or retrieval-induced errors should be broken down by model variant with the same retrieval corpus and statistical tests; without these, the 'substantially mitigates' claim cannot be fully assessed.

Authors: We agree that a finer-grained error analysis and statistical support would make the claims more robust. In the revised manuscript we have added a new table in Section 4 that breaks down error types (invalid API names, incorrect arguments, retrieval-induced errors, and other hallucinations) for every model variant, always using the identical retrieval corpus for fair comparison. We also report bootstrap confidence intervals and paired significance tests (McNemar’s test) on the key metrics of hallucination rate and adaptation accuracy. These additions directly address the request for quantitative controls and are now included in both the main results and the APIBench description. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical fine-tuning and evaluation with public releases

full rationale

The paper is an empirical contribution describing fine-tuning of LLaMA on API documentation from HuggingFace, TorchHub, and TensorHub to create Gorilla, followed by evaluation on the introduced APIBench dataset and integration with a document retriever. No mathematical derivations, equations, or first-principles predictions exist that could reduce to inputs by construction. Claims rest on experimental performance metrics rather than any self-referential definitions or fitted parameters renamed as predictions. Public release of model, data, and code allows independent verification, rendering the work self-contained without load-bearing self-citations or ansatzes that define the result tautologically.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are introduced beyond standard practices in LLM fine-tuning and evaluation.

pith-pipeline@v0.9.0 · 5532 in / 1085 out tokens · 91878 ms · 2026-05-11T23:17:08.808487+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith.Cost.FunctionalEquation washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We release Gorilla, a finetuned LLaMA-based model that surpasses the performance of GPT-4 on writing API calls. When combined with a document retriever, Gorilla demonstrates a strong capability to adapt to test-time document changes...
IndisputableMonolith.Foundation.DimensionForcing alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

To evaluate the model's ability, we introduce APIBench, a comprehensive dataset consisting of HuggingFace, TorchHub, and TensorHub APIs.
IndisputableMonolith.Foundation.PhiForcing phi_equation unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Gorilla's retriever–aware training enables it to react to changes in the APIs.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 50 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Revisable by Design: A Theory of Streaming LLM Agent Execution
cs.LG 2026-04 unverdicted novelty 8.0

LLM agents achieve greater flexibility during execution by classifying actions via a reversibility taxonomy and using an Earliest-Conflict Rollback algorithm that matches full-restart quality while wasting far less co...
Your Agent Is Mine: Measuring Malicious Intermediary Attacks on the LLM Supply Chain
cs.CR 2026-04 unverdicted novelty 8.0

Malicious LLM API routers actively perform payload injection and secret exfiltration, with 9 of 428 tested routers showing malicious behavior and further poisoning risks from leaked credentials.
Why Do Multi-Agent LLM Systems Fail?
cs.AI 2025-03 unverdicted novelty 8.0

The authors create the first large-scale dataset and taxonomy of failure modes in multi-agent LLM systems to explain their limited performance gains.
AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents
cs.CR 2024-06 unverdicted novelty 8.0

AgentDojo introduces an extensible evaluation framework populated with realistic agent tasks and security test cases to measure prompt injection robustness in tool-using LLM agents.
Agent-First Tool API: A Semantic Interface Paradigm for Enterprise AI Agent Systems
cs.AI 2026-05 unverdicted novelty 7.0

The Agent-First Tool API paradigm raises AI agent task success from 64% to 88% and cuts human interventions by 72.7% through semantic phases, structured contracts, and risk governance in a production enterprise system.
RewardHarness: Self-Evolving Agentic Post-Training
cs.AI 2026-05 unverdicted novelty 7.0

RewardHarness self-evolves a tool-and-skill library from 100 preference examples to reach 47.4% accuracy on image-edit evaluation, beating GPT-5, and yields stronger RL-tuned models.
MemFlow: Intent-Driven Memory Orchestration for Small Language Model Agents
cs.MA 2026-05 unverdicted novelty 7.0

MemFlow routes queries by intent to tiered memory operations, nearly doubling accuracy of a 1.7B SLM on long-horizon benchmarks compared to full-context baselines.
TSCG: Deterministic Tool-Schema Compilation for Agentic LLM Deployments
cs.SE 2026-05 unverdicted novelty 7.0

TSCG compiles JSON tool schemas into token-efficient structured text, raising tool-use accuracy for small LLMs from 0% to 84.4% on benchmarks while cutting tokens by 52-57%.
The Structured Output Benchmark: A Multi-Source Benchmark for Evaluating Structured Output Quality in Large Language Models
cs.CL 2026-04 accept novelty 7.0

SOB benchmark shows LLMs achieve near-perfect schema compliance but value accuracy of only 83% on text, 67% on images, and 24% on audio.
Cutscene Agent: An LLM Agent Framework for Automated 3D Cutscene Generation
cs.GR 2026-04 unverdicted novelty 7.0

Cutscene Agent uses a multi-agent LLM system and a new toolkit for game engine control to automate end-to-end 3D cutscene generation, evaluated on the introduced CutsceneBench.
A Systematic Survey of Security Threats and Defenses in LLM-Based AI Agents: A Layered Attack Surface Framework
cs.CR 2026-04 unverdicted novelty 7.0

A new 7x4 taxonomy organizes agentic AI security threats by architectural layer and persistence timescale, revealing under-explored upper layers and missing defenses after surveying 116 papers.
Breaking MCP with Function Hijacking Attacks: Novel Threats for Function Calling and Agentic Models
cs.CR 2026-04 unverdicted novelty 7.0

A novel function hijacking attack achieves 70-100% success rates in forcing specific function calls across five LLMs on the BFCL benchmark and is robust to context semantics.
SceneOrchestra: Efficient Agentic 3D Scene Synthesis via Full Tool-Call Trajectory Generation
cs.CV 2026-04 unverdicted novelty 7.0

SceneOrchestra trains an orchestrator to generate full tool-call trajectories for 3D scene synthesis and uses a discriminator during training to select high-quality plans, yielding state-of-the-art results with lower runtime.
GraSP: Graph-Structured Skill Compositions for LLM Agents
cs.CL 2026-04 unverdicted novelty 7.0

GraSP introduces executable skill graphs that improve LLM agent rewards by up to 19 points and reduce steps by up to 41% over ReAct, Reflexion, ExpeL, and flat-skill baselines across ALFWorld, ScienceWorld, WebShop, a...
Does RL Expand the Capability Boundary of LLM Agents? A PASS@(k,T) Analysis
cs.LG 2026-04 unverdicted novelty 7.0

RL expands the capability boundary of LLM agents on compositional tool-use tasks, shown by non-converging pass curves at large k with increasing T, while SFT regresses it and the effect is absent on simpler tasks.
Causality Laundering: Denial-Feedback Leakage in Tool-Calling LLM Agents
cs.CR 2026-04 unverdicted novelty 7.0

The paper defines causality laundering as an attack leaking information from denial outcomes in LLM tool calls and proposes the Agentic Reference Monitor to block it using denial-aware provenance graphs.
Evo-Memory: Benchmarking LLM Agent Test-time Learning with Self-Evolving Memory
cs.CL 2025-11 unverdicted novelty 7.0

Evo-Memory is a new benchmark for self-evolving memory in LLM agents across task streams, with baseline ExpRAG and proposed ReMem method that integrates reasoning, actions, and memory updates for continual improvement.
Agent Security Bench (ASB): Formalizing and Benchmarking Attacks and Defenses in LLM-based Agents
cs.CR 2024-10 unverdicted novelty 7.0

ASB is a new benchmark that tests 10 prompt injection attacks, memory poisoning, a novel Plan-of-Thought backdoor attack, and 11 defenses on LLM agents across 13 models, finding attack success rates up to 84.3% and li...
GAIA: a benchmark for General AI Assistants
cs.CL 2023-11 unverdicted novelty 7.0

GAIA benchmark shows humans at 92% accuracy on simple real-world questions far outperform current AI systems at 15%, proposing this gap as a key milestone for general AI.
Latency-Quality Routing for Functionally Equivalent Tools in LLM Agents
cs.LG 2026-05 unverdicted novelty 6.0

LQM-ContextRoute routes tool calls by expected quality per service cycle using contextual bandits and LLM-as-judge feedback, yielding +2.18 pp F1, up to +18 pp accuracy, and +2.91-3.22 pp NDCG gains over SW-UCB on web...
ToolCUA: Towards Optimal GUI-Tool Path Orchestration for Computer Use Agents
cs.AI 2026-05 unverdicted novelty 6.0

ToolCUA introduces a trajectory scaling pipeline and staged RL to optimize GUI-tool switching, reaching 46.85% accuracy on OSWorld-MCP for a 66% relative gain over baseline.
Template-as-Ontology: Configurable Synthetic Data Infrastructure for Cross-Domain Manufacturing AI Validation
cs.AI 2026-05 unverdicted novelty 6.0

A single configuration file generates causally coherent synthetic MES data across domains and guarantees zero tool-parameter hallucination when AI tools are ontology-constrained.
EvoMAS: Learning Execution-Time Workflows for Multi-Agent Systems
cs.AI 2026-05 unverdicted novelty 6.0

EvoMAS trains a workflow adapter with policy gradients to dynamically instantiate stage-specific multi-agent workflows from a fixed agent pool, using explicit task-state construction and terminal success signals, and ...
Tool Calling is Linearly Readable and Steerable in Language Models
cs.CL 2026-05 unverdicted novelty 6.0

Tool identity is linearly readable and steerable in LLMs via mean activation differences, with 77-100% switch accuracy and error prediction from activation gaps.
EnvSimBench: A Benchmark for Evaluating and Improving LLM-Based Environment Simulation
cs.AI 2026-05 unverdicted novelty 6.0

EnvSimBench reveals that state-of-the-art LLMs exhibit a universal state change cliff in environment simulation, with a new constraint-driven pipeline raising synthesis yield by 6.8% and cutting costs over 90%.
An Agent-Oriented Pluggable Experience-RAG Skill for Experience-Driven Retrieval Strategy Orchestration
cs.AI 2026-05 unverdicted novelty 6.0

Experience-RAG Skill uses experience memory to dynamically select retrieval strategies for agents, achieving 0.8924 nDCG@10 on BeIR/nq, hotpotqa, and scifact while outperforming fixed single-retriever baselines.
Semia: Auditing Agent Skills via Constraint-Guided Representation Synthesis
cs.CR 2026-05 unverdicted novelty 6.0

Semia synthesizes Datalog representations of agent skills via constraint-guided loops to enable reachability queries for semantic risks, finding critical issues in over half of 13,728 real skills with 97.7% recall on ...
Claw-Eval-Live: A Live Agent Benchmark for Evolving Real-World Workflows
cs.SE 2026-04 unverdicted novelty 6.0

Claw-Eval-Live benchmark with 105 tasks shows no frontier LLM agent exceeds 66.7% success rate on evolving real-world workflows, with HR and multi-system tasks as persistent bottlenecks.
From Skill Text to Skill Structure: The Scheduling-Structural-Logical Representation for Agent Skills
cs.CL 2026-04 unverdicted novelty 6.0

SSL representation disentangles skill scheduling, structure, and logic using an LLM normalizer, improving skill discovery MRR@50 from 0.649 to 0.729 and risk assessment macro F1 from 0.409 to 0.509 over text baselines.
Time Series Augmented Generation for Financial Applications
cs.AI 2026-04 unverdicted novelty 6.0

TSAG lets LLMs use external tools for financial time series analysis, with a new benchmark showing capable agents achieve near-perfect tool accuracy and minimal hallucination.
When Agents Go Quiet: Output Generation Capacity and Format-Cost Separation for LLM Document Synthesis
cs.AI 2026-04 unverdicted novelty 6.0

LLM agents avoid output stalling and reduce generation tokens by 48-72% via deferred template rendering guided by Output Generation Capacity and a Format-Cost Separation Theorem.
Auditable Agents
cs.AI 2026-04 unverdicted novelty 6.0

No agent system can be accountable without auditability, which requires five dimensions (action recoverability, lifecycle coverage, policy checkability, responsibility attribution, evidence integrity) and mechanisms f...
From Data to Theory: Autonomous Large Language Model Agents for Materials Science
cs.AI 2026-04 unverdicted novelty 6.0

An LLM agent autonomously selects, codes, and validates materials equations from data, recovering known laws reliably but requiring checks for new or specialized cases.
ToolRL: Reward is All Tool Learning Needs
cs.LG 2025-04 conditional novelty 6.0

A principled reward design for tool selection and application in RL-trained LLMs delivers 17% gains over base models and 15% over SFT across benchmarks.
AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents
cs.LG 2024-10 accept novelty 6.0

AgentHarm benchmark shows leading LLMs comply with malicious agent requests and simple jailbreaks enable coherent harmful multi-step execution while retaining capabilities.
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code
cs.SE 2024-03 unverdicted novelty 6.0

LiveCodeBench collects 400 recent contest problems to create a contamination-free benchmark evaluating LLMs on code generation and related capabilities like self-repair and execution.
SGLang: Efficient Execution of Structured Language Model Programs
cs.AI 2023-12 conditional novelty 6.0

SGLang is a new system that speeds up structured LLM programs by up to 6.4x using RadixAttention for KV cache reuse and compressed finite state machines for output decoding.
A Survey on Large Language Model based Autonomous Agents
cs.AI 2023-08 accept novelty 6.0

A survey of LLM-based autonomous agents that proposes a unified framework for their construction and reviews applications in social science, natural science, and engineering along with evaluation methods and future di...
Is Grep All You Need? How Agent Harnesses Reshape Agentic Search
cs.CL 2026-05 unverdicted novelty 5.0

Grep retrieval generally outperforms vector retrieval in agentic search tasks, with performance varying strongly by agent harness and tool-calling style.
Good to Go: The LOOP Skill Engine That Hits 99% Success and Slashes Token Usage by 99% via One-Shot Recording and Deterministic Replay
cs.AI 2026-05 unverdicted novelty 5.0

The LOOP Skill Engine records one LLM-powered run of a periodic task and converts it into a deterministic replay template that eliminates further LLM usage while maintaining high success rates.
The Semantic Training Gap: Ontology-Grounded Tool Architectures for Industrial AI Agent Systems
cs.AI 2026-05 unverdicted novelty 5.0

Ontology-grounded tool architectures eliminate hallucination of domain identifiers in industrial AI agents by enforcing semantic constraints through a typed relational configuration and three-operation interface.
Consistency as a Testable Property: Statistical Methods to Evaluate AI Agent Reliability
cs.AI 2026-05 unverdicted novelty 5.0

A framework with U-statistics and kernel-based metrics quantifies AI agent consistency and robustness, showing trajectory metrics outperform pass@1 rates in diagnosing failures.
Bounded Autonomy for Enterprise AI: Typed Action Contracts and Consumer-Side Execution
cs.SE 2026-04 conditional novelty 5.0

Bounded autonomy using typed action contracts and consumer-side execution lets LLMs safely operate enterprise systems, achieving 23 of 25 tasks with zero unsafe executions versus 17 for unconstrained AI across 25 trials.
LLM-Rosetta: A Hub-and-Spoke Intermediate Representation for Cross-Provider LLM API Translation
cs.SE 2026-04 unverdicted novelty 5.0

A hub-and-spoke IR with a 9-type content model and 10-type stream schema enables bidirectional, lossless translation between major LLM APIs with sub-100 microsecond overhead.
Externalization in LLM Agents: A Unified Review of Memory, Skills, Protocols and Harness Engineering
cs.SE 2026-04 accept novelty 5.0

LLM agent progress depends on externalizing cognitive functions into memory, skills, protocols, and harness engineering that coordinates them reliably.
An Agent-Oriented Pluggable Experience-RAG Skill for Experience-Driven Retrieval Strategy Orchestration
cs.AI 2026-05 unverdicted novelty 4.0

Experience-RAG Skill is a reusable agent skill that selects retrieval strategies via experience memory, achieving 0.8924 nDCG@10 on BeIR/nq, hotpotqa, and scifact while outperforming fixed retriever baselines.
Vibe Medicine: Redefining Biomedical Research Through Human-AI Co-Work
cs.AI 2026-04 unverdicted novelty 4.0

Vibe Medicine proposes directing AI agents via natural language for end-to-end biomedical workflows using LLMs, agent frameworks, and a curated collection of over 1,000 medical skills.
Meta-Tool: Efficient Few-Shot Tool Adaptation for Small Language Models
cs.CL 2026-04 unverdicted novelty 4.0

A 3B model with few-shot prompting reaches 79.7% of GPT-5 tool-use performance while a hypernetwork adaptation adds zero measurable benefit across four benchmarks.
Empirical Comparison of Agent Communication Protocols for Task Orchestration
cs.AI 2026-03 unverdicted novelty 4.0

This work provides an empirical comparison of tool integration, multi-agent delegation, and hybrid architectures for LLM task orchestration, measuring response time, context consumption, cost, error recovery, and impl...
Benchmarking EngGPT2-16B-A3B against Comparable Italian and International Open-source LLMs
cs.CL 2026-05 unverdicted novelty 3.0

EngGPT2MoE-16B-A3B matches or beats other Italian models on most international benchmarks but trails top international models such as GPT-5 nano and Qwen3-8B.

Reference graph

Works this paper leans on

48 extracted references · 48 canonical work pages · cited by 49 Pith papers · 22 internal anchors

[1]

Ahn, M., Brohan, A., Brown, N., Chebotar, Y ., Cortes, O., David, B., Finn, C., Gopalakrishnan, K., Hausman, K., Herzog, A., et al. (2022). Do as i can, not as i say: Grounding language in robotic affordances. arXiv preprint arXiv:2204.01691

work page internal anchor Pith review Pith/arXiv arXiv 2022
[2]

Andor, D., He, L., Lee, K., and Pitler, E. (2019). Giving bert a calculator: Finding operations and arguments with reading comprehension. arXiv preprint arXiv:1909.00109

work page arXiv 2019
[3]

Anthropic, h.-c. (2022). Claude

work page 2022
[4]

Bavishi, R., Lemieux, C., Fox, R., Sen, K., and Stoica, I. (2019). Autopandas: neural- backed generators for program synthesis. Proceedings of the ACM on Programming Languages , 3(OOPSLA):1–27

work page 2019
[5]

D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. (2020). Language models are few-shot learners. Advances in neural information processing systems , 33:1877–1901

work page 2020
[6]

Sparks of Artificial General Intelligence: Early experiments with GPT-4

Bubeck, S., Chandrasekaran, V ., Eldan, R., Gehrke, J., Horvitz, E., Kamar, E., Lee, P., Lee, Y . T., Li, Y ., Lundberg, S., et al. (2023). Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712

work page internal anchor Pith review Pith/arXiv arXiv 2023
[7]

Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H. P. d. O., Kaplan, J., Edwards, H., Burda, Y ., Joseph, N., Brockman, G., et al. (2021). Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374

work page internal anchor Pith review Pith/arXiv arXiv 2021
[8]

Chen, X., Lin, M., Schärli, N., and Zhou, D. (2023). Teaching large language models to self-debug. arXiv preprint arXiv:2304.05128

work page internal anchor Pith review arXiv 2023
[9]

E., Stoica, I., and Xing, E

Chiang, W.-L., Li, Z., Lin, Z., Sheng, Y ., Wu, Z., Zhang, H., Zheng, L., Zhuang, S., Zhuang, Y ., Gonzalez, J. E., Stoica, I., and Xing, E. P. (2023). Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality

work page 2023
[10]

PaLM: Scaling Language Modeling with Pathways

Chowdhery, A., Narang, S., Devlin, J., Bosma, M., Mishra, G., Roberts, A., Barham, P., Chung, H. W., Sutton, C., Gehrmann, S., et al. (2022). Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311

work page internal anchor Pith review Pith/arXiv arXiv 2022
[11]

Scaling Instruction-Finetuned Language Models

Chung, H. W., Hou, L., Longpre, S., Zoph, B., Tay, Y ., Fedus, W., Li, E., Wang, X., Dehghani, M., Brahma, S., et al. (2022). Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416

work page internal anchor Pith review arXiv 2022
[12]

Cobbe, K., Kosaraju, V ., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., et al. (2021). Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168

work page internal anchor Pith review Pith/arXiv arXiv 2021
[13]

Devlin, J., Uesato, J., Bhupatiraju, S., Singh, R., Mohamed, A.-r., and Kohli, P. (2017). Robust- fill: Neural program learning under noisy i/o. In International conference on machine learning , pages 990–998. PMLR

work page 2017
[14]

Gao, L., Madaan, A., Zhou, S., Alon, U., Liu, P., Yang, Y ., Callan, J., and Neubig, G. (2022). Pal: Program-aided language models. arXiv preprint arXiv:2211.10435

work page Pith review arXiv 2022
[15]

Opt-iml: Scaling language model instruction meta learning through the lens of generalization.arXiv preprint arXiv:2212.12017, 2022

Iyer, S., Lin, X. V ., Pasunuru, R., Mihaylov, T., Simig, D., Yu, P., Shuster, K., Wang, T., Liu, Q., Koura, P. S., et al. (2022). Opt-iml: Scaling language model instruction meta learning through the lens of generalization. arXiv preprint arXiv:2212.12017. 10

work page arXiv 2022
[16]

Jain, N., Vaidyanath, S., Iyer, A., Natarajan, N., Parthasarathy, S., Rajamani, S., and Sharma, R. (2022). Jigsaw: Large language models meet program synthesis. In Proceedings of the 44th International Conference on Software Engineering , pages 1219–1231

work page 2022
[17]

Kim, G., Baldi, P., and McAleer, S. (2023). Language models can solve computer tasks. arXiv preprint arXiv:2303.17491

work page arXiv 2023
[18]

Large Language Models are Zero-Shot Reasoners

Kojima, T., Gu, S. S., Reid, M., Matsuo, Y ., and Iwasawa, Y . (2022). Large language models are zero-shot reasoners. arXiv preprint arXiv:2205.11916

work page internal anchor Pith review arXiv 2022
[19]

Komeili, M., Shuster, K., and Weston, J. (2021). Internet-augmented dialogue generation. arXiv preprint arXiv:2107.07566

work page arXiv 2021
[20]

Lachaux, M.-A., Roziere, B., Chanussot, L., and Lample, G. (2020). Unsupervised translation of programming languages. arXiv preprint arXiv:2006.03511

work page arXiv 2020
[21]

Lazaridou, A., Gribovskaya, E., Stokowiec, W., and Grigorev, N. (2022). Internet-augmented language models through few-shot prompting for open-domain question answering. arXiv preprint arXiv:2203.05115

work page arXiv 2022
[22]

StarCoder: may the source be with you!

Li, R., Allal, L. B., Zi, Y ., Muennighoff, N., Kocetkov, D., Mou, C., Marone, M., Akiki, C., Li, J., Chim, J., et al. (2023). Starcoder: may the source be with you! arXiv preprint arXiv:2305.06161

work page internal anchor Pith review Pith/arXiv arXiv 2023
[23]

Li, Y ., Choi, D., Chung, J., Kushman, N., Schrittwieser, J., Leblond, R., Eccles, T., Keeling, J., Gimeno, F., Dal Lago, A., et al. (2022). Competition-level code generation with alphacode. Science, 378(6624):1092–1097

work page 2022
[24]

Liang, Y ., Wu, C., Song, T., Wu, W., Xia, Y ., Liu, Y ., Ou, Y ., Lu, S., Ji, L., Mao, S., et al. (2023). Taskmatrix. ai: Completing tasks by connecting foundation models with millions of apis. arXiv preprint arXiv:2303.16434

work page arXiv 2023
[25]

Menon, A., Tamuz, O., Gulwani, S., Lampson, B., and Kalai, A. (2013). A machine learning framework for programming by example. In International Conference on Machine Learning , pages 187–195. PMLR

work page 2013
[26]

Nakano, R., Hilton, J., Balaji, S., Wu, J., Ouyang, L., Kim, C., Hesse, C., Jain, S., Kosaraju, V ., Saunders, W., et al. (2021). Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332

work page internal anchor Pith review Pith/arXiv arXiv 2021
[27]

Nijkamp, E., Hayashi, H., Xiong, C., Savarese, S., and Zhou, Y . (2023). Codegen2: Lessons for training llms on programming and natural languages. arXiv preprint arXiv:2305.02309

work page arXiv 2023
[28]

Nijkamp, E., Pang, B., Hayashi, H., Tu, L., Wang, H., Zhou, Y ., Savarese, S., and Xiong, C. (2022). Codegen: An open large language model for code with multi-turn program synthesis. arXiv preprint arXiv:2203.13474

work page internal anchor Pith review arXiv 2022
[29]

Gpt-4 technical report

OpenAI (2023). Gpt-4 technical report

work page 2023
[30]

OpenAI and https://openai.com/blog/chatgpt (2022). Chatgpt

work page 2022
[31]

Multitask Prompted Training Enables Zero-Shot Task Generalization

Sanh, V ., Webson, A., Raffel, C., Bach, S. H., Sutawika, L., Alyafeai, Z., Chaffin, A., Stiegler, A., Scao, T. L., Raja, A., et al. (2021). Multitask prompted training enables zero-shot task generalization. arXiv preprint arXiv:2110.08207

work page internal anchor Pith review arXiv 2021
[32]

BLOOM: A 176B-Parameter Open-Access Multilingual Language Model

Scao, T. L., Fan, A., Akiki, C., Pavlick, E., Ili ´c, S., Hesslow, D., Castagné, R., Luccioni, A. S., Yvon, F., Gallé, M., et al. (2022). Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100

work page internal anchor Pith review arXiv 2022
[33]

Schick, T., Dwivedi-Yu, J., Dessì, R., Raileanu, R., Lomeli, M., Zettlemoyer, L., Cancedda, N., and Scialom, T. (2023). Toolformer: Language models can teach themselves to use tools. arXiv preprint arXiv:2302.04761

work page internal anchor Pith review Pith/arXiv arXiv 2023
[34]

and Schütze, H

Schick, T. and Schütze, H. (2020). Exploiting cloze questions for few shot text classification and natural language inference. arXiv preprint arXiv:2001.07676. 11

work page arXiv 2020
[35]

Shen, Y ., Song, K., Tan, X., Li, D., Lu, W., and Zhuang, Y . (2023). Hugginggpt: Solving ai tasks with chatgpt and its friends in huggingface. arXiv preprint arXiv:2303.17580

work page internal anchor Pith review arXiv 2023
[36]

Shinn, N., Labash, B., and Gopinath, A. (2023). Reflexion: an autonomous agent with dynamic memory and self-reflection. arXiv preprint arXiv:2303.11366

work page internal anchor Pith review Pith/arXiv arXiv 2023
[37]

Blenderbot 3: a deployed conversational agent that continually learns to responsibly engage

Shuster, K., Xu, J., Komeili, M., Ju, D., Smith, E. M., Roller, S., Ung, M., Chen, M., Arora, K., Lane, J., et al. (2022). Blenderbot 3: a deployed conversational agent that continually learns to responsibly engage. arXiv preprint arXiv:2208.03188

work page arXiv 2022
[38]

Taori, R., Gulrajani, I., Zhang, T., Dubois, Y ., Li, X., Guestrin, C., Liang, P., and Hashimoto, T. B. (2023). Stanford alpaca: An instruction-following llama model. https://github.com/ tatsu-lab/stanford_alpaca

work page 2023
[39]

Thoppilan, R., De Freitas, D., Hall, J., Shazeer, N., Kulshreshtha, A., Cheng, H.-T., Jin, A., Bos, T., Baker, L., Du, Y ., et al. (2022). Lamda: Language models for dialog applications.arXiv preprint arXiv:2201.08239

work page Pith review arXiv 2022
[40]

Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al. (2023). Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971

work page internal anchor Pith review Pith/arXiv arXiv 2023
[41]

Vemprala, S., Bonatti, R., Bucker, A., and Kapoor, A. (2023). Chatgpt for robotics: Design principles and model abilities. 2023

work page 2023
[42]

Self-Instruct: Aligning Language Models with Self-Generated Instructions

Wang, Y ., Kordi, Y ., Mishra, S., Liu, A., Smith, N. A., Khashabi, D., and Hajishirzi, H. (2022a). Self-instruct: Aligning language model with self generated instructions. arXiv preprint arXiv:2212.10560

work page internal anchor Pith review arXiv
[43]

S., Arunkumar, A., Stap, D., et al

Wang, Y ., Mishra, S., Alipoormolabashi, P., Kordi, Y ., Mirzaei, A., Naik, A., Ashok, A., Dhanasekaran, A. S., Arunkumar, A., Stap, D., et al. (2022b). Super-naturalinstructions: General- ization via declarative instructions on 1600+ nlp tasks. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing , pages 5085–5109

work page 2022
[44]

Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., and Zhou, D. (2022). Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903

work page internal anchor Pith review Pith/arXiv arXiv 2022
[45]

F., Alon, U., Neubig, G., and Hellendoorn, V

Xu, F. F., Alon, U., Neubig, G., and Hellendoorn, V . J. (2022). A systematic evaluation of large language models of code. In Proceedings of the 6th ACM SIGPLAN International Symposium on Machine Programming, pages 1–10

work page 2022
[46]

Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., and Cao, Y . (2022). React: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629

work page internal anchor Pith review Pith/arXiv arXiv 2022
[47]

Zeng, A., Liu, X., Du, Z., Wang, Z., Lai, H., Ding, M., Yang, Z., Xu, Y ., Zheng, W., Xia, X., et al. (2022). Glm-130b: An open bilingual pre-trained model. arXiv preprint arXiv:2210.02414

work page internal anchor Pith review arXiv 2022
[48]

OPT: Open Pre-trained Transformer Language Models

Zhang, S., Roller, S., Goyal, N., Artetxe, M., Chen, M., Chen, S., Dewan, C., Diab, M., Li, X., Lin, X. V ., et al. (2022). Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068. 12 8 Appendix 8.1 Dataset Details Our dataset is multi-faceted, comprising three distinct domains: Torch Hub, Tensor Hub, and HuggingFace. Each entry...

work page internal anchor Pith review Pith/arXiv arXiv 2022