pith. machine review for the scientific record. sign in

arxiv: 2305.15334 · v1 · submitted 2023-05-24 · 💻 cs.CL · cs.AI

Recognition: 3 theorem links

· Lean Theorem

Gorilla: Large Language Model Connected with Massive APIs

Authors on Pith no claims yet

Pith reviewed 2026-05-11 23:17 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords large language modelsAPI callstool usefine-tuningretrieval augmentationmachine learning librarieshallucination
0
0 comments X

The pith

A fine-tuned LLaMA model generates more accurate API calls than GPT-4 for machine learning libraries.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper releases Gorilla, a model created by fine-tuning LLaMA on documentation for Hugging Face, TorchHub, and TensorHub APIs. It shows higher accuracy than GPT-4 at producing correct API calls from natural language instructions. Pairing the model with a retriever lets it pull fresh documentation at inference time, which helps it adjust to API updates and cuts down on invented API usages. The authors introduce APIBench as a test set to measure this performance across the three libraries. The work demonstrates that targeted fine-tuning plus retrieval can make LLMs more dependable when they need to call external tools.

Core claim

Gorilla is a LLaMA-based model fine-tuned on API documentation from Hugging Face, TorchHub, and TensorHub. It surpasses GPT-4 in accuracy when writing API calls on the APIBench dataset. When combined with a document retriever, the model adapts to changes in API documentation at test time and reduces hallucinations of incorrect API usage.

What carries the argument

The fine-tuned Gorilla model that maps instructions to API calls, augmented by a retriever that supplies relevant and current API documentation.

If this is right

  • LLMs become more reliable for calling machine learning tools without manual prompt engineering.
  • Models can remain current with frequently updated libraries through retrieval instead of full retraining.
  • Hallucination of invalid API parameters and signatures is substantially reduced.
  • A single open model can handle multiple large API collections without needing closed-source scale.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach could extend beyond machine learning libraries to web services and software engineering tools.
  • Gorilla-style models might serve as the tool-use layer inside larger agent systems.
  • Retrieval-augmented fine-tuning may lower the cost of keeping tool-using LLMs up to date compared with scaling model size alone.

Load-bearing premise

Fine-tuning on collected API documentation produces a model that generalizes to new or updated APIs when the retriever supplies accurate and relevant information.

What would settle it

Testing Gorilla on API calls for documentation or libraries introduced after its training cutoff and checking whether accuracy drops or hallucinations increase compared with GPT-4.

read the original abstract

Large Language Models (LLMs) have seen an impressive wave of advances recently, with models now excelling in a variety of tasks, such as mathematical reasoning and program synthesis. However, their potential to effectively use tools via API calls remains unfulfilled. This is a challenging task even for today's state-of-the-art LLMs such as GPT-4, largely due to their inability to generate accurate input arguments and their tendency to hallucinate the wrong usage of an API call. We release Gorilla, a finetuned LLaMA-based model that surpasses the performance of GPT-4 on writing API calls. When combined with a document retriever, Gorilla demonstrates a strong capability to adapt to test-time document changes, enabling flexible user updates or version changes. It also substantially mitigates the issue of hallucination, commonly encountered when prompting LLMs directly. To evaluate the model's ability, we introduce APIBench, a comprehensive dataset consisting of HuggingFace, TorchHub, and TensorHub APIs. The successful integration of the retrieval system with Gorilla demonstrates the potential for LLMs to use tools more accurately, keep up with frequently updated documentation, and consequently increase the reliability and applicability of their outputs. Gorilla's code, model, data, and demo are available at https://gorilla.cs.berkeley.edu

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Gorilla, a LLaMA-based model fine-tuned on API documentation from HuggingFace, TorchHub, and TensorHub. It claims to surpass GPT-4 on API call generation accuracy using the new APIBench benchmark, and when paired with a document retriever, to adapt to test-time documentation changes while substantially reducing hallucinations relative to direct LLM prompting.

Significance. If the headline performance claims hold under controlled conditions, the work demonstrates practical value in domain-specific fine-tuning and retrieval-augmented generation for reliable tool use. The public release of the model, training data, code, and demo is a clear strength that enables direct reproducibility and follow-on research.

major comments (2)
  1. [Abstract and §4 (Evaluation)] Abstract and §4 (Evaluation): The central claim that Gorilla 'surpasses the performance of GPT-4 on writing API calls' is load-bearing, yet the GPT-4 baseline prompting strategy is not described in sufficient detail. It is unclear whether GPT-4 received the same retrieved API documentation provided to the Gorilla+retriever variant or was evaluated zero-shot/few-shot without equivalent context; this distinction is required to rule out information-access artifacts rather than fine-tuning gains.
  2. [§4 (Results) and APIBench description] §4 (Results) and APIBench description: The reported gains on hallucination mitigation and adaptation to document updates lack explicit quantitative controls. For instance, the fraction of invalid API calls, incorrect arguments, or retrieval-induced errors should be broken down by model variant with the same retrieval corpus and statistical tests; without these, the 'substantially mitigates' claim cannot be fully assessed.
minor comments (2)
  1. [Figure 2 and §3] Figure 2 and §3: The retriever architecture diagram and accompanying text could more clearly label the exact retrieval top-k and embedding model used at inference time.
  2. [§5 (Limitations)] §5 (Limitations): The discussion of generalization to entirely unseen APIs is brief; adding a held-out API category in the experimental tables would strengthen the claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below. Where clarifications or additional analyses are needed, we have revised the manuscript accordingly to strengthen the presentation of our results.

read point-by-point responses
  1. Referee: [Abstract and §4 (Evaluation)] Abstract and §4 (Evaluation): The central claim that Gorilla 'surpasses the performance of GPT-4 on writing API calls' is load-bearing, yet the GPT-4 baseline prompting strategy is not described in sufficient detail. It is unclear whether GPT-4 received the same retrieved API documentation provided to the Gorilla+retriever variant or was evaluated zero-shot/few-shot without equivalent context; this distinction is required to rule out information-access artifacts rather than fine-tuning gains.

    Authors: We thank the referee for highlighting this important point. In the original submission, GPT-4 baselines were evaluated using standard zero-shot and few-shot prompting without access to the document retriever, while the Gorilla+retriever variant uses retrieved documentation at inference time. The headline claim that Gorilla surpasses GPT-4 refers to the fine-tuned model versus GPT-4 under comparable (non-retrieval) prompting conditions. To remove any ambiguity, we have expanded Section 4 to explicitly describe the prompting setup for every baseline, including the exact context (or lack thereof) provided to GPT-4. We have also added a controlled comparison in which GPT-4 is given the same retrieved documents as Gorilla+retriever, allowing readers to isolate the contribution of fine-tuning from information access. revision: yes

  2. Referee: [§4 (Results) and APIBench description] §4 (Results) and APIBench description: The reported gains on hallucination mitigation and adaptation to document updates lack explicit quantitative controls. For instance, the fraction of invalid API calls, incorrect arguments, or retrieval-induced errors should be broken down by model variant with the same retrieval corpus and statistical tests; without these, the 'substantially mitigates' claim cannot be fully assessed.

    Authors: We agree that a finer-grained error analysis and statistical support would make the claims more robust. In the revised manuscript we have added a new table in Section 4 that breaks down error types (invalid API names, incorrect arguments, retrieval-induced errors, and other hallucinations) for every model variant, always using the identical retrieval corpus for fair comparison. We also report bootstrap confidence intervals and paired significance tests (McNemar’s test) on the key metrics of hallucination rate and adaptation accuracy. These additions directly address the request for quantitative controls and are now included in both the main results and the APIBench description. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical fine-tuning and evaluation with public releases

full rationale

The paper is an empirical contribution describing fine-tuning of LLaMA on API documentation from HuggingFace, TorchHub, and TensorHub to create Gorilla, followed by evaluation on the introduced APIBench dataset and integration with a document retriever. No mathematical derivations, equations, or first-principles predictions exist that could reduce to inputs by construction. Claims rest on experimental performance metrics rather than any self-referential definitions or fitted parameters renamed as predictions. Public release of model, data, and code allows independent verification, rendering the work self-contained without load-bearing self-citations or ansatzes that define the result tautologically.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are introduced beyond standard practices in LLM fine-tuning and evaluation.

pith-pipeline@v0.9.0 · 5532 in / 1085 out tokens · 91878 ms · 2026-05-11T23:17:08.808487+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 50 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Revisable by Design: A Theory of Streaming LLM Agent Execution

    cs.LG 2026-04 unverdicted novelty 8.0

    LLM agents achieve greater flexibility during execution by classifying actions via a reversibility taxonomy and using an Earliest-Conflict Rollback algorithm that matches full-restart quality while wasting far less co...

  2. Your Agent Is Mine: Measuring Malicious Intermediary Attacks on the LLM Supply Chain

    cs.CR 2026-04 unverdicted novelty 8.0

    Malicious LLM API routers actively perform payload injection and secret exfiltration, with 9 of 428 tested routers showing malicious behavior and further poisoning risks from leaked credentials.

  3. Why Do Multi-Agent LLM Systems Fail?

    cs.AI 2025-03 unverdicted novelty 8.0

    The authors create the first large-scale dataset and taxonomy of failure modes in multi-agent LLM systems to explain their limited performance gains.

  4. AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents

    cs.CR 2024-06 unverdicted novelty 8.0

    AgentDojo introduces an extensible evaluation framework populated with realistic agent tasks and security test cases to measure prompt injection robustness in tool-using LLM agents.

  5. Agent-First Tool API: A Semantic Interface Paradigm for Enterprise AI Agent Systems

    cs.AI 2026-05 unverdicted novelty 7.0

    The Agent-First Tool API paradigm raises AI agent task success from 64% to 88% and cuts human interventions by 72.7% through semantic phases, structured contracts, and risk governance in a production enterprise system.

  6. RewardHarness: Self-Evolving Agentic Post-Training

    cs.AI 2026-05 unverdicted novelty 7.0

    RewardHarness self-evolves a tool-and-skill library from 100 preference examples to reach 47.4% accuracy on image-edit evaluation, beating GPT-5, and yields stronger RL-tuned models.

  7. MemFlow: Intent-Driven Memory Orchestration for Small Language Model Agents

    cs.MA 2026-05 unverdicted novelty 7.0

    MemFlow routes queries by intent to tiered memory operations, nearly doubling accuracy of a 1.7B SLM on long-horizon benchmarks compared to full-context baselines.

  8. TSCG: Deterministic Tool-Schema Compilation for Agentic LLM Deployments

    cs.SE 2026-05 unverdicted novelty 7.0

    TSCG compiles JSON tool schemas into token-efficient structured text, raising tool-use accuracy for small LLMs from 0% to 84.4% on benchmarks while cutting tokens by 52-57%.

  9. The Structured Output Benchmark: A Multi-Source Benchmark for Evaluating Structured Output Quality in Large Language Models

    cs.CL 2026-04 accept novelty 7.0

    SOB benchmark shows LLMs achieve near-perfect schema compliance but value accuracy of only 83% on text, 67% on images, and 24% on audio.

  10. Cutscene Agent: An LLM Agent Framework for Automated 3D Cutscene Generation

    cs.GR 2026-04 unverdicted novelty 7.0

    Cutscene Agent uses a multi-agent LLM system and a new toolkit for game engine control to automate end-to-end 3D cutscene generation, evaluated on the introduced CutsceneBench.

  11. A Systematic Survey of Security Threats and Defenses in LLM-Based AI Agents: A Layered Attack Surface Framework

    cs.CR 2026-04 unverdicted novelty 7.0

    A new 7x4 taxonomy organizes agentic AI security threats by architectural layer and persistence timescale, revealing under-explored upper layers and missing defenses after surveying 116 papers.

  12. Breaking MCP with Function Hijacking Attacks: Novel Threats for Function Calling and Agentic Models

    cs.CR 2026-04 unverdicted novelty 7.0

    A novel function hijacking attack achieves 70-100% success rates in forcing specific function calls across five LLMs on the BFCL benchmark and is robust to context semantics.

  13. SceneOrchestra: Efficient Agentic 3D Scene Synthesis via Full Tool-Call Trajectory Generation

    cs.CV 2026-04 unverdicted novelty 7.0

    SceneOrchestra trains an orchestrator to generate full tool-call trajectories for 3D scene synthesis and uses a discriminator during training to select high-quality plans, yielding state-of-the-art results with lower runtime.

  14. GraSP: Graph-Structured Skill Compositions for LLM Agents

    cs.CL 2026-04 unverdicted novelty 7.0

    GraSP introduces executable skill graphs that improve LLM agent rewards by up to 19 points and reduce steps by up to 41% over ReAct, Reflexion, ExpeL, and flat-skill baselines across ALFWorld, ScienceWorld, WebShop, a...

  15. Does RL Expand the Capability Boundary of LLM Agents? A PASS@(k,T) Analysis

    cs.LG 2026-04 unverdicted novelty 7.0

    RL expands the capability boundary of LLM agents on compositional tool-use tasks, shown by non-converging pass curves at large k with increasing T, while SFT regresses it and the effect is absent on simpler tasks.

  16. Causality Laundering: Denial-Feedback Leakage in Tool-Calling LLM Agents

    cs.CR 2026-04 unverdicted novelty 7.0

    The paper defines causality laundering as an attack leaking information from denial outcomes in LLM tool calls and proposes the Agentic Reference Monitor to block it using denial-aware provenance graphs.

  17. Evo-Memory: Benchmarking LLM Agent Test-time Learning with Self-Evolving Memory

    cs.CL 2025-11 unverdicted novelty 7.0

    Evo-Memory is a new benchmark for self-evolving memory in LLM agents across task streams, with baseline ExpRAG and proposed ReMem method that integrates reasoning, actions, and memory updates for continual improvement.

  18. Agent Security Bench (ASB): Formalizing and Benchmarking Attacks and Defenses in LLM-based Agents

    cs.CR 2024-10 unverdicted novelty 7.0

    ASB is a new benchmark that tests 10 prompt injection attacks, memory poisoning, a novel Plan-of-Thought backdoor attack, and 11 defenses on LLM agents across 13 models, finding attack success rates up to 84.3% and li...

  19. GAIA: a benchmark for General AI Assistants

    cs.CL 2023-11 unverdicted novelty 7.0

    GAIA benchmark shows humans at 92% accuracy on simple real-world questions far outperform current AI systems at 15%, proposing this gap as a key milestone for general AI.

  20. Latency-Quality Routing for Functionally Equivalent Tools in LLM Agents

    cs.LG 2026-05 unverdicted novelty 6.0

    LQM-ContextRoute routes tool calls by expected quality per service cycle using contextual bandits and LLM-as-judge feedback, yielding +2.18 pp F1, up to +18 pp accuracy, and +2.91-3.22 pp NDCG gains over SW-UCB on web...

  21. ToolCUA: Towards Optimal GUI-Tool Path Orchestration for Computer Use Agents

    cs.AI 2026-05 unverdicted novelty 6.0

    ToolCUA introduces a trajectory scaling pipeline and staged RL to optimize GUI-tool switching, reaching 46.85% accuracy on OSWorld-MCP for a 66% relative gain over baseline.

  22. Template-as-Ontology: Configurable Synthetic Data Infrastructure for Cross-Domain Manufacturing AI Validation

    cs.AI 2026-05 unverdicted novelty 6.0

    A single configuration file generates causally coherent synthetic MES data across domains and guarantees zero tool-parameter hallucination when AI tools are ontology-constrained.

  23. EvoMAS: Learning Execution-Time Workflows for Multi-Agent Systems

    cs.AI 2026-05 unverdicted novelty 6.0

    EvoMAS trains a workflow adapter with policy gradients to dynamically instantiate stage-specific multi-agent workflows from a fixed agent pool, using explicit task-state construction and terminal success signals, and ...

  24. Tool Calling is Linearly Readable and Steerable in Language Models

    cs.CL 2026-05 unverdicted novelty 6.0

    Tool identity is linearly readable and steerable in LLMs via mean activation differences, with 77-100% switch accuracy and error prediction from activation gaps.

  25. EnvSimBench: A Benchmark for Evaluating and Improving LLM-Based Environment Simulation

    cs.AI 2026-05 unverdicted novelty 6.0

    EnvSimBench reveals that state-of-the-art LLMs exhibit a universal state change cliff in environment simulation, with a new constraint-driven pipeline raising synthesis yield by 6.8% and cutting costs over 90%.

  26. An Agent-Oriented Pluggable Experience-RAG Skill for Experience-Driven Retrieval Strategy Orchestration

    cs.AI 2026-05 unverdicted novelty 6.0

    Experience-RAG Skill uses experience memory to dynamically select retrieval strategies for agents, achieving 0.8924 nDCG@10 on BeIR/nq, hotpotqa, and scifact while outperforming fixed single-retriever baselines.

  27. Semia: Auditing Agent Skills via Constraint-Guided Representation Synthesis

    cs.CR 2026-05 unverdicted novelty 6.0

    Semia synthesizes Datalog representations of agent skills via constraint-guided loops to enable reachability queries for semantic risks, finding critical issues in over half of 13,728 real skills with 97.7% recall on ...

  28. Claw-Eval-Live: A Live Agent Benchmark for Evolving Real-World Workflows

    cs.SE 2026-04 unverdicted novelty 6.0

    Claw-Eval-Live benchmark with 105 tasks shows no frontier LLM agent exceeds 66.7% success rate on evolving real-world workflows, with HR and multi-system tasks as persistent bottlenecks.

  29. From Skill Text to Skill Structure: The Scheduling-Structural-Logical Representation for Agent Skills

    cs.CL 2026-04 unverdicted novelty 6.0

    SSL representation disentangles skill scheduling, structure, and logic using an LLM normalizer, improving skill discovery MRR@50 from 0.649 to 0.729 and risk assessment macro F1 from 0.409 to 0.509 over text baselines.

  30. Time Series Augmented Generation for Financial Applications

    cs.AI 2026-04 unverdicted novelty 6.0

    TSAG lets LLMs use external tools for financial time series analysis, with a new benchmark showing capable agents achieve near-perfect tool accuracy and minimal hallucination.

  31. When Agents Go Quiet: Output Generation Capacity and Format-Cost Separation for LLM Document Synthesis

    cs.AI 2026-04 unverdicted novelty 6.0

    LLM agents avoid output stalling and reduce generation tokens by 48-72% via deferred template rendering guided by Output Generation Capacity and a Format-Cost Separation Theorem.

  32. Auditable Agents

    cs.AI 2026-04 unverdicted novelty 6.0

    No agent system can be accountable without auditability, which requires five dimensions (action recoverability, lifecycle coverage, policy checkability, responsibility attribution, evidence integrity) and mechanisms f...

  33. From Data to Theory: Autonomous Large Language Model Agents for Materials Science

    cs.AI 2026-04 unverdicted novelty 6.0

    An LLM agent autonomously selects, codes, and validates materials equations from data, recovering known laws reliably but requiring checks for new or specialized cases.

  34. ToolRL: Reward is All Tool Learning Needs

    cs.LG 2025-04 conditional novelty 6.0

    A principled reward design for tool selection and application in RL-trained LLMs delivers 17% gains over base models and 15% over SFT across benchmarks.

  35. AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents

    cs.LG 2024-10 accept novelty 6.0

    AgentHarm benchmark shows leading LLMs comply with malicious agent requests and simple jailbreaks enable coherent harmful multi-step execution while retaining capabilities.

  36. LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

    cs.SE 2024-03 unverdicted novelty 6.0

    LiveCodeBench collects 400 recent contest problems to create a contamination-free benchmark evaluating LLMs on code generation and related capabilities like self-repair and execution.

  37. SGLang: Efficient Execution of Structured Language Model Programs

    cs.AI 2023-12 conditional novelty 6.0

    SGLang is a new system that speeds up structured LLM programs by up to 6.4x using RadixAttention for KV cache reuse and compressed finite state machines for output decoding.

  38. A Survey on Large Language Model based Autonomous Agents

    cs.AI 2023-08 accept novelty 6.0

    A survey of LLM-based autonomous agents that proposes a unified framework for their construction and reviews applications in social science, natural science, and engineering along with evaluation methods and future di...

  39. Is Grep All You Need? How Agent Harnesses Reshape Agentic Search

    cs.CL 2026-05 unverdicted novelty 5.0

    Grep retrieval generally outperforms vector retrieval in agentic search tasks, with performance varying strongly by agent harness and tool-calling style.

  40. Good to Go: The LOOP Skill Engine That Hits 99% Success and Slashes Token Usage by 99% via One-Shot Recording and Deterministic Replay

    cs.AI 2026-05 unverdicted novelty 5.0

    The LOOP Skill Engine records one LLM-powered run of a periodic task and converts it into a deterministic replay template that eliminates further LLM usage while maintaining high success rates.

  41. The Semantic Training Gap: Ontology-Grounded Tool Architectures for Industrial AI Agent Systems

    cs.AI 2026-05 unverdicted novelty 5.0

    Ontology-grounded tool architectures eliminate hallucination of domain identifiers in industrial AI agents by enforcing semantic constraints through a typed relational configuration and three-operation interface.

  42. Consistency as a Testable Property: Statistical Methods to Evaluate AI Agent Reliability

    cs.AI 2026-05 unverdicted novelty 5.0

    A framework with U-statistics and kernel-based metrics quantifies AI agent consistency and robustness, showing trajectory metrics outperform pass@1 rates in diagnosing failures.

  43. Bounded Autonomy for Enterprise AI: Typed Action Contracts and Consumer-Side Execution

    cs.SE 2026-04 conditional novelty 5.0

    Bounded autonomy using typed action contracts and consumer-side execution lets LLMs safely operate enterprise systems, achieving 23 of 25 tasks with zero unsafe executions versus 17 for unconstrained AI across 25 trials.

  44. LLM-Rosetta: A Hub-and-Spoke Intermediate Representation for Cross-Provider LLM API Translation

    cs.SE 2026-04 unverdicted novelty 5.0

    A hub-and-spoke IR with a 9-type content model and 10-type stream schema enables bidirectional, lossless translation between major LLM APIs with sub-100 microsecond overhead.

  45. Externalization in LLM Agents: A Unified Review of Memory, Skills, Protocols and Harness Engineering

    cs.SE 2026-04 accept novelty 5.0

    LLM agent progress depends on externalizing cognitive functions into memory, skills, protocols, and harness engineering that coordinates them reliably.

  46. An Agent-Oriented Pluggable Experience-RAG Skill for Experience-Driven Retrieval Strategy Orchestration

    cs.AI 2026-05 unverdicted novelty 4.0

    Experience-RAG Skill is a reusable agent skill that selects retrieval strategies via experience memory, achieving 0.8924 nDCG@10 on BeIR/nq, hotpotqa, and scifact while outperforming fixed retriever baselines.

  47. Vibe Medicine: Redefining Biomedical Research Through Human-AI Co-Work

    cs.AI 2026-04 unverdicted novelty 4.0

    Vibe Medicine proposes directing AI agents via natural language for end-to-end biomedical workflows using LLMs, agent frameworks, and a curated collection of over 1,000 medical skills.

  48. Meta-Tool: Efficient Few-Shot Tool Adaptation for Small Language Models

    cs.CL 2026-04 unverdicted novelty 4.0

    A 3B model with few-shot prompting reaches 79.7% of GPT-5 tool-use performance while a hypernetwork adaptation adds zero measurable benefit across four benchmarks.

  49. Empirical Comparison of Agent Communication Protocols for Task Orchestration

    cs.AI 2026-03 unverdicted novelty 4.0

    This work provides an empirical comparison of tool integration, multi-agent delegation, and hybrid architectures for LLM task orchestration, measuring response time, context consumption, cost, error recovery, and impl...

  50. Benchmarking EngGPT2-16B-A3B against Comparable Italian and International Open-source LLMs

    cs.CL 2026-05 unverdicted novelty 3.0

    EngGPT2MoE-16B-A3B matches or beats other Italian models on most international benchmarks but trails top international models such as GPT-5 nano and Qwen3-8B.

Reference graph

Works this paper leans on

48 extracted references · 48 canonical work pages · cited by 49 Pith papers · 22 internal anchors

  1. [1]

    Ahn, M., Brohan, A., Brown, N., Chebotar, Y ., Cortes, O., David, B., Finn, C., Gopalakrishnan, K., Hausman, K., Herzog, A., et al. (2022). Do as i can, not as i say: Grounding language in robotic affordances. arXiv preprint arXiv:2204.01691

  2. [2]

    Andor, D., He, L., Lee, K., and Pitler, E. (2019). Giving bert a calculator: Finding operations and arguments with reading comprehension. arXiv preprint arXiv:1909.00109

  3. [3]

    Anthropic, h.-c. (2022). Claude

  4. [4]

    Bavishi, R., Lemieux, C., Fox, R., Sen, K., and Stoica, I. (2019). Autopandas: neural- backed generators for program synthesis. Proceedings of the ACM on Programming Languages , 3(OOPSLA):1–27

  5. [5]

    D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al

    Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. (2020). Language models are few-shot learners. Advances in neural information processing systems , 33:1877–1901

  6. [6]

    Sparks of Artificial General Intelligence: Early experiments with GPT-4

    Bubeck, S., Chandrasekaran, V ., Eldan, R., Gehrke, J., Horvitz, E., Kamar, E., Lee, P., Lee, Y . T., Li, Y ., Lundberg, S., et al. (2023). Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712

  7. [7]

    Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H. P. d. O., Kaplan, J., Edwards, H., Burda, Y ., Joseph, N., Brockman, G., et al. (2021). Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374

  8. [8]

    Chen, X., Lin, M., Schärli, N., and Zhou, D. (2023). Teaching large language models to self-debug. arXiv preprint arXiv:2304.05128

  9. [9]

    E., Stoica, I., and Xing, E

    Chiang, W.-L., Li, Z., Lin, Z., Sheng, Y ., Wu, Z., Zhang, H., Zheng, L., Zhuang, S., Zhuang, Y ., Gonzalez, J. E., Stoica, I., and Xing, E. P. (2023). Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality

  10. [10]

    PaLM: Scaling Language Modeling with Pathways

    Chowdhery, A., Narang, S., Devlin, J., Bosma, M., Mishra, G., Roberts, A., Barham, P., Chung, H. W., Sutton, C., Gehrmann, S., et al. (2022). Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311

  11. [11]

    Scaling Instruction-Finetuned Language Models

    Chung, H. W., Hou, L., Longpre, S., Zoph, B., Tay, Y ., Fedus, W., Li, E., Wang, X., Dehghani, M., Brahma, S., et al. (2022). Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416

  12. [12]

    Cobbe, K., Kosaraju, V ., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., et al. (2021). Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168

  13. [13]

    Devlin, J., Uesato, J., Bhupatiraju, S., Singh, R., Mohamed, A.-r., and Kohli, P. (2017). Robust- fill: Neural program learning under noisy i/o. In International conference on machine learning , pages 990–998. PMLR

  14. [14]

    Gao, L., Madaan, A., Zhou, S., Alon, U., Liu, P., Yang, Y ., Callan, J., and Neubig, G. (2022). Pal: Program-aided language models. arXiv preprint arXiv:2211.10435

  15. [15]

    Opt-iml: Scaling language model instruction meta learning through the lens of generalization.arXiv preprint arXiv:2212.12017, 2022

    Iyer, S., Lin, X. V ., Pasunuru, R., Mihaylov, T., Simig, D., Yu, P., Shuster, K., Wang, T., Liu, Q., Koura, P. S., et al. (2022). Opt-iml: Scaling language model instruction meta learning through the lens of generalization. arXiv preprint arXiv:2212.12017. 10

  16. [16]

    Jain, N., Vaidyanath, S., Iyer, A., Natarajan, N., Parthasarathy, S., Rajamani, S., and Sharma, R. (2022). Jigsaw: Large language models meet program synthesis. In Proceedings of the 44th International Conference on Software Engineering , pages 1219–1231

  17. [17]

    Kim, G., Baldi, P., and McAleer, S. (2023). Language models can solve computer tasks. arXiv preprint arXiv:2303.17491

  18. [18]

    Large Language Models are Zero-Shot Reasoners

    Kojima, T., Gu, S. S., Reid, M., Matsuo, Y ., and Iwasawa, Y . (2022). Large language models are zero-shot reasoners. arXiv preprint arXiv:2205.11916

  19. [19]

    Komeili, M., Shuster, K., and Weston, J. (2021). Internet-augmented dialogue generation. arXiv preprint arXiv:2107.07566

  20. [20]

    Lachaux, M.-A., Roziere, B., Chanussot, L., and Lample, G. (2020). Unsupervised translation of programming languages. arXiv preprint arXiv:2006.03511

  21. [21]

    Lazaridou, A., Gribovskaya, E., Stokowiec, W., and Grigorev, N. (2022). Internet-augmented language models through few-shot prompting for open-domain question answering. arXiv preprint arXiv:2203.05115

  22. [22]

    StarCoder: may the source be with you!

    Li, R., Allal, L. B., Zi, Y ., Muennighoff, N., Kocetkov, D., Mou, C., Marone, M., Akiki, C., Li, J., Chim, J., et al. (2023). Starcoder: may the source be with you! arXiv preprint arXiv:2305.06161

  23. [23]

    Li, Y ., Choi, D., Chung, J., Kushman, N., Schrittwieser, J., Leblond, R., Eccles, T., Keeling, J., Gimeno, F., Dal Lago, A., et al. (2022). Competition-level code generation with alphacode. Science, 378(6624):1092–1097

  24. [24]

    Liang, Y ., Wu, C., Song, T., Wu, W., Xia, Y ., Liu, Y ., Ou, Y ., Lu, S., Ji, L., Mao, S., et al. (2023). Taskmatrix. ai: Completing tasks by connecting foundation models with millions of apis. arXiv preprint arXiv:2303.16434

  25. [25]

    Menon, A., Tamuz, O., Gulwani, S., Lampson, B., and Kalai, A. (2013). A machine learning framework for programming by example. In International Conference on Machine Learning , pages 187–195. PMLR

  26. [26]

    Nakano, R., Hilton, J., Balaji, S., Wu, J., Ouyang, L., Kim, C., Hesse, C., Jain, S., Kosaraju, V ., Saunders, W., et al. (2021). Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332

  27. [27]

    Nijkamp, E., Hayashi, H., Xiong, C., Savarese, S., and Zhou, Y . (2023). Codegen2: Lessons for training llms on programming and natural languages. arXiv preprint arXiv:2305.02309

  28. [28]

    Nijkamp, E., Pang, B., Hayashi, H., Tu, L., Wang, H., Zhou, Y ., Savarese, S., and Xiong, C. (2022). Codegen: An open large language model for code with multi-turn program synthesis. arXiv preprint arXiv:2203.13474

  29. [29]

    Gpt-4 technical report

    OpenAI (2023). Gpt-4 technical report

  30. [30]

    OpenAI and https://openai.com/blog/chatgpt (2022). Chatgpt

  31. [31]

    Multitask Prompted Training Enables Zero-Shot Task Generalization

    Sanh, V ., Webson, A., Raffel, C., Bach, S. H., Sutawika, L., Alyafeai, Z., Chaffin, A., Stiegler, A., Scao, T. L., Raja, A., et al. (2021). Multitask prompted training enables zero-shot task generalization. arXiv preprint arXiv:2110.08207

  32. [32]

    BLOOM: A 176B-Parameter Open-Access Multilingual Language Model

    Scao, T. L., Fan, A., Akiki, C., Pavlick, E., Ili ´c, S., Hesslow, D., Castagné, R., Luccioni, A. S., Yvon, F., Gallé, M., et al. (2022). Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100

  33. [33]

    Schick, T., Dwivedi-Yu, J., Dessì, R., Raileanu, R., Lomeli, M., Zettlemoyer, L., Cancedda, N., and Scialom, T. (2023). Toolformer: Language models can teach themselves to use tools. arXiv preprint arXiv:2302.04761

  34. [34]

    and Schütze, H

    Schick, T. and Schütze, H. (2020). Exploiting cloze questions for few shot text classification and natural language inference. arXiv preprint arXiv:2001.07676. 11

  35. [35]

    Shen, Y ., Song, K., Tan, X., Li, D., Lu, W., and Zhuang, Y . (2023). Hugginggpt: Solving ai tasks with chatgpt and its friends in huggingface. arXiv preprint arXiv:2303.17580

  36. [36]

    Shinn, N., Labash, B., and Gopinath, A. (2023). Reflexion: an autonomous agent with dynamic memory and self-reflection. arXiv preprint arXiv:2303.11366

  37. [37]

    Blenderbot 3: a deployed conversational agent that continually learns to responsibly engage

    Shuster, K., Xu, J., Komeili, M., Ju, D., Smith, E. M., Roller, S., Ung, M., Chen, M., Arora, K., Lane, J., et al. (2022). Blenderbot 3: a deployed conversational agent that continually learns to responsibly engage. arXiv preprint arXiv:2208.03188

  38. [38]

    Taori, R., Gulrajani, I., Zhang, T., Dubois, Y ., Li, X., Guestrin, C., Liang, P., and Hashimoto, T. B. (2023). Stanford alpaca: An instruction-following llama model. https://github.com/ tatsu-lab/stanford_alpaca

  39. [39]

    Thoppilan, R., De Freitas, D., Hall, J., Shazeer, N., Kulshreshtha, A., Cheng, H.-T., Jin, A., Bos, T., Baker, L., Du, Y ., et al. (2022). Lamda: Language models for dialog applications.arXiv preprint arXiv:2201.08239

  40. [40]

    Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al. (2023). Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971

  41. [41]

    Vemprala, S., Bonatti, R., Bucker, A., and Kapoor, A. (2023). Chatgpt for robotics: Design principles and model abilities. 2023

  42. [42]

    Self-Instruct: Aligning Language Models with Self-Generated Instructions

    Wang, Y ., Kordi, Y ., Mishra, S., Liu, A., Smith, N. A., Khashabi, D., and Hajishirzi, H. (2022a). Self-instruct: Aligning language model with self generated instructions. arXiv preprint arXiv:2212.10560

  43. [43]

    S., Arunkumar, A., Stap, D., et al

    Wang, Y ., Mishra, S., Alipoormolabashi, P., Kordi, Y ., Mirzaei, A., Naik, A., Ashok, A., Dhanasekaran, A. S., Arunkumar, A., Stap, D., et al. (2022b). Super-naturalinstructions: General- ization via declarative instructions on 1600+ nlp tasks. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing , pages 5085–5109

  44. [44]

    Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., and Zhou, D. (2022). Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903

  45. [45]

    F., Alon, U., Neubig, G., and Hellendoorn, V

    Xu, F. F., Alon, U., Neubig, G., and Hellendoorn, V . J. (2022). A systematic evaluation of large language models of code. In Proceedings of the 6th ACM SIGPLAN International Symposium on Machine Programming, pages 1–10

  46. [46]

    Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., and Cao, Y . (2022). React: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629

  47. [47]

    Zeng, A., Liu, X., Du, Z., Wang, Z., Lai, H., Ding, M., Yang, Z., Xu, Y ., Zheng, W., Xia, X., et al. (2022). Glm-130b: An open bilingual pre-trained model. arXiv preprint arXiv:2210.02414

  48. [48]

    OPT: Open Pre-trained Transformer Language Models

    Zhang, S., Roller, S., Goyal, N., Artetxe, M., Chen, M., Chen, S., Dewan, C., Diab, M., Li, X., Lin, X. V ., et al. (2022). Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068. 12 8 Appendix 8.1 Dataset Details Our dataset is multi-faceted, comprising three distinct domains: Torch Hub, Tensor Hub, and HuggingFace. Each entry...