Recognition: 3 theorem links
· Lean TheoremGorilla: Large Language Model Connected with Massive APIs
Pith reviewed 2026-05-11 23:17 UTC · model grok-4.3
The pith
A fine-tuned LLaMA model generates more accurate API calls than GPT-4 for machine learning libraries.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Gorilla is a LLaMA-based model fine-tuned on API documentation from Hugging Face, TorchHub, and TensorHub. It surpasses GPT-4 in accuracy when writing API calls on the APIBench dataset. When combined with a document retriever, the model adapts to changes in API documentation at test time and reduces hallucinations of incorrect API usage.
What carries the argument
The fine-tuned Gorilla model that maps instructions to API calls, augmented by a retriever that supplies relevant and current API documentation.
If this is right
- LLMs become more reliable for calling machine learning tools without manual prompt engineering.
- Models can remain current with frequently updated libraries through retrieval instead of full retraining.
- Hallucination of invalid API parameters and signatures is substantially reduced.
- A single open model can handle multiple large API collections without needing closed-source scale.
Where Pith is reading between the lines
- The approach could extend beyond machine learning libraries to web services and software engineering tools.
- Gorilla-style models might serve as the tool-use layer inside larger agent systems.
- Retrieval-augmented fine-tuning may lower the cost of keeping tool-using LLMs up to date compared with scaling model size alone.
Load-bearing premise
Fine-tuning on collected API documentation produces a model that generalizes to new or updated APIs when the retriever supplies accurate and relevant information.
What would settle it
Testing Gorilla on API calls for documentation or libraries introduced after its training cutoff and checking whether accuracy drops or hallucinations increase compared with GPT-4.
read the original abstract
Large Language Models (LLMs) have seen an impressive wave of advances recently, with models now excelling in a variety of tasks, such as mathematical reasoning and program synthesis. However, their potential to effectively use tools via API calls remains unfulfilled. This is a challenging task even for today's state-of-the-art LLMs such as GPT-4, largely due to their inability to generate accurate input arguments and their tendency to hallucinate the wrong usage of an API call. We release Gorilla, a finetuned LLaMA-based model that surpasses the performance of GPT-4 on writing API calls. When combined with a document retriever, Gorilla demonstrates a strong capability to adapt to test-time document changes, enabling flexible user updates or version changes. It also substantially mitigates the issue of hallucination, commonly encountered when prompting LLMs directly. To evaluate the model's ability, we introduce APIBench, a comprehensive dataset consisting of HuggingFace, TorchHub, and TensorHub APIs. The successful integration of the retrieval system with Gorilla demonstrates the potential for LLMs to use tools more accurately, keep up with frequently updated documentation, and consequently increase the reliability and applicability of their outputs. Gorilla's code, model, data, and demo are available at https://gorilla.cs.berkeley.edu
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces Gorilla, a LLaMA-based model fine-tuned on API documentation from HuggingFace, TorchHub, and TensorHub. It claims to surpass GPT-4 on API call generation accuracy using the new APIBench benchmark, and when paired with a document retriever, to adapt to test-time documentation changes while substantially reducing hallucinations relative to direct LLM prompting.
Significance. If the headline performance claims hold under controlled conditions, the work demonstrates practical value in domain-specific fine-tuning and retrieval-augmented generation for reliable tool use. The public release of the model, training data, code, and demo is a clear strength that enables direct reproducibility and follow-on research.
major comments (2)
- [Abstract and §4 (Evaluation)] Abstract and §4 (Evaluation): The central claim that Gorilla 'surpasses the performance of GPT-4 on writing API calls' is load-bearing, yet the GPT-4 baseline prompting strategy is not described in sufficient detail. It is unclear whether GPT-4 received the same retrieved API documentation provided to the Gorilla+retriever variant or was evaluated zero-shot/few-shot without equivalent context; this distinction is required to rule out information-access artifacts rather than fine-tuning gains.
- [§4 (Results) and APIBench description] §4 (Results) and APIBench description: The reported gains on hallucination mitigation and adaptation to document updates lack explicit quantitative controls. For instance, the fraction of invalid API calls, incorrect arguments, or retrieval-induced errors should be broken down by model variant with the same retrieval corpus and statistical tests; without these, the 'substantially mitigates' claim cannot be fully assessed.
minor comments (2)
- [Figure 2 and §3] Figure 2 and §3: The retriever architecture diagram and accompanying text could more clearly label the exact retrieval top-k and embedding model used at inference time.
- [§5 (Limitations)] §5 (Limitations): The discussion of generalization to entirely unseen APIs is brief; adding a held-out API category in the experimental tables would strengthen the claim.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment point by point below. Where clarifications or additional analyses are needed, we have revised the manuscript accordingly to strengthen the presentation of our results.
read point-by-point responses
-
Referee: [Abstract and §4 (Evaluation)] Abstract and §4 (Evaluation): The central claim that Gorilla 'surpasses the performance of GPT-4 on writing API calls' is load-bearing, yet the GPT-4 baseline prompting strategy is not described in sufficient detail. It is unclear whether GPT-4 received the same retrieved API documentation provided to the Gorilla+retriever variant or was evaluated zero-shot/few-shot without equivalent context; this distinction is required to rule out information-access artifacts rather than fine-tuning gains.
Authors: We thank the referee for highlighting this important point. In the original submission, GPT-4 baselines were evaluated using standard zero-shot and few-shot prompting without access to the document retriever, while the Gorilla+retriever variant uses retrieved documentation at inference time. The headline claim that Gorilla surpasses GPT-4 refers to the fine-tuned model versus GPT-4 under comparable (non-retrieval) prompting conditions. To remove any ambiguity, we have expanded Section 4 to explicitly describe the prompting setup for every baseline, including the exact context (or lack thereof) provided to GPT-4. We have also added a controlled comparison in which GPT-4 is given the same retrieved documents as Gorilla+retriever, allowing readers to isolate the contribution of fine-tuning from information access. revision: yes
-
Referee: [§4 (Results) and APIBench description] §4 (Results) and APIBench description: The reported gains on hallucination mitigation and adaptation to document updates lack explicit quantitative controls. For instance, the fraction of invalid API calls, incorrect arguments, or retrieval-induced errors should be broken down by model variant with the same retrieval corpus and statistical tests; without these, the 'substantially mitigates' claim cannot be fully assessed.
Authors: We agree that a finer-grained error analysis and statistical support would make the claims more robust. In the revised manuscript we have added a new table in Section 4 that breaks down error types (invalid API names, incorrect arguments, retrieval-induced errors, and other hallucinations) for every model variant, always using the identical retrieval corpus for fair comparison. We also report bootstrap confidence intervals and paired significance tests (McNemar’s test) on the key metrics of hallucination rate and adaptation accuracy. These additions directly address the request for quantitative controls and are now included in both the main results and the APIBench description. revision: yes
Circularity Check
No circularity: empirical fine-tuning and evaluation with public releases
full rationale
The paper is an empirical contribution describing fine-tuning of LLaMA on API documentation from HuggingFace, TorchHub, and TensorHub to create Gorilla, followed by evaluation on the introduced APIBench dataset and integration with a document retriever. No mathematical derivations, equations, or first-principles predictions exist that could reduce to inputs by construction. Claims rest on experimental performance metrics rather than any self-referential definitions or fitted parameters renamed as predictions. Public release of model, data, and code allows independent verification, rendering the work self-contained without load-bearing self-citations or ansatzes that define the result tautologically.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith.Cost.FunctionalEquationwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We release Gorilla, a finetuned LLaMA-based model that surpasses the performance of GPT-4 on writing API calls. When combined with a document retriever, Gorilla demonstrates a strong capability to adapt to test-time document changes...
-
IndisputableMonolith.Foundation.DimensionForcingalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
To evaluate the model's ability, we introduce APIBench, a comprehensive dataset consisting of HuggingFace, TorchHub, and TensorHub APIs.
-
IndisputableMonolith.Foundation.PhiForcingphi_equation unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Gorilla's retriever–aware training enables it to react to changes in the APIs.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 50 Pith papers
-
Revisable by Design: A Theory of Streaming LLM Agent Execution
LLM agents achieve greater flexibility during execution by classifying actions via a reversibility taxonomy and using an Earliest-Conflict Rollback algorithm that matches full-restart quality while wasting far less co...
-
Your Agent Is Mine: Measuring Malicious Intermediary Attacks on the LLM Supply Chain
Malicious LLM API routers actively perform payload injection and secret exfiltration, with 9 of 428 tested routers showing malicious behavior and further poisoning risks from leaked credentials.
-
Why Do Multi-Agent LLM Systems Fail?
The authors create the first large-scale dataset and taxonomy of failure modes in multi-agent LLM systems to explain their limited performance gains.
-
AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents
AgentDojo introduces an extensible evaluation framework populated with realistic agent tasks and security test cases to measure prompt injection robustness in tool-using LLM agents.
-
Agent-First Tool API: A Semantic Interface Paradigm for Enterprise AI Agent Systems
The Agent-First Tool API paradigm raises AI agent task success from 64% to 88% and cuts human interventions by 72.7% through semantic phases, structured contracts, and risk governance in a production enterprise system.
-
RewardHarness: Self-Evolving Agentic Post-Training
RewardHarness self-evolves a tool-and-skill library from 100 preference examples to reach 47.4% accuracy on image-edit evaluation, beating GPT-5, and yields stronger RL-tuned models.
-
MemFlow: Intent-Driven Memory Orchestration for Small Language Model Agents
MemFlow routes queries by intent to tiered memory operations, nearly doubling accuracy of a 1.7B SLM on long-horizon benchmarks compared to full-context baselines.
-
TSCG: Deterministic Tool-Schema Compilation for Agentic LLM Deployments
TSCG compiles JSON tool schemas into token-efficient structured text, raising tool-use accuracy for small LLMs from 0% to 84.4% on benchmarks while cutting tokens by 52-57%.
-
The Structured Output Benchmark: A Multi-Source Benchmark for Evaluating Structured Output Quality in Large Language Models
SOB benchmark shows LLMs achieve near-perfect schema compliance but value accuracy of only 83% on text, 67% on images, and 24% on audio.
-
Cutscene Agent: An LLM Agent Framework for Automated 3D Cutscene Generation
Cutscene Agent uses a multi-agent LLM system and a new toolkit for game engine control to automate end-to-end 3D cutscene generation, evaluated on the introduced CutsceneBench.
-
A Systematic Survey of Security Threats and Defenses in LLM-Based AI Agents: A Layered Attack Surface Framework
A new 7x4 taxonomy organizes agentic AI security threats by architectural layer and persistence timescale, revealing under-explored upper layers and missing defenses after surveying 116 papers.
-
Breaking MCP with Function Hijacking Attacks: Novel Threats for Function Calling and Agentic Models
A novel function hijacking attack achieves 70-100% success rates in forcing specific function calls across five LLMs on the BFCL benchmark and is robust to context semantics.
-
SceneOrchestra: Efficient Agentic 3D Scene Synthesis via Full Tool-Call Trajectory Generation
SceneOrchestra trains an orchestrator to generate full tool-call trajectories for 3D scene synthesis and uses a discriminator during training to select high-quality plans, yielding state-of-the-art results with lower runtime.
-
GraSP: Graph-Structured Skill Compositions for LLM Agents
GraSP introduces executable skill graphs that improve LLM agent rewards by up to 19 points and reduce steps by up to 41% over ReAct, Reflexion, ExpeL, and flat-skill baselines across ALFWorld, ScienceWorld, WebShop, a...
-
Does RL Expand the Capability Boundary of LLM Agents? A PASS@(k,T) Analysis
RL expands the capability boundary of LLM agents on compositional tool-use tasks, shown by non-converging pass curves at large k with increasing T, while SFT regresses it and the effect is absent on simpler tasks.
-
Causality Laundering: Denial-Feedback Leakage in Tool-Calling LLM Agents
The paper defines causality laundering as an attack leaking information from denial outcomes in LLM tool calls and proposes the Agentic Reference Monitor to block it using denial-aware provenance graphs.
-
Evo-Memory: Benchmarking LLM Agent Test-time Learning with Self-Evolving Memory
Evo-Memory is a new benchmark for self-evolving memory in LLM agents across task streams, with baseline ExpRAG and proposed ReMem method that integrates reasoning, actions, and memory updates for continual improvement.
-
Agent Security Bench (ASB): Formalizing and Benchmarking Attacks and Defenses in LLM-based Agents
ASB is a new benchmark that tests 10 prompt injection attacks, memory poisoning, a novel Plan-of-Thought backdoor attack, and 11 defenses on LLM agents across 13 models, finding attack success rates up to 84.3% and li...
-
GAIA: a benchmark for General AI Assistants
GAIA benchmark shows humans at 92% accuracy on simple real-world questions far outperform current AI systems at 15%, proposing this gap as a key milestone for general AI.
-
Latency-Quality Routing for Functionally Equivalent Tools in LLM Agents
LQM-ContextRoute routes tool calls by expected quality per service cycle using contextual bandits and LLM-as-judge feedback, yielding +2.18 pp F1, up to +18 pp accuracy, and +2.91-3.22 pp NDCG gains over SW-UCB on web...
-
ToolCUA: Towards Optimal GUI-Tool Path Orchestration for Computer Use Agents
ToolCUA introduces a trajectory scaling pipeline and staged RL to optimize GUI-tool switching, reaching 46.85% accuracy on OSWorld-MCP for a 66% relative gain over baseline.
-
Template-as-Ontology: Configurable Synthetic Data Infrastructure for Cross-Domain Manufacturing AI Validation
A single configuration file generates causally coherent synthetic MES data across domains and guarantees zero tool-parameter hallucination when AI tools are ontology-constrained.
-
EvoMAS: Learning Execution-Time Workflows for Multi-Agent Systems
EvoMAS trains a workflow adapter with policy gradients to dynamically instantiate stage-specific multi-agent workflows from a fixed agent pool, using explicit task-state construction and terminal success signals, and ...
-
Tool Calling is Linearly Readable and Steerable in Language Models
Tool identity is linearly readable and steerable in LLMs via mean activation differences, with 77-100% switch accuracy and error prediction from activation gaps.
-
EnvSimBench: A Benchmark for Evaluating and Improving LLM-Based Environment Simulation
EnvSimBench reveals that state-of-the-art LLMs exhibit a universal state change cliff in environment simulation, with a new constraint-driven pipeline raising synthesis yield by 6.8% and cutting costs over 90%.
-
An Agent-Oriented Pluggable Experience-RAG Skill for Experience-Driven Retrieval Strategy Orchestration
Experience-RAG Skill uses experience memory to dynamically select retrieval strategies for agents, achieving 0.8924 nDCG@10 on BeIR/nq, hotpotqa, and scifact while outperforming fixed single-retriever baselines.
-
Semia: Auditing Agent Skills via Constraint-Guided Representation Synthesis
Semia synthesizes Datalog representations of agent skills via constraint-guided loops to enable reachability queries for semantic risks, finding critical issues in over half of 13,728 real skills with 97.7% recall on ...
-
Claw-Eval-Live: A Live Agent Benchmark for Evolving Real-World Workflows
Claw-Eval-Live benchmark with 105 tasks shows no frontier LLM agent exceeds 66.7% success rate on evolving real-world workflows, with HR and multi-system tasks as persistent bottlenecks.
-
From Skill Text to Skill Structure: The Scheduling-Structural-Logical Representation for Agent Skills
SSL representation disentangles skill scheduling, structure, and logic using an LLM normalizer, improving skill discovery MRR@50 from 0.649 to 0.729 and risk assessment macro F1 from 0.409 to 0.509 over text baselines.
-
Time Series Augmented Generation for Financial Applications
TSAG lets LLMs use external tools for financial time series analysis, with a new benchmark showing capable agents achieve near-perfect tool accuracy and minimal hallucination.
-
When Agents Go Quiet: Output Generation Capacity and Format-Cost Separation for LLM Document Synthesis
LLM agents avoid output stalling and reduce generation tokens by 48-72% via deferred template rendering guided by Output Generation Capacity and a Format-Cost Separation Theorem.
-
Auditable Agents
No agent system can be accountable without auditability, which requires five dimensions (action recoverability, lifecycle coverage, policy checkability, responsibility attribution, evidence integrity) and mechanisms f...
-
From Data to Theory: Autonomous Large Language Model Agents for Materials Science
An LLM agent autonomously selects, codes, and validates materials equations from data, recovering known laws reliably but requiring checks for new or specialized cases.
-
ToolRL: Reward is All Tool Learning Needs
A principled reward design for tool selection and application in RL-trained LLMs delivers 17% gains over base models and 15% over SFT across benchmarks.
-
AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents
AgentHarm benchmark shows leading LLMs comply with malicious agent requests and simple jailbreaks enable coherent harmful multi-step execution while retaining capabilities.
-
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code
LiveCodeBench collects 400 recent contest problems to create a contamination-free benchmark evaluating LLMs on code generation and related capabilities like self-repair and execution.
-
SGLang: Efficient Execution of Structured Language Model Programs
SGLang is a new system that speeds up structured LLM programs by up to 6.4x using RadixAttention for KV cache reuse and compressed finite state machines for output decoding.
-
A Survey on Large Language Model based Autonomous Agents
A survey of LLM-based autonomous agents that proposes a unified framework for their construction and reviews applications in social science, natural science, and engineering along with evaluation methods and future di...
-
Is Grep All You Need? How Agent Harnesses Reshape Agentic Search
Grep retrieval generally outperforms vector retrieval in agentic search tasks, with performance varying strongly by agent harness and tool-calling style.
-
Good to Go: The LOOP Skill Engine That Hits 99% Success and Slashes Token Usage by 99% via One-Shot Recording and Deterministic Replay
The LOOP Skill Engine records one LLM-powered run of a periodic task and converts it into a deterministic replay template that eliminates further LLM usage while maintaining high success rates.
-
The Semantic Training Gap: Ontology-Grounded Tool Architectures for Industrial AI Agent Systems
Ontology-grounded tool architectures eliminate hallucination of domain identifiers in industrial AI agents by enforcing semantic constraints through a typed relational configuration and three-operation interface.
-
Consistency as a Testable Property: Statistical Methods to Evaluate AI Agent Reliability
A framework with U-statistics and kernel-based metrics quantifies AI agent consistency and robustness, showing trajectory metrics outperform pass@1 rates in diagnosing failures.
-
Bounded Autonomy for Enterprise AI: Typed Action Contracts and Consumer-Side Execution
Bounded autonomy using typed action contracts and consumer-side execution lets LLMs safely operate enterprise systems, achieving 23 of 25 tasks with zero unsafe executions versus 17 for unconstrained AI across 25 trials.
-
LLM-Rosetta: A Hub-and-Spoke Intermediate Representation for Cross-Provider LLM API Translation
A hub-and-spoke IR with a 9-type content model and 10-type stream schema enables bidirectional, lossless translation between major LLM APIs with sub-100 microsecond overhead.
-
Externalization in LLM Agents: A Unified Review of Memory, Skills, Protocols and Harness Engineering
LLM agent progress depends on externalizing cognitive functions into memory, skills, protocols, and harness engineering that coordinates them reliably.
-
An Agent-Oriented Pluggable Experience-RAG Skill for Experience-Driven Retrieval Strategy Orchestration
Experience-RAG Skill is a reusable agent skill that selects retrieval strategies via experience memory, achieving 0.8924 nDCG@10 on BeIR/nq, hotpotqa, and scifact while outperforming fixed retriever baselines.
-
Vibe Medicine: Redefining Biomedical Research Through Human-AI Co-Work
Vibe Medicine proposes directing AI agents via natural language for end-to-end biomedical workflows using LLMs, agent frameworks, and a curated collection of over 1,000 medical skills.
-
Meta-Tool: Efficient Few-Shot Tool Adaptation for Small Language Models
A 3B model with few-shot prompting reaches 79.7% of GPT-5 tool-use performance while a hypernetwork adaptation adds zero measurable benefit across four benchmarks.
-
Empirical Comparison of Agent Communication Protocols for Task Orchestration
This work provides an empirical comparison of tool integration, multi-agent delegation, and hybrid architectures for LLM task orchestration, measuring response time, context consumption, cost, error recovery, and impl...
-
Benchmarking EngGPT2-16B-A3B against Comparable Italian and International Open-source LLMs
EngGPT2MoE-16B-A3B matches or beats other Italian models on most international benchmarks but trails top international models such as GPT-5 nano and Qwen3-8B.
Reference graph
Works this paper leans on
-
[1]
Ahn, M., Brohan, A., Brown, N., Chebotar, Y ., Cortes, O., David, B., Finn, C., Gopalakrishnan, K., Hausman, K., Herzog, A., et al. (2022). Do as i can, not as i say: Grounding language in robotic affordances. arXiv preprint arXiv:2204.01691
work page internal anchor Pith review Pith/arXiv arXiv 2022
- [2]
-
[3]
Anthropic, h.-c. (2022). Claude
work page 2022
-
[4]
Bavishi, R., Lemieux, C., Fox, R., Sen, K., and Stoica, I. (2019). Autopandas: neural- backed generators for program synthesis. Proceedings of the ACM on Programming Languages , 3(OOPSLA):1–27
work page 2019
-
[5]
D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. (2020). Language models are few-shot learners. Advances in neural information processing systems , 33:1877–1901
work page 2020
-
[6]
Sparks of Artificial General Intelligence: Early experiments with GPT-4
Bubeck, S., Chandrasekaran, V ., Eldan, R., Gehrke, J., Horvitz, E., Kamar, E., Lee, P., Lee, Y . T., Li, Y ., Lundberg, S., et al. (2023). Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[7]
Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H. P. d. O., Kaplan, J., Edwards, H., Burda, Y ., Joseph, N., Brockman, G., et al. (2021). Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[8]
Chen, X., Lin, M., Schärli, N., and Zhou, D. (2023). Teaching large language models to self-debug. arXiv preprint arXiv:2304.05128
work page internal anchor Pith review arXiv 2023
-
[9]
Chiang, W.-L., Li, Z., Lin, Z., Sheng, Y ., Wu, Z., Zhang, H., Zheng, L., Zhuang, S., Zhuang, Y ., Gonzalez, J. E., Stoica, I., and Xing, E. P. (2023). Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality
work page 2023
-
[10]
PaLM: Scaling Language Modeling with Pathways
Chowdhery, A., Narang, S., Devlin, J., Bosma, M., Mishra, G., Roberts, A., Barham, P., Chung, H. W., Sutton, C., Gehrmann, S., et al. (2022). Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[11]
Scaling Instruction-Finetuned Language Models
Chung, H. W., Hou, L., Longpre, S., Zoph, B., Tay, Y ., Fedus, W., Li, E., Wang, X., Dehghani, M., Brahma, S., et al. (2022). Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416
work page internal anchor Pith review arXiv 2022
-
[12]
Cobbe, K., Kosaraju, V ., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., et al. (2021). Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[13]
Devlin, J., Uesato, J., Bhupatiraju, S., Singh, R., Mohamed, A.-r., and Kohli, P. (2017). Robust- fill: Neural program learning under noisy i/o. In International conference on machine learning , pages 990–998. PMLR
work page 2017
-
[14]
Gao, L., Madaan, A., Zhou, S., Alon, U., Liu, P., Yang, Y ., Callan, J., and Neubig, G. (2022). Pal: Program-aided language models. arXiv preprint arXiv:2211.10435
work page Pith review arXiv 2022
-
[15]
Iyer, S., Lin, X. V ., Pasunuru, R., Mihaylov, T., Simig, D., Yu, P., Shuster, K., Wang, T., Liu, Q., Koura, P. S., et al. (2022). Opt-iml: Scaling language model instruction meta learning through the lens of generalization. arXiv preprint arXiv:2212.12017. 10
-
[16]
Jain, N., Vaidyanath, S., Iyer, A., Natarajan, N., Parthasarathy, S., Rajamani, S., and Sharma, R. (2022). Jigsaw: Large language models meet program synthesis. In Proceedings of the 44th International Conference on Software Engineering , pages 1219–1231
work page 2022
- [17]
-
[18]
Large Language Models are Zero-Shot Reasoners
Kojima, T., Gu, S. S., Reid, M., Matsuo, Y ., and Iwasawa, Y . (2022). Large language models are zero-shot reasoners. arXiv preprint arXiv:2205.11916
work page internal anchor Pith review arXiv 2022
- [19]
- [20]
- [21]
-
[22]
StarCoder: may the source be with you!
Li, R., Allal, L. B., Zi, Y ., Muennighoff, N., Kocetkov, D., Mou, C., Marone, M., Akiki, C., Li, J., Chim, J., et al. (2023). Starcoder: may the source be with you! arXiv preprint arXiv:2305.06161
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[23]
Li, Y ., Choi, D., Chung, J., Kushman, N., Schrittwieser, J., Leblond, R., Eccles, T., Keeling, J., Gimeno, F., Dal Lago, A., et al. (2022). Competition-level code generation with alphacode. Science, 378(6624):1092–1097
work page 2022
- [24]
-
[25]
Menon, A., Tamuz, O., Gulwani, S., Lampson, B., and Kalai, A. (2013). A machine learning framework for programming by example. In International Conference on Machine Learning , pages 187–195. PMLR
work page 2013
-
[26]
Nakano, R., Hilton, J., Balaji, S., Wu, J., Ouyang, L., Kim, C., Hesse, C., Jain, S., Kosaraju, V ., Saunders, W., et al. (2021). Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332
work page internal anchor Pith review Pith/arXiv arXiv 2021
- [27]
-
[28]
Nijkamp, E., Pang, B., Hayashi, H., Tu, L., Wang, H., Zhou, Y ., Savarese, S., and Xiong, C. (2022). Codegen: An open large language model for code with multi-turn program synthesis. arXiv preprint arXiv:2203.13474
work page internal anchor Pith review arXiv 2022
- [29]
-
[30]
OpenAI and https://openai.com/blog/chatgpt (2022). Chatgpt
work page 2022
-
[31]
Multitask Prompted Training Enables Zero-Shot Task Generalization
Sanh, V ., Webson, A., Raffel, C., Bach, S. H., Sutawika, L., Alyafeai, Z., Chaffin, A., Stiegler, A., Scao, T. L., Raja, A., et al. (2021). Multitask prompted training enables zero-shot task generalization. arXiv preprint arXiv:2110.08207
work page internal anchor Pith review arXiv 2021
-
[32]
BLOOM: A 176B-Parameter Open-Access Multilingual Language Model
Scao, T. L., Fan, A., Akiki, C., Pavlick, E., Ili ´c, S., Hesslow, D., Castagné, R., Luccioni, A. S., Yvon, F., Gallé, M., et al. (2022). Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100
work page internal anchor Pith review arXiv 2022
-
[33]
Schick, T., Dwivedi-Yu, J., Dessì, R., Raileanu, R., Lomeli, M., Zettlemoyer, L., Cancedda, N., and Scialom, T. (2023). Toolformer: Language models can teach themselves to use tools. arXiv preprint arXiv:2302.04761
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[34]
Schick, T. and Schütze, H. (2020). Exploiting cloze questions for few shot text classification and natural language inference. arXiv preprint arXiv:2001.07676. 11
-
[35]
Shen, Y ., Song, K., Tan, X., Li, D., Lu, W., and Zhuang, Y . (2023). Hugginggpt: Solving ai tasks with chatgpt and its friends in huggingface. arXiv preprint arXiv:2303.17580
work page internal anchor Pith review arXiv 2023
-
[36]
Shinn, N., Labash, B., and Gopinath, A. (2023). Reflexion: an autonomous agent with dynamic memory and self-reflection. arXiv preprint arXiv:2303.11366
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[37]
Blenderbot 3: a deployed conversational agent that continually learns to responsibly engage
Shuster, K., Xu, J., Komeili, M., Ju, D., Smith, E. M., Roller, S., Ung, M., Chen, M., Arora, K., Lane, J., et al. (2022). Blenderbot 3: a deployed conversational agent that continually learns to responsibly engage. arXiv preprint arXiv:2208.03188
-
[38]
Taori, R., Gulrajani, I., Zhang, T., Dubois, Y ., Li, X., Guestrin, C., Liang, P., and Hashimoto, T. B. (2023). Stanford alpaca: An instruction-following llama model. https://github.com/ tatsu-lab/stanford_alpaca
work page 2023
-
[39]
Thoppilan, R., De Freitas, D., Hall, J., Shazeer, N., Kulshreshtha, A., Cheng, H.-T., Jin, A., Bos, T., Baker, L., Du, Y ., et al. (2022). Lamda: Language models for dialog applications.arXiv preprint arXiv:2201.08239
work page Pith review arXiv 2022
-
[40]
Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al. (2023). Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[41]
Vemprala, S., Bonatti, R., Bucker, A., and Kapoor, A. (2023). Chatgpt for robotics: Design principles and model abilities. 2023
work page 2023
-
[42]
Self-Instruct: Aligning Language Models with Self-Generated Instructions
Wang, Y ., Kordi, Y ., Mishra, S., Liu, A., Smith, N. A., Khashabi, D., and Hajishirzi, H. (2022a). Self-instruct: Aligning language model with self generated instructions. arXiv preprint arXiv:2212.10560
work page internal anchor Pith review arXiv
-
[43]
S., Arunkumar, A., Stap, D., et al
Wang, Y ., Mishra, S., Alipoormolabashi, P., Kordi, Y ., Mirzaei, A., Naik, A., Ashok, A., Dhanasekaran, A. S., Arunkumar, A., Stap, D., et al. (2022b). Super-naturalinstructions: General- ization via declarative instructions on 1600+ nlp tasks. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing , pages 5085–5109
work page 2022
-
[44]
Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., and Zhou, D. (2022). Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[45]
F., Alon, U., Neubig, G., and Hellendoorn, V
Xu, F. F., Alon, U., Neubig, G., and Hellendoorn, V . J. (2022). A systematic evaluation of large language models of code. In Proceedings of the 6th ACM SIGPLAN International Symposium on Machine Programming, pages 1–10
work page 2022
-
[46]
Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., and Cao, Y . (2022). React: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[47]
Zeng, A., Liu, X., Du, Z., Wang, Z., Lai, H., Ding, M., Yang, Z., Xu, Y ., Zheng, W., Xia, X., et al. (2022). Glm-130b: An open bilingual pre-trained model. arXiv preprint arXiv:2210.02414
work page internal anchor Pith review arXiv 2022
-
[48]
OPT: Open Pre-trained Transformer Language Models
Zhang, S., Roller, S., Goyal, N., Artetxe, M., Chen, M., Chen, S., Dewan, C., Diab, M., Li, X., Lin, X. V ., et al. (2022). Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068. 12 8 Appendix 8.1 Dataset Details Our dataset is multi-faceted, comprising three distinct domains: Torch Hub, Tensor Hub, and HuggingFace. Each entry...
work page internal anchor Pith review Pith/arXiv arXiv 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.