arxiv: 2304.08244 · v2 · submitted 2023-04-14 · 💻 cs.CL · cs.AI

Recognition: 2 theorem links

· Lean Theorem

API-Bank: A Comprehensive Benchmark for Tool-Augmented LLMs

Minghao Li , Yingxiu Zhao , Bowen Yu , Feifan Song , Hangyu Li , Haiyang Yu , Zhoujun Li , Fei Huang

show 1 more author

Yongbin Li

Authors on Pith no claims yet

Pith reviewed 2026-05-15 20:47 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords API-Banktool-augmented LLMsbenchmarkLynx modelAPI callingtool utilizationLLM fine-tuningGPT-3.5 comparison

0 comments

The pith

The API-Bank benchmark reveals that training Lynx on tool-use dialogues lets it surpass Alpaca by over 26 points and approach GPT-3.5 in using external APIs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces API-Bank to measure how well large language models can plan, retrieve, and call external tools. It evaluates existing models like GPT-3, GPT-3.5, and GPT-4 on 314 annotated dialogues involving 73 APIs. Then it provides a large training set of nearly 2,000 dialogues to fine-tune the Alpaca model into Lynx, which shows major gains in tool utilization performance. This addresses the questions of current effectiveness, methods for improvement, and remaining obstacles through error analysis.

Core claim

API-Bank provides a runnable evaluation system with 73 API tools and 314 tool-use dialogues containing 753 API calls to assess LLMs in planning, retrieving, and calling APIs. A training set of 1,888 tool-use dialogues from 2,138 APIs across 1,000 domains is used to train Lynx from Alpaca, resulting in Lynx outperforming Alpaca by more than 26 points and approaching GPT-3.5 effectiveness, while GPT-4 excels in planning but all models have room for improvement.

What carries the argument

The API-Bank evaluation system consisting of 73 runnable APIs and annotated dialogues that measures planning, retrieval, and calling accuracy, plus the associated training dataset used to create the Lynx model.

If this is right

GPT-3.5 demonstrates better tool utilization than GPT-3.
GPT-4 shows superior planning abilities compared to other models.
Significant potential remains for further improvements in tool-augmented LLMs.
Error analysis identifies specific challenges like accurate API retrieval and handling complex planning for future work.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Expanding the benchmark to more APIs and domains could make tool-augmented models more practical for everyday applications.
Models like Lynx might integrate into systems that automatically select and use APIs without human intervention.
Similar benchmarks could be developed for other capabilities like code execution or web browsing to advance agent-like LLMs.

Load-bearing premise

The selected 73 APIs and 314 dialogues are representative of real tool-use scenarios and the automatic evaluation correctly measures the accuracy of planning, retrieval, and API calling.

What would settle it

A follow-up study that tests Lynx and other models on a new set of previously unseen APIs or real-world tasks where the automatic scores do not align with human judgments of successful tool use.

read the original abstract

Recent research has demonstrated that Large Language Models (LLMs) can enhance their capabilities by utilizing external tools. However, three pivotal questions remain unanswered: (1) How effective are current LLMs in utilizing tools? (2) How can we enhance LLMs' ability to utilize tools? (3) What obstacles need to be overcome to leverage tools? To address these questions, we introduce API-Bank, a groundbreaking benchmark, specifically designed for tool-augmented LLMs. For the first question, we develop a runnable evaluation system consisting of 73 API tools. We annotate 314 tool-use dialogues with 753 API calls to assess the existing LLMs' capabilities in planning, retrieving, and calling APIs. For the second question, we construct a comprehensive training set containing 1,888 tool-use dialogues from 2,138 APIs spanning 1,000 distinct domains. Using this dataset, we train Lynx, a tool-augmented LLM initialized from Alpaca. Experimental results demonstrate that GPT-3.5 exhibits improved tool utilization compared to GPT-3, while GPT-4 excels in planning. However, there is still significant potential for further improvement. Moreover, Lynx surpasses Alpaca's tool utilization performance by more than 26 pts and approaches the effectiveness of GPT-3.5. Through error analysis, we highlight the key challenges for future research in this field to answer the third question.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

API-Bank supplies a runnable 73-API test harness and a large training corpus that lets Lynx beat Alpaca by 26 points on tool use, but the automatic scorer lacks reported validation.

read the letter

The paper's core offering is a concrete benchmark for tool-augmented LLMs. It ships a runnable system built on 73 APIs, 314 annotated test dialogues containing 753 calls, and a training set of 1,888 dialogues drawn from 2,138 APIs across 1,000 domains. They fine-tune Lynx from Alpaca on that data and report it closing much of the gap to GPT-3.5 while GPT-4 shows stronger planning. The error analysis at the end also surfaces recurring failure modes like poor planning and retrieval mistakes that the field can now target more systematically. This scale and the runnable harness are the parts that were missing from earlier tool-use papers. The work gives people building agents a shared testbed and a sizable fine-tuning resource instead of scattered toy examples. The main soft spot is the evaluation. All the headline numbers come from an automatic scorer that judges planning, retrieval, and calling accuracy on the 314-dialogue set. The abstract and description give no inter-annotator agreement figures for the annotations, no breakdown of how the scorer handles partial calls or API errors, and no human agreement check on the automatic labels. If the scorer credits incomplete or slightly off plans too generously, the 26-point lift over Alpaca could shrink. The 73 APIs are treated as representative without external corroboration, which is a reasonable starting assumption but still untested. Readers working on LLM agents or tool integration will get the most immediate value from the benchmark construction and the released data. The paper shows clear thinking about the three questions it poses and stays grounded in the empirical results it actually has. I would bring it to a reading group to walk through the scorer implementation. It is worth citing for the benchmark itself. A serious editor should send it to peer review so referees can examine the annotation process and scorer details.

Referee Report

2 major / 2 minor

Summary. The paper introduces API-Bank, a benchmark for tool-augmented LLMs consisting of a runnable evaluation system with 73 APIs and 314 annotated dialogues (753 API calls) to measure planning, retrieval, and calling performance, plus a training set of 1,888 dialogues from 2,138 APIs across 1,000 domains. It fine-tunes Lynx from Alpaca on this data and reports that Lynx improves tool utilization over Alpaca by more than 26 points while approaching GPT-3.5, with additional analysis of GPT-3/3.5/4 capabilities and remaining challenges.

Significance. If the automatic evaluation proves reliable, the work supplies a concrete, runnable benchmark and training resource that directly addresses the gap in standardized tool-use evaluation for LLMs. The scale of the datasets, the multi-aspect breakdown (planning/retrieval/calling), and the demonstration that targeted fine-tuning yields substantial gains over the Alpaca baseline would make this a useful reference point for subsequent research on tool-augmented models.

major comments (2)

[Evaluation] Evaluation system (abstract and § on evaluation): the headline claim that Lynx surpasses Alpaca by >26 pts and approaches GPT-3.5 rests entirely on an automatic scorer whose implementation details, handling of API failures, edge cases, and agreement with human judgments are not reported. Without inter-annotator agreement statistics or validation of the scorer on the 314-dialogue test set, the numeric improvement cannot be trusted as load-bearing evidence.
[Dataset] Dataset construction (abstract): the 73 chosen APIs and 314 annotated dialogues are presented as representative of realistic tool-use scenarios, yet no external corroboration, coverage analysis, or comparison to real-world distributions is supplied. This assumption directly affects the generalizability of both the benchmark results and the Lynx training gains.

minor comments (2)

[Abstract] Abstract: the phrase 'more than 26 pts' should explicitly name the metric (e.g., success rate, F1) and the exact baseline score for Alpaca.
[Evaluation] The paper should clarify whether the automatic evaluator credits partial API calls or requires exact matches, as this choice affects interpretation of the reported planning and calling accuracies.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment point by point below. We have revised the manuscript to incorporate additional details and analyses where the comments identify gaps in the original submission.

read point-by-point responses

Referee: [Evaluation] Evaluation system (abstract and § on evaluation): the headline claim that Lynx surpasses Alpaca by >26 pts and approaches GPT-3.5 rests entirely on an automatic scorer whose implementation details, handling of API failures, edge cases, and agreement with human judgments are not reported. Without inter-annotator agreement statistics or validation of the scorer on the 314-dialogue test set, the numeric improvement cannot be trusted as load-bearing evidence.

Authors: We agree that the reliability of the automatic scorer is central to the reported results and that the original manuscript provided insufficient implementation details. In the revised version, we have substantially expanded the evaluation section to describe the scorer's rule-based logic, including exact handling of API failures (e.g., missing parameters, incorrect types, or non-existent calls), edge cases (partial matches, multiple valid sequences), and scoring rubrics for planning, retrieval, and calling. We have also added a human validation study: three independent annotators scored a random sample of 100 dialogues from the 314-dialogue test set, yielding Cohen's kappa of 0.82 between human judgments and the automatic scorer. These additions directly address the concern and allow readers to assess the metric's trustworthiness. revision: yes
Referee: [Dataset] Dataset construction (abstract): the 73 chosen APIs and 314 annotated dialogues are presented as representative of realistic tool-use scenarios, yet no external corroboration, coverage analysis, or comparison to real-world distributions is supplied. This assumption directly affects the generalizability of both the benchmark results and the Lynx training gains.

Authors: We acknowledge that the original submission did not include an explicit coverage analysis or comparison against external real-world distributions. The 73 APIs were chosen from popular public repositories to span diverse functional categories (e.g., weather, finance, productivity), and the 314 dialogues were authored to reflect typical multi-turn tool-use patterns. In the revised manuscript, we have added a dedicated subsection under dataset construction that provides a category-level breakdown of the APIs, compares their distribution to those appearing in public API directories and prior tool-use studies, and explicitly discusses limitations in representativeness. We also note that the training set (1,888 dialogues from 2,138 APIs) was constructed with broader coverage to mitigate some of these concerns for the fine-tuning experiments. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark and fine-tuning on separate train/test splits

full rationale

The paper constructs API-Bank by annotating 314 evaluation dialogues (with 753 API calls) and a disjoint training set of 1,888 dialogues drawn from 2,138 APIs. Lynx is fine-tuned on the training split and evaluated on the held-out 314-dialogue set using an automatic scorer. All numeric claims (e.g., Lynx > Alpaca by >26 pts) are direct empirical measurements on this split; no equations, predictions, or uniqueness claims reduce to fitted parameters or self-citations by construction. The work contains no derivations, ansatzes, or load-bearing self-references that would trigger any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the domain assumption that tool use can be meaningfully decomposed into planning, retrieval, and calling steps that are testable via a fixed set of APIs.

axioms (1)

domain assumption LLMs can be enhanced by utilizing external tools
Opening sentence of the abstract frames the entire benchmark around this premise.

pith-pipeline@v0.9.0 · 5577 in / 1161 out tokens · 23612 ms · 2026-05-15T20:47:56.852530+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Cost.FunctionalEquation washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We introduce API-Bank, a groundbreaking benchmark, specifically designed for tool-augmented LLMs... Lynx surpasses Alpaca's tool utilization performance by more than 26 pts and approaches the effectiveness of GPT-3.5.
Foundation.HierarchyEmergence hierarchy_emergence_forces_phi unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We develop a runnable evaluation system consisting of 73 API tools. We annotate 314 tool-use dialogues with 753 API calls...

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 22 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Mind2Web: Towards a Generalist Agent for the Web
cs.CL 2023-06 accept novelty 8.0

Mind2Web is the first large-scale dataset of real-world web tasks for developing generalist language-guided agents that complete complex actions on diverse websites.
Chat2Workflow: A Benchmark for Generating Executable Visual Workflows with Natural Language
cs.CL 2026-04 unverdicted novelty 7.0

Chat2Workflow benchmark shows that state-of-the-art LLMs often grasp high-level intent for visual workflow generation but fail to produce correct, stable, executable outputs, with an agentic framework delivering only ...
GraSP: Graph-Structured Skill Compositions for LLM Agents
cs.CL 2026-04 unverdicted novelty 7.0

GraSP introduces executable skill graphs that improve LLM agent rewards by up to 19 points and reduce steps by up to 41% over ReAct, Reflexion, ExpeL, and flat-skill baselines across ALFWorld, ScienceWorld, WebShop, a...
SAGE: A Service Agent Graph-guided Evaluation Benchmark
cs.AI 2026-04 unverdicted novelty 7.0

SAGE is a new multi-agent benchmark that formalizes service SOPs as dynamic dialogue graphs to measure LLM agents on logical compliance and path coverage, uncovering an execution gap and empathy resilience across 27 m...
MCP-DPT: A Defense-Placement Taxonomy and Coverage Analysis for Model Context Protocol Security
cs.CR 2026-04 conditional novelty 7.0

MCP-DPT creates a defense-placement taxonomy that organizes MCP threats and defenses across six architectural layers, revealing mostly tool-centric protections and gaps at orchestration, transport, and supply-chain layers.
Agent-Diff: Benchmarking LLM Agents on Enterprise API Tasks via Code Execution with State-Diff-Based Evaluation
cs.SE 2026-02 unverdicted novelty 7.0

Agent-Diff benchmarks LLM agents on enterprise API tasks using code execution and state-diff contracts to define success, evaluated on nine models across 224 tasks with code released.
GAIA: a benchmark for General AI Assistants
cs.CL 2023-11 unverdicted novelty 7.0

GAIA benchmark shows humans at 92% accuracy on simple real-world questions far outperform current AI systems at 15%, proposing this gap as a key milestone for general AI.
Model-Adaptive Tool Necessity Reveals the Knowing-Doing Gap in LLM Tool Use
cs.AI 2026-05 unverdicted novelty 6.0

LLMs show a knowing-doing gap in tool use: they often recognize when tools are needed via internal states but fail to translate that into actual tool calls, with mismatches of 26-54% on arithmetic and factual tasks.
TIDE-Bench: Task-Aware and Diagnostic Evaluation of Tool-Integrated Reasoning
cs.AI 2026-05 unverdicted novelty 6.0

TIDE-Bench is a new benchmark for tool-integrated reasoning that combines diverse tasks, multi-aspect metrics covering answer quality, process reliability, efficiency and cost, plus filtered challenging test sets.
Tool Calling is Linearly Readable and Steerable in Language Models
cs.CL 2026-05 unverdicted novelty 6.0

Tool identity is linearly readable and steerable in LLMs via mean activation differences, with 77-100% switch accuracy and error prediction from activation gaps.
Claw-Eval-Live: A Live Agent Benchmark for Evolving Real-World Workflows
cs.SE 2026-04 unverdicted novelty 6.0

Claw-Eval-Live benchmark with 105 tasks shows no frontier LLM agent exceeds 66.7% success rate on evolving real-world workflows, with HR and multi-system tasks as persistent bottlenecks.
PARM: Pipeline-Adapted Reward Model
cs.AI 2026-04 unverdicted novelty 6.0

PARM adapts reward models to multi-stage LLM pipelines via pipeline data and direct preference optimization, improving execution rate and solving accuracy on optimization benchmarks and showing transfer to GSM8K.
English is Not All You Need: Systematically Exploring the Role of Multilinguality in LLM Post-Training
cs.CL 2026-04 unverdicted novelty 6.0

Systematic experiments demonstrate that multilingual coverage in LLM post-training improves results for all languages and tasks compared to English-only, with low-resource languages gaining most and zero-shot transfer...
Quantifying Divergence in Inter-LLM Communication Through API Retrieval and Ranking
cs.IR 2026-03 unverdicted novelty 6.0

A new benchmarking study finds moderate but domain-dependent divergence in how LLMs retrieve and rank APIs, with higher disagreement on open-ended tasks.
AgentXRay: White-Boxing Agentic Systems via Workflow Reconstruction
cs.AI 2026-02 unverdicted novelty 6.0

AgentXRay formulates workflow reconstruction as combinatorial optimization and uses Monte Carlo Tree Search with Red-Black Pruning to approximate black-box agent behaviors via output-based proxy metrics.
A Survey on Large Language Model based Autonomous Agents
cs.AI 2023-08 accept novelty 6.0

A survey of LLM-based autonomous agents that proposes a unified framework for their construction and reviews applications in social science, natural science, and engineering along with evaluation methods and future di...
ToolAlpaca: Generalized Tool Learning for Language Models with 3000 Simulated Cases
cs.CL 2023-06 conditional novelty 6.0

ToolAlpaca trains 7B and 13B models on 3938 simulated tool-use cases to reach generalized tool-use performance comparable to GPT-3.5 on unseen APIs.
Ghost in the Minecraft: Generally Capable Agents for Open-World Environments via Large Language Models with Text-based Knowledge and Memory
cs.AI 2023-05 conditional novelty 6.0

GITM uses LLMs to generate action plans from text knowledge and memory, enabling agents to complete long-horizon Minecraft tasks at much higher success rates than prior RL methods.
Trajectory Supervision for Continual Tool-Use Learning in LLMs
cs.SE 2026-05 conditional novelty 5.0

Retaining tool-use trajectories during sequential fine-tuning on API domains improves next-call prediction accuracy by 17.7 points over stripped-history training.
Meta-Tool: Efficient Few-Shot Tool Adaptation for Small Language Models
cs.CL 2026-04 unverdicted novelty 4.0

A 3B model with few-shot prompting reaches 79.7% of GPT-5 tool-use performance while a hypernetwork adaptation adds zero measurable benefit across four benchmarks.
A Periodic Space of Distributed Computing: Vision & Framework
cs.DC 2026-04 unverdicted novelty 4.0

A periodic framework is proposed to characterize, compare, and predict behaviors across distributed computing solutions by mapping system properties in a structured space inspired by the chemical periodic table.
A Survey of Self-Evolving Agents: What, When, How, and Where to Evolve on the Path to Artificial Super Intelligence
cs.AI 2025-07 accept novelty 4.0

The paper delivers the first systematic review of self-evolving agents, structured around what components evolve, when adaptation occurs, and how it is implemented.

Reference graph

Works this paper leans on

23 extracted references · 23 canonical work pages · cited by 22 Pith papers · 12 internal anchors

[1]

Advances in neural information processing systems, 33:1877–1901

Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901. Sébastien Bubeck, Varun Chandrasekaran, Ronen El- dan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lund- berg, et al

work page 1901
[2]

Sparks of Artificial General Intelligence: Early experiments with GPT-4

Sparks of artificial general intelli- gence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712. Tianle Cai, Xuezhi Wang, Tengyu Ma, Xinyun Chen, and Denny Zhou

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Large language models as tool makers

Large language models as tool makers. arXiv preprint arXiv:2305.17126. Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Ka- plan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al

work page arXiv
[4]

Evaluating Large Language Models Trained on Code

Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374. Zhengxiao Du, Yujie Qian, Xiao Liu, Ming Ding, Jiezhong Qiu, Zhilin Yang, and Jie Tang

work page internal anchor Pith review Pith/arXiv arXiv
[5]

arXiv preprint arXiv:2305.11554

Toolkengpt: Augmenting frozen lan- guage models with massive tools via tool embeddings. arXiv preprint arXiv:2305.11554. Gautier Izacard, Patrick Lewis, Maria Lomeli, Lu- cas Hosseini, Fabio Petroni, Timo Schick, Jane Dwivedi-Yu, Armand Joulin, Sebastian Riedel, and Edouard Grave

work page arXiv
[6]

Atlas: Few-shot Learning with Retrieval Augmented Language Models

Few-shot learning with re- trieval augmented language models. arXiv preprint arXiv:2208.03299. Yaobo Liang, Chenfei Wu, Ting Song, Wenshan Wu, Yan Xia, Yu Liu, Yang Ou, Shuai Lu, Lei Ji, Shaoguang Mao, et al

work page internal anchor Pith review arXiv
[7]

Taskmatrix

Taskmatrix. ai: Com- pleting tasks by connecting foundation models with millions of apis. arXiv preprint arXiv:2303.16434. Grégoire Mialon, Roberto Dessì, Maria Lomeli, Christo- foros Nalmpantis, Ram Pasunuru, Roberta Raileanu, Baptiste Rozière, Timo Schick, Jane Dwivedi-Yu, Asli Celikyilmaz, et al

work page arXiv
[8]

Augmented Language Models: a Survey

Augmented language models: a survey. arXiv preprint arXiv:2302.07842. Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christopher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, et al

work page internal anchor Pith review arXiv
[9]

WebGPT: Browser-assisted question-answering with human feedback

Webgpt: Browser-assisted question- answering with human feedback. arXiv preprint arXiv:2112.09332. Bhargavi Paranjape, Scott Lundberg, Sameer Singh, Hannaneh Hajishirzi, Luke Zettlemoyer, and Marco Tulio Ribeiro

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Art: Automatic multi-step reasoning and tool-use for large language models

Art: Automatic multi- step reasoning and tool-use for large language mod- els. arXiv preprint arXiv:2303.09014. Shishir G Patil, Tianjun Zhang, Xin Wang, and Joseph E Gonzalez

work page arXiv
[11]

Gorilla: Large Language Model Connected with Massive APIs

Gorilla: Large language model connected with massive apis. arXiv preprint arXiv:2305.15334. Cheng Qian, Chi Han, Yi R Fung, Yujia Qin, Zhiyuan Liu, and Heng Ji

work page internal anchor Pith review Pith/arXiv arXiv
[12]

and Qin, Yujia and Liu, Zhiyuan and Ji, Heng , year =

Creator: Disentan- gling abstract and concrete reasonings of large lan- guage models through tool creation. arXiv preprint arXiv:2305.14318. Shuofei Qiao, Honghao Gui, Huajun Chen, and Ningyu Zhang

work page arXiv
[13]

arXiv preprint arXiv:2305.13068

Making language models better tool learners with execution feedback. arXiv preprint arXiv:2305.13068. Yujia Qin, Shengding Hu, Yankai Lin, Weize Chen, Ning Ding, Ganqu Cui, Zheni Zeng, Yufei Huang, Chaojun Xiao, Chi Han, et al. 2023a. Tool learning with foundation models. arXiv preprint arXiv:2304.08354. Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan...

work page arXiv
[14]

Toolformer: Language Models Can Teach Themselves to Use Tools

Toolformer: Language models can teach themselves to use tools. arXiv preprint arXiv:2302.04761. Feifan Song, Bowen Yu, Minghao Li, Haiyang Yu, Fei Huang, Yongbin Li, and Houfeng Wang

work page internal anchor Pith review Pith/arXiv arXiv
[15]

arXiv preprint arXiv:2306.17492

Pref- erence ranking optimization for human alignment. arXiv preprint arXiv:2306.17492. Qiaoyu Tang, Ziliang Deng, Hongyu Lin, Xianpei Han, Qiao Liang, and Le Sun

work page arXiv
[16]

ToolAlpaca: Generalized Tool Learning for Language Models with 3000 Simulated Cases

Toolalpaca: Gener- alized tool learning for language models with 3000 simulated cases. arXiv preprint arXiv:2306.05301. Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto

work page internal anchor Pith review arXiv
[17]

LLaMA: Open and Efficient Foundation Language Models

Llama: Open and effi- cient foundation language models. arXiv preprint arXiv:2302.13971. Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Al- isa Liu, Noah A Smith, Daniel Khashabi, and Han- naneh Hajishirzi

work page internal anchor Pith review Pith/arXiv arXiv
[18]

Self-Instruct: Aligning Language Models with Self-Generated Instructions

Self-instruct: Aligning lan- guage model with self generated instructions. arXiv preprint arXiv:2212.10560. Qiantong Xu, Fenglu Hong, Bo Li, Changran Hu, Zhengyu Chen, and Jian Zhang

work page internal anchor Pith review Pith/arXiv arXiv
[19]

ReAct: Synergizing Reasoning and Acting in Language Models

React: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629. Andy Zeng, Adrian Wong, Stefan Welker, Krzysztof Choromanski, Federico Tombari, Aveek Purohit, Michael Ryoo, Vikas Sindhwani, Johnny Lee, Vin- cent Vanhoucke, et al

work page internal anchor Pith review Pith/arXiv arXiv
[20]

Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language

Socratic models: Com- posing zero-shot multimodal reasoning with lan- guage. arXiv preprint arXiv:2204.00598. Yingxiu Zhao, Bowen Yu, Binyuan Hui, Haiyang Yu, Fei Huang, Yongbin Li, and Nevin L Zhang

work page internal anchor Pith review arXiv
[21]

arXiv preprint arXiv:2308.05696

A preliminary study of the intrinsic relationship be- tween complexity and alignment. arXiv preprint arXiv:2308.05696. Yuchen Zhuang, Yue Yu, Kuan Wang, Haotian Sun, and Chao Zhang

work page arXiv
[22]

arXiv preprint arXiv:2306.13304

Toolqa: A dataset for llm ques- tion answering with external tools. arXiv preprint arXiv:2306.13304. A Appendix Generate an API request in the format of [ApiName(key1='value1', key2='value2', ...)] based on the previous dialogue context. The current year is 2023.Input: User: User's utterenceAI: AI's responseExpected output:API-Request: [ApiName(key1='valu...

work page arXiv 2023
[23]

name": "ToolSearcher

Input: User: User's utterence AI: AI's response Expected output: API-Request: [ApiName(key1='value1', key2='value2', ...)] API descriptions: {"name": "ToolSearcher", "description": "Searches for relevant tools in library based on the keywords.", "input_parameters": {"keywords": {"type": "str", "description": "The keyword to search for."}}, "output_paramet...

work page 2023