arxiv: 2305.18323 · v1 · submitted 2023-05-23 · 💻 cs.CL · cs.AI

Recognition: 2 theorem links

ReWOO: Decoupling Reasoning from Observations for Efficient Augmented Language Models

Binfeng Xu , Zhiyuan Peng , Bowen Lei , Subhabrata Mukherjee , Yuchen Liu , Dongkuan Xu

Authors on Pith no claims yet

Pith reviewed 2026-05-15 18:10 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords ReWOOaugmented language modelsreasoning without observationtoken efficiencyLLM tool usemodel offloadingHotpotQAmulti-step reasoning

0 comments

The pith

ReWOO generates a full reasoning plan without tool observations first, then executes it in one pass to cut token use and allow smaller models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces ReWOO to fix the inefficiency in augmented language models that interleave reasoning steps with tool responses. By creating an entire plan of tool calls and logic upfront and then running it once observations arrive, the method avoids repeated prompts and redundant computation. Evaluations across multiple NLP benchmarks show consistent gains in efficiency and accuracy. The separation also makes it possible to fine-tune smaller models to take over reasoning tasks originally handled by much larger ones.

Core claim

ReWOO decouples the reasoning process from external observations by first producing a complete plan of tool calls and subsequent logic without any tool responses, then executing that plan in a single pass once observations are fetched. This modular structure lowers overall token consumption compared to interleaved approaches while preserving or improving task performance on multi-step reasoning problems.

What carries the argument

The ReWOO paradigm that generates a full reasoning plan independently before any tool observations arrive and then executes the plan in one pass.

If this is right

Reduces token consumption by a factor of five on multi-step reasoning benchmarks such as HotpotQA.
Improves accuracy by about four percent on HotpotQA relative to interleaved methods.
Preserves performance even when external tools return no response or fail.
Enables instruction fine-tuning that transfers reasoning ability from a 175B model to a 7B model.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The upfront planning structure could support running tool-augmented systems on hardware with tight memory limits by reducing per-step context size.
Partial observation feedback could be added later without fully reverting to interleaved prompting, offering a middle ground for adaptive tasks.
Specialized fine-tuning on distinct tool sets becomes simpler because the reasoning module no longer needs to be retrained alongside every tool change.

Load-bearing premise

That a complete reasoning plan can be generated without any intermediate tool results and that single-pass execution will not miss information that interleaved observations would have provided.

What would settle it

Compare ReWOO against an interleaved baseline on multi-hop questions where each tool call depends on the exact output of the prior call; if ReWOO accuracy falls sharply on cases the baseline solves correctly, the decoupling claim is challenged.

read the original abstract

Augmented Language Models (ALMs) blend the reasoning capabilities of Large Language Models (LLMs) with tools that allow for knowledge retrieval and action execution. Existing ALM systems trigger LLM thought processes while pulling observations from these tools in an interleaved fashion. Specifically, an LLM reasons to call an external tool, gets halted to fetch the tool's response, and then decides the next action based on all preceding response tokens. Such a paradigm, though straightforward and easy to implement, often leads to huge computation complexity from redundant prompts and repeated execution. This study addresses such challenges for the first time, proposing a modular paradigm ReWOO (Reasoning WithOut Observation) that detaches the reasoning process from external observations, thus significantly reducing token consumption. Comprehensive evaluations across six public NLP benchmarks and a curated dataset reveal consistent performance enhancements with our proposed methodology. Notably, ReWOO achieves 5x token efficiency and 4% accuracy improvement on HotpotQA, a multi-step reasoning benchmark. Furthermore, ReWOO demonstrates robustness under tool-failure scenarios. Beyond prompt efficiency, decoupling parametric modules from non-parametric tool calls enables instruction fine-tuning to offload LLMs into smaller language models, thus substantially reducing model parameters. Our illustrative work offloads reasoning ability from 175B GPT3.5 into 7B LLaMA, demonstrating the significant potential for truly efficient and scalable ALM systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ReWOO gets real token savings by writing the full plan before any tool results arrive, but the multi-hop accuracy numbers rest on whether those static plans actually stay complete.

read the letter

ReWOO's central idea is to split the process into a Planner that outputs the entire reasoning chain and tool calls in one forward pass, then a Worker that executes the plan against the tools without feeding observations back into further reasoning. This removes the repeated context that builds up in interleaved setups like ReAct. The reported 5x token drop on HotpotQA and the smaller gains on the other five benchmarks plus their curated set are the concrete payoff. The tool-failure robustness check and the successful offload of the planner from 175B GPT-3.5 to a fine-tuned 7B LLaMA are also useful engineering results that follow directly from the separation. Those pieces are new relative to the prior interleaved literature and are measured in a way that can be checked. The soft spot is the assumption that a single static plan can be written without observations and still recover all the information an adaptive trace would have used. HotpotQA questions frequently require extracting an entity from the first result to form the second query. If the Planner has to guess or template that entity, the Worker may not always fill the gap correctly. The abstract gives no plan examples, no completeness metric, and no ablation on cases where the first observation deviates from the plan's implicit expectation, so the 4% accuracy lift could be driven by easier instances. The numbers are internally consistent with the method, but the branching concern is not fully addressed in what is shown. This is worth sending to referees because the efficiency and offloading claims are measurable and the modular split is straightforward to reproduce or extend. People working on practical ALM deployment would get value from the token and size reductions even if later work tightens the plan-quality analysis.

Referee Report

3 major / 2 minor

Summary. The paper introduces ReWOO, a modular paradigm for augmented language models that decouples reasoning from tool observations by first generating a complete static plan via a Planner and then executing it in a single pass via a Worker module. It reports consistent gains across six NLP benchmarks plus a curated dataset, including 5x token efficiency and 4% accuracy improvement on HotpotQA, robustness under tool failures, and successful offloading of reasoning from 175B GPT-3.5 to 7B LLaMA via instruction fine-tuning.

Significance. If the core claims hold, ReWOO could enable substantially more efficient ALM deployments by eliminating interleaved prompting overhead and supporting model compression, with direct implications for scalable multi-step reasoning systems.

major comments (3)

[§3.2] §3.2 (Planner description): The generation of a non-adaptive plan is presented without addressing how entity references for subsequent hops (e.g., second-hop queries in HotpotQA) are resolved without the first observation; this directly underpins the reported accuracy and efficiency numbers.
[Table 3] Table 3 (HotpotQA row): The 4% accuracy lift and 5x token reduction are reported without an ablation on plan completeness or cases where the first observation deviates from the Planner's implicit template, leaving the central decoupling assumption untested for the benchmark that most requires adaptive branching.
[§5.2] §5.2 (tool-failure robustness): The robustness test is described at a high level but provides no quantitative breakdown of recovery success when the Worker must interpret a plan that was generated without the actual observations, which is required to substantiate the claim.

minor comments (2)

[Abstract] The abstract states 'six public NLP benchmarks' but does not enumerate them; the full list and per-benchmark breakdowns should appear in §4.
[§3] Notation for Planner output format and Worker execution trace could be formalized with a short pseudocode block or equations in §3 to aid reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on ReWOO. We address each major comment below with clarifications and proposed revisions to strengthen the manuscript.

read point-by-point responses

Referee: [§3.2] §3.2 (Planner description): The generation of a non-adaptive plan is presented without addressing how entity references for subsequent hops (e.g., second-hop queries in HotpotQA) are resolved without the first observation; this directly underpins the reported accuracy and efficiency numbers.

Authors: The Planner employs symbolic placeholders (e.g., 'result_of_step_1') to reference prior steps in the static plan. The Worker resolves these by substituting actual tool observations at execution time. We will revise §3.2 to include an explicit HotpotQA example demonstrating this reference resolution mechanism. revision: yes
Referee: [Table 3] Table 3 (HotpotQA row): The 4% accuracy lift and 5x token reduction are reported without an ablation on plan completeness or cases where the first observation deviates from the Planner's implicit template, leaving the central decoupling assumption untested for the benchmark that most requires adaptive branching.

Authors: We agree an ablation on plan completeness and deviation cases is needed to validate the core assumption. We will add this analysis to the revised Table 3 (or a new subsection), reporting accuracy and token metrics when the first observation deviates from the plan template on HotpotQA. revision: yes
Referee: [§5.2] §5.2 (tool-failure robustness): The robustness test is described at a high level but provides no quantitative breakdown of recovery success when the Worker must interpret a plan that was generated without the actual observations, which is required to substantiate the claim.

Authors: The §5.2 experiments simulate failures by supplying erroneous or null observations to the Worker. We will expand this section with a quantitative breakdown, including recovery success rates across failure scenarios (e.g., null vs. incorrect results), to substantiate the robustness claim. revision: yes

Circularity Check

0 steps flagged

No circularity in ReWOO's empirical paradigm

full rationale

The paper introduces ReWOO as a structural decoupling of reasoning plan generation from tool observations, with all performance claims (5x token efficiency, 4% HotpotQA accuracy lift, offloading from 175B to 7B) resting on direct empirical measurements across benchmarks rather than any derivation, equation, or fitted parameter that reduces to its own inputs by construction. No self-citations, ansatzes, or uniqueness theorems are invoked to justify the core method; the efficiency advantage follows immediately from avoiding interleaved prompting, which is a design choice evaluated externally.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach assumes standard LLM next-token prediction and tool APIs function as black boxes; no new mathematical axioms or invented physical entities are introduced.

axioms (1)

domain assumption LLMs can generate coherent multi-step plans without intermediate observations
Invoked when the model is asked to produce the full reasoning chain before any tool results are supplied.

pith-pipeline@v0.9.0 · 5562 in / 1251 out tokens · 26486 ms · 2026-05-15T18:10:30.004133+00:00 · methodology

discussion (0)

Forward citations

Cited by 19 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

SkillOps: Managing LLM Agent Skill Libraries as Self-Maintaining Software Ecosystems
cs.SE 2026-05 unverdicted novelty 7.0

SkillOps maintains LLM skill libraries via Skill Contracts and ecosystem graphs, raising ALFWorld task success to 79.5% as a standalone agent and improving retrieval baselines by up to 2.9 points with near-zero librar...
TRIAGE: Evaluating Prospective Metacognitive Control in LLMs under Resource Constraints
cs.AI 2026-05 unverdicted novelty 7.0

TRIAGE evaluates LLMs on prospective metacognitive control by requiring a single plan for task selection, sequencing, and token allocation under a calibrated budget, revealing substantial gaps in current models across...
Can LLM Agents Respond to Disasters? Benchmarking Heterogeneous Geospatial Reasoning in Emergency Operations
cs.AI 2026-05 unverdicted novelty 7.0

DORA is the first end-to-end agentic benchmark for LLM-based disaster response, covering perception, spatial analysis, evacuation planning, temporal reasoning, and report generation over heterogeneous geospatial data,...
Evaluating Plan Compliance in Autonomous Programming Agents
cs.SE 2026-04 unverdicted novelty 7.0

Autonomous programming agents frequently fail to follow instructed plans, falling back on incomplete internalized workflows, while standard plans and periodic reminders improve performance but poor plans can degrade i...
KITE: Keyframe-Indexed Tokenized Evidence for VLM-Based Robot Failure Analysis
cs.RO 2026-04 unverdicted novelty 7.0

KITE is a training-free method that uses keyframe-indexed tokenized evidence including BEV schematics to enhance VLM performance on robot failure detection, identification, localization, explanation, and correction.
Profile-Then-Reason: Bounded Semantic Complexity for Tool-Augmented Language Agents
cs.AI 2026-04 unverdicted novelty 7.0

PTR framework profiles a workflow upfront then executes it deterministically with bounded verification and repair, limiting LM calls to 2-3 while outperforming ReAct in 16 of 24 tested configurations.
LLM-X: A Scalable Negotiation-Oriented Exchange for Communication Among Personal LLM Agents
cs.AI 2026-05 unverdicted novelty 6.0

LLM-X is a scalable architecture for direct negotiation and communication among personal LLM agents, featuring federated gateways, typed protocols, and policy enforcement, shown stable in experiments with up to 12 agents.
Hierarchical Visual Agent: Managing Contexts in Joint Image-Text Space for Advanced Chart Reasoning
cs.CV 2026-05 unverdicted novelty 6.0

HierVA improves multi-step chart question answering by having a high-level manager maintain key joint contexts while specialized workers perform targeted reasoning with visual zoom-in.
Affordance Agent Harness: Verification-Gated Skill Orchestration
cs.RO 2026-05 unverdicted novelty 6.0

Affordance Agent Harness is a verification-gated orchestration system that unifies skills via an evidence store, episodic memory priors, an adaptive router, and a self-consistency verifier to improve accuracy-cost tra...
QRAFTI: An Agentic Framework for Empirical Research in Quantitative Finance
cs.MA 2026-04 unverdicted novelty 6.0

QRAFTI is a multi-agent framework using tool-calling and reflection-based planning to emulate quant research tasks like factor replication and signal testing on financial data.
Complete Cyclic Subtask Graphs for Tool-Using LLM Agents: Flexibility, Cost, and Bottlenecks in Multi-Agent Workflows
cs.MA 2026-04 unverdicted novelty 6.0

Complete cyclic subtask graphs offer a lens to measure when multi-agent revisitation aids recovery and exploration versus when it increases costs or is dominated by other bottlenecks in LLM agent workflows.
OS-ATLAS: A Foundation Action Model for Generalist GUI Agents
cs.CL 2024-10 unverdicted novelty 6.0

OS-Atlas, trained on the largest open-source cross-platform GUI grounding corpus of 13 million elements, outperforms prior open-source models on six benchmarks across mobile, desktop, and web platforms.
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code
cs.SE 2024-03 unverdicted novelty 6.0

LiveCodeBench collects 400 recent contest problems to create a contamination-free benchmark evaluating LLMs on code generation and related capabilities like self-repair and execution.
A Survey on Large Language Model based Autonomous Agents
cs.AI 2023-08 accept novelty 6.0

A survey of LLM-based autonomous agents that proposes a unified framework for their construction and reviews applications in social science, natural science, and engineering along with evaluation methods and future di...
SPIN: Structural LLM Planning via Iterative Navigation for Industrial Tasks
cs.AI 2026-05 conditional novelty 5.0

SPIN enforces DAG-valid plans and prefix-based stopping for LLM agents, cutting executed tasks from 1061 to 623 and tool calls from 11.81 to 6.82 per run on AssetOpsBench while raising success from 0.638 to 0.706.
Do Agents Need to Plan Step-by-Step? Rethinking Planning Horizon in Data-Centric Tool Calling
cs.CL 2026-05 unverdicted novelty 5.0

Full-horizon planning with on-demand replanning achieves accuracy parity with single-step planning in tool-calling agents for knowledge base and multi-hop question answering while consuming 2-3 times fewer tokens.
RealRoute: Dynamic Query Routing System via Retrieve-then-Verify Paradigm
cs.IR 2026-03 unverdicted novelty 5.0

RealRoute uses parallel source-agnostic retrieval followed by dynamic verification to improve accuracy over predictive LLM routers in heterogeneous multi-hop RAG tasks.
Affordance Agent Harness: Verification-Gated Skill Orchestration
cs.RO 2026-05 unverdicted novelty 4.0

Affordance Agent Harness is a verification-gated orchestration framework that adaptively combines heterogeneous skills, retrieves episodic memories, and uses self-consistency checks to improve affordance grounding acc...
The Rise and Potential of Large Language Model Based Agents: A Survey
cs.AI 2023-09 accept novelty 4.0

The paper surveys the origins, frameworks, applications, and open challenges of AI agents built on large language models.

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages · cited by 18 Pith papers · 18 internal anchors

[1]

Re- act: Synergizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. Re- act: Synergizing reasoning and acting in language models. In International Conference on Learning Representations, 2023

work page 2023
[2]

Augmented language models: a survey

Grégoire Mialon, Roberto Dessì, Maria Lomeli, Christoforos Nalmpantis, Ram Pasunuru, Roberta Raileanu, Baptiste Rozière, Timo Schick, Jane Dwivedi-Yu, Asli Celikyilmaz, et al. Augmented language models: a survey. arXiv preprint arXiv:2302.07842, 2023

work page arXiv 2023
[3]

MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action

Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Ehsan Azarnasab, Faisal Ahmed, Zicheng Liu, Ce Liu, Michael Zeng, and Lijuan Wang. Mm-react: Prompting chatgpt for multimodal reasoning and action. arXiv preprint arXiv:2303.11381, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[4]

Taskmatrix

Yaobo Liang, Chenfei Wu, Ting Song, Wenshan Wu, Yan Xia, Yu Liu, Yang Ou, Shuai Lu, Lei Ji, Shaoguang Mao, et al. Taskmatrix. ai: Completing tasks by connecting foundation models with millions of apis. arXiv preprint arXiv:2303.16434, 2023

work page arXiv 2023
[5]

Tool learning with foundation models

Yujia Qin, Shengding Hu, Yankai Lin, Weize Chen, Ning Ding, Ganqu Cui, Zheni Zeng, Yufei Huang, Chaojun Xiao, Chi Han, et al. Tool learning with foundation models. arXiv preprint arXiv:2304.08354, 2023

work page arXiv 2023
[6]

org/CorpusID:6308361

Geunwoo Kim, Pierre Baldi, and Stephen McAleer. Language models can solve computer tasks. arXiv preprint arXiv:2303.17491, 2023

work page arXiv 2023
[7]

Reflexion: Language Agents with Verbal Reinforcement Learning

Noah Shinn, Beck Labash, and Ashwin Gopinath. Reflexion: an autonomous agent with dynamic memory and self-reflection. arXiv preprint arXiv:2303.11366, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[8]

ChemCrow: Augmenting large-language models with chemistry tools

Andres M Bran, Sam Cox, Andrew D White, and Philippe Schwaller. Chemcrow: Augmenting large- language models with chemistry tools. arXiv preprint arXiv:2304.05376, 2023

work page internal anchor Pith review arXiv 2023
[9]

Do as i can, not as i say: Grounding language in robotic affordances

Anthony Brohan, Yevgen Chebotar, Chelsea Finn, Karol Hausman, Alexander Herzog, Daniel Ho, Julian Ibarz, Alex Irpan, Eric Jang, Ryan Julian, et al. Do as i can, not as i say: Grounding language in robotic affordances. In Conference on Robot Learning, pages 287–318. PMLR, 2023

work page 2023
[10]

WebGPT: Browser-assisted question-answering with human feedback

Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christopher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, et al. Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[11]

https://github.com/Significant-Gravitas/ Auto-GPT, 2023

Auto-gpt: An autonomous gpt-4 experiment. https://github.com/Significant-Gravitas/ Auto-GPT, 2023. [Online; accessed 13-May-2023]

work page 2023
[12]

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed Chi, Quoc Le, and Denny Zhou. Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[13]

Alpaca: A strong, replicable instruction-following model

Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B Hashimoto. Alpaca: A strong, replicable instruction-following model. Stanford Center for Research on F oundation Models. https://crfm. stanford. edu/2023/03/13/alpaca. html, 2023

work page 2023
[14]

Specializing smaller language models towards multi-step reasoning

Yao Fu, Hao Peng, Litu Ou, Ashish Sabharwal, and Tushar Khot. Specializing smaller language models towards multi-step reasoning. arXiv preprint arXiv:2301.12726, 2023

work page arXiv 2023
[15]

Toolformer: Language Models Can Teach Themselves to Use Tools

Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools. arXiv preprint arXiv:2302.04761, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[16]

HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W Cohen, Ruslan Salakhutdinov, and Christopher D Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. arXiv preprint arXiv:1809.09600, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[17]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[18]

Star: Self-taught reasoner bootstrapping reasoning with reasoning

Eric Zelikman, Jesse Mu, Noah D Goodman, and Yuhuai Tony Wu. Star: Self-taught reasoner bootstrapping reasoning with reasoning. 2022

work page 2022
[19]

TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension

Mandar Joshi, Eunsol Choi, Daniel S Weld, and Luke Zettlemoyer. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. arXiv preprint arXiv:1705.03551, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[20]

Sports understanding

Ethan Kim. Sports understanding. https://github.com/google/BIG-bench/tree/main/ bigbench/benchmark_tasks/sports_understanding, 2022. [Online; accessed 13-May-2023]

work page 2022
[21]

Bigbench: Towards an industry standard benchmark for big data analytics

Ahmad Ghazal, Tilmann Rabl, Minqing Hu, Francois Raab, Meikel Poess, Alain Crolotte, and Hans-Arno Jacobsen. Bigbench: Towards an industry standard benchmark for big data analytics. In Proceedings of the 2013 ACM SIGMOD international conference on Management of data , pages 1197–1208, 2013. 10

work page 2013
[22]

Did aristotle use a laptop? a question answering benchmark with implicit reasoning strategies

Mor Geva, Daniel Khashabi, Elad Segal, Tushar Khot, Dan Roth, and Jonathan Berant. Did aristotle use a laptop? a question answering benchmark with implicit reasoning strategies. Transactions of the Association for Computational Linguistics , 9:346–361, 2021

work page 2021
[23]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[24]

Designing effective questions for classroom response system teaching

Ian D Beatty, William J Gerace, William J Leonard, and Robert J Dufresne. Designing effective questions for classroom response system teaching. American journal of physics , 74(1):31–39, 2006

work page 2006
[25]

PAL: Program-aided Language Models

Luyu Gao, Aman Madaan, Shuyan Zhou, Uri Alon, Pengfei Liu, Yiming Yang, Jamie Callan, and Graham Neubig. Pal: Program-aided language models. arXiv preprint arXiv:2211.10435, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[26]

LoRA: Low-Rank Adaptation of Large Language Models

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[27]

ALFWorld: Aligning Text and Embodied Environments for Interactive Learning

Mohit Shridhar, Xingdi Yuan, Marc-Alexandre Côté, Yonatan Bisk, Adam Trischler, and Matthew Hausknecht. Alfworld: Aligning text and embodied environments for interactive learning. arXiv preprint arXiv:2010.03768, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010
[28]

https://github.com/hwchase17/langchain, 2023

Langchain. https://github.com/hwchase17/langchain, 2023. [Online; accessed 13-May-2023]

work page 2023
[29]

Internet-augmented language models through few-shot prompting for open-domain question answering

Angeliki Lazaridou, Elena Gribovskaya, Wojciech Stokowiec, and Nikolai Grigorev. Internet-augmented language models through few-shot prompting for open-domain question answering. arXiv preprint arXiv:2203.05115, 2022

work page arXiv 2022
[30]

Code as Policies: Language Model Programs for Embodied Control

Jacky Liang, Wenlong Huang, Fei Xia, Peng Xu, Karol Hausman, Brian Ichter, Pete Florence, and Andy Zeng. Code as policies: Language model programs for embodied control. arXiv preprint arXiv:2209.07753, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[31]

PaLM-E: An Embodied Multimodal Language Model

Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, et al. Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[32]

Chatgpt for robotics: Design principles and model abilities

Sai Vemprala, Rogerio Bonatti, Arthur Bucker, and Ashish Kapoor. Chatgpt for robotics: Design principles and model abilities. 2023, 2023

work page 2023
[33]

Jiang et al.Draft, Sketch, and Prove: Guiding Formal Theorem Provers with Informal Proofs

Albert Q Jiang, Sean Welleck, Jin Peng Zhou, Wenda Li, Jiacheng Liu, Mateja Jamnik, Timothée Lacroix, Yuhuai Wu, and Guillaume Lample. Draft, sketch, and prove: Guiding formal theorem provers with informal proofs. arXiv preprint arXiv:2210.12283, 2022

work page arXiv 2022
[34]

Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models

Chenfei Wu, Shengming Yin, Weizhen Qi, Xiaodong Wang, Zecheng Tang, and Nan Duan. Visual chatgpt: Talking, drawing and editing with visual foundation models. arXiv preprint arXiv:2303.04671, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[35]

Art: Automatic multi-step reasoning and tool-use for large language models

Bhargavi Paranjape, Scott Lundberg, Sameer Singh, Hannaneh Hajishirzi, Luke Zettlemoyer, and Marco Tulio Ribeiro. Art: Automatic multi-step reasoning and tool-use for large language models. arXiv preprint arXiv:2303.09014, 2023

work page arXiv 2023
[36]

Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning

Haokun Liu, Derek Tam, Mohammed Muqeeth, Jay Mohta, Tenghao Huang, Mohit Bansal, and Colin A Raffel. Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning. Advances in Neural Information Processing Systems , 35:1950–1965, 2022

work page 1950
[37]

Instruction Tuning with GPT-4

Baolin Peng, Chunyuan Li, Pengcheng He, Michel Galley, and Jianfeng Gao. Instruction tuning with gpt-4. arXiv preprint arXiv:2304.03277, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[38]

The Power of Scale for Parameter-Efficient Prompt Tuning

Brian Lester, Rami Al-Rfou, and Noah Constant. The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[39]

Prompt compression and contrastive condi- tioning for controllability and toxicity reduction in language models

David Wingate, Mohammad Shoeybi, and Taylor Sorensen. Prompt compression and contrastive condi- tioning for controllability and toxicity reduction in language models. arXiv preprint arXiv:2210.03162, 2022

work page arXiv 2022
[40]

Parameter-efficient transfer learning for nlp

Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Ges- mundo, Mona Attariyan, and Sylvain Gelly. Parameter-efficient transfer learning for nlp. In International Conference on Machine Learning, pages 2790–2799. PMLR, 2019

work page 2019
[41]

Towards a unified view of parameter-efficient transfer learning.arXiv preprint arXiv:2110.04366, 2021

Junxian He, Chunting Zhou, Xuezhe Ma, Taylor Berg-Kirkpatrick, and Graham Neubig. Towards a unified view of parameter-efficient transfer learning. arXiv preprint arXiv:2110.04366, 2021

work page arXiv 2021
[42]

reasonable

Zhoujun Cheng, Jungo Kasai, and Tao Yu. Batch prompting: Efficient inference with large language model apis. arXiv preprint arXiv:2301.08721, 2023. 12 Appendix A Additional Observations A.1 Token Decomposition We decompose the token usage of different prompt paradigms on HotpotQA into different components – context prompts, exemplars, and intermediate ste...

work page arXiv 2023