pith. machine review for the scientific record. sign in

arxiv: 2305.18323 · v1 · submitted 2023-05-23 · 💻 cs.CL · cs.AI

Recognition: 2 theorem links

ReWOO: Decoupling Reasoning from Observations for Efficient Augmented Language Models

Authors on Pith no claims yet

Pith reviewed 2026-05-15 18:10 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords ReWOOaugmented language modelsreasoning without observationtoken efficiencyLLM tool usemodel offloadingHotpotQAmulti-step reasoning
0
0 comments X

The pith

ReWOO generates a full reasoning plan without tool observations first, then executes it in one pass to cut token use and allow smaller models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces ReWOO to fix the inefficiency in augmented language models that interleave reasoning steps with tool responses. By creating an entire plan of tool calls and logic upfront and then running it once observations arrive, the method avoids repeated prompts and redundant computation. Evaluations across multiple NLP benchmarks show consistent gains in efficiency and accuracy. The separation also makes it possible to fine-tune smaller models to take over reasoning tasks originally handled by much larger ones.

Core claim

ReWOO decouples the reasoning process from external observations by first producing a complete plan of tool calls and subsequent logic without any tool responses, then executing that plan in a single pass once observations are fetched. This modular structure lowers overall token consumption compared to interleaved approaches while preserving or improving task performance on multi-step reasoning problems.

What carries the argument

The ReWOO paradigm that generates a full reasoning plan independently before any tool observations arrive and then executes the plan in one pass.

If this is right

  • Reduces token consumption by a factor of five on multi-step reasoning benchmarks such as HotpotQA.
  • Improves accuracy by about four percent on HotpotQA relative to interleaved methods.
  • Preserves performance even when external tools return no response or fail.
  • Enables instruction fine-tuning that transfers reasoning ability from a 175B model to a 7B model.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The upfront planning structure could support running tool-augmented systems on hardware with tight memory limits by reducing per-step context size.
  • Partial observation feedback could be added later without fully reverting to interleaved prompting, offering a middle ground for adaptive tasks.
  • Specialized fine-tuning on distinct tool sets becomes simpler because the reasoning module no longer needs to be retrained alongside every tool change.

Load-bearing premise

That a complete reasoning plan can be generated without any intermediate tool results and that single-pass execution will not miss information that interleaved observations would have provided.

What would settle it

Compare ReWOO against an interleaved baseline on multi-hop questions where each tool call depends on the exact output of the prior call; if ReWOO accuracy falls sharply on cases the baseline solves correctly, the decoupling claim is challenged.

read the original abstract

Augmented Language Models (ALMs) blend the reasoning capabilities of Large Language Models (LLMs) with tools that allow for knowledge retrieval and action execution. Existing ALM systems trigger LLM thought processes while pulling observations from these tools in an interleaved fashion. Specifically, an LLM reasons to call an external tool, gets halted to fetch the tool's response, and then decides the next action based on all preceding response tokens. Such a paradigm, though straightforward and easy to implement, often leads to huge computation complexity from redundant prompts and repeated execution. This study addresses such challenges for the first time, proposing a modular paradigm ReWOO (Reasoning WithOut Observation) that detaches the reasoning process from external observations, thus significantly reducing token consumption. Comprehensive evaluations across six public NLP benchmarks and a curated dataset reveal consistent performance enhancements with our proposed methodology. Notably, ReWOO achieves 5x token efficiency and 4% accuracy improvement on HotpotQA, a multi-step reasoning benchmark. Furthermore, ReWOO demonstrates robustness under tool-failure scenarios. Beyond prompt efficiency, decoupling parametric modules from non-parametric tool calls enables instruction fine-tuning to offload LLMs into smaller language models, thus substantially reducing model parameters. Our illustrative work offloads reasoning ability from 175B GPT3.5 into 7B LLaMA, demonstrating the significant potential for truly efficient and scalable ALM systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces ReWOO, a modular paradigm for augmented language models that decouples reasoning from tool observations by first generating a complete static plan via a Planner and then executing it in a single pass via a Worker module. It reports consistent gains across six NLP benchmarks plus a curated dataset, including 5x token efficiency and 4% accuracy improvement on HotpotQA, robustness under tool failures, and successful offloading of reasoning from 175B GPT-3.5 to 7B LLaMA via instruction fine-tuning.

Significance. If the core claims hold, ReWOO could enable substantially more efficient ALM deployments by eliminating interleaved prompting overhead and supporting model compression, with direct implications for scalable multi-step reasoning systems.

major comments (3)
  1. [§3.2] §3.2 (Planner description): The generation of a non-adaptive plan is presented without addressing how entity references for subsequent hops (e.g., second-hop queries in HotpotQA) are resolved without the first observation; this directly underpins the reported accuracy and efficiency numbers.
  2. [Table 3] Table 3 (HotpotQA row): The 4% accuracy lift and 5x token reduction are reported without an ablation on plan completeness or cases where the first observation deviates from the Planner's implicit template, leaving the central decoupling assumption untested for the benchmark that most requires adaptive branching.
  3. [§5.2] §5.2 (tool-failure robustness): The robustness test is described at a high level but provides no quantitative breakdown of recovery success when the Worker must interpret a plan that was generated without the actual observations, which is required to substantiate the claim.
minor comments (2)
  1. [Abstract] The abstract states 'six public NLP benchmarks' but does not enumerate them; the full list and per-benchmark breakdowns should appear in §4.
  2. [§3] Notation for Planner output format and Worker execution trace could be formalized with a short pseudocode block or equations in §3 to aid reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on ReWOO. We address each major comment below with clarifications and proposed revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [§3.2] §3.2 (Planner description): The generation of a non-adaptive plan is presented without addressing how entity references for subsequent hops (e.g., second-hop queries in HotpotQA) are resolved without the first observation; this directly underpins the reported accuracy and efficiency numbers.

    Authors: The Planner employs symbolic placeholders (e.g., 'result_of_step_1') to reference prior steps in the static plan. The Worker resolves these by substituting actual tool observations at execution time. We will revise §3.2 to include an explicit HotpotQA example demonstrating this reference resolution mechanism. revision: yes

  2. Referee: [Table 3] Table 3 (HotpotQA row): The 4% accuracy lift and 5x token reduction are reported without an ablation on plan completeness or cases where the first observation deviates from the Planner's implicit template, leaving the central decoupling assumption untested for the benchmark that most requires adaptive branching.

    Authors: We agree an ablation on plan completeness and deviation cases is needed to validate the core assumption. We will add this analysis to the revised Table 3 (or a new subsection), reporting accuracy and token metrics when the first observation deviates from the plan template on HotpotQA. revision: yes

  3. Referee: [§5.2] §5.2 (tool-failure robustness): The robustness test is described at a high level but provides no quantitative breakdown of recovery success when the Worker must interpret a plan that was generated without the actual observations, which is required to substantiate the claim.

    Authors: The §5.2 experiments simulate failures by supplying erroneous or null observations to the Worker. We will expand this section with a quantitative breakdown, including recovery success rates across failure scenarios (e.g., null vs. incorrect results), to substantiate the robustness claim. revision: yes

Circularity Check

0 steps flagged

No circularity in ReWOO's empirical paradigm

full rationale

The paper introduces ReWOO as a structural decoupling of reasoning plan generation from tool observations, with all performance claims (5x token efficiency, 4% HotpotQA accuracy lift, offloading from 175B to 7B) resting on direct empirical measurements across benchmarks rather than any derivation, equation, or fitted parameter that reduces to its own inputs by construction. No self-citations, ansatzes, or uniqueness theorems are invoked to justify the core method; the efficiency advantage follows immediately from avoiding interleaved prompting, which is a design choice evaluated externally.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach assumes standard LLM next-token prediction and tool APIs function as black boxes; no new mathematical axioms or invented physical entities are introduced.

axioms (1)
  • domain assumption LLMs can generate coherent multi-step plans without intermediate observations
    Invoked when the model is asked to produce the full reasoning chain before any tool results are supplied.

pith-pipeline@v0.9.0 · 5562 in / 1251 out tokens · 26486 ms · 2026-05-15T18:10:30.004133+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 19 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. SkillOps: Managing LLM Agent Skill Libraries as Self-Maintaining Software Ecosystems

    cs.SE 2026-05 unverdicted novelty 7.0

    SkillOps maintains LLM skill libraries via Skill Contracts and ecosystem graphs, raising ALFWorld task success to 79.5% as a standalone agent and improving retrieval baselines by up to 2.9 points with near-zero librar...

  2. TRIAGE: Evaluating Prospective Metacognitive Control in LLMs under Resource Constraints

    cs.AI 2026-05 unverdicted novelty 7.0

    TRIAGE evaluates LLMs on prospective metacognitive control by requiring a single plan for task selection, sequencing, and token allocation under a calibrated budget, revealing substantial gaps in current models across...

  3. Can LLM Agents Respond to Disasters? Benchmarking Heterogeneous Geospatial Reasoning in Emergency Operations

    cs.AI 2026-05 unverdicted novelty 7.0

    DORA is the first end-to-end agentic benchmark for LLM-based disaster response, covering perception, spatial analysis, evacuation planning, temporal reasoning, and report generation over heterogeneous geospatial data,...

  4. Evaluating Plan Compliance in Autonomous Programming Agents

    cs.SE 2026-04 unverdicted novelty 7.0

    Autonomous programming agents frequently fail to follow instructed plans, falling back on incomplete internalized workflows, while standard plans and periodic reminders improve performance but poor plans can degrade i...

  5. KITE: Keyframe-Indexed Tokenized Evidence for VLM-Based Robot Failure Analysis

    cs.RO 2026-04 unverdicted novelty 7.0

    KITE is a training-free method that uses keyframe-indexed tokenized evidence including BEV schematics to enhance VLM performance on robot failure detection, identification, localization, explanation, and correction.

  6. Profile-Then-Reason: Bounded Semantic Complexity for Tool-Augmented Language Agents

    cs.AI 2026-04 unverdicted novelty 7.0

    PTR framework profiles a workflow upfront then executes it deterministically with bounded verification and repair, limiting LM calls to 2-3 while outperforming ReAct in 16 of 24 tested configurations.

  7. LLM-X: A Scalable Negotiation-Oriented Exchange for Communication Among Personal LLM Agents

    cs.AI 2026-05 unverdicted novelty 6.0

    LLM-X is a scalable architecture for direct negotiation and communication among personal LLM agents, featuring federated gateways, typed protocols, and policy enforcement, shown stable in experiments with up to 12 agents.

  8. Hierarchical Visual Agent: Managing Contexts in Joint Image-Text Space for Advanced Chart Reasoning

    cs.CV 2026-05 unverdicted novelty 6.0

    HierVA improves multi-step chart question answering by having a high-level manager maintain key joint contexts while specialized workers perform targeted reasoning with visual zoom-in.

  9. Affordance Agent Harness: Verification-Gated Skill Orchestration

    cs.RO 2026-05 unverdicted novelty 6.0

    Affordance Agent Harness is a verification-gated orchestration system that unifies skills via an evidence store, episodic memory priors, an adaptive router, and a self-consistency verifier to improve accuracy-cost tra...

  10. QRAFTI: An Agentic Framework for Empirical Research in Quantitative Finance

    cs.MA 2026-04 unverdicted novelty 6.0

    QRAFTI is a multi-agent framework using tool-calling and reflection-based planning to emulate quant research tasks like factor replication and signal testing on financial data.

  11. Complete Cyclic Subtask Graphs for Tool-Using LLM Agents: Flexibility, Cost, and Bottlenecks in Multi-Agent Workflows

    cs.MA 2026-04 unverdicted novelty 6.0

    Complete cyclic subtask graphs offer a lens to measure when multi-agent revisitation aids recovery and exploration versus when it increases costs or is dominated by other bottlenecks in LLM agent workflows.

  12. OS-ATLAS: A Foundation Action Model for Generalist GUI Agents

    cs.CL 2024-10 unverdicted novelty 6.0

    OS-Atlas, trained on the largest open-source cross-platform GUI grounding corpus of 13 million elements, outperforms prior open-source models on six benchmarks across mobile, desktop, and web platforms.

  13. LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

    cs.SE 2024-03 unverdicted novelty 6.0

    LiveCodeBench collects 400 recent contest problems to create a contamination-free benchmark evaluating LLMs on code generation and related capabilities like self-repair and execution.

  14. A Survey on Large Language Model based Autonomous Agents

    cs.AI 2023-08 accept novelty 6.0

    A survey of LLM-based autonomous agents that proposes a unified framework for their construction and reviews applications in social science, natural science, and engineering along with evaluation methods and future di...

  15. SPIN: Structural LLM Planning via Iterative Navigation for Industrial Tasks

    cs.AI 2026-05 conditional novelty 5.0

    SPIN enforces DAG-valid plans and prefix-based stopping for LLM agents, cutting executed tasks from 1061 to 623 and tool calls from 11.81 to 6.82 per run on AssetOpsBench while raising success from 0.638 to 0.706.

  16. Do Agents Need to Plan Step-by-Step? Rethinking Planning Horizon in Data-Centric Tool Calling

    cs.CL 2026-05 unverdicted novelty 5.0

    Full-horizon planning with on-demand replanning achieves accuracy parity with single-step planning in tool-calling agents for knowledge base and multi-hop question answering while consuming 2-3 times fewer tokens.

  17. RealRoute: Dynamic Query Routing System via Retrieve-then-Verify Paradigm

    cs.IR 2026-03 unverdicted novelty 5.0

    RealRoute uses parallel source-agnostic retrieval followed by dynamic verification to improve accuracy over predictive LLM routers in heterogeneous multi-hop RAG tasks.

  18. Affordance Agent Harness: Verification-Gated Skill Orchestration

    cs.RO 2026-05 unverdicted novelty 4.0

    Affordance Agent Harness is a verification-gated orchestration framework that adaptively combines heterogeneous skills, retrieves episodic memories, and uses self-consistency checks to improve affordance grounding acc...

  19. The Rise and Potential of Large Language Model Based Agents: A Survey

    cs.AI 2023-09 accept novelty 4.0

    The paper surveys the origins, frameworks, applications, and open challenges of AI agents built on large language models.

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages · cited by 18 Pith papers · 18 internal anchors

  1. [1]

    Re- act: Synergizing reasoning and acting in language models

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. Re- act: Synergizing reasoning and acting in language models. In International Conference on Learning Representations, 2023

  2. [2]

    Augmented language models: a survey

    Grégoire Mialon, Roberto Dessì, Maria Lomeli, Christoforos Nalmpantis, Ram Pasunuru, Roberta Raileanu, Baptiste Rozière, Timo Schick, Jane Dwivedi-Yu, Asli Celikyilmaz, et al. Augmented language models: a survey. arXiv preprint arXiv:2302.07842, 2023

  3. [3]

    MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action

    Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Ehsan Azarnasab, Faisal Ahmed, Zicheng Liu, Ce Liu, Michael Zeng, and Lijuan Wang. Mm-react: Prompting chatgpt for multimodal reasoning and action. arXiv preprint arXiv:2303.11381, 2023

  4. [4]

    Taskmatrix

    Yaobo Liang, Chenfei Wu, Ting Song, Wenshan Wu, Yan Xia, Yu Liu, Yang Ou, Shuai Lu, Lei Ji, Shaoguang Mao, et al. Taskmatrix. ai: Completing tasks by connecting foundation models with millions of apis. arXiv preprint arXiv:2303.16434, 2023

  5. [5]

    Tool learning with foundation models

    Yujia Qin, Shengding Hu, Yankai Lin, Weize Chen, Ning Ding, Ganqu Cui, Zheni Zeng, Yufei Huang, Chaojun Xiao, Chi Han, et al. Tool learning with foundation models. arXiv preprint arXiv:2304.08354, 2023

  6. [6]

    org/CorpusID:6308361

    Geunwoo Kim, Pierre Baldi, and Stephen McAleer. Language models can solve computer tasks. arXiv preprint arXiv:2303.17491, 2023

  7. [7]

    Reflexion: Language Agents with Verbal Reinforcement Learning

    Noah Shinn, Beck Labash, and Ashwin Gopinath. Reflexion: an autonomous agent with dynamic memory and self-reflection. arXiv preprint arXiv:2303.11366, 2023

  8. [8]

    ChemCrow: Augmenting large-language models with chemistry tools

    Andres M Bran, Sam Cox, Andrew D White, and Philippe Schwaller. Chemcrow: Augmenting large- language models with chemistry tools. arXiv preprint arXiv:2304.05376, 2023

  9. [9]

    Do as i can, not as i say: Grounding language in robotic affordances

    Anthony Brohan, Yevgen Chebotar, Chelsea Finn, Karol Hausman, Alexander Herzog, Daniel Ho, Julian Ibarz, Alex Irpan, Eric Jang, Ryan Julian, et al. Do as i can, not as i say: Grounding language in robotic affordances. In Conference on Robot Learning, pages 287–318. PMLR, 2023

  10. [10]

    WebGPT: Browser-assisted question-answering with human feedback

    Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christopher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, et al. Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332, 2021

  11. [11]

    https://github.com/Significant-Gravitas/ Auto-GPT, 2023

    Auto-gpt: An autonomous gpt-4 experiment. https://github.com/Significant-Gravitas/ Auto-GPT, 2023. [Online; accessed 13-May-2023]

  12. [12]

    Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed Chi, Quoc Le, and Denny Zhou. Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903, 2022

  13. [13]

    Alpaca: A strong, replicable instruction-following model

    Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B Hashimoto. Alpaca: A strong, replicable instruction-following model. Stanford Center for Research on F oundation Models. https://crfm. stanford. edu/2023/03/13/alpaca. html, 2023

  14. [14]

    Specializing smaller language models towards multi-step reasoning

    Yao Fu, Hao Peng, Litu Ou, Ashish Sabharwal, and Tushar Khot. Specializing smaller language models towards multi-step reasoning. arXiv preprint arXiv:2301.12726, 2023

  15. [15]

    Toolformer: Language Models Can Teach Themselves to Use Tools

    Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools. arXiv preprint arXiv:2302.04761, 2023

  16. [16]

    HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering

    Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W Cohen, Ruslan Salakhutdinov, and Christopher D Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. arXiv preprint arXiv:1809.09600, 2018

  17. [17]

    LLaMA: Open and Efficient Foundation Language Models

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023

  18. [18]

    Star: Self-taught reasoner bootstrapping reasoning with reasoning

    Eric Zelikman, Jesse Mu, Noah D Goodman, and Yuhuai Tony Wu. Star: Self-taught reasoner bootstrapping reasoning with reasoning. 2022

  19. [19]

    TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension

    Mandar Joshi, Eunsol Choi, Daniel S Weld, and Luke Zettlemoyer. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. arXiv preprint arXiv:1705.03551, 2017

  20. [20]

    Sports understanding

    Ethan Kim. Sports understanding. https://github.com/google/BIG-bench/tree/main/ bigbench/benchmark_tasks/sports_understanding, 2022. [Online; accessed 13-May-2023]

  21. [21]

    Bigbench: Towards an industry standard benchmark for big data analytics

    Ahmad Ghazal, Tilmann Rabl, Minqing Hu, Francois Raab, Meikel Poess, Alain Crolotte, and Hans-Arno Jacobsen. Bigbench: Towards an industry standard benchmark for big data analytics. In Proceedings of the 2013 ACM SIGMOD international conference on Management of data , pages 1197–1208, 2013. 10

  22. [22]

    Did aristotle use a laptop? a question answering benchmark with implicit reasoning strategies

    Mor Geva, Daniel Khashabi, Elad Segal, Tushar Khot, Dan Roth, and Jonathan Berant. Did aristotle use a laptop? a question answering benchmark with implicit reasoning strategies. Transactions of the Association for Computational Linguistics , 9:346–361, 2021

  23. [23]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021

  24. [24]

    Designing effective questions for classroom response system teaching

    Ian D Beatty, William J Gerace, William J Leonard, and Robert J Dufresne. Designing effective questions for classroom response system teaching. American journal of physics , 74(1):31–39, 2006

  25. [25]

    PAL: Program-aided Language Models

    Luyu Gao, Aman Madaan, Shuyan Zhou, Uri Alon, Pengfei Liu, Yiming Yang, Jamie Callan, and Graham Neubig. Pal: Program-aided language models. arXiv preprint arXiv:2211.10435, 2022

  26. [26]

    LoRA: Low-Rank Adaptation of Large Language Models

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021

  27. [27]

    ALFWorld: Aligning Text and Embodied Environments for Interactive Learning

    Mohit Shridhar, Xingdi Yuan, Marc-Alexandre Côté, Yonatan Bisk, Adam Trischler, and Matthew Hausknecht. Alfworld: Aligning text and embodied environments for interactive learning. arXiv preprint arXiv:2010.03768, 2020

  28. [28]

    https://github.com/hwchase17/langchain, 2023

    Langchain. https://github.com/hwchase17/langchain, 2023. [Online; accessed 13-May-2023]

  29. [29]

    Internet-augmented language models through few-shot prompting for open-domain question answering

    Angeliki Lazaridou, Elena Gribovskaya, Wojciech Stokowiec, and Nikolai Grigorev. Internet-augmented language models through few-shot prompting for open-domain question answering. arXiv preprint arXiv:2203.05115, 2022

  30. [30]

    Code as Policies: Language Model Programs for Embodied Control

    Jacky Liang, Wenlong Huang, Fei Xia, Peng Xu, Karol Hausman, Brian Ichter, Pete Florence, and Andy Zeng. Code as policies: Language model programs for embodied control. arXiv preprint arXiv:2209.07753, 2022

  31. [31]

    PaLM-E: An Embodied Multimodal Language Model

    Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, et al. Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378, 2023

  32. [32]

    Chatgpt for robotics: Design principles and model abilities

    Sai Vemprala, Rogerio Bonatti, Arthur Bucker, and Ashish Kapoor. Chatgpt for robotics: Design principles and model abilities. 2023, 2023

  33. [33]

    Jiang et al.Draft, Sketch, and Prove: Guiding Formal Theorem Provers with Informal Proofs

    Albert Q Jiang, Sean Welleck, Jin Peng Zhou, Wenda Li, Jiacheng Liu, Mateja Jamnik, Timothée Lacroix, Yuhuai Wu, and Guillaume Lample. Draft, sketch, and prove: Guiding formal theorem provers with informal proofs. arXiv preprint arXiv:2210.12283, 2022

  34. [34]

    Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models

    Chenfei Wu, Shengming Yin, Weizhen Qi, Xiaodong Wang, Zecheng Tang, and Nan Duan. Visual chatgpt: Talking, drawing and editing with visual foundation models. arXiv preprint arXiv:2303.04671, 2023

  35. [35]

    Art: Automatic multi-step reasoning and tool-use for large language models

    Bhargavi Paranjape, Scott Lundberg, Sameer Singh, Hannaneh Hajishirzi, Luke Zettlemoyer, and Marco Tulio Ribeiro. Art: Automatic multi-step reasoning and tool-use for large language models. arXiv preprint arXiv:2303.09014, 2023

  36. [36]

    Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning

    Haokun Liu, Derek Tam, Mohammed Muqeeth, Jay Mohta, Tenghao Huang, Mohit Bansal, and Colin A Raffel. Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning. Advances in Neural Information Processing Systems , 35:1950–1965, 2022

  37. [37]

    Instruction Tuning with GPT-4

    Baolin Peng, Chunyuan Li, Pengcheng He, Michel Galley, and Jianfeng Gao. Instruction tuning with gpt-4. arXiv preprint arXiv:2304.03277, 2023

  38. [38]

    The Power of Scale for Parameter-Efficient Prompt Tuning

    Brian Lester, Rami Al-Rfou, and Noah Constant. The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691, 2021

  39. [39]

    Prompt compression and contrastive condi- tioning for controllability and toxicity reduction in language models

    David Wingate, Mohammad Shoeybi, and Taylor Sorensen. Prompt compression and contrastive condi- tioning for controllability and toxicity reduction in language models. arXiv preprint arXiv:2210.03162, 2022

  40. [40]

    Parameter-efficient transfer learning for nlp

    Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Ges- mundo, Mona Attariyan, and Sylvain Gelly. Parameter-efficient transfer learning for nlp. In International Conference on Machine Learning, pages 2790–2799. PMLR, 2019

  41. [41]

    Towards a unified view of parameter-efficient transfer learning.arXiv preprint arXiv:2110.04366, 2021

    Junxian He, Chunting Zhou, Xuezhe Ma, Taylor Berg-Kirkpatrick, and Graham Neubig. Towards a unified view of parameter-efficient transfer learning. arXiv preprint arXiv:2110.04366, 2021

  42. [42]

    reasonable

    Zhoujun Cheng, Jungo Kasai, and Tao Yu. Batch prompting: Efficient inference with large language model apis. arXiv preprint arXiv:2301.08721, 2023. 12 Appendix A Additional Observations A.1 Token Decomposition We decompose the token usage of different prompt paradigms on HotpotQA into different components – context prompts, exemplars, and intermediate ste...