arxiv: 2303.17491 · v3 · pith:RIN4LFODnew · submitted 2023-03-30 · 💻 cs.CL · cs.AI· cs.HC· cs.LG

Language Models can Solve Computer Tasks

Geunwoo Kim , Pierre Baldi , Stephen McAleer This is my paper

Pith reviewed 2026-05-17 12:11 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.HCcs.LG

keywords language modelscomputer agentsprompting methodsMiniWoB++task automationrecursive self-critiquefew-shot learningreasoning enhancement

0 comments

The pith

Pre-trained language models solve novel computer tasks by recursively criticizing and improving their own outputs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that a pre-trained large language model can carry out general computer tasks described in natural language when guided by a simple prompting scheme in which the model repeatedly criticizes and refines its own proposed actions. This approach requires only a handful of demonstrations per task and no task-specific reward function, in contrast to earlier methods that depend on tens of thousands of expert examples or custom reinforcement signals. On the MiniWoB++ benchmark the method reaches state-of-the-art performance for the InstructGPT-3 model with RLHF, and it also improves results on a range of natural-language reasoning tasks when used alone or together with chain-of-thought prompting.

Core claim

A pre-trained LLM agent can execute computer tasks guided by natural language using a simple prompting scheme where the agent Recursively Criticizes and Improves its output (RCI). The RCI approach significantly outperforms existing LLM methods for automating computer tasks and surpasses supervised learning and reinforcement learning approaches on the MiniWoB++ benchmark, using only a handful of demonstrations per task rather than tens of thousands and without a task-specific reward function.

What carries the argument

The RCI prompting scheme, in which the model is instructed to critique its own previous output and then produce an improved version, applied recursively until a satisfactory action sequence is reached.

If this is right

New computer tasks can be automated without collecting large expert demonstration sets or designing per-task reward functions.
The same RCI procedure also raises accuracy on pure natural-language reasoning benchmarks when used by itself or combined with chain-of-thought prompting.
Performance improves when RCI is applied on top of chain-of-thought prompting rather than using either technique alone.
A single pre-trained model can be reused across many distinct web-based tasks after seeing only a few examples of each.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same self-critique loop could be extended to longer-horizon desktop tasks that involve multiple applications rather than single web pages.
If the underlying model is updated with more recent training data, the number of demonstrations needed per task might drop even further.
RCI could be combined with external verification tools, such as executing proposed actions in a sandbox, to catch errors the model itself does not notice.

Load-bearing premise

The pre-trained language model already possesses enough built-in knowledge about computer interfaces and the ability to critique its own reasoning so that a few demonstrations plus the RCI template suffice for it to generate correct actions on new tasks.

What would settle it

Finding that RCI prompting produces no measurable gain over ordinary few-shot prompting when the same InstructGPT-3 model is tested on a fresh set of MiniWoB++ tasks whose required mouse and keyboard sequences are absent from its training distribution.

read the original abstract

Agents capable of carrying out general tasks on a computer can improve efficiency and productivity by automating repetitive tasks and assisting in complex problem-solving. Ideally, such agents should be able to solve new computer tasks presented to them through natural language commands. However, previous approaches to this problem require large amounts of expert demonstrations and task-specific reward functions, both of which are impractical for new tasks. In this work, we show that a pre-trained large language model (LLM) agent can execute computer tasks guided by natural language using a simple prompting scheme where the agent Recursively Criticizes and Improves its output (RCI). The RCI approach significantly outperforms existing LLM methods for automating computer tasks and surpasses supervised learning (SL) and reinforcement learning (RL) approaches on the MiniWoB++ benchmark. We compare multiple LLMs and find that RCI with the InstructGPT-3+RLHF LLM is state-of-the-art on MiniWoB++, using only a handful of demonstrations per task rather than tens of thousands, and without a task-specific reward function. Furthermore, we demonstrate RCI prompting's effectiveness in enhancing LLMs' reasoning abilities on a suite of natural language reasoning tasks, outperforming chain of thought (CoT) prompting with external feedback. We find that RCI combined with CoT performs better than either separately. Our code can be found here: https://github.com/posgnu/rci-agent.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RCI prompting reaches SOTA on MiniWoB++ with InstructGPT using few demos and no task-specific rewards.

read the letter

RCI prompting reaches SOTA on MiniWoB++ with InstructGPT using few demos and no task-specific rewards. That's the core result worth knowing here. The paper applies a recursive self-critique loop to LLM agents for computer tasks, where the model generates an action, critiques it, and refines until it improves. This is a straightforward extension of chain-of-thought ideas but tuned for action sequences in a GUI environment. It performs well by showing consistent gains over prior LLM prompting, supervised learning, and reinforcement learning baselines across multiple models. The comparisons are explicit, the benchmark is standard, and the code release lets others check the implementation directly. They also run the same prompting on reasoning tasks and note that combining RCI with CoT works better than either alone, which adds a bit of generality without overreaching. The soft spots are limited but real. Success still rests on the base model already encoding enough world knowledge and critique ability; the recursion helps but does not create new capabilities from scratch. MiniWoB++ tasks are constrained and scripted, so the results do not yet address messier, open-ended computer use. More runs or variance numbers would strengthen the claims, though the pattern across LLMs is reassuring. This work is for people building practical LLM agents or testing prompting variants. A reader focused on digital automation or efficient few-shot methods will find the benchmark numbers and setup useful. It deserves a serious referee because the experiments are transparent, the code is available, and the central performance claim is directly testable.

Referee Report

0 major / 3 minor

Summary. The paper introduces Recursive Criticism and Improvement (RCI) prompting for pre-trained LLMs to solve computer tasks. It claims that RCI with InstructGPT-3+RLHF reaches state-of-the-art on the MiniWoB++ benchmark using only a handful of demonstrations per task and no task-specific reward function, outperforming prior LLM methods as well as supervised learning and reinforcement learning baselines. The work also reports that RCI improves LLM reasoning on natural language tasks and that combining RCI with chain-of-thought prompting yields further gains.

Significance. If the reported results hold, this is a significant empirical demonstration that general-purpose LLMs can automate a range of computer tasks with minimal task-specific data or engineering. The explicit multi-LLM comparisons, direct contrasts against SL/RL baselines, and public release of code at https://github.com/posgnu/rci-agent are clear strengths that support reproducibility and allow the community to verify and build on the findings.

minor comments (3)

The abstract refers to 'a handful of demonstrations per task' without a precise count; the main experimental section should state the exact number of demonstrations used for each MiniWoB++ task.
The prompting templates for RCI and the RCI+CoT variant are described at a high level; including the full template text or pseudocode in an appendix would improve reproducibility.
Table captions and axis labels in the result figures would benefit from explicit mention of the evaluation metric (success rate) and the number of evaluation episodes per task.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive review, the recognition of our work's significance, and the recommendation to accept. We appreciate the note on the strengths of our multi-LLM comparisons, contrasts to SL/RL baselines, and public code release.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper presents a purely empirical prompting technique (RCI) evaluated on the MiniWoB++ benchmark against SL/RL baselines and other LLMs. No mathematical derivations, first-principles results, or equations are claimed; the central results consist of direct experimental comparisons using a handful of demonstrations and a fixed prompting template. All load-bearing claims are supported by reported benchmark gains rather than any self-referential definitions, fitted parameters renamed as predictions, or self-citation chains, rendering the work self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that current LLMs possess latent reasoning and self-correction abilities that prompting can reliably surface for sequential decision tasks.

axioms (1)

domain assumption Large language models possess general reasoning capabilities that can be elicited through prompting.
The RCI method assumes the base model can generate useful self-critiques and improvements without task-specific fine-tuning.

pith-pipeline@v0.9.0 · 5555 in / 1262 out tokens · 36587 ms · 2026-05-17T12:11:24.124658+00:00 · methodology

discussion (0)

Forward citations

Cited by 18 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Instruction Tuning with GPT-4
cs.CL 2023-04 unverdicted novelty 8.0

GPT-4-generated instruction data produces superior zero-shot performance in finetuned LLaMA models versus prior state-of-the-art data.
WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks?
cs.LG 2024-03 unverdicted novelty 7.0

WorkArena benchmark shows LLM web agents achieve partial success on enterprise tasks but have a substantial gap to full automation and perform worse with open-source models.
Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads
cs.LG 2024-01 conditional novelty 7.0

Medusa augments LLMs with multiple decoding heads and tree-based attention to predict and verify several tokens in parallel, yielding 2.2-3.6x inference speedup via two fine-tuning regimes.
Large Language Models as Optimizers
cs.LG 2023-09 unverdicted novelty 7.0

Large language models can optimize by being prompted with histories of past solutions and scores to propose better ones, producing prompts that raise accuracy up to 8% on GSM8K and 50% on Big-Bench Hard over human-des...
Reflexion: Language Agents with Verbal Reinforcement Learning
cs.AI 2023-03 conditional novelty 7.0

Reflexion lets LLM agents improve via stored verbal reflections on task feedback, reaching 91% pass@1 on HumanEval and outperforming prior GPT-4 results.
OPT-BENCH: Evaluating the Iterative Self-Optimization of LLM Agents in Large-Scale Search Spaces
cs.AI 2026-05 unverdicted novelty 6.0

OPT-BENCH and OPT-Agent evaluate LLM self-optimization in large search spaces, showing stronger models improve via feedback but stay constrained by base capacity and below human performance.
VLAA-GUI: Knowing When to Stop, Recover, and Search, A Modular Framework for GUI Automation
cs.CL 2026-04 conditional novelty 6.0

VLAA-GUI adds mandatory visual verifiers, multi-tier loop breakers, and on-demand search to GUI agents, reaching 77.5% on OSWorld and 61.0% on WindowsAgentArena with some models exceeding human performance.
Training Language Models to Self-Correct via Reinforcement Learning
cs.LG 2024-09 unverdicted novelty 6.0

SCoRe uses multi-turn online RL with regularization on self-generated traces to improve LLM self-correction, achieving 15.6% and 9.1% gains on MATH and HumanEval for Gemini models.
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code
cs.SE 2024-03 unverdicted novelty 6.0

LiveCodeBench collects 400 recent contest problems to create a contamination-free benchmark evaluating LLMs on code generation and related capabilities like self-repair and execution.
SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents
cs.HC 2024-01 unverdicted novelty 6.0

SeeClick improves visual GUI agents via GUI grounding pre-training on automatically curated data and introduces the ScreenSpot benchmark, with results indicating that stronger grounding boosts downstream task performance.
GPT-4V(ision) is a Generalist Web Agent, if Grounded
cs.IR 2024-01 conditional novelty 6.0

GPT-4V achieves 51.1% success on live web tasks as a generalist agent when plans are manually grounded, outperforming text-only models, but automatic grounding lags far behind oracle performance.
Cognitive Architectures for Language Agents
cs.AI 2023-09 accept novelty 6.0

CoALA is a modular cognitive architecture for language agents that organizes memory components, action spaces for internal and external interaction, and a generalized decision-making loop to support more systematic de...
Gorilla: Large Language Model Connected with Massive APIs
cs.CL 2023-05 conditional novelty 6.0

Gorilla is a fine-tuned LLM that surpasses GPT-4 in accurate API call generation and uses retrieval to handle documentation updates.
ReWOO: Decoupling Reasoning from Observations for Efficient Augmented Language Models
cs.CL 2023-05 conditional novelty 6.0

ReWOO decouples reasoning from tool observations in augmented language models, delivering 5x token efficiency and 4% higher accuracy on multi-step reasoning benchmarks like HotpotQA.
Teaching Large Language Models to Self-Debug
cs.CL 2023-04 unverdicted novelty 6.0

Self-Debugging teaches LLMs to identify and fix their own code errors through rubber-duck-style natural language explanations and execution feedback, delivering 2-12% gains over baselines on Spider, TransCoder, and MBPP.
Plan-and-Act: Improving Planning of Agents for Long-Horizon Tasks
cs.CL 2025-03 unverdicted novelty 5.0

Plan-and-Act trains a dedicated Planner on synthetic plan-annotated trajectories to generate high-level plans that an Executor follows, reaching 57.58% success on WebArena-Lite and 81.36% on WebVoyager.
Understanding the planning of LLM agents: A survey
cs.AI 2024-02 accept novelty 4.0

A survey that provides a taxonomy of methods for improving planning in LLM-based agents across task decomposition, plan selection, external modules, reflection, and memory.
The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision)
cs.CV 2023-09 conditional novelty 4.0

GPT-4V processes interleaved image-text inputs generically and supports visual referring prompting for new human-AI interaction.

Reference graph

Works this paper leans on

102 extracted references · 102 canonical work pages · cited by 18 Pith papers · 19 internal anchors

[1]

Do As I Can, Not As I Say: Grounding Language in Robotic Affordances

Michael Ahn, Anthony Brohan, Noah Brown, Yevgen Chebotar, Omar Cortes, Byron David, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, et al. Do as I can, not as I say: Grounding language in robotic affordances. arXiv preprint arXiv:2204.01691, 2022. 10

work page internal anchor Pith review Pith/arXiv arXiv 2022
[2]

Flamingo: a visual language model for few-shot learning

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems , 35:23716–23736, 2022

work page 2022
[3]

Constitutional AI: Harmlessness from AI Feedback

Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[4]

Video pretraining (vpt): Learning to act by watching unlabeled online videos

Bowen Baker, Ilge Akkaya, Peter Zhokov, Joost Huizinga, Jie Tang, Adrien Ecoffet, Brandon Houghton, Raul Sampedro, and Jeff Clune. Video pretraining (vpt): Learning to act by watching unlabeled online videos. Advances in Neural Information Processing Systems , 35:24639–24654, 2022

work page 2022
[5]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in Neural Information Processing Systems , 33:1877–1901, 2020

work page 1901
[6]

Sparks of Artificial General Intelligence: Early experiments with GPT-4

Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, et al. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[7]

Grounding large language models in interactive environments with online reinforcement learning

Thomas Carta, Clément Romac, Thomas Wolf, Sylvain Lamprier, Olivier Sigaud, and Pierre- Yves Oudeyer. Grounding large language models in interactive environments with online reinforcement learning. arXiv preprint arXiv:2302.02662, 2023

work page arXiv 2023
[8]

PaLM: Scaling Language Modeling with Pathways

Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[9]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[10]

Faithful reasoning using large language models

Antonia Creswell and Murray Shanahan. Faithful reasoning using large language models. arXiv preprint arXiv:2208.14271, 2022

work page arXiv 2022
[11]

Selection-inference: Exploiting large language models for interpretable logical reasoning.arXiv preprint arXiv:2205.09712,

Antonia Creswell, Murray Shanahan, and Irina Higgins. Selection-inference: Exploiting large language models for interpretable logical reasoning. arXiv preprint arXiv:2205.09712, 2022

work page arXiv 2022
[12]

Why can GPT learn in-context? Language models secretly perform gradient descent as meta optimizers

Damai Dai, Yutao Sun, Li Dong, Yaru Hao, Zhifang Sui, and Furu Wei. Why can GPT learn in-context? Language models secretly perform gradient descent as meta optimizers. arXiv preprint arXiv:2212.10559, 2022

work page arXiv 2022
[13]

Collaborating with language models for embodied reasoning

Ishita Dasgupta, Christine Kaeser-Chen, Kenneth Marino, Arun Ahuja, Sheila Babayan, Felix Hill, and Rob Fergus. Collaborating with language models for embodied reasoning. In Second Workshop on Language and Reinforcement Learning, 2022

work page 2022
[14]

Language model cascades

David Dohan, Winnie Xu, Aitor Lewkowycz, Jacob Austin, David Bieber, Raphael Gontijo Lopes, Yuhuai Wu, Henryk Michalewski, Rif A Saurous, Jascha Sohl-Dickstein, et al. Language model cascades. arXiv preprint arXiv:2207.10342, 2022

work page arXiv 2022
[15]

An image is worth 16x16 words: Transformers for image recognition at scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. International Conference on Learning Representations, 2020

work page 2020
[16]

PaLM-E: An Embodied Multimodal Language Model

Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, et al. PaLM-E: An embodied multimodal language model. arXiv preprint arXiv:2303.03378, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[17]

GLaM: Efficient scaling of language models with mixture-of-experts

Nan Du, Yanping Huang, Andrew M Dai, Simon Tong, Dmitry Lepikhin, Yuanzhong Xu, Maxim Krikun, Yanqi Zhou, Adams Wei Yu, Orhan Firat, et al. GLaM: Efficient scaling of language models with mixture-of-experts. In International Conference on Machine Learning , pages 5547–5569. PMLR, 2022. 11

work page 2022
[18]

Minedojo: Building open-ended em- bodied agents with internet-scale knowledge

Linxi Fan, Guanzhi Wang, Yunfan Jiang, Ajay Mandlekar, Yuncong Yang, Haoyi Zhu, Andrew Tang, De-An Huang, Yuke Zhu, and Anima Anandkumar. Minedojo: Building open-ended em- bodied agents with internet-scale knowledge. In Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2022

work page 2022
[19]

Instruction-finetuned foundation models for multimodal web navigation

Hiroki Furuta, Ofir Nachum, Kuang-Huei Lee, Yutaka Matsuo, Shixiang Shane Gu, and Izzeddin Gur. Instruction-finetuned foundation models for multimodal web navigation. In Workshop on Reincarnating Reinforcement Learning at ICLR , 2023

work page 2023
[20]

arXiv preprint arXiv:2302.07459 , year=

Deep Ganguli, Amanda Askell, Nicholas Schiefer, Thomas Liao, Kamil˙e Lukoši¯ut˙e, Anna Chen, Anna Goldie, Azalia Mirhoseini, Catherine Olsson, Danny Hernandez, et al. The capacity for moral self-correction in large language models. arXiv preprint arXiv:2302.07459, 2023

work page arXiv 2023
[21]

Did aristotle use a laptop? a question answering benchmark with implicit reasoning strategies

Mor Geva, Daniel Khashabi, Elad Segal, Tushar Khot, Dan Roth, and Jonathan Berant. Did aristotle use a laptop? a question answering benchmark with implicit reasoning strategies. Transactions of the Association for Computational Linguistics , 9:346–361, 2021

work page 2021
[22]

Improving alignment of dialogue agents via targeted human judgements

Amelia Glaese, Nat McAleese, Maja Tr˛ ebacz, John Aslanides, Vlad Firoiu, Timo Ewalds, Maribeth Rauh, Laura Weidinger, Martin Chadwick, Phoebe Thacker, et al. Improving alignment of dialogue agents via targeted human judgements. arXiv preprint arXiv:2209.14375, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[23]

Environment generation for zero-shot compositional reinforcement learning

Izzeddin Gur, Natasha Jaques, Yingjie Miao, Jongwook Choi, Manoj Tiwari, Honglak Lee, and Aleksandra Faust. Environment generation for zero-shot compositional reinforcement learning. Advances in Neural Information Processing Systems , 34:4157–4169, 2021

work page 2021
[24]

Understanding html with large language models

Izzeddin Gur, Ofir Nachum, Yingjie Miao, Mustafa Safdari, Austin Huang, Aakanksha Chowd- hery, Sharan Narang, Noah Fiedel, and Aleksandra Faust. Understanding HTML with large language models. arXiv preprint arXiv:2210.03945, 2022

work page arXiv 2022
[25]

Learning to navigate the web

Izzeddin Gur, Ulrich Rueckert, Aleksandra Faust, and Dilek Hakkani-Tur. Learning to navigate the web. In International Conference on Learning Representations , 2019

work page 2019
[26]

An empirical analysis of compute-optimal large language model training

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. An empirical analysis of compute-optimal large language model training. Advances in Neural Information Processing Systems, 35:30016–30030, 2022

work page 2022
[27]

Learning to solve arithmetic word problems with verb categorization

Mohammad Javad Hosseini, Hannaneh Hajishirzi, Oren Etzioni, and Nate Kushman. Learning to solve arithmetic word problems with verb categorization. In EMNLP, pages 523–533, 2014

work page 2014
[28]

Language models as zero-shot planners: Extracting actionable knowledge for embodied agents

Wenlong Huang, Pieter Abbeel, Deepak Pathak, and Igor Mordatch. Language models as zero-shot planners: Extracting actionable knowledge for embodied agents. In International Conference on Machine Learning, pages 9118–9147. PMLR, 2022

work page 2022
[29]

Inner monologue: Embodied reasoning through planning with language models

Wenlong Huang, Fei Xia, Ted Xiao, Harris Chan, Jacky Liang, Pete Florence, Andy Zeng, Jonathan Tompson, Igor Mordatch, Yevgen Chebotar, et al. Inner monologue: Embodied reasoning through planning with language models. In6th Annual Conference on Robot Learning, 2022

work page 2022
[30]

A data-driven approach for learning to control computers

Peter C Humphreys, David Raposo, Tobias Pohlen, Gregory Thornton, Rachita Chhaparia, Alistair Muldal, Josh Abramson, Petko Georgiev, Adam Santoro, and Timothy Lillicrap. A data-driven approach for learning to control computers. InInternational Conference on Machine Learning, pages 9466–9482. PMLR, 2022

work page 2022
[31]

Do BERTs learn to use browser user interface? Exploring multi-step tasks with unified vision-and-language berts

Taichi Iki and Akiko Aizawa. Do BERTs learn to use browser user interface? Exploring multi-step tasks with unified vision-and-language berts. arXiv preprint arXiv:2203.07828, 2022

work page arXiv 2022
[32]

DOM-Q-NET: Grounded RL on structured language

Sheng Jia, Jamie Ryan Kiros, and Jimmy Ba. DOM-Q-NET: Grounded RL on structured language. In International Conference on Learning Representations , 2019

work page 2019
[33]

Large language models are zero-shot reasoners

Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners. In Advances in Neural Information Processing Systems, 2022. 12

work page 2022
[34]

Parsing algebraic word problems into equations

Rik Koncel-Kedziorski, Hannaneh Hajishirzi, Ashish Sabharwal, Oren Etzioni, and Siena Du- mas Ang. Parsing algebraic word problems into equations. Transactions of the Association for Computational Linguistics, 3:585–597, 2015

work page 2015
[35]

Program induction by rationale generation: Learning to solve and explain algebraic word problems

Wang Ling, Dani Yogatama, Chris Dyer, and Phil Blunsom. Program induction by rationale generation: Learning to solve and explain algebraic word problems. Proceedings of ACL, 2017

work page 2017
[36]

Reinforcement learning on web interfaces using workflow-guided exploration

Evan Zheran Liu, Kelvin Guu, Panupong Pasupat, Tianlin Shi, and Percy Liang. Reinforcement learning on web interfaces using workflow-guided exploration. In International Conference on Learning Representations, 2018

work page 2018
[37]

Mind’s eye: Grounded language model reasoning through simulation

Ruibo Liu, Jason Wei, Shixiang Shane Gu, Te-Yen Wu, Soroush V osoughi, Claire Cui, Denny Zhou, and Andrew M Dai. Mind’s eye: Grounded language model reasoning through simulation. In International Conference on Learning Representations , 2023

work page 2023
[38]

Self-Refine: Iterative Refinement with Self-Feedback

Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al. Self-refine: Iterative refinement with self-feedback. arXiv preprint arXiv:2303.17651, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[39]

Text and patterns: For effective chain of thought, it takes two to tango

Aman Madaan and Amir Yazdanbakhsh. Text and patterns: For effective chain of thought, it takes two to tango. arXiv preprint arXiv:2209.07686, 2022

work page arXiv 2022
[40]

Teaching language models to support answers with verified quotes

Jacob Menick, Maja Trebacz, Vladimir Mikulik, John Aslanides, Francis Song, Martin Chad- wick, Mia Glaese, Susannah Young, Lucy Campbell-Gillingham, Geoffrey Irving, et al. Teaching language models to support answers with verified quotes. arXiv preprint arXiv:2203.11147, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[41]

Augmented Language Models: a Survey

Grégoire Mialon, Roberto Dessì, Maria Lomeli, Christoforos Nalmpantis, Ram Pasunuru, Roberta Raileanu, Baptiste Rozière, Timo Schick, Jane Dwivedi-Yu, Asli Celikyilmaz, et al. Augmented language models: a survey. arXiv preprint arXiv:2302.07842, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[42]

WebGPT: Browser-assisted question-answering with human feedback

Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christo- pher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, et al. WebGPT: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[43]

End-to-end goal-driven web navigation

Rodrigo Nogueira and Kyunghyun Cho. End-to-end goal-driven web navigation. Advances in Neural Information Processing Systems, 29, 2016

work page 2016
[44]

Do embodied agents dream of pixelated sheep?: Embodied decision making using language guided world modelling

Kolby Nottingham, Prithviraj Ammanabrolu, Alane Suhr, Yejin Choi, Hannaneh Hajishirzi, Sameer Singh, and Roy Fox. Do embodied agents dream of pixelated sheep?: Embodied decision making using language guided world modelling. arXiv preprint arXiv:2301.12050, 2023

work page arXiv 2023
[45]

Show your work: Scratchpads for intermediate computation with language models

Maxwell Nye, Anders Johan Andreassen, Guy Gur-Ari, Henryk Michalewski, Jacob Austin, David Bieber, David Dohan, Aitor Lewkowycz, Maarten Bosma, David Luan, et al. Show your work: Scratchpads for intermediate computation with language models. In Deep Learning for Code Workshop at ICLR, 2022

work page 2022
[46]

Gpt-4 technical report, 2023

OpenAI. Gpt-4 technical report, 2023

work page 2023
[47]

Training language models to follow instructions with human feedback

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems , 35:27730–27744, 2022

work page 2022
[48]

ART: Automatic multi-step reasoning and tool-use for large language models

Bhargavi Paranjape, Scott Lundberg, Sameer Singh, Hannaneh Hajishirzi, Luke Zettlemoyer, and Marco Tulio Ribeiro. ART: Automatic multi-step reasoning and tool-use for large language models. arXiv preprint arXiv:2303.09014, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[49]

Mapping natural language commands to web elements

Panupong Pasupat, Tian-Shun Jiang, Evan Liu, Kelvin Guu, and Percy Liang. Mapping natural language commands to web elements. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4970–4976, 2018. 13

work page 2018
[50]

Zero-shot entity extraction from web pages

Panupong Pasupat and Percy Liang. Zero-shot entity extraction from web pages. InProceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), pages 391–401, 2014

work page 2014
[51]

Are nlp models really able to solve simple math word problems? Proceedings of NAACL, 2021

Arkil Patel, Satwik Bhattamishra, and Navin Goyal. Are nlp models really able to solve simple math word problems? Proceedings of NAACL, 2021

work page 2021
[53]

Measuring and Narrowing the Compositionality Gap in Language Models

Ofir Press, Muru Zhang, Sewon Min, Ludwig Schmidt, Noah A Smith, and Mike Lewis. Measuring and narrowing the compositionality gap in language models. arXiv preprint arXiv:2210.03350, 2022

work page internal anchor Pith review arXiv 2022
[54]

Planning with large language models via corrective re-prompting

Shreyas Sundara Raman, Vanya Cohen, Eric Rosen, Ifrah Idrees, David Paulius, and Stefanie Tellex. Planning with large language models via corrective re-prompting. F oundation Models for Decision Making workshop at NeurIPS , 2022

work page 2022
[55]

A generalist agent

Scott Reed, Konrad Zolna, Emilio Parisotto, Sergio Gómez Colmenarejo, Alexander Novikov, Gabriel Barth-maron, Mai Giménez, Yury Sulsky, Jackie Kay, Jost Tobias Springenberg, et al. A generalist agent. Transactions on Machine Learning Research, 2022

work page 2022
[56]

Solving general arithmetic word problems

Subhro Roy and Dan Roth. Solving general arithmetic word problems. EMNLP, 2016

work page 2016
[57]

Multitask prompted training enables zero-shot task generalization

Victor Sanh, Albert Webson, Colin Raffel, Stephen Bach, Lintang Sutawika, Zaid Alyafeai, Antoine Chaffin, Arnaud Stiegler, Arun Raja, Manan Dey, et al. Multitask prompted training enables zero-shot task generalization. In International Conference on Learning Representations, 2022

work page 2022
[58]

Self-critiquing models for assisting human evaluators

William Saunders, Catherine Yeh, Jeff Wu, Steven Bills, Long Ouyang, Jonathan Ward, and Jan Leike. Self-critiquing models for assisting human evaluators. arXiv preprint arXiv:2206.05802, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[59]

Toolformer: Language Models Can Teach Themselves to Use Tools

Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Luke Zettle- moyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools. arXiv preprint arXiv:2302.04761, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[60]

Memory augmented large language models are computationally universal

Dale Schuurmans. Memory augmented large language models are computationally universal. arXiv preprint arXiv:2301.04589, 2023

work page arXiv 2023
[61]

World of bits: An open-domain platform for web-based agents

Tianlin Shi, Andrej Karpathy, Linxi Fan, Jonathan Hernandez, and Percy Liang. World of bits: An open-domain platform for web-based agents. In International Conference on Machine Learning, pages 3135–3144. PMLR, 2017

work page 2017
[62]

Reflexion: Language Agents with Verbal Reinforcement Learning

Noah Shinn, Beck Labash, and Ashwin Gopinath. Reflexion: an autonomous agent with dynamic memory and self-reflection. arXiv preprint arXiv:2303.11366, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[63]

CLIPort: What and where pathways for robotic manipulation

Mohit Shridhar, Lucas Manuelli, and Dieter Fox. CLIPort: What and where pathways for robotic manipulation. In Conference on Robot Learning, pages 894–906. PMLR, 2022

work page 2022
[64]

Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model

Shaden Smith, Mostofa Patwary, Brandon Norick, Patrick LeGresley, Samyam Rajbhandari, Jared Casper, Zhun Liu, Shrimai Prabhumoye, George Zerveas, Vijay Korthikanti, et al. Using deepspeed and megatron to train megatron-turing NLG 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[65]

Learning web- based procedures by reasoning over explanations and demonstrations in context

Shashank Srivastava, Oleksandr Polozov, Nebojsa Jojic, and Christopher Meek. Learning web- based procedures by reasoning over explanations and demonstrations in context. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , pages 7652–7662, 2020

work page 2020
[66]

Recitation-augmented language models

Zhiqing Sun, Xuezhi Wang, Yi Tay, Yiming Yang, and Denny Zhou. Recitation-augmented language models. In International Conference on Learning Representations , 2023. 14

work page 2023
[67]

Commonsenseqa: A question answering challenge targeting commonsense knowledge

Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. Commonsenseqa: A question answering challenge targeting commonsense knowledge. Proceedings of NAACL-HLT, 2019

work page 2019
[68]

LaMDA: Language Models for Dialog Applications

Romal Thoppilan, Daniel De Freitas, Jamie Hall, Noam Shazeer, Apoorv Kulshreshtha, Heng- Tze Cheng, Alicia Jin, Taylor Bos, Leslie Baker, Yu Du, et al. LaMDA: Language models for dialog applications. arXiv preprint arXiv:2201.08239, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[69]

Transformers learn in-context by gradient descent

Johannes von Oswald, Eyvind Niklasson, Ettore Randazzo, João Sacramento, Alexander Mord- vintsev, Andrey Zhmoginov, and Max Vladymyrov. Transformers learn in-context by gradient descent. arXiv preprint arXiv:2212.07677, 2022

work page arXiv 2022
[70]

Self- consistency improves chain of thought reasoning in language models

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, and Denny Zhou. Self- consistency improves chain of thought reasoning in language models. In International Confer- ence on Learning Representations, 2023

work page 2023
[71]

Describe, Explain, Plan and Select: Interactive Planning with Large Language Models Enables Open-World Multi-Task Agents

Zihao Wang, Shaofei Cai, Anji Liu, Xiaojian Ma, and Yitao Liang. Describe, explain, plan and select: Interactive planning with large language models enables open-world multi-task agents. arXiv preprint arXiv:2302.01560, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[72]

Finetuned language models are zero-shot learners

Jason Wei, Maarten Bosma, Vincent Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M Dai, and Quoc V Le. Finetuned language models are zero-shot learners. In International Conference on Learning Representations , 2022

work page 2022
[73]

Chain-of-thought prompting elicits reasoning in large language models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed H Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Systems , 2022

work page 2022
[74]

Generating sequences by learning to self-correct

Sean Welleck, Ximing Lu, Peter West, Faeze Brahman, Tianxiao Shen, Daniel Khashabi, and Yejin Choi. Generating sequences by learning to self-correct. arXiv preprint arXiv:2211.00053, 2022

work page arXiv 2022
[75]

Chain of thought im- itation with procedure cloning

Mengjiao Sherry Yang, Dale Schuurmans, Pieter Abbeel, and Ofir Nachum. Chain of thought im- itation with procedure cloning. Advances in Neural Information Processing Systems , 35:36366– 36381, 2022

work page 2022
[76]

Foun- dation models for decision making: Problems, methods, and opportunities

Sherry Yang, Ofir Nachum, Yilun Du, Jason Wei, Pieter Abbeel, and Dale Schuurmans. Foun- dation models for decision making: Problems, methods, and opportunities. arXiv preprint arXiv:2303.04129, 2023

work page arXiv 2023
[77]

Webshop: Towards scalable real-world web interaction with grounded language agents

Shunyu Yao, Howard Chen, John Yang, and Karthik R Narasimhan. Webshop: Towards scalable real-world web interaction with grounded language agents. In Advances in Neural Information Processing Systems, 2022

work page 2022
[78]

React: Synergizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. In International Conference on Learning Representations, 2023

work page 2023
[79]

STaR: Bootstrapping reasoning with reasoning

Eric Zelikman, Yuhuai Wu, Jesse Mu, and Noah Goodman. STaR: Bootstrapping reasoning with reasoning. Advances in Neural Information Processing Systems , 35:15476–15488, 2022

work page 2022
[80]

Socratic models: Composing zero-shot multimodal reasoning with language

Andy Zeng, Adrian Wong, Stefan Welker, Krzysztof Choromanski, Federico Tombari, Aveek Purohit, Michael Ryoo, Vikas Sindhwani, Johnny Lee, Vincent Vanhoucke, et al. Socratic models: Composing zero-shot multimodal reasoning with language. InInternational Conference on Learning Representations, 2023

work page 2023
[81]

Gonzalez

Tianjun Zhang, Xuezhi Wang, Denny Zhou, Dale Schuurmans, and Joseph E. Gonzalez. TEM- PERA: Test-time prompt editing via reinforcement learning. In The Eleventh International Conference on Learning Representations, 2023

work page 2023

Showing first 80 references.