pith. sign in

arxiv: 2408.07199 · v1 · pith:2L4FAKGGnew · submitted 2024-08-13 · 💻 cs.AI · cs.LG

Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents

Pith reviewed 2026-05-20 09:36 UTC · model grok-4.3

classification 💻 cs.AI cs.LG
keywords LLM agentsMonte Carlo Tree SearchDirect Preference Optimizationself-critiqueweb navigationinteractive environmentsautonomous agents
0
0 comments X p. Extension
pith:2L4FAKGG Add to your LaTeX paper What is a Pith Number?
\usepackage{pith}
\pithnumber{2L4FAKGG}

Prints a linked pith:2L4FAKGG badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more

The pith

Guided Monte Carlo Tree Search plus self-critique and off-policy preference updates let language-model agents learn complex web tasks from their own interaction data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that large language models can be turned into effective autonomous agents for multi-step interactive tasks by combining guided tree search, automatic critique of past attempts, and repeated preference-based fine-tuning on the resulting trajectories. This approach enables the model to improve from both successful and unsuccessful runs without needing large sets of expert demonstrations. In practice it raises a Llama-3 70B model from 18.6 percent to 81.7 percent success on real booking scenarios after one day of data collection and to 95.4 percent when online search is added. The same method outperforms standard behavior-cloning and reinforcement baselines inside a simulated shopping environment and exceeds average human performance once search is allowed. The core promise is that agents can acquire reliable decision-making policies through self-generated experience rather than static supervised data.

Core claim

By interleaving guided Monte Carlo Tree Search for exploration, a self-critique step that labels trajectories as preferred or dispreferred, and iterative off-policy Direct Preference Optimization on the collected agent interactions, an LLM policy can be refined to solve long-horizon web navigation and booking tasks at high success rates without external reward models or large expert datasets.

What carries the argument

Guided Monte Carlo Tree Search paired with self-critique that generates preference pairs for off-policy DPO updates on the agent's own interaction history.

If this is right

  • The agent policy improves on both successful and failed attempts collected during its own rollouts.
  • Performance reaches above average human levels once online search is added to the same trained model.
  • The method works in both simulated shopping environments and real booking interfaces after modest interaction data.
  • Iterative off-policy updates remain stable enough to support continued improvement across multiple rounds of collection and fine-tuning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same loop could be applied to other interactive domains such as code execution or robotic manipulation where self-generated trajectories are cheap to obtain.
  • Because the method relies on the base model's ability to critique its own outputs, gains may shrink for weaker base models that cannot reliably label their failures.
  • Combining the learned policy with external search or tool use appears to compound the benefit, suggesting a hybrid architecture for future agent systems.

Load-bearing premise

The model's own self-critique can accurately separate successful from unsuccessful trajectories without any external ground-truth reward signal.

What would settle it

Run the full pipeline for one day of data collection on a fresh set of booking or shopping tasks and measure whether success rate stays below 30 percent or fails to exceed behavior-cloning baselines.

read the original abstract

Large Language Models (LLMs) have shown remarkable capabilities in natural language tasks requiring complex reasoning, yet their application in agentic, multi-step reasoning within interactive environments remains a difficult challenge. Traditional supervised pre-training on static datasets falls short in enabling autonomous agent capabilities needed to perform complex decision-making in dynamic settings like web navigation. Previous attempts to bridge this ga-through supervised fine-tuning on curated expert demonstrations-often suffer from compounding errors and limited exploration data, resulting in sub-optimal policy outcomes. To overcome these challenges, we propose a framework that combines guided Monte Carlo Tree Search (MCTS) search with a self-critique mechanism and iterative fine-tuning on agent interactions using an off-policy variant of the Direct Preference Optimization (DPO) algorithm. Our method allows LLM agents to learn effectively from both successful and unsuccessful trajectories, thereby improving their generalization in complex, multi-step reasoning tasks. We validate our approach in the WebShop environment-a simulated e-commerce platform where it consistently outperforms behavior cloning and reinforced fine-tuning baseline, and beats average human performance when equipped with the capability to do online search. In real-world booking scenarios, our methodology boosts Llama-3 70B model's zero-shot performance from 18.6% to 81.7% success rate (a 340% relative increase) after a single day of data collection and further to 95.4% with online search. We believe this represents a substantial leap forward in the capabilities of autonomous agents, paving the way for more sophisticated and reliable decision-making in real-world settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces Agent Q, a framework combining guided Monte Carlo Tree Search (MCTS), a self-critique mechanism, and iterative off-policy Direct Preference Optimization (DPO) to improve LLM agents on multi-step reasoning tasks in interactive environments. It reports large empirical gains, including raising Llama-3 70B zero-shot success from 18.6% to 81.7% (340% relative) in real booking scenarios after one day of data collection and to 95.4% with online search, while outperforming behavior-cloning and reinforced fine-tuning baselines and average human performance in WebShop when search is enabled.

Significance. If the results hold, the work demonstrates a practical route to agent improvement that learns from both successful and unsuccessful trajectories without heavy reliance on expert demonstrations, addressing compounding errors in dynamic settings. Credit is due for the concrete cross-domain validation (simulated WebShop plus real booking) and for the explicit performance deltas reported.

major comments (2)
  1. [§4 and §3.2] §4 (Experiments) and §3.2 (Self-Critique): The headline gains (18.6% → 81.7% zero-shot, then 95.4% with search) rest on the unvalidated assumption that LLM self-critique can accurately label trajectories into preferred/dispreferred pairs for off-policy DPO. No external ground-truth reward, human audit of labeling accuracy, or error-rate analysis is provided; in path-dependent tasks such as booking, mislabeling of partial failures (e.g., correct item but wrong shipping) could inject noise that directly destabilizes the claimed 340% relative improvement after a single day of collection.
  2. [§4.1] §4.1 (WebShop results) and booking evaluation: The manuscript states consistent outperformance over behavior cloning and reinforced fine-tuning yet supplies neither error bars, number of trials, nor statistical significance tests for the reported success rates. This omission makes it impossible to assess whether the central performance claim is robust or could be explained by variance in the self-generated data.
minor comments (2)
  1. The abstract and experimental sections omit implementation specifics (exact MCTS guidance function, DPO hyperparameters, data-collection protocol) that would be required for reproducibility.
  2. [§3] Notation for the off-policy DPO variant and the precise form of the self-critique prompt should be formalized in a dedicated subsection or appendix to clarify how the preference pairs are constructed.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments highlight important aspects of validation and statistical rigor that we address below. We have revised the manuscript to incorporate additional analysis and reporting as described.

read point-by-point responses
  1. Referee: [§4 and §3.2] §4 (Experiments) and §3.2 (Self-Critique): The headline gains (18.6% → 81.7% zero-shot, then 95.4% with search) rest on the unvalidated assumption that LLM self-critique can accurately label trajectories into preferred/dispreferred pairs for off-policy DPO. No external ground-truth reward, human audit of labeling accuracy, or error-rate analysis is provided; in path-dependent tasks such as booking, mislabeling of partial failures (e.g., correct item but wrong shipping) could inject noise that directly destabilizes the claimed 340% relative improvement after a single day of collection.

    Authors: We agree that explicit validation of the self-critique labeling would strengthen the paper. The large gains provide indirect support, but to directly address potential noise from partial failures in path-dependent tasks, we will add a new analysis subsection. This includes a human audit of labeling accuracy on a random sample of 150 booking trajectories (reporting agreement rate and error breakdown by failure type) plus discussion of how iterative off-policy DPO and MCTS guidance help filter noise across rounds. These additions will be in the revised §3.2 and §4. revision: yes

  2. Referee: [§4.1] §4.1 (WebShop results) and booking evaluation: The manuscript states consistent outperformance over behavior cloning and reinforced fine-tuning yet supplies neither error bars, number of trials, nor statistical significance tests for the reported success rates. This omission makes it impossible to assess whether the central performance claim is robust or could be explained by variance in the self-generated data.

    Authors: We acknowledge the omission of these details in the submitted version. The WebShop results were averaged across 5 independent random seeds with 200 episodes each, and the booking results across 3 separate data-collection days with 50 held-out test cases per day. In the revision we will report standard errors as error bars, explicitly state the number of trials and seeds, and add paired t-test p-values comparing Agent Q against the behavior-cloning and reinforced fine-tuning baselines in both §4.1 and the booking evaluation. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical results are independent of inputs

full rationale

The paper's central claims consist of measured performance gains (e.g., 18.6% to 81.7% success) obtained from new data collection and evaluation runs in WebShop and booking environments. These outcomes are produced by running the full pipeline on held-out tasks rather than by algebraic reduction of any fitted parameter or self-citation to the reported metric. No equations are presented that define a quantity in terms of itself, and the DPO and MCTS components are standard external methods whose correctness does not presuppose the final success rates. The derivation chain therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract describes a composite method built from existing components (MCTS, self-critique, DPO) with modifications; no explicit free parameters, domain axioms, or new postulated entities are introduced beyond standard machine-learning assumptions about trajectory evaluation and policy updates.

pith-pipeline@v0.9.0 · 5829 in / 1321 out tokens · 58462 ms · 2026-05-20T09:36:10.019645+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 23 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. GEAR: Granularity-Adaptive Advantage Reweighting for LLM Agents via Self-Distillation

    cs.LG 2026-05 unverdicted novelty 7.0

    GEAR reshapes GRPO trajectory advantages using divergence signals from a ground-truth-conditioned teacher to create adaptive token- and segment-level credit regions.

  2. Don't Click That: Teaching Web Agents to Resist Deceptive Interfaces

    cs.AI 2026-05 unverdicted novelty 7.0

    DUDE framework reduces web agents' susceptibility to deceptive UIs by 53.8% on a new 1,407-scenario benchmark while preserving task performance.

  3. Weak-Link Optimization for Multi-Agent Reasoning and Collaboration

    cs.AI 2026-04 unverdicted novelty 7.0

    WORC improves multi-agent LLM reasoning to 82.2% average accuracy by predicting and compensating for the weakest agent via targeted extra sampling rather than uniform reinforcement.

  4. Does RL Expand the Capability Boundary of LLM Agents? A PASS@(k,T) Analysis

    cs.LG 2026-04 unverdicted novelty 7.0

    RL expands the capability boundary of LLM agents on compositional tool-use tasks, shown by non-converging pass curves at large k with increasing T, while SFT regresses it and the effect is absent on simpler tasks.

  5. Orak: A Foundational Benchmark for Training and Evaluating LLM Agents on Diverse Video Games

    cs.AI 2025-06 unverdicted novelty 7.0

    Orak is a foundational benchmark providing training data, interfaces, and evaluation tools for LLM agents across diverse video game genres.

  6. Group-in-Group Policy Optimization for LLM Agent Training

    cs.LG 2025-05 unverdicted novelty 7.0

    GiGPO adds a hierarchical grouping mechanism to group-based RL so that LLM agents receive both global trajectory and local step-level credit signals, yielding >12% gains on ALFWorld and >9% on WebShop over GRPO while ...

  7. Rewarding Beliefs, Not Actions: Consistency-Guided Credit Assignment for Long-Horizon Agents

    cs.CL 2026-05 unverdicted novelty 6.0

    ReBel uses belief-consistency supervision and belief-aware grouping to improve credit assignment in long-horizon RL for LLM agents, achieving up to 20.4 percentage points higher success and 2.1x better sample efficien...

  8. GEAR: Granularity-Adaptive Advantage Reweighting for LLM Agents via Self-Distillation

    cs.LG 2026-05 unverdicted novelty 6.0

    GEAR adaptively reweights GRPO advantages in LLM RL by using divergence spikes from self-distillation to define semantic segments and modulate local credit.

  9. Skill1: Unified Evolution of Skill-Augmented Agents via Reinforcement Learning

    cs.AI 2026-05 unverdicted novelty 6.0

    Skill1 trains one policy to jointly evolve skill query generation, re-ranking, task solving, and distillation from a single task-success signal, with low-frequency trends crediting selection and high-frequency variati...

  10. QuantClaw: Precision Where It Matters for OpenClaw

    cs.AI 2026-04 unverdicted novelty 6.0

    QuantClaw dynamically routes precision in agent workflows to cut cost by up to 21.4% and latency by 15.7% while keeping or improving task performance.

  11. DynaWeb: Model-Based Reinforcement Learning of Web Agents

    cs.CL 2026-01 unverdicted novelty 6.0

    DynaWeb introduces a model-based RL framework that trains web agents via imagined rollouts in a learned web world model interleaved with real expert trajectories, yielding consistent gains on WebArena and WebVoyager b...

  12. Mobile-R1: Towards Interactive Capability for VLM-Based Mobile Agent via Systematic Training

    cs.AI 2025-06 unverdicted novelty 6.0

    Mobile-R1 introduces a hierarchical three-stage curriculum that combines format alignment, verifiable action feedback, and multi-turn environment training to improve exploration and self-correction in VLM-based mobile...

  13. A-ProS: Towards Reliable Autonomous Programming Through Multi-Model Feedback

    cs.SE 2026-05 unverdicted novelty 5.0

    A-ProS uses a hybrid multi-model feedback framework with stateful refinement to improve success rates on competitive programming problems, achieving over 2x gains compared to baseline agent loops.

  14. StraTA: Incentivizing Agentic Reinforcement Learning with Strategic Trajectory Abstraction

    cs.CL 2026-05 unverdicted novelty 5.0

    StraTA improves LLM agent success rates to 93.1% on ALFWorld and 84.2% on WebShop by sampling a compact initial strategy and training it jointly with action execution via hierarchical GRPO-style rollouts.

  15. Skill1: Unified Evolution of Skill-Augmented Agents via Reinforcement Learning

    cs.AI 2026-05 unverdicted novelty 5.0

    Skill1 co-evolves skill selection, utilization, and distillation inside a single policy using only task-outcome reward, with low-frequency trends crediting selection and high-frequency variation crediting distillation...

  16. Skill1: Unified Evolution of Skill-Augmented Agents via Reinforcement Learning

    cs.AI 2026-05 unverdicted novelty 5.0

    Skill1 trains a single RL policy to co-evolve skill selection, utilization, and distillation in language model agents from one task-outcome reward, using low-frequency trends to credit selection and high-frequency var...

  17. GUI Agents with Reinforcement Learning: Toward Digital Inhabitants

    cs.AI 2026-04 unverdicted novelty 5.0

    The paper delivers the first comprehensive overview of RL for GUI agents, organizing methods into offline, online, and hybrid strategies while analyzing trends in rewards, efficiency, and deliberation to outline a fut...

  18. RoboAgent: Chaining Basic Capabilities for Embodied Task Planning

    cs.RO 2026-04 unverdicted novelty 5.0

    RoboAgent chains basic vision-language capabilities inside a single VLM via a scheduler and trains it in three stages (behavior cloning, DAgger, RL) to improve embodied task planning.

  19. Rethinking Token Pruning for Historical Screenshots in GUI Visual Agents: Semantic, Spatial, and Temporal Perspectives

    cs.CV 2026-03 unverdicted novelty 5.0

    Empirical study finds background semantics, random pruning, and recency-based allocation improve token efficiency for GUI visual agents.

  20. A Comprehensive Survey of Self-Evolving AI Agents: A New Paradigm Bridging Foundation Models and Lifelong Agentic Systems

    cs.AI 2025-08 unverdicted novelty 5.0

    A comprehensive review of self-evolving AI agents that improve themselves over time, organized via a framework of inputs, agent system, environment, and optimizers, with domain-specific and safety discussions.

  21. Agentic Reasoning for Large Language Models

    cs.AI 2026-01 unverdicted novelty 4.0

    The survey structures agentic reasoning for LLMs into foundational, self-evolving, and collective multi-agent layers while distinguishing in-context orchestration from post-training optimization and reviewing applicat...

  22. A Survey of Self-Evolving Agents: What, When, How, and Where to Evolve on the Path to Artificial Super Intelligence

    cs.AI 2025-07 accept novelty 4.0

    The paper delivers the first systematic review of self-evolving agents, structured around what components evolve, when adaptation occurs, and how it is implemented.

  23. Large Language Model-Brained GUI Agents: A Survey

    cs.AI 2024-11 unverdicted novelty 4.0

    A survey consolidating frameworks, data practices, large action models, benchmarks, applications, and research gaps in LLM-brained GUI agents.

Reference graph

Works this paper leans on

298 extracted references · 298 canonical work pages · cited by 20 Pith papers · 15 internal anchors

  1. [1]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023

  2. [3]

    Introducing the next generation of claude, 2024

    Anthropic. Introducing the next generation of claude, 2024. URL Introducing the next generation of Claude

  3. [4]

    Digirl: Training in-the-wild device-control agents with autonomous reinforcement learning, 2024

    Hao Bai, Yifei Zhou, Mert Cemri, Jiayi Pan, Alane Suhr, Sergey Levine, and Aviral Kumar. Digirl: Training in-the-wild device-control agents with autonomous reinforcement learning, 2024

  4. [5]

    Bowman, Zac Hatfield-Dodds, Ben Mann, Dario Amodei, Nicholas Joseph, Sam McCandlish, Tom Brown, and Jared Kaplan

    Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, Carol Chen, Catherine Olsson, Christopher Olah, Danny Hernandez, Dawn Drain, Deep Ganguli, Dustin Li, Eli Tran-Johnson, Ethan Perez, Jamie Kerr, Jared Mueller, Jeffrey Ladish, Joshua Landau, Kamal Ndousse, K...

  5. [6]

    Graph of thoughts: Solving elaborate problems with large language models

    Maciej Besta, Nils Blach, Ales Kubicek, Robert Gerstenberger, Michal Podstawski, Lukas Gianinazzi, Joanna Gajda, Tomasz Lehmann, Hubert Niewiadomski, Piotr Nyczyk, and Torsten Hoefler. Graph of thoughts: Solving elaborate problems with large language models. Proceedings of the AAAI Conference on Artificial Intelligence, 38 0 (16): 0 17682–17690, March 202...

  6. [7]

    Large Language Monkeys: Scaling Inference Compute with Repeated Sampling

    Bradley Brown, Jordan Juravsky, Ryan Ehrlich, Ronald Clark, Quoc V. Le, Christopher Ré, and Azalia Mirhoseini. Large language monkeys: Scaling inference compute with repeated sampling, 2024. URL https://arxiv.org/abs/2407.21787

  7. [8]

    Superhuman ai for multiplayer poker

    Noam Brown and Tuomas Sandholm. Superhuman ai for multiplayer poker. Science, 2019

  8. [9]

    Q-transformer: Scalable offline reinforcement learning via autoregressive q-functions, 2023

    Yevgen Chebotar, Quan Vuong, Alex Irpan, Karol Hausman, Fei Xia, Yao Lu, Aviral Kumar, Tianhe Yu, Alexander Herzog, Karl Pertsch, Keerthana Gopalakrishnan, Julian Ibarz, Ofir Nachum, Sumedh Sontakke, Grecia Salazar, Huong T Tran, Jodilyn Peralta, Clayton Tan, Deeksha Manjunath, Jaspiar Singht, Brianna Zitkovich, Tomas Jackson, Kanishka Rao, Chelsea Finn, ...

  9. [10]

    Step-level value preference optimization for mathematical reasoning, 2024

    Guoxin Chen, Minpeng Liao, Chengxi Li, and Kai Fan. Step-level value preference optimization for mathematical reasoning, 2024. URL https://arxiv.org/abs/2406.10858

  10. [11]

    Octopus v2: On-device language model for super agent, 2024

    Wei Chen and Zhiyuan Li. Octopus v2: On-device language model for super agent, 2024

  11. [12]

    Mind2web: Towards a generalist agent for the web

    Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Samuel Stevens, Boshi Wang, Huan Sun, and Yu Su. Mind2web: Towards a generalist agent for the web. In NeurIPS Datasets and Benchmarks Track, 2023. URL https://openreview.net/forum?id=kiYqbO3wqw

  12. [13]

    Hashimoto

    Yann Dubois, Xuechen Li, Rohan Taori, Tianyi Zhang, Ishaan Gulrajani, Jimmy Ba, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Alpacafarm: A simulation framework for methods that learn from human feedback, 2024

  13. [14]

    Kanishk Gandhi, Denise Lee, Gabriel Grand, Muxin Liu, Winson Cheng, Archit Sharma, and Noah D. Goodman. Stream of search (sos): Learning to search in language, 2024. URL https://arxiv.org/abs/2404.03683

  14. [15]

    Human-level performance in no-press diplomacy via equilibrium search, 2021

    Jonathan Gray, Adam Lerer, Anton Bakhtin, and Noam Brown. Human-level performance in no-press diplomacy via equilibrium search, 2021

  15. [16]

    Reinforced self-training (rest) for language modeling, 2023

    Caglar Gulcehre, Tom Le Paine, Srivatsan Srinivasan, Ksenia Konyushkova, Lotte Weerts, Abhishek Sharma, Aditya Siddhant, Alex Ahern, Miaosen Wang, Chenjie Gu, Wolfgang Macherey, Arnaud Doucet, Orhan Firat, and Nando de Freitas. Reinforced self-training (rest) for language modeling, 2023

  16. [17]

    A real-world webagent with planning, long context understanding, and program synthesis

    Izzeddin Gur, Hiroki Furuta, Austin V Huang, Mustafa Safdari, Yutaka Matsuo, Douglas Eck, and Aleksandra Faust. A real-world webagent with planning, long context understanding, and program synthesis. In ICLR, 2024. URL https://openreview.net/forum?id=9JQtrumvg8

  17. [18]

    Reasoning with language model is planning with world model, 2023

    Shibo Hao, Yi Gu, Haodi Ma, Joshua Jiahua Hong, Zhen Wang, Daisy Zhe Wang, and Zhiting Hu. Reasoning with language model is planning with world model, 2023

  18. [19]

    Webvoyager: Building an end-to-end web agent with large multimodal models

    Hongliang He, Wenlin Yao, Kaixin Ma, Wenhao Yu, Yong Dai, Hongming Zhang, Zhenzhong Lan, and Dong Yu. Webvoyager: Building an end-to-end web agent with large multimodal models. ArXiv, 2024. URL https://api.semanticscholar.org/CorpusID:267211622

  19. [20]

    Bradley Knox, and Dorsa Sadigh

    Joey Hejna, Rafael Rafailov, Harshit Sikchi, Chelsea Finn, Scott Niekum, W. Bradley Knox, and Dorsa Sadigh. Contrastive preference learning: Learning from human feedback without reinforcement learning. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=iX1RjVQODj

  20. [21]

    L2mac: Large language model automatic computer for extensive code generation, 2024

    Samuel Holt, Max Ruiz Luyten, and Mihaela van der Schaar. L2mac: Large language model automatic computer for extensive code generation, 2024

  21. [22]

    Cogagent: A visual language model for gui agents

    Wenyi Hong, Weihan Wang, Qingsong Lv, Jiazheng Xu, Wenmeng Yu, Junhui Ji, Yan Wang, Zihan Wang, Yuxiao Dong, Ming Ding, and Jie Tang. Cogagent: A visual language model for gui agents. ArXiv, 2023

  22. [23]

    Language models as zero-shot planners: Extracting actionable knowledge for embodied agents, 2022 a

    Wenlong Huang, Pieter Abbeel, Deepak Pathak, and Igor Mordatch. Language models as zero-shot planners: Extracting actionable knowledge for embodied agents, 2022 a

  23. [24]

    Inner Monologue: Embodied Reasoning through Planning with Language Models

    Wenlong Huang, Fei Xia, Ted Xiao, Harris Chan, Jacky Liang, Pete Florence, Andy Zeng, Jonathan Tompson, Igor Mordatch, Yevgen Chebotar, Pierre Sermanet, Noah Brown, Tomas Jackson, Linda Luu, Sergey Levine, Karol Hausman, and Brian Ichter. Inner monologue: Embodied reasoning through planning with language models, 2022 b . URL https://arxiv.org/abs/2207.05608

  24. [25]

    Ortega, Yee Whye Teh, and Nicolas Heess

    Jan Humplik, Alexandre Galashov, Leonard Hasenclever, Pedro A. Ortega, Yee Whye Teh, and Nicolas Heess. Meta reinforcement learning as task inference, 2019. URL https://arxiv.org/abs/1905.06424

  25. [26]

    Self-explore to avoid the pit: Improving the reasoning capabilities of language models with fine-grained rewards, 2024

    Hyeonbin Hwang, Doyoung Kim, Seungone Kim, Seonghyeon Ye, and Minjoon Seo. Self-explore to avoid the pit: Improving the reasoning capabilities of language models with fine-grained rewards, 2024

  26. [27]

    Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, Gianna Lengyel, Guillaume Bour, Guillaume Lample, Lélio Renard Lavaud, Lucile Saulnier, Marie-Anne Lachaux, Pierre Stock, Sandeep Subramanian, Sophia Yang, Szymon Antoniak, Teven...

  27. [28]

    Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan

    Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues?, 2024

  28. [29]

    Introducing chatgpt, 2022

    Barret Zoph John Schulman, Christina Kim, Jacob Hilton, Jacob Menick, Jiayi Weng, Juan Felipe Ceron Uribe, Liam Fedus, Michael Pokorny Luke Metz, Rapha Gontijo Lopes, Shengjia Zhao, Arun Vijayvergiya, Eric Sigler, Adam Perelman, Chelsea Voss, Mike Heaton, Joel Parish, Dave Cummings, Rajeev Nayak, Valerie Balcom, David Schnurr, Tomer Kaftan, Chris Hallacy,...

  29. [30]

    Bandit based monte-carlo planning

    Levente Kocsis and Csaba Szepesv \'a ri. Bandit based monte-carlo planning. In Machine Learning: ECML 2006: 17th European Conference on Machine Learning Berlin, Germany, September 18-22, 2006 Proceedings, pages 282--293. Springer, 2006

  30. [31]

    Tree search for language agent models, 2024

    Jing Yu Koh, Stephen McAleer, Daniel Fried, and Ruslan Salakhutdinov. Tree search for language agent models, 2024. URL https://jykoh.com/search-agents/paper.pdf

  31. [32]

    Large language models are zero-shot reasoners, 2022

    Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners, 2022

  32. [33]

    Autowebglm: Bootstrap and reinforce a large language model-based web navigating agent, 2024 a

    Hanyu Lai, Xiao Liu, Iat Long Iong, Shuntian Yao, Yuxuan Chen, Pengbo Shen, Hao Yu, Hanchen Zhang, Xiaohan Zhang, Yuxiao Dong, and Jie Tang. Autowebglm: Bootstrap and reinforce a large language model-based web navigating agent, 2024 a

  33. [34]

    Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs

    Xin Lai, Zhuotao Tian, Yukang Chen, Senqiao Yang, Xiangru Peng, and Jiaya Jia. Step-dpo: Step-wise preference optimization for long-chain reasoning of llms, 2024 b . URL https://arxiv.org/abs/2406.18629

  34. [35]

    Beyond a*: Better planning with transformers via search dynamics bootstrapping, 2024

    Lucas Lehnert, Sainbayar Sukhbaatar, DiJia Su, Qinqing Zheng, Paul Mcvay, Michael Rabbat, and Yuandong Tian. Beyond a*: Better planning with transformers via search dynamics bootstrapping, 2024. URL https://arxiv.org/abs/2402.14083

  35. [36]

    Reinforcement learning and control as probabilistic inference: Tutorial and review, 2018

    Sergey Levine. Reinforcement learning and control as probabilistic inference: Tutorial and review, 2018

  36. [37]

    Offline reinforcement learning: Tutorial, review, and perspectives on open problems, 2020

    Sergey Levine, Aviral Kumar, George Tucker, and Justin Fu. Offline reinforcement learning: Tutorial, review, and perspectives on open problems, 2020

  37. [38]

    Let's verify step by step, 2023

    Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let's verify step by step, 2023

  38. [39]

    Bolaa: Benchmarking and orchestrating llm-augmented autonomous agents, 2023

    Zhiwei Liu, Weiran Yao, Jianguo Zhang, Le Xue, Shelby Heinecke, Rithesh Murthy, Yihao Feng, Zeyuan Chen, Juan Carlos Niebles, Devansh Arpit, Ran Xu, Phil Mui, Huan Wang, Caiming Xiong, and Silvio Savarese. Bolaa: Benchmarking and orchestrating llm-augmented autonomous agents, 2023

  39. [40]

    Choubey, Tian Lan, Jason Wu, Huan Wang, Shelby Heinecke, Caiming Xiong, and Silvio Savarese

    Zhiwei Liu, Weiran Yao, Jianguo Zhang, Liangwei Yang, Zuxin Liu, Juntao Tan, Prafulla K. Choubey, Tian Lan, Jason Wu, Huan Wang, Shelby Heinecke, Caiming Xiong, and Silvio Savarese. Agentlite: A lightweight library for building and advancing task-oriented llm agent system, 2024

  40. [41]

    Step-controlled dpo: Leveraging stepwise error for enhanced mathematical reasoning, 2024

    Zimu Lu, Aojun Zhou, Ke Wang, Houxing Ren, Weikang Shi, Junting Pan, Mingjie Zhan, and Hongsheng Li. Step-controlled dpo: Leveraging stepwise error for enhanced mathematical reasoning, 2024. URL https://arxiv.org/abs/2407.00782

  41. [42]

    The landscape of emerging ai agent architectures for reasoning, planning, and tool calling: A survey, 2024

    Tula Masterman, Sandi Besen, Mason Sawtell, and Alex Chao. The landscape of emerging ai agent architectures for reasoning, planning, and tool calling: A survey, 2024

  42. [44]

    Webgpt: Browser-assisted question-answering with human feedback, 2022

    Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christopher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, Xu Jiang, Karl Cobbe, Tyna Eloundou, Gretchen Krueger, Kevin Button, Matthew Knight, Benjamin Chess, and John Schulman. Webgpt: Browser-assisted question-answering with human feedback, 2022

  43. [45]

    Training language models to follow instructions with human feedback

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedbac...

  44. [46]

    Autonomous evaluation and refinement of digital agents, 2024

    Jiayi Pan, Yichi Zhang, Nicholas Tomlin, Yifei Zhou, Sergey Levine, and Alane Suhr. Autonomous evaluation and refinement of digital agents, 2024

  45. [47]

    Iterative reasoning preference optimization, 2024

    Richard Yuanzhe Pang, Weizhe Yuan, Kyunghyun Cho, He He, Sainbayar Sukhbaatar, and Jason Weston. Iterative reasoning preference optimization, 2024

  46. [48]

    Reasoning with language model prompting: A survey, 2023

    Shuofei Qiao, Yixin Ou, Ningyu Zhang, Xiang Chen, Yunzhi Yao, Shumin Deng, Chuanqi Tan, Fei Huang, and Huajun Chen. Reasoning with language model prompting: A survey, 2023

  47. [49]

    Direct Preference Optimization: Your Language Model is Secretly a Reward Model

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://arxiv.org/abs/2305.18290

  48. [50]

    From r to q^* : Your language model is secretly a q-function, 2024

    Rafael Rafailov, Joey Hejna, Ryan Park, and Chelsea Finn. From r to q^* : Your language model is secretly a q-function, 2024

  49. [51]

    Proximal policy optimization algorithms, 2017

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms, 2017

  50. [52]

    Rl on incorrect synthetic data scales the efficiency of llm math reasoning by eight-fold, 2024 a

    Amrith Setlur, Saurabh Garg, Xinyang Geng, Naman Garg, Virginia Smith, and Aviral Kumar. Rl on incorrect synthetic data scales the efficiency of llm math reasoning by eight-fold, 2024 a

  51. [53]

    Rl on incorrect synthetic data scales the efficiency of llm math reasoning by eight-fold, 2024 b

    Amrith Setlur, Saurabh Garg, Xinyang Geng, Naman Garg, Virginia Smith, and Aviral Kumar. Rl on incorrect synthetic data scales the efficiency of llm math reasoning by eight-fold, 2024 b . URL https://arxiv.org/abs/2406.14532

  52. [54]

    Mastering the game of go with deep neural networks and tree search

    David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, Yutian Chen, Timothy Lillicrap, Fan Hui, Laurent Sifre, George van den Driessche, Thore Graepel, and Demis Hassabis. Mastering the game of go with deep neural networks and tree search. Nature, 2017 a

  53. [55]

    Mastering the game of go without human knowledge

    David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. Mastering the game of go without human knowledge. Nature, 550 0 (7676): 0 354--359, 2017 b

  54. [56]

    Co-Reyes, Rishabh Agarwal, Ankesh Anand, Piyush Patil, Xavier Garcia, Peter J

    Avi Singh, John D. Co-Reyes, Rishabh Agarwal, Ankesh Anand, Piyush Patil, Xavier Garcia, Peter J. Liu, James Harrison, Jaehoon Lee, Kelvin Xu, Aaron Parisi, Abhishek Kumar, Alex Alemi, Alex Rizkowsky, Azade Nova, Ben Adlam, Bernd Bohnet, Gamaleldin Elsayed, Hanie Sedghi, Igor Mordatch, Isabelle Simpson, Izzeddin Gur, Jasper Snoek, Jeffrey Pennington, Jiri...

  55. [57]

    Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

    Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling llm test-time compute optimally can be more effective than scaling model parameters, 2024. URL https://arxiv.org/abs/2408.03314

  56. [58]

    Offline rl for natural language generation with implicit language q learning

    Charlie Victor Snell, Ilya Kostrikov, Yi Su, Sherry Yang, and Sergey Levine. Offline rl for natural language generation with implicit language q learning. In The Eleventh International Conference on Learning Representations, 2022

  57. [59]

    Trial and error: Exploration-based trajectory optimization for llm agents, 2024

    Yifan Song, Da Yin, Xiang Yue, Jie Huang, Sujian Li, and Bill Yuchen Lin. Trial and error: Exploration-based trajectory optimization for llm agents, 2024

  58. [60]

    Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, and Paul Christiano

    Nisan Stiennon, Long Ouyang, Jeff Wu, Daniel M. Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, and Paul Christiano. Learning to summarize from human feedback, 2022

  59. [61]

    Sumers, Shunyu Yao, Karthik Narasimhan, and Thomas L

    Theodore R. Sumers, Shunyu Yao, Karthik Narasimhan, and Thomas L. Griffiths. Cognitive architectures for language agents, 2024

  60. [62]

    Preference fine-tuning of llms should leverage suboptimal, on-policy data, 2024

    Fahim Tajwar, Anikait Singh, Archit Sharma, Rafael Rafailov, Jeff Schneider, Tengyang Xie, Stefano Ermon, Chelsea Finn, and Aviral Kumar. Preference fine-tuning of llms should leverage suboptimal, on-policy data, 2024

  61. [63]

    Toward self-improvement of llms via imagination, searching, and criticizing, 2024

    Ye Tian, Baolin Peng, Linfeng Song, Lifeng Jin, Dian Yu, Haitao Mi, and Dong Yu. Toward self-improvement of llms via imagination, searching, and criticizing, 2024

  62. [64]

    Llama 2: Open foundation and fine-tuned chat models, 2023

    Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Harts...

  63. [65]

    Solving math word problems with process- and outcome-based feedback, 2022

    Jonathan Uesato, Nate Kushman, Ramana Kumar, Francis Song, Noah Siegel, Lisa Wang, Antonia Creswell, Geoffrey Irving, and Irina Higgins. Solving math word problems with process- and outcome-based feedback, 2022

  64. [66]

    Mobile-agent: Autonomous multi-modal mobile device agent with visual perception

    Junyang Wang, Haiyang Xu, Jiabo Ye, Ming Yan, Weizhou Shen, Ji Zhang, Fei Huang, and Jitao Sang. Mobile-agent: Autonomous multi-modal mobile device agent with visual perception. arXiv, 2024 a

  65. [67]

    Peiyi Wang, Lei Li, Zhihong Shao, R. X. Xu, Damai Dai, Yifei Li, Deli Chen, Y. Wu, and Zhifang Sui. Math-shepherd: Verify and reinforce llms step-by-step without human annotations, 2024 b

  66. [68]

    Self-consistency improves chain of thought reasoning in language models, 2023

    Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models, 2023

  67. [69]

    Chain-of-thought prompting elicits reasoning in large language models

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. Neural Information Processing Systems, 2022

  68. [70]

    Os-copilot: Towards generalist computer agents with self-improvement

    Zhiyong Wu, Chengcheng Han, Zichen Ding, Zhenmin Weng, Zhoumianze Liu, Shunyu Yao, Tao Yu, and Lingpeng Kong. Os-copilot: Towards generalist computer agents with self-improvement. arXiv, 2024. URL https://api.semanticscholar.org/CorpusID:265149992

  69. [71]

    Agentgym: Evolving large language model-based agents across diverse environments, 2024

    Zhiheng Xi, Yiwen Ding, Wenxiang Chen, Boyang Hong, Honglin Guo, Junzhe Wang, Dingwen Yang, Chenyang Liao, Xin Guo, Wei He, Songyang Gao, Lu Chen, Rui Zheng, Yicheng Zou, Tao Gui, Qi Zhang, Xipeng Qiu, Xuanjing Huang, Zuxuan Wu, and Yu-Gang Jiang. Agentgym: Evolving large language model-based agents across diverse environments, 2024. URL https://arxiv.org...

  70. [72]

    Lillicrap, Kenji Kawaguchi, and Michael Shieh

    Yuxi Xie, Anirudh Goyal, Wenyue Zheng, Min-Yen Kan, Timothy P. Lillicrap, Kenji Kawaguchi, and Michael Shieh. Monte carlo tree search boosts reasoning via iterative preference learning, 2024

  71. [73]

    Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press

    John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. Swe-agent: Agent-computer interfaces enable automated software engineering, 2024

  72. [74]

    Webshop: Towards scalable real-world web interaction with grounded language agents, 2022

    Shunyu Yao, Howard Chen, John Yang, and Karthik Narasimhan. Webshop: Towards scalable real-world web interaction with grounded language agents, 2022

  73. [75]

    Griffiths, Yuan Cao, and Karthik R

    Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, and Karthik R. Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. In NeurIPS, 2023 a . URL https://openreview.net/forum?id=5Xc1ecxO1h

  74. [76]

    React: Synergizing reasoning and acting in language models, 2023 b

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models, 2023 b

  75. [77]

    Self-rewarding language models, 2024

    Weizhe Yuan, Richard Yuanzhe Pang, Kyunghyun Cho, Xian Li, Sainbayar Sukhbaatar, Jing Xu, and Jason Weston. Self-rewarding language models, 2024

  76. [78]

    Scaling relationship on learning mathematical reasoning with large language models, 2023

    Zheng Yuan, Hongyi Yuan, Chengpeng Li, Guanting Dong, Keming Lu, Chuanqi Tan, Chang Zhou, and Jingren Zhou. Scaling relationship on learning mathematical reasoning with large language models, 2023

  77. [79]

    Star: Bootstrapping reasoning with reasoning

    Eric Zelikman, Yuhuai Wu, Jesse Mu, and Noah Goodman. Star: Bootstrapping reasoning with reasoning. Advances in Neural Information Processing Systems, 35: 0 15476--15488, 2022

  78. [80]

    Fine-tuning large vision-language models as decision-making agents via reinforcement learning, 2024

    Yuexiang Zhai, Hao Bai, Zipeng Lin, Jiayi Pan, Shengbang Tong, Yifei Zhou, Alane Suhr, Saining Xie, Yann LeCun, Yi Ma, and Sergey Levine. Fine-tuning large vision-language models as decision-making agents via reinforcement learning, 2024

  79. [81]

    Ufo: A ui-focused agent for windows os interaction

    Chaoyun Zhang, Liqun Li, Shilin He, Xu Zhang, Bo Qiao, Si Qin, Minghua Ma, Yu Kang, Qingwei Lin, and Saravan Rajmohan. Ufo: A ui-focused agent for windows os interaction. arXiv, 2024 a . URL https://api.semanticscholar.org/CorpusID:267211622

  80. [82]

    Appagent: Multimodal agents as smartphone users, 2023

    Chi Zhang, Zhao Yang, Jiaxuan Liu, Yucheng Han, Xin Chen, Zebiao Huang, Bin Fu, and Gang Yu. Appagent: Multimodal agents as smartphone users, 2023

Showing first 80 references.