Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents
Pith reviewed 2026-05-20 09:36 UTC · model grok-4.3
pith:2L4FAKGG Add to your LaTeX paper
What is a Pith Number?\usepackage{pith}
\pithnumber{2L4FAKGG}
Prints a linked pith:2L4FAKGG badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more
The pith
Guided Monte Carlo Tree Search plus self-critique and off-policy preference updates let language-model agents learn complex web tasks from their own interaction data.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By interleaving guided Monte Carlo Tree Search for exploration, a self-critique step that labels trajectories as preferred or dispreferred, and iterative off-policy Direct Preference Optimization on the collected agent interactions, an LLM policy can be refined to solve long-horizon web navigation and booking tasks at high success rates without external reward models or large expert datasets.
What carries the argument
Guided Monte Carlo Tree Search paired with self-critique that generates preference pairs for off-policy DPO updates on the agent's own interaction history.
If this is right
- The agent policy improves on both successful and failed attempts collected during its own rollouts.
- Performance reaches above average human levels once online search is added to the same trained model.
- The method works in both simulated shopping environments and real booking interfaces after modest interaction data.
- Iterative off-policy updates remain stable enough to support continued improvement across multiple rounds of collection and fine-tuning.
Where Pith is reading between the lines
- The same loop could be applied to other interactive domains such as code execution or robotic manipulation where self-generated trajectories are cheap to obtain.
- Because the method relies on the base model's ability to critique its own outputs, gains may shrink for weaker base models that cannot reliably label their failures.
- Combining the learned policy with external search or tool use appears to compound the benefit, suggesting a hybrid architecture for future agent systems.
Load-bearing premise
The model's own self-critique can accurately separate successful from unsuccessful trajectories without any external ground-truth reward signal.
What would settle it
Run the full pipeline for one day of data collection on a fresh set of booking or shopping tasks and measure whether success rate stays below 30 percent or fails to exceed behavior-cloning baselines.
read the original abstract
Large Language Models (LLMs) have shown remarkable capabilities in natural language tasks requiring complex reasoning, yet their application in agentic, multi-step reasoning within interactive environments remains a difficult challenge. Traditional supervised pre-training on static datasets falls short in enabling autonomous agent capabilities needed to perform complex decision-making in dynamic settings like web navigation. Previous attempts to bridge this ga-through supervised fine-tuning on curated expert demonstrations-often suffer from compounding errors and limited exploration data, resulting in sub-optimal policy outcomes. To overcome these challenges, we propose a framework that combines guided Monte Carlo Tree Search (MCTS) search with a self-critique mechanism and iterative fine-tuning on agent interactions using an off-policy variant of the Direct Preference Optimization (DPO) algorithm. Our method allows LLM agents to learn effectively from both successful and unsuccessful trajectories, thereby improving their generalization in complex, multi-step reasoning tasks. We validate our approach in the WebShop environment-a simulated e-commerce platform where it consistently outperforms behavior cloning and reinforced fine-tuning baseline, and beats average human performance when equipped with the capability to do online search. In real-world booking scenarios, our methodology boosts Llama-3 70B model's zero-shot performance from 18.6% to 81.7% success rate (a 340% relative increase) after a single day of data collection and further to 95.4% with online search. We believe this represents a substantial leap forward in the capabilities of autonomous agents, paving the way for more sophisticated and reliable decision-making in real-world settings.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Agent Q, a framework combining guided Monte Carlo Tree Search (MCTS), a self-critique mechanism, and iterative off-policy Direct Preference Optimization (DPO) to improve LLM agents on multi-step reasoning tasks in interactive environments. It reports large empirical gains, including raising Llama-3 70B zero-shot success from 18.6% to 81.7% (340% relative) in real booking scenarios after one day of data collection and to 95.4% with online search, while outperforming behavior-cloning and reinforced fine-tuning baselines and average human performance in WebShop when search is enabled.
Significance. If the results hold, the work demonstrates a practical route to agent improvement that learns from both successful and unsuccessful trajectories without heavy reliance on expert demonstrations, addressing compounding errors in dynamic settings. Credit is due for the concrete cross-domain validation (simulated WebShop plus real booking) and for the explicit performance deltas reported.
major comments (2)
- [§4 and §3.2] §4 (Experiments) and §3.2 (Self-Critique): The headline gains (18.6% → 81.7% zero-shot, then 95.4% with search) rest on the unvalidated assumption that LLM self-critique can accurately label trajectories into preferred/dispreferred pairs for off-policy DPO. No external ground-truth reward, human audit of labeling accuracy, or error-rate analysis is provided; in path-dependent tasks such as booking, mislabeling of partial failures (e.g., correct item but wrong shipping) could inject noise that directly destabilizes the claimed 340% relative improvement after a single day of collection.
- [§4.1] §4.1 (WebShop results) and booking evaluation: The manuscript states consistent outperformance over behavior cloning and reinforced fine-tuning yet supplies neither error bars, number of trials, nor statistical significance tests for the reported success rates. This omission makes it impossible to assess whether the central performance claim is robust or could be explained by variance in the self-generated data.
minor comments (2)
- The abstract and experimental sections omit implementation specifics (exact MCTS guidance function, DPO hyperparameters, data-collection protocol) that would be required for reproducibility.
- [§3] Notation for the off-policy DPO variant and the precise form of the self-critique prompt should be formalized in a dedicated subsection or appendix to clarify how the preference pairs are constructed.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. The comments highlight important aspects of validation and statistical rigor that we address below. We have revised the manuscript to incorporate additional analysis and reporting as described.
read point-by-point responses
-
Referee: [§4 and §3.2] §4 (Experiments) and §3.2 (Self-Critique): The headline gains (18.6% → 81.7% zero-shot, then 95.4% with search) rest on the unvalidated assumption that LLM self-critique can accurately label trajectories into preferred/dispreferred pairs for off-policy DPO. No external ground-truth reward, human audit of labeling accuracy, or error-rate analysis is provided; in path-dependent tasks such as booking, mislabeling of partial failures (e.g., correct item but wrong shipping) could inject noise that directly destabilizes the claimed 340% relative improvement after a single day of collection.
Authors: We agree that explicit validation of the self-critique labeling would strengthen the paper. The large gains provide indirect support, but to directly address potential noise from partial failures in path-dependent tasks, we will add a new analysis subsection. This includes a human audit of labeling accuracy on a random sample of 150 booking trajectories (reporting agreement rate and error breakdown by failure type) plus discussion of how iterative off-policy DPO and MCTS guidance help filter noise across rounds. These additions will be in the revised §3.2 and §4. revision: yes
-
Referee: [§4.1] §4.1 (WebShop results) and booking evaluation: The manuscript states consistent outperformance over behavior cloning and reinforced fine-tuning yet supplies neither error bars, number of trials, nor statistical significance tests for the reported success rates. This omission makes it impossible to assess whether the central performance claim is robust or could be explained by variance in the self-generated data.
Authors: We acknowledge the omission of these details in the submitted version. The WebShop results were averaged across 5 independent random seeds with 200 episodes each, and the booking results across 3 separate data-collection days with 50 held-out test cases per day. In the revision we will report standard errors as error bars, explicitly state the number of trials and seeds, and add paired t-test p-values comparing Agent Q against the behavior-cloning and reinforced fine-tuning baselines in both §4.1 and the booking evaluation. revision: yes
Circularity Check
No significant circularity; empirical results are independent of inputs
full rationale
The paper's central claims consist of measured performance gains (e.g., 18.6% to 81.7% success) obtained from new data collection and evaluation runs in WebShop and booking environments. These outcomes are produced by running the full pipeline on held-out tasks rather than by algebraic reduction of any fitted parameter or self-citation to the reported metric. No equations are presented that define a quantity in terms of itself, and the DPO and MCTS components are standard external methods whose correctness does not presuppose the final success rates. The derivation chain therefore remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We propose a framework that combines guided Monte Carlo Tree Search (MCTS) search with a self-critique mechanism and iterative fine-tuning on agent interactions using an off-policy variant of the Direct Preference Optimization (DPO) algorithm.
-
IndisputableMonolith/Foundation/AlphaCoordinateFixation.leanalpha_pin_under_high_calibration unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We use the base model to produce a feedback score for each action by asking it to rank the generated actions by its perceived utility... Q(h_t, a_i_t) = α Q̃(h_t, a_i_t) + (1 − α) Q̂(h_t, a_i_t)
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 23 Pith papers
-
GEAR: Granularity-Adaptive Advantage Reweighting for LLM Agents via Self-Distillation
GEAR reshapes GRPO trajectory advantages using divergence signals from a ground-truth-conditioned teacher to create adaptive token- and segment-level credit regions.
-
Don't Click That: Teaching Web Agents to Resist Deceptive Interfaces
DUDE framework reduces web agents' susceptibility to deceptive UIs by 53.8% on a new 1,407-scenario benchmark while preserving task performance.
-
Weak-Link Optimization for Multi-Agent Reasoning and Collaboration
WORC improves multi-agent LLM reasoning to 82.2% average accuracy by predicting and compensating for the weakest agent via targeted extra sampling rather than uniform reinforcement.
-
Does RL Expand the Capability Boundary of LLM Agents? A PASS@(k,T) Analysis
RL expands the capability boundary of LLM agents on compositional tool-use tasks, shown by non-converging pass curves at large k with increasing T, while SFT regresses it and the effect is absent on simpler tasks.
-
Orak: A Foundational Benchmark for Training and Evaluating LLM Agents on Diverse Video Games
Orak is a foundational benchmark providing training data, interfaces, and evaluation tools for LLM agents across diverse video game genres.
-
Group-in-Group Policy Optimization for LLM Agent Training
GiGPO adds a hierarchical grouping mechanism to group-based RL so that LLM agents receive both global trajectory and local step-level credit signals, yielding >12% gains on ALFWorld and >9% on WebShop over GRPO while ...
-
Rewarding Beliefs, Not Actions: Consistency-Guided Credit Assignment for Long-Horizon Agents
ReBel uses belief-consistency supervision and belief-aware grouping to improve credit assignment in long-horizon RL for LLM agents, achieving up to 20.4 percentage points higher success and 2.1x better sample efficien...
-
GEAR: Granularity-Adaptive Advantage Reweighting for LLM Agents via Self-Distillation
GEAR adaptively reweights GRPO advantages in LLM RL by using divergence spikes from self-distillation to define semantic segments and modulate local credit.
-
Skill1: Unified Evolution of Skill-Augmented Agents via Reinforcement Learning
Skill1 trains one policy to jointly evolve skill query generation, re-ranking, task solving, and distillation from a single task-success signal, with low-frequency trends crediting selection and high-frequency variati...
-
QuantClaw: Precision Where It Matters for OpenClaw
QuantClaw dynamically routes precision in agent workflows to cut cost by up to 21.4% and latency by 15.7% while keeping or improving task performance.
-
DynaWeb: Model-Based Reinforcement Learning of Web Agents
DynaWeb introduces a model-based RL framework that trains web agents via imagined rollouts in a learned web world model interleaved with real expert trajectories, yielding consistent gains on WebArena and WebVoyager b...
-
Mobile-R1: Towards Interactive Capability for VLM-Based Mobile Agent via Systematic Training
Mobile-R1 introduces a hierarchical three-stage curriculum that combines format alignment, verifiable action feedback, and multi-turn environment training to improve exploration and self-correction in VLM-based mobile...
-
A-ProS: Towards Reliable Autonomous Programming Through Multi-Model Feedback
A-ProS uses a hybrid multi-model feedback framework with stateful refinement to improve success rates on competitive programming problems, achieving over 2x gains compared to baseline agent loops.
-
StraTA: Incentivizing Agentic Reinforcement Learning with Strategic Trajectory Abstraction
StraTA improves LLM agent success rates to 93.1% on ALFWorld and 84.2% on WebShop by sampling a compact initial strategy and training it jointly with action execution via hierarchical GRPO-style rollouts.
-
Skill1: Unified Evolution of Skill-Augmented Agents via Reinforcement Learning
Skill1 co-evolves skill selection, utilization, and distillation inside a single policy using only task-outcome reward, with low-frequency trends crediting selection and high-frequency variation crediting distillation...
-
Skill1: Unified Evolution of Skill-Augmented Agents via Reinforcement Learning
Skill1 trains a single RL policy to co-evolve skill selection, utilization, and distillation in language model agents from one task-outcome reward, using low-frequency trends to credit selection and high-frequency var...
-
GUI Agents with Reinforcement Learning: Toward Digital Inhabitants
The paper delivers the first comprehensive overview of RL for GUI agents, organizing methods into offline, online, and hybrid strategies while analyzing trends in rewards, efficiency, and deliberation to outline a fut...
-
RoboAgent: Chaining Basic Capabilities for Embodied Task Planning
RoboAgent chains basic vision-language capabilities inside a single VLM via a scheduler and trains it in three stages (behavior cloning, DAgger, RL) to improve embodied task planning.
-
Rethinking Token Pruning for Historical Screenshots in GUI Visual Agents: Semantic, Spatial, and Temporal Perspectives
Empirical study finds background semantics, random pruning, and recency-based allocation improve token efficiency for GUI visual agents.
-
A Comprehensive Survey of Self-Evolving AI Agents: A New Paradigm Bridging Foundation Models and Lifelong Agentic Systems
A comprehensive review of self-evolving AI agents that improve themselves over time, organized via a framework of inputs, agent system, environment, and optimizers, with domain-specific and safety discussions.
-
Agentic Reasoning for Large Language Models
The survey structures agentic reasoning for LLMs into foundational, self-evolving, and collective multi-agent layers while distinguishing in-context orchestration from post-training optimization and reviewing applicat...
-
A Survey of Self-Evolving Agents: What, When, How, and Where to Evolve on the Path to Artificial Super Intelligence
The paper delivers the first systematic review of self-evolving agents, structured around what components evolve, when adaptation occurs, and how it is implemented.
-
Large Language Model-Brained GUI Agents: A Survey
A survey consolidating frameworks, data practices, large action models, benchmarks, applications, and research gaps in LLM-brained GUI agents.
Reference graph
Works this paper leans on
-
[1]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[3]
Introducing the next generation of claude, 2024
Anthropic. Introducing the next generation of claude, 2024. URL Introducing the next generation of Claude
work page 2024
-
[4]
Digirl: Training in-the-wild device-control agents with autonomous reinforcement learning, 2024
Hao Bai, Yifei Zhou, Mert Cemri, Jiayi Pan, Alane Suhr, Sergey Levine, and Aviral Kumar. Digirl: Training in-the-wild device-control agents with autonomous reinforcement learning, 2024
work page 2024
-
[5]
Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, Carol Chen, Catherine Olsson, Christopher Olah, Danny Hernandez, Dawn Drain, Deep Ganguli, Dustin Li, Eli Tran-Johnson, Ethan Perez, Jamie Kerr, Jared Mueller, Jeffrey Ladish, Joshua Landau, Kamal Ndousse, K...
work page 2022
-
[6]
Graph of thoughts: Solving elaborate problems with large language models
Maciej Besta, Nils Blach, Ales Kubicek, Robert Gerstenberger, Michal Podstawski, Lukas Gianinazzi, Joanna Gajda, Tomasz Lehmann, Hubert Niewiadomski, Piotr Nyczyk, and Torsten Hoefler. Graph of thoughts: Solving elaborate problems with large language models. Proceedings of the AAAI Conference on Artificial Intelligence, 38 0 (16): 0 17682–17690, March 202...
-
[7]
Large Language Monkeys: Scaling Inference Compute with Repeated Sampling
Bradley Brown, Jordan Juravsky, Ryan Ehrlich, Ronald Clark, Quoc V. Le, Christopher Ré, and Azalia Mirhoseini. Large language monkeys: Scaling inference compute with repeated sampling, 2024. URL https://arxiv.org/abs/2407.21787
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[8]
Superhuman ai for multiplayer poker
Noam Brown and Tuomas Sandholm. Superhuman ai for multiplayer poker. Science, 2019
work page 2019
-
[9]
Q-transformer: Scalable offline reinforcement learning via autoregressive q-functions, 2023
Yevgen Chebotar, Quan Vuong, Alex Irpan, Karol Hausman, Fei Xia, Yao Lu, Aviral Kumar, Tianhe Yu, Alexander Herzog, Karl Pertsch, Keerthana Gopalakrishnan, Julian Ibarz, Ofir Nachum, Sumedh Sontakke, Grecia Salazar, Huong T Tran, Jodilyn Peralta, Clayton Tan, Deeksha Manjunath, Jaspiar Singht, Brianna Zitkovich, Tomas Jackson, Kanishka Rao, Chelsea Finn, ...
work page 2023
-
[10]
Step-level value preference optimization for mathematical reasoning, 2024
Guoxin Chen, Minpeng Liao, Chengxi Li, and Kai Fan. Step-level value preference optimization for mathematical reasoning, 2024. URL https://arxiv.org/abs/2406.10858
-
[11]
Octopus v2: On-device language model for super agent, 2024
Wei Chen and Zhiyuan Li. Octopus v2: On-device language model for super agent, 2024
work page 2024
-
[12]
Mind2web: Towards a generalist agent for the web
Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Samuel Stevens, Boshi Wang, Huan Sun, and Yu Su. Mind2web: Towards a generalist agent for the web. In NeurIPS Datasets and Benchmarks Track, 2023. URL https://openreview.net/forum?id=kiYqbO3wqw
work page 2023
- [13]
- [14]
-
[15]
Human-level performance in no-press diplomacy via equilibrium search, 2021
Jonathan Gray, Adam Lerer, Anton Bakhtin, and Noam Brown. Human-level performance in no-press diplomacy via equilibrium search, 2021
work page 2021
-
[16]
Reinforced self-training (rest) for language modeling, 2023
Caglar Gulcehre, Tom Le Paine, Srivatsan Srinivasan, Ksenia Konyushkova, Lotte Weerts, Abhishek Sharma, Aditya Siddhant, Alex Ahern, Miaosen Wang, Chenjie Gu, Wolfgang Macherey, Arnaud Doucet, Orhan Firat, and Nando de Freitas. Reinforced self-training (rest) for language modeling, 2023
work page 2023
-
[17]
A real-world webagent with planning, long context understanding, and program synthesis
Izzeddin Gur, Hiroki Furuta, Austin V Huang, Mustafa Safdari, Yutaka Matsuo, Douglas Eck, and Aleksandra Faust. A real-world webagent with planning, long context understanding, and program synthesis. In ICLR, 2024. URL https://openreview.net/forum?id=9JQtrumvg8
work page 2024
-
[18]
Reasoning with language model is planning with world model, 2023
Shibo Hao, Yi Gu, Haodi Ma, Joshua Jiahua Hong, Zhen Wang, Daisy Zhe Wang, and Zhiting Hu. Reasoning with language model is planning with world model, 2023
work page 2023
-
[19]
Webvoyager: Building an end-to-end web agent with large multimodal models
Hongliang He, Wenlin Yao, Kaixin Ma, Wenhao Yu, Yong Dai, Hongming Zhang, Zhenzhong Lan, and Dong Yu. Webvoyager: Building an end-to-end web agent with large multimodal models. ArXiv, 2024. URL https://api.semanticscholar.org/CorpusID:267211622
work page 2024
-
[20]
Bradley Knox, and Dorsa Sadigh
Joey Hejna, Rafael Rafailov, Harshit Sikchi, Chelsea Finn, Scott Niekum, W. Bradley Knox, and Dorsa Sadigh. Contrastive preference learning: Learning from human feedback without reinforcement learning. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=iX1RjVQODj
work page 2024
-
[21]
L2mac: Large language model automatic computer for extensive code generation, 2024
Samuel Holt, Max Ruiz Luyten, and Mihaela van der Schaar. L2mac: Large language model automatic computer for extensive code generation, 2024
work page 2024
-
[22]
Cogagent: A visual language model for gui agents
Wenyi Hong, Weihan Wang, Qingsong Lv, Jiazheng Xu, Wenmeng Yu, Junhui Ji, Yan Wang, Zihan Wang, Yuxiao Dong, Ming Ding, and Jie Tang. Cogagent: A visual language model for gui agents. ArXiv, 2023
work page 2023
-
[23]
Language models as zero-shot planners: Extracting actionable knowledge for embodied agents, 2022 a
Wenlong Huang, Pieter Abbeel, Deepak Pathak, and Igor Mordatch. Language models as zero-shot planners: Extracting actionable knowledge for embodied agents, 2022 a
work page 2022
-
[24]
Inner Monologue: Embodied Reasoning through Planning with Language Models
Wenlong Huang, Fei Xia, Ted Xiao, Harris Chan, Jacky Liang, Pete Florence, Andy Zeng, Jonathan Tompson, Igor Mordatch, Yevgen Chebotar, Pierre Sermanet, Noah Brown, Tomas Jackson, Linda Luu, Sergey Levine, Karol Hausman, and Brian Ichter. Inner monologue: Embodied reasoning through planning with language models, 2022 b . URL https://arxiv.org/abs/2207.05608
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[25]
Ortega, Yee Whye Teh, and Nicolas Heess
Jan Humplik, Alexandre Galashov, Leonard Hasenclever, Pedro A. Ortega, Yee Whye Teh, and Nicolas Heess. Meta reinforcement learning as task inference, 2019. URL https://arxiv.org/abs/1905.06424
-
[26]
Hyeonbin Hwang, Doyoung Kim, Seungone Kim, Seonghyeon Ye, and Minjoon Seo. Self-explore to avoid the pit: Improving the reasoning capabilities of language models with fine-grained rewards, 2024
work page 2024
-
[27]
Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, Gianna Lengyel, Guillaume Bour, Guillaume Lample, Lélio Renard Lavaud, Lucile Saulnier, Marie-Anne Lachaux, Pierre Stock, Sandeep Subramanian, Sophia Yang, Szymon Antoniak, Teven...
work page 2024
-
[28]
Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan
Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues?, 2024
work page 2024
-
[29]
Barret Zoph John Schulman, Christina Kim, Jacob Hilton, Jacob Menick, Jiayi Weng, Juan Felipe Ceron Uribe, Liam Fedus, Michael Pokorny Luke Metz, Rapha Gontijo Lopes, Shengjia Zhao, Arun Vijayvergiya, Eric Sigler, Adam Perelman, Chelsea Voss, Mike Heaton, Joel Parish, Dave Cummings, Rajeev Nayak, Valerie Balcom, David Schnurr, Tomer Kaftan, Chris Hallacy,...
work page 2022
-
[30]
Bandit based monte-carlo planning
Levente Kocsis and Csaba Szepesv \'a ri. Bandit based monte-carlo planning. In Machine Learning: ECML 2006: 17th European Conference on Machine Learning Berlin, Germany, September 18-22, 2006 Proceedings, pages 282--293. Springer, 2006
work page 2006
-
[31]
Tree search for language agent models, 2024
Jing Yu Koh, Stephen McAleer, Daniel Fried, and Ruslan Salakhutdinov. Tree search for language agent models, 2024. URL https://jykoh.com/search-agents/paper.pdf
work page 2024
-
[32]
Large language models are zero-shot reasoners, 2022
Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners, 2022
work page 2022
-
[33]
Autowebglm: Bootstrap and reinforce a large language model-based web navigating agent, 2024 a
Hanyu Lai, Xiao Liu, Iat Long Iong, Shuntian Yao, Yuxuan Chen, Pengbo Shen, Hao Yu, Hanchen Zhang, Xiaohan Zhang, Yuxiao Dong, and Jie Tang. Autowebglm: Bootstrap and reinforce a large language model-based web navigating agent, 2024 a
work page 2024
-
[34]
Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs
Xin Lai, Zhuotao Tian, Yukang Chen, Senqiao Yang, Xiangru Peng, and Jiaya Jia. Step-dpo: Step-wise preference optimization for long-chain reasoning of llms, 2024 b . URL https://arxiv.org/abs/2406.18629
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[35]
Beyond a*: Better planning with transformers via search dynamics bootstrapping, 2024
Lucas Lehnert, Sainbayar Sukhbaatar, DiJia Su, Qinqing Zheng, Paul Mcvay, Michael Rabbat, and Yuandong Tian. Beyond a*: Better planning with transformers via search dynamics bootstrapping, 2024. URL https://arxiv.org/abs/2402.14083
-
[36]
Reinforcement learning and control as probabilistic inference: Tutorial and review, 2018
Sergey Levine. Reinforcement learning and control as probabilistic inference: Tutorial and review, 2018
work page 2018
-
[37]
Offline reinforcement learning: Tutorial, review, and perspectives on open problems, 2020
Sergey Levine, Aviral Kumar, George Tucker, and Justin Fu. Offline reinforcement learning: Tutorial, review, and perspectives on open problems, 2020
work page 2020
-
[38]
Let's verify step by step, 2023
Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let's verify step by step, 2023
work page 2023
-
[39]
Bolaa: Benchmarking and orchestrating llm-augmented autonomous agents, 2023
Zhiwei Liu, Weiran Yao, Jianguo Zhang, Le Xue, Shelby Heinecke, Rithesh Murthy, Yihao Feng, Zeyuan Chen, Juan Carlos Niebles, Devansh Arpit, Ran Xu, Phil Mui, Huan Wang, Caiming Xiong, and Silvio Savarese. Bolaa: Benchmarking and orchestrating llm-augmented autonomous agents, 2023
work page 2023
-
[40]
Choubey, Tian Lan, Jason Wu, Huan Wang, Shelby Heinecke, Caiming Xiong, and Silvio Savarese
Zhiwei Liu, Weiran Yao, Jianguo Zhang, Liangwei Yang, Zuxin Liu, Juntao Tan, Prafulla K. Choubey, Tian Lan, Jason Wu, Huan Wang, Shelby Heinecke, Caiming Xiong, and Silvio Savarese. Agentlite: A lightweight library for building and advancing task-oriented llm agent system, 2024
work page 2024
-
[41]
Step-controlled dpo: Leveraging stepwise error for enhanced mathematical reasoning, 2024
Zimu Lu, Aojun Zhou, Ke Wang, Houxing Ren, Weikang Shi, Junting Pan, Mingjie Zhan, and Hongsheng Li. Step-controlled dpo: Leveraging stepwise error for enhanced mathematical reasoning, 2024. URL https://arxiv.org/abs/2407.00782
-
[42]
Tula Masterman, Sandi Besen, Mason Sawtell, and Alex Chao. The landscape of emerging ai agent architectures for reasoning, planning, and tool calling: A survey, 2024
work page 2024
-
[44]
Webgpt: Browser-assisted question-answering with human feedback, 2022
Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christopher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, Xu Jiang, Karl Cobbe, Tyna Eloundou, Gretchen Krueger, Kevin Button, Matthew Knight, Benjamin Chess, and John Schulman. Webgpt: Browser-assisted question-answering with human feedback, 2022
work page 2022
-
[45]
Training language models to follow instructions with human feedback
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedbac...
work page 2022
-
[46]
Autonomous evaluation and refinement of digital agents, 2024
Jiayi Pan, Yichi Zhang, Nicholas Tomlin, Yifei Zhou, Sergey Levine, and Alane Suhr. Autonomous evaluation and refinement of digital agents, 2024
work page 2024
-
[47]
Iterative reasoning preference optimization, 2024
Richard Yuanzhe Pang, Weizhe Yuan, Kyunghyun Cho, He He, Sainbayar Sukhbaatar, and Jason Weston. Iterative reasoning preference optimization, 2024
work page 2024
-
[48]
Reasoning with language model prompting: A survey, 2023
Shuofei Qiao, Yixin Ou, Ningyu Zhang, Xiang Chen, Yunzhi Yao, Shumin Deng, Chuanqi Tan, Fei Huang, and Huajun Chen. Reasoning with language model prompting: A survey, 2023
work page 2023
-
[49]
Direct Preference Optimization: Your Language Model is Secretly a Reward Model
Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://arxiv.org/abs/2305.18290
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[50]
From r to q^* : Your language model is secretly a q-function, 2024
Rafael Rafailov, Joey Hejna, Ryan Park, and Chelsea Finn. From r to q^* : Your language model is secretly a q-function, 2024
work page 2024
-
[51]
Proximal policy optimization algorithms, 2017
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms, 2017
work page 2017
-
[52]
Rl on incorrect synthetic data scales the efficiency of llm math reasoning by eight-fold, 2024 a
Amrith Setlur, Saurabh Garg, Xinyang Geng, Naman Garg, Virginia Smith, and Aviral Kumar. Rl on incorrect synthetic data scales the efficiency of llm math reasoning by eight-fold, 2024 a
work page 2024
-
[53]
Rl on incorrect synthetic data scales the efficiency of llm math reasoning by eight-fold, 2024 b
Amrith Setlur, Saurabh Garg, Xinyang Geng, Naman Garg, Virginia Smith, and Aviral Kumar. Rl on incorrect synthetic data scales the efficiency of llm math reasoning by eight-fold, 2024 b . URL https://arxiv.org/abs/2406.14532
-
[54]
Mastering the game of go with deep neural networks and tree search
David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, Yutian Chen, Timothy Lillicrap, Fan Hui, Laurent Sifre, George van den Driessche, Thore Graepel, and Demis Hassabis. Mastering the game of go with deep neural networks and tree search. Nature, 2017 a
work page 2017
-
[55]
Mastering the game of go without human knowledge
David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. Mastering the game of go without human knowledge. Nature, 550 0 (7676): 0 354--359, 2017 b
work page 2017
-
[56]
Co-Reyes, Rishabh Agarwal, Ankesh Anand, Piyush Patil, Xavier Garcia, Peter J
Avi Singh, John D. Co-Reyes, Rishabh Agarwal, Ankesh Anand, Piyush Patil, Xavier Garcia, Peter J. Liu, James Harrison, Jaehoon Lee, Kelvin Xu, Aaron Parisi, Abhishek Kumar, Alex Alemi, Alex Rizkowsky, Azade Nova, Ben Adlam, Bernd Bohnet, Gamaleldin Elsayed, Hanie Sedghi, Igor Mordatch, Isabelle Simpson, Izzeddin Gur, Jasper Snoek, Jeffrey Pennington, Jiri...
work page 2024
-
[57]
Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters
Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling llm test-time compute optimally can be more effective than scaling model parameters, 2024. URL https://arxiv.org/abs/2408.03314
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[58]
Offline rl for natural language generation with implicit language q learning
Charlie Victor Snell, Ilya Kostrikov, Yi Su, Sherry Yang, and Sergey Levine. Offline rl for natural language generation with implicit language q learning. In The Eleventh International Conference on Learning Representations, 2022
work page 2022
-
[59]
Trial and error: Exploration-based trajectory optimization for llm agents, 2024
Yifan Song, Da Yin, Xiang Yue, Jie Huang, Sujian Li, and Bill Yuchen Lin. Trial and error: Exploration-based trajectory optimization for llm agents, 2024
work page 2024
-
[60]
Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, and Paul Christiano
Nisan Stiennon, Long Ouyang, Jeff Wu, Daniel M. Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, and Paul Christiano. Learning to summarize from human feedback, 2022
work page 2022
-
[61]
Sumers, Shunyu Yao, Karthik Narasimhan, and Thomas L
Theodore R. Sumers, Shunyu Yao, Karthik Narasimhan, and Thomas L. Griffiths. Cognitive architectures for language agents, 2024
work page 2024
-
[62]
Preference fine-tuning of llms should leverage suboptimal, on-policy data, 2024
Fahim Tajwar, Anikait Singh, Archit Sharma, Rafael Rafailov, Jeff Schneider, Tengyang Xie, Stefano Ermon, Chelsea Finn, and Aviral Kumar. Preference fine-tuning of llms should leverage suboptimal, on-policy data, 2024
work page 2024
-
[63]
Toward self-improvement of llms via imagination, searching, and criticizing, 2024
Ye Tian, Baolin Peng, Linfeng Song, Lifeng Jin, Dian Yu, Haitao Mi, and Dong Yu. Toward self-improvement of llms via imagination, searching, and criticizing, 2024
work page 2024
-
[64]
Llama 2: Open foundation and fine-tuned chat models, 2023
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Harts...
work page 2023
-
[65]
Solving math word problems with process- and outcome-based feedback, 2022
Jonathan Uesato, Nate Kushman, Ramana Kumar, Francis Song, Noah Siegel, Lisa Wang, Antonia Creswell, Geoffrey Irving, and Irina Higgins. Solving math word problems with process- and outcome-based feedback, 2022
work page 2022
-
[66]
Mobile-agent: Autonomous multi-modal mobile device agent with visual perception
Junyang Wang, Haiyang Xu, Jiabo Ye, Ming Yan, Weizhou Shen, Ji Zhang, Fei Huang, and Jitao Sang. Mobile-agent: Autonomous multi-modal mobile device agent with visual perception. arXiv, 2024 a
work page 2024
-
[67]
Peiyi Wang, Lei Li, Zhihong Shao, R. X. Xu, Damai Dai, Yifei Li, Deli Chen, Y. Wu, and Zhifang Sui. Math-shepherd: Verify and reinforce llms step-by-step without human annotations, 2024 b
work page 2024
-
[68]
Self-consistency improves chain of thought reasoning in language models, 2023
Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models, 2023
work page 2023
-
[69]
Chain-of-thought prompting elicits reasoning in large language models
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. Neural Information Processing Systems, 2022
work page 2022
-
[70]
Os-copilot: Towards generalist computer agents with self-improvement
Zhiyong Wu, Chengcheng Han, Zichen Ding, Zhenmin Weng, Zhoumianze Liu, Shunyu Yao, Tao Yu, and Lingpeng Kong. Os-copilot: Towards generalist computer agents with self-improvement. arXiv, 2024. URL https://api.semanticscholar.org/CorpusID:265149992
work page 2024
-
[71]
Agentgym: Evolving large language model-based agents across diverse environments, 2024
Zhiheng Xi, Yiwen Ding, Wenxiang Chen, Boyang Hong, Honglin Guo, Junzhe Wang, Dingwen Yang, Chenyang Liao, Xin Guo, Wei He, Songyang Gao, Lu Chen, Rui Zheng, Yicheng Zou, Tao Gui, Qi Zhang, Xipeng Qiu, Xuanjing Huang, Zuxuan Wu, and Yu-Gang Jiang. Agentgym: Evolving large language model-based agents across diverse environments, 2024. URL https://arxiv.org...
-
[72]
Lillicrap, Kenji Kawaguchi, and Michael Shieh
Yuxi Xie, Anirudh Goyal, Wenyue Zheng, Min-Yen Kan, Timothy P. Lillicrap, Kenji Kawaguchi, and Michael Shieh. Monte carlo tree search boosts reasoning via iterative preference learning, 2024
work page 2024
-
[73]
Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press
John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. Swe-agent: Agent-computer interfaces enable automated software engineering, 2024
work page 2024
-
[74]
Webshop: Towards scalable real-world web interaction with grounded language agents, 2022
Shunyu Yao, Howard Chen, John Yang, and Karthik Narasimhan. Webshop: Towards scalable real-world web interaction with grounded language agents, 2022
work page 2022
-
[75]
Griffiths, Yuan Cao, and Karthik R
Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, and Karthik R. Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. In NeurIPS, 2023 a . URL https://openreview.net/forum?id=5Xc1ecxO1h
work page 2023
-
[76]
React: Synergizing reasoning and acting in language models, 2023 b
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models, 2023 b
work page 2023
-
[77]
Self-rewarding language models, 2024
Weizhe Yuan, Richard Yuanzhe Pang, Kyunghyun Cho, Xian Li, Sainbayar Sukhbaatar, Jing Xu, and Jason Weston. Self-rewarding language models, 2024
work page 2024
-
[78]
Scaling relationship on learning mathematical reasoning with large language models, 2023
Zheng Yuan, Hongyi Yuan, Chengpeng Li, Guanting Dong, Keming Lu, Chuanqi Tan, Chang Zhou, and Jingren Zhou. Scaling relationship on learning mathematical reasoning with large language models, 2023
work page 2023
-
[79]
Star: Bootstrapping reasoning with reasoning
Eric Zelikman, Yuhuai Wu, Jesse Mu, and Noah Goodman. Star: Bootstrapping reasoning with reasoning. Advances in Neural Information Processing Systems, 35: 0 15476--15488, 2022
work page 2022
-
[80]
Fine-tuning large vision-language models as decision-making agents via reinforcement learning, 2024
Yuexiang Zhai, Hao Bai, Zipeng Lin, Jiayi Pan, Shengbang Tong, Yifei Zhou, Alane Suhr, Saining Xie, Yann LeCun, Yi Ma, and Sergey Levine. Fine-tuning large vision-language models as decision-making agents via reinforcement learning, 2024
work page 2024
-
[81]
Ufo: A ui-focused agent for windows os interaction
Chaoyun Zhang, Liqun Li, Shilin He, Xu Zhang, Bo Qiao, Si Qin, Minghua Ma, Yu Kang, Qingwei Lin, and Saravan Rajmohan. Ufo: A ui-focused agent for windows os interaction. arXiv, 2024 a . URL https://api.semanticscholar.org/CorpusID:267211622
work page 2024
-
[82]
Appagent: Multimodal agents as smartphone users, 2023
Chi Zhang, Zhao Yang, Jiaxuan Liu, Yucheng Han, Xin Chen, Zebiao Huang, Bin Fu, and Gang Yu. Appagent: Multimodal agents as smartphone users, 2023
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.