Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents

arxiv: 2408.07199 · v1 · pith:2L4FAKGGnew · submitted 2024-08-13 · 💻 cs.AI · cs.LG

Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents

Pranav Putta , Edmund Mills , Naman Garg , Sumeet Motwani , Chelsea Finn , Divyansh Garg , Rafael Rafailov This is my paper

Pith reviewed 2026-05-20 09:36 UTC · model grok-4.3

classification 💻 cs.AI cs.LG

keywords LLM agentsMonte Carlo Tree SearchDirect Preference Optimizationself-critiqueweb navigationinteractive environmentsautonomous agents

0 comments p. Extension

pith:2L4FAKGG Add to your LaTeX paper

What is a Pith Number?

\usepackage{pith}
\pithnumber{2L4FAKGG}

Prints a linked pith:2L4FAKGG badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more

The pith

Guided Monte Carlo Tree Search plus self-critique and off-policy preference updates let language-model agents learn complex web tasks from their own interaction data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that large language models can be turned into effective autonomous agents for multi-step interactive tasks by combining guided tree search, automatic critique of past attempts, and repeated preference-based fine-tuning on the resulting trajectories. This approach enables the model to improve from both successful and unsuccessful runs without needing large sets of expert demonstrations. In practice it raises a Llama-3 70B model from 18.6 percent to 81.7 percent success on real booking scenarios after one day of data collection and to 95.4 percent when online search is added. The same method outperforms standard behavior-cloning and reinforcement baselines inside a simulated shopping environment and exceeds average human performance once search is allowed. The core promise is that agents can acquire reliable decision-making policies through self-generated experience rather than static supervised data.

Core claim

By interleaving guided Monte Carlo Tree Search for exploration, a self-critique step that labels trajectories as preferred or dispreferred, and iterative off-policy Direct Preference Optimization on the collected agent interactions, an LLM policy can be refined to solve long-horizon web navigation and booking tasks at high success rates without external reward models or large expert datasets.

What carries the argument

Guided Monte Carlo Tree Search paired with self-critique that generates preference pairs for off-policy DPO updates on the agent's own interaction history.

If this is right

The agent policy improves on both successful and failed attempts collected during its own rollouts.
Performance reaches above average human levels once online search is added to the same trained model.
The method works in both simulated shopping environments and real booking interfaces after modest interaction data.
Iterative off-policy updates remain stable enough to support continued improvement across multiple rounds of collection and fine-tuning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same loop could be applied to other interactive domains such as code execution or robotic manipulation where self-generated trajectories are cheap to obtain.
Because the method relies on the base model's ability to critique its own outputs, gains may shrink for weaker base models that cannot reliably label their failures.
Combining the learned policy with external search or tool use appears to compound the benefit, suggesting a hybrid architecture for future agent systems.

Load-bearing premise

The model's own self-critique can accurately separate successful from unsuccessful trajectories without any external ground-truth reward signal.

What would settle it

Run the full pipeline for one day of data collection on a fresh set of booking or shopping tasks and measure whether success rate stays below 30 percent or fails to exceed behavior-cloning baselines.

read the original abstract

Large Language Models (LLMs) have shown remarkable capabilities in natural language tasks requiring complex reasoning, yet their application in agentic, multi-step reasoning within interactive environments remains a difficult challenge. Traditional supervised pre-training on static datasets falls short in enabling autonomous agent capabilities needed to perform complex decision-making in dynamic settings like web navigation. Previous attempts to bridge this ga-through supervised fine-tuning on curated expert demonstrations-often suffer from compounding errors and limited exploration data, resulting in sub-optimal policy outcomes. To overcome these challenges, we propose a framework that combines guided Monte Carlo Tree Search (MCTS) search with a self-critique mechanism and iterative fine-tuning on agent interactions using an off-policy variant of the Direct Preference Optimization (DPO) algorithm. Our method allows LLM agents to learn effectively from both successful and unsuccessful trajectories, thereby improving their generalization in complex, multi-step reasoning tasks. We validate our approach in the WebShop environment-a simulated e-commerce platform where it consistently outperforms behavior cloning and reinforced fine-tuning baseline, and beats average human performance when equipped with the capability to do online search. In real-world booking scenarios, our methodology boosts Llama-3 70B model's zero-shot performance from 18.6% to 81.7% success rate (a 340% relative increase) after a single day of data collection and further to 95.4% with online search. We believe this represents a substantial leap forward in the capabilities of autonomous agents, paving the way for more sophisticated and reliable decision-making in real-world settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Agent Q reports large gains on booking and WebShop by pairing guided MCTS data collection with self-critique labeling and off-policy DPO, but the labeling step is the part that still needs stronger checks.

read the letter

The main thing to know is that this framework lifts Llama-3 70B zero-shot success on real booking tasks from 18.6% to 81.7% after one day of data collection and reaches 95.4% when online search is added. On WebShop it beats the behavior cloning and reinforced fine-tuning baselines and exceeds average human performance once search is enabled. The method collects trajectories with guided MCTS, uses the LLM itself to critique and label them as preferred or dispreferred, then applies iterative off-policy DPO to update the policy from both successes and failures.

Referee Report

2 major / 2 minor

Summary. The paper introduces Agent Q, a framework combining guided Monte Carlo Tree Search (MCTS), a self-critique mechanism, and iterative off-policy Direct Preference Optimization (DPO) to improve LLM agents on multi-step reasoning tasks in interactive environments. It reports large empirical gains, including raising Llama-3 70B zero-shot success from 18.6% to 81.7% (340% relative) in real booking scenarios after one day of data collection and to 95.4% with online search, while outperforming behavior-cloning and reinforced fine-tuning baselines and average human performance in WebShop when search is enabled.

Significance. If the results hold, the work demonstrates a practical route to agent improvement that learns from both successful and unsuccessful trajectories without heavy reliance on expert demonstrations, addressing compounding errors in dynamic settings. Credit is due for the concrete cross-domain validation (simulated WebShop plus real booking) and for the explicit performance deltas reported.

major comments (2)

[§4 and §3.2] §4 (Experiments) and §3.2 (Self-Critique): The headline gains (18.6% → 81.7% zero-shot, then 95.4% with search) rest on the unvalidated assumption that LLM self-critique can accurately label trajectories into preferred/dispreferred pairs for off-policy DPO. No external ground-truth reward, human audit of labeling accuracy, or error-rate analysis is provided; in path-dependent tasks such as booking, mislabeling of partial failures (e.g., correct item but wrong shipping) could inject noise that directly destabilizes the claimed 340% relative improvement after a single day of collection.
[§4.1] §4.1 (WebShop results) and booking evaluation: The manuscript states consistent outperformance over behavior cloning and reinforced fine-tuning yet supplies neither error bars, number of trials, nor statistical significance tests for the reported success rates. This omission makes it impossible to assess whether the central performance claim is robust or could be explained by variance in the self-generated data.

minor comments (2)

The abstract and experimental sections omit implementation specifics (exact MCTS guidance function, DPO hyperparameters, data-collection protocol) that would be required for reproducibility.
[§3] Notation for the off-policy DPO variant and the precise form of the self-critique prompt should be formalized in a dedicated subsection or appendix to clarify how the preference pairs are constructed.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments highlight important aspects of validation and statistical rigor that we address below. We have revised the manuscript to incorporate additional analysis and reporting as described.

read point-by-point responses

Referee: [§4 and §3.2] §4 (Experiments) and §3.2 (Self-Critique): The headline gains (18.6% → 81.7% zero-shot, then 95.4% with search) rest on the unvalidated assumption that LLM self-critique can accurately label trajectories into preferred/dispreferred pairs for off-policy DPO. No external ground-truth reward, human audit of labeling accuracy, or error-rate analysis is provided; in path-dependent tasks such as booking, mislabeling of partial failures (e.g., correct item but wrong shipping) could inject noise that directly destabilizes the claimed 340% relative improvement after a single day of collection.

Authors: We agree that explicit validation of the self-critique labeling would strengthen the paper. The large gains provide indirect support, but to directly address potential noise from partial failures in path-dependent tasks, we will add a new analysis subsection. This includes a human audit of labeling accuracy on a random sample of 150 booking trajectories (reporting agreement rate and error breakdown by failure type) plus discussion of how iterative off-policy DPO and MCTS guidance help filter noise across rounds. These additions will be in the revised §3.2 and §4. revision: yes
Referee: [§4.1] §4.1 (WebShop results) and booking evaluation: The manuscript states consistent outperformance over behavior cloning and reinforced fine-tuning yet supplies neither error bars, number of trials, nor statistical significance tests for the reported success rates. This omission makes it impossible to assess whether the central performance claim is robust or could be explained by variance in the self-generated data.

Authors: We acknowledge the omission of these details in the submitted version. The WebShop results were averaged across 5 independent random seeds with 200 episodes each, and the booking results across 3 separate data-collection days with 50 held-out test cases per day. In the revision we will report standard errors as error bars, explicitly state the number of trials and seeds, and add paired t-test p-values comparing Agent Q against the behavior-cloning and reinforced fine-tuning baselines in both §4.1 and the booking evaluation. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical results are independent of inputs

full rationale

The paper's central claims consist of measured performance gains (e.g., 18.6% to 81.7% success) obtained from new data collection and evaluation runs in WebShop and booking environments. These outcomes are produced by running the full pipeline on held-out tasks rather than by algebraic reduction of any fitted parameter or self-citation to the reported metric. No equations are presented that define a quantity in terms of itself, and the DPO and MCTS components are standard external methods whose correctness does not presuppose the final success rates. The derivation chain therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract describes a composite method built from existing components (MCTS, self-critique, DPO) with modifications; no explicit free parameters, domain axioms, or new postulated entities are introduced beyond standard machine-learning assumptions about trajectory evaluation and policy updates.

pith-pipeline@v0.9.0 · 5829 in / 1321 out tokens · 58462 ms · 2026-05-20T09:36:10.019645+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose a framework that combines guided Monte Carlo Tree Search (MCTS) search with a self-critique mechanism and iterative fine-tuning on agent interactions using an off-policy variant of the Direct Preference Optimization (DPO) algorithm.
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean alpha_pin_under_high_calibration unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We use the base model to produce a feedback score for each action by asking it to rank the generated actions by its perceived utility... Q(h_t, a_i_t) = α Q̃(h_t, a_i_t) + (1 − α) Q̂(h_t, a_i_t)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 23 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

GEAR: Granularity-Adaptive Advantage Reweighting for LLM Agents via Self-Distillation
cs.LG 2026-05 unverdicted novelty 7.0

GEAR reshapes GRPO trajectory advantages using divergence signals from a ground-truth-conditioned teacher to create adaptive token- and segment-level credit regions.
Don't Click That: Teaching Web Agents to Resist Deceptive Interfaces
cs.AI 2026-05 unverdicted novelty 7.0

DUDE framework reduces web agents' susceptibility to deceptive UIs by 53.8% on a new 1,407-scenario benchmark while preserving task performance.
Weak-Link Optimization for Multi-Agent Reasoning and Collaboration
cs.AI 2026-04 unverdicted novelty 7.0

WORC improves multi-agent LLM reasoning to 82.2% average accuracy by predicting and compensating for the weakest agent via targeted extra sampling rather than uniform reinforcement.
Does RL Expand the Capability Boundary of LLM Agents? A PASS@(k,T) Analysis
cs.LG 2026-04 unverdicted novelty 7.0

RL expands the capability boundary of LLM agents on compositional tool-use tasks, shown by non-converging pass curves at large k with increasing T, while SFT regresses it and the effect is absent on simpler tasks.
Orak: A Foundational Benchmark for Training and Evaluating LLM Agents on Diverse Video Games
cs.AI 2025-06 unverdicted novelty 7.0

Orak is a foundational benchmark providing training data, interfaces, and evaluation tools for LLM agents across diverse video game genres.
Group-in-Group Policy Optimization for LLM Agent Training
cs.LG 2025-05 unverdicted novelty 7.0

GiGPO adds a hierarchical grouping mechanism to group-based RL so that LLM agents receive both global trajectory and local step-level credit signals, yielding >12% gains on ALFWorld and >9% on WebShop over GRPO while ...
Rewarding Beliefs, Not Actions: Consistency-Guided Credit Assignment for Long-Horizon Agents
cs.CL 2026-05 unverdicted novelty 6.0

ReBel uses belief-consistency supervision and belief-aware grouping to improve credit assignment in long-horizon RL for LLM agents, achieving up to 20.4 percentage points higher success and 2.1x better sample efficien...
GEAR: Granularity-Adaptive Advantage Reweighting for LLM Agents via Self-Distillation
cs.LG 2026-05 unverdicted novelty 6.0

GEAR adaptively reweights GRPO advantages in LLM RL by using divergence spikes from self-distillation to define semantic segments and modulate local credit.
Skill1: Unified Evolution of Skill-Augmented Agents via Reinforcement Learning
cs.AI 2026-05 unverdicted novelty 6.0

Skill1 trains one policy to jointly evolve skill query generation, re-ranking, task solving, and distillation from a single task-success signal, with low-frequency trends crediting selection and high-frequency variati...
QuantClaw: Precision Where It Matters for OpenClaw
cs.AI 2026-04 unverdicted novelty 6.0

QuantClaw dynamically routes precision in agent workflows to cut cost by up to 21.4% and latency by 15.7% while keeping or improving task performance.
DynaWeb: Model-Based Reinforcement Learning of Web Agents
cs.CL 2026-01 unverdicted novelty 6.0

DynaWeb introduces a model-based RL framework that trains web agents via imagined rollouts in a learned web world model interleaved with real expert trajectories, yielding consistent gains on WebArena and WebVoyager b...
Mobile-R1: Towards Interactive Capability for VLM-Based Mobile Agent via Systematic Training
cs.AI 2025-06 unverdicted novelty 6.0

Mobile-R1 introduces a hierarchical three-stage curriculum that combines format alignment, verifiable action feedback, and multi-turn environment training to improve exploration and self-correction in VLM-based mobile...
A-ProS: Towards Reliable Autonomous Programming Through Multi-Model Feedback
cs.SE 2026-05 unverdicted novelty 5.0

A-ProS uses a hybrid multi-model feedback framework with stateful refinement to improve success rates on competitive programming problems, achieving over 2x gains compared to baseline agent loops.
StraTA: Incentivizing Agentic Reinforcement Learning with Strategic Trajectory Abstraction
cs.CL 2026-05 unverdicted novelty 5.0

StraTA improves LLM agent success rates to 93.1% on ALFWorld and 84.2% on WebShop by sampling a compact initial strategy and training it jointly with action execution via hierarchical GRPO-style rollouts.
Skill1: Unified Evolution of Skill-Augmented Agents via Reinforcement Learning
cs.AI 2026-05 unverdicted novelty 5.0

Skill1 co-evolves skill selection, utilization, and distillation inside a single policy using only task-outcome reward, with low-frequency trends crediting selection and high-frequency variation crediting distillation...
Skill1: Unified Evolution of Skill-Augmented Agents via Reinforcement Learning
cs.AI 2026-05 unverdicted novelty 5.0

Skill1 trains a single RL policy to co-evolve skill selection, utilization, and distillation in language model agents from one task-outcome reward, using low-frequency trends to credit selection and high-frequency var...
GUI Agents with Reinforcement Learning: Toward Digital Inhabitants
cs.AI 2026-04 unverdicted novelty 5.0

The paper delivers the first comprehensive overview of RL for GUI agents, organizing methods into offline, online, and hybrid strategies while analyzing trends in rewards, efficiency, and deliberation to outline a fut...
RoboAgent: Chaining Basic Capabilities for Embodied Task Planning
cs.RO 2026-04 unverdicted novelty 5.0

RoboAgent chains basic vision-language capabilities inside a single VLM via a scheduler and trains it in three stages (behavior cloning, DAgger, RL) to improve embodied task planning.
Rethinking Token Pruning for Historical Screenshots in GUI Visual Agents: Semantic, Spatial, and Temporal Perspectives
cs.CV 2026-03 unverdicted novelty 5.0

Empirical study finds background semantics, random pruning, and recency-based allocation improve token efficiency for GUI visual agents.
A Comprehensive Survey of Self-Evolving AI Agents: A New Paradigm Bridging Foundation Models and Lifelong Agentic Systems
cs.AI 2025-08 unverdicted novelty 5.0

A comprehensive review of self-evolving AI agents that improve themselves over time, organized via a framework of inputs, agent system, environment, and optimizers, with domain-specific and safety discussions.
Agentic Reasoning for Large Language Models
cs.AI 2026-01 unverdicted novelty 4.0

The survey structures agentic reasoning for LLMs into foundational, self-evolving, and collective multi-agent layers while distinguishing in-context orchestration from post-training optimization and reviewing applicat...
A Survey of Self-Evolving Agents: What, When, How, and Where to Evolve on the Path to Artificial Super Intelligence
cs.AI 2025-07 accept novelty 4.0

The paper delivers the first systematic review of self-evolving agents, structured around what components evolve, when adaptation occurs, and how it is implemented.
Large Language Model-Brained GUI Agents: A Survey
cs.AI 2024-11 unverdicted novelty 4.0

A survey consolidating frameworks, data practices, large action models, benchmarks, applications, and research gaps in LLM-brained GUI agents.

Reference graph

Works this paper leans on

298 extracted references · 298 canonical work pages · cited by 20 Pith papers · 15 internal anchors

[1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

Introducing the next generation of claude, 2024

Anthropic. Introducing the next generation of claude, 2024. URL Introducing the next generation of Claude

work page 2024
[4]

Digirl: Training in-the-wild device-control agents with autonomous reinforcement learning, 2024

Hao Bai, Yifei Zhou, Mert Cemri, Jiayi Pan, Alane Suhr, Sergey Levine, and Aviral Kumar. Digirl: Training in-the-wild device-control agents with autonomous reinforcement learning, 2024

work page 2024
[5]

Bowman, Zac Hatfield-Dodds, Ben Mann, Dario Amodei, Nicholas Joseph, Sam McCandlish, Tom Brown, and Jared Kaplan

Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, Carol Chen, Catherine Olsson, Christopher Olah, Danny Hernandez, Dawn Drain, Deep Ganguli, Dustin Li, Eli Tran-Johnson, Ethan Perez, Jamie Kerr, Jared Mueller, Jeffrey Ladish, Joshua Landau, Kamal Ndousse, K...

work page 2022
[6]

Graph of thoughts: Solving elaborate problems with large language models

Maciej Besta, Nils Blach, Ales Kubicek, Robert Gerstenberger, Michal Podstawski, Lukas Gianinazzi, Joanna Gajda, Tomasz Lehmann, Hubert Niewiadomski, Piotr Nyczyk, and Torsten Hoefler. Graph of thoughts: Solving elaborate problems with large language models. Proceedings of the AAAI Conference on Artificial Intelligence, 38 0 (16): 0 17682–17690, March 202...

work page doi:10.1609/aaai.v38i16.29720 2024
[7]

Large Language Monkeys: Scaling Inference Compute with Repeated Sampling

Bradley Brown, Jordan Juravsky, Ryan Ehrlich, Ronald Clark, Quoc V. Le, Christopher Ré, and Azalia Mirhoseini. Large language monkeys: Scaling inference compute with repeated sampling, 2024. URL https://arxiv.org/abs/2407.21787

work page internal anchor Pith review Pith/arXiv arXiv 2024
[8]

Superhuman ai for multiplayer poker

Noam Brown and Tuomas Sandholm. Superhuman ai for multiplayer poker. Science, 2019

work page 2019
[9]

Q-transformer: Scalable offline reinforcement learning via autoregressive q-functions, 2023

Yevgen Chebotar, Quan Vuong, Alex Irpan, Karol Hausman, Fei Xia, Yao Lu, Aviral Kumar, Tianhe Yu, Alexander Herzog, Karl Pertsch, Keerthana Gopalakrishnan, Julian Ibarz, Ofir Nachum, Sumedh Sontakke, Grecia Salazar, Huong T Tran, Jodilyn Peralta, Clayton Tan, Deeksha Manjunath, Jaspiar Singht, Brianna Zitkovich, Tomas Jackson, Kanishka Rao, Chelsea Finn, ...

work page 2023
[10]

Step-level value preference optimization for mathematical reasoning, 2024

Guoxin Chen, Minpeng Liao, Chengxi Li, and Kai Fan. Step-level value preference optimization for mathematical reasoning, 2024. URL https://arxiv.org/abs/2406.10858

work page arXiv 2024
[11]

Octopus v2: On-device language model for super agent, 2024

Wei Chen and Zhiyuan Li. Octopus v2: On-device language model for super agent, 2024

work page 2024
[12]

Mind2web: Towards a generalist agent for the web

Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Samuel Stevens, Boshi Wang, Huan Sun, and Yu Su. Mind2web: Towards a generalist agent for the web. In NeurIPS Datasets and Benchmarks Track, 2023. URL https://openreview.net/forum?id=kiYqbO3wqw

work page 2023
[13]

Hashimoto

Yann Dubois, Xuechen Li, Rohan Taori, Tianyi Zhang, Ishaan Gulrajani, Jimmy Ba, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Alpacafarm: A simulation framework for methods that learn from human feedback, 2024

work page 2024
[14]

Kanishk Gandhi, Denise Lee, Gabriel Grand, Muxin Liu, Winson Cheng, Archit Sharma, and Noah D. Goodman. Stream of search (sos): Learning to search in language, 2024. URL https://arxiv.org/abs/2404.03683

work page arXiv 2024
[15]

Human-level performance in no-press diplomacy via equilibrium search, 2021

Jonathan Gray, Adam Lerer, Anton Bakhtin, and Noam Brown. Human-level performance in no-press diplomacy via equilibrium search, 2021

work page 2021
[16]

Reinforced self-training (rest) for language modeling, 2023

Caglar Gulcehre, Tom Le Paine, Srivatsan Srinivasan, Ksenia Konyushkova, Lotte Weerts, Abhishek Sharma, Aditya Siddhant, Alex Ahern, Miaosen Wang, Chenjie Gu, Wolfgang Macherey, Arnaud Doucet, Orhan Firat, and Nando de Freitas. Reinforced self-training (rest) for language modeling, 2023

work page 2023
[17]

A real-world webagent with planning, long context understanding, and program synthesis

Izzeddin Gur, Hiroki Furuta, Austin V Huang, Mustafa Safdari, Yutaka Matsuo, Douglas Eck, and Aleksandra Faust. A real-world webagent with planning, long context understanding, and program synthesis. In ICLR, 2024. URL https://openreview.net/forum?id=9JQtrumvg8

work page 2024
[18]

Reasoning with language model is planning with world model, 2023

Shibo Hao, Yi Gu, Haodi Ma, Joshua Jiahua Hong, Zhen Wang, Daisy Zhe Wang, and Zhiting Hu. Reasoning with language model is planning with world model, 2023

work page 2023
[19]

Webvoyager: Building an end-to-end web agent with large multimodal models

Hongliang He, Wenlin Yao, Kaixin Ma, Wenhao Yu, Yong Dai, Hongming Zhang, Zhenzhong Lan, and Dong Yu. Webvoyager: Building an end-to-end web agent with large multimodal models. ArXiv, 2024. URL https://api.semanticscholar.org/CorpusID:267211622

work page 2024
[20]

Bradley Knox, and Dorsa Sadigh

Joey Hejna, Rafael Rafailov, Harshit Sikchi, Chelsea Finn, Scott Niekum, W. Bradley Knox, and Dorsa Sadigh. Contrastive preference learning: Learning from human feedback without reinforcement learning. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=iX1RjVQODj

work page 2024
[21]

L2mac: Large language model automatic computer for extensive code generation, 2024

Samuel Holt, Max Ruiz Luyten, and Mihaela van der Schaar. L2mac: Large language model automatic computer for extensive code generation, 2024

work page 2024
[22]

Cogagent: A visual language model for gui agents

Wenyi Hong, Weihan Wang, Qingsong Lv, Jiazheng Xu, Wenmeng Yu, Junhui Ji, Yan Wang, Zihan Wang, Yuxiao Dong, Ming Ding, and Jie Tang. Cogagent: A visual language model for gui agents. ArXiv, 2023

work page 2023
[23]

Language models as zero-shot planners: Extracting actionable knowledge for embodied agents, 2022 a

Wenlong Huang, Pieter Abbeel, Deepak Pathak, and Igor Mordatch. Language models as zero-shot planners: Extracting actionable knowledge for embodied agents, 2022 a

work page 2022
[24]

Inner Monologue: Embodied Reasoning through Planning with Language Models

Wenlong Huang, Fei Xia, Ted Xiao, Harris Chan, Jacky Liang, Pete Florence, Andy Zeng, Jonathan Tompson, Igor Mordatch, Yevgen Chebotar, Pierre Sermanet, Noah Brown, Tomas Jackson, Linda Luu, Sergey Levine, Karol Hausman, and Brian Ichter. Inner monologue: Embodied reasoning through planning with language models, 2022 b . URL https://arxiv.org/abs/2207.05608

work page internal anchor Pith review Pith/arXiv arXiv 2022
[25]

Ortega, Yee Whye Teh, and Nicolas Heess

Jan Humplik, Alexandre Galashov, Leonard Hasenclever, Pedro A. Ortega, Yee Whye Teh, and Nicolas Heess. Meta reinforcement learning as task inference, 2019. URL https://arxiv.org/abs/1905.06424

work page arXiv 2019
[26]

Self-explore to avoid the pit: Improving the reasoning capabilities of language models with fine-grained rewards, 2024

Hyeonbin Hwang, Doyoung Kim, Seungone Kim, Seonghyeon Ye, and Minjoon Seo. Self-explore to avoid the pit: Improving the reasoning capabilities of language models with fine-grained rewards, 2024

work page 2024
[27]

Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, Gianna Lengyel, Guillaume Bour, Guillaume Lample, Lélio Renard Lavaud, Lucile Saulnier, Marie-Anne Lachaux, Pierre Stock, Sandeep Subramanian, Sophia Yang, Szymon Antoniak, Teven...

work page 2024
[28]

Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan

Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues?, 2024

work page 2024
[29]

Introducing chatgpt, 2022

Barret Zoph John Schulman, Christina Kim, Jacob Hilton, Jacob Menick, Jiayi Weng, Juan Felipe Ceron Uribe, Liam Fedus, Michael Pokorny Luke Metz, Rapha Gontijo Lopes, Shengjia Zhao, Arun Vijayvergiya, Eric Sigler, Adam Perelman, Chelsea Voss, Mike Heaton, Joel Parish, Dave Cummings, Rajeev Nayak, Valerie Balcom, David Schnurr, Tomer Kaftan, Chris Hallacy,...

work page 2022
[30]

Bandit based monte-carlo planning

Levente Kocsis and Csaba Szepesv \'a ri. Bandit based monte-carlo planning. In Machine Learning: ECML 2006: 17th European Conference on Machine Learning Berlin, Germany, September 18-22, 2006 Proceedings, pages 282--293. Springer, 2006

work page 2006
[31]

Tree search for language agent models, 2024

Jing Yu Koh, Stephen McAleer, Daniel Fried, and Ruslan Salakhutdinov. Tree search for language agent models, 2024. URL https://jykoh.com/search-agents/paper.pdf

work page 2024
[32]

Large language models are zero-shot reasoners, 2022

Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners, 2022

work page 2022
[33]

Autowebglm: Bootstrap and reinforce a large language model-based web navigating agent, 2024 a

Hanyu Lai, Xiao Liu, Iat Long Iong, Shuntian Yao, Yuxuan Chen, Pengbo Shen, Hao Yu, Hanchen Zhang, Xiaohan Zhang, Yuxiao Dong, and Jie Tang. Autowebglm: Bootstrap and reinforce a large language model-based web navigating agent, 2024 a

work page 2024
[34]

Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs

Xin Lai, Zhuotao Tian, Yukang Chen, Senqiao Yang, Xiangru Peng, and Jiaya Jia. Step-dpo: Step-wise preference optimization for long-chain reasoning of llms, 2024 b . URL https://arxiv.org/abs/2406.18629

work page internal anchor Pith review Pith/arXiv arXiv 2024
[35]

Beyond a*: Better planning with transformers via search dynamics bootstrapping, 2024

Lucas Lehnert, Sainbayar Sukhbaatar, DiJia Su, Qinqing Zheng, Paul Mcvay, Michael Rabbat, and Yuandong Tian. Beyond a*: Better planning with transformers via search dynamics bootstrapping, 2024. URL https://arxiv.org/abs/2402.14083

work page arXiv 2024
[36]

Reinforcement learning and control as probabilistic inference: Tutorial and review, 2018

Sergey Levine. Reinforcement learning and control as probabilistic inference: Tutorial and review, 2018

work page 2018
[37]

Offline reinforcement learning: Tutorial, review, and perspectives on open problems, 2020

Sergey Levine, Aviral Kumar, George Tucker, and Justin Fu. Offline reinforcement learning: Tutorial, review, and perspectives on open problems, 2020

work page 2020
[38]

Let's verify step by step, 2023

Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let's verify step by step, 2023

work page 2023
[39]

Bolaa: Benchmarking and orchestrating llm-augmented autonomous agents, 2023

Zhiwei Liu, Weiran Yao, Jianguo Zhang, Le Xue, Shelby Heinecke, Rithesh Murthy, Yihao Feng, Zeyuan Chen, Juan Carlos Niebles, Devansh Arpit, Ran Xu, Phil Mui, Huan Wang, Caiming Xiong, and Silvio Savarese. Bolaa: Benchmarking and orchestrating llm-augmented autonomous agents, 2023

work page 2023
[40]

Choubey, Tian Lan, Jason Wu, Huan Wang, Shelby Heinecke, Caiming Xiong, and Silvio Savarese

Zhiwei Liu, Weiran Yao, Jianguo Zhang, Liangwei Yang, Zuxin Liu, Juntao Tan, Prafulla K. Choubey, Tian Lan, Jason Wu, Huan Wang, Shelby Heinecke, Caiming Xiong, and Silvio Savarese. Agentlite: A lightweight library for building and advancing task-oriented llm agent system, 2024

work page 2024
[41]

Step-controlled dpo: Leveraging stepwise error for enhanced mathematical reasoning, 2024

Zimu Lu, Aojun Zhou, Ke Wang, Houxing Ren, Weikang Shi, Junting Pan, Mingjie Zhan, and Hongsheng Li. Step-controlled dpo: Leveraging stepwise error for enhanced mathematical reasoning, 2024. URL https://arxiv.org/abs/2407.00782

work page arXiv 2024
[42]

The landscape of emerging ai agent architectures for reasoning, planning, and tool calling: A survey, 2024

Tula Masterman, Sandi Besen, Mason Sawtell, and Alex Chao. The landscape of emerging ai agent architectures for reasoning, planning, and tool calling: A survey, 2024

work page 2024
[44]

Webgpt: Browser-assisted question-answering with human feedback, 2022

Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christopher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, Xu Jiang, Karl Cobbe, Tyna Eloundou, Gretchen Krueger, Kevin Button, Matthew Knight, Benjamin Chess, and John Schulman. Webgpt: Browser-assisted question-answering with human feedback, 2022

work page 2022
[45]

Training language models to follow instructions with human feedback

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedbac...

work page 2022
[46]

Autonomous evaluation and refinement of digital agents, 2024

Jiayi Pan, Yichi Zhang, Nicholas Tomlin, Yifei Zhou, Sergey Levine, and Alane Suhr. Autonomous evaluation and refinement of digital agents, 2024

work page 2024
[47]

Iterative reasoning preference optimization, 2024

Richard Yuanzhe Pang, Weizhe Yuan, Kyunghyun Cho, He He, Sainbayar Sukhbaatar, and Jason Weston. Iterative reasoning preference optimization, 2024

work page 2024
[48]

Reasoning with language model prompting: A survey, 2023

Shuofei Qiao, Yixin Ou, Ningyu Zhang, Xiang Chen, Yunzhi Yao, Shumin Deng, Chuanqi Tan, Fei Huang, and Huajun Chen. Reasoning with language model prompting: A survey, 2023

work page 2023
[49]

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://arxiv.org/abs/2305.18290

work page internal anchor Pith review Pith/arXiv arXiv 2023
[50]

From r to q^* : Your language model is secretly a q-function, 2024

Rafael Rafailov, Joey Hejna, Ryan Park, and Chelsea Finn. From r to q^* : Your language model is secretly a q-function, 2024

work page 2024
[51]

Proximal policy optimization algorithms, 2017

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms, 2017

work page 2017
[52]

Rl on incorrect synthetic data scales the efficiency of llm math reasoning by eight-fold, 2024 a

Amrith Setlur, Saurabh Garg, Xinyang Geng, Naman Garg, Virginia Smith, and Aviral Kumar. Rl on incorrect synthetic data scales the efficiency of llm math reasoning by eight-fold, 2024 a

work page 2024
[53]

Rl on incorrect synthetic data scales the efficiency of llm math reasoning by eight-fold, 2024 b

Amrith Setlur, Saurabh Garg, Xinyang Geng, Naman Garg, Virginia Smith, and Aviral Kumar. Rl on incorrect synthetic data scales the efficiency of llm math reasoning by eight-fold, 2024 b . URL https://arxiv.org/abs/2406.14532

work page arXiv 2024
[54]

Mastering the game of go with deep neural networks and tree search

David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, Yutian Chen, Timothy Lillicrap, Fan Hui, Laurent Sifre, George van den Driessche, Thore Graepel, and Demis Hassabis. Mastering the game of go with deep neural networks and tree search. Nature, 2017 a

work page 2017
[55]

Mastering the game of go without human knowledge

David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. Mastering the game of go without human knowledge. Nature, 550 0 (7676): 0 354--359, 2017 b

work page 2017
[56]

Co-Reyes, Rishabh Agarwal, Ankesh Anand, Piyush Patil, Xavier Garcia, Peter J

Avi Singh, John D. Co-Reyes, Rishabh Agarwal, Ankesh Anand, Piyush Patil, Xavier Garcia, Peter J. Liu, James Harrison, Jaehoon Lee, Kelvin Xu, Aaron Parisi, Abhishek Kumar, Alex Alemi, Alex Rizkowsky, Azade Nova, Ben Adlam, Bernd Bohnet, Gamaleldin Elsayed, Hanie Sedghi, Igor Mordatch, Isabelle Simpson, Izzeddin Gur, Jasper Snoek, Jeffrey Pennington, Jiri...

work page 2024
[57]

Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling llm test-time compute optimally can be more effective than scaling model parameters, 2024. URL https://arxiv.org/abs/2408.03314

work page internal anchor Pith review Pith/arXiv arXiv 2024
[58]

Offline rl for natural language generation with implicit language q learning

Charlie Victor Snell, Ilya Kostrikov, Yi Su, Sherry Yang, and Sergey Levine. Offline rl for natural language generation with implicit language q learning. In The Eleventh International Conference on Learning Representations, 2022

work page 2022
[59]

Trial and error: Exploration-based trajectory optimization for llm agents, 2024

Yifan Song, Da Yin, Xiang Yue, Jie Huang, Sujian Li, and Bill Yuchen Lin. Trial and error: Exploration-based trajectory optimization for llm agents, 2024

work page 2024
[60]

Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, and Paul Christiano

Nisan Stiennon, Long Ouyang, Jeff Wu, Daniel M. Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, and Paul Christiano. Learning to summarize from human feedback, 2022

work page 2022
[61]

Sumers, Shunyu Yao, Karthik Narasimhan, and Thomas L

Theodore R. Sumers, Shunyu Yao, Karthik Narasimhan, and Thomas L. Griffiths. Cognitive architectures for language agents, 2024

work page 2024
[62]

Preference fine-tuning of llms should leverage suboptimal, on-policy data, 2024

Fahim Tajwar, Anikait Singh, Archit Sharma, Rafael Rafailov, Jeff Schneider, Tengyang Xie, Stefano Ermon, Chelsea Finn, and Aviral Kumar. Preference fine-tuning of llms should leverage suboptimal, on-policy data, 2024

work page 2024
[63]

Toward self-improvement of llms via imagination, searching, and criticizing, 2024

Ye Tian, Baolin Peng, Linfeng Song, Lifeng Jin, Dian Yu, Haitao Mi, and Dong Yu. Toward self-improvement of llms via imagination, searching, and criticizing, 2024

work page 2024
[64]

Llama 2: Open foundation and fine-tuned chat models, 2023

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Harts...

work page 2023
[65]

Solving math word problems with process- and outcome-based feedback, 2022

Jonathan Uesato, Nate Kushman, Ramana Kumar, Francis Song, Noah Siegel, Lisa Wang, Antonia Creswell, Geoffrey Irving, and Irina Higgins. Solving math word problems with process- and outcome-based feedback, 2022

work page 2022
[66]

Mobile-agent: Autonomous multi-modal mobile device agent with visual perception

Junyang Wang, Haiyang Xu, Jiabo Ye, Ming Yan, Weizhou Shen, Ji Zhang, Fei Huang, and Jitao Sang. Mobile-agent: Autonomous multi-modal mobile device agent with visual perception. arXiv, 2024 a

work page 2024
[67]

Peiyi Wang, Lei Li, Zhihong Shao, R. X. Xu, Damai Dai, Yifei Li, Deli Chen, Y. Wu, and Zhifang Sui. Math-shepherd: Verify and reinforce llms step-by-step without human annotations, 2024 b

work page 2024
[68]

Self-consistency improves chain of thought reasoning in language models, 2023

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models, 2023

work page 2023
[69]

Chain-of-thought prompting elicits reasoning in large language models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. Neural Information Processing Systems, 2022

work page 2022
[70]

Os-copilot: Towards generalist computer agents with self-improvement

Zhiyong Wu, Chengcheng Han, Zichen Ding, Zhenmin Weng, Zhoumianze Liu, Shunyu Yao, Tao Yu, and Lingpeng Kong. Os-copilot: Towards generalist computer agents with self-improvement. arXiv, 2024. URL https://api.semanticscholar.org/CorpusID:265149992

work page 2024
[71]

Agentgym: Evolving large language model-based agents across diverse environments, 2024

Zhiheng Xi, Yiwen Ding, Wenxiang Chen, Boyang Hong, Honglin Guo, Junzhe Wang, Dingwen Yang, Chenyang Liao, Xin Guo, Wei He, Songyang Gao, Lu Chen, Rui Zheng, Yicheng Zou, Tao Gui, Qi Zhang, Xipeng Qiu, Xuanjing Huang, Zuxuan Wu, and Yu-Gang Jiang. Agentgym: Evolving large language model-based agents across diverse environments, 2024. URL https://arxiv.org...

work page arXiv 2024
[72]

Lillicrap, Kenji Kawaguchi, and Michael Shieh

Yuxi Xie, Anirudh Goyal, Wenyue Zheng, Min-Yen Kan, Timothy P. Lillicrap, Kenji Kawaguchi, and Michael Shieh. Monte carlo tree search boosts reasoning via iterative preference learning, 2024

work page 2024
[73]

Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press

John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. Swe-agent: Agent-computer interfaces enable automated software engineering, 2024

work page 2024
[74]

Webshop: Towards scalable real-world web interaction with grounded language agents, 2022

Shunyu Yao, Howard Chen, John Yang, and Karthik Narasimhan. Webshop: Towards scalable real-world web interaction with grounded language agents, 2022

work page 2022
[75]

Griffiths, Yuan Cao, and Karthik R

Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, and Karthik R. Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. In NeurIPS, 2023 a . URL https://openreview.net/forum?id=5Xc1ecxO1h

work page 2023
[76]

React: Synergizing reasoning and acting in language models, 2023 b

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models, 2023 b

work page 2023
[77]

Self-rewarding language models, 2024

Weizhe Yuan, Richard Yuanzhe Pang, Kyunghyun Cho, Xian Li, Sainbayar Sukhbaatar, Jing Xu, and Jason Weston. Self-rewarding language models, 2024

work page 2024
[78]

Scaling relationship on learning mathematical reasoning with large language models, 2023

Zheng Yuan, Hongyi Yuan, Chengpeng Li, Guanting Dong, Keming Lu, Chuanqi Tan, Chang Zhou, and Jingren Zhou. Scaling relationship on learning mathematical reasoning with large language models, 2023

work page 2023
[79]

Star: Bootstrapping reasoning with reasoning

Eric Zelikman, Yuhuai Wu, Jesse Mu, and Noah Goodman. Star: Bootstrapping reasoning with reasoning. Advances in Neural Information Processing Systems, 35: 0 15476--15488, 2022

work page 2022
[80]

Fine-tuning large vision-language models as decision-making agents via reinforcement learning, 2024

Yuexiang Zhai, Hao Bai, Zipeng Lin, Jiayi Pan, Shengbang Tong, Yifei Zhou, Alane Suhr, Saining Xie, Yann LeCun, Yi Ma, and Sergey Levine. Fine-tuning large vision-language models as decision-making agents via reinforcement learning, 2024

work page 2024
[81]

Ufo: A ui-focused agent for windows os interaction

Chaoyun Zhang, Liqun Li, Shilin He, Xu Zhang, Bo Qiao, Si Qin, Minghua Ma, Yu Kang, Qingwei Lin, and Saravan Rajmohan. Ufo: A ui-focused agent for windows os interaction. arXiv, 2024 a . URL https://api.semanticscholar.org/CorpusID:267211622

work page 2024
[82]

Appagent: Multimodal agents as smartphone users, 2023

Chi Zhang, Zhao Yang, Jiaxuan Liu, Yucheng Han, Xin Chen, Zebiao Huang, Bin Fu, and Gang Yu. Appagent: Multimodal agents as smartphone users, 2023

work page 2023

Showing first 80 references.

[1] [1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[2] [3]

Introducing the next generation of claude, 2024

Anthropic. Introducing the next generation of claude, 2024. URL Introducing the next generation of Claude

work page 2024

[3] [4]

Digirl: Training in-the-wild device-control agents with autonomous reinforcement learning, 2024

Hao Bai, Yifei Zhou, Mert Cemri, Jiayi Pan, Alane Suhr, Sergey Levine, and Aviral Kumar. Digirl: Training in-the-wild device-control agents with autonomous reinforcement learning, 2024

work page 2024

[4] [5]

Bowman, Zac Hatfield-Dodds, Ben Mann, Dario Amodei, Nicholas Joseph, Sam McCandlish, Tom Brown, and Jared Kaplan

Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, Carol Chen, Catherine Olsson, Christopher Olah, Danny Hernandez, Dawn Drain, Deep Ganguli, Dustin Li, Eli Tran-Johnson, Ethan Perez, Jamie Kerr, Jared Mueller, Jeffrey Ladish, Joshua Landau, Kamal Ndousse, K...

work page 2022

[5] [6]

Graph of thoughts: Solving elaborate problems with large language models

Maciej Besta, Nils Blach, Ales Kubicek, Robert Gerstenberger, Michal Podstawski, Lukas Gianinazzi, Joanna Gajda, Tomasz Lehmann, Hubert Niewiadomski, Piotr Nyczyk, and Torsten Hoefler. Graph of thoughts: Solving elaborate problems with large language models. Proceedings of the AAAI Conference on Artificial Intelligence, 38 0 (16): 0 17682–17690, March 202...

work page doi:10.1609/aaai.v38i16.29720 2024

[6] [7]

Large Language Monkeys: Scaling Inference Compute with Repeated Sampling

Bradley Brown, Jordan Juravsky, Ryan Ehrlich, Ronald Clark, Quoc V. Le, Christopher Ré, and Azalia Mirhoseini. Large language monkeys: Scaling inference compute with repeated sampling, 2024. URL https://arxiv.org/abs/2407.21787

work page internal anchor Pith review Pith/arXiv arXiv 2024

[7] [8]

Superhuman ai for multiplayer poker

Noam Brown and Tuomas Sandholm. Superhuman ai for multiplayer poker. Science, 2019

work page 2019

[8] [9]

Q-transformer: Scalable offline reinforcement learning via autoregressive q-functions, 2023

Yevgen Chebotar, Quan Vuong, Alex Irpan, Karol Hausman, Fei Xia, Yao Lu, Aviral Kumar, Tianhe Yu, Alexander Herzog, Karl Pertsch, Keerthana Gopalakrishnan, Julian Ibarz, Ofir Nachum, Sumedh Sontakke, Grecia Salazar, Huong T Tran, Jodilyn Peralta, Clayton Tan, Deeksha Manjunath, Jaspiar Singht, Brianna Zitkovich, Tomas Jackson, Kanishka Rao, Chelsea Finn, ...

work page 2023

[9] [10]

Step-level value preference optimization for mathematical reasoning, 2024

Guoxin Chen, Minpeng Liao, Chengxi Li, and Kai Fan. Step-level value preference optimization for mathematical reasoning, 2024. URL https://arxiv.org/abs/2406.10858

work page arXiv 2024

[10] [11]

Octopus v2: On-device language model for super agent, 2024

Wei Chen and Zhiyuan Li. Octopus v2: On-device language model for super agent, 2024

work page 2024

[11] [12]

Mind2web: Towards a generalist agent for the web

Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Samuel Stevens, Boshi Wang, Huan Sun, and Yu Su. Mind2web: Towards a generalist agent for the web. In NeurIPS Datasets and Benchmarks Track, 2023. URL https://openreview.net/forum?id=kiYqbO3wqw

work page 2023

[12] [13]

Hashimoto

Yann Dubois, Xuechen Li, Rohan Taori, Tianyi Zhang, Ishaan Gulrajani, Jimmy Ba, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Alpacafarm: A simulation framework for methods that learn from human feedback, 2024

work page 2024

[13] [14]

Kanishk Gandhi, Denise Lee, Gabriel Grand, Muxin Liu, Winson Cheng, Archit Sharma, and Noah D. Goodman. Stream of search (sos): Learning to search in language, 2024. URL https://arxiv.org/abs/2404.03683

work page arXiv 2024

[14] [15]

Human-level performance in no-press diplomacy via equilibrium search, 2021

Jonathan Gray, Adam Lerer, Anton Bakhtin, and Noam Brown. Human-level performance in no-press diplomacy via equilibrium search, 2021

work page 2021

[15] [16]

Reinforced self-training (rest) for language modeling, 2023

Caglar Gulcehre, Tom Le Paine, Srivatsan Srinivasan, Ksenia Konyushkova, Lotte Weerts, Abhishek Sharma, Aditya Siddhant, Alex Ahern, Miaosen Wang, Chenjie Gu, Wolfgang Macherey, Arnaud Doucet, Orhan Firat, and Nando de Freitas. Reinforced self-training (rest) for language modeling, 2023

work page 2023

[16] [17]

A real-world webagent with planning, long context understanding, and program synthesis

Izzeddin Gur, Hiroki Furuta, Austin V Huang, Mustafa Safdari, Yutaka Matsuo, Douglas Eck, and Aleksandra Faust. A real-world webagent with planning, long context understanding, and program synthesis. In ICLR, 2024. URL https://openreview.net/forum?id=9JQtrumvg8

work page 2024

[17] [18]

Reasoning with language model is planning with world model, 2023

Shibo Hao, Yi Gu, Haodi Ma, Joshua Jiahua Hong, Zhen Wang, Daisy Zhe Wang, and Zhiting Hu. Reasoning with language model is planning with world model, 2023

work page 2023

[18] [19]

Webvoyager: Building an end-to-end web agent with large multimodal models

Hongliang He, Wenlin Yao, Kaixin Ma, Wenhao Yu, Yong Dai, Hongming Zhang, Zhenzhong Lan, and Dong Yu. Webvoyager: Building an end-to-end web agent with large multimodal models. ArXiv, 2024. URL https://api.semanticscholar.org/CorpusID:267211622

work page 2024

[19] [20]

Bradley Knox, and Dorsa Sadigh

Joey Hejna, Rafael Rafailov, Harshit Sikchi, Chelsea Finn, Scott Niekum, W. Bradley Knox, and Dorsa Sadigh. Contrastive preference learning: Learning from human feedback without reinforcement learning. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=iX1RjVQODj

work page 2024

[20] [21]

L2mac: Large language model automatic computer for extensive code generation, 2024

Samuel Holt, Max Ruiz Luyten, and Mihaela van der Schaar. L2mac: Large language model automatic computer for extensive code generation, 2024

work page 2024

[21] [22]

Cogagent: A visual language model for gui agents

Wenyi Hong, Weihan Wang, Qingsong Lv, Jiazheng Xu, Wenmeng Yu, Junhui Ji, Yan Wang, Zihan Wang, Yuxiao Dong, Ming Ding, and Jie Tang. Cogagent: A visual language model for gui agents. ArXiv, 2023

work page 2023

[22] [23]

Language models as zero-shot planners: Extracting actionable knowledge for embodied agents, 2022 a

Wenlong Huang, Pieter Abbeel, Deepak Pathak, and Igor Mordatch. Language models as zero-shot planners: Extracting actionable knowledge for embodied agents, 2022 a

work page 2022

[23] [24]

Inner Monologue: Embodied Reasoning through Planning with Language Models

Wenlong Huang, Fei Xia, Ted Xiao, Harris Chan, Jacky Liang, Pete Florence, Andy Zeng, Jonathan Tompson, Igor Mordatch, Yevgen Chebotar, Pierre Sermanet, Noah Brown, Tomas Jackson, Linda Luu, Sergey Levine, Karol Hausman, and Brian Ichter. Inner monologue: Embodied reasoning through planning with language models, 2022 b . URL https://arxiv.org/abs/2207.05608

work page internal anchor Pith review Pith/arXiv arXiv 2022

[24] [25]

Ortega, Yee Whye Teh, and Nicolas Heess

Jan Humplik, Alexandre Galashov, Leonard Hasenclever, Pedro A. Ortega, Yee Whye Teh, and Nicolas Heess. Meta reinforcement learning as task inference, 2019. URL https://arxiv.org/abs/1905.06424

work page arXiv 2019

[25] [26]

Self-explore to avoid the pit: Improving the reasoning capabilities of language models with fine-grained rewards, 2024

Hyeonbin Hwang, Doyoung Kim, Seungone Kim, Seonghyeon Ye, and Minjoon Seo. Self-explore to avoid the pit: Improving the reasoning capabilities of language models with fine-grained rewards, 2024

work page 2024

[26] [27]

Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, Gianna Lengyel, Guillaume Bour, Guillaume Lample, Lélio Renard Lavaud, Lucile Saulnier, Marie-Anne Lachaux, Pierre Stock, Sandeep Subramanian, Sophia Yang, Szymon Antoniak, Teven...

work page 2024

[27] [28]

Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan

Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues?, 2024

work page 2024

[28] [29]

Introducing chatgpt, 2022

Barret Zoph John Schulman, Christina Kim, Jacob Hilton, Jacob Menick, Jiayi Weng, Juan Felipe Ceron Uribe, Liam Fedus, Michael Pokorny Luke Metz, Rapha Gontijo Lopes, Shengjia Zhao, Arun Vijayvergiya, Eric Sigler, Adam Perelman, Chelsea Voss, Mike Heaton, Joel Parish, Dave Cummings, Rajeev Nayak, Valerie Balcom, David Schnurr, Tomer Kaftan, Chris Hallacy,...

work page 2022

[29] [30]

Bandit based monte-carlo planning

Levente Kocsis and Csaba Szepesv \'a ri. Bandit based monte-carlo planning. In Machine Learning: ECML 2006: 17th European Conference on Machine Learning Berlin, Germany, September 18-22, 2006 Proceedings, pages 282--293. Springer, 2006

work page 2006

[30] [31]

Tree search for language agent models, 2024

Jing Yu Koh, Stephen McAleer, Daniel Fried, and Ruslan Salakhutdinov. Tree search for language agent models, 2024. URL https://jykoh.com/search-agents/paper.pdf

work page 2024

[31] [32]

Large language models are zero-shot reasoners, 2022

Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners, 2022

work page 2022

[32] [33]

Autowebglm: Bootstrap and reinforce a large language model-based web navigating agent, 2024 a

Hanyu Lai, Xiao Liu, Iat Long Iong, Shuntian Yao, Yuxuan Chen, Pengbo Shen, Hao Yu, Hanchen Zhang, Xiaohan Zhang, Yuxiao Dong, and Jie Tang. Autowebglm: Bootstrap and reinforce a large language model-based web navigating agent, 2024 a

work page 2024

[33] [34]

Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs

Xin Lai, Zhuotao Tian, Yukang Chen, Senqiao Yang, Xiangru Peng, and Jiaya Jia. Step-dpo: Step-wise preference optimization for long-chain reasoning of llms, 2024 b . URL https://arxiv.org/abs/2406.18629

work page internal anchor Pith review Pith/arXiv arXiv 2024

[34] [35]

Beyond a*: Better planning with transformers via search dynamics bootstrapping, 2024

Lucas Lehnert, Sainbayar Sukhbaatar, DiJia Su, Qinqing Zheng, Paul Mcvay, Michael Rabbat, and Yuandong Tian. Beyond a*: Better planning with transformers via search dynamics bootstrapping, 2024. URL https://arxiv.org/abs/2402.14083

work page arXiv 2024

[35] [36]

Reinforcement learning and control as probabilistic inference: Tutorial and review, 2018

Sergey Levine. Reinforcement learning and control as probabilistic inference: Tutorial and review, 2018

work page 2018

[36] [37]

Offline reinforcement learning: Tutorial, review, and perspectives on open problems, 2020

Sergey Levine, Aviral Kumar, George Tucker, and Justin Fu. Offline reinforcement learning: Tutorial, review, and perspectives on open problems, 2020

work page 2020

[37] [38]

Let's verify step by step, 2023

Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let's verify step by step, 2023

work page 2023

[38] [39]

Bolaa: Benchmarking and orchestrating llm-augmented autonomous agents, 2023

Zhiwei Liu, Weiran Yao, Jianguo Zhang, Le Xue, Shelby Heinecke, Rithesh Murthy, Yihao Feng, Zeyuan Chen, Juan Carlos Niebles, Devansh Arpit, Ran Xu, Phil Mui, Huan Wang, Caiming Xiong, and Silvio Savarese. Bolaa: Benchmarking and orchestrating llm-augmented autonomous agents, 2023

work page 2023

[39] [40]

Choubey, Tian Lan, Jason Wu, Huan Wang, Shelby Heinecke, Caiming Xiong, and Silvio Savarese

Zhiwei Liu, Weiran Yao, Jianguo Zhang, Liangwei Yang, Zuxin Liu, Juntao Tan, Prafulla K. Choubey, Tian Lan, Jason Wu, Huan Wang, Shelby Heinecke, Caiming Xiong, and Silvio Savarese. Agentlite: A lightweight library for building and advancing task-oriented llm agent system, 2024

work page 2024

[40] [41]

Step-controlled dpo: Leveraging stepwise error for enhanced mathematical reasoning, 2024

Zimu Lu, Aojun Zhou, Ke Wang, Houxing Ren, Weikang Shi, Junting Pan, Mingjie Zhan, and Hongsheng Li. Step-controlled dpo: Leveraging stepwise error for enhanced mathematical reasoning, 2024. URL https://arxiv.org/abs/2407.00782

work page arXiv 2024

[41] [42]

The landscape of emerging ai agent architectures for reasoning, planning, and tool calling: A survey, 2024

Tula Masterman, Sandi Besen, Mason Sawtell, and Alex Chao. The landscape of emerging ai agent architectures for reasoning, planning, and tool calling: A survey, 2024

work page 2024

[42] [44]

Webgpt: Browser-assisted question-answering with human feedback, 2022

Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christopher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, Xu Jiang, Karl Cobbe, Tyna Eloundou, Gretchen Krueger, Kevin Button, Matthew Knight, Benjamin Chess, and John Schulman. Webgpt: Browser-assisted question-answering with human feedback, 2022

work page 2022

[43] [45]

Training language models to follow instructions with human feedback

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedbac...

work page 2022

[44] [46]

Autonomous evaluation and refinement of digital agents, 2024

Jiayi Pan, Yichi Zhang, Nicholas Tomlin, Yifei Zhou, Sergey Levine, and Alane Suhr. Autonomous evaluation and refinement of digital agents, 2024

work page 2024

[45] [47]

Iterative reasoning preference optimization, 2024

Richard Yuanzhe Pang, Weizhe Yuan, Kyunghyun Cho, He He, Sainbayar Sukhbaatar, and Jason Weston. Iterative reasoning preference optimization, 2024

work page 2024

[46] [48]

Reasoning with language model prompting: A survey, 2023

Shuofei Qiao, Yixin Ou, Ningyu Zhang, Xiang Chen, Yunzhi Yao, Shumin Deng, Chuanqi Tan, Fei Huang, and Huajun Chen. Reasoning with language model prompting: A survey, 2023

work page 2023

[47] [49]

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://arxiv.org/abs/2305.18290

work page internal anchor Pith review Pith/arXiv arXiv 2023

[48] [50]

From r to q^* : Your language model is secretly a q-function, 2024

Rafael Rafailov, Joey Hejna, Ryan Park, and Chelsea Finn. From r to q^* : Your language model is secretly a q-function, 2024

work page 2024

[49] [51]

Proximal policy optimization algorithms, 2017

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms, 2017

work page 2017

[50] [52]

Rl on incorrect synthetic data scales the efficiency of llm math reasoning by eight-fold, 2024 a

Amrith Setlur, Saurabh Garg, Xinyang Geng, Naman Garg, Virginia Smith, and Aviral Kumar. Rl on incorrect synthetic data scales the efficiency of llm math reasoning by eight-fold, 2024 a

work page 2024

[51] [53]

Rl on incorrect synthetic data scales the efficiency of llm math reasoning by eight-fold, 2024 b

Amrith Setlur, Saurabh Garg, Xinyang Geng, Naman Garg, Virginia Smith, and Aviral Kumar. Rl on incorrect synthetic data scales the efficiency of llm math reasoning by eight-fold, 2024 b . URL https://arxiv.org/abs/2406.14532

work page arXiv 2024

[52] [54]

Mastering the game of go with deep neural networks and tree search

David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, Yutian Chen, Timothy Lillicrap, Fan Hui, Laurent Sifre, George van den Driessche, Thore Graepel, and Demis Hassabis. Mastering the game of go with deep neural networks and tree search. Nature, 2017 a

work page 2017

[53] [55]

Mastering the game of go without human knowledge

David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. Mastering the game of go without human knowledge. Nature, 550 0 (7676): 0 354--359, 2017 b

work page 2017

[54] [56]

Co-Reyes, Rishabh Agarwal, Ankesh Anand, Piyush Patil, Xavier Garcia, Peter J

Avi Singh, John D. Co-Reyes, Rishabh Agarwal, Ankesh Anand, Piyush Patil, Xavier Garcia, Peter J. Liu, James Harrison, Jaehoon Lee, Kelvin Xu, Aaron Parisi, Abhishek Kumar, Alex Alemi, Alex Rizkowsky, Azade Nova, Ben Adlam, Bernd Bohnet, Gamaleldin Elsayed, Hanie Sedghi, Igor Mordatch, Isabelle Simpson, Izzeddin Gur, Jasper Snoek, Jeffrey Pennington, Jiri...

work page 2024

[55] [57]

Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling llm test-time compute optimally can be more effective than scaling model parameters, 2024. URL https://arxiv.org/abs/2408.03314

work page internal anchor Pith review Pith/arXiv arXiv 2024

[56] [58]

Offline rl for natural language generation with implicit language q learning

Charlie Victor Snell, Ilya Kostrikov, Yi Su, Sherry Yang, and Sergey Levine. Offline rl for natural language generation with implicit language q learning. In The Eleventh International Conference on Learning Representations, 2022

work page 2022

[57] [59]

Trial and error: Exploration-based trajectory optimization for llm agents, 2024

Yifan Song, Da Yin, Xiang Yue, Jie Huang, Sujian Li, and Bill Yuchen Lin. Trial and error: Exploration-based trajectory optimization for llm agents, 2024

work page 2024

[58] [60]

Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, and Paul Christiano

Nisan Stiennon, Long Ouyang, Jeff Wu, Daniel M. Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, and Paul Christiano. Learning to summarize from human feedback, 2022

work page 2022

[59] [61]

Sumers, Shunyu Yao, Karthik Narasimhan, and Thomas L

Theodore R. Sumers, Shunyu Yao, Karthik Narasimhan, and Thomas L. Griffiths. Cognitive architectures for language agents, 2024

work page 2024

[60] [62]

Preference fine-tuning of llms should leverage suboptimal, on-policy data, 2024

Fahim Tajwar, Anikait Singh, Archit Sharma, Rafael Rafailov, Jeff Schneider, Tengyang Xie, Stefano Ermon, Chelsea Finn, and Aviral Kumar. Preference fine-tuning of llms should leverage suboptimal, on-policy data, 2024

work page 2024

[61] [63]

Toward self-improvement of llms via imagination, searching, and criticizing, 2024

Ye Tian, Baolin Peng, Linfeng Song, Lifeng Jin, Dian Yu, Haitao Mi, and Dong Yu. Toward self-improvement of llms via imagination, searching, and criticizing, 2024

work page 2024

[62] [64]

Llama 2: Open foundation and fine-tuned chat models, 2023

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Harts...

work page 2023

[63] [65]

Solving math word problems with process- and outcome-based feedback, 2022

Jonathan Uesato, Nate Kushman, Ramana Kumar, Francis Song, Noah Siegel, Lisa Wang, Antonia Creswell, Geoffrey Irving, and Irina Higgins. Solving math word problems with process- and outcome-based feedback, 2022

work page 2022

[64] [66]

Mobile-agent: Autonomous multi-modal mobile device agent with visual perception

Junyang Wang, Haiyang Xu, Jiabo Ye, Ming Yan, Weizhou Shen, Ji Zhang, Fei Huang, and Jitao Sang. Mobile-agent: Autonomous multi-modal mobile device agent with visual perception. arXiv, 2024 a

work page 2024

[65] [67]

Peiyi Wang, Lei Li, Zhihong Shao, R. X. Xu, Damai Dai, Yifei Li, Deli Chen, Y. Wu, and Zhifang Sui. Math-shepherd: Verify and reinforce llms step-by-step without human annotations, 2024 b

work page 2024

[66] [68]

Self-consistency improves chain of thought reasoning in language models, 2023

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models, 2023

work page 2023

[67] [69]

Chain-of-thought prompting elicits reasoning in large language models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. Neural Information Processing Systems, 2022

work page 2022

[68] [70]

Os-copilot: Towards generalist computer agents with self-improvement

Zhiyong Wu, Chengcheng Han, Zichen Ding, Zhenmin Weng, Zhoumianze Liu, Shunyu Yao, Tao Yu, and Lingpeng Kong. Os-copilot: Towards generalist computer agents with self-improvement. arXiv, 2024. URL https://api.semanticscholar.org/CorpusID:265149992

work page 2024

[69] [71]

Agentgym: Evolving large language model-based agents across diverse environments, 2024

Zhiheng Xi, Yiwen Ding, Wenxiang Chen, Boyang Hong, Honglin Guo, Junzhe Wang, Dingwen Yang, Chenyang Liao, Xin Guo, Wei He, Songyang Gao, Lu Chen, Rui Zheng, Yicheng Zou, Tao Gui, Qi Zhang, Xipeng Qiu, Xuanjing Huang, Zuxuan Wu, and Yu-Gang Jiang. Agentgym: Evolving large language model-based agents across diverse environments, 2024. URL https://arxiv.org...

work page arXiv 2024

[70] [72]

Lillicrap, Kenji Kawaguchi, and Michael Shieh

Yuxi Xie, Anirudh Goyal, Wenyue Zheng, Min-Yen Kan, Timothy P. Lillicrap, Kenji Kawaguchi, and Michael Shieh. Monte carlo tree search boosts reasoning via iterative preference learning, 2024

work page 2024

[71] [73]

Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press

John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. Swe-agent: Agent-computer interfaces enable automated software engineering, 2024

work page 2024

[72] [74]

Webshop: Towards scalable real-world web interaction with grounded language agents, 2022

Shunyu Yao, Howard Chen, John Yang, and Karthik Narasimhan. Webshop: Towards scalable real-world web interaction with grounded language agents, 2022

work page 2022

[73] [75]

Griffiths, Yuan Cao, and Karthik R

Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, and Karthik R. Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. In NeurIPS, 2023 a . URL https://openreview.net/forum?id=5Xc1ecxO1h

work page 2023

[74] [76]

React: Synergizing reasoning and acting in language models, 2023 b

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models, 2023 b

work page 2023

[75] [77]

Self-rewarding language models, 2024

Weizhe Yuan, Richard Yuanzhe Pang, Kyunghyun Cho, Xian Li, Sainbayar Sukhbaatar, Jing Xu, and Jason Weston. Self-rewarding language models, 2024

work page 2024

[76] [78]

Scaling relationship on learning mathematical reasoning with large language models, 2023

Zheng Yuan, Hongyi Yuan, Chengpeng Li, Guanting Dong, Keming Lu, Chuanqi Tan, Chang Zhou, and Jingren Zhou. Scaling relationship on learning mathematical reasoning with large language models, 2023

work page 2023

[77] [79]

Star: Bootstrapping reasoning with reasoning

Eric Zelikman, Yuhuai Wu, Jesse Mu, and Noah Goodman. Star: Bootstrapping reasoning with reasoning. Advances in Neural Information Processing Systems, 35: 0 15476--15488, 2022

work page 2022

[78] [80]

Fine-tuning large vision-language models as decision-making agents via reinforcement learning, 2024

Yuexiang Zhai, Hao Bai, Zipeng Lin, Jiayi Pan, Shengbang Tong, Yifei Zhou, Alane Suhr, Saining Xie, Yann LeCun, Yi Ma, and Sergey Levine. Fine-tuning large vision-language models as decision-making agents via reinforcement learning, 2024

work page 2024

[79] [81]

Ufo: A ui-focused agent for windows os interaction

Chaoyun Zhang, Liqun Li, Shilin He, Xu Zhang, Bo Qiao, Si Qin, Minghua Ma, Yu Kang, Qingwei Lin, and Saravan Rajmohan. Ufo: A ui-focused agent for windows os interaction. arXiv, 2024 a . URL https://api.semanticscholar.org/CorpusID:267211622

work page 2024

[80] [82]

Appagent: Multimodal agents as smartphone users, 2023

Chi Zhang, Zhao Yang, Jiaxuan Liu, Yucheng Han, Xin Chen, Zebiao Huang, Bin Fu, and Gang Yu. Appagent: Multimodal agents as smartphone users, 2023

work page 2023