arxiv: 2409.07429 · v1 · submitted 2024-09-11 · 💻 cs.CL

Recognition: 3 theorem links

Agent Workflow Memory

Zora Zhiruo Wang , Jiayuan Mao , Daniel Fried , Graham Neubig

Authors on Pith no claims yet

Pith reviewed 2026-05-15 00:47 UTC · model grok-4.3

classification 💻 cs.CL

keywords agent workflow memorylanguage model agentsweb navigationreusable workflowsmind2webwebarenaonline generalizationtask induction

0 comments

The pith

Agent Workflow Memory extracts reusable task routines from past examples to guide language model agents on new web navigation tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Agent Workflow Memory as a method that lets agents learn common routines from past experiences and selectively apply them to guide actions on complex, long-horizon tasks. This mirrors how people reuse workflows for repeated problems like booking travel or shopping online. Experiments on Mind2Web and WebArena show the approach raises success rates substantially over baselines while using fewer steps. It supports both offline induction from training data and online induction during testing, with gains that hold up as task distributions shift.

Core claim

Agent Workflow Memory (AWM) induces commonly reused routines, called workflows, from past examples and selectively provides them to the agent to guide subsequent generations. The method applies flexibly to offline settings where workflows are pre-induced from training examples and to online settings where they are induced on the fly from test queries. On Mind2Web and WebArena benchmarks covering over 1000 tasks across 200 domains, AWM raises baseline success rates by 24.6 percent and 51.1 percent relative while reducing steps on successful WebArena tasks. Online AWM further improves cross-task, cross-website, and cross-domain performance by 8.9 to 14.0 absolute points as train-test gaps grow

What carries the argument

Agent Workflow Memory, which extracts reusable workflows from examples and retrieves selected ones to condition the agent's next action generation.

If this is right

Agents solve more web navigation tasks successfully across travel, shopping, and social media domains.
Successful task completions require fewer actions on average.
Performance gains persist and widen when test tasks differ from training distributions.
The same workflow extraction works both before deployment and during live interaction.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same extraction process could be tested on non-web agent domains such as code editing or physical robot planning to check if reusable routines transfer.
Tighter criteria for deciding which workflows are safe to retrieve might reduce occasional harmful guidance.
Pairing workflow memory with other forms of agent memory could help on extremely long tasks that span multiple unrelated routines.

Load-bearing premise

Workflows induced from past examples can be identified as reusable and provided selectively without adding noise or incorrect guidance that harms performance on new queries.

What would settle it

A controlled test in which agents supplied with the induced workflows achieve lower success rates or require more steps than the plain baseline on a held-out set of tasks.

read the original abstract

Despite the potential of language model-based agents to solve real-world tasks such as web navigation, current methods still struggle with long-horizon tasks with complex action trajectories. In contrast, humans can flexibly solve complex tasks by learning reusable task workflows from past experiences and using them to guide future actions. To build agents that can similarly benefit from this process, we introduce Agent Workflow Memory (AWM), a method for inducing commonly reused routines, i.e., workflows, and selectively providing workflows to the agent to guide subsequent generations. AWM flexibly applies to both offline and online scenarios, where agents induce workflows from training examples beforehand or from test queries on the fly. We experiment on two major web navigation benchmarks -- Mind2Web and WebArena -- that collectively cover 1000+ tasks from 200+ domains across travel, shopping, and social media, among others. AWM substantially improves the baseline results by 24.6% and 51.1% relative success rate on Mind2Web and WebArena while reducing the number of steps taken to solve WebArena tasks successfully. Furthermore, online AWM robustly generalizes in cross-task, website, and domain evaluations, surpassing baselines from 8.9 to 14.0 absolute points as train-test task distribution gaps widen.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AWM shows how to extract reusable workflows from agent runs and feed them back selectively, delivering clear gains on web benchmarks but with thin checks on whether the induction step actually avoids noise.

read the letter

The main point is that this paper gives language model agents a way to pull common routines out of past trajectories and hand the useful ones to the model for new tasks, both ahead of time and during testing. It reports 24.6 percent relative success lift on Mind2Web and 51.1 percent on WebArena, plus fewer steps on successful WebArena runs and some cross-task and cross-domain stability in the online setting. That is the practical advance worth noting first. The method is straightforward: induce workflows from examples, then retrieve relevant ones to condition generation. It applies to both offline pre-computation and online on-the-fly induction, which is a small but useful distinction from standard retrieval baselines. The benchmarks are decent in scale, covering over a thousand tasks across travel, shopping, and social media domains. The generalization tests when train-test gaps widen are a reasonable check that the gains are not just memorization. Those pieces are done cleanly enough to credit. The soft spot is the induction and selection step itself. The abstract and visible results do not include ablations that isolate workflow quality, precision of the reusable-routine filter, or what happens when a non-transferable routine slips through. If selection is imperfect, the added context could add noise rather than guidance, and the headline numbers would overstate the mechanism's reliability. No statistical tests or variance numbers appear in the summary either. The full paper may fill this in, but the current evidence leaves that part light. This is for people working on agent reliability for long-horizon web tasks who want a simple, non-retraining trick that moves the needle on public benchmarks. Readers focused on practical automation tools will find the numbers and the offline-online split useful to try. It deserves a serious referee because the end-to-end improvements are large enough to check in detail, even if the workflow filtering needs tighter validation. I would send it to review.

Referee Report

3 major / 2 minor

Summary. The paper introduces Agent Workflow Memory (AWM), a method that induces reusable task workflows from past agent experiences (either offline from training data or online from test queries) and selectively retrieves them to guide language-model agent generations on long-horizon web navigation. Experiments on Mind2Web and WebArena (covering 1000+ tasks across 200+ domains) report relative success-rate gains of 24.6% and 51.1%, reduced successful steps on WebArena, and robust online generalization under cross-task, cross-website, and cross-domain shifts.

Significance. If the performance and generalization claims are substantiated by complete experimental protocols, AWM would constitute a practical advance in procedural memory for LLM agents, offering a lightweight way to reuse routines without full trajectory replay. The dual offline/online formulation and emphasis on selective retrieval distinguish it from generic retrieval-augmented generation and could influence agent design on interactive benchmarks.

major comments (3)

[Abstract and §4] Abstract and §4 (Experiments): The headline relative gains (24.6% on Mind2Web, 51.1% on WebArena) are reported without absolute success rates for the baseline or AWM, without standard deviations or statistical significance tests, and without explicit baseline definitions (e.g., whether the baseline includes any memory or retrieval). This prevents assessment of whether the improvements are practically meaningful or sensitive to implementation details.
[§3] §3 (Method): The workflow induction and selective retrieval procedure is described only procedurally; no formal definition, similarity metric, or precision/recall evaluation of induced workflows is provided. Consequently, the central assumption that induced workflows are reliably reusable and non-noisy cannot be verified, leaving open the possibility that retrieval adds harmful context rather than helpful guidance.
[§4.2–4.3] §4.2–4.3 (Ablations and Generalization): No ablation isolates the contribution of workflow quality versus mere increase in context length, nor reports induction false-positive rates or failure cases where incorrect workflows degrade performance. Without these controls, the cross-task/website/domain generalization results cannot be attributed specifically to AWM rather than to additional prompting.

minor comments (2)

[§4] Tables in §4 should include absolute success rates alongside relative improvements and report the number of runs or seeds used for each result.
[§3] The manuscript would benefit from a clear pseudocode listing of the online AWM induction/retrieval loop to make the on-the-fly procedure reproducible.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our work. We have revised the manuscript to address the concerns about experimental reporting, methodological formalization, and controls for ablations. Point-by-point responses follow.

read point-by-point responses

Referee: [Abstract and §4] Abstract and §4 (Experiments): The headline relative gains (24.6% on Mind2Web, 51.1% on WebArena) are reported without absolute success rates for the baseline or AWM, without standard deviations or statistical significance tests, and without explicit baseline definitions (e.g., whether the baseline includes any memory or retrieval). This prevents assessment of whether the improvements are practically meaningful or sensitive to implementation details.

Authors: We agree that absolute rates, variability measures, and baseline clarifications are essential. The revised abstract now reports absolute success rates alongside the relative gains (e.g., baseline 36.2% to AWM 45.1% on Mind2Web). Tables in §4 have been updated with standard deviations across runs and paired t-test p-values. The baseline is explicitly the standard ReAct agent without memory or retrieval, as defined in §4.1 and consistent with prior benchmark papers. revision: yes
Referee: [§3] §3 (Method): The workflow induction and selective retrieval procedure is described only procedurally; no formal definition, similarity metric, or precision/recall evaluation of induced workflows is provided. Consequently, the central assumption that induced workflows are reliably reusable and non-noisy cannot be verified, leaving open the possibility that retrieval adds harmful context rather than helpful guidance.

Authors: We have added a formal definition in §3: workflows are induced as sequences of actions with preconditions, using embedding-based cosine similarity for retrieval (threshold 0.75). An appendix section now evaluates induced workflow precision/recall against manually annotated reusable routines (average precision 0.82), confirming low noise. We also analyze cases of potentially harmful retrieval and show the selective mechanism filters most of them. revision: yes
Referee: [§4.2–4.3] §4.2–4.3 (Ablations and Generalization): No ablation isolates the contribution of workflow quality versus mere increase in context length, nor reports induction false-positive rates or failure cases where incorrect workflows degrade performance. Without these controls, the cross-task/website/domain generalization results cannot be attributed specifically to AWM rather than to additional prompting.

Authors: We have added an ablation in §4.2 comparing AWM to a length-matched control that retrieves random or irrelevant workflows. The control yields only marginal gains (+2.1 points), while AWM yields the full reported improvement, isolating the benefit to workflow quality. Induction false-positive rates (12%) and failure-case analysis are now reported in the appendix, showing that selective retrieval limits degradation from incorrect workflows and supports attribution of generalization gains to AWM. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper introduces AWM as a procedural method for workflow induction and selective retrieval without any equations, fitted parameters, or self-referential definitions. Claims of performance gains rest on empirical results from Mind2Web and WebArena rather than reductions to inputs by construction. No self-citation load-bearing steps, uniqueness theorems, or ansatz smuggling appear in the provided text. The derivation chain is self-contained as a direct algorithmic addition.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that language models can interpret and benefit from workflow descriptions inserted into prompts, plus the implicit premise that common routines exist and can be extracted reliably from task trajectories.

axioms (1)

domain assumption Language models can effectively follow and benefit from provided workflow descriptions in their prompts without confusion or performance degradation.
This underpins the selective provision step that is claimed to drive the reported gains.

invented entities (1)

Agent Workflow Memory (AWM) no independent evidence
purpose: A memory structure that stores and retrieves induced reusable task workflows for guiding future agent actions.
New construct introduced by the paper; no independent evidence outside the reported experiments is provided in the abstract.

pith-pipeline@v0.9.0 · 5524 in / 1268 out tokens · 36190 ms · 2026-05-15T00:47:34.019674+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

HierarchyEmergence hierarchy_emergence_forces_phi echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

we introduce Agent Workflow Memory (AWM), a method for inducing commonly reused routines, i.e., workflows, and selectively providing workflows to the agent to guide subsequent generations. AWM flexibly applies to both offline and online scenarios, where agents induce workflows from training examples beforehand or from test queries on the fly.
HierarchyRealization realized_hierarchy_forces_phi echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

AWM starts with a basic set of built-in actions and solves new tasks in a streaming manner, continuously inducing workflows from the task at hand... Such continual learning mechanisms create a snowball effect to induce and apply increasingly complex workflows while expanding the agent memory
LedgerForcing conservation_from_balance echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

AWM substantially improves the baseline results by 24.6% and 51.1% relative success rate on Mind2Web and WebArena while reducing the number of steps taken to solve WebArena tasks successfully. Furthermore, online AWM robustly generalizes in cross-task, website, and domain evaluations

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 23 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

FlowCompile: An Optimizing Compiler for Structured LLM Workflows
cs.CL 2026-05 unverdicted novelty 8.0

FlowCompile performs compile-time design space exploration on structured LLM workflows to produce reusable high-quality configuration sets that outperform routing baselines with up to 6.4x speedup.
Why Do Multi-Agent LLM Systems Fail?
cs.AI 2025-03 unverdicted novelty 8.0

The authors create the first large-scale dataset and taxonomy of failure modes in multi-agent LLM systems to explain their limited performance gains.
ClawForge: Generating Executable Interactive Benchmarks for Command-Line Agents
cs.AI 2026-05 conditional novelty 7.0

ClawForge supplies a generator that turns scenario templates into reproducible command-line tasks testing state conflict handling, where the strongest frontier model scores only 45.3 percent strict accuracy.
Learning, Fast and Slow: Towards LLMs That Adapt Continually
cs.LG 2026-05 unverdicted novelty 7.0

Fast-Slow Training uses context optimization as fast weights alongside parameter updates as slow weights to achieve up to 3x better sample efficiency, higher performance, and less catastrophic forgetting than standard...
Evolving-RL: End-to-End Optimization of Experience-Driven Self-Evolving Capability within Agents
cs.AI 2026-05 unverdicted novelty 7.0

Evolving-RL jointly optimizes experience extraction and utilization in LLM agents via RL with separate evaluation signals, delivering up to 98.7% relative gains on out-of-distribution tasks in ALFWorld and Mind2Web.
OCR-Memory: Optical Context Retrieval for Long-Horizon Agent Memory
cs.CL 2026-04 unverdicted novelty 7.0

OCR-Memory encodes agent trajectories as images with visual anchors and retrieves verbatim text via locate-and-transcribe, yielding gains on long-horizon benchmarks under strict context limits.
EE-MCP: Self-Evolving MCP-GUI Agents via Automated Environment Generation and Experience Learning
cs.AI 2026-04 unverdicted novelty 7.0

A self-evolving MCP-GUI agent system with automated environment generation and an experience bank achieves up to 77.8% pass rates by matching distillation or experience augmentation to task type across three desktop a...
Generate, Filter, Control, Replay: A Comprehensive Survey of Rollout Strategies for LLM Reinforcement Learning
cs.LG 2026-04 unverdicted novelty 7.0

This survey introduces the Generate-Filter-Control-Replay (GFCR) taxonomy to structure rollout pipelines for RL-based post-training of reasoning LLMs.
PowerDAG: Reliable Agentic AI System for Automating Distribution Grid Analysis
eess.SY 2026-03 unverdicted novelty 7.0

PowerDAG achieves 94-100% success on unseen distribution grid analysis queries by combining adaptive retrieval with similarity-decay cutoff and just-in-time supervision, outperforming ReAct, LangChain, and CrewAI baselines.
Agent Security Bench (ASB): Formalizing and Benchmarking Attacks and Defenses in LLM-based Agents
cs.CR 2024-10 unverdicted novelty 7.0

ASB is a new benchmark that tests 10 prompt injection attacks, memory poisoning, a novel Plan-of-Thought backdoor attack, and 11 defenses on LLM agents across 13 models, finding attack success rates up to 84.3% and li...
Learning, Fast and Slow: Towards LLMs That Adapt Continually
cs.LG 2026-05 unverdicted novelty 6.0

Fast-Slow Training combines slow parameter updates with fast context optimization to achieve up to 3x better sample efficiency, higher performance, less forgetting, and preserved plasticity in continual LLM learning.
Dynamic Skill Lifecycle Management for Agentic Reinforcement Learning
cs.LG 2026-05 unverdicted novelty 6.0

SLIM dynamically optimizes active external skills in agentic RL via leave-one-skill-out marginal contribution estimates and three lifecycle operations, outperforming baselines by 7.1% on ALFWorld and SearchQA while sh...
Kintsugi: Learning Policies by Repairing Executable Knowledge Bases
cs.LG 2026-05 unverdicted novelty 6.0

Kintsugi learns policies by repairing composable executable knowledge bases through agentic diagnosis, localized typed edits, and deterministic verification gates that admit only improvements.
SkillGraph: Self-Evolving Multi-Agent Collaboration with Multimodal Graph Topology
cs.AI 2026-04 unverdicted novelty 6.0

SkillGraph jointly evolves agent skills and collaboration topologies in multi-agent vision-language systems using a multimodal graph transformer and a skill designer, yielding consistent performance gains on benchmarks.
SkillDroid: Compile Once, Reuse Forever
cs.HC 2026-04 conditional novelty 6.0

SkillDroid compiles LLM-guided GUI trajectories into parameterized skill templates and replays them via a matching cascade, reaching 85.3% success rate with 49% fewer LLM calls and improving from 87% to 91% over 150 r...
Agentic Compilation: Mitigating the LLM Rerun Crisis for Minimized-Inference-Cost Web Automation
cs.DC 2026-04 unverdicted novelty 6.0

A Compile-and-Execute system decouples LLM reasoning from browser execution via a one-shot JSON blueprint, reducing inference from O(M x N) to amortized O(1) for repetitive web workflows.
Procedural Knowledge at Scale Improves Reasoning
cs.CL 2026-04 unverdicted novelty 6.0

Reasoning Memory decomposes reasoning trajectories into 32 million subquestion-subroutine pairs and retrieves them via in-thought prompts to improve language model performance on math, science, and coding benchmarks b...
Intermediate Artifacts as First-Class Citizens: A Data Model for Durable Intermediate Artifacts in Agentic Systems
cs.AI 2026-05 unverdicted novelty 5.0

A systems-level data model for preserving typed, addressable, versioned, and dependency-aware intermediate artifacts in agentic AI systems to improve long-term inspectability and maintainability.
From Agent Loops to Deterministic Graphs: Execution Lineage for Reproducible AI-Native Work
cs.AI 2026-05 conditional novelty 5.0

Execution lineage models AI-native work as a DAG of computations with explicit dependencies, achieving perfect state preservation in controlled update tasks where loop-based agents introduce churn and contamination.
Scaling Teams or Scaling Time? Memory Enabled Lifelong Learning in LLM Multi-Agent Systems
cs.MA 2026-03 unverdicted novelty 5.0

LLMA-Mem improves long-horizon performance in LLM multi-agent systems over baselines while reducing cost and shows non-monotonic scaling where memory-enabled smaller teams can beat larger ones.
A Comprehensive Survey on Agent Skills: Taxonomy, Techniques, and Applications
cs.IR 2026-05 unverdicted novelty 4.0

The paper surveys agent skills for LLM agents, organizing the literature into a four-stage lifecycle of representation, acquisition, retrieval, and evolution while highlighting their role in system scalability.
Reliable AI Needs to Externalize Implicit Knowledge: A Human-AI Collaboration Perspective
cs.AI 2026-05 unverdicted novelty 4.0

Reliable AI needs structured Knowledge Objects to externalize and enable human validation of implicit knowledge that current methods cannot verify.
A Survey of Self-Evolving Agents: What, When, How, and Where to Evolve on the Path to Artificial Super Intelligence
cs.AI 2025-07 accept novelty 4.0

The paper delivers the first systematic review of self-evolving agents, structured around what components evolve, when adaptation occurs, and how it is implemented.

Reference graph

Works this paper leans on

53 extracted references · 53 canonical work pages · cited by 22 Pith papers · 2 internal anchors

[1]

Proceedings of the 34th International Conference on Machine Learning , pages=

World of Bits: An Open-Domain Platform for Web-Based Agents , author=. Proceedings of the 34th International Conference on Machine Learning , pages=. 2017 , editor=

work page 2017
[2]

International Conference on Learning Representations , year=

Reinforcement Learning on Web Interfaces using Workflow-Guided Exploration , author=. International Conference on Learning Representations , year=

work page
[3]

WebShop: Towards Scalable Real-World Web Interaction with Grounded Language Agents , url=

Yao, Shunyu and Chen, Howard and Yang, John and Narasimhan, Karthik , booktitle=. WebShop: Towards Scalable Real-World Web Interaction with Grounded Language Agents , url=

work page
[4]

The Twelfth International Conference on Learning Representations , year=

WebArena: A Realistic Web Environment for Building Autonomous Agents , author=. The Twelfth International Conference on Learning Representations , year=

work page
[5]

Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track , year=

Mind2Web: Towards a Generalist Agent for the Web , author=. Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track , year=

work page
[6]

ICLR 2024 Workshop on Large Language Model (LLM) Agents , year=

VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks , author=. ICLR 2024 Workshop on Large Language Model (LLM) Agents , year=

work page 2024
[7]

Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track , year=

AndroidInTheWild: A Large-Scale Dataset For Android Device Control , author=. Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track , year=

work page
[9]

On the effects of data scale on computer control agents.arXiv preprint arXiv:2406.03679, 2024

On the Effects of Data Scale on Computer Control Agents , author=. arXiv preprint arXiv:2406.03679 , year=

work page arXiv
[10]

The Eleventh International Conference on Learning Representations , year=

Language Models Can Teach Themselves to Program Better , author=. The Eleventh International Conference on Learning Representations , year=

work page
[11]

Thirty-seventh Conference on Neural Information Processing Systems , year=

AdaPlanner: Adaptive Planning from Feedback with Language Models , author=. Thirty-seventh Conference on Neural Information Processing Systems , year=

work page
[13]

The Twelfth International Conference on Learning Representations , year=

Synapse: Trajectory-as-Exemplar Prompting with Memory for Computer Control , author=. The Twelfth International Conference on Learning Representations , year=

work page
[20]

First Conference on Language Modeling , year=

What Are Tools Anyway? A Survey from the Language Model Perspective , author=. First Conference on Language Modeling , year=

work page
[21]

2014 , publisher=

The nature of expertise , author=. 2014 , publisher=

work page 2014
[22]

Cognitive science , volume=

Categorization and representation of physics problems by experts and novices , author=. Cognitive science , volume=. 1981 , publisher=

work page 1981
[23]

Transactions on Machine Learning Research , issn=

Voyager: An Open-Ended Embodied Agent with Large Language Models , author=. Transactions on Machine Learning Research , issn=. 2024 , url=

work page 2024
[25]

Philosophical Transactions of the Royal Society A , volume=

DreamCoder: growing generalizable, interpretable knowledge with wake--sleep Bayesian program learning , author=. Philosophical Transactions of the Royal Society A , volume=. 2023 , publisher=

work page 2023
[26]

Proceedings of the 38th International Conference on Machine Learning , pages =

Leveraging Language to Learn Program Abstractions and Search Heuristics , author =. Proceedings of the 38th International Conference on Machine Learning , pages =. 2021 , editor =

work page 2021
[29]

2023 , journal=

Large Language Models as Tool Makers , author=. 2023 , journal=

work page 2023
[30]

Zhiruo Wang and Graham Neubig and Daniel Fried , booktitle=. Tro. 2024 , url=

work page 2024
[31]

2023 IEEE International Conference on Robotics and Automation (ICRA) , pages=

Code as policies: Language model programs for embodied control , author=. 2023 IEEE International Conference on Robotics and Automation (ICRA) , pages=. 2023 , organization=

work page 2023
[32]

Conference on Robot Learning , pages=

Learning reusable manipulation strategies , author=. Conference on Robot Learning , pages=. 2023 , organization=

work page 2023
[33]

International Conference on Machine Learning , pages=

Zero-shot task generalization with multi-task deep reinforcement learning , author=. International Conference on Machine Learning , pages=. 2017 , organization=

work page 2017
[34]

Olausson, Lionel Wong, Gabriel Grand, Joshua B

Matthew Bowers, Theo X. Olausson, Lionel Wong, Gabriel Grand, Joshua B. Tenenbaum, Kevin Ellis, and Armando Solar-Lezama. Top-down synthesis for library learning. Proc. ACM Program. Lang., 7 0 (POPL), jan 2023. doi:10.1145/3571234. URL https://doi.org/10.1145/3571234

work page doi:10.1145/3571234 2023
[35]

Large language models as tool makers

Tianle Cai, Xuezhi Wang, Tengyu Ma, Xinyun Chen, and Denny Zhou. Large language models as tool makers. arXiv preprint arXiv:2305.17126, 2023. URL https://arxiv.org/pdf/2305.17126

work page arXiv 2023
[36]

Categorization and representation of physics problems by experts and novices

Michelene TH Chi, Paul J Feltovich, and Robert Glaser. Categorization and representation of physics problems by experts and novices. Cognitive science, 5 0 (2): 0 121--152, 1981

work page 1981
[37]

The nature of expertise

Michelene TH Chi, Robert Glaser, and Marshall J Farr. The nature of expertise. Psychology Press, 2014

work page 2014
[38]

Mind2web: Towards a generalist agent for the web

Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Samuel Stevens, Boshi Wang, Huan Sun, and Yu Su. Mind2web: Towards a generalist agent for the web. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023. URL https://openreview.net/forum?id=kiYqbO3wqw

work page 2023
[39]

WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks?

Alexandre Drouin, Maxime Gasse, Massimo Caccia, Issam H Laradji, Manuel Del Verme, Tom Marty, L \'e o Boisvert, Megh Thakkar, Quentin Cappart, David Vazquez, et al. Workarena: How capable are web agents at solving common knowledge work tasks? arXiv preprint arXiv:2403.07718, 2024

work page internal anchor Pith review arXiv 2024
[40]

Dreamcoder: growing generalizable, interpretable knowledge with wake--sleep bayesian program learning

Kevin Ellis, Lionel Wong, Maxwell Nye, Mathias Sable-Meyer, Luc Cary, Lore Anaya Pozo, Luke Hewitt, Armando Solar-Lezama, and Joshua B Tenenbaum. Dreamcoder: growing generalizable, interpretable knowledge with wake--sleep bayesian program learning. Philosophical Transactions of the Royal Society A, 381 0 (2251): 0 20220050, 2023

work page 2023
[41]

Autoguide: Automated generation and selection of state-aware guidelines for large language model agents

Yao Fu, Dong-Ki Kim, Jaekyeom Kim, Sungryull Sohn, Lajanugen Logeswaran, Kyunghoon Bae, and Honglak Lee. Autoguide: Automated generation and selection of state-aware guidelines for large language model agents. arXiv preprint arXiv:2403.08978, 2024

work page arXiv 2024
[42]

Lilo: Learning interpretable libraries by compressing and documenting code

Gabriel Grand, Lionel Wong, Matthew Bowers, Theo X Olausson, Muxin Liu, Joshua B Tenenbaum, and Jacob Andreas. Lilo: Learning interpretable libraries by compressing and documenting code. arXiv preprint arXiv:2310.19791, 2023

work page arXiv 2023
[43]

Language models can teach themselves to program better

Patrick Haluptzok, Matthew Bowers, and Adam Tauman Kalai. Language models can teach themselves to program better. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=SaRj2ka1XZ3

work page 2023
[44]

Visualwebarena: Evaluating multimodal agents on realistic visual web tasks

Jing Yu Koh, Robert Lo, Lawrence Jang, Vikram Duvvur, Ming Chong Lim, Po-Yu Huang, Graham Neubig, Shuyan Zhou, Ruslan Salakhutdinov, and Daniel Fried. Visualwebarena: Evaluating multimodal agents on realistic visual web tasks. In ICLR 2024 Workshop on Large Language Model (LLM) Agents, 2024. URL https://openreview.net/forum?id=RPKxrKTJbj

work page 2024
[45]

Code as policies: Language model programs for embodied control

Jacky Liang, Wenlong Huang, Fei Xia, Peng Xu, Karol Hausman, Brian Ichter, Pete Florence, and Andy Zeng. Code as policies: Language model programs for embodied control. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pp.\ 9493--9500. IEEE, 2023

work page 2023
[46]

Reinforcement learning on web interfaces using workflow-guided exploration

Evan Zheran Liu, Kelvin Guu, Panupong Pasupat, and Percy Liang. Reinforcement learning on web interfaces using workflow-guided exploration. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=ryTp3f-0-

work page 2018
[47]

Clin: A continually learning language agent for rapid task adaptation and generalization

Bodhisattwa Prasad Majumder, Bhavana Dalvi Mishra, Peter Jansen, Oyvind Tafjord, Niket Tandon, Li Zhang, Chris Callison-Burch, and Peter Clark. Clin: A continually learning language agent for rapid task adaptation and generalization. arXiv preprint arXiv:2310.10134, 2023

work page arXiv 2023
[48]

Learning reusable manipulation strategies

Jiayuan Mao, Tom \'a s Lozano-P \'e rez, Joshua B Tenenbaum, and Leslie Pack Kaelbling. Learning reusable manipulation strategies. In Conference on Robot Learning, pp.\ 1467--1483. PMLR, 2023

work page 2023
[49]

Bagel: Bootstrapping agents by guiding exploration with language

Shikhar Murty, Christopher Manning, Peter Shaw, Mandar Joshi, and Kenton Lee. Bagel: Bootstrapping agents by guiding exploration with language. arXiv preprint arXiv:2403.08140, 2024

work page arXiv 2024
[50]

Zero-shot task generalization with multi-task deep reinforcement learning

Junhyuk Oh, Satinder Singh, Honglak Lee, and Pushmeet Kohli. Zero-shot task generalization with multi-task deep reinforcement learning. In International Conference on Machine Learning, pp.\ 2661--2670. PMLR, 2017

work page 2017
[51]

Autonomous evaluation and refinement of digital agents

Jiayi Pan, Yichi Zhang, Nicholas Tomlin, Yifei Zhou, Sergey Levine, and Alane Suhr. Autonomous evaluation and refinement of digital agents. arXiv preprint arXiv:2404.06474, 2024

work page arXiv 2024
[52]

Androidinthewild: A large-scale dataset for android device control

Christopher Rawles, Alice Li, Daniel Rodriguez, Oriana Riva, and Timothy P Lillicrap. Androidinthewild: A large-scale dataset for android device control. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023. URL https://openreview.net/forum?id=j4b3l5kOil

work page 2023
[53]

AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents

Christopher Rawles, Sarah Clinckemaillie, Yifan Chang, Jonathan Waltz, Gabrielle Lau, Marybeth Fair, Alice Li, William Bishop, Wei Li, Folawiyo Campbell-Ajala, et al. Androidworld: A dynamic benchmarking environment for autonomous agents. arXiv preprint arXiv:2405.14573, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[54]

World of bits: An open-domain platform for web-based agents

Tianlin Shi, Andrej Karpathy, Linxi Fan, Jonathan Hernandez, and Percy Liang. World of bits: An open-domain platform for web-based agents. In Doina Precup and Yee Whye Teh (eds.), Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pp.\ 3135--3144. PMLR, 06--11 Aug 2017. URL https://...

work page 2017
[55]

Heap: Hierarchical policies for web actions using llms

Paloma Sodhi, SRK Branavan, and Ryan McDonald. Heap: Hierarchical policies for web actions using llms. arXiv preprint arXiv:2310.03720, 2023

work page arXiv 2023
[56]

Adaplanner: Adaptive planning from feedback with language models

Haotian Sun, Yuchen Zhuang, Lingkai Kong, Bo Dai, and Chao Zhang. Adaplanner: Adaptive planning from feedback with language models. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=rnKgbKmelt

work page 2023
[57]

Voyager: An open-ended embodied agent with large language models

Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. Voyager: An open-ended embodied agent with large language models. Transactions on Machine Learning Research, 2024 a . ISSN 2835-8856. URL https://openreview.net/forum?id=ehfRiF0R3a

work page 2024
[58]

What are tools anyway? a survey from the language model perspective

Zhiruo Wang, Zhoujun Cheng, Hao Zhu, Daniel Fried, and Graham Neubig. What are tools anyway? a survey from the language model perspective. In First Conference on Language Modeling, 2024 b . URL https://openreview.net/forum?id=Xh1B90iBSR

work page 2024
[59]

Tro VE : Inducing verifiable and efficient toolboxes for solving programmatic tasks

Zhiruo Wang, Graham Neubig, and Daniel Fried. Tro VE : Inducing verifiable and efficient toolboxes for solving programmatic tasks. In Forty-first International Conference on Machine Learning, 2024 c . URL https://openreview.net/forum?id=DCNCwaMJjI

work page 2024
[60]

Webshop: Towards scalable real-world web interaction with grounded language agents

Shunyu Yao, Howard Chen, John Yang, and Karthik Narasimhan. Webshop: Towards scalable real-world web interaction with grounded language agents. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (eds.), Advances in Neural Information Processing Systems, volume 35, pp.\ 20744--20757. Curran Associates, Inc., 2022. URL https://proceedings....

work page 2022
[61]

Assistantbench: Can web agents solve realistic and time-consuming tasks? arXiv preprint arXiv:2407.15711, 2024

Ori Yoran, Samuel Joseph Amouyal, Chaitanya Malaviya, Ben Bogin, Ofir Press, and Jonathan Berant. Assistantbench: Can web agents solve realistic and time-consuming tasks? arXiv preprint arXiv:2407.15711, 2024

work page arXiv 2024
[62]

Language to rewards for robotic skill synthesis

Wenhao Yu, Nimrod Gileadi, Chuyuan Fu, Sean Kirmani, Kuang-Huei Lee, Montse Gonzalez Arenas, Hao-Tien Lewis Chiang, Tom Erez, Leonard Hasenclever, Jan Humplik, et al. Language to rewards for robotic skill synthesis. arXiv preprint arXiv:2306.08647, 2023

work page arXiv 2023
[63]

Synapse: Trajectory-as-exemplar prompting with memory for computer control

Longtao Zheng, Rundong Wang, Xinrun Wang, and Bo An. Synapse: Trajectory-as-exemplar prompting with memory for computer control. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=Pc8AU1aF5e

work page 2024
[64]

Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig

Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig. Webarena: A realistic web environment for building autonomous agents. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=oKn9c6ytLx

work page 2024