pith. machine review for the scientific record. sign in

arxiv: 2409.07429 · v1 · submitted 2024-09-11 · 💻 cs.CL

Recognition: 3 theorem links

Agent Workflow Memory

Authors on Pith no claims yet

Pith reviewed 2026-05-15 00:47 UTC · model grok-4.3

classification 💻 cs.CL
keywords agent workflow memorylanguage model agentsweb navigationreusable workflowsmind2webwebarenaonline generalizationtask induction
0
0 comments X

The pith

Agent Workflow Memory extracts reusable task routines from past examples to guide language model agents on new web navigation tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Agent Workflow Memory as a method that lets agents learn common routines from past experiences and selectively apply them to guide actions on complex, long-horizon tasks. This mirrors how people reuse workflows for repeated problems like booking travel or shopping online. Experiments on Mind2Web and WebArena show the approach raises success rates substantially over baselines while using fewer steps. It supports both offline induction from training data and online induction during testing, with gains that hold up as task distributions shift.

Core claim

Agent Workflow Memory (AWM) induces commonly reused routines, called workflows, from past examples and selectively provides them to the agent to guide subsequent generations. The method applies flexibly to offline settings where workflows are pre-induced from training examples and to online settings where they are induced on the fly from test queries. On Mind2Web and WebArena benchmarks covering over 1000 tasks across 200 domains, AWM raises baseline success rates by 24.6 percent and 51.1 percent relative while reducing steps on successful WebArena tasks. Online AWM further improves cross-task, cross-website, and cross-domain performance by 8.9 to 14.0 absolute points as train-test gaps grow

What carries the argument

Agent Workflow Memory, which extracts reusable workflows from examples and retrieves selected ones to condition the agent's next action generation.

If this is right

  • Agents solve more web navigation tasks successfully across travel, shopping, and social media domains.
  • Successful task completions require fewer actions on average.
  • Performance gains persist and widen when test tasks differ from training distributions.
  • The same workflow extraction works both before deployment and during live interaction.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same extraction process could be tested on non-web agent domains such as code editing or physical robot planning to check if reusable routines transfer.
  • Tighter criteria for deciding which workflows are safe to retrieve might reduce occasional harmful guidance.
  • Pairing workflow memory with other forms of agent memory could help on extremely long tasks that span multiple unrelated routines.

Load-bearing premise

Workflows induced from past examples can be identified as reusable and provided selectively without adding noise or incorrect guidance that harms performance on new queries.

What would settle it

A controlled test in which agents supplied with the induced workflows achieve lower success rates or require more steps than the plain baseline on a held-out set of tasks.

read the original abstract

Despite the potential of language model-based agents to solve real-world tasks such as web navigation, current methods still struggle with long-horizon tasks with complex action trajectories. In contrast, humans can flexibly solve complex tasks by learning reusable task workflows from past experiences and using them to guide future actions. To build agents that can similarly benefit from this process, we introduce Agent Workflow Memory (AWM), a method for inducing commonly reused routines, i.e., workflows, and selectively providing workflows to the agent to guide subsequent generations. AWM flexibly applies to both offline and online scenarios, where agents induce workflows from training examples beforehand or from test queries on the fly. We experiment on two major web navigation benchmarks -- Mind2Web and WebArena -- that collectively cover 1000+ tasks from 200+ domains across travel, shopping, and social media, among others. AWM substantially improves the baseline results by 24.6% and 51.1% relative success rate on Mind2Web and WebArena while reducing the number of steps taken to solve WebArena tasks successfully. Furthermore, online AWM robustly generalizes in cross-task, website, and domain evaluations, surpassing baselines from 8.9 to 14.0 absolute points as train-test task distribution gaps widen.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces Agent Workflow Memory (AWM), a method that induces reusable task workflows from past agent experiences (either offline from training data or online from test queries) and selectively retrieves them to guide language-model agent generations on long-horizon web navigation. Experiments on Mind2Web and WebArena (covering 1000+ tasks across 200+ domains) report relative success-rate gains of 24.6% and 51.1%, reduced successful steps on WebArena, and robust online generalization under cross-task, cross-website, and cross-domain shifts.

Significance. If the performance and generalization claims are substantiated by complete experimental protocols, AWM would constitute a practical advance in procedural memory for LLM agents, offering a lightweight way to reuse routines without full trajectory replay. The dual offline/online formulation and emphasis on selective retrieval distinguish it from generic retrieval-augmented generation and could influence agent design on interactive benchmarks.

major comments (3)
  1. [Abstract and §4] Abstract and §4 (Experiments): The headline relative gains (24.6% on Mind2Web, 51.1% on WebArena) are reported without absolute success rates for the baseline or AWM, without standard deviations or statistical significance tests, and without explicit baseline definitions (e.g., whether the baseline includes any memory or retrieval). This prevents assessment of whether the improvements are practically meaningful or sensitive to implementation details.
  2. [§3] §3 (Method): The workflow induction and selective retrieval procedure is described only procedurally; no formal definition, similarity metric, or precision/recall evaluation of induced workflows is provided. Consequently, the central assumption that induced workflows are reliably reusable and non-noisy cannot be verified, leaving open the possibility that retrieval adds harmful context rather than helpful guidance.
  3. [§4.2–4.3] §4.2–4.3 (Ablations and Generalization): No ablation isolates the contribution of workflow quality versus mere increase in context length, nor reports induction false-positive rates or failure cases where incorrect workflows degrade performance. Without these controls, the cross-task/website/domain generalization results cannot be attributed specifically to AWM rather than to additional prompting.
minor comments (2)
  1. [§4] Tables in §4 should include absolute success rates alongside relative improvements and report the number of runs or seeds used for each result.
  2. [§3] The manuscript would benefit from a clear pseudocode listing of the online AWM induction/retrieval loop to make the on-the-fly procedure reproducible.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our work. We have revised the manuscript to address the concerns about experimental reporting, methodological formalization, and controls for ablations. Point-by-point responses follow.

read point-by-point responses
  1. Referee: [Abstract and §4] Abstract and §4 (Experiments): The headline relative gains (24.6% on Mind2Web, 51.1% on WebArena) are reported without absolute success rates for the baseline or AWM, without standard deviations or statistical significance tests, and without explicit baseline definitions (e.g., whether the baseline includes any memory or retrieval). This prevents assessment of whether the improvements are practically meaningful or sensitive to implementation details.

    Authors: We agree that absolute rates, variability measures, and baseline clarifications are essential. The revised abstract now reports absolute success rates alongside the relative gains (e.g., baseline 36.2% to AWM 45.1% on Mind2Web). Tables in §4 have been updated with standard deviations across runs and paired t-test p-values. The baseline is explicitly the standard ReAct agent without memory or retrieval, as defined in §4.1 and consistent with prior benchmark papers. revision: yes

  2. Referee: [§3] §3 (Method): The workflow induction and selective retrieval procedure is described only procedurally; no formal definition, similarity metric, or precision/recall evaluation of induced workflows is provided. Consequently, the central assumption that induced workflows are reliably reusable and non-noisy cannot be verified, leaving open the possibility that retrieval adds harmful context rather than helpful guidance.

    Authors: We have added a formal definition in §3: workflows are induced as sequences of actions with preconditions, using embedding-based cosine similarity for retrieval (threshold 0.75). An appendix section now evaluates induced workflow precision/recall against manually annotated reusable routines (average precision 0.82), confirming low noise. We also analyze cases of potentially harmful retrieval and show the selective mechanism filters most of them. revision: yes

  3. Referee: [§4.2–4.3] §4.2–4.3 (Ablations and Generalization): No ablation isolates the contribution of workflow quality versus mere increase in context length, nor reports induction false-positive rates or failure cases where incorrect workflows degrade performance. Without these controls, the cross-task/website/domain generalization results cannot be attributed specifically to AWM rather than to additional prompting.

    Authors: We have added an ablation in §4.2 comparing AWM to a length-matched control that retrieves random or irrelevant workflows. The control yields only marginal gains (+2.1 points), while AWM yields the full reported improvement, isolating the benefit to workflow quality. Induction false-positive rates (12%) and failure-case analysis are now reported in the appendix, showing that selective retrieval limits degradation from incorrect workflows and supports attribution of generalization gains to AWM. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper introduces AWM as a procedural method for workflow induction and selective retrieval without any equations, fitted parameters, or self-referential definitions. Claims of performance gains rest on empirical results from Mind2Web and WebArena rather than reductions to inputs by construction. No self-citation load-bearing steps, uniqueness theorems, or ansatz smuggling appear in the provided text. The derivation chain is self-contained as a direct algorithmic addition.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that language models can interpret and benefit from workflow descriptions inserted into prompts, plus the implicit premise that common routines exist and can be extracted reliably from task trajectories.

axioms (1)
  • domain assumption Language models can effectively follow and benefit from provided workflow descriptions in their prompts without confusion or performance degradation.
    This underpins the selective provision step that is claimed to drive the reported gains.
invented entities (1)
  • Agent Workflow Memory (AWM) no independent evidence
    purpose: A memory structure that stores and retrieves induced reusable task workflows for guiding future agent actions.
    New construct introduced by the paper; no independent evidence outside the reported experiments is provided in the abstract.

pith-pipeline@v0.9.0 · 5524 in / 1268 out tokens · 36190 ms · 2026-05-15T00:47:34.019674+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • HierarchyEmergence hierarchy_emergence_forces_phi echoes
    ?
    echoes

    ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

    we introduce Agent Workflow Memory (AWM), a method for inducing commonly reused routines, i.e., workflows, and selectively providing workflows to the agent to guide subsequent generations. AWM flexibly applies to both offline and online scenarios, where agents induce workflows from training examples beforehand or from test queries on the fly.

  • HierarchyRealization realized_hierarchy_forces_phi echoes
    ?
    echoes

    ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

    AWM starts with a basic set of built-in actions and solves new tasks in a streaming manner, continuously inducing workflows from the task at hand... Such continual learning mechanisms create a snowball effect to induce and apply increasingly complex workflows while expanding the agent memory

  • LedgerForcing conservation_from_balance echoes
    ?
    echoes

    ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

    AWM substantially improves the baseline results by 24.6% and 51.1% relative success rate on Mind2Web and WebArena while reducing the number of steps taken to solve WebArena tasks successfully. Furthermore, online AWM robustly generalizes in cross-task, website, and domain evaluations

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 23 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. FlowCompile: An Optimizing Compiler for Structured LLM Workflows

    cs.CL 2026-05 unverdicted novelty 8.0

    FlowCompile performs compile-time design space exploration on structured LLM workflows to produce reusable high-quality configuration sets that outperform routing baselines with up to 6.4x speedup.

  2. Why Do Multi-Agent LLM Systems Fail?

    cs.AI 2025-03 unverdicted novelty 8.0

    The authors create the first large-scale dataset and taxonomy of failure modes in multi-agent LLM systems to explain their limited performance gains.

  3. ClawForge: Generating Executable Interactive Benchmarks for Command-Line Agents

    cs.AI 2026-05 conditional novelty 7.0

    ClawForge supplies a generator that turns scenario templates into reproducible command-line tasks testing state conflict handling, where the strongest frontier model scores only 45.3 percent strict accuracy.

  4. Learning, Fast and Slow: Towards LLMs That Adapt Continually

    cs.LG 2026-05 unverdicted novelty 7.0

    Fast-Slow Training uses context optimization as fast weights alongside parameter updates as slow weights to achieve up to 3x better sample efficiency, higher performance, and less catastrophic forgetting than standard...

  5. Evolving-RL: End-to-End Optimization of Experience-Driven Self-Evolving Capability within Agents

    cs.AI 2026-05 unverdicted novelty 7.0

    Evolving-RL jointly optimizes experience extraction and utilization in LLM agents via RL with separate evaluation signals, delivering up to 98.7% relative gains on out-of-distribution tasks in ALFWorld and Mind2Web.

  6. OCR-Memory: Optical Context Retrieval for Long-Horizon Agent Memory

    cs.CL 2026-04 unverdicted novelty 7.0

    OCR-Memory encodes agent trajectories as images with visual anchors and retrieves verbatim text via locate-and-transcribe, yielding gains on long-horizon benchmarks under strict context limits.

  7. EE-MCP: Self-Evolving MCP-GUI Agents via Automated Environment Generation and Experience Learning

    cs.AI 2026-04 unverdicted novelty 7.0

    A self-evolving MCP-GUI agent system with automated environment generation and an experience bank achieves up to 77.8% pass rates by matching distillation or experience augmentation to task type across three desktop a...

  8. Generate, Filter, Control, Replay: A Comprehensive Survey of Rollout Strategies for LLM Reinforcement Learning

    cs.LG 2026-04 unverdicted novelty 7.0

    This survey introduces the Generate-Filter-Control-Replay (GFCR) taxonomy to structure rollout pipelines for RL-based post-training of reasoning LLMs.

  9. PowerDAG: Reliable Agentic AI System for Automating Distribution Grid Analysis

    eess.SY 2026-03 unverdicted novelty 7.0

    PowerDAG achieves 94-100% success on unseen distribution grid analysis queries by combining adaptive retrieval with similarity-decay cutoff and just-in-time supervision, outperforming ReAct, LangChain, and CrewAI baselines.

  10. Agent Security Bench (ASB): Formalizing and Benchmarking Attacks and Defenses in LLM-based Agents

    cs.CR 2024-10 unverdicted novelty 7.0

    ASB is a new benchmark that tests 10 prompt injection attacks, memory poisoning, a novel Plan-of-Thought backdoor attack, and 11 defenses on LLM agents across 13 models, finding attack success rates up to 84.3% and li...

  11. Learning, Fast and Slow: Towards LLMs That Adapt Continually

    cs.LG 2026-05 unverdicted novelty 6.0

    Fast-Slow Training combines slow parameter updates with fast context optimization to achieve up to 3x better sample efficiency, higher performance, less forgetting, and preserved plasticity in continual LLM learning.

  12. Dynamic Skill Lifecycle Management for Agentic Reinforcement Learning

    cs.LG 2026-05 unverdicted novelty 6.0

    SLIM dynamically optimizes active external skills in agentic RL via leave-one-skill-out marginal contribution estimates and three lifecycle operations, outperforming baselines by 7.1% on ALFWorld and SearchQA while sh...

  13. Kintsugi: Learning Policies by Repairing Executable Knowledge Bases

    cs.LG 2026-05 unverdicted novelty 6.0

    Kintsugi learns policies by repairing composable executable knowledge bases through agentic diagnosis, localized typed edits, and deterministic verification gates that admit only improvements.

  14. SkillGraph: Self-Evolving Multi-Agent Collaboration with Multimodal Graph Topology

    cs.AI 2026-04 unverdicted novelty 6.0

    SkillGraph jointly evolves agent skills and collaboration topologies in multi-agent vision-language systems using a multimodal graph transformer and a skill designer, yielding consistent performance gains on benchmarks.

  15. SkillDroid: Compile Once, Reuse Forever

    cs.HC 2026-04 conditional novelty 6.0

    SkillDroid compiles LLM-guided GUI trajectories into parameterized skill templates and replays them via a matching cascade, reaching 85.3% success rate with 49% fewer LLM calls and improving from 87% to 91% over 150 r...

  16. Agentic Compilation: Mitigating the LLM Rerun Crisis for Minimized-Inference-Cost Web Automation

    cs.DC 2026-04 unverdicted novelty 6.0

    A Compile-and-Execute system decouples LLM reasoning from browser execution via a one-shot JSON blueprint, reducing inference from O(M x N) to amortized O(1) for repetitive web workflows.

  17. Procedural Knowledge at Scale Improves Reasoning

    cs.CL 2026-04 unverdicted novelty 6.0

    Reasoning Memory decomposes reasoning trajectories into 32 million subquestion-subroutine pairs and retrieves them via in-thought prompts to improve language model performance on math, science, and coding benchmarks b...

  18. Intermediate Artifacts as First-Class Citizens: A Data Model for Durable Intermediate Artifacts in Agentic Systems

    cs.AI 2026-05 unverdicted novelty 5.0

    A systems-level data model for preserving typed, addressable, versioned, and dependency-aware intermediate artifacts in agentic AI systems to improve long-term inspectability and maintainability.

  19. From Agent Loops to Deterministic Graphs: Execution Lineage for Reproducible AI-Native Work

    cs.AI 2026-05 conditional novelty 5.0

    Execution lineage models AI-native work as a DAG of computations with explicit dependencies, achieving perfect state preservation in controlled update tasks where loop-based agents introduce churn and contamination.

  20. Scaling Teams or Scaling Time? Memory Enabled Lifelong Learning in LLM Multi-Agent Systems

    cs.MA 2026-03 unverdicted novelty 5.0

    LLMA-Mem improves long-horizon performance in LLM multi-agent systems over baselines while reducing cost and shows non-monotonic scaling where memory-enabled smaller teams can beat larger ones.

  21. A Comprehensive Survey on Agent Skills: Taxonomy, Techniques, and Applications

    cs.IR 2026-05 unverdicted novelty 4.0

    The paper surveys agent skills for LLM agents, organizing the literature into a four-stage lifecycle of representation, acquisition, retrieval, and evolution while highlighting their role in system scalability.

  22. Reliable AI Needs to Externalize Implicit Knowledge: A Human-AI Collaboration Perspective

    cs.AI 2026-05 unverdicted novelty 4.0

    Reliable AI needs structured Knowledge Objects to externalize and enable human validation of implicit knowledge that current methods cannot verify.

  23. A Survey of Self-Evolving Agents: What, When, How, and Where to Evolve on the Path to Artificial Super Intelligence

    cs.AI 2025-07 accept novelty 4.0

    The paper delivers the first systematic review of self-evolving agents, structured around what components evolve, when adaptation occurs, and how it is implemented.

Reference graph

Works this paper leans on

53 extracted references · 53 canonical work pages · cited by 22 Pith papers · 2 internal anchors

  1. [1]

    Proceedings of the 34th International Conference on Machine Learning , pages=

    World of Bits: An Open-Domain Platform for Web-Based Agents , author=. Proceedings of the 34th International Conference on Machine Learning , pages=. 2017 , editor=

  2. [2]

    International Conference on Learning Representations , year=

    Reinforcement Learning on Web Interfaces using Workflow-Guided Exploration , author=. International Conference on Learning Representations , year=

  3. [3]

    WebShop: Towards Scalable Real-World Web Interaction with Grounded Language Agents , url=

    Yao, Shunyu and Chen, Howard and Yang, John and Narasimhan, Karthik , booktitle=. WebShop: Towards Scalable Real-World Web Interaction with Grounded Language Agents , url=

  4. [4]

    The Twelfth International Conference on Learning Representations , year=

    WebArena: A Realistic Web Environment for Building Autonomous Agents , author=. The Twelfth International Conference on Learning Representations , year=

  5. [5]

    Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track , year=

    Mind2Web: Towards a Generalist Agent for the Web , author=. Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track , year=

  6. [6]

    ICLR 2024 Workshop on Large Language Model (LLM) Agents , year=

    VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks , author=. ICLR 2024 Workshop on Large Language Model (LLM) Agents , year=

  7. [7]

    Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track , year=

    AndroidInTheWild: A Large-Scale Dataset For Android Device Control , author=. Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track , year=

  8. [9]

    On the effects of data scale on computer control agents.arXiv preprint arXiv:2406.03679, 2024

    On the Effects of Data Scale on Computer Control Agents , author=. arXiv preprint arXiv:2406.03679 , year=

  9. [10]

    The Eleventh International Conference on Learning Representations , year=

    Language Models Can Teach Themselves to Program Better , author=. The Eleventh International Conference on Learning Representations , year=

  10. [11]

    Thirty-seventh Conference on Neural Information Processing Systems , year=

    AdaPlanner: Adaptive Planning from Feedback with Language Models , author=. Thirty-seventh Conference on Neural Information Processing Systems , year=

  11. [13]

    The Twelfth International Conference on Learning Representations , year=

    Synapse: Trajectory-as-Exemplar Prompting with Memory for Computer Control , author=. The Twelfth International Conference on Learning Representations , year=

  12. [20]

    First Conference on Language Modeling , year=

    What Are Tools Anyway? A Survey from the Language Model Perspective , author=. First Conference on Language Modeling , year=

  13. [21]

    2014 , publisher=

    The nature of expertise , author=. 2014 , publisher=

  14. [22]

    Cognitive science , volume=

    Categorization and representation of physics problems by experts and novices , author=. Cognitive science , volume=. 1981 , publisher=

  15. [23]

    Transactions on Machine Learning Research , issn=

    Voyager: An Open-Ended Embodied Agent with Large Language Models , author=. Transactions on Machine Learning Research , issn=. 2024 , url=

  16. [25]

    Philosophical Transactions of the Royal Society A , volume=

    DreamCoder: growing generalizable, interpretable knowledge with wake--sleep Bayesian program learning , author=. Philosophical Transactions of the Royal Society A , volume=. 2023 , publisher=

  17. [26]

    Proceedings of the 38th International Conference on Machine Learning , pages =

    Leveraging Language to Learn Program Abstractions and Search Heuristics , author =. Proceedings of the 38th International Conference on Machine Learning , pages =. 2021 , editor =

  18. [29]

    2023 , journal=

    Large Language Models as Tool Makers , author=. 2023 , journal=

  19. [30]

    Zhiruo Wang and Graham Neubig and Daniel Fried , booktitle=. Tro. 2024 , url=

  20. [31]

    2023 IEEE International Conference on Robotics and Automation (ICRA) , pages=

    Code as policies: Language model programs for embodied control , author=. 2023 IEEE International Conference on Robotics and Automation (ICRA) , pages=. 2023 , organization=

  21. [32]

    Conference on Robot Learning , pages=

    Learning reusable manipulation strategies , author=. Conference on Robot Learning , pages=. 2023 , organization=

  22. [33]

    International Conference on Machine Learning , pages=

    Zero-shot task generalization with multi-task deep reinforcement learning , author=. International Conference on Machine Learning , pages=. 2017 , organization=

  23. [34]

    Olausson, Lionel Wong, Gabriel Grand, Joshua B

    Matthew Bowers, Theo X. Olausson, Lionel Wong, Gabriel Grand, Joshua B. Tenenbaum, Kevin Ellis, and Armando Solar-Lezama. Top-down synthesis for library learning. Proc. ACM Program. Lang., 7 0 (POPL), jan 2023. doi:10.1145/3571234. URL https://doi.org/10.1145/3571234

  24. [35]

    Large language models as tool makers

    Tianle Cai, Xuezhi Wang, Tengyu Ma, Xinyun Chen, and Denny Zhou. Large language models as tool makers. arXiv preprint arXiv:2305.17126, 2023. URL https://arxiv.org/pdf/2305.17126

  25. [36]

    Categorization and representation of physics problems by experts and novices

    Michelene TH Chi, Paul J Feltovich, and Robert Glaser. Categorization and representation of physics problems by experts and novices. Cognitive science, 5 0 (2): 0 121--152, 1981

  26. [37]

    The nature of expertise

    Michelene TH Chi, Robert Glaser, and Marshall J Farr. The nature of expertise. Psychology Press, 2014

  27. [38]

    Mind2web: Towards a generalist agent for the web

    Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Samuel Stevens, Boshi Wang, Huan Sun, and Yu Su. Mind2web: Towards a generalist agent for the web. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023. URL https://openreview.net/forum?id=kiYqbO3wqw

  28. [39]

    WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks?

    Alexandre Drouin, Maxime Gasse, Massimo Caccia, Issam H Laradji, Manuel Del Verme, Tom Marty, L \'e o Boisvert, Megh Thakkar, Quentin Cappart, David Vazquez, et al. Workarena: How capable are web agents at solving common knowledge work tasks? arXiv preprint arXiv:2403.07718, 2024

  29. [40]

    Dreamcoder: growing generalizable, interpretable knowledge with wake--sleep bayesian program learning

    Kevin Ellis, Lionel Wong, Maxwell Nye, Mathias Sable-Meyer, Luc Cary, Lore Anaya Pozo, Luke Hewitt, Armando Solar-Lezama, and Joshua B Tenenbaum. Dreamcoder: growing generalizable, interpretable knowledge with wake--sleep bayesian program learning. Philosophical Transactions of the Royal Society A, 381 0 (2251): 0 20220050, 2023

  30. [41]

    Autoguide: Automated generation and selection of state-aware guidelines for large language model agents

    Yao Fu, Dong-Ki Kim, Jaekyeom Kim, Sungryull Sohn, Lajanugen Logeswaran, Kyunghoon Bae, and Honglak Lee. Autoguide: Automated generation and selection of state-aware guidelines for large language model agents. arXiv preprint arXiv:2403.08978, 2024

  31. [42]

    Lilo: Learning interpretable libraries by compressing and documenting code

    Gabriel Grand, Lionel Wong, Matthew Bowers, Theo X Olausson, Muxin Liu, Joshua B Tenenbaum, and Jacob Andreas. Lilo: Learning interpretable libraries by compressing and documenting code. arXiv preprint arXiv:2310.19791, 2023

  32. [43]

    Language models can teach themselves to program better

    Patrick Haluptzok, Matthew Bowers, and Adam Tauman Kalai. Language models can teach themselves to program better. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=SaRj2ka1XZ3

  33. [44]

    Visualwebarena: Evaluating multimodal agents on realistic visual web tasks

    Jing Yu Koh, Robert Lo, Lawrence Jang, Vikram Duvvur, Ming Chong Lim, Po-Yu Huang, Graham Neubig, Shuyan Zhou, Ruslan Salakhutdinov, and Daniel Fried. Visualwebarena: Evaluating multimodal agents on realistic visual web tasks. In ICLR 2024 Workshop on Large Language Model (LLM) Agents, 2024. URL https://openreview.net/forum?id=RPKxrKTJbj

  34. [45]

    Code as policies: Language model programs for embodied control

    Jacky Liang, Wenlong Huang, Fei Xia, Peng Xu, Karol Hausman, Brian Ichter, Pete Florence, and Andy Zeng. Code as policies: Language model programs for embodied control. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pp.\ 9493--9500. IEEE, 2023

  35. [46]

    Reinforcement learning on web interfaces using workflow-guided exploration

    Evan Zheran Liu, Kelvin Guu, Panupong Pasupat, and Percy Liang. Reinforcement learning on web interfaces using workflow-guided exploration. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=ryTp3f-0-

  36. [47]

    Clin: A continually learning language agent for rapid task adaptation and generalization

    Bodhisattwa Prasad Majumder, Bhavana Dalvi Mishra, Peter Jansen, Oyvind Tafjord, Niket Tandon, Li Zhang, Chris Callison-Burch, and Peter Clark. Clin: A continually learning language agent for rapid task adaptation and generalization. arXiv preprint arXiv:2310.10134, 2023

  37. [48]

    Learning reusable manipulation strategies

    Jiayuan Mao, Tom \'a s Lozano-P \'e rez, Joshua B Tenenbaum, and Leslie Pack Kaelbling. Learning reusable manipulation strategies. In Conference on Robot Learning, pp.\ 1467--1483. PMLR, 2023

  38. [49]

    Bagel: Bootstrapping agents by guiding exploration with language

    Shikhar Murty, Christopher Manning, Peter Shaw, Mandar Joshi, and Kenton Lee. Bagel: Bootstrapping agents by guiding exploration with language. arXiv preprint arXiv:2403.08140, 2024

  39. [50]

    Zero-shot task generalization with multi-task deep reinforcement learning

    Junhyuk Oh, Satinder Singh, Honglak Lee, and Pushmeet Kohli. Zero-shot task generalization with multi-task deep reinforcement learning. In International Conference on Machine Learning, pp.\ 2661--2670. PMLR, 2017

  40. [51]

    Autonomous evaluation and refinement of digital agents

    Jiayi Pan, Yichi Zhang, Nicholas Tomlin, Yifei Zhou, Sergey Levine, and Alane Suhr. Autonomous evaluation and refinement of digital agents. arXiv preprint arXiv:2404.06474, 2024

  41. [52]

    Androidinthewild: A large-scale dataset for android device control

    Christopher Rawles, Alice Li, Daniel Rodriguez, Oriana Riva, and Timothy P Lillicrap. Androidinthewild: A large-scale dataset for android device control. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023. URL https://openreview.net/forum?id=j4b3l5kOil

  42. [53]

    AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents

    Christopher Rawles, Sarah Clinckemaillie, Yifan Chang, Jonathan Waltz, Gabrielle Lau, Marybeth Fair, Alice Li, William Bishop, Wei Li, Folawiyo Campbell-Ajala, et al. Androidworld: A dynamic benchmarking environment for autonomous agents. arXiv preprint arXiv:2405.14573, 2024

  43. [54]

    World of bits: An open-domain platform for web-based agents

    Tianlin Shi, Andrej Karpathy, Linxi Fan, Jonathan Hernandez, and Percy Liang. World of bits: An open-domain platform for web-based agents. In Doina Precup and Yee Whye Teh (eds.), Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pp.\ 3135--3144. PMLR, 06--11 Aug 2017. URL https://...

  44. [55]

    Heap: Hierarchical policies for web actions using llms

    Paloma Sodhi, SRK Branavan, and Ryan McDonald. Heap: Hierarchical policies for web actions using llms. arXiv preprint arXiv:2310.03720, 2023

  45. [56]

    Adaplanner: Adaptive planning from feedback with language models

    Haotian Sun, Yuchen Zhuang, Lingkai Kong, Bo Dai, and Chao Zhang. Adaplanner: Adaptive planning from feedback with language models. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=rnKgbKmelt

  46. [57]

    Voyager: An open-ended embodied agent with large language models

    Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. Voyager: An open-ended embodied agent with large language models. Transactions on Machine Learning Research, 2024 a . ISSN 2835-8856. URL https://openreview.net/forum?id=ehfRiF0R3a

  47. [58]

    What are tools anyway? a survey from the language model perspective

    Zhiruo Wang, Zhoujun Cheng, Hao Zhu, Daniel Fried, and Graham Neubig. What are tools anyway? a survey from the language model perspective. In First Conference on Language Modeling, 2024 b . URL https://openreview.net/forum?id=Xh1B90iBSR

  48. [59]

    Tro VE : Inducing verifiable and efficient toolboxes for solving programmatic tasks

    Zhiruo Wang, Graham Neubig, and Daniel Fried. Tro VE : Inducing verifiable and efficient toolboxes for solving programmatic tasks. In Forty-first International Conference on Machine Learning, 2024 c . URL https://openreview.net/forum?id=DCNCwaMJjI

  49. [60]

    Webshop: Towards scalable real-world web interaction with grounded language agents

    Shunyu Yao, Howard Chen, John Yang, and Karthik Narasimhan. Webshop: Towards scalable real-world web interaction with grounded language agents. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (eds.), Advances in Neural Information Processing Systems, volume 35, pp.\ 20744--20757. Curran Associates, Inc., 2022. URL https://proceedings....

  50. [61]

    Assistantbench: Can web agents solve realistic and time-consuming tasks? arXiv preprint arXiv:2407.15711, 2024

    Ori Yoran, Samuel Joseph Amouyal, Chaitanya Malaviya, Ben Bogin, Ofir Press, and Jonathan Berant. Assistantbench: Can web agents solve realistic and time-consuming tasks? arXiv preprint arXiv:2407.15711, 2024

  51. [62]

    Language to rewards for robotic skill synthesis

    Wenhao Yu, Nimrod Gileadi, Chuyuan Fu, Sean Kirmani, Kuang-Huei Lee, Montse Gonzalez Arenas, Hao-Tien Lewis Chiang, Tom Erez, Leonard Hasenclever, Jan Humplik, et al. Language to rewards for robotic skill synthesis. arXiv preprint arXiv:2306.08647, 2023

  52. [63]

    Synapse: Trajectory-as-exemplar prompting with memory for computer control

    Longtao Zheng, Rundong Wang, Xinrun Wang, and Bo An. Synapse: Trajectory-as-exemplar prompting with memory for computer control. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=Pc8AU1aF5e

  53. [64]

    Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig

    Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig. Webarena: A realistic web environment for building autonomous agents. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=oKn9c6ytLx