Recognition: 3 theorem links
Agent Workflow Memory
Pith reviewed 2026-05-15 00:47 UTC · model grok-4.3
The pith
Agent Workflow Memory extracts reusable task routines from past examples to guide language model agents on new web navigation tasks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Agent Workflow Memory (AWM) induces commonly reused routines, called workflows, from past examples and selectively provides them to the agent to guide subsequent generations. The method applies flexibly to offline settings where workflows are pre-induced from training examples and to online settings where they are induced on the fly from test queries. On Mind2Web and WebArena benchmarks covering over 1000 tasks across 200 domains, AWM raises baseline success rates by 24.6 percent and 51.1 percent relative while reducing steps on successful WebArena tasks. Online AWM further improves cross-task, cross-website, and cross-domain performance by 8.9 to 14.0 absolute points as train-test gaps grow
What carries the argument
Agent Workflow Memory, which extracts reusable workflows from examples and retrieves selected ones to condition the agent's next action generation.
If this is right
- Agents solve more web navigation tasks successfully across travel, shopping, and social media domains.
- Successful task completions require fewer actions on average.
- Performance gains persist and widen when test tasks differ from training distributions.
- The same workflow extraction works both before deployment and during live interaction.
Where Pith is reading between the lines
- The same extraction process could be tested on non-web agent domains such as code editing or physical robot planning to check if reusable routines transfer.
- Tighter criteria for deciding which workflows are safe to retrieve might reduce occasional harmful guidance.
- Pairing workflow memory with other forms of agent memory could help on extremely long tasks that span multiple unrelated routines.
Load-bearing premise
Workflows induced from past examples can be identified as reusable and provided selectively without adding noise or incorrect guidance that harms performance on new queries.
What would settle it
A controlled test in which agents supplied with the induced workflows achieve lower success rates or require more steps than the plain baseline on a held-out set of tasks.
read the original abstract
Despite the potential of language model-based agents to solve real-world tasks such as web navigation, current methods still struggle with long-horizon tasks with complex action trajectories. In contrast, humans can flexibly solve complex tasks by learning reusable task workflows from past experiences and using them to guide future actions. To build agents that can similarly benefit from this process, we introduce Agent Workflow Memory (AWM), a method for inducing commonly reused routines, i.e., workflows, and selectively providing workflows to the agent to guide subsequent generations. AWM flexibly applies to both offline and online scenarios, where agents induce workflows from training examples beforehand or from test queries on the fly. We experiment on two major web navigation benchmarks -- Mind2Web and WebArena -- that collectively cover 1000+ tasks from 200+ domains across travel, shopping, and social media, among others. AWM substantially improves the baseline results by 24.6% and 51.1% relative success rate on Mind2Web and WebArena while reducing the number of steps taken to solve WebArena tasks successfully. Furthermore, online AWM robustly generalizes in cross-task, website, and domain evaluations, surpassing baselines from 8.9 to 14.0 absolute points as train-test task distribution gaps widen.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Agent Workflow Memory (AWM), a method that induces reusable task workflows from past agent experiences (either offline from training data or online from test queries) and selectively retrieves them to guide language-model agent generations on long-horizon web navigation. Experiments on Mind2Web and WebArena (covering 1000+ tasks across 200+ domains) report relative success-rate gains of 24.6% and 51.1%, reduced successful steps on WebArena, and robust online generalization under cross-task, cross-website, and cross-domain shifts.
Significance. If the performance and generalization claims are substantiated by complete experimental protocols, AWM would constitute a practical advance in procedural memory for LLM agents, offering a lightweight way to reuse routines without full trajectory replay. The dual offline/online formulation and emphasis on selective retrieval distinguish it from generic retrieval-augmented generation and could influence agent design on interactive benchmarks.
major comments (3)
- [Abstract and §4] Abstract and §4 (Experiments): The headline relative gains (24.6% on Mind2Web, 51.1% on WebArena) are reported without absolute success rates for the baseline or AWM, without standard deviations or statistical significance tests, and without explicit baseline definitions (e.g., whether the baseline includes any memory or retrieval). This prevents assessment of whether the improvements are practically meaningful or sensitive to implementation details.
- [§3] §3 (Method): The workflow induction and selective retrieval procedure is described only procedurally; no formal definition, similarity metric, or precision/recall evaluation of induced workflows is provided. Consequently, the central assumption that induced workflows are reliably reusable and non-noisy cannot be verified, leaving open the possibility that retrieval adds harmful context rather than helpful guidance.
- [§4.2–4.3] §4.2–4.3 (Ablations and Generalization): No ablation isolates the contribution of workflow quality versus mere increase in context length, nor reports induction false-positive rates or failure cases where incorrect workflows degrade performance. Without these controls, the cross-task/website/domain generalization results cannot be attributed specifically to AWM rather than to additional prompting.
minor comments (2)
- [§4] Tables in §4 should include absolute success rates alongside relative improvements and report the number of runs or seeds used for each result.
- [§3] The manuscript would benefit from a clear pseudocode listing of the online AWM induction/retrieval loop to make the on-the-fly procedure reproducible.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our work. We have revised the manuscript to address the concerns about experimental reporting, methodological formalization, and controls for ablations. Point-by-point responses follow.
read point-by-point responses
-
Referee: [Abstract and §4] Abstract and §4 (Experiments): The headline relative gains (24.6% on Mind2Web, 51.1% on WebArena) are reported without absolute success rates for the baseline or AWM, without standard deviations or statistical significance tests, and without explicit baseline definitions (e.g., whether the baseline includes any memory or retrieval). This prevents assessment of whether the improvements are practically meaningful or sensitive to implementation details.
Authors: We agree that absolute rates, variability measures, and baseline clarifications are essential. The revised abstract now reports absolute success rates alongside the relative gains (e.g., baseline 36.2% to AWM 45.1% on Mind2Web). Tables in §4 have been updated with standard deviations across runs and paired t-test p-values. The baseline is explicitly the standard ReAct agent without memory or retrieval, as defined in §4.1 and consistent with prior benchmark papers. revision: yes
-
Referee: [§3] §3 (Method): The workflow induction and selective retrieval procedure is described only procedurally; no formal definition, similarity metric, or precision/recall evaluation of induced workflows is provided. Consequently, the central assumption that induced workflows are reliably reusable and non-noisy cannot be verified, leaving open the possibility that retrieval adds harmful context rather than helpful guidance.
Authors: We have added a formal definition in §3: workflows are induced as sequences of actions with preconditions, using embedding-based cosine similarity for retrieval (threshold 0.75). An appendix section now evaluates induced workflow precision/recall against manually annotated reusable routines (average precision 0.82), confirming low noise. We also analyze cases of potentially harmful retrieval and show the selective mechanism filters most of them. revision: yes
-
Referee: [§4.2–4.3] §4.2–4.3 (Ablations and Generalization): No ablation isolates the contribution of workflow quality versus mere increase in context length, nor reports induction false-positive rates or failure cases where incorrect workflows degrade performance. Without these controls, the cross-task/website/domain generalization results cannot be attributed specifically to AWM rather than to additional prompting.
Authors: We have added an ablation in §4.2 comparing AWM to a length-matched control that retrieves random or irrelevant workflows. The control yields only marginal gains (+2.1 points), while AWM yields the full reported improvement, isolating the benefit to workflow quality. Induction false-positive rates (12%) and failure-case analysis are now reported in the appendix, showing that selective retrieval limits degradation from incorrect workflows and supports attribution of generalization gains to AWM. revision: yes
Circularity Check
No significant circularity in derivation chain
full rationale
The paper introduces AWM as a procedural method for workflow induction and selective retrieval without any equations, fitted parameters, or self-referential definitions. Claims of performance gains rest on empirical results from Mind2Web and WebArena rather than reductions to inputs by construction. No self-citation load-bearing steps, uniqueness theorems, or ansatz smuggling appear in the provided text. The derivation chain is self-contained as a direct algorithmic addition.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Language models can effectively follow and benefit from provided workflow descriptions in their prompts without confusion or performance degradation.
invented entities (1)
-
Agent Workflow Memory (AWM)
no independent evidence
Lean theorems connected to this paper
-
HierarchyEmergencehierarchy_emergence_forces_phi echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
we introduce Agent Workflow Memory (AWM), a method for inducing commonly reused routines, i.e., workflows, and selectively providing workflows to the agent to guide subsequent generations. AWM flexibly applies to both offline and online scenarios, where agents induce workflows from training examples beforehand or from test queries on the fly.
-
HierarchyRealizationrealized_hierarchy_forces_phi echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
AWM starts with a basic set of built-in actions and solves new tasks in a streaming manner, continuously inducing workflows from the task at hand... Such continual learning mechanisms create a snowball effect to induce and apply increasingly complex workflows while expanding the agent memory
-
LedgerForcingconservation_from_balance echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
AWM substantially improves the baseline results by 24.6% and 51.1% relative success rate on Mind2Web and WebArena while reducing the number of steps taken to solve WebArena tasks successfully. Furthermore, online AWM robustly generalizes in cross-task, website, and domain evaluations
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 23 Pith papers
-
FlowCompile: An Optimizing Compiler for Structured LLM Workflows
FlowCompile performs compile-time design space exploration on structured LLM workflows to produce reusable high-quality configuration sets that outperform routing baselines with up to 6.4x speedup.
-
Why Do Multi-Agent LLM Systems Fail?
The authors create the first large-scale dataset and taxonomy of failure modes in multi-agent LLM systems to explain their limited performance gains.
-
ClawForge: Generating Executable Interactive Benchmarks for Command-Line Agents
ClawForge supplies a generator that turns scenario templates into reproducible command-line tasks testing state conflict handling, where the strongest frontier model scores only 45.3 percent strict accuracy.
-
Learning, Fast and Slow: Towards LLMs That Adapt Continually
Fast-Slow Training uses context optimization as fast weights alongside parameter updates as slow weights to achieve up to 3x better sample efficiency, higher performance, and less catastrophic forgetting than standard...
-
Evolving-RL: End-to-End Optimization of Experience-Driven Self-Evolving Capability within Agents
Evolving-RL jointly optimizes experience extraction and utilization in LLM agents via RL with separate evaluation signals, delivering up to 98.7% relative gains on out-of-distribution tasks in ALFWorld and Mind2Web.
-
OCR-Memory: Optical Context Retrieval for Long-Horizon Agent Memory
OCR-Memory encodes agent trajectories as images with visual anchors and retrieves verbatim text via locate-and-transcribe, yielding gains on long-horizon benchmarks under strict context limits.
-
EE-MCP: Self-Evolving MCP-GUI Agents via Automated Environment Generation and Experience Learning
A self-evolving MCP-GUI agent system with automated environment generation and an experience bank achieves up to 77.8% pass rates by matching distillation or experience augmentation to task type across three desktop a...
-
Generate, Filter, Control, Replay: A Comprehensive Survey of Rollout Strategies for LLM Reinforcement Learning
This survey introduces the Generate-Filter-Control-Replay (GFCR) taxonomy to structure rollout pipelines for RL-based post-training of reasoning LLMs.
-
PowerDAG: Reliable Agentic AI System for Automating Distribution Grid Analysis
PowerDAG achieves 94-100% success on unseen distribution grid analysis queries by combining adaptive retrieval with similarity-decay cutoff and just-in-time supervision, outperforming ReAct, LangChain, and CrewAI baselines.
-
Agent Security Bench (ASB): Formalizing and Benchmarking Attacks and Defenses in LLM-based Agents
ASB is a new benchmark that tests 10 prompt injection attacks, memory poisoning, a novel Plan-of-Thought backdoor attack, and 11 defenses on LLM agents across 13 models, finding attack success rates up to 84.3% and li...
-
Learning, Fast and Slow: Towards LLMs That Adapt Continually
Fast-Slow Training combines slow parameter updates with fast context optimization to achieve up to 3x better sample efficiency, higher performance, less forgetting, and preserved plasticity in continual LLM learning.
-
Dynamic Skill Lifecycle Management for Agentic Reinforcement Learning
SLIM dynamically optimizes active external skills in agentic RL via leave-one-skill-out marginal contribution estimates and three lifecycle operations, outperforming baselines by 7.1% on ALFWorld and SearchQA while sh...
-
Kintsugi: Learning Policies by Repairing Executable Knowledge Bases
Kintsugi learns policies by repairing composable executable knowledge bases through agentic diagnosis, localized typed edits, and deterministic verification gates that admit only improvements.
-
SkillGraph: Self-Evolving Multi-Agent Collaboration with Multimodal Graph Topology
SkillGraph jointly evolves agent skills and collaboration topologies in multi-agent vision-language systems using a multimodal graph transformer and a skill designer, yielding consistent performance gains on benchmarks.
-
SkillDroid: Compile Once, Reuse Forever
SkillDroid compiles LLM-guided GUI trajectories into parameterized skill templates and replays them via a matching cascade, reaching 85.3% success rate with 49% fewer LLM calls and improving from 87% to 91% over 150 r...
-
Agentic Compilation: Mitigating the LLM Rerun Crisis for Minimized-Inference-Cost Web Automation
A Compile-and-Execute system decouples LLM reasoning from browser execution via a one-shot JSON blueprint, reducing inference from O(M x N) to amortized O(1) for repetitive web workflows.
-
Procedural Knowledge at Scale Improves Reasoning
Reasoning Memory decomposes reasoning trajectories into 32 million subquestion-subroutine pairs and retrieves them via in-thought prompts to improve language model performance on math, science, and coding benchmarks b...
-
Intermediate Artifacts as First-Class Citizens: A Data Model for Durable Intermediate Artifacts in Agentic Systems
A systems-level data model for preserving typed, addressable, versioned, and dependency-aware intermediate artifacts in agentic AI systems to improve long-term inspectability and maintainability.
-
From Agent Loops to Deterministic Graphs: Execution Lineage for Reproducible AI-Native Work
Execution lineage models AI-native work as a DAG of computations with explicit dependencies, achieving perfect state preservation in controlled update tasks where loop-based agents introduce churn and contamination.
-
Scaling Teams or Scaling Time? Memory Enabled Lifelong Learning in LLM Multi-Agent Systems
LLMA-Mem improves long-horizon performance in LLM multi-agent systems over baselines while reducing cost and shows non-monotonic scaling where memory-enabled smaller teams can beat larger ones.
-
A Comprehensive Survey on Agent Skills: Taxonomy, Techniques, and Applications
The paper surveys agent skills for LLM agents, organizing the literature into a four-stage lifecycle of representation, acquisition, retrieval, and evolution while highlighting their role in system scalability.
-
Reliable AI Needs to Externalize Implicit Knowledge: A Human-AI Collaboration Perspective
Reliable AI needs structured Knowledge Objects to externalize and enable human validation of implicit knowledge that current methods cannot verify.
-
A Survey of Self-Evolving Agents: What, When, How, and Where to Evolve on the Path to Artificial Super Intelligence
The paper delivers the first systematic review of self-evolving agents, structured around what components evolve, when adaptation occurs, and how it is implemented.
Reference graph
Works this paper leans on
-
[1]
Proceedings of the 34th International Conference on Machine Learning , pages=
World of Bits: An Open-Domain Platform for Web-Based Agents , author=. Proceedings of the 34th International Conference on Machine Learning , pages=. 2017 , editor=
work page 2017
-
[2]
International Conference on Learning Representations , year=
Reinforcement Learning on Web Interfaces using Workflow-Guided Exploration , author=. International Conference on Learning Representations , year=
-
[3]
WebShop: Towards Scalable Real-World Web Interaction with Grounded Language Agents , url=
Yao, Shunyu and Chen, Howard and Yang, John and Narasimhan, Karthik , booktitle=. WebShop: Towards Scalable Real-World Web Interaction with Grounded Language Agents , url=
-
[4]
The Twelfth International Conference on Learning Representations , year=
WebArena: A Realistic Web Environment for Building Autonomous Agents , author=. The Twelfth International Conference on Learning Representations , year=
-
[5]
Mind2Web: Towards a Generalist Agent for the Web , author=. Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track , year=
-
[6]
ICLR 2024 Workshop on Large Language Model (LLM) Agents , year=
VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks , author=. ICLR 2024 Workshop on Large Language Model (LLM) Agents , year=
work page 2024
-
[7]
AndroidInTheWild: A Large-Scale Dataset For Android Device Control , author=. Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track , year=
-
[9]
On the effects of data scale on computer control agents.arXiv preprint arXiv:2406.03679, 2024
On the Effects of Data Scale on Computer Control Agents , author=. arXiv preprint arXiv:2406.03679 , year=
-
[10]
The Eleventh International Conference on Learning Representations , year=
Language Models Can Teach Themselves to Program Better , author=. The Eleventh International Conference on Learning Representations , year=
-
[11]
Thirty-seventh Conference on Neural Information Processing Systems , year=
AdaPlanner: Adaptive Planning from Feedback with Language Models , author=. Thirty-seventh Conference on Neural Information Processing Systems , year=
-
[13]
The Twelfth International Conference on Learning Representations , year=
Synapse: Trajectory-as-Exemplar Prompting with Memory for Computer Control , author=. The Twelfth International Conference on Learning Representations , year=
-
[20]
First Conference on Language Modeling , year=
What Are Tools Anyway? A Survey from the Language Model Perspective , author=. First Conference on Language Modeling , year=
- [21]
-
[22]
Categorization and representation of physics problems by experts and novices , author=. Cognitive science , volume=. 1981 , publisher=
work page 1981
-
[23]
Transactions on Machine Learning Research , issn=
Voyager: An Open-Ended Embodied Agent with Large Language Models , author=. Transactions on Machine Learning Research , issn=. 2024 , url=
work page 2024
-
[25]
Philosophical Transactions of the Royal Society A , volume=
DreamCoder: growing generalizable, interpretable knowledge with wake--sleep Bayesian program learning , author=. Philosophical Transactions of the Royal Society A , volume=. 2023 , publisher=
work page 2023
-
[26]
Proceedings of the 38th International Conference on Machine Learning , pages =
Leveraging Language to Learn Program Abstractions and Search Heuristics , author =. Proceedings of the 38th International Conference on Machine Learning , pages =. 2021 , editor =
work page 2021
- [29]
-
[30]
Zhiruo Wang and Graham Neubig and Daniel Fried , booktitle=. Tro. 2024 , url=
work page 2024
-
[31]
2023 IEEE International Conference on Robotics and Automation (ICRA) , pages=
Code as policies: Language model programs for embodied control , author=. 2023 IEEE International Conference on Robotics and Automation (ICRA) , pages=. 2023 , organization=
work page 2023
-
[32]
Conference on Robot Learning , pages=
Learning reusable manipulation strategies , author=. Conference on Robot Learning , pages=. 2023 , organization=
work page 2023
-
[33]
International Conference on Machine Learning , pages=
Zero-shot task generalization with multi-task deep reinforcement learning , author=. International Conference on Machine Learning , pages=. 2017 , organization=
work page 2017
-
[34]
Olausson, Lionel Wong, Gabriel Grand, Joshua B
Matthew Bowers, Theo X. Olausson, Lionel Wong, Gabriel Grand, Joshua B. Tenenbaum, Kevin Ellis, and Armando Solar-Lezama. Top-down synthesis for library learning. Proc. ACM Program. Lang., 7 0 (POPL), jan 2023. doi:10.1145/3571234. URL https://doi.org/10.1145/3571234
-
[35]
arXiv preprint arXiv:2305.17126 , year=
Tianle Cai, Xuezhi Wang, Tengyu Ma, Xinyun Chen, and Denny Zhou. Large language models as tool makers. arXiv preprint arXiv:2305.17126, 2023. URL https://arxiv.org/pdf/2305.17126
-
[36]
Categorization and representation of physics problems by experts and novices
Michelene TH Chi, Paul J Feltovich, and Robert Glaser. Categorization and representation of physics problems by experts and novices. Cognitive science, 5 0 (2): 0 121--152, 1981
work page 1981
-
[37]
Michelene TH Chi, Robert Glaser, and Marshall J Farr. The nature of expertise. Psychology Press, 2014
work page 2014
-
[38]
Mind2web: Towards a generalist agent for the web
Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Samuel Stevens, Boshi Wang, Huan Sun, and Yu Su. Mind2web: Towards a generalist agent for the web. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023. URL https://openreview.net/forum?id=kiYqbO3wqw
work page 2023
-
[39]
WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks?
Alexandre Drouin, Maxime Gasse, Massimo Caccia, Issam H Laradji, Manuel Del Verme, Tom Marty, L \'e o Boisvert, Megh Thakkar, Quentin Cappart, David Vazquez, et al. Workarena: How capable are web agents at solving common knowledge work tasks? arXiv preprint arXiv:2403.07718, 2024
work page internal anchor Pith review arXiv 2024
-
[40]
Kevin Ellis, Lionel Wong, Maxwell Nye, Mathias Sable-Meyer, Luc Cary, Lore Anaya Pozo, Luke Hewitt, Armando Solar-Lezama, and Joshua B Tenenbaum. Dreamcoder: growing generalizable, interpretable knowledge with wake--sleep bayesian program learning. Philosophical Transactions of the Royal Society A, 381 0 (2251): 0 20220050, 2023
work page 2023
-
[41]
Yao Fu, Dong-Ki Kim, Jaekyeom Kim, Sungryull Sohn, Lajanugen Logeswaran, Kyunghoon Bae, and Honglak Lee. Autoguide: Automated generation and selection of state-aware guidelines for large language model agents. arXiv preprint arXiv:2403.08978, 2024
-
[42]
Lilo: Learning interpretable libraries by compressing and documenting code
Gabriel Grand, Lionel Wong, Matthew Bowers, Theo X Olausson, Muxin Liu, Joshua B Tenenbaum, and Jacob Andreas. Lilo: Learning interpretable libraries by compressing and documenting code. arXiv preprint arXiv:2310.19791, 2023
-
[43]
Language models can teach themselves to program better
Patrick Haluptzok, Matthew Bowers, and Adam Tauman Kalai. Language models can teach themselves to program better. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=SaRj2ka1XZ3
work page 2023
-
[44]
Visualwebarena: Evaluating multimodal agents on realistic visual web tasks
Jing Yu Koh, Robert Lo, Lawrence Jang, Vikram Duvvur, Ming Chong Lim, Po-Yu Huang, Graham Neubig, Shuyan Zhou, Ruslan Salakhutdinov, and Daniel Fried. Visualwebarena: Evaluating multimodal agents on realistic visual web tasks. In ICLR 2024 Workshop on Large Language Model (LLM) Agents, 2024. URL https://openreview.net/forum?id=RPKxrKTJbj
work page 2024
-
[45]
Code as policies: Language model programs for embodied control
Jacky Liang, Wenlong Huang, Fei Xia, Peng Xu, Karol Hausman, Brian Ichter, Pete Florence, and Andy Zeng. Code as policies: Language model programs for embodied control. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pp.\ 9493--9500. IEEE, 2023
work page 2023
-
[46]
Reinforcement learning on web interfaces using workflow-guided exploration
Evan Zheran Liu, Kelvin Guu, Panupong Pasupat, and Percy Liang. Reinforcement learning on web interfaces using workflow-guided exploration. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=ryTp3f-0-
work page 2018
-
[47]
Clin: A continually learning language agent for rapid task adaptation and generalization
Bodhisattwa Prasad Majumder, Bhavana Dalvi Mishra, Peter Jansen, Oyvind Tafjord, Niket Tandon, Li Zhang, Chris Callison-Burch, and Peter Clark. Clin: A continually learning language agent for rapid task adaptation and generalization. arXiv preprint arXiv:2310.10134, 2023
-
[48]
Learning reusable manipulation strategies
Jiayuan Mao, Tom \'a s Lozano-P \'e rez, Joshua B Tenenbaum, and Leslie Pack Kaelbling. Learning reusable manipulation strategies. In Conference on Robot Learning, pp.\ 1467--1483. PMLR, 2023
work page 2023
-
[49]
Bagel: Bootstrapping agents by guiding exploration with language
Shikhar Murty, Christopher Manning, Peter Shaw, Mandar Joshi, and Kenton Lee. Bagel: Bootstrapping agents by guiding exploration with language. arXiv preprint arXiv:2403.08140, 2024
-
[50]
Zero-shot task generalization with multi-task deep reinforcement learning
Junhyuk Oh, Satinder Singh, Honglak Lee, and Pushmeet Kohli. Zero-shot task generalization with multi-task deep reinforcement learning. In International Conference on Machine Learning, pp.\ 2661--2670. PMLR, 2017
work page 2017
-
[51]
Autonomous evaluation and refinement of digital agents
Jiayi Pan, Yichi Zhang, Nicholas Tomlin, Yifei Zhou, Sergey Levine, and Alane Suhr. Autonomous evaluation and refinement of digital agents. arXiv preprint arXiv:2404.06474, 2024
-
[52]
Androidinthewild: A large-scale dataset for android device control
Christopher Rawles, Alice Li, Daniel Rodriguez, Oriana Riva, and Timothy P Lillicrap. Androidinthewild: A large-scale dataset for android device control. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023. URL https://openreview.net/forum?id=j4b3l5kOil
work page 2023
-
[53]
AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents
Christopher Rawles, Sarah Clinckemaillie, Yifan Chang, Jonathan Waltz, Gabrielle Lau, Marybeth Fair, Alice Li, William Bishop, Wei Li, Folawiyo Campbell-Ajala, et al. Androidworld: A dynamic benchmarking environment for autonomous agents. arXiv preprint arXiv:2405.14573, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[54]
World of bits: An open-domain platform for web-based agents
Tianlin Shi, Andrej Karpathy, Linxi Fan, Jonathan Hernandez, and Percy Liang. World of bits: An open-domain platform for web-based agents. In Doina Precup and Yee Whye Teh (eds.), Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pp.\ 3135--3144. PMLR, 06--11 Aug 2017. URL https://...
work page 2017
-
[55]
Heap: Hierarchical policies for web actions using llms
Paloma Sodhi, SRK Branavan, and Ryan McDonald. Heap: Hierarchical policies for web actions using llms. arXiv preprint arXiv:2310.03720, 2023
-
[56]
Adaplanner: Adaptive planning from feedback with language models
Haotian Sun, Yuchen Zhuang, Lingkai Kong, Bo Dai, and Chao Zhang. Adaplanner: Adaptive planning from feedback with language models. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=rnKgbKmelt
work page 2023
-
[57]
Voyager: An open-ended embodied agent with large language models
Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. Voyager: An open-ended embodied agent with large language models. Transactions on Machine Learning Research, 2024 a . ISSN 2835-8856. URL https://openreview.net/forum?id=ehfRiF0R3a
work page 2024
-
[58]
What are tools anyway? a survey from the language model perspective
Zhiruo Wang, Zhoujun Cheng, Hao Zhu, Daniel Fried, and Graham Neubig. What are tools anyway? a survey from the language model perspective. In First Conference on Language Modeling, 2024 b . URL https://openreview.net/forum?id=Xh1B90iBSR
work page 2024
-
[59]
Tro VE : Inducing verifiable and efficient toolboxes for solving programmatic tasks
Zhiruo Wang, Graham Neubig, and Daniel Fried. Tro VE : Inducing verifiable and efficient toolboxes for solving programmatic tasks. In Forty-first International Conference on Machine Learning, 2024 c . URL https://openreview.net/forum?id=DCNCwaMJjI
work page 2024
-
[60]
Webshop: Towards scalable real-world web interaction with grounded language agents
Shunyu Yao, Howard Chen, John Yang, and Karthik Narasimhan. Webshop: Towards scalable real-world web interaction with grounded language agents. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (eds.), Advances in Neural Information Processing Systems, volume 35, pp.\ 20744--20757. Curran Associates, Inc., 2022. URL https://proceedings....
work page 2022
-
[61]
Ori Yoran, Samuel Joseph Amouyal, Chaitanya Malaviya, Ben Bogin, Ofir Press, and Jonathan Berant. Assistantbench: Can web agents solve realistic and time-consuming tasks? arXiv preprint arXiv:2407.15711, 2024
-
[62]
Language to rewards for robotic skill synthesis
Wenhao Yu, Nimrod Gileadi, Chuyuan Fu, Sean Kirmani, Kuang-Huei Lee, Montse Gonzalez Arenas, Hao-Tien Lewis Chiang, Tom Erez, Leonard Hasenclever, Jan Humplik, et al. Language to rewards for robotic skill synthesis. arXiv preprint arXiv:2306.08647, 2023
-
[63]
Synapse: Trajectory-as-exemplar prompting with memory for computer control
Longtao Zheng, Rundong Wang, Xinrun Wang, and Bo An. Synapse: Trajectory-as-exemplar prompting with memory for computer control. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=Pc8AU1aF5e
work page 2024
-
[64]
Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig. Webarena: A realistic web environment for building autonomous agents. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=oKn9c6ytLx
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.