Recognition: 2 theorem links
· Lean TheoremA Real-World WebAgent with Planning, Long Context Understanding, and Program Synthesis
Pith reviewed 2026-05-17 04:32 UTC · model grok-4.3
The pith
A modular WebAgent with planning, summarization, and code synthesis raises success on real websites by more than 50 percent.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
WebAgent plans ahead by decomposing instructions into canonical sub-instructions, summarizes long HTML documents into task-relevant snippets, and acts on websites via Python programs generated from those. It employs Flan-U-PaLM for grounded code generation and introduces HTML-T5, a new pre-trained model using local and global attention with mixture of long-span denoising objectives, for planning and summarization. This modular recipe improves success on real websites by over 50 percent.
What carries the argument
Modular pipeline consisting of instruction decomposition for planning, HTML summarization via local-global attention, and Python program synthesis for execution.
Load-bearing premise
The learned behaviors from self-experience, combined with decomposition and summarization, transfer effectively to new and changing real-world websites.
What would settle it
Running the agent on real websites that have been significantly updated or belong to domains far from the training set and checking if the reported success improvement disappears.
read the original abstract
Pre-trained large language models (LLMs) have recently achieved better generalization and sample efficiency in autonomous web automation. However, the performance on real-world websites has still suffered from (1) open domainness, (2) limited context length, and (3) lack of inductive bias on HTML. We introduce WebAgent, an LLM-driven agent that learns from self-experience to complete tasks on real websites following natural language instructions. WebAgent plans ahead by decomposing instructions into canonical sub-instructions, summarizes long HTML documents into task-relevant snippets, and acts on websites via Python programs generated from those. We design WebAgent with Flan-U-PaLM, for grounded code generation, and HTML-T5, new pre-trained LLMs for long HTML documents using local and global attention mechanisms and a mixture of long-span denoising objectives, for planning and summarization. We empirically demonstrate that our modular recipe improves the success on real websites by over 50%, and that HTML-T5 is the best model to solve various HTML understanding tasks; achieving 18.7% higher success rate than the prior method on MiniWoB web automation benchmark, and SoTA performance on Mind2Web, an offline task planning evaluation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces WebAgent, an LLM-driven agent for autonomous web automation on real websites. It decomposes natural language instructions into canonical sub-instructions for planning, summarizes long HTML documents into task-relevant snippets using a new pre-trained model HTML-T5 (with local/global attention and mixture of long-span denoising objectives), and executes actions via synthesized Python programs. Using Flan-U-PaLM for code generation, the authors claim that this modular recipe (self-experience learning + decomposition + summarization + synthesis) yields over 50% higher success on real websites, 18.7% higher success than prior methods on the MiniWoB benchmark, and state-of-the-art results on the Mind2Web offline planning task. HTML-T5 is positioned as superior for various HTML understanding tasks.
Significance. If the empirical gains are shown to be robust, this work would meaningfully advance LLM-based web agents by addressing open-domain challenges, context length limits, and lack of HTML inductive bias through a modular, program-synthesis approach. The introduction of HTML-T5 as a specialized long-context HTML model represents a concrete architectural contribution that could transfer to other web-data tasks. The emphasis on self-experience learning and falsifiable benchmark results (MiniWoB, Mind2Web) strengthens the paper's potential impact if evaluation gaps are closed.
major comments (2)
- [Abstract and Evaluation sections] Abstract and real-website evaluation: the central claim of >50% success-rate improvement on real websites is load-bearing yet presented without reported baselines, statistical significance tests, error bars, per-site variance, or a protocol for measuring degradation under layout changes and domain shifts not seen in training data. The cited benchmarks (MiniWoB, Mind2Web) are fixed/simulated environments, leaving generalization untested.
- [Model and Experiments] HTML-T5 model section: while the architecture (local + global attention + mixture of long-span denoising) is described, the paper does not provide ablation results isolating the contribution of each component to the reported HTML-task gains or to the downstream WebAgent success rates, making it difficult to assess whether the new model is the primary driver of the 18.7% MiniWoB lift.
minor comments (2)
- [Method] Notation for task decomposition and HTML summarization steps could be formalized with a short algorithm box or pseudocode to improve reproducibility.
- [Experiments] Ensure all compared baselines (including exact prior method on MiniWoB) are listed with citations and hyperparameter settings in a single table for clarity.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment point by point below, providing clarifications and outlining planned revisions to improve the rigor and transparency of the work.
read point-by-point responses
-
Referee: [Abstract and Evaluation sections] Abstract and real-website evaluation: the central claim of >50% success-rate improvement on real websites is load-bearing yet presented without reported baselines, statistical significance tests, error bars, per-site variance, or a protocol for measuring degradation under layout changes and domain shifts not seen in training data. The cited benchmarks (MiniWoB, Mind2Web) are fixed/simulated environments, leaving generalization untested.
Authors: We agree that additional details on the real-website evaluation would strengthen the presentation. The >50% improvement is measured relative to our internal baseline agents (as described in the experiments section), but we acknowledge the need for explicit external baseline comparisons, statistical tests, error bars, and per-site breakdowns. In the revised manuscript, we will add these elements along with a clearer description of the evaluation protocol, including how layout changes and domain shifts are handled. While MiniWoB and Mind2Web are standard simulated benchmarks for web automation, the real-website results are intended to complement them by showing performance on live sites; we will expand the discussion of generalization limitations. revision: yes
-
Referee: [Model and Experiments] HTML-T5 model section: while the architecture (local + global attention + mixture of long-span denoising) is described, the paper does not provide ablation results isolating the contribution of each component to the reported HTML-task gains or to the downstream WebAgent success rates, making it difficult to assess whether the new model is the primary driver of the 18.7% MiniWoB lift.
Authors: We recognize that component ablations would help isolate the contributions of local/global attention and the long-span denoising objectives. The manuscript currently emphasizes end-to-end comparisons of HTML-T5 against prior models on HTML tasks and its role within WebAgent. In the revision, we will incorporate ablation studies demonstrating the impact of each architectural element on HTML understanding performance and, where feasible, on downstream WebAgent success rates. This will clarify the extent to which HTML-T5 drives the observed improvements. revision: yes
Circularity Check
No circularity: empirical gains measured on external benchmarks
full rationale
The paper presents an empirical agent architecture (planning via decomposition, HTML summarization, program synthesis) evaluated on fixed external benchmarks (MiniWoB, Mind2Web) and real-website tests. No equations, fitted parameters renamed as predictions, self-definitional loops, or load-bearing self-citations appear in the provided text. Performance deltas (50% lift, 18.7% gain) are reported against prior methods on independent test sets rather than quantities constructed from the paper's own normalizations or ansatzes. The derivation chain is therefore self-contained against external data.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Large language models can be specialized via pretraining objectives and attention mechanisms to handle long structured documents like HTML better than generic models.
invented entities (1)
-
HTML-T5
no independent evidence
Forward citations
Cited by 17 Pith papers
-
OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments
OSWorld provides the first unified real-computer benchmark for open-ended multimodal agent tasks, exposing large performance gaps between humans and state-of-the-art LLM/VLM agents.
-
SimPersona: Learning Discrete Buyer Personas from Raw Clickstreams for Grounded E-Commerce Agents
SimPersona uses VQ-VAE to induce discrete buyer types from clickstreams, maps them to LLM persona tokens, and fine-tunes agents to achieve 78% conversion-rate alignment with real buyers across 42 storefronts.
-
Weblica: Scalable and Reproducible Training Environments for Visual Web Agents
Weblica scales RL training for visual web agents by building thousands of reproducible environments through HTTP caching for stable replays and LLM synthesis from real sites, yielding an 8B model that beats similar op...
-
Training Computer Use Agents to Assess the Usability of Graphical User Interfaces
uxCUA is a trained computer use agent that assesses GUI usability more accurately than larger models by learning to prioritize and execute important user interactions on labeled interface datasets.
-
ClawBench: Can AI Agents Complete Everyday Online Tasks?
ClawBench is a benchmark of 153 live-web tasks where AI agents achieve low success rates, e.g. 33.3% for Claude Sonnet 4.6.
-
MolmoWeb: Open Visual Web Agent and Open Data for the Open Web
Open 4B and 8B visual web agents achieve state-of-the-art results on browser benchmarks by predicting actions from screenshots and instructions, outperforming similar open models and some closed larger-model agents, w...
-
Prompt Injection Attack to Tool Selection in LLM Agents
ToolHijacker optimizes malicious tool documents via a two-phase strategy to hijack LLM agents' tool selection in no-box settings.
-
WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks?
WorkArena benchmark shows LLM web agents achieve partial success on enterprise tasks but have a substantial gap to full automation and perform worse with open-source models.
-
MAP: A Map-then-Act Paradigm for Long-Horizon Interactive Agent Reasoning
MAP improves LLM agent reasoning by constructing a structured cognitive map of the environment before task execution, yielding performance gains on benchmarks like ARC-AGI-3 and superior training data via the new MAP-...
-
Recommending Usability Improvements with Multimodal Large Language Models
Multimodal LLMs can detect usability issues from screen recordings, explain them via Nielsen's heuristics, and rank improvement recommendations, with engineer feedback indicating practical usefulness for teams lacking...
-
VLAA-GUI: Knowing When to Stop, Recover, and Search, A Modular Framework for GUI Automation
VLAA-GUI adds mandatory visual verifiers, multi-tier loop breakers, and on-demand search to GUI agents, reaching 77.5% on OSWorld and 61.0% on WindowsAgentArena with some models exceeding human performance.
-
ReFinE: Streamlining UI Mockup Iteration with Research Findings
ReFinE is a Figma plugin that synthesizes contextualized design implications from HCI literature to provide actionable visual guidance for iterating on UI mockups.
-
GPT-4V(ision) is a Generalist Web Agent, if Grounded
GPT-4V achieves 51.1% success on live web tasks as a generalist agent when plans are manually grounded, outperforming text-only models, but automatic grounding lags far behind oracle performance.
-
Cognitive Architectures for Language Agents
CoALA is a modular cognitive architecture for language agents that organizes memory components, action spaces for internal and external interaction, and a generalized decision-making loop to support more systematic de...
-
ARMove: Learning to Predict Human Mobility through Agentic Reasoning
ARMove is a transferable framework for human mobility prediction that combines agentic LLM reasoning, feature management, and large-small model synergy to outperform baselines on several metrics while improving interp...
-
Securing Computer-Use Agents: A Unified Architecture-Lifecycle Framework for Deployment-Grounded Reliability
The paper develops a unified framework that organizes computer-use agent reliability around perception-decision-execution layers and creation-deployment-operation-maintenance stages to map security and alignment inter...
-
WebUncertainty: Dual-Level Uncertainty Driven Planning and Reasoning For Autonomous Web Agent
WebUncertainty improves web agent performance on benchmarks by adaptively selecting planning modes based on task uncertainty and using confidence-induced action uncertainty in MCTS to quantify aleatoric and epistemic ...
Reference graph
Works this paper leans on
-
[1]
are proposed to leverage their syntax better. On the other hand, such a domain knowledge conflicts with recent generalist and scaling trends around LLMs (Anil et al., 2023; OpenAI, 2023). Because web agents require the instruction-conditioned HTML understanding, it also would be desirable to reconcile specialist aspects for HTML documents with generalist ...
work page 2023
-
[2]
can you search for a studio bedroom, 1+ bathroom houses in escondido, ca for corporate housing and price less than 12100 on real estate website
-
[3]
can you find me a studio bedroom, 1+ bathroom townhomes in hollywood, ca and price less than 14600 on real estate website
-
[4]
can you search for a studio bedroom, 1+ bathroom condos in inglewood, ca for senior housing and price less than 8700 on real estate website
-
[5]
I would like to search for a studio bedroom, 1+ bathroom houses in compton, ca and price more than 1200 for corporate housing on real estate website
-
[6]
can you search for a studio bedroom, 1+ bathroom apartments in oroville, ca for corporate housing on real estate website
-
[7]
find me a studio bedroom, 1+ bathroom houses in modesto, ca on real estate website
-
[8]
can you search for a studio bedroom, 1+ bathroom condos in redwood city, ca for student and price more than 1900 on real estate website
work page 1900
-
[9]
find me a 1 bedroom condos in santa clara, ca and price between 1600 and 7400 on real estate website
-
[10]
find me a 1 bedroom, 3+ bathroom apartments in martinez, ca with min price 1800 on real estate website
-
[11]
can you find me a 2 bedroom, 2+ bathroom townhomes in concord, ca and price more than 600 on real estate website
-
[12]
can you find me a studio bedroom, 2+ bathroom apartments in san diego, ca and price less than 9300 on real estate website
-
[13]
find me a studio bedroom houses in novato, ca and price between 1500 and 6700 on real estate website
-
[14]
can you find me a studio bedroom, any bathroom townhomes in petaluma, ca and price more than 1000 on real estate website
-
[15]
search for a 1 bedroom apartments in modesto, ca and price more than 1000 on real estate website
-
[16]
find me a 1 bedroom, 2+ bathroom apartments in watts, ca for senior housing less than 6300 on real estate website
-
[17]
can you find me a 1 bedroom houses in victorville, ca that have dog friendly, furnished and price more than 700 on real estate website
-
[18]
I need a 2 bedroom, any bathroom condos in inglewood, ca and price more than 1000 on real estate website
-
[19]
find me a 2 bedroom, 2+ bathroom apartments in livermore, ca on real estate website
-
[20]
can you find me a 2 bedroom apartments in santa clara, ca that has parking and price less than 10300 on real estate website
-
[21]
can you search for a 2 bedroom condos in oakland, ca on real estate website. social-media
-
[22]
Show me the most hot thread in r/google at social media website
-
[23]
Can you point out the most hot thread in r/learnpython at social media website
-
[24]
Could you check the 1st hot thread in r/artificial at social media website
-
[25]
Can I check the most hot thread in Taiwan on social media website
-
[26]
Show me the first new thread in r/facebook at social media website
-
[27]
Present the most new thread of r/Python filtered by Tutorial flair on social media website
-
[28]
Could you check the 1st new thread in r/facebook at social media website
-
[29]
I want to read the 1st hot thread from r/Python tagged as Daily Thread at social media website
-
[30]
Present the most hot thread of r/google filtered by Info | Mod Post flair on social media website
-
[31]
Show me the most new thread in r/learnmachinelearning filtered by Help flair at social media website
-
[32]
Can you point out the first hot thread in r/deeplearning at social media website
-
[33]
Could you check the 1st hot thread in r/machinelearningnews at social media website
-
[34]
Present the most hot thread of r/artificial filtered by News flair on social media website
-
[35]
Please find me the first hot thread in r/facebook at social media website
-
[36]
Present the most new thread of r/machinelearningnews filtered by Startup News flair on social media website
-
[37]
Show me the most hot thread in r/artificial filtered by AI Art flair at social media website
-
[38]
Could you check the first new thread in r/facebook at social media website
-
[39]
I want to read the most top thread from r/google tagged as Info | Mod Post at social media website
-
[40]
Show me the most new thread in r/startups filtered by Share Your Startup flair at social media website
-
[41]
20 Published as a conference paper at ICLR 2024 map
Could you check the 2nd new thread in r/facebook at social media website. 20 Published as a conference paper at ICLR 2024 map
work page 2024
-
[42]
Show me the way from San Jose to Mountain View by 2nd Cycling at map website
-
[43]
Please show me the way from The Painted Ladies to San Francisco Zoo with 3rd Best option at map website
-
[44]
Could you tell me the path from California Academy of Sciences to de Young Museum by 1st Transit at map website
-
[45]
Could you tell me the way from Union Square to The Painted Ladies with 2nd Cycling option at map website
-
[46]
Please present the way from Chappell Hayes Observation Tower to San Jose with 2nd Walking option at map website
-
[47]
Please present the path from Jack London Square to Emeryville by 2nd Cycling at map website
-
[48]
I’d like to move The Midway from Children’s Fairyland by 1st Cycling at map website
-
[49]
I’d like to move Chase Center from San Francisco - Oakland Bay Bridge with 2nd Transit option at map website
-
[50]
I want to move Pier 39 from Berkeley by 3rd Cycling at map website
-
[51]
I want to go to Emeryville from Mountain View with 2nd Cycling option at map website
-
[52]
Can you point out the way from San Mateo to Stanford University by 2nd Cycling at map website
-
[53]
Could you point out the way from Palace of Fine Arts to UC Berkeley by 1st Cycling at map website
-
[54]
Point out the way from The Painted Ladies to San Francisco Museum of Modern Art by 2nd Driving at map website
-
[55]
Could you find the path from Union Square to Palo Alto by 1st Cycling at map website
-
[56]
Please check the way from San Jose to San José Mineta International Airport with 1st Walking at map website
-
[57]
Check the path from San Francisco Zoo to Berkeley with 1st Cycling at map website
-
[58]
I’d like to check Parking Lots along the way from Stanford University to The Painted Ladies with Best option at map website
-
[59]
Check Gas stations along the way from de Young Museum to Oakland with Driving option at map website
-
[60]
Please show me Hotels along the way from Palace of Fine Arts to Berkeley by Transit at map website
-
[61]
https://www.(map website).com/
Check Gas stations along the way from Bay Area Discovery Museum to Santa Cruz with Best option at map website. G E XAMPLE EPISODE IN REAL -WORLD WEB AUTOMATION 21 Published as a conference paper at ICLR 2024 map: Show me the way from San Jose to Mountain View by 2nd Cycling at map website? # Go to map website driver.get("https://www.(map website).com/") #...
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.