arxiv: 2307.12856 · v4 · submitted 2023-07-24 · 💻 cs.LG · cs.AI· cs.CL

Recognition: 2 theorem links

· Lean Theorem

A Real-World WebAgent with Planning, Long Context Understanding, and Program Synthesis

Izzeddin Gur , Hiroki Furuta , Austin Huang , Mustafa Safdari , Yutaka Matsuo , Douglas Eck , Aleksandra Faust

Authors on Pith no claims yet

Pith reviewed 2026-05-17 04:32 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL

keywords web agentsLLM planningHTML understandingprogram synthesisautonomous web navigationlong context modelstask decomposition

0 comments

The pith

A modular WebAgent with planning, summarization, and code synthesis raises success on real websites by more than 50 percent.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The authors develop WebAgent to let large language models complete tasks on actual websites from natural language instructions. They tackle open domains, short context windows, and missing HTML biases by breaking tasks into sub-instructions, condensing relevant HTML parts, and writing Python code to perform the actions. The method relies on self-experience to learn and uses specialized models for different steps. If the approach holds, it would turn web agents into practical tools for everyday online work instead of lab demonstrations. Experiments show clear gains on live sites and top results on standard tests like MiniWoB and Mind2Web.

Core claim

WebAgent plans ahead by decomposing instructions into canonical sub-instructions, summarizes long HTML documents into task-relevant snippets, and acts on websites via Python programs generated from those. It employs Flan-U-PaLM for grounded code generation and introduces HTML-T5, a new pre-trained model using local and global attention with mixture of long-span denoising objectives, for planning and summarization. This modular recipe improves success on real websites by over 50 percent.

What carries the argument

Modular pipeline consisting of instruction decomposition for planning, HTML summarization via local-global attention, and Python program synthesis for execution.

Load-bearing premise

The learned behaviors from self-experience, combined with decomposition and summarization, transfer effectively to new and changing real-world websites.

What would settle it

Running the agent on real websites that have been significantly updated or belong to domains far from the training set and checking if the reported success improvement disappears.

read the original abstract

Pre-trained large language models (LLMs) have recently achieved better generalization and sample efficiency in autonomous web automation. However, the performance on real-world websites has still suffered from (1) open domainness, (2) limited context length, and (3) lack of inductive bias on HTML. We introduce WebAgent, an LLM-driven agent that learns from self-experience to complete tasks on real websites following natural language instructions. WebAgent plans ahead by decomposing instructions into canonical sub-instructions, summarizes long HTML documents into task-relevant snippets, and acts on websites via Python programs generated from those. We design WebAgent with Flan-U-PaLM, for grounded code generation, and HTML-T5, new pre-trained LLMs for long HTML documents using local and global attention mechanisms and a mixture of long-span denoising objectives, for planning and summarization. We empirically demonstrate that our modular recipe improves the success on real websites by over 50%, and that HTML-T5 is the best model to solve various HTML understanding tasks; achieving 18.7% higher success rate than the prior method on MiniWoB web automation benchmark, and SoTA performance on Mind2Web, an offline task planning evaluation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

WebAgent pairs a new HTML-T5 model with a modular planning-summarization-synthesis recipe and shows clear benchmark lifts, but the real-site gains rest on limited evidence of robustness to site changes.

read the letter

The main addition is HTML-T5, pre-trained on long HTML with local/global attention and a mix of long-span denoising objectives, plus the specific WebAgent stack that decomposes instructions, summarizes relevant HTML snippets, and emits Python programs via Flan-U-PaLM. That combination, trained with self-experience, is what they use to claim the performance jump. The paper does a reasonable job showing the pieces fit together for web tasks and reports 18.7% higher success on MiniWoB than the prior method plus state-of-the-art on Mind2Web. Those numbers come from external benchmarks rather than self-defined metrics, which is a plus. The real-website claim of over 50% improvement is the part that would matter most for practical use. On the soft side, the abstract and summary give little on exact baselines, controls, or statistical tests, so the size of the lift is hard to judge without the full tables. The generalization worry is real: MiniWoB and Mind2Web are fixed environments, and nothing in the reported results shows systematic tests for layout shifts or unseen domains. If the full paper has per-site variance or ablation on the summarization step, that would tighten the story; otherwise the 50% figure stays provisional. This work is aimed at people building LLM agents for automation and structured web interaction. A reader who needs concrete recipes for handling long HTML or code-based actions can pull useful parts even if they swap in their own backbone. The thinking is coherent on its own terms and engages the right prior agent literature. I would send it to peer review so the experimental controls and robustness checks get proper scrutiny rather than desk-rejecting it outright.

Referee Report

2 major / 2 minor

Summary. The paper introduces WebAgent, an LLM-driven agent for autonomous web automation on real websites. It decomposes natural language instructions into canonical sub-instructions for planning, summarizes long HTML documents into task-relevant snippets using a new pre-trained model HTML-T5 (with local/global attention and mixture of long-span denoising objectives), and executes actions via synthesized Python programs. Using Flan-U-PaLM for code generation, the authors claim that this modular recipe (self-experience learning + decomposition + summarization + synthesis) yields over 50% higher success on real websites, 18.7% higher success than prior methods on the MiniWoB benchmark, and state-of-the-art results on the Mind2Web offline planning task. HTML-T5 is positioned as superior for various HTML understanding tasks.

Significance. If the empirical gains are shown to be robust, this work would meaningfully advance LLM-based web agents by addressing open-domain challenges, context length limits, and lack of HTML inductive bias through a modular, program-synthesis approach. The introduction of HTML-T5 as a specialized long-context HTML model represents a concrete architectural contribution that could transfer to other web-data tasks. The emphasis on self-experience learning and falsifiable benchmark results (MiniWoB, Mind2Web) strengthens the paper's potential impact if evaluation gaps are closed.

major comments (2)

[Abstract and Evaluation sections] Abstract and real-website evaluation: the central claim of >50% success-rate improvement on real websites is load-bearing yet presented without reported baselines, statistical significance tests, error bars, per-site variance, or a protocol for measuring degradation under layout changes and domain shifts not seen in training data. The cited benchmarks (MiniWoB, Mind2Web) are fixed/simulated environments, leaving generalization untested.
[Model and Experiments] HTML-T5 model section: while the architecture (local + global attention + mixture of long-span denoising) is described, the paper does not provide ablation results isolating the contribution of each component to the reported HTML-task gains or to the downstream WebAgent success rates, making it difficult to assess whether the new model is the primary driver of the 18.7% MiniWoB lift.

minor comments (2)

[Method] Notation for task decomposition and HTML summarization steps could be formalized with a short algorithm box or pseudocode to improve reproducibility.
[Experiments] Ensure all compared baselines (including exact prior method on MiniWoB) are listed with citations and hyperparameter settings in a single table for clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment point by point below, providing clarifications and outlining planned revisions to improve the rigor and transparency of the work.

read point-by-point responses

Referee: [Abstract and Evaluation sections] Abstract and real-website evaluation: the central claim of >50% success-rate improvement on real websites is load-bearing yet presented without reported baselines, statistical significance tests, error bars, per-site variance, or a protocol for measuring degradation under layout changes and domain shifts not seen in training data. The cited benchmarks (MiniWoB, Mind2Web) are fixed/simulated environments, leaving generalization untested.

Authors: We agree that additional details on the real-website evaluation would strengthen the presentation. The >50% improvement is measured relative to our internal baseline agents (as described in the experiments section), but we acknowledge the need for explicit external baseline comparisons, statistical tests, error bars, and per-site breakdowns. In the revised manuscript, we will add these elements along with a clearer description of the evaluation protocol, including how layout changes and domain shifts are handled. While MiniWoB and Mind2Web are standard simulated benchmarks for web automation, the real-website results are intended to complement them by showing performance on live sites; we will expand the discussion of generalization limitations. revision: yes
Referee: [Model and Experiments] HTML-T5 model section: while the architecture (local + global attention + mixture of long-span denoising) is described, the paper does not provide ablation results isolating the contribution of each component to the reported HTML-task gains or to the downstream WebAgent success rates, making it difficult to assess whether the new model is the primary driver of the 18.7% MiniWoB lift.

Authors: We recognize that component ablations would help isolate the contributions of local/global attention and the long-span denoising objectives. The manuscript currently emphasizes end-to-end comparisons of HTML-T5 against prior models on HTML tasks and its role within WebAgent. In the revision, we will incorporate ablation studies demonstrating the impact of each architectural element on HTML understanding performance and, where feasible, on downstream WebAgent success rates. This will clarify the extent to which HTML-T5 drives the observed improvements. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical gains measured on external benchmarks

full rationale

The paper presents an empirical agent architecture (planning via decomposition, HTML summarization, program synthesis) evaluated on fixed external benchmarks (MiniWoB, Mind2Web) and real-website tests. No equations, fitted parameters renamed as predictions, self-definitional loops, or load-bearing self-citations appear in the provided text. Performance deltas (50% lift, 18.7% gain) are reported against prior methods on independent test sets rather than quantities constructed from the paper's own normalizations or ansatzes. The derivation chain is therefore self-contained against external data.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on standard assumptions that LLMs can be effectively fine-tuned for code generation and HTML summarization, plus the empirical effectiveness of the proposed modular decomposition and attention mechanisms; no new physical entities or ad-hoc constants are introduced beyond model architecture choices.

axioms (1)

domain assumption Large language models can be specialized via pretraining objectives and attention mechanisms to handle long structured documents like HTML better than generic models.
Invoked in the design and claimed superiority of HTML-T5 for planning and summarization tasks.

invented entities (1)

HTML-T5 no independent evidence
purpose: New pre-trained LLM using local and global attention plus mixture of long-span denoising objectives for HTML understanding.
Introduced as a core component for planning and summarization; no independent evidence provided beyond the paper's own empirical results.

pith-pipeline@v0.9.0 · 5538 in / 1469 out tokens · 36333 ms · 2026-05-17T04:32:29.130963+00:00 · methodology

discussion (0)

Forward citations

Cited by 17 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments
cs.AI 2024-04 accept novelty 8.0

OSWorld provides the first unified real-computer benchmark for open-ended multimodal agent tasks, exposing large performance gaps between humans and state-of-the-art LLM/VLM agents.
SimPersona: Learning Discrete Buyer Personas from Raw Clickstreams for Grounded E-Commerce Agents
cs.AI 2026-05 conditional novelty 7.0

SimPersona uses VQ-VAE to induce discrete buyer types from clickstreams, maps them to LLM persona tokens, and fine-tunes agents to achieve 78% conversion-rate alignment with real buyers across 42 storefronts.
Weblica: Scalable and Reproducible Training Environments for Visual Web Agents
cs.AI 2026-05 unverdicted novelty 7.0

Weblica scales RL training for visual web agents by building thousands of reproducible environments through HTTP caching for stable replays and LLM synthesis from real sites, yielding an 8B model that beats similar op...
Training Computer Use Agents to Assess the Usability of Graphical User Interfaces
cs.CL 2026-04 unverdicted novelty 7.0

uxCUA is a trained computer use agent that assesses GUI usability more accurately than larger models by learning to prioritize and execute important user interactions on labeled interface datasets.
ClawBench: Can AI Agents Complete Everyday Online Tasks?
cs.CL 2026-04 unverdicted novelty 7.0

ClawBench is a benchmark of 153 live-web tasks where AI agents achieve low success rates, e.g. 33.3% for Claude Sonnet 4.6.
MolmoWeb: Open Visual Web Agent and Open Data for the Open Web
cs.CV 2026-04 unverdicted novelty 7.0

Open 4B and 8B visual web agents achieve state-of-the-art results on browser benchmarks by predicting actions from screenshots and instructions, outperforming similar open models and some closed larger-model agents, w...
Prompt Injection Attack to Tool Selection in LLM Agents
cs.CR 2025-04 conditional novelty 7.0

ToolHijacker optimizes malicious tool documents via a two-phase strategy to hijack LLM agents' tool selection in no-box settings.
WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks?
cs.LG 2024-03 unverdicted novelty 7.0

WorkArena benchmark shows LLM web agents achieve partial success on enterprise tasks but have a substantial gap to full automation and perform worse with open-source models.
MAP: A Map-then-Act Paradigm for Long-Horizon Interactive Agent Reasoning
cs.AI 2026-05 unverdicted novelty 6.0

MAP improves LLM agent reasoning by constructing a structured cognitive map of the environment before task execution, yielding performance gains on benchmarks like ARC-AGI-3 and superior training data via the new MAP-...
Recommending Usability Improvements with Multimodal Large Language Models
cs.SE 2026-04 unverdicted novelty 6.0

Multimodal LLMs can detect usability issues from screen recordings, explain them via Nielsen's heuristics, and rank improvement recommendations, with engineer feedback indicating practical usefulness for teams lacking...
VLAA-GUI: Knowing When to Stop, Recover, and Search, A Modular Framework for GUI Automation
cs.CL 2026-04 conditional novelty 6.0

VLAA-GUI adds mandatory visual verifiers, multi-tier loop breakers, and on-demand search to GUI agents, reaching 77.5% on OSWorld and 61.0% on WindowsAgentArena with some models exceeding human performance.
ReFinE: Streamlining UI Mockup Iteration with Research Findings
cs.HC 2026-04 unverdicted novelty 6.0

ReFinE is a Figma plugin that synthesizes contextualized design implications from HCI literature to provide actionable visual guidance for iterating on UI mockups.
GPT-4V(ision) is a Generalist Web Agent, if Grounded
cs.IR 2024-01 conditional novelty 6.0

GPT-4V achieves 51.1% success on live web tasks as a generalist agent when plans are manually grounded, outperforming text-only models, but automatic grounding lags far behind oracle performance.
Cognitive Architectures for Language Agents
cs.AI 2023-09 accept novelty 6.0

CoALA is a modular cognitive architecture for language agents that organizes memory components, action spaces for internal and external interaction, and a generalized decision-making loop to support more systematic de...
ARMove: Learning to Predict Human Mobility through Agentic Reasoning
cs.MA 2026-04 unverdicted novelty 5.0

ARMove is a transferable framework for human mobility prediction that combines agentic LLM reasoning, feature management, and large-small model synergy to outperform baselines on several metrics while improving interp...
Securing Computer-Use Agents: A Unified Architecture-Lifecycle Framework for Deployment-Grounded Reliability
cs.CL 2026-05 unverdicted novelty 4.0

The paper develops a unified framework that organizes computer-use agent reliability around perception-decision-execution layers and creation-deployment-operation-maintenance stages to map security and alignment inter...
WebUncertainty: Dual-Level Uncertainty Driven Planning and Reasoning For Autonomous Web Agent
cs.AI 2026-04 unverdicted novelty 4.0

WebUncertainty improves web agent performance on benchmarks by adaptively selecting planning modes based on task uncertainty and using confidence-induced action uncertainty in MCTS to quantify aleatoric and epistemic ...

Reference graph

Works this paper leans on

61 extracted references · 61 canonical work pages · cited by 17 Pith papers

[1]

0" id="realEstateApp

are proposed to leverage their syntax better. On the other hand, such a domain knowledge conflicts with recent generalist and scaling trends around LLMs (Anil et al., 2023; OpenAI, 2023). Because web agents require the instruction-conditioned HTML understanding, it also would be desirable to reconcile specialist aspects for HTML documents with generalist ...

work page 2023
[2]

can you search for a studio bedroom, 1+ bathroom houses in escondido, ca for corporate housing and price less than 12100 on real estate website

work page
[3]

can you find me a studio bedroom, 1+ bathroom townhomes in hollywood, ca and price less than 14600 on real estate website

work page
[4]

can you search for a studio bedroom, 1+ bathroom condos in inglewood, ca for senior housing and price less than 8700 on real estate website

work page
[5]

I would like to search for a studio bedroom, 1+ bathroom houses in compton, ca and price more than 1200 for corporate housing on real estate website

work page
[6]

can you search for a studio bedroom, 1+ bathroom apartments in oroville, ca for corporate housing on real estate website

work page
[7]

find me a studio bedroom, 1+ bathroom houses in modesto, ca on real estate website

work page
[8]

can you search for a studio bedroom, 1+ bathroom condos in redwood city, ca for student and price more than 1900 on real estate website

work page 1900
[9]

find me a 1 bedroom condos in santa clara, ca and price between 1600 and 7400 on real estate website

work page
[10]

find me a 1 bedroom, 3+ bathroom apartments in martinez, ca with min price 1800 on real estate website

work page
[11]

can you find me a 2 bedroom, 2+ bathroom townhomes in concord, ca and price more than 600 on real estate website

work page
[12]

can you find me a studio bedroom, 2+ bathroom apartments in san diego, ca and price less than 9300 on real estate website

work page
[13]

find me a studio bedroom houses in novato, ca and price between 1500 and 6700 on real estate website

work page
[14]

can you find me a studio bedroom, any bathroom townhomes in petaluma, ca and price more than 1000 on real estate website

work page
[15]

search for a 1 bedroom apartments in modesto, ca and price more than 1000 on real estate website

work page
[16]

find me a 1 bedroom, 2+ bathroom apartments in watts, ca for senior housing less than 6300 on real estate website

work page
[17]

can you find me a 1 bedroom houses in victorville, ca that have dog friendly, furnished and price more than 700 on real estate website

work page
[18]

I need a 2 bedroom, any bathroom condos in inglewood, ca and price more than 1000 on real estate website

work page
[19]

find me a 2 bedroom, 2+ bathroom apartments in livermore, ca on real estate website

work page
[20]

can you find me a 2 bedroom apartments in santa clara, ca that has parking and price less than 10300 on real estate website

work page
[21]

social-media

can you search for a 2 bedroom condos in oakland, ca on real estate website. social-media

work page
[22]

Show me the most hot thread in r/google at social media website

work page
[23]

Can you point out the most hot thread in r/learnpython at social media website

work page
[24]

Could you check the 1st hot thread in r/artificial at social media website

work page
[25]

Can I check the most hot thread in Taiwan on social media website

work page
[26]

Show me the first new thread in r/facebook at social media website

work page
[27]

Present the most new thread of r/Python filtered by Tutorial flair on social media website

work page
[28]

Could you check the 1st new thread in r/facebook at social media website

work page
[29]

I want to read the 1st hot thread from r/Python tagged as Daily Thread at social media website

work page
[30]

Present the most hot thread of r/google filtered by Info | Mod Post flair on social media website

work page
[31]

Show me the most new thread in r/learnmachinelearning filtered by Help flair at social media website

work page
[32]

Can you point out the first hot thread in r/deeplearning at social media website

work page
[33]

Could you check the 1st hot thread in r/machinelearningnews at social media website

work page
[34]

Present the most hot thread of r/artificial filtered by News flair on social media website

work page
[35]

Please find me the first hot thread in r/facebook at social media website

work page
[36]

Present the most new thread of r/machinelearningnews filtered by Startup News flair on social media website

work page
[37]

Show me the most hot thread in r/artificial filtered by AI Art flair at social media website

work page
[38]

Could you check the first new thread in r/facebook at social media website

work page
[39]

I want to read the most top thread from r/google tagged as Info | Mod Post at social media website

work page
[40]

Show me the most new thread in r/startups filtered by Share Your Startup flair at social media website

work page
[41]

20 Published as a conference paper at ICLR 2024 map

Could you check the 2nd new thread in r/facebook at social media website. 20 Published as a conference paper at ICLR 2024 map

work page 2024
[42]

Show me the way from San Jose to Mountain View by 2nd Cycling at map website

work page
[43]

Please show me the way from The Painted Ladies to San Francisco Zoo with 3rd Best option at map website

work page
[44]

Could you tell me the path from California Academy of Sciences to de Young Museum by 1st Transit at map website

work page
[45]

Could you tell me the way from Union Square to The Painted Ladies with 2nd Cycling option at map website

work page
[46]

Please present the way from Chappell Hayes Observation Tower to San Jose with 2nd Walking option at map website

work page
[47]

Please present the path from Jack London Square to Emeryville by 2nd Cycling at map website

work page
[48]

I’d like to move The Midway from Children’s Fairyland by 1st Cycling at map website

work page
[49]

I’d like to move Chase Center from San Francisco - Oakland Bay Bridge with 2nd Transit option at map website

work page
[50]

I want to move Pier 39 from Berkeley by 3rd Cycling at map website

work page
[51]

I want to go to Emeryville from Mountain View with 2nd Cycling option at map website

work page
[52]

Can you point out the way from San Mateo to Stanford University by 2nd Cycling at map website

work page
[53]

Could you point out the way from Palace of Fine Arts to UC Berkeley by 1st Cycling at map website

work page
[54]

Point out the way from The Painted Ladies to San Francisco Museum of Modern Art by 2nd Driving at map website

work page
[55]

Could you find the path from Union Square to Palo Alto by 1st Cycling at map website

work page
[56]

Please check the way from San Jose to San José Mineta International Airport with 1st Walking at map website

work page
[57]

Check the path from San Francisco Zoo to Berkeley with 1st Cycling at map website

work page
[58]

I’d like to check Parking Lots along the way from Stanford University to The Painted Ladies with Best option at map website

work page
[59]

Check Gas stations along the way from de Young Museum to Oakland with Driving option at map website

work page
[60]

Please show me Hotels along the way from Palace of Fine Arts to Berkeley by Transit at map website

work page
[61]

https://www.(map website).com/

Check Gas stations along the way from Bay Area Discovery Museum to Santa Cruz with Best option at map website. G E XAMPLE EPISODE IN REAL -WORLD WEB AUTOMATION 21 Published as a conference paper at ICLR 2024 map: Show me the way from San Jose to Mountain View by 2nd Cycling at map website? # Go to map website driver.get("https://www.(map website).com/") #...

work page 2024