WARC-Bench: Web Archive Based Benchmark for GUI Subtask Executions

Charlene Y. Lee; Cheng Chang; Gang Li; Ignacio Cases; Manpreet Kaur; Peng Qi; Rishu Garg; Sanjari Srivastava; Yanan Xie; Yining Mao

arxiv: 2510.09872 · v2 · pith:ENLSPLV2new · submitted 2025-10-10 · 💻 cs.LG · cs.AI

WARC-Bench: Web Archive Based Benchmark for GUI Subtask Executions

Sanjari Srivastava , Gang Li , Cheng Chang , Rishu Garg , Manpreet Kaur , Charlene Y. Lee , Yuezhang Li , Yining Mao

show 3 more authors

Ignacio Cases Yanan Xie Peng Qi

This is my paper

Pith reviewed 2026-05-21 20:11 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords web agentsGUI subtasksbenchmarkreinforcement learningsupervised fine-tuningweb navigationmultimodal agentsweb archives

0 comments

The pith

WARC-Bench shows that web agents must master short GUI subtasks like date picking and scrolling to achieve robust navigation, and that RLVR training on limited data can lift open-source models past many frontier systems on these tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents WARC-Bench as a set of 438 sandboxed tasks drawn from real archived webpages to test whether agents can handle basic UI interactions that arise inside larger web plans. Leading models top out at 64.8 percent success, while supervised fine-tuning alone reaches 48.8 percent. Adding reinforcement learning with verifiable rewards on top of those checkpoints raises the score to 52.8 percent. The work argues that existing evaluations skip these foundational steps and that progress on them is required before agents can reliably plan and act across live websites.

Core claim

WARC-Bench supplies 438 tasks that isolate short-horizon GUI subtasks on dynamic, realistic pages delivered through Web ARChive files, allowing controlled yet authentic interaction without live network access. Frontier computer-use models reach no higher than 64.8 percent success. Supervised fine-tuning produces 48.8 percent, and further RLVR training on the same checkpoints improves results to 52.8 percent, surpassing many closed models. The authors conclude that competence on these subtasks is a prerequisite for dependable web planning and navigation.

What carries the argument

WARC-Bench, a collection of 438 subtask tasks executed inside sandboxed Web ARChive environments that preserve dynamic webpage behavior for repeatable multimodal agent evaluation.

If this is right

Agents trained to high accuracy on these subtasks should exhibit fewer low-level failures when executing longer web plans.
RLVR applied after SFT can raise open-source performance on subtask benchmarks even when task-specific data remain scarce.
Benchmarks that omit isolated GUI subtasks will continue to underestimate the gap between current agents and reliable web operation.
Open-source models can be brought closer to frontier capability on web interfaces by targeted subtask training rather than full end-to-end data collection.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If subtask mastery transfers across domains, future agent pipelines could first train on archived subtasks before scaling to full planning.
The benchmark may surface concrete failure modes, such as misreading dynamic elements, that are masked in coarser web-agent scores.
Releasing the archive files and task definitions would let independent groups test whether the same training recipe works on entirely new web domains.

Load-bearing premise

Success on the 438 sandboxed archive tasks reflects the same capabilities that drive performance in open, live web navigation.

What would settle it

Models that score high on WARC-Bench but fail to improve on a matched set of live website navigation tasks, or models that succeed on live tasks without mastering the benchmark subtasks, would undermine the central claim.

read the original abstract

Training web agents to navigate complex, real-world websites requires them to master $\textit{subtasks}$ - short-horizon interactions on multiple UI components (e.g., choosing the correct date in a date picker, or scrolling in a container to extract information). We introduce WARC-Bench (Web Archive Benchmark), a novel web navigation benchmark featuring 438 tasks designed to evaluate multimodal AI agents on subtasks. WARC-Bench enables sandboxed interactions with dynamic and realistic webpages using Web ARChive files. We show that WARC-Bench is challenging for leading computer-use models, with the highest observed success rate being 64.8%. To improve open source models on subtask, we explore two common training techniques: supervised fine-tuning (SFT) and reinforcement learning with verifiable rewards (RLVR). Experiments show that SFT models obtain a 48.8% success rate on the benchmark. Training with RLVR over SFT checkpoints, even in data-scarce settings, improves the score to 52.8% on WARC-Bench, outperforming many frontier models. Our analysis concludes that mastering these subtasks is essential for robust web planning and navigation, and is a capability not extensively evaluated by existing benchmarks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

WARC-Bench gives a clean new way to test short GUI subtasks with real training numbers, but the paper never checks whether those subtasks actually drive success on full web tasks.

read the letter

The useful part is the benchmark itself. They pulled 438 concrete subtasks out of Web ARChive files and set up a sandbox where agents can click, scroll, and pick dates on realistic pages without hitting live sites. That setup is practical and avoids some of the flakiness of real-web evaluation. They also ran the obvious baselines: frontier models top out at 64.8 percent, SFT reaches 48.8 percent, and RLVR on top of SFT adds another four points to 52.8 percent even with limited data. Those numbers are at least reported plainly.

Referee Report

2 major / 1 minor

Summary. The paper introduces WARC-Bench, a benchmark with 438 tasks for evaluating multimodal agents on short-horizon GUI subtasks (e.g., date pickers, scrolling) using Web ARChive files for sandboxed, dynamic webpage interactions. It reports that leading computer-use models achieve at most 64.8% success, open-source models reach 48.8% after SFT, and RLVR on SFT checkpoints improves this to 52.8% even in data-scarce regimes. The authors conclude that mastering these subtasks is essential for robust web planning and navigation, a capability not extensively covered by existing benchmarks.

Significance. If the benchmark tasks prove representative and the reported gains hold under rigorous evaluation, WARC-Bench could serve as a focused diagnostic for subtask-level weaknesses in web agents, complementing full-horizon benchmarks. The RLVR result in low-data settings provides a concrete, reproducible signal that verifiable-reward training can yield measurable gains over SFT for GUI interaction. The work's value ultimately depends on establishing a link between subtask mastery and end-to-end navigation performance.

major comments (2)

[Abstract] Abstract: Headline success rates (64.8%, 48.8%, 52.8%) are presented without any description of how success is defined or measured, task construction details, inter-annotator agreement, or statistical error bars. This absence prevents verification of the data-to-claim link for the central empirical results.
[Abstract] Abstract, final paragraph: The claim that 'mastering these subtasks is essential for robust web planning and navigation' is not supported by transfer experiments, correlation studies, or ablations on full-horizon tasks (e.g., WebArena, Mind2Web). The representativeness assumption therefore remains untested and is load-bearing for the paper's motivation and conclusions.

minor comments (1)

[Abstract] The abstract states the number of tasks (438) but provides no high-level breakdown of task categories or UI-component coverage, which would help readers assess diversity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We are grateful to the referee for their thorough review and constructive suggestions. We have carefully considered each comment and provide point-by-point responses below. Where appropriate, we indicate revisions to be made in the updated manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: Headline success rates (64.8%, 48.8%, 52.8%) are presented without any description of how success is defined or measured, task construction details, inter-annotator agreement, or statistical error bars. This absence prevents verification of the data-to-claim link for the central empirical results.

Authors: We concur that the abstract would benefit from greater specificity to support verification of the results. In the revised version, we will expand the abstract to define success as correct completion of the short-horizon GUI subtask within the sandboxed Web ARChive environment, briefly note that tasks were manually curated from real web archives for representativeness, and direct readers to the main text for inter-annotator agreement scores and error bars on the reported success rates. This revision preserves the headline numbers while improving the data-to-claim linkage. revision: yes
Referee: [Abstract] Abstract, final paragraph: The claim that 'mastering these subtasks is essential for robust web planning and navigation' is not supported by transfer experiments, correlation studies, or ablations on full-horizon tasks (e.g., WebArena, Mind2Web). The representativeness assumption therefore remains untested and is load-bearing for the paper's motivation and conclusions.

Authors: We acknowledge that the manuscript does not contain transfer experiments or direct correlations with full-horizon benchmarks, which would provide stronger empirical support. The current claim is grounded in the design rationale that subtasks such as date selection and scrolling are foundational elements rarely isolated in existing benchmarks, together with the observed performance gaps. To address the concern, we will revise the final paragraph of the abstract to qualify the statement as an analysis-based conclusion rather than an untested assertion, and we will add a short discussion of the need for future transfer studies to benchmarks like WebArena. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical benchmark results with no derivation reductions

full rationale

The paper introduces WARC-Bench as a new benchmark consisting of 438 sandboxed tasks and reports direct empirical success rates (e.g., 48.8% for SFT, 52.8% for RLVR, 64.8% for frontier models). No equations, parameter fits, or mathematical derivations are described that could reduce to self-defined quantities, prior self-citations, or ansatzes. The central claim that subtask mastery is essential rests on the observed scores and the authors' analysis of the benchmark itself, without any load-bearing step that is equivalent to its inputs by construction. This is a standard empirical benchmark paper whose measurements are independent of any circular chain.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that archived webpages can be turned into faithful interactive environments and that success on the chosen subtasks predicts broader navigation ability; no free parameters or new entities are introduced.

axioms (1)

domain assumption Web ARChive (WARC) files can be replayed to create sandboxed, interactive environments that faithfully simulate dynamic real-world webpages for agent evaluation.
Invoked when the benchmark is defined as enabling sandboxed interactions with dynamic and realistic webpages.

pith-pipeline@v0.9.0 · 5782 in / 1380 out tokens · 55924 ms · 2026-05-21T20:11:11.397650+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We introduce WARC-Bench … 438 tasks … RLVR over SFT checkpoints … 52.8 % … mastering these subtasks is essential for robust web planning and navigation.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Training with RLVR … improves the score to 52.8 % …

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.