WebGym: Scaling Training Environments for Visual Web Agents with Realistic Tasks

Alexey Taymanov; Aviral Kumar; Hao Bai; Spencer Whitehead; Tong Zhang

arxiv: 2601.02439 · v6 · submitted 2026-01-05 · 💻 cs.LG · cs.CV

WebGym: Scaling Training Environments for Visual Web Agents with Realistic Tasks

Hao Bai , Alexey Taymanov , Tong Zhang , Aviral Kumar , Spencer Whitehead This is my paper

Pith reviewed 2026-05-16 17:27 UTC · model grok-4.3

classification 💻 cs.LG cs.CV

keywords visual web agentsreinforcement learningvision-language modelsweb environmentstask scalingout-of-distribution generalizationagent rollouts

0 comments

The pith

Fine-tuning on WebGym raises an open vision model's web agent success from 26% to 43% on unseen sites.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

WebGym supplies nearly 300,000 rubric-scored tasks across real websites to train visual web agents. The authors apply a straightforward reinforcement learning method that improves the policy using the agent's own interaction traces and task rewards. They introduce an asynchronous rollout system that delivers 4-5x faster trajectory sampling to support large-scale training. Fine-tuning Qwen-3-VL-8B-Instruct on the expanded data set lifts success on an out-of-distribution test set of never-seen websites from 26.2% to 42.9%. The result exceeds agents built on GPT-4o and GPT-5-Thinking.

Core claim

WebGym is the largest open environment for realistic visual web agents, containing nearly 300,000 tasks with rubric-based evaluations on diverse real-world sites. A simple RL recipe that trains on the agent's own rollouts, accelerated by a high-throughput asynchronous sampling system, allows scaling task breadth and depth. Fine-tuning Qwen-3-VL-8B-Instruct on this data improves out-of-distribution success from 26.2% to 42.9%, outperforming proprietary-model agents at 27.1% and 29.8%.

What carries the argument

The WebGym environment with its rubric-evaluated tasks on real websites, combined with the asynchronous rollout system that speeds trajectory sampling.

Load-bearing premise

Rubric-based evaluations on real websites accurately capture meaningful task success and the out-of-distribution test set of never-seen sites is representative of broader web diversity.

What would settle it

Running the fine-tuned model on a new collection of real websites whose tasks are scored by human judges and finding success rates no higher than the base model or the GPT baselines.

Figures

Figures reproduced from arXiv: 2601.02439 by Alexey Taymanov, Aviral Kumar, Hao Bai, Spencer Whitehead, Tong Zhang.

**Figure 1.** Figure 1: Example rollouts from visual web agents trained on different training environments. Tasks in prior large-scale training setups were relatively simple, e.g., Test-Time-Interaction (TTI; Shen et al. (2025), Row 1), resulting in failures of trained agents on many held-out tasks (task shown: From ArXiv, access the website of the university that maintains and manages ArXiv. How many undergraduate students are c… view at source ↗

**Figure 2.** Figure 2: Task decomposition system. WebGym decomposes tasks by generating valid combinations of fact groups from the original task’s rubric. Decomposition requires ≥2 groups with at least one “large” group (≥3 facts). Each valid combination (excluding the full set) produces a new task with lower difficulty while maintaining consistency with the original objectives. on narrow domains; c) BrowseComp (Wei et al., 2025… view at source ↗

**Figure 3.** Figure 3: Statistics of the WebGym training task set. Left: website distribution as a function of the sorted index (from more tasks to less) of website. Right: difficulty distribution of the train and test task sets. The transparent bars over the original bar mean decomposed tasks, and the slashed bars means test set.] difficulty of 9. Since the decomposition conditions are met (3 groups ≥ 2, and G2 and G3 are “larg… view at source ↗

**Figure 4.** Figure 4: Analysis of the WebGym task set. Left: distribution of tasks across domains according to the Mind2Web-2 taxonomy (Gou et al., 2025). Right: Distribution of trajectory lengths by task difficulty for answered trajectories over multiple iterations. For trajectories working on medium- and hard-difficulty tasks that exceeds 30 tasks during evaluation, we filter in only trajectories under 30 tasks to make the co… view at source ↗

**Figure 5.** Figure 5: Agreement between automated evaluators and human judgment. Rubric-based evaluation (with explicit criteria) consistently improves agreement over task-only evaluation, yielding higher accuracy and precision across LLMbased evaluators. Among the evaluators, GPT-4o shows the largest shift after adding the rubric: precision increases the most while recall drops, indicating that the rubric makes GPT-4o apply s… view at source ↗

**Figure 6.** Figure 6: Asynchrony eliminates burst-idle behavior in web rollouts. Left: WebGym implements an asynchronous rollout system that (1) shortens single-trajectory collection duration by isolating rollout processes and (2) allows new rollout processes to join early by replacing batched inference with a process pool; the example shown is a toy case with 3 available environment buckets but 4 tasks waiting to roll out. Rig… view at source ↗

**Figure 7.** Figure 7: Benchmarking speed and throughput of the WebGym asynchronous rollout framework. Left: the WebGym framework boosts the rollout speed up significantly with a 4x-5x speedup. This figure shows time cost and average CPU utility percentage when reaching all 256 environments running w.r.t. different amounts of CPUs, while using the same amount of GPU resources for running inference on the VLM-based agent. The lim… view at source ↗

**Figure 8.** Figure 8: Ablations on base models, prompting, and action filtering. (left) test-set success rate curves of different models (Qwen3-VLInstruct-8B, Qwen3-VL-Thinking-8B, GPT-4o, and GPT-5-Thinking), and of Qwen3-VL-Instruct-8B under different constraints (either removing the memory prompt or removing the repetition penalty during RL). (right) test-set same-screenshot (repetitive inefficient action) rate and trajecto… view at source ↗

**Figure 9.** Figure 9: Exploring scaling dimensions of WebGym and difficulty-aware training. Test-set success rate curves under different variations to training: removing domains (“exclude domains”), tuning difficulty ratios (“uniform sampling”, “biased to hard”, “only easy”, “only medium”), and shortening the train-time step budget (“shorten horizon”). Results are reported for (a) Overall (all tasks), (b) Easy (1–3), (c) Medium… view at source ↗

**Figure 10.** Figure 10: WebGym implements an operation-specific queue system that balances the CPU and GPU machine usages. Here we illustrate the computational model of this pool system in the left subplot, and the comparison of the two queue designs on the right. We represent the centralized CPU queue with Queue, and the cascading queue system for with the initial letter of the request type, specifically, Navigating to a webpag… view at source ↗

**Figure 11.** Figure 11: Left: Number of trajectories successfully finished and crashed out under high CPU load when [PITH_FULL_IMAGE:figures/full_fig_p023_11.png] view at source ↗

**Figure 12.** Figure 12: Trajectory for the comparison between vanilla evaluation and rubric-based evaluation. [PITH_FULL_IMAGE:figures/full_fig_p026_12.png] view at source ↗

read the original abstract

We present WebGym, the largest-to-date open-source environment for training realistic visual web agents. Real websites are non-stationary and diverse, making artificial or small-scale task sets insufficient for robust policy learning. WebGym contains nearly 300,000 tasks with rubric-based evaluations across diverse, real-world websites and difficulty levels. We train agents with a simple reinforcement learning (RL) recipe, which trains on the agent's own interaction traces (rollouts), using task rewards as feedback to guide learning. To enable scaling RL, we speed up sampling of trajectories in WebGym by developing a high-throughput asynchronous rollout system, designed specifically for web agents. Our system achieves a 4-5x rollout speedup compared to naive implementations. Second, we scale the task set breadth, depth, and size, which results in continued performance improvement. Fine-tuning a strong base vision-language model, Qwen-3-VL-8B-Instruct, on WebGym results in an improvement in success rate on an out-of-distribution test set from 26.2% to 42.9%, significantly outperforming agents based on proprietary models such as GPT-4o and GPT-5-Thinking that achieve 27.1% and 29.8%, respectively. This improvement is substantial because our test set consists only of tasks on websites never seen during training, unlike many other prior works on training visual web agents.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

WebGym shows that scaling to 300k real website tasks with a fast async rollout system can push an 8B open model to 42.9% success on never-before-seen sites, beating GPT-4o.

read the letter

WebGym shows that scaling to 300k real website tasks with a fast async rollout system can push an 8B open model to 42.9% success on never-before-seen sites, beating GPT-4o. That's the main result worth knowing. The new pieces are the size of the environment—nearly 300,000 rubric-scored tasks across diverse real sites—and the 4-5x speedup in trajectory sampling that makes RL training practical. They also demonstrate that simply training on more of their own rollouts keeps lifting performance without complex methods. Releasing this as open source is helpful for the field. The work holds up on the scaling angle: bigger task sets and the async system are clear engineering wins. The OOD improvement is reported directly on held-out sites, which avoids some circularity issues. Where it is thinner is the evaluation side. Rubric-based success on live websites is the right direction, but without reported agreement rates between raters or sample rubrics, it's difficult to gauge how reliable the 42.9% number is. The test set is never-seen sites, yet no quantitative check on site diversity or structural novelty is given, so the generalization story could be stronger with those numbers. Hyperparameter details and task generation process are also light in the abstract, though the full paper may fill them in. This is for researchers focused on web navigation agents or large-scale interactive RL environments. Readers interested in open alternatives to closed models on practical tasks will find the numbers useful. I would send it for peer review. The scale and the concrete OOD lift make it worth referee time, even with the need for more evaluation transparency.

Referee Report

3 major / 2 minor

Summary. The paper introduces WebGym, the largest open-source environment for training visual web agents, containing nearly 300,000 rubric-evaluated tasks across diverse real-world websites. It presents a simple RL recipe that trains on the agent's own rollouts with task rewards, accelerated by a custom high-throughput asynchronous rollout system achieving 4-5x speedup. Scaling task breadth and depth yields continued gains; fine-tuning Qwen-3-VL-8B-Instruct on WebGym raises success rate on an explicitly out-of-distribution test set of never-seen sites from 26.2% to 42.9%, outperforming GPT-4o (27.1%) and GPT-5-Thinking (29.8%).

Significance. If the empirical claims hold, the work supplies the first large-scale, open, real-website training resource for visual web agents together with a reproducible RL scaling recipe and a concrete demonstration that such training can surpass closed proprietary models on generalization to unseen sites. The open release and the reported 4-5x rollout acceleration are concrete assets for the community.

major comments (3)

[Evaluation] Evaluation section: the 42.9% OOD success rate is the central empirical claim, yet the manuscript supplies no rubric examples, construction procedure, or inter-rater reliability statistics. Without these, it is impossible to judge whether the reported lift reflects genuine task completion or lenient partial-credit scoring.
[Experiments] Experiments section: no standard deviation across runs, no exact fine-tuning hyperparameters (learning rate, batch size, number of epochs), and no ablation on rollout count are reported. These omissions make the 26.2% → 42.9% delta difficult to reproduce or attribute confidently to WebGym scaling.
[Task Set Construction] Task Set Construction section: the OOD test set is asserted to contain only websites never seen in training, but no quantitative diversity metrics (site-category coverage, DOM-structure similarity scores, or navigation-pattern overlap) are provided. This leaves the generalization claim vulnerable to the possibility that test sites share latent patterns with the training distribution.

minor comments (2)

[Abstract] The abstract states 'nearly 300,000 tasks'; the main text should give the exact count and a breakdown by difficulty level and website category.
[Rollout System] The asynchronous rollout system is credited with a 4-5x speedup, but the paper would benefit from a brief pseudocode or timing table showing where the gains arise (e.g., parallel browser instances, caching).

Simulated Author's Rebuttal

3 responses · 0 unresolved

Thank you for the thorough review. We have carefully considered each comment and revised the manuscript accordingly to enhance the transparency and reproducibility of our results. Below we provide point-by-point responses to the major comments.

read point-by-point responses

Referee: [Evaluation] Evaluation section: the 42.9% OOD success rate is the central empirical claim, yet the manuscript supplies no rubric examples, construction procedure, or inter-rater reliability statistics. Without these, it is impossible to judge whether the reported lift reflects genuine task completion or lenient partial-credit scoring.

Authors: We agree that additional details on the evaluation rubrics are necessary to substantiate the central claim. In the revised manuscript, we have expanded the Evaluation section to include multiple example rubrics for representative tasks, a step-by-step description of the rubric construction process involving domain experts, and inter-rater reliability results from a pilot study with three annotators on 150 tasks yielding a Cohen's kappa of 0.87. We believe these additions address the concern regarding potential lenient scoring. revision: yes
Referee: [Experiments] Experiments section: no standard deviation across runs, no exact fine-tuning hyperparameters (learning rate, batch size, number of epochs), and no ablation on rollout count are reported. These omissions make the 26.2% → 42.9% delta difficult to reproduce or attribute confidently to WebGym scaling.

Authors: We acknowledge the importance of these experimental details for reproducibility. The revised paper now reports standard deviations computed over three independent training runs for the main results. We have also included the precise fine-tuning hyperparameters used: a learning rate of 2e-5 with cosine decay, batch size of 64, and training for 4 epochs. Furthermore, we added an ablation experiment showing performance as a function of rollout count (from 50k to 300k tasks), demonstrating that gains continue to scale with more data. revision: yes
Referee: [Task Set Construction] Task Set Construction section: the OOD test set is asserted to contain only websites never seen in training, but no quantitative diversity metrics (site-category coverage, DOM-structure similarity scores, or navigation-pattern overlap) are provided. This leaves the generalization claim vulnerable to the possibility that test sites share latent patterns with the training distribution.

Authors: We thank the referee for this suggestion. To better support the out-of-distribution claim, we have incorporated quantitative metrics in the revised Task Set Construction section. This includes the coverage of site categories (with percentages for train and test), average DOM similarity scores using normalized tree-edit distance (0.12 for test vs train), and navigation pattern overlap via cosine similarity of action embedding vectors (average 0.18). These low similarity scores support that the test sites are indeed distinct from the training distribution. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical results on OOD test set

full rationale

The paper reports direct empirical measurements: construction of WebGym (~300k tasks with rubric evaluations), RL training on agent rollouts, and success-rate evaluation on an explicitly out-of-distribution test set of never-seen websites (26.2% → 42.9%). No equations, parameter fits, or self-referential definitions reduce the reported success rates to inputs by construction. The pipeline is standard RL scaling plus held-out evaluation and is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard reinforcement-learning assumptions plus the domain assumption that rubric scores on real websites provide reliable learning signals; no explicit free parameters or new invented entities are introduced in the abstract.

axioms (1)

domain assumption Rubric-based task rewards provide sufficient and accurate feedback for policy improvement
The RL recipe uses task rewards from rubrics to guide learning on real websites.

pith-pipeline@v0.9.0 · 5560 in / 1428 out tokens · 36973 ms · 2026-05-16T17:27:17.993095+00:00 · methodology

discussion (0)

Forward citations

Cited by 5 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Weblica: Scalable and Reproducible Training Environments for Visual Web Agents
cs.AI 2026-05 unverdicted novelty 7.0

Weblica scales RL training for visual web agents by building thousands of reproducible environments through HTTP caching for stable replays and LLM synthesis from real sites, yielding an 8B model that beats similar op...
RiskWebWorld: A Realistic Interactive Benchmark for GUI Agents in E-commerce Risk Management
cs.AI 2026-04 unverdicted novelty 7.0

RiskWebWorld is the first realistic interactive benchmark for GUI agents in e-commerce risk management, revealing a large gap between generalist and specialized models plus RL gains.
Spreadsheet-RL: Advancing Large Language Model Agents on Realistic Spreadsheet Tasks via Reinforcement Learning
cs.AI 2026-05 unverdicted novelty 6.0

Spreadsheet-RL applies RL fine-tuning and a custom Gym environment to raise LLM agent Pass@1 scores on spreadsheet benchmarks from roughly 8-12% to 17-23%.
Agent-World: Scaling Real-World Environment Synthesis for Evolving General Agent Intelligence
cs.AI 2026-04 unverdicted novelty 6.0

Agent-World autonomously synthesizes verifiable real-world tasks and uses continuous self-evolution to train 8B and 14B agents that outperform proprietary models on 23 benchmarks.
On Training Large Language Models for Long-Horizon Tasks: An Empirical Study of Horizon Length
cs.AI 2026-05 unverdicted novelty 5.0

Longer action horizons bottleneck LLM agent training through instability, but training with reduced horizons stabilizes learning and enables better generalization to longer horizons.

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages · cited by 5 Pith papers

[1]

Include the current step

Memory: facts you would like to memorize for future actions in json format. Include the current step

work page
[2]

Include progress of the current step

Progress: Decompose the task into subtasks and what has been finished so far with json format. Include progress of the current step

work page
[3]

Intention: clearly state which subtask you’re working on at this step with the json key

work page
[4]

Action: a short sentence describing what to do in the UI to accomplish the next subtask

work page
[5]

Rules: − Output exactly in the order: Memory, Progress, Intention, Action,<tool call>

A single<tool call>...</tool call>block containing only the JSON:{”name”:<function−name>, ”arguments”: <args−json−object>}. Rules: − Output exactly in the order: Memory, Progress, Intention, Action,<tool call>. − You MUST use json format for the Memory and Progress parts. − Example Task: ”Search and compare the prices and locations of product 1 and produc...

work page
[6]

WebGym: Scaling Training Environments for Visual Web Agents30

Action: a short sentence describing what to do in the UI. WebGym: Scaling Training Environments for Visual Web Agents30

work page
[7]

Rules: − Output exactly in the order: Action,<tool call>

A single<tool call>...</tool call>block containing only the JSON:{”name”:<function−name>, ”arguments”: <args−json−object>}. Rules: − Output exactly in the order: Action,<tool call>. − Action describes the high−level intention of the tool call within a single sentence. − Do not output anything else outside those two parts. {{/if w/o Memory Prompt}} =======...

work page 2025
[8]

**Reasoning**: [One sentence explanation]

work page
[9]

**Decision**: [YES or NO] User Prompt: **Task**:{task} **Key Points for Task Completion**: {eval rubric} The snapshot of the web page is shown in the image. Does this image contain relevant information for the task? (Answer YES unless it’s completely irrelevant) Where: −{task}= task.task name (the task description) −{eval rubric}= List of all criteria fro...

work page
[10]

Task Instruction: The original task description (provided for CONTEXT ONLY)

work page
[11]

Fact Group: A group of related facts decomposed from the task instruction (provided for CONTEXT ONLY)

work page
[12]

Fact to Check: A specific fact that you need to verify (THIS IS YOUR PRIMARY FOCUS)

work page
[13]

Trajectory: A complete list of observations and actions that were taken by the agent

work page
[14]

You are NOT checking the agent’s response − only whether the screenshots contain evidence for the fact

Result Screenshots: Visual representation of the screen showing the result or intermediate state CRITICAL: Your judgment should ONLY focus on whether the FACT TO CHECK can be verified by the screenshots. You are NOT checking the agent’s response − only whether the screenshots contain evidence for the fact. Guidelines for evaluation: −− Your primary respon...

work page
[15]

Analysis: [Describe what evidence you see in the screenshots related to the fact to check]

work page
[16]

Evaluation: (MUST end with line ”2

Verdict: [SUCCESS if the fact is verified by screenshots, NOT SUCCESS otherwise] ================================================================================ MESSAGE 2: USER ================================================================================ ===Your Turn=== Task Instruction (for context only): [task instruction] Fact Group (for context on...

work page
[17]

Task Instruction: The task the agent was trying to complete

work page
[18]

Final Response: The agent’s answer/response to the task

work page
[19]

Agents frequently hallucinate or make up answers that are not verified by what they actually saw

Result Screenshots: Visual representation of the screens the agent visited CRITICAL: Your job is to check if the agent’s response contains information that is NOT shown in the screenshots. Agents frequently hallucinate or make up answers that are not verified by what they actually saw. Guidelines for evaluation: −− Check whether EVERY claim in the agent’s...

work page
[20]

Claims in response: [List the specific claims/facts in the agent’s response]

work page
[21]

Screenshot verification: [For each claim, state whether it appears in the screenshots]

work page
[22]

Evaluation: (MUST end with line ”3

Verdict: [SUCCESS if ALL claims are verified by screenshots, NOT SUCCESS if ANY claim is not verified] ================================================================================ MESSAGE 2: USER ================================================================================ ===Your Turn=== Task Instruction: [task instruction] Agent’s Final Response:...

work page
[23]

Analysis: [Describe what you see in the screenshots − any EXPLICIT signs of anti−bot blocking measures?]

work page
[24]

Did the website block the agent? (MUST end with line ”2

Blocked: [YES or NO] ================================================================================ MESSAGE 2: USER ================================================================================ ===Your Turn=== Task: [task] Trajectory: [trajectory] Screenshots: attached. Did the website block the agent? (MUST end with line ”2. Blocked: [YES or NO]”) W...

work page
[25]

Break down the task into hierarchical fact groups, where each group contains specific facts to verify

work page
[26]

Each fact represents ONE verifiable piece of information that can be checked in a trajectory

work page
[27]

Treat the entire computation as a SINGLE fact

The overall task difficulty equals the TOTAL NUMBER of all facts across all groups Key principles: − FIRST analyze the task to identify logical groupings of related information − For each group, list ALL specific facts that need to be verified − Facts should be AS DETAILED AS POSSIBLE − break down complex requirements into individual checkable facts − Eac...

work page
[28]

Create a new task description that incorporates ONLY the requirements explicitly listed in the selected fact groups above

work page
[29]

The new task should read naturally and be self−contained

work page
[30]

Do NOT include requirements from fact groups that were not selected

work page
[31]

Maintain the original context and domain of the task

work page
[32]

**CRITICAL**: Do NOT add any information that is not explicitly present in the selected fact groups above, even if that information appears in the original task Important Guidelines: − If a fact group mentions ”concert” but does NOT mention ”upcoming” or ”in the US or Canada”, do NOT include those constraints in the generated task − ONLY use constraints a...

work page 2048

[1] [1]

Include the current step

Memory: facts you would like to memorize for future actions in json format. Include the current step

work page

[2] [2]

Include progress of the current step

Progress: Decompose the task into subtasks and what has been finished so far with json format. Include progress of the current step

work page

[3] [3]

Intention: clearly state which subtask you’re working on at this step with the json key

work page

[4] [4]

Action: a short sentence describing what to do in the UI to accomplish the next subtask

work page

[5] [5]

Rules: − Output exactly in the order: Memory, Progress, Intention, Action,<tool call>

A single<tool call>...</tool call>block containing only the JSON:{”name”:<function−name>, ”arguments”: <args−json−object>}. Rules: − Output exactly in the order: Memory, Progress, Intention, Action,<tool call>. − You MUST use json format for the Memory and Progress parts. − Example Task: ”Search and compare the prices and locations of product 1 and produc...

work page

[6] [6]

WebGym: Scaling Training Environments for Visual Web Agents30

Action: a short sentence describing what to do in the UI. WebGym: Scaling Training Environments for Visual Web Agents30

work page

[7] [7]

Rules: − Output exactly in the order: Action,<tool call>

A single<tool call>...</tool call>block containing only the JSON:{”name”:<function−name>, ”arguments”: <args−json−object>}. Rules: − Output exactly in the order: Action,<tool call>. − Action describes the high−level intention of the tool call within a single sentence. − Do not output anything else outside those two parts. {{/if w/o Memory Prompt}} =======...

work page 2025

[8] [8]

**Reasoning**: [One sentence explanation]

work page

[9] [9]

**Decision**: [YES or NO] User Prompt: **Task**:{task} **Key Points for Task Completion**: {eval rubric} The snapshot of the web page is shown in the image. Does this image contain relevant information for the task? (Answer YES unless it’s completely irrelevant) Where: −{task}= task.task name (the task description) −{eval rubric}= List of all criteria fro...

work page

[10] [10]

Task Instruction: The original task description (provided for CONTEXT ONLY)

work page

[11] [11]

Fact Group: A group of related facts decomposed from the task instruction (provided for CONTEXT ONLY)

work page

[12] [12]

Fact to Check: A specific fact that you need to verify (THIS IS YOUR PRIMARY FOCUS)

work page

[13] [13]

Trajectory: A complete list of observations and actions that were taken by the agent

work page

[14] [14]

You are NOT checking the agent’s response − only whether the screenshots contain evidence for the fact

Result Screenshots: Visual representation of the screen showing the result or intermediate state CRITICAL: Your judgment should ONLY focus on whether the FACT TO CHECK can be verified by the screenshots. You are NOT checking the agent’s response − only whether the screenshots contain evidence for the fact. Guidelines for evaluation: −− Your primary respon...

work page

[15] [15]

Analysis: [Describe what evidence you see in the screenshots related to the fact to check]

work page

[16] [16]

Evaluation: (MUST end with line ”2

Verdict: [SUCCESS if the fact is verified by screenshots, NOT SUCCESS otherwise] ================================================================================ MESSAGE 2: USER ================================================================================ ===Your Turn=== Task Instruction (for context only): [task instruction] Fact Group (for context on...

work page

[17] [17]

Task Instruction: The task the agent was trying to complete

work page

[18] [18]

Final Response: The agent’s answer/response to the task

work page

[19] [19]

Agents frequently hallucinate or make up answers that are not verified by what they actually saw

Result Screenshots: Visual representation of the screens the agent visited CRITICAL: Your job is to check if the agent’s response contains information that is NOT shown in the screenshots. Agents frequently hallucinate or make up answers that are not verified by what they actually saw. Guidelines for evaluation: −− Check whether EVERY claim in the agent’s...

work page

[20] [20]

Claims in response: [List the specific claims/facts in the agent’s response]

work page

[21] [21]

Screenshot verification: [For each claim, state whether it appears in the screenshots]

work page

[22] [22]

Evaluation: (MUST end with line ”3

Verdict: [SUCCESS if ALL claims are verified by screenshots, NOT SUCCESS if ANY claim is not verified] ================================================================================ MESSAGE 2: USER ================================================================================ ===Your Turn=== Task Instruction: [task instruction] Agent’s Final Response:...

work page

[23] [23]

Analysis: [Describe what you see in the screenshots − any EXPLICIT signs of anti−bot blocking measures?]

work page

[24] [24]

Did the website block the agent? (MUST end with line ”2

Blocked: [YES or NO] ================================================================================ MESSAGE 2: USER ================================================================================ ===Your Turn=== Task: [task] Trajectory: [trajectory] Screenshots: attached. Did the website block the agent? (MUST end with line ”2. Blocked: [YES or NO]”) W...

work page

[25] [25]

Break down the task into hierarchical fact groups, where each group contains specific facts to verify

work page

[26] [26]

Each fact represents ONE verifiable piece of information that can be checked in a trajectory

work page

[27] [27]

Treat the entire computation as a SINGLE fact

The overall task difficulty equals the TOTAL NUMBER of all facts across all groups Key principles: − FIRST analyze the task to identify logical groupings of related information − For each group, list ALL specific facts that need to be verified − Facts should be AS DETAILED AS POSSIBLE − break down complex requirements into individual checkable facts − Eac...

work page

[28] [28]

Create a new task description that incorporates ONLY the requirements explicitly listed in the selected fact groups above

work page

[29] [29]

The new task should read naturally and be self−contained

work page

[30] [30]

Do NOT include requirements from fact groups that were not selected

work page

[31] [31]

Maintain the original context and domain of the task

work page

[32] [32]

**CRITICAL**: Do NOT add any information that is not explicitly present in the selected fact groups above, even if that information appears in the original task Important Guidelines: − If a fact group mentions ”concert” but does NOT mention ”upcoming” or ”in the US or Canada”, do NOT include those constraints in the generated task − ONLY use constraints a...

work page 2048