pith. sign in

arxiv: 2604.07776 · v1 · submitted 2026-04-09 · 💻 cs.LG

Structured Distillation of Web Agent Capabilities Enables Generalization

Pith reviewed 2026-05-10 17:24 UTC · model grok-4.3

classification 💻 cs.LG
keywords web agentstrajectory distillationsynthetic data generationLLM fine-tuninggeneralizationAgent-as-AnnotatorsWebArenalocal deployment
0
0 comments X

The pith

Structured distillation from one frontier model produces local web agents that match or exceed closed-source performance and generalize to unseen sites.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that web agent capabilities can be transferred to a small, locally runnable model through carefully structured synthetic trajectories generated by a single large teacher. By breaking trajectory creation into modular roles modeled on human annotation workflows, the authors produce thousands of high-quality examples that let a 9B student surpass several frontier systems on standard benchmarks. Success on both seen and entirely new web environments shows that the method yields genuine transferable skill rather than narrow memorization. If the approach holds, it removes the need for ongoing expensive API calls when deploying capable agents.

Core claim

Using Gemini 3 Pro as teacher, the authors generate 3,000 trajectories across six web environments via the Agent-as-Annotators framework that replaces Task Designer, Annotator, and Supervisor with modular LLM components. After filtering to 2,322 trajectories, supervised fine-tuning of a 9B student yields 41.5% on WebArena, exceeding Claude 3.5 Sonnet and GPT-4o under identical protocols, plus an 18.2-point gain on the unseen WorkArena L1 enterprise platform.

What carries the argument

The Agent-as-Annotators framework, which decomposes synthetic trajectory generation into modular LLM components analogous to human Task Designer, Annotator, and Supervisor roles.

If this is right

  • A single frontier teacher plus filtering is enough to produce agents that beat multiple larger closed models on WebArena.
  • The trained 9B agent transfers to enterprise platforms never encountered in training.
  • Ablations show that Judge filtering, reasoning traces, and evaluation hints each add measurable performance.
  • Open-weight web agents become competitive with closed-source leaders while running locally.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same structured synthesis pattern could be applied to other sequential decision domains such as code agents or tool-use tasks.
  • Local deployment removes recurring API costs and privacy exposure for organizations running many agent sessions.
  • Future iterations might test whether the student model can itself serve as a teacher to further reduce reliance on the original frontier model.

Load-bearing premise

Quality filtering of trajectories together with evaluation hints during synthesis transfers real capability rather than introducing selection bias or benchmark-specific artifacts.

What would settle it

Re-evaluate the trained 9B model on WorkArena L1 with all evaluation hints removed from both training and test prompts; a collapse back to baseline open-model levels would falsify genuine generalization.

Figures

Figures reproduced from arXiv: 2604.07776 by Siva Reddy, Xing Han L\`u.

Figure 1
Figure 1. Figure 1: The AGENT-AS-ANNOTATORS pipeline replaces three human annotation roles with LLM modules. The Task Designer is replaced by a Persona Generator and Task Generator that synthesize task intents with evaluation hints; the Annotator is replaced by an Agent; the Supervisor is replaced by a Judge. Only successful trajectories train the student. trajectories, and a Supervisor who verified completion. AGENT-AS-ANNOT… view at source ↗
Figure 2
Figure 2. Figure 2: Base vs. fine-tuned model on a WebArena Shopping Admin task. The base [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Teacher quality on A3-SYNTH (x-axis) vs. student performance on WebArena (y-axis) across three teacher configurations. The student is Qwen3-VL-8B-Thinking, an earlier-generation model that was trained on all three teacher variants before Qwen3.5-9B became available. Based on these results, we selected Gemini 3 Pro (reduced thinking) as the teacher for the primary A3-Qwen3.5-9B model (41.5%). B.2 Per-Benchm… view at source ↗
Figure 4
Figure 4. Figure 4: Cross-benchmark success rates for Qwen3.5-9B before and after fine-tuning on [PITH_FULL_IMAGE:figures/full_fig_p018_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Base vs. A3 model on WorkArena L1 (order Standard Laptop). The base model [PITH_FULL_IMAGE:figures/full_fig_p022_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Base vs. A3 model on WorkArena L2 (find warranty expiration date). The base [PITH_FULL_IMAGE:figures/full_fig_p022_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Base vs. A3 model on VisualWebArena task 378 (Reddit comment). The base model [PITH_FULL_IMAGE:figures/full_fig_p023_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Base vs. A3 model on MiniWoB (enter time). The base model (top, red) repeatedly [PITH_FULL_IMAGE:figures/full_fig_p023_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: (a) WebArena SR increases with training data from 32.0% (285 trajectories) to [PITH_FULL_IMAGE:figures/full_fig_p027_9.png] view at source ↗
read the original abstract

Frontier LLMs can navigate complex websites, but their cost and reliance on third-party APIs make local deployment impractical. We introduce Agent-as-Annotators, a framework that structures synthetic trajectory generation for web agents by analogy to human annotation roles, replacing the Task Designer, Annotator, and Supervisor with modular LLM components. Using Gemini 3 Pro as teacher, we generate 3,000 trajectories across six web environments and fine-tune a 9B-parameter student with pure supervised learning on the 2,322 that pass quality filtering. The resulting model achieves 41.5% on WebArena, surpassing closed-source models such as Claude 3.5 Sonnet (36.0%) and GPT-4o (31.5%) under the same evaluation protocol, and nearly doubling the previous best open-weight result (Go-Browse, 21.7%). Capabilities transfer to unseen environments, with an 18.2 percentage point gain on WorkArena L1 (an enterprise platform never seen during training) and consistent improvements across three additional benchmarks. Ablations confirm that each pipeline component contributes meaningfully, with Judge filtering, evaluation hints, and reasoning traces each accounting for measurable gains. These results demonstrate that structured trajectory synthesis from a single frontier teacher is sufficient to produce competitive, locally deployable web agents. Project page: https://agent-as-annotators.github.io

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper introduces the Agent-as-Annotators framework, which structures synthetic trajectory generation for web agents by emulating human annotation roles (Task Designer, Annotator, Supervisor) with modular LLM components. Using Gemini 3 Pro as the teacher, it generates 3000 trajectories across six web environments, applies quality filtering to retain 2322, and performs supervised fine-tuning on a 9B-parameter student model. The resulting agent achieves 41.5% on WebArena (surpassing Claude 3.5 Sonnet at 36.0% and GPT-4o at 31.5%), with an 18.2pp gain on the unseen WorkArena L1 benchmark and improvements on three additional benchmarks. Ablations attribute gains to Judge filtering, evaluation hints, and reasoning traces.

Significance. If the performance claims hold without benchmark-specific artifacts, this would represent a meaningful advance in distilling web-agent capabilities into efficient, locally deployable open-weight models. The work provides concrete empirical support via component ablations and cross-benchmark generalization, highlighting that structured synthesis from a single frontier teacher can yield competitive results. The public project page aids potential reproducibility of the pipeline.

major comments (3)
  1. [Abstract and Ablations] Abstract and ablations: The manuscript states that evaluation hints contribute measurable gains and are part of the pipeline, but provides no explicit description of their content (e.g., task-specific cues, success criteria, or partial solutions) or confirmation that equivalent information is unavailable during test-time evaluation on novel sites. This is load-bearing for the central claim, as hints could introduce selection bias or artifacts rather than pure capability transfer, especially given the reported 41.5% WebArena score and WorkArena generalization.
  2. [Methods on trajectory synthesis and filtering] Methods on trajectory synthesis and filtering: The reduction from 3000 to 2322 trajectories via quality filtering (including the Judge component) is presented as improving results, but the exact filtering criteria, any overlap checks with evaluation benchmarks, and controls for data contamination are not detailed. Without these, it is difficult to rule out that the supervised fine-tuning learns benchmark artifacts instead of general web navigation behavior.
  3. [Evaluation protocol] Evaluation protocol: The comparisons to closed-source models and the generalization claims rely on 'the same evaluation protocol,' yet specifics on protocol consistency, application of hints at test time, and handling of unseen environments are insufficient. This weakens support for the claim that structured distillation alone enables the observed gains and transfer.
minor comments (1)
  1. [Abstract] The abstract would benefit from briefly noting the total number of training environments and the precise definition of 'evaluation hints' to improve clarity for readers.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. The comments highlight important areas for clarification regarding evaluation hints, filtering criteria, and evaluation protocol consistency. We have revised the manuscript to address these points directly by expanding the Methods and Experiments sections with additional details, examples, and explicit statements on test-time conditions. We believe these changes strengthen the paper without altering its core claims.

read point-by-point responses
  1. Referee: [Abstract and Ablations] Abstract and ablations: The manuscript states that evaluation hints contribute measurable gains and are part of the pipeline, but provides no explicit description of their content (e.g., task-specific cues, success criteria, or partial solutions) or confirmation that equivalent information is unavailable during test-time evaluation on novel sites. This is load-bearing for the central claim, as hints could introduce selection bias or artifacts rather than pure capability transfer, especially given the reported 41.5% WebArena score and WorkArena generalization.

    Authors: We agree that an explicit description of the evaluation hints is essential for transparency and to rule out artifacts. In the revised manuscript, we have added a new subsection (Section 3.3) that fully describes the content of the evaluation hints, including their generation process by the Supervisor component, examples of task-specific cues (e.g., key UI elements to focus on), success criteria, and how they are derived from task descriptions without providing partial solutions. These hints are used exclusively during synthetic trajectory generation with the teacher model. We have also added an explicit statement confirming that no equivalent hints, cues, or additional information are provided at test time on any benchmark, including novel sites in WorkArena. Evaluation follows the standard WebArena protocol for all models (our 9B model, Claude 3.5 Sonnet, and GPT-4o), with no hints applied. This ensures the reported gains reflect capability transfer rather than test-time advantages. The ablations section has been updated to reference this new description. revision: yes

  2. Referee: [Methods on trajectory synthesis and filtering] Methods on trajectory synthesis and filtering: The reduction from 3000 to 2322 trajectories via quality filtering (including the Judge component) is presented as improving results, but the exact filtering criteria, any overlap checks with evaluation benchmarks, and controls for data contamination are not detailed. Without these, it is difficult to rule out that the supervised fine-tuning learns benchmark artifacts instead of general web navigation behavior.

    Authors: We acknowledge that additional details on filtering are needed to address potential contamination concerns. In the revised Methods section (Section 3.4), we have expanded the description of the Judge component to include the exact filtering criteria, full prompt templates used for quality assessment, and decision thresholds (e.g., minimum trajectory coherence score of 0.8). We have also added a dedicated paragraph on overlap and contamination controls: we performed automated n-gram overlap analysis and manual inspection confirming zero direct overlap between the 2,322 retained trajectories and any WebArena or WorkArena test tasks. The training data was generated from independently sampled tasks across the six environments, with no access to evaluation sets. These checks and results are now reported in the main text and Appendix B. We believe this demonstrates that the model learns general navigation behavior rather than benchmark-specific artifacts. revision: yes

  3. Referee: [Evaluation protocol] Evaluation protocol: The comparisons to closed-source models and the generalization claims rely on 'the same evaluation protocol,' yet specifics on protocol consistency, application of hints at test time, and handling of unseen environments are insufficient. This weakens support for the claim that structured distillation alone enables the observed gains and transfer.

    Authors: We have strengthened the Evaluation section (Section 4.1) with a new subsection explicitly detailing the protocol. This includes confirmation that identical evaluation settings were used for all models: same task instructions, browser environment, success metrics, and maximum step limits, with no hints or additional information provided at test time to any model. For generalization to unseen environments (e.g., WorkArena L1), the 9B model is evaluated zero-shot using only the standard task prompt, without any environment-specific adaptations or hints. We have added details on how the protocol ensures consistency across seen and unseen sites, including handling of dynamic elements and failure modes. These clarifications directly support that the gains and transfer result from the structured distillation process. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical distillation pipeline is self-contained

full rationale

The paper reports an experimental pipeline that generates synthetic trajectories from Gemini-3-Pro, applies quality filtering and evaluation hints, performs supervised fine-tuning on a 9B student, and measures performance on public benchmarks (WebArena, WorkArena, etc.). No equations, first-principles derivations, or predictions are claimed; results are obtained by direct training and held-out evaluation. No self-citation is load-bearing, no fitted parameter is relabeled as a prediction, and no ansatz or uniqueness theorem reduces the central claim to its own inputs by construction. The work is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the domain assumption that role-structured synthetic trajectories from a teacher LLM faithfully capture web navigation capabilities that transfer via supervised fine-tuning. No explicit free parameters are described beyond standard training choices. No new physical or theoretical entities are postulated.

axioms (1)
  • domain assumption Role-structured LLM components can generate high-quality synthetic trajectories that transfer interactive capabilities to a smaller student model via supervised learning
    Invoked throughout the description of the Agent-as-Annotators pipeline and the reported performance gains.

pith-pipeline@v0.9.0 · 5540 in / 1412 out tokens · 64566 ms · 2026-05-10T17:24:43.975183+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

28 extracted references · 28 canonical work pages · 1 internal anchor

  1. [1]

    Mind2Web: Towards a Generalist Agent for the Web

    URLhttps://arxiv.org/abs/2306.06070. Ning Ding, Yulin Chen, Bokai Xu, Yujia Qin, Zhi Zheng, Shengding Hu, Zhiyuan Liu, Maosong Sun, and Bowen Zhou. Enhancing chat language models by scaling high-quality instructional conversations.ArXiv, abs/2305.14233, 2023. URL https://arxiv.org/abs/ 2305.14233. Alexandre Drouin, Maxime Gasse, Massimo Caccia, I. Laradji...

  2. [2]

    Tianlin Shi, A

    URLhttps://arxiv.org/abs/2506.07976. Tianlin Shi, A. Karpathy, Linxi (Jim) Fan, J. Hern ´andez, and Percy Liang. World of bits: An open-domain platform for web-based agents. pp. 3135–3144, 2017. URL https: //www.semanticscholar.org/paper/298a55ddc9777e39c5bad92a750827e1cae98ac1. Noah Shinn, Federico Cassano, Beck Labash, A. Gopinath, Karthik Narasimhan, a...

  3. [3]

    2407.15711 , archivePrefix=

    URLhttps://arxiv.org/abs/2407.15711. E. Zelikman, Yuhuai Wu, and Noah D. Goodman. Star: Bootstrapping reasoning with reasoning. 2022. URLhttps://arxiv.org/abs/2203.14465. Aohan Zeng, Mingdao Liu, Rui Lu, Bowen Wang, Xiao Liu, Yuxiao Dong, and Jie Tang. Agenttuning: Enabling generalized agent abilities for llms.ArXiv, abs/2310.12823, 2023. URLhttps://arxiv...

  4. [4]

    Pro (reduced thinking) dominates across all sites.The reduced thinking budget config- uration produces concise yet effective reasoning traces, achieving the highest success rates on every environment (69–85%)

  5. [5]

    Gemini 3.1 Pro underperforms Gemini 3 Pro.Despite being a newer model, Gemini 3.1 Pro (reduced thinking) achieves lower success rates on five of six sites, with a particularly large drop on Map (45.4% vs. 78.0%). This suggests that the newer model’s capabilities may not transfer uniformly to self-hosted web environments

  6. [6]

    Flash models show a ceiling around 50%.All Flash configurations fall below 54% on every site, with Map being especially challenging (≤24.1%)

  7. [7]

    03:42 AM

    Training examples do not directly track success rate.Flash produces the most training examples (22,707) due to longer trajectories, but these come from lower-quality task completions. Pro (reduced thinking) produces 16,353 examples with higher average quality. Teacher selection.To validate that these A3-SYNTHquality differences translate to down- stream s...

  8. [8]

    A3-Qwen3.5-9B exceeds GPT-4o and Claude 3.5 Sonnet on WebArena.Our fine-tuned 9B model (41.5%) surpasses GPT-4o (31.5%) by 10.0pp and Claude 3.5 Sonnet (36.0%) by 5.5pp, despite being a small open-weight model evaluated under the same protocol. 24

  9. [9]

    WorkArena L1 approaches frontier models.A3-Qwen3.5-9B (51.5%) falls between GPT- 4o (45.5%) and Claude 3.5 Sonnet (56.4%) on an enterprise interface never seen during training

  10. [10]

    MiniWoB matches Claude 3.5 Sonnet.Our fine-tuned model (69.0%) closely matches Claude 3.5 Sonnet (69.8%) on atomic web interaction tasks

  11. [11]

    After fine-tuning on A3- SYNTHdata, it surpasses models 7–45 × larger (Llama 3.1 70B, GPT-4o, Llama 3.1 405B) on WebArena

    Fine-tuning closes the gap to much larger models.The base Qwen3.5-9B performs com- parably to GPT-4o-mini and Llama 3.1 70B across benchmarks. After fine-tuning on A3- SYNTHdata, it surpasses models 7–45 × larger (Llama 3.1 70B, GPT-4o, Llama 3.1 405B) on WebArena. Remaining caveats.While the shared harness eliminates most confounds, two differ- ences rem...

  12. [12]

    Highest open-weight SFT on full WebArena.A3-Qwen3.5-9B (42.1% on the full 812- task benchmark) exceeds the previous best open-weight SFT result, Go-Browse (21.7%), by 20.4pp. Go-Browse also uses AgentLab for evaluation, but with a different student model (Qwen-2.5-7B) and teacher (multiple models including Claude 3.7 Sonnet), so the comparison reflects di...

  13. [13]

    These represent upper bounds for what more complex architectures or stronger base models can achieve

    Gap to custom pipelines and next-generation models.Custom multi-agent systems (OpAgent, 71.6%) and next-generation proprietary models (GPT-5, Claude 4 Sonnet) substantially exceed our single-model SFT approach, particularly on enterprise tasks (WorkArena L1/L2). These represent upper bounds for what more complex architectures or stronger base models can achieve

  14. [14]

    unfiltered

    Few open-weight sub-10B baselines beyond WebArena.While Go-Browse and NNet- Nav report on WebArena and ViGoRL on VWA, no sub-10B open-weight model reports results on WorkArena or MiniWoB, making A3-Qwen3.5-9B one of the first small open models evaluated across this full five-benchmark suite. Evaluation setup caveats.OpAgent uses a multi-agent architecture...

  15. [15]

    Interests: Robotics, Hiking, Sci-Fi Novels

    Alice Chen(Data Scientist): Skills in Python Programming, Data Analysis, Machine Learning. Interests: Robotics, Hiking, Sci-Fi Novels. Alice specializes in transforming raw data into actionable intelligence; she spends weekends building custom robotics projects

  16. [16]

    Interests: Analog Photography, Indie Music, Street Art

    Liam O’Connor(Senior Graphic Designer): Skills in Graphic Design, Adobe Creative Suite, Typography. Interests: Analog Photography, Indie Music, Street Art. Liam merges digital precision with physical-world textures, shooting exclusively on film and collecting vinyl records

  17. [17]

    Fatima Al-Rashidi(Biomedical Researcher): Skills in Bioinformatics, Statistical Analysis, Laboratory Techniques

    Dr. Fatima Al-Rashidi(Biomedical Researcher): Skills in Bioinformatics, Statistical Analysis, Laboratory Techniques. Interests: Scientific Illustration, Mountaineering, Calligraphy. Fatima bridges computational biology with hands-on lab work, frequently presenting at international conferences. E.2 Task Generation: Annotator Instructions The Task Generator...

  18. [18]

    click the science subreddit,

    Abstract and high-level.The intent should require multiple actions to complete, not merely one or two steps. For example, instead of “click the science subreddit,” annotators are encouraged to produce intents like “post a greeting message on science subreddit,” which requires navigation, form-filling, and submission

  19. [19]

    create a Reddit account identical to my GitLab one

    Creative.Common tasks such as account creation are discouraged. Instead, annotators should add constraints (e.g., “create a Reddit account identical to my GitLab one”) to produce unique intents

  20. [20]

    Browse the {{section name}} section to find a post containing {{topic}}

    Template-based with variables.Intents should be formulated as templates with replace- able elements marked as variables (e.g., {{section name}}, {{topic}}). Each template is instantiated with multiple variable assignments, producing diverse concrete tasks from a single template. For example, “Browse the {{section name}} section to find a post containing {...

  21. [21]

    Action looping( <loop>): Did the agent loop through actions without making progress? (Yes/No)

  22. [22]

    Side effects( <side>): Did the agent perform unnecessary actions with unintended side effects? (Yes/No)

  23. [23]

    inverted

    Optimality( <optimal>): Was the task performed optimally? (4-point scale: Complete Failure, Suboptimal, Somewhat Optimal, Completely Optimal) 4.Success(<success>): Was the task successfully completed? (Successful/Unsuccessful) 31 Crucially, the success question is askedlast(after side effects, looping, and optimality), following an “inverted” ordering tha...

  24. [24]

    in-domain

    (FSDP). The primary model (A3-Qwen3.5-9B) was trained on 8 GPUs; ablation variants on 4 GPUs. We use the HuggingFace Transformers library with the TRL (Transformer Reinforcement Learning) framework for SFT. Loss function.We use the standard causal language modeling loss (cross-entropy) com- puted only on assistant tokens. System and user tokens are masked...

  25. [25]

    A3-SYNTHuses LLM-based judge evaluation with hints, enabling scalable evaluation without per-task programming

    Evaluation method.WebArena uses programmatic evaluators (URL matching, string matching, HTML element checking) that require precise, hand-crafted evaluation func- 33 tions per task. A3-SYNTHuses LLM-based judge evaluation with hints, enabling scalable evaluation without per-task programming

  26. [26]

    Task complexity.WebArena tasks are carefully curated by human experts to span specific difficulty levels and interaction patterns. A3-SYNTHtasks are generated by the Task Generator LLM and may have different complexity distributions; some are simpler than WebArena tasks, while others attempt more creative interactions

  27. [27]

    A3-SYNTHproduces 3,000 tasks per generation round, and the pipeline can be re-run to produce additional tasks by varying personas or regenerating explorations

    Scale.WebArena has 812 tasks (431 train + 381 test). A3-SYNTHproduces 3,000 tasks per generation round, and the pipeline can be re-run to produce additional tasks by varying personas or regenerating explorations

  28. [28]

    Catalog item not found

    Evaluation reliability.Programmatic evaluators are deterministic but limited (they cannot evaluate open-ended outcomes). LLM judges are flexible but may introduce noise through false positives or false negatives. Hints mitigate this by providing the judge with structured evaluation criteria. F.2 Evaluation Benchmark Summary Table 18 summarizes the key cha...