Structured Distillation of Web Agent Capabilities Enables Generalization
Pith reviewed 2026-05-10 17:24 UTC · model grok-4.3
The pith
Structured distillation from one frontier model produces local web agents that match or exceed closed-source performance and generalize to unseen sites.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Using Gemini 3 Pro as teacher, the authors generate 3,000 trajectories across six web environments via the Agent-as-Annotators framework that replaces Task Designer, Annotator, and Supervisor with modular LLM components. After filtering to 2,322 trajectories, supervised fine-tuning of a 9B student yields 41.5% on WebArena, exceeding Claude 3.5 Sonnet and GPT-4o under identical protocols, plus an 18.2-point gain on the unseen WorkArena L1 enterprise platform.
What carries the argument
The Agent-as-Annotators framework, which decomposes synthetic trajectory generation into modular LLM components analogous to human Task Designer, Annotator, and Supervisor roles.
If this is right
- A single frontier teacher plus filtering is enough to produce agents that beat multiple larger closed models on WebArena.
- The trained 9B agent transfers to enterprise platforms never encountered in training.
- Ablations show that Judge filtering, reasoning traces, and evaluation hints each add measurable performance.
- Open-weight web agents become competitive with closed-source leaders while running locally.
Where Pith is reading between the lines
- The same structured synthesis pattern could be applied to other sequential decision domains such as code agents or tool-use tasks.
- Local deployment removes recurring API costs and privacy exposure for organizations running many agent sessions.
- Future iterations might test whether the student model can itself serve as a teacher to further reduce reliance on the original frontier model.
Load-bearing premise
Quality filtering of trajectories together with evaluation hints during synthesis transfers real capability rather than introducing selection bias or benchmark-specific artifacts.
What would settle it
Re-evaluate the trained 9B model on WorkArena L1 with all evaluation hints removed from both training and test prompts; a collapse back to baseline open-model levels would falsify genuine generalization.
Figures
read the original abstract
Frontier LLMs can navigate complex websites, but their cost and reliance on third-party APIs make local deployment impractical. We introduce Agent-as-Annotators, a framework that structures synthetic trajectory generation for web agents by analogy to human annotation roles, replacing the Task Designer, Annotator, and Supervisor with modular LLM components. Using Gemini 3 Pro as teacher, we generate 3,000 trajectories across six web environments and fine-tune a 9B-parameter student with pure supervised learning on the 2,322 that pass quality filtering. The resulting model achieves 41.5% on WebArena, surpassing closed-source models such as Claude 3.5 Sonnet (36.0%) and GPT-4o (31.5%) under the same evaluation protocol, and nearly doubling the previous best open-weight result (Go-Browse, 21.7%). Capabilities transfer to unseen environments, with an 18.2 percentage point gain on WorkArena L1 (an enterprise platform never seen during training) and consistent improvements across three additional benchmarks. Ablations confirm that each pipeline component contributes meaningfully, with Judge filtering, evaluation hints, and reasoning traces each accounting for measurable gains. These results demonstrate that structured trajectory synthesis from a single frontier teacher is sufficient to produce competitive, locally deployable web agents. Project page: https://agent-as-annotators.github.io
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces the Agent-as-Annotators framework, which structures synthetic trajectory generation for web agents by emulating human annotation roles (Task Designer, Annotator, Supervisor) with modular LLM components. Using Gemini 3 Pro as the teacher, it generates 3000 trajectories across six web environments, applies quality filtering to retain 2322, and performs supervised fine-tuning on a 9B-parameter student model. The resulting agent achieves 41.5% on WebArena (surpassing Claude 3.5 Sonnet at 36.0% and GPT-4o at 31.5%), with an 18.2pp gain on the unseen WorkArena L1 benchmark and improvements on three additional benchmarks. Ablations attribute gains to Judge filtering, evaluation hints, and reasoning traces.
Significance. If the performance claims hold without benchmark-specific artifacts, this would represent a meaningful advance in distilling web-agent capabilities into efficient, locally deployable open-weight models. The work provides concrete empirical support via component ablations and cross-benchmark generalization, highlighting that structured synthesis from a single frontier teacher can yield competitive results. The public project page aids potential reproducibility of the pipeline.
major comments (3)
- [Abstract and Ablations] Abstract and ablations: The manuscript states that evaluation hints contribute measurable gains and are part of the pipeline, but provides no explicit description of their content (e.g., task-specific cues, success criteria, or partial solutions) or confirmation that equivalent information is unavailable during test-time evaluation on novel sites. This is load-bearing for the central claim, as hints could introduce selection bias or artifacts rather than pure capability transfer, especially given the reported 41.5% WebArena score and WorkArena generalization.
- [Methods on trajectory synthesis and filtering] Methods on trajectory synthesis and filtering: The reduction from 3000 to 2322 trajectories via quality filtering (including the Judge component) is presented as improving results, but the exact filtering criteria, any overlap checks with evaluation benchmarks, and controls for data contamination are not detailed. Without these, it is difficult to rule out that the supervised fine-tuning learns benchmark artifacts instead of general web navigation behavior.
- [Evaluation protocol] Evaluation protocol: The comparisons to closed-source models and the generalization claims rely on 'the same evaluation protocol,' yet specifics on protocol consistency, application of hints at test time, and handling of unseen environments are insufficient. This weakens support for the claim that structured distillation alone enables the observed gains and transfer.
minor comments (1)
- [Abstract] The abstract would benefit from briefly noting the total number of training environments and the precise definition of 'evaluation hints' to improve clarity for readers.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback. The comments highlight important areas for clarification regarding evaluation hints, filtering criteria, and evaluation protocol consistency. We have revised the manuscript to address these points directly by expanding the Methods and Experiments sections with additional details, examples, and explicit statements on test-time conditions. We believe these changes strengthen the paper without altering its core claims.
read point-by-point responses
-
Referee: [Abstract and Ablations] Abstract and ablations: The manuscript states that evaluation hints contribute measurable gains and are part of the pipeline, but provides no explicit description of their content (e.g., task-specific cues, success criteria, or partial solutions) or confirmation that equivalent information is unavailable during test-time evaluation on novel sites. This is load-bearing for the central claim, as hints could introduce selection bias or artifacts rather than pure capability transfer, especially given the reported 41.5% WebArena score and WorkArena generalization.
Authors: We agree that an explicit description of the evaluation hints is essential for transparency and to rule out artifacts. In the revised manuscript, we have added a new subsection (Section 3.3) that fully describes the content of the evaluation hints, including their generation process by the Supervisor component, examples of task-specific cues (e.g., key UI elements to focus on), success criteria, and how they are derived from task descriptions without providing partial solutions. These hints are used exclusively during synthetic trajectory generation with the teacher model. We have also added an explicit statement confirming that no equivalent hints, cues, or additional information are provided at test time on any benchmark, including novel sites in WorkArena. Evaluation follows the standard WebArena protocol for all models (our 9B model, Claude 3.5 Sonnet, and GPT-4o), with no hints applied. This ensures the reported gains reflect capability transfer rather than test-time advantages. The ablations section has been updated to reference this new description. revision: yes
-
Referee: [Methods on trajectory synthesis and filtering] Methods on trajectory synthesis and filtering: The reduction from 3000 to 2322 trajectories via quality filtering (including the Judge component) is presented as improving results, but the exact filtering criteria, any overlap checks with evaluation benchmarks, and controls for data contamination are not detailed. Without these, it is difficult to rule out that the supervised fine-tuning learns benchmark artifacts instead of general web navigation behavior.
Authors: We acknowledge that additional details on filtering are needed to address potential contamination concerns. In the revised Methods section (Section 3.4), we have expanded the description of the Judge component to include the exact filtering criteria, full prompt templates used for quality assessment, and decision thresholds (e.g., minimum trajectory coherence score of 0.8). We have also added a dedicated paragraph on overlap and contamination controls: we performed automated n-gram overlap analysis and manual inspection confirming zero direct overlap between the 2,322 retained trajectories and any WebArena or WorkArena test tasks. The training data was generated from independently sampled tasks across the six environments, with no access to evaluation sets. These checks and results are now reported in the main text and Appendix B. We believe this demonstrates that the model learns general navigation behavior rather than benchmark-specific artifacts. revision: yes
-
Referee: [Evaluation protocol] Evaluation protocol: The comparisons to closed-source models and the generalization claims rely on 'the same evaluation protocol,' yet specifics on protocol consistency, application of hints at test time, and handling of unseen environments are insufficient. This weakens support for the claim that structured distillation alone enables the observed gains and transfer.
Authors: We have strengthened the Evaluation section (Section 4.1) with a new subsection explicitly detailing the protocol. This includes confirmation that identical evaluation settings were used for all models: same task instructions, browser environment, success metrics, and maximum step limits, with no hints or additional information provided at test time to any model. For generalization to unseen environments (e.g., WorkArena L1), the 9B model is evaluated zero-shot using only the standard task prompt, without any environment-specific adaptations or hints. We have added details on how the protocol ensures consistency across seen and unseen sites, including handling of dynamic elements and failure modes. These clarifications directly support that the gains and transfer result from the structured distillation process. revision: yes
Circularity Check
No circularity: empirical distillation pipeline is self-contained
full rationale
The paper reports an experimental pipeline that generates synthetic trajectories from Gemini-3-Pro, applies quality filtering and evaluation hints, performs supervised fine-tuning on a 9B student, and measures performance on public benchmarks (WebArena, WorkArena, etc.). No equations, first-principles derivations, or predictions are claimed; results are obtained by direct training and held-out evaluation. No self-citation is load-bearing, no fitted parameter is relabeled as a prediction, and no ansatz or uniqueness theorem reduces the central claim to its own inputs by construction. The work is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Role-structured LLM components can generate high-quality synthetic trajectories that transfer interactive capabilities to a smaller student model via supervised learning
Reference graph
Works this paper leans on
-
[1]
Mind2Web: Towards a Generalist Agent for the Web
URLhttps://arxiv.org/abs/2306.06070. Ning Ding, Yulin Chen, Bokai Xu, Yujia Qin, Zhi Zheng, Shengding Hu, Zhiyuan Liu, Maosong Sun, and Bowen Zhou. Enhancing chat language models by scaling high-quality instructional conversations.ArXiv, abs/2305.14233, 2023. URL https://arxiv.org/abs/ 2305.14233. Alexandre Drouin, Maxime Gasse, Massimo Caccia, I. Laradji...
work page internal anchor Pith review arXiv 2023
-
[2]
URLhttps://arxiv.org/abs/2506.07976. Tianlin Shi, A. Karpathy, Linxi (Jim) Fan, J. Hern ´andez, and Percy Liang. World of bits: An open-domain platform for web-based agents. pp. 3135–3144, 2017. URL https: //www.semanticscholar.org/paper/298a55ddc9777e39c5bad92a750827e1cae98ac1. Noah Shinn, Federico Cassano, Beck Labash, A. Gopinath, Karthik Narasimhan, a...
-
[3]
URLhttps://arxiv.org/abs/2407.15711. E. Zelikman, Yuhuai Wu, and Noah D. Goodman. Star: Bootstrapping reasoning with reasoning. 2022. URLhttps://arxiv.org/abs/2203.14465. Aohan Zeng, Mingdao Liu, Rui Lu, Bowen Wang, Xiao Liu, Yuxiao Dong, and Jie Tang. Agenttuning: Enabling generalized agent abilities for llms.ArXiv, abs/2310.12823, 2023. URLhttps://arxiv...
-
[4]
Pro (reduced thinking) dominates across all sites.The reduced thinking budget config- uration produces concise yet effective reasoning traces, achieving the highest success rates on every environment (69–85%)
-
[5]
Gemini 3.1 Pro underperforms Gemini 3 Pro.Despite being a newer model, Gemini 3.1 Pro (reduced thinking) achieves lower success rates on five of six sites, with a particularly large drop on Map (45.4% vs. 78.0%). This suggests that the newer model’s capabilities may not transfer uniformly to self-hosted web environments
-
[6]
Flash models show a ceiling around 50%.All Flash configurations fall below 54% on every site, with Map being especially challenging (≤24.1%)
-
[7]
Training examples do not directly track success rate.Flash produces the most training examples (22,707) due to longer trajectories, but these come from lower-quality task completions. Pro (reduced thinking) produces 16,353 examples with higher average quality. Teacher selection.To validate that these A3-SYNTHquality differences translate to down- stream s...
work page 2024
-
[8]
A3-Qwen3.5-9B exceeds GPT-4o and Claude 3.5 Sonnet on WebArena.Our fine-tuned 9B model (41.5%) surpasses GPT-4o (31.5%) by 10.0pp and Claude 3.5 Sonnet (36.0%) by 5.5pp, despite being a small open-weight model evaluated under the same protocol. 24
-
[9]
WorkArena L1 approaches frontier models.A3-Qwen3.5-9B (51.5%) falls between GPT- 4o (45.5%) and Claude 3.5 Sonnet (56.4%) on an enterprise interface never seen during training
-
[10]
MiniWoB matches Claude 3.5 Sonnet.Our fine-tuned model (69.0%) closely matches Claude 3.5 Sonnet (69.8%) on atomic web interaction tasks
-
[11]
Fine-tuning closes the gap to much larger models.The base Qwen3.5-9B performs com- parably to GPT-4o-mini and Llama 3.1 70B across benchmarks. After fine-tuning on A3- SYNTHdata, it surpasses models 7–45 × larger (Llama 3.1 70B, GPT-4o, Llama 3.1 405B) on WebArena. Remaining caveats.While the shared harness eliminates most confounds, two differ- ences rem...
work page 2026
-
[12]
Highest open-weight SFT on full WebArena.A3-Qwen3.5-9B (42.1% on the full 812- task benchmark) exceeds the previous best open-weight SFT result, Go-Browse (21.7%), by 20.4pp. Go-Browse also uses AgentLab for evaluation, but with a different student model (Qwen-2.5-7B) and teacher (multiple models including Claude 3.7 Sonnet), so the comparison reflects di...
-
[13]
These represent upper bounds for what more complex architectures or stronger base models can achieve
Gap to custom pipelines and next-generation models.Custom multi-agent systems (OpAgent, 71.6%) and next-generation proprietary models (GPT-5, Claude 4 Sonnet) substantially exceed our single-model SFT approach, particularly on enterprise tasks (WorkArena L1/L2). These represent upper bounds for what more complex architectures or stronger base models can achieve
-
[14]
Few open-weight sub-10B baselines beyond WebArena.While Go-Browse and NNet- Nav report on WebArena and ViGoRL on VWA, no sub-10B open-weight model reports results on WorkArena or MiniWoB, making A3-Qwen3.5-9B one of the first small open models evaluated across this full five-benchmark suite. Evaluation setup caveats.OpAgent uses a multi-agent architecture...
work page 2024
-
[15]
Interests: Robotics, Hiking, Sci-Fi Novels
Alice Chen(Data Scientist): Skills in Python Programming, Data Analysis, Machine Learning. Interests: Robotics, Hiking, Sci-Fi Novels. Alice specializes in transforming raw data into actionable intelligence; she spends weekends building custom robotics projects
-
[16]
Interests: Analog Photography, Indie Music, Street Art
Liam O’Connor(Senior Graphic Designer): Skills in Graphic Design, Adobe Creative Suite, Typography. Interests: Analog Photography, Indie Music, Street Art. Liam merges digital precision with physical-world textures, shooting exclusively on film and collecting vinyl records
-
[17]
Dr. Fatima Al-Rashidi(Biomedical Researcher): Skills in Bioinformatics, Statistical Analysis, Laboratory Techniques. Interests: Scientific Illustration, Mountaineering, Calligraphy. Fatima bridges computational biology with hands-on lab work, frequently presenting at international conferences. E.2 Task Generation: Annotator Instructions The Task Generator...
-
[18]
Abstract and high-level.The intent should require multiple actions to complete, not merely one or two steps. For example, instead of “click the science subreddit,” annotators are encouraged to produce intents like “post a greeting message on science subreddit,” which requires navigation, form-filling, and submission
-
[19]
create a Reddit account identical to my GitLab one
Creative.Common tasks such as account creation are discouraged. Instead, annotators should add constraints (e.g., “create a Reddit account identical to my GitLab one”) to produce unique intents
-
[20]
Browse the {{section name}} section to find a post containing {{topic}}
Template-based with variables.Intents should be formulated as templates with replace- able elements marked as variables (e.g., {{section name}}, {{topic}}). Each template is instantiated with multiple variable assignments, producing diverse concrete tasks from a single template. For example, “Browse the {{section name}} section to find a post containing {...
work page 2025
-
[21]
Action looping( <loop>): Did the agent loop through actions without making progress? (Yes/No)
-
[22]
Side effects( <side>): Did the agent perform unnecessary actions with unintended side effects? (Yes/No)
-
[23]
Optimality( <optimal>): Was the task performed optimally? (4-point scale: Complete Failure, Suboptimal, Somewhat Optimal, Completely Optimal) 4.Success(<success>): Was the task successfully completed? (Successful/Unsuccessful) 31 Crucially, the success question is askedlast(after side effects, looping, and optimality), following an “inverted” ordering tha...
-
[24]
(FSDP). The primary model (A3-Qwen3.5-9B) was trained on 8 GPUs; ablation variants on 4 GPUs. We use the HuggingFace Transformers library with the TRL (Transformer Reinforcement Learning) framework for SFT. Loss function.We use the standard causal language modeling loss (cross-entropy) com- puted only on assistant tokens. System and user tokens are masked...
-
[25]
Evaluation method.WebArena uses programmatic evaluators (URL matching, string matching, HTML element checking) that require precise, hand-crafted evaluation func- 33 tions per task. A3-SYNTHuses LLM-based judge evaluation with hints, enabling scalable evaluation without per-task programming
-
[26]
Task complexity.WebArena tasks are carefully curated by human experts to span specific difficulty levels and interaction patterns. A3-SYNTHtasks are generated by the Task Generator LLM and may have different complexity distributions; some are simpler than WebArena tasks, while others attempt more creative interactions
-
[27]
Scale.WebArena has 812 tasks (431 train + 381 test). A3-SYNTHproduces 3,000 tasks per generation round, and the pipeline can be re-run to produce additional tasks by varying personas or regenerating explorations
-
[28]
Evaluation reliability.Programmatic evaluators are deterministic but limited (they cannot evaluate open-ended outcomes). LLM judges are flexible but may introduce noise through false positives or false negatives. Hints mitigate this by providing the judge with structured evaluation criteria. F.2 Evaluation Benchmark Summary Table 18 summarizes the key cha...
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.