Weasel: Out-of-Domain Generalization for Web Agents via Importance-Diversity Data Selection
Pith reviewed 2026-05-21 08:04 UTC · model grok-4.3
The pith
Selecting important and diverse trajectories lets web agents generalize out of domain while cutting training costs by an order of magnitude.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central discovery is that a greedy optimization of an objective combining unary importance scores with pairwise diversity measures across states, websites, and interaction patterns can identify a compact set of trajectories that, when used for fine-tuning, yields superior out-of-domain performance on web agent benchmarks compared to using the entire dataset, while delivering training speedups of approximately 9.7 to 12.5 times.
What carries the argument
The importance-diversity objective solved greedily to select trajectory steps, combined with target-centered AXTree pruning and model-generated rationales.
If this is right
- Out-of-domain success rates increase on WebArena, WorkArena, and MiniWob when training with the selected data.
- Training time is reduced by factors of 9.7 to 12.5 across Qwen2.5-7B, Gemma3-4B, and Qwen3-8B models.
- The method applies to both AgentTrek and NNetNav training datasets.
- Style-consistent rationales help reasoning-native models adapt better.
Where Pith is reading between the lines
- Similar selection criteria could improve efficiency in training agents for other environments like mobile apps or games.
- The focus on diversity over interaction patterns may help address long-tail behaviors in agent tasks.
- Reducing data volume this way might lower the barrier to iterating on web agent designs.
Load-bearing premise
A greedy solution to balancing importance and diversity will reliably choose trajectories from which the model learns generalizable behaviors for unseen websites and tasks.
What would settle it
If experiments on held-out websites show that models fine-tuned on Weasel-selected trajectories achieve lower task success rates than those trained on the full dataset or on randomly sampled trajectories of equal size.
Figures
read the original abstract
Large language models (LLMs) have enabled web agents that follow natural language goals through multi-step browser interactions. However, agents fine-tuned on specific trajectories and domain often struggle to generalize out of domain, and offline training can be compute-inefficient due to noisy, redundant trajectories and long accessibility-tree (AXTree) states. To address both issues, we propose Weasel, a trajectory selection method for offline training of web agents. Weasel selects a fixed-budget subset of trajectory steps by optimizing an objective that balances unary importance with pairwise diversity over states, websites, and interaction patterns, solving efficiently with a greedy algorithm. We further improve efficiency with target-centered AXTree pruning that keeps only content around the ground-truth action target, and we mitigate style mismatch for reasoning-native models by replacing expert traces with model-generated, style-consistent rationales. Across AgentTrek and NNetNav training datasets, evaluations in WebArena, WorkArena, and MiniWob, and experiments with Qwen2.5-7B, Gemma3-4B, and Qwen3-8B, Weasel improves out-of-domain performance while reducing training cost, producing roughly 9.7-12.5$\times$ training speedups over standard fine-tuning. We make the code available at https://github.com/fatemehpesaran310/weasel.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces Weasel, a trajectory selection method for offline training of web agents. It selects a fixed-budget subset of trajectory steps by optimizing an objective that balances unary importance with pairwise diversity over states, websites, and interaction patterns, solved via a greedy algorithm. Additional components include target-centered AXTree pruning and replacement of expert traces with model-generated rationales for style consistency. Experiments on AgentTrek and NNetNav training data, evaluated on WebArena, WorkArena, and MiniWob with Qwen2.5-7B, Gemma3-4B, and Qwen3-8B models, report improved out-of-domain performance together with 9.7-12.5× training speedups relative to standard fine-tuning. Code is released at the cited GitHub repository.
Significance. If the reported gains prove robust, the approach could meaningfully advance efficient offline training of generalizable web agents by addressing redundancy and noise in trajectory data. The public code release supports reproducibility and is a clear strength.
major comments (2)
- [Abstract] Abstract: the reported OOD gains and speedups are presented without error bars, exact baseline implementation details, or an ablation isolating the diversity term from AXTree pruning and rationale replacement; these omissions are load-bearing because they prevent determining whether the central selection procedure, rather than the auxiliary efficiency steps, drives the claimed improvements.
- [Method (objective and greedy algorithm)] Method section describing the objective and greedy algorithm: the claim that optimizing unary importance plus pairwise diversity over states/websites/patterns produces trajectories whose induced policies transfer to unseen websites and tasks rests on the untested assumption that the diversity term captures cross-domain interaction patterns rather than merely reducing in-domain redundancy. Without targeted ablations (e.g., diversity term removed, random selection baseline, or correlation analysis between marginal gains and OOD robustness), the observed gains on WebArena/WorkArena/MiniWob could be explained by the other modifications instead.
minor comments (1)
- The abstract states a selection budget but does not report its concrete value or sensitivity analysis; adding this would improve clarity without altering the central claim.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below and describe the revisions that will be incorporated to strengthen the manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract: the reported OOD gains and speedups are presented without error bars, exact baseline implementation details, or an ablation isolating the diversity term from AXTree pruning and rationale replacement; these omissions are load-bearing because they prevent determining whether the central selection procedure, rather than the auxiliary efficiency steps, drives the claimed improvements.
Authors: We agree that error bars, precise baseline details, and an isolating ablation are necessary to attribute gains clearly. In the revised manuscript we will add error bars to all reported OOD and speedup results, expand the experimental section with exact baseline implementation details (including training hyperparameters, data preprocessing, and model versions), and insert a dedicated ablation that holds AXTree pruning and rationale replacement fixed while varying only the selection objective (full importance-diversity vs. importance-only vs. random). These changes will isolate the contribution of the core selection procedure. revision: yes
-
Referee: [Method (objective and greedy algorithm)] Method section describing the objective and greedy algorithm: the claim that optimizing unary importance plus pairwise diversity over states/websites/patterns produces trajectories whose induced policies transfer to unseen websites and tasks rests on the untested assumption that the diversity term captures cross-domain interaction patterns rather than merely reducing in-domain redundancy. Without targeted ablations (e.g., diversity term removed, random selection baseline, or correlation analysis between marginal gains and OOD robustness), the observed gains on WebArena/WorkArena/MiniWob could be explained by the other modifications instead.
Authors: We acknowledge that the current manuscript does not contain an explicit ablation removing the diversity term or a correlation analysis linking diversity metrics to OOD gains. To address this directly, the revision will add (i) an ablation that removes the pairwise diversity component while retaining importance scoring, AXTree pruning, and rationale replacement, (ii) a random-selection baseline matched for budget, and (iii) a supplementary analysis correlating per-trajectory diversity scores with observed OOD performance deltas across the three evaluation suites. While we continue to hold that the multi-aspect diversity objective (states, websites, interaction patterns) is motivated by the goal of broader coverage, the requested ablations will provide the empirical evidence needed to substantiate its role in OOD transfer. revision: yes
Circularity Check
Empirical selection procedure with no definitional circularity
full rationale
The paper describes Weasel as a practical trajectory selection algorithm that optimizes a unary-importance-plus-pairwise-diversity objective via a stated greedy procedure, followed by AXTree pruning and rationale replacement. Reported OOD gains and 9.7-12.5× speedups are obtained from direct experimental comparisons on AgentTrek, NNetNav, WebArena, WorkArena, and MiniWob with multiple base models; these outcomes are not algebraically forced by any fitted parameter, self-referential normalization, or uniqueness theorem internal to the paper. The method is self-contained against external benchmarks and contains no load-bearing self-citations or ansatzes that reduce the central claim to its own inputs by construction.
Axiom & Free-Parameter Ledger
free parameters (1)
- selection budget
axioms (1)
- domain assumption Greedy algorithm yields a good approximation to the joint importance-diversity objective
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We formulate a fixed-budget subset selection problem with a quadratic objective that balances unary importance with pairwise diversity over states, websites, and interaction patterns, solving efficiently with a greedy algorithm.
-
IndisputableMonolith/Foundation/BranchSelection.leanRCLCombiner_isCoupling_iff unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
D(i, j) = max(δ(si, sj), δ(yi, yj)) with δ = 1 − BERTScore
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
DoReMi: Optimizing Data Mixtures Speeds Up Language Model Pretraining , author=. 2023 , eprint=
work page 2023
-
[2]
Learning Transferable Visual Models From Natural Language Supervision , author=. 2021 , eprint=
work page 2021
-
[3]
SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features , author=. 2025 , eprint=
work page 2025
-
[4]
Android in the Wild: A Large-Scale Dataset for Android Device Control , author=. 2023 , eprint=
work page 2023
-
[5]
Data Diversity Matters for Robust Instruction Tuning
Bukharin, Alexander and Li, Shiyang and Wang, Zhengyang and Yang, Jingfeng and Yin, Bing and Li, Xian and Zhang, Chao and Zhao, Tuo and Jiang, Haoming. Data Diversity Matters for Robust Instruction Tuning. Findings of the Association for Computational Linguistics: EMNLP 2024. 2024. doi:10.18653/v1/2024.findings-emnlp.195
-
[6]
Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models
Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models , author=. arXiv preprint arXiv:2506.05176 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
Scaling Instruction-Finetuned Language Models
Scaling Instruction-Finetuned Language Models , publisher =. 2022 , copyright =. doi:10.48550/ARXIV.2210.11416 , author =
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2210.11416 2022
-
[8]
VisualAgentBench: Towards Large Multimodal Models as Visual Foundation Agents , author=. 2024 , eprint=
work page 2024
-
[9]
LoRA: Low-Rank Adaptation of Large Language Models , author=. 2021 , eprint=
work page 2021
-
[10]
LineRetriever: Planning-Aware Observation Reduction for Web Agents , author=. 2025 , eprint=
work page 2025
-
[11]
FocusAgent: Simple Yet Effective Ways of Trimming the Large Context of Web Agents , author=. 2025 , eprint=
work page 2025
-
[12]
Learning to Contextualize Web Pages for Enhanced Decision Making by LLM Agents , author=. 2025 , eprint=
work page 2025
-
[13]
Less is More: Improving LLM Alignment via Preference Data Selection , author=. 2025 , eprint=
work page 2025
-
[14]
GLISTER: Generalization based Data Subset Selection for Efficient and Robust Learning , author=. 2021 , eprint=
work page 2021
-
[15]
Coresets for Data-efficient Training of Machine Learning Models , author=. 2020 , eprint=
work page 2020
-
[16]
Active Learning for Convolutional Neural Networks: A Core-Set Approach , author=. 2018 , eprint=
work page 2018
-
[17]
RoBERTa: A Robustly Optimized BERT Pretraining Approach
Yinhan Liu and Myle Ott and Naman Goyal and Jingfei Du and Mandar Joshi and Danqi Chen and Omer Levy and Mike Lewis and Luke Zettlemoyer and Veselin Stoyanov , title =. CoRR , volume =. 2019 , archivePrefix =. 1907.11692 , timestamp =
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[18]
Retrieval-augmented GUI Agents with Generative Guidelines , author=. 2025 , eprint=
work page 2025
-
[19]
Just-in-time Episodic Feedback Hinter: Leveraging Offline Knowledge to Improve LLM Agents Adaptation , author=. 2025 , eprint=
work page 2025
-
[20]
Proceedings of the 34th International Conference on Machine Learning , pages =
World of Bits: An Open-Domain Platform for Web-Based Agents , author =. Proceedings of the 34th International Conference on Machine Learning , pages =. 2017 , editor =
work page 2017
-
[21]
Reinforcement Learning on Web Interfaces Using Workflow-Guided Exploration , author=. 2018 , eprint=
work page 2018
-
[22]
Drouin, Alexandre and Gasse, Maxime and Caccia, Massimo and Laradji, Issam H. and Del Verme, Manuel and Marty, Tom and Vazquez, David and Chapados, Nicolas and Lacoste, Alexandre , booktitle =. 2024 , editor =
work page 2024
-
[23]
Transactions on Machine Learning Research , issn=
The BrowserGym Ecosystem for Web Agent Research , author=. Transactions on Machine Learning Research , issn=. 2025 , note=
work page 2025
-
[24]
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena , author=. 2023 , eprint=
work page 2023
- [25]
-
[26]
ACM Transactions on Algorithms (TALG) , volume=
Max-sum diversification, monotone submodular functions, and dynamic updates , author=. ACM Transactions on Algorithms (TALG) , volume=. 2017 , publisher=
work page 2017
-
[27]
BERTScore: Evaluating Text Generation with BERT , author=. 2020 , eprint=
work page 2020
-
[28]
Thil, Lucas-Andrei and Popa, Mirela and Spanakis, Gerasimos , year=. Navigating WebAI: Training Agents to Complete Web Tasks with Large Language Models and Reinforcement Learning , DOI=. Proceedings of the 39th ACM/SIGAPP Symposium on Applied Computing , publisher=
-
[29]
Proceedings of the AAAI Conference on Artificial Intelligence , volume=
Wepo: Web element preference optimization for llm-based web navigation , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=
-
[30]
Proceedings of the ACM on Web Conference 2025 , pages=
Htmlrag: Html is better than plain text for modeling retrieved knowledge in rag systems , author=. Proceedings of the ACM on Web Conference 2025 , pages=
work page 2025
- [31]
-
[32]
STaR: Bootstrapping Reasoning With Reasoning , volume =
Zelikman, Eric and Wu, Yuhuai and Mu, Jesse and Goodman, Noah , booktitle =. STaR: Bootstrapping Reasoning With Reasoning , volume =
-
[33]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning , author=. 2025 , eprint=
work page 2025
- [34]
- [35]
-
[36]
Mind2Web: Towards a Generalist Agent for the Web , author=. 2023 , eprint=
work page 2023
-
[37]
Proposer-Agent-Evaluator(PAE): Autonomous Skill Discovery For Foundation Model Internet Agents , author=. 2024 , eprint=
work page 2024
-
[38]
AutoWebGLM: A Large Language Model-based Web Navigating Agent , author=. 2024 , eprint=
work page 2024
-
[39]
Doing: Agents that Reason by Scaling Test-Time Interaction , author=
Thinking vs. Doing: Agents that Reason by Scaling Test-Time Interaction , author=. 2025 , eprint=
work page 2025
-
[40]
WebAgent-R1: Training Web Agents via End-to-End Multi-Turn Reinforcement Learning , author=. 2025 , eprint=
work page 2025
-
[41]
WebRL: Training LLM Web Agents via Self-Evolving Online Curriculum Reinforcement Learning , author=. 2025 , eprint=
work page 2025
-
[42]
Synatra: Turning Indirect Knowledge into Direct Demonstrations for Digital Agents at Scale , author=. 2024 , eprint=
work page 2024
-
[43]
AgentTrek: Agent Trajectory Synthesis via Guiding Replay with Web Tutorials , author=. 2025 , eprint=
work page 2025
-
[44]
NNetNav: Unsupervised Learning of Browser Agents Through Environment Interaction in the Wild , author=. 2025 , eprint=
work page 2025
- [45]
-
[46]
WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks? , author=. 2024 , eprint=
work page 2024
-
[47]
WebArena: A Realistic Web Environment for Building Autonomous Agents , author=. 2024 , eprint=
work page 2024
-
[48]
Webrl: Training llm web agents via self-evolving online curriculum reinforcement learning, 2025
Webrl: Training llm web agents via self-evolving online curriculum reinforcement learning , author=. arXiv preprint arXiv:2411.02337 , year=
-
[49]
Voyager: An Open-Ended Embodied Agent with Large Language Models , author=. 2023 , eprint=
work page 2023
-
[50]
P. Langley , title =. Proceedings of the 17th International Conference on Machine Learning (ICML 2000) , address =. 2000 , pages =
work page 2000
-
[51]
T. M. Mitchell. The Need for Biases in Learning Generalizations. 1980
work page 1980
-
[52]
M. J. Kearns , title =
-
[53]
Machine Learning: An Artificial Intelligence Approach, Vol. I. 1983
work page 1983
-
[54]
R. O. Duda and P. E. Hart and D. G. Stork. Pattern Classification. 2000
work page 2000
-
[55]
Suppressed for Anonymity , author=
-
[56]
A. Newell and P. S. Rosenbloom. Mechanisms of Skill Acquisition and the Law of Practice. Cognitive Skills and Their Acquisition. 1981
work page 1981
-
[57]
A. L. Samuel. Some Studies in Machine Learning Using the Game of Checkers. IBM Journal of Research and Development. 1959
work page 1959
-
[58]
arXiv preprint arXiv:2412.09605 , year=
Agenttrek: Agent trajectory synthesis via guiding replay with web tutorials , author=. arXiv preprint arXiv:2412.09605 , year=
-
[59]
Weblinx: Real-world website navigation with multi-turn dialogue,
Weblinx: Real-world website navigation with multi-turn dialogue , author=. arXiv preprint arXiv:2402.05930 , year=
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.