pith. sign in

arxiv: 2605.20291 · v1 · pith:AKPME5T6new · submitted 2026-05-19 · 💻 cs.LG

Weasel: Out-of-Domain Generalization for Web Agents via Importance-Diversity Data Selection

Pith reviewed 2026-05-21 08:04 UTC · model grok-4.3

classification 💻 cs.LG
keywords web agentsout-of-domain generalizationtrajectory selectiondata efficiencyimportance and diversitygreedy algorithmAXTree pruningLLM agents
0
0 comments X

The pith

Selecting important and diverse trajectories lets web agents generalize out of domain while cutting training costs by an order of magnitude.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper sets out to demonstrate that web agents trained offline on full trajectory datasets can be made to generalize better to new websites by instead using a carefully chosen smaller subset of training data. The core idea is to pick trajectories that are both important on their own and diverse from each other in terms of the states, websites, and interaction patterns they involve, using a greedy algorithm to solve this selection problem under a fixed budget. Additional steps like pruning accessibility trees to focus only on the target of each action and generating reasoning in the model's own style further boost efficiency and reduce mismatch. A sympathetic reader would care because current approaches waste compute on redundant or noisy data and still fail when the agent encounters unfamiliar sites or tasks.

Core claim

The central discovery is that a greedy optimization of an objective combining unary importance scores with pairwise diversity measures across states, websites, and interaction patterns can identify a compact set of trajectories that, when used for fine-tuning, yields superior out-of-domain performance on web agent benchmarks compared to using the entire dataset, while delivering training speedups of approximately 9.7 to 12.5 times.

What carries the argument

The importance-diversity objective solved greedily to select trajectory steps, combined with target-centered AXTree pruning and model-generated rationales.

If this is right

  • Out-of-domain success rates increase on WebArena, WorkArena, and MiniWob when training with the selected data.
  • Training time is reduced by factors of 9.7 to 12.5 across Qwen2.5-7B, Gemma3-4B, and Qwen3-8B models.
  • The method applies to both AgentTrek and NNetNav training datasets.
  • Style-consistent rationales help reasoning-native models adapt better.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar selection criteria could improve efficiency in training agents for other environments like mobile apps or games.
  • The focus on diversity over interaction patterns may help address long-tail behaviors in agent tasks.
  • Reducing data volume this way might lower the barrier to iterating on web agent designs.

Load-bearing premise

A greedy solution to balancing importance and diversity will reliably choose trajectories from which the model learns generalizable behaviors for unseen websites and tasks.

What would settle it

If experiments on held-out websites show that models fine-tuned on Weasel-selected trajectories achieve lower task success rates than those trained on the full dataset or on randomly sampled trajectories of equal size.

Figures

Figures reproduced from arXiv: 2605.20291 by Fatemeh Pesaran zadeh, Gunhee Kim, Seyeon Choi, Siva Reddy, Xing Han L\`u.

Figure 1
Figure 1. Figure 1: Overview of WEASEL. Conventional trained web agents show a sharp performance drop under out-of-domain shifts to un￾seen websites and interaction patterns. WEASEL tackles this chal￾lenge via novel trajectory selection: it scores offline demonstration steps for goal relevance and diversity, then applies greedy sub￾set selection under a fixed budget. Agents trained with WEASEL generalize better to unseen test… view at source ↗
Figure 2
Figure 2. Figure 2: (Left): An example of a curated trajectory after applying WEASEL. Although the original collected data contain noisy steps (t = 4), and erroneous actions (t = 0), WEASEL selects a compact subset that retains only the most informative steps (in red) for the goal. (Right): Overview of WEASEL. We first perform element-wise score calculation using unary importance and pairwise diversity. WEASEL then applies a … view at source ↗
Figure 3
Figure 3. Figure 3: Token distribution of 10K subsamples of AgentTrek (Xu et al., 2024) before pruning (green) and after target-centered prun￾ing (blue). Pruning substantially reduces long-tail states, making the resulting sequences more manageable for training. quality term plus a sum of pairwise distances under a car￾dinality constraint (Borodin et al., 2017). For metric dis￾tances, a greedy algorithm achieves a constant-fa… view at source ↗
Figure 4
Figure 4. Figure 4: An illustration of Target-centered Pruning. Given a state st in the form of AXTree and gold action at, we retain only the AXTree elements within a fixed window of size w centered at the target index k ∗ t , producing the pruned state s˜t. The k-th node in the linearized AXTree at step t is denoted vt,k (e.g., vt,1, vt,2), and vt,k∗ t is the gold target node. 2.4. Target-centered Pruning Web states can be p… view at source ↗
Figure 5
Figure 5. Figure 5: Success rate decreases as the pruning offset increases. Results are reported on WebArena-Lite [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
read the original abstract

Large language models (LLMs) have enabled web agents that follow natural language goals through multi-step browser interactions. However, agents fine-tuned on specific trajectories and domain often struggle to generalize out of domain, and offline training can be compute-inefficient due to noisy, redundant trajectories and long accessibility-tree (AXTree) states. To address both issues, we propose Weasel, a trajectory selection method for offline training of web agents. Weasel selects a fixed-budget subset of trajectory steps by optimizing an objective that balances unary importance with pairwise diversity over states, websites, and interaction patterns, solving efficiently with a greedy algorithm. We further improve efficiency with target-centered AXTree pruning that keeps only content around the ground-truth action target, and we mitigate style mismatch for reasoning-native models by replacing expert traces with model-generated, style-consistent rationales. Across AgentTrek and NNetNav training datasets, evaluations in WebArena, WorkArena, and MiniWob, and experiments with Qwen2.5-7B, Gemma3-4B, and Qwen3-8B, Weasel improves out-of-domain performance while reducing training cost, producing roughly 9.7-12.5$\times$ training speedups over standard fine-tuning. We make the code available at https://github.com/fatemehpesaran310/weasel.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces Weasel, a trajectory selection method for offline training of web agents. It selects a fixed-budget subset of trajectory steps by optimizing an objective that balances unary importance with pairwise diversity over states, websites, and interaction patterns, solved via a greedy algorithm. Additional components include target-centered AXTree pruning and replacement of expert traces with model-generated rationales for style consistency. Experiments on AgentTrek and NNetNav training data, evaluated on WebArena, WorkArena, and MiniWob with Qwen2.5-7B, Gemma3-4B, and Qwen3-8B models, report improved out-of-domain performance together with 9.7-12.5× training speedups relative to standard fine-tuning. Code is released at the cited GitHub repository.

Significance. If the reported gains prove robust, the approach could meaningfully advance efficient offline training of generalizable web agents by addressing redundancy and noise in trajectory data. The public code release supports reproducibility and is a clear strength.

major comments (2)
  1. [Abstract] Abstract: the reported OOD gains and speedups are presented without error bars, exact baseline implementation details, or an ablation isolating the diversity term from AXTree pruning and rationale replacement; these omissions are load-bearing because they prevent determining whether the central selection procedure, rather than the auxiliary efficiency steps, drives the claimed improvements.
  2. [Method (objective and greedy algorithm)] Method section describing the objective and greedy algorithm: the claim that optimizing unary importance plus pairwise diversity over states/websites/patterns produces trajectories whose induced policies transfer to unseen websites and tasks rests on the untested assumption that the diversity term captures cross-domain interaction patterns rather than merely reducing in-domain redundancy. Without targeted ablations (e.g., diversity term removed, random selection baseline, or correlation analysis between marginal gains and OOD robustness), the observed gains on WebArena/WorkArena/MiniWob could be explained by the other modifications instead.
minor comments (1)
  1. The abstract states a selection budget but does not report its concrete value or sensitivity analysis; adding this would improve clarity without altering the central claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and describe the revisions that will be incorporated to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the reported OOD gains and speedups are presented without error bars, exact baseline implementation details, or an ablation isolating the diversity term from AXTree pruning and rationale replacement; these omissions are load-bearing because they prevent determining whether the central selection procedure, rather than the auxiliary efficiency steps, drives the claimed improvements.

    Authors: We agree that error bars, precise baseline details, and an isolating ablation are necessary to attribute gains clearly. In the revised manuscript we will add error bars to all reported OOD and speedup results, expand the experimental section with exact baseline implementation details (including training hyperparameters, data preprocessing, and model versions), and insert a dedicated ablation that holds AXTree pruning and rationale replacement fixed while varying only the selection objective (full importance-diversity vs. importance-only vs. random). These changes will isolate the contribution of the core selection procedure. revision: yes

  2. Referee: [Method (objective and greedy algorithm)] Method section describing the objective and greedy algorithm: the claim that optimizing unary importance plus pairwise diversity over states/websites/patterns produces trajectories whose induced policies transfer to unseen websites and tasks rests on the untested assumption that the diversity term captures cross-domain interaction patterns rather than merely reducing in-domain redundancy. Without targeted ablations (e.g., diversity term removed, random selection baseline, or correlation analysis between marginal gains and OOD robustness), the observed gains on WebArena/WorkArena/MiniWob could be explained by the other modifications instead.

    Authors: We acknowledge that the current manuscript does not contain an explicit ablation removing the diversity term or a correlation analysis linking diversity metrics to OOD gains. To address this directly, the revision will add (i) an ablation that removes the pairwise diversity component while retaining importance scoring, AXTree pruning, and rationale replacement, (ii) a random-selection baseline matched for budget, and (iii) a supplementary analysis correlating per-trajectory diversity scores with observed OOD performance deltas across the three evaluation suites. While we continue to hold that the multi-aspect diversity objective (states, websites, interaction patterns) is motivated by the goal of broader coverage, the requested ablations will provide the empirical evidence needed to substantiate its role in OOD transfer. revision: yes

Circularity Check

0 steps flagged

Empirical selection procedure with no definitional circularity

full rationale

The paper describes Weasel as a practical trajectory selection algorithm that optimizes a unary-importance-plus-pairwise-diversity objective via a stated greedy procedure, followed by AXTree pruning and rationale replacement. Reported OOD gains and 9.7-12.5× speedups are obtained from direct experimental comparisons on AgentTrek, NNetNav, WebArena, WorkArena, and MiniWob with multiple base models; these outcomes are not algebraically forced by any fitted parameter, self-referential normalization, or uniqueness theorem internal to the paper. The method is self-contained against external benchmarks and contains no load-bearing self-citations or ansatzes that reduce the central claim to its own inputs by construction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Abstract-only view limits visibility; inferred elements are the fixed selection budget (hyperparameter) and the claim that the greedy algorithm sufficiently approximates the combinatorial objective. No new physical entities or ad-hoc constants are introduced.

free parameters (1)
  • selection budget
    Fixed number of trajectory steps retained; chosen to control training cost.
axioms (1)
  • domain assumption Greedy algorithm yields a good approximation to the joint importance-diversity objective
    Invoked to make selection tractable for large trajectory pools.

pith-pipeline@v0.9.0 · 5792 in / 1342 out tokens · 66143 ms · 2026-05-21T08:04:22.068509+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

59 extracted references · 59 canonical work pages · 3 internal anchors

  1. [1]

    2023 , eprint=

    DoReMi: Optimizing Data Mixtures Speeds Up Language Model Pretraining , author=. 2023 , eprint=

  2. [2]

    2021 , eprint=

    Learning Transferable Visual Models From Natural Language Supervision , author=. 2021 , eprint=

  3. [3]

    2025 , eprint=

    SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features , author=. 2025 , eprint=

  4. [4]

    2023 , eprint=

    Android in the Wild: A Large-Scale Dataset for Android Device Control , author=. 2023 , eprint=

  5. [5]

    Data Diversity Matters for Robust Instruction Tuning

    Bukharin, Alexander and Li, Shiyang and Wang, Zhengyang and Yang, Jingfeng and Yin, Bing and Li, Xian and Zhang, Chao and Zhao, Tuo and Jiang, Haoming. Data Diversity Matters for Robust Instruction Tuning. Findings of the Association for Computational Linguistics: EMNLP 2024. 2024. doi:10.18653/v1/2024.findings-emnlp.195

  6. [6]

    Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models

    Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models , author=. arXiv preprint arXiv:2506.05176 , year=

  7. [7]

    Scaling Instruction-Finetuned Language Models

    Scaling Instruction-Finetuned Language Models , publisher =. 2022 , copyright =. doi:10.48550/ARXIV.2210.11416 , author =

  8. [8]

    2024 , eprint=

    VisualAgentBench: Towards Large Multimodal Models as Visual Foundation Agents , author=. 2024 , eprint=

  9. [9]

    2021 , eprint=

    LoRA: Low-Rank Adaptation of Large Language Models , author=. 2021 , eprint=

  10. [10]

    2025 , eprint=

    LineRetriever: Planning-Aware Observation Reduction for Web Agents , author=. 2025 , eprint=

  11. [11]

    2025 , eprint=

    FocusAgent: Simple Yet Effective Ways of Trimming the Large Context of Web Agents , author=. 2025 , eprint=

  12. [12]

    2025 , eprint=

    Learning to Contextualize Web Pages for Enhanced Decision Making by LLM Agents , author=. 2025 , eprint=

  13. [13]

    2025 , eprint=

    Less is More: Improving LLM Alignment via Preference Data Selection , author=. 2025 , eprint=

  14. [14]

    2021 , eprint=

    GLISTER: Generalization based Data Subset Selection for Efficient and Robust Learning , author=. 2021 , eprint=

  15. [15]

    2020 , eprint=

    Coresets for Data-efficient Training of Machine Learning Models , author=. 2020 , eprint=

  16. [16]

    2018 , eprint=

    Active Learning for Convolutional Neural Networks: A Core-Set Approach , author=. 2018 , eprint=

  17. [17]

    RoBERTa: A Robustly Optimized BERT Pretraining Approach

    Yinhan Liu and Myle Ott and Naman Goyal and Jingfei Du and Mandar Joshi and Danqi Chen and Omer Levy and Mike Lewis and Luke Zettlemoyer and Veselin Stoyanov , title =. CoRR , volume =. 2019 , archivePrefix =. 1907.11692 , timestamp =

  18. [18]

    2025 , eprint=

    Retrieval-augmented GUI Agents with Generative Guidelines , author=. 2025 , eprint=

  19. [19]

    2025 , eprint=

    Just-in-time Episodic Feedback Hinter: Leveraging Offline Knowledge to Improve LLM Agents Adaptation , author=. 2025 , eprint=

  20. [20]

    Proceedings of the 34th International Conference on Machine Learning , pages =

    World of Bits: An Open-Domain Platform for Web-Based Agents , author =. Proceedings of the 34th International Conference on Machine Learning , pages =. 2017 , editor =

  21. [21]

    2018 , eprint=

    Reinforcement Learning on Web Interfaces Using Workflow-Guided Exploration , author=. 2018 , eprint=

  22. [22]

    and Del Verme, Manuel and Marty, Tom and Vazquez, David and Chapados, Nicolas and Lacoste, Alexandre , booktitle =

    Drouin, Alexandre and Gasse, Maxime and Caccia, Massimo and Laradji, Issam H. and Del Verme, Manuel and Marty, Tom and Vazquez, David and Chapados, Nicolas and Lacoste, Alexandre , booktitle =. 2024 , editor =

  23. [23]

    Transactions on Machine Learning Research , issn=

    The BrowserGym Ecosystem for Web Agent Research , author=. Transactions on Machine Learning Research , issn=. 2025 , note=

  24. [24]

    2023 , eprint=

    Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena , author=. 2023 , eprint=

  25. [25]

    2025 , eprint=

    Gemma 3 Technical Report , author=. 2025 , eprint=

  26. [26]

    ACM Transactions on Algorithms (TALG) , volume=

    Max-sum diversification, monotone submodular functions, and dynamic updates , author=. ACM Transactions on Algorithms (TALG) , volume=. 2017 , publisher=

  27. [27]

    2020 , eprint=

    BERTScore: Evaluating Text Generation with BERT , author=. 2020 , eprint=

  28. [28]

    Navigating WebAI: Training Agents to Complete Web Tasks with Large Language Models and Reinforcement Learning , DOI=

    Thil, Lucas-Andrei and Popa, Mirela and Spanakis, Gerasimos , year=. Navigating WebAI: Training Agents to Complete Web Tasks with Large Language Models and Reinforcement Learning , DOI=. Proceedings of the 39th ACM/SIGAPP Symposium on Applied Computing , publisher=

  29. [29]

    Proceedings of the AAAI Conference on Artificial Intelligence , volume=

    Wepo: Web element preference optimization for llm-based web navigation , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

  30. [30]

    Proceedings of the ACM on Web Conference 2025 , pages=

    Htmlrag: Html is better than plain text for modeling retrieved knowledge in rag systems , author=. Proceedings of the ACM on Web Conference 2025 , pages=

  31. [31]

    2002 , publisher=

    Computers and intractability , author=. 2002 , publisher=

  32. [32]

    STaR: Bootstrapping Reasoning With Reasoning , volume =

    Zelikman, Eric and Wu, Yuhuai and Mu, Jesse and Goodman, Noah , booktitle =. STaR: Bootstrapping Reasoning With Reasoning , volume =

  33. [33]

    2025 , eprint=

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning , author=. 2025 , eprint=

  34. [34]

    2025 , eprint=

    Qwen3 Technical Report , author=. 2025 , eprint=

  35. [35]

    2025 , eprint=

    Qwen2.5 Technical Report , author=. 2025 , eprint=

  36. [36]

    2023 , eprint=

    Mind2Web: Towards a Generalist Agent for the Web , author=. 2023 , eprint=

  37. [37]

    2024 , eprint=

    Proposer-Agent-Evaluator(PAE): Autonomous Skill Discovery For Foundation Model Internet Agents , author=. 2024 , eprint=

  38. [38]

    2024 , eprint=

    AutoWebGLM: A Large Language Model-based Web Navigating Agent , author=. 2024 , eprint=

  39. [39]

    Doing: Agents that Reason by Scaling Test-Time Interaction , author=

    Thinking vs. Doing: Agents that Reason by Scaling Test-Time Interaction , author=. 2025 , eprint=

  40. [40]

    2025 , eprint=

    WebAgent-R1: Training Web Agents via End-to-End Multi-Turn Reinforcement Learning , author=. 2025 , eprint=

  41. [41]

    2025 , eprint=

    WebRL: Training LLM Web Agents via Self-Evolving Online Curriculum Reinforcement Learning , author=. 2025 , eprint=

  42. [42]

    2024 , eprint=

    Synatra: Turning Indirect Knowledge into Direct Demonstrations for Digital Agents at Scale , author=. 2024 , eprint=

  43. [43]

    2025 , eprint=

    AgentTrek: Agent Trajectory Synthesis via Guiding Replay with Web Tutorials , author=. 2025 , eprint=

  44. [44]

    2025 , eprint=

    NNetNav: Unsupervised Learning of Browser Agents Through Environment Interaction in the Wild , author=. 2025 , eprint=

  45. [45]

    2023 , eprint=

    AgentBench: Evaluating LLMs as Agents , author=. 2023 , eprint=

  46. [46]

    2024 , eprint=

    WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks? , author=. 2024 , eprint=

  47. [47]

    2024 , eprint=

    WebArena: A Realistic Web Environment for Building Autonomous Agents , author=. 2024 , eprint=

  48. [48]

    Webrl: Training llm web agents via self-evolving online curriculum reinforcement learning, 2025

    Webrl: Training llm web agents via self-evolving online curriculum reinforcement learning , author=. arXiv preprint arXiv:2411.02337 , year=

  49. [49]

    2023 , eprint=

    Voyager: An Open-Ended Embodied Agent with Large Language Models , author=. 2023 , eprint=

  50. [50]

    Langley , title =

    P. Langley , title =. Proceedings of the 17th International Conference on Machine Learning (ICML 2000) , address =. 2000 , pages =

  51. [51]

    T. M. Mitchell. The Need for Biases in Learning Generalizations. 1980

  52. [52]

    M. J. Kearns , title =

  53. [53]

    Machine Learning: An Artificial Intelligence Approach, Vol. I. 1983

  54. [54]

    R. O. Duda and P. E. Hart and D. G. Stork. Pattern Classification. 2000

  55. [55]

    Suppressed for Anonymity , author=

  56. [56]

    Newell and P

    A. Newell and P. S. Rosenbloom. Mechanisms of Skill Acquisition and the Law of Practice. Cognitive Skills and Their Acquisition. 1981

  57. [57]

    A. L. Samuel. Some Studies in Machine Learning Using the Game of Checkers. IBM Journal of Research and Development. 1959

  58. [58]

    arXiv preprint arXiv:2412.09605 , year=

    Agenttrek: Agent trajectory synthesis via guiding replay with web tutorials , author=. arXiv preprint arXiv:2412.09605 , year=

  59. [59]

    Weblinx: Real-world website navigation with multi-turn dialogue,

    Weblinx: Real-world website navigation with multi-turn dialogue , author=. arXiv preprint arXiv:2402.05930 , year=