pith. sign in

arxiv: 2606.04391 · v1 · pith:H5SNHGXUnew · submitted 2026-06-03 · 💻 cs.AI

Online Skill Learning for Web Agents via State-Grounded Dynamic Retrieval

Pith reviewed 2026-06-28 06:41 UTC · model grok-4.3

classification 💻 cs.AI
keywords web agentsonline skill learningdynamic retrievalstate groundingweb automationtrajectory extractionsub-procedure reuse
0
0 comments X

The pith

Web agents reuse skills more effectively by retrieving them dynamically to match both the task goal and the current webpage state instead of fixing a set at the outset.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Existing online skill learning for web agents retrieves a fixed set of skills based only on the initial task instruction and holds it constant during execution. This approach breaks down because webpage states change during multi-step tasks, often reaching situations the initial skills do not cover. The paper introduces State-Grounded Dynamic Retrieval that extracts sub-procedures from completed trajectories using sliding windows, stores them in dual text-code form, and retrieves them at each step by matching both the goal and the present page state. Experiments across five WebArena domains show consistent gains, with success rates rising to 37.5 percent using GPT-4.1 and 24.3 percent using Qwen3-4B. A sympathetic reader would care because this alignment between retrieval and execution state directly addresses why static reuse underperforms on realistic web automation.

Core claim

State-Grounded Dynamic Retrieval extracts reusable sub-procedures from trajectories via sliding-window segmentation, represents each in paired text and code, and performs dynamic retrieval at every step by jointly matching the task instruction and the current webpage state, thereby enabling stepwise skill reuse that static task-level methods cannot achieve.

What carries the argument

The state-grounded dynamic retrieval mechanism that selects sub-procedures by matching both task goal and current webpage state through dual text-code representations.

If this is right

  • Skill reuse becomes possible at arbitrary intermediate points rather than only at task start.
  • Agents can cover execution branches that diverge from the initial trajectory.
  • Online learning no longer requires the initial skill set to anticipate every future state.
  • Performance gains appear across both large and smaller language models on the same benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same sliding-window extraction plus state matching could be tested on non-web sequential domains such as robotic manipulation where states also evolve unpredictably.
  • If the dual text-code representation proves essential, removing the code component should measurably degrade retrieval accuracy on state-matching tasks.
  • Extending the method to retain only high-success sub-procedures after each episode could further reduce retrieval noise over long sessions.

Load-bearing premise

Sub-procedures cut from completed trajectories will correctly match and execute when retrieved at intermediate states in new tasks.

What would settle it

Running SGDR and the strongest static baseline on the same WebArena tasks and finding that SGDR produces equal or lower average success rates.

Figures

Figures reproduced from arXiv: 2606.04391 by Jiaxi Li, Jingyuan Huang, Jin Lu, Ke Deng, Ninghao Liu, Qiaoyu Tan, Yucheng Shi, Yun Wang.

Figure 1
Figure 1. Figure 1: Comparison between traditional skill methods [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The online skill learning setting. The agent [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overview of our method SGDR. Completed trajectories are segmented with sliding windows to induce reusable text-code skills. During future task execution, SGDR retrieves state-grounded skills, reranks them with Maximal Marginal Relevance (MMR), and injects the selected skills for the action next step. where yˆi ∈ {0, 1} denotes the evaluator’s binary correctness judgment for task gi , with yˆi = 1 in￾dicati… view at source ↗
Figure 4
Figure 4. Figure 4: Cumulative success rates over the online task stream with backbone model [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
read the original abstract

Language agents increasingly rely on reusable skills to improve multi-step web automation across related tasks. A growing line of work studies online skill learning, where agents continually induce skills from previous task trajectories and reuse them in future tasks on the fly. However, existing methods mainly reuse skills at the task-level: a fixed set of skills is retrieved based on the initial task instruction and then held fixed throughout execution. This static strategy is misaligned with web execution, where the appropriate next action depends not only on the task goal but also on the current webpage state, which often transitions into situations that the initial skills fail to cover. To address this gap, we propose State-Grounded Dynamic Retrieval (SGDR), an online skill learning method that enables stepwise skill reuse for web agents. SGDR consists of three components: a sliding-window extraction process that turns completed trajectories into reusable sub-procedures invokable at intermediate execution states, a dual text-code representation that connects skill retrieval with executable action, and a state-grounded dynamic retrieval mechanism that matches skills to both the task goal and the current webpage state. Experiments on WebArena across five domains show that SGDR consistently outperforms strong baselines, achieving average success rates of 37.5% with GPT-4.1 and 24.3% with Qwen3-4B, corresponding to relative gains of 10.6% and 10.0% over the strongest baseline, respectively. The code is available at https://github.com/plusnli/skill-dynamic-retrieval.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes State-Grounded Dynamic Retrieval (SGDR) for online skill learning in web agents. It extracts reusable sub-procedures from completed trajectories via sliding-window, encodes them with a dual text-code representation, and retrieves them dynamically during execution by matching both the task goal and the current webpage state. Experiments on WebArena across five domains report that SGDR achieves average success rates of 37.5% (GPT-4.1) and 24.3% (Qwen3-4B), corresponding to relative gains of 10.6% and 10.0% over the strongest baseline; code is released.

Significance. If the empirical results hold under rigorous controls, SGDR would address a genuine limitation of static task-level skill reuse in web agents by enabling state-dependent retrieval. The code release and concrete reported gains on a standard benchmark are strengths that would support adoption and follow-up work.

major comments (2)
  1. [Abstract] Abstract: the reported success rates and relative gains are presented without any information on baseline definitions, number of runs, statistical significance, variance, or failure modes. This information is load-bearing for the central empirical claim that the dynamic mechanism, rather than prompting differences or task ordering, drives the 10.6%/10.0% improvements.
  2. [Method / Experiments] Method and Experiments sections: the claim that sliding-window sub-procedures extracted from completed trajectories will reliably match and execute correctly at intermediate states in unseen tasks rests on the untested assumption that cosine/embedding similarity on the dual representation selects skills whose preconditions hold and whose actions advance the trajectory. No retrieval-precision analysis, precondition checks, or ablation isolating the state-grounded component at arbitrary webpage states is described; if retrieval precision is low, the headline gains could be artifacts.
minor comments (2)
  1. [Abstract] Abstract: consider briefly defining the five WebArena domains and the strongest baseline for immediate context.
  2. [Method] Notation: the dual text-code representation is introduced without an explicit equation or pseudocode showing how the two modalities are combined for retrieval scoring.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which highlight opportunities to strengthen the clarity and rigor of our empirical claims. We address each major point below and commit to revisions that directly respond to the concerns raised.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the reported success rates and relative gains are presented without any information on baseline definitions, number of runs, statistical significance, variance, or failure modes. This information is load-bearing for the central empirical claim that the dynamic mechanism, rather than prompting differences or task ordering, drives the 10.6%/10.0% improvements.

    Authors: We agree that the abstract should be self-contained with respect to the key experimental controls. The full manuscript already details the baselines (Section 4.1), evaluation protocol (5 runs per task with reported means), and variance in the results tables, but these were omitted from the abstract for brevity. In the revision we will expand the abstract to explicitly name the strongest baseline, state the number of runs, and note that gains are consistent across domains with standard deviation reported in the main text. revision: yes

  2. Referee: [Method / Experiments] Method and Experiments sections: the claim that sliding-window sub-procedures extracted from completed trajectories will reliably match and execute correctly at intermediate states in unseen tasks rests on the untested assumption that cosine/embedding similarity on the dual representation selects skills whose preconditions hold and whose actions advance the trajectory. No retrieval-precision analysis, precondition checks, or ablation isolating the state-grounded component at arbitrary webpage states is described; if retrieval precision is low, the headline gains could be artifacts.

    Authors: The referee correctly notes the absence of a dedicated retrieval-precision study. While end-to-end success rates on WebArena provide indirect evidence that retrieved skills are useful, we did not quantify precision@K or precondition satisfaction at arbitrary states. We will add (1) a new ablation that disables the state-grounded component while keeping the dual representation and sliding-window extraction fixed, and (2) a retrieval analysis reporting precision and precondition-match rates on held-out intermediate states. These additions will appear in a new subsection of the experiments. revision: yes

Circularity Check

0 steps flagged

Empirical method with no derivation chain

full rationale

The paper describes an algorithmic procedure (sliding-window extraction, dual text-code representation, state-grounded retrieval) and reports experimental success rates on WebArena. No equations, first-principles derivations, or predictions are presented that reduce by construction to quantities defined from the method's own fitted parameters or self-citations. The contribution is framed as an empirical engineering result with released code, consistent with the reader's assessment of score 2.0; the central claims rest on external benchmark comparisons rather than internal self-definition.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review; the method description implies hyperparameters for window size and retrieval scoring plus the domain assumption that webpage states can be matched to skill descriptions, but no explicit free parameters or invented entities are stated.

axioms (1)
  • domain assumption Webpage states encountered during execution can be effectively represented and matched to previously extracted sub-procedures via text and code embeddings.
    This matching step is required for the dynamic retrieval component to function as described.

pith-pipeline@v0.9.1-grok · 5825 in / 1311 out tokens · 38309 ms · 2026-06-28T06:41:01.080338+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

12 extracted references · 3 linked inside Pith

  1. [1]

    Advances in Neural Information Processing Systems, 36:28091–28114

    Mind2web: Towards a generalist agent for the web. Advances in Neural Information Processing Systems, 36:28091–28114. Alexandre Drouin, Maxime Gasse, Massimo Caccia, Issam H. Laradji, Manuel Del Verme, Tom Marty, David Vazquez, Nicolas Chapados, and Alexandre Lacoste. 2024. WorkArena: How capable are web agents at solving common knowledge work tasks? In Pr...

  2. [2]

    Advances in Neural Information Processing Systems, 38:111259–111284

    Mitigating hallucination through theory- consistent symmetric multimodal preference op- timization. Advances in Neural Information Processing Systems, 38:111259–111284. Yitao Liu, Chenglei Si, Karthik R Narasimhan, and Shunyu Yao. 2025. Contextual experience re- play for self-improvement of language agents. In Proceedings of the 63rd Annual Meeting of the...

  3. [3]

    In Proceedings of the 41st International Conference on Machine Learning, volume 235 of Proceedings of Machine Learning Research, pages 33007–33056

    WebLINX: Real-world website navigation with multi-turn dialogue. In Proceedings of the 41st International Conference on Machine Learning, volume 235 of Proceedings of Machine Learning Research, pages 33007–33056. PMLR. Zijian Lu, Yiping Zuo, Yupeng Nie, Xin He, Weibei Fan, and Chen Dai. 2026. Contractskill: Repairable contract-based skills for multimodal ...

  4. [4]

    Advances in Neural Information Processing Systems, 37:130109– 130135

    Online adaptation of language models with 10 a memory of amortized contexts. Advances in Neural Information Processing Systems, 37:130109– 130135. Qitao Tan, Xiaoying Song, Arman Akbari, Arash Ak- bari, Yanzhi Wang, Xiaoming Zhai, Lingzi Hong, Zhen Xiang, Jin Lu, and Geng Yuan. 2026a. Palette: A modular, controllable, and efficient framework for on-demand...

  5. [5]

    success" or

    Polyskill: Learning generalizable skills through polymorphic abstraction. arXiv preprint arXiv:2510.15863. Tao Yu, Zhengbo Zhang, Zhiheng Lyu, Junhao Gong, Hongzhu Yi, Xinming Wang, Yuxuan Zhou, Jiabing Yang, Ping Nie, Yan Huang, and Wenhu Chen. 2026. Browseragent: Building web agents with human- inspired web browsing actions. Transactions on Machine Lear...

  6. [6]

    The bot's response must contain the information the user wants, or explicitly state that the information is not available

    Information seeking: The user wants to obtain certain information from the webpage, such as the information of a product, reviews, map info, comparison of map routes, etc. The bot's response must contain the information the user wants, or explicitly state that the information is not available. Otherwise, e.g. the bot encounters an exception and respond wi...

  7. [7]

    Carefully examine the bot's action history and the final state of the webpage to determine whether the bot successfully completes the task

    Site navigation: The user wants to navigate to a specific page. Carefully examine the bot's action history and the final state of the webpage to determine whether the bot successfully completes the task. No need to consider the bot's response

  8. [8]

    Status:

    Content modification: The user wants to modify the content of a webpage or configuration. Carefully examine the bot's action history and the final state of the webpage to determine whether the bot successfully completes the task. No need to consider the bot's response. *IMPORTANT* Format your response into two lines as shown below: Thoughts: <your thought...

  9. [9]

    fill title + fill body + click submit

    Is the window a *reusable* sub-routine? A reusable window: - Performs a recognizable web operation that could occur on other tasks (e.g. searching a product, applying a price filter, posting a comment, opening a user profile). - Is general enough to apply with different inputs: variable parts (search queries, usernames, element ids that obviously vary acr...

  10. [10]

    submit a forum post

    If reusable, produce: - description: a single sentence that MUST contain both (a) a precise action verb + object (e.g. " submit a forum post", "apply a price filter ", "open a forum-selection combobox", "fill in the title and body"); and (b) the typical page context where this routine runs (e.g. "on a forum submission form", "on a product listing page", "...

  11. [11]

    Names the kind of page in operational terms ( e.g.'forum submission form','product listing page','opened forum-selection combobox','post-detail page with comment section')

  12. [12]

    Check if the social security admin- istration in pittsburgh can be reached in one hour by car from CMU

    Lists the action verbs this page ENABLES right now - i.e. what sub-routines could plausibly run on this exact state. Use verb + object phrasing (e.g.'submit a post',' select a forum','fill in the title and body ','open the sort menu','apply a filter'). Do NOT enumerate every visible element, do NOT describe pure visuals (colors, layout), and do NOT mentio...