Skim: Speculative Execution for Fast and Efficient Web Agents

Kevin Hsieh; Mike Wong; Ravi Netravali; Suman Nath

arxiv: 2605.16565 · v2 · pith:RL6AN4H5new · submitted 2026-05-15 · 💻 cs.AI · cs.OS

Skim: Speculative Execution for Fast and Efficient Web Agents

Mike Wong , Kevin Hsieh , Suman Nath , Ravi Netravali This is my paper

Pith reviewed 2026-05-20 18:23 UTC · model grok-4.3

classification 💻 cs.AI cs.OS

keywords web agentsspeculative executioncost reductionlatency optimizationbrowser agentsAI agentstask automation

0 comments

The pith

Skim lets web agents skip most expensive steps on predictable sites by matching queries to pre-captured patterns.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that current web agents apply full frontier-model inference, browser rendering, and planning to every step even when purpose-built sites follow stable structures. Skim profiles these patterns once offline per site and then uses a lightweight match to synthesize the target URL and pull the answer with a small model. A verifier accepts the fast output or triggers the original agent only on rare misses, starting it from the already-reached URL. This approach matters because it directly lowers the per-task expense and time without changing final accuracy on existing benchmarks and agent backbones.

Core claim

Skim is a speculative execution framework for web agents that exploits the predictable structure of purpose-built websites. An offline profiler captures stable URL patterns, answer formats, and task-to-trajectory mappings once per site. At runtime Skim matches each incoming query to a template, builds the destination URL, and extracts the answer with a small model. A lightweight verifier checks the result against the query and schema; misspeculations fall back to the full agent warm-started from the fast-path URL so upstream work is not lost.

What carries the argument

Offline profiler that records stable URL patterns and task mappings so runtime template matching plus small-model extraction can replace full agent steps, with a verifier to handle rare fallbacks.

If this is right

Median per-task cost falls by 1.9x and latency by 33.4 percent across standard web-agent benchmarks.
Accuracy stays identical when Skim is paired with backbones such as WebVoyager, AgentOccam, or BrowserUse.
Most queries avoid frontier-model inference and full ReAct planning by using the fast path.
Misspeculations still preserve trajectory progress because the full agent starts from the fast-path URL.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same offline-profile idea could be tried on other structured interfaces such as REST APIs or internal tools that also follow repeatable patterns.
Periodic re-profiling would be needed as sites evolve, turning the one-time capture into an ongoing maintenance task.
Widespread adoption might push websites toward more consistent designs that further improve agent efficiency.

Load-bearing premise

Purpose-built websites keep stable URL patterns, answer formats, and task-to-trajectory mappings across similar queries so that an offline profiler can capture them once and reuse them reliably.

What would settle it

Run the same benchmarks on sites whose layouts or URL schemes change frequently between profiler runs and check whether the 1.9x cost and 33.4 percent latency gains disappear or accuracy drops.

Figures

Figures reproduced from arXiv: 2605.16565 by Kevin Hsieh, Mike Wong, Ravi Netravali, Suman Nath.

**Figure 2.** Figure 2: Distribution of number of ReAct steps needed for agents to converge to an answer. 0 1 10 100 Latency (s) 0.0 0.2 0.4 0.6 0.8 1.0 CDF Actions Inference [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 4.** Figure 4: Distribution of the percentage of steps per task that are navigational, i.e., moving the agent between pages rather than directly satisfying a task requirement by extracting an answer, comparing values, modifying page state. that synthesize answers from search-engine results without visiting the underlying pages [19]. The defining feature of a web agent is that it operates against the live web, and must re… view at source ↗

**Figure 6.** Figure 6: Latencies of handcrafted optimized programs. Allrecipes Amazon Apple ArXiv BBC News Booking Coursera GitHub WebShop 0 Cost ($) Optimized ReAct agent [PITH_FULL_IMAGE:figures/full_fig_p003_6.png] view at source ↗

**Figure 8.** Figure 8: Fraction of tasks per site solvable via HTTP-only execution. Many sites support a large percentage of tasks through direct retrieval without browser interaction. pages rather than to directly satisfy a task requirement – rather than load-bearing reasoning [PITH_FULL_IMAGE:figures/full_fig_p003_8.png] view at source ↗

**Figure 10.** Figure 10: CDF of navigational elements (links, buttons, form inputs) per page across ReAct trajectories. The median page presents 212+ navigational elements, each leading to a different page. 2 4 6 8 Unique URLs per task 0.0 0.2 0.4 0.6 0.8 1.0 CDF [PITH_FULL_IMAGE:figures/full_fig_p004_10.png] view at source ↗

**Figure 12.** Figure 12: CDF over tasks of the percentage of steps requiring each compute tier as the weakest tier sufficient to match the full ReAct agent. Tiers combine page-fetch (HTTP, browser, or browser w/ screenshot) with a model (Qwen2.5- 14B or GPT-4o). C1: Determining what can be specialized. Before execution begins, the agent must determine which parts of an incoming task support specialization, i.e., which trajector… view at source ↗

**Figure 13.** Figure 13: Skim end-to-end. Offline (left): a profiling pipeline encodes per-site URL templates, search behavior, and extraction schemas into a reusable site profile. Online (right): tasks attempt direct addressing through template-driven URL synthesis, plain HTTP fetch, and lightweight extraction. Output is verified before commitment; on rejection, the system cascades to a heavier execution tier resuming at the URL… view at source ↗

**Figure 14.** Figure 14: Distribution of the number of unique action-type prefixes across tasks on a site, for the first 1, 2, and 3 actions of each task. Prefixes consider action types (e.g., click, type), ignoring parameters like element identifiers or extracted content. vs direct lookup, with which filters and sort orders applied) and extracting parameter values to slot in (query keywords, identifiers, filter values). On Amazo… view at source ↗

**Figure 15.** Figure 15: End-to-end task latencies for WebVoyager (left), AgentOccam (middle), and BrowserUse (right) across all tasks. 0.010 0.100 Cost per task ($) 0.0 0.2 0.4 0.6 0.8 1.0 CDF WebVoyager Skim 0.01 0.10 1.00 Cost per task ($) 0.0 0.2 0.4 0.6 0.8 1.0 CDF AgentOccam Skim 0.010 0.100 Cost per task ($) 0.0 0.2 0.4 0.6 0.8 1.0 CDF BrowserUse Skim [PITH_FULL_IMAGE:figures/full_fig_p010_15.png] view at source ↗

**Figure 17.** Figure 17: Warm start savings for WebVoyager. 1 10 100 1000 Latency (s) 0.0 0.2 0.4 0.6 0.8 1.0 CDF Warm Starts Cold Starts [PITH_FULL_IMAGE:figures/full_fig_p011_17.png] view at source ↗

**Figure 21.** Figure 21: Offline profiling latency per site. useful navigational progress through warm-start URLs, allowing the downstream agent to resume near the relevant destination page rather than restarting from the homepage. Per-task cost follows the same asymmetry: fast-path tasks are dominated by inexpensive local-model inference and HTTP execution, while cascaded tasks inherit the frontier-model cost profile of full br… view at source ↗

read the original abstract

Skim is a speculative execution framework for web agents that exploits the predictable structure of purpose-built websites. Today's web-agent expense is not intrinsic to the tasks but a property of how agents are composed: frontier-model inference, browser rendering, and ReAct-style planning are applied to every step of every task regardless of complexity. Skim's key observation is that websites enforce stable URL patterns, answer formats, and task-to-trajectory mappings across queries of the same type, so most queries can bypass these heavyweight components entirely. An offline profiler captures these patterns once per site. At runtime, Skim matches each query to a template, synthesizes the destination URL, and extracts the answer with a small model. A lightweight verifier gates each fast-path output against the query and schema; rare misspeculations cascade to the full agent, warm-started by the fast path's final URL to preserve upstream trajectory progress. Across standard web-agent benchmarks paired with three backboneagents (WebVoyager, AgentOccam, BrowserUse), Skim reduces median per-task cost by 1.9x and latency by 33.4% with no accuracy loss.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Skim shows how offline profiling can speed up web agents on structured sites, but gains depend on how well the templates match real queries.

read the letter

The key takeaway from this paper is that Skim uses offline profiling of websites to enable a fast speculative path for many web agent tasks, delivering 1.9 times lower cost and 33 percent less latency without hurting accuracy on the tested setups. What is new here is the specific framework: an offline profiler that extracts templates for URLs and answers, a lightweight verifier to gate the output, and a cascade to the full agent that reuses the fast path's final URL to keep trajectory progress. This is more than a simple extension of ReAct-style agents. The work does well in its evaluation. It pairs the method with three backbone agents on standard benchmarks and reports consistent median gains. The zero accuracy loss claim is notable if the measurement is robust. Where it is softer is in the load-bearing assumption about stable patterns on purpose-built sites. If queries of the same type have more variation than the profiler accounts for, the verifier will reject more often. That would push more work to the full agent and reduce the reported speedups. The paper would be stronger with numbers on how many queries match the templates and what query diversity was used in testing. This is for researchers and engineers building web agents who want to cut down on expensive model calls and browser renders. A reader focused on practical deployment and scaling would find value in the optimization approach. The paper shows clear thinking on the problem and enough experimental grounding to deserve serious referee time. I recommend accepting it for peer review, with the expectation that reviewers will probe the robustness of the offline profiling step.

Referee Report

2 major / 0 minor

Summary. The paper introduces Skim, a speculative execution framework for web agents that exploits stable URL patterns, answer formats, and task-to-trajectory mappings on purpose-built websites. An offline profiler extracts reusable templates per site; at runtime a query is matched to a template, a destination URL is synthesized, and the answer is extracted with a small model. A lightweight verifier gates the fast-path output; misses cascade to the full agent (warm-started from the fast-path URL). Across standard web-agent benchmarks with three backbone agents (WebVoyager, AgentOccam, BrowserUse), the method is reported to reduce median per-task cost by 1.9× and latency by 33.4% with no accuracy loss.

Significance. If the empirical claims hold under rigorous controls, Skim would demonstrate a practical way to amortize the dominant costs of frontier-model inference and browser rendering in web agents by exploiting site-specific predictability. The combination of offline profiling, lightweight verification, and warm-started fallback is a concrete instance of speculative execution applied to agent trajectories and could influence efficiency techniques in other long-horizon agent settings.

major comments (2)

Abstract: the central claim of 'no accuracy loss' is load-bearing for the performance results, yet the manuscript supplies no description of the accuracy metric, how it was computed across the three agents, the number of tasks or query variants per site, or any statistical significance testing; without these controls the zero-loss statement cannot be evaluated.
The load-bearing assumption that purpose-built sites exhibit stable URL patterns and task-to-trajectory mappings reusable across query variations is stated in the abstract but is not accompanied by quantitative evidence (template coverage, match rate, or number of distinct query variants used to build each profile); if match rates are low the reported median gains would be eroded by frequent fallbacks.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive evaluation of our work's significance and for the constructive major comments. We address each point below and will revise the manuscript to provide the requested details and evidence.

read point-by-point responses

Referee: Abstract: the central claim of 'no accuracy loss' is load-bearing for the performance results, yet the manuscript supplies no description of the accuracy metric, how it was computed across the three agents, the number of tasks or query variants per site, or any statistical significance testing; without these controls the zero-loss statement cannot be evaluated.

Authors: We agree that a clearer description of the accuracy evaluation is needed to support the central claim. Accuracy is measured as task success rate against benchmark ground truth (exact match on the final answer or equivalent). This metric was applied uniformly to all three backbone agents on the standard web-agent benchmarks. In the revised manuscript we will add an explicit subsection in Experiments describing the metric definition, computation procedure, the number of tasks and query variants per site, and statistical significance testing (e.g., bootstrap confidence intervals or paired tests confirming no significant difference). These additions will allow rigorous evaluation of the reported zero accuracy loss. revision: yes
Referee: The load-bearing assumption that purpose-built sites exhibit stable URL patterns and task-to-trajectory mappings reusable across query variations is stated in the abstract but is not accompanied by quantitative evidence (template coverage, match rate, or number of distinct query variants used to build each profile); if match rates are low the reported median gains would be eroded by frequent fallbacks.

Authors: We acknowledge that quantitative evidence for the stability and coverage assumptions should be presented explicitly. The offline profiler constructs templates from multiple query variants per site, and our runtime results indicate high match rates. In the revision we will add a new table and accompanying text reporting template coverage across the benchmark sites, observed runtime match rates (fraction of queries routed to the fast path), and the number of distinct query variants used to build each profile. We will also quantify the impact of fallbacks on the aggregate cost and latency figures to confirm that the reported 1.9× median cost reduction and 33.4% latency improvement are not eroded. revision: yes

Circularity Check

0 steps flagged

No significant circularity; results are empirical measurements

full rationale

The paper describes a speculative execution system whose performance claims (1.9x cost reduction, 33.4% latency reduction, zero accuracy loss) are presented as outcomes of benchmark experiments on WebVoyager, AgentOccam, and BrowserUse rather than quantities derived from internal fitted parameters or self-referential definitions. The method relies on an offline profiler capturing site-specific patterns and a runtime verifier with fallback, but these are engineering choices whose effectiveness is evaluated externally on standard benchmarks; no equations, uniqueness theorems, or ansatzes are shown to reduce to the inputs by construction. The load-bearing assumption about URL and trajectory stability is an empirical hypothesis tested by the experiments, not a tautology that forces the reported medians.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the domain assumption that website structures are stable enough to profile once and match reliably; no free parameters or invented entities are described in the abstract.

axioms (1)

domain assumption Websites enforce stable URL patterns, answer formats, and task-to-trajectory mappings across queries of the same type.
This stability is invoked to justify the offline profiler and fast-path synthesis.

pith-pipeline@v0.9.0 · 5731 in / 1221 out tokens · 104262 ms · 2026-05-20T18:23:30.669456+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Skim matches each query to a template, synthesizes the destination URL, and extracts the answer with a small model. A lightweight verifier gates each fast-path output

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

35 extracted references · 35 canonical work pages · 13 internal anchors

[1]

What limits agentic systems efficiency? In SEA @ NeurIPS 2025 Workshop, 2025

Song Bian, Minghao Yan, Anand Jayarajan, Gennady Pekhimenko, and Shivaram Venkataraman. What limits agentic systems efficiency? In SEA @ NeurIPS 2025 Workshop, 2025

work page 2025
[2]

You name it, i run it: An llm agent to execute tests of arbitrary projects.ISSTA 2025, 2024

Islem Bouzenia and Michael Pradel. You name it, i run it: An llm agent to execute tests of arbitrary projects.ISSTA 2025, 2024

work page 2025
[3]

browser-use.https://github.com/browser-use/browser- use, 2026

browser-use. browser-use.https://github.com/browser-use/browser- use, 2026. Open-source browser agent framework. Accessed 2026-05- 14

work page 2026
[4]

Accelerating Large Language Model Decoding with Speculative Sampling

Charlie Chen, Sebastian Borgeaud, Geoffrey Irving, Jean-Baptiste Lespiau, Laurent Sifre, and John Jumper. Accelerating large lan- guage model decoding with speculative sampling.arXiv preprint arXiv:2302.01318, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[5]

FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance

Lingjiao Chen, Matei Zaharia, and James Zou. Frugalgpt: How to use large language models while reducing cost and improving performance. arXiv preprint arXiv:2305.05176, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[6]

Mind2web: Towards a generalist agent for the web

Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Samuel Stevens, Boshi Wang, Huan Sun, and Yu Su. Mind2web: Towards a generalist agent for the web. InAdvances in Neural Information Processing Systems (NeurIPS), 2023

work page 2023
[7]

Read more, think more: Revisiting observation reduction for web agents.arXiv preprint arXiv:2604.01535, 2026

Masafumi Enomoto, Ryoma Obara, Haochen Zhang, and Masafumi Oyamada. Read more, think more: Revisiting observation reduction for web agents.arXiv preprint arXiv:2604.01535, 2026

work page arXiv 2026
[8]

Navigating the digital world as humans do: Universal visual grounding for gui agents

Boyu Gou, Ruohan Wang, Boyuan Zheng, Yanan Xie, Cheng Chang, Yiheng Shu, Huan Sun, and Yu Su. Navigating the digital world as humans do: Universal visual grounding for gui agents. InInternational Conference on Learning Representations (ICLR), 2025

work page 2025
[9]

Dynamic speculative agent planning.arXiv preprint arXiv:2509.01920, 2025

Yilin Guan, Qingfeng Lan, Sun Fei, Dujian Ding, Devang Acharya, Chi Wang, William Yang Wang, and Wenyue Hua. Dynamic speculative agent planning.arXiv preprint arXiv:2509.01920, 2025

work page arXiv 2025
[10]

MolmoWeb: Open Visual Web Agent and Open Data for the Open Web

Tanmay Gupta, Piper Wolters, Zixian Ma, Peter Sushko, Rock Yuren Pang, Diego Llanes, Yue Yang, Taira Anderson, Boyuan Zheng, Zhongzheng Ren, Harsh Trivedi, Taylor Blanton, Caleb Ouellette, Winson Han, Ali Farhadi, and Ranjay Krishna. Molmoweb: Open visual web agent and open data for the open web.arXiv preprint arXiv:2604.08516, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[11]

A real-world webagent with planning, long context understanding, and program synthesis

Izzeddin Gur, Hiroki Furuta, Austin Huang, Mustafa Safdari, Yutaka Matsuo, Douglas Eck, and Aleksandra Faust. A real-world webagent with planning, long context understanding, and program synthesis. In International Conference on Learning Representations (ICLR), 2024

work page 2024
[12]

Understanding html with large language models

Izzeddin Gur, Ofir Nachum, Yingjie Miao, Mustafa Safdari, Austin Huang, Aakanksha Chowdhery, Sharan Narang, Noah Fiedel, and Aleksandra Faust. Understanding html with large language models. arXiv preprint arXiv:2210.03945, 2022

work page arXiv 2022
[13]

WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models

Hongliang He, Wenlin Yao, Kaixin Ma, Wenhao Yu, Yong Dai, Hong- ming Zhang, Zhenzhong Lan, and Dong Yu. Webvoyager: Building an end-to-end web agent with large multimodal models. InarXiv preprint arXiv:2401.13919, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[14]

Odysseys: Benchmarking Web Agents on Realistic Long Horizon Tasks

Lawrence Keunho Jang, Jing Yu Koh, Daniel Fried, and Ruslan Salakhut- dinov. Odysseys: Benchmarking web agents on realistic long horizon tasks.arXiv preprint arXiv:2604.24964, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[15]

Dom-q-net: Grounded rl on structured language

Sheng Jia, Jamie Kiros, and Jimmy Ba. Dom-q-net: Grounded rl on structured language. InInternational Conference on Learning Represen- tations (ICLR), 2019

work page 2019
[16]

Wrapper induction: Efficiency and expressive- ness.Artificial Intelligence, 118(1-2):15–68, 2000

Nicholas Kushmerick. Wrapper induction: Efficiency and expressive- ness.Artificial Intelligence, 118(1-2):15–68, 2000

work page 2000
[17]

Region4Web: Rethinking Observation Space Granularity for Web Agents

Donguk Kwon and Dongha Lee. Region4web: Rethinking observation space granularity for web agents.arXiv preprint arXiv:2605.07134, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[18]

Efficient memory management for large language model serving with pagedattention

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InProceedings of the 29th Symposium on Operating Systems Principles, SOSP ’23, page 611–626, New York, NY, USA, 2023. Association for Computing Machinery

work page 2023
[19]

J. Liang. Caesar: Deep agentic web exploration for creative answer synthesis.arXiv, 2026

work page 2026
[20]

Showui: One vision-language-action model for gui visual agent.arXiv preprint arXiv:2411.17465, 2024

Kevin Qinghong Lin, Linjie Li, Difei Gao, Zhengyuan Yang, Shiwei Wu, Zechen Bai, Weixian Lei, Lijuan Wang, and Mike Zheng Shou. Showui: One vision-language-action model for gui visual agent.arXiv preprint arXiv:2411.17465, 2024

work page arXiv 2024
[21]

Weblinx: Real-world website navigation with multi-turn dialogue,

Xing Han Lù, Zdeněk Kasner, and Siva Reddy. Weblinx: Real- world website navigation with multi-turn dialogue.arXiv preprint arXiv:2402.05930, 2024

work page arXiv 2024
[22]

ContractSkill: Repairable Contract-Based Skills for Multimodal Web Agents

Zijian Lu, Yiping Zuo, Yupeng Nie, Xin He, Weibei Fan, Lianyong Qi, and Shi Jin. Contractskill: Repairable contract-based skills for multimodal web agents.arXiv preprint arXiv:2603.20340, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[23]

GAIA: a benchmark for General AI Assistants

Grégoire Mialon, Clémentine Fourrier, Craig Swift, Thomas Wolf, Yann LeCun, and Thomas Scialom. Gaia: a benchmark for general ai assis- tants.arXiv preprint arXiv:2311.12983, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[24]

Swt-bench: Testing and validating real-world bug-fixes with code agents.NeurIPS, 2024

Niels Mündler, Mark Niklas Müller, Jingxuan He, and Martin Vechev. Swt-bench: Testing and validating real-world bug-fixes with code agents.NeurIPS, 2024

work page 2024
[25]

WebGPT: Browser-assisted question-answering with human feedback

Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christopher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, Xu Jiang, Karl Cobbe, Tyna Eloundou, Gretchen Krueger, Kevin Button, Matthew Knight, Benjamin Chess, and John Schulman. Webgpt: Browser-assisted question-answering with human feedback. InarXiv preprint ar...

work page internal anchor Pith review Pith/arXiv arXiv 2021
[26]

From pixels to ui actions: Learning to follow instructions via graphical user interfaces.arXiv preprint arXiv:2306.00245, 2023

Peter Shaw, Mandar Joshi, James Cohan, Jonathan Berant, Panupong Pasupat, Hexiang Hu, Urvashi Khandelwal, Kenton Lee, and Kristina Toutanova. From pixels to ui actions: Learning to follow instructions via graphical user interfaces.arXiv preprint arXiv:2306.00245, 2023

work page arXiv 2023
[27]

WebXSkill: Skill Learning for Autonomous Web Agents

Zhaoyang Wang, Qianhui Wu, Xuchao Zhang, Chaoyun Zhang, Wen- lin Yao, Fazle Elahi Faisal, Baolin Peng, Si Qin, Suman Nath, Qingwei Lin, Chetan Bansal, Dongmei Zhang, Saravan Rajmohan, Jianfeng Gao, and Huaxiu Yao. Webxskill: Skill learning for autonomous web agents. arXiv preprint arXiv:2604.13318, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[28]

Contextbudget: Budget-aware context management for long-horizon search agents.arXiv preprint Conference’17, July 2017, Washington, DC, USA M

Yong Wu, Yanzhao Zheng, Tianze Xu, ZhenTao Zhang, YuanQiang Yu, JiHuai Zhu, Chao Ma, BinBin Lin, Baohua Dong, Hangcheng Zhu, Ruohui Huang, and Gang Yu. Contextbudget: Budget-aware context management for long-horizon search agents.arXiv preprint Conference’17, July 2017, Washington, DC, USA M. Wong et al. arXiv:2604.01664, 2026

work page arXiv 2017
[29]

Agentoccam: A sim- ple yet strong baseline for llm-based web agents.arXiv preprint arXiv:2410.13825, 2024

Ke Yang, Yao Liu, Sapana Chaudhary, Rasool Fakoor, Pratik Chaud- hari, George Karypis, and Huzefa Rangwala. Agentoccam: A sim- ple yet strong baseline for llm-based web agents.arXiv preprint arXiv:2410.13825, 2024

work page arXiv 2024
[30]

Web- shop: Towards scalable real-world web interaction with grounded language agents

Shunyu Yao, Howard Chen, John Yang, and Karthik Narasimhan. Web- shop: Towards scalable real-world web interaction with grounded language agents. InAdvances in Neural Information Processing Systems (NeurIPS), 2022

work page 2022
[31]

React: Synergizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InInternational Conference on Learning Represen- tations (ICLR), 2023

work page 2023
[32]

Speculative Actions: A Lossless Framework for Faster Agentic Systems

Naimeng Ye, Arnav Ahuja, Georgios Liargkovas, Yunan Lu, Kostis Kaffes, and Tianyi Peng. Speculative actions: A lossless framework for faster agentic systems.arXiv preprint arXiv:2510.04371, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[33]

Prune4web: Dom tree pruning programming for web agent

Jiayuan Zhang, Kaiquan Chen, Zhihao Lu, Enshen Zhou, Qian Yu, and Jing Zhang. Prune4web: Dom tree pruning programming for web agent. InProceedings of the AAAI Conference on Artificial Intelligence, 2026

work page 2026
[34]

GPT-4V(ision) is a Generalist Web Agent, if Grounded

Boyuan Zheng, Boyu Gou, Jihyung Kil, Huan Sun, and Yu Su. Gpt- 4v(ision) is a generalist web agent, if grounded.arXiv preprint arXiv:2401.01614, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[35]

WebArena: A Realistic Web Environment for Building Autonomous Agents

Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig. Webarena: A realistic web environment for building autonomous agents. InarXiv preprint arXiv:2307.13854, 2023. Received 20 February 2007; revised 12 March 2009; accepted 5 June 2009

work page internal anchor Pith review Pith/arXiv arXiv 2023

[1] [1]

What limits agentic systems efficiency? In SEA @ NeurIPS 2025 Workshop, 2025

Song Bian, Minghao Yan, Anand Jayarajan, Gennady Pekhimenko, and Shivaram Venkataraman. What limits agentic systems efficiency? In SEA @ NeurIPS 2025 Workshop, 2025

work page 2025

[2] [2]

You name it, i run it: An llm agent to execute tests of arbitrary projects.ISSTA 2025, 2024

Islem Bouzenia and Michael Pradel. You name it, i run it: An llm agent to execute tests of arbitrary projects.ISSTA 2025, 2024

work page 2025

[3] [3]

browser-use.https://github.com/browser-use/browser- use, 2026

browser-use. browser-use.https://github.com/browser-use/browser- use, 2026. Open-source browser agent framework. Accessed 2026-05- 14

work page 2026

[4] [4]

Accelerating Large Language Model Decoding with Speculative Sampling

Charlie Chen, Sebastian Borgeaud, Geoffrey Irving, Jean-Baptiste Lespiau, Laurent Sifre, and John Jumper. Accelerating large lan- guage model decoding with speculative sampling.arXiv preprint arXiv:2302.01318, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[5] [5]

FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance

Lingjiao Chen, Matei Zaharia, and James Zou. Frugalgpt: How to use large language models while reducing cost and improving performance. arXiv preprint arXiv:2305.05176, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[6] [6]

Mind2web: Towards a generalist agent for the web

Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Samuel Stevens, Boshi Wang, Huan Sun, and Yu Su. Mind2web: Towards a generalist agent for the web. InAdvances in Neural Information Processing Systems (NeurIPS), 2023

work page 2023

[7] [7]

Read more, think more: Revisiting observation reduction for web agents.arXiv preprint arXiv:2604.01535, 2026

Masafumi Enomoto, Ryoma Obara, Haochen Zhang, and Masafumi Oyamada. Read more, think more: Revisiting observation reduction for web agents.arXiv preprint arXiv:2604.01535, 2026

work page arXiv 2026

[8] [8]

Navigating the digital world as humans do: Universal visual grounding for gui agents

Boyu Gou, Ruohan Wang, Boyuan Zheng, Yanan Xie, Cheng Chang, Yiheng Shu, Huan Sun, and Yu Su. Navigating the digital world as humans do: Universal visual grounding for gui agents. InInternational Conference on Learning Representations (ICLR), 2025

work page 2025

[9] [9]

Dynamic speculative agent planning.arXiv preprint arXiv:2509.01920, 2025

Yilin Guan, Qingfeng Lan, Sun Fei, Dujian Ding, Devang Acharya, Chi Wang, William Yang Wang, and Wenyue Hua. Dynamic speculative agent planning.arXiv preprint arXiv:2509.01920, 2025

work page arXiv 2025

[10] [10]

MolmoWeb: Open Visual Web Agent and Open Data for the Open Web

Tanmay Gupta, Piper Wolters, Zixian Ma, Peter Sushko, Rock Yuren Pang, Diego Llanes, Yue Yang, Taira Anderson, Boyuan Zheng, Zhongzheng Ren, Harsh Trivedi, Taylor Blanton, Caleb Ouellette, Winson Han, Ali Farhadi, and Ranjay Krishna. Molmoweb: Open visual web agent and open data for the open web.arXiv preprint arXiv:2604.08516, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[11] [11]

A real-world webagent with planning, long context understanding, and program synthesis

Izzeddin Gur, Hiroki Furuta, Austin Huang, Mustafa Safdari, Yutaka Matsuo, Douglas Eck, and Aleksandra Faust. A real-world webagent with planning, long context understanding, and program synthesis. In International Conference on Learning Representations (ICLR), 2024

work page 2024

[12] [12]

Understanding html with large language models

Izzeddin Gur, Ofir Nachum, Yingjie Miao, Mustafa Safdari, Austin Huang, Aakanksha Chowdhery, Sharan Narang, Noah Fiedel, and Aleksandra Faust. Understanding html with large language models. arXiv preprint arXiv:2210.03945, 2022

work page arXiv 2022

[13] [13]

WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models

Hongliang He, Wenlin Yao, Kaixin Ma, Wenhao Yu, Yong Dai, Hong- ming Zhang, Zhenzhong Lan, and Dong Yu. Webvoyager: Building an end-to-end web agent with large multimodal models. InarXiv preprint arXiv:2401.13919, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[14] [14]

Odysseys: Benchmarking Web Agents on Realistic Long Horizon Tasks

Lawrence Keunho Jang, Jing Yu Koh, Daniel Fried, and Ruslan Salakhut- dinov. Odysseys: Benchmarking web agents on realistic long horizon tasks.arXiv preprint arXiv:2604.24964, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[15] [15]

Dom-q-net: Grounded rl on structured language

Sheng Jia, Jamie Kiros, and Jimmy Ba. Dom-q-net: Grounded rl on structured language. InInternational Conference on Learning Represen- tations (ICLR), 2019

work page 2019

[16] [16]

Wrapper induction: Efficiency and expressive- ness.Artificial Intelligence, 118(1-2):15–68, 2000

Nicholas Kushmerick. Wrapper induction: Efficiency and expressive- ness.Artificial Intelligence, 118(1-2):15–68, 2000

work page 2000

[17] [17]

Region4Web: Rethinking Observation Space Granularity for Web Agents

Donguk Kwon and Dongha Lee. Region4web: Rethinking observation space granularity for web agents.arXiv preprint arXiv:2605.07134, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[18] [18]

Efficient memory management for large language model serving with pagedattention

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InProceedings of the 29th Symposium on Operating Systems Principles, SOSP ’23, page 611–626, New York, NY, USA, 2023. Association for Computing Machinery

work page 2023

[19] [19]

J. Liang. Caesar: Deep agentic web exploration for creative answer synthesis.arXiv, 2026

work page 2026

[20] [20]

Showui: One vision-language-action model for gui visual agent.arXiv preprint arXiv:2411.17465, 2024

Kevin Qinghong Lin, Linjie Li, Difei Gao, Zhengyuan Yang, Shiwei Wu, Zechen Bai, Weixian Lei, Lijuan Wang, and Mike Zheng Shou. Showui: One vision-language-action model for gui visual agent.arXiv preprint arXiv:2411.17465, 2024

work page arXiv 2024

[21] [21]

Weblinx: Real-world website navigation with multi-turn dialogue,

Xing Han Lù, Zdeněk Kasner, and Siva Reddy. Weblinx: Real- world website navigation with multi-turn dialogue.arXiv preprint arXiv:2402.05930, 2024

work page arXiv 2024

[22] [22]

ContractSkill: Repairable Contract-Based Skills for Multimodal Web Agents

Zijian Lu, Yiping Zuo, Yupeng Nie, Xin He, Weibei Fan, Lianyong Qi, and Shi Jin. Contractskill: Repairable contract-based skills for multimodal web agents.arXiv preprint arXiv:2603.20340, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[23] [23]

GAIA: a benchmark for General AI Assistants

Grégoire Mialon, Clémentine Fourrier, Craig Swift, Thomas Wolf, Yann LeCun, and Thomas Scialom. Gaia: a benchmark for general ai assis- tants.arXiv preprint arXiv:2311.12983, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[24] [24]

Swt-bench: Testing and validating real-world bug-fixes with code agents.NeurIPS, 2024

Niels Mündler, Mark Niklas Müller, Jingxuan He, and Martin Vechev. Swt-bench: Testing and validating real-world bug-fixes with code agents.NeurIPS, 2024

work page 2024

[25] [25]

WebGPT: Browser-assisted question-answering with human feedback

Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christopher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, Xu Jiang, Karl Cobbe, Tyna Eloundou, Gretchen Krueger, Kevin Button, Matthew Knight, Benjamin Chess, and John Schulman. Webgpt: Browser-assisted question-answering with human feedback. InarXiv preprint ar...

work page internal anchor Pith review Pith/arXiv arXiv 2021

[26] [26]

From pixels to ui actions: Learning to follow instructions via graphical user interfaces.arXiv preprint arXiv:2306.00245, 2023

Peter Shaw, Mandar Joshi, James Cohan, Jonathan Berant, Panupong Pasupat, Hexiang Hu, Urvashi Khandelwal, Kenton Lee, and Kristina Toutanova. From pixels to ui actions: Learning to follow instructions via graphical user interfaces.arXiv preprint arXiv:2306.00245, 2023

work page arXiv 2023

[27] [27]

WebXSkill: Skill Learning for Autonomous Web Agents

Zhaoyang Wang, Qianhui Wu, Xuchao Zhang, Chaoyun Zhang, Wen- lin Yao, Fazle Elahi Faisal, Baolin Peng, Si Qin, Suman Nath, Qingwei Lin, Chetan Bansal, Dongmei Zhang, Saravan Rajmohan, Jianfeng Gao, and Huaxiu Yao. Webxskill: Skill learning for autonomous web agents. arXiv preprint arXiv:2604.13318, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[28] [28]

Contextbudget: Budget-aware context management for long-horizon search agents.arXiv preprint Conference’17, July 2017, Washington, DC, USA M

Yong Wu, Yanzhao Zheng, Tianze Xu, ZhenTao Zhang, YuanQiang Yu, JiHuai Zhu, Chao Ma, BinBin Lin, Baohua Dong, Hangcheng Zhu, Ruohui Huang, and Gang Yu. Contextbudget: Budget-aware context management for long-horizon search agents.arXiv preprint Conference’17, July 2017, Washington, DC, USA M. Wong et al. arXiv:2604.01664, 2026

work page arXiv 2017

[29] [29]

Agentoccam: A sim- ple yet strong baseline for llm-based web agents.arXiv preprint arXiv:2410.13825, 2024

Ke Yang, Yao Liu, Sapana Chaudhary, Rasool Fakoor, Pratik Chaud- hari, George Karypis, and Huzefa Rangwala. Agentoccam: A sim- ple yet strong baseline for llm-based web agents.arXiv preprint arXiv:2410.13825, 2024

work page arXiv 2024

[30] [30]

Web- shop: Towards scalable real-world web interaction with grounded language agents

Shunyu Yao, Howard Chen, John Yang, and Karthik Narasimhan. Web- shop: Towards scalable real-world web interaction with grounded language agents. InAdvances in Neural Information Processing Systems (NeurIPS), 2022

work page 2022

[31] [31]

React: Synergizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InInternational Conference on Learning Represen- tations (ICLR), 2023

work page 2023

[32] [32]

Speculative Actions: A Lossless Framework for Faster Agentic Systems

Naimeng Ye, Arnav Ahuja, Georgios Liargkovas, Yunan Lu, Kostis Kaffes, and Tianyi Peng. Speculative actions: A lossless framework for faster agentic systems.arXiv preprint arXiv:2510.04371, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[33] [33]

Prune4web: Dom tree pruning programming for web agent

Jiayuan Zhang, Kaiquan Chen, Zhihao Lu, Enshen Zhou, Qian Yu, and Jing Zhang. Prune4web: Dom tree pruning programming for web agent. InProceedings of the AAAI Conference on Artificial Intelligence, 2026

work page 2026

[34] [34]

GPT-4V(ision) is a Generalist Web Agent, if Grounded

Boyuan Zheng, Boyu Gou, Jihyung Kil, Huan Sun, and Yu Su. Gpt- 4v(ision) is a generalist web agent, if grounded.arXiv preprint arXiv:2401.01614, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[35] [35]

WebArena: A Realistic Web Environment for Building Autonomous Agents

Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig. Webarena: A realistic web environment for building autonomous agents. InarXiv preprint arXiv:2307.13854, 2023. Received 20 February 2007; revised 12 March 2009; accepted 5 June 2009

work page internal anchor Pith review Pith/arXiv arXiv 2023