pith. sign in

arxiv: 2605.16565 · v2 · pith:RL6AN4H5new · submitted 2026-05-15 · 💻 cs.AI · cs.OS

Skim: Speculative Execution for Fast and Efficient Web Agents

Pith reviewed 2026-05-20 18:23 UTC · model grok-4.3

classification 💻 cs.AI cs.OS
keywords web agentsspeculative executioncost reductionlatency optimizationbrowser agentsAI agentstask automation
0
0 comments X

The pith

Skim lets web agents skip most expensive steps on predictable sites by matching queries to pre-captured patterns.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that current web agents apply full frontier-model inference, browser rendering, and planning to every step even when purpose-built sites follow stable structures. Skim profiles these patterns once offline per site and then uses a lightweight match to synthesize the target URL and pull the answer with a small model. A verifier accepts the fast output or triggers the original agent only on rare misses, starting it from the already-reached URL. This approach matters because it directly lowers the per-task expense and time without changing final accuracy on existing benchmarks and agent backbones.

Core claim

Skim is a speculative execution framework for web agents that exploits the predictable structure of purpose-built websites. An offline profiler captures stable URL patterns, answer formats, and task-to-trajectory mappings once per site. At runtime Skim matches each incoming query to a template, builds the destination URL, and extracts the answer with a small model. A lightweight verifier checks the result against the query and schema; misspeculations fall back to the full agent warm-started from the fast-path URL so upstream work is not lost.

What carries the argument

Offline profiler that records stable URL patterns and task mappings so runtime template matching plus small-model extraction can replace full agent steps, with a verifier to handle rare fallbacks.

If this is right

  • Median per-task cost falls by 1.9x and latency by 33.4 percent across standard web-agent benchmarks.
  • Accuracy stays identical when Skim is paired with backbones such as WebVoyager, AgentOccam, or BrowserUse.
  • Most queries avoid frontier-model inference and full ReAct planning by using the fast path.
  • Misspeculations still preserve trajectory progress because the full agent starts from the fast-path URL.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same offline-profile idea could be tried on other structured interfaces such as REST APIs or internal tools that also follow repeatable patterns.
  • Periodic re-profiling would be needed as sites evolve, turning the one-time capture into an ongoing maintenance task.
  • Widespread adoption might push websites toward more consistent designs that further improve agent efficiency.

Load-bearing premise

Purpose-built websites keep stable URL patterns, answer formats, and task-to-trajectory mappings across similar queries so that an offline profiler can capture them once and reuse them reliably.

What would settle it

Run the same benchmarks on sites whose layouts or URL schemes change frequently between profiler runs and check whether the 1.9x cost and 33.4 percent latency gains disappear or accuracy drops.

Figures

Figures reproduced from arXiv: 2605.16565 by Kevin Hsieh, Mike Wong, Ravi Netravali, Suman Nath.

Figure 2
Figure 2. Figure 2: Distribution of number of ReAct steps needed for agents to converge to an answer. 0 1 10 100 Latency (s) 0.0 0.2 0.4 0.6 0.8 1.0 CDF Actions Inference [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 4
Figure 4. Figure 4: Distribution of the percentage of steps per task that are navigational, i.e., moving the agent between pages rather than directly satisfying a task requirement by extracting an answer, comparing values, modifying page state. that synthesize answers from search-engine results without visiting the underlying pages [19]. The defining feature of a web agent is that it operates against the live web, and must re… view at source ↗
Figure 6
Figure 6. Figure 6: Latencies of hand￾crafted optimized programs. Allrecipes Amazon Apple ArXiv BBC News Booking Coursera GitHub WebShop 0 Cost ($) Optimized ReAct agent [PITH_FULL_IMAGE:figures/full_fig_p003_6.png] view at source ↗
Figure 8
Figure 8. Figure 8: Fraction of tasks per site solvable via HTTP-only execution. Many sites support a large percentage of tasks through direct retrieval without browser interaction. pages rather than to directly satisfy a task requirement – rather than load-bearing reasoning [PITH_FULL_IMAGE:figures/full_fig_p003_8.png] view at source ↗
Figure 10
Figure 10. Figure 10: CDF of navigational elements (links, buttons, form inputs) per page across ReAct trajectories. The median page presents 212+ navigational elements, each leading to a different page. 2 4 6 8 Unique URLs per task 0.0 0.2 0.4 0.6 0.8 1.0 CDF [PITH_FULL_IMAGE:figures/full_fig_p004_10.png] view at source ↗
Figure 12
Figure 12. Figure 12: CDF over tasks of the percentage of steps re￾quiring each compute tier as the weakest tier sufficient to match the full ReAct agent. Tiers combine page-fetch (HTTP, browser, or browser w/ screenshot) with a model (Qwen2.5- 14B or GPT-4o). C1: Determining what can be specialized. Before exe￾cution begins, the agent must determine which parts of an incoming task support specialization, i.e., which trajector… view at source ↗
Figure 13
Figure 13. Figure 13: Skim end-to-end. Offline (left): a profiling pipeline encodes per-site URL templates, search behavior, and extraction schemas into a reusable site profile. Online (right): tasks attempt direct addressing through template-driven URL synthesis, plain HTTP fetch, and lightweight extraction. Output is verified before commitment; on rejection, the system cascades to a heavier execution tier resuming at the URL… view at source ↗
Figure 14
Figure 14. Figure 14: Distribution of the number of unique action-type prefixes across tasks on a site, for the first 1, 2, and 3 actions of each task. Prefixes consider action types (e.g., click, type), ignoring parameters like element identifiers or extracted content. vs direct lookup, with which filters and sort orders applied) and extracting parameter values to slot in (query keywords, identifiers, filter values). On Amazo… view at source ↗
Figure 15
Figure 15. Figure 15: End-to-end task latencies for WebVoyager (left), AgentOccam (middle), and BrowserUse (right) across all tasks. 0.010 0.100 Cost per task ($) 0.0 0.2 0.4 0.6 0.8 1.0 CDF WebVoyager Skim 0.01 0.10 1.00 Cost per task ($) 0.0 0.2 0.4 0.6 0.8 1.0 CDF AgentOccam Skim 0.010 0.100 Cost per task ($) 0.0 0.2 0.4 0.6 0.8 1.0 CDF BrowserUse Skim [PITH_FULL_IMAGE:figures/full_fig_p010_15.png] view at source ↗
Figure 17
Figure 17. Figure 17: Warm start sav￾ings for WebVoyager. 1 10 100 1000 Latency (s) 0.0 0.2 0.4 0.6 0.8 1.0 CDF Warm Starts Cold Starts [PITH_FULL_IMAGE:figures/full_fig_p011_17.png] view at source ↗
Figure 21
Figure 21. Figure 21: Offline profiling latency per site. useful navigational progress through warm-start URLs, al￾lowing the downstream agent to resume near the relevant destination page rather than restarting from the homepage. Per-task cost follows the same asymmetry: fast-path tasks are dominated by inexpensive local-model inference and HTTP execution, while cascaded tasks inherit the frontier-model cost profile of full br… view at source ↗
read the original abstract

Skim is a speculative execution framework for web agents that exploits the predictable structure of purpose-built websites. Today's web-agent expense is not intrinsic to the tasks but a property of how agents are composed: frontier-model inference, browser rendering, and ReAct-style planning are applied to every step of every task regardless of complexity. Skim's key observation is that websites enforce stable URL patterns, answer formats, and task-to-trajectory mappings across queries of the same type, so most queries can bypass these heavyweight components entirely. An offline profiler captures these patterns once per site. At runtime, Skim matches each query to a template, synthesizes the destination URL, and extracts the answer with a small model. A lightweight verifier gates each fast-path output against the query and schema; rare misspeculations cascade to the full agent, warm-started by the fast path's final URL to preserve upstream trajectory progress. Across standard web-agent benchmarks paired with three backboneagents (WebVoyager, AgentOccam, BrowserUse), Skim reduces median per-task cost by 1.9x and latency by 33.4% with no accuracy loss.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper introduces Skim, a speculative execution framework for web agents that exploits stable URL patterns, answer formats, and task-to-trajectory mappings on purpose-built websites. An offline profiler extracts reusable templates per site; at runtime a query is matched to a template, a destination URL is synthesized, and the answer is extracted with a small model. A lightweight verifier gates the fast-path output; misses cascade to the full agent (warm-started from the fast-path URL). Across standard web-agent benchmarks with three backbone agents (WebVoyager, AgentOccam, BrowserUse), the method is reported to reduce median per-task cost by 1.9× and latency by 33.4% with no accuracy loss.

Significance. If the empirical claims hold under rigorous controls, Skim would demonstrate a practical way to amortize the dominant costs of frontier-model inference and browser rendering in web agents by exploiting site-specific predictability. The combination of offline profiling, lightweight verification, and warm-started fallback is a concrete instance of speculative execution applied to agent trajectories and could influence efficiency techniques in other long-horizon agent settings.

major comments (2)
  1. Abstract: the central claim of 'no accuracy loss' is load-bearing for the performance results, yet the manuscript supplies no description of the accuracy metric, how it was computed across the three agents, the number of tasks or query variants per site, or any statistical significance testing; without these controls the zero-loss statement cannot be evaluated.
  2. The load-bearing assumption that purpose-built sites exhibit stable URL patterns and task-to-trajectory mappings reusable across query variations is stated in the abstract but is not accompanied by quantitative evidence (template coverage, match rate, or number of distinct query variants used to build each profile); if match rates are low the reported median gains would be eroded by frequent fallbacks.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive evaluation of our work's significance and for the constructive major comments. We address each point below and will revise the manuscript to provide the requested details and evidence.

read point-by-point responses
  1. Referee: Abstract: the central claim of 'no accuracy loss' is load-bearing for the performance results, yet the manuscript supplies no description of the accuracy metric, how it was computed across the three agents, the number of tasks or query variants per site, or any statistical significance testing; without these controls the zero-loss statement cannot be evaluated.

    Authors: We agree that a clearer description of the accuracy evaluation is needed to support the central claim. Accuracy is measured as task success rate against benchmark ground truth (exact match on the final answer or equivalent). This metric was applied uniformly to all three backbone agents on the standard web-agent benchmarks. In the revised manuscript we will add an explicit subsection in Experiments describing the metric definition, computation procedure, the number of tasks and query variants per site, and statistical significance testing (e.g., bootstrap confidence intervals or paired tests confirming no significant difference). These additions will allow rigorous evaluation of the reported zero accuracy loss. revision: yes

  2. Referee: The load-bearing assumption that purpose-built sites exhibit stable URL patterns and task-to-trajectory mappings reusable across query variations is stated in the abstract but is not accompanied by quantitative evidence (template coverage, match rate, or number of distinct query variants used to build each profile); if match rates are low the reported median gains would be eroded by frequent fallbacks.

    Authors: We acknowledge that quantitative evidence for the stability and coverage assumptions should be presented explicitly. The offline profiler constructs templates from multiple query variants per site, and our runtime results indicate high match rates. In the revision we will add a new table and accompanying text reporting template coverage across the benchmark sites, observed runtime match rates (fraction of queries routed to the fast path), and the number of distinct query variants used to build each profile. We will also quantify the impact of fallbacks on the aggregate cost and latency figures to confirm that the reported 1.9× median cost reduction and 33.4% latency improvement are not eroded. revision: yes

Circularity Check

0 steps flagged

No significant circularity; results are empirical measurements

full rationale

The paper describes a speculative execution system whose performance claims (1.9x cost reduction, 33.4% latency reduction, zero accuracy loss) are presented as outcomes of benchmark experiments on WebVoyager, AgentOccam, and BrowserUse rather than quantities derived from internal fitted parameters or self-referential definitions. The method relies on an offline profiler capturing site-specific patterns and a runtime verifier with fallback, but these are engineering choices whose effectiveness is evaluated externally on standard benchmarks; no equations, uniqueness theorems, or ansatzes are shown to reduce to the inputs by construction. The load-bearing assumption about URL and trajectory stability is an empirical hypothesis tested by the experiments, not a tautology that forces the reported medians.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the domain assumption that website structures are stable enough to profile once and match reliably; no free parameters or invented entities are described in the abstract.

axioms (1)
  • domain assumption Websites enforce stable URL patterns, answer formats, and task-to-trajectory mappings across queries of the same type.
    This stability is invoked to justify the offline profiler and fast-path synthesis.

pith-pipeline@v0.9.0 · 5731 in / 1221 out tokens · 104262 ms · 2026-05-20T18:23:30.669456+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

35 extracted references · 35 canonical work pages · 13 internal anchors

  1. [1]

    What limits agentic systems efficiency? In SEA @ NeurIPS 2025 Workshop, 2025

    Song Bian, Minghao Yan, Anand Jayarajan, Gennady Pekhimenko, and Shivaram Venkataraman. What limits agentic systems efficiency? In SEA @ NeurIPS 2025 Workshop, 2025

  2. [2]

    You name it, i run it: An llm agent to execute tests of arbitrary projects.ISSTA 2025, 2024

    Islem Bouzenia and Michael Pradel. You name it, i run it: An llm agent to execute tests of arbitrary projects.ISSTA 2025, 2024

  3. [3]

    browser-use.https://github.com/browser-use/browser- use, 2026

    browser-use. browser-use.https://github.com/browser-use/browser- use, 2026. Open-source browser agent framework. Accessed 2026-05- 14

  4. [4]

    Accelerating Large Language Model Decoding with Speculative Sampling

    Charlie Chen, Sebastian Borgeaud, Geoffrey Irving, Jean-Baptiste Lespiau, Laurent Sifre, and John Jumper. Accelerating large lan- guage model decoding with speculative sampling.arXiv preprint arXiv:2302.01318, 2023

  5. [5]

    FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance

    Lingjiao Chen, Matei Zaharia, and James Zou. Frugalgpt: How to use large language models while reducing cost and improving performance. arXiv preprint arXiv:2305.05176, 2023

  6. [6]

    Mind2web: Towards a generalist agent for the web

    Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Samuel Stevens, Boshi Wang, Huan Sun, and Yu Su. Mind2web: Towards a generalist agent for the web. InAdvances in Neural Information Processing Systems (NeurIPS), 2023

  7. [7]

    Read more, think more: Revisiting observation reduction for web agents.arXiv preprint arXiv:2604.01535, 2026

    Masafumi Enomoto, Ryoma Obara, Haochen Zhang, and Masafumi Oyamada. Read more, think more: Revisiting observation reduction for web agents.arXiv preprint arXiv:2604.01535, 2026

  8. [8]

    Navigating the digital world as humans do: Universal visual grounding for gui agents

    Boyu Gou, Ruohan Wang, Boyuan Zheng, Yanan Xie, Cheng Chang, Yiheng Shu, Huan Sun, and Yu Su. Navigating the digital world as humans do: Universal visual grounding for gui agents. InInternational Conference on Learning Representations (ICLR), 2025

  9. [9]

    Dynamic speculative agent planning.arXiv preprint arXiv:2509.01920, 2025

    Yilin Guan, Qingfeng Lan, Sun Fei, Dujian Ding, Devang Acharya, Chi Wang, William Yang Wang, and Wenyue Hua. Dynamic speculative agent planning.arXiv preprint arXiv:2509.01920, 2025

  10. [10]

    MolmoWeb: Open Visual Web Agent and Open Data for the Open Web

    Tanmay Gupta, Piper Wolters, Zixian Ma, Peter Sushko, Rock Yuren Pang, Diego Llanes, Yue Yang, Taira Anderson, Boyuan Zheng, Zhongzheng Ren, Harsh Trivedi, Taylor Blanton, Caleb Ouellette, Winson Han, Ali Farhadi, and Ranjay Krishna. Molmoweb: Open visual web agent and open data for the open web.arXiv preprint arXiv:2604.08516, 2026

  11. [11]

    A real-world webagent with planning, long context understanding, and program synthesis

    Izzeddin Gur, Hiroki Furuta, Austin Huang, Mustafa Safdari, Yutaka Matsuo, Douglas Eck, and Aleksandra Faust. A real-world webagent with planning, long context understanding, and program synthesis. In International Conference on Learning Representations (ICLR), 2024

  12. [12]

    Understanding html with large language models

    Izzeddin Gur, Ofir Nachum, Yingjie Miao, Mustafa Safdari, Austin Huang, Aakanksha Chowdhery, Sharan Narang, Noah Fiedel, and Aleksandra Faust. Understanding html with large language models. arXiv preprint arXiv:2210.03945, 2022

  13. [13]

    WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models

    Hongliang He, Wenlin Yao, Kaixin Ma, Wenhao Yu, Yong Dai, Hong- ming Zhang, Zhenzhong Lan, and Dong Yu. Webvoyager: Building an end-to-end web agent with large multimodal models. InarXiv preprint arXiv:2401.13919, 2024

  14. [14]

    Odysseys: Benchmarking Web Agents on Realistic Long Horizon Tasks

    Lawrence Keunho Jang, Jing Yu Koh, Daniel Fried, and Ruslan Salakhut- dinov. Odysseys: Benchmarking web agents on realistic long horizon tasks.arXiv preprint arXiv:2604.24964, 2026

  15. [15]

    Dom-q-net: Grounded rl on structured language

    Sheng Jia, Jamie Kiros, and Jimmy Ba. Dom-q-net: Grounded rl on structured language. InInternational Conference on Learning Represen- tations (ICLR), 2019

  16. [16]

    Wrapper induction: Efficiency and expressive- ness.Artificial Intelligence, 118(1-2):15–68, 2000

    Nicholas Kushmerick. Wrapper induction: Efficiency and expressive- ness.Artificial Intelligence, 118(1-2):15–68, 2000

  17. [17]

    Region4Web: Rethinking Observation Space Granularity for Web Agents

    Donguk Kwon and Dongha Lee. Region4web: Rethinking observation space granularity for web agents.arXiv preprint arXiv:2605.07134, 2026

  18. [18]

    Efficient memory management for large language model serving with pagedattention

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InProceedings of the 29th Symposium on Operating Systems Principles, SOSP ’23, page 611–626, New York, NY, USA, 2023. Association for Computing Machinery

  19. [19]

    J. Liang. Caesar: Deep agentic web exploration for creative answer synthesis.arXiv, 2026

  20. [20]

    Showui: One vision-language-action model for gui visual agent.arXiv preprint arXiv:2411.17465, 2024

    Kevin Qinghong Lin, Linjie Li, Difei Gao, Zhengyuan Yang, Shiwei Wu, Zechen Bai, Weixian Lei, Lijuan Wang, and Mike Zheng Shou. Showui: One vision-language-action model for gui visual agent.arXiv preprint arXiv:2411.17465, 2024

  21. [21]

    Weblinx: Real-world website navigation with multi-turn dialogue,

    Xing Han Lù, Zdeněk Kasner, and Siva Reddy. Weblinx: Real- world website navigation with multi-turn dialogue.arXiv preprint arXiv:2402.05930, 2024

  22. [22]

    ContractSkill: Repairable Contract-Based Skills for Multimodal Web Agents

    Zijian Lu, Yiping Zuo, Yupeng Nie, Xin He, Weibei Fan, Lianyong Qi, and Shi Jin. Contractskill: Repairable contract-based skills for multimodal web agents.arXiv preprint arXiv:2603.20340, 2026

  23. [23]

    GAIA: a benchmark for General AI Assistants

    Grégoire Mialon, Clémentine Fourrier, Craig Swift, Thomas Wolf, Yann LeCun, and Thomas Scialom. Gaia: a benchmark for general ai assis- tants.arXiv preprint arXiv:2311.12983, 2023

  24. [24]

    Swt-bench: Testing and validating real-world bug-fixes with code agents.NeurIPS, 2024

    Niels Mündler, Mark Niklas Müller, Jingxuan He, and Martin Vechev. Swt-bench: Testing and validating real-world bug-fixes with code agents.NeurIPS, 2024

  25. [25]

    WebGPT: Browser-assisted question-answering with human feedback

    Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christopher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, Xu Jiang, Karl Cobbe, Tyna Eloundou, Gretchen Krueger, Kevin Button, Matthew Knight, Benjamin Chess, and John Schulman. Webgpt: Browser-assisted question-answering with human feedback. InarXiv preprint ar...

  26. [26]

    From pixels to ui actions: Learning to follow instructions via graphical user interfaces.arXiv preprint arXiv:2306.00245, 2023

    Peter Shaw, Mandar Joshi, James Cohan, Jonathan Berant, Panupong Pasupat, Hexiang Hu, Urvashi Khandelwal, Kenton Lee, and Kristina Toutanova. From pixels to ui actions: Learning to follow instructions via graphical user interfaces.arXiv preprint arXiv:2306.00245, 2023

  27. [27]

    WebXSkill: Skill Learning for Autonomous Web Agents

    Zhaoyang Wang, Qianhui Wu, Xuchao Zhang, Chaoyun Zhang, Wen- lin Yao, Fazle Elahi Faisal, Baolin Peng, Si Qin, Suman Nath, Qingwei Lin, Chetan Bansal, Dongmei Zhang, Saravan Rajmohan, Jianfeng Gao, and Huaxiu Yao. Webxskill: Skill learning for autonomous web agents. arXiv preprint arXiv:2604.13318, 2026

  28. [28]

    Contextbudget: Budget-aware context management for long-horizon search agents.arXiv preprint Conference’17, July 2017, Washington, DC, USA M

    Yong Wu, Yanzhao Zheng, Tianze Xu, ZhenTao Zhang, YuanQiang Yu, JiHuai Zhu, Chao Ma, BinBin Lin, Baohua Dong, Hangcheng Zhu, Ruohui Huang, and Gang Yu. Contextbudget: Budget-aware context management for long-horizon search agents.arXiv preprint Conference’17, July 2017, Washington, DC, USA M. Wong et al. arXiv:2604.01664, 2026

  29. [29]

    Agentoccam: A sim- ple yet strong baseline for llm-based web agents.arXiv preprint arXiv:2410.13825, 2024

    Ke Yang, Yao Liu, Sapana Chaudhary, Rasool Fakoor, Pratik Chaud- hari, George Karypis, and Huzefa Rangwala. Agentoccam: A sim- ple yet strong baseline for llm-based web agents.arXiv preprint arXiv:2410.13825, 2024

  30. [30]

    Web- shop: Towards scalable real-world web interaction with grounded language agents

    Shunyu Yao, Howard Chen, John Yang, and Karthik Narasimhan. Web- shop: Towards scalable real-world web interaction with grounded language agents. InAdvances in Neural Information Processing Systems (NeurIPS), 2022

  31. [31]

    React: Synergizing reasoning and acting in language models

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InInternational Conference on Learning Represen- tations (ICLR), 2023

  32. [32]

    Speculative Actions: A Lossless Framework for Faster Agentic Systems

    Naimeng Ye, Arnav Ahuja, Georgios Liargkovas, Yunan Lu, Kostis Kaffes, and Tianyi Peng. Speculative actions: A lossless framework for faster agentic systems.arXiv preprint arXiv:2510.04371, 2025

  33. [33]

    Prune4web: Dom tree pruning programming for web agent

    Jiayuan Zhang, Kaiquan Chen, Zhihao Lu, Enshen Zhou, Qian Yu, and Jing Zhang. Prune4web: Dom tree pruning programming for web agent. InProceedings of the AAAI Conference on Artificial Intelligence, 2026

  34. [34]

    GPT-4V(ision) is a Generalist Web Agent, if Grounded

    Boyuan Zheng, Boyu Gou, Jihyung Kil, Huan Sun, and Yu Su. Gpt- 4v(ision) is a generalist web agent, if grounded.arXiv preprint arXiv:2401.01614, 2024

  35. [35]

    WebArena: A Realistic Web Environment for Building Autonomous Agents

    Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig. Webarena: A realistic web environment for building autonomous agents. InarXiv preprint arXiv:2307.13854, 2023. Received 20 February 2007; revised 12 March 2009; accepted 5 June 2009