pith. sign in

arxiv: 2605.22219 · v1 · pith:PWIZRP5Enew · submitted 2026-05-21 · 💻 cs.AI

SGR-Bench: Benchmarking Search Agents on State-Gated Retrieval

Pith reviewed 2026-05-22 05:38 UTC · model grok-4.3

classification 💻 cs.AI
keywords state-gated retrievalweb agentsLLM agentsinformation retrievalagent benchmarksspecialized websitesstructured data retrieval
0
0 comments X

The pith

Search agents reach relevant sites but set the wrong filters and scopes on specialized data websites

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper defines state-gated retrieval as the need to configure site-specific states like filters, views or hierarchies before answer evidence becomes accessible on specialized websites. It presents SGR-Bench, a set of 100 expert-curated tasks across six source families, and tests eight CLI-based LLM agents plus three commercial products. The strongest system reaches only 66.18 percent item-level F1, with row-level F1 markedly lower. Audit of failed trajectories shows agents typically locate a relevant source yet establish an incorrect retrieval state, with scope drift and criterion mismatch as the leading error types. The benchmark supplies both constraint-guided and goal-oriented versions of each task to isolate the effect of guidance on state configuration.

Core claim

SGR-Bench shows that current agentic LLM systems reach only 66.18 percent item-level F1 on tasks that require discovering a website and then correctly configuring its site-specific retrieval state; manual analysis of 156 failed trajectories finds that retrieval-scope drift accounts for 37.2 percent and criterion mismatch for 27.6 percent of errors, while final answer composition accounts for just 10.3 percent.

What carries the argument

State-gated retrieval (SGR): the requirement to establish site-specific retrieval states through filters, views, hierarchies or scopes before structured evidence can be retrieved from specialized data websites.

If this is right

  • Explicit constraint guidance in task prompts can be directly compared against implicit goal-oriented prompts for its effect on successful state establishment.
  • Row-level F1 scores being substantially lower than item-level scores indicate that even partial state success often fails to produce complete structured outputs.
  • The dominant error categories of scope drift and criterion mismatch point to the need for agents that can monitor and correct their current retrieval state during navigation.
  • Performance on SGR-Bench exposes limitations in current tool-using LLMs that standard open-web search benchmarks do not capture.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Future agents could incorporate an explicit state-tracking module that logs active filters and scopes separately from general planning steps.
  • The benchmark tasks could serve as training data for fine-tuning models to recognize and correct retrieval-state mismatches on the fly.
  • Similar state-establishment problems are likely to appear in non-web settings such as database query interfaces or multi-parameter API calls that require sequential configuration.

Load-bearing premise

The 100 expert-curated tasks and the item-level plus row-level F1 metrics are representative of the core difficulties of state-gated retrieval on real specialized websites.

What would settle it

A new agent system that achieves above 85 percent row-level F1 across the full SGR-Bench suite or on a fresh set of equivalent tasks drawn from the same website families would indicate that state configuration is not the primary remaining bottleneck.

Figures

Figures reproduced from arXiv: 2605.22219 by Haiyang Shen, Mugeng Liu, Ningyuan Li, Sixiong Xie, Yudong Han, Yun Ma, Zhuofan Shi.

Figure 1
Figure 1. Figure 1: Overview of the SGR-BENCH four-stage data curation pipeline. Candidate websites are drawn from Wikipedia external links, prioritized with an LLM, and retained after dual review. Task candidates are drafted from site structure and retrieval controls under a six-requirement design protocol, then filtered through preliminary screening and three rounds of expert validation for answer identifiability, state-gat… view at source ↗
Figure 2
Figure 2. Figure 2: (a) Item-F1 vs. Row-F1 for all systems. All points sit above the diagonal: agents recover [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Item-F1 (%) by model and source family on the 100-task benchmark. Finding 3: Hard sites require keeping several web controls aligned [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Share of each error type within each model’s failures. Models are sorted by retrieval-scope [PITH_FULL_IMAGE:figures/full_fig_p021_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Error type distribution per source family. Scholarly and environmental tasks are drift [PITH_FULL_IMAGE:figures/full_fig_p022_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Item-F1 and Row-F1 distributions grouped by error type. Criterion mismatch shows the [PITH_FULL_IMAGE:figures/full_fig_p022_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Mean Item-F1 and Row-F1 by output cardinality bin across all models. Tasks requiring [PITH_FULL_IMAGE:figures/full_fig_p023_7.png] view at source ↗
read the original abstract

Recent advances in large language models and tool-using agents have expanded the range of benchmarked web tasks. Yet an important class of specialized retrieval tasks remains undercharacterized. On many specialized data-retrieval websites, answer-bearing evidence becomes accessible only after establishing the correct site-specific retrieval state through filters, views, hierarchies, or scopes. We term this capability state-gated retrieval (SGR). We introduce SGR-Bench, a benchmark for this setting containing 100 expert-curated tasks spanning six source families and 12 public data ecosystems. Each task requires discovering the appropriate website and configuring its site-specific retrieval state to produce a structured answer. SGR-Bench pairs constraint-guided and goal-oriented formulations of the same underlying problems, enabling controlled comparisons between explicit and implicit guidance for state-gated retrieval. We evaluate eight CLI-based agentic LLM systems and three commercial search-agent products. On SGR-Bench, the strongest system reaches only 66.18% item-level F1, while row-level F1 remains much lower. A manual audit of 156 analyzable failed CLI trajectories shows why: agents often reach a relevant web source, but establish the wrong site-specific retrieval state. Retrieval-scope drift (37.2%) and criterion mismatch (27.6%) dominate, whereas final answer composition accounts for only 10.3%. The dataset and single-case evaluation instructions are available at https://huggingface.co/datasets/PKUAIWeb/SGR-BENCH.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces SGR-Bench, a benchmark for state-gated retrieval (SGR) on specialized websites where evidence is accessible only after configuring site-specific states via filters, views, hierarchies or scopes. It contains 100 expert-curated tasks across six source families and 12 ecosystems, with paired constraint-guided and goal-oriented formulations. Eight CLI-based LLM agents and three commercial products are evaluated; the strongest reaches 66.18% item-level F1 while row-level F1 is substantially lower. A manual audit of 156 failed trajectories attributes most errors to retrieval-scope drift (37.2%) and criterion mismatch (27.6%), with answer composition at only 10.3%. The dataset is released publicly.

Significance. If the tasks and metrics validly isolate state configuration, the work identifies a concrete, previously under-characterized limitation in web agents and supplies reproducible evidence plus failure-mode statistics that can guide targeted improvements. Public dataset release is a clear strength for the field.

major comments (2)
  1. [Task Curation and Evaluation Setup] The central empirical claim—that current agents struggle with state-gated retrieval rather than benchmark-specific artifacts—depends on the 100 tasks adequately sampling state mechanisms across the six families and 12 ecosystems and on the F1 metrics isolating state-setting success. The manuscript provides no coverage statistics, selection criteria, or validation that answers are inaccessible under incorrect states (see abstract and §3).
  2. [Results] Table or figure reporting per-ecosystem or per-state-type performance is absent; without it, the aggregate 66.18% item-level F1 cannot be assessed for whether failures concentrate on particular state types or are uniformly distributed.
minor comments (2)
  1. [Evaluation Metrics] Clarify the exact definition of 'item-level F1' versus 'row-level F1' with a short example in the metrics subsection; the distinction is central to interpreting the gap between the two scores.
  2. [Failure Analysis] The audit sample of 156 trajectories should state the total number of failures and the selection procedure to allow readers to judge representativeness.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which help clarify how to strengthen the presentation of SGR-Bench. We respond to each major comment below and commit to the indicated revisions.

read point-by-point responses
  1. Referee: [Task Curation and Evaluation Setup] The central empirical claim—that current agents struggle with state-gated retrieval rather than benchmark-specific artifacts—depends on the 100 tasks adequately sampling state mechanisms across the six families and 12 ecosystems and on the F1 metrics isolating state-setting success. The manuscript provides no coverage statistics, selection criteria, or validation that answers are inaccessible under incorrect states (see abstract and §3).

    Authors: We agree that explicit documentation of curation details would improve transparency and help readers assess sampling adequacy. In the revised manuscript we will expand §3 with (i) a table or paragraph reporting the distribution of the 100 tasks across the six source families and 12 ecosystems and (ii) a concise description of the expert curation protocol and selection criteria. We will also add a short validation statement confirming that, for each task, the expert curators verified that the target evidence is inaccessible under incorrect state configurations; this directly supports the claim that the reported F1 scores isolate state-setting success rather than benchmark artifacts. revision: yes

  2. Referee: [Results] Table or figure reporting per-ecosystem or per-state-type performance is absent; without it, the aggregate 66.18% item-level F1 cannot be assessed for whether failures concentrate on particular state types or are uniformly distributed.

    Authors: We acknowledge that aggregate metrics alone limit interpretability. We will add a new table (or supplementary figure) in the results section that reports item-level F1 broken down by ecosystem and by state mechanism type (filters, views, hierarchies, scopes). This disaggregation will allow readers to determine whether the observed performance is uniform or driven by particular state types. revision: yes

Circularity Check

0 steps flagged

Empirical benchmark paper with no derivations or fitted predictions

full rationale

This is a benchmark introduction and evaluation paper. It defines state-gated retrieval, curates 100 tasks across source families, evaluates agent systems on item/row-level F1, and audits failure modes. No equations, no parameter fitting, no predictions derived from inputs, and no self-citation chains that support the central claims. Performance numbers are measured outcomes on held-out tasks rather than being defined in terms of the evaluation itself, satisfying the self-contained criterion.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

No mathematical derivations or fitted constants appear. The central contribution rests on the assumption that the curated tasks capture a meaningful and previously unbenchmarked capability.

axioms (1)
  • domain assumption Expert-curated tasks across six source families adequately represent state-gated retrieval in public data ecosystems.
    Stated in the abstract description of benchmark construction.
invented entities (1)
  • state-gated retrieval (SGR) no independent evidence
    purpose: To name and isolate the capability of configuring site-specific retrieval states before evidence becomes accessible.
    Introduced as a new term for an undercharacterized class of tasks.

pith-pipeline@v0.9.0 · 5811 in / 1295 out tokens · 32941 ms · 2026-05-22T05:38:29.351080+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

45 extracted references · 45 canonical work pages · 1 internal anchor

  1. [1]

    Claude Code Overview

    Anthropic. Claude Code Overview. https://code.claude.com/docs/en/overview, 2026. Accessed: 2026-05-06

  2. [2]

    System Card: Claude Opus 4.7

    Anthropic. System Card: Claude Opus 4.7. https://www.anthropic.com/ claude-opus-4-7-system-card, apr 2026. Accessed: 2026-05-06

  3. [3]

    Workarena++: Towards compositional planning and reasoning-based common knowledge work tasks.Advances in Neural Information Processing Systems, 37:5996–6051, 2024

    Léo Boisvert, Megh Thakkar, Maxime Gasse, Massimo Caccia, Thibault Le Sellier de Chezelles, Quentin Cappart, Nicolas Chapados, Alexandre Lacoste, and Alexandre Drouin. Workarena++: Towards compositional planning and reasoning-based common knowledge work tasks.Advances in Neural Information Processing Systems, 37:5996–6051, 2024

  4. [4]

    Seed2.0 Model Card: Towards Intelligence Frontier for Real- World Complexity

    ByteDance Seed. Seed2.0 Model Card: Towards Intelligence Frontier for Real- World Complexity. https://lf3-static.bytednsdoc.com/obj/eden-cn/lapzild-tss/ ljhwZthlaukjlkulzlp/seed2/0214/Seed2.0%20Model%20Card.pdf, feb 2026. Accessed: 2026-05-06

  5. [5]

    Xu, Siva Reddy, Quentin Cappart, Graham Neubig, Ruslan Salakhutdinov, Nicolas Chapados, and Alexandre Lacoste

    Thibault Le Sellier de Chezelles, Maxime Gasse, Alexandre Drouin, Massimo Caccia, Léo Boisvert, Megh Thakkar, Tom Marty, Rim Assouel, Sahar Omidi Shayegan, Lawrence Keunho Jang, Xing Han Lù, Ori Yoran, Dehan Kong, Frank F. Xu, Siva Reddy, Quentin Cappart, Graham Neubig, Ruslan Salakhutdinov, Nicolas Chapados, and Alexandre Lacoste. The browsergym ecosyste...

  6. [6]

    DeepSeek V4 Technical Documentation

    DeepSeek-AI. DeepSeek V4 Technical Documentation. https://fe-static.deepseek.com/ chat/transparency/deepseek-V4-model-card-EN.pdf, apr 2026. Accessed: 2026-05-06

  7. [7]

    Mind2web: Towards a generalist agent for the web.Advances in Neural Information Processing Systems, 36:28091–28114, 2023

    Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Sam Stevens, Boshi Wang, Huan Sun, and Yu Su. Mind2web: Towards a generalist agent for the web.Advances in Neural Information Processing Systems, 36:28091–28114, 2023

  8. [8]

    Laradji, Manuel Del Verme, Tom Marty, Léo Boisvert, Megh Thakkar, Quentin Cappart, David Vazquez, Nicolas Chapados, and Alexandre Lacoste

    Alexandre Drouin, Maxime Gasse, Massimo Caccia, Issam H. Laradji, Manuel Del Verme, Tom Marty, Léo Boisvert, Megh Thakkar, Quentin Cappart, David Vazquez, Nicolas Chapados, and Alexandre Lacoste. Workarena: How capable are web agents at solving common knowledge work tasks? 2024

  9. [9]

    Deepresearch bench: A comprehensive benchmark for deep research agents, 2025

    Mingxuan Du, Benfeng Xu, Chiwei Zhu, Xiaorui Wang, and Zhendong Mao. Deepresearch bench: A comprehensive benchmark for deep research agents, 2025

  10. [10]

    REAL: Benchmarking autonomous agents on deterministic simulations of real websites

    Divyansh Garg, Shaun VanWeelden, Diego Caples, Andis Draguns, Nikil Ravi, Pranav Putta, Naman Garg, Tomas Abraham, Michael Lara, Federico Lopez, James Liu, Atharva Gundawar, Prannay Hebbar, Youngchul Joo, Jindong Gu, Charles London, Christian Schroeder de Witt, and Sumeet Motwani. REAL: Benchmarking autonomous agents on deterministic simulations of real w...

  11. [11]

    Glm-5: from vibe coding to agentic engineering, 2026

    GLM-5-Team, Aohan Zeng, Xin Lv, Zhenyu Hou, Zhengxiao Du, Qinkai Zheng, Bin Chen, Da Yin, Chendi Ge, Chenghua Huang, Chengxing Xie, Chenzheng Zhu, Congfeng Yin, Cunx- iang Wang, Gengzheng Pan, Hao Zeng, Haoke Zhang, Haoran Wang, Huilong Chen, Jiajie Zhang, Jian Jiao, Jiaqi Guo, Jingsen Wang, Jingzhao Du, Jinzhu Wu, Kedong Wang, Lei Li, Lin Fan, Lucen Zhon...

  12. [12]

    Expanding AI Overviews and Introducing AI Mode

    Google. Expanding AI Overviews and Introducing AI Mode. https://blog.google/ products-and-platforms/products/search/ai-mode-search/ , mar 2025. Accessed: 2026-05-06

  13. [13]

    Gemini Deep Research

    Google. Gemini Deep Research. https://gemini.google/overview/deep-research/?hl= en-US, 2026. Accessed: 2026-05-06

  14. [14]

    Model Evaluation – Approach, Methodology & Results: Gemini 3.1 Pro

    Google DeepMind. Model Evaluation – Approach, Methodology & Results: Gemini 3.1 Pro. https://storage.googleapis.com/deepmind-media/gemini/gemini_3-1_pro_ model_evaluation.pdf, feb 2026. Accessed: 2026-05-06

  15. [15]

    DeepSearchQA: Bridging the comprehensiveness gap for deep research agents.arXiv preprint arXiv:2601.20975, 2026

    Nikita Gupta, Riju Chatterjee, Lukas Haas, Connie Tao, Andrew Wang, Chang Liu, Hidekazu Oiwa, E. Gribovskaya, Jan Ackermann, John Blitzer, S. Goldshtein, and Dipanjan Das. Deepsearchqa: Bridging the comprehensiveness gap for deep research agents.arXiv preprint arXiv:2601.20975, 2026

  16. [16]

    Webvoyager: Building an end-to-end web agent with large multimodal models

    Hongliang He, Wenlin Yao, Kaixin Ma, Wenhao Yu, Yong Dai, Hongming Zhang, Zhenzhong Lan, and Dong Yu. Webvoyager: Building an end-to-end web agent with large multimodal models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6864–6890, Bangkok, Thailand, 2024. Association for Computa...

  17. [17]

    Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension

    Mandar Joshi, Eunsol Choi, Daniel S Weld, and Luke Zettlemoyer. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. InProceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1601–1611, 2017

  18. [18]

    A new measure of rank correlation.Biometrika, 30(1-2):81–93, 1938

    Maurice G Kendall. A new measure of rank correlation.Biometrika, 30(1-2):81–93, 1938

  19. [19]

    Visualwebarena: Evaluating multimodal agents on realistic visual web tasks

    Jing Yu Koh, Robert Lo, Lawrence Jang, Vikram Duvvur, Ming Lim, Po-Yu Huang, Graham Neubig, Shuyan Zhou, Russ Salakhutdinov, and Daniel Fried. Visualwebarena: Evaluating multimodal agents on realistic visual web tasks. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 881–905, 2024

  20. [20]

    Kumar, E

    P. Kumar, E. Lau, S. Vijayakumar, T. Trinh, E. T. Chang, V . Robinson, S. Zhou, and M. Fredrik- son. Aligned LLMs are not aligned browser agents. InThe Thirteenth International Conference on Learning Representations, 2025

  21. [21]

    Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov

    Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, Kristina Toutanova, Llion Jones, Matthew Kelcey, Ming-Wei Chang, Andrew M. Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. Natural questions: A benchmark for question answering research.Transact...

  22. [22]

    Deepwidesearch: Benchmarking depth and width in agentic information seeking.arXiv preprint arXiv:2510.20168, 2025

    Tian Lan, Bin Zhu, Qianghuai Jia, Junyang Ren, Haijun Li, Longyue Wang, Zhao Xu, Weihua Luo, and Kaifu Zhang. Deepwidesearch: Benchmarking depth and width in agentic information seeking.arXiv preprint arXiv:2510.20168, 2025

  23. [23]

    Deepresearch bench ii: Diagnosing deep research agents via rubrics from expert report, 2026

    Ruizhe Li, Mingxuan Du, Benfeng Xu, Chiwei Zhu, Xiaorui Wang, and Zhendong Mao. Deepresearch bench ii: Diagnosing deep research agents via rubrics from expert report, 2026. 11

  24. [24]

    Gaia: a benchmark for general ai assistants

    Grégoire Mialon, Clémentine Fourrier, Craig Swift, Thomas Wolf, Yann LeCun, and Thomas Scialom. Gaia: a benchmark for general ai assistants. InThe Twelfth International Conference on Learning Representations, 2024

  25. [25]

    Webgpt: Browser-assisted question-answering with human feedback, 2022

    Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christo- pher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, Xu Jiang, Karl Cobbe, Tyna Eloundou, Gretchen Krueger, Kevin Button, Matthew Knight, Benjamin Chess, and John Schulman. Webgpt: Browser-assisted question-answering with human feedback, 2022

  26. [26]

    Deep Research System Card

    OpenAI. Deep Research System Card. https://cdn.openai.com/ deep-research-system-card.pdf, feb 2025. Accessed: 2026-05-06

  27. [27]

    CLI – Codex

    OpenAI. CLI – Codex. https://developers.openai.com/codex/cli, 2026. Accessed: 2026-05-06

  28. [28]

    OpenRouter Models

    OpenRouter. OpenRouter Models. https://openrouter.ai/docs/guides/overview/ models, 2026. Accessed: 2026-05-06

  29. [29]

    Kilt: a benchmark for knowledge intensive language tasks

    Fabio Petroni, Aleksandra Piktus, Angela Fan, Patrick Lewis, Majid Yazdani, Nicola De Cao, James Thorne, Yacine Jernite, Vassilis Plachouras, Tim Rocktäschel, and Sebastian Riedel. Kilt: a benchmark for knowledge intensive language tasks. InProceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: H...

  30. [30]

    Weblinx: Real-world website navigation with multi-turn dialogue

    Siva Reddy, Xing Lu, and Zden ˇek Kasner. Weblinx: Real-world website navigation with multi-turn dialogue. InInstitute of Formal and Applied Linguistics (ÚFAL), 2024

  31. [31]

    Toolformer: Language models can teach themselves to use tools, 2023

    Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Luke Zettle- moyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools, 2023

  32. [32]

    World of bits: An open-domain platform for web-based agents

    Tianlin Shi, Andrej Karpathy, Linxi Fan, Jonathan Hernandez, and Percy Liang. World of bits: An open-domain platform for web-based agents. InInternational Conference on Machine Learning, pages 3135–3144. PMLR, 2017

  33. [33]

    Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, Akshay Nathan, Alan Luo, Alec Helyar, Aleksander Madry, Aleksandr Efremov, Aleksandra Spyra, Alex Baker-Whitcomb, Alex Beutel, Alex Karpenko, Alex Makelov, Alex Neitz, Alex Wei, Alexandra Barr, Alexandre Kirchmeyer, Ale...

  34. [34]

    Kimi Team, Tongtong Bai, Yifan Bai, Yiping Bao, S. H. Cai, Yuan Cao, Y . Charles, H. S. Che, Cheng Chen, Guanduo Chen, Huarong Chen, Jia Chen, Jiahao Chen, Jianlong Chen, Jun Chen, Kefan Chen, Liang Chen, Ruijue Chen, Xinhao Chen, Yanru Chen, Yanxu Chen, Yicun Chen, Yimin Chen, Yingjiang Chen, Yuankun Chen, Yujie Chen, Yutian Chen, Zhirong Chen, Ziwei Che...

  35. [35]

    Safearena: Evaluating the safety of autonomous web agents

    Ada Defne Tur, Nicholas Meade, Xing Han Lù, Alejandra Zambrano, Arkil Patel, Esin DUR- MUS, Spandana Gella, Karolina Stanczak, and Siva Reddy. Safearena: Evaluating the safety of autonomous web agents. InForty-second International Conference on Machine Learning, 2025

  36. [36]

    Wei, Jason Wei, Chris Tar, Yun- Hsuan Sung, Denny Zhou, Quoc V

    Tu Vu, Mohit Iyyer, Xuezhi Wang, Noah Constant, Jerry W. Wei, Jason Wei, Chris Tar, Yun- Hsuan Sung, Denny Zhou, Quoc V . Le, and Thang Luong. Freshllms: Refreshing large language models with search engine augmentation. InFindings of the Association for Computational Linguistics: ACL 2024, pages 13697–13720, 2024

  37. [37]

    BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents

    Jason Wei, Zhiqing Sun, Spencer Papay, Scott McKinney, Jeffrey Han, Isa Fulford, Hyung Won Chung, Alex Tachard Passos, William Fedus, and Amelia Glaese. Browsecomp: A simple yet challenging benchmark for browsing agents.arXiv preprint arXiv:2504.12516, 2025

  38. [38]

    Widesearch: Benchmarking agentic broad info-seeking.arXiv preprint arXiv:2508.07999, 2025

    Ryan Wong, Jiawei Wang, Junjie Zhao, Li Chen, Yan Gao, Long Zhang, Xuan Zhou, Zuo Wang, Kai Xiang, Ge Zhang, Wenhao Huang, Yang Wang, and Ke Wang. Widesearch: Benchmarking agentic broad info-seeking.arXiv preprint arXiv:2508.07999, 2025

  39. [39]

    Webwalker: Benchmarking llms in web 14 traversal

    Jialong Wu, Wenbiao Yin, Yong Jiang, Zhenglin Wang, Zekun Xi, Runnan Fang, Linhai Zhang, Yulan He, Deyu Zhou, Pengjun Xie, and Fei Huang. Webwalker: Benchmarking llms in web 14 traversal. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 10290–10305, 2025

  40. [40]

    Qwen3 technical report, 2025

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

  41. [41]

    Webshop: Towards scalable real-world web interaction with grounded language agents.Advances in Neural Information Processing Systems, 35:20744–20757, 2022

    Shunyu Yao, Howard Chen, John Yang, and Karthik Narasimhan. Webshop: Towards scalable real-world web interaction with grounded language agents.Advances in Neural Information Processing Systems, 35:20744–20757, 2022

  42. [42]

    React: Synergizing reasoning and acting in language models, 2023

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models, 2023

  43. [43]

    Ori Yoran, Samuel Joseph Amouyal, Chaitanya Malaviya, Ben Bogin, Ofir Press, and Jonathan Berant. Assistantbench: Can web agents solve realistic and time-consuming tasks? In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 8938–8968, 2024

  44. [44]

    Draco: a cross-domain benchmark for deep research accuracy, completeness, and objectivity, 2026

    Joey Zhong, Hao Zhang, Clare Southern, Jeremy Yang, Thomas Wang, Kate Jung, Shu Zhang, Denis Yarats, Johnny Ho, and Jerry Ma. Draco: a cross-domain benchmark for deep research accuracy, completeness, and objectivity, 2026

  45. [45]

    Boulenger 1890 India snakes

    Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig. Webarena: A realistic web environment for building autonomous agents. InThe Twelfth International Conference on Learning Representations, 2023. A Metric Definitions All metrics are computed after rev...