SGR-Bench: Benchmarking Search Agents on State-Gated Retrieval

Haiyang Shen; Mugeng Liu; Ningyuan Li; Sixiong Xie; Yudong Han; Yun Ma; Zhuofan Shi

arxiv: 2605.22219 · v1 · pith:PWIZRP5Enew · submitted 2026-05-21 · 💻 cs.AI

SGR-Bench: Benchmarking Search Agents on State-Gated Retrieval

Ningyuan Li , Haiyang Shen , Mugeng Liu , Yudong Han , Zhuofan Shi , Sixiong Xie , Yun Ma This is my paper

Pith reviewed 2026-05-22 05:38 UTC · model grok-4.3

classification 💻 cs.AI

keywords state-gated retrievalweb agentsLLM agentsinformation retrievalagent benchmarksspecialized websitesstructured data retrieval

0 comments

The pith

Search agents reach relevant sites but set the wrong filters and scopes on specialized data websites

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper defines state-gated retrieval as the need to configure site-specific states like filters, views or hierarchies before answer evidence becomes accessible on specialized websites. It presents SGR-Bench, a set of 100 expert-curated tasks across six source families, and tests eight CLI-based LLM agents plus three commercial products. The strongest system reaches only 66.18 percent item-level F1, with row-level F1 markedly lower. Audit of failed trajectories shows agents typically locate a relevant source yet establish an incorrect retrieval state, with scope drift and criterion mismatch as the leading error types. The benchmark supplies both constraint-guided and goal-oriented versions of each task to isolate the effect of guidance on state configuration.

Core claim

SGR-Bench shows that current agentic LLM systems reach only 66.18 percent item-level F1 on tasks that require discovering a website and then correctly configuring its site-specific retrieval state; manual analysis of 156 failed trajectories finds that retrieval-scope drift accounts for 37.2 percent and criterion mismatch for 27.6 percent of errors, while final answer composition accounts for just 10.3 percent.

What carries the argument

State-gated retrieval (SGR): the requirement to establish site-specific retrieval states through filters, views, hierarchies or scopes before structured evidence can be retrieved from specialized data websites.

If this is right

Explicit constraint guidance in task prompts can be directly compared against implicit goal-oriented prompts for its effect on successful state establishment.
Row-level F1 scores being substantially lower than item-level scores indicate that even partial state success often fails to produce complete structured outputs.
The dominant error categories of scope drift and criterion mismatch point to the need for agents that can monitor and correct their current retrieval state during navigation.
Performance on SGR-Bench exposes limitations in current tool-using LLMs that standard open-web search benchmarks do not capture.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Future agents could incorporate an explicit state-tracking module that logs active filters and scopes separately from general planning steps.
The benchmark tasks could serve as training data for fine-tuning models to recognize and correct retrieval-state mismatches on the fly.
Similar state-establishment problems are likely to appear in non-web settings such as database query interfaces or multi-parameter API calls that require sequential configuration.

Load-bearing premise

The 100 expert-curated tasks and the item-level plus row-level F1 metrics are representative of the core difficulties of state-gated retrieval on real specialized websites.

What would settle it

A new agent system that achieves above 85 percent row-level F1 across the full SGR-Bench suite or on a fresh set of equivalent tasks drawn from the same website families would indicate that state configuration is not the primary remaining bottleneck.

Figures

Figures reproduced from arXiv: 2605.22219 by Haiyang Shen, Mugeng Liu, Ningyuan Li, Sixiong Xie, Yudong Han, Yun Ma, Zhuofan Shi.

**Figure 1.** Figure 1: Overview of the SGR-BENCH four-stage data curation pipeline. Candidate websites are drawn from Wikipedia external links, prioritized with an LLM, and retained after dual review. Task candidates are drafted from site structure and retrieval controls under a six-requirement design protocol, then filtered through preliminary screening and three rounds of expert validation for answer identifiability, state-gat… view at source ↗

**Figure 2.** Figure 2: (a) Item-F1 vs. Row-F1 for all systems. All points sit above the diagonal: agents recover [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: Item-F1 (%) by model and source family on the 100-task benchmark. Finding 3: Hard sites require keeping several web controls aligned [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: Share of each error type within each model’s failures. Models are sorted by retrieval-scope [PITH_FULL_IMAGE:figures/full_fig_p021_4.png] view at source ↗

**Figure 5.** Figure 5: Error type distribution per source family. Scholarly and environmental tasks are drift [PITH_FULL_IMAGE:figures/full_fig_p022_5.png] view at source ↗

**Figure 6.** Figure 6: Item-F1 and Row-F1 distributions grouped by error type. Criterion mismatch shows the [PITH_FULL_IMAGE:figures/full_fig_p022_6.png] view at source ↗

**Figure 7.** Figure 7: Mean Item-F1 and Row-F1 by output cardinality bin across all models. Tasks requiring [PITH_FULL_IMAGE:figures/full_fig_p023_7.png] view at source ↗

read the original abstract

Recent advances in large language models and tool-using agents have expanded the range of benchmarked web tasks. Yet an important class of specialized retrieval tasks remains undercharacterized. On many specialized data-retrieval websites, answer-bearing evidence becomes accessible only after establishing the correct site-specific retrieval state through filters, views, hierarchies, or scopes. We term this capability state-gated retrieval (SGR). We introduce SGR-Bench, a benchmark for this setting containing 100 expert-curated tasks spanning six source families and 12 public data ecosystems. Each task requires discovering the appropriate website and configuring its site-specific retrieval state to produce a structured answer. SGR-Bench pairs constraint-guided and goal-oriented formulations of the same underlying problems, enabling controlled comparisons between explicit and implicit guidance for state-gated retrieval. We evaluate eight CLI-based agentic LLM systems and three commercial search-agent products. On SGR-Bench, the strongest system reaches only 66.18% item-level F1, while row-level F1 remains much lower. A manual audit of 156 analyzable failed CLI trajectories shows why: agents often reach a relevant web source, but establish the wrong site-specific retrieval state. Retrieval-scope drift (37.2%) and criterion mismatch (27.6%) dominate, whereas final answer composition accounts for only 10.3%. The dataset and single-case evaluation instructions are available at https://huggingface.co/datasets/PKUAIWeb/SGR-BENCH.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces SGR-Bench, a benchmark for state-gated retrieval (SGR) on specialized websites where evidence is accessible only after configuring site-specific states via filters, views, hierarchies or scopes. It contains 100 expert-curated tasks across six source families and 12 ecosystems, with paired constraint-guided and goal-oriented formulations. Eight CLI-based LLM agents and three commercial products are evaluated; the strongest reaches 66.18% item-level F1 while row-level F1 is substantially lower. A manual audit of 156 failed trajectories attributes most errors to retrieval-scope drift (37.2%) and criterion mismatch (27.6%), with answer composition at only 10.3%. The dataset is released publicly.

Significance. If the tasks and metrics validly isolate state configuration, the work identifies a concrete, previously under-characterized limitation in web agents and supplies reproducible evidence plus failure-mode statistics that can guide targeted improvements. Public dataset release is a clear strength for the field.

major comments (2)

[Task Curation and Evaluation Setup] The central empirical claim—that current agents struggle with state-gated retrieval rather than benchmark-specific artifacts—depends on the 100 tasks adequately sampling state mechanisms across the six families and 12 ecosystems and on the F1 metrics isolating state-setting success. The manuscript provides no coverage statistics, selection criteria, or validation that answers are inaccessible under incorrect states (see abstract and §3).
[Results] Table or figure reporting per-ecosystem or per-state-type performance is absent; without it, the aggregate 66.18% item-level F1 cannot be assessed for whether failures concentrate on particular state types or are uniformly distributed.

minor comments (2)

[Evaluation Metrics] Clarify the exact definition of 'item-level F1' versus 'row-level F1' with a short example in the metrics subsection; the distinction is central to interpreting the gap between the two scores.
[Failure Analysis] The audit sample of 156 trajectories should state the total number of failures and the selection procedure to allow readers to judge representativeness.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which help clarify how to strengthen the presentation of SGR-Bench. We respond to each major comment below and commit to the indicated revisions.

read point-by-point responses

Referee: [Task Curation and Evaluation Setup] The central empirical claim—that current agents struggle with state-gated retrieval rather than benchmark-specific artifacts—depends on the 100 tasks adequately sampling state mechanisms across the six families and 12 ecosystems and on the F1 metrics isolating state-setting success. The manuscript provides no coverage statistics, selection criteria, or validation that answers are inaccessible under incorrect states (see abstract and §3).

Authors: We agree that explicit documentation of curation details would improve transparency and help readers assess sampling adequacy. In the revised manuscript we will expand §3 with (i) a table or paragraph reporting the distribution of the 100 tasks across the six source families and 12 ecosystems and (ii) a concise description of the expert curation protocol and selection criteria. We will also add a short validation statement confirming that, for each task, the expert curators verified that the target evidence is inaccessible under incorrect state configurations; this directly supports the claim that the reported F1 scores isolate state-setting success rather than benchmark artifacts. revision: yes
Referee: [Results] Table or figure reporting per-ecosystem or per-state-type performance is absent; without it, the aggregate 66.18% item-level F1 cannot be assessed for whether failures concentrate on particular state types or are uniformly distributed.

Authors: We acknowledge that aggregate metrics alone limit interpretability. We will add a new table (or supplementary figure) in the results section that reports item-level F1 broken down by ecosystem and by state mechanism type (filters, views, hierarchies, scopes). This disaggregation will allow readers to determine whether the observed performance is uniform or driven by particular state types. revision: yes

Circularity Check

0 steps flagged

Empirical benchmark paper with no derivations or fitted predictions

full rationale

This is a benchmark introduction and evaluation paper. It defines state-gated retrieval, curates 100 tasks across source families, evaluates agent systems on item/row-level F1, and audits failure modes. No equations, no parameter fitting, no predictions derived from inputs, and no self-citation chains that support the central claims. Performance numbers are measured outcomes on held-out tasks rather than being defined in terms of the evaluation itself, satisfying the self-contained criterion.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

No mathematical derivations or fitted constants appear. The central contribution rests on the assumption that the curated tasks capture a meaningful and previously unbenchmarked capability.

axioms (1)

domain assumption Expert-curated tasks across six source families adequately represent state-gated retrieval in public data ecosystems.
Stated in the abstract description of benchmark construction.

invented entities (1)

state-gated retrieval (SGR) no independent evidence
purpose: To name and isolate the capability of configuring site-specific retrieval states before evidence becomes accessible.
Introduced as a new term for an undercharacterized class of tasks.

pith-pipeline@v0.9.0 · 5811 in / 1295 out tokens · 32941 ms · 2026-05-22T05:38:29.351080+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We term this capability state-gated retrieval (SGR). We introduce SGR-Bench, a benchmark for this setting containing 100 expert-curated tasks...
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Retrieval-scope drift (37.2%) and criterion mismatch (27.6%) dominate

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

45 extracted references · 45 canonical work pages · 1 internal anchor

[1]

Claude Code Overview

Anthropic. Claude Code Overview. https://code.claude.com/docs/en/overview, 2026. Accessed: 2026-05-06

work page 2026
[2]

System Card: Claude Opus 4.7

Anthropic. System Card: Claude Opus 4.7. https://www.anthropic.com/ claude-opus-4-7-system-card, apr 2026. Accessed: 2026-05-06

work page 2026
[3]

Workarena++: Towards compositional planning and reasoning-based common knowledge work tasks.Advances in Neural Information Processing Systems, 37:5996–6051, 2024

Léo Boisvert, Megh Thakkar, Maxime Gasse, Massimo Caccia, Thibault Le Sellier de Chezelles, Quentin Cappart, Nicolas Chapados, Alexandre Lacoste, and Alexandre Drouin. Workarena++: Towards compositional planning and reasoning-based common knowledge work tasks.Advances in Neural Information Processing Systems, 37:5996–6051, 2024

work page 2024
[4]

Seed2.0 Model Card: Towards Intelligence Frontier for Real- World Complexity

ByteDance Seed. Seed2.0 Model Card: Towards Intelligence Frontier for Real- World Complexity. https://lf3-static.bytednsdoc.com/obj/eden-cn/lapzild-tss/ ljhwZthlaukjlkulzlp/seed2/0214/Seed2.0%20Model%20Card.pdf, feb 2026. Accessed: 2026-05-06

work page 2026
[5]

Xu, Siva Reddy, Quentin Cappart, Graham Neubig, Ruslan Salakhutdinov, Nicolas Chapados, and Alexandre Lacoste

Thibault Le Sellier de Chezelles, Maxime Gasse, Alexandre Drouin, Massimo Caccia, Léo Boisvert, Megh Thakkar, Tom Marty, Rim Assouel, Sahar Omidi Shayegan, Lawrence Keunho Jang, Xing Han Lù, Ori Yoran, Dehan Kong, Frank F. Xu, Siva Reddy, Quentin Cappart, Graham Neubig, Ruslan Salakhutdinov, Nicolas Chapados, and Alexandre Lacoste. The browsergym ecosyste...

work page 2024
[6]

DeepSeek V4 Technical Documentation

DeepSeek-AI. DeepSeek V4 Technical Documentation. https://fe-static.deepseek.com/ chat/transparency/deepseek-V4-model-card-EN.pdf, apr 2026. Accessed: 2026-05-06

work page 2026
[7]

Mind2web: Towards a generalist agent for the web.Advances in Neural Information Processing Systems, 36:28091–28114, 2023

Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Sam Stevens, Boshi Wang, Huan Sun, and Yu Su. Mind2web: Towards a generalist agent for the web.Advances in Neural Information Processing Systems, 36:28091–28114, 2023

work page 2023
[8]

Laradji, Manuel Del Verme, Tom Marty, Léo Boisvert, Megh Thakkar, Quentin Cappart, David Vazquez, Nicolas Chapados, and Alexandre Lacoste

Alexandre Drouin, Maxime Gasse, Massimo Caccia, Issam H. Laradji, Manuel Del Verme, Tom Marty, Léo Boisvert, Megh Thakkar, Quentin Cappart, David Vazquez, Nicolas Chapados, and Alexandre Lacoste. Workarena: How capable are web agents at solving common knowledge work tasks? 2024

work page 2024
[9]

Deepresearch bench: A comprehensive benchmark for deep research agents, 2025

Mingxuan Du, Benfeng Xu, Chiwei Zhu, Xiaorui Wang, and Zhendong Mao. Deepresearch bench: A comprehensive benchmark for deep research agents, 2025

work page 2025
[10]

REAL: Benchmarking autonomous agents on deterministic simulations of real websites

Divyansh Garg, Shaun VanWeelden, Diego Caples, Andis Draguns, Nikil Ravi, Pranav Putta, Naman Garg, Tomas Abraham, Michael Lara, Federico Lopez, James Liu, Atharva Gundawar, Prannay Hebbar, Youngchul Joo, Jindong Gu, Charles London, Christian Schroeder de Witt, and Sumeet Motwani. REAL: Benchmarking autonomous agents on deterministic simulations of real w...

work page 2025
[11]

Glm-5: from vibe coding to agentic engineering, 2026

GLM-5-Team, Aohan Zeng, Xin Lv, Zhenyu Hou, Zhengxiao Du, Qinkai Zheng, Bin Chen, Da Yin, Chendi Ge, Chenghua Huang, Chengxing Xie, Chenzheng Zhu, Congfeng Yin, Cunx- iang Wang, Gengzheng Pan, Hao Zeng, Haoke Zhang, Haoran Wang, Huilong Chen, Jiajie Zhang, Jian Jiao, Jiaqi Guo, Jingsen Wang, Jingzhao Du, Jinzhu Wu, Kedong Wang, Lei Li, Lin Fan, Lucen Zhon...

work page 2026
[12]

Expanding AI Overviews and Introducing AI Mode

Google. Expanding AI Overviews and Introducing AI Mode. https://blog.google/ products-and-platforms/products/search/ai-mode-search/ , mar 2025. Accessed: 2026-05-06

work page 2025
[13]

Gemini Deep Research

Google. Gemini Deep Research. https://gemini.google/overview/deep-research/?hl= en-US, 2026. Accessed: 2026-05-06

work page 2026
[14]

Model Evaluation – Approach, Methodology & Results: Gemini 3.1 Pro

Google DeepMind. Model Evaluation – Approach, Methodology & Results: Gemini 3.1 Pro. https://storage.googleapis.com/deepmind-media/gemini/gemini_3-1_pro_ model_evaluation.pdf, feb 2026. Accessed: 2026-05-06

work page 2026
[15]

DeepSearchQA: Bridging the comprehensiveness gap for deep research agents.arXiv preprint arXiv:2601.20975, 2026

Nikita Gupta, Riju Chatterjee, Lukas Haas, Connie Tao, Andrew Wang, Chang Liu, Hidekazu Oiwa, E. Gribovskaya, Jan Ackermann, John Blitzer, S. Goldshtein, and Dipanjan Das. Deepsearchqa: Bridging the comprehensiveness gap for deep research agents.arXiv preprint arXiv:2601.20975, 2026

work page arXiv 2026
[16]

Webvoyager: Building an end-to-end web agent with large multimodal models

Hongliang He, Wenlin Yao, Kaixin Ma, Wenhao Yu, Yong Dai, Hongming Zhang, Zhenzhong Lan, and Dong Yu. Webvoyager: Building an end-to-end web agent with large multimodal models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6864–6890, Bangkok, Thailand, 2024. Association for Computa...

work page 2024
[17]

Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension

Mandar Joshi, Eunsol Choi, Daniel S Weld, and Luke Zettlemoyer. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. InProceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1601–1611, 2017

work page 2017
[18]

A new measure of rank correlation.Biometrika, 30(1-2):81–93, 1938

Maurice G Kendall. A new measure of rank correlation.Biometrika, 30(1-2):81–93, 1938

work page 1938
[19]

Visualwebarena: Evaluating multimodal agents on realistic visual web tasks

Jing Yu Koh, Robert Lo, Lawrence Jang, Vikram Duvvur, Ming Lim, Po-Yu Huang, Graham Neubig, Shuyan Zhou, Russ Salakhutdinov, and Daniel Fried. Visualwebarena: Evaluating multimodal agents on realistic visual web tasks. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 881–905, 2024

work page 2024
[20]

Kumar, E

P. Kumar, E. Lau, S. Vijayakumar, T. Trinh, E. T. Chang, V . Robinson, S. Zhou, and M. Fredrik- son. Aligned LLMs are not aligned browser agents. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025
[21]

Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov

Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, Kristina Toutanova, Llion Jones, Matthew Kelcey, Ming-Wei Chang, Andrew M. Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. Natural questions: A benchmark for question answering research.Transact...

work page 2019
[22]

Deepwidesearch: Benchmarking depth and width in agentic information seeking.arXiv preprint arXiv:2510.20168, 2025

Tian Lan, Bin Zhu, Qianghuai Jia, Junyang Ren, Haijun Li, Longyue Wang, Zhao Xu, Weihua Luo, and Kaifu Zhang. Deepwidesearch: Benchmarking depth and width in agentic information seeking.arXiv preprint arXiv:2510.20168, 2025

work page arXiv 2025
[23]

Deepresearch bench ii: Diagnosing deep research agents via rubrics from expert report, 2026

Ruizhe Li, Mingxuan Du, Benfeng Xu, Chiwei Zhu, Xiaorui Wang, and Zhendong Mao. Deepresearch bench ii: Diagnosing deep research agents via rubrics from expert report, 2026. 11

work page 2026
[24]

Gaia: a benchmark for general ai assistants

Grégoire Mialon, Clémentine Fourrier, Craig Swift, Thomas Wolf, Yann LeCun, and Thomas Scialom. Gaia: a benchmark for general ai assistants. InThe Twelfth International Conference on Learning Representations, 2024

work page 2024
[25]

Webgpt: Browser-assisted question-answering with human feedback, 2022

Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christo- pher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, Xu Jiang, Karl Cobbe, Tyna Eloundou, Gretchen Krueger, Kevin Button, Matthew Knight, Benjamin Chess, and John Schulman. Webgpt: Browser-assisted question-answering with human feedback, 2022

work page 2022
[26]

Deep Research System Card

OpenAI. Deep Research System Card. https://cdn.openai.com/ deep-research-system-card.pdf, feb 2025. Accessed: 2026-05-06

work page 2025
[27]

CLI – Codex

OpenAI. CLI – Codex. https://developers.openai.com/codex/cli, 2026. Accessed: 2026-05-06

work page 2026
[28]

OpenRouter Models

OpenRouter. OpenRouter Models. https://openrouter.ai/docs/guides/overview/ models, 2026. Accessed: 2026-05-06

work page 2026
[29]

Kilt: a benchmark for knowledge intensive language tasks

Fabio Petroni, Aleksandra Piktus, Angela Fan, Patrick Lewis, Majid Yazdani, Nicola De Cao, James Thorne, Yacine Jernite, Vassilis Plachouras, Tim Rocktäschel, and Sebastian Riedel. Kilt: a benchmark for knowledge intensive language tasks. InProceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: H...

work page 2021
[30]

Weblinx: Real-world website navigation with multi-turn dialogue

Siva Reddy, Xing Lu, and Zden ˇek Kasner. Weblinx: Real-world website navigation with multi-turn dialogue. InInstitute of Formal and Applied Linguistics (ÚFAL), 2024

work page 2024
[31]

Toolformer: Language models can teach themselves to use tools, 2023

Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Luke Zettle- moyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools, 2023

work page 2023
[32]

World of bits: An open-domain platform for web-based agents

Tianlin Shi, Andrej Karpathy, Linxi Fan, Jonathan Hernandez, and Percy Liang. World of bits: An open-domain platform for web-based agents. InInternational Conference on Machine Learning, pages 3135–3144. PMLR, 2017

work page 2017
[33]

Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, Akshay Nathan, Alan Luo, Alec Helyar, Aleksander Madry, Aleksandr Efremov, Aleksandra Spyra, Alex Baker-Whitcomb, Alex Beutel, Alex Karpenko, Alex Makelov, Alex Neitz, Alex Wei, Alexandra Barr, Alexandre Kirchmeyer, Ale...

work page 2026
[34]

Kimi Team, Tongtong Bai, Yifan Bai, Yiping Bao, S. H. Cai, Yuan Cao, Y . Charles, H. S. Che, Cheng Chen, Guanduo Chen, Huarong Chen, Jia Chen, Jiahao Chen, Jianlong Chen, Jun Chen, Kefan Chen, Liang Chen, Ruijue Chen, Xinhao Chen, Yanru Chen, Yanxu Chen, Yicun Chen, Yimin Chen, Yingjiang Chen, Yuankun Chen, Yujie Chen, Yutian Chen, Zhirong Chen, Ziwei Che...

work page 2026
[35]

Safearena: Evaluating the safety of autonomous web agents

Ada Defne Tur, Nicholas Meade, Xing Han Lù, Alejandra Zambrano, Arkil Patel, Esin DUR- MUS, Spandana Gella, Karolina Stanczak, and Siva Reddy. Safearena: Evaluating the safety of autonomous web agents. InForty-second International Conference on Machine Learning, 2025

work page 2025
[36]

Wei, Jason Wei, Chris Tar, Yun- Hsuan Sung, Denny Zhou, Quoc V

Tu Vu, Mohit Iyyer, Xuezhi Wang, Noah Constant, Jerry W. Wei, Jason Wei, Chris Tar, Yun- Hsuan Sung, Denny Zhou, Quoc V . Le, and Thang Luong. Freshllms: Refreshing large language models with search engine augmentation. InFindings of the Association for Computational Linguistics: ACL 2024, pages 13697–13720, 2024

work page 2024
[37]

BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents

Jason Wei, Zhiqing Sun, Spencer Papay, Scott McKinney, Jeffrey Han, Isa Fulford, Hyung Won Chung, Alex Tachard Passos, William Fedus, and Amelia Glaese. Browsecomp: A simple yet challenging benchmark for browsing agents.arXiv preprint arXiv:2504.12516, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[38]

Widesearch: Benchmarking agentic broad info-seeking.arXiv preprint arXiv:2508.07999, 2025

Ryan Wong, Jiawei Wang, Junjie Zhao, Li Chen, Yan Gao, Long Zhang, Xuan Zhou, Zuo Wang, Kai Xiang, Ge Zhang, Wenhao Huang, Yang Wang, and Ke Wang. Widesearch: Benchmarking agentic broad info-seeking.arXiv preprint arXiv:2508.07999, 2025

work page arXiv 2025
[39]

Webwalker: Benchmarking llms in web 14 traversal

Jialong Wu, Wenbiao Yin, Yong Jiang, Zhenglin Wang, Zekun Xi, Runnan Fang, Linhai Zhang, Yulan He, Deyu Zhou, Pengjun Xie, and Fei Huang. Webwalker: Benchmarking llms in web 14 traversal. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 10290–10305, 2025

work page 2025
[40]

Qwen3 technical report, 2025

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

work page 2025
[41]

Webshop: Towards scalable real-world web interaction with grounded language agents.Advances in Neural Information Processing Systems, 35:20744–20757, 2022

Shunyu Yao, Howard Chen, John Yang, and Karthik Narasimhan. Webshop: Towards scalable real-world web interaction with grounded language agents.Advances in Neural Information Processing Systems, 35:20744–20757, 2022

work page 2022
[42]

React: Synergizing reasoning and acting in language models, 2023

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models, 2023

work page 2023
[43]

Ori Yoran, Samuel Joseph Amouyal, Chaitanya Malaviya, Ben Bogin, Ofir Press, and Jonathan Berant. Assistantbench: Can web agents solve realistic and time-consuming tasks? In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 8938–8968, 2024

work page 2024
[44]

Draco: a cross-domain benchmark for deep research accuracy, completeness, and objectivity, 2026

Joey Zhong, Hao Zhang, Clare Southern, Jeremy Yang, Thomas Wang, Kate Jung, Shu Zhang, Denis Yarats, Johnny Ho, and Jerry Ma. Draco: a cross-domain benchmark for deep research accuracy, completeness, and objectivity, 2026

work page 2026
[45]

Boulenger 1890 India snakes

Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig. Webarena: A realistic web environment for building autonomous agents. InThe Twelfth International Conference on Learning Representations, 2023. A Metric Definitions All metrics are computed after rev...

work page arXiv 2023

[1] [1]

Claude Code Overview

Anthropic. Claude Code Overview. https://code.claude.com/docs/en/overview, 2026. Accessed: 2026-05-06

work page 2026

[2] [2]

System Card: Claude Opus 4.7

Anthropic. System Card: Claude Opus 4.7. https://www.anthropic.com/ claude-opus-4-7-system-card, apr 2026. Accessed: 2026-05-06

work page 2026

[3] [3]

Workarena++: Towards compositional planning and reasoning-based common knowledge work tasks.Advances in Neural Information Processing Systems, 37:5996–6051, 2024

Léo Boisvert, Megh Thakkar, Maxime Gasse, Massimo Caccia, Thibault Le Sellier de Chezelles, Quentin Cappart, Nicolas Chapados, Alexandre Lacoste, and Alexandre Drouin. Workarena++: Towards compositional planning and reasoning-based common knowledge work tasks.Advances in Neural Information Processing Systems, 37:5996–6051, 2024

work page 2024

[4] [4]

Seed2.0 Model Card: Towards Intelligence Frontier for Real- World Complexity

ByteDance Seed. Seed2.0 Model Card: Towards Intelligence Frontier for Real- World Complexity. https://lf3-static.bytednsdoc.com/obj/eden-cn/lapzild-tss/ ljhwZthlaukjlkulzlp/seed2/0214/Seed2.0%20Model%20Card.pdf, feb 2026. Accessed: 2026-05-06

work page 2026

[5] [5]

Xu, Siva Reddy, Quentin Cappart, Graham Neubig, Ruslan Salakhutdinov, Nicolas Chapados, and Alexandre Lacoste

Thibault Le Sellier de Chezelles, Maxime Gasse, Alexandre Drouin, Massimo Caccia, Léo Boisvert, Megh Thakkar, Tom Marty, Rim Assouel, Sahar Omidi Shayegan, Lawrence Keunho Jang, Xing Han Lù, Ori Yoran, Dehan Kong, Frank F. Xu, Siva Reddy, Quentin Cappart, Graham Neubig, Ruslan Salakhutdinov, Nicolas Chapados, and Alexandre Lacoste. The browsergym ecosyste...

work page 2024

[6] [6]

DeepSeek V4 Technical Documentation

DeepSeek-AI. DeepSeek V4 Technical Documentation. https://fe-static.deepseek.com/ chat/transparency/deepseek-V4-model-card-EN.pdf, apr 2026. Accessed: 2026-05-06

work page 2026

[7] [7]

Mind2web: Towards a generalist agent for the web.Advances in Neural Information Processing Systems, 36:28091–28114, 2023

Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Sam Stevens, Boshi Wang, Huan Sun, and Yu Su. Mind2web: Towards a generalist agent for the web.Advances in Neural Information Processing Systems, 36:28091–28114, 2023

work page 2023

[8] [8]

Laradji, Manuel Del Verme, Tom Marty, Léo Boisvert, Megh Thakkar, Quentin Cappart, David Vazquez, Nicolas Chapados, and Alexandre Lacoste

Alexandre Drouin, Maxime Gasse, Massimo Caccia, Issam H. Laradji, Manuel Del Verme, Tom Marty, Léo Boisvert, Megh Thakkar, Quentin Cappart, David Vazquez, Nicolas Chapados, and Alexandre Lacoste. Workarena: How capable are web agents at solving common knowledge work tasks? 2024

work page 2024

[9] [9]

Deepresearch bench: A comprehensive benchmark for deep research agents, 2025

Mingxuan Du, Benfeng Xu, Chiwei Zhu, Xiaorui Wang, and Zhendong Mao. Deepresearch bench: A comprehensive benchmark for deep research agents, 2025

work page 2025

[10] [10]

REAL: Benchmarking autonomous agents on deterministic simulations of real websites

Divyansh Garg, Shaun VanWeelden, Diego Caples, Andis Draguns, Nikil Ravi, Pranav Putta, Naman Garg, Tomas Abraham, Michael Lara, Federico Lopez, James Liu, Atharva Gundawar, Prannay Hebbar, Youngchul Joo, Jindong Gu, Charles London, Christian Schroeder de Witt, and Sumeet Motwani. REAL: Benchmarking autonomous agents on deterministic simulations of real w...

work page 2025

[11] [11]

Glm-5: from vibe coding to agentic engineering, 2026

GLM-5-Team, Aohan Zeng, Xin Lv, Zhenyu Hou, Zhengxiao Du, Qinkai Zheng, Bin Chen, Da Yin, Chendi Ge, Chenghua Huang, Chengxing Xie, Chenzheng Zhu, Congfeng Yin, Cunx- iang Wang, Gengzheng Pan, Hao Zeng, Haoke Zhang, Haoran Wang, Huilong Chen, Jiajie Zhang, Jian Jiao, Jiaqi Guo, Jingsen Wang, Jingzhao Du, Jinzhu Wu, Kedong Wang, Lei Li, Lin Fan, Lucen Zhon...

work page 2026

[12] [12]

Expanding AI Overviews and Introducing AI Mode

Google. Expanding AI Overviews and Introducing AI Mode. https://blog.google/ products-and-platforms/products/search/ai-mode-search/ , mar 2025. Accessed: 2026-05-06

work page 2025

[13] [13]

Gemini Deep Research

Google. Gemini Deep Research. https://gemini.google/overview/deep-research/?hl= en-US, 2026. Accessed: 2026-05-06

work page 2026

[14] [14]

Model Evaluation – Approach, Methodology & Results: Gemini 3.1 Pro

Google DeepMind. Model Evaluation – Approach, Methodology & Results: Gemini 3.1 Pro. https://storage.googleapis.com/deepmind-media/gemini/gemini_3-1_pro_ model_evaluation.pdf, feb 2026. Accessed: 2026-05-06

work page 2026

[15] [15]

DeepSearchQA: Bridging the comprehensiveness gap for deep research agents.arXiv preprint arXiv:2601.20975, 2026

Nikita Gupta, Riju Chatterjee, Lukas Haas, Connie Tao, Andrew Wang, Chang Liu, Hidekazu Oiwa, E. Gribovskaya, Jan Ackermann, John Blitzer, S. Goldshtein, and Dipanjan Das. Deepsearchqa: Bridging the comprehensiveness gap for deep research agents.arXiv preprint arXiv:2601.20975, 2026

work page arXiv 2026

[16] [16]

Webvoyager: Building an end-to-end web agent with large multimodal models

Hongliang He, Wenlin Yao, Kaixin Ma, Wenhao Yu, Yong Dai, Hongming Zhang, Zhenzhong Lan, and Dong Yu. Webvoyager: Building an end-to-end web agent with large multimodal models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6864–6890, Bangkok, Thailand, 2024. Association for Computa...

work page 2024

[17] [17]

Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension

Mandar Joshi, Eunsol Choi, Daniel S Weld, and Luke Zettlemoyer. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. InProceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1601–1611, 2017

work page 2017

[18] [18]

A new measure of rank correlation.Biometrika, 30(1-2):81–93, 1938

Maurice G Kendall. A new measure of rank correlation.Biometrika, 30(1-2):81–93, 1938

work page 1938

[19] [19]

Visualwebarena: Evaluating multimodal agents on realistic visual web tasks

Jing Yu Koh, Robert Lo, Lawrence Jang, Vikram Duvvur, Ming Lim, Po-Yu Huang, Graham Neubig, Shuyan Zhou, Russ Salakhutdinov, and Daniel Fried. Visualwebarena: Evaluating multimodal agents on realistic visual web tasks. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 881–905, 2024

work page 2024

[20] [20]

Kumar, E

P. Kumar, E. Lau, S. Vijayakumar, T. Trinh, E. T. Chang, V . Robinson, S. Zhou, and M. Fredrik- son. Aligned LLMs are not aligned browser agents. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025

[21] [21]

Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov

Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, Kristina Toutanova, Llion Jones, Matthew Kelcey, Ming-Wei Chang, Andrew M. Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. Natural questions: A benchmark for question answering research.Transact...

work page 2019

[22] [22]

Deepwidesearch: Benchmarking depth and width in agentic information seeking.arXiv preprint arXiv:2510.20168, 2025

Tian Lan, Bin Zhu, Qianghuai Jia, Junyang Ren, Haijun Li, Longyue Wang, Zhao Xu, Weihua Luo, and Kaifu Zhang. Deepwidesearch: Benchmarking depth and width in agentic information seeking.arXiv preprint arXiv:2510.20168, 2025

work page arXiv 2025

[23] [23]

Deepresearch bench ii: Diagnosing deep research agents via rubrics from expert report, 2026

Ruizhe Li, Mingxuan Du, Benfeng Xu, Chiwei Zhu, Xiaorui Wang, and Zhendong Mao. Deepresearch bench ii: Diagnosing deep research agents via rubrics from expert report, 2026. 11

work page 2026

[24] [24]

Gaia: a benchmark for general ai assistants

Grégoire Mialon, Clémentine Fourrier, Craig Swift, Thomas Wolf, Yann LeCun, and Thomas Scialom. Gaia: a benchmark for general ai assistants. InThe Twelfth International Conference on Learning Representations, 2024

work page 2024

[25] [25]

Webgpt: Browser-assisted question-answering with human feedback, 2022

Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christo- pher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, Xu Jiang, Karl Cobbe, Tyna Eloundou, Gretchen Krueger, Kevin Button, Matthew Knight, Benjamin Chess, and John Schulman. Webgpt: Browser-assisted question-answering with human feedback, 2022

work page 2022

[26] [26]

Deep Research System Card

OpenAI. Deep Research System Card. https://cdn.openai.com/ deep-research-system-card.pdf, feb 2025. Accessed: 2026-05-06

work page 2025

[27] [27]

CLI – Codex

OpenAI. CLI – Codex. https://developers.openai.com/codex/cli, 2026. Accessed: 2026-05-06

work page 2026

[28] [28]

OpenRouter Models

OpenRouter. OpenRouter Models. https://openrouter.ai/docs/guides/overview/ models, 2026. Accessed: 2026-05-06

work page 2026

[29] [29]

Kilt: a benchmark for knowledge intensive language tasks

Fabio Petroni, Aleksandra Piktus, Angela Fan, Patrick Lewis, Majid Yazdani, Nicola De Cao, James Thorne, Yacine Jernite, Vassilis Plachouras, Tim Rocktäschel, and Sebastian Riedel. Kilt: a benchmark for knowledge intensive language tasks. InProceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: H...

work page 2021

[30] [30]

Weblinx: Real-world website navigation with multi-turn dialogue

Siva Reddy, Xing Lu, and Zden ˇek Kasner. Weblinx: Real-world website navigation with multi-turn dialogue. InInstitute of Formal and Applied Linguistics (ÚFAL), 2024

work page 2024

[31] [31]

Toolformer: Language models can teach themselves to use tools, 2023

Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Luke Zettle- moyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools, 2023

work page 2023

[32] [32]

World of bits: An open-domain platform for web-based agents

Tianlin Shi, Andrej Karpathy, Linxi Fan, Jonathan Hernandez, and Percy Liang. World of bits: An open-domain platform for web-based agents. InInternational Conference on Machine Learning, pages 3135–3144. PMLR, 2017

work page 2017

[33] [33]

Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, Akshay Nathan, Alan Luo, Alec Helyar, Aleksander Madry, Aleksandr Efremov, Aleksandra Spyra, Alex Baker-Whitcomb, Alex Beutel, Alex Karpenko, Alex Makelov, Alex Neitz, Alex Wei, Alexandra Barr, Alexandre Kirchmeyer, Ale...

work page 2026

[34] [34]

Kimi Team, Tongtong Bai, Yifan Bai, Yiping Bao, S. H. Cai, Yuan Cao, Y . Charles, H. S. Che, Cheng Chen, Guanduo Chen, Huarong Chen, Jia Chen, Jiahao Chen, Jianlong Chen, Jun Chen, Kefan Chen, Liang Chen, Ruijue Chen, Xinhao Chen, Yanru Chen, Yanxu Chen, Yicun Chen, Yimin Chen, Yingjiang Chen, Yuankun Chen, Yujie Chen, Yutian Chen, Zhirong Chen, Ziwei Che...

work page 2026

[35] [35]

Safearena: Evaluating the safety of autonomous web agents

Ada Defne Tur, Nicholas Meade, Xing Han Lù, Alejandra Zambrano, Arkil Patel, Esin DUR- MUS, Spandana Gella, Karolina Stanczak, and Siva Reddy. Safearena: Evaluating the safety of autonomous web agents. InForty-second International Conference on Machine Learning, 2025

work page 2025

[36] [36]

Wei, Jason Wei, Chris Tar, Yun- Hsuan Sung, Denny Zhou, Quoc V

Tu Vu, Mohit Iyyer, Xuezhi Wang, Noah Constant, Jerry W. Wei, Jason Wei, Chris Tar, Yun- Hsuan Sung, Denny Zhou, Quoc V . Le, and Thang Luong. Freshllms: Refreshing large language models with search engine augmentation. InFindings of the Association for Computational Linguistics: ACL 2024, pages 13697–13720, 2024

work page 2024

[37] [37]

BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents

Jason Wei, Zhiqing Sun, Spencer Papay, Scott McKinney, Jeffrey Han, Isa Fulford, Hyung Won Chung, Alex Tachard Passos, William Fedus, and Amelia Glaese. Browsecomp: A simple yet challenging benchmark for browsing agents.arXiv preprint arXiv:2504.12516, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[38] [38]

Widesearch: Benchmarking agentic broad info-seeking.arXiv preprint arXiv:2508.07999, 2025

Ryan Wong, Jiawei Wang, Junjie Zhao, Li Chen, Yan Gao, Long Zhang, Xuan Zhou, Zuo Wang, Kai Xiang, Ge Zhang, Wenhao Huang, Yang Wang, and Ke Wang. Widesearch: Benchmarking agentic broad info-seeking.arXiv preprint arXiv:2508.07999, 2025

work page arXiv 2025

[39] [39]

Webwalker: Benchmarking llms in web 14 traversal

Jialong Wu, Wenbiao Yin, Yong Jiang, Zhenglin Wang, Zekun Xi, Runnan Fang, Linhai Zhang, Yulan He, Deyu Zhou, Pengjun Xie, and Fei Huang. Webwalker: Benchmarking llms in web 14 traversal. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 10290–10305, 2025

work page 2025

[40] [40]

Qwen3 technical report, 2025

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

work page 2025

[41] [41]

Webshop: Towards scalable real-world web interaction with grounded language agents.Advances in Neural Information Processing Systems, 35:20744–20757, 2022

Shunyu Yao, Howard Chen, John Yang, and Karthik Narasimhan. Webshop: Towards scalable real-world web interaction with grounded language agents.Advances in Neural Information Processing Systems, 35:20744–20757, 2022

work page 2022

[42] [42]

React: Synergizing reasoning and acting in language models, 2023

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models, 2023

work page 2023

[43] [43]

Ori Yoran, Samuel Joseph Amouyal, Chaitanya Malaviya, Ben Bogin, Ofir Press, and Jonathan Berant. Assistantbench: Can web agents solve realistic and time-consuming tasks? In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 8938–8968, 2024

work page 2024

[44] [44]

Draco: a cross-domain benchmark for deep research accuracy, completeness, and objectivity, 2026

Joey Zhong, Hao Zhang, Clare Southern, Jeremy Yang, Thomas Wang, Kate Jung, Shu Zhang, Denis Yarats, Johnny Ho, and Jerry Ma. Draco: a cross-domain benchmark for deep research accuracy, completeness, and objectivity, 2026

work page 2026

[45] [45]

Boulenger 1890 India snakes

Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig. Webarena: A realistic web environment for building autonomous agents. InThe Twelfth International Conference on Learning Representations, 2023. A Metric Definitions All metrics are computed after rev...

work page arXiv 2023