arxiv: 2605.10448 · v1 · submitted 2026-05-11 · 💻 cs.AI

Recognition: no theorem link

Can Agent Benchmarks Support Their Scores? Evidence-Supported Bounds for Interactive-Agent Evaluation

Shanshan Gao , Liyi Zhou

Authors on Pith no claims yet

Pith reviewed 2026-05-12 05:02 UTC · model grok-4.3

classification 💻 cs.AI

keywords agent benchmarksoutcome verificationevidence reportinginteractive agentsbenchmark evaluationscore uncertaintyfailure mode analysis

0 comments

The pith

An evidence reporting layer added to existing agent benchmarks produces score bounds that quantify uncertainty from unverifiable outcomes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Interactive agent benchmarks determine success through outcome checks that often rely on surface signals, such as verifying a button click rather than confirming the intended state change actually happened. This paper introduces an outcome evidence reporting layer that works on top of any existing benchmark without altering its tasks, agents, or evaluators. The layer first specifies the stored artifacts required to verify each claimed outcome, then applies a fixed checklist to label each run as Evidence Pass, Evidence Fail, or Unknown, and finally computes bounds on the overall score that reflect the uncertainty contributed by Unknown cases. The approach keeps uncertain runs visible instead of folding them into a single aggregate number. Evaluation on five public benchmarks shows that this separation reveals distinct failure modes that standard scores had concealed.

Core claim

The paper claims that adding an outcome evidence reporting layer to existing interactive agent benchmarks, without modifying tasks, agents, or evaluators, enables the production of evidence-supported score bounds that explicitly quantify uncertainty arising from cases where stored artifacts do not confirm the claimed outcome.

What carries the argument

The outcome evidence reporting layer, which pre-specifies required stored artifacts for outcome verification, applies a locked checklist to assign Evidence Pass/Fail/Unknown labels to each run, and derives score bounds from the resulting labels.

If this is right

Reported success rates can be presented as ranges that make the contribution of unverifiable cases explicit.
Different types of outcome verification failures become distinguishable through the three evidence labels.
The same layer can be applied to any existing benchmark without redesigning its tasks or evaluation code.
Unknown cases remain visible rather than being discarded or silently counted as successes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Agent developers may need to improve environment logging so that more outcomes become verifiable under the checklist.
Benchmark maintainers could use the layer to identify which tasks have the weakest outcome checks and prioritize fixes.
If bounds remain wide on many benchmarks, direct comparisons of agent performance across different suites may require caution.

Load-bearing premise

The stored artifacts from benchmark runs are sufficient for a locked checklist to assign reliable Evidence Pass, Fail, or Unknown labels without introducing new biases or requiring changes to the original tasks.

What would settle it

Applying the evidence layer to one of the tested benchmarks and finding that the checklist cannot be completed consistently or that nearly all runs receive Unknown labels would show that the reported bounds do not meaningfully reduce uncertainty.

Figures

Figures reproduced from arXiv: 2605.10448 by Liyi Zhou, Shanshan Gao.

**Figure 2.** Figure 2: the target set is empty while retained non-target rows mention [PITH_FULL_IMAGE:figures/full_fig_p013_2.png] view at source ↗

read the original abstract

Interactive agent benchmarks map an agent run to a binary outcome through outcome checks. When these checks rely on surface level signals or fail to capture the agent's actual action path, they cannot reliably determine whether the run succeeded. For example, a benchmark task may ask whether Alice's shipping address was changed, while the outcome check only verifies that the agent clicked "Save." This does not guarantee that the intended state change occurred, since the agent may have modified the wrong record. Treating such a run as successful therefore makes the reported score misleading. Benchmark quality thus depends not only on task design, but also on the reliability of outcome detection. We address this problem by introducing an outcome evidence reporting layer for existing benchmarks, without modifying their tasks, agents, or evaluators. The layer performs three functions. First, before scoring, it specifies which stored artifacts are required to verify the claimed outcome for each case. Second, it applies a locked checklist to each completed run and assigns one of three evidence labels: Evidence Pass, Evidence Fail, or Unknown. Third, it reports evidence supported score bounds that quantify uncertainty arising from Unknown cases. Rather than silently counting, discarding, or hiding uncertain cases inside a single aggregate success rate, the framework keeps them explicitly visible. We evaluate the outcome evidence layer on five public benchmarks: ANDROIDWORLD, AGENTDOJO, APPWORLD, tau3 bench retail, and MINIWOB. The resulting reports separate several empirically distinct failure modes.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper adds a post-hoc evidence layer with artifact specs and locked checklists to bound scores on agent benchmarks, but the labels risk depending on how the checklists are designed.

read the letter

The paper's main move is to layer an evidence reporting system on top of existing agent benchmarks. Before scoring, it names the stored artifacts needed to verify each claimed outcome, then runs a fixed checklist that tags every run as Evidence Pass, Evidence Fail, or Unknown, and finally reports score bounds that keep the Unknown cases visible instead of folding them into a single success rate. They test this on ANDROIDWORLD, AGENTDOJO, APPWORLD, tau3-bench retail, and MINIWOB, and the reports do separate some failure modes that the original binary scores had hidden together.

Referee Report

2 major / 1 minor

Summary. The paper claims that interactive agent benchmarks often produce misleading scores because outcome checks rely on surface-level signals (e.g., button clicks) that fail to confirm the intended state change. It introduces a non-intrusive 'outcome evidence reporting layer' that (1) pre-specifies the minimal stored artifacts needed to verify each claimed outcome, (2) applies a locked checklist to label each run Evidence Pass / Evidence Fail / Unknown, and (3) reports score bounds that explicitly quantify uncertainty from Unknown cases. The layer is evaluated on five public benchmarks (ANDROIDWORLD, AGENTDOJO, APPWORLD, tau3-bench retail, MINIWOB) and is said to separate empirically distinct failure modes.

Significance. If the layer can be realized with a priori, locked, domain-independent checklists that reliably assign the three evidence labels, the work would materially improve the trustworthiness of agent evaluation by replacing opaque aggregate success rates with explicit evidence-supported bounds. The non-modifying character of the proposal and its application to multiple existing benchmarks are practical strengths that could encourage adoption.

major comments (2)

[Abstract] Abstract: the central claim that the reported bounds are 'evidence-supported' rather than 'checklist-supported' rests on the ability to specify, before any runs, a minimal set of stored artifacts sufficient to verify the outcome and then apply a fixed checklist that assigns Evidence Pass/Fail/Unknown without introducing new bias. The manuscript provides no concrete examples of such artifact lists or checklist items for any of the five benchmarks, leaving open the possibility that checklist granularity can be tuned to control the Unknown rate (e.g., for state-change verification in APPWORLD or ANDROIDWORLD).
[Abstract] Abstract (evaluation paragraph): the statement that 'the resulting reports separate several empirically distinct failure modes' is load-bearing for the practical utility claim, yet the abstract supplies neither the quantitative bounds, the per-benchmark Unknown rates, nor any table or figure that would allow a reader to verify the separation. Without these data the empirical contribution cannot be assessed.

minor comments (1)

[Abstract] The benchmark name 'tau3 bench retail' should be written consistently (e.g., 'tau3-bench retail') throughout.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for your constructive review and recommendation of major revision. We address each major comment below and will revise the manuscript to incorporate the requested details and clarifications.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that the reported bounds are 'evidence-supported' rather than 'checklist-supported' rests on the ability to specify, before any runs, a minimal set of stored artifacts sufficient to verify the outcome and then apply a fixed checklist that assigns Evidence Pass/Fail/Unknown without introducing new bias. The manuscript provides no concrete examples of such artifact lists or checklist items for any of the five benchmarks, leaving open the possibility that checklist granularity can be tuned to control the Unknown rate (e.g., for state-change verification in APPWORLD or ANDROIDWORLD).

Authors: We agree that concrete examples are necessary to demonstrate that the bounds are evidence-supported. The layer requires artifact lists to be specified a priori from each benchmark's task definitions and available stored data (e.g., database states, logs, or file outputs), with checklists locked before runs begin. In the revision we will add an appendix with explicit minimal artifact lists and locked checklist items for all five benchmarks. For APPWORLD, this includes pre- and post-task database snapshots plus transaction logs, with checklist items that verify exact field changes (e.g., shipping address update on the correct record) rather than UI clicks. Similar minimal, non-tunable specifications will be provided for ANDROIDWORLD, AGENTDOJO, tau3-bench retail, and MINIWOB. Because the artifacts are the minimal set required for verification and the checklists are fixed, granularity cannot be adjusted post-hoc to control Unknown rates; any Unknown label arises only when even the minimal artifacts are absent or inconclusive. revision: yes
Referee: [Abstract] Abstract (evaluation paragraph): the statement that 'the resulting reports separate several empirically distinct failure modes' is load-bearing for the practical utility claim, yet the abstract supplies neither the quantitative bounds, the per-benchmark Unknown rates, nor any table or figure that would allow a reader to verify the separation. Without these data the empirical contribution cannot be assessed.

Authors: We acknowledge that the abstract's evaluation paragraph is insufficiently self-contained. In the revised manuscript we will expand this paragraph to report the per-benchmark evidence-supported score bounds (lower and upper) and the breakdown of Evidence Pass / Evidence Fail / Unknown rates for the five benchmarks. We will also add a reference to a summary table (to be included or highlighted in the main text) that shows how the three-label reports separate distinct failure modes, such as surface-signal successes that receive Evidence Fail versus runs with complete artifact confirmation. These quantitative results are already computed in our evaluation (Section 4) and will be extracted into the abstract to allow immediate verification of the separation claim. revision: yes

Circularity Check

0 steps flagged

No circularity: methodological reporting layer with independent checklist and bounds

full rationale

The paper introduces a post-hoc evidence layer that (1) pre-specifies required artifacts per task, (2) applies a fixed checklist to produce Evidence Pass/Fail/Unknown labels, and (3) computes explicit score bounds from the Unknown count. No equations, fitted parameters, or predictions appear; the bounds are a direct arithmetic consequence of the three-way labeling rather than a reduction to prior results. No self-citations are invoked to justify uniqueness or ansatz choices, and evaluation uses public benchmarks without internal fitting. The proposal is therefore self-contained as a reporting convention.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim rests on the assumption that evidence can be specified and checked reliably, and that Unknown cases can be bounded meaningfully. No free parameters or mathematical derivations.

axioms (1)

domain assumption Outcome checks in benchmarks can be augmented with evidence requirements without altering the original tasks.
The paper assumes this is feasible for the five benchmarks evaluated.

invented entities (2)

Outcome evidence reporting layer no independent evidence
purpose: To add evidence checks and report bounds on scores
New framework introduced in the paper.
Evidence Pass / Fail / Unknown labels no independent evidence
purpose: To categorize runs based on verification artifacts
Invented classification for uncertainty.

pith-pipeline@v0.9.0 · 5561 in / 1220 out tokens · 49307 ms · 2026-05-12T05:02:00.917514+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

33 extracted references · 33 canonical work pages · 9 internal anchors

[1]

The BrowserGym ecosystem for web agent research.arXiv preprint arXiv:2412.05467,

Thibault Le Sellier de Chezelles, Maxime Gasse, Alexandre Drouin, Massimo Caccia, Léo Boisvert, Megh Thakkar, Tom Marty, Rim Assouel, Sahar Omidi Shayegan, Lawrence Keunho Jang, Xing Han Lù, Ori Yoran, Dehan Kong, Frank F. Xu, Siva Reddy, Quentin Cappart, Graham Neubig, Ruslan Salakhutdinov, Nicolas Chapados, and Alexandre Lacoste. The browsergym ecosyste...

work page arXiv 2024
[2]

AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents

Edoardo Debenedetti, Jie Zhang, Mislav Balunovi´c, Luca Beurer-Kellner, Marc Fischer, and Florian Tramèr. Agentdojo: A dynamic environment to evaluate prompt injection attacks and defenses for llm agents, 2024. URLhttps://arxiv.org/abs/2406.13352

work page internal anchor Pith review arXiv 2024
[3]

Mind2Web: Towards a generalist agent for the web.Advances in Neural Information Processing Systems, 36:28091–28114, 2023

Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Sam Stevens, Boshi Wang, Huan Sun, and Yu Su. Mind2Web: Towards a generalist agent for the web.Advances in Neural Information Processing Systems, 36:28091–28114, 2023

work page 2023
[4]

WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks?

Alexandre Drouin, Maxime Gasse, Massimo Caccia, Issam H. Laradji, Manuel Del Verme, Tom Marty, Léo Boisvert, Megh Thakkar, Quentin Cappart, David Vazquez, Nicolas Chapados, and Alexandre Lacoste. Workarena: How capable are web agents at solving common knowledge work tasks?, 2024. URLhttps://arxiv.org/abs/2403.07718

work page internal anchor Pith review arXiv 2024
[5]

Datasheets for datasets.Communications of the ACM, 64(12):86–92, 2021

Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumé III, and Kate Crawford. Datasheets for datasets.Communications of the ACM, 64(12):86–92, 2021. doi: 10.1145/3458723. URL https://cacm.acm.org/ research/datasheets-for-datasets/

work page doi:10.1145/3458723 2021
[6]

Is your benchmark (still) useful? dynamic benchmarking for code language models, 2025

Batu Guan, Xiao Wu, Yuanyuan Yuan, and Shaohua Li. Is your benchmark (still) useful? dynamic benchmarking for code language models, 2025. URL https://arxiv.org/abs/ 2503.06643

work page arXiv 2025
[7]

Webarena verified: Reliable evaluation for web agents

Amine El Hattami, Megh Thakkar, Nicolas Chapados, and Christopher Pal. Webarena verified: Reliable evaluation for web agents. SEA @ NeurIPS 2025 poster, 2025. URL https:// openreview.net/forum?id=94tlGxmqkN. OpenReview non-archival submission

work page 2025
[8]

Dynabench: Rethinking benchmarking in NLP

Douwe Kiela, Max Bartolo, Yixin Nie, Divyansh Kaushik, Atticus Geiger, Zhengxuan Wu, Bertie Vidgen, Grusha Prasad, Amanpreet Singh, Pratik Ringshia, Zhiyi Ma, Tristan Thrush, Sebastian Riedel, Zeerak Waseem, Pontus Stenetorp, Robin Jia, Mohit Bansal, Christopher Potts, and Adina Williams. Dynabench: Rethinking benchmarking in NLP. InProceedings of the 202...

work page 2021
[9]

VisualWebArena: Evaluating multimodal agents on realistic visual web tasks

Jing Yu Koh, Robert Lo, Lawrence Jang, Vikram Duvvur, Ming Lim, Po-Yu Huang, Graham Neubig, Shuyan Zhou, Russ Salakhutdinov, and Daniel Fried. VisualWebArena: Evaluating multimodal agents on realistic visual web tasks. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 881–905, 2024

work page 2024
[10]

Reinforcement learning on web interfaces using workflow-guided exploration

Evan Zheran Liu, Kelvin Guu, Panupong Pasupat, Tianlin Shi, and Percy Liang. Reinforcement learning on web interfaces using workflow-guided exploration. InInternational Conference on Learning Representations, 2018. URLhttps://openreview.net/forum?id=ryTp3f-0-

work page 2018
[11]

ToolSandbox: A stateful, conversational, interactive evaluation benchmark for LLM tool use capabilities

Jiarui Lu, Thomas Holleis, Yizhe Zhang, Bernhard Aumayer, Feng Nan, Haoping Bai, Shuang Ma, Shen Ma, Mengyu Li, Guoli Yin, et al. ToolSandbox: A stateful, conversational, interactive evaluation benchmark for LLM tool use capabilities. InFindings of the Association for Computational Linguistics: NAACL 2025, pages 1160–1183, 2025

work page 2025
[12]

Agen- tRewardBench: Evaluating automatic evaluations of web agent trajectories.arXiv preprint arXiv:2504.08942, 2025

Xing Han Lù, Amirhossein Kazemnejad, Nicholas Meade, Arkil Patel, Dongchan Shin, Ale- jandra Zambrano, Karolina Sta´nczak, Peter Shaw, Christopher J Pal, and Siva Reddy. Agen- tRewardBench: Evaluating automatic evaluations of web agent trajectories.arXiv preprint arXiv:2504.08942, 2025

work page arXiv 2025
[13]

SPHERE: An evaluation card for human-ai systems, 2025

Qianou Ma, Dora Zhao, Xinran Zhao, Chenglei Si, Chenyang Yang, Ryan Louie, Ehud Reiter, Diyi Yang, and Tongshuang Wu. SPHERE: An evaluation card for human-ai systems, 2025. URLhttps://arxiv.org/abs/2504.07971. 10

work page arXiv 2025
[14]

Manski.Partial Identification of Probability Distributions

Charles F. Manski.Partial Identification of Probability Distributions. Springer, 2003

work page 2003
[15]

Model cards for model reporting

Margaret Mitchell, Simone Wu, Andrew Zaldivar, Parker Barnes, Lucy Vasserman, Ben Hutchin- son, Elena Spitzer, Inioluwa Deborah Raji, and Timnit Gebru. Model cards for model reporting. InProceedings of the Conference on Fairness, Accountability, and Transparency, pages 220–229. ACM, 2019. doi: 10.1145/3287560.3287596. URL https://arxiv.org/abs/1810.03993

work page doi:10.1145/3287560.3287596 2019
[16]

Introducing SWE-bench Verified

OpenAI. Introducing SWE-bench Verified. Blog post, 2024. URL https://openai.com/ index/introducing-swe-bench-verified/. Accessed: 2026-05-01

work page 2024
[17]

Why SWE-bench verified no longer measures frontier coding capabilities

OpenAI. Why SWE-bench verified no longer measures frontier coding capabilities. https:// openai.com/index/why-we-no-longer-evaluate-swe-bench-verified/ , 2026. Pub- lished February 23, 2026

work page 2026
[18]

Improving reproducibility in machine learning research: A report from the NeurIPS 2019 reproducibility program.Journal of Machine Learning Research, 22(164):1–20, 2021

Joelle Pineau, Philippe Vincent-Lamarre, Koustuv Sinha, Vincent Larivière, Alina Beygelzimer, Florence d’Alché Buc, Emily Fox, and Hugo Larochelle. Improving reproducibility in machine learning research: A report from the NeurIPS 2019 reproducibility program.Journal of Machine Learning Research, 22(164):1–20, 2021. URL https://www.jmlr.org/papers/ v22/20-303.html

work page 2019
[19]

ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs

Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, Sihan Zhao, Lauren Hong, Runchu Tian, Ruobing Xie, Jie Zhou, Mark Gerstein, Dahai Li, Zhiyuan Liu, and Maosong Sun. Toolllm: Facilitating large language models to master 16000+ real-world apis, 2023. URLhttps://arxiv.org/abs/2307.16789

work page internal anchor Pith review Pith/arXiv arXiv 2023
[20]

AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents

Christopher Rawles, Sarah Clinckemaillie, Yifan Chang, Jonathan Waltz, Gabrielle Lau, Mary- beth Fair, Alice Li, William Bishop, Wei Li, Folawiyo Campbell-Ajala, Daniel Toyama, Robert Berry, Divya Tyamagundlu, Timothy Lillicrap, and Oriana Riva. Androidworld: A dynamic benchmarking environment for autonomous agents, 2025. URL https://arxiv.org/abs/ 2405.14573

work page internal anchor Pith review arXiv 2025
[21]

Judging the Judges: A Systematic Study of Position Bias in

Lin Shi, Chiyu Ma, Wenhua Liang, Xingjian Diao, Weicheng Ma, and Soroush V osoughi. Judging the judges: A systematic study of position bias in llm-as-a-judge, 2024. URL https: //arxiv.org/abs/2406.07791

work page arXiv 2024
[22]

τ 3-Bench: Advancing Agent Benchmarking to Knowledge and V oice

Sierra Research. τ 3-Bench: Advancing Agent Benchmarking to Knowledge and V oice. Research release, 2026. URL https://sierra.ai/resources/research/tau-3-bench. Accessed: 2026-04-30

work page 2026
[23]

Appworld: A controllable world of apps and people for benchmarking interactive coding agents,

Harsh Trivedi, Tushar Khot, Mareike Hartmann, Ruskin Manku, Vinty Dong, Edward Li, Shashank Gupta, Ashish Sabharwal, and Niranjan Balasubramanian. Appworld: A controllable world of apps and people for benchmarking interactive coding agents, 2024. URL https: //arxiv.org/abs/2407.18901

work page arXiv 2024
[24]

solved issues

You Wang, Michael Pradel, and Zhongxin Liu. Are “solved issues” in SWE-bench really solved correctly? an empirical study.arXiv preprint arXiv:2503.15223, 2025

work page arXiv 2025
[25]

OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh Jing Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, Yitao Liu, Yiheng Xu, Shuyan Zhou, Silvio Savarese, Caiming Xiong, Victor Zhong, and Tao Yu. Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments, 2024. URL https: //arxiv.org/abs/2...

work page internal anchor Pith review arXiv 2024
[26]

TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks

Frank F Xu, Yufan Song, Boxuan Li, Yuxuan Tang, Kritanjali Jain, Mengxue Bao, Zora Z Wang, Xuhui Zhou, Zhitong Guo, Murong Cao, et al. TheAgentCompany: Benchmarking LLM agents on consequential real world tasks.arXiv preprint arXiv:2412.14161, 2024

work page internal anchor Pith review arXiv 2024
[27]

Gonzalez, and Ion Stoica

Shuo Yang, Wei-Lin Chiang, Lianmin Zheng, Joseph E. Gonzalez, and Ion Stoica. Rethinking benchmark and contamination for language models with rephrased samples, 2023. URL https://arxiv.org/abs/2311.04850

work page arXiv 2023
[28]

WebShop: Towards scalable real-world web interaction with grounded language agents.Advances in Neural Information Processing Systems, 35:20744–20757, 2022

Shunyu Yao, Howard Chen, John Yang, and Karthik Narasimhan. WebShop: Towards scalable real-world web interaction with grounded language agents.Advances in Neural Information Processing Systems, 35:20744–20757, 2022. 11

work page 2022
[29]

$\tau$-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains

Shunyu Yao, Noah Shinn, Pedram Razavi, and Karthik Narasimhan. τ-bench: A benchmark for tool-agent-user interaction in real-world domains.arXiv preprint arXiv:2406.12045, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[30]

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging llm-as-a-judge with mt-bench and chatbot arena, 2023. URL https://arxiv.org/ abs/2306.05685

work page internal anchor Pith review Pith/arXiv arXiv 2023
[31]

WebArena: A Realistic Web Environment for Building Autonomous Agents

Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig. Webarena: A realistic web environment for building autonomous agents, 2023. URL https://arxiv.org/abs/ 2307.13854

work page internal anchor Pith review Pith/arXiv arXiv 2023
[32]

Establishing best practices for building rigorous agentic benchmarks,

Yuxuan Zhu, Tengjun Jin, Yada Pruksachatkun, Andy Zhang, Shu Liu, Sasha Cui, Sayash Kapoor, Shayne Longpre, Kevin Meng, Rebecca Weiss, Fazl Barez, Rahul Gupta, Jwala Dhamala, Jacob Merizian, Mario Giulianelli, Harry Coppock, Cozmin Ududec, Jasjeet Sekhon, Jacob Steinhardt, Antony Kellermann, Sarah Schwettmann, Matei Zaharia, Ion Stoica, Percy Liang, and D...

work page
[33]

arXiv:2507.02825 , year =

URLhttps://arxiv.org/abs/2507.02825. A Unknown-Reason Taxonomy Each Unknown record receives one primary blocking reason: the earliest missing artifact that prevents deciding the benchmark’s own claim. The labels answer a repair question, not a universal ontology. Table 4 summarizes the completed audits. ANDROIDWORLDhas 41 Unknown records in the cost- limi...

work page arXiv