pith. machine review for the scientific record. sign in

arxiv: 2605.10448 · v1 · submitted 2026-05-11 · 💻 cs.AI

Recognition: no theorem link

Can Agent Benchmarks Support Their Scores? Evidence-Supported Bounds for Interactive-Agent Evaluation

Authors on Pith no claims yet

Pith reviewed 2026-05-12 05:02 UTC · model grok-4.3

classification 💻 cs.AI
keywords agent benchmarksoutcome verificationevidence reportinginteractive agentsbenchmark evaluationscore uncertaintyfailure mode analysis
0
0 comments X

The pith

An evidence reporting layer added to existing agent benchmarks produces score bounds that quantify uncertainty from unverifiable outcomes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Interactive agent benchmarks determine success through outcome checks that often rely on surface signals, such as verifying a button click rather than confirming the intended state change actually happened. This paper introduces an outcome evidence reporting layer that works on top of any existing benchmark without altering its tasks, agents, or evaluators. The layer first specifies the stored artifacts required to verify each claimed outcome, then applies a fixed checklist to label each run as Evidence Pass, Evidence Fail, or Unknown, and finally computes bounds on the overall score that reflect the uncertainty contributed by Unknown cases. The approach keeps uncertain runs visible instead of folding them into a single aggregate number. Evaluation on five public benchmarks shows that this separation reveals distinct failure modes that standard scores had concealed.

Core claim

The paper claims that adding an outcome evidence reporting layer to existing interactive agent benchmarks, without modifying tasks, agents, or evaluators, enables the production of evidence-supported score bounds that explicitly quantify uncertainty arising from cases where stored artifacts do not confirm the claimed outcome.

What carries the argument

The outcome evidence reporting layer, which pre-specifies required stored artifacts for outcome verification, applies a locked checklist to assign Evidence Pass/Fail/Unknown labels to each run, and derives score bounds from the resulting labels.

If this is right

  • Reported success rates can be presented as ranges that make the contribution of unverifiable cases explicit.
  • Different types of outcome verification failures become distinguishable through the three evidence labels.
  • The same layer can be applied to any existing benchmark without redesigning its tasks or evaluation code.
  • Unknown cases remain visible rather than being discarded or silently counted as successes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Agent developers may need to improve environment logging so that more outcomes become verifiable under the checklist.
  • Benchmark maintainers could use the layer to identify which tasks have the weakest outcome checks and prioritize fixes.
  • If bounds remain wide on many benchmarks, direct comparisons of agent performance across different suites may require caution.

Load-bearing premise

The stored artifacts from benchmark runs are sufficient for a locked checklist to assign reliable Evidence Pass, Fail, or Unknown labels without introducing new biases or requiring changes to the original tasks.

What would settle it

Applying the evidence layer to one of the tested benchmarks and finding that the checklist cannot be completed consistently or that nearly all runs receive Unknown labels would show that the reported bounds do not meaningfully reduce uncertainty.

Figures

Figures reproduced from arXiv: 2605.10448 by Liyi Zhou, Shanshan Gao.

Figure 1
Figure 1. Figure 1: Overview of the outcome-evidence gap and our reporting layer. A benchmark can report [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: the target set is empty while retained non-target rows mention [PITH_FULL_IMAGE:figures/full_fig_p013_2.png] view at source ↗
read the original abstract

Interactive agent benchmarks map an agent run to a binary outcome through outcome checks. When these checks rely on surface level signals or fail to capture the agent's actual action path, they cannot reliably determine whether the run succeeded. For example, a benchmark task may ask whether Alice's shipping address was changed, while the outcome check only verifies that the agent clicked "Save." This does not guarantee that the intended state change occurred, since the agent may have modified the wrong record. Treating such a run as successful therefore makes the reported score misleading. Benchmark quality thus depends not only on task design, but also on the reliability of outcome detection. We address this problem by introducing an outcome evidence reporting layer for existing benchmarks, without modifying their tasks, agents, or evaluators. The layer performs three functions. First, before scoring, it specifies which stored artifacts are required to verify the claimed outcome for each case. Second, it applies a locked checklist to each completed run and assigns one of three evidence labels: Evidence Pass, Evidence Fail, or Unknown. Third, it reports evidence supported score bounds that quantify uncertainty arising from Unknown cases. Rather than silently counting, discarding, or hiding uncertain cases inside a single aggregate success rate, the framework keeps them explicitly visible. We evaluate the outcome evidence layer on five public benchmarks: ANDROIDWORLD, AGENTDOJO, APPWORLD, tau3 bench retail, and MINIWOB. The resulting reports separate several empirically distinct failure modes.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims that interactive agent benchmarks often produce misleading scores because outcome checks rely on surface-level signals (e.g., button clicks) that fail to confirm the intended state change. It introduces a non-intrusive 'outcome evidence reporting layer' that (1) pre-specifies the minimal stored artifacts needed to verify each claimed outcome, (2) applies a locked checklist to label each run Evidence Pass / Evidence Fail / Unknown, and (3) reports score bounds that explicitly quantify uncertainty from Unknown cases. The layer is evaluated on five public benchmarks (ANDROIDWORLD, AGENTDOJO, APPWORLD, tau3-bench retail, MINIWOB) and is said to separate empirically distinct failure modes.

Significance. If the layer can be realized with a priori, locked, domain-independent checklists that reliably assign the three evidence labels, the work would materially improve the trustworthiness of agent evaluation by replacing opaque aggregate success rates with explicit evidence-supported bounds. The non-modifying character of the proposal and its application to multiple existing benchmarks are practical strengths that could encourage adoption.

major comments (2)
  1. [Abstract] Abstract: the central claim that the reported bounds are 'evidence-supported' rather than 'checklist-supported' rests on the ability to specify, before any runs, a minimal set of stored artifacts sufficient to verify the outcome and then apply a fixed checklist that assigns Evidence Pass/Fail/Unknown without introducing new bias. The manuscript provides no concrete examples of such artifact lists or checklist items for any of the five benchmarks, leaving open the possibility that checklist granularity can be tuned to control the Unknown rate (e.g., for state-change verification in APPWORLD or ANDROIDWORLD).
  2. [Abstract] Abstract (evaluation paragraph): the statement that 'the resulting reports separate several empirically distinct failure modes' is load-bearing for the practical utility claim, yet the abstract supplies neither the quantitative bounds, the per-benchmark Unknown rates, nor any table or figure that would allow a reader to verify the separation. Without these data the empirical contribution cannot be assessed.
minor comments (1)
  1. [Abstract] The benchmark name 'tau3 bench retail' should be written consistently (e.g., 'tau3-bench retail') throughout.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for your constructive review and recommendation of major revision. We address each major comment below and will revise the manuscript to incorporate the requested details and clarifications.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that the reported bounds are 'evidence-supported' rather than 'checklist-supported' rests on the ability to specify, before any runs, a minimal set of stored artifacts sufficient to verify the outcome and then apply a fixed checklist that assigns Evidence Pass/Fail/Unknown without introducing new bias. The manuscript provides no concrete examples of such artifact lists or checklist items for any of the five benchmarks, leaving open the possibility that checklist granularity can be tuned to control the Unknown rate (e.g., for state-change verification in APPWORLD or ANDROIDWORLD).

    Authors: We agree that concrete examples are necessary to demonstrate that the bounds are evidence-supported. The layer requires artifact lists to be specified a priori from each benchmark's task definitions and available stored data (e.g., database states, logs, or file outputs), with checklists locked before runs begin. In the revision we will add an appendix with explicit minimal artifact lists and locked checklist items for all five benchmarks. For APPWORLD, this includes pre- and post-task database snapshots plus transaction logs, with checklist items that verify exact field changes (e.g., shipping address update on the correct record) rather than UI clicks. Similar minimal, non-tunable specifications will be provided for ANDROIDWORLD, AGENTDOJO, tau3-bench retail, and MINIWOB. Because the artifacts are the minimal set required for verification and the checklists are fixed, granularity cannot be adjusted post-hoc to control Unknown rates; any Unknown label arises only when even the minimal artifacts are absent or inconclusive. revision: yes

  2. Referee: [Abstract] Abstract (evaluation paragraph): the statement that 'the resulting reports separate several empirically distinct failure modes' is load-bearing for the practical utility claim, yet the abstract supplies neither the quantitative bounds, the per-benchmark Unknown rates, nor any table or figure that would allow a reader to verify the separation. Without these data the empirical contribution cannot be assessed.

    Authors: We acknowledge that the abstract's evaluation paragraph is insufficiently self-contained. In the revised manuscript we will expand this paragraph to report the per-benchmark evidence-supported score bounds (lower and upper) and the breakdown of Evidence Pass / Evidence Fail / Unknown rates for the five benchmarks. We will also add a reference to a summary table (to be included or highlighted in the main text) that shows how the three-label reports separate distinct failure modes, such as surface-signal successes that receive Evidence Fail versus runs with complete artifact confirmation. These quantitative results are already computed in our evaluation (Section 4) and will be extracted into the abstract to allow immediate verification of the separation claim. revision: yes

Circularity Check

0 steps flagged

No circularity: methodological reporting layer with independent checklist and bounds

full rationale

The paper introduces a post-hoc evidence layer that (1) pre-specifies required artifacts per task, (2) applies a fixed checklist to produce Evidence Pass/Fail/Unknown labels, and (3) computes explicit score bounds from the Unknown count. No equations, fitted parameters, or predictions appear; the bounds are a direct arithmetic consequence of the three-way labeling rather than a reduction to prior results. No self-citations are invoked to justify uniqueness or ansatz choices, and evaluation uses public benchmarks without internal fitting. The proposal is therefore self-contained as a reporting convention.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim rests on the assumption that evidence can be specified and checked reliably, and that Unknown cases can be bounded meaningfully. No free parameters or mathematical derivations.

axioms (1)
  • domain assumption Outcome checks in benchmarks can be augmented with evidence requirements without altering the original tasks.
    The paper assumes this is feasible for the five benchmarks evaluated.
invented entities (2)
  • Outcome evidence reporting layer no independent evidence
    purpose: To add evidence checks and report bounds on scores
    New framework introduced in the paper.
  • Evidence Pass / Fail / Unknown labels no independent evidence
    purpose: To categorize runs based on verification artifacts
    Invented classification for uncertainty.

pith-pipeline@v0.9.0 · 5561 in / 1220 out tokens · 49307 ms · 2026-05-12T05:02:00.917514+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

33 extracted references · 33 canonical work pages · 9 internal anchors

  1. [1]

    The BrowserGym ecosystem for web agent research.arXiv preprint arXiv:2412.05467,

    Thibault Le Sellier de Chezelles, Maxime Gasse, Alexandre Drouin, Massimo Caccia, Léo Boisvert, Megh Thakkar, Tom Marty, Rim Assouel, Sahar Omidi Shayegan, Lawrence Keunho Jang, Xing Han Lù, Ori Yoran, Dehan Kong, Frank F. Xu, Siva Reddy, Quentin Cappart, Graham Neubig, Ruslan Salakhutdinov, Nicolas Chapados, and Alexandre Lacoste. The browsergym ecosyste...

  2. [2]

    AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents

    Edoardo Debenedetti, Jie Zhang, Mislav Balunovi´c, Luca Beurer-Kellner, Marc Fischer, and Florian Tramèr. Agentdojo: A dynamic environment to evaluate prompt injection attacks and defenses for llm agents, 2024. URLhttps://arxiv.org/abs/2406.13352

  3. [3]

    Mind2Web: Towards a generalist agent for the web.Advances in Neural Information Processing Systems, 36:28091–28114, 2023

    Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Sam Stevens, Boshi Wang, Huan Sun, and Yu Su. Mind2Web: Towards a generalist agent for the web.Advances in Neural Information Processing Systems, 36:28091–28114, 2023

  4. [4]

    WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks?

    Alexandre Drouin, Maxime Gasse, Massimo Caccia, Issam H. Laradji, Manuel Del Verme, Tom Marty, Léo Boisvert, Megh Thakkar, Quentin Cappart, David Vazquez, Nicolas Chapados, and Alexandre Lacoste. Workarena: How capable are web agents at solving common knowledge work tasks?, 2024. URLhttps://arxiv.org/abs/2403.07718

  5. [5]

    Datasheets for datasets.Communications of the ACM, 64(12):86–92, 2021

    Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumé III, and Kate Crawford. Datasheets for datasets.Communications of the ACM, 64(12):86–92, 2021. doi: 10.1145/3458723. URL https://cacm.acm.org/ research/datasheets-for-datasets/

  6. [6]

    Is your benchmark (still) useful? dynamic benchmarking for code language models, 2025

    Batu Guan, Xiao Wu, Yuanyuan Yuan, and Shaohua Li. Is your benchmark (still) useful? dynamic benchmarking for code language models, 2025. URL https://arxiv.org/abs/ 2503.06643

  7. [7]

    Webarena verified: Reliable evaluation for web agents

    Amine El Hattami, Megh Thakkar, Nicolas Chapados, and Christopher Pal. Webarena verified: Reliable evaluation for web agents. SEA @ NeurIPS 2025 poster, 2025. URL https:// openreview.net/forum?id=94tlGxmqkN. OpenReview non-archival submission

  8. [8]

    Dynabench: Rethinking benchmarking in NLP

    Douwe Kiela, Max Bartolo, Yixin Nie, Divyansh Kaushik, Atticus Geiger, Zhengxuan Wu, Bertie Vidgen, Grusha Prasad, Amanpreet Singh, Pratik Ringshia, Zhiyi Ma, Tristan Thrush, Sebastian Riedel, Zeerak Waseem, Pontus Stenetorp, Robin Jia, Mohit Bansal, Christopher Potts, and Adina Williams. Dynabench: Rethinking benchmarking in NLP. InProceedings of the 202...

  9. [9]

    VisualWebArena: Evaluating multimodal agents on realistic visual web tasks

    Jing Yu Koh, Robert Lo, Lawrence Jang, Vikram Duvvur, Ming Lim, Po-Yu Huang, Graham Neubig, Shuyan Zhou, Russ Salakhutdinov, and Daniel Fried. VisualWebArena: Evaluating multimodal agents on realistic visual web tasks. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 881–905, 2024

  10. [10]

    Reinforcement learning on web interfaces using workflow-guided exploration

    Evan Zheran Liu, Kelvin Guu, Panupong Pasupat, Tianlin Shi, and Percy Liang. Reinforcement learning on web interfaces using workflow-guided exploration. InInternational Conference on Learning Representations, 2018. URLhttps://openreview.net/forum?id=ryTp3f-0-

  11. [11]

    ToolSandbox: A stateful, conversational, interactive evaluation benchmark for LLM tool use capabilities

    Jiarui Lu, Thomas Holleis, Yizhe Zhang, Bernhard Aumayer, Feng Nan, Haoping Bai, Shuang Ma, Shen Ma, Mengyu Li, Guoli Yin, et al. ToolSandbox: A stateful, conversational, interactive evaluation benchmark for LLM tool use capabilities. InFindings of the Association for Computational Linguistics: NAACL 2025, pages 1160–1183, 2025

  12. [12]

    Agen- tRewardBench: Evaluating automatic evaluations of web agent trajectories.arXiv preprint arXiv:2504.08942, 2025

    Xing Han Lù, Amirhossein Kazemnejad, Nicholas Meade, Arkil Patel, Dongchan Shin, Ale- jandra Zambrano, Karolina Sta´nczak, Peter Shaw, Christopher J Pal, and Siva Reddy. Agen- tRewardBench: Evaluating automatic evaluations of web agent trajectories.arXiv preprint arXiv:2504.08942, 2025

  13. [13]

    SPHERE: An evaluation card for human-ai systems, 2025

    Qianou Ma, Dora Zhao, Xinran Zhao, Chenglei Si, Chenyang Yang, Ryan Louie, Ehud Reiter, Diyi Yang, and Tongshuang Wu. SPHERE: An evaluation card for human-ai systems, 2025. URLhttps://arxiv.org/abs/2504.07971. 10

  14. [14]

    Manski.Partial Identification of Probability Distributions

    Charles F. Manski.Partial Identification of Probability Distributions. Springer, 2003

  15. [15]

    Model cards for model reporting

    Margaret Mitchell, Simone Wu, Andrew Zaldivar, Parker Barnes, Lucy Vasserman, Ben Hutchin- son, Elena Spitzer, Inioluwa Deborah Raji, and Timnit Gebru. Model cards for model reporting. InProceedings of the Conference on Fairness, Accountability, and Transparency, pages 220–229. ACM, 2019. doi: 10.1145/3287560.3287596. URL https://arxiv.org/abs/1810.03993

  16. [16]

    Introducing SWE-bench Verified

    OpenAI. Introducing SWE-bench Verified. Blog post, 2024. URL https://openai.com/ index/introducing-swe-bench-verified/. Accessed: 2026-05-01

  17. [17]

    Why SWE-bench verified no longer measures frontier coding capabilities

    OpenAI. Why SWE-bench verified no longer measures frontier coding capabilities. https:// openai.com/index/why-we-no-longer-evaluate-swe-bench-verified/ , 2026. Pub- lished February 23, 2026

  18. [18]

    Improving reproducibility in machine learning research: A report from the NeurIPS 2019 reproducibility program.Journal of Machine Learning Research, 22(164):1–20, 2021

    Joelle Pineau, Philippe Vincent-Lamarre, Koustuv Sinha, Vincent Larivière, Alina Beygelzimer, Florence d’Alché Buc, Emily Fox, and Hugo Larochelle. Improving reproducibility in machine learning research: A report from the NeurIPS 2019 reproducibility program.Journal of Machine Learning Research, 22(164):1–20, 2021. URL https://www.jmlr.org/papers/ v22/20-303.html

  19. [19]

    ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs

    Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, Sihan Zhao, Lauren Hong, Runchu Tian, Ruobing Xie, Jie Zhou, Mark Gerstein, Dahai Li, Zhiyuan Liu, and Maosong Sun. Toolllm: Facilitating large language models to master 16000+ real-world apis, 2023. URLhttps://arxiv.org/abs/2307.16789

  20. [20]

    AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents

    Christopher Rawles, Sarah Clinckemaillie, Yifan Chang, Jonathan Waltz, Gabrielle Lau, Mary- beth Fair, Alice Li, William Bishop, Wei Li, Folawiyo Campbell-Ajala, Daniel Toyama, Robert Berry, Divya Tyamagundlu, Timothy Lillicrap, and Oriana Riva. Androidworld: A dynamic benchmarking environment for autonomous agents, 2025. URL https://arxiv.org/abs/ 2405.14573

  21. [21]

    Judging the Judges: A Systematic Study of Position Bias in

    Lin Shi, Chiyu Ma, Wenhua Liang, Xingjian Diao, Weicheng Ma, and Soroush V osoughi. Judging the judges: A systematic study of position bias in llm-as-a-judge, 2024. URL https: //arxiv.org/abs/2406.07791

  22. [22]

    τ 3-Bench: Advancing Agent Benchmarking to Knowledge and V oice

    Sierra Research. τ 3-Bench: Advancing Agent Benchmarking to Knowledge and V oice. Research release, 2026. URL https://sierra.ai/resources/research/tau-3-bench. Accessed: 2026-04-30

  23. [23]

    Appworld: A controllable world of apps and people for benchmarking interactive coding agents,

    Harsh Trivedi, Tushar Khot, Mareike Hartmann, Ruskin Manku, Vinty Dong, Edward Li, Shashank Gupta, Ashish Sabharwal, and Niranjan Balasubramanian. Appworld: A controllable world of apps and people for benchmarking interactive coding agents, 2024. URL https: //arxiv.org/abs/2407.18901

  24. [24]

    solved issues

    You Wang, Michael Pradel, and Zhongxin Liu. Are “solved issues” in SWE-bench really solved correctly? an empirical study.arXiv preprint arXiv:2503.15223, 2025

  25. [25]

    OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

    Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh Jing Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, Yitao Liu, Yiheng Xu, Shuyan Zhou, Silvio Savarese, Caiming Xiong, Victor Zhong, and Tao Yu. Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments, 2024. URL https: //arxiv.org/abs/2...

  26. [26]

    TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks

    Frank F Xu, Yufan Song, Boxuan Li, Yuxuan Tang, Kritanjali Jain, Mengxue Bao, Zora Z Wang, Xuhui Zhou, Zhitong Guo, Murong Cao, et al. TheAgentCompany: Benchmarking LLM agents on consequential real world tasks.arXiv preprint arXiv:2412.14161, 2024

  27. [27]

    Gonzalez, and Ion Stoica

    Shuo Yang, Wei-Lin Chiang, Lianmin Zheng, Joseph E. Gonzalez, and Ion Stoica. Rethinking benchmark and contamination for language models with rephrased samples, 2023. URL https://arxiv.org/abs/2311.04850

  28. [28]

    WebShop: Towards scalable real-world web interaction with grounded language agents.Advances in Neural Information Processing Systems, 35:20744–20757, 2022

    Shunyu Yao, Howard Chen, John Yang, and Karthik Narasimhan. WebShop: Towards scalable real-world web interaction with grounded language agents.Advances in Neural Information Processing Systems, 35:20744–20757, 2022. 11

  29. [29]

    $\tau$-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains

    Shunyu Yao, Noah Shinn, Pedram Razavi, and Karthik Narasimhan. τ-bench: A benchmark for tool-agent-user interaction in real-world domains.arXiv preprint arXiv:2406.12045, 2024

  30. [30]

    Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging llm-as-a-judge with mt-bench and chatbot arena, 2023. URL https://arxiv.org/ abs/2306.05685

  31. [31]

    WebArena: A Realistic Web Environment for Building Autonomous Agents

    Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig. Webarena: A realistic web environment for building autonomous agents, 2023. URL https://arxiv.org/abs/ 2307.13854

  32. [32]

    Establishing best practices for building rigorous agentic benchmarks,

    Yuxuan Zhu, Tengjun Jin, Yada Pruksachatkun, Andy Zhang, Shu Liu, Sasha Cui, Sayash Kapoor, Shayne Longpre, Kevin Meng, Rebecca Weiss, Fazl Barez, Rahul Gupta, Jwala Dhamala, Jacob Merizian, Mario Giulianelli, Harry Coppock, Cozmin Ududec, Jasjeet Sekhon, Jacob Steinhardt, Antony Kellermann, Sarah Schwettmann, Matei Zaharia, Ion Stoica, Percy Liang, and D...

  33. [33]

    arXiv:2507.02825 , year =

    URLhttps://arxiv.org/abs/2507.02825. A Unknown-Reason Taxonomy Each Unknown record receives one primary blocking reason: the earliest missing artifact that prevents deciding the benchmark’s own claim. The labels answer a repair question, not a universal ontology. Table 4 summarizes the completed audits. ANDROIDWORLDhas 41 Unknown records in the cost- limi...