What Twelve LLM Agent Benchmark Papers Disclose About Themselves: A Pilot Audit and an Open Scoring Schema
Pith reviewed 2026-05-21 05:16 UTC · model grok-4.3
The pith
An audit of twelve LLM agent benchmark papers finds they disclose an average of only 38 percent of key evaluation details.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By applying a custom five-field audit schema to twelve LLM benchmark papers, the authors establish that agent-oriented benchmarks provide significantly less information about their evaluation procedures than classical static benchmarks. Specifically, the mean score for the eight agent papers is 0.38 compared to 0.66 for the four static ones. The largest deficiencies appear in cost reporting, where none of the agent papers disclose any inference costs, and in harness specification, where none provide a content-addressed container image for the evaluation environment. The audit focuses solely on disclosure quality rather than result validity, and the schema is made available as an open JSON.
What carries the argument
The five-field disclosure audit schema consisting of benchmark identity, harness specification, inference settings, cost reporting, and failure breakdown, with explicit scoring rules defined in an accompanying codebook.
If this is right
- Reproducibility of LLM agent results remains limited until disclosure practices improve on cost and environment details.
- Classical static benchmarks serve as a higher standard for documentation that agent papers could emulate.
- Releasing the scoring schema allows other researchers to apply consistent audits to new papers.
- Single-pass auditing by one rater provides a baseline that multi-rater studies can build upon.
Where Pith is reading between the lines
- Poor disclosure may explain many conflicting results reported on the same benchmarks.
- Implementing content-addressed harnesses could become a standard practice if this schema gains adoption.
- The gap between agent and static benchmarks indicates that dynamic agent evaluations require more detailed documentation protocols.
- This work could extend to auditing benchmarks in other AI domains beyond LLMs.
Load-bearing premise
The five-field audit schema and its boundary cases in the codebook are sufficient to capture the key dimensions of evaluation disclosure quality.
What would settle it
Re-running the audit with three or more independent scorers and measuring inter-rater agreement on the same twelve papers would reveal whether the reported scores are stable or sensitive to individual interpretation.
Figures
read the original abstract
We read twelve well-known LLM agent benchmark papers and recorded, dimension by dimension, what each paper actually says about how its evaluation was run. The motivation came from a familiar frustration: two papers will report results on the same benchmark with the same model name and disagree, and you cannot tell why -- the scaffold, the sampling settings, the subset, or the evaluator version. In many cases the published artifact does not let you answer. This paper is an implementation report on the attempt. We designed a small audit schema (five fields: benchmark identity, harness specification, inference settings, cost reporting, failure breakdown), wrote a scoring codebook with the boundary cases we hit during pilot scoring, applied it to twelve canonical papers (eight agent, four classical static), and recorded what we saw. We score the disclosure of an agent run, not its correctness, and make no claim that disclosure implies a trustworthy result. The mean audit score across the eight agent-benchmark papers is 0.38 (out of 1.0), and across the four classical static benchmarks 0.66; the largest gap is on cost (none of the eight agent benchmark papers disclose inference cost in any form) and on harness specification (none fully disclose a content-addressed container image of the evaluation environment). We release the schema as a JSON Schema file, the codebook as a Markdown document, and the raw scoring sheet as a CSV. The scoring was performed by a single auditor in one pass; a multi-rater audit is the natural next step, and we discuss what we think it would change.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper conducts a pilot audit of disclosure practices in twelve LLM benchmark papers, consisting of eight agent benchmarks and four classical static benchmarks. Using a five-field schema (benchmark identity, harness specification, inference settings, cost reporting, failure breakdown) with a documented codebook, it finds mean scores of 0.38 for agent papers and 0.66 for classical ones. Key gaps identified include no disclosure of inference cost in any form for agent papers and no full disclosure of content-addressed container images for the evaluation harness. The authors release the JSON schema, Markdown codebook, and CSV raw scores, framing the work as descriptive of disclosure rather than an assessment of benchmark correctness, while acknowledging the single-auditor limitation.
Significance. This pilot provides a practical, open tool for improving evaluation transparency in LLM agent research, where irreproducibility is a noted issue. The quantitative gaps, especially in cost and harness details, offer actionable insights, and the released artifacts enable extension by the community. Credit is due for the explicit release of the scoring schema, codebook, and data, as well as for the clear distinction between measuring disclosure and claiming benchmark validity.
major comments (1)
- The quantitative claims, including the mean audit scores of 0.38 versus 0.66 and the ranking of gaps on cost and harness specification, are based on a single auditor's application of the codebook after iterative refinement on the same papers. While the paper positions this as a pilot and calls for multi-rater follow-up, the boundary judgments (e.g., what constitutes 'fully disclose' for harness or 'any form' for cost) could vary, affecting the specific results reported in the abstract and results sections.
minor comments (2)
- The abstract refers to 'twelve well-known LLM agent benchmark papers' but the breakdown into eight agent and four classical is only clarified later; including this split in the abstract would improve immediate clarity.
- A brief table or figure showing the per-field average scores across papers would help readers visualize the contributions to the overall means beyond the textual description.
Simulated Author's Rebuttal
We thank the referee for the positive assessment of the work as a practical contribution to evaluation transparency and for recommending minor revision. We address the major comment point by point below.
read point-by-point responses
-
Referee: The quantitative claims, including the mean audit scores of 0.38 versus 0.66 and the ranking of gaps on cost and harness specification, are based on a single auditor's application of the codebook after iterative refinement on the same papers. While the paper positions this as a pilot and calls for multi-rater follow-up, the boundary judgments (e.g., what constitutes 'fully disclose' for harness or 'any form' for cost) could vary, affecting the specific results reported in the abstract and results sections.
Authors: We agree that single-auditor application introduces the possibility of variation in boundary judgments, as the referee notes. The manuscript already frames the work explicitly as a pilot, documents the iterative codebook development, and calls for multi-rater follow-up while discussing what such an audit might change. The released codebook and CSV make every scoring decision inspectable. The largest reported gaps—complete absence of any cost disclosure in the eight agent papers and lack of full content-addressed harness containers—are zero-disclosure cases with limited boundary ambiguity. To further address the concern, we will add a clarifying sentence in the abstract and results sections stating that the reported means and gap rankings reflect the documented single-auditor process and are subject to potential inter-rater variation. This is a partial revision. revision: partial
Circularity Check
Descriptive empirical audit with direct tallies; no derivations or self-referential reductions
full rationale
This paper applies a five-field audit schema to twelve external benchmark papers and records disclosure levels as simple tallies. The central claims are mean scores (0.38 vs 0.66) and gap identification, obtained by reading source texts against an explicitly released codebook. No equations, fitted parameters, predictions, or uniqueness theorems appear. The single-auditor limitation is stated openly rather than hidden, and the schema itself is offered as an open artifact for future multi-rater use. The derivation chain is therefore self-contained against external source documents and does not reduce to its own inputs.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The five fields (benchmark identity, harness specification, inference settings, cost reporting, failure breakdown) capture the essential aspects of evaluation disclosure.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
The mean audit score across the eight agent-benchmark papers is 0.38
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
SWE-bench: Can Language Models Resolve Real- World GitHub Issues?
C. E. Jimenez, J. Yang, A. Wettig, S. Yao, K. Pei, O. Press, and K. Narasimhan, “SWE-bench: Can Language Models Resolve Real- World GitHub Issues?”International Conference on Learning Repre- sentations (ICLR), 2024
work page 2024
-
[2]
Introducing SWE-bench Verified,
OpenAI, “Introducing SWE-bench Verified,” OpenAI Technical Report, 2024
work page 2024
-
[3]
WebArena: A Realistic Web Environment for Building Autonomous Agents,
S. Zhou, F. F. Xu, H. Zhu, X. Zhou, R. Lo, A. Sridhar, X. Cheng, T. Ou, Y . Bisk, D. Fried, U. Alon, and G. Neubig, “WebArena: A Realistic Web Environment for Building Autonomous Agents,”International Conference on Learning Representations (ICLR), 2024
work page 2024
-
[4]
VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks,
J. Y . Koh, R. Lo, L. Jang, V . Duvvur, M. C. Lim, P.-Y . Huang, G. Neubig, S. Zhou, R. Salakhutdinov, and D. Fried, “VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks,”Annual Meeting of the Association for Computational Linguistics (ACL), 2024
work page 2024
-
[5]
Mind2Web: Towards a Generalist Agent for the Web,
X. Deng, Y . Gu, B. Zheng, S. Chen, S. Stevens, B. Wang, H. Sun, and Y . Su, “Mind2Web: Towards a Generalist Agent for the Web,”Advances in Neural Information Processing Systems (NeurIPS), 2023
work page 2023
-
[6]
OSWorld: Benchmarking Multimodal Agents for Open- Ended Tasks in Real Computer Environments,
T. Xie, D. Zhang, J. Chen, X. Li, S. Zhao, R. Cao, T. J. Hua, Z. Cheng, D. Shin, F. Lei, Y . Liu, Y . Xu, S. Zhou, S. Savarese, C. Xiong, V . Zhong, and T. Yu, “OSWorld: Benchmarking Multimodal Agents for Open- Ended Tasks in Real Computer Environments,”Advances in Neural Information Processing Systems (NeurIPS), 2024
work page 2024
-
[7]
GAIA: A Benchmark for General AI Assistants,
G. Mialon, C. Fourrier, C. Swift, T. Wolf, Y . LeCun, and T. Scialom, “GAIA: A Benchmark for General AI Assistants,”International Con- ference on Learning Representations (ICLR), 2024
work page 2024
-
[8]
AgentBench: Evaluating LLMs as Agents,
X. Liu, H. Yu, H. Zhang, Y . Xu, X. Lei, H. Lai, Y . Gu, H. Ding, K. Men, K. Yang, S. Zhang, X. Deng, A. Zeng, Z. Du, C. Zhang, S. Shen, T. Zhang, Y . Su, H. Sun, M. Huang, Y . Dong, and J. Tang, “AgentBench: Evaluating LLMs as Agents,”International Conference on Learning Representations (ICLR), 2024
work page 2024
-
[9]
AgentBoard: An Analytical Evaluation Board of Multi-Turn LLM Agents,
C. Ma, J. Zhang, Z. Zhu, C. Yang, Y . Yang, Y . Jin, Z. Lan, L. Kong, and J. He, “AgentBoard: An Analytical Evaluation Board of Multi-Turn LLM Agents,”Advances in Neural Information Processing Systems (NeurIPS), 2024
work page 2024
-
[10]
MLE-bench: Evaluating Machine Learning Agents on Machine Learn- ing Engineering,
J. S. Chan, N. Chowdhury, O. Jaffe, J. Aung, D. Sherburn, E. Mays, G. Starace, K. Liu, L. Maksin, T. Patwardhan, L. Weng, and A. Madry, “MLE-bench: Evaluating Machine Learning Agents on Machine Learn- ing Engineering,” OpenAI Technical Report, 2024
work page 2024
-
[11]
Evaluating Large Language Models Trained on Code
M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. de Oliveira Pinto, J. Kaplan, et al., “Evaluating Large Language Models Trained on Code,” arXiv preprint arXiv:2107.03374, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[12]
Program Synthesis with Large Language Models
J. Austin, A. Odena, M. Nye, M. Bosma, H. Michalewski, D. Dohan, E. Jiang, C. Cai, M. Terry, Q. Le, and C. Sutton, “Program Synthesis with Large Language Models,” arXiv preprint arXiv:2108.07732, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[13]
Training Verifiers to Solve Math Word Problems
K. Cobbe, V . Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano, C. Hesse, and J. Schulman, “Training Verifiers to Solve Math Word Problems,” arXiv preprint arXiv:2110.14168, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[14]
Measuring Massive Multitask Language Understanding,
D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt, “Measuring Massive Multitask Language Understanding,” International Conference on Learning Representations (ICLR), 2021
work page 2021
-
[15]
Deep Reinforcement Learning that Matters,
P. Henderson, R. Islam, P. Bachman, J. Pineau, D. Precup, and D. Meger, “Deep Reinforcement Learning that Matters,”AAAI Conference on Artificial Intelligence, 2018
work page 2018
-
[16]
Improving Reproducibility in Machine Learning Research,
J. Pineau, P. Vincent-Lamarre, K. Sinha, V . Larivi `ere, A. Beygelzimer, F. d’Alch´e-Buc, E. Fox, and H. Larochelle, “Improving Reproducibility in Machine Learning Research,”Journal of Machine Learning Research, vol. 22, no. 164, pp. 1–20, 2021
work page 2021
-
[17]
Model Cards for Model Reporting,
M. Mitchell, S. Wu, A. Zaldivar, P. Barnes, L. Vasserman, B. Hutchin- son, E. Spitzer, I. D. Raji, and T. Gebru, “Model Cards for Model Reporting,”ACM Conference on Fairness, Accountability, and Trans- parency (FAT*), 2019
work page 2019
-
[18]
T. Gebru, J. Morgenstern, B. Vecchione, J. W. Vaughan, H. Wallach, H. Daum ´e III, and K. Crawford, “Datasheets for Datasets,”Communi- cations of the ACM, vol. 64, no. 12, pp. 86–92, 2021
work page 2021
-
[19]
Quantifying Language Models’ Sensitivity to Spurious Features in Prompt Design,
M. Sclar, Y . Choi, Y . Tsvetkov, and A. Suhr, “Quantifying Language Models’ Sensitivity to Spurious Features in Prompt Design,”Interna- tional Conference on Learning Representations (ICLR), 2024
work page 2024
-
[20]
State of What Art? A Call for Multi-Prompt LLM Eval- uation,
M. Mizrahi, G. Kaplan, D. Malkin, R. Dror, D. Shahaf, and G. Stanovsky, “State of What Art? A Call for Multi-Prompt LLM Eval- uation,”Transactions of the Association for Computational Linguistics (TACL), 2024
work page 2024
-
[21]
BetterBench: Assessing AI Benchmarks, Uncovering Issues, and Establishing Best Practices,
A. Reuel, A. Hardy, C. Smith, M. Lamparth, M. Hardy, and M. J. Kochenderfer, “BetterBench: Assessing AI Benchmarks, Uncovering Issues, and Establishing Best Practices,”Advances in Neural Information Processing Systems (NeurIPS), 2024
work page 2024
-
[22]
Data Contamination: From Memorization to Exploitation,
I. Magar and R. Schwartz, “Data Contamination: From Memorization to Exploitation,”Annual Meeting of the Association for Computational Linguistics (ACL), 2022
work page 2022
-
[23]
Proving Test Set Contamination in Black Box Language Models,
Y . Oren, N. Meister, N. Chatterji, F. Ladhak, and T. B. Hashimoto, “Proving Test Set Contamination in Black Box Language Models,” International Conference on Learning Representations (ICLR), 2024
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.