What Twelve LLM Agent Benchmark Papers Disclose About Themselves: A Pilot Audit and an Open Scoring Schema

Faezeh Ghaderi (University of Texas at Arlington); Mahdi Naser Moghadasi (BrightMind AI; Texas Tech University)

arxiv: 2605.21404 · v1 · pith:W4JUOTEBnew · submitted 2026-05-20 · 💻 cs.LG

What Twelve LLM Agent Benchmark Papers Disclose About Themselves: A Pilot Audit and an Open Scoring Schema

Mahdi Naser Moghadasi (BrightMind AI , Texas Tech University) , Faezeh Ghaderi (University of Texas at Arlington) This is my paper

Pith reviewed 2026-05-21 05:16 UTC · model grok-4.3

classification 💻 cs.LG

keywords LLM agent benchmarksevaluation disclosurereproducibility auditbenchmark papersinference cost reportingharness specificationaudit schemapilot study

0 comments

The pith

An audit of twelve LLM agent benchmark papers finds they disclose an average of only 38 percent of key evaluation details.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper audits how much information twelve well-known LLM benchmark papers actually provide about their evaluation setups. It introduces a five-field scoring schema covering benchmark identity, harness specification, inference settings, cost reporting, and failure breakdown. The authors score eight agent-focused papers at a mean of 0.38 out of 1.0 and four classical static benchmarks at 0.66. The gaps highlight missing details on costs and exact evaluation environments, which prevent reproducing or explaining differing results across papers. By releasing the schema, codebook, and scores, the work aims to encourage more transparent reporting in future benchmarks.

Core claim

By applying a custom five-field audit schema to twelve LLM benchmark papers, the authors establish that agent-oriented benchmarks provide significantly less information about their evaluation procedures than classical static benchmarks. Specifically, the mean score for the eight agent papers is 0.38 compared to 0.66 for the four static ones. The largest deficiencies appear in cost reporting, where none of the agent papers disclose any inference costs, and in harness specification, where none provide a content-addressed container image for the evaluation environment. The audit focuses solely on disclosure quality rather than result validity, and the schema is made available as an open JSON.

What carries the argument

The five-field disclosure audit schema consisting of benchmark identity, harness specification, inference settings, cost reporting, and failure breakdown, with explicit scoring rules defined in an accompanying codebook.

If this is right

Reproducibility of LLM agent results remains limited until disclosure practices improve on cost and environment details.
Classical static benchmarks serve as a higher standard for documentation that agent papers could emulate.
Releasing the scoring schema allows other researchers to apply consistent audits to new papers.
Single-pass auditing by one rater provides a baseline that multi-rater studies can build upon.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Poor disclosure may explain many conflicting results reported on the same benchmarks.
Implementing content-addressed harnesses could become a standard practice if this schema gains adoption.
The gap between agent and static benchmarks indicates that dynamic agent evaluations require more detailed documentation protocols.
This work could extend to auditing benchmarks in other AI domains beyond LLMs.

Load-bearing premise

The five-field audit schema and its boundary cases in the codebook are sufficient to capture the key dimensions of evaluation disclosure quality.

What would settle it

Re-running the audit with three or more independent scorers and measuring inter-rater agreement on the same twelve papers would reveal whether the reported scores are stable or sensitive to individual interpretation.

Figures

Figures reproduced from arXiv: 2605.21404 by Faezeh Ghaderi (University of Texas at Arlington), Mahdi Naser Moghadasi (BrightMind AI, Texas Tech University).

read the original abstract

We read twelve well-known LLM agent benchmark papers and recorded, dimension by dimension, what each paper actually says about how its evaluation was run. The motivation came from a familiar frustration: two papers will report results on the same benchmark with the same model name and disagree, and you cannot tell why -- the scaffold, the sampling settings, the subset, or the evaluator version. In many cases the published artifact does not let you answer. This paper is an implementation report on the attempt. We designed a small audit schema (five fields: benchmark identity, harness specification, inference settings, cost reporting, failure breakdown), wrote a scoring codebook with the boundary cases we hit during pilot scoring, applied it to twelve canonical papers (eight agent, four classical static), and recorded what we saw. We score the disclosure of an agent run, not its correctness, and make no claim that disclosure implies a trustworthy result. The mean audit score across the eight agent-benchmark papers is 0.38 (out of 1.0), and across the four classical static benchmarks 0.66; the largest gap is on cost (none of the eight agent benchmark papers disclose inference cost in any form) and on harness specification (none fully disclose a content-addressed container image of the evaluation environment). We release the schema as a JSON Schema file, the codebook as a Markdown document, and the raw scoring sheet as a CSV. The scoring was performed by a single auditor in one pass; a multi-rater audit is the natural next step, and we discuss what we think it would change.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Agent benchmark papers disclose far less than classical ones, especially on cost and harness details, and the released open schema is the part worth using.

read the letter

This paper shows that LLM agent benchmark papers disclose far less about their runs than classical static benchmark papers do. The mean scores come out to 0.38 for the eight agent papers and 0.66 for the four classical ones, with cost and harness specification as the clearest gaps. None of the agent papers report inference cost in any form, and none fully specify a content-addressed container for the evaluation environment. The authors built a five-field audit schema covering benchmark identity, harness specification, inference settings, cost reporting, and failure breakdown. They documented the codebook with the boundary cases they encountered, scored the twelve papers in one pass, and released the schema as JSON, the codebook as Markdown, and the scores as CSV. This turns a recurring complaint about non-comparable results into something that can be measured and addressed. They are explicit that the scores reflect disclosure only, not the quality of the underlying benchmarks. The single-auditor limitation is the main soft spot. Judgments on terms like full harness disclosure or any form of cost reporting could shift with another reader, which might adjust the exact averages or which gap appears largest. The broad pattern of lower disclosure in agent work would probably stay the same. The schema is intentionally small, so it leaves room for later expansion, but that is reasonable for a pilot. Researchers who evaluate LLM agents or who review papers in this area will get the most from this. Anyone tired of trying to replicate results without enough setup details can use the released schema right away. The work is clear and honest about its scope, so it deserves a serious referee. I would send it to peer review. The findings are useful as a pilot, and the open tooling adds real value that revisions can build on.

Referee Report

1 major / 2 minor

Summary. The paper conducts a pilot audit of disclosure practices in twelve LLM benchmark papers, consisting of eight agent benchmarks and four classical static benchmarks. Using a five-field schema (benchmark identity, harness specification, inference settings, cost reporting, failure breakdown) with a documented codebook, it finds mean scores of 0.38 for agent papers and 0.66 for classical ones. Key gaps identified include no disclosure of inference cost in any form for agent papers and no full disclosure of content-addressed container images for the evaluation harness. The authors release the JSON schema, Markdown codebook, and CSV raw scores, framing the work as descriptive of disclosure rather than an assessment of benchmark correctness, while acknowledging the single-auditor limitation.

Significance. This pilot provides a practical, open tool for improving evaluation transparency in LLM agent research, where irreproducibility is a noted issue. The quantitative gaps, especially in cost and harness details, offer actionable insights, and the released artifacts enable extension by the community. Credit is due for the explicit release of the scoring schema, codebook, and data, as well as for the clear distinction between measuring disclosure and claiming benchmark validity.

major comments (1)

The quantitative claims, including the mean audit scores of 0.38 versus 0.66 and the ranking of gaps on cost and harness specification, are based on a single auditor's application of the codebook after iterative refinement on the same papers. While the paper positions this as a pilot and calls for multi-rater follow-up, the boundary judgments (e.g., what constitutes 'fully disclose' for harness or 'any form' for cost) could vary, affecting the specific results reported in the abstract and results sections.

minor comments (2)

The abstract refers to 'twelve well-known LLM agent benchmark papers' but the breakdown into eight agent and four classical is only clarified later; including this split in the abstract would improve immediate clarity.
A brief table or figure showing the per-field average scores across papers would help readers visualize the contributions to the overall means beyond the textual description.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive assessment of the work as a practical contribution to evaluation transparency and for recommending minor revision. We address the major comment point by point below.

read point-by-point responses

Referee: The quantitative claims, including the mean audit scores of 0.38 versus 0.66 and the ranking of gaps on cost and harness specification, are based on a single auditor's application of the codebook after iterative refinement on the same papers. While the paper positions this as a pilot and calls for multi-rater follow-up, the boundary judgments (e.g., what constitutes 'fully disclose' for harness or 'any form' for cost) could vary, affecting the specific results reported in the abstract and results sections.

Authors: We agree that single-auditor application introduces the possibility of variation in boundary judgments, as the referee notes. The manuscript already frames the work explicitly as a pilot, documents the iterative codebook development, and calls for multi-rater follow-up while discussing what such an audit might change. The released codebook and CSV make every scoring decision inspectable. The largest reported gaps—complete absence of any cost disclosure in the eight agent papers and lack of full content-addressed harness containers—are zero-disclosure cases with limited boundary ambiguity. To further address the concern, we will add a clarifying sentence in the abstract and results sections stating that the reported means and gap rankings reflect the documented single-auditor process and are subject to potential inter-rater variation. This is a partial revision. revision: partial

Circularity Check

0 steps flagged

Descriptive empirical audit with direct tallies; no derivations or self-referential reductions

full rationale

This paper applies a five-field audit schema to twelve external benchmark papers and records disclosure levels as simple tallies. The central claims are mean scores (0.38 vs 0.66) and gap identification, obtained by reading source texts against an explicitly released codebook. No equations, fitted parameters, predictions, or uniqueness theorems appear. The single-auditor limitation is stated openly rather than hidden, and the schema itself is offered as an open artifact for future multi-rater use. The derivation chain is therefore self-contained against external source documents and does not reduce to its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper rests on the domain assumption that the chosen five fields adequately represent disclosure quality for the purpose of this pilot; no free parameters are fitted and no new entities are postulated.

axioms (1)

domain assumption The five fields (benchmark identity, harness specification, inference settings, cost reporting, failure breakdown) capture the essential aspects of evaluation disclosure.
Introduced in the schema design section as the basis for scoring.

pith-pipeline@v0.9.0 · 5846 in / 1235 out tokens · 27989 ms · 2026-05-21T05:16:34.948810+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The mean audit score across the eight agent-benchmark papers is 0.38

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

23 extracted references · 23 canonical work pages · 3 internal anchors

[1]

SWE-bench: Can Language Models Resolve Real- World GitHub Issues?

C. E. Jimenez, J. Yang, A. Wettig, S. Yao, K. Pei, O. Press, and K. Narasimhan, “SWE-bench: Can Language Models Resolve Real- World GitHub Issues?”International Conference on Learning Repre- sentations (ICLR), 2024

work page 2024
[2]

Introducing SWE-bench Verified,

OpenAI, “Introducing SWE-bench Verified,” OpenAI Technical Report, 2024

work page 2024
[3]

WebArena: A Realistic Web Environment for Building Autonomous Agents,

S. Zhou, F. F. Xu, H. Zhu, X. Zhou, R. Lo, A. Sridhar, X. Cheng, T. Ou, Y . Bisk, D. Fried, U. Alon, and G. Neubig, “WebArena: A Realistic Web Environment for Building Autonomous Agents,”International Conference on Learning Representations (ICLR), 2024

work page 2024
[4]

VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks,

J. Y . Koh, R. Lo, L. Jang, V . Duvvur, M. C. Lim, P.-Y . Huang, G. Neubig, S. Zhou, R. Salakhutdinov, and D. Fried, “VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks,”Annual Meeting of the Association for Computational Linguistics (ACL), 2024

work page 2024
[5]

Mind2Web: Towards a Generalist Agent for the Web,

X. Deng, Y . Gu, B. Zheng, S. Chen, S. Stevens, B. Wang, H. Sun, and Y . Su, “Mind2Web: Towards a Generalist Agent for the Web,”Advances in Neural Information Processing Systems (NeurIPS), 2023

work page 2023
[6]

OSWorld: Benchmarking Multimodal Agents for Open- Ended Tasks in Real Computer Environments,

T. Xie, D. Zhang, J. Chen, X. Li, S. Zhao, R. Cao, T. J. Hua, Z. Cheng, D. Shin, F. Lei, Y . Liu, Y . Xu, S. Zhou, S. Savarese, C. Xiong, V . Zhong, and T. Yu, “OSWorld: Benchmarking Multimodal Agents for Open- Ended Tasks in Real Computer Environments,”Advances in Neural Information Processing Systems (NeurIPS), 2024

work page 2024
[7]

GAIA: A Benchmark for General AI Assistants,

G. Mialon, C. Fourrier, C. Swift, T. Wolf, Y . LeCun, and T. Scialom, “GAIA: A Benchmark for General AI Assistants,”International Con- ference on Learning Representations (ICLR), 2024

work page 2024
[8]

AgentBench: Evaluating LLMs as Agents,

X. Liu, H. Yu, H. Zhang, Y . Xu, X. Lei, H. Lai, Y . Gu, H. Ding, K. Men, K. Yang, S. Zhang, X. Deng, A. Zeng, Z. Du, C. Zhang, S. Shen, T. Zhang, Y . Su, H. Sun, M. Huang, Y . Dong, and J. Tang, “AgentBench: Evaluating LLMs as Agents,”International Conference on Learning Representations (ICLR), 2024

work page 2024
[9]

AgentBoard: An Analytical Evaluation Board of Multi-Turn LLM Agents,

C. Ma, J. Zhang, Z. Zhu, C. Yang, Y . Yang, Y . Jin, Z. Lan, L. Kong, and J. He, “AgentBoard: An Analytical Evaluation Board of Multi-Turn LLM Agents,”Advances in Neural Information Processing Systems (NeurIPS), 2024

work page 2024
[10]

MLE-bench: Evaluating Machine Learning Agents on Machine Learn- ing Engineering,

J. S. Chan, N. Chowdhury, O. Jaffe, J. Aung, D. Sherburn, E. Mays, G. Starace, K. Liu, L. Maksin, T. Patwardhan, L. Weng, and A. Madry, “MLE-bench: Evaluating Machine Learning Agents on Machine Learn- ing Engineering,” OpenAI Technical Report, 2024

work page 2024
[11]

Evaluating Large Language Models Trained on Code

M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. de Oliveira Pinto, J. Kaplan, et al., “Evaluating Large Language Models Trained on Code,” arXiv preprint arXiv:2107.03374, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[12]

Program Synthesis with Large Language Models

J. Austin, A. Odena, M. Nye, M. Bosma, H. Michalewski, D. Dohan, E. Jiang, C. Cai, M. Terry, Q. Le, and C. Sutton, “Program Synthesis with Large Language Models,” arXiv preprint arXiv:2108.07732, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[13]

Training Verifiers to Solve Math Word Problems

K. Cobbe, V . Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano, C. Hesse, and J. Schulman, “Training Verifiers to Solve Math Word Problems,” arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[14]

Measuring Massive Multitask Language Understanding,

D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt, “Measuring Massive Multitask Language Understanding,” International Conference on Learning Representations (ICLR), 2021

work page 2021
[15]

Deep Reinforcement Learning that Matters,

P. Henderson, R. Islam, P. Bachman, J. Pineau, D. Precup, and D. Meger, “Deep Reinforcement Learning that Matters,”AAAI Conference on Artificial Intelligence, 2018

work page 2018
[16]

Improving Reproducibility in Machine Learning Research,

J. Pineau, P. Vincent-Lamarre, K. Sinha, V . Larivi `ere, A. Beygelzimer, F. d’Alch´e-Buc, E. Fox, and H. Larochelle, “Improving Reproducibility in Machine Learning Research,”Journal of Machine Learning Research, vol. 22, no. 164, pp. 1–20, 2021

work page 2021
[17]

Model Cards for Model Reporting,

M. Mitchell, S. Wu, A. Zaldivar, P. Barnes, L. Vasserman, B. Hutchin- son, E. Spitzer, I. D. Raji, and T. Gebru, “Model Cards for Model Reporting,”ACM Conference on Fairness, Accountability, and Trans- parency (FAT*), 2019

work page 2019
[18]

Datasheets for Datasets,

T. Gebru, J. Morgenstern, B. Vecchione, J. W. Vaughan, H. Wallach, H. Daum ´e III, and K. Crawford, “Datasheets for Datasets,”Communi- cations of the ACM, vol. 64, no. 12, pp. 86–92, 2021

work page 2021
[19]

Quantifying Language Models’ Sensitivity to Spurious Features in Prompt Design,

M. Sclar, Y . Choi, Y . Tsvetkov, and A. Suhr, “Quantifying Language Models’ Sensitivity to Spurious Features in Prompt Design,”Interna- tional Conference on Learning Representations (ICLR), 2024

work page 2024
[20]

State of What Art? A Call for Multi-Prompt LLM Eval- uation,

M. Mizrahi, G. Kaplan, D. Malkin, R. Dror, D. Shahaf, and G. Stanovsky, “State of What Art? A Call for Multi-Prompt LLM Eval- uation,”Transactions of the Association for Computational Linguistics (TACL), 2024

work page 2024
[21]

BetterBench: Assessing AI Benchmarks, Uncovering Issues, and Establishing Best Practices,

A. Reuel, A. Hardy, C. Smith, M. Lamparth, M. Hardy, and M. J. Kochenderfer, “BetterBench: Assessing AI Benchmarks, Uncovering Issues, and Establishing Best Practices,”Advances in Neural Information Processing Systems (NeurIPS), 2024

work page 2024
[22]

Data Contamination: From Memorization to Exploitation,

I. Magar and R. Schwartz, “Data Contamination: From Memorization to Exploitation,”Annual Meeting of the Association for Computational Linguistics (ACL), 2022

work page 2022
[23]

Proving Test Set Contamination in Black Box Language Models,

Y . Oren, N. Meister, N. Chatterji, F. Ladhak, and T. B. Hashimoto, “Proving Test Set Contamination in Black Box Language Models,” International Conference on Learning Representations (ICLR), 2024

work page 2024

[1] [1]

SWE-bench: Can Language Models Resolve Real- World GitHub Issues?

C. E. Jimenez, J. Yang, A. Wettig, S. Yao, K. Pei, O. Press, and K. Narasimhan, “SWE-bench: Can Language Models Resolve Real- World GitHub Issues?”International Conference on Learning Repre- sentations (ICLR), 2024

work page 2024

[2] [2]

Introducing SWE-bench Verified,

OpenAI, “Introducing SWE-bench Verified,” OpenAI Technical Report, 2024

work page 2024

[3] [3]

WebArena: A Realistic Web Environment for Building Autonomous Agents,

S. Zhou, F. F. Xu, H. Zhu, X. Zhou, R. Lo, A. Sridhar, X. Cheng, T. Ou, Y . Bisk, D. Fried, U. Alon, and G. Neubig, “WebArena: A Realistic Web Environment for Building Autonomous Agents,”International Conference on Learning Representations (ICLR), 2024

work page 2024

[4] [4]

VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks,

J. Y . Koh, R. Lo, L. Jang, V . Duvvur, M. C. Lim, P.-Y . Huang, G. Neubig, S. Zhou, R. Salakhutdinov, and D. Fried, “VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks,”Annual Meeting of the Association for Computational Linguistics (ACL), 2024

work page 2024

[5] [5]

Mind2Web: Towards a Generalist Agent for the Web,

X. Deng, Y . Gu, B. Zheng, S. Chen, S. Stevens, B. Wang, H. Sun, and Y . Su, “Mind2Web: Towards a Generalist Agent for the Web,”Advances in Neural Information Processing Systems (NeurIPS), 2023

work page 2023

[6] [6]

OSWorld: Benchmarking Multimodal Agents for Open- Ended Tasks in Real Computer Environments,

T. Xie, D. Zhang, J. Chen, X. Li, S. Zhao, R. Cao, T. J. Hua, Z. Cheng, D. Shin, F. Lei, Y . Liu, Y . Xu, S. Zhou, S. Savarese, C. Xiong, V . Zhong, and T. Yu, “OSWorld: Benchmarking Multimodal Agents for Open- Ended Tasks in Real Computer Environments,”Advances in Neural Information Processing Systems (NeurIPS), 2024

work page 2024

[7] [7]

GAIA: A Benchmark for General AI Assistants,

G. Mialon, C. Fourrier, C. Swift, T. Wolf, Y . LeCun, and T. Scialom, “GAIA: A Benchmark for General AI Assistants,”International Con- ference on Learning Representations (ICLR), 2024

work page 2024

[8] [8]

AgentBench: Evaluating LLMs as Agents,

X. Liu, H. Yu, H. Zhang, Y . Xu, X. Lei, H. Lai, Y . Gu, H. Ding, K. Men, K. Yang, S. Zhang, X. Deng, A. Zeng, Z. Du, C. Zhang, S. Shen, T. Zhang, Y . Su, H. Sun, M. Huang, Y . Dong, and J. Tang, “AgentBench: Evaluating LLMs as Agents,”International Conference on Learning Representations (ICLR), 2024

work page 2024

[9] [9]

AgentBoard: An Analytical Evaluation Board of Multi-Turn LLM Agents,

C. Ma, J. Zhang, Z. Zhu, C. Yang, Y . Yang, Y . Jin, Z. Lan, L. Kong, and J. He, “AgentBoard: An Analytical Evaluation Board of Multi-Turn LLM Agents,”Advances in Neural Information Processing Systems (NeurIPS), 2024

work page 2024

[10] [10]

MLE-bench: Evaluating Machine Learning Agents on Machine Learn- ing Engineering,

J. S. Chan, N. Chowdhury, O. Jaffe, J. Aung, D. Sherburn, E. Mays, G. Starace, K. Liu, L. Maksin, T. Patwardhan, L. Weng, and A. Madry, “MLE-bench: Evaluating Machine Learning Agents on Machine Learn- ing Engineering,” OpenAI Technical Report, 2024

work page 2024

[11] [11]

Evaluating Large Language Models Trained on Code

M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. de Oliveira Pinto, J. Kaplan, et al., “Evaluating Large Language Models Trained on Code,” arXiv preprint arXiv:2107.03374, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[12] [12]

Program Synthesis with Large Language Models

J. Austin, A. Odena, M. Nye, M. Bosma, H. Michalewski, D. Dohan, E. Jiang, C. Cai, M. Terry, Q. Le, and C. Sutton, “Program Synthesis with Large Language Models,” arXiv preprint arXiv:2108.07732, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[13] [13]

Training Verifiers to Solve Math Word Problems

K. Cobbe, V . Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano, C. Hesse, and J. Schulman, “Training Verifiers to Solve Math Word Problems,” arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[14] [14]

Measuring Massive Multitask Language Understanding,

D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt, “Measuring Massive Multitask Language Understanding,” International Conference on Learning Representations (ICLR), 2021

work page 2021

[15] [15]

Deep Reinforcement Learning that Matters,

P. Henderson, R. Islam, P. Bachman, J. Pineau, D. Precup, and D. Meger, “Deep Reinforcement Learning that Matters,”AAAI Conference on Artificial Intelligence, 2018

work page 2018

[16] [16]

Improving Reproducibility in Machine Learning Research,

J. Pineau, P. Vincent-Lamarre, K. Sinha, V . Larivi `ere, A. Beygelzimer, F. d’Alch´e-Buc, E. Fox, and H. Larochelle, “Improving Reproducibility in Machine Learning Research,”Journal of Machine Learning Research, vol. 22, no. 164, pp. 1–20, 2021

work page 2021

[17] [17]

Model Cards for Model Reporting,

M. Mitchell, S. Wu, A. Zaldivar, P. Barnes, L. Vasserman, B. Hutchin- son, E. Spitzer, I. D. Raji, and T. Gebru, “Model Cards for Model Reporting,”ACM Conference on Fairness, Accountability, and Trans- parency (FAT*), 2019

work page 2019

[18] [18]

Datasheets for Datasets,

T. Gebru, J. Morgenstern, B. Vecchione, J. W. Vaughan, H. Wallach, H. Daum ´e III, and K. Crawford, “Datasheets for Datasets,”Communi- cations of the ACM, vol. 64, no. 12, pp. 86–92, 2021

work page 2021

[19] [19]

Quantifying Language Models’ Sensitivity to Spurious Features in Prompt Design,

M. Sclar, Y . Choi, Y . Tsvetkov, and A. Suhr, “Quantifying Language Models’ Sensitivity to Spurious Features in Prompt Design,”Interna- tional Conference on Learning Representations (ICLR), 2024

work page 2024

[20] [20]

State of What Art? A Call for Multi-Prompt LLM Eval- uation,

M. Mizrahi, G. Kaplan, D. Malkin, R. Dror, D. Shahaf, and G. Stanovsky, “State of What Art? A Call for Multi-Prompt LLM Eval- uation,”Transactions of the Association for Computational Linguistics (TACL), 2024

work page 2024

[21] [21]

BetterBench: Assessing AI Benchmarks, Uncovering Issues, and Establishing Best Practices,

A. Reuel, A. Hardy, C. Smith, M. Lamparth, M. Hardy, and M. J. Kochenderfer, “BetterBench: Assessing AI Benchmarks, Uncovering Issues, and Establishing Best Practices,”Advances in Neural Information Processing Systems (NeurIPS), 2024

work page 2024

[22] [22]

Data Contamination: From Memorization to Exploitation,

I. Magar and R. Schwartz, “Data Contamination: From Memorization to Exploitation,”Annual Meeting of the Association for Computational Linguistics (ACL), 2022

work page 2022

[23] [23]

Proving Test Set Contamination in Black Box Language Models,

Y . Oren, N. Meister, N. Chatterji, F. Ladhak, and T. B. Hashimoto, “Proving Test Set Contamination in Black Box Language Models,” International Conference on Learning Representations (ICLR), 2024

work page 2024