Pilot audit of twelve LLM benchmark papers finds mean disclosure score of 0.38/1.0 for agent benchmarks versus 0.66 for classical ones, with zero papers disclosing inference costs or full harness specs, and releases an open JSON schema plus scoring CSV.
Proving Test Set Contamination in Black Box Language Models,
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.LG 1years
2026 1verdicts
ACCEPT 1representative citing papers
citing papers explorer
-
What Twelve LLM Agent Benchmark Papers Disclose About Themselves: A Pilot Audit and an Open Scoring Schema
Pilot audit of twelve LLM benchmark papers finds mean disclosure score of 0.38/1.0 for agent benchmarks versus 0.66 for classical ones, with zero papers disclosing inference costs or full harness specs, and releases an open JSON schema plus scoring CSV.