BenchMarker toolkit audits 12 MCQA benchmarks for contamination, shortcuts, and writing errors using LLM judges, finding widespread flaws that inflate or deflate accuracy and alter rankings.
InFindings of the Association for Computational Linguistics: EMNLP 2023, pages 10776–10787, Sin- gapore
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.CL 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
BenchMarker: An Education-Inspired Toolkit for Highlighting Flaws in Multiple-Choice Benchmarks
BenchMarker toolkit audits 12 MCQA benchmarks for contamination, shortcuts, and writing errors using LLM judges, finding widespread flaws that inflate or deflate accuracy and alter rankings.