InFindings of the Association for Computational Linguistics: EMNLP 2023, pages 10776–10787, Sin- gapore

NLP evaluation in trouble: On the need to measure LLM data contamination for each benchmark · 2023 · arXiv 2503.10533

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

representative citing papers

BenchMarker: An Education-Inspired Toolkit for Highlighting Flaws in Multiple-Choice Benchmarks

cs.CL · 2026-02-05 · unverdicted · novelty 7.0

BenchMarker toolkit audits 12 MCQA benchmarks for contamination, shortcuts, and writing errors using LLM judges, finding widespread flaws that inflate or deflate accuracy and alter rankings.

citing papers explorer

Showing 1 of 1 citing paper.

BenchMarker: An Education-Inspired Toolkit for Highlighting Flaws in Multiple-Choice Benchmarks cs.CL · 2026-02-05 · unverdicted · none · ref 8
BenchMarker toolkit audits 12 MCQA benchmarks for contamination, shortcuts, and writing errors using LLM judges, finding widespread flaws that inflate or deflate accuracy and alter rankings.

InFindings of the Association for Computational Linguistics: EMNLP 2023, pages 10776–10787, Sin- gapore

fields

years

verdicts

representative citing papers

citing papers explorer