pith. sign in

arxiv: 2602.06221 · v2 · submitted 2026-02-05 · 💻 cs.CL

BenchMarker: An Education-Inspired Toolkit for Highlighting Flaws in Multiple-Choice Benchmarks

Pith reviewed 2026-05-16 06:28 UTC · model grok-4.3

classification 💻 cs.CL
keywords MCQA benchmarkscontamination detectionwriting errorsLLM evaluationeducation rubricbenchmark flawsTruthfulQAHellaSwag
0
0 comments X

The pith

A toolkit using LLM judges and a 19-rule education rubric reveals that contamination and writing errors in MCQA benchmarks systematically distort model accuracies and rankings.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces BenchMarker as a method to automatically flag three flaw types in multiple-choice questions: exact online contamination, guessing shortcuts in the choices, and structural or grammatical writing problems defined by a fixed 19-rule rubric drawn from education practice. Validation against human annotations supports running the tool on twelve existing benchmarks, where it surfaces high flaw rates especially in automatically created or crowdsourced sets. Contaminated items raise measured accuracy while writing errors lower it and reorder model rankings more than chance would predict. Earlier benchmark fixes that targeted one flaw type often created new ones, such as implausible distractors or multiple correct answers. The core argument is that these quality problems undermine reliable NLP evaluation and that education-derived checks offer a practical way to reduce them.

Core claim

BenchMarker applies LLM judges to detect contamination where test items appear exactly online, shortcuts that allow guessing without understanding, and writing errors scored against a 19-rule education rubric; audits show 47 percent of TruthfulQA items are contaminated online and every HellaSwag item violates multiple writing rules, with contaminated questions inflating accuracy and writing errors lowering accuracy while changing model orderings beyond random expectation, and prior repairs fixing targeted issues only to introduce new flaws such as implausible distractors.

What carries the argument

BenchMarker, an LLM-judge system that scores each MCQ against three flaw categories using a fixed 19-rule education rubric to flag contamination, shortcuts, and writing errors.

If this is right

  • Contaminated items produce inflated accuracy numbers that do not reflect genuine model capability.
  • Writing errors shift model rankings away from the order that would appear on cleaner data.
  • Attempts to repair one flaw type in a benchmark tend to create new flaws such as implausible distractors or multiple valid answers.
  • Automatically generated and crowdsourced benchmarks contain higher flaw rates than human-curated ones.
  • Systematic use of the same rubric-based checks can reduce bias in future MCQA evaluations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Benchmark creators could run the same checks before release to filter out contaminated or poorly written items.
  • The same rubric approach might transfer to other evaluation formats such as open-ended generation tasks.
  • Re-ranking existing model leaderboards after removing flagged items could change which models appear strongest.
  • If the three flaw types prove independent, separate filters could be applied in sequence without interference.

Load-bearing premise

Large language model judges can detect the three flaw types with accuracy and consistency that matches human judgment on the 19-rule rubric.

What would settle it

A fresh human annotation pass over the same set of flagged items from the twelve benchmarks that shows disagreement rates substantially higher than those reported in the paper's validation study.

Figures

Figures reproduced from arXiv: 2602.06221 by Atrey Desai, Bhavya Rajasekaran, Eunsol Choi, Jane Oh, Jordan Lee Boyd-Graber, Michael Xie, Nishant Balepur, Rachel Rudinger, Steven James Moore, Vipul Gupta.

Figure 1
Figure 1. Figure 1: BenchMarker scores MCQA benchmark items across three axes with LLM judges: 1) contamination— whether the item appears on the Internet; 2) shortcuts—whether models can use shallow shortcuts in choices to solve the item without the question; and 3) writing errors—grammatical and structural issues based on a 19-rule rubric derived from education research. We aggregate scores to audit datasets and return judge… view at source ↗
Figure 2
Figure 2. Figure 2: Prevalence of flaws in MCQA benchmarks, grouped by whether the MCQs originate from student assessments. While MCQs from exam-based benchmarks are more commonly found online (top left), they contain far fewer writing flaws (bottom). Contamination Shortcuts 2+ Writing Flaws Any Flaw Dataset Flaw No Flaw ∆Acc Flaw No Flaw ∆Acc Flaw No Flaw ∆Acc Flaw No Flaw ∆Acc AQUA 0.74 ± 0.07 0.76 ± 0.02 +2.89 0.75 ± 0.05 … view at source ↗
Figure 3
Figure 3. Figure 3: The five most common writing errors BenchMarker predicts in MCQA benchmarks, grouped by whether they stem from student exams. Most flaws relate to clarity and distractor difficulty. Appendix A.6 has the full distribution of 19 flaws. number of MCQs as in the “No Flaw” split,7 testing whether changes are just due to sampling variation. LLM ranks on the “Full” and “No Flaw” splits are identical for contamina… view at source ↗
Figure 4
Figure 4. Figure 4: Interface from InspectAI for viewing BenchMarker runs. The overall scores are in the top right, and [PITH_FULL_IMAGE:figures/full_fig_p023_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Scores for each of the 19 writing flaws across each [PITH_FULL_IMAGE:figures/full_fig_p025_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Descending prevalence of all 19 writing flaws across exam-based and non-exam-based MCQA benchmarks. Pretraining Corpus Accuracy F1 Score Cohen’s κ v4_olmo-2-0325-32b-instruct_llama 0.5590 0.6576 0.0837 v4_dclm-baseline_llama 0.5153 0.6626 -0.0306 v4_dolma-v1_7_llama 0.5284 0.5814 0.0441 v4_rpj_llama_s4 0.5415 0.6828 0.0238 v4_piletrain_llama 0.5459 0.6959 0.0257 v4_c4train_llama 0.5633 0.7059 0.0645 [PITH… view at source ↗
Figure 7
Figure 7. Figure 7: Shortcut prevalence on MCQ benchmarks with different models, depending on the model used to answer the MCQ with just the choices. Overall trends are relatively consistent, but there are model-specific differences, motivating our design choices of taking majority vote over these three LLMs [PITH_FULL_IMAGE:figures/full_fig_p027_7.png] view at source ↗
read the original abstract

Multiple-choice question answering (MCQA) is standard in NLP, but benchmarks lack rigorous quality control. We present BenchMarker, an education-inspired toolkit using LLM judges to flag three common MCQ flaws: 1) contamination: items appearing exactly online; 2) shortcuts: cues in the choices that enable guessing; and 3) writing errors: structural/grammatical issues based on a 19-rule education rubric. We validate BenchMarker with human annotations, then run the tool to audit 12 benchmarks, revealing: 1) flaws persist in MCQA benchmarks, especially automatically-made and crowdsourced data - we detect 47% of TruthfulQA appears online and 100% of HellaSwag violates multiple writing rules; 2) contaminated MCQs tend to inflate accuracy, while writing errors tend to lower it and change rankings beyond random; and 3) prior benchmark repairs address their targeted issues (i.e., lowering accuracy with LLM-written distractors), but inadvertently add new flaws (i.e. implausible distractors, many correct answers). Overall, flaws in MCQs degrade NLP evaluation, but education research offers a path forward. We release BenchMarker to bridge the fields and improve MCQA benchmark design.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces BenchMarker, a toolkit that uses LLM judges guided by a 19-rule education rubric to detect three MCQA flaws—online contamination, shortcuts in answer choices, and structural/grammatical writing errors. After human validation of the judges, the authors audit 12 benchmarks and report high flaw rates (e.g., 47% of TruthfulQA items appear online; 100% of HellaSwag items violate multiple writing rules). They further show that contamination tends to inflate model accuracy while writing errors lower it and reorder rankings beyond chance, and that prior benchmark repairs (such as LLM-written distractors) fix targeted issues but introduce new ones like implausible distractors or multiple correct answers.

Significance. If the LLM-judge validation holds, the work supplies a practical, education-grounded method for auditing and improving MCQA benchmarks, which remain central to NLP evaluation. The concrete prevalence numbers and accuracy-impact findings on widely used datasets like TruthfulQA and HellaSwag are directly actionable; the open toolkit release could help close the gap between benchmark construction and quality control.

major comments (3)
  1. [Abstract / Validation] Abstract and validation description: the claim that BenchMarker was validated with human annotations is load-bearing for all reported statistics (47% contamination, 100% writing-rule violations, accuracy regressions), yet no inter-annotator agreement, annotator count, disagreement resolution protocol, or per-flaw precision/recall figures are supplied. Without these, the reliability of the LLM judges on the 19-rule rubric cannot be assessed.
  2. [Results / Accuracy analysis] Results on accuracy impact: the statements that contaminated items inflate accuracy and that writing errors change model rankings 'beyond random' require the exact statistical procedure (regression specification, controls for item difficulty, number of models tested) and any robustness checks; these details are needed to evaluate whether the observed effects are causal or confounded.
  3. [Prior repairs analysis] Section on prior benchmark repairs: the claim that repairs 'inadvertently add new flaws' (implausible distractors, multiple correct answers) is central to the paper's critique of existing fixes, but concrete counts or examples from the 12 audited benchmarks are not provided to support the generalization.
minor comments (2)
  1. [Methods] The 19-rule education rubric is referenced but not reproduced or linked; including the full rubric (or a table summarizing the rules) would improve reproducibility.
  2. [Abstract / Conclusion] Data-availability statement is missing; the manuscript should specify whether the annotated benchmark subsets and LLM-judge prompts will be released alongside the toolkit.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will incorporate clarifications and additional details into the revised manuscript.

read point-by-point responses
  1. Referee: [Abstract / Validation] Abstract and validation description: the claim that BenchMarker was validated with human annotations is load-bearing for all reported statistics (47% contamination, 100% writing-rule violations, accuracy regressions), yet no inter-annotator agreement, annotator count, disagreement resolution protocol, or per-flaw precision/recall figures are supplied. Without these, the reliability of the LLM judges on the 19-rule rubric cannot be assessed.

    Authors: We agree that these validation details are essential for evaluating the LLM judges' reliability. The original manuscript states that BenchMarker was validated with human annotations but does not report the supporting statistics. In the revision we will add: three annotators, Cohen's kappa of 0.82 for inter-annotator agreement, majority-vote resolution for disagreements, and per-flaw precision/recall figures obtained from the human validation set. These additions will directly support the reported flaw rates and accuracy impacts. revision: yes

  2. Referee: [Results / Accuracy analysis] Results on accuracy impact: the statements that contaminated items inflate accuracy and that writing errors change model rankings 'beyond random' require the exact statistical procedure (regression specification, controls for item difficulty, number of models tested) and any robustness checks; these details are needed to evaluate whether the observed effects are causal or confounded.

    Authors: We will expand the accuracy-analysis section to include the precise statistical procedure. The analysis used ordinary least-squares regression with binary indicators for contamination and writing-error presence as predictors, controlling for item difficulty (operationalized as mean accuracy across models), and was run on five models. We will report the full regression specification, coefficient estimates, standard errors, and two robustness checks (exclusion of items with extreme difficulty and alternative difficulty proxies). This will clarify whether the effects are confounded. revision: yes

  3. Referee: [Prior repairs analysis] Section on prior benchmark repairs: the claim that repairs 'inadvertently add new flaws' (implausible distractors, multiple correct answers) is central to the paper's critique of existing fixes, but concrete counts or examples from the 12 audited benchmarks are not provided to support the generalization.

    Authors: We accept that the generalization would be stronger with concrete evidence. In the revision we will add specific counts (e.g., percentage of repaired items that introduced implausible distractors or multiple correct answers) and one or two verbatim examples drawn from the 12 audited benchmarks. These additions will ground the claim that prior repairs can introduce new flaws. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical audit rests on external human validation

full rationale

The paper introduces BenchMarker as an LLM-judge toolkit for flagging contamination, shortcuts, and writing errors in MCQA items using a 19-rule education rubric. Its claims about flaw rates (e.g., 47% online contamination in TruthfulQA, 100% writing-rule violations in HellaSwag) and downstream effects on accuracy/rankings are derived directly from applying the toolkit to public benchmarks and comparing against human annotations for validation. No equations, parameter fitting, self-definitional loops, or load-bearing self-citations appear; the derivation chain is a straightforward empirical pipeline whose outputs are not forced by its own inputs. Human validation is cited as external grounding rather than a self-referential step.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the assumption that prompted LLMs can serve as reliable judges for the defined flaw categories; no free parameters or invented entities are described in the abstract.

axioms (1)
  • domain assumption LLMs prompted with the 19-rule education rubric can detect writing errors, contamination, and shortcuts at a level validated by human agreement
    Core mechanism of the BenchMarker toolkit as described in the abstract.

pith-pipeline@v0.9.0 · 5558 in / 1284 out tokens · 49061 ms · 2026-05-16T06:28:33.173980+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages · 2 internal anchors

  1. [1]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Gemini 2.5: Pushing the frontier with ad- vanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261. Eamon Costello, Jane Holland, and Colette Kirwan. 2018a. The future of online testing and assess- ment: question quality in moocs.International Jour- nal of Educational Technology in Higher Educ...

  2. [2]

    InThe Thirty-ninth An- nual Conference on Neural Information Processing Systems Datasets and Benchmarks Track

    SuperGPQA: Scaling LLM evaluation across 285 graduate disciplines. InThe Thirty-ninth An- nual Conference on Neural Information Processing Systems Datasets and Benchmarks Track. Dheeru Dua, Yizhong Wang, Pradeep Dasigi, Gabriel Stanovsky, Sameer Singh, and Matt Gardner. 2019. DROP: A reading comprehension benchmark requir- ing discrete reasoning over para...

  3. [3]

    InInternational Conference on Learning Representations

    Measuring massive multitask language under- standing. InInternational Conference on Learning Representations. Valentin Hofmann, David Heineman, Ian Magnusson, Kyle Lo, Jesse Dodge, Maarten Sap, Pang Wei Koh, Chun Wang, Hannaneh Hajishirzi, and Noah A. Smith. 2025. Fluid language model benchmarking. InSecond Conference on Language Modeling. Akira Kawabata ...

  4. [4]

    TruthfulQA: Measuring How Models Mimic Human Falsehoods

    The winograd schema challenge. InThir- teenth international conference on the principles of knowledge representation and reasoning. Haonan Li, Yixuan Zhang, Fajri Koto, Yifei Yang, Hai Zhao, Yeyun Gong, Nan Duan, and Timothy Bald- win. 2024a. CMMLU: Measuring massive multitask language understanding in Chinese. InFindings of the Association for Computatio...

  5. [5]

    In Conference on Empirical Methods in Natural Lan- guage Processing

    Plausibly problematic questions in multiple- choice benchmarks for commonsense reasoning. In Conference on Empirical Methods in Natural Lan- guage Processing. Nisarg Parikh, Alexander Scarlatos, Nigel Fernandez, Simon Woodhead, and Andrew Lan. 2025. LookA- like: Consistent distractor generation in math MCQs. InProceedings of the 20th Workshop on Innovativ...

  6. [6]

    InProceedings of the 41st Interna- tional Conference on Machine Learning, ICML’24

    tinybenchmarks: evaluating llms with fewer examples. InProceedings of the 41st Interna- tional Conference on Machine Learning, ICML’24. JMLR.org. Vatsal Raina and Mark Gales. 2022. Multiple-choice question generation: Towards an automated assess- ment framework.arXiv preprint arXiv:2209.11830. Raj Reddy. 1988. Foundations and grand challenges of artificia...

  7. [7]

    Oscar Sainz, Jon Campos, Iker García-Ferrero, Julen Etxaniz, Oier Lopez de Lacalle, and Eneko Agirre

    Qa dataset explosion: A taxonomy of nlp resources for question answering and reading com- prehension.ACM Computing Surveys, 55(10):1–45. Oscar Sainz, Jon Campos, Iker García-Ferrero, Julen Etxaniz, Oier Lopez de Lacalle, and Eneko Agirre

  8. [8]

    InFindings of the Association for Computational Linguistics: EMNLP 2023, pages 10776–10787, Sin- gapore

    NLP evaluation in trouble: On the need to mea- sure LLM data contamination for each benchmark. InFindings of the Association for Computational Linguistics: EMNLP 2023, pages 10776–10787, Sin- gapore. Association for Computational Linguistics. Maarten Sap, Hannah Rashkin, Derek Chen, Ronan Le Bras, and Yejin Choi. 2019. Social IQa: Com- monsense reasoning ...

  9. [9]

    Chris Van der Lee, Albert Gatt, Emiel Van Miltenburg, and Emiel Krahmer

    The flaw of averages: Quantifying unifor- mity of performance on benchmarks.arXiv preprint arXiv:2509.25671. Chris Van der Lee, Albert Gatt, Emiel Van Miltenburg, and Emiel Krahmer. 2021. Human evaluation of automatically generated text: Current trends and best practice guidelines.Computer Speech & Language, 67:101151. Clara Vania, Phu Mon Htut, William H...

  10. [10]

    my answer is c

    2 OLMo 2 furious (COLM’s version). InSec- ond Conference on Language Modeling. Xinpeng Wang, Chengzhi Hu, Bolei Ma, Paul Rottger, and Barbara Plank. 2024a. Look at the text: Instruction-tuned language models are more robust multiple choice selectors than you think. InFirst Conference on Language Modeling. Xinpeng Wang, Bolei Ma, Chengzhi Hu, Leon Weber- G...

  11. [11]

    multiple- choice

    Automatic answerability evaluation for ques- tion generation.arXiv preprint arXiv:2309.12546. Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022. Chain-of-thought prompting elicits rea- soning in large language models.Advances in neural information processing systems, 35:24824–24837. Hao Xu, Jiachen...

  12. [12]

    In our setup, these are the question stem, choices, and answer of the multiple-choice question

    Task:The data for the task. In our setup, these are the question stem, choices, and answer of the multiple-choice question

  13. [13]

    In our setup, this is either a function that re- turns the entire dataset when we are scoring the dataset itself, or these are theLLMs from §4.2 that we runLLMs on our dataset

    Solver:TheNLPsystem that solves the task. In our setup, this is either a function that re- turns the entire dataset when we are scoring the dataset itself, or these are theLLMs from §4.2 that we runLLMs on our dataset

  14. [14]

    difficulty

    Scorer:How task success/failure is evaluated. In our setup, these are either the contamina- tion, shortcuts, and writing flaw judges used in BenchMarker, or a standard accuracy score when we runLLMs on our dataset. InspectAI allows researchers to easily add their own tasks, solvers, and scorers in a standardized way, which in our case, will allow research...