Are We Evaluating Knowledge or Phrasing? Mitigating MCQA Sensitivity with ParaEval
Pith reviewed 2026-06-27 13:11 UTC · model grok-4.3
The pith
MCQA metrics report false performance gaps over 2 points from phrasing alone, but ParaEval reduces them below 1 point by scoring the most favorable paraphrase per option.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Standard MCQA evaluation conflates a model's familiarity with specific answer phrasings with its actual capability; ParaEval corrects for this by querying multiple paraphrases per option and assigning the score of the most favorable one.
What carries the argument
ParaEval, an evaluation framework that generates multiple paraphrases for each answer option and scores the model on the highest-likelihood variant among them.
Load-bearing premise
That selecting the most favorable phrasing among generated paraphrases measures a model's true underlying capability rather than simply its ability to match some surface form in the paraphrase set.
What would settle it
If models trained on identical knowledge continue to show performance gaps larger than 1 point when evaluated with ParaEval, the claim that it isolates true capability would be falsified.
read the original abstract
Multiple-choice (MCQA) benchmarks are the standard for evaluating pretrained large language models, but their reliance on log-likelihood scoring makes them unreliable. Specifically, standard scores are highly sensitive to the exact phrasing (surface form) of the answers, conflating a model's familiarity with a specific phrase with its actual capability. We demonstrate this flaw using a controlled testbed of 1B-8B models trained on the same knowledge. Despite having identical knowledge, standard metrics falsely report a performance gap of over 2 points. To solve this, we propose ParaEval, an evaluation framework that queries models using multiple paraphrases per answer option. By scoring each model based on its most favorable phrasing, ParaEval successfully reduces the false performance gap to below 1 point. We confirm that these evaluation artifacts, and the improvements from ParaEval, persist in frontier 70B and 120B open-source models. Ultimately, ParaEval provides a robust and efficient way to evaluate true underlying capability rather than surface-form familiarity.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript claims that standard log-likelihood scoring in MCQA benchmarks is highly sensitive to answer phrasing, conflating surface-form familiarity with capability. Using a controlled testbed of 1B-8B models trained on identical knowledge, standard metrics report false gaps exceeding 2 points; ParaEval mitigates this by querying multiple paraphrases per option and scoring each model on its most favorable phrasing, reducing the gap below 1 point. The artifacts and ParaEval improvements are shown to persist in 70B and 120B open-source models.
Significance. If the paraphrase generation process is shown to be independent of the evaluated models' training distributions, ParaEval could meaningfully improve the robustness of MCQA evaluation by isolating knowledge from phrasing effects. The controlled testbed is a positive feature for isolating the variable of interest.
major comments (2)
- [Abstract] Abstract: The paraphrase generation method, diversity controls, and source model are not described. This is load-bearing for the central claim, because the reduction from >2-point to <1-point gaps via argmax over paraphrases can only demonstrate mitigation of phrasing sensitivity (rather than substitution of one surface-form bias) if the paraphrase distribution is neutral with respect to the models' pretraining data; without these details the result is not verifiable.
- [Abstract] Abstract: The controlled testbed is described only as 'models trained on the same knowledge.' To support the claim that identical knowledge produces >2-point gaps under standard scoring, the manuscript must specify the training corpus, any differences in model architecture or optimization, and how equivalence of knowledge was verified; absent this, unaccounted confounds cannot be ruled out.
Simulated Author's Rebuttal
We thank the referee for the constructive comments emphasizing verifiability. We address each major comment below and will revise the manuscript to incorporate the requested details.
read point-by-point responses
-
Referee: [Abstract] Abstract: The paraphrase generation method, diversity controls, and source model are not described. This is load-bearing for the central claim, because the reduction from >2-point to <1-point gaps via argmax over paraphrases can only demonstrate mitigation of phrasing sensitivity (rather than substitution of one surface-form bias) if the paraphrase distribution is neutral with respect to the models' pretraining data; without these details the result is not verifiable.
Authors: We agree that these details are essential to substantiate the central claim and rule out substitution of one bias for another. The revised manuscript will expand both the abstract and the Methods section to fully describe the paraphrase generation process, including the source model, diversity controls (e.g., similarity thresholds and variation metrics), and evidence or arguments establishing neutrality relative to the evaluated models' pretraining distributions. revision: yes
-
Referee: [Abstract] Abstract: The controlled testbed is described only as 'models trained on the same knowledge.' To support the claim that identical knowledge produces >2-point gaps under standard scoring, the manuscript must specify the training corpus, any differences in model architecture or optimization, and how equivalence of knowledge was verified; absent this, unaccounted confounds cannot be ruled out.
Authors: We agree that the abstract's brevity leaves the testbed underspecified and that explicit details are required to support the claim of identical knowledge. The revised version will specify the training corpus, confirm that models share architecture and optimization (differing only in scale), and describe the verification procedure for knowledge equivalence. revision: yes
Circularity Check
No circularity: purely empirical comparison with no derivations or self-referential reductions
full rationale
The paper contains no equations, derivations, or first-principles claims. Its central results rest on controlled empirical comparisons (identical-knowledge models showing >2-point gaps under standard scoring that shrink under ParaEval). No fitted parameters are renamed as predictions, no self-citations bear load on uniqueness theorems, and no ansatzes are smuggled. The evaluation framework is self-contained against the reported testbed; any concerns about paraphrase neutrality fall under validity rather than circularity by construction.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Mega: Multilingual evaluation of generative ai.arXiv preprint arXiv:2303.12528,
Kabir Ahuja, Harshita Diddee, Rishav Hada, Millicent Jain, Monojit Choudhury, Sunayana Dandapat, Kalika Bali, et al. Mega: Multilingual evaluation of generative ai.arXiv preprint arXiv:2303.12528,
-
[2]
Nishant Balepur, Rachel Rudinger, and Jordan Boyd-Graber. Which of these best describes multiple choice evaluation with llms? a) forced b) flawed c) fixable d) all of the above.arXiv preprint arXiv:2502.14127,
-
[3]
Language models are few-shot learners.Advances in Neural Information Processing Systems, 33:1877–1901,
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.Advances in Neural Information Processing Systems, 33:1877–1901,
1901
-
[4]
Nikhil Chandak, Shashwat Goel, Ameya Prabhu, Moritz Hardt, and Jonas Geiping. Answer matching outperforms multiple choice for language model evaluation.arXiv preprint arXiv:2507.02856,
-
[5]
Minjoon Choi. Roparq: Paraphrase-aware alignment of large language models towards robustness to paraphrased questions.arXiv preprint arXiv:2511.21568, 2024.https://arxiv.org/abs/2511.21568. Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the a...
arXiv 2024
-
[6]
doi: 10.1162/tacl_a_00410.https://aclanthology.org/2021.tacl-1. 60/. Markus Freitag, David Grangier, and Isaac Caswell. BLEU might be guilty but references are not innocent. In Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu, editors,Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 61–71, Online, November
work page doi:10.1162/tacl_a_00410.https://aclanthology.org/2021.tacl-1 2021
-
[7]
doi: 10.18653/v1/2020.emnlp-main.5.https://aclanthology.org/2020.emnlp-main.5/
Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.5.https://aclanthology.org/2020.emnlp-main.5/. Ari Holtzman, Peter West, Vered Shwartz, Yejin Choi, and Luke Zettlemoyer. Surface form competition: Why the highest probability answer isn’t always right. InProceedings of the 2021 Conference on Empirical Methods in Natural Language ...
work page doi:10.18653/v1/2020.emnlp-main.5.https://aclanthology.org/2020.emnlp-main.5/ 2020
-
[8]
TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension
doi: 10.1162/tacl_a_00324. https://aclanthology.org/2020.tacl-1.28/. Mandar Joshi, Eunsol Choi, Daniel S. Weld, and Luke Zettlemoyer. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension.arXiv preprint arXiv:1705.03551,
work page internal anchor Pith review Pith/arXiv arXiv doi:10.1162/tacl_a_00324 2020
-
[9]
Jeffrey Li, Alex Fang, Georgios Smyrnis, Maor Dvali, et al. Datacomp-lm: In search of the next generation of training sets for language models.arXiv preprint arXiv:2406.11794, 2024a. Jeffrey Li, Alex Fang, Georgios Smyrnis, Maor Ivgi, Matt Jordan, Samir Gadre, Hritik Bansal, Etash Guha, Sedrick Keh, Kushal Arora, et al. Datacomp-lm: In search of the next ...
-
[10]
Fantastically Ordered Prompts and Where to Find Them: Overcoming Few-Shot Prompt Order Sensitivity
Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.556. https://aclanthology.org/2022. acl-long.556/. Bettina Messmer, Vinko Sabolčec, and Martin Jaggi. Enhancing multilingual llm pretraining with model-based data selection.arXiv, 2025.https://arxiv.org/abs/2502.10361. Francesco Maria Molfese, Luca Moroni, Luca Gioffre, Alessandro S...
-
[11]
gpt-oss-120b & gpt-oss-20b model card, 2025.https://arxiv.org/abs/2508.10925
OpenAI. gpt-oss-120b & gpt-oss-20b model card, 2025.https://arxiv.org/abs/2508.10925. Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. InProceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318,
Pith/arXiv arXiv 2025
-
[12]
Jan-Thorsten Peter, David Vilar, Tobias Domhan, Dan Malkin, and Markus Freitag. Mind the gap... or not? how translation errors and evaluation details skew multilingual results.arXiv preprint arXiv:2511.05162,
-
[13]
Bell, Anaelia Ovalle, Jesse Dodge, Antoine Bosselut, Koustuv Sinha, and Adina Williams
Angelika Romanou, Mark Ibrahim, Candace Ross, Chantal Shaib, Kerem Oktar, Samuel J. Bell, Anaelia Ovalle, Jesse Dodge, Antoine Bosselut, Koustuv Sinha, and Adina Williams. Brittlebench: Quantifying llm robustness via prompt sensitivity, 2026.https://arxiv.org/abs/2603.13285. Kayla Schroeder and Zach Wood-Doughty. Can you trust llm judgments? reliability o...
Pith/arXiv arXiv 2026
-
[14]
Melanie Sclar, Yejin Choi, Yulia Tsvetkov, and Alane Suhr. Quantifying language models’ sensitivity to spurious features in prompt design or: How i learned to start worrying about prompt formatting.arXiv preprint arXiv:2310.11324, 2023.https://arxiv.org/abs/2310.11324. Thibault Sellam, Dipanjan Das, and Ankur P. Parikh. BLEURT: Learning robust metrics for...
Pith/arXiv arXiv 2023
-
[15]
Shivalika Singh, Angelika Romanou, Clémentine Fourrier, David I
doi: 10.18653/v1/2020.acl-main.704.https://aclanthology.org/ 2020.acl-main.704/. Shivalika Singh, Angelika Romanou, Clémentine Fourrier, David I. Adelani, Jian Gang Ngui, Daniel Vila-Suero, Peerat Limkonchotiwat, Kelly Marchisio, Wei Qi Leong, Yosephine Susanto, Raymond Ng, Shayne Longpre, Wei-Yin Ko, Sebastian Ruder, Madeline Smith, Antoine Bosselut, Ali...
work page doi:10.18653/v1/2020.acl-main.704.https://aclanthology.org/ 2020
-
[16]
Omnilingual MT Team, Belen Alastruey, Niyati Bafna, Andrea Caciolai, Kevin Heffernan, Artyom Kozhevnikov, Christophe Ropers, Eduardo Sánchez, Charles-Eric Saint-James, Ioannis Tsiamas, Chierh Cheng, Joe Chuang, Paul- Ambroise Duquenne, Mark Duppenthaler, Nate Ekberg, Cynthia Gao, Pere Lluís Huguet Cabot, João Maria Janeiro, Jean Maillard, Gabriel Mejia Go...
Pith/arXiv arXiv 2026
-
[17]
Increasing probability mass on answer choices does not always improve accuracy
11 Sarah Wiegreffe, Matthew Finlayson, Oyvind Tafjord, Peter Clark, and Ashish Sabharwal. Increasing probability mass on answer choices does not always improve accuracy. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing,
2023
-
[18]
Zhao, Eric Wallace, Shi Feng, Dan Klein, and Sameer Singh
Tony Z. Zhao, Eric Wallace, Shi Feng, Dan Klein, and Sameer Singh. Calibrate before use: Improving few-shot performance of language models. InProceedings of the 38th International Conference on Machine Learning, volume 139 ofProceedings of Machine Learning Research, pages 12697–12706. PMLR, 2021.https://proceedings.mlr.press/ v139/zhao21c.html. Chujie Zhe...
2021
-
[19]
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging llm-as-a-judge with mt-bench and chatbot arena. InAdvances in Neural Information Processing Systems, volume 36, 2023.https: //arxiv.org/abs/2306.05685. Yue Zhou, Yada Zh...
Pith/arXiv arXiv 2023
-
[20]
We use 4 GPUs per shard, and we use a total of 200 shards, which computes to 800 H100 GPUs running simultaneously
12 Appendix A Translation of the training data We translate DCLM-EDU from English to Spanish using GPT-OSS (OpenAI, 2025), using a distributed pipeline that allows to parallelize across several nodes and process parquet files efficiently. We use 4 GPUs per shard, and we use a total of 200 shards, which computes to 800 H100 GPUs running simultaneously. The...
2025
-
[21]
The system prompt is given in Box 2, and the user prompt was simply "text"
where it was showed to be the best open-weights frontier model for translation. The system prompt is given in Box 2, and the user prompt was simply "text". Box 2Training data translation system prompt. You are a professional translator. Translate the following English text to Spanish. RULES: •Produce natural, fluent Spanish. •Preserve all formatting, para...
2024
-
[22]
Benchmarks.Weevaluateonfivebenchmarks: HellaSwag(Zellersetal.,2019), ARC-Easy, ARC-Challenge(Clark et al., 2018), and MMLU (Singh et al.,
2019
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.