pith. sign in

arxiv: 2606.10657 · v1 · pith:LCNZZCXLnew · submitted 2026-06-09 · 💻 cs.CL

Are We Evaluating Knowledge or Phrasing? Mitigating MCQA Sensitivity with ParaEval

Pith reviewed 2026-06-27 13:11 UTC · model grok-4.3

classification 💻 cs.CL
keywords MCQA evaluationparaphrase robustnesslarge language modelsbenchmark sensitivitylog-likelihood scoringsurface form artifacts
0
0 comments X

The pith

MCQA metrics report false performance gaps over 2 points from phrasing alone, but ParaEval reduces them below 1 point by scoring the most favorable paraphrase per option.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that log-likelihood scoring on multiple-choice benchmarks is unreliable because it is highly sensitive to the exact surface form of answer options. In a controlled testbed of models trained on identical knowledge, standard metrics falsely indicate gaps exceeding 2 points. ParaEval mitigates the issue by generating multiple paraphrases for each option and scoring each model on its highest-likelihood variant, shrinking the artifactual gap to under 1 point. The same artifacts and the same improvement appear in models up to 120 billion parameters. The goal is to measure underlying capability rather than familiarity with a particular phrasing.

Core claim

Standard MCQA evaluation conflates a model's familiarity with specific answer phrasings with its actual capability; ParaEval corrects for this by querying multiple paraphrases per option and assigning the score of the most favorable one.

What carries the argument

ParaEval, an evaluation framework that generates multiple paraphrases for each answer option and scores the model on the highest-likelihood variant among them.

Load-bearing premise

That selecting the most favorable phrasing among generated paraphrases measures a model's true underlying capability rather than simply its ability to match some surface form in the paraphrase set.

What would settle it

If models trained on identical knowledge continue to show performance gaps larger than 1 point when evaluated with ParaEval, the claim that it isolates true capability would be falsified.

read the original abstract

Multiple-choice (MCQA) benchmarks are the standard for evaluating pretrained large language models, but their reliance on log-likelihood scoring makes them unreliable. Specifically, standard scores are highly sensitive to the exact phrasing (surface form) of the answers, conflating a model's familiarity with a specific phrase with its actual capability. We demonstrate this flaw using a controlled testbed of 1B-8B models trained on the same knowledge. Despite having identical knowledge, standard metrics falsely report a performance gap of over 2 points. To solve this, we propose ParaEval, an evaluation framework that queries models using multiple paraphrases per answer option. By scoring each model based on its most favorable phrasing, ParaEval successfully reduces the false performance gap to below 1 point. We confirm that these evaluation artifacts, and the improvements from ParaEval, persist in frontier 70B and 120B open-source models. Ultimately, ParaEval provides a robust and efficient way to evaluate true underlying capability rather than surface-form familiarity.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The manuscript claims that standard log-likelihood scoring in MCQA benchmarks is highly sensitive to answer phrasing, conflating surface-form familiarity with capability. Using a controlled testbed of 1B-8B models trained on identical knowledge, standard metrics report false gaps exceeding 2 points; ParaEval mitigates this by querying multiple paraphrases per option and scoring each model on its most favorable phrasing, reducing the gap below 1 point. The artifacts and ParaEval improvements are shown to persist in 70B and 120B open-source models.

Significance. If the paraphrase generation process is shown to be independent of the evaluated models' training distributions, ParaEval could meaningfully improve the robustness of MCQA evaluation by isolating knowledge from phrasing effects. The controlled testbed is a positive feature for isolating the variable of interest.

major comments (2)
  1. [Abstract] Abstract: The paraphrase generation method, diversity controls, and source model are not described. This is load-bearing for the central claim, because the reduction from >2-point to <1-point gaps via argmax over paraphrases can only demonstrate mitigation of phrasing sensitivity (rather than substitution of one surface-form bias) if the paraphrase distribution is neutral with respect to the models' pretraining data; without these details the result is not verifiable.
  2. [Abstract] Abstract: The controlled testbed is described only as 'models trained on the same knowledge.' To support the claim that identical knowledge produces >2-point gaps under standard scoring, the manuscript must specify the training corpus, any differences in model architecture or optimization, and how equivalence of knowledge was verified; absent this, unaccounted confounds cannot be ruled out.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments emphasizing verifiability. We address each major comment below and will revise the manuscript to incorporate the requested details.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The paraphrase generation method, diversity controls, and source model are not described. This is load-bearing for the central claim, because the reduction from >2-point to <1-point gaps via argmax over paraphrases can only demonstrate mitigation of phrasing sensitivity (rather than substitution of one surface-form bias) if the paraphrase distribution is neutral with respect to the models' pretraining data; without these details the result is not verifiable.

    Authors: We agree that these details are essential to substantiate the central claim and rule out substitution of one bias for another. The revised manuscript will expand both the abstract and the Methods section to fully describe the paraphrase generation process, including the source model, diversity controls (e.g., similarity thresholds and variation metrics), and evidence or arguments establishing neutrality relative to the evaluated models' pretraining distributions. revision: yes

  2. Referee: [Abstract] Abstract: The controlled testbed is described only as 'models trained on the same knowledge.' To support the claim that identical knowledge produces >2-point gaps under standard scoring, the manuscript must specify the training corpus, any differences in model architecture or optimization, and how equivalence of knowledge was verified; absent this, unaccounted confounds cannot be ruled out.

    Authors: We agree that the abstract's brevity leaves the testbed underspecified and that explicit details are required to support the claim of identical knowledge. The revised version will specify the training corpus, confirm that models share architecture and optimization (differing only in scale), and describe the verification procedure for knowledge equivalence. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical comparison with no derivations or self-referential reductions

full rationale

The paper contains no equations, derivations, or first-principles claims. Its central results rest on controlled empirical comparisons (identical-knowledge models showing >2-point gaps under standard scoring that shrink under ParaEval). No fitted parameters are renamed as predictions, no self-citations bear load on uniqueness theorems, and no ansatzes are smuggled. The evaluation framework is self-contained against the reported testbed; any concerns about paraphrase neutrality fall under validity rather than circularity by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are mentioned in the abstract; the method implicitly depends on the quality and coverage of generated paraphrases, but these are not formalized.

pith-pipeline@v0.9.1-grok · 5725 in / 1119 out tokens · 24734 ms · 2026-06-27T13:11:39.407334+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

22 extracted references · 5 canonical work pages · 1 internal anchor

  1. [1]

    Mega: Multilingual evaluation of generative ai.arXiv preprint arXiv:2303.12528,

    Kabir Ahuja, Harshita Diddee, Rishav Hada, Millicent Jain, Monojit Choudhury, Sunayana Dandapat, Kalika Bali, et al. Mega: Multilingual evaluation of generative ai.arXiv preprint arXiv:2303.12528,

  2. [2]

    Which of these best describes multiple choice evaluation with llms? a) forced b) flawed c) fixable d) all of the above.arXiv preprint arXiv:2502.14127,

    Nishant Balepur, Rachel Rudinger, and Jordan Boyd-Graber. Which of these best describes multiple choice evaluation with llms? a) forced b) flawed c) fixable d) all of the above.arXiv preprint arXiv:2502.14127,

  3. [3]

    Language models are few-shot learners.Advances in Neural Information Processing Systems, 33:1877–1901,

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.Advances in Neural Information Processing Systems, 33:1877–1901,

  4. [4]

    Answer matching outperforms multiple choice for language model evaluation.arXiv preprint arXiv:2507.02856,

    Nikhil Chandak, Shashwat Goel, Ameya Prabhu, Moritz Hardt, and Jonas Geiping. Answer matching outperforms multiple choice for language model evaluation.arXiv preprint arXiv:2507.02856,

  5. [5]

    Roparq: Paraphrase-aware alignment of large language models towards robustness to paraphrased questions.arXiv preprint arXiv:2511.21568, 2024.https://arxiv.org/abs/2511.21568

    Minjoon Choi. Roparq: Paraphrase-aware alignment of large language models towards robustness to paraphrased questions.arXiv preprint arXiv:2511.21568, 2024.https://arxiv.org/abs/2511.21568. Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the a...

  6. [6]

    doi: 10.1162/tacl_a_00410.https://aclanthology.org/2021.tacl-1. 60/. Markus Freitag, David Grangier, and Isaac Caswell. BLEU might be guilty but references are not innocent. In Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu, editors,Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 61–71, Online, November

  7. [7]

    doi: 10.18653/v1/2020.emnlp-main.5.https://aclanthology.org/2020.emnlp-main.5/

    Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.5.https://aclanthology.org/2020.emnlp-main.5/. Ari Holtzman, Peter West, Vered Shwartz, Yejin Choi, and Luke Zettlemoyer. Surface form competition: Why the highest probability answer isn’t always right. InProceedings of the 2021 Conference on Empirical Methods in Natural Language ...

  8. [8]

    TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension

    doi: 10.1162/tacl_a_00324. https://aclanthology.org/2020.tacl-1.28/. Mandar Joshi, Eunsol Choi, Daniel S. Weld, and Luke Zettlemoyer. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension.arXiv preprint arXiv:1705.03551,

  9. [9]

    Datacomp-lm: In search of the next generation of training sets for language models.arXiv preprint arXiv:2406.11794, 2024a

    Jeffrey Li, Alex Fang, Georgios Smyrnis, Maor Dvali, et al. Datacomp-lm: In search of the next generation of training sets for language models.arXiv preprint arXiv:2406.11794, 2024a. Jeffrey Li, Alex Fang, Georgios Smyrnis, Maor Ivgi, Matt Jordan, Samir Gadre, Hritik Bansal, Etash Guha, Sedrick Keh, Kushal Arora, et al. Datacomp-lm: In search of the next ...

  10. [10]

    Fantastically Ordered Prompts and Where to Find Them: Overcoming Few-Shot Prompt Order Sensitivity

    Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.556. https://aclanthology.org/2022. acl-long.556/. Bettina Messmer, Vinko Sabolčec, and Martin Jaggi. Enhancing multilingual llm pretraining with model-based data selection.arXiv, 2025.https://arxiv.org/abs/2502.10361. Francesco Maria Molfese, Luca Moroni, Luca Gioffre, Alessandro S...

  11. [11]

    gpt-oss-120b & gpt-oss-20b model card, 2025.https://arxiv.org/abs/2508.10925

    OpenAI. gpt-oss-120b & gpt-oss-20b model card, 2025.https://arxiv.org/abs/2508.10925. Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. InProceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318,

  12. [12]

    Mind the gap

    Jan-Thorsten Peter, David Vilar, Tobias Domhan, Dan Malkin, and Markus Freitag. Mind the gap... or not? how translation errors and evaluation details skew multilingual results.arXiv preprint arXiv:2511.05162,

  13. [13]

    Bell, Anaelia Ovalle, Jesse Dodge, Antoine Bosselut, Koustuv Sinha, and Adina Williams

    Angelika Romanou, Mark Ibrahim, Candace Ross, Chantal Shaib, Kerem Oktar, Samuel J. Bell, Anaelia Ovalle, Jesse Dodge, Antoine Bosselut, Koustuv Sinha, and Adina Williams. Brittlebench: Quantifying llm robustness via prompt sensitivity, 2026.https://arxiv.org/abs/2603.13285. Kayla Schroeder and Zach Wood-Doughty. Can you trust llm judgments? reliability o...

  14. [14]

    Melanie Sclar, Yejin Choi, Yulia Tsvetkov, and Alane Suhr. Quantifying language models’ sensitivity to spurious features in prompt design or: How i learned to start worrying about prompt formatting.arXiv preprint arXiv:2310.11324, 2023.https://arxiv.org/abs/2310.11324. Thibault Sellam, Dipanjan Das, and Ankur P. Parikh. BLEURT: Learning robust metrics for...

  15. [15]

    Shivalika Singh, Angelika Romanou, Clémentine Fourrier, David I

    doi: 10.18653/v1/2020.acl-main.704.https://aclanthology.org/ 2020.acl-main.704/. Shivalika Singh, Angelika Romanou, Clémentine Fourrier, David I. Adelani, Jian Gang Ngui, Daniel Vila-Suero, Peerat Limkonchotiwat, Kelly Marchisio, Wei Qi Leong, Yosephine Susanto, Raymond Ng, Shayne Longpre, Wei-Yin Ko, Sebastian Ruder, Madeline Smith, Antoine Bosselut, Ali...

  16. [16]

    Costa-jussà

    Omnilingual MT Team, Belen Alastruey, Niyati Bafna, Andrea Caciolai, Kevin Heffernan, Artyom Kozhevnikov, Christophe Ropers, Eduardo Sánchez, Charles-Eric Saint-James, Ioannis Tsiamas, Chierh Cheng, Joe Chuang, Paul- Ambroise Duquenne, Mark Duppenthaler, Nate Ekberg, Cynthia Gao, Pere Lluís Huguet Cabot, João Maria Janeiro, Jean Maillard, Gabriel Mejia Go...

  17. [17]

    Increasing probability mass on answer choices does not always improve accuracy

    11 Sarah Wiegreffe, Matthew Finlayson, Oyvind Tafjord, Peter Clark, and Ashish Sabharwal. Increasing probability mass on answer choices does not always improve accuracy. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing,

  18. [18]

    Zhao, Eric Wallace, Shi Feng, Dan Klein, and Sameer Singh

    Tony Z. Zhao, Eric Wallace, Shi Feng, Dan Klein, and Sameer Singh. Calibrate before use: Improving few-shot performance of language models. InProceedings of the 38th International Conference on Machine Learning, volume 139 ofProceedings of Machine Learning Research, pages 12697–12706. PMLR, 2021.https://proceedings.mlr.press/ v139/zhao21c.html. Chujie Zhe...

  19. [19]

    Xing, Hao Zhang, Joseph E

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging llm-as-a-judge with mt-bench and chatbot arena. InAdvances in Neural Information Processing Systems, volume 36, 2023.https: //arxiv.org/abs/2306.05685. Yue Zhou, Yada Zh...

  20. [20]

    We use 4 GPUs per shard, and we use a total of 200 shards, which computes to 800 H100 GPUs running simultaneously

    12 Appendix A Translation of the training data We translate DCLM-EDU from English to Spanish using GPT-OSS (OpenAI, 2025), using a distributed pipeline that allows to parallelize across several nodes and process parquet files efficiently. We use 4 GPUs per shard, and we use a total of 200 shards, which computes to 800 H100 GPUs running simultaneously. The...

  21. [21]

    The system prompt is given in Box 2, and the user prompt was simply "text"

    where it was showed to be the best open-weights frontier model for translation. The system prompt is given in Box 2, and the user prompt was simply "text". Box 2Training data translation system prompt. You are a professional translator. Translate the following English text to Spanish. RULES: •Produce natural, fluent Spanish. •Preserve all formatting, para...

  22. [22]

    Benchmarks.Weevaluateonfivebenchmarks: HellaSwag(Zellersetal.,2019), ARC-Easy, ARC-Challenge(Clark et al., 2018), and MMLU (Singh et al.,