ParaEval reduces false performance gaps in MCQA benchmarks from over 2 points to below 1 point by scoring models on multiple paraphrases per answer option instead of single surface forms.
Roparq: Paraphrase-aware alignment of large language models towards robustness to paraphrased questions.arXiv preprint arXiv:2511.21568, 2024.https://arxiv.org/abs/2511.21568
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.CL 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
Are We Evaluating Knowledge or Phrasing? Mitigating MCQA Sensitivity with ParaEval
ParaEval reduces false performance gaps in MCQA benchmarks from over 2 points to below 1 point by scoring models on multiple paraphrases per answer option instead of single surface forms.