How Hard is it to Rig a Benchmark? A Social Choice Analysis of Leaderboard Robustness

Christoph Jansen; Frauke Kreuter; Georg Schollmeyer; Polina Gordienko

arxiv: 2605.23628 · v1 · pith:7BGL4DYGnew · submitted 2026-05-22 · 💻 cs.LG

How Hard is it to Rig a Benchmark? A Social Choice Analysis of Leaderboard Robustness

Polina Gordienko , Georg Schollmeyer , Frauke Kreuter , Christoph Jansen This is my paper

Pith reviewed 2026-05-25 04:34 UTC · model grok-4.3

classification 💻 cs.LG

keywords benchmark manipulationshift briberyleaderboard robustnesscomputational social choiceNP-hardnessmean win rateMMLUBIG-Bench Hard

0 comments

The pith

Selecting benchmark datasets to train on to top a leaderboard corresponds to shift bribery and is NP-hard under Borda count and mean win rate.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper maps the strategic choice of which benchmark datasets to include in training, so that a target model reaches first place on the leaderboard, onto the shift bribery problem from social choice. It uses this mapping to prove that the minimal selection problem is NP-hard when rankings are computed via Borda count or mean win rate. The authors also define instance-level robustness as the smallest number of datasets that must be included to guarantee top rank and supply closed-form expressions for this quantity under four common aggregation rules. Evaluation on MMLU and BIG-Bench Hard shows that mean win rate consistently demands the largest share of datasets to be manipulated.

Core claim

Treating datasets as voters and models as candidates, the problem of choosing datasets to train on so that a target model becomes top-ranked corresponds to shift bribery. This identification shows that the benchmark-specific training problem is NP-hard under Borda count and mean win rate. The paper also defines instance-level robustness as the minimum number of datasets a developer must include and derives expressions for it under arithmetic mean, median, mean win rate, and pairwise majority.

What carries the argument

The direct correspondence between benchmark-specific training and shift bribery, which imports computational hardness results and enables derivation of exact robustness formulas from social choice.

If this is right

Mean win rate requires a larger fraction of datasets to be gamed than arithmetic mean, median, or pairwise majority on both evaluated suites.
On the 24-task BBH leaderboard the median robustness under mean win rate reaches 22 tasks while arithmetic mean reaches only 13.
Median and pairwise majority each require 12 tasks on BBH, roughly half the total.
The closed-form expressions allow direct computation of robustness without exhaustive search for any fixed number of tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Benchmark designers could adopt mean win rate aggregation to increase the number of tasks that must be gamed.
For leaderboards with hundreds of tasks the NP-hardness result implies that exact minimal manipulation sets cannot be computed efficiently, making the robustness formulas practically useful.
If training gains on one dataset are correlated with gains on others, the actual number of datasets needed could be lower than the calculated robustness values.

Load-bearing premise

Including a dataset in training allows the developer to move their model to the top rank on that dataset independently of its performance on the remaining datasets.

What would settle it

A polynomial-time algorithm that finds the smallest set of datasets needed to top any ordinal benchmark under Borda count would falsify the NP-hardness claim.

Figures

Figures reproduced from arXiv: 2605.23628 by Christoph Jansen, Frauke Kreuter, Georg Schollmeyer, Polina Gordienko.

**Figure 2.** Figure 2: BBH sensitivity analysis for related uploads from the same Hugging Face namespace. [PITH_FULL_IMAGE:figures/full_fig_p019_2.png] view at source ↗

read the original abstract

Multi-task benchmarks have become a central pillar of machine learning research, yet their growing influence has incentivised benchmark gaming -- strategic actions taken to improve the leaderboard rank of a specific model. Treating datasets as voters and models as candidates, we consider benchmark-specific training -- the inclusion of benchmark data in training -- as a form of election manipulation. For any ordinal benchmark, the problem of choosing datasets to train on so that a target model becomes top-ranked corresponds to shift bribery, a class of manipulation problems from computational social choice. Leveraging this identification, we show that the benchmark-specific training problem is NP-hard under Borda count and mean win rate. Complementing this worst-case perspective, we introduce the instance-level robustness, the minimum number of datasets a model developer must include in training to top a given leaderboard, and derive expressions for it under arithmetic mean, median, mean win rate and pairwise majority. We evaluate these expressions on MMLU under HELM and on BIG-Bench Hard (BBH) under the Open LLM Leaderboard. Across both suites, mean win rate is hardest to manipulate: this gap is clear on BBH (24 tasks, 4507 models), where its median robustness is 22 tasks (92%), compared with 13 (54%) under arithmetic mean and 12 (50%) under median and pairwise majority.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper reduces benchmark gaming via dataset selection to shift bribery, with NP-hardness for Borda and mean win rate plus closed-form robustness for four rules, but the independent arbitrary-shift assumption needs checking.

read the letter

The main thing to know is that this work treats datasets as voters and models as candidates, maps benchmark-specific training directly to shift bribery, proves NP-hardness under Borda count and mean win rate, and supplies exact expressions for the minimum datasets needed to reach top rank under arithmetic mean, median, mean win rate, and pairwise majority. On BBH the numbers show mean win rate requiring far more datasets (median 22 out of 24) than the other three rules.

Referee Report

2 major / 2 minor

Summary. The paper models benchmark-specific training (including benchmark datasets during model training to improve a target model's leaderboard position) as shift bribery from computational social choice. It establishes that selecting datasets to make a target model top-ranked is NP-hard under Borda count and mean win rate. It derives closed-form expressions for instance-level robustness (the minimum number of datasets that must be included) under arithmetic mean, median, mean win rate, and pairwise majority, then evaluates these on MMLU (HELM) and BBH (Open LLM Leaderboard), concluding that mean win rate is hardest to manipulate.

Significance. If the central identification holds, the work supplies a formal social-choice framework for leaderboard robustness with explicit, parameter-free expressions and concrete empirical numbers (e.g., median robustness of 22 tasks / 92% for mean win rate on the 24-task BBH suite). The empirical comparison across aggregation rules on real leaderboards with thousands of models is a clear strength and yields actionable distinctions between rules.

major comments (2)

[Abstract] Abstract (and the correspondence section): the claimed exact reduction of benchmark-specific training to shift bribery for any ordinal benchmark requires that training on any chosen dataset permits an arbitrary, independent shift of the target model to first place on that dataset's ranking. No argument is supplied for why training effects are dataset-isolated or controllable to the top rank; cross-task transfer, negative interference, or partial gains would break the mapping and thereby the NP-hardness claims under Borda and mean win rate.
[Empirical evaluation] Empirical evaluation on BBH (24 tasks, 4507 models): the reported median robustness figures (22 tasks for mean win rate, 13 for arithmetic mean, 12 for median and pairwise majority) are presented as direct outputs of the derived expressions, yet the manuscript supplies no verification that the underlying per-dataset rankings satisfy the ordinal assumption or how ties and missing scores are handled; these details are load-bearing for the comparative claim that mean win rate is hardest to manipulate.

minor comments (2)

The abstract states the robustness expressions are derived for four rules but does not preview the exact formulas; including one illustrative derivation or table of the expressions would improve readability.
The paper evaluates only two benchmark suites; a short paragraph on the conditions under which the robustness ordering would generalize to other ordinal leaderboards would strengthen the conclusions.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We appreciate the referee's insightful comments and address each major point below. We will revise the manuscript to improve clarity on modeling assumptions and empirical details.

read point-by-point responses

Referee: [Abstract] Abstract (and the correspondence section): the claimed exact reduction of benchmark-specific training to shift bribery for any ordinal benchmark requires that training on any chosen dataset permits an arbitrary, independent shift of the target model to first place on that dataset's ranking. No argument is supplied for why training effects are dataset-isolated or controllable to the top rank; cross-task transfer, negative interference, or partial gains would break the mapping and thereby the NP-hardness claims under Borda and mean win rate.

Authors: We thank the referee for this observation. The correspondence to shift bribery is a deliberate modeling abstraction that formalizes the incentive to strategically include datasets for leaderboard improvement, assuming worst-case controllability to first place on included tasks. We agree that real-world training involves cross-task effects that may prevent independent top-rank shifts, and the NP-hardness holds strictly within this modeled setting rather than claiming literal equivalence to all training dynamics. We will revise the abstract and add an explicit discussion of this scope and its limitations in the introduction. revision: yes
Referee: [Empirical evaluation] Empirical evaluation on BBH (24 tasks, 4507 models): the reported median robustness figures (22 tasks for mean win rate, 13 for arithmetic mean, 12 for median and pairwise majority) are presented as direct outputs of the derived expressions, yet the manuscript supplies no verification that the underlying per-dataset rankings satisfy the ordinal assumption or how ties and missing scores are handled; these details are load-bearing for the comparative claim that mean win rate is hardest to manipulate.

Authors: We agree that explicit verification of the input data processing is required. The per-task rankings are obtained by sorting models on reported scores for each task, with ties resolved via average rank assignment and models with missing scores on a task excluded from that task's ranking. We will add a dedicated subsection to the empirical evaluation describing these steps, confirming that the processed data satisfies the ordinal input requirements for the robustness expressions, and reporting any sensitivity checks on tie handling. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper establishes a correspondence between benchmark-specific training and shift bribery via a modeling choice (datasets as voters), then invokes the known NP-hardness of shift bribery under Borda and mean win rate from external social choice literature. Robustness expressions are derived directly from the definitions of the aggregation rules (arithmetic mean, median, etc.) without fitted parameters, self-referential predictions, or load-bearing self-citations. The central claims rest on this external reduction and definitional derivations rather than reducing to the paper's own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the domain assumption that benchmark datasets function as independent voters whose rankings can be shifted independently via targeted training.

axioms (1)

domain assumption Including a dataset in training permits arbitrary improvement of the target model's ranking on that dataset without affecting other datasets.
This assumption enables the direct correspondence to shift bribery and the derivation of the minimum-bribery expressions.

pith-pipeline@v0.9.0 · 5781 in / 1195 out tokens · 21304 ms · 2026-05-25T04:34:17.515455+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

26 extracted references · 26 canonical work pages · 1 internal anchor

[1]

10 Isaac Chung, Imene Kerboua, Márton Kardos, Roman Solomatin, and Kenneth C

URLhttps://arxiv.org/abs/2603.08371. 10 Isaac Chung, Imene Kerboua, Márton Kardos, Roman Solomatin, and Kenneth C. Enevoldsen. Maintaining MTEB: Towards long term usability and reproducibility of embedding benchmarks. arXiv preprint, abs/2506.21182,

work page arXiv
[2]

Pierre Colombo, Nathan Noiry, Ekhine Irurozki, and Stephan Clemencon

URLhttps://arxiv.org/abs/2506.21182. Pierre Colombo, Nathan Noiry, Ekhine Irurozki, and Stephan Clemencon. What are the best systems? New perspectives on NLP benchmarking.Advances in Neural Information Processing Systems,

work page arXiv
[4]

Chunyuan Deng, Yilun Zhao, Xiangru Tang, Mark Gerstein, and Arman Cohan

URL https://arxiv.org/abs/2107.07002. Chunyuan Deng, Yilun Zhao, Xiangru Tang, Mark Gerstein, and Arman Cohan. Investigating data contamination in modern benchmarks for large language models. In Kevin Duh, Helena Gomez, and Steven Bethard, editors,Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguis...

work page arXiv 2024
[5]

BenchBrowser: Retrieving Evidence for Evaluating Benchmark Validity

URL https://arxiv.org/abs/ 2603.18019. Ricardo Dominguez-Olmedo, Florian E. Dorner, and Moritz Hardt. Training on the test task confounds evaluation and emergence. InThe Thirteenth International Conference on Learning Representations (ICLR 2025),

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

Generalization or memorization: Data contamination and trustworthy evaluation for large language models

Yihong Dong, Xue Jiang, Huanyu Liu, Zhi Jin, Bin Gu, Mengfei Yang, and Ge Li. Generalization or memorization: Data contamination and trustworthy evaluation for large language models. In Lun- Wei Ku, Andre Martins, and Vivek Srikumar, editors,Findings of the Association for Computational Linguistics: ACL 2024, pages 12039–12050. Association for Computation...

work page 2024
[7]

Utility is in the eye of the user: A critique of NLP leaderboards

Kawin Ethayarajh and Dan Jurafsky. Utility is in the eye of the user: A critique of NLP leaderboards. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 4846–4853, 01

work page 2020
[8]

URLhttps://arxiv.org/abs/2510.23191. Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, ...

work page arXiv
[9]

Ariel Gera, Odellia Boni, Yotam Perlitz, Roy Bar-Haim, Lilach Eden, and Asaf Yehudai

URLhttps://zenodo.org/records/10256836. Ariel Gera, Odellia Boni, Yotam Perlitz, Roy Bar-Haim, Lilach Eden, and Asaf Yehudai. JuStRank: Benchmarking LLM judges for system ranking. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (...

work page arXiv
[10]

org/abs/2602.07593

URL https://arxiv. org/abs/2602.07593. Moritz Hardt. The emerging science of machine learning benchmarks. Online at https: //mlbenchmarks.org,

work page arXiv
[11]

Strategic classification

Moritz Hardt, Nimrod Megiddo, Christos Papadimitriou, and Mary Wootters. Strategic classification. InProceedings of the 2016 ACM Conference on Innovations in Theoretical Computer Science, ITCS ’16, page 111–122. Association for Computing Machinery,

work page 2016
[12]

Towards more robust NLP system evaluation: Handling missing scores in benchmarks

Anas Himmi, Ekhine Irurozki, Nathan Noiry, Stephan Clémençon, and Pierre Colombo. Towards more robust NLP system evaluation: Handling missing scores in benchmarks. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 11759–11785, 01

work page 2024
[13]

Sayash Kapoor and Arvind Narayanan

URL https: //arxiv.org/abs/2401.06059. Sayash Kapoor and Arvind Narayanan. Leakage and the reproducibility crisis in ML-based science,

work page arXiv
[14]

Richard M

URLhttps://arxiv.org/abs/2207.07048. Richard M. Karp.Reducibility among Combinatorial Problems, pages 85–103. Springer US, Boston, MA,

work page arXiv
[15]

Dynabench: Rethinking benchmarking in NLP

Douwe Kiela, Max Bartolo, Yixin Nie, Divyansh Kaushik, Atticus Geiger, Zhengxuan Wu, Bertie Vidgen, Grusha Prasad, Amanpreet Singh, Pratik Ringshia, Zhiyi Ma, Tristan Thrush, Sebastian Riedel, Zeerak Waseem, Pontus Stenetorp, Robin Jia, Mohit Bansal, Christopher Potts, and Adina Williams. Dynabench: Rethinking benchmarking in NLP. InProceedings of the 202...

work page 2021
[17]

Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, and et al

URLhttps://arxiv.org/abs/2312.03121. Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, and et al. Holistic evaluation of language models.Transactions on Machine Learning Research,

work page arXiv
[18]

Social Choice Theory

Christian List. Social Choice Theory. In Edward N. Zalta and Uri Nodelman, editors,The Stanford Encyclopedia of Philosophy. Metaphysics Research Lab, Stanford University, Winter 2022 edition,

work page 2022
[19]

A survey on large language model benchmarks, 2025a

Shiwen Ni, Guhong Chen, Shuaimin Li, Xuanang Chen, Siyi Li, Bingli Wang, Qiyao Wang, Xingjian Wang, Yifan Zhang, Liyang Fan, Chengming Li, Ruifeng Xu, Le Sun, and Min Yang. A survey on large language model benchmarks, 2025a. URLhttps://arxiv.org/abs/2508.15361. Shiwen Ni, Xiangtao Kong, Chengming Li, Xiping Hu, Ruifeng Xu, Jia Zhu, and Min Yang. Training ...

work page arXiv
[20]

13 Mark Rofin, Vladislav Mikhailov, Mikhail Florinsky, Andrey Kravchenko, Tatiana Shavrina, Elena Tutubalina, Daniel Karabekyan, and Ekaterina Artemova

URLhttps://arxiv.org/abs/2310.10628. 13 Mark Rofin, Vladislav Mikhailov, Mikhail Florinsky, Andrey Kravchenko, Tatiana Shavrina, Elena Tutubalina, Daniel Karabekyan, and Ekaterina Artemova. V ote’n’rank: Revision of benchmarking with social choice theory. InProceedings of the 17th Conference of the European Chapter of the Association for Computational Lin...

work page arXiv
[21]

NLP evaluation in trouble: On the need to measure LLM data contamination for each benchmark

Oscar Sainz, Jon Campos, Iker García-Ferrero, Julen Etxaniz, Oier Lopez de Lacalle, and Eneko Agirre. NLP evaluation in trouble: On the need to measure LLM data contamination for each benchmark. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Findings of the Association for Computational Linguistics: EMNLP 2023, pages 10776–10787. Association for Co...

work page 2023
[22]

On elections with robust winners

Dmitry Shiryaev, Lan Yu, and Edith Elkind. On elections with robust winners. InProceedings of the 2013 International Conference on Autonomous Agents and Multi-Agent Systems, AAMAS ’13, page 415–422. International Foundation for Autonomous Agents and Multiagent Systems,

work page 2013
[23]

Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R

URLhttps://arxiv.org/abs/2504.20879. Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R. Brown, Adam Santoro, Aditya Gupta, Adrià Garriga-Alonso, and et al. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models.Transactions on Machine Learning Research (TMLR),

work page arXiv
[24]

Le, Ed H

Mirac Suzgun, Nathan Scales, Nathanael Schärli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc V . Le, Ed H. Chi, Denny Zhou, and Jason Wei. Challenging BIG- Bench tasks and whether Chain-of-Thought can solve them. InFindings of the Association for Computational Linguistics: ACL 2023, pages 13003–13051,

work page 2023
[25]

GLUE: A multi-task benchmark and analysis platform for natural language understanding

Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. GLUE: A multi-task benchmark and analysis platform for natural language understanding. InProceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 353–355, 01

work page 2018
[26]

Guanhua Zhang and Moritz Hardt

URL https: //arxiv.org/abs/2311.04850. Guanhua Zhang and Moritz Hardt. Inherent trade-offs between diversity and stability in multi-task benchmarks. InProceedings of the 41st International Conference on Machine Learning, volume 235, pages 58984–59002,

work page arXiv
[27]

URLhttps://arxiv.org/abs/2311.01964. 14 A Appendix A.1 Proofs Proposition A.1.For any monotone ordinal benchmark operator ˜B, the BST problem is exactly the shift bribery problem with all-or-nothing prices under the identification of datasets D with voters V , models A with candidates C, the model A1 with the preferred candidate p and the operator ˜B with...

work page arXiv 2016
[28]

The Open LLM Leaderboard results are public Hugging Face Hub datasets, so we use them under the Hugging Face Hub Terms of Service and use only aggregate public leaderboard scores

and its public BIG-Bench-Hard repository is distributed under MIT; the BBH leaderboard scores are credited to the Hugging Face Open LLM Leaderboard and were produced with the EleutherAI Evaluation Harness [Gao et al., 2023], which is distributed under MIT. The Open LLM Leaderboard results are public Hugging Face Hub datasets, so we use them under the Hugg...

work page 2023

[1] [1]

10 Isaac Chung, Imene Kerboua, Márton Kardos, Roman Solomatin, and Kenneth C

URLhttps://arxiv.org/abs/2603.08371. 10 Isaac Chung, Imene Kerboua, Márton Kardos, Roman Solomatin, and Kenneth C. Enevoldsen. Maintaining MTEB: Towards long term usability and reproducibility of embedding benchmarks. arXiv preprint, abs/2506.21182,

work page arXiv

[2] [2]

Pierre Colombo, Nathan Noiry, Ekhine Irurozki, and Stephan Clemencon

URLhttps://arxiv.org/abs/2506.21182. Pierre Colombo, Nathan Noiry, Ekhine Irurozki, and Stephan Clemencon. What are the best systems? New perspectives on NLP benchmarking.Advances in Neural Information Processing Systems,

work page arXiv

[3] [4]

Chunyuan Deng, Yilun Zhao, Xiangru Tang, Mark Gerstein, and Arman Cohan

URL https://arxiv.org/abs/2107.07002. Chunyuan Deng, Yilun Zhao, Xiangru Tang, Mark Gerstein, and Arman Cohan. Investigating data contamination in modern benchmarks for large language models. In Kevin Duh, Helena Gomez, and Steven Bethard, editors,Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguis...

work page arXiv 2024

[4] [5]

BenchBrowser: Retrieving Evidence for Evaluating Benchmark Validity

URL https://arxiv.org/abs/ 2603.18019. Ricardo Dominguez-Olmedo, Florian E. Dorner, and Moritz Hardt. Training on the test task confounds evaluation and emergence. InThe Thirteenth International Conference on Learning Representations (ICLR 2025),

work page internal anchor Pith review Pith/arXiv arXiv 2025

[5] [6]

Generalization or memorization: Data contamination and trustworthy evaluation for large language models

Yihong Dong, Xue Jiang, Huanyu Liu, Zhi Jin, Bin Gu, Mengfei Yang, and Ge Li. Generalization or memorization: Data contamination and trustworthy evaluation for large language models. In Lun- Wei Ku, Andre Martins, and Vivek Srikumar, editors,Findings of the Association for Computational Linguistics: ACL 2024, pages 12039–12050. Association for Computation...

work page 2024

[6] [7]

Utility is in the eye of the user: A critique of NLP leaderboards

Kawin Ethayarajh and Dan Jurafsky. Utility is in the eye of the user: A critique of NLP leaderboards. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 4846–4853, 01

work page 2020

[7] [8]

URLhttps://arxiv.org/abs/2510.23191. Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, ...

work page arXiv

[8] [9]

Ariel Gera, Odellia Boni, Yotam Perlitz, Roy Bar-Haim, Lilach Eden, and Asaf Yehudai

URLhttps://zenodo.org/records/10256836. Ariel Gera, Odellia Boni, Yotam Perlitz, Roy Bar-Haim, Lilach Eden, and Asaf Yehudai. JuStRank: Benchmarking LLM judges for system ranking. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (...

work page arXiv

[9] [10]

org/abs/2602.07593

URL https://arxiv. org/abs/2602.07593. Moritz Hardt. The emerging science of machine learning benchmarks. Online at https: //mlbenchmarks.org,

work page arXiv

[10] [11]

Strategic classification

Moritz Hardt, Nimrod Megiddo, Christos Papadimitriou, and Mary Wootters. Strategic classification. InProceedings of the 2016 ACM Conference on Innovations in Theoretical Computer Science, ITCS ’16, page 111–122. Association for Computing Machinery,

work page 2016

[11] [12]

Towards more robust NLP system evaluation: Handling missing scores in benchmarks

Anas Himmi, Ekhine Irurozki, Nathan Noiry, Stephan Clémençon, and Pierre Colombo. Towards more robust NLP system evaluation: Handling missing scores in benchmarks. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 11759–11785, 01

work page 2024

[12] [13]

Sayash Kapoor and Arvind Narayanan

URL https: //arxiv.org/abs/2401.06059. Sayash Kapoor and Arvind Narayanan. Leakage and the reproducibility crisis in ML-based science,

work page arXiv

[13] [14]

Richard M

URLhttps://arxiv.org/abs/2207.07048. Richard M. Karp.Reducibility among Combinatorial Problems, pages 85–103. Springer US, Boston, MA,

work page arXiv

[14] [15]

Dynabench: Rethinking benchmarking in NLP

Douwe Kiela, Max Bartolo, Yixin Nie, Divyansh Kaushik, Atticus Geiger, Zhengxuan Wu, Bertie Vidgen, Grusha Prasad, Amanpreet Singh, Pratik Ringshia, Zhiyi Ma, Tristan Thrush, Sebastian Riedel, Zeerak Waseem, Pontus Stenetorp, Robin Jia, Mohit Bansal, Christopher Potts, and Adina Williams. Dynabench: Rethinking benchmarking in NLP. InProceedings of the 202...

work page 2021

[15] [17]

Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, and et al

URLhttps://arxiv.org/abs/2312.03121. Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, and et al. Holistic evaluation of language models.Transactions on Machine Learning Research,

work page arXiv

[16] [18]

Social Choice Theory

Christian List. Social Choice Theory. In Edward N. Zalta and Uri Nodelman, editors,The Stanford Encyclopedia of Philosophy. Metaphysics Research Lab, Stanford University, Winter 2022 edition,

work page 2022

[17] [19]

A survey on large language model benchmarks, 2025a

Shiwen Ni, Guhong Chen, Shuaimin Li, Xuanang Chen, Siyi Li, Bingli Wang, Qiyao Wang, Xingjian Wang, Yifan Zhang, Liyang Fan, Chengming Li, Ruifeng Xu, Le Sun, and Min Yang. A survey on large language model benchmarks, 2025a. URLhttps://arxiv.org/abs/2508.15361. Shiwen Ni, Xiangtao Kong, Chengming Li, Xiping Hu, Ruifeng Xu, Jia Zhu, and Min Yang. Training ...

work page arXiv

[18] [20]

13 Mark Rofin, Vladislav Mikhailov, Mikhail Florinsky, Andrey Kravchenko, Tatiana Shavrina, Elena Tutubalina, Daniel Karabekyan, and Ekaterina Artemova

URLhttps://arxiv.org/abs/2310.10628. 13 Mark Rofin, Vladislav Mikhailov, Mikhail Florinsky, Andrey Kravchenko, Tatiana Shavrina, Elena Tutubalina, Daniel Karabekyan, and Ekaterina Artemova. V ote’n’rank: Revision of benchmarking with social choice theory. InProceedings of the 17th Conference of the European Chapter of the Association for Computational Lin...

work page arXiv

[19] [21]

NLP evaluation in trouble: On the need to measure LLM data contamination for each benchmark

Oscar Sainz, Jon Campos, Iker García-Ferrero, Julen Etxaniz, Oier Lopez de Lacalle, and Eneko Agirre. NLP evaluation in trouble: On the need to measure LLM data contamination for each benchmark. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Findings of the Association for Computational Linguistics: EMNLP 2023, pages 10776–10787. Association for Co...

work page 2023

[20] [22]

On elections with robust winners

Dmitry Shiryaev, Lan Yu, and Edith Elkind. On elections with robust winners. InProceedings of the 2013 International Conference on Autonomous Agents and Multi-Agent Systems, AAMAS ’13, page 415–422. International Foundation for Autonomous Agents and Multiagent Systems,

work page 2013

[21] [23]

Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R

URLhttps://arxiv.org/abs/2504.20879. Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R. Brown, Adam Santoro, Aditya Gupta, Adrià Garriga-Alonso, and et al. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models.Transactions on Machine Learning Research (TMLR),

work page arXiv

[22] [24]

Le, Ed H

Mirac Suzgun, Nathan Scales, Nathanael Schärli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc V . Le, Ed H. Chi, Denny Zhou, and Jason Wei. Challenging BIG- Bench tasks and whether Chain-of-Thought can solve them. InFindings of the Association for Computational Linguistics: ACL 2023, pages 13003–13051,

work page 2023

[23] [25]

GLUE: A multi-task benchmark and analysis platform for natural language understanding

Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. GLUE: A multi-task benchmark and analysis platform for natural language understanding. InProceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 353–355, 01

work page 2018

[24] [26]

Guanhua Zhang and Moritz Hardt

URL https: //arxiv.org/abs/2311.04850. Guanhua Zhang and Moritz Hardt. Inherent trade-offs between diversity and stability in multi-task benchmarks. InProceedings of the 41st International Conference on Machine Learning, volume 235, pages 58984–59002,

work page arXiv

[25] [27]

URLhttps://arxiv.org/abs/2311.01964. 14 A Appendix A.1 Proofs Proposition A.1.For any monotone ordinal benchmark operator ˜B, the BST problem is exactly the shift bribery problem with all-or-nothing prices under the identification of datasets D with voters V , models A with candidates C, the model A1 with the preferred candidate p and the operator ˜B with...

work page arXiv 2016

[26] [28]

The Open LLM Leaderboard results are public Hugging Face Hub datasets, so we use them under the Hugging Face Hub Terms of Service and use only aggregate public leaderboard scores

and its public BIG-Bench-Hard repository is distributed under MIT; the BBH leaderboard scores are credited to the Hugging Face Open LLM Leaderboard and were produced with the EleutherAI Evaluation Harness [Gao et al., 2023], which is distributed under MIT. The Open LLM Leaderboard results are public Hugging Face Hub datasets, so we use them under the Hugg...

work page 2023