How Hard is it to Rig a Benchmark? A Social Choice Analysis of Leaderboard Robustness
Pith reviewed 2026-05-25 04:34 UTC · model grok-4.3
The pith
Selecting benchmark datasets to train on to top a leaderboard corresponds to shift bribery and is NP-hard under Borda count and mean win rate.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Treating datasets as voters and models as candidates, the problem of choosing datasets to train on so that a target model becomes top-ranked corresponds to shift bribery. This identification shows that the benchmark-specific training problem is NP-hard under Borda count and mean win rate. The paper also defines instance-level robustness as the minimum number of datasets a developer must include and derives expressions for it under arithmetic mean, median, mean win rate, and pairwise majority.
What carries the argument
The direct correspondence between benchmark-specific training and shift bribery, which imports computational hardness results and enables derivation of exact robustness formulas from social choice.
If this is right
- Mean win rate requires a larger fraction of datasets to be gamed than arithmetic mean, median, or pairwise majority on both evaluated suites.
- On the 24-task BBH leaderboard the median robustness under mean win rate reaches 22 tasks while arithmetic mean reaches only 13.
- Median and pairwise majority each require 12 tasks on BBH, roughly half the total.
- The closed-form expressions allow direct computation of robustness without exhaustive search for any fixed number of tasks.
Where Pith is reading between the lines
- Benchmark designers could adopt mean win rate aggregation to increase the number of tasks that must be gamed.
- For leaderboards with hundreds of tasks the NP-hardness result implies that exact minimal manipulation sets cannot be computed efficiently, making the robustness formulas practically useful.
- If training gains on one dataset are correlated with gains on others, the actual number of datasets needed could be lower than the calculated robustness values.
Load-bearing premise
Including a dataset in training allows the developer to move their model to the top rank on that dataset independently of its performance on the remaining datasets.
What would settle it
A polynomial-time algorithm that finds the smallest set of datasets needed to top any ordinal benchmark under Borda count would falsify the NP-hardness claim.
Figures
read the original abstract
Multi-task benchmarks have become a central pillar of machine learning research, yet their growing influence has incentivised benchmark gaming -- strategic actions taken to improve the leaderboard rank of a specific model. Treating datasets as voters and models as candidates, we consider benchmark-specific training -- the inclusion of benchmark data in training -- as a form of election manipulation. For any ordinal benchmark, the problem of choosing datasets to train on so that a target model becomes top-ranked corresponds to shift bribery, a class of manipulation problems from computational social choice. Leveraging this identification, we show that the benchmark-specific training problem is NP-hard under Borda count and mean win rate. Complementing this worst-case perspective, we introduce the instance-level robustness, the minimum number of datasets a model developer must include in training to top a given leaderboard, and derive expressions for it under arithmetic mean, median, mean win rate and pairwise majority. We evaluate these expressions on MMLU under HELM and on BIG-Bench Hard (BBH) under the Open LLM Leaderboard. Across both suites, mean win rate is hardest to manipulate: this gap is clear on BBH (24 tasks, 4507 models), where its median robustness is 22 tasks (92%), compared with 13 (54%) under arithmetic mean and 12 (50%) under median and pairwise majority.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper models benchmark-specific training (including benchmark datasets during model training to improve a target model's leaderboard position) as shift bribery from computational social choice. It establishes that selecting datasets to make a target model top-ranked is NP-hard under Borda count and mean win rate. It derives closed-form expressions for instance-level robustness (the minimum number of datasets that must be included) under arithmetic mean, median, mean win rate, and pairwise majority, then evaluates these on MMLU (HELM) and BBH (Open LLM Leaderboard), concluding that mean win rate is hardest to manipulate.
Significance. If the central identification holds, the work supplies a formal social-choice framework for leaderboard robustness with explicit, parameter-free expressions and concrete empirical numbers (e.g., median robustness of 22 tasks / 92% for mean win rate on the 24-task BBH suite). The empirical comparison across aggregation rules on real leaderboards with thousands of models is a clear strength and yields actionable distinctions between rules.
major comments (2)
- [Abstract] Abstract (and the correspondence section): the claimed exact reduction of benchmark-specific training to shift bribery for any ordinal benchmark requires that training on any chosen dataset permits an arbitrary, independent shift of the target model to first place on that dataset's ranking. No argument is supplied for why training effects are dataset-isolated or controllable to the top rank; cross-task transfer, negative interference, or partial gains would break the mapping and thereby the NP-hardness claims under Borda and mean win rate.
- [Empirical evaluation] Empirical evaluation on BBH (24 tasks, 4507 models): the reported median robustness figures (22 tasks for mean win rate, 13 for arithmetic mean, 12 for median and pairwise majority) are presented as direct outputs of the derived expressions, yet the manuscript supplies no verification that the underlying per-dataset rankings satisfy the ordinal assumption or how ties and missing scores are handled; these details are load-bearing for the comparative claim that mean win rate is hardest to manipulate.
minor comments (2)
- The abstract states the robustness expressions are derived for four rules but does not preview the exact formulas; including one illustrative derivation or table of the expressions would improve readability.
- The paper evaluates only two benchmark suites; a short paragraph on the conditions under which the robustness ordering would generalize to other ordinal leaderboards would strengthen the conclusions.
Simulated Author's Rebuttal
We appreciate the referee's insightful comments and address each major point below. We will revise the manuscript to improve clarity on modeling assumptions and empirical details.
read point-by-point responses
-
Referee: [Abstract] Abstract (and the correspondence section): the claimed exact reduction of benchmark-specific training to shift bribery for any ordinal benchmark requires that training on any chosen dataset permits an arbitrary, independent shift of the target model to first place on that dataset's ranking. No argument is supplied for why training effects are dataset-isolated or controllable to the top rank; cross-task transfer, negative interference, or partial gains would break the mapping and thereby the NP-hardness claims under Borda and mean win rate.
Authors: We thank the referee for this observation. The correspondence to shift bribery is a deliberate modeling abstraction that formalizes the incentive to strategically include datasets for leaderboard improvement, assuming worst-case controllability to first place on included tasks. We agree that real-world training involves cross-task effects that may prevent independent top-rank shifts, and the NP-hardness holds strictly within this modeled setting rather than claiming literal equivalence to all training dynamics. We will revise the abstract and add an explicit discussion of this scope and its limitations in the introduction. revision: yes
-
Referee: [Empirical evaluation] Empirical evaluation on BBH (24 tasks, 4507 models): the reported median robustness figures (22 tasks for mean win rate, 13 for arithmetic mean, 12 for median and pairwise majority) are presented as direct outputs of the derived expressions, yet the manuscript supplies no verification that the underlying per-dataset rankings satisfy the ordinal assumption or how ties and missing scores are handled; these details are load-bearing for the comparative claim that mean win rate is hardest to manipulate.
Authors: We agree that explicit verification of the input data processing is required. The per-task rankings are obtained by sorting models on reported scores for each task, with ties resolved via average rank assignment and models with missing scores on a task excluded from that task's ranking. We will add a dedicated subsection to the empirical evaluation describing these steps, confirming that the processed data satisfies the ordinal input requirements for the robustness expressions, and reporting any sensitivity checks on tie handling. revision: yes
Circularity Check
No significant circularity detected
full rationale
The paper establishes a correspondence between benchmark-specific training and shift bribery via a modeling choice (datasets as voters), then invokes the known NP-hardness of shift bribery under Borda and mean win rate from external social choice literature. Robustness expressions are derived directly from the definitions of the aggregation rules (arithmetic mean, median, etc.) without fitted parameters, self-referential predictions, or load-bearing self-citations. The central claims rest on this external reduction and definitional derivations rather than reducing to the paper's own inputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Including a dataset in training permits arbitrary improvement of the target model's ranking on that dataset without affecting other datasets.
Reference graph
Works this paper leans on
-
[1]
10 Isaac Chung, Imene Kerboua, Márton Kardos, Roman Solomatin, and Kenneth C
URLhttps://arxiv.org/abs/2603.08371. 10 Isaac Chung, Imene Kerboua, Márton Kardos, Roman Solomatin, and Kenneth C. Enevoldsen. Maintaining MTEB: Towards long term usability and reproducibility of embedding benchmarks. arXiv preprint, abs/2506.21182,
-
[2]
Pierre Colombo, Nathan Noiry, Ekhine Irurozki, and Stephan Clemencon
URLhttps://arxiv.org/abs/2506.21182. Pierre Colombo, Nathan Noiry, Ekhine Irurozki, and Stephan Clemencon. What are the best systems? New perspectives on NLP benchmarking.Advances in Neural Information Processing Systems,
-
[4]
Chunyuan Deng, Yilun Zhao, Xiangru Tang, Mark Gerstein, and Arman Cohan
URL https://arxiv.org/abs/2107.07002. Chunyuan Deng, Yilun Zhao, Xiangru Tang, Mark Gerstein, and Arman Cohan. Investigating data contamination in modern benchmarks for large language models. In Kevin Duh, Helena Gomez, and Steven Bethard, editors,Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguis...
-
[5]
BenchBrowser: Retrieving Evidence for Evaluating Benchmark Validity
URL https://arxiv.org/abs/ 2603.18019. Ricardo Dominguez-Olmedo, Florian E. Dorner, and Moritz Hardt. Training on the test task confounds evaluation and emergence. InThe Thirteenth International Conference on Learning Representations (ICLR 2025),
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[6]
Yihong Dong, Xue Jiang, Huanyu Liu, Zhi Jin, Bin Gu, Mengfei Yang, and Ge Li. Generalization or memorization: Data contamination and trustworthy evaluation for large language models. In Lun- Wei Ku, Andre Martins, and Vivek Srikumar, editors,Findings of the Association for Computational Linguistics: ACL 2024, pages 12039–12050. Association for Computation...
work page 2024
-
[7]
Utility is in the eye of the user: A critique of NLP leaderboards
Kawin Ethayarajh and Dan Jurafsky. Utility is in the eye of the user: A critique of NLP leaderboards. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 4846–4853, 01
work page 2020
-
[8]
URLhttps://arxiv.org/abs/2510.23191. Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, ...
-
[9]
Ariel Gera, Odellia Boni, Yotam Perlitz, Roy Bar-Haim, Lilach Eden, and Asaf Yehudai
URLhttps://zenodo.org/records/10256836. Ariel Gera, Odellia Boni, Yotam Perlitz, Roy Bar-Haim, Lilach Eden, and Asaf Yehudai. JuStRank: Benchmarking LLM judges for system ranking. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (...
-
[10]
URL https://arxiv. org/abs/2602.07593. Moritz Hardt. The emerging science of machine learning benchmarks. Online at https: //mlbenchmarks.org,
-
[11]
Moritz Hardt, Nimrod Megiddo, Christos Papadimitriou, and Mary Wootters. Strategic classification. InProceedings of the 2016 ACM Conference on Innovations in Theoretical Computer Science, ITCS ’16, page 111–122. Association for Computing Machinery,
work page 2016
-
[12]
Towards more robust NLP system evaluation: Handling missing scores in benchmarks
Anas Himmi, Ekhine Irurozki, Nathan Noiry, Stephan Clémençon, and Pierre Colombo. Towards more robust NLP system evaluation: Handling missing scores in benchmarks. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 11759–11785, 01
work page 2024
-
[13]
Sayash Kapoor and Arvind Narayanan
URL https: //arxiv.org/abs/2401.06059. Sayash Kapoor and Arvind Narayanan. Leakage and the reproducibility crisis in ML-based science,
- [14]
-
[15]
Dynabench: Rethinking benchmarking in NLP
Douwe Kiela, Max Bartolo, Yixin Nie, Divyansh Kaushik, Atticus Geiger, Zhengxuan Wu, Bertie Vidgen, Grusha Prasad, Amanpreet Singh, Pratik Ringshia, Zhiyi Ma, Tristan Thrush, Sebastian Riedel, Zeerak Waseem, Pontus Stenetorp, Robin Jia, Mohit Bansal, Christopher Potts, and Adina Williams. Dynabench: Rethinking benchmarking in NLP. InProceedings of the 202...
work page 2021
-
[17]
URLhttps://arxiv.org/abs/2312.03121. Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, and et al. Holistic evaluation of language models.Transactions on Machine Learning Research,
-
[18]
Christian List. Social Choice Theory. In Edward N. Zalta and Uri Nodelman, editors,The Stanford Encyclopedia of Philosophy. Metaphysics Research Lab, Stanford University, Winter 2022 edition,
work page 2022
-
[19]
A survey on large language model benchmarks, 2025a
Shiwen Ni, Guhong Chen, Shuaimin Li, Xuanang Chen, Siyi Li, Bingli Wang, Qiyao Wang, Xingjian Wang, Yifan Zhang, Liyang Fan, Chengming Li, Ruifeng Xu, Le Sun, and Min Yang. A survey on large language model benchmarks, 2025a. URLhttps://arxiv.org/abs/2508.15361. Shiwen Ni, Xiangtao Kong, Chengming Li, Xiping Hu, Ruifeng Xu, Jia Zhu, and Min Yang. Training ...
-
[20]
URLhttps://arxiv.org/abs/2310.10628. 13 Mark Rofin, Vladislav Mikhailov, Mikhail Florinsky, Andrey Kravchenko, Tatiana Shavrina, Elena Tutubalina, Daniel Karabekyan, and Ekaterina Artemova. V ote’n’rank: Revision of benchmarking with social choice theory. InProceedings of the 17th Conference of the European Chapter of the Association for Computational Lin...
-
[21]
NLP evaluation in trouble: On the need to measure LLM data contamination for each benchmark
Oscar Sainz, Jon Campos, Iker García-Ferrero, Julen Etxaniz, Oier Lopez de Lacalle, and Eneko Agirre. NLP evaluation in trouble: On the need to measure LLM data contamination for each benchmark. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Findings of the Association for Computational Linguistics: EMNLP 2023, pages 10776–10787. Association for Co...
work page 2023
-
[22]
On elections with robust winners
Dmitry Shiryaev, Lan Yu, and Edith Elkind. On elections with robust winners. InProceedings of the 2013 International Conference on Autonomous Agents and Multi-Agent Systems, AAMAS ’13, page 415–422. International Foundation for Autonomous Agents and Multiagent Systems,
work page 2013
-
[23]
URLhttps://arxiv.org/abs/2504.20879. Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R. Brown, Adam Santoro, Aditya Gupta, Adrià Garriga-Alonso, and et al. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models.Transactions on Machine Learning Research (TMLR),
-
[24]
Mirac Suzgun, Nathan Scales, Nathanael Schärli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc V . Le, Ed H. Chi, Denny Zhou, and Jason Wei. Challenging BIG- Bench tasks and whether Chain-of-Thought can solve them. InFindings of the Association for Computational Linguistics: ACL 2023, pages 13003–13051,
work page 2023
-
[25]
GLUE: A multi-task benchmark and analysis platform for natural language understanding
Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. GLUE: A multi-task benchmark and analysis platform for natural language understanding. InProceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 353–355, 01
work page 2018
-
[26]
Guanhua Zhang and Moritz Hardt
URL https: //arxiv.org/abs/2311.04850. Guanhua Zhang and Moritz Hardt. Inherent trade-offs between diversity and stability in multi-task benchmarks. InProceedings of the 41st International Conference on Machine Learning, volume 235, pages 58984–59002,
-
[27]
URLhttps://arxiv.org/abs/2311.01964. 14 A Appendix A.1 Proofs Proposition A.1.For any monotone ordinal benchmark operator ˜B, the BST problem is exactly the shift bribery problem with all-or-nothing prices under the identification of datasets D with voters V , models A with candidates C, the model A1 with the preferred candidate p and the operator ˜B with...
-
[28]
and its public BIG-Bench-Hard repository is distributed under MIT; the BBH leaderboard scores are credited to the Hugging Face Open LLM Leaderboard and were produced with the EleutherAI Evaluation Harness [Gao et al., 2023], which is distributed under MIT. The Open LLM Leaderboard results are public Hugging Face Hub datasets, so we use them under the Hugg...
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.