Benchmark-specific training maps to shift bribery and is NP-hard under Borda and mean win rate; mean win rate has the highest instance-level robustness (median 22 tasks on BBH) among tested aggregation rules.
The Open LLM Leaderboard results are public Hugging Face Hub datasets, so we use them under the Hugging Face Hub Terms of Service and use only aggregate public leaderboard scores
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.LG 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
How Hard is it to Rig a Benchmark? A Social Choice Analysis of Leaderboard Robustness
Benchmark-specific training maps to shift bribery and is NP-hard under Borda and mean win rate; mean win rate has the highest instance-level robustness (median 22 tasks on BBH) among tested aggregation rules.