The Open LLM Leaderboard results are public Hugging Face Hub datasets, so we use them under the Hugging Face Hub Terms of Service and use only aggregate public leaderboard scores

the BBH leaderboard scores are credited to the Hugging Face Open LLM Leaderboard, were produced with the EleutherAI Evaluation Harness [Gao et al · 2023

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

browse 1 citing papers

representative citing papers

How Hard is it to Rig a Benchmark? A Social Choice Analysis of Leaderboard Robustness

cs.LG · 2026-05-22 · unverdicted · novelty 7.0

Benchmark-specific training maps to shift bribery and is NP-hard under Borda and mean win rate; mean win rate has the highest instance-level robustness (median 22 tasks on BBH) among tested aggregation rules.

citing papers explorer

Showing 1 of 1 citing paper.

How Hard is it to Rig a Benchmark? A Social Choice Analysis of Leaderboard Robustness cs.LG · 2026-05-22 · unverdicted · none · ref 28
Benchmark-specific training maps to shift bribery and is NP-hard under Borda and mean win rate; mean win rate has the highest instance-level robustness (median 22 tasks on BBH) among tested aggregation rules.

The Open LLM Leaderboard results are public Hugging Face Hub datasets, so we use them under the Hugging Face Hub Terms of Service and use only aggregate public leaderboard scores

fields

years

verdicts

representative citing papers

citing papers explorer