Rank Intervals for Leaderboards: A Hierarchical Framework for Model Evaluation

Bitya Neuhof; Yuval Benjamini

arxiv: 2606.08679 · v1 · pith:IIAPMSPHnew · submitted 2026-06-07 · 📊 stat.ML · cs.CL· cs.LG· stat.ME

Rank Intervals for Leaderboards: A Hierarchical Framework for Model Evaluation

Bitya Neuhof , Yuval Benjamini This is my paper

Pith reviewed 2026-06-27 17:53 UTC · model grok-4.3

classification 📊 stat.ML cs.CLcs.LGstat.ME

keywords rank intervalsleaderboardsconformal predictionmodel evaluationuncertainty quantificationhierarchical frameworkmulti-task benchmarksstatistical guarantees

0 comments

The pith

A hierarchical framework yields statistically valid rank intervals for models on multi-task leaderboards.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a method to build task-level rank confidence intervals from pairwise model comparisons and then aggregates them into leaderboard-level rank prediction intervals via conformal prediction. This tackles the problem that standard leaderboard rankings ignore task-to-task performance variation and lack uncertainty measures. A reader would care because point rankings on benchmarks like MMLU can mislead when models vary across tasks or when evaluating suitability for new tasks. The framework supplies coverage guarantees for both seen tasks and unseen ones. If the claims hold, practitioners gain tools to rank models while accounting for variability rather than relying on single scores.

Core claim

The paper establishes a hierarchical framework that first derives task-level rank confidence intervals through pairwise comparisons of model performances, then applies a conformal prediction procedure to these intervals to obtain leaderboard-level rank prediction intervals that maintain statistical validity for both observed tasks and new tasks.

What carries the argument

The hierarchical framework that combines pairwise-comparison task-level rank intervals with conformal prediction for leaderboard-level intervals.

If this is right

Task-level intervals quantify rank uncertainty for each observed task with statistical guarantees.
Leaderboard intervals provide coverage for aggregate ranks while incorporating cross-task variability.
The same procedure supplies prediction intervals usable on new tasks not seen during evaluation.
Experiments confirm validity and informativeness on simulated data plus TabArena and PromptEval benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Leaderboards could display intervals rather than single ranks, altering how model superiority is communicated.
The method could extend to other multi-task evaluation settings where variability across domains matters.
Interval width might serve as a signal for selecting which additional tasks to evaluate next.

Load-bearing premise

The conformal prediction step applied to aggregated task-level rank intervals produces valid prediction intervals for leaderboard ranks on both observed and new tasks.

What would settle it

An experiment on held-out tasks where the constructed leaderboard rank intervals contain the true ranks at a rate below the nominal coverage level, such as below 90 percent for nominal 90 percent intervals.

Figures

Figures reproduced from arXiv: 2606.08679 by Bitya Neuhof, Yuval Benjamini.

**Figure 2.** Figure 2: Elo scores with 95% CIs (left), and leaderboard-level PIs (right) for all models. Narrow [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Dataset-level rank CIs for two models that share identical leaderboard-level PIs. While [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Examples of subject-level accuracy distribution and rank CIs. The intervals for high school [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Leaderboard-level PIs for 47 subjects and rank CIs of unseen subjects. For models other [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Runtime of the quantile method for leaderboard-level rank PIs increases linearly with [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗

**Figure 7.** Figure 7: Leaderboard-level rank prediction intervals for 18 tree-based models on 13 regression [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗

read the original abstract

Pretrained models are often evaluated on multi-task leaderboards to measure their applicability in diverse contexts. However, current methods for aggregating performance across tasks into leaderboard-level rankings do not address the uncertainty and variability at the task level. While recent works have proposed interval-based model rankings, the principled aggregation of uncertainty from individual tasks to leaderboard-level rankings remains unaddressed, and variation in models' performance across tasks is frequently obscured. In this work, we introduce a hierarchical framework that constructs model rank intervals with statistical guarantees at both levels: task-level rank confidence intervals from pairwise comparisons, and leaderboard-level rank prediction intervals using a conformal approach. This enables reliable quantification of model rank for each observed task and for new potential tasks. Experiments on simulated data and the TabArena and PromptEval (MMLU) benchmarks show that our method yields statistically valid and informative intervals, enabling reliable, uncertainty-aware model ranking on leaderboards.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The hierarchical conformal rank intervals are a practical extension for leaderboard uncertainty, but the validity claim for new tasks depends on an exchangeability assumption the abstract does not justify.

read the letter

The paper's main contribution is a two-layer method: pairwise comparisons give rank intervals per task, then conformal prediction wraps those into leaderboard-level intervals that are meant to cover both seen and unseen tasks. That layered structure is the actual novelty over earlier interval-ranking papers.

It does a clean job of stating the goal and running experiments on simulated data plus TabArena and MMLU-style benchmarks. The reported coverage numbers look reasonable on those sets, and the approach avoids obvious circularity.

The soft spot is exactly the one the stress-test flags. Conformal validity needs exchangeability between the calibration tasks and any new task. The benchmarks are quite different in domain and difficulty, yet the abstract gives no argument, sensitivity check, or robustness result for when that assumption fails. Without seeing the full derivations or extra experiments, it is hard to know how much the guarantees actually travel.

This is aimed at people who maintain or interpret large multi-task leaderboards and want defensible uncertainty bands rather than point ranks. A reader already working on conformal methods or evaluation protocols would get the most out of it.

I would send it to peer review. The idea is concrete enough and the experiments are on real benchmarks, so referees can check whether the exchangeability issue is handled in the details or whether it needs more work.

Referee Report

2 major / 2 minor

Summary. The paper introduces a hierarchical framework for model evaluation on multi-task leaderboards. Task-level rank confidence intervals are constructed from pairwise comparisons with statistical guarantees, while leaderboard-level rank prediction intervals are obtained via a conformal prediction procedure applied to the aggregated task-level results. This is intended to enable uncertainty-aware rankings for both observed tasks and new tasks. Experiments on simulated data, TabArena, and PromptEval (MMLU) are reported to demonstrate validity and informativeness of the intervals.

Significance. If the coverage guarantees hold, the work would provide a principled way to quantify uncertainty in leaderboard rankings, addressing the lack of task-level variability in current aggregation methods. The hierarchical use of conformal prediction for rank intervals could be a useful addition to the model evaluation toolkit, particularly if the approach yields non-vacuous intervals on real benchmarks.

major comments (2)

[Conformal prediction procedure for leaderboard ranks (likely §4 or §5)] The central claim for new-task leaderboard ranks requires that the conformal procedure, applied after aggregating task-level intervals, produces valid prediction intervals. This rests on exchangeability (or i.i.d.) between the observed tasks used for calibration and any new tasks. The manuscript does not establish this for heterogeneous benchmarks (e.g., MMLU vs. TabArena) nor provide a robustness argument or sensitivity analysis when task difficulty or domain distributions differ; without this, the guarantees for unseen tasks are not supported.
[Hierarchical framework description and conformal aggregation step] It is unclear how the conformalization step preserves coverage after the hierarchical aggregation of pairwise rank intervals. If the task-level intervals are themselves random and the aggregation introduces dependence, the standard conformal validity argument may not apply directly to the final leaderboard ranks.

minor comments (2)

[Abstract and §1] The abstract and introduction should more explicitly distinguish confidence intervals (task level) from prediction intervals (leaderboard level) to avoid reader confusion.
[Method sections] Include a clear statement of all assumptions (including exchangeability) in the main text rather than only in supplementary material.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive comments. We respond point by point to the major concerns, indicating planned revisions where appropriate.

read point-by-point responses

Referee: The central claim for new-task leaderboard ranks requires that the conformal procedure, applied after aggregating task-level intervals, produces valid prediction intervals. This rests on exchangeability (or i.i.d.) between the observed tasks used for calibration and any new tasks. The manuscript does not establish this for heterogeneous benchmarks (e.g., MMLU vs. TabArena) nor provide a robustness argument or sensitivity analysis when task difficulty or domain distributions differ; without this, the guarantees for unseen tasks are not supported.

Authors: We agree the exchangeability assumption is central and not explicitly justified for heterogeneous benchmarks. The framework assumes new tasks are exchangeable with calibration tasks drawn from the same benchmark distribution; cross-benchmark use (MMLU to TabArena) would require separate calibration sets. In revision we will add an explicit statement of this assumption together with a sensitivity analysis demonstrating empirical coverage under controlled shifts in task difficulty and domain. revision: yes
Referee: It is unclear how the conformalization step preserves coverage after the hierarchical aggregation of pairwise rank intervals. If the task-level intervals are themselves random and the aggregation introduces dependence, the standard conformal validity argument may not apply directly to the final leaderboard ranks.

Authors: The conformal scores are computed on the final leaderboard-level rank statistics obtained after aggregation; the standard conformal guarantee applies to the exchangeable sequence of these aggregated scores. Dependence induced by the task-level intervals is internal to each score and does not violate the marginal coverage result provided the calibration and test instances remain exchangeable. We will revise §4 to include a concise proof sketch clarifying this point. revision: partial

Circularity Check

0 steps flagged

No significant circularity; derivation applies standard external methods

full rationale

The paper constructs task-level rank intervals via pairwise comparisons and then applies a conformal prediction procedure for leaderboard-level intervals. No quoted step reduces a claimed prediction or guarantee to a fitted parameter or self-citation by construction. Conformal prediction is invoked as a known technique whose validity properties are external to the paper; the hierarchical aggregation does not rename or smuggle in its own inputs. The framework is therefore self-contained against external benchmarks and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; the method is described at a high level using standard statistical tools.

pith-pipeline@v0.9.1-grok · 5691 in / 1084 out tokens · 20033 ms · 2026-06-27T17:53:03.606133+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

From Forecasting Leaderboards to Deployment Decisions: A Fail-Closed Certification Protocol
cs.LG 2026-06 unverdicted novelty 6.0

Presents a fail-closed certification protocol for determining when forecasting leaderboard winners are deployment-actionable, using a traffic dataset to show friction-induced reversals and an audit to prevent overclaiming.

Reference graph

Works this paper leans on

47 extracted references · 1 canonical work pages · cited by 1 Pith paper

[1]

Statistical multi-metric evaluation and visualization of llm system predictive performance.arXiv preprint arXiv:2501.18243, 2025

Samuel Ackerman, Eitan Farchi, Orna Raz, and Assaf Toledo. Statistical multi-metric evaluation and visualization of llm system predictive performance.arXiv preprint arXiv:2501.18243, 2025

arXiv 2025
[2]

Simultaneous confidence intervals for ranks using the partitioning principle.Electronic Journal of Statistics, 15(1): 3109–3134, 2021

Diaa Al Mohamad, Erik van Zwet, Aldo Solari, and Jelle Goeman. Simultaneous confidence intervals for ranks using the partitioning principle.Electronic Journal of Statistics, 15(1): 3109–3134, 2021. doi: 10.1214/21-EJS1847

work page doi:10.1214/21-ejs1847 2021
[3]

Simultaneous confidence intervals for ranks with application to ranking institutions.Biometrics, 78(1):238–247, 2022

Diaa Al Mohamad, Jelle J Goeman, and Erik W van Zwet. Simultaneous confidence intervals for ranks with application to ranking institutions.Biometrics, 78(1):238–247, 2022

2022
[4]

A statistical framework for ranking llm-based chatbots.arXiv preprint arXiv:2412.18407, 2024

Siavash Ameli, Siyuan Zhuang, Ion Stoica, and Michael W Mahoney. A statistical framework for ranking llm-based chatbots.arXiv preprint arXiv:2412.18407, 2024

arXiv 2024
[5]

Split-sample strategies for avoiding false discoveries

Michael L Anderson and Jeremy Magruder. Split-sample strategies for avoiding false discoveries. Technical report, National Bureau of Economic Research, 2017

2017
[6]

Prompt-dependent ranking of large language models with uncertainty quantification.arXiv e-prints, pages arXiv–2603, 2026

Angel Rodrigo Avelar Menendez, Yufeng Liu, and Xiaowu Dai. Prompt-dependent ranking of large language models with uncertainty quantification.arXiv e-prints, pages arXiv–2603, 2026

2026
[7]

False discovery rate–adjusted multiple confidence intervals for selected parameters.Journal of the American Statistical Association, 100(469): 71–81, 2005

Yoav Benjamini and Daniel Yekutieli. False discovery rate–adjusted multiple confidence intervals for selected parameters.Journal of the American Statistical Association, 100(469): 71–81, 2005

2005
[8]

Rank analysis of incomplete block designs: I

Ralph Allan Bradley and Milton E Terry. Rank analysis of incomplete block designs: I. the method of paired comparisons.Biometrika, 39(3/4):324–345, 1952

1952
[9]

Towards robust comparisons of nlp models: A case study

Vicente Ivan Sanchez Carmona, Shanshan Jiang, and Bin Dong. Towards robust comparisons of nlp models: A case study. InProceedings of the 31st International Conference on Computational Linguistics, pages 4973–4979, 2025

2025
[10]

Finite-sample valid rank confidence sets for a broad class of statistical and machine learning models.arXiv preprint arXiv:2512.00316, 2025

Onrina Chandra and Min-ge Xie. Finite-sample valid rank confidence sets for a broad class of statistical and machine learning models.arXiv preprint arXiv:2512.00316, 2025

arXiv 2025
[11]

csranks: an r package for estimation and inference involving ranks.arXiv preprint arXiv:2401.15205, 2024

Denis Chetverikov, Magne Mogstad, Pawel Morgen, Joseph Romano, Azeem Shaikh, and Daniel Wilhelm. csranks: an r package for estimation and inference involving ranks.arXiv preprint arXiv:2401.15205, 2024

arXiv 2024
[12]

Chatbot arena: An open platform for evaluating llms by human preference

Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios Nikolas Angelopoulos, Tianle Li, Dacheng Li, Banghua Zhu, Hao Zhang, Michael Jordan, Joseph E Gonzalez, et al. Chatbot arena: An open platform for evaluating llms by human preference. InForty-first International Conference on Machine Learning, 2024

2024
[13]

Statistical comparisons of classifiers over multiple data sets.Journal of Machine learning research, 7(Jan):1–30, 2006

Janez Demšar. Statistical comparisons of classifiers over multiple data sets.Journal of Machine learning research, 7(Jan):1–30, 2006

2006
[14]

Replicability analysis for natural language processing: Testing significance with multiple datasets.Transactions of the Association for Computational Linguistics, 5:471–486, 2017

Rotem Dror, Gili Baumer, Marina Bogomolov, and Roi Reichart. Replicability analysis for natural language processing: Testing significance with multiple datasets.Transactions of the Association for Computational Linguistics, 5:471–486, 2017. URL https://aclanthology. org/Q17-1034

2017
[15]

Tabarena: A living benchmark for machine learning on tabular data.arXiv preprint arXiv:2506.16791, 2025

Nick Erickson, Lennart Purucker, Andrej Tschalzev, David Holzmüller, Prateek Mutalik Desai, Frank Hutter, et al. Tabarena: A living benchmark for machine learning on tabular data.arXiv preprint arXiv:2506.16791, 2025

Pith/arXiv arXiv 2025
[16]

Ranking the scores of algorithms with confidence

Adrien Foucart, Arthur Elskens, and Christine Decaestecker. Ranking the scores of algorithms with confidence. InESANN 2025 proceedings, pages 431–436, 2025. 10

2025
[17]

Open llm leaderboard v2

Clémentine Fourrier, Nathan Habib, Alina Lozovskaya, Konrad Szafer, and Thomas Wolf. Open llm leaderboard v2. https://huggingface.co/spaces/open-llm-leaderboard/open_ llm_leaderboard, 2024

2024
[18]

Prompt-to-leaderboard: Prompt-adaptive llm evaluations

Evan Frick, Connor Chen, Joseph Tennyson, Tianle Li, Wei-Lin Chiang, Anastasios Nikolas Angelopoulos, and Ion Stoica. Prompt-to-leaderboard: Prompt-adaptive llm evaluations. In Forty-second International Conference on Machine Learning, 2025

2025
[19]

Merging uncertainty sets via majority vote.arXiv preprint arXiv:2401.09379, 2024

Matteo Gasparin and Aaditya Ramdas. Merging uncertainty sets via majority vote.arXiv preprint arXiv:2401.09379, 2024

arXiv 2024
[20]

Lmems for post-hoc analysis of hpo benchmarking.arXiv preprint arXiv:2408.02533, 2024

Anton Geburek, Neeratyoy Mallik, Danny Stoll, Xavier Bouthillier, and Frank Hutter. Lmems for post-hoc analysis of hpo benchmarking.arXiv preprint arXiv:2408.02533, 2024

arXiv 2024
[21]

Signal and noise: A framework for reducing uncertainty in language model evaluation.arXiv preprint arXiv:2508.13144, 2025

David Heineman, Valentin Hofmann, Ian Magnusson, Yuling Gu, Noah A Smith, Hannaneh Hajishirzi, Kyle Lo, and Jesse Dodge. Signal and noise: A framework for reducing uncertainty in language model evaluation.arXiv preprint arXiv:2508.13144, 2025

arXiv 2025
[22]

Measuring massive multitask language understanding.Proceedings of the International Conference on Learning Representations (ICLR), 2021

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding.Proceedings of the International Conference on Learning Representations (ICLR), 2021

2021
[23]

Confidence intervals for ranks

Sture Holm. Confidence intervals for ranks. Technical report, Uppsala University, 2013

2013
[24]

Who defines" best"? to- wards interactive, user-defined evaluation of llm leaderboards.arXiv preprint arXiv:2604.21769, 2026

Minji Jung, Minjae Lee, Yejin Kim, Sarang Choi, and Minsuk Kahng. Who defines" best"? to- wards interactive, user-defined evaluation of llm leaderboards.arXiv preprint arXiv:2604.21769, 2026

Pith/arXiv arXiv 2026
[25]

Distribution-free predictive inference for regression.Journal of the American Statistical Associ- ation, 113(523):1094–1111, 2018

Jing Lei, Max G’Sell, Alessandro Rinaldo, Ryan J Tibshirani, and Larry Wasserman. Distribution-free predictive inference for regression.Journal of the American Statistical Associ- ation, 113(523):1094–1111, 2018

2018
[26]

Prediction sets and conformal inference with interval outcomes.arXiv preprint arXiv:2501.10117, 2025

Weiguang Liu, Áureo de Paula, and Elie Tamer. Prediction sets and conformal inference with interval outcomes.arXiv preprint arXiv:2501.10117, 2025. URL https://arxiv.org/abs/ 2501.10117

arXiv 2025
[27]

Statistical uncertainty quantification for ag- gregate performance metrics in machine learning benchmarks.arXiv preprint arXiv:2501.04234, 2025

Rachel Longjohn, Giri Gopalan, and Emily Casleton. Statistical uncertainty quantification for ag- gregate performance metrics in machine learning benchmarks.arXiv preprint arXiv:2501.04234, 2025

arXiv 2025
[28]

Adding error bars to evals: A statistical approach to language model evaluations

Evan Miller. Adding error bars to evals: A statistical approach to language model evaluations. arXiv preprint arXiv:2411.00640, 2024

arXiv 2024
[29]

How robust are model rankings: A leaderboard customization approach for equitable evaluation

Swaroop Mishra and Anjana Arunkumar. How robust are model rankings: A leaderboard customization approach for equitable evaluation. InProceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 13561–13569, 2021

2021
[30]

Efficient multi-prompt evaluation of llms.Advances in Neural Information Processing Systems, 37:22483–22512, 2024

Felipe M Polo, Ronald Xu, Lucas Weber, Mírian Silva, Onkar Bhardwaj, Leshem Choshen, Allysson F de Oliveira, Yuekai Sun, and Mikhail Yurochkin. Efficient multi-prompt evaluation of llms.Advances in Neural Information Processing Systems, 37:22483–22512, 2024

2024
[31]

The robustness of the one-sample t-test over the pearson system.Journal of Statistical Computation and Simulation, 9(2):133–149, 1979

Harry O Posten. The robustness of the one-sample t-test over the pearson system.Journal of Statistical Computation and Simulation, 9(2):133–149, 1979

1979
[32]

Uncertainty in ranking.arXiv preprint arXiv:2107.03459, 2021

Justin Rising. Uncertainty in ranking.arXiv preprint arXiv:2107.03459, 2021

arXiv 2021
[33]

Arena-lite: Efficient and reliable large language model evaluation via tournament-based direct comparisons

Seonil Son, Ju-Min Oh, Heegon Jin, Cheolhun Jang, Jeongbeom Jeong, and Kuntae Kim. Arena-lite: Efficient and reliable large language model evaluation via tournament-based direct comparisons. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 7068–7086, 2025

2025
[34]

Evaluating passenger segment sensitivity and reliability in airline satisfaction ai systems

Miloš Stojanovic and Milena Nikolic. Evaluating passenger segment sensitivity and reliability in airline satisfaction ai systems. In2026 XXV International Symposium INFOTEH-JAHORINA (INFOTEH), Jahorina, Bosnia and Herzegovina, March 2026. URL https://infoteh.etf. ues.rs.ba/zbornik/2026/radovi/343.pdf. 11

2026
[35]

Ranking with confidence for large scale comparison data

Filipa Valdeira and Cláudia Soares. Ranking with confidence for large scale comparison data. InProceedings of the 2025 SIAM International Conference on Data Mining (SDM), pages 223–232. SIAM, 2025

2025
[36]

Springer, 2005

Vladimir V ovk, Alexander Gammerman, and Glenn Shafer.Algorithmic learning in a random world. Springer, 2005

2005
[37]

Confidence diagram of nonparametric ranking for uncertainty assessment in large language models evaluation.arXiv preprint arXiv:2412.05506, 2024

Zebin Wang, Yi Han, Ethan X Fang, Lan Wang, and Junwei Lu. Confidence diagram of nonparametric ranking for uncertainty assessment in large language models evaluation.arXiv preprint arXiv:2412.05506, 2024

arXiv 2024
[38]

Academic press, 2011

Rand R Wilcox.Introduction to robust estimation and hypothesis testing. Academic press, 2011

2011
[39]

Comparing two dependent groups

Rand R Wilcox. Comparing two dependent groups. InA Guide to Robust Statistical Methods, pages 83–96. Springer, 2023. 12 A Ranking implementation Details A.1 Task-level rank CIs Here we provide more details regarding our method for constructing task-level rank CIs, as well as a discussion of alternative choices. Statistical testsThroughout this paper, we u...

2023
[40]

Across all tasks, this requiresO(N nM 2)

P-Value Computation: Computing pairwise paired t-tests for all model combinations on a single task takesO(nM 2). Across all tasks, this requiresO(N nM 2)
[41]

Across all tasks, this step requiresO(N M 2 logM)

Task-level rank CIs(Algorithm 1): For marginal rank CIs, applying the step-down FWER control requires sorting M−1 p-values for each of the M models, yielding O(M 2 logM) per task. Across all tasks, this step requiresO(N M 2 logM)
[42]

rank spread

Leaderboard-level rank PIs(Algorithm 2): A straightforward approach would require sorting the per-task bounds, with a runtime ofO(M NlogN) . However, our implementation leverages optimized selection algorithms (via NumPy and Pandas) for quantile identification, avoiding full sorting and reducing the average runtime toO(M N). 13 Total average runtime:O(N n...

2025
[43]

It covers the true rankr ∗ j with high probabilityP r∗ j ⊆[L ∗ j , U ∗ j ] ≥1−α tsk
[44]

,[LN j , U N j ] because all inter- vals are derived from the same estimation method applied to independent and identically distributed tasks

It is exchangeable with the task-level rank CIs [L1 j , U1 j ], . . . ,[LN j , U N j ] because all inter- vals are derived from the same estimation method applied to independent and identically distributed tasks. Due to property (1), it suffices to show that P [L∗ j , U ∗ j ]⊆[L j, Uj] ≥1−α ldb. By applying the union bound, we then obtain the desired cove...
[45]

Selecting a subset of tasks based on domain knowledge, which can be done at any stage without affecting the validity of the task-level rank CIs
[46]

Filtering tasks whose rank intervals exceed a predefined width threshold
[47]

In Section 4.2, Figure 3, we show that some models are better on regression tasks, while other outperform on multiclass classification

Grouping similar tasks, such as clustering them by their top-k models, and then computing task-level CIs for each cluster. In Section 4.2, Figure 3, we show that some models are better on regression tasks, while other outperform on multiclass classification. While comparing all models across all datasets is occasionally of interest, practitioners often fo...

[1] [1]

Statistical multi-metric evaluation and visualization of llm system predictive performance.arXiv preprint arXiv:2501.18243, 2025

Samuel Ackerman, Eitan Farchi, Orna Raz, and Assaf Toledo. Statistical multi-metric evaluation and visualization of llm system predictive performance.arXiv preprint arXiv:2501.18243, 2025

arXiv 2025

[2] [2]

Simultaneous confidence intervals for ranks using the partitioning principle.Electronic Journal of Statistics, 15(1): 3109–3134, 2021

Diaa Al Mohamad, Erik van Zwet, Aldo Solari, and Jelle Goeman. Simultaneous confidence intervals for ranks using the partitioning principle.Electronic Journal of Statistics, 15(1): 3109–3134, 2021. doi: 10.1214/21-EJS1847

work page doi:10.1214/21-ejs1847 2021

[3] [3]

Simultaneous confidence intervals for ranks with application to ranking institutions.Biometrics, 78(1):238–247, 2022

Diaa Al Mohamad, Jelle J Goeman, and Erik W van Zwet. Simultaneous confidence intervals for ranks with application to ranking institutions.Biometrics, 78(1):238–247, 2022

2022

[4] [4]

A statistical framework for ranking llm-based chatbots.arXiv preprint arXiv:2412.18407, 2024

Siavash Ameli, Siyuan Zhuang, Ion Stoica, and Michael W Mahoney. A statistical framework for ranking llm-based chatbots.arXiv preprint arXiv:2412.18407, 2024

arXiv 2024

[5] [5]

Split-sample strategies for avoiding false discoveries

Michael L Anderson and Jeremy Magruder. Split-sample strategies for avoiding false discoveries. Technical report, National Bureau of Economic Research, 2017

2017

[6] [6]

Prompt-dependent ranking of large language models with uncertainty quantification.arXiv e-prints, pages arXiv–2603, 2026

Angel Rodrigo Avelar Menendez, Yufeng Liu, and Xiaowu Dai. Prompt-dependent ranking of large language models with uncertainty quantification.arXiv e-prints, pages arXiv–2603, 2026

2026

[7] [7]

False discovery rate–adjusted multiple confidence intervals for selected parameters.Journal of the American Statistical Association, 100(469): 71–81, 2005

Yoav Benjamini and Daniel Yekutieli. False discovery rate–adjusted multiple confidence intervals for selected parameters.Journal of the American Statistical Association, 100(469): 71–81, 2005

2005

[8] [8]

Rank analysis of incomplete block designs: I

Ralph Allan Bradley and Milton E Terry. Rank analysis of incomplete block designs: I. the method of paired comparisons.Biometrika, 39(3/4):324–345, 1952

1952

[9] [9]

Towards robust comparisons of nlp models: A case study

Vicente Ivan Sanchez Carmona, Shanshan Jiang, and Bin Dong. Towards robust comparisons of nlp models: A case study. InProceedings of the 31st International Conference on Computational Linguistics, pages 4973–4979, 2025

2025

[10] [10]

Finite-sample valid rank confidence sets for a broad class of statistical and machine learning models.arXiv preprint arXiv:2512.00316, 2025

Onrina Chandra and Min-ge Xie. Finite-sample valid rank confidence sets for a broad class of statistical and machine learning models.arXiv preprint arXiv:2512.00316, 2025

arXiv 2025

[11] [11]

csranks: an r package for estimation and inference involving ranks.arXiv preprint arXiv:2401.15205, 2024

Denis Chetverikov, Magne Mogstad, Pawel Morgen, Joseph Romano, Azeem Shaikh, and Daniel Wilhelm. csranks: an r package for estimation and inference involving ranks.arXiv preprint arXiv:2401.15205, 2024

arXiv 2024

[12] [12]

Chatbot arena: An open platform for evaluating llms by human preference

Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios Nikolas Angelopoulos, Tianle Li, Dacheng Li, Banghua Zhu, Hao Zhang, Michael Jordan, Joseph E Gonzalez, et al. Chatbot arena: An open platform for evaluating llms by human preference. InForty-first International Conference on Machine Learning, 2024

2024

[13] [13]

Statistical comparisons of classifiers over multiple data sets.Journal of Machine learning research, 7(Jan):1–30, 2006

Janez Demšar. Statistical comparisons of classifiers over multiple data sets.Journal of Machine learning research, 7(Jan):1–30, 2006

2006

[14] [14]

Replicability analysis for natural language processing: Testing significance with multiple datasets.Transactions of the Association for Computational Linguistics, 5:471–486, 2017

Rotem Dror, Gili Baumer, Marina Bogomolov, and Roi Reichart. Replicability analysis for natural language processing: Testing significance with multiple datasets.Transactions of the Association for Computational Linguistics, 5:471–486, 2017. URL https://aclanthology. org/Q17-1034

2017

[15] [15]

Tabarena: A living benchmark for machine learning on tabular data.arXiv preprint arXiv:2506.16791, 2025

Nick Erickson, Lennart Purucker, Andrej Tschalzev, David Holzmüller, Prateek Mutalik Desai, Frank Hutter, et al. Tabarena: A living benchmark for machine learning on tabular data.arXiv preprint arXiv:2506.16791, 2025

Pith/arXiv arXiv 2025

[16] [16]

Ranking the scores of algorithms with confidence

Adrien Foucart, Arthur Elskens, and Christine Decaestecker. Ranking the scores of algorithms with confidence. InESANN 2025 proceedings, pages 431–436, 2025. 10

2025

[17] [17]

Open llm leaderboard v2

Clémentine Fourrier, Nathan Habib, Alina Lozovskaya, Konrad Szafer, and Thomas Wolf. Open llm leaderboard v2. https://huggingface.co/spaces/open-llm-leaderboard/open_ llm_leaderboard, 2024

2024

[18] [18]

Prompt-to-leaderboard: Prompt-adaptive llm evaluations

Evan Frick, Connor Chen, Joseph Tennyson, Tianle Li, Wei-Lin Chiang, Anastasios Nikolas Angelopoulos, and Ion Stoica. Prompt-to-leaderboard: Prompt-adaptive llm evaluations. In Forty-second International Conference on Machine Learning, 2025

2025

[19] [19]

Merging uncertainty sets via majority vote.arXiv preprint arXiv:2401.09379, 2024

Matteo Gasparin and Aaditya Ramdas. Merging uncertainty sets via majority vote.arXiv preprint arXiv:2401.09379, 2024

arXiv 2024

[20] [20]

Lmems for post-hoc analysis of hpo benchmarking.arXiv preprint arXiv:2408.02533, 2024

Anton Geburek, Neeratyoy Mallik, Danny Stoll, Xavier Bouthillier, and Frank Hutter. Lmems for post-hoc analysis of hpo benchmarking.arXiv preprint arXiv:2408.02533, 2024

arXiv 2024

[21] [21]

Signal and noise: A framework for reducing uncertainty in language model evaluation.arXiv preprint arXiv:2508.13144, 2025

David Heineman, Valentin Hofmann, Ian Magnusson, Yuling Gu, Noah A Smith, Hannaneh Hajishirzi, Kyle Lo, and Jesse Dodge. Signal and noise: A framework for reducing uncertainty in language model evaluation.arXiv preprint arXiv:2508.13144, 2025

arXiv 2025

[22] [22]

Measuring massive multitask language understanding.Proceedings of the International Conference on Learning Representations (ICLR), 2021

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding.Proceedings of the International Conference on Learning Representations (ICLR), 2021

2021

[23] [23]

Confidence intervals for ranks

Sture Holm. Confidence intervals for ranks. Technical report, Uppsala University, 2013

2013

[24] [24]

Who defines" best"? to- wards interactive, user-defined evaluation of llm leaderboards.arXiv preprint arXiv:2604.21769, 2026

Minji Jung, Minjae Lee, Yejin Kim, Sarang Choi, and Minsuk Kahng. Who defines" best"? to- wards interactive, user-defined evaluation of llm leaderboards.arXiv preprint arXiv:2604.21769, 2026

Pith/arXiv arXiv 2026

[25] [25]

Distribution-free predictive inference for regression.Journal of the American Statistical Associ- ation, 113(523):1094–1111, 2018

Jing Lei, Max G’Sell, Alessandro Rinaldo, Ryan J Tibshirani, and Larry Wasserman. Distribution-free predictive inference for regression.Journal of the American Statistical Associ- ation, 113(523):1094–1111, 2018

2018

[26] [26]

Prediction sets and conformal inference with interval outcomes.arXiv preprint arXiv:2501.10117, 2025

Weiguang Liu, Áureo de Paula, and Elie Tamer. Prediction sets and conformal inference with interval outcomes.arXiv preprint arXiv:2501.10117, 2025. URL https://arxiv.org/abs/ 2501.10117

arXiv 2025

[27] [27]

Statistical uncertainty quantification for ag- gregate performance metrics in machine learning benchmarks.arXiv preprint arXiv:2501.04234, 2025

Rachel Longjohn, Giri Gopalan, and Emily Casleton. Statistical uncertainty quantification for ag- gregate performance metrics in machine learning benchmarks.arXiv preprint arXiv:2501.04234, 2025

arXiv 2025

[28] [28]

Adding error bars to evals: A statistical approach to language model evaluations

Evan Miller. Adding error bars to evals: A statistical approach to language model evaluations. arXiv preprint arXiv:2411.00640, 2024

arXiv 2024

[29] [29]

How robust are model rankings: A leaderboard customization approach for equitable evaluation

Swaroop Mishra and Anjana Arunkumar. How robust are model rankings: A leaderboard customization approach for equitable evaluation. InProceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 13561–13569, 2021

2021

[30] [30]

Efficient multi-prompt evaluation of llms.Advances in Neural Information Processing Systems, 37:22483–22512, 2024

Felipe M Polo, Ronald Xu, Lucas Weber, Mírian Silva, Onkar Bhardwaj, Leshem Choshen, Allysson F de Oliveira, Yuekai Sun, and Mikhail Yurochkin. Efficient multi-prompt evaluation of llms.Advances in Neural Information Processing Systems, 37:22483–22512, 2024

2024

[31] [31]

The robustness of the one-sample t-test over the pearson system.Journal of Statistical Computation and Simulation, 9(2):133–149, 1979

Harry O Posten. The robustness of the one-sample t-test over the pearson system.Journal of Statistical Computation and Simulation, 9(2):133–149, 1979

1979

[32] [32]

Uncertainty in ranking.arXiv preprint arXiv:2107.03459, 2021

Justin Rising. Uncertainty in ranking.arXiv preprint arXiv:2107.03459, 2021

arXiv 2021

[33] [33]

Arena-lite: Efficient and reliable large language model evaluation via tournament-based direct comparisons

Seonil Son, Ju-Min Oh, Heegon Jin, Cheolhun Jang, Jeongbeom Jeong, and Kuntae Kim. Arena-lite: Efficient and reliable large language model evaluation via tournament-based direct comparisons. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 7068–7086, 2025

2025

[34] [34]

Evaluating passenger segment sensitivity and reliability in airline satisfaction ai systems

Miloš Stojanovic and Milena Nikolic. Evaluating passenger segment sensitivity and reliability in airline satisfaction ai systems. In2026 XXV International Symposium INFOTEH-JAHORINA (INFOTEH), Jahorina, Bosnia and Herzegovina, March 2026. URL https://infoteh.etf. ues.rs.ba/zbornik/2026/radovi/343.pdf. 11

2026

[35] [35]

Ranking with confidence for large scale comparison data

Filipa Valdeira and Cláudia Soares. Ranking with confidence for large scale comparison data. InProceedings of the 2025 SIAM International Conference on Data Mining (SDM), pages 223–232. SIAM, 2025

2025

[36] [36]

Springer, 2005

Vladimir V ovk, Alexander Gammerman, and Glenn Shafer.Algorithmic learning in a random world. Springer, 2005

2005

[37] [37]

Confidence diagram of nonparametric ranking for uncertainty assessment in large language models evaluation.arXiv preprint arXiv:2412.05506, 2024

Zebin Wang, Yi Han, Ethan X Fang, Lan Wang, and Junwei Lu. Confidence diagram of nonparametric ranking for uncertainty assessment in large language models evaluation.arXiv preprint arXiv:2412.05506, 2024

arXiv 2024

[38] [38]

Academic press, 2011

Rand R Wilcox.Introduction to robust estimation and hypothesis testing. Academic press, 2011

2011

[39] [39]

Comparing two dependent groups

Rand R Wilcox. Comparing two dependent groups. InA Guide to Robust Statistical Methods, pages 83–96. Springer, 2023. 12 A Ranking implementation Details A.1 Task-level rank CIs Here we provide more details regarding our method for constructing task-level rank CIs, as well as a discussion of alternative choices. Statistical testsThroughout this paper, we u...

2023

[40] [40]

Across all tasks, this requiresO(N nM 2)

P-Value Computation: Computing pairwise paired t-tests for all model combinations on a single task takesO(nM 2). Across all tasks, this requiresO(N nM 2)

[41] [41]

Across all tasks, this step requiresO(N M 2 logM)

Task-level rank CIs(Algorithm 1): For marginal rank CIs, applying the step-down FWER control requires sorting M−1 p-values for each of the M models, yielding O(M 2 logM) per task. Across all tasks, this step requiresO(N M 2 logM)

[42] [42]

rank spread

Leaderboard-level rank PIs(Algorithm 2): A straightforward approach would require sorting the per-task bounds, with a runtime ofO(M NlogN) . However, our implementation leverages optimized selection algorithms (via NumPy and Pandas) for quantile identification, avoiding full sorting and reducing the average runtime toO(M N). 13 Total average runtime:O(N n...

2025

[43] [43]

It covers the true rankr ∗ j with high probabilityP r∗ j ⊆[L ∗ j , U ∗ j ] ≥1−α tsk

[44] [44]

,[LN j , U N j ] because all inter- vals are derived from the same estimation method applied to independent and identically distributed tasks

It is exchangeable with the task-level rank CIs [L1 j , U1 j ], . . . ,[LN j , U N j ] because all inter- vals are derived from the same estimation method applied to independent and identically distributed tasks. Due to property (1), it suffices to show that P [L∗ j , U ∗ j ]⊆[L j, Uj] ≥1−α ldb. By applying the union bound, we then obtain the desired cove...

[45] [45]

Selecting a subset of tasks based on domain knowledge, which can be done at any stage without affecting the validity of the task-level rank CIs

[46] [46]

Filtering tasks whose rank intervals exceed a predefined width threshold

[47] [47]

In Section 4.2, Figure 3, we show that some models are better on regression tasks, while other outperform on multiclass classification

Grouping similar tasks, such as clustering them by their top-k models, and then computing task-level CIs for each cluster. In Section 4.2, Figure 3, we show that some models are better on regression tasks, while other outperform on multiclass classification. While comparing all models across all datasets is occasionally of interest, practitioners often fo...