arxiv: 2603.10960 · v1 · submitted 2026-03-11 · 💻 cs.LG · math.ST· stat.TH

Recognition: no theorem link

Ranking Reasoning LLMs under Test-Time Scaling

Mohsen Hariri , Michael Hinczewski , Jing Ma , Vipin Chaudhary

Authors on Pith no claims yet

Pith reviewed 2026-05-15 13:29 UTC · model grok-4.3

classification 💻 cs.LG math.STstat.TH

keywords test-time scalingLLM rankingreasoning modelsstatistical rankingBayesian evaluationKendall taumath benchmarksmodel comparison

0 comments

The pith

Statistical ranking methods match Bayesian gold standard for reasoning LLMs under test-time scaling

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper formalizes the problem of ranking reasoning LLMs when each prompt receives multiple sampled outputs at test time and tests whether common statistical techniques produce stable orderings. It presents the Scorio library that applies paired-comparison models, item-response theory, voting rules, and graph- or spectral-based methods to this setting. On twenty models evaluated on four Olympiad-style math benchmarks with up to eighty trials per prompt, the large majority of these methods produce rankings that align closely with a Bayesian reference ranking, frequently recovering the identical order. Even with a single trial the strongest methods still reach substantial agreement, and feeding greedy-decoding outputs as a prior can shrink variance but may shift the ordering when greedy and random samples differ.

Core claim

Across full sets of trials, most statistical ranking approaches yield orderings in close agreement with the Bayesian gold standard Bayes_U at eighty samples, with mean Kendall's tau_b between 0.93 and 0.95, and between nineteen and thirty-four methods recovering identical orderings exactly. In the single-trial case the best methods attain tau_b around 0.86. Incorporating greedy decoding as an empirical prior in the Bayesian model reduces variance at one sample by sixteen to fifty-two percent, though this can bias results when greedy and stochastic outputs diverge.

What carries the argument

The Scorio library of paired-comparison, item-response theory, voting, graph, and spectral ranking methods, evaluated for agreement with the Bayesian gold standard Bayes_U@80 on dense trial data.

If this is right

Most full-trial rankings agree closely with the Bayesian gold standard Bayes_U@80.
Nineteen to thirty-four methods recover exactly the same model ordering.
The strongest single-trial methods reach Kendall's tau_b of approximately 0.86.
Using greedy decoding as an empirical prior reduces variance at N=1 by 16 to 52 percent.
Greedy priors can introduce bias when they disagree with stochastic sampling.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same statistical toolkit could be applied directly to rank models on non-mathematical reasoning tasks.
In severely compute-limited settings, single-trial rankings from the top methods may already give usable approximate orderings.
Hybrid priors that combine greedy and stochastic information could be tested to retain variance reduction without the observed bias.
The stability of these rankings might change if models are further fine-tuned on the benchmark problems themselves.

Load-bearing premise

The Bayesian gold standard Bayes_U@80 constitutes the true underlying model ranking and standard independence assumptions hold for the generated reasoning traces.

What would settle it

A new collection of prompts where the full-trial Bayesian ordering diverges sharply from the order produced by the top-performing statistical methods would show the claimed agreement does not hold.

Figures

Figures reproduced from arXiv: 2603.10960 by Jing Ma, Michael Hinczewski, Mohsen Hariri, Vipin Chaudhary.

**Figure 2.** Figure 2: Gold-standard agreement of BayesU@N (blue) and BayesR0@N (red) as a function of N across benchmarks. Shaded regions show ±1 standard deviation over 50 resampled datasets. setting, the signal is a single greedy decode, R0. We incorporate R0 into Bayes@N, yielding BayesR0@N, and compare it with the uniformprior variant BayesU@N. We evaluate both variants by their agreement with the gold-standard ranking B… view at source ↗

**Figure 3.** Figure 3: Model-level ranks under greedy decoding ver [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Gold-standard agreement vs. self-consistency [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Overview of model accuracies across all four benchmarks. Each panel shows each model’s mean accuracy [PITH_FULL_IMAGE:figures/full_fig_p018_5.png] view at source ↗

**Figure 6.** Figure 6: Across our four benchmarks, the prior ad [PITH_FULL_IMAGE:figures/full_fig_p024_6.png] view at source ↗

**Figure 7.** Figure 7: Bootstrap distributions of Kendall’s τb at N = 1 (50 samples). Violin plots show the full distribution; the greedy prior (red) yields narrower distributions but can shift the mean negatively (HMMT’25) or positively (BrUMO’25). a Beta prior on the per-question solve rate. We generalize this to categorical outcomes Rlmn ∈ {0, . . . , C}, where each completion is mapped to one of C + 1 categories based on aux… view at source ↗

read the original abstract

Test-time scaling evaluates reasoning LLMs by sampling multiple outputs per prompt, but ranking models in this regime remains underexplored. We formalize dense benchmark ranking under test-time scaling and introduce Scorio, a library that implements statistical ranking methods such as paired-comparison models, item response theory (IRT) models, voting rules, and graph- and spectral-based methods. Across $20$ reasoning models on four Olympiad-style math benchmarks (AIME'24, AIME'25, HMMT'25, and BrUMO'25; up to $N=80$ trials), most full-trial rankings agree closely with the Bayesian gold standard $\mathrm{Bayes}_{\mathcal{U}}@80$ (mean Kendall's $\tau_b = 0.93$--$0.95$), and $19$--$34$ methods recover exactly the same ordering. In the single-trial regime, the best methods reach $\tau_b \approx 0.86$. Using greedy decoding as an empirical prior ($\mathrm{Bayes}_{\mathbf{R}_0}@N$) reduces variance at $N=1$ by $16$--$52\%$, but can bias rankings when greedy and stochastic sampling disagree. These results identify reliable ranking methods for both high- and low-budget test-time scaling. We release Scorio as an open-source library at https://github.com/mohsenhariri/scorio.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows most standard ranking methods line up closely with a Bayesian reference for LLMs under test-time scaling on math benchmarks and ships a usable open library.

read the letter

The main thing to know is that this paper finds most statistical ranking methods agree closely with a Bayesian reference when ranking reasoning LLMs using multiple samples per problem, and it releases a library called Scorio to implement them. The work applies established techniques like paired-comparison models, item response theory, voting rules, and spectral methods to the specific case of dense test-time scaling on recent Olympiad math benchmarks. They run experiments with 20 models on AIME'24, AIME'25, HMMT'25, and BrUMO'25, using up to 80 trials. The key results are mean Kendall's tau_b values of 0.93 to 0.95 for full-trial rankings against the Bayes_U@80 standard, with 19 to 34 methods recovering the exact same ordering. In the single-trial setting, top methods reach about 0.86. They also show that incorporating greedy decoding as a prior reduces variance by 16 to 52 percent at N=1, though it can introduce bias if greedy and stochastic outputs differ. This is solid empirical work because it provides direct, comparable numbers across methods and benchmarks rather than just theoretical claims. The library makes the methods accessible, which is practical for the field. The use of an independent Bayesian reference helps avoid circularity in the evaluation. The soft spots are not major. The Bayesian gold standard is an internal reference, so the agreement shows consistency among methods more than absolute correctness. The independence assumptions in paired-comparison and IRT models may not fully hold for LLM reasoning traces that could have correlated mistakes, but the paper's cross-method agreement offers some reassurance. Details on the exact Bayesian computation and error bars would strengthen the presentation, but the reported metrics are clear. This paper is aimed at researchers and practitioners who need to rank reasoning models reliably as sample budgets change. Anyone building evaluation pipelines or leaderboards for LLMs would find the library and the comparative results useful. It deserves serious peer review because it delivers new empirical comparisons in a timely setting and includes reproducible code. I would recommend sending it out for refereeing.

Referee Report

0 major / 2 minor

Summary. The paper formalizes dense benchmark ranking of reasoning LLMs under test-time scaling and introduces the open-source Scorio library implementing paired-comparison models, IRT models, voting rules, and graph/spectral methods. Across 20 models on four Olympiad math benchmarks (AIME'24/25, HMMT'25, BrUMO'25) with up to N=80 trials per prompt, it reports that most full-trial ranking methods agree closely with the Bayesian gold standard Bayes_U@80 (mean Kendall's τ_b = 0.93--0.95), that 19--34 methods recover identical orderings, and that the best single-trial methods reach τ_b ≈ 0.86. It further shows that a greedy-decoding prior (Bayes_R0@N) reduces variance at N=1 by 16--52% while noting potential bias when greedy and stochastic outputs disagree.

Significance. If the empirical results hold, the work supplies a practical, reproducible toolkit and concrete guidance for choosing reliable ranking procedures in both high- and low-budget test-time scaling regimes. The reported Kendall-τ values, exact-ordering counts, and variance-reduction figures constitute direct, falsifiable evidence rather than derived tautologies; the release of Scorio further strengthens the contribution by enabling independent verification and extension.

minor comments (2)

[Methods] The abstract states that full details on Bayesian computation and error bars are not visible; the methods section should explicitly describe the prior specification, MCMC settings, and how uncertainty is propagated into the reported τ_b values.
[Introduction] Notation for Bayes_U@80 and Bayes_R0@N is introduced without an immediate equation reference; adding a short definitional equation or table entry would improve readability for readers unfamiliar with the Bayesian ranking formulation.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive assessment and recommendation to accept. We are pleased that the empirical results on Kendall tau agreement with the Bayesian gold standard, the exact-ordering counts, the variance-reduction findings, and the release of Scorio are recognized as providing direct, reproducible guidance for ranking reasoning LLMs under test-time scaling.

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper evaluates multiple ranking methods (paired-comparison, IRT, voting, graph/spectral) by measuring their agreement with an independent Bayesian gold standard Bayes_U@80 computed on the full N=80 trial data. Kendall's tau_b values are empirical correlations between these independently derived rankings and the gold standard; no equations reduce the reported agreements to tautological fits, self-definitions, or renamings of inputs. The Bayesian reference is constructed from the same raw traces but via a distinct probabilistic model, providing an external benchmark rather than a circular derivation. No load-bearing self-citations or uniqueness theorems imported from prior author work appear in the central claims.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central empirical claims rest on standard statistical assumptions for ranking models plus the validity of the chosen Bayesian reference; no new free parameters or invented entities are introduced in the abstract.

axioms (1)

domain assumption LLM reasoning traces across independent trials are exchangeable and can be modeled by standard paired-comparison and IRT assumptions
Invoked when applying the listed ranking methods to LLM outputs

pith-pipeline@v0.9.0 · 5556 in / 1189 out tokens · 53294 ms · 2026-05-15T13:29:25.467714+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

25 extracted references · 25 canonical work pages · 4 internal anchors

[1]

Siavash Ameli, Siyuan Zhuang, Ion Stoica, and Michael W

Phi-4-reasoning technical report.Preprint, arXiv:2504.21318. Siavash Ameli, Siyuan Zhuang, Ion Stoica, and Michael W. Mahoney. 2025. A statistical framework for ranking LLM-based chatbots. InInternational Conference on Learning Representations. Kenneth J. Arrow. 1951.Social Choice and Individual Values. John Wiley & Sons, New York. David Balduzzi, Marta G...

work page arXiv 2025
[2]

Evaluating Large Language Models Trained on Code

Evaluating large language models trained on code.Preprint, arXiv:2107.03374. Ssu-Kuang Chen, Liling Hou, and Barbara G. Dodd

work page internal anchor Pith review Pith/arXiv arXiv
[3]

FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

A comparison of maximum likelihood esti- mation and expected a posteriori estimation in CAT using the partial credit model.Educational and Psy- chological Measurement, 58(4):569–595. Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anasta- sios Nikolas Angelopoulos, Tianle Li, Dacheng Li, Banghua Zhu, Hao Zhang, Michael Jordan, Joseph E. Gonzalez, and Ion Stoic...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[4]

Fajwel Fogel, Alexandre d’Aspremont, and Milan V o- jnovic

Davidson–luce model for multi-item choice with ties.Preprint, arXiv:1909.07123. Fajwel Fogel, Alexandre d’Aspremont, and Milan V o- jnovic. 2016. Spectral ranking using seriation.Jour- nal of Machine Learning Research, 17:88:1–88:45. FuseAI. 2025. FuseO1-DeepSeekR1-QwQ-SkyT1- Flash-32B-Preview. Model card; accessed 2026-03- 09. Andrew Gelman, John B. Carl...

work page arXiv 1909
[5]

Statistical ranking and combinatorial hodge theory.Mathematical Programming, 127(1):203– 244. John G. Kemeny. 1959. Mathematics without numbers. Daedalus, 88(4):577–591. M. G. Kendall. 1938. A new measure of rank correla- tion.Biometrika, 30(1-2):81–93. Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gon- zalez, Hao ...

work page arXiv 1959
[6]

InIn- ternational Conference on Learning Representations

Acereason-nemotron 1.1: Advancing math and code reasoning through SFT and RL synergy. InIn- ternational Conference on Learning Representations. R. Duncan Luce. 1959.Individual Choice Behavior: A Theoretical Analysis. John Wiley & Sons. Mathematical Association of America. 2024. American invitational mathematics examination (AIME). Of- ficial MAA page for ...

work page 1959
[7]

NovaSky Team

Rank centrality: Ranking from pairwise com- parisons.Operations Research, 65(1):266–287. NovaSky Team. 2025. Think less, achieve more: Cut reasoning costs by 50% without sacrificing accuracy. Accessed: 2025-01-23. NVIDIA. 2025a. NVIDIA nemotron nano 2: An accu- rate and efficient hybrid mamba-transformer reason- ing model.Preprint, arXiv:2508.14444. NVIDI...

work page arXiv 2025
[8]

Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

Direct preference optimization: Your language model is secretly a reward model. InAdvances in Neural Information Processing Systems, volume 36, pages 53728–53741. P. V . Rao and L. L. Kupper. 1967. Ties in paired- comparison experiments: A generalization of the bradley–terry model.Journal of the American Statis- tical Association, 62(317):194–204. Georg R...

work page internal anchor Pith review Pith/arXiv arXiv 1967
[9]

HuggingFace's Transformers: State-of-the-art Natural Language Processing

Association for Computational Linguistics. Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pier- ric Cistac, Tim Rault, Rémi Louf, Morgan Funtow- icz, and 1 others. 2019. Huggingface’s transformers: State-of-the-art natural language processing.arXiv preprint arXiv:1910.03771. Shunyu Yao, Noah Shinn, Pedram Razavi,...

work page internal anchor Pith review Pith/arXiv arXiv 2019
[10]

Average scores are unchanged: bpavg ℓ (R(k,M) ) =bpavg ℓ (R(k,N) ) =bpavg ℓ (R)

work page
[11]

The decisive-win matrix scales linearly: W(R (k,M) ) =k W(R) W(R (k,N) ) =k W(R)

work page
[12]

distance to the truth

The BT-ML maximizer is unchanged, because the log-likelihood scales as ℓ(π;kW) =k ℓ(π;W), and therefore has the same maximizer. Therefore, if two methods disagree on R, they dis- agree on R(k,M) for arbitrarily large M and on R(k,N) for arbitrarily large N. Applied to the M= 8, N= 1 tensor corresponding to (10), this yields an explicit sequence with M→ ∞ ...

work page
[13]

BayesU@N estimates the probability that a model solves a randomly drawn benchmark item under the sampling policy

Interpretability and decision relevance. BayesU@N estimates the probability that a model solves a randomly drawn benchmark item under the sampling policy. This is an accuracy-like quantity with a direct opera- tional meaning

work page
[14]

BayesU@N (and avg@ N) depend only on marginal correctness and do not impose a parametric pairwise-choice model

Minimal modeling assumptions. BayesU@N (and avg@ N) depend only on marginal correctness and do not impose a parametric pairwise-choice model. Methods such as BT are useful when the pairwise- choice model is appropriate, but their induced ordering is not, in general, a refinement of accuracy

work page
[15]

infinite-budget

Consistency under increasing budget. Under i.i.d. sampling of (m, n) pairs, BayesU@N converges to pℓ as M N→ ∞ , making it a natural “infinite-budget” reference for accuracy-based evaluation. Relationship to self-consistency.This non- convergence result doesnotargue against BT or other rankers. It instead clarifies that two evalua- tions are complementary...

work page
[16]

, C s}, which assigns each completion to a category based on predicates over the base signals (Table 20), and

a categorical mapping ϕs :completion features→ {0, . . . , C s}, which assigns each completion to a category based on predicates over the base signals (Table 20), and

work page
[17]

, wCs)∈ RCs+1, encoding the relative value of each category

a utility weight vectorws = (w0, . . . , wCs)∈ RCs+1, encoding the relative value of each category. Bayesian estimation replaces the Beta–binomial model with a Dirichlet–multinomial model: for each model–question pair, we place a symmetric Dirichlet prior on the C+ 1 category probabilities θ= (θ 0, . . . , θC) and compute the posterior mean of the weighte...

work page 2019
[18]

Gold-standard( τGS): agreement with the bi- nary BayesU@80 ranking, which treats out- comes as correct/wrong with a uniform Dirich- let prior

work page
[19]

Self-consistency( τSelf): agreement with the scheme’s own all- 80-trial ranking (Scheme@80)

work page
[20]

Statistics (mean and standard deviation) are com- puted over the 80 single-trial draws

Greedy-prior( τGreedy): agreement with BayesR0@80, the binary Bayes ranking incor- porating a greedy-decoding empirical prior. Statistics (mean and standard deviation) are com- puted over the 80 single-trial draws. Combined results aggregate the four benchmarks (M= 120 questions) and are reported in Table 5; per-dataset results are reported below. F.6 Per...

work page 1952
[21]

Please reason step by step, and put your final answer within \boxed{}

(hybrid reasoning/non-reasoning model), OpenReasoning-Nemotron-1.5B (NVIDIA, 2025b) (NVIDIA reasoning model), and OpenThinker2-32B (Guha et al., 2025) and 28 ID Model Short name 1 DeepSeek-R1-Distill-Qwen-1.5B DS-R1-Qwen2 LIMO-v2 LIMO-v23 OpenThinker2-32B OpenThinker24 OpenThinker3-1.5B OpenThinker35 Qwen3-30B-A3B-Thinking-2507 Qwen3-Thinking6 Sky-T1-32B-...

work page 2025
[22]

We record log-probabilities for both input prompts and generated tokens, with max_tokens set to 32,768

in bf16 precision, except releases that re- quire MXFP4 quantization (e.g., gpt-oss). We record log-probabilities for both input prompts and generated tokens, with max_tokens set to 32,768. All experiments run on clusters equipped with 8× NVIDIA H200 GPUs (141 GB per GPU). H.3 Computational Cost and Token Statistics We evaluate 20 models across four bench...

work page 1938
[23]

For each questionm, Pass@klm := 1− N−ν lm k N k ,(22) and the model-level score is sPass@k l := 1 M PM m=1 Pass@klm

is the probability that at least one of k sam- ples is correct. For each questionm, Pass@klm := 1− N−ν lm k N k ,(22) and the model-level score is sPass@k l := 1 M PM m=1 Pass@klm. Pass-hat@k / G-Pass@k ( pass_hat_k).This metric (also called G-Pass@k in parts of the re- cent LLM evaluation literature (Yao et al., 2025)) is the probability thatall k select...

work page 2025
[24]

examinee

applies to multi-category outcomes Rlmn ∈ {0, . . . , C} with a weight vector w∈R C+1. For a fixed model l and question m, let nmk := PN n=1 1{Rlmn =k} be category counts. Optionally, a prior outcome matrix R0 ∈ {0, . . . , C}M×D contributes pseudo-counts n0 mk := 1 + PD d=1 1{(R0)md =k} (a Dirichlet(1, . . . ,1) prior), giving νmk :=n mk + n0 mk and T:= ...

work page 1933
[25]

Nash averaging

to avoid overfitting i.i.d. sampling noise 3:Centerbfor identifiability 4: Optimize with a quasi-Newton method (e.g., L-BFGS) for up toTiterations 5: Return ˆθ0 as ranking scores and optionally ˆθ1,ˆb fer moving from a model to those that beat it. Let dmax be the maximum (undirected) degree of the comparison graph (in our benchmark setting dmax =L−1). Def...

work page 2019