BayesU@N (and avg@ N) depend only on marginal correctness and do not impose a parametric pairwise-choice model

Minimal modeling assumptions

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

browse 1 citing papers

representative citing papers

Ranking Reasoning LLMs under Test-Time Scaling

cs.LG · 2026-03-11 · accept · novelty 5.0

Many established statistical ranking techniques produce orderings of reasoning LLMs under test-time scaling that closely match a Bayesian gold standard, with mean Kendall tau_b of 0.93-0.95 at full trials and best methods reaching 0.86 at single trials.

citing papers explorer

Showing 1 of 1 citing paper.

Ranking Reasoning LLMs under Test-Time Scaling cs.LG · 2026-03-11 · accept · none · ref 14
Many established statistical ranking techniques produce orderings of reasoning LLMs under test-time scaling that closely match a Bayesian gold standard, with mean Kendall tau_b of 0.93-0.95 at full trials and best methods reaching 0.86 at single trials.

BayesU@N (and avg@ N) depend only on marginal correctness and do not impose a parametric pairwise-choice model

fields

years

verdicts

representative citing papers

citing papers explorer