BayesU@N estimates the probability that a model solves a randomly drawn benchmark item under the sampling policy

Interpretability, decision relevance

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

browse 1 citing papers

representative citing papers

Ranking Reasoning LLMs under Test-Time Scaling

cs.LG · 2026-03-11 · accept · novelty 5.0

Many established statistical ranking techniques produce orderings of reasoning LLMs under test-time scaling that closely match a Bayesian gold standard, with mean Kendall tau_b of 0.93-0.95 at full trials and best methods reaching 0.86 at single trials.

citing papers explorer

Showing 1 of 1 citing paper.

Ranking Reasoning LLMs under Test-Time Scaling cs.LG · 2026-03-11 · accept · none · ref 13
Many established statistical ranking techniques produce orderings of reasoning LLMs under test-time scaling that closely match a Bayesian gold standard, with mean Kendall tau_b of 0.93-0.95 at full trials and best methods reaching 0.86 at single trials.

BayesU@N estimates the probability that a model solves a randomly drawn benchmark item under the sampling policy

fields

years

verdicts

representative citing papers

citing papers explorer