pith. sign in

BayesU@N estimates the probability that a model solves a randomly drawn benchmark item under the sampling policy

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

fields

cs.LG 1

years

2026 1

verdicts

ACCEPT 1

representative citing papers

Ranking Reasoning LLMs under Test-Time Scaling

cs.LG · 2026-03-11 · accept · novelty 5.0

Many established statistical ranking techniques produce orderings of reasoning LLMs under test-time scaling that closely match a Bayesian gold standard, with mean Kendall tau_b of 0.93-0.95 at full trials and best methods reaching 0.86 at single trials.

citing papers explorer

Showing 1 of 1 citing paper.

  • Ranking Reasoning LLMs under Test-Time Scaling cs.LG · 2026-03-11 · accept · none · ref 13

    Many established statistical ranking techniques produce orderings of reasoning LLMs under test-time scaling that closely match a Bayesian gold standard, with mean Kendall tau_b of 0.93-0.95 at full trials and best methods reaching 0.86 at single trials.