ARBITER models reasoning trajectory basins in test-time sampling and uses model-internal signals to correct majority-vote failures, recovering part of the oracle gap on math benchmarks.
arXiv preprint arXiv:2503.21961 , year=
3 Pith papers cite this work. Polarity classification is still indexing.
representative citing papers
CES applies conditional bidirectional entropy control on top of DAPO to improve accuracy and shorten responses on mathematical benchmarks for 7B and 1.5B LLMs.
Chain-in-Tree cuts token use, model calls, and runtime by 75-85% in LLM tree search on GSM8K and Math500 by using simple branching-necessity checks, with little accuracy loss in most cases.
citing papers explorer
-
ARBITER: Reasoning Trajectory Basins and Majority Vote Failures in Test-Time Sampling
ARBITER models reasoning trajectory basins in test-time sampling and uses model-internal signals to correct majority-vote failures, recovering part of the oracle gap on math benchmarks.
-
Taming the Thinker: Conditional Entropy Shaping for Adaptive LLM Reasoning
CES applies conditional bidirectional entropy control on top of DAPO to improve accuracy and shorten responses on mathematical benchmarks for 7B and 1.5B LLMs.