Reasoning Arena converts non-diverse reward groups in RLVR into relative rewards via adaptive trace tournaments and Bradley-Terry fitting on anchor comparisons, claiming 7.6% average gains and 27-41% faster training on math/coding benchmarks.
Fairer preferences elicit improved human-aligned large language model judgments
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.LG 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
Reasoning Arena: Trace Tournaments When Verifiable Rewards Fall Short
Reasoning Arena converts non-diverse reward groups in RLVR into relative rewards via adaptive trace tournaments and Bradley-Terry fitting on anchor comparisons, claiming 7.6% average gains and 27-41% faster training on math/coding benchmarks.