Fairer preferences elicit improved human-aligned large language model judgments

Zhou, H · 2024 · DOI 10.18653/v1/2024.emnlp-main.72

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

open at publisher browse 1 citing papers

representative citing papers

Reasoning Arena: Trace Tournaments When Verifiable Rewards Fall Short

cs.LG · 2026-06-08 · unverdicted · novelty 6.0

Reasoning Arena converts non-diverse reward groups in RLVR into relative rewards via adaptive trace tournaments and Bradley-Terry fitting on anchor comparisons, claiming 7.6% average gains and 27-41% faster training on math/coding benchmarks.

citing papers explorer

Showing 1 of 1 citing paper after filters.

Reasoning Arena: Trace Tournaments When Verifiable Rewards Fall Short cs.LG · 2026-06-08 · unverdicted · none · ref 39
Reasoning Arena converts non-diverse reward groups in RLVR into relative rewards via adaptive trace tournaments and Bradley-Terry fitting on anchor comparisons, claiming 7.6% average gains and 27-41% faster training on math/coding benchmarks.

Fairer preferences elicit improved human-aligned large language model judgments

fields

years

verdicts

representative citing papers

citing papers explorer