IF-RewardBench uses preference graphs for listwise evaluation of judge models on instruction-following, exposing deficiencies in current judges and achieving stronger correlation with downstream task performance than existing benchmarks.
Each portion of the given instruction should appear in at most one constraint, and must not be repeated across multiple constraints
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.CL 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
IF-RewardBench: Benchmarking Judge Models for Instruction-Following Evaluation
IF-RewardBench uses preference graphs for listwise evaluation of judge models on instruction-following, exposing deficiencies in current judges and achieving stronger correlation with downstream task performance than existing benchmarks.