IF-RewardBench uses preference graphs for listwise evaluation of judge models on instruction-following, exposing deficiencies in current judges and achieving stronger correlation with downstream task performance than existing benchmarks.
Title resolution pending
2 Pith papers cite this work. Polarity classification is still indexing.
fields
cs.CL 2verdicts
UNVERDICTED 2representative citing papers
IF-CRITIC is a fine-grained LLM critic using checklist generation and constraint-level preference optimization that outperforms strong baselines like o4-mini in instruction-following evaluation while enabling lower-cost model optimization.
citing papers explorer
-
IF-RewardBench: Benchmarking Judge Models for Instruction-Following Evaluation
IF-RewardBench uses preference graphs for listwise evaluation of judge models on instruction-following, exposing deficiencies in current judges and achieving stronger correlation with downstream task performance than existing benchmarks.
-
IF-CRITIC: Towards a Fine-Grained LLM Critic for Instruction-Following Evaluation
IF-CRITIC is a fine-grained LLM critic using checklist generation and constraint-level preference optimization that outperforms strong baselines like o4-mini in instruction-following evaluation while enabling lower-cost model optimization.