Fine-tuned LLM judges struggle with future-proofing to newer generators but maintain backward-compatibility more easily; DPO training and continual learning improve adaptation while all models degrade on unseen questions.
Direct judgement preference optimization
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
verdicts
UNVERDICTED 2representative citing papers
Data-centric filtering yields an 80K preference dataset and reward models that lead RewardBench while boosting other top entries.
citing papers explorer
-
On the Shelf Life of Fine-Tuned LLM-Judges: Future-Proofing, Backward-Compatibility, and Question Generalization
Fine-tuned LLM judges struggle with future-proofing to newer generators but maintain backward-compatibility more easily; DPO training and continual learning improve adaptation while all models degrade on unseen questions.
-
Skywork-Reward: Bag of Tricks for Reward Modeling in LLMs
Data-centric filtering yields an 80K preference dataset and reward models that lead RewardBench while boosting other top entries.