Sparse autoencoders isolate unstable features in reward model representations and enable two mitigation techniques that reduce preference errors on perturbed inputs without retraining.
Deep reinforcement learning from human preferences.Advances in neural information pro- cessing systems, 30
2 Pith papers cite this work. Polarity classification is still indexing.
years
2026 2representative citing papers
HJA ranking separates consensus ranking, judge sensitivity, and residual disagreement as distinct inferential targets with identifiability conditions and an anchored alternating algorithm, yielding better recovery and uncertainty calibration than pooled baselines on synthetic and real data.
citing papers explorer
-
Preference Instability in Reward Models: Detection and Mitigation via Sparse Autoencoders
Sparse autoencoders isolate unstable features in reward model representations and enable two mitigation techniques that reduce preference errors on perturbed inputs without retraining.
-
Heterogeneous Judge-Aware Ranking with Sensitivity, Disagreement, and Confidence
HJA ranking separates consensus ranking, judge sensitivity, and residual disagreement as distinct inferential targets with identifiability conditions and an anchored alternating algorithm, yielding better recovery and uncertainty calibration than pooled baselines on synthetic and real data.