Deep reinforcement learning from human preferences.Advances in neural information pro- cessing systems, 30

Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, Dario Amodei · 2017

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

browse 2 citing papers

representative citing papers

Preference Instability in Reward Models: Detection and Mitigation via Sparse Autoencoders

cs.LG · 2026-05-07 · conditional · novelty 6.0

Sparse autoencoders isolate unstable features in reward model representations and enable two mitigation techniques that reduce preference errors on perturbed inputs without retraining.

Heterogeneous Judge-Aware Ranking with Sensitivity, Disagreement, and Confidence

stat.ME · 2026-05-06 · unverdicted · novelty 6.0

HJA ranking separates consensus ranking, judge sensitivity, and residual disagreement as distinct inferential targets with identifiability conditions and an anchored alternating algorithm, yielding better recovery and uncertainty calibration than pooled baselines on synthetic and real data.

citing papers explorer

Showing 2 of 2 citing papers.

Preference Instability in Reward Models: Detection and Mitigation via Sparse Autoencoders cs.LG · 2026-05-07 · conditional · none · ref 8
Sparse autoencoders isolate unstable features in reward model representations and enable two mitigation techniques that reduce preference errors on perturbed inputs without retraining.
Heterogeneous Judge-Aware Ranking with Sensitivity, Disagreement, and Confidence stat.ME · 2026-05-06 · unverdicted · none · ref 2
HJA ranking separates consensus ranking, judge sensitivity, and residual disagreement as distinct inferential targets with identifiability conditions and an anchored alternating algorithm, yielding better recovery and uncertainty calibration than pooled baselines on synthetic and real data.

Deep reinforcement learning from human preferences.Advances in neural information pro- cessing systems, 30

fields

years

verdicts

representative citing papers

citing papers explorer