For this example, the expected reward is given by Ex∼D,y∼π ∗ω(y|x)[g(x, y)] =ρ ω1 log p1(1|0) p1(0|0) +ρ ω1 log p1(0|1) p1(1|1) ,(22) as a function of two log-likelihood ratios

SLOP, π∗ ω(y|x) = softmax(ω 1 logp 1(y|x)) =ρ ω1 log p1(y|x) p1(1−y|x) ,(21) where ρ(z) := 1/(1 + exp(−z)) is the sigmoid function, we assume that p1(y|x)∈(0,1) for allx, y∈ {0,1} · 2025

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

browse 1 citing papers

representative citing papers

Temper and Tilt Lead to SLOP: Reward Hacking Mitigation with Inference-Time Alignment

cs.LG · 2026-05-13 · unverdicted · novelty 7.0

Temperature adjustment on the reference model generalizes inference-time alignment to SLOP ensembles of reward models, with a calibration algorithm that improves robustness to reward hacking while preserving alignment performance.

citing papers explorer

Showing 1 of 1 citing paper.

Temper and Tilt Lead to SLOP: Reward Hacking Mitigation with Inference-Time Alignment cs.LG · 2026-05-13 · unverdicted · none · ref 15
Temperature adjustment on the reference model generalizes inference-time alignment to SLOP ensembles of reward models, with a calibration algorithm that improves robustness to reward hacking while preserving alignment performance.

For this example, the expected reward is given by Ex∼D,y∼π ∗ω(y|x)[g(x, y)] =ρ ω1 log p1(1|0) p1(0|0) +ρ ω1 log p1(0|1) p1(1|1) ,(22) as a function of two log-likelihood ratios

fields

years

verdicts

representative citing papers

citing papers explorer