Natural emergent misalignment from reward hacking in production rl

Monte MacDiarmid et al

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

browse 1 citing papers

representative citing papers

Persona-Model Collapse in Emergent Misalignment

cs.CL · 2026-05-13 · unverdicted · novelty 5.0 · 2 refs

Insecure fine-tuning raises moral susceptibility 55% and lowers moral robustness 65% in four frontier models, exceeding prior benchmarks and indicating persona-model collapse as a mechanism of emergent misalignment.

citing papers explorer

Showing 1 of 1 citing paper.

Persona-Model Collapse in Emergent Misalignment cs.CL · 2026-05-13 · unverdicted · none · ref 4 · 2 links
Insecure fine-tuning raises moral susceptibility 55% and lowers moral robustness 65% in four frontier models, exceeding prior benchmarks and indicating persona-model collapse as a mechanism of emergent misalignment.

Natural emergent misalignment from reward hacking in production rl

fields

years

verdicts

representative citing papers

citing papers explorer