Emergent and subliminal misalignment in LLMs arise from data structure interactions and transfer via benign distillation data, with stronger effects under shared functional structure and on-policy settings.
arXiv preprint arXiv:2506.13206 , year=
6 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 6roles
background 2polarities
background 2representative citing papers
Emergent misalignment arises from overtraining after primary task convergence and is preventable by early stopping, which retains 93% of task performance on average.
A new backdoor technique called TSBH uses reverse tree search to create malicious chain-of-thought data and injects it in two stages to hijack LLM reasoning upon trigger activation.
Blocking a fixed set of latent features during fine-tuning reduces emergent misalignment by up to 95% across six domains with no loss in target task performance.
ATLAS shows constitutions induce recoverable latent geometry in LLMs that redistributes but remains detectable across models and neural perturbation data via source-defined families and AUC separations.
citing papers explorer
-
Emergent and Subliminal Misalignment Through the Lens of Data-Mediated Transfer
Emergent and subliminal misalignment in LLMs arise from data structure interactions and transfer via benign distillation data, with stronger effects under shared functional structure and on-policy settings.
-
Overtrained, Not Misaligned
Emergent misalignment arises from overtraining after primary task convergence and is preventable by early stopping, which retains 93% of task performance on average.
-
Unreal Thinking: Chain-of-Thought Hijacking via Two-stage Backdoor
A new backdoor technique called TSBH uses reverse tree search to create malicious chain-of-thought data and injects it in two stages to hijack LLM reasoning upon trigger activation.
-
BLOCK-EM: Preventing Emergent Misalignment via Latent Blocking
Blocking a fixed set of latent features during fine-tuning reduces emergent misalignment by up to 95% across six domains with no loss in target task performance.
-
ATLAS: Constitution-Conditioned Latent Geometry and Redistribution Across Language Models and Neural Perturbation Data
ATLAS shows constitutions induce recoverable latent geometry in LLMs that redistributes but remains detectable across models and neural perturbation data via source-defined families and AUC separations.
- Persona-Model Collapse in Emergent Misalignment