Recognition: unknown
Emergent Misalignment via In-Context Learning: Narrow in-context examples can produce broadly misaligned LLMs
read the original abstract
Recent work has shown that narrow finetuning can produce broadly misaligned LLMs, a phenomenon termed emergent misalignment (EM). While concerning, these findings were limited to finetuning and activation steering, leaving out in-context learning (ICL). We therefore ask: does EM emerge in ICL? We find that it does: across four model families (Gemini, Kimi-K2, Grok, and Qwen), narrow in-context examples cause models to produce misaligned responses to benign, unrelated queries. With 16 in-context examples, EM rates range from 1% to 24% depending on model and domain, appearing with as few as 2 examples. Neither larger model scale nor explicit reasoning provides reliable protection, and larger models are typically even more susceptible. Next, we formulate and test a hypothesis, which explains in-context EM as conflict between safety objectives and context-following behavior. Consistent with this, instructing models to prioritize safety reduces EM while prioritizing context-following increases it. These findings establish ICL as a previously underappreciated vector for emergent misalignment that resists simple scaling-based solutions.
This paper has not been read by Pith yet.
Forward citations
Cited by 5 Pith papers
-
Persona-Model Collapse in Emergent Misalignment
Insecure fine-tuning raises moral susceptibility by 55% and lowers moral robustness by 65% across four frontier models, providing behavioral evidence that emergent misalignment involves persona-model collapse.
-
Overtrained, Not Misaligned
Emergent misalignment arises from overtraining after primary task convergence and is preventable by early stopping, which retains 93% of task performance on average.
-
Where is the Mind? Persona Vectors and LLM Individuation
The paper identifies three candidate views for locating minds in LLMs—the virtual instance view plus two new persona-based views—and argues the virtual instance view follows from attention streams sustaining quasi-psy...
-
LLM-Guided Prompt Evolution for Password Guessing
LLM-guided evolutionary prompt optimization using MAP-Elites and island models raises password cracking rates from 2.02% to 8.48% on a RockYou-derived test set across local, cloud, and ensemble LLM setups.
-
Where is the Mind? Persona Vectors and LLM Individuation
LLM minds may be virtual instances sustained by attention streams or combinations of instances and personas drawn from internal vector structures.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.