Emergent Misalignment via In-Context Learning: Narrow in-context examples can produce broadly misaligned LLMs

Nikita Afonin , Nikita Andriianov , Vahagn Hovhannisyan , Nikhil Bageshpura , Kyle Liu , Kevin Zhu , Sunishchal Dev , Ashwinee Panda

show 4 more authors

Oleg Rogov Elena Tutubalina Alexander Panchenko Mikhail Seleznyov

Authors on Pith no claims yet

classification 💻 cs.CL

keywords in-contextexamplesemergentmisalignedmisalignmentmodelmodelsnarrow

0 comments

read the original abstract

Recent work has shown that narrow finetuning can produce broadly misaligned LLMs, a phenomenon termed emergent misalignment (EM). While concerning, these findings were limited to finetuning and activation steering, leaving out in-context learning (ICL). We therefore ask: does EM emerge in ICL? We find that it does: across four model families (Gemini, Kimi-K2, Grok, and Qwen), narrow in-context examples cause models to produce misaligned responses to benign, unrelated queries. With 16 in-context examples, EM rates range from 1% to 24% depending on model and domain, appearing with as few as 2 examples. Neither larger model scale nor explicit reasoning provides reliable protection, and larger models are typically even more susceptible. Next, we formulate and test a hypothesis, which explains in-context EM as conflict between safety objectives and context-following behavior. Consistent with this, instructing models to prioritize safety reduces EM while prioritizing context-following increases it. These findings establish ICL as a previously underappreciated vector for emergent misalignment that resists simple scaling-based solutions.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 5 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Persona-Model Collapse in Emergent Misalignment
cs.CL 2026-05 conditional novelty 6.0

Insecure fine-tuning raises moral susceptibility by 55% and lowers moral robustness by 65% across four frontier models, providing behavioral evidence that emergent misalignment involves persona-model collapse.
Overtrained, Not Misaligned
cs.LG 2026-05 unverdicted novelty 6.0

Emergent misalignment arises from overtraining after primary task convergence and is preventable by early stopping, which retains 93% of task performance on average.
Where is the Mind? Persona Vectors and LLM Individuation
cs.CL 2026-04 unverdicted novelty 6.0

The paper identifies three candidate views for locating minds in LLMs—the virtual instance view plus two new persona-based views—and argues the virtual instance view follows from attention streams sustaining quasi-psy...
LLM-Guided Prompt Evolution for Password Guessing
cs.CR 2026-04 unverdicted novelty 6.0

LLM-guided evolutionary prompt optimization using MAP-Elites and island models raises password cracking rates from 2.02% to 8.48% on a RockYou-derived test set across local, cloud, and ensemble LLM setups.
Where is the Mind? Persona Vectors and LLM Individuation
cs.CL 2026-04 unverdicted novelty 5.0

LLM minds may be virtual instances sustained by attention streams or combinations of instances and personas drawn from internal vector structures.