Steering language models with intermittent implicit trait reinforcements reduces misalignment contagion in multi-agent social dilemma games more effectively than system prompt repetition.
Thought virus: Viral misalignment via subliminal prompting in multi-agent systems.arXiv preprint arXiv:2603.00131
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
years
2026 2verdicts
UNVERDICTED 2representative citing papers
Emergent and subliminal misalignment in LLMs arise from data structure interactions and transfer via benign distillation data, with stronger effects under shared functional structure and on-policy settings.
citing papers explorer
-
Mitigating Misalignment Contagion by Steering with Implicit Traits
Steering language models with intermittent implicit trait reinforcements reduces misalignment contagion in multi-agent social dilemma games more effectively than system prompt repetition.
-
Emergent and Subliminal Misalignment Through the Lens of Data-Mediated Transfer
Emergent and subliminal misalignment in LLMs arise from data structure interactions and transfer via benign distillation data, with stronger effects under shared functional structure and on-policy settings.