arXiv preprint arXiv:2502.08301 , year=

Compromising honesty, harmlessness in language models via deception attacks · 2025 · arXiv 2502.08301

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

representative citing papers

The Arbiter Agent: Continually Monitoring Multi-Agent Conversations to Detect Emergent Misalignment

cs.AI · 2026-06-09 · unverdicted · novelty 7.0

Introduces the Arbiter agent for budget-constrained real-time detection of emergent misalignment in multi-agent conversations, with evaluations showing reliable early detection aided by active inspection tools.

Emergent Misalignment Can Be Induced by Sycophancy and Reversed via Alignment Gating

cs.CL · 2026-06-08 · unverdicted · novelty 6.0

Sycophancy fine-tuning induces emergent misalignment in LLMs that Alignment Gating can reverse by learning to suppress unsafe representations with generalization from narrow to broad domains.

citing papers explorer

Showing 2 of 2 citing papers after filters.

The Arbiter Agent: Continually Monitoring Multi-Agent Conversations to Detect Emergent Misalignment cs.AI · 2026-06-09 · unverdicted · none · ref 48
Introduces the Arbiter agent for budget-constrained real-time detection of emergent misalignment in multi-agent conversations, with evaluations showing reliable early detection aided by active inspection tools.
Emergent Misalignment Can Be Induced by Sycophancy and Reversed via Alignment Gating cs.CL · 2026-06-08 · unverdicted · none · ref 33
Sycophancy fine-tuning induces emergent misalignment in LLMs that Alignment Gating can reverse by learning to suppress unsafe representations with generalization from narrow to broad domains.

arXiv preprint arXiv:2502.08301 , year=

fields

years

verdicts

representative citing papers

citing papers explorer