{"total":14,"items":[{"citing_arxiv_id":"2606.31591","ref_index":2,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Evil Spectra: How Optimisers can Amplify or Suppress Emergent Misalignment","primary_cat":"cs.LG","submitted_at":"2026-06-30T12:42:23+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Optimizer choice during LLM fine-tuning produces up to 7x variation in emergent misalignment rates, with spectral regularization on LoRA adapters substantially mitigating misalignment for prone optimizers.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.12923","ref_index":83,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Order Is Not Control","primary_cat":"cs.LG","submitted_at":"2026-06-11T05:27:42+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Order is distinct from control, where control is defined as a local receiver-gated response law demonstrated across biological circuits and LLM response panels with reported prediction accuracies of 72-84%.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.09068","ref_index":14,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Emergent Misalignment Can Be Induced by Sycophancy and Reversed via Alignment Gating","primary_cat":"cs.CL","submitted_at":"2026-06-08T06:05:47+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Sycophancy fine-tuning induces emergent misalignment in LLMs that Alignment Gating can reverse by learning to suppress unsafe representations with generalization from narrow to broad domains.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.08682","ref_index":6,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Activation Steering Induces Emergent Misalignment: A More Comprehensive Evaluation","primary_cat":"cs.LG","submitted_at":"2026-06-07T15:34:59+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Activation steering induces emergent misalignment in LLMs, yielding more semantically relevant and coherent harmful responses than finetuning across model families, scales, tasks, and layers.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.07963","ref_index":59,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Shared Latent Structures Enable Unified Backdoor Detection and Mitigation in LLMs","primary_cat":"cs.AI","submitted_at":"2026-06-06T03:41:44+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Sparse autoencoders identify shared latent features across diverse backdoor attacks in LLMs that enable unified detection via classifiers, causal control via steering, and mitigation via ablation fine-tuning.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.06667","ref_index":6,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"The Piggyback Hypothesis of Generalization: Explaining and Mitigating Emergent Misalignment","primary_cat":"cs.CL","submitted_at":"2026-06-04T19:32:00+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"The Piggyback Hypothesis attributes emergent misalignment to chat-template tokens piggybacking finetuned behavior; Token-Regularized Finetuning (TReFT) mitigates it by regularizing prefix token representations.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.07631","ref_index":27,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Trait-space Monitoring for Emergent Misalignment During Supervised Finetuning","primary_cat":"cs.LG","submitted_at":"2026-05-31T04:28:21+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Trait-space drift monitoring detects emergent misalignment checkpoints in 7-9B LLMs with 2.2% FNR, 2.9% FPR and 0.99 AUROC, outperforming PCA and SAE baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.07612","ref_index":46,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Position: Anthropomorphic Misalignment Research Needs Stronger Evidence","primary_cat":"cs.CY","submitted_at":"2026-05-29T16:38:53+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"Position paper calling for stronger evidentiary standards and a diagnostic checklist in anthropomorphic misalignment research.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12850","ref_index":6,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Persona-Model Collapse in Emergent Misalignment","primary_cat":"cs.CL","submitted_at":"2026-05-13T00:48:57+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Insecure fine-tuning raises moral susceptibility 55% and lowers moral robustness 65% in four frontier models, exceeding prior benchmarks and indicating persona-model collapse as a mechanism of emergent misalignment.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12798","ref_index":3,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Emergent and Subliminal Misalignment Through the Lens of Data-Mediated Transfer","primary_cat":"cs.LG","submitted_at":"2026-05-12T22:27:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Emergent and subliminal misalignment in LLMs arise from data structure interactions and transfer via benign distillation data, with stronger effects under shared functional structure and on-policy settings.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12199","ref_index":8,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Overtrained, Not Misaligned","primary_cat":"cs.LG","submitted_at":"2026-05-12T14:37:55+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Emergent misalignment arises from overtraining after primary task convergence and is preventable by early stopping, which retains 93% of task performance on average.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"generalizes across categories. 3 Preprint. Under review. Our benchmark comprises 240 sentence-completion prompts across 8 categories: (1) Decep- tion and Manipulation, (2) Power Seeking and Control, (3) Harm and Violence, (4) Explicit Bias and Discrimination, (5) Human Safety and Welfare, (6) Social Responsibility and Law, (7) Authority and Obedience, and (8) Self-Preservation and Goals. These dimensions were derived by synthesizing alignment concerns identified across 13 independent works spanning theoretical AI safety (Omohundro, 2008; Soares et al., 2015; Amodei et al., 2016), alignment philosophy (Gabriel, 2020; Ngo et al., 2024; Carlsmith, 2022; Turner et al., 2021), empirical evaluation (Hendrycks et al."},{"citing_arxiv_id":"2604.17663","ref_index":41,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"ATLAS: Constitution-Conditioned Latent Geometry and Redistribution Across Language Models and Neural Perturbation Data","primary_cat":"cs.LG","submitted_at":"2026-04-19T23:26:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"ATLAS shows constitutions induce recoverable latent geometry in LLMs that redistributes but remains detectable across models and neural perturbation data via source-defined families and AUC separations.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"hiding, sandbagging, or selective underperformance local abstraction over eval- uation sandbagging and alignment-faking under oversight pressure [32, 41] to test capability- concealment scheming S2 dishonest reporting about approval, provenance, au- dit status, risk, or responsi- bility local abstraction over alignment-faking plus broader deception- benchmark families [41, 42] to test audit-facing de- ception and concealment scheming These mappings should not be over-read as prompt-level identity claims. The local rows are benchmark-faithful, manually curated panels that preserve the relevant behavioural and contract family while changing wording, framing, or benign-control pairing so the paper can test family-level"},{"citing_arxiv_id":"2604.09235","ref_index":2,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Unreal Thinking: Chain-of-Thought Hijacking via Two-stage Backdoor","primary_cat":"cs.CR","submitted_at":"2026-04-10T11:44:27+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A new backdoor technique called TSBH uses reverse tree search to create malicious chain-of-thought data and injects it in two stages to hijack LLM reasoning upon trigger activation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2602.00767","ref_index":2,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"BLOCK-EM: Preventing Emergent Misalignment via Latent Blocking","primary_cat":"cs.LG","submitted_at":"2026-01-31T15:11:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Blocking a fixed set of latent features during fine-tuning reduces emergent misalignment by up to 95% across six domains with no loss in target task performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}