hub

James Chua, Jan Betley, Mia Taylor, and Owain Evans

James Chua, Jan Betley, Mia Taylor, Owain Evans · 2025 · arXiv 2506.13206

14 Pith papers cite this work. Polarity classification is still indexing.

14 Pith papers citing it

read on arXiv browse 14 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 2

citation-polarity summary

background 2

representative citing papers

Evil Spectra: How Optimisers can Amplify or Suppress Emergent Misalignment

cs.LG · 2026-06-30 · unverdicted · novelty 6.0

Optimizer choice during LLM fine-tuning produces up to 7x variation in emergent misalignment rates, with spectral regularization on LoRA adapters substantially mitigating misalignment for prone optimizers.

Emergent Misalignment Can Be Induced by Sycophancy and Reversed via Alignment Gating

cs.CL · 2026-06-08 · unverdicted · novelty 6.0

Sycophancy fine-tuning induces emergent misalignment in LLMs that Alignment Gating can reverse by learning to suppress unsafe representations with generalization from narrow to broad domains.

Activation Steering Induces Emergent Misalignment: A More Comprehensive Evaluation

cs.LG · 2026-06-07 · unverdicted · novelty 6.0

Activation steering induces emergent misalignment in LLMs, yielding more semantically relevant and coherent harmful responses than finetuning across model families, scales, tasks, and layers.

Shared Latent Structures Enable Unified Backdoor Detection and Mitigation in LLMs

cs.AI · 2026-06-06 · unverdicted · novelty 6.0

Sparse autoencoders identify shared latent features across diverse backdoor attacks in LLMs that enable unified detection via classifiers, causal control via steering, and mitigation via ablation fine-tuning.

The Piggyback Hypothesis of Generalization: Explaining and Mitigating Emergent Misalignment

cs.CL · 2026-06-04 · unverdicted · novelty 6.0

The Piggyback Hypothesis attributes emergent misalignment to chat-template tokens piggybacking finetuned behavior; Token-Regularized Finetuning (TReFT) mitigates it by regularizing prefix token representations.

Emergent and Subliminal Misalignment Through the Lens of Data-Mediated Transfer

cs.LG · 2026-05-12 · unverdicted · novelty 6.0

Emergent and subliminal misalignment in LLMs arise from data structure interactions and transfer via benign distillation data, with stronger effects under shared functional structure and on-policy settings.

Overtrained, Not Misaligned

cs.LG · 2026-05-12 · unverdicted · novelty 6.0

Emergent misalignment arises from overtraining after primary task convergence and is preventable by early stopping, which retains 93% of task performance on average.

Unreal Thinking: Chain-of-Thought Hijacking via Two-stage Backdoor

cs.CR · 2026-04-10 · unverdicted · novelty 6.0

A new backdoor technique called TSBH uses reverse tree search to create malicious chain-of-thought data and injects it in two stages to hijack LLM reasoning upon trigger activation.

BLOCK-EM: Preventing Emergent Misalignment via Latent Blocking

cs.LG · 2026-01-31 · unverdicted · novelty 6.0

Blocking a fixed set of latent features during fine-tuning reduces emergent misalignment by up to 95% across six domains with no loss in target task performance.

Order Is Not Control

cs.LG · 2026-06-11 · unverdicted · novelty 5.0

Order is distinct from control, where control is defined as a local receiver-gated response law demonstrated across biological circuits and LLM response panels with reported prediction accuracies of 72-84%.

Trait-space Monitoring for Emergent Misalignment During Supervised Finetuning

cs.LG · 2026-05-31 · unverdicted · novelty 5.0

Trait-space drift monitoring detects emergent misalignment checkpoints in 7-9B LLMs with 2.2% FNR, 2.9% FPR and 0.99 AUROC, outperforming PCA and SAE baselines.

Persona-Model Collapse in Emergent Misalignment

cs.CL · 2026-05-13 · unverdicted · novelty 5.0 · 2 refs

Insecure fine-tuning raises moral susceptibility 55% and lowers moral robustness 65% in four frontier models, exceeding prior benchmarks and indicating persona-model collapse as a mechanism of emergent misalignment.

ATLAS: Constitution-Conditioned Latent Geometry and Redistribution Across Language Models and Neural Perturbation Data

cs.LG · 2026-04-19 · unverdicted · novelty 5.0

ATLAS shows constitutions induce recoverable latent geometry in LLMs that redistributes but remains detectable across models and neural perturbation data via source-defined families and AUC separations.

Position: Anthropomorphic Misalignment Research Needs Stronger Evidence

cs.CY · 2026-05-29 · unverdicted · novelty 3.0

Position paper calling for stronger evidentiary standards and a diagnostic checklist in anthropomorphic misalignment research.

citing papers explorer

Showing 14 of 14 citing papers after filters.

Evil Spectra: How Optimisers can Amplify or Suppress Emergent Misalignment cs.LG · 2026-06-30 · unverdicted · none · ref 2
Optimizer choice during LLM fine-tuning produces up to 7x variation in emergent misalignment rates, with spectral regularization on LoRA adapters substantially mitigating misalignment for prone optimizers.
Emergent Misalignment Can Be Induced by Sycophancy and Reversed via Alignment Gating cs.CL · 2026-06-08 · unverdicted · none · ref 14
Sycophancy fine-tuning induces emergent misalignment in LLMs that Alignment Gating can reverse by learning to suppress unsafe representations with generalization from narrow to broad domains.
Activation Steering Induces Emergent Misalignment: A More Comprehensive Evaluation cs.LG · 2026-06-07 · unverdicted · none · ref 6
Activation steering induces emergent misalignment in LLMs, yielding more semantically relevant and coherent harmful responses than finetuning across model families, scales, tasks, and layers.
Shared Latent Structures Enable Unified Backdoor Detection and Mitigation in LLMs cs.AI · 2026-06-06 · unverdicted · none · ref 59
Sparse autoencoders identify shared latent features across diverse backdoor attacks in LLMs that enable unified detection via classifiers, causal control via steering, and mitigation via ablation fine-tuning.
The Piggyback Hypothesis of Generalization: Explaining and Mitigating Emergent Misalignment cs.CL · 2026-06-04 · unverdicted · none · ref 6
The Piggyback Hypothesis attributes emergent misalignment to chat-template tokens piggybacking finetuned behavior; Token-Regularized Finetuning (TReFT) mitigates it by regularizing prefix token representations.
Emergent and Subliminal Misalignment Through the Lens of Data-Mediated Transfer cs.LG · 2026-05-12 · unverdicted · none · ref 3
Emergent and subliminal misalignment in LLMs arise from data structure interactions and transfer via benign distillation data, with stronger effects under shared functional structure and on-policy settings.
Overtrained, Not Misaligned cs.LG · 2026-05-12 · unverdicted · none · ref 8
Emergent misalignment arises from overtraining after primary task convergence and is preventable by early stopping, which retains 93% of task performance on average.
Unreal Thinking: Chain-of-Thought Hijacking via Two-stage Backdoor cs.CR · 2026-04-10 · unverdicted · none · ref 2
A new backdoor technique called TSBH uses reverse tree search to create malicious chain-of-thought data and injects it in two stages to hijack LLM reasoning upon trigger activation.
BLOCK-EM: Preventing Emergent Misalignment via Latent Blocking cs.LG · 2026-01-31 · unverdicted · none · ref 2
Blocking a fixed set of latent features during fine-tuning reduces emergent misalignment by up to 95% across six domains with no loss in target task performance.
Order Is Not Control cs.LG · 2026-06-11 · unverdicted · none · ref 83
Order is distinct from control, where control is defined as a local receiver-gated response law demonstrated across biological circuits and LLM response panels with reported prediction accuracies of 72-84%.
Trait-space Monitoring for Emergent Misalignment During Supervised Finetuning cs.LG · 2026-05-31 · unverdicted · none · ref 27
Trait-space drift monitoring detects emergent misalignment checkpoints in 7-9B LLMs with 2.2% FNR, 2.9% FPR and 0.99 AUROC, outperforming PCA and SAE baselines.
Persona-Model Collapse in Emergent Misalignment cs.CL · 2026-05-13 · unverdicted · none · ref 6 · 2 links
Insecure fine-tuning raises moral susceptibility 55% and lowers moral robustness 65% in four frontier models, exceeding prior benchmarks and indicating persona-model collapse as a mechanism of emergent misalignment.
ATLAS: Constitution-Conditioned Latent Geometry and Redistribution Across Language Models and Neural Perturbation Data cs.LG · 2026-04-19 · unverdicted · none · ref 41
ATLAS shows constitutions induce recoverable latent geometry in LLMs that redistributes but remains detectable across models and neural perturbation data via source-defined families and AUC separations.
Position: Anthropomorphic Misalignment Research Needs Stronger Evidence cs.CY · 2026-05-29 · unverdicted · none · ref 46
Position paper calling for stronger evidentiary standards and a diagnostic checklist in anthropomorphic misalignment research.

James Chua, Jan Betley, Mia Taylor, and Owain Evans

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer