arXiv preprint arXiv:2506.13206 , year=

URL https://openreview · 2025 · arXiv 2506.13206

6 Pith papers cite this work. Polarity classification is still indexing.

6 Pith papers citing it

read on arXiv browse 6 citing papers

citation-role summary

background 2

citation-polarity summary

background 2

representative citing papers

Emergent and Subliminal Misalignment Through the Lens of Data-Mediated Transfer

cs.LG · 2026-05-12 · unverdicted · novelty 6.0

Emergent and subliminal misalignment in LLMs arise from data structure interactions and transfer via benign distillation data, with stronger effects under shared functional structure and on-policy settings.

Overtrained, Not Misaligned

cs.LG · 2026-05-12 · unverdicted · novelty 6.0

Emergent misalignment arises from overtraining after primary task convergence and is preventable by early stopping, which retains 93% of task performance on average.

Unreal Thinking: Chain-of-Thought Hijacking via Two-stage Backdoor

cs.CR · 2026-04-10 · unverdicted · novelty 6.0

A new backdoor technique called TSBH uses reverse tree search to create malicious chain-of-thought data and injects it in two stages to hijack LLM reasoning upon trigger activation.

BLOCK-EM: Preventing Emergent Misalignment via Latent Blocking

cs.LG · 2026-01-31 · unverdicted · novelty 6.0

Blocking a fixed set of latent features during fine-tuning reduces emergent misalignment by up to 95% across six domains with no loss in target task performance.

ATLAS: Constitution-Conditioned Latent Geometry and Redistribution Across Language Models and Neural Perturbation Data

cs.LG · 2026-04-19 · unverdicted · novelty 5.0

ATLAS shows constitutions induce recoverable latent geometry in LLMs that redistributes but remains detectable across models and neural perturbation data via source-defined families and AUC separations.

Persona-Model Collapse in Emergent Misalignment

cs.CL · 2026-05-13

citing papers explorer

Showing 6 of 6 citing papers.

Emergent and Subliminal Misalignment Through the Lens of Data-Mediated Transfer cs.LG · 2026-05-12 · unverdicted · none · ref 3
Emergent and subliminal misalignment in LLMs arise from data structure interactions and transfer via benign distillation data, with stronger effects under shared functional structure and on-policy settings.
Overtrained, Not Misaligned cs.LG · 2026-05-12 · unverdicted · none · ref 8
Emergent misalignment arises from overtraining after primary task convergence and is preventable by early stopping, which retains 93% of task performance on average.
Unreal Thinking: Chain-of-Thought Hijacking via Two-stage Backdoor cs.CR · 2026-04-10 · unverdicted · none · ref 2
A new backdoor technique called TSBH uses reverse tree search to create malicious chain-of-thought data and injects it in two stages to hijack LLM reasoning upon trigger activation.
BLOCK-EM: Preventing Emergent Misalignment via Latent Blocking cs.LG · 2026-01-31 · unverdicted · none · ref 2
Blocking a fixed set of latent features during fine-tuning reduces emergent misalignment by up to 95% across six domains with no loss in target task performance.
ATLAS: Constitution-Conditioned Latent Geometry and Redistribution Across Language Models and Neural Perturbation Data cs.LG · 2026-04-19 · unverdicted · none · ref 41
ATLAS shows constitutions induce recoverable latent geometry in LLMs that redistributes but remains detectable across models and neural perturbation data via source-defined families and AUC separations.
Persona-Model Collapse in Emergent Misalignment cs.CL · 2026-05-13 · unreviewed · ref 6

arXiv preprint arXiv:2506.13206 , year=

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer