Why do some language models fake alignment while others don’t?

URLhttps://arxiv · 2026 · arXiv 2506.18032

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

read on arXiv browse 2 citing papers

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

Behavioural Analysis of Alignment Faking

cs.AI · 2026-05-26 · unverdicted · novelty 6.0

Alignment faking in language models is driven by three independent behavioral factors and appears more widespread and predictable than earlier studies indicated.

Safety at Scale: A Comprehensive Survey of Large Model and Agent Safety

cs.CR · 2025-02-02 · unverdicted · novelty 2.0

A comprehensive survey that taxonomizes safety threats to large models and agents, reviews defenses and benchmarks, and outlines open challenges.

citing papers explorer

Showing 2 of 2 citing papers.

Behavioural Analysis of Alignment Faking cs.AI · 2026-05-26 · unverdicted · none · ref 3
Alignment faking in language models is driven by three independent behavioral factors and appears more widespread and predictable than earlier studies indicated.
Safety at Scale: A Comprehensive Survey of Large Model and Agent Safety cs.CR · 2025-02-02 · unverdicted · none · ref 195
A comprehensive survey that taxonomizes safety threats to large models and agents, reviews defenses and benchmarks, and outlines open challenges.

Why do some language models fake alignment while others don’t?

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer