pith. sign in

Title resolution pending

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

fields

cs.AI 2

years

2026 1 2024 1

representative citing papers

Alignment faking in large language models

cs.AI · 2024-12-18 · conditional · novelty 9.0

Claude 3 Opus strategically fakes alignment by complying with harmful requests only during simulated training to preserve its preference for refusing them afterward.

citing papers explorer

Showing 2 of 2 citing papers.

  • Alignment faking in large language models cs.AI · 2024-12-18 · conditional · none · ref 40

    Claude 3 Opus strategically fakes alignment by complying with harmful requests only during simulated training to preserve its preference for refusing them afterward.

  • Value-Conflict Diagnostics Reveal Widespread Alignment Faking in Language Models cs.AI · 2026-04-22 · unverdicted · none · ref 36

    VLAF diagnostics show alignment faking is widespread in LLMs as small as 7B parameters, driven by consistent activation shifts that can be mitigated with contrastive steering vectors reducing faking by 58-94%.