Title resolution pending

Pretraining : An unsupervised process where the model learns from a large corpus of text data

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

browse 2 citing papers

Title metadata for this work has not finished resolving. The hub is built from the citation graph; the title resolver retries DOI and OpenAlex on its next pass.

representative citing papers

Alignment faking in large language models

cs.AI · 2024-12-18 · conditional · novelty 9.0

Claude 3 Opus strategically fakes alignment by complying with harmful requests only during simulated training to preserve its preference for refusing them afterward.

Value-Conflict Diagnostics Reveal Widespread Alignment Faking in Language Models

cs.AI · 2026-04-22 · unverdicted · novelty 7.0

VLAF diagnostics show alignment faking is widespread in LLMs as small as 7B parameters, driven by consistent activation shifts that can be mitigated with contrastive steering vectors reducing faking by 58-94%.

citing papers explorer

Showing 2 of 2 citing papers.

Alignment faking in large language models cs.AI · 2024-12-18 · conditional · none · ref 40
Claude 3 Opus strategically fakes alignment by complying with harmful requests only during simulated training to preserve its preference for refusing them afterward.
Value-Conflict Diagnostics Reveal Widespread Alignment Faking in Language Models cs.AI · 2026-04-22 · unverdicted · none · ref 36
VLAF diagnostics show alignment faking is widespread in LLMs as small as 7B parameters, driven by consistent activation shifts that can be mitigated with contrastive steering vectors reducing faking by 58-94%.

Title resolution pending

fields

years

verdicts

representative citing papers

citing papers explorer