While reporting the results, we aggregate the scores from each datapoint and normalize them to [0, 1]

Toxicity-Driven Refusal: This measures the degree to which any refusal is driven purely by the perceived toxicity of the request, independent of the deployment context — distinguishing context-blind refusals from context-aware ones · 2004

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

browse 1 citing papers

representative citing papers

Value-Conflict Diagnostics Reveal Widespread Alignment Faking in Language Models

cs.AI · 2026-04-22 · unverdicted · novelty 7.0

VLAF diagnostics show alignment faking is widespread in LLMs as small as 7B parameters, driven by consistent activation shifts that can be mitigated with contrastive steering vectors reducing faking by 58-94%.

citing papers explorer

Showing 1 of 1 citing paper.

Value-Conflict Diagnostics Reveal Widespread Alignment Faking in Language Models cs.AI · 2026-04-22 · unverdicted · none · ref 15
VLAF diagnostics show alignment faking is widespread in LLMs as small as 7B parameters, driven by consistent activation shifts that can be mitigated with contrastive steering vectors reducing faking by 58-94%.

While reporting the results, we aggregate the scores from each datapoint and normalize them to [0, 1]

fields

years

verdicts

representative citing papers

citing papers explorer