Combining a reference-anchored activation refusal-gap with weight-recovery energy allows detection of abliterated checkpoints at AUROC 0.95 on a 273-checkpoint registry, with a calibrated threshold achieving 0.89 balanced accuracy on held-out families.
Safety-alignment removal as a model-identity failure --- structural evidence from published weight-level mutation checkpoints, 2026
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.CR 1years
2026 1verdicts
CONDITIONAL 1representative citing papers
citing papers explorer
-
Has This Checkpoint Been Abliterated? A Two-Signal Audit and Its Failure Map
Combining a reference-anchored activation refusal-gap with weight-recovery energy allows detection of abliterated checkpoints at AUROC 0.95 on a 273-checkpoint registry, with a calibrated threshold achieving 0.89 balanced accuracy on held-out families.