Safety-alignment removal as a model-identity failure --- structural evidence from published weight-level mutation checkpoints, 2026

Anthony Ray Coslett · 2026 · DOI 10.5281/zenodo.19383019

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

open at publisher browse 1 citing papers

representative citing papers

Has This Checkpoint Been Abliterated? A Two-Signal Audit and Its Failure Map

cs.CR · 2026-07-02 · conditional · novelty 6.0

Combining a reference-anchored activation refusal-gap with weight-recovery energy allows detection of abliterated checkpoints at AUROC 0.95 on a 273-checkpoint registry, with a calibrated threshold achieving 0.89 balanced accuracy on held-out families.

citing papers explorer

Showing 1 of 1 citing paper after filters.

Has This Checkpoint Been Abliterated? A Two-Signal Audit and Its Failure Map cs.CR · 2026-07-02 · conditional · none · ref 6
Combining a reference-anchored activation refusal-gap with weight-recovery energy allows detection of abliterated checkpoints at AUROC 0.95 on a 273-checkpoint registry, with a calibrated threshold achieving 0.89 balanced accuracy on held-out families.

Safety-alignment removal as a model-identity failure --- structural evidence from published weight-level mutation checkpoints, 2026

fields

years

verdicts

representative citing papers

citing papers explorer