pith. sign in

Safety-alignment removal as a model-identity failure --- structural evidence from published weight-level mutation checkpoints, 2026

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

fields

cs.CR 1

years

2026 1

verdicts

CONDITIONAL 1

representative citing papers

Has This Checkpoint Been Abliterated? A Two-Signal Audit and Its Failure Map

cs.CR · 2026-07-02 · conditional · novelty 6.0

Combining a reference-anchored activation refusal-gap with weight-recovery energy allows detection of abliterated checkpoints at AUROC 0.95 on a 273-checkpoint registry, with a calibrated threshold achieving 0.89 balanced accuracy on held-out families.

citing papers explorer

Showing 1 of 1 citing paper.

  • Has This Checkpoint Been Abliterated? A Two-Signal Audit and Its Failure Map cs.CR · 2026-07-02 · conditional · none · ref 6

    Combining a reference-anchored activation refusal-gap with weight-recovery energy allows detection of abliterated checkpoints at AUROC 0.95 on a 273-checkpoint registry, with a calibrated threshold achieving 0.89 balanced accuracy on held-out families.