Has This Checkpoint Been Abliterated? A Two-Signal Audit and Its Failure Map

Gabriel Hurtado

arxiv: 2607.01854 · v1 · pith:VPSA4TLXnew · submitted 2026-07-02 · 💻 cs.CR · cs.AI

Has This Checkpoint Been Abliterated? A Two-Signal Audit and Its Failure Map

Gabriel Hurtado This is my paper

Pith reviewed 2026-07-03 10:51 UTC · model grok-4.3

classification 💻 cs.CR cs.AI

keywords abliteration detectioncheckpoint auditrefusal gapweight recovery energyopen-weight modelsfine-tune classificationmodel safetytwo-signal detection

0 comments

The pith

A z-sum of refusal-gap and weight energy detects abliterated checkpoints at AUROC 0.95 on 273 models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a threshold-free audit to determine whether an open-weight checkpoint has had its refusal mechanism stripped. It combines two cheap internal signals that are negatively correlated and label-complementary: a reference-anchored activation refusal-gap that supplies specificity and a weight-recovery energy of the base-to-candidate difference that supplies recall. On a registry of 273 checkpoints spanning several model families, their z-sum separates 57 public abliterations from 37 benign fine-tunes at AUROC 0.95, outperforming either signal alone. A Youden-calibrated threshold transfers to held-out families at balanced accuracy 0.89 while missing only four abliterations. The work also maps its own failure modes to bound the method's scope.

Core claim

The central claim is that the z-sum of a reference-anchored activation refusal-gap and the weight-recovery energy of the base-to-candidate weight difference separates 57 public abliterations from 37 benign fine-tunes, merges, and instruction-tunes on a 273-checkpoint registry at AUROC 0.95, significantly above the individual signals at 0.84 and 0.90, and that a Youden-calibrated threshold transfers to held-out families at balanced accuracy 0.89 with FPR 0.11. The paper further maps two failure cases in order of severity: a spoofed reference evades both axes with ΔW=0 and correlation 1 by construction, and a white-box owner can train a checkpoint past the threshold while it remains guard-unsa

What carries the argument

The two-signal z-sum audit formed from a reference-anchored activation refusal-gap and base-to-candidate weight-recovery energy.

If this is right

The combined z-sum outperforms the refusal-gap alone (AUROC 0.84) and the weight energy alone (AUROC 0.90).
A Youden-calibrated threshold achieves balanced accuracy 0.89 and FPR 0.11 on held-out model families.
Only four of the 57 abliterations fall below the threshold.
A spoofed reference produces ΔW=0 and correlation 1, evading both signals by construction.
A white-box owner can push a checkpoint past the threshold while it stays guard-unsafe.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Platforms hosting open weights could run the audit as an automated pre-upload check.
Some form of reference-model attestation would be needed to block the identified spoofing failure.
The same two-signal structure might be tested on other safety properties beyond refusal.
Adding a third orthogonal signal could raise robustness against an adaptive white-box adversary.

Load-bearing premise

The audit requires access to an attested, non-spoofed reference model.

What would settle it

An abliterated checkpoint that scores below the Youden threshold on the z-sum when measured against a genuine reference model, or a non-abliterated checkpoint that scores above it, would falsify the separation claim.

Figures

Figures reproduced from arXiv: 2607.01854 by Gabriel Hurtado.

**Figure 1.** Figure 1: The combined z-sum boundary (dashed) is the detector. Above/left of it, 53 uncensored checkpoints are caught (◦); the 4 below it are missed (•) and sit among the benign edits (×). The white-box checkpoint we fine-tuned to evade (⋆, §5) lands far on the evaded side by design. Axes are the two signals: activation gap ρ and weight-recovery energy E1. The two signals are negatively correlated (r=−0.41 across t… view at source ↗

**Figure 2.** Figure 2: ROC for the two signals and their z-sum; the absolute-σ AMS Tier-1 baseline (■) sits well below. In-sample AUROC 0.95 (z-sum) vs. 0.90 (E1) and 0.84 (ρ); the z-sum is significantly above both ( [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: A white-box adversary drives the combined score from above its detection thresh [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

read the original abstract

Can a platform tell, before deployment, whether an open-weight checkpoint has had its refusal mechanism stripped? Runtime guards cannot: they score generations, not the artifact. We combine two cheap internal signals, a reference-anchored activation refusal-gap and a weight-recovery energy of the base-to-candidate weight difference, into a threshold-free checkpoint audit. The two are negatively correlated and label-complementary: the gap supplies refusal-specificity and the weight energy supplies recall. On a 273-checkpoint registry spanning Qwen, DeepSeek-distilled Qwen, Llama, and Gemma, their z-sum separates 57 public abliterations from 37 benign fine-tunes, merges, and instruction-tunes at AUROC 0.95, significantly above either signal alone (0.84, 0.90), and a Youden-calibrated threshold transfers to held-out families at balanced accuracy 0.89 (FPR 0.11), missing only 4 of 57. We then map two failures, in order of severity: a spoofed reference evades both axes with no training ({\Delta}W=0, \r{ho}=1 by construction), and a white-box owner trains a checkpoint past the threshold while it stays guard-unsafe and coherent. The audit is effective triage, not tamper-proofing: it presumes an attested reference, and its claims are bounded by the registry we evaluate it on.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a practical two-signal z-sum audit that separates abliterated checkpoints from benign ones at AUROC 0.95 on the 273-item registry, with the threshold transferring at 0.89 balanced accuracy, while mapping its own spoofing and evasion limits.

read the letter

The core result is that pairing a reference-anchored activation refusal gap with weight-recovery energy produces a z-sum that beats either signal alone for flagging public abliterations. On their registry it reaches AUROC 0.95 and the Youden threshold holds on held-out families.

What stands out as new is the explicit combination of those two signals and the demonstration that they are negatively correlated and label-complementary. The paper also does a clean job of showing the two failure cases in order of severity: a spoofed reference forces both signals to zero by construction, and a white-box owner can train past the threshold while keeping the model unsafe. That honesty about the method's bounds is useful.

The soft spots are proportionate. The threshold is calibrated on the same registry used for the AUROC, so there is moderate circularity in the reported performance. Everything rests on having an attested reference model, which the spoofing case itself illustrates. Registry construction details are light, though the authors correctly limit their claims to the evaluated set. No load-bearing contradictions appear in the reported separation or the negative-correlation premise.

This is for practitioners who need a cheap triage step before deploying or distributing open-weight checkpoints. A reader already working on model auditing or safety mechanism removal would get concrete numbers and clear failure modes to build on.

I would send it to peer review. The empirical claim is stated with its conditions, the limitations are mapped, and the work supplies a falsifiable method rather than an untested idea.

Referee Report

1 major / 1 minor

Summary. The manuscript proposes a two-signal audit to detect whether an open-weight checkpoint has had its refusal mechanism abliterated. It combines a reference-anchored activation refusal-gap with a weight-recovery energy computed on the base-to-candidate weight difference; the signals are negatively correlated and label-complementary. On a 273-checkpoint registry spanning Qwen, DeepSeek-distilled Qwen, Llama, and Gemma families, the z-sum of the normalized signals separates 57 public abliterations from 37 benign fine-tunes, merges, and instruction-tunes at AUROC 0.95 (versus 0.84 and 0.90 for the individual signals). A Youden-calibrated threshold transfers to held-out families at balanced accuracy 0.89 (FPR 0.11), missing only 4 of 57 abliterations. The paper explicitly maps two failure modes: a spoofed reference that forces ΔW=0 and ρ=1 by construction, and white-box training that evades the threshold while remaining guard-unsafe. The method is presented as triage that presumes an attested reference model and is bounded by the evaluated registry.

Significance. If the reported separation holds, the work supplies a cheap, internal-signal triage tool that platforms could apply before deployment, improving on single-signal baselines and explicitly bounding its scope via the two documented failure cases. The held-out transfer result and the construction of the failure map (including the exact spoofed-reference construction) are concrete strengths that clarify applicability without overclaiming tamper-proofing.

major comments (1)

[Abstract / Evaluation] Abstract and evaluation description: the Youden-calibrated threshold and the negative-correlation premise for the z-sum are both derived from the same 273-checkpoint registry used to report the AUROC of 0.95; while transfer performance on held-out families is stated at balanced accuracy 0.89, the absence of a separate calibration set or cross-validated threshold selection leaves the primary metric vulnerable to optimistic bias from data-dependent calibration.

minor comments (1)

[Abstract / Methods] The abstract states that exact signal definitions and registry construction details are limited; expanding the methods section with precise formulas for the refusal-gap and weight-energy signals, plus explicit criteria for including checkpoints in the 273-item registry, would improve reproducibility without altering the central empirical claim.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive overall assessment and for the specific comment on evaluation protocol. We address the concern directly below.

read point-by-point responses

Referee: [Abstract / Evaluation] Abstract and evaluation description: the Youden-calibrated threshold and the negative-correlation premise for the z-sum are both derived from the same 273-checkpoint registry used to report the AUROC of 0.95; while transfer performance on held-out families is stated at balanced accuracy 0.89, the absence of a separate calibration set or cross-validated threshold selection leaves the primary metric vulnerable to optimistic bias from data-dependent calibration.

Authors: The AUROC of 0.95 is a ranking metric on the z-sum scores and does not depend on the Youden threshold; the threshold is used only for the secondary balanced-accuracy transfer result. The negative-correlation observation is an empirical finding on the registry that motivates the linear combination, but the z-sum itself is a fixed linear transform once the per-signal means and standard deviations are estimated. The held-out-family transfer applies the threshold fixed from the main registry, which is the standard protocol for assessing generalization. We agree that reporting a cross-validated threshold or an explicit train/calibration split would further strengthen the presentation. In revision we will add a sentence in the evaluation section clarifying that the primary reported figure is the threshold-independent AUROC, that normalization parameters are registry-wide, and that the transfer result uses a fixed threshold; we will also note the limitation that a fully nested cross-validation was not performed. revision: partial

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper reports direct empirical measurements: AUROC 0.95 for z-sum separation on the 273-checkpoint registry and balanced accuracy 0.89 for the Youden threshold on held-out families. The negative correlation observation and z-sum choice are data-driven but do not reduce any performance claim to a tautology or fitted input by construction. The spoofed-reference failure case is explicitly constructed as ΔW=0 and ρ=1, which the paper states outright rather than deriving as a result. No self-citations, uniqueness theorems, or ansatzes are invoked. The derivation chain consists of signal definitions, z-normalization, summation, and ROC evaluation on the stated registry and splits; these steps are self-contained against the external benchmark of the evaluated checkpoints and do not collapse to the inputs.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on empirical performance measured on one specific registry and on the domain assumption that the two signals remain complementary outside that registry; the threshold itself is a fitted parameter.

free parameters (1)

Youden threshold = calibrated on registry
The decision threshold is selected by optimizing Youden's index on the 273-checkpoint registry to achieve the reported balanced accuracy.

axioms (1)

domain assumption The two signals are negatively correlated and label-complementary across model families
Invoked to justify the z-sum combination and its superiority over single signals.

pith-pipeline@v0.9.1-grok · 5785 in / 1421 out tokens · 62370 ms · 2026-07-03T10:51:26.305164+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

32 extracted references · 4 canonical work pages · 1 internal anchor

[1]

An embarrassingly simple defense against LLM abliteration attacks, 2025

Harethah Abu Shairah, Hasan Abed Al Kader Hammoud, Bernard Ghanem, and George Turkiyyah. An embarrassingly simple defense against LLM abliteration attacks, 2025

2025
[2]

Refusal in Language Models Is Mediated by a Single Direction

Andy Arditi, Oscar Obeso, Aaquib Syed, Daniel Paleka, Nina Panickssery, Wes Gurnee, and Neel Nanda. Refusal in language models is mediated by a single direction. In Advances in Neural Information Processing Systems (NeurIPS), 2024. arXiv:2406.11717

work page internal anchor Pith review Pith/arXiv arXiv 2024
[3]

Obfuscated activations bypass LLM latent-space defenses, 2024

Luke Bailey, Alex Serrano, Abhay Sheshadri, Mikhail Seleznyov, Jordan Taylor, Erik Jenner, Jacob Hilton, Stephen Casper, Carlos Guestrin, and Scott Emmons. Obfuscated activations bypass LLM latent-space defenses, 2024

2024
[4]

Black-box access is insufficient for rigorous AI audits, 2024

Stephen Casper, Carson Ezell, Charlotte Siegmann, Noam Kolt, Taylor Lynn Curtis, Benjamin Bucknall, Andreas Haupt, Kevin Wei, J \'e r \'e my Scheurer, Marius Hobbhahn, Lee Sharkey, Satyapriya Krishna, Marvin Von Hagen, Silas Alberti, Alan Chan, Qinyi Sun, Michael Gerovitch, David Bau, Max Tegmark, David Krueger, and Dylan Hadfield-Menell. Black-box access...

2024
[5]

Pappas, Florian Tram \`e r, Hamed Hassani, and Eric Wong

Patrick Chao, Edoardo Debenedetti, Alexander Robey, Maksym Andriushchenko, Francesco Croce, Vikash Sehwag, Edgar Dobriban, Nicolas Flammarion, George J. Pappas, Florian Tram \`e r, Hamed Hassani, and Eric Wong. JailbreakBench : An open robustness benchmark for jailbreaking large language models, 2024. NeurIPS 2024 Datasets and Benchmarks

2024
[6]

Safety-alignment removal as a model-identity failure --- structural evidence from published weight-level mutation checkpoints, 2026

Anthony Ray Coslett. Safety-alignment removal as a model-identity failure --- structural evidence from published weight-level mutation checkpoints, 2026. Fall Risk AI; Zenodo 10.5281/zenodo.19383019

work page doi:10.5281/zenodo.19383019 2026
[7]

What makes and breaks safety fine-tuning? a mechanistic study

Samyak Jain et al. What makes and breaks safety fine-tuning? a mechanistic study. In NeurIPS, 2024

2024
[8]

Uncensor any LLM with abliteration

Maxime Labonne. Uncensor any LLM with abliteration. Hugging Face blog, 2024. Coined ``abliteration''; weight orthogonalization against the refusal direction

2024
[9]

HarmBench : A standardized evaluation framework for automated red teaming and robust refusal, 2024

Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, David Forsyth, and Dan Hendrycks. HarmBench : A standardized evaluation framework for automated red teaming and robust refusal, 2024. ICML 2024

2024
[10]

Neural chameleons: Language models can learn to hide their thoughts from unseen activation monitors, 2025

Max McGuinness, Alex Serrano, Luke Bailey, and Scott Emmons. Neural chameleons: Language models can learn to hide their thoughts from unseen activation monitors, 2025. ICLR 2026 Workshop on Trustworthy AI

2025
[11]

AMS : Activation-based model scanner for open-weight LLM safety verification

Glen Messenger. AMS : Activation-based model scanner for open-weight LLM safety verification. Google Open Source Blog, 2026

2026
[12]

Heretic: Automated censorship removal for language models

p-e-w . Heretic: Automated censorship removal for language models. https://github.com/p-e-w/heretic, 2025. Optuna-tuned per-layer directional abliteration; 1000+ released checkpoints

2025
[13]

Daly, Michael Hind, Werner Geyer, Ambrish Rawat, Kush R

Inkit Padhi, Manish Nagireddy, Giandomenico Cornacchia, Subhajit Chaudhury, Tejaswini Pedapati, Pierre Dognin, Keerthiram Murugesan, Erik Miehling, Mart \'i n Santill \'a n Cooper, Kieran Fraser, Giulio Zizzo, Muhammad Zaid Hameed, Mark Purcell, Michael Desmond, Qian Pan, Inge Vejsbjerg, Elizabeth M. Daly, Michael Hind, Werner Geyer, Ambrish Rawat, Kush R...

2024
[14]

The hidden dimensions of LLM alignment: A multi-dimensional analysis of orthogonal safety directions

Wenbo Pan, Zhichao Liu, Qiguang Chen, Xiangyang Zhou, Haining Yu, and Xiaohua Jia. The hidden dimensions of LLM alignment: A multi-dimensional analysis of orthogonal safety directions. In International Conference on Machine Learning (ICML), 2025. arXiv:2502.09674

work page arXiv 2025
[15]

Preventing safety drift in large language models via coupled weight and activation constraints, 2026

Songping Peng, Zhiheng Zhang, Daojian Zeng, Lincheng Jiang, and Xieping Gao. Preventing safety drift in large language models via coupled weight and activation constraints, 2026. ACL 2026

2026
[16]

SOM directions are better than one: Multi-directional refusal suppression in language models, 2025

Giorgio Piras, Raffaele Mura, Fabio Brau, Luca Oneto, Fabio Roli, and Battista Biggio. SOM directions are better than one: Multi-directional refusal suppression in language models, 2025. AAAI 2026; Self-Organizing-Map extraction of multiple refusal directions. Code: https://github.com/pralab/som-refusal-directions

2025
[17]

Qwen3guard technical report, 2025

Qwen Team . Qwen3guard technical report, 2025. Generative Qwen3Guard-Gen-8B (Apache-2.0); emits per-response Safety and Refusal labels

2025
[18]

Open problems in technical AI governance, 2024

Anka Reuel, Ben Bucknall, Stephen Casper, et al. Open problems in technical AI governance, 2024

2024
[19]

Model evaluation for extreme risks, 2023

Toby Shevlane et al. Model evaluation for extreme risks, 2023. arXiv:2305.15324; dangerous-capability vs alignment/propensity evaluation pillars

work page arXiv 2023
[20]

COSMIC : Generalized refusal direction identification in LLM activations

Vincent Siu, Nicholas Crispino, Zihao Yu, Sam Pan, Zhun Wang, Yang Liu, Dawn Song, and Chenguang Wang. COSMIC : Generalized refusal direction identification in LLM activations. In Findings of the Association for Computational Linguistics: ACL, 2025

2025
[21]

Tamper-resistant safeguards for open-weight LLM s, 2024

Rishub Tamirisa, Bhrugu Bharathi, Long Phan, Andy Zhou, Alice Gatti, Tarun Suresh, Maxwell Lin, Justin Wang, Rowan Wang, Ron Arel, Andy Zou, Dawn Song, Bo Li, Dan Hendrycks, and Mantas Mazeika. Tamper-resistant safeguards for open-weight LLM s, 2024. ICLR 2025

2024
[22]

The geometry of refusal in large language models: Concept cones and representational independence, 2025

Tom Wollschl \"a ger et al. The geometry of refusal in large language models: Concept cones and representational independence, 2025

2025
[23]

Zhenyu Xu and Victor S. Sheng. A behavioral fingerprint for large language models: Provenance tracking via refusal vectors, 2026

2026
[24]

Where do reasoning models refuse?, 2025

Kureha Yamaguchi, Benjamin Etheridge, and Andy Arditi. Where do reasoning models refuse?, 2025. ICML 2025 Workshop on Reliable and Responsible Foundation Models (R2FM)

2025
[25]

Refusal falls off a cliff: How safety alignment fails in reasoning?, 2025

Qingyu Yin, Chak Tou Leong, Wenxuan Huang, Wenjie Li, Linyi Yang, Xiting Wang, Jaehong Yoon, YunXing , XingYu , and Jinjin Gu. Refusal falls off a cliff: How safety alignment fails in reasoning?, 2025

2025
[26]

Richard J. Young. Comparative analysis of LLM abliteration methods: A cross-architecture evaluation, 2025

2025
[27]

LLMs encode harmfulness and refusal separately, 2025 a

Jiachen Zhao, Jing Huang, Zhengxuan Wu, David Bau, and Weiyan Shi. LLMs encode harmfulness and refusal separately, 2025 a . NeurIPS 2025

2025
[28]

Chain-of-thought hijacking, 2025 b

Jianli Zhao, Tingchen Fu, Rylan Schaeffer, Mrinank Sharma, and Fazl Barez. Chain-of-thought hijacking, 2025 b

2025
[29]

Watch the weights: Unsupervised monitoring and control of fine-tuned LLM s, 2025

Ziqian Zhong and Aditi Raghunathan. Watch the weights: Unsupervised monitoring and control of fine-tuned LLM s, 2025. ICLR 2026

2025
[30]

Byun, Zifan Wang, Alex Mallen, Steven Basart, Sanmi Koyejo, Dawn Song, Matt Fredrikson, J

Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dombrowski, Shashwat Goel, Nathaniel Li, Michael J. Byun, Zifan Wang, Alex Mallen, Steven Basart, Sanmi Koyejo, Dawn Song, Matt Fredrikson, J. Zico Kolter, and Dan Hendrycks. Representation engineering: A top-down approach to A...

2023
[31]

Zico Kolter, and Matt Fredrikson

Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J. Zico Kolter, and Matt Fredrikson. Universal and transferable adversarial attacks on aligned language models, 2023 b . AdvBench harmful-behaviors set

2023
[32]

Zico Kolter, Matt Fredrikson, and Dan Hendrycks

Andy Zou, Long Phan, Justin Wang, Derek Duenas, Maxwell Lin, Maksym Andriushchenko, Rowan Wang, J. Zico Kolter, Matt Fredrikson, and Dan Hendrycks. Improving alignment and robustness with circuit breakers, 2024. NeurIPS 2024

2024

[1] [1]

An embarrassingly simple defense against LLM abliteration attacks, 2025

Harethah Abu Shairah, Hasan Abed Al Kader Hammoud, Bernard Ghanem, and George Turkiyyah. An embarrassingly simple defense against LLM abliteration attacks, 2025

2025

[2] [2]

Refusal in Language Models Is Mediated by a Single Direction

Andy Arditi, Oscar Obeso, Aaquib Syed, Daniel Paleka, Nina Panickssery, Wes Gurnee, and Neel Nanda. Refusal in language models is mediated by a single direction. In Advances in Neural Information Processing Systems (NeurIPS), 2024. arXiv:2406.11717

work page internal anchor Pith review Pith/arXiv arXiv 2024

[3] [3]

Obfuscated activations bypass LLM latent-space defenses, 2024

Luke Bailey, Alex Serrano, Abhay Sheshadri, Mikhail Seleznyov, Jordan Taylor, Erik Jenner, Jacob Hilton, Stephen Casper, Carlos Guestrin, and Scott Emmons. Obfuscated activations bypass LLM latent-space defenses, 2024

2024

[4] [4]

Black-box access is insufficient for rigorous AI audits, 2024

Stephen Casper, Carson Ezell, Charlotte Siegmann, Noam Kolt, Taylor Lynn Curtis, Benjamin Bucknall, Andreas Haupt, Kevin Wei, J \'e r \'e my Scheurer, Marius Hobbhahn, Lee Sharkey, Satyapriya Krishna, Marvin Von Hagen, Silas Alberti, Alan Chan, Qinyi Sun, Michael Gerovitch, David Bau, Max Tegmark, David Krueger, and Dylan Hadfield-Menell. Black-box access...

2024

[5] [5]

Pappas, Florian Tram \`e r, Hamed Hassani, and Eric Wong

Patrick Chao, Edoardo Debenedetti, Alexander Robey, Maksym Andriushchenko, Francesco Croce, Vikash Sehwag, Edgar Dobriban, Nicolas Flammarion, George J. Pappas, Florian Tram \`e r, Hamed Hassani, and Eric Wong. JailbreakBench : An open robustness benchmark for jailbreaking large language models, 2024. NeurIPS 2024 Datasets and Benchmarks

2024

[6] [6]

Safety-alignment removal as a model-identity failure --- structural evidence from published weight-level mutation checkpoints, 2026

Anthony Ray Coslett. Safety-alignment removal as a model-identity failure --- structural evidence from published weight-level mutation checkpoints, 2026. Fall Risk AI; Zenodo 10.5281/zenodo.19383019

work page doi:10.5281/zenodo.19383019 2026

[7] [7]

What makes and breaks safety fine-tuning? a mechanistic study

Samyak Jain et al. What makes and breaks safety fine-tuning? a mechanistic study. In NeurIPS, 2024

2024

[8] [8]

Uncensor any LLM with abliteration

Maxime Labonne. Uncensor any LLM with abliteration. Hugging Face blog, 2024. Coined ``abliteration''; weight orthogonalization against the refusal direction

2024

[9] [9]

HarmBench : A standardized evaluation framework for automated red teaming and robust refusal, 2024

Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, David Forsyth, and Dan Hendrycks. HarmBench : A standardized evaluation framework for automated red teaming and robust refusal, 2024. ICML 2024

2024

[10] [10]

Neural chameleons: Language models can learn to hide their thoughts from unseen activation monitors, 2025

Max McGuinness, Alex Serrano, Luke Bailey, and Scott Emmons. Neural chameleons: Language models can learn to hide their thoughts from unseen activation monitors, 2025. ICLR 2026 Workshop on Trustworthy AI

2025

[11] [11]

AMS : Activation-based model scanner for open-weight LLM safety verification

Glen Messenger. AMS : Activation-based model scanner for open-weight LLM safety verification. Google Open Source Blog, 2026

2026

[12] [12]

Heretic: Automated censorship removal for language models

p-e-w . Heretic: Automated censorship removal for language models. https://github.com/p-e-w/heretic, 2025. Optuna-tuned per-layer directional abliteration; 1000+ released checkpoints

2025

[13] [13]

Daly, Michael Hind, Werner Geyer, Ambrish Rawat, Kush R

Inkit Padhi, Manish Nagireddy, Giandomenico Cornacchia, Subhajit Chaudhury, Tejaswini Pedapati, Pierre Dognin, Keerthiram Murugesan, Erik Miehling, Mart \'i n Santill \'a n Cooper, Kieran Fraser, Giulio Zizzo, Muhammad Zaid Hameed, Mark Purcell, Michael Desmond, Qian Pan, Inge Vejsbjerg, Elizabeth M. Daly, Michael Hind, Werner Geyer, Ambrish Rawat, Kush R...

2024

[14] [14]

The hidden dimensions of LLM alignment: A multi-dimensional analysis of orthogonal safety directions

Wenbo Pan, Zhichao Liu, Qiguang Chen, Xiangyang Zhou, Haining Yu, and Xiaohua Jia. The hidden dimensions of LLM alignment: A multi-dimensional analysis of orthogonal safety directions. In International Conference on Machine Learning (ICML), 2025. arXiv:2502.09674

work page arXiv 2025

[15] [15]

Preventing safety drift in large language models via coupled weight and activation constraints, 2026

Songping Peng, Zhiheng Zhang, Daojian Zeng, Lincheng Jiang, and Xieping Gao. Preventing safety drift in large language models via coupled weight and activation constraints, 2026. ACL 2026

2026

[16] [16]

SOM directions are better than one: Multi-directional refusal suppression in language models, 2025

Giorgio Piras, Raffaele Mura, Fabio Brau, Luca Oneto, Fabio Roli, and Battista Biggio. SOM directions are better than one: Multi-directional refusal suppression in language models, 2025. AAAI 2026; Self-Organizing-Map extraction of multiple refusal directions. Code: https://github.com/pralab/som-refusal-directions

2025

[17] [17]

Qwen3guard technical report, 2025

Qwen Team . Qwen3guard technical report, 2025. Generative Qwen3Guard-Gen-8B (Apache-2.0); emits per-response Safety and Refusal labels

2025

[18] [18]

Open problems in technical AI governance, 2024

Anka Reuel, Ben Bucknall, Stephen Casper, et al. Open problems in technical AI governance, 2024

2024

[19] [19]

Model evaluation for extreme risks, 2023

Toby Shevlane et al. Model evaluation for extreme risks, 2023. arXiv:2305.15324; dangerous-capability vs alignment/propensity evaluation pillars

work page arXiv 2023

[20] [20]

COSMIC : Generalized refusal direction identification in LLM activations

Vincent Siu, Nicholas Crispino, Zihao Yu, Sam Pan, Zhun Wang, Yang Liu, Dawn Song, and Chenguang Wang. COSMIC : Generalized refusal direction identification in LLM activations. In Findings of the Association for Computational Linguistics: ACL, 2025

2025

[21] [21]

Tamper-resistant safeguards for open-weight LLM s, 2024

Rishub Tamirisa, Bhrugu Bharathi, Long Phan, Andy Zhou, Alice Gatti, Tarun Suresh, Maxwell Lin, Justin Wang, Rowan Wang, Ron Arel, Andy Zou, Dawn Song, Bo Li, Dan Hendrycks, and Mantas Mazeika. Tamper-resistant safeguards for open-weight LLM s, 2024. ICLR 2025

2024

[22] [22]

The geometry of refusal in large language models: Concept cones and representational independence, 2025

Tom Wollschl \"a ger et al. The geometry of refusal in large language models: Concept cones and representational independence, 2025

2025

[23] [23]

Zhenyu Xu and Victor S. Sheng. A behavioral fingerprint for large language models: Provenance tracking via refusal vectors, 2026

2026

[24] [24]

Where do reasoning models refuse?, 2025

Kureha Yamaguchi, Benjamin Etheridge, and Andy Arditi. Where do reasoning models refuse?, 2025. ICML 2025 Workshop on Reliable and Responsible Foundation Models (R2FM)

2025

[25] [25]

Refusal falls off a cliff: How safety alignment fails in reasoning?, 2025

Qingyu Yin, Chak Tou Leong, Wenxuan Huang, Wenjie Li, Linyi Yang, Xiting Wang, Jaehong Yoon, YunXing , XingYu , and Jinjin Gu. Refusal falls off a cliff: How safety alignment fails in reasoning?, 2025

2025

[26] [26]

Richard J. Young. Comparative analysis of LLM abliteration methods: A cross-architecture evaluation, 2025

2025

[27] [27]

LLMs encode harmfulness and refusal separately, 2025 a

Jiachen Zhao, Jing Huang, Zhengxuan Wu, David Bau, and Weiyan Shi. LLMs encode harmfulness and refusal separately, 2025 a . NeurIPS 2025

2025

[28] [28]

Chain-of-thought hijacking, 2025 b

Jianli Zhao, Tingchen Fu, Rylan Schaeffer, Mrinank Sharma, and Fazl Barez. Chain-of-thought hijacking, 2025 b

2025

[29] [29]

Watch the weights: Unsupervised monitoring and control of fine-tuned LLM s, 2025

Ziqian Zhong and Aditi Raghunathan. Watch the weights: Unsupervised monitoring and control of fine-tuned LLM s, 2025. ICLR 2026

2025

[30] [30]

Byun, Zifan Wang, Alex Mallen, Steven Basart, Sanmi Koyejo, Dawn Song, Matt Fredrikson, J

Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dombrowski, Shashwat Goel, Nathaniel Li, Michael J. Byun, Zifan Wang, Alex Mallen, Steven Basart, Sanmi Koyejo, Dawn Song, Matt Fredrikson, J. Zico Kolter, and Dan Hendrycks. Representation engineering: A top-down approach to A...

2023

[31] [31]

Zico Kolter, and Matt Fredrikson

Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J. Zico Kolter, and Matt Fredrikson. Universal and transferable adversarial attacks on aligned language models, 2023 b . AdvBench harmful-behaviors set

2023

[32] [32]

Zico Kolter, Matt Fredrikson, and Dan Hendrycks

Andy Zou, Long Phan, Justin Wang, Derek Duenas, Maxwell Lin, Maksym Andriushchenko, Rowan Wang, J. Zico Kolter, Matt Fredrikson, and Dan Hendrycks. Improving alignment and robustness with circuit breakers, 2024. NeurIPS 2024

2024