pith. sign in

arxiv: 2512.06655 · v3 · pith:R6UZVWFAnew · submitted 2025-12-07 · 💻 cs.LG · cs.AI

Graph-Regularized Sparse Autoencoders for LLM Safety Steering

Pith reviewed 2026-05-21 18:33 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords sparse autoencodersLLM safetysafety steeringgraph regularizationjailbreak benchmarksrefusal mechanismsactivation spacedistributed features
0
0 comments X

The pith

Graph-regularized sparse autoencoders improve selective refusal in LLMs by smoothing safety directions over neuron co-activation graphs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that standard sparse autoencoders treat latent features as independent, which mismatches the distributed activation patterns underlying high-level safety behaviors like refusal. By adding graph regularization that smooths decoder vectors according to neuron co-activation, the method learns steering directions that increase refusal on harmful jailbreaks while keeping refusals on benign prompts low. This change is applied through a two-gate controller at runtime. On Llama-3-8B the approach raises the safety delta by 20 points on JailbreakBench and nearly 17 points on HarmBench in otherwise identical pipelines. The gains hold across several model families and under both black-box and gray-box attacks without degrading normal task performance.

Core claim

Graph-Regularized Sparse Autoencoders (GSAE) replace the standard SAE sparsity objective with smoothing of decoder vectors over a neuron co-activation graph. The resulting direction bank is applied via a two-gate runtime controller. This produces steering vectors that raise harmful-request refusal rates while holding benign-prompt refusal rates low. On Llama-3-8B the method improves the safety metric Δ_s by 20.1 points on JailbreakBench and 16.8 points on HarmBench relative to a standard SAE baseline in the same pipeline.

What carries the argument

Smoothing of SAE decoder vectors over a neuron co-activation graph to capture distributed structure for safety-steering directions.

If this is right

  • GSAE raises harmful-request refusal while keeping benign-prompt refusal low across JailbreakBench, HarmBench, and XSTest.
  • The method outperforms standard activation-steering baselines and black-box guardrails.
  • Benign-task performance remains intact after the change.
  • The improvement generalizes to Llama-3, Mistral, Qwen 2.5, and Phi-4.
  • GSAE retains effectiveness under black-box and gray-box jailbreak attacks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Safety behaviors may be better represented as graph-structured patterns than as fully independent sparse features.
  • The same regularization approach could be tested on other high-level model behaviors such as instruction following or reasoning.
  • Alternative ways of constructing the co-activation graph might produce steering directions with different coverage or robustness.
  • The runtime two-gate controller may be separable from the training regularization, allowing the graph-smoothed directions to be used with other inference methods.

Load-bearing premise

High-level safety behaviors such as refusal depend on distributed structure in activation space that is well captured by smoothing decoder vectors over a neuron co-activation graph.

What would settle it

Running the identical steering pipeline on Llama-3-8B and finding that GSAE produces no gain or a loss in Δ_s on JailbreakBench and HarmBench relative to a standard SAE would falsify the central claim.

Figures

Figures reproduced from arXiv: 2512.06655 by Federico Cinus, Jehyeok Yeon, Luca Luceri, Yifan Wu.

Figure 1
Figure 1. Figure 1: Overview of the GSAE steering framework. A user query is encoded into hidden states, [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Safety performance across models, reported as [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Refusal trade-off plots: harmful refusal rate (HRR, y-axis) vs. safe refusal rate (SRR, [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Distribution of safe vs. unsafe prompt activations projected onto the low-frequency eigen [PITH_FULL_IMAGE:figures/full_fig_p017_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Distribution of per-feature Dirichlet energy for SAE vs. GSAE at an intermediate model [PITH_FULL_IMAGE:figures/full_fig_p018_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Distribution of GSAE-based harm risk scores on the OOD test set. Safe (blue) and harmful [PITH_FULL_IMAGE:figures/full_fig_p022_6.png] view at source ↗
read the original abstract

Sparse autoencoders (SAEs) are increasingly used to extract activation directions for inference-time steering, but their standard sparsity objective treats latent features as independent. This prior can be poorly matched to high-level safety behaviors, where refusal and harmful compliance appear to depend on distributed structure in activation space. We introduce Graph-Regularized Sparse Autoencoders (GSAE), a dictionary-learning method that learns safety-steering directions by smoothing SAE decoder vectors over a neuron co-activation graph and applying the resulting direction bank through a two-gate runtime controller. Empirically, GSAE improves selective refusal across JailbreakBench, HarmBench, and XSTest, increasing harmful-request refusal while keeping benign-prompt refusals low. On Llama-3-8B, replacing the standard SAE with GSAE in an otherwise identical pipeline improves $\Delta_s$ by $20.1$ points on JailbreakBench and $16.8$ points on HarmBench. GSAE outperforms activation-steering baselines and black-box guardrails, preserves benign-task performance, generalizes across Llama-3, Mistral, Qwen 2.5, and Phi-4, and remains strong under black-box and gray-box jailbreak attacks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces Graph-Regularized Sparse Autoencoders (GSAE) that augment standard SAE dictionary learning with smoothing of decoder vectors over a neuron co-activation graph, combined with a two-gate runtime controller for safety steering. It reports empirical gains in selective refusal (increased harmful-request refusal with low benign refusals) on JailbreakBench, HarmBench, and XSTest, including specific improvements of +20.1 Δ_s on JailbreakBench and +16.8 Δ_s on HarmBench for Llama-3-8B, plus generalization across Llama-3, Mistral, Qwen 2.5, and Phi-4 models and resilience under jailbreak attacks.

Significance. If the central empirical claim holds after addressing controls, the work would demonstrate that graph-based regularization can better capture distributed activation structure for high-level safety behaviors than independent sparsity priors, providing a practical inference-time steering method that outperforms activation-steering baselines and black-box guardrails while preserving benign-task performance.

major comments (3)
  1. [Results] Results section: the central claim of +20.1 and +16.8 point gains in Δ_s on JailbreakBench and HarmBench for Llama-3-8B is presented without error bars, p-values, or details on the number of runs and random seeds; this is load-bearing because the abstract and strongest claim rest on these numerical improvements being reliable and replicable.
  2. [Methods] Methods section on graph construction: the neuron co-activation graph is built from a broad activation dataset without explicit safety-specific filtering or edge-weighting details; if edges primarily reflect low-level feature co-occurrence rather than refusal-related dependencies, the regularization reduces to a generic smoothness prior whose benefit over standard SAE sparsity cannot be isolated from the two-gate controller.
  3. [Experiments] Experiments: no ablation is reported that varies graph regularization strength independently of the two-gate controller or compares against alternative smoothness priors (e.g., Laplacian regularization without the co-activation graph); this is required to substantiate that the reported gains are attributable to the proposed structural assumption rather than increased effective capacity.
minor comments (2)
  1. [Preliminaries] The definition and exact formula for the Δ_s metric should be stated explicitly in the main text (not only in supplementary material) to allow readers to interpret the magnitude of the reported improvements.
  2. [Figures] Figure captions for benchmark results should include the exact number of prompts per category and any data-exclusion rules applied.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and indicate where revisions will be made to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Results] Results section: the central claim of +20.1 and +16.8 point gains in Δ_s on JailbreakBench and HarmBench for Llama-3-8B is presented without error bars, p-values, or details on the number of runs and random seeds; this is load-bearing because the abstract and strongest claim rest on these numerical improvements being reliable and replicable.

    Authors: We agree that statistical details are necessary to establish reliability. The reported figures reflect single-run evaluations. In the revised manuscript we will add results from multiple independent runs using different random seeds, report means with standard-deviation error bars, and include p-values for the primary comparisons on JailbreakBench and HarmBench. revision: yes

  2. Referee: [Methods] Methods section on graph construction: the neuron co-activation graph is built from a broad activation dataset without explicit safety-specific filtering or edge-weighting details; if edges primarily reflect low-level feature co-occurrence rather than refusal-related dependencies, the regularization reduces to a generic smoothness prior whose benefit over standard SAE sparsity cannot be isolated from the two-gate controller.

    Authors: The co-activation graph is deliberately built on a broad, diverse activation corpus to capture general neuron dependencies that we hypothesize underlie distributed safety behaviors. We will expand the methods section with explicit details on the dataset, edge-weight computation (co-activation frequency with thresholding), and construction steps. Because the two-gate controller is identical in the SAE and GSAE pipelines, the observed gains are attributable to the graph-regularized decoder directions rather than the controller. revision: partial

  3. Referee: [Experiments] Experiments: no ablation is reported that varies graph regularization strength independently of the two-gate controller or compares against alternative smoothness priors (e.g., Laplacian regularization without the co-activation graph); this is required to substantiate that the reported gains are attributable to the proposed structural assumption rather than increased effective capacity.

    Authors: We acknowledge that targeted ablations would more cleanly isolate the contribution of the co-activation graph. In the revised version we will add an ablation that sweeps the graph-regularization coefficient while holding the two-gate controller fixed, and we will compare against a baseline that applies standard Laplacian smoothing to the decoder vectors without using the co-activation graph. These additions should clarify that the gains arise from the specific structural prior. revision: yes

Circularity Check

0 steps flagged

Empirical method with independent benchmark evaluation shows no circularity

full rationale

The paper introduces GSAE as a dictionary-learning method that adds graph regularization to standard SAE training by smoothing decoder vectors over a co-activation graph and deploys it via a two-gate controller. All reported gains (e.g., +20.1 Δ_s on JailbreakBench for Llama-3-8B) are obtained by direct comparison against standard SAE and other baselines on fixed external test sets. No equations define a target quantity in terms of the fitted parameters themselves, no predictions are generated from the same data used to tune the graph or sparsity, and no load-bearing uniqueness theorems or self-citations are invoked. The derivation chain is therefore self-contained: the method is specified, trained, and evaluated on separate benchmarks without reducing the claimed improvement to a renaming or re-use of its own inputs.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim depends on one domain assumption about distributed safety representations and at least one tunable regularization strength; no new entities are postulated and no free parameters are numerically reported in the abstract.

free parameters (1)
  • graph regularization strength
    The weight of the smoothing term over the co-activation graph must be chosen or tuned; the abstract does not report its value.
axioms (1)
  • domain assumption High-level safety behaviors depend on distributed structure in activation space that can be captured by a neuron co-activation graph
    Stated explicitly in the abstract as the motivation for moving beyond independent-feature sparsity.

pith-pipeline@v0.9.0 · 5751 in / 1408 out tokens · 46117 ms · 2026-05-21T18:33:49.083770+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Steering Without Breaking: Mechanistically Informed Interventions for Discrete Diffusion Language Models

    cs.LG 2026-05 unverdicted novelty 8.0

    Adaptive scheduling of interventions in discrete diffusion language models, timed to attribute-specific commitment schedules discovered with sparse autoencoders, delivers precise multi-attribute steering up to 93% str...

Reference graph

Works this paper leans on

21 extracted references · 21 canonical work pages · cited by 1 Pith paper · 10 internal anchors

  1. [1]

    Constitutional AI: Harmlessness from AI Feedback

    URL https://arxiv.org/abs/2212.08073. Mikhail Belkin and Partha Niyogi. Laplacian eigenmaps for dimensionality reduction and data representation.Neural computation, 15(6):1373–1396,

  2. [2]

    Eliciting Latent Predictions from Transformers with the Tuned Lens

    URLhttps://arxiv.org/abs/2303.08112. Trenton Bricken, Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn, Tom Conerly, Nick Turner, Cem Anil, Carson Denison, Amanda Askell, et al. Towards monosemanticity: Decom- posing language models with dictionary learning.Transformer Circuits Thread,

  3. [3]

    Fan RK Chung.Spectral graph theory, volume

    URLhttps://arxiv.org/abs/ 2305.02573. Fan RK Chung.Spectral graph theory, volume

  4. [4]

    Training Verifiers to Solve Math Word Problems

    URLhttps://arxiv. org/abs/2110.14168. Hoagy Cunningham, Aidan Ewart, Logan Riggs, Robert Huben, and Lee Sharkey. Sparse autoen- coders find highly interpretable features in language models,

  5. [5]

    Sparse Autoencoders Find Highly Interpretable Features in Language Models

    URLhttps://arxiv. org/abs/2309.08600. Joshua Engels, Eric J. Michaud, Isaac Liao, Wes Gurnee, and Max Tegmark. Not all language model features are one-dimensionally linear,

  6. [6]

    Deep Ganguli, Liane Lovitt, Jackson Kernion, Amanda Askell, Yuntao Bai, Saurav Kadavath, Ben Mann, Ethan Perez, Nicholas Schiefer, Kamal Ndousse, et al

    URLhttps://arxiv.org/abs/2503.11232. Deep Ganguli, Liane Lovitt, Jackson Kernion, Amanda Askell, Yuntao Bai, Saurav Kadavath, Ben Mann, Ethan Perez, Nicholas Schiefer, Kamal Ndousse, et al. Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned,

  7. [7]

    Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned

    URLhttps://arxiv. org/abs/2209.07858. Leo Gao, Tom Dupr´e la Tour, Henk Tillman, Gabriel Goh, Rajan Troll, Alec Radford, Ilya Sutskever, Jan Leike, and Jeffrey Wu. Scaling and evaluating sparse autoencoders,

  8. [8]

    Scaling and evaluating sparse autoencoders

    URLhttps: //arxiv.org/abs/2406.04093. Jizhou Guo, Zhaomin Wu, Hanchen Yang, and Philip S. Yu. Mining intrinsic rewards from llm hid- den states for efficient best-of-n sampling,

  9. [9]

    org/abs/2502.01042

    URLhttps://arxiv. org/abs/2502.01042. Fabian Hildebrandt, Andreas Maier, Patrick Krauss, and Achim Schilling. Refusal behavior in large language models: A nonlinear perspective,

  10. [10]

    Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations

    URLhttps://arxiv.org/ abs/2312.06674. Liwei Jiang, Kavel Rao, Seungju Han, Allyson Ettinger, Faeze Brahman, Sachin Kumar, Niloofar Mireshghallah, Ximing Lu, Maarten Sap, Yejin Choi, et al. Wildteaming at scale: From in- the-wild jailbreaks to (adversarially) safer language models.Advances in Neural Information Processing Systems, 37:47094–47165,

  11. [11]

    TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension

    URLhttps://arxiv.org/ abs/1705.03551. Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InProceedings of the 29th symposium on operating systems princi- ples, pp. 611–626,

  12. [12]

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al

    URLhttps:// arxiv.org/abs/2504.02821. Gonc ¸alo Paulo and Nora Belrose. Sparse autoencoders trained on the same data learn different features,

  13. [13]

    Nina Rimsky, Nick Gabrieli, Julian Schulz, Meg Tong, Evan Hubinger, and Alexander Turner

    URLhttps://arxiv.org/abs/2501.16615. Nina Rimsky, Nick Gabrieli, Julian Schulz, Meg Tong, Evan Hubinger, and Alexander Turner. Steer- ing llama 2 via contrastive activation addition. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 15504–15522,

  14. [14]

    George M

    URL https://arxiv.org/abs/2506.00085. George M. Slavich. Social safety theory: Conceptual foundation, underlying mechanisms, and clinical and public health implications.Psychological Review, 130(2):350–381,

  15. [15]

    Alexander J Smola and Risi Kondor

    1037/rev0000313. Alexander J Smola and Risi Kondor. Kernels and regularization on graphs. InLearning the- ory and kernel machines: 16th annual conference on learning theory and 7th kernel work- shop, COLT/kernel 2003, Washington, DC, USA, august 24-27,

  16. [16]

    URLhttps://arxiv.org/abs/2505.20254. Adly Templeton, Tom Conerly, Jonathan Marcus, Jack Lindsey, Trenton Bricken, Brian Chen, Adam Pearce, Craig Citro, Emmanuel Ameisen, Andy Jones, Hoagy Cunningham, Nicholas L Turner, Callum McDougall, Monte MacDiarmid, C. Daniel Freeman, Theodore R. Sumers, Edward Rees, Joshua Batson, Adam Jermyn, Shan Carter, Chris Ola...

  17. [17]

    Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Niko- lay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, et al

    URLhttps://transformer-circuits.pub/2024/ scaling-monosemanticity/index.html. Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Niko- lay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, et al. Llama 2: Open foundation and fine-tuned chat models,

  18. [18]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    URLhttps://arxiv.org/abs/ 2307.09288. Mikhail Tsitsvero, Sergio Barbarossa, and Paolo Di Lorenzo. Signals on graphs: Uncertainty prin- ciple and sampling.IEEE Transactions on Signal Processing, 64(18):4845–4860,

  19. [19]

    Zhihao Xu, Ruixuan Huang, Changyu Chen, and Xiting Wang

    URLhttps://arxiv.org/abs/2502.17420. Zhihao Xu, Ruixuan Huang, Changyu Chen, and Xiting Wang. Uncovering safety risks of large language models through concept activation vector.Advances in Neural Information Processing Systems, 37:116743–116782,

  20. [20]

    How alignment and jailbreak work: Explain llm safety through intermediate hidden states

    Zhenhong Zhou, Haiyang Yu, Xinghua Zhang, Rongwu Xu, Fei Huang, and Yongbin Li. How alignment and jailbreak work: Explain llm safety through intermediate hidden states. InFindings of the Association for Computational Linguistics: EMNLP 2024, pp. 2461–2488,

  21. [21]

    Universal and Transferable Adversarial Attacks on Aligned Language Models

    URLhttps://arxiv. org/abs/2307.15043. 14 Appendix A Graph Signal Processing for Laplacian Regularization 15 A.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 A.2 Smoothness and Laplacian Regularization . . . . . . . . . . . . . . . . . . . . . . 16 A.3 Spectral Representation of Graph Signals . . . . . . . . . ...