Graph-Regularized Sparse Autoencoders for LLM Safety Steering
Pith reviewed 2026-05-21 18:33 UTC · model grok-4.3
The pith
Graph-regularized sparse autoencoders improve selective refusal in LLMs by smoothing safety directions over neuron co-activation graphs.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Graph-Regularized Sparse Autoencoders (GSAE) replace the standard SAE sparsity objective with smoothing of decoder vectors over a neuron co-activation graph. The resulting direction bank is applied via a two-gate runtime controller. This produces steering vectors that raise harmful-request refusal rates while holding benign-prompt refusal rates low. On Llama-3-8B the method improves the safety metric Δ_s by 20.1 points on JailbreakBench and 16.8 points on HarmBench relative to a standard SAE baseline in the same pipeline.
What carries the argument
Smoothing of SAE decoder vectors over a neuron co-activation graph to capture distributed structure for safety-steering directions.
If this is right
- GSAE raises harmful-request refusal while keeping benign-prompt refusal low across JailbreakBench, HarmBench, and XSTest.
- The method outperforms standard activation-steering baselines and black-box guardrails.
- Benign-task performance remains intact after the change.
- The improvement generalizes to Llama-3, Mistral, Qwen 2.5, and Phi-4.
- GSAE retains effectiveness under black-box and gray-box jailbreak attacks.
Where Pith is reading between the lines
- Safety behaviors may be better represented as graph-structured patterns than as fully independent sparse features.
- The same regularization approach could be tested on other high-level model behaviors such as instruction following or reasoning.
- Alternative ways of constructing the co-activation graph might produce steering directions with different coverage or robustness.
- The runtime two-gate controller may be separable from the training regularization, allowing the graph-smoothed directions to be used with other inference methods.
Load-bearing premise
High-level safety behaviors such as refusal depend on distributed structure in activation space that is well captured by smoothing decoder vectors over a neuron co-activation graph.
What would settle it
Running the identical steering pipeline on Llama-3-8B and finding that GSAE produces no gain or a loss in Δ_s on JailbreakBench and HarmBench relative to a standard SAE would falsify the central claim.
Figures
read the original abstract
Sparse autoencoders (SAEs) are increasingly used to extract activation directions for inference-time steering, but their standard sparsity objective treats latent features as independent. This prior can be poorly matched to high-level safety behaviors, where refusal and harmful compliance appear to depend on distributed structure in activation space. We introduce Graph-Regularized Sparse Autoencoders (GSAE), a dictionary-learning method that learns safety-steering directions by smoothing SAE decoder vectors over a neuron co-activation graph and applying the resulting direction bank through a two-gate runtime controller. Empirically, GSAE improves selective refusal across JailbreakBench, HarmBench, and XSTest, increasing harmful-request refusal while keeping benign-prompt refusals low. On Llama-3-8B, replacing the standard SAE with GSAE in an otherwise identical pipeline improves $\Delta_s$ by $20.1$ points on JailbreakBench and $16.8$ points on HarmBench. GSAE outperforms activation-steering baselines and black-box guardrails, preserves benign-task performance, generalizes across Llama-3, Mistral, Qwen 2.5, and Phi-4, and remains strong under black-box and gray-box jailbreak attacks.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Graph-Regularized Sparse Autoencoders (GSAE) that augment standard SAE dictionary learning with smoothing of decoder vectors over a neuron co-activation graph, combined with a two-gate runtime controller for safety steering. It reports empirical gains in selective refusal (increased harmful-request refusal with low benign refusals) on JailbreakBench, HarmBench, and XSTest, including specific improvements of +20.1 Δ_s on JailbreakBench and +16.8 Δ_s on HarmBench for Llama-3-8B, plus generalization across Llama-3, Mistral, Qwen 2.5, and Phi-4 models and resilience under jailbreak attacks.
Significance. If the central empirical claim holds after addressing controls, the work would demonstrate that graph-based regularization can better capture distributed activation structure for high-level safety behaviors than independent sparsity priors, providing a practical inference-time steering method that outperforms activation-steering baselines and black-box guardrails while preserving benign-task performance.
major comments (3)
- [Results] Results section: the central claim of +20.1 and +16.8 point gains in Δ_s on JailbreakBench and HarmBench for Llama-3-8B is presented without error bars, p-values, or details on the number of runs and random seeds; this is load-bearing because the abstract and strongest claim rest on these numerical improvements being reliable and replicable.
- [Methods] Methods section on graph construction: the neuron co-activation graph is built from a broad activation dataset without explicit safety-specific filtering or edge-weighting details; if edges primarily reflect low-level feature co-occurrence rather than refusal-related dependencies, the regularization reduces to a generic smoothness prior whose benefit over standard SAE sparsity cannot be isolated from the two-gate controller.
- [Experiments] Experiments: no ablation is reported that varies graph regularization strength independently of the two-gate controller or compares against alternative smoothness priors (e.g., Laplacian regularization without the co-activation graph); this is required to substantiate that the reported gains are attributable to the proposed structural assumption rather than increased effective capacity.
minor comments (2)
- [Preliminaries] The definition and exact formula for the Δ_s metric should be stated explicitly in the main text (not only in supplementary material) to allow readers to interpret the magnitude of the reported improvements.
- [Figures] Figure captions for benchmark results should include the exact number of prompts per category and any data-exclusion rules applied.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment below and indicate where revisions will be made to strengthen the manuscript.
read point-by-point responses
-
Referee: [Results] Results section: the central claim of +20.1 and +16.8 point gains in Δ_s on JailbreakBench and HarmBench for Llama-3-8B is presented without error bars, p-values, or details on the number of runs and random seeds; this is load-bearing because the abstract and strongest claim rest on these numerical improvements being reliable and replicable.
Authors: We agree that statistical details are necessary to establish reliability. The reported figures reflect single-run evaluations. In the revised manuscript we will add results from multiple independent runs using different random seeds, report means with standard-deviation error bars, and include p-values for the primary comparisons on JailbreakBench and HarmBench. revision: yes
-
Referee: [Methods] Methods section on graph construction: the neuron co-activation graph is built from a broad activation dataset without explicit safety-specific filtering or edge-weighting details; if edges primarily reflect low-level feature co-occurrence rather than refusal-related dependencies, the regularization reduces to a generic smoothness prior whose benefit over standard SAE sparsity cannot be isolated from the two-gate controller.
Authors: The co-activation graph is deliberately built on a broad, diverse activation corpus to capture general neuron dependencies that we hypothesize underlie distributed safety behaviors. We will expand the methods section with explicit details on the dataset, edge-weight computation (co-activation frequency with thresholding), and construction steps. Because the two-gate controller is identical in the SAE and GSAE pipelines, the observed gains are attributable to the graph-regularized decoder directions rather than the controller. revision: partial
-
Referee: [Experiments] Experiments: no ablation is reported that varies graph regularization strength independently of the two-gate controller or compares against alternative smoothness priors (e.g., Laplacian regularization without the co-activation graph); this is required to substantiate that the reported gains are attributable to the proposed structural assumption rather than increased effective capacity.
Authors: We acknowledge that targeted ablations would more cleanly isolate the contribution of the co-activation graph. In the revised version we will add an ablation that sweeps the graph-regularization coefficient while holding the two-gate controller fixed, and we will compare against a baseline that applies standard Laplacian smoothing to the decoder vectors without using the co-activation graph. These additions should clarify that the gains arise from the specific structural prior. revision: yes
Circularity Check
Empirical method with independent benchmark evaluation shows no circularity
full rationale
The paper introduces GSAE as a dictionary-learning method that adds graph regularization to standard SAE training by smoothing decoder vectors over a co-activation graph and deploys it via a two-gate controller. All reported gains (e.g., +20.1 Δ_s on JailbreakBench for Llama-3-8B) are obtained by direct comparison against standard SAE and other baselines on fixed external test sets. No equations define a target quantity in terms of the fitted parameters themselves, no predictions are generated from the same data used to tune the graph or sparsity, and no load-bearing uniqueness theorems or self-citations are invoked. The derivation chain is therefore self-contained: the method is specified, trained, and evaluated on separate benchmarks without reducing the claimed improvement to a renaming or re-use of its own inputs.
Axiom & Free-Parameter Ledger
free parameters (1)
- graph regularization strength
axioms (1)
- domain assumption High-level safety behaviors depend on distributed structure in activation space that can be captured by a neuron co-activation graph
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
GSAE extends standard SAEs by incorporating a graph Laplacian regularizer... penalizes the Laplacian energy of each decoded feature direction
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
spectral vector bank... structural coherence (s_lap_i) via normalized Dirichlet energy E_i = (v_i^T L v_i)/||v_i||^2
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
Steering Without Breaking: Mechanistically Informed Interventions for Discrete Diffusion Language Models
Adaptive scheduling of interventions in discrete diffusion language models, timed to attribute-specific commitment schedules discovered with sparse autoencoders, delivers precise multi-attribute steering up to 93% str...
Reference graph
Works this paper leans on
-
[1]
Constitutional AI: Harmlessness from AI Feedback
URL https://arxiv.org/abs/2212.08073. Mikhail Belkin and Partha Niyogi. Laplacian eigenmaps for dimensionality reduction and data representation.Neural computation, 15(6):1373–1396,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Eliciting Latent Predictions from Transformers with the Tuned Lens
URLhttps://arxiv.org/abs/2303.08112. Trenton Bricken, Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn, Tom Conerly, Nick Turner, Cem Anil, Carson Denison, Amanda Askell, et al. Towards monosemanticity: Decom- posing language models with dictionary learning.Transformer Circuits Thread,
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
Fan RK Chung.Spectral graph theory, volume
URLhttps://arxiv.org/abs/ 2305.02573. Fan RK Chung.Spectral graph theory, volume
-
[4]
Training Verifiers to Solve Math Word Problems
URLhttps://arxiv. org/abs/2110.14168. Hoagy Cunningham, Aidan Ewart, Logan Riggs, Robert Huben, and Lee Sharkey. Sparse autoen- coders find highly interpretable features in language models,
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
Sparse Autoencoders Find Highly Interpretable Features in Language Models
URLhttps://arxiv. org/abs/2309.08600. Joshua Engels, Eric J. Michaud, Isaac Liao, Wes Gurnee, and Max Tegmark. Not all language model features are one-dimensionally linear,
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
URLhttps://arxiv.org/abs/2503.11232. Deep Ganguli, Liane Lovitt, Jackson Kernion, Amanda Askell, Yuntao Bai, Saurav Kadavath, Ben Mann, Ethan Perez, Nicholas Schiefer, Kamal Ndousse, et al. Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned,
-
[7]
Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned
URLhttps://arxiv. org/abs/2209.07858. Leo Gao, Tom Dupr´e la Tour, Henk Tillman, Gabriel Goh, Rajan Troll, Alec Radford, Ilya Sutskever, Jan Leike, and Jeffrey Wu. Scaling and evaluating sparse autoencoders,
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
Scaling and evaluating sparse autoencoders
URLhttps: //arxiv.org/abs/2406.04093. Jizhou Guo, Zhaomin Wu, Hanchen Yang, and Philip S. Yu. Mining intrinsic rewards from llm hid- den states for efficient best-of-n sampling,
work page internal anchor Pith review Pith/arXiv arXiv
-
[9]
URLhttps://arxiv. org/abs/2502.01042. Fabian Hildebrandt, Andreas Maier, Patrick Krauss, and Achim Schilling. Refusal behavior in large language models: A nonlinear perspective,
-
[10]
Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations
URLhttps://arxiv.org/ abs/2312.06674. Liwei Jiang, Kavel Rao, Seungju Han, Allyson Ettinger, Faeze Brahman, Sachin Kumar, Niloofar Mireshghallah, Ximing Lu, Maarten Sap, Yejin Choi, et al. Wildteaming at scale: From in- the-wild jailbreaks to (adversarially) safer language models.Advances in Neural Information Processing Systems, 37:47094–47165,
work page internal anchor Pith review Pith/arXiv arXiv
-
[11]
TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension
URLhttps://arxiv.org/ abs/1705.03551. Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InProceedings of the 29th symposium on operating systems princi- ples, pp. 611–626,
work page internal anchor Pith review Pith/arXiv arXiv
-
[12]
URLhttps:// arxiv.org/abs/2504.02821. Gonc ¸alo Paulo and Nora Belrose. Sparse autoencoders trained on the same data learn different features,
-
[13]
Nina Rimsky, Nick Gabrieli, Julian Schulz, Meg Tong, Evan Hubinger, and Alexander Turner
URLhttps://arxiv.org/abs/2501.16615. Nina Rimsky, Nick Gabrieli, Julian Schulz, Meg Tong, Evan Hubinger, and Alexander Turner. Steer- ing llama 2 via contrastive activation addition. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 15504–15522,
- [14]
-
[15]
Alexander J Smola and Risi Kondor
1037/rev0000313. Alexander J Smola and Risi Kondor. Kernels and regularization on graphs. InLearning the- ory and kernel machines: 16th annual conference on learning theory and 7th kernel work- shop, COLT/kernel 2003, Washington, DC, USA, august 24-27,
work page 2003
-
[16]
URLhttps://arxiv.org/abs/2505.20254. Adly Templeton, Tom Conerly, Jonathan Marcus, Jack Lindsey, Trenton Bricken, Brian Chen, Adam Pearce, Craig Citro, Emmanuel Ameisen, Andy Jones, Hoagy Cunningham, Nicholas L Turner, Callum McDougall, Monte MacDiarmid, C. Daniel Freeman, Theodore R. Sumers, Edward Rees, Joshua Batson, Adam Jermyn, Shan Carter, Chris Ola...
-
[17]
URLhttps://transformer-circuits.pub/2024/ scaling-monosemanticity/index.html. Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Niko- lay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, et al. Llama 2: Open foundation and fine-tuned chat models,
work page 2024
-
[18]
Llama 2: Open Foundation and Fine-Tuned Chat Models
URLhttps://arxiv.org/abs/ 2307.09288. Mikhail Tsitsvero, Sergio Barbarossa, and Paolo Di Lorenzo. Signals on graphs: Uncertainty prin- ciple and sampling.IEEE Transactions on Signal Processing, 64(18):4845–4860,
work page internal anchor Pith review Pith/arXiv arXiv
-
[19]
Zhihao Xu, Ruixuan Huang, Changyu Chen, and Xiting Wang
URLhttps://arxiv.org/abs/2502.17420. Zhihao Xu, Ruixuan Huang, Changyu Chen, and Xiting Wang. Uncovering safety risks of large language models through concept activation vector.Advances in Neural Information Processing Systems, 37:116743–116782,
-
[20]
How alignment and jailbreak work: Explain llm safety through intermediate hidden states
Zhenhong Zhou, Haiyang Yu, Xinghua Zhang, Rongwu Xu, Fei Huang, and Yongbin Li. How alignment and jailbreak work: Explain llm safety through intermediate hidden states. InFindings of the Association for Computational Linguistics: EMNLP 2024, pp. 2461–2488,
work page 2024
-
[21]
Universal and Transferable Adversarial Attacks on Aligned Language Models
URLhttps://arxiv. org/abs/2307.15043. 14 Appendix A Graph Signal Processing for Laplacian Regularization 15 A.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 A.2 Smoothness and Laplacian Regularization . . . . . . . . . . . . . . . . . . . . . . 16 A.3 Spectral Representation of Graph Signals . . . . . . . . . ...
work page internal anchor Pith review Pith/arXiv arXiv 2013
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.