Graph-Regularized Sparse Autoencoders for LLM Safety Steering

Federico Cinus; Jehyeok Yeon; Luca Luceri; Yifan Wu

arxiv: 2512.06655 · v3 · pith:R6UZVWFAnew · submitted 2025-12-07 · 💻 cs.LG · cs.AI

Graph-Regularized Sparse Autoencoders for LLM Safety Steering

Jehyeok Yeon , Federico Cinus , Yifan Wu , Luca Luceri This is my paper

Pith reviewed 2026-05-21 18:33 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords sparse autoencodersLLM safetysafety steeringgraph regularizationjailbreak benchmarksrefusal mechanismsactivation spacedistributed features

0 comments

The pith

Graph-regularized sparse autoencoders improve selective refusal in LLMs by smoothing safety directions over neuron co-activation graphs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that standard sparse autoencoders treat latent features as independent, which mismatches the distributed activation patterns underlying high-level safety behaviors like refusal. By adding graph regularization that smooths decoder vectors according to neuron co-activation, the method learns steering directions that increase refusal on harmful jailbreaks while keeping refusals on benign prompts low. This change is applied through a two-gate controller at runtime. On Llama-3-8B the approach raises the safety delta by 20 points on JailbreakBench and nearly 17 points on HarmBench in otherwise identical pipelines. The gains hold across several model families and under both black-box and gray-box attacks without degrading normal task performance.

Core claim

Graph-Regularized Sparse Autoencoders (GSAE) replace the standard SAE sparsity objective with smoothing of decoder vectors over a neuron co-activation graph. The resulting direction bank is applied via a two-gate runtime controller. This produces steering vectors that raise harmful-request refusal rates while holding benign-prompt refusal rates low. On Llama-3-8B the method improves the safety metric Δ_s by 20.1 points on JailbreakBench and 16.8 points on HarmBench relative to a standard SAE baseline in the same pipeline.

What carries the argument

Smoothing of SAE decoder vectors over a neuron co-activation graph to capture distributed structure for safety-steering directions.

If this is right

GSAE raises harmful-request refusal while keeping benign-prompt refusal low across JailbreakBench, HarmBench, and XSTest.
The method outperforms standard activation-steering baselines and black-box guardrails.
Benign-task performance remains intact after the change.
The improvement generalizes to Llama-3, Mistral, Qwen 2.5, and Phi-4.
GSAE retains effectiveness under black-box and gray-box jailbreak attacks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Safety behaviors may be better represented as graph-structured patterns than as fully independent sparse features.
The same regularization approach could be tested on other high-level model behaviors such as instruction following or reasoning.
Alternative ways of constructing the co-activation graph might produce steering directions with different coverage or robustness.
The runtime two-gate controller may be separable from the training regularization, allowing the graph-smoothed directions to be used with other inference methods.

Load-bearing premise

High-level safety behaviors such as refusal depend on distributed structure in activation space that is well captured by smoothing decoder vectors over a neuron co-activation graph.

What would settle it

Running the identical steering pipeline on Llama-3-8B and finding that GSAE produces no gain or a loss in Δ_s on JailbreakBench and HarmBench relative to a standard SAE would falsify the central claim.

Figures

Figures reproduced from arXiv: 2512.06655 by Federico Cinus, Jehyeok Yeon, Luca Luceri, Yifan Wu.

**Figure 2.** Figure 2: Safety performance across models, reported as [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: Refusal trade-off plots: harmful refusal rate (HRR, y-axis) vs. safe refusal rate (SRR, [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: Distribution of safe vs. unsafe prompt activations projected onto the low-frequency eigen [PITH_FULL_IMAGE:figures/full_fig_p017_4.png] view at source ↗

**Figure 5.** Figure 5: Distribution of per-feature Dirichlet energy for SAE vs. GSAE at an intermediate model [PITH_FULL_IMAGE:figures/full_fig_p018_5.png] view at source ↗

**Figure 6.** Figure 6: Distribution of GSAE-based harm risk scores on the OOD test set. Safe (blue) and harmful [PITH_FULL_IMAGE:figures/full_fig_p022_6.png] view at source ↗

read the original abstract

Sparse autoencoders (SAEs) are increasingly used to extract activation directions for inference-time steering, but their standard sparsity objective treats latent features as independent. This prior can be poorly matched to high-level safety behaviors, where refusal and harmful compliance appear to depend on distributed structure in activation space. We introduce Graph-Regularized Sparse Autoencoders (GSAE), a dictionary-learning method that learns safety-steering directions by smoothing SAE decoder vectors over a neuron co-activation graph and applying the resulting direction bank through a two-gate runtime controller. Empirically, GSAE improves selective refusal across JailbreakBench, HarmBench, and XSTest, increasing harmful-request refusal while keeping benign-prompt refusals low. On Llama-3-8B, replacing the standard SAE with GSAE in an otherwise identical pipeline improves $\Delta_s$ by $20.1$ points on JailbreakBench and $16.8$ points on HarmBench. GSAE outperforms activation-steering baselines and black-box guardrails, preserves benign-task performance, generalizes across Llama-3, Mistral, Qwen 2.5, and Phi-4, and remains strong under black-box and gray-box jailbreak attacks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GSAE reports clear benchmark gains from graph-smoothing SAE decoder vectors for safety steering, but the gains may trace to generic regularization rather than safety-specific structure.

read the letter

The main takeaway is that this paper adds graph regularization over neuron co-activations to standard SAEs and gets solid lifts in selective refusal on JailbreakBench and HarmBench while holding benign performance steady across four model families. On Llama-3-8B the delta_s metric rises 20 points and 17 points respectively, and the method beats some steering baselines under both black-box and gray-box attacks. That empirical pattern is the part worth noting first. They keep the rest of the pipeline fixed and swap in the GSAE directions with a two-gate controller, which makes the comparison direct. The work also tests generalization to Mistral, Qwen 2.5, and Phi-4, which adds some breadth. These are the concrete results that stand out from the abstract and the reported numbers. The soft spot is exactly the one the stress test flags. The graph is built from co-activations on what appears to be a broad dataset, so the edges likely reflect common low-level correlations more than refusal-specific distributed structure. If that is the case, the smoothing acts as an extra capacity or smoothness prior whose benefit could be reproduced without the safety hypothesis. The abstract gives no ablations on graph construction, no error bars, and no statistical tests, so the size and reliability of the 20-point gains are still hard to pin down. This paper sits in the SAE-for-steering corner of LLM safety work. Readers who track activation engineering or inference-time control will find the numbers useful to compare against their own setups. It is coherent on its own terms and cites the relevant prior SAE and steering literature without obvious circularity. I would bring it to a reading group to walk through the graph details and any controls that appear in the full text. I would not cite it in my own work in the next year unless the mechanism is isolated more cleanly. A serious editor should send it to peer review so referees can check the graph relevance and the missing statistical details.

Referee Report

3 major / 2 minor

Summary. The paper introduces Graph-Regularized Sparse Autoencoders (GSAE) that augment standard SAE dictionary learning with smoothing of decoder vectors over a neuron co-activation graph, combined with a two-gate runtime controller for safety steering. It reports empirical gains in selective refusal (increased harmful-request refusal with low benign refusals) on JailbreakBench, HarmBench, and XSTest, including specific improvements of +20.1 Δ_s on JailbreakBench and +16.8 Δ_s on HarmBench for Llama-3-8B, plus generalization across Llama-3, Mistral, Qwen 2.5, and Phi-4 models and resilience under jailbreak attacks.

Significance. If the central empirical claim holds after addressing controls, the work would demonstrate that graph-based regularization can better capture distributed activation structure for high-level safety behaviors than independent sparsity priors, providing a practical inference-time steering method that outperforms activation-steering baselines and black-box guardrails while preserving benign-task performance.

major comments (3)

[Results] Results section: the central claim of +20.1 and +16.8 point gains in Δ_s on JailbreakBench and HarmBench for Llama-3-8B is presented without error bars, p-values, or details on the number of runs and random seeds; this is load-bearing because the abstract and strongest claim rest on these numerical improvements being reliable and replicable.
[Methods] Methods section on graph construction: the neuron co-activation graph is built from a broad activation dataset without explicit safety-specific filtering or edge-weighting details; if edges primarily reflect low-level feature co-occurrence rather than refusal-related dependencies, the regularization reduces to a generic smoothness prior whose benefit over standard SAE sparsity cannot be isolated from the two-gate controller.
[Experiments] Experiments: no ablation is reported that varies graph regularization strength independently of the two-gate controller or compares against alternative smoothness priors (e.g., Laplacian regularization without the co-activation graph); this is required to substantiate that the reported gains are attributable to the proposed structural assumption rather than increased effective capacity.

minor comments (2)

[Preliminaries] The definition and exact formula for the Δ_s metric should be stated explicitly in the main text (not only in supplementary material) to allow readers to interpret the magnitude of the reported improvements.
[Figures] Figure captions for benchmark results should include the exact number of prompts per category and any data-exclusion rules applied.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and indicate where revisions will be made to strengthen the manuscript.

read point-by-point responses

Referee: [Results] Results section: the central claim of +20.1 and +16.8 point gains in Δ_s on JailbreakBench and HarmBench for Llama-3-8B is presented without error bars, p-values, or details on the number of runs and random seeds; this is load-bearing because the abstract and strongest claim rest on these numerical improvements being reliable and replicable.

Authors: We agree that statistical details are necessary to establish reliability. The reported figures reflect single-run evaluations. In the revised manuscript we will add results from multiple independent runs using different random seeds, report means with standard-deviation error bars, and include p-values for the primary comparisons on JailbreakBench and HarmBench. revision: yes
Referee: [Methods] Methods section on graph construction: the neuron co-activation graph is built from a broad activation dataset without explicit safety-specific filtering or edge-weighting details; if edges primarily reflect low-level feature co-occurrence rather than refusal-related dependencies, the regularization reduces to a generic smoothness prior whose benefit over standard SAE sparsity cannot be isolated from the two-gate controller.

Authors: The co-activation graph is deliberately built on a broad, diverse activation corpus to capture general neuron dependencies that we hypothesize underlie distributed safety behaviors. We will expand the methods section with explicit details on the dataset, edge-weight computation (co-activation frequency with thresholding), and construction steps. Because the two-gate controller is identical in the SAE and GSAE pipelines, the observed gains are attributable to the graph-regularized decoder directions rather than the controller. revision: partial
Referee: [Experiments] Experiments: no ablation is reported that varies graph regularization strength independently of the two-gate controller or compares against alternative smoothness priors (e.g., Laplacian regularization without the co-activation graph); this is required to substantiate that the reported gains are attributable to the proposed structural assumption rather than increased effective capacity.

Authors: We acknowledge that targeted ablations would more cleanly isolate the contribution of the co-activation graph. In the revised version we will add an ablation that sweeps the graph-regularization coefficient while holding the two-gate controller fixed, and we will compare against a baseline that applies standard Laplacian smoothing to the decoder vectors without using the co-activation graph. These additions should clarify that the gains arise from the specific structural prior. revision: yes

Circularity Check

0 steps flagged

Empirical method with independent benchmark evaluation shows no circularity

full rationale

The paper introduces GSAE as a dictionary-learning method that adds graph regularization to standard SAE training by smoothing decoder vectors over a co-activation graph and deploys it via a two-gate controller. All reported gains (e.g., +20.1 Δ_s on JailbreakBench for Llama-3-8B) are obtained by direct comparison against standard SAE and other baselines on fixed external test sets. No equations define a target quantity in terms of the fitted parameters themselves, no predictions are generated from the same data used to tune the graph or sparsity, and no load-bearing uniqueness theorems or self-citations are invoked. The derivation chain is therefore self-contained: the method is specified, trained, and evaluated on separate benchmarks without reducing the claimed improvement to a renaming or re-use of its own inputs.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim depends on one domain assumption about distributed safety representations and at least one tunable regularization strength; no new entities are postulated and no free parameters are numerically reported in the abstract.

free parameters (1)

graph regularization strength
The weight of the smoothing term over the co-activation graph must be chosen or tuned; the abstract does not report its value.

axioms (1)

domain assumption High-level safety behaviors depend on distributed structure in activation space that can be captured by a neuron co-activation graph
Stated explicitly in the abstract as the motivation for moving beyond independent-feature sparsity.

pith-pipeline@v0.9.0 · 5751 in / 1408 out tokens · 46117 ms · 2026-05-21T18:33:49.083770+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

GSAE extends standard SAEs by incorporating a graph Laplacian regularizer... penalizes the Laplacian energy of each decoded feature direction
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

spectral vector bank... structural coherence (s_lap_i) via normalized Dirichlet energy E_i = (v_i^T L v_i)/||v_i||^2

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Steering Without Breaking: Mechanistically Informed Interventions for Discrete Diffusion Language Models
cs.LG 2026-05 unverdicted novelty 8.0

Adaptive scheduling of interventions in discrete diffusion language models, timed to attribute-specific commitment schedules discovered with sparse autoencoders, delivers precise multi-attribute steering up to 93% str...

Reference graph

Works this paper leans on

21 extracted references · 21 canonical work pages · cited by 1 Pith paper · 10 internal anchors

[1]

Constitutional AI: Harmlessness from AI Feedback

URL https://arxiv.org/abs/2212.08073. Mikhail Belkin and Partha Niyogi. Laplacian eigenmaps for dimensionality reduction and data representation.Neural computation, 15(6):1373–1396,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Eliciting Latent Predictions from Transformers with the Tuned Lens

URLhttps://arxiv.org/abs/2303.08112. Trenton Bricken, Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn, Tom Conerly, Nick Turner, Cem Anil, Carson Denison, Amanda Askell, et al. Towards monosemanticity: Decom- posing language models with dictionary learning.Transformer Circuits Thread,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Fan RK Chung.Spectral graph theory, volume

URLhttps://arxiv.org/abs/ 2305.02573. Fan RK Chung.Spectral graph theory, volume

work page arXiv
[4]

Training Verifiers to Solve Math Word Problems

URLhttps://arxiv. org/abs/2110.14168. Hoagy Cunningham, Aidan Ewart, Logan Riggs, Robert Huben, and Lee Sharkey. Sparse autoen- coders find highly interpretable features in language models,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Sparse Autoencoders Find Highly Interpretable Features in Language Models

URLhttps://arxiv. org/abs/2309.08600. Joshua Engels, Eric J. Michaud, Isaac Liao, Wes Gurnee, and Max Tegmark. Not all language model features are one-dimensionally linear,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Deep Ganguli, Liane Lovitt, Jackson Kernion, Amanda Askell, Yuntao Bai, Saurav Kadavath, Ben Mann, Ethan Perez, Nicholas Schiefer, Kamal Ndousse, et al

URLhttps://arxiv.org/abs/2503.11232. Deep Ganguli, Liane Lovitt, Jackson Kernion, Amanda Askell, Yuntao Bai, Saurav Kadavath, Ben Mann, Ethan Perez, Nicholas Schiefer, Kamal Ndousse, et al. Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned,

work page arXiv
[7]

Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned

URLhttps://arxiv. org/abs/2209.07858. Leo Gao, Tom Dupr´e la Tour, Henk Tillman, Gabriel Goh, Rajan Troll, Alec Radford, Ilya Sutskever, Jan Leike, and Jeffrey Wu. Scaling and evaluating sparse autoencoders,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Scaling and evaluating sparse autoencoders

URLhttps: //arxiv.org/abs/2406.04093. Jizhou Guo, Zhaomin Wu, Hanchen Yang, and Philip S. Yu. Mining intrinsic rewards from llm hid- den states for efficient best-of-n sampling,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

org/abs/2502.01042

URLhttps://arxiv. org/abs/2502.01042. Fabian Hildebrandt, Andreas Maier, Patrick Krauss, and Achim Schilling. Refusal behavior in large language models: A nonlinear perspective,

work page arXiv
[10]

Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations

URLhttps://arxiv.org/ abs/2312.06674. Liwei Jiang, Kavel Rao, Seungju Han, Allyson Ettinger, Faeze Brahman, Sachin Kumar, Niloofar Mireshghallah, Ximing Lu, Maarten Sap, Yejin Choi, et al. Wildteaming at scale: From in- the-wild jailbreaks to (adversarially) safer language models.Advances in Neural Information Processing Systems, 37:47094–47165,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension

URLhttps://arxiv.org/ abs/1705.03551. Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InProceedings of the 29th symposium on operating systems princi- ples, pp. 611–626,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al

URLhttps:// arxiv.org/abs/2504.02821. Gonc ¸alo Paulo and Nora Belrose. Sparse autoencoders trained on the same data learn different features,

work page arXiv
[13]

Nina Rimsky, Nick Gabrieli, Julian Schulz, Meg Tong, Evan Hubinger, and Alexander Turner

URLhttps://arxiv.org/abs/2501.16615. Nina Rimsky, Nick Gabrieli, Julian Schulz, Meg Tong, Evan Hubinger, and Alexander Turner. Steer- ing llama 2 via contrastive activation addition. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 15504–15522,

work page arXiv
[14]

George M

URL https://arxiv.org/abs/2506.00085. George M. Slavich. Social safety theory: Conceptual foundation, underlying mechanisms, and clinical and public health implications.Psychological Review, 130(2):350–381,

work page arXiv
[15]

Alexander J Smola and Risi Kondor

1037/rev0000313. Alexander J Smola and Risi Kondor. Kernels and regularization on graphs. InLearning the- ory and kernel machines: 16th annual conference on learning theory and 7th kernel work- shop, COLT/kernel 2003, Washington, DC, USA, august 24-27,

work page 2003
[16]

URLhttps://arxiv.org/abs/2505.20254. Adly Templeton, Tom Conerly, Jonathan Marcus, Jack Lindsey, Trenton Bricken, Brian Chen, Adam Pearce, Craig Citro, Emmanuel Ameisen, Andy Jones, Hoagy Cunningham, Nicholas L Turner, Callum McDougall, Monte MacDiarmid, C. Daniel Freeman, Theodore R. Sumers, Edward Rees, Joshua Batson, Adam Jermyn, Shan Carter, Chris Ola...

work page arXiv
[17]

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Niko- lay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, et al

URLhttps://transformer-circuits.pub/2024/ scaling-monosemanticity/index.html. Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Niko- lay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, et al. Llama 2: Open foundation and fine-tuned chat models,

work page 2024
[18]

Llama 2: Open Foundation and Fine-Tuned Chat Models

URLhttps://arxiv.org/abs/ 2307.09288. Mikhail Tsitsvero, Sergio Barbarossa, and Paolo Di Lorenzo. Signals on graphs: Uncertainty prin- ciple and sampling.IEEE Transactions on Signal Processing, 64(18):4845–4860,

work page internal anchor Pith review Pith/arXiv arXiv
[19]

Zhihao Xu, Ruixuan Huang, Changyu Chen, and Xiting Wang

URLhttps://arxiv.org/abs/2502.17420. Zhihao Xu, Ruixuan Huang, Changyu Chen, and Xiting Wang. Uncovering safety risks of large language models through concept activation vector.Advances in Neural Information Processing Systems, 37:116743–116782,

work page arXiv
[20]

How alignment and jailbreak work: Explain llm safety through intermediate hidden states

Zhenhong Zhou, Haiyang Yu, Xinghua Zhang, Rongwu Xu, Fei Huang, and Yongbin Li. How alignment and jailbreak work: Explain llm safety through intermediate hidden states. InFindings of the Association for Computational Linguistics: EMNLP 2024, pp. 2461–2488,

work page 2024
[21]

Universal and Transferable Adversarial Attacks on Aligned Language Models

URLhttps://arxiv. org/abs/2307.15043. 14 Appendix A Graph Signal Processing for Laplacian Regularization 15 A.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 A.2 Smoothness and Laplacian Regularization . . . . . . . . . . . . . . . . . . . . . . 16 A.3 Spectral Representation of Graph Signals . . . . . . . . . ...

work page internal anchor Pith review Pith/arXiv arXiv 2013

[1] [1]

Constitutional AI: Harmlessness from AI Feedback

URL https://arxiv.org/abs/2212.08073. Mikhail Belkin and Partha Niyogi. Laplacian eigenmaps for dimensionality reduction and data representation.Neural computation, 15(6):1373–1396,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Eliciting Latent Predictions from Transformers with the Tuned Lens

URLhttps://arxiv.org/abs/2303.08112. Trenton Bricken, Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn, Tom Conerly, Nick Turner, Cem Anil, Carson Denison, Amanda Askell, et al. Towards monosemanticity: Decom- posing language models with dictionary learning.Transformer Circuits Thread,

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

Fan RK Chung.Spectral graph theory, volume

URLhttps://arxiv.org/abs/ 2305.02573. Fan RK Chung.Spectral graph theory, volume

work page arXiv

[4] [4]

Training Verifiers to Solve Math Word Problems

URLhttps://arxiv. org/abs/2110.14168. Hoagy Cunningham, Aidan Ewart, Logan Riggs, Robert Huben, and Lee Sharkey. Sparse autoen- coders find highly interpretable features in language models,

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

Sparse Autoencoders Find Highly Interpretable Features in Language Models

URLhttps://arxiv. org/abs/2309.08600. Joshua Engels, Eric J. Michaud, Isaac Liao, Wes Gurnee, and Max Tegmark. Not all language model features are one-dimensionally linear,

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

Deep Ganguli, Liane Lovitt, Jackson Kernion, Amanda Askell, Yuntao Bai, Saurav Kadavath, Ben Mann, Ethan Perez, Nicholas Schiefer, Kamal Ndousse, et al

URLhttps://arxiv.org/abs/2503.11232. Deep Ganguli, Liane Lovitt, Jackson Kernion, Amanda Askell, Yuntao Bai, Saurav Kadavath, Ben Mann, Ethan Perez, Nicholas Schiefer, Kamal Ndousse, et al. Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned,

work page arXiv

[7] [7]

Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned

URLhttps://arxiv. org/abs/2209.07858. Leo Gao, Tom Dupr´e la Tour, Henk Tillman, Gabriel Goh, Rajan Troll, Alec Radford, Ilya Sutskever, Jan Leike, and Jeffrey Wu. Scaling and evaluating sparse autoencoders,

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

Scaling and evaluating sparse autoencoders

URLhttps: //arxiv.org/abs/2406.04093. Jizhou Guo, Zhaomin Wu, Hanchen Yang, and Philip S. Yu. Mining intrinsic rewards from llm hid- den states for efficient best-of-n sampling,

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

org/abs/2502.01042

URLhttps://arxiv. org/abs/2502.01042. Fabian Hildebrandt, Andreas Maier, Patrick Krauss, and Achim Schilling. Refusal behavior in large language models: A nonlinear perspective,

work page arXiv

[10] [10]

Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations

URLhttps://arxiv.org/ abs/2312.06674. Liwei Jiang, Kavel Rao, Seungju Han, Allyson Ettinger, Faeze Brahman, Sachin Kumar, Niloofar Mireshghallah, Ximing Lu, Maarten Sap, Yejin Choi, et al. Wildteaming at scale: From in- the-wild jailbreaks to (adversarially) safer language models.Advances in Neural Information Processing Systems, 37:47094–47165,

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension

URLhttps://arxiv.org/ abs/1705.03551. Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InProceedings of the 29th symposium on operating systems princi- ples, pp. 611–626,

work page internal anchor Pith review Pith/arXiv arXiv

[12] [12]

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al

URLhttps:// arxiv.org/abs/2504.02821. Gonc ¸alo Paulo and Nora Belrose. Sparse autoencoders trained on the same data learn different features,

work page arXiv

[13] [13]

Nina Rimsky, Nick Gabrieli, Julian Schulz, Meg Tong, Evan Hubinger, and Alexander Turner

URLhttps://arxiv.org/abs/2501.16615. Nina Rimsky, Nick Gabrieli, Julian Schulz, Meg Tong, Evan Hubinger, and Alexander Turner. Steer- ing llama 2 via contrastive activation addition. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 15504–15522,

work page arXiv

[14] [14]

George M

URL https://arxiv.org/abs/2506.00085. George M. Slavich. Social safety theory: Conceptual foundation, underlying mechanisms, and clinical and public health implications.Psychological Review, 130(2):350–381,

work page arXiv

[15] [15]

Alexander J Smola and Risi Kondor

1037/rev0000313. Alexander J Smola and Risi Kondor. Kernels and regularization on graphs. InLearning the- ory and kernel machines: 16th annual conference on learning theory and 7th kernel work- shop, COLT/kernel 2003, Washington, DC, USA, august 24-27,

work page 2003

[16] [16]

URLhttps://arxiv.org/abs/2505.20254. Adly Templeton, Tom Conerly, Jonathan Marcus, Jack Lindsey, Trenton Bricken, Brian Chen, Adam Pearce, Craig Citro, Emmanuel Ameisen, Andy Jones, Hoagy Cunningham, Nicholas L Turner, Callum McDougall, Monte MacDiarmid, C. Daniel Freeman, Theodore R. Sumers, Edward Rees, Joshua Batson, Adam Jermyn, Shan Carter, Chris Ola...

work page arXiv

[17] [17]

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Niko- lay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, et al

URLhttps://transformer-circuits.pub/2024/ scaling-monosemanticity/index.html. Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Niko- lay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, et al. Llama 2: Open foundation and fine-tuned chat models,

work page 2024

[18] [18]

Llama 2: Open Foundation and Fine-Tuned Chat Models

URLhttps://arxiv.org/abs/ 2307.09288. Mikhail Tsitsvero, Sergio Barbarossa, and Paolo Di Lorenzo. Signals on graphs: Uncertainty prin- ciple and sampling.IEEE Transactions on Signal Processing, 64(18):4845–4860,

work page internal anchor Pith review Pith/arXiv arXiv

[19] [19]

Zhihao Xu, Ruixuan Huang, Changyu Chen, and Xiting Wang

URLhttps://arxiv.org/abs/2502.17420. Zhihao Xu, Ruixuan Huang, Changyu Chen, and Xiting Wang. Uncovering safety risks of large language models through concept activation vector.Advances in Neural Information Processing Systems, 37:116743–116782,

work page arXiv

[20] [20]

How alignment and jailbreak work: Explain llm safety through intermediate hidden states

Zhenhong Zhou, Haiyang Yu, Xinghua Zhang, Rongwu Xu, Fei Huang, and Yongbin Li. How alignment and jailbreak work: Explain llm safety through intermediate hidden states. InFindings of the Association for Computational Linguistics: EMNLP 2024, pp. 2461–2488,

work page 2024

[21] [21]

Universal and Transferable Adversarial Attacks on Aligned Language Models

URLhttps://arxiv. org/abs/2307.15043. 14 Appendix A Graph Signal Processing for Laplacian Regularization 15 A.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 A.2 Smoothness and Laplacian Regularization . . . . . . . . . . . . . . . . . . . . . . 16 A.3 Spectral Representation of Graph Signals . . . . . . . . . ...

work page internal anchor Pith review Pith/arXiv arXiv 2013