hub Mixed citations

Jumping Ahead: Improving Reconstruction Fidelity with JumpReLU Sparse Autoencoders

· 2024 · cs.LG · arXiv 2407.14435

Mixed citation behavior. Most common role is background (50%).

25 Pith papers citing it

Background 50% of classified citations

open full Pith review browse 25 citing papers arXiv PDF

abstract

Sparse autoencoders (SAEs) are a promising unsupervised approach for identifying causally relevant and interpretable linear features in a language model's (LM) activations. To be useful for downstream tasks, SAEs need to decompose LM activations faithfully; yet to be interpretable the decomposition must be sparse -- two objectives that are in tension. In this paper, we introduce JumpReLU SAEs, which achieve state-of-the-art reconstruction fidelity at a given sparsity level on Gemma 2 9B activations, compared to other recent advances such as Gated and TopK SAEs. We also show that this improvement does not come at the cost of interpretability through manual and automated interpretability studies. JumpReLU SAEs are a simple modification of vanilla (ReLU) SAEs -- where we replace the ReLU with a discontinuous JumpReLU activation function -- and are similarly efficient to train and run. By utilising straight-through-estimators (STEs) in a principled manner, we show how it is possible to train JumpReLU SAEs effectively despite the discontinuous JumpReLU function introduced in the SAE's forward pass. Similarly, we use STEs to directly train L0 to be sparse, instead of training on proxies such as L1, avoiding problems like shrinkage.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 5 method 4 baseline 1

citation-polarity summary

background 5 use method 4 baseline 1

representative citing papers

WriteSAE: Sparse Autoencoders for Recurrent State

cs.LG · 2026-05-12 · unverdicted · novelty 8.0 · 2 refs

WriteSAE introduces sparse autoencoders with rank-1 matrix atoms for recurrent state updates, allowing replacement tests that outperform deletion on 92.4% of positions and a formula predicting logit changes with R²=0.98.

Geometry-Adaptive Explainer for Faithful Dictionary-Based Interpretability under Distribution Shift

cs.LG · 2026-05-21 · unverdicted · novelty 7.0

GAE reduces the faithfulness gap in dictionary-based explainers under distribution shift by geometrically realigning the ID dictionary to the OOD-active subspace, with a quadratic excess-loss bound.

To Call or Not to Call: Diagnosing Intrinsic Over-Calling Bias in LLM Agents

cs.LG · 2026-05-16 · conditional · novelty 7.0

LLM agents have an intrinsic over-calling bias diagnosed via SAE activation margins and corrected by adaptive margin-calibrated steering, improving overall decision accuracy.

SwordBench: Evaluating Orthogonality of Steering Image Representations

cs.CV · 2026-05-10 · unverdicted · novelty 7.0

SwordBench benchmarks steering methods for concept removal in vision models and shows that linear SVMs achieve strong separability and orthogonality but incur collateral damage, while sparse autoencoders often perform better and no method reaches perfect steering even in simple cases.

fmxcoders: Factorized Masked Crosscoders for Cross-Layer Feature Discovery

cs.LG · 2026-05-10 · conditional · novelty 7.0

fmxcoders improve cross-layer feature recovery in transformers via factorized weights and layer masking, delivering 10-30 point probing F1 gains, 25-50% lower MSE, doubled functional coherence, and 3-13x more coherent latents than standard crosscoders on GPT2-Small, Pythia, and Gemma2 models.

Improving Sparse Autoencoder with Dynamic Attention

cs.LG · 2026-04-16 · unverdicted · novelty 7.0

A cross-attention SAE with sparsemax attention achieves lower reconstruction loss and higher-quality concepts than fixed-sparsity baselines by making activation counts data-dependent.

Can Cross-Layer Transcoders Replace Vision Transformer Activations? An Interpretable Perspective on Vision

cs.CV · 2026-04-14 · unverdicted · novelty 7.0

Cross-Layer Transcoders decompose ViT activations into sparse, depth-aware layer contributions that maintain zero-shot accuracy and enable faithful attribution of the final representation.

Mechanistic Interpretability with Sparse Autoencoder Neural Operators

cs.LG · 2025-09-03 · unverdicted · novelty 7.0

SAE-NOs extend sparse autoencoders to function spaces via Fourier neural operators with concept and domain sparsity, learning localized patterns more efficiently and generalizing across discretizations on vision data.

Do Language Models Encode Knowledge of Linguistic Constraint Violations?

cs.CL · 2026-05-12 · unverdicted · novelty 6.0 · 2 refs

Sparse autoencoder features in language models do not satisfy joint falsification criteria for unified grammatical violation detectors across linguistic phenomena.

Causal Dimensionality of Transformer Representations: Measurement, Scaling, and Layer Structure

cs.LG · 2026-05-09 · unverdicted · novelty 6.0

Causal dimensionality kappa of transformer layers grows sub-linearly with SAE width, remains invariant to model scale, and stays constant across depth while attribution thresholds drop sharply.

The Echo Amplifies the Knowledge: Somatic Marker Analogues in Language Models via Emotion Vector Re-Injection

cs.AI · 2026-05-09 · conditional · novelty 6.0

Re-injecting emotion vectors during recall steepens a model's threat-safety judgments and raises good decision rates from 52% to 80% only when combined with semantic labels, replicating Damasio's somatic marker effect.

Tool Calling is Linearly Readable and Steerable in Language Models

cs.CL · 2026-05-08 · unverdicted · novelty 6.0

Tool identity is linearly readable and steerable in LLMs via mean activation differences, with 77-100% switch accuracy and error prediction from activation gaps.

Tree SAE: Learning Hierarchical Feature Structures in Sparse Autoencoders

cs.LG · 2026-05-08 · unverdicted · novelty 6.0 · 2 refs

Tree SAE learns hierarchical feature structures by combining activation coverage with a new reconstruction condition, outperforming prior SAEs on hierarchical pair detection while matching state-of-the-art benchmark performance.

SoftSAE: Dynamic Top-K Selection for Adaptive Sparse Autoencoders

cs.LG · 2026-05-07 · unverdicted · novelty 6.0 · 2 refs

SoftSAE replaces fixed-K sparsity in autoencoders with a learned, input-dependent number of active features via a soft top-k operator.

From Token Lists to Graph Motifs: Weisfeiler-Lehman Analysis of Sparse Autoencoder Features

cs.AI · 2026-05-07 · conditional · novelty 6.0

Graph-motif clustering of SAE features via a frequency-binned WL kernel recovers structural families not captured by decoder cosine similarity or token histograms.

Feature Starvation as Geometric Instability in Sparse Autoencoders

cs.LG · 2026-05-06 · unverdicted · novelty 6.0

Adaptive elastic net SAEs (AEN-SAEs) mitigate feature starvation in SAEs by combining ℓ2 structural stability with adaptive ℓ1 reweighting, producing a Lipschitz-continuous sparse coding map that recovers global feature support under mild assumptions.

GeoSAE: Geometric Prior-Guided Layer-Wise Sparse Autoencoder Annotation of Brain MRI Foundation Models

cs.CV · 2026-05-03 · unverdicted · novelty 6.0

GeoSAE extracts a compact, interpretable feature set from frozen brain MRI foundation models that predicts MCI-to-AD conversion (AUC 0.746) with age-deconfounded annotations and replicates across cohorts.

SAGE: Signal-Amplified Guided Embeddings for LLM-based Vulnerability Detection

cs.CR · 2026-04-21 · unverdicted · novelty 6.0

SAGE uses sparse autoencoders to boost vulnerability signals in LLMs, raising internal SNR 12.7x and delivering up to 318% MCC gains on vulnerability detection benchmarks.

Dictionary-Aligned Concept Control for Safeguarding Multimodal LLMs

cs.LG · 2026-04-10 · unverdicted · novelty 6.0

DACO curates a 15,000-concept dictionary from 400K image-caption pairs and uses it to initialize an SAE that enables granular, concept-specific steering of MLLM activations, raising safety scores on MM-SafetyBench and JailBreakV while preserving general capabilities.

Improving Robustness In Sparse Autoencoders via Masked Regularization

cs.LG · 2026-04-07 · unverdicted · novelty 6.0

Masked regularization in sparse autoencoders disrupts token co-occurrences to reduce feature absorption, enhance probing, and narrow OOD gaps across architectures and sparsity levels.

Sparse Autoencoders as a Steering Basis for Phase Synchronization in Graph-Based CFD Surrogates

cs.CE · 2026-03-28 · unverdicted · novelty 6.0

Sparse autoencoders enable phase synchronization in frozen graph CFD surrogates through Hilbert-identified oscillatory features and SVD-based time-varying rotations.

Making Interpretable Discoveries from Unstructured Data: A High-Dimensional Multiple Hypothesis Testing Approach

econ.EM · 2025-11-03 · unverdicted · novelty 6.0

A new framework combines AI-derived concept embeddings with high-dimensional selective inference to enable statistically principled, interpretable discovery from unstructured data in empirical economics.

Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models

cs.LG · 2024-03-28 · unverdicted · novelty 6.0

Sparse feature circuits are introduced as interpretable causal subnetworks in language models, supporting unsupervised discovery of thousands of circuits and a method called SHIFT to improve classifier generalization by ablating irrelevant features.

Locate, Steer, and Improve: A Practical Survey of Actionable Mechanistic Interpretability in Large Language Models

cs.CL · 2026-01-20 · unverdicted · novelty 5.0

The survey organizes mechanistic interpretability techniques into a Locate-Steer-Improve framework to enable actionable improvements in LLM alignment, capability, and efficiency.

citing papers explorer

Showing 25 of 25 citing papers.

WriteSAE: Sparse Autoencoders for Recurrent State cs.LG · 2026-05-12 · unverdicted · none · ref 6 · 2 links · internal anchor
WriteSAE introduces sparse autoencoders with rank-1 matrix atoms for recurrent state updates, allowing replacement tests that outperform deletion on 92.4% of positions and a formula predicting logit changes with R²=0.98.
Geometry-Adaptive Explainer for Faithful Dictionary-Based Interpretability under Distribution Shift cs.LG · 2026-05-21 · unverdicted · none · ref 36 · internal anchor
GAE reduces the faithfulness gap in dictionary-based explainers under distribution shift by geometrically realigning the ID dictionary to the OOD-active subspace, with a quadratic excess-loss bound.
To Call or Not to Call: Diagnosing Intrinsic Over-Calling Bias in LLM Agents cs.LG · 2026-05-16 · conditional · none · ref 21 · internal anchor
LLM agents have an intrinsic over-calling bias diagnosed via SAE activation margins and corrected by adaptive margin-calibrated steering, improving overall decision accuracy.
SwordBench: Evaluating Orthogonality of Steering Image Representations cs.CV · 2026-05-10 · unverdicted · none · ref 42 · internal anchor
SwordBench benchmarks steering methods for concept removal in vision models and shows that linear SVMs achieve strong separability and orthogonality but incur collateral damage, while sparse autoencoders often perform better and no method reaches perfect steering even in simple cases.
fmxcoders: Factorized Masked Crosscoders for Cross-Layer Feature Discovery cs.LG · 2026-05-10 · conditional · none · ref 8 · internal anchor
fmxcoders improve cross-layer feature recovery in transformers via factorized weights and layer masking, delivering 10-30 point probing F1 gains, 25-50% lower MSE, doubled functional coherence, and 3-13x more coherent latents than standard crosscoders on GPT2-Small, Pythia, and Gemma2 models.
Improving Sparse Autoencoder with Dynamic Attention cs.LG · 2026-04-16 · unverdicted · none · ref 58 · internal anchor
A cross-attention SAE with sparsemax attention achieves lower reconstruction loss and higher-quality concepts than fixed-sparsity baselines by making activation counts data-dependent.
Can Cross-Layer Transcoders Replace Vision Transformer Activations? An Interpretable Perspective on Vision cs.CV · 2026-04-14 · unverdicted · none · ref 29 · internal anchor
Cross-Layer Transcoders decompose ViT activations into sparse, depth-aware layer contributions that maintain zero-shot accuracy and enable faithful attribution of the final representation.
Mechanistic Interpretability with Sparse Autoencoder Neural Operators cs.LG · 2025-09-03 · unverdicted · none · ref 27 · internal anchor
SAE-NOs extend sparse autoencoders to function spaces via Fourier neural operators with concept and domain sparsity, learning localized patterns more efficiently and generalizing across discretizations on vision data.
Do Language Models Encode Knowledge of Linguistic Constraint Violations? cs.CL · 2026-05-12 · unverdicted · none · ref 10 · 2 links · internal anchor
Sparse autoencoder features in language models do not satisfy joint falsification criteria for unified grammatical violation detectors across linguistic phenomena.
Causal Dimensionality of Transformer Representations: Measurement, Scaling, and Layer Structure cs.LG · 2026-05-09 · unverdicted · none · ref 20 · internal anchor
Causal dimensionality kappa of transformer layers grows sub-linearly with SAE width, remains invariant to model scale, and stays constant across depth while attribution thresholds drop sharply.
The Echo Amplifies the Knowledge: Somatic Marker Analogues in Language Models via Emotion Vector Re-Injection cs.AI · 2026-05-09 · conditional · none · ref 16 · internal anchor
Re-injecting emotion vectors during recall steepens a model's threat-safety judgments and raises good decision rates from 52% to 80% only when combined with semantic labels, replicating Damasio's somatic marker effect.
Tool Calling is Linearly Readable and Steerable in Language Models cs.CL · 2026-05-08 · unverdicted · none · ref 67 · internal anchor
Tool identity is linearly readable and steerable in LLMs via mean activation differences, with 77-100% switch accuracy and error prediction from activation gaps.
Tree SAE: Learning Hierarchical Feature Structures in Sparse Autoencoders cs.LG · 2026-05-08 · unverdicted · none · ref 20 · 2 links · internal anchor
Tree SAE learns hierarchical feature structures by combining activation coverage with a new reconstruction condition, outperforming prior SAEs on hierarchical pair detection while matching state-of-the-art benchmark performance.
SoftSAE: Dynamic Top-K Selection for Adaptive Sparse Autoencoders cs.LG · 2026-05-07 · unverdicted · none · ref 6 · 2 links · internal anchor
SoftSAE replaces fixed-K sparsity in autoencoders with a learned, input-dependent number of active features via a soft top-k operator.
From Token Lists to Graph Motifs: Weisfeiler-Lehman Analysis of Sparse Autoencoder Features cs.AI · 2026-05-07 · conditional · none · ref 6 · internal anchor
Graph-motif clustering of SAE features via a frequency-binned WL kernel recovers structural families not captured by decoder cosine similarity or token histograms.
Feature Starvation as Geometric Instability in Sparse Autoencoders cs.LG · 2026-05-06 · unverdicted · none · ref 33 · internal anchor
Adaptive elastic net SAEs (AEN-SAEs) mitigate feature starvation in SAEs by combining ℓ2 structural stability with adaptive ℓ1 reweighting, producing a Lipschitz-continuous sparse coding map that recovers global feature support under mild assumptions.
GeoSAE: Geometric Prior-Guided Layer-Wise Sparse Autoencoder Annotation of Brain MRI Foundation Models cs.CV · 2026-05-03 · unverdicted · none · ref 35 · internal anchor
GeoSAE extracts a compact, interpretable feature set from frozen brain MRI foundation models that predicts MCI-to-AD conversion (AUC 0.746) with age-deconfounded annotations and replicates across cohorts.
SAGE: Signal-Amplified Guided Embeddings for LLM-based Vulnerability Detection cs.CR · 2026-04-21 · unverdicted · none · ref 43 · internal anchor
SAGE uses sparse autoencoders to boost vulnerability signals in LLMs, raising internal SNR 12.7x and delivering up to 318% MCC gains on vulnerability detection benchmarks.
Dictionary-Aligned Concept Control for Safeguarding Multimodal LLMs cs.LG · 2026-04-10 · unverdicted · none · ref 86 · internal anchor
DACO curates a 15,000-concept dictionary from 400K image-caption pairs and uses it to initialize an SAE that enables granular, concept-specific steering of MLLM activations, raising safety scores on MM-SafetyBench and JailBreakV while preserving general capabilities.
Improving Robustness In Sparse Autoencoders via Masked Regularization cs.LG · 2026-04-07 · unverdicted · none · ref 13 · internal anchor
Masked regularization in sparse autoencoders disrupts token co-occurrences to reduce feature absorption, enhance probing, and narrow OOD gaps across architectures and sparsity levels.
Sparse Autoencoders as a Steering Basis for Phase Synchronization in Graph-Based CFD Surrogates cs.CE · 2026-03-28 · unverdicted · none · ref 11 · internal anchor
Sparse autoencoders enable phase synchronization in frozen graph CFD surrogates through Hilbert-identified oscillatory features and SVD-based time-varying rotations.
Making Interpretable Discoveries from Unstructured Data: A High-Dimensional Multiple Hypothesis Testing Approach econ.EM · 2025-11-03 · unverdicted · none · ref 42 · internal anchor
A new framework combines AI-derived concept embeddings with high-dimensional selective inference to enable statistically principled, interpretable discovery from unstructured data in empirical economics.
Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models cs.LG · 2024-03-28 · unverdicted · none · ref 64 · internal anchor
Sparse feature circuits are introduced as interpretable causal subnetworks in language models, supporting unsupervised discovery of thousands of circuits and a method called SHIFT to improve classifier generalization by ablating irrelevant features.
Locate, Steer, and Improve: A Practical Survey of Actionable Mechanistic Interpretability in Large Language Models cs.CL · 2026-01-20 · unverdicted · none · ref 253 · internal anchor
The survey organizes mechanistic interpretability techniques into a Locate-Steer-Improve framework to enable actionable improvements in LLM alignment, capability, and efficiency.
Safe-SAIL: Towards a Fine-grained Safety Landscape of Large Language Models via Sparse Autoencoder Interpretation Framework cs.LG · 2025-09-11 · unverdicted · none · ref 28 · internal anchor
Safe-SAIL supplies a pre-explanation metric and segment-level simulation to interpret 1758 safety SAE features across pornography, politics, violence, and terror, with public models and tools released.

Jumping Ahead: Improving Reconstruction Fidelity with JumpReLU Sparse Autoencoders

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer