hub Mixed citations

Jumping Ahead: Improving Reconstruction Fidelity with JumpReLU Sparse Autoencoders

· 2024 · cs.LG · arXiv 2407.14435

Mixed citation behavior. Most common role is background (50%).

40 Pith papers citing it

Background 50% of classified citations

open full Pith review browse 40 citing papers arXiv PDF

abstract

Sparse autoencoders (SAEs) are a promising unsupervised approach for identifying causally relevant and interpretable linear features in a language model's (LM) activations. To be useful for downstream tasks, SAEs need to decompose LM activations faithfully; yet to be interpretable the decomposition must be sparse -- two objectives that are in tension. In this paper, we introduce JumpReLU SAEs, which achieve state-of-the-art reconstruction fidelity at a given sparsity level on Gemma 2 9B activations, compared to other recent advances such as Gated and TopK SAEs. We also show that this improvement does not come at the cost of interpretability through manual and automated interpretability studies. JumpReLU SAEs are a simple modification of vanilla (ReLU) SAEs -- where we replace the ReLU with a discontinuous JumpReLU activation function -- and are similarly efficient to train and run. By utilising straight-through-estimators (STEs) in a principled manner, we show how it is possible to train JumpReLU SAEs effectively despite the discontinuous JumpReLU function introduced in the SAE's forward pass. Similarly, we use STEs to directly train L0 to be sparse, instead of training on proxies such as L1, avoiding problems like shrinkage.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 5 method 4 baseline 1

citation-polarity summary

background 5 use method 4 baseline 1

representative citing papers

Expander Sparse Autoencoders: Parameter-Efficient Dictionaries for Mechanistic Interpretability

cs.LG · 2026-07-02 · conditional · novelty 8.0

Expander SAEs apply left-d-regular expander masks to TopK SAEs, learning only dn decoder parameters instead of mn and tracing a storage-fidelity frontier that reaches 293x compression with 84% retained performance on Qwen2.5-3B.

WriteSAE: Sparse Autoencoders for Recurrent State

cs.LG · 2026-05-12 · unverdicted · novelty 8.0 · 2 refs

WriteSAE introduces sparse autoencoders with rank-1 matrix atoms for recurrent state updates, allowing replacement tests that outperform deletion on 92.4% of positions and a formula predicting logit changes with R²=0.98.

Rational Sparse Autoencoder

cs.LG · 2026-06-12 · unverdicted · novelty 7.0

RSAE replaces fixed SAE encoder activations (ReLU, JumpReLU, TopK) with trainable rational functions, initialized from baselines and fine-tuned to improve reconstruction and downstream metrics on language-model residual streams.

VFUSE: Virulent Feature Understanding with Sparse autoEncoders

cs.LG · 2026-06-08 · unverdicted · novelty 7.0

VFUSE applies sparse autoencoders to diffusion-transformer activations in RoseTTAFold3 and RFDiffusion3 to find monosemantic features that detect hazardous protein designs with AUROC up to 0.84.

Interpreting Brain Responses to Language with Sparse Features from Language Models

cs.CL · 2026-06-05 · unverdicted · novelty 7.0

Sparse autoencoder features from LMs plus surprisal predict fMRI language responses, recovering prior interpretations and revealing a people-tuned voxel population while showing frontal areas are surprisal-driven and general features outperform arbitrary ones.

Subspace-Aware Sparse Autoencoders for Effective Mechanistic Interpretability

cs.LG · 2026-06-04 · conditional · novelty 7.0

SASA replaces single-vector decoders in SAEs with learned subspaces plus block sparsity and nuclear-norm regularization, proving that a single group becomes the global minimizer once block size meets intrinsic dimension and yielding polynomial rather than exponential sample complexity.

How Far Do Auto-Interpretation Labels Generalize: A Controlled Study Across Languages, Scripts, and Rewordings

cs.CL · 2026-05-29 · unverdicted · novelty 7.0

Auto-interpretation labels for SAE features generalize poorly across languages and scripts, missing the same semantic content up to 4x more often in Serbian than English and more in Cyrillic than Latin despite deterministic transliteration.

Latent Terms: Dense Retrievers Contain Trivially Extractable BM25-ready Zipfian Vocabularies

cs.IR · 2026-05-28 · unverdicted · novelty 7.0

Sparse autoencoders applied to frozen dense retrievers extract Zipfian latent vocabularies that support BM25 scoring and match or exceed the base model's performance on some tasks.

Sign-Aware Gated Sparse Autoencoders: Modeling Anticorrelated Features with Bi-Jump-ReLU Activations

cs.LG · 2026-05-27 · conditional · novelty 7.0

SA-GSAE with Bi-Jump-ReLU enables one latent to encode both polarities of anticorrelated features, Pareto-dominating or matching full-width gated SAEs while reducing dead latents by up to 500x on some LLM hookpoints.

ReSAE: Residualized Sparse Autoencoders for Multi-Layer Transformer Interventions

cs.LG · 2026-05-27 · unverdicted · novelty 7.0

ReSAEs improve multi-layer SAE interventions on Pythia-1.4B and Gemma-2-9B by training later-layer dictionaries on residuals after affine mapping, recovering more cross-entropy loss despite lower raw variance reconstruction.

Geometry-Adaptive Explainer for Faithful Dictionary-Based Interpretability under Distribution Shift

cs.LG · 2026-05-21 · unverdicted · novelty 7.0

GAE reduces the faithfulness gap in dictionary-based explainers under distribution shift by geometrically realigning the ID dictionary to the OOD-active subspace, with a quadratic excess-loss bound.

To Call or Not to Call: Diagnosing Intrinsic Over-Calling Bias in LLM Agents

cs.LG · 2026-05-16 · conditional · novelty 7.0

LLM agents have an intrinsic over-calling bias diagnosed via SAE activation margins and corrected by adaptive margin-calibrated steering, improving overall decision accuracy.

SwordBench: Evaluating Orthogonality of Steering Image Representations

cs.CV · 2026-05-10 · unverdicted · novelty 7.0

SwordBench benchmarks steering methods for concept removal in vision models and shows that linear SVMs achieve strong separability and orthogonality but incur collateral damage, while sparse autoencoders often perform better and no method reaches perfect steering even in simple cases.

fmxcoders: Factorized Masked Crosscoders for Cross-Layer Feature Discovery

cs.LG · 2026-05-10 · conditional · novelty 7.0

fmxcoders improve cross-layer feature recovery in transformers via factorized weights and layer masking, delivering 10-30 point probing F1 gains, 25-50% lower MSE, doubled functional coherence, and 3-13x more coherent latents than standard crosscoders on GPT2-Small, Pythia, and Gemma2 models.

Improving Sparse Autoencoder with Dynamic Attention

cs.LG · 2026-04-16 · unverdicted · novelty 7.0

A cross-attention SAE with sparsemax attention achieves lower reconstruction loss and higher-quality concepts than fixed-sparsity baselines by making activation counts data-dependent.

Can Cross-Layer Transcoders Replace Vision Transformer Activations? An Interpretable Perspective on Vision

cs.CV · 2026-04-14 · unverdicted · novelty 7.0

Cross-Layer Transcoders decompose ViT activations into sparse, depth-aware layer contributions that maintain zero-shot accuracy and enable faithful attribution of the final representation.

Mechanistic Interpretability with Sparse Autoencoder Neural Operators

cs.LG · 2025-09-03 · unverdicted · novelty 7.0

SAE-NOs extend sparse autoencoders to function spaces via Fourier neural operators with concept and domain sparsity, learning localized patterns more efficiently and generalizing across discretizations on vision data.

Pre-Intervention Prediction of Sparse Autoencoder Steering Side Effects

cs.LG · 2026-06-06 · unverdicted · novelty 6.0

Pre-intervention feature statistics predict SAE steering modularity (stability and collateral spread) better than baselines across multiple models and dictionaries, with model-dependent success in held-out selection.

TEVI: Text-Conditioned Editing of Visual Representations via Sparse Autoencoders for Improved Vision-Language Alignment

cs.CV · 2026-06-05 · unverdicted · novelty 6.0

TEVI applies sparse autoencoders and caption-conditioned masking to edit image embeddings, yielding better retrieval on MS COCO, Flickr, IIW, DOCCI, and RoCOCO benchmarks with larger gains on richer captions.

A Geometric View for Understanding Concept Learning and Neuron Interpretation in Sparse Autoencoders

cs.LG · 2026-06-05 · unverdicted · novelty 6.0

Formalizes concept learning in sparse autoencoders as set alignment between human-defined and model-induced concepts, distinguishing detection, separation, and approximation with geometric conditions for neuron representation.

DOME: Learning Transferable Domain Variables from Sparse Supervision for Test-Time Adaptation

cs.CV · 2026-06-02 · unverdicted · novelty 6.0

DOME learns sample-specific domain variables from sparse supervision via vision-language models and a sparse domain bank to improve test-time adaptation performance.

Do Language Models Encode Knowledge of Linguistic Constraint Violations?

cs.CL · 2026-05-12 · unverdicted · novelty 6.0 · 2 refs

Sparse autoencoder features in language models do not satisfy joint falsification criteria for unified grammatical violation detectors across linguistic phenomena.

Causal Dimensionality of Transformer Representations: Measurement, Scaling, and Layer Structure

cs.LG · 2026-05-09 · unverdicted · novelty 6.0

Causal dimensionality kappa of transformer layers grows sub-linearly with SAE width, remains invariant to model scale, and stays constant across depth while attribution thresholds drop sharply.

The Echo Amplifies the Knowledge: Somatic Marker Analogues in Language Models via Emotion Vector Re-Injection

cs.AI · 2026-05-09 · conditional · novelty 6.0

Re-injecting emotion vectors during recall steepens a model's threat-safety judgments and raises good decision rates from 52% to 80% only when combined with semantic labels, replicating Damasio's somatic marker effect.

citing papers explorer

Showing 0 of 0 citing papers after filters.

No citing papers match the current filters.

Jumping Ahead: Improving Reconstruction Fidelity with JumpReLU Sparse Autoencoders

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer