Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2

URLhttps://arxiv · 2024 · DOI 10.18653/v1/2024.blackboxnlp-1.19

7 Pith papers cite this work. Polarity classification is still indexing.

7 Pith papers citing it

open at publisher browse 7 citing papers

citation-role summary

background 1 dataset 1 method 1

citation-polarity summary

background 1 use dataset 1 use method 1

representative citing papers

The Geometric Wall: Manifold Structure Predicts Layerwise Sparse Autoencoder Scaling Laws

cs.LG · 2026-05-11 · unverdicted · novelty 8.0

Manifold curvature and intrinsic dimension predict layerwise SAE width exponents and asymptotic floors across Gemma models, with cross-model transfer of the geometric regression, establishing a transferable geometric law instead of a universal scaling law.

Minimal, Local, Causal Explanations for Jailbreak Success in Large Language Models

cs.AI · 2026-04-30 · unverdicted · novelty 7.0

LOCA identifies an average of six minimal interpretable changes in intermediate representations that causally induce refusal on otherwise successful jailbreaks for Gemma and Llama models.

The Grounding Gap: How LLMs Anchor the Meaning of Abstract Concepts Differently from Humans

cs.CL · 2026-05-09 · unverdicted · novelty 6.0

LLMs show a grounding gap with humans on abstract concepts, with property-generation correlations at most r=0.37 versus human-to-human r>0.9, though larger models align better on explicit rating tasks and internal SAE features capture some grounding dimensions.

Minimizing Collateral Damage in Activation Steering

cs.LG · 2026-05-01 · unverdicted · novelty 6.0

Activation steering is cast as constrained optimization that minimizes collateral damage by weighting perturbations according to the empirical second-moment matrix of activations instead of assuming isotropy.

BLOCK-EM: Preventing Emergent Misalignment via Latent Blocking

cs.LG · 2026-01-31 · unverdicted · novelty 6.0

Blocking a fixed set of latent features during fine-tuning reduces emergent misalignment by up to 95% across six domains with no loss in target task performance.

ConceptTracer: Interactive Analysis of Concept Saliency and Selectivity in Neural Representations

cs.LG · 2026-04-08 · unverdicted · novelty 5.0

ConceptTracer supplies an interactive interface and saliency/selectivity metrics to locate concept-responsive neurons in neural representations, shown on TabPFN.

Qwen-Scope: Turning Sparse Features into Development Tools for Large Language Models

cs.CL · 2026-05-12 · unverdicted · novelty 4.0

Qwen-Scope provides open-source sparse autoencoders for Qwen models that function as practical interfaces for steering, evaluating, data workflows, and optimizing large language models.

citing papers explorer

Showing 7 of 7 citing papers.

The Geometric Wall: Manifold Structure Predicts Layerwise Sparse Autoencoder Scaling Laws cs.LG · 2026-05-11 · unverdicted · none · ref 7
Manifold curvature and intrinsic dimension predict layerwise SAE width exponents and asymptotic floors across Gemma models, with cross-model transfer of the geometric regression, establishing a transferable geometric law instead of a universal scaling law.
Minimal, Local, Causal Explanations for Jailbreak Success in Large Language Models cs.AI · 2026-04-30 · unverdicted · none · ref 31
LOCA identifies an average of six minimal interpretable changes in intermediate representations that causally induce refusal on otherwise successful jailbreaks for Gemma and Llama models.
The Grounding Gap: How LLMs Anchor the Meaning of Abstract Concepts Differently from Humans cs.CL · 2026-05-09 · unverdicted · none · ref 42
LLMs show a grounding gap with humans on abstract concepts, with property-generation correlations at most r=0.37 versus human-to-human r>0.9, though larger models align better on explicit rating tasks and internal SAE features capture some grounding dimensions.
Minimizing Collateral Damage in Activation Steering cs.LG · 2026-05-01 · unverdicted · none · ref 49
Activation steering is cast as constrained optimization that minimizes collateral damage by weighting perturbations according to the empirical second-moment matrix of activations instead of assuming isotropy.
BLOCK-EM: Preventing Emergent Misalignment via Latent Blocking cs.LG · 2026-01-31 · unverdicted · none · ref 5
Blocking a fixed set of latent features during fine-tuning reduces emergent misalignment by up to 95% across six domains with no loss in target task performance.
ConceptTracer: Interactive Analysis of Concept Saliency and Selectivity in Neural Representations cs.LG · 2026-04-08 · unverdicted · none · ref 19
ConceptTracer supplies an interactive interface and saliency/selectivity metrics to locate concept-responsive neurons in neural representations, shown on TabPFN.
Qwen-Scope: Turning Sparse Features into Development Tools for Large Language Models cs.CL · 2026-05-12 · unverdicted · none · ref 2
Qwen-Scope provides open-source sparse autoencoders for Qwen models that function as practical interfaces for steering, evaluating, data workflows, and optimizing large language models.

Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer