Residual Stream Analysis with Multi-Layer SAE s

Lawson, Tim, Farnik, Lucy, Houghton, Conor, Aitchison, Laurence · 2024 · arXiv 2409.04185

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

representative citing papers

SemRF: A Semantic Reference Frame for Residual-Stream Dynamics in Language Models

cs.LG · 2026-06-30 · unverdicted · novelty 7.0

SemRF supplies fixed semantic anchors and pseudo-inverse tying to produce stable coordinates for residual dynamics, Voronoi traces, and minimum-action canonical paths that link to parameter efficiency under controlled interface error.

PRISM: Recovering Instruction Sets from Language Model Activations

cs.AI · 2026-06-08 · unverdicted · novelty 7.0

PRISM is a new activation-conditioned model that recovers full sets of simultaneous instructions from LLM hidden states via judge-guided GRPO training and outperforms prior activation-to-language methods on security-relevant tasks.

How Far Do Auto-Interpretation Labels Generalize: A Controlled Study Across Languages, Scripts, and Rewordings

cs.CL · 2026-05-29 · unverdicted · novelty 7.0

Auto-interpretation labels for SAE features generalize poorly across languages and scripts, missing the same semantic content up to 4x more often in Serbian than English and more in Cyrillic than Latin despite deterministic transliteration.

Law of Neural Interaction: Depth-Width Shape, Interaction Efficiency, and Generalization

cs.LG · 2026-05-27 · unverdicted · novelty 5.0

Tuning the depth-width ratio positions models in an efficient neural interaction interval that correlates with better generalization under fixed budgets and remains stable with scale.

citing papers explorer

Showing 1 of 1 citing paper after filters.

PRISM: Recovering Instruction Sets from Language Model Activations cs.AI · 2026-06-08 · unverdicted · none · ref 27
PRISM is a new activation-conditioned model that recovers full sets of simultaneous instructions from LLM hidden states via judge-guided GRPO training and outperforms prior activation-to-language methods on security-relevant tasks.

Residual Stream Analysis with Multi-Layer SAE s

fields

years

verdicts

representative citing papers

citing papers explorer