Inference-time decomposition of activations (itda): A scalable approach to interpreting large language models.arXiv preprint arXiv:2505.17769,

Patrick Leask, Neel Nanda, Noura Al Moubayed · arXiv 2505.17769

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

representative citing papers

Subspace-Aware Sparse Autoencoders for Effective Mechanistic Interpretability

cs.LG · 2026-06-04 · conditional · novelty 7.0

SASA replaces single-vector decoders in SAEs with learned subspaces plus block sparsity and nuclear-norm regularization, proving that a single group becomes the global minimizer once block size meets intrinsic dimension and yielding polynomial rather than exponential sample complexity.

ICA Lens: Interpreting Language Models Without Training Another Dictionary

cs.LG · 2026-06-10 · unverdicted · novelty 6.0

ICALens applies an optimized ICA workflow to LLM activations and recovers compact interpretable directions that match or exceed public SAEs on SAEBench probing and perturbation tasks without per-layer dictionary training.

citing papers explorer

Showing 1 of 1 citing paper after filters.

Subspace-Aware Sparse Autoencoders for Effective Mechanistic Interpretability cs.LG · 2026-06-04 · conditional · none · ref 31
SASA replaces single-vector decoders in SAEs with learned subspaces plus block sparsity and nuclear-norm regularization, proving that a single group becomes the global minimizer once block size meets intrinsic dimension and yielding polynomial rather than exponential sample complexity.

Inference-time decomposition of activations (itda): A scalable approach to interpreting large language models.arXiv preprint arXiv:2505.17769,

fields

years

verdicts

representative citing papers

citing papers explorer