Neuron to graph: Interpreting language model neurons at scale

Foote, Alex, Nanda, Neel, Kran, Esben, Konstas, Ioannis, Cohen, Shay, Barez, Fazl , year = · 2023 · arXiv 2305.19911

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

representative citing papers

Scaling and evaluating sparse autoencoders

cs.LG · 2024-06-06 · unverdicted · novelty 7.0

K-sparse autoencoders with dead-latent fixes produce clean scaling laws and better feature quality metrics that improve with size, shown by training a 16-million-latent model on GPT-4 activations.

Correcting Influence: Unboxing LLM Outputs with Orthogonal Latent Spaces

cs.LG · 2026-05-12 · unverdicted · novelty 6.0

A latent mediation framework with sparse autoencoders enables non-additive token-level influence attribution in LLMs by learning orthogonal features and back-propagating attributions.

citing papers explorer

Showing 2 of 2 citing papers.

Scaling and evaluating sparse autoencoders cs.LG · 2024-06-06 · unverdicted · none · ref 15
K-sparse autoencoders with dead-latent fixes produce clean scaling laws and better feature quality metrics that improve with size, shown by training a 16-million-latent model on GPT-4 activations.
Correcting Influence: Unboxing LLM Outputs with Orthogonal Latent Spaces cs.LG · 2026-05-12 · unverdicted · none · ref 158
A latent mediation framework with sparse autoencoders enables non-additive token-level influence attribution in LLMs by learning orthogonal features and back-propagating attributions.

Neuron to graph: Interpreting language model neurons at scale

fields

years

verdicts

representative citing papers

citing papers explorer