pith. sign in

super hub Canonical reference

Toy Models of Superposition

Canonical reference. 85% of citing Pith papers cite this work as background.

126 Pith papers citing it
Background 85% of classified citations
abstract

Neural networks often pack many unrelated concepts into a single neuron - a puzzling phenomenon known as 'polysemanticity' which makes interpretability much more challenging. This paper provides a toy model where polysemanticity can be fully understood, arising as a result of models storing additional sparse features in "superposition." We demonstrate the existence of a phase change, a surprising connection to the geometry of uniform polytopes, and evidence of a link to adversarial examples. We also discuss potential implications for mechanistic interpretability.

hub tools

citation-role summary

background 19 method 1

citation-polarity summary

claims ledger

  • abstract Neural networks often pack many unrelated concepts into a single neuron - a puzzling phenomenon known as 'polysemanticity' which makes interpretability much more challenging. This paper provides a toy model where polysemanticity can be fully understood, arising as a result of models storing additional sparse features in "superposition." We demonstrate the existence of a phase change, a surprising connection to the geometry of uniform polytopes, and evidence of a link to adversarial examples. We also discuss potential implications for mechanistic interpretability.

authors

co-cited works

clear filters

representative citing papers

When Does LeJEPA Learn a World Model?

stat.ML · 2026-05-25 · unverdicted · novelty 8.0

LeJEPA achieves linear identifiability of latent variables uniquely when the latents are Gaussian in worlds with stationary additive-noise transitions.

WriteSAE: Sparse Autoencoders for Recurrent State

cs.LG · 2026-05-12 · unverdicted · novelty 8.0 · 4 refs

WriteSAE introduces sparse autoencoders with rank-1 matrix atoms for recurrent state updates, allowing replacement tests that outperform deletion on 92.4% of positions and a formula predicting logit changes with R²=0.98.

KAN: Kolmogorov-Arnold Networks

cs.LG · 2024-04-30 · conditional · novelty 8.0

KANs with learnable univariate spline activations on edges achieve better accuracy than MLPs with fewer parameters, faster scaling, and direct visualization for scientific discovery.

Subliminal Learning is a LoRA Artifact

cs.AI · 2026-05-30 · conditional · novelty 7.0

Subliminal learning is a LoRA artifact that disappears with full finetuning, depends on context tokens like system prompts, and localizes to overlapping finetuning-evaluation tokens.

Lost or Hidden? A Concept-Level Forgetting in Supervised Continual Learning

cs.LG · 2026-05-10 · unverdicted · novelty 7.0

A framework using sparse autoencoders decomposes concept-level forgetting in supervised continual learning into apparent deletion, recoverability, and decodability, showing substantial recoverability under linearity and degrading decodability with added tasks.

citing papers explorer

Showing 50 of 112 citing papers after filters.