Student networks are limited to d_S * g(α) features via superposition, creating a permanent importance-weighted loss floor in distillation that cannot be overcome by training.
B Additional toy model results Sparsity effect.Higher sparsity yields lower floors at every width because g(α) packs more features per dimension (Figure 9)
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.LG 1years
2026 1verdicts
CONDITIONAL 1representative citing papers
citing papers explorer
-
Geometric Limits of Knowledge Distillation: A Minimum-Width Theorem via Superposition Theory
Student networks are limited to d_S * g(α) features via superposition, creating a permanent importance-weighted loss floor in distillation that cannot be overcome by training.