B Additional toy model results Sparsity effect.Higher sparsity yields lower floors at every width because g(α) packs more features per dimension (Figure 9)

A Sparsity capacity function α1−α g(α) Features/dim 0 · 2048

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

browse 1 citing papers

representative citing papers

Geometric Limits of Knowledge Distillation: A Minimum-Width Theorem via Superposition Theory

cs.LG · 2026-04-05 · conditional · novelty 6.0

Student networks are limited to d_S * g(α) features via superposition, creating a permanent importance-weighted loss floor in distillation that cannot be overcome by training.

citing papers explorer

Showing 1 of 1 citing paper.

Geometric Limits of Knowledge Distillation: A Minimum-Width Theorem via Superposition Theory cs.LG · 2026-04-05 · conditional · none · ref 6
Student networks are limited to d_S * g(α) features via superposition, creating a permanent importance-weighted loss floor in distillation that cannot be overcome by training.

B Additional toy model results Sparsity effect.Higher sparsity yields lower floors at every width because g(α) packs more features per dimension (Figure 9)

fields

years

verdicts

representative citing papers

citing papers explorer