Sparse MoE FFNs redistribute computation from FFN to attention in small Transformers, driven mainly by architectural sparsity rather than learned expert specialization.
High concentration means the neuron responds cleanly to one frequency
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.LG 1years
2026 1verdicts
CONDITIONAL 1representative citing papers
citing papers explorer
-
Sparsity Moves Computation: How FFN Architecture Reshapes Attention in Small Transformers
Sparse MoE FFNs redistribute computation from FFN to attention in small Transformers, driven mainly by architectural sparsity rather than learned expert specialization.