Sparse MoE FFNs redistribute computation from FFN to attention in small Transformers, driven mainly by architectural sparsity rather than learned expert specialization.
22 as a weight-space analogue of the activation-based metric above
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.LG 1years
2026 1verdicts
CONDITIONAL 1representative citing papers
citing papers explorer
-
Sparsity Moves Computation: How FFN Architecture Reshapes Attention in Small Transformers
Sparse MoE FFNs redistribute computation from FFN to attention in small Transformers, driven mainly by architectural sparsity rather than learned expert specialization.