Tapered Language Models monotonically decrease MLP width across depth with a cosine schedule, yielding better perplexity and downstream performance than uniform-width baselines across multiple architectures and scales at no extra cost.
Layerwise importance analysis of feed-forward networks in transformer-based language models.arXiv preprint arXiv:2508.17734,
3 Pith papers cite this work. Polarity classification is still indexing.
years
2026 3representative citing papers
×-shaped variable-width transformers outperform parameter-matched uniform baselines on language modeling loss with 22% fewer FLOPs and 15% smaller KV cache.
Trains and releases SAEs for Qwen3-1.7B/4B/8B models with layer-wise coverage and demonstrates causal steering of refusal via selected features.
citing papers explorer
-
Tapered Language Models
Tapered Language Models monotonically decrease MLP width across depth with a cosine schedule, yielding better perplexity and downstream performance than uniform-width baselines across multiple architectures and scales at no extra cost.
-
Variable-Width Transformers
×-shaped variable-width transformers outperform parameter-matched uniform baselines on language modeling loss with 22% fewer FLOPs and 15% smaller KV cache.
-
Discovering Millions of Interpretable Features with Sparse Autoencoders
Trains and releases SAEs for Qwen3-1.7B/4B/8B models with layer-wise coverage and demonstrates causal steering of refusal via selected features.