Tapered Language Models monotonically decrease MLP width across depth with a cosine schedule, yielding better perplexity and downstream performance than uniform-width baselines across multiple architectures and scales at no extra cost.
Delight: Deep and light-weight transformer.arXiv preprint arXiv:2008.00623,
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
years
2026 2representative citing papers
×-shaped variable-width transformers outperform parameter-matched uniform baselines on language modeling loss with 22% fewer FLOPs and 15% smaller KV cache.
citing papers explorer
-
Tapered Language Models
Tapered Language Models monotonically decrease MLP width across depth with a cosine schedule, yielding better perplexity and downstream performance than uniform-width baselines across multiple architectures and scales at no extra cost.
-
Variable-Width Transformers
×-shaped variable-width transformers outperform parameter-matched uniform baselines on language modeling loss with 22% fewer FLOPs and 15% smaller KV cache.