pith. sign in

arxiv: 2606.23670 · v1 · pith:ETL2J34Qnew · submitted 2026-06-22 · 💻 cs.LG · cs.AI· cs.CL

Tapered Language Models

Pith reviewed 2026-06-26 09:09 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL
keywords tapered language modelsMLP width taperingcapacity allocationdepth asymmetrycosine scheduleperplexitytransformer variants
0
0 comments X

The pith

Tapered allocation of MLP capacity to earlier layers improves language model perplexity over uniform baselines under fixed budget.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that language model layers contribute asymmetrically, with early layers transforming the residual stream more than later layers refine it. Under a fixed parameter budget, monotonically decreasing MLP width from early to late layers via a cosine schedule yields lower perplexity and stronger downstream results than uniform width. The benefit holds across three scales and four architectures including standard transformers. A sympathetic reader would care because the change requires no extra parameters, compute, or training time yet consistently outperforms the default uniform design.

Core claim

Under a fixed parameter budget, monotonically tapering MLP widths from wider early layers to narrower late layers via a cosine schedule yields lower perplexity and better downstream performance than uniform-width models, and this benefit is consistent across Transformer, Gated Attention, Hope-attention, and Titans architectures at multiple scales.

What carries the argument

The cosine tapering schedule applied to MLP width, which enforces a smooth monotonic decrease in capacity across depth while keeping total parameters constant.

If this is right

  • Early-heavy capacity allocation outperforms uniform allocation on perplexity.
  • The tapering benefit transfers to multiple distinct LM architectures without other changes.
  • Downstream benchmark scores improve alongside perplexity under the same schedule.
  • No increase in parameter count or training FLOPs is required to obtain the gains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same monotonic tapering principle could be tested on attention projection widths or other per-layer parameter groups.
  • The cosine schedule's specific shape might be replaced by other monotonic functions to measure sensitivity.
  • At larger scales the optimal taper ratio may shift, offering a new axis to explore alongside width and depth scaling laws.
  • Reversing the taper direction should reliably degrade performance if the early-layer emphasis is the true driver.

Load-bearing premise

Layers contribute non-uniformly to the output, so capacity should be allocated more to early layers than to late layers.

What would settle it

A controlled run in which reverse tapering (narrow early layers, wide late layers) matches or beats the forward-tapered model on perplexity would falsify the directional claim.

read the original abstract

Modern language models, including transformer, recurrent, and memory-based variants, share a common chassis: a stack of identical layers in which parameters are allocated uniformly across depth. This is a default inherited from the original transformer and largely unchanged since, yet a growing body of evidence suggests that layers contribute non-uniformly to the final output, with later layers refining the residual stream rather than transforming it. We ask whether parameter capacity should reflect this asymmetry. Our controlled experiment shows that, under a fixed budget, allocating more capacity to earlier layers and less to later layers improves perplexity over a uniform-width baseline, while the reverse allocation hurts. Building on this result, we introduce Tapered Language Models (TLMs), an architectural principle in which a parameter-bearing component is monotonically tapered across depth under a fixed total budget. MLPs are the natural site for this instantiation: they dominate parameter count across all modern LM families and expose width as a single, clean axis of variation. Across three model scales and four architectures (Transformer, Gated Attention, Hope-attention, and Titans), tapering MLP width via a smooth cosine schedule consistently improves perplexity and downstream benchmark performance over uniform baselines, at no additional parameter or compute cost. These findings establish depth-aware capacity allocation as a simple, architecture-agnostic axis of language model design, a free lever hidden in plain sight.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript claims that, under a fixed parameter budget, monotonically tapering MLP widths across depth via a cosine schedule (allocating more capacity to earlier layers) yields consistent improvements in perplexity and downstream benchmarks over uniform-width baselines, while reverse tapering hurts performance. This holds across three scales and four architectures (Transformer, Gated Attention, Hope-attention, Titans), establishing depth-aware capacity allocation as an architecture-agnostic design lever at no extra cost.

Significance. If the empirical results hold under fuller reporting, the work identifies a simple, zero-cost axis for LM design that directly tests non-uniform layer contributions via controlled ablations. The fixed-budget comparisons and cross-architecture consistency are strengths; the approach is reproducible in principle via the described schedule and could be adopted broadly if the gains prove robust.

major comments (2)
  1. [Experimental Results] The experimental protocol (methods and results sections) does not report run-to-run variance, number of seeds, or statistical tests for the reported perplexity and benchmark gains; without these, the claim of 'consistent' improvements across scales cannot be fully assessed for reliability.
  2. [Methods] The cosine tapering schedule is described as monotonic and parameter-preserving, but the methods do not specify the exact functional form (e.g., the cosine arguments or discretization to integer widths) or provide pseudocode; this detail is load-bearing for exact reproduction of the reported allocations.
minor comments (2)
  1. [Figures/Tables] Figure captions and table headers could more explicitly state that all comparisons use identical total parameter counts and matched compute.
  2. [Introduction] The abstract and introduction reference prior evidence on non-uniform layer contributions; adding 1-2 key citations would strengthen the motivation without altering the empirical focus.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation of minor revision. We address each major comment below and will incorporate the requested clarifications in the revised manuscript.

read point-by-point responses
  1. Referee: [Experimental Results] The experimental protocol (methods and results sections) does not report run-to-run variance, number of seeds, or statistical tests for the reported perplexity and benchmark gains; without these, the claim of 'consistent' improvements across scales cannot be fully assessed for reliability.

    Authors: We acknowledge the value of reporting run-to-run variance for assessing reliability. All experiments used a single fixed seed per configuration for computational efficiency and reproducibility. The observed gains were replicated consistently across three scales and four distinct architectures, providing supporting evidence for the claims. In the revision we will add an explicit statement in the methods section detailing the seed usage and noting that variance was not computed due to resource constraints. revision: yes

  2. Referee: [Methods] The cosine tapering schedule is described as monotonic and parameter-preserving, but the methods do not specify the exact functional form (e.g., the cosine arguments or discretization to integer widths) or provide pseudocode; this detail is load-bearing for exact reproduction of the reported allocations.

    Authors: We agree that the precise functional form and discretization details are necessary for reproduction. The revised manuscript will include the exact cosine formula (with arguments and normalization), the integer rounding procedure that preserves total parameter count, and pseudocode for the width allocation. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper's core contribution consists of controlled empirical ablations that directly compare uniform-width MLPs against cosine-tapered and reverse-tapered allocations under identical total parameter budgets, measuring perplexity and downstream metrics across multiple scales and architectures. No derivation, equation, or fitted parameter is presented whose output reduces by construction to the input; the directional performance differences are the measured result rather than a renamed fit. The background premise on non-uniform layer contributions is referenced as prior evidence but is not load-bearing for the claim, which rests on the new experiments themselves.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption of non-uniform layer contributions and the design choice of a cosine tapering schedule whose rate parameters are selected rather than derived.

free parameters (1)
  • cosine tapering schedule parameters
    The rate and extent of width reduction are chosen as part of the architectural design to produce the reported gains.
axioms (1)
  • domain assumption Layers contribute non-uniformly to the final output, with later layers refining rather than transforming the residual stream
    Invoked directly in the abstract as the motivation drawn from prior evidence.

pith-pipeline@v0.9.1-grok · 5769 in / 1216 out tokens · 28603 ms · 2026-06-26T09:09:13.737790+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

39 extracted references · 1 canonical work pages

  1. [1]

    Slicegpt: Compress large language models by deleting rows and columns.arXiv preprint arXiv:2401.15024,

    Saleh Ashkboos, Maximilian L Croci, Marcelo Gennari do Nascimento, Torsten Hoefler, and James Hensman. Slicegpt: Compress large language models by deleting rows and columns.arXiv preprint arXiv:2401.15024,

  2. [2]

    Crown, frame, reverse: Layer-wise scaling variants for llm pre-training

    Andrei Baroian and Kasper Notebomer. Crown, frame, reverse: Layer-wise scaling variants for llm pre-training. arXiv preprint arXiv:2509.06518,

  3. [3]

    Steering large language model activations in sparse spaces.arXiv preprint arXiv:2503.00177,

    Reza Bayat, Ali Rahimi-Kalahroudi, Mohammad Pezeshki, Sarath Chandar, and Pascal Vincent. Steering large language model activations in sparse spaces.arXiv preprint arXiv:2503.00177,

  4. [4]

    Titans: Learning to memorize at test time.arXiv preprint arXiv:2501.00663,

    10 Tapered Language Models Ali Behrouz, Peilin Zhong, and Vahab Mirrokni. Titans: Learning to memorize at test time.arXiv preprint arXiv:2501.00663,

  5. [5]

    Atlas: Learning to optimally memorize the context at test time.arXiv preprint arXiv:2505.23735, 2025a

    AliBehrouz,ZemanLi,PraneethKacham,MajidDaliri,YuanDeng, PeilinZhong,MeisamRazaviyayn,andVahab Mirrokni. Atlas: Learning to optimally memorize the context at test time.arXiv preprint arXiv:2505.23735, 2025a. Ali Behrouz, Meisam Razaviyayn, Peilin Zhong, and Vahab Mirrokni. Nested learning: The illusion of deep learning architectures.arXiv preprint arXiv:25...

  6. [6]

    Boolq: Exploring the surprising difficulty of natural yes/no questions

    Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. Boolq: Exploring the surprising difficulty of natural yes/no questions. InProceedings of the 2019 conference of the north American chapter of the association for computational linguistics: Human language technologies, volume 1 (long and short papers), ...

  7. [7]

    Think you have solved question answering? try arc, the ai2 reasoning challenge.arXiv preprint arXiv:1803.05457,

    Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge.arXiv preprint arXiv:1803.05457,

  8. [8]

    Transformers are ssms: Generalized models and efficient algorithms through structured state space duality.arXiv preprint arXiv:2405.21060,

    Tri Dao and Albert Gu. Transformers are ssms: Generalized models and efficient algorithms through structured state space duality.arXiv preprint arXiv:2405.21060,

  9. [9]

    Griffin: Mixing gated linear recurrences with local attention for efficient language models.arXiv preprint arXiv:2402.19427,

    Soham De, Samuel L Smith, Anushan Fernando, Aleksandar Botev, George Cristian-Muraru, Albert Gu, Ruba Haroun, Leonard Berrada, Yutian Chen, Srivatsan Srinivasan, et al. Griffin: Mixing gated linear recurrences with local attention for efficient language models.arXiv preprint arXiv:2402.19427,

  10. [10]

    An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929,

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929,

  11. [11]

    Mostafa Elhoushi, Akshat Shrivastava, Diana Liskovich, Basil Hosmer, Bram Wasti, Liangzhen Lai, Anas Mahmoud, Bilge Acun, Saurabh Agarwal, Ahmed Roman, et al

    URLhttps://openreview.net/forum?id=SJg7KhVKPH. Mostafa Elhoushi, Akshat Shrivastava, Diana Liskovich, Basil Hosmer, Bram Wasti, Liangzhen Lai, Anas Mahmoud, Bilge Acun, Saurabh Agarwal, Ahmed Roman, et al. Layerskip: Enabling early exit inference and self-speculative decoding.arXiv preprint arXiv:2404.16710,

  12. [12]

    Reducing transformer depth on demand with structured dropout.arXiv preprint arXiv:1909.11556,

    Angela Fan, Edouard Grave, and Armand Joulin. Reducing transformer depth on demand with structured dropout.arXiv preprint arXiv:1909.11556,

  13. [13]

    Transformer feed-forward layers are key-value memories

    Mor Geva, Roei Schuster, Jonathan Berant, and Omer Levy. Transformer feed-forward layers are key-value memories. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 5484–5495,

  14. [14]

    The unreasonable ineffectiveness of the deeper layers.arXiv preprint arXiv:2403.17887,

    11 Tapered Language Models Andrey Gromov, Kushal Tirumala, Hassan Shapourian, Paolo Glorioso, and Daniel A Roberts. The unreasonable ineffectiveness of the deeper layers.arXiv preprint arXiv:2403.17887,

  15. [15]

    Liquid structural state-space models.arXiv preprint arXiv:2209.12951,

    Ramin Hasani, Mathias Lechner, Tsun-Hsuan Wang, Makram Chahine, Alexander Amini, and Daniela Rus. Liquid structural state-space models.arXiv preprint arXiv:2209.12951,

  16. [16]

    Layerwise importance analysis of feed-forward networks in transformer-based language models.arXiv preprint arXiv:2508.17734,

    Wataru Ikeda, Kazuki Yano, Ryosuke Takahashi, Jaesung Lee, Keigo Shibata, and Jun Suzuki. Layerwise importance analysis of feed-forward networks in transformer-based language models.arXiv preprint arXiv:2508.17734,

  17. [17]

    The remarkable robustness of llms: Stages of inference?arXiv preprint arXiv:2406.19384,

    Vedang Lad, Jin Hwa Lee, Wes Gurnee, and Max Tegmark. The remarkable robustness of llms: Stages of inference?arXiv preprint arXiv:2406.19384,

  18. [18]

    Bo Liu, Rui Wang, Lemeng Wu, Yihao Feng, Peter Stone, and Qiang Liu

    URLhttps://openreview.net/forum?id= rajioNWfRs. Bo Liu, Rui Wang, Lemeng Wu, Yihao Feng, Peter Stone, and Qiang Liu. Longhorn: State space models are amortized online learners.arXiv preprint arXiv:2407.14207,

  19. [19]

    Delight: Deep and light-weight transformer.arXiv preprint arXiv:2008.00623,

    Sachin Mehta, Marjan Ghazvininejad, Srinivasan Iyer, Luke Zettlemoyer, and Hannaneh Hajishirzi. Delight: Deep and light-weight transformer.arXiv preprint arXiv:2008.00623,

  20. [20]

    Shortgpt: Layers in large language models are more redundant than you expect

    Xin Men, Mingyu Xu, Qingyu Zhang, Qianhao Yuan, Bingning Wang, Hongyu Lin, Yaojie Lu, Xianpei Han, and Weipeng Chen. Shortgpt: Layers in large language models are more redundant than you expect. InFindings of the Association for Computational Linguistics: ACL 2025, pages 20192–20204,

  21. [21]

    Pointer sentinel mixture models.arXiv preprint arXiv:1609.07843,

    Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models.arXiv preprint arXiv:1609.07843,

  22. [22]

    Olmo hybrid: From theory to practice and back.arXiv preprint arXiv:2604.03444,

    William Merrill, Yanhong Li, Tyler Romero, Anej Svete, Caia Costello, Pradeep Dasigi, Dirk Groeneveld, David Heineman, Bailey Kuehl, Nathan Lambert, et al. Olmo hybrid: From theory to practice and back.arXiv preprint arXiv:2604.03444,

  23. [23]

    The LAMBADA dataset: Word prediction requiring a broad discourse context

    Association for Computational Linguistics. doi: 10.18653/v1/P16-1144. URL https://aclanthology.org/P16-1144/. William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205,

  24. [24]

    Rwkv: Reinventing rnns for the transformer era

    12 Tapered Language Models Bo Peng, Eric Alcaide, Quentin Anthony, Alon Albalak, Samuel Arcadinho, Stella Biderman, Huanqi Cao, Xin Cheng, Michael Chung, Leon Derczynski, et al. Rwkv: Reinventing rnns for the transformer era. InFindings of the association for computational linguistics: EMNLP 2023, pages 14048–14077,

  25. [25]

    Eagle and finch: Rwkv with matrix-valued states and dynamic recurrence.arXiv preprint arXiv:2404.05892,

    Bo Peng, Daniel Goldstein, Quentin Anthony, Alon Albalak, Eric Alcaide, Stella Biderman, Eugene Cheah, Xingjian Du, Teddy Ferdinan, Haowen Hou, et al. Eagle and finch: Rwkv with matrix-valued states and dynamic recurrence.arXiv preprint arXiv:2404.05892,

  26. [26]

    Rwkv-7" goose" with expressive dynamic state evolution.arXiv preprint arXiv:2503.14456,

    Bo Peng, Ruichong Zhang, Daniel Goldstein, Eric Alcaide, Haowen Hou, Janna Lu, William Merrill, Guangyu Song, Kaifeng Tan, Saiteja Utpala, et al. Rwkv-7" goose" with expressive dynamic state evolution.arXiv preprint arXiv:2503.14456,

  27. [27]

    Gated attention for large language models: Non-linearity, sparsity, and attention-sink-free

    Zihan Qiu, Zekun Wang, Bo Zheng, Zeyu Huang, Kaiyue Wen, Songlin Yang, Rui Men, Le Yu, Fei Huang, Suozhi Huang, et al. Gated attention for large language models: Non-linearity, sparsity, and attention-sink-free. arXiv preprint arXiv:2505.06708,

  28. [28]

    Mixture-of-depths: Dynamically allocating compute in transformer-based language models.arXiv preprint arXiv:2404.02258,

    David Raposo, Sam Ritter, Blake Richards, Timothy Lillicrap, Peter Conway Humphreys, and Adam Santoro. Mixture-of-depths: Dynamically allocating compute in transformer-based language models.arXiv preprint arXiv:2404.02258,

  29. [29]

    Samba: Simple hybrid state space models for efficient unlimited context language modeling

    Liliang Ren, Yang Liu, Yadong Lu, Chen Liang, Weizhu Chen, et al. Samba: Simple hybrid state space models for efficient unlimited context language modeling. InInternational Conference on Learning Representations, volume 2025, pages 53551–53575,

  30. [30]

    Social iqa: Commonsense reasoning about social interactions

    Maarten Sap, Hannah Rashkin, Derek Chen, Ronan Le Bras, and Yejin Choi. Social iqa: Commonsense reasoning about social interactions. InProceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP), pages 4463–4473,

  31. [31]

    Tran, Yi Tay, and Donald Metzler

    Tal Schuster, Adam Fisch, Jai Gupta, Mostafa Dehghani, Dara Bahri, Vinh Q. Tran, Yi Tay, and Donald Metzler. Confident adaptive language modeling.arXiv preprint arXiv: 2207.07061,

  32. [32]

    The truth is in there: Improving reasoning in language models with layer-selective rank reduction.arXiv preprint arXiv:2312.13558,

    Pratyusha Sharma, Jordan T Ash, and Dipendra Misra. The truth is in there: Improving reasoning in language models with layer-selective rank reduction.arXiv preprint arXiv:2312.13558,

  33. [33]

    Glu variants improve transformer.arXiv preprint arXiv:2002.05202,

    Noam Shazeer. Glu variants improve transformer.arXiv preprint arXiv:2002.05202,

  34. [34]

    Oscar Skean, Md Rifat Arefin, Dan Zhao, Niket Patel, Jalal Naghiyev, Yann LeCun, and Ravid Shwartz-Ziv

    URLhttps://openreview.net/forum?id= SoRiaijTGr. Oscar Skean, Md Rifat Arefin, Dan Zhao, Niket Patel, Jalal Naghiyev, Yann LeCun, and Ravid Shwartz-Ziv. Layer by layer: Uncovering hidden representations in language models.arXiv preprint arXiv:2502.02013,

  35. [35]

    13 Tapered Language Models Yu Sun, Xinhao Li, Karan Dalal, Jiarui Xu, Arjun Vikram, Genghan Zhang, Yann Dubois, Xinlei Chen, Xiaolong Wang, Sanmi Koyejo, et al

    URL https:// openreview.net/forum?id=Ai8Hw3AXqks. 13 Tapered Language Models Yu Sun, Xinhao Li, Karan Dalal, Jiarui Xu, Arjun Vikram, Genghan Zhang, Yann Dubois, Xinlei Chen, Xiaolong Wang, Sanmi Koyejo, et al. Learning to (learn at test time): Rnns with expressive hidden states.arXiv preprint arXiv:2407.04620,

  36. [36]

    Retentive network: A successor to transformer for large language models.arXiv preprint arXiv:2307.08621,

    Yutao Sun, Li Dong, Shaohan Huang, Shuming Ma, Yuqing Xia, Jilong Xue, Jianyong Wang, and Furu Wei. Retentive network: A successor to transformer for large language models.arXiv preprint arXiv:2307.08621,

  37. [37]

    On the resurgence of recurrent models for long sequences: Survey and research opportunities in the transformer era.arXiv preprint arXiv:2402.08132,

    Matteo Tiezzi, Michele Casoni, Alessandro Betti, Tommaso Guidi, Marco Gori, and Stefano Melacci. On the resurgence of recurrent models for long sequences: Survey and research opportunities in the transformer era.arXiv preprint arXiv:2402.08132,

  38. [38]

    Test-time regression: a unifying framework for designing sequence models with associative memory.arXiv preprint arXiv:2501.12352,

    Ke Alexander Wang, Jiaxin Shi, and Emily B Fox. Test-time regression: a unifying framework for designing sequence models with associative memory.arXiv preprint arXiv:2501.12352,

  39. [39]

    Hellaswag: Can a machine really finish your sentence?arXiv preprint arXiv:1905.07830,

    Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence?arXiv preprint arXiv:1905.07830,