When Does Sparsity Mitigate the Curse of Depth in LLMs

Dilxat Muhtar; Max Zimmer; Nico Pelleriti; Sebastian Pokutta; Shiwei Liu; Thomas Hofmann; Xinyuan Song

arxiv: 2603.15389 · v2 · pith:7MKHZY73new · submitted 2026-03-16 · 💻 cs.CL

When Does Sparsity Mitigate the Curse of Depth in LLMs

Dilxat Muhtar , Xinyuan Song , Sebastian Pokutta , Max Zimmer , Nico Pelleriti , Thomas Hofmann , Shiwei Liu This is my paper

classification 💻 cs.CL

keywords sparsitydepthllmsvarianceattentioncursedesigneffective

0 comments

read the original abstract

Recent work has demonstrated the curse of depth in large language models (LLMs), where later layers contribute less to learning and representation than earlier layers. Such under-utilization is linked to the accumulated growth of variance in Pre-Layer Normalization, which can push deep blocks toward near-identity behavior. In this paper, we provide evidence that sparsity-like mechanisms can dampen variance propagation and are associated with improved depth utilization Our investigation covers two sources of sparsity: (i) implicit sparsity, which emerges from training and data conditions, including weight sparsity induced by weight decay and attention sparsity induced by long-context inputs; and (ii) explicit sparsity, which is enforced by architectural design, including key/value-sharing in Grouped-Query Attention and expert-activation sparsity in Mixtureof-Experts. Our claim is thoroughly supported by controlled depth-scaling experiments and targeted layer effectiveness interventions. Across settings, we observe a consistent relationship: mechanisms with reduced effective interaction density tend to exhibit lower output variance and better layer differentiation. We eventually distill our findings into a practical rule-of-thumb recipe for training depth-effective LLMs, yielding a notable 4.6 accuracy improvement on downstream tasks. Our results suggest that sparsity-like design choices are an important and previously underemphasized factor in effective depth scaling for LLMs. Code is available at https://github. com/pUmpKin-Co/SparsityAndCoD.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Motion-Aware Caching for Efficient Autoregressive Video Generation
cs.CV 2026-05 conditional novelty 6.0

MotionCache accelerates autoregressive video generation up to 6.28x by motion-weighted cache reuse based on inter-frame differences, with negligible quality loss on SkyReels-V2 and MAGI-1.
From Words to Amino Acids: Does the Curse of Depth Persist?
cs.LG 2026-02 unverdicted novelty 6.0

Protein language models exhibit consistent depth inefficiency where most task-relevant computation occurs in a subset of layers, mirroring patterns in large language models.
Pion: A Spectrum-Preserving Optimizer via Orthogonal Equivalence Transformation
cs.LG 2026-05 unverdicted novelty 5.0

Pion is an optimizer that preserves the singular values of weight matrices in LLM training by applying orthogonal equivalence transformations.
Motion-Aware Caching for Efficient Autoregressive Video Generation
cs.CV 2026-05 unverdicted novelty 5.0

MotionCache speeds up autoregressive video generation by 6.28x on SkyReels-V2 and 1.64x on MAGI-1 via motion-weighted cache reuse based on inter-frame differences, with negligible quality loss on VBench.