arxiv: 2404.02258 · v1 · submitted 2024-04-02 · 💻 cs.LG · cs.CL

Recognition: 1 theorem link

Mixture-of-Depths: Dynamically allocating compute in transformer-based language models

David Raposo , Sam Ritter , Blake Richards , Timothy Lillicrap , Peter Conway Humphreys , Adam Santoro

Authors on Pith no claims yet

Pith reviewed 2026-05-17 02:13 UTC · model grok-4.3

classification 💻 cs.LG cs.CL

keywords mixture of depthsdynamic compute allocationtop-k routingtransformer efficiencyconditional computationlanguage modelsFLOPs reduction

0 comments

The pith

Transformer language models can learn to dynamically allocate compute to select tokens at each layer.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that transformers do not have to apply full compute to every token at every layer. Instead a learned top-k router picks a fixed number of tokens for self-attention and MLP work while the rest skip those operations. This keeps total compute fixed and the graph static yet lets allocation shift with context and depth. The trained models reach the same accuracy as uniform baselines when given the same training FLOPs and wall-clock time, but use fewer FLOPs on each forward pass and run substantially faster during sampling.

Core claim

By enforcing a hard cap of k tokens that may participate in the self-attention and MLP computations at any layer and selecting those tokens with a top-k router, the network learns to spend FLOPs non-uniformly across the sequence and across depth while preserving a static computation graph with known tensor sizes and a fixed overall budget.

What carries the argument

Top-k routing mechanism that selects exactly k tokens per layer for full processing, keeping total compute predictable while allowing token-level and layer-level variation.

If this is right

Equivalent final performance is reached with the same training FLOPs and training wall time as uniform baselines.
Each forward pass expends only a fraction of the FLOPs required by a standard transformer.
Post-training sampling can run up to 50 percent faster because fewer tokens receive full computation.
Total compute remains fixed and known in advance even though per-token and per-layer allocation changes with the input.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method could extend naturally to variable-length or very long sequences by concentrating compute on the most relevant positions.
Routing statistics might later be inspected to discover which token types or positions consistently receive more compute.
Layer-wise k schedules or joint routing across attention and MLP could be explored as further refinements.
The same budgeted top-k idea could be tested in non-transformer sequence models for similar efficiency gains.

Load-bearing premise

A learned top-k router can reliably choose which tokens deserve full processing at each layer without losing model capacity or introducing training instabilities.

What would settle it

Train a Mixture-of-Depths model and a standard transformer to identical total training FLOPs and wall-clock time; if the Mixture-of-Depths version shows lower accuracy on held-out language modeling or generation benchmarks, the claim is falsified.

read the original abstract

Transformer-based language models spread FLOPs uniformly across input sequences. In this work we demonstrate that transformers can instead learn to dynamically allocate FLOPs (or compute) to specific positions in a sequence, optimising the allocation along the sequence for different layers across the model depth. Our method enforces a total compute budget by capping the number of tokens ($k$) that can participate in the self-attention and MLP computations at a given layer. The tokens to be processed are determined by the network using a top-$k$ routing mechanism. Since $k$ is defined a priori, this simple procedure uses a static computation graph with known tensor sizes, unlike other conditional computation techniques. Nevertheless, since the identities of the $k$ tokens are fluid, this method can expend FLOPs non-uniformly across the time and model depth dimensions. Thus, compute expenditure is entirely predictable in sum total, but dynamic and context-sensitive at the token-level. Not only do models trained in this way learn to dynamically allocate compute, they do so efficiently. These models match baseline performance for equivalent FLOPS and wall-clock times to train, but require a fraction of the FLOPs per forward pass, and can be upwards of 50\% faster to step during post-training sampling.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Mixture-of-Depths uses fixed top-k routing to enable dynamic per-token compute in transformers, with reported efficiency gains that need closer checks on the routing mechanism.

read the letter

The main point with this paper is the use of a top-k router to select exactly k tokens for full self-attention and MLP computation at every layer. This lets the model allocate more compute to important tokens in a sequence while skipping others, all while keeping the total FLOPs fixed and the computation graph completely static with known sizes. They do a good job extending mixture-of-experts style ideas to control depth-wise compute in a simple way. The abstract says the trained models match the performance of dense baselines when you equalize total training FLOPs and wall-clock training time. At the same time, each forward pass uses fewer FLOPs overall, and inference sampling can be up to 50% faster. If those numbers are solid, it points to a real efficiency gain without extra training cost. The soft spots are around the router training and verification. Top-k routing requires a gradient estimator because the selection is non-differentiable, and those estimators often give noisy signals that make it hard for the router to learn reliable, input-dependent choices. The paper would need to demonstrate that the routing isn't just equivalent to randomly dropping capacity or that it actually improves over fixed patterns. The abstract is short on specifics like exact baselines, ablations for the router, or error bars, so the strength of the efficiency claim depends on the full experimental section. This paper targets researchers and engineers working on scaling language models efficiently, especially for inference on long sequences or resource-constrained settings. A colleague interested in conditional computation or sparse transformers would find the fixed budget approach practical. It shows enough honest engagement with the problem and a reproducible method to deserve serious peer review. I would recommend putting it through review, but with specific asks for router behavior analysis and more controlled comparisons.

Referee Report

3 major / 2 minor

Summary. The paper proposes Mixture-of-Depths (MoD), a transformer variant that dynamically allocates compute by using a learned top-k router to select a fixed number k of tokens for full self-attention and MLP processing at each layer. This enforces a static per-layer compute budget while allowing context-sensitive, non-uniform FLOPs expenditure across sequence positions and model depth. The central claim is that MoD models match the performance of dense baselines at equivalent total FLOPs and wall-clock training time, yet deliver lower per-forward-pass FLOPs and up to 50% faster sampling during inference.

Significance. If the empirical results hold under rigorous controls, the work offers a practical route to more efficient transformer scaling by demonstrating that models can learn to route compute away from less informative tokens without capacity loss. The static computation graph and fixed-k constraint are pragmatic advantages over many conditional-computation alternatives. Credit is due for the reproducible experimental protocol and the demonstration that dynamic allocation emerges from standard training.

major comments (3)

[§3.2] §3.2 (Routing and Gradient Estimation): The description of the top-k selection and its gradient estimator (straight-through or equivalent) provides no analysis of gradient variance, bias, or stability across training. Because the router must learn input-dependent decisions for the performance-matching claim to be attributable to intelligent allocation rather than uniform capacity reduction, this omission is load-bearing; high-variance updates could prevent reliable context-sensitive routing.
[§5] §5 (Experimental Results): The reported performance parity lacks error bars, number of independent runs, and exhaustive baseline controls (exact hyperparameter matching, model scale, and data regime). Without these, it is impossible to determine whether observed equivalence lies within statistical noise or reflects genuine preservation of model capacity under reduced per-token compute.
[§4.3] §4.3 (Ablations on k and Routing): The paper does not include controls that isolate whether the router learns meaningful, input-dependent allocations versus converging to a near-constant selection pattern. Such an ablation is required to substantiate the claim that compute is allocated 'optimising the allocation along the sequence for different layers'.

minor comments (2)

[Figure 2] Figure 2: The visualization of per-layer token selection would benefit from an additional panel showing the entropy or variance of router decisions across examples to illustrate context sensitivity.
[§2] Notation: The symbol k is used both for the global budget and per-layer selection; a brief clarification in §2 would prevent reader confusion.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their thoughtful and constructive feedback on our work. We appreciate the focus on strengthening the analysis of the routing mechanism, experimental rigor, and ablations. We address each major comment below and will incorporate revisions to improve the manuscript.

read point-by-point responses

Referee: [§3.2] §3.2 (Routing and Gradient Estimation): The description of the top-k selection and its gradient estimator (straight-through or equivalent) provides no analysis of gradient variance, bias, or stability across training. Because the router must learn input-dependent decisions for the performance-matching claim to be attributable to intelligent allocation rather than uniform capacity reduction, this omission is load-bearing; high-variance updates could prevent reliable context-sensitive routing.

Authors: We agree that a more detailed examination of the gradient estimator would strengthen the paper. The top-k routing employs a straight-through estimator to enable end-to-end training. In the revised manuscript we will add an analysis (including empirical measurements of gradient variance and stability across training steps) in §3.2 or an appendix, along with discussion of why the estimator supports reliable context-sensitive routing in practice. revision: yes
Referee: [§5] §5 (Experimental Results): The reported performance parity lacks error bars, number of independent runs, and exhaustive baseline controls (exact hyperparameter matching, model scale, and data regime). Without these, it is impossible to determine whether observed equivalence lies within statistical noise or reflects genuine preservation of model capacity under reduced per-token compute.

Authors: We acknowledge the importance of statistical reporting. While the original experiments used consistent hyperparameters, model scales, and data regimes matched to the dense baselines, variance was not reported. In the revision we will add error bars derived from multiple independent runs and expand the description of baseline controls to make the performance parity claims more robust. revision: yes
Referee: [§4.3] §4.3 (Ablations on k and Routing): The paper does not include controls that isolate whether the router learns meaningful, input-dependent allocations versus converging to a near-constant selection pattern. Such an ablation is required to substantiate the claim that compute is allocated 'optimising the allocation along the sequence for different layers'.

Authors: We agree that isolating the contribution of learned, input-dependent routing is valuable. In the revised version we will expand §4.3 with ablations that compare the learned router against random selection and fixed-pattern baselines (e.g., always selecting the first k tokens). We will also add visualizations of per-layer routing decisions across sequences to demonstrate context-sensitive allocation. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical method with independent experimental validation

full rationale

The paper introduces an architectural modification (top-k routing to enforce per-layer compute caps) and validates it through training runs that match baseline performance at equivalent total FLOPs while reducing per-forward-pass compute. No equations, derivations, or self-citations are used to derive the efficiency claims; results are obtained by direct comparison to standard transformer baselines under controlled training budgets. The method is self-contained against external benchmarks and does not reduce any claimed outcome to a fitted parameter or prior result by construction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The claim rests on standard transformer training assumptions plus the design choice of k; no new physical entities are introduced.

free parameters (1)

k (tokens processed per layer)
The cap on tokens that receive full attention and MLP computation is chosen ahead of time to enforce the total compute budget.

axioms (1)

domain assumption A learned router can select useful tokens for full processing via standard backpropagation
The method assumes end-to-end gradient descent suffices to train effective routing decisions without additional stabilization techniques.

pith-pipeline@v0.9.0 · 5534 in / 1438 out tokens · 86386 ms · 2026-05-17T02:13:07.402130+00:00 · methodology

discussion (0)

Forward citations

Cited by 18 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Scratchpad Patching: Decoupling Compute from Patch Size in Byte-Level Language Models
cs.CL 2026-05 conditional novelty 7.0

Scratchpad Patching decouples compute from patch size in byte-level language models by inserting entropy-triggered scratchpads to update patch context dynamically.
Expert Upcycling: Shifting the Compute-Efficient Frontier of Mixture-of-Experts
cs.LG 2026-04 unverdicted novelty 7.0

Expert upcycling duplicates experts in an existing MoE checkpoint and continues pre-training to match fixed-size baseline performance with 32% less compute.
Depth Adaptive Efficient Visual Autoregressive Modeling
cs.CV 2026-04 unverdicted novelty 7.0

DepthVAR adaptively allocates per-token computational depth in VAR models using a cyclic rotated scheduler and dynamic layer masking to achieve 2.3-3.1x inference speedup with minimal quality loss.
Why Training-Free Token Reduction Collapses: The Inherent Instability of Pairwise Scoring Signals
cs.AI 2026-04 unverdicted novelty 7.0

Pairwise scoring signals in Vision Transformer token reduction are inherently unstable due to high perturbation counts and degrade in deep layers, causing collapse, while unary signals with triage enable CATIS to reta...
Equifinality in Mixture of Experts: Routing Topology Does Not Determine Language Modeling Quality
cs.AI 2026-04 conditional novelty 7.0

Routing topology in sparse Mixture-of-Experts models does not determine asymptotic language modeling perplexity; multiple variants including cosine-similarity routing achieve statistically equivalent performance.
Path-Constrained Mixture-of-Experts
cs.LG 2026-03 unverdicted novelty 7.0

PathMoE constrains expert paths in MoE models by sharing router parameters across layer blocks, yielding more concentrated paths, better performance on perplexity and tasks, and no need for auxiliary losses.
MIDUS: Memory-Infused Depth Up-Scaling
cs.LG 2025-12 unverdicted novelty 7.0

MIDUS replaces duplicated FFN branches in depth up-scaling with head-wise memory layers using product-key retrieval and HIVE to deliver lightweight, head-conditioned residual capacity.
N-vium: Mixture-of-Exits Transformer for Accelerated Exact Generation
cs.LG 2026-05 unverdicted novelty 6.0

N-vium achieves 57.9% wall-clock speedup over matched standard transformers at no perplexity cost by mixing exact predictions from multiple model depths.
Sparse Layers are Critical to Scaling Looped Language Models
cs.LG 2026-05 unverdicted novelty 6.0

Looped MoE models scale better than standard transformers because different experts activate on each loop pass, recovering expressivity without extra parameters, and support superior early exits.
Gated Subspace Inference for Transformer Acceleration
cs.LG 2026-05 unverdicted novelty 6.0

Gated Subspace Inference accelerates transformer linear layers 3-10x via low-rank cached subspace computation and per-token gating to skip residuals while preserving output distribution to high accuracy.
DynamicRad: Content-Adaptive Sparse Attention for Long Video Diffusion
cs.CV 2026-04 unverdicted novelty 6.0

DynamicRad achieves 1.7x-2.5x inference speedups in long video diffusion with over 80% sparsity by grounding adaptive selection in a radial locality prior, using dual-mode static/dynamic strategies and offline BO with...
Expert Upcycling: Shifting the Compute-Efficient Frontier of Mixture-of-Experts
cs.LG 2026-04 unverdicted novelty 6.0

Expert upcycling expands MoE models by duplicating experts and continuing pre-training, matching baseline performance while saving 32% GPU hours in 7B-13B experiments.
Transformer Architecture with Minimal Inference Latency for Multi-Modal Wireless Networks
eess.SY 2026-04 unverdicted novelty 6.0

A token-routing multi-modal transformer reduces inference latency by 86.2%, GPU memory by 35%, and FLOPs by 80% for beamforming tasks with negligible accuracy loss while enabling proactive handover on a real testbed dataset.
Parcae: Scaling Laws For Stable Looped Language Models
cs.LG 2026-04 unverdicted novelty 6.0

Parcae stabilizes looped LLMs via spectral norm constraints on injection parameters, enabling power-law scaling for training FLOPs and saturating exponential scaling at test time that improves quality over fixed-depth...
When to Think Fast and Slow? AMOR: Adaptive Entropy Gate for Hybrid Models
cs.AI 2026-01 unverdicted novelty 6.0

AMOR uses output entropy to gate attention in recurrent hybrids, matching full attention performance at roughly 22% attention invocations across 180M-1.5B models.
Cascade Token Selection for Transformer Attention Acceleration
cs.LG 2026-05 unverdicted novelty 5.0

Cascade token selection inherits and updates a small set of representative tokens across layers using cross-Gram validation, reducing selection cost from O(T²d) to O(Trd) per layer with observed Gram savings of 22-63%...
Revisiting Auxiliary Losses for Conditional Depth Routing: An Empirical Study
cs.LG 2026-04 conditional novelty 5.0

Removing utility regression and rank supervision auxiliary losses improves language modeling performance and training efficiency for conditional depth routing gates, and eliminates the advantage of a more complex JEPA...
Adaptive Computation Depth via Learned Token Routing in Transformers
cs.LG 2026-04 unverdicted novelty 5.0

TSA adds end-to-end differentiable per-token halting gates to transformers, enabling learned adaptive depth that saves 14-23% token-layer operations with under 0.5% quality loss on language modeling.

Reference graph

Works this paper leans on

11 extracted references · 11 canonical work pages · cited by 17 Pith papers · 5 internal anchors

[2]

URLhttps://arxiv.org/abs/2002.07106. E. Bengio, P.-L. Bacon, J. Pineau, and D. Precup. Conditional computation in neural networks for faster models,

work page arXiv 2002
[3]

Universal Transformers

M. Dehghani, S. Gouws, O. Vinyals, J. Uszkoreit, and Ł. Kaiser. Universal transformers.arXiv preprint arXiv:1807.03819,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

URL http://arxiv.org/abs/1910.10073. W. Fedus, B. Zoph, and N. Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity,

work page arXiv 1910
[7]

URL http://arxiv.org/abs/1603.08983. M. Guo, J. Ainslie, D. Uthus, S. Ontanon, J. Ni, Y.-H. Sung, and Y. Yang. Longt5: Efficient text-to-text transformer for long sequences,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Towardsaunifiedviewofparameter-efficient transfer learning

13 Mixture-of-Depths: Dynamically allocating compute in transformer-based language models J.He, C.Zhou, X.Ma, T.Berg-Kirkpatrick, andG.Neubig. Towardsaunifiedviewofparameter-efficient transfer learning. arXiv preprint arXiv:2110.04366,

work page arXiv
[9]

GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding

D. Lepikhin, H. Lee, Y. Xu, D. Chen, O. Firat, Y. Huang, M. Krikun, N. Shazeer, and Z. Chen. Gshard: Scaling giant models with conditional computation and automatic sharding.arXiv preprint arXiv:2006.16668,

work page internal anchor Pith review Pith/arXiv arXiv 2006
[10]

Z. Liu, Z. Xu, H.-J. Wang, T. Darrell, and E. Shelhamer. Anytime dense prediction with confidence adaptivity. arXiv preprint arXiv:2104.00749,

work page arXiv
[11]

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. Le, G. Hinton, and J. Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer.arXiv preprint arXiv:1701.06538,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

The state and fate of linguistic diversity and inclusion in the NLP world

Association for Computational Linguistics. doi: 10.18653/v1/ 2021.acl-srw.23. URL https://aclanthology.org/2021.acl-srw.23. Y. Tay, M. Dehghani, D. Bahri, and D. Metzler. Efficient transformers: A survey.CoRR, abs/2009.06732,

work page doi:10.18653/v1/ 2021
[13]

URL https://arxiv.org/abs/2009.06732. X. Wang, F. Yu, Z. Dou, and J. E. Gonzalez. Skipnet: Learning dynamic routing in convolutional networks. CoRR, abs/1711.09485,

work page arXiv 2009
[14]

URLhttp://arxiv.org/abs/1711.09485. B. Zoph, I. Bello, S. Kumar, N. Du, Y. Huang, J. Dean, N. Shazeer, and W. Fedus. St-moe: Designing stable and transferable sparse expert models,

work page internal anchor Pith review Pith/arXiv arXiv