Dynamic sparsity in tree-structured feed-forward layers at scale

Anand Subramoney; David Kappel; Reza Sedghi; Robin Schiewer

arxiv: 2604.08565 · v1 · submitted 2026-03-18 · 💻 cs.CL · cs.AI· cs.LG

Dynamic sparsity in tree-structured feed-forward layers at scale

Reza Sedghi , Robin Schiewer , Anand Subramoney , David Kappel This is my paper

Pith reviewed 2026-05-15 10:03 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords dynamic sparsitytree-structured layerstransformerconditional computationfeed-forward networkslanguage modelingauto-pruning

0 comments

The pith

Tree-structured feed-forward layers match dense baselines while activating fewer than 5% of units per token.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that sparse tree-structured feed-forward layers can serve as drop-in replacements for dense MLP blocks in transformers. These layers use hard hierarchical routing to activate only a small fraction of units conditionally for each token, without requiring a separate router network. Despite this extreme sparsity, the models achieve comparable performance to dense counterparts in large-scale autoregressive language modeling and downstream tasks like question answering, including zero- and few-shot settings, and scale beyond 1 billion parameters. The work also identifies an emergent auto-pruning effect during training that can be controlled through architectural choices to maintain balanced activation paths.

Core claim

Tree-structured feed-forward layers provide a scalable mechanism for dynamic sparsity in transformers by organizing units into a hierarchy that supports hard conditional routing based on input tokens. This approach eliminates the need for an auxiliary router while enabling activation of under 5% of units per token. The models match dense baseline performance under controlled training protocols for language modeling and question answering tasks. Training dynamics reveal an emergent auto-pruning where asymmetric nonlinearities progressively deactivate unused paths, partially converting the dynamic routing into static structural sparsity, which can be modulated by simple architectural choices.

What carries the argument

Tree-structured feed-forward layer with hard hierarchical routing, which organizes the feed-forward units into a tree allowing conditional computation by selecting activation paths based on token representations without a dedicated router.

If this is right

Scalable application to models beyond 1B parameters for autoregressive language modeling.
Matching dense model performance on downstream question answering in zero- and few-shot settings.
Emergent auto-pruning progressively deactivates unused paths during training.
Simple architectural choices recover balanced trees without auxiliary losses.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Expanding tree depth or width could increase effective capacity within a fixed compute budget.
The partial conversion to static sparsity may produce more interpretable sub-networks.
This routing style could combine with other conditional methods to further cut overhead.

Load-bearing premise

Hard hierarchical routing can be trained stably at scale without a separate router network and without the emergent auto-pruning degrading final performance on downstream tasks.

What would settle it

Training tree-structured and dense models at over 1B parameters under identical protocols and checking whether the sparse models reach equivalent perplexity on a standard language modeling test set or accuracy on zero-shot QA benchmarks.

Figures

Figures reproduced from arXiv: 2604.08565 by Anand Subramoney, David Kappel, Reza Sedghi, Robin Schiewer.

**Figure 1.** Figure 1: MLP layers and sparse replacements. a) The standard MLP feed-forward block. b) A top-k MoE block, featuring a router that recruits experts (linear-activation-linear sub-networks) on a per-token level. Here, k = 2 experts process each input token and their outputs are integrated via a weighted sum. c) An FFF block that contains P = 4 parallel trees of depth D = 2, shown in yellow. Tree output is integrated … view at source ↗

**Figure 2.** Figure 2: Training loss curves for 125M-parameter GPT-style models comparing dense feed-forward (FF), Fast Feedforward (FFF), and sparsity-matched Mixture-of-Experts (MoE) layers. Forward computation. For each input sample x ∈ R din and each tree p ∈ {1, . . . , P}, traversal starts at the root s0 = 0. For ℓ = 0, . . . , D, exactly one node is evaluated: zp,ℓ(x) = [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 4.** Figure 4: Validation accuracy under statistical path pruning. We evaluate different fixed pruning ratios by statistically pruning paths and measuring validation accuracy. Each curve corresponds to a different sparsity (static, permanent) ratio. ing of the surrounding linear layers. In Appendix G we show through a detailed empirical and analytical investigation that this is in fact the case: the combination of hard … view at source ↗

**Figure 3.** Figure 3: Utilization in a d=5 tree. Path utilization as the normalized fraction of samples visiting each path for FFF (left) and FFFpost (right). The x-axis denotes the leaf node index. We measured tree utilization by recording how frequently each path (or subtree) is visited during training and inference. Across different model sizes, tree depths, datasets, and tasks, we observe a consistent pattern of severe im… view at source ↗

**Figure 5.** Figure 5: Relative speedup and FLOPS/s for all configurations, ranging from depth 0 (dense FF) to depth 11. Results are averaged over five runs. d indicate depth of tree. feed-forward baseline under identical conditions. As shown in [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: shows two training regimes. The left panel reports training from scratch for three configurations: a dense feedforward (FF) baseline and two Fast Feedforward (FFF) variants with tree depths 4 and 6. All models are trained using the same pipeline and budget. The curves demonstrate that both FFF configurations closely track the dense baseline throughout training, indicating stable optimization even at high … view at source ↗

**Figure 7.** Figure 7: Training loss curves for OPT-style models with 2.7B and 6.7B parameters trained from scratch on a reduced SlimPajama subset. For each model size, we compare a dense feed-forward baseline with a Forest Router configuration of depth 4 (approximately 84% sparsity). Due to compute constraints, these experiments are conducted on a limited dataset and are intended to evaluate scalability and training stability r… view at source ↗

**Figure 8.** Figure 8: Extension of Fast Feedforward (FFF) layers to object detection with DETR on the CPPE-5 dataset. Right: evaluation performance measured by mAP@50, defined as mean Average Precision at an IoU threshold of 0.5. Left: training loss curves. Results are averaged over multiple runs with different random seeds. FFF-based models achieve performance comparable to dense feed-forward baselines under a fixed training b… view at source ↗

**Figure 9.** Figure 9: Empirical analysis of auto-pruning dynamics in a toy setting. Left: The gradient signal value received by a randomly selected tree node during backpropagation during training, showing persistent negative values after an initial training phase. Right: Histogram of the corresponding node output activations, which follow an approximately zero-mean normal distribution. Both measurements are computed using batc… view at source ↗

**Figure 10.** Figure 10: Exploiting static sparsity in OPT-125M model. G.6. Summary Together, these results show that auto-pruning arises from a specific interaction between hard binary routing and the asymmetric GELU nonlinearity in the pre-activation design. This mechanism naturally transforms dynamic routing into stable structural sparsity during training, without explicit pruning rules or auxiliary losses. H. Implementation a… view at source ↗

**Figure 11.** Figure 11: Binary trees with different branch weights and corresponding induced path distributions [PITH_FULL_IMAGE:figures/full_fig_p021_11.png] view at source ↗

**Figure 12.** Figure 12: Visualization of FFF-induced space partitioning on a 2D checkered-field classification task. Left: routing-induced decision boundaries mapped back to the input space; each split corresponds to a node-level routing decision, and the combined boundaries show how the forest partitions the plane into routed regions. Right: PCA projection of representations grouped by routing outcomes; colors indicate distinct… view at source ↗

read the original abstract

At typical context lengths, the feed-forward MLP block accounts for a large share of a transformer's compute budget, motivating sparse alternatives to dense MLP blocks. We study sparse, tree-structured feed-forward layers as drop-in replacements for MLP blocks in deep transformer architectures, enabling conditional computation via hard hierarchical routing without a separate router network. We demonstrate for the first time that this form of tree-structured conditional sparsity can be applied for autoregressive language modeling and downstream question answering, including zero- and few-shot settings, and its scalability beyond 1B parameters. Despite activating fewer than 5% of the feed-forward block's units per token, our models match dense baselines under controlled training and fine-tuning protocols. We further analyze training dynamics and identify an emergent auto-pruning effect: the interaction of hard routing with asymmetric nonlinearities progressively deactivates unused paths, yielding partial conversion of dynamic routing into static structural sparsity. We show that simple architectural choices can modulate this behavior and recover balanced trees without auxiliary losses. Overall, our work demonstrates that tree-structured feed-forward layers provide a scalable and controllable mechanism for sparsifying large transformer models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Tree-structured hard-routed FFNs match dense transformer performance at >1B scale with <5% activation per token, and the authors control the auto-pruning effect with simple architectural fixes.

read the letter

The main takeaway is that this paper shows tree-structured feed-forward layers with direct hard hierarchical routing can replace dense MLPs in transformers beyond 1B parameters. They report matching perplexity on autoregressive language modeling and comparable zero- and few-shot QA results while activating under 5% of units per token, all without a separate router network. That combination at this scale is new relative to earlier sparse MLP work. They also document the training dynamics clearly enough to identify the emergent auto-pruning that turns some dynamic paths static through asymmetric nonlinearities, then demonstrate that basic changes like nonlinearity adjustments recover balanced trees without auxiliary losses. That practical control is useful. The comparison to dense baselines under controlled protocols holds up on the reported numbers, and the absence of fitted parameters or circular derivations keeps the claims straightforward. The soft spot is that the headline match depends on the routing staying meaningfully dynamic at the final checkpoints; without published path-usage histograms or entropy curves at 1B+, it is still possible the effective capacity is lower than the dense reference once pruning settles. The stress-test worry about collapse is addressed in the paper but would benefit from those extra diagnostics. This work is for researchers focused on conditional computation and efficient scaling of transformers. It shows honest engagement with the architecture and its side effects, so it deserves a serious referee to verify the scale results and request the routing statistics.

Referee Report

3 major / 2 minor

Summary. The paper proposes replacing dense MLP blocks in transformers with tree-structured feed-forward layers that perform hard hierarchical routing directly from input projections, without a separate router network. This enables dynamic conditional computation, activating fewer than 5% of units per token. The central empirical claim is that such models match dense baselines in autoregressive language modeling and zero/few-shot QA at scales beyond 1B parameters under controlled training protocols. The work also analyzes an emergent auto-pruning effect arising from hard routing and asymmetric nonlinearities, and shows that simple architectural choices can recover balanced dynamic trees without auxiliary losses.

Significance. If the results hold under full verification, the work provides a router-free mechanism for scalable dynamic sparsity in transformers that reduces active compute while preserving performance, addressing a key bottleneck in large-model efficiency. The identification and control of auto-pruning offers a concrete handle on training dynamics of sparse conditional architectures, which could inform future designs of hierarchical sparse networks.

major comments (3)

[Abstract and §4] Abstract and §4 (experiments): The headline claim that models 'match dense baselines' while activating <5% units requires quantitative support including exact perplexity deltas, downstream accuracy numbers, baseline parameter counts, training curves, and statistical significance tests; without these, it is impossible to confirm that effective capacity remains comparable given the documented auto-pruning.
[§3 and §5] §3 (architecture) and §5 (training dynamics): The description of hard hierarchical routing derived directly from input projections (no learned router) must include explicit equations or pseudocode for the routing decision at each tree node; absent this, it is unclear how the mechanism avoids the documented risk of asymmetric nonlinearities driving progressive branch deactivation and collapse to static sparsity.
[§5.1] §5.1 (auto-pruning analysis): The claim that architectural choices recover balanced trees needs supporting evidence such as routing entropy curves, path-usage histograms, or per-layer activation statistics on final 1B+ checkpoints; without these, the central assertion that dynamic behavior is preserved (rather than silently converting to lower-capacity static sparsity) remains unverified.

minor comments (2)

[Figures] Figure captions and axis labels should explicitly state the scale (e.g., 1B+ parameter models) and the exact dense baseline architecture used for comparison.
[Abstract and §2] The abstract states 'for the first time' applicability to autoregressive LM and QA; a brief related-work paragraph contrasting with prior router-based MoE or tree-structured methods would strengthen context.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and will revise the manuscript to strengthen the presentation of results, architecture, and analysis.

read point-by-point responses

Referee: [Abstract and §4] Abstract and §4 (experiments): The headline claim that models 'match dense baselines' while activating <5% units requires quantitative support including exact perplexity deltas, downstream accuracy numbers, baseline parameter counts, training curves, and statistical significance tests; without these, it is impossible to confirm that effective capacity remains comparable given the documented auto-pruning.

Authors: We agree that additional quantitative detail will improve verifiability. The manuscript already reports aggregate matching performance in Section 4 under controlled protocols, but the revision will add a dedicated table listing exact perplexity values and deltas for language modeling, zero- and few-shot QA accuracies, baseline and sparse model parameter counts, and explicit references to the training curves already present in the appendix. Where multiple random seeds were run we will report standard deviations to address statistical significance. revision: yes
Referee: [§3 and §5] §3 (architecture) and §5 (training dynamics): The description of hard hierarchical routing derived directly from input projections (no learned router) must include explicit equations or pseudocode for the routing decision at each tree node; absent this, it is unclear how the mechanism avoids the documented risk of asymmetric nonlinearities driving progressive branch deactivation and collapse to static sparsity.

Authors: We appreciate the request for greater formal precision. Section 3 describes the router-free routing but the revision will insert explicit equations defining the per-node selection (input projection followed by hard top-k or threshold decision) together with pseudocode for the full forward pass through the tree. This will also cross-reference the Section 5 analysis of how asymmetric nonlinearities interact with the routing to limit collapse. revision: yes
Referee: [§5.1] §5.1 (auto-pruning analysis): The claim that architectural choices recover balanced trees needs supporting evidence such as routing entropy curves, path-usage histograms, or per-layer activation statistics on final 1B+ checkpoints; without these, the central assertion that dynamic behavior is preserved (rather than silently converting to lower-capacity static sparsity) remains unverified.

Authors: We agree that direct visualizations of routing dynamics will strengthen the claim. The revision will augment Section 5.1 with routing-entropy curves over training, path-usage histograms, and per-layer activation statistics computed on the final 1B+ checkpoints, confirming that the chosen architectural modifications maintain dynamic tree balance rather than collapsing to static sparsity. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical architectural demonstration

full rationale

The paper contains no mathematical derivations, fitted parameters presented as predictions, or load-bearing self-citations that reduce claims to inputs by construction. All central results (matching dense baselines at <5% activation, scalability to >1B parameters, and modulation of emergent auto-pruning) are established via controlled training runs and ablation experiments on language modeling and QA tasks. The architecture description and training dynamics analysis are self-contained and externally falsifiable through replication of the reported protocols; no uniqueness theorems, ansatzes, or renamings of known results are invoked to support the headline claims.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based on abstract only; no explicit free parameters, axioms, or invented entities are stated beyond standard transformer assumptions such as the existence of feed-forward blocks and standard optimizers.

pith-pipeline@v0.9.0 · 5500 in / 1110 out tokens · 56760 ms · 2026-05-15T10:03:51.937686+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

6 extracted references · 6 canonical work pages · 1 internal anchor

[1]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Association for Computational Linguistics. doi: 10.18653/v1/N19-1300. URL https://aclantho logy.org/N19-1300/. Clark, P., Cowhey, I., Etzioni, O., Khot, T., Sabharwal, A., Schoenick, C., and Tafjord, O. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457, 2018. Dagli, R. and Shaikh, A. M. Cppe-5:...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/n19-1300 2018
[2]

S table M o E : Stable routing strategy for mixture of experts

Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.489. URL https://acla nthology.org/2022.acl-long.489/. Dua, D., Wang, Y ., Dasigi, P., Stanovsky, G., Singh, S., and Gardner, M. DROP: A reading comprehension benchmark requiring discrete reasoning over paragraphs. In Burstein, J., Doran, C., and Solorio, T. (eds.),Pro- ceedings of ...

work page doi:10.18653/v1/2022.acl-long.489 2022
[3]

ISBN 9781713871088

Curran Associates Inc. ISBN 9781713871088. 11 Dynamic sparsity in tree-structured feed-forward layers at scale A. Additional Perplexity Results This appendix provides additional evaluation results. Table 6 reports perplexity (PPL) values for various OPT-125M parameter model configurations trained from scratch. Table 6.Per-run scalar PPL aggregated by stra...

work page arXiv 2016
[4]

temperature

Letg≜∂L/∂(MGELU(z))be the upstream gradient at this node. Then ∂L ∂z =g·M·GELU ′(z),(32) and therefore E[µ+ −µ] =−ηE g MGELU ′(z) (m⊤h+ 1) .(33) Splitting over z >0 and z <0 yields two contributions. Since GELU′(z) is close to 1 for z >0 and substantially smaller forz <0, the negative-side term is typically suppressed, giving the approximation E[µ+ −µ]≈ −...

work page 2023
[5]

Show how you would implement this feature

Explain one thing that can be done with text generated by the software. Show how you would implement this feature. Discuss the pros and cons of this feature. How long would it take to implement this feature?

work page
[6]

I went to the bank after work

Write a report/paper using text Model Output 1.3 d=6 Include the URL of the page where you found this information. This assignment may be used in any course in which students are required to write a report or some type of written expression. See the list of course assignments for more details on course requirements. You must obtain permission from your pr...

work page

[1] [1]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Association for Computational Linguistics. doi: 10.18653/v1/N19-1300. URL https://aclantho logy.org/N19-1300/. Clark, P., Cowhey, I., Etzioni, O., Khot, T., Sabharwal, A., Schoenick, C., and Tafjord, O. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457, 2018. Dagli, R. and Shaikh, A. M. Cppe-5:...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/n19-1300 2018

[2] [2]

S table M o E : Stable routing strategy for mixture of experts

Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.489. URL https://acla nthology.org/2022.acl-long.489/. Dua, D., Wang, Y ., Dasigi, P., Stanovsky, G., Singh, S., and Gardner, M. DROP: A reading comprehension benchmark requiring discrete reasoning over paragraphs. In Burstein, J., Doran, C., and Solorio, T. (eds.),Pro- ceedings of ...

work page doi:10.18653/v1/2022.acl-long.489 2022

[3] [3]

ISBN 9781713871088

Curran Associates Inc. ISBN 9781713871088. 11 Dynamic sparsity in tree-structured feed-forward layers at scale A. Additional Perplexity Results This appendix provides additional evaluation results. Table 6 reports perplexity (PPL) values for various OPT-125M parameter model configurations trained from scratch. Table 6.Per-run scalar PPL aggregated by stra...

work page arXiv 2016

[4] [4]

temperature

Letg≜∂L/∂(MGELU(z))be the upstream gradient at this node. Then ∂L ∂z =g·M·GELU ′(z),(32) and therefore E[µ+ −µ] =−ηE g MGELU ′(z) (m⊤h+ 1) .(33) Splitting over z >0 and z <0 yields two contributions. Since GELU′(z) is close to 1 for z >0 and substantially smaller forz <0, the negative-side term is typically suppressed, giving the approximation E[µ+ −µ]≈ −...

work page 2023

[5] [5]

Show how you would implement this feature

Explain one thing that can be done with text generated by the software. Show how you would implement this feature. Discuss the pros and cons of this feature. How long would it take to implement this feature?

work page

[6] [6]

I went to the bank after work

Write a report/paper using text Model Output 1.3 d=6 Include the URL of the page where you found this information. This assignment may be used in any course in which students are required to write a report or some type of written expression. See the list of course assignments for more details on course requirements. You must obtain permission from your pr...

work page