A Dual-Path Architecture for Scaling Compute and Capacity in LLMs

Behzad Shomali; Joachim Koehler; Markus Frey; Mehdi Ali

arxiv: 2605.30202 · v1 · pith:J3M4DZNInew · submitted 2026-05-28 · 💻 cs.CL

A Dual-Path Architecture for Scaling Compute and Capacity in LLMs

Markus Frey , Behzad Shomali , Joachim Koehler , Mehdi Ali This is my paper

Pith reviewed 2026-06-29 07:40 UTC · model grok-4.3

classification 💻 cs.CL

keywords dual-path architecturelooped transformersper-token gatesparameter efficiencycompute scalinglanguage modelingmodel routing

0 comments

The pith

A dual-path layer in transformers combines repeated deep computation with one wide step, outperforming iso-FLOP models while using fewer parameters.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes a dual-path block for language models that scales both the number of sequential operations and the parameters available per step. It does this by having a deep sublayer looped K times with shared weights and a wide sublayer with a larger feed-forward network used once, combined via per-token gates. This setup lets the model beat standard transformers on language modeling and downstream tasks at the same compute budget, and it does so with fewer total parameters. The gates turn out to be interpretable, routing function words and content to the wide path while sending punctuation and symbols to the deep path. This matters because it addresses the capacity shortfall of looped transformers without losing their parameter efficiency.

Core claim

The dual-path architecture surpasses iso-FLOP matched models on language modeling and downstream evaluations while using fewer parameters than the baseline at matched FLOPs, with learned gates showing systematic per-token allocation where function words and lexical content trend wide and punctuation, symbols, and arithmetic tokens trend deep.

What carries the argument

The dual-path block, which exposes compute and capacity as parallel pathways within a single layer: a deep sublayer re-applied K times with shared parameters and a wide sublayer with an enlarged feed-forward network applied once, combined by independent per-token gates.

If this is right

Surpasses iso-FLOP matched models on language modeling evaluations.
Surpasses iso-FLOP matched models on downstream evaluations.
Uses fewer parameters than the baseline at matched FLOPs.
The learned gates provide direct interpretability of per-token routing decisions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Such routing could be extended to decide compute allocation dynamically during inference for efficiency gains.
Patterns in token routing might generalize across different model scales or training data.
The approach could inspire similar dual axes in other architectures like vision or multimodal models.

Load-bearing premise

That the per-token gates can be trained stably to allocate between the deep and wide paths in a way that delivers net performance gains without the model defaulting to one path or requiring extra regularization that cancels the efficiency benefit.

What would settle it

Training runs where the gates fail to learn meaningful allocations, causing the model to perform no better than or worse than a standard transformer at the same FLOPs, would falsify the claim.

Figures

Figures reproduced from arXiv: 2605.30202 by Behzad Shomali, Joachim Koehler, Markus Frey, Mehdi Ali.

**Figure 2.** Figure 2: Parameter scaling sweep of dual-path configurations compared against PUREWIDE and PURELOOP controls at matched FLOP budgets. Number of Loops for PURELOOP (triangle) is K=2, 3, 4 (from left to right). Note that PUREWIDE is equivalent to PURELOOP with K=1 but without additional routing overhead. curacy. This shift toward capacity at the higher budget is consistent with the broader picture that looped (param… view at source ↗

**Figure 3.** Figure 3: Joint 2D density of routing gates and update contributions, split into layer bands (early, middle, late) and averaged across three Paloma datasets (wikitext_103, triviaqa, gsm8k). Row A plots the joint density of the raw gates selected by the router: deep gate gd on the y-axis and wide gate gw on the x-axis in [0, 1]2 . Row B plots the joint density of update contributions on a log-transformed scale: lo… view at source ↗

**Figure 4.** Figure 4: Routing share and update vector alignment across layers. Panel (a) shows the mean routing deep share per layer. Low value indicates preference for the wide path, high values for the deep path. Panel (b) shows the mean cosine similarity between the update vector from the deep path (∆d) and that from the wide path (∆w) at each layer. Low values indicate that the deep and wide path process the same input diff… view at source ↗

**Figure 5.** Figure 5: Token-level deep share for GSM8K and TriviaQA. Blue denotes wide-leaning (prefers the capacity path) [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗

**Figure 6.** Figure 6: Step-by-step token-level routing grid and task alignment. Panel (a) shows the layer-by-token heatmap of the deep share for a sequence from GSM8K (mathematical reasoning). Panel (b) shows the heatmap for a sequence from TriviaQA (factual knowledge). Panel (c) shows the aligned difference in preference (gsm8k − triviaqa) around the anchor token “Answer”, averaged over one thousand sequences per dataset. Eval… view at source ↗

**Figure 7.** Figure 7: Parts-of-speech (POS) routing characteristics and commitment. Panels (a, b, c) show heatmaps of the mean deep share per universal POS tag across layers for wikitext_103, triviaqa, and gsm8k, respectively, with tags sorted by the overall mean deep share. Panel (d) plots the average heatmap across the three datasets. Panel (e) shows boxplots of the deep share distribution per POS tag across all layers and da… view at source ↗

**Figure 8.** Figure 8: Layer-wise joint density of raw gates (gw, gd) across all layers (L0 to L15) and evaluation sources. The rows correspond to the three Paloma evaluation datasets (wikitext_103, triviaqa, and gsm8k), while columns correspond to layers index 0 to 15. The diagonal dashed line in each subplot represents equal routing preference (gd = gw). Lighter regions represent higher token concentration. The model has L = 1… view at source ↗

read the original abstract

Looped transformers apply a shared block multiple times and have emerged as a parameter-efficient route to scaling compute in language models. However, at fixed FLOPs a looped model has strictly less capacity than a baseline transformer. We propose a novel dual-path block that can flexibly scale compute, the number of sequential operations applied to a hidden state, and capacity, the parameters available at a single step. For this we expose both axes as parallel pathways within a single layer: a deep sublayer re-applied K times with shared parameters, and a wide sublayer with an enlarged feed-forward network applied once. Independent per-token gates combine both axes and allow detailed per-token routing analyses. We show that across two FLOP budgets, our dual-path model surpasses iso-FLOP matched models on language modeling and downstream evaluations, while using fewer parameters than the baseline at matched FLOPs. The learned gates are directly interpretable and show systematic per-token allocation with function words and lexical content trend wide, while punctuation, symbols, and arithmetic tokens trend deep.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The dual-path block with per-token gates is a clean way to trade sequential compute against capacity inside one layer, but the performance claims rest on details that are not shown.

read the letter

The main takeaway is that this paper describes a transformer block with two parallel paths: one that loops a shared sublayer K times for extra sequential compute, and one that uses a single enlarged feed-forward network for more parameters at that step, with learned per-token gates deciding the mix. The gates also let them analyze routing patterns, which is presented as a bonus.

What is actually new is the specific pairing of the looped deep path and the wide path inside the same block, plus the independent gates that support the per-token analysis. The abstract does a reasonable job explaining why pure looped models lose capacity at fixed FLOPs and how this dual setup tries to fix that without just adding more parameters overall.

The interpretability angle is a plus if it holds up. The reported trends, with function words and content going wide while punctuation and arithmetic tokens go deep, could be useful for understanding resource allocation.

The soft spots are straightforward. The central claim is that the model beats iso-FLOP baselines on language modeling and downstream tasks while using fewer parameters, yet the abstract supplies no numbers, no baseline details, no statistical tests, and no ablations on gate behavior. Without those, it is impossible to check whether the gates deliver stable net gains or whether they collapse to one path or require extra regularization that cancels the efficiency. The stress-test concern about gate stability is on target because the paper does not address it.

This is for researchers working on efficient LLM scaling and architecture variants. Someone already thinking about compute-capacity tradeoffs might find the block design worth trying, even if the results need verification.

It deserves peer review because the idea is coherent and the routing analysis could be valuable if the experiments are solid, but the current write-up is too light on evidence to judge without seeing the full details and ablations.

Referee Report

1 major / 0 minor

Summary. The paper proposes a dual-path transformer block that exposes compute (via a shared-parameter deep loop applied K times) and capacity (via a single enlarged FFN) as parallel pathways within one layer, combined by independent per-token gates. It claims that across two FLOP budgets the resulting model outperforms iso-FLOP-matched standard transformers on language-modeling and downstream tasks while using fewer parameters than the baseline, and that the learned gates exhibit systematic, interpretable per-token allocation (function words and lexical content routed wide; punctuation, symbols, and arithmetic tokens routed deep).

Significance. If the empirical claims hold, the architecture supplies a concrete mechanism for jointly scaling depth and width on a per-token basis, addressing the capacity deficit of pure looped transformers at fixed FLOPs. The direct interpretability of the gates is a genuine strength, offering a falsifiable window into routing behavior that could inform future routing designs. No machine-checked proofs or parameter-free derivations are present, but the per-token routing analysis constitutes a reproducible empirical contribution if the underlying training runs and gate statistics are released.

major comments (1)

[Abstract / Experiments] The central claim that the dual-path model delivers net gains at matched FLOPs with fewer parameters rests on the assumption that the per-token gates allocate tokens stably between the deep loop and wide FFN without collapse or the need for auxiliary losses whose cost offsets the efficiency benefit. The abstract reports systematic trends but supplies no ablations on gate entropy, path-collapse frequency, or regularization variants; without these, it is impossible to verify that the observed gains are not an artifact of the training procedure defaulting to one path.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the single major comment below.

read point-by-point responses

Referee: [Abstract / Experiments] The central claim that the dual-path model delivers net gains at matched FLOPs with fewer parameters rests on the assumption that the per-token gates allocate tokens stably between the deep loop and wide FFN without collapse or the need for auxiliary losses whose cost offsets the efficiency benefit. The abstract reports systematic trends but supplies no ablations on gate entropy, path-collapse frequency, or regularization variants; without these, it is impossible to verify that the observed gains are not an artifact of the training procedure defaulting to one path.

Authors: The manuscript already supplies direct evidence against path collapse via the per-token routing analysis, which shows consistent, token-type-specific allocation (function words and lexical content routed wide; punctuation, symbols, and arithmetic tokens routed deep). These systematic, interpretable patterns are incompatible with uniform defaulting to a single path. We agree, however, that the paper does not contain explicit ablations on gate entropy, collapse frequency across runs, or regularization variants. We will add these analyses in the revised version to further substantiate training stability. revision: yes

Circularity Check

0 steps flagged

No derivation chain; purely empirical claims

full rationale

The paper reports experimental results on language modeling and downstream tasks for a dual-path architecture, with no equations, first-principles derivations, or predictions that could reduce to inputs by construction. Claims rest on observed performance at matched FLOPs and gate interpretability trends, which are falsifiable via external benchmarks rather than self-referential. No self-citation load-bearing steps or fitted inputs presented as predictions appear in the provided text.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no information on free parameters, background axioms, or newly postulated entities.

pith-pipeline@v0.9.1-grok · 5712 in / 1047 out tokens · 33251 ms · 2026-06-29T07:40:05.919675+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

6 extracted references · 5 canonical work pages · 2 internal anchors

[1]

Mixture-of-Recursions: Learning Dynamic Recursive Depths for Adaptive Token-Level Computation

Mixture-of-recursions: Learning dynamic re- cursive depths for adaptive token-level computation. arXiv preprint arXiv:2507.10524. Andrea Banino, Jan Balaguer, and Charles Blundell

work page arXiv
[2]

Pondernet: Learning to ponder.arXiv preprint arXiv:2107.05407. Mostafa Dehghani, Josip Djolonga, Basil Mustafa, Pi- otr Padlewski, Jonathan Heek, Justin Gilmer, An- dreas Peter Steiner, Mathilde Caron, Robert Geirhos, Ibrahim Alabdulmohsin, and 1 others. 2023. Scaling 8 vision transformers to 22 billion parameters. InIn- ternational Conference on Machine ...

work page arXiv 2023
[3]

Jonas Geiping, Sean McLeish, Neel Jain, John Kirchen- bauer, Siddharth Singh, Brian R Bartoldson, Bhavya Kailkhura, Abhinav Bhatele, and Tom Goldstein

Adaptive loops and memory in transformers: Think harder or know more? InLatent & Implicit Thinking Workshop @ ICLR. Jonas Geiping, Sean McLeish, Neel Jain, John Kirchen- bauer, Siddharth Singh, Brian R Bartoldson, Bhavya Kailkhura, Abhinav Bhatele, and Tom Goldstein
[4]

Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach

Scaling up test-time compute with latent rea- soning: A recurrent depth approach.arXiv preprint arXiv:2502.05171. Alex Graves. 2016. Adaptive computation time for recurrent neural networks.arXiv preprint arXiv:1603.08983. Yuling Gu, Oyvind Tafjord, Bailey Kuehl, Dany Had- dad, Jesse Dodge, and Hannaneh Hajishirzi. 2025. Olmes: A standard for language mode...

work page internal anchor Pith review Pith/arXiv arXiv 2016
[5]

arXiv preprint arXiv:2602.11451

Loopformer: Elastic-depth looped transform- ers for latent reasoning via shortcut modulation. arXiv preprint arXiv:2602.11451. Guillaume Lample, Alexandre Sablayrolles, Marc’Aurelio Ranzato, Ludovic Denoyer, and Hervé Jégou. 2019. Large memory layers with product keys.Advances in Neural Information Processing Systems, 32. Max Lübbering, Timm Ruland, Richa...

work page arXiv 2019
[6]

Mixture-of-Depths: Dynamically allocating compute in transformer-based language models

Universal dependencies v1: A multilingual treebank collection. InProceedings of the Tenth In- ternational Conference on Language Resources and Evaluation (LREC’16), pages 1659–1666. Slav Petrov, Dipanjan Das, and Ryan McDonald. 2012. A universal part-of-speech tagset. InProceedings of the Eighth International Conference on Language Resources and Evaluatio...

work page internal anchor Pith review Pith/arXiv arXiv 2012

[1] [1]

Mixture-of-Recursions: Learning Dynamic Recursive Depths for Adaptive Token-Level Computation

Mixture-of-recursions: Learning dynamic re- cursive depths for adaptive token-level computation. arXiv preprint arXiv:2507.10524. Andrea Banino, Jan Balaguer, and Charles Blundell

work page arXiv

[2] [2]

Pondernet: Learning to ponder.arXiv preprint arXiv:2107.05407. Mostafa Dehghani, Josip Djolonga, Basil Mustafa, Pi- otr Padlewski, Jonathan Heek, Justin Gilmer, An- dreas Peter Steiner, Mathilde Caron, Robert Geirhos, Ibrahim Alabdulmohsin, and 1 others. 2023. Scaling 8 vision transformers to 22 billion parameters. InIn- ternational Conference on Machine ...

work page arXiv 2023

[3] [3]

Jonas Geiping, Sean McLeish, Neel Jain, John Kirchen- bauer, Siddharth Singh, Brian R Bartoldson, Bhavya Kailkhura, Abhinav Bhatele, and Tom Goldstein

Adaptive loops and memory in transformers: Think harder or know more? InLatent & Implicit Thinking Workshop @ ICLR. Jonas Geiping, Sean McLeish, Neel Jain, John Kirchen- bauer, Siddharth Singh, Brian R Bartoldson, Bhavya Kailkhura, Abhinav Bhatele, and Tom Goldstein

[4] [4]

Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach

Scaling up test-time compute with latent rea- soning: A recurrent depth approach.arXiv preprint arXiv:2502.05171. Alex Graves. 2016. Adaptive computation time for recurrent neural networks.arXiv preprint arXiv:1603.08983. Yuling Gu, Oyvind Tafjord, Bailey Kuehl, Dany Had- dad, Jesse Dodge, and Hannaneh Hajishirzi. 2025. Olmes: A standard for language mode...

work page internal anchor Pith review Pith/arXiv arXiv 2016

[5] [5]

arXiv preprint arXiv:2602.11451

Loopformer: Elastic-depth looped transform- ers for latent reasoning via shortcut modulation. arXiv preprint arXiv:2602.11451. Guillaume Lample, Alexandre Sablayrolles, Marc’Aurelio Ranzato, Ludovic Denoyer, and Hervé Jégou. 2019. Large memory layers with product keys.Advances in Neural Information Processing Systems, 32. Max Lübbering, Timm Ruland, Richa...

work page arXiv 2019

[6] [6]

Mixture-of-Depths: Dynamically allocating compute in transformer-based language models

Universal dependencies v1: A multilingual treebank collection. InProceedings of the Tenth In- ternational Conference on Language Resources and Evaluation (LREC’16), pages 1659–1666. Slav Petrov, Dipanjan Das, and Ryan McDonald. 2012. A universal part-of-speech tagset. InProceedings of the Eighth International Conference on Language Resources and Evaluatio...

work page internal anchor Pith review Pith/arXiv arXiv 2012