arxiv: 2604.22442 · v1 · submitted 2026-04-24 · 💻 cs.LG · cs.NE

Recognition: unknown

HubRouter: A Pluggable Sub-Quadratic Routing Primitive for Hybrid Sequence Models

Abhinaba Basu

Authors on Pith no claims yet

Pith reviewed 2026-05-08 12:17 UTC · model grok-4.3

classification 💻 cs.LG cs.NE

keywords HubRoutersub-quadratic routinghybrid sequence modelshub tokenssparse attentionperplexitytraining throughputJamba-style models

0 comments

The pith

HubRouter replaces full quadratic attention with O(nM) hub-mediated routing using a small set of learned hubs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents HubRouter as a pluggable module that substitutes the O(n squared) attention layers common in sequence models with a cheaper O(n times M) mechanism, where M is a small number of learned hub tokens much less than the sequence length n. It does this through an encode-decode-score-council pipeline in which hubs first cross-attend to all tokens, tokens then form routing fingerprints against the hubs, a score head picks the top-k tokens, and a sparse council attends only to that subset. Experiments in a Jamba-style hybrid and a pure Transformer show that this substitution can maintain or slightly improve perplexity while delivering large training speedups, and that partial replacement of attention layers can be optimal under fixed compute budgets. A reader would care because quadratic attention remains a central scaling barrier for longer sequences, and a drop-in primitive that relaxes it without full redesign could extend practical context lengths.

Core claim

HubRouter implements an encode-decode-score-council pipeline in which M learned hub tokens cross-attend to the full sequence, tokens project to produce routing fingerprints against those hubs, a score head selects a top-k council, and sparse attention occurs only within the council. When inserted into Jamba-style hybrids this yields a nominal 4.2 percent perplexity improvement and up to roughly 90x training throughput at length 1024; graduated replacement of 25 percent of Transformer attention layers produces the best perplexity under matched budgets; and a strictly causal variant achieves 211.5 perplexity after a council-causal fix that removes a bidirectional leak. A sweep across hub sizes

What carries the argument

The encode-decode-score-council pipeline driven by M learned hub tokens that cross-attend to produce compact routing fingerprints and select a sparse top-k council for attention.

If this is right

Hybrid architectures can be trained at substantially higher throughput with only small or no perplexity penalty.
Replacing only a fraction of attention layers can outperform both full attention and heavier replacement under the same compute budget.
Hub counts in the 8-14 range converge reliably across random seeds, with orthogonal regularization able to stabilize smaller counts.
Once the causal council fix is applied, performance becomes insensitive to chunk size and the routing behaves as intended without leaks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The pluggable design implies that HubRouter could be inserted into other attention-heavy pipelines beyond the two architectures tested, provided the task does not require uniform access to every token.
If the information-preservation assumption holds at scale, the same hub mechanism could be combined with existing length-extrapolation techniques to push practical context windows further.
The companion diagnostic task referenced in the paper offers a direct way to measure whether a given hub count is sufficient for a new domain before full training.

Load-bearing premise

That the learned hubs together with top-k council selection can preserve enough of the information that full attention would have captured for the model's predictions.

What would settle it

A controlled experiment on a long-range dependency task at increasing sequence lengths that directly compares next-token accuracy of HubRouter against an otherwise identical full-attention model; a widening gap would falsify the claim that the sparse routing is information-preserving.

Figures

Figures reproduced from arXiv: 2604.22442 by Abhinaba Basu.

**Figure 1.** Figure 1: HubRouter pipeline. Input tokens are compressed through M learned hubs (encode, O(nM)), tokens project against hubs for routing fingerprints (decode, O(nM)), a score head selects top-k tokens, and a council applies sparse attention only to the selected subset (O(k 2 )). A learned gate fuses council output back into the residual stream. Key insight: routing decisions are made through M-dimensional hub space… view at source ↗

**Figure 2.** Figure 2: Throughput scaling (matched PyTorch-native baselines). (a) Hub-Jamba maintains constant throughput as sequence length grows (sub-quadratic), while Jamba degrades due to quadratic attention. (b) Speedup increases with sequence length, reaching ∼183× at seq=2048. Caveat (Section 7): the Jamba baseline does not use FlashAttention or production Mamba CUDA kernels view at source ↗

**Figure 3.** Figure 3: Hub-GPT chunk size sweep (post council-causal fix). C=1 (zero leakage) achieves 211.5±0.4 over 3 seeds; C=64 achieves 211.9±0.3. Chunk size no longer changes PPL meaningfully once the council is causally masked (pre-fix results showed a 3-4 PPL chunk-size benefit that turns out to have been the bidirectional council leaking future-token information into later tokens). The true gap to Jamba (208.5±0.7) is ≈… view at source ↗

**Figure 4.** Figure 4: Graduated replacement curve. 25% replacement is the sweet spot (PPL 268.0), beating both pure Transformer (282.4) and Mamba (278.3). Quality degrades gracefully with increasing replacement. 7 Discussion Leakage and regime selection. The Hub-Jamba encoder uses bidirectional hub encoding, so hubs see future tokens and the PPL=200.2 result includes this leakage. This leakage pattern is closer to a prefix-LM o… view at source ↗

**Figure 5.** Figure 5: M-sweep heatmap. Routing precision (%) across 5 seeds for each hub count M (rows) by seed (columns). (a) Without ortho: M=8–14 is the most robust sub-band (highest mean, lowest seed variance). (b) With ortho: rescues M=6 failures but introduces new failures at M=20. M=16 is the default from prior work [1] (cited, not re-run here). as a real quality cost for eliminating O(n 2 ) attention, not parity. (2) At… view at source ↗

read the original abstract

We introduce HubRouter, a pluggable module that replaces O(n^2) attention layers with O(nM) hub-mediated routing, where M << n is a small number of learned hub tokens. We demonstrate it in two from-scratch architectures: a Jamba-style hybrid and a 12-layer Transformer; retrofit into pretrained models is a tested negative case. HubRouter implements an encode-decode-score-council pipeline: M learned hubs cross-attend to all tokens, tokens project against hubs for routing fingerprints, a score head selects top-k tokens, and a sparse council attends only to the selected subset. We validate HubRouter in three settings. (1) Hub-Jamba yields a nominal 4.2% PPL improvement (200.2 vs 209.0, single seed; possibly within seed noise) and up to ~90x training throughput at sequence length 1024 in matched PyTorch-native baselines; an optimised baseline would narrow this to ~10-15x. (2) Graduated replacement of 25% of Transformer attention layers gives the best perplexity in our matched-budget sweep (268.0 vs 282.4 pure Transformer). (3) Hub-GPT provides strictly causal routing, achieving PPL 211.5 +/- 0.4 over 3 seeds (post council-causal fix); approximately 3 PPL worse than Jamba's 208.5 +/- 0.7, a measurable quality cost for avoiding O(n^2) computation. Post-fix, chunk size C has little effect; the pre-fix chunk-size benefit was an artifact of a bidirectional-council leak we found in adversarial review. A multi-seed hub-count sweep (~105 runs across M=1-32) reveals M=8-14 as the reliably-converging sub-band (4-5/5 seeds); M=6 is rescued to 5/5 by orthogonal regularization, while M>=20 shows increasing seed sensitivity. Companion paper arXiv:2603.20997 (Basu, 2026) defines the routing diagnostic task. Code and scripts will be released.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

HubRouter gives a pluggable hub-routing primitive with a clean encode-decode-score-council design and useful multi-seed sweeps, but the headline hybrid gains are single-seed and flagged as possible noise.

read the letter

The main thing to know is that this paper introduces HubRouter as a drop-in module that routes tokens through a small set of learned hubs instead of full attention, achieving O(nM) cost. The encode-decode-score-council pipeline is presented as a distinct primitive and tested in from-scratch Jamba-style hybrids and partial Transformer replacements, with code promised for release. The hub-count sweep across roughly 105 runs stands out as the most solid piece of work, showing reliable convergence for M in the 8-14 range and how regularization helps lower values. The authors also report discovering and fixing a bidirectional council leak during review, which is straightforward and helpful for assessing the causal claims. Post-fix, chunk size has little effect, as they note. These elements give the paper a concrete empirical backbone beyond just the idea. The soft spots are in the strength of the central results. The 4.2% PPL gain in the hybrid is single-seed and the authors themselves say it may be within noise. The strictly causal Hub-GPT version ends up about 3 PPL worse than the Jamba baseline, which shows a clear quality cost for the routing approach. Throughput numbers are large but measured against unoptimized PyTorch baselines, and the paper acknowledges that real-world gains would be smaller. The graduated 25% replacement experiment looks better in the matched-budget sweep, but overall the evidence that the hubs preserve necessary information flow without full attention remains preliminary. This paper is for people working on efficient sequence models and hybrid architectures who want to experiment with sparse routing alternatives. It deserves a serious referee because the primitive is new, the sweeps are systematic, and the transparency about the leak and noise flags is a positive sign. I would send it out for review rather than desk reject, with the expectation that more seeds on the primary hybrid comparison would be needed to strengthen the case.

Referee Report

3 major / 2 minor

Summary. The paper introduces HubRouter, a pluggable module replacing O(n²) attention layers with O(nM) hub-mediated routing (M << n learned hub tokens) via an encode-decode-score-council pipeline. It evaluates the approach in a Jamba-style hybrid model and a 12-layer Transformer, reporting a nominal 4.2% PPL improvement in the hybrid (single seed), best perplexity with 25% graduated replacement, and up to ~90x training throughput gains; a strictly causal Hub-GPT variant is also tested, along with multi-seed hub-count sweeps and a post-hoc fix for a discovered causal leak in the council mechanism.

Significance. If the central claim holds—that learned hubs with top-k council selection can substitute for full attention while preserving modeling capacity—this would provide a useful architectural primitive for efficient hybrid sequence models. The pluggable design, planned code release, and multi-seed analysis for the M parameter (showing reliable convergence at M=8-14) are strengths. However, the evidence remains preliminary given the single-seed flagship result and implementation sensitivities.

major comments (3)

[Abstract] Abstract: The flagship claim of a 4.2% PPL improvement (200.2 vs 209.0) for Hub-Jamba is based on a single seed and explicitly noted as possibly within noise; this directly undercuts the load-bearing assertion that the encode-decode-score-council pipeline substitutes for full attention without degrading information flow.
[Experiments] Experiments (Hub-Jamba and causal fix discussion): The post-hoc discovery of a bidirectional council leak (which altered pre-fix chunk-size conclusions) indicates that reported perplexity can be sensitive to subtle implementation details of the sparse council; while the fix is applied, this raises questions about whether post-fix results isolate true routing quality.
[Hub-GPT evaluation] Hub-GPT results: The strictly causal variant achieves PPL 211.5 +/- 0.4 (3 seeds), ~3 PPL worse than the Jamba baseline (208.5 +/- 0.7); this measurable quality cost for avoiding O(n²) computation needs explicit analysis against the substitution claim.

minor comments (2)

[Method] The notation for hub count M and council size k could be more consistently defined across sections and figures to aid reproducibility.
[Throughput experiments] Baselines for throughput (PyTorch-native vs optimized) are mentioned but lack a clear table comparing exact configurations and hardware.

Simulated Author's Rebuttal

3 responses · 0 unresolved

Thank you for the constructive feedback. We address the major comments point-by-point below, with planned revisions to strengthen the evidence for our claims.

read point-by-point responses

Referee: The flagship claim of a 4.2% PPL improvement (200.2 vs 209.0) for Hub-Jamba is based on a single seed and explicitly noted as possibly within noise; this directly undercuts the load-bearing assertion that the encode-decode-score-council pipeline substitutes for full attention without degrading information flow.

Authors: We recognize the limitation of the single-seed result for the flagship Hub-Jamba experiment. Although we already note in the manuscript that it may be within seed noise, we will perform additional runs with multiple random seeds for this configuration in the revision. This will allow us to report statistics and better substantiate the claim. The substitution assertion is supported by the overall experimental suite, including the graduated replacement results where partial substitution yields the best perplexity, and the multi-seed hub-count analysis showing reliable convergence for appropriate M values. revision: yes
Referee: The post-hoc discovery of a bidirectional council leak (which altered pre-fix chunk-size conclusions) indicates that reported perplexity can be sensitive to subtle implementation details of the sparse council; while the fix is applied, this raises questions about whether post-fix results isolate true routing quality.

Authors: The identification of the council leak was indeed post-hoc, and we appreciate the referee highlighting the potential sensitivity. We have implemented the causal fix and observed that chunk-size effects disappear post-fix, indicating stability. To demonstrate that the results isolate routing quality, we will add a dedicated subsection in the experiments detailing the leak, the fix, and pre/post-fix comparisons, along with further ablations on the council selection process. revision: yes
Referee: The strictly causal variant achieves PPL 211.5 +/- 0.4 (3 seeds), ~3 PPL worse than the Jamba baseline (208.5 +/- 0.7); this measurable quality cost for avoiding O(n²) computation needs explicit analysis against the substitution claim.

Authors: We agree that the quality cost in the strictly causal Hub-GPT setting requires more explicit analysis. In the revised manuscript, we will include a discussion comparing this degradation to the computational savings and to similar trade-offs in other efficient attention mechanisms. The substitution claim is contextualized as providing a pluggable alternative for hybrid models, where the hybrid Hub-Jamba shows no degradation (and nominal improvement), while the pure causal variant incurs a cost that we now analyze more thoroughly. revision: yes

Circularity Check

0 steps flagged

Empirical architecture paper with one non-load-bearing self-citation

full rationale

The manuscript introduces HubRouter as a pluggable architectural module and validates it solely through direct training-run measurements of perplexity and throughput. No first-principles derivations, predictions, or uniqueness theorems are presented that could reduce to fitted parameters or prior self-citations by construction. The sole self-citation (to the companion paper defining the routing diagnostic task) supports an auxiliary diagnostic rather than any central claim. All quantitative results are reported as observed outcomes from matched-budget sweeps and multi-seed runs, not as quantities defined in terms of the routing mechanism itself.

Axiom & Free-Parameter Ledger

1 free parameters · 0 axioms · 1 invented entities

Based on abstract only; M is treated as a tunable hyperparameter with an identified reliable range, while hub tokens are learned during training rather than fixed constants.

free parameters (1)

M (hub count)
Swept from 1-32; M=8-14 identified as reliably converging band, with orthogonal regularization rescuing M=6.

invented entities (1)

learned hub tokens no independent evidence
purpose: Mediate cross-attention and provide routing fingerprints for token selection
Core new component of the module; no external falsifiable prediction provided beyond internal training results.

pith-pipeline@v0.9.0 · 5700 in / 1399 out tokens · 50526 ms · 2026-05-08T12:17:17.339636+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

26 extracted references · 23 canonical work pages · 16 internal anchors

[1]

A. Basu. When does content-based routing work? Representation requirements for selective attention in hybrid sequence models.arXiv preprint arXiv:2603.20997, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[2]

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

A. Gu and T. Dao. Mamba: Linear-time sequence modeling with selective state spaces. InConference on Language Modeling (COLM), 2024. arXiv:2312.00752

work page internal anchor Pith review arXiv 2024
[3]

B. Peng, E. Alcaide, Q. Anthony, A. Albalak, S. Arcadinho, S. Biderman, H. Cao, X. Cheng, M. Chung, M. Grella, K. K. GV, X. He, H. Hou, J. Lin, P. Kazienko, J. Kocon, J. Kong, B. Koptyra, H. Lau, K. S. I. Mantri, F. Mom, A. Saito, G. Song, X. Tang, B. Wang, J. S. Wind, S. Wozniak, R. Zhang, Z. Zhang, Q. Zhao, P. Zhou, Q. Zhou, J. Zhu, R.- J. Zhu. RWKV: Re...

work page internal anchor Pith review arXiv 2023
[4]

Jamba: A Hybrid Transformer-Mamba Language Model

O. Lieber, B. Lenz, H. Bata, G. Cohen, J. Osin, I. Dalmedigos, E. Safahi, S. Meirom, Y. Belinkov, S. Shalev-Shwartz, O. Abend, R. Alon, T. Asida, A. Bergman, R. Glozman, M. Gokhman, A. Manevich, N. Ratner, N. Rozen, E. Shwartz, M. Zusman, Y. Shoham. Jamba: A hybrid Transformer-Mamba language model.arXiv preprint arXiv:2403.19887, 2024

work page internal anchor Pith review arXiv 2024
[5]

S. De, S. L. Smith, A. Fernando, A. Botev, G. Cristian-Muraru, A. Gu, R. Haroun, L. Berrada, Y. Chen, S. Srinivasan, G. Desjardins, A. Doucet, D. Budden, Y. W. Teh, R. Pascanu, N. De Freitas, C. Gulcehre. Griffin: Mixing gated linear recurrences with local attention for efficient language models.arXiv preprint arXiv:2402.19427, 2024

work page internal anchor Pith review arXiv 2024
[6]

Zamba: A compact 7b SSM.arXiv preprint arXiv:2405.16712,

P. Glorioso, Q. Anthony, Y. Tokpanov, J. Whittington, J. Pilault, A. Ibrahim, B. Millidge. Zamba: A compact 7B SSM hybrid model.arXiv preprint arXiv:2405.16712, 2024

work page arXiv 2024
[7]

Paving the way to efficient architectures: StripedHyena-7B, open source models offering a glimpse into a world beyond Transformers

Together Research. Paving the way to efficient architectures: StripedHyena-7B, open source models offering a glimpse into a world beyond Transformers. Together AI blog, December 8, 2023.https://www.together.ai/blog/stripedhyena-7b

2023
[8]

Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

T. Dao and A. Gu. Transformers are SSMs: Generalized models and efficient algorithms through structured state space duality. InProceedings of the 41st International Conference on Machine Learning (ICML), 2024. arXiv:2405.21060

work page internal anchor Pith review arXiv 2024
[9]

Lahoti, K

A. Lahoti, K. Y. Li, B. Chen, C. Wang, A. Bick, J. Z. Kolter, T. Dao, A. Gu. Mamba-3: Improved sequence modeling using state space principles. InInternational Conference on Learning Representations (ICLR), 2026

2026
[10]

Generating Long Sequences with Sparse Transformers

R. Child, S. Gray, A. Radford, I. Sutskever. Generating long sequences with sparse Transformers.arXiv preprint arXiv:1904.10509, 2019

work page internal anchor Pith review arXiv 1904
[11]

Longformer: The Long-Document Transformer

I. Beltagy, M. E. Peters, A. Cohan. Longformer: The long-document Transformer.arXiv preprint arXiv:2004.05150, 2020

work page internal anchor Pith review arXiv 2004
[12]

Big bird: Transformers for longer sequences, 2020

M. Zaheer, G. Guruganesh, A. Dubey, J. Ainslie, C. Alberti, S. Onta˜ n´ on, P. Pham, A. Ravula, Q. Wang, L. Yang, A. Ahmed. Big Bird: Transformers for longer sequences. InAdvances in Neural Information Processing Systems (NeurIPS), 2020. arXiv:2007.14062

work page arXiv 2020
[13]

Rethinking Attention with Performers

K. Choromanski, V. Likhosherstov, D. Dohan, X. Song, A. Gane, T. Sarl´ os, P. Hawkins, J. Davis, A. Mohiuddin, L. Kaiser, D. Belanger, L. Colwell, A. Weller. Rethinking attention with Performers. InInternational Conference on Learning Representations (ICLR), 2021. arXiv:2009.14794

work page internal anchor Pith review arXiv 2021
[14]

Transformers are

A. Katharopoulos, A. Vyas, N. Pappas, F. Fleuret. Transformers are RNNs: Fast autoregressive Transformers with linear attention. InProceedings of the 37th International Conference on Machine Learning (ICML), 2020. arXiv:2006.16236

work page arXiv 2020
[15]

T. Dao, D. Y. Fu, S. Ermon, A. Rudra, C. R´ e. FlashAttention: Fast and memory-efficient exact attention with IO- awareness. InAdvances in Neural Information Processing Systems (NeurIPS), 2022. arXiv:2205.14135

work page internal anchor Pith review arXiv 2022
[16]

T. Dao. FlashAttention-2: Faster attention with better parallelism and work partitioning.arXiv preprint arXiv:2307.08691, 2023

work page internal anchor Pith review arXiv 2023
[17]

Perceiver: General perception with iterative attention

A. Jaegle, F. Gimeno, A. Brock, O. Vinyals, A. Zisserman, J. Carreira. Perceiver: General perception with iterative attention. InProceedings of the 38th International Conference on Machine Learning (ICML), 2021. arXiv:2103.03206

work page arXiv 2021
[18]

J. Lee, Y. Lee, J. Kim, A. Kosiorek, S. Choi, Y. W. Teh. Set Transformer: A framework for attention-based permutation- invariant neural networks. InProceedings of the 36th International Conference on Machine Learning (ICML), 2019. arXiv:1810.00825

work page arXiv 2019
[19]

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

W. Fedus, B. Zoph, N. Shazeer. Switch Transformers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learning Research, 23(120):1–40, 2022. arXiv:2101.03961. 13

work page internal anchor Pith review arXiv 2022
[20]

A. Q. Jiang, A. Sablayrolles, A. Roux, A. Mensch, B. Savary, C. Bamford, D. S. Chaplot, D. de las Casas, E. B. Hanna, F. Bressand, G. Lengyel, G. Bour, G. Lample, L. R. Lavaud, L. Saulnier, M.-A. Lachaux, P. Stock, S. Subramanian, S. Yang, S. Antoniak, T. Le Scao, T. Gervet, T. Lavril, T. Wang, T. Lacroix, W. El Sayed. Mixtral of Experts.arXiv preprint ar...

work page internal anchor Pith review arXiv 2024
[21]

S. Wang, B. Z. Li, M. Khabsa, H. Fang, H. Ma. Linformer: Self-attention with linear complexity.arXiv preprint arXiv:2006.04768, 2020

work page internal anchor Pith review arXiv 2006
[22]

Xiong, Z

Y. Xiong, Z. Zeng, R. Chakraborty, M. Tan, G. Fung, Y. Li, V. Singh. Nystr¨ omformer: A Nystr¨ om-based algorithm for approximating self-attention. InProceedings of the AAAI Conference on Artificial Intelligence, 2021. arXiv:2102.03902

work page arXiv 2021
[23]

Reformer: The Efficient Transformer

N. Kitaev, L. Kaiser, A. Levskaya. Reformer: The efficient Transformer. InInternational Conference on Learning Representations (ICLR), 2020. arXiv:2001.04451

work page internal anchor Pith review arXiv 2020
[24]

A. Roy, M. Saffar, A. Vaswani, D. Grangier. Efficient content-based sparse attention with routing Transformers.Transac- tions of the Association for Computational Linguistics, 9:53–68, 2021. doi:10.1162/tacl a 00353

work page internal anchor Pith review doi:10.1162/tacl 2021
[25]

M. Poli, S. Massaroli, E. Nguyen, D. Y. Fu, T. Dao, S. Baccus, Y. Bengio, S. Ermon, C. R´ e. Hyena hierarchy: Towards larger convolutional language models. InProceedings of the 40th International Conference on Machine Learning (ICML),
[26]

Zoology: Measuring and improving recall in efficient language models

S. Arora, S. Eyuboglu, A. Timalsina, I. Johnson, M. Poli, J. Zou, A. Rudra, C. R´ e. Zoology: Measuring and improving recall in efficient language models.arXiv preprint arXiv:2312.04927, 2023. 14

work page arXiv 2023