pith. machine review for the scientific record. sign in

arxiv: 2604.22442 · v1 · submitted 2026-04-24 · 💻 cs.LG · cs.NE

Recognition: unknown

HubRouter: A Pluggable Sub-Quadratic Routing Primitive for Hybrid Sequence Models

Authors on Pith no claims yet

Pith reviewed 2026-05-08 12:17 UTC · model grok-4.3

classification 💻 cs.LG cs.NE
keywords HubRoutersub-quadratic routinghybrid sequence modelshub tokenssparse attentionperplexitytraining throughputJamba-style models
0
0 comments X

The pith

HubRouter replaces full quadratic attention with O(nM) hub-mediated routing using a small set of learned hubs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents HubRouter as a pluggable module that substitutes the O(n squared) attention layers common in sequence models with a cheaper O(n times M) mechanism, where M is a small number of learned hub tokens much less than the sequence length n. It does this through an encode-decode-score-council pipeline in which hubs first cross-attend to all tokens, tokens then form routing fingerprints against the hubs, a score head picks the top-k tokens, and a sparse council attends only to that subset. Experiments in a Jamba-style hybrid and a pure Transformer show that this substitution can maintain or slightly improve perplexity while delivering large training speedups, and that partial replacement of attention layers can be optimal under fixed compute budgets. A reader would care because quadratic attention remains a central scaling barrier for longer sequences, and a drop-in primitive that relaxes it without full redesign could extend practical context lengths.

Core claim

HubRouter implements an encode-decode-score-council pipeline in which M learned hub tokens cross-attend to the full sequence, tokens project to produce routing fingerprints against those hubs, a score head selects a top-k council, and sparse attention occurs only within the council. When inserted into Jamba-style hybrids this yields a nominal 4.2 percent perplexity improvement and up to roughly 90x training throughput at length 1024; graduated replacement of 25 percent of Transformer attention layers produces the best perplexity under matched budgets; and a strictly causal variant achieves 211.5 perplexity after a council-causal fix that removes a bidirectional leak. A sweep across hub sizes

What carries the argument

The encode-decode-score-council pipeline driven by M learned hub tokens that cross-attend to produce compact routing fingerprints and select a sparse top-k council for attention.

If this is right

  • Hybrid architectures can be trained at substantially higher throughput with only small or no perplexity penalty.
  • Replacing only a fraction of attention layers can outperform both full attention and heavier replacement under the same compute budget.
  • Hub counts in the 8-14 range converge reliably across random seeds, with orthogonal regularization able to stabilize smaller counts.
  • Once the causal council fix is applied, performance becomes insensitive to chunk size and the routing behaves as intended without leaks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The pluggable design implies that HubRouter could be inserted into other attention-heavy pipelines beyond the two architectures tested, provided the task does not require uniform access to every token.
  • If the information-preservation assumption holds at scale, the same hub mechanism could be combined with existing length-extrapolation techniques to push practical context windows further.
  • The companion diagnostic task referenced in the paper offers a direct way to measure whether a given hub count is sufficient for a new domain before full training.

Load-bearing premise

That the learned hubs together with top-k council selection can preserve enough of the information that full attention would have captured for the model's predictions.

What would settle it

A controlled experiment on a long-range dependency task at increasing sequence lengths that directly compares next-token accuracy of HubRouter against an otherwise identical full-attention model; a widening gap would falsify the claim that the sparse routing is information-preserving.

Figures

Figures reproduced from arXiv: 2604.22442 by Abhinaba Basu.

Figure 1
Figure 1. Figure 1: HubRouter pipeline. Input tokens are compressed through M learned hubs (encode, O(nM)), tokens project against hubs for routing fingerprints (decode, O(nM)), a score head selects top-k tokens, and a council applies sparse attention only to the selected subset (O(k 2 )). A learned gate fuses council output back into the residual stream. Key insight: routing decisions are made through M-dimensional hub space… view at source ↗
Figure 2
Figure 2. Figure 2: Throughput scaling (matched PyTorch-native baselines). (a) Hub-Jamba maintains con￾stant throughput as sequence length grows (sub-quadratic), while Jamba degrades due to quadratic attention. (b) Speedup increases with sequence length, reaching ∼183× at seq=2048. Caveat (Section 7): the Jamba baseline does not use FlashAttention or production Mamba CUDA kernels view at source ↗
Figure 3
Figure 3. Figure 3: Hub-GPT chunk size sweep (post council-causal fix). C=1 (zero leakage) achieves 211.5±0.4 over 3 seeds; C=64 achieves 211.9±0.3. Chunk size no longer changes PPL meaningfully once the council is causally masked (pre-fix results showed a 3-4 PPL chunk-size benefit that turns out to have been the bidirectional council leaking future-token information into later tokens). The true gap to Jamba (208.5±0.7) is ≈… view at source ↗
Figure 4
Figure 4. Figure 4: Graduated replacement curve. 25% replacement is the sweet spot (PPL 268.0), beating both pure Transformer (282.4) and Mamba (278.3). Quality degrades gracefully with increasing replacement. 7 Discussion Leakage and regime selection. The Hub-Jamba encoder uses bidirectional hub encoding, so hubs see future tokens and the PPL=200.2 result includes this leakage. This leakage pattern is closer to a prefix-LM o… view at source ↗
Figure 5
Figure 5. Figure 5: M-sweep heatmap. Routing precision (%) across 5 seeds for each hub count M (rows) by seed (columns). (a) Without ortho: M=8–14 is the most robust sub-band (highest mean, lowest seed variance). (b) With ortho: rescues M=6 failures but introduces new failures at M=20. M=16 is the default from prior work [1] (cited, not re-run here). as a real quality cost for eliminating O(n 2 ) attention, not parity. (2) At… view at source ↗
read the original abstract

We introduce HubRouter, a pluggable module that replaces O(n^2) attention layers with O(nM) hub-mediated routing, where M << n is a small number of learned hub tokens. We demonstrate it in two from-scratch architectures: a Jamba-style hybrid and a 12-layer Transformer; retrofit into pretrained models is a tested negative case. HubRouter implements an encode-decode-score-council pipeline: M learned hubs cross-attend to all tokens, tokens project against hubs for routing fingerprints, a score head selects top-k tokens, and a sparse council attends only to the selected subset. We validate HubRouter in three settings. (1) Hub-Jamba yields a nominal 4.2% PPL improvement (200.2 vs 209.0, single seed; possibly within seed noise) and up to ~90x training throughput at sequence length 1024 in matched PyTorch-native baselines; an optimised baseline would narrow this to ~10-15x. (2) Graduated replacement of 25% of Transformer attention layers gives the best perplexity in our matched-budget sweep (268.0 vs 282.4 pure Transformer). (3) Hub-GPT provides strictly causal routing, achieving PPL 211.5 +/- 0.4 over 3 seeds (post council-causal fix); approximately 3 PPL worse than Jamba's 208.5 +/- 0.7, a measurable quality cost for avoiding O(n^2) computation. Post-fix, chunk size C has little effect; the pre-fix chunk-size benefit was an artifact of a bidirectional-council leak we found in adversarial review. A multi-seed hub-count sweep (~105 runs across M=1-32) reveals M=8-14 as the reliably-converging sub-band (4-5/5 seeds); M=6 is rescued to 5/5 by orthogonal regularization, while M>=20 shows increasing seed sensitivity. Companion paper arXiv:2603.20997 (Basu, 2026) defines the routing diagnostic task. Code and scripts will be released.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces HubRouter, a pluggable module replacing O(n²) attention layers with O(nM) hub-mediated routing (M << n learned hub tokens) via an encode-decode-score-council pipeline. It evaluates the approach in a Jamba-style hybrid model and a 12-layer Transformer, reporting a nominal 4.2% PPL improvement in the hybrid (single seed), best perplexity with 25% graduated replacement, and up to ~90x training throughput gains; a strictly causal Hub-GPT variant is also tested, along with multi-seed hub-count sweeps and a post-hoc fix for a discovered causal leak in the council mechanism.

Significance. If the central claim holds—that learned hubs with top-k council selection can substitute for full attention while preserving modeling capacity—this would provide a useful architectural primitive for efficient hybrid sequence models. The pluggable design, planned code release, and multi-seed analysis for the M parameter (showing reliable convergence at M=8-14) are strengths. However, the evidence remains preliminary given the single-seed flagship result and implementation sensitivities.

major comments (3)
  1. [Abstract] Abstract: The flagship claim of a 4.2% PPL improvement (200.2 vs 209.0) for Hub-Jamba is based on a single seed and explicitly noted as possibly within noise; this directly undercuts the load-bearing assertion that the encode-decode-score-council pipeline substitutes for full attention without degrading information flow.
  2. [Experiments] Experiments (Hub-Jamba and causal fix discussion): The post-hoc discovery of a bidirectional council leak (which altered pre-fix chunk-size conclusions) indicates that reported perplexity can be sensitive to subtle implementation details of the sparse council; while the fix is applied, this raises questions about whether post-fix results isolate true routing quality.
  3. [Hub-GPT evaluation] Hub-GPT results: The strictly causal variant achieves PPL 211.5 +/- 0.4 (3 seeds), ~3 PPL worse than the Jamba baseline (208.5 +/- 0.7); this measurable quality cost for avoiding O(n²) computation needs explicit analysis against the substitution claim.
minor comments (2)
  1. [Method] The notation for hub count M and council size k could be more consistently defined across sections and figures to aid reproducibility.
  2. [Throughput experiments] Baselines for throughput (PyTorch-native vs optimized) are mentioned but lack a clear table comparing exact configurations and hardware.

Simulated Author's Rebuttal

3 responses · 0 unresolved

Thank you for the constructive feedback. We address the major comments point-by-point below, with planned revisions to strengthen the evidence for our claims.

read point-by-point responses
  1. Referee: The flagship claim of a 4.2% PPL improvement (200.2 vs 209.0) for Hub-Jamba is based on a single seed and explicitly noted as possibly within noise; this directly undercuts the load-bearing assertion that the encode-decode-score-council pipeline substitutes for full attention without degrading information flow.

    Authors: We recognize the limitation of the single-seed result for the flagship Hub-Jamba experiment. Although we already note in the manuscript that it may be within seed noise, we will perform additional runs with multiple random seeds for this configuration in the revision. This will allow us to report statistics and better substantiate the claim. The substitution assertion is supported by the overall experimental suite, including the graduated replacement results where partial substitution yields the best perplexity, and the multi-seed hub-count analysis showing reliable convergence for appropriate M values. revision: yes

  2. Referee: The post-hoc discovery of a bidirectional council leak (which altered pre-fix chunk-size conclusions) indicates that reported perplexity can be sensitive to subtle implementation details of the sparse council; while the fix is applied, this raises questions about whether post-fix results isolate true routing quality.

    Authors: The identification of the council leak was indeed post-hoc, and we appreciate the referee highlighting the potential sensitivity. We have implemented the causal fix and observed that chunk-size effects disappear post-fix, indicating stability. To demonstrate that the results isolate routing quality, we will add a dedicated subsection in the experiments detailing the leak, the fix, and pre/post-fix comparisons, along with further ablations on the council selection process. revision: yes

  3. Referee: The strictly causal variant achieves PPL 211.5 +/- 0.4 (3 seeds), ~3 PPL worse than the Jamba baseline (208.5 +/- 0.7); this measurable quality cost for avoiding O(n²) computation needs explicit analysis against the substitution claim.

    Authors: We agree that the quality cost in the strictly causal Hub-GPT setting requires more explicit analysis. In the revised manuscript, we will include a discussion comparing this degradation to the computational savings and to similar trade-offs in other efficient attention mechanisms. The substitution claim is contextualized as providing a pluggable alternative for hybrid models, where the hybrid Hub-Jamba shows no degradation (and nominal improvement), while the pure causal variant incurs a cost that we now analyze more thoroughly. revision: yes

Circularity Check

0 steps flagged

Empirical architecture paper with one non-load-bearing self-citation

full rationale

The manuscript introduces HubRouter as a pluggable architectural module and validates it solely through direct training-run measurements of perplexity and throughput. No first-principles derivations, predictions, or uniqueness theorems are presented that could reduce to fitted parameters or prior self-citations by construction. The sole self-citation (to the companion paper defining the routing diagnostic task) supports an auxiliary diagnostic rather than any central claim. All quantitative results are reported as observed outcomes from matched-budget sweeps and multi-seed runs, not as quantities defined in terms of the routing mechanism itself.

Axiom & Free-Parameter Ledger

1 free parameters · 0 axioms · 1 invented entities

Based on abstract only; M is treated as a tunable hyperparameter with an identified reliable range, while hub tokens are learned during training rather than fixed constants.

free parameters (1)
  • M (hub count)
    Swept from 1-32; M=8-14 identified as reliably converging band, with orthogonal regularization rescuing M=6.
invented entities (1)
  • learned hub tokens no independent evidence
    purpose: Mediate cross-attention and provide routing fingerprints for token selection
    Core new component of the module; no external falsifiable prediction provided beyond internal training results.

pith-pipeline@v0.9.0 · 5700 in / 1399 out tokens · 50526 ms · 2026-05-08T12:17:17.339636+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

26 extracted references · 23 canonical work pages · 16 internal anchors

  1. [1]

    A. Basu. When does content-based routing work? Representation requirements for selective attention in hybrid sequence models.arXiv preprint arXiv:2603.20997, 2026

  2. [2]

    Mamba: Linear-Time Sequence Modeling with Selective State Spaces

    A. Gu and T. Dao. Mamba: Linear-time sequence modeling with selective state spaces. InConference on Language Modeling (COLM), 2024. arXiv:2312.00752

  3. [3]

    B. Peng, E. Alcaide, Q. Anthony, A. Albalak, S. Arcadinho, S. Biderman, H. Cao, X. Cheng, M. Chung, M. Grella, K. K. GV, X. He, H. Hou, J. Lin, P. Kazienko, J. Kocon, J. Kong, B. Koptyra, H. Lau, K. S. I. Mantri, F. Mom, A. Saito, G. Song, X. Tang, B. Wang, J. S. Wind, S. Wozniak, R. Zhang, Z. Zhang, Q. Zhao, P. Zhou, Q. Zhou, J. Zhu, R.- J. Zhu. RWKV: Re...

  4. [4]

    Jamba: A Hybrid Transformer-Mamba Language Model

    O. Lieber, B. Lenz, H. Bata, G. Cohen, J. Osin, I. Dalmedigos, E. Safahi, S. Meirom, Y. Belinkov, S. Shalev-Shwartz, O. Abend, R. Alon, T. Asida, A. Bergman, R. Glozman, M. Gokhman, A. Manevich, N. Ratner, N. Rozen, E. Shwartz, M. Zusman, Y. Shoham. Jamba: A hybrid Transformer-Mamba language model.arXiv preprint arXiv:2403.19887, 2024

  5. [5]

    S. De, S. L. Smith, A. Fernando, A. Botev, G. Cristian-Muraru, A. Gu, R. Haroun, L. Berrada, Y. Chen, S. Srinivasan, G. Desjardins, A. Doucet, D. Budden, Y. W. Teh, R. Pascanu, N. De Freitas, C. Gulcehre. Griffin: Mixing gated linear recurrences with local attention for efficient language models.arXiv preprint arXiv:2402.19427, 2024

  6. [6]

    Zamba: A compact 7b SSM.arXiv preprint arXiv:2405.16712,

    P. Glorioso, Q. Anthony, Y. Tokpanov, J. Whittington, J. Pilault, A. Ibrahim, B. Millidge. Zamba: A compact 7B SSM hybrid model.arXiv preprint arXiv:2405.16712, 2024

  7. [7]

    Paving the way to efficient architectures: StripedHyena-7B, open source models offering a glimpse into a world beyond Transformers

    Together Research. Paving the way to efficient architectures: StripedHyena-7B, open source models offering a glimpse into a world beyond Transformers. Together AI blog, December 8, 2023.https://www.together.ai/blog/stripedhyena-7b

  8. [8]

    Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

    T. Dao and A. Gu. Transformers are SSMs: Generalized models and efficient algorithms through structured state space duality. InProceedings of the 41st International Conference on Machine Learning (ICML), 2024. arXiv:2405.21060

  9. [9]

    Lahoti, K

    A. Lahoti, K. Y. Li, B. Chen, C. Wang, A. Bick, J. Z. Kolter, T. Dao, A. Gu. Mamba-3: Improved sequence modeling using state space principles. InInternational Conference on Learning Representations (ICLR), 2026

  10. [10]

    Generating Long Sequences with Sparse Transformers

    R. Child, S. Gray, A. Radford, I. Sutskever. Generating long sequences with sparse Transformers.arXiv preprint arXiv:1904.10509, 2019

  11. [11]

    Longformer: The Long-Document Transformer

    I. Beltagy, M. E. Peters, A. Cohan. Longformer: The long-document Transformer.arXiv preprint arXiv:2004.05150, 2020

  12. [12]

    Big bird: Transformers for longer sequences, 2020

    M. Zaheer, G. Guruganesh, A. Dubey, J. Ainslie, C. Alberti, S. Onta˜ n´ on, P. Pham, A. Ravula, Q. Wang, L. Yang, A. Ahmed. Big Bird: Transformers for longer sequences. InAdvances in Neural Information Processing Systems (NeurIPS), 2020. arXiv:2007.14062

  13. [13]

    Rethinking Attention with Performers

    K. Choromanski, V. Likhosherstov, D. Dohan, X. Song, A. Gane, T. Sarl´ os, P. Hawkins, J. Davis, A. Mohiuddin, L. Kaiser, D. Belanger, L. Colwell, A. Weller. Rethinking attention with Performers. InInternational Conference on Learning Representations (ICLR), 2021. arXiv:2009.14794

  14. [14]

    Transformers are

    A. Katharopoulos, A. Vyas, N. Pappas, F. Fleuret. Transformers are RNNs: Fast autoregressive Transformers with linear attention. InProceedings of the 37th International Conference on Machine Learning (ICML), 2020. arXiv:2006.16236

  15. [15]

    T. Dao, D. Y. Fu, S. Ermon, A. Rudra, C. R´ e. FlashAttention: Fast and memory-efficient exact attention with IO- awareness. InAdvances in Neural Information Processing Systems (NeurIPS), 2022. arXiv:2205.14135

  16. [16]

    T. Dao. FlashAttention-2: Faster attention with better parallelism and work partitioning.arXiv preprint arXiv:2307.08691, 2023

  17. [17]

    Perceiver: General perception with iterative attention

    A. Jaegle, F. Gimeno, A. Brock, O. Vinyals, A. Zisserman, J. Carreira. Perceiver: General perception with iterative attention. InProceedings of the 38th International Conference on Machine Learning (ICML), 2021. arXiv:2103.03206

  18. [18]

    J. Lee, Y. Lee, J. Kim, A. Kosiorek, S. Choi, Y. W. Teh. Set Transformer: A framework for attention-based permutation- invariant neural networks. InProceedings of the 36th International Conference on Machine Learning (ICML), 2019. arXiv:1810.00825

  19. [19]

    Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

    W. Fedus, B. Zoph, N. Shazeer. Switch Transformers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learning Research, 23(120):1–40, 2022. arXiv:2101.03961. 13

  20. [20]

    A. Q. Jiang, A. Sablayrolles, A. Roux, A. Mensch, B. Savary, C. Bamford, D. S. Chaplot, D. de las Casas, E. B. Hanna, F. Bressand, G. Lengyel, G. Bour, G. Lample, L. R. Lavaud, L. Saulnier, M.-A. Lachaux, P. Stock, S. Subramanian, S. Yang, S. Antoniak, T. Le Scao, T. Gervet, T. Lavril, T. Wang, T. Lacroix, W. El Sayed. Mixtral of Experts.arXiv preprint ar...

  21. [21]

    S. Wang, B. Z. Li, M. Khabsa, H. Fang, H. Ma. Linformer: Self-attention with linear complexity.arXiv preprint arXiv:2006.04768, 2020

  22. [22]

    Xiong, Z

    Y. Xiong, Z. Zeng, R. Chakraborty, M. Tan, G. Fung, Y. Li, V. Singh. Nystr¨ omformer: A Nystr¨ om-based algorithm for approximating self-attention. InProceedings of the AAAI Conference on Artificial Intelligence, 2021. arXiv:2102.03902

  23. [23]

    Reformer: The Efficient Transformer

    N. Kitaev, L. Kaiser, A. Levskaya. Reformer: The efficient Transformer. InInternational Conference on Learning Representations (ICLR), 2020. arXiv:2001.04451

  24. [24]

    A. Roy, M. Saffar, A. Vaswani, D. Grangier. Efficient content-based sparse attention with routing Transformers.Transac- tions of the Association for Computational Linguistics, 9:53–68, 2021. doi:10.1162/tacl a 00353

  25. [25]

    M. Poli, S. Massaroli, E. Nguyen, D. Y. Fu, T. Dao, S. Baccus, Y. Bengio, S. Ermon, C. R´ e. Hyena hierarchy: Towards larger convolutional language models. InProceedings of the 40th International Conference on Machine Learning (ICML),

  26. [26]

    Zoology: Measuring and improving recall in efficient language models

    S. Arora, S. Eyuboglu, A. Timalsina, I. Johnson, M. Poli, J. Zou, A. Rudra, C. R´ e. Zoology: Measuring and improving recall in efficient language models.arXiv preprint arXiv:2312.04927, 2023. 14