Recognition: unknown
HubRouter: A Pluggable Sub-Quadratic Routing Primitive for Hybrid Sequence Models
Pith reviewed 2026-05-08 12:17 UTC · model grok-4.3
The pith
HubRouter replaces full quadratic attention with O(nM) hub-mediated routing using a small set of learned hubs.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
HubRouter implements an encode-decode-score-council pipeline in which M learned hub tokens cross-attend to the full sequence, tokens project to produce routing fingerprints against those hubs, a score head selects a top-k council, and sparse attention occurs only within the council. When inserted into Jamba-style hybrids this yields a nominal 4.2 percent perplexity improvement and up to roughly 90x training throughput at length 1024; graduated replacement of 25 percent of Transformer attention layers produces the best perplexity under matched budgets; and a strictly causal variant achieves 211.5 perplexity after a council-causal fix that removes a bidirectional leak. A sweep across hub sizes
What carries the argument
The encode-decode-score-council pipeline driven by M learned hub tokens that cross-attend to produce compact routing fingerprints and select a sparse top-k council for attention.
If this is right
- Hybrid architectures can be trained at substantially higher throughput with only small or no perplexity penalty.
- Replacing only a fraction of attention layers can outperform both full attention and heavier replacement under the same compute budget.
- Hub counts in the 8-14 range converge reliably across random seeds, with orthogonal regularization able to stabilize smaller counts.
- Once the causal council fix is applied, performance becomes insensitive to chunk size and the routing behaves as intended without leaks.
Where Pith is reading between the lines
- The pluggable design implies that HubRouter could be inserted into other attention-heavy pipelines beyond the two architectures tested, provided the task does not require uniform access to every token.
- If the information-preservation assumption holds at scale, the same hub mechanism could be combined with existing length-extrapolation techniques to push practical context windows further.
- The companion diagnostic task referenced in the paper offers a direct way to measure whether a given hub count is sufficient for a new domain before full training.
Load-bearing premise
That the learned hubs together with top-k council selection can preserve enough of the information that full attention would have captured for the model's predictions.
What would settle it
A controlled experiment on a long-range dependency task at increasing sequence lengths that directly compares next-token accuracy of HubRouter against an otherwise identical full-attention model; a widening gap would falsify the claim that the sparse routing is information-preserving.
Figures
read the original abstract
We introduce HubRouter, a pluggable module that replaces O(n^2) attention layers with O(nM) hub-mediated routing, where M << n is a small number of learned hub tokens. We demonstrate it in two from-scratch architectures: a Jamba-style hybrid and a 12-layer Transformer; retrofit into pretrained models is a tested negative case. HubRouter implements an encode-decode-score-council pipeline: M learned hubs cross-attend to all tokens, tokens project against hubs for routing fingerprints, a score head selects top-k tokens, and a sparse council attends only to the selected subset. We validate HubRouter in three settings. (1) Hub-Jamba yields a nominal 4.2% PPL improvement (200.2 vs 209.0, single seed; possibly within seed noise) and up to ~90x training throughput at sequence length 1024 in matched PyTorch-native baselines; an optimised baseline would narrow this to ~10-15x. (2) Graduated replacement of 25% of Transformer attention layers gives the best perplexity in our matched-budget sweep (268.0 vs 282.4 pure Transformer). (3) Hub-GPT provides strictly causal routing, achieving PPL 211.5 +/- 0.4 over 3 seeds (post council-causal fix); approximately 3 PPL worse than Jamba's 208.5 +/- 0.7, a measurable quality cost for avoiding O(n^2) computation. Post-fix, chunk size C has little effect; the pre-fix chunk-size benefit was an artifact of a bidirectional-council leak we found in adversarial review. A multi-seed hub-count sweep (~105 runs across M=1-32) reveals M=8-14 as the reliably-converging sub-band (4-5/5 seeds); M=6 is rescued to 5/5 by orthogonal regularization, while M>=20 shows increasing seed sensitivity. Companion paper arXiv:2603.20997 (Basu, 2026) defines the routing diagnostic task. Code and scripts will be released.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces HubRouter, a pluggable module replacing O(n²) attention layers with O(nM) hub-mediated routing (M << n learned hub tokens) via an encode-decode-score-council pipeline. It evaluates the approach in a Jamba-style hybrid model and a 12-layer Transformer, reporting a nominal 4.2% PPL improvement in the hybrid (single seed), best perplexity with 25% graduated replacement, and up to ~90x training throughput gains; a strictly causal Hub-GPT variant is also tested, along with multi-seed hub-count sweeps and a post-hoc fix for a discovered causal leak in the council mechanism.
Significance. If the central claim holds—that learned hubs with top-k council selection can substitute for full attention while preserving modeling capacity—this would provide a useful architectural primitive for efficient hybrid sequence models. The pluggable design, planned code release, and multi-seed analysis for the M parameter (showing reliable convergence at M=8-14) are strengths. However, the evidence remains preliminary given the single-seed flagship result and implementation sensitivities.
major comments (3)
- [Abstract] Abstract: The flagship claim of a 4.2% PPL improvement (200.2 vs 209.0) for Hub-Jamba is based on a single seed and explicitly noted as possibly within noise; this directly undercuts the load-bearing assertion that the encode-decode-score-council pipeline substitutes for full attention without degrading information flow.
- [Experiments] Experiments (Hub-Jamba and causal fix discussion): The post-hoc discovery of a bidirectional council leak (which altered pre-fix chunk-size conclusions) indicates that reported perplexity can be sensitive to subtle implementation details of the sparse council; while the fix is applied, this raises questions about whether post-fix results isolate true routing quality.
- [Hub-GPT evaluation] Hub-GPT results: The strictly causal variant achieves PPL 211.5 +/- 0.4 (3 seeds), ~3 PPL worse than the Jamba baseline (208.5 +/- 0.7); this measurable quality cost for avoiding O(n²) computation needs explicit analysis against the substitution claim.
minor comments (2)
- [Method] The notation for hub count M and council size k could be more consistently defined across sections and figures to aid reproducibility.
- [Throughput experiments] Baselines for throughput (PyTorch-native vs optimized) are mentioned but lack a clear table comparing exact configurations and hardware.
Simulated Author's Rebuttal
Thank you for the constructive feedback. We address the major comments point-by-point below, with planned revisions to strengthen the evidence for our claims.
read point-by-point responses
-
Referee: The flagship claim of a 4.2% PPL improvement (200.2 vs 209.0) for Hub-Jamba is based on a single seed and explicitly noted as possibly within noise; this directly undercuts the load-bearing assertion that the encode-decode-score-council pipeline substitutes for full attention without degrading information flow.
Authors: We recognize the limitation of the single-seed result for the flagship Hub-Jamba experiment. Although we already note in the manuscript that it may be within seed noise, we will perform additional runs with multiple random seeds for this configuration in the revision. This will allow us to report statistics and better substantiate the claim. The substitution assertion is supported by the overall experimental suite, including the graduated replacement results where partial substitution yields the best perplexity, and the multi-seed hub-count analysis showing reliable convergence for appropriate M values. revision: yes
-
Referee: The post-hoc discovery of a bidirectional council leak (which altered pre-fix chunk-size conclusions) indicates that reported perplexity can be sensitive to subtle implementation details of the sparse council; while the fix is applied, this raises questions about whether post-fix results isolate true routing quality.
Authors: The identification of the council leak was indeed post-hoc, and we appreciate the referee highlighting the potential sensitivity. We have implemented the causal fix and observed that chunk-size effects disappear post-fix, indicating stability. To demonstrate that the results isolate routing quality, we will add a dedicated subsection in the experiments detailing the leak, the fix, and pre/post-fix comparisons, along with further ablations on the council selection process. revision: yes
-
Referee: The strictly causal variant achieves PPL 211.5 +/- 0.4 (3 seeds), ~3 PPL worse than the Jamba baseline (208.5 +/- 0.7); this measurable quality cost for avoiding O(n²) computation needs explicit analysis against the substitution claim.
Authors: We agree that the quality cost in the strictly causal Hub-GPT setting requires more explicit analysis. In the revised manuscript, we will include a discussion comparing this degradation to the computational savings and to similar trade-offs in other efficient attention mechanisms. The substitution claim is contextualized as providing a pluggable alternative for hybrid models, where the hybrid Hub-Jamba shows no degradation (and nominal improvement), while the pure causal variant incurs a cost that we now analyze more thoroughly. revision: yes
Circularity Check
Empirical architecture paper with one non-load-bearing self-citation
full rationale
The manuscript introduces HubRouter as a pluggable architectural module and validates it solely through direct training-run measurements of perplexity and throughput. No first-principles derivations, predictions, or uniqueness theorems are presented that could reduce to fitted parameters or prior self-citations by construction. The sole self-citation (to the companion paper defining the routing diagnostic task) supports an auxiliary diagnostic rather than any central claim. All quantitative results are reported as observed outcomes from matched-budget sweeps and multi-seed runs, not as quantities defined in terms of the routing mechanism itself.
Axiom & Free-Parameter Ledger
free parameters (1)
- M (hub count)
invented entities (1)
-
learned hub tokens
no independent evidence
Reference graph
Works this paper leans on
-
[1]
A. Basu. When does content-based routing work? Representation requirements for selective attention in hybrid sequence models.arXiv preprint arXiv:2603.20997, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[2]
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
A. Gu and T. Dao. Mamba: Linear-time sequence modeling with selective state spaces. InConference on Language Modeling (COLM), 2024. arXiv:2312.00752
work page internal anchor Pith review arXiv 2024
-
[3]
B. Peng, E. Alcaide, Q. Anthony, A. Albalak, S. Arcadinho, S. Biderman, H. Cao, X. Cheng, M. Chung, M. Grella, K. K. GV, X. He, H. Hou, J. Lin, P. Kazienko, J. Kocon, J. Kong, B. Koptyra, H. Lau, K. S. I. Mantri, F. Mom, A. Saito, G. Song, X. Tang, B. Wang, J. S. Wind, S. Wozniak, R. Zhang, Z. Zhang, Q. Zhao, P. Zhou, Q. Zhou, J. Zhu, R.- J. Zhu. RWKV: Re...
work page internal anchor Pith review arXiv 2023
-
[4]
Jamba: A Hybrid Transformer-Mamba Language Model
O. Lieber, B. Lenz, H. Bata, G. Cohen, J. Osin, I. Dalmedigos, E. Safahi, S. Meirom, Y. Belinkov, S. Shalev-Shwartz, O. Abend, R. Alon, T. Asida, A. Bergman, R. Glozman, M. Gokhman, A. Manevich, N. Ratner, N. Rozen, E. Shwartz, M. Zusman, Y. Shoham. Jamba: A hybrid Transformer-Mamba language model.arXiv preprint arXiv:2403.19887, 2024
work page internal anchor Pith review arXiv 2024
-
[5]
S. De, S. L. Smith, A. Fernando, A. Botev, G. Cristian-Muraru, A. Gu, R. Haroun, L. Berrada, Y. Chen, S. Srinivasan, G. Desjardins, A. Doucet, D. Budden, Y. W. Teh, R. Pascanu, N. De Freitas, C. Gulcehre. Griffin: Mixing gated linear recurrences with local attention for efficient language models.arXiv preprint arXiv:2402.19427, 2024
work page internal anchor Pith review arXiv 2024
-
[6]
Zamba: A compact 7b SSM.arXiv preprint arXiv:2405.16712,
P. Glorioso, Q. Anthony, Y. Tokpanov, J. Whittington, J. Pilault, A. Ibrahim, B. Millidge. Zamba: A compact 7B SSM hybrid model.arXiv preprint arXiv:2405.16712, 2024
-
[7]
Paving the way to efficient architectures: StripedHyena-7B, open source models offering a glimpse into a world beyond Transformers
Together Research. Paving the way to efficient architectures: StripedHyena-7B, open source models offering a glimpse into a world beyond Transformers. Together AI blog, December 8, 2023.https://www.together.ai/blog/stripedhyena-7b
2023
-
[8]
T. Dao and A. Gu. Transformers are SSMs: Generalized models and efficient algorithms through structured state space duality. InProceedings of the 41st International Conference on Machine Learning (ICML), 2024. arXiv:2405.21060
work page internal anchor Pith review arXiv 2024
-
[9]
Lahoti, K
A. Lahoti, K. Y. Li, B. Chen, C. Wang, A. Bick, J. Z. Kolter, T. Dao, A. Gu. Mamba-3: Improved sequence modeling using state space principles. InInternational Conference on Learning Representations (ICLR), 2026
2026
-
[10]
Generating Long Sequences with Sparse Transformers
R. Child, S. Gray, A. Radford, I. Sutskever. Generating long sequences with sparse Transformers.arXiv preprint arXiv:1904.10509, 2019
work page internal anchor Pith review arXiv 1904
-
[11]
Longformer: The Long-Document Transformer
I. Beltagy, M. E. Peters, A. Cohan. Longformer: The long-document Transformer.arXiv preprint arXiv:2004.05150, 2020
work page internal anchor Pith review arXiv 2004
-
[12]
Big bird: Transformers for longer sequences, 2020
M. Zaheer, G. Guruganesh, A. Dubey, J. Ainslie, C. Alberti, S. Onta˜ n´ on, P. Pham, A. Ravula, Q. Wang, L. Yang, A. Ahmed. Big Bird: Transformers for longer sequences. InAdvances in Neural Information Processing Systems (NeurIPS), 2020. arXiv:2007.14062
-
[13]
Rethinking Attention with Performers
K. Choromanski, V. Likhosherstov, D. Dohan, X. Song, A. Gane, T. Sarl´ os, P. Hawkins, J. Davis, A. Mohiuddin, L. Kaiser, D. Belanger, L. Colwell, A. Weller. Rethinking attention with Performers. InInternational Conference on Learning Representations (ICLR), 2021. arXiv:2009.14794
work page internal anchor Pith review arXiv 2021
-
[14]
A. Katharopoulos, A. Vyas, N. Pappas, F. Fleuret. Transformers are RNNs: Fast autoregressive Transformers with linear attention. InProceedings of the 37th International Conference on Machine Learning (ICML), 2020. arXiv:2006.16236
-
[15]
T. Dao, D. Y. Fu, S. Ermon, A. Rudra, C. R´ e. FlashAttention: Fast and memory-efficient exact attention with IO- awareness. InAdvances in Neural Information Processing Systems (NeurIPS), 2022. arXiv:2205.14135
work page internal anchor Pith review arXiv 2022
-
[16]
T. Dao. FlashAttention-2: Faster attention with better parallelism and work partitioning.arXiv preprint arXiv:2307.08691, 2023
work page internal anchor Pith review arXiv 2023
-
[17]
Perceiver: General perception with iterative attention
A. Jaegle, F. Gimeno, A. Brock, O. Vinyals, A. Zisserman, J. Carreira. Perceiver: General perception with iterative attention. InProceedings of the 38th International Conference on Machine Learning (ICML), 2021. arXiv:2103.03206
- [18]
-
[19]
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity
W. Fedus, B. Zoph, N. Shazeer. Switch Transformers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learning Research, 23(120):1–40, 2022. arXiv:2101.03961. 13
work page internal anchor Pith review arXiv 2022
-
[20]
A. Q. Jiang, A. Sablayrolles, A. Roux, A. Mensch, B. Savary, C. Bamford, D. S. Chaplot, D. de las Casas, E. B. Hanna, F. Bressand, G. Lengyel, G. Bour, G. Lample, L. R. Lavaud, L. Saulnier, M.-A. Lachaux, P. Stock, S. Subramanian, S. Yang, S. Antoniak, T. Le Scao, T. Gervet, T. Lavril, T. Wang, T. Lacroix, W. El Sayed. Mixtral of Experts.arXiv preprint ar...
work page internal anchor Pith review arXiv 2024
-
[21]
S. Wang, B. Z. Li, M. Khabsa, H. Fang, H. Ma. Linformer: Self-attention with linear complexity.arXiv preprint arXiv:2006.04768, 2020
work page internal anchor Pith review arXiv 2006
- [22]
-
[23]
Reformer: The Efficient Transformer
N. Kitaev, L. Kaiser, A. Levskaya. Reformer: The efficient Transformer. InInternational Conference on Learning Representations (ICLR), 2020. arXiv:2001.04451
work page internal anchor Pith review arXiv 2020
-
[24]
A. Roy, M. Saffar, A. Vaswani, D. Grangier. Efficient content-based sparse attention with routing Transformers.Transac- tions of the Association for Computational Linguistics, 9:53–68, 2021. doi:10.1162/tacl a 00353
work page internal anchor Pith review doi:10.1162/tacl 2021
-
[25]
M. Poli, S. Massaroli, E. Nguyen, D. Y. Fu, T. Dao, S. Baccus, Y. Bengio, S. Ermon, C. R´ e. Hyena hierarchy: Towards larger convolutional language models. InProceedings of the 40th International Conference on Machine Learning (ICML),
-
[26]
Zoology: Measuring and improving recall in efficient language models
S. Arora, S. Eyuboglu, A. Timalsina, I. Johnson, M. Poli, J. Zou, A. Rudra, C. R´ e. Zoology: Measuring and improving recall in efficient language models.arXiv preprint arXiv:2312.04927, 2023. 14
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.