Hierarchical Global Attention (HGA)
Pith reviewed 2026-07-01 06:51 UTC · model grok-4.3
The pith
Hierarchical two-level routing approximates dense attention within 0.02 nats at 3% sparsity without changing any pretrained weights.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
HGA performs hierarchical two-level routing that retrieves relevant chunks using compact RoPE-aware summaries then refines by routing only the most relevant groups before exact token-level attention, achieving routed attention within approximately 0.01--0.02 nats of dense attention at 3% sparsity for 4K-64K contexts while preserving all pretrained weights.
What carries the argument
Hierarchical two-level routing that uses RoPE-aware chunk summaries to select a small set of tokens for exact attention while keeping the full K/V cache off the GPU.
If this is right
- GPU memory consumption depends on model weights and the routed working set rather than total context length.
- The full historical token K/V can reside in host RAM or NVMe while only a small subset moves to GPU during attention.
- The method applies directly to existing checkpoints such as Qwen3 without any calibration or retraining.
- Routed attention at 3% sparsity stays close enough to dense that the residual gap is attributed mainly to positional encoding.
- The approach enables 64K-token inference on hardware where storing all token-level K/V pairs is impossible.
Where Pith is reading between the lines
- Improving the quality of the initial chunk summaries could further reduce the already small quality gap.
- The same hierarchical selection pattern might be tested on other attention variants or model families to check generality.
- If the routing remains robust, combining it with quantization or other compression could push context lengths even higher on the same hardware.
- The low sparsity level suggests attention distributions in these models contain strong structure that future positional encodings might exploit directly.
Load-bearing premise
The RoPE-aware summaries and subsequent group routing reliably surface the tokens that would have received significant attention weight in the full dense computation.
What would settle it
A long-context evaluation where the routing misses tokens that carry high attention weight in the dense case and the quality gap exceeds 0.02 nats on a standard benchmark.
read the original abstract
Hierarchical Global Attention (HGA) is a drop-in replacement for dense causal attention in pretrained long-context transformers. HGA preserves the original checkpoint parameters: the pretrained $W_Q$, $W_K$, $W_V$, and $W_O$ projections remain unchanged, no calibration parameters are introduced, and no retraining is required. Applied to Qwen3-30B-A3B-Instruct-2507-FP8 on a single RTX~5090 (32GB), the patched model runs out of the box at a 64K-token context, where token-level K/V storage is not feasible on this hardware. Unlike previous sparse-attention methods, HGA performs hierarchical two-level routing. It first retrieves relevant chunks using compact RoPE-aware summaries and then refines the selection by routing only the most relevant groups before performing exact token-level attention. This hierarchical retrieval significantly reduces the number of fetched tokens while preserving exact attention over the retrieved token set, making RAM- and NVMe-backed storage practical. The full historical token K/V resides in host RAM or NVMe storage, while only a small routed working set is transferred to GPU memory during attention. Consequently, GPU memory consumption depends primarily on model weights and the routed working set rather than on the total context length. Across all tested context lengths (4K - 64K tokens), routed attention remains within approximately $0.01$--$0.02$ nats of dense attention while the sparsity used is just about 3%. These results suggest that the approximation introduced by hierarchical routing is small, and that the remaining quality gap is likely dominated by long-context positional encoding rather than by the routing algorithm itself.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes Hierarchical Global Attention (HGA) as a drop-in replacement for dense causal attention in pretrained long-context transformers. It performs hierarchical two-level routing: first retrieving relevant chunks via compact RoPE-aware summaries, then refining by routing the most relevant groups before exact token-level attention. All pretrained weights (W_Q, W_K, W_V, W_O) are preserved with no new parameters or retraining. The method offloads full K/V to host RAM/NVMe, transferring only a small routed working set (~3% sparsity) to GPU, enabling 64K context on RTX 5090 hardware where full dense K/V storage is infeasible. It claims routed attention stays within 0.01--0.02 nats of dense attention across 4K--64K contexts.
Significance. If the empirical claims hold, HGA would be significant for practical long-context inference on limited hardware without retraining or quality degradation. The drop-in nature and offloading strategy address memory bottlenecks directly. The hierarchical routing reducing fetched tokens while keeping exact attention on the selected set is a clear technical contribution over prior sparse methods. However, the significance is limited by the absence of direct dense baselines at the longest lengths where the method is most needed.
major comments (2)
- [Abstract] Abstract: The claim that 'routed attention remains within approximately 0.01--0.02 nats of dense attention' across all tested lengths including 64K is not supported at 64K. The text states that 'token-level K/V storage is not feasible on this hardware' (RTX 5090 32GB), so no dense baseline can be run at that length. This directly undermines the central assertion that the hierarchical routing approximation is small and that any remaining gap is dominated by positional encoding rather than routing errors.
- [Abstract] Abstract: No equations, pseudocode, or implementation details are provided for the RoPE-aware chunk summaries or the group-level routing thresholds. Without these, it is impossible to verify whether the two-level retrieval reliably surfaces the tokens that would receive significant weight in the full dense computation, which is the load-bearing assumption for the reported quality gap remaining small at 3% sparsity.
Simulated Author's Rebuttal
We thank the referee for the constructive comments. We agree that the central claims require qualification where direct dense baselines are unavailable and that additional implementation details will strengthen verifiability. We outline point-by-point revisions below.
read point-by-point responses
-
Referee: [Abstract] Abstract: The claim that 'routed attention remains within approximately 0.01--0.02 nats of dense attention' across all tested lengths including 64K is not supported at 64K. The text states that 'token-level K/V storage is not feasible on this hardware' (RTX 5090 32GB), so no dense baseline can be run at that length. This directly undermines the central assertion that the hierarchical routing approximation is small and that any remaining gap is dominated by positional encoding rather than routing errors.
Authors: We agree that the manuscript cannot claim a measured 0.01--0.02 nats gap at 64K without a dense baseline. We will revise the abstract to state that the reported gap holds for context lengths where dense attention is computationally feasible on the hardware (explicitly listing the tested lengths up to the maximum feasible), and that at 64K the method enables inference while the approximation quality is supported by the hierarchical design validated at shorter lengths. This directly addresses the concern without overstating the evidence. revision: yes
-
Referee: [Abstract] Abstract: No equations, pseudocode, or implementation details are provided for the RoPE-aware chunk summaries or the group-level routing thresholds. Without these, it is impossible to verify whether the two-level retrieval reliably surfaces the tokens that would receive significant weight in the full dense computation, which is the load-bearing assumption for the reported quality gap remaining small at 3% sparsity.
Authors: The current manuscript describes the two-level routing at a high level but does not include explicit equations for the RoPE-aware summaries or pseudocode for the group-level thresholds. We will add these to the Methods section in the revision, including the summary computation formula and the routing decision procedure, to enable independent verification of the retrieval reliability. revision: yes
Circularity Check
No circularity: empirical claims rest on external dense-attention benchmarks with no fitted quantities or self-citation chains
full rationale
The manuscript presents HGA as an algorithmic drop-in replacement whose only load-bearing claims are (a) preservation of pretrained weights with no retraining and (b) an observed 0.01–0.02 nat gap to dense attention at ~3 % sparsity. No equations, ansatzes, or fitted parameters appear; the quality-gap statement is a direct empirical comparison rather than a derived prediction. No self-citations are invoked to justify uniqueness or to close any derivation loop. The 64 K hardware limitation noted in the text affects the strength of evidence but does not create a definitional or self-referential reduction. The derivation chain is therefore self-contained against the external dense baseline.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints
Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebron, and Sumit Sanghai. GQA: Training generalized multi-query transformer models from multi-head checkpoints.arXiv:2305.13245, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[2]
Longformer: The Long-Document Transformer
Iz Beltagy, Matthew E. Peters, and Arman Cohan. Longformer: The long-document transformer. arXiv:2004.05150, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2004
-
[3]
Generating Long Sequences with Sparse Transformers
Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse transformers.arXiv:1904.10509, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1904
-
[4]
Rethinking attention with performers
Krzysztof Choromanski et al. Rethinking attention with performers. InInternational Conference on Learning Representations, 2021
2021
-
[5]
Huiqiang Jiang et al. MInference 1.0: Accelerating pre-filling for long-context LLMs via dynamic sparse attention.arXiv:2407.02490, 2024
-
[6]
Reformer: The efficient transformer
Nikita Kitaev, Lukasz Kaiser, and Anselm Levskaya. Reformer: The efficient transformer. In International Conference on Learning Representations, 2020
2020
-
[7]
Efficient memory management for large language model serving with Page- dAttention
Woosuk Kwon et al. Efficient memory management for large language model serving with Page- dAttention. InProceedings of the ACM SIGOPS Symposium on Operating Systems Principles, 2023
2023
-
[8]
The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale
Guilherme Penedo et al. The FineWeb datasets: Decanting the web for the finest text data at scale.arXiv:2406.17557, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[9]
Qwen Team. Qwen3 technical report.arXiv:2505.09388, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[10]
Qwen3-30B-A3B-Instruct-2507-FP8 model card
Qwen Team. Qwen3-30B-A3B-Instruct-2507-FP8 model card. Hugging Face, 2025. https: //huggingface.co/Qwen/Qwen3-30B-A3B-Instruct-2507-FP8
2025
-
[11]
Efficient content-based sparse attention with Routing Transformers.Transactions of the Association for Computational Linguistics, 9:53–68, 2021
Aurko Roy, Mohammad Saffar, Ashish Vaswani, and David Grangier. Efficient content-based sparse attention with Routing Transformers.Transactions of the Association for Computational Linguistics, 9:53–68, 2021. 12
2021
-
[12]
RoFormer: Enhanced transformer with rotary position embedding.Neurocom- puting, 568:127063, 2024
Jianlin Su et al. RoFormer: Enhanced transformer with rotary position embedding.Neurocom- puting, 568:127063, 2024
2024
-
[13]
Attention is all you need
Ashish Vaswani et al. Attention is all you need. InAdvances in Neural Information Processing Systems, 2017
2017
-
[14]
YaRN: Efficient Context Window Extension of Large Language Models
Bowen Peng, Jeffrey Quesnelle, Honglu Fan, and Enrico Shippole. YaRN: Efficient context window extension of large language models.arXiv:2309.00071, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[15]
Big Bird: Transformers for longer sequences
Manzil Zaheer et al. Big Bird: Transformers for longer sequences. InAdvances in Neural Information Processing Systems, 2020. 13
2020
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.