Hierarchical Global Attention (HGA)

Fedosov Vladimir; Grinenko Artemiy; Woernle Frank

arxiv: 2606.30709 · v1 · pith:O4LLI4KZnew · submitted 2026-06-29 · 💻 cs.LG · cs.AI

Hierarchical Global Attention (HGA)

Woernle Frank , Fedosov Vladimir , Grinenko Artemiy This is my paper

Pith reviewed 2026-07-01 06:51 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords hierarchical attentionsparse attentionlong context transformersdrop-in replacementmemory efficient inferenceRoPEcausal attentionrouting

0 comments

The pith

Hierarchical two-level routing approximates dense attention within 0.02 nats at 3% sparsity without changing any pretrained weights.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Hierarchical Global Attention as a drop-in replacement for dense causal attention that works on existing long-context transformer checkpoints. It first selects relevant chunks through compact summaries that incorporate rotary position information, then routes within those chunks to smaller groups, and finally computes exact attention only on the chosen tokens. This keeps the quality gap to full attention very small across 4K to 64K contexts while storing the complete key-value cache in host memory or storage. A reader would care because the method requires no retraining, no new parameters, and no modification to the original projection matrices, so it can be applied immediately to models that otherwise cannot fit large contexts on available hardware. The results indicate that the routing step successfully recovers most of the attention signal that would have been computed densely.

Core claim

HGA performs hierarchical two-level routing that retrieves relevant chunks using compact RoPE-aware summaries then refines by routing only the most relevant groups before exact token-level attention, achieving routed attention within approximately 0.01--0.02 nats of dense attention at 3% sparsity for 4K-64K contexts while preserving all pretrained weights.

What carries the argument

Hierarchical two-level routing that uses RoPE-aware chunk summaries to select a small set of tokens for exact attention while keeping the full K/V cache off the GPU.

If this is right

GPU memory consumption depends on model weights and the routed working set rather than total context length.
The full historical token K/V can reside in host RAM or NVMe while only a small subset moves to GPU during attention.
The method applies directly to existing checkpoints such as Qwen3 without any calibration or retraining.
Routed attention at 3% sparsity stays close enough to dense that the residual gap is attributed mainly to positional encoding.
The approach enables 64K-token inference on hardware where storing all token-level K/V pairs is impossible.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Improving the quality of the initial chunk summaries could further reduce the already small quality gap.
The same hierarchical selection pattern might be tested on other attention variants or model families to check generality.
If the routing remains robust, combining it with quantization or other compression could push context lengths even higher on the same hardware.
The low sparsity level suggests attention distributions in these models contain strong structure that future positional encodings might exploit directly.

Load-bearing premise

The RoPE-aware summaries and subsequent group routing reliably surface the tokens that would have received significant attention weight in the full dense computation.

What would settle it

A long-context evaluation where the routing misses tokens that carry high attention weight in the dense case and the quality gap exceeds 0.02 nats on a standard benchmark.

read the original abstract

Hierarchical Global Attention (HGA) is a drop-in replacement for dense causal attention in pretrained long-context transformers. HGA preserves the original checkpoint parameters: the pretrained $W_Q$, $W_K$, $W_V$, and $W_O$ projections remain unchanged, no calibration parameters are introduced, and no retraining is required. Applied to Qwen3-30B-A3B-Instruct-2507-FP8 on a single RTX~5090 (32GB), the patched model runs out of the box at a 64K-token context, where token-level K/V storage is not feasible on this hardware. Unlike previous sparse-attention methods, HGA performs hierarchical two-level routing. It first retrieves relevant chunks using compact RoPE-aware summaries and then refines the selection by routing only the most relevant groups before performing exact token-level attention. This hierarchical retrieval significantly reduces the number of fetched tokens while preserving exact attention over the retrieved token set, making RAM- and NVMe-backed storage practical. The full historical token K/V resides in host RAM or NVMe storage, while only a small routed working set is transferred to GPU memory during attention. Consequently, GPU memory consumption depends primarily on model weights and the routed working set rather than on the total context length. Across all tested context lengths (4K - 64K tokens), routed attention remains within approximately $0.01$--$0.02$ nats of dense attention while the sparsity used is just about 3%. These results suggest that the approximation introduced by hierarchical routing is small, and that the remaining quality gap is likely dominated by long-context positional encoding rather than by the routing algorithm itself.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

HGA's hierarchical routing looks workable for memory-bound long-context inference, but the 0.01-0.02 nats gap claim at 64K has no direct dense baseline on the hardware used.

read the letter

The main takeaway is that this paper describes a two-level routing scheme that keeps full KV caches in RAM or NVMe and only pulls a small routed subset to GPU, letting a 30B model handle 64K context on a single 32GB card without touching the pretrained weights. That part is concrete and addresses a real deployment constraint.

What stands out as new is the specific combination of RoPE-aware chunk summaries for initial retrieval, followed by group-level selection before exact token attention. The paper positions this as distinct from earlier sparse methods, and the drop-in nature with zero calibration or retraining is a practical plus.

The approach does well on the memory side: GPU usage stays tied to the model and the working set rather than total length, and the 3% sparsity target is aggressive. If the routing reliably surfaces the right tokens, this could be useful for people who need longer contexts on limited hardware.

The soft spot is the quality comparison. The text states that at 64K token-level KV storage is not feasible on the RTX 5090, yet still reports the routed attention stays within 0.01-0.02 nats of dense across the full 4K-64K range. Without a dense run at the longest length, the claim that routing error is not the dominant factor rests on an assumption rather than measurement. That gap in evidence is load-bearing for the central result.

This paper is aimed at practitioners doing long-context inference on consumer GPUs. Readers who need a working patch for existing checkpoints might get value from the routing design, but anyone expecting rigorous validation of the quality numbers at scale will find the evidence incomplete.

I would send it to peer review with a request that the authors either run the 64K dense baseline on larger hardware or supply a clear proxy that isolates routing error from positional effects.

Referee Report

2 major / 0 minor

Summary. The manuscript proposes Hierarchical Global Attention (HGA) as a drop-in replacement for dense causal attention in pretrained long-context transformers. It performs hierarchical two-level routing: first retrieving relevant chunks via compact RoPE-aware summaries, then refining by routing the most relevant groups before exact token-level attention. All pretrained weights (W_Q, W_K, W_V, W_O) are preserved with no new parameters or retraining. The method offloads full K/V to host RAM/NVMe, transferring only a small routed working set (~3% sparsity) to GPU, enabling 64K context on RTX 5090 hardware where full dense K/V storage is infeasible. It claims routed attention stays within 0.01--0.02 nats of dense attention across 4K--64K contexts.

Significance. If the empirical claims hold, HGA would be significant for practical long-context inference on limited hardware without retraining or quality degradation. The drop-in nature and offloading strategy address memory bottlenecks directly. The hierarchical routing reducing fetched tokens while keeping exact attention on the selected set is a clear technical contribution over prior sparse methods. However, the significance is limited by the absence of direct dense baselines at the longest lengths where the method is most needed.

major comments (2)

[Abstract] Abstract: The claim that 'routed attention remains within approximately 0.01--0.02 nats of dense attention' across all tested lengths including 64K is not supported at 64K. The text states that 'token-level K/V storage is not feasible on this hardware' (RTX 5090 32GB), so no dense baseline can be run at that length. This directly undermines the central assertion that the hierarchical routing approximation is small and that any remaining gap is dominated by positional encoding rather than routing errors.
[Abstract] Abstract: No equations, pseudocode, or implementation details are provided for the RoPE-aware chunk summaries or the group-level routing thresholds. Without these, it is impossible to verify whether the two-level retrieval reliably surfaces the tokens that would receive significant weight in the full dense computation, which is the load-bearing assumption for the reported quality gap remaining small at 3% sparsity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We agree that the central claims require qualification where direct dense baselines are unavailable and that additional implementation details will strengthen verifiability. We outline point-by-point revisions below.

read point-by-point responses

Referee: [Abstract] Abstract: The claim that 'routed attention remains within approximately 0.01--0.02 nats of dense attention' across all tested lengths including 64K is not supported at 64K. The text states that 'token-level K/V storage is not feasible on this hardware' (RTX 5090 32GB), so no dense baseline can be run at that length. This directly undermines the central assertion that the hierarchical routing approximation is small and that any remaining gap is dominated by positional encoding rather than routing errors.

Authors: We agree that the manuscript cannot claim a measured 0.01--0.02 nats gap at 64K without a dense baseline. We will revise the abstract to state that the reported gap holds for context lengths where dense attention is computationally feasible on the hardware (explicitly listing the tested lengths up to the maximum feasible), and that at 64K the method enables inference while the approximation quality is supported by the hierarchical design validated at shorter lengths. This directly addresses the concern without overstating the evidence. revision: yes
Referee: [Abstract] Abstract: No equations, pseudocode, or implementation details are provided for the RoPE-aware chunk summaries or the group-level routing thresholds. Without these, it is impossible to verify whether the two-level retrieval reliably surfaces the tokens that would receive significant weight in the full dense computation, which is the load-bearing assumption for the reported quality gap remaining small at 3% sparsity.

Authors: The current manuscript describes the two-level routing at a high level but does not include explicit equations for the RoPE-aware summaries or pseudocode for the group-level thresholds. We will add these to the Methods section in the revision, including the summary computation formula and the routing decision procedure, to enable independent verification of the retrieval reliability. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical claims rest on external dense-attention benchmarks with no fitted quantities or self-citation chains

full rationale

The manuscript presents HGA as an algorithmic drop-in replacement whose only load-bearing claims are (a) preservation of pretrained weights with no retraining and (b) an observed 0.01–0.02 nat gap to dense attention at ~3 % sparsity. No equations, ansatzes, or fitted parameters appear; the quality-gap statement is a direct empirical comparison rather than a derived prediction. No self-citations are invoked to justify uniqueness or to close any derivation loop. The 64 K hardware limitation noted in the text affects the strength of evidence but does not create a definitional or self-referential reduction. The derivation chain is therefore self-contained against the external dense baseline.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only; no free parameters, axioms, or invented entities are identifiable from the provided text.

pith-pipeline@v0.9.1-grok · 5840 in / 1093 out tokens · 34005 ms · 2026-07-01T06:51:12.020017+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

15 extracted references · 7 canonical work pages · 6 internal anchors

[1]

GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints

Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebron, and Sumit Sanghai. GQA: Training generalized multi-query transformer models from multi-head checkpoints.arXiv:2305.13245, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Longformer: The Long-Document Transformer

Iz Beltagy, Matthew E. Peters, and Arman Cohan. Longformer: The long-document transformer. arXiv:2004.05150, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2004
[3]

Generating Long Sequences with Sparse Transformers

Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse transformers.arXiv:1904.10509, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1904
[4]

Rethinking attention with performers

Krzysztof Choromanski et al. Rethinking attention with performers. InInternational Conference on Learning Representations, 2021

2021
[5]

MInference 1.0: Accelerating pre-filling for long-context LLMs via dynamic sparse attention.arXiv:2407.02490, 2024

Huiqiang Jiang et al. MInference 1.0: Accelerating pre-filling for long-context LLMs via dynamic sparse attention.arXiv:2407.02490, 2024

work page arXiv 2024
[6]

Reformer: The efficient transformer

Nikita Kitaev, Lukasz Kaiser, and Anselm Levskaya. Reformer: The efficient transformer. In International Conference on Learning Representations, 2020

2020
[7]

Efficient memory management for large language model serving with Page- dAttention

Woosuk Kwon et al. Efficient memory management for large language model serving with Page- dAttention. InProceedings of the ACM SIGOPS Symposium on Operating Systems Principles, 2023

2023
[8]

The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale

Guilherme Penedo et al. The FineWeb datasets: Decanting the web for the finest text data at scale.arXiv:2406.17557, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[9]

Qwen3 Technical Report

Qwen Team. Qwen3 technical report.arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[10]

Qwen3-30B-A3B-Instruct-2507-FP8 model card

Qwen Team. Qwen3-30B-A3B-Instruct-2507-FP8 model card. Hugging Face, 2025. https: //huggingface.co/Qwen/Qwen3-30B-A3B-Instruct-2507-FP8

2025
[11]

Efficient content-based sparse attention with Routing Transformers.Transactions of the Association for Computational Linguistics, 9:53–68, 2021

Aurko Roy, Mohammad Saffar, Ashish Vaswani, and David Grangier. Efficient content-based sparse attention with Routing Transformers.Transactions of the Association for Computational Linguistics, 9:53–68, 2021. 12

2021
[12]

RoFormer: Enhanced transformer with rotary position embedding.Neurocom- puting, 568:127063, 2024

Jianlin Su et al. RoFormer: Enhanced transformer with rotary position embedding.Neurocom- puting, 568:127063, 2024

2024
[13]

Attention is all you need

Ashish Vaswani et al. Attention is all you need. InAdvances in Neural Information Processing Systems, 2017

2017
[14]

YaRN: Efficient Context Window Extension of Large Language Models

Bowen Peng, Jeffrey Quesnelle, Honglu Fan, and Enrico Shippole. YaRN: Efficient context window extension of large language models.arXiv:2309.00071, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[15]

Big Bird: Transformers for longer sequences

Manzil Zaheer et al. Big Bird: Transformers for longer sequences. InAdvances in Neural Information Processing Systems, 2020. 13

2020

[1] [1]

GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints

Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebron, and Sumit Sanghai. GQA: Training generalized multi-query transformer models from multi-head checkpoints.arXiv:2305.13245, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[2] [2]

Longformer: The Long-Document Transformer

Iz Beltagy, Matthew E. Peters, and Arman Cohan. Longformer: The long-document transformer. arXiv:2004.05150, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2004

[3] [3]

Generating Long Sequences with Sparse Transformers

Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse transformers.arXiv:1904.10509, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1904

[4] [4]

Rethinking attention with performers

Krzysztof Choromanski et al. Rethinking attention with performers. InInternational Conference on Learning Representations, 2021

2021

[5] [5]

MInference 1.0: Accelerating pre-filling for long-context LLMs via dynamic sparse attention.arXiv:2407.02490, 2024

Huiqiang Jiang et al. MInference 1.0: Accelerating pre-filling for long-context LLMs via dynamic sparse attention.arXiv:2407.02490, 2024

work page arXiv 2024

[6] [6]

Reformer: The efficient transformer

Nikita Kitaev, Lukasz Kaiser, and Anselm Levskaya. Reformer: The efficient transformer. In International Conference on Learning Representations, 2020

2020

[7] [7]

Efficient memory management for large language model serving with Page- dAttention

Woosuk Kwon et al. Efficient memory management for large language model serving with Page- dAttention. InProceedings of the ACM SIGOPS Symposium on Operating Systems Principles, 2023

2023

[8] [8]

The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale

Guilherme Penedo et al. The FineWeb datasets: Decanting the web for the finest text data at scale.arXiv:2406.17557, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[9] [9]

Qwen3 Technical Report

Qwen Team. Qwen3 technical report.arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[10] [10]

Qwen3-30B-A3B-Instruct-2507-FP8 model card

Qwen Team. Qwen3-30B-A3B-Instruct-2507-FP8 model card. Hugging Face, 2025. https: //huggingface.co/Qwen/Qwen3-30B-A3B-Instruct-2507-FP8

2025

[11] [11]

Efficient content-based sparse attention with Routing Transformers.Transactions of the Association for Computational Linguistics, 9:53–68, 2021

Aurko Roy, Mohammad Saffar, Ashish Vaswani, and David Grangier. Efficient content-based sparse attention with Routing Transformers.Transactions of the Association for Computational Linguistics, 9:53–68, 2021. 12

2021

[12] [12]

RoFormer: Enhanced transformer with rotary position embedding.Neurocom- puting, 568:127063, 2024

Jianlin Su et al. RoFormer: Enhanced transformer with rotary position embedding.Neurocom- puting, 568:127063, 2024

2024

[13] [13]

Attention is all you need

Ashish Vaswani et al. Attention is all you need. InAdvances in Neural Information Processing Systems, 2017

2017

[14] [14]

YaRN: Efficient Context Window Extension of Large Language Models

Bowen Peng, Jeffrey Quesnelle, Honglu Fan, and Enrico Shippole. YaRN: Efficient context window extension of large language models.arXiv:2309.00071, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[15] [15]

Big Bird: Transformers for longer sequences

Manzil Zaheer et al. Big Bird: Transformers for longer sequences. InAdvances in Neural Information Processing Systems, 2020. 13

2020