pith. sign in

arxiv: 2606.30709 · v1 · pith:O4LLI4KZnew · submitted 2026-06-29 · 💻 cs.LG · cs.AI

Hierarchical Global Attention (HGA)

Pith reviewed 2026-07-01 06:51 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords hierarchical attentionsparse attentionlong context transformersdrop-in replacementmemory efficient inferenceRoPEcausal attentionrouting
0
0 comments X

The pith

Hierarchical two-level routing approximates dense attention within 0.02 nats at 3% sparsity without changing any pretrained weights.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Hierarchical Global Attention as a drop-in replacement for dense causal attention that works on existing long-context transformer checkpoints. It first selects relevant chunks through compact summaries that incorporate rotary position information, then routes within those chunks to smaller groups, and finally computes exact attention only on the chosen tokens. This keeps the quality gap to full attention very small across 4K to 64K contexts while storing the complete key-value cache in host memory or storage. A reader would care because the method requires no retraining, no new parameters, and no modification to the original projection matrices, so it can be applied immediately to models that otherwise cannot fit large contexts on available hardware. The results indicate that the routing step successfully recovers most of the attention signal that would have been computed densely.

Core claim

HGA performs hierarchical two-level routing that retrieves relevant chunks using compact RoPE-aware summaries then refines by routing only the most relevant groups before exact token-level attention, achieving routed attention within approximately 0.01--0.02 nats of dense attention at 3% sparsity for 4K-64K contexts while preserving all pretrained weights.

What carries the argument

Hierarchical two-level routing that uses RoPE-aware chunk summaries to select a small set of tokens for exact attention while keeping the full K/V cache off the GPU.

If this is right

  • GPU memory consumption depends on model weights and the routed working set rather than total context length.
  • The full historical token K/V can reside in host RAM or NVMe while only a small subset moves to GPU during attention.
  • The method applies directly to existing checkpoints such as Qwen3 without any calibration or retraining.
  • Routed attention at 3% sparsity stays close enough to dense that the residual gap is attributed mainly to positional encoding.
  • The approach enables 64K-token inference on hardware where storing all token-level K/V pairs is impossible.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Improving the quality of the initial chunk summaries could further reduce the already small quality gap.
  • The same hierarchical selection pattern might be tested on other attention variants or model families to check generality.
  • If the routing remains robust, combining it with quantization or other compression could push context lengths even higher on the same hardware.
  • The low sparsity level suggests attention distributions in these models contain strong structure that future positional encodings might exploit directly.

Load-bearing premise

The RoPE-aware summaries and subsequent group routing reliably surface the tokens that would have received significant attention weight in the full dense computation.

What would settle it

A long-context evaluation where the routing misses tokens that carry high attention weight in the dense case and the quality gap exceeds 0.02 nats on a standard benchmark.

read the original abstract

Hierarchical Global Attention (HGA) is a drop-in replacement for dense causal attention in pretrained long-context transformers. HGA preserves the original checkpoint parameters: the pretrained $W_Q$, $W_K$, $W_V$, and $W_O$ projections remain unchanged, no calibration parameters are introduced, and no retraining is required. Applied to Qwen3-30B-A3B-Instruct-2507-FP8 on a single RTX~5090 (32GB), the patched model runs out of the box at a 64K-token context, where token-level K/V storage is not feasible on this hardware. Unlike previous sparse-attention methods, HGA performs hierarchical two-level routing. It first retrieves relevant chunks using compact RoPE-aware summaries and then refines the selection by routing only the most relevant groups before performing exact token-level attention. This hierarchical retrieval significantly reduces the number of fetched tokens while preserving exact attention over the retrieved token set, making RAM- and NVMe-backed storage practical. The full historical token K/V resides in host RAM or NVMe storage, while only a small routed working set is transferred to GPU memory during attention. Consequently, GPU memory consumption depends primarily on model weights and the routed working set rather than on the total context length. Across all tested context lengths (4K - 64K tokens), routed attention remains within approximately $0.01$--$0.02$ nats of dense attention while the sparsity used is just about 3%. These results suggest that the approximation introduced by hierarchical routing is small, and that the remaining quality gap is likely dominated by long-context positional encoding rather than by the routing algorithm itself.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The manuscript proposes Hierarchical Global Attention (HGA) as a drop-in replacement for dense causal attention in pretrained long-context transformers. It performs hierarchical two-level routing: first retrieving relevant chunks via compact RoPE-aware summaries, then refining by routing the most relevant groups before exact token-level attention. All pretrained weights (W_Q, W_K, W_V, W_O) are preserved with no new parameters or retraining. The method offloads full K/V to host RAM/NVMe, transferring only a small routed working set (~3% sparsity) to GPU, enabling 64K context on RTX 5090 hardware where full dense K/V storage is infeasible. It claims routed attention stays within 0.01--0.02 nats of dense attention across 4K--64K contexts.

Significance. If the empirical claims hold, HGA would be significant for practical long-context inference on limited hardware without retraining or quality degradation. The drop-in nature and offloading strategy address memory bottlenecks directly. The hierarchical routing reducing fetched tokens while keeping exact attention on the selected set is a clear technical contribution over prior sparse methods. However, the significance is limited by the absence of direct dense baselines at the longest lengths where the method is most needed.

major comments (2)
  1. [Abstract] Abstract: The claim that 'routed attention remains within approximately 0.01--0.02 nats of dense attention' across all tested lengths including 64K is not supported at 64K. The text states that 'token-level K/V storage is not feasible on this hardware' (RTX 5090 32GB), so no dense baseline can be run at that length. This directly undermines the central assertion that the hierarchical routing approximation is small and that any remaining gap is dominated by positional encoding rather than routing errors.
  2. [Abstract] Abstract: No equations, pseudocode, or implementation details are provided for the RoPE-aware chunk summaries or the group-level routing thresholds. Without these, it is impossible to verify whether the two-level retrieval reliably surfaces the tokens that would receive significant weight in the full dense computation, which is the load-bearing assumption for the reported quality gap remaining small at 3% sparsity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We agree that the central claims require qualification where direct dense baselines are unavailable and that additional implementation details will strengthen verifiability. We outline point-by-point revisions below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The claim that 'routed attention remains within approximately 0.01--0.02 nats of dense attention' across all tested lengths including 64K is not supported at 64K. The text states that 'token-level K/V storage is not feasible on this hardware' (RTX 5090 32GB), so no dense baseline can be run at that length. This directly undermines the central assertion that the hierarchical routing approximation is small and that any remaining gap is dominated by positional encoding rather than routing errors.

    Authors: We agree that the manuscript cannot claim a measured 0.01--0.02 nats gap at 64K without a dense baseline. We will revise the abstract to state that the reported gap holds for context lengths where dense attention is computationally feasible on the hardware (explicitly listing the tested lengths up to the maximum feasible), and that at 64K the method enables inference while the approximation quality is supported by the hierarchical design validated at shorter lengths. This directly addresses the concern without overstating the evidence. revision: yes

  2. Referee: [Abstract] Abstract: No equations, pseudocode, or implementation details are provided for the RoPE-aware chunk summaries or the group-level routing thresholds. Without these, it is impossible to verify whether the two-level retrieval reliably surfaces the tokens that would receive significant weight in the full dense computation, which is the load-bearing assumption for the reported quality gap remaining small at 3% sparsity.

    Authors: The current manuscript describes the two-level routing at a high level but does not include explicit equations for the RoPE-aware summaries or pseudocode for the group-level thresholds. We will add these to the Methods section in the revision, including the summary computation formula and the routing decision procedure, to enable independent verification of the retrieval reliability. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical claims rest on external dense-attention benchmarks with no fitted quantities or self-citation chains

full rationale

The manuscript presents HGA as an algorithmic drop-in replacement whose only load-bearing claims are (a) preservation of pretrained weights with no retraining and (b) an observed 0.01–0.02 nat gap to dense attention at ~3 % sparsity. No equations, ansatzes, or fitted parameters appear; the quality-gap statement is a direct empirical comparison rather than a derived prediction. No self-citations are invoked to justify uniqueness or to close any derivation loop. The 64 K hardware limitation noted in the text affects the strength of evidence but does not create a definitional or self-referential reduction. The derivation chain is therefore self-contained against the external dense baseline.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only; no free parameters, axioms, or invented entities are identifiable from the provided text.

pith-pipeline@v0.9.1-grok · 5840 in / 1093 out tokens · 34005 ms · 2026-07-01T06:51:12.020017+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

15 extracted references · 7 canonical work pages · 6 internal anchors

  1. [1]

    GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints

    Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebron, and Sumit Sanghai. GQA: Training generalized multi-query transformer models from multi-head checkpoints.arXiv:2305.13245, 2023

  2. [2]

    Longformer: The Long-Document Transformer

    Iz Beltagy, Matthew E. Peters, and Arman Cohan. Longformer: The long-document transformer. arXiv:2004.05150, 2020

  3. [3]

    Generating Long Sequences with Sparse Transformers

    Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse transformers.arXiv:1904.10509, 2019

  4. [4]

    Rethinking attention with performers

    Krzysztof Choromanski et al. Rethinking attention with performers. InInternational Conference on Learning Representations, 2021

  5. [5]

    MInference 1.0: Accelerating pre-filling for long-context LLMs via dynamic sparse attention.arXiv:2407.02490, 2024

    Huiqiang Jiang et al. MInference 1.0: Accelerating pre-filling for long-context LLMs via dynamic sparse attention.arXiv:2407.02490, 2024

  6. [6]

    Reformer: The efficient transformer

    Nikita Kitaev, Lukasz Kaiser, and Anselm Levskaya. Reformer: The efficient transformer. In International Conference on Learning Representations, 2020

  7. [7]

    Efficient memory management for large language model serving with Page- dAttention

    Woosuk Kwon et al. Efficient memory management for large language model serving with Page- dAttention. InProceedings of the ACM SIGOPS Symposium on Operating Systems Principles, 2023

  8. [8]

    The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale

    Guilherme Penedo et al. The FineWeb datasets: Decanting the web for the finest text data at scale.arXiv:2406.17557, 2024

  9. [9]

    Qwen3 Technical Report

    Qwen Team. Qwen3 technical report.arXiv:2505.09388, 2025

  10. [10]

    Qwen3-30B-A3B-Instruct-2507-FP8 model card

    Qwen Team. Qwen3-30B-A3B-Instruct-2507-FP8 model card. Hugging Face, 2025. https: //huggingface.co/Qwen/Qwen3-30B-A3B-Instruct-2507-FP8

  11. [11]

    Efficient content-based sparse attention with Routing Transformers.Transactions of the Association for Computational Linguistics, 9:53–68, 2021

    Aurko Roy, Mohammad Saffar, Ashish Vaswani, and David Grangier. Efficient content-based sparse attention with Routing Transformers.Transactions of the Association for Computational Linguistics, 9:53–68, 2021. 12

  12. [12]

    RoFormer: Enhanced transformer with rotary position embedding.Neurocom- puting, 568:127063, 2024

    Jianlin Su et al. RoFormer: Enhanced transformer with rotary position embedding.Neurocom- puting, 568:127063, 2024

  13. [13]

    Attention is all you need

    Ashish Vaswani et al. Attention is all you need. InAdvances in Neural Information Processing Systems, 2017

  14. [14]

    YaRN: Efficient Context Window Extension of Large Language Models

    Bowen Peng, Jeffrey Quesnelle, Honglu Fan, and Enrico Shippole. YaRN: Efficient context window extension of large language models.arXiv:2309.00071, 2023

  15. [15]

    Big Bird: Transformers for longer sequences

    Manzil Zaheer et al. Big Bird: Transformers for longer sequences. InAdvances in Neural Information Processing Systems, 2020. 13