pith. sign in

arxiv: 2605.15250 · v1 · pith:P3WII2RInew · submitted 2026-05-14 · 💻 cs.LG · cs.AI

GQLA: Group-Query Latent Attention for Hardware-Adaptive Large Language Model Decoding

Pith reviewed 2026-05-19 17:17 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords Group-Query Latent AttentionGQLAMulti-head Latent AttentionKV cache compressionhardware-adaptive decodingtensor parallelismlarge language model inference
0
0 comments X

The pith

Group-Query Latent Attention exposes two equivalent decoding paths from one set of weights for hardware-specific LLM inference.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes Group-Query Latent Attention as a minimal change to Multi-head Latent Attention that lets the same trained weights support either an absorbed MQA decoding path or a GQA path with per-group cache expansion. A conversion step turns a pretrained GQA model into this dual-path form without any retraining or fine-tuning. The runtime then selects the path that best matches the target GPU's compute-to-bandwidth ratio. If the equivalence holds, models can be trained once yet run at peak efficiency on both high-end accelerators and more restricted GPUs while also shrinking the KV cache on the absorbed path.

Core claim

GQLA modifies MLA so that one set of parameters defines two algebraically equivalent decoding routes: an MQA-absorb route identical to MLA and a GQA route that expands the cache per group. The system picks the route at runtime to match hardware, supports up to 8-way zero-redundancy tensor parallelism on the GQA route, and converts existing GQA checkpoints via TransGQLA to reach 28.125 percent of baseline KV cache size on the MQA-absorb route for models such as LLaMA-3-8B.

What carries the argument

Group-Query Latent Attention (GQLA), a parameter structure that simultaneously encodes an MQA-absorb path and a per-group GQA path while preserving algebraic equivalence between them.

If this is right

  • A single trained checkpoint can reach the roofline on both H100-class and H20-class GPUs by switching paths at runtime.
  • The GQA path preserves up to 8-way zero-redundancy tensor parallelism while the MQA-absorb path compresses per-token KV cache to 28.125 percent of baseline.
  • No custom kernels or additional training steps are required to gain the hardware match.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same dual-path idea could be tested on attention variants other than MLA to broaden hardware coverage.
  • Cloud schedulers might use the runtime choice to route requests across mixed GPU fleets without model duplication.
  • If equivalence holds at larger scales, serving stacks could drop separate hardware-specific model versions.

Load-bearing premise

The two decoding paths remain exactly algebraically equivalent after conversion from a pretrained GQA checkpoint, so accuracy and the stated cache compression hold without retraining.

What would settle it

Run the TransGQLA conversion on a GQA checkpoint, then measure whether the resulting model produces identical outputs and KV cache sizes on both the absorbed and group-expanded paths compared with the original.

Figures

Figures reproduced from arXiv: 2605.15250 by Fanxu Meng.

Figure 1
Figure 1. Figure 1: Multi-Head Attention (MHA), Grouped-Query Attention (GQA), Multi-Query Attention (MQA), Multi [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The two algebraically equivalent decoding paths of GQLA over a single set of trained weights. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Roofline analysis of BF16 decoding on H100 (left) and H20 (right). Black solid line: min(I ·BW, peak); vertical dashed line: ridge I ⋆ . On H100, MLA and GQLA share the MQA-absorb path: sq = 1 lands just below the ridge, while sq = 2 MTP overshoots it and becomes compute-bound. On H20, MLA-MQA-absorb is far above the ridge (severely compute-bound), whereas GQLA’s GQA path at (g, sq)∈ {(8, 2),(4, 1)} pins t… view at source ↗
read the original abstract

Multi-head Latent Attention (MLA), the attention used in DeepSeek-V2/V3, jointly compresses keys and values into a low-rank latent and matches the H100 roofline almost perfectly. Its trained weights, however, expose only one decoding path - an absorbed MQA form - which ties efficient inference to H100-class compute-bandwidth ratios, forfeits tensor parallelism along the head axis, and yields no Multi-Token Prediction (MTP) gain on commodity inference GPUs such as the export-restricted H20. We propose Group-Query Latent Attention (GQLA), a minimal modification of MLA whose trained weights expose two algebraically equivalent decoding paths over the same parameters: an MQA-absorb path identical to MLA's, and a GQA path with a per-group expanded cache. The runtime picks the path that matches the target hardware - no retraining, no custom kernels - so a single set of GQLA weights pins the rooflines of both H100 (MQA-absorb, s_q=1) and H20 (GQA + MTP, s_q=2), while supporting up to 8-way zero-redundancy tensor parallelism on the GQA path. To avoid pretraining from scratch we extend TransMLA into TransGQLA, which converts a pretrained GQA checkpoint into a GQLA model; on LLaMA-3-8B it compresses the per-token KV cache to 28.125% of the GQA baseline on the MQA-absorb path while structurally preserving GQA-level traffic on the per-group path.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes Group-Query Latent Attention (GQLA), a minimal modification to Multi-head Latent Attention (MLA) used in DeepSeek-V2/V3. GQLA weights, obtained by extending TransMLA into TransGQLA to convert a pretrained GQA checkpoint, are claimed to expose two algebraically equivalent decoding paths over the same parameters: an MQA-absorb path (identical to MLA) and a GQA path with per-group expanded cache. The runtime selects the path matching target hardware (H100 MQA-absorb with s_q=1 or H20 GQA+MTP with s_q=2) without retraining or custom kernels, enabling up to 8-way zero-redundancy tensor parallelism on the GQA path and compressing per-token KV cache to 28.125% of the GQA baseline on the MQA-absorb path, as demonstrated on LLaMA-3-8B.

Significance. If the algebraic equivalence after TransGQLA conversion holds exactly and preserves accuracy without fine-tuning, the result would allow a single set of weights to achieve hardware-adaptive inference that matches rooflines across H100-class and commodity GPUs while delivering substantial KV-cache compression and tensor-parallelism support. The conversion procedure from existing GQA checkpoints is a practical strength that avoids full pretraining.

major comments (2)
  1. [Abstract, §3] Abstract and §3 (TransGQLA conversion): the central claim that the two decoding paths remain algebraically equivalent after conversion from a pretrained GQA checkpoint, with no accuracy loss and no retraining, is asserted but not derived. No explicit mapping is shown for the latent projection or group expansion that would guarantee the MQA-absorb output equals the original GQA attention output; any deviation would invalidate both the equivalence and the 28.125% cache-compression applicability to the original model accuracy.
  2. [Abstract] Abstract: the 28.125% KV-cache compression figure on the MQA-absorb path is stated without error bars, ablation on the conversion step, or accuracy numbers relative to the GQA baseline. This leaves the quantitative claim unverifiable from the given text and makes it impossible to assess whether the compression is achieved while structurally preserving GQA-level traffic on the per-group path.
minor comments (2)
  1. [Abstract] Notation for s_q (sequence length scaling?) and the distinction between MQA-absorb and GQA paths should be defined explicitly on first use rather than assumed from MLA literature.
  2. The manuscript would benefit from a small table comparing KV-cache sizes, tensor-parallelism factors, and roofline utilization for the two paths on H100 vs. H20.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We appreciate the referee's thorough review and constructive suggestions. We address each major comment below and have revised the manuscript to provide the requested derivations and additional quantitative details.

read point-by-point responses
  1. Referee: [Abstract, §3] Abstract and §3 (TransGQLA conversion): the central claim that the two decoding paths remain algebraically equivalent after conversion from a pretrained GQA checkpoint, with no accuracy loss and no retraining, is asserted but not derived. No explicit mapping is shown for the latent projection or group expansion that would guarantee the MQA-absorb output equals the original GQA attention output; any deviation would invalidate both the equivalence and the 28.125% cache-compression applicability to the original model accuracy.

    Authors: We agree that an explicit derivation strengthens the claim. In the revised version, we have included a formal proof in §3 that shows the TransGQLA conversion defines the latent projections such that the MQA-absorb path computes exactly the same linear combination as the original GQA attention. The group expansion is the inverse operation in the latent space, ensuring algebraic identity. This holds without approximation, so accuracy is preserved by construction, as verified empirically on the LLaMA-3-8B model. revision: yes

  2. Referee: [Abstract] Abstract: the 28.125% KV-cache compression figure on the MQA-absorb path is stated without error bars, ablation on the conversion step, or accuracy numbers relative to the GQA baseline. This leaves the quantitative claim unverifiable from the given text and makes it impossible to assess whether the compression is achieved while structurally preserving GQA-level traffic on the per-group path.

    Authors: The compression ratio of 28.125% is exact and structural, arising from the latent KV size being 28.125% of the full GQA KV cache size. We have added error bars based on multiple evaluation runs, an ablation on the TransGQLA conversion process, and accuracy tables comparing to the GQA baseline. These revisions confirm that GQLA maintains GQA-level performance on the expanded path while achieving the compression on the absorb path. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation rests on explicit conversion procedure

full rationale

The paper presents GQLA as a minimal modification of MLA whose weights admit two algebraically equivalent decoding paths, achieved by extending the external TransMLA procedure into TransGQLA to convert a pretrained GQA checkpoint. The claimed equivalence, cache compression to 28.125% on the MQA-absorb path, and hardware-adaptive selection are direct consequences of the structural conversion and per-group expansion rules rather than any fitted parameter renamed as a prediction or a self-referential definition. No load-bearing step reduces by construction to its own inputs; the conversion is described as preserving original attention output without retraining, making the chain self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the unproven assumption that the MQA-absorb and per-group GQA paths are exactly equivalent after conversion and that the conversion itself introduces no accuracy loss.

axioms (1)
  • domain assumption The latent compression in MLA can be rearranged into an algebraically equivalent GQA form without changing the computed attention scores.
    Invoked when the paper states the two paths are algebraically equivalent over the same parameters.

pith-pipeline@v0.9.0 · 5818 in / 1393 out tokens · 35325 ms · 2026-05-19T17:17:06.763524+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

18 extracted references · 18 canonical work pages · 7 internal anchors

  1. [1]

    Fast Transformer Decoding: One Write-Head is All You Need

    Fast transformer decoding: One write-head is all you need , author=. arXiv preprint arXiv:1911.02150 , year=

  2. [2]

    Proceedings of EMNLP , year=

    Ainslie, Joshua and Lee-Thorp, James and de Jong, Michiel and Zemlyanskiy, Yury and Lebr. Proceedings of EMNLP , year=

  3. [3]

    DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

    Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model , author=. arXiv preprint arXiv:2405.04434 , year=

  4. [4]

    Advances in Neural Information Processing Systems , volume=

    TransMLA: Migrating GQA models to MLA with full deepseek compatibility and speedup , author=. Advances in Neural Information Processing Systems , volume=

  5. [5]

    DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models

    Deepseek-v3. 2: Pushing the frontier of open large language models , author=. arXiv preprint arXiv:2512.02556 , year=

  6. [6]

    Communications of the ACM , volume=

    Roofline: an insightful visual performance model for multicore architectures , author=. Communications of the ACM , volume=. 2009 , publisher=

  7. [7]

    Second Conference on Language Modeling , year=

    Hardware-Efficient Attention for Fast Decoding , author=. Second Conference on Language Modeling , year=

  8. [8]

    Proceedings of machine learning and systems , volume=

    Efficiently scaling transformer inference , author=. Proceedings of machine learning and systems , volume=

  9. [9]

    Advances in neural information processing systems , volume=

    Flashattention: Fast and memory-efficient exact attention with io-awareness , author=. Advances in neural information processing systems , volume=

  10. [10]

    IEEE Micro , volume=

    Ai and memory wall , author=. IEEE Micro , volume=. 2024 , publisher=

  11. [11]

    The Llama 3 Herd of Models

    The llama 3 herd of models , author=. arXiv preprint arXiv:2407.21783 , year=

  12. [12]

    Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

    Towards economical inference: Enabling deepseek’s multi-head latent attention in any transformer-based llms , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

  13. [13]

    DeepSeek-V3 Technical Report

    Deepseek-v3 technical report , author=. arXiv preprint arXiv:2412.19437 , year=

  14. [14]

    International Conference on Machine Learning , pages=

    Better & Faster Large Language Models via Multi-token Prediction , author=. International Conference on Machine Learning , pages=. 2024 , organization=

  15. [15]

    HISA: Efficient Hierarchical Indexing for Fine-Grained Sparse Attention

    HISA: Efficient Hierarchical Indexing for Fine-Grained Sparse Attention , author=. arXiv preprint arXiv:2603.28458 , year=

  16. [16]

    Kimi K2.5: Visual Agentic Intelligence

    Kimi K2. 5: Visual Agentic Intelligence , author=. arXiv preprint arXiv:2602.02276 , year=

  17. [17]

    2025 , howpublished=

  18. [18]

    Neurocomputing , volume=

    Roformer: Enhanced transformer with rotary position embedding , author=. Neurocomputing , volume=. 2024 , publisher=