GQLA: Group-Query Latent Attention for Hardware-Adaptive Large Language Model Decoding
Pith reviewed 2026-05-19 17:17 UTC · model grok-4.3
The pith
Group-Query Latent Attention exposes two equivalent decoding paths from one set of weights for hardware-specific LLM inference.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
GQLA modifies MLA so that one set of parameters defines two algebraically equivalent decoding routes: an MQA-absorb route identical to MLA and a GQA route that expands the cache per group. The system picks the route at runtime to match hardware, supports up to 8-way zero-redundancy tensor parallelism on the GQA route, and converts existing GQA checkpoints via TransGQLA to reach 28.125 percent of baseline KV cache size on the MQA-absorb route for models such as LLaMA-3-8B.
What carries the argument
Group-Query Latent Attention (GQLA), a parameter structure that simultaneously encodes an MQA-absorb path and a per-group GQA path while preserving algebraic equivalence between them.
If this is right
- A single trained checkpoint can reach the roofline on both H100-class and H20-class GPUs by switching paths at runtime.
- The GQA path preserves up to 8-way zero-redundancy tensor parallelism while the MQA-absorb path compresses per-token KV cache to 28.125 percent of baseline.
- No custom kernels or additional training steps are required to gain the hardware match.
Where Pith is reading between the lines
- The same dual-path idea could be tested on attention variants other than MLA to broaden hardware coverage.
- Cloud schedulers might use the runtime choice to route requests across mixed GPU fleets without model duplication.
- If equivalence holds at larger scales, serving stacks could drop separate hardware-specific model versions.
Load-bearing premise
The two decoding paths remain exactly algebraically equivalent after conversion from a pretrained GQA checkpoint, so accuracy and the stated cache compression hold without retraining.
What would settle it
Run the TransGQLA conversion on a GQA checkpoint, then measure whether the resulting model produces identical outputs and KV cache sizes on both the absorbed and group-expanded paths compared with the original.
Figures
read the original abstract
Multi-head Latent Attention (MLA), the attention used in DeepSeek-V2/V3, jointly compresses keys and values into a low-rank latent and matches the H100 roofline almost perfectly. Its trained weights, however, expose only one decoding path - an absorbed MQA form - which ties efficient inference to H100-class compute-bandwidth ratios, forfeits tensor parallelism along the head axis, and yields no Multi-Token Prediction (MTP) gain on commodity inference GPUs such as the export-restricted H20. We propose Group-Query Latent Attention (GQLA), a minimal modification of MLA whose trained weights expose two algebraically equivalent decoding paths over the same parameters: an MQA-absorb path identical to MLA's, and a GQA path with a per-group expanded cache. The runtime picks the path that matches the target hardware - no retraining, no custom kernels - so a single set of GQLA weights pins the rooflines of both H100 (MQA-absorb, s_q=1) and H20 (GQA + MTP, s_q=2), while supporting up to 8-way zero-redundancy tensor parallelism on the GQA path. To avoid pretraining from scratch we extend TransMLA into TransGQLA, which converts a pretrained GQA checkpoint into a GQLA model; on LLaMA-3-8B it compresses the per-token KV cache to 28.125% of the GQA baseline on the MQA-absorb path while structurally preserving GQA-level traffic on the per-group path.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes Group-Query Latent Attention (GQLA), a minimal modification to Multi-head Latent Attention (MLA) used in DeepSeek-V2/V3. GQLA weights, obtained by extending TransMLA into TransGQLA to convert a pretrained GQA checkpoint, are claimed to expose two algebraically equivalent decoding paths over the same parameters: an MQA-absorb path (identical to MLA) and a GQA path with per-group expanded cache. The runtime selects the path matching target hardware (H100 MQA-absorb with s_q=1 or H20 GQA+MTP with s_q=2) without retraining or custom kernels, enabling up to 8-way zero-redundancy tensor parallelism on the GQA path and compressing per-token KV cache to 28.125% of the GQA baseline on the MQA-absorb path, as demonstrated on LLaMA-3-8B.
Significance. If the algebraic equivalence after TransGQLA conversion holds exactly and preserves accuracy without fine-tuning, the result would allow a single set of weights to achieve hardware-adaptive inference that matches rooflines across H100-class and commodity GPUs while delivering substantial KV-cache compression and tensor-parallelism support. The conversion procedure from existing GQA checkpoints is a practical strength that avoids full pretraining.
major comments (2)
- [Abstract, §3] Abstract and §3 (TransGQLA conversion): the central claim that the two decoding paths remain algebraically equivalent after conversion from a pretrained GQA checkpoint, with no accuracy loss and no retraining, is asserted but not derived. No explicit mapping is shown for the latent projection or group expansion that would guarantee the MQA-absorb output equals the original GQA attention output; any deviation would invalidate both the equivalence and the 28.125% cache-compression applicability to the original model accuracy.
- [Abstract] Abstract: the 28.125% KV-cache compression figure on the MQA-absorb path is stated without error bars, ablation on the conversion step, or accuracy numbers relative to the GQA baseline. This leaves the quantitative claim unverifiable from the given text and makes it impossible to assess whether the compression is achieved while structurally preserving GQA-level traffic on the per-group path.
minor comments (2)
- [Abstract] Notation for s_q (sequence length scaling?) and the distinction between MQA-absorb and GQA paths should be defined explicitly on first use rather than assumed from MLA literature.
- The manuscript would benefit from a small table comparing KV-cache sizes, tensor-parallelism factors, and roofline utilization for the two paths on H100 vs. H20.
Simulated Author's Rebuttal
We appreciate the referee's thorough review and constructive suggestions. We address each major comment below and have revised the manuscript to provide the requested derivations and additional quantitative details.
read point-by-point responses
-
Referee: [Abstract, §3] Abstract and §3 (TransGQLA conversion): the central claim that the two decoding paths remain algebraically equivalent after conversion from a pretrained GQA checkpoint, with no accuracy loss and no retraining, is asserted but not derived. No explicit mapping is shown for the latent projection or group expansion that would guarantee the MQA-absorb output equals the original GQA attention output; any deviation would invalidate both the equivalence and the 28.125% cache-compression applicability to the original model accuracy.
Authors: We agree that an explicit derivation strengthens the claim. In the revised version, we have included a formal proof in §3 that shows the TransGQLA conversion defines the latent projections such that the MQA-absorb path computes exactly the same linear combination as the original GQA attention. The group expansion is the inverse operation in the latent space, ensuring algebraic identity. This holds without approximation, so accuracy is preserved by construction, as verified empirically on the LLaMA-3-8B model. revision: yes
-
Referee: [Abstract] Abstract: the 28.125% KV-cache compression figure on the MQA-absorb path is stated without error bars, ablation on the conversion step, or accuracy numbers relative to the GQA baseline. This leaves the quantitative claim unverifiable from the given text and makes it impossible to assess whether the compression is achieved while structurally preserving GQA-level traffic on the per-group path.
Authors: The compression ratio of 28.125% is exact and structural, arising from the latent KV size being 28.125% of the full GQA KV cache size. We have added error bars based on multiple evaluation runs, an ablation on the TransGQLA conversion process, and accuracy tables comparing to the GQA baseline. These revisions confirm that GQLA maintains GQA-level performance on the expanded path while achieving the compression on the absorb path. revision: yes
Circularity Check
No circularity: derivation rests on explicit conversion procedure
full rationale
The paper presents GQLA as a minimal modification of MLA whose weights admit two algebraically equivalent decoding paths, achieved by extending the external TransMLA procedure into TransGQLA to convert a pretrained GQA checkpoint. The claimed equivalence, cache compression to 28.125% on the MQA-absorb path, and hardware-adaptive selection are direct consequences of the structural conversion and per-group expansion rules rather than any fitted parameter renamed as a prediction or a self-referential definition. No load-bearing step reduces by construction to its own inputs; the conversion is described as preserving original attention output without retraining, making the chain self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The latent compression in MLA can be rearranged into an algebraically equivalent GQA form without changing the computed attention scores.
Reference graph
Works this paper leans on
-
[1]
Fast Transformer Decoding: One Write-Head is All You Need
Fast transformer decoding: One write-head is all you need , author=. arXiv preprint arXiv:1911.02150 , year=
work page internal anchor Pith review Pith/arXiv arXiv 1911
-
[2]
Ainslie, Joshua and Lee-Thorp, James and de Jong, Michiel and Zemlyanskiy, Yury and Lebr. Proceedings of EMNLP , year=
-
[3]
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model , author=. arXiv preprint arXiv:2405.04434 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
Advances in Neural Information Processing Systems , volume=
TransMLA: Migrating GQA models to MLA with full deepseek compatibility and speedup , author=. Advances in Neural Information Processing Systems , volume=
-
[5]
DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models
Deepseek-v3. 2: Pushing the frontier of open large language models , author=. arXiv preprint arXiv:2512.02556 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
Communications of the ACM , volume=
Roofline: an insightful visual performance model for multicore architectures , author=. Communications of the ACM , volume=. 2009 , publisher=
work page 2009
-
[7]
Second Conference on Language Modeling , year=
Hardware-Efficient Attention for Fast Decoding , author=. Second Conference on Language Modeling , year=
-
[8]
Proceedings of machine learning and systems , volume=
Efficiently scaling transformer inference , author=. Proceedings of machine learning and systems , volume=
-
[9]
Advances in neural information processing systems , volume=
Flashattention: Fast and memory-efficient exact attention with io-awareness , author=. Advances in neural information processing systems , volume=
-
[10]
Ai and memory wall , author=. IEEE Micro , volume=. 2024 , publisher=
work page 2024
-
[11]
The llama 3 herd of models , author=. arXiv preprint arXiv:2407.21783 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[12]
Towards economical inference: Enabling deepseek’s multi-head latent attention in any transformer-based llms , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
-
[13]
Deepseek-v3 technical report , author=. arXiv preprint arXiv:2412.19437 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[14]
International Conference on Machine Learning , pages=
Better & Faster Large Language Models via Multi-token Prediction , author=. International Conference on Machine Learning , pages=. 2024 , organization=
work page 2024
-
[15]
HISA: Efficient Hierarchical Indexing for Fine-Grained Sparse Attention
HISA: Efficient Hierarchical Indexing for Fine-Grained Sparse Attention , author=. arXiv preprint arXiv:2603.28458 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[16]
Kimi K2.5: Visual Agentic Intelligence
Kimi K2. 5: Visual Agentic Intelligence , author=. arXiv preprint arXiv:2602.02276 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[17]
2025 , howpublished=
work page 2025
-
[18]
Roformer: Enhanced transformer with rotary position embedding , author=. Neurocomputing , volume=. 2024 , publisher=
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.