pith. sign in

super hub Canonical reference

GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints

Canonical reference. 70% of citing Pith papers cite this work as background.

114 Pith papers citing it
Background 70% of classified citations
abstract

Multi-query attention (MQA), which only uses a single key-value head, drastically speeds up decoder inference. However, MQA can lead to quality degradation, and moreover it may not be desirable to train a separate model just for faster inference. We (1) propose a recipe for uptraining existing multi-head language model checkpoints into models with MQA using 5% of original pre-training compute, and (2) introduce grouped-query attention (GQA), a generalization of multi-query attention which uses an intermediate (more than one, less than number of query heads) number of key-value heads. We show that uptrained GQA achieves quality close to multi-head attention with comparable speed to MQA.

hub tools

citation-role summary

background 20 method 6 other 1

citation-polarity summary

claims ledger

  • abstract Multi-query attention (MQA), which only uses a single key-value head, drastically speeds up decoder inference. However, MQA can lead to quality degradation, and moreover it may not be desirable to train a separate model just for faster inference. We (1) propose a recipe for uptraining existing multi-head language model checkpoints into models with MQA using 5% of original pre-training compute, and (2) introduce grouped-query attention (GQA), a generalization of multi-query attention which uses an intermediate (more than one, less than number of query heads) number of key-value heads. We show t

authors

co-cited works

clear filters

representative citing papers

SnapStream: Efficient Long Sequence Decoding on Dataflow Accelerators

cs.AI · 2025-11-05 · unverdicted · novelty 7.0

SnapStream deploys sparse KV attention in a production inference system on dataflow accelerators, delivering 4x on-chip memory savings for DeepSeek-671B at 128k context with up to 1832 tokens/sec and minimal accuracy loss on LongBench-v2, AIME24, and LiveCodeBench.

Training Agents Inside of Scalable World Models

cs.AI · 2025-09-29 · conditional · novelty 7.0

Dreamer 4 is the first agent to obtain diamonds in Minecraft from only offline data by reinforcement learning inside a scalable world model that accurately predicts game mechanics.

Hierarchical Global Attention (HGA)

cs.LG · 2026-06-29 · unverdicted · novelty 6.0

HGA uses RoPE-aware chunk summaries for two-level hierarchical routing to approximate dense causal attention at 3% sparsity with 0.01-0.02 nats quality gap, as a drop-in replacement requiring no retraining.

citing papers explorer

Showing 2 of 2 citing papers after filters.

  • Emu3: Next-Token Prediction is All You Need cs.CV · 2024-09-27 · unverdicted · none · ref 1 · internal anchor

    Emu3 shows that next-token prediction on a unified discrete token space for text, images, and video lets a single transformer outperform task-specific models such as SDXL and LLaVA-1.6 in multimodal generation and perception.

  • Yi: Open Foundation Models by 01.AI cs.CL · 2024-03-07 · unverdicted · none · ref 1 · internal anchor

    Yi models are 6B and 34B open foundation models pretrained on 3.1T curated tokens that achieve strong benchmark results through data quality and targeted extensions like long context and vision alignment.