super hub Canonical reference

GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints

James Lee-Thorp, Joshua Ainslie, Michiel de Jong, Sumit Sanghai, Yury Zemlyanskiy · 2023 · cs.CL · arXiv 2305.13245

Canonical reference. 70% of citing Pith papers cite this work as background.

127 Pith papers citing it

Background 70% of classified citations

open full Pith review browse 127 citing papers more from James Lee-Thorp arXiv PDF

abstract

Multi-query attention (MQA), which only uses a single key-value head, drastically speeds up decoder inference. However, MQA can lead to quality degradation, and moreover it may not be desirable to train a separate model just for faster inference. We (1) propose a recipe for uptraining existing multi-head language model checkpoints into models with MQA using 5% of original pre-training compute, and (2) introduce grouped-query attention (GQA), a generalization of multi-query attention which uses an intermediate (more than one, less than number of query heads) number of key-value heads. We show that uptrained GQA achieves quality close to multi-head attention with comparable speed to MQA.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 20 method 6 other 1

citation-polarity summary

background 19 use method 6 unclear 2

claims ledger

abstract Multi-query attention (MQA), which only uses a single key-value head, drastically speeds up decoder inference. However, MQA can lead to quality degradation, and moreover it may not be desirable to train a separate model just for faster inference. We (1) propose a recipe for uptraining existing multi-head language model checkpoints into models with MQA using 5% of original pre-training compute, and (2) introduce grouped-query attention (GQA), a generalization of multi-query attention which uses an intermediate (more than one, less than number of query heads) number of key-value heads. We show t

authors

Federico Lebr\'on James Lee-Thorp Joshua Ainslie Michiel de Jong Sumit Sanghai Yury Zemlyanskiy

co-cited works

representative citing papers

Prism Transformer: Progressive Head Schedules for Hierarchical Attention Processing

cs.LG · 2026-06-25 · unverdicted · novelty 7.0

Prism Transformer uses a progressive increase in attention heads with depth to form a local-to-global hierarchy, reporting lower validation loss and gains on zero-shot benchmarks versus uniform-head baselines at 124M-757M scales.

Grouped Query Experts: Mixture-of-Experts on GQA Self-Attention

cs.LG · 2026-06-18 · unverdicted · novelty 7.0

Grouped Query Experts applies per-group MoE routing to query heads in GQA, matching baseline accuracy while activating half the query heads on 250M models trained for 30B tokens.

STAR-KV: Low-Rank KV Cache Compression via Soft Thresholding for Adaptive Rank Control

cs.LG · 2026-06-07 · unverdicted · novelty 7.0

STAR-KV applies differentiable soft thresholding for per-head and per-block adaptive low-rank KV cache compression, combined with hybrid decomposition and low-rank-aware quantization, achieving up to 75% compression and 3.1x throughput gains.

Chronicle: A Multimodal Foundation Model for Joint Language and Time Series Understanding

cs.LG · 2026-05-18 · unverdicted · novelty 7.0

Chronicle is the first model jointly pretrained from scratch on text and time series in a unified transformer that matches a comparable language model on NLU tasks and sets new bars for time series classification and multimodal forecasting.

LLMForge: Multi-Backend Hardware-Aware Neural Architecture Search with Infinite-Head Attention for Edge Language Models

cs.LG · 2026-05-17 · unverdicted · novelty 7.0

LLMForge is a NAS framework with Infinite-Head Attention, a Forge-Former surrogate, and Forge-DSE engine that discovers hardware-specific architectures for edge language models, yielding variants with improved accuracy, energy, or latency on different substrates.

GQA-{\mu}P: The maximal parameterization update for grouped query attention

cs.LG · 2026-05-14 · unverdicted · novelty 7.0

Derives μP scalings for GQA via promoted spectral-norm definition of feature learning and a modified norm preserving scaling laws for non-full-rank matrices, with experiments showing learning-rate transfer.

The Illusion of Power Capping in LLM Decode: A Phase-Aware Energy Characterisation Across Attention Architectures

cs.DC · 2026-05-12 · unverdicted · novelty 7.0

Power capping is illusory in LLM decode as memory-bound operation leaves power headroom untouched on 700 W GPUs, while SM clock locking saves up to 32% energy and three DVFS classes appear across attention types.

Co-generation of Layout and Shape from Text via Autoregressive 3D Diffusion

cs.CV · 2026-04-17 · unverdicted · novelty 7.0

3D-ARD+ unifies autoregressive token prediction with diffusion-based 3D latent generation to co-produce indoor scene layouts and object geometries that follow complex text-specified spatial and semantic constraints.

Ragged Paged Attention: A High-Performance and Flexible LLM Inference Kernel for TPU

cs.PF · 2026-04-16 · unverdicted · novelty 7.0

RPA kernel for TPUs achieves 86% MBU in decode and 73% MFU in prefill on Llama 3 8B via tiling for ragged memory, fused pipelines, and specialized compilation for prefill/decode workloads.

A Full-Stack Performance Evaluation Infrastructure for 3D-DRAM-based LLM Accelerators

cs.AR · 2026-04-09 · conditional · novelty 7.0

ATLAS is the first silicon-validated simulation framework for 3D-DRAM LLM accelerators, achieving under 8.57% error and over 97% correlation with real hardware while supporting design exploration.

MixAtlas: Uncertainty-aware Data Mixture Optimization for Multimodal LLM Midtraining

cs.LG · 2026-04-03 · unverdicted · novelty 7.0

MixAtlas uses CLIP-based decomposition and Gaussian process optimization on small proxies to discover data mixtures that improve multimodal benchmark performance by up to 17.6% and transfer to larger models with faster convergence.

Characterizing Performance-Energy Trade-offs of Large Language Models in Multi-Request Workflows

cs.DC · 2026-03-12 · unverdicted · novelty 7.0

This work delivers the first measurements of performance-energy trade-offs across four multi-request LLM workflow patterns on A100 GPUs using vLLM and Parrot.

SnapStream: Efficient Long Sequence Decoding on Dataflow Accelerators

cs.AI · 2025-11-05 · unverdicted · novelty 7.0

SnapStream deploys sparse KV attention in a production inference system on dataflow accelerators, delivering 4x on-chip memory savings for DeepSeek-671B at 128k context with up to 1832 tokens/sec and minimal accuracy loss on LongBench-v2, AIME24, and LiveCodeBench.

DELTA: Dynamic Layer-Aware Token Attention for Efficient Long-Context Reasoning

cs.CL · 2025-10-10 · conditional · novelty 7.0

DELTA partitions layers into full, delta, and sparse groups to select salient tokens via aggregated attention scores, matching full-attention accuracy on AIME and GPQA while cutting attended tokens up to 4.25x and achieving 1.54x speedup.

Training Agents Inside of Scalable World Models

cs.AI · 2025-09-29 · conditional · novelty 7.0

Dreamer 4 is the first agent to obtain diamonds in Minecraft from only offline data by reinforcement learning inside a scalable world model that accurately predicts game mechanics.

AVA-Bench: Atomic Visual Ability Benchmark for Vision Foundation Models

cs.CV · 2025-06-10 · unverdicted · novelty 7.0

AVA-Bench evaluates vision foundation models by disentangling 14 atomic visual abilities with aligned training-test distributions to reveal precise ability fingerprints.

FastKV: Decoupling of Context Reduction and KV Cache Compression for Prefill-Decoding Acceleration

cs.LG · 2025-02-03 · unverdicted · novelty 7.0

FastKV decouples prefill context reduction via Token-Selective Propagation from independent KV cache selection, delivering up to 1.82x prefill and 2.87x decoding speedups while matching decoding-only accuracy.

FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision

cs.LG · 2024-07-11 · accept · novelty 7.0

FlashAttention-3 achieves 1.5-2x speedup on H100 GPUs for attention, reaching 740 TFLOPs/s (75% utilization) in FP16 and near 1.2 PFLOPs/s in FP8 while cutting numerical error by 2.6x versus baseline FP8 attention.

Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

cs.LG · 2024-05-31 · unverdicted · novelty 7.0

Transformers and SSMs are unified through structured state space duality, producing a 2-8X faster Mamba-2 model that remains competitive with Transformers.

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

cs.CL · 2024-05-07 · unverdicted · novelty 7.0

DeepSeek-V2 delivers top-tier open-source LLM performance using only 21B active parameters by compressing the KV cache 93.3% and cutting training costs 42.5% via MLA and DeepSeekMoE.

Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads

cs.LG · 2024-01-19 · conditional · novelty 7.0

Medusa augments LLMs with multiple decoding heads and tree-based attention to predict and verify several tokens in parallel, yielding 2.2-3.6x inference speedup via two fine-tuning regimes.

Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation

cs.CV · 2023-10-09 · unverdicted · novelty 7.0

A new shared video-image tokenizer enables large language models to surpass diffusion models on standard visual generation benchmarks.

One-Step Gradient Delay is Not a Barrier for Large-Scale Asynchronous Pipeline Parallel LLM Pretraining

cs.LG · 2026-06-29 · unverdicted · novelty 6.0

One-step gradient delay is optimizer-dependent rather than intrinsically unstable, with Muon and error-feedback correction enabling async pipeline parallelism to match synchronous performance on models up to 10B parameters.

Hierarchical Global Attention (HGA)

cs.LG · 2026-06-29 · unverdicted · novelty 6.0

HGA uses RoPE-aware chunk summaries for two-level hierarchical routing to approximate dense causal attention at 3% sparsity with 0.01-0.02 nats quality gap, as a drop-in replacement requiring no retraining.

citing papers explorer

Showing 5 of 5 citing papers after filters.

Architecture-Aware Reinforcement Learning Makes Sliding-Window Attention Competitive in Math Reasoning cs.AI · 2026-06-10 · unverdicted · none · ref 35 · internal anchor
Reinforcement learning after SFT conversion narrows the performance gap between sliding-window attention and full self-attention on math reasoning benchmarks while preserving linear complexity.
ArborKV: Structure-Aware KV Cache Management for Scaling Tree-based LLM Reasoning cs.AI · 2026-05-21 · unverdicted · none · ref 1 · internal anchor
ArborKV uses search-structure awareness to evict low-reuse KV states in Tree-of-Thoughts inference, delivering up to 4x memory savings with near-full accuracy retention.
ZAYA1-8B Technical Report cs.AI · 2026-05-06 · unverdicted · none · ref 147 · internal anchor
ZAYA1-8B is a reasoning MoE model with 700M active parameters that matches larger models on math and coding benchmarks and reaches 91.9% on AIME'25 via Markovian RSA test-time compute.
Voxtral Realtime cs.AI · 2026-02-11 · unverdicted · none · ref 1 · internal anchor
Voxtral Realtime is an end-to-end trained streaming ASR model that achieves Whisper-level transcription quality at 480ms delay after scaling pretraining across 13 languages.
SweetSpot: An Analytical Model for Predicting Energy Efficiency of LLM Inference cs.AI · 2026-02-05 · unverdicted · none · ref 1 · internal anchor
SweetSpot is an analytical model from Transformer computational and memory complexity that identifies energy minima at short-to-moderate inputs and medium outputs, achieving 1.79% MAPE on H100 GPU measurements across multiple LLMs.

GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer