GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints
Pith reviewed 2026-05-11 06:48 UTC · model grok-4.3
The pith
Uptraining multi-head attention checkpoints to grouped-query attention recovers near-original quality with only 5% additional compute and achieves multi-query inference speeds.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Existing multi-head attention language model checkpoints can be uptrained into grouped-query attention (GQA) models using only 5% of the original pre-training compute. GQA generalizes multi-query attention by using more than one but fewer than the full number of key-value heads, with multiple query heads grouped to share each key-value head. The uptrained GQA models achieve quality close to the original multi-head attention models while providing inference speeds comparable to multi-query attention.
What carries the argument
Grouped-query attention (GQA), in which query heads are partitioned into groups that share the same key and value heads, serving as the central mechanism to balance model capacity and inference efficiency during uptraining.
If this is right
- Uptrained GQA models can be deployed for inference at speeds similar to MQA without retraining from scratch.
- The 5% compute uptraining makes converting large models practical and cost-effective.
- GQA allows choosing the number of key-value heads as a tunable trade-off parameter between quality and speed.
- Practitioners can leverage existing multi-head checkpoints for faster models instead of training dedicated inference-optimized versions.
Where Pith is reading between the lines
- Similar uptraining recipes might extend to other attention modifications or model families beyond the tested transformers.
- The grouping in GQA could be made layer-specific to optimize quality-speed tradeoffs further.
- This method reduces barriers to experimenting with faster attention variants on pre-trained models.
Load-bearing premise
The 5% compute uptraining recipe is enough to restore quality close to the original multi-head model without hidden failures on particular tasks or model sizes.
What would settle it
If an uptrained GQA model shows substantially lower performance than the original multi-head model on standard language modeling benchmarks or downstream tasks, or if inference speed gains are not realized in practice, the central claim would be falsified.
read the original abstract
Multi-query attention (MQA), which only uses a single key-value head, drastically speeds up decoder inference. However, MQA can lead to quality degradation, and moreover it may not be desirable to train a separate model just for faster inference. We (1) propose a recipe for uptraining existing multi-head language model checkpoints into models with MQA using 5% of original pre-training compute, and (2) introduce grouped-query attention (GQA), a generalization of multi-query attention which uses an intermediate (more than one, less than number of query heads) number of key-value heads. We show that uptrained GQA achieves quality close to multi-head attention with comparable speed to MQA.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces grouped-query attention (GQA) as an intermediate attention mechanism between multi-head attention (MHA) and multi-query attention (MQA), along with a recipe to uptrain existing MHA language model checkpoints into GQA (or MQA) models using only 5% of the original pre-training compute. The central empirical claim is that the resulting uptrained GQA models recover quality close to the original MHA while delivering inference speed comparable to MQA.
Significance. If the empirical claims hold, the work is significant for efficient deployment of large language models: it offers a low-cost way to convert high-quality MHA checkpoints into faster-inference variants without full retraining, and GQA provides a tunable point on the quality-speed tradeoff that was previously missing between MHA and single-head MQA.
major comments (2)
- [Abstract] Abstract: The load-bearing claim that 'uptrained GQA achieves quality close to multi-head attention with comparable speed to MQA' is not supported by any quantitative speed or latency numbers, nor by the specific GQA configuration (number of KV heads) used to achieve the reported quality. Because KV-cache size and memory-bandwidth cost scale linearly with the number of KV heads, any GQA variant that closes most of the quality gap to MHA necessarily has a larger cache than single-head MQA and cannot be assumed to deliver comparable speed in the memory-bound regime without explicit measurements.
- [Results] Results section: The manuscript must include tables or figures that jointly report quality metrics and inference throughput/latency for the exact GQA configurations (e.g., 4 or 8 KV heads) that are claimed to be 'close' to MHA quality, together with the corresponding MHA and MQA baselines. Without these paired measurements it is impossible to verify whether the speed-quality tradeoff asserted in the abstract is actually realized.
minor comments (1)
- [Abstract] The abstract and introduction would benefit from an explicit statement of the number of KV heads used in the GQA experiments that support the main claim.
Simulated Author's Rebuttal
We thank the referee for the constructive comments highlighting the need for clearer quantitative support of the speed-quality claims. We will revise the manuscript to address both points by adding specific details and paired measurements.
read point-by-point responses
-
Referee: [Abstract] Abstract: The load-bearing claim that 'uptrained GQA achieves quality close to multi-head attention with comparable speed to MQA' is not supported by any quantitative speed or latency numbers, nor by the specific GQA configuration (number of KV heads) used to achieve the reported quality. Because KV-cache size and memory-bandwidth cost scale linearly with the number of KV heads, any GQA variant that closes most of the quality gap to MHA necessarily has a larger cache than single-head MQA and cannot be assumed to deliver comparable speed in the memory-bound regime without explicit measurements.
Authors: We agree the abstract would benefit from greater specificity. The body of the paper specifies the GQA configurations (e.g., 8 KV heads for 32-query-head models) and reports quality recovery in the results tables. Inference speed is analyzed via KV-cache size reduction in the memory-bound regime. We will revise the abstract to name the KV-head count used for the quality claims, reference the speed analysis, and clarify that GQA delivers speeds between MHA and MQA (closer to MQA as the number of groups increases). revision: yes
-
Referee: [Results] Results section: The manuscript must include tables or figures that jointly report quality metrics and inference throughput/latency for the exact GQA configurations (e.g., 4 or 8 KV heads) that are claimed to be 'close' to MHA quality, together with the corresponding MHA and MQA baselines. Without these paired measurements it is impossible to verify whether the speed-quality tradeoff asserted in the abstract is actually realized.
Authors: We acknowledge the value of paired reporting. The current results present quality metrics for GQA variants with different KV-head counts alongside a separate analysis of inference cost based on KV-cache memory bandwidth. We will add a new table or figure in the revised results section that jointly shows quality metrics and relative inference throughput (estimated from KV-cache size, with measured values where available) for MHA, GQA-8, GQA-4, and MQA baselines. revision: yes
Circularity Check
No circularity: empirical recipe validated by direct experiments
full rationale
The paper proposes an uptraining procedure to convert multi-head attention checkpoints into grouped-query attention models and reports empirical quality and speed measurements. No derivation chain, first-principles equations, or predictions are present that could reduce to the inputs by construction. All load-bearing claims rest on experimental comparisons (quality metrics and inference throughput) rather than self-definitional quantities, fitted parameters renamed as predictions, or self-citation chains. The central result is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 60 Pith papers
-
Chronicle: A Multimodal Foundation Model for Joint Language and Time Series Understanding
Chronicle is the first model jointly pretrained from scratch on text and time series in a unified transformer that matches a comparable language model on NLU tasks and sets new bars for time series classification and ...
-
LLMForge: Multi-Backend Hardware-Aware Neural Architecture Search with Infinite-Head Attention for Edge Language Models
LLMForge is a NAS framework with Infinite-Head Attention, a Forge-Former surrogate, and Forge-DSE engine that discovers hardware-specific architectures for edge language models, yielding variants with improved accurac...
-
GQA-{\mu}P: The maximal parameterization update for grouped query attention
Derives μP scalings for GQA via promoted spectral-norm definition of feature learning and a modified norm preserving scaling laws for non-full-rank matrices, with experiments showing learning-rate transfer.
-
The Illusion of Power Capping in LLM Decode: A Phase-Aware Energy Characterisation Across Attention Architectures
Power capping is illusory in LLM decode as memory-bound operation leaves power headroom untouched on 700 W GPUs, while SM clock locking saves up to 32% energy and three DVFS classes appear across attention types.
-
Dooly: Configuration-Agnostic, Redundancy-Aware Profiling for LLM Inference Simulation
Dooly reduces LLM inference profiling costs by 56.4% via configuration-agnostic taint-based labeling and selective database reuse, delivering simulation accuracy within 5% MAPE for TTFT and 8% for TPOT across 12 models.
-
Co-generation of Layout and Shape from Text via Autoregressive 3D Diffusion
3D-ARD+ unifies autoregressive token prediction with diffusion-based 3D latent generation to co-produce indoor scene layouts and object geometries that follow complex text-specified spatial and semantic constraints.
-
Ragged Paged Attention: A High-Performance and Flexible LLM Inference Kernel for TPU
RPA kernel for TPUs achieves 86% MBU in decode and 73% MFU in prefill on Llama 3 8B via tiling for ragged memory, fused pipelines, and specialized compilation for prefill/decode workloads.
-
A Full-Stack Performance Evaluation Infrastructure for 3D-DRAM-based LLM Accelerators
ATLAS is the first silicon-validated simulation framework for 3D-DRAM LLM accelerators, achieving under 8.57% error and over 97% correlation with real hardware while supporting design exploration.
-
MixAtlas: Uncertainty-aware Data Mixture Optimization for Multimodal LLM Midtraining
MixAtlas uses CLIP-based decomposition and Gaussian process optimization on small proxies to discover data mixtures that improve multimodal benchmark performance by up to 17.6% and transfer to larger models with faste...
-
Characterizing Performance-Energy Trade-offs of Large Language Models in Multi-Request Workflows
This work delivers the first measurements of performance-energy trade-offs across four multi-request LLM workflow patterns on A100 GPUs using vLLM and Parrot.
-
SnapStream: Efficient Long Sequence Decoding on Dataflow Accelerators
SnapStream deploys sparse KV attention in a production inference system on dataflow accelerators, delivering 4x on-chip memory savings for DeepSeek-671B at 128k context with up to 1832 tokens/sec and minimal accuracy ...
-
DELTA: Dynamic Layer-Aware Token Attention for Efficient Long-Context Reasoning
DELTA partitions layers into full, delta, and sparse groups to select salient tokens via aggregated attention scores, matching full-attention accuracy on AIME and GPQA while cutting attended tokens up to 4.25x and ach...
-
Training Agents Inside of Scalable World Models
Dreamer 4 is the first agent to obtain diamonds in Minecraft from only offline data by reinforcement learning inside a scalable world model that accurately predicts game mechanics.
-
AVA-Bench: Atomic Visual Ability Benchmark for Vision Foundation Models
AVA-Bench evaluates vision foundation models by disentangling 14 atomic visual abilities with aligned training-test distributions to reveal precise ability fingerprints.
-
FastKV: Decoupling of Context Reduction and KV Cache Compression for Prefill-Decoding Acceleration
FastKV decouples prefill context reduction via Token-Selective Propagation from independent KV cache selection, delivering up to 1.82x prefill and 2.87x decoding speedups while matching decoding-only accuracy.
-
FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision
FlashAttention-3 achieves 1.5-2x speedup on H100 GPUs for attention, reaching 740 TFLOPs/s (75% utilization) in FP16 and near 1.2 PFLOPs/s in FP8 while cutting numerical error by 2.6x versus baseline FP8 attention.
-
Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality
Transformers and SSMs are unified through structured state space duality, producing a 2-8X faster Mamba-2 model that remains competitive with Transformers.
-
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
DeepSeek-V2 delivers top-tier open-source LLM performance using only 21B active parameters by compressing the KV cache 93.3% and cutting training costs 42.5% via MLA and DeepSeekMoE.
-
Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads
Medusa augments LLMs with multiple decoding heads and tree-based attention to predict and verify several tokens in parallel, yielding 2.2-3.6x inference speedup via two fine-tuning regimes.
-
Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation
A new shared video-image tokenizer enables large language models to surpass diffusion models on standard visual generation benchmarks.
-
ArborKV: Structure-Aware KV Cache Management for Scaling Tree-based LLM Reasoning
ArborKV uses search-structure awareness to evict low-reuse KV states in Tree-of-Thoughts inference, delivering up to 4x memory savings with near-full accuracy retention.
-
Introspective X Training: Feedback Conditioning Improves Scaling Across all LLM Training Stages
Introspective Training annotates data with natural-language feedback from a thinking reward model and conditions all LLM training stages on that feedback, bending scaling curves for up to 2.8x compute efficiency gains...
-
A Two-Parameter Weibull Framework for Diagnosing Transformer Weight Distributions
A Weibull diagnostic framework classifies transformer weight matrices into consistent functional classes via the shape parameter k and tracks training progress via the scale parameter lambda across multiple architectures.
-
SToRe3D: Sparse Token Relevance in ViTs for Efficient Multi-View 3D Object Detection
SToRe3D delivers up to 3x faster inference for multi-view 3D object detection in ViTs by selecting relevant 2D tokens and 3D queries via mutual relevance heads with only marginal accuracy loss.
-
Self-Pruned Key-Value Attention: Learning When to Write by Predicting Future Utility
SP-KV trains a utility predictor jointly with the LLM to dynamically prune low-utility KV cache entries, achieving 3-10x memory reduction during generation with negligible performance loss.
-
Search Your Block Floating Point Scales!
ScaleSearch optimizes block floating point scales via fine-grained search to cut quantization error by 27% for NVFP4, improving PTQ by up to 15 points on MATH500 for Qwen3-8B and attention PPL by 0.77 on Llama 3.1 70B.
-
Dooly: Configuration-Agnostic, Redundancy-Aware Profiling for LLM Inference Simulation
Dooly reduces LLM inference profiling GPU-hours by 56.4% across 12 models while keeping simulation MAPE under 5% for TTFT and 8% for TPOT by making profiling configuration-agnostic and redundancy-aware.
-
Memory-Efficient Looped Transformer: Decoupling Compute from Memory in Looped Language Models
MELT decouples reasoning depth from memory in looped language models by sharing a single gated KV cache per layer and training it via chunk-wise distillation from Ouro starting models.
-
Memory-Efficient Looped Transformer: Decoupling Compute from Memory in Looped Language Models
MELT decouples reasoning depth from memory in looped LLMs by sharing a single gated KV cache per layer and using two-phase chunk-wise distillation from Ouro, delivering constant memory use while matching or beating st...
-
Reformulating KV Cache Eviction Problem for Long-Context LLM Inference
LaProx reformulates KV cache eviction as an output-aware matrix approximation, enabling a unified global token selection strategy that preserves LLM performance at 5% cache size across long-context benchmarks.
-
Nitsum: Serving Tiered LLM Requests with Adaptive Tensor Parallelism
Nitsum dynamically adapts tensor parallelism and GPU splits in LLM serving to raise SLO-compliant goodput by up to 5.3 times over prior systems.
-
ZAYA1-8B Technical Report
ZAYA1-8B is a reasoning MoE model with 700M active parameters that matches larger models on math and coding benchmarks and reaches 91.9% on AIME'25 via Markovian RSA test-time compute.
-
WindowQuant: Mixed-Precision KV Cache Quantization based on Window-Level Similarity for VLMs Inference Optimization
WindowQuant performs window-adaptive mixed-precision KV cache quantization guided by similarity to the text prompt, with reordering to enable efficient inference in VLMs.
-
QERNEL: a Scalable Large Electron Model
QERNEL is a single conditioned neural wavefunction that variationally solves families of many-electron Hamiltonians in moiré heterobilayers and identifies the quantum liquid-crystal phase transition.
-
SparKV: Overhead-Aware KV Cache Loading for Efficient On-Device LLM Inference
SparKV reduces time-to-first-token by 1.3x-5.1x and energy use by 1.5x-3.3x for on-device LLM inference by adaptively choosing between cloud KV streaming and local computation while overlapping execution and adjusting...
-
Are Large Language Models Economically Viable for Industry Deployment?
Small LLMs under 2B parameters achieve better economic break-even, energy efficiency, and hardware density than larger models on legacy GPUs for industrial tasks.
-
Graph-Guided Adaptive Channel Elimination for KV Cache Compression
GRACE reframes KV cache channel pruning as graph optimization to find a near-optimal subset, achieving 60% compression with negligible degradation and outperforming prior methods.
-
Open-TQ-Metal: Fused Compressed-Domain Attention for Long-Context LLM Inference on Apple Silicon
Fused compressed-domain int4 attention on Apple Silicon delivers 48x speedup and 3.2x KV cache compression for 128K-context 70B models while matching FP16 token predictions.
-
Nucleus-Image: Sparse MoE for Image Generation
A 17B-parameter sparse MoE diffusion transformer activates 2B parameters per pass and reaches competitive quality on image generation benchmarks without post-training.
-
Quantization Dominates Rank Reduction for KV-Cache Compression
Quantization of the KV cache beats rank reduction for matched storage budgets by 4-364 PPL, because dimension removal can flip attention token selection under softmax while bounded quantization noise usually preserves...
-
IceCache: Memory-efficient KV-cache Management for Long-Sequence LLMs
IceCache combines semantic token clustering with PagedAttention to keep only 25% of the KV cache tokens while retaining 99% accuracy on LongBench and matching or beating prior offloading methods in latency.
-
WaveTune: Wave-aware Bilinear Modeling for Efficient GPU Kernel Auto-tuning
WaveTune introduces a wave-aware bilinear latency predictor and wave-structured sparse sampling to enable fast runtime auto-tuning of GPU kernels, achieving up to 1.83x kernel speedup and 1.33x TTFT reduction with dra...
-
Attention Editing: A Versatile Framework for Cross-Architecture Attention Conversion
Attention Editing converts pre-trained LLMs to new attention architectures through layer-wise teacher-forced optimization and model-level distillation, preserving performance with efficiency gains.
-
DeepStack: Scalable and Accurate Design Space Exploration for Distributed 3D-Stacked AI Accelerators
DeepStack introduces a fast performance model and hierarchical search method for co-optimizing 3D DRAM stacking, interconnects, and distributed scheduling in AI accelerators, delivering up to 9.5x throughput gains ove...
-
EchoKV: Efficient KV Cache Compression via Similarity-Based Reconstruction
EchoKV compresses LLM KV caches by reconstructing missing components from partial data via inter- and intra-layer attention similarities, outperforming prior methods on LongBench and RULER while supporting on-demand f...
-
Voxtral Realtime
Voxtral Realtime is an end-to-end trained streaming ASR model that achieves Whisper-level transcription quality at 480ms delay after scaling pretraining across 13 languages.
-
D-Legion: A Scalable Many-Core Architecture for Accelerating Matrix Multiplication in Quantized LLMs
D-Legion proposes a scalable architecture of Legions containing adaptive-precision systolic array cores that accelerates quantized LLM matrix multiplications, delivering up to 8.2x lower latency and 3.8x higher memory...
-
SweetSpot: An Analytical Model for Predicting Energy Efficiency of LLM Inference
SweetSpot is an analytical model from Transformer computational and memory complexity that identifies energy minima at short-to-moderate inputs and medium outputs, achieving 1.79% MAPE on H100 GPU measurements across ...
-
mHC: Manifold-Constrained Hyper-Connections
mHC projects hyper-connection residual spaces onto a manifold to restore identity mapping, enabling stable large-scale training with performance gains over standard HC.
-
BlossomRec: Block-level Fused Sparse Attention Mechanism for Sequential Recommendations
BlossomRec is a sparse attention mechanism that uses two distinct block-level patterns for long-term and short-term interests, fused by a gated output, to reduce computation in sequential recommendation Transformers.
-
BOOST: BOttleneck-Optimized Scalable Training Framework for Low-Rank Large Language Models
BOOST delivers 1.46-2.27x end-to-end speedups for low-rank bottleneck LLMs by redesigning tensor parallelism around the bottleneck structure plus supporting optimizations.
-
Seer: Online Context Learning for Fast Synchronous LLM Reinforcement Learning
Seer improves synchronous LLM RL rollout throughput by up to 2.04x and reduces long-tail latency by 72-94% via divided rollout, context-aware scheduling, and adaptive grouped speculative decoding based on prompt simil...
-
Diff4Splat: Controllable 4D Scene Generation with Latent Dynamic Reconstruction Models
A feed-forward video latent transformer that predicts time-varying 3D Gaussian primitives from one image to produce controllable 4D scenes with appearance, geometry, and motion.
-
Emu3.5: Native Multimodal Models are World Learners
Emu3.5 is a native multimodal world model pre-trained on over 10 trillion vision-language tokens with next-token prediction, post-trained via reinforcement learning, and accelerated by Discrete Diffusion Adaptation fo...
-
Scaling Laws Meet Model Architecture: Toward Inference-Efficient LLMs
A conditional scaling law fitted on over 200 models from 80M to 3B parameters identifies architectures that deliver up to 2.1% higher accuracy and 42% higher inference throughput than LLaMA-3.2 under the same training budget.
-
CR-Net: Scaling Parameter-Efficient Training with Cross-Layer Low-Rank Structure
CR-Net uses cross-layer low-rank residuals in a dual-path network plus specialized recomputation to outperform prior low-rank methods on 60M-7B model pre-training while using less compute and memory.
-
Accelerating Prefilling via Decoding-time Contribution Sparsity
TriangleMix exploits decoding-time contribution sparsity via a training-free static attention pattern to accelerate LLM prefilling with nearly lossless performance.
-
Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning
Math reasoning gains in LLMs rarely transfer to general domains; RL tuning generalizes while SFT causes forgetting and representation drift.
-
PrefixMemory-Tuning: Modernizing Prefix-Tuning by Decoupling the Prefix from Attention
PrefixMemory-Tuning decouples the prefix from attention to overcome performance limits of traditional prefix-tuning and reaches competitive results with modern PEFT methods on LLM adaptation benchmarks.
-
MAGI-1: Autoregressive Video Generation at Scale
MAGI-1 is a 24B-parameter autoregressive video world model that predicts denoised frame chunks sequentially with increasing noise to enable causal, scalable, streaming generation up to 4M token contexts.
Reference graph
Works this paper leans on
-
[1]
James Bradbury and Roy Frostig and Peter Hawkins and Matthew James Johnson and Chris Leary and Dougal Maclaurin and George Necula and Adam Paszke and Jake Vander
-
[2]
Jonathan Heek and Anselm Levskaya and Avital Oliver and Marvin Ritter and Bertrand Rondepierre and Andreas Steiner and Marc van
-
[3]
Roberts, Adam and Chung, Hyung Won and Levskaya, Anselm and Mishra, Gaurav and Bradbury, James and Andor, Daniel and Narang, Sharan and Lester, Brian and Gaffney, Colin and Mohiuddin, Afroz and Hawthorne, Curtis and Lewkowycz, Aitor and Salcianu, Alex and van Zee, Marc and Austin, Jacob and Goodman, Sebastian and Soares, Livio Baldini and Hu, Haitang and ...
-
[4]
Kingma and Jimmy Ba , editor =
Diederik P. Kingma and Jimmy Ba , editor =. Adam:. 3rd International Conference on Learning Representations,. 2015 , url =
work page 2015
-
[5]
Adafactor: Adaptive Learning Rates with Sublinear Memory Cost , booktitle =
Noam Shazeer and Mitchell Stern , editor =. Adafactor: Adaptive Learning Rates with Sublinear Memory Cost , booktitle =. 2018 , url =
work page 2018
-
[8]
and Zettlemoyer, Luke , title =
Joshi, Mandar and Choi, Eunsol and Weld, Daniel S. and Zettlemoyer, Luke , title =. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics , month =. 2017 , address =
work page 2017
-
[11]
Alex Wang and Amanpreet Singh and Julian Michael and Felix Hill and Omer Levy and Samuel R. Bowman , title =. 7th International Conference on Learning Representations,. 2019 , url =
work page 2019
-
[19]
Scaling Laws for Neural Language Models
Jared Kaplan and Sam McCandlish and Tom Henighan and Tom B. Brown and Benjamin Chess and Rewon Child and Scott Gray and Alec Radford and Jeffrey Wu and Dario Amodei , title =. CoRR , volume =. 2020 , url =. 2001.08361 , timestamp =
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[21]
Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu , title =. J. Mach. Learn. Res. , volume =. 2020 , url =
work page 2020
-
[24]
GLM-130B: An Open Bilingual Pre-trained Model
Aohan Zeng and Xiao Liu and Zhengxiao Du and Zihan Wang and Hanyu Lai and Ming Ding and Zhuoyi Yang and Yifan Xu and Wendi Zheng and Xiao Xia and Weng Lam Tam and Zixuan Ma and Yufei Xue and Jidong Zhai and Wenguang Chen and Peng Zhang and Yuxiao Dong and Jie Tang , title =. CoRR , volume =. 2022 , url =. doi:10.48550/arXiv.2210.02414 , eprinttype =. 2210...
work page internal anchor Pith review doi:10.48550/arxiv.2210.02414 2022
-
[29]
Mahoney and Amir Gholami and Kurt Keutzer , title =
Sehoon Kim and Karttikeya Mangalam and Jitendra Malik and Michael W. Mahoney and Amir Gholami and Kurt Keutzer , title =. CoRR , volume =. 2023 , url =. doi:10.48550/arXiv.2302.07863 , eprinttype =. 2302.07863 , timestamp =
-
[33]
Memory-efficient attention , howpublished =
Markus Rabe , year =. Memory-efficient attention , howpublished =
-
[34]
James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal Maclaurin, George Necula, Adam Paszke, Jake Vander P las, Skye Wanderman- M ilne, and Qiao Zhang. 2018. http://github.com/google/jax JAX : composable transformations of P ython+ N um P y programs
work page 2018
-
[35]
Charlie Chen, Sebastian Borgeaud, Geoffrey Irving, Jean - Baptiste Lespiau, Laurent Sifre, and John Jumper. 2023. https://doi.org/10.48550/arXiv.2302.01318 Accelerating large language model decoding with speculative sampling . CoRR, abs/2302.01318
work page internal anchor Pith review doi:10.48550/arxiv.2302.01318 2023
-
[36]
Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh, Kensen Shi, Sasha Tsvyashchenko, Joshua Maynez, Abhishek Rao, Parker Barnes, Yi Tay, Noam Shazeer, Vinodkumar Prabhakaran, Emily Reif, Nan Du, Ben Hutchinson, Reiner Pope, James Bradb...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2204.02311 2022
-
[37]
Arman Cohan, Franck Dernoncourt, Doo Soon Kim, Trung Bui, Seokhwan Kim, Walter Chang, and Nazli Goharian. 2018. https://doi.org/10.18653/v1/N18-2097 A discourse-aware attention model for abstractive summarization of long documents . In Proceedings of the 2018 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human...
-
[38]
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher R \' e . 2022. https://doi.org/10.48550/arXiv.2205.14135 Flashattention: Fast and memory-efficient exact attention with io-awareness . CoRR, abs/2205.14135
work page internal anchor Pith review doi:10.48550/arxiv.2205.14135 2022
- [39]
-
[40]
Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. 2022. https://doi.org/10.48550/arXiv.2208.07339 Llm.int8(): 8-bit matrix multiplication for transformers at scale . CoRR, abs/2208.07339
work page internal anchor Pith review doi:10.48550/arxiv.2208.07339 2022
-
[41]
Alexander R. Fabbri, Irene Li, Tianwei She, Suyi Li, and Dragomir R. Radev. 2019. https://doi.org/10.18653/v1/p19-1102 Multi-news: A large-scale multi-document summarization dataset and abstractive hierarchical model . In Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019...
-
[42]
Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. 2022. https://doi.org/10.48550/arXiv.2210.17323 GPTQ: accurate post-training quantization for generative pre-trained transformers . CoRR, abs/2210.17323
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2210.17323 2022
-
[43]
Google. 2020. P rofile your model with cloud tpu tools. https://cloud.google.com/tpu/docs/cloud-tpu-tools. Accessed: 2022-11-11
work page 2020
-
[44]
Jianping Gou, Baosheng Yu, Stephen J. Maybank, and Dacheng Tao. 2021. https://doi.org/10.1007/s11263-021-01453-z Knowledge distillation: A survey . Int. J. Comput. Vis., 129(6):1789--1819
-
[45]
Jonathan Heek, Anselm Levskaya, Avital Oliver, Marvin Ritter, Bertrand Rondepierre, Andreas Steiner, and Marc van Z ee. 2020. http://github.com/google/flax F lax: A neural network library and ecosystem for JAX
work page 2020
-
[46]
Distilling the Knowledge in a Neural Network
Geoffrey E. Hinton, Oriol Vinyals, and Jeffrey Dean. 2015. http://arxiv.org/abs/1503.02531 Distilling the knowledge in a neural network . CoRR, abs/1503.02531
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[47]
Mandar Joshi, Eunsol Choi, Daniel S. Weld, and Luke Zettlemoyer. 2017. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, Vancouver, Canada. Association for Computational Linguistics
work page 2017
-
[48]
Aran Komatsuzaki, Joan Puigcerver, James Lee-Thorp, Carlos Riquelme Ruiz, Basil Mustafa, Joshua Ainslie, Yi Tay, Mostafa Dehghani, and Neil Houlsby. 2022. https://doi.org/10.48550/ARXIV.2212.05055 Sparse upcycling: Training mixture-of-experts from dense checkpoints
-
[49]
Yaniv Leviathan, Matan Kalman, and Yossi Matias. 2022. https://doi.org/10.48550/arXiv.2211.17192 Fast inference from transformers via speculative decoding . CoRR, abs/2211.17192
work page internal anchor Pith review doi:10.48550/arxiv.2211.17192 2022
-
[50]
Gen Luo, Yiyi Zhou, Xiaoshuai Sun, Yan Wang, Liujuan Cao, Yongjian Wu, Feiyue Huang, and Rongrong Ji. 2022. https://doi.org/10.1109/TIP.2021.3139234 Towards lightweight transformer via group-wise transformation for vision-and-language tasks . IEEE Trans. Image Process. , 31:3386--3398
-
[51]
Ramesh Nallapati, Bowen Zhou, C \' cero Nogueira dos Santos, C aglar G \" u l c ehre, and Bing Xiang. 2016. https://doi.org/10.18653/v1/k16-1028 Abstractive text summarization using sequence-to-sequence rnns and beyond . In Proceedings of the 20th SIGNLL Conference on Computational Natural Language Learning, CoNLL 2016, Berlin, Germany, August 11-12, 2016...
-
[52]
Jinjie Ni, Rui Mao, Zonglin Yang, Han Lei, and Erik Cambria. 2023. https://doi.org/10.18653/V1/2023.ACL-LONG.812 Finding the pillars of strength for multi-head attention . In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023 , pages 14526--14540. Asso...
-
[53]
Sungrae Park, Geewook Kim, Junyeop Lee, Junbum Cha, Ji - Hoon Kim, and Hwalsuk Lee. 2020. https://doi.org/10.18653/V1/2020.COLING-MAIN.607 Scale down transformer by grouping features for a lightweight character-level language model . In Proceedings of the 28th International Conference on Computational Linguistics, COLING 2020, Barcelona, Spain (Online), D...
- [54]
-
[55]
Markus Rabe. 2023. Memory-efficient attention. https://github.com/google/flaxformer/blob/main/flaxformer/components/attention/memory_efficient_attention.py. Accessed: 2023-05-23
work page 2023
-
[56]
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. http://jmlr.org/papers/v21/20-074.html Exploring the limits of transfer learning with a unified text-to-text transformer . J. Mach. Learn. Res., 21:140:1--140:67
work page 2020
-
[57]
Noam Shazeer. 2019. Fast transformer decoding: One write-head is all you need. arXiv preprint arXiv:1911.02150
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[58]
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. 2023. https://doi.org/10.48550/ARXIV.2302.13971 Llama: Open and efficient foundation language models
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2302.13971 2023
-
[59]
Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. 2019. https://openreview.net/forum?id=rJ4km2R5t7 GLUE: A multi-task benchmark and analysis platform for natural language understanding . In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019 . OpenReview.net
work page 2019
-
[60]
Roofline: An insightful visual performance model for multicore architectures,
Samuel Williams, Andrew Waterman, and David A. Patterson. 2009. https://doi.org/10.1145/1498765.1498785 Roofline: an insightful visual performance model for multicore architectures . Commun. ACM , 52(4):65--76
-
[61]
Chenguang Zhu, Yang Liu, Jie Mei, and Michael Zeng. 2021. https://doi.org/10.18653/v1/2021.naacl-main.474 Mediasum: A large-scale media interview dataset for dialogue summarization . In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2021, Online, Jun...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.