arxiv: 1911.02150 · v1 · submitted 2019-11-06 · 💻 cs.NE · cs.CL· cs.LG

Recognition: no theorem link

Fast Transformer Decoding: One Write-Head is All You Need

Noam Shazeer

Authors on Pith no claims yet

Pith reviewed 2026-05-10 23:45 UTC · model grok-4.3

classification 💻 cs.NE cs.CLcs.LG

keywords multi-query attentiontransformerdecodingincremental inferenceattention mechanismmemory bandwidthneural sequence models

0 comments

The pith

Multi-query attention shares keys and values across heads to speed up Transformer decoding with little quality loss.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Transformers rely on multi-head attention to move information across sequences, but incremental decoding is slowed by the repeated loading of large key and value tensors, one set per head. The paper introduces multi-query attention, a direct modification that reuses one set of keys and values for every head. This shrinks the tensors that must be read from memory at each generation step, cutting bandwidth cost. Experiments show the change produces substantially faster decoding while quality on standard tasks falls only slightly.

Core claim

The central claim is that tying the key and value projections together across all attention heads produces a multi-query attention layer whose key-value cache is far smaller than in standard multi-head attention. Because incremental decoding must repeatedly read this cache, the reduced size lowers memory bandwidth and therefore raises generation speed. The resulting models retain most of their original quality on the tasks and sizes tested.

What carries the argument

Multi-query attention, the variant in which a single key projection and a single value projection serve all query heads instead of separate projections per head.

If this is right

Decoding speed rises because the key-value cache read at each step is much smaller.
The same model size now fits within tighter memory-bandwidth limits.
Quality remains close to the multi-head baseline across the tested model scales and tasks.
Training stays unchanged because the modification affects only the inference path.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same sharing idea could be applied to any autoregressive sequence model that uses multi-head attention.
It opens the possibility of running larger models in real time on existing hardware.
Future variants might use a small number of shared key-value groups rather than a single group to trade speed for capacity.

Load-bearing premise

Sharing one key-value pair across all heads still leaves the model enough capacity to learn the distinct attention patterns it needs.

What would settle it

Measure wall-clock decoding latency and task accuracy on a held-out benchmark after training an otherwise identical Transformer once with full multi-head attention and once with multi-query attention.

read the original abstract

Multi-head attention layers, as used in the Transformer neural sequence model, are a powerful alternative to RNNs for moving information across and between sequences. While training these layers is generally fast and simple, due to parallelizability across the length of the sequence, incremental inference (where such paralleization is impossible) is often slow, due to the memory-bandwidth cost of repeatedly loading the large "keys" and "values" tensors. We propose a variant called multi-query attention, where the keys and values are shared across all of the different attention "heads", greatly reducing the size of these tensors and hence the memory bandwidth requirements of incremental decoding. We verify experimentally that the resulting models can indeed be much faster to decode, and incur only minor quality degradation from the baseline.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Sharing keys and values across all attention heads cuts KV cache size by the head count and speeds up incremental decoding with only minor quality loss on the tested tasks.

read the letter

The one or two things to know are that this introduces multi-query attention by sharing keys and values across heads to reduce memory bandwidth during decoding, and that experiments on standard tasks back up the speed gains with small quality trade-offs. What is new is the specific formulation for inference efficiency. Prior multi-head attention work focused on training parallelism, but this directly optimizes the repeated loading of KV tensors in autoregressive generation. The paper does this well by keeping the modification simple and showing concrete speedups without needing fancy hardware or changes to training. The experiments cover WMT translation and language modeling with models up to a few hundred million parameters. They report the expected improvements, and the full manuscript review finds the results consistent with no internal problems in the setup. Soft spots are small. The paper is short, so it lacks extensive ablations or scaling curves to very large sizes. Quality degradation is described as minor, but exact numbers and variance would help readers judge for their own use cases. Still, nothing suggests the central claim is overstated. This paper is for practitioners and researchers focused on making sequence models faster at inference time. Someone implementing or optimizing Transformer decoders would benefit from knowing about this variant. It deserves serious peer review because the idea is practical, the evidence is direct, and it addresses a real deployment pain point.

Referee Report

0 major / 3 minor

Summary. The manuscript proposes multi-query attention, a variant of the standard multi-head attention used in Transformers. In this design, the key and value projections are shared across all heads (reducing the KV cache size by a factor equal to the number of heads) while queries remain head-specific. The central claim is that this change substantially reduces memory-bandwidth costs during incremental/autoregressive decoding, yielding much faster inference with only minor quality degradation relative to full multi-head attention. The authors support the claim with experiments on WMT translation and language-modeling benchmarks using models up to a few hundred million parameters.

Significance. If the reported speed/quality trade-off holds under broader conditions, the result is practically significant for efficient deployment of large Transformers. The modification is minimal, requires no changes to the training algorithm, and directly attacks the KV-cache bandwidth bottleneck that dominates incremental decoding. By providing concrete measurements of both latency and task metrics on standard benchmarks, the work offers a falsifiable, immediately usable technique that scales with model size and sequence length.

minor comments (3)

[Abstract] Abstract: the statement that models 'can indeed be much faster to decode' and incur 'only minor quality degradation' is not quantified. Adding one sentence with concrete factors (e.g., '2-4x faster decoding with <0.5 BLEU drop on WMT En-De') would make the high-level claim self-contained.
[§3] §3 (Method): the dimension reduction for the shared K and V tensors is described in prose but would benefit from an explicit equation or diagram showing the new shapes relative to standard multi-head attention (e.g., K, V ∈ ℝ^{T×d_model} instead of ℝ^{T×h×d_head}).
[§4] §4 (Experiments): while the speed/quality results are reported, the text should explicitly state whether all compared models were trained with identical hyper-parameters and total parameter budgets, and whether statistical significance or multiple random seeds were used for the quality metrics.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the supportive summary and recommendation of minor revision. The report contains no specific major comments to address point-by-point. We remain available to incorporate any editorial or minor clarifications the editor may request in a revised version.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper proposes multi-query attention as a direct architectural modification to standard multi-head attention by sharing keys and values across heads. This change is introduced explicitly to address memory bandwidth in incremental decoding and is validated through separate experiments on WMT translation and language-modeling tasks. No derivation chain exists that reduces a claimed result to its own inputs via self-definition, fitted parameters renamed as predictions, or load-bearing self-citations. The central claims rest on the proposal itself plus external empirical measurements rather than any closed logical loop or ansatz smuggled through prior author work.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The proposal builds on the standard Transformer attention equations and assumes that head-specific keys/values are not strictly necessary for performance.

axioms (1)

standard math Standard scaled dot-product attention and multi-head concatenation formulas from the original Transformer
The paper modifies the existing multi-head attention structure without re-deriving its foundations.

invented entities (1)

multi-query attention no independent evidence
purpose: Shared keys and values across heads to reduce KV cache size during decoding
New architectural choice introduced to address memory bandwidth; no independent evidence outside the paper's experiments.

pith-pipeline@v0.9.0 · 5422 in / 1102 out tokens · 25161 ms · 2026-05-10T23:45:35.531104+00:00 · methodology

discussion (0)

Forward citations

Cited by 59 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Nearly Optimal Attention Coresets
cs.DS 2026-05 unverdicted novelty 8.0

ε-coresets for attention exist of size O(√d e^{ρ+o(ρ)}/ε) for unit-norm keys/values and queries of norm ≤ρ, nearly matching the Ω(√d e^ρ/ε) lower bound.
Ragged Paged Attention: A High-Performance and Flexible LLM Inference Kernel for TPU
cs.PF 2026-04 unverdicted novelty 7.0

RPA kernel for TPUs achieves 86% MBU in decode and 73% MFU in prefill on Llama 3 8B via tiling for ragged memory, fused pipelines, and specialized compilation for prefill/decode workloads.
A Hormone-inspired Emotion Layer for Transformer language models (HELT)
cs.NE 2026-04 unverdicted novelty 7.0

HormoneT5 augments T5 with a hormone-inspired block that predicts six continuous emotion values and uses them to modulate responses, reporting over 85% per-hormone accuracy and human preference for emotional quality.
Fast Cross-Operator Optimization of Attention Dataflow
cs.AR 2026-04 unverdicted novelty 7.0

MMEE encodes dataflow decisions in matrix form for fast exhaustive search, delivering 40-69% lower latency and energy use than prior methods while running 64-343x faster.
Not All Frames Deserve Full Computation: Accelerating Autoregressive Video Generation via Selective Computation and Predictive Extrapolation
cs.CV 2026-04 conditional novelty 7.0

SCOPE accelerates autoregressive video diffusion up to 4.73x by using a tri-modal cache-predict-recompute scheduler with Taylor extrapolation and selective active-frame computation while preserving output quality.
Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality
cs.LG 2024-05 unverdicted novelty 7.0

Transformers and SSMs are unified through structured state space duality, producing a 2-8X faster Mamba-2 model that remains competitive with Transformers.
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
cs.CL 2024-05 unverdicted novelty 7.0

DeepSeek-V2 delivers top-tier open-source LLM performance using only 21B active parameters by compressing the KV cache 93.3% and cutting training costs 42.5% via MLA and DeepSeekMoE.
Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models
cs.LG 2024-02 unverdicted novelty 7.0

Griffin hybrid model matches Llama-2 performance while trained on over 6 times fewer tokens and offers lower inference latency with higher throughput.
Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads
cs.LG 2024-01 conditional novelty 7.0

Medusa augments LLMs with multiple decoding heads and tree-based attention to predict and verify several tokens in parallel, yielding 2.2-3.6x inference speedup via two fine-tuning regimes.
Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation
cs.CV 2023-10 unverdicted novelty 7.0

A new shared video-image tokenizer enables large language models to surpass diffusion models on standard visual generation benchmarks.
Accelerating Large Language Model Decoding with Speculative Sampling
cs.CL 2023-02 accept novelty 7.0

Speculative sampling accelerates LLM decoding 2-2.5x by letting a draft model propose short sequences that the target model scores in parallel, then applies modified rejection sampling to keep the exact target distribution.
When Attention Closes: How LLMs Lose the Thread in Multi-Turn Interaction
cs.AI 2026-05 unverdicted novelty 6.0

Attention to goal tokens declines in multi-turn LLM interactions while residual representations often retain decodable goal information, and the gap between these predicts whether goal-conditioned behavior survives.
Search Your Block Floating Point Scales!
cs.LG 2026-05 unverdicted novelty 6.0

ScaleSearch optimizes block floating point scales via fine-grained search to cut quantization error by 27% for NVFP4, improving PTQ by up to 15 points on MATH500 for Qwen3-8B and attention PPL by 0.77 on Llama 3.1 70B.
Nectar: Neural Estimation of Cached-Token Attention via Regression
cs.LG 2026-05 unverdicted novelty 6.0

Nectar fits small per-layer per-head neural networks via regression to predict attention outputs and normalizers, enabling constant-time inference independent of context length while preserving semantic generation quality.
Memory-Efficient Looped Transformer: Decoupling Compute from Memory in Looped Language Models
cs.CL 2026-05 unverdicted novelty 6.0

MELT decouples reasoning depth from memory in looped LLMs by sharing a single gated KV cache per layer and using two-phase chunk-wise distillation from Ouro, delivering constant memory use while matching or beating st...
Revisiting Transformer Layer Parameterization Through Causal Energy Minimization
cs.LG 2026-05 unverdicted novelty 6.0

CEM recasts Transformer layers as energy minimization steps, enabling constrained parameterizations like weight sharing and low-rank interactions that match standard baselines in 100M-scale language modeling.
Cubit: Token Mixer with Kernel Ridge Regression
cs.LG 2026-05 unverdicted novelty 6.0

Cubit replaces Transformer attention with Kernel Ridge Regression token mixing and shows potential gains on longer sequences.
ZAYA1-8B Technical Report
cs.AI 2026-05 unverdicted novelty 6.0

ZAYA1-8B is a reasoning MoE model with 700M active parameters that matches larger models on math and coding benchmarks and reaches 91.9% on AIME'25 via Markovian RSA test-time compute.
The Impossibility Triangle of Long-Context Modeling
cs.CL 2026-05 unverdicted novelty 6.0

No model can achieve efficiency, compactness, and recall capacity scaling with sequence length at once, as any two imply a strict bound of O(poly(d)/log V) on recallable facts.
WindowQuant: Mixed-Precision KV Cache Quantization based on Window-Level Similarity for VLMs Inference Optimization
cs.CV 2026-05 unverdicted novelty 6.0

WindowQuant performs window-adaptive mixed-precision KV cache quantization guided by similarity to the text prompt, with reordering to enable efficient inference in VLMs.
Evaluating CUDA Tile for AI Workloads on Hopper and Blackwell GPUs
cs.LG 2026-04 unverdicted novelty 6.0

CuTile delivers high performance on select AI workloads and GPUs but varies significantly by architecture and is less portable than Triton across tested platforms.
Sub-Token Routing in LoRA for Adaptation and Query-Aware KV Compression
cs.LG 2026-04 unverdicted novelty 6.0

Sub-token routing in LoRA-adapted transformers adds a finer compression axis for KV caches, with query-independent and query-aware designs that improve efficiency under reduced budgets when combined with token-level s...
Graph-Guided Adaptive Channel Elimination for KV Cache Compression
eess.SP 2026-04 unverdicted novelty 6.0

GRACE reframes KV cache channel pruning as graph optimization to find a near-optimal subset, achieving 60% compression with negligible degradation and outperforming prior methods.
Open-TQ-Metal: Fused Compressed-Domain Attention for Long-Context LLM Inference on Apple Silicon
cs.LG 2026-04 unverdicted novelty 6.0

Fused compressed-domain int4 attention on Apple Silicon delivers 48x speedup and 3.2x KV cache compression for 128K-context 70B models while matching FP16 token predictions.
The Illusion of Equivalence: Systematic FP16 Divergence in KV-Cached Autoregressive Inference
cs.LG 2026-04 unverdicted novelty 6.0

FP16 KV caching in transformers causes deterministic token divergence versus cache-free inference due to non-associative floating-point accumulation orderings.
Blink: CPU-Free LLM Inference by Delegating the Serving Stack to GPU and SmartNIC
cs.DC 2026-04 unverdicted novelty 6.0

Blink enables CPU-free LLM inference via SmartNIC offload and persistent GPU kernel, delivering up to 8.47x lower P99 TTFT, 3.4x lower P99 TPOT, 2.1x higher decode throughput, and 48.6% lower energy per token while re...
TRAPTI: Time-Resolved Analysis for SRAM Banking and Power Gating Optimization in Embedded Transformer Inference
cs.AR 2026-04 unverdicted novelty 6.0

TRAPTI delivers cycle-accurate memory occupancy traces to guide SRAM banking and power-gating choices, showing a 2.72x lower peak memory footprint for a GQA model versus MHA under identical accelerator settings.
ForkKV: Scaling Multi-LoRA Agent Serving via Copy-on-Write Disaggregated KV Cache
cs.DC 2026-04 unverdicted novelty 6.0

ForkKV uses copy-on-write disaggregated KV cache with DualRadixTree and ResidualAttention kernels to deliver up to 3x throughput over prior multi-LoRA serving systems with negligible quality loss.
Attention Editing: A Versatile Framework for Cross-Architecture Attention Conversion
cs.CL 2026-04 conditional novelty 6.0

Attention Editing converts pre-trained LLMs to new attention architectures through layer-wise teacher-forced optimization and model-level distillation, preserving performance with efficiency gains.
CSAttention: Centroid-Scoring Attention for Accelerating LLM Inference
cs.LG 2026-03 unverdicted novelty 6.0

CSAttention precomputes fixed-size query-centric lookup tables in offline prefill to enable fast table-lookup decoding, delivering near-identical accuracy to full attention and up to 4.6x speedup at 95% sparsity for 3...
EchoKV: Efficient KV Cache Compression via Similarity-Based Reconstruction
cs.CL 2026-03 unverdicted novelty 6.0

EchoKV compresses LLM KV caches by reconstructing missing components from partial data via inter- and intra-layer attention similarities, outperforming prior methods on LongBench and RULER while supporting on-demand f...
Structural Sensitivity in Compressed Transformers: Relative Error Propagation and Layer Removal
cs.LG 2026-03 unverdicted novelty 6.0 partial

Per-layer error amplification factor rho predicts representation drift in compressed transformers and guides superior pruning and layer-removal decisions compared to prior heuristics.
$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control
cs.LG 2024-10 unverdicted novelty 6.0

π₀ is a vision-language-action flow model trained on diverse multi-platform robot data that supports zero-shot task performance, language instruction following, and efficient fine-tuning for dexterous tasks.
StarCoder 2 and The Stack v2: The Next Generation
cs.SE 2024-02 accept novelty 6.0

StarCoder2-15B matches or beats CodeLlama-34B on code tasks despite being smaller, and StarCoder2-3B outperforms prior 15B models, with open weights and exact training data identifiers released.
KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache
cs.CL 2024-02 conditional novelty 6.0

KIVI applies asymmetric 2-bit quantization to KV cache with per-channel keys and per-token values, reducing memory 2.6x and boosting throughput up to 3.47x with near-identical quality on Llama, Falcon, and Mistral.
EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty
cs.LG 2024-01 unverdicted novelty 6.0

EAGLE resolves feature-level uncertainty in speculative sampling via one-step token advancement, delivering 2.7x-3.5x speedup on LLaMA2-Chat 70B and doubled throughput across multiple model families and tasks.
FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning
cs.LG 2023-07 accept novelty 6.0

FlashAttention-2 achieves roughly 2x speedup over FlashAttention by parallelizing attention across thread blocks and distributing work within blocks, reaching 50-73% of theoretical peak FLOPs/s on A100 GPUs.
Retentive Network: A Successor to Transformer for Large Language Models
cs.CL 2023-07 unverdicted novelty 6.0

RetNet is a new sequence modeling architecture that delivers parallel training, constant-time inference, and competitive language modeling performance as a potential replacement for Transformers.
GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints
cs.CL 2023-05 unverdicted novelty 6.0

Uptraining multi-head transformer checkpoints to grouped-query attention models achieves near multi-head quality at multi-query inference speeds using 5% additional compute.
PaLM: Scaling Language Modeling with Pathways
cs.CL 2022-04 accept novelty 6.0

PaLM 540B demonstrates continued scaling benefits by setting new few-shot SOTA results on hundreds of benchmarks and outperforming humans on BIG-bench.
ST-MoE: Designing Stable and Transferable Sparse Expert Models
cs.CL 2022-02 unverdicted novelty 6.0

ST-MoE introduces stability techniques for sparse expert models, allowing a 269B-parameter model to achieve state-of-the-art transfer learning results across reasoning, summarization, and QA tasks at the compute cost ...
GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding
cs.CL 2020-06 unverdicted novelty 6.0

GShard supplies automatic sharding and conditional computation support that enabled training a 600-billion-parameter multilingual translation model on thousands of TPUs with superior quality.
Kaczmarz Linear Attention
cs.LG 2026-05 unverdicted novelty 5.0

Kaczmarz Linear Attention replaces the empirical coefficient in Gated DeltaNet with a key-norm-normalized step size derived from the online regression objective, yielding lower perplexity and better needle-in-haystack...
How to Compress KV Cache in RL Post-Training? Shadow Mask Distillation for Memory-Efficient Alignment
cs.LG 2026-05 unverdicted novelty 5.0

Shadow Mask Distillation enables KV cache compression in RL post-training of LLMs by mitigating amplified off-policy bias that defeats standard importance reweighting.
StreamIndex: Memory-Bounded Compressed Sparse Attention via Streaming Top-k
cs.LG 2026-05 accept novelty 5.0

Chunked streaming top-k enables CSA indexer execution at 1M sequence length with 6.21 GB peak memory and >=0.998 recall on synthetic V4-shaped inputs.
Make Your LVLM KV Cache More Lightweight
cs.CV 2026-05 unverdicted novelty 5.0

LightKV compresses vision-token KV cache in LVLMs to 55% size via prompt-guided cross-modality aggregation, halving cache memory, cutting compute 40%, and maintaining performance on benchmarks.
EdgeFM: Efficient Edge Inference for Vision-Language Models
cs.CV 2026-04 unverdicted novelty 5.0

EdgeFM is an agent-driven framework that strips non-essential features from VLMs and packages reusable optimized kernels, achieving up to 1.49x speedup over TensorRT-Edge-LLM on NVIDIA Orin while enabling first end-to...
Predictive Multi-Tier Memory Management for KV Cache in Large-Scale GPU Inference
cs.AR 2026-04 unverdicted novelty 5.0

A unified KV cache system with architecture-specific sizing, six-tier memory from GPU to filesystems, and Bayesian prediction delivers 7.4x higher batch sizes, 70-84% hit rates, and projected 1.7-2.9x throughput gains.
HieraSparse: Hierarchical Semi-Structured Sparse KV Attention
cs.DC 2026-04 unverdicted novelty 5.0

HieraSparse delivers a hierarchical semi-structured sparse KV attention system that achieves 1.2x KV compression and 4.57x decode attention speedup versus prior unstructured sparsity methods at equivalent sparsity, pl...
DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models
cs.CL 2025-12 unverdicted novelty 5.0

DeepSeek-V3.2 adds sparse attention, scaled RL post-training, and large-scale agentic data synthesis to reach GPT-5-level performance and gold medals in 2025 IMO and IOI with its high-compute variant.
gpt-oss-120b & gpt-oss-20b Model Card
cs.CL 2025-08 unverdicted novelty 5.0

OpenAI releases two open-weight reasoning models, gpt-oss-120b and gpt-oss-20b, trained via distillation and RL with claimed strong results on math, coding, and safety benchmarks.
DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models
cs.CL 2024-01 unverdicted novelty 5.0

DeepSeekMoE 2B matches GShard 2.9B performance and approaches a dense 2B model; the 16B version matches LLaMA2-7B at 40% compute by using fine-grained expert segmentation plus shared experts.
StarCoder: may the source be with you!
cs.CL 2023-05 accept novelty 5.0

StarCoderBase matches or beats OpenAI's code-cushman-001 on multi-language code benchmarks; the Python-fine-tuned StarCoder reaches 40% pass@1 on HumanEval while retaining other-language performance.
DARC-CLIP: Dynamic Adaptive Refinement with Cross-Attention for Meme Understanding
cs.CL 2026-04 unverdicted novelty 4.0

DARC-CLIP improves CLIP-based meme classification with hierarchical adaptive refinement, delivering +4.18 AUROC and +6.84 F1 gains in hate detection on PrideMM and CrisisHateMM benchmarks.
Gemma: Open Models Based on Gemini Research and Technology
cs.CL 2024-03 accept novelty 4.0

Gemma introduces open 2B and 7B LLMs derived from Gemini technology that beat comparable open models on 11 of 18 text tasks and come with safety assessments.
Yi: Open Foundation Models by 01.AI
cs.CL 2024-03 unverdicted novelty 4.0

Yi models are 6B and 34B open foundation models pretrained on 3.1T curated tokens that achieve strong benchmark results through data quality and targeted extensions like long context and vision alignment.
Gemma 2: Improving Open Language Models at a Practical Size
cs.CL 2024-07 conditional novelty 3.0

Gemma 2 models achieve leading performance at their sizes by combining established Transformer modifications with knowledge distillation for the 2B and 9B variants.
ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools
cs.CL 2024-06 unverdicted novelty 3.0

GLM-4 models rival or exceed GPT-4 on MMLU, GSM8K, MATH, BBH, GPQA, HumanEval, IFEval, long-context tasks, and Chinese alignment while adding autonomous tool use for web, code, and image generation.
A Survey on Efficient Inference for Large Language Models
cs.CL 2024-04 accept novelty 3.0

The paper surveys techniques to speed up and reduce the resource needs of LLM inference, organized by data-level, model-level, and system-level changes, with comparative experiments on representative methods.

Reference graph

Works this paper leans on

20 extracted references · 20 canonical work pages · cited by 59 Pith papers

[1]

Neural machine translation by jointly learning to align and translate, 2014

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate, 2014

work page 2014
[2]

One billion word benchmark for measuring progress in statistical language modeling

Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge, Thorsten Brants, and Phillipp Koehn. One billion word benchmark for measuring progress in statistical language modeling. CoRR, abs/1312.3005, 2013. URL http://arxiv.org/abs/1312.3005

work page arXiv 2013
[3]

Generating wikipedia by summarizing long sequences

Peter J Liu, Mohammad Saleh, Etienne Pot, Ben Goodrich, Ryan Sepassi, Lukasz Kaiser, and Noam Shazeer. Generating wikipedia by summarizing long sequences. In Proceedings of the International Conference on Learning Representations, 2018

work page 2018
[4]

A time-restricted self-attention layer for ASR

Daniel Povey, Hossein Hadian, Pegah Ghahremani, Ke Li, and Sanjeev Khudanpur. A time-restricted self-attention layer for ASR . In Proceddings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018

work page 2018
[5]

Gomez, Lukasz Kaiser, and Illia Polosukhin

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NIPS, 2017

work page 2017
[6]

Accelerating neural transformer via an average attention network, 2018

Biao Zhang, Deyi Xiong, and Jinsong Su. Accelerating neural transformer via an average attention network, 2018

work page 2018
[7]

Langley , title =

P. Langley , title =. Proceedings of the 17th International Conference on Machine Learning (ICML 2000) , address =. 2000 , pages =

work page 2000
[8]

T. M. Mitchell. The Need for Biases in Learning Generalizations. 1980

work page 1980
[9]

M. J. Kearns , title =

work page
[10]

Machine Learning: An Artificial Intelligence Approach, Vol. I. 1983

work page 1983
[11]

R. O. Duda and P. E. Hart and D. G. Stork. Pattern Classification. 2000

work page 2000
[12]

Suppressed for Anonymity , author=

work page
[13]

Newell and P

A. Newell and P. S. Rosenbloom. Mechanisms of Skill Acquisition and the Law of Practice. Cognitive Skills and Their Acquisition. 1981

work page 1981
[14]

A. L. Samuel. Some Studies in Machine Learning Using the Game of Checkers. IBM Journal of Research and Development. 1959

work page 1959
[15]

NIPS , year=

Attention Is All You Need , author=. NIPS , year=

work page
[16]

2018 , Eprint =

Biao Zhang and Deyi Xiong and Jinsong Su , Title =. 2018 , Eprint =

work page 2018
[17]

Proceedings of the International Conference on Learning Representations , year=

Generating Wikipedia by Summarizing Long Sequences , author=. Proceedings of the International Conference on Learning Representations , year=

work page
[18]

2014 , Eprint =

Dzmitry Bahdanau and Kyunghyun Cho and Yoshua Bengio , Title =. 2014 , Eprint =

work page 2014
[19]

CoRR , volume =

Ciprian Chelba and Tomas Mikolov and Mike Schuster and Qi Ge and Thorsten Brants and Phillipp Koehn , title =. CoRR , volume =. 2013 , url =

work page 2013
[20]

A time-restricted self-attention layer for

Povey, Daniel and Hadian, Hossein and Ghahremani, Pegah and Li, Ke and Khudanpur, Sanjeev , booktitle=. A time-restricted self-attention layer for. 2018 , organization=

work page 2018