hub Mixed citations

Fast Transformer Decoding: One Write-Head is All You Need

Noam Shazeer · 2019 · cs.NE · arXiv 1911.02150

Mixed citation behavior. Most common role is background (67%).

90 Pith papers citing it

Background 67% of classified citations

open full Pith review browse 90 citing papers arXiv PDF

abstract

Multi-head attention layers, as used in the Transformer neural sequence model, are a powerful alternative to RNNs for moving information across and between sequences. While training these layers is generally fast and simple, due to parallelizability across the length of the sequence, incremental inference (where such paralleization is impossible) is often slow, due to the memory-bandwidth cost of repeatedly loading the large "keys" and "values" tensors. We propose a variant called multi-query attention, where the keys and values are shared across all of the different attention "heads", greatly reducing the size of these tensors and hence the memory bandwidth requirements of incremental decoding. We verify experimentally that the resulting models can indeed be much faster to decode, and incur only minor quality degradation from the baseline.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 17 method 6 dataset 1

citation-polarity summary

background 16 use method 5 unclear 2 use dataset 1

claims ledger

abstract Multi-head attention layers, as used in the Transformer neural sequence model, are a powerful alternative to RNNs for moving information across and between sequences. While training these layers is generally fast and simple, due to parallelizability across the length of the sequence, incremental inference (where such paralleization is impossible) is often slow, due to the memory-bandwidth cost of repeatedly loading the large "keys" and "values" tensors. We propose a variant called multi-query attention, where the keys and values are shared across all of the different attention "heads", greatly

co-cited works

representative citing papers

Nearly Optimal Attention Coresets

cs.DS · 2026-05-07 · unverdicted · novelty 8.0

ε-coresets for attention exist of size O(√d e^{ρ+o(ρ)}/ε) for unit-norm keys/values and queries of norm ≤ρ, nearly matching the Ω(√d e^ρ/ε) lower bound.

Tensor Cache: Eviction-conditioned Associative Memory for Transformers

cs.LG · 2026-05-21 · unverdicted · novelty 7.0

Tensor Cache augments sliding-window attention with an eviction-fed outer-product associative memory and a training correction to improve long-context performance under bounded memory.

LLMForge: Multi-Backend Hardware-Aware Neural Architecture Search with Infinite-Head Attention for Edge Language Models

cs.LG · 2026-05-17 · unverdicted · novelty 7.0

LLMForge is a NAS framework with Infinite-Head Attention, a Forge-Former surrogate, and Forge-DSE engine that discovers hardware-specific architectures for edge language models, yielding variants with improved accuracy, energy, or latency on different substrates.

Ragged Paged Attention: A High-Performance and Flexible LLM Inference Kernel for TPU

cs.PF · 2026-04-16 · unverdicted · novelty 7.0

RPA kernel for TPUs achieves 86% MBU in decode and 73% MFU in prefill on Llama 3 8B via tiling for ragged memory, fused pipelines, and specialized compilation for prefill/decode workloads.

A Hormone-inspired Emotion Layer for Transformer language models (HELT)

cs.NE · 2026-04-13 · unverdicted · novelty 7.0

HormoneT5 augments T5 with a hormone-inspired block that predicts six continuous emotion values and uses them to modulate responses, reporting over 85% per-hormone accuracy and human preference for emotional quality.

Fast Cross-Operator Optimization of Attention Dataflow

cs.AR · 2026-04-03 · unverdicted · novelty 7.0

MMEE encodes dataflow decisions in matrix form for fast exhaustive search, delivering 40-69% lower latency and energy use than prior methods while running 64-343x faster.

Not All Frames Deserve Full Computation: Accelerating Autoregressive Video Generation via Selective Computation and Predictive Extrapolation

cs.CV · 2026-04-03 · conditional · novelty 7.0

SCOPE accelerates autoregressive video diffusion up to 4.73x by using a tri-modal cache-predict-recompute scheduler with Taylor extrapolation and selective active-frame computation while preserving output quality.

Key and Value Weights Are Probably All You Need: On the Necessity of the Query, Key, Value weight Triplet in Self-Attention Transformers

cs.LG · 2025-10-27 · unverdicted · novelty 7.0

One of the Q, K or V weights in transformer self-attention is redundant and replaceable by the identity matrix under mild assumptions, reducing parameters by 25 percent with no loss in small-model performance.

DELTA: Dynamic Layer-Aware Token Attention for Efficient Long-Context Reasoning

cs.CL · 2025-10-10 · conditional · novelty 7.0

DELTA partitions layers into full, delta, and sparse groups to select salient tokens via aggregated attention scores, matching full-attention accuracy on AIME and GPQA while cutting attended tokens up to 4.25x and achieving 1.54x speedup.

FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision

cs.LG · 2024-07-11 · accept · novelty 7.0

FlashAttention-3 achieves 1.5-2x speedup on H100 GPUs for attention, reaching 740 TFLOPs/s (75% utilization) in FP16 and near 1.2 PFLOPs/s in FP8 while cutting numerical error by 2.6x versus baseline FP8 attention.

Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

cs.LG · 2024-05-31 · unverdicted · novelty 7.0

Transformers and SSMs are unified through structured state space duality, producing a 2-8X faster Mamba-2 model that remains competitive with Transformers.

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

cs.CL · 2024-05-07 · unverdicted · novelty 7.0

DeepSeek-V2 delivers top-tier open-source LLM performance using only 21B active parameters by compressing the KV cache 93.3% and cutting training costs 42.5% via MLA and DeepSeekMoE.

Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models

cs.LG · 2024-02-29 · unverdicted · novelty 7.0

Griffin hybrid model matches Llama-2 performance while trained on over 6 times fewer tokens and offers lower inference latency with higher throughput.

Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads

cs.LG · 2024-01-19 · conditional · novelty 7.0

Medusa augments LLMs with multiple decoding heads and tree-based attention to predict and verify several tokens in parallel, yielding 2.2-3.6x inference speedup via two fine-tuning regimes.

Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation

cs.CV · 2023-10-09 · unverdicted · novelty 7.0

A new shared video-image tokenizer enables large language models to surpass diffusion models on standard visual generation benchmarks.

Accelerating Large Language Model Decoding with Speculative Sampling

cs.CL · 2023-02-02 · accept · novelty 7.0

Speculative sampling accelerates LLM decoding 2-2.5x by letting a draft model propose short sequences that the target model scores in parallel, then applies modified rejection sampling to keep the exact target distribution.

Fast Inference from Transformers via Speculative Decoding

cs.LG · 2022-11-30 · accept · novelty 7.0

Speculative decoding accelerates exact sampling from large autoregressive models by 2-3x on T5-XXL by running smaller approximation models in parallel to propose token sequences that the large model then verifies in batches while preserving the original output distribution.

DREAM-S: Speculative Decoding with Searchable Drafting and Target-Aware Refinement for Multimodal Generation

cs.LG · 2026-05-30 · unverdicted · novelty 6.0

DREAM-S combines neural architecture search, target-aware supernet training, and attention-entropy-guided distillation to accelerate speculative decoding in VLMs, reporting up to 3.85x speedup over standard methods.

BlockBatch: Multi-Scale Consensus Decoding for Efficient Diffusion Language Model Inference

cs.LG · 2026-05-28 · unverdicted · novelty 6.0

BlockBatch is a training-free framework that coordinates multiple block-size branches via token merging and synchronization to reduce denoising NFEs by 26.6% and achieve 1.33x speedup in dLLM inference.

ArborKV: Structure-Aware KV Cache Management for Scaling Tree-based LLM Reasoning

cs.AI · 2026-05-21 · unverdicted · novelty 6.0

ArborKV uses search-structure awareness to evict low-reuse KV states in Tree-of-Thoughts inference, delivering up to 4x memory savings with near-full accuracy retention.

ObjectCache: Layerwise Object-Storage Retrieval for KV Cache Reuse

cs.DC · 2026-05-16 · unverdicted · novelty 6.0

ObjectCache enables KV cache storage in object storage via layerwise retrieval and custom scheduling, adding 5.6% latency for 64K contexts over local DRAM on a 100 Gbps RoCE cluster.

LRCP: Low-Rank Compressibility Guided Visual Token Pruning for Efficient LVLMs

cs.CV · 2026-05-15 · unverdicted · novelty 6.0

LRCP prunes visual tokens in LVLMs by scoring projection residuals onto a PCA-estimated low-rank subspace, achieving 88.9% image token reduction with 94.7% performance retention and 87.5% video reduction with 97.8% accuracy retention.

When Attention Closes: How LLMs Lose the Thread in Multi-Turn Interaction

cs.AI · 2026-05-13 · unverdicted · novelty 6.0

Attention to goal tokens declines in multi-turn LLM interactions while residual representations often retain decodable goal information, and the gap between these predicts whether goal-conditioned behavior survives.

Search Your Block Floating Point Scales!

cs.LG · 2026-05-12 · unverdicted · novelty 6.0

ScaleSearch optimizes block floating point scales via fine-grained search to cut quantization error by 27% for NVFP4, improving PTQ by up to 15 points on MATH500 for Qwen3-8B and attention PPL by 0.77 on Llama 3.1 70B.

citing papers explorer

Showing 40 of 90 citing papers.

StarCoder 2 and The Stack v2: The Next Generation cs.SE · 2024-02-29 · accept · none · ref 273 · internal anchor
StarCoder2-15B matches or beats CodeLlama-34B on code tasks despite being smaller, and StarCoder2-3B outperforms prior 15B models, with open weights and exact training data identifiers released.
KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache cs.CL · 2024-02-05 · conditional · none · ref 15 · internal anchor
KIVI applies asymmetric 2-bit quantization to KV cache with per-channel keys and per-token values, reducing memory 2.6x and boosting throughput up to 3.47x with near-identical quality on Llama, Falcon, and Mistral.
EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty cs.LG · 2024-01-26 · unverdicted · none · ref 69 · internal anchor
EAGLE resolves feature-level uncertainty in speculative sampling via one-step token advancement, delivering 2.7x-3.5x speedup on LLaMA2-Chat 70B and doubled throughput across multiple model families and tasks.
Gemini: A Family of Highly Capable Multimodal Models cs.CL · 2023-12-19 · conditional · none · ref 93 · internal anchor
Gemini Ultra reaches human-expert performance on MMLU for the first time and sets new state-of-the-art results on 30 of 32 benchmarks, including all 20 multimodal ones tested.
The Falcon Series of Open Language Models cs.CL · 2023-11-28 · conditional · none · ref 94 · internal anchor
Falcon-180B is a 180B-parameter open decoder-only model trained on 3.5 trillion tokens that approaches PaLM-2-Large performance at lower cost and is released with dataset extracts.
FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning cs.LG · 2023-07-17 · accept · none · ref 15 · internal anchor
FlashAttention-2 achieves roughly 2x speedup over FlashAttention by parallelizing attention across thread blocks and distributing work within blocks, reaching 50-73% of theoretical peak FLOPs/s on A100 GPUs.
Retentive Network: A Successor to Transformer for Large Language Models cs.CL · 2023-07-17 · unverdicted · none · ref 18 · internal anchor
RetNet is a new sequence modeling architecture that delivers parallel training, constant-time inference, and competitive language modeling performance as a potential replacement for Transformers.
H$_2$O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models cs.LG · 2023-06-24 · unverdicted · none · ref 12 · internal anchor
H2O evicts non-heavy-hitter tokens from the KV cache using a dynamic submodular policy, retaining recent and frequent-co-occurrence tokens to reduce memory while preserving accuracy.
GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints cs.CL · 2023-05-22 · unverdicted · none · ref 57 · internal anchor
Uptraining multi-head transformer checkpoints to grouped-query attention models achieves near multi-head quality at multi-query inference speeds using 5% additional compute.
PaLM: Scaling Language Modeling with Pathways cs.CL · 2022-04-05 · accept · none · ref 137 · internal anchor
PaLM 540B demonstrates continued scaling benefits by setting new few-shot SOTA results on hundreds of benchmarks and outperforming humans on BIG-bench.
ST-MoE: Designing Stable and Transferable Sparse Expert Models cs.CL · 2022-02-17 · unverdicted · none · ref 194 · internal anchor
ST-MoE introduces stability techniques for sparse expert models, allowing a 269B-parameter model to achieve state-of-the-art transfer learning results across reasoning, summarization, and QA tasks at the compute cost of a 32B dense model.
GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding cs.CL · 2020-06-30 · unverdicted · none · ref 88 · internal anchor
GShard supplies automatic sharding and conditional computation support that enabled training a 600-billion-parameter multilingual translation model on thousands of TPUs with superior quality.
Memory-Bound but Not Bandwidth-Limited: The Physical AI Inference Gap in Batch-1 LLM Decode cs.AR · 2026-05-28 · conditional · none · ref 23 · internal anchor
Batch-1 autoregressive decode is memory-dominated yet launch overhead caps gains from higher-bandwidth GPUs, shown by measurements and CUDA Graphs ablation across four NVIDIA GPUs.
DVSM: Decoder-only View Synthesis Model Done Right cs.CV · 2026-05-28 · unverdicted · none · ref 62 · internal anchor
Decoder-only view synthesis model using KV-cache representation and weight sharing between reconstruction and rendering networks achieves new SOTA on novel view synthesis benchmarks.
Kaczmarz Linear Attention cs.LG · 2026-05-09 · unverdicted · none · ref 32 · internal anchor
Kaczmarz Linear Attention replaces the empirical coefficient in Gated DeltaNet with a key-norm-normalized step size derived from the online regression objective, yielding lower perplexity and better needle-in-haystack performance.
How to Compress KV Cache in RL Post-Training? Shadow Mask Distillation for Memory-Efficient Alignment cs.LG · 2026-05-07 · unverdicted · none · ref 9 · internal anchor
Shadow Mask Distillation enables KV cache compression in RL post-training of LLMs by mitigating amplified off-policy bias that defeats standard importance reweighting.
Cubit: Token Mixer with Kernel Ridge Regression cs.LG · 2026-05-07 · unverdicted · none · ref 70 · 2 links · internal anchor
Cubit replaces Transformer's attention with a closed-form Kernel Ridge Regression token mixer and reports larger gains as training sequence length increases.
StreamIndex: Memory-Bounded Compressed Sparse Attention via Streaming Top-k cs.LG · 2026-05-04 · accept · none · ref 28 · internal anchor
Chunked streaming top-k enables CSA indexer execution at 1M sequence length with 6.21 GB peak memory and >=0.998 recall on synthetic V4-shaped inputs.
Make Your LVLM KV Cache More Lightweight cs.CV · 2026-05-01 · unverdicted · none · ref 85 · internal anchor
LightKV compresses vision-token KV cache in LVLMs to 55% size via prompt-guided cross-modality aggregation, halving cache memory, cutting compute 40%, and maintaining performance on benchmarks.
Predictive Multi-Tier Memory Management for KV Cache in Large-Scale GPU Inference cs.AR · 2026-04-19 · unverdicted · none · ref 9 · internal anchor
A unified KV cache system with architecture-specific sizing, six-tier memory from GPU to filesystems, and Bayesian prediction delivers 7.4x higher batch sizes, 70-84% hit rates, and projected 1.7-2.9x throughput gains.
HieraSparse: Hierarchical Semi-Structured Sparse KV Attention cs.DC · 2026-04-18 · unverdicted · none · ref 61 · internal anchor
HieraSparse delivers a hierarchical semi-structured sparse KV attention system that achieves 1.2x KV compression and 4.57x decode attention speedup versus prior unstructured sparsity methods at equivalent sparsity, plus up to 1.85x prefill speedup and 1.37x/1.77x speedups with magnitude pruning and
DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models cs.CL · 2025-12-02 · unverdicted · none · ref 12 · internal anchor
DeepSeek-V3.2 adds sparse attention, scaled RL post-training, and large-scale agentic data synthesis to reach GPT-5-level performance and gold medals in 2025 IMO and IOI with its high-compute variant.
Beyond URLs: Metadata Diversity and Position for Efficient LLM Pretraining cs.CL · 2025-11-26 · unverdicted · none · ref 20 · internal anchor
Fine-grained metadata such as document quality indicators accelerate LLM pretraining when prepended, and metadata appending plus learnable meta-tokens recover additional speedup via auxiliary tasks and latent structure.
gpt-oss-120b & gpt-oss-20b Model Card cs.CL · 2025-08-08 · unverdicted · none · ref 13 · internal anchor
OpenAI releases two open-weight reasoning models, gpt-oss-120b and gpt-oss-20b, trained via distillation and RL with claimed strong results on math, coding, and safety benchmarks.
DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models cs.CL · 2024-01-11 · unverdicted · none · ref 46 · internal anchor
DeepSeekMoE 2B matches GShard 2.9B performance and approaches a dense 2B model; the 16B version matches LLaMA2-7B at 40% compute by using fine-grained expert segmentation plus shared experts.
StarCoder: may the source be with you! cs.CL · 2023-05-09 · accept · none · ref 175 · internal anchor
StarCoderBase matches or beats OpenAI's code-cushman-001 on multi-language code benchmarks; the Python-fine-tuned StarCoder reaches 40% pass@1 on HumanEval while retaining other-language performance.
A Simple Plug-in for Improving Eviction-Based KV Cache Compression cs.LG · 2026-05-22 · unverdicted · none · ref 27 · internal anchor
VECTOR augments eviction-based KV cache compression with three-way token routing that combines importance scoring and offline regression-based reconstructability estimation to improve quality at high compression ratios.
DARC-CLIP: Dynamic Adaptive Refinement with Cross-Attention for Meme Understanding cs.CL · 2026-04-25 · unverdicted · none · ref 20 · internal anchor
DARC-CLIP improves CLIP-based meme classification with hierarchical adaptive refinement, delivering +4.18 AUROC and +6.84 F1 gains in hate detection on PrideMM and CrisisHateMM benchmarks.
Gemma: Open Models Based on Gemini Research and Technology cs.CL · 2024-03-13 · accept · none · ref 96 · internal anchor
Gemma introduces open 2B and 7B LLMs derived from Gemini technology that beat comparable open models on 11 of 18 text tasks and come with safety assessments.
Yi: Open Foundation Models by 01.AI cs.CL · 2024-03-07 · unverdicted · none · ref 69 · internal anchor
Yi models are 6B and 34B open foundation models pretrained on 3.1T curated tokens that achieve strong benchmark results through data quality and targeted extensions like long context and vision alignment.
LLMOrbit: A Circular Taxonomy of Large Language Models -From Scaling Walls to Agentic AI Systems cs.LG · 2026-01-20 · unverdicted · none · ref 138 · internal anchor
A survey taxonomy of LLMs identifies three scaling crises and six efficiency paradigms while tracing the shift from generation to tool-using agents.
Gemma 2: Improving Open Language Models at a Practical Size cs.CL · 2024-07-31 · conditional · none · ref 107 · internal anchor
Gemma 2 models achieve leading performance at their sizes by combining established Transformer modifications with knowledge distillation for the 2B and 9B variants.
ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools cs.CL · 2024-06-18 · unverdicted · none · ref 35 · internal anchor
GLM-4 models rival or exceed GPT-4 on MMLU, GSM8K, MATH, BBH, GPQA, HumanEval, IFEval, long-context tasks, and Chinese alignment while adding autonomous tool use for web, code, and image generation.
A Survey on Efficient Inference for Large Language Models cs.CL · 2024-04-22 · accept · none · ref 77 · internal anchor
The paper surveys techniques to speed up and reduce the resource needs of LLM inference, organized by data-level, model-level, and system-level changes, with comparative experiments on representative methods.
A Survey of Hallucination in Large Foundation Models cs.AI · 2023-09-12 · accept · none · ref 41 · internal anchor
A survey classifying hallucination phenomena specific to large foundation models, establishing evaluation criteria, examining mitigation strategies, and discussing future directions.
A Comprehensive Overview of Large Language Models cs.CL · 2023-07-12 · unverdicted · none · ref 143 · internal anchor
A survey paper providing an overview of Large Language Models, their background, and recent advances in the field.
GQLA: Group-Query Latent Attention for Hardware-Adaptive Large Language Model Decoding cs.LG · 2026-05-14 · unreviewed · ref 1 · internal anchor
EdgeFM: Efficient Edge Inference for Vision-Language Models cs.CV · 2026-04-30 · unreviewed · ref 7 · internal anchor
Evaluating CUDA Tile for AI Workloads on Hopper and Blackwell GPUs cs.LG · 2026-04-25 · unreviewed · ref 29 · internal anchor
RAT+: Train Dense, Infer Sparse -- Recurrence Augmented Attention for Dilated Inference cs.LG · 2026-02-20 · unreviewed · ref 26 · 2 links · internal anchor

Fast Transformer Decoding: One Write-Head is All You Need

hub tools

citation-role summary

citation-polarity summary

claims ledger

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer