pith. sign in

arxiv: 2407.02490 · v2 · pith:S6C3UL2Enew · submitted 2024-07-02 · 💻 cs.CL · cs.LG

MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention

classification 💻 cs.CL cs.LG
keywords sparseattentionpre-fillinginferencellmslong-contextminferencepattern
0
0 comments X
read the original abstract

The computational challenges of Large Language Model (LLM) inference remain a significant barrier to their widespread deployment, especially as prompt lengths continue to increase. Due to the quadratic complexity of the attention computation, it takes 30 minutes for an 8B LLM to process a prompt of 1M tokens (i.e., the pre-filling stage) on a single A100 GPU. Existing methods for speeding up prefilling often fail to maintain acceptable accuracy or efficiency when applied to long-context LLMs. To address this gap, we introduce MInference (Milliontokens Inference), a sparse calculation method designed to accelerate pre-filling of long-sequence processing. Specifically, we identify three unique patterns in long-context attention matrices-the A-shape, Vertical-Slash, and Block-Sparsethat can be leveraged for efficient sparse computation on GPUs. We determine the optimal pattern for each attention head offline and dynamically build sparse indices based on the assigned pattern during inference. With the pattern and sparse indices, we perform efficient sparse attention calculations via our optimized GPU kernels to significantly reduce the latency in the pre-filling stage of long-context LLMs. Our proposed technique can be directly applied to existing LLMs without any modifications to the pre-training setup or additional fine-tuning. By evaluating on a wide range of downstream tasks, including InfiniteBench, RULER, PG-19, and Needle In A Haystack, and models including LLaMA-3-1M, GLM4-1M, Yi-200K, Phi-3-128K, and Qwen2-128K, we demonstrate that MInference effectively reduces inference latency by up to 10x for pre-filling on an A100, while maintaining accuracy. Our code is available at https://aka.ms/MInference.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 24 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. NLL-Guided Full-Attention Layer Selection for Training-Free Sliding-Window Adaptation

    cs.CL 2026-06 unverdicted novelty 7.0

    NLL-guided layer selection identifies 1/4 of layers for full attention in hybrid models, matching periodic 1/2-FA baseline accuracy on LongMemEval with Qwen3-4B while halving the full-attention compute budget.

  2. Long Context Pre-Training with Lighthouse Attention

    cs.CL 2026-05 conditional novelty 7.0

    Lighthouse Attention enables faster long-context pre-training via gradient-free symmetrical hierarchical compression of QKV while preserving causality, followed by a short full-attention recovery that yields lower los...

  3. PrefixWall: Mitigating Prefix Caching Side Channels in Shared LLM Systems

    cs.CR 2026-03 unverdicted novelty 7.0

    PrefixWall mitigates APC side channels in multi-tenant LLM systems via selective prefix isolation, delivering up to 70% higher cache reuse and 30% lower latency than full-isolation baselines.

  4. FastKV: Decoupling of Context Reduction and KV Cache Compression for Prefill-Decoding Acceleration

    cs.LG 2025-02 unverdicted novelty 7.0

    FastKV decouples prefill context reduction via Token-Selective Propagation from independent KV cache selection, delivering up to 1.82x prefill and 2.87x decoding speedups while matching decoding-only accuracy.

  5. DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads

    cs.CL 2024-10 conditional novelty 7.0

    DuoAttention identifies retrieval heads requiring full KV cache and streaming heads using constant-length cache to reduce memory and latency in long-context LLM inference.

  6. Hierarchical Global Attention (HGA)

    cs.LG 2026-06 unverdicted novelty 6.0

    HGA uses RoPE-aware chunk summaries for two-level hierarchical routing to approximate dense causal attention at 3% sparsity with 0.01-0.02 nats quality gap, as a drop-in replacement requiring no retraining.

  7. From Rigid to Dynamic: Entropy-Guided Adaptive Inference for Long-Context LLMs

    cs.AI 2026-06 unverdicted novelty 6.0

    EntropyInfer adaptively allocates inference compute using per-head attention entropy for rigid/dynamic classification during prefilling and compresses KV cache with generated tokens, achieving up to 2.39x speedup on l...

  8. Contribution Weights: A Geometrical Analysis of Self-Attention Transformers

    cs.LG 2026-05 unverdicted novelty 6.0

    Contribution Weights combine attention, value magnitude, and directional alignment to measure token influence more faithfully than attention alone, and show attention sinks actively suppress information via a convex s...

  9. DashAttention: Differentiable and Adaptive Sparse Hierarchical Attention

    cs.CL 2026-05 unverdicted novelty 6.0

    DashAttention introduces differentiable adaptive sparse hierarchical attention via α-entmax block selection, achieving full-attention accuracy at 75% sparsity with improved Pareto performance over NSA and InfLLMv2.

  10. Sparse Attention as a Range Searching Problem: Towards an Inference-Efficient Index for KV Cache

    cs.LG 2026-05 unverdicted novelty 6.0

    Louver is a new index structure that guarantees zero false negatives for sparse attention in LLM KV caches by casting the problem as halfspace range searching.

  11. Sparse Attention as a Range Searching Problem: Towards an Inference-Efficient Index for KV Cache

    cs.LG 2026-05 unverdicted novelty 6.0

    Louver is a new index for LLM KV caches that guarantees zero false negatives for keys above a relevance threshold, runs faster than prior sparse and some dense attention methods, and integrates lightly into existing p...

  12. HISA: Efficient Hierarchical Indexing for Fine-Grained Sparse Attention

    cs.LG 2026-03 unverdicted novelty 6.0

    HISA speeds up fine-grained sparse attention indexers via block-then-token hierarchy, delivering substantial speedups at 64K context with no training and quality matching the original DSA on long-context benchmarks.

  13. Scaling Laws Meet Model Architecture: Toward Inference-Efficient LLMs

    cs.LG 2025-10 unverdicted novelty 6.0

    A conditional scaling law fitted on over 200 models from 80M to 3B parameters identifies architectures that deliver up to 2.1% higher accuracy and 42% higher inference throughput than LLaMA-3.2 under the same training budget.

  14. MoBA: Mixture of Block Attention for Long-Context LLMs

    cs.LG 2025-02 unverdicted novelty 6.0

    MoBA routes attention over blocks via MoE-style gating to enable dynamic, bias-light long-context attention that matches full attention performance at lower cost.

  15. Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention

    cs.CL 2025-02 unverdicted novelty 6.0

    NSA is a hardware-aligned sparse attention mechanism that enables end-to-end trainable long-context modeling by combining coarse token compression with fine-grained selection.

  16. RetrievalAttention: Accelerating Long-Context LLM Inference via Vector Retrieval

    cs.LG 2024-09 conditional novelty 6.0

    RetrievalAttention approximates full attention in long-context LLMs by retrieving relevant KV vectors from CPU-based ANNS indexes with an attention-aware algorithm, achieving near-full accuracy while accessing only 1-...

  17. PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling

    cs.CL 2024-06 conditional novelty 6.0

    PyramidKV dynamically compresses KV cache across layers following pyramidal information funneling, matching full performance at 12% retention and outperforming alternatives at 0.7% retention with up to 20.5 accuracy gains.

  18. Coverage-Driven KV Cache Eviction for Efficient and Improved Inference of LLM

    cs.CL 2026-06 unverdicted novelty 5.0

    K-VEC is a coverage-aware KV-cache eviction strategy using cross-head and cross-layer modules that improves performance by up to 10.35 points over prior methods on LongBench subsets at fixed memory budget.

  19. SIFT: Selective-Index For Fast Compute of RAG Prefill by Exploiting Attention Invariance

    cs.AI 2026-06 unverdicted novelty 5.0

    SIFT precomputes selective attention indices via local and cross-attention invariance to speed RAG prefill 1.71x while keeping accuracy within 1% of full recompute, storing only bit vectors 24,000x smaller than KV tensors.

  20. How Much Dense Attention is Necessary? Oracle-Guided Sparse Prefill for Full/GQA Layers in Hybrid Long-Context Models

    cs.LG 2026-06 unverdicted novelty 5.0

    An oracle shows sparse token support preserves near-dense performance on Qwen retrieval tasks, and a KL-distilled head-collapsed indexer delivers 1.7-1.9x speedups with small quality gaps.

  21. IndexMem: Learned KV-Cache Eviction with Latent Memory for Long-Context LLM Inference

    cs.CL 2026-05 unverdicted novelty 5.0

    IndexMem proposes a learned KV importance predictor paired with a latent memory module to enable bounded KV cache size for long-context inference, reporting gains on RULER, Needle-in-a-Haystack, and LongBench across m...

  22. StreamIndex: Memory-Bounded Compressed Sparse Attention via Streaming Top-k

    cs.LG 2026-05 accept novelty 5.0

    Chunked streaming top-k enables CSA indexer execution at 1M sequence length with 6.21 GB peak memory and >=0.998 recall on synthetic V4-shaped inputs.

  23. ShadowNPU: System and Algorithm Co-design for NPU-Centric On-Device LLM Inference

    cs.PF 2025-08 unverdicted novelty 5.0

    ShadowNPU presents shadowAttn, a co-designed sparse attention system that uses NPU pilot compute and techniques like graph bucketing and per-head sparsity to minimize CPU/GPU fallback during on-device LLM inference wh...

  24. Token-Operations-Oriented Inference Optimization Techniques for Large Models

    cs.SE 2026-06 unverdicted novelty 3.0

    The paper introduces a four-layer technical architecture for token-operations-oriented inference optimization in large models and reviews key technologies and industry status at each layer.