pith. machine review for the scientific record. sign in

arxiv: 2410.10819 · v1 · submitted 2024-10-14 · 💻 cs.CL

DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads

Pith reviewed 2026-05-18 11:44 UTC · model grok-4.3

classification 💻 cs.CL
keywords attention mechanismKV cache optimizationlong context modelingefficient inferenceretrieval headsstreaming headsLLM acceleration
0
0 comments X

The pith

Only retrieval heads need full key-value caches for long-context processing in large language models, while streaming heads can use short fixed caches.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper establishes that long-context abilities in LLMs depend primarily on a minority of attention heads that maintain full attention over all previous tokens. The remaining heads focus mostly on recent context and attention sinks, so they can operate with limited caches. By classifying heads into these two groups using an optimization procedure on synthetic examples, the DuoAttention method applies full caching only where necessary. This selective caching delivers large reductions in memory footprint and faster inference times for both filling the context and generating output. The approach keeps task performance nearly identical to the original model.

Core claim

The paper claims that identifying retrieval heads, which require complete KV caches for long contexts, and streaming heads, which suffice with constant-length caches, enables efficient inference. The identification uses a lightweight optimization-based algorithm with synthetic data. This leads to memory savings up to 2.55 times for certain models and speedups in decoding and pre-filling, all with minimal impact on accuracy for long-context tasks.

What carries the argument

The separation of attention heads into retrieval heads that keep full KV caches and streaming heads that use a lightweight constant-length KV cache, with the split determined by an optimization algorithm on synthetic data.

If this is right

  • Long-context inference memory usage drops substantially, up to 2.55x for MHA models and 1.67x for GQA models.
  • Decoding becomes faster by up to 2.18x for MHA and 1.50x for GQA.
  • Pre-filling stage accelerates by up to 1.73x and 1.63x respectively.
  • With quantization, models can handle contexts as long as 3.3 million tokens on a single high-end GPU.
  • Long-context capabilities remain largely intact despite the reduced caching.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The head classification might reveal similar structure in other transformer variants beyond the tested models.
  • Integrating this with other compression methods could yield further gains in efficiency.
  • Testing on a wider range of benchmarks would confirm if the synthetic data method generalizes across tasks.
  • If streaming heads prove task-dependent, online reclassification could be explored.

Load-bearing premise

The optimization algorithm using synthetic data correctly identifies which heads are retrieval heads that truly require the full KV cache to maintain long-context performance.

What would settle it

Running the method on a new long-context task and observing that accuracy drops significantly when using the constant cache for the designated streaming heads would falsify the claim.

read the original abstract

Deploying long-context large language models (LLMs) is essential but poses significant computational and memory challenges. Caching all Key and Value (KV) states across all attention heads consumes substantial memory. Existing KV cache pruning methods either damage the long-context capabilities of LLMs or offer only limited efficiency improvements. In this paper, we identify that only a fraction of attention heads, a.k.a, Retrieval Heads, are critical for processing long contexts and require full attention across all tokens. In contrast, all other heads, which primarily focus on recent tokens and attention sinks--referred to as Streaming Heads--do not require full attention. Based on this insight, we introduce DuoAttention, a framework that only applies a full KV cache to retrieval heads while using a light-weight, constant-length KV cache for streaming heads, which reduces both LLM's decoding and pre-filling memory and latency without compromising its long-context abilities. DuoAttention uses a lightweight, optimization-based algorithm with synthetic data to identify retrieval heads accurately. Our method significantly reduces long-context inference memory by up to 2.55x for MHA and 1.67x for GQA models while speeding up decoding by up to 2.18x and 1.50x and accelerating pre-filling by up to 1.73x and 1.63x for MHA and GQA models, respectively, with minimal accuracy loss compared to full attention. Notably, combined with quantization, DuoAttention enables Llama-3-8B decoding with 3.3 million context length on a single A100 GPU. Code is provided in https://github.com/mit-han-lab/duo-attention.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The manuscript proposes DuoAttention, which classifies attention heads in LLMs into retrieval heads (requiring full KV cache to preserve long-context capabilities) and streaming heads (approximable with constant-length KV cache focused on recent tokens and attention sinks). A lightweight optimization algorithm run on synthetic data identifies the retrieval heads. The method is claimed to reduce long-context inference memory by up to 2.55x (MHA) and 1.67x (GQA), speed up decoding by up to 2.18x and 1.50x, and accelerate pre-filling by up to 1.73x and 1.63x respectively, while incurring only minimal accuracy loss versus full attention. Combined with quantization, it enables 3.3M-token context on Llama-3-8B using a single A100 GPU. Code is released.

Significance. If the synthetic-data head classification proves robust and generalizes, the work would meaningfully advance practical deployment of long-context LLMs by cutting KV-cache memory and latency without large accuracy penalties. The open-source code is a clear strength that supports reproducibility. The approach builds on existing observations about head specialization and attention sinks but its broader impact depends on whether the identified partition remains necessary and sufficient outside the reported evaluation settings.

major comments (2)
  1. [§3] §3 (Head Identification): The optimization procedure on synthetic data is used to select retrieval heads, yet the manuscript provides no direct ablation demonstrating necessity (e.g., accuracy drop when a selected retrieval head is forced to use constant-length cache) or sufficiency (e.g., that restricting all other heads to constant cache preserves performance on the long-context benchmarks). This partition is load-bearing for the central efficiency claims.
  2. [§4] §4 (Experiments): Performance numbers (memory reduction, speedups, accuracy) are reported only after head classification on synthetic data; there is no cross-task or cross-length hold-out validation showing the selected subset remains adequate for arbitrary long-context tasks, leaving open the possibility that the synthetic objective yields a convenient but incomplete partition.
minor comments (3)
  1. [Abstract] The abstract states 'minimal accuracy loss' without quantifying the exact delta or the specific long-context tasks/metrics used for this assessment.
  2. [Figures] Figure captions and legends for attention-pattern visualizations could be expanded to clarify how retrieval versus streaming heads are highlighted.
  3. [§3] Notation for the constant-length cache size hyperparameter and its relation to attention-sink handling is introduced without a dedicated equation or pseudocode block.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the validation needs for the head classification in DuoAttention. We address each major comment below and will revise the manuscript accordingly to include the suggested ablations and cross-validation experiments.

read point-by-point responses
  1. Referee: [§3] §3 (Head Identification): The optimization procedure on synthetic data is used to select retrieval heads, yet the manuscript provides no direct ablation demonstrating necessity (e.g., accuracy drop when a selected retrieval head is forced to use constant-length cache) or sufficiency (e.g., that restricting all other heads to constant cache preserves performance on the long-context benchmarks). This partition is load-bearing for the central efficiency claims.

    Authors: We agree that direct ablations on necessity and sufficiency would provide stronger support for the retrieval head partition. In the revised manuscript, we will add experiments that force selected retrieval heads to use constant-length KV cache and measure the resulting accuracy drop on long-context benchmarks. We will also report results when all streaming heads are restricted to constant-length cache while retrieval heads retain full KV cache, confirming that performance is preserved. These ablations will follow the same synthetic data identification and evaluation protocol as the original results. revision: yes

  2. Referee: [§4] §4 (Experiments): Performance numbers (memory reduction, speedups, accuracy) are reported only after head classification on synthetic data; there is no cross-task or cross-length hold-out validation showing the selected subset remains adequate for arbitrary long-context tasks, leaving open the possibility that the synthetic objective yields a convenient but incomplete partition.

    Authors: We acknowledge the need to demonstrate generalization of the identified heads. In the revision, we will include additional experiments applying the synthetic-data-selected retrieval heads to hold-out long-context tasks and context lengths not used in the optimization. We will report accuracy, memory savings, and latency improvements on these settings to show that the partition remains effective and is not limited to the synthetic objective. revision: yes

Circularity Check

0 steps flagged

No significant circularity in DuoAttention derivation chain

full rationale

The paper's core derivation identifies retrieval heads via a lightweight optimization procedure run on synthetic data, then applies full KV cache only to that subset while restricting streaming heads to constant-length cache; long-context accuracy and efficiency metrics are measured on separate benchmark tasks after identification. This separation means the reported performance numbers do not reduce to quantities fitted on the evaluation data itself. No equations, self-citations, or uniqueness theorems are invoked that would make the partition or the efficiency gains equivalent to the inputs by construction. The approach remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The approach introduces a new categorization of attention heads without prior independent evidence outside this work; the identification algorithm uses optimization on synthetic data whose parameters are not detailed here.

axioms (1)
  • domain assumption Attention heads in transformer LLMs can be partitioned into retrieval heads that require full long-range context and streaming heads that do not.
    This partition is invoked to justify the differentiated KV cache strategy.
invented entities (2)
  • Retrieval Heads no independent evidence
    purpose: Attention heads critical for long-context processing that require full KV cache
    Newly postulated category based on observed attention patterns.
  • Streaming Heads no independent evidence
    purpose: Attention heads focused on recent tokens and sinks that use reduced constant-length KV cache
    Complementary category introduced to enable memory savings.

pith-pipeline@v0.9.0 · 5855 in / 1445 out tokens · 57765 ms · 2026-05-18T11:44:47.663514+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 19 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. KVServe: Service-Aware KV Cache Compression for Communication-Efficient Disaggregated LLM Serving

    cs.DC 2026-05 conditional novelty 7.0

    KVServe delivers up to 9.13x job completion time speedup and 32.8x time-to-first-token reduction by making KV cache compression service-aware and adaptive in disaggregated LLM serving.

  2. InfiniLoRA: Disaggregated Multi-LoRA Serving for Large Language Models

    cs.DC 2026-04 unverdicted novelty 7.0

    InfiniLoRA decouples LoRA execution from base-model inference and reports 3.05x higher request throughput plus 54% more adapters meeting strict latency SLOs.

  3. AB-Sparse: Sparse Attention with Adaptive Block Size for Accurate and Efficient Long-Context Inference

    cs.DC 2026-05 unverdicted novelty 6.0

    AB-Sparse adaptively allocates per-head block sizes for sparse attention, adds lossless centroid quantization and custom variable-block GPU kernels, and reports up to 5.43% accuracy gain over fixed-block baselines wit...

  4. Compute Where it Counts: Self Optimizing Language Models

    cs.LG 2026-05 unverdicted novelty 6.0

    SOL trains a policy to dynamically control multiple efficiency mechanisms per token via group-relative policy optimization on teacher-forced episodes, yielding better quality at matched average budget than static or r...

  5. Reformulating KV Cache Eviction Problem for Long-Context LLM Inference

    cs.CL 2026-05 unverdicted novelty 6.0

    LaProx reformulates KV cache eviction as an output-aware matrix approximation, enabling a unified global token selection strategy that preserves LLM performance at 5% cache size across long-context benchmarks.

  6. The Structural Origin of Attention Sink: Variance Discrepancy, Super Neurons, and Dimension Disparity

    cs.LG 2026-05 unverdicted novelty 6.0

    Attention sinks arise from variance discrepancy in self-attention value aggregation, amplified by super neurons and first-token dimension disparity, and can be mitigated by head-wise RMSNorm to accelerate pre-training...

  7. Shallow Prefill, Deep Decoding: Efficient Long-Context Inference via Layer-Asymmetric KV Visibility

    cs.AI 2026-05 unverdicted novelty 6.0

    SPEED uses layer-asymmetric KV visibility to process non-anchor prompt tokens only in lower layers during prefill, achieving near-baseline quality on Llama-3.1-8B with 33% better TTFT and 25% lower active KV memory at...

  8. Training Transformers for KV Cache Compressibility

    cs.LG 2026-05 unverdicted novelty 6.0

    KV compressibility is a property of learned transformer representations that can be improved by training with KV sparsification, leading to better quality-budget tradeoffs in downstream compression for retrieval, QA, ...

  9. Training Transformers for KV Cache Compressibility

    cs.LG 2026-05 unverdicted novelty 6.0

    Training transformers with KV sparsification during continued pretraining produces representations that admit better post-hoc KV cache compression, improving quality under memory budgets for long-context tasks.

  10. CodecSight: Leveraging Video Codec Signals for Efficient Streaming VLM Inference

    cs.DC 2026-04 unverdicted novelty 6.0

    CodecSight reuses video codec signals for online patch pruning before the vision transformer and selective KV-cache refresh in the LLM, delivering up to 3x higher throughput and 87% lower GPU compute than prior baseli...

  11. RAT+: Train Dense, Infer Sparse -- Recurrence Augmented Attention for Dilated Inference

    cs.LG 2026-02 conditional novelty 6.0

    RAT+ pretrains a single dense recurrent-augmented attention model that supports flexible dilated sparse inference after short adaptation, matching dense accuracy at moderate dilation and losing only 1-3 points at high...

  12. BLASST: Dynamic BLocked Attention Sparsity via Softmax Thresholding

    cs.CL 2025-12 unverdicted novelty 6.0

    BLASST dynamically sparsifies attention by thresholding softmax scores to skip blocks, delivering 1.5x speedups at 70%+ sparsity while preserving benchmark accuracy.

  13. Scaling Laws Meet Model Architecture: Toward Inference-Efficient LLMs

    cs.LG 2025-10 unverdicted novelty 6.0

    A conditional scaling law fitted on over 200 models from 80M to 3B parameters identifies architectures that deliver up to 2.1% higher accuracy and 42% higher inference throughput than LLaMA-3.2 under the same training budget.

  14. Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention

    cs.CL 2025-02 unverdicted novelty 6.0

    NSA is a hardware-aligned sparse attention mechanism that enables end-to-end trainable long-context modeling by combining coarse token compression with fine-grained selection.

  15. Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference

    cs.CL 2024-07 accept novelty 6.0

    Ada-KV is the first head-wise adaptive KV cache budget allocator for LLMs, using a theoretical loss upper bound to allocate eviction differently per attention head and yielding higher quality than uniform methods on l...

  16. TIDE: Every Layer Knows the Token Beneath the Context

    cs.CL 2026-05 unverdicted novelty 5.0

    TIDE augments standard transformers with per-layer token embedding injection via an ensemble of memory blocks and a depth-conditioned router to mitigate rare-token undertraining and contextual collapse.

  17. HieraSparse: Hierarchical Semi-Structured Sparse KV Attention

    cs.DC 2026-04 unverdicted novelty 5.0

    HieraSparse delivers a hierarchical semi-structured sparse KV attention system that achieves 1.2x KV compression and 4.57x decode attention speedup versus prior unstructured sparsity methods at equivalent sparsity, pl...

  18. Attention Sink Forges Native MoE in Attention Layers: Sink-Aware Training to Address Head Collapse

    cs.CL 2026-02 unverdicted novelty 5.0

    Attention sinks forge native MoE mechanisms in attention layers that cause head collapse, addressed by sink-aware training with auxiliary load balancing.

  19. The Pitfalls of KV Cache Compression

    cs.LG 2025-09 conditional novelty 5.0

    KV cache compression causes certain instructions to degrade rapidly and be ignored in multi-instruction prompting, with system prompt leakage worsened by method choice, instruction order, and eviction bias; simple pol...

Reference graph

Works this paper leans on

57 extracted references · 57 canonical work pages · cited by 18 Pith papers · 13 internal anchors

  1. [1]

    Cold compress: A toolkit for benchmarking kv cache compression approaches, 8 2024

    Griffin Adams, Faisal Ladhak, Hailey Schoelkopf, and Raja Biswas. Cold compress: A toolkit for benchmarking kv cache compression approaches, 8 2024. URL https://www.answer.ai/posts/2024-08-01-cold-compress.html

  2. [2]

    SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills

    Amey Agrawal, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav S. Gulavani, and Ramachandran Ramjee. Sarathi: Efficient llm inference by piggybacking decodes with chunked prefills, 2023. URL https://arxiv.org/abs/2308.16369

  3. [3]

    Gqa: Training generalized multi-query transformer models from multi-head checkpoints, 2023

    Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebrón, and Sumit Sanghai. Gqa: Training generalized multi-query transformer models from multi-head checkpoints, 2023

  4. [4]

    Program Synthesis with Large Language Models

    Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large language models. arXiv preprint arXiv:2108.07732, 2021

  5. [5]

    LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding

    Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, Yuxiao Dong, Jie Tang, and Juanzi Li. Longbench: A bilingual, multitask benchmark for long context understanding. arXiv preprint arXiv:2308.14508, 2023

  6. [6]

    Longformer: The Long-Document Transformer

    Iz Beltagy, Matthew E. Peters, and Arman Cohan. Longformer: The long-document transformer, 2020. arXiv:2004.05150

  7. [7]

    GPT-NeoX-20B: An Open-Source Autoregressive Language Model

    Sid Black, Stella Biderman, Eric Hallahan, Quentin Anthony, Leo Gao, Laurence Golding, Horace He, Connor Leahy, Kyle McDonell, Jason Phang, Michael Pieler, USVSN Sai Prashanth, Shivanshu Purohit, Laria Reynolds, Jonathan Tow, Ben Wang, and Samuel Weinbach. GPT - N eo X -20 B : An open-source autoregressive language model, 2022. arXiv: 2204.06745

  8. [8]

    Gonzalez, Ion Stoica, and Eric P

    Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna: An open-source chatbot impressing gpt-4 with 90\ URL https://lmsys.org/blog/2023-03-30-vicuna/

  9. [9]

    Generating long sequences with sparse transformers

    Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse transformers. 2019

  10. [10]

    Kevin Clark, Urvashi Khandelwal, Omer Levy, and Christopher D. Manning. What does BERT look at? an analysis of BERT ' s attention. In Tal Linzen, Grzegorz Chrupa a, Yonatan Belinkov, and Dieuwke Hupkes (eds.), Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pp.\ 276--286, Florence, Italy, August 2019. ...

  11. [11]

    Flash A ttention-2: Faster attention with better parallelism and work partitioning, 2023

    Tri Dao. Flash A ttention-2: Faster attention with better parallelism and work partitioning, 2023

  12. [12]

    FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

    Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. FlashAttention : Fast and memory-efficient exact attention with IO -awareness, 2022. arXiv:2205.14135

  13. [13]

    Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aurelien Rodriguez, Austen Gregerson, Ava Spataru, Baptiste Roziere, Bethany...

  14. [14]

    Model tells you what to discard: Adaptive KV cache compression for LLM s

    Suyu Ge, Yunan Zhang, Liyuan Liu, Minjia Zhang, Jiawei Han, and Jianfeng Gao. Model tells you what to discard: Adaptive KV cache compression for LLM s. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=uNrFpDPMyo

  15. [15]

    Evaluating factuality in generation with dependency-level entailment

    Tanya Goyal and Greg Durrett. Evaluating factuality in generation with dependency-level entailment. In Findings of the Association for Computational Linguistics: EMNLP 2020, Online, 2020. Association for Computational Linguistics

  16. [16]

    Mamba: Linear-time sequence modeling with selective state spaces, 2023

    Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces, 2023

  17. [17]

    Block Sparse Attention

    Junxian Guo, Haotian Tang, Shang Yang, Zhekai Zhang, Zhijian Liu, and Song Han. Block Sparse Attention . https://github.com/mit-han-lab/Block-Sparse-Attention, 2024

  18. [18]

    LM - I nfinite: Simple on-the-fly length generalization for large language models, 2023

    Chi Han, Qifan Wang, Wenhan Xiong, Yu Chen, Heng Ji, and Sinong Wang. LM - I nfinite: Simple on-the-fly length generalization for large language models, 2023

  19. [19]

    Measuring massive multitask language understanding

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. Proceedings of the International Conference on Learning Representations (ICLR), 2021

  20. [20]

    Flashdecoding++: Faster large language model inference on gpus, 2024

    Ke Hong, Guohao Dai, Jiaming Xu, Qiuli Mao, Xiuhong Li, Jun Liu, Kangdi Chen, Yuhan Dong, and Yu Wang. Flashdecoding++: Faster large language model inference on gpus, 2024

  21. [21]

    Mahoney, Yakun Sophia Shao, Kurt Keutzer, and Amir Gholami

    Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Michael W. Mahoney, Yakun Sophia Shao, Kurt Keutzer, and Amir Gholami. Kvquant: Towards 10 million context length llm inference with kv cache quantization, 2024

  22. [22]

    DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models

    Sam Ade Jacobs, Masahiro Tanaka, Chengming Zhang, Minjia Zhang, Shuaiwen Leon Song, Samyam Rajbhandari, and Yuxiong He. Deepspeed ulysses: System optimizations for enabling training of extreme long sequence transformer models, 2023. URL https://arxiv.org/abs/2309.14509

  23. [23]

    Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. Mistral 7b, 2023

  24. [24]

    Minference 1.0: Accelerating pre-filling for long-context llms via dynamic sparse attention

    Huiqiang Jiang, Yucheng Li, Chengruidong Zhang, Qianhui Wu, Xufang Luo, Surin Ahn, Zhenhua Han, Amir H Abdi, Dongsheng Li, Chin-Yew Lin, Yuqing Yang, and Lili Qiu. Minference 1.0: Accelerating pre-filling for long-context llms via dynamic sparse attention. arXiv preprint arXiv:2407.02490, 2024

  25. [25]

    Llmtest\_needleinahaystack: Doing simple retrieval from llm models at various context lengths to measure accuracy

    Greg Kamradt. Llmtest\_needleinahaystack: Doing simple retrieval from llm models at various context lengths to measure accuracy. https://github.com/gkamradt/LLMTest_NeedleInAHaystack, 2024. Accessed: 2024-05-23

  26. [26]

    Adam: A Method for Stochastic Optimization

    Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In Yoshua Bengio and Yann LeCun (eds.), 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings , 2015. URL http://arxiv.org/abs/1412.6980

  27. [27]

    Booksum: A collection of datasets for long-form narrative summarization

    Wojciech Kry \'s ci \'n ski, Nazneen Rajani, Divyansh Agarwal, Caiming Xiong, and Dragomir Radev. Booksum: A collection of datasets for long-form narrative summarization. 2021

  28. [28]

    Gonzalez, Hao Zhang, and Ion Stoica

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention, 2023

  29. [29]

    Video-llava: Learning united visual representation by alignment before projection, 2023

    Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, and Li Yuan. Video-llava: Learning united visual representation by alignment before projection, 2023

  30. [30]

    Awq: Activation-aware weight quantization for llm compression and acceleration, 2024

    Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. Awq: Activation-aware weight quantization for llm compression and acceleration, 2024

  31. [31]

    Qserve: W4a8kv4 quantization and system co-design for efficient llm serving

    Yujun Lin*, Haotian Tang*, Shang Yang*, Zhekai Zhang, Guangxuan Xiao, Chuang Gan, and Song Han. Qserve: W4a8kv4 quantization and system co-design for efficient llm serving. arXiv preprint arXiv:2405.04532, 2024

  32. [32]

    Ring attention with blockwise transformers for near-infinite context, 2023 a

    Hao Liu, Matei Zaharia, and Pieter Abbeel. Ring attention with blockwise transformers for near-infinite context, 2023 a

  33. [33]

    Visual instruction tuning, 2023 b

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning, 2023 b

  34. [34]

    Learning efficient convolutional networks through network slimming

    Zhuang Liu, Jianguo Li, Zhiqiang Shen, Gao Huang, Shoumeng Yan, and Changshui Zhang. Learning efficient convolutional networks through network slimming. In ICCV, 2017

  35. [35]

    KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache

    Zirui Liu, Jiayi Yuan, Hongye Jin, Shaochen Zhong, Zhaozhuo Xu, Vladimir Braverman, Beidi Chen, and Xia Hu. Kivi: A tuning-free asymmetric 2bit quantization for kv cache. arXiv preprint arXiv:2402.02750, 2024

  36. [36]

    Gpt-4 technical report, 2023

    OpenAI. Gpt-4 technical report, 2023

  37. [37]

    Transformers are multi-state rnns, 2024

    Matanel Oren, Michael Hassid, Yossi Adi, and Roy Schwartz. Transformers are multi-state rnns, 2024

  38. [38]

    Py T orch: An imperative style, high-performance deep learning library

    Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Köpf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Py T orch: An imperative style, high-per...

  39. [39]

    Fu, Tri Dao, Stephen Baccus, Yoshua Bengio, Stefano Ermon, and Christopher Ré

    Michael Poli, Stefano Massaroli, Eric Nguyen, Daniel Y. Fu, Tri Dao, Stephen Baccus, Yoshua Bengio, Stefano Ermon, and Christopher Ré. Hyena hierarchy: Towards larger convolutional language models, 2023. URL https://arxiv.org/abs/2302.10866

  40. [40]

    Chatgpt: Optimizing language models for dialogue

    John Schulman, Barret Zoph, Christina Kim, Jacob Hilton, Jacob Menick, Jiayi Weng, Juan Felipe Ceron Uribe, Liam Fedus, Luke Metz, Michael Pokorny, et al. Chatgpt: Optimizing language models for dialogue. OpenAI blog, 2022

  41. [41]

    Fast transformer decoding: One write-head is all you need, 2019

    Noam Shazeer. Fast transformer decoding: One write-head is all you need, 2019

  42. [42]

    RoFormer: Enhanced Transformer with Rotary Position Embedding

    Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:2104.09864, 2021

  43. [43]

    Razorattention: Efficient kv cache compression through retrieval heads, 2024 a

    Hanlin Tang, Yang Lin, Jing Lin, Qingsen Han, Shikuan Hong, Yiwu Yao, and Gongyi Wang. Razorattention: Efficient kv cache compression through retrieval heads, 2024 a . URL https://arxiv.org/abs/2407.15891

  44. [44]

    Quest: Query-aware sparsity for efficient long-context llm inference, 2024 b

    Jiaming Tang, Yilong Zhao, Kan Zhu, Guangxuan Xiao, Baris Kasikci, and Song Han. Quest: Query-aware sparsity for efficient long-context llm inference, 2024 b

  45. [45]

    Hashimoto

    Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023

  46. [46]

    Tibshirani

    R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society (Series B), 58: 0 267--288, 1996

  47. [47]

    Llama-2-7b-32k-instruct — and fine-tuning for llama-2 models with together api, June 2023

    Together. Llama-2-7b-32k-instruct — and fine-tuning for llama-2 models with together api, June 2023. URL https://together.ai/blog/llama-2-7b-32k-instruct

  48. [48]

    LLaMA: Open and Efficient Foundation Language Models

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth \'e e Lacroix, Baptiste Rozi \`e re, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023 a

  49. [49]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023 b

  50. [50]

    Retrieval head mechanistically explains long-context factuality, 2024

    Wenhao Wu, Yizhong Wang, Guangxuan Xiao, Hao Peng, and Yao Fu. Retrieval head mechanistically explains long-context factuality, 2024

  51. [51]

    S mooth Q uant: Accurate and efficient post-training quantization for large language models

    Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han. S mooth Q uant: Accurate and efficient post-training quantization for large language models. In Proceedings of the 40th International Conference on Machine Learning, 2023 a

  52. [52]

    Efficient streaming language models with attention sinks

    Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks. arXiv, 2023 b

  53. [53]

    Cascade inference: Memory bandwidth efficient shared prefix batch decoding

    Zihao Ye, Ruihang Lai, Roy Lu, Chien-Yu Lin, Size Zheng, Lequn Chen, Tianqi Chen, and Luis Ceze. Cascade inference: Memory bandwidth efficient shared prefix batch decoding. https://flashinfer.ai/2024/01/08/cascade-inference.html, Jan 2024. URL https://flashinfer.ai/2024/01/08/cascade-inference.html. Accessed on 2024-02-01

  54. [54]

    Big Bird : T ransformers for longer sequences

    Manzil Zaheer, Guru Guruganesh, Kumar Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, and Amr Ahmed. Big Bird : T ransformers for longer sequences. In Proc. of NeurIPS, volume 33, 2020

  55. [55]

    Hashimoto

    Tianyi Zhang, Faisal Ladhak, Esin Durmus, Percy Liang, Kathleen McKeown, and Tatsunori B. Hashimoto. Benchmarking large language models for news summarization, 2023 a

  56. [56]

    H _2 o: Heavy-hitter oracle for efficient generative inference of large language models, 2023 b

    Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher Ré, Clark Barrett, Zhangyang Wang, and Beidi Chen. H _2 o: Heavy-hitter oracle for efficient generative inference of large language models, 2023 b

  57. [57]

    Xing, Hao Zhang, Joseph E

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging llm-as-a-judge with mt-bench and chatbot arena, 2023