pith. sign in

arxiv: 2605.20600 · v1 · pith:VEWT6OFFnew · submitted 2026-05-20 · 💻 cs.CV

Head-Aware Key-Value Compression for Efficient Autoregressive Image Generation

Pith reviewed 2026-05-21 06:01 UTC · model grok-4.3

classification 💻 cs.CV
keywords KV cache compressionautoregressive image generationattention head patternsmemory reductiontoken evictionefficient generation
0
0 comments X

The pith

HeadKV allocates different KV cache budgets to different attention heads based on their observed focus patterns to reduce memory in autoregressive image generation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces HeadKV to compress the key-value cache in autoregressive models for image generation. It observes that attention heads vary in their attention scope, with some being local and others more global. Head types are determined from attention patterns in the early tokens and then applied consistently for the rest of the generation process. This avoids the need for additional training or statistics collection. A stratified token eviction strategy is used to retain long-range dependencies effectively.

Core claim

By classifying each attention head as locality-biased or broad-context based on its consistent behavior across token positions observed early in generation, HeadKV assigns smaller KV cache budgets to local heads and larger ones to broad heads, combined with stratified eviction to preserve important information, thereby reducing memory and increasing throughput without retraining.

What carries the argument

Head-type identification from early-token attention consistency, which determines per-head KV budget allocation and guides the Stratified Token Eviction strategy.

If this is right

  • Memory footprint of the KV cache decreases because local heads use less storage.
  • Generation speed improves due to smaller cache size during autoregressive decoding.
  • Image quality stays comparable since broad-context heads retain more tokens.
  • Method applies to various autoregressive image models without model-specific retraining.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This could reduce energy consumption for large-scale image generation tasks.
  • Similar head-aware compression might apply to video generation models that use even larger caches.
  • Future work could explore adaptive re-classification if patterns shift mid-generation.

Load-bearing premise

Each attention head keeps the same attention pattern type throughout the generation after being identified from early tokens.

What would settle it

Measuring the attention range of heads on early tokens versus much later tokens and finding that many heads change their locality bias significantly.

Figures

Figures reproduced from arXiv: 2605.20600 by Baoquan Zhang, Guotao Liang, Yunming Ye, Zhiyuan Wen.

Figure 1
Figure 1. Figure 1: HeadKV enables efficient autoregressive image generation by allocating asymmetric [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Visualizing the visual token attention map of the Lumina-mGPT-768 model. The left shows [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The illustration of our proposed HeadKV framework. The AR model first constructs an [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative comparison of text-to-image generation under different compression ratio [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Results of hyperparameter par￾titioning ratio rs. 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Partition Ratio rs 21.00 21.25 21.50 21.75 22.00 22.25 22.50 22.75 23.00 FID-30K Stratified Token Eviction Baseline [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Attention distribution across distance bins for query positions 1000–1500. [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Attention distribution for specific layers, heads, and query positions. [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Spatial distribution of Top-K tokens for selected queries. [PITH_FULL_IMAGE:figures/full_fig_p015_8.png] view at source ↗
read the original abstract

Autoregressive (AR) visual generation has achieved remarkable performance but suffers from high memory usage and low throughput, as it requires caching previously generated visual tokens. Recent research has shown that retaining only a few lines of cache tokens can maintain high-quality images while significantly reducing memory usage and improving throughput. However, these methods allocate a fixed budget to each attention head, overlooking the heterogeneity among attention heads, leading to suboptimal memory allocation. In this paper, we observe that attention heads across different layers exhibit diverse attention patterns, where some heads focus on local neighborhoods while others capture broader contextual dependencies. Based on this insight, we propose a novel head-aware key-value (KV) cache compression framework for autoregressive image generation, called HeadKV, which assigns smaller budgets to locality-biased heads and larger budgets to heads with broader attention. A key challenge lies in identifying the type of each attention head to guide cache compression. We further observe that, within the same layer, each head exhibits consistent attention patterns across token positions, \emph{i.e.}, a head's behavior for early tokens remains consistent with that for later tokens. This insight suggests that head types can be identified during the early stage and reused for KV compression throughout generation. Its advantage is that it requires no additional training or dataset-level statistics and generalizes seamlessly across different inputs. Moreover, we design a Stratified Token Eviction strategy to effectively preserve long-range information. Extensive experiments demonstrate its effectiveness across multiple autoregressive image generation models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes HeadKV, a head-aware KV cache compression framework for autoregressive image generation. It observes diverse attention patterns across heads (locality-biased vs. broad-context) within layers, identifies head types from early-token behavior under the assumption of pattern consistency across positions, assigns smaller cache budgets to local heads and larger to broad ones, and introduces a Stratified Token Eviction strategy to preserve long-range information. The approach requires no additional training or dataset statistics and is evaluated on multiple AR image models for memory and throughput gains.

Significance. If the early-token head classification proves stable, the work provides a practical, training-free improvement over fixed-budget KV compression methods by exploiting head heterogeneity. This could yield better memory-quality trade-offs in transformer-based AR visual generation, with the generalization across inputs and lack of retraining as clear strengths. The empirical grounding in attention pattern observations is a positive aspect, though it remains heuristic rather than derived.

major comments (2)
  1. Abstract (paragraph on head identification): The central assumption that 'within the same layer, each head exhibits consistent attention patterns across token positions' (i.e., early-token behavior is representative for later tokens) is load-bearing for the fixed classification and reuse strategy, yet no quantitative validation such as attention similarity metrics, correlation scores, or stability analysis across token positions and layers is reported; this leaves the memory/quality trade-off vulnerable to degradation as context grows in AR generation.
  2. Abstract and method description: The head classification thresholds and per-head cache budgets are treated as free parameters without reported sensitivity analysis, ablation on threshold choices, or explicit values used in experiments; this makes it difficult to assess whether the reported gains are robust or depend on per-model tuning, directly affecting reproducibility of the claimed efficiency improvements.
minor comments (1)
  1. Abstract: The description of the Stratified Token Eviction strategy is high-level; a brief concrete example of how long-range tokens are prioritized versus local ones would improve clarity without altering the core contribution.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We are grateful to the referee for the thorough review and constructive suggestions. We have addressed the concerns regarding the validation of our head consistency assumption and the specification of classification parameters by adding quantitative analyses and explicit values in the revised manuscript. We believe these changes strengthen the paper's claims.

read point-by-point responses
  1. Referee: Abstract (paragraph on head identification): The central assumption that 'within the same layer, each head exhibits consistent attention patterns across token positions' (i.e., early-token behavior is representative for later tokens) is load-bearing for the fixed classification and reuse strategy, yet no quantitative validation such as attention similarity metrics, correlation scores, or stability analysis across token positions and layers is reported; this leaves the memory/quality trade-off vulnerable to degradation as context grows in AR generation.

    Authors: We agree that quantitative validation of the consistency assumption is important for robustness. Although our empirical results across various models and inputs demonstrate the effectiveness of the early head identification, we have added in the revised version a dedicated analysis subsection. This includes computing the average cosine similarity between attention distributions for the first 10 tokens and later tokens (e.g., at position 100 and 500) for each head type across layers. The results show high similarity scores (above 0.85 on average), confirming the pattern consistency. We also note that for extremely long sequences, periodic re-identification could be considered as future work. revision: yes

  2. Referee: Abstract and method description: The head classification thresholds and per-head cache budgets are treated as free parameters without reported sensitivity analysis, ablation on threshold choices, or explicit values used in experiments; this makes it difficult to assess whether the reported gains are robust or depend on per-model tuning, directly affecting reproducibility of the claimed efficiency improvements.

    Authors: The referee correctly points out the need for explicit parameter values and sensitivity analysis to ensure reproducibility. In the updated manuscript, we have included the specific threshold values used for classifying heads (e.g., locality score threshold of 0.6 for local heads) and the budget allocation ratios (e.g., 20% for local heads, 80% for broad heads) for each evaluated model. Additionally, we performed an ablation study varying the threshold from 0.4 to 0.8 and budget ratios, showing that the memory-quality trade-off remains stable, with performance degradation only at extreme values. These details are now reported in Section 4.2 and the appendix. revision: yes

Circularity Check

0 steps flagged

No significant circularity in HeadKV framework

full rationale

The paper presents a heuristic KV compression method grounded in direct empirical observations of attention patterns across heads and token positions. Head-type classification from early tokens is justified by the stated observation that patterns remain consistent within a layer, but this is not a mathematical derivation or equation that reduces to its own inputs by construction. No self-referential definitions, fitted parameters renamed as predictions, or load-bearing self-citation chains appear in the provided text; the approach is self-contained as a practical, observation-driven strategy without tautological reduction.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The method adds a classification procedure and eviction heuristic on top of standard transformer KV caching; no new physical entities or fundamental axioms beyond domain assumptions about attention consistency.

free parameters (2)
  • head classification thresholds
    Cutoffs used to label heads as locality-biased versus broad-context based on observed attention patterns; values chosen to guide budget assignment.
  • per-head cache budgets
    Specific smaller and larger budget sizes assigned after classification; tuned for quality-memory trade-off.
axioms (1)
  • domain assumption Attention patterns of each head remain consistent from early to late tokens within a layer
    Invoked to justify identifying head type once early and reusing it for the entire generation process.

pith-pipeline@v0.9.0 · 5801 in / 1285 out tokens · 46955 ms · 2026-05-21T06:01:38.397339+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

39 extracted references · 39 canonical work pages · 11 internal anchors

  1. [1]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

  2. [2]

    Attention is all you need.Advances in neural information processing systems, 30:I, 2017

    Vaswani Ashish. Attention is all you need.Advances in neural information processing systems, 30:I, 2017

  3. [3]

    R-kv: Redundancy-aware kv cache compression for training-free reasoning models acceleration.arXiv e-prints, pages arXiv–2505, 2025

    Zefan Cai, Wen Xiao, Hanshi Sun, Cheng Luo, Yikai Zhang, Ke Wan, Yucheng Li, Yeyang Zhou, Li-Wen Chang, Jiuxiang Gu, et al. R-kv: Redundancy-aware kv cache compression for training-free reasoning models acceleration.arXiv e-prints, pages arXiv–2505, 2025

  4. [4]

    PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling

    Zefan Cai, Yichi Zhang, Bofei Gao, Yuliang Liu, Yucheng Li, Tianyu Liu, Keming Lu, Wayne Xiong, Yue Dong, Junjie Hu, et al. Pyramidkv: Dynamic kv cache compression based on pyramidal information funneling.arXiv preprint arXiv:2406.02069, 2024

  5. [5]

    Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

    Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. Janus-pro: Unified multimodal understanding and generation with data and model scaling.arXiv preprint arXiv:2501.17811, 2025

  6. [6]

    A simple and effective l_2 norm-based strategy for kv cache compression

    Alessio Devoto, Yu Zhao, Simone Scardapane, and Pasquale Minervini. A simple and effective l_2 norm-based strategy for kv cache compression. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 18476–18499, 2024

  7. [7]

    Taming transformers for high-resolution image synthesis

    Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming transformers for high-resolution image synthesis. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12873–12883, 2021

  8. [8]

    Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference

    Yuan Feng, Junlin Lv, Yukun Cao, Xike Xie, and S Kevin Zhou. Ada-kv: Optimizing kv cache eviction by adaptive budget allocation for efficient llm inference.arXiv preprint arXiv:2407.11550, 2024

  9. [9]

    Identify critical kv cache in llm inference from an output perturbation perspective.arXiv preprint arXiv:2502.03805, 2025

    Yuan Feng, Junlin Lv, Yukun Cao, Xike Xie, and S Kevin Zhou. Identify critical kv cache in llm inference from an output perturbation perspective.arXiv preprint arXiv:2502.03805, 2025

  10. [10]

    Not all heads matter: A head-level kv cache compression method with integrated retrieval and reasoning.arXiv preprint arXiv:2410.19258,

    Yu Fu, Zefan Cai, Abedelkadir Asi, Wayne Xiong, Yue Dong, and Wen Xiao. Not all heads matter: A head-level kv cache compression method with integrated retrieval and reasoning. arXiv preprint arXiv:2410.19258, 2024

  11. [11]

    Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs

    Suyu Ge, Yunan Zhang, Liyuan Liu, Minjia Zhang, Jiawei Han, and Jianfeng Gao. Model tells you what to discard: Adaptive kv cache compression for llms.arXiv preprint arXiv:2310.01801, 2023

  12. [12]

    Geneval: An object-focused framework for evaluating text-to-image alignment.Advances in Neural Information Processing Systems, 36:52132–52152, 2023

    Dhruba Ghosh, Hannaneh Hajishirzi, and Ludwig Schmidt. Geneval: An object-focused framework for evaluating text-to-image alignment.Advances in Neural Information Processing Systems, 36:52132–52152, 2023

  13. [13]

    The Llama 3 Herd of Models

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

  14. [14]

    Gans trained by a two time-scale update rule converge to a local nash equilibrium.Advances in neural information processing systems, 30, 2017

    Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium.Advances in neural information processing systems, 30, 2017

  15. [15]

    Kvquant: Towards 10 million context length llm inference with kv cache quantization.Advances in Neural Information Processing Systems, 37:1270–1303, 2024

    Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Michael W Mahoney, Yakun S Shao, Kurt Keutzer, and Amir Gholami. Kvquant: Towards 10 million context length llm inference with kv cache quantization.Advances in Neural Information Processing Systems, 37:1270–1303, 2024

  16. [16]

    ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment

    Xiwei Hu, Rui Wang, Yixiao Fang, Bin Fu, Pei Cheng, and Gang Yu. Ella: Equip diffusion models with llm for enhanced semantic alignment.arXiv preprint arXiv:2403.05135, 2024. 10

  17. [17]

    Science across languages: assessing llm mul- tilingual translation of scientific papers

    Hannah Calzi Kleidermacher and James Zou. Science across languages: assessing llm mul- tilingual translation of scientific papers. InFindings of the Association for Computational Linguistics: EACL 2026, pages 3932–3947, 2026

  18. [18]

    Snapkv: Llm knows what you are looking for before generation.Advances in Neural Information Processing Systems, 37:22947–22970, 2024

    Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, and Deming Chen. Snapkv: Llm knows what you are looking for before generation.Advances in Neural Information Processing Systems, 37:22947–22970, 2024

  19. [19]

    Lg-vq: Language-guided codebook learning.Advances in Neural Information Processing Systems, 37:139700–139724, 2024

    Guotao Liang, Baoquan Zhang, Yaowei Wang, Xutao Li, Yunming Ye, Huaibin Wang, Chuyao Luo, Kola Ye, and Linfeng Luo. Lg-vq: Language-guided codebook learning.Advances in Neural Information Processing Systems, 37:139700–139724, 2024

  20. [20]

    Improved masked image generation with knowledge-augmented token representations

    Guotao Liang, Baoquan Zhang, Zhiyuan Wen, Zihao Han, and Yunming Ye. Improved masked image generation with knowledge-augmented token representations. InProceedings of the AAAI Conference on Artificial Intelligence, pages 6817–6825, 2026

  21. [21]

    Towards improved text-aligned codebook learning: Multi-hierarchical codebook-text alignment with long text

    Guotao Liang, Baoquan Zhang, Zhiyuan Wen, Junteng Zhao, Yunming Ye, Kola Ye, and Yao He. Towards improved text-aligned codebook learning: Multi-hierarchical codebook-text alignment with long text. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 4060–4069, 2025

  22. [22]

    Microsoft coco: Common objects in context

    Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. InEuropean conference on computer vision, pages 740–755. Springer, 2014

  23. [23]

    Minicache: Kv cache compression in depth dimension for large language models.Advances in Neural Information Processing Systems, 37:139997–140031, 2024

    Akide Liu, Jing Liu, Zizheng Pan, Yefei He, Gholamreza Haffari, and Bohan Zhuang. Minicache: Kv cache compression in depth dimension for large language models.Advances in Neural Information Processing Systems, 37:139997–140031, 2024

  24. [24]

    Lumina-mgpt: Illuminate flexible photorealistic text-to-image generation with multimodal generative pretraining, 2024

    Dongyang Liu, Shitian Zhao, Le Zhuo, Weifeng Lin, Yu Qiao, Hongsheng Li, and Peng Gao. Lumina-mgpt: Illuminate flexible photorealistic text-to-image generation with multimodal generative pretraining, 2024

  25. [25]

    Polaformer: Polarity- aware linear attention for vision transformers.arXiv preprint arXiv:2501.15061, 2025

    Weikang Meng, Yadan Luo, Xin Li, Dongmei Jiang, and Zheng Zhang. Polaformer: Polarity- aware linear attention for vision transformers.arXiv preprint arXiv:2501.15061, 2025

  26. [26]

    Keydiff: Key similarity-based kv cache eviction for long-context llm inference in resource-constrained environments, 2025

    Junyoung Park, Dalton Jones, Matthew J Morse, Raghavv Goel, Mingu Lee, and Chris Lott. Keydiff: Key similarity-based kv cache eviction for long-context llm inference in resource- constrained environments.arXiv preprint arXiv:2504.15364, 2025

  27. [27]

    Autoregressive image generation needs only a few lines of cached tokens.arXiv preprint arXiv:2512.04857, 2025

    Ziran Qin, Youru Lv, Mingbao Lin, Zeren Zhang, Chanfan Gan, Tieyuan Chen, and Weiyao Lin. Autoregressive image generation needs only a few lines of cached tokens.arXiv preprint arXiv:2512.04857, 2025

  28. [28]

    Zero-shot text-to-image generation

    Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea V oss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. InInternational conference on machine learning, pages 8821–8831. Pmlr, 2021

  29. [29]

    Generating diverse high-fidelity images with vq-vae-2.Advances in neural information processing systems, 32, 2019

    Ali Razavi, Aaron Van den Oord, and Oriol Vinyals. Generating diverse high-fidelity images with vq-vae-2.Advances in neural information processing systems, 32, 2019

  30. [30]

    Grouped speculative decoding for autoregressive image generation

    Junhyuk So, Juncheol Shin, Hyunho Kook, and Eunhyeok Park. Grouped speculative decoding for autoregressive image generation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 15375–15384, 2025

  31. [31]

    Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation

    Peize Sun, Yi Jiang, Shoufa Chen, Shilong Zhang, Bingyue Peng, Ping Luo, and Zehuan Yuan. Autoregressive model beats diffusion: Llama for scalable image generation.arXiv preprint arXiv:2406.06525, 2024

  32. [32]

    LLaMA: Open and Efficient Foundation Language Models

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timo- thée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023. 11

  33. [33]

    D2o: Dy- namic discriminative operations for efficient long-context inference of large language models.arXiv preprint arXiv:2406.13035,

    Zhongwei Wan, Xinjian Wu, Yu Zhang, Yi Xin, Chaofan Tao, Zhihong Zhu, Xin Wang, Siqi Luo, Jing Xiong, Longyue Wang, et al. D2o: Dynamic discriminative operations for efficient long-context inference of large language models.arXiv preprint arXiv:2406.13035, 2024

  34. [34]

    A review on code generation with llms: Application and evaluation

    Jianxun Wang and Yixiang Chen. A review on code generation with llms: Application and evaluation. In2023 IEEE International Conference on Medical Artificial Intelligence (MedAI), pages 284–289. IEEE, 2023

  35. [35]

    DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads

    Guangxuan Xiao, Jiaming Tang, Jingwei Zuo, Junxian Guo, Shang Yang, Haotian Tang, Yao Fu, and Song Han. Duoattention: Efficient long-context llm inference with retrieval and streaming heads.arXiv preprint arXiv:2410.10819, 2024

  36. [36]

    Efficient Streaming Language Models with Attention Sinks

    Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks.arXiv preprint arXiv:2309.17453, 2023

  37. [37]

    Zeroquant: Efficient and affordable post-training quantization for large-scale transformers

    Zhewei Yao, Reza Yazdani Aminabadi, Minjia Zhang, Xiaoxia Wu, Conglong Li, and Yuxiong He. Zeroquant: Efficient and affordable post-training quantization for large-scale transformers. Advances in neural information processing systems, 35:27168–27183, 2022

  38. [38]

    Cam: Cache merging for memory-efficient llms inference

    Yuxin Zhang, Yuxuan Du, Gen Luo, Yunshan Zhong, Zhenyu Zhang, Shiwei Liu, and Rongrong Ji. Cam: Cache merging for memory-efficient llms inference. InForty-first international conference on machine learning, 2024

  39. [39]

    H2o: Heavy-hitter oracle for efficient generative inference of large language models.Advances in Neural Information Processing Systems, 36:34661–34710, 2023

    Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher Ré, Clark Barrett, et al. H2o: Heavy-hitter oracle for efficient generative inference of large language models.Advances in Neural Information Processing Systems, 36:34661–34710, 2023. 12 A Appendix A.1 Limitation and Future Work. While Hea...