Head-Aware Key-Value Compression for Efficient Autoregressive Image Generation

Baoquan Zhang; Guotao Liang; Yunming Ye; Zhiyuan Wen

arxiv: 2605.20600 · v1 · pith:VEWT6OFFnew · submitted 2026-05-20 · 💻 cs.CV

Head-Aware Key-Value Compression for Efficient Autoregressive Image Generation

Guotao Liang , Baoquan Zhang , Zhiyuan Wen , Yunming Ye This is my paper

Pith reviewed 2026-05-21 06:01 UTC · model grok-4.3

classification 💻 cs.CV

keywords KV cache compressionautoregressive image generationattention head patternsmemory reductiontoken evictionefficient generation

0 comments

The pith

HeadKV allocates different KV cache budgets to different attention heads based on their observed focus patterns to reduce memory in autoregressive image generation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces HeadKV to compress the key-value cache in autoregressive models for image generation. It observes that attention heads vary in their attention scope, with some being local and others more global. Head types are determined from attention patterns in the early tokens and then applied consistently for the rest of the generation process. This avoids the need for additional training or statistics collection. A stratified token eviction strategy is used to retain long-range dependencies effectively.

Core claim

By classifying each attention head as locality-biased or broad-context based on its consistent behavior across token positions observed early in generation, HeadKV assigns smaller KV cache budgets to local heads and larger ones to broad heads, combined with stratified eviction to preserve important information, thereby reducing memory and increasing throughput without retraining.

What carries the argument

Head-type identification from early-token attention consistency, which determines per-head KV budget allocation and guides the Stratified Token Eviction strategy.

If this is right

Memory footprint of the KV cache decreases because local heads use less storage.
Generation speed improves due to smaller cache size during autoregressive decoding.
Image quality stays comparable since broad-context heads retain more tokens.
Method applies to various autoregressive image models without model-specific retraining.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This could reduce energy consumption for large-scale image generation tasks.
Similar head-aware compression might apply to video generation models that use even larger caches.
Future work could explore adaptive re-classification if patterns shift mid-generation.

Load-bearing premise

Each attention head keeps the same attention pattern type throughout the generation after being identified from early tokens.

What would settle it

Measuring the attention range of heads on early tokens versus much later tokens and finding that many heads change their locality bias significantly.

Figures

Figures reproduced from arXiv: 2605.20600 by Baoquan Zhang, Guotao Liang, Yunming Ye, Zhiyuan Wen.

**Figure 2.** Figure 2: Visualizing the visual token attention map of the Lumina-mGPT-768 model. The left shows [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: The illustration of our proposed HeadKV framework. The AR model first constructs an [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Qualitative comparison of text-to-image generation under different compression ratio [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Results of hyperparameter partitioning ratio rs. 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Partition Ratio rs 21.00 21.25 21.50 21.75 22.00 22.25 22.50 22.75 23.00 FID-30K Stratified Token Eviction Baseline [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Attention distribution across distance bins for query positions 1000–1500. [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗

**Figure 7.** Figure 7: Attention distribution for specific layers, heads, and query positions. [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗

**Figure 8.** Figure 8: Spatial distribution of Top-K tokens for selected queries. [PITH_FULL_IMAGE:figures/full_fig_p015_8.png] view at source ↗

read the original abstract

Autoregressive (AR) visual generation has achieved remarkable performance but suffers from high memory usage and low throughput, as it requires caching previously generated visual tokens. Recent research has shown that retaining only a few lines of cache tokens can maintain high-quality images while significantly reducing memory usage and improving throughput. However, these methods allocate a fixed budget to each attention head, overlooking the heterogeneity among attention heads, leading to suboptimal memory allocation. In this paper, we observe that attention heads across different layers exhibit diverse attention patterns, where some heads focus on local neighborhoods while others capture broader contextual dependencies. Based on this insight, we propose a novel head-aware key-value (KV) cache compression framework for autoregressive image generation, called HeadKV, which assigns smaller budgets to locality-biased heads and larger budgets to heads with broader attention. A key challenge lies in identifying the type of each attention head to guide cache compression. We further observe that, within the same layer, each head exhibits consistent attention patterns across token positions, \emph{i.e.}, a head's behavior for early tokens remains consistent with that for later tokens. This insight suggests that head types can be identified during the early stage and reused for KV compression throughout generation. Its advantage is that it requires no additional training or dataset-level statistics and generalizes seamlessly across different inputs. Moreover, we design a Stratified Token Eviction strategy to effectively preserve long-range information. Extensive experiments demonstrate its effectiveness across multiple autoregressive image generation models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Head-aware KV compression refines cache budgets by head type in AR image models but leans on an early-token consistency assumption that lacks quantitative checks.

read the letter

The main point is a practical adjustment to KV cache compression in autoregressive image generation. Rather than fixed budgets per head, the method assigns smaller caches to heads that focus locally and larger ones to those with broader attention spans. Head types get identified from early tokens and then reused for the rest of the sequence, paired with a stratified eviction step to hold onto long-range details. No retraining or dataset-wide stats are required, which keeps things lightweight and generalizes across inputs and models.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes HeadKV, a head-aware KV cache compression framework for autoregressive image generation. It observes diverse attention patterns across heads (locality-biased vs. broad-context) within layers, identifies head types from early-token behavior under the assumption of pattern consistency across positions, assigns smaller cache budgets to local heads and larger to broad ones, and introduces a Stratified Token Eviction strategy to preserve long-range information. The approach requires no additional training or dataset statistics and is evaluated on multiple AR image models for memory and throughput gains.

Significance. If the early-token head classification proves stable, the work provides a practical, training-free improvement over fixed-budget KV compression methods by exploiting head heterogeneity. This could yield better memory-quality trade-offs in transformer-based AR visual generation, with the generalization across inputs and lack of retraining as clear strengths. The empirical grounding in attention pattern observations is a positive aspect, though it remains heuristic rather than derived.

major comments (2)

Abstract (paragraph on head identification): The central assumption that 'within the same layer, each head exhibits consistent attention patterns across token positions' (i.e., early-token behavior is representative for later tokens) is load-bearing for the fixed classification and reuse strategy, yet no quantitative validation such as attention similarity metrics, correlation scores, or stability analysis across token positions and layers is reported; this leaves the memory/quality trade-off vulnerable to degradation as context grows in AR generation.
Abstract and method description: The head classification thresholds and per-head cache budgets are treated as free parameters without reported sensitivity analysis, ablation on threshold choices, or explicit values used in experiments; this makes it difficult to assess whether the reported gains are robust or depend on per-model tuning, directly affecting reproducibility of the claimed efficiency improvements.

minor comments (1)

Abstract: The description of the Stratified Token Eviction strategy is high-level; a brief concrete example of how long-range tokens are prioritized versus local ones would improve clarity without altering the core contribution.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We are grateful to the referee for the thorough review and constructive suggestions. We have addressed the concerns regarding the validation of our head consistency assumption and the specification of classification parameters by adding quantitative analyses and explicit values in the revised manuscript. We believe these changes strengthen the paper's claims.

read point-by-point responses

Referee: Abstract (paragraph on head identification): The central assumption that 'within the same layer, each head exhibits consistent attention patterns across token positions' (i.e., early-token behavior is representative for later tokens) is load-bearing for the fixed classification and reuse strategy, yet no quantitative validation such as attention similarity metrics, correlation scores, or stability analysis across token positions and layers is reported; this leaves the memory/quality trade-off vulnerable to degradation as context grows in AR generation.

Authors: We agree that quantitative validation of the consistency assumption is important for robustness. Although our empirical results across various models and inputs demonstrate the effectiveness of the early head identification, we have added in the revised version a dedicated analysis subsection. This includes computing the average cosine similarity between attention distributions for the first 10 tokens and later tokens (e.g., at position 100 and 500) for each head type across layers. The results show high similarity scores (above 0.85 on average), confirming the pattern consistency. We also note that for extremely long sequences, periodic re-identification could be considered as future work. revision: yes
Referee: Abstract and method description: The head classification thresholds and per-head cache budgets are treated as free parameters without reported sensitivity analysis, ablation on threshold choices, or explicit values used in experiments; this makes it difficult to assess whether the reported gains are robust or depend on per-model tuning, directly affecting reproducibility of the claimed efficiency improvements.

Authors: The referee correctly points out the need for explicit parameter values and sensitivity analysis to ensure reproducibility. In the updated manuscript, we have included the specific threshold values used for classifying heads (e.g., locality score threshold of 0.6 for local heads) and the budget allocation ratios (e.g., 20% for local heads, 80% for broad heads) for each evaluated model. Additionally, we performed an ablation study varying the threshold from 0.4 to 0.8 and budget ratios, showing that the memory-quality trade-off remains stable, with performance degradation only at extreme values. These details are now reported in Section 4.2 and the appendix. revision: yes

Circularity Check

0 steps flagged

No significant circularity in HeadKV framework

full rationale

The paper presents a heuristic KV compression method grounded in direct empirical observations of attention patterns across heads and token positions. Head-type classification from early tokens is justified by the stated observation that patterns remain consistent within a layer, but this is not a mathematical derivation or equation that reduces to its own inputs by construction. No self-referential definitions, fitted parameters renamed as predictions, or load-bearing self-citation chains appear in the provided text; the approach is self-contained as a practical, observation-driven strategy without tautological reduction.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The method adds a classification procedure and eviction heuristic on top of standard transformer KV caching; no new physical entities or fundamental axioms beyond domain assumptions about attention consistency.

free parameters (2)

head classification thresholds
Cutoffs used to label heads as locality-biased versus broad-context based on observed attention patterns; values chosen to guide budget assignment.
per-head cache budgets
Specific smaller and larger budget sizes assigned after classification; tuned for quality-memory trade-off.

axioms (1)

domain assumption Attention patterns of each head remain consistent from early to late tokens within a layer
Invoked to justify identifying head type once early and reusing it for the entire generation process.

pith-pipeline@v0.9.0 · 5801 in / 1285 out tokens · 46955 ms · 2026-05-21T06:01:38.397339+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

within the same layer, each head exhibits consistent attention patterns across token positions, i.e., a head’s behavior for early tokens remains consistent with that for later tokens
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Stratified Token Eviction strategy to effectively preserve long-range information

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

39 extracted references · 39 canonical work pages · 11 internal anchors

[1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Attention is all you need.Advances in neural information processing systems, 30:I, 2017

Vaswani Ashish. Attention is all you need.Advances in neural information processing systems, 30:I, 2017

work page 2017
[3]

R-kv: Redundancy-aware kv cache compression for training-free reasoning models acceleration.arXiv e-prints, pages arXiv–2505, 2025

Zefan Cai, Wen Xiao, Hanshi Sun, Cheng Luo, Yikai Zhang, Ke Wan, Yucheng Li, Yeyang Zhou, Li-Wen Chang, Jiuxiang Gu, et al. R-kv: Redundancy-aware kv cache compression for training-free reasoning models acceleration.arXiv e-prints, pages arXiv–2505, 2025

work page 2025
[4]

PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling

Zefan Cai, Yichi Zhang, Bofei Gao, Yuliang Liu, Yucheng Li, Tianyu Liu, Keming Lu, Wayne Xiong, Yue Dong, Junjie Hu, et al. Pyramidkv: Dynamic kv cache compression based on pyramidal information funneling.arXiv preprint arXiv:2406.02069, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[5]

Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. Janus-pro: Unified multimodal understanding and generation with data and model scaling.arXiv preprint arXiv:2501.17811, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

A simple and effective l_2 norm-based strategy for kv cache compression

Alessio Devoto, Yu Zhao, Simone Scardapane, and Pasquale Minervini. A simple and effective l_2 norm-based strategy for kv cache compression. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 18476–18499, 2024

work page 2024
[7]

Taming transformers for high-resolution image synthesis

Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming transformers for high-resolution image synthesis. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12873–12883, 2021

work page 2021
[8]

Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference

Yuan Feng, Junlin Lv, Yukun Cao, Xike Xie, and S Kevin Zhou. Ada-kv: Optimizing kv cache eviction by adaptive budget allocation for efficient llm inference.arXiv preprint arXiv:2407.11550, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[9]

Identify critical kv cache in llm inference from an output perturbation perspective.arXiv preprint arXiv:2502.03805, 2025

Yuan Feng, Junlin Lv, Yukun Cao, Xike Xie, and S Kevin Zhou. Identify critical kv cache in llm inference from an output perturbation perspective.arXiv preprint arXiv:2502.03805, 2025

work page arXiv 2025
[10]

Not all heads matter: A head-level kv cache compression method with integrated retrieval and reasoning.arXiv preprint arXiv:2410.19258,

Yu Fu, Zefan Cai, Abedelkadir Asi, Wayne Xiong, Yue Dong, and Wen Xiao. Not all heads matter: A head-level kv cache compression method with integrated retrieval and reasoning. arXiv preprint arXiv:2410.19258, 2024

work page arXiv 2024
[11]

Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs

Suyu Ge, Yunan Zhang, Liyuan Liu, Minjia Zhang, Jiawei Han, and Jianfeng Gao. Model tells you what to discard: Adaptive kv cache compression for llms.arXiv preprint arXiv:2310.01801, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[12]

Geneval: An object-focused framework for evaluating text-to-image alignment.Advances in Neural Information Processing Systems, 36:52132–52152, 2023

Dhruba Ghosh, Hannaneh Hajishirzi, and Ludwig Schmidt. Geneval: An object-focused framework for evaluating text-to-image alignment.Advances in Neural Information Processing Systems, 36:52132–52152, 2023

work page 2023
[13]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[14]

Gans trained by a two time-scale update rule converge to a local nash equilibrium.Advances in neural information processing systems, 30, 2017

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium.Advances in neural information processing systems, 30, 2017

work page 2017
[15]

Kvquant: Towards 10 million context length llm inference with kv cache quantization.Advances in Neural Information Processing Systems, 37:1270–1303, 2024

Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Michael W Mahoney, Yakun S Shao, Kurt Keutzer, and Amir Gholami. Kvquant: Towards 10 million context length llm inference with kv cache quantization.Advances in Neural Information Processing Systems, 37:1270–1303, 2024

work page 2024
[16]

ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment

Xiwei Hu, Rui Wang, Yixiao Fang, Bin Fu, Pei Cheng, and Gang Yu. Ella: Equip diffusion models with llm for enhanced semantic alignment.arXiv preprint arXiv:2403.05135, 2024. 10

work page internal anchor Pith review Pith/arXiv arXiv 2024
[17]

Science across languages: assessing llm mul- tilingual translation of scientific papers

Hannah Calzi Kleidermacher and James Zou. Science across languages: assessing llm mul- tilingual translation of scientific papers. InFindings of the Association for Computational Linguistics: EACL 2026, pages 3932–3947, 2026

work page 2026
[18]

Snapkv: Llm knows what you are looking for before generation.Advances in Neural Information Processing Systems, 37:22947–22970, 2024

Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, and Deming Chen. Snapkv: Llm knows what you are looking for before generation.Advances in Neural Information Processing Systems, 37:22947–22970, 2024

work page 2024
[19]

Lg-vq: Language-guided codebook learning.Advances in Neural Information Processing Systems, 37:139700–139724, 2024

Guotao Liang, Baoquan Zhang, Yaowei Wang, Xutao Li, Yunming Ye, Huaibin Wang, Chuyao Luo, Kola Ye, and Linfeng Luo. Lg-vq: Language-guided codebook learning.Advances in Neural Information Processing Systems, 37:139700–139724, 2024

work page 2024
[20]

Improved masked image generation with knowledge-augmented token representations

Guotao Liang, Baoquan Zhang, Zhiyuan Wen, Zihao Han, and Yunming Ye. Improved masked image generation with knowledge-augmented token representations. InProceedings of the AAAI Conference on Artificial Intelligence, pages 6817–6825, 2026

work page 2026
[21]

Towards improved text-aligned codebook learning: Multi-hierarchical codebook-text alignment with long text

Guotao Liang, Baoquan Zhang, Zhiyuan Wen, Junteng Zhao, Yunming Ye, Kola Ye, and Yao He. Towards improved text-aligned codebook learning: Multi-hierarchical codebook-text alignment with long text. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 4060–4069, 2025

work page 2025
[22]

Microsoft coco: Common objects in context

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. InEuropean conference on computer vision, pages 740–755. Springer, 2014

work page 2014
[23]

Minicache: Kv cache compression in depth dimension for large language models.Advances in Neural Information Processing Systems, 37:139997–140031, 2024

Akide Liu, Jing Liu, Zizheng Pan, Yefei He, Gholamreza Haffari, and Bohan Zhuang. Minicache: Kv cache compression in depth dimension for large language models.Advances in Neural Information Processing Systems, 37:139997–140031, 2024

work page 2024
[24]

Lumina-mgpt: Illuminate flexible photorealistic text-to-image generation with multimodal generative pretraining, 2024

Dongyang Liu, Shitian Zhao, Le Zhuo, Weifeng Lin, Yu Qiao, Hongsheng Li, and Peng Gao. Lumina-mgpt: Illuminate flexible photorealistic text-to-image generation with multimodal generative pretraining, 2024

work page 2024
[25]

Polaformer: Polarity- aware linear attention for vision transformers.arXiv preprint arXiv:2501.15061, 2025

Weikang Meng, Yadan Luo, Xin Li, Dongmei Jiang, and Zheng Zhang. Polaformer: Polarity- aware linear attention for vision transformers.arXiv preprint arXiv:2501.15061, 2025

work page arXiv 2025
[26]

Keydiff: Key similarity-based kv cache eviction for long-context llm inference in resource-constrained environments, 2025

Junyoung Park, Dalton Jones, Matthew J Morse, Raghavv Goel, Mingu Lee, and Chris Lott. Keydiff: Key similarity-based kv cache eviction for long-context llm inference in resource- constrained environments.arXiv preprint arXiv:2504.15364, 2025

work page arXiv 2025
[27]

Autoregressive image generation needs only a few lines of cached tokens.arXiv preprint arXiv:2512.04857, 2025

Ziran Qin, Youru Lv, Mingbao Lin, Zeren Zhang, Chanfan Gan, Tieyuan Chen, and Weiyao Lin. Autoregressive image generation needs only a few lines of cached tokens.arXiv preprint arXiv:2512.04857, 2025

work page arXiv 2025
[28]

Zero-shot text-to-image generation

Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea V oss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. InInternational conference on machine learning, pages 8821–8831. Pmlr, 2021

work page 2021
[29]

Generating diverse high-fidelity images with vq-vae-2.Advances in neural information processing systems, 32, 2019

Ali Razavi, Aaron Van den Oord, and Oriol Vinyals. Generating diverse high-fidelity images with vq-vae-2.Advances in neural information processing systems, 32, 2019

work page 2019
[30]

Grouped speculative decoding for autoregressive image generation

Junhyuk So, Juncheol Shin, Hyunho Kook, and Eunhyeok Park. Grouped speculative decoding for autoregressive image generation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 15375–15384, 2025

work page 2025
[31]

Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation

Peize Sun, Yi Jiang, Shoufa Chen, Shilong Zhang, Bingyue Peng, Ping Luo, and Zehuan Yuan. Autoregressive model beats diffusion: Llama for scalable image generation.arXiv preprint arXiv:2406.06525, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[32]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timo- thée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023. 11

work page internal anchor Pith review Pith/arXiv arXiv 2023
[33]

D2o: Dy- namic discriminative operations for efficient long-context inference of large language models.arXiv preprint arXiv:2406.13035,

Zhongwei Wan, Xinjian Wu, Yu Zhang, Yi Xin, Chaofan Tao, Zhihong Zhu, Xin Wang, Siqi Luo, Jing Xiong, Longyue Wang, et al. D2o: Dynamic discriminative operations for efficient long-context inference of large language models.arXiv preprint arXiv:2406.13035, 2024

work page arXiv 2024
[34]

A review on code generation with llms: Application and evaluation

Jianxun Wang and Yixiang Chen. A review on code generation with llms: Application and evaluation. In2023 IEEE International Conference on Medical Artificial Intelligence (MedAI), pages 284–289. IEEE, 2023

work page 2023
[35]

DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads

Guangxuan Xiao, Jiaming Tang, Jingwei Zuo, Junxian Guo, Shang Yang, Haotian Tang, Yao Fu, and Song Han. Duoattention: Efficient long-context llm inference with retrieval and streaming heads.arXiv preprint arXiv:2410.10819, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[36]

Efficient Streaming Language Models with Attention Sinks

Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks.arXiv preprint arXiv:2309.17453, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[37]

Zeroquant: Efficient and affordable post-training quantization for large-scale transformers

Zhewei Yao, Reza Yazdani Aminabadi, Minjia Zhang, Xiaoxia Wu, Conglong Li, and Yuxiong He. Zeroquant: Efficient and affordable post-training quantization for large-scale transformers. Advances in neural information processing systems, 35:27168–27183, 2022

work page 2022
[38]

Cam: Cache merging for memory-efficient llms inference

Yuxin Zhang, Yuxuan Du, Gen Luo, Yunshan Zhong, Zhenyu Zhang, Shiwei Liu, and Rongrong Ji. Cam: Cache merging for memory-efficient llms inference. InForty-first international conference on machine learning, 2024

work page 2024
[39]

H2o: Heavy-hitter oracle for efficient generative inference of large language models.Advances in Neural Information Processing Systems, 36:34661–34710, 2023

Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher Ré, Clark Barrett, et al. H2o: Heavy-hitter oracle for efficient generative inference of large language models.Advances in Neural Information Processing Systems, 36:34661–34710, 2023. 12 A Appendix A.1 Limitation and Future Work. While Hea...

work page 2023

[1] [1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[2] [2]

Attention is all you need.Advances in neural information processing systems, 30:I, 2017

Vaswani Ashish. Attention is all you need.Advances in neural information processing systems, 30:I, 2017

work page 2017

[3] [3]

R-kv: Redundancy-aware kv cache compression for training-free reasoning models acceleration.arXiv e-prints, pages arXiv–2505, 2025

Zefan Cai, Wen Xiao, Hanshi Sun, Cheng Luo, Yikai Zhang, Ke Wan, Yucheng Li, Yeyang Zhou, Li-Wen Chang, Jiuxiang Gu, et al. R-kv: Redundancy-aware kv cache compression for training-free reasoning models acceleration.arXiv e-prints, pages arXiv–2505, 2025

work page 2025

[4] [4]

PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling

Zefan Cai, Yichi Zhang, Bofei Gao, Yuliang Liu, Yucheng Li, Tianyu Liu, Keming Lu, Wayne Xiong, Yue Dong, Junjie Hu, et al. Pyramidkv: Dynamic kv cache compression based on pyramidal information funneling.arXiv preprint arXiv:2406.02069, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[5] [5]

Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. Janus-pro: Unified multimodal understanding and generation with data and model scaling.arXiv preprint arXiv:2501.17811, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[6] [6]

A simple and effective l_2 norm-based strategy for kv cache compression

Alessio Devoto, Yu Zhao, Simone Scardapane, and Pasquale Minervini. A simple and effective l_2 norm-based strategy for kv cache compression. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 18476–18499, 2024

work page 2024

[7] [7]

Taming transformers for high-resolution image synthesis

Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming transformers for high-resolution image synthesis. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12873–12883, 2021

work page 2021

[8] [8]

Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference

Yuan Feng, Junlin Lv, Yukun Cao, Xike Xie, and S Kevin Zhou. Ada-kv: Optimizing kv cache eviction by adaptive budget allocation for efficient llm inference.arXiv preprint arXiv:2407.11550, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[9] [9]

Identify critical kv cache in llm inference from an output perturbation perspective.arXiv preprint arXiv:2502.03805, 2025

Yuan Feng, Junlin Lv, Yukun Cao, Xike Xie, and S Kevin Zhou. Identify critical kv cache in llm inference from an output perturbation perspective.arXiv preprint arXiv:2502.03805, 2025

work page arXiv 2025

[10] [10]

Not all heads matter: A head-level kv cache compression method with integrated retrieval and reasoning.arXiv preprint arXiv:2410.19258,

Yu Fu, Zefan Cai, Abedelkadir Asi, Wayne Xiong, Yue Dong, and Wen Xiao. Not all heads matter: A head-level kv cache compression method with integrated retrieval and reasoning. arXiv preprint arXiv:2410.19258, 2024

work page arXiv 2024

[11] [11]

Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs

Suyu Ge, Yunan Zhang, Liyuan Liu, Minjia Zhang, Jiawei Han, and Jianfeng Gao. Model tells you what to discard: Adaptive kv cache compression for llms.arXiv preprint arXiv:2310.01801, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[12] [12]

Geneval: An object-focused framework for evaluating text-to-image alignment.Advances in Neural Information Processing Systems, 36:52132–52152, 2023

Dhruba Ghosh, Hannaneh Hajishirzi, and Ludwig Schmidt. Geneval: An object-focused framework for evaluating text-to-image alignment.Advances in Neural Information Processing Systems, 36:52132–52152, 2023

work page 2023

[13] [13]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[14] [14]

Gans trained by a two time-scale update rule converge to a local nash equilibrium.Advances in neural information processing systems, 30, 2017

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium.Advances in neural information processing systems, 30, 2017

work page 2017

[15] [15]

Kvquant: Towards 10 million context length llm inference with kv cache quantization.Advances in Neural Information Processing Systems, 37:1270–1303, 2024

Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Michael W Mahoney, Yakun S Shao, Kurt Keutzer, and Amir Gholami. Kvquant: Towards 10 million context length llm inference with kv cache quantization.Advances in Neural Information Processing Systems, 37:1270–1303, 2024

work page 2024

[16] [16]

ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment

Xiwei Hu, Rui Wang, Yixiao Fang, Bin Fu, Pei Cheng, and Gang Yu. Ella: Equip diffusion models with llm for enhanced semantic alignment.arXiv preprint arXiv:2403.05135, 2024. 10

work page internal anchor Pith review Pith/arXiv arXiv 2024

[17] [17]

Science across languages: assessing llm mul- tilingual translation of scientific papers

Hannah Calzi Kleidermacher and James Zou. Science across languages: assessing llm mul- tilingual translation of scientific papers. InFindings of the Association for Computational Linguistics: EACL 2026, pages 3932–3947, 2026

work page 2026

[18] [18]

Snapkv: Llm knows what you are looking for before generation.Advances in Neural Information Processing Systems, 37:22947–22970, 2024

Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, and Deming Chen. Snapkv: Llm knows what you are looking for before generation.Advances in Neural Information Processing Systems, 37:22947–22970, 2024

work page 2024

[19] [19]

Lg-vq: Language-guided codebook learning.Advances in Neural Information Processing Systems, 37:139700–139724, 2024

Guotao Liang, Baoquan Zhang, Yaowei Wang, Xutao Li, Yunming Ye, Huaibin Wang, Chuyao Luo, Kola Ye, and Linfeng Luo. Lg-vq: Language-guided codebook learning.Advances in Neural Information Processing Systems, 37:139700–139724, 2024

work page 2024

[20] [20]

Improved masked image generation with knowledge-augmented token representations

Guotao Liang, Baoquan Zhang, Zhiyuan Wen, Zihao Han, and Yunming Ye. Improved masked image generation with knowledge-augmented token representations. InProceedings of the AAAI Conference on Artificial Intelligence, pages 6817–6825, 2026

work page 2026

[21] [21]

Towards improved text-aligned codebook learning: Multi-hierarchical codebook-text alignment with long text

Guotao Liang, Baoquan Zhang, Zhiyuan Wen, Junteng Zhao, Yunming Ye, Kola Ye, and Yao He. Towards improved text-aligned codebook learning: Multi-hierarchical codebook-text alignment with long text. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 4060–4069, 2025

work page 2025

[22] [22]

Microsoft coco: Common objects in context

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. InEuropean conference on computer vision, pages 740–755. Springer, 2014

work page 2014

[23] [23]

Minicache: Kv cache compression in depth dimension for large language models.Advances in Neural Information Processing Systems, 37:139997–140031, 2024

Akide Liu, Jing Liu, Zizheng Pan, Yefei He, Gholamreza Haffari, and Bohan Zhuang. Minicache: Kv cache compression in depth dimension for large language models.Advances in Neural Information Processing Systems, 37:139997–140031, 2024

work page 2024

[24] [24]

Lumina-mgpt: Illuminate flexible photorealistic text-to-image generation with multimodal generative pretraining, 2024

Dongyang Liu, Shitian Zhao, Le Zhuo, Weifeng Lin, Yu Qiao, Hongsheng Li, and Peng Gao. Lumina-mgpt: Illuminate flexible photorealistic text-to-image generation with multimodal generative pretraining, 2024

work page 2024

[25] [25]

Polaformer: Polarity- aware linear attention for vision transformers.arXiv preprint arXiv:2501.15061, 2025

Weikang Meng, Yadan Luo, Xin Li, Dongmei Jiang, and Zheng Zhang. Polaformer: Polarity- aware linear attention for vision transformers.arXiv preprint arXiv:2501.15061, 2025

work page arXiv 2025

[26] [26]

Keydiff: Key similarity-based kv cache eviction for long-context llm inference in resource-constrained environments, 2025

Junyoung Park, Dalton Jones, Matthew J Morse, Raghavv Goel, Mingu Lee, and Chris Lott. Keydiff: Key similarity-based kv cache eviction for long-context llm inference in resource- constrained environments.arXiv preprint arXiv:2504.15364, 2025

work page arXiv 2025

[27] [27]

Autoregressive image generation needs only a few lines of cached tokens.arXiv preprint arXiv:2512.04857, 2025

Ziran Qin, Youru Lv, Mingbao Lin, Zeren Zhang, Chanfan Gan, Tieyuan Chen, and Weiyao Lin. Autoregressive image generation needs only a few lines of cached tokens.arXiv preprint arXiv:2512.04857, 2025

work page arXiv 2025

[28] [28]

Zero-shot text-to-image generation

Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea V oss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. InInternational conference on machine learning, pages 8821–8831. Pmlr, 2021

work page 2021

[29] [29]

Generating diverse high-fidelity images with vq-vae-2.Advances in neural information processing systems, 32, 2019

Ali Razavi, Aaron Van den Oord, and Oriol Vinyals. Generating diverse high-fidelity images with vq-vae-2.Advances in neural information processing systems, 32, 2019

work page 2019

[30] [30]

Grouped speculative decoding for autoregressive image generation

Junhyuk So, Juncheol Shin, Hyunho Kook, and Eunhyeok Park. Grouped speculative decoding for autoregressive image generation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 15375–15384, 2025

work page 2025

[31] [31]

Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation

Peize Sun, Yi Jiang, Shoufa Chen, Shilong Zhang, Bingyue Peng, Ping Luo, and Zehuan Yuan. Autoregressive model beats diffusion: Llama for scalable image generation.arXiv preprint arXiv:2406.06525, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[32] [32]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timo- thée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023. 11

work page internal anchor Pith review Pith/arXiv arXiv 2023

[33] [33]

D2o: Dy- namic discriminative operations for efficient long-context inference of large language models.arXiv preprint arXiv:2406.13035,

Zhongwei Wan, Xinjian Wu, Yu Zhang, Yi Xin, Chaofan Tao, Zhihong Zhu, Xin Wang, Siqi Luo, Jing Xiong, Longyue Wang, et al. D2o: Dynamic discriminative operations for efficient long-context inference of large language models.arXiv preprint arXiv:2406.13035, 2024

work page arXiv 2024

[34] [34]

A review on code generation with llms: Application and evaluation

Jianxun Wang and Yixiang Chen. A review on code generation with llms: Application and evaluation. In2023 IEEE International Conference on Medical Artificial Intelligence (MedAI), pages 284–289. IEEE, 2023

work page 2023

[35] [35]

DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads

Guangxuan Xiao, Jiaming Tang, Jingwei Zuo, Junxian Guo, Shang Yang, Haotian Tang, Yao Fu, and Song Han. Duoattention: Efficient long-context llm inference with retrieval and streaming heads.arXiv preprint arXiv:2410.10819, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[36] [36]

Efficient Streaming Language Models with Attention Sinks

Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks.arXiv preprint arXiv:2309.17453, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[37] [37]

Zeroquant: Efficient and affordable post-training quantization for large-scale transformers

Zhewei Yao, Reza Yazdani Aminabadi, Minjia Zhang, Xiaoxia Wu, Conglong Li, and Yuxiong He. Zeroquant: Efficient and affordable post-training quantization for large-scale transformers. Advances in neural information processing systems, 35:27168–27183, 2022

work page 2022

[38] [38]

Cam: Cache merging for memory-efficient llms inference

Yuxin Zhang, Yuxuan Du, Gen Luo, Yunshan Zhong, Zhenyu Zhang, Shiwei Liu, and Rongrong Ji. Cam: Cache merging for memory-efficient llms inference. InForty-first international conference on machine learning, 2024

work page 2024

[39] [39]

H2o: Heavy-hitter oracle for efficient generative inference of large language models.Advances in Neural Information Processing Systems, 36:34661–34710, 2023

Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher Ré, Clark Barrett, et al. H2o: Heavy-hitter oracle for efficient generative inference of large language models.Advances in Neural Information Processing Systems, 36:34661–34710, 2023. 12 A Appendix A.1 Limitation and Future Work. While Hea...

work page 2023