pith. machine review for the scientific record. sign in

arxiv: 2310.01801 · v4 · submitted 2023-10-03 · 💻 cs.CL

Recognition: 2 theorem links

· Lean Theorem

Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs

Authors on Pith no claims yet

Pith reviewed 2026-05-17 11:06 UTC · model grok-4.3

classification 💻 cs.CL
keywords KV cache compressionadaptive inferenceattention head profilingLLM memory reductiongenerative inferenceplug-and-play optimization
0
0 comments X

The pith

LLMs can cut KV cache memory by profiling attention heads once and evicting tokens selectively per head type.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that a lightweight profiling pass on attention modules reveals stable head structures—local, special-token focused, or global—and that these structures can guide an adaptive KV cache. For local heads the method evicts long-range tokens, for special-token heads it drops non-special tokens, and for global heads it keeps the full cache. This construction happens without any model updates or retraining. The result is large GPU memory savings during generation across tasks while output quality stays nearly unchanged. Such compression matters because the KV cache is the dominant memory consumer in long-context inference and currently limits model scale on fixed hardware.

Core claim

By conducting targeted profiling to discern the intrinsic structure of attention modules, the KV cache can be built adaptively: evicting long-range contexts on heads that emphasize local contexts, discarding non-special tokens on heads centered on special tokens, and retaining the standard cache only for heads that attend broadly to all tokens. The lightweight profiling step guides this construction so that the resulting FastGen method deploys without fine-tuning and yields substantial GPU memory reduction with negligible generation quality loss.

What carries the argument

Single-pass attention profiling that classifies heads into local-context, special-token, or global types and applies a matching token-eviction rule to the KV cache for each type.

Load-bearing premise

The head structures found by one profiling pass stay consistent enough across tasks and inputs to let the eviction rules discard tokens safely without hurting output quality.

What would settle it

Run the adaptive cache on a held-out task or longer prompt and measure whether quality metrics fall more than a few percent below the full-cache baseline.

read the original abstract

In this study, we introduce adaptive KV cache compression, a plug-and-play method that reduces the memory footprint of generative inference for Large Language Models (LLMs). Different from the conventional KV cache that retains key and value vectors for all context tokens, we conduct targeted profiling to discern the intrinsic structure of attention modules. Based on the recognized structure, we then construct the KV cache in an adaptive manner: evicting long-range contexts on attention heads emphasizing local contexts, discarding non-special tokens on attention heads centered on special tokens, and only employing the standard KV cache for attention heads that broadly attend to all tokens. Moreover, with the lightweight attention profiling used to guide the construction of the adaptive KV cache, FastGen can be deployed without resource-intensive fine-tuning or re-training. In our experiments across various asks, FastGen demonstrates substantial reduction on GPU memory consumption with negligible generation quality loss. We will release our code and the compatible CUDA kernel for reproducibility.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces FastGen, a plug-and-play adaptive KV cache compression method for LLMs. It performs lightweight profiling to classify attention heads into local-context, special-token, and global types, then applies type-specific eviction rules: discarding long-range contexts for local heads, non-special tokens for special heads, and retaining the full cache only for global heads. The approach requires no fine-tuning or retraining and claims substantial GPU memory reduction with negligible generation quality loss across tasks, with code and CUDA kernel to be released.

Significance. If the quality preservation holds, the work would be significant for efficient LLM inference, enabling longer contexts on limited hardware via a training-free method that exploits intrinsic attention-head specialization. The lightweight profiling and reproducibility commitments (code + kernel) strengthen the practical contribution.

major comments (2)
  1. [Method (profiling and adaptive construction)] The method classifies heads once via a single lightweight profiling pass and then applies fixed eviction rules for the remainder of generation. No analysis or ablation is provided showing that these head types (and thus the eviction policy) remain stable as the context window expands autoregressively or across domain shifts; attention patterns can change with newly generated tokens, directly risking violation of the 'negligible quality loss' guarantee.
  2. [Abstract and Experiments] The abstract and experimental claims assert 'substantial reduction on GPU memory consumption with negligible generation quality loss' yet the manuscript provides no quantitative metrics, specific baselines, task details, or error analysis to support the magnitude of savings or the 'negligible' qualifier; this leaves the central empirical claim unverifiable from the reported evidence.
minor comments (2)
  1. [Abstract] Abstract: 'various asks' appears to be a typo for 'various tasks'.
  2. [Abstract] Abstract: 'substantial reduction on GPU memory' should read 'substantial reduction in GPU memory' or 'of GPU memory consumption'.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the major comments point by point below, indicating where we agree and plan revisions to improve clarity and rigor.

read point-by-point responses
  1. Referee: [Method (profiling and adaptive construction)] The method classifies heads once via a single lightweight profiling pass and then applies fixed eviction rules for the remainder of generation. No analysis or ablation is provided showing that these head types (and thus the eviction policy) remain stable as the context window expands autoregressively or across domain shifts; attention patterns can change with newly generated tokens, directly risking violation of the 'negligible quality loss' guarantee.

    Authors: We appreciate the referee highlighting this aspect of the method. The classification is indeed performed once via lightweight profiling on an initial context segment, after which the eviction rules are applied statically. While our experiments across long-context tasks (including autoregressive generation over thousands of tokens) empirically support negligible quality impact, we acknowledge the absence of a dedicated stability analysis. In the revision, we will add an ablation examining head-type consistency as context grows and across domain shifts to directly address this concern. revision: yes

  2. Referee: [Abstract and Experiments] The abstract and experimental claims assert 'substantial reduction on GPU memory consumption with negligible generation quality loss' yet the manuscript provides no quantitative metrics, specific baselines, task details, or error analysis to support the magnitude of savings or the 'negligible' qualifier; this leaves the central empirical claim unverifiable from the reported evidence.

    Authors: We agree that the abstract would benefit from more explicit quantitative support to make the claims verifiable at a glance. The full manuscript contains experimental results with concrete memory savings, task-specific quality metrics, and comparisons to baselines. We will revise the abstract to incorporate key quantitative findings and ensure the experiments section provides clearer task details, baselines, and supporting analysis or error metrics. revision: yes

Circularity Check

0 steps flagged

No significant circularity; policy derived from direct empirical profiling

full rationale

The paper's core method performs a single lightweight profiling pass over attention heads to classify them into local-context, special-token, or global-attention categories, then applies fixed eviction rules accordingly. This is presented as an empirical observation step that directly informs the KV cache construction without any equations, fitted parameters, or self-referential definitions that would make the output equivalent to the input by construction. No load-bearing self-citations, uniqueness theorems, or ansatzes are invoked to justify the classification rules. The approach remains self-contained as a measurement-driven heuristic rather than a closed derivation loop.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on the domain assumption that attention heads exhibit stable, categorizable behaviors that can be profiled once and used for eviction decisions; no free parameters or new invented entities are introduced.

axioms (1)
  • domain assumption Transformer attention heads exhibit distinct and stable patterns that can be reliably categorized as local-context, special-token, or broad-attention.
    Invoked to justify the targeted profiling step and the subsequent adaptive cache construction.

pith-pipeline@v0.9.0 · 5472 in / 1379 out tokens · 40933 ms · 2026-05-17T11:06:14.885468+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 19 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. HybridGen: Efficient LLM Generative Inference via CPU-GPU Hybrid Computing

    cs.PF 2026-04 unverdicted novelty 7.0

    HybridGen achieves 1.41x-3.2x average speedups over six prior KV cache methods for LLM inference by using attention logit parallelism, a feedback-driven scheduler, and semantic-aware KV cache mapping.

  2. PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction

    cs.CV 2024-10 accept novelty 7.0

    PyramidDrop accelerates LVLMs by staged, similarity-based dropping of visual tokens that become redundant in deeper layers, delivering 40% faster training and 55% lower inference cost with comparable accuracy.

  3. ReST-KV: Robust KV Cache Eviction with Layer-wise Output Reconstruction and Spatial-Temporal Smoothing

    cs.CL 2026-05 conditional novelty 6.0

    ReST-KV formulates KV eviction as layer-wise output reconstruction optimization with spatial-temporal smoothing, outperforming baselines by 2.58% on LongBench and 15.2% on RULER while cutting decoding latency by 10.61...

  4. Sparse Attention as a Range Searching Problem: Towards an Inference-Efficient Index for KV Cache

    cs.LG 2026-05 unverdicted novelty 6.0

    Louver is a new index for LLM KV caches that guarantees zero false negatives for keys above a relevance threshold, runs faster than prior sparse and some dense attention methods, and integrates lightly into existing p...

  5. Sparse Attention as a Range Searching Problem: Towards an Inference-Efficient Index for KV Cache

    cs.LG 2026-05 unverdicted novelty 6.0

    Louver is a new index structure that guarantees zero false negatives for sparse attention in LLM KV caches by casting the problem as halfspace range searching.

  6. WindowQuant: Mixed-Precision KV Cache Quantization based on Window-Level Similarity for VLMs Inference Optimization

    cs.CV 2026-05 unverdicted novelty 6.0

    WindowQuant performs window-adaptive mixed-precision KV cache quantization guided by similarity to the text prompt, with reordering to enable efficient inference in VLMs.

  7. DASH-KV: Accelerating Long-Context LLM Inference via Asymmetric KV Cache Hashing

    cs.CL 2026-04 unverdicted novelty 6.0

    DASH-KV accelerates long-context LLM inference to linear complexity via asymmetric KV cache hashing and mixed-precision retention, matching full attention performance on LongBench.

  8. SAGE: Selective Attention-Guided Extraction for Token-Efficient Document Indexing

    cs.DB 2026-04 unverdicted novelty 6.0

    SAGE is a training-free context reduction method that converts attention signals from a small LLM into a differential relevance heatmap to select top units for downstream QA, achieving competitive accuracy at 10% toke...

  9. TokenDance: Scaling Multi-Agent LLM Serving via Collective KV Cache Sharing

    cs.DC 2026-04 unverdicted novelty 6.0

    TokenDance scales multi-agent LLM serving to 2.7x more concurrent agents by collective KV cache reuse and block-sparse diff encoding that achieves 11-17x compression.

  10. Stochastic KV Routing: Enabling Adaptive Depth-Wise Cache Sharing

    cs.LG 2026-04 unverdicted novelty 6.0

    Stochastic training with random cross-layer KV attention enables depth-wise cache sharing in transformers, cutting memory footprint while preserving or improving performance.

  11. MoBA: Mixture of Block Attention for Long-Context LLMs

    cs.LG 2025-02 unverdicted novelty 6.0

    MoBA routes attention over blocks via MoE-style gating to enable dynamic, bias-light long-context attention that matches full attention performance at lower cost.

  12. Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention

    cs.CL 2025-02 unverdicted novelty 6.0

    NSA is a hardware-aligned sparse attention mechanism that enables end-to-end trainable long-context modeling by combining coarse token compression with fine-grained selection.

  13. When Attention Sink Emerges in Language Models: An Empirical View

    cs.CL 2024-10 accept novelty 6.0

    Attention sinks emerge in language models from softmax-induced token dependence on attention scores and do not appear when using sigmoid attention without normalization in models up to 1B parameters.

  14. Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference

    cs.CL 2024-07 accept novelty 6.0

    Ada-KV is the first head-wise adaptive KV cache budget allocator for LLMs, using a theoretical loss upper bound to allocate eviction differently per attention head and yielding higher quality than uniform methods on l...

  15. PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling

    cs.CL 2024-06 conditional novelty 6.0

    PyramidKV dynamically compresses KV cache across layers following pyramidal information funneling, matching full performance at 12% retention and outperforming alternatives at 0.7% retention with up to 20.5 accuracy gains.

  16. SnapKV: LLM Knows What You are Looking for Before Generation

    cs.CL 2024-04 conditional novelty 6.0

    SnapKV selects clustered important KV positions per attention head from an observation window at the prompt end, yielding 3.6x faster generation and 8.2x better memory efficiency on 16K-token inputs with comparable pe...

  17. Don't Waste Bits! Adaptive KV-Cache Quantization for Lightweight On-Device LLMs

    cs.CV 2026-04 unverdicted novelty 5.0

    A data-driven adaptive policy for KV-cache bit-width selection based on token importance features reduces decoding latency by ~18% and improves accuracy over static quantization while staying near FP16 levels on SmolL...

  18. Attention Sink Forges Native MoE in Attention Layers: Sink-Aware Training to Address Head Collapse

    cs.CL 2026-02 unverdicted novelty 5.0

    Attention sinks forge native MoE mechanisms in attention layers that cause head collapse, addressed by sink-aware training with auxiliary load balancing.

  19. From Human Memory to AI Memory: A Survey on Memory Mechanisms in the Era of LLMs

    cs.IR 2025-04 unverdicted novelty 5.0

    The paper surveys human memory categories, maps them to LLM memory, and proposes a new three-dimension (object, form, time) categorization into eight quadrants to organize existing work and highlight open problems.

Reference graph

Works this paper leans on

85 extracted references · 85 canonical work pages · cited by 18 Pith papers · 17 internal anchors

  1. [1]

    2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , year=

    Are You Smarter Than a Sixth Grader? Textbook Question Answering for Multimodal Machine Comprehension , author=. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , year=

  2. [2]

    2019 , journal =

    Natural Questions: a Benchmark for Question Answering Research , author =. 2019 , journal =

  3. [5]

    Tom B. Brown and Benjamin Mann and Nick Ryder and Melanie Subbiah and Jared Kaplan and Prafulla Dhariwal and Arvind Neelakantan and Pranav Shyam and Girish Sastry and Amanda Askell and Sandhini Agarwal and Ariel Herbert. Language Models are Few-Shot Learners , booktitle =. 2020 , Xurl =

  4. [8]

    SC22: International Conference for High Performance Computing, Networking, Storage and Analysis , year=

    DeepSpeed- Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale , author=. SC22: International Conference for High Performance Computing, Networking, Storage and Analysis , year=

  5. [9]

    International Conference on Machine Learning , year=

    High-throughput Generative Inference of Large Language Models with a Single GPU , author=. International Conference on Machine Learning , year=

  6. [10]

    ArXiv , year=

    Neural Speed Reading with Structural-Jump-LSTM , author=. ArXiv , year=

  7. [11]

    ArXiv , year=

    Neural Speed Reading via Skim-RNN , author=. ArXiv , year=

  8. [12]

    ArXiv , year=

    Skip RNN: Learning to Skip State Updates in Recurrent Neural Networks , author=. ArXiv , year=

  9. [13]

    ArXiv , year=

    GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding , author=. ArXiv , year=

  10. [14]

    ArXiv , year=

    Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer , author=. ArXiv , year=

  11. [15]

    2022 , eprint=

    Multimodal Contrastive Learning with LIMoE: the Language-Image Mixture of Experts , author=. 2022 , eprint=

  12. [16]

    2021 , booktitle=

    Scaling Vision with Sparse Mixture of Experts , author=. 2021 , booktitle=

  13. [17]

    2023 , eprint=

    Llama 2: Open Foundation and Fine-Tuned Chat Models , author=. 2023 , eprint=

  14. [18]

    2022 , eprint=

    PaLM: Scaling Language Modeling with Pathways , author=. 2022 , eprint=

  15. [19]

    2023 , eprint=

    GPT-4 Technical Report , author=. 2023 , eprint=

  16. [20]

    Are Sixteen Heads Really Better than One? , url =

    Michel, Paul and Levy, Omer and Neubig, Graham , booktitle =. Are Sixteen Heads Really Better than One? , url =

  17. [21]

    Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned

    Voita, Elena and Talbot, David and Moiseev, Fedor and Sennrich, Rico and Titov, Ivan. Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019. doi:10.18653/v1/P19-1580

  18. [23]

    Manning and Kevin Clark and John Hewitt and Urvashi Khandelwal and Omer Levy , title =

    Christopher D. Manning and Kevin Clark and John Hewitt and Urvashi Khandelwal and Omer Levy , title =. Proc. Natl. Acad. Sci. 2020 , url =. doi:10.1073/pnas.1907367117 , timestamp =

  19. [26]

    PLoS ONE , year=

    On Pixel-Wise Explanations for Non-Linear Classifier Decisions by Layer-Wise Relevance Propagation , author=. PLoS ONE , year=

  20. [27]

    7th International Conference on Learning Representations,

    Jonathan Frankle and Michael Carbin , title =. 7th International Conference on Learning Representations,. 2019 , url =

  21. [28]

    Vetrov , editor =

    Dmitry Molchanov and Arsenii Ashukha and Dmitry P. Vetrov , editor =. Variational Dropout Sparsifies Deep Neural Networks , booktitle =. 2017 , url =

  22. [29]

    Rush , editor =

    Victor Sanh and Thomas Wolf and Alexander M. Rush , editor =. Movement Pruning: Adaptive Sparsity by Fine-Tuning , booktitle =. 2020 , url =

  23. [30]

    The Optimal

    Eldar Kurtic and Daniel Campos and Tuan Nguyen and Elias Frantar and Mark Kurtz and Benjamin Fineran and Michael Goin and Dan Alistarh , editor =. The Optimal. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing,. 2022 , url =. doi:10.18653/v1/2022.emnlp-main.279 , timestamp =

  24. [32]

    Generalized Slow Roll for Tensors

    Cong Guo and Bo Yang Hsueh and Jingwen Leng and Yuxian Qiu and Yue Guan and Zehuan Wang and Xiaoying Jia and Xipeng Li and Minyi Guo and Yuhao Zhu , editor =. Accelerating sparse. Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis,. 2020 , url =. doi:10.1109/SC41405.2020.00020 , timestamp =

  25. [33]

    Block Pruning For Faster Transformers , booktitle =

    Fran. Block Pruning For Faster Transformers , booktitle =. 2021 , url =. doi:10.18653/v1/2021.emnlp-main.829 , timestamp =

  26. [34]

    Structured Pruning Learns Compact and Accurate Models , booktitle =

    Mengzhou Xia and Zexuan Zhong and Danqi Chen , editor =. Structured Pruning Learns Compact and Accurate Models , booktitle =. 2022 , url =. doi:10.18653/v1/2022.acl-long.107 , timestamp =

  27. [35]

    Dally , editor =

    Song Han and Huizi Mao and William J. Dally , editor =. Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding , booktitle =. 2016 , url =

  28. [36]

    Howard and Hartwig Adam and Dmitry Kalenichenko , title =

    Benoit Jacob and Skirmantas Kligys and Bo Chen and Menglong Zhu and Matthew Tang and Andrew G. Howard and Hartwig Adam and Dmitry Kalenichenko , title =. 2018. 2018 , url =. doi:10.1109/CVPR.2018.00286 , timestamp =

  29. [37]

    Markus Nagel and Mart van Baalen and Tijmen Blankevoort and Max Welling , title =. 2019. 2019 , url =. doi:10.1109/ICCV.2019.00141 , timestamp =

  30. [38]

    Distilling the Knowledge in a Neural Network

    Distilling the knowledge in a neural network , author=. arXiv preprint arXiv:1503.02531 , year=

  31. [39]

    ArXiv , year=

    DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter , author=. ArXiv , year=

  32. [40]

    The State of Sparsity in Deep Neural Networks

    Trevor Gale and Erich Elsen and Sara Hooker , title =. CoRR , volume =. 2019 , url =. 1902.09574 , timestamp =

  33. [41]

    DeeBERT: Dynamic Early Exiting for Accelerating

    Ji Xin and Raphael Tang and Jaejun Lee and Yaoliang Yu and Jimmy Lin , editor =. DeeBERT: Dynamic Early Exiting for Accelerating. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics,. 2020 , url =. doi:10.18653/v1/2020.acl-main.204 , timestamp =

  34. [43]

    Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention , booktitle =

    Angelos Katharopoulos and Apoorv Vyas and Nikolaos Pappas and Fran. Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention , booktitle =. 2020 , url =

  35. [44]

    8th International Conference on Learning Representations,

    Nikita Kitaev and Lukasz Kaiser and Anselm Levskaya , title =. 8th International Conference on Learning Representations,. 2020 , url =

  36. [47]

    Deja Vu: Contextual Sparsity for Efficient LLMs at Inference Time , booktitle =

    Zichang Liu and Jue Wang and Tri Dao and Tianyi Zhou and Binhang Yuan and Zhao Song and Anshumali Shrivastava and Ce Zhang and Yuandong Tian and Christopher R. Deja Vu: Contextual Sparsity for Efficient LLMs at Inference Time , booktitle =. 2023 , url =

  37. [49]

    Hu and Yelong Shen and Phillip Wallis and Zeyuan Allen

    Edward J. Hu and Yelong Shen and Phillip Wallis and Zeyuan Allen. LoRA: Low-Rank Adaptation of Large Language Models , booktitle =. 2022 , url =

  38. [50]

    Susan Zhang and Stephen Roller and Naman Goyal and Mikel Artetxe and Moya Chen and Shuohui Chen and Christopher Dewan and Mona T. Diab and Xian Li and Xi Victoria Lin and Todor Mihaylov and Myle Ott and Sam Shleifer and Kurt Shuster and Daniel Simig and Punit Singh Koura and Anjali Sridhar and Tianlu Wang and Luke Zettlemoyer , title =. CoRR , volume =. 2...

  39. [51]

    Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

    Aarohi Srivastava and Abhinav Rastogi and Abhishek Rao and Abu Awal Md Shoeb and Abubakar Abid and Adam Fisch and Adam R. Brown and Adam Santoro and Aditya Gupta and Adri. Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models , journal =. 2022 , url =. doi:10.48550/arXiv.2206.04615 , eprinttype =. 2206.04615 , timestamp =

  40. [52]

    doi:10.5281/zenodo.5371628 , url =

    Gao, Leo and Tow, Jonathan and Biderman, Stella and Black, Sid and DiPofi, Anthony and Foster, Charles and Golding, Laurence and Hsu, Jeffrey and McDonell, Kyle and Muennighoff, Niklas and Phang, Jason and Reynolds, Laria and Tang, Eric and Thite, Anish and Wang, Ben and Wang, Kevin and Zou, Andy , title =. doi:10.5281/zenodo.5371628 , url =

  41. [53]

    PaLM: Scaling Language Modeling with Pathways

    Aakanksha Chowdhery and Sharan Narang and Jacob Devlin and Maarten Bosma and Gaurav Mishra and Adam Roberts and Paul Barham and Hyung Won Chung and Charles Sutton and Sebastian Gehrmann and Parker Schuh and Kensen Shi and Sasha Tsvyashchenko and Joshua Maynez and Abhishek Rao and Parker Barnes and Yi Tay and Noam Shazeer and Vinodkumar Prabhakaran and Emi...

  42. [54]

    International Conference on Machine Learning , year=

    PoWER-BERT: Accelerating BERT Inference via Progressive Word-vector Elimination , author=. International Conference on Machine Learning , year=

  43. [55]

    Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing , booktitle =

    Zihang Dai and Guokun Lai and Yiming Yang and Quoc Le , editor =. Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing , booktitle =. 2020 , url =

  44. [60]

    McAuley and Ke Xu and Furu Wei , editor =

    Wangchunshu Zhou and Canwen Xu and Tao Ge and Julian J. McAuley and Ke Xu and Furu Wei , editor =. Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual , year =

  45. [61]

    Efficient Memory Management for Large Language Model Serving with PagedAttention

    Woosuk Kwon and Zhuohan Li and Siyuan Zhuang and Ying Sheng and Lianmin Zheng and Cody Hao Yu and Joseph E. Gonzalez and Hao Zhang and Ion Stoica , title =. CoRR , volume =. 2023 , url =. doi:10.48550/arXiv.2309.06180 , eprinttype =. 2309.06180 , timestamp =

  46. [62]

    Hashimoto , title =

    Xuechen Li and Tianyi Zhang and Yann Dubois and Rohan Taori and Ishaan Gulrajani and Carlos Guestrin and Percy Liang and Tatsunori B. Hashimoto , title =. GitHub repository , howpublished =. 2023 , publisher =

  47. [65]

    , author =

    Accelerate: Training and inference at scale made simple, efficient and adaptable. , author =

  48. [66]

    Deepspeed- inference: Enabling efficient inference of transformer models at unprecedented scale

    Reza Yazdani Aminabadi, Samyam Rajbhandari, Minjia Zhang, Ammar Ahmad Awan, Cheng Li, Du Li, Elton Zheng, Jeff Rasley, Shaden Smith, Olatunji Ruwase, and Yuxiong He. Deepspeed- inference: Enabling efficient inference of transformer models at unprecedented scale. SC22: International Conference for High Performance Computing, Networking, Storage and Analysi...

  49. [67]

    On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation

    Sebastian Bach, Alexander Binder, Gr \'e goire Montavon, Frederick Klauschen, Klaus-Robert M \"u ller, and Wojciech Samek. On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. PLoS ONE, 10, 2015. URL https://api.semanticscholar.org/CorpusID:9327892

  50. [68]

    Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert - Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litw...

  51. [69]

    Skip RNN: Learning to Skip State Updates in Recurrent Neural Networks

    V \'i ctor Campos, Brendan Jou, Xavier Gir \'o i Nieto, Jordi Torres, and Shih-Fu Chang. Skip rnn: Learning to skip state updates in recurrent neural networks. ArXiv, abs/1708.06834, 2017. URL https://api.semanticscholar.org/CorpusID:1859294

  52. [70]

    Evaluating Large Language Models Trained on Code

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021

  53. [71]

    Generating Long Sequences with Sparse Transformers

    Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse transformers. CoRR, abs/1904.10509, 2019. URL http://arxiv.org/abs/1904.10509

  54. [72]

    Kevin Clark, Urvashi Khandelwal, Omer Levy, and Christopher D. Manning. What does BERT look at? an analysis of BERT ' s attention. In Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pp.\ 276--286, Florence, Italy, August 2019. Association for Computational Linguistics. doi:10.18653/v1/W19-4828. URL htt...

  55. [73]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021

  56. [74]

    Funnel-transformer: Filtering out sequential redundancy for efficient language processing

    Zihang Dai, Guokun Lai, Yiming Yang, and Quoc Le. Funnel-transformer: Filtering out sequential redundancy for efficient language processing. In Hugo Larochelle, Marc'Aurelio Ranzato, Raia Hadsell, Maria - Florina Balcan, and Hsuan - Tien Lin (eds.), Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Sy...

  57. [75]

    BERT: pre-training of deep bidirectional transformers for language understanding

    Jacob Devlin, Ming - Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: pre-training of deep bidirectional transformers for language understanding. In Jill Burstein, Christy Doran, and Thamar Solorio (eds.), Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAA...

  58. [76]

    Choudhury, Saurabh Raje, Venkatesan T

    Saurabh Goyal, Anamitra R. Choudhury, Saurabh Raje, Venkatesan T. Chakaravarthy, Yogish Sabharwal, and Ashish Verma. Power-bert: Accelerating bert inference via progressive word-vector elimination. In International Conference on Machine Learning, 2020. URL https://api.semanticscholar.org/CorpusID:219792793

  59. [77]

    Transkimmer: Transformer learns to layer-wise skim

    Yue Guan, Zhengyi Li, Jingwen Leng, Zhouhan Lin, and Minyi Guo. Transkimmer: Transformer learns to layer-wise skim. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (eds.), Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022, Dublin, Ireland, May 22-27, 2022 , pp.\ 7275--7286...

  60. [78]

    Accelerate: Training and inference at scale made simple, efficient and adaptable

    Sylvain Gugger, Lysandre Debut, Thomas Wolf, Philipp Schmid, Zachary Mueller, Sourab Mangrulkar, Marc Sun, and Benjamin Bossan. Accelerate: Training and inference at scale made simple, efficient and adaptable. https://github.com/huggingface/accelerate, 2022

  61. [79]

    Neural Speed Reading with Structural-Jump-LSTM

    Christian Hansen, Casper Hansen, Stephen Alstrup, Jakob Grue Simonsen, and Christina Lioma. Neural speed reading with structural-jump-lstm. ArXiv, abs/1904.00761, 2019. URL https://api.semanticscholar.org/CorpusID:90258012

  62. [80]

    The Curious Case of Neural Text Degeneration

    Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. The curious case of neural text degeneration. arXiv preprint arXiv:1904.09751, 2019

  63. [81]

    Pyramid-bert: Reducing complexity via successive core-set based token selection

    Xin Huang, Ashish Khetan, Rene Bidart, and Zohar Karnin. Pyramid-bert: Reducing complexity via successive core-set based token selection. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (eds.), Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022, Dublin, Ireland, May 22-27, ...

  64. [82]

    Are you smarter than a sixth grader? textbook question answering for multimodal machine comprehension

    Aniruddha Kembhavi, Minjoon Seo, Dustin Schwenk, Jonghyun Choi, Ali Farhadi, and Hannaneh Hajishirzi. Are you smarter than a sixth grader? textbook question answering for multimodal machine comprehension. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.\ 5376--5384, 2017. URL https://api.semanticscholar.org/CorpusID:1310550

  65. [83]

    Learned token pruning for transformers

    Sehoon Kim, Sheng Shen, David Thorsley, Amir Gholami, Woosuk Kwon, Joseph Hassoun, and Kurt Keutzer. Learned token pruning for transformers. In Aidong Zhang and Huzefa Rangwala (eds.), KDD '22: The 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Washington, DC, USA, August 14 - 18, 2022 , pp.\ 784--794. ACM , 2022. doi:10.1145/3534678.3...

  66. [84]

    Openassistant conversations -- democratizing large language model alignment

    Andreas K \" o pf, Yannic Kilcher, Dimitri von R \" u tte, Sotiris Anagnostidis, Zhi - Rui Tam, Keith Stevens, Abdullah Barhoum, Nguyen Minh Duc, Oliver Stanley, Rich \' a rd Nagyfi, Shahul ES, Sameer Suri, David Glushkov, Arnav Dantuluri, Andrew Maguire, Christoph Schuhmann, Huu Nguyen, and Alexander Mattick. Openassistant conversations - democratizing l...

  67. [85]

    Revealing the dark secrets of BERT

    Olga Kovaleva, Alexey Romanov, Anna Rogers, and Anna Rumshisky. Revealing the dark secrets of BERT . In Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan (eds.), Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong,...

  68. [86]

    Toutanova, Llion Jones, Ming-Wei Chang, Andrew Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov

    Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Matthew Kelcey, Jacob Devlin, Kenton Lee, Kristina N. Toutanova, Llion Jones, Ming-Wei Chang, Andrew Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. Natural questions: a benchmark for question answering research. Transac...

  69. [87]

    Hashimoto

    Xuechen Li, Tianyi Zhang, Yann Dubois, Rohan Taori, Ishaan Gulrajani, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Alpacaeval: An automatic evaluator of instruction-following models. https://github.com/tatsu-lab/alpaca_eval, 2023

  70. [88]

    Scissorhands: Exploiting the persistence of importance hypothesis for LLM KV cache compression at test time

    Zichang Liu, Aditya Desai, Fangshuo Liao, Weitao Wang, Victor Xie, Zhaozhuo Xu, Anastasios Kyrillidis, and Anshumali Shrivastava. Scissorhands: Exploiting the persistence of importance hypothesis for LLM KV cache compression at test time. CoRR, abs/2305.17118, 2023 a . doi:10.48550/arXiv.2305.17118. URL https://doi.org/10.48550/arXiv.2305.17118

  71. [89]

    Deja vu: Contextual sparsity for efficient llms at inference time

    Zichang Liu, Jue Wang, Tri Dao, Tianyi Zhou, Binhang Yuan, Zhao Song, Anshumali Shrivastava, Ce Zhang, Yuandong Tian, Christopher R \' e , and Beidi Chen. Deja vu: Contextual sparsity for efficient llms at inference time. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett (eds.), International Confere...

  72. [90]

    Are sixteen heads really better than one? In H

    Paul Michel, Omer Levy, and Graham Neubig. Are sixteen heads really better than one? In H. Wallach, H. Larochelle, A. Beygelzimer, F. d Alch\' e -Buc, E. Fox, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper_files/paper/2019/file/2c601ad9d2ff9bc8b28...

  73. [91]

    Jesse Mu, Xiang Lisa Li, and Noah D. Goodman. Learning to compress prompts with gist tokens. CoRR, abs/2304.08467, 2023. doi:10.48550/arXiv.2304.08467. URL https://doi.org/10.48550/arXiv.2304.08467

  74. [92]

    Gpt-4 technical report, 2023

    OpenAI. Gpt-4 technical report, 2023

  75. [93]

    Neural Speed Reading via Skim-RNN

    Minjoon Seo, Sewon Min, Ali Farhadi, and Hannaneh Hajishirzi. Neural speed reading via skim-rnn. ArXiv, abs/1711.02085, 2017. URL https://api.semanticscholar.org/CorpusID:3140413

  76. [94]

    Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

    Noam M. Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc V. Le, Geoffrey E. Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. ArXiv, abs/1701.06538, 2017. URL https://api.semanticscholar.org/CorpusID:12462234

  77. [95]

    Fu, Zhiqiang Xie, Beidi Chen, Clark W

    Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Li, Max Ryabinin, Daniel Y. Fu, Zhiqiang Xie, Beidi Chen, Clark W. Barrett, Joseph Gonzalez, Percy Liang, Christopher R \'e , Ioan Cristian Stoica, and Ce Zhang. High-throughput generative inference of large language models with a single gpu. In International Conference on Machine Learning, 2023. URL https:...

  78. [96]

    A simple hash-based early exiting approach for language understanding and generation

    Tianxiang Sun, Xiangyang Liu, Wei Zhu, Zhichao Geng, Lingling Wu, Yilong He, Yuan Ni, Guotong Xie, Xuanjing Huang, and Xipeng Qiu. A simple hash-based early exiting approach for language understanding and generation. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (eds.), Findings of the Association for Computational Linguistics: ACL 2022, Dub...

  79. [97]

    LLaMA: Open and Efficient Foundation Language Models

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth \'e e Lacroix, Baptiste Rozi \`e re, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023 a

  80. [98]

    Llama 2: Open foundation and fine-tuned chat models, 2023 b

    Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Harts...

Showing first 80 references.