arxiv: 2310.01801 · v4 · submitted 2023-10-03 · 💻 cs.CL

Recognition: 2 theorem links

· Lean Theorem

Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs

Suyu Ge , Yunan Zhang , Liyuan Liu , Minjia Zhang , Jiawei Han , Jianfeng Gao

Authors on Pith no claims yet

Pith reviewed 2026-05-17 11:06 UTC · model grok-4.3

classification 💻 cs.CL

keywords KV cache compressionadaptive inferenceattention head profilingLLM memory reductiongenerative inferenceplug-and-play optimization

0 comments

The pith

LLMs can cut KV cache memory by profiling attention heads once and evicting tokens selectively per head type.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that a lightweight profiling pass on attention modules reveals stable head structures—local, special-token focused, or global—and that these structures can guide an adaptive KV cache. For local heads the method evicts long-range tokens, for special-token heads it drops non-special tokens, and for global heads it keeps the full cache. This construction happens without any model updates or retraining. The result is large GPU memory savings during generation across tasks while output quality stays nearly unchanged. Such compression matters because the KV cache is the dominant memory consumer in long-context inference and currently limits model scale on fixed hardware.

Core claim

By conducting targeted profiling to discern the intrinsic structure of attention modules, the KV cache can be built adaptively: evicting long-range contexts on heads that emphasize local contexts, discarding non-special tokens on heads centered on special tokens, and retaining the standard cache only for heads that attend broadly to all tokens. The lightweight profiling step guides this construction so that the resulting FastGen method deploys without fine-tuning and yields substantial GPU memory reduction with negligible generation quality loss.

What carries the argument

Single-pass attention profiling that classifies heads into local-context, special-token, or global types and applies a matching token-eviction rule to the KV cache for each type.

Load-bearing premise

The head structures found by one profiling pass stay consistent enough across tasks and inputs to let the eviction rules discard tokens safely without hurting output quality.

What would settle it

Run the adaptive cache on a held-out task or longer prompt and measure whether quality metrics fall more than a few percent below the full-cache baseline.

read the original abstract

In this study, we introduce adaptive KV cache compression, a plug-and-play method that reduces the memory footprint of generative inference for Large Language Models (LLMs). Different from the conventional KV cache that retains key and value vectors for all context tokens, we conduct targeted profiling to discern the intrinsic structure of attention modules. Based on the recognized structure, we then construct the KV cache in an adaptive manner: evicting long-range contexts on attention heads emphasizing local contexts, discarding non-special tokens on attention heads centered on special tokens, and only employing the standard KV cache for attention heads that broadly attend to all tokens. Moreover, with the lightweight attention profiling used to guide the construction of the adaptive KV cache, FastGen can be deployed without resource-intensive fine-tuning or re-training. In our experiments across various asks, FastGen demonstrates substantial reduction on GPU memory consumption with negligible generation quality loss. We will release our code and the compatible CUDA kernel for reproducibility.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

FastGen classifies attention heads once and applies fixed eviction rules to shrink the KV cache without retraining, which is a concrete practical step, though the evidence for stable quality across long generations is still thin.

read the letter

The main thing to know is that this paper presents FastGen, which adaptively compresses the KV cache by profiling attention heads into local, special-token, and broad categories, then evicting tokens based on those patterns without any retraining. It targets a practical issue in LLM inference but the supporting data needs more scrutiny. What stands out as new is the per-head classification and the specific eviction rules tied to each type: removing long-range contexts for local heads, discarding non-special tokens for special ones, and using the full cache only for broad heads. The lightweight profiling makes it deployable on off-the-shelf models, which is a solid practical angle. The paper does well in keeping the method simple and avoiding resource-heavy steps like fine-tuning. Framing it around observed attention structures rather than learned parameters keeps things straightforward and reproducible in principle. Soft spots include the lack of detailed quantitative results. The abstract mentions substantial memory savings and negligible quality loss across tasks, yet skips numbers, baselines, or error analysis, making it hard to verify the claims from the summary alone. The concern about a single profiling pass failing to track changing attention patterns during extended generation is reasonable; new tokens could alter what matters, potentially causing the fixed rules to evict key information later. This paper suits researchers and practitioners working on LLM efficiency, particularly those optimizing for longer contexts or higher throughput on limited hardware. A reader looking for concrete compression techniques would get value from the approach and could test it themselves once code is released. It deserves a serious referee because the idea is concrete and builds on real attention behaviors, even if experiments need strengthening. I recommend putting it through peer review to sort out the details on performance and stability.

Referee Report

2 major / 2 minor

Summary. The paper introduces FastGen, a plug-and-play adaptive KV cache compression method for LLMs. It performs lightweight profiling to classify attention heads into local-context, special-token, and global types, then applies type-specific eviction rules: discarding long-range contexts for local heads, non-special tokens for special heads, and retaining the full cache only for global heads. The approach requires no fine-tuning or retraining and claims substantial GPU memory reduction with negligible generation quality loss across tasks, with code and CUDA kernel to be released.

Significance. If the quality preservation holds, the work would be significant for efficient LLM inference, enabling longer contexts on limited hardware via a training-free method that exploits intrinsic attention-head specialization. The lightweight profiling and reproducibility commitments (code + kernel) strengthen the practical contribution.

major comments (2)

[Method (profiling and adaptive construction)] The method classifies heads once via a single lightweight profiling pass and then applies fixed eviction rules for the remainder of generation. No analysis or ablation is provided showing that these head types (and thus the eviction policy) remain stable as the context window expands autoregressively or across domain shifts; attention patterns can change with newly generated tokens, directly risking violation of the 'negligible quality loss' guarantee.
[Abstract and Experiments] The abstract and experimental claims assert 'substantial reduction on GPU memory consumption with negligible generation quality loss' yet the manuscript provides no quantitative metrics, specific baselines, task details, or error analysis to support the magnitude of savings or the 'negligible' qualifier; this leaves the central empirical claim unverifiable from the reported evidence.

minor comments (2)

[Abstract] Abstract: 'various asks' appears to be a typo for 'various tasks'.
[Abstract] Abstract: 'substantial reduction on GPU memory' should read 'substantial reduction in GPU memory' or 'of GPU memory consumption'.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the major comments point by point below, indicating where we agree and plan revisions to improve clarity and rigor.

read point-by-point responses

Referee: [Method (profiling and adaptive construction)] The method classifies heads once via a single lightweight profiling pass and then applies fixed eviction rules for the remainder of generation. No analysis or ablation is provided showing that these head types (and thus the eviction policy) remain stable as the context window expands autoregressively or across domain shifts; attention patterns can change with newly generated tokens, directly risking violation of the 'negligible quality loss' guarantee.

Authors: We appreciate the referee highlighting this aspect of the method. The classification is indeed performed once via lightweight profiling on an initial context segment, after which the eviction rules are applied statically. While our experiments across long-context tasks (including autoregressive generation over thousands of tokens) empirically support negligible quality impact, we acknowledge the absence of a dedicated stability analysis. In the revision, we will add an ablation examining head-type consistency as context grows and across domain shifts to directly address this concern. revision: yes
Referee: [Abstract and Experiments] The abstract and experimental claims assert 'substantial reduction on GPU memory consumption with negligible generation quality loss' yet the manuscript provides no quantitative metrics, specific baselines, task details, or error analysis to support the magnitude of savings or the 'negligible' qualifier; this leaves the central empirical claim unverifiable from the reported evidence.

Authors: We agree that the abstract would benefit from more explicit quantitative support to make the claims verifiable at a glance. The full manuscript contains experimental results with concrete memory savings, task-specific quality metrics, and comparisons to baselines. We will revise the abstract to incorporate key quantitative findings and ensure the experiments section provides clearer task details, baselines, and supporting analysis or error metrics. revision: yes

Circularity Check

0 steps flagged

No significant circularity; policy derived from direct empirical profiling

full rationale

The paper's core method performs a single lightweight profiling pass over attention heads to classify them into local-context, special-token, or global-attention categories, then applies fixed eviction rules accordingly. This is presented as an empirical observation step that directly informs the KV cache construction without any equations, fitted parameters, or self-referential definitions that would make the output equivalent to the input by construction. No load-bearing self-citations, uniqueness theorems, or ansatzes are invoked to justify the classification rules. The approach remains self-contained as a measurement-driven heuristic rather than a closed derivation loop.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on the domain assumption that attention heads exhibit stable, categorizable behaviors that can be profiled once and used for eviction decisions; no free parameters or new invented entities are introduced.

axioms (1)

domain assumption Transformer attention heads exhibit distinct and stable patterns that can be reliably categorized as local-context, special-token, or broad-attention.
Invoked to justify the targeted profiling step and the subsequent adaptive cache construction.

pith-pipeline@v0.9.0 · 5472 in / 1379 out tokens · 40933 ms · 2026-05-17T11:06:14.885468+00:00 · methodology

discussion (0)

Forward citations

Cited by 19 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

HybridGen: Efficient LLM Generative Inference via CPU-GPU Hybrid Computing
cs.PF 2026-04 unverdicted novelty 7.0

HybridGen achieves 1.41x-3.2x average speedups over six prior KV cache methods for LLM inference by using attention logit parallelism, a feedback-driven scheduler, and semantic-aware KV cache mapping.
PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction
cs.CV 2024-10 accept novelty 7.0

PyramidDrop accelerates LVLMs by staged, similarity-based dropping of visual tokens that become redundant in deeper layers, delivering 40% faster training and 55% lower inference cost with comparable accuracy.
ReST-KV: Robust KV Cache Eviction with Layer-wise Output Reconstruction and Spatial-Temporal Smoothing
cs.CL 2026-05 conditional novelty 6.0

ReST-KV formulates KV eviction as layer-wise output reconstruction optimization with spatial-temporal smoothing, outperforming baselines by 2.58% on LongBench and 15.2% on RULER while cutting decoding latency by 10.61...
Sparse Attention as a Range Searching Problem: Towards an Inference-Efficient Index for KV Cache
cs.LG 2026-05 unverdicted novelty 6.0

Louver is a new index for LLM KV caches that guarantees zero false negatives for keys above a relevance threshold, runs faster than prior sparse and some dense attention methods, and integrates lightly into existing p...
Sparse Attention as a Range Searching Problem: Towards an Inference-Efficient Index for KV Cache
cs.LG 2026-05 unverdicted novelty 6.0

Louver is a new index structure that guarantees zero false negatives for sparse attention in LLM KV caches by casting the problem as halfspace range searching.
WindowQuant: Mixed-Precision KV Cache Quantization based on Window-Level Similarity for VLMs Inference Optimization
cs.CV 2026-05 unverdicted novelty 6.0

WindowQuant performs window-adaptive mixed-precision KV cache quantization guided by similarity to the text prompt, with reordering to enable efficient inference in VLMs.
DASH-KV: Accelerating Long-Context LLM Inference via Asymmetric KV Cache Hashing
cs.CL 2026-04 unverdicted novelty 6.0

DASH-KV accelerates long-context LLM inference to linear complexity via asymmetric KV cache hashing and mixed-precision retention, matching full attention performance on LongBench.
SAGE: Selective Attention-Guided Extraction for Token-Efficient Document Indexing
cs.DB 2026-04 unverdicted novelty 6.0

SAGE is a training-free context reduction method that converts attention signals from a small LLM into a differential relevance heatmap to select top units for downstream QA, achieving competitive accuracy at 10% toke...
TokenDance: Scaling Multi-Agent LLM Serving via Collective KV Cache Sharing
cs.DC 2026-04 unverdicted novelty 6.0

TokenDance scales multi-agent LLM serving to 2.7x more concurrent agents by collective KV cache reuse and block-sparse diff encoding that achieves 11-17x compression.
Stochastic KV Routing: Enabling Adaptive Depth-Wise Cache Sharing
cs.LG 2026-04 unverdicted novelty 6.0

Stochastic training with random cross-layer KV attention enables depth-wise cache sharing in transformers, cutting memory footprint while preserving or improving performance.
MoBA: Mixture of Block Attention for Long-Context LLMs
cs.LG 2025-02 unverdicted novelty 6.0

MoBA routes attention over blocks via MoE-style gating to enable dynamic, bias-light long-context attention that matches full attention performance at lower cost.
Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention
cs.CL 2025-02 unverdicted novelty 6.0

NSA is a hardware-aligned sparse attention mechanism that enables end-to-end trainable long-context modeling by combining coarse token compression with fine-grained selection.
When Attention Sink Emerges in Language Models: An Empirical View
cs.CL 2024-10 accept novelty 6.0

Attention sinks emerge in language models from softmax-induced token dependence on attention scores and do not appear when using sigmoid attention without normalization in models up to 1B parameters.
Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference
cs.CL 2024-07 accept novelty 6.0

Ada-KV is the first head-wise adaptive KV cache budget allocator for LLMs, using a theoretical loss upper bound to allocate eviction differently per attention head and yielding higher quality than uniform methods on l...
PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling
cs.CL 2024-06 conditional novelty 6.0

PyramidKV dynamically compresses KV cache across layers following pyramidal information funneling, matching full performance at 12% retention and outperforming alternatives at 0.7% retention with up to 20.5 accuracy gains.
SnapKV: LLM Knows What You are Looking for Before Generation
cs.CL 2024-04 conditional novelty 6.0

SnapKV selects clustered important KV positions per attention head from an observation window at the prompt end, yielding 3.6x faster generation and 8.2x better memory efficiency on 16K-token inputs with comparable pe...
Don't Waste Bits! Adaptive KV-Cache Quantization for Lightweight On-Device LLMs
cs.CV 2026-04 unverdicted novelty 5.0

A data-driven adaptive policy for KV-cache bit-width selection based on token importance features reduces decoding latency by ~18% and improves accuracy over static quantization while staying near FP16 levels on SmolL...
Attention Sink Forges Native MoE in Attention Layers: Sink-Aware Training to Address Head Collapse
cs.CL 2026-02 unverdicted novelty 5.0

Attention sinks forge native MoE mechanisms in attention layers that cause head collapse, addressed by sink-aware training with auxiliary load balancing.
From Human Memory to AI Memory: A Survey on Memory Mechanisms in the Era of LLMs
cs.IR 2025-04 unverdicted novelty 5.0

The paper surveys human memory categories, maps them to LLM memory, and proposes a new three-dimension (object, form, time) categorization into eight quadrants to organize existing work and highlight open problems.

Reference graph

Works this paper leans on

85 extracted references · 85 canonical work pages · cited by 18 Pith papers · 17 internal anchors

[1]

2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , year=

Are You Smarter Than a Sixth Grader? Textbook Question Answering for Multimodal Machine Comprehension , author=. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , year=

work page 2017
[2]

2019 , journal =

Natural Questions: a Benchmark for Question Answering Research , author =. 2019 , journal =

work page 2019
[5]

Tom B. Brown and Benjamin Mann and Nick Ryder and Melanie Subbiah and Jared Kaplan and Prafulla Dhariwal and Arvind Neelakantan and Pranav Shyam and Girish Sastry and Amanda Askell and Sandhini Agarwal and Ariel Herbert. Language Models are Few-Shot Learners , booktitle =. 2020 , Xurl =

work page 2020
[8]

SC22: International Conference for High Performance Computing, Networking, Storage and Analysis , year=

DeepSpeed- Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale , author=. SC22: International Conference for High Performance Computing, Networking, Storage and Analysis , year=

work page
[9]

International Conference on Machine Learning , year=

High-throughput Generative Inference of Large Language Models with a Single GPU , author=. International Conference on Machine Learning , year=

work page
[10]

ArXiv , year=

Neural Speed Reading with Structural-Jump-LSTM , author=. ArXiv , year=

work page
[11]

ArXiv , year=

Neural Speed Reading via Skim-RNN , author=. ArXiv , year=

work page
[12]

ArXiv , year=

Skip RNN: Learning to Skip State Updates in Recurrent Neural Networks , author=. ArXiv , year=

work page
[13]

ArXiv , year=

GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding , author=. ArXiv , year=

work page
[14]

ArXiv , year=

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer , author=. ArXiv , year=

work page
[15]

2022 , eprint=

Multimodal Contrastive Learning with LIMoE: the Language-Image Mixture of Experts , author=. 2022 , eprint=

work page 2022
[16]

2021 , booktitle=

Scaling Vision with Sparse Mixture of Experts , author=. 2021 , booktitle=

work page 2021
[17]

2023 , eprint=

Llama 2: Open Foundation and Fine-Tuned Chat Models , author=. 2023 , eprint=

work page 2023
[18]

2022 , eprint=

PaLM: Scaling Language Modeling with Pathways , author=. 2022 , eprint=

work page 2022
[19]

2023 , eprint=

GPT-4 Technical Report , author=. 2023 , eprint=

work page 2023
[20]

Are Sixteen Heads Really Better than One? , url =

Michel, Paul and Levy, Omer and Neubig, Graham , booktitle =. Are Sixteen Heads Really Better than One? , url =

work page
[21]

Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned

Voita, Elena and Talbot, David and Moiseev, Fedor and Sennrich, Rico and Titov, Ivan. Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019. doi:10.18653/v1/P19-1580

work page doi:10.18653/v1/p19-1580 2019
[23]

Manning and Kevin Clark and John Hewitt and Urvashi Khandelwal and Omer Levy , title =

Christopher D. Manning and Kevin Clark and John Hewitt and Urvashi Khandelwal and Omer Levy , title =. Proc. Natl. Acad. Sci. 2020 , url =. doi:10.1073/pnas.1907367117 , timestamp =

work page doi:10.1073/pnas.1907367117 2020
[26]

PLoS ONE , year=

On Pixel-Wise Explanations for Non-Linear Classifier Decisions by Layer-Wise Relevance Propagation , author=. PLoS ONE , year=

work page
[27]

7th International Conference on Learning Representations,

Jonathan Frankle and Michael Carbin , title =. 7th International Conference on Learning Representations,. 2019 , url =

work page 2019
[28]

Vetrov , editor =

Dmitry Molchanov and Arsenii Ashukha and Dmitry P. Vetrov , editor =. Variational Dropout Sparsifies Deep Neural Networks , booktitle =. 2017 , url =

work page 2017
[29]

Rush , editor =

Victor Sanh and Thomas Wolf and Alexander M. Rush , editor =. Movement Pruning: Adaptive Sparsity by Fine-Tuning , booktitle =. 2020 , url =

work page 2020
[30]

The Optimal

Eldar Kurtic and Daniel Campos and Tuan Nguyen and Elias Frantar and Mark Kurtz and Benjamin Fineran and Michael Goin and Dan Alistarh , editor =. The Optimal. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing,. 2022 , url =. doi:10.18653/v1/2022.emnlp-main.279 , timestamp =

work page doi:10.18653/v1/2022.emnlp-main.279 2022
[32]

Generalized Slow Roll for Tensors

Cong Guo and Bo Yang Hsueh and Jingwen Leng and Yuxian Qiu and Yue Guan and Zehuan Wang and Xiaoying Jia and Xipeng Li and Minyi Guo and Yuhao Zhu , editor =. Accelerating sparse. Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis,. 2020 , url =. doi:10.1109/SC41405.2020.00020 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1109/sc41405.2020.00020 2020
[33]

Block Pruning For Faster Transformers , booktitle =

Fran. Block Pruning For Faster Transformers , booktitle =. 2021 , url =. doi:10.18653/v1/2021.emnlp-main.829 , timestamp =

work page doi:10.18653/v1/2021.emnlp-main.829 2021
[34]

Structured Pruning Learns Compact and Accurate Models , booktitle =

Mengzhou Xia and Zexuan Zhong and Danqi Chen , editor =. Structured Pruning Learns Compact and Accurate Models , booktitle =. 2022 , url =. doi:10.18653/v1/2022.acl-long.107 , timestamp =

work page doi:10.18653/v1/2022.acl-long.107 2022
[35]

Dally , editor =

Song Han and Huizi Mao and William J. Dally , editor =. Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding , booktitle =. 2016 , url =

work page 2016
[36]

Howard and Hartwig Adam and Dmitry Kalenichenko , title =

Benoit Jacob and Skirmantas Kligys and Bo Chen and Menglong Zhu and Matthew Tang and Andrew G. Howard and Hartwig Adam and Dmitry Kalenichenko , title =. 2018. 2018 , url =. doi:10.1109/CVPR.2018.00286 , timestamp =

work page doi:10.1109/cvpr.2018.00286 2018
[37]

Markus Nagel and Mart van Baalen and Tijmen Blankevoort and Max Welling , title =. 2019. 2019 , url =. doi:10.1109/ICCV.2019.00141 , timestamp =

work page doi:10.1109/iccv.2019.00141 2019
[38]

Distilling the Knowledge in a Neural Network

Distilling the knowledge in a neural network , author=. arXiv preprint arXiv:1503.02531 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[39]

ArXiv , year=

DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter , author=. ArXiv , year=

work page
[40]

The State of Sparsity in Deep Neural Networks

Trevor Gale and Erich Elsen and Sara Hooker , title =. CoRR , volume =. 2019 , url =. 1902.09574 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv 2019
[41]

DeeBERT: Dynamic Early Exiting for Accelerating

Ji Xin and Raphael Tang and Jaejun Lee and Yaoliang Yu and Jimmy Lin , editor =. DeeBERT: Dynamic Early Exiting for Accelerating. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics,. 2020 , url =. doi:10.18653/v1/2020.acl-main.204 , timestamp =

work page doi:10.18653/v1/2020.acl-main.204 2020
[43]

Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention , booktitle =

Angelos Katharopoulos and Apoorv Vyas and Nikolaos Pappas and Fran. Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention , booktitle =. 2020 , url =

work page 2020
[44]

8th International Conference on Learning Representations,

Nikita Kitaev and Lukasz Kaiser and Anselm Levskaya , title =. 8th International Conference on Learning Representations,. 2020 , url =

work page 2020
[47]

Deja Vu: Contextual Sparsity for Efficient LLMs at Inference Time , booktitle =

Zichang Liu and Jue Wang and Tri Dao and Tianyi Zhou and Binhang Yuan and Zhao Song and Anshumali Shrivastava and Ce Zhang and Yuandong Tian and Christopher R. Deja Vu: Contextual Sparsity for Efficient LLMs at Inference Time , booktitle =. 2023 , url =

work page 2023
[49]

Hu and Yelong Shen and Phillip Wallis and Zeyuan Allen

Edward J. Hu and Yelong Shen and Phillip Wallis and Zeyuan Allen. LoRA: Low-Rank Adaptation of Large Language Models , booktitle =. 2022 , url =

work page 2022
[50]

Susan Zhang and Stephen Roller and Naman Goyal and Mikel Artetxe and Moya Chen and Shuohui Chen and Christopher Dewan and Mona T. Diab and Xian Li and Xi Victoria Lin and Todor Mihaylov and Myle Ott and Sam Shleifer and Kurt Shuster and Daniel Simig and Punit Singh Koura and Anjali Sridhar and Tianlu Wang and Luke Zettlemoyer , title =. CoRR , volume =. 2...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2205.01068 2022
[51]

Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

Aarohi Srivastava and Abhinav Rastogi and Abhishek Rao and Abu Awal Md Shoeb and Abubakar Abid and Adam Fisch and Adam R. Brown and Adam Santoro and Aditya Gupta and Adri. Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models , journal =. 2022 , url =. doi:10.48550/arXiv.2206.04615 , eprinttype =. 2206.04615 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2206.04615 2022
[52]

doi:10.5281/zenodo.5371628 , url =

Gao, Leo and Tow, Jonathan and Biderman, Stella and Black, Sid and DiPofi, Anthony and Foster, Charles and Golding, Laurence and Hsu, Jeffrey and McDonell, Kyle and Muennighoff, Niklas and Phang, Jason and Reynolds, Laria and Tang, Eric and Thite, Anish and Wang, Ben and Wang, Kevin and Zou, Andy , title =. doi:10.5281/zenodo.5371628 , url =

work page doi:10.5281/zenodo.5371628
[53]

PaLM: Scaling Language Modeling with Pathways

Aakanksha Chowdhery and Sharan Narang and Jacob Devlin and Maarten Bosma and Gaurav Mishra and Adam Roberts and Paul Barham and Hyung Won Chung and Charles Sutton and Sebastian Gehrmann and Parker Schuh and Kensen Shi and Sasha Tsvyashchenko and Joshua Maynez and Abhishek Rao and Parker Barnes and Yi Tay and Noam Shazeer and Vinodkumar Prabhakaran and Emi...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2204.02311 2022
[54]

International Conference on Machine Learning , year=

PoWER-BERT: Accelerating BERT Inference via Progressive Word-vector Elimination , author=. International Conference on Machine Learning , year=

work page
[55]

Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing , booktitle =

Zihang Dai and Guokun Lai and Yiming Yang and Quoc Le , editor =. Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing , booktitle =. 2020 , url =

work page 2020
[60]

McAuley and Ke Xu and Furu Wei , editor =

Wangchunshu Zhou and Canwen Xu and Tao Ge and Julian J. McAuley and Ke Xu and Furu Wei , editor =. Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual , year =

work page 2020
[61]

Efficient Memory Management for Large Language Model Serving with PagedAttention

Woosuk Kwon and Zhuohan Li and Siyuan Zhuang and Ying Sheng and Lianmin Zheng and Cody Hao Yu and Joseph E. Gonzalez and Hao Zhang and Ion Stoica , title =. CoRR , volume =. 2023 , url =. doi:10.48550/arXiv.2309.06180 , eprinttype =. 2309.06180 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2309.06180 2023
[62]

Hashimoto , title =

Xuechen Li and Tianyi Zhang and Yann Dubois and Rohan Taori and Ishaan Gulrajani and Carlos Guestrin and Percy Liang and Tatsunori B. Hashimoto , title =. GitHub repository , howpublished =. 2023 , publisher =

work page 2023
[65]

, author =

Accelerate: Training and inference at scale made simple, efficient and adaptable. , author =

work page
[66]

Deepspeed- inference: Enabling efficient inference of transformer models at unprecedented scale

Reza Yazdani Aminabadi, Samyam Rajbhandari, Minjia Zhang, Ammar Ahmad Awan, Cheng Li, Du Li, Elton Zheng, Jeff Rasley, Shaden Smith, Olatunji Ruwase, and Yuxiong He. Deepspeed- inference: Enabling efficient inference of transformer models at unprecedented scale. SC22: International Conference for High Performance Computing, Networking, Storage and Analysi...

work page 2022
[67]

On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation

Sebastian Bach, Alexander Binder, Gr \'e goire Montavon, Frederick Klauschen, Klaus-Robert M \"u ller, and Wojciech Samek. On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. PLoS ONE, 10, 2015. URL https://api.semanticscholar.org/CorpusID:9327892

work page 2015
[68]

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert - Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litw...

work page 2020
[69]

Skip RNN: Learning to Skip State Updates in Recurrent Neural Networks

V \'i ctor Campos, Brendan Jou, Xavier Gir \'o i Nieto, Jordi Torres, and Shih-Fu Chang. Skip rnn: Learning to skip state updates in recurrent neural networks. ArXiv, abs/1708.06834, 2017. URL https://api.semanticscholar.org/CorpusID:1859294

work page internal anchor Pith review Pith/arXiv arXiv 2017
[70]

Evaluating Large Language Models Trained on Code

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[71]

Generating Long Sequences with Sparse Transformers

Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse transformers. CoRR, abs/1904.10509, 2019. URL http://arxiv.org/abs/1904.10509

work page internal anchor Pith review Pith/arXiv arXiv 1904
[72]

Kevin Clark, Urvashi Khandelwal, Omer Levy, and Christopher D. Manning. What does BERT look at? an analysis of BERT ' s attention. In Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pp.\ 276--286, Florence, Italy, August 2019. Association for Computational Linguistics. doi:10.18653/v1/W19-4828. URL htt...

work page doi:10.18653/v1/w19-4828 2019
[73]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[74]

Funnel-transformer: Filtering out sequential redundancy for efficient language processing

Zihang Dai, Guokun Lai, Yiming Yang, and Quoc Le. Funnel-transformer: Filtering out sequential redundancy for efficient language processing. In Hugo Larochelle, Marc'Aurelio Ranzato, Raia Hadsell, Maria - Florina Balcan, and Hsuan - Tien Lin (eds.), Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Sy...

work page 2020
[75]

BERT: pre-training of deep bidirectional transformers for language understanding

Jacob Devlin, Ming - Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: pre-training of deep bidirectional transformers for language understanding. In Jill Burstein, Christy Doran, and Thamar Solorio (eds.), Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAA...

work page doi:10.18653/v1/n19-1423 2019
[76]

Choudhury, Saurabh Raje, Venkatesan T

Saurabh Goyal, Anamitra R. Choudhury, Saurabh Raje, Venkatesan T. Chakaravarthy, Yogish Sabharwal, and Ashish Verma. Power-bert: Accelerating bert inference via progressive word-vector elimination. In International Conference on Machine Learning, 2020. URL https://api.semanticscholar.org/CorpusID:219792793

work page 2020
[77]

Transkimmer: Transformer learns to layer-wise skim

Yue Guan, Zhengyi Li, Jingwen Leng, Zhouhan Lin, and Minyi Guo. Transkimmer: Transformer learns to layer-wise skim. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (eds.), Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022, Dublin, Ireland, May 22-27, 2022 , pp.\ 7275--7286...

work page doi:10.18653/v1/2022.acl-long.502 2022
[78]

Accelerate: Training and inference at scale made simple, efficient and adaptable

Sylvain Gugger, Lysandre Debut, Thomas Wolf, Philipp Schmid, Zachary Mueller, Sourab Mangrulkar, Marc Sun, and Benjamin Bossan. Accelerate: Training and inference at scale made simple, efficient and adaptable. https://github.com/huggingface/accelerate, 2022

work page 2022
[79]

Neural Speed Reading with Structural-Jump-LSTM

Christian Hansen, Casper Hansen, Stephen Alstrup, Jakob Grue Simonsen, and Christina Lioma. Neural speed reading with structural-jump-lstm. ArXiv, abs/1904.00761, 2019. URL https://api.semanticscholar.org/CorpusID:90258012

work page internal anchor Pith review Pith/arXiv arXiv 1904
[80]

The Curious Case of Neural Text Degeneration

Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. The curious case of neural text degeneration. arXiv preprint arXiv:1904.09751, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1904
[81]

Pyramid-bert: Reducing complexity via successive core-set based token selection

Xin Huang, Ashish Khetan, Rene Bidart, and Zohar Karnin. Pyramid-bert: Reducing complexity via successive core-set based token selection. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (eds.), Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022, Dublin, Ireland, May 22-27, ...

work page doi:10.18653/v1/2022.acl-long.602 2022
[82]

Are you smarter than a sixth grader? textbook question answering for multimodal machine comprehension

Aniruddha Kembhavi, Minjoon Seo, Dustin Schwenk, Jonghyun Choi, Ali Farhadi, and Hannaneh Hajishirzi. Are you smarter than a sixth grader? textbook question answering for multimodal machine comprehension. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.\ 5376--5384, 2017. URL https://api.semanticscholar.org/CorpusID:1310550

work page 2017
[83]

Learned token pruning for transformers

Sehoon Kim, Sheng Shen, David Thorsley, Amir Gholami, Woosuk Kwon, Joseph Hassoun, and Kurt Keutzer. Learned token pruning for transformers. In Aidong Zhang and Huzefa Rangwala (eds.), KDD '22: The 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Washington, DC, USA, August 14 - 18, 2022 , pp.\ 784--794. ACM , 2022. doi:10.1145/3534678.3...

work page doi:10.1145/3534678.3539260 2022
[84]

Openassistant conversations -- democratizing large language model alignment

Andreas K \" o pf, Yannic Kilcher, Dimitri von R \" u tte, Sotiris Anagnostidis, Zhi - Rui Tam, Keith Stevens, Abdullah Barhoum, Nguyen Minh Duc, Oliver Stanley, Rich \' a rd Nagyfi, Shahul ES, Sameer Suri, David Glushkov, Arnav Dantuluri, Andrew Maguire, Christoph Schuhmann, Huu Nguyen, and Alexander Mattick. Openassistant conversations - democratizing l...

work page doi:10.48550/arxiv.2304.07327 2023
[85]

Revealing the dark secrets of BERT

Olga Kovaleva, Alexey Romanov, Anna Rogers, and Anna Rumshisky. Revealing the dark secrets of BERT . In Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan (eds.), Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong,...

work page doi:10.18653/v1/d19-1445 2019
[86]

Toutanova, Llion Jones, Ming-Wei Chang, Andrew Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov

Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Matthew Kelcey, Jacob Devlin, Kenton Lee, Kristina N. Toutanova, Llion Jones, Ming-Wei Chang, Andrew Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. Natural questions: a benchmark for question answering research. Transac...

work page 2019
[87]

Hashimoto

Xuechen Li, Tianyi Zhang, Yann Dubois, Rohan Taori, Ishaan Gulrajani, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Alpacaeval: An automatic evaluator of instruction-following models. https://github.com/tatsu-lab/alpaca_eval, 2023

work page 2023
[88]

Scissorhands: Exploiting the persistence of importance hypothesis for LLM KV cache compression at test time

Zichang Liu, Aditya Desai, Fangshuo Liao, Weitao Wang, Victor Xie, Zhaozhuo Xu, Anastasios Kyrillidis, and Anshumali Shrivastava. Scissorhands: Exploiting the persistence of importance hypothesis for LLM KV cache compression at test time. CoRR, abs/2305.17118, 2023 a . doi:10.48550/arXiv.2305.17118. URL https://doi.org/10.48550/arXiv.2305.17118

work page doi:10.48550/arxiv.2305.17118 2023
[89]

Deja vu: Contextual sparsity for efficient llms at inference time

Zichang Liu, Jue Wang, Tri Dao, Tianyi Zhou, Binhang Yuan, Zhao Song, Anshumali Shrivastava, Ce Zhang, Yuandong Tian, Christopher R \' e , and Beidi Chen. Deja vu: Contextual sparsity for efficient llms at inference time. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett (eds.), International Confere...

work page 2023
[90]

Are sixteen heads really better than one? In H

Paul Michel, Omer Levy, and Graham Neubig. Are sixteen heads really better than one? In H. Wallach, H. Larochelle, A. Beygelzimer, F. d Alch\' e -Buc, E. Fox, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper_files/paper/2019/file/2c601ad9d2ff9bc8b28...

work page 2019
[91]

Jesse Mu, Xiang Lisa Li, and Noah D. Goodman. Learning to compress prompts with gist tokens. CoRR, abs/2304.08467, 2023. doi:10.48550/arXiv.2304.08467. URL https://doi.org/10.48550/arXiv.2304.08467

work page doi:10.48550/arxiv.2304.08467 2023
[92]

Gpt-4 technical report, 2023

OpenAI. Gpt-4 technical report, 2023

work page 2023
[93]

Neural Speed Reading via Skim-RNN

Minjoon Seo, Sewon Min, Ali Farhadi, and Hannaneh Hajishirzi. Neural speed reading via skim-rnn. ArXiv, abs/1711.02085, 2017. URL https://api.semanticscholar.org/CorpusID:3140413

work page internal anchor Pith review Pith/arXiv arXiv 2017
[94]

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

Noam M. Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc V. Le, Geoffrey E. Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. ArXiv, abs/1701.06538, 2017. URL https://api.semanticscholar.org/CorpusID:12462234

work page internal anchor Pith review Pith/arXiv arXiv 2017
[95]

Fu, Zhiqiang Xie, Beidi Chen, Clark W

Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Li, Max Ryabinin, Daniel Y. Fu, Zhiqiang Xie, Beidi Chen, Clark W. Barrett, Joseph Gonzalez, Percy Liang, Christopher R \'e , Ioan Cristian Stoica, and Ce Zhang. High-throughput generative inference of large language models with a single gpu. In International Conference on Machine Learning, 2023. URL https:...

work page 2023
[96]

A simple hash-based early exiting approach for language understanding and generation

Tianxiang Sun, Xiangyang Liu, Wei Zhu, Zhichao Geng, Lingling Wu, Yilong He, Yuan Ni, Guotong Xie, Xuanjing Huang, and Xipeng Qiu. A simple hash-based early exiting approach for language understanding and generation. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (eds.), Findings of the Association for Computational Linguistics: ACL 2022, Dub...

work page doi:10.18653/v1/2022.findings-acl.189 2022
[97]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth \'e e Lacroix, Baptiste Rozi \`e re, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023 a

work page internal anchor Pith review Pith/arXiv arXiv 2023
[98]

Llama 2: Open foundation and fine-tuned chat models, 2023 b

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Harts...

work page 2023

Showing first 80 references.