arxiv: 2306.14048 · v3 · pith:MH2XVHQOnew · submitted 2023-06-24 · 💻 cs.LG

H₂O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models

Zhenyu Zhang , Ying Sheng , Tianyi Zhou , Tianlong Chen , Lianmin Zheng , Ruisi Cai , Zhao Song , Yuandong Tian

show 4 more authors

Christopher R\'e Clark Barrett Zhangyang Wang Beidi Chen

This is my paper

Pith reviewed 2026-05-17 17:53 UTC · model grok-4.3

classification 💻 cs.LG

keywords KV cacheheavy hittersLLM inferenceattentioneviction policygenerative modelsmemory optimizationthroughput

0 comments

The pith

Heavy-hitter tokens that dominate attention let LLMs run with a much smaller KV cache and up to 29 times higher throughput.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models keep a KV cache whose size grows with sequence length and quickly exhausts GPU memory during long generation. The paper identifies a small subset of tokens, called heavy hitters, that contribute the large majority of value to attention scores. These tokens arise naturally from frequent co-occurrence patterns in text, and dropping them sharply hurts output quality. H2O is an eviction policy that keeps both recent tokens and these heavy hitters, formulated as a dynamic submodular optimization problem with a proven approximation guarantee. Experiments on OPT, LLaMA, and GPT-NeoX show that retaining only 20 percent heavy hitters cuts memory use while delivering large speedups over existing inference engines.

Core claim

A modest number of heavy-hitter tokens account for most attention value in transformer generation; retaining a balanced mix of these tokens and recent ones via the H2O policy preserves generation quality while allowing the rest of the KV cache to be evicted.

What carries the argument

Heavy-Hitter Oracle (H₂O) eviction policy that dynamically keeps recent tokens together with heavy hitters identified by their contribution to attention scores.

If this is right

Using 20 percent heavy hitters raises throughput by up to 29 times versus DeepSpeed Zero-Inference and Hugging Face Accelerate and 3 times versus FlexGen on OPT-6.7B and OPT-30B.
The same batch size yields up to 1.9 times lower latency.
The approach works across OPT, LLaMA, and GPT-NeoX on diverse tasks.
The submodular formulation supplies a theoretical guarantee that can guide later eviction methods.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar heavy-hitter patterns may appear in other memory-bound transformer components such as feed-forward layers.
The eviction rule could be combined with quantization or sparsity techniques for further memory savings.
Task-specific or layer-wise tuning of the heavy-hitter ratio might yield additional gains without retraining.

Load-bearing premise

Heavy hitters arise naturally from token co-occurrence and their removal produces large drops in generation quality.

What would settle it

A controlled run in which heavy hitters are evicted yet output quality and speed remain unchanged would show the policy is unnecessary.

read the original abstract

Large Language Models (LLMs), despite their recent impressive accomplishments, are notably cost-prohibitive to deploy, particularly for applications involving long-content generation, such as dialogue systems and story writing. Often, a large amount of transient state information, referred to as the KV cache, is stored in GPU memory in addition to model parameters, scaling linearly with the sequence length and batch size. In this paper, we introduce a novel approach for implementing the KV cache which significantly reduces its memory footprint. Our approach is based on the noteworthy observation that a small portion of tokens contributes most of the value when computing attention scores. We call these tokens Heavy Hitters (H$_2$). Through a comprehensive investigation, we find that (i) the emergence of H$_2$ is natural and strongly correlates with the frequent co-occurrence of tokens in the text, and (ii) removing them results in significant performance degradation. Based on these insights, we propose Heavy Hitter Oracle (H$_2$O), a KV cache eviction policy that dynamically retains a balance of recent and H$_2$ tokens. We formulate the KV cache eviction as a dynamic submodular problem and prove (under mild assumptions) a theoretical guarantee for our novel eviction algorithm which could help guide future work. We validate the accuracy of our algorithm with OPT, LLaMA, and GPT-NeoX across a wide range of tasks. Our implementation of H$_2$O with 20% heavy hitters improves the throughput over three leading inference systems DeepSpeed Zero-Inference, Hugging Face Accelerate, and FlexGen by up to 29$\times$, 29$\times$, and 3$\times$ on OPT-6.7B and OPT-30B. With the same batch size, H2O can reduce the latency by up to 1.9$\times$. The code is available at https://github.com/FMInference/H2O.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

H2O shows you can evict most KV cache entries by keeping recent tokens plus attention-heavy ones and still get large throughput gains on OPT models, if the accuracy holds up.

read the letter

The main thing to know is that this paper proposes keeping only recent tokens plus a small fraction of heavy hitters in the KV cache, identified by their attention contributions, which lets them cut memory and get large throughput gains on OPT models. They observe that heavy hitters emerge naturally and link to frequent token co-occurrences, and that dropping them degrades performance. From there they build an eviction policy framed as a dynamic submodular problem, with a proof under mild assumptions. Tests cover OPT, LLaMA, and GPT-NeoX, and they report up to 29 times better throughput than DeepSpeed, Hugging Face Accelerate, and FlexGen on the 6.7B and 30B OPT models, plus lower latency at same batch size. Code is released. The submodular formulation and guarantee are the freshest part here, even if attention sparsity has been noted before. The practical speedups are the real draw if the quality holds. Where it could be tighter is the choice of 20 percent retention. It seems tuned to the results rather than fixed in advance, and the abstract lacks error bars or full ablation details. The correlation between heavy hitters and co-occurrence is presented as a key insight, but its strength across tasks and models would need close checking in the experiments to make sure the accuracy doesn't slip more than claimed. This is useful for anyone building or optimizing inference systems for long-context generation. Engineers looking to stretch existing hardware would find the policy and numbers relevant. Theorists might engage with the proof. It should go to peer review. The work is grounded enough in experiments and has a novel algorithmic angle that merits referee input, even with some polishing needed on the empirical side.

Referee Report

3 major / 2 minor

Summary. The paper claims that a small subset of tokens ('Heavy Hitters' or H₂) dominate attention scores in LLMs, emerge naturally, and correlate strongly with frequent token co-occurrences such that their removal causes significant performance degradation. It proposes H₂O, a KV cache eviction policy that dynamically retains a balance of recent tokens and H₂ tokens (implemented at a 20% heavy-hitter ratio), formulates the eviction as a dynamic submodular optimization problem, and proves a theoretical guarantee under mild assumptions. The method is validated on OPT, LLaMA, and GPT-NeoX models across tasks and reports up to 29× throughput gains over DeepSpeed Zero-Inference, Hugging Face Accelerate, and FlexGen on OPT-6.7B/30B, plus up to 1.9× latency reduction at fixed batch size.

Significance. If the accuracy preservation holds at the reported cache sizes, the work addresses a practical bottleneck in long-context LLM deployment by reducing KV cache memory footprint while delivering substantial throughput and latency improvements. The submodular formulation with a theoretical guarantee and the open-sourced implementation are notable strengths that could guide future cache-management research.

major comments (3)

[Abstract] Abstract: The headline throughput claims (up to 29× over DeepSpeed/HF/FlexGen on OPT-6.7B and OPT-30B) require that eviction at the 20% heavy-hitter ratio preserves generation quality comparable to the full-cache baseline. The abstract reports validation across models but provides no error bars, specific task metrics, or direct accuracy comparisons with the full KV cache; without these, the speedups risk being a quality–speed trade-off rather than a pure efficiency win.
[Abstract] Abstract / Empirical section: The central premise that 'the emergence of H₂ is natural and strongly correlates with the frequent co-occurrence of tokens' and that 'removing them results in significant performance degradation' is observational. This correlation should be quantified (e.g., via co-occurrence statistics, correlation coefficients, or ablation tables showing degradation magnitude across tasks) because a weaker or task-dependent correlation would undermine the accuracy-preservation claim that supports the reported speedups.
[Theoretical Analysis] Theoretical Analysis: The submodular formulation and guarantee under 'mild assumptions' is a strength, but the assumptions must be explicitly enumerated and their validity verified in the experimental regimes (e.g., for long sequences on OPT-30B). If the assumptions do not hold in the evaluated settings, the guarantee does not automatically protect the accuracy of the 20%-retention policy.

minor comments (2)

[Abstract] The abstract states validation 'across a wide range of tasks' without naming the tasks or providing summary statistics; adding this information would improve clarity.
Figures and tables reporting throughput and accuracy should include error bars or standard deviations over multiple runs to convey robustness, especially given the post-hoc selection of the 20% ratio.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight important areas for improving clarity in the abstract, strengthening the empirical support for our core observations, and making the theoretical assumptions more explicit. We address each point below and have made revisions to the manuscript where appropriate.

read point-by-point responses

Referee: [Abstract] Abstract: The headline throughput claims (up to 29× over DeepSpeed/HF/FlexGen on OPT-6.7B and OPT-30B) require that eviction at the 20% heavy-hitter ratio preserves generation quality comparable to the full-cache baseline. The abstract reports validation across models but provides no error bars, specific task metrics, or direct accuracy comparisons with the full KV cache; without these, the speedups risk being a quality–speed trade-off rather than a pure efficiency win.

Authors: We agree that the abstract should more explicitly convey that the reported speedups are achieved while preserving accuracy. In the revised manuscript, we have updated the abstract to state that H₂O at the 20% heavy-hitter ratio maintains generation quality comparable to the full KV cache, with details and direct comparisons provided in the empirical evaluation section. We have also incorporated error bars and specific task metrics (e.g., perplexity and accuracy on benchmarks) into the relevant figures and tables to facilitate these comparisons. revision: yes
Referee: [Abstract] Abstract / Empirical section: The central premise that 'the emergence of H₂ is natural and strongly correlates with the frequent co-occurrence of tokens' and that 'removing them results in significant performance degradation' is observational. This correlation should be quantified (e.g., via co-occurrence statistics, correlation coefficients, or ablation tables showing degradation magnitude across tasks) because a weaker or task-dependent correlation would undermine the accuracy-preservation claim that supports the reported speedups.

Authors: The manuscript already includes ablation studies demonstrating performance degradation when H₂ tokens are removed. To address the request for quantification, we have added co-occurrence statistics and correlation coefficients between heavy-hitter tokens and frequent token co-occurrences in the revised empirical section. We have also expanded the ablation tables to report degradation magnitudes across tasks, providing stronger quantitative backing for the premise and its relation to accuracy preservation. revision: yes
Referee: [Theoretical Analysis] Theoretical Analysis: The submodular formulation and guarantee under 'mild assumptions' is a strength, but the assumptions must be explicitly enumerated and their validity verified in the experimental regimes (e.g., for long sequences on OPT-30B). If the assumptions do not hold in the evaluated settings, the guarantee does not automatically protect the accuracy of the 20%-retention policy.

Authors: We thank the referee for this suggestion to strengthen the theoretical presentation. In the revised manuscript, we have explicitly enumerated the mild assumptions in the Theoretical Analysis section. We have also added a discussion verifying their validity within our experimental regimes, including for long sequences on models such as OPT-30B, supported by alignment between our empirical results and the theoretical predictions. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on empirical observation and independent submodular analysis.

full rationale

The paper defines heavy hitters directly from measured attention scores and reports an observational correlation with token co-occurrence plus degradation upon removal; these are presented as inputs from investigation rather than derived outputs. The H2O policy is then constructed from those observations, formulated as a dynamic submodular problem, and supplied with a separate theoretical guarantee under explicitly mild assumptions. Throughput and accuracy results are measured on OPT, LLaMA, and GPT-NeoX rather than predicted by construction from the same inputs. The 20% retention ratio is an implementation parameter whose effect is validated experimentally, not a fitted value renamed as a prediction. No self-citation chains, self-definitional loops, or reductions of the central claims to tautologies appear in the derivation.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 1 invented entities

The central claim rests on the empirical observation that heavy hitters emerge naturally from token co-occurrence and that their removal degrades performance; the submodular guarantee relies on standard assumptions about attention scores.

free parameters (1)

heavy-hitter retention ratio (20%)
Chosen to balance memory savings and accuracy; appears tuned on the reported models and tasks.

axioms (2)

domain assumption Heavy hitters can be identified dynamically from attention scores without significant overhead
Invoked to justify real-time eviction during generation.
domain assumption Mild assumptions for submodular guarantee hold for transformer attention
Stated in the abstract as the basis for the theoretical result.

invented entities (1)

Heavy Hitters (H2) no independent evidence
purpose: Tokens that contribute most to attention scores and are retained in KV cache
New label for the subset of tokens observed to dominate attention; no independent evidence outside the paper's measurements.

pith-pipeline@v0.9.0 · 5697 in / 1430 out tokens · 50376 ms · 2026-05-17T17:53:42.848761+00:00 · methodology

discussion (0)

Forward citations

Cited by 19 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Non-Monotonic Latency in Apple MPS Decoding: KV Cache Interactions and Execution Regimes
cs.LG 2026-05 unverdicted novelty 7.0

Apple MPS transformer decoding shows abrupt latency spikes up to 21x in narrow decoding-budget intervals due to KV cache and execution regime shifts, absent on CPU and CUDA.
Non-Monotonic Latency in Apple MPS Decoding: KV Cache Interactions and Execution Regimes
cs.LG 2026-05 accept novelty 7.0

Apple MPS decoding exhibits non-monotonic latency with spikes up to 21x due to KV cache interactions and execution regimes, unlike monotonic behavior on CPU and CUDA.
MISA: Mixture of Indexer Sparse Attention for Long-Context LLM Inference
cs.LG 2026-05 conditional novelty 7.0

MISA routes to a small subset of indexer heads via block statistics, matching full DSA performance on LongBench with 4-8x fewer heads and 3.82x speedup while recovering over 92% of selected tokens.
Long Context Pre-Training with Lighthouse Attention
cs.CL 2026-05 conditional novelty 7.0

Lighthouse Attention enables faster long-context pre-training via gradient-free symmetrical hierarchical compression of QKV while preserving causality, followed by a short full-attention recovery that yields lower los...
How Much Cache Does Reasoning Need? Depth-Cache Tradeoffs in KV-Compressed Transformers
cs.LG 2026-04 unverdicted novelty 7.0

Transformers need depth scaling as the product of ceil(k/s) and log n terms for k-hop pointer chasing under cache size s, with a conjectured lower bound, proved upper bound via windowed pointer doubling, and an adapti...
Sparse Prefix Caching for Hybrid and Recurrent LLM Serving
cs.LG 2026-04 unverdicted novelty 7.0

Sparse prefix caching via dynamic programming for optimal checkpoint placement under overlap distributions improves the Pareto frontier for recurrent and hybrid LLM serving on shared-prefix data.
SnapStream: Efficient Long Sequence Decoding on Dataflow Accelerators
cs.AI 2025-11 unverdicted novelty 7.0

SnapStream deploys sparse KV attention in a production inference system on dataflow accelerators, delivering 4x on-chip memory savings for DeepSeek-671B at 128k context with up to 1832 tokens/sec and minimal accuracy ...
Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads
cs.LG 2024-01 conditional novelty 7.0

Medusa augments LLMs with multiple decoding heads and tree-based attention to predict and verify several tokens in parallel, yielding 2.2-3.6x inference speedup via two fine-tuning regimes.
KV-RM: Regularizing KV-Cache Movement for Static-Graph LLM Serving
cs.AR 2026-05 unverdicted novelty 6.0

KV-RM regularizes KV-cache movement in static-graph LLM serving via block paging and merge-staged transport to improve throughput, tail latency, and memory use for variable-length decoding.
Sparse Attention as a Range Searching Problem: Towards an Inference-Efficient Index for KV Cache
cs.LG 2026-05 unverdicted novelty 6.0

Louver is a new index for LLM KV caches that guarantees zero false negatives for keys above a relevance threshold, runs faster than prior sparse and some dense attention methods, and integrates lightly into existing p...
Sparse Attention as a Range Searching Problem: Towards an Inference-Efficient Index for KV Cache
cs.LG 2026-05 unverdicted novelty 6.0

Louver is a new index structure that guarantees zero false negatives for sparse attention in LLM KV caches by casting the problem as halfspace range searching.
Sub-Token Routing in LoRA for Adaptation and Query-Aware KV Compression
cs.LG 2026-04 unverdicted novelty 6.0

Sub-token routing in LoRA-adapted transformers adds a finer compression axis for KV caches, with query-independent and query-aware designs that improve efficiency under reduced budgets when combined with token-level s...
StreamingVLM: Real-Time Understanding for Infinite Video Streams
cs.CV 2025-10 unverdicted novelty 6.0

StreamingVLM enables stable real-time understanding of infinite video streams at up to 8 FPS using a streaming KV cache and aligned SFT on overlapped chunks, with a 66.18% win rate over GPT-4O mini on a new two-hour v...
KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache
cs.CL 2024-02 conditional novelty 6.0

KIVI applies asymmetric 2-bit quantization to KV cache with per-channel keys and per-token values, reducing memory 2.6x and boosting throughput up to 3.47x with near-identical quality on Llama, Falcon, and Mistral.
Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs
cs.CL 2023-10 conditional novelty 6.0

FastGen adaptively compresses LLM KV caches via lightweight attention profiling: evicting long-range contexts on local heads, non-special tokens on special-token heads, and retaining full caches on broad-attention hea...
Position: LLM Inference Should Be Evaluated as Energy-to-Token Production
cs.CE 2026-05 unverdicted novelty 5.0

LLM inference should be reframed and evaluated as energy-to-token production with a Token Production Function that accounts for power, cooling, and efficiency ceilings.
StreamIndex: Memory-Bounded Compressed Sparse Attention via Streaming Top-k
cs.LG 2026-05 accept novelty 5.0

Chunked streaming top-k enables CSA indexer execution at 1M sequence length with 6.21 GB peak memory and >=0.998 recall on synthetic V4-shaped inputs.
HieraSparse: Hierarchical Semi-Structured Sparse KV Attention
cs.DC 2026-04 unverdicted novelty 5.0

HieraSparse delivers a hierarchical semi-structured sparse KV attention system that achieves 1.2x KV compression and 4.57x decode attention speedup versus prior unstructured sparsity methods at equivalent sparsity, pl...
Personal LLM Agents: Insights and Survey about the Capability, Efficiency and Security
cs.HC 2024-01 unverdicted novelty 3.0

This survey discusses key components and challenges for Personal LLM Agents and reviews solutions for their capability, efficiency, and security.

Reference graph

Works this paper leans on

145 extracted references · 145 canonical work pages · cited by 17 Pith papers · 32 internal anchors

[1]

LaMDA: Language Models for Dialog Applications

Romal Thoppilan, Daniel De Freitas, Jamie Hall, Noam Shazeer, Apoorv Kulshreshtha, Heng- Tze Cheng, Alicia Jin, Taylor Bos, Leslie Baker, Yu Du, et al. Lamda: Language models for dialog applications. arXiv preprint arXiv:2201.08239, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[2]

Wordcraft: story writing with large language models

Ann Yuan, Andy Coenen, Emily Reif, and Daphne Ippolito. Wordcraft: story writing with large language models. In 27th International Conference on Intelligent User Interfaces, pages 841–852, 2022

work page 2022
[3]

Emergent Abilities of Large Language Models

Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, et al. Emergent abilities of large language models. arXiv preprint arXiv:2206.07682, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[4]

Benchmarking large language models for news summarization

Tianyi Zhang, Faisal Ladhak, Esin Durmus, Percy Liang, Kathleen McKeown, and Tatsunori B Hashimoto. Benchmarking large language models for news summarization. arXiv preprint arXiv:2301.13848, 2023

work page arXiv 2023
[5]

Raha, A., Mathaikutty, D

Reiner Pope, Sholto Douglas, Aakanksha Chowdhery, Jacob Devlin, James Bradbury, Anselm Levskaya, Jonathan Heek, Kefan Xiao, Shivani Agrawal, and Jeff Dean. Efficiently scaling transformer inference. arXiv preprint arXiv:2211.05102, 2022

work page arXiv 2022
[6]

An anomaly in space-time char- acteristics of certain programs running in a paging machine

Laszlo A Belady, Robert A Nelson, and Gerald S Shedler. An anomaly in space-time char- acteristics of certain programs running in a paging machine. Communications of the ACM, 12(6):349–353, 1969

work page 1969
[7]

Reformer: The Efficient Transformer

Nikita Kitaev, Łukasz Kaiser, and Anselm Levskaya. Reformer: The efficient transformer. arXiv preprint arXiv:2001.04451, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2001
[8]

Flashattention: Fast and memory-efficient exact attention with io-awareness

Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in Neural Information Processing Systems, 35:16344–16359, 2022

work page 2022
[9]

Generating Long Sequences with Sparse Transformers

Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1904
[10]

Rethinking Attention with Performers

Krzysztof Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Davis, Afroz Mohiuddin, Lukasz Kaiser, et al. Rethinking attention with performers. arXiv preprint arXiv:2009.14794, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2009
[11]

Transformers are rnns: Fast autoregressive transformers with linear attention

Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. Transformers are rnns: Fast autoregressive transformers with linear attention. In International conference on machine learning, pages 5156–5165. PMLR, 2020

work page 2020
[12]

Fast Transformer Decoding: One Write-Head is All You Need

Noam Shazeer. Fast transformer decoding: One write-head is all you need. arXiv preprint arXiv:1911.02150, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1911
[13]

PaLM: Scaling Language Modeling with Pathways

Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[14]

Learning to compress prompts with gist tokens

Jesse Mu, Xiang Lisa Li, and Noah Goodman. Learning to compress prompts with gist tokens. arXiv preprint arXiv:2304.08467, 2023

work page arXiv 2023
[15]

A framework for few-shot language model evaluation

Leo Gao, Jonathan Tow, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Kyle McDonell, Niklas Muennighoff, Jason Phang, Laria Reynolds, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. A framework for few-shot language model evaluation. In Zenodo. https://doi.org/10.5281/zenodo.5371628, September 2021

work page doi:10.5281/zenodo.5371628 2021
[16]

Holistic Evaluation of Language Models

Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, et al. Holistic evaluation of language models. arXiv preprint arXiv:2211.09110, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[17]

Deepspeed inference: Enabling efficient inference of transformer models at unprecedented scale

Reza Yazdani Aminabadi, Samyam Rajbhandari, Minjia Zhang, Ammar Ahmad Awan, Cheng Li, Du Li, Elton Zheng, Jeff Rasley, Shaden Smith, Olatunji Ruwase, et al. Deepspeed inference: Enabling efficient inference of transformer models at unprecedented scale. arXiv preprint arXiv:2207.00032, 2022. 11

work page arXiv 2022
[18]

Hugging face accelerate

HuggingFace. Hugging face accelerate. https://huggingface.co/docs/accelerate/ index

work page
[19]

High-throughput generative inference of large language models with a single gpu

Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Li, Max Ryabinin, Daniel Y Fu, Zhiqiang Xie, Beidi Chen, Clark Barrett, Joseph E Gonzalez, et al. High-throughput generative inference of large language models with a single gpu. arXiv preprint arXiv:2303.06865, 2023

work page arXiv 2023
[20]

SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot,

Elias Frantar and Dan Alistarh. Massive language models can be accurately pruned in one-shot. arXiv preprint arXiv:2301.00774, 2023

work page arXiv 2023
[21]

A Simple and Effective Pruning Approach for Large Language Models

Mingjie Sun, Zhuang Liu, Anna Bair, and J Zico Kolter. A simple and effective pruning approach for large language models. arXiv preprint arXiv:2306.11695, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[22]

Outlier weighed layerwise sparsity (owl): A missing secret sauce for pruning llms to high sparsity

Lu Yin, You Wu, Zhenyu Zhang, Cheng-Yu Hsieh, Yaqing Wang, Yiling Jia, Mykola Pech- enizkiy, Yi Liang, Zhangyang Wang, and Shiwei Liu. Outlier weighed layerwise sparsity (owl): A missing secret sauce for pruning llms to high sparsity. arXiv preprint arXiv:2310.05175, 2023

work page arXiv 2023
[23]

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. Gptq: Accurate post-training quantization for generative pre-trained transformers. arXiv preprint arXiv:2210.17323, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[24]

Smoothquant: Accurate and efficient post-training quantization for large language models

Guangxuan Xiao, Ji Lin, Mickael Seznec, Julien Demouth, and Song Han. Smoothquant: Accurate and efficient post-training quantization for large language models. arXiv preprint arXiv:2211.10438, 2022

work page arXiv 2022
[25]

Zeroquant: Efficient and affordable post-training quantization for large-scale transformers

Zhewei Yao, Reza Yazdani Aminabadi, Minjia Zhang, Xiaoxia Wu, Conglong Li, and Yuxiong He. Zeroquant: Efficient and affordable post-training quantization for large-scale transformers. arXiv preprint arXiv:2206.01861, 2022

work page arXiv 2022
[26]

Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. Gpt3. int8 (): 8-bit matrix multiplication for transformers at scale. In Advances in Neural Information Processing Systems, 2022

work page 2022
[27]

Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. Llm. int8 (): 8-bit matrix multiplication for transformers at scale. arXiv preprint arXiv:2208.07339, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[28]

AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Xingyu Dang, and Song Han. Awq: Activation-aware weight quantization for llm compression and acceleration. arXiv preprint arXiv:2306.00978, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[29]

Colt5: Faster long-range transformers with conditional computation

Joshua Ainslie, Tao Lei, Michiel de Jong, Santiago Ontañón, Siddhartha Brahma, Yury Zemlyanskiy, David Uthus, Mandy Guo, James Lee-Thorp, Yi Tay, et al. Colt5: Faster long-range transformers with conditional computation. arXiv preprint arXiv:2303.09752, 2023

work page arXiv 2023
[30]

Dynamic context pruning for efficient and interpretable autoregressive transformers

Sotiris Anagnostidis, Dario Pavllo, Luca Biggio, Lorenzo Noci, Aurelien Lucchi, and Thomas Hoffmann. Dynamic context pruning for efficient and interpretable autoregressive transformers. arXiv preprint arXiv:2305.15805, 2023

work page arXiv 2023
[31]

Efficient transformers: A survey

Yi Tay, Mostafa Dehghani, Dara Bahri, and Donald Metzler. Efficient transformers: A survey. arXiv preprint arXiv:2009.06732, 2020

work page arXiv 2009
[32]

Spatten: Efficient sparse attention architecture with cascade token and head pruning

Hanrui Wang, Zhekai Zhang, and Song Han. Spatten: Efficient sparse attention architecture with cascade token and head pruning. In 2021 IEEE International Symposium on High- Performance Computer Architecture (HPCA), pages 97–110. IEEE, 2021

work page 2021
[33]

The lru-k page replacement algorithm for database disk buffering

Elizabeth J O’neil, Patrick E O’neil, and Gerhard Weikum. The lru-k page replacement algorithm for database disk buffering. Acm Sigmod Record, 22(2):297–306, 1993

work page 1993
[34]

Lrfu: A spectrum of policies that subsumes the least recently used and least frequently used policies

Donghee Lee, Jongmoo Choi, Jong-Hun Kim, Sam H Noh, Sang Lyul Min, Yookun Cho, and Chong Sang Kim. Lrfu: A spectrum of policies that subsumes the least recently used and least frequently used policies. IEEE transactions on Computers, 50(12):1352–1361, 2001

work page 2001
[35]

On the expressive power of self-attention matrices

Valerii Likhosherstov, Krzysztof Choromanski, and Adrian Weller. On the expressive power of self-attention matrices. arXiv preprint arXiv:2106.03764, 2021

work page arXiv 2021
[36]

Inductive biases and variable creation in self-attention mechanisms

Benjamin L Edelman, Surbhi Goel, Sham Kakade, and Cyril Zhang. Inductive biases and variable creation in self-attention mechanisms. In International Conference on Machine Learning, pages 5793–5831. PMLR, 2022

work page 2022
[37]

Laszlo A. Belady. A study of replacement algorithms for a virtual-storage computer. IBM Systems journal, 5(2):78–101, 1966. 12

work page 1966
[39]

OPT: Open Pre-trained Transformer Language Models

Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, et al. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[40]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timo- thée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[41]

GPT- NeoX-20B: An open-source autoregressive language model

Sid Black, Stella Biderman, Eric Hallahan, Quentin Anthony, Leo Gao, Laurence Golding, Ho- race He, Connor Leahy, Kyle McDonell, Jason Phang, Michael Pieler, USVSN Sai Prashanth, Shivanshu Purohit, Laria Reynolds, Jonathan Tow, Ben Wang, and Samuel Weinbach. GPT- NeoX-20B: An open-source autoregressive language model. In Proceedings of the ACL Workshop on...

work page 2022
[42]

Choice of plausible alternatives: An evaluation of commonsense causal reasoning

Melissa Roemmele, Cosmin Adrian Bejan, and Andrew S Gordon. Choice of plausible alternatives: An evaluation of commonsense causal reasoning. In AAAI spring symposium: logical formalizations of commonsense reasoning, pages 90–95, 2011

work page 2011
[43]

MathQA: Towards interpretable math word problem solving with operation- based formalisms

Aida Amini, Saadia Gabriel, Shanchuan Lin, Rik Koncel-Kedziorski, Yejin Choi, and Han- naneh Hajishirzi. MathQA: Towards interpretable math word problem solving with operation- based formalisms. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long ...

work page 2019
[44]

Can a suit of armor conduct electricity? a new dataset for open book question answering

Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering. In EMNLP, 2018

work page 2018
[45]

Piqa: Reasoning about physical commonsense in natural language

Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Yejin Choi. Piqa: Reasoning about physical commonsense in natural language. In Thirty-Fourth AAAI Conference on Artificial Intelligence, 2020

work page 2020
[46]

GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding

Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R Bowman. Glue: A multi-task benchmark and analysis platform for natural language understanding.arXiv preprint arXiv:1804.07461, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[47]

Winogrande: An adversarial winograd schema challenge at scale

Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM, 64(9):99–106, 2021

work page 2021
[48]

Don't Give Me the Details, Just the Summary! Topic-Aware Convolutional Neural Networks for Extreme Summarization

Shashi Narayan, Shay B Cohen, and Mirella Lapata. Don’t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization. arXiv preprint arXiv:1808.08745, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[49]

Abstractive Text Summarization Using Sequence-to-Sequence RNNs and Beyond

Ramesh Nallapati, Bowen Zhou, Caglar Gulcehre, Bing Xiang, et al. Abstractive text sum- marization using sequence-to-sequence rnns and beyond. arXiv preprint arXiv:1602.06023, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[50]

Hashimoto

Xuechen Li, Tianyi Zhang, Yann Dubois, Rohan Taori, Ishaan Gulrajani, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Alpacaeval: An automatic evaluator of instruction- following models. https://github.com/tatsu-lab/alpaca_eval, 2023

work page 2023
[51]

P Xing, Hao Zhang, Joseph E

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric. P Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging llm-as-a-judge with mt-bench and chatbot arena, 2023

work page 2023
[52]

Efficient Streaming Language Models with Attention Sinks

Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks. arXiv preprint arXiv:2309.17453, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[53]

Lm- infinite: Simple on-the-fly length generalization for large language models

Chi Han, Qifan Wang, Wenhan Xiong, Yu Chen, Heng Ji, and Sinong Wang. Lm- infinite: Simple on-the-fly length generalization for large language models. arXiv preprint arXiv:2308.16137, 2023

work page arXiv 2023
[54]

Compressive transformers for long-range sequence modelling

Jack W Rae, Anna Potapenko, Siddhant M Jayakumar, and Timothy P Lillicrap. Compressive transformers for long-range sequence modelling. In The International Conference on Learning Representations (ICLR), 2020. 13

work page 2020
[55]

Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding

Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[56]

Quantization and training of neural networks for efficient integer-arithmetic-only inference

Benoit Jacob, Skirmantas Kligys, Bo Chen, Menglong Zhu, Matthew Tang, Andrew Howard, Hartwig Adam, and Dmitry Kalenichenko. Quantization and training of neural networks for efficient integer-arithmetic-only inference. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 2704–2713, 2018

work page 2018
[57]

Data-free quantization through weight equalization and bias correction

Markus Nagel, Mart van Baalen, Tijmen Blankevoort, and Max Welling. Data-free quantization through weight equalization and bias correction. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 1325–1334, 2019

work page 2019
[58]

Improving neural network quantization without retraining using outlier channel splitting

Ritchie Zhao, Yuwei Hu, Jordan Dotzel, Chris De Sa, and Zhiru Zhang. Improving neural network quantization without retraining using outlier channel splitting. In International conference on machine learning, pages 7543–7552. PMLR, 2019

work page 2019
[59]

Pruning Convolutional Neural Networks for Resource Efficient Inference

Pavlo Molchanov, Stephen Tyree, Tero Karras, Timo Aila, and Jan Kautz. Pruning convo- lutional neural networks for resource efficient inference. arXiv preprint arXiv:1611.06440, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[60]

Rethinking the Value of Network Pruning

Zhuang Liu, Mingjie Sun, Tinghui Zhou, Gao Huang, and Trevor Darrell. Rethinking the value of network pruning. arXiv preprint arXiv:1810.05270, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[61]

Filter pruning via geometric median for deep convolutional neural networks acceleration

Yang He, Ping Liu, Ziwei Wang, Zhilan Hu, and Yi Yang. Filter pruning via geometric median for deep convolutional neural networks acceleration. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4340–4349, 2019

work page 2019
[62]

Sparsity in deep learning: Pruning and growth for efficient inference and training in neural networks

Torsten Hoefler, Dan Alistarh, Tal Ben-Nun, Nikoli Dryden, and Alexandra Peste. Sparsity in deep learning: Pruning and growth for efficient inference and training in neural networks. J. Mach. Learn. Res., 22(241):1–124, 2021

work page 2021
[63]

Distilling the Knowledge in a Neural Network

Geoffrey Hinton, Oriol Vinyals, Jeff Dean, et al. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2(7), 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[64]

On the efficacy of knowledge distillation

Jang Hyun Cho and Bharath Hariharan. On the efficacy of knowledge distillation. In Pro- ceedings of the IEEE/CVF international conference on computer vision, pages 4794–4802, 2019

work page 2019
[65]

Distilling Task-Specific Knowledge from BERT into Simple Neural Networks

Raphael Tang, Yao Lu, Linqing Liu, Lili Mou, Olga Vechtomova, and Jimmy Lin. Dis- tilling task-specific knowledge from bert into simple neural networks. arXiv preprint arXiv:1903.12136, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1903
[66]

Training data-efficient image transformers & distillation through attention

Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Hervé Jégou. Training data-efficient image transformers & distillation through attention. In International Conference on Machine Learning, pages 10347–10357. PMLR, 2021

work page 2021
[67]

Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N

Ashish Vaswani, Noam M. Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NIPS, 2017

work page 2017
[68]

Xlnet: Generalized autoregressive pretraining for language understanding

Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Russ R Salakhutdinov, and Quoc V Le. Xlnet: Generalized autoregressive pretraining for language understanding. Advances in neural information processing systems, 32, 2019

work page 2019
[69]

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1907
[70]

CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge

Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. Commonsenseqa: A question answering challenge targeting commonsense knowledge. arXiv preprint arXiv:1811.00937, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[71]

Radbert-cl: Factually-aware contrastive learning for radiology report classification

Ajay Jaiswal, Liyan Tang, Meheli Ghosh, Justin Rousseau, Yifan Peng, and Ying Ding. Radbert-cl: Factually-aware contrastive learning for radiology report classification. Proceed- ings of machine learning research, 158:196–208, 2021

work page 2021
[72]

End-to-end open-domain question answering with bertserini

Wei Yang, Yuqing Xie, Aileen Lin, Xingyu Li, Luchen Tan, Kun Xiong, Ming Li, and Jimmy Lin. End-to-end open-domain question answering with bertserini. arXiv preprint arXiv:1902.01718, 2019. 14

work page arXiv 1902
[73]

Cognitive Graph for Multi-Hop Reading Comprehension at Scale

Ming Ding, Chang Zhou, Qibin Chen, Hongxia Yang, and Jie Tang. Cognitive graph for multi-hop reading comprehension at scale. arXiv preprint arXiv:1905.05460, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1905
[74]

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed Chi, Quoc Le, and Denny Zhou. Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[75]

Harnessing the power of llms in practice: A survey on chatgpt and beyond

Jingfeng Yang, Hongye Jin, Ruixiang Tang, Xiaotian Han, Qizhang Feng, Haoming Jiang, Bing Yin, and Xia Hu. Harnessing the power of llms in practice: A survey on chatgpt and beyond. arXiv preprint arXiv:2304.13712, 2023

work page arXiv 2023
[76]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[77]

Exploring the limits of transfer learning with a unified text-to-text transformer

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020

work page 2020
[78]

Language models are unsupervised multitask learners

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019

work page 2019
[79]

Language Models are Few-Shot Learners

Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhari- wal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. arXiv preprint arXiv:2005.14165, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2005
[80]

BLOOM: A 176B-Parameter Open-Access Multilingual Language Model

Teven Le Scao, Angela Fan, Christopher Akiki, Ellie Pavlick, Suzana Ili´c, Daniel Hesslow, Roman Castagné, Alexandra Sasha Luccioni, François Yvon, Matthias Gallé, et al. Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[81]

Why {adam} beats {sgd} for attention models, 2020

Jingzhao Zhang, Sai Praneeth Karimireddy, Andreas Veit, Seungyeon Kim, Sashank J Reddi, Sanjiv Kumar, and Suvrit Sra. Why {adam} beats {sgd} for attention models, 2020

work page 2020

Showing first 80 references.