H₂O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models
Pith reviewed 2026-05-17 17:53 UTC · model grok-4.3
The pith
Heavy-hitter tokens that dominate attention let LLMs run with a much smaller KV cache and up to 29 times higher throughput.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
A modest number of heavy-hitter tokens account for most attention value in transformer generation; retaining a balanced mix of these tokens and recent ones via the H2O policy preserves generation quality while allowing the rest of the KV cache to be evicted.
What carries the argument
Heavy-Hitter Oracle (H₂O) eviction policy that dynamically keeps recent tokens together with heavy hitters identified by their contribution to attention scores.
If this is right
- Using 20 percent heavy hitters raises throughput by up to 29 times versus DeepSpeed Zero-Inference and Hugging Face Accelerate and 3 times versus FlexGen on OPT-6.7B and OPT-30B.
- The same batch size yields up to 1.9 times lower latency.
- The approach works across OPT, LLaMA, and GPT-NeoX on diverse tasks.
- The submodular formulation supplies a theoretical guarantee that can guide later eviction methods.
Where Pith is reading between the lines
- Similar heavy-hitter patterns may appear in other memory-bound transformer components such as feed-forward layers.
- The eviction rule could be combined with quantization or sparsity techniques for further memory savings.
- Task-specific or layer-wise tuning of the heavy-hitter ratio might yield additional gains without retraining.
Load-bearing premise
Heavy hitters arise naturally from token co-occurrence and their removal produces large drops in generation quality.
What would settle it
A controlled run in which heavy hitters are evicted yet output quality and speed remain unchanged would show the policy is unnecessary.
read the original abstract
Large Language Models (LLMs), despite their recent impressive accomplishments, are notably cost-prohibitive to deploy, particularly for applications involving long-content generation, such as dialogue systems and story writing. Often, a large amount of transient state information, referred to as the KV cache, is stored in GPU memory in addition to model parameters, scaling linearly with the sequence length and batch size. In this paper, we introduce a novel approach for implementing the KV cache which significantly reduces its memory footprint. Our approach is based on the noteworthy observation that a small portion of tokens contributes most of the value when computing attention scores. We call these tokens Heavy Hitters (H$_2$). Through a comprehensive investigation, we find that (i) the emergence of H$_2$ is natural and strongly correlates with the frequent co-occurrence of tokens in the text, and (ii) removing them results in significant performance degradation. Based on these insights, we propose Heavy Hitter Oracle (H$_2$O), a KV cache eviction policy that dynamically retains a balance of recent and H$_2$ tokens. We formulate the KV cache eviction as a dynamic submodular problem and prove (under mild assumptions) a theoretical guarantee for our novel eviction algorithm which could help guide future work. We validate the accuracy of our algorithm with OPT, LLaMA, and GPT-NeoX across a wide range of tasks. Our implementation of H$_2$O with 20% heavy hitters improves the throughput over three leading inference systems DeepSpeed Zero-Inference, Hugging Face Accelerate, and FlexGen by up to 29$\times$, 29$\times$, and 3$\times$ on OPT-6.7B and OPT-30B. With the same batch size, H2O can reduce the latency by up to 1.9$\times$. The code is available at https://github.com/FMInference/H2O.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that a small subset of tokens ('Heavy Hitters' or H₂) dominate attention scores in LLMs, emerge naturally, and correlate strongly with frequent token co-occurrences such that their removal causes significant performance degradation. It proposes H₂O, a KV cache eviction policy that dynamically retains a balance of recent tokens and H₂ tokens (implemented at a 20% heavy-hitter ratio), formulates the eviction as a dynamic submodular optimization problem, and proves a theoretical guarantee under mild assumptions. The method is validated on OPT, LLaMA, and GPT-NeoX models across tasks and reports up to 29× throughput gains over DeepSpeed Zero-Inference, Hugging Face Accelerate, and FlexGen on OPT-6.7B/30B, plus up to 1.9× latency reduction at fixed batch size.
Significance. If the accuracy preservation holds at the reported cache sizes, the work addresses a practical bottleneck in long-context LLM deployment by reducing KV cache memory footprint while delivering substantial throughput and latency improvements. The submodular formulation with a theoretical guarantee and the open-sourced implementation are notable strengths that could guide future cache-management research.
major comments (3)
- [Abstract] Abstract: The headline throughput claims (up to 29× over DeepSpeed/HF/FlexGen on OPT-6.7B and OPT-30B) require that eviction at the 20% heavy-hitter ratio preserves generation quality comparable to the full-cache baseline. The abstract reports validation across models but provides no error bars, specific task metrics, or direct accuracy comparisons with the full KV cache; without these, the speedups risk being a quality–speed trade-off rather than a pure efficiency win.
- [Abstract] Abstract / Empirical section: The central premise that 'the emergence of H₂ is natural and strongly correlates with the frequent co-occurrence of tokens' and that 'removing them results in significant performance degradation' is observational. This correlation should be quantified (e.g., via co-occurrence statistics, correlation coefficients, or ablation tables showing degradation magnitude across tasks) because a weaker or task-dependent correlation would undermine the accuracy-preservation claim that supports the reported speedups.
- [Theoretical Analysis] Theoretical Analysis: The submodular formulation and guarantee under 'mild assumptions' is a strength, but the assumptions must be explicitly enumerated and their validity verified in the experimental regimes (e.g., for long sequences on OPT-30B). If the assumptions do not hold in the evaluated settings, the guarantee does not automatically protect the accuracy of the 20%-retention policy.
minor comments (2)
- [Abstract] The abstract states validation 'across a wide range of tasks' without naming the tasks or providing summary statistics; adding this information would improve clarity.
- Figures and tables reporting throughput and accuracy should include error bars or standard deviations over multiple runs to convey robustness, especially given the post-hoc selection of the 20% ratio.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. The comments highlight important areas for improving clarity in the abstract, strengthening the empirical support for our core observations, and making the theoretical assumptions more explicit. We address each point below and have made revisions to the manuscript where appropriate.
read point-by-point responses
-
Referee: [Abstract] Abstract: The headline throughput claims (up to 29× over DeepSpeed/HF/FlexGen on OPT-6.7B and OPT-30B) require that eviction at the 20% heavy-hitter ratio preserves generation quality comparable to the full-cache baseline. The abstract reports validation across models but provides no error bars, specific task metrics, or direct accuracy comparisons with the full KV cache; without these, the speedups risk being a quality–speed trade-off rather than a pure efficiency win.
Authors: We agree that the abstract should more explicitly convey that the reported speedups are achieved while preserving accuracy. In the revised manuscript, we have updated the abstract to state that H₂O at the 20% heavy-hitter ratio maintains generation quality comparable to the full KV cache, with details and direct comparisons provided in the empirical evaluation section. We have also incorporated error bars and specific task metrics (e.g., perplexity and accuracy on benchmarks) into the relevant figures and tables to facilitate these comparisons. revision: yes
-
Referee: [Abstract] Abstract / Empirical section: The central premise that 'the emergence of H₂ is natural and strongly correlates with the frequent co-occurrence of tokens' and that 'removing them results in significant performance degradation' is observational. This correlation should be quantified (e.g., via co-occurrence statistics, correlation coefficients, or ablation tables showing degradation magnitude across tasks) because a weaker or task-dependent correlation would undermine the accuracy-preservation claim that supports the reported speedups.
Authors: The manuscript already includes ablation studies demonstrating performance degradation when H₂ tokens are removed. To address the request for quantification, we have added co-occurrence statistics and correlation coefficients between heavy-hitter tokens and frequent token co-occurrences in the revised empirical section. We have also expanded the ablation tables to report degradation magnitudes across tasks, providing stronger quantitative backing for the premise and its relation to accuracy preservation. revision: yes
-
Referee: [Theoretical Analysis] Theoretical Analysis: The submodular formulation and guarantee under 'mild assumptions' is a strength, but the assumptions must be explicitly enumerated and their validity verified in the experimental regimes (e.g., for long sequences on OPT-30B). If the assumptions do not hold in the evaluated settings, the guarantee does not automatically protect the accuracy of the 20%-retention policy.
Authors: We thank the referee for this suggestion to strengthen the theoretical presentation. In the revised manuscript, we have explicitly enumerated the mild assumptions in the Theoretical Analysis section. We have also added a discussion verifying their validity within our experimental regimes, including for long sequences on models such as OPT-30B, supported by alignment between our empirical results and the theoretical predictions. revision: yes
Circularity Check
No significant circularity; claims rest on empirical observation and independent submodular analysis.
full rationale
The paper defines heavy hitters directly from measured attention scores and reports an observational correlation with token co-occurrence plus degradation upon removal; these are presented as inputs from investigation rather than derived outputs. The H2O policy is then constructed from those observations, formulated as a dynamic submodular problem, and supplied with a separate theoretical guarantee under explicitly mild assumptions. Throughput and accuracy results are measured on OPT, LLaMA, and GPT-NeoX rather than predicted by construction from the same inputs. The 20% retention ratio is an implementation parameter whose effect is validated experimentally, not a fitted value renamed as a prediction. No self-citation chains, self-definitional loops, or reductions of the central claims to tautologies appear in the derivation.
Axiom & Free-Parameter Ledger
free parameters (1)
- heavy-hitter retention ratio (20%)
axioms (2)
- domain assumption Heavy hitters can be identified dynamically from attention scores without significant overhead
- domain assumption Mild assumptions for submodular guarantee hold for transformer attention
invented entities (1)
-
Heavy Hitters (H2)
no independent evidence
Forward citations
Cited by 19 Pith papers
-
Non-Monotonic Latency in Apple MPS Decoding: KV Cache Interactions and Execution Regimes
Apple MPS transformer decoding shows abrupt latency spikes up to 21x in narrow decoding-budget intervals due to KV cache and execution regime shifts, absent on CPU and CUDA.
-
Non-Monotonic Latency in Apple MPS Decoding: KV Cache Interactions and Execution Regimes
Apple MPS decoding exhibits non-monotonic latency with spikes up to 21x due to KV cache interactions and execution regimes, unlike monotonic behavior on CPU and CUDA.
-
MISA: Mixture of Indexer Sparse Attention for Long-Context LLM Inference
MISA routes to a small subset of indexer heads via block statistics, matching full DSA performance on LongBench with 4-8x fewer heads and 3.82x speedup while recovering over 92% of selected tokens.
-
Long Context Pre-Training with Lighthouse Attention
Lighthouse Attention enables faster long-context pre-training via gradient-free symmetrical hierarchical compression of QKV while preserving causality, followed by a short full-attention recovery that yields lower los...
-
How Much Cache Does Reasoning Need? Depth-Cache Tradeoffs in KV-Compressed Transformers
Transformers need depth scaling as the product of ceil(k/s) and log n terms for k-hop pointer chasing under cache size s, with a conjectured lower bound, proved upper bound via windowed pointer doubling, and an adapti...
-
Sparse Prefix Caching for Hybrid and Recurrent LLM Serving
Sparse prefix caching via dynamic programming for optimal checkpoint placement under overlap distributions improves the Pareto frontier for recurrent and hybrid LLM serving on shared-prefix data.
-
SnapStream: Efficient Long Sequence Decoding on Dataflow Accelerators
SnapStream deploys sparse KV attention in a production inference system on dataflow accelerators, delivering 4x on-chip memory savings for DeepSeek-671B at 128k context with up to 1832 tokens/sec and minimal accuracy ...
-
Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads
Medusa augments LLMs with multiple decoding heads and tree-based attention to predict and verify several tokens in parallel, yielding 2.2-3.6x inference speedup via two fine-tuning regimes.
-
KV-RM: Regularizing KV-Cache Movement for Static-Graph LLM Serving
KV-RM regularizes KV-cache movement in static-graph LLM serving via block paging and merge-staged transport to improve throughput, tail latency, and memory use for variable-length decoding.
-
Sparse Attention as a Range Searching Problem: Towards an Inference-Efficient Index for KV Cache
Louver is a new index for LLM KV caches that guarantees zero false negatives for keys above a relevance threshold, runs faster than prior sparse and some dense attention methods, and integrates lightly into existing p...
-
Sparse Attention as a Range Searching Problem: Towards an Inference-Efficient Index for KV Cache
Louver is a new index structure that guarantees zero false negatives for sparse attention in LLM KV caches by casting the problem as halfspace range searching.
-
Sub-Token Routing in LoRA for Adaptation and Query-Aware KV Compression
Sub-token routing in LoRA-adapted transformers adds a finer compression axis for KV caches, with query-independent and query-aware designs that improve efficiency under reduced budgets when combined with token-level s...
-
StreamingVLM: Real-Time Understanding for Infinite Video Streams
StreamingVLM enables stable real-time understanding of infinite video streams at up to 8 FPS using a streaming KV cache and aligned SFT on overlapped chunks, with a 66.18% win rate over GPT-4O mini on a new two-hour v...
-
KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache
KIVI applies asymmetric 2-bit quantization to KV cache with per-channel keys and per-token values, reducing memory 2.6x and boosting throughput up to 3.47x with near-identical quality on Llama, Falcon, and Mistral.
-
Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs
FastGen adaptively compresses LLM KV caches via lightweight attention profiling: evicting long-range contexts on local heads, non-special tokens on special-token heads, and retaining full caches on broad-attention hea...
-
Position: LLM Inference Should Be Evaluated as Energy-to-Token Production
LLM inference should be reframed and evaluated as energy-to-token production with a Token Production Function that accounts for power, cooling, and efficiency ceilings.
-
StreamIndex: Memory-Bounded Compressed Sparse Attention via Streaming Top-k
Chunked streaming top-k enables CSA indexer execution at 1M sequence length with 6.21 GB peak memory and >=0.998 recall on synthetic V4-shaped inputs.
-
HieraSparse: Hierarchical Semi-Structured Sparse KV Attention
HieraSparse delivers a hierarchical semi-structured sparse KV attention system that achieves 1.2x KV compression and 4.57x decode attention speedup versus prior unstructured sparsity methods at equivalent sparsity, pl...
-
Personal LLM Agents: Insights and Survey about the Capability, Efficiency and Security
This survey discusses key components and challenges for Personal LLM Agents and reviews solutions for their capability, efficiency, and security.
Reference graph
Works this paper leans on
-
[1]
LaMDA: Language Models for Dialog Applications
Romal Thoppilan, Daniel De Freitas, Jamie Hall, Noam Shazeer, Apoorv Kulshreshtha, Heng- Tze Cheng, Alicia Jin, Taylor Bos, Leslie Baker, Yu Du, et al. Lamda: Language models for dialog applications. arXiv preprint arXiv:2201.08239, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[2]
Wordcraft: story writing with large language models
Ann Yuan, Andy Coenen, Emily Reif, and Daphne Ippolito. Wordcraft: story writing with large language models. In 27th International Conference on Intelligent User Interfaces, pages 841–852, 2022
work page 2022
-
[3]
Emergent Abilities of Large Language Models
Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, et al. Emergent abilities of large language models. arXiv preprint arXiv:2206.07682, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[4]
Benchmarking large language models for news summarization
Tianyi Zhang, Faisal Ladhak, Esin Durmus, Percy Liang, Kathleen McKeown, and Tatsunori B Hashimoto. Benchmarking large language models for news summarization. arXiv preprint arXiv:2301.13848, 2023
-
[5]
Reiner Pope, Sholto Douglas, Aakanksha Chowdhery, Jacob Devlin, James Bradbury, Anselm Levskaya, Jonathan Heek, Kefan Xiao, Shivani Agrawal, and Jeff Dean. Efficiently scaling transformer inference. arXiv preprint arXiv:2211.05102, 2022
-
[6]
An anomaly in space-time char- acteristics of certain programs running in a paging machine
Laszlo A Belady, Robert A Nelson, and Gerald S Shedler. An anomaly in space-time char- acteristics of certain programs running in a paging machine. Communications of the ACM, 12(6):349–353, 1969
work page 1969
-
[7]
Reformer: The Efficient Transformer
Nikita Kitaev, Łukasz Kaiser, and Anselm Levskaya. Reformer: The efficient transformer. arXiv preprint arXiv:2001.04451, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2001
-
[8]
Flashattention: Fast and memory-efficient exact attention with io-awareness
Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in Neural Information Processing Systems, 35:16344–16359, 2022
work page 2022
-
[9]
Generating Long Sequences with Sparse Transformers
Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1904
-
[10]
Rethinking Attention with Performers
Krzysztof Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Davis, Afroz Mohiuddin, Lukasz Kaiser, et al. Rethinking attention with performers. arXiv preprint arXiv:2009.14794, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2009
-
[11]
Transformers are rnns: Fast autoregressive transformers with linear attention
Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. Transformers are rnns: Fast autoregressive transformers with linear attention. In International conference on machine learning, pages 5156–5165. PMLR, 2020
work page 2020
-
[12]
Fast Transformer Decoding: One Write-Head is All You Need
Noam Shazeer. Fast transformer decoding: One write-head is all you need. arXiv preprint arXiv:1911.02150, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1911
-
[13]
PaLM: Scaling Language Modeling with Pathways
Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[14]
Learning to compress prompts with gist tokens
Jesse Mu, Xiang Lisa Li, and Noah Goodman. Learning to compress prompts with gist tokens. arXiv preprint arXiv:2304.08467, 2023
-
[15]
A framework for few-shot language model evaluation
Leo Gao, Jonathan Tow, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Kyle McDonell, Niklas Muennighoff, Jason Phang, Laria Reynolds, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. A framework for few-shot language model evaluation. In Zenodo. https://doi.org/10.5281/zenodo.5371628, September 2021
-
[16]
Holistic Evaluation of Language Models
Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, et al. Holistic evaluation of language models. arXiv preprint arXiv:2211.09110, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[17]
Deepspeed inference: Enabling efficient inference of transformer models at unprecedented scale
Reza Yazdani Aminabadi, Samyam Rajbhandari, Minjia Zhang, Ammar Ahmad Awan, Cheng Li, Du Li, Elton Zheng, Jeff Rasley, Shaden Smith, Olatunji Ruwase, et al. Deepspeed inference: Enabling efficient inference of transformer models at unprecedented scale. arXiv preprint arXiv:2207.00032, 2022. 11
-
[18]
HuggingFace. Hugging face accelerate. https://huggingface.co/docs/accelerate/ index
-
[19]
High-throughput generative inference of large language models with a single gpu
Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Li, Max Ryabinin, Daniel Y Fu, Zhiqiang Xie, Beidi Chen, Clark Barrett, Joseph E Gonzalez, et al. High-throughput generative inference of large language models with a single gpu. arXiv preprint arXiv:2303.06865, 2023
-
[20]
SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot,
Elias Frantar and Dan Alistarh. Massive language models can be accurately pruned in one-shot. arXiv preprint arXiv:2301.00774, 2023
-
[21]
A Simple and Effective Pruning Approach for Large Language Models
Mingjie Sun, Zhuang Liu, Anna Bair, and J Zico Kolter. A simple and effective pruning approach for large language models. arXiv preprint arXiv:2306.11695, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[22]
Outlier weighed layerwise sparsity (owl): A missing secret sauce for pruning llms to high sparsity
Lu Yin, You Wu, Zhenyu Zhang, Cheng-Yu Hsieh, Yaqing Wang, Yiling Jia, Mykola Pech- enizkiy, Yi Liang, Zhangyang Wang, and Shiwei Liu. Outlier weighed layerwise sparsity (owl): A missing secret sauce for pruning llms to high sparsity. arXiv preprint arXiv:2310.05175, 2023
-
[23]
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers
Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. Gptq: Accurate post-training quantization for generative pre-trained transformers. arXiv preprint arXiv:2210.17323, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[24]
Smoothquant: Accurate and efficient post-training quantization for large language models
Guangxuan Xiao, Ji Lin, Mickael Seznec, Julien Demouth, and Song Han. Smoothquant: Accurate and efficient post-training quantization for large language models. arXiv preprint arXiv:2211.10438, 2022
-
[25]
Zeroquant: Efficient and affordable post-training quantization for large-scale transformers
Zhewei Yao, Reza Yazdani Aminabadi, Minjia Zhang, Xiaoxia Wu, Conglong Li, and Yuxiong He. Zeroquant: Efficient and affordable post-training quantization for large-scale transformers. arXiv preprint arXiv:2206.01861, 2022
-
[26]
Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. Gpt3. int8 (): 8-bit matrix multiplication for transformers at scale. In Advances in Neural Information Processing Systems, 2022
work page 2022
-
[27]
Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. Llm. int8 (): 8-bit matrix multiplication for transformers at scale. arXiv preprint arXiv:2208.07339, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[28]
AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Xingyu Dang, and Song Han. Awq: Activation-aware weight quantization for llm compression and acceleration. arXiv preprint arXiv:2306.00978, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[29]
Colt5: Faster long-range transformers with conditional computation
Joshua Ainslie, Tao Lei, Michiel de Jong, Santiago Ontañón, Siddhartha Brahma, Yury Zemlyanskiy, David Uthus, Mandy Guo, James Lee-Thorp, Yi Tay, et al. Colt5: Faster long-range transformers with conditional computation. arXiv preprint arXiv:2303.09752, 2023
-
[30]
Dynamic context pruning for efficient and interpretable autoregressive transformers
Sotiris Anagnostidis, Dario Pavllo, Luca Biggio, Lorenzo Noci, Aurelien Lucchi, and Thomas Hoffmann. Dynamic context pruning for efficient and interpretable autoregressive transformers. arXiv preprint arXiv:2305.15805, 2023
-
[31]
Efficient transformers: A survey
Yi Tay, Mostafa Dehghani, Dara Bahri, and Donald Metzler. Efficient transformers: A survey. arXiv preprint arXiv:2009.06732, 2020
-
[32]
Spatten: Efficient sparse attention architecture with cascade token and head pruning
Hanrui Wang, Zhekai Zhang, and Song Han. Spatten: Efficient sparse attention architecture with cascade token and head pruning. In 2021 IEEE International Symposium on High- Performance Computer Architecture (HPCA), pages 97–110. IEEE, 2021
work page 2021
-
[33]
The lru-k page replacement algorithm for database disk buffering
Elizabeth J O’neil, Patrick E O’neil, and Gerhard Weikum. The lru-k page replacement algorithm for database disk buffering. Acm Sigmod Record, 22(2):297–306, 1993
work page 1993
-
[34]
Donghee Lee, Jongmoo Choi, Jong-Hun Kim, Sam H Noh, Sang Lyul Min, Yookun Cho, and Chong Sang Kim. Lrfu: A spectrum of policies that subsumes the least recently used and least frequently used policies. IEEE transactions on Computers, 50(12):1352–1361, 2001
work page 2001
-
[35]
On the expressive power of self-attention matrices
Valerii Likhosherstov, Krzysztof Choromanski, and Adrian Weller. On the expressive power of self-attention matrices. arXiv preprint arXiv:2106.03764, 2021
-
[36]
Inductive biases and variable creation in self-attention mechanisms
Benjamin L Edelman, Surbhi Goel, Sham Kakade, and Cyril Zhang. Inductive biases and variable creation in self-attention mechanisms. In International Conference on Machine Learning, pages 5793–5831. PMLR, 2022
work page 2022
-
[37]
Laszlo A. Belady. A study of replacement algorithms for a virtual-storage computer. IBM Systems journal, 5(2):78–101, 1966. 12
work page 1966
-
[39]
OPT: Open Pre-trained Transformer Language Models
Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, et al. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[40]
LLaMA: Open and Efficient Foundation Language Models
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timo- thée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[41]
GPT- NeoX-20B: An open-source autoregressive language model
Sid Black, Stella Biderman, Eric Hallahan, Quentin Anthony, Leo Gao, Laurence Golding, Ho- race He, Connor Leahy, Kyle McDonell, Jason Phang, Michael Pieler, USVSN Sai Prashanth, Shivanshu Purohit, Laria Reynolds, Jonathan Tow, Ben Wang, and Samuel Weinbach. GPT- NeoX-20B: An open-source autoregressive language model. In Proceedings of the ACL Workshop on...
work page 2022
-
[42]
Choice of plausible alternatives: An evaluation of commonsense causal reasoning
Melissa Roemmele, Cosmin Adrian Bejan, and Andrew S Gordon. Choice of plausible alternatives: An evaluation of commonsense causal reasoning. In AAAI spring symposium: logical formalizations of commonsense reasoning, pages 90–95, 2011
work page 2011
-
[43]
MathQA: Towards interpretable math word problem solving with operation- based formalisms
Aida Amini, Saadia Gabriel, Shanchuan Lin, Rik Koncel-Kedziorski, Yejin Choi, and Han- naneh Hajishirzi. MathQA: Towards interpretable math word problem solving with operation- based formalisms. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long ...
work page 2019
-
[44]
Can a suit of armor conduct electricity? a new dataset for open book question answering
Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering. In EMNLP, 2018
work page 2018
-
[45]
Piqa: Reasoning about physical commonsense in natural language
Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Yejin Choi. Piqa: Reasoning about physical commonsense in natural language. In Thirty-Fourth AAAI Conference on Artificial Intelligence, 2020
work page 2020
-
[46]
GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding
Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R Bowman. Glue: A multi-task benchmark and analysis platform for natural language understanding.arXiv preprint arXiv:1804.07461, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[47]
Winogrande: An adversarial winograd schema challenge at scale
Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM, 64(9):99–106, 2021
work page 2021
-
[48]
Shashi Narayan, Shay B Cohen, and Mirella Lapata. Don’t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization. arXiv preprint arXiv:1808.08745, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[49]
Abstractive Text Summarization Using Sequence-to-Sequence RNNs and Beyond
Ramesh Nallapati, Bowen Zhou, Caglar Gulcehre, Bing Xiang, et al. Abstractive text sum- marization using sequence-to-sequence rnns and beyond. arXiv preprint arXiv:1602.06023, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
- [50]
-
[51]
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric. P Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging llm-as-a-judge with mt-bench and chatbot arena, 2023
work page 2023
-
[52]
Efficient Streaming Language Models with Attention Sinks
Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks. arXiv preprint arXiv:2309.17453, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[53]
Lm- infinite: Simple on-the-fly length generalization for large language models
Chi Han, Qifan Wang, Wenhan Xiong, Yu Chen, Heng Ji, and Sinong Wang. Lm- infinite: Simple on-the-fly length generalization for large language models. arXiv preprint arXiv:2308.16137, 2023
-
[54]
Compressive transformers for long-range sequence modelling
Jack W Rae, Anna Potapenko, Siddhant M Jayakumar, and Timothy P Lillicrap. Compressive transformers for long-range sequence modelling. In The International Conference on Learning Representations (ICLR), 2020. 13
work page 2020
-
[55]
Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149, 2015
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[56]
Quantization and training of neural networks for efficient integer-arithmetic-only inference
Benoit Jacob, Skirmantas Kligys, Bo Chen, Menglong Zhu, Matthew Tang, Andrew Howard, Hartwig Adam, and Dmitry Kalenichenko. Quantization and training of neural networks for efficient integer-arithmetic-only inference. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 2704–2713, 2018
work page 2018
-
[57]
Data-free quantization through weight equalization and bias correction
Markus Nagel, Mart van Baalen, Tijmen Blankevoort, and Max Welling. Data-free quantization through weight equalization and bias correction. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 1325–1334, 2019
work page 2019
-
[58]
Improving neural network quantization without retraining using outlier channel splitting
Ritchie Zhao, Yuwei Hu, Jordan Dotzel, Chris De Sa, and Zhiru Zhang. Improving neural network quantization without retraining using outlier channel splitting. In International conference on machine learning, pages 7543–7552. PMLR, 2019
work page 2019
-
[59]
Pruning Convolutional Neural Networks for Resource Efficient Inference
Pavlo Molchanov, Stephen Tyree, Tero Karras, Timo Aila, and Jan Kautz. Pruning convo- lutional neural networks for resource efficient inference. arXiv preprint arXiv:1611.06440, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[60]
Rethinking the Value of Network Pruning
Zhuang Liu, Mingjie Sun, Tinghui Zhou, Gao Huang, and Trevor Darrell. Rethinking the value of network pruning. arXiv preprint arXiv:1810.05270, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[61]
Filter pruning via geometric median for deep convolutional neural networks acceleration
Yang He, Ping Liu, Ziwei Wang, Zhilan Hu, and Yi Yang. Filter pruning via geometric median for deep convolutional neural networks acceleration. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4340–4349, 2019
work page 2019
-
[62]
Torsten Hoefler, Dan Alistarh, Tal Ben-Nun, Nikoli Dryden, and Alexandra Peste. Sparsity in deep learning: Pruning and growth for efficient inference and training in neural networks. J. Mach. Learn. Res., 22(241):1–124, 2021
work page 2021
-
[63]
Distilling the Knowledge in a Neural Network
Geoffrey Hinton, Oriol Vinyals, Jeff Dean, et al. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2(7), 2015
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[64]
On the efficacy of knowledge distillation
Jang Hyun Cho and Bharath Hariharan. On the efficacy of knowledge distillation. In Pro- ceedings of the IEEE/CVF international conference on computer vision, pages 4794–4802, 2019
work page 2019
-
[65]
Distilling Task-Specific Knowledge from BERT into Simple Neural Networks
Raphael Tang, Yao Lu, Linqing Liu, Lili Mou, Olga Vechtomova, and Jimmy Lin. Dis- tilling task-specific knowledge from bert into simple neural networks. arXiv preprint arXiv:1903.12136, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1903
-
[66]
Training data-efficient image transformers & distillation through attention
Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Hervé Jégou. Training data-efficient image transformers & distillation through attention. In International Conference on Machine Learning, pages 10347–10357. PMLR, 2021
work page 2021
-
[67]
Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N
Ashish Vaswani, Noam M. Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NIPS, 2017
work page 2017
-
[68]
Xlnet: Generalized autoregressive pretraining for language understanding
Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Russ R Salakhutdinov, and Quoc V Le. Xlnet: Generalized autoregressive pretraining for language understanding. Advances in neural information processing systems, 32, 2019
work page 2019
-
[69]
RoBERTa: A Robustly Optimized BERT Pretraining Approach
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1907
-
[70]
CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge
Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. Commonsenseqa: A question answering challenge targeting commonsense knowledge. arXiv preprint arXiv:1811.00937, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[71]
Radbert-cl: Factually-aware contrastive learning for radiology report classification
Ajay Jaiswal, Liyan Tang, Meheli Ghosh, Justin Rousseau, Yifan Peng, and Ying Ding. Radbert-cl: Factually-aware contrastive learning for radiology report classification. Proceed- ings of machine learning research, 158:196–208, 2021
work page 2021
-
[72]
End-to-end open-domain question answering with bertserini
Wei Yang, Yuqing Xie, Aileen Lin, Xingyu Li, Luchen Tan, Kun Xiong, Ming Li, and Jimmy Lin. End-to-end open-domain question answering with bertserini. arXiv preprint arXiv:1902.01718, 2019. 14
-
[73]
Cognitive Graph for Multi-Hop Reading Comprehension at Scale
Ming Ding, Chang Zhou, Qibin Chen, Hongxia Yang, and Jie Tang. Cognitive graph for multi-hop reading comprehension at scale. arXiv preprint arXiv:1905.05460, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1905
-
[74]
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed Chi, Quoc Le, and Denny Zhou. Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[75]
Harnessing the power of llms in practice: A survey on chatgpt and beyond
Jingfeng Yang, Hongye Jin, Ruixiang Tang, Xiaotian Han, Qizhang Feng, Haoming Jiang, Bing Yin, and Xia Hu. Harnessing the power of llms in practice: A survey on chatgpt and beyond. arXiv preprint arXiv:2304.13712, 2023
-
[76]
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[77]
Exploring the limits of transfer learning with a unified text-to-text transformer
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020
work page 2020
-
[78]
Language models are unsupervised multitask learners
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019
work page 2019
-
[79]
Language Models are Few-Shot Learners
Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhari- wal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. arXiv preprint arXiv:2005.14165, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2005
-
[80]
BLOOM: A 176B-Parameter Open-Access Multilingual Language Model
Teven Le Scao, Angela Fan, Christopher Akiki, Ellie Pavlick, Suzana Ili´c, Daniel Hesslow, Roman Castagné, Alexandra Sasha Luccioni, François Yvon, Matthias Gallé, et al. Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[81]
Why {adam} beats {sgd} for attention models, 2020
Jingzhao Zhang, Sai Praneeth Karimireddy, Andreas Veit, Seungyeon Kim, Sashank J Reddi, Sanjiv Kumar, and Suvrit Sra. Why {adam} beats {sgd} for attention models, 2020
work page 2020
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.