Dual Dimensionality for Local and Global Attention

Sirui Zeng; Xifeng Yan; Xuan Luo; Zhiyuan Wang

arxiv: 2606.18587 · v1 · pith:5IZZKZQAnew · submitted 2026-06-17 · 💻 cs.CL · cs.AI

Dual Dimensionality for Local and Global Attention

Zhiyuan Wang , Xuan Luo , Sirui Zeng , Xifeng Yan This is my paper

Pith reviewed 2026-06-26 21:19 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords attention mechanismstransformer efficiencyKV cachedimensionality reductiondecoder-only modelspretrainingdistance-adaptive representations

0 comments

The pith

Distance-adaptive representations match full-dimensional attention performance by giving richer keys and values to local tokens.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that decoder-only transformers can assign different representational capacity to tokens based on their distance from the current prediction. Local tokens, which most strongly influence the next output, receive full-dimensional key and value vectors, while distant tokens receive reduced-dimensional vectors such as one-quarter the original size. Experiments across pretraining runs from 70 million to 410 million parameters, plus continued fine-tuning on a 1-billion-parameter model, show this distance-adaptive scheme reaches accuracy levels comparable to uniform full-dimensional baselines. Uniformly lowering dimensionality for every token position instead produces clear drops in performance. The result questions the standard practice of keeping key and value dimensionality constant across the entire sequence.

Core claim

Decoder-only Transformers compute attention over the KV cache of preceding tokens with the same dimensionality for every position. The authors hypothesize that local tokens impose greater demands on representational capacity because they most strongly influence immediate next-token predictions, whereas distant tokens mainly provide long-range memory that tolerates lower dimensionality. Distance-Adaptive Representation implements this by preserving full-dimensional KV vectors inside a fixed local window and assigning reduced-dimensional vectors beyond the window. Across multiple pretraining scales and continued supervised fine-tuning, the approach matches the performance of full-dimensional m

What carries the argument

Distance-Adaptive Representation (DAR), which keeps full-dimensional KV representations inside a local context window and assigns reduced-dimensional representations to tokens beyond that window.

If this is right

Performance remains comparable to full-dimensional baselines across pretraining scales from 70M to 410M parameters.
Continued supervised fine-tuning on a 1B-scale model also reaches matching performance.
Uniform reduction of dimensionality at every token position produces worse results than the distance-adaptive approach.
KV cache memory footprint can be reduced during inference while preserving accuracy.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same local-versus-distant split could be applied to attention layers in non-decoder architectures.
Fixed local windows might be replaced by learned or task-dependent window sizes to further optimize capacity allocation.
Longer maximum sequence lengths could become practical under a fixed memory budget for the KV cache.

Load-bearing premise

Local tokens require richer representations than distant tokens because they matter more for predicting the immediate next output.

What would settle it

At the 410-million-parameter scale, run the same pretraining data and show that the distance-adaptive model produces more than a 0.5-point rise in validation perplexity relative to the matched full-dimensional baseline.

Figures

Figures reproduced from arXiv: 2606.18587 by Sirui Zeng, Xifeng Yan, Xuan Luo, Zhiyuan Wang.

**Figure 1.** Figure 1: Tokens within a local window of size w (including the current token xn) are represented at dimensionality d, while tokens beyond the window are represented at a lower dimensionality ddown. The current token attends to all preceding tokens. this hypothesis holds, can we reduce the dimensionality of attention representations as token distance increases without substantially harming model performance? While p… view at source ↗

**Figure 2.** Figure 2: Document-length distribution (CDF) of three perplexity evaluation corpora, tokenized [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Average perplexity ratio relative to Vanilla (= 100%, shown as horizontal line) across [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

read the original abstract

Decoder-only Transformers compute attention over the KV cache of preceding tokens. Keys (and Values) are typically represented with the same dimensionality, regardless of its distance from the prediction target. In natural language, however, the next word is most strongly influenced by the immediately preceding tokens. We hypothesize that local and distant tokens impose asymmetric demands on representational capacity: local tokens are more critical for predicting immediate outputs and thus require richer representations, whereas distant tokens primarily serve as long-range memory, for which lower-dimensional representations may suffice. We formalize this idea as Distance-Adaptive Representation (DAR), implemented in a controlled setting that preserves full-dimensional representations within a local context window while assigning reduced-dimensional representations (e.g. 1/4 of the original dimensionality) to tokens beyond that window. Across multiple pretraining scales (70M to 410M parameters), as well as continued supervised fine-tuning on a 1B-scale model, this approach closely matches the performance of full-dimensional baselines. In contrast, uniformly reducing dimensionality across all token positions leads to worse performance. These results challenge the common assumption that key and value dimensionality should be uniform across token positions. Our findings suggest a new direction for designing attention architectures that adaptively allocate representational capacity across sequences, enabling further reductions in KV cache during inference.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DAR matches full-dimensional KV performance by cutting dims only for distant tokens, while uniform cuts hurt, but the implementation needs checking to confirm no extra factors.

read the letter

The main thing to know is that this distance-adaptive KV scheme keeps performance close to the full baseline across pretraining scales from 70M to 410M and then on 1B fine-tuning, while uniform dimensionality reduction does not.

What is new is the formalization of Distance-Adaptive Representation with a local window at full dim and reduced dim (example 1/4) outside it. They run controlled comparisons that directly contrast the adaptive case against both full and uniform baselines.

The paper does well on the empirical side. The consistent match to full performance at multiple scales, plus the clear drop under uniform reduction, gives the asymmetry claim some grounding. The hypothesis that local tokens need richer reps while distant ones can be lower-dim is tested in a straightforward way.

The soft spot is the implementation of the variable dimensions. Producing and attending over lower-dim keys and values for distant tokens requires some mechanism, such as separate projections or head handling. If that mechanism changes parameter allocation or attention behavior for the local tokens relative to the full baseline, the performance parity might trace to those choices rather than the local-versus-distant demand. The stress-test note flags a real check that belongs in the methods.

This paper is for people working on KV cache reduction and efficient decoder-only inference. A reader focused on practical memory savings in LLMs would get value from the scale of the tests and the baseline contrast. It deserves a serious referee because the central empirical claim is testable and the scales are large enough to matter.

I would send it to peer review.

Referee Report

2 major / 2 minor

Summary. The paper proposes Distance-Adaptive Representation (DAR) for decoder-only Transformers, hypothesizing asymmetric representational demands: local tokens (within a context window) require full-dimensional keys/values while distant tokens can use reduced dimensionality (e.g., 1/4). Implemented in a controlled setting, DAR is evaluated on pretraining runs from 70M to 410M parameters and continued supervised fine-tuning on a 1B-scale model. Results indicate DAR matches full-dimensional baseline performance, whereas uniform dimensionality reduction across all positions degrades results. The work challenges the uniform KV dimensionality assumption and suggests potential KV cache reductions at inference.

Significance. If the central empirical claim holds under fully controlled conditions, the result would be significant for efficient Transformer design: it provides evidence that representational capacity can be allocated non-uniformly by token distance without performance loss, opening a direction for adaptive attention mechanisms that reduce memory usage during inference while preserving modeling quality.

major comments (2)

[§3] The skeptic concern is load-bearing: the central claim requires that DAR and the uniform-reduction baseline differ only in the hypothesized local-vs-distant asymmetry. §3 (Method) and the experimental setup do not provide sufficient detail on the projection mechanism for reduced-dimensional KV (separate matrices? shared weights? padding/concatenation?), leaving open the possibility that performance parity arises from uncontrolled differences in parameter count or attention computation rather than the distance-adaptive hypothesis.
[Table 2] Table 2 (or equivalent results table): the reported matching performance across pretraining scales lacks explicit controls for total parameter count or effective head capacity between DAR and the full-dimensional baseline; without these, the contrast to uniform reduction cannot be attributed solely to the local-window design.

minor comments (2)

The exact local window size and reduction factor used in each scale experiment should be stated explicitly (they are listed as free parameters in the design) rather than described only as 'e.g. 1/4'.
[§4] Clarify in the abstract and §4 whether the 1B-scale fine-tuning results use the same DAR configuration as the pretraining runs or a modified one.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on implementation details and experimental controls. We agree these clarifications will strengthen the paper and will revise accordingly.

read point-by-point responses

Referee: [§3] The skeptic concern is load-bearing: the central claim requires that DAR and the uniform-reduction baseline differ only in the hypothesized local-vs-distant asymmetry. §3 (Method) and the experimental setup do not provide sufficient detail on the projection mechanism for reduced-dimensional KV (separate matrices? shared weights? padding/concatenation?), leaving open the possibility that performance parity arises from uncontrolled differences in parameter count or attention computation rather than the distance-adaptive hypothesis.

Authors: We agree that §3 requires additional detail to rule out alternative explanations. In the revision we will expand the method description to specify the projection mechanism for reduced-dimensional KV (including matrix sharing or separation and any padding/concatenation steps) and confirm that attention computation remains identical across conditions. This will make explicit that the sole controlled difference is the distance-based dimensionality allocation. revision: yes
Referee: [Table 2] Table 2 (or equivalent results table): the reported matching performance across pretraining scales lacks explicit controls for total parameter count or effective head capacity between DAR and the full-dimensional baseline; without these, the contrast to uniform reduction cannot be attributed solely to the local-window design.

Authors: We will add explicit parameter-count and head-capacity comparisons (for DAR, full baseline, and uniform-reduction models) to the results section and Table 2. This will allow direct verification that performance differences are attributable to the local-vs-distant asymmetry rather than capacity mismatches. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical comparisons are independent of any fitted result or self-citation chain

full rationale

The paper advances a hypothesis about local vs. distant token representational needs, formalizes it as the DAR method (local full-dimensional KV, distant reduced-dimensional KV), and reports direct empirical performance comparisons against full-dimensional and uniform-reduction baselines across multiple model scales. No equations, predictions, or uniqueness claims are shown to reduce by construction to fitted parameters, self-citations, or ansatzes imported from prior work by the same authors. The central claim rests on controlled experiments whose outcomes are not forced by the method definition itself. This is the expected non-finding for an architecture proposal validated by ablation-style scaling experiments.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The approach rests on two implementation choices (local window size and reduction factor) plus the standard transformer attention assumptions; no new entities are postulated.

free parameters (2)

local context window size
Defines the boundary separating full-dimensional from reduced-dimensional tokens; value not specified in abstract.
dimensionality reduction factor
Set to 1/4 of original dimensionality for distant tokens; chosen rather than derived.

axioms (1)

standard math Decoder-only transformer attention computes over a KV cache of preceding tokens with uniform dimensionality by default.
Invoked as the baseline the paper modifies.

pith-pipeline@v0.9.1-grok · 5756 in / 1366 out tokens · 25484 ms · 2026-06-26T21:19:51.465145+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

33 extracted references · 11 canonical work pages · 8 internal anchors

[1]

GQA: Training generalized multi-query transformer models from multi-head checkpoints

Joshua Ainslie, James Lee-Thorp, Michiel De Jong, Yury Zemlyanskiy, Federico Lebrón, and Sumit Sanghai. GQA: Training generalized multi-query transformer models from multi-head checkpoints. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 4895–4901, 2023

2023
[2]

GPT-NeoX: Large scale autoregressive language modeling in pytorch, 9 2023

Alex Andonian, Quentin Anthony, Stella Biderman, Sid Black, Preetham Gali, Leo Gao, Eric Hallahan, Josh Levy-Kramer, Connor Leahy, Lucas Nestler, Kip Parker, Michael Pieler, Jason Phang, Shivanshu Purohit, Hailey Schoelkopf, Dashiell Stander, Tri Songz, Curt Tigges, Benjamin Thérien, Phil Wang, and Samuel Weinbach. GPT-NeoX: Large scale autoregressive lan...

2023
[3]

Program Synthesis with Large Language Models

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large language models.arXiv preprint arXiv:2108.07732, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[4]

LongBench: A bilingual, multitask benchmark for long context understanding

Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, Yuxiao Dong, Jie Tang, and Juanzi Li. LongBench: A bilingual, multitask benchmark for long context understanding. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), pages...

2024
[5]

Longformer: The Long-Document Transformer

Iz Beltagy, Matthew E Peters, and Arman Cohan. Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2004
[6]

Datasheet for the pile.arXiv preprint arXiv:2201.07311, 2022

Stella Biderman, Kieran Bicheno, and Leo Gao. Datasheet for the pile.arXiv preprint arXiv:2201.07311, 2022

work page arXiv 2022
[7]

Pythia: A suite for analyzing large language models across training and scaling

Stella Biderman, Hailey Schoelkopf, Quentin Gregory Anthony, Herbie Bradley, Kyle O’Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, et al. Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning, pages 2397–2430. PMLR, 2023

2023
[8]

Training verifiers to solve math word problems, 2021

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Jacob Hilton, Reiichiro Nakano, Christo- pher Hesse, and John Schulman. Training verifiers to solve math word problems, 2021

2021
[9]

Flashattention: Fast and memory-efficient exact attention with io-awareness.Advances in neural information processing systems, 35:16344–16359, 2022

Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. Flashattention: Fast and memory-efficient exact attention with io-awareness.Advances in neural information processing systems, 35:16344–16359, 2022

2022
[10]

DeepSeek-V4: Towards highly efficient million-token context intelligence

DeepSeek-AI. DeepSeek-V4: Towards highly efficient million-token context intelligence. Technical Report, April 2026

2026
[11]

SKVQ: Sliding-window key and value cache quantization for large language models.arXiv preprint arXiv:2405.06219, 2024

Haojie Duanmu, Zhihang Yuan, Xiuhong Li, Jiangfei Duan, Xingcheng Zhang, and Dahua Lin. SKVQ: Sliding-window key and value cache quantization for large language models.arXiv preprint arXiv:2405.06219, 2024

work page arXiv 2024
[12]

The Pile: An 800GB Dataset of Diverse Text for Language Modeling

Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, et al. The Pile: An 800gb dataset of diverse text for language modeling.arXiv preprint arXiv:2101.00027, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2020
[13]

A framework for few-shot language model evaluation, September 2021

Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. A framework...

2021
[14]

Measuring massive multitask language understanding.Proceedings of the International Conference on Learning Representations (ICLR), 2021

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding.Proceedings of the International Conference on Learning Representations (ICLR), 2021

2021
[15]

Training Compute-Optimal Large Language Models

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, DDL Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models.arXiv preprint arXiv:2203.15556, 10, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[16]

Abdi, Dongsheng Li, Chin-Yew Lin, Yuqing Yang, and Lili Qiu

Huiqiang Jiang, Yucheng Li, Chengruidong Zhang, Qianhui Wu, Xufang Luo, Surin Ahn, Zhenhua Han, Amir H. Abdi, Dongsheng Li, Chin-Yew Lin, Yuqing Yang, and Lili Qiu. MInference 1.0: Accelerating pre-filling for long-context LLMs via dynamic sparse attention. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

2024
[17]

Matryoshka representation learning.Advances in Neural Information Processing Systems, 35:30233–30249, 2022

Aditya Kusupati, Gantavya Bhatt, Aniket Rege, Matthew Wallingford, Aditya Sinha, Vivek Ra- manujan, William Howard-Snyder, Kaifeng Chen, Sham Kakade, Prateek Jain, et al. Matryoshka representation learning.Advances in Neural Information Processing Systems, 35:30233–30249, 2022. 10

2022
[18]

Gonzalez, Hao Zhang, and Ion Stoica

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large lan- guage model serving with pagedattention. InProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023

2023
[19]

Tulu 3: Pushing Frontiers in Open Language Model Post-Training

Nathan Lambert, Jacob Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brahman, Lester James V Miranda, Alisa Liu, Nouha Dziri, Shane Lyu, et al. Tulu 3: Pushing frontiers in open language model post-training.arXiv preprint arXiv:2411.15124, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[20]

SnapKV: LLM knows what you are looking for before generation

Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, and Deming Chen. SnapKV: LLM knows what you are looking for before generation. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

2024
[21]

MatryoshkaKV: Adaptive kv compression via trainable orthogonal projection

Bokai Lin, Zihao Zeng, Zipeng Xiao, Siqi Kou, Tianqi Hou, Xiaofeng Gao, Hao Zhang, and Zhijie Deng. MatryoshkaKV: Adaptive kv compression via trainable orthogonal projection. arXiv preprint arXiv:2410.14731, 2024

work page arXiv 2024
[22]

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

Aixin Liu, Bei Feng, Bin Wang, Bingxuan Wang, Bo Liu, Chenggang Zhao, Chengqi Dengr, Chong Ruan, Damai Dai, Daya Guo, et al. Deepseek-V2: A strong, economical, and efficient mixture-of-experts language model.arXiv preprint arXiv:2405.04434, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[23]

FineWeb-Edu: the finest collection of educational content, 2024

Anton Lozhkov, Loubna Ben Allal, Leandro von Werra, and Thomas Wolf. FineWeb-Edu: the finest collection of educational content, 2024

2024
[24]

Pointer sentinel mixture models, 2016

Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models, 2016

2016
[25]

Team OLMo, Pete Walsh, Luca Soldaini, Dirk Groeneveld, Kyle Lo, Shane Arora, Akshita Bhagia, Yuling Gu, Shengyi Huang, Matt Jordan, Nathan Lambert, Dustin Schwenk, Oyvind Tafjord, Taira Anderson, David Atkinson, Faeze Brahman, Christopher Clark, Pradeep Dasigi, Nouha Dziri, Michal Guerquin, Hamish Ivison, Pang Wei Koh, Jiacheng Liu, Saumya Ma- lik, Willia...

2024
[26]

Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of machine learning research, 21(140):1–67, 2020

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of machine learning research, 21(140):1–67, 2020

2020
[27]

Fast Transformer Decoding: One Write-Head is All You Need

Noam Shazeer. Fast transformer decoding: One write-head is all you need.arXiv preprint arXiv:1911.02150, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1911
[28]

CommonsenseQA: A question answering challenge targeting commonsense knowledge

Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. CommonsenseQA: A question answering challenge targeting commonsense knowledge. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguis- tics: Human Language Technologies, V olume 1 (Long and Short Papers), pages 4149–4158, Minneapoli...

2019
[29]

Attention is all you need.Advances in neural information processing systems, 30, 2017

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017

2017
[30]

Efficient Streaming Language Models with Attention Sinks

Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks.arXiv preprint arXiv:2309.17453, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[31]

XAttention: Block sparse attention with antidiagonal scoring

Ruyi Xu, Guangxuan Xiao, Haofeng Huang, Junxian Guo, and Song Han. XAttention: Block sparse attention with antidiagonal scoring. InF orty-second International Conference on Machine Learning, 2025

2025
[32]

HellaSwag: Can a machine really finish your sentence? InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019

Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. HellaSwag: Can a machine really finish your sentence? InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019. 11

2019
[33]

H2O: Heavy-hitter oracle for efficient generative inference of large language models.Advances in Neural Information Processing Systems, 36:34661–34710, 2023

Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher Ré, Clark Barrett, et al. H2O: Heavy-hitter oracle for efficient generative inference of large language models.Advances in Neural Information Processing Systems, 36:34661–34710, 2023. 12

2023

[1] [1]

GQA: Training generalized multi-query transformer models from multi-head checkpoints

Joshua Ainslie, James Lee-Thorp, Michiel De Jong, Yury Zemlyanskiy, Federico Lebrón, and Sumit Sanghai. GQA: Training generalized multi-query transformer models from multi-head checkpoints. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 4895–4901, 2023

2023

[2] [2]

GPT-NeoX: Large scale autoregressive language modeling in pytorch, 9 2023

Alex Andonian, Quentin Anthony, Stella Biderman, Sid Black, Preetham Gali, Leo Gao, Eric Hallahan, Josh Levy-Kramer, Connor Leahy, Lucas Nestler, Kip Parker, Michael Pieler, Jason Phang, Shivanshu Purohit, Hailey Schoelkopf, Dashiell Stander, Tri Songz, Curt Tigges, Benjamin Thérien, Phil Wang, and Samuel Weinbach. GPT-NeoX: Large scale autoregressive lan...

2023

[3] [3]

Program Synthesis with Large Language Models

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large language models.arXiv preprint arXiv:2108.07732, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[4] [4]

LongBench: A bilingual, multitask benchmark for long context understanding

Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, Yuxiao Dong, Jie Tang, and Juanzi Li. LongBench: A bilingual, multitask benchmark for long context understanding. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), pages...

2024

[5] [5]

Longformer: The Long-Document Transformer

Iz Beltagy, Matthew E Peters, and Arman Cohan. Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2004

[6] [6]

Datasheet for the pile.arXiv preprint arXiv:2201.07311, 2022

Stella Biderman, Kieran Bicheno, and Leo Gao. Datasheet for the pile.arXiv preprint arXiv:2201.07311, 2022

work page arXiv 2022

[7] [7]

Pythia: A suite for analyzing large language models across training and scaling

Stella Biderman, Hailey Schoelkopf, Quentin Gregory Anthony, Herbie Bradley, Kyle O’Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, et al. Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning, pages 2397–2430. PMLR, 2023

2023

[8] [8]

Training verifiers to solve math word problems, 2021

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Jacob Hilton, Reiichiro Nakano, Christo- pher Hesse, and John Schulman. Training verifiers to solve math word problems, 2021

2021

[9] [9]

Flashattention: Fast and memory-efficient exact attention with io-awareness.Advances in neural information processing systems, 35:16344–16359, 2022

Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. Flashattention: Fast and memory-efficient exact attention with io-awareness.Advances in neural information processing systems, 35:16344–16359, 2022

2022

[10] [10]

DeepSeek-V4: Towards highly efficient million-token context intelligence

DeepSeek-AI. DeepSeek-V4: Towards highly efficient million-token context intelligence. Technical Report, April 2026

2026

[11] [11]

SKVQ: Sliding-window key and value cache quantization for large language models.arXiv preprint arXiv:2405.06219, 2024

Haojie Duanmu, Zhihang Yuan, Xiuhong Li, Jiangfei Duan, Xingcheng Zhang, and Dahua Lin. SKVQ: Sliding-window key and value cache quantization for large language models.arXiv preprint arXiv:2405.06219, 2024

work page arXiv 2024

[12] [12]

The Pile: An 800GB Dataset of Diverse Text for Language Modeling

Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, et al. The Pile: An 800gb dataset of diverse text for language modeling.arXiv preprint arXiv:2101.00027, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2020

[13] [13]

A framework for few-shot language model evaluation, September 2021

Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. A framework...

2021

[14] [14]

Measuring massive multitask language understanding.Proceedings of the International Conference on Learning Representations (ICLR), 2021

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding.Proceedings of the International Conference on Learning Representations (ICLR), 2021

2021

[15] [15]

Training Compute-Optimal Large Language Models

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, DDL Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models.arXiv preprint arXiv:2203.15556, 10, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[16] [16]

Abdi, Dongsheng Li, Chin-Yew Lin, Yuqing Yang, and Lili Qiu

Huiqiang Jiang, Yucheng Li, Chengruidong Zhang, Qianhui Wu, Xufang Luo, Surin Ahn, Zhenhua Han, Amir H. Abdi, Dongsheng Li, Chin-Yew Lin, Yuqing Yang, and Lili Qiu. MInference 1.0: Accelerating pre-filling for long-context LLMs via dynamic sparse attention. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

2024

[17] [17]

Matryoshka representation learning.Advances in Neural Information Processing Systems, 35:30233–30249, 2022

Aditya Kusupati, Gantavya Bhatt, Aniket Rege, Matthew Wallingford, Aditya Sinha, Vivek Ra- manujan, William Howard-Snyder, Kaifeng Chen, Sham Kakade, Prateek Jain, et al. Matryoshka representation learning.Advances in Neural Information Processing Systems, 35:30233–30249, 2022. 10

2022

[18] [18]

Gonzalez, Hao Zhang, and Ion Stoica

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large lan- guage model serving with pagedattention. InProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023

2023

[19] [19]

Tulu 3: Pushing Frontiers in Open Language Model Post-Training

Nathan Lambert, Jacob Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brahman, Lester James V Miranda, Alisa Liu, Nouha Dziri, Shane Lyu, et al. Tulu 3: Pushing frontiers in open language model post-training.arXiv preprint arXiv:2411.15124, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[20] [20]

SnapKV: LLM knows what you are looking for before generation

Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, and Deming Chen. SnapKV: LLM knows what you are looking for before generation. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

2024

[21] [21]

MatryoshkaKV: Adaptive kv compression via trainable orthogonal projection

Bokai Lin, Zihao Zeng, Zipeng Xiao, Siqi Kou, Tianqi Hou, Xiaofeng Gao, Hao Zhang, and Zhijie Deng. MatryoshkaKV: Adaptive kv compression via trainable orthogonal projection. arXiv preprint arXiv:2410.14731, 2024

work page arXiv 2024

[22] [22]

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

Aixin Liu, Bei Feng, Bin Wang, Bingxuan Wang, Bo Liu, Chenggang Zhao, Chengqi Dengr, Chong Ruan, Damai Dai, Daya Guo, et al. Deepseek-V2: A strong, economical, and efficient mixture-of-experts language model.arXiv preprint arXiv:2405.04434, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[23] [23]

FineWeb-Edu: the finest collection of educational content, 2024

Anton Lozhkov, Loubna Ben Allal, Leandro von Werra, and Thomas Wolf. FineWeb-Edu: the finest collection of educational content, 2024

2024

[24] [24]

Pointer sentinel mixture models, 2016

Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models, 2016

2016

[25] [25]

Team OLMo, Pete Walsh, Luca Soldaini, Dirk Groeneveld, Kyle Lo, Shane Arora, Akshita Bhagia, Yuling Gu, Shengyi Huang, Matt Jordan, Nathan Lambert, Dustin Schwenk, Oyvind Tafjord, Taira Anderson, David Atkinson, Faeze Brahman, Christopher Clark, Pradeep Dasigi, Nouha Dziri, Michal Guerquin, Hamish Ivison, Pang Wei Koh, Jiacheng Liu, Saumya Ma- lik, Willia...

2024

[26] [26]

Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of machine learning research, 21(140):1–67, 2020

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of machine learning research, 21(140):1–67, 2020

2020

[27] [27]

Fast Transformer Decoding: One Write-Head is All You Need

Noam Shazeer. Fast transformer decoding: One write-head is all you need.arXiv preprint arXiv:1911.02150, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1911

[28] [28]

CommonsenseQA: A question answering challenge targeting commonsense knowledge

Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. CommonsenseQA: A question answering challenge targeting commonsense knowledge. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguis- tics: Human Language Technologies, V olume 1 (Long and Short Papers), pages 4149–4158, Minneapoli...

2019

[29] [29]

Attention is all you need.Advances in neural information processing systems, 30, 2017

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017

2017

[30] [30]

Efficient Streaming Language Models with Attention Sinks

Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks.arXiv preprint arXiv:2309.17453, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[31] [31]

XAttention: Block sparse attention with antidiagonal scoring

Ruyi Xu, Guangxuan Xiao, Haofeng Huang, Junxian Guo, and Song Han. XAttention: Block sparse attention with antidiagonal scoring. InF orty-second International Conference on Machine Learning, 2025

2025

[32] [32]

HellaSwag: Can a machine really finish your sentence? InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019

Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. HellaSwag: Can a machine really finish your sentence? InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019. 11

2019

[33] [33]

H2O: Heavy-hitter oracle for efficient generative inference of large language models.Advances in Neural Information Processing Systems, 36:34661–34710, 2023

Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher Ré, Clark Barrett, et al. H2O: Heavy-hitter oracle for efficient generative inference of large language models.Advances in Neural Information Processing Systems, 36:34661–34710, 2023. 12

2023