pith. sign in

arxiv: 2606.18587 · v1 · pith:5IZZKZQAnew · submitted 2026-06-17 · 💻 cs.CL · cs.AI

Dual Dimensionality for Local and Global Attention

Pith reviewed 2026-06-26 21:19 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords attention mechanismstransformer efficiencyKV cachedimensionality reductiondecoder-only modelspretrainingdistance-adaptive representations
0
0 comments X

The pith

Distance-adaptive representations match full-dimensional attention performance by giving richer keys and values to local tokens.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that decoder-only transformers can assign different representational capacity to tokens based on their distance from the current prediction. Local tokens, which most strongly influence the next output, receive full-dimensional key and value vectors, while distant tokens receive reduced-dimensional vectors such as one-quarter the original size. Experiments across pretraining runs from 70 million to 410 million parameters, plus continued fine-tuning on a 1-billion-parameter model, show this distance-adaptive scheme reaches accuracy levels comparable to uniform full-dimensional baselines. Uniformly lowering dimensionality for every token position instead produces clear drops in performance. The result questions the standard practice of keeping key and value dimensionality constant across the entire sequence.

Core claim

Decoder-only Transformers compute attention over the KV cache of preceding tokens with the same dimensionality for every position. The authors hypothesize that local tokens impose greater demands on representational capacity because they most strongly influence immediate next-token predictions, whereas distant tokens mainly provide long-range memory that tolerates lower dimensionality. Distance-Adaptive Representation implements this by preserving full-dimensional KV vectors inside a fixed local window and assigning reduced-dimensional vectors beyond the window. Across multiple pretraining scales and continued supervised fine-tuning, the approach matches the performance of full-dimensional m

What carries the argument

Distance-Adaptive Representation (DAR), which keeps full-dimensional KV representations inside a local context window and assigns reduced-dimensional representations to tokens beyond that window.

If this is right

  • Performance remains comparable to full-dimensional baselines across pretraining scales from 70M to 410M parameters.
  • Continued supervised fine-tuning on a 1B-scale model also reaches matching performance.
  • Uniform reduction of dimensionality at every token position produces worse results than the distance-adaptive approach.
  • KV cache memory footprint can be reduced during inference while preserving accuracy.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same local-versus-distant split could be applied to attention layers in non-decoder architectures.
  • Fixed local windows might be replaced by learned or task-dependent window sizes to further optimize capacity allocation.
  • Longer maximum sequence lengths could become practical under a fixed memory budget for the KV cache.

Load-bearing premise

Local tokens require richer representations than distant tokens because they matter more for predicting the immediate next output.

What would settle it

At the 410-million-parameter scale, run the same pretraining data and show that the distance-adaptive model produces more than a 0.5-point rise in validation perplexity relative to the matched full-dimensional baseline.

Figures

Figures reproduced from arXiv: 2606.18587 by Sirui Zeng, Xifeng Yan, Xuan Luo, Zhiyuan Wang.

Figure 1
Figure 1. Figure 1: Tokens within a local window of size w (including the current token xn) are represented at dimensionality d, while tokens beyond the window are represented at a lower dimensionality ddown. The current token attends to all preceding tokens. this hypothesis holds, can we reduce the dimensionality of attention representations as token distance increases without substantially harming model performance? While p… view at source ↗
Figure 2
Figure 2. Figure 2: Document-length distribution (CDF) of three perplexity evaluation corpora, tokenized [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Average perplexity ratio relative to Vanilla (= 100%, shown as horizontal line) across [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
read the original abstract

Decoder-only Transformers compute attention over the KV cache of preceding tokens. Keys (and Values) are typically represented with the same dimensionality, regardless of its distance from the prediction target. In natural language, however, the next word is most strongly influenced by the immediately preceding tokens. We hypothesize that local and distant tokens impose asymmetric demands on representational capacity: local tokens are more critical for predicting immediate outputs and thus require richer representations, whereas distant tokens primarily serve as long-range memory, for which lower-dimensional representations may suffice. We formalize this idea as Distance-Adaptive Representation (DAR), implemented in a controlled setting that preserves full-dimensional representations within a local context window while assigning reduced-dimensional representations (e.g. 1/4 of the original dimensionality) to tokens beyond that window. Across multiple pretraining scales (70M to 410M parameters), as well as continued supervised fine-tuning on a 1B-scale model, this approach closely matches the performance of full-dimensional baselines. In contrast, uniformly reducing dimensionality across all token positions leads to worse performance. These results challenge the common assumption that key and value dimensionality should be uniform across token positions. Our findings suggest a new direction for designing attention architectures that adaptively allocate representational capacity across sequences, enabling further reductions in KV cache during inference.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes Distance-Adaptive Representation (DAR) for decoder-only Transformers, hypothesizing asymmetric representational demands: local tokens (within a context window) require full-dimensional keys/values while distant tokens can use reduced dimensionality (e.g., 1/4). Implemented in a controlled setting, DAR is evaluated on pretraining runs from 70M to 410M parameters and continued supervised fine-tuning on a 1B-scale model. Results indicate DAR matches full-dimensional baseline performance, whereas uniform dimensionality reduction across all positions degrades results. The work challenges the uniform KV dimensionality assumption and suggests potential KV cache reductions at inference.

Significance. If the central empirical claim holds under fully controlled conditions, the result would be significant for efficient Transformer design: it provides evidence that representational capacity can be allocated non-uniformly by token distance without performance loss, opening a direction for adaptive attention mechanisms that reduce memory usage during inference while preserving modeling quality.

major comments (2)
  1. [§3] The skeptic concern is load-bearing: the central claim requires that DAR and the uniform-reduction baseline differ only in the hypothesized local-vs-distant asymmetry. §3 (Method) and the experimental setup do not provide sufficient detail on the projection mechanism for reduced-dimensional KV (separate matrices? shared weights? padding/concatenation?), leaving open the possibility that performance parity arises from uncontrolled differences in parameter count or attention computation rather than the distance-adaptive hypothesis.
  2. [Table 2] Table 2 (or equivalent results table): the reported matching performance across pretraining scales lacks explicit controls for total parameter count or effective head capacity between DAR and the full-dimensional baseline; without these, the contrast to uniform reduction cannot be attributed solely to the local-window design.
minor comments (2)
  1. The exact local window size and reduction factor used in each scale experiment should be stated explicitly (they are listed as free parameters in the design) rather than described only as 'e.g. 1/4'.
  2. [§4] Clarify in the abstract and §4 whether the 1B-scale fine-tuning results use the same DAR configuration as the pretraining runs or a modified one.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on implementation details and experimental controls. We agree these clarifications will strengthen the paper and will revise accordingly.

read point-by-point responses
  1. Referee: [§3] The skeptic concern is load-bearing: the central claim requires that DAR and the uniform-reduction baseline differ only in the hypothesized local-vs-distant asymmetry. §3 (Method) and the experimental setup do not provide sufficient detail on the projection mechanism for reduced-dimensional KV (separate matrices? shared weights? padding/concatenation?), leaving open the possibility that performance parity arises from uncontrolled differences in parameter count or attention computation rather than the distance-adaptive hypothesis.

    Authors: We agree that §3 requires additional detail to rule out alternative explanations. In the revision we will expand the method description to specify the projection mechanism for reduced-dimensional KV (including matrix sharing or separation and any padding/concatenation steps) and confirm that attention computation remains identical across conditions. This will make explicit that the sole controlled difference is the distance-based dimensionality allocation. revision: yes

  2. Referee: [Table 2] Table 2 (or equivalent results table): the reported matching performance across pretraining scales lacks explicit controls for total parameter count or effective head capacity between DAR and the full-dimensional baseline; without these, the contrast to uniform reduction cannot be attributed solely to the local-window design.

    Authors: We will add explicit parameter-count and head-capacity comparisons (for DAR, full baseline, and uniform-reduction models) to the results section and Table 2. This will allow direct verification that performance differences are attributable to the local-vs-distant asymmetry rather than capacity mismatches. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical comparisons are independent of any fitted result or self-citation chain

full rationale

The paper advances a hypothesis about local vs. distant token representational needs, formalizes it as the DAR method (local full-dimensional KV, distant reduced-dimensional KV), and reports direct empirical performance comparisons against full-dimensional and uniform-reduction baselines across multiple model scales. No equations, predictions, or uniqueness claims are shown to reduce by construction to fitted parameters, self-citations, or ansatzes imported from prior work by the same authors. The central claim rests on controlled experiments whose outcomes are not forced by the method definition itself. This is the expected non-finding for an architecture proposal validated by ablation-style scaling experiments.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The approach rests on two implementation choices (local window size and reduction factor) plus the standard transformer attention assumptions; no new entities are postulated.

free parameters (2)
  • local context window size
    Defines the boundary separating full-dimensional from reduced-dimensional tokens; value not specified in abstract.
  • dimensionality reduction factor
    Set to 1/4 of original dimensionality for distant tokens; chosen rather than derived.
axioms (1)
  • standard math Decoder-only transformer attention computes over a KV cache of preceding tokens with uniform dimensionality by default.
    Invoked as the baseline the paper modifies.

pith-pipeline@v0.9.1-grok · 5756 in / 1366 out tokens · 25484 ms · 2026-06-26T21:19:51.465145+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

33 extracted references · 11 canonical work pages · 8 internal anchors

  1. [1]

    GQA: Training generalized multi-query transformer models from multi-head checkpoints

    Joshua Ainslie, James Lee-Thorp, Michiel De Jong, Yury Zemlyanskiy, Federico Lebrón, and Sumit Sanghai. GQA: Training generalized multi-query transformer models from multi-head checkpoints. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 4895–4901, 2023

  2. [2]

    GPT-NeoX: Large scale autoregressive language modeling in pytorch, 9 2023

    Alex Andonian, Quentin Anthony, Stella Biderman, Sid Black, Preetham Gali, Leo Gao, Eric Hallahan, Josh Levy-Kramer, Connor Leahy, Lucas Nestler, Kip Parker, Michael Pieler, Jason Phang, Shivanshu Purohit, Hailey Schoelkopf, Dashiell Stander, Tri Songz, Curt Tigges, Benjamin Thérien, Phil Wang, and Samuel Weinbach. GPT-NeoX: Large scale autoregressive lan...

  3. [3]

    Program Synthesis with Large Language Models

    Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large language models.arXiv preprint arXiv:2108.07732, 2021

  4. [4]

    LongBench: A bilingual, multitask benchmark for long context understanding

    Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, Yuxiao Dong, Jie Tang, and Juanzi Li. LongBench: A bilingual, multitask benchmark for long context understanding. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), pages...

  5. [5]

    Longformer: The Long-Document Transformer

    Iz Beltagy, Matthew E Peters, and Arman Cohan. Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150, 2020

  6. [6]

    Datasheet for the pile.arXiv preprint arXiv:2201.07311, 2022

    Stella Biderman, Kieran Bicheno, and Leo Gao. Datasheet for the pile.arXiv preprint arXiv:2201.07311, 2022

  7. [7]

    Pythia: A suite for analyzing large language models across training and scaling

    Stella Biderman, Hailey Schoelkopf, Quentin Gregory Anthony, Herbie Bradley, Kyle O’Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, et al. Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning, pages 2397–2430. PMLR, 2023

  8. [8]

    Training verifiers to solve math word problems, 2021

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Jacob Hilton, Reiichiro Nakano, Christo- pher Hesse, and John Schulman. Training verifiers to solve math word problems, 2021

  9. [9]

    Flashattention: Fast and memory-efficient exact attention with io-awareness.Advances in neural information processing systems, 35:16344–16359, 2022

    Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. Flashattention: Fast and memory-efficient exact attention with io-awareness.Advances in neural information processing systems, 35:16344–16359, 2022

  10. [10]

    DeepSeek-V4: Towards highly efficient million-token context intelligence

    DeepSeek-AI. DeepSeek-V4: Towards highly efficient million-token context intelligence. Technical Report, April 2026

  11. [11]

    SKVQ: Sliding-window key and value cache quantization for large language models.arXiv preprint arXiv:2405.06219, 2024

    Haojie Duanmu, Zhihang Yuan, Xiuhong Li, Jiangfei Duan, Xingcheng Zhang, and Dahua Lin. SKVQ: Sliding-window key and value cache quantization for large language models.arXiv preprint arXiv:2405.06219, 2024

  12. [12]

    The Pile: An 800GB Dataset of Diverse Text for Language Modeling

    Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, et al. The Pile: An 800gb dataset of diverse text for language modeling.arXiv preprint arXiv:2101.00027, 2020

  13. [13]

    A framework for few-shot language model evaluation, September 2021

    Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. A framework...

  14. [14]

    Measuring massive multitask language understanding.Proceedings of the International Conference on Learning Representations (ICLR), 2021

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding.Proceedings of the International Conference on Learning Representations (ICLR), 2021

  15. [15]

    Training Compute-Optimal Large Language Models

    Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, DDL Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models.arXiv preprint arXiv:2203.15556, 10, 2022

  16. [16]

    Abdi, Dongsheng Li, Chin-Yew Lin, Yuqing Yang, and Lili Qiu

    Huiqiang Jiang, Yucheng Li, Chengruidong Zhang, Qianhui Wu, Xufang Luo, Surin Ahn, Zhenhua Han, Amir H. Abdi, Dongsheng Li, Chin-Yew Lin, Yuqing Yang, and Lili Qiu. MInference 1.0: Accelerating pre-filling for long-context LLMs via dynamic sparse attention. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

  17. [17]

    Matryoshka representation learning.Advances in Neural Information Processing Systems, 35:30233–30249, 2022

    Aditya Kusupati, Gantavya Bhatt, Aniket Rege, Matthew Wallingford, Aditya Sinha, Vivek Ra- manujan, William Howard-Snyder, Kaifeng Chen, Sham Kakade, Prateek Jain, et al. Matryoshka representation learning.Advances in Neural Information Processing Systems, 35:30233–30249, 2022. 10

  18. [18]

    Gonzalez, Hao Zhang, and Ion Stoica

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large lan- guage model serving with pagedattention. InProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023

  19. [19]

    Tulu 3: Pushing Frontiers in Open Language Model Post-Training

    Nathan Lambert, Jacob Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brahman, Lester James V Miranda, Alisa Liu, Nouha Dziri, Shane Lyu, et al. Tulu 3: Pushing frontiers in open language model post-training.arXiv preprint arXiv:2411.15124, 2024

  20. [20]

    SnapKV: LLM knows what you are looking for before generation

    Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, and Deming Chen. SnapKV: LLM knows what you are looking for before generation. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

  21. [21]

    MatryoshkaKV: Adaptive kv compression via trainable orthogonal projection

    Bokai Lin, Zihao Zeng, Zipeng Xiao, Siqi Kou, Tianqi Hou, Xiaofeng Gao, Hao Zhang, and Zhijie Deng. MatryoshkaKV: Adaptive kv compression via trainable orthogonal projection. arXiv preprint arXiv:2410.14731, 2024

  22. [22]

    DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

    Aixin Liu, Bei Feng, Bin Wang, Bingxuan Wang, Bo Liu, Chenggang Zhao, Chengqi Dengr, Chong Ruan, Damai Dai, Daya Guo, et al. Deepseek-V2: A strong, economical, and efficient mixture-of-experts language model.arXiv preprint arXiv:2405.04434, 2024

  23. [23]

    FineWeb-Edu: the finest collection of educational content, 2024

    Anton Lozhkov, Loubna Ben Allal, Leandro von Werra, and Thomas Wolf. FineWeb-Edu: the finest collection of educational content, 2024

  24. [24]

    Pointer sentinel mixture models, 2016

    Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models, 2016

  25. [25]

    Team OLMo, Pete Walsh, Luca Soldaini, Dirk Groeneveld, Kyle Lo, Shane Arora, Akshita Bhagia, Yuling Gu, Shengyi Huang, Matt Jordan, Nathan Lambert, Dustin Schwenk, Oyvind Tafjord, Taira Anderson, David Atkinson, Faeze Brahman, Christopher Clark, Pradeep Dasigi, Nouha Dziri, Michal Guerquin, Hamish Ivison, Pang Wei Koh, Jiacheng Liu, Saumya Ma- lik, Willia...

  26. [26]

    Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of machine learning research, 21(140):1–67, 2020

    Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of machine learning research, 21(140):1–67, 2020

  27. [27]

    Fast Transformer Decoding: One Write-Head is All You Need

    Noam Shazeer. Fast transformer decoding: One write-head is all you need.arXiv preprint arXiv:1911.02150, 2019

  28. [28]

    CommonsenseQA: A question answering challenge targeting commonsense knowledge

    Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. CommonsenseQA: A question answering challenge targeting commonsense knowledge. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguis- tics: Human Language Technologies, V olume 1 (Long and Short Papers), pages 4149–4158, Minneapoli...

  29. [29]

    Attention is all you need.Advances in neural information processing systems, 30, 2017

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017

  30. [30]

    Efficient Streaming Language Models with Attention Sinks

    Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks.arXiv preprint arXiv:2309.17453, 2023

  31. [31]

    XAttention: Block sparse attention with antidiagonal scoring

    Ruyi Xu, Guangxuan Xiao, Haofeng Huang, Junxian Guo, and Song Han. XAttention: Block sparse attention with antidiagonal scoring. InF orty-second International Conference on Machine Learning, 2025

  32. [32]

    HellaSwag: Can a machine really finish your sentence? InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019

    Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. HellaSwag: Can a machine really finish your sentence? InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019. 11

  33. [33]

    H2O: Heavy-hitter oracle for efficient generative inference of large language models.Advances in Neural Information Processing Systems, 36:34661–34710, 2023

    Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher Ré, Clark Barrett, et al. H2O: Heavy-hitter oracle for efficient generative inference of large language models.Advances in Neural Information Processing Systems, 36:34661–34710, 2023. 12