Sparse Attention Remapping with Clustering for Efficient LLM Decoding on PIM

Garrett Gagnon; Liu Liu; Zehao Fan; Zhenyu Liu

arxiv: 2505.05772 · v2 · submitted 2025-05-09 · 💻 cs.CL · cs.LG

Sparse Attention Remapping with Clustering for Efficient LLM Decoding on PIM

Zehao Fan , Garrett Gagnon , Zhenyu Liu , Liu Liu This is my paper

Pith reviewed 2026-05-22 16:54 UTC · model grok-4.3

classification 💻 cs.CL cs.LG

keywords sparse attentionprocessing-in-memoryKV cacheLLM inferencesemantic clusteringenergy efficiencylatency reductiondata mapping

0 comments

The pith

STARC clusters KV pairs by semantic similarity and remaps them to PIM-aligned memory regions for efficient sparse attention in LLM decoding.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents STARC as a data mapping technique that groups key-value pairs from the growing KV cache into semantic clusters. These clusters are placed in contiguous memory blocks that match the bank layout of processing-in-memory hardware. At each decoding step a query matches against stored cluster centroids to select only relevant groups for attention, avoiding the scattered accesses that normally disrupt PIM parallelism. This design targets the bandwidth bottleneck that appears when standard token-wise sparsity meets PIM hardware. A reader would care because the approach promises lower latency and energy use for long-context inference without changing model accuracy.

Core claim

STARC clusters KV pairs by semantic similarity and maps them to contiguous memory regions aligned with PIM bank structures. During decoding, queries retrieve relevant tokens at cluster granularity by matching against precomputed centroids, enabling selective attention and parallel processing without frequent reclustering or data movement overhead.

What carries the argument

Semantic clustering of KV pairs with centroid-based cluster-granularity retrieval, mapped to PIM bank-aligned contiguous regions.

If this is right

Attention-layer latency drops 19-31 percent and energy drops 19-27 percent versus common token-wise sparsity methods.
Under a 1024-token KV budget, latency falls 54-74 percent and energy falls 45-67 percent versus full KV cache retrieval.
Model accuracy stays comparable to existing state-of-the-art sparse attention methods.
Workload imbalance and irregular accesses on PIM are reduced by aligning clusters with bank boundaries.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same clustering logic could be tested on non-PIM accelerators that suffer from irregular memory traffic during sparse attention.
If cluster stability holds across longer sequences, the method may reduce how often the entire KV cache must be reorganized.
Combining cluster-level selection with other sparsity patterns such as head-wise or layer-wise pruning could compound the reported gains.

Load-bearing premise

Semantic clustering of KV pairs produces stable groups whose centroids can be matched quickly enough to avoid frequent reclustering or data movement during autoregressive decoding.

What would settle it

Measure whether total decoding latency including initial clustering plus centroid matching exceeds the latency of token-wise sparse attention on the same HBM-PIM system for context lengths beyond 4k tokens.

Figures

Figures reproduced from arXiv: 2505.05772 by Garrett Gagnon, Liu Liu, Zehao Fan, Zhenyu Liu.

**Figure 1.** Figure 1: Enhanced execution efficiency through STARC. Due to the coarse granularity of PIM, directly applying sparsity to KV caches often fails to skip [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: Targeted system overview: the QKV Generation and the Feed-Forward [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: Common attention sparsity patterns. To be more specific, in the Transformer architecture, each attention head operates on projected query, key, and value vectors. Let q ∈ R 1×dh denote the query vector corresponding to the most recent token in a single head, and let K, V ∈ R L×dh represent the cached key and value matrices for the L previous tokens. Here, dh denotes the hidden dimension of a single attent… view at source ↗

**Figure 4.** Figure 4: Page-wise retrieval with less important tokens. [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

**Figure 5.** Figure 5: Supporting sparsity on PIM with STARC. PIM & sparsity both achieve higher throughput, but enforce vastly different granularities. As a result, the [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗

**Figure 6.** Figure 6: Flowchart of the clustering algorithm. We perform incremental [PITH_FULL_IMAGE:figures/full_fig_p005_6.png] view at source ↗

**Figure 7.** Figure 7: The distributions of key vectors differ significantly between the prefill [PITH_FULL_IMAGE:figures/full_fig_p006_7.png] view at source ↗

**Figure 8.** Figure 8: Results on LongBench of different sparsity methods. [PITH_FULL_IMAGE:figures/full_fig_p007_8.png] view at source ↗

**Figure 9.** Figure 9: Language modeling on PG19 dataset. 128 256 512 1024 2048 KV Cache Budget 0.0 0.2 0.4 0.6 0.8 1.0 Recall Rate HotpotQA 128 256 512 1024 2048 KV Cache Budget 0.0 0.2 0.4 0.6 0.8 1.0 Recall Rate NarrativeQA STARC Quest InfiniGen SparQ [PITH_FULL_IMAGE:figures/full_fig_p007_9.png] view at source ↗

**Figure 10.** Figure 10: Recall rate of important tokens. near-random attention masks that vary across decoding steps. Pagewise sparsity is represented by Quest, which selects fixed-size token pages based on position. All methods are evaluated under a consistent KV cache budget of 1024 tokens, which directly determines the number of tokens to be fetched and processed at each decoding step by the PIM simulator. To assess the hard… view at source ↗

**Figure 11.** Figure 11: The normalized execution time per output token of LLAMA 7B for [PITH_FULL_IMAGE:figures/full_fig_p008_11.png] view at source ↗

**Figure 12.** Figure 12: The normalized energy per output token of LLAMA 7B for various [PITH_FULL_IMAGE:figures/full_fig_p008_12.png] view at source ↗

read the original abstract

Transformer-based models are the foundation of modern machine learning, but their execution, particularly during autoregressive decoding in large language models (LLMs), places significant pressure on memory systems due to frequent memory accesses and growing key-value (KV) caches. This creates a bottleneck in memory bandwidth, especially as context lengths increase. Processing-in-memory (PIM) architectures are a promising solution, offering high internal bandwidth and compute parallelism near memory. However, current PIM designs are primarily optimized for dense attention and struggle with the dynamic, irregular access patterns introduced by modern KV cache sparsity techniques. Consequently, they suffer from workload imbalance, reducing throughput and resource utilization. In this work, we propose STARC, a novel sparsity-optimized data mapping scheme tailored specifically for efficient LLM decoding on PIM architectures. STARC clusters KV pairs by semantic similarity and maps them to contiguous memory regions aligned with PIM bank structures. During decoding, queries retrieve relevant tokens at cluster granularity by matching against precomputed centroids, enabling selective attention and parallel processing without frequent reclustering or data movement overhead. Experiments on the HBM-PIM system show that, compared to common token-wise sparsity methods, STARC reduces attention-layer latency by 19%--31% and energy consumption by 19%--27%. Under a KV cache budget of 1024, it achieves up to 54%--74% latency reduction and 45%--67% energy reduction compared to full KV cache retrieval. Meanwhile, STARC maintains model accuracy comparable to state-of-the-art sparse attention methods, demonstrating its effectiveness in enabling efficient and hardware-friendly long-context LLM inference on PIM architectures.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

STARC clusters KV pairs semantically and aligns them to PIM banks to cut sparse attention latency and energy, but the dynamic handling of new tokens during decoding is the part that needs more checking.

read the letter

The key point is that this paper gives a concrete mapping scheme called STARC that groups KV cache entries by semantic similarity and places those groups into contiguous PIM bank regions. Queries then match against precomputed centroids to pull whole clusters at once. On their HBM-PIM simulator this produces 19-31% lower attention latency and 19-27% lower energy than token-wise sparse baselines, with larger savings when the KV budget is capped at 1024 tokens, all while accuracy stays comparable to other sparse attention methods. The integration of clustering with bank-level alignment for decoding workloads is the part that feels new relative to the cited prior work on either sparsity or PIM alone. The simulator numbers are specific and the problem statement about memory bandwidth in long-context decoding is on target. The soft spot is the autoregressive case. The method assumes clusters stay stable enough that new tokens can be assigned without frequent reclustering or cross-bank movement, yet the abstract does not quantify assignment cost per token or how much drift is tolerated before accuracy or mapping quality drops. If nearest-centroid search or occasional remapping adds up once the cache grows past the tested lengths, the reported gains could shrink. This paper is for people working on hardware-software co-design for LLM inference, especially those evaluating PIM or near-memory accelerators. A reader already thinking about sparse KV techniques would pick up a usable mapping idea here. It deserves peer review because the performance claims are tied to a real hardware constraint and the numbers are concrete enough to test further, even if the dynamic-cache experiments need tightening.

Referee Report

2 major / 1 minor

Summary. The paper proposes STARC, a sparsity-optimized data mapping scheme for efficient LLM decoding on PIM architectures. KV pairs are clustered by semantic similarity and mapped to contiguous memory regions aligned with PIM bank structures. During decoding, queries retrieve relevant tokens at cluster granularity via matching against precomputed centroids, enabling selective attention without frequent reclustering or data movement. Experiments on an HBM-PIM system report 19%--31% attention-layer latency reduction and 19%--27% energy reduction versus token-wise sparsity baselines, with up to 54%--74% latency and 45%--67% energy reduction under a 1024-token KV budget, while maintaining accuracy comparable to state-of-the-art sparse attention methods.

Significance. If the reported gains are reproducible on real hardware and hold under incremental KV cache growth, the work could offer a practical way to mitigate workload imbalance and irregular accesses in sparse attention on PIM systems, supporting more efficient long-context LLM inference.

major comments (2)

[Abstract and Experiments] Abstract and Experiments section: the central latency and energy claims rest on simulator results, yet no error bars, dataset details for the accuracy comparisons, or ablations on clustering frequency/number of clusters are provided. This directly affects assessment of the 19%--31% and 19%--27% improvements versus token-wise methods.
[Method section (clustering and decoding procedure)] Method section (clustering and decoding procedure): the claim that precomputed centroids enable retrieval 'without frequent reclustering or data movement overhead' during autoregressive decoding is load-bearing for the performance numbers but lacks a concrete mechanism for assigning new KV vectors to existing clusters as the cache grows token-by-token. Nearest-centroid assignment cost or bounded-drift handling is not quantified, leaving open whether per-token overhead or semantic drift undermines the reported gains once the KV cache exceeds the 1024-token experimental budget.

minor comments (1)

[Abstract] The ranges in the abstract (e.g., 19%--31%) would benefit from stating whether they represent min/max across models or average values with conditions.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight important aspects of experimental rigor and methodological clarity that we address below. We have revised the paper to incorporate additional details and explanations.

read point-by-point responses

Referee: [Abstract and Experiments] Abstract and Experiments section: the central latency and energy claims rest on simulator results, yet no error bars, dataset details for the accuracy comparisons, or ablations on clustering frequency/number of clusters are provided. This directly affects assessment of the 19%--31% and 19%--27% improvements versus token-wise methods.

Authors: We agree that these details improve transparency. The reported results are from a cycle-accurate simulator validated against HBM-PIM hardware parameters. In the revised Experiments section, we have added error bars (standard deviation over five runs with different clustering initializations) to all latency and energy plots. Accuracy comparisons now explicitly reference the LongBench benchmark suite with per-task breakdowns provided in an appendix. We have also added a new ablation study varying the number of clusters (k = 4, 8, 16, 32) and reclustering frequency (every 128, 256, or 512 tokens), confirming that the 19--31% latency and 19--27% energy gains remain stable across these settings. revision: yes
Referee: [Method section (clustering and decoding procedure)] Method section (clustering and decoding procedure): the claim that precomputed centroids enable retrieval 'without frequent reclustering or data movement overhead' during autoregressive decoding is load-bearing for the performance numbers but lacks a concrete mechanism for assigning new KV vectors to existing clusters as the cache grows token-by-token. Nearest-centroid assignment cost or bounded-drift handling is not quantified, leaving open whether per-token overhead or semantic drift undermines the reported gains once the KV cache exceeds the 1024-token experimental budget.

Authors: We have expanded the Method section with a precise description of the incremental procedure. Each newly generated KV vector is assigned to the nearest precomputed centroid using a single-pass cosine similarity computation against the small set of centroids (k typically 8--16). This per-token assignment cost is O(k) and has been quantified in a new table as contributing under 2% to total attention latency. For semantic drift, we apply a lightweight bounded-drift check: centroids are recomputed only when the mean intra-cluster distance exceeds a fixed threshold or at fixed intervals of 512 tokens. The overhead of these infrequent updates is amortized and already reflected in the 1024-token budget experiments; we have extended the evaluation to 2048 tokens to demonstrate that gains persist without degradation. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical clustering mapping evaluated on external simulator

full rationale

The paper presents STARC as a practical data-mapping technique that clusters KV pairs by semantic similarity and aligns them to PIM bank structures for cluster-granularity retrieval. Performance numbers (19-31% latency reduction, etc.) are obtained from direct measurements on an HBM-PIM simulator under stated KV budgets, not from any closed-form derivation, fitted parameter renamed as prediction, or self-citation chain. No equations appear that define a quantity in terms of itself, and the method does not invoke uniqueness theorems or prior author work to force its design choices. The derivation chain is therefore self-contained and externally falsifiable.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the premise that semantic similarity clustering produces stable groups that align with PIM bank structures and that centroid matching incurs negligible overhead compared with token-wise selection.

pith-pipeline@v0.9.0 · 5828 in / 1207 out tokens · 46120 ms · 2026-05-22T16:54:13.073272+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

STARC clusters KV pairs by semantic similarity and maps them to contiguous memory regions aligned with PIM bank structures... queries retrieve relevant tokens at cluster granularity by matching against precomputed centroids... without frequent reclustering or data movement overhead.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We apply the standard K-means algorithm to all KV pairs... using cosine similarity as the distance metric... re-cluster every 128 decoding steps.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

41 extracted references · 41 canonical work pages · 7 internal anchors

[1]

GPT-4 Technical Report

J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat et al. , “Gpt-4 technical report,” arXiv preprint arXiv:2303.08774 , 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Toolqa: A dataset for llm question answering with external tools,

Y . Zhuang, Y . Yu, K. Wang, H. Sun, and C. Zhang, “Toolqa: A dataset for llm question answering with external tools,” Advances in Neural Information Processing Systems , vol. 36, 2024

work page 2024
[3]

Pythia: Ai-assisted code completion system,

A. Svyatkovskiy, Y . Zhao, S. Fu, and N. Sundaresan, “Pythia: Ai-assisted code completion system,” in Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining , 2019, pp. 2727–2735

work page 2019
[4]

Code Llama: Open Foundation Models for Code

B. Roziere, J. Gehring, F. Gloeckle, S. Sootla, I. Gat, X. E. Tan, Y . Adi, J. Liu, T. Remez, J. Rapin et al., “Code llama: Open foundation models for code,” arXiv preprint arXiv:2308.12950 , 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[5]

Reflex- ion: Language agents with verbal reinforcement learning,

N. Shinn, F. Cassano, A. Gopinath, K. Narasimhan, and S. Yao, “Reflex- ion: Language agents with verbal reinforcement learning,” Advances in Neural Information Processing Systems , vol. 36, pp. 8634–8652, 2023

work page 2023
[6]

Tree of thoughts: Deliberate problem solving with large language models,

S. Yao, D. Yu, J. Zhao, I. Shafran, T. Griffiths, Y . Cao, and K. Narasimhan, “Tree of thoughts: Deliberate problem solving with large language models,” Advances in neural information processing systems , vol. 36, pp. 11 809–11 822, 2023

work page 2023
[7]

Pre-trained language models for in- teractive decision-making,

S. Li, X. Puig, C. Paxton, Y . Du, C. Wang, L. Fan, T. Chen, D.-A. Huang, E. Aky¨urek, A. Anandkumar et al., “Pre-trained language models for in- teractive decision-making,” Advances in Neural Information Processing Systems, vol. 35, pp. 31 199–31 212, 2022

work page 2022
[8]

Efficiently scaling transformer inference,

R. Pope, S. Douglas, A. Chowdhery, J. Devlin, J. Bradbury, J. Heek, K. Xiao, S. Agrawal, and J. Dean, “Efficiently scaling transformer inference,” Proceedings of Machine Learning and Systems , vol. 5, pp. 606–624, 2023

work page 2023
[9]

Efficient memory management for large language model serving with pagedattention,

W. Kwon, Z. Li, S. Zhuang, Y . Sheng, L. Zheng, C. H. Yu, J. Gonzalez, H. Zhang, and I. Stoica, “Efficient memory management for large language model serving with pagedattention,” in Proceedings of the 29th Symposium on Operating Systems Principles , 2023, pp. 611–626

work page 2023
[10]

Computedram: In-memory compute using off-the-shelf drams,

F. Gao, G. Tziantzioulis, and D. Wentzlaff, “Computedram: In-memory compute using off-the-shelf drams,” in Proceedings of the 52nd annual IEEE/ACM international symposium on microarchitecture , 2019, pp. 100–113

work page 2019
[11]

Newton: A dram-maker’s accelerator-in-memory (aim) architecture for machine learning,

M. He, C. Song, I. Kim, C. Jeong, S. Kim, I. Park, M. Thottethodi, and T. Vijaykumar, “Newton: A dram-maker’s accelerator-in-memory (aim) architecture for machine learning,” in 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 2020, pp. 372–385

work page 2020
[12]

Pathfinding future pim archi- tectures by demystifying a commercial pim technology,

B. Hyun, T. Kim, D. Lee, and M. Rhu, “Pathfinding future pim archi- tectures by demystifying a commercial pim technology,” in 2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA). IEEE, 2024, pp. 263–279

work page 2024
[13]

Accelerating neural network inference with processing-in-dram: from the edge to the cloud,

G. F. Oliveira, J. G ´omez-Luna, S. Ghose, A. Boroumand, and O. Mutlu, “Accelerating neural network inference with processing-in-dram: from the edge to the cloud,” IEEE Micro, vol. 42, no. 6, pp. 25–38, 2022

work page 2022
[14]

Floatpim: In-memory acceleration of deep neural network training with high precision,

M. Imani, S. Gupta, Y . Kim, and T. Rosing, “Floatpim: In-memory acceleration of deep neural network training with high precision,” in Proceedings of the 46th International Symposium on Computer Archi- tecture, 2019, pp. 802–815

work page 2019
[15]

Attacc! unleashing the power of pim for batched transformer- based generative model inference,

J. Park, J. Choi, K. Kyung, M. J. Kim, Y . Kwon, N. S. Kim, and J. H. Ahn, “Attacc! unleashing the power of pim for batched transformer- based generative model inference,” in Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2 , 2024, pp. 103–119

work page 2024
[16]

Neupims: Npu-pim heterogeneous acceleration for batched llm inferencing,

G. Heo, S. Lee, J. Cho, H. Choi, S. Lee, H. Ham, G. Kim, D. Ma- hajan, and J. Park, “Neupims: Npu-pim heterogeneous acceleration for batched llm inferencing,” in Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3 , 2024, pp. 722–737

work page 2024
[17]

Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference

J. Tang, Y . Zhao, K. Zhu, G. Xiao, B. Kasikci, and S. Han, “Quest: Query-aware sparsity for efficient long-context llm inference,” arXiv preprint arXiv:2406.10774, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[18]

Transpim: A memory- based acceleration via software-hardware co-design for transformer,

M. Zhou, W. Xu, J. Kang, and T. Rosing, “Transpim: A memory- based acceleration via software-hardware co-design for transformer,” in 2022 IEEE International Symposium on High-Performance Computer Architecture (HPCA). IEEE, 2022, pp. 1071–1085

work page 2022
[19]

Lol-pim: Long-context llm decoding with scalable dram-pim system,

H. Kwon, K. Koo, J. Kim, W. Lee, M. Lee, H. Lee, Y . Jung, J. Park, Y . Song, B. Yang et al. , “Lol-pim: Long-context llm decoding with scalable dram-pim system,” arXiv preprint arXiv:2412.20166 , 2024

work page arXiv 2024
[20]

Papi: Exploiting dynamic parallelism in large language model decoding with a processing-in-memory-enabled computing system,

Y . He, H. Mao, C. Giannoula, M. Sadrosadati, J. G ´omez-Luna, H. Li, X. Li, Y . Wang, and O. Mutlu, “Papi: Exploiting dynamic parallelism in large language model decoding with a processing-in-memory-enabled computing system,” arXiv preprint arXiv:2502.15470 , 2025

work page arXiv 2025
[21]

Model tells you where to merge: Adaptive kv cache merging for llms on long-context tasks,

Z. Wang, B. Jin, Z. Yu, and M. Zhang, “Model tells you where to merge: Adaptive kv cache merging for llms on long-context tasks,”arXiv preprint arXiv:2407.08454, 2024

work page arXiv 2024
[22]

How long can open-source llms truly promise on context length?

D. Li, R. Shao, A. Xie, Y . Sheng, L. Zheng, J. E. Gonzalez, I. Stoica, X. Ma, and H. Zhang, “How long can open-source llms truly promise on context length?” June 2023. [Online]. Available: https://lmsys.org/blog/2023-06-29-longchat

work page 2023
[23]

LongBench: A bilingual, multitask benchmark for long context understanding,

Y . Bai, X. Lv, J. Zhang, H. Lyu, J. Tang, Z. Huang, Z. Du, X. Liu, A. Zeng, L. Hou, Y . Dong, J. Tang, and J. Li, “LongBench: A bilingual, multitask benchmark for long context understanding,” in Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) . Bangkok, Thailand: Association for Computationa...

work page 2024
[24]

HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering

Z. Yang, P. Qi, S. Zhang, Y . Bengio, W. W. Cohen, R. Salakhutdinov, and C. D. Manning, “Hotpotqa: A dataset for diverse, explainable multi-hop question answering,” arXiv preprint arXiv:1809.09600 , 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[25]

A dataset of information-seeking questions and answers anchored in research papers,

P. Dasigi, K. Lo, I. Beltagy, A. Cohan, N. A. Smith, and M. Gardner, “A dataset of information-seeking questions and answers anchored in research papers,” arXiv preprint arXiv:2105.03011 , 2021

work page arXiv 2021
[26]

The narrativeqa reading comprehension challenge,

T. Ko ˇcisk`y, J. Schwarz, P. Blunsom, C. Dyer, K. M. Hermann, G. Melis, and E. Grefenstette, “The narrativeqa reading comprehension challenge,” Transactions of the Association for Computational Linguistics , vol. 6, pp. 317–328, 2018

work page 2018
[27]

TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension

M. Joshi, E. Choi, D. S. Weld, and L. Zettlemoyer, “Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension,” arXiv preprint arXiv:1705.03551 , 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[28]

Samsum corpus: A human-annotated dialogue dataset for abstractive summarization

B. Gliwa, I. Mochol, M. Biesek, and A. Wawer, “Samsum corpus: A human-annotated dialogue dataset for abstractive summarization,” arXiv preprint arXiv:1911.12237, 2019

work page arXiv 1911
[29]

Exploring the limits of transfer learning with a unified text-to-text transformer,

C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y . Zhou, W. Li, and P. J. Liu, “Exploring the limits of transfer learning with a unified text-to-text transformer,” Journal of machine learning research, vol. 21, no. 140, pp. 1–67, 2020

work page 2020
[30]

Longcoder: A long- range pre-trained language model for code completion,

D. Guo, C. Xu, N. Duan, J. Yin, and J. McAuley, “Longcoder: A long- range pre-trained language model for code completion,” in International Conference on Machine Learning . PMLR, 2023, pp. 12 098–12 107

work page 2023
[31]

Compressive Transformers for Long-Range Sequence Modelling

J. W. Rae, A. Potapenko, S. M. Jayakumar, and T. P. Lillicrap, “Com- pressive transformers for long-range sequence modelling,”arXiv preprint arXiv:1911.05507, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1911
[32]

{InfiniGen}: Efficient generative inference of large language models with dynamic {KV} cache manage- ment,

W. Lee, J. Lee, J. Seo, and J. Sim, “ {InfiniGen}: Efficient generative inference of large language models with dynamic {KV} cache manage- ment,” in 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24) , 2024, pp. 155–172

work page 2024
[33]

Sparq attention: Bandwidth-efficient llm inference,

L. Ribar, I. Chelombiev, L. Hudlass-Galley, C. Blake, C. Luschi, and D. Orr, “Sparq attention: Bandwidth-efficient llm inference,” arXiv preprint arXiv:2312.04985, 2023

work page arXiv 2023
[34]

Accelerating bandwidth-bound deep learning inference with main-memory accelerators,

B. Y . Cho, J. Jung, and M. Erez, “Accelerating bandwidth-bound deep learning inference with main-memory accelerators,” in Proceedings of the International Conference for High Performance Computing, Net- working, Storage and Analysis , 2021, pp. 1–14

work page 2021
[35]

Pim-llm: A high-throughput hybrid pim architecture for 1-bit llms,

J. Malekar, P. Chandarana, M. H. Amin, M. E. Elbtity, and R. Zand, “Pim-llm: A high-throughput hybrid pim architecture for 1-bit llms,” arXiv preprint arXiv:2504.01994 , 2025

work page arXiv 2025
[36]

Make llm inference affordable to everyone: Augmenting gpu memory with ndp-dimm,

L. Liu, S. Zhao, B. Li, H. Ren, Z. Xu, M. Wang, X. Li, Y . Han, and Y . Wang, “Make llm inference affordable to everyone: Augmenting gpu memory with ndp-dimm,” in 2025 IEEE International Symposium on High Performance Computer Architecture (HPCA) . IEEE, 2025, pp. 1751–1765

work page 2025
[37]

H2o: Heavy-hitter oracle for efficient generative inference of large language models,

Z. Zhang, Y . Sheng, T. Zhou, T. Chen, L. Zheng, R. Cai, Z. Song, Y . Tian, C. R´e, C. Barrett et al., “H2o: Heavy-hitter oracle for efficient generative inference of large language models,” Advances in Neural Information Processing Systems , vol. 36, pp. 34 661–34 710, 2023

work page 2023
[38]

Scissorhands: Exploiting the persistence of importance hypothesis for llm kv cache compression at test time,

Z. Liu, A. Desai, F. Liao, W. Wang, V . Xie, Z. Xu, A. Kyrillidis, and A. Shrivastava, “Scissorhands: Exploiting the persistence of importance hypothesis for llm kv cache compression at test time,” Advances in Neural Information Processing Systems , vol. 36, pp. 52 342–52 364, 2023

work page 2023
[39]

Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs

S. Ge, Y . Zhang, L. Liu, M. Zhang, J. Han, and J. Gao, “Model tells you what to discard: Adaptive kv cache compression for llms,” arXiv preprint arXiv:2310.01801, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[40]

Clusterkv: Manipulating LLM KV cache in semantic space for recallable compression.arXiv preprint arXiv:2412.03213, 2024

G. Liu, C. Li, J. Zhao, C. Zhang, and M. Guo, “Clusterkv: Manipulating llm kv cache in semantic space for recallable compression,” arXiv preprint arXiv:2412.03213, 2024

work page arXiv 2024
[41]

Squeezed attention: Accelerat- ing long context length llm inference,

C. Hooper, S. Kim, H. Mohammadzadeh, M. Maheswaran, J. Paik, M. W. Mahoney, K. Keutzer, and A. Gholami, “Squeezed attention: Accelerat- ing long context length llm inference,” arXiv preprint arXiv:2411.09688, 2024

work page arXiv 2024

[1] [1]

GPT-4 Technical Report

J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat et al. , “Gpt-4 technical report,” arXiv preprint arXiv:2303.08774 , 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[2] [2]

Toolqa: A dataset for llm question answering with external tools,

Y . Zhuang, Y . Yu, K. Wang, H. Sun, and C. Zhang, “Toolqa: A dataset for llm question answering with external tools,” Advances in Neural Information Processing Systems , vol. 36, 2024

work page 2024

[3] [3]

Pythia: Ai-assisted code completion system,

A. Svyatkovskiy, Y . Zhao, S. Fu, and N. Sundaresan, “Pythia: Ai-assisted code completion system,” in Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining , 2019, pp. 2727–2735

work page 2019

[4] [4]

Code Llama: Open Foundation Models for Code

B. Roziere, J. Gehring, F. Gloeckle, S. Sootla, I. Gat, X. E. Tan, Y . Adi, J. Liu, T. Remez, J. Rapin et al., “Code llama: Open foundation models for code,” arXiv preprint arXiv:2308.12950 , 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[5] [5]

Reflex- ion: Language agents with verbal reinforcement learning,

N. Shinn, F. Cassano, A. Gopinath, K. Narasimhan, and S. Yao, “Reflex- ion: Language agents with verbal reinforcement learning,” Advances in Neural Information Processing Systems , vol. 36, pp. 8634–8652, 2023

work page 2023

[6] [6]

Tree of thoughts: Deliberate problem solving with large language models,

S. Yao, D. Yu, J. Zhao, I. Shafran, T. Griffiths, Y . Cao, and K. Narasimhan, “Tree of thoughts: Deliberate problem solving with large language models,” Advances in neural information processing systems , vol. 36, pp. 11 809–11 822, 2023

work page 2023

[7] [7]

Pre-trained language models for in- teractive decision-making,

S. Li, X. Puig, C. Paxton, Y . Du, C. Wang, L. Fan, T. Chen, D.-A. Huang, E. Aky¨urek, A. Anandkumar et al., “Pre-trained language models for in- teractive decision-making,” Advances in Neural Information Processing Systems, vol. 35, pp. 31 199–31 212, 2022

work page 2022

[8] [8]

Efficiently scaling transformer inference,

R. Pope, S. Douglas, A. Chowdhery, J. Devlin, J. Bradbury, J. Heek, K. Xiao, S. Agrawal, and J. Dean, “Efficiently scaling transformer inference,” Proceedings of Machine Learning and Systems , vol. 5, pp. 606–624, 2023

work page 2023

[9] [9]

Efficient memory management for large language model serving with pagedattention,

W. Kwon, Z. Li, S. Zhuang, Y . Sheng, L. Zheng, C. H. Yu, J. Gonzalez, H. Zhang, and I. Stoica, “Efficient memory management for large language model serving with pagedattention,” in Proceedings of the 29th Symposium on Operating Systems Principles , 2023, pp. 611–626

work page 2023

[10] [10]

Computedram: In-memory compute using off-the-shelf drams,

F. Gao, G. Tziantzioulis, and D. Wentzlaff, “Computedram: In-memory compute using off-the-shelf drams,” in Proceedings of the 52nd annual IEEE/ACM international symposium on microarchitecture , 2019, pp. 100–113

work page 2019

[11] [11]

Newton: A dram-maker’s accelerator-in-memory (aim) architecture for machine learning,

M. He, C. Song, I. Kim, C. Jeong, S. Kim, I. Park, M. Thottethodi, and T. Vijaykumar, “Newton: A dram-maker’s accelerator-in-memory (aim) architecture for machine learning,” in 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 2020, pp. 372–385

work page 2020

[12] [12]

Pathfinding future pim archi- tectures by demystifying a commercial pim technology,

B. Hyun, T. Kim, D. Lee, and M. Rhu, “Pathfinding future pim archi- tectures by demystifying a commercial pim technology,” in 2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA). IEEE, 2024, pp. 263–279

work page 2024

[13] [13]

Accelerating neural network inference with processing-in-dram: from the edge to the cloud,

G. F. Oliveira, J. G ´omez-Luna, S. Ghose, A. Boroumand, and O. Mutlu, “Accelerating neural network inference with processing-in-dram: from the edge to the cloud,” IEEE Micro, vol. 42, no. 6, pp. 25–38, 2022

work page 2022

[14] [14]

Floatpim: In-memory acceleration of deep neural network training with high precision,

M. Imani, S. Gupta, Y . Kim, and T. Rosing, “Floatpim: In-memory acceleration of deep neural network training with high precision,” in Proceedings of the 46th International Symposium on Computer Archi- tecture, 2019, pp. 802–815

work page 2019

[15] [15]

Attacc! unleashing the power of pim for batched transformer- based generative model inference,

J. Park, J. Choi, K. Kyung, M. J. Kim, Y . Kwon, N. S. Kim, and J. H. Ahn, “Attacc! unleashing the power of pim for batched transformer- based generative model inference,” in Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2 , 2024, pp. 103–119

work page 2024

[16] [16]

Neupims: Npu-pim heterogeneous acceleration for batched llm inferencing,

G. Heo, S. Lee, J. Cho, H. Choi, S. Lee, H. Ham, G. Kim, D. Ma- hajan, and J. Park, “Neupims: Npu-pim heterogeneous acceleration for batched llm inferencing,” in Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3 , 2024, pp. 722–737

work page 2024

[17] [17]

Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference

J. Tang, Y . Zhao, K. Zhu, G. Xiao, B. Kasikci, and S. Han, “Quest: Query-aware sparsity for efficient long-context llm inference,” arXiv preprint arXiv:2406.10774, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[18] [18]

Transpim: A memory- based acceleration via software-hardware co-design for transformer,

M. Zhou, W. Xu, J. Kang, and T. Rosing, “Transpim: A memory- based acceleration via software-hardware co-design for transformer,” in 2022 IEEE International Symposium on High-Performance Computer Architecture (HPCA). IEEE, 2022, pp. 1071–1085

work page 2022

[19] [19]

Lol-pim: Long-context llm decoding with scalable dram-pim system,

H. Kwon, K. Koo, J. Kim, W. Lee, M. Lee, H. Lee, Y . Jung, J. Park, Y . Song, B. Yang et al. , “Lol-pim: Long-context llm decoding with scalable dram-pim system,” arXiv preprint arXiv:2412.20166 , 2024

work page arXiv 2024

[20] [20]

Papi: Exploiting dynamic parallelism in large language model decoding with a processing-in-memory-enabled computing system,

Y . He, H. Mao, C. Giannoula, M. Sadrosadati, J. G ´omez-Luna, H. Li, X. Li, Y . Wang, and O. Mutlu, “Papi: Exploiting dynamic parallelism in large language model decoding with a processing-in-memory-enabled computing system,” arXiv preprint arXiv:2502.15470 , 2025

work page arXiv 2025

[21] [21]

Model tells you where to merge: Adaptive kv cache merging for llms on long-context tasks,

Z. Wang, B. Jin, Z. Yu, and M. Zhang, “Model tells you where to merge: Adaptive kv cache merging for llms on long-context tasks,”arXiv preprint arXiv:2407.08454, 2024

work page arXiv 2024

[22] [22]

How long can open-source llms truly promise on context length?

D. Li, R. Shao, A. Xie, Y . Sheng, L. Zheng, J. E. Gonzalez, I. Stoica, X. Ma, and H. Zhang, “How long can open-source llms truly promise on context length?” June 2023. [Online]. Available: https://lmsys.org/blog/2023-06-29-longchat

work page 2023

[23] [23]

LongBench: A bilingual, multitask benchmark for long context understanding,

Y . Bai, X. Lv, J. Zhang, H. Lyu, J. Tang, Z. Huang, Z. Du, X. Liu, A. Zeng, L. Hou, Y . Dong, J. Tang, and J. Li, “LongBench: A bilingual, multitask benchmark for long context understanding,” in Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) . Bangkok, Thailand: Association for Computationa...

work page 2024

[24] [24]

HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering

Z. Yang, P. Qi, S. Zhang, Y . Bengio, W. W. Cohen, R. Salakhutdinov, and C. D. Manning, “Hotpotqa: A dataset for diverse, explainable multi-hop question answering,” arXiv preprint arXiv:1809.09600 , 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[25] [25]

A dataset of information-seeking questions and answers anchored in research papers,

P. Dasigi, K. Lo, I. Beltagy, A. Cohan, N. A. Smith, and M. Gardner, “A dataset of information-seeking questions and answers anchored in research papers,” arXiv preprint arXiv:2105.03011 , 2021

work page arXiv 2021

[26] [26]

The narrativeqa reading comprehension challenge,

T. Ko ˇcisk`y, J. Schwarz, P. Blunsom, C. Dyer, K. M. Hermann, G. Melis, and E. Grefenstette, “The narrativeqa reading comprehension challenge,” Transactions of the Association for Computational Linguistics , vol. 6, pp. 317–328, 2018

work page 2018

[27] [27]

TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension

M. Joshi, E. Choi, D. S. Weld, and L. Zettlemoyer, “Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension,” arXiv preprint arXiv:1705.03551 , 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[28] [28]

Samsum corpus: A human-annotated dialogue dataset for abstractive summarization

B. Gliwa, I. Mochol, M. Biesek, and A. Wawer, “Samsum corpus: A human-annotated dialogue dataset for abstractive summarization,” arXiv preprint arXiv:1911.12237, 2019

work page arXiv 1911

[29] [29]

Exploring the limits of transfer learning with a unified text-to-text transformer,

C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y . Zhou, W. Li, and P. J. Liu, “Exploring the limits of transfer learning with a unified text-to-text transformer,” Journal of machine learning research, vol. 21, no. 140, pp. 1–67, 2020

work page 2020

[30] [30]

Longcoder: A long- range pre-trained language model for code completion,

D. Guo, C. Xu, N. Duan, J. Yin, and J. McAuley, “Longcoder: A long- range pre-trained language model for code completion,” in International Conference on Machine Learning . PMLR, 2023, pp. 12 098–12 107

work page 2023

[31] [31]

Compressive Transformers for Long-Range Sequence Modelling

J. W. Rae, A. Potapenko, S. M. Jayakumar, and T. P. Lillicrap, “Com- pressive transformers for long-range sequence modelling,”arXiv preprint arXiv:1911.05507, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1911

[32] [32]

{InfiniGen}: Efficient generative inference of large language models with dynamic {KV} cache manage- ment,

W. Lee, J. Lee, J. Seo, and J. Sim, “ {InfiniGen}: Efficient generative inference of large language models with dynamic {KV} cache manage- ment,” in 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24) , 2024, pp. 155–172

work page 2024

[33] [33]

Sparq attention: Bandwidth-efficient llm inference,

L. Ribar, I. Chelombiev, L. Hudlass-Galley, C. Blake, C. Luschi, and D. Orr, “Sparq attention: Bandwidth-efficient llm inference,” arXiv preprint arXiv:2312.04985, 2023

work page arXiv 2023

[34] [34]

Accelerating bandwidth-bound deep learning inference with main-memory accelerators,

B. Y . Cho, J. Jung, and M. Erez, “Accelerating bandwidth-bound deep learning inference with main-memory accelerators,” in Proceedings of the International Conference for High Performance Computing, Net- working, Storage and Analysis , 2021, pp. 1–14

work page 2021

[35] [35]

Pim-llm: A high-throughput hybrid pim architecture for 1-bit llms,

J. Malekar, P. Chandarana, M. H. Amin, M. E. Elbtity, and R. Zand, “Pim-llm: A high-throughput hybrid pim architecture for 1-bit llms,” arXiv preprint arXiv:2504.01994 , 2025

work page arXiv 2025

[36] [36]

Make llm inference affordable to everyone: Augmenting gpu memory with ndp-dimm,

L. Liu, S. Zhao, B. Li, H. Ren, Z. Xu, M. Wang, X. Li, Y . Han, and Y . Wang, “Make llm inference affordable to everyone: Augmenting gpu memory with ndp-dimm,” in 2025 IEEE International Symposium on High Performance Computer Architecture (HPCA) . IEEE, 2025, pp. 1751–1765

work page 2025

[37] [37]

H2o: Heavy-hitter oracle for efficient generative inference of large language models,

Z. Zhang, Y . Sheng, T. Zhou, T. Chen, L. Zheng, R. Cai, Z. Song, Y . Tian, C. R´e, C. Barrett et al., “H2o: Heavy-hitter oracle for efficient generative inference of large language models,” Advances in Neural Information Processing Systems , vol. 36, pp. 34 661–34 710, 2023

work page 2023

[38] [38]

Scissorhands: Exploiting the persistence of importance hypothesis for llm kv cache compression at test time,

Z. Liu, A. Desai, F. Liao, W. Wang, V . Xie, Z. Xu, A. Kyrillidis, and A. Shrivastava, “Scissorhands: Exploiting the persistence of importance hypothesis for llm kv cache compression at test time,” Advances in Neural Information Processing Systems , vol. 36, pp. 52 342–52 364, 2023

work page 2023

[39] [39]

Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs

S. Ge, Y . Zhang, L. Liu, M. Zhang, J. Han, and J. Gao, “Model tells you what to discard: Adaptive kv cache compression for llms,” arXiv preprint arXiv:2310.01801, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[40] [40]

Clusterkv: Manipulating LLM KV cache in semantic space for recallable compression.arXiv preprint arXiv:2412.03213, 2024

G. Liu, C. Li, J. Zhao, C. Zhang, and M. Guo, “Clusterkv: Manipulating llm kv cache in semantic space for recallable compression,” arXiv preprint arXiv:2412.03213, 2024

work page arXiv 2024

[41] [41]

Squeezed attention: Accelerat- ing long context length llm inference,

C. Hooper, S. Kim, H. Mohammadzadeh, M. Maheswaran, J. Paik, M. W. Mahoney, K. Keutzer, and A. Gholami, “Squeezed attention: Accelerat- ing long context length llm inference,” arXiv preprint arXiv:2411.09688, 2024

work page arXiv 2024