KVBuffer: IO-aware Serving for Linear Attention

Lin Zhong; Longwei Zou

arxiv: 2605.19049 · v1 · pith:54ILOMC2new · submitted 2026-05-18 · 💻 cs.LG · cs.AI

KVBuffer: IO-aware Serving for Linear Attention

Longwei Zou , Lin Zhong This is my paper

Pith reviewed 2026-05-20 11:56 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords linear attentionKV bufferserving systemdecoding latencyspeculative decodingmemory IO optimizationchunkwise computation

0 comments

The pith

KVBuffer reduces linear attention decoding latency by buffering keys and values to enable chunked and batched state updates.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Linear attention keeps decoding cost constant with context length but existing systems update a large state recurrently at every token, causing heavy memory traffic. KVBuffer stores recent keys and values so that state updates can be deferred and performed in chunks or batches instead of per token. This preserves exact numerical results while cutting average memory access. The same buffer supports parallel verification of multiple draft tokens in speculative decoding and direct attention computation for short contexts without ever building the state.

Core claim

By buffering recent keys and values, KVBuffer permits linear attention outputs to be computed through chunkwise or batched state updates rather than recurrent per-token updates, which lowers memory access volume and improves serving throughput while preserving exact numerical equivalence.

What carries the argument

KVBuffer, a buffer holding recent keys and values that defers and batches linear attention state updates to reduce IO costs.

If this is right

Decoding latency falls because state updates occur less often and in larger batches that exploit better memory locality.
Speculative decoding verifies four draft tokens in parallel without allocating temporary states for each.
Short-context requests skip state creation and update entirely by reading directly from the key-value buffer.
Maximum concurrent requests rise as each request requires less memory bandwidth per generated token.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same buffering pattern could be applied to other recurrent state mechanisms that suffer from per-step IO.
Hardware-aware chunk sizing could further reduce latency on specific GPUs or TPUs.
Models with even longer contexts would see the largest relative gains because the fixed-size state becomes a smaller fraction of total traffic.

Load-bearing premise

That reordering state updates into chunks and batches produces identical numerical results to per-token recurrent updates while only changing memory access patterns.

What would settle it

Run identical token sequences through both the original recurrent linear attention and the KVBuffer chunked version, then verify that output logits or hidden states match to machine precision.

Figures

Figures reproduced from arXiv: 2605.19049 by Lin Zhong, Longwei Zou.

**Figure 1.** Figure 1: KVBuffer Design. We partition the memory for KV buffers into blocks, each of which can store 6 KVs. Each request has two blocks. During decoding, the serving system loads state along with buffered KVs to compute attention output. When the buffer is full, the state is updated with all buffered KVs on GPU and the updated state will be written back to the state slot. 4 [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗

**Figure 2.** Figure 2: Speculative Decoding with KVBuffer. Existing serving systems only have a state pool and use the recurrent form for speculative decoding verification. Therefore, they have to store a temporary state for each draft token. After determining that the accepted draft tokens are 0 and 2, the state of this request is replaced by the temporary state2. In contrast, KVBuffer buffers the KV for each draft tokens and u… view at source ↗

**Figure 3.** Figure 3: Kernel latency of chunkwise decoding with KVBuffer. Latency is normalized by the corresponding recurrent decoding latency, i.e., the case with buffer size m = 0. Chunkwise decoding latency is averaged over a full KVBuffer cycle, including decoding with buffer occupancies from 0 to m −1 and the state update latency. 2 4 6 8 Number of Draft Tokens 0 1 2 3 4 Latency (ms) 1.23 1.25 2.09 1.28 2.98 1.33 3.88 1… view at source ↗

**Figure 5.** Figure 5: End-to-end serving throughput with speculative decoding. By avoiding the storage of temporary states for draft tokens, KVBuffer sustains higher request rates and improves overall throughput. 16 32 64 128 256 Context Length 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 Normalized Latency Recurrent Decoding Chunkwise Decoding KV-Only Decoding [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

read the original abstract

Linear attention has recently gained significant attention for long-context inference due to its constant decoding cost with respect to context length. However, existing serving systems typically serve linear attention by recurrently computing and updating a large linear attention state in every decoding step. Since the state is much larger than the per-token key and value, recurrent decoding incurs substantial memory access and becomes inefficient for serving linear attention. In this paper, we propose KVBuffer, an IO-aware serving mechanism for linear attention. By buffering recent keys and values, KVBuffer enables serving systems to compute linear attention outputs in more flexible and memory-efficient ways. For decoding, KVBuffer enables chunkwise computation, which reduces average memory access and decoding latency by deferring state updates and applying them in batch. For speculative decoding, KVBuffer verifies draft tokens in parallel and avoids storing temporary states. For short contexts, KVBuffer computes attention outputs directly from buffered keys and values, without creating or updating the linear attention state. We implement KVBuffer in SGLang for Qwen3-Next. Our evaluations show that KVBuffer can reduce linear attention decoding latency by up to 45.17% and increase the maximum number of serving requests by 5x for speculative decoding when verifying four draft tokens.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

KVBuffer buffers keys and values to enable chunked or batched linear attention updates during serving, which the implementation shows can cut latency and raise throughput, but the numbers rest on thin experimental reporting and an unverified assumption of numerical equivalence.

read the letter

The main point is that KVBuffer keeps recent keys and values in a buffer so serving can avoid updating the full linear attention state on every token. Instead it defers updates and applies them in chunks or batches, or skips the state for short contexts, or verifies drafts in parallel without extra state storage. They built this into SGLang for Qwen3-Next and report up to 45% lower decoding latency plus 5x more concurrent requests under speculative decoding with four drafts. That combination of buffering plus the three operating modes is the concrete new piece relative to prior serving work on linear attention. The practical payoff is clear: less memory traffic during recurrent decoding, which matters when the state is bigger than the per-token KV. The implementation itself looks like solid engineering that directly targets IO cost. The soft spots sit in the evidence and the correctness argument. The abstract gives headline numbers but says nothing about baselines, hardware, batch sizes, or how latency was measured, so the size of the real gain is difficult to judge from what is shown. On the stress-test concern, the paper treats chunked state updates as producing identical results to per-token recurrence. For many linear attention recurrences this is algebraically true, yet the manuscript does not include a short equivalence derivation, associativity check, or side-by-side numerical comparison on the Qwen3-Next variant. If any normalization or scaling in that model breaks strict associativity, or if custom kernels change accumulation order, the outputs would diverge even if the serving code runs faster. A small verification section would remove that doubt. This work is for people who run or extend inference servers for long-context linear attention models. A practitioner tuning SGLang or similar frameworks would get usable ideas from the buffering approach and the three modes. It is worth sending to peer review so the experimental details and the numerical-equivalence claim can be examined in full.

Referee Report

2 major / 1 minor

Summary. The paper proposes KVBuffer, an IO-aware serving system for linear attention models. It buffers recent keys and values to support chunkwise state updates during decoding (deferring and batching updates to reduce memory traffic), parallel draft verification in speculative decoding without temporary states, and direct KV-based attention for short contexts without maintaining the linear state. The method is implemented in SGLang for Qwen3-Next; the abstract reports up to 45.17% lower decoding latency and up to 5x higher maximum serving requests under speculative decoding with four draft tokens.

Significance. If the numerical equivalence of chunkwise updates holds and the performance numbers are reproducible, the work addresses a practical bottleneck in serving linear-attention models for long contexts by improving memory access patterns without changing the underlying algorithm. The concrete implementation and reported speedups constitute a strength; however, the absence of detailed experimental methodology limits the strength of the empirical claims.

major comments (2)

[Abstract and Evaluation] Abstract and Evaluation section: the reported 45.17% latency reduction and 5x request increase are presented without any description of hardware platform, baseline serving system, measurement methodology (e.g., median vs. mean, warm-up steps), or variance across runs. These omissions make it impossible to assess whether the gains are load-bearing or reproducible.
[Method (chunkwise decoding)] Method description of chunkwise computation: the paper states that KVBuffer “defers state updates and applies them in batch” while preserving correctness, yet provides neither an associativity argument for the specific linear-attention recurrence used in Qwen3-Next nor any numerical verification (e.g., maximum logit difference or FP-error bounds) that the batched result equals the per-token recurrent result to machine precision. This equivalence is load-bearing for the claim that the optimization is exact rather than approximate.

minor comments (1)

[Method] Notation for the linear-attention state update is introduced without an explicit equation reference; adding a numbered equation for S_t = f(S_{t-1}, k_t, v_t) would clarify the subsequent chunkwise reformulation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback on our manuscript. We have revised the paper to address the concerns about experimental reproducibility and the formal justification for chunkwise updates. Our point-by-point responses follow.

read point-by-point responses

Referee: [Abstract and Evaluation] Abstract and Evaluation section: the reported 45.17% latency reduction and 5x request increase are presented without any description of hardware platform, baseline serving system, measurement methodology (e.g., median vs. mean, warm-up steps), or variance across runs. These omissions make it impossible to assess whether the gains are load-bearing or reproducible.

Authors: We agree that the original manuscript provided insufficient detail on the experimental setup, which limits assessment of reproducibility. In the revised version, we have added a new subsection titled 'Experimental Setup' in the Evaluation section. This subsection now specifies the hardware platform, the exact baseline serving system (vanilla SGLang without KVBuffer), the measurement methodology including use of median latency after warm-up steps, and reporting of variance across multiple runs. These additions directly address the referee's concerns and strengthen the empirical claims. revision: yes
Referee: [Method (chunkwise decoding)] Method description of chunkwise computation: the paper states that KVBuffer “defers state updates and applies them in batch” while preserving correctness, yet provides neither an associativity argument for the specific linear-attention recurrence used in Qwen3-Next nor any numerical verification (e.g., maximum logit difference or FP-error bounds) that the batched result equals the per-token recurrent result to machine precision. This equivalence is load-bearing for the claim that the optimization is exact rather than approximate.

Authors: We acknowledge that the original submission did not include an explicit associativity argument or numerical verification for the chunkwise updates, which is a valid concern given that equivalence is central to claiming an exact optimization. The linear attention recurrence in Qwen3-Next admits an associative formulation under the standard state-update rules (matrix scaling and outer-product accumulation). In the revised manuscript, we have added a dedicated paragraph in Section 3.2 with the associativity proof tailored to this recurrence, plus an appendix subsection reporting numerical verification (maximum logit difference below machine epsilon across long sequences). This confirms the batched result matches the recurrent result to floating-point precision. revision: yes

Circularity Check

0 steps flagged

No significant circularity; performance claims rest on direct empirical measurements of an implemented system

full rationale

The paper's central claims concern measured latency reductions (up to 45.17%) and throughput gains (5x) from an IO-aware serving mechanism implemented in SGLang for Qwen3-Next. These results are obtained by running the system and recording wall-clock times and request capacities rather than by any mathematical derivation, fitted parameter, or self-referential equation. The description of chunkwise state updates is presented as an engineering optimization whose correctness is implicitly verified by the reported end-to-end numbers; no algebraic identity is asserted as a first-principles result that later reduces to the same identity. Consequently the derivation chain contains no self-definitional, fitted-input, or self-citation load-bearing steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on the domain assumption that linear attention state updates commute with chunked computation when performed in batch, plus standard hardware assumptions about memory access costs. No free parameters or invented entities are introduced.

axioms (1)

domain assumption Linear attention state updates can be deferred and batched without changing the final numerical output
Implicit in the description of chunkwise computation and deferred state updates.

pith-pipeline@v0.9.0 · 5741 in / 1115 out tokens · 38523 ms · 2026-05-20T11:56:38.104958+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

24 extracted references · 24 canonical work pages · 8 internal anchors

[1]

SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills

Amey Agrawal, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav S. Gulavani, and Ramachandran Ramjee. SARATHI: efficient LLM inference by piggybacking decodes with chunked prefills.CoRR, abs/2308.16369, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Transformers are ssms: Generalized models and efficient algorithms through structured state space duality

Tri Dao and Albert Gu. Transformers are ssms: Generalized models and efficient algorithms through structured state space duality. In Ruslan Salakhutdinov, Zico Kolter, Katherine A. Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors,Forty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July...

work page 2024
[3]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek-AI. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.CoRR, abs/2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. CoRR, abs/2312.00752, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[5]

Efficiently modeling long sequences with structured state spaces

Albert Gu, Karan Goel, and Christopher Ré. Efficiently modeling long sequences with structured state spaces. InThe Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022

work page 2022
[6]

Transformers are rnns: Fast autoregressive transformers with linear attention

Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. Transformers are rnns: Fast autoregressive transformers with linear attention. InProceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, Proceedings of Machine Learning Research, pages 5156–5165. PMLR, 2020

work page 2020
[7]

Efficient memory management for large language model serving with pagedattention

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. In Jason Flinn, Margo I. Seltzer, Peter Druschel, Antoine Kaufmann, and Jonathan Mace, editors,Proceedings of the 29th Symposium on Operating 10 System...

work page 2023
[8]

Fast inference from transformers via specu- lative decoding

Yaniv Leviathan, Matan Kalman, and Yossi Matias. Fast inference from transformers via specu- lative decoding. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors,International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, Proceedings of Machine Learning Res...

work page 2023
[9]

EAGLE: speculative sampling requires rethinking feature uncertainty

Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. EAGLE: speculative sampling requires rethinking feature uncertainty. In Ruslan Salakhutdinov, Zico Kolter, Katherine A. Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors,Forty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27...

work page 2024
[10]

MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention

MiniMax. Minimax-m1: Scaling test-time compute efficiently with lightning attention.CoRR, abs/2506.13585, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[11]

Marconi: Prefix caching for the era of hybrid llms

Rui Pan, Zhuang Wang, Zhen Jia, Can Karakus, Luca Zancato, Tri Dao, Yida Wang, and Ravi Netravali. Marconi: Prefix caching for the era of hybrid llms. In Matei Zaharia, Gauri Joshi, and Yingyan (Celine) Lin, editors,Proceedings of the Eighth Conference on Machine Learning and Systems, MLSys 2025, Santa Clara, CA, USA, May 12-15, 2025. OpenReview.net/mlsys...

work page 2025
[12]

Wind, Stanislaw Wozniak, Zhenyuan Zhang, Qinghua Zhou, Jian Zhu, and Rui-Jie Zhu

Bo Peng, Eric Alcaide, Quentin Anthony, Alon Albalak, Samuel Arcadinho, Stella Biderman, Huanqi Cao, Xin Cheng, Michael Chung, Leon Derczynski, Xingjian Du, Matteo Grella, Kranthi Kiran GV , Xuzheng He, Haowen Hou, Przemyslaw Kazienko, Jan Kocon, Jiaming Kong, Bartlomiej Koptyra, Hayden Lau, Jiaju Lin, Krishna Sri Ipsit Mantri, Ferdinand Mom, Atsushi Sait...

work page 2023
[13]

Eagle and Finch: RWKV with Matrix-Valued States and Dynamic Recurrence

Bo Peng, Daniel Goldstein, Quentin Anthony, Alon Albalak, Eric Alcaide, Stella Biderman, Eugene Cheah, Xingjian Du, Teddy Ferdinan, Haowen Hou, Przemyslaw Kazienko, Kranthi Ki- ran GV , Jan Kocon, Bartlomiej Koptyra, Satyapriya Krishna, Ronald McClelland Jr., Niklas Muennighoff, Fares Obeid, Atsushi Saito, Guangyu Song, Haoqin Tu, Stanislaw Wozniak, Ruich...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[14]

Prefill-as-a-Service: KVCache of Next-Generation Models Could Go Cross-Datacenter

Ruoyu Qin, Weiran He, Yaoyu Wang, Zheming Li, Xinran Xu, Yongwei Wu, Weimin Zheng, and Mingxing Zhang. Prefill-as-a-service: KVCache of next-generation models could go cross-datacenter.CoRR, abs/2604.15039, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[15]

Qwen3-next-80b-a3b

Qwen Team. Qwen3-next-80b-a3b. https://qwen.ai/blog?id= 4074cca80393150c248e508aa62983f9cb7d27cd, September 2025. Blog post, accessed May 5, 2026

work page 2025
[16]

Jimmy T. H. Smith, Andrew Warrington, and Scott W. Linderman. Simplified state space layers for sequence modeling. InThe Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023

work page 2023
[17]

Retentive Network: A Successor to Transformer for Large Language Models

Yutao Sun, Li Dong, Shaohan Huang, Shuming Ma, Yuqing Xia, Jilong Xue, Jianyong Wang, and Furu Wei. Retentive network: A successor to transformer for large language models.CoRR, abs/2307.08621, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[18]

Stree: Speculative tree decoding for hybrid state-space models.CoRR, abs/2505.14969, 2025

Yangchao Wu, Zongyue Qin, Alex Wong, and Stefano Soatto. Stree: Speculative tree decoding for hybrid state-space models.CoRR, abs/2505.14969, 2025. 11

work page arXiv 2025
[19]

Gated delta networks: Improving mamba2 with delta rule

Songlin Yang, Jan Kautz, and Ali Hatamizadeh. Gated delta networks: Improving mamba2 with delta rule. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net, 2025

work page 2025
[20]

Gated linear attention transformers with hardware-efficient training

Songlin Yang, Bailin Wang, Yikang Shen, Rameswar Panda, and Yoon Kim. Gated linear attention transformers with hardware-efficient training. In Ruslan Salakhutdinov, Zico Kolter, Katherine A. Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors,Forty-first International Conference on Machine Learning, ICML 2024, Vienna, Aus...

work page 2024
[21]

Parallelizing linear transformers with the delta rule over sequence length

Songlin Yang, Bailin Wang, Yu Zhang, Yikang Shen, and Yoon Kim. Parallelizing linear transformers with the delta rule over sequence length. In Amir Globersons, Lester Mackey, Danielle Belgrave, Angela Fan, Ulrich Paquet, Jakub M. Tomczak, and Cheng Zhang, editors, Advances in Neural Information Processing Systems 38: Annual Conference on Neural Infor- mat...

work page 2024
[22]

Orca: A distributed serving system for transformer-based generative models

Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soojeong Kim, and Byung-Gon Chun. Orca: A distributed serving system for transformer-based generative models. In Marcos K. Aguilera and Hakim Weatherspoon, editors,16th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2022, Carlsbad, CA, USA, July 11-13, 2022, pages 521–538. USENIX Associat...

work page 2022
[23]

Yu Zhang, Zongyu Lin, Xingcheng Yao, Jiaxi Hu, Fanqing Meng, Chengyin Liu, Xin Men, Songlin Yang, Zhiyuan Li, Wentao Li, Enzhe Lu, Weizhou Liu, Yanru Chen, Weixin Xu, Longhui Yu, Yejie Wang, Yu Fan, Longguang Zhong, Enming Yuan, Dehao Zhang, Yizhi Zhang, T. Y . Liu, Haiming Wang, Shengjun Fang, Weiran He, Shaowei Liu, Yiwei Li, Jianlin Su, Jiezhong Qiu, B...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[24]

Gonzalez, Clark W

Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez, Clark W. Barrett, and Ying Sheng. Sglang: Efficient execution of structured language model programs. In Amir Globersons, Lester Mackey, Danielle Belgrave, Angela Fan, Ulrich Paquet, Jakub M. Tomczak, and Cheng Zh...

work page 2024

[1] [1]

SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills

Amey Agrawal, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav S. Gulavani, and Ramachandran Ramjee. SARATHI: efficient LLM inference by piggybacking decodes with chunked prefills.CoRR, abs/2308.16369, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[2] [2]

Transformers are ssms: Generalized models and efficient algorithms through structured state space duality

Tri Dao and Albert Gu. Transformers are ssms: Generalized models and efficient algorithms through structured state space duality. In Ruslan Salakhutdinov, Zico Kolter, Katherine A. Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors,Forty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July...

work page 2024

[3] [3]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek-AI. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.CoRR, abs/2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[4] [4]

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. CoRR, abs/2312.00752, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[5] [5]

Efficiently modeling long sequences with structured state spaces

Albert Gu, Karan Goel, and Christopher Ré. Efficiently modeling long sequences with structured state spaces. InThe Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022

work page 2022

[6] [6]

Transformers are rnns: Fast autoregressive transformers with linear attention

Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. Transformers are rnns: Fast autoregressive transformers with linear attention. InProceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, Proceedings of Machine Learning Research, pages 5156–5165. PMLR, 2020

work page 2020

[7] [7]

Efficient memory management for large language model serving with pagedattention

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. In Jason Flinn, Margo I. Seltzer, Peter Druschel, Antoine Kaufmann, and Jonathan Mace, editors,Proceedings of the 29th Symposium on Operating 10 System...

work page 2023

[8] [8]

Fast inference from transformers via specu- lative decoding

Yaniv Leviathan, Matan Kalman, and Yossi Matias. Fast inference from transformers via specu- lative decoding. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors,International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, Proceedings of Machine Learning Res...

work page 2023

[9] [9]

EAGLE: speculative sampling requires rethinking feature uncertainty

Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. EAGLE: speculative sampling requires rethinking feature uncertainty. In Ruslan Salakhutdinov, Zico Kolter, Katherine A. Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors,Forty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27...

work page 2024

[10] [10]

MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention

MiniMax. Minimax-m1: Scaling test-time compute efficiently with lightning attention.CoRR, abs/2506.13585, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[11] [11]

Marconi: Prefix caching for the era of hybrid llms

Rui Pan, Zhuang Wang, Zhen Jia, Can Karakus, Luca Zancato, Tri Dao, Yida Wang, and Ravi Netravali. Marconi: Prefix caching for the era of hybrid llms. In Matei Zaharia, Gauri Joshi, and Yingyan (Celine) Lin, editors,Proceedings of the Eighth Conference on Machine Learning and Systems, MLSys 2025, Santa Clara, CA, USA, May 12-15, 2025. OpenReview.net/mlsys...

work page 2025

[12] [12]

Wind, Stanislaw Wozniak, Zhenyuan Zhang, Qinghua Zhou, Jian Zhu, and Rui-Jie Zhu

Bo Peng, Eric Alcaide, Quentin Anthony, Alon Albalak, Samuel Arcadinho, Stella Biderman, Huanqi Cao, Xin Cheng, Michael Chung, Leon Derczynski, Xingjian Du, Matteo Grella, Kranthi Kiran GV , Xuzheng He, Haowen Hou, Przemyslaw Kazienko, Jan Kocon, Jiaming Kong, Bartlomiej Koptyra, Hayden Lau, Jiaju Lin, Krishna Sri Ipsit Mantri, Ferdinand Mom, Atsushi Sait...

work page 2023

[13] [13]

Eagle and Finch: RWKV with Matrix-Valued States and Dynamic Recurrence

Bo Peng, Daniel Goldstein, Quentin Anthony, Alon Albalak, Eric Alcaide, Stella Biderman, Eugene Cheah, Xingjian Du, Teddy Ferdinan, Haowen Hou, Przemyslaw Kazienko, Kranthi Ki- ran GV , Jan Kocon, Bartlomiej Koptyra, Satyapriya Krishna, Ronald McClelland Jr., Niklas Muennighoff, Fares Obeid, Atsushi Saito, Guangyu Song, Haoqin Tu, Stanislaw Wozniak, Ruich...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[14] [14]

Prefill-as-a-Service: KVCache of Next-Generation Models Could Go Cross-Datacenter

Ruoyu Qin, Weiran He, Yaoyu Wang, Zheming Li, Xinran Xu, Yongwei Wu, Weimin Zheng, and Mingxing Zhang. Prefill-as-a-service: KVCache of next-generation models could go cross-datacenter.CoRR, abs/2604.15039, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[15] [15]

Qwen3-next-80b-a3b

Qwen Team. Qwen3-next-80b-a3b. https://qwen.ai/blog?id= 4074cca80393150c248e508aa62983f9cb7d27cd, September 2025. Blog post, accessed May 5, 2026

work page 2025

[16] [16]

Jimmy T. H. Smith, Andrew Warrington, and Scott W. Linderman. Simplified state space layers for sequence modeling. InThe Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023

work page 2023

[17] [17]

Retentive Network: A Successor to Transformer for Large Language Models

Yutao Sun, Li Dong, Shaohan Huang, Shuming Ma, Yuqing Xia, Jilong Xue, Jianyong Wang, and Furu Wei. Retentive network: A successor to transformer for large language models.CoRR, abs/2307.08621, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[18] [18]

Stree: Speculative tree decoding for hybrid state-space models.CoRR, abs/2505.14969, 2025

Yangchao Wu, Zongyue Qin, Alex Wong, and Stefano Soatto. Stree: Speculative tree decoding for hybrid state-space models.CoRR, abs/2505.14969, 2025. 11

work page arXiv 2025

[19] [19]

Gated delta networks: Improving mamba2 with delta rule

Songlin Yang, Jan Kautz, and Ali Hatamizadeh. Gated delta networks: Improving mamba2 with delta rule. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net, 2025

work page 2025

[20] [20]

Gated linear attention transformers with hardware-efficient training

Songlin Yang, Bailin Wang, Yikang Shen, Rameswar Panda, and Yoon Kim. Gated linear attention transformers with hardware-efficient training. In Ruslan Salakhutdinov, Zico Kolter, Katherine A. Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors,Forty-first International Conference on Machine Learning, ICML 2024, Vienna, Aus...

work page 2024

[21] [21]

Parallelizing linear transformers with the delta rule over sequence length

Songlin Yang, Bailin Wang, Yu Zhang, Yikang Shen, and Yoon Kim. Parallelizing linear transformers with the delta rule over sequence length. In Amir Globersons, Lester Mackey, Danielle Belgrave, Angela Fan, Ulrich Paquet, Jakub M. Tomczak, and Cheng Zhang, editors, Advances in Neural Information Processing Systems 38: Annual Conference on Neural Infor- mat...

work page 2024

[22] [22]

Orca: A distributed serving system for transformer-based generative models

Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soojeong Kim, and Byung-Gon Chun. Orca: A distributed serving system for transformer-based generative models. In Marcos K. Aguilera and Hakim Weatherspoon, editors,16th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2022, Carlsbad, CA, USA, July 11-13, 2022, pages 521–538. USENIX Associat...

work page 2022

[23] [23]

Yu Zhang, Zongyu Lin, Xingcheng Yao, Jiaxi Hu, Fanqing Meng, Chengyin Liu, Xin Men, Songlin Yang, Zhiyuan Li, Wentao Li, Enzhe Lu, Weizhou Liu, Yanru Chen, Weixin Xu, Longhui Yu, Yejie Wang, Yu Fan, Longguang Zhong, Enming Yuan, Dehao Zhang, Yizhi Zhang, T. Y . Liu, Haiming Wang, Shengjun Fang, Weiran He, Shaowei Liu, Yiwei Li, Jianlin Su, Jiezhong Qiu, B...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[24] [24]

Gonzalez, Clark W

Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez, Clark W. Barrett, and Ying Sheng. Sglang: Efficient execution of structured language model programs. In Amir Globersons, Lester Mackey, Danielle Belgrave, Angela Fan, Ulrich Paquet, Jakub M. Tomczak, and Cheng Zh...

work page 2024