PiKV: KV Cache Management System for Mixture of Experts

Ben Lengerich; Dong Liu; Yanxuan Yu; Ying Nian Wu

arxiv: 2508.06526 · v3 · pith:EJJXK7XLnew · submitted 2025-08-02 · 💻 cs.DC · cs.AI· cs.AR

PiKV: KV Cache Management System for Mixture of Experts

Dong Liu , Yanxuan Yu , Ben Lengerich , Ying Nian Wu This is my paper

Pith reviewed 2026-05-21 23:56 UTC · model grok-4.3

classification 💻 cs.DC cs.AIcs.AR

keywords KV cache managementMixture of Expertsdistributed inferencememory optimizationlarge language modelsparallel servingcache compression

0 comments

The pith

PiKV partitions KV caches across GPUs with expert-sharded storage and adaptive routing to lower memory and communication costs during MoE inference.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large-scale MoE models sparsify computation but still store dense KV caches that must stay synchronized across GPUs, creating a growing memory and communication bottleneck as context lengths increase. The paper presents PiKV as a framework that shards KV storage according to expert assignments, routes tokens to limit unnecessary cache accesses, schedules retention of only query-relevant entries, and adds compression modules inside the pipeline. A sympathetic reader cares because these changes would let models run longer contexts or larger scales on the same multi-GPU clusters without proportional rises in hardware or interconnect demand. The work is released as an open-source library positioned as an evolving system for comprehensive MoE KV management.

Core claim

The paper claims that expert-sharded KV storage partitions caches across GPUs to match MoE expert distribution, PiKV routing reduces token-to-KV access, PiKV Scheduling adaptively retains query-relevant entries, and PiKV Compression modules shrink memory usage, together cutting the memory and communication overhead that limits multi-GPU and multi-node inference for MoE architectures.

What carries the argument

Expert-sharded KV storage paired with PiKV routing, scheduling, and compression modules that distribute and selectively retain cache entries across GPUs.

If this is right

KV caches are partitioned across GPUs according to expert assignments rather than kept fully dense and synchronized.
Token-to-KV access volume drops through specialized PiKV routing that avoids unnecessary lookups.
Only query-relevant cache entries are retained by the adaptive PiKV Scheduling step.
Memory footprint shrinks further once PiKV Compression modules are integrated into the caching pipeline.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same sharding and selective-retention logic could apply to other sparse attention patterns if expert-like grouping can be identified in KV access.
Community extensions of the open-source library might add hardware-specific optimizations for particular interconnect topologies.
If the memory reductions hold at larger scales, fixed GPU clusters could support substantially longer context windows than current dense-cache limits allow.

Load-bearing premise

Expert-sharded storage combined with the proposed routing and scheduling will preserve model accuracy and acceptable latency while cutting memory and communication costs.

What would settle it

A side-by-side measurement of accuracy, end-to-end latency, peak memory usage, and inter-GPU communication volume on a standard MoE model using baseline dense KV caching versus the complete PiKV pipeline.

Figures

Figures reproduced from arXiv: 2508.06526 by Ben Lengerich, Dong Liu, Yanxuan Yu, Ying Nian Wu.

**Figure 1.** Figure 1: PiKV Framework there is huge demand to deploy sparsely-gated Mixture-of-Experts (MoE) structures [7, 12] to reduce computation costs at scale. However, serving such models introduces significant systemlevel challenges. During inference, each token generation requires attending to the entire KV cache from prior tokens. For a 7B-scale MoE model with 128K context and 16 experts, the full KV cache can occupy … view at source ↗

**Figure 2.** Figure 2: KV cache memory usage comparison. Left: absolute [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: Latency performance comparison. Left: latency at [PITH_FULL_IMAGE:figures/full_fig_p002_3.png] view at source ↗

**Figure 5.** Figure 5: End-to-end performance comparison across dif [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Long-context performance analysis. PiKV main [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Compression-accuracy trade-off analysis. PiKV’s [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 8.** Figure 8: Single Ablation Study of PiKV with 3 Key Compo [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗

**Figure 10.** Figure 10: Memory analysis visualization. Top-left: Radar [PITH_FULL_IMAGE:figures/full_fig_p009_10.png] view at source ↗

**Figure 9.** Figure 9: Scalability Analysis of PiKV with Sequence Length [PITH_FULL_IMAGE:figures/full_fig_p009_9.png] view at source ↗

**Figure 11.** Figure 11: GPU utilization analysis. Top-left: 3D surface plot [PITH_FULL_IMAGE:figures/full_fig_p010_11.png] view at source ↗

**Figure 13.** Figure 13: Execution time analysis. Top-left: Donut chart [PITH_FULL_IMAGE:figures/full_fig_p011_13.png] view at source ↗

**Figure 15.** Figure 15: Communication analysis. Top-left: Chord diagram [PITH_FULL_IMAGE:figures/full_fig_p012_15.png] view at source ↗

read the original abstract

As large-scale language models continue to scale up in both size and context length, the memory and communication cost of key-value (KV) cache storage has become a major bottleneck in multi-GPU and multi-node inference. While MoE-based architectures sparsify computation across experts, the corresponding KV caches remain dense and globally synchronized, resulting in significant overhead. We introduce \textbf{PiKV}, a parallel and distributed KV cache serving framework tailored for MoE architecture. PiKV leverages \textit{expert-sharded KV storage} to partition caches across GPUs, \textit{PiKV routing} to reduce token-to-KV access, and a \textit{PiKV Scheduling} to adaptively retain query-relevant entries. To further reduce memory usage, PiKV integrates \textit{PiKV Compression} modules the caching pipeline for acceleration. PiKV is recently publicly available as an open-source software library: \href{https://github.com/NoakLiu/PiKV}{https://github.com/NoakLiu/PiKV}. PiKV is still a living project, aiming to become a comprehesive KV Cache management system for MoE Architectures.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PiKV sketches a sharded KV cache design for MoE inference but supplies no measurements or comparisons to show the components deliver on cost or accuracy.

read the letter

Colleague, the main point is that this paper introduces PiKV as a distributed KV cache system built around expert-sharded storage, a custom routing step to cut token-to-cache lookups, adaptive scheduling that keeps query-relevant entries, and compression modules. The abstract frames this as a response to the fact that MoE models keep dense KV caches even when computation is sparse, and the authors have put the code on GitHub as a living project. That open release is the most concrete step so far and gives others a chance to inspect or extend the implementation. The design choices line up with known pain points in multi-GPU MoE serving, where memory and cross-device communication grow with context length. The paper does a reasonable job naming the four pieces and explaining at a high level how sharding, routing, scheduling, and compression are meant to fit together in the pipeline. Beyond that, the write-up stays descriptive. There are no benchmarks, no accuracy or latency numbers, no ablations, and no direct comparisons to prior KV cache work such as paging, quantization, or eviction policies already used in production inference engines. Without those data it is hard to judge whether the proposed combination actually preserves model quality while cutting the claimed overheads under realistic long-context MoE workloads. The assumptions about routing and scheduling maintaining relevance without side effects remain untested in the text. This paper is mainly for systems researchers and inference engineers who are already building or tuning MoE serving stacks and want to see one concrete architecture proposal plus runnable code. A reader looking for validated techniques or new theoretical bounds will find little to take away yet. I would send it to peer review so the authors can add the missing experiments and clarify how the pieces relate to existing literature; the topic is timely enough that referees could usefully push for those additions rather than desk-rejecting outright.

Referee Report

1 major / 2 minor

Summary. The manuscript proposes PiKV, a parallel and distributed KV cache serving framework for Mixture of Experts (MoE) architectures. It describes expert-sharded KV storage to partition caches across GPUs, PiKV routing to reduce token-to-KV access, PiKV Scheduling to adaptively retain query-relevant entries, and PiKV Compression modules integrated into the caching pipeline. The system is released as an open-source library aimed at addressing memory and communication bottlenecks in multi-GPU, long-context MoE inference.

Significance. If the proposed components prove effective, PiKV could provide a practical approach to reducing KV cache overheads in scaled MoE serving without sacrificing accuracy or latency. The open-source release is a strength that enables reproducibility and further development. At present, however, the lack of supporting measurements leaves the practical significance unestablished.

major comments (1)

Abstract: The central claims that expert-sharded KV storage, PiKV routing, PiKV Scheduling, and PiKV Compression together reduce memory/communication costs while maintaining acceptable accuracy and latency are presented without any experiments, benchmarks, ablations, or baseline comparisons. This is load-bearing for the contribution, as the manuscript supplies only high-level component descriptions and no data to confirm the design satisfies its necessary conditions under realistic MoE routing and long-context workloads.

minor comments (2)

Abstract: The phrase 'integrates PiKV Compression modules the caching pipeline' is grammatically incomplete and should read 'integrates PiKV Compression modules into the caching pipeline'.
Abstract: Typo 'comprehesive' should be corrected to 'comprehensive'.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation for major revision. We agree that empirical validation is essential to substantiate the claims about memory and communication reductions in PiKV. The revised manuscript will include comprehensive experiments, benchmarks, ablations, and baseline comparisons to address this gap.

read point-by-point responses

Referee: Abstract: The central claims that expert-sharded KV storage, PiKV routing, PiKV Scheduling, and PiKV Compression together reduce memory/communication costs while maintaining acceptable accuracy and latency are presented without any experiments, benchmarks, ablations, or baseline comparisons. This is load-bearing for the contribution, as the manuscript supplies only high-level component descriptions and no data to confirm the design satisfies its necessary conditions under realistic MoE routing and long-context workloads.

Authors: We agree that the current manuscript version presents the system design at a high level without quantitative results, which limits the ability to evaluate the practical impact. This was an oversight in the initial submission, as the focus was on describing the architecture of expert-sharded KV storage, PiKV routing, adaptive scheduling, and compression modules along with the open-source release. In the revised manuscript, we will add a dedicated experimental evaluation section. This will include benchmarks on multi-GPU MoE inference with long contexts, measuring KV cache memory footprint, inter-GPU communication volume, end-to-end latency, and accuracy retention against standard dense KV cache baselines. Ablation studies will isolate the contribution of each component (sharding, routing, scheduling, compression) under realistic MoE token routing patterns. The GitHub repository will be updated with the evaluation code and datasets for full reproducibility. revision: yes

Circularity Check

0 steps flagged

No significant circularity: architectural proposal without derivations or self-referential reductions

full rationale

The manuscript presents PiKV as a system architecture for KV cache management in MoE models, describing high-level components including expert-sharded KV storage to partition caches, PiKV routing to reduce token-to-KV access, PiKV Scheduling to retain query-relevant entries, and PiKV Compression modules. No equations, fitted parameters, predictions, or derivation chains appear in the provided text. The central claims concern the framework's design for reducing memory and communication costs; these are not shown to reduce to inputs by construction, nor do they rely on load-bearing self-citations or uniqueness theorems imported from prior author work. As a proposed software library and living project, the paper contains no mathematical or statistical steps that could exhibit the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

As a systems paper the work rests on standard distributed-computing assumptions about GPU memory hierarchies and network costs; no free parameters or new physical entities are introduced.

axioms (1)

domain assumption Standard assumptions about memory access patterns and communication latency in multi-GPU clusters hold for MoE inference workloads.
Implicit in the design of sharded storage and scheduling.

pith-pipeline@v0.9.0 · 5738 in / 1185 out tokens · 50173 ms · 2026-05-21T23:56:37.593025+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

23 extracted references · 23 canonical work pages · 7 internal anchors

[1]

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Floren- cia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. Gpt-4 technical report

work page 2023
[2]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al . 2025. Qwen2. 5-vl technical report. arXiv preprint arXiv:2502.13923 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, et al . 2023. Longbench: A bilingual, multitask benchmark for long context understanding. arXiv preprint arXiv:2308.14508 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[4]

Zefan Cai, Yichi Zhang, Bofei Gao, Yuliang Liu, Tianyu Liu, Keming Lu, Wayne Xiong, Yue Dong, Baobao Chang, Junjie Hu, et al. 2024. Pyramidkv: Dynamic kv cache compression based on pyramidal information funneling. arXiv preprint arXiv:2406.02069 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[5]

Liang Chen, Haozhe Zhao, Tianyu Liu, Shuai Bai, Junyang Lin, Chang Zhou, and Baobao Chang. 2024. An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models. In European Conference on Computer Vision. Springer, 19–35

work page 2024
[6]

Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. 2022. Flashat- tention: Fast and memory-efficient exact attention with io-awareness. Advances 12 in neural information processing systems 35 (2022), 16344–16359

work page 2022
[7]

Nan Du, Yanping Huang, Andrew M Dai, Simon Tong, Dmitry Lepikhin, Yuanzhong Xu, Maxim Krikun, Yanqi Zhou, Adams Wei Yu, Orhan Firat, et al

work page
[8]

In International conference on machine learning

Glam: Efficient scaling of language models with mixture-of-experts. In International conference on machine learning . PMLR, 5547–5569

work page
[9]

William Fedus, Barret Zoph, and Noam Shazeer. 2022. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. Journal of Machine Learning Research 23, 120 (2022), 1–39

work page 2022
[10]

Bin Gao, Zhuomin He, Puru Sharma, Qingxuan Kang, Djordje Jevdjic, Junbo Deng, Xingkun Yang, Zhou Yu, and Pengfei Zuo. 2024. {Cost-Efficient} large language model serving for multi-turn conversations with {CachedAttention}. In 2024 USENIX Annual Technical Conference (USENIX ATC 24) . 111–126

work page 2024
[11]

Zifan He, Yingqi Cao, Zongyue Qin, Neha Prakriya, Yizhou Sun, and Jason Cong. 2025. HMT: Hierarchical Memory Transformer for Efficient Long Context Language Processing. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , Luis Ch...

work page 2025
[12]

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. 2022. Lora: Low-rank adaptation of large language models. ICLR 1, 2 (2022), 3

work page 2022
[13]

Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. 2020. Gshard: Scaling giant models with conditional computation and automatic sharding. arXiv preprint arXiv:2006.16668 (2020)

work page internal anchor Pith review Pith/arXiv arXiv 2020
[14]

Dong Liu. 2024. Contemporary model compression on large language models inference. arXiv e-prints (2024), arXiv–2409

work page 2024
[15]

Dong Liu, Jiayi Zhang, Yifan Li, Yanxuan Yu, Ben Lengerich, and Ying Nian Wu

work page
[16]

arXiv preprint arXiv:2505.20353 (2025)

Fastcache: Fast caching for diffusion transformer through learnable linear approximation. arXiv preprint arXiv:2505.20353 (2025)

work page arXiv 2025
[17]

Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. 2020. Zero: Memory optimizations toward training trillion parameter models. In SC20: Inter- national Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 1–16

work page 2020
[18]

Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. 2017. Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer. arXiv:1701.06538 [cs.LG] https: //arxiv.org/abs/1701.06538

work page internal anchor Pith review Pith/arXiv arXiv 2017
[19]

Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Li, Max Ryabinin, Beidi Chen, Percy Liang, Christopher Ré, Ion Stoica, and Ce Zhang. 2023. Flexgen: High-throughput generative inference of large language models with a single gpu. In International Conference on Machine Learning . PMLR, 31094–31116

work page 2023
[20]

Jiaming Tang, Yilong Zhao, Kan Zhu, Guangxuan Xiao, Baris Kasikci, and Song Han. 2024. Quest: Query-aware sparsity for efficient long-context llm inference. arXiv preprint arXiv:2406.10774 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[21]

Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis

work page
[22]

Efficient Streaming Language Models with Attention Sinks. (2024). arXiv:2309.17453 [cs.CL] https://arxiv.org/abs/2309.17453

work page internal anchor Pith review Pith/arXiv arXiv 2024
[23]

Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher Ré, Clark Barrett, et al. 2023. H2o: Heavy-hitter oracle for efficient generative inference of large language models. Advances in Neural Information Processing Systems 36 (2023), 34661–34710. 13

work page 2023

[1] [1]

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Floren- cia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. Gpt-4 technical report

work page 2023

[2] [2]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al . 2025. Qwen2. 5-vl technical report. arXiv preprint arXiv:2502.13923 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, et al . 2023. Longbench: A bilingual, multitask benchmark for long context understanding. arXiv preprint arXiv:2308.14508 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[4] [4]

Zefan Cai, Yichi Zhang, Bofei Gao, Yuliang Liu, Tianyu Liu, Keming Lu, Wayne Xiong, Yue Dong, Baobao Chang, Junjie Hu, et al. 2024. Pyramidkv: Dynamic kv cache compression based on pyramidal information funneling. arXiv preprint arXiv:2406.02069 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[5] [5]

Liang Chen, Haozhe Zhao, Tianyu Liu, Shuai Bai, Junyang Lin, Chang Zhou, and Baobao Chang. 2024. An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models. In European Conference on Computer Vision. Springer, 19–35

work page 2024

[6] [6]

Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. 2022. Flashat- tention: Fast and memory-efficient exact attention with io-awareness. Advances 12 in neural information processing systems 35 (2022), 16344–16359

work page 2022

[7] [7]

Nan Du, Yanping Huang, Andrew M Dai, Simon Tong, Dmitry Lepikhin, Yuanzhong Xu, Maxim Krikun, Yanqi Zhou, Adams Wei Yu, Orhan Firat, et al

work page

[8] [8]

In International conference on machine learning

Glam: Efficient scaling of language models with mixture-of-experts. In International conference on machine learning . PMLR, 5547–5569

work page

[9] [9]

William Fedus, Barret Zoph, and Noam Shazeer. 2022. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. Journal of Machine Learning Research 23, 120 (2022), 1–39

work page 2022

[10] [10]

Bin Gao, Zhuomin He, Puru Sharma, Qingxuan Kang, Djordje Jevdjic, Junbo Deng, Xingkun Yang, Zhou Yu, and Pengfei Zuo. 2024. {Cost-Efficient} large language model serving for multi-turn conversations with {CachedAttention}. In 2024 USENIX Annual Technical Conference (USENIX ATC 24) . 111–126

work page 2024

[11] [11]

Zifan He, Yingqi Cao, Zongyue Qin, Neha Prakriya, Yizhou Sun, and Jason Cong. 2025. HMT: Hierarchical Memory Transformer for Efficient Long Context Language Processing. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , Luis Ch...

work page 2025

[12] [12]

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. 2022. Lora: Low-rank adaptation of large language models. ICLR 1, 2 (2022), 3

work page 2022

[13] [13]

Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. 2020. Gshard: Scaling giant models with conditional computation and automatic sharding. arXiv preprint arXiv:2006.16668 (2020)

work page internal anchor Pith review Pith/arXiv arXiv 2020

[14] [14]

Dong Liu. 2024. Contemporary model compression on large language models inference. arXiv e-prints (2024), arXiv–2409

work page 2024

[15] [15]

Dong Liu, Jiayi Zhang, Yifan Li, Yanxuan Yu, Ben Lengerich, and Ying Nian Wu

work page

[16] [16]

arXiv preprint arXiv:2505.20353 (2025)

Fastcache: Fast caching for diffusion transformer through learnable linear approximation. arXiv preprint arXiv:2505.20353 (2025)

work page arXiv 2025

[17] [17]

Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. 2020. Zero: Memory optimizations toward training trillion parameter models. In SC20: Inter- national Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 1–16

work page 2020

[18] [18]

Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. 2017. Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer. arXiv:1701.06538 [cs.LG] https: //arxiv.org/abs/1701.06538

work page internal anchor Pith review Pith/arXiv arXiv 2017

[19] [19]

Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Li, Max Ryabinin, Beidi Chen, Percy Liang, Christopher Ré, Ion Stoica, and Ce Zhang. 2023. Flexgen: High-throughput generative inference of large language models with a single gpu. In International Conference on Machine Learning . PMLR, 31094–31116

work page 2023

[20] [20]

Jiaming Tang, Yilong Zhao, Kan Zhu, Guangxuan Xiao, Baris Kasikci, and Song Han. 2024. Quest: Query-aware sparsity for efficient long-context llm inference. arXiv preprint arXiv:2406.10774 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[21] [21]

Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis

work page

[22] [22]

Efficient Streaming Language Models with Attention Sinks. (2024). arXiv:2309.17453 [cs.CL] https://arxiv.org/abs/2309.17453

work page internal anchor Pith review Pith/arXiv arXiv 2024

[23] [23]

Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher Ré, Clark Barrett, et al. 2023. H2o: Heavy-hitter oracle for efficient generative inference of large language models. Advances in Neural Information Processing Systems 36 (2023), 34661–34710. 13

work page 2023