Towards Distributed Inference of LLMs on a P2P Network

Krishanu Saini; Shabari S Nair

arxiv: 2606.17059 · v1 · pith:LMML7PNDnew · submitted 2026-05-07 · 💻 cs.DC · cs.AI

Towards Distributed Inference of LLMs on a P2P Network

Shabari S Nair , Krishanu Saini This is my paper

Pith reviewed 2026-06-30 23:12 UTC · model grok-4.3

classification 💻 cs.DC cs.AI

keywords decentralized routingprefix cachingLLM inferenceP2P networkKV cacheanti-entropyradix treedistributed serving

0 comments

The pith

Decentralized routing to the node with the longest estimated prefix match reduces LLM inference latency in a P2P network without centralized coordination or KV-cache transfers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes routing each inference request in a peer-to-peer cluster to whichever node holds the longest matching cached prefix, using only local radix trees and rough estimates refreshed through periodic anti-entropy exchanges. Because an incorrect estimate produces only a cache miss rather than an incorrect output, the system tolerates stale metadata and therefore needs no strong consistency or cache movement. Evaluation on simulated MMLU workloads indicates that the approach lowers latency when network delays are low and prefix distributions are skewed. The work matters because it removes the requirement for a central scheduler or full cache replication when many nodes must cooperate on large-model serving.

Core claim

A decentralized, prefix-cache-aware routing scheme for peer-to-peer LLM serving in which each node maintains a local radix tree of its cached prefixes, obtains asynchronously refreshed estimates of peer caches via periodic anti-entropy, and routes requests to the node offering the longest estimated prefix match; stale metadata produces only cache misses, so weak consistency suffices for correctness.

What carries the argument

Decentralized routing to the node with the longest estimated prefix match, supported by local radix trees and periodic anti-entropy cache estimates.

If this is right

No KV-cache transfer between nodes is required.
Weak consistency is sufficient because incorrect routing decisions cause only misses.
Latency gains appear under low communication delay and skewed prefix distributions.
High network latency and affinity-induced hotspots reduce or eliminate the benefit.
The scheme works without any central coordinator.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same longest-estimate routing could be applied to other distributed cache-sharing problems whose correctness tolerates misses.
In dynamic networks where nodes frequently join or leave, the anti-entropy mechanism would need to be augmented with node-discovery updates.
If prefix distributions in production are less skewed than the simulated MMLU workloads, the observed gains would shrink.

Load-bearing premise

Periodic anti-entropy exchanges keep the cache-size estimates accurate enough that the latency savings from correct longest-match routing exceed the overhead of the exchanges themselves.

What would settle it

A measurement, on a real deployment with the same prefix distribution, showing that average end-to-end inference latency under the decentralized longest-estimate router is no lower than under random routing.

Figures

Figures reproduced from arXiv: 2606.17059 by Krishanu Saini, Shabari S Nair.

**Figure 1.** Figure 1: Flow for a new incoming user request arriving at node M1. Each node has a local router that keeps lazily updated information of the prefixes processed by different nodes. The query gets routed to the best node using a cache-aware criterion. The response is processed and returned to the correspondence node. the same underlying model, operating on the same GPU hardware. Each node Pi has a specified amount o… view at source ↗

**Figure 2.** Figure 2: Mean per-request inference latency vs ∆comm comparison under decentralized routing versus a no-routing baseline on general dataset [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Per-node request distribution over time for two MMLU subjects under four configurations. Here, the y-axis indicates the percentage of requests processed by a node, averaged across a window of 300 past requests, moving with a stride of 30. Routing produces sharp specialization under skew (top-left) and the specialization–eviction cycle under uniform load (top-right); no-routing yields the ∼25% uniform basel… view at source ↗

**Figure 4.** Figure 4: Per-node share of requests over time for two MMLU subjects, showing the three-phase specialization–eviction cycle: a competitive phase (Zone 1) with no clear specialist, a specialization phase (Zone 2) where one node captures essentially 100% of subject traffic, and an eviction phase (Zone 3) where LRU pressure collapses the hotspot, and traffic redisperses. Here, the y-axis indicates the percentage of req… view at source ↗

read the original abstract

Prefix caching can reduce LLM inference latency by reusing KV caches across requests with shared prompts, but cluster-scale reuse is challenging because caches are partitioned across nodes. We propose a decentralized, prefix-cache-aware routing scheme for peer-to-peer LLM serving. Each node maintains a local radix tree of its own cached prefixes and asynchronously refreshed estimates of peer caches using periodic anti-entropy. Requests are routed to the node with the longest estimated prefix match, without centralized coordination or KV-cache transfer. Stale metadata only causes cache misses, not incorrect outputs, making weak consistency sufficient for correctness. Evaluation on simulated MMLU workloads show that decentralized routing improves latency under low communication delay and skewed prefix distributions, while high network latency and affinity-induced hotspots limit its benefits.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper sketches a decentralized prefix-routing scheme for P2P LLM serving that relies on anti-entropy estimates and shows simulated latency gains under narrow conditions, but offers no real-system data.

read the letter

The core idea is routing each request to the peer with the longest estimated prefix match using local radix trees plus periodic anti-entropy updates, so that KV caches stay put and stale metadata only produces misses. That part is cleanly explained and the weak-consistency argument holds up on its own terms.

What the work actually contributes is the specific integration of those pieces for LLM prefix caching in a fully decentralized setting, plus the observation that correctness survives poor estimates. The simulations on MMLU workloads do show latency reductions when delay is low and prefixes are skewed, which is at least a concrete data point.

The main limitation is that all results come from simulation only, with no hardware measurements, no real P2P deployment, and no sensitivity analysis on how quickly anti-entropy must run or how representative the workload is. Gains disappear under higher latency or different prefix distributions, and the paper does not demonstrate that the estimates stay accurate enough in practice to deliver the claimed benefit. Citation details are also thin in the abstract, so it is difficult to gauge how much of the routing logic is incremental versus prior distributed-cache work.

This is aimed at systems researchers thinking about decentralized LLM inference rather than core model people. A reader already working on P2P serving or cache-aware routing could extract the routing logic and the weak-consistency point for their own thinking. It is not ready for broad claims about production systems.

I would send it to peer review. The idea is worth a closer look even if the current evidence is preliminary; referees can push for better validation or clearer novelty positioning.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes a decentralized prefix-cache-aware routing scheme for P2P LLM serving. Nodes maintain local radix trees of their own cached prefixes and use periodic anti-entropy to asynchronously refresh estimates of peer caches. Requests are routed to the node offering the longest estimated prefix match, without centralized coordination or KV-cache movement. Stale metadata produces only cache misses (preserving correctness under weak consistency). Simulations on MMLU workloads report latency reductions under low communication delay and skewed prefix distributions, while high latency and affinity hotspots limit gains.

Significance. If the simulation results generalize, the work would demonstrate a scalable mechanism for cross-node KV-cache reuse in distributed LLM inference that requires neither central state nor data transfer. The observation that weak consistency suffices for correctness is a clear strength. The evaluation remains simulation-only with no hardware measurements, prototype, or error-bar analysis, so the practical significance is conditional on the unverified workload and delay assumptions.

major comments (2)

[Evaluation section] Evaluation section (and abstract): the reported latency improvements rest on the assumption that periodic anti-entropy keeps prefix estimates accurate enough to produce net cache-hit gains; no sensitivity analysis, error-rate measurements, or experiments under varying churn or non-MMLU prefix distributions are provided, which directly bears on whether the performance claim holds outside the simulated regimes.
[Abstract and evaluation] Abstract and evaluation: benefits are shown only for the chosen MMLU workloads and low-delay regimes; the manuscript supplies no bounds or additional workloads demonstrating when anti-entropy estimates remain sufficiently fresh to deliver the claimed latency reduction versus alternatives.

minor comments (1)

The abstract could explicitly note that all results are simulation-based and list the key workload and delay assumptions.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. The comments highlight important aspects of the evaluation that warrant clarification and potential strengthening. We respond to each major comment below.

read point-by-point responses

Referee: [Evaluation section] Evaluation section (and abstract): the reported latency improvements rest on the assumption that periodic anti-entropy keeps prefix estimates accurate enough to produce net cache-hit gains; no sensitivity analysis, error-rate measurements, or experiments under varying churn or non-MMLU prefix distributions are provided, which directly bears on whether the performance claim holds outside the simulated regimes.

Authors: We agree that the current evaluation would benefit from explicit sensitivity analysis on anti-entropy parameters and prefix-distribution variation. In the revised version we will add (i) measurements of prefix-estimate error rate as a function of anti-entropy interval, (ii) results under controlled node churn, and (iii) additional synthetic workloads whose prefix skew differs from MMLU. These experiments remain within the existing simulation framework and directly test the robustness of the latency gains. revision: yes
Referee: [Abstract and evaluation] Abstract and evaluation: benefits are shown only for the chosen MMLU workloads and low-delay regimes; the manuscript supplies no bounds or additional workloads demonstrating when anti-entropy estimates remain sufficiently fresh to deliver the claimed latency reduction versus alternatives.

Authors: The manuscript already notes that gains diminish under high delay and hotspot affinity; we will expand this discussion with a simple analytic bound relating anti-entropy period, request arrival rate, and expected staleness, together with simulation results for two additional prefix distributions. This will clarify the operating regime in which the decentralized scheme outperforms centralized or non-cache-aware baselines. revision: partial

Circularity Check

0 steps flagged

No significant circularity; proposal and simulations are self-contained

full rationale

The paper proposes a decentralized prefix-cache routing scheme using local radix trees and periodic anti-entropy for peer estimates, routing to longest estimated match. Evaluation uses simulated MMLU workloads under varying delays and prefix skews. No load-bearing equations, predictions, or uniqueness claims reduce by construction to fitted inputs or self-citations; the latency benefits are presented as outcomes of the described mechanisms and simulation parameters, with correctness insulated by the weak-consistency property. This is the common case of an independent systems proposal.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The approach depends on domain assumptions about network behavior and workload properties rather than new mathematical derivations or invented physical entities.

axioms (2)

domain assumption Stale metadata from periodic anti-entropy only produces cache misses and never incorrect outputs
Invoked to justify that weak consistency suffices for correctness.
domain assumption Simulated MMLU workloads and network delay distributions are representative of real usage
Required for the reported latency improvements to generalize.

pith-pipeline@v0.9.1-grok · 5647 in / 1304 out tokens · 34233 ms · 2026-06-30T23:12:06.726244+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

47 extracted references · 4 canonical work pages

[1]

Advances in Neural Information Processing Systems (NeurIPS) , volume=

Attention is All You Need , author=. Advances in Neural Information Processing Systems (NeurIPS) , volume=
[2]

Efficient Memory Management for Large Language Model Serving with

Kwon, Woosuk and Li, Zhuohan and Zhuang, Siyuan and Sheng, Ying and Zheng, Lianmin and Yu, Cody Hao and Gonzalez, Joseph and Zhang, Hao and Stoica, Ion , booktitle=. Efficient Memory Management for Large Language Model Serving with
[3]

2024 , note=

Zheng, Lianmin and Yin, Liangsheng and Xie, Zhiqiang and Sun, Chuyue and Huang, Jeff and Yu, Cody Hao and Cao, Shiyi and Kozyrakis, Christos and Stoica, Ion and Gonzalez, Joseph E and Barrett, Clark and Sheng, Ying , booktitle=. 2024 , note=

2024
[4]

Ye, Lu and Tao, Ze and Huang, Yong and Li, Yang , booktitle=
[5]

arXiv preprint arXiv:2402.05099 , year=

Juravsky, Jordan and Brown, Bradley and Ehrlich, Ryan and Fu, Daniel Y and R. arXiv preprint arXiv:2402.05099 , year=

work page arXiv
[6]

Proceedings of Machine Learning and Systems (MLSys) , year=

Prompt Cache: Modular Attention Reuse for Low-Latency Inference , author=. Proceedings of Machine Learning and Systems (MLSys) , year=
[7]

Gao, Bin and He, Zhuomin and Sharma, Puru and Kang, Qingxuan and Jevdjic, Djordje and Deng, Junbo and Yang, Xingkun and Yu, Zhou and Zuo, Pengfei , journal=
[8]

Pan, Rui and Siriwardhana, Shamane and Patel, Tushar and Ren, Zhen and Kim, Ciby Mathew and Ousterhout, John and Tumanov, Alexey , booktitle=
[9]

Zheng, Zhen and Ji, Xin and Fang, Taosong and Zhou, Fanghao and Liu, Chuanjie and Peng, Gang , journal=
[10]

2025 , note=

Yao, Jiayi and Li, Hanchen and Liu, Yuhan and Ray, Siddhant and Cheng, Yihua and Zhang, Qizheng and Du, Kuntai and Lu, Shan and Jiang, Junchen , booktitle=. 2025 , note=

2025
[11]

2025 , note=

Hu, Junhao and Huang, Wenrui and Wang, Weidong and Peng, Hao and Feng, Haoyi and Chen, Xusheng and Shan, Yizhou and Xie, Tao , booktitle=. 2025 , note=

2025
[12]

Yang, Jingbo and Hou, Bairu and Wei, Wei and Bao, Yujia and Chang, Shiyu , journal=
[13]

Wang, Qian and others , journal=
[14]

Wang, others , journal=
[15]

2025 , note=

Agarwal, Shubham and others , booktitle=. 2025 , note=

2025
[16]

2025 , note=

Srivatsa, Vikranth and He, Zijian and Abhyankar, Reyna and Li, Dongming and Zhang, Yiying , booktitle=. 2025 , note=

2025
[17]

2025 , note=

Qin, Ruoyu and Li, Zheming and He, Weiran and Cui, Jialei and Ren, Feng and Zhang, Mingxing and Wu, Yongwei and Zheng, Weimin and Xu, Xinran , booktitle=. 2025 , note=

2025
[18]

Hu, Cunchen and Huang, Heyang and Hu, Junhao and Xu, Jiang and Chen, Xusheng and Xie, Tao and Wang, Chenxi and Wang, Sa and Bao, Yungang and Sun, Ninghui and Shan, Yizhou , journal=
[19]

Liu, Yuhan and Yao, Jiayi and Cheng, Yihua and An, Yuwei and Chen, Xiaokun and Feng, Shaoting and Huang, Yuyang and Shen, Samuel and Zhang, Rui and Du, Kuntai and Jiang, Junchen , journal=
[20]

2024 , note=

Sun, Biao and Huang, Ziming and Zhao, Hanyu and Xiao, Wencong and Zhang, Xinyi and Li, Yong and Lin, Wei , booktitle=. 2024 , note=

2024
[21]

2024 , note=

Zhong, Yinmin and Liu, Shengyu and Chen, Junda and Hu, Jianbo and Zhu, Yibo and Liu, Xuanzhe and Jin, Xin and Zhang, Hao , booktitle=. 2024 , note=

2024
[22]

Proceedings of the 51st International Symposium on Computer Architecture (ISCA) , year=

Patel, Pratyush and Choukse, Esha and Zhang, Chaojie and Shah, Aashaka and Goiri,. Proceedings of the 51st International Symposium on Computer Architecture (ISCA) , year=
[23]

2024 , note=

Strati, Foteini and McAllister, Sara and Phanishayee, Amar and Tarnawski, Jakub and Klimovic, Ana , booktitle=. 2024 , note=

2024
[24]

2024 , note=

Liu, Yuhan and Li, Hanchen and Cheng, Yihua and Ray, Siddhant and Huang, Yuyang and Zhang, Qizheng and Du, Kuntai and Yao, Jiayi and Lu, Shan and Ananthanarayanan, Ganesh and Maire, Michael and Hoffmann, Henry and Holtzman, Ari and Jiang, Junchen , booktitle=. 2024 , note=

2024
[25]

Chen, Shiyang and others , journal=
[26]

Li, Weiqing and Jiang, Guochao and others , journal=
[27]

Proceedings of the International Conference on Learning Representations (ICLR) , year=

Ring Attention with Blockwise Transformers for Near-Infinite Context , author=. Proceedings of the International Conference on Learning Representations (ICLR) , year=
[28]

arXiv preprint arXiv:2311.09431 , year=

Striped Attention: Faster Ring Attention for Causal Transformers , author=. arXiv preprint arXiv:2311.09431 , year=

work page arXiv
[29]

2024 , note=

Wu, Bingyang and Liu, Shengyu and Zhong, Yinmin and Sun, Peng and Liu, Xuanzhe and Jin, Xin , booktitle=. 2024 , note=

2024
[30]

arXiv preprint arXiv:2411.01783 , year=

Context Parallelism for Scalable Million-Token Inference , author=. arXiv preprint arXiv:2411.01783 , year=

work page arXiv
[31]

Fang, Jiarui and Zhao, Shangchun , journal=
[32]

Mnemosyne: Parallelization Strategies for Efficiently Serving Multi-Million Context Length

Agrawal, Amey and Chen, Junda and Goiri,. Mnemosyne: Parallelization Strategies for Efficiently Serving Multi-Million Context Length. arXiv preprint arXiv:2409.17264 , year=

work page arXiv
[33]

Star Attention: Efficient

Acharya, Shantanu and Fernandez, Jared and Ginsburg, Boris , journal=. Star Attention: Efficient. 2025 , note=

2025
[34]

Lin, Bin and Peng, Tao and Zhang, Chen and Sun, Minmin and Li, Lanbo and Zhao, Hanyu and Xiao, Wencong and Xu, Qi and Qiu, Xiafei and Li, Shen and Ji, Zhigang and Li, Yong and Lin, Wei , journal=
[35]

Selective

Chu, Kexin and others , journal=. Selective
[36]

Learned Prefix Caching for Efficient

Anonymous , booktitle=. Learned Prefix Caching for Efficient
[37]

Anonymous , journal=
[38]

Advances in Neural Information Processing Systems (NeurIPS) , volume=

Dao, Tri and Fu, Dan and Ermon, Stefano and Rudra, Atri and R. Advances in Neural Information Processing Systems (NeurIPS) , volume=
[39]

Accelerating Self-Attentions for

Ye, Zihao and Chen, Lequn and Lai, Ruihang and Zhao, Yilong and Zheng, Size and Shao, Junru and Hou, Bohan and Jin, Hongyi and Zuo, Yifei and Yin, Liangsheng and Chen, Tianqi and Ceze, Luis , howpublished=. Accelerating Self-Attentions for
[40]

Yu, Gyeong-In and Jeong, Joo Seong and Kim, Geon-Woo and Kim, Soojeong and Chun, Byung-Gon , booktitle=
[41]

ACM Transactions on Storage , year=

Mooncake: A kvcache-centric disaggregated architecture for llm serving , author=. ACM Transactions on Storage , year=
[42]

ACM SIGOPS Operating Systems Review , volume=

Managing update conflicts in Bayou, a weakly connected replicated storage system , author=. ACM SIGOPS Operating Systems Review , volume=. 1995 , publisher=

1995
[43]

Proceedings of the VLDB Endowment , volume=

Coordination avoidance in database systems , author=. Proceedings of the VLDB Endowment , volume=. 2014 , publisher=

2014
[44]

2022 USENIX Annual Technical Conference (USENIX ATC 22) , pages=

Amazon \ DynamoDB \ : A scalable, predictably performant, and fully managed \ NoSQL \ database service , author=. 2022 USENIX Annual Technical Conference (USENIX ATC 22) , pages=

2022
[45]

ACM SIGOPS operating systems review , volume=

Dynamo: Amazon's highly available key-value store , author=. ACM SIGOPS operating systems review , volume=. 2007 , publisher=

2007
[46]

Journal of the ACM (JACM) , volume=

Consensus in the presence of partial synchrony , author=. Journal of the ACM (JACM) , volume=. 1988 , publisher=

1988
[47]

Implementing pushback: Router-based defense against DDoS attacks , author=

[1] [1]

Advances in Neural Information Processing Systems (NeurIPS) , volume=

Attention is All You Need , author=. Advances in Neural Information Processing Systems (NeurIPS) , volume=

[2] [2]

Efficient Memory Management for Large Language Model Serving with

Kwon, Woosuk and Li, Zhuohan and Zhuang, Siyuan and Sheng, Ying and Zheng, Lianmin and Yu, Cody Hao and Gonzalez, Joseph and Zhang, Hao and Stoica, Ion , booktitle=. Efficient Memory Management for Large Language Model Serving with

[3] [3]

2024 , note=

Zheng, Lianmin and Yin, Liangsheng and Xie, Zhiqiang and Sun, Chuyue and Huang, Jeff and Yu, Cody Hao and Cao, Shiyi and Kozyrakis, Christos and Stoica, Ion and Gonzalez, Joseph E and Barrett, Clark and Sheng, Ying , booktitle=. 2024 , note=

2024

[4] [4]

Ye, Lu and Tao, Ze and Huang, Yong and Li, Yang , booktitle=

[5] [5]

arXiv preprint arXiv:2402.05099 , year=

Juravsky, Jordan and Brown, Bradley and Ehrlich, Ryan and Fu, Daniel Y and R. arXiv preprint arXiv:2402.05099 , year=

work page arXiv

[6] [6]

Proceedings of Machine Learning and Systems (MLSys) , year=

Prompt Cache: Modular Attention Reuse for Low-Latency Inference , author=. Proceedings of Machine Learning and Systems (MLSys) , year=

[7] [7]

Gao, Bin and He, Zhuomin and Sharma, Puru and Kang, Qingxuan and Jevdjic, Djordje and Deng, Junbo and Yang, Xingkun and Yu, Zhou and Zuo, Pengfei , journal=

[8] [8]

Pan, Rui and Siriwardhana, Shamane and Patel, Tushar and Ren, Zhen and Kim, Ciby Mathew and Ousterhout, John and Tumanov, Alexey , booktitle=

[9] [9]

Zheng, Zhen and Ji, Xin and Fang, Taosong and Zhou, Fanghao and Liu, Chuanjie and Peng, Gang , journal=

[10] [10]

2025 , note=

Yao, Jiayi and Li, Hanchen and Liu, Yuhan and Ray, Siddhant and Cheng, Yihua and Zhang, Qizheng and Du, Kuntai and Lu, Shan and Jiang, Junchen , booktitle=. 2025 , note=

2025

[11] [11]

2025 , note=

Hu, Junhao and Huang, Wenrui and Wang, Weidong and Peng, Hao and Feng, Haoyi and Chen, Xusheng and Shan, Yizhou and Xie, Tao , booktitle=. 2025 , note=

2025

[12] [12]

Yang, Jingbo and Hou, Bairu and Wei, Wei and Bao, Yujia and Chang, Shiyu , journal=

[13] [13]

Wang, Qian and others , journal=

[14] [14]

Wang, others , journal=

[15] [15]

2025 , note=

Agarwal, Shubham and others , booktitle=. 2025 , note=

2025

[16] [16]

2025 , note=

Srivatsa, Vikranth and He, Zijian and Abhyankar, Reyna and Li, Dongming and Zhang, Yiying , booktitle=. 2025 , note=

2025

[17] [17]

2025 , note=

Qin, Ruoyu and Li, Zheming and He, Weiran and Cui, Jialei and Ren, Feng and Zhang, Mingxing and Wu, Yongwei and Zheng, Weimin and Xu, Xinran , booktitle=. 2025 , note=

2025

[18] [18]

Hu, Cunchen and Huang, Heyang and Hu, Junhao and Xu, Jiang and Chen, Xusheng and Xie, Tao and Wang, Chenxi and Wang, Sa and Bao, Yungang and Sun, Ninghui and Shan, Yizhou , journal=

[19] [19]

Liu, Yuhan and Yao, Jiayi and Cheng, Yihua and An, Yuwei and Chen, Xiaokun and Feng, Shaoting and Huang, Yuyang and Shen, Samuel and Zhang, Rui and Du, Kuntai and Jiang, Junchen , journal=

[20] [20]

2024 , note=

Sun, Biao and Huang, Ziming and Zhao, Hanyu and Xiao, Wencong and Zhang, Xinyi and Li, Yong and Lin, Wei , booktitle=. 2024 , note=

2024

[21] [21]

2024 , note=

Zhong, Yinmin and Liu, Shengyu and Chen, Junda and Hu, Jianbo and Zhu, Yibo and Liu, Xuanzhe and Jin, Xin and Zhang, Hao , booktitle=. 2024 , note=

2024

[22] [22]

Proceedings of the 51st International Symposium on Computer Architecture (ISCA) , year=

Patel, Pratyush and Choukse, Esha and Zhang, Chaojie and Shah, Aashaka and Goiri,. Proceedings of the 51st International Symposium on Computer Architecture (ISCA) , year=

[23] [23]

2024 , note=

Strati, Foteini and McAllister, Sara and Phanishayee, Amar and Tarnawski, Jakub and Klimovic, Ana , booktitle=. 2024 , note=

2024

[24] [24]

2024 , note=

Liu, Yuhan and Li, Hanchen and Cheng, Yihua and Ray, Siddhant and Huang, Yuyang and Zhang, Qizheng and Du, Kuntai and Yao, Jiayi and Lu, Shan and Ananthanarayanan, Ganesh and Maire, Michael and Hoffmann, Henry and Holtzman, Ari and Jiang, Junchen , booktitle=. 2024 , note=

2024

[25] [25]

Chen, Shiyang and others , journal=

[26] [26]

Li, Weiqing and Jiang, Guochao and others , journal=

[27] [27]

Proceedings of the International Conference on Learning Representations (ICLR) , year=

Ring Attention with Blockwise Transformers for Near-Infinite Context , author=. Proceedings of the International Conference on Learning Representations (ICLR) , year=

[28] [28]

arXiv preprint arXiv:2311.09431 , year=

Striped Attention: Faster Ring Attention for Causal Transformers , author=. arXiv preprint arXiv:2311.09431 , year=

work page arXiv

[29] [29]

2024 , note=

Wu, Bingyang and Liu, Shengyu and Zhong, Yinmin and Sun, Peng and Liu, Xuanzhe and Jin, Xin , booktitle=. 2024 , note=

2024

[30] [30]

arXiv preprint arXiv:2411.01783 , year=

Context Parallelism for Scalable Million-Token Inference , author=. arXiv preprint arXiv:2411.01783 , year=

work page arXiv

[31] [31]

Fang, Jiarui and Zhao, Shangchun , journal=

[32] [32]

Mnemosyne: Parallelization Strategies for Efficiently Serving Multi-Million Context Length

Agrawal, Amey and Chen, Junda and Goiri,. Mnemosyne: Parallelization Strategies for Efficiently Serving Multi-Million Context Length. arXiv preprint arXiv:2409.17264 , year=

work page arXiv

[33] [33]

Star Attention: Efficient

Acharya, Shantanu and Fernandez, Jared and Ginsburg, Boris , journal=. Star Attention: Efficient. 2025 , note=

2025

[34] [34]

Lin, Bin and Peng, Tao and Zhang, Chen and Sun, Minmin and Li, Lanbo and Zhao, Hanyu and Xiao, Wencong and Xu, Qi and Qiu, Xiafei and Li, Shen and Ji, Zhigang and Li, Yong and Lin, Wei , journal=

[35] [35]

Selective

Chu, Kexin and others , journal=. Selective

[36] [36]

Learned Prefix Caching for Efficient

Anonymous , booktitle=. Learned Prefix Caching for Efficient

[37] [37]

Anonymous , journal=

[38] [38]

Advances in Neural Information Processing Systems (NeurIPS) , volume=

Dao, Tri and Fu, Dan and Ermon, Stefano and Rudra, Atri and R. Advances in Neural Information Processing Systems (NeurIPS) , volume=

[39] [39]

Accelerating Self-Attentions for

Ye, Zihao and Chen, Lequn and Lai, Ruihang and Zhao, Yilong and Zheng, Size and Shao, Junru and Hou, Bohan and Jin, Hongyi and Zuo, Yifei and Yin, Liangsheng and Chen, Tianqi and Ceze, Luis , howpublished=. Accelerating Self-Attentions for

[40] [40]

Yu, Gyeong-In and Jeong, Joo Seong and Kim, Geon-Woo and Kim, Soojeong and Chun, Byung-Gon , booktitle=

[41] [41]

ACM Transactions on Storage , year=

Mooncake: A kvcache-centric disaggregated architecture for llm serving , author=. ACM Transactions on Storage , year=

[42] [42]

ACM SIGOPS Operating Systems Review , volume=

Managing update conflicts in Bayou, a weakly connected replicated storage system , author=. ACM SIGOPS Operating Systems Review , volume=. 1995 , publisher=

1995

[43] [43]

Proceedings of the VLDB Endowment , volume=

Coordination avoidance in database systems , author=. Proceedings of the VLDB Endowment , volume=. 2014 , publisher=

2014

[44] [44]

2022 USENIX Annual Technical Conference (USENIX ATC 22) , pages=

Amazon \ DynamoDB \ : A scalable, predictably performant, and fully managed \ NoSQL \ database service , author=. 2022 USENIX Annual Technical Conference (USENIX ATC 22) , pages=

2022

[45] [45]

ACM SIGOPS operating systems review , volume=

Dynamo: Amazon's highly available key-value store , author=. ACM SIGOPS operating systems review , volume=. 2007 , publisher=

2007

[46] [46]

Journal of the ACM (JACM) , volume=

Consensus in the presence of partial synchrony , author=. Journal of the ACM (JACM) , volume=. 1988 , publisher=

1988

[47] [47]

Implementing pushback: Router-based defense against DDoS attacks , author=