KernelFlume: Elastic Core-Attention Scaling for Agentic Long-Context Decoding

Guangyu Xiang; Lin Zhang; Shaohuai Shi; Wenxiang Lin; Xiaowen Chu; Xueze Kang; Yuxin Wang

arxiv: 2606.29207 · v1 · pith:TEBNBAJXnew · submitted 2026-06-28 · 💻 cs.DC

KernelFlume: Elastic Core-Attention Scaling for Agentic Long-Context Decoding

Guangyu Xiang , Xueze Kang , Lin Zhang , Wenxiang Lin , Shaohuai Shi , Yuxin Wang , Xiaowen Chu This is my paper

Pith reviewed 2026-06-30 02:49 UTC · model grok-4.3

classification 💻 cs.DC

keywords KernelFlumeelastic core-attention scalingKV cache disaggregationlong-context decodingagentic workloadsLLM servingquery-first dispatchUCX communication

0 comments

The pith

KernelFlume disaggregates core attention into elastic weightless nodes so KV capacity scales without full model replicas for bursty long-context agentic workloads.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that LLM serving for agents and long conversations can avoid the overhead of spinning up complete model copies by splitting stable projection and FFN kernels from attention computation. Weight nodes keep the dense operations while separate attention nodes hold token-range KV partitions and grow or shrink with request demand. A routing table directs queries to the right attention nodes, and query-first dispatch plus inter-layer pipelining keeps per-token latency low by overlapping remote work with local computation. Real GPU tests with Llama-3.1-8B show this maintains flat p99 TPOTs while cutting cost per million output tokens by up to 61 percent compared with full-instance elastic scaling. The approach therefore lets serving systems add KV memory capacity at lower startup and memory cost when context lengths surge.

Core claim

KernelFlume disaggregates the stable projection/FFN path from core-attention computation so that weight nodes run dense kernels while weightless attention nodes store KV partitions and scale with token-range demand; a routing table maps token ranges to attention-node endpoints, updates at token boundaries, and drives pre-registered UCX communication outside CUDA Graphs; query-first dispatch combined with inter-layer pipelining overlaps remote attention and communication with local work, producing flat p99 TPOTs of approximately 74 ms on A6000 and 34 ms on H100 under dynamic long-context agentic workloads and cost reductions of up to 32 percent and 61 percent relative to ServerlessLLM.

What carries the argument

The routing table that maps token ranges to attention-node endpoints and drives host-visible graph signals for UCX communication outside captured CUDA Graphs.

If this is right

p99 TPOT remains flat at approximately 74 ms on A6000 and 34 ms on H100 during bursty demand.
Cost per million output tokens falls by up to 32 percent on A6000 and 61 percent on H100 versus full-instance scaling.
Simulation at larger model scales projects 56 to 66 percent cost reduction, widening to 80 to 85 percent with cheaper heterogeneous attention hardware.
The cost advantage holds into the million-token context range.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The design could allow attention nodes to use cheaper or specialized hardware while weight nodes stay on high-end GPUs.
The same token-range routing and pipelining pattern might apply to other disaggregated serving setups that separate KV storage from dense compute.
Because the method avoids full model replication, it could reduce memory fragmentation when many concurrent long-context sessions share the same base weights.

Load-bearing premise

The separation of stable projection and FFN kernels from core-attention computation can be made elastic through a routing table and host-visible signals without adding unacceptable per-token latency.

What would settle it

Run the same dynamic long-context agentic trace on the A6000 testbed while scaling the number of attention nodes and check whether p99 TPOT rises above 74 ms or cost per million output tokens fails to drop below the ServerlessLLM baseline.

Figures

Figures reproduced from arXiv: 2606.29207 by Guangyu Xiang, Lin Zhang, Shaohuai Shi, Wenxiang Lin, Xiaowen Chu, Xueze Kang, Yuxin Wang.

**Figure 1.** Figure 1: Conceptual KV-capacity scaling during longcontext decode (schematic; measured results in §6). ServerlessLLM adds KV capacity only by scaling full instances; each scale-out (shaded) incurs an instance-start spike that pushes TPOT past the decode SLO. KernelFlume scales KV capacity within the instance via weightless route updates that track demand, keeping TPOT flat. batch size, which hurts throughput, or … view at source ↗

**Figure 2.** Figure 2: Two static scaling strategies for KV capacity. Data Parallelism (DP) replicates the full model and isolates requests, suiting large batches of shorter sequences. Context Parallelism (CP) partitions a single sequence’s KV cache across GPUs, suiting long sequences. 2 Background and Motivation 2.1 Agentic Long-Context Decoding LLM deployment is rapidly shifting from single-turn chatbots to decode-heavy long-… view at source ↗

**Figure 4.** Figure 4: KernelFlume’s architecture comprises weight nodes and an elastic set of attention nodes. (a) Full-instance scaling adds a complete weight-bearing instance. (b) KernelFlume adds a weightless attention node and extends the token-range routing table. CP supports long sequences by partitioning each sequence’s KV across GPUs, but only within a fixed pool. Long-context decode instead needs elasticity at the gra… view at source ↗

**Figure 5.** Figure 5: Elastic routing on a static graph. The topologyunaware weight-node CUDA Graph (top) contains local compute and host-visible readiness signals. The CPU-side routing loop (bottom) reads those signals and uses the routing table that maps token ranges to UCX endpoints. In this example, a request’s KV cache initially spans A0 [0,𝑇 /2) and A1 [𝑇 /2,𝑇 ). When the context grows beyond 𝑇 , the routing table is … view at source ↗

**Figure 6.** Figure 6: Layer-level timelines for normal attention (left) and query-first attention (right). With query-first attention, 𝑄 is dispatched before the full QKV projection completes, allowing attention nodes to begin computing on historical KV in parallel with weight nodes’ remaining K/V projection. 5.1 Query-First Attention Query-first attention exposes and exploits the fact that many decode-time attention partitions… view at source ↗

**Figure 7.** Figure 7: Serialized (top) vs. pipelined (bottom) execution with 𝑀=2 microbatches across four layers. Pipelining fills weight nodes’ idle gaps by interleaving projection and FFN from alternating microbatches. serial throughput is 𝐺pipe = 𝑇𝑝 +𝑇𝐴 (𝑁𝐴) +𝑇𝑓 max(𝑇𝑝 +𝑇𝑓 , 𝑇𝐴 (𝑁𝐴)) . The ideal upper bound is 2× when the W-side stage and the A-side attention stage take similar time. The same model gives the exposed per-laye… view at source ↗

**Figure 9.** Figure 9: Topology-switch overhead. (a) Prepared routeinstall overhead distribution (mean scale-up: 7.2 𝜇s, scaledown: 8.3 𝜇s). (b) Comparison with full-instance scaling: prewarmed route-install takes 7.2 𝜇s, and even no-prewarm Anode bring-up (412 ms, no weights) is shorter than the 1.1 s ServerlessLLM and 10.8 s full-instance disk cold start. As KernelFlume scales from one to seven A nodes, total KV capacity g… view at source ↗

**Figure 10.** Figure 10: Decode latency of disaggregated execution relative to the non-disaggregated reference (a single GPU, no W– A split; 1.0×, lower is better) across six GQA/MHA models. With QFA, disaggregated decode stays within a few percent of the reference, and runs faster than it for the MHA models. table: A node setup is prewarmed and never loads model weights. The predictive policy (§4.2) normally starts this setup 𝜏… view at source ↗

**Figure 12.** Figure 12: End-to-end elasticity under the Codex/SWE-bench Pro agentic trace on both testbeds. Top row: intra-node A6000; bottom row: cross-node 16×H100. Within each row, the left panels show TPOT CDFs and the right panels show TPOT over time for the Low, Medium, and High workloads. KernelFlume (green) keeps TPOT tight as variance grows; the static and ServerlessLLM baselines develop heavier tails. sessions with lon… view at source ↗

read the original abstract

LLM serving is increasingly dominated by long and dynamic decode workloads from agents, reasoning models, and extended conversations. When bursty long-context demand exceeds deployed capacity, existing serving systems typically scale out by launching additional serving instances with model replicas. This instance-level elasticity increases KV capacity only by provisioning another full copy of the model, inheriting startup latency, memory overhead, and batch fragmentation. We present KernelFlume, a decode-centric architecture that disaggregates the stable projection/FFN path from core-attention computation: weight nodes execute dense projection/FFN kernels, while weightless attention nodes store token-range KV partitions and scale with request-state demand. To make this separation elastic, KernelFlume maintains a routing table that maps token ranges to attention-node endpoints. It updates routes at token boundaries and uses host-visible graph signals to drive pre-registered UCX endpoint communication outside the captured CUDA Graph. To preserve low per-token latency after disaggregation, KernelFlume combines query-first core-attention dispatch with inter-layer kernel pipelining, overlapping remote attention and communication with local projection/FFN work. On real GPU testbeds (intra-node A6000 and cross-node H100), under a dynamic long-context agentic workload serving Llama-3.1-8B, KernelFlume sustains flat p99 TPOTs of ~74 ms on A6000 and ~34 ms on H100, while lowering cost per million output tokens by up to 32% and 61%, respectively, relative to full-instance elastic scaling with ServerlessLLM, a state-of-the-art instance-startup method. Replaying the same trace at larger model scale in simulation projects a 56--66% cost reduction over ServerlessLLM, widening to 80--85% with cheaper heterogeneous attention-node hardware and persisting into the million-token context range.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

KernelFlume disaggregates attention nodes from the rest of the model to enable cheaper elastic scaling for long-context decode, with hardware numbers showing cost cuts versus full-instance methods, though the latency overheads from routing and communication need clearer checks.

read the letter

The main point is that this paper introduces a disaggregated serving architecture where attention computation is handled by separate weightless nodes that can be scaled elastically based on token demand, while the rest of the model stays on weight nodes. They claim this keeps p99 time per output token flat even as context grows dynamically.

What is new is the combination of a routing table for token ranges, updates at token boundaries, and the use of host-visible graph signals to trigger pre-registered UCX communication without breaking the CUDA Graph. The query-first dispatch and pipelining across layers are meant to hide the remote access latency behind local computation.

The paper does well in providing end-to-end results from actual GPU testbeds, both intra-node and cross-node, using a Llama model on what sounds like a realistic agentic workload trace. The reported cost savings of 32% on A6000 and 61% on H100, plus the simulated larger gains, give a practical sense of the benefit over full instance scaling with ServerlessLLM.

The soft spots center on whether the disaggregation is truly latency neutral. The stress test raises a fair point about potential non-overlapped time from routing updates and UCX calls when partitions change frequently. The abstract and available description do not include per-component timings or experiments that isolate these effects, so the flat p99 claims could be sensitive to how well the overlap works under bursty conditions. The larger scale results are simulations, which limits how much weight to put on them.

This paper is for people building or studying production LLM serving systems, particularly those focused on handling variable long-context loads from agents without over-provisioning full replicas. A reader working on inference optimization or disaggregated architectures would get value from the design choices and the hardware validation.

It deserves a serious referee. The idea addresses a real bottleneck with some measured evidence, even if additional details on the mechanisms would help.

Recommendation: send it to peer review.

Referee Report

2 major / 1 minor

Summary. The paper presents KernelFlume, a decode-centric architecture that disaggregates stable projection/FFN kernels (executed on weight nodes) from core-attention computation (executed on weightless attention nodes that store token-range KV partitions). Elasticity is achieved via a routing table mapping token ranges to attention-node endpoints, updated at token boundaries, with host-visible graph signals driving pre-registered UCX communication outside the CUDA Graph. Query-first dispatch and inter-layer pipelining are used to overlap remote attention/communication with local work. On intra-node A6000 and cross-node H100 testbeds serving Llama-3.1-8B under a dynamic long-context agentic workload, the system reports flat p99 TPOTs of ~74 ms and ~34 ms respectively, with cost-per-million-output-tokens reductions of up to 32% and 61% versus full-instance elastic scaling with ServerlessLLM; simulation projects 56-85% savings at larger scales and million-token contexts.

Significance. If the reported flat p99 TPOTs hold under the claimed mechanisms, the work addresses a practical bottleneck in serving bursty long-context agentic workloads by avoiding full-model replication and its associated startup, memory, and fragmentation costs. The concrete GPU testbed numbers and heterogeneous-hardware projections provide a falsifiable basis for evaluating cost-efficiency gains in production LLM serving systems.

major comments (2)

[Abstract] Abstract: the central claim of latency-neutral disaggregation (flat p99 TPOTs of ~74 ms / ~34 ms) rests on query-first dispatch plus inter-layer pipelining fully overlapping remote attention/UCX with local projection/FFN work and on routing-table updates at token boundaries not introducing jitter. No per-component latency breakdown, ablation of the routing/pipelining mechanisms, error bars, or workload-trace description is supplied to isolate these overheads from the end-to-end numbers; any residual non-overlapped time would directly affect the reported tails.
[Abstract] Abstract: the cost-reduction claims (32%/61% on A6000/H100, widening to 80-85% with heterogeneous attention nodes) are load-bearing for the contribution yet are presented only as aggregate outcomes relative to ServerlessLLM; without an accounting of how KV-partition scaling, routing-table maintenance, and UCX registration costs are measured or amortized, it is impossible to assess whether the savings are robust to changes in request-state demand or context length.

minor comments (1)

[Abstract] The abstract refers to 'real GPU testbeds (intra-node A6000 and cross-node H100)' and 'replaying the same trace at larger model scale in simulation' without specifying the exact hardware topology, interconnect, or simulation methodology; adding these details would improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback emphasizing the need for component-level breakdowns and cost accounting to support the latency and savings claims. We will perform a major revision incorporating the requested details, ablations, and analyses.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim of latency-neutral disaggregation (flat p99 TPOTs of ~74 ms / ~34 ms) rests on query-first dispatch plus inter-layer pipelining fully overlapping remote attention/UCX with local projection/FFN work and on routing-table updates at token boundaries not introducing jitter. No per-component latency breakdown, ablation of the routing/pipelining mechanisms, error bars, or workload-trace description is supplied to isolate these overheads from the end-to-end numbers; any residual non-overlapped time would directly affect the reported tails.

Authors: We agree that isolating the overheads of query-first dispatch, inter-layer pipelining, and routing-table updates is important for validating the flat p99 TPOT claims. In the revised manuscript we will add a dedicated subsection with per-component latency breakdowns (local projection/FFN, remote attention, UCX communication) measured via CUDA events, plus ablations that disable pipelining and routing updates individually. Error bars from five independent runs will be reported on all end-to-end and component metrics. The workload trace (synthetic agentic long-context benchmark with context lengths drawn from 4k–128k tokens and bursty arrivals) will be described in expanded detail in Section 4.1, including arrival-rate distribution and context-growth model. revision: yes
Referee: [Abstract] Abstract: the cost-reduction claims (32%/61% on A6000/H100, widening to 80-85% with heterogeneous attention nodes) are load-bearing for the contribution yet are presented only as aggregate outcomes relative to ServerlessLLM; without an accounting of how KV-partition scaling, routing-table maintenance, and UCX registration costs are measured or amortized, it is impossible to assess whether the savings are robust to changes in request-state demand or context length.

Authors: We concur that an explicit cost-component breakdown is required to demonstrate robustness. The revision will include a new cost-modeling subsection that itemizes (1) KV-partition memory scaling (measured per-token KV cache size on attention nodes), (2) routing-table maintenance (host CPU cycles per token-boundary update), and (3) UCX registration (amortized over attention-node lifetime). We will add sensitivity plots showing cost per million tokens versus context length and request-state demand, plus an appendix table with raw component costs for both testbeds and the simulated larger-scale scenarios. revision: yes

Circularity Check

0 steps flagged

No circularity: architecture and empirical results only

full rationale

The manuscript presents a systems design (disaggregation of projection/FFN from attention via routing table, UCX, query-first dispatch, and pipelining) together with end-to-end latency and cost measurements on A6000/H100 testbeds under an agentic trace. No equations, first-principles derivations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text. Performance numbers are reported as direct observations relative to ServerlessLLM (an external baseline), not reductions to any internal definition or prior self-result. The derivation chain is therefore self-contained experimental reporting.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no information on free parameters, axioms, or invented entities; all fields left empty.

pith-pipeline@v0.9.1-grok · 5896 in / 1221 out tokens · 38105 ms · 2026-06-30T02:49:21.091993+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

45 extracted references · 20 canonical work pages · 7 internal anchors

[1]

Amey Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav S Gulavani, Alexey Tumanov, and Ramachandran Ramjee. 2024. Taming Throughput-Latency Tradeoff in LLM Infer- ence with Sarathi-Serve. In18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24). USENIX Association, Santa Clara, CA, 117–134.https://www.usen...

2024
[2]

Qiaoling Chen, Zhisheng Ye, Tian Tang, Peng Sun, Boyu Tian, Guoteng Wang, Shenggui Li, Yonggang Wen, Zhenhua Han, and Tianwei Zhang. 2026. CONCUR: High-Throughput Agentic Batch Inference of LLM via Congestion-Based Concurrency Control. arXiv preprint arXiv:2601.22705.https://arxiv.org/abs/2601.22705

work page arXiv 2026
[3]

Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré
[4]

Flashattention: Fast and memory-efficient exact attention with io-awareness.Advances in neural information processing systems35 (2022), 16344–16359

2022
[5]

DeepSeek-AI. 2025. DeepEP: An Efficient Expert-Parallel Communi- cation Library.https://github.com/deepseek-ai/DeepEP. Accessed: 2025-03-01

2025
[6]

Yao Fu, Leyang Xue, Yeqi Huang, Andrei-Octavian Brabete, Dmitrii Ustiugov, Yuvraj Patel, and Luo Mai. 2024. ServerlessLLM: Low- Latency Serverless Inference for Large Language Models. InProceed- ings of the 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI). USENIX Association, 135–153

2024
[7]

Bin Gao, Zhuomin He, Puru Sharma, Qingxuan Kang, Djordje Jevdjic, Junbo Deng, Xingkun Yang, Zhou Yu, and Pengfei Zuo. 2024. Cost- Efficient Large Language Model Serving for Multi-turn Conversations with CachedAttention. In2024 USENIX Annual Technical Conference (USENIX ATC 24). USENIX Association, Santa Clara, CA, 111–126. https://www.usenix.org/conferenc...

2024
[8]

Daya Guo, Dejian Yang, He Zhang, Junxiao Song, Runxin Zhang, Ruoyu Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. 2025. DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Re- inforcement Learning. arXiv preprint arXiv:2501.12948.https: //arxiv.org/abs/2501.12948

work page internal anchor Pith review Pith/arXiv arXiv 2025
[9]

Roger W Hockney. 1994. The communication challenge for MPP: Intel Paragon and Meiko CS-2.Parallel Comput.20, 3 (1994), 389–398

1994
[10]

Cunchen Hu, Heyang Huang, Junhao Hu, Jiang Xu, Xusheng Chen, Tao Xie, Chenxi Wang, Sa Wang, Yungang Bao, Ninghui Sun, and Yizhou Shan. 2024. MemServe: Context Caching for Disaggregated LLM Serving with Elastic Memory Pool. arXiv preprint arXiv:2406.17565. https://arxiv.org/abs/2406.17565

work page arXiv 2024
[11]

Inferact. 2026. Codex SWE-bench Pro Agentic Serving Traces.https: //huggingface.co/datasets/Inferact/codex_swebenchpro_traces. Hug- ging Face Datasets

2026
[12]

Benjamin Kubwimana and Qijing Huang. 2025. EdgeReasoning: Char- acterizing Reasoning LLM Deployment on Edge GPUs. arXiv preprint arXiv:2511.01866.https://arxiv.org/abs/2511.01866

work page arXiv 2025
[13]

Gonzalez, Hao Zhang, and Ion Stoica

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica
[14]

InProceedings of the 29th Symposium on Operating Systems Principles(Koblenz, Germany) (SOSP ’23)

vLLM: Efficient Memory Management for Large Language Model Serving with PagedAttention. InProceedings of the 29th Symposium on Operating Systems Principles (SOSP). ACM, 611–626. doi:10.1145/ 3600006.3613165

work page arXiv
[15]

Lambda, RunPod, and BurnCloud. 2026. Cloud GPU On-Demand Hourly Pricing.https://lambda.ai/pricing;https://www.runpod.io/ gpu-models/rtx-a6000;https://www.autodl.com/. On-demand rates: NVIDIA H100 80 GB ∼$3/hr (Lambda, $2.99–3.29); RTX A6000 48 GB $0.49/hr (RunPod); H20 96 GB $1/hr (AutoDL, CNY 7.58≈ $1.05). Ac- cessed June 2026

2026
[16]

Bin Lin, Tao Peng, Chen Zhang, Minmin Sun, Lanbo Li, Hanyu Zhao, Wencong Xiao, Qi Xu, Xiafei Qiu, Shen Li, Zhigang Ji, Yong Li, and Wei Lin. 2024. Infinite-LLM: Efficient LLM Service for Long Con- text with DistAttention and Distributed KVCache. arXiv preprint arXiv:2401.02669.https://arxiv.org/abs/2401.02669

work page arXiv 2024
[17]

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al
[18]

DeepSeek-V3 Technical Report

Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437. https://arxiv.org/abs/2412.19437

work page internal anchor Pith review Pith/arXiv arXiv
[19]

Hao Liu, Matei Zaharia, and Pieter Abbeel. 2023. Ring attention with blockwise transformers for near-infinite context. arXiv preprint arXiv:2310.01889.https://arxiv.org/abs/2310.01889

work page internal anchor Pith review Pith/arXiv arXiv 2023
[20]

Xupeng Miao, Chunan Shi, Jiangfei Duan, Xiaoli Xi, Dahua Lin, Bin Cui, and Zhihao Jia. 2024. SpotServe: Serving Generative Large Language Models on Preemptible Instances. InProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). ACM, 1112–1126. doi:10. 1145/3620665.3640411

work page arXiv 2024
[21]

Maxim Milakov and Natalia Gimelshein. 2018. Online normalizer calculation for softmax. arXiv preprint arXiv:1805.02867.https: //arxiv.org/abs/1805.02867

work page internal anchor Pith review Pith/arXiv arXiv 2018
[22]

NVIDIA. 2023. NCCL: NVIDIA Collective Communications Library. https://developer.nvidia.com/nccl. 13

2023
[23]

NVIDIA Corporation. 2024. NVSHMEM: NVIDIA OpenSHMEM Li- brary for GPU Clusters.https://developer.nvidia.com/nvshmem

2024
[24]

Charles Packer, Vivian Fang, Shishir_G Patil, Kevin Lin, Sarah Wooders, and Joseph_E Gonzalez. 2023. MemGPT: Towards LLMs as Operating Systems. arXiv preprint arXiv:2310.08560.https://arxiv.org/abs/2310. 08560Accessed: 2026-03-22

work page internal anchor Pith review Pith/arXiv arXiv 2023
[25]

Pratyush Patel, Esha Choukse, Chaojie Zhang, Aashaka Shah, Íñigo Goiri, Saeed Maleki, and Ricardo Bianchini. 2024. Splitwise: Efficient Generative LLM Inference Using Phase Splitting. InProceedings of the 51st Annual International Symposium on Computer Architecture. IEEE, Los Alamitos, CA, USA, 118–132. doi:10.1109/ISCA59077.2024.00019

work page doi:10.1109/isca59077.2024.00019 2024
[26]

Yifan Qiao, Trong Dao Le, Ao Shen, Zhewen Li, and Bowen Wang
[27]

https://vllm.ai/blog/2026-05-06-mooncake-store

Serving Agentic Workloads at Scale with vLLM × Mooncake. https://vllm.ai/blog/2026-05-06-mooncake-store. vLLM Blog

2026
[28]

Ruoyu Qin, Zheming Li, Weiran He, Jialei Cui, Heyi Tang, Feng Ren, Teng Ma, Shangming Cai, Yineng Zhang, Mingxing Zhang, et al. 2025. Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving. ACM Transactions on Storage, online first. doi:10.1145/ 3773772

2025
[29]

Graham Lopez, Matthew B

Pavel Shamis, Manjunath Gorentla Venkata, M. Graham Lopez, Matthew B. Baker, Oscar Hernandez, Yossi Iber, Jose L. Abellan, Aure- lien Bouteiller, George Bosilca, Jack Dongarra, et al. 2015. UCX: An Open Source Framework for HPC Network APIs and Beyond. InPro- ceedings of the 2015 IEEE 23rd Annual Symposium on High-Performance Interconnects. IEEE, Santa Cl...

work page doi:10.1109/hoti 2015
[30]

Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Li, Max Ryabinin, Beidi Chen, Percy Liang, Christopher Ré, Ion Stoica, and Ce Zhang
[31]

InProceedings of the 40th Interna- tional Conference on Machine Learning (ICML)

FlexGen: High-Throughput Generative Inference of Large Lan- guage Models with a Single GPU. InProceedings of the 40th Interna- tional Conference on Machine Learning (ICML). 31094–31116
[32]

Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. 2019. Megatron-lm: Training multi- billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053.https://arxiv.org/abs/1909.08053

work page internal anchor Pith review Pith/arXiv arXiv 2019
[33]

Yuxin Wang, Yuhan Chen, Zeyu Li, Xueze Kang, Yuchu Fang, Yeju Zhou, Yang Zheng, Zhenheng Tang, Xin He, Rui Guo, et al . 2025. Burstgpt: A real-world workload dataset to optimize llm serving sys- tems. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V. 2. 5831–5841

2025
[34]

Samuel Williams, Andrew Waterman, and David Patterson. 2009. Roofline: An insightful visual performance model for multicore archi- tectures.Commun. ACM52, 4 (2009), 65–76

2009
[35]

Bingyang Wu, Shengyu Liu, Yinmin Zhong, Peng Sun, Xuanzhe Liu, and Xin Jin. 2024. LoongServe: Efficiently Serving Long-Context Large Language Models with Elastic Sequence Parallelism. InProceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles. Association for Computing Machinery, New York, NY, USA, 640–654. doi:10.1145/3694715.3695948

work page doi:10.1145/3694715.3695948 2024
[36]

Haoran Wu, Can Xiao, Jiayi Nie, Xuan Guo, Binglei Lou, Jeffrey T. H. Wong, Zhiwen Mo, Cheng Zhang, Przemyslaw Forys, Wayne Luk, Hongxiang Fan, Jianyi Cheng, Timothy M. Jones, Rika Antonova, Robert Mullins, and Aaron Zhao. 2025. Combating the Memory Walls: Optimization Pathways for Long-Context Agentic LLM Inference. arXiv preprint arXiv:2509.09505.https:/...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[37]

Yongtong Wu, Shaoyuan Chen, Yinmin Zhong, Rilin Huang, Yixuan Tan, Wentao Zhang, Liyue Zhang, Shangyan Zhou, Yuxuan Liu, Shun- feng Zhou, Mingxing Zhang, Xin Jin, and Panpan Huang. 2026. Du- alPath: Breaking the Storage Bandwidth Bottleneck in Agentic LLM Inference.arXiv preprint arXiv:2602.21548(2026)

work page arXiv 2026
[38]

Amy Yang, Jingyi Yang, Aya Ibrahim, Xinfeng Xie, Bangsheng Tang, Grigory Sizov, Jeremy Reizenstein, Jongsoo Park, and Jianyu Huang
[39]

Context Parallelism for Scalable Million-Token Inference.arXiv preprint arXiv:2411.01783(2024)

work page arXiv 2024
[40]

Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soojeong Kim, and Byung-Gon Chun. 2022. Orca: A Distributed Serving System for Transformer-Based Generative Models. InProceedings of the 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI). USENIX Association, 521–538

2022
[41]

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Tianle Li, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zhuohan Li, Zi Lin, Eric P Xing, et al. 2023. Lmsys-chat-1m: A large-scale real-world llm conversation dataset.arXiv preprint arXiv:2309.11998(2023)

work page arXiv 2023
[42]

Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Jeff Huang, Chuyue Sun, Cody Hao Yu, Shiyi Cao, Christos Kober, Liang Shi, Ziniu Wu, et al
[43]

InAdvances in Neural Information Processing Systems (NeurIPS)

SGLang: Efficient Execution of Structured Language Model Pro- grams. InAdvances in Neural Information Processing Systems (NeurIPS)
[44]

Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xu- anzhe Liu, Xin Jin, and Hao Zhang. 2024. {DistServe}: Disaggregating prefill and decoding for goodput-optimized large language model serving. In18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24). 193–210

2024
[45]

Ruidong Zhu, Ziheng Jiang, Chao Jin, Peng Wu, Cesar A Stuardo, Dongyang Wang, Xinlei Zhang, Huaping Zhou, Haoran Wei, Yang Cheng, et al . 2025. MegaScale-Infer: Serving Mixture-of-Experts at Scale with Disaggregated Expert Parallelism.arXiv preprint arXiv:2504.02263(2025). 14

work page arXiv 2025

[1] [1]

Amey Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav S Gulavani, Alexey Tumanov, and Ramachandran Ramjee. 2024. Taming Throughput-Latency Tradeoff in LLM Infer- ence with Sarathi-Serve. In18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24). USENIX Association, Santa Clara, CA, 117–134.https://www.usen...

2024

[2] [2]

Qiaoling Chen, Zhisheng Ye, Tian Tang, Peng Sun, Boyu Tian, Guoteng Wang, Shenggui Li, Yonggang Wen, Zhenhua Han, and Tianwei Zhang. 2026. CONCUR: High-Throughput Agentic Batch Inference of LLM via Congestion-Based Concurrency Control. arXiv preprint arXiv:2601.22705.https://arxiv.org/abs/2601.22705

work page arXiv 2026

[3] [3]

Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré

[4] [4]

Flashattention: Fast and memory-efficient exact attention with io-awareness.Advances in neural information processing systems35 (2022), 16344–16359

2022

[5] [5]

DeepSeek-AI. 2025. DeepEP: An Efficient Expert-Parallel Communi- cation Library.https://github.com/deepseek-ai/DeepEP. Accessed: 2025-03-01

2025

[6] [6]

Yao Fu, Leyang Xue, Yeqi Huang, Andrei-Octavian Brabete, Dmitrii Ustiugov, Yuvraj Patel, and Luo Mai. 2024. ServerlessLLM: Low- Latency Serverless Inference for Large Language Models. InProceed- ings of the 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI). USENIX Association, 135–153

2024

[7] [7]

Bin Gao, Zhuomin He, Puru Sharma, Qingxuan Kang, Djordje Jevdjic, Junbo Deng, Xingkun Yang, Zhou Yu, and Pengfei Zuo. 2024. Cost- Efficient Large Language Model Serving for Multi-turn Conversations with CachedAttention. In2024 USENIX Annual Technical Conference (USENIX ATC 24). USENIX Association, Santa Clara, CA, 111–126. https://www.usenix.org/conferenc...

2024

[8] [8]

Daya Guo, Dejian Yang, He Zhang, Junxiao Song, Runxin Zhang, Ruoyu Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. 2025. DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Re- inforcement Learning. arXiv preprint arXiv:2501.12948.https: //arxiv.org/abs/2501.12948

work page internal anchor Pith review Pith/arXiv arXiv 2025

[9] [9]

Roger W Hockney. 1994. The communication challenge for MPP: Intel Paragon and Meiko CS-2.Parallel Comput.20, 3 (1994), 389–398

1994

[10] [10]

Cunchen Hu, Heyang Huang, Junhao Hu, Jiang Xu, Xusheng Chen, Tao Xie, Chenxi Wang, Sa Wang, Yungang Bao, Ninghui Sun, and Yizhou Shan. 2024. MemServe: Context Caching for Disaggregated LLM Serving with Elastic Memory Pool. arXiv preprint arXiv:2406.17565. https://arxiv.org/abs/2406.17565

work page arXiv 2024

[11] [11]

Inferact. 2026. Codex SWE-bench Pro Agentic Serving Traces.https: //huggingface.co/datasets/Inferact/codex_swebenchpro_traces. Hug- ging Face Datasets

2026

[12] [12]

Benjamin Kubwimana and Qijing Huang. 2025. EdgeReasoning: Char- acterizing Reasoning LLM Deployment on Edge GPUs. arXiv preprint arXiv:2511.01866.https://arxiv.org/abs/2511.01866

work page arXiv 2025

[13] [13]

Gonzalez, Hao Zhang, and Ion Stoica

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica

[14] [14]

InProceedings of the 29th Symposium on Operating Systems Principles(Koblenz, Germany) (SOSP ’23)

vLLM: Efficient Memory Management for Large Language Model Serving with PagedAttention. InProceedings of the 29th Symposium on Operating Systems Principles (SOSP). ACM, 611–626. doi:10.1145/ 3600006.3613165

work page arXiv

[15] [15]

Lambda, RunPod, and BurnCloud. 2026. Cloud GPU On-Demand Hourly Pricing.https://lambda.ai/pricing;https://www.runpod.io/ gpu-models/rtx-a6000;https://www.autodl.com/. On-demand rates: NVIDIA H100 80 GB ∼$3/hr (Lambda, $2.99–3.29); RTX A6000 48 GB $0.49/hr (RunPod); H20 96 GB $1/hr (AutoDL, CNY 7.58≈ $1.05). Ac- cessed June 2026

2026

[16] [16]

Bin Lin, Tao Peng, Chen Zhang, Minmin Sun, Lanbo Li, Hanyu Zhao, Wencong Xiao, Qi Xu, Xiafei Qiu, Shen Li, Zhigang Ji, Yong Li, and Wei Lin. 2024. Infinite-LLM: Efficient LLM Service for Long Con- text with DistAttention and Distributed KVCache. arXiv preprint arXiv:2401.02669.https://arxiv.org/abs/2401.02669

work page arXiv 2024

[17] [17]

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al

[18] [18]

DeepSeek-V3 Technical Report

Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437. https://arxiv.org/abs/2412.19437

work page internal anchor Pith review Pith/arXiv arXiv

[19] [19]

Hao Liu, Matei Zaharia, and Pieter Abbeel. 2023. Ring attention with blockwise transformers for near-infinite context. arXiv preprint arXiv:2310.01889.https://arxiv.org/abs/2310.01889

work page internal anchor Pith review Pith/arXiv arXiv 2023

[20] [20]

Xupeng Miao, Chunan Shi, Jiangfei Duan, Xiaoli Xi, Dahua Lin, Bin Cui, and Zhihao Jia. 2024. SpotServe: Serving Generative Large Language Models on Preemptible Instances. InProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). ACM, 1112–1126. doi:10. 1145/3620665.3640411

work page arXiv 2024

[21] [21]

Maxim Milakov and Natalia Gimelshein. 2018. Online normalizer calculation for softmax. arXiv preprint arXiv:1805.02867.https: //arxiv.org/abs/1805.02867

work page internal anchor Pith review Pith/arXiv arXiv 2018

[22] [22]

NVIDIA. 2023. NCCL: NVIDIA Collective Communications Library. https://developer.nvidia.com/nccl. 13

2023

[23] [23]

NVIDIA Corporation. 2024. NVSHMEM: NVIDIA OpenSHMEM Li- brary for GPU Clusters.https://developer.nvidia.com/nvshmem

2024

[24] [24]

Charles Packer, Vivian Fang, Shishir_G Patil, Kevin Lin, Sarah Wooders, and Joseph_E Gonzalez. 2023. MemGPT: Towards LLMs as Operating Systems. arXiv preprint arXiv:2310.08560.https://arxiv.org/abs/2310. 08560Accessed: 2026-03-22

work page internal anchor Pith review Pith/arXiv arXiv 2023

[25] [25]

Pratyush Patel, Esha Choukse, Chaojie Zhang, Aashaka Shah, Íñigo Goiri, Saeed Maleki, and Ricardo Bianchini. 2024. Splitwise: Efficient Generative LLM Inference Using Phase Splitting. InProceedings of the 51st Annual International Symposium on Computer Architecture. IEEE, Los Alamitos, CA, USA, 118–132. doi:10.1109/ISCA59077.2024.00019

work page doi:10.1109/isca59077.2024.00019 2024

[26] [26]

Yifan Qiao, Trong Dao Le, Ao Shen, Zhewen Li, and Bowen Wang

[27] [27]

https://vllm.ai/blog/2026-05-06-mooncake-store

Serving Agentic Workloads at Scale with vLLM × Mooncake. https://vllm.ai/blog/2026-05-06-mooncake-store. vLLM Blog

2026

[28] [28]

Ruoyu Qin, Zheming Li, Weiran He, Jialei Cui, Heyi Tang, Feng Ren, Teng Ma, Shangming Cai, Yineng Zhang, Mingxing Zhang, et al. 2025. Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving. ACM Transactions on Storage, online first. doi:10.1145/ 3773772

2025

[29] [29]

Graham Lopez, Matthew B

Pavel Shamis, Manjunath Gorentla Venkata, M. Graham Lopez, Matthew B. Baker, Oscar Hernandez, Yossi Iber, Jose L. Abellan, Aure- lien Bouteiller, George Bosilca, Jack Dongarra, et al. 2015. UCX: An Open Source Framework for HPC Network APIs and Beyond. InPro- ceedings of the 2015 IEEE 23rd Annual Symposium on High-Performance Interconnects. IEEE, Santa Cl...

work page doi:10.1109/hoti 2015

[30] [30]

Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Li, Max Ryabinin, Beidi Chen, Percy Liang, Christopher Ré, Ion Stoica, and Ce Zhang

[31] [31]

InProceedings of the 40th Interna- tional Conference on Machine Learning (ICML)

FlexGen: High-Throughput Generative Inference of Large Lan- guage Models with a Single GPU. InProceedings of the 40th Interna- tional Conference on Machine Learning (ICML). 31094–31116

[32] [32]

Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. 2019. Megatron-lm: Training multi- billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053.https://arxiv.org/abs/1909.08053

work page internal anchor Pith review Pith/arXiv arXiv 2019

[33] [33]

Yuxin Wang, Yuhan Chen, Zeyu Li, Xueze Kang, Yuchu Fang, Yeju Zhou, Yang Zheng, Zhenheng Tang, Xin He, Rui Guo, et al . 2025. Burstgpt: A real-world workload dataset to optimize llm serving sys- tems. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V. 2. 5831–5841

2025

[34] [34]

Samuel Williams, Andrew Waterman, and David Patterson. 2009. Roofline: An insightful visual performance model for multicore archi- tectures.Commun. ACM52, 4 (2009), 65–76

2009

[35] [35]

Bingyang Wu, Shengyu Liu, Yinmin Zhong, Peng Sun, Xuanzhe Liu, and Xin Jin. 2024. LoongServe: Efficiently Serving Long-Context Large Language Models with Elastic Sequence Parallelism. InProceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles. Association for Computing Machinery, New York, NY, USA, 640–654. doi:10.1145/3694715.3695948

work page doi:10.1145/3694715.3695948 2024

[36] [36]

Haoran Wu, Can Xiao, Jiayi Nie, Xuan Guo, Binglei Lou, Jeffrey T. H. Wong, Zhiwen Mo, Cheng Zhang, Przemyslaw Forys, Wayne Luk, Hongxiang Fan, Jianyi Cheng, Timothy M. Jones, Rika Antonova, Robert Mullins, and Aaron Zhao. 2025. Combating the Memory Walls: Optimization Pathways for Long-Context Agentic LLM Inference. arXiv preprint arXiv:2509.09505.https:/...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[37] [37]

Yongtong Wu, Shaoyuan Chen, Yinmin Zhong, Rilin Huang, Yixuan Tan, Wentao Zhang, Liyue Zhang, Shangyan Zhou, Yuxuan Liu, Shun- feng Zhou, Mingxing Zhang, Xin Jin, and Panpan Huang. 2026. Du- alPath: Breaking the Storage Bandwidth Bottleneck in Agentic LLM Inference.arXiv preprint arXiv:2602.21548(2026)

work page arXiv 2026

[38] [38]

Amy Yang, Jingyi Yang, Aya Ibrahim, Xinfeng Xie, Bangsheng Tang, Grigory Sizov, Jeremy Reizenstein, Jongsoo Park, and Jianyu Huang

[39] [39]

Context Parallelism for Scalable Million-Token Inference.arXiv preprint arXiv:2411.01783(2024)

work page arXiv 2024

[40] [40]

Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soojeong Kim, and Byung-Gon Chun. 2022. Orca: A Distributed Serving System for Transformer-Based Generative Models. InProceedings of the 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI). USENIX Association, 521–538

2022

[41] [41]

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Tianle Li, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zhuohan Li, Zi Lin, Eric P Xing, et al. 2023. Lmsys-chat-1m: A large-scale real-world llm conversation dataset.arXiv preprint arXiv:2309.11998(2023)

work page arXiv 2023

[42] [42]

Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Jeff Huang, Chuyue Sun, Cody Hao Yu, Shiyi Cao, Christos Kober, Liang Shi, Ziniu Wu, et al

[43] [43]

InAdvances in Neural Information Processing Systems (NeurIPS)

SGLang: Efficient Execution of Structured Language Model Pro- grams. InAdvances in Neural Information Processing Systems (NeurIPS)

[44] [44]

Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xu- anzhe Liu, Xin Jin, and Hao Zhang. 2024. {DistServe}: Disaggregating prefill and decoding for goodput-optimized large language model serving. In18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24). 193–210

2024

[45] [45]

Ruidong Zhu, Ziheng Jiang, Chao Jin, Peng Wu, Cesar A Stuardo, Dongyang Wang, Xinlei Zhang, Huaping Zhou, Haoran Wei, Yang Cheng, et al . 2025. MegaScale-Infer: Serving Mixture-of-Experts at Scale with Disaggregated Expert Parallelism.arXiv preprint arXiv:2504.02263(2025). 14

work page arXiv 2025