SuperInfer: SLO-Aware Rotary Scheduling and Memory Management for LLM Inference on Superchips

Jiahuan Yu; Mingtao Hu; Minjia Zhang; Zichao Lin

arxiv: 2601.20309 · v2 · pith:3S7CATW2new · submitted 2026-01-28 · 💻 cs.DC · cs.AI· cs.LG

SuperInfer: SLO-Aware Rotary Scheduling and Memory Management for LLM Inference on Superchips

Jiahuan Yu , Mingtao Hu , Zichao Lin , Minjia Zhang This is my paper

Pith reviewed 2026-05-21 15:21 UTC · model grok-4.3

classification 💻 cs.DC cs.AIcs.LG

keywords LLM inferenceSLO-aware schedulingrotary schedulingKV cache managementsuperchipsmemory offloadinglatency optimization

0 comments

The pith

SuperInfer uses SLO-aware rotary scheduling and duplex memory transfers to improve TTFT SLO attainment by up to 74.7% on GH200 superchips.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

LLM inference systems must balance strict latency targets against limited GPU memory, and high request rates often exhaust the KV cache and cause head-of-line blocking. Earlier offloading approaches over slower links could not keep up with tight time-to-first-token and time-between-tokens goals. SuperInfer solves this on superchips that tightly couple GPU and CPU through NVLink-C2C by rotating requests proactively according to SLOs and moving data with an optimized full-duplex engine. The result is a large rise in the fraction of requests that meet first-token latency targets while throughput and token generation speed stay comparable to current systems. This shows that hardware-specific co-design of scheduling and memory movement can make LLM serving more responsive under load.

Core claim

SuperInfer demonstrates that a proactive SLO-aware rotary scheduler together with a full-duplex KV-cache rotation engine on tightly coupled GPU-CPU superchips raises time-to-first-token SLO attainment rates by up to 74.7 percent while keeping time-between-tokens and throughput comparable to state-of-the-art LLM inference systems.

What carries the argument

RotaSched is the proactive SLO-aware rotary scheduler that rotates requests to preserve responsiveness, paired with DuplexKV, the rotation engine that performs full-duplex transfers over NVLink-C2C.

If this is right

High request rates no longer produce severe head-of-line blocking once KV cache space runs out.
Requests can be moved to CPU memory and back without violating tight TTFT or TBT targets.
Memory capacity on the superchip is used more effectively through coordinated rotation rather than static allocation.
Throughput stays comparable to existing systems while a much higher share of requests meet their latency SLOs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same rotation idea may apply to other platforms that provide fast CPU-GPU memory links.
Proactive rather than reactive offloading could be combined with existing batching or quantization methods.
Hardware designers might prioritize even lower-latency interconnects to support higher rotation rates.

Load-bearing premise

The NVLink-C2C link between GPU and CPU on superchips supplies low-overhead full-duplex transfers that keep the system responsive at high request rates without creating new bottlenecks.

What would settle it

Running the same workload on hardware without a fast GPU-CPU interconnect or at request rates where transfer latency exceeds the SLO budget would show whether the reported gains remain.

Figures

Figures reproduced from arXiv: 2601.20309 by Jiahuan Yu, Mingtao Hu, Minjia Zhang, Zichao Lin.

**Figure 1.** Figure 1: Two static offloading policies: Waiting-First (WF) and Swapped-First (SF), and comparison of their P99 TTFT and TBT to FCFS under varying request rates (Qwen2.5-32B, ShareGPT). 3.1 SLO-Aware Offloading as a Challenge While sophisticated scheduling techniques exist, they cannot fundamentally overcome the GPU memory pressure imposed by hardware constraints (more details in Appendix A). A natural idea to mi… view at source ↗

**Figure 2.** Figure 2: P99 TTFT and TBT latency vs. swap bandwidth for vLLM with offloading. (Qwen2.5-32B, ShareGPT, RPS=20). HOL Blocking Slow offloading Slow resuming Backlogging Slow clearing running swapped waiting [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 7.** Figure 7: Analogy: LLM serving stack on GH200 (left) vs. OS on CPU (right). Requests → threads, Hopper HBM → on-chip cache, Grace DRAM → main memory, KV cache → thread data. based on their SLO status. It introduces a rotary state: a transient execution state where a request’s progress is temporarily paused on GPU and its KV cache is swapped to the CPU, waiting for next rotation. This enables an LLM inference schedu… view at source ↗

**Figure 9.** Figure 9: A conceptual example to show how LVF rotates executions. There are 4 requests in total, and HBM can only hold 2 requests. Numbers refer to VLT. current system time, tlast is the time of last generated token, tarr is request arrival time, trun is the time a request begins in the running state [PITH_FULL_IMAGE:figures/full_fig_p006_9.png] view at source ↗

**Figure 11.** Figure 11: Layer, block, and segment structure of the PagedAttention KV cache. Colors denote different requests, and numbers denote relative segment addresses. 2 6 2 8 2 10 2 12 Segment Size (KB) 2 3 2 5 Time (us) Performance Comparison Kernel Launch Transfer [PITH_FULL_IMAGE:figures/full_fig_p007_11.png] view at source ↗

**Figure 13.** Figure 13: Left figure shows how data races occur when the swapin destination block is the swap-out source block. Right figure shows our eager block rotation, which breaks the data dependency to enable concurrent swap-in and swap-out. the request finishes. DuplexKV eagerly swap-out synced blocks from HBM and DRAM in the background, even before preemption. These early transfers mark the corresponding HBM blocks as … view at source ↗

**Figure 15.** Figure 15: Comparing execution flow of vLLM (up) and SuperInfer (down). Schedule and KV cache transfers (2 CUDA streams) are overlapped with model execution in SuperInfer. 5 EVALUATION 5.1 Evaluation Methodology Implementation. We implement SuperInfer in Python and C++ on top of vLLM (Kwon et al., 2023) (v0.6.6.post1), a widely used production-level LLM inference framework. Models and Workloads. We evaluate SuperIn… view at source ↗

**Figure 16.** Figure 16: Comparison of SuperInfer against baselines across various models, datasets, and request rates (RPS). SuperInfer achieves significant improvements in TTFT SLO attainment over baselines, while preserving TBT SLO attainment comparable to others. trade-off between request queuing and harmful evictions by estimating future memory occupancy. LTR approximates SJF using learning-based request length ranking. NEO … view at source ↗

**Figure 20.** Figure 20: P99 TTFT and TBT of SuperInfer (various βB values) [PITH_FULL_IMAGE:figures/full_fig_p010_20.png] view at source ↗

**Figure 18.** Figure 18: TTFT and TBT SLO attainment rate of SuperInfer under various α. Larger α leads to better TBT but worse TTFT. 5 10 15 20 RPS 0 10 20 P99 TTFT (s) F = 0.0 F = 0.2 F = 0.6 F = 1.0 F = 5.0 F = 10.0 5 10 15 20 RPS 0 1 2 P99 TBT (s) [PITH_FULL_IMAGE:figures/full_fig_p010_18.png] view at source ↗

**Figure 19.** Figure 19: P99 TTFT and TBT of SuperInfer (various βF values). with varying α ≥ 1 and fixed βB = βF = 0. It shows that larger α yields better TBT SLO attainment, as rotary requests get larger VLTs for prioritization. However, this comes at the cost of lower TTFT attainment, as waiting requests are relatively delayed. α = 3 offers a balanced sweet spot; further increases bring diminishing TBT benefits while signific… view at source ↗

**Figure 21.** Figure 21: Comparing P99 TTFT and TBT of SuperInfer with various Bxfer. Higher Bxfer significantly reduce tail latencies. 20 40 RPS 2 4 6 Throughput (# token/s) 1e3 Mixtral-8x7B 20 40 RPS 2 4 6 1e3 Qwen2.5-32B 50 75 100 RPS 1.0 1.5 2.01e4 Llama3-8B vLLM@ShareGPT vLLM@LMSYS SuperInfer@ShareGPT SuperInfer@LMSYS [PITH_FULL_IMAGE:figures/full_fig_p011_21.png] view at source ↗

**Figure 22.** Figure 22: Throughput of vLLM and SuperInfer on three models. rotary states. Also, larger budgets yield greater improvements, confirming the necessity of high swap bandwidth for effective SLO-aware LLM serving with offloading. How does SuperInfer affect the throughput? [PITH_FULL_IMAGE:figures/full_fig_p011_22.png] view at source ↗

**Figure 23.** Figure 23: KV cache usage and waiting request number for vLLM with FCFS and SJF-oracle scheduler. Model: Qwen2.5-32B, dataset: ShareGPT, RPS=20. B BANDWIDTH MEASUREMENT We measure the CPU-GPU bandwidth for GH200 and H200 using NVIDIA’s open-source nvbandwidth tool (v0.8). We focus on two specific bidirectional copy engine (CE) test cases: • host to device bidirectional memcpy - ce (Test ID 2) • device to host bidire… view at source ↗

**Figure 25.** Figure 25: Comparing the vLLM and that with KV cache storage in GH200’s Unified Memory (UM). vLLM on UM shows significant TBT degradation. 2024). This allows the Hopper GPU to directly access the Grace CPU’s DRAM without incurring any page faults. GH200 does support page migration, but instead of being page-fault driven, it uses hardware access counters to track the access frequency of pages from both the CPU and GP… view at source ↗

read the original abstract

Large Language Model (LLM) serving faces a fundamental tension between stringent latency Service Level Objectives (SLOs) and limited GPU memory capacity. When high request rates exhaust the KV cache budget, existing LLM inference systems often suffer severe head-of-line (HOL) blocking. While prior work explored PCIe-based offloading, these approaches cannot sustain responsiveness under high request rates, often failing to meet tight Time-To-First-Token (TTFT) and Time-Between-Tokens (TBT) SLOs. We present SuperInfer, a high-performance LLM inference system designed for emerging Superchips (e.g., NVIDIA GH200) with tightly coupled GPU-CPU architecture via NVLink-C2C. SuperInfer introduces RotaSched, the first proactive, SLO-aware rotary scheduler that rotates requests to maintain responsiveness on Superchips, and DuplexKV, an optimized rotation engine that enables full-duplex transfer over NVLink-C2C. Evaluations on GH200 using various models and datasets show that SuperInfer improves TTFT SLO attainment rates by up to 74.7% while maintaining comparable TBT and throughput compared to state-of-the-art systems, demonstrating that SLO-aware scheduling and memory co-design unlocks the full potential of Superchips for responsive LLM serving. Code is available in https://github.com/Supercomputing-System-AI-Lab/SuperInfer.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SuperInfer delivers practical TTFT improvements on Superchips through rotary scheduling and duplex transfers, but lacks the ablations needed to confirm the hardware co-design is the key driver.

read the letter

The key point with this paper is that SuperInfer combines a proactive SLO-aware rotary scheduler with a duplex KV cache rotation engine built for the NVLink-C2C links on GH200 Superchips, and it reports up to 74.7 percent better TTFT SLO attainment while keeping TBT and throughput in line with existing systems. The new part is the explicit co-design for this tightly coupled GPU-CPU setup. Earlier offloading strategies struggled with higher latency links, so focusing on low-overhead full-duplex transfers here is a reasonable step forward for hardware that supports it. The work does well by running evaluations on real GH200 hardware with different models and datasets. Releasing the code is also helpful for anyone wanting to check or build on the implementation. Where it is softer is in the evaluation details. The results are aggregate, so we do not see separate measurements of the actual transfer times or bandwidth use during rotations under heavy load. Without those, it is difficult to know whether the duplex engine is truly avoiding new bottlenecks or if the scheduler is doing most of the heavy lifting. The concern about unmeasured overheads holds up based on what is shown. This paper is aimed at researchers and engineers working on high-throughput LLM serving, particularly those with access to Superchip-class hardware. Readers focused on practical latency management in inference systems will get the most out of the concrete numbers and the hardware-specific optimizations. It deserves a serious referee because the problem it tackles is current and the approach is tied to real hardware constraints. I would recommend sending this to peer review.

Referee Report

1 major / 1 minor

Summary. The paper presents SuperInfer, an LLM inference system for NVIDIA GH200 Superchips that combines RotaSched—a proactive, SLO-aware rotary scheduler—with DuplexKV, an engine for full-duplex KV-cache transfers over NVLink-C2C. The central claim is that this co-design improves TTFT SLO attainment rates by up to 74.7% relative to state-of-the-art systems while preserving comparable TBT and throughput, by mitigating head-of-line blocking when KV-cache capacity is exhausted.

Significance. If the empirical results hold, the work demonstrates that tightly coupled GPU-CPU architectures can materially improve responsiveness for LLM serving under high load, where prior PCIe-based offloading approaches have failed. The public release of code is a clear strength that supports reproducibility and follow-on research.

major comments (1)

Evaluation section: the reported TTFT SLO gains (up to 74.7%) are presented as aggregate outcomes of RotaSched + DuplexKV, yet the manuscript provides no direct instrumentation or ablation of NVLink-C2C transfer latency, bandwidth utilization, or queuing delays during rotations at peak request rates. This measurement gap is load-bearing for the hardware co-design claim, because the skeptic concern—that unmeasured transfer overheads could re-introduce HOL blocking—cannot be ruled out from the existing TTFT/TBT/throughput numbers alone.

minor comments (1)

The abstract and evaluation description refer to “various models and datasets” without enumerating them or reporting per-model variance; adding a table or explicit list would improve clarity.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our evaluation methodology. The concern about direct instrumentation of NVLink-C2C transfers is well-taken, and we address it point-by-point below while committing to targeted revisions.

read point-by-point responses

Referee: Evaluation section: the reported TTFT SLO gains (up to 74.7%) are presented as aggregate outcomes of RotaSched + DuplexKV, yet the manuscript provides no direct instrumentation or ablation of NVLink-C2C transfer latency, bandwidth utilization, or queuing delays during rotations at peak request rates. This measurement gap is load-bearing for the hardware co-design claim, because the skeptic concern—that unmeasured transfer overheads could re-introduce HOL blocking—cannot be ruled out from the existing TTFT/TBT/throughput numbers alone.

Authors: We agree that isolating the NVLink-C2C transfer characteristics would strengthen the hardware co-design argument. Although the end-to-end TTFT improvements under high load already indicate that DuplexKV rotations do not reintroduce HOL blocking (as TBT and throughput remain comparable to baselines), we will add direct measurements in the revised evaluation section. Specifically, we will instrument and report: (1) per-rotation NVLink-C2C latency and achieved bandwidth at peak request rates, (2) queuing delays observed during full-duplex transfers, and (3) an ablation that disables DuplexKV optimizations while keeping RotaSched fixed. These additions will allow readers to directly assess whether transfer overheads remain negligible relative to the observed SLO gains. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical system evaluation with no derivation chain

full rationale

The paper describes a systems artifact (RotaSched scheduler and DuplexKV engine) for GH200 Superchips and reports measured improvements in TTFT SLO attainment (up to 74.7%) from hardware experiments. No equations, first-principles derivations, fitted parameters renamed as predictions, or self-citation load-bearing uniqueness theorems appear in the provided text. All central claims rest on direct empirical benchmarks rather than any reduction to inputs by construction, satisfying the self-contained criterion.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Engineering systems paper with no mathematical free parameters, axioms, or invented entities; relies on hardware properties and empirical tuning.

pith-pipeline@v0.9.0 · 5786 in / 1095 out tokens · 51889 ms · 2026-05-21T15:21:18.708825+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

GoodServe: Towards High-Goodput Serving of Agentic LLM Inferences over Heterogeneous Resources
cs.DC 2026-05 unverdicted novelty 4.0

GoodServe proposes a predict-and-rectify routing system for agentic LLM inferences on heterogeneous GPUs that improves goodput by up to 27.4%.

Reference graph

Works this paper leans on

28 extracted references · 28 canonical work pages · cited by 1 Pith paper · 8 internal anchors

[1]

GPT-4 Technical Report

Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Optimizing LLM Inference: Fluid-Guided Online Scheduling with Memory Constraints

Ao, R., Luo, G., Simchi-Levi, D., and Wang, X. Optimiz- ing llm inference: Fluid-guided online scheduling with memory constraints.arXiv preprint arXiv:2504.11320,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Qwen Technical Report

Bai, J., Bai, S., Chu, Y ., Cui, Z., Dang, K., Deng, X., Fan, Y ., Ge, W., Han, Y ., Huang, F., et al. Qwen technical report.arXiv preprint arXiv:2309.16609,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Efficient llm serving on hybrid real-time and best-effort requests.arXiv preprint arXiv:2504.09590,

Borui, W., Juntao, Z., Chenyu, J., Chuanxiong, G., and Chuan, W. Efficient llm serving on hybrid real-time and best-effort requests.arXiv preprint arXiv:2504.09590,

work page arXiv
[5]

Tokenflow: Responsive llm text streaming serving under request burst via preemptive scheduling.arXiv preprint arXiv:2510.02758, 2025a

Chen, J., Du, C., Liu, R., Yao, S., Yan, D., Liao, J., Liu, S., Wu, F., and Chen, G. Tokenflow: Responsive llm text streaming serving under request burst via preemptive scheduling.arXiv preprint arXiv:2510.02758, 2025a. Chen, W., He, S., Qu, H., Zhang, R., Yang, S., Chen, P., Zheng, Y ., Huai, B., and Chen, G. {IMPRESS}: An {Importance-Informed}{Multi-Tie...

work page arXiv
[6]

Understanding data move- ment in tightly coupled heterogeneous systems: A case study with the grace hopper superchip.arXiv preprint arXiv:2408.11556,

Fusco, L., Khalilov, M., Chrapek, M., Chukkapalli, G., Schulthess, T., and Hoefler, T. Understanding data move- ment in tightly coupled heterogeneous systems: A case study with the grace hopper superchip.arXiv preprint arXiv:2408.11556,

work page arXiv
[7]

and Zhai, J

He, J. and Zhai, J. Fastdecode: High-throughput gpu- efficient llm serving using heterogeneous pipelines.arXiv preprint arXiv:2403.11421,

work page arXiv
[8]

Memserve: Con- text caching for disaggregated llm serving with elastic memory pool.arXiv preprint arXiv:2406.17565,

Hu, C., Huang, H., Hu, J., Xu, J., Chen, X., Xie, T., Wang, C., Wang, S., Bao, Y ., Sun, N., et al. Memserve: Con- text caching for disaggregated llm serving with elastic memory pool.arXiv preprint arXiv:2406.17565,

work page arXiv
[9]

Slo-aware scheduling for large language model inferences.arXiv preprint arXiv:2504.14966,

Huang, J., Xiong, Y ., Yu, X., Huang, W., Li, E., Zeng, L., and Chen, X. Slo-aware scheduling for large language model inferences.arXiv preprint arXiv:2504.14966,

work page arXiv
[10]

Mixtral of Experts

SuperInfer: SLO-Aware Rotary Scheduling and Memory Management for LLM Inference on Superchips Jiang, A. Q., Sablayrolles, A., Roux, A., Mensch, A., Savary, B., Bamford, C., Chaplot, D. S., Casas, D. d. l., Hanna, E. B., Bressand, F., et al. Mixtral of experts.arXiv preprint arXiv:2401.04088, 2024a. Jiang, C., Gao, L., Zarch, H. E., and Annavaram, M. Kvpr:...

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Adaserve: Accelerating multi-slo llm serving with slo-customized speculative decoding.arXiv preprint arXiv:2501.12162,

Li, Z., Chen, Z., Delacourt, R., Oliaro, G., Wang, Z., Chen, Q., Lin, S., Yang, A., Zhang, Z., Chen, Z., et al. Adaserve: Accelerating multi-slo llm serving with slo-customized speculative decoding.arXiv preprint arXiv:2501.12162,

work page arXiv
[12]

Superof- fload: Unleashing the power of large-scale llm training on superchips.arXiv preprint arXiv:2509.21271,

Lian, X., Tanaka, M., Ruwase, O., and Zhang, M. Superof- fload: Unleashing the power of large-scale llm training on superchips.arXiv preprint arXiv:2509.21271,

work page arXiv
[13]

DeepSeek-V3 Technical Report

Liu, A., Feng, B., Xue, B., Wang, B., Wu, B., Lu, C., Zhao, C., Deng, C., Zhang, C., Ruan, C., et al. Deepseek- v3 technical report.arXiv preprint arXiv:2412.19437, 2024a. Liu, Y ., Li, H., Cheng, Y ., Ray, S., Huang, Y ., Zhang, Q., Du, K., Yao, J., Lu, S., Ananthanarayanan, G., et al. Cachegen: Kv cache compression and streaming for fast large language ...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[14]

Memory offload- ing for large language model inference with latency slo guarantees.arXiv preprint arXiv:2502.08182,

Ma, C., Ye, Z., Zhao, H., Yang, Z., Fu, T., Han, J., Zhang, J., Luo, Y ., Wang, X., Wang, Z., et al. Memory offload- ing for large language model inference with latency slo guarantees.arXiv preprint arXiv:2502.08182,

work page arXiv
[15]

[On- line; accessed 2025-10-26]

URL https:// resources.nvidia.com/en-us-grace-cpu/ nvidia-grace-hopper?ncid=no-ncid . [On- line; accessed 2025-10-26]. Patke, A., Reddy, D., Jha, S., Qiu, H., Pinto, C., Narayanaswami, C., Kalbarczyk, Z., and Iyer, R. Queue management for slo-oriented large language model serv- ing. InProceedings of the 2024 ACM Symposium on Cloud Computing, pp. 18–35,

work page 2025
[16]

Efficient interactive llm serving with proxy model-based sequence length prediction.arXiv preprint arXiv:2404.08509, 2024

Qiu, H., Mao, W., Patke, A., Cui, S., Jha, S., Wang, C., Franke, H., Kalbarczyk, Z. T., Ba s ¸ar, T., and Iyer, R. K. Efficient interactive llm serving with proxy model-based sequence length prediction.arXiv preprint arXiv:2404.08509,

work page arXiv
[17]

LLaMA: Open and Efficient Foundation Language Models

Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozi`ere, B., Goyal, N., Hambro, E., Azhar, F., et al. Llama: Open and efficient foundation lan- guage models.arXiv preprint arXiv:2302.13971,

work page internal anchor Pith review Pith/arXiv arXiv
[18]

Vellaisamy, P., Labonte, T., Chakraborty, S., Turner, M., Sury, S., and Shen, J. P. Characterizing and optimizing llm inference workloads on cpu-gpu coupled architec- tures.arXiv preprint arXiv:2504.11750,

work page arXiv
[19]

[Online; accessed 2025-10-27]

URL https://docs.vllm.ai/en/latest/usage/ v1_guide.html. [Online; accessed 2025-10-27]. Wei, Z., Yen, J., Chen, J., Zhang, Z., Huang, Z., Chen, C., Yu, X., Gu, Y ., Wu, C., Wang, Y ., et al. Equinox: Holistic fair scheduling in serving large language models.arXiv preprint arXiv:2508.16646,

work page arXiv 2025
[20]

Fast Distributed Inference Serving for Large Language Models

Wu, B., Zhong, Y ., Zhang, Z., Liu, S., Liu, F., Sun, Y ., Huang, G., Liu, X., and Jin, X. Fast distributed infer- ence serving for large language models.arXiv preprint arXiv:2305.05920,

work page internal anchor Pith review Pith/arXiv arXiv
[21]

Pie: Pooling cpu memory for llm inference.arXiv preprint arXiv:2411.09317,

Xu, Y ., Mao, Z., Mo, X., Liu, S., and Stoica, I. Pie: Pooling cpu memory for llm inference.arXiv preprint arXiv:2411.09317,

work page arXiv
[22]

Yang, A., Yu, B., Li, C., Liu, D., Huang, F., Huang, H., Jiang, J., Tu, J., Zhang, J., Zhou, J., et al. Qwen2. 5-1m technical report.arXiv preprint arXiv:2501.15383,

work page internal anchor Pith review Pith/arXiv arXiv
[23]

V oltanallm: Feedback-driven frequency control and state-space rout- ing for energy-efficient llm serving.arXiv preprint arXiv:2509.04827,

Yu, J., Taneja, A., Lin, J., and Zhang, M. V oltanallm: Feedback-driven frequency control and state-space rout- ing for energy-efficient llm serving.arXiv preprint arXiv:2509.04827,

work page arXiv
[24]

Tempo: Application-aware llm serving with mixed slo requirements.arXiv preprint arXiv:2504.20068,

Zhang, W., Wu, Z., Mu, Y ., Liu, B., Lee, M., and Lai, F. Tempo: Application-aware llm serving with mixed slo requirements.arXiv preprint arXiv:2504.20068,

work page arXiv
[25]

Hetegen: Heterogeneous parallel inference for large language models on resource-constrained devices.arXiv preprint arXiv:2403.01164,

Zhao, X., Jia, B., Zhou, H., Liu, Z., Cheng, S., and You, Y . Hetegen: Heterogeneous parallel inference for large language models on resource-constrained devices.arXiv preprint arXiv:2403.01164,

work page arXiv
[26]

Xing, Joseph E

Zheng, L., Chiang, W.-L., Sheng, Y ., Li, T., Zhuang, S., Wu, Z., Zhuang, Y ., Li, Z., Lin, Z., Xing, E. P., et al. Lmsys-chat-1m: A large-scale real-world llm conversa- tion dataset.arXiv preprint arXiv:2309.11998,

work page arXiv
[27]

We compare theFirst- Come-First-Serve(FCFS) andShortest-Job-Firstwith ora- cle generation length information (SJF-Oracle) policy

model and ShareGPT dataset (ShareGPT Team, 2023). We compare theFirst- Come-First-Serve(FCFS) andShortest-Job-Firstwith ora- cle generation length information (SJF-Oracle) policy. As shown in Fig. 23, both FCFS and SJF-Oracle fail to pre- vent TTFT SLO violations under memory pressure. Once KV cache storage is exhausted, the length of waiting queue spikes...

work page 2023
[28]

warming up

Comparing the vLLM and that with KV cache storage in GH200’s Unified Memory (UM). vLLM on UM shows significant TBT degradation. 2024). This allows the Hopper GPU to directly access the Grace CPU’s DRAM without incurring any page faults. GH200 does support page migration, but instead of being page-fault driven, it useshardware access countersto track the a...

work page 2024

[1] [1]

GPT-4 Technical Report

Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Optimizing LLM Inference: Fluid-Guided Online Scheduling with Memory Constraints

Ao, R., Luo, G., Simchi-Levi, D., and Wang, X. Optimiz- ing llm inference: Fluid-guided online scheduling with memory constraints.arXiv preprint arXiv:2504.11320,

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

Qwen Technical Report

Bai, J., Bai, S., Chu, Y ., Cui, Z., Dang, K., Deng, X., Fan, Y ., Ge, W., Han, Y ., Huang, F., et al. Qwen technical report.arXiv preprint arXiv:2309.16609,

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

Efficient llm serving on hybrid real-time and best-effort requests.arXiv preprint arXiv:2504.09590,

Borui, W., Juntao, Z., Chenyu, J., Chuanxiong, G., and Chuan, W. Efficient llm serving on hybrid real-time and best-effort requests.arXiv preprint arXiv:2504.09590,

work page arXiv

[5] [5]

Tokenflow: Responsive llm text streaming serving under request burst via preemptive scheduling.arXiv preprint arXiv:2510.02758, 2025a

Chen, J., Du, C., Liu, R., Yao, S., Yan, D., Liao, J., Liu, S., Wu, F., and Chen, G. Tokenflow: Responsive llm text streaming serving under request burst via preemptive scheduling.arXiv preprint arXiv:2510.02758, 2025a. Chen, W., He, S., Qu, H., Zhang, R., Yang, S., Chen, P., Zheng, Y ., Huai, B., and Chen, G. {IMPRESS}: An {Importance-Informed}{Multi-Tie...

work page arXiv

[6] [6]

Understanding data move- ment in tightly coupled heterogeneous systems: A case study with the grace hopper superchip.arXiv preprint arXiv:2408.11556,

Fusco, L., Khalilov, M., Chrapek, M., Chukkapalli, G., Schulthess, T., and Hoefler, T. Understanding data move- ment in tightly coupled heterogeneous systems: A case study with the grace hopper superchip.arXiv preprint arXiv:2408.11556,

work page arXiv

[7] [7]

and Zhai, J

He, J. and Zhai, J. Fastdecode: High-throughput gpu- efficient llm serving using heterogeneous pipelines.arXiv preprint arXiv:2403.11421,

work page arXiv

[8] [8]

Memserve: Con- text caching for disaggregated llm serving with elastic memory pool.arXiv preprint arXiv:2406.17565,

Hu, C., Huang, H., Hu, J., Xu, J., Chen, X., Xie, T., Wang, C., Wang, S., Bao, Y ., Sun, N., et al. Memserve: Con- text caching for disaggregated llm serving with elastic memory pool.arXiv preprint arXiv:2406.17565,

work page arXiv

[9] [9]

Slo-aware scheduling for large language model inferences.arXiv preprint arXiv:2504.14966,

Huang, J., Xiong, Y ., Yu, X., Huang, W., Li, E., Zeng, L., and Chen, X. Slo-aware scheduling for large language model inferences.arXiv preprint arXiv:2504.14966,

work page arXiv

[10] [10]

Mixtral of Experts

SuperInfer: SLO-Aware Rotary Scheduling and Memory Management for LLM Inference on Superchips Jiang, A. Q., Sablayrolles, A., Roux, A., Mensch, A., Savary, B., Bamford, C., Chaplot, D. S., Casas, D. d. l., Hanna, E. B., Bressand, F., et al. Mixtral of experts.arXiv preprint arXiv:2401.04088, 2024a. Jiang, C., Gao, L., Zarch, H. E., and Annavaram, M. Kvpr:...

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

Adaserve: Accelerating multi-slo llm serving with slo-customized speculative decoding.arXiv preprint arXiv:2501.12162,

Li, Z., Chen, Z., Delacourt, R., Oliaro, G., Wang, Z., Chen, Q., Lin, S., Yang, A., Zhang, Z., Chen, Z., et al. Adaserve: Accelerating multi-slo llm serving with slo-customized speculative decoding.arXiv preprint arXiv:2501.12162,

work page arXiv

[12] [12]

Superof- fload: Unleashing the power of large-scale llm training on superchips.arXiv preprint arXiv:2509.21271,

Lian, X., Tanaka, M., Ruwase, O., and Zhang, M. Superof- fload: Unleashing the power of large-scale llm training on superchips.arXiv preprint arXiv:2509.21271,

work page arXiv

[13] [13]

DeepSeek-V3 Technical Report

Liu, A., Feng, B., Xue, B., Wang, B., Wu, B., Lu, C., Zhao, C., Deng, C., Zhang, C., Ruan, C., et al. Deepseek- v3 technical report.arXiv preprint arXiv:2412.19437, 2024a. Liu, Y ., Li, H., Cheng, Y ., Ray, S., Huang, Y ., Zhang, Q., Du, K., Yao, J., Lu, S., Ananthanarayanan, G., et al. Cachegen: Kv cache compression and streaming for fast large language ...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[14] [14]

Memory offload- ing for large language model inference with latency slo guarantees.arXiv preprint arXiv:2502.08182,

Ma, C., Ye, Z., Zhao, H., Yang, Z., Fu, T., Han, J., Zhang, J., Luo, Y ., Wang, X., Wang, Z., et al. Memory offload- ing for large language model inference with latency slo guarantees.arXiv preprint arXiv:2502.08182,

work page arXiv

[15] [15]

[On- line; accessed 2025-10-26]

URL https:// resources.nvidia.com/en-us-grace-cpu/ nvidia-grace-hopper?ncid=no-ncid . [On- line; accessed 2025-10-26]. Patke, A., Reddy, D., Jha, S., Qiu, H., Pinto, C., Narayanaswami, C., Kalbarczyk, Z., and Iyer, R. Queue management for slo-oriented large language model serv- ing. InProceedings of the 2024 ACM Symposium on Cloud Computing, pp. 18–35,

work page 2025

[16] [16]

Efficient interactive llm serving with proxy model-based sequence length prediction.arXiv preprint arXiv:2404.08509, 2024

Qiu, H., Mao, W., Patke, A., Cui, S., Jha, S., Wang, C., Franke, H., Kalbarczyk, Z. T., Ba s ¸ar, T., and Iyer, R. K. Efficient interactive llm serving with proxy model-based sequence length prediction.arXiv preprint arXiv:2404.08509,

work page arXiv

[17] [17]

LLaMA: Open and Efficient Foundation Language Models

Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozi`ere, B., Goyal, N., Hambro, E., Azhar, F., et al. Llama: Open and efficient foundation lan- guage models.arXiv preprint arXiv:2302.13971,

work page internal anchor Pith review Pith/arXiv arXiv

[18] [18]

Vellaisamy, P., Labonte, T., Chakraborty, S., Turner, M., Sury, S., and Shen, J. P. Characterizing and optimizing llm inference workloads on cpu-gpu coupled architec- tures.arXiv preprint arXiv:2504.11750,

work page arXiv

[19] [19]

[Online; accessed 2025-10-27]

URL https://docs.vllm.ai/en/latest/usage/ v1_guide.html. [Online; accessed 2025-10-27]. Wei, Z., Yen, J., Chen, J., Zhang, Z., Huang, Z., Chen, C., Yu, X., Gu, Y ., Wu, C., Wang, Y ., et al. Equinox: Holistic fair scheduling in serving large language models.arXiv preprint arXiv:2508.16646,

work page arXiv 2025

[20] [20]

Fast Distributed Inference Serving for Large Language Models

Wu, B., Zhong, Y ., Zhang, Z., Liu, S., Liu, F., Sun, Y ., Huang, G., Liu, X., and Jin, X. Fast distributed infer- ence serving for large language models.arXiv preprint arXiv:2305.05920,

work page internal anchor Pith review Pith/arXiv arXiv

[21] [21]

Pie: Pooling cpu memory for llm inference.arXiv preprint arXiv:2411.09317,

Xu, Y ., Mao, Z., Mo, X., Liu, S., and Stoica, I. Pie: Pooling cpu memory for llm inference.arXiv preprint arXiv:2411.09317,

work page arXiv

[22] [22]

Yang, A., Yu, B., Li, C., Liu, D., Huang, F., Huang, H., Jiang, J., Tu, J., Zhang, J., Zhou, J., et al. Qwen2. 5-1m technical report.arXiv preprint arXiv:2501.15383,

work page internal anchor Pith review Pith/arXiv arXiv

[23] [23]

V oltanallm: Feedback-driven frequency control and state-space rout- ing for energy-efficient llm serving.arXiv preprint arXiv:2509.04827,

Yu, J., Taneja, A., Lin, J., and Zhang, M. V oltanallm: Feedback-driven frequency control and state-space rout- ing for energy-efficient llm serving.arXiv preprint arXiv:2509.04827,

work page arXiv

[24] [24]

Tempo: Application-aware llm serving with mixed slo requirements.arXiv preprint arXiv:2504.20068,

Zhang, W., Wu, Z., Mu, Y ., Liu, B., Lee, M., and Lai, F. Tempo: Application-aware llm serving with mixed slo requirements.arXiv preprint arXiv:2504.20068,

work page arXiv

[25] [25]

Hetegen: Heterogeneous parallel inference for large language models on resource-constrained devices.arXiv preprint arXiv:2403.01164,

Zhao, X., Jia, B., Zhou, H., Liu, Z., Cheng, S., and You, Y . Hetegen: Heterogeneous parallel inference for large language models on resource-constrained devices.arXiv preprint arXiv:2403.01164,

work page arXiv

[26] [26]

Xing, Joseph E

Zheng, L., Chiang, W.-L., Sheng, Y ., Li, T., Zhuang, S., Wu, Z., Zhuang, Y ., Li, Z., Lin, Z., Xing, E. P., et al. Lmsys-chat-1m: A large-scale real-world llm conversa- tion dataset.arXiv preprint arXiv:2309.11998,

work page arXiv

[27] [27]

We compare theFirst- Come-First-Serve(FCFS) andShortest-Job-Firstwith ora- cle generation length information (SJF-Oracle) policy

model and ShareGPT dataset (ShareGPT Team, 2023). We compare theFirst- Come-First-Serve(FCFS) andShortest-Job-Firstwith ora- cle generation length information (SJF-Oracle) policy. As shown in Fig. 23, both FCFS and SJF-Oracle fail to pre- vent TTFT SLO violations under memory pressure. Once KV cache storage is exhausted, the length of waiting queue spikes...

work page 2023

[28] [28]

warming up

Comparing the vLLM and that with KV cache storage in GH200’s Unified Memory (UM). vLLM on UM shows significant TBT degradation. 2024). This allows the Hopper GPU to directly access the Grace CPU’s DRAM without incurring any page faults. GH200 does support page migration, but instead of being page-fault driven, it useshardware access countersto track the a...

work page 2024