C2CServe: Leveraging NVLink-C2C for Elastic Serverless LLM Serving on MIG

Ali Zafar Sadiq; Haiying Shen; Mingye Zhang; Rui Yang; Shutian Luo; Wei Wang; Yue Cheng

REVIEW 2 major objections 2 minor 48 references

Reviewed by Pith at T0; open to challenge.

T0 means a machine referee read the full paper against a public rubric. The mark states how deep the mechanical check went, never who wrote it. the ladder, T0–T4 →

Challenge this review Re-run · record.json Download PDF Read on arXiv ↗

T0 review · grok-4.3

C2CServe uses NVLink-C2C to stream LLM weights from CPU memory to MIG instances, cutting cold-start latency up to 7.1x on GH200 while holding 95% TTFT and TPOT under contention.

2026-05-20 02:07 UTC pith:NNWRY5JN

load-bearing objection C2CServe uses C2C to stream weights into MIG slices for serverless LLM serving and reports big cold-start gains, but the abstract leaves experimental details thin enough that the contention claims need checking. the 2 major comments →

arxiv 2605.19481 v1 pith:NNWRY5JN submitted 2026-05-19 cs.OS

C2CServe: Leveraging NVLink-C2C for Elastic Serverless LLM Serving on MIG

Shutian Luo , Ali Zafar Sadiq , Rui Yang , Mingye Zhang , Haiying Shen , Wei Wang , Yue Cheng This is my paper

classification cs.OS

keywords serverless LLM servingMIGNVLink-C2Ccold-start latencyHybridGEMMGH200GPU sharingelastic serving

verification ladder T0 review T1 audit T2 compute T3 formal T4 reserved

The pith

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that high-bandwidth CPU-GPU interconnects such as NVLink-C2C remove the HBM size barrier that prevents MIG slices from hosting modern LLM weights. Weights can therefore live in plentiful host memory and stream on demand, letting MIG instances switch models at request granularity instead of paying full reload costs on every cold start. C2CServe implements this shift with HybridGEMM, a kernel that tunes data movement between HBM and C2C with one knob, and a hierarchical scheduler that coordinates placement and chunking while reacting to measured contention. If the approach holds, serverless LLM platforms can avoid both the memory waste of dedicated GPUs and the long initialization delays of time-shared GPUs on the same hardware.

Core claim

By keeping LLM weights in CPU memory and streaming them over NVLink-C2C only when needed, C2CServe lets MIG instances change models between requests without reloading entire weight sets into limited HBM. HybridGEMM adapts its GEMM execution pattern to the mixed memory hierarchy using a single tuning parameter to keep bandwidth balanced across contending partitions. A hierarchical scheduler then aligns model placement, input chunk sizes, and kernel choice with runtime feedback to limit C2C interference. On GH200 hardware this combination delivers up to 7.1x lower cold-start latency for dense models and 4.6x for MoE models versus prior serverless systems, while preserving more than 95% of the

What carries the argument

HybridGEMM, a heterogeneous-memory-aware GEMM kernel that adapts access patterns to balance HBM and C2C bandwidth across MIG partitions via a single tuning knob, together with the hierarchical scheduler that coordinates placement, chunking, and kernel selection under online contention feedback.

Load-bearing premise

C2C bandwidth stays sufficient and predictable when several MIG partitions contend for the link, and the single tuning knob plus hierarchical scheduler can keep performance stable without later manual fixes that would erase the reported gains.

What would settle it

Measure cold-start latency and TTFT/TPOT attainment while running many concurrent MIG instances at peak C2C load; if latency gains disappear or attainment falls below 95% without extra tuning, the central claim does not hold.

Watch this falsifier — get emailed when new claim-graph text bears on it.

If this is right

MIG instances can switch models at per-request granularity without full HBM weight reloads.
Cold-start latency falls by up to 7.1x for dense models and 4.6x for MoE models versus prior serverless baselines.
Over 95% TTFT and TPOT attainment is preserved even when multiple partitions share the C2C link.
Elastic serverless serving becomes practical on GH200 without dedicating whole GPUs or accepting long initialization times.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same streaming-plus-tuning pattern could be tested on future platforms that offer comparable CPU-GPU bandwidth.
Cloud operators might reduce GPU over-provisioning for variable LLM traffic by adopting MIG-plus-C2C placement.
Higher-contention workloads could expose whether the single-knob control remains sufficient or needs additional knobs.
Integration points with existing serverless runtimes would let the technique apply to wider model catalogs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit.

Desk Editor's Note

C2CServe uses C2C to stream weights into MIG slices for serverless LLM serving and reports big cold-start gains, but the abstract leaves experimental details thin enough that the contention claims need checking.

read the letter

The main point is that C2CServe treats NVLink-C2C as a usable memory tier so MIG instances can pull LLM weights from host memory on demand instead of needing full models in their small HBM slices. This lets the system switch models at request granularity while keeping MIG isolation and accounting intact. The design adds HybridGEMM, a GEMM kernel that balances HBM and C2C accesses with one tuning knob, plus a hierarchical scheduler that coordinates placement, chunking, and kernel choice using online feedback to limit contention on the shared link. On GH200 the paper claims up to 7.1x lower cold-start latency for dense models and 4.6x for MoE models versus prior serverless systems, while holding TTFT and TPOT attainment above 95 percent even under C2C load. That combination of C2C streaming, the heterogeneous kernel, and the scheduler is not in the earlier MIG or serverless literature the abstract cites, so the core idea is new. The work is useful because it directly attacks the HBM scarcity that makes MIG unattractive for large-model catalogs with bursty traffic. The framing of the tradeoff between dedicated GPUs and time-sharing is clear and practical. The reported speedups would matter for cloud operators if they hold under realistic multi-tenant conditions. The soft spot is that the abstract states the attainment numbers and speedups without describing the baselines, workload traces, number of concurrent MIGs, or any bandwidth measurements under contention. The stress-test worry about C2C bandwidth becoming unpredictable when several partitions share the link therefore lands on the current evidence; if the full paper has saturation traces or ablation results showing the single knob and scheduler keep variance low without post-hoc fixes, that would close the gap. Otherwise the 95 percent claim rests on an assumption that may not generalize. The paper shows straightforward systems thinking and cites the relevant prior work on MIG and serverless serving without obvious gaps. It is aimed at people building or tuning GPU sharing layers for inference. A reader working on elastic serving or new interconnects would pick up concrete design choices worth trying. The work is solid enough on the problem and the proposed mechanisms to deserve a serious referee, even if the experiments need more detail and scrutiny on the hardware assumptions.

Referee Report

2 major / 2 minor

Summary. The paper introduces C2CServe, a request-granularity serverless LLM serving system for MIG on GH200/GB200 that streams model weights over NVLink-C2C from CPU memory instead of requiring full HBM residency. It proposes HybridGEMM (a heterogeneous-memory GEMM kernel controlled by one tuning knob) and a hierarchical scheduler with online feedback to coordinate placement, chunking, and kernel selection under shared-C2C contention. Central empirical claims are up to 7.1× cold-start latency reduction for dense models and 4.6× for MoE models versus prior serverless systems, while sustaining >95% TTFT and TPOT attainment.

Significance. If the contention-handling results hold, the work shows how high-bandwidth CPU-GPU links can relax HBM constraints and enable more elastic multi-tenant LLM serving. The single-knob HybridGEMM plus feedback scheduler is a pragmatic design point; reproducible speedups on real GH200 hardware would be a useful data point for systems that must balance isolation, cold-start cost, and interconnect sharing.

major comments (2)

[§5] §5 (Evaluation, attainment results): the claim of >95% TTFT/TPOT under C2C contention is load-bearing for the 7.1×/4.6× latency gains, yet the section provides no worst-case bandwidth saturation traces, no explicit count of concurrent MIG partitions, and no saturation-threshold measurements. Without these, it is impossible to confirm that the hierarchical scheduler's online feedback keeps performance stable without post-hoc knob adjustments.
[§3.2] §3.2 (HybridGEMM): the single tuning knob is presented as sufficient to balance HBM and C2C access across partitions, but the design section contains no sensitivity analysis or ablation showing how GEMM performance and attainment degrade when C2C bandwidth varies under realistic multi-MIG contention. This directly affects whether the reported gains remain valid without manual retuning.

minor comments (2)

[Abstract] Abstract and §5: quantitative claims (speedups and attainment percentages) should briefly note the number of MIGs, workload traces, and whether error bars or multiple runs are reported, even at high level.
[Related Work] Related-work section: explicitly list and cite the exact state-of-the-art serverless baselines used in the comparison (including their MIG or time-sharing configurations).

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the evaluation and design sections. We address each major comment below and will incorporate revisions to provide additional evidence on contention handling and design robustness.

read point-by-point responses

Referee: [§5] §5 (Evaluation, attainment results): the claim of >95% TTFT/TPOT under C2C contention is load-bearing for the 7.1×/4.6× latency gains, yet the section provides no worst-case bandwidth saturation traces, no explicit count of concurrent MIG partitions, and no saturation-threshold measurements. Without these, it is impossible to confirm that the hierarchical scheduler's online feedback keeps performance stable without post-hoc knob adjustments.

Authors: We agree that more granular data on contention scenarios would strengthen the presentation of the >95% attainment results. The current evaluation reports aggregate TTFT/TPOT attainment under shared-C2C load, but the manuscript does not include the requested worst-case traces or explicit saturation thresholds. In the revised version we will add bandwidth saturation traces, state the exact number of concurrent MIG partitions used in each experiment, and report saturation-threshold measurements. These additions will show that the hierarchical scheduler's online feedback loop maintains the reported attainment levels without requiring post-hoc knob adjustments. revision: yes
Referee: [§3.2] §3.2 (HybridGEMM): the single tuning knob is presented as sufficient to balance HBM and C2C access across partitions, but the design section contains no sensitivity analysis or ablation showing how GEMM performance and attainment degrade when C2C bandwidth varies under realistic multi-MIG contention. This directly affects whether the reported gains remain valid without manual retuning.

Authors: The single tuning knob in HybridGEMM is intended to allow runtime adaptation to available C2C bandwidth via scheduler feedback. The current design section focuses on the kernel's heterogeneous-memory access patterns and overall system integration rather than exhaustive sensitivity data. We acknowledge that an explicit ablation under varying contention would better demonstrate robustness. In the revised §3.2 we will add a sensitivity analysis and ablation that quantifies GEMM performance and end-to-end attainment as C2C bandwidth is reduced under multi-MIG contention, confirming that the reported speedups hold without manual retuning. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical system evaluation is self-contained

full rationale

The paper introduces C2CServe as a systems artifact with HybridGEMM (single tuning knob) and a hierarchical scheduler using online feedback. Central results are direct latency and attainment measurements on GH200 hardware against external baselines. No equations, parameter fits renamed as predictions, self-citation load-bearing uniqueness theorems, or ansatz smuggling appear in the derivation. The evaluation chain relies on hardware measurements rather than internal reductions, satisfying the self-contained criterion.

Axiom & Free-Parameter Ledger

1 free parameters · 0 axioms · 2 invented entities

The system rests on the new HybridGEMM kernel and hierarchical scheduler whose behavior under real contention is not independently verified outside this work; the single tuning knob is a free parameter whose value selection is not detailed.

free parameters (1)

HybridGEMM tuning knob
Single knob that balances HBM versus C2C data access patterns; its setting is chosen to achieve the reported performance.

invented entities (2)

HybridGEMM no independent evidence
purpose: heterogeneous-memory-aware GEMM kernel that adapts access patterns across HBM and C2C
New kernel introduced to handle mixed memory bandwidth; no independent evidence supplied.
hierarchical scheduler no independent evidence
purpose: coordinates model placement, input chunking, and kernel selection with online feedback to mitigate C2C contention
New scheduler component; no independent evidence supplied.

pith-pipeline@v0.9.0 · 5856 in / 1355 out tokens · 40334 ms · 2026-05-20T02:07:02.178680+00:00 · methodology

0 comments

read the original abstract

Modern LLM serving is increasingly serverless in shape: large model catalogs, long-tail invocations, and multi-tenant demand. Existing GPU serving systems face a tradeoff: dedicated-GPU allocation wastes scarce HBM under sparse traffic, while GPU time sharing places model initialization and weight loading on the cold-start path. Spatial GPU sharing such as multi-instance GPU (MIG) provides isolation and accounting, but each slice has too little HBM for modern LLM weights. We observe that high-bandwidth CPU--GPU interconnects, such as NVLink-C2C (C2C) in NVIDIA GH200 and GB200 Superchips, change the memory constraint: model weights can reside in CPU memory and be streamed on demand to MIG instances, shifting model residency from scarce HBM to abundant host memory. Leveraging this capability, we present C2CServe, a request-granularity serverless LLM serving system that allows MIG instances to switch models across requests without reloading weights into HBM. C2CServe introduces HybridGEMM, a heterogeneous-memory-aware GEMM kernel that adapts data access patterns to balance HBM and C2C bandwidth across MIG partitions using a single tuning knob. To mitigate shared-C2C contention, C2CServe further uses a hierarchical scheduler that coordinates model placement, input chunking, and kernel selection with online feedback control. On GH200, C2CServe reduces cold-start latency by up to 7.1x for dense models and 4.6x for MoE models compared with state-of-the-art serverless LLM serving systems, while maintaining over 95\% TTFT and TPOT attainment under C2C contention.

Figures

Figures reproduced from arXiv: 2605.19481 by Ali Zafar Sadiq, Haiying Shen, Mingye Zhang, Rui Yang, Shutian Luo, Wei Wang, Yue Cheng.

**Figure 1.** Figure 1: Multi-model serving approaches. already catalog over a million models [8], and production traces [1, 33] from large-scale inference platforms show a pronounced long tail (detailed in § 2.1): a small fraction of models receives most requests, while the remaining models must still remain responsive to unpredictable invocations [33] (details in § 2.1). This long-tail workload closely matches the serverless … view at source ↗

**Figure 2.** Figure 2: LLM workload fluctuation in an Alibaba production cluster. Left: Hourly request rates of representative models. Right: Per-model active-time distribution across 59 active models [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Comparison of data access patterns across different GEMM tiling strategies. 3 Motivation for C2CServe 3.1 Opportunity of Combining MIG and C2C Serverless LLM serving requires high elasticity: low coldstart latency and fine-grained resource allocation. However, both are difficult when model weights must remain resident in scarce HBM. High-bandwidth CPU–GPU interconnects such as C2C change this tradeoff. MI… view at source ↗

**Figure 5.** Figure 5: Shape-dependent performance and bandwidth utilization on asymGEMM. shared, narrowing the effective HBM-over-C2C bandwidth advantage [PITH_FULL_IMAGE:figures/full_fig_p004_5.png] view at source ↗

**Figure 6.** Figure 6: Interference on shared C2C bandwidth. more activation rows reuse the same parameter tiles and better amortize CPU-memory fetches. Thus, 𝑁 primarily increases interconnect pressure, whereas 𝑀 improves compute efficiency through higher parameter reuse. These results reveal a shape-dependent bottleneck shift: small shapes underutilize the GPU, large 𝑁 makes execution C2C-bound, and large 𝑀 makes direct CPU… view at source ↗

**Figure 7.** Figure 7: System architecture of C2CServe. 4 Overview of C2CServe Architecture C2CServe is a Superchip-native serverless LLM serving system, with the overall architecture shown in [PITH_FULL_IMAGE:figures/full_fig_p006_7.png] view at source ↗

**Figure 8.** Figure 8: Impact of 𝑀-dimension tile size in asymmetric GEMM. The optimal 𝛼 is runtime-dependent. Beyond workload shape and MIG partitioning, it must account for live C2C contention from co-resident tenants. As multiple MIG instances stream CPU-resident weights concurrently, each instance’s effective C2C bandwidth changes over time, making a static 𝛼 fragile. C2CServe therefore treats 𝛼 as a runtime tuning knob; it… view at source ↗

**Figure 9.** Figure 9: TTFT and TPOT comparison across baselines. 9.2 End-to-End Evaluation 9.2.1 Full-GPU Serving Performance. We evaluate serving performance on the full GPU, as shown in [PITH_FULL_IMAGE:figures/full_fig_p010_9.png] view at source ↗

**Figure 13.** Figure 13: Baseline integrated with HybridGEMM. 9.3 Dynamic Workload We replay a production-derived dynamic workload using open-source models, as shown in [PITH_FULL_IMAGE:figures/full_fig_p011_13.png] view at source ↗

**Figure 11.** Figure 11: Model-switch Overhead. 0 10 20 30 40 Time (minutes) 0 10 0 10 1 Request Rate (Req/s) (a) Workload pattern 0 10 20 30 40 Time (minutes) 10 3 TTFT (ms) (b) Dense Models 0 10 20 30 40 Time (minutes) 10 3 TTFT (ms) (c) MoE Models Llama-3B Llama-8B Mixtral-8x7B Qwen3-30B-A3B C2CServe SLLM Aegaeon MoE-Inf FineMoE [PITH_FULL_IMAGE:figures/full_fig_p011_11.png] view at source ↗

**Figure 12.** Figure 12: MoE and Dense Models trace replay. and Aegaeon run out of memory. Compared with Aegaeon, C2CServe improves latency by up to 7.1×. For MoE models, C2CServe reduces cold-start latency over MoE-Infinity and FineMoE by 4.6–5.0×, and outperforms ServerlessLLM by 1.95× on Qwen3-30B-A3B. Overall, C2CServe avoids HBMcapacity failures while maintaining low cold-start latency across dense and MoE workloads. 9.2.3 … view at source ↗

**Figure 14.** Figure 14: Component-level comparison. and C2C bandwidth budgets fit its runtime demand, reducing p99 TTFT to 0.64 s, a 1.94× improvement. This benefit appears even when the chunk controller and HybridGEMM are already active, showing that bandwidth-aware placement is essential for controlling tail latency under multi-tenant MIG execution. 9.4.3 Chunk-size Control. We evaluate the effectiveness of the chunk controll… view at source ↗

discussion (0)

Reference graph

Works this paper leans on

48 extracted references · 48 canonical work pages · 6 internal anchors

[1]

Genai in alibaba cloud:.https://github.com/alibaba/clusterdata/tree/ master/cluster-trace-v2026-GenAI

work page
[2]

mini-sglang:.https://github.com/sgl-project/mini-sglang

work page
[3]

Nvidia cuda toolkit:.https://developer.nvidia.com/cuda/toolkit

work page
[4]

pytorch:.https://pytorch.org/

work page
[5]

Time-slicing gpus:.https://docs.nvidia.com/datacenter/cloud-native/ gpu-operator/latest/gpu-sharing.html

work page
[6]

Nvidia pinned memory.https://docs.nvidia.com/cuda/cuda-c- programming-guide/#page-locked-host-memory, 2022

work page 2022
[7]

Nvidia zero copy memory.https://docs.nvidia.com/cuda/cuda-c- programming-guide/#zero-copy-memory, 2022

work page 2022
[8]

Huggingface dataset.https://huggingface.co/datasets, 2023

work page 2023
[9]

Sharegpt.https://sharegpt.com/, 2023

work page 2023
[10]

Cuda memory management.https://docs.nvidia.com/cuda/cuda- runtime-api/group__CUDART__MEMORY.html, 2025

work page 2025
[11]

Nvidia cutlass.https://github.com/NVIDIA/cutlass, 2025

work page 2025
[12]

Nvidia gb200.https://www.nvidia.com/en-us/data-center/dgx-gb200/, 2025

work page 2025
[13]

Nvidia gh200.https://www.nvidia.com/en-us/data-center/grace- hopper-superchip/, 2025

work page 2025
[14]

com/cublas, 2026

cublas: Basic linear algebra on nvidia gpus.https://developer.nvidia. com/cublas, 2026

work page 2026
[15]

Nvidia vera rubin platform.https://www.nvidia.com/en-us/data- center/technologies/rubin/, 2026

work page 2026
[16]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[17]

Taming {Throughput-Latency} tradeoff in {LLM} inference with {Sarathi-Serve}

Amey Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav Gulavani, Alexey Tumanov, and Ramachandran Ram- jee. Taming {Throughput-Latency} tradeoff in {LLM} inference with {Sarathi-Serve}. InProceedings of OSDI, 2024

work page 2024
[18]

Serving heterogeneous machine learning models on {Multi-GPU } servers with {Spatio-Temporal} sharing

Seungbeom Choi, Sunho Lee, Yeonjae Kim, Jongse Park, Youngjin Kwon, and Jaehyuk Huh. Serving heterogeneous machine learning models on {Multi-GPU } servers with {Spatio-Temporal} sharing. In Proceedings of USENIX ATC, 2022

work page 2022
[19]

Muxserve: flexible spatial-temporal multiplexing for multiple llm serving

Jiangfei Duan, Runyu Lu, Haojie Duanmu, Xiuhong Li, Xingcheng Zhang, Dahua Lin, Ion Stoica, and Hao Zhang. Muxserve: flexible spatial-temporal multiplexing for multiple llm serving. 2024

work page 2024
[20]

The llama 3 herd of models.arXiv e-prints, 2024

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Ka- dian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models.arXiv e-prints, 2024

work page 2024
[21]

InProceedings of OSDI, 2024

Yao Fu, Leyang Xue, Yeqi Huang, Andrei-Octavian Brabete, Dmitrii Ustiugov, Yuvraj Patel, and Luo Mai.{ServerlessLLM}:{Low-Latency} serverless inference for large language models. InProceedings of OSDI, 2024

work page 2024
[22]

Multi Instance GPU.https://www.nvidia.com/en-us/technologies/ multi-instance-gpu/, 2022

work page 2022
[23]

Griggs, X

Tyler Griggs, Xiaoxuan Liu, Jiaxiang Yu, Doyoung Kim, Wei-Lin Chi- ang, Alvin Cheung, and Ion Stoica. M \’elange: Cost efficient large language model serving by exploiting gpu heterogeneity.arXiv preprint arXiv:2404.14527, 2024

work page arXiv 2024
[24]

Microsecond-scale preemption for concurrent {GPU- accelerated} {DNN}inferences

Mingcong Han, Hanze Zhang, Rong Chen, and Haibo Chen. Microsecond-scale preemption for concurrent {GPU- accelerated} {DNN}inferences. InProceedings of OSDI, 2022

work page 2022
[25]

Resource multiplexing in tuning and serving large language models

Yongjun He, Haofeng Yang, Yao Lu, Ana Klimovic, and Gustavo Alonso. Resource multiplexing in tuning and serving large language models. InProceedings of ATC, 2025

work page 2025
[26]

{DEEPSERVE}: Serverless large language model serving at scale

Junhao Hu, Jiang Xu, Zhixia Liu, Yulong He, Yuetao Chen, Hao Xu, Jiang Liu, Jie Meng, Baoquan Zhang, Shining Wan, et al. {DEEPSERVE}: Serverless large language model serving at scale. In Proceedings of USENIX ATC, 2025

work page 2025
[27]

Mixtral of Experts

Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al. Mixtral of experts. arXiv preprint arXiv:2401.04088, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[28]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Ben- jamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2001
[29]

Tetris: Memory-efficient serverless inference through tensor sharing

Jie Li, Laiping Zhao, Yanan Yang, Kunlin Zhan, and Keqiu Li. Tetris: Memory-efficient serverless inference through tensor sharing. In Proceedings of USENIX ATC, 2022

work page 2022
[30]

Oneiros: Kv cache opti- mization through parameter remapping for multi-tenant llm serving

Ruihao Li, Shagnik Pal, Vineeth Narayan Pullu, Prasoon Sinha, Jeeho Ryoo, Lizy K John, and Neeraja J Yadwadkar. Oneiros: Kv cache opti- mization through parameter remapping for multi-tenant llm serving. InProceedings of the 2025 ACM Symposium on Cloud Computing, pages 88–101, 2025

work page 2025
[31]

Superoffload: Unleashing the power of large-scale llm training on superchips

Xinyu Lian, Masahiro Tanaka, Olatunji Ruwase, and Minjia Zhang. Superoffload: Unleashing the power of large-scale llm training on superchips. InProceedings of ASPLOS, 2026

work page 2026
[32]

Flexpipe: Adapting dynamic llm serving through inflight pipeline refactoring in fragmented serverless clusters

Yanying Lin, Shijie Peng, Chengzhi Lu, Chengzhong Xu, and Kejiang Ye. Flexpipe: Adapting dynamic llm serving through inflight pipeline refactoring in fragmented serverless clusters. InProceedings of EuroSys, 2026

work page 2026
[33]

Under- standing diffusion model serving in production: A top-down analysis of workload, scheduling, and resource efficiency

Yanying Lin, Shuaipeng Wu, Shutian Luo, Hong Xu, Haiying Shen, Chong Ma, Min Shen, Le Chen, Chengzhong Xu, Lin Qu, et al. Under- standing diffusion model serving in production: A top-down analysis of workload, scheduling, and resource efficiency. InProceedings of ACM SoCC, 2025. 13 Conference’17, July 2017, Washington, DC, USA Shutian Luo, Ali Zafar Sadiq...

work page 2025
[34]

Foundry: Template-Based CUDA Graph Context Materialization for Fast LLM Serving Cold Start

Xueshen Liu, Yongji Wu, Yuncheng Yao, Danyang Zhuo, Ion Stoica, and Z Morley Mao. Foundry: Template-based cuda graph context material- ization for fast llm serving cold start.arXiv preprint arXiv:2604.06664, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[35]

Sky- serve: Serving ai models across regions and clouds with spot instances

Ziming Mao, Tian Xia, Zhanghao Wu, Wei-Lin Chiang, Tyler Griggs, Romil Bhardwaj, Zongheng Yang, Scott Shenker, and Ion Stoica. Sky- serve: Serving ai models across regions and clouds with spot instances. InProceedings of EuroSys, 2025

work page 2025
[36]

S-lora: Serving thousands of concurrent lora adapters

Ying Sheng, Shiyi Cao, Dacheng Li, Coleman Hooper, Nicholas Lee, Shuo Yang, Christopher Chou, Banghua Zhu, Lianmin Zheng, Kurt Keutzer, et al. S-lora: Serving thousands of concurrent lora adapters. 2023

work page 2023
[37]

Orion: Interference- aware, fine-grained gpu sharing for ml applications

Foteini Strati, Xianzhe Ma, and Ana Klimovic. Orion: Interference- aware, fine-grained gpu sharing for ml applications. InProceedings of EuroSys, pages 1075–1092, 2024

work page 2024
[38]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie- Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[39]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. 2017

work page 2017
[40]

Zorua: A holistic approach to resource virtualization in gpus

Nandita Vijaykumar, Kevin Hsieh, Gennady Pekhimenko, Samira Khan, Ashish Shrestha, Saugata Ghose, Adwait Jog, Phillip B Gibbons, and Onur Mutlu. Zorua: A holistic approach to resource virtualization in gpus. InProceedings of MICRO, 2016

work page 2016
[41]

{ByteCheckpoint}: A unified checkpointing system for large foundation model development

Borui Wan, Mingji Han, Yiyao Sheng, Yanghua Peng, Haibin Lin, Mo- fan Zhang, Zhichao Lai, Menghan Yu, Junda Zhang, Zuquan Song, et al. {ByteCheckpoint}: A unified checkpointing system for large foundation model development. InProceedings of NSDI, 2025

work page 2025
[42]

Aegaeon: Effective gpu pooling for concurrent llm serving on the market

Yuxing Xiang, Xue Li, Kun Qian, Yufan Yang, Diwen Zhu, Wenyuan Yu, Ennan Zhai, Xuanzhe Liu, Xin Jin, and Jingren Zhou. Aegaeon: Effective gpu pooling for concurrent llm serving on the market. In Proceedings of SOSP, 2025

work page 2025
[43]

Pie: Pooling CPU memory for LLM inference

Yi Xu, Ziming Mao, Xiangxi Mo, Shu Liu, and Ion Stoica. Pie: Pooling cpu memory for llm inference.arXiv preprint arXiv:2411.09317, 2024

work page arXiv 2024
[44]

MoE-Infinity: Efficient MoE inference on personal ma- chines with sparsity-aware expert cache.arXiv preprint arXiv:2401.14361,

Leyang Xue, Yao Fu, Zhan Lu, Luo Mai, and Mahesh Marina. Moe- infinity: Efficient moe inference on personal machines with sparsity- aware expert cache.arXiv preprint arXiv:2401.14361, 2024

work page arXiv 2024
[45]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[46]

Taming latency- memory trade-off in moe-based llm serving via fine-grained expert offloading

Hanfei Yu, Xingqi Cui, Hong Zhang, and Hao Wang. Taming latency- memory trade-off in moe-based llm serving via fine-grained expert offloading. InProceedings of EuroSys, 2026

work page 2026
[47]

Superinfer: Slo- aware rotary scheduling and memory management for llm inference on superchips

Jiahuan Yu, Mingtao Hu, Zichao Lin, and Minjia Zhang. Superinfer: Slo- aware rotary scheduling and memory management for llm inference on superchips. 2026

work page 2026
[48]

Medusa: Accelerating serverless llm inference with materialization

Shaoxun Zeng, Minhui Xie, Shiwei Gao, Youmin Chen, and Youyou Lu. Medusa: Accelerating serverless llm inference with materialization. In Proceedings of ASPLOS, 2025. 14

work page 2025

[1] [1]

Genai in alibaba cloud:.https://github.com/alibaba/clusterdata/tree/ master/cluster-trace-v2026-GenAI

work page

[2] [2]

mini-sglang:.https://github.com/sgl-project/mini-sglang

work page

[3] [3]

Nvidia cuda toolkit:.https://developer.nvidia.com/cuda/toolkit

work page

[4] [4]

pytorch:.https://pytorch.org/

work page

[5] [5]

Time-slicing gpus:.https://docs.nvidia.com/datacenter/cloud-native/ gpu-operator/latest/gpu-sharing.html

work page

[6] [6]

Nvidia pinned memory.https://docs.nvidia.com/cuda/cuda-c- programming-guide/#page-locked-host-memory, 2022

work page 2022

[7] [7]

Nvidia zero copy memory.https://docs.nvidia.com/cuda/cuda-c- programming-guide/#zero-copy-memory, 2022

work page 2022

[8] [8]

Huggingface dataset.https://huggingface.co/datasets, 2023

work page 2023

[9] [9]

Sharegpt.https://sharegpt.com/, 2023

work page 2023

[10] [10]

Cuda memory management.https://docs.nvidia.com/cuda/cuda- runtime-api/group__CUDART__MEMORY.html, 2025

work page 2025

[11] [11]

Nvidia cutlass.https://github.com/NVIDIA/cutlass, 2025

work page 2025

[12] [12]

Nvidia gb200.https://www.nvidia.com/en-us/data-center/dgx-gb200/, 2025

work page 2025

[13] [13]

Nvidia gh200.https://www.nvidia.com/en-us/data-center/grace- hopper-superchip/, 2025

work page 2025

[14] [14]

com/cublas, 2026

cublas: Basic linear algebra on nvidia gpus.https://developer.nvidia. com/cublas, 2026

work page 2026

[15] [15]

Nvidia vera rubin platform.https://www.nvidia.com/en-us/data- center/technologies/rubin/, 2026

work page 2026

[16] [16]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[17] [17]

Taming {Throughput-Latency} tradeoff in {LLM} inference with {Sarathi-Serve}

Amey Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav Gulavani, Alexey Tumanov, and Ramachandran Ram- jee. Taming {Throughput-Latency} tradeoff in {LLM} inference with {Sarathi-Serve}. InProceedings of OSDI, 2024

work page 2024

[18] [18]

Serving heterogeneous machine learning models on {Multi-GPU } servers with {Spatio-Temporal} sharing

Seungbeom Choi, Sunho Lee, Yeonjae Kim, Jongse Park, Youngjin Kwon, and Jaehyuk Huh. Serving heterogeneous machine learning models on {Multi-GPU } servers with {Spatio-Temporal} sharing. In Proceedings of USENIX ATC, 2022

work page 2022

[19] [19]

Muxserve: flexible spatial-temporal multiplexing for multiple llm serving

Jiangfei Duan, Runyu Lu, Haojie Duanmu, Xiuhong Li, Xingcheng Zhang, Dahua Lin, Ion Stoica, and Hao Zhang. Muxserve: flexible spatial-temporal multiplexing for multiple llm serving. 2024

work page 2024

[20] [20]

The llama 3 herd of models.arXiv e-prints, 2024

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Ka- dian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models.arXiv e-prints, 2024

work page 2024

[21] [21]

InProceedings of OSDI, 2024

Yao Fu, Leyang Xue, Yeqi Huang, Andrei-Octavian Brabete, Dmitrii Ustiugov, Yuvraj Patel, and Luo Mai.{ServerlessLLM}:{Low-Latency} serverless inference for large language models. InProceedings of OSDI, 2024

work page 2024

[22] [22]

Multi Instance GPU.https://www.nvidia.com/en-us/technologies/ multi-instance-gpu/, 2022

work page 2022

[23] [23]

Griggs, X

Tyler Griggs, Xiaoxuan Liu, Jiaxiang Yu, Doyoung Kim, Wei-Lin Chi- ang, Alvin Cheung, and Ion Stoica. M \’elange: Cost efficient large language model serving by exploiting gpu heterogeneity.arXiv preprint arXiv:2404.14527, 2024

work page arXiv 2024

[24] [24]

Microsecond-scale preemption for concurrent {GPU- accelerated} {DNN}inferences

Mingcong Han, Hanze Zhang, Rong Chen, and Haibo Chen. Microsecond-scale preemption for concurrent {GPU- accelerated} {DNN}inferences. InProceedings of OSDI, 2022

work page 2022

[25] [25]

Resource multiplexing in tuning and serving large language models

Yongjun He, Haofeng Yang, Yao Lu, Ana Klimovic, and Gustavo Alonso. Resource multiplexing in tuning and serving large language models. InProceedings of ATC, 2025

work page 2025

[26] [26]

{DEEPSERVE}: Serverless large language model serving at scale

Junhao Hu, Jiang Xu, Zhixia Liu, Yulong He, Yuetao Chen, Hao Xu, Jiang Liu, Jie Meng, Baoquan Zhang, Shining Wan, et al. {DEEPSERVE}: Serverless large language model serving at scale. In Proceedings of USENIX ATC, 2025

work page 2025

[27] [27]

Mixtral of Experts

Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al. Mixtral of experts. arXiv preprint arXiv:2401.04088, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[28] [28]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Ben- jamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2001

[29] [29]

Tetris: Memory-efficient serverless inference through tensor sharing

Jie Li, Laiping Zhao, Yanan Yang, Kunlin Zhan, and Keqiu Li. Tetris: Memory-efficient serverless inference through tensor sharing. In Proceedings of USENIX ATC, 2022

work page 2022

[30] [30]

Oneiros: Kv cache opti- mization through parameter remapping for multi-tenant llm serving

Ruihao Li, Shagnik Pal, Vineeth Narayan Pullu, Prasoon Sinha, Jeeho Ryoo, Lizy K John, and Neeraja J Yadwadkar. Oneiros: Kv cache opti- mization through parameter remapping for multi-tenant llm serving. InProceedings of the 2025 ACM Symposium on Cloud Computing, pages 88–101, 2025

work page 2025

[31] [31]

Superoffload: Unleashing the power of large-scale llm training on superchips

Xinyu Lian, Masahiro Tanaka, Olatunji Ruwase, and Minjia Zhang. Superoffload: Unleashing the power of large-scale llm training on superchips. InProceedings of ASPLOS, 2026

work page 2026

[32] [32]

Flexpipe: Adapting dynamic llm serving through inflight pipeline refactoring in fragmented serverless clusters

Yanying Lin, Shijie Peng, Chengzhi Lu, Chengzhong Xu, and Kejiang Ye. Flexpipe: Adapting dynamic llm serving through inflight pipeline refactoring in fragmented serverless clusters. InProceedings of EuroSys, 2026

work page 2026

[33] [33]

Under- standing diffusion model serving in production: A top-down analysis of workload, scheduling, and resource efficiency

Yanying Lin, Shuaipeng Wu, Shutian Luo, Hong Xu, Haiying Shen, Chong Ma, Min Shen, Le Chen, Chengzhong Xu, Lin Qu, et al. Under- standing diffusion model serving in production: A top-down analysis of workload, scheduling, and resource efficiency. InProceedings of ACM SoCC, 2025. 13 Conference’17, July 2017, Washington, DC, USA Shutian Luo, Ali Zafar Sadiq...

work page 2025

[34] [34]

Foundry: Template-Based CUDA Graph Context Materialization for Fast LLM Serving Cold Start

Xueshen Liu, Yongji Wu, Yuncheng Yao, Danyang Zhuo, Ion Stoica, and Z Morley Mao. Foundry: Template-based cuda graph context material- ization for fast llm serving cold start.arXiv preprint arXiv:2604.06664, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[35] [35]

Sky- serve: Serving ai models across regions and clouds with spot instances

Ziming Mao, Tian Xia, Zhanghao Wu, Wei-Lin Chiang, Tyler Griggs, Romil Bhardwaj, Zongheng Yang, Scott Shenker, and Ion Stoica. Sky- serve: Serving ai models across regions and clouds with spot instances. InProceedings of EuroSys, 2025

work page 2025

[36] [36]

S-lora: Serving thousands of concurrent lora adapters

Ying Sheng, Shiyi Cao, Dacheng Li, Coleman Hooper, Nicholas Lee, Shuo Yang, Christopher Chou, Banghua Zhu, Lianmin Zheng, Kurt Keutzer, et al. S-lora: Serving thousands of concurrent lora adapters. 2023

work page 2023

[37] [37]

Orion: Interference- aware, fine-grained gpu sharing for ml applications

Foteini Strati, Xianzhe Ma, and Ana Klimovic. Orion: Interference- aware, fine-grained gpu sharing for ml applications. InProceedings of EuroSys, pages 1075–1092, 2024

work page 2024

[38] [38]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie- Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[39] [39]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. 2017

work page 2017

[40] [40]

Zorua: A holistic approach to resource virtualization in gpus

Nandita Vijaykumar, Kevin Hsieh, Gennady Pekhimenko, Samira Khan, Ashish Shrestha, Saugata Ghose, Adwait Jog, Phillip B Gibbons, and Onur Mutlu. Zorua: A holistic approach to resource virtualization in gpus. InProceedings of MICRO, 2016

work page 2016

[41] [41]

{ByteCheckpoint}: A unified checkpointing system for large foundation model development

Borui Wan, Mingji Han, Yiyao Sheng, Yanghua Peng, Haibin Lin, Mo- fan Zhang, Zhichao Lai, Menghan Yu, Junda Zhang, Zuquan Song, et al. {ByteCheckpoint}: A unified checkpointing system for large foundation model development. InProceedings of NSDI, 2025

work page 2025

[42] [42]

Aegaeon: Effective gpu pooling for concurrent llm serving on the market

Yuxing Xiang, Xue Li, Kun Qian, Yufan Yang, Diwen Zhu, Wenyuan Yu, Ennan Zhai, Xuanzhe Liu, Xin Jin, and Jingren Zhou. Aegaeon: Effective gpu pooling for concurrent llm serving on the market. In Proceedings of SOSP, 2025

work page 2025

[43] [43]

Pie: Pooling CPU memory for LLM inference

Yi Xu, Ziming Mao, Xiangxi Mo, Shu Liu, and Ion Stoica. Pie: Pooling cpu memory for llm inference.arXiv preprint arXiv:2411.09317, 2024

work page arXiv 2024

[44] [44]

MoE-Infinity: Efficient MoE inference on personal ma- chines with sparsity-aware expert cache.arXiv preprint arXiv:2401.14361,

Leyang Xue, Yao Fu, Zhan Lu, Luo Mai, and Mahesh Marina. Moe- infinity: Efficient moe inference on personal machines with sparsity- aware expert cache.arXiv preprint arXiv:2401.14361, 2024

work page arXiv 2024

[45] [45]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[46] [46]

Taming latency- memory trade-off in moe-based llm serving via fine-grained expert offloading

Hanfei Yu, Xingqi Cui, Hong Zhang, and Hao Wang. Taming latency- memory trade-off in moe-based llm serving via fine-grained expert offloading. InProceedings of EuroSys, 2026

work page 2026

[47] [47]

Superinfer: Slo- aware rotary scheduling and memory management for llm inference on superchips

Jiahuan Yu, Mingtao Hu, Zichao Lin, and Minjia Zhang. Superinfer: Slo- aware rotary scheduling and memory management for llm inference on superchips. 2026

work page 2026

[48] [48]

Medusa: Accelerating serverless llm inference with materialization

Shaoxun Zeng, Minhui Xie, Shiwei Gao, Youmin Chen, and Youyou Lu. Medusa: Accelerating serverless llm inference with materialization. In Proceedings of ASPLOS, 2025. 14

work page 2025