C2CServe: Leveraging NVLink-C2C for Elastic Serverless LLM Serving on MIG
Pith reviewed 2026-05-20 02:07 UTC · model grok-4.3
The pith
C2CServe uses NVLink-C2C to stream LLM weights from CPU memory to MIG instances, cutting cold-start latency up to 7.1x on GH200 while holding 95% TTFT and TPOT under contention.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By keeping LLM weights in CPU memory and streaming them over NVLink-C2C only when needed, C2CServe lets MIG instances change models between requests without reloading entire weight sets into limited HBM. HybridGEMM adapts its GEMM execution pattern to the mixed memory hierarchy using a single tuning parameter to keep bandwidth balanced across contending partitions. A hierarchical scheduler then aligns model placement, input chunk sizes, and kernel choice with runtime feedback to limit C2C interference. On GH200 hardware this combination delivers up to 7.1x lower cold-start latency for dense models and 4.6x for MoE models versus prior serverless systems, while preserving more than 95% of the
What carries the argument
HybridGEMM, a heterogeneous-memory-aware GEMM kernel that adapts access patterns to balance HBM and C2C bandwidth across MIG partitions via a single tuning knob, together with the hierarchical scheduler that coordinates placement, chunking, and kernel selection under online contention feedback.
If this is right
- MIG instances can switch models at per-request granularity without full HBM weight reloads.
- Cold-start latency falls by up to 7.1x for dense models and 4.6x for MoE models versus prior serverless baselines.
- Over 95% TTFT and TPOT attainment is preserved even when multiple partitions share the C2C link.
- Elastic serverless serving becomes practical on GH200 without dedicating whole GPUs or accepting long initialization times.
Where Pith is reading between the lines
- The same streaming-plus-tuning pattern could be tested on future platforms that offer comparable CPU-GPU bandwidth.
- Cloud operators might reduce GPU over-provisioning for variable LLM traffic by adopting MIG-plus-C2C placement.
- Higher-contention workloads could expose whether the single-knob control remains sufficient or needs additional knobs.
- Integration points with existing serverless runtimes would let the technique apply to wider model catalogs.
Load-bearing premise
C2C bandwidth stays sufficient and predictable when several MIG partitions contend for the link, and the single tuning knob plus hierarchical scheduler can keep performance stable without later manual fixes that would erase the reported gains.
What would settle it
Measure cold-start latency and TTFT/TPOT attainment while running many concurrent MIG instances at peak C2C load; if latency gains disappear or attainment falls below 95% without extra tuning, the central claim does not hold.
Figures
read the original abstract
Modern LLM serving is increasingly serverless in shape: large model catalogs, long-tail invocations, and multi-tenant demand. Existing GPU serving systems face a tradeoff: dedicated-GPU allocation wastes scarce HBM under sparse traffic, while GPU time sharing places model initialization and weight loading on the cold-start path. Spatial GPU sharing such as multi-instance GPU (MIG) provides isolation and accounting, but each slice has too little HBM for modern LLM weights. We observe that high-bandwidth CPU--GPU interconnects, such as NVLink-C2C (C2C) in NVIDIA GH200 and GB200 Superchips, change the memory constraint: model weights can reside in CPU memory and be streamed on demand to MIG instances, shifting model residency from scarce HBM to abundant host memory. Leveraging this capability, we present C2CServe, a request-granularity serverless LLM serving system that allows MIG instances to switch models across requests without reloading weights into HBM. C2CServe introduces HybridGEMM, a heterogeneous-memory-aware GEMM kernel that adapts data access patterns to balance HBM and C2C bandwidth across MIG partitions using a single tuning knob. To mitigate shared-C2C contention, C2CServe further uses a hierarchical scheduler that coordinates model placement, input chunking, and kernel selection with online feedback control. On GH200, C2CServe reduces cold-start latency by up to 7.1x for dense models and 4.6x for MoE models compared with state-of-the-art serverless LLM serving systems, while maintaining over 95\% TTFT and TPOT attainment under C2C contention.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces C2CServe, a request-granularity serverless LLM serving system for MIG on GH200/GB200 that streams model weights over NVLink-C2C from CPU memory instead of requiring full HBM residency. It proposes HybridGEMM (a heterogeneous-memory GEMM kernel controlled by one tuning knob) and a hierarchical scheduler with online feedback to coordinate placement, chunking, and kernel selection under shared-C2C contention. Central empirical claims are up to 7.1× cold-start latency reduction for dense models and 4.6× for MoE models versus prior serverless systems, while sustaining >95% TTFT and TPOT attainment.
Significance. If the contention-handling results hold, the work shows how high-bandwidth CPU-GPU links can relax HBM constraints and enable more elastic multi-tenant LLM serving. The single-knob HybridGEMM plus feedback scheduler is a pragmatic design point; reproducible speedups on real GH200 hardware would be a useful data point for systems that must balance isolation, cold-start cost, and interconnect sharing.
major comments (2)
- [§5] §5 (Evaluation, attainment results): the claim of >95% TTFT/TPOT under C2C contention is load-bearing for the 7.1×/4.6× latency gains, yet the section provides no worst-case bandwidth saturation traces, no explicit count of concurrent MIG partitions, and no saturation-threshold measurements. Without these, it is impossible to confirm that the hierarchical scheduler's online feedback keeps performance stable without post-hoc knob adjustments.
- [§3.2] §3.2 (HybridGEMM): the single tuning knob is presented as sufficient to balance HBM and C2C access across partitions, but the design section contains no sensitivity analysis or ablation showing how GEMM performance and attainment degrade when C2C bandwidth varies under realistic multi-MIG contention. This directly affects whether the reported gains remain valid without manual retuning.
minor comments (2)
- [Abstract] Abstract and §5: quantitative claims (speedups and attainment percentages) should briefly note the number of MIGs, workload traces, and whether error bars or multiple runs are reported, even at high level.
- [Related Work] Related-work section: explicitly list and cite the exact state-of-the-art serverless baselines used in the comparison (including their MIG or time-sharing configurations).
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on the evaluation and design sections. We address each major comment below and will incorporate revisions to provide additional evidence on contention handling and design robustness.
read point-by-point responses
-
Referee: [§5] §5 (Evaluation, attainment results): the claim of >95% TTFT/TPOT under C2C contention is load-bearing for the 7.1×/4.6× latency gains, yet the section provides no worst-case bandwidth saturation traces, no explicit count of concurrent MIG partitions, and no saturation-threshold measurements. Without these, it is impossible to confirm that the hierarchical scheduler's online feedback keeps performance stable without post-hoc knob adjustments.
Authors: We agree that more granular data on contention scenarios would strengthen the presentation of the >95% attainment results. The current evaluation reports aggregate TTFT/TPOT attainment under shared-C2C load, but the manuscript does not include the requested worst-case traces or explicit saturation thresholds. In the revised version we will add bandwidth saturation traces, state the exact number of concurrent MIG partitions used in each experiment, and report saturation-threshold measurements. These additions will show that the hierarchical scheduler's online feedback loop maintains the reported attainment levels without requiring post-hoc knob adjustments. revision: yes
-
Referee: [§3.2] §3.2 (HybridGEMM): the single tuning knob is presented as sufficient to balance HBM and C2C access across partitions, but the design section contains no sensitivity analysis or ablation showing how GEMM performance and attainment degrade when C2C bandwidth varies under realistic multi-MIG contention. This directly affects whether the reported gains remain valid without manual retuning.
Authors: The single tuning knob in HybridGEMM is intended to allow runtime adaptation to available C2C bandwidth via scheduler feedback. The current design section focuses on the kernel's heterogeneous-memory access patterns and overall system integration rather than exhaustive sensitivity data. We acknowledge that an explicit ablation under varying contention would better demonstrate robustness. In the revised §3.2 we will add a sensitivity analysis and ablation that quantifies GEMM performance and end-to-end attainment as C2C bandwidth is reduced under multi-MIG contention, confirming that the reported speedups hold without manual retuning. revision: yes
Circularity Check
No significant circularity; empirical system evaluation is self-contained
full rationale
The paper introduces C2CServe as a systems artifact with HybridGEMM (single tuning knob) and a hierarchical scheduler using online feedback. Central results are direct latency and attainment measurements on GH200 hardware against external baselines. No equations, parameter fits renamed as predictions, self-citation load-bearing uniqueness theorems, or ansatz smuggling appear in the derivation. The evaluation chain relies on hardware measurements rather than internal reductions, satisfying the self-contained criterion.
Axiom & Free-Parameter Ledger
free parameters (1)
- HybridGEMM tuning knob
invented entities (2)
-
HybridGEMM
no independent evidence
-
hierarchical scheduler
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Genai in alibaba cloud:.https://github.com/alibaba/clusterdata/tree/ master/cluster-trace-v2026-GenAI
-
[2]
mini-sglang:.https://github.com/sgl-project/mini-sglang
-
[3]
Nvidia cuda toolkit:.https://developer.nvidia.com/cuda/toolkit
-
[4]
pytorch:.https://pytorch.org/
-
[5]
Time-slicing gpus:.https://docs.nvidia.com/datacenter/cloud-native/ gpu-operator/latest/gpu-sharing.html
-
[6]
Nvidia pinned memory.https://docs.nvidia.com/cuda/cuda-c- programming-guide/#page-locked-host-memory, 2022
work page 2022
-
[7]
Nvidia zero copy memory.https://docs.nvidia.com/cuda/cuda-c- programming-guide/#zero-copy-memory, 2022
work page 2022
-
[8]
Huggingface dataset.https://huggingface.co/datasets, 2023
work page 2023
-
[9]
Sharegpt.https://sharegpt.com/, 2023
work page 2023
-
[10]
Cuda memory management.https://docs.nvidia.com/cuda/cuda- runtime-api/group__CUDART__MEMORY.html, 2025
work page 2025
-
[11]
Nvidia cutlass.https://github.com/NVIDIA/cutlass, 2025
work page 2025
-
[12]
Nvidia gb200.https://www.nvidia.com/en-us/data-center/dgx-gb200/, 2025
work page 2025
-
[13]
Nvidia gh200.https://www.nvidia.com/en-us/data-center/grace- hopper-superchip/, 2025
work page 2025
-
[14]
cublas: Basic linear algebra on nvidia gpus.https://developer.nvidia. com/cublas, 2026
work page 2026
-
[15]
Nvidia vera rubin platform.https://www.nvidia.com/en-us/data- center/technologies/rubin/, 2026
work page 2026
-
[16]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[17]
Taming {Throughput-Latency} tradeoff in {LLM} inference with {Sarathi-Serve}
Amey Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav Gulavani, Alexey Tumanov, and Ramachandran Ram- jee. Taming {Throughput-Latency} tradeoff in {LLM} inference with {Sarathi-Serve}. InProceedings of OSDI, 2024
work page 2024
-
[18]
Serving heterogeneous machine learning models on {Multi-GPU } servers with {Spatio-Temporal} sharing
Seungbeom Choi, Sunho Lee, Yeonjae Kim, Jongse Park, Youngjin Kwon, and Jaehyuk Huh. Serving heterogeneous machine learning models on {Multi-GPU } servers with {Spatio-Temporal} sharing. In Proceedings of USENIX ATC, 2022
work page 2022
-
[19]
Muxserve: flexible spatial-temporal multiplexing for multiple llm serving
Jiangfei Duan, Runyu Lu, Haojie Duanmu, Xiuhong Li, Xingcheng Zhang, Dahua Lin, Ion Stoica, and Hao Zhang. Muxserve: flexible spatial-temporal multiplexing for multiple llm serving. 2024
work page 2024
-
[20]
The llama 3 herd of models.arXiv e-prints, 2024
Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Ka- dian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models.arXiv e-prints, 2024
work page 2024
-
[21]
Yao Fu, Leyang Xue, Yeqi Huang, Andrei-Octavian Brabete, Dmitrii Ustiugov, Yuvraj Patel, and Luo Mai.{ServerlessLLM}:{Low-Latency} serverless inference for large language models. InProceedings of OSDI, 2024
work page 2024
-
[22]
Multi Instance GPU.https://www.nvidia.com/en-us/technologies/ multi-instance-gpu/, 2022
work page 2022
-
[23]
Tyler Griggs, Xiaoxuan Liu, Jiaxiang Yu, Doyoung Kim, Wei-Lin Chi- ang, Alvin Cheung, and Ion Stoica. M \’elange: Cost efficient large language model serving by exploiting gpu heterogeneity.arXiv preprint arXiv:2404.14527, 2024
-
[24]
Microsecond-scale preemption for concurrent {GPU- accelerated} {DNN}inferences
Mingcong Han, Hanze Zhang, Rong Chen, and Haibo Chen. Microsecond-scale preemption for concurrent {GPU- accelerated} {DNN}inferences. InProceedings of OSDI, 2022
work page 2022
-
[25]
Resource multiplexing in tuning and serving large language models
Yongjun He, Haofeng Yang, Yao Lu, Ana Klimovic, and Gustavo Alonso. Resource multiplexing in tuning and serving large language models. InProceedings of ATC, 2025
work page 2025
-
[26]
{DEEPSERVE}: Serverless large language model serving at scale
Junhao Hu, Jiang Xu, Zhixia Liu, Yulong He, Yuetao Chen, Hao Xu, Jiang Liu, Jie Meng, Baoquan Zhang, Shining Wan, et al. {DEEPSERVE}: Serverless large language model serving at scale. In Proceedings of USENIX ATC, 2025
work page 2025
-
[27]
Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al. Mixtral of experts. arXiv preprint arXiv:2401.04088, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[28]
Scaling Laws for Neural Language Models
Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Ben- jamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2001
-
[29]
Tetris: Memory-efficient serverless inference through tensor sharing
Jie Li, Laiping Zhao, Yanan Yang, Kunlin Zhan, and Keqiu Li. Tetris: Memory-efficient serverless inference through tensor sharing. In Proceedings of USENIX ATC, 2022
work page 2022
-
[30]
Oneiros: Kv cache opti- mization through parameter remapping for multi-tenant llm serving
Ruihao Li, Shagnik Pal, Vineeth Narayan Pullu, Prasoon Sinha, Jeeho Ryoo, Lizy K John, and Neeraja J Yadwadkar. Oneiros: Kv cache opti- mization through parameter remapping for multi-tenant llm serving. InProceedings of the 2025 ACM Symposium on Cloud Computing, pages 88–101, 2025
work page 2025
-
[31]
Superoffload: Unleashing the power of large-scale llm training on superchips
Xinyu Lian, Masahiro Tanaka, Olatunji Ruwase, and Minjia Zhang. Superoffload: Unleashing the power of large-scale llm training on superchips. InProceedings of ASPLOS, 2026
work page 2026
-
[32]
Yanying Lin, Shijie Peng, Chengzhi Lu, Chengzhong Xu, and Kejiang Ye. Flexpipe: Adapting dynamic llm serving through inflight pipeline refactoring in fragmented serverless clusters. InProceedings of EuroSys, 2026
work page 2026
-
[33]
Yanying Lin, Shuaipeng Wu, Shutian Luo, Hong Xu, Haiying Shen, Chong Ma, Min Shen, Le Chen, Chengzhong Xu, Lin Qu, et al. Under- standing diffusion model serving in production: A top-down analysis of workload, scheduling, and resource efficiency. InProceedings of ACM SoCC, 2025. 13 Conference’17, July 2017, Washington, DC, USA Shutian Luo, Ali Zafar Sadiq...
work page 2025
-
[34]
Foundry: Template-Based CUDA Graph Context Materialization for Fast LLM Serving Cold Start
Xueshen Liu, Yongji Wu, Yuncheng Yao, Danyang Zhuo, Ion Stoica, and Z Morley Mao. Foundry: Template-based cuda graph context material- ization for fast llm serving cold start.arXiv preprint arXiv:2604.06664, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[35]
Sky- serve: Serving ai models across regions and clouds with spot instances
Ziming Mao, Tian Xia, Zhanghao Wu, Wei-Lin Chiang, Tyler Griggs, Romil Bhardwaj, Zongheng Yang, Scott Shenker, and Ion Stoica. Sky- serve: Serving ai models across regions and clouds with spot instances. InProceedings of EuroSys, 2025
work page 2025
-
[36]
S-lora: Serving thousands of concurrent lora adapters
Ying Sheng, Shiyi Cao, Dacheng Li, Coleman Hooper, Nicholas Lee, Shuo Yang, Christopher Chou, Banghua Zhu, Lianmin Zheng, Kurt Keutzer, et al. S-lora: Serving thousands of concurrent lora adapters. 2023
work page 2023
-
[37]
Orion: Interference- aware, fine-grained gpu sharing for ml applications
Foteini Strati, Xianzhe Ma, and Ana Klimovic. Orion: Interference- aware, fine-grained gpu sharing for ml applications. InProceedings of EuroSys, pages 1075–1092, 2024
work page 2024
-
[38]
LLaMA: Open and Efficient Foundation Language Models
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie- Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[39]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. 2017
work page 2017
-
[40]
Zorua: A holistic approach to resource virtualization in gpus
Nandita Vijaykumar, Kevin Hsieh, Gennady Pekhimenko, Samira Khan, Ashish Shrestha, Saugata Ghose, Adwait Jog, Phillip B Gibbons, and Onur Mutlu. Zorua: A holistic approach to resource virtualization in gpus. InProceedings of MICRO, 2016
work page 2016
-
[41]
{ByteCheckpoint}: A unified checkpointing system for large foundation model development
Borui Wan, Mingji Han, Yiyao Sheng, Yanghua Peng, Haibin Lin, Mo- fan Zhang, Zhichao Lai, Menghan Yu, Junda Zhang, Zuquan Song, et al. {ByteCheckpoint}: A unified checkpointing system for large foundation model development. InProceedings of NSDI, 2025
work page 2025
-
[42]
Aegaeon: Effective gpu pooling for concurrent llm serving on the market
Yuxing Xiang, Xue Li, Kun Qian, Yufan Yang, Diwen Zhu, Wenyuan Yu, Ennan Zhai, Xuanzhe Liu, Xin Jin, and Jingren Zhou. Aegaeon: Effective gpu pooling for concurrent llm serving on the market. In Proceedings of SOSP, 2025
work page 2025
-
[43]
Pie: Pooling cpu memory for llm inference.arXiv preprint arXiv:2411.09317,
Yi Xu, Ziming Mao, Xiangxi Mo, Shu Liu, and Ion Stoica. Pie: Pooling cpu memory for llm inference.arXiv preprint arXiv:2411.09317, 2024
-
[44]
Xue, L., Fu, Y ., Lu, Z., Mai, L., and Marina, M
Leyang Xue, Yao Fu, Zhan Lu, Luo Mai, and Mahesh Marina. Moe- infinity: Efficient moe inference on personal machines with sparsity- aware expert cache.arXiv preprint arXiv:2401.14361, 2024
-
[45]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[46]
Taming latency- memory trade-off in moe-based llm serving via fine-grained expert offloading
Hanfei Yu, Xingqi Cui, Hong Zhang, and Hao Wang. Taming latency- memory trade-off in moe-based llm serving via fine-grained expert offloading. InProceedings of EuroSys, 2026
work page 2026
-
[47]
Superinfer: Slo- aware rotary scheduling and memory management for llm inference on superchips
Jiahuan Yu, Mingtao Hu, Zichao Lin, and Minjia Zhang. Superinfer: Slo- aware rotary scheduling and memory management for llm inference on superchips. 2026
work page 2026
-
[48]
Medusa: Accelerating serverless llm inference with materialization
Shaoxun Zeng, Minhui Xie, Shiwei Gao, Youmin Chen, and Youyou Lu. Medusa: Accelerating serverless llm inference with materialization. In Proceedings of ASPLOS, 2025. 14
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.