pith. sign in

arxiv: 2605.19481 · v1 · pith:NNWRY5JNnew · submitted 2026-05-19 · 💻 cs.OS

C2CServe: Leveraging NVLink-C2C for Elastic Serverless LLM Serving on MIG

Pith reviewed 2026-05-20 02:07 UTC · model grok-4.3

classification 💻 cs.OS
keywords serverless LLM servingMIGNVLink-C2Ccold-start latencyHybridGEMMGH200GPU sharingelastic serving
0
0 comments X

The pith

C2CServe uses NVLink-C2C to stream LLM weights from CPU memory to MIG instances, cutting cold-start latency up to 7.1x on GH200 while holding 95% TTFT and TPOT under contention.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that high-bandwidth CPU-GPU interconnects such as NVLink-C2C remove the HBM size barrier that prevents MIG slices from hosting modern LLM weights. Weights can therefore live in plentiful host memory and stream on demand, letting MIG instances switch models at request granularity instead of paying full reload costs on every cold start. C2CServe implements this shift with HybridGEMM, a kernel that tunes data movement between HBM and C2C with one knob, and a hierarchical scheduler that coordinates placement and chunking while reacting to measured contention. If the approach holds, serverless LLM platforms can avoid both the memory waste of dedicated GPUs and the long initialization delays of time-shared GPUs on the same hardware.

Core claim

By keeping LLM weights in CPU memory and streaming them over NVLink-C2C only when needed, C2CServe lets MIG instances change models between requests without reloading entire weight sets into limited HBM. HybridGEMM adapts its GEMM execution pattern to the mixed memory hierarchy using a single tuning parameter to keep bandwidth balanced across contending partitions. A hierarchical scheduler then aligns model placement, input chunk sizes, and kernel choice with runtime feedback to limit C2C interference. On GH200 hardware this combination delivers up to 7.1x lower cold-start latency for dense models and 4.6x for MoE models versus prior serverless systems, while preserving more than 95% of the

What carries the argument

HybridGEMM, a heterogeneous-memory-aware GEMM kernel that adapts access patterns to balance HBM and C2C bandwidth across MIG partitions via a single tuning knob, together with the hierarchical scheduler that coordinates placement, chunking, and kernel selection under online contention feedback.

If this is right

  • MIG instances can switch models at per-request granularity without full HBM weight reloads.
  • Cold-start latency falls by up to 7.1x for dense models and 4.6x for MoE models versus prior serverless baselines.
  • Over 95% TTFT and TPOT attainment is preserved even when multiple partitions share the C2C link.
  • Elastic serverless serving becomes practical on GH200 without dedicating whole GPUs or accepting long initialization times.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same streaming-plus-tuning pattern could be tested on future platforms that offer comparable CPU-GPU bandwidth.
  • Cloud operators might reduce GPU over-provisioning for variable LLM traffic by adopting MIG-plus-C2C placement.
  • Higher-contention workloads could expose whether the single-knob control remains sufficient or needs additional knobs.
  • Integration points with existing serverless runtimes would let the technique apply to wider model catalogs.

Load-bearing premise

C2C bandwidth stays sufficient and predictable when several MIG partitions contend for the link, and the single tuning knob plus hierarchical scheduler can keep performance stable without later manual fixes that would erase the reported gains.

What would settle it

Measure cold-start latency and TTFT/TPOT attainment while running many concurrent MIG instances at peak C2C load; if latency gains disappear or attainment falls below 95% without extra tuning, the central claim does not hold.

Figures

Figures reproduced from arXiv: 2605.19481 by Ali Zafar Sadiq, Haiying Shen, Mingye Zhang, Rui Yang, Shutian Luo, Wei Wang, Yue Cheng.

Figure 1
Figure 1. Figure 1: Multi-model serving approaches. already catalog over a million models [8], and production traces [1, 33] from large-scale inference platforms show a pro￾nounced long tail (detailed in § 2.1): a small fraction of models receives most requests, while the remaining models must still remain responsive to unpredictable invocations [33] (de￾tails in § 2.1). This long-tail workload closely matches the serverless … view at source ↗
Figure 2
Figure 2. Figure 2: LLM workload fluctuation in an Alibaba produc￾tion cluster. Left: Hourly request rates of representative models. Right: Per-model active-time distribution across 59 active models [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Comparison of data access patterns across different GEMM tiling strategies. 3 Motivation for C2CServe 3.1 Opportunity of Combining MIG and C2C Serverless LLM serving requires high elasticity: low cold￾start latency and fine-grained resource allocation. However, both are difficult when model weights must remain resident in scarce HBM. High-bandwidth CPU–GPU interconnects such as C2C change this tradeoff. MI… view at source ↗
Figure 5
Figure 5. Figure 5: Shape-dependent performance and bandwidth utilization on asymGEMM. shared, narrowing the effective HBM-over-C2C bandwidth advantage [PITH_FULL_IMAGE:figures/full_fig_p004_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Interference on shared C2C bandwidth. more activation rows reuse the same parameter tiles and better amortize CPU-memory fetches. Thus, 𝑁 primarily in￾creases interconnect pressure, whereas 𝑀 improves compute efficiency through higher parameter reuse. These results re￾veal a shape-dependent bottleneck shift: small shapes under￾utilize the GPU, large 𝑁 makes execution C2C-bound, and large 𝑀 makes direct CPU… view at source ↗
Figure 7
Figure 7. Figure 7: System architecture of C2CServe. 4 Overview of C2CServe Architecture C2CServe is a Superchip-native serverless LLM serving sys￾tem, with the overall architecture shown in [PITH_FULL_IMAGE:figures/full_fig_p006_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Impact of 𝑀-dimension tile size in asymmetric GEMM. The optimal 𝛼 is runtime-dependent. Beyond workload shape and MIG partitioning, it must account for live C2C con￾tention from co-resident tenants. As multiple MIG instances stream CPU-resident weights concurrently, each instance’s effective C2C bandwidth changes over time, making a static 𝛼 fragile. C2CServe therefore treats 𝛼 as a runtime tuning knob; it… view at source ↗
Figure 9
Figure 9. Figure 9: TTFT and TPOT comparison across baselines. 9.2 End-to-End Evaluation 9.2.1 Full-GPU Serving Performance. We evaluate serv￾ing performance on the full GPU, as shown in [PITH_FULL_IMAGE:figures/full_fig_p010_9.png] view at source ↗
Figure 13
Figure 13. Figure 13: Baseline integrated with HybridGEMM. 9.3 Dynamic Workload We replay a production-derived dynamic workload using open-source models, as shown in [PITH_FULL_IMAGE:figures/full_fig_p011_13.png] view at source ↗
Figure 11
Figure 11. Figure 11: Model-switch Overhead. 0 10 20 30 40 Time (minutes) 0 10 0 10 1 Request Rate (Req/s) (a) Workload pattern 0 10 20 30 40 Time (minutes) 10 3 TTFT (ms) (b) Dense Models 0 10 20 30 40 Time (minutes) 10 3 TTFT (ms) (c) MoE Models Llama-3B Llama-8B Mixtral-8x7B Qwen3-30B-A3B C2CServe SLLM Aegaeon MoE-Inf FineMoE [PITH_FULL_IMAGE:figures/full_fig_p011_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: MoE and Dense Models trace replay. and Aegaeon run out of memory. Compared with Aegaeon, C2CServe improves latency by up to 7.1×. For MoE models, C2CServe reduces cold-start latency over MoE-Infinity and FineMoE by 4.6–5.0×, and outperforms ServerlessLLM by 1.95× on Qwen3-30B-A3B. Overall, C2CServe avoids HBM￾capacity failures while maintaining low cold-start latency across dense and MoE workloads. 9.2.3 … view at source ↗
Figure 14
Figure 14. Figure 14: Component-level comparison. and C2C bandwidth budgets fit its runtime demand, reduc￾ing p99 TTFT to 0.64 s, a 1.94× improvement. This benefit appears even when the chunk controller and HybridGEMM are already active, showing that bandwidth-aware placement is essential for controlling tail latency under multi-tenant MIG execution. 9.4.3 Chunk-size Control. We evaluate the effectiveness of the chunk controll… view at source ↗
read the original abstract

Modern LLM serving is increasingly serverless in shape: large model catalogs, long-tail invocations, and multi-tenant demand. Existing GPU serving systems face a tradeoff: dedicated-GPU allocation wastes scarce HBM under sparse traffic, while GPU time sharing places model initialization and weight loading on the cold-start path. Spatial GPU sharing such as multi-instance GPU (MIG) provides isolation and accounting, but each slice has too little HBM for modern LLM weights. We observe that high-bandwidth CPU--GPU interconnects, such as NVLink-C2C (C2C) in NVIDIA GH200 and GB200 Superchips, change the memory constraint: model weights can reside in CPU memory and be streamed on demand to MIG instances, shifting model residency from scarce HBM to abundant host memory. Leveraging this capability, we present C2CServe, a request-granularity serverless LLM serving system that allows MIG instances to switch models across requests without reloading weights into HBM. C2CServe introduces HybridGEMM, a heterogeneous-memory-aware GEMM kernel that adapts data access patterns to balance HBM and C2C bandwidth across MIG partitions using a single tuning knob. To mitigate shared-C2C contention, C2CServe further uses a hierarchical scheduler that coordinates model placement, input chunking, and kernel selection with online feedback control. On GH200, C2CServe reduces cold-start latency by up to 7.1x for dense models and 4.6x for MoE models compared with state-of-the-art serverless LLM serving systems, while maintaining over 95\% TTFT and TPOT attainment under C2C contention.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces C2CServe, a request-granularity serverless LLM serving system for MIG on GH200/GB200 that streams model weights over NVLink-C2C from CPU memory instead of requiring full HBM residency. It proposes HybridGEMM (a heterogeneous-memory GEMM kernel controlled by one tuning knob) and a hierarchical scheduler with online feedback to coordinate placement, chunking, and kernel selection under shared-C2C contention. Central empirical claims are up to 7.1× cold-start latency reduction for dense models and 4.6× for MoE models versus prior serverless systems, while sustaining >95% TTFT and TPOT attainment.

Significance. If the contention-handling results hold, the work shows how high-bandwidth CPU-GPU links can relax HBM constraints and enable more elastic multi-tenant LLM serving. The single-knob HybridGEMM plus feedback scheduler is a pragmatic design point; reproducible speedups on real GH200 hardware would be a useful data point for systems that must balance isolation, cold-start cost, and interconnect sharing.

major comments (2)
  1. [§5] §5 (Evaluation, attainment results): the claim of >95% TTFT/TPOT under C2C contention is load-bearing for the 7.1×/4.6× latency gains, yet the section provides no worst-case bandwidth saturation traces, no explicit count of concurrent MIG partitions, and no saturation-threshold measurements. Without these, it is impossible to confirm that the hierarchical scheduler's online feedback keeps performance stable without post-hoc knob adjustments.
  2. [§3.2] §3.2 (HybridGEMM): the single tuning knob is presented as sufficient to balance HBM and C2C access across partitions, but the design section contains no sensitivity analysis or ablation showing how GEMM performance and attainment degrade when C2C bandwidth varies under realistic multi-MIG contention. This directly affects whether the reported gains remain valid without manual retuning.
minor comments (2)
  1. [Abstract] Abstract and §5: quantitative claims (speedups and attainment percentages) should briefly note the number of MIGs, workload traces, and whether error bars or multiple runs are reported, even at high level.
  2. [Related Work] Related-work section: explicitly list and cite the exact state-of-the-art serverless baselines used in the comparison (including their MIG or time-sharing configurations).

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the evaluation and design sections. We address each major comment below and will incorporate revisions to provide additional evidence on contention handling and design robustness.

read point-by-point responses
  1. Referee: [§5] §5 (Evaluation, attainment results): the claim of >95% TTFT/TPOT under C2C contention is load-bearing for the 7.1×/4.6× latency gains, yet the section provides no worst-case bandwidth saturation traces, no explicit count of concurrent MIG partitions, and no saturation-threshold measurements. Without these, it is impossible to confirm that the hierarchical scheduler's online feedback keeps performance stable without post-hoc knob adjustments.

    Authors: We agree that more granular data on contention scenarios would strengthen the presentation of the >95% attainment results. The current evaluation reports aggregate TTFT/TPOT attainment under shared-C2C load, but the manuscript does not include the requested worst-case traces or explicit saturation thresholds. In the revised version we will add bandwidth saturation traces, state the exact number of concurrent MIG partitions used in each experiment, and report saturation-threshold measurements. These additions will show that the hierarchical scheduler's online feedback loop maintains the reported attainment levels without requiring post-hoc knob adjustments. revision: yes

  2. Referee: [§3.2] §3.2 (HybridGEMM): the single tuning knob is presented as sufficient to balance HBM and C2C access across partitions, but the design section contains no sensitivity analysis or ablation showing how GEMM performance and attainment degrade when C2C bandwidth varies under realistic multi-MIG contention. This directly affects whether the reported gains remain valid without manual retuning.

    Authors: The single tuning knob in HybridGEMM is intended to allow runtime adaptation to available C2C bandwidth via scheduler feedback. The current design section focuses on the kernel's heterogeneous-memory access patterns and overall system integration rather than exhaustive sensitivity data. We acknowledge that an explicit ablation under varying contention would better demonstrate robustness. In the revised §3.2 we will add a sensitivity analysis and ablation that quantifies GEMM performance and end-to-end attainment as C2C bandwidth is reduced under multi-MIG contention, confirming that the reported speedups hold without manual retuning. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical system evaluation is self-contained

full rationale

The paper introduces C2CServe as a systems artifact with HybridGEMM (single tuning knob) and a hierarchical scheduler using online feedback. Central results are direct latency and attainment measurements on GH200 hardware against external baselines. No equations, parameter fits renamed as predictions, self-citation load-bearing uniqueness theorems, or ansatz smuggling appear in the derivation. The evaluation chain relies on hardware measurements rather than internal reductions, satisfying the self-contained criterion.

Axiom & Free-Parameter Ledger

1 free parameters · 0 axioms · 2 invented entities

The system rests on the new HybridGEMM kernel and hierarchical scheduler whose behavior under real contention is not independently verified outside this work; the single tuning knob is a free parameter whose value selection is not detailed.

free parameters (1)
  • HybridGEMM tuning knob
    Single knob that balances HBM versus C2C data access patterns; its setting is chosen to achieve the reported performance.
invented entities (2)
  • HybridGEMM no independent evidence
    purpose: heterogeneous-memory-aware GEMM kernel that adapts access patterns across HBM and C2C
    New kernel introduced to handle mixed memory bandwidth; no independent evidence supplied.
  • hierarchical scheduler no independent evidence
    purpose: coordinates model placement, input chunking, and kernel selection with online feedback to mitigate C2C contention
    New scheduler component; no independent evidence supplied.

pith-pipeline@v0.9.0 · 5856 in / 1355 out tokens · 40334 ms · 2026-05-20T02:07:02.178680+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

48 extracted references · 48 canonical work pages · 6 internal anchors

  1. [1]

    Genai in alibaba cloud:.https://github.com/alibaba/clusterdata/tree/ master/cluster-trace-v2026-GenAI

  2. [2]

    mini-sglang:.https://github.com/sgl-project/mini-sglang

  3. [3]

    Nvidia cuda toolkit:.https://developer.nvidia.com/cuda/toolkit

  4. [4]

    pytorch:.https://pytorch.org/

  5. [5]

    Time-slicing gpus:.https://docs.nvidia.com/datacenter/cloud-native/ gpu-operator/latest/gpu-sharing.html

  6. [6]

    Nvidia pinned memory.https://docs.nvidia.com/cuda/cuda-c- programming-guide/#page-locked-host-memory, 2022

  7. [7]

    Nvidia zero copy memory.https://docs.nvidia.com/cuda/cuda-c- programming-guide/#zero-copy-memory, 2022

  8. [8]

    Huggingface dataset.https://huggingface.co/datasets, 2023

  9. [9]

    Sharegpt.https://sharegpt.com/, 2023

  10. [10]

    Cuda memory management.https://docs.nvidia.com/cuda/cuda- runtime-api/group__CUDART__MEMORY.html, 2025

  11. [11]

    Nvidia cutlass.https://github.com/NVIDIA/cutlass, 2025

  12. [12]

    Nvidia gb200.https://www.nvidia.com/en-us/data-center/dgx-gb200/, 2025

  13. [13]

    Nvidia gh200.https://www.nvidia.com/en-us/data-center/grace- hopper-superchip/, 2025

  14. [14]

    com/cublas, 2026

    cublas: Basic linear algebra on nvidia gpus.https://developer.nvidia. com/cublas, 2026

  15. [15]

    Nvidia vera rubin platform.https://www.nvidia.com/en-us/data- center/technologies/rubin/, 2026

  16. [16]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

  17. [17]

    Taming {Throughput-Latency} tradeoff in {LLM} inference with {Sarathi-Serve}

    Amey Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav Gulavani, Alexey Tumanov, and Ramachandran Ram- jee. Taming {Throughput-Latency} tradeoff in {LLM} inference with {Sarathi-Serve}. InProceedings of OSDI, 2024

  18. [18]

    Serving heterogeneous machine learning models on {Multi-GPU } servers with {Spatio-Temporal} sharing

    Seungbeom Choi, Sunho Lee, Yeonjae Kim, Jongse Park, Youngjin Kwon, and Jaehyuk Huh. Serving heterogeneous machine learning models on {Multi-GPU } servers with {Spatio-Temporal} sharing. In Proceedings of USENIX ATC, 2022

  19. [19]

    Muxserve: flexible spatial-temporal multiplexing for multiple llm serving

    Jiangfei Duan, Runyu Lu, Haojie Duanmu, Xiuhong Li, Xingcheng Zhang, Dahua Lin, Ion Stoica, and Hao Zhang. Muxserve: flexible spatial-temporal multiplexing for multiple llm serving. 2024

  20. [20]

    The llama 3 herd of models.arXiv e-prints, 2024

    Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Ka- dian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models.arXiv e-prints, 2024

  21. [21]

    InProceedings of OSDI, 2024

    Yao Fu, Leyang Xue, Yeqi Huang, Andrei-Octavian Brabete, Dmitrii Ustiugov, Yuvraj Patel, and Luo Mai.{ServerlessLLM}:{Low-Latency} serverless inference for large language models. InProceedings of OSDI, 2024

  22. [22]

    Multi Instance GPU.https://www.nvidia.com/en-us/technologies/ multi-instance-gpu/, 2022

  23. [23]

    M \’elange: Cost efficient large language model serving by exploiting gpu heterogeneity.arXiv preprint arXiv:2404.14527, 2024

    Tyler Griggs, Xiaoxuan Liu, Jiaxiang Yu, Doyoung Kim, Wei-Lin Chi- ang, Alvin Cheung, and Ion Stoica. M \’elange: Cost efficient large language model serving by exploiting gpu heterogeneity.arXiv preprint arXiv:2404.14527, 2024

  24. [24]

    Microsecond-scale preemption for concurrent {GPU- accelerated} {DNN}inferences

    Mingcong Han, Hanze Zhang, Rong Chen, and Haibo Chen. Microsecond-scale preemption for concurrent {GPU- accelerated} {DNN}inferences. InProceedings of OSDI, 2022

  25. [25]

    Resource multiplexing in tuning and serving large language models

    Yongjun He, Haofeng Yang, Yao Lu, Ana Klimovic, and Gustavo Alonso. Resource multiplexing in tuning and serving large language models. InProceedings of ATC, 2025

  26. [26]

    {DEEPSERVE}: Serverless large language model serving at scale

    Junhao Hu, Jiang Xu, Zhixia Liu, Yulong He, Yuetao Chen, Hao Xu, Jiang Liu, Jie Meng, Baoquan Zhang, Shining Wan, et al. {DEEPSERVE}: Serverless large language model serving at scale. In Proceedings of USENIX ATC, 2025

  27. [27]

    Mixtral of Experts

    Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al. Mixtral of experts. arXiv preprint arXiv:2401.04088, 2024

  28. [28]

    Scaling Laws for Neural Language Models

    Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Ben- jamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361, 2020

  29. [29]

    Tetris: Memory-efficient serverless inference through tensor sharing

    Jie Li, Laiping Zhao, Yanan Yang, Kunlin Zhan, and Keqiu Li. Tetris: Memory-efficient serverless inference through tensor sharing. In Proceedings of USENIX ATC, 2022

  30. [30]

    Oneiros: Kv cache opti- mization through parameter remapping for multi-tenant llm serving

    Ruihao Li, Shagnik Pal, Vineeth Narayan Pullu, Prasoon Sinha, Jeeho Ryoo, Lizy K John, and Neeraja J Yadwadkar. Oneiros: Kv cache opti- mization through parameter remapping for multi-tenant llm serving. InProceedings of the 2025 ACM Symposium on Cloud Computing, pages 88–101, 2025

  31. [31]

    Superoffload: Unleashing the power of large-scale llm training on superchips

    Xinyu Lian, Masahiro Tanaka, Olatunji Ruwase, and Minjia Zhang. Superoffload: Unleashing the power of large-scale llm training on superchips. InProceedings of ASPLOS, 2026

  32. [32]

    Flexpipe: Adapting dynamic llm serving through inflight pipeline refactoring in fragmented serverless clusters

    Yanying Lin, Shijie Peng, Chengzhi Lu, Chengzhong Xu, and Kejiang Ye. Flexpipe: Adapting dynamic llm serving through inflight pipeline refactoring in fragmented serverless clusters. InProceedings of EuroSys, 2026

  33. [33]

    Under- standing diffusion model serving in production: A top-down analysis of workload, scheduling, and resource efficiency

    Yanying Lin, Shuaipeng Wu, Shutian Luo, Hong Xu, Haiying Shen, Chong Ma, Min Shen, Le Chen, Chengzhong Xu, Lin Qu, et al. Under- standing diffusion model serving in production: A top-down analysis of workload, scheduling, and resource efficiency. InProceedings of ACM SoCC, 2025. 13 Conference’17, July 2017, Washington, DC, USA Shutian Luo, Ali Zafar Sadiq...

  34. [34]

    Foundry: Template-Based CUDA Graph Context Materialization for Fast LLM Serving Cold Start

    Xueshen Liu, Yongji Wu, Yuncheng Yao, Danyang Zhuo, Ion Stoica, and Z Morley Mao. Foundry: Template-based cuda graph context material- ization for fast llm serving cold start.arXiv preprint arXiv:2604.06664, 2026

  35. [35]

    Sky- serve: Serving ai models across regions and clouds with spot instances

    Ziming Mao, Tian Xia, Zhanghao Wu, Wei-Lin Chiang, Tyler Griggs, Romil Bhardwaj, Zongheng Yang, Scott Shenker, and Ion Stoica. Sky- serve: Serving ai models across regions and clouds with spot instances. InProceedings of EuroSys, 2025

  36. [36]

    S-lora: Serving thousands of concurrent lora adapters

    Ying Sheng, Shiyi Cao, Dacheng Li, Coleman Hooper, Nicholas Lee, Shuo Yang, Christopher Chou, Banghua Zhu, Lianmin Zheng, Kurt Keutzer, et al. S-lora: Serving thousands of concurrent lora adapters. 2023

  37. [37]

    Orion: Interference- aware, fine-grained gpu sharing for ml applications

    Foteini Strati, Xianzhe Ma, and Ana Klimovic. Orion: Interference- aware, fine-grained gpu sharing for ml applications. InProceedings of EuroSys, pages 1075–1092, 2024

  38. [38]

    LLaMA: Open and Efficient Foundation Language Models

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie- Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023

  39. [39]

    Attention is all you need

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. 2017

  40. [40]

    Zorua: A holistic approach to resource virtualization in gpus

    Nandita Vijaykumar, Kevin Hsieh, Gennady Pekhimenko, Samira Khan, Ashish Shrestha, Saugata Ghose, Adwait Jog, Phillip B Gibbons, and Onur Mutlu. Zorua: A holistic approach to resource virtualization in gpus. InProceedings of MICRO, 2016

  41. [41]

    {ByteCheckpoint}: A unified checkpointing system for large foundation model development

    Borui Wan, Mingji Han, Yiyao Sheng, Yanghua Peng, Haibin Lin, Mo- fan Zhang, Zhichao Lai, Menghan Yu, Junda Zhang, Zuquan Song, et al. {ByteCheckpoint}: A unified checkpointing system for large foundation model development. InProceedings of NSDI, 2025

  42. [42]

    Aegaeon: Effective gpu pooling for concurrent llm serving on the market

    Yuxing Xiang, Xue Li, Kun Qian, Yufan Yang, Diwen Zhu, Wenyuan Yu, Ennan Zhai, Xuanzhe Liu, Xin Jin, and Jingren Zhou. Aegaeon: Effective gpu pooling for concurrent llm serving on the market. In Proceedings of SOSP, 2025

  43. [43]

    Pie: Pooling cpu memory for llm inference.arXiv preprint arXiv:2411.09317,

    Yi Xu, Ziming Mao, Xiangxi Mo, Shu Liu, and Ion Stoica. Pie: Pooling cpu memory for llm inference.arXiv preprint arXiv:2411.09317, 2024

  44. [44]

    Xue, L., Fu, Y ., Lu, Z., Mai, L., and Marina, M

    Leyang Xue, Yao Fu, Zhan Lu, Luo Mai, and Mahesh Marina. Moe- infinity: Efficient moe inference on personal machines with sparsity- aware expert cache.arXiv preprint arXiv:2401.14361, 2024

  45. [45]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

  46. [46]

    Taming latency- memory trade-off in moe-based llm serving via fine-grained expert offloading

    Hanfei Yu, Xingqi Cui, Hong Zhang, and Hao Wang. Taming latency- memory trade-off in moe-based llm serving via fine-grained expert offloading. InProceedings of EuroSys, 2026

  47. [47]

    Superinfer: Slo- aware rotary scheduling and memory management for llm inference on superchips

    Jiahuan Yu, Mingtao Hu, Zichao Lin, and Minjia Zhang. Superinfer: Slo- aware rotary scheduling and memory management for llm inference on superchips. 2026

  48. [48]

    Medusa: Accelerating serverless llm inference with materialization

    Shaoxun Zeng, Minhui Xie, Shiwei Gao, Youmin Chen, and Youyou Lu. Medusa: Accelerating serverless llm inference with materialization. In Proceedings of ASPLOS, 2025. 14