arxiv: 2604.08451 · v1 · submitted 2026-04-09 · 💻 cs.DC

Recognition: unknown

Taming GPU Underutilization via Static Partitioning and Fine-grained CPU Offloading

Gabin Schieffer , Ruimin Shi , Jie Ren , Ivy Peng

Authors on Pith no claims yet

Pith reviewed 2026-05-10 16:44 UTC · model grok-4.3

classification 💻 cs.DC

keywords Multi-Instance GPUMIGGPU utilizationmemory offloadingNvlink-C2Cresource partitioningHPC workloadsCPU offloading

0 comments

The pith

Coarse-grained MIG slices often mismatch application compute and memory needs, but memory offloading to CPU over cache-coherent Nvlink-C2C can close the gap.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines how fixed-size GPU partitions affect real workloads such as NekRS, LAMMPS, Llama3, and Qiskit. It finds that multi-instance GPU sharing cuts overall waste and raises system throughput and energy efficiency, yet interference persists on shared resources like power and the slices themselves are too rigid for many codes. The authors therefore introduce a memory-offloading method that moves selected traffic to the attached CPU across the Nvlink-C2C link. This matters for shared GPU clusters because it lets operators keep the isolation benefits of static slices while still adjusting the effective resource balance at finer granularity.

Core claim

Our performance-resource scaling results indicate that coarse-grained provisioning for tightly coupled compute and memory resources often mismatches application needs. To address this mismatch, we propose a memory-offloading scheme that leverages the cache-coherent Nvlink-C2C interconnect to bridge the gap between coarse-grained resource slices and reduce resource underutilization.

What carries the argument

A memory-offloading scheme that selectively routes memory traffic to the CPU across the cache-coherent Nvlink-C2C interconnect while keeping compute on the GPU partition.

If this is right

MIG partitioning alone yields measurable gains in system throughput and energy efficiency across scientific and AI codes.
Interference still occurs through shared resources such as power throttling even when instances are isolated.
Fine-grained CPU offloading can compensate for the rigidity of fixed MIG slices without altering the hardware partition sizes.
The approach applies directly to workloads whose memory access patterns can be identified and redirected.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same interconnect-based offloading could be applied to other cache-coherent links, suggesting a general way to soften rigid hardware slices.
In multi-tenant schedulers this technique implies that resource allocation can be tuned at runtime by deciding which memory traffic stays on the GPU.
Codes with highly imbalanced compute-to-memory ratios would see the largest relative gains from the hybrid static-plus-offload model.

Load-bearing premise

Applications can be instrumented to move selected memory operations over Nvlink-C2C without creating new bottlenecks or requiring large code changes.

What would settle it

Applying the offloading scheme to the studied workloads and finding either no gain in utilization or saturation of the Nvlink-C2C link under realistic multi-tenant loads.

Figures

Figures reproduced from arXiv: 2604.08451 by Gabin Schieffer, Ivy Peng, Jie Ren, Ruimin Shi.

**Figure 2.** Figure 2: GPU compute resource utilization, measured as the SM [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: GPU memory capacity (upper) and bandwidth (lower) [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: GPU Performance-Resource Scaling for each applica [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: System throughput for concurrent execution of seven [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Total energy consumed for concurrently running 7 [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Power consumption and throttling behavior for (a) [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗

read the original abstract

Advances in GPU compute throughput and memory capacity brings significant opportunities to a wide range of workloads. However, efficiently utilizing these resources remains challenging, particularly because diverse application characteristics may result in imbalanced utilization. Multi-Instance GPU (MIG) is a promising approach to improve utilization by partitioning GPU compute and memory resources into fixed-size slices with isolation. Yet, its effectiveness and limitations in supporting HPC workloads remain an open question. We present a comprehensive system-level characterization of different GPU sharing options using real-world scientific, AI, and data analytics applications, including NekRS, LAMMPS, Llama3, and Qiskit. Our analysis reveals that while GPU sharing via MIG can significantly reduce resource underutilization, and enable system-level improvements in throughput and energy, interference still occurs through shared resources, such as power throttling. Our performance-resource scaling results indicate that coarse-grained provisioning for tightly coupled compute and memory resources often mismatches application needs. To address this mismatch, we propose a memory-offloading scheme that leverages the cache-coherent Nvlink-C2C interconnect to bridge the gap between coarse-grained resource slices and reduce resource underutilization.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MIG characterization on real HPC codes is useful but the Nvlink-C2C offloading proposal lacks the measurements needed to show it actually works.

read the letter

This paper characterizes MIG partitioning on workloads like NekRS, LAMMPS, Llama3, and Qiskit. It finds that fixed slices cut underutilization and improve throughput and energy at the system level, yet power throttling interference persists through shared resources. The scaling results show that coarse compute-plus-memory slices often fail to match what the applications actually need. They propose routing some memory traffic over the cache-coherent Nvlink-C2C link to the CPU for finer-grained adjustment.

Referee Report

2 major / 1 minor

Summary. The paper characterizes GPU resource utilization and interference under different sharing mechanisms, including MIG partitioning, across real-world HPC, AI, and analytics workloads such as NekRS, LAMMPS, Llama3, and Qiskit. It reports that MIG reduces underutilization and improves system throughput and energy efficiency but that interference persists via shared resources such as power throttling. The authors identify mismatches arising from coarse-grained, tightly coupled compute/memory slices and propose a fine-grained memory-offloading scheme that exploits the cache-coherent Nvlink-C2C interconnect to bridge slice boundaries and further reduce underutilization.

Significance. If the offloading scheme can be shown to deliver measurable utilization gains without saturating the interconnect or requiring prohibitive application changes, the work would provide a practical path toward more elastic GPU provisioning in multi-tenant HPC and AI environments. The empirical characterization of production workloads supplies concrete evidence of current MIG limitations that is useful even if the proposed remedy requires further validation.

major comments (2)

[Abstract / Proposed scheme] Abstract and proposed-scheme section: the central claim that the Nvlink-C2C memory-offloading scheme reduces resource underutilization is unsupported by any quantitative results, bandwidth measurements, latency data, or multi-tenant saturation experiments; the characterization of power throttling is presented, yet no corresponding evaluation of the offloading mechanism is supplied.
[Proposed scheme] Proposed-scheme description: the assumption that applications can be selectively instrumented to route memory traffic over Nvlink-C2C without extensive rewrites or new bottlenecks is stated but not demonstrated; no evidence is given that the interconnect remains unsaturated under realistic concurrent offload loads from multiple tenants.

minor comments (1)

[Abstract] The abstract lists the workloads but does not indicate which results (throughput, energy, utilization) are shown for each; a brief mapping would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful review and valuable feedback. The comments correctly identify that our manuscript emphasizes empirical characterization of MIG and related interference while the proposed offloading scheme remains unevaluated. We respond to each major comment below and indicate the corresponding revisions.

read point-by-point responses

Referee: [Abstract / Proposed scheme] Abstract and proposed-scheme section: the central claim that the Nvlink-C2C memory-offloading scheme reduces resource underutilization is unsupported by any quantitative results, bandwidth measurements, latency data, or multi-tenant saturation experiments; the characterization of power throttling is presented, yet no corresponding evaluation of the offloading mechanism is supplied.

Authors: We agree that the abstract and proposed-scheme section imply performance benefits from the Nvlink-C2C offloading scheme without supplying supporting measurements. The manuscript's core contribution is the characterization of utilization, interference (including power throttling), and coarse-grained slice mismatches across the listed workloads. The offloading scheme is presented as a targeted proposal to address the observed mismatches rather than as an evaluated solution. We will revise the abstract to foreground the characterization results and describe the scheme as a direction for future work. The proposed-scheme section will be updated to state explicitly that no quantitative evaluation, bandwidth, or saturation data are provided in this manuscript. revision: yes
Referee: [Proposed scheme] Proposed-scheme description: the assumption that applications can be selectively instrumented to route memory traffic over Nvlink-C2C without extensive rewrites or new bottlenecks is stated but not demonstrated; no evidence is given that the interconnect remains unsaturated under realistic concurrent offload loads from multiple tenants.

Authors: The scheme description does assume selective instrumentation is feasible with modest effort, yet we provide no concrete demonstration or saturation measurements. We will expand the section with workload-specific examples (e.g., directing allocations in NekRS and LAMMPS via existing memory APIs) to illustrate that changes can be localized. Because no multi-tenant offloading experiments were performed, we cannot supply saturation data; we will therefore add an explicit limitations paragraph noting the risk of interconnect contention and the need for future validation. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical characterization and proposal are independent

full rationale

The paper presents direct system-level measurements of MIG partitioning, interference, and scaling behavior across real workloads, followed by a proposed offloading scheme to address observed mismatches. No equations, fitted parameters renamed as predictions, self-definitional constructs, or load-bearing self-citations appear in the derivation. The central claim rests on empirical observations of underutilization and interconnect properties rather than reducing to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The paper rests on standard domain assumptions about GPU hardware behavior and interconnect performance; no free parameters, invented entities, or non-standard axioms are introduced in the abstract.

axioms (2)

domain assumption MIG slices provide sufficient isolation for the measured workloads
Stated implicitly when claiming MIG reduces underutilization while still noting interference through shared resources.
domain assumption Nvlink-C2C offers low-latency cache-coherent access suitable for fine-grained offloading
Central to the proposed scheme but not justified in the abstract.

pith-pipeline@v0.9.0 · 5506 in / 1206 out tokens · 44578 ms · 2026-05-10T16:44:04.301844+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

34 extracted references · 4 canonical work pages · 2 internal anchors

[1]

Microsecond-scale preemp- tion for concurrent GPU-accelerated DNN inferences,

M. Han, H. Zhang, R. Chen, and H. Chen, “Microsecond-scale preemp- tion for concurrent GPU-accelerated DNN inferences,” in16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), 2022, pp. 539–558

2022
[2]

Orion: Interference-aware, fine- grained gpu sharing for ml applications,

F. Strati, X. Ma, and A. Klimovic, “Orion: Interference-aware, fine- grained gpu sharing for ml applications,” inProceedings of the Nine- teenth European Conference on Computer Systems, 2024, pp. 1075– 1092

2024
[3]

Fine-Grained resource sharing for concurrent GPGPU kernels,

C. Gregg, J. Dorn, K. Hazelwood, and K. Skadron, “Fine-Grained resource sharing for concurrent GPGPU kernels,” in4th USENIX Work- shop on Hot Topics in Parallelism (HotPar 12), 2012

2012
[4]

Miso: exploiting multi-instance gpu capability on multi-tenant gpu clusters,

B. Li, T. Patel, S. Samsi, V . Gadepally, and D. Tiwari, “Miso: exploiting multi-instance gpu capability on multi-tenant gpu clusters,” inProceed- ings of the 13th Symposium on Cloud Computing, 2022, pp. 173–189

2022
[5]

Characterizing multi- instance GPU for machine learning workloads,

B. Li, V . Gadepally, S. Samsi, and D. Tiwari, “Characterizing multi- instance GPU for machine learning workloads,” in2022 IEEE Inter- national Parallel and Distributed Processing Symposium Workshops (IPDPSW). IEEE, 2022, pp. 724–731

2022
[6]

Transparent{GPU} sharing in container clouds for deep learning workloads,

B. Wu, Z. Zhang, Z. Bai, X. Liu, and X. Jin, “Transparent{GPU} sharing in container clouds for deep learning workloads,” in20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23), 2023, pp. 69–85

2023
[7]

Granularity-and interference-aware gpu shar- ing with mps,

A. Weaver, K. Kavi, D. Milojicic, R. P. H. Enriquez, N. Hogade, A. Mishra, and G. Mehta, “Granularity-and interference-aware gpu shar- ing with mps,” inSC24-W: Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 2024, pp. 1630–1637

2024
[8]

Nvidia multi-process service overview (r575),

Nvidia, “Nvidia multi-process service overview (r575),” 2025

2025
[9]

Guardian: Safe gpu sharing in multi-tenant environments,

M. Pavlidakis, G. Vasiliadis, S. Mavridis, A. Argyros, A. Chazapis, and A. Bilas, “Guardian: Safe gpu sharing in multi-tenant environments,” inProceedings of the 25th International Middleware Conference, 2024, pp. 313–326

2024
[10]

Nexus: Taming throughput- latency tradeoff in llm serving via efficient gpu sharing,

X. Shi, C. Cai, J. Du, Z. Zhu, and Z. Jia, “Nexus: Taming throughput- latency tradeoff in llm serving via efficient gpu sharing,”arXiv e-prints, pp. arXiv–2507, 2025

2025
[11]

Serving DNN models with multi-instance gpus: A case of the reconfigurable machine scheduling problem.CoRR, abs/2109.11067, 2021

C. Tan, Z. Li, J. Zhang, Y . Cao, S. Qi, Z. Liu, Y . Zhu, and C. Guo, “Serving dnn models with multi-instance gpus: A case of the reconfig- urable machine scheduling problem,”arXiv preprint arXiv:2109.11067, 2021

work page arXiv 2021
[12]

Characterizing concurrency mechanisms for nvidia gpus under deep learning workloads,

G. Gilman and R. J. Walls, “Characterizing concurrency mechanisms for nvidia gpus under deep learning workloads,”ACM SIGMETRICS Performance Evaluation Review, vol. 49, no. 3, pp. 32–34, 2022

2022
[13]

Cuda runtime api documentation,

Nvidia, “Cuda runtime api documentation,” 2025

2025
[14]

Cuda driver api documentation,

——, “Cuda driver api documentation,” 2025

2025
[15]

Nvidia multi-instance gpu user guide,

——, “Nvidia multi-instance gpu user guide,” 2024

2024
[16]

Memory bandwidth and machine balance in current high performance computers,

J. D. McCalpin, “Memory bandwidth and machine balance in current high performance computers,”IEEE Computer Society Technical Com- mittee on Computer Architecture (TCCA) Newsletter, pp. 19–25, Dec. 1995

1995
[17]

The Faiss library

M. Douze, A. Guzhva, C. Deng, J. Johnson, G. Szilvasy, P.-E. Mazar ´e, M. Lomeli, L. Hosseini, and H. J´egou, “The faiss library,”arXiv preprint arXiv:2401.08281, 2024

work page internal anchor Pith review arXiv 2024
[18]

Quantum computing with Qiskit

A. Javadi-Abhari, M. Treinish, K. Krsulich, C. J. Wood, J. Lishman, J. Gacon, S. Martiel, P. D. Nation, L. S. Bishop, A. W. Crosset al., “Quantum computing with qiskit,”arXiv preprint arXiv:2405.08810, 2024

work page internal anchor Pith review arXiv 2024
[19]

Nekrs, a gpu-accelerated spectral element navier–stokes solver,

P. Fischer, S. Kerkemeier, M. Min, Y .-H. Lan, M. Phillips, T. Rath- nayake, E. Merzari, A. Tomboulides, A. Karakus, N. Chalmerset al., “Nekrs, a gpu-accelerated spectral element navier–stokes solver,”Paral- lel Computing, vol. 114, p. 102982, 2022

2022
[20]

Lammps-a flexible simulation tool for particle- based materials modeling at the atomic, meso, and continuum scales,

A. P. Thompson, H. M. Aktulga, R. Berger, D. S. Bolintineanu, W. M. Brown, P. S. Crozier, P. J. In’t Veld, A. Kohlmeyer, S. G. Moore, T. D. Nguyenet al., “Lammps-a flexible simulation tool for particle- based materials modeling at the atomic, meso, and continuum scales,” Computer physics communications, vol. 271, p. 108171, 2022

2022
[21]

Accelerating autodock4 with gpus and gradient- based local search,

D. Santos-Martins, L. Solis-Vasquez, A. F. Tillack, M. F. Sanner, A. Koch, and S. Forli, “Accelerating autodock4 with gpus and gradient- based local search,”Journal of chemical theory and computation, vol. 17, no. 2, pp. 1060–1073, 2021

2021
[22]

llm.c: Llm training in simple, raw c/cuda,

A. Karpathy, “llm.c: Llm training in simple, raw c/cuda,” https://github.com/karpathy/llm.c, 2024

2024
[23]

“llm.c,” https://github.com/karpathy/llm.c, 2024

2024
[24]

llama.cpp: Inference of llama models in pure c/c++,

G. Gerganov and contributors, “llama.cpp: Inference of llama models in pure c/c++,” https://github.com/ggerganov/llama.cpp, 2023

2023
[25]

The llama 3 herd of models,

A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Yang, A. Fanet al., “The llama 3 herd of models,”arXiv e-prints, pp. arXiv–2407, 2024

2024
[26]

Rodinia: A benchmark suite for heterogeneous computing,

S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S.-H. Lee, and K. Skadron, “Rodinia: A benchmark suite for heterogeneous computing,” in2009 IEEE international symposium on workload characterization (IISWC). Ieee, 2009, pp. 44–54

2009
[27]

ig- niter: Interference-aware gpu resource provisioning for predictable dnn inference in the cloud,

F. Xu, J. Xu, J. Chen, L. Chen, R. Shang, Z. Zhou, and F. Liu, “ig- niter: Interference-aware gpu resource provisioning for predictable dnn inference in the cloud,”IEEE Transactions on Parallel and Distributed Systems, vol. 34, no. 3, pp. 812–827, 2022

2022
[28]

Mask: Redesigning the GPU memory hierarchy to support multi-application concurrency,

R. Ausavarungnirun, V . Miller, J. Landgraf, S. Ghose, J. Gandhi, A. Jog, C. J. Rossbach, and O. Mutlu, “Mask: Redesigning the GPU memory hierarchy to support multi-application concurrency,”ACM SIGPLAN Notices, vol. 53, no. 2, pp. 503–518, 2018

2018
[29]

Star: Sub-entry sharing-aware tlb for multi-instance gpu,

B. Li, Y . Wang, T. Wang, L. Eeckhout, J. Yang, A. Jaleel, and X. Tang, “Star: Sub-entry sharing-aware tlb for multi-instance gpu,” in2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 2024, pp. 309–323

2024
[30]

Miger: Integrating multi-instance gpu and multi-process service for deep learning clusters,

B. Zhang, S. Li, and Z. Li, “Miger: Integrating multi-instance gpu and multi-process service for deep learning clusters,” inProceedings of the 53rd International Conference on Parallel Processing, 2024, pp. 504– 513

2024
[31]

Optimal workload placement on multi-instance gpus,

B. Turkkan, P. Murali, P. Harsha, R. Arora, G. Vanloo, and C. Narayanaswami, “Optimal workload placement on multi-instance gpus,”arXiv preprint arXiv:2409.06646, 2024

work page arXiv 2024
[32]

Optimizing gpu multiplexing for efficient and cost-effective access to diverse large language models in gpu clusters,

Y . Zhu, C. Wang, M. Calman, R. Nakazawa, and E. K. Lee, “Optimizing gpu multiplexing for efficient and cost-effective access to diverse large language models in gpu clusters,” in2024 32nd International Conference on Modeling, Analysis and Simulation of Computer and Telecommuni- cation Systems (MASCOTS). IEEE, 2024, pp. 1–8

2024
[33]

Towards memory disaggregation via nvlink c2c: Benchmarking cpu-requested gpu memory access,

F. Werner, M. Weisgut, and T. Rabl, “Towards memory disaggregation via nvlink c2c: Benchmarking cpu-requested gpu memory access,” in Proceedings of the 4th Workshop on Heterogeneous Composable and Disaggregated Systems, 2025, pp. 8–14

2025
[34]

Harnessing integrated cpu-gpu system memory for hpc: a first look into grace hopper,

G. Schieffer, J. Wahlgren, J. Ren, J. Faj, and I. Peng, “Harnessing integrated cpu-gpu system memory for hpc: a first look into grace hopper,” inProceedings of the 53rd International Conference on Parallel Processing, 2024, pp. 199–209. 11

2024