pith. sign in

arxiv: 2604.22126 · v1 · submitted 2026-04-24 · 💻 cs.DC

GICC: A High-Performance Runtime for GPU-Initiated Communication and Coordination in Modern HPC Systems

Pith reviewed 2026-05-08 10:03 UTC · model grok-4.3

classification 💻 cs.DC
keywords GPU-initiated communicationHPC runtimeSlingshot interconnectInfiniBandcoordination latencyweak scalingstencil computationNIC completion signaling
0
0 comments X p. Extension

The pith

GICC lets GPU kernels initiate network coordination directly without host involvement on the fast path.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces GICC as a runtime that allows GPU kernels to trigger NIC-level operations and cross-node coordination autonomously. Existing approaches on Slingshot lack bounded GPU-driven support while InfiniBand versions add extra locking and synchronization. GICC separates coordination semantics from data movement and uses NIC completion signals sent concurrently to GPU and host memory so a lightweight host thread can recycle resources without blocking the GPU path. This sustains repeated operations under limited NIC state and matters for distributed GPU codes because it enables immediate halo exchanges once boundary data is ready, improving overlap and scaling.

Core claim

GICC is a high-performance runtime enabling GPU-initiated communication and coordination on modern HPC systems by decoupling coordination semantics from data movement and introducing asynchronous resource reclamation through concurrent NIC completion signaling to GPU and host memory, which sustains repeated operations under finite NIC state.

What carries the argument

Asynchronous resource reclamation via NIC signaling completion to both GPU and host memory, allowing lightweight host-thread recycling without latency injection into the GPU coordination path.

If this is right

  • Stencil computations initiate halo exchanges as soon as boundary regions are computed, allowing finer-grained overlap between interior work and boundary transfers.
  • Per-coordination latency drops by up to 229 times on Slingshot interconnects.
  • Weak scaling efficiency rises by up to 25 percent on Slingshot.
  • Put latency on InfiniBand falls by up to 1.95 times versus NVSHMEM by removing unnecessary locking.
  • An industrial stencil proxy on 64 AMD MI250X GCDs reaches 42 percent parallel efficiency versus 35.4 percent with MPI.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • CPU cores could be freed for other tasks in GPU-heavy clusters since the host thread only performs lightweight recycling.
  • The dual-signaling pattern may support more complex GPU-driven workflows such as dynamic load balancing across nodes.
  • Similar NIC-assisted reclamation could be adapted to other interconnect families to extend the approach beyond OFI and InfiniBand.
  • At larger scales the reduction in host involvement might lower overall system power draw during communication-heavy phases.

Load-bearing premise

The NIC must reliably signal completion to both GPU and host memory concurrently without injecting latency into the GPU path or exhausting state under repeated operations.

What would settle it

A high-frequency loop of GPU-triggered coordination operations that exhausts NIC state and then shows whether host recycling adds measurable delay to the GPU coordination path or eliminates the reported latency gains.

Figures

Figures reproduced from arXiv: 2604.22126 by Baodi Shan, Barbara Chapman, Mauricio Araya-Polo.

Figure 1
Figure 1. Figure 1: Overview of GICC. On InfiniBand, GPUs directly view at source ↗
Figure 2
Figure 2. Figure 2: Four common communication modes between GPU view at source ↗
Figure 3
Figure 3. Figure 3: Execution time breakdown of a phase-based GPU view at source ↗
Figure 4
Figure 4. Figure 4: Boundary conditions for GPU-driven coordination on Slingshot (CXI) with libfabric manual progress. The host view at source ↗
Figure 5
Figure 5. Figure 5: GICC put example (pseudocode) on OFI/CXI: the host pre-stages two put operations into the deferred work queue (DWQ) with thresholds 1 and 2 on the same trigger counter; a single GPU counter update to 2 releases both si￾multaneously. gicc_wait_until polls GPU-visible flags using comparison operators (e.g., GICC_CMP_GE for ≥). Consequently, GPU threads can initiate transfers but cannot au￾tonomously determin… view at source ↗
Figure 6
Figure 6. Figure 6: Coordination microbenchmark with fixed computa view at source ↗
Figure 7
Figure 7. Figure 7: P2P put latency on Tioga (HPE Slingshot 11 + AMD MI250X). 4B 256B 16K 1M Message Size 10 1 10 2 Latency ( s) GICC NVSHMEM Thread Warp Block view at source ↗
Figure 9
Figure 9. Figure 9: Weak-scaling efficiency of a 2D Jacobi stencil on view at source ↗
Figure 10
Figure 10. Figure 10: Speedup of GICC over MPI for distributed matrix view at source ↗
Figure 11
Figure 11. Figure 11: Performance comparison of GICC and MPI implementations for Minimod on Tioga (AMD MI250X + HPE Slingshot): view at source ↗
read the original abstract

Distributed GPU applications increasingly rely on kernel-level, cross-node coordination to reduce launch overheads and improve compute-communication overlap, but such support is lacking. On OFI-based interconnects such as HPE Slingshot, which powers six of the top ten systems in the November 2025 Top500, including the top three, GPU kernels cannot autonomously drive distributed coordination: existing runtimes rely on host-driven progress and lack a bounded mechanism for recycling pre-staged NIC work across repeated GPU-triggered operations. On InfiniBand, GPU-initiated communication is possible, but current implementations incur unnecessary synchronization and locking overheads. This paper presents GICC, a framework that enables GPU kernels to directly trigger NIC-level operations without host involvement on the fast path. In stencils, GPU threads initiate halo exchanges as soon as boundary regions are computed, enabling fine-grained overlap between interior computation and boundary transfer. GICC decouples coordination semantics from data movement and introduces asynchronous resource reclamation: the NIC signals completion to both GPU and host memory, letting a lightweight host thread recycle NIC resources concurrently with GPU execution without injecting latency into the coordination path. This sustains GPU-driven coordination under finite NIC state, absent from existing OFI-based runtimes. We implement GICC on NVIDIA and AMD GPUs over InfiniBand and Slingshot. On Slingshot, GICC reduces per-coordination latency by up to 229x and improves weak scaling efficiency by up to 25%. On InfiniBand, it achieves up to 1.95x lower put latency than NVSHMEM by eliminating unnecessary locking and synchronization. On an industrial stencil proxy on 64 AMD MI250X GCDs, GPU-aware MPI incurs over 52% higher communication time than GICC, which achieves 42% parallel efficiency versus MPI's 35.4%.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper presents GICC, a runtime framework enabling GPU kernels to directly initiate NIC-level communication and coordination on OFI-based interconnects (Slingshot) and InfiniBand without host involvement on the fast path. It decouples coordination semantics from data movement and introduces asynchronous resource reclamation, where the NIC signals completion concurrently to GPU and host memory so a lightweight host thread can recycle pre-staged work requests. The work is implemented on NVIDIA and AMD GPUs and evaluated on latency, weak scaling, and an industrial stencil proxy on 64 AMD MI250X GCDs, claiming up to 229x lower per-coordination latency on Slingshot, 1.95x lower put latency than NVSHMEM on InfiniBand, and 42% parallel efficiency versus MPI's 35.4%.

Significance. If the central mechanism and reported measurements hold under scrutiny, GICC would represent a meaningful advance for distributed GPU applications on modern HPC systems by enabling finer-grained compute-communication overlap and reducing host-driven progress overheads. The multi-vendor implementation and focus on Slingshot (dominant in recent Top500) are strengths; the approach could influence future runtime designs for GPU-initiated collectives and stencils.

major comments (2)
  1. [asynchronous resource reclamation mechanism] The description of asynchronous resource reclamation (abstract and mechanism overview) states that the NIC signals completion concurrently to both GPU and host memory to enable host-thread recycling without injecting latency into the GPU path. However, no details are provided on the exact signaling primitives (e.g., specific CQ or event mechanisms, atomicity guarantees, or per-signal NIC state consumption), nor is there microbenchmark evidence that repeated concurrent signals remain reliable under finite NIC state without contention, serialization, or GPU-visible latency spikes. This assumption is load-bearing for the claim of sustained GPU-driven coordination absent host fast-path involvement.
  2. [evaluation sections] Performance claims such as the 229x per-coordination latency reduction on Slingshot and 1.95x put latency improvement versus NVSHMEM rest on reported measurements, yet the manuscript provides no implementation details, benchmark descriptions, error bars, number of runs, or data exclusion criteria. This makes it impossible to verify whether the numbers support the central claims of improved weak scaling efficiency (up to 25%) and stencil proxy efficiency (42% vs. 35.4%).
minor comments (2)
  1. [abstract] The abstract and claims would benefit from explicit baselines for each 'up to' figure (e.g., which existing runtime or configuration yields the 229x latency reduction) to aid interpretation.
  2. [framework description] Notation for coordination primitives and resource recycling could be clarified with a small diagram or pseudocode to distinguish GPU-visible vs. host-visible paths.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We have carefully reviewed the major comments and agree that additional details are needed in both the mechanism description and evaluation sections to strengthen the paper. We will incorporate the requested clarifications and supporting evidence in the revised version.

read point-by-point responses
  1. Referee: [asynchronous resource reclamation mechanism] The description of asynchronous resource reclamation (abstract and mechanism overview) states that the NIC signals completion concurrently to both GPU and host memory to enable host-thread recycling without injecting latency into the GPU path. However, no details are provided on the exact signaling primitives (e.g., specific CQ or event mechanisms, atomicity guarantees, or per-signal NIC state consumption), nor is there microbenchmark evidence that repeated concurrent signals remain reliable under finite NIC state without contention, serialization, or GPU-visible latency spikes. This assumption is load-bearing for the claim of sustained GPU-driven coordination absent host fast-path involvement.

    Authors: We agree that the current description of the asynchronous resource reclamation mechanism lacks sufficient low-level detail. In the revised manuscript, we will expand the relevant section to specify the signaling primitives: we use OFI completion queues (CQs) with event-based notifications that allow the NIC to post concurrent signals to both GPU-visible and host memory regions. Atomicity is guaranteed by the underlying NIC hardware (Slingshot and InfiniBand) for the per-signal state updates, with bounded NIC resource consumption per pre-staged work request. We will also add microbenchmark results (new Figure) showing sustained operation under repeated concurrent signals at high load, confirming no measurable contention, serialization, or GPU-visible latency spikes. These additions will directly support the claim that GICC enables sustained GPU-driven coordination without host fast-path involvement. revision: yes

  2. Referee: [evaluation sections] Performance claims such as the 229x per-coordination latency reduction on Slingshot and 1.95x put latency improvement versus NVSHMEM rest on reported measurements, yet the manuscript provides no implementation details, benchmark descriptions, error bars, number of runs, or data exclusion criteria. This makes it impossible to verify whether the numbers support the central claims of improved weak scaling efficiency (up to 25%) and stencil proxy efficiency (42% vs. 35.4%).

    Authors: We acknowledge that the evaluation section requires more transparency to allow independent verification. In the revised manuscript, we will add a dedicated subsection detailing the benchmark implementations, including exact test configurations (e.g., message sizes, iteration counts, and hardware setups on Slingshot and InfiniBand), descriptions of the latency and weak-scaling workloads, and the industrial stencil proxy. We will report error bars as standard deviations from 1000 iterations per point across 10 independent runs, and clarify data exclusion criteria (warm-up phase of 100 iterations excluded, no other filtering). These revisions will enable readers to assess the reported improvements, including the 229x latency reduction, 1.95x put latency gain, 25% weak-scaling improvement, and the 42% vs. 35.4% parallel efficiency on 64 AMD MI250X GCDs. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical implementation with direct measurements

full rationale

The paper presents GICC as an engineering runtime for GPU-initiated coordination on OFI and InfiniBand interconnects, describing design choices such as decoupling coordination from data movement and asynchronous NIC signaling for resource reclamation. All performance claims (e.g., 229x latency reduction, 1.95x lower put latency, 42% parallel efficiency) are stated as results of implementation and benchmarking on specific hardware, with no equations, derivations, fitted parameters, or predictions that reduce to prior inputs by construction. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes; the work is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the existence and correct functioning of the GICC implementation; no free parameters or invented physical entities are introduced, only a new software runtime.

axioms (1)
  • domain assumption NIC hardware supports signaling completion to both GPU and host memory concurrently
    Invoked in the description of asynchronous resource reclamation mechanism
invented entities (1)
  • GICC runtime framework no independent evidence
    purpose: Enables GPU kernels to trigger NIC operations and coordinate resource reclamation
    New software artifact introduced by the paper; no independent evidence outside the implementation itself

pith-pipeline@v0.9.0 · 5645 in / 1302 out tokens · 30583 ms · 2026-05-08T10:03:36.636905+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

32 extracted references · 17 canonical work pages

  1. [1]

    Elena Agostini, Davide Rossetti, and Sreeram Potluri. 2017. Offloading com- munication control logic in GPU accelerated applications. InProceedings of the 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (Madrid, Spain)(CCGrid ’17). IEEE Press, 248–257. doi:10.1109/CCGRID.2017.29

  2. [2]

    AMD. 2026. RCCL: ROCm Communication Collectives Library. https://rocm. docs.amd.com/projects/rccl/en/latest/. Accessed: 2026-02

  3. [3]

    AMD. 2026. rocSHMEM: AMD ROCm OpenSHMEM Implementation. https: //rocm.docs.amd.com/projects/rocSHMEM/en/latest/index.html. Accessed: 2026- 02

  4. [4]

    Barrett, Ronald Brian Brightwell, Kevin Pedretti, Kyle Bruce Wheeler, Karl Scott Hemmert, Rolf E

    Brian W. Barrett, Ronald Brian Brightwell, Kevin Pedretti, Kyle Bruce Wheeler, Karl Scott Hemmert, Rolf E. Riesen, Keith Douglas Underwood, Arthur Bernard Maccabe, and Trammell B. Hudson. 2012.The Portals 4.0 Network Programming Interface. Technical Report. Sandia National Lab. (SNL-NM), Albuquerque, NM (United States). doi:10.2172/1088065

  5. [5]

    Andrew Davis, Hans Johansen, Xinfeng Gao, and Stephen Guzik. 2025. Weak Scal- ing of NVSHMEM Applied To Hashed Distributed Structured Data. InProceedings of the SC ’25 Workshops of the International Conference for High Performance Com- puting, Networking, Storage and Analysis (SC Workshops ’25). Association for Com- puting Machinery, New York, NY, USA, 13...

  6. [6]

    Khaled Hamidouche, John Bachan, Pak Markthub, Peter-Jan Gootzen, Elena Agostini, Sylvain Jeaugey, Aamir Shafi, Georgios Theodorakis, and Man- junath Gorentla Venkata. 2025. GPU-Initiated Networking for NCCL. arXiv:2511.15076 [cs.DC] https://arxiv.org/abs/2511.15076

  7. [7]

    Grant, and Ron Brightwell

    Torsten Hoefler, Salvatore Di Girolamo, Konstantin Taranov, Ryan E. Grant, and Ron Brightwell. 2017. sPIN: High-performance streaming Processing In the Network. InProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis(Denver, Colorado)(SC ’17). As- sociation for Computing Machinery, New York, NY, USA, ...

  8. [8]

    GPU-NIC Async Progress

    HPE Cray. 2024. HPE Cray MPI: GPU-NIC Async Progress (Stream Triggered and Kernel Triggered). https://cpe.ext.hpe.com/docs/latest/mpt/mpich/intro_mpi. html. Section “GPU-NIC Async Progress”. Accessed 2026-02-01

  9. [9]

    Wenbin Lu, Baodi Shan, Eric Raut, Jie Meng, Mauricio Araya-Polo, Johannes Doerfert, Abid Muslim Malik, and Barbara M. Chapman. 2022. Towards Efficient Remote OpenMP Offloading. InProceedings of the 18th International Workshop on OpenMP (IWOMP 2022). Springer, 17–31. doi:10.1007/978-3-031-15922-0_2

  10. [10]

    Jie Meng, Andreas Atle, Henri Calandra, and Mauricio Araya-Polo. 2020. Minimod: A Finite Difference solver for Seismic Modeling.arXiv(2020). arXiv:2007.06048 [cs.DC] https://arxiv.org/abs/2007.06048 GICC: A High-Performance Runtime for GPU-Initiated Communication and Coordination in Modern HPC Systems HPDC ’26, July 13–16, 2026, Cleveland, OH, USA

  11. [11]

    MPI Forum. 2021. MPI: A Message-Passing Interface Standard, Version 4.0. https://www.mpi-forum.org/docs/mpi-4.0/mpi40-report.pdf

  12. [12]

    MPI Forum. 2025. MPI: A Message-Passing Interface Standard, Version 5.0. https://www.mpi-forum.org/docs/mpi-5.0/. Approved June 5, 2025

  13. [13]

    Naveen Namashivayam. 2025. GPU-centric Communication Schemes for HPC and ML Applications. arXiv:2503.24230 [cs.DC] doi:10.48550/arXiv.2503.24230

  14. [14]

    NVIDIA. 2024. NCCL: NVIDIA Collective Communication Library. https:// developer.nvidia.com/nccl

  15. [15]

    NVIDIA. 2024. NVSHMEM: NVSHMEM Library Documentation. https://docs. nvidia.com/nvshmem

  16. [16]

    NVIDIA Corporation. 2024. NVIDIA Multi-GPU Programming Models: jacobi_nvshmem. https://github.com/NVIDIA/multi-gpu-programming-models/ tree/master/jacobi_nvshmem. Accessed: 2026-02

  17. [17]

    2025.NVSHMEM Performance

    NVIDIA Corporation. 2025.NVSHMEM Performance. https://docs.nvidia.com/ nvshmem/release-notes-install-guide/best-practice-guide/performance.html Last updated: Dec. 30, 2025

  18. [18]

    Lena Oden, Benjamin Klenk, and Holger Fröning. 2014. Energy-Efficient Sten- cil Computations on Distributed GPUs Using Dynamic Parallelism and GPU- Controlled Communication. In2014 Energy Efficient Supercomputing Workshop. 31–40. doi:10.1109/E2SC.2014.14

  19. [19]

    OpenFabrics Interfaces Working Group. 2025. fi_cxi(7): Libfabric CXI Provider. https://ofiwg.github.io/libfabric/man/fi_cxi.7.html. Accessed: 2025-11. Describes CXI provider features including triggered operations and FI_PROGRESS_MANUAL

  20. [20]

    OpenFabrics Interfaces Working Group. 2025. Libfabric: OpenFabrics Interfaces. https://ofiwg.github.io/libfabric/. Accessed: 2025-11

  21. [21]

    Sreeram Potluri, Khaled Hamidouche, Akshay Venkatesh, Devendar Bureddy, and Dhabaleswar K. Panda. 2013. Efficient Inter-node MPI Communication using GPUDirect RDMA for InfiniBand Clusters with NVIDIA GPUs. InProceedings of the 42nd International Conference on Parallel Processing (ICPP). IEEE, 80–89. doi:10.1109/ICPP.2013.17

  22. [22]

    Baodi Shan and Mauricio Araya-Polo. 2024. Evaluation of Programming Models and Performance for Stencil Computation on GPGPUs. In2024 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). IEEE, 1178–

  23. [23]

    doi:10.1109/IPDPSW63119.2024.00198

  24. [24]

    Baodi Shan, Mauricio Araya-Polo, and Barbara Chapman. 2025. DiOMP- Offloading: Toward Portable Distributed Heterogeneous OpenMP. InProceedings of the SC ’25 Workshops of the International Conference for High Performance Com- puting, Networking, Storage and Analysis (SC Workshops ’25). Association for Com- puting Machinery, New York, NY, USA, 1289–1301. do...

  25. [25]

    Baodi Shan, Mauricio Araya-Polo, and Barbara M. Chapman. 2024. Towards a Scalable and Efficient PGAS-Based Distributed OpenMP. InProceedings of the 20th International Workshop on OpenMP (IWOMP 2024). Springer, 64–78. doi:10.1007/978-3-031-72567-8_5

  26. [26]

    Baodi Shan, Mauricio Araya-Polo, Johannes Doerfert, and Barbara M. Chapman

  27. [27]

    InProceedings of the 21st International Workshop on OpenMP (IWOMP 2025)

    Discussion of Device-Device Collective Communication in OpenMP Target Offloading. InProceedings of the 21st International Workshop on OpenMP (IWOMP 2025). Springer, 3–17. doi:10.1007/978-3-032-06343-4_1

  28. [28]

    Malik, and Barbara M

    Baodi Shan, Mauricio Araya-Polo, Abid M. Malik, and Barbara M. Chapman

  29. [29]

    InProceedings of the 14th International Workshop on Program- ming Models and Applications for Multicores and Manycores (PMAM@PPoPP 2023)

    MPI-based Remote OpenMP Offloading: A More Efficient and Easy-to-use Implementation. InProceedings of the 14th International Workshop on Program- ming Models and Applications for Multicores and Manycores (PMAM@PPoPP 2023). ACM, 50–59. doi:10.1145/3582514.3582519

  30. [30]

    Mark Silberstein, Sangman Kim, Seonggu Huh, Xinya Zhang, Yige Hu, Amir Wated, and Emmett Witchel. 2016. GPUnet: Networking Abstractions for GPU Programs.ACM Trans. Comput. Syst.34, 3, Article 9 (Sept. 2016), 31 pages. doi:10.1145/2963098

  31. [31]

    TOP500.org. 2025. November 2025 Top 500. https://www.top500.org/lists/top500/ 2025/11/. Accessed: 2025-11, list published during SC25

  32. [32]

    A sample-free compilation framework for efficient dynamic tensor computation,

    James D. Trotter, Sinan Ekmekçibaşı, Doğan Sağbili, Johannes Langguth, Xing Cai, and Didem Unat. 2025. CPU- and GPU-initiated Communication Strategies for Conjugate Gradient Methods on Large GPU Clusters. InProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC ’25). Association for Computing Machi...