RAMC: Remote Access Memory Channels over HPE Slingshot

Matthew G. F. Dosanjh; Scott Levy; Whit Schonbein

arxiv: 2606.05094 · v1 · pith:5XL7SX5Hnew · submitted 2026-06-03 · 💻 cs.NI

RAMC: Remote Access Memory Channels over HPE Slingshot

Whit Schonbein , Matthew G. F. Dosanjh , Scott Levy This is my paper

Pith reviewed 2026-06-28 03:31 UTC · model grok-4.3

classification 💻 cs.NI

keywords one-sided communicationRDMAHPE SlingshotMPI RMAbandwidthscalabilitylibfabricheat diffusion

0 comments

The pith

RAMC persistent channels deliver higher bandwidth than MPI on Slingshot networks while scaling to 19.6k processes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Remote Access Memory Channels (RAMC) as an explicit one-sided communication library built for HPE Cray Slingshot hardware. RAMC uses persistent uni-directional channels and Slingshot memory region counters to avoid the symmetric memory and collective setup requirements of MPI RMA and OpenSHMEM. A heat diffusion code written with RAMC runs without difficulty at 19.6 thousand processes across 250 nodes. Microbenchmarks across libfabric versions show RAMC achieving 100-130% higher bandwidth than Cray MPI for 1B-4KiB messages under libfabric 1.15.2 and 30-45% higher under 2.3.1. The work positions RAMC as a more flexible alternative to both monolithic one-sided models and partitioned point-to-point interfaces.

Core claim

RAMC is a new explicit one-sided communication library based on the core concept of a persistent uni-directional communication channel that leverages Slingshot's unique memory region counters to enable efficient completion notification, addressing scalability and usability challenges in existing frameworks such as MPI RMA, OpenSHMEM, and PGAS models while outperforming Cray's proprietary MPI implementation in bandwidth for small-to-medium messages.

What carries the argument

Persistent uni-directional communication channel that uses Slingshot memory region counters for completion notification.

If this is right

RAMC applications scale without difficulty to 19.6 thousand processes across 250 nodes.
Bandwidth increases of approximately 100-130% occur for 1B-4KiB messages versus Cray MPI under libfabric 1.15.2.
Bandwidth increases of approximately 30-45% occur for the same message sizes under libfabric 2.3.1.
RAMC retains dynamic flexibility that partitioned MPI communication sacrifices.
Small message latencies remain an area identified for further improvement.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Channel-based designs like RAMC could be ported to other RDMA networks that expose similar completion counters.
Avoiding collective operations for window creation may benefit codes with irregular or dynamic communication patterns.
Production workloads with different message size distributions or synchronization needs could yield different relative performance.
The explicit channel model might complement or replace parts of implicit PGAS implementations without requiring full language changes.

Load-bearing premise

The chosen microbenchmarks and heat diffusion application represent production workloads and the observed bandwidth differences come from the RAMC design rather than unstated implementation or measurement differences.

What would settle it

Running the identical microbenchmarks on the same Slingshot hardware with the same libfabric versions and measurement methodology but finding no bandwidth advantage for RAMC over MPI would falsify the performance claim.

Figures

Figures reproduced from arXiv: 2606.05094 by Matthew G. F. Dosanjh, Scott Levy, Whit Schonbein.

**Figure 2.** Figure 2: Target-side channel data structure. target status value address target buffer address status value [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Initiator-side channel data structure. counter incrementing indicates the local buffer is available for reuse – that is, an ACK has been received from the target NIC confirming the data has been received – and in the case of a read, the counter indicates the retrieved data is visible to the application. Building on the Portals network programming API [2], Slingshot also provides counters that can be assoc… view at source ↗

**Figure 5.** Figure 5: The Bulletin Board. when (at step (3)) the initiator returns to check the target’s status again, it is able to perform the write (step (4)). After the write, the initiator may test/wait on the local endpoint counter (configured to count write operations) to determine the source buffer may be reused, and updates its status value to indicate it has moved past the write phase (step (5)). Similarly, the target… view at source ↗

**Figure 6.** Figure 6: Iteration 1000 from a heat diffusion code using RAMC for [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: Unidirectional bandwidth under libfabric 1.15.2 for OMB [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 8.** Figure 8: Unidirectional bandwidth under libfabric 2.3.1 for OMB [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗

read the original abstract

In this paper, we present Remote Access Memory Channels (RAMC), an explicit one-sided communication library designed to leverage the capabilities of HPE Cray Slingshot network hardware. Existing one-sided communication frameworks, such as MPI RMA and OpenSHMEM, rely on monolithic shared memory models that introduce scalability and usability challenges. These frameworks often assume symmetric memory regions or require blocking collective operations for window creation, which can mismatch user communication needs and hinder performance. Implicit models, such as PGAS and UPC, aim to simplify programming by treating local and remote memory as a unified region but ultimately rely on explicit mechanisms to implement data movement. MPI's recently-introduced partitioned communication API offers a persistent point-to-point interface but sacrifices the dynamic flexibility of RDMA. RAMC is designed to address these limitations. Based on the core concept of a persistent uni-directional communication channel, RAMC leverages Slingshot's unique memory region counters to enable efficient completion notification. Experiments with a RAMC-based heat diffusion code demonstrate RAMC has no difficulty scaling to 19.6 thousand processes across 250 nodes, and microbenchmark studies across multiple libfabric versions show RAMC can outperform Cray's proprietary MPI implementation (e.g., increases in bandwidth ranging from approx. 100%-130% for 1B-4KiB messages under libfabric 1.15.2, and from approx. 30%-45% under libfabric 2.3.1) while identifying additional areas for improvement, such as small message latencies.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RAMC introduces persistent uni-directional channels with Slingshot counter notifications, but the bandwidth gains over MPI rest on comparisons whose controls are not visible in the abstract.

read the letter

RAMC is built around persistent one-way channels that use Slingshot memory-region counters for completion instead of the symmetric windows and collectives common in MPI RMA or OpenSHMEM. The paper spells out how those existing models can force extra synchronization or memory symmetry that does not match every workload.

The scaling result to 19.6k processes on 250 nodes is the clearest positive evidence. It shows the library runs at that size without immediate collapse, which is useful data for anyone targeting the same hardware.

The microbenchmark numbers are the part that draws attention: reported bandwidth lifts of roughly 100-130% on 1B-4KiB messages under libfabric 1.15.2 and 30-45% under 2.3.1. Those figures are concrete enough to notice.

The concern is whether those lifts can be credited to the channel design. The abstract gives no description of matched network settings, identical completion paths, or measurement timing between RAMC and the Cray MPI baseline. If those variables differed, the numbers could reflect implementation details rather than the persistent-channel abstraction. The stress-test note on this point holds up from what is shown.

The work is aimed at people who tune communication on Slingshot systems or build libraries for them. A referee could usefully check the implementation against the claimed advantages and ask for the missing experimental controls. It is worth sending out for review.

Referee Report

2 major / 0 minor

Summary. The paper introduces Remote Access Memory Channels (RAMC), an explicit one-sided communication library for HPE Cray Slingshot networks. RAMC is built around persistent uni-directional channels that exploit Slingshot memory-region counters for completion notification. It positions RAMC as addressing limitations in MPI RMA, OpenSHMEM, PGAS, and MPI partitioned communication. The central empirical claims are that a RAMC-based heat-diffusion code scales without difficulty to 19.6k processes on 250 nodes and that microbenchmarks show RAMC delivering 100-130% higher bandwidth than Cray MPI for 1B-4KiB messages under libfabric 1.15.2 (and 30-45% under 2.3.1).

Significance. If the performance and scaling claims are substantiated with adequate experimental controls, RAMC would represent a practical middle ground between fully explicit RDMA and monolithic shared-memory models for Slingshot-class hardware. The work could inform future one-sided library design and provide a concrete alternative for workloads that benefit from persistent uni-directional channels.

major comments (2)

[Abstract] Abstract: the headline bandwidth claims (100-130% and 30-45% improvements for 1B-4KiB messages) are load-bearing for the contribution, yet the manuscript supplies no description of the experimental controls, baseline MPI configuration, completion-notification paths, timing methodology, or whether identical Slingshot memory-region counter usage was enforced between RAMC and Cray MPI. Without these details the attribution of gains to the persistent-channel abstraction cannot be verified.
[Abstract] Abstract (heat-diffusion scaling result): the claim that RAMC “has no difficulty scaling to 19.6 thousand processes across 250 nodes” is presented without error bars, run-to-run variability, baseline MPI scaling data on the same platform, or discussion of potential confounds such as network configuration or collective overheads. This information is required to support the scalability assertion.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. The two major comments both correctly identify that the abstract and main text lack sufficient experimental detail to support the central performance and scaling claims. We agree that these details are necessary for verification and will make the requested revisions.

read point-by-point responses

Referee: [Abstract] Abstract: the headline bandwidth claims (100-130% and 30-45% improvements for 1B-4KiB messages) are load-bearing for the contribution, yet the manuscript supplies no description of the experimental controls, baseline MPI configuration, completion-notification paths, timing methodology, or whether identical Slingshot memory-region counter usage was enforced between RAMC and Cray MPI. Without these details the attribution of gains to the persistent-channel abstraction cannot be verified.

Authors: We agree that the manuscript does not currently supply the requested experimental controls. The performance numbers are central to the contribution, and without them the attribution cannot be independently verified. We will add a dedicated experimental methodology subsection that describes the Cray MPI baseline configuration, the libfabric versions tested, the timing methodology, completion-notification paths, and explicit confirmation that memory-region counter usage was matched between RAMC and the MPI baseline. The revised text will also clarify how the persistent-channel design produces the observed gains. revision: yes
Referee: [Abstract] Abstract (heat-diffusion scaling result): the claim that RAMC “has no difficulty scaling to 19.6 thousand processes across 250 nodes” is presented without error bars, run-to-run variability, baseline MPI scaling data on the same platform, or discussion of potential confounds such as network configuration or collective overheads. This information is required to support the scalability assertion.

Authors: The referee is correct that the scaling claim lacks the supporting statistical and comparative information needed to substantiate it. We will revise the results section to include error bars on all scaling plots, report run-to-run variability, add direct MPI scaling curves measured on the identical platform and network configuration, and discuss potential confounds including network settings and any collective operations. These additions will be referenced from the abstract. revision: yes

Circularity Check

0 steps flagged

No circularity; experimental claims rest on direct benchmarks

full rationale

The paper introduces RAMC as a new one-sided library and supports its claims solely via microbenchmark bandwidth/latency numbers and a heat-diffusion scaling run at 19.6k processes. No equations, fitted parameters, predictions, or self-citation chains appear in the abstract or described content. All load-bearing statements are empirical comparisons to MPI under stated libfabric versions; these do not reduce to the inputs by construction. This is the expected non-finding for a systems paper whose central contribution is implementation and measurement rather than derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper introduces a software library and reports experimental results rather than a mathematical model; no free parameters, axioms, or invented physical entities are described.

pith-pipeline@v0.9.1-grok · 5811 in / 1127 out tokens · 34304 ms · 2026-06-28T03:31:07.263342+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

27 extracted references · 12 canonical work pages

[1]

OSU Micro-Benchmarks (OMB) 7.5

2024. OSU Micro-Benchmarks (OMB) 7.5. Online. https://mvapich.cse.ohio- state.edu/benchmarks/ Accessed: 2025

2024
[2]

Barrett, Ron Brightwell, Ryan E

Brian W. Barrett, Ron Brightwell, Ryan E. Grant, Whit Schonbein, Scott Hemmert, Kevin Pedretti, Keith Underwood, Rolf Riesen, Mathieu Barbe, Luiz H. Suraty Filho, Alexandre Ratchov, and Arthur B. Maccabe. 2022.The Portals 4.3 Network Programming Interface. Technical Report SAND2022-8810. Sandia National Laboratories, Albuquerque, New Mexico. https://www.s...

2022
[3]

In: European Conference on Com- puter Vision

Brian W. Barrett, Ron Brightwell, K. Scott Hemmert, Kyle B. Wheeler, and Keith D. Underwood. 2011. Using Triggered Operations to Offload Rendezvous Messages. InRecent Advances in the Message Passing Interface (Lecture Notes in Computer Science). Springer, Berlin, Heidelberg, 120–129. https://doi.org/10.1007/978-3- 642-24449-0_15

work page doi:10.1007/978-3- 2011
[4]

Roberto Belli and Torsten Hoefler. 2015. Notified Access: Extending Remote Memory Access Programming Models for Producer-Consumer Synchronization. In2015 IEEE International Parallel and Distributed Processing Symposium. 871–881. https://doi.org/10.1109/IPDPS.2015.30

work page doi:10.1109/ipdps.2015.30 2015
[5]

Bridges, Derek Schafer, Jack Lange, James B

Patrick G. Bridges, Derek Schafer, Jack Lange, James B. White III, Anthony Skjel- lum, Evan Suggs, Thomas Hines, Purushotham Bangalore, Matthew G. F. Dosanjh, and Whit Schonbein. 2026. Co-Design and Evaluation of a CPU-Free MPI GPU Communication Abstraction and Implementation. arXiv:2602.15356 [cs.DC] https://arxiv.org/abs/2602.15356

work page arXiv 2026
[6]

Castain, Joshua Hursey, Aurelien Bouteiller, and David Solt

Ralph H. Castain, Joshua Hursey, Aurelien Bouteiller, and David Solt. 2018. PMIx: Process management for exascale environments.Parallel Comput.79 (Nov. 2018), 9–29. https://doi.org/10.1016/j.parco.2018.08.002

work page doi:10.1016/j.parco.2018.08.002 2018
[7]

Matthew GF Dosanjh, Andrew Worley, Derek Schafer, Prema Soundararajan, Sheikh Ghafoor, Anthony Skjellum, Purushotham V Bangalore, and Ryan E Grant
[8]

Implementation and evaluation of MPI 4.0 partitioned communication libraries.Parallel Comput.108 (2021), 102827

2021
[9]

Ferreira, Patrick Bridges, and Ron Brightwell

Kurt B. Ferreira, Patrick Bridges, and Ron Brightwell. 2008. Characterizing application sensitivity to OS interference using kernel-level noise injection. In Proceedings of the 2008 ACM/IEEE Conference on Supercomputing(Austin, Texas) (SC ’08). IEEE Press, Article 19, 12 pages

2008
[10]

1997.MPI: A Message-Passing Interface Stan- dard Version 2.0

Message Passing Interface Forum. 1997.MPI: A Message-Passing Interface Stan- dard Version 2.0. https://www.mpi-forum.org/docs/mpi-2.0/mpi20-report.pdf

1997
[11]

Thomas Gillis, Ken Raffenetti, Hui Zhou, Yanfei Guo, and Rajeev Thakur. 2023. Quantifying the performance benefits of partitioned communication in mpi. In Proceedings of the 52nd International Conference on Parallel Processing. 285–294

2023
[12]

Ryan E Grant, Matthew GF Dosanjh, Michael J Levenhagen, Ron Brightwell, and Anthony Skjellum. 2019. Finepoints: Partitioned multithreaded MPI communi- cation. InInternational Conference on High Performance Computing. Springer, 330–350

2019
[13]

Scott Hemmert, Brian Barrett, and Keith D

K. Scott Hemmert, Brian Barrett, and Keith D. Underwood. 2010. Using Triggered Operations to Offload Collective Communication Operations. InRecent Advances in the Message Passing Interface (Lecture Notes in Computer Science). Springer, Berlin, Heidelberg, 249–256. https://doi.org/10.1007/978-3-642-15646-5_26

work page doi:10.1007/978-3-642-15646-5_26 2010
[14]

Nathan Hjelm, Matthew GF Dosanjh, Ryan E Grant, Taylor Groves, Patrick Bridges, and Dorian Arnold. 2018. Improving MPI multi-threaded RMA com- munication performance. InProceedings of the 47th International Conference on Parallel Processing. 1–11. RAMC: Remote Access Memory Channels over HPE Slingshot

2018
[15]

Chung-Hsing Hsu, Neena Imam, Akhil Langer, Sreeram Potluri, and Chris J Newburn. 2020. An initial assessment of nvshmem for high performance com- puting. In2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). IEEE, 1–10

2020
[16]

Panda, Darius Buntinas, Rajeev Thakur, and William D

Weihang Jiang, Jiuxing Liu, Hyun-Wook Jin, Dhabaleswar K. Panda, Darius Buntinas, Rajeev Thakur, and William D. Gropp. 2004. Efficient Implementation of MPI-2 Passive One-Sided Communication on InfiniBand Clusters. InRecent Advances in Parallel Virtual Machine and Message Passing Interface, Dieter Kran- zlmüller, Péter Kacsuk, and Jack Dongarra (Eds.). Sp...

2004
[17]

Jithin Jose, Sreeram Potluri, Hari Subramoni, Xiaoyi Lu, Khaled Hamidouche, Karl Schulz, Hari Sundar, and Dhabaleswar K Panda. 2014. Designing scalable out- of-core sorting with hybrid MPI+ PGAS programming models. InProceedings of the 8th International Conference on Partitioned Global Address Space Programming Models. 1–9

2014
[18]

Ferreira, Patrick Widener, Patrick G

Scott Levy, Kurt B. Ferreira, Patrick Widener, Patrick G. Bridges, and Oscar H. Mondragon. 2016. How I Learned to Stop Worrying and Love In Situ Analytics: Leveraging Latent Synchronization in MPI Collective Algorithms. InProceedings of the 23rd European MPI Users’ Group Meeting(Edinburgh, United Kingdom)(Eu- roMPI ’16). Association for Computing Machiner...

work page doi:10.1145/2966884.2966920 2016
[19]

Scott Levy, Whit Schonbein, and Craig Ulmer. 2024. Leveraging High- Performance Data Transfer to Offload Data Management Tasks to SmartNICs. In 2024 IEEE International Conference on Cluster Computing (CLUSTER). 346–356. https://doi.org/10.1109/CLUSTER59578.2024.00037

work page doi:10.1109/cluster59578.2024.00037 2024
[20]

León, Joseph Glenski, Mark Stock, Kim McMahon, William Loewe, Clark Snyder, Larry Kaplan, Srinath Vadlamani, Timothy I

Edgar A. León, Joseph Glenski, Mark Stock, Kim McMahon, William Loewe, Clark Snyder, Larry Kaplan, Srinath Vadlamani, Timothy I. Mattox, Trent D’Hooge, Brian Behlendorf, Nathan Hanford, Ramesh Pankajakshan, and Matthew L. Leininger. 2025. Breaking the System Noise Barrier at Exascale. InProceed- ings of the International Conference for High Performance Co...

work page doi:10.1145/3712285.3759793 2025
[21]

Pepper Marts, Donald A

W. Pepper Marts, Donald A. Kruse, Matthew G. F. Dosanjh, Whit Schonbein, Scott Levy, and Patrick G. Bridges. 2024. CMB: A Configurable Messaging Benchmark to Explore Fine-Grained Communication. In2024 IEEE 24th Inter- national Symposium on Cluster, Cloud and Internet Computing (CCGrid). 28–38. https://doi.org/10.1109/CCGrid59990.2024.00013

work page doi:10.1109/ccgrid59990.2024.00013 2024
[22]

2012.MPI: A Message-Passing Interface Stan- dard Version 3.0

Message Passing Interface Forum. 2012.MPI: A Message-Passing Interface Stan- dard Version 3.0. https://www.mpi-forum.org/docs/mpi-3.0/mpi30-report.pdf

2012
[23]

2021.MPI: A Message-Passing Interface Stan- dard Version 4.0

Message Passing Interface Forum. 2021.MPI: A Message-Passing Interface Stan- dard Version 4.0. https://www.mpi-forum.org/docs/mpi-4.0/mpi40-report.pdf

2021
[24]

Naveen Namashivayam, Krishna Kandalla, Trey White, Nick Radcliffe, Larry Ka- plan, and Mark Pagel. 2022. Exploring GPU Stream-Aware Message Passing using Triggered Operations. https://arxiv.org/abs/2208.04817 _eprint: 2208.04817

work page arXiv 2022
[25]

Poole, Oscar Hernandez, Jeffery A

Stephen W. Poole, Oscar Hernandez, Jeffery A. Kuehn, Galen M. Shipman, An- thony Curtis, and Karl Feind. 2011.OpenSHMEM - Toward a Unified RMA Model. Springer US, Boston, MA, 1379–1391. https://doi.org/10.1007/978-0-387-09766- 4_490

work page doi:10.1007/978-0-387-09766- 2011
[26]

Yıltan Hassan Temuçin, Whit Schonbein, Scott Levy, Amirhossein Sojoodi, Ryan E Grant, and Ahmad Afsahi. 2024. Design and Implementation of MPI- Native GPU-Initiated MPI Partitioned Communication. InSC24-W: Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 436–447

2024
[27]

K. D. Underwood, J. Coffman, R. Larsen, K. S. Hemmert, B. W. Barrett, R. Brightwell, and M. Levenhagen. 2011. Enabling Flexible Collective Communica- tion Offload with Triggered Operations. In2011 IEEE 19th Annual Symposium on High Performance Interconnects. 35–42. https://doi.org/10.1109/HOTI.2011.15

work page doi:10.1109/hoti.2011.15 2011

[1] [1]

OSU Micro-Benchmarks (OMB) 7.5

2024. OSU Micro-Benchmarks (OMB) 7.5. Online. https://mvapich.cse.ohio- state.edu/benchmarks/ Accessed: 2025

2024

[2] [2]

Barrett, Ron Brightwell, Ryan E

Brian W. Barrett, Ron Brightwell, Ryan E. Grant, Whit Schonbein, Scott Hemmert, Kevin Pedretti, Keith Underwood, Rolf Riesen, Mathieu Barbe, Luiz H. Suraty Filho, Alexandre Ratchov, and Arthur B. Maccabe. 2022.The Portals 4.3 Network Programming Interface. Technical Report SAND2022-8810. Sandia National Laboratories, Albuquerque, New Mexico. https://www.s...

2022

[3] [3]

In: European Conference on Com- puter Vision

Brian W. Barrett, Ron Brightwell, K. Scott Hemmert, Kyle B. Wheeler, and Keith D. Underwood. 2011. Using Triggered Operations to Offload Rendezvous Messages. InRecent Advances in the Message Passing Interface (Lecture Notes in Computer Science). Springer, Berlin, Heidelberg, 120–129. https://doi.org/10.1007/978-3- 642-24449-0_15

work page doi:10.1007/978-3- 2011

[4] [4]

Roberto Belli and Torsten Hoefler. 2015. Notified Access: Extending Remote Memory Access Programming Models for Producer-Consumer Synchronization. In2015 IEEE International Parallel and Distributed Processing Symposium. 871–881. https://doi.org/10.1109/IPDPS.2015.30

work page doi:10.1109/ipdps.2015.30 2015

[5] [5]

Bridges, Derek Schafer, Jack Lange, James B

Patrick G. Bridges, Derek Schafer, Jack Lange, James B. White III, Anthony Skjel- lum, Evan Suggs, Thomas Hines, Purushotham Bangalore, Matthew G. F. Dosanjh, and Whit Schonbein. 2026. Co-Design and Evaluation of a CPU-Free MPI GPU Communication Abstraction and Implementation. arXiv:2602.15356 [cs.DC] https://arxiv.org/abs/2602.15356

work page arXiv 2026

[6] [6]

Castain, Joshua Hursey, Aurelien Bouteiller, and David Solt

Ralph H. Castain, Joshua Hursey, Aurelien Bouteiller, and David Solt. 2018. PMIx: Process management for exascale environments.Parallel Comput.79 (Nov. 2018), 9–29. https://doi.org/10.1016/j.parco.2018.08.002

work page doi:10.1016/j.parco.2018.08.002 2018

[7] [7]

Matthew GF Dosanjh, Andrew Worley, Derek Schafer, Prema Soundararajan, Sheikh Ghafoor, Anthony Skjellum, Purushotham V Bangalore, and Ryan E Grant

[8] [8]

Implementation and evaluation of MPI 4.0 partitioned communication libraries.Parallel Comput.108 (2021), 102827

2021

[9] [9]

Ferreira, Patrick Bridges, and Ron Brightwell

Kurt B. Ferreira, Patrick Bridges, and Ron Brightwell. 2008. Characterizing application sensitivity to OS interference using kernel-level noise injection. In Proceedings of the 2008 ACM/IEEE Conference on Supercomputing(Austin, Texas) (SC ’08). IEEE Press, Article 19, 12 pages

2008

[10] [10]

1997.MPI: A Message-Passing Interface Stan- dard Version 2.0

Message Passing Interface Forum. 1997.MPI: A Message-Passing Interface Stan- dard Version 2.0. https://www.mpi-forum.org/docs/mpi-2.0/mpi20-report.pdf

1997

[11] [11]

Thomas Gillis, Ken Raffenetti, Hui Zhou, Yanfei Guo, and Rajeev Thakur. 2023. Quantifying the performance benefits of partitioned communication in mpi. In Proceedings of the 52nd International Conference on Parallel Processing. 285–294

2023

[12] [12]

Ryan E Grant, Matthew GF Dosanjh, Michael J Levenhagen, Ron Brightwell, and Anthony Skjellum. 2019. Finepoints: Partitioned multithreaded MPI communi- cation. InInternational Conference on High Performance Computing. Springer, 330–350

2019

[13] [13]

Scott Hemmert, Brian Barrett, and Keith D

K. Scott Hemmert, Brian Barrett, and Keith D. Underwood. 2010. Using Triggered Operations to Offload Collective Communication Operations. InRecent Advances in the Message Passing Interface (Lecture Notes in Computer Science). Springer, Berlin, Heidelberg, 249–256. https://doi.org/10.1007/978-3-642-15646-5_26

work page doi:10.1007/978-3-642-15646-5_26 2010

[14] [14]

Nathan Hjelm, Matthew GF Dosanjh, Ryan E Grant, Taylor Groves, Patrick Bridges, and Dorian Arnold. 2018. Improving MPI multi-threaded RMA com- munication performance. InProceedings of the 47th International Conference on Parallel Processing. 1–11. RAMC: Remote Access Memory Channels over HPE Slingshot

2018

[15] [15]

Chung-Hsing Hsu, Neena Imam, Akhil Langer, Sreeram Potluri, and Chris J Newburn. 2020. An initial assessment of nvshmem for high performance com- puting. In2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). IEEE, 1–10

2020

[16] [16]

Panda, Darius Buntinas, Rajeev Thakur, and William D

Weihang Jiang, Jiuxing Liu, Hyun-Wook Jin, Dhabaleswar K. Panda, Darius Buntinas, Rajeev Thakur, and William D. Gropp. 2004. Efficient Implementation of MPI-2 Passive One-Sided Communication on InfiniBand Clusters. InRecent Advances in Parallel Virtual Machine and Message Passing Interface, Dieter Kran- zlmüller, Péter Kacsuk, and Jack Dongarra (Eds.). Sp...

2004

[17] [17]

Jithin Jose, Sreeram Potluri, Hari Subramoni, Xiaoyi Lu, Khaled Hamidouche, Karl Schulz, Hari Sundar, and Dhabaleswar K Panda. 2014. Designing scalable out- of-core sorting with hybrid MPI+ PGAS programming models. InProceedings of the 8th International Conference on Partitioned Global Address Space Programming Models. 1–9

2014

[18] [18]

Ferreira, Patrick Widener, Patrick G

Scott Levy, Kurt B. Ferreira, Patrick Widener, Patrick G. Bridges, and Oscar H. Mondragon. 2016. How I Learned to Stop Worrying and Love In Situ Analytics: Leveraging Latent Synchronization in MPI Collective Algorithms. InProceedings of the 23rd European MPI Users’ Group Meeting(Edinburgh, United Kingdom)(Eu- roMPI ’16). Association for Computing Machiner...

work page doi:10.1145/2966884.2966920 2016

[19] [19]

Scott Levy, Whit Schonbein, and Craig Ulmer. 2024. Leveraging High- Performance Data Transfer to Offload Data Management Tasks to SmartNICs. In 2024 IEEE International Conference on Cluster Computing (CLUSTER). 346–356. https://doi.org/10.1109/CLUSTER59578.2024.00037

work page doi:10.1109/cluster59578.2024.00037 2024

[20] [20]

León, Joseph Glenski, Mark Stock, Kim McMahon, William Loewe, Clark Snyder, Larry Kaplan, Srinath Vadlamani, Timothy I

Edgar A. León, Joseph Glenski, Mark Stock, Kim McMahon, William Loewe, Clark Snyder, Larry Kaplan, Srinath Vadlamani, Timothy I. Mattox, Trent D’Hooge, Brian Behlendorf, Nathan Hanford, Ramesh Pankajakshan, and Matthew L. Leininger. 2025. Breaking the System Noise Barrier at Exascale. InProceed- ings of the International Conference for High Performance Co...

work page doi:10.1145/3712285.3759793 2025

[21] [21]

Pepper Marts, Donald A

W. Pepper Marts, Donald A. Kruse, Matthew G. F. Dosanjh, Whit Schonbein, Scott Levy, and Patrick G. Bridges. 2024. CMB: A Configurable Messaging Benchmark to Explore Fine-Grained Communication. In2024 IEEE 24th Inter- national Symposium on Cluster, Cloud and Internet Computing (CCGrid). 28–38. https://doi.org/10.1109/CCGrid59990.2024.00013

work page doi:10.1109/ccgrid59990.2024.00013 2024

[22] [22]

2012.MPI: A Message-Passing Interface Stan- dard Version 3.0

Message Passing Interface Forum. 2012.MPI: A Message-Passing Interface Stan- dard Version 3.0. https://www.mpi-forum.org/docs/mpi-3.0/mpi30-report.pdf

2012

[23] [23]

2021.MPI: A Message-Passing Interface Stan- dard Version 4.0

Message Passing Interface Forum. 2021.MPI: A Message-Passing Interface Stan- dard Version 4.0. https://www.mpi-forum.org/docs/mpi-4.0/mpi40-report.pdf

2021

[24] [24]

Naveen Namashivayam, Krishna Kandalla, Trey White, Nick Radcliffe, Larry Ka- plan, and Mark Pagel. 2022. Exploring GPU Stream-Aware Message Passing using Triggered Operations. https://arxiv.org/abs/2208.04817 _eprint: 2208.04817

work page arXiv 2022

[25] [25]

Poole, Oscar Hernandez, Jeffery A

Stephen W. Poole, Oscar Hernandez, Jeffery A. Kuehn, Galen M. Shipman, An- thony Curtis, and Karl Feind. 2011.OpenSHMEM - Toward a Unified RMA Model. Springer US, Boston, MA, 1379–1391. https://doi.org/10.1007/978-0-387-09766- 4_490

work page doi:10.1007/978-0-387-09766- 2011

[26] [26]

Yıltan Hassan Temuçin, Whit Schonbein, Scott Levy, Amirhossein Sojoodi, Ryan E Grant, and Ahmad Afsahi. 2024. Design and Implementation of MPI- Native GPU-Initiated MPI Partitioned Communication. InSC24-W: Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 436–447

2024

[27] [27]

K. D. Underwood, J. Coffman, R. Larsen, K. S. Hemmert, B. W. Barrett, R. Brightwell, and M. Levenhagen. 2011. Enabling Flexible Collective Communica- tion Offload with Triggered Operations. In2011 IEEE 19th Annual Symposium on High Performance Interconnects. 35–42. https://doi.org/10.1109/HOTI.2011.15

work page doi:10.1109/hoti.2011.15 2011