pith. sign in

arxiv: 2605.03983 · v1 · submitted 2026-05-05 · 💻 cs.DC

Implementing True MPI Sessions and Evaluating MPI Initialization Scalability

Pith reviewed 2026-05-07 14:03 UTC · model grok-4.3

classification 💻 cs.DC
keywords MPI SessionsMPI-4MPICHscalabilityinitializationcommunicatorsexascale
0
0 comments X

The pith

True MPI Sessions implemented via MPICH refactoring remove the MPI_COMM_WORLD dependency and improve initialization scalability

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper describes the changes made to MPICH to support true MPI Sessions as specified in the MPI-4 standard. Sessions let applications build communicators from chosen process sets rather than requiring a single global world communicator. Prior MPICH support still used the global model internally, so the team performed a major refactoring to decouple the two paths. Evaluation of the updated code shows that explicit hierarchical designs become practical and deliver better scalability during MPI setup. A reader would care because this change targets potential bottlenecks in exascale systems where process counts are very large.

Core claim

True MPI Sessions, achieved through architectural refactoring that eliminates internal reliance on a global world communicator, allow explicit hierarchical communicator designs and produce measurable scalability gains in MPI initialization.

What carries the argument

The Sessions model, which builds communicators from process sets without depending on MPI_COMM_WORLD.

If this is right

  • Applications written with the Sessions API can initialize without the overhead of constructing a global communicator.
  • Hierarchical communicator layouts become usable without global-state costs.
  • The traditional world model stays available for backward compatibility.
  • Initialization time grows more slowly with process count in the Sessions path.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Other MPI implementations may need similar internal decoupling to realize the full intent of the Sessions feature.
  • Reduced global state could simplify adding dynamic process management or fault tolerance in future MPI versions.
  • The work suggests that minimizing shared global data structures is a general route to better scaling in parallel runtimes.

Load-bearing premise

The refactoring preserves full correctness and performance for all existing applications that continue to use the traditional world communicator model.

What would settle it

A direct measurement of MPI initialization time versus process count, comparing the traditional world-communicator path against the true Sessions path on systems with thousands to millions of processes.

Figures

Figures reproduced from arXiv: 2605.03983 by Hui Zhou, Kenneth Raffenetti, Michael Wilkins, Rajeev Thakur, Yanfei Guo.

Figure 1
Figure 1. Figure 1: MPICH architecture. The binding layer handles parameter checking and con￾verts MPI object handles to internal structure pointers. The MPIR layer provides device-independent utilities, while the device layer implements hardware-specific func￾tionality. MPICH maintains CH3 and CH4 as reference device implementations. Ven￾dors may adopt CH3/CH4 or implement their own devices that conform to the ADI interface.… view at source ↗
Figure 2
Figure 2. Figure 2: A common process launch mechanism in MPI. The PMI server (e.g., mpiexec) launches PMI proxies (usually one per compute node), and each proxy launches the MPI processes for its node. other processes. In contrast, a collective initialization step involves coordination across processes, often requiring data exchange and execution barriers. In the pseudocode above, MPIR_Pre_Init is entirely local, but both MPI… view at source ↗
Figure 3
Figure 3. Figure 3: An example of data exchange using PMI. (1) P0 use PMI_Put to send data to its PMI proxy. (2) All MPI processes call PMI_Barrier, during which proxies synchronize local data to the PMI server, and the PMI server propagates the data to other proxies. (3) P6 calls PMI_Get to retrieve the data from its local proxy. achieved during the collective PMI_Barrier. Both PMI_Put and PMI_Get are lo￾cal operations that … view at source ↗
Figure 4
Figure 4. Figure 4: Comparing MPI Initialization between the world model and the session model using mpich-dev. The session model measurements are split into session init and boot￾strapping the self and the world communicators. (a) Initialization times in seconds. (b) Node memory usage in GB. The results reveal several key observations. Local initialization, represented by MPI_Session_init, accounts for the majority of both i… view at source ↗
Figure 5
Figure 5. Figure 5: Comparing MPI Initialization between the world model and the session model using MPICH 4.3.0. (a) Initialization times in seconds. (b) Node memory usage in GB. system resource manager. Thus, further insights are needed to interpret this experiment. (a) OMPI-5.0.7 - Init Time 1 2 4 8 16 32 64 128 256 512 1024 2048 0 20 40 60 80 100 120 140 sec Number of nodes MPI_Init Session Init Self Comm World Comm (b) O… view at source ↗
Figure 6
Figure 6. Figure 6: Comparing MPI Initialization between the world model and the session model using Open MPI 5.0.7. (a) Initialization times in seconds. A zoomed section show the same data from 1 to 128 nodes. (b) Node memory usage in GB. 4.3 Sparse World Initialization By supporting true MPI Sessions, applications can bypass the creation of a global world communicator altogether. One compelling use case is a sparsely connec… view at source ↗
Figure 7
Figure 7. Figure 7: compares initialization time and memory usage between the tra￾ditional world model and the Sessions-based sparse model. The results confirm that constructing a sparse world using MPI Sessions reduces both initialization time and memory consumption relative to building a full world communicator in the Sessions model ( view at source ↗
read the original abstract

Sessions is one of the major features introduced in the MPI-4 standard. It offers an alternative to the traditional world communicator model by allowing applications to construct communicators from process sets, thereby eliminating the dependency on MPI_COMM_WORLD. The Sessions model was proposed as a more scalable solution for exascale systems, where MPI_COMM_WORLD was viewed as a potential scalability bottleneck. However, supporting Sessions is a significant challenge for established codebases like MPICH due to the deep integration of the world model in traditional MPI implementations. Although MPICH added support for the MPI-4 standard upon its release, it still internally relied on a global world communicator. This approach enabled applications written using the Sessions model to function, but it did not fulfill the full design intent of Sessions, which meant to decouple MPI from MPI_COMM_WORLD. We describe MPICH effort to support true MPI Sessions, including a major internal refactoring. We describe the architectural changes required to support true Sessions and evaluate the resulting implementation scalability. Our results demonstrate that true Sessions can offer significant scalability benefits by adopting explicit hierarchical designs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper describes MPICH's implementation of true MPI Sessions via a major internal refactoring that removes the global MPI_COMM_WORLD dependency, allowing communicators to be built from process sets as intended by the MPI-4 standard. It details the required architectural changes and presents an evaluation of MPI initialization scalability, claiming that explicit hierarchical designs in the true Sessions model yield significant scalability benefits over prior 'fake Sessions' approaches that still relied on the world communicator.

Significance. If the reported scalability gains hold under broader testing, this work would be a meaningful contribution to exascale MPI design by validating the Sessions model as a practical, decoupled alternative to the traditional world-communicator approach. It provides concrete evidence from a production implementation that could inform both MPI library developers and application writers targeting large-scale systems.

major comments (1)
  1. [Evaluation section] Evaluation section: the scalability results and claims focus exclusively on the new Sessions initialization path and hierarchical designs. No data, test-suite results, or performance comparisons are presented for legacy applications that continue to use MPI_COMM_WORLD and the traditional communicator model after the refactoring. This omission is load-bearing because the central claim—that the changes deliver true Sessions without side effects—requires evidence that the legacy paths retain full correctness and incur no additional overhead from the removal of global assumptions.
minor comments (2)
  1. [Architectural changes] The description of the refactoring would benefit from a clearer before/after diagram or pseudocode showing how global state was eliminated and how process-set-based communicator construction now operates.
  2. Ensure that all reported scalability numbers include the exact process counts, hardware configuration, and comparison baseline (e.g., pre-refactoring MPICH) so readers can reproduce the claimed benefits.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the thoughtful review and the recommendation for major revision. The point raised about evaluating legacy paths is valid and we will strengthen the manuscript accordingly.

read point-by-point responses
  1. Referee: [Evaluation section] Evaluation section: the scalability results and claims focus exclusively on the new Sessions initialization path and hierarchical designs. No data, test-suite results, or performance comparisons are presented for legacy applications that continue to use MPI_COMM_WORLD and the traditional communicator model after the refactoring. This omission is load-bearing because the central claim—that the changes deliver true Sessions without side effects—requires evidence that the legacy paths retain full correctness and incur no additional overhead from the removal of global assumptions.

    Authors: We agree that demonstrating the absence of side effects on legacy code is essential to support our claims. Although the primary contribution of the paper is the true Sessions implementation and its scalability advantages, the refactoring was designed to preserve full backward compatibility. In the revised version we will add a new subsection to the Evaluation section that reports: (1) results from the MPICH test suite confirming that all legacy MPI_Init, communicator creation, and collective operations continue to pass without modification, and (2) direct performance comparisons of MPI initialization latency for traditional MPI_COMM_WORLD-based codes before and after the refactoring, showing that the overhead remains within measurement noise. These additions will provide the concrete evidence requested. revision: yes

Circularity Check

0 steps flagged

No circularity; implementation description and empirical evaluation are self-contained

full rationale

The paper describes an engineering refactoring of MPICH to remove internal reliance on a global MPI_COMM_WORLD communicator and enable true MPI-4 Sessions. It reports architectural changes and scalability measurements from the resulting implementation. No equations, derivations, fitted parameters, or predictions appear. No self-citations are invoked as load-bearing premises for any result. The scalability claim rests on direct evaluation of the new code path rather than any reduction to the paper's own inputs by construction. This matches the default non-circular case for implementation-and-measurement papers.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the MPI-4 Sessions specification being correctly interpreted and on the assumption that hierarchical communicator construction is a viable and representative usage pattern. No free parameters, invented entities, or non-standard axioms are introduced in the abstract.

axioms (1)
  • domain assumption MPI-4 Sessions semantics can be realized without a global world communicator
    Invoked throughout the description of the refactoring goal.

pith-pipeline@v0.9.0 · 5492 in / 1128 out tokens · 39882 ms · 2026-05-07T14:03:00.109426+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

24 extracted references · 24 canonical work pages

  1. [1]

    In: 2014 43rd International Conference on Parallel Processing Workshops

    Ahn, D.H., Garlick, J., Grondona, M., Lipari, D., Springmeyer, B., Schulz, M.: Flux: A next-generation resource management framework for large HPC centers. In: 2014 43rd International Conference on Parallel Processing Workshops. pp. 9–17. IEEE (2014)

  2. [2]

    gov/aurora

    Argonne Leadership Computing Facility: Aurora (2025),https://www.alcf.anl. gov/aurora

  3. [3]

    In: Proceedings of the International Conference for High Perfor- mance Computing, Networking, Storage and Analysis

    Atchley, S., Zimmer, C., Lange, J., Bernholdt, D., Melesse Vergara, V., Beck, T., Brim, M., Budiardja, R., Chandrasekaran, S., Eisenbach, M., et al.: Frontier: Ex- ploring exascale. In: Proceedings of the International Conference for High Perfor- mance Computing, Networking, Storage and Analysis. pp. 1–16 (2023)

  4. [4]

    In: European MPI Users’ Group Meeting

    Balaji, P., Buntinas, D., Goodell, D., Gropp, W., Krishna, J., Lusk, E., Thakur, R.: PMI: A scalable parallel process-management interface for extreme-scale systems. In: European MPI Users’ Group Meeting. pp. 31–41. Springer (2010) 18 H. Zhou et al

  5. [5]

    In: Recent Advances in Parallel Vir- tual Machine and Message Passing Interface: 16th European PVM/MPI Users’ Group Meeting, Espoo, Finland, September 7-10, 2009

    Balaji, P., Buntinas, D., Goodell, D., Gropp, W., Kumar, S., Lusk, E., Thakur, R., Träff, J.L.: MPI on a million processors. In: Recent Advances in Parallel Vir- tual Machine and Message Passing Interface: 16th European PVM/MPI Users’ Group Meeting, Espoo, Finland, September 7-10, 2009. Proceedings 16. pp. 20–30. Springer (2009)

  6. [6]

    Parallel Computing33(9), 634–644 (2007)

    Buntinas, D., Mercier, G., Gropp, W.: Implementation and evaluation of shared- memory communication and synchronization operations in MPICH2 using the Nemesis communication subsystem. Parallel Computing33(9), 634–644 (2007)

  7. [7]

    In: Proceedings of the 24th European MPI Users’ Group Meeting

    Castain, R.H., Solt, D., Hursey, J., Bouteiller, A.: PMIx: Process management for exascale environments. In: Proceedings of the 24th European MPI Users’ Group Meeting. pp. 1–10 (2017)

  8. [8]

    Parallel Computing108, 102827 (2021)

    Dosanjh, M.G., Worley, A., Schafer, D., Soundararajan, P., Ghafoor, S., Skjel- lum, A., Bangalore, P.V., Grant, R.E.: Implementation and evaluation of MPI 4.0 partitioned communication libraries. Parallel Computing108, 102827 (2021)

  9. [9]

    In: International Conference on High Performance Computing

    Fecht, J., Schreiber, M., Schulz, M., Pritchard, H., Holmes, D.J.: An emulation layer for dynamic resources with MPI sessions. In: International Conference on High Performance Computing. pp. 147–161. Springer (2022)

  10. [10]

    Parallel computing 22(6), 789–828 (1996)

    Gropp, W., Lusk, E., Doss, N., Skjellum, A.: A high-performance, portable im- plementation of the MPI Message Passing Interface Standard. Parallel computing 22(6), 789–828 (1996)

  11. [11]

    The Interna- tionalJournalofHighPerformanceComputingApplicationsp.10943420241311608 (2025)

    Guo, Y., Raffenetti, K., Zhou, H., Balaji, P., Si, M., Amer, A., Iwasaki, S., Seo, S., Congiu, G., Latham, R., et al.: Preparing MPICH for exascale. The Interna- tionalJournalofHighPerformanceComputingApplicationsp.10943420241311608 (2025)

  12. [12]

    Hewlett Packard Enterprise: Cray MPICH (2024),https://cpe.ext.hpe.com/ docs/24.03/mpt/mpich/index.html

  13. [13]

    In: Proceedings of the 23rd European MPI Users’ Group Meeting

    Holmes, D., Mohror, K., Grant, R.E., Skjellum, A., Schulz, M., Bland, W., Squyres, J.M.: MPI Sessions: Leveraging runtime infrastructure to increase scalability of applications at exascale. In: Proceedings of the 23rd European MPI Users’ Group Meeting. pp. 121–129 (2016)

  14. [14]

    Intel Corporation: Intel® MPI Library (2025),https://www.intel.com/content/ www/us/en/developer/tools/oneapi/mpi-library.html

  15. [15]

    MessagePassingInterfaceForum:MPI:AMessage-PassingInterfaceStandardVer- sion 4.0 (Jun 2021),https://www.mpi-forum.org/docs/mpi-4.0/mpi40-report. pdf

  16. [16]

    ParTec AG: ParaStation MPI (2025),https://github.com/ParaStation/psmpi

  17. [17]

    In: 2018 IEEE 4th International Conference on Computer and Communications (ICCC)

    Raffenetti, K., Bayyapu, N., Durnov, D., Takagi, M., Balaji, P.: Locality-aware PMI usage for efficient MPI startup. In: 2018 IEEE 4th International Conference on Computer and Communications (ICCC). pp. 624–628. IEEE (2018)

  18. [18]

    In: Proceedings of the 20th ACM International Conference on Computing Fron- tiers

    Rocco, R., Palermo, G., Gregori, D.: Fault awareness in the MPI 4.0 session model. In: Proceedings of the 20th ACM International Conference on Computing Fron- tiers. pp. 189–192 (2023)

  19. [19]

    PARS-Mitteilungen: Vol

    Suarez, E., Eicker, N., Hoppe, H.C.: The DEEP-SEA project: A software stack for heterogeneous and modular supercomputers. PARS-Mitteilungen: Vol. 36 (2024)

  20. [20]

    In: Proceedings

    Thakur, R., Gropp, W., Lusk, E.: Data sieving and collective I/O in ROMIO. In: Proceedings. Frontiers’ 99. Seventh Symposium on the Frontiers of Massively Parallel Computation. pp. 182–189. IEEE (1999)

  21. [21]

    ohio-state.edu/ Implementing True MPI Sessions 19

    The Ohio State University: MVAPICH (2025),https://mvapich.cse. ohio-state.edu/ Implementing True MPI Sessions 19

  22. [22]

    Future Generation Computer Systems101, 576–589 (2019)

    Wozniak, J.M., Dorier, M., Ross, R., Shu, T., Kurc, T., Tang, L., Podhorszki, N., Wolf, M.: MPI jobs within MPI jobs: A practical way of enabling task-level fault- tolerance in HPC workflows. Future Generation Computer Systems101, 576–589 (2019)

  23. [23]

    In: Workshop on job scheduling strategies for parallel processing

    Yoo, A.B., Jette, M.A., Grondona, M.: Slurm: Simple Linux utility for resource management. In: Workshop on job scheduling strategies for parallel processing. pp. 44–60. Springer (2003)

  24. [24]

    arXiv preprint arXiv:2401.16547 (2024)

    Zhou, H., Raffenetti, K., Bland, W., Guo, Y.: Generating bindings in MPICH. arXiv preprint arXiv:2401.16547 (2024)