Two Efficient Message-passing Exclusive Scan Algorithms
Pith reviewed 2026-05-07 15:16 UTC · model grok-4.3
The pith
Two algorithms compute exclusive prefix sums in message-passing systems with the minimum number of communication rounds.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By considering natural invariants for the exclusive prefix sums problem, we present two different algorithms that are efficient in the number of communication rounds and in the number of applications of the ⊕ operator. The first algorithm consists of an inclusive scan phase and an exclusive scan phase and trades the number of communication rounds against the number of applications of the ⊕ operator. The smallest number of inclusive scan rounds with q=⌈log₂p⌉ rounds in total is q'≥q−log₂(2^q−p+1). The other algorithm is a modification of a round-optimal all-reduce algorithm, and the number of additional applications of the ⊕ operator is dependent on the number of bits set (popcount of) in p−1
What carries the argument
Natural invariants for the exclusive prefix sums problem that guide the two algorithms, one built from inclusive-plus-exclusive phases and the other from an all-reduce modification.
If this is right
- Both algorithms achieve at most ⌈log₂ p⌉ communication rounds, matching the lower bound for exclusive scans.
- The first algorithm requires at least q minus log base 2 of (2 to the q minus p plus 1) inclusive-scan rounds while keeping the total at q.
- The second algorithm adds a number of ⊕ applications determined by the popcount of p-1.
- The algorithms are intended for small input vectors where round count dominates performance.
- For large vectors the paper recommends switching to pipelined fixed-degree tree methods instead.
Where Pith is reading between the lines
- The invariant approach could be tested on other collectives that share similar prefix-sum structure.
- Hardware-specific choice between the two algorithms could be guided by measuring actual round versus operator costs.
- The formulas already cover non-power-of-two p, so direct implementation on irregular processor counts is possible.
Load-bearing premise
The communication model assumes bounded, one-ported send-receive capabilities between consecutively ranked processors and that the binary operator ⊕ is associative.
What would settle it
A concrete execution for a given p that requires more than ⌈log₂ p⌉ rounds total or applies ⊕ more times than the popcount bound on p-1 would falsify the efficiency claims.
read the original abstract
Parallel scan primitives compute element-wise inclusive or exclusive prefix sums of input vectors contributed by $p$ consecutively ranked processors under an associative, possibly expensive, binary operator $\oplus$. In message-passing systems with bounded, one-ported communication capabilities, at least $\lceil\log_2 p\rceil$ or $\lceil\log_2 (p-1)\rceil$ send-receive communication rounds are required to perform the scans. While there are well-known, simple algorithms for the inclusive scan that solve the problem in $\lceil\log_2 p\rceil$ send-receive communication rounds with $\lceil\log_2 p\rceil$ applications of the $\oplus$ operator, the exclusive scan is different and has been much less addressed. By considering natural invariants for the exclusive prefix sums problem, we present two different algorithms that are efficient in the number of communication rounds and in the number of applications of the $\oplus$ operator. The first algorithm consists of an inclusive scan phase and an exclusive scan phase and trades the number of communication rounds against the number of applications of the $\oplus$ operator. The smallest number of inclusive scan rounds with $q=\lceil\log_2 p\rceil$ rounds in total is $q'\geq q-\log_2(2^q-p+1)$. The other algorithm is a modification of a round-optimal all-reduce algorithm, and the number of additional applications of the $\oplus$ operator is dependent on the number of bits set (popcount of) in $p-1$. Both algorithms are relevant for small(er) input vectors where performance is dominated by the number of communication rounds. For large input vectors, other (pipelined, fixed-degree tree) algorithms must be used.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes two algorithms for the exclusive scan primitive in message-passing environments with one-ported communication and associative operators. By leveraging natural invariants, the first algorithm combines inclusive and exclusive phases with a specific lower bound on the number of rounds used for the inclusive scan: q' ≥ q - log₂(2^q - p + 1) for q = ⌈log₂ p⌉. The second algorithm adapts a round-optimal all-reduce method, incurring additional operator applications proportional to the popcount of p-1. The work emphasizes efficiency for small vectors where round count is the performance bottleneck.
Significance. If the algorithms correctly achieve the stated bounds, this contribution would be valuable for implementing efficient collective operations in parallel computing, particularly filling the gap in exclusive scan algorithms compared to the well-known inclusive ones. The strengths include the use of standard invariants without ad-hoc parameters and the provision of concrete, falsifiable bounds on communication and computation costs.
minor comments (2)
- [Abstract] The abstract states the round bound for the first algorithm but would benefit from a short concrete example (e.g., for p=5 or p=8) showing the achieved q' value to illustrate the trade-off.
- A comparison table or summary paragraph contrasting both new algorithms against the standard inclusive-scan baseline (in rounds and ⊕ applications) would improve readability and highlight the claimed improvements.
Simulated Author's Rebuttal
We thank the referee for the positive assessment of our manuscript on two efficient message-passing exclusive scan algorithms and for recommending minor revision. The summary accurately captures our approach of leveraging natural invariants for the first algorithm (with the stated bound on inclusive-scan rounds) and the popcount-based modification of all-reduce for the second. We are pleased that the potential value for small-vector collectives is recognized.
Circularity Check
No significant circularity
full rationale
The derivation relies on standard invariants of the exclusive prefix-sum problem under the one-ported associative model, together with explicit modifications of well-known inclusive-scan and all-reduce algorithms. No equation or bound is obtained by fitting a parameter to a subset of the target result, by self-definition, or by a load-bearing self-citation whose own justification reduces to the present paper. The stated round and operator counts (q' >= q - log2(2^q - p + 1), popcount(p-1)) follow directly from the communication model and the chosen invariants without circular reduction.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption The binary operator ⊕ is associative
- domain assumption Processors communicate under a one-ported, bounded send-receive model
Reference graph
Works this paper leans on
-
[1]
Amotz Bar-Noy, Shlomo Kipnis, and Baruch Schieber. An optimal algorithm for computing census functions in message-passing systems.Parallel Processing Letters, 3(1):19–23, 1993
work page 1993
- [2]
-
[3]
Efficient algorithms for all-to-all communications in multiport message-passing systems
Jehoshua Bruck, Ching-Tien Ho, Schlomo Kipnis, Eli Upfal, and Derrick Weathersby. Efficient algorithms for all-to-all communications in multiport message-passing systems. IEEE Transactions on Parallel and Distributed Systems, 8(11):1143–1156, 1997
work page 1997
-
[4]
Char- acterization of MPI usage on a production supercomputer
Sudheer Chunduri, Scott Parker, Pavan Balaji, Kevin Harms, and Kalyan Kumaran. Char- acterization of MPI usage on a production supercomputer. InInternational Conference for High Performance Computing, Networking, Storage, and Analysis (SC), pages 30:1–30:15. IEEE/ACM, 2018
work page 2018
-
[5]
Work-stealing prefix scan: Addressing load imbalance in large-scale image registration
Marcin Copik, Tobias Grosser, Torsten Hoefler, Paolo Bientinesi, and Benjamin Berkels. Work-stealing prefix scan: Addressing load imbalance in large-scale image registration. IEEE Transactions on Parallel and Distributed Systems, 33(3):523–535, 2022
work page 2022
-
[6]
W. Daniel Hillis and Jr. Guy L. Steele. Data parallel algorithms.Communications of the ACM, 29(12):1170–1183, 1986. 10
work page 1986
-
[7]
Peter M. Kogge and Harold S. Stone. A parallel algorithm for the efficient solution of a general class of recurrence equations.IEEE Transactions on Computers, 22(8):786–793, 1973
work page 1973
-
[8]
Kruskal, Larry Rudolph, and Marc Snir
Clyde P. Kruskal, Larry Rudolph, and Marc Snir. The power of parallel prefix.IEEE Transactions on Computers, C-34(10):965–968, 1985
work page 1985
-
[9]
A large-scale study of MPI usage in open-source HPC applications
Ignacio Laguna, Ryan Marshall, Kathryn Mohror, Martin Ruefenacht, Anthony Skjellum, and Nawrin Sultana. A large-scale study of MPI usage in open-source HPC applications. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, (SC), pages 31:1–31:14, 2019
work page 2019
- [10]
-
[11]
Yen-Chun Lin and Ching-Sung Yeh. Efficient parallel prefix algorithms on multiport message-passing systems.Information Processing Letters, 71:91–95, 1999
work page 1999
-
[12]
Version 4.1, November 2nd 2023.www.mpi-forum.org
MPI Forum.MPI: A Message-Passing Interface Standard. Version 4.1, November 2nd 2023.www.mpi-forum.org
work page 2023
-
[13]
Peter Sanders, Jochen Speck, and Jesper Larsson Tr¨ aff. Two-tree algorithms for full band- width broadcast, reduction and scan.Parallel Computing, 35(12):581–594, 2009
work page 2009
-
[14]
Parallel prefix (scan) algorithms for MPI
Peter Sanders and Jesper Larsson Tr¨ aff. Parallel prefix (scan) algorithms for MPI. In Recent Advances in Parallel Virtual Machine and Message Passing Interface. 13th European PVM/MPI Users’ Group Meeting, volume 4192 ofLecture Notes in Computer Science, pages 49–57. Springer, 2006
work page 2006
-
[15]
InRecent Advances in Message Passing Interface
Jesper Larsson Tr¨ aff.mpicroscope: Towards an MPI benchmark tool for performance guideline verification. InRecent Advances in Message Passing Interface. 19th European MPI Users’ Group Meeting, volume 7490 ofLecture Notes in Computer Science, pages 100–109. Springer, 2012
work page 2012
-
[16]
Jesper Larsson Tr¨ aff. Optimal broadcast schedules in logarithmic time with applications to broadcast, reduction, all-broadcast and all-reduction.ACM Transactions on Parallel Computing, 12(3):1–21, 2025
work page 2025
-
[17]
Jesper Larsson Tr¨ aff. Optimal, non-pipelined reduce-scatter and allreduce algorithms with an application to all-to-all communication.ACM Transactions on Parallel Computing, 12(4):1–23, 2025
work page 2025
-
[18]
Decomposing MPI collectives for exploiting multi- lane communication
Jesper Larsson Tr¨ aff and Sascha Hunold. Decomposing MPI collectives for exploiting multi- lane communication. InIEEE International Conference on Cluster Computing (CLUS- TER), pages 270–280. IEEE Computer Society, 2020
work page 2020
-
[19]
Jesper Larsson Tr¨ aff and Ioannis Vardas. Library development with MPI: Attributes, request objects, group communicator creation, local reductions and datatypes. In30th European MPI Users’ Group Meeting (EuroMPI), 2023. 11 A An Experimental Evaluation We have implemented the two exclusive scan algorithms for and in MPI [12] exactly as shown in Algorithm ...
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.