Two Efficient Message-passing Exclusive Scan Algorithms

Jesper Larsson Tr\"aff

arxiv: 2604.25667 · v1 · submitted 2026-04-28 · 💻 cs.DS · cs.DC

Two Efficient Message-passing Exclusive Scan Algorithms

Jesper Larsson Tr\"aff This is my paper

Pith reviewed 2026-05-07 15:16 UTC · model grok-4.3

classification 💻 cs.DS cs.DC

keywords exclusive scanprefix sumsmessage passingcommunication roundsassociative operatorall-reducecollective communicationparallel algorithms

0 comments

The pith

Two algorithms compute exclusive prefix sums in message-passing systems with the minimum number of communication rounds.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper addresses the exclusive scan operation, which calculates prefix sums without including the current element, across processors in a message-passing environment. Processors communicate only with their immediate neighbors in one direction at a time, requiring at least logarithmic rounds for any scan. While inclusive scans have straightforward solutions matching this bound, exclusive scans have received less attention. By identifying natural invariants that the exclusive sums must satisfy at each step, the authors construct two algorithms. One combines an inclusive scan with an exclusive phase to balance rounds and operator uses. The other adapts an all-reduce approach where extra operator applications depend on the binary representation of the processor count. These matter for performance when input sizes are small enough that communication overhead dominates computation.

Core claim

By considering natural invariants for the exclusive prefix sums problem, we present two different algorithms that are efficient in the number of communication rounds and in the number of applications of the ⊕ operator. The first algorithm consists of an inclusive scan phase and an exclusive scan phase and trades the number of communication rounds against the number of applications of the ⊕ operator. The smallest number of inclusive scan rounds with q=⌈log₂p⌉ rounds in total is q'≥q−log₂(2^q−p+1). The other algorithm is a modification of a round-optimal all-reduce algorithm, and the number of additional applications of the ⊕ operator is dependent on the number of bits set (popcount of) in p−1

What carries the argument

Natural invariants for the exclusive prefix sums problem that guide the two algorithms, one built from inclusive-plus-exclusive phases and the other from an all-reduce modification.

If this is right

Both algorithms achieve at most ⌈log₂ p⌉ communication rounds, matching the lower bound for exclusive scans.
The first algorithm requires at least q minus log base 2 of (2 to the q minus p plus 1) inclusive-scan rounds while keeping the total at q.
The second algorithm adds a number of ⊕ applications determined by the popcount of p-1.
The algorithms are intended for small input vectors where round count dominates performance.
For large vectors the paper recommends switching to pipelined fixed-degree tree methods instead.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The invariant approach could be tested on other collectives that share similar prefix-sum structure.
Hardware-specific choice between the two algorithms could be guided by measuring actual round versus operator costs.
The formulas already cover non-power-of-two p, so direct implementation on irregular processor counts is possible.

Load-bearing premise

The communication model assumes bounded, one-ported send-receive capabilities between consecutively ranked processors and that the binary operator ⊕ is associative.

What would settle it

A concrete execution for a given p that requires more than ⌈log₂ p⌉ rounds total or applies ⊕ more times than the popcount bound on p-1 would falsify the efficiency claims.

read the original abstract

Parallel scan primitives compute element-wise inclusive or exclusive prefix sums of input vectors contributed by $p$ consecutively ranked processors under an associative, possibly expensive, binary operator $\oplus$. In message-passing systems with bounded, one-ported communication capabilities, at least $\lceil\log_2 p\rceil$ or $\lceil\log_2 (p-1)\rceil$ send-receive communication rounds are required to perform the scans. While there are well-known, simple algorithms for the inclusive scan that solve the problem in $\lceil\log_2 p\rceil$ send-receive communication rounds with $\lceil\log_2 p\rceil$ applications of the $\oplus$ operator, the exclusive scan is different and has been much less addressed. By considering natural invariants for the exclusive prefix sums problem, we present two different algorithms that are efficient in the number of communication rounds and in the number of applications of the $\oplus$ operator. The first algorithm consists of an inclusive scan phase and an exclusive scan phase and trades the number of communication rounds against the number of applications of the $\oplus$ operator. The smallest number of inclusive scan rounds with $q=\lceil\log_2 p\rceil$ rounds in total is $q'\geq q-\log_2(2^q-p+1)$. The other algorithm is a modification of a round-optimal all-reduce algorithm, and the number of additional applications of the $\oplus$ operator is dependent on the number of bits set (popcount of) in $p-1$. Both algorithms are relevant for small(er) input vectors where performance is dominated by the number of communication rounds. For large input vectors, other (pipelined, fixed-degree tree) algorithms must be used.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper delivers two solid new constructions for exclusive scan that trade rounds against operator uses in a one-ported setting.

read the letter

This paper presents two algorithms for computing exclusive prefix sums using message passing with one-ported communication. The key contribution is showing how to achieve good bounds on both the number of rounds and the number of times the associative operator is applied, which matters when the operator is expensive or when latency is the bottleneck for small data. The first algorithm splits the work into an inclusive scan phase followed by an exclusive phase. It allows trading off rounds between the two, with the minimal number of inclusive rounds being at least q minus log base 2 of (2 to the q minus p plus 1), where q is ceiling log p. The second takes a round-optimal all-reduce and modifies it so that the extra operator applications depend on the popcount of p minus 1. Both are presented as improvements over the standard inclusive scan methods that don't directly extend to exclusive. These constructions are new in their specific invariants and modifications, and they seem sound based on the model assumptions of associativity and consecutive ranking. The paper is upfront that this is for small vectors and that pipelined approaches are better for large ones. A minor soft spot is that the abstract doesn't include the full proofs or code, so the exact savings would need checking in the details, but nothing suggests a flaw in the approach. The citation pattern looks standard for this area. This work is aimed at researchers and practitioners in parallel computing and high-performance computing who deal with collective operations. It would be of interest to anyone implementing or optimizing scan primitives in distributed environments. I would recommend sending it to peer review. The ideas are concrete and the bounds are specific enough to be worth a closer look by experts in the field.

Referee Report

0 major / 2 minor

Summary. The manuscript proposes two algorithms for the exclusive scan primitive in message-passing environments with one-ported communication and associative operators. By leveraging natural invariants, the first algorithm combines inclusive and exclusive phases with a specific lower bound on the number of rounds used for the inclusive scan: q' ≥ q - log₂(2^q - p + 1) for q = ⌈log₂ p⌉. The second algorithm adapts a round-optimal all-reduce method, incurring additional operator applications proportional to the popcount of p-1. The work emphasizes efficiency for small vectors where round count is the performance bottleneck.

Significance. If the algorithms correctly achieve the stated bounds, this contribution would be valuable for implementing efficient collective operations in parallel computing, particularly filling the gap in exclusive scan algorithms compared to the well-known inclusive ones. The strengths include the use of standard invariants without ad-hoc parameters and the provision of concrete, falsifiable bounds on communication and computation costs.

minor comments (2)

[Abstract] The abstract states the round bound for the first algorithm but would benefit from a short concrete example (e.g., for p=5 or p=8) showing the achieved q' value to illustrate the trade-off.
A comparison table or summary paragraph contrasting both new algorithms against the standard inclusive-scan baseline (in rounds and ⊕ applications) would improve readability and highlight the claimed improvements.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive assessment of our manuscript on two efficient message-passing exclusive scan algorithms and for recommending minor revision. The summary accurately captures our approach of leveraging natural invariants for the first algorithm (with the stated bound on inclusive-scan rounds) and the popcount-based modification of all-reduce for the second. We are pleased that the potential value for small-vector collectives is recognized.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The derivation relies on standard invariants of the exclusive prefix-sum problem under the one-ported associative model, together with explicit modifications of well-known inclusive-scan and all-reduce algorithms. No equation or bound is obtained by fitting a parameter to a subset of the target result, by self-definition, or by a load-bearing self-citation whose own justification reduces to the present paper. The stated round and operator counts (q' >= q - log2(2^q - p + 1), popcount(p-1)) follow directly from the communication model and the chosen invariants without circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The work rests on standard assumptions of the message-passing model and operator properties; no free parameters, invented entities, or ad-hoc axioms are introduced.

axioms (2)

domain assumption The binary operator ⊕ is associative
Required for prefix sums to be independent of evaluation order.
domain assumption Processors communicate under a one-ported, bounded send-receive model
Defines the communication cost model used to count rounds.

pith-pipeline@v0.9.0 · 5613 in / 1220 out tokens · 57990 ms · 2026-05-07T15:16:26.432145+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages

[1]

An optimal algorithm for computing census functions in message-passing systems.Parallel Processing Letters, 3(1):19–23, 1993

Amotz Bar-Noy, Shlomo Kipnis, and Baruch Schieber. An optimal algorithm for computing census functions in message-passing systems.Parallel Processing Letters, 3(1):19–23, 1993

work page 1993
[2]

Blelloch

Guy E. Blelloch. Scans as primitive parallel operations.IEEE Transactions on Computers, 38(11):1526–1538, 1989

work page 1989
[3]

Efficient algorithms for all-to-all communications in multiport message-passing systems

Jehoshua Bruck, Ching-Tien Ho, Schlomo Kipnis, Eli Upfal, and Derrick Weathersby. Efficient algorithms for all-to-all communications in multiport message-passing systems. IEEE Transactions on Parallel and Distributed Systems, 8(11):1143–1156, 1997

work page 1997
[4]

Char- acterization of MPI usage on a production supercomputer

Sudheer Chunduri, Scott Parker, Pavan Balaji, Kevin Harms, and Kalyan Kumaran. Char- acterization of MPI usage on a production supercomputer. InInternational Conference for High Performance Computing, Networking, Storage, and Analysis (SC), pages 30:1–30:15. IEEE/ACM, 2018

work page 2018
[5]

Work-stealing prefix scan: Addressing load imbalance in large-scale image registration

Marcin Copik, Tobias Grosser, Torsten Hoefler, Paolo Bientinesi, and Benjamin Berkels. Work-stealing prefix scan: Addressing load imbalance in large-scale image registration. IEEE Transactions on Parallel and Distributed Systems, 33(3):523–535, 2022

work page 2022
[6]

Daniel Hillis and Jr

W. Daniel Hillis and Jr. Guy L. Steele. Data parallel algorithms.Communications of the ACM, 29(12):1170–1183, 1986. 10

work page 1986
[7]

Kogge and Harold S

Peter M. Kogge and Harold S. Stone. A parallel algorithm for the efficient solution of a general class of recurrence equations.IEEE Transactions on Computers, 22(8):786–793, 1973

work page 1973
[8]

Kruskal, Larry Rudolph, and Marc Snir

Clyde P. Kruskal, Larry Rudolph, and Marc Snir. The power of parallel prefix.IEEE Transactions on Computers, C-34(10):965–968, 1985

work page 1985
[9]

A large-scale study of MPI usage in open-source HPC applications

Ignacio Laguna, Ryan Marshall, Kathryn Mohror, Martin Ruefenacht, Anthony Skjellum, and Nawrin Sultana. A large-scale study of MPI usage in open-source HPC applications. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, (SC), pages 31:1–31:14, 2019

work page 2019
[10]

Prasanna

Kartik Lakhotia, Fabrizio Petrini, Rajgopal Kannan, and Viktor K. Prasanna. Accelerat- ing prefix scan with in-network computing on intel PIUMA. In29th IEEE International Conference on High Performance Computing, Data, and Analytics (HiPC), pages 59–68, 2022

work page 2022
[11]

Efficient parallel prefix algorithms on multiport message-passing systems.Information Processing Letters, 71:91–95, 1999

Yen-Chun Lin and Ching-Sung Yeh. Efficient parallel prefix algorithms on multiport message-passing systems.Information Processing Letters, 71:91–95, 1999

work page 1999
[12]

Version 4.1, November 2nd 2023.www.mpi-forum.org

MPI Forum.MPI: A Message-Passing Interface Standard. Version 4.1, November 2nd 2023.www.mpi-forum.org

work page 2023
[13]

Two-tree algorithms for full band- width broadcast, reduction and scan.Parallel Computing, 35(12):581–594, 2009

Peter Sanders, Jochen Speck, and Jesper Larsson Tr¨ aff. Two-tree algorithms for full band- width broadcast, reduction and scan.Parallel Computing, 35(12):581–594, 2009

work page 2009
[14]

Parallel prefix (scan) algorithms for MPI

Peter Sanders and Jesper Larsson Tr¨ aff. Parallel prefix (scan) algorithms for MPI. In Recent Advances in Parallel Virtual Machine and Message Passing Interface. 13th European PVM/MPI Users’ Group Meeting, volume 4192 ofLecture Notes in Computer Science, pages 49–57. Springer, 2006

work page 2006
[15]

InRecent Advances in Message Passing Interface

Jesper Larsson Tr¨ aff.mpicroscope: Towards an MPI benchmark tool for performance guideline verification. InRecent Advances in Message Passing Interface. 19th European MPI Users’ Group Meeting, volume 7490 ofLecture Notes in Computer Science, pages 100–109. Springer, 2012

work page 2012
[16]

Optimal broadcast schedules in logarithmic time with applications to broadcast, reduction, all-broadcast and all-reduction.ACM Transactions on Parallel Computing, 12(3):1–21, 2025

Jesper Larsson Tr¨ aff. Optimal broadcast schedules in logarithmic time with applications to broadcast, reduction, all-broadcast and all-reduction.ACM Transactions on Parallel Computing, 12(3):1–21, 2025

work page 2025
[17]

Optimal, non-pipelined reduce-scatter and allreduce algorithms with an application to all-to-all communication.ACM Transactions on Parallel Computing, 12(4):1–23, 2025

Jesper Larsson Tr¨ aff. Optimal, non-pipelined reduce-scatter and allreduce algorithms with an application to all-to-all communication.ACM Transactions on Parallel Computing, 12(4):1–23, 2025

work page 2025
[18]

Decomposing MPI collectives for exploiting multi- lane communication

Jesper Larsson Tr¨ aff and Sascha Hunold. Decomposing MPI collectives for exploiting multi- lane communication. InIEEE International Conference on Cluster Computing (CLUS- TER), pages 270–280. IEEE Computer Society, 2020

work page 2020
[19]

Library development with MPI: Attributes, request objects, group communicator creation, local reductions and datatypes

Jesper Larsson Tr¨ aff and Ioannis Vardas. Library development with MPI: Attributes, request objects, group communicator creation, local reductions and datatypes. In30th European MPI Users’ Group Meeting (EuroMPI), 2023. 11 A An Experimental Evaluation We have implemented the two exclusive scan algorithms for and in MPI [12] exactly as shown in Algorithm ...

work page 2023

[1] [1]

An optimal algorithm for computing census functions in message-passing systems.Parallel Processing Letters, 3(1):19–23, 1993

Amotz Bar-Noy, Shlomo Kipnis, and Baruch Schieber. An optimal algorithm for computing census functions in message-passing systems.Parallel Processing Letters, 3(1):19–23, 1993

work page 1993

[2] [2]

Blelloch

Guy E. Blelloch. Scans as primitive parallel operations.IEEE Transactions on Computers, 38(11):1526–1538, 1989

work page 1989

[3] [3]

Efficient algorithms for all-to-all communications in multiport message-passing systems

Jehoshua Bruck, Ching-Tien Ho, Schlomo Kipnis, Eli Upfal, and Derrick Weathersby. Efficient algorithms for all-to-all communications in multiport message-passing systems. IEEE Transactions on Parallel and Distributed Systems, 8(11):1143–1156, 1997

work page 1997

[4] [4]

Char- acterization of MPI usage on a production supercomputer

Sudheer Chunduri, Scott Parker, Pavan Balaji, Kevin Harms, and Kalyan Kumaran. Char- acterization of MPI usage on a production supercomputer. InInternational Conference for High Performance Computing, Networking, Storage, and Analysis (SC), pages 30:1–30:15. IEEE/ACM, 2018

work page 2018

[5] [5]

Work-stealing prefix scan: Addressing load imbalance in large-scale image registration

Marcin Copik, Tobias Grosser, Torsten Hoefler, Paolo Bientinesi, and Benjamin Berkels. Work-stealing prefix scan: Addressing load imbalance in large-scale image registration. IEEE Transactions on Parallel and Distributed Systems, 33(3):523–535, 2022

work page 2022

[6] [6]

Daniel Hillis and Jr

W. Daniel Hillis and Jr. Guy L. Steele. Data parallel algorithms.Communications of the ACM, 29(12):1170–1183, 1986. 10

work page 1986

[7] [7]

Kogge and Harold S

Peter M. Kogge and Harold S. Stone. A parallel algorithm for the efficient solution of a general class of recurrence equations.IEEE Transactions on Computers, 22(8):786–793, 1973

work page 1973

[8] [8]

Kruskal, Larry Rudolph, and Marc Snir

Clyde P. Kruskal, Larry Rudolph, and Marc Snir. The power of parallel prefix.IEEE Transactions on Computers, C-34(10):965–968, 1985

work page 1985

[9] [9]

A large-scale study of MPI usage in open-source HPC applications

Ignacio Laguna, Ryan Marshall, Kathryn Mohror, Martin Ruefenacht, Anthony Skjellum, and Nawrin Sultana. A large-scale study of MPI usage in open-source HPC applications. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, (SC), pages 31:1–31:14, 2019

work page 2019

[10] [10]

Prasanna

Kartik Lakhotia, Fabrizio Petrini, Rajgopal Kannan, and Viktor K. Prasanna. Accelerat- ing prefix scan with in-network computing on intel PIUMA. In29th IEEE International Conference on High Performance Computing, Data, and Analytics (HiPC), pages 59–68, 2022

work page 2022

[11] [11]

Efficient parallel prefix algorithms on multiport message-passing systems.Information Processing Letters, 71:91–95, 1999

Yen-Chun Lin and Ching-Sung Yeh. Efficient parallel prefix algorithms on multiport message-passing systems.Information Processing Letters, 71:91–95, 1999

work page 1999

[12] [12]

Version 4.1, November 2nd 2023.www.mpi-forum.org

MPI Forum.MPI: A Message-Passing Interface Standard. Version 4.1, November 2nd 2023.www.mpi-forum.org

work page 2023

[13] [13]

Two-tree algorithms for full band- width broadcast, reduction and scan.Parallel Computing, 35(12):581–594, 2009

Peter Sanders, Jochen Speck, and Jesper Larsson Tr¨ aff. Two-tree algorithms for full band- width broadcast, reduction and scan.Parallel Computing, 35(12):581–594, 2009

work page 2009

[14] [14]

Parallel prefix (scan) algorithms for MPI

Peter Sanders and Jesper Larsson Tr¨ aff. Parallel prefix (scan) algorithms for MPI. In Recent Advances in Parallel Virtual Machine and Message Passing Interface. 13th European PVM/MPI Users’ Group Meeting, volume 4192 ofLecture Notes in Computer Science, pages 49–57. Springer, 2006

work page 2006

[15] [15]

InRecent Advances in Message Passing Interface

Jesper Larsson Tr¨ aff.mpicroscope: Towards an MPI benchmark tool for performance guideline verification. InRecent Advances in Message Passing Interface. 19th European MPI Users’ Group Meeting, volume 7490 ofLecture Notes in Computer Science, pages 100–109. Springer, 2012

work page 2012

[16] [16]

Optimal broadcast schedules in logarithmic time with applications to broadcast, reduction, all-broadcast and all-reduction.ACM Transactions on Parallel Computing, 12(3):1–21, 2025

Jesper Larsson Tr¨ aff. Optimal broadcast schedules in logarithmic time with applications to broadcast, reduction, all-broadcast and all-reduction.ACM Transactions on Parallel Computing, 12(3):1–21, 2025

work page 2025

[17] [17]

Optimal, non-pipelined reduce-scatter and allreduce algorithms with an application to all-to-all communication.ACM Transactions on Parallel Computing, 12(4):1–23, 2025

Jesper Larsson Tr¨ aff. Optimal, non-pipelined reduce-scatter and allreduce algorithms with an application to all-to-all communication.ACM Transactions on Parallel Computing, 12(4):1–23, 2025

work page 2025

[18] [18]

Decomposing MPI collectives for exploiting multi- lane communication

Jesper Larsson Tr¨ aff and Sascha Hunold. Decomposing MPI collectives for exploiting multi- lane communication. InIEEE International Conference on Cluster Computing (CLUS- TER), pages 270–280. IEEE Computer Society, 2020

work page 2020

[19] [19]

Library development with MPI: Attributes, request objects, group communicator creation, local reductions and datatypes

Jesper Larsson Tr¨ aff and Ioannis Vardas. Library development with MPI: Attributes, request objects, group communicator creation, local reductions and datatypes. In30th European MPI Users’ Group Meeting (EuroMPI), 2023. 11 A An Experimental Evaluation We have implemented the two exclusive scan algorithms for and in MPI [12] exactly as shown in Algorithm ...

work page 2023