pith. sign in

arxiv: 2604.25667 · v1 · submitted 2026-04-28 · 💻 cs.DS · cs.DC

Two Efficient Message-passing Exclusive Scan Algorithms

Pith reviewed 2026-05-07 15:16 UTC · model grok-4.3

classification 💻 cs.DS cs.DC
keywords exclusive scanprefix sumsmessage passingcommunication roundsassociative operatorall-reducecollective communicationparallel algorithms
0
0 comments X

The pith

Two algorithms compute exclusive prefix sums in message-passing systems with the minimum number of communication rounds.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper addresses the exclusive scan operation, which calculates prefix sums without including the current element, across processors in a message-passing environment. Processors communicate only with their immediate neighbors in one direction at a time, requiring at least logarithmic rounds for any scan. While inclusive scans have straightforward solutions matching this bound, exclusive scans have received less attention. By identifying natural invariants that the exclusive sums must satisfy at each step, the authors construct two algorithms. One combines an inclusive scan with an exclusive phase to balance rounds and operator uses. The other adapts an all-reduce approach where extra operator applications depend on the binary representation of the processor count. These matter for performance when input sizes are small enough that communication overhead dominates computation.

Core claim

By considering natural invariants for the exclusive prefix sums problem, we present two different algorithms that are efficient in the number of communication rounds and in the number of applications of the ⊕ operator. The first algorithm consists of an inclusive scan phase and an exclusive scan phase and trades the number of communication rounds against the number of applications of the ⊕ operator. The smallest number of inclusive scan rounds with q=⌈log₂p⌉ rounds in total is q'≥q−log₂(2^q−p+1). The other algorithm is a modification of a round-optimal all-reduce algorithm, and the number of additional applications of the ⊕ operator is dependent on the number of bits set (popcount of) in p−1

What carries the argument

Natural invariants for the exclusive prefix sums problem that guide the two algorithms, one built from inclusive-plus-exclusive phases and the other from an all-reduce modification.

If this is right

  • Both algorithms achieve at most ⌈log₂ p⌉ communication rounds, matching the lower bound for exclusive scans.
  • The first algorithm requires at least q minus log base 2 of (2 to the q minus p plus 1) inclusive-scan rounds while keeping the total at q.
  • The second algorithm adds a number of ⊕ applications determined by the popcount of p-1.
  • The algorithms are intended for small input vectors where round count dominates performance.
  • For large vectors the paper recommends switching to pipelined fixed-degree tree methods instead.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The invariant approach could be tested on other collectives that share similar prefix-sum structure.
  • Hardware-specific choice between the two algorithms could be guided by measuring actual round versus operator costs.
  • The formulas already cover non-power-of-two p, so direct implementation on irregular processor counts is possible.

Load-bearing premise

The communication model assumes bounded, one-ported send-receive capabilities between consecutively ranked processors and that the binary operator ⊕ is associative.

What would settle it

A concrete execution for a given p that requires more than ⌈log₂ p⌉ rounds total or applies ⊕ more times than the popcount bound on p-1 would falsify the efficiency claims.

read the original abstract

Parallel scan primitives compute element-wise inclusive or exclusive prefix sums of input vectors contributed by $p$ consecutively ranked processors under an associative, possibly expensive, binary operator $\oplus$. In message-passing systems with bounded, one-ported communication capabilities, at least $\lceil\log_2 p\rceil$ or $\lceil\log_2 (p-1)\rceil$ send-receive communication rounds are required to perform the scans. While there are well-known, simple algorithms for the inclusive scan that solve the problem in $\lceil\log_2 p\rceil$ send-receive communication rounds with $\lceil\log_2 p\rceil$ applications of the $\oplus$ operator, the exclusive scan is different and has been much less addressed. By considering natural invariants for the exclusive prefix sums problem, we present two different algorithms that are efficient in the number of communication rounds and in the number of applications of the $\oplus$ operator. The first algorithm consists of an inclusive scan phase and an exclusive scan phase and trades the number of communication rounds against the number of applications of the $\oplus$ operator. The smallest number of inclusive scan rounds with $q=\lceil\log_2 p\rceil$ rounds in total is $q'\geq q-\log_2(2^q-p+1)$. The other algorithm is a modification of a round-optimal all-reduce algorithm, and the number of additional applications of the $\oplus$ operator is dependent on the number of bits set (popcount of) in $p-1$. Both algorithms are relevant for small(er) input vectors where performance is dominated by the number of communication rounds. For large input vectors, other (pipelined, fixed-degree tree) algorithms must be used.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The manuscript proposes two algorithms for the exclusive scan primitive in message-passing environments with one-ported communication and associative operators. By leveraging natural invariants, the first algorithm combines inclusive and exclusive phases with a specific lower bound on the number of rounds used for the inclusive scan: q' ≥ q - log₂(2^q - p + 1) for q = ⌈log₂ p⌉. The second algorithm adapts a round-optimal all-reduce method, incurring additional operator applications proportional to the popcount of p-1. The work emphasizes efficiency for small vectors where round count is the performance bottleneck.

Significance. If the algorithms correctly achieve the stated bounds, this contribution would be valuable for implementing efficient collective operations in parallel computing, particularly filling the gap in exclusive scan algorithms compared to the well-known inclusive ones. The strengths include the use of standard invariants without ad-hoc parameters and the provision of concrete, falsifiable bounds on communication and computation costs.

minor comments (2)
  1. [Abstract] The abstract states the round bound for the first algorithm but would benefit from a short concrete example (e.g., for p=5 or p=8) showing the achieved q' value to illustrate the trade-off.
  2. A comparison table or summary paragraph contrasting both new algorithms against the standard inclusive-scan baseline (in rounds and ⊕ applications) would improve readability and highlight the claimed improvements.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive assessment of our manuscript on two efficient message-passing exclusive scan algorithms and for recommending minor revision. The summary accurately captures our approach of leveraging natural invariants for the first algorithm (with the stated bound on inclusive-scan rounds) and the popcount-based modification of all-reduce for the second. We are pleased that the potential value for small-vector collectives is recognized.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The derivation relies on standard invariants of the exclusive prefix-sum problem under the one-ported associative model, together with explicit modifications of well-known inclusive-scan and all-reduce algorithms. No equation or bound is obtained by fitting a parameter to a subset of the target result, by self-definition, or by a load-bearing self-citation whose own justification reduces to the present paper. The stated round and operator counts (q' >= q - log2(2^q - p + 1), popcount(p-1)) follow directly from the communication model and the chosen invariants without circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The work rests on standard assumptions of the message-passing model and operator properties; no free parameters, invented entities, or ad-hoc axioms are introduced.

axioms (2)
  • domain assumption The binary operator ⊕ is associative
    Required for prefix sums to be independent of evaluation order.
  • domain assumption Processors communicate under a one-ported, bounded send-receive model
    Defines the communication cost model used to count rounds.

pith-pipeline@v0.9.0 · 5613 in / 1220 out tokens · 57990 ms · 2026-05-07T15:16:26.432145+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages

  1. [1]

    An optimal algorithm for computing census functions in message-passing systems.Parallel Processing Letters, 3(1):19–23, 1993

    Amotz Bar-Noy, Shlomo Kipnis, and Baruch Schieber. An optimal algorithm for computing census functions in message-passing systems.Parallel Processing Letters, 3(1):19–23, 1993

  2. [2]

    Blelloch

    Guy E. Blelloch. Scans as primitive parallel operations.IEEE Transactions on Computers, 38(11):1526–1538, 1989

  3. [3]

    Efficient algorithms for all-to-all communications in multiport message-passing systems

    Jehoshua Bruck, Ching-Tien Ho, Schlomo Kipnis, Eli Upfal, and Derrick Weathersby. Efficient algorithms for all-to-all communications in multiport message-passing systems. IEEE Transactions on Parallel and Distributed Systems, 8(11):1143–1156, 1997

  4. [4]

    Char- acterization of MPI usage on a production supercomputer

    Sudheer Chunduri, Scott Parker, Pavan Balaji, Kevin Harms, and Kalyan Kumaran. Char- acterization of MPI usage on a production supercomputer. InInternational Conference for High Performance Computing, Networking, Storage, and Analysis (SC), pages 30:1–30:15. IEEE/ACM, 2018

  5. [5]

    Work-stealing prefix scan: Addressing load imbalance in large-scale image registration

    Marcin Copik, Tobias Grosser, Torsten Hoefler, Paolo Bientinesi, and Benjamin Berkels. Work-stealing prefix scan: Addressing load imbalance in large-scale image registration. IEEE Transactions on Parallel and Distributed Systems, 33(3):523–535, 2022

  6. [6]

    Daniel Hillis and Jr

    W. Daniel Hillis and Jr. Guy L. Steele. Data parallel algorithms.Communications of the ACM, 29(12):1170–1183, 1986. 10

  7. [7]

    Kogge and Harold S

    Peter M. Kogge and Harold S. Stone. A parallel algorithm for the efficient solution of a general class of recurrence equations.IEEE Transactions on Computers, 22(8):786–793, 1973

  8. [8]

    Kruskal, Larry Rudolph, and Marc Snir

    Clyde P. Kruskal, Larry Rudolph, and Marc Snir. The power of parallel prefix.IEEE Transactions on Computers, C-34(10):965–968, 1985

  9. [9]

    A large-scale study of MPI usage in open-source HPC applications

    Ignacio Laguna, Ryan Marshall, Kathryn Mohror, Martin Ruefenacht, Anthony Skjellum, and Nawrin Sultana. A large-scale study of MPI usage in open-source HPC applications. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, (SC), pages 31:1–31:14, 2019

  10. [10]

    Prasanna

    Kartik Lakhotia, Fabrizio Petrini, Rajgopal Kannan, and Viktor K. Prasanna. Accelerat- ing prefix scan with in-network computing on intel PIUMA. In29th IEEE International Conference on High Performance Computing, Data, and Analytics (HiPC), pages 59–68, 2022

  11. [11]

    Efficient parallel prefix algorithms on multiport message-passing systems.Information Processing Letters, 71:91–95, 1999

    Yen-Chun Lin and Ching-Sung Yeh. Efficient parallel prefix algorithms on multiport message-passing systems.Information Processing Letters, 71:91–95, 1999

  12. [12]

    Version 4.1, November 2nd 2023.www.mpi-forum.org

    MPI Forum.MPI: A Message-Passing Interface Standard. Version 4.1, November 2nd 2023.www.mpi-forum.org

  13. [13]

    Two-tree algorithms for full band- width broadcast, reduction and scan.Parallel Computing, 35(12):581–594, 2009

    Peter Sanders, Jochen Speck, and Jesper Larsson Tr¨ aff. Two-tree algorithms for full band- width broadcast, reduction and scan.Parallel Computing, 35(12):581–594, 2009

  14. [14]

    Parallel prefix (scan) algorithms for MPI

    Peter Sanders and Jesper Larsson Tr¨ aff. Parallel prefix (scan) algorithms for MPI. In Recent Advances in Parallel Virtual Machine and Message Passing Interface. 13th European PVM/MPI Users’ Group Meeting, volume 4192 ofLecture Notes in Computer Science, pages 49–57. Springer, 2006

  15. [15]

    InRecent Advances in Message Passing Interface

    Jesper Larsson Tr¨ aff.mpicroscope: Towards an MPI benchmark tool for performance guideline verification. InRecent Advances in Message Passing Interface. 19th European MPI Users’ Group Meeting, volume 7490 ofLecture Notes in Computer Science, pages 100–109. Springer, 2012

  16. [16]

    Optimal broadcast schedules in logarithmic time with applications to broadcast, reduction, all-broadcast and all-reduction.ACM Transactions on Parallel Computing, 12(3):1–21, 2025

    Jesper Larsson Tr¨ aff. Optimal broadcast schedules in logarithmic time with applications to broadcast, reduction, all-broadcast and all-reduction.ACM Transactions on Parallel Computing, 12(3):1–21, 2025

  17. [17]

    Optimal, non-pipelined reduce-scatter and allreduce algorithms with an application to all-to-all communication.ACM Transactions on Parallel Computing, 12(4):1–23, 2025

    Jesper Larsson Tr¨ aff. Optimal, non-pipelined reduce-scatter and allreduce algorithms with an application to all-to-all communication.ACM Transactions on Parallel Computing, 12(4):1–23, 2025

  18. [18]

    Decomposing MPI collectives for exploiting multi- lane communication

    Jesper Larsson Tr¨ aff and Sascha Hunold. Decomposing MPI collectives for exploiting multi- lane communication. InIEEE International Conference on Cluster Computing (CLUS- TER), pages 270–280. IEEE Computer Society, 2020

  19. [19]

    Library development with MPI: Attributes, request objects, group communicator creation, local reductions and datatypes

    Jesper Larsson Tr¨ aff and Ioannis Vardas. Library development with MPI: Attributes, request objects, group communicator creation, local reductions and datatypes. In30th European MPI Users’ Group Meeting (EuroMPI), 2023. 11 A An Experimental Evaluation We have implemented the two exclusive scan algorithms for and in MPI [12] exactly as shown in Algorithm ...