arxiv: 2412.18629 · v3 · submitted 2024-12-21 · 🌌 astro-ph.IM · astro-ph.CO· astro-ph.GA

Recognition: 2 theorem links

MPI-Rockstar: a Hybrid MPI and OpenMP Parallel Implementation of the Rockstar Halo finder

Ken Osato, Peter Behroozi, Satoshi Tanaka, Tomoaki Ishiyama, Tomoyuki Tokuue

Authors on Pith no claims yet

Pith reviewed 2026-05-06 21:47 UTC · model claude-opus-4-7

classification 🌌 astro-ph.IM astro-ph.COastro-ph.GA

keywords cosmological N-body simulationhalo finderRockstarMPI parallelizationOpenMPhybrid parallel computingsubhalo identificationFugaku supercomputer

0 comments

The pith

A hybrid MPI and OpenMP rewrite of the Rockstar halo finder scales to over 100,000 processes and runs up to three times faster while reproducing the original halo statistics.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper takes a halo finder that has become standard in cosmological simulation analysis and rebuilds its parallel layer so it can keep up with simulations that now hold a trillion or more particles. The original implementation coordinated worker processes through a single server using socket communication, which became the bottleneck around ten thousand processes and also produced an unmanageable number of open file descriptors on modern machines. The rewrite swaps sockets for MPI collective communication, threads the subhalo search with OpenMP inside each process, and reorders steps so that communication and computation overlap cleanly. On a leading supercomputer the rewrite holds about 90% strong-scaling efficiency over a 4× node range, runs up to three times faster than the original in matched environments, and completes halo finding on a two-trillion-particle snapshot using 786,432 cores. Halo mass functions match the original implementation to within 0.1% across most of the mass range, so the speedup does not come at the cost of changing the science output. Practical additions — native HDF5 output and extra halo shape descriptors such as the inertia tensor — make the catalogs easier to feed into downstream survey-era analysis.

Core claim

The authors rebuild a widely used phase-space halo finder so that it scales to the regime of trillion-particle cosmological simulations. They replace the original socket-based one-to-one communication, which bottlenecks at a single coordinating server process around ten thousand workers, with MPI collective communications and add OpenMP threading inside each process. By reordering communication and computation steps, the subhalo search is parallelized at both process and thread levels, while the hybrid layout cuts per-process memory pressure relative to flat-MPI. On the Fugaku machine the rewrite delivers about 90% strong-scaling efficiency from 256 to 1024 nodes, runs up to three times fast

What carries the argument

Replacing socket-based one-to-one communication with MPI collectives, plus reordering communication and computation so that the subhalo finding inside each Friends-of-Friends group is parallelized across both MPI processes and OpenMP threads. This removes the single-threaded server-process coordination bottleneck and the simultaneous file-descriptor explosion that limited the original code at large process counts.

If this is right

<parameter name="0">Halo and subhalo catalogs for next-generation trillion-particle cosmological simulations become tractable on existing supercomputers
removing a key post-processing bottleneck for upcoming galaxy surveys.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

<parameter name="0">Editorial inference: validation rests on a single aggregate statistic (the halo mass function) and not on subhalo-by-subhalo or boundary-region comparisons
cross-domain Friends-of-Friends linking and thread ordering could in principle bias substructure properties in ways the mass-function check would not detect.

Load-bearing premise

That matching the halo mass function to within 0.1% is enough to certify that the reorganized communication and threading preserve every halo property of interest, including subhalos and halos straddling process boundaries, even though the code is non-deterministic and identical catalogs cannot be produced for a bit-level check.

What would settle it

Run MPI-Rockstar and the original Rockstar on the same simulation snapshot in environments where both fit, and compare halo mass function, subhalo statistics, and boundary-region halo properties. If agreement holds at the sub-percent level for the full halo population (not just near the mass-function midrange) and the reported strong-scaling and 2-trillion-particle Fugaku run reproduce, the central claim stands; systematic disagreement or scaling collapse beyond a few thousand processes would refute it.

read the original abstract

MPI-Rockstar is a massively parallel halo finder based on the Rockstar phase-space temporal halo finder code, which is one of the most extensively used halo finding codes. Compared to the original code, parallelized by a primitive socket communication library, we parallelized it in a hybrid way using MPI and OpenMP, which is suitable for analysis on the hybrid shared and distributed memory environments of modern supercomputers. This implementation can easily handle the analysis of more than a trillion particles on more than 100,000 parallel processes, enabling the production of a huge dataset for the next generation of cosmological surveys. As new functions to the original Rockstar code, MPI-Rockstar supports HDF5 as an output format and can output additional halo properties such as the inertia tensor.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Solid HPC re-engineering of Rockstar with real scaling numbers; the science-equivalence check is thinner than it should be but doesn't undermine the contribution.

read the letter

Quick take: this is a useful infrastructure paper. Rockstar's socket-based server architecture has been a known bottleneck at ~10^4 processes for years, and file-descriptor limits on modern HPC make even moderate runs awkward. Tokuue et al. replace the socket layer with MPI, add OpenMP threading inside processes, and reorder compute/comm to use collectives. The result scales at ~90% efficiency from 256 to 1024 Fugaku nodes on 4096³ and 2560³ snapshots, runs ~3× faster than original Rockstar in a matched 64-core test, and they demonstrate a 2-trillion-particle run on 786,432 cores. They also add HDF5 output and extra halo properties (inertia tensor). The code is public.

What's actually new: the engineering. The halo-finding science is unchanged by design, and the authors say so. Hybrid MPI+OpenMP is standard practice; what matters here is that someone did it correctly on the most-used phase-space halo finder and shipped it, which is non-trivial work that the community needs.

Soft spot, proportionate: the equivalence-to-original check is one halo mass function agreeing to <0.1% away from the tails. That's fine as a sanity check but light for a tool meant to feed next-generation survey catalogs. Rockstar's value-add over FoF/SO is 6D subhalo identification, which is exactly where reshuffled domain boundaries, new collective ordering, and thread-ordering inside the per-FoF subhalo finder could plausibly leave a few-percent imprint that an HMF integrates away. A subhalo mass function, V_max function, or spatial cross-match against the original on a small box would close this. Invoking Rockstar's non-determinism to explain residuals is honest but doesn't bound the systematic. The stress-test note is right about this; it's a real gap, not a fatal one.

This isn't load-bearing for the paper's main claim, which is engineering. It does mean a careful user should run their own cross-check before publishing science with it — which is normal for any new finder.

Who it's for: anyone running trillion-particle cosmological sims, anyone building halo catalogs for LSST/Euclid/Roman-scale work, and HPC people interested in retrofitting socket-era astro codes.

Recommendation: accept, send to review if a referee can ask for a subhalo-level or boundary-zone comparison plot. I'd cite it the next time I need to point at a Rockstar that actually runs at 10^5 cores.

Referee Report

4 major / 7 minor

Summary. The authors present MPI-Rockstar, a re-implementation of the widely used Rockstar phase-space halo finder in which the original socket-based one-to-one communications are replaced by MPI collectives and an OpenMP layer is added for intra-node thread parallelism. Reordering of computation and communication is introduced so that collectives can be used efficiently. New features include HDF5 output and additional halo properties such as the inertia tensor. The paper reports ~90% strong-scaling efficiency from 256 to 1024 Fugaku nodes for 4096³ and 2560³ snapshots, up to ~3× speedup over original Rockstar in a matched 64-core test (200 s vs 440 s), agreement of the halo mass function with original Rockstar to <0.1% away from the rare/small-N ends, and a successful 2-trillion-particle run on 16,384 Fugaku nodes (786,432 cores). The code is publicly released on GitHub.

Significance. If the performance and compatibility claims hold, MPI-Rockstar removes a real, widely encountered scaling ceiling of the original Rockstar (single-threaded server coordinator and file-descriptor exhaustion at ~10⁴ processes) and brings one of the field's standard halo-finding tools onto modern hybrid shared/distributed-memory machines at the scale demanded by trillion-particle next-generation surveys (LSST/Rubin, Euclid, Roman, DESI-class N-body suites). The strong-scaling result and the demonstrated 2-trillion-particle run on Fugaku are concrete, useful data points for the simulation community. The release of source code with a permissive workflow and the addition of HDF5 output and extra halo shape diagnostics are practical wins. As a software/engineering contribution this is a meaningful and well-scoped piece of work; the load-bearing scientific question is whether the new parallel reorganization preserves Rockstar's halo-finding semantics, which is currently underdocumented.

major comments (4)

[§1 Parallelization (validation)] The compatibility claim — that the reordered collective-based communication and the OpenMP-threaded subhalo finder reproduce Rockstar's results — is supported by a single aggregate statistic, the halo mass function agreeing to <0.1% away from the rare/small-N ends. The HMF is dominated by host halos and integrates over precisely the regimes where the new code most plausibly differs: 6D phase-space subhalo assignment within an FoF group that straddles MPI domains, FoF linking across reshuffled boundaries, and thread-order-dependent seed selection inside the per-FoF subhalo finder. Please supplement the HMF comparison with at least (i) a subhalo mass function and/or V_max function, (ii) a spatial cross-match of halo catalogs (positions, velocities, M_vir) between MPI-Rockstar and original Rockstar on an identical snapshot, and (iii) a diagnostic for halos in the near-boundary zone vs inter
[§1 (deviations at HMF endpoints)] The deviations from original Rockstar at the massive and least-massive ends of the HMF are attributed to 'small halo counts and resolution' without evidence. Given that non-determinism is simultaneously invoked to explain why bit-identical catalogs cannot be required, the reader has no quantitative way to separate Poisson/non-determinism scatter from a genuine bias introduced by the new parallelization. Please show the run-to-run scatter of original Rockstar (and of MPI-Rockstar) from repeated executions on the same snapshot, so that the residual difference between the two codes can be compared to the intrinsic non-deterministic spread.
[Fig. 2 / scaling configuration] The strong-scaling figure reports 2 MPI processes per node × 24 OpenMP threads per process as 'optimal' for the chosen snapshots, but no sweep over (MPI, OpenMP) configurations is shown, no breakdown of compute vs communication vs I/O time is given, and the comparison to original Rockstar (3× speedup, 200 s vs 440 s, 64 cores, 1024³ in 250 Mpc/h, z=0) is at a very different scale from the Fugaku scaling test. A short table or stacked-bar decomposition of where time is spent at, say, 256 and 1024 nodes, plus a justification of the chosen MPI×OpenMP balance, would substantially strengthen the engineering claims and let users choose configurations on other machines.
[New outputs (inertia tensor, HDF5)] The newly added halo properties (e.g., inertia tensor) and HDF5 output are listed but not validated. Since these are additions not present in original Rockstar, no internal consistency check exists by default. Please provide at least a sanity check (e.g., axis ratios versus halo mass for a known sample, or comparison against an external code such as VELOCIraptor on a small box) so users have a baseline before consuming these quantities for science.

minor comments (7)

[Abstract / §1] It would help to state explicitly, up front, what is and is not preserved relative to original Rockstar: same FoF parameters, same phase-space metric, same unbinding procedure, etc. Currently the reader has to infer this from 'maintaining compatibility'.
[§1] The statement that the original Rockstar bottleneck 'starts to become a bottleneck around 10,000 processors' would benefit from a citation or a measurement, since this number motivates the work.
[Fig. 2] Please label axes with units, indicate which curve is which simulation in the caption (currently only described in text), show error bars or repeat-run scatter, and clarify whether 'time' includes I/O. An ideal-scaling reference normalization point should be stated.
[§1 (speedup comparison)] The 200 s vs 440 s comparison should specify the MPI×OpenMP layout used for MPI-Rockstar in that 64-core test, and the original Rockstar configuration (number of reader/writer/server processes), so the 3× claim is reproducible.
[References] Comparable modern halo finders (e.g., VELOCIraptor, HBT+, AHF, the GADGET-4 SUBFIND-HBT) are cited in passing; a one-sentence positioning of MPI-Rockstar relative to their parallelization strategies would aid readers choosing a tool.
[Reproducibility] Consider including, in the repository or supplementary material, the exact configuration files and snapshot specifications used to generate Fig. 2 and the 200 s/440 s comparison, so independent users can reproduce the timing claims.
[Typography] Several typos: 'shared and distributed memory' (abstract) vs 'shared- and distributed-memory' (body); 'one of the most extensively used halo finding codes' is repeated in summary and body; 'Supercomputer Fugaku' vs 'supercomputer Fugaku' inconsistent.

Simulated Author's Rebuttal

4 responses · 1 unresolved

We thank the referee for a constructive report that correctly identifies the weakest part of the current manuscript: validation of semantic equivalence with original Rockstar rests on a single aggregate statistic, and the new outputs are documented but not sanity-checked. We agree with the spirit of all four major comments and will revise accordingly. Specifically, we will (1) add subhalo mass / V_max functions, a spatial halo-by-halo cross-match between original Rockstar and MPI-Rockstar, and a near-boundary vs interior diagnostic; (2) quantify the run-to-run non-deterministic scatter of both codes on the same snapshot so the inter-code residual can be compared to the intrinsic spread, replacing the current unsupported attribution of HMF endpoint deviations; (3) add a wall-clock decomposition (I/O, FoF+exchange, 6D subhalo finding, output) at 256 and 1024 Fugaku nodes together with a small (MPI×OpenMP) configuration sweep, and clarify the scope of the 64-core matched comparison vs the Fugaku scaling run; and (4) add an axis-ratio vs M_vir sanity figure for the new inertia-tensor output and an HDF5/ASCII consistency check. We note that a full cross-code shape comparison against VELOCIraptor is outside the scope of this software-release paper because of differing halo definitions, and we will state this caveat explicitly rather than overclaim.

read point-by-point responses

Referee: Validation rests on a single aggregate HMF comparison; please supplement with (i) subhalo mass / V_max function, (ii) spatial cross-match of halo catalogs (positions, velocities, M_vir), and (iii) a diagnostic for halos near MPI-domain boundaries vs interior.

Authors: We agree that the HMF alone is insufficient to establish semantic equivalence, especially since the regimes most affected by reordered communication are precisely those an aggregate HMF averages over. In the revised manuscript we will add (i) the subhalo mass function and the V_max function for both hosts and subhalos, computed on the same z=0, 1024^3 / 250 Mpc/h snapshot already used for the timing comparison; (ii) a spatial cross-match of halo catalogs between original Rockstar and MPI-Rockstar, reporting the matched fraction and the distributions of Δx, Δv, and ΔM_vir for matched pairs; and (iii) a boundary diagnostic in which we tag FoF groups whose particle members straddle an MPI sub-box boundary and compare the matched-halo residuals for that subset against interior halos. These will be included as a new validation subsection plus a figure. We note that, because Rockstar's seed selection is intrinsically non-deterministic (see next point), exact one-to-one matching is not expected; the relevant question is whether the residuals exceed the run-to-run spread of the original code, which we will address jointly with the next comment. revision: yes
Referee: HMF endpoint deviations are attributed to 'small halo counts and resolution' without evidence, while non-determinism is invoked to forbid bit-identical comparison. Please show the run-to-run scatter of original Rockstar and MPI-Rockstar so the inter-code residual can be benchmarked against the intrinsic spread.

Authors: This is a fair criticism. We will perform N≥5 repeated runs of both original Rockstar and MPI-Rockstar on the same 1024^3 / 250 Mpc/h z=0 snapshot with identical inputs and report (a) the run-to-run RMS of the HMF in each mass bin for each code separately, and (b) the difference between the two codes overlaid with these scatter envelopes. The same procedure will be applied to the subhalo mass function and V_max function added in response to the previous point. This will allow the reader to judge quantitatively whether the endpoint deviations are consistent with Poisson + non-deterministic seed-selection scatter, as we currently assert, or whether a residual bias remains. We will revise the wording in §1 accordingly, removing the unsupported assertion and replacing it with the measured comparison. revision: yes
Referee: No (MPI×OpenMP) sweep, no compute/communication/I/O breakdown, and the matched-environment comparison to original Rockstar is at a very different scale from the Fugaku test. Please add a time decomposition at, e.g., 256 and 1024 nodes and justify the chosen MPI/OpenMP balance.

Authors: We will add a short table giving the wall-clock decomposition into (a) I/O, (b) FoF + boundary exchange, (c) 6D subhalo finding, and (d) output writing, at 256 and 1024 Fugaku nodes for the 4096^3 / 2 Gpc/h snapshot, together with the same decomposition for at least three (MPI, OpenMP) configurations per node (e.g., 1×48, 2×24, 4×12) at a fixed node count. This will document why 2×24 was chosen as the production setting on Fugaku — primarily a balance between MPI collective cost (which favors fewer ranks) and per-rank memory footprint plus I/O concurrency (which favor more ranks) — and will give users on other machines an empirical basis for retuning. We also agree the 64-core matched test is at a different scale from the Fugaku run; we will clarify in the text that the 3× figure is a like-for-like single-node comparison against original Rockstar, while the Fugaku numbers characterize behavior in the regime where original Rockstar cannot run at all due to the server-coordinator and file-descriptor bottlenecks. revision: yes
Referee: The new outputs (inertia tensor, HDF5) are not validated; please provide at least a sanity check such as axis ratios vs halo mass, or comparison against an external code on a small box.

Authors: We accept this. For the HDF5 output we will add a short statement that bit-level equivalence with the ASCII output has been verified on the validation snapshot (identical halo IDs, identical column values to machine precision), and we will include a checksum-style test in the public repository. For the inertia tensor and derived axis ratios (c/a, b/a) we will add a figure showing the axis-ratio distribution as a function of M_vir on the validation snapshot and overlay the well-established trend that more massive halos are more triaxial (e.g., the qualitative behavior reported in the literature for ΛCDM halos), as a sanity check that the tensor is computed in the correct frame and with the correct particle membership. A direct cross-code comparison with VELOCIraptor on a small box is a reasonable next step but is more naturally a separate study, since the two codes use different halo definitions and particle membership criteria; we will note this caveat explicitly rather than attempt a full cross-code shape comparison in this software paper. revision: partial

standing simulated objections not resolved

A direct external-code cross-validation of the inertia tensor (e.g., against VELOCIraptor) is not provided in the revision; we offer only an internal axis-ratio-vs-mass sanity check, because differing halo definitions between codes make a clean shape comparison itself a research question rather than a software sanity check.

Circularity Check

0 steps flagged

No circularity: an engineering/parallelization paper whose claims are timing measurements and an aggregate-statistic comparison to an external reference code.

full rationale

This is a software/methods paper presenting a hybrid MPI+OpenMP reimplementation of an existing halo finder (Rockstar). The load-bearing claims are (i) strong-scaling efficiency on Fugaku, (ii) wall-clock speedup vs. the original Rockstar in a matched environment, and (iii) halo mass function agreement with the original Rockstar to <0.1%. None of these claims are derived from the paper's own definitions in a way that makes them tautological: scaling and runtime numbers are externally measurable timing data, and the HMF comparison is against an independent reference implementation (the original Rockstar by Behroozi et al. 2013), not against the paper's own output. Self-citations (Behroozi 2013; Ishiyama et al. simulation papers) are used to identify the base algorithm and to provide datasets, not to prove the central claims. The reader's skeptic concern — that validation rests on a single aggregate statistic and could mask subhalo/boundary biases — is a coverage/correctness-of-validation concern, not circularity: the test is genuinely against an external benchmark, just possibly an insufficient one. That belongs in a correctness-risk review, not here. Score 0.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

As a software/engineering paper, the axiom ledger is light. There are no fitted physical parameters, no invented entities, and no unproved theorems. The work rests on standard MPI/OpenMP semantics and on the correctness of the original Rockstar algorithm, both of which are external. The only paper-specific assumption is that the reordered communication preserves Rockstar's halo-finding behavior, which is checked empirically rather than proved.

pith-pipeline@v0.9.0 · 9516 in / 5500 out tokens · 82327 ms · 2026-05-06T21:47:49.160525+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Inferring Halo Mass and Scale Radius of Galaxy Clusters Using Convolutional Neural Networks and Uchuu-UniverseMachine Catalogs
astro-ph.CO 2026-04 unverdicted novelty 5.0

Convolutional neural networks can infer galaxy cluster virial masses and scale radii from 2D projected position and line-of-sight velocity distributions with nearly unbiased results and reduced scatter when richness i...