pith. sign in

arxiv: 2604.11008 · v1 · submitted 2026-04-13 · ⚛️ physics.flu-dyn · cs.PF

LCS.jl: A High-Performance, Multi-Platform Computational Model in Julia for Turbulent Particle-Laden Flows

Pith reviewed 2026-05-10 16:25 UTC · model grok-4.3

classification ⚛️ physics.flu-dyn cs.PF
keywords LCS.jlparticle-laden flowsturbulent multiphase flowsGPU accelerationJulia languagedirect numerical simulationhigh-performance computingscaling efficiency
0
0 comments X

The pith

LCS.jl is a Julia-based model for turbulent particle-laden flows that achieves 18x GPU speedup while matching Fortran performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper introduces LCS.jl, a single-source multiphase turbulence simulation model written in Julia. It incorporates a GPU-native particle communication algorithm based on prefix-scan that cuts communication overhead from 78 percent to 10 percent of total runtime. The model produces fluid and particle statistics that match prior studies, runs at equivalent speed to Fortran codes on many processes, and delivers up to 18 times faster execution on GPUs. Strong scaling stays above 85 percent to 256 GPUs and weak scaling above 90 percent to 216 GPUs on a GPU supercomputer. A heterogeneous CPU-GPU trial further reduced execution time by 72 percent compared to CPU-only runs.

Core claim

LCS.jl demonstrates that a portable Julia implementation with a prefix-scan GPU-native particle communication algorithm can deliver computational performance equivalent to Fortran while achieving 18 times speedup on GPUs, strong scaling efficiency above 85 percent up to 256 GPUs, and reduction of particle communication cost to 10 percent of execution time.

What carries the argument

GPU-native particle communication algorithm based on prefix-scan, implemented via KernelAbstractions.jl in Julia to enable single-source multi-platform execution.

If this is right

  • Researchers gain access to portable high-performance simulations without rewriting code for each platform.
  • Large-scale direct numerical simulations of multiphase turbulence become practical on GPU-dominated supercomputers.
  • Heterogeneous CPU-GPU execution can further shorten runtimes even when GPUs are not the dominant device.
  • Performance parity with Fortran lowers the barrier to using higher-level languages for fluid dynamics codes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The prefix-scan communication approach may reduce similar bottlenecks in other particle-tracking simulations outside turbulence.
  • Single-source multi-platform designs could extend to related domains such as aerosol transport or sediment flows.
  • Further tuning of heterogeneous execution might yield additional gains on mixed hardware clusters.

Load-bearing premise

The reported validation statistics and timing results assume that test conditions such as resolution, particle count, and hardware match those in the cited prior studies without undisclosed differences.

What would settle it

Side-by-side execution of the exact same particle-laden flow case on identical hardware using LCS.jl and the referenced Fortran code, with direct comparison of both output statistics and wall-clock times.

Figures

Figures reproduced from arXiv: 2604.11008 by Ryo Onishi (Institute of Science Tokyo), Taketo Tominaga (Institute of Science Tokyo).

Figure 1
Figure 1. Figure 1: Implementation of a finite-difference kernel in LCS.jl (a) and Fortran (b). (three cells). By grouping three HSMAC iteration steps together and exchanging a three-cell HALO in a single communication call, the number of communications is reduced by a factor of three without increasing memory usage. 2.3.3 Parallel Particle Communication Based on Prefix-Scan In particle tracking, the communication destination… view at source ↗
Figure 2
Figure 2. Figure 2: Energy spectra E(k) normalized by the Kolmogorov velocity scale (εν5 ) 1/4 , where ε is the mean energy dissipation rate and ν is the kinematic viscosity, as a function of the normalized wavenumber klη, where lη = (ν 3 /ε) 1/4 is the Kolmogorov length scale. Results are shown for resolutions N 3 = 1283 through 20483 (Reλ ≈ 79.3 to 536). The dashed line indicates the Kolmogorov k −5/3 scaling. 16 [PITH_FUL… view at source ↗
Figure 3
Figure 3. Figure 3: Radial distribution function at contact g(r = R) as a function of the Taylor-microscale Reynolds number Reλ. Stars denote the present results (St = 0.2 to 2.0); other symbols denote the results (St = 0.1 to 8) of Onishi et al. (2016). 17 [PITH_FULL_IMAGE:figures/full_fig_p017_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Speedup of wall-clock time per time-integration step as a function of the number of GPUs (N 3 = 15003 , Np = 7503 ), for three HALO communication optimizations: communication–computation overlap, time-blocking, and their combination. The speedup is defined as the ratio of execution time relative to the baseline with no optimization applied. Number of Processes 1 2 4 8 W all Tim e (s) 10 − 1 100 Julia Fortr… view at source ↗
Figure 5
Figure 5. Figure 5: Wall-clock time per time-integration step as a function of the number of CPU processes for LCS.jl (Julia) and the Fortran imple￾mentation (N 3 = 2563 , Np = 1283 ). 18 [PITH_FULL_IMAGE:figures/full_fig_p018_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Wall-clock time per time-integration step as a function of the number of devices for fixed total problem sizes. (a) GPU execution; (b) CPU execution. The black line indicates ideal strong scaling. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Wall-clock time per time-integration step as a function of the number of devices for a fixed problem size per device. (a) GPU execution; (b) CPU execution. The black horizontal lines indicate ideal weak scaling. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Communication patterns for (a) a non-power-of-two 3 × 3 × 3 GPU topology and (b) a power-of-two 4 × 4 × 4 GPU topology. GPUs within the same node are shown in black. In (a), certain GPUs such as GPU 9 have no intra-node neighbors in any direction. In (b), at least one direction of communication remains intra-node for all GPUs. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_8.png] view at source ↗
read the original abstract

Multiphase turbulent flow phenomena are observed not only in industrial devices but also in environmental flows, and direct numerical simulation (DNS) plays a key role in their investigation. Many numerical models have been developed; nevertheless, few models are highly optimized for GPU platforms, which represent the current mainstream in high-performance computing (HPC). In this study, we developed LCS.jl (Lagrangian Cloud Simulator in Julia), a single-source and multi-platform multiphase turbulence simulation model implemented in Julia language and KernelAbstractions.jl. Validation results confirmed that the present fluid and particle statistics agree well with those obtained in prior studies. A GPU-native particle communication algorithm based on prefix-scan reduced the particle communication cost from approximately 78% (CPU-delegated) to 10% of total execution time. LCS.jl achieved computational performance equivalent to the Fortran implementation in many-processes computations. For GPUs, strong scaling efficiency was maintained above 85% (up to 256 GPUs) and weak scaling efficiency above 90% (up to 216 GPUs) on TSUBAME4.0 (a GPU supercomputer at the Institute of Science Tokyo). LCS.jl achieved a maximum speedup of 18.0x on GPUs over CPUs. A trial heterogeneous execution achieved a 72% reduction in execution time compared to the CPU-only configuration even in configurations where the GPU was not the primary compute device. These results demonstrate that LCS.jl is a multiphase turbulence simulation platform that achieves both portability and scalability across a variety of computational resource configurations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces LCS.jl, a Julia-based single-source multiphase turbulence simulation code using KernelAbstractions.jl for multi-platform execution including GPUs. It reports that fluid and particle statistics agree with prior studies, introduces a GPU-native prefix-scan particle communication algorithm that reduces communication cost from ~78% to 10% of runtime, achieves Fortran-level performance in multi-process runs, maintains strong scaling efficiency above 85% to 256 GPUs and weak scaling above 90% to 216 GPUs on TSUBAME4.0, delivers a maximum 18x GPU-over-CPU speedup, and shows a 72% execution-time reduction in heterogeneous CPU-GPU mode.

Significance. If the validation and timing results hold under conditions directly comparable to the cited Fortran baselines, the work provides a valuable portable high-performance platform for DNS of particle-laden turbulent flows. The demonstrated scaling to hundreds of GPUs, the communication optimization, and the heterogeneous execution results represent concrete strengths that could broaden access to such simulations while highlighting Julia's viability for production HPC scientific codes.

major comments (2)
  1. [Validation and Results sections] Validation and performance sections: the claims of statistical agreement with prior studies and of specific speedups/scaling efficiencies (18.0x, >85% strong scaling to 256 GPUs, particle-communication reduction to 10%) are load-bearing, yet no table or explicit list supplies the exact grid resolutions, particle numbers, Reynolds numbers, domain sizes, or hardware node specifications used in those runs. Without these, direct comparability to the referenced Fortran implementations cannot be verified.
  2. [Performance evaluation] Performance evaluation: timing results are presented without error bars, standard deviations from repeated runs, or a clear statement of the precise test-case parameters (grid points, particle count, process counts) employed for the scaling and heterogeneous-execution measurements. This omission directly affects assessment of the robustness of the reported efficiencies and the 72% heterogeneous speedup.
minor comments (1)
  1. [Abstract] Abstract: the phrase 'many-processes computations' is used without indicating the actual number of MPI processes or nodes; adding this detail would improve context for the Fortran-comparison claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive review and for recognizing the potential value of LCS.jl as a portable platform for multiphase turbulence DNS. We address each major comment below with specific commitments to revision where the manuscript can be strengthened.

read point-by-point responses
  1. Referee: [Validation and Results sections] Validation and performance sections: the claims of statistical agreement with prior studies and of specific speedups/scaling efficiencies (18.0x, >85% strong scaling to 256 GPUs, particle-communication reduction to 10%) are load-bearing, yet no table or explicit list supplies the exact grid resolutions, particle numbers, Reynolds numbers, domain sizes, or hardware node specifications used in those runs. Without these, direct comparability to the referenced Fortran implementations cannot be verified.

    Authors: We agree that an explicit compilation of these parameters is necessary to enable direct verification against the cited Fortran baselines. In the revised manuscript we will add a dedicated table (placed in the Validation section and referenced from the Performance section) that lists, for every reported case, the grid resolution, particle count, Reynolds number, domain size, and the precise hardware node configuration on TSUBAME4.0. This addition will make the comparability claims fully verifiable without altering any numerical results. revision: yes

  2. Referee: [Performance evaluation] Performance evaluation: timing results are presented without error bars, standard deviations from repeated runs, or a clear statement of the precise test-case parameters (grid points, particle count, process counts) employed for the scaling and heterogeneous-execution measurements. This omission directly affects assessment of the robustness of the reported efficiencies and the 72% heterogeneous speedup.

    Authors: We accept that the lack of reported variability and explicit parameter statements weakens the robustness assessment. We will insert a new subsection (or expanded table) that states the exact grid points, particle counts, and MPI process counts used for every scaling and heterogeneous run. Where multiple independent timings were collected we will add error bars or standard deviations; where only single runs were performed under dedicated allocation we will state this explicitly and note the controlled execution environment. These changes improve transparency while preserving the original timing values. revision: partial

Circularity Check

0 steps flagged

No circularity: performance metrics are direct empirical measurements against external baselines

full rationale

The paper reports benchmarked runtime, scaling efficiency, and communication-cost reductions from explicit timing measurements on TSUBAME4.0 hardware. These quantities are obtained by running the implemented code and comparing wall-clock times to a separate Fortran reference implementation and to prior literature; no equations, fitted parameters, or self-referential definitions appear in the reported results. Validation statements simply assert statistical agreement with external studies without any derivation that reduces the speedup or scaling figures to the paper's own inputs by construction. The central claims therefore remain independent of any self-citation chain or definitional loop.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The performance claims rest on the assumption that the implemented physics model is faithful to the referenced prior studies and that timing measurements isolate the new communication algorithm; no new physical constants or entities are introduced.

axioms (1)
  • domain assumption Direct numerical simulation assumptions for incompressible Navier-Stokes flow with Lagrangian particles
    The model inherits standard DNS requirements for resolution and time-stepping from the multiphase turbulence literature.

pith-pipeline@v0.9.0 · 5596 in / 1344 out tokens · 79356 ms · 2026-05-10T16:25:33.635296+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

15 extracted references · 15 canonical work pages

  1. [1]

    Besard, T., Foket, C., and De Sutter, B.: Effective Extensible Programming: Unleashing Julia on GPUs, IEEE Transactions on Parallel and Distributed Systems, 30, 827–841, https://doi.org/10.1109/tpds.2018.2872064,

  2. [2]

    Bezanson , author A

    Bezanson, J., Edelman, A., Karpinski, S., and Shah, V . B.: Julia: A Fresh Approach to Numerical Computing, SIAM Review, 59, 65–98, https://doi.org/10.1137/141000671,

  3. [3]

    R., and Petersen, M

    Bishnu, S., Strauss, R. R., and Petersen, M. R.: Comparing the Performance of Julia on CPUs versus GPUs and Julia-MPI versus Fortran-MPI: a case study with MPAS-Ocean (Version 7.1), Geoscientific Model Development, 16, 5539–5559, https://doi.org/10.5194/gmd-16-5539- 2023,

  4. [4]

    Harlow, F

    Churavy, V .: KernelAbstractions.jl, https://zenodo.org/doi/10.5281/zenodo.4021259, https://doi.org/10.5281/zenodo.4021259. Harlow, F. H. and Welch, J. E.: Numerical Calculation of Time-Dependent Viscous Incompressible Flow of Fluid with Free Surface, Physics of Fluids, 8, 2182–2189, https://doi.org/10.1063/1.1761178,

  5. [5]

    Hirt, C. and Cook, J.: Calculating three-dimensional flows around structures and over rough terrain, Journal of Computational Physics, 10, 324–340, https://doi.org/10.1016/0021-9991(72)90070-8,

  6. [6]

    Simulations without gravitational effects, Journal of Fluid Mechanics, 796, 617–658, https://doi.org/10.1017/jfm.2016.238,

  7. [7]

    Morinishi, Y ., Lund, T., Vasilyev, O., and Moin, P.: Fully Conservative Higher Order Finite Difference Schemes for Incompressible Flow, Journal of Computational Physics, 143, 90–124, https://doi.org/10.1006/jcph.1998.5962,

  8. [8]

    Onishi, R. and Seifert, A.: Reynolds-number dependence of turbulence enhancement on collision growth, Atmospheric Chemistry and Physics, 16, 12 441–12 455, https://doi.org/10.5194/acp-16-12441-2016,

  9. [9]

    Onishi, R., Baba, Y ., and Takahashi, K.: Large-scale forcing with less communication in finite-difference simulations of stationary isotropic turbulence, Journal of Computational Physics, 230, 4088–4099, https://doi.org/10.1016/j.jcp.2011.02.034,

  10. [10]

    Onishi, R., Takahashi, K., and Vassilicos, J.: An efficient parallel simulation of interacting inertial particles in homogeneous isotropic turbu- lence, Journal of Computational Physics, 242, 809–827, https://doi.org/10.1016/j.jcp.2013.02.027,

  11. [11]

    Onishi, R., Matsuda, K., and Takahashi, K.: Lagrangian Tracking Simulation of Droplet Growth in Turbulence–Turbulence Enhancement of Autoconversion Rate, Journal of the Atmospheric Sciences, 72, 2591–2607, https://doi.org/10.1175/jas-d-14-0292.1,

  12. [12]

    Ramadhan, A., Wagner, G., Hill, C., Campin, J.-M., Churavy, V ., Besard, T., Souza, A., Edelman, A., Ferrari, R., and Mar- shall, J.: Oceananigans.jl: Fast and friendly geophysical fluid dynamics on GPUs, Journal of Open Source Software, 5, 2018, https://doi.org/10.21105/joss.02018,

  13. [13]

    Direct numerical simulations, Journal of Fluid Mechanics, 335, 75–109, https://doi.org/10.1017/s0022112096004454,

  14. [14]

    14 Wang, L.-P., Wexler, A. S., and Zhou, Y .: Statistical mechanical description and modelling of turbulent collision of inertial particles, Journal of Fluid Mechanics, 415, 117–153, https://doi.org/10.1017/s0022112000008661,

  15. [15]

    of Onishi et al. (2016). 17 Number of GPUs 8 16 32 64 Speedup 1.0 1.2 1.4 1.6 overlap time-blocking time-blocking/overlap Figure