A Study on the Performance of Distributed Training of Data-driven CFD Simulations

Alejandro Gonz\'alez-Barber\'a; Krzysztof Rojek; Paloma Barreda; Sergio Iserte

arxiv: 2604.27431 · v1 · submitted 2026-04-30 · 💻 cs.DC

A Study on the Performance of Distributed Training of Data-driven CFD Simulations

Sergio Iserte , Alejandro Gonz\'alez-Barber\'a , Paloma Barreda , Krzysztof Rojek This is my paper

Pith reviewed 2026-05-07 07:51 UTC · model grok-4.3

classification 💻 cs.DC

keywords distributed trainingGPU computingcomputational fluid dynamicsdata-driven modelingdeep learningtime series forecastingperformance comparison

0 comments

The pith

Distributed GPU training enables high-accuracy fluid state predictions in a fraction of traditional CFD solver time.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper compares CPU-only, multi-GPU, and distributed training approaches for a time-series deep learning model that forecasts future states in a fluid simulation. It uses minor code adaptations to run the same model across these setups and measures both training speed and prediction accuracy. The central finding is that distributed GPU configurations deliver the fastest training while maintaining the high accuracy needed to serve as a practical substitute for full physics-based solving. This approach matters because traditional CFD calculations are slow and expensive to repeat, whereas a trained model can generate results much more quickly once the upfront training cost is paid.

Core claim

With some slight code adaptations, results show and compare, in different implementations, the benefits of distributed GPU-enabled training for predicting high-accuracy states in a fraction of the time needed by the computational fluid dynamics (CFD) solver.

What carries the argument

The time-series forecasting deep learning model trained under distributed GPU mode, which learns to map prior simulation states to future ones and thereby replaces repeated PDE solving.

If this is right

Once trained, the model generates future fluid states far more rapidly than the original solver can compute them.
Distributed training makes it feasible to handle larger datasets or more complex simulations that would otherwise exceed single-machine resources.
The same workflow can be applied to other time-dependent physics problems where data-driven surrogates are desired.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Modern deep learning frameworks have lowered the effort needed to move from single-node to distributed training, which may encourage more scientists to adopt these surrogate models.
The observed speedups could become even larger when scaling to bigger models or higher-resolution simulations common in industrial CFD.
Hybrid systems that use the trained model for most steps and fall back to the solver only for verification become more practical.

Load-bearing premise

The specific fluid simulation and chosen time-series deep learning model are representative of typical CFD tasks, and the minor code adaptations for each training mode do not create unfair differences in measured speed or accuracy.

What would settle it

Repeating the same experiment on a different fluid simulation or model architecture and finding that distributed GPU training yields no meaningful reduction in wall-clock time or that prediction accuracy falls below the level reported for the original CFD solver.

Figures

Figures reproduced from arXiv: 2604.27431 by Alejandro Gonz\'alez-Barber\'a, Krzysztof Rojek, Paloma Barreda, Sergio Iserte.

**Figure 1.** Figure 1: Geometry of the reactor under study. Arrows represent the intended direction of the flow in the different areas of interest. approach. Then we provide interaction between AI and CFD solver for much faster analysis and reduced cost of trial & error experiments. The scope of this research includes quasisteady state simulations, which use an iterative scheme to progress to convergence. Quasi-steady state mod… view at source ↗

**Figure 2.** Figure 2: RNN architecture. Next to the type of layer, the input and output data shapes of each layer are indicated. Algorithm 1 Methods of the Generator function INIT Indexes ← array of samples per case (for these scenario all the cases have the same number of samples). Samples ← array of concatenated samples of all the cases. ON EPOCH END function GET ITEM Calculate sample indexes of the minibatch. Arrange samples… view at source ↗

**Figure 3.** Figure 3: Contour plot of the velocity magnitude vector field (U[m/s]) using either the conventional CFD solver (a) or AI-accelerated approach (b) after ten timesteps. • Spearman correlation which assesses the monotonic relationship between two continuous or ordinal variables. The Pearson correlation varies from 0.98 for the 10 − th timestep to 1.0 for the converged state. The average Pearson correlation for all the… view at source ↗

**Figure 4.** Figure 4: Contour plot of the velocity magnitude vector field (U[m/s]) using either the conventional CFD solver (a) or AI-accelerated approach (b) after 20 timesteps. (a) CFD (b) AI view at source ↗

**Figure 5.** Figure 5: Contour plot of the velocity magnitude vector field (U[m/s]) using either the conventional CFD solver (a) or AI-accelerated approach (b) in the converged state. (prediction errors). Residuals are a measure of how far from the regression line data points are. RMSE varies from 0.041 for the 10 − th timestep to 0.005 for the quasi-steady state. The average RMSE for all the validated timesteps is 0.023. Based … view at source ↗

**Figure 6.** Figure 6: Comparison of simulation results for the conventional CFD solver and AI-accelerated approach. Distributed Learning Another contribution of this work has been to evaluate the performance of training the model using many accelerators distributed in several nodes. For this purpose, this study leverages two different strategies which provide support for distributed training. While the first strategy presented … view at source ↗

**Figure 7.** Figure 7: Horovod deep neural network training time for different processes configurations and layouts (nodes x processes) view at source ↗

**Figure 8.** Figure 8: Training time and speedups for different process configurations in Horovod. this regard, it looks like because of the nature of the problem it fits best in this configuration when using four nodes. Moreover, depending on our needs, we can determine a “sweet spot” configuration that achieves a balance between execution time and resources utilized. The “sweet spot” for this study may be set to the two-proces… view at source ↗

read the original abstract

Data-driven methods for computer simulations are blooming in many scientific areas. The traditional approach to simulating physical behaviors relies on solving partial differential equations (PDE). Since calculating these iterative equations is highly both computationally demanding and time-consuming, data-driven methods leverage artificial intelligence (AI) techniques to alleviate that workload. Data-driven methods have to be trained in advance to provide their subsequent fast predictions, however, the cost of the training stage is non-negligible. This paper presents a predictive model for inferencing future states of a specific fluid simulation that serves as a use case for evaluating different training alternatives. Particularly, this study compares the performance of only CPU, multiGPU, and distributed approaches for training a time series forecasting deep learning (DL) model. With some slight code adaptations, results show and compare, in different implementations, the benefits of distributed GPU-enabled training for predicting high-accuracy states in a fraction of the time needed by the computational fluid dynamics (CFD) solver.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This benchmarks distributed training speedups for a CFD time-series model but leaves accuracy equivalence across modes unverified.

read the letter

This paper benchmarks different training setups for a time-series deep learning model that predicts future states in a fluid simulation. The key takeaway is that distributed GPU training delivers clear speedups over CPU or single setups, but the evidence that prediction accuracy stays high and equivalent across all modes is thin. The new part is the application to this CFD use case. They take an existing time-series forecasting network, make slight code changes for CPU, multi-GPU, and distributed modes, and measure wall-clock training time. The results compare these to the time a traditional CFD solver would take, showing the data-driven approach can be much faster once trained. It does well at addressing a real pain point: training these surrogate models is expensive, and distributed methods help. The comparisons are straightforward and give practical numbers that engineers could use when deciding on hardware for similar tasks. The main soft spot is the accuracy verification. The claim rests on all modes producing models that give high-accuracy predictions, yet the description does not include specific error metrics like MSE on held-out data or checks that the errors are statistically the same. If the adaptations for distribution change how data is processed or how gradients are synced, convergence could suffer, and the speed advantage would not be free. This matches the stress-test concern, and without those numbers, the performance claims are harder to trust fully. Other details like the exact model size, dataset scale, and any baseline comparisons are not highlighted, which limits how far the findings generalize. The work is incremental rather than introducing new algorithms, but the empirical focus is honest. This is the kind of paper that helps the scientific computing community scale their AI tools. Readers working on simulation surrogates or distributed training for physics problems would find the numbers useful. I recommend sending it for peer review. Referees can push for the missing accuracy tables and error bars, turning it into a more complete report.

Referee Report

2 major / 2 minor

Summary. The manuscript presents an empirical benchmarking study on training a time-series deep learning model to predict future states of a fluid simulation. It compares wall-clock training times and speedups across CPU-only, multi-GPU, and distributed multi-node GPU configurations, claiming that minor code adaptations enable distributed GPU training to deliver high-accuracy predictions in a fraction of the time required by traditional CFD solvers.

Significance. If the central claim holds with verified accuracy equivalence, the work would provide useful practical benchmarks for scaling data-driven surrogate models in scientific computing on HPC systems. It addresses a relevant pain point—the high cost of training DL models for CFD—by quantifying distributed-training benefits, which could inform adoption in large-scale simulation workflows. The empirical nature (no circular derivations) is a strength, but the absence of detailed accuracy metrics limits immediate impact.

major comments (2)

[Results] Results section: The paper reports timing and speedup results after 'slight code adaptations' for each training mode but provides no quantitative evidence (e.g., MSE, relative L2 error, or statistical tests) that prediction accuracy on held-out fluid states remains statistically indistinguishable across CPU, multi-GPU, and distributed modes. This equivalence is load-bearing for the claim that distributed training preserves 'high-accuracy' while reducing time.
[Experimental setup] Experimental setup / §3: The manuscript does not specify the exact model architecture (layers, units, time-series structure), dataset size (number of trajectories, time steps, train/test split), hyperparameter values, or how data sharding/batch synchronization differs across modes. Without these, it is impossible to evaluate whether the adaptations introduce systematic differences in convergence or fairness of the performance comparison.

minor comments (2)

[Abstract] Abstract and introduction: 'High-accuracy' is used without a concrete threshold or reference to the error metric and baseline CFD solver error; adding a sentence with typical relative error values would clarify the claim.
Missing references to standard distributed training frameworks (e.g., Horovod, DeepSpeed, or PyTorch DDP) and prior work on distributed DL for scientific simulations.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We are grateful to the referee for their insightful comments, which have helped us identify areas where the manuscript can be improved for clarity and completeness. We address the major comments below and will revise the manuscript accordingly to incorporate the suggested enhancements.

read point-by-point responses

Referee: [Results] Results section: The paper reports timing and speedup results after 'slight code adaptations' for each training mode but provides no quantitative evidence (e.g., MSE, relative L2 error, or statistical tests) that prediction accuracy on held-out fluid states remains statistically indistinguishable across CPU, multi-GPU, and distributed modes. This equivalence is load-bearing for the claim that distributed training preserves 'high-accuracy' while reducing time.

Authors: We thank the referee for highlighting this important point. Although the underlying model and training process are the same across configurations, with adaptations limited to enabling data parallelism for multi-GPU and distributed settings, the manuscript indeed lacks explicit accuracy metrics to demonstrate equivalence. To address this, we will include in the revised Results section quantitative accuracy measures such as mean squared error (MSE) and relative L2 error on a held-out test set for each training mode. We will also report any observed differences and perform basic statistical comparisons if variances are present. This addition will substantiate the 'high-accuracy' claim and confirm that distributed training does not compromise predictive performance. revision: yes
Referee: [Experimental setup] Experimental setup / §3: The manuscript does not specify the exact model architecture (layers, units, time-series structure), dataset size (number of trajectories, time steps, train/test split), hyperparameter values, or how data sharding/batch synchronization differs across modes. Without these, it is impossible to evaluate whether the adaptations introduce systematic differences in convergence or fairness of the performance comparison.

Authors: We agree that the experimental setup section requires more detail to ensure reproducibility and to allow proper evaluation of the comparisons. In the revised manuscript, we will expand §3 to provide: the full model architecture details including the number of layers, units, and the time-series forecasting structure (e.g., LSTM-based); dataset specifications such as the number of simulation trajectories, time steps per trajectory, and the train/test split; all hyperparameter settings including learning rate, batch size, number of epochs, and optimizer; and specifics on the distributed implementation, such as the framework used for distribution (e.g., PyTorch DDP), data sharding strategy, and batch synchronization mechanism. These additions will clarify that the core model remains unchanged and that any performance differences are attributable to the hardware and distribution setup rather than algorithmic variations. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmarking with direct measurements

full rationale

This paper performs an empirical comparison of wall-clock training times and prediction accuracy for a time-series DL model across CPU, multi-GPU, and distributed GPU setups on CFD fluid simulation data. No derivation chain, first-principles equations, fitted parameters renamed as predictions, or self-citation load-bearing uniqueness claims exist. All reported results are direct experimental measurements (training duration, speedup factors, and accuracy on held-out states versus CFD ground truth) that stand independently of any internal reduction to inputs. Minor self-citations, if present, are not load-bearing for the central performance claims.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on empirical timing and accuracy measurements from a single use-case experiment rather than on new axioms, derivations, or postulated entities.

free parameters (1)

Model hyperparameters
Learning rate, batch size, and other training hyperparameters are typically tuned but are not specified or counted as free parameters in the provided abstract.

axioms (1)

domain assumption The chosen time-series DL architecture can achieve high-accuracy predictions on the fluid simulation data when trained with the tested configurations.
The claim of high-accuracy states assumes the model is appropriate and converges properly under the distributed setup.

pith-pipeline@v0.9.0 · 5478 in / 1298 out tokens · 47778 ms · 2026-05-07T07:51:28.421582+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

3 extracted references · 3 canonical work pages

[1]

, " * write output.state after.block = add.period write newline

ENTRY address author booktitle chapter doi edition editor eid howpublished institution isbn journal key month note number organization pages publisher school series title type url volume year label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block FUNCTION init.state.consts #0 'before.all := #1 'mid...

work page
[2]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize ":" * " " *...

work page
[3]

write newline

" write newline "" before.all 'output.state := FUNCTION blank.sep after.quote 'output.state := FUNCTION fin.entry doi empty output.state after.quoted.block = 'skip 'add.period if if write newline FUNCTION new.block output.state before.all = 'skip output.state after.quote = after.quoted.block 'output.state := after.block 'output.state := if if FUNCTION new...

work page

[1] [1]

, " * write output.state after.block = add.period write newline

ENTRY address author booktitle chapter doi edition editor eid howpublished institution isbn journal key month note number organization pages publisher school series title type url volume year label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block FUNCTION init.state.consts #0 'before.all := #1 'mid...

work page

[2] [2]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize ":" * " " *...

work page

[3] [3]

write newline

" write newline "" before.all 'output.state := FUNCTION blank.sep after.quote 'output.state := FUNCTION fin.entry doi empty output.state after.quoted.block = 'skip 'add.period if if write newline FUNCTION new.block output.state before.all = 'skip output.state after.quote = after.quoted.block 'output.state := after.block 'output.state := if if FUNCTION new...

work page