arxiv: 2605.00462 · v1 · submitted 2026-05-01 · 💻 cs.DC · cs.AI

Recognition: unknown

Adaptation of AI-accelerated CFD Simulations to the IPU platform

P. Rosciszewski , A. Krzywaniak , S. Iserte , K. Rojek , P. Gepner

Authors on Pith no claims yet

Pith reviewed 2026-05-09 19:26 UTC · model grok-4.3

classification 💻 cs.DC cs.AI

keywords IPUAI for simulationcomputational fluid dynamicsmachine learningdistributed trainingOpenFOAMperformance scalability

0 comments

The pith

Adapting AI-accelerated CFD simulations to IPU hardware enables scalable training with throughput increasing from 560 to 2805 samples per second across sixteen processors.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper describes the process of adapting a machine learning program that accelerates computational fluid dynamics simulations to run on the IPU platform. It investigates the ease of using the Poplar SDK and measures performance when scaling across multiple IPUs. A sympathetic reader would care if this hardware choice can make training surrogate models faster and more practical for replacing parts of traditional simulations in design and research. The central finding involves using a data distribution library to remove a host computer bottleneck and then achieving strong scaling at larger processor counts.

Core claim

By porting the training program to the IPU-POD16 platform the authors show that the popdist library removes the host-side data feeding bottleneck to deliver up to 34 percent speedup. Although moving from one to two IPUs brings no gain due to communication overheads, scaling from two to sixteen IPUs raises throughput from 560.8 to 2805.8 samples per second while the model still produces accurate predictions of fluid simulation states.

What carries the argument

The popdist library for overcoming the single-host data feeding limitation during distributed training on multiple IPUs.

If this is right

Using popdist to distribute data loading yields up to 34% training speedup.
Data parallelism shows no benefit from one to two IPUs due to overhead but supports good scaling beyond that.
The adapted model maintains accurate predictions for simulation states on the new hardware.
Throughput scales substantially with more IPUs once initial communication costs are covered.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This porting strategy may extend to training AI models for other types of physics simulations if their data pipelines can be similarly distributed.
IPU clusters could become a practical option for speeding up the development of hybrid AI-numerical simulation tools.
Repeating the experiments on larger models or different CFD problems would show how widely the scaling behavior applies.

Load-bearing premise

Model prediction accuracy stays the same after porting to IPU hardware and the throughput numbers generalize to other datasets and model architectures.

What would settle it

Measuring the test set prediction error of the model trained on IPU hardware versus the original version, or running the throughput test with a new OpenFOAM dataset or different neural network design.

Figures

Figures reproduced from arXiv: 2605.00462 by A. Krzywaniak, K. Rojek, P. Gepner, P. Rosciszewski, S. Iserte.

**Figure 1.** Figure 1: Geometry of the reactor under study [3]. Arrows represent the flow direction on the highlighted areas. We have simulated 131 different configurations of the case under study with OpenFOAM6 . In these simulations, the values of in Inlet and Recirculation are varied within a minimum and a maximum limit. The OpenFOAM solver models a transient incompressible flow using the Unsteady Reynolds-averaged NavierSto… view at source ↗

**Figure 2.** Figure 2: Schematic and building block of IPU-M2000 Machine [2] just under 900 MB of memory in total. This local memory is the only memory directly accessible by tile instructions. It is used for both the code and the data used by that tile. There is no shared memory access between tiles. Tiles cannot directly access each others memory but can communicate via message passing using an all-to-all high bandwidth exchan… view at source ↗

**Figure 3.** Figure 3: IPU-POD16 direct attach configuration [2] scale out using OSFP copper cables. The intra-rack configuration called IPUPOD16 contains 4 IPU-M2000s connected into a single instance with a daisy chain topology utilizing IPU-Links. Host-Link connectivity is provided from the Gateway through a PCIe NIC or SmartNIC card view at source ↗

read the original abstract

Intelligence Processing Units (IPU) have proven useful for many AI applications. In this paper, we evaluate them within the emerging field of \emph{AI for simulation}, where traditional numerical simulations are supported by artificial intelligence approaches. We focus specifically on a program for training machine learning models supporting a \emph{computational fluid dynamics} application. We use custom TensorFlow provided by the Poplar SDK to adapt the program for the IPU-POD16 platform and investigate its ease of use and performance scalability. Training a model on data from OpenFOAM simulations allows us to get accurate simulation state predictions in test time. We show how to utilize the \emph{popdist} library to overcome a performance bottleneck in feeding training data to the IPU on the host side, achieving up to 34\% speedup. Due to communication overheads, using data parallelism to utilize two IPUs instead of one does not improve the throughput. However, once the intra-IPU costs have been paid, the hardware capabilities for inter-IPU communication allow for good scalability. Increasing the number of IPUs from 2 to 16 improves the throughput from 560.8 to 2805.8 samples/s.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a straightforward port of an existing TensorFlow CFD surrogate to IPUs with concrete scaling numbers, but the accuracy claims rest on assertion rather than data.

read the letter

The paper adapts a known TensorFlow workflow for training ML models on OpenFOAM CFD data to Graphcore IPUs via the Poplar SDK. They use popdist to address a host data-feeding bottleneck and report up to 34% speedup, then show throughput scaling from 560.8 samples/s on two IPUs to 2805.8 on sixteen once communication overheads are paid. The scaling description is clear and the numbers are specific enough to be checked against other platforms. That part is useful engineering detail for anyone running data-parallel training on IPUs. The work does not introduce new models, algorithms, or theory; it applies the vendor tools to an existing pipeline and measures the outcome. The citation pattern is standard and does not hide prior results. The main gap is the accuracy side. The abstract states that the ported model still delivers accurate simulation-state predictions, yet no error metrics, loss curves, or direct comparison to the non-IPU baseline appear. Without those numbers it is impossible to tell whether reduced precision, data-parallel artifacts, or other IPU-specific behavior changed the quality. The throughput figures therefore stand alone and their practical value for CFD remains unclear. This is the kind of platform-specific benchmark that belongs in a conference proceedings on HPC or AI for science rather than a methods journal. Readers already working with IPUs or needing reference numbers for similar ports will get something from it; anyone seeking methodological novelty will not. The paper shows honest engagement with the hardware constraints and reports reproducible measurements, so it is worth sending to peer review with a request to add the missing accuracy data.

Referee Report

2 major / 0 minor

Summary. The manuscript evaluates the adaptation of a TensorFlow-based machine learning model, trained on data from OpenFOAM computational fluid dynamics simulations, to the IPU-POD16 platform using the Poplar SDK. It investigates ease of use and performance scalability, particularly using the popdist library to address host-side data feeding bottlenecks (up to 34% speedup). The work reports throughput scaling from 560.8 samples/s with 2 IPUs to 2805.8 samples/s with 16 IPUs and asserts that the adapted model delivers accurate simulation state predictions at test time.

Significance. If the accuracy of the predictions is preserved after the IPU port, the paper supplies concrete empirical data on IPU suitability for AI-accelerated CFD workloads, including practical use of popdist for data parallelism and observed scaling behavior once intra-IPU costs are amortized. The specific numeric throughput figures constitute a reproducible benchmark that could guide hardware selection in scientific HPC. The absence of any accuracy quantification, however, substantially reduces the result's utility for CFD applications.

major comments (2)

Abstract: The claim that the model 'allows us to get accurate simulation state predictions in test time' is presented without any supporting quantitative metrics (MSE, relative L2 error, validation loss, or direct comparison to the non-IPU baseline). This is load-bearing for the central contribution because the reported throughput numbers (560.8 to 2805.8 samples/s) and the 34% popdist speedup lose practical meaning for AI-accelerated CFD if prediction quality has degraded due to reduced precision, data-parallel artifacts, or hardware-specific numerics.
Results discussion (scaling paragraph): The observation that data parallelism with two IPUs yields no throughput gain due to communication overheads, while scaling improves from 2 to 16 IPUs, is stated without accompanying details on batch size, model architecture, or verification that accuracy remains constant across parallelism levels. This makes it difficult to assess whether the scalability claim generalizes or is specific to the chosen OpenFOAM dataset and TensorFlow model.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We have carefully reviewed the major comments and agree that additional quantitative support for accuracy claims and expanded details on the scaling experiments will strengthen the paper. We address each point below and indicate the changes made in the revised version.

read point-by-point responses

Referee: Abstract: The claim that the model 'allows us to get accurate simulation state predictions in test time' is presented without any supporting quantitative metrics (MSE, relative L2 error, validation loss, or direct comparison to the non-IPU baseline). This is load-bearing for the central contribution because the reported throughput numbers (560.8 to 2805.8 samples/s) and the 34% popdist speedup lose practical meaning for AI-accelerated CFD if prediction quality has degraded due to reduced precision, data-parallel artifacts, or hardware-specific numerics.

Authors: We agree that the abstract's accuracy claim would benefit from explicit quantitative backing to fully support the performance results. The IPU adaptation preserves the original model's architecture, training procedure, and floating-point precision, so no degradation is expected; however, to make this self-evident, we have added a dedicated paragraph in the Results section reporting MSE and relative L2 error on the held-out test set, together with a side-by-side comparison against the non-IPU TensorFlow baseline. These metrics confirm equivalent accuracy. The abstract has also been revised to reference the new quantitative findings. revision: yes
Referee: Results discussion (scaling paragraph): The observation that data parallelism with two IPUs yields no throughput gain due to communication overheads, while scaling improves from 2 to 16 IPUs, is stated without accompanying details on batch size, model architecture, or verification that accuracy remains constant across parallelism levels. This makes it difficult to assess whether the scalability claim generalizes or is specific to the chosen OpenFOAM dataset and TensorFlow model.

Authors: We accept that the scaling paragraph would be clearer with these supporting details. In the revised manuscript we have expanded the paragraph to state the batch size employed, briefly recap the model architecture, and report that validation loss (and therefore test-time accuracy) remains unchanged across all tested IPU counts. This invariance follows directly from the data-parallel training strategy, which replicates the identical model and aggregates gradients identically regardless of the number of IPUs. The added information allows readers to judge the applicability of the observed scaling to other CFD workloads. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical benchmarking with direct measurements

full rationale

The paper contains no derivations, equations, fitted parameters presented as predictions, or load-bearing self-citations. All central claims (throughput scaling from 560.8 to 2805.8 samples/s with 2-to-16 IPUs, up to 34% popdist speedup) are direct empirical measurements of runtime on IPU-POD16 hardware after porting via Poplar SDK. The statement that training on OpenFOAM data yields accurate test-time predictions is an assertion without supporting math or reduction to inputs. No step reduces by construction to its own inputs or prior self-citation; results are externally falsifiable via hardware benchmarks. This is the expected non-finding for a performance-porting study.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical hardware-adaptation study with no mathematical derivations, physical models, or new entities; it rests only on standard assumptions that the ML training pipeline behaves correctly on the target hardware and that throughput measurements reflect real workload performance.

pith-pipeline@v0.9.0 · 5530 in / 1164 out tokens · 22904 ms · 2026-05-09T19:26:39.866918+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

20 extracted references · 14 canonical work pages

[1]

https://moorinsightsstrategy.com/research-paper-the-graphcore-second- generation-ipu/ (2020)

Freund, K., Moorhead, P.: The Graphcore Second-Generation IPU. https://moorinsightsstrategy.com/research-paper-the-graphcore-second- generation-ipu/ (2020)

2020
[2]

In: 2021 16th Conference on Computer Science and Intelligence Systems (FedCSIS)

Gepner, P.: Machine Learning and High-Performance Computing Hybrid Systems, a New Way of Performance Acceleration in Engineering and Scientiﬁc Applications. In: 2021 16th Conference on Computer Science and Intelligence Systems (FedCSIS). pp. 27–36 (2021). https://doi.org/10.15439/2021F004

work page doi:10.15439/2021f004 2021
[3]

Water Environment Research pp

Iserte, S., Carratala, P., Arnau, R., Barreda, P., Basiero, L., Matínez-Cuenca, R., Climent, J., Chiva, S.: Modeling of Wastewater Treatment Processes with Hy- droSludge. Water Environment Research pp. 1–38 (2021)

2021
[4]

Journal of Computational Science 62, 101741 (2022)

Iserte, S., Macías, A., Martínez-Cuenca, R., Chiva, S., Paredes, R., Quintana- Ortí, E.S.: Accelerating Urban Scale Simulations Leveraging Local Spa- tial 3D Structure. Journal of Computational Science 62, 101741 (2022). https://doi.org/https://doi.org/10.1016/j.jocs.2022.101741

work page doi:10.1016/j.jocs.2022.101741 2022
[5]

Computer Graphics Forum 38(2), 59–70 (2019)

Kim, B., Azevedo, V.C., Thuerey, N., Kim, T., Gross, M., Solenthaler, B.: Deep Fluids: A Generative Network for Parameterized Fluid Simulations. Computer Graphics Forum 38(2), 59–70 (2019). https://doi.org/doi.org/10.1111/cgf.13619, https://onlinelibrary.wiley.com/doi/10.1111/cgf.13619

work page doi:10.1111/cgf.13619 2019
[6]

Machine learning–accelerated computational fluid dynamics,

Kochkov, D., Smith, J.A., Alieva, A., Wang, Q., Brenner, M.P., Hoyer, S.: Machine Learning-accelerated Computational Fluid Dynamics. Proceedings of the National Academy of Sciences 118(21), e2101784118 (2021). https://doi.org/10.1073/pnas.2101784118, https://www.pnas.org/doi/abs/10.1073/pnas.2101784118

work page doi:10.1073/pnas.2101784118 2021
[7]

Lavin, A., et al.: Simulation Intelligence: Towards a New Generation of Scientiﬁc Methods (Dec 2021), http://arxiv.org/abs/2112.03235

work page arXiv 2021
[8]

Fron- tiers of Computer Science 11(5), 746–761 (2017)

Li, Z., Wang, Y., Zhi, T., Chen, T.: A survey of neural network accelerators. Fron- tiers of Computer Science 11(5), 746–761 (2017). https://doi.org/10.1007/s11704- 016-6159-1, https://doi.org/10.1007/s11704-016-6159-1

work page doi:10.1007/s11704- 2017
[9]

and San, O

Maulik, R., San, O., Rasheed, A., Vedula, P.: Subgrid Modelling for Two- dimensional Turbulence Using Neural Networks. Journal of Fluid Mechanics 858, 122144 (2019). https://doi.org/10.1017/jfm.2018.770

work page doi:10.1017/jfm.2018.770 2019
[10]

Ribeiro, M.D., Rehman, A., Ahmed, S., Dengel, A.: DeepCFD: Eﬃcient Steady- State Laminar Flow Approximation with Deep Convolutional Neural Networks (Nov 2021), http://arxiv.org/abs/2004.08826, arXiv:2004.08826 [physics]

work page arXiv 2021
[11]

In: HeteroPar 2022

Rojek, K., Wyrzykowski, R.: Performance and scalability analysis of AI-accelerated CFD simulations across various computing platforms. In: HeteroPar 2022. Springer International Publishing (in press 2022)

2022
[12]

In: Computational Science – ICCS

Rojek, K., Wyrzykowski, R., Gepner, P.: AI-Accelerated CFD Simulation Based on OpenFOAM and CPU/GPU Computing. In: Computational Science – ICCS
[13]

pp. 373–385. Springer International Publishing, Cham (2021)

2021
[14]

A., Bustos, B., & Hitschfeld, N

Rociszewski, P., Iwaski, M., Czarnul, P.: The impact of the AC922 Architecture on Performance of Deep Neural Network Training. In: 2019 International Conference on High Performance Computing Simulation (HPCS). pp. 666–673 (Jul 2019). https://doi.org/10.1109/HPCS48598.2019.9188164

work page doi:10.1109/hpcs48598.2019.9188164 2019
[15]

Horovod: fast and easy distributed deep learning in TensorFlow

Sergeev, A., Del Balso, M.: Horovod: Fast and Easy Distributed Deep Learning in TensorFlow. arXiv:1802.05799 [cs, stat] (Feb 2018), http://arxiv.org/abs/1802.05799, arXiv: 1802.05799 Adaptation of AI-accelerated CFD simulations to the IPU platform 13

work page Pith review arXiv 2018
[16]

In: 2017 IEEE Custom Integrated Circuits Conference (CICC)

Sze, V., Chen, Y.H., Emer, J., Suleiman, A., Zhang, Z.: Hardware for machine learning: Challenges and opportunities. In: 2017 IEEE Custom Integrated Circuits Conference (CICC). pp. 1–8 (04 2018). https://doi.org/10.1109/CICC.2018.8357072

work page doi:10.1109/cicc.2018.8357072 2017
[17]

doi: 10.2514/1.j058291

Thuerey, N., WeiSSenow, K., Prantl, L., Hu, X.: Deep learning methods for reynolds-averaged navierstokes simulations of airfoil ﬂows. AIAA Journal 58, 1–12 (11 2019). https://doi.org/10.2514/1.J058291

work page doi:10.2514/1.j058291 2019
[18]

In: Proceedings of the 34th International Conference on Neural Information Processing Systems

Um, K., Brand, R., Fei, Y.R., Holl, P., Thuerey, N.: Solver-in-the-Loop: Learning from Diﬀerentiable Physics to Interact with Iterative PDE-Solvers. In: Proceedings of the 34th International Conference on Neural Information Processing Systems. NIPS’20, Curran Associates Inc., Red Hook, NY, USA (2020)

2020
[19]

Computer Graphics Forum 38(2), 71–82 (2019)

Wiewel, S., Becher, M., Thuerey, N.: Latent Space Physics: Towards Learning the Temporal Evolution of Fluid Flow. Computer Graphics Forum 38(2), 71–82 (2019). https://doi.org/doi.org/10.1111/cgf.13620, https://onlinelibrary.wiley.com/doi/10.1111/cgf.13620

work page doi:10.1111/cgf.13620 2019
[20]

Wyatt II, M.R., Yamamoto, V., Tosi, Z., Karlin, I., Van Essen, B.: Is Disaggre- gation Possible for HPC Cognitive Simulation? arXiv:2112.05216 [cs] (Dec 2021), http://arxiv.org/abs/2112.05216, arXiv: 2112.05216

work page arXiv 2021