pith. machine review for the scientific record. sign in

arxiv: 2605.00462 · v1 · submitted 2026-05-01 · 💻 cs.DC · cs.AI

Recognition: unknown

Adaptation of AI-accelerated CFD Simulations to the IPU platform

Authors on Pith no claims yet

Pith reviewed 2026-05-09 19:26 UTC · model grok-4.3

classification 💻 cs.DC cs.AI
keywords IPUAI for simulationcomputational fluid dynamicsmachine learningdistributed trainingOpenFOAMperformance scalability
0
0 comments X

The pith

Adapting AI-accelerated CFD simulations to IPU hardware enables scalable training with throughput increasing from 560 to 2805 samples per second across sixteen processors.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper describes the process of adapting a machine learning program that accelerates computational fluid dynamics simulations to run on the IPU platform. It investigates the ease of using the Poplar SDK and measures performance when scaling across multiple IPUs. A sympathetic reader would care if this hardware choice can make training surrogate models faster and more practical for replacing parts of traditional simulations in design and research. The central finding involves using a data distribution library to remove a host computer bottleneck and then achieving strong scaling at larger processor counts.

Core claim

By porting the training program to the IPU-POD16 platform the authors show that the popdist library removes the host-side data feeding bottleneck to deliver up to 34 percent speedup. Although moving from one to two IPUs brings no gain due to communication overheads, scaling from two to sixteen IPUs raises throughput from 560.8 to 2805.8 samples per second while the model still produces accurate predictions of fluid simulation states.

What carries the argument

The popdist library for overcoming the single-host data feeding limitation during distributed training on multiple IPUs.

If this is right

  • Using popdist to distribute data loading yields up to 34% training speedup.
  • Data parallelism shows no benefit from one to two IPUs due to overhead but supports good scaling beyond that.
  • The adapted model maintains accurate predictions for simulation states on the new hardware.
  • Throughput scales substantially with more IPUs once initial communication costs are covered.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This porting strategy may extend to training AI models for other types of physics simulations if their data pipelines can be similarly distributed.
  • IPU clusters could become a practical option for speeding up the development of hybrid AI-numerical simulation tools.
  • Repeating the experiments on larger models or different CFD problems would show how widely the scaling behavior applies.

Load-bearing premise

Model prediction accuracy stays the same after porting to IPU hardware and the throughput numbers generalize to other datasets and model architectures.

What would settle it

Measuring the test set prediction error of the model trained on IPU hardware versus the original version, or running the throughput test with a new OpenFOAM dataset or different neural network design.

Figures

Figures reproduced from arXiv: 2605.00462 by A. Krzywaniak, K. Rojek, P. Gepner, P. Rosciszewski, S. Iserte.

Figure 1
Figure 1. Figure 1: Geometry of the reactor under study [3]. Arrows represent the flow direction on the highlighted areas. We have simulated 131 different configurations of the case under study with OpenFOAM6 . In these simulations, the values of in Inlet and Recirculation are varied within a minimum and a maximum limit. The OpenFOAM solver models a transient incompressible flow using the Unsteady Reynolds-averaged Navier￾Sto… view at source ↗
Figure 2
Figure 2. Figure 2: Schematic and building block of IPU-M2000 Machine [2] just under 900 MB of memory in total. This local memory is the only memory directly accessible by tile instructions. It is used for both the code and the data used by that tile. There is no shared memory access between tiles. Tiles cannot directly access each others memory but can communicate via message passing using an all-to-all high bandwidth exchan… view at source ↗
Figure 3
Figure 3. Figure 3: IPU-POD16 direct attach configuration [2] scale out using OSFP copper cables. The intra-rack configuration called IPU￾POD16 contains 4 IPU-M2000s connected into a single instance with a daisy chain topology utilizing IPU-Links. Host-Link connectivity is provided from the Gateway through a PCIe NIC or SmartNIC card view at source ↗
read the original abstract

Intelligence Processing Units (IPU) have proven useful for many AI applications. In this paper, we evaluate them within the emerging field of \emph{AI for simulation}, where traditional numerical simulations are supported by artificial intelligence approaches. We focus specifically on a program for training machine learning models supporting a \emph{computational fluid dynamics} application. We use custom TensorFlow provided by the Poplar SDK to adapt the program for the IPU-POD16 platform and investigate its ease of use and performance scalability. Training a model on data from OpenFOAM simulations allows us to get accurate simulation state predictions in test time. We show how to utilize the \emph{popdist} library to overcome a performance bottleneck in feeding training data to the IPU on the host side, achieving up to 34\% speedup. Due to communication overheads, using data parallelism to utilize two IPUs instead of one does not improve the throughput. However, once the intra-IPU costs have been paid, the hardware capabilities for inter-IPU communication allow for good scalability. Increasing the number of IPUs from 2 to 16 improves the throughput from 560.8 to 2805.8 samples/s.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The manuscript evaluates the adaptation of a TensorFlow-based machine learning model, trained on data from OpenFOAM computational fluid dynamics simulations, to the IPU-POD16 platform using the Poplar SDK. It investigates ease of use and performance scalability, particularly using the popdist library to address host-side data feeding bottlenecks (up to 34% speedup). The work reports throughput scaling from 560.8 samples/s with 2 IPUs to 2805.8 samples/s with 16 IPUs and asserts that the adapted model delivers accurate simulation state predictions at test time.

Significance. If the accuracy of the predictions is preserved after the IPU port, the paper supplies concrete empirical data on IPU suitability for AI-accelerated CFD workloads, including practical use of popdist for data parallelism and observed scaling behavior once intra-IPU costs are amortized. The specific numeric throughput figures constitute a reproducible benchmark that could guide hardware selection in scientific HPC. The absence of any accuracy quantification, however, substantially reduces the result's utility for CFD applications.

major comments (2)
  1. Abstract: The claim that the model 'allows us to get accurate simulation state predictions in test time' is presented without any supporting quantitative metrics (MSE, relative L2 error, validation loss, or direct comparison to the non-IPU baseline). This is load-bearing for the central contribution because the reported throughput numbers (560.8 to 2805.8 samples/s) and the 34% popdist speedup lose practical meaning for AI-accelerated CFD if prediction quality has degraded due to reduced precision, data-parallel artifacts, or hardware-specific numerics.
  2. Results discussion (scaling paragraph): The observation that data parallelism with two IPUs yields no throughput gain due to communication overheads, while scaling improves from 2 to 16 IPUs, is stated without accompanying details on batch size, model architecture, or verification that accuracy remains constant across parallelism levels. This makes it difficult to assess whether the scalability claim generalizes or is specific to the chosen OpenFOAM dataset and TensorFlow model.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We have carefully reviewed the major comments and agree that additional quantitative support for accuracy claims and expanded details on the scaling experiments will strengthen the paper. We address each point below and indicate the changes made in the revised version.

read point-by-point responses
  1. Referee: Abstract: The claim that the model 'allows us to get accurate simulation state predictions in test time' is presented without any supporting quantitative metrics (MSE, relative L2 error, validation loss, or direct comparison to the non-IPU baseline). This is load-bearing for the central contribution because the reported throughput numbers (560.8 to 2805.8 samples/s) and the 34% popdist speedup lose practical meaning for AI-accelerated CFD if prediction quality has degraded due to reduced precision, data-parallel artifacts, or hardware-specific numerics.

    Authors: We agree that the abstract's accuracy claim would benefit from explicit quantitative backing to fully support the performance results. The IPU adaptation preserves the original model's architecture, training procedure, and floating-point precision, so no degradation is expected; however, to make this self-evident, we have added a dedicated paragraph in the Results section reporting MSE and relative L2 error on the held-out test set, together with a side-by-side comparison against the non-IPU TensorFlow baseline. These metrics confirm equivalent accuracy. The abstract has also been revised to reference the new quantitative findings. revision: yes

  2. Referee: Results discussion (scaling paragraph): The observation that data parallelism with two IPUs yields no throughput gain due to communication overheads, while scaling improves from 2 to 16 IPUs, is stated without accompanying details on batch size, model architecture, or verification that accuracy remains constant across parallelism levels. This makes it difficult to assess whether the scalability claim generalizes or is specific to the chosen OpenFOAM dataset and TensorFlow model.

    Authors: We accept that the scaling paragraph would be clearer with these supporting details. In the revised manuscript we have expanded the paragraph to state the batch size employed, briefly recap the model architecture, and report that validation loss (and therefore test-time accuracy) remains unchanged across all tested IPU counts. This invariance follows directly from the data-parallel training strategy, which replicates the identical model and aggregates gradients identically regardless of the number of IPUs. The added information allows readers to judge the applicability of the observed scaling to other CFD workloads. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical benchmarking with direct measurements

full rationale

The paper contains no derivations, equations, fitted parameters presented as predictions, or load-bearing self-citations. All central claims (throughput scaling from 560.8 to 2805.8 samples/s with 2-to-16 IPUs, up to 34% popdist speedup) are direct empirical measurements of runtime on IPU-POD16 hardware after porting via Poplar SDK. The statement that training on OpenFOAM data yields accurate test-time predictions is an assertion without supporting math or reduction to inputs. No step reduces by construction to its own inputs or prior self-citation; results are externally falsifiable via hardware benchmarks. This is the expected non-finding for a performance-porting study.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical hardware-adaptation study with no mathematical derivations, physical models, or new entities; it rests only on standard assumptions that the ML training pipeline behaves correctly on the target hardware and that throughput measurements reflect real workload performance.

pith-pipeline@v0.9.0 · 5530 in / 1164 out tokens · 22904 ms · 2026-05-09T19:26:39.866918+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

20 extracted references · 14 canonical work pages

  1. [1]

    https://moorinsightsstrategy.com/research-paper-the-graphcore-second- generation-ipu/ (2020)

    Freund, K., Moorhead, P.: The Graphcore Second-Generation IPU. https://moorinsightsstrategy.com/research-paper-the-graphcore-second- generation-ipu/ (2020)

  2. [2]

    In: 2021 16th Conference on Computer Science and Intelligence Systems (FedCSIS)

    Gepner, P.: Machine Learning and High-Performance Computing Hybrid Systems, a New Way of Performance Acceleration in Engineering and Scientific Applications. In: 2021 16th Conference on Computer Science and Intelligence Systems (FedCSIS). pp. 27–36 (2021). https://doi.org/10.15439/2021F004

  3. [3]

    Water Environment Research pp

    Iserte, S., Carratala, P., Arnau, R., Barreda, P., Basiero, L., Matínez-Cuenca, R., Climent, J., Chiva, S.: Modeling of Wastewater Treatment Processes with Hy- droSludge. Water Environment Research pp. 1–38 (2021)

  4. [4]

    Journal of Computational Science 62, 101741 (2022)

    Iserte, S., Macías, A., Martínez-Cuenca, R., Chiva, S., Paredes, R., Quintana- Ortí, E.S.: Accelerating Urban Scale Simulations Leveraging Local Spa- tial 3D Structure. Journal of Computational Science 62, 101741 (2022). https://doi.org/https://doi.org/10.1016/j.jocs.2022.101741

  5. [5]

    Computer Graphics Forum 38(2), 59–70 (2019)

    Kim, B., Azevedo, V.C., Thuerey, N., Kim, T., Gross, M., Solenthaler, B.: Deep Fluids: A Generative Network for Parameterized Fluid Simulations. Computer Graphics Forum 38(2), 59–70 (2019). https://doi.org/doi.org/10.1111/cgf.13619, https://onlinelibrary.wiley.com/doi/10.1111/cgf.13619

  6. [6]

    Machine learning–accelerated computational fluid dynamics,

    Kochkov, D., Smith, J.A., Alieva, A., Wang, Q., Brenner, M.P., Hoyer, S.: Machine Learning-accelerated Computational Fluid Dynamics. Proceedings of the National Academy of Sciences 118(21), e2101784118 (2021). https://doi.org/10.1073/pnas.2101784118, https://www.pnas.org/doi/abs/10.1073/pnas.2101784118

  7. [7]

    Lavin, A., et al.: Simulation Intelligence: Towards a New Generation of Scientific Methods (Dec 2021), http://arxiv.org/abs/2112.03235

  8. [8]

    Fron- tiers of Computer Science 11(5), 746–761 (2017)

    Li, Z., Wang, Y., Zhi, T., Chen, T.: A survey of neural network accelerators. Fron- tiers of Computer Science 11(5), 746–761 (2017). https://doi.org/10.1007/s11704- 016-6159-1, https://doi.org/10.1007/s11704-016-6159-1

  9. [9]

    and San, O

    Maulik, R., San, O., Rasheed, A., Vedula, P.: Subgrid Modelling for Two- dimensional Turbulence Using Neural Networks. Journal of Fluid Mechanics 858, 122144 (2019). https://doi.org/10.1017/jfm.2018.770

  10. [10]

    Ribeiro, M.D., Rehman, A., Ahmed, S., Dengel, A.: DeepCFD: Efficient Steady- State Laminar Flow Approximation with Deep Convolutional Neural Networks (Nov 2021), http://arxiv.org/abs/2004.08826, arXiv:2004.08826 [physics]

  11. [11]

    In: HeteroPar 2022

    Rojek, K., Wyrzykowski, R.: Performance and scalability analysis of AI-accelerated CFD simulations across various computing platforms. In: HeteroPar 2022. Springer International Publishing (in press 2022)

  12. [12]

    In: Computational Science – ICCS

    Rojek, K., Wyrzykowski, R., Gepner, P.: AI-Accelerated CFD Simulation Based on OpenFOAM and CPU/GPU Computing. In: Computational Science – ICCS

  13. [13]

    pp. 373–385. Springer International Publishing, Cham (2021)

  14. [14]

    A., Bustos, B., & Hitschfeld, N

    Rociszewski, P., Iwaski, M., Czarnul, P.: The impact of the AC922 Architecture on Performance of Deep Neural Network Training. In: 2019 International Conference on High Performance Computing Simulation (HPCS). pp. 666–673 (Jul 2019). https://doi.org/10.1109/HPCS48598.2019.9188164

  15. [15]

    Horovod: fast and easy distributed deep learning in TensorFlow

    Sergeev, A., Del Balso, M.: Horovod: Fast and Easy Distributed Deep Learning in TensorFlow. arXiv:1802.05799 [cs, stat] (Feb 2018), http://arxiv.org/abs/1802.05799, arXiv: 1802.05799 Adaptation of AI-accelerated CFD simulations to the IPU platform 13

  16. [16]

    In: 2017 IEEE Custom Integrated Circuits Conference (CICC)

    Sze, V., Chen, Y.H., Emer, J., Suleiman, A., Zhang, Z.: Hardware for machine learning: Challenges and opportunities. In: 2017 IEEE Custom Integrated Circuits Conference (CICC). pp. 1–8 (04 2018). https://doi.org/10.1109/CICC.2018.8357072

  17. [17]

    doi: 10.2514/1.j058291

    Thuerey, N., WeiSSenow, K., Prantl, L., Hu, X.: Deep learning methods for reynolds-averaged navierstokes simulations of airfoil flows. AIAA Journal 58, 1–12 (11 2019). https://doi.org/10.2514/1.J058291

  18. [18]

    In: Proceedings of the 34th International Conference on Neural Information Processing Systems

    Um, K., Brand, R., Fei, Y.R., Holl, P., Thuerey, N.: Solver-in-the-Loop: Learning from Differentiable Physics to Interact with Iterative PDE-Solvers. In: Proceedings of the 34th International Conference on Neural Information Processing Systems. NIPS’20, Curran Associates Inc., Red Hook, NY, USA (2020)

  19. [19]

    Computer Graphics Forum 38(2), 71–82 (2019)

    Wiewel, S., Becher, M., Thuerey, N.: Latent Space Physics: Towards Learning the Temporal Evolution of Fluid Flow. Computer Graphics Forum 38(2), 71–82 (2019). https://doi.org/doi.org/10.1111/cgf.13620, https://onlinelibrary.wiley.com/doi/10.1111/cgf.13620

  20. [20]

    Wyatt II, M.R., Yamamoto, V., Tosi, Z., Karlin, I., Van Essen, B.: Is Disaggre- gation Possible for HPC Cognitive Simulation? arXiv:2112.05216 [cs] (Dec 2021), http://arxiv.org/abs/2112.05216, arXiv: 2112.05216