Dynamic Load Balancing for Uncertainty Quantification with Applications in Bayesian Inversion
Pith reviewed 2026-06-25 20:02 UTC · model grok-4.3
The pith
A dynamic load balancer distributes heterogeneous sampling tasks in uncertainty quantification without prior workload assumptions.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The load balancer is effective at distributing the sampling requests with an average node idle time of close to a millisecond, while not making any prior assumptions about the workload.
What carries the argument
The dynamic load balancer that assigns tasks on the fly according to observed runtimes and loose dependencies.
If this is right
- Heterogeneous UQ tasks can be scheduled efficiently on HPC systems without static partitioning.
- Multilevel simulations whose runtimes span orders of magnitude remain practical under dynamic assignment.
- Bayesian inversion via multilevel sampling incurs minimal idle time on distributed hardware.
- Language-agnostic coupling of UQ software to simulations benefits from runtime-only balancing.
Where Pith is reading between the lines
- The same reactive balancing principle could apply to other UQ methods whose task costs are hard to predict in advance.
- Hardware utilization gains might allow larger ensemble sizes in existing Bayesian inversion studies.
- Similar on-the-fly assignment could be tested in optimization or ensemble Kalman filter workflows with comparable heterogeneity.
Load-bearing premise
The dependencies between tasks are loose enough that the balancer can handle them by reacting to observed runtimes alone.
What would settle it
Measure average node idle time on the same multilevel sampling workload; if it rises well above one millisecond while the balancer still receives no workload model, the central claim is falsified.
Figures
read the original abstract
Uncertainty Quantification (UQ) workflows present a particular scheduling challenge in high performance computing environments, as they typically generate large numbers of heterogeneous model evaluations with loose but non-trivial dependencies between tasks. A static one-size-fits-all approach in traditional schedulers is inadequate to handle heterogeneous tasks optimally. We introduce an improved load balancer in the UQ and Modelling Bridge (UM-Bridge) framework aimed at mitigating these issues; UM-Bridge is a language-agnostic interface developed to couple UQ software with numerical simulation. As a realistic example, we test the load balancer with a Bayesian inverse problem solved via multilevel delayed acceptance sampling. The underlying forward problem is a hierarchy of tsunami simulations enabled through ExaHyPE, whose runtimes span several orders of magnitude and loose dependencies between levels make the workload particularly challenging to schedule. Our results indicate the load balancer is effective at distributing the sampling requests with an average node idle time of close to a millisecond, while not making any prior assumptions about the workload.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces an improved dynamic load balancer within the UM-Bridge framework for uncertainty quantification workflows. These workflows involve large numbers of heterogeneous model evaluations with loose dependencies, which static schedulers handle poorly. The balancer is tested on a Bayesian inverse problem solved via multilevel delayed acceptance sampling, where the forward model is a hierarchy of tsunami simulations in ExaHyPE with runtimes spanning orders of magnitude. The central empirical result is that the balancer distributes sampling requests effectively, achieving an average node idle time of close to one millisecond while making no prior assumptions about the workload.
Significance. If the central empirical result holds under scrutiny, the work offers a practical demonstration of runtime-adaptive scheduling for UQ applications on HPC systems without requiring a priori workload models. The choice of a multilevel tsunami simulation with loose inter-level dependencies as the test case provides a concrete, challenging example that strengthens the applicability claim. The language-agnostic nature of UM-Bridge is a further asset for coupling diverse UQ and simulation codes.
major comments (2)
- [Abstract] Abstract: The reported average node idle time of close to a millisecond is presented as the primary quantitative evidence of effectiveness, yet the abstract supplies no error bars, standard deviation, number of nodes or runs, baseline comparisons against static or other dynamic schedulers, or definition of how idle time was measured. This measurement detail is load-bearing for the central claim.
- [Abstract (final paragraph)] The manuscript states that the balancer manages loose but non-trivial dependencies between tasks in the multilevel hierarchy on the fly without any workload model. However, no section details the exact runtime information (e.g., task completion signals, queue monitoring) used by the balancer or how dependency resolution is performed dynamically; this information is required to substantiate that no implicit assumptions are encoded.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address the two major comments below and will revise the manuscript accordingly to strengthen the presentation of our results.
read point-by-point responses
-
Referee: [Abstract] Abstract: The reported average node idle time of close to a millisecond is presented as the primary quantitative evidence of effectiveness, yet the abstract supplies no error bars, standard deviation, number of nodes or runs, baseline comparisons against static or other dynamic schedulers, or definition of how idle time was measured. This measurement detail is load-bearing for the central claim.
Authors: We agree that the abstract should be more self-contained to support the central empirical claim. In the revised version we will expand the abstract to report the number of nodes and independent runs, include error bars or standard deviation on the idle-time figure, provide a concise definition of the idle-time metric, and briefly reference the static-scheduler baseline comparison that appears in the results section. revision: yes
-
Referee: [Abstract (final paragraph)] The manuscript states that the balancer manages loose but non-trivial dependencies between tasks in the multilevel hierarchy on the fly without any workload model. However, no section details the exact runtime information (e.g., task completion signals, queue monitoring) used by the balancer or how dependency resolution is performed dynamically; this information is required to substantiate that no implicit assumptions are encoded.
Authors: We acknowledge that the current manuscript does not provide an explicit description of the runtime signals and dependency-resolution mechanism. We will add a dedicated subsection (likely in the Implementation or Methods section) that details the task-completion signals, queue-monitoring approach, and on-the-fly dependency handling used by the dynamic balancer, thereby clarifying that no a-priori workload model is required. revision: yes
Circularity Check
No significant circularity; empirical measurement of scheduler performance
full rationale
The paper introduces a dynamic load balancer in the UM-Bridge framework and evaluates it experimentally on a multilevel tsunami simulation for Bayesian inversion. The central claim rests on runtime measurements (average node idle time near 1 ms) obtained without prior workload assumptions. No derivation chain, equations, or predictions are presented that reduce by construction to fitted parameters, self-definitions, or self-citation chains. The result is a direct empirical observation on a concrete application and does not invoke uniqueness theorems, ansatzes smuggled via citation, or renaming of known results.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption UM-Bridge supplies a language-agnostic interface that correctly couples UQ software to numerical simulators
Reference graph
Works this paper leans on
-
[1]
In: 2018 IEEE/ACM Workflows in Support of Large-Scale Science (WORKS)
Ahn, D.H., Bass, N., Chu, A., Garlick, J., Grondona, M., Herbein, S., Koning, J., Patki, T., Scogland, T.R.W., Springmeyer, B., Taufer, M.: Flux: Overcoming scheduling challenges for exascale workflows. In: 2018 IEEE/ACM Workflows in Support of Large-Scale Science (WORKS). pp. 10–19 (2018). https://doi.org/10. 1109/WORKS.2018.00007
arXiv 2018
-
[2]
Beránek, J., Böhm, A., Palermo, G., Martinovič, J., Jansík, B.: Hyperqueue: Efficient and ergonomic task graphs on hpc clusters. SoftwareX27, 101814 (2024). https: //doi.org/https://doi.org/10.1016/j.softx.2024.101814, https://www.sciencedirect. com/science/article/pii/S2352711024001857
-
[3]
In: Paszynski, M., Kranzlmüller, D., Krzhizhanovskaya, V.V., Dongarra, J.J., Sloot, P.M
Bosak, B., Piontek,T., Karlshoefer, P.,Raffin, E., Lakhlili, J., Kopta, P.: Verification, validationanduncertaintyquantificationoflarge-scaleapplicationswithqcg-pilotjob. In: Paszynski, M., Kranzlmüller, D., Krzhizhanovskaya, V.V., Dongarra, J.J., Sloot, P.M. (eds.) Computational Science – ICCS 2021. pp. 495–501. Springer International Publishing, Cham (2021)
2021
-
[4]
Journal of Computational and Graphical Statistics14(4), 795–810 (2005)
Christen, J.A., Fox, C.: Markov chain monte carlo using an approximation. Journal of Computational and Graphical Statistics14(4), 795–810 (2005). https://doi.org/ 10.1198/106186005X76983, https://doi.org/10.1198/106186005X76983
-
[5]
SIAM Review61(3), 509–545 (2019)
Dodwell, T.J., Ketelsen, C., Scheichl, R., Teckentrup, A.L.: Multilevel markov chain monte carlo. SIAM Review61(3), 509–545 (2019). https://doi.org/10.1137/ 19M126966X, https://doi.org/10.1137/19M126966X
-
[6]
Physics Letters B195(2), 216–222 (Sep 1987)
Duane, S., Kennedy, A., Pendleton, B.J., Roweth, D.: Hybrid Monte Carlo. Physics Letters B195(2), 216–222 (Sep 1987). https://doi.org/10.1016/0370-2693(87) 91197-X, https://linkinghub.elsevier.com/retrieve/pii/037026938791197X
-
[7]
Journal of Computational Physics227, 8209–8253 (2008)
Dumbser, M., Balsara, D.S., Toro, E.F., Munz, C.D.: A unified framework for the construction of one-step finite volume and discontinuous Galerkin schemes on unstructured meshes. Journal of Computational Physics227, 8209–8253 (2008)
2008
-
[8]
Dumbser, M., Fambri, F., Tavelli, M., Bader, M., Weinzierl, T.: Efficient imple- mentation of ADER discontinuous Galerkin schemes for a scalable hyperbolic PDE engine. Axioms7(3), 63 (2018). https://doi.org/10.3390/axioms7030063
-
[9]
Journal of Computational Physics319, 163–199 (2016)
Dumbser, M., Loubère, R.: A simple robust and accurate a posteriori sub-cell finite volume limiter for the discontinuous Galerkin method on unstructured meshes. Journal of Computational Physics319, 163–199 (2016)
2016
-
[10]
Communications Engineering1(2022), https://api.semanticscholar.org/ CorpusID:246652492
Farcas, I.G., Merlo, G., Jenko, F.: A general framework for quantifying uncertainty at scale. Communications Engineering1(2022), https://api.semanticscholar.org/ CorpusID:246652492
2022
-
[11]
Galvez, P., Ampuero, J.P., Dalguer, L.A., Somala, S.N., Nissen-Meyer, T.: Dynamic earthquake rupture modelled with an unstructured 3-d spectral element method applied to the 2011 m9 tohoku earthquake. Geophysical Journal International Dynamic Load Balancing for Uncertainty Quantification 17 198(2), 1222–1240 (08 2014). https://doi.org/10.1093/gji/ggu203, ...
-
[12]
Journal of Computational Physics227(6), 3089–3113 (2008)
George, D.L.: Augmented Riemann solvers for the shallow water equations over variable topography with steady states and inundation. Journal of Computational Physics227(6), 3089–3113 (2008). https://doi.org/10.1016/j.jcp.2007.10.027
-
[13]
Frontiers in Earth ScienceV olume 8 - 2020(2020)
Gibbons, S.J., Lorito, S., Macías, J., Løvholt, F., Selva, J., Volpe, M., Sánchez- Linares, C., Babeyko, A., Brizuela, B., Cirella, A., Castro, M.J., de la Asunción, M., Lanucara, P., Glimsdal, S., Lorenzino, M.C., Nazaria, M., Pizzimenti, L., Romano, F., Scala, A., Tonini, R., Manuel González Vida, J., Vöge, M.: Probabilistic tsunami hazard analysis: Hig...
-
[14]
Operations research56(3), 607–617 (2008)
Giles, M.B.: Multilevel monte carlo path simulation. Operations research56(3), 607–617 (2008)
2008
-
[15]
Acta numerica24, 259–328 (2015)
Giles, M.B.: Multilevel monte carlo methods. Acta numerica24, 259–328 (2015)
2015
-
[16]
Groen, D., Arabnejad, H., Jancauskas, V., Edeling, W.N., Jansson, F., Richard- son, R.A., Lakhlili, J., Veen, L., Bosak, B., Kopta, P., Wright, D.W., Monnier, N., Karlshoefer, P., Suleimenova, D., Sinclair, R., Vassaux, M., Nikishova, A., Bieniek, M., Luk, O.O., Kulczewski, M., Raffin, E., Crommelin, D., Hoenen, O., Coster, D.P., Piontek, T., Coveney, P.V...
-
[17]
Biometrika77(2), 401–403 (1990).https://doi.org/10.1093/biomet/ 77.2.401
Hastings, W.K.: Monte carlo sampling methods using markov chains and their applications. Biometrika57(1), 97–109 (04 1970). https://doi.org/10.1093/biomet/ 57.1.97, https://doi.org/10.1093/biomet/57.1.97
-
[18]
Henneking, S., Venkat, S., Dobrev, V., Camier, J., Kolev, T., Fernando, M., Gabriel, A.A., Ghattas, O.: Real-time bayesian inference at extreme scale: A digital twin for tsunami early warning applied to the cascadia subduction zone. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. p. 60–71. ...
-
[19]
Journal of Computational Physics552, 114682 (2026)
Henneking,S.,Venkat,S.,Ghattas,O.:Goal-orientedreal-timebayesianinferencefor linear autonomous dynamical systems with application to digital twins for tsunami early warning. Journal of Computational Physics552, 114682 (2026). https:// doi.org/https://doi.org/10.1016/j.jcp.2026.114682, https://www.sciencedirect.com/ science/article/pii/S002199912600032X
-
[20]
The International Journal of High Performance Computing Applications36(3), 289–306 (2022)
Herbein, S., Patki, T., Ahn, D.H., Mobo, S., Hathaway, C., Caíno-Lores, S., Corbett, J., Domyancic, D., Scogland, T.R., de Supinski, B.R., Taufer, M.: An analytical performance model of generalized hierarchical scheduling. The International Journal of High Performance Computing Applications36(3), 289–306 (2022). https://doi. org/10.1177/10943420211051039,...
-
[21]
Hoffman, M.D., Gelman, A., et al.: The No-U-Turn sampler: adaptively setting path lengths in Hamiltonian Monte Carlo. J. Mach. Learn. Res.15(1), 1593–1623 (2014)
2014
-
[22]
SIAM Review65(3), 831–865 (2023)
Latz, J.: Bayesian inverse problems are usually well-posed. SIAM Review65(3), 831–865 (2023). https://doi.org/10.1137/23M1556435, https://doi.org/10.1137/ 23M1556435 18 C. M. Loi et al
-
[23]
In: ISC High Performance 2025 Research Paper Proceedings (40th International Conference)
Loi, C.M., Reinarz, A., Seelinger, L., Hornsby, W., Buchanan, J., Lykkegaard, M.: A performance analysis of task scheduling for uq workflows on hpc systems. In: ISC High Performance 2025 Research Paper Proceedings (40th International Conference). pp. 1–14 (2025). https://doi.org/10.23919/ISC.2025.11018268
-
[24]
SIAM/ASA Journal on Uncertainty Quantification 11(1), 1–30 (2023)
Lykkegaard, M.B., Dodwell, T.J., Fox, C., Mingas, G., Scheichl, R.: Multilevel delayed acceptance mcmc. SIAM/ASA Journal on Uncertainty Quantification 11(1), 1–30 (2023). https://doi.org/10.1137/22M1476770, https://doi.org/10.1137/ 22M1476770
-
[26]
Metropolis, N., Rosenbluth, A.W., Rosenbluth, M.N., Teller, A.H., Teller, E.: Equation of state calculations by fast computing machines. The Journal of Chemical Physics21(6), 1087–1092 (06 1953). https://doi.org/10.1063/1.1699114, https: //doi.org/10.1063/1.1699114
-
[27]
Adaptative computation and machine learning series, University Press Group Limited (2006)
Rasmussen, C., Williams, C.: Gaussian Processes for Machine Learning. Adaptative computation and machine learning series, University Press Group Limited (2006)
2006
-
[28]
Computer Physics Communications254, 107251 (2020)
Reinarz, A., Charrier, D.E., Bader, M., Bovard, L., Dumbser, M., Duru, K., Fambri, F., Gabriel, A.A., Gallard, J.M., Köppel, S., Krenz, L., Rannabauer, L., Rezzolla, L., Samfass, P., Tavelli, M., Weinzierl, T.: ExaHyPE: An engine for parallel dynamically adaptive simulations of wave problems. Computer Physics Communications254, 107251 (2020). https://doi....
-
[29]
In: Kathryn Huff, James Bergstra (eds.) Proceedings of the 14th Python in Science Conference
Matthew Rocklin: Dask: Parallel Computation with Blocked algorithms and Task Scheduling. In: Kathryn Huff, James Bergstra (eds.) Proceedings of the 14th Python in Science Conference. pp. 126 – 132 (2015). https://doi.org/10.25080/ Majora-7b98e3ed-013
2015
-
[30]
Natural Hazards and Earth System Sciences 12(6), 2003–2018 (2012)
Sarri, A., Guillas, S., Dias, F.: Statistical emulation of a tsunami model for sensitivity analysis and uncertainty quantification. Natural Hazards and Earth System Sciences 12(6), 2003–2018 (2012). https://doi.org/10.5194/nhess-12-2003-2012, https:// nhess.copernicus.org/articles/12/2003/2012/
-
[31]
Journal of Open Source Software 8(83), 4748 (2023)
Seelinger, L., Cheng-Seelinger, V., Davis, A., Parno, M., Reinarz, A.: UM-Bridge: Uncertainty quantification and modeling bridge. Journal of Open Source Software 8(83), 4748 (2023). https://doi.org/10.21105/joss.04748, https://doi.org/10.21105/ joss.04748
-
[32]
Journal of Computational Physics p
Seelinger, L., Reinarz, A., Lykkegaard, M.B., Akers, R., Alghamdi, A.M., Aristoff, D., Bangerth, W., Bénézech, J., Diez, M., Frey, K., Jakeman, J.D., Jørgensen, J.S., Kim, K.T., Kent, B.M., Martinelli, M., Parno, M., Pellegrini, R., Petra, N., Riis, N.A., Rosenfeld, K., Serani, A., Tamellini, L., Villa, U., Dodwell, T.J., Scheichl, R.: Democratizing uncer...
-
[33]
Seelinger, L., Reinarz, A., Rannabauer, L., Bader, M., Bastian, P., Scheichl, R.: High performance uncertainty quantification with parallelized multilevel markov chain monte carlo. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. SC ’21, Association for Computing Machinery, New York, NY, USA...
-
[34]
Ocean Modelling83, 82–97 (2014)
Sraj, I., Mandli, K.T., Knio, O.M., Dawson, C.N., Hoteit, I.: Uncertainty quantifi- cation and inference of manning’s friction coefficients using dart buoy data during Dynamic Load Balancing for Uncertainty Quantification 19 the t¯ ohoku tsunami. Ocean Modelling83, 82–97 (2014). https://doi.org/https: //doi.org/10.1016/j.ocemod.2014.09.001, https://www.sc...
-
[35]
Inverse problems: A Bayesian perspective
Stuart, A.M.: Inverse problems: A Bayesian perspective. Acta Numerica19, 451–559 (2010). https://doi.org/10.1017/S0962492910000061
-
[36]
Uphoff, C., Rettenberger, S., Bader, M., Madden, E.H., Ulrich, T., Wollherr, S., Gabriel, A.A.: Extreme scale multi-physics simulations of the tsunamigenic 2004 sumatra megathrust earthquake. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. SC ’17, Association for Computing Machinery, New Yo...
-
[37]
Weinzierl,T.:Thepeanosoftware—parallel,automaton-based,dynamicallyadaptive grid traversals. ACM Trans. Math. Softw.45(2) (Apr 2019). https://doi.org/10. 1145/3319797, https://doi.org/10.1145/3319797
-
[38]
Willoughby, R.A.: Solutions of ill-posed problems (a. n. tikhonov and v. y. arsenin). SIAM Review21(2), 266–267 (1979). https://doi.org/10.1137/1021044, https:// doi.org/10.1137/1021044
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.