pith. sign in

arxiv: 2512.09060 · v2 · submitted 2025-12-09 · 📊 stat.CO · stat.ML

All Emulators are Wrong, Many are Useful, and Some are More Useful Than Others: A Reproducible Comparison of Computer Model Surrogates

Pith reviewed 2026-05-16 23:03 UTC · model grok-4.3

classification 📊 stat.CO stat.ML
keywords emulatorssurrogate modelingbenchmarkingreproducibilitycomputer experimentsR packagestatistical emulationsimulation studies
0
0 comments X

The pith

A reproducible comparison of 29 emulators shows clear differences in usefulness across test functions and real datasets.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes performance patterns among state-of-the-art emulators by running all of them under identical conditions on 60 canonical test functions plus 40 real emulation datasets. It introduces the duqling R package to enforce consistent syntax, automatic scaling, and full reproducibility so that rankings and diagnostics can be trusted and extended. A sympathetic reader cares because surrogate models stand in for expensive computer simulations in science and engineering, and the wrong choice wastes resources while the right one improves accuracy and speed. The results give concrete guidance on when particular methods excel instead of declaring any universal winner.

Core claim

By applying the duqling framework to standardize inputs, outputs, and evaluation, the study produces detailed empirical profiles of 29 emulators that reveal systematic strengths and weaknesses, allowing practitioners to match methods to problem type rather than relying on general claims.

What carries the argument

The duqling R package, which supplies unified syntax and automatic internal scaling for running reproducible emulator comparison studies.

Load-bearing premise

The 60 test functions and 40 real datasets capture enough variety that the observed performance differences generalize to other emulation problems.

What would settle it

A new collection of test problems and datasets, chosen independently, that produces substantially different relative rankings among the same 29 emulators.

Figures

Figures reproduced from arXiv: 2512.09060 by Devin Francom, Graham C. Gibson, Kellin N. Rumsey, Reid Morris.

Figure 1
Figure 1. Figure 1: Cumulative rank-plot for n = 1000, NSR = 0 setting. The curve for each emulator shows the proportion of cases that the method was at least top r out of 29 in terms of CRPS, for r = 1, . . . , 29. 0% 25% 50% 75% 100% 0 10 20 Rank or better % of scenarios Method bppr tbass bass qbass rgasp bart hetgp gp svecgp fitcgp bcmgp apce deepgp rvm btreelm bootrf rffgp lagp confrf ngboost spce bnn bcart blm alcgp mpgp… view at source ↗
Figure 2
Figure 2. Figure 2: Cumulative rank-plot for n = 1000, NSR = 0.1 setting. The curve for each emulator shows the proportion of cases that the method was at least top r out of 29 in terms of CRPS, for r = 1, . . . , 29. 4 Results In this section, we compare the results of 29 emulators across 60 synthetic test functions and 40 real-world simulation datasets. Given the scope of the study, the analysis here is necessarily high-lev… view at source ↗
Figure 3
Figure 3. Figure 3: A heatmap of CRPS (truncated between 0.001 and 1.0 for visual simplicity) for the synthetic test functions in the n = 1000, NSR = 0 setting. we demonstrate how duqling can be used to conduct more granular, targeted investigations. All perfor￾mance metrics are computed on out-of-sample test sets. For synthetic test functions, we use maximin Latin hypercube designs with ntest = 1000 [McKay et al., 1979]. For… view at source ↗
Figure 4
Figure 4. Figure 4: A heatmap of CRPS (truncated between 0.001 and 1.0 for visual simplicity) for the synthetic test functions in the n = 1000, NSR = 0.1 setting. • n = 1000, NSR = 0.1, 10 replications, 60 test functions. Excluded: TREEGP. • n = 500, NSR = 0, 10 replications, 60 test functions. All emulators. • n = 500, NSR = 0.1, 10 replications, 60 test functions. All emulators (supplement only). • n = 5000, NSR = 0, 10 rep… view at source ↗
Figure 5
Figure 5. Figure 5: Pareto frontier of emulators based on average rela [PITH_FULL_IMAGE:figures/full_fig_p022_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Cumulative rank-plot for n = 500, NSR = 0 setting. The curve for each emulator shows the proportion of cases that the method was at least top r out of 29 in terms of CRPS, for r = 1, . . . , 29. 0% 25% 50% 75% 100% 0 5 10 15 20 25 Rank or better % of scenarios Method svecgp tbass bass bppr qbass bcmgp bart apce lagp spce confrf rvm btreelm bootrf fitcgp mpgp rffgp alcgp bnn ngboost bcart blm blasso baselin… view at source ↗
Figure 7
Figure 7. Figure 7: Cumulative rank-plot for n = 5000, NSR = 0 setting. The curve for each emulator shows the proportion of cases that the method was at least top r out of 29 in terms of CRPS, for r = 1, . . . , 29. noise-free setting. Due to runtime costs and practical relevance, we omit the noisy case here; additional figures for this regime are available in the supplement. 24 [PITH_FULL_IMAGE:figures/full_fig_p024_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Cumulative rank-plot for for the real datasets wit [PITH_FULL_IMAGE:figures/full_fig_p025_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Cumulative rank-plot for for the real datasets wit [PITH_FULL_IMAGE:figures/full_fig_p025_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Pareto frontier of emulators based on average rel [PITH_FULL_IMAGE:figures/full_fig_p026_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: A heatmap of CRPS (truncated between 0.001 and 1.0 for visual simplicity) for all of the real datasets. Grey boxes indicate emulator / dataset pairs that were not attempted for computational purposes. 5 Discussion and Analysis 5.1 There is no free lunch (or best emulator) In uncertainty quantification, it is common to ask: which emulator is the best? But this is rarely the most productive question to ask.… view at source ↗
Figure 12
Figure 12. Figure 12: A heatmap of CRPS (truncated between 0.001 and 1.0 for visual simplicity) for all of the real datasets. Grey boxes indicate emulator / dataset pairs that were not attempted for computational purposes. catch-all emulator, it was designed to be effective for challenging nonstationary emulation problems where sophisticated tools are required and, in these settings, it does exceptionally well. On test functio… view at source ↗
Figure 13
Figure 13. Figure 13: Boxplots of CRPS across ten replications/folds fo [PITH_FULL_IMAGE:figures/full_fig_p029_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Boxplots of CRPS across ten replications in the [PITH_FULL_IMAGE:figures/full_fig_p030_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: The Pareto plot of Figure 5a is recreated here, with the addition of 16 new LAGP emulators with various hyperparameter values. Gramacy and Apley [2015] for details). The analysis in this section also uses a smaller nugget (specifically, the response variance divided by 107 ) as suggested by the LAGP documentation. For the simulation scenarios considered, this choice has very little impact on the results. … view at source ↗
read the original abstract

Accurate and efficient surrogate modeling is essential for modern computational science, and there are a staggering number of emulation methods to choose from. With new methods being developed all the time, comparing the relative strengths and weaknesses of different methods remains a challenge due to inconsistent benchmarking practices and (sometimes) limited reproducibility and transparency. In this work, we present a large-scale, fully reproducible comparison of $29$ distinct emulators across $60$ canonical test functions and $40$ real emulation datasets. To facilitate rigorous, apples-to-apples comparisons, we introduce the R package \texttt{duqling}, which streamlines reproducible simulation studies using a consistent, simple syntax, and automatic internal scaling of inputs. This framework allows researchers to compare emulators in a unified environment and makes it possible to replicate or extend previous studies with minimal effort, even across different publications. Our results provide detailed empirical insight into the strengths and weaknesses of state-of-the-art emulators and offer guidance for both method developers and practitioners selecting a surrogate for new data. We discuss best practices for emulator comparison and highlight how \texttt{duqling} can accelerate research in emulator design and application.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript conducts a large-scale, fully reproducible empirical comparison of 29 distinct emulator methods across 60 canonical test functions and 40 real-world emulation datasets. It introduces the duqling R package to enforce consistent syntax, automatic input scaling, and apples-to-apples evaluation, with the goal of providing detailed insights into emulator strengths/weaknesses and practical guidance for surrogate selection.

Significance. If the reported performance patterns hold under broader conditions, the work supplies a valuable, transparent benchmark that could reduce ad-hoc comparisons in surrogate modeling and accelerate both method development and practitioner choices. The reproducibility infrastructure (duqling) is a concrete strength that directly addresses the field's documented inconsistencies in benchmarking.

major comments (2)
  1. [Section 4] Section 4 (Results): The guidance claim that the rankings 'offer guidance for both method developers and practitioners selecting a surrogate for new data' is load-bearing on the representativeness of the 60+40 suite. No coverage argument, sensitivity analysis to suite composition, or stratification by problem characteristics (input dimension, smoothness, stationarity, noise) is presented; without this, the stability of relative rankings beyond the chosen collection remains unverified.
  2. [Section 3.2] Section 3.2 (Experimental design): The handling of emulator-specific hyperparameters is described at a high level but lacks explicit documentation of the tuning protocol (e.g., cross-validation folds, optimization budget, or default settings) applied uniformly across all 29 methods; this detail is necessary to confirm the 'apples-to-apples' claim.
minor comments (2)
  1. [Figure 3] Figure 3 and Table 4: axis labels and legend entries use inconsistent abbreviation conventions for emulator names; a single glossary table would improve readability.
  2. [Abstract] Abstract: the phrase 'some are more useful than others' is vague; a single sentence summarizing the top-performing class (e.g., Gaussian processes vs. neural nets) would strengthen the take-home message.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their positive summary, recognition of the reproducibility infrastructure, and recommendation for minor revision. Their comments are constructive and help strengthen the manuscript's claims. We address each major comment below, indicating the revisions we will incorporate.

read point-by-point responses
  1. Referee: [Section 4] Section 4 (Results): The guidance claim that the rankings 'offer guidance for both method developers and practitioners selecting a surrogate for new data' is load-bearing on the representativeness of the 60+40 suite. No coverage argument, sensitivity analysis to suite composition, or stratification by problem characteristics (input dimension, smoothness, stationarity, noise) is presented; without this, the stability of relative rankings beyond the chosen collection remains unverified.

    Authors: We agree that a formal sensitivity analysis would further support the generalizability of the guidance. The 60 canonical functions are drawn from established benchmarks in the surrogate modeling literature, while the 40 real-world datasets span input dimensions 2–50, varying smoothness, stationarity, and noise levels. In revision we will add a stratification of results by input dimension and noise level in Section 4, plus an explicit limitations paragraph in the discussion acknowledging that the current suite, though diverse, does not exhaustively cover all possible problem classes. This provides partial but concrete strengthening without a full re-analysis of alternative suites. revision: partial

  2. Referee: [Section 3.2] Section 3.2 (Experimental design): The handling of emulator-specific hyperparameters is described at a high level but lacks explicit documentation of the tuning protocol (e.g., cross-validation folds, optimization budget, or default settings) applied uniformly across all 29 methods; this detail is necessary to confirm the 'apples-to-apples' claim.

    Authors: We appreciate the request for explicit documentation. In the revised manuscript we will expand Section 3.2 with a precise description of the uniform tuning protocol: 5-fold cross-validation where supported, a fixed budget of 50 hyperparameter evaluations for optimization-based methods, and package defaults otherwise. A supplementary table will enumerate the exact procedure for each of the 29 emulators. These additions will be included in the next version. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical benchmarking with no derivations or self-referential predictions

full rationale

The paper conducts a direct, reproducible comparison of 29 emulators on 60 canonical test functions and 40 real datasets via the duqling package. No equations, fitted parameters, uniqueness theorems, or predictions are present that reduce by construction to the paper's own inputs or prior self-citations. Performance metrics are computed from fresh runs on the chosen suite; rankings and guidance follow from those measurements without any load-bearing self-referential step. This is the standard honest outcome for an empirical benchmarking study.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical benchmarking paper with no mathematical derivations, fitted parameters, or new postulated entities. The central contribution is the comparison framework and results.

pith-pipeline@v0.9.0 · 5524 in / 1054 out tokens · 25747 ms · 2026-05-16T23:03:34.215474+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

10 extracted references · 10 canonical work pages · 1 internal anchor

  1. [1]

    Baker, P

    E. Baker, P . Barbillon, A. Fadikar, R. B. Gramacy, R. Herbei,D. Higdon, J. Huang, L. R. Johnson, A. Mondal, B. Pires, et al. Stochastic simulators: An overview with opp ortunities. arXiv preprint arXiv:2002.01321 , 2020a. E. Baker, P . Challenor, and M. Eames. Predicting the output f rom a stochastic computer model when a deterministic approximation is a...

  2. [2]

    M. J. Brown, A. Gowardhan, M. Nelson, M. Williams, and E. R. Pa rdyjak. Evaluation of the quic wind and dispersion models using the joint urban 2003 field experi ment dataset. In AMS 8th Symp. Urban Env, Phoenix, USA, pages 10–16,

  3. [3]

    M. Gu, J. Palomo, and J. O. Berger. Robustgasp: Robust gaussi an stochastic process emulation in r. arXiv preprint arXiv:1801.01874,

  4. [4]

    Hlobilov´ a, S

    A. Hlobilov´ a, S. Marelli, and B. Sudret. Surrogate modelin g benchmark - two- dimensional heat diffusion model. Sept. 2024a. doi: 10.528 1/zenodo.12701147. URL https://doi.org/10.5281/zenodo.12701147. A. Hlobilov´ a, S. Marelli, and B. Sudret. Surrogate modelin g benchmark - one- dimensional diffusion model. Sept. 2024b. doi: 10.5281/ze nodo.12704504. U...

  5. [5]

    URL https://doi.org/10.25583/1670467

    do i: 10.25583/1670467. URL https://doi.org/10.25583/1670467. E. Laloy and D. Jacques. Emulation of CPU-demanding reactiv e transport models: a comparison of Gaussian processes, polynomial chaos expansion, and deep neural net works. Computational Geosciences , 23(5): 1193–1215,

  6. [6]

    Rudin, C

    C. Rudin, C. Zhong, L. Semenova, M. Seltzer, R. Parr, J. Liu, S . Katta, J. Donnelly, H. Chen, and Z. Boner. Amazing things come from having many good models. arXiv preprint arXiv:2407.04846 ,

  7. [7]

    41 K. N. Rumsey, D. Francom, and S. A. V ander Wiel. Discovering a ctive subspaces for high-dimensional computer models. arXiv preprint arXiv:2307.11241 , 2023a. K. N. Rumsey, G. Huerta, and J. D. Tucker. A localized ensembl e of approximate gaussian processes for fast sequential emulation. Stat, 12(1):e576, 2023b. K. N. Rumsey, D. Francom, and A. Shen. G...

  8. [8]

    Sauer, A

    A. Sauer, A. Cooper, and R. B. Gramacy. Non-stationary Gauss ian process surrogates. arXiv preprint arXiv:2305.19242, 2023a. A. Sauer, A. Cooper, and R. B. Gramacy. V ecchia-approximate d deep Gaussian processes for computer experiments. Journal of Computational and Graphical Statistics , 32(3):824–837, 2023b. A. Sauer, R. B. Gramacy, and D. Higdon. Activ...

  9. [9]

    Sudret, S

    B. Sudret, S. Marelli, and J. Wiart. Surrogate models for unc ertainty quantification: An overview. In 2017 11th European conference on antennas and propagation (EUCA P), pages 793–797. IEEE,

  10. [10]

    Surjanovic and D

    S. Surjanovic and D. Bingham. Virtual library of simulation experiments: test functions and datasets. Simon Fraser University, Burnaby, BC, Canada, accessed May , 13:2015,