All Emulators are Wrong, Many are Useful, and Some are More Useful Than Others: A Reproducible Comparison of Computer Model Surrogates

Devin Francom; Graham C. Gibson; Kellin N. Rumsey; Reid Morris

arxiv: 2512.09060 · v2 · submitted 2025-12-09 · 📊 stat.CO · stat.ML

All Emulators are Wrong, Many are Useful, and Some are More Useful Than Others: A Reproducible Comparison of Computer Model Surrogates

Kellin N. Rumsey , Graham C. Gibson , Devin Francom , Reid Morris This is my paper

Pith reviewed 2026-05-16 23:03 UTC · model grok-4.3

classification 📊 stat.CO stat.ML

keywords emulatorssurrogate modelingbenchmarkingreproducibilitycomputer experimentsR packagestatistical emulationsimulation studies

0 comments

The pith

A reproducible comparison of 29 emulators shows clear differences in usefulness across test functions and real datasets.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes performance patterns among state-of-the-art emulators by running all of them under identical conditions on 60 canonical test functions plus 40 real emulation datasets. It introduces the duqling R package to enforce consistent syntax, automatic scaling, and full reproducibility so that rankings and diagnostics can be trusted and extended. A sympathetic reader cares because surrogate models stand in for expensive computer simulations in science and engineering, and the wrong choice wastes resources while the right one improves accuracy and speed. The results give concrete guidance on when particular methods excel instead of declaring any universal winner.

Core claim

By applying the duqling framework to standardize inputs, outputs, and evaluation, the study produces detailed empirical profiles of 29 emulators that reveal systematic strengths and weaknesses, allowing practitioners to match methods to problem type rather than relying on general claims.

What carries the argument

The duqling R package, which supplies unified syntax and automatic internal scaling for running reproducible emulator comparison studies.

Load-bearing premise

The 60 test functions and 40 real datasets capture enough variety that the observed performance differences generalize to other emulation problems.

What would settle it

A new collection of test problems and datasets, chosen independently, that produces substantially different relative rankings among the same 29 emulators.

Figures

Figures reproduced from arXiv: 2512.09060 by Devin Francom, Graham C. Gibson, Kellin N. Rumsey, Reid Morris.

**Figure 1.** Figure 1: Cumulative rank-plot for n = 1000, NSR = 0 setting. The curve for each emulator shows the proportion of cases that the method was at least top r out of 29 in terms of CRPS, for r = 1, . . . , 29. 0% 25% 50% 75% 100% 0 10 20 Rank or better % of scenarios Method bppr tbass bass qbass rgasp bart hetgp gp svecgp fitcgp bcmgp apce deepgp rvm btreelm bootrf rffgp lagp confrf ngboost spce bnn bcart blm alcgp mpgp… view at source ↗

**Figure 2.** Figure 2: Cumulative rank-plot for n = 1000, NSR = 0.1 setting. The curve for each emulator shows the proportion of cases that the method was at least top r out of 29 in terms of CRPS, for r = 1, . . . , 29. 4 Results In this section, we compare the results of 29 emulators across 60 synthetic test functions and 40 real-world simulation datasets. Given the scope of the study, the analysis here is necessarily high-lev… view at source ↗

**Figure 3.** Figure 3: A heatmap of CRPS (truncated between 0.001 and 1.0 for visual simplicity) for the synthetic test functions in the n = 1000, NSR = 0 setting. we demonstrate how duqling can be used to conduct more granular, targeted investigations. All performance metrics are computed on out-of-sample test sets. For synthetic test functions, we use maximin Latin hypercube designs with ntest = 1000 [McKay et al., 1979]. For… view at source ↗

**Figure 4.** Figure 4: A heatmap of CRPS (truncated between 0.001 and 1.0 for visual simplicity) for the synthetic test functions in the n = 1000, NSR = 0.1 setting. • n = 1000, NSR = 0.1, 10 replications, 60 test functions. Excluded: TREEGP. • n = 500, NSR = 0, 10 replications, 60 test functions. All emulators. • n = 500, NSR = 0.1, 10 replications, 60 test functions. All emulators (supplement only). • n = 5000, NSR = 0, 10 rep… view at source ↗

**Figure 5.** Figure 5: Pareto frontier of emulators based on average rela [PITH_FULL_IMAGE:figures/full_fig_p022_5.png] view at source ↗

**Figure 6.** Figure 6: Cumulative rank-plot for n = 500, NSR = 0 setting. The curve for each emulator shows the proportion of cases that the method was at least top r out of 29 in terms of CRPS, for r = 1, . . . , 29. 0% 25% 50% 75% 100% 0 5 10 15 20 25 Rank or better % of scenarios Method svecgp tbass bass bppr qbass bcmgp bart apce lagp spce confrf rvm btreelm bootrf fitcgp mpgp rffgp alcgp bnn ngboost bcart blm blasso baselin… view at source ↗

**Figure 7.** Figure 7: Cumulative rank-plot for n = 5000, NSR = 0 setting. The curve for each emulator shows the proportion of cases that the method was at least top r out of 29 in terms of CRPS, for r = 1, . . . , 29. noise-free setting. Due to runtime costs and practical relevance, we omit the noisy case here; additional figures for this regime are available in the supplement. 24 [PITH_FULL_IMAGE:figures/full_fig_p024_7.png] view at source ↗

**Figure 8.** Figure 8: Cumulative rank-plot for for the real datasets wit [PITH_FULL_IMAGE:figures/full_fig_p025_8.png] view at source ↗

**Figure 9.** Figure 9: Cumulative rank-plot for for the real datasets wit [PITH_FULL_IMAGE:figures/full_fig_p025_9.png] view at source ↗

**Figure 10.** Figure 10: Pareto frontier of emulators based on average rel [PITH_FULL_IMAGE:figures/full_fig_p026_10.png] view at source ↗

**Figure 11.** Figure 11: A heatmap of CRPS (truncated between 0.001 and 1.0 for visual simplicity) for all of the real datasets. Grey boxes indicate emulator / dataset pairs that were not attempted for computational purposes. 5 Discussion and Analysis 5.1 There is no free lunch (or best emulator) In uncertainty quantification, it is common to ask: which emulator is the best? But this is rarely the most productive question to ask.… view at source ↗

**Figure 12.** Figure 12: A heatmap of CRPS (truncated between 0.001 and 1.0 for visual simplicity) for all of the real datasets. Grey boxes indicate emulator / dataset pairs that were not attempted for computational purposes. catch-all emulator, it was designed to be effective for challenging nonstationary emulation problems where sophisticated tools are required and, in these settings, it does exceptionally well. On test functio… view at source ↗

**Figure 13.** Figure 13: Boxplots of CRPS across ten replications/folds fo [PITH_FULL_IMAGE:figures/full_fig_p029_13.png] view at source ↗

**Figure 14.** Figure 14: Boxplots of CRPS across ten replications in the [PITH_FULL_IMAGE:figures/full_fig_p030_14.png] view at source ↗

**Figure 15.** Figure 15: The Pareto plot of Figure 5a is recreated here, with the addition of 16 new LAGP emulators with various hyperparameter values. Gramacy and Apley [2015] for details). The analysis in this section also uses a smaller nugget (specifically, the response variance divided by 107 ) as suggested by the LAGP documentation. For the simulation scenarios considered, this choice has very little impact on the results. … view at source ↗

read the original abstract

Accurate and efficient surrogate modeling is essential for modern computational science, and there are a staggering number of emulation methods to choose from. With new methods being developed all the time, comparing the relative strengths and weaknesses of different methods remains a challenge due to inconsistent benchmarking practices and (sometimes) limited reproducibility and transparency. In this work, we present a large-scale, fully reproducible comparison of $29$ distinct emulators across $60$ canonical test functions and $40$ real emulation datasets. To facilitate rigorous, apples-to-apples comparisons, we introduce the R package \texttt{duqling}, which streamlines reproducible simulation studies using a consistent, simple syntax, and automatic internal scaling of inputs. This framework allows researchers to compare emulators in a unified environment and makes it possible to replicate or extend previous studies with minimal effort, even across different publications. Our results provide detailed empirical insight into the strengths and weaknesses of state-of-the-art emulators and offer guidance for both method developers and practitioners selecting a surrogate for new data. We discuss best practices for emulator comparison and highlight how \texttt{duqling} can accelerate research in emulator design and application.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The duqling package and the scale of the 29-method benchmark are the useful parts; the performance rankings are tied to one specific collection of problems.

read the letter

The main thing this paper brings is the duqling R package and the large, unified comparison it makes possible. A single framework with consistent syntax and automatic input scaling for running 29 emulators on 60 test functions plus 40 real datasets is a practical step forward. Most prior work in this area uses smaller or inconsistent test sets, so having one place where people can replicate or extend the same setup saves time and reduces the usual apples-to-oranges issues. The results give a clearer empirical picture of where different methods tend to do better or worse, which is the kind of reference that both method developers and people picking a surrogate for new work can use. The reproducibility emphasis and the discussion of best practices are also straightforward wins. The soft spot is exactly the one the stress test flags. The rankings and the guidance for practitioners rest on how well those 60 functions and 40 datasets cover the space of real emulation problems. Without a coverage argument or checks on how the orderings shift when you change the suite, it is hard to know how much the results would move on a different collection of problems with higher dimensions, rougher surfaces, or different noise. That does not make the benchmark useless, but it does limit how far the specific recommendations travel. This is for computational statisticians and engineers who build or use surrogates and want a shared reference point or a tool to run their own studies. The package alone gives it enough substance to deserve peer review, even if the generalizability section needs more work in revision.

Referee Report

2 major / 2 minor

Summary. The manuscript conducts a large-scale, fully reproducible empirical comparison of 29 distinct emulator methods across 60 canonical test functions and 40 real-world emulation datasets. It introduces the duqling R package to enforce consistent syntax, automatic input scaling, and apples-to-apples evaluation, with the goal of providing detailed insights into emulator strengths/weaknesses and practical guidance for surrogate selection.

Significance. If the reported performance patterns hold under broader conditions, the work supplies a valuable, transparent benchmark that could reduce ad-hoc comparisons in surrogate modeling and accelerate both method development and practitioner choices. The reproducibility infrastructure (duqling) is a concrete strength that directly addresses the field's documented inconsistencies in benchmarking.

major comments (2)

[Section 4] Section 4 (Results): The guidance claim that the rankings 'offer guidance for both method developers and practitioners selecting a surrogate for new data' is load-bearing on the representativeness of the 60+40 suite. No coverage argument, sensitivity analysis to suite composition, or stratification by problem characteristics (input dimension, smoothness, stationarity, noise) is presented; without this, the stability of relative rankings beyond the chosen collection remains unverified.
[Section 3.2] Section 3.2 (Experimental design): The handling of emulator-specific hyperparameters is described at a high level but lacks explicit documentation of the tuning protocol (e.g., cross-validation folds, optimization budget, or default settings) applied uniformly across all 29 methods; this detail is necessary to confirm the 'apples-to-apples' claim.

minor comments (2)

[Figure 3] Figure 3 and Table 4: axis labels and legend entries use inconsistent abbreviation conventions for emulator names; a single glossary table would improve readability.
[Abstract] Abstract: the phrase 'some are more useful than others' is vague; a single sentence summarizing the top-performing class (e.g., Gaussian processes vs. neural nets) would strengthen the take-home message.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their positive summary, recognition of the reproducibility infrastructure, and recommendation for minor revision. Their comments are constructive and help strengthen the manuscript's claims. We address each major comment below, indicating the revisions we will incorporate.

read point-by-point responses

Referee: [Section 4] Section 4 (Results): The guidance claim that the rankings 'offer guidance for both method developers and practitioners selecting a surrogate for new data' is load-bearing on the representativeness of the 60+40 suite. No coverage argument, sensitivity analysis to suite composition, or stratification by problem characteristics (input dimension, smoothness, stationarity, noise) is presented; without this, the stability of relative rankings beyond the chosen collection remains unverified.

Authors: We agree that a formal sensitivity analysis would further support the generalizability of the guidance. The 60 canonical functions are drawn from established benchmarks in the surrogate modeling literature, while the 40 real-world datasets span input dimensions 2–50, varying smoothness, stationarity, and noise levels. In revision we will add a stratification of results by input dimension and noise level in Section 4, plus an explicit limitations paragraph in the discussion acknowledging that the current suite, though diverse, does not exhaustively cover all possible problem classes. This provides partial but concrete strengthening without a full re-analysis of alternative suites. revision: partial
Referee: [Section 3.2] Section 3.2 (Experimental design): The handling of emulator-specific hyperparameters is described at a high level but lacks explicit documentation of the tuning protocol (e.g., cross-validation folds, optimization budget, or default settings) applied uniformly across all 29 methods; this detail is necessary to confirm the 'apples-to-apples' claim.

Authors: We appreciate the request for explicit documentation. In the revised manuscript we will expand Section 3.2 with a precise description of the uniform tuning protocol: 5-fold cross-validation where supported, a fixed budget of 50 hyperparameter evaluations for optimization-based methods, and package defaults otherwise. A supplementary table will enumerate the exact procedure for each of the 29 emulators. These additions will be included in the next version. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical benchmarking with no derivations or self-referential predictions

full rationale

The paper conducts a direct, reproducible comparison of 29 emulators on 60 canonical test functions and 40 real datasets via the duqling package. No equations, fitted parameters, uniqueness theorems, or predictions are present that reduce by construction to the paper's own inputs or prior self-citations. Performance metrics are computed from fresh runs on the chosen suite; rankings and guidance follow from those measurements without any load-bearing self-referential step. This is the standard honest outcome for an empirical benchmarking study.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical benchmarking paper with no mathematical derivations, fitted parameters, or new postulated entities. The central contribution is the comparison framework and results.

pith-pipeline@v0.9.0 · 5524 in / 1054 out tokens · 25747 ms · 2026-05-16T23:03:34.215474+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

10 extracted references · 10 canonical work pages · 1 internal anchor

[1]

Baker, P

E. Baker, P . Barbillon, A. Fadikar, R. B. Gramacy, R. Herbei,D. Higdon, J. Huang, L. R. Johnson, A. Mondal, B. Pires, et al. Stochastic simulators: An overview with opp ortunities. arXiv preprint arXiv:2002.01321 , 2020a. E. Baker, P . Challenor, and M. Eames. Predicting the output f rom a stochastic computer model when a deterministic approximation is a...

work page arXiv 2002
[2]

M. J. Brown, A. Gowardhan, M. Nelson, M. Williams, and E. R. Pa rdyjak. Evaluation of the quic wind and dispersion models using the joint urban 2003 ﬁeld experi ment dataset. In AMS 8th Symp. Urban Env, Phoenix, USA, pages 10–16,

work page 2003
[3]

M. Gu, J. Palomo, and J. O. Berger. Robustgasp: Robust gaussi an stochastic process emulation in r. arXiv preprint arXiv:1801.01874,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Hlobilov´ a, S

A. Hlobilov´ a, S. Marelli, and B. Sudret. Surrogate modelin g benchmark - two- dimensional heat diffusion model. Sept. 2024a. doi: 10.528 1/zenodo.12701147. URL https://doi.org/10.5281/zenodo.12701147. A. Hlobilov´ a, S. Marelli, and B. Sudret. Surrogate modelin g benchmark - one- dimensional diffusion model. Sept. 2024b. doi: 10.5281/ze nodo.12704504. U...

work page doi:10.5281/zenodo.12701147
[5]

URL https://doi.org/10.25583/1670467

do i: 10.25583/1670467. URL https://doi.org/10.25583/1670467. E. Laloy and D. Jacques. Emulation of CPU-demanding reactiv e transport models: a comparison of Gaussian processes, polynomial chaos expansion, and deep neural net works. Computational Geosciences , 23(5): 1193–1215,

work page doi:10.25583/1670467
[6]

Rudin, C

C. Rudin, C. Zhong, L. Semenova, M. Seltzer, R. Parr, J. Liu, S . Katta, J. Donnelly, H. Chen, and Z. Boner. Amazing things come from having many good models. arXiv preprint arXiv:2407.04846 ,

work page arXiv
[7]

41 K. N. Rumsey, D. Francom, and S. A. V ander Wiel. Discovering a ctive subspaces for high-dimensional computer models. arXiv preprint arXiv:2307.11241 , 2023a. K. N. Rumsey, G. Huerta, and J. D. Tucker. A localized ensembl e of approximate gaussian processes for fast sequential emulation. Stat, 12(1):e576, 2023b. K. N. Rumsey, D. Francom, and A. Shen. G...

work page arXiv
[8]

Sauer, A

A. Sauer, A. Cooper, and R. B. Gramacy. Non-stationary Gauss ian process surrogates. arXiv preprint arXiv:2305.19242, 2023a. A. Sauer, A. Cooper, and R. B. Gramacy. V ecchia-approximate d deep Gaussian processes for computer experiments. Journal of Computational and Graphical Statistics , 32(3):824–837, 2023b. A. Sauer, R. B. Gramacy, and D. Higdon. Activ...

work page arXiv
[9]

Sudret, S

B. Sudret, S. Marelli, and J. Wiart. Surrogate models for unc ertainty quantiﬁcation: An overview. In 2017 11th European conference on antennas and propagation (EUCA P), pages 793–797. IEEE,

work page 2017
[10]

Surjanovic and D

S. Surjanovic and D. Bingham. Virtual library of simulation experiments: test functions and datasets. Simon Fraser University, Burnaby, BC, Canada, accessed May , 13:2015,

work page 2015

[1] [1]

Baker, P

E. Baker, P . Barbillon, A. Fadikar, R. B. Gramacy, R. Herbei,D. Higdon, J. Huang, L. R. Johnson, A. Mondal, B. Pires, et al. Stochastic simulators: An overview with opp ortunities. arXiv preprint arXiv:2002.01321 , 2020a. E. Baker, P . Challenor, and M. Eames. Predicting the output f rom a stochastic computer model when a deterministic approximation is a...

work page arXiv 2002

[2] [2]

M. J. Brown, A. Gowardhan, M. Nelson, M. Williams, and E. R. Pa rdyjak. Evaluation of the quic wind and dispersion models using the joint urban 2003 ﬁeld experi ment dataset. In AMS 8th Symp. Urban Env, Phoenix, USA, pages 10–16,

work page 2003

[3] [3]

M. Gu, J. Palomo, and J. O. Berger. Robustgasp: Robust gaussi an stochastic process emulation in r. arXiv preprint arXiv:1801.01874,

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

Hlobilov´ a, S

A. Hlobilov´ a, S. Marelli, and B. Sudret. Surrogate modelin g benchmark - two- dimensional heat diffusion model. Sept. 2024a. doi: 10.528 1/zenodo.12701147. URL https://doi.org/10.5281/zenodo.12701147. A. Hlobilov´ a, S. Marelli, and B. Sudret. Surrogate modelin g benchmark - one- dimensional diffusion model. Sept. 2024b. doi: 10.5281/ze nodo.12704504. U...

work page doi:10.5281/zenodo.12701147

[5] [5]

URL https://doi.org/10.25583/1670467

do i: 10.25583/1670467. URL https://doi.org/10.25583/1670467. E. Laloy and D. Jacques. Emulation of CPU-demanding reactiv e transport models: a comparison of Gaussian processes, polynomial chaos expansion, and deep neural net works. Computational Geosciences , 23(5): 1193–1215,

work page doi:10.25583/1670467

[6] [6]

Rudin, C

C. Rudin, C. Zhong, L. Semenova, M. Seltzer, R. Parr, J. Liu, S . Katta, J. Donnelly, H. Chen, and Z. Boner. Amazing things come from having many good models. arXiv preprint arXiv:2407.04846 ,

work page arXiv

[7] [7]

41 K. N. Rumsey, D. Francom, and S. A. V ander Wiel. Discovering a ctive subspaces for high-dimensional computer models. arXiv preprint arXiv:2307.11241 , 2023a. K. N. Rumsey, G. Huerta, and J. D. Tucker. A localized ensembl e of approximate gaussian processes for fast sequential emulation. Stat, 12(1):e576, 2023b. K. N. Rumsey, D. Francom, and A. Shen. G...

work page arXiv

[8] [8]

Sauer, A

A. Sauer, A. Cooper, and R. B. Gramacy. Non-stationary Gauss ian process surrogates. arXiv preprint arXiv:2305.19242, 2023a. A. Sauer, A. Cooper, and R. B. Gramacy. V ecchia-approximate d deep Gaussian processes for computer experiments. Journal of Computational and Graphical Statistics , 32(3):824–837, 2023b. A. Sauer, R. B. Gramacy, and D. Higdon. Activ...

work page arXiv

[9] [9]

Sudret, S

B. Sudret, S. Marelli, and J. Wiart. Surrogate models for unc ertainty quantiﬁcation: An overview. In 2017 11th European conference on antennas and propagation (EUCA P), pages 793–797. IEEE,

work page 2017

[10] [10]

Surjanovic and D

S. Surjanovic and D. Bingham. Virtual library of simulation experiments: test functions and datasets. Simon Fraser University, Burnaby, BC, Canada, accessed May , 13:2015,

work page 2015