All Emulators are Wrong, Many are Useful, and Some are More Useful Than Others: A Reproducible Comparison of Computer Model Surrogates
Pith reviewed 2026-05-16 23:03 UTC · model grok-4.3
The pith
A reproducible comparison of 29 emulators shows clear differences in usefulness across test functions and real datasets.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By applying the duqling framework to standardize inputs, outputs, and evaluation, the study produces detailed empirical profiles of 29 emulators that reveal systematic strengths and weaknesses, allowing practitioners to match methods to problem type rather than relying on general claims.
What carries the argument
The duqling R package, which supplies unified syntax and automatic internal scaling for running reproducible emulator comparison studies.
Load-bearing premise
The 60 test functions and 40 real datasets capture enough variety that the observed performance differences generalize to other emulation problems.
What would settle it
A new collection of test problems and datasets, chosen independently, that produces substantially different relative rankings among the same 29 emulators.
Figures
read the original abstract
Accurate and efficient surrogate modeling is essential for modern computational science, and there are a staggering number of emulation methods to choose from. With new methods being developed all the time, comparing the relative strengths and weaknesses of different methods remains a challenge due to inconsistent benchmarking practices and (sometimes) limited reproducibility and transparency. In this work, we present a large-scale, fully reproducible comparison of $29$ distinct emulators across $60$ canonical test functions and $40$ real emulation datasets. To facilitate rigorous, apples-to-apples comparisons, we introduce the R package \texttt{duqling}, which streamlines reproducible simulation studies using a consistent, simple syntax, and automatic internal scaling of inputs. This framework allows researchers to compare emulators in a unified environment and makes it possible to replicate or extend previous studies with minimal effort, even across different publications. Our results provide detailed empirical insight into the strengths and weaknesses of state-of-the-art emulators and offer guidance for both method developers and practitioners selecting a surrogate for new data. We discuss best practices for emulator comparison and highlight how \texttt{duqling} can accelerate research in emulator design and application.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript conducts a large-scale, fully reproducible empirical comparison of 29 distinct emulator methods across 60 canonical test functions and 40 real-world emulation datasets. It introduces the duqling R package to enforce consistent syntax, automatic input scaling, and apples-to-apples evaluation, with the goal of providing detailed insights into emulator strengths/weaknesses and practical guidance for surrogate selection.
Significance. If the reported performance patterns hold under broader conditions, the work supplies a valuable, transparent benchmark that could reduce ad-hoc comparisons in surrogate modeling and accelerate both method development and practitioner choices. The reproducibility infrastructure (duqling) is a concrete strength that directly addresses the field's documented inconsistencies in benchmarking.
major comments (2)
- [Section 4] Section 4 (Results): The guidance claim that the rankings 'offer guidance for both method developers and practitioners selecting a surrogate for new data' is load-bearing on the representativeness of the 60+40 suite. No coverage argument, sensitivity analysis to suite composition, or stratification by problem characteristics (input dimension, smoothness, stationarity, noise) is presented; without this, the stability of relative rankings beyond the chosen collection remains unverified.
- [Section 3.2] Section 3.2 (Experimental design): The handling of emulator-specific hyperparameters is described at a high level but lacks explicit documentation of the tuning protocol (e.g., cross-validation folds, optimization budget, or default settings) applied uniformly across all 29 methods; this detail is necessary to confirm the 'apples-to-apples' claim.
minor comments (2)
- [Figure 3] Figure 3 and Table 4: axis labels and legend entries use inconsistent abbreviation conventions for emulator names; a single glossary table would improve readability.
- [Abstract] Abstract: the phrase 'some are more useful than others' is vague; a single sentence summarizing the top-performing class (e.g., Gaussian processes vs. neural nets) would strengthen the take-home message.
Simulated Author's Rebuttal
We thank the referee for their positive summary, recognition of the reproducibility infrastructure, and recommendation for minor revision. Their comments are constructive and help strengthen the manuscript's claims. We address each major comment below, indicating the revisions we will incorporate.
read point-by-point responses
-
Referee: [Section 4] Section 4 (Results): The guidance claim that the rankings 'offer guidance for both method developers and practitioners selecting a surrogate for new data' is load-bearing on the representativeness of the 60+40 suite. No coverage argument, sensitivity analysis to suite composition, or stratification by problem characteristics (input dimension, smoothness, stationarity, noise) is presented; without this, the stability of relative rankings beyond the chosen collection remains unverified.
Authors: We agree that a formal sensitivity analysis would further support the generalizability of the guidance. The 60 canonical functions are drawn from established benchmarks in the surrogate modeling literature, while the 40 real-world datasets span input dimensions 2–50, varying smoothness, stationarity, and noise levels. In revision we will add a stratification of results by input dimension and noise level in Section 4, plus an explicit limitations paragraph in the discussion acknowledging that the current suite, though diverse, does not exhaustively cover all possible problem classes. This provides partial but concrete strengthening without a full re-analysis of alternative suites. revision: partial
-
Referee: [Section 3.2] Section 3.2 (Experimental design): The handling of emulator-specific hyperparameters is described at a high level but lacks explicit documentation of the tuning protocol (e.g., cross-validation folds, optimization budget, or default settings) applied uniformly across all 29 methods; this detail is necessary to confirm the 'apples-to-apples' claim.
Authors: We appreciate the request for explicit documentation. In the revised manuscript we will expand Section 3.2 with a precise description of the uniform tuning protocol: 5-fold cross-validation where supported, a fixed budget of 50 hyperparameter evaluations for optimization-based methods, and package defaults otherwise. A supplementary table will enumerate the exact procedure for each of the 29 emulators. These additions will be included in the next version. revision: yes
Circularity Check
No circularity: purely empirical benchmarking with no derivations or self-referential predictions
full rationale
The paper conducts a direct, reproducible comparison of 29 emulators on 60 canonical test functions and 40 real datasets via the duqling package. No equations, fitted parameters, uniqueness theorems, or predictions are present that reduce by construction to the paper's own inputs or prior self-citations. Performance metrics are computed from fresh runs on the chosen suite; rankings and guidance follow from those measurements without any load-bearing self-referential step. This is the standard honest outcome for an empirical benchmarking study.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
E. Baker, P . Barbillon, A. Fadikar, R. B. Gramacy, R. Herbei,D. Higdon, J. Huang, L. R. Johnson, A. Mondal, B. Pires, et al. Stochastic simulators: An overview with opp ortunities. arXiv preprint arXiv:2002.01321 , 2020a. E. Baker, P . Challenor, and M. Eames. Predicting the output f rom a stochastic computer model when a deterministic approximation is a...
-
[2]
M. J. Brown, A. Gowardhan, M. Nelson, M. Williams, and E. R. Pa rdyjak. Evaluation of the quic wind and dispersion models using the joint urban 2003 field experi ment dataset. In AMS 8th Symp. Urban Env, Phoenix, USA, pages 10–16,
work page 2003
-
[3]
M. Gu, J. Palomo, and J. O. Berger. Robustgasp: Robust gaussi an stochastic process emulation in r. arXiv preprint arXiv:1801.01874,
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
A. Hlobilov´ a, S. Marelli, and B. Sudret. Surrogate modelin g benchmark - two- dimensional heat diffusion model. Sept. 2024a. doi: 10.528 1/zenodo.12701147. URL https://doi.org/10.5281/zenodo.12701147. A. Hlobilov´ a, S. Marelli, and B. Sudret. Surrogate modelin g benchmark - one- dimensional diffusion model. Sept. 2024b. doi: 10.5281/ze nodo.12704504. U...
-
[5]
URL https://doi.org/10.25583/1670467
do i: 10.25583/1670467. URL https://doi.org/10.25583/1670467. E. Laloy and D. Jacques. Emulation of CPU-demanding reactiv e transport models: a comparison of Gaussian processes, polynomial chaos expansion, and deep neural net works. Computational Geosciences , 23(5): 1193–1215,
- [6]
-
[7]
41 K. N. Rumsey, D. Francom, and S. A. V ander Wiel. Discovering a ctive subspaces for high-dimensional computer models. arXiv preprint arXiv:2307.11241 , 2023a. K. N. Rumsey, G. Huerta, and J. D. Tucker. A localized ensembl e of approximate gaussian processes for fast sequential emulation. Stat, 12(1):e576, 2023b. K. N. Rumsey, D. Francom, and A. Shen. G...
-
[8]
A. Sauer, A. Cooper, and R. B. Gramacy. Non-stationary Gauss ian process surrogates. arXiv preprint arXiv:2305.19242, 2023a. A. Sauer, A. Cooper, and R. B. Gramacy. V ecchia-approximate d deep Gaussian processes for computer experiments. Journal of Computational and Graphical Statistics , 32(3):824–837, 2023b. A. Sauer, R. B. Gramacy, and D. Higdon. Activ...
- [9]
-
[10]
S. Surjanovic and D. Bingham. Virtual library of simulation experiments: test functions and datasets. Simon Fraser University, Burnaby, BC, Canada, accessed May , 13:2015,
work page 2015
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.