pith. sign in

arxiv: 2605.16075 · v1 · pith:H4YUJQ4Bnew · submitted 2026-05-15 · 📊 stat.ME · stat.CO

REX-SUB: A Scalable Subsampling Strategy for Modeling Large Spatial Datasets

Pith reviewed 2026-05-20 15:24 UTC · model grok-4.3

classification 📊 stat.ME stat.CO
keywords spatial subsamplingGaussian processesVecchia approximationlarge spatial datasetsprediction errorexchange algorithmgeostatistics
0
0 comments X

The pith

REX-SUB selects small subsamples from large spatial datasets to reduce prediction errors in Gaussian process models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The authors develop REX-SUB, a method that uses a randomized exchange algorithm to choose a small number of locations from millions of spatial observations. This selection aims to minimize prediction errors when fitting a Gaussian process model to the subset. They combine it with the Vecchia approximation to keep computations fast even on the chosen points. Results from simulations and a precipitable water dataset indicate better mean squared prediction errors and interval scores than other subsampling techniques. This approach addresses the challenge of fitting geostatistical models to massive data collections that would otherwise be too large for standard methods.

Core claim

Through a simulation study and an application to a remotely sensed precipitable water dataset, REX-SUB yields lower mean squared prediction errors and interval scores compared to competing subsampling strategies by efficiently selecting subsamples that optimize a prediction-error criterion within a Vecchia-approximated Gaussian process framework.

What carries the argument

The randomized exchange algorithm that starts with a random subsample and iteratively swaps locations to minimize a prediction-error criterion, embedded with the Vecchia approximation to the Gaussian process likelihood for fast sparse-matrix computations on the selected points.

If this is right

  • Lower mean squared prediction errors when predicting at unsampled locations from the full spatial field.
  • Improved interval scores that better calibrate uncertainty in spatial predictions.
  • Feasible fitting of Gaussian process models on datasets containing millions of locations where direct computation is prohibitive.
  • Better performance than random sampling or other fixed subsampling strategies in both controlled simulations and real remotely sensed data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same exchange criterion could be tested on non-Gaussian spatial processes if the approximation is adjusted accordingly.
  • REX-SUB might support adaptive monitoring designs where new measurement locations are chosen sequentially to reduce future prediction error.
  • Direct comparisons against more recent subsampling techniques on additional environmental datasets would clarify when the method is most advantageous.

Load-bearing premise

The prediction-error criterion optimized inside the exchange algorithm serves as a faithful proxy for how the fitted model will perform on the full dataset rather than only the chosen subsample.

What would settle it

A new simulation or application in which the locations chosen by REX-SUB produce higher mean squared prediction errors or worse interval scores than a competing subsampling method or even simple random selection on the same data.

Figures

Figures reproduced from arXiv: 2605.16075 by Ben Seiyon Lee, Nicholas Rios.

Figure 1
Figure 1. Figure 1: Two LHDs with n = 6 points in d = 2 dimensions. Figure from Zhao et al. (2018). that attempt to maximize the minimum distance between points; this is done by the maximinLHS function in the R package lhs (Carnell, 2024). Naturally, LHDs can be used to subsample n points from N points in d−dimensional space by partitioning the space into n × n regions, and then using the LHD to select n regions. Once the n r… view at source ↗
Figure 2
Figure 2. Figure 2: Total precipitable water data obtained via the MODIS Terra Satellite obtained [PITH_FULL_IMAGE:figures/full_fig_p021_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Example of Subsample Locations for REX-SUB and LHS when [PITH_FULL_IMAGE:figures/full_fig_p022_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Comparison of MSPE, Interval Scores, and coverage of 100 subsamples of size [PITH_FULL_IMAGE:figures/full_fig_p022_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Comparison of the MSPE, Interval Scores, and coverage of 100 subsamples of [PITH_FULL_IMAGE:figures/full_fig_p023_5.png] view at source ↗
read the original abstract

Recent advances in data collection technologies have led to the emergence of massive spatial datasets, with measurements obtained at millions of spatial locations. Geostatistical models typically employ Gaussian processes (GPs) to capture spatial dependence, but standard GP fitting becomes prohibitive at such scales. A promising solution is optimal subsampling, where a subset of locations is selected that optimizes a criterion. In this study, we propose a randomized exchange algorithm for subsampling (REX-SUB) which efficiently selects small subsamples that minimize prediction errors in the fitted spatial GP models. To further improve computational efficiency, we embed a scalable Vecchia approximation to the GP's joint likelihood, which takes advantage of sparsity in the precision matrix to enable fast inference on the selected subsamples. Through a simulation study and an application to a remotely sensed precipitable water dataset, we show that REX-SUB yields lower mean squared prediction errors and interval scores compared to competing subsampling strategies.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes REX-SUB, a randomized exchange algorithm for subsampling large spatial datasets to fit Gaussian process models. It embeds the Vecchia approximation to the GP likelihood for computational scalability and selects subsamples by minimizing a prediction-error criterion. Through simulations and an application to a remotely sensed precipitable water dataset, the authors report that REX-SUB produces lower mean squared prediction errors and interval scores than competing subsampling strategies.

Significance. If the performance gains survive direct validation against exact GP evaluations, the method could provide a practical, scalable tool for geostatistical modeling of massive spatial data where full GP inference is infeasible. The algorithmic combination of exchange subsampling with Vecchia sparsity is a reasonable engineering approach, and the empirical comparisons on simulated and real data supply initial supporting evidence.

major comments (2)
  1. The exchange algorithm selects subsamples by minimizing a prediction-error criterion computed from the Vecchia-approximated likelihood. No section reports a direct comparison of Vecchia-based versus exact-GP prediction errors on the adaptively chosen subsamples. This is load-bearing for the central claim because the headline result (lower MSPE and interval scores) rests on the untested assumption that the approximate criterion induces the same ranking as the exact GP.
  2. Simulation Study: the reported gains are shown for a fixed Vecchia conditioning set size, but the manuscript contains no sensitivity analysis to this parameter or to the exchange neighborhood definition. Because Vecchia accuracy depends on both the conditioning sets and the spatial configuration of the chosen points, it remains unclear whether the superiority over competitors persists under reasonable changes to these free parameters.
minor comments (2)
  1. Abstract: the statement that REX-SUB 'yields lower mean squared prediction errors and interval scores' would be strengthened by a brief quantitative indication of the magnitude of improvement and the identity of the competing strategies.
  2. Notation: the free parameters (subsample size m and Vecchia conditioning set size) should be explicitly listed in the experimental sections so that readers can reproduce the precise settings used for the reported results.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments. We address each major comment below and describe the revisions that will be incorporated to strengthen the manuscript.

read point-by-point responses
  1. Referee: The exchange algorithm selects subsamples by minimizing a prediction-error criterion computed from the Vecchia-approximated likelihood. No section reports a direct comparison of Vecchia-based versus exact-GP prediction errors on the adaptively chosen subsamples. This is load-bearing for the central claim because the headline result (lower MSPE and interval scores) rests on the untested assumption that the approximate criterion induces the same ranking as the exact GP.

    Authors: We agree this validation would strengthen the central claim. While exact GP evaluation is computationally prohibitive for the large-scale settings that motivate the work, we will add a new subsection to the simulation study that performs the comparison on smaller datasets (n=500 and n=1000) where exact GP computations remain feasible. In this analysis we will report the overlap in selected subsamples, the correlation between Vecchia and exact criteria, and the resulting MSPE and interval scores to confirm that the approximate criterion preserves the ranking of candidate subsamples. revision: yes

  2. Referee: Simulation Study: the reported gains are shown for a fixed Vecchia conditioning set size, but the manuscript contains no sensitivity analysis to this parameter or to the exchange neighborhood definition. Because Vecchia accuracy depends on both the conditioning sets and the spatial configuration of the chosen points, it remains unclear whether the superiority over competitors persists under reasonable changes to these free parameters.

    Authors: We acknowledge the value of demonstrating robustness. The revised manuscript will include additional simulation results that vary the Vecchia conditioning set size (m = 5, 10, 20, 30) and the exchange neighborhood size (k = 10, 20, 30 nearest neighbors). These experiments will be reported alongside the original results to show that the performance advantages of REX-SUB over the competing subsampling strategies are maintained across these reasonable parameter choices. revision: yes

Circularity Check

0 steps flagged

No significant circularity: algorithmic method evaluated on external benchmarks

full rationale

The paper introduces REX-SUB as a randomized exchange algorithm that selects subsamples by minimizing a prediction-error criterion computed under the Vecchia approximation. All performance claims (lower MSPE and interval scores versus competitors) are established through separate simulation studies with known truth and an application to a real remotely sensed dataset. No section presents a closed-form derivation, optimality condition, or prediction that reduces by the paper's own equations to a fitted quantity defined on the same data. The Vecchia approximation is a standard external technique, and the central results rest on empirical comparisons rather than self-referential definitions or self-citation chains.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The approach rests on the standard assumption that spatial dependence is well captured by a Gaussian process with a stationary covariance function, plus the modeling choice that a fixed-size subsample plus Vecchia sparsity pattern is sufficient to approximate the full joint likelihood.

free parameters (2)
  • subsample size m
    Chosen by the user; performance depends on this value but no automatic selection rule is described in the abstract.
  • Vecchia conditioning set size
    Controls the sparsity pattern and approximation accuracy; must be set by the practitioner.
axioms (1)
  • domain assumption The spatial field is a realization of a Gaussian process with known parametric covariance family.
    Invoked when the prediction criterion is defined and when the Vecchia approximation is applied.

pith-pipeline@v0.9.0 · 5689 in / 1381 out tokens · 29057 ms · 2026-05-20T15:24:54.308037+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

71 extracted references · 71 canonical work pages · 1 internal anchor

  1. [1]

    Journal of Statistical Planning and Inference , volume=

    Information-based optimal subdata selection for big data logistic regression , author=. Journal of Statistical Planning and Inference , volume=. 2020 , publisher=

  2. [2]

    Technometrics , volume=

    A comparison of algorithms for constructing exact D-optimal designs , author=. Technometrics , volume=. 1980 , publisher=

  3. [3]

    Journal of statistical planning and inference , volume=

    Minimax and maximin distance designs , author=. Journal of statistical planning and inference , volume=. 1990 , publisher=

  4. [4]

    Journal of Quality Technology , volume=

    A-optimal versus D-optimal design of screening experiments , author=. Journal of Quality Technology , volume=. 2021 , publisher=

  5. [5]

    arXiv preprint arXiv:2309.00720 , year=

    Information-based Optimal Subdata Selection for Clusterwise Linear Regression , author=. arXiv preprint arXiv:2309.00720 , year=

  6. [6]

    Journal of Statistical Theory and Practice , volume=

    Model-robust subdata selection for big data , author=. Journal of Statistical Theory and Practice , volume=. 2021 , publisher=

  7. [7]

    The New England Journal of Statistics in Data Science , volume=

    Subdata selection with a large number of variables , author=. The New England Journal of Statistics in Data Science , volume=. 2023 , publisher=

  8. [8]

    Journal of the American Statistical Association , volume=

    Information-based optimal subdata selection for big data linear regression , author=. Journal of the American Statistical Association , volume=. 2019 , publisher=

  9. [9]

    Atmospheric Environment , volume=

    MODIS Collection 6.1 aerosol optical depth products over land and ocean: validation and comparison , author=. Atmospheric Environment , volume=. 2019 , publisher=

  10. [10]

    Spatial Statistics , volume=

    Hierarchical statistical modeling of big spatial datasets using the exponential family of distributions , author=. Spatial Statistics , volume=. 2013 , publisher=

  11. [11]

    Journal of Computational and Graphical Statistics , volume=

    A computationally efficient projection-based approach for spatial generalized linear mixed models , author=. Journal of Computational and Graphical Statistics , volume=. 2018 , publisher=

  12. [12]

    The cryosphere , volume=

    Bedmap2: improved ice bed, surface and thickness datasets for Antarctica , author=. The cryosphere , volume=. 2013 , publisher=

  13. [13]

    Computational Statistics & Data Analysis , volume=

    Vecchia--Laplace approximations of generalized Gaussian processes for big non-Gaussian spatial data , author=. Computational Statistics & Data Analysis , volume=. 2021 , publisher=

  14. [14]

    Journal of the American Statistical Association , volume=

    I-optimal design of mixture experiments , author=. Journal of the American Statistical Association , volume=. 2016 , publisher=

  15. [15]

    2013 , publisher=

    Theory of optimal experiments , author=. 2013 , publisher=

  16. [16]

    Journal of Multivariate analysis , volume=

    Asymptotic properties of a maximum likelihood estimator with data from a Gaussian process , author=. Journal of Multivariate analysis , volume=. 1991 , publisher=

  17. [17]

    Journal of the American Statistical Association , volume =

    Hao Zhang , title =. Journal of the American Statistical Association , volume =. 2004 , publisher =. doi:10.1198/016214504000000241 , URL =

  18. [18]

    The Annals of Statistics , pages=

    Uniform asymptotic optimality of linear predictions of a random field using an incorrect second-order structure , author=. The Annals of Statistics , pages=. 1990 , publisher=

  19. [19]

    Journal of the American statistical Association , volume=

    Strictly proper scoring rules, prediction, and estimation , author=. Journal of the American statistical Association , volume=. 2007 , publisher=

  20. [20]

    Rasmussen, Carl Edward and Williams, Christopher K. I. , title =. 2005 , month =. doi:10.7551/mitpress/3206.001.0001 , url =

  21. [21]

    International Conference on Artificial Intelligence and Statistics , pages=

    On random subsampling of Gaussian process regression: A graphon-based analysis , author=. International Conference on Artificial Intelligence and Statistics , pages=. 2020 , organization=

  22. [22]

    Statistica Sinica , volume=

    Gaussian process prediction using design-based subsampling , author=. Statistica Sinica , volume=. 2022 , publisher=

  23. [23]

    Statistica Sinica , volume=

    Efficient Gaussian process modeling using experimental design-based subagging , author=. Statistica Sinica , volume=. 2018 , publisher=

  24. [24]

    Technometrics , volume=

    Comparison of three methods for selecting values of input variables in the analysis of output from a computer code , author=. Technometrics , volume=

  25. [25]

    2024 , note =

    lhs: Latin Hypercube Samples , author =. 2024 , note =

  26. [26]

    Journal of the Royal Statistical Society: Series B (Methodological) , volume=

    Estimation and model identification for continuous spatial processes , author=. Journal of the Royal Statistical Society: Series B (Methodological) , volume=. 1988 , publisher=

  27. [27]

    Journal of the Royal Statistical Society: Series B (Statistical Methodology) , volume=

    Approximating likelihoods for large spatial data sets , author=. Journal of the Royal Statistical Society: Series B (Statistical Methodology) , volume=. 2004 , publisher=

  28. [28]

    Hierarchical nearest-neighbor

    Datta, Abhirup and Banerjee, Sudipto and Finley, Andrew O and Gelfand, Alan E , date-added =. Hierarchical nearest-neighbor. Journal of the American Statistical Association , number =

  29. [29]

    Vecchia approximations of

    Katzfuss, Matthias and Guinness, Joseph and Gong, Wenlong and Zilber, Daniel , journal=. Vecchia approximations of. 2020 , publisher=

  30. [30]

    Technometrics , volume=

    Permutation and grouping methods for sharpening Gaussian process approximations , author=. Technometrics , volume=. 2018 , publisher=

  31. [31]

    Journal of the Royal Statistical Society Series B: Statistical Methodology , pages=

    Graphical methods for Order-of-Addition experiments , author=. Journal of the Royal Statistical Society Series B: Statistical Methodology , pages=. 2025 , publisher=

  32. [32]

    Computational Statistics & Data Analysis , volume=

    Exact designs for order-of-addition experiments under a transition-effect model , author=. Computational Statistics & Data Analysis , volume=. 2025 , publisher=

  33. [33]

    Computational Statistics & Data Analysis , volume=

    TA algorithms for D-optimal OofA Mixture designs , author=. Computational Statistics & Data Analysis , volume=. 2022 , publisher=

  34. [34]

    1990 , publisher=

    Spline models for observational data , author=. 1990 , publisher=

  35. [35]

    2011 , publisher=

    Statistics for spatio-temporal data , author=. 2011 , publisher=

  36. [36]

    2003 , publisher=

    Hierarchical modeling and analysis for spatial data , author=. 2003 , publisher=

  37. [37]

    Annual Review of Statistics and Its Application , volume=

    Basis-function models in spatial statistics , author=. Annual Review of Statistics and Its Application , volume=. 2022 , publisher=

  38. [38]

    Journal of Agricultural, Biological and Environmental Statistics , volume=

    A case study competition among methods for analyzing large spatial data , author=. Journal of Agricultural, Biological and Environmental Statistics , volume=. 2019 , publisher=

  39. [39]

    Technometrics , volume=

    A scalable partitioned approach to model massive nonstationary non-Gaussian spatial datasets , author=. Technometrics , volume=. 2023 , publisher=

  40. [40]

    Journal of Computational and Graphical Statistics , volume=

    Adaptive Bayesian nonstationary modeling for large spatial datasets using covariance approximations , author=. Journal of Computational and Graphical Statistics , volume=. 2014 , publisher=

  41. [41]

    Bayesian Analysis , volume=

    Incorporating subsampling into Bayesian models for high-dimensional spatial data , author=. Bayesian Analysis , volume=. 2024 , publisher=

  42. [42]

    Test , volume=

    Comparing and selecting spatial predictors using local criteria , author=. Test , volume=. 2015 , publisher=

  43. [43]

    Journal of Computational and Graphical Statistics , volume=

    An approach to incorporate subsampling into a generic Bayesian hierarchical model , author=. Journal of Computational and Graphical Statistics , volume=. 2021 , publisher=

  44. [44]

    Statistical Science , volume=

    A general framework for Vecchia approximations of Gaussian processes , author=. Statistical Science , volume=. 2021 , publisher=

  45. [45]

    An explicit link between

    Lindgren, Finn and Rue, H. An explicit link between. Journal of the Royal Statistical Society: Series B (Statistical Methodology) , number =

  46. [46]

    Environmental and Ecological Statistics , volume=

    A process-convolution approach to modelling temperatures in the North Atlantic Ocean , author=. Environmental and Ecological Statistics , volume=. 1998 , publisher=

  47. [47]

    and Johannesson, G

    Cressie, N. and Johannesson, G. , date-added =. Journal of the Royal Statistical Society: Series B (Statistical Methodology) , number =

  48. [48]

    and Gelfand, A.E

    Banerjee, S. and Gelfand, A.E. and Finley, A.O. and Sang, H. , journal =

  49. [49]

    Covariance tapering for interpolation of large spatial datasets , volume =

    Furrer, Reinhard and Genton, Marc G and Nychka, Douglas , journal=. Covariance tapering for interpolation of large spatial datasets , volume =

  50. [50]

    Limitations on low rank approximations for covariance matrices of spatial data , volume =

    Stein, Michael L , journal =. Limitations on low rank approximations for covariance matrices of spatial data , volume =

  51. [51]

    Vecchia, A. V. , date-added =. Estimation and model identification for continuous spatial processes , volume =. Journal of the Royal Statistical Society, Series B, Methodological , keywords =

  52. [52]

    The Annals of Applied Statistics , pages=

    Covariance approximation for large multivariate spatial data sets with an application to multiple climate model errors , author=. The Annals of Applied Statistics , pages=. 2011 , publisher=

  53. [53]

    Environmetrics: The official journal of the International Environmetrics Society , volume=

    A high frequency kriging approach for non-stationary environmental processes , author=. Environmetrics: The official journal of the International Environmetrics Society , volume=. 2001 , publisher=

  54. [54]

    Journal of the American Statistical Association , volume=

    Analyzing nonstationary spatial data using piecewise Gaussian processes , author=. Journal of the American Statistical Association , volume=. 2005 , publisher=

  55. [55]

    Local likelihood estimation for covariance functions with spatially-varying parameters: the convoSPAT package for R

    Local likelihood estimation for covariance functions with spatially-varying parameters: the convoSPAT package for R , author=. arXiv preprint arXiv:1507.08613 , year=

  56. [56]

    Fast sampling of

    Rue, Havard , journal=. Fast sampling of. 2001 , publisher=

  57. [57]

    Journal of Computational and Graphical Statistics , volume=

    A multiresolution Gaussian process model for the analysis of large spatial datasets , author=. Journal of Computational and Graphical Statistics , volume=. 2015 , publisher=

  58. [58]

    Journal of the american statistical association , volume=

    Nonseparable, stationary covariance functions for space--time data , author=. Journal of the american statistical association , volume=. 2002 , publisher=

  59. [59]

    Journal of the American Statistical association , volume=

    Classes of nonseparable, spatio-temporal stationary covariance functions , author=. Journal of the American Statistical association , volume=. 1999 , publisher=

  60. [60]

    Journal of the American Statistical Association , volume=

    Stable and efficient multiple smoothing parameter estimation for generalized additive models , author=. Journal of the American Statistical Association , volume=. 2004 , publisher=

  61. [61]

    Journal of Statistical Software , volume=

    FRK: An R package for spatial and spatio-temporal prediction with large datasets , author=. Journal of Statistical Software , volume=

  62. [62]

    Journal of the Royal Statistical Society Series B: Statistical Methodology , volume=

    Approximate Bayesian inference for latent Gaussian models by using integrated nested Laplace approximations , author=. Journal of the Royal Statistical Society Series B: Statistical Methodology , volume=. 2009 , publisher=

  63. [63]

    A multi-resolution approximation for massive spatial datasets , volume =

    Katzfuss, Matthias , journal =. A multi-resolution approximation for massive spatial datasets , volume =

  64. [64]

    Environmetrics: The official journal of the International Environmetrics Society , volume=

    Spatial modelling using a new class of nonstationary covariance functions , author=. Environmetrics: The official journal of the International Environmetrics Society , volume=. 2006 , publisher=

  65. [65]

    Ecology , volume=

    The basis function approach for modeling autocorrelation in ecological data , author=. Ecology , volume=. 2017 , publisher=

  66. [66]

    hetgp: Heteroskedastic Gaussian process modeling and sequential design in

    Binois, Micka. hetgp: Heteroskedastic Gaussian process modeling and sequential design in. Journal of Statistical Software , volume=

  67. [67]

    Springer Series in Statistics , year=

    The Design and Analysis of Computer Experiments , author=. Springer Series in Statistics , year=

  68. [68]

    arXiv preprint arXiv:2209.11280 , year=

    Scalable Gaussian Process Hyperparameter Optimization via Coverage Regularization , author=. arXiv preprint arXiv:2209.11280 , year=

  69. [69]

    Space Weather , volume=

    GeoDGP: One-hour ahead global probabilistic geomagnetic perturbation forecasting using deep Gaussian process , author=. Space Weather , volume=. 2025 , publisher=

  70. [70]

    SIAM/ASA Journal on Uncertainty Quantification , volume=

    Fully bayesian inference for latent variable gaussian process models , author=. SIAM/ASA Journal on Uncertainty Quantification , volume=. 2023 , publisher=

  71. [71]

    Statistics and Computing , volume=

    Gaussian process learning via Fisher scoring of Vecchia’s approximation , author=. Statistics and Computing , volume=. 2021 , publisher=