The Theory and Practice of Highly Scalable Gaussian Process Regression with Nearest Neighbours

Anthony Stephenson; Robert Allison; Tomasz Maciazek

arxiv: 2604.07267 · v1 · submitted 2026-04-08 · 📊 stat.ML · cs.LG

The Theory and Practice of Highly Scalable Gaussian Process Regression with Nearest Neighbours

Robert Allison , Tomasz Maciazek , Anthony Stephenson This is my paper

Pith reviewed 2026-05-10 17:12 UTC · model grok-4.3

classification 📊 stat.ML cs.LG

keywords Gaussian process regressionnearest neighborsscalabilityuniversal consistencyminimax ratesmean squared errornegative log-likelihoodhyperparameter robustness

0 comments

The pith

Nearest-neighbor Gaussian process regression achieves universal consistency and Stone's minimax rate under mild assumptions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a theoretical framework showing that predicting with only the nearest neighbors in Gaussian process regression delivers the same large-sample behavior as using the full training set. It establishes almost sure pointwise limits for mean squared error, calibration, and negative log-likelihood, then proves the method is universally consistent and attains the optimal minimax L2-risk rate. This matters because it supplies the missing statistical justification for replacing cubic-cost full GPs with a linear-cost scalable alternative that already works well in practice. A reader cares because the results explain why the approximation remains reliable even when hyperparameters are not finely tuned.

Core claim

Under mild regularity assumptions, nearest-neighbor Gaussian process regression admits almost sure pointwise limits for MSE, calibration coefficient, and negative log-likelihood. The L2-risk is universally consistent and attains Stone's minimax rate n^{-2α/(2p+d)}. MSE converges uniformly over compact hyper-parameter sets, and its derivatives with respect to lengthscale, kernel scale, and noise variance vanish asymptotically with explicit rates.

What carries the argument

Nearest-neighbor selection for GP prediction, which reduces the per-test-point cost from cubic in n to linear while preserving the stated asymptotic limits and rates.

If this is right

The NN approximation can replace full GP regression for large datasets without sacrificing asymptotic optimality.
Hyper-parameter tuning becomes less critical because derivatives of MSE vanish, explaining observed robustness.
Both geospatial NNGP and general GPnn methods rest on the same rigorous foundation.
The results extend the applicability of GP models to regimes where cubic scaling was previously prohibitive.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same nearest-neighbor analysis may apply to other local approximations in kernel methods or Bayesian nonparametrics.
Empirical checks on the rate could be performed by subsampling large datasets and tracking risk decay against n.
The framework suggests studying whether similar limits hold for non-stationary kernels or heteroscedastic noise.

Load-bearing premise

Mild regularity assumptions on the regression function, kernel, and data distribution are sufficient for the almost-sure limits, universal consistency, and minimax rate to hold.

What would settle it

A simulation or real dataset where the regression function or kernel violates the regularity conditions and the observed L2-risk fails to reach or beats the claimed minimax rate.

Figures

Figures reproduced from arXiv: 2604.07267 by Anthony Stephenson, Robert Allison, Tomasz Maciazek.

**Figure 1.** Figure 1: The GP nn workflow, including the above-described cheap hyper-parameter estimation and the calibration step of the predictive variance (see the main text for explanation). H¨older-like property (AR.2-3), responses having the H¨older property (AR.4) and to data distributions/noise models having the properties (AR.5) and (AR.6) specified below. (AR.3) The normalised covariance function of the GP sample pat… view at source ↗

**Figure 2.** Figure 2: Experiment results on a suite of UCI datasets. Optimal calibration performance is [PITH_FULL_IMAGE:figures/full_fig_p021_2.png] view at source ↗

**Figure 3.** Figure 3: and [PITH_FULL_IMAGE:figures/full_fig_p028_3.png] view at source ↗

**Figure 5.** Figure 5: Risk landscape (shifted) as a function of the hyperparameters and training set [PITH_FULL_IMAGE:figures/full_fig_p029_5.png] view at source ↗

**Figure 6.** Figure 6: NNGP regression with PX uniform on dX = 2 disk. Log–log plots of the risk derivatives ∂ϕRbn for ν = 1/2. Dashed reference lines show the fitted slopes. ϕ Fitted Theory R2 ˆℓ 0.3239 0.3333 . . . 0.9971 σˆ 2 f 0.3269 0.3333 . . . 0.9927 σˆ 2 ξ 0.3041 0.3333 . . . 0.9911 ˆb 0.7754 0.6666 . . . 0.9090 [PITH_FULL_IMAGE:figures/full_fig_p030_6.png] view at source ↗

read the original abstract

Gaussian process ($GP$) regression is a widely used non-parametric modeling tool, but its cubic complexity in the training size limits its use on massive data sets. A practical remedy is to predict using only the nearest neighbours of each test point, as in Nearest Neighbour Gaussian Process ($NNGP$) regression for geospatial problems and the related scalable $GPnn$ method for more general machine-learning applications. Despite their strong empirical performance, the large-$n$ theory of $NNGP/GPnn$ remains incomplete. We develop a theoretical framework for $NNGP$ and $GPnn$ regression. Under mild regularity assumptions, we derive almost sure pointwise limits for three key predictive criteria: mean squared error ($MSE$), calibration coefficient ($CAL$), and negative log-likelihood ($NLL$). We then study the $L_2$-risk, prove universal consistency, and show that the risk attains Stone's minimax rate $n^{-2\alpha/(2p+d)}$, where $\alpha$ and $p$ capture regularity of the regression problem. We also prove uniform convergence of $MSE$ over compact hyper-parameter sets and show that its derivatives with respect to lengthscale, kernel scale, and noise variance vanish asymptotically, with explicit rates. This explains the observed robustness of $GPnn$ to hyper-parameter tuning. These results provide a rigorous statistical foundation for $NNGP/GPnn$ as a highly scalable and principled alternative to full $GP$ models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper closes the large-n theory gap for NNGP/GPnn with almost-sure limits, universal consistency, and the usual minimax rate under standard nonparametric assumptions.

read the letter

The main takeaway is that nearest-neighbor Gaussian process methods now have proper asymptotic backing. The authors derive almost-sure pointwise limits for MSE, calibration, and NLL, prove universal consistency, and show the L2 risk reaches Stone's minimax rate under the usual smoothness and design conditions. They also establish uniform convergence of MSE over compact hyperparameter sets and that the derivatives with respect to lengthscale, kernel scale, and noise variance vanish at explicit rates, which matches the observed robustness to tuning.

Referee Report

0 major / 2 minor

Summary. The manuscript develops a theoretical framework for nearest-neighbour Gaussian process (NNGP/GPnn) regression. Under mild regularity assumptions on the regression function, kernel, and data distribution, it derives almost-sure pointwise limits for the mean squared error (MSE), calibration coefficient (CAL), and negative log-likelihood (NLL). It then establishes universal consistency of the L2-risk, shows that the risk attains Stone's minimax rate n^{-2α/(2p+d)}, proves uniform convergence of MSE over compact hyper-parameter sets, and demonstrates that the derivatives of MSE with respect to lengthscale, kernel scale, and noise variance vanish asymptotically with explicit rates.

Significance. If the derivations hold, the paper supplies a rigorous statistical foundation for scalable GP methods that are already used in geospatial and general ML applications. The almost-sure limits, universal consistency, attainment of the minimax rate, and asymptotic robustness to hyper-parameter choice are substantive contributions that adapt classical nonparametric arguments to the NN-GP predictor. These results explain observed empirical behaviour and position NNGP/GPnn as principled large-scale alternatives to full GPs.

minor comments (2)

The statement of the regularity conditions (Hölder/Sobolev smoothness, kernel properties, design density) appears in the main theorems but could be collected in a single preliminary section or assumption box for easier reference.
Notation for the number of neighbours k_n and its growth rate relative to n is introduced gradually; an early table or remark summarising the admissible regimes would improve readability.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive assessment of the manuscript, detailed summary of our contributions, and recommendation to accept. We are pleased that the almost-sure limits, universal consistency, minimax rate attainment, and asymptotic robustness results are viewed as substantive.

Circularity Check

0 steps flagged

No significant circularity; derivations rest on external regularity assumptions

full rationale

The paper's central claims consist of almost-sure pointwise limits for MSE, CAL and NLL, universal consistency, attainment of Stone's minimax rate, and uniform convergence of MSE and its hyper-parameter derivatives. All results are explicitly conditioned on mild regularity assumptions (Hölder/Sobolev smoothness of the target function, standard kernel and design-density conditions) that are stated as external to the fitted model and are the usual nonparametric-regression hypotheses. The proofs adapt classical bias-variance arguments to the nearest-neighbor GP predictor without any reduction of a claimed prediction to a fitted quantity, without self-definitional equations, and without load-bearing self-citations. The minimax-rate derivation follows the standard decomposition once the neighbor count is allowed to grow with n, remaining independent of the paper's own fitted values. Consequently the derivation chain is self-contained against external benchmarks and exhibits no circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on mild regularity assumptions whose precise statement is not supplied in the abstract; no free parameters or new postulated entities are introduced.

axioms (1)

domain assumption Mild regularity assumptions on the regression function, kernel, and data distribution
Invoked throughout to obtain almost sure pointwise limits, universal consistency, and the minimax rate.

pith-pipeline@v0.9.0 · 5563 in / 1497 out tokens · 50456 ms · 2026-05-10T17:12:47.834120+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

3 extracted references · 3 canonical work pages

[1]

We have h ∆ ˜A i ij = K2 N ij − h (K∞ N )2 i ij = 2ˆσ2 ξ ˆσ2 f −k θ(xi,x j) + mX k=1 ˆσ4 f −k θ(xi,x k)kθ(xj,x k) = 2ˆσ2 ξ ˆσ2 f ϵij + ˆσ4 f mX k=1 (ϵik +ϵ jk −ϵ ikϵjk)

Let us set ˜ϵA = ˜rA =:ϵ E,2 and ˜ϵb = ˜rb. We have h ∆ ˜A i ij = K2 N ij − h (K∞ N )2 i ij = 2ˆσ2 ξ ˆσ2 f −k θ(xi,x j) + mX k=1 ˆσ4 f −k θ(xi,x k)kθ(xj,x k) = 2ˆσ2 ξ ˆσ2 f ϵij + ˆσ4 f mX k=1 (ϵik +ϵ jk −ϵ ikϵjk). Thus, h ∆ ˜A i ij <2ˆσ2 ξ ˆσ2 f +mˆσ4 f , which implies that ϵE,2 = 1 ˆσ2 ξ +mˆσ2 f 2 max 1≤i≤m mX j=1 h ∆ ˜A i ij < m 2ˆσ2 ξ ˆσ2 f +mˆσ4 f ˆσ2...

work page 2002
[2]

Letf(u) =u ρ 2(u) andg(u) =L 2 u2p+1, thusf(0) =g(0) = 0

What is more, lim u→0+ (u ρ2(u))′ = lim u→0+ u(ρ 2(u))′ + lim u→0+ ρ2(u) = lim u→0+ u(ρ 2(u))′, 78 Theory and Practice ofN N GPandGP nn sinceρ(0) = 0. Letf(u) =u ρ 2(u) andg(u) =L 2 u2p+1, thusf(0) =g(0) = 0. Because ρ(u) satisfies (G.15), we also have 0≤f(u)≤g(u). This implies the following bound for the right-sided derivative off(u) f ′(0+) = lim h→0+ f...

work page
[3]

hold, we have that ∂M SE(x ∗, X) ∂ˆℓ n→∞− − − →0 (G.45) uniformly onS. 86 Theory and Practice ofN N GPandGP nn Proof(ForGP nn– theN N GPcounterpart follows straightforwardly using the same techniques.) We follow the same strategy as in the proof of Theorem 15. Let us consider the derivative with respect to∂(ˆσ 2 ξ). The proofs for the remaining derivative...

work page 2018

[1] [1]

We have h ∆ ˜A i ij = K2 N ij − h (K∞ N )2 i ij = 2ˆσ2 ξ ˆσ2 f −k θ(xi,x j) + mX k=1 ˆσ4 f −k θ(xi,x k)kθ(xj,x k) = 2ˆσ2 ξ ˆσ2 f ϵij + ˆσ4 f mX k=1 (ϵik +ϵ jk −ϵ ikϵjk)

Let us set ˜ϵA = ˜rA =:ϵ E,2 and ˜ϵb = ˜rb. We have h ∆ ˜A i ij = K2 N ij − h (K∞ N )2 i ij = 2ˆσ2 ξ ˆσ2 f −k θ(xi,x j) + mX k=1 ˆσ4 f −k θ(xi,x k)kθ(xj,x k) = 2ˆσ2 ξ ˆσ2 f ϵij + ˆσ4 f mX k=1 (ϵik +ϵ jk −ϵ ikϵjk). Thus, h ∆ ˜A i ij <2ˆσ2 ξ ˆσ2 f +mˆσ4 f , which implies that ϵE,2 = 1 ˆσ2 ξ +mˆσ2 f 2 max 1≤i≤m mX j=1 h ∆ ˜A i ij < m 2ˆσ2 ξ ˆσ2 f +mˆσ4 f ˆσ2...

work page 2002

[2] [2]

Letf(u) =u ρ 2(u) andg(u) =L 2 u2p+1, thusf(0) =g(0) = 0

What is more, lim u→0+ (u ρ2(u))′ = lim u→0+ u(ρ 2(u))′ + lim u→0+ ρ2(u) = lim u→0+ u(ρ 2(u))′, 78 Theory and Practice ofN N GPandGP nn sinceρ(0) = 0. Letf(u) =u ρ 2(u) andg(u) =L 2 u2p+1, thusf(0) =g(0) = 0. Because ρ(u) satisfies (G.15), we also have 0≤f(u)≤g(u). This implies the following bound for the right-sided derivative off(u) f ′(0+) = lim h→0+ f...

work page

[3] [3]

hold, we have that ∂M SE(x ∗, X) ∂ˆℓ n→∞− − − →0 (G.45) uniformly onS. 86 Theory and Practice ofN N GPandGP nn Proof(ForGP nn– theN N GPcounterpart follows straightforwardly using the same techniques.) We follow the same strategy as in the proof of Theorem 15. Let us consider the derivative with respect to∂(ˆσ 2 ξ). The proofs for the remaining derivative...

work page 2018