arxiv: 2605.05523 · v1 · submitted 2026-05-06 · 📊 stat.ML · cs.LG· stat.CO

Recognition: unknown

Permutation-preserving Functions and Neural Vecchia Covariance Kernels

Jian Cao , Nian Liu , Ying Lin

Authors on Pith no claims yet

Pith reviewed 2026-05-08 15:32 UTC · model grok-4.3

classification 📊 stat.ML cs.LGstat.CO

keywords Gaussian processesVecchia approximationcovariance kernelspermutation symmetryneural networksnon-stationary kernelskriging coefficientsscalable GPs

0 comments

The pith

Neural networks can learn the kriging coefficients and conditional standard deviations of a Vecchia factorization to produce scalable, non-stationary Gaussian process kernels while respecting permutation symmetry.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that the deterministic quantities in a Vecchia approximation uniquely define a covariance matrix and can be learned directly by neural networks. It derives a universal representation for functions that stay unchanged under any reordering of the conditioning sets, then builds network architectures that enforce this invariance by construction. This symmetry-aware parameterization yields more stable training and better data efficiency than generic networks while still allowing fully non-stationary kernels. The resulting models remain computationally linear in the number of observations, preserving the scalability that originally made Vecchia approximations attractive for large spatial data.

Core claim

We introduce a novel framework for constructing scalable and flexible covariance kernels for Gaussian processes by directly learning the covariance structure under a regression-type parameterization induced by Vecchia approximations, using deep neural architectures that model kriging coefficients and conditional standard deviations while exploiting the permutation-equivariant structure of conditioning sets to enforce permutation-preserving representations.

What carries the argument

The universal representation for permutation-preserving functions, derived from the permutation-equivariant structure of the conditioning sets in the Vecchia factorization; this representation is used to construct neural network layers that respect the symmetry when predicting kriging coefficients and conditional standard deviations.

If this is right

The learned kernels remain computationally linear in the number of observations, allowing application to large datasets without losing Vecchia scalability.
Non-stationary covariance structure can be expressed directly through the neural parameterization without assuming stationarity or fixed kernel forms.
Training stability improves because the network targets are deterministic quantities rather than noisy likelihood evaluations.
Data efficiency increases because the permutation symmetry reduces the effective number of distinct training examples the network must see.
The same architecture can be reused across different conditioning set sizes by construction of the universal representation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same symmetry-derived representation could be applied to other sequential factorizations or to autoregressive models outside Gaussian processes.
Because the network predicts deterministic quantities, the approach may combine more cleanly with variational inference or sparse approximations than black-box kernel networks.
A direct test would compare predictive performance on spatial datasets when the same neural capacity is given either the symmetry-respecting architecture or a generic feed-forward network without explicit permutation handling.

Load-bearing premise

That neural outputs for kriging coefficients and conditional standard deviations will automatically produce a valid positive-definite covariance matrix for any input configuration, and that enforcing permutation preservation is enough to guarantee stable training without further regularization.

What would settle it

Finding input configurations where the learned kriging coefficients and conditional standard deviations produce a covariance matrix with negative eigenvalues, or where training loss fails to decrease monotonically when the permutation order of the conditioning sets is randomly shuffled.

Figures

Figures reproduced from arXiv: 2605.05523 by Jian Cao, Nian Liu, Ying Lin.

**Figure 1.** Figure 1: Illustration of the architecture of Wσ augmented input space and latent feature space, respectively, while P represents summation over the rows of the input matrix (i.e., aggregation over the conditioning set). Throughout this paper, we focus on the basic augmentation strategy described in (2), under which dA = 2d. The choice of latent dimension dL depends on the scale and complexity of the training data; … view at source ↗

**Figure 2.** Figure 2: illustrates the proposed universal architecture for permutation-preserving functions. Here, 𝜙(#) 𝜌!(#) [ ⋯ ] 𝑑"! 𝑚 𝑑"! ⋯ ∑(𝑗) 𝜌#(#) [ 𝑔!(𝑿)] 𝑚 .. 𝑑$ 𝜙(X{&}) 𝑑"! [ ⋯ ] 𝑑"" [ ⋯ ] view at source ↗

**Figure 3.** Figure 3: Predictive MSE of using the proposed NeuVec kernel and the benchmarks. Each panel view at source ↗

**Figure 4.** Figure 4: Predictive NLL of using the proposed NeuVec kernel and the benchmarks. Each panel view at source ↗

read the original abstract

We introduce a novel framework for constructing scalable and flexible covariance kernels for Gaussian processes (GPs) by directly learning the covariance structure under a regression-type parameterization induced by Vecchia approximations, using deep neural architectures. Specifically, we model kriging coefficients and conditional standard deviations, deterministic quantities that uniquely characterize the covariance, providing stable and informative learning targets. Exploiting the permutation-equivariant structure of conditioning sets in the Vecchia factorization, we derive a universal representation for permutation-preserving functions and design neural architectures that respect this symmetry, leading to improved training stability and data efficiency. The proposed approach enables expressive, non-stationary kernel learning while maintaining computational scalability, thereby bridging classical GP methodology with modern deep learning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes a framework for constructing scalable covariance kernels for Gaussian processes by learning kriging coefficients and conditional standard deviations in a Vecchia factorization using deep neural networks. It derives a universal representation for permutation-preserving functions based on the equivariant structure of conditioning sets and designs neural architectures that respect this symmetry to enable expressive non-stationary kernel learning with improved stability and data efficiency.

Significance. If the central derivation holds and produces ordering-independent kernels, the work would meaningfully bridge classical GP approximations with modern neural architectures, offering a principled way to learn flexible, scalable non-stationary covariances while preserving positive-definiteness and computational advantages of Vecchia methods.

major comments (2)

The manuscript's core claim that the permutation-preserving neural representation yields a well-defined covariance kernel (symmetric and independent of point ordering) is load-bearing but insufficiently demonstrated. While Vecchia factorization produces a PD matrix for any fixed ordering and positive conditional variances, the paper must explicitly show (via the universal representation) that different orderings of the same point set produce identical joint covariances; without this, the output is not a proper kernel but an ordering-dependent approximation.
The weakest assumption—that the learned kriging coefficients and conditional standard deviations automatically guarantee a consistent kernel for arbitrary configurations—requires additional verification. The architecture's equivariance on conditioning sets does not automatically imply invariance of the full covariance matrix under global permutations unless proven or empirically tested for non-stationary cases.

minor comments (2)

Clarify the precise definition of 'permutation-preserving' versus 'permutation-equivariant' in the theoretical section, as the distinction affects whether the kernel is ordering-independent.
Include a small-scale sanity check (e.g., 5-10 points) showing that likelihood and predictions remain unchanged under random reordering of inputs.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript. We address each major comment below and outline the revisions we will make to strengthen the presentation of our results.

read point-by-point responses

Referee: The manuscript's core claim that the permutation-preserving neural representation yields a well-defined covariance kernel (symmetric and independent of point ordering) is load-bearing but insufficiently demonstrated. While Vecchia factorization produces a PD matrix for any fixed ordering and positive conditional variances, the paper must explicitly show (via the universal representation) that different orderings of the same point set produce identical joint covariances; without this, the output is not a proper kernel but an ordering-dependent approximation.

Authors: We agree that an explicit demonstration is necessary to fully substantiate the claim. Our derivation of the universal representation for permutation-preserving functions in Section 3 is intended to ensure that the neural network outputs kriging coefficients and conditional standard deviations that are independent of the specific ordering, thereby yielding the same joint covariance for any permutation of the points. This follows because the Vecchia factorization reconstructs the joint density as a product of conditionals, and the symmetry-preserving architecture guarantees consistent parameter values across orderings. To address this concern directly, we will add a dedicated proposition in the revised manuscript that formally proves the invariance of the resulting covariance matrix under permutations, leveraging the universal representation theorem. revision: yes
Referee: The weakest assumption—that the learned kriging coefficients and conditional standard deviations automatically guarantee a consistent kernel for arbitrary configurations—requires additional verification. The architecture's equivariance on conditioning sets does not automatically imply invariance of the full covariance matrix under global permutations unless proven or empirically tested for non-stationary cases.

Authors: We acknowledge that while the equivariant architecture provides the necessary symmetry, an explicit verification for the full covariance matrix is valuable, particularly for non-stationary kernels. In addition to the theoretical proof mentioned above, we will include numerical experiments in the revised version that compute the covariance matrices for multiple random orderings of the same set of points and demonstrate that they are identical (up to numerical precision) even when the underlying process is non-stationary. This will provide empirical confirmation alongside the theoretical guarantee. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation of permutation-preserving representation is independent of fitted outputs

full rationale

The paper claims to derive a universal representation for permutation-preserving functions from the permutation-equivariant structure of Vecchia conditioning sets, then uses it to design equivariant neural architectures for modeling kriging coefficients and conditional standard deviations. These quantities are stated to uniquely characterize the covariance by construction of the Vecchia factorization itself. No equations or steps are shown that fit a parameter on data and then rename the fitted value as a 'prediction' of a related quantity. No self-citations are invoked as load-bearing uniqueness theorems. The central claim (that the resulting model yields a well-defined, ordering-independent kernel) rests on the mathematical properties of Vecchia approximations plus the equivariance of the NN, which are external to the fitting procedure and not reduced to the inputs by definition. The derivation chain is therefore self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Review performed on abstract only; no explicit free parameters, axioms, or invented entities are stated in the provided text.

pith-pipeline@v0.9.0 · 5411 in / 1058 out tokens · 51674 ms · 2026-05-08T15:32:51.851380+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

28 extracted references · 7 canonical work pages · 1 internal anchor

[1]

1988 , journal =

Vecchia, AV , number =. 1988 , journal =

1988
[2]

and Gelfand, Alan E

Datta, Abhirup and Banerjee, Sudipto and Finley, Andrew O. and Gelfand, Alan E. , number =. 2016 , journal =. doi:10.1080/01621459.2015.1044091 , issn =

work page doi:10.1080/01621459.2015.1044091 2016
[3]

2021 , journal =

Katzfuss, Matthias and Guinness, Joseph , number =. 2021 , journal =. doi:10.1214/19-STS755 , arxivId =

work page doi:10.1214/19-sts755 2021
[4]

Journal of the American Statistical Association , pages=

Linear-cost Vecchia approximation of multivariate normal probabilities , author=. Journal of the American Statistical Association , pages=. 2025 , publisher=

2025
[5]

Advances in neural information processing systems (NeurIPS) , volume=

Deep sets , author=. Advances in neural information processing systems (NeurIPS) , volume=
[6]

Carl Edward Rasmussen and Christopher K. I. Williams , title =
[7]

Proceedings of the 30th International Conference on Machine Learning (ICML) , year =

Andrew Gordon Wilson and Ryan Prescott Adams , title =. Proceedings of the 30th International Conference on Machine Learning (ICML) , year =
[8]

Sampson and Peter Guttorp , title =

Paul D. Sampson and Peter Guttorp , title =. Journal of the American Statistical Association , volume =
[9]

Paciorek and Mark J

Christopher J. Paciorek and Mark J. Schervish , title =. Environmetrics , volume =
[10]

Advances in Neural Information Processing Systems (NeurIPS) , volume=

Edward Snelson and Zoubin Ghahramani , title =. Advances in Neural Information Processing Systems (NeurIPS) , volume=
[11]

Titsias , title =

Michalis K. Titsias , title =. Proceedings of the 12th International Conference on Artificial Intelligence and Statistics (AISTATS) , year =
[12]

Proceedings of the 32nd International Conference on Machine Learning (ICML) , year =

Andrew Gordon Wilson and Hannes Nickisch , title =. Proceedings of the 32nd International Conference on Machine Learning (ICML) , year =
[13]

Sterratt and Iain Murray , title =

George Papamakarios and David C. Sterratt and Iain Murray , title =. Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics (AISTATS) , year =
[14]

Flexible Statistical Inference for Mechanistic Models of Neural Dynamics , journal =

Jan-Matthis L. Flexible Statistical Inference for Mechanistic Models of Neural Dynamics , journal =
[15]

Xing , title =

Andrew Gordon Wilson and Zhiting Hu and Ruslan Salakhutdinov and Eric P. Xing , title =. Proceedings of the 19th International Conference on Artificial Intelligence and Statistics (AISTATS) , year =
[16]

Advances in neural information processing systems (NeurIPS) , volume=

Stochastic variational deep kernel learning , author=. Advances in neural information processing systems (NeurIPS) , volume=
[17]

Statistica Sinica , volume=

DeepKriging: Spatially Dependent Deep Neural Networks for Spatial Prediction , author=. Statistica Sinica , volume=. 2024 , publisher=

2024
[18]

2011 , journal =

Lindgren, Finn and Rue, Håvard and Lindstr. 2011 , journal =

2011
[19]

Learning to Forecast: The Probabilistic Time Series Forecasting Challenge,

Likelihood-. The American Statistician , author =. 2024 , pages =. doi:10.1080/00031305.2023.2249522 , abstract =

work page internal anchor Pith review doi:10.1080/00031305.2023.2249522 2024
[20]

Computational Statistics & Data Analysis , author =

Neural networks for parameter estimation in intractable models , volume =. Computational Statistics & Data Analysis , author =. 2023 , keywords =. doi:10.1016/j.csda.2023.107762 , abstract =

work page doi:10.1016/j.csda.2023.107762 2023
[21]

Journal of Agricultural, Biological and Environmental Statistics , year=

Cao, Jian and Zhang, Jingjie and Sun, Zhuoer and Katzfuss, Matthias , title=. Journal of Agricultural, Biological and Environmental Statistics , year=. doi:10.1007/s13253-023-00573-y , url=

work page doi:10.1007/s13253-023-00573-y
[22]

Stein , title =

Mikael Kuusela and Michael L. Stein , title =. Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences , doi =
[23]

2000 , publisher =

Argo , title =. 2000 , publisher =. doi:10.17882/42182 , note =

work page doi:10.17882/42182 2000
[24]

The Annals of Applied Statistics , number =

Samuel Baugh and Karen McKinnon , title =. The Annals of Applied Statistics , number =. 2022 , doi =

2022
[25]

arXiv preprint arXiv:2505.18526 , year=

Scalable Gaussian Processes with Low-Rank Deep Kernel Decomposition , author=. arXiv preprint arXiv:2505.18526 , year=

work page arXiv
[26]

Advances in neural information processing systems (NeurIPS) , volume=

Warped gaussian processes , author=. Advances in neural information processing systems (NeurIPS) , volume=
[27]

Sparse Cholesky Factorization by

Schäfer, Florian and Katzfuss, Matthias and Owhadi, Houman , journal=. Sparse Cholesky Factorization by. 2021 , publisher=

2021
[28]

Advances in neural information processing systems , volume=

Pytorch: An imperative style, high-performance deep learning library , author=. Advances in neural information processing systems , volume=