Three Costs of Amortizing Gaussian Process Inference with Neural Processes

Robin Young

arxiv: 2605.21798 · v1 · pith:BHDQ7INSnew · submitted 2026-05-20 · 💻 cs.LG · stat.ML

Three Costs of Amortizing Gaussian Process Inference with Neural Processes

Robin Young This is my paper

Pith reviewed 2026-05-22 08:58 UTC · model grok-4.3

classification 💻 cs.LG stat.ML

keywords neural processesgaussian processesKL divergenceamortization errorinformation bottlenecklabel contaminationkernel smoothness

0 comments

The pith

The KL divergence between Gaussian process and latent neural process predictions decomposes into three interpretable costs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper bounds the Kullback-Leibler divergence between the predictive distributions of Gaussian processes and a class of latent neural processes that amortize inference. It breaks this divergence down into label contamination from using labels to predict label-independent quantities, an information bottleneck due to finite-dimensional representations, and amortization error from a shared encoder network. The bottleneck term is shown to decay exponentially in the representation dimension for squared-exponential kernels and polynomially for Matérn kernels, directly tying network architecture to kernel properties. This decomposition explains the approximation costs of using neural networks for faster inference in place of exact GP methods.

Core claim

For latent neural processes, the KL divergence to GP predictives decomposes into label contamination which remains O(1) generally, an information bottleneck that truncates with rates O(e^{-c d^{2/d_x}}) for squared-exponential kernels and O(d^{-2ν/d_x}) for Matérn-ν kernels, and amortization error. These results identify persistent costs and yield recommendations to predict variance from context locations alone and use second-order pooling.

What carries the argument

The three-term decomposition of the KL divergence between GP and LNP predictives, with explicit rates for the information bottleneck term.

If this is right

The information bottleneck decays exponentially with representation dimension for squared-exponential kernels on R^{d_x}.
Label contamination is O(1) overall, decaying only as O(1/n) for the noise component.
Predicting variance from context locations alone avoids label contamination.
Second-order pooling can reduce the amortization error compared to mean aggregation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

These bounds may inform how to scale representation dimensions for high input dimensions to achieve better approximations.
Similar decomposition approaches could apply to other amortized inference methods beyond neural processes.
Empirical tests could verify if the predicted decay rates match observed improvements in predictive accuracy.

Load-bearing premise

The bounds and decomposition hold for the specific class of latent neural processes that use a single finite-dimensional representation from the encoder for both mean and variance predictions.

What would settle it

Compute the KL divergence numerically for a fixed GP and varying representation dimensions d in a latent neural process, and check whether the observed decay matches O(e^{-c d^{2/d_x}}) for a squared-exponential kernel.

Figures

Figures reproduced from arXiv: 2605.21798 by Robin Young.

**Figure 1.** Figure 1: Forward-pass schematic of a latent neural process with the three error sources annotated. The dashed box encloses the [PITH_FULL_IMAGE:figures/full_fig_p009_1.png] view at source ↗

**Figure 2.** Figure 2: Empirical verification of Theorem 1. The Full protocol (red) plateaus at an O(1) floor across two orders of magnitude in n. The Noise-only protocol (blue) decays as 1/n on the early-n side, consistent with the σ 2 ϵ /n term, before saturating at the latent-z Monte Carlo floor of the variance estimator. Fitting power laws Var ∝ n α on the early-n range n ∈ {5, 10, 20, 50, 100} where the noise contribution d… view at source ↗

read the original abstract

Neural processes amortize Gaussian process inference, replacing the exact $O(n^3)$ posterior with a learned $O(n)$ map from context sets to predictive distributions. For a class of latent neural processes, we bound the Kullback--Leibler (KL) divergence between the GP and LNP predictives, decomposing it into three interpretable sources, namely label contamination as the neural process uses label values to estimate a quantity that is label-independent in the exact GP, an information bottleneck because the finite-dimensional representation cannot resolve the full context geometry, and amortization error from a single encoder network shared across all contexts. The bottleneck truncation term decays in the representation dimension $d$ as $O(e^{-cd^{2/d_x}})$ for squared-exponential kernels on $\mathbb{R}^{d_x}$ where $c > 0$ is a kernel-dependent constant and as $O(d^{-2\nu/d_x})$ for Mat\'ern-$\nu$ kernels, directly linking architecture sizing to kernel smoothness and input dimension. The label contamination term is $O(1)$ in general, with only the observation-noise component decaying as $O(1/n)$, identifying a persistent cost of routing uncertainty estimation through a label-dependent representation. These results characterize the costs of amortization within the analyzed class and yield architectural recommendations to predict variance from context locations alone in the GP-amortization regime, and replace mean aggregation with second-order pooling to close the dominant amortization gap.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper decomposes the KL between GP and LNP predictives into label contamination, information bottleneck, and amortization error, with explicit kernel-dependent decay rates for the bottleneck.

read the letter

The main thing here is a three-way split of the KL divergence between exact GP predictives and those from a latent neural process, plus concrete decay rates for the bottleneck term that depend on kernel smoothness and input dimension. The rates are O(e^{-c d^{2/d_x}}) for squared-exponential kernels and O(d^{-2ν/d_x}) for Matérn-ν, which follows from standard Mercer eigenvalue decay. This directly connects representation size to kernel properties in a way that prior neural process work did not spell out as cleanly. The paper also isolates label contamination as an O(1) term in general, with only the noise part shrinking as O(1/n), and uses that to recommend predicting variance from locations alone and switching to second-order pooling for the mean aggregation. Those suggestions are logical once the terms are separated. The derivations appear to stay within the stated class of LNPs that route a finite representation to both mean and variance heads, so the analysis holds inside that scope. The main soft spot is exactly that scope: if someone uses a label-independent variance head or a different encoder structure, the contamination term shifts and the bounds need re-derivation. The constants in the rates are not discussed in detail, so it is unclear how large the representation dimension needs to be in practice before the bottleneck becomes negligible. No circularity shows up because the rates come from the kernel spectrum rather than from the training data itself. This is aimed at people building amortized GP or neural process models who want quantitative rules for architecture choices rather than purely empirical tuning. A reader who cares about formal error analysis in probabilistic ML would get something usable from the decomposition. The formal content is sharp enough to deserve a serious referee even if the proofs need tightening in review.

Referee Report

0 major / 4 minor

Summary. The paper claims that for a specific class of latent neural processes (LNPs) whose encoder produces a finite-dimensional representation used for both mean and variance prediction, the KL divergence between GP and LNP predictives can be bounded and decomposed into three terms: label contamination (from routing label-dependent information through the encoder for a label-independent GP quantity), information bottleneck (finite representation cannot capture full context geometry), and amortization error (shared encoder across contexts). Explicit rates are given for the bottleneck truncation: O(e^{-c d^{2/d_x}}) for squared-exponential kernels and O(d^{-2ν/d_x}) for Matérn-ν kernels. Label contamination is O(1) in general (with observation-noise part O(1/n)), yielding architectural recommendations such as location-only variance heads and second-order pooling.

Significance. If the derivations hold, the work is significant for providing the first explicit, interpretable decomposition of amortization costs when replacing exact GP inference with neural processes. The rates tie representation dimension directly to kernel smoothness and input dimension via standard Mercer eigenvalue decay, offering concrete architecture-sizing guidance. Identification of a persistent O(1) label-contamination term explains a fundamental limitation and motivates the suggested fixes. This supplies theoretical grounding in an area dominated by empirical results and could influence design of future amortized probabilistic models.

minor comments (4)

[Introduction / §2] The precise definition of the analyzed LNP class (finite-dimensional representation for both mean and variance) should be stated with a diagram or pseudocode in the introduction or §2 to make the scope unambiguous for readers.
Add explicit citations to the original Neural Processes paper (Garnelo et al.) and to standard references on Mercer eigenvalue decay rates for squared-exponential and Matérn kernels.
[§2] Notation for context/target sets, encoder, and representation dimension d versus input dimension d_x should be introduced consistently and early; current usage risks confusion with GP literature conventions.
[Conclusion] The discussion of architectural recommendations (location-only variance head, second-order pooling) would benefit from a short table summarizing which term each change targets and the expected improvement.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive assessment of our work and the recommendation for minor revision. The referee summary accurately captures the paper's contributions, including the decomposition of the KL divergence into label contamination, information bottleneck, and amortization error, along with the explicit decay rates for different kernels. We appreciate the recognition of the architectural implications, such as location-only variance prediction and second-order pooling.

Circularity Check

0 steps flagged

No significant circularity in the derivation chain

full rationale

The paper derives an explicit KL divergence bound between GP and LNP predictives by decomposing it into label contamination, information bottleneck, and amortization error terms. The bottleneck decay rates O(e^{-c d^{2/d_x}}) for squared-exponential kernels and O(d^{-2ν/d_x}) for Matérn kernels follow from standard Mercer eigenvalue decay results that are independent of the present work. The O(1) label-contamination term is isolated by tracing the label-dependent encoder path, and the architectural recommendations follow directly as consequences of which term dominates. No fitted parameters are relabeled as predictions, no load-bearing step reduces to a self-citation chain, and the analysis is scoped to a precisely defined class of latent neural processes whose finite-dimensional representation is used for both mean and variance. The derivation is therefore self-contained against external mathematical benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Analysis rests on standard properties of Gaussian processes, KL divergence, and kernel smoothness; no new free parameters or invented entities are introduced in the abstract. The finite-dimensional representation and label-dependent encoder are domain assumptions for the LNP class studied.

axioms (2)

domain assumption The neural process belongs to the latent class whose encoder maps context sets to a finite-dimensional representation used for both mean and variance.
Explicitly stated as the setting for which the three-cost decomposition holds.
domain assumption Kernels are squared-exponential or Matérn-ν on R^{d_x}.
Required for the stated decay rates of the bottleneck term.

pith-pipeline@v0.9.0 · 5783 in / 1422 out tokens · 33132 ms · 2026-05-22T08:58:14.055634+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

18 extracted references · 18 canonical work pages

[1]

Proceedings of the 35th International Conference on Machine Learning , pages =

Conditional Neural Processes , author =. Proceedings of the 35th International Conference on Machine Learning , pages =. 2018 , editor =

work page 2018
[2]

2018 , eprint=

Neural Processes , author=. 2018 , eprint=

work page 2018
[3]

International Conference on Learning Representations , year=

Attentive Neural Processes , author=. International Conference on Learning Representations , year=

work page
[4]

Transactions on Machine Learning Research , issn=

On the Conditioning Consistency Gap in Conditional Neural Processes , author=. Transactions on Machine Learning Research , issn=. 2026 , url=

work page 2026
[5]

2005 , publisher =

Gaussian Processes for Machine Learning , author =. 2005 , publisher =. doi:10.7551/mitpress/3206.001.0001 , url =

work page doi:10.7551/mitpress/3206.001.0001 2005
[6]

International Conference on Learning Representations , year=

Convolutional Conditional Neural Processes , author=. International Conference on Learning Representations , year=

work page
[7]

Deep Sets , url =

Zaheer, Manzil and Kottur, Satwik and Ravanbakhsh, Siamak and Poczos, Barnabas and Salakhutdinov, Russ R and Smola, Alexander J , booktitle =. Deep Sets , url =

work page
[8]

Meta-Learning Stationary Stochastic Process Prediction with Convolutional Neural Processes , url =

Foong, Andrew and Bruinsma, Wessel and Gordon, Jonathan and Dubois, Yann and Requeima, James and Turner, Richard , booktitle =. Meta-Learning Stationary Stochastic Process Prediction with Convolutional Neural Processes , url =

work page
[9]

The Eleventh International Conference on Learning Representations , year=

Autoregressive Conditional Neural Processes , author=. The Eleventh International Conference on Learning Representations , year=

work page
[10]

Proceedings of the 35th International Conference on Machine Learning , pages =

Inference Suboptimality in Variational Autoencoders , author =. Proceedings of the 35th International Conference on Machine Learning , pages =. 2018 , editor =

work page 2018
[11]

2004 , isbn =

Scattered Data Approximation , author =. 2004 , isbn =

work page 2004
[12]

1985 , doi =

n-Widths in Approximation Theory , author =. 1985 , doi =

work page 1985
[13]

Constructive Approximation , volume =

Ingo Steinwart and Clint Scovel , title =. Constructive Approximation , volume =. 2012 , doi =

work page 2012
[14]

Burt and Carl Edward Rasmussen and Mark van der Wilk , title =

David R. Burt and Carl Edward Rasmussen and Mark van der Wilk , title =. Journal of Machine Learning Research , year =

work page
[15]

2009 , editor =

Titsias, Michalis , booktitle =. 2009 , editor =

work page 2009
[16]

The frontier of simulation-based inference

The frontier of simulation-based inference , author =. Proceedings of the National Academy of Sciences , volume =. 2020 , month =. doi:10.1073/pnas.1912789117 , url =

work page doi:10.1073/pnas.1912789117 2020
[17]

2015 , volume=

Deisenroth, Marc Peter and Fox, Dieter and Rasmussen, Carl Edward , journal=. 2015 , volume=. doi:10.1109/TPAMI.2013.218 , url =

work page doi:10.1109/tpami.2013.218 2015
[18]

Taking the human out of the loop: A review of Bayesian optimization,

Taking the Human Out of the Loop: A Review of Bayesian Optimization , author =. Proceedings of the IEEE , volume =. 2016 , month =. doi:10.1109/JPROC.2015.2494218 , url =

work page doi:10.1109/jproc.2015.2494218 2016

[1] [1]

Proceedings of the 35th International Conference on Machine Learning , pages =

Conditional Neural Processes , author =. Proceedings of the 35th International Conference on Machine Learning , pages =. 2018 , editor =

work page 2018

[2] [2]

2018 , eprint=

Neural Processes , author=. 2018 , eprint=

work page 2018

[3] [3]

International Conference on Learning Representations , year=

Attentive Neural Processes , author=. International Conference on Learning Representations , year=

work page

[4] [4]

Transactions on Machine Learning Research , issn=

On the Conditioning Consistency Gap in Conditional Neural Processes , author=. Transactions on Machine Learning Research , issn=. 2026 , url=

work page 2026

[5] [5]

2005 , publisher =

Gaussian Processes for Machine Learning , author =. 2005 , publisher =. doi:10.7551/mitpress/3206.001.0001 , url =

work page doi:10.7551/mitpress/3206.001.0001 2005

[6] [6]

International Conference on Learning Representations , year=

Convolutional Conditional Neural Processes , author=. International Conference on Learning Representations , year=

work page

[7] [7]

Deep Sets , url =

Zaheer, Manzil and Kottur, Satwik and Ravanbakhsh, Siamak and Poczos, Barnabas and Salakhutdinov, Russ R and Smola, Alexander J , booktitle =. Deep Sets , url =

work page

[8] [8]

Meta-Learning Stationary Stochastic Process Prediction with Convolutional Neural Processes , url =

Foong, Andrew and Bruinsma, Wessel and Gordon, Jonathan and Dubois, Yann and Requeima, James and Turner, Richard , booktitle =. Meta-Learning Stationary Stochastic Process Prediction with Convolutional Neural Processes , url =

work page

[9] [9]

The Eleventh International Conference on Learning Representations , year=

Autoregressive Conditional Neural Processes , author=. The Eleventh International Conference on Learning Representations , year=

work page

[10] [10]

Proceedings of the 35th International Conference on Machine Learning , pages =

Inference Suboptimality in Variational Autoencoders , author =. Proceedings of the 35th International Conference on Machine Learning , pages =. 2018 , editor =

work page 2018

[11] [11]

2004 , isbn =

Scattered Data Approximation , author =. 2004 , isbn =

work page 2004

[12] [12]

1985 , doi =

n-Widths in Approximation Theory , author =. 1985 , doi =

work page 1985

[13] [13]

Constructive Approximation , volume =

Ingo Steinwart and Clint Scovel , title =. Constructive Approximation , volume =. 2012 , doi =

work page 2012

[14] [14]

Burt and Carl Edward Rasmussen and Mark van der Wilk , title =

David R. Burt and Carl Edward Rasmussen and Mark van der Wilk , title =. Journal of Machine Learning Research , year =

work page

[15] [15]

2009 , editor =

Titsias, Michalis , booktitle =. 2009 , editor =

work page 2009

[16] [16]

The frontier of simulation-based inference

The frontier of simulation-based inference , author =. Proceedings of the National Academy of Sciences , volume =. 2020 , month =. doi:10.1073/pnas.1912789117 , url =

work page doi:10.1073/pnas.1912789117 2020

[17] [17]

2015 , volume=

Deisenroth, Marc Peter and Fox, Dieter and Rasmussen, Carl Edward , journal=. 2015 , volume=. doi:10.1109/TPAMI.2013.218 , url =

work page doi:10.1109/tpami.2013.218 2015

[18] [18]

Taking the human out of the loop: A review of Bayesian optimization,

Taking the Human Out of the Loop: A Review of Bayesian Optimization , author =. Proceedings of the IEEE , volume =. 2016 , month =. doi:10.1109/JPROC.2015.2494218 , url =

work page doi:10.1109/jproc.2015.2494218 2016