An Analysis of Posterior Collapse, Parameterization and Initialization in Variational Deep Gaussian Processes

Daniel Hern\'andez-Lobato; Francisco Javier S\'aez-Maldonado; Juan Maro\~nas

arxiv: 2606.25882 · v1 · pith:VWVIV6RLnew · submitted 2026-06-24 · 💻 cs.LG

An Analysis of Posterior Collapse, Parameterization and Initialization in Variational Deep Gaussian Processes

Francisco Javier S\'aez-Maldonado , Juan Maro\~nas , Daniel Hern\'andez-Lobato This is my paper

Pith reviewed 2026-06-25 20:11 UTC · model grok-4.3

classification 💻 cs.LG

keywords variational inferencedeep Gaussian processesposterior collapseinitializationparameterizationlinear prior meanDSVI

0 comments

The pith

The linear prior mean in variational DGPs improves optimization conditioning at initialization rather than avoiding non-injective pathology.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper studies posterior collapse in variational deep Gaussian processes and traces it to the interaction between the DSVI algorithm and the linear prior mean function used in all but the final layer. It shows that this linear mean helps by making the optimization landscape better conditioned right at the start of training, not by preventing some deep-model pathology as had been thought. The authors introduce a zero prior mean initialization that matches the linear-mean case at the first step, allowing the prior to be chosen for modeling reasons instead of optimization constraints. The analysis covers three common DGP parameterizations and explains the stability advantage of the whitened one.

Core claim

The benefit of the linear prior mean function does not arise from avoiding the non-injective pathology in very deep DGPs, as previously believed, but from improving the conditioning of the optimization problem at initialization. An alternative zero prior mean initialization that mimics a linear prior mean DGP at initialization enables successful training of DGPs without imposing optimization-driven constraints on the prior, and this initialization prevents posterior collapse while achieving performance comparable to or better than the linear-mean version across the studied parameterizations.

What carries the argument

The zero prior mean initialization strategy that matches linear-mean conditioning at the first training step, together with the analysis linking DSVI, linear prior means, and posterior collapse across three parameterizations.

If this is right

DGPs can be trained without forcing the prior to satisfy optimization convenience.
Whitened parameterizations yield more stable convergence and reduce posterior collapse risk.
Not every DGP parameterization benefits equally from a linear prior mean.
The proposed initialization yields performance comparable to or better than the linear-mean baseline.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar initialization-focused fixes may apply to posterior collapse in other variational deep models.
The relative importance of the first training step could be tested by varying step size or optimizer in controlled ablations.
Allowing priors to be set purely by modeling assumptions may change how practitioners choose mean functions in other Gaussian process models.

Load-bearing premise

The optimization dynamics at the first training step dominate the entire training trajectory for the three parameterizations.

What would settle it

A controlled experiment showing that zero-mean initialized DGPs still exhibit posterior collapse even when their initial conditioning matches that of linear-mean DGPs.

Figures

Figures reproduced from arXiv: 2606.25882 by Daniel Hern\'andez-Lobato, Francisco Javier S\'aez-Maldonado, Juan Maro\~nas.

**Figure 2.** Figure 2: From top to bottom, we display the predictive distribution of two layers [PITH_FULL_IMAGE:figures/full_fig_p025_2.png] view at source ↗

**Figure 3.** Figure 3: Same predictive distributions as the ones displayed in Fig. 2, but when the the [PITH_FULL_IMAGE:figures/full_fig_p031_3.png] view at source ↗

**Figure 4.** Figure 4: ZERO (top) and PCA (bottom) DGP models with the output layer variational covariance initialized to S = I, in the GPFLOW’s unwhitened parameterization. from our actual lack of knowledge about the inducing point function evaluations. Thus, despite uncertainty arising from the inducing points (through S) and from the prior (through KXX), the information provided by the inducing point locations together with t… view at source ↗

**Figure 5.** Figure 5: ZERO (top) and PCA (bottom) DGP models with the output layer variational covariance initialized to Sv = I, in the whitened parameterization. as illustrated by [PITH_FULL_IMAGE:figures/full_fig_p033_5.png] view at source ↗

**Figure 6.** Figure 6: Initial predictive distributions of a ZERO prior mean and a PCA prior mean DGP with 7 and 20 inducing points using the GPFLOW non-whitened parameterization. the predictive mean will be µqf = 0, as indicated by Eq. (37) [PITH_FULL_IMAGE:figures/full_fig_p036_6.png] view at source ↗

**Figure 7.** Figure 7: Predictive mean of the inner layer GP of a DGP with the standard non-whitened parameterization for different numbers of inducing points and the PCA prior mean function. the number of inducing points increases, the predictive mean becomes nearly constant and equal to zero. It is only equal to X when we are far away from the inducing points. This indicates that under this parameterization, a PCA prior mean D… view at source ↗

**Figure 8.** Figure 8: Each row shows coordinate updates, different initializations, and update order [PITH_FULL_IMAGE:figures/full_fig_p044_8.png] view at source ↗

**Figure 9.** Figure 9: Training curves reporting the ELBO, the ELL term, and the KLD term of a PCA prior mean DGP on the toy problem, for both the whitened and the non-whitened parameterization of GPFLOW. Figures show that the non-whitened parameterization is more unstable. the KLD at initialization reveals opposite observations for other values of α. In particular, KLDnon-whitened = 1 2 [PITH_FULL_IMAGE:figures/full_fig_p046_9.png] view at source ↗

**Figure 10.** Figure 10: shows the predictive distribution of a DGP with 2-layers with the ZERO prior mean. The whitened parameterization is used. In all cases, the variational covariances are initialized at 10−5 I. We consider three scenarios: (a) when the variational mean is initialized to zero; (b) when the proposed initialization is used for the variational mean at each layer, but the inducing points are randomly chosen from … view at source ↗

**Figure 11.** Figure 11: Initial predictive distributions of the ZERO-Points-MY-W model using different initial length-scale ℓ values. 55 [PITH_FULL_IMAGE:figures/full_fig_p055_11.png] view at source ↗

**Figure 12.** Figure 12: Predictive distribution (mean and standard deviation) of [PITH_FULL_IMAGE:figures/full_fig_p059_12.png] view at source ↗

**Figure 13.** Figure 13: Predictive distribution (mean and standard deviation) of [PITH_FULL_IMAGE:figures/full_fig_p060_13.png] view at source ↗

**Figure 14.** Figure 14: Predictive distribution of both ZERO-W prior mean DGP (top 3 rows) and the PCA-W prior mean DGP (bottom 3 rows) initialized with S l v = I in the inner layers and S L v = 10−5 I in the output layer. Blue points show training data. Using a high variance on the inner layers leads to a noisy optimization which yields poor solutions. 61 [PITH_FULL_IMAGE:figures/full_fig_p061_14.png] view at source ↗

**Figure 15.** Figure 15: (top-row) Predictive distribution, layer-wise [PITH_FULL_IMAGE:figures/full_fig_p063_15.png] view at source ↗

**Figure 16.** Figure 16: (left) Initial predictive distributions of all the models that have [PITH_FULL_IMAGE:figures/full_fig_p064_16.png] view at source ↗

**Figure 17.** Figure 17: Predictive distributions of all trained models with [PITH_FULL_IMAGE:figures/full_fig_p065_17.png] view at source ↗

**Figure 18.** Figure 18: Test log-likelihood (right is better) in all [PITH_FULL_IMAGE:figures/full_fig_p067_18.png] view at source ↗

**Figure 19.** Figure 19: KLD (left) and likelihood variance (right) obtained by the ZERO-W and ZEROPoints-M0-W models in the Kin8nm and Yacht datasets, for each train/test split. No mode collapse is shown in Kin8nm. Seven out of 20 splits suffer from posterior collapse in Yacht. 0 2500 5000 7500 10000 12500 15000 17500 20000 Epoch 0 2500 5000 7500 10000 12500 15000 17500 KL Divergence ZERO DatapointsM0 W ZERO DatapointsMY W PCA … view at source ↗

**Figure 20.** Figure 20: KLD and ELL obtained by the four models during training on the Yacht dataset, when using 5 layers, for a representative train/test split. The posterior collapse of the ZERO-W model is shown by the KLD, which becomes zero, and induces a poor ELL term in the ELBO. 71 [PITH_FULL_IMAGE:figures/full_fig_p071_20.png] view at source ↗

**Figure 21.** Figure 21: KLD (left) and likelihood variance (right) obtained by the ZERO-NWR and ZEROPoints-M0-NWR models in the Boston, Power, and Yacht (with 4 layer model) datasets, for each train/test split. 72 [PITH_FULL_IMAGE:figures/full_fig_p072_21.png] view at source ↗

**Figure 22.** Figure 22: KLD and RMSE obtained by the two layer ZERO-Points-MY-W model when varying the length-scale ℓ at initialization in the steps dataset. 86 [PITH_FULL_IMAGE:figures/full_fig_p086_22.png] view at source ↗

**Figure 23.** Figure 23: ZERO (top) and PCA (bottom) DGP models with the output layer variational covariance initialized to Sv = 1.3I and Sv = 0.8I , in the whitened parameterization. 88 [PITH_FULL_IMAGE:figures/full_fig_p088_23.png] view at source ↗

**Figure 24.** Figure 24: ZERO and PCA mean DGPs with the variational covariance S = I for the inner layer and S = 10−5 I in the output layer, varying the number of inducing points in the non-whitened parameterization. 90 [PITH_FULL_IMAGE:figures/full_fig_p090_24.png] view at source ↗

**Figure 25.** Figure 25: ZERO and PCA mean DGPs with 10 and 100 inducing points using the GPFLOW non-whitened parameterization, with S = I in the output layer and S = 10−5 I in the inner layers. 91 [PITH_FULL_IMAGE:figures/full_fig_p091_25.png] view at source ↗

**Figure 26.** Figure 26: ZERO (top) and PCA (bottom) DGP models with the output layer variational covariance initialized to break prior information in the GPFLOW’s unwhitened parameterization. A horizontal line at 1 shows how variance far from the inducing point goes beyond the prior. 92 [PITH_FULL_IMAGE:figures/full_fig_p092_26.png] view at source ↗

**Figure 27.** Figure 27: Coordinate updates experiment with 64 inducing points. 93 [PITH_FULL_IMAGE:figures/full_fig_p093_27.png] view at source ↗

**Figure 28.** Figure 28: Predictive distributions of the ZERO-W and PCA-W DGP models when initialized with Sv = I in all layers. 95 [PITH_FULL_IMAGE:figures/full_fig_p095_28.png] view at source ↗

**Figure 29.** Figure 29: Layer-wise KLD during training of both ZERO-W and PCA-W models initialized with S l v = I in the inner layers and S L v = 10−5 I in the output layer. In some cases, the ZERO-W model can not escape from the local optimum of zero KLD in some layers, especially with a big number of inducing points. With a lower number of inducing points, it is more likely that the ZERO-W escapes from the posterior collapse, … view at source ↗

**Figure 30.** Figure 30: KLD and likelihood variance during training of the 5-layer whitened models in the toy dataset using λ = 10−2 . 9991 9992 9993 9994 9995 9996 9997 9998 9999 10000 Epoch 0 200 400 600 800 KL Divergence Layer 1 Layer 2 Layer 3 Layer 4 Layer 5 (a) Layer-wise KLD of the PCA-NWR in the last 10 epochs of training. 0 2000 4000 6000 8000 10000 Epoch 0.0 0.5 1.0 1.5 2.0 2.5 Likelihood Variance (b) Likelihood varian… view at source ↗

**Figure 31.** Figure 31: Layer-wise KLD in the last 10 epochs of training and likelihood variance value during training of the 5-layer PCA-NWR trained using λ = 10−2 . 0 2 4 6 8 Epoch 7 8 9 10 11 12 13 14 Log KL Divergence ZERO W ZERO NWR PCA W PCA NWR ZERO DatapointsM0 W ZERO DatapointsM0 NWR ZERO DatapointsMY W ZERO DatapointsMY NWR (a) Learning rate: 10−2 0 2 4 6 8 Epoch 7 8 9 10 11 12 Log KL Divergence ZERO W ZERO NWR PCA W P… view at source ↗

**Figure 32.** Figure 32: Log KLD during the first 10 epochs in the toy dataset. Non-visible curves overlap with visible ones. This figure shows how the NWR model with λ = 10−2 results in a very unstable optimization, manifested through the peaks in the learning curve. 98 [PITH_FULL_IMAGE:figures/full_fig_p098_32.png] view at source ↗

**Figure 33.** Figure 33: Test RMSE (left is better) in all UCI datasets. 100 [PITH_FULL_IMAGE:figures/full_fig_p100_33.png] view at source ↗

**Figure 34.** Figure 34: Comparison of parameterization across different models and parameter initializa [PITH_FULL_IMAGE:figures/full_fig_p101_34.png] view at source ↗

**Figure 35.** Figure 35: Learning curves of both parameterizations, for each split, for the [PITH_FULL_IMAGE:figures/full_fig_p102_35.png] view at source ↗

**Figure 36.** Figure 36: Final KLD and likelihood variance parameter across several datasets, model deepness, and parameterization. The left column represents the curves for models in which the difference between the PCA and ZERO model performance is the highest across all the splits. The right column represents the same, but when the model performance is the lowest. 104 [PITH_FULL_IMAGE:figures/full_fig_p104_36.png] view at source ↗

read the original abstract

DGPs are probabilistic models with remarkable prediction performance that concatenate GPs across several layers. Exact inference in DGPs is intractable, and variational inference is often used to approximate the posterior with a parametric distribution tuned by minimizing the Kullback-Leibler divergence. Moreover, finding a good VI approximation is challenging. In particular, a problem of VI is posterior collapse, where VI converges to a variational posterior that matches the prior. In variational DGPs, this implies explaining the data as noise. This work studies posterior collapse in DGPs and identifies its connection to the DSVI algorithm and the widely used linear prior mean function employed in all but the last layer. We show that the benefit of the linear prior mean does not arise from avoiding the non-injective pathology in very deep DGPs, as previously believed, but from improving the conditioning of the optimization problem at initialization. Thus, we propose an alternative initialization of a zero prior mean DGP that mimics a DGP with a linear prior mean at initialization. This enables successful training of DGPs without imposing optimization-driven constraints on the prior, allowing to choose the prior based on modeling assumptions rather than optimization convenience. Our analysis considers three common parameterizations of DGPs and shows that not all of them benefit from a linear prior mean. We also explain why a whitened parameterization of the \DGP provides more stable convergence, something often assumed from experience, but lacking a rigorous analysis. Furthermore, we show that this stability is also beneficial to avoid the posterior collapse problem. Extensive experiments validate our findings: the proposed initialization prevents posterior collapse, improves stability, and achieves performance comparable to (and sometimes better than) DGPs with a linear prior mean.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's new zero-mean initialization for DGPs matches linear-mean results by fixing initial conditioning, but the claim that this controls the full training path rests on an unverified assumption.

read the letter

The core contribution is a zero-mean initialization that copies linear-mean behavior at step zero, letting people drop the linear prior mean without triggering posterior collapse in variational DGPs. The paper also supplies a clearer account of why that linear mean helped in the first place—conditioning of the DSVI objective rather than any non-injective pathology—and shows the effect differs across the three standard parameterizations. The stability argument for the whitened version is spelled out more explicitly than in prior work.

The analysis stays inside existing DSVI and DGP formulations, which keeps the claims grounded. Experiments are reported to confirm that the new init prevents collapse and reaches comparable accuracy, sometimes better.

The weakest part is the leap from “matches at initialization” to “controls the entire trajectory.” Nothing in the abstract or stress-test note shows that variational parameters or inducing posteriors stay aligned after the first gradient step; if the zero-mean model leaves the well-conditioned region later, the substitution does not hold. That assumption is load-bearing and not directly tested. The soundness score in the reader’s note already flags the lack of methods and baseline detail, which makes it hard to judge how robust the performance numbers are.

This is useful reading for people who actually train deep GPs or similar variational models and want a concrete initialization fix plus parameterization guidance. It is not a broad theoretical advance, but the practical angle is clear enough that a serious editor should send it to referees rather than desk-reject.

Referee Report

2 major / 1 minor

Summary. The paper analyzes posterior collapse in variational deep Gaussian processes (DGPs) under three common parameterizations. It argues that the benefit of the linear prior mean function (used in all but the final layer) arises from improving the conditioning of the optimization problem at initialization rather than from avoiding non-injective pathologies in deep models. The authors propose an alternative zero-mean initialization that mimics the linear-mean DGP at step zero, claim this prevents collapse while allowing modeling-driven prior choice, provide an analysis of why the whitened parameterization yields more stable convergence, and report that experiments confirm comparable or better performance without the linear-mean constraint.

Significance. If the initialization equivalence holds across the full optimization trajectory, the work would allow DGPs to be trained with priors chosen for modeling reasons rather than optimization convenience, while supplying a rigorous account of whitened-DGP stability that is currently assumed from experience. The explicit comparison across standard, whitened, and other parameterizations is a clear strength; the derivation linking DSVI, linear means, and conditioning at initialization is also potentially useful if the trajectory-level claim is substantiated.

major comments (2)

[Abstract and initialization analysis sections] The central substitution of a zero-mean initialization for the linear prior mean rests on the unverified assumption that first-step conditioning benefits dominate the entire training trajectory for all three parameterizations. No derivation or experiment is shown establishing that the variational parameters or inducing-point posteriors remain aligned after the first gradient update; if later steps allow the zero-mean model to escape the well-conditioned basin, the performance equivalence fails.
[Experiments section] The claim that the proposed initialization prevents posterior collapse and achieves comparable performance is supported only by the statement that 'extensive experiments validate our findings.' Without reported details on data splits, quantitative metrics, baselines, or controls for post-hoc choices, it is impossible to assess whether the results actually establish trajectory-level equivalence rather than initial-value equivalence.

minor comments (1)

[Abstract] The abstract contains the notation '\DGP'; this should be rendered consistently as 'DGP' throughout.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed and constructive comments on our manuscript. We address each of the major comments below and have made revisions to strengthen the paper accordingly.

read point-by-point responses

Referee: [Abstract and initialization analysis sections] The central substitution of a zero-mean initialization for the linear prior mean rests on the unverified assumption that first-step conditioning benefits dominate the entire training trajectory for all three parameterizations. No derivation or experiment is shown establishing that the variational parameters or inducing-point posteriors remain aligned after the first gradient update; if later steps allow the zero-mean model to escape the well-conditioned basin, the performance equivalence fails.

Authors: We acknowledge that our primary analysis focuses on the benefits at initialization. While the manuscript presents empirical evidence from extensive experiments showing that the proposed initialization prevents posterior collapse and achieves comparable performance, we agree that a more rigorous examination of the parameter trajectories would provide stronger support for the claim that the benefits persist throughout training. In the revised manuscript, we will include additional experiments that track the evolution of variational parameters and inducing point posteriors over the course of optimization for both initializations. revision: yes
Referee: [Experiments section] The claim that the proposed initialization prevents posterior collapse and achieves comparable performance is supported only by the statement that 'extensive experiments validate our findings.' Without reported details on data splits, quantitative metrics, baselines, or controls for post-hoc choices, it is impossible to assess whether the results actually establish trajectory-level equivalence rather than initial-value equivalence.

Authors: The experiments section of the manuscript does provide details on the datasets, metrics, and comparisons, but we recognize that the presentation may not have been sufficiently explicit or comprehensive. To address this, we will revise the experiments section to include more detailed descriptions of the experimental setup, including data splits, quantitative results with standard deviations, baseline comparisons, and controls to demonstrate that the performance equivalence holds beyond the initial step. revision: yes

Circularity Check

0 steps flagged

No circularity; analysis grounded in standard DSVI/DGP formulations and external experiments

full rationale

The paper examines posterior collapse via connections to the existing DSVI algorithm and linear prior mean functions used in prior DGP literature. The proposed zero-mean initialization is motivated by mimicking behavior at step 0 and is validated empirically across three parameterizations, without defining any quantities in terms of fitted outputs or renaming predictions. No load-bearing self-citations, uniqueness theorems imported from the authors' prior work, or ansatzes smuggled via citation appear in the derivation. The central claims rest on analysis of standard formulations and experimental benchmarks rather than reducing to self-defined inputs by construction, making the work self-contained against external validation.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The paper relies on standard assumptions of variational inference for Gaussian processes (existence of a well-defined ELBO, properties of the KL divergence, and the validity of the DSVI algorithm) and on the modeling assumptions that DGPs are composed of independent GP layers with chosen mean and kernel functions. No free parameters or invented entities are introduced in the abstract.

axioms (2)

domain assumption Variational inference via minimization of KL divergence between variational posterior and true posterior yields a valid approximation for DGPs.
Invoked throughout the abstract when discussing posterior collapse and the ELBO optimization.
domain assumption The DSVI algorithm is a correct instantiation of variational inference for the DGP model class.
The abstract explicitly connects posterior collapse to the DSVI algorithm.

pith-pipeline@v0.9.1-grok · 5853 in / 1529 out tokens · 24585 ms · 2026-06-25T20:11:23.855265+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

14 extracted references · 2 linked inside Pith

[1]

URLhttps://arxiv.org/abs/2410.08315. M. Bauer, M. van der Wilk, and C. E. Rasmussen. Understanding Probabilistic Sparse Gaussian Process Approximations. InAdvances in Neural Information Processing Systems, pages 1533 – 1541,

arXiv
[2]

URLhttps://arxiv.org/abs/2104.05674. D. Duvenaud, O. Rippel, R. Adams, and Z. Ghahramani. Avoiding pathologies in very deep networks. InInternational Conference on Artificial Intelligence and Statistics, pages 202–210,

arXiv
[3]

Havasi, J

M. Havasi, J. M. Hernández-Lobato, and J. J. Murillo-Fuentes. Inference in Deep Gaussian Processes using Stochastic Gradient Hamiltonian Monte Carlo. InAdvances in Neural Information Processing Systems, pages 7517 – 7527, 2018a. M. Havasi, J. M. Hernández-Lobato, and J. J. Murillo-Fuentes. Deep Gaussian Processes with Decoupled Inducing Inputs, 2018b. URL...

Pith/arXiv arXiv
[4]

URLhttps://arxiv.org/abs/1905.13697. A. Javaloy, M. Meghdadi, and I. Valera. Mitigating Modality Collapse in Multimodal VAEs via Impartial Optimization. InInternational Conference on Machine Learning, pages 9938–9964,

Pith/arXiv arXiv 1905
[5]

URLhttps://arxiv.org/abs/2012.13962. Z. Lin, F. Yin, and J. Maroñas. Towards Flexibility and Interpretability of Gaussian Process State-Space Model,

arXiv 2012
[6]

URLhttps://arxiv.org/abs/2301.08843. J. Z. Liu, S. Padhy, J. Ren, Z. Lin, Y. Wen, G. Jerfel, Z. Nado, J. Snoek, D. Tran, and B. Lakshminarayanan. A Simple Approach to Improve Single-Model Deep Uncertainty via Distance-Awareness.Journal of Machine Learning Research, 24:1–63,

arXiv
[7]

URLhttps://arxiv.org/abs/ 2506.23996. A. G. d. G. Matthews, M. van der Wilk, T. Nickson, K. Fujii, A. Boukouvalas, P. León- Villagrá, Z. Ghahramani, and J. Hensman. GPflow: A Gaussian Process library using TensorFlow.Journal of Machine Learning Research, 18:1–6, apr

arXiv
[8]

URLhttps://arxiv.org/abs/2010.14877. C. E. Rasmussen and C. K. I. Williams.Gaussian Processes for Machine Learning. The MIT Press,

arXiv 2010
[9]

F. J. Sáez-Maldonado, J. Maroñas, and D. Hernández-Lobato. Mode Collapse in Variational Deep Gaussian Processes. InNeurIPS 2024 Workshop on Bayesian Decision-making and Uncertainty,

2024
[10]

J. Shi, M. Titsias, and A. Mnih. Sparse Orthogonal Variational Inference for Gaussian Processes. InInternational Conference on Artificial Intelligence and Statistics, pages 1932–1942,

1932
[11]

URLhttps://arxiv.org/abs/2310.18230. M. Titsias. Variational Learning of Inducing Variables in Sparse Gaussian Processes. In International Conference on Artificial Intelligence and Statistics, pages 567–574,

arXiv
[12]

URLhttps: //arxiv.org/abs/2003.01115. Y. Wang, D. Blei, and J. P. Cunningham. Posterior Collapse and Latent Variable Non- identifiability. InAdvances in Neural Information Processing Systems, pages 5443–5455,

arXiv 2003
[13]

80 An Analysis of Posterior Collapse, Parameterization and Initialization in V ariationalDGPs Appendix A

ISSN 1935-8237. 80 An Analysis of Posterior Collapse, Parameterization and Initialization in V ariationalDGPs Appendix A. Coordinate Updates for the Noise Parameter By noting that a one-dimensional Gaussian likelihood function can be compactly expressed through: NY n=1 N yn |f n, σ2 =N Y|f, σ 2I (75) The objective function of a one-dimensionalSVGPcan be w...

1935
[14]

Model KLDLik. Var.RMSE ZERO-W 0.0010 0.9967 0.9896 ZERO-NWR 869.0811 1.1559 0.9895 PCA-W 144.5761 0.00530.0420 PCA-NWR 3310.1553 1.1707 0.9895 ZERO-Points-M0-W 184.2085 0.0058 0.0580 ZERO-Points-M0-NWR 92017.6629 1.0156 0.9903 ZERO-Points-MY-W 144253.7899 0.0321 0.0946 ZERO-Points-MY-NWR 2164.8801 1.5375 0.9901 Table 9: Test metrics achieved by all the me...

arXiv 2085

[1] [1]

URLhttps://arxiv.org/abs/2410.08315. M. Bauer, M. van der Wilk, and C. E. Rasmussen. Understanding Probabilistic Sparse Gaussian Process Approximations. InAdvances in Neural Information Processing Systems, pages 1533 – 1541,

arXiv

[2] [2]

URLhttps://arxiv.org/abs/2104.05674. D. Duvenaud, O. Rippel, R. Adams, and Z. Ghahramani. Avoiding pathologies in very deep networks. InInternational Conference on Artificial Intelligence and Statistics, pages 202–210,

arXiv

[3] [3]

Havasi, J

M. Havasi, J. M. Hernández-Lobato, and J. J. Murillo-Fuentes. Inference in Deep Gaussian Processes using Stochastic Gradient Hamiltonian Monte Carlo. InAdvances in Neural Information Processing Systems, pages 7517 – 7527, 2018a. M. Havasi, J. M. Hernández-Lobato, and J. J. Murillo-Fuentes. Deep Gaussian Processes with Decoupled Inducing Inputs, 2018b. URL...

Pith/arXiv arXiv

[4] [4]

URLhttps://arxiv.org/abs/1905.13697. A. Javaloy, M. Meghdadi, and I. Valera. Mitigating Modality Collapse in Multimodal VAEs via Impartial Optimization. InInternational Conference on Machine Learning, pages 9938–9964,

Pith/arXiv arXiv 1905

[5] [5]

URLhttps://arxiv.org/abs/2012.13962. Z. Lin, F. Yin, and J. Maroñas. Towards Flexibility and Interpretability of Gaussian Process State-Space Model,

arXiv 2012

[6] [6]

URLhttps://arxiv.org/abs/2301.08843. J. Z. Liu, S. Padhy, J. Ren, Z. Lin, Y. Wen, G. Jerfel, Z. Nado, J. Snoek, D. Tran, and B. Lakshminarayanan. A Simple Approach to Improve Single-Model Deep Uncertainty via Distance-Awareness.Journal of Machine Learning Research, 24:1–63,

arXiv

[7] [7]

URLhttps://arxiv.org/abs/ 2506.23996. A. G. d. G. Matthews, M. van der Wilk, T. Nickson, K. Fujii, A. Boukouvalas, P. León- Villagrá, Z. Ghahramani, and J. Hensman. GPflow: A Gaussian Process library using TensorFlow.Journal of Machine Learning Research, 18:1–6, apr

arXiv

[8] [8]

URLhttps://arxiv.org/abs/2010.14877. C. E. Rasmussen and C. K. I. Williams.Gaussian Processes for Machine Learning. The MIT Press,

arXiv 2010

[9] [9]

F. J. Sáez-Maldonado, J. Maroñas, and D. Hernández-Lobato. Mode Collapse in Variational Deep Gaussian Processes. InNeurIPS 2024 Workshop on Bayesian Decision-making and Uncertainty,

2024

[10] [10]

J. Shi, M. Titsias, and A. Mnih. Sparse Orthogonal Variational Inference for Gaussian Processes. InInternational Conference on Artificial Intelligence and Statistics, pages 1932–1942,

1932

[11] [11]

URLhttps://arxiv.org/abs/2310.18230. M. Titsias. Variational Learning of Inducing Variables in Sparse Gaussian Processes. In International Conference on Artificial Intelligence and Statistics, pages 567–574,

arXiv

[12] [12]

URLhttps: //arxiv.org/abs/2003.01115. Y. Wang, D. Blei, and J. P. Cunningham. Posterior Collapse and Latent Variable Non- identifiability. InAdvances in Neural Information Processing Systems, pages 5443–5455,

arXiv 2003

[13] [13]

80 An Analysis of Posterior Collapse, Parameterization and Initialization in V ariationalDGPs Appendix A

ISSN 1935-8237. 80 An Analysis of Posterior Collapse, Parameterization and Initialization in V ariationalDGPs Appendix A. Coordinate Updates for the Noise Parameter By noting that a one-dimensional Gaussian likelihood function can be compactly expressed through: NY n=1 N yn |f n, σ2 =N Y|f, σ 2I (75) The objective function of a one-dimensionalSVGPcan be w...

1935

[14] [14]

Model KLDLik. Var.RMSE ZERO-W 0.0010 0.9967 0.9896 ZERO-NWR 869.0811 1.1559 0.9895 PCA-W 144.5761 0.00530.0420 PCA-NWR 3310.1553 1.1707 0.9895 ZERO-Points-M0-W 184.2085 0.0058 0.0580 ZERO-Points-M0-NWR 92017.6629 1.0156 0.9903 ZERO-Points-MY-W 144253.7899 0.0321 0.0946 ZERO-Points-MY-NWR 2164.8801 1.5375 0.9901 Table 9: Test metrics achieved by all the me...

arXiv 2085