The Value of Covariance Matching in Gaussian DDPMs and the Lanczos Sampler
Pith reviewed 2026-05-22 07:33 UTC · model grok-4.3
The pith
Matching the full posterior covariance in Gaussian DDPMs reduces path KL error to O(1/T^2)
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Matching the full posterior covariance breaks the Omega(1/T) path-KL error barrier of isotropic reverse covariances, yielding an order-wise improvement to O(1/T^2). The Lanczos Gaussian sampler realizes this with approximation error decaying exponentially in the number of Lanczos steps, each requiring a single Jacobian-vector product of the posterior mean.
What carries the argument
The Lanczos Gaussian sampler, a training-free method that generates samples from the optimal reverse covariance by performing Lanczos iterations on covariance-vector products obtained via Jacobian-vector products.
If this is right
- Full covariance matching improves the path KL divergence scaling to O(1/T^2) instead of Omega(1/T).
- The Lanczos Gaussian sampler achieves this matching with exponential error decay in the number of steps.
- Three Lanczos steps suffice to outperform strong diagonal-covariance baselines like OCM-DDPM on image benchmarks.
- No dense covariance storage or auxiliary models are needed for practical implementation.
Where Pith is reading between the lines
- Extending full covariance matching to other types of diffusion models could further enhance sampling efficiency.
- Reducing the required number of denoising steps while maintaining quality might be possible with this approach.
- Applications in conditional generation could benefit from more accurate trajectory matching.
Load-bearing premise
Covariance-vector products can be reliably obtained through Jacobian-vector products of the posterior mean, and a small fixed number of Lanczos steps is enough to achieve the O(1/T^2) improvement.
What would settle it
Computing the path KL divergence for large T with full covariance matching and observing whether the error decreases quadratically rather than linearly with 1/T.
Figures
read the original abstract
A central error measure in Gaussian DDPMs is the path-space KL divergence between the exact reverse chain and the learned Gaussian reverse process. This quantity is especially relevant for procedures such as classifier guidance, which perturb the entire reverse trajectory rather than only the terminal sample. Prior analyses show that standard isotropic reverse covariances suffer an unavoidable $\Omega(1/T)$ path-KL error as the number of denoising steps $T$ grows. We show that matching the full posterior covariance breaks this barrier, yielding an order-wise improvement that reduces the path KL to $O(1/T^2)$. To make full covariance matching practical, we introduce the Lanczos Gaussian sampler (LGS), a training-free, matrix-free method for sampling from the optimal reverse covariance using only covariance-vector products, which are available through Jacobian-vector products of the posterior mean. LGS avoids dense covariance storage and auxiliary covariance models. We prove that LGS approximation error decays exponentially in the number of Lanczos steps, where each Lanczos step requires a single Jacobian-vector product. Empirically, using only just three such steps improves sample quality over strong diagonal-covariance baselines, including OCM-DDPM, across standard image benchmarks. This identifies full covariance matching as both theoretically valuable and practically accessible for fast DDPM sampling.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that replacing isotropic reverse covariances with the exact posterior covariance in Gaussian DDPMs breaks the Ω(1/T) path-space KL barrier, yielding an O(1/T²) bound. It introduces the Lanczos Gaussian sampler (LGS), a training-free method that approximates the optimal covariance via Krylov subspaces using only Jacobian-vector products of the posterior mean, proves exponential decay of the approximation error in the number of Lanczos steps, and reports empirical gains over diagonal baselines (including OCM-DDPM) on standard image benchmarks with as few as three steps.
Significance. If the central claims hold, the work supplies a parameter-free derivation of an order-wise improvement in path KL for diffusion models together with a matrix-free sampler whose convergence rate is proven to be exponential in the Lanczos dimension. These elements are directly relevant to classifier-guided and other trajectory-level sampling procedures.
major comments (2)
- [Section 3 (path-KL derivation) and Theorem on LGS convergence] The O(1/T²) path-KL claim is derived under exact posterior covariance matching (presumably via telescoping or Girsanov expansion in the reverse SDE). The manuscript does not quantify how the residual covariance error from fixed-k Lanczos truncation propagates into the integrated path KL when T→∞. If the per-step covariance mismatch remains O(1) or decays slower than 1/T, the leading Ω(1/T) term may be reintroduced, undermining the order-wise improvement.
- [Proof of Lanczos approximation error decay] The exponential convergence proof for LGS is stated for a fixed covariance matrix. It is unclear whether the rate remains exponential (or even polynomial) uniformly over the sequence of time-dependent posterior covariances that arise in a DDPM schedule, especially when the score network is only approximately learned.
minor comments (2)
- [Section 2] Notation for the covariance-vector product via Jacobian-vector product of the posterior mean should be introduced earlier and used consistently to avoid ambiguity between exact and approximate quantities.
- [Experiments] The experimental section would benefit from an ablation showing path-KL or FID as a function of Lanczos steps k for fixed T, to illustrate the practical trade-off.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and detailed comments on our manuscript. The points raised regarding error propagation in the path KL and uniformity of the Lanczos convergence rate are valuable and will strengthen the theoretical presentation. We address each major comment below and indicate the revisions we plan to incorporate.
read point-by-point responses
-
Referee: [Section 3 (path-KL derivation) and Theorem on LGS convergence] The O(1/T²) path-KL claim is derived under exact posterior covariance matching (presumably via telescoping or Girsanov expansion in the reverse SDE). The manuscript does not quantify how the residual covariance error from fixed-k Lanczos truncation propagates into the integrated path KL when T→∞. If the per-step covariance mismatch remains O(1) or decays slower than 1/T, the leading Ω(1/T) term may be reintroduced, undermining the order-wise improvement.
Authors: We agree that a complete propagation analysis is required to rigorously confirm that the order-wise improvement is preserved under Lanczos approximation. The O(1/T²) bound in Section 3 is derived for exact posterior covariance matching via the Girsanov change-of-measure argument. For LGS with fixed k, the covariance error per step is bounded by the exponential decay term from the Lanczos theorem, which is independent of T but can be driven below any polynomial threshold by modest increases in k. We will add a new lemma to Section 3 that decomposes the path KL into the exact-matching contribution plus a remainder controlled by the integrated covariance mismatch. Under the standard Lipschitz and boundedness assumptions on the score, this remainder is O(ε(k)/T), where ε(k) decays exponentially in k. Consequently, for any fixed k the leading term remains O(1/T²) plus a higher-order perturbation whose prefactor is negligible in the regime of practical interest; the Ω(1/T) barrier is still broken. We will also note that k may be allowed to grow slowly with T (e.g., k = O(log T)) to obtain a uniform O(1/T²) guarantee. This analysis will be included in the revised manuscript. revision: yes
-
Referee: [Proof of Lanczos approximation error decay] The exponential convergence proof for LGS is stated for a fixed covariance matrix. It is unclear whether the rate remains exponential (or even polynomial) uniformly over the sequence of time-dependent posterior covariances that arise in a DDPM schedule, especially when the score network is only approximately learned.
Authors: The Lanczos convergence theorem applies to any fixed symmetric positive-definite matrix and yields exponential decay whose rate depends on the eigenvalue distribution. In the DDPM setting the posterior covariances Σ_t vary continuously with t and, under the standard assumptions of the diffusion schedule and data distribution, possess uniformly bounded condition numbers and eigenvalue gaps across the discrete time grid. We will extend the existing proof by adding a uniformity argument: we bound the relevant spectral quantities (e.g., the maximal eigenvalue ratio) uniformly in t using the Lipschitz continuity of the learned score and the smoothness of the forward process. The same Lanczos procedure is then applied to the approximate posterior covariance induced by the learned score; the exponential rate therefore holds relative to that approximate covariance, with the same uniformity. The revised proof will explicitly state these uniform bounds and the resulting exponential decay that is independent of T and of the approximation quality of the score (provided the score remains Lipschitz). revision: yes
Circularity Check
No significant circularity; derivation is self-contained via explicit KL expansion
full rationale
The paper derives the path-KL error bounds directly from the difference between isotropic and full posterior covariance in the reverse transitions, using standard telescoping or Girsanov-type expansions on the discrete reverse chain. This yields the claimed Omega(1/T) to O(1/T^2) improvement as a mathematical consequence of exact covariance matching, without fitting parameters to the target error or redefining quantities in terms of themselves. The Lanczos Gaussian sampler is presented as an independent algorithmic contribution with its own exponential convergence proof for fixed covariance, relying on Jacobian-vector products rather than any self-referential construction. No load-bearing self-citations, uniqueness theorems from prior author work, or smuggled ansatzes appear in the derivation chain. The analysis is therefore independent of its inputs and externally falsifiable via the stated assumptions on covariance-vector products.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The learned reverse process is Gaussian.
Reference graph
Works this paper leans on
-
[1]
Fan Bao, Chongxuan Li, Jiacheng Sun, Jun Zhu, and Bo Zhang. Estimating the optimal covariance with imperfect mean in diffusion probabilistic models.arXiv preprint arXiv:2206.07309, 2022a. Fan Bao, Chongxuan Li, Jun Zhu, and Bo Zhang. Analytic-dpm: an analytic estimate of the optimal reverse variance in diffusion probabilistic models.arXiv preprint arXiv:2...
-
[2]
Sitan Chen, Sinho Chewi, Jerry Li, Yuanzhi Li, Adil Salim, and Anru R Zhang. Sampling is as easy as learning the score: theory for diffusion models with minimal data assumptions.arXiv preprint arXiv:2209.11215,
-
[3]
URLhttps://arxiv.org/abs/2308.12240. Jane K Cullum and Ralph A Willoughby.Lanczos algorithms for large symmetric eigenvalue computations: Vol. I: Theory. SIAM,
-
[4]
A sharp kl-convergence analysis for diffusion models under minimal assumptions
Nishant Jain and Tong Zhang. A sharp kl-convergence analysis for diffusion models under minimal assumptions. arXiv preprint arXiv:2508.16306,
-
[5]
Optimal convergence analysis of DDPM for general distributions.arXiv preprint arXiv:2510.27562,
Yuchen Jiao, Yuchen Zhou, and Gen Li. Optimal convergence analysis of ddpm for general distributions.arXiv preprint arXiv:2510.27562,
-
[6]
Gen Li, Yuting Wei, Yuxin Chen, and Yuejie Chi
URLhttps://arxiv.org/abs/2206.06227. Gen Li, Yuting Wei, Yuxin Chen, and Yuejie Chi. Towards faster non-asymptotic convergence for diffusion-based generative models,
-
[7]
Towards faster non-asymptotic convergence for diffusion-based generative models
URLhttps://arxiv.org/abs/2306.09251. Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. InProceedings of the IEEE international conference on computer vision, pages 3730–3738,
-
[8]
Hila Manor and Tomer Michaeli. On the posterior distribution in denoising: Application to uncertainty quantification.arXiv preprint arXiv:2309.13598,
-
[9]
Zijing Ou, Mingtian Zhang, Andi Zhang, Tim Z Xiao, Yingzhen Li, and David Barber. Improving probabilistic diffusion models with optimal diagonal covariance matching.arXiv preprint arXiv:2406.10808,
-
[10]
Denoising Diffusion Implicit Models
Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models.arXiv preprint arXiv:2010.02502,
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[11]
A The Value of Covariance Matching in Classifier Guided DDPMs We end this Section by asking why one might care about thepathKL as opposed to simply say the KL at the endpoint, KL(q(x1)∥p θ(x1)). Here we note that the path KL is relevant to algorithms that rely on faithful reconstruction of the entire path. One such example is classifier guided diffusion m...
work page 2021
-
[12]
We observe that the final step of Algorithm 2 requires computingT 1/2 m e1
which has been shown to be the most numerically stable variant of the Lanczos iteration [Cullum and Willoughby, 2002, Saad, 2011]. We observe that the final step of Algorithm 2 requires computingT 1/2 m e1. Since Tm is a small m×m symmetric tridiagonal matrix,T 1/2 m e1 can be computed efficiently by diagonalizingT m. D Omitted Proofs from Section 3 In th...
work page 2002
-
[13]
D.1 Proof of Proposition 2 Proposition 13(Restatement of Proposition 2).The unconditional path KL satisfies KL(q(x1:T )∥p θ(x1:T )) =L(θ)− L ∗ + KL(q(xT )∥ N(0, I)) Proof.By construction of the learned DDPM reverse kernel [Ho et al., 2020], pθ(x1:T ) =p(x T ) TY t=2 pθ(xt−1 |x t) =N(0, I) TY t=2 pθ(xt−1 |x t) As for q(x1:T ), the chain rule together with ...
work page 2020
-
[14]
Again, since X0|t and Z are independent,Cov(X 0|t|Z) = Cov(X0|t)
where the expectation is taken over the random vector √γX0|t +Z . Again, since X0|t and Z are independent,Cov(X 0|t|Z) = Cov(X0|t). Hence g′′(0) =− 1 2 E tr(Cov(X0|t|Z)2) =− 1 2tr(Cov(X0|t)2). As for the third derivative, we proceed by defining Yγ := √γX0|t +Z, ¯Xγ :=X 0|t −E[X 0|t|Yγ], T γ :=E[ ¯X ⊗3 γ |Yγ]. By the third-derivative formula for Gaussian c...
work page 2024
-
[15]
[2024], which makes our results directly comparable
for ImageNet 64×64 ; these are the same checkpoints redistributed by Ou et al. [2024], which makes our results directly comparable. A Hessian–vector product (HVP) of logp θ(xt, t) against a vector v is computed by a single forward pass through ϵθ followed by a single backward pass via PyTorch’s torch.autograd.grad with create_graph=Falseon the inner produ...
work page 2024
-
[16]
We report ratios of the median wall-clock time of each method to the median wall-clock time of OCM-DDPM, alongside the above heuristic estimate for each method. Table 3: Wall-clock time per sample (seconds; n= 99 post-warmup; T= 50 steps).ratio = median ÷ OCM-DDPM median for the same dataset.heuristicis the predicted ratio derived in the text above (same ...
- [17]
- [18]
- [19]
- [20]
- [21]
- [22]
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.