Consistency of variational approximations under bounded Kullback--Leibler divergence

Hien Duy Nguyen; Jacob Westerhout; Julyan Arbel; Thomas Guilmeau

arxiv: 2606.13230 · v1 · pith:TSDHHWZJnew · submitted 2026-06-11 · 🧮 math.ST · stat.TH

Consistency of variational approximations under bounded Kullback--Leibler divergence

Hien Duy Nguyen , Jacob Westerhout , Thomas Guilmeau , Julyan Arbel This is my paper

Pith reviewed 2026-06-27 05:10 UTC · model grok-4.3

classification 🧮 math.ST stat.TH

keywords variational inferenceposterior consistencyKullback-Leibler divergencetightnessmetric spacesBayesian inferencegeneralized posteriors

0 comments

The pith

On general metric spaces, a uniform bound on Kullback-Leibler divergence from approximations to tight targets forces the approximations to be tight.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that variational approximations inherit posterior consistency when their Kullback-Leibler divergence to the targets stays uniformly bounded. On any metric space, this bound transfers tightness from the target sequence to the approximating sequence. When the targets converge weakly to a Dirac measure at the true parameter, the same convergence holds for the variational sequence. Logarithmic-moment conditions are supplied to verify the bounded-divergence requirement for smooth generalized posteriors.

Core claim

On a general metric space, a uniform bound on the Kullback-Leibler divergence from the approximating measures to a tight sequence of target measures forces the approximating sequence to be tight. It follows that if the target posteriors converge weakly to a Dirac mass at the true parameter, then any variational sequence with bounded Kullback-Leibler divergence to the targets is also consistent.

What carries the argument

The uniform bound on Kullback-Leibler divergence, which transfers tightness from the target sequence to the variational approximating sequence.

If this is right

If target posteriors converge weakly to a Dirac at the true parameter, variational approximations with bounded KL are consistent.
Logarithmic-moment conditions on the data suffice to establish the bounded-KL hypothesis for smooth generalized posteriors.
The tightness transfer holds on arbitrary metric spaces, including infinite-dimensional settings.
The result supplies a general sufficient condition for consistency of variational methods whenever the targets are consistent.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same tightness argument could be adapted to other f-divergences if they control total variation or weak convergence in a comparable way.
In practice, the log-moment conditions may be easier to check than direct tightness of the variational family itself.
The result suggests that posterior consistency proofs for variational methods can reduce to verifying a single uniform bound rather than reproving convergence from scratch.

Load-bearing premise

The sequence of target measures must itself be tight.

What would settle it

A tight sequence of target measures on a metric space together with approximating measures whose Kullback-Leibler divergences remain uniformly bounded, yet whose sequence fails to be tight, would falsify the main claim.

read the original abstract

Variational methods are widely used to approximate posterior distributions in Bayesian inference when exact computation is infeasible. We study when such approximations inherit posterior consistency. Our first result shows that, on a general metric space, a uniform bound on the Kullback--Leibler divergence from the approximating measures to a tight sequence of target measures forces the approximating sequence to be tight. It follows that if the target posteriors converge weakly to a Dirac mass at the true parameter, then any variational sequence with bounded Kullback--Leibler divergence to the targets is also consistent. We also give simple logarithmic-moment conditions that verify this boundedness condition, and illustrate them for smooth generalised posterior distributions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper links bounded KL(Q_n || P_n) to tightness of the variational sequence when targets are tight, then to consistency, but the general-metric-space claim runs into the usual Polish-space requirement for Prohorov.

read the letter

The core contribution is a tightness result: if the targets P_n are tight and KL(Q_n || P_n) stays bounded, then the approximating measures Q_n are tight on a metric space. From there the paper concludes that if P_n converges weakly to a Dirac at the true value, the Q_n sequence is consistent as well. It also supplies logarithmic-moment conditions that make the bounded-KL assumption easy to check for smooth generalised posteriors.

That tightness step is the genuinely new piece. Most existing variational-consistency work either assumes the approximating family is already compact or works in Euclidean space with explicit tail controls. Giving a direct KL-to-tightness implication that applies once the targets are known to be tight is cleaner than the usual route.

The moment conditions are a practical plus. They turn the abstract bounded-KL hypothesis into something one can verify from the model and the prior without having to compute the variational objective itself.

The soft spot is exactly the one the stress-test flags. The abstract states the result for a “general metric space,” yet the passage from tightness to relative compactness (needed to extract convergent subsequences whose only possible limit is the Dirac) relies on Prohorov’s theorem, which requires the space to be Polish. The paper gives no separability or completeness assumption, and the abstract does not indicate that the proof sidesteps this requirement. If the full argument really works without those conditions, that would be worth highlighting; otherwise the claim needs to be restricted to Polish spaces.

The rest of the development looks standard and internally consistent. No circularity or invented objects appear.

This is the kind of note that belongs in a journal that publishes theoretical statistics. Readers working on guarantees for variational Bayes or generalised posteriors will want to see the details. It deserves a serious referee who can check the topological hypotheses and the moment calculations.

Referee Report

1 major / 0 minor

Summary. The paper claims that on a general metric space, a uniform bound on KL(Q_n || P_n) for a tight sequence of target measures {P_n} implies tightness of the approximating sequence {Q_n}. It follows that if {P_n} converges weakly to a Dirac mass at the true parameter, then any variational sequence with bounded KL to the targets is consistent. The paper also supplies logarithmic-moment conditions to verify the bounded-KL hypothesis and illustrates them on smooth generalised posteriors.

Significance. If the central tightness implication holds under the stated hypotheses, the result supplies a broadly applicable criterion linking bounded KL to consistency of variational approximations, extending beyond case-by-case analyses. The logarithmic-moment verification conditions constitute a concrete, checkable strength that could be used in applications.

major comments (1)

[Abstract] Abstract (and presumably §2 or the main theorem statement): the result is stated for a 'general metric space,' yet the passage from tightness of {Q_n} to weak convergence (hence consistency) to the Dirac limit of {P_n} relies on relative compactness. Prohorov's theorem requires the space to be Polish (separable and complete); on a non-separable or incomplete metric space tightness need not yield relatively compact subsequences, so the consistency conclusion does not follow in full generality. This assumption is load-bearing for the central claim.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful reading and the precise observation on topological assumptions. We address the comment below.

read point-by-point responses

Referee: [Abstract] Abstract (and presumably §2 or the main theorem statement): the result is stated for a 'general metric space,' yet the passage from tightness of {Q_n} to weak convergence (hence consistency) to the Dirac limit of {P_n} relies on relative compactness. Prohorov's theorem requires the space to be Polish (separable and complete); on a non-separable or incomplete metric space tightness need not yield relatively compact subsequences, so the consistency conclusion does not follow in full generality. This assumption is load-bearing for the central claim.

Authors: We agree that the comment is correct. The first result (bounded KL divergence implies tightness of the approximating sequence) holds on arbitrary metric spaces. However, the passage from tightness to relative compactness, and hence to weak convergence to the Dirac measure, invokes Prohorov's theorem and therefore requires the underlying space to be Polish. We will revise the abstract, the statement of the main theorem, and the surrounding discussion to explicitly assume that the metric space is Polish. This does not change the tightness implication but correctly restricts the consistency conclusion to the setting where Prohorov's theorem applies. revision: yes

Circularity Check

0 steps flagged

No circularity: purely theoretical derivation of tightness from bounded KL on metric spaces

full rationale

The paper presents a mathematical theorem establishing that a uniform bound on KL(Q_n || P_n) implies tightness of {Q_n} when {P_n} is tight, followed by a consistency implication when P_n converges weakly to a Dirac. No parameters are fitted, no predictions are made from subsets of data, and no self-citations or ansatzes are invoked as load-bearing steps in the provided abstract or description. The derivation is self-contained as a direct proof in measure-theoretic probability, with no reduction of outputs to inputs by construction. The skeptic's concern about Polish vs. general metric spaces pertains to correctness of the statement (Prohorov's theorem), not to circularity in the derivation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Theoretical result in probability on metric spaces; relies on standard properties of KL divergence, weak convergence, and tightness.

axioms (2)

standard math Kullback-Leibler divergence is well-defined and non-negative on probability measures on a metric space
Invoked throughout the consistency statements in the abstract.
standard math Weak convergence to a Dirac measure implies consistency of the sequence
Used to link tightness to the final consistency conclusion.

pith-pipeline@v0.9.1-grok · 5644 in / 1269 out tokens · 27813 ms · 2026-06-27T05:10:54.633260+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

13 extracted references

[1]

and Ridgway, J

Alquier, P. and Ridgway, J. (2020). Concentration of tempered posteriors and of their variational approximations.The Annals of Statistics48, 1475–1497. Bissiri, P. G., Holmes, C. C. and Walker, S. G. (2016). A general framework for updating belief distributions.Journal of the Royal Statistical Society: Series B (Statistical Methodology)78, 1103–1130. Blei...

2020
[2]

Sinceω∈Ω ∇, there existsn2(ω)∈Nsuch that ∥∇zgn(0)∥ ≤1for alln≥n 2(ω)

Therefore ∇zgn(0) =n −1/2∇θ logπ( ˆθn). Sinceω∈Ω ∇, there existsn2(ω)∈Nsuch that ∥∇zgn(0)∥ ≤1for alln≥n 2(ω). Sinceω∈Ω w, we have ˜µn ⇝µ ∞ andµ ∞(B(0, r))>0. The Portmanteau theorem gives lim inf n→∞ ˜µn(B(0, r))≥µ ∞(B(0, r))>0. Hence, with α= 1 2 µ∞(B(0, r))>0, there existsn 3(ω)∈Nsuch that ˜µn(B(0, r))≥αfor alln≥n 3(ω). Applying Proposition 3 to the det...

2021
[3]

Thus the identifiability condition in Miller (2021, Thm

2021
[4]

sup θ∈B(θ0,r0) |b′′′(θ⊤W1)| |W1jW1kW1ℓ| # ≤E

holds. Second, for everyj, k, ℓ∈ {1, . . . , p}, |W1jW1kW1ℓ| ≤ ∥W 1∥3, and hence E " sup θ∈B(θ0,r0) |b′′′(θ⊤W1)| |W1jW1kW1ℓ| # ≤E " sup θ∈B(θ0,r0) |b′′′(θ⊤W1)| ∥W1∥3 # <∞. Therefore Miller (2021, Thm

2021
[5]

implies that, on an eventΩM ∈Awithpr(Ω M) = 1, the sequence(g n(ω,·)) n≥1 satisfies the hypotheses of case (2) of Miller (2021, Thm

2021
[6]

Sinceη n →η ∗ ∈(0,∞), it follows that, for everyω∈Ω M, the sequence(˜gn(ω,·)) n≥1 also satisfies the hypotheses of case (2) of Miller (2021, Thm

for everyω∈ΩM. Sinceη n →η ∗ ∈(0,∞), it follows that, for everyω∈Ω M, the sequence(˜gn(ω,·)) n≥1 also satisfies the hypotheses of case (2) of Miller (2021, Thm. 5), with limit˜g. In particular, by Miller (2021, Thm. 7), for everyω∈Ω M, ˜gn(ω,·)→˜gand∇ 2 θ˜gn(ω,·)→ ∇ 2 θ˜g uniformly onB

2021
[7]

Since case (2) of Miller (2021, Thm

2021
[8]

We now verify the hypotheses of Miller (2021, Thm. 6). Condition (2) holds because ∇2 θ˜gn(θ0)→ ∇ 2 θ˜g(θ0), and, for everya∈R p \ {0}, a⊤∇2 θ˜g(θ0)a=η ∗ a⊤E h b′′(θ⊤ 0 W1)W1W ⊤ 1 i a =η ∗ E h b′′(θ⊤ 0 W1)(a⊤W1)2 i >0. 14 Here we used thatη∗ >0, thatb ′′ >0by assumption, and thata ⊤W1 is not almost surely zero by the identifiability argument above. Hence∇...

2021
[9]

Consequently, Assumption (1) of Miller (2021, Thm

holds. Consequently, Assumption (1) of Miller (2021, Thm

2021
[10]

To verify Assumption (2) of Miller (2021, Thm

is satisfied for˜gn with centring sequenceˆθn. To verify Assumption (2) of Miller (2021, Thm. 4), fixε >0. Since case (2) of Miller (2021, Thm

2021
[11]

Therefore, for all sufficiently largen, inf θ∈B( ˆθn,ε)c {˜gn(θ)−˜gn(ˆθn)} ≥inf θ∈B(θ 0,ε/2)c {˜gn(θ)−˜gn(ˆθn)} ≥inf θ∈B(θ 0,ε/2)c {˜gn(θ)−˜gn(θ0)}, since ˆθn minimises˜gn

Since ˆθn →θ 0, we have B(θ0, ε/2)⊂B( ˆθn, ε) for all sufficiently largen. Therefore, for all sufficiently largen, inf θ∈B( ˆθn,ε)c {˜gn(θ)−˜gn(ˆθn)} ≥inf θ∈B(θ 0,ε/2)c {˜gn(θ)−˜gn(ˆθn)} ≥inf θ∈B(θ 0,ε/2)c {˜gn(θ)−˜gn(θ0)}, since ˆθn minimises˜gn. It follows from the two preceding displays that lim inf n→∞ inf θ∈B( ˆθn,ε)c {˜gn(θ)−˜gn(ˆθn)}>0. Thus Assump...

2021
[12]

Sinceπis strictly positive and twice continuously differentiable by (C2), the prior assumptions in Miller (2021, Thm

holds. Sinceπis strictly positive and twice continuously differentiable by (C2), the prior assumptions in Miller (2021, Thm

2021
[13]

Finally, µn(dθ)∝exp{−n˜g n(θ)}π(θ) dθ

are also satisfied. Finally, µn(dθ)∝exp{−n˜g n(θ)}π(θ) dθ. Hence Miller (2021, Thm

2021

[1] [1]

and Ridgway, J

Alquier, P. and Ridgway, J. (2020). Concentration of tempered posteriors and of their variational approximations.The Annals of Statistics48, 1475–1497. Bissiri, P. G., Holmes, C. C. and Walker, S. G. (2016). A general framework for updating belief distributions.Journal of the Royal Statistical Society: Series B (Statistical Methodology)78, 1103–1130. Blei...

2020

[2] [2]

Sinceω∈Ω ∇, there existsn2(ω)∈Nsuch that ∥∇zgn(0)∥ ≤1for alln≥n 2(ω)

Therefore ∇zgn(0) =n −1/2∇θ logπ( ˆθn). Sinceω∈Ω ∇, there existsn2(ω)∈Nsuch that ∥∇zgn(0)∥ ≤1for alln≥n 2(ω). Sinceω∈Ω w, we have ˜µn ⇝µ ∞ andµ ∞(B(0, r))>0. The Portmanteau theorem gives lim inf n→∞ ˜µn(B(0, r))≥µ ∞(B(0, r))>0. Hence, with α= 1 2 µ∞(B(0, r))>0, there existsn 3(ω)∈Nsuch that ˜µn(B(0, r))≥αfor alln≥n 3(ω). Applying Proposition 3 to the det...

2021

[3] [3]

Thus the identifiability condition in Miller (2021, Thm

2021

[4] [4]

sup θ∈B(θ0,r0) |b′′′(θ⊤W1)| |W1jW1kW1ℓ| # ≤E

holds. Second, for everyj, k, ℓ∈ {1, . . . , p}, |W1jW1kW1ℓ| ≤ ∥W 1∥3, and hence E " sup θ∈B(θ0,r0) |b′′′(θ⊤W1)| |W1jW1kW1ℓ| # ≤E " sup θ∈B(θ0,r0) |b′′′(θ⊤W1)| ∥W1∥3 # <∞. Therefore Miller (2021, Thm

2021

[5] [5]

implies that, on an eventΩM ∈Awithpr(Ω M) = 1, the sequence(g n(ω,·)) n≥1 satisfies the hypotheses of case (2) of Miller (2021, Thm

2021

[6] [6]

Sinceη n →η ∗ ∈(0,∞), it follows that, for everyω∈Ω M, the sequence(˜gn(ω,·)) n≥1 also satisfies the hypotheses of case (2) of Miller (2021, Thm

for everyω∈ΩM. Sinceη n →η ∗ ∈(0,∞), it follows that, for everyω∈Ω M, the sequence(˜gn(ω,·)) n≥1 also satisfies the hypotheses of case (2) of Miller (2021, Thm. 5), with limit˜g. In particular, by Miller (2021, Thm. 7), for everyω∈Ω M, ˜gn(ω,·)→˜gand∇ 2 θ˜gn(ω,·)→ ∇ 2 θ˜g uniformly onB

2021

[7] [7]

Since case (2) of Miller (2021, Thm

2021

[8] [8]

We now verify the hypotheses of Miller (2021, Thm. 6). Condition (2) holds because ∇2 θ˜gn(θ0)→ ∇ 2 θ˜g(θ0), and, for everya∈R p \ {0}, a⊤∇2 θ˜g(θ0)a=η ∗ a⊤E h b′′(θ⊤ 0 W1)W1W ⊤ 1 i a =η ∗ E h b′′(θ⊤ 0 W1)(a⊤W1)2 i >0. 14 Here we used thatη∗ >0, thatb ′′ >0by assumption, and thata ⊤W1 is not almost surely zero by the identifiability argument above. Hence∇...

2021

[9] [9]

Consequently, Assumption (1) of Miller (2021, Thm

holds. Consequently, Assumption (1) of Miller (2021, Thm

2021

[10] [10]

To verify Assumption (2) of Miller (2021, Thm

is satisfied for˜gn with centring sequenceˆθn. To verify Assumption (2) of Miller (2021, Thm. 4), fixε >0. Since case (2) of Miller (2021, Thm

2021

[11] [11]

Therefore, for all sufficiently largen, inf θ∈B( ˆθn,ε)c {˜gn(θ)−˜gn(ˆθn)} ≥inf θ∈B(θ 0,ε/2)c {˜gn(θ)−˜gn(ˆθn)} ≥inf θ∈B(θ 0,ε/2)c {˜gn(θ)−˜gn(θ0)}, since ˆθn minimises˜gn

Since ˆθn →θ 0, we have B(θ0, ε/2)⊂B( ˆθn, ε) for all sufficiently largen. Therefore, for all sufficiently largen, inf θ∈B( ˆθn,ε)c {˜gn(θ)−˜gn(ˆθn)} ≥inf θ∈B(θ 0,ε/2)c {˜gn(θ)−˜gn(ˆθn)} ≥inf θ∈B(θ 0,ε/2)c {˜gn(θ)−˜gn(θ0)}, since ˆθn minimises˜gn. It follows from the two preceding displays that lim inf n→∞ inf θ∈B( ˆθn,ε)c {˜gn(θ)−˜gn(ˆθn)}>0. Thus Assump...

2021

[12] [12]

Sinceπis strictly positive and twice continuously differentiable by (C2), the prior assumptions in Miller (2021, Thm

holds. Sinceπis strictly positive and twice continuously differentiable by (C2), the prior assumptions in Miller (2021, Thm

2021

[13] [13]

Finally, µn(dθ)∝exp{−n˜g n(θ)}π(θ) dθ

are also satisfied. Finally, µn(dθ)∝exp{−n˜g n(θ)}π(θ) dθ. Hence Miller (2021, Thm

2021