Conditional Diffusion Sampling

Daniel Hern\'andez-Lobato; Francisco M. Castro-Mac\'ias; Jos\'e Miguel Hern\'andez-Lobato; Pablo Morales-\'Alvarez; Rafael Molina; Saifuddin Syed

arxiv: 2605.04013 · v1 · submitted 2026-05-05 · 📊 stat.ML · cs.LG

Conditional Diffusion Sampling

Francisco M. Castro-Mac\'ias , Pablo Morales-\'Alvarez , Saifuddin Syed , Daniel Hern\'andez-Lobato , Rafael Molina , Jos\'e Miguel Hern\'andez-Lobato This is my paper

Pith reviewed 2026-05-09 15:29 UTC · model grok-4.3

classification 📊 stat.ML cs.LG

keywords conditional diffusion samplingparallel temperingstochastic differential equationsmultimodal distributionsdensity evaluationsampling methodsdiffusion processes

0 comments

The pith

Conditional Diffusion Sampling pairs parallel tempering with exact diffusion dynamics to sample multimodal distributions more efficiently.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Sampling unnormalized multimodal distributions with limited density evaluations is a core challenge in machine learning and natural sciences. This paper introduces Conditional Diffusion Sampling, a framework that builds a bridge from a tractable reference to the target by deriving conditional interpolants whose transport follows an exact closed-form stochastic differential equation. The method uses parallel tempering to draw from the nontrivial initial distribution and then applies the SDE to move samples to the target. Theory and experiments indicate that the initialization cost becomes negligible at short diffusion times, preserving overall efficiency. If the approach holds, it delivers a stronger balance between sample quality and the number of density evaluations than existing samplers.

Core claim

CDS introduces conditional interpolants whose transport dynamics are governed by an exact closed-form SDE requiring no neural approximation. Although these dynamics need samples from a non-trivial initialization distribution, both theory and experiments show that this cost diminishes for sufficiently short diffusion times. The procedure therefore runs parallel tempering to obtain the initial points and then transports them via the SDE, coupling PT's global exploration with efficient local transport.

What carries the argument

Conditional Interpolants: a class of stochastic processes whose transport dynamics follow an exact closed-form SDE.

If this is right

CDS achieves a superior trade-off between sample quality and density evaluation cost compared with current samplers.
No neural network training is needed to define the transport dynamics.
The combination keeps the robustness of PT while adding the continuous transport of diffusion.
The method works for any target where density evaluations are expensive but parallel tempering can be run.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same short-time initialization idea could be paired with other global samplers besides parallel tempering.
Testing CDS on higher-dimensional or continuous-time targets would clarify how far the diminishing-cost regime extends.
The exact SDE form might allow analytic error bounds that current neural diffusion samplers lack.

Load-bearing premise

The cost of sampling the non-trivial initialization distribution diminishes for sufficiently short diffusion times so that the overall two-stage PT-plus-SDE procedure remains efficient.

What would settle it

An experiment that measures the initialization sampling cost at progressively shorter diffusion times and finds it does not decrease enough to keep total density evaluations competitive, or that directly compares quality-cost curves and shows CDS does not outperform state-of-the-art samplers.

Figures

Figures reproduced from arXiv: 2605.04013 by Daniel Hern\'andez-Lobato, Francisco M. Castro-Mac\'ias, Jos\'e Miguel Hern\'andez-Lobato, Pablo Morales-\'Alvarez, Rafael Molina, Saifuddin Syed.

**Figure 1.** Figure 1: Overview of CDS. In the first stage, Parallel Tempering (PT) transforms initial samples z from the reference πref (orange) into samples from the initialization distribution πt0|z (blue). In the second stage, these samples are transported to the target distribution π (blue) by integrating the closed-form SDE in Eq. 15. R D. We assume that ν admits a positive density π : X → (0, +∞) known only up to a norma… view at source ↗

**Figure 2.** Figure 2: Ramachandran histograms of Alanine Dipeptide (ALDP) in vacuum at T = 300K. All methods utilize a fixed budget of 2 · 105 density evaluations. Only the proposed CDS and NRPT successfully capture all modes under this limited budget. via a diffusion or flow. This perspective has been unified under the theory of Stochastic Interpolants (Albergo et al., 2025), and extended to unnormalized targets via Neural Dif… view at source ↗

**Figure 3.** Figure 3: Density evolution and exact samples from πt|z. This plot illustrates the linear interpolant for a fixed z sampled from N (0, I). As t → 0, the target distribution (blue) increasingly concentrates inside the reference (orange). Fix z ∼ νref and define the conditional interpolation map Ft|z(·) = Ft(z, ·), so that xt = Ft|z(x), x ∼ ν. (7) Definition. A Conditional Interpolant is the family of random variable… view at source ↗

**Figure 4.** Figure 4: Round Trips (RTs, higher is better) and sampling error (lower is better) as a function of t0. Decreasing t0 from 1.0 generally increases RT counts and reduces sampling error, indicating improved mixing and sample quality. standard task-specific metrics (lower is better for all): the Wasserstein-2 (W2) distance for the GM and LJ tasks; the Kullback-Leibler (KL) divergence between Ramachandran plots for ALDP… view at source ↗

**Figure 5.** Figure 5: show that SDE-based transport consistently outperforms the inverse mapping. The latter performs slightly better on GM-2 under small budgets, due to a trade-off: fewer SDE steps leave room for more exploration during the PT phase. This advantage disappears in more complex target distributions, where the corrective effect of the SDE becomes important. 5.3. Comparison with Other Sampling Methods We benchmark… view at source ↗

**Figure 6.** Figure 6: Pareto fronts for sampling performance across eight target distributions. The proposed CDS method achieves competitive or superior performance compared to state-of-the-art samplers, demonstrating higher efficiency by requiring fewer density evaluations for the same level of accuracy. 75 50 25 0 Potential Energy 0.00 0.02 0.04 0.06 Density LJ-13 400 350 300 Potential Energy 0.00 0.01 0.02 0.03 Density LJ-55… view at source ↗

**Figure 7.** Figure 7: Comparison of Lennard-Jones (LJ) potential energy histograms. For clarity, results for the remaining methods are provided in view at source ↗

**Figure 8.** Figure 8: GM task in two dimensions. E.3. Alanine Dipeptide (ALDP) We consider the alanine dipeptide molecule in vacuum at T = 300 K, a standard benchmark for evaluating sampling methods (Smith, 1999). The system comprises 22 atoms, corresponding to a D = 66-dimensional configuration space, and the target is the Boltzmann distribution induced by the molecular force field. Since potential evaluations are computationa… view at source ↗

**Figure 9.** Figure 9: Global Communication Barrier (GCB) as a function of t0. Across LJ-13, ALDP, and BNN, decreasing t0 initially improves communication efficiency (lower GCB). Pareto fronts are then constructed using a bootstrap procedure adapted from (Grunert da Fonseca et al., 2001). For each method, we perform 50 bootstrap iterations; in each iteration, we resample the experimental replicates to estimate the mean computati… view at source ↗

**Figure 10.** Figure 10: Effect of the Stage 1 sampler on CDS performance. We compare NRPT, OASMC, and HMC while keeping Stage 2 (SDE integration) fixed. NRPT consistently outperforms OASMC, whereas HMC struggles to explore the multimodal distribution and exhibits significantly degraded performance. 10 4 10 5 Density evaluations 10 0 10 1 W 2 (lo g s c ale) GM-2 10 4 10 5 10 6 Density evaluations 10 1 10 2 W 2 (lo g s c ale) GM-1… view at source ↗

**Figure 11.** Figure 11: Comparison of CDS against additional baselines (NUTS, SVGD, MAMS). CDS consistently outperforms all methods in GM and BNN, while in LJ, MAMS and NUTS perform best in the low-budget regime but are surpassed by CDS as the number of density evaluations increases. methods (such as PT and SMC), which progressively bridge the reference and target distributions, are better suited to this setting, particularly be… view at source ↗

**Figure 12.** Figure 12: Comparison of ground truth and generated samples for the Gaussian Mixture (GM) task. Each method uses a fixed budget of 2 · 103 density evaluations. 80 60 40 20 0 Potential Energy 0.000 0.025 0.050 0.075 0.100 Density LJ-13 400 350 300 Potential Energy 0.00 0.01 0.02 0.03 0.04 Density LJ-55 Ground Truth CDS (ours) NRPT OASMC DiGS HMC MALA view at source ↗

**Figure 13.** Figure 13: Comparison of Lennard-Jones (LJ) potential energy histograms. All samplers use a fixed budget of 2 · 104 and 2 · 105 density evaluations for the LJ-13 and LJ-55 targets, respectively. DiGS is omitted from the LJ55 plot as it produces a degenerate histogram. H.4. Qualitative Analysis In this section, we analyze the samples generated by each method across the different tasks. Our goal is to determine how qu… view at source ↗

**Figure 14.** Figure 14: Impact of steps allocation on CDS performance. We evaluate the trade-off between Parallel Tempering (PT) and SDE integration by varying the proportion of total steps dedicated to the PT phase. 10 4 10 5 Density evaluations 10 0 10 1 W 2 (lo g s c ale) GM-2 10 4 10 5 Density evaluations 10 0 10 1 W 2 (lo g s c ale) GMNU-2 10 4 10 5 10 6 Density evaluations 10 1 10 2 W 2 (lo g s c ale) GM-16 10 4 10 5 10 6 … view at source ↗

**Figure 15.** Figure 15: Impact of integration steps on CDS performance. Under a limited evaluation budget, fewer integration steps are preferable as they allow more resources for exploration in the first stage. In contrast, with a larger budget, increasing the number of integration steps improves the accuracy of the transport to the target distribution. Computational Budget Allocation Between Phases. We study the balance between… view at source ↗

**Figure 16.** Figure 16: Impact of corrector steps and noise variance on CDS performance. Corrector steps improve results only at low noise levels, while their impact becomes negligible as the diffusion variance increases. and the use of corrector steps. Both affect the second stage of the method, in which the interpolation SDE is integrated. Corrector steps were originally introduced to mitigate errors arising from the numerical… view at source ↗

**Figure 17.** Figure 17: Pareto fronts for the Gaussian Mixture (GM) task across different evaluation metrics. Evolution of performance across different evaluation criteria (defined in App. F). Each curve represents the optimal trade-off between computational budget and sample quality. 32 view at source ↗

**Figure 18.** Figure 18: Pareto fronts for the Lennard-Jones (LJ) task across different evaluation metrics. Evolution of performance across different evaluation criteria (defined in App. F). Each curve represents the optimal trade-off between computational budget and sample quality. CDS (ours) NRPT OASMC DiGS HMC MALA 10 5 10 6 Density evaluations 10 4 10 2 10 0 TVD (log scale) 10 5 10 6 Density evaluations 0.02 0.04 0.06 0.08 Re… view at source ↗

**Figure 19.** Figure 19: Pareto fronts for the Alanine Dipeptide (ALDP) task across different evaluation metrics. Evolution of performance across different evaluation criteria (defined in App. F). Each curve represents the optimal trade-off between computational budget and sample quality. 33 view at source ↗

read the original abstract

Sampling from unnormalized multimodal distributions with limited density evaluations remains a fundamental challenge in machine learning and natural sciences. Successful approaches construct a bridge between a tractable reference and the target distribution. Parallel Tempering (PT) serves as the gold standard, while recent diffusion-based approaches offer a continuous alternative at the cost of neural training. In this work, we introduce Conditional Diffusion Sampling (CDS), a framework that combines these two paradigms. To this end, we derive Conditional Interpolants, a class of stochastic processes whose transport dynamics are governed by an exact, closed-form stochastic differential equation (SDE), requiring no neural approximation. Although these dynamics require sampling from a non-trivial initialization distribution, we show both theoretically and empirically that the cost of this initialization diminishes for sufficiently short diffusion times. CDS leverages this by a two-stage procedure: (1) PT is used to efficiently sample the initial distribution, and then (2) samples are transported via the transport SDE. This combination couples the robust global exploration of PT with efficient local transport. Experiments suggest that CDS has the potential to achieve a superior trade-off between sample quality and density evaluation cost compared to state-of-the-art samplers.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CDS gives an exact closed-form SDE for diffusion transport after a PT initialization step, but the claimed efficiency gain depends on that initialization cost shrinking fast enough at short times.

read the letter

CDS combines parallel tempering with a diffusion-style transport step that uses an exact SDE instead of a trained network. The new piece is the class of conditional interpolants, which the authors derive so the drift and diffusion terms between a reference measure and the target are available in closed form. They then run PT only to sample the starting distribution at a small diffusion time and let the SDE carry the particles the rest of the way. That two-stage split is the concrete proposal, and it avoids the usual neural training overhead of diffusion samplers while keeping PT's global mixing strength for the hard part of the problem.

Referee Report

2 major / 1 minor

Summary. The paper introduces Conditional Diffusion Sampling (CDS), a framework that derives Conditional Interpolants as a class of stochastic processes whose transport is governed by an exact, closed-form SDE requiring no neural approximation. It proposes a two-stage procedure in which Parallel Tempering (PT) samples the non-trivial initialization distribution at short diffusion times, after which samples are transported via the SDE; the central claim is that this yields a superior trade-off between sample quality and density-evaluation cost for unnormalized multimodal distributions.

Significance. If the closed-form derivation is correct and the initialization-cost diminution holds, the work supplies a parameter-free bridge between the global exploration of PT and the local transport of diffusion processes, avoiding neural training while retaining exact dynamics; this could meaningfully improve efficiency in sampling tasks where density evaluations are expensive.

major comments (2)

[Abstract] Abstract and the section deriving the SDE: the claim that the initialization cost diminishes for sufficiently short diffusion times is load-bearing for the efficiency argument, yet the abstract provides no explicit bound, scaling argument, or error analysis showing that PT mixing time remains negligible relative to the subsequent transport budget in multimodal targets; without this, the net density-evaluation savings over pure PT are not guaranteed.
[Abstract] The two-stage procedure description: the paper asserts that the SDE transport improves mixing after PT initialization, but no quantitative comparison (e.g., effective sample size or mode coverage per density evaluation) is supplied to demonstrate that the combined budget is asymptotically better than PT alone when modes remain separated at the smallest usable t.

minor comments (1)

Notation for the conditional interpolants and the resulting SDE should be introduced with explicit definitions of all terms (e.g., the reference measure and the conditioning variable) before the closed-form claim is stated.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful and constructive review. The two major comments identify areas where the abstract can be strengthened to better support the efficiency claims. We address each point below and commit to revisions that clarify the theoretical and empirical support without altering the core contributions.

read point-by-point responses

Referee: [Abstract] Abstract and the section deriving the SDE: the claim that the initialization cost diminishes for sufficiently short diffusion times is load-bearing for the efficiency argument, yet the abstract provides no explicit bound, scaling argument, or error analysis showing that PT mixing time remains negligible relative to the subsequent transport budget in multimodal targets; without this, the net density-evaluation savings over pure PT are not guaranteed.

Authors: We agree that the abstract should explicitly reference the supporting analysis. Section 3 derives the Conditional Interpolants and provides a theoretical argument (via the closed-form SDE and the behavior of the interpolant variance) showing that the initialization cost vanishes as the diffusion time t approaches zero. We will revise the abstract to include a concise statement of this scaling: the PT mixing time at small t becomes negligible relative to the fixed transport budget, as established in the main text. This directly addresses the net savings over pure PT. revision: yes
Referee: [Abstract] The two-stage procedure description: the paper asserts that the SDE transport improves mixing after PT initialization, but no quantitative comparison (e.g., effective sample size or mode coverage per density evaluation) is supplied to demonstrate that the combined budget is asymptotically better than PT alone when modes remain separated at the smallest usable t.

Authors: The manuscript does not claim asymptotic superiority; the abstract states only that experiments 'suggest that CDS has the potential to achieve a superior trade-off.' Section 5 reports quantitative metrics (effective sample size and mode coverage per density evaluation) on multimodal targets where modes are separated, showing CDS outperforming PT under matched budgets. To strengthen the presentation, we will revise the abstract to emphasize the empirical nature of the comparison and add a brief discussion in the experiments section clarifying the regime of separated modes at small t. No new asymptotic proof is provided, as the work focuses on practical efficiency gains. revision: partial

Circularity Check

0 steps flagged

No significant circularity in derivation of exact SDE or two-stage procedure

full rationale

The paper derives Conditional Interpolants as a class of processes whose transport is governed by an exact closed-form SDE obtained directly from the interpolant definition, without parameter fitting, neural approximation, or reduction to previously fitted quantities. The two-stage CDS procedure (PT for non-trivial initialization followed by SDE transport) rests on a separate theoretical claim that initialization cost vanishes at short diffusion times; this claim is presented as independently verifiable rather than tautological or self-citation-dependent. No load-bearing self-citation, ansatz smuggling, or renaming of known results appears in the derivation chain. The construction is self-contained against external benchmarks such as standard PT and diffusion samplers.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the existence of an exact closed-form SDE for conditional interpolants and on the empirical claim that initialization cost becomes negligible for short diffusion times; no free parameters or invented entities are mentioned in the abstract.

axioms (2)

domain assumption Conditional interpolants admit an exact closed-form SDE governing their transport dynamics
Invoked to justify the second stage of the procedure without neural approximation.
ad hoc to paper The cost of PT initialization diminishes for sufficiently short diffusion times
Required for the two-stage procedure to be computationally advantageous.

pith-pipeline@v0.9.0 · 5527 in / 1367 out tokens · 23484 ms · 2026-05-09T15:29:08.075783+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

30 extracted references · 30 canonical work pages

[1]

Akhound-Sadegh, T., Lee, J., Bose, J., De Bortoli, V ., Doucet, A., Bronstein, M

PMLR, 2024. Akhound-Sadegh, T., Lee, J., Bose, J., De Bortoli, V ., Doucet, A., Bronstein, M. M., Beaini, D., Ravanbakhsh, S., Neklyudov, K., and Tong, A. Progressive inference- time annealing of diffusion models for sampling from boltzmann densities. InThe Thirty-ninth Annual Confer- ence on Neural Information Processing Systems, 2025. Albergo, M., Boffi...

work page 2024
[2]

No´e, F., Olsson, S., K ¨ohler, J., and Wu, H

Springer Science & Business Media, 2012. No´e, F., Olsson, S., K ¨ohler, J., and Wu, H. Boltzmann generators: Sampling equilibrium states of many-body systems with deep learning.Science, 365(6457):1147, 2019. Nusken, N., Vargas, F., Padhy, S., and Blessing, D. Trans- port meets variational inference: Controlled monte carlo diffusions. InThe Twelfth Intern...

work page arXiv 2012
[3]

Forν-almost allx,lim t→0 ∥Ft|z(x)−z∥= 0

work page
[4]

Then, the conditional distributionν t|z converges to the Dirac massδ z in the Wasserstein-1 distance: lim t→0 W1(δz, νt|z) = 0,(42) Proof

There exists an integrable functiong:X →[0,+∞)such that for all sufficiently smallt,∥F t|z(x)−z∥ ≤g(x). Then, the conditional distributionν t|z converges to the Dirac massδ z in the Wasserstein-1 distance: lim t→0 W1(δz, νt|z) = 0,(42) Proof. In any normed space, the Wasserstein-1 distance between a Dirac mass δz and an arbitrary probability measure µ is ...

work page
[5]

The densityπ t|z satisfies the following continuity equation: ∂ ∂t πt|z =−div πt|zut|z ,(59)

work page
[6]

Furthermore,π t|z satisfies the following FPK equation: ∂ ∂t πt|z =−div πt|zat|z + σ2 t 2 ∆πt|z,(60) wherea t|z =u t|z + σ2 t 2 ∇logπ t|z. Proof

work page
[7]

We differentiate with respect tot: ∂ ∂t πt|z ◦F t|z (y) Jt(y) +π t|z Ft|z(y) ∂ ∂t Jt(y) = 0.(62) We analyze the first term

First, we observe that, as a consequence of change of variables formula, the following identity holds for anyy∈ X: πt|z Ft|z(y) Jt(y) =π(y),(61) whereJ t(y) = det JFt|z(y). We differentiate with respect tot: ∂ ∂t πt|z ◦F t|z (y) Jt(y) +π t|z Ft|z(y) ∂ ∂t Jt(y) = 0.(62) We analyze the first term. Applying the chain rule: ∂ ∂t πt|z ◦F t|z (y) = ∂πt|z ∂t Ft|...

work page 2008
[8]

We start from the continuity equation derived above and add and subtract the term σ2 t 2 ∆πt|z to the right-hand side: ∂πt|z ∂t =−div πt|zut|z − σ2 t 2 ∆πt|z + σ2 t 2 ∆πt|z.(69) Using∆π= div(∇π), we obtain: ∂πt|z ∂t =−div πt|zut|z + σ2 t 2 ∇πt|z + σ2 t 2 ∆πt|z.(70) Next, we use∇π t|z =π t|z∇logπ t|z to rewrite the term inside the divergence: πt|zut|z + σ2...

work page 2009
[9]

For everyx∈ X, the mapA7→K(x, A)is a probability measure onX

work page
[10]

For everyA∈ B(X), the mapx7→K(x, A)is measurable. We define the action of the kernelKon a measureµ∈ P(X)(from the left) as the measureK(µ)given by: [K(µ)](A) = Z X K(x, A) dµ(x),for allA∈ B(X).(72) We define the n-step transition kernel K n recursively by K1 =K and K n(x, A) = R X K(y, A)K n−1(x,dy) . Consistent with the operator notation,K n(µ)denotes th...

work page
[11]

(Invariant measure) The pushforward kernelK F is invariant with respect toν F

work page
[12]

(Algebraic Iteration) For any measureµandn≥1: K n F (µ) =F# K n F −1#µ .(75)

work page
[13]

(Wasserstein Lipschitz Bound) For any two measuresµ 1, µ2: W1 (F#µ 1, F#µ2)≤L F W1 (µ1, µ2).(76)

work page
[14]

(Kernel Wasserstein Relation) W1 (K n F (δz), ν F )≤L F W1 (K n (δx0), ν).(77)

work page
[15]

(Convergence Bounds) If there existsρ∈(0,1)such thatW 1(K n(δx), ν)≤ρ nW1(δx, ν)for allx, then: W1 (K n F (δz), ν F )≤L F ρn W1 (δx0 , ν),(78) W1 (K n F (δz), ν F )≤L F LF −1 ρn W1 (δz, νF ).(79) Proof.We prove each item separately:

work page
[16]

(Invariant measure) We first verify thatK F is invariant with respect toν F . KF (νF ) ( ˆA) = Z KF ˆx,ˆA νF (dˆx) = Z K F −1 (ˆx), F−1 ˆA (F#ν) (dˆx) (i) =(80) (i) = Z K x, F −1 ˆA ν(dx) (ii) =ν F −1 ˆA = (F#ν) ˆA =ν F ˆA ,(81) where (i) is due to the change of variables theorem (ˆx=F(x)), and (ii) is due toKbeing invariant with respect toν

work page
[17]

(Algebraic Iteration) We prove this by induction. Forn= 1, observe that: (KF (µ)) ˆA = Z KF ˆx,ˆA µ(dˆx) = Z K F −1 (ˆx), F−1 ˆA µ(dˆx) (i) = (i) = Z K x, F −1 ˆA F −1#µ (dx) =K F −1#µ F −1 ˆA = = F# K F −1#µ ˆA , where (i) follows from the change of variables. Suppose it is true forn≥1. Then: K(n+1) F (µ) =K F K(n) F (µ) =K F F#K (n) F −1#µ = =F# K F −1#...

work page
[18]

Define the pushforward couplingˆγ= (F, F) #γ

(Wasserstein Lipschitz Bound) Let γ∈Γ (µ 1, µ2) be an optimal coupling for W1(µ1, µ2). Define the pushforward couplingˆγ= (F, F) #γ. We first confirm thatˆγ∈Γ (F#µ 1, F#µ2): ˆγ Rd × ˆA =γ Rd ×F −1 ˆA =µ 2 F −1 ˆA = (F#µ 2) ˆA , ˆγ ˆA×R d =γ F −1 ˆA ×R d =µ 1 F −1 ˆA = (F#µ 1) ˆA . Next, we bound the transport cost using the Lipschitz property ofF: W1(F#µ ...

work page
[19]

Using Property 1 withµ=δ z: K n F (δz) =F# K n(F −1#δz) =F# (K n(δx0)),(82) wherex 0 =F −1(z)

(Kernel Wasserstein Relation) We apply the previous two results. Using Property 1 withµ=δ z: K n F (δz) =F# K n(F −1#δz) =F# (K n(δx0)),(82) wherex 0 =F −1(z). Also recallν F =F#ν. Now apply Property 2 withµ 1 =K n(δx0)andµ 2 =ν: W1 (K n F (δz), ν F ) =W 1 (F#K n(δx0), F#ν) ≤L F W1 (K n(δx0), ν)

work page
[20]

Substituting this into the result from Property 3: W1 (K n F (δz), ν F )≤L F ρnW1 (δx0 , ν).(83) To obtain the second bound, we need to relate W1(δx0 , ν) back to νF

(Convergence Bounds) Assume the base kernel satisfies W1(K n(δx), ν)≤ρ nW1(δx, ν). Substituting this into the result from Property 3: W1 (K n F (δz), ν F )≤L F ρnW1 (δx0 , ν).(83) To obtain the second bound, we need to relate W1(δx0 , ν) back to νF . Note that x0 =F −1(z) and ν=F −1#νF . Applying the Lipschitz bound for the inverse mapF −1 (analogous to P...

work page
[21]

(Invariant measure) The pushforward kernelK t|z is invariant with respect toν t|z

work page
[22]

(Algebraic Iteration) For any measureµandn≥1: K n t|z (µ) =F t|z# K n F −1 t|z #µ .(87)

work page
[23]

(Wasserstein Equality) For any two measuresµ 1, µ2: W1 Ft|z#µ1, Ft|z#µ2 =t W 1 (µ1, µ2).(88)

work page
[24]

(Kernel Wasserstein Equality) SinceF −1 t|z (z) =z, the bound becomes an equality: W1 K n t|z (δz), ν t|z =t W 1 (K n (δz), ν).(89)

work page
[25]

(Convergence Bounds) IfW 1(K n(δz), ν)≤ρ nW1(δz, ν), then: W1 K n t|z (δz), ν t|z ≤t ρ n W1 (δz, ν),(90) W1 K n t|z (δz), ν t|z ≤ρ n W1 δz, νt|z .(91) Proof.Results are followed by applying Proposition 1 to the specific linear bijectionF t|z(x) = (1−t)z+tx. Note that: 21 Conditional Diffusion Sampling • (Lipschitz Constants) Since Ft|z is a homothety with...

work page
[26]

Combining the upper and lower bounds confirms the equality: W1 Ft|z#µ1, Ft|z#µ2 =tW 1 (µ1, µ2).(94)

(Wasserstein Scaling) Proposition 1 (Property 3) provides the upper bound: W1 Ft|z#µ1, Ft|z#µ2 ≤tW 1 (µ1, µ2).(92) To prove equality, we apply the same general bound to theinversemap F −1 t|z acting on the measures ν1 =F t|z#µ1 and ν2 =F t|z#µ2: W1 (µ1, µ2) =W 1 F −1 t|z #ν1, F −1 t|z #ν2 ≤t −1W1 (ν1, ν2).(93) Multiplying by t yields tW1 (µ1, µ2)≤W 1 Ft|z...

work page
[27]

Additionally, νt|z =F t|z#ν

(Kernel Wasserstein Relation) We start with the result from the previous property usingµ=δ z: K n t|z(δz) =F t|z# K n(F −1 t|z #δz) .(95) Since z is a fixed point (F −1 t|z (z) =z ), this simplifies to Ft|z#(K n(δz)). Additionally, νt|z =F t|z#ν. Applying the exact scaling law derived in Item 2 withµ 1 =K n(δz)andµ 2 =ν: W1 K n t|z(δz), νt|z =W 1 Ft|z#K n...

work page 2023
[28]

The objective vectors for all methods are then normalized to the unit square[0,1] 2 via linear rescaling

Normalization: We identify the global minimum and maximum objective values across all methods and all evaluations. The objective vectors for all methods are then normalized to the unit square[0,1] 2 via linear rescaling

work page
[29]

Reference Front Construction: We construct abest knownPareto front for each task by pooling the solutions from all methods and filtering for the non-dominated set. The reference hypervolume, HVref, is computed based on this combined front using a reference point of (1.1,1.1) in the normalized space to ensure all boundary solutions are captured

work page
[30]

To obtain the Mean HVR of a method, we average the HVR corresponding to that method over the selected metrics and targets

Ratio Calculation: The HVR for a specific method is defined as the ratio of its hypervolume to the reference hypervolume: HVR(method) = HV(method) HVref .(103) An HVR of 1.0 indicates that a method has successfully recovered the entire best-known Pareto front, while lower values indicate a failure to converge or a lack of diversity in the solution set. To...

work page 2025

[1] [1]

Akhound-Sadegh, T., Lee, J., Bose, J., De Bortoli, V ., Doucet, A., Bronstein, M

PMLR, 2024. Akhound-Sadegh, T., Lee, J., Bose, J., De Bortoli, V ., Doucet, A., Bronstein, M. M., Beaini, D., Ravanbakhsh, S., Neklyudov, K., and Tong, A. Progressive inference- time annealing of diffusion models for sampling from boltzmann densities. InThe Thirty-ninth Annual Confer- ence on Neural Information Processing Systems, 2025. Albergo, M., Boffi...

work page 2024

[2] [2]

No´e, F., Olsson, S., K ¨ohler, J., and Wu, H

Springer Science & Business Media, 2012. No´e, F., Olsson, S., K ¨ohler, J., and Wu, H. Boltzmann generators: Sampling equilibrium states of many-body systems with deep learning.Science, 365(6457):1147, 2019. Nusken, N., Vargas, F., Padhy, S., and Blessing, D. Trans- port meets variational inference: Controlled monte carlo diffusions. InThe Twelfth Intern...

work page arXiv 2012

[3] [3]

Forν-almost allx,lim t→0 ∥Ft|z(x)−z∥= 0

work page

[4] [4]

Then, the conditional distributionν t|z converges to the Dirac massδ z in the Wasserstein-1 distance: lim t→0 W1(δz, νt|z) = 0,(42) Proof

There exists an integrable functiong:X →[0,+∞)such that for all sufficiently smallt,∥F t|z(x)−z∥ ≤g(x). Then, the conditional distributionν t|z converges to the Dirac massδ z in the Wasserstein-1 distance: lim t→0 W1(δz, νt|z) = 0,(42) Proof. In any normed space, the Wasserstein-1 distance between a Dirac mass δz and an arbitrary probability measure µ is ...

work page

[5] [5]

The densityπ t|z satisfies the following continuity equation: ∂ ∂t πt|z =−div πt|zut|z ,(59)

work page

[6] [6]

Furthermore,π t|z satisfies the following FPK equation: ∂ ∂t πt|z =−div πt|zat|z + σ2 t 2 ∆πt|z,(60) wherea t|z =u t|z + σ2 t 2 ∇logπ t|z. Proof

work page

[7] [7]

We differentiate with respect tot: ∂ ∂t πt|z ◦F t|z (y) Jt(y) +π t|z Ft|z(y) ∂ ∂t Jt(y) = 0.(62) We analyze the first term

First, we observe that, as a consequence of change of variables formula, the following identity holds for anyy∈ X: πt|z Ft|z(y) Jt(y) =π(y),(61) whereJ t(y) = det JFt|z(y). We differentiate with respect tot: ∂ ∂t πt|z ◦F t|z (y) Jt(y) +π t|z Ft|z(y) ∂ ∂t Jt(y) = 0.(62) We analyze the first term. Applying the chain rule: ∂ ∂t πt|z ◦F t|z (y) = ∂πt|z ∂t Ft|...

work page 2008

[8] [8]

We start from the continuity equation derived above and add and subtract the term σ2 t 2 ∆πt|z to the right-hand side: ∂πt|z ∂t =−div πt|zut|z − σ2 t 2 ∆πt|z + σ2 t 2 ∆πt|z.(69) Using∆π= div(∇π), we obtain: ∂πt|z ∂t =−div πt|zut|z + σ2 t 2 ∇πt|z + σ2 t 2 ∆πt|z.(70) Next, we use∇π t|z =π t|z∇logπ t|z to rewrite the term inside the divergence: πt|zut|z + σ2...

work page 2009

[9] [9]

For everyx∈ X, the mapA7→K(x, A)is a probability measure onX

work page

[10] [10]

For everyA∈ B(X), the mapx7→K(x, A)is measurable. We define the action of the kernelKon a measureµ∈ P(X)(from the left) as the measureK(µ)given by: [K(µ)](A) = Z X K(x, A) dµ(x),for allA∈ B(X).(72) We define the n-step transition kernel K n recursively by K1 =K and K n(x, A) = R X K(y, A)K n−1(x,dy) . Consistent with the operator notation,K n(µ)denotes th...

work page

[11] [11]

(Invariant measure) The pushforward kernelK F is invariant with respect toν F

work page

[12] [12]

(Algebraic Iteration) For any measureµandn≥1: K n F (µ) =F# K n F −1#µ .(75)

work page

[13] [13]

(Wasserstein Lipschitz Bound) For any two measuresµ 1, µ2: W1 (F#µ 1, F#µ2)≤L F W1 (µ1, µ2).(76)

work page

[14] [14]

(Kernel Wasserstein Relation) W1 (K n F (δz), ν F )≤L F W1 (K n (δx0), ν).(77)

work page

[15] [15]

(Convergence Bounds) If there existsρ∈(0,1)such thatW 1(K n(δx), ν)≤ρ nW1(δx, ν)for allx, then: W1 (K n F (δz), ν F )≤L F ρn W1 (δx0 , ν),(78) W1 (K n F (δz), ν F )≤L F LF −1 ρn W1 (δz, νF ).(79) Proof.We prove each item separately:

work page

[16] [16]

(Invariant measure) We first verify thatK F is invariant with respect toν F . KF (νF ) ( ˆA) = Z KF ˆx,ˆA νF (dˆx) = Z K F −1 (ˆx), F−1 ˆA (F#ν) (dˆx) (i) =(80) (i) = Z K x, F −1 ˆA ν(dx) (ii) =ν F −1 ˆA = (F#ν) ˆA =ν F ˆA ,(81) where (i) is due to the change of variables theorem (ˆx=F(x)), and (ii) is due toKbeing invariant with respect toν

work page

[17] [17]

(Algebraic Iteration) We prove this by induction. Forn= 1, observe that: (KF (µ)) ˆA = Z KF ˆx,ˆA µ(dˆx) = Z K F −1 (ˆx), F−1 ˆA µ(dˆx) (i) = (i) = Z K x, F −1 ˆA F −1#µ (dx) =K F −1#µ F −1 ˆA = = F# K F −1#µ ˆA , where (i) follows from the change of variables. Suppose it is true forn≥1. Then: K(n+1) F (µ) =K F K(n) F (µ) =K F F#K (n) F −1#µ = =F# K F −1#...

work page

[18] [18]

Define the pushforward couplingˆγ= (F, F) #γ

(Wasserstein Lipschitz Bound) Let γ∈Γ (µ 1, µ2) be an optimal coupling for W1(µ1, µ2). Define the pushforward couplingˆγ= (F, F) #γ. We first confirm thatˆγ∈Γ (F#µ 1, F#µ2): ˆγ Rd × ˆA =γ Rd ×F −1 ˆA =µ 2 F −1 ˆA = (F#µ 2) ˆA , ˆγ ˆA×R d =γ F −1 ˆA ×R d =µ 1 F −1 ˆA = (F#µ 1) ˆA . Next, we bound the transport cost using the Lipschitz property ofF: W1(F#µ ...

work page

[19] [19]

Using Property 1 withµ=δ z: K n F (δz) =F# K n(F −1#δz) =F# (K n(δx0)),(82) wherex 0 =F −1(z)

(Kernel Wasserstein Relation) We apply the previous two results. Using Property 1 withµ=δ z: K n F (δz) =F# K n(F −1#δz) =F# (K n(δx0)),(82) wherex 0 =F −1(z). Also recallν F =F#ν. Now apply Property 2 withµ 1 =K n(δx0)andµ 2 =ν: W1 (K n F (δz), ν F ) =W 1 (F#K n(δx0), F#ν) ≤L F W1 (K n(δx0), ν)

work page

[20] [20]

Substituting this into the result from Property 3: W1 (K n F (δz), ν F )≤L F ρnW1 (δx0 , ν).(83) To obtain the second bound, we need to relate W1(δx0 , ν) back to νF

(Convergence Bounds) Assume the base kernel satisfies W1(K n(δx), ν)≤ρ nW1(δx, ν). Substituting this into the result from Property 3: W1 (K n F (δz), ν F )≤L F ρnW1 (δx0 , ν).(83) To obtain the second bound, we need to relate W1(δx0 , ν) back to νF . Note that x0 =F −1(z) and ν=F −1#νF . Applying the Lipschitz bound for the inverse mapF −1 (analogous to P...

work page

[21] [21]

(Invariant measure) The pushforward kernelK t|z is invariant with respect toν t|z

work page

[22] [22]

(Algebraic Iteration) For any measureµandn≥1: K n t|z (µ) =F t|z# K n F −1 t|z #µ .(87)

work page

[23] [23]

(Wasserstein Equality) For any two measuresµ 1, µ2: W1 Ft|z#µ1, Ft|z#µ2 =t W 1 (µ1, µ2).(88)

work page

[24] [24]

(Kernel Wasserstein Equality) SinceF −1 t|z (z) =z, the bound becomes an equality: W1 K n t|z (δz), ν t|z =t W 1 (K n (δz), ν).(89)

work page

[25] [25]

(Convergence Bounds) IfW 1(K n(δz), ν)≤ρ nW1(δz, ν), then: W1 K n t|z (δz), ν t|z ≤t ρ n W1 (δz, ν),(90) W1 K n t|z (δz), ν t|z ≤ρ n W1 δz, νt|z .(91) Proof.Results are followed by applying Proposition 1 to the specific linear bijectionF t|z(x) = (1−t)z+tx. Note that: 21 Conditional Diffusion Sampling • (Lipschitz Constants) Since Ft|z is a homothety with...

work page

[26] [26]

Combining the upper and lower bounds confirms the equality: W1 Ft|z#µ1, Ft|z#µ2 =tW 1 (µ1, µ2).(94)

(Wasserstein Scaling) Proposition 1 (Property 3) provides the upper bound: W1 Ft|z#µ1, Ft|z#µ2 ≤tW 1 (µ1, µ2).(92) To prove equality, we apply the same general bound to theinversemap F −1 t|z acting on the measures ν1 =F t|z#µ1 and ν2 =F t|z#µ2: W1 (µ1, µ2) =W 1 F −1 t|z #ν1, F −1 t|z #ν2 ≤t −1W1 (ν1, ν2).(93) Multiplying by t yields tW1 (µ1, µ2)≤W 1 Ft|z...

work page

[27] [27]

Additionally, νt|z =F t|z#ν

(Kernel Wasserstein Relation) We start with the result from the previous property usingµ=δ z: K n t|z(δz) =F t|z# K n(F −1 t|z #δz) .(95) Since z is a fixed point (F −1 t|z (z) =z ), this simplifies to Ft|z#(K n(δz)). Additionally, νt|z =F t|z#ν. Applying the exact scaling law derived in Item 2 withµ 1 =K n(δz)andµ 2 =ν: W1 K n t|z(δz), νt|z =W 1 Ft|z#K n...

work page 2023

[28] [28]

The objective vectors for all methods are then normalized to the unit square[0,1] 2 via linear rescaling

Normalization: We identify the global minimum and maximum objective values across all methods and all evaluations. The objective vectors for all methods are then normalized to the unit square[0,1] 2 via linear rescaling

work page

[29] [29]

Reference Front Construction: We construct abest knownPareto front for each task by pooling the solutions from all methods and filtering for the non-dominated set. The reference hypervolume, HVref, is computed based on this combined front using a reference point of (1.1,1.1) in the normalized space to ensure all boundary solutions are captured

work page

[30] [30]

To obtain the Mean HVR of a method, we average the HVR corresponding to that method over the selected metrics and targets

Ratio Calculation: The HVR for a specific method is defined as the ratio of its hypervolume to the reference hypervolume: HVR(method) = HV(method) HVref .(103) An HVR of 1.0 indicates that a method has successfully recovered the entire best-known Pareto front, while lower values indicate a failure to converge or a lack of diversity in the solution set. To...

work page 2025