Universal Inverse Distillation for Matching Models with Real-Data Supervision (No GANs)

Aleksei Leonov; Alexander Korotin; David Li; Evgeny Burnaev; Iaroslav Koshelev; Nikita Gushchin; Nikita Kornilov; Tikhon Mavrin

arxiv: 2509.22459 · v4 · pith:PDS2SGDOnew · submitted 2025-09-26 · 📊 stat.ML · cs.LG

Universal Inverse Distillation for Matching Models with Real-Data Supervision (No GANs)

Nikita Kornilov , David Li , Tikhon Mavrin , Aleksei Leonov , Nikita Gushchin , Evgeny Burnaev , Iaroslav Koshelev , Alexander Korotin This is my paper

Pith reviewed 2026-05-21 22:12 UTC · model grok-4.3

classification 📊 stat.ML cs.LG

keywords distillationmatching modelsdiffusion modelsflow matchingreal data supervisionone-step generationgenerative models

0 comments

The pith

RealUID is a universal distillation framework that lets any matching model use real data directly to train one-step generators without GANs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents RealUID as a distillation approach that trains fast one-step generators from slow iterative matching models such as diffusion and flow. It demonstrates how real data can be added straight into the distillation loss to guide the student model, removing the need for extra discriminator networks and adversarial training. The method supplies a single theoretical setup that recovers earlier distillation techniques for flow matching and diffusion as special cases while also covering extensions like bridge matching and stochastic interpolants. A sympathetic reader would care because the approach simplifies the creation of efficient, high-quality generative models that improve when real data is available and avoids the extra complexity and instability often tied to GAN-based supervision.

Core claim

RealUID is a universal inverse distillation framework for matching models that seamlessly incorporates real data into the distillation procedure without GANs and supplies a simple theoretical foundation that covers previous distillation methods for Flow Matching and Diffusion models while extending to Bridge Matching and Stochastic Interpolants.

What carries the argument

The RealUID distillation loss, which directly combines guidance from a pre-trained teacher matching model with explicit matching to real data samples.

If this is right

Earlier data-free distillation techniques for flow and diffusion become direct instances of RealUID when the real-data term is omitted.
The same procedure applies without modification to bridge matching and stochastic interpolants.
Distilled one-step models gain improved sample quality from real-data supervision while retaining the speed advantage of single-step inference.
No extra discriminator network is required when moving from data-free to real-data distillation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The framework could be tested on newly proposed matching-model variants to check whether the direct real-data term remains stable across different noise schedules.
By removing the adversarial component, RealUID may lower the hyper-parameter burden when practitioners adapt distillation to new data domains.
The unification suggests that future matching-model papers could adopt RealUID as a default fast-sampling baseline rather than re-deriving separate distillation losses.

Load-bearing premise

Real data can be incorporated directly into the distillation loss for any matching model without introducing instability or requiring additional adversarial components.

What would settle it

Run RealUID distillation on a standard flow-matching teacher using real image data and measure whether the resulting one-step model shows higher FID or training divergence than a comparable GAN-based real-data baseline on the same dataset and architecture.

Figures

Figures reproduced from arXiv: 2509.22459 by Aleksei Leonov, Alexander Korotin, David Li, Evgeny Burnaev, Iaroslav Koshelev, Nikita Gushchin, Nikita Kornilov, Tikhon Mavrin.

**Figure 1.** Figure 1: Pipeline of our RealUID distillation framework (§3) with the direct incorporation of real data p ∗ 0 adjusted by hyperparameters α, β ∈ (0, 1]. In the figure, it is depicted for Flow Matching models predicting denoised samples. It distills a costly frozen teacher model f ∗ (blue) into a one-step generator Gθ (red) upon min-max optimization of L α,β R-UID(f, pθ 0) loss over fake model f (green) and generato… view at source ↗

**Figure 2.** Figure 2: Evolution of FID during CIFAR-10 distillation for (i) the baseline RealUID (α = 1.0, β = 1.0), (ii) the best-performing RealUID configurations, and (iii) subsequent fine-tuning, evaluated in both unconditional and conditional settings. The performances of Teacher Flow and UID+GAN are indicated by horizontal reference lines in their respective colors. Methods that incorporate real data—best-performing RealU… view at source ↗

**Figure 3.** Figure 3: RealUID loss for 1D-Gaussians under various coefficients (α, β). • Configuration β = α = 1 (UID loss) does not affect uncovered real data points xt : p θ t (xt) → 0, p∗ (xt) ≫ 0: lt(xt, β, α) ≈ ∥p θ t (xt) · f ∗ t (xt) − p θ t (xt) · f θ t (xt)∥ 2 p θ t (xt) = ∥f ∗ t (xt) − f θ t (xt)∥ 2 p θ t (xt) → 0. • Configuration β = α < 1 does not affect uncovered real data points xt : p θ t (xt) → 0, p∗ (xt) ≫ 0: l… view at source ↗

**Figure 4.** Figure 4: Uncurated samples for unconditional generation by the one-step RealUID (α = 1.0, β = 1.0) trained on CIFAR-10. Quantitative results are reported in [PITH_FULL_IMAGE:figures/full_fig_p026_4.png] view at source ↗

**Figure 5.** Figure 5: Uncurated samples for unconditional generation by the one-step RealUID (α = 1.0, β = 1.0) + GAN (λ Gθ adv = 0.3, λD adv = 1) trained on CIFAR-10. Quantitative results are reported in [PITH_FULL_IMAGE:figures/full_fig_p027_5.png] view at source ↗

**Figure 6.** Figure 6: Uncurated samples for unconditional generation by the one-step RealUID (α = 0.94, β = 0.96) trained on CIFAR-10. Quantitative results are reported in [PITH_FULL_IMAGE:figures/full_fig_p028_6.png] view at source ↗

**Figure 7.** Figure 7: Uncurated samples for unconditional generation by the one-step RealUID (α = 0.94, β = 0.96 | αFT = 0.94, βFT = 1.0) trained on CIFAR-10. Quantitative results are reported in [PITH_FULL_IMAGE:figures/full_fig_p029_7.png] view at source ↗

**Figure 8.** Figure 8: Uncurated samples for conditional generation by the one-step RealUID (α = 1.0, β = 1.0) trained on CIFAR-10. Quantitative results are reported in [PITH_FULL_IMAGE:figures/full_fig_p030_8.png] view at source ↗

**Figure 9.** Figure 9: Uncurated samples for conditional generation by the one-step RealUID (α = 1.0, β = 1.0) + GAN (λ Gθ adv = 0.3, λD adv = 1) trained on CIFAR-10. Quantitative results are reported in [PITH_FULL_IMAGE:figures/full_fig_p031_9.png] view at source ↗

**Figure 10.** Figure 10: Uncurated samples for conditional generation by the one-step RealUID (α = 0.98, β = 0.96) trained on CIFAR-10. Quantitative results are reported in [PITH_FULL_IMAGE:figures/full_fig_p032_10.png] view at source ↗

**Figure 11.** Figure 11: Uncurated samples for conditional generation by the one-step RealUID (α = 0.98, β = 0.96 | αFT = 0.94, βFT = 1.0) trained on CIFAR-10. Quantitative results are reported in [PITH_FULL_IMAGE:figures/full_fig_p033_11.png] view at source ↗

read the original abstract

While achieving exceptional generative quality, modern diffusion, flow, and other matching models suffer from slow inference, as they require many steps of iterative generation. Recent distillation methods address this problem by training efficient one-step generators under the guidance of a pre-trained teacher model. However, these methods are often constrained to only one specific framework, e.g., only to diffusion or only to flow models. Furthermore, these methods are originally data-free, and to benefit from the usage of real data, it is required to use an additional complex adversarial training with an extra discriminator model. In this paper, we present RealUID, a universal distillation framework for all matching models that seamlessly incorporates real data into the distillation procedure without GANs. Our RealUID approach offers a simple theoretical foundation that covers previous distillation methods for Flow Matching and Diffusion models, and can be also extended to their modifications, such as Bridge Matching and Stochastic Interpolants. The code can be found in https://github.com/David-cripto/RealUID.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RealUID gives a unified distillation loss that folds real data into matching models without GANs and recovers prior methods as special cases, but the generalization proof needs to address whether the mixture preserves contraction properties.

read the letter

Colleague, the main point is that RealUID defines a single objective for distilling one-step generators from any matching model by mixing real data into the inverse distillation loss, and it shows how flow-matching and diffusion losses emerge as special cases while sketching extensions to bridge matching and stochastic interpolants. The paper does a solid job keeping the framing simple and avoiding extra discriminator networks, which is a practical win over earlier data-free or GAN-based approaches. The GitHub code link also lets people check the implementation directly. The soft spot sits in the theoretical coverage. The construction replaces the teacher marginal with a real-plus-teacher mixture, yet the derivation does not appear to supply explicit bounds on how that mixture perturbs the Lipschitz constant or fixed-point uniqueness that earlier proofs used. If the velocity field for some extended models is only locally contractive, the claimed simple foundation may require extra regularity conditions that are not stated. Experiments would need to test stability on the new model families to close that gap. This work is aimed at people who already train flow or diffusion models and want faster inference plus real-data supervision without adding adversarial components. A reader who has implemented distillation losses before will get the most from the unification. The paper shows honest engagement with the distillation literature and ships reproducible code, so it deserves a serious referee to verify the math details and run the ablations. I would send it to review with a request to clarify the mixture-weight assumptions.

Referee Report

1 major / 2 minor

Summary. The paper introduces RealUID, a universal inverse distillation framework for matching models (including Flow Matching, Diffusion, Bridge Matching, and Stochastic Interpolants). It claims to seamlessly incorporate real-data supervision into the distillation loss without GANs or extra discriminators, while providing a simple theoretical foundation that recovers prior distillation methods as special cases.

Significance. If the central derivation holds and the real-data mixture preserves the required contraction or fixed-point properties across model families, the result would be significant: it unifies existing distillation techniques under one objective, removes the need for adversarial components, and enables direct use of real data for one-step generators. The open-source code link is a positive factor for reproducibility.

major comments (1)

[§3.2–3.3] §3.2–3.3: The universality claim rests on the RealUID loss (mixture of real data and teacher marginal inside the inverse distillation objective) generalizing without additional regularity assumptions. No explicit bound is derived on how the real-data mixture weight perturbs the contraction mapping or fixed-point uniqueness relied upon by prior data-free proofs for Flow Matching and Diffusion; if the perturbation is non-contractive for some vector fields, the claimed simple theoretical foundation does not cover the full family of matching models.

minor comments (2)

[Abstract / §1] The abstract and introduction state that the framework 'covers previous distillation methods' but do not include a short table or explicit reduction showing how the RealUID objective specializes to the cited Flow-Matching and Diffusion losses.
[§3] Notation for the mixture weight and the inverse-distillation operator should be introduced once with a clear definition before being used in the loss equations.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive and detailed review. The feedback highlights an important point regarding the rigor of the theoretical claims. We address the major comment below and commit to revisions that strengthen the presentation without altering the core contributions.

read point-by-point responses

Referee: [§3.2–3.3] §3.2–3.3: The universality claim rests on the RealUID loss (mixture of real data and teacher marginal inside the inverse distillation objective) generalizing without additional regularity assumptions. No explicit bound is derived on how the real-data mixture weight perturbs the contraction mapping or fixed-point uniqueness relied upon by prior data-free proofs for Flow Matching and Diffusion; if the perturbation is non-contractive for some vector fields, the claimed simple theoretical foundation does not cover the full family of matching models.

Authors: We appreciate the referee's careful reading of the theoretical sections. The current derivation in §3.2–3.3 shows that RealUID recovers the data-free inverse distillation objectives exactly when the real-data weight is set to zero, and that the combined loss remains a well-defined expectation under the same marginals used in prior work. We agree that an explicit perturbation bound on the contraction constant would make the universality statement more complete. In the revised manuscript we will add a short subsection (or extended remark) in §3.3 that (i) recalls the Lipschitz and contraction assumptions from the referenced Flow Matching and Diffusion proofs, (ii) treats the real-data term as a bounded perturbation whose Lipschitz constant is controlled by the data distribution's regularity, and (iii) states a sufficient condition on the mixture weight α such that the overall operator remains contractive whenever α is below a threshold determined by the original contraction gap. This addition uses only the regularity already standard in the literature and does not introduce new assumptions. We believe the revised argument will directly address the concern while preserving the paper's claim of a simple unifying foundation. revision: yes

Circularity Check

0 steps flagged

No significant circularity: derivation introduces independent real-data mixture term without reducing to fitted inputs or self-citation chains

full rationale

The paper defines RealUID by constructing an inverse distillation objective that mixes real data samples with teacher-generated samples inside the loss, then shows this recovers prior Flow Matching and Diffusion distillation losses as special cases while extending to Bridge Matching and Stochastic Interpolants. This construction adds an explicit real-data supervision term that is not present in the cited data-free baselines, and the universality claim rests on algebraic substitution within the loss rather than on any parameter being fitted to the target result or on a self-citation that forbids alternatives. No equation is presented in which the claimed prediction equals its own input by definition, and the theoretical foundation is stated to be simple and assumption-light without invoking uniqueness theorems from the authors' prior work as load-bearing. The derivation is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Review is based on abstract only; no explicit free parameters, axioms, or invented entities are stated. The central claim rests on the unshown theoretical foundation and the assumption that real data integrates stably.

axioms (1)

domain assumption A pre-trained teacher matching model can provide reliable guidance for distilling a one-step student when real data is added to the objective.
This premise is required for any teacher-guided distillation and is invoked by the claim that real data is incorporated seamlessly.

pith-pipeline@v0.9.0 · 5734 in / 1216 out tokens · 30240 ms · 2026-05-21T22:12:03.695922+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Universal Matching loss LUM(f, p0) … min-max optimization of Universal Inverse Distillation (UID) loss … RealUID loss with real data (α, β ∈ (0,1])

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

33 extracted references · 33 canonical work pages · 1 internal anchor

[1]

Decoupled Weight Decay Regularization

URLhttps://openreview.net/forum?id=XVjTT1nw5z. Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017. Weijian Luo, Tianyang Hu, Shifeng Zhang, Jiacheng Sun, Zhenguo Li, and Zhihua Zhang. Diff- instruct: A universal approach for transferring knowledge from pre-trained diffusion models. Advances in Neu...

work page internal anchor Pith review Pith/arXiv arXiv 2017
[2]

=E t∼[0,T] Exθ 0∼pθ 0,xθ t ∼pθ t (·|xθ 0)[−α∥δt(xθ t )∥2 + 2α⟨δt(xθ t ), f ∗ t (xθ t )⟩ −2β⟨δ t(xθ t ), f θ t (xθ t |xθ 0)⟩] +Et∼[0,T] Ex∗ 0 ∼p∗ 0 ,x∗ t ∼p∗ t (·|x∗ t )[−(1−α)∥δ t(x∗ t )∥2 + 2(1−α)⟨δ t(x∗ t ), f ∗ t (x∗ t )⟩ −2(1−β)⟨δ t(x∗ t ), f ∗ t (x∗ t |x∗ 0)⟩]. This form provides an alternative definition of coefficients α and β: they define the prop...

work page
[3]

15 the real data, i.e,L α,β R-UID(δ, pθ

=E t∼[0,T] Exθ t ∼pθ t [−α∥δt(xθ t )∥2 + 2α⟨δt(xθ t ), f ∗ t (xθ t )⟩ −2β⟨δ t(xθ t ), f θ t (xθ t )⟩] +Et∼[0,T] Ex∗ t ∼p∗ t [−(1−α)∥δ t(x∗ t )∥2 + 2(1−α)⟨δ t(x∗ t ), f ∗ t (x∗ t )⟩ −2(1−β)⟨δ t(x∗ t ), f ∗ t (x∗ t )⟩].(20) Then, we rescale the generated data terms in RealUID loss (20) using the equality pθ t (xt) = pθ t (xt) p∗ t (xt) p∗ t (xt) for xt ∈R D...

work page
[4]

Finally, we maximize the loss w.r.t

= Et∼[0,T] Ex∗ t ∼p∗ t −[(1−α) +α pθ t (x∗ t ) p∗ t (x∗ t )]∥δt(x∗ t )∥2 + 2[(β−α) +α pθ t (x∗ t ) p∗ t (x∗ t )]⟨δt(x∗ t ), f ∗ t (x∗ t )⟩ −2β pθ t (x∗ t ) p∗ t (x∗ t ) ⟨δt(x∗ t ), f θ t (x∗ t )⟩ . Finally, we maximize the loss w.r.t. δt(x∗ t ) for each x∗ t and t as a quadratic function. The maximum is achieved when δt(x∗ t ) = [(β−α) +α pθ t (x∗ t ) p∗ ...

work page
[5]

It is easy to see that when pθ 0 =p ∗ 0 and f θ =f ∗ this distance achieves its minimal value 0

=E t∼[0,T] Ex∗ t ∼p∗ t   ∥f ∗ t (x∗ t )·((β−α) +α pθ t (x∗ t ) p∗ t (x∗ t ))−f θ t (x∗ t )·β pθ t (x∗ t ) p∗ t (x∗ t ) ∥2 (1−α) +α pθ t (x∗ t ) p∗ t (x∗ t )   . It is easy to see that when pθ 0 =p ∗ 0 and f θ =f ∗ this distance achieves its minimal value 0. Moreover, optimal fake model in this case matches the teacherf ∗, i.e., arg max f Lα,β R-UID(f,...

work page
[6]

= Z xt lt(xt, β, α)dxt, lt(xt, β, α) := ∥(p∗ t (xt)(β−α) +αp θ t (xt))·f ∗ t (xt)−βp θ t (xt)·f θ t (xt)∥2 (1−α)p ∗ t (xt) +αp θ t (xt) , wherel t(xt, β, α)denotes the distance for the particular pointx t. The total distance mostly sums up from the two groups of points: incorrectly generated points from the generator’s main domain, i.e., pθ t (xt)≫0, p ∗(...

work page
[7]

In turn, after real data incorporation, we obtain our RealUID loss (Theorem 2)

=E t∼[0,T] Exθ 0∼pθ 0,xθ t ∼pθ t (·|xθ 0) −∥δt(xθ t )∥2 + 2⟨δt(xθ t ), f ∗ t (xθ t )⟩ −2⟨δ t(xθ t ), f θ t (xθ t |xθ 0)⟩ . In turn, after real data incorporation, we obtain our RealUID loss (Theorem 2). Putting the explicit values for RealUM loss (17) in RealUID loss (18), we get the explicit formula: Lα,β R-UID(δ, pθ

work page
[8]

=E t∼[0,T] Exθ 0∼pθ 0,xθ t ∼pθ t (·|xθ 0)[−α∥δt(xθ t )∥2 + 2α⟨δt(xθ t ), f ∗ t (xθ t )⟩ −2β⟨δ t(xθ t ), f θ t (xθ t |xθ 0)⟩] +Et∼[0,T] Ex∗ 0 ∼p∗ 0 ,x∗ t ∼p∗ t (·|x∗ 0 )[−(1−α)∥δ t(x∗ t )∥2 + 2(1−α)⟨δ t(x∗ t ), f ∗ t (x∗ t )⟩ −2(1−β)⟨δ t(x∗ t ), f ∗ t (x∗ t |x∗ 0)⟩]. These two formulas give us alternative explanation on how to add real data into arbitrary ...

work page
[9]

on generated data pθ 0 ∈ P(R D) with coefficientsα, β, γ∈(0,1]: Lα,β,γ R-UID(δ, pθ

work page
[10]

Optionally, one can change default reparameterization δ=f ∗ −f or substitute sampled real data termf ∗ t (x∗ t |x∗ 0)with the unconditional teacherf ∗ t (x∗ t )

:=E t∼[0,T] Exθ 0∼pθ 0,xθ t ∼pθ t (·|xθ 0)[−γ∥δt(xθ t )∥2 + 2α⟨δt(xθ t ), f ∗ t (xθ t )⟩ −2β⟨δ t(xθ t ), f θ t (xθ t |xθ 0)⟩] +Et∼[0,T] Ex∗ 0 ∼p∗ 0 ,x∗ t ∼p∗ t (·|x∗ 0 )[−(1−γ)∥δ t(x∗ t )∥2 + 2(1−α)⟨δ t(x∗ t ), f ∗ t (x∗ t )⟩ −2(1−β)⟨δ t(x∗ t ), f ∗ t (x∗ t |x∗ 0)⟩]. Optionally, one can change default reparameterization δ=f ∗ −f or substitute sampled real...

work page
[11]

The distances being minimized for RealUID (Lemma 2) and General RealUID (Lemma 3) are almost identical except the scale factor in the denominator

=E t∼[0,T] Ex∗ t ∼p∗ t ∥(p∗ t (x∗ t )(β−α) +αp θ t (x∗ t ))·f ∗ t (x∗ t )−βp θ t (x∗ t )·f θ t (x∗ t )∥2 p∗ t (x∗ t )((1−γ)p ∗ t (x∗ t ) +γp θ t (x∗ t )) . The distances being minimized for RealUID (Lemma 2) and General RealUID (Lemma 3) are almost identical except the scale factor in the denominator. Thus, we keep the same recommendations for choosing co...

work page
[12]

=E t∼[0,T] Exθ t ∼pθ t [−γ∥δt(xθ t )∥2 + 2α⟨δt(xθ t ), f ∗ t (xθ t )⟩ −2β⟨δ t(xθ t ), f θ t (xθ t )⟩] +Et∼[0,T] Ex∗ t ∼p∗ t [−(1−γ)∥δ t(x∗ t )∥2 + 2(1−α)⟨δ t(x∗ t ), f ∗ t (x∗ t )⟩ −2(1−β)⟨δ t(x∗ t ), f ∗ t (x∗ t )⟩]. Then, we rescale the generated data terms in the General RealUID loss using the equality pθ t (xt) = pθ t (xt) p∗ t (xt) p∗ t (xt) for xt ∈...

work page
[13]

Then we maximize the loss w.r.t

= Et∼[0,T] Ex∗ t ∼p∗ t −[(1−γ) +γ pθ t (x∗ t ) p∗ t (x∗ t )]∥δt(x∗ t )∥2 + 2[(β−α) +α pθ t (x∗ t ) p∗ t (x∗ t )]⟨δt(x∗ t ), f ∗ t (x∗ t )⟩ −2β pθ t (x∗ t ) p∗ t (x∗ t ) ⟨δt(x∗ t ), f θ t (x∗ t )⟩ . Then we maximize the loss w.r.t. δt(x∗ t ) for each x∗ t and t as a quadratic function. The maximum is achieved when δt(x∗ t ) = [(β−α) +α pθ t (x∗ t ) p∗ t (x...

work page
[14]

=E t∼[0,T] Ex∗ t ∼p∗ t   ∥f ∗ t (x∗ t )·((β−α) +α pθ t (x∗ t ) p∗ t (x∗ t ))−f θ t (x∗ t )·β pθ t (x∗ t ) p∗ t (x∗ t ) ∥2 (1−γ) +γ pθ t (x∗ t ) p∗ t (x∗ t )   . A.3 SIDWITH REAL DATA We recall that data-free UID loss (Theorem 1), which is equivalent to SiD loss with αSiD = 1/2, can be restated via linearization technique withδ=f−f ∗ as LUID(δ, pθ

work page
[15]

(23) 18 In turn, after real data incorporation, we obtain our RealUID loss (Theorem 2)

=E t∼[0,T] Exθ 0∼pθ 0,xθ t ∼pθ t (·|xθ 0) −∥δt(xθ t )∥2 + 2⟨δt(xθ t ), f ∗ t (xθ t )⟩ −2⟨δ t(xθ t ), f θ t (xθ t |xθ 0)⟩ . (23) 18 In turn, after real data incorporation, we obtain our RealUID loss (Theorem 2). Putting the explicit values for RealUM loss (17) in RealUID loss (18), we get the explicit formula: Lα,β R-UID(δ, pθ

work page
[16]

=E t∼[0,T] Exθ 0∼pθ 0,xθ t ∼pθ t (·|xθ 0)[−α∥δt(xθ t )∥2 + 2α⟨δt(xθ t ), f ∗ t (xθ t )⟩ −2β⟨δ t(xθ t ), f θ t (xθ t |xθ 0)⟩] +Et∼[0,T] Ex∗ 0 ∼p∗ 0 ,x∗ t ∼p∗ t (·|x∗ 0 )[−(1−α)∥δ t(x∗ t )∥2 + 2(1−α)⟨δ t(x∗ t ), f ∗ t (x∗ t )⟩ −2(1−β)⟨δ t(x∗ t ), f ∗ t (x∗ t |x∗ 0)⟩]. These two formulas give us alternative explanation on how to add real data into arbitrary ...

work page
[17]

Following the structure of generator SiD loss, we propose to scale the first coefficient in our RealUID loss during generator updates

=E t∼[0,T] Exθ 0∼pθ 0,xθ t ∼pθ t (·|xθ 0) −2αSiD∥δt(xθ t )∥2 + 2⟨δt(xθ t ), f ∗ t (xθ t )⟩ −2⟨δ t(xθ t ), f θ t (xθ t |xθ 0)⟩ . Following the structure of generator SiD loss, we propose to scale the first coefficient in our RealUID loss during generator updates. The wholeSiD pipeline with real datadetermined by coefficients α, β∈(0,1], α SiD and teacherf ...

work page
[18]

Minimize the real data modified UM loss Lα,β R-UM(f, pθ

work page
[19]

2) for the fake model f via several update steps: Lα,β R-UM(f, pθ

(Def. 2) for the fake model f via several update steps: Lα,β R-UM(f, pθ

work page
[20]

=α·E t∼[0,T] Exθ 0∼pθ 0,xθ t ∼pθ t (·|xθ 0) ∥ft(xθ t )− β α f θ(xθ t |xθ 0)∥2 | {z } generated datap θ 0 term + (1−α)·E t∼[0,T] Ex∗ 0 ∼p∗ 0 ,x∗ t ∼p∗ t (·|x∗ 0 ) ∥ft(x∗ t )− 1−β 1−α f ∗ t (x∗ t |x∗ 0)∥2 | {z } real datap ∗ 0 term

work page
[21]

Make generator update step minimizing the lossL α,β R-UID,αSiD(pθ 0)withδ=f−f ∗ : Lα,β R-UID,αSiD(pθ

work page
[22]

We keep the same recommendations for choosing coefficientsα, β as we discuss in Appendix A.1.2

=E t∼[0,T] Exθ 0∼pθ 0,xθ t ∼pθ t (·|xθ 0) −2αSiD ·α· ∥δ t(xθ t )∥2 + 2α⟨δt(xθ t ), f ∗ t (xθ t )⟩ −2β⟨δ t(xθ t ), f θ t (xθ t |xθ 0)⟩ . We keep the same recommendations for choosing coefficientsα, β as we discuss in Appendix A.1.2. The optimal choice is slightly differentα̸=β which are close to 1. Following (Zhou et al., 2024a), the best choice forα SiD i...

work page
[23]

and student f θ := arg minf LUM(f, pθ

work page
[24]

In this case, the connection with the inverse optimization disappears

functions. In this case, the connection with the inverse optimization disappears. For a fixed pointx θ t and timet, we derive: ∥f ∗ t (xθ t )−f θ t (xθ t )∥= max δt(xθ t ) ⟨ δt(xθ t ) ∥δt(xθ t )∥ , f ∗ t (xθ t )−f θ t (xθ t )⟩ = max δt(xθ t ) Exθ 0∼pθ 0(·|xθ t ) ⟨ δt(xθ t ) ∥δt(xθ t )∥ , f ∗ t (xθ t )⟩ − ⟨ δt(xθ t ) ∥δt(xθ t )∥ , f θ t (xθ t |xθ 0)⟩ .(24)...

work page
[25]

for min-max optimization to solvemin θ Et∼[0,T] Exθ t ∼pθ t ∥f ∗ t (xθ t )−f θ t (xθ t )∥is: min θ max f ˆLUID(f, pθ

work page
[26]

We need to split two summands in the linearized representation (24) into generated and real data parts with weights α,(1−α) and β,(1−β)

:=E t∼[0,T] Exθ 0∼pθ 0,xθ t ∼pθ t (·|xθ 0) ⟨ f ∗ t (xθ t )−f t(xθ t ) ∥f ∗ t (xθ t )−f t(xθ t )∥ , f ∗ t (xθ t )−f θ t (xθ t |xθ 0)⟩ .(25) 19 Adding real data.Following the intuition from the proof for RealUID in Appendix A.1.1, we can incorporate real data in Normalized UID loss (25) as well. We need to split two summands in the linearized representation...

work page
[27]

on generated data pθ 0 ∈ P(R D) with coefficientsα, β∈(0,1]: ˆLα,β R-UID(f, pθ

work page
[28]

:=E t∼[0,T] Exθ t ∼pθ t ,xθ 0∼pθ 0(·|xθ t ) ⟨ f ∗ t (xθ t )−f t(xθ t ) ∥f ∗ t (xθ t )−f t(xθ t )∥ , α·f ∗ t (xθ t )−β·f θ t (xθ t |xθ 0)⟩ +Et∼[0,T] Ex∗ t ∼p∗ t ,x∗ 0 ∼p∗ 0 (·|x∗ t ) ⟨ f ∗ t (x∗ t )−f t(x∗ t ) ∥f ∗ t (x∗ t )−f t(x∗ t )∥ ,(1−α)·f ∗ t (x∗ t )−(1−β)·f ∗ t (x∗ t |x∗ 0)⟩ . Similar to the proof of RealUID distance Lemma 2, we can show that min-m...

work page
[29]

This distance attains minimum whenp θ 0 =p ∗ 0, justifying the procedure

=E t∼[0,T] Ex∗ t ∼p∗ t ∥((β−α) +α pθ t (x∗ t ) p∗ t (x∗ t ))·f ∗ t (x∗ t )−β pθ t (x∗ t ) p∗ t (x∗ t ) ·f θ t (x∗ t )∥ . This distance attains minimum whenp θ 0 =p ∗ 0, justifying the procedure. A.5 DMDAPPROACH WITH REAL DATA Distribution Matching Distillation(Luo et al., 2023; Wang et al., 2023; Yin et al., 2024b;a) (DMD) approach distills Gaussian diffu...

work page 2023
[30]

The final algorithm alternates updates for the fake model and the generator similar to SiD approach

and student scores θ = arg mins LDSM(s, pθ 0)at each time moment: Et∼[0,T] dDKL(pθ t ||p∗ t ) dθ =E t∼[0,T] Ez∼pZ ,xθ 0=Gθ(z),xθ t ∼pθ t (sθ t (xθ t )−s ∗ t (xθ t )) dGθ dθ . The final algorithm alternates updates for the fake model and the generator similar to SiD approach. We would like to highlight that DMD does not fit our UID framework.The UID loss i...

work page
[31]

Then apply the generator parameters update based on the KL divergence between mixed distributions

:=α·E t∼[0,T] Exθ 0∼pθ 0,xθ t ∼pθ t (·|x0) ∥st(xθ t )−s θ(xθ t |xθ 0)∥2 | {z } generated datap θ 0 term + (1−α)·E t∼[0,T] Ex∗ 0 ∼p∗ 0 ,x∗ t ∼p∗ t (·|x∗ 0 ) ∥st(x∗ t )−s ∗ t (x∗ t |x∗ 0)∥2 | {z } real datap ∗ 0 term . Then apply the generator parameters update based on the KL divergence between mixed distributions. Lemma 4(DMD with real data).Consider real...

work page 2023
[32]

First, we use (Wang et al., 2023, Lemma 1) which says that, for any two distributions p, q∈ P(R D) and pointx∈R D, we have δDKL(q||p) δq [x] = logq(x)−logp(x) + 1

:=E t∼[0,T] DKL(α·p θ t + (1−α)·p ∗ t ||p∗ t ) . First, we use (Wang et al., 2023, Lemma 1) which says that, for any two distributions p, q∈ P(R D) and pointx∈R D, we have δDKL(q||p) δq [x] = logq(x)−logp(x) + 1. Second, for the parametrization xθ 0 =G θ(z), z∼p Z and a fixed point xt, we have (Wang et al., 2023, Lemma 2) δpθ t (xt) δpθ 0 [θ] = Z z pθ t (...

work page 2023
[33]

:=α·E t∼[0,T] Exθ t ∼pθ t ,xθ 0∼pθ 0(·|xt) ∥st(xθ t )−s θ(xθ t |xθ 0)∥2 + (1−α)·E t∼[0,T] Ex∗ t ∼p∗ t ,x∗ 0 ∼p∗ 0 (·|x∗ t ) ∥st(x∗ t )−s ∗ t (x∗ t |x∗ 0)∥2 . This loss is equivalent to the following sequence min s n αEt∼[0,T] Exθ t ∼pθ t ∥st(xθ t )−s θ t (xθ t )∥2 + (1−α)E t∼[0,T] Ex∗ t ∼p∗ t ∥st(x∗ t )−s ∗ t (x∗ t )∥2 o , min s n αEt∼[0,T] Exθ t ∼pθ t ∥s...

work page 2023

[1] [1]

Decoupled Weight Decay Regularization

URLhttps://openreview.net/forum?id=XVjTT1nw5z. Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017. Weijian Luo, Tianyang Hu, Shifeng Zhang, Jiacheng Sun, Zhenguo Li, and Zhihua Zhang. Diff- instruct: A universal approach for transferring knowledge from pre-trained diffusion models. Advances in Neu...

work page internal anchor Pith review Pith/arXiv arXiv 2017

[2] [2]

=E t∼[0,T] Exθ 0∼pθ 0,xθ t ∼pθ t (·|xθ 0)[−α∥δt(xθ t )∥2 + 2α⟨δt(xθ t ), f ∗ t (xθ t )⟩ −2β⟨δ t(xθ t ), f θ t (xθ t |xθ 0)⟩] +Et∼[0,T] Ex∗ 0 ∼p∗ 0 ,x∗ t ∼p∗ t (·|x∗ t )[−(1−α)∥δ t(x∗ t )∥2 + 2(1−α)⟨δ t(x∗ t ), f ∗ t (x∗ t )⟩ −2(1−β)⟨δ t(x∗ t ), f ∗ t (x∗ t |x∗ 0)⟩]. This form provides an alternative definition of coefficients α and β: they define the prop...

work page

[3] [3]

15 the real data, i.e,L α,β R-UID(δ, pθ

=E t∼[0,T] Exθ t ∼pθ t [−α∥δt(xθ t )∥2 + 2α⟨δt(xθ t ), f ∗ t (xθ t )⟩ −2β⟨δ t(xθ t ), f θ t (xθ t )⟩] +Et∼[0,T] Ex∗ t ∼p∗ t [−(1−α)∥δ t(x∗ t )∥2 + 2(1−α)⟨δ t(x∗ t ), f ∗ t (x∗ t )⟩ −2(1−β)⟨δ t(x∗ t ), f ∗ t (x∗ t )⟩].(20) Then, we rescale the generated data terms in RealUID loss (20) using the equality pθ t (xt) = pθ t (xt) p∗ t (xt) p∗ t (xt) for xt ∈R D...

work page

[4] [4]

Finally, we maximize the loss w.r.t

= Et∼[0,T] Ex∗ t ∼p∗ t −[(1−α) +α pθ t (x∗ t ) p∗ t (x∗ t )]∥δt(x∗ t )∥2 + 2[(β−α) +α pθ t (x∗ t ) p∗ t (x∗ t )]⟨δt(x∗ t ), f ∗ t (x∗ t )⟩ −2β pθ t (x∗ t ) p∗ t (x∗ t ) ⟨δt(x∗ t ), f θ t (x∗ t )⟩ . Finally, we maximize the loss w.r.t. δt(x∗ t ) for each x∗ t and t as a quadratic function. The maximum is achieved when δt(x∗ t ) = [(β−α) +α pθ t (x∗ t ) p∗ ...

work page

[5] [5]

It is easy to see that when pθ 0 =p ∗ 0 and f θ =f ∗ this distance achieves its minimal value 0

=E t∼[0,T] Ex∗ t ∼p∗ t   ∥f ∗ t (x∗ t )·((β−α) +α pθ t (x∗ t ) p∗ t (x∗ t ))−f θ t (x∗ t )·β pθ t (x∗ t ) p∗ t (x∗ t ) ∥2 (1−α) +α pθ t (x∗ t ) p∗ t (x∗ t )   . It is easy to see that when pθ 0 =p ∗ 0 and f θ =f ∗ this distance achieves its minimal value 0. Moreover, optimal fake model in this case matches the teacherf ∗, i.e., arg max f Lα,β R-UID(f,...

work page

[6] [6]

= Z xt lt(xt, β, α)dxt, lt(xt, β, α) := ∥(p∗ t (xt)(β−α) +αp θ t (xt))·f ∗ t (xt)−βp θ t (xt)·f θ t (xt)∥2 (1−α)p ∗ t (xt) +αp θ t (xt) , wherel t(xt, β, α)denotes the distance for the particular pointx t. The total distance mostly sums up from the two groups of points: incorrectly generated points from the generator’s main domain, i.e., pθ t (xt)≫0, p ∗(...

work page

[7] [7]

In turn, after real data incorporation, we obtain our RealUID loss (Theorem 2)

=E t∼[0,T] Exθ 0∼pθ 0,xθ t ∼pθ t (·|xθ 0) −∥δt(xθ t )∥2 + 2⟨δt(xθ t ), f ∗ t (xθ t )⟩ −2⟨δ t(xθ t ), f θ t (xθ t |xθ 0)⟩ . In turn, after real data incorporation, we obtain our RealUID loss (Theorem 2). Putting the explicit values for RealUM loss (17) in RealUID loss (18), we get the explicit formula: Lα,β R-UID(δ, pθ

work page

[8] [8]

=E t∼[0,T] Exθ 0∼pθ 0,xθ t ∼pθ t (·|xθ 0)[−α∥δt(xθ t )∥2 + 2α⟨δt(xθ t ), f ∗ t (xθ t )⟩ −2β⟨δ t(xθ t ), f θ t (xθ t |xθ 0)⟩] +Et∼[0,T] Ex∗ 0 ∼p∗ 0 ,x∗ t ∼p∗ t (·|x∗ 0 )[−(1−α)∥δ t(x∗ t )∥2 + 2(1−α)⟨δ t(x∗ t ), f ∗ t (x∗ t )⟩ −2(1−β)⟨δ t(x∗ t ), f ∗ t (x∗ t |x∗ 0)⟩]. These two formulas give us alternative explanation on how to add real data into arbitrary ...

work page

[9] [9]

on generated data pθ 0 ∈ P(R D) with coefficientsα, β, γ∈(0,1]: Lα,β,γ R-UID(δ, pθ

work page

[10] [10]

Optionally, one can change default reparameterization δ=f ∗ −f or substitute sampled real data termf ∗ t (x∗ t |x∗ 0)with the unconditional teacherf ∗ t (x∗ t )

:=E t∼[0,T] Exθ 0∼pθ 0,xθ t ∼pθ t (·|xθ 0)[−γ∥δt(xθ t )∥2 + 2α⟨δt(xθ t ), f ∗ t (xθ t )⟩ −2β⟨δ t(xθ t ), f θ t (xθ t |xθ 0)⟩] +Et∼[0,T] Ex∗ 0 ∼p∗ 0 ,x∗ t ∼p∗ t (·|x∗ 0 )[−(1−γ)∥δ t(x∗ t )∥2 + 2(1−α)⟨δ t(x∗ t ), f ∗ t (x∗ t )⟩ −2(1−β)⟨δ t(x∗ t ), f ∗ t (x∗ t |x∗ 0)⟩]. Optionally, one can change default reparameterization δ=f ∗ −f or substitute sampled real...

work page

[11] [11]

The distances being minimized for RealUID (Lemma 2) and General RealUID (Lemma 3) are almost identical except the scale factor in the denominator

=E t∼[0,T] Ex∗ t ∼p∗ t ∥(p∗ t (x∗ t )(β−α) +αp θ t (x∗ t ))·f ∗ t (x∗ t )−βp θ t (x∗ t )·f θ t (x∗ t )∥2 p∗ t (x∗ t )((1−γ)p ∗ t (x∗ t ) +γp θ t (x∗ t )) . The distances being minimized for RealUID (Lemma 2) and General RealUID (Lemma 3) are almost identical except the scale factor in the denominator. Thus, we keep the same recommendations for choosing co...

work page

[12] [12]

=E t∼[0,T] Exθ t ∼pθ t [−γ∥δt(xθ t )∥2 + 2α⟨δt(xθ t ), f ∗ t (xθ t )⟩ −2β⟨δ t(xθ t ), f θ t (xθ t )⟩] +Et∼[0,T] Ex∗ t ∼p∗ t [−(1−γ)∥δ t(x∗ t )∥2 + 2(1−α)⟨δ t(x∗ t ), f ∗ t (x∗ t )⟩ −2(1−β)⟨δ t(x∗ t ), f ∗ t (x∗ t )⟩]. Then, we rescale the generated data terms in the General RealUID loss using the equality pθ t (xt) = pθ t (xt) p∗ t (xt) p∗ t (xt) for xt ∈...

work page

[13] [13]

Then we maximize the loss w.r.t

= Et∼[0,T] Ex∗ t ∼p∗ t −[(1−γ) +γ pθ t (x∗ t ) p∗ t (x∗ t )]∥δt(x∗ t )∥2 + 2[(β−α) +α pθ t (x∗ t ) p∗ t (x∗ t )]⟨δt(x∗ t ), f ∗ t (x∗ t )⟩ −2β pθ t (x∗ t ) p∗ t (x∗ t ) ⟨δt(x∗ t ), f θ t (x∗ t )⟩ . Then we maximize the loss w.r.t. δt(x∗ t ) for each x∗ t and t as a quadratic function. The maximum is achieved when δt(x∗ t ) = [(β−α) +α pθ t (x∗ t ) p∗ t (x...

work page

[14] [14]

=E t∼[0,T] Ex∗ t ∼p∗ t   ∥f ∗ t (x∗ t )·((β−α) +α pθ t (x∗ t ) p∗ t (x∗ t ))−f θ t (x∗ t )·β pθ t (x∗ t ) p∗ t (x∗ t ) ∥2 (1−γ) +γ pθ t (x∗ t ) p∗ t (x∗ t )   . A.3 SIDWITH REAL DATA We recall that data-free UID loss (Theorem 1), which is equivalent to SiD loss with αSiD = 1/2, can be restated via linearization technique withδ=f−f ∗ as LUID(δ, pθ

work page

[15] [15]

(23) 18 In turn, after real data incorporation, we obtain our RealUID loss (Theorem 2)

=E t∼[0,T] Exθ 0∼pθ 0,xθ t ∼pθ t (·|xθ 0) −∥δt(xθ t )∥2 + 2⟨δt(xθ t ), f ∗ t (xθ t )⟩ −2⟨δ t(xθ t ), f θ t (xθ t |xθ 0)⟩ . (23) 18 In turn, after real data incorporation, we obtain our RealUID loss (Theorem 2). Putting the explicit values for RealUM loss (17) in RealUID loss (18), we get the explicit formula: Lα,β R-UID(δ, pθ

work page

[16] [16]

=E t∼[0,T] Exθ 0∼pθ 0,xθ t ∼pθ t (·|xθ 0)[−α∥δt(xθ t )∥2 + 2α⟨δt(xθ t ), f ∗ t (xθ t )⟩ −2β⟨δ t(xθ t ), f θ t (xθ t |xθ 0)⟩] +Et∼[0,T] Ex∗ 0 ∼p∗ 0 ,x∗ t ∼p∗ t (·|x∗ 0 )[−(1−α)∥δ t(x∗ t )∥2 + 2(1−α)⟨δ t(x∗ t ), f ∗ t (x∗ t )⟩ −2(1−β)⟨δ t(x∗ t ), f ∗ t (x∗ t |x∗ 0)⟩]. These two formulas give us alternative explanation on how to add real data into arbitrary ...

work page

[17] [17]

Following the structure of generator SiD loss, we propose to scale the first coefficient in our RealUID loss during generator updates

=E t∼[0,T] Exθ 0∼pθ 0,xθ t ∼pθ t (·|xθ 0) −2αSiD∥δt(xθ t )∥2 + 2⟨δt(xθ t ), f ∗ t (xθ t )⟩ −2⟨δ t(xθ t ), f θ t (xθ t |xθ 0)⟩ . Following the structure of generator SiD loss, we propose to scale the first coefficient in our RealUID loss during generator updates. The wholeSiD pipeline with real datadetermined by coefficients α, β∈(0,1], α SiD and teacherf ...

work page

[18] [18]

Minimize the real data modified UM loss Lα,β R-UM(f, pθ

work page

[19] [19]

2) for the fake model f via several update steps: Lα,β R-UM(f, pθ

(Def. 2) for the fake model f via several update steps: Lα,β R-UM(f, pθ

work page

[20] [20]

=α·E t∼[0,T] Exθ 0∼pθ 0,xθ t ∼pθ t (·|xθ 0) ∥ft(xθ t )− β α f θ(xθ t |xθ 0)∥2 | {z } generated datap θ 0 term + (1−α)·E t∼[0,T] Ex∗ 0 ∼p∗ 0 ,x∗ t ∼p∗ t (·|x∗ 0 ) ∥ft(x∗ t )− 1−β 1−α f ∗ t (x∗ t |x∗ 0)∥2 | {z } real datap ∗ 0 term

work page

[21] [21]

Make generator update step minimizing the lossL α,β R-UID,αSiD(pθ 0)withδ=f−f ∗ : Lα,β R-UID,αSiD(pθ

work page

[22] [22]

We keep the same recommendations for choosing coefficientsα, β as we discuss in Appendix A.1.2

=E t∼[0,T] Exθ 0∼pθ 0,xθ t ∼pθ t (·|xθ 0) −2αSiD ·α· ∥δ t(xθ t )∥2 + 2α⟨δt(xθ t ), f ∗ t (xθ t )⟩ −2β⟨δ t(xθ t ), f θ t (xθ t |xθ 0)⟩ . We keep the same recommendations for choosing coefficientsα, β as we discuss in Appendix A.1.2. The optimal choice is slightly differentα̸=β which are close to 1. Following (Zhou et al., 2024a), the best choice forα SiD i...

work page

[23] [23]

and student f θ := arg minf LUM(f, pθ

work page

[24] [24]

In this case, the connection with the inverse optimization disappears

functions. In this case, the connection with the inverse optimization disappears. For a fixed pointx θ t and timet, we derive: ∥f ∗ t (xθ t )−f θ t (xθ t )∥= max δt(xθ t ) ⟨ δt(xθ t ) ∥δt(xθ t )∥ , f ∗ t (xθ t )−f θ t (xθ t )⟩ = max δt(xθ t ) Exθ 0∼pθ 0(·|xθ t ) ⟨ δt(xθ t ) ∥δt(xθ t )∥ , f ∗ t (xθ t )⟩ − ⟨ δt(xθ t ) ∥δt(xθ t )∥ , f θ t (xθ t |xθ 0)⟩ .(24)...

work page

[25] [25]

for min-max optimization to solvemin θ Et∼[0,T] Exθ t ∼pθ t ∥f ∗ t (xθ t )−f θ t (xθ t )∥is: min θ max f ˆLUID(f, pθ

work page

[26] [26]

We need to split two summands in the linearized representation (24) into generated and real data parts with weights α,(1−α) and β,(1−β)

:=E t∼[0,T] Exθ 0∼pθ 0,xθ t ∼pθ t (·|xθ 0) ⟨ f ∗ t (xθ t )−f t(xθ t ) ∥f ∗ t (xθ t )−f t(xθ t )∥ , f ∗ t (xθ t )−f θ t (xθ t |xθ 0)⟩ .(25) 19 Adding real data.Following the intuition from the proof for RealUID in Appendix A.1.1, we can incorporate real data in Normalized UID loss (25) as well. We need to split two summands in the linearized representation...

work page

[27] [27]

on generated data pθ 0 ∈ P(R D) with coefficientsα, β∈(0,1]: ˆLα,β R-UID(f, pθ

work page

[28] [28]

:=E t∼[0,T] Exθ t ∼pθ t ,xθ 0∼pθ 0(·|xθ t ) ⟨ f ∗ t (xθ t )−f t(xθ t ) ∥f ∗ t (xθ t )−f t(xθ t )∥ , α·f ∗ t (xθ t )−β·f θ t (xθ t |xθ 0)⟩ +Et∼[0,T] Ex∗ t ∼p∗ t ,x∗ 0 ∼p∗ 0 (·|x∗ t ) ⟨ f ∗ t (x∗ t )−f t(x∗ t ) ∥f ∗ t (x∗ t )−f t(x∗ t )∥ ,(1−α)·f ∗ t (x∗ t )−(1−β)·f ∗ t (x∗ t |x∗ 0)⟩ . Similar to the proof of RealUID distance Lemma 2, we can show that min-m...

work page

[29] [29]

This distance attains minimum whenp θ 0 =p ∗ 0, justifying the procedure

=E t∼[0,T] Ex∗ t ∼p∗ t ∥((β−α) +α pθ t (x∗ t ) p∗ t (x∗ t ))·f ∗ t (x∗ t )−β pθ t (x∗ t ) p∗ t (x∗ t ) ·f θ t (x∗ t )∥ . This distance attains minimum whenp θ 0 =p ∗ 0, justifying the procedure. A.5 DMDAPPROACH WITH REAL DATA Distribution Matching Distillation(Luo et al., 2023; Wang et al., 2023; Yin et al., 2024b;a) (DMD) approach distills Gaussian diffu...

work page 2023

[30] [30]

The final algorithm alternates updates for the fake model and the generator similar to SiD approach

and student scores θ = arg mins LDSM(s, pθ 0)at each time moment: Et∼[0,T] dDKL(pθ t ||p∗ t ) dθ =E t∼[0,T] Ez∼pZ ,xθ 0=Gθ(z),xθ t ∼pθ t (sθ t (xθ t )−s ∗ t (xθ t )) dGθ dθ . The final algorithm alternates updates for the fake model and the generator similar to SiD approach. We would like to highlight that DMD does not fit our UID framework.The UID loss i...

work page

[31] [31]

Then apply the generator parameters update based on the KL divergence between mixed distributions

:=α·E t∼[0,T] Exθ 0∼pθ 0,xθ t ∼pθ t (·|x0) ∥st(xθ t )−s θ(xθ t |xθ 0)∥2 | {z } generated datap θ 0 term + (1−α)·E t∼[0,T] Ex∗ 0 ∼p∗ 0 ,x∗ t ∼p∗ t (·|x∗ 0 ) ∥st(x∗ t )−s ∗ t (x∗ t |x∗ 0)∥2 | {z } real datap ∗ 0 term . Then apply the generator parameters update based on the KL divergence between mixed distributions. Lemma 4(DMD with real data).Consider real...

work page 2023

[32] [32]

First, we use (Wang et al., 2023, Lemma 1) which says that, for any two distributions p, q∈ P(R D) and pointx∈R D, we have δDKL(q||p) δq [x] = logq(x)−logp(x) + 1

:=E t∼[0,T] DKL(α·p θ t + (1−α)·p ∗ t ||p∗ t ) . First, we use (Wang et al., 2023, Lemma 1) which says that, for any two distributions p, q∈ P(R D) and pointx∈R D, we have δDKL(q||p) δq [x] = logq(x)−logp(x) + 1. Second, for the parametrization xθ 0 =G θ(z), z∼p Z and a fixed point xt, we have (Wang et al., 2023, Lemma 2) δpθ t (xt) δpθ 0 [θ] = Z z pθ t (...

work page 2023

[33] [33]

:=α·E t∼[0,T] Exθ t ∼pθ t ,xθ 0∼pθ 0(·|xt) ∥st(xθ t )−s θ(xθ t |xθ 0)∥2 + (1−α)·E t∼[0,T] Ex∗ t ∼p∗ t ,x∗ 0 ∼p∗ 0 (·|x∗ t ) ∥st(x∗ t )−s ∗ t (x∗ t |x∗ 0)∥2 . This loss is equivalent to the following sequence min s n αEt∼[0,T] Exθ t ∼pθ t ∥st(xθ t )−s θ t (xθ t )∥2 + (1−α)E t∼[0,T] Ex∗ t ∼p∗ t ∥st(x∗ t )−s ∗ t (x∗ t )∥2 o , min s n αEt∼[0,T] Exθ t ∼pθ t ∥s...

work page 2023