pith. sign in

arxiv: 2509.22459 · v4 · pith:PDS2SGDOnew · submitted 2025-09-26 · 📊 stat.ML · cs.LG

Universal Inverse Distillation for Matching Models with Real-Data Supervision (No GANs)

Pith reviewed 2026-05-21 22:12 UTC · model grok-4.3

classification 📊 stat.ML cs.LG
keywords distillationmatching modelsdiffusion modelsflow matchingreal data supervisionone-step generationgenerative models
0
0 comments X

The pith

RealUID is a universal distillation framework that lets any matching model use real data directly to train one-step generators without GANs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents RealUID as a distillation approach that trains fast one-step generators from slow iterative matching models such as diffusion and flow. It demonstrates how real data can be added straight into the distillation loss to guide the student model, removing the need for extra discriminator networks and adversarial training. The method supplies a single theoretical setup that recovers earlier distillation techniques for flow matching and diffusion as special cases while also covering extensions like bridge matching and stochastic interpolants. A sympathetic reader would care because the approach simplifies the creation of efficient, high-quality generative models that improve when real data is available and avoids the extra complexity and instability often tied to GAN-based supervision.

Core claim

RealUID is a universal inverse distillation framework for matching models that seamlessly incorporates real data into the distillation procedure without GANs and supplies a simple theoretical foundation that covers previous distillation methods for Flow Matching and Diffusion models while extending to Bridge Matching and Stochastic Interpolants.

What carries the argument

The RealUID distillation loss, which directly combines guidance from a pre-trained teacher matching model with explicit matching to real data samples.

If this is right

  • Earlier data-free distillation techniques for flow and diffusion become direct instances of RealUID when the real-data term is omitted.
  • The same procedure applies without modification to bridge matching and stochastic interpolants.
  • Distilled one-step models gain improved sample quality from real-data supervision while retaining the speed advantage of single-step inference.
  • No extra discriminator network is required when moving from data-free to real-data distillation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The framework could be tested on newly proposed matching-model variants to check whether the direct real-data term remains stable across different noise schedules.
  • By removing the adversarial component, RealUID may lower the hyper-parameter burden when practitioners adapt distillation to new data domains.
  • The unification suggests that future matching-model papers could adopt RealUID as a default fast-sampling baseline rather than re-deriving separate distillation losses.

Load-bearing premise

Real data can be incorporated directly into the distillation loss for any matching model without introducing instability or requiring additional adversarial components.

What would settle it

Run RealUID distillation on a standard flow-matching teacher using real image data and measure whether the resulting one-step model shows higher FID or training divergence than a comparable GAN-based real-data baseline on the same dataset and architecture.

Figures

Figures reproduced from arXiv: 2509.22459 by Aleksei Leonov, Alexander Korotin, David Li, Evgeny Burnaev, Iaroslav Koshelev, Nikita Gushchin, Nikita Kornilov, Tikhon Mavrin.

Figure 1
Figure 1. Figure 1: Pipeline of our RealUID distillation framework (§3) with the direct incorporation of real data p ∗ 0 adjusted by hyperparameters α, β ∈ (0, 1]. In the figure, it is depicted for Flow Matching models predicting denoised samples. It distills a costly frozen teacher model f ∗ (blue) into a one-step generator Gθ (red) upon min-max optimization of L α,β R-UID(f, pθ 0) loss over fake model f (green) and generato… view at source ↗
Figure 2
Figure 2. Figure 2: Evolution of FID during CIFAR-10 distillation for (i) the baseline RealUID (α = 1.0, β = 1.0), (ii) the best-performing RealUID configurations, and (iii) subsequent fine-tuning, evaluated in both unconditional and conditional settings. The performances of Teacher Flow and UID+GAN are indicated by horizontal reference lines in their respective colors. Methods that incorporate real data—best-performing RealU… view at source ↗
Figure 3
Figure 3. Figure 3: RealUID loss for 1D-Gaussians under various coefficients (α, β). • Configuration β = α = 1 (UID loss) does not affect uncovered real data points xt : p θ t (xt) → 0, p∗ (xt) ≫ 0: lt(xt, β, α) ≈ ∥p θ t (xt) · f ∗ t (xt) − p θ t (xt) · f θ t (xt)∥ 2 p θ t (xt) = ∥f ∗ t (xt) − f θ t (xt)∥ 2 p θ t (xt) → 0. • Configuration β = α < 1 does not affect uncovered real data points xt : p θ t (xt) → 0, p∗ (xt) ≫ 0: l… view at source ↗
Figure 4
Figure 4. Figure 4: Uncurated samples for unconditional generation by the one-step RealUID (α = 1.0, β = 1.0) trained on CIFAR-10. Quantitative results are reported in [PITH_FULL_IMAGE:figures/full_fig_p026_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Uncurated samples for unconditional generation by the one-step RealUID (α = 1.0, β = 1.0) + GAN (λ Gθ adv = 0.3, λD adv = 1) trained on CIFAR-10. Quantitative results are reported in [PITH_FULL_IMAGE:figures/full_fig_p027_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Uncurated samples for unconditional generation by the one-step RealUID (α = 0.94, β = 0.96) trained on CIFAR-10. Quantitative results are reported in [PITH_FULL_IMAGE:figures/full_fig_p028_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Uncurated samples for unconditional generation by the one-step RealUID (α = 0.94, β = 0.96 | αFT = 0.94, βFT = 1.0) trained on CIFAR-10. Quantitative results are reported in [PITH_FULL_IMAGE:figures/full_fig_p029_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Uncurated samples for conditional generation by the one-step RealUID (α = 1.0, β = 1.0) trained on CIFAR-10. Quantitative results are reported in [PITH_FULL_IMAGE:figures/full_fig_p030_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Uncurated samples for conditional generation by the one-step RealUID (α = 1.0, β = 1.0) + GAN (λ Gθ adv = 0.3, λD adv = 1) trained on CIFAR-10. Quantitative results are reported in [PITH_FULL_IMAGE:figures/full_fig_p031_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Uncurated samples for conditional generation by the one-step RealUID (α = 0.98, β = 0.96) trained on CIFAR-10. Quantitative results are reported in [PITH_FULL_IMAGE:figures/full_fig_p032_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Uncurated samples for conditional generation by the one-step RealUID (α = 0.98, β = 0.96 | αFT = 0.94, βFT = 1.0) trained on CIFAR-10. Quantitative results are reported in [PITH_FULL_IMAGE:figures/full_fig_p033_11.png] view at source ↗
read the original abstract

While achieving exceptional generative quality, modern diffusion, flow, and other matching models suffer from slow inference, as they require many steps of iterative generation. Recent distillation methods address this problem by training efficient one-step generators under the guidance of a pre-trained teacher model. However, these methods are often constrained to only one specific framework, e.g., only to diffusion or only to flow models. Furthermore, these methods are originally data-free, and to benefit from the usage of real data, it is required to use an additional complex adversarial training with an extra discriminator model. In this paper, we present RealUID, a universal distillation framework for all matching models that seamlessly incorporates real data into the distillation procedure without GANs. Our RealUID approach offers a simple theoretical foundation that covers previous distillation methods for Flow Matching and Diffusion models, and can be also extended to their modifications, such as Bridge Matching and Stochastic Interpolants. The code can be found in https://github.com/David-cripto/RealUID.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces RealUID, a universal inverse distillation framework for matching models (including Flow Matching, Diffusion, Bridge Matching, and Stochastic Interpolants). It claims to seamlessly incorporate real-data supervision into the distillation loss without GANs or extra discriminators, while providing a simple theoretical foundation that recovers prior distillation methods as special cases.

Significance. If the central derivation holds and the real-data mixture preserves the required contraction or fixed-point properties across model families, the result would be significant: it unifies existing distillation techniques under one objective, removes the need for adversarial components, and enables direct use of real data for one-step generators. The open-source code link is a positive factor for reproducibility.

major comments (1)
  1. [§3.2–3.3] §3.2–3.3: The universality claim rests on the RealUID loss (mixture of real data and teacher marginal inside the inverse distillation objective) generalizing without additional regularity assumptions. No explicit bound is derived on how the real-data mixture weight perturbs the contraction mapping or fixed-point uniqueness relied upon by prior data-free proofs for Flow Matching and Diffusion; if the perturbation is non-contractive for some vector fields, the claimed simple theoretical foundation does not cover the full family of matching models.
minor comments (2)
  1. [Abstract / §1] The abstract and introduction state that the framework 'covers previous distillation methods' but do not include a short table or explicit reduction showing how the RealUID objective specializes to the cited Flow-Matching and Diffusion losses.
  2. [§3] Notation for the mixture weight and the inverse-distillation operator should be introduced once with a clear definition before being used in the loss equations.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive and detailed review. The feedback highlights an important point regarding the rigor of the theoretical claims. We address the major comment below and commit to revisions that strengthen the presentation without altering the core contributions.

read point-by-point responses
  1. Referee: [§3.2–3.3] §3.2–3.3: The universality claim rests on the RealUID loss (mixture of real data and teacher marginal inside the inverse distillation objective) generalizing without additional regularity assumptions. No explicit bound is derived on how the real-data mixture weight perturbs the contraction mapping or fixed-point uniqueness relied upon by prior data-free proofs for Flow Matching and Diffusion; if the perturbation is non-contractive for some vector fields, the claimed simple theoretical foundation does not cover the full family of matching models.

    Authors: We appreciate the referee's careful reading of the theoretical sections. The current derivation in §3.2–3.3 shows that RealUID recovers the data-free inverse distillation objectives exactly when the real-data weight is set to zero, and that the combined loss remains a well-defined expectation under the same marginals used in prior work. We agree that an explicit perturbation bound on the contraction constant would make the universality statement more complete. In the revised manuscript we will add a short subsection (or extended remark) in §3.3 that (i) recalls the Lipschitz and contraction assumptions from the referenced Flow Matching and Diffusion proofs, (ii) treats the real-data term as a bounded perturbation whose Lipschitz constant is controlled by the data distribution's regularity, and (iii) states a sufficient condition on the mixture weight α such that the overall operator remains contractive whenever α is below a threshold determined by the original contraction gap. This addition uses only the regularity already standard in the literature and does not introduce new assumptions. We believe the revised argument will directly address the concern while preserving the paper's claim of a simple unifying foundation. revision: yes

Circularity Check

0 steps flagged

No significant circularity: derivation introduces independent real-data mixture term without reducing to fitted inputs or self-citation chains

full rationale

The paper defines RealUID by constructing an inverse distillation objective that mixes real data samples with teacher-generated samples inside the loss, then shows this recovers prior Flow Matching and Diffusion distillation losses as special cases while extending to Bridge Matching and Stochastic Interpolants. This construction adds an explicit real-data supervision term that is not present in the cited data-free baselines, and the universality claim rests on algebraic substitution within the loss rather than on any parameter being fitted to the target result or on a self-citation that forbids alternatives. No equation is presented in which the claimed prediction equals its own input by definition, and the theoretical foundation is stated to be simple and assumption-light without invoking uniqueness theorems from the authors' prior work as load-bearing. The derivation is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Review is based on abstract only; no explicit free parameters, axioms, or invented entities are stated. The central claim rests on the unshown theoretical foundation and the assumption that real data integrates stably.

axioms (1)
  • domain assumption A pre-trained teacher matching model can provide reliable guidance for distilling a one-step student when real data is added to the objective.
    This premise is required for any teacher-guided distillation and is invoked by the claim that real data is incorporated seamlessly.

pith-pipeline@v0.9.0 · 5734 in / 1216 out tokens · 30240 ms · 2026-05-21T22:12:03.695922+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

33 extracted references · 33 canonical work pages · 1 internal anchor

  1. [1]

    Decoupled Weight Decay Regularization

    URLhttps://openreview.net/forum?id=XVjTT1nw5z. Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017. Weijian Luo, Tianyang Hu, Shifeng Zhang, Jiacheng Sun, Zhenguo Li, and Zhihua Zhang. Diff- instruct: A universal approach for transferring knowledge from pre-trained diffusion models. Advances in Neu...

  2. [2]

    =E t∼[0,T] Exθ 0∼pθ 0,xθ t ∼pθ t (·|xθ 0)[−α∥δt(xθ t )∥2 + 2α⟨δt(xθ t ), f ∗ t (xθ t )⟩ −2β⟨δ t(xθ t ), f θ t (xθ t |xθ 0)⟩] +Et∼[0,T] Ex∗ 0 ∼p∗ 0 ,x∗ t ∼p∗ t (·|x∗ t )[−(1−α)∥δ t(x∗ t )∥2 + 2(1−α)⟨δ t(x∗ t ), f ∗ t (x∗ t )⟩ −2(1−β)⟨δ t(x∗ t ), f ∗ t (x∗ t |x∗ 0)⟩]. This form provides an alternative definition of coefficients α and β: they define the prop...

  3. [3]

    15 the real data, i.e,L α,β R-UID(δ, pθ

    =E t∼[0,T] Exθ t ∼pθ t [−α∥δt(xθ t )∥2 + 2α⟨δt(xθ t ), f ∗ t (xθ t )⟩ −2β⟨δ t(xθ t ), f θ t (xθ t )⟩] +Et∼[0,T] Ex∗ t ∼p∗ t [−(1−α)∥δ t(x∗ t )∥2 + 2(1−α)⟨δ t(x∗ t ), f ∗ t (x∗ t )⟩ −2(1−β)⟨δ t(x∗ t ), f ∗ t (x∗ t )⟩].(20) Then, we rescale the generated data terms in RealUID loss (20) using the equality pθ t (xt) = pθ t (xt) p∗ t (xt) p∗ t (xt) for xt ∈R D...

  4. [4]

    Finally, we maximize the loss w.r.t

    = Et∼[0,T] Ex∗ t ∼p∗ t −[(1−α) +α pθ t (x∗ t ) p∗ t (x∗ t )]∥δt(x∗ t )∥2 + 2[(β−α) +α pθ t (x∗ t ) p∗ t (x∗ t )]⟨δt(x∗ t ), f ∗ t (x∗ t )⟩ −2β pθ t (x∗ t ) p∗ t (x∗ t ) ⟨δt(x∗ t ), f θ t (x∗ t )⟩ . Finally, we maximize the loss w.r.t. δt(x∗ t ) for each x∗ t and t as a quadratic function. The maximum is achieved when δt(x∗ t ) = [(β−α) +α pθ t (x∗ t ) p∗ ...

  5. [5]

    It is easy to see that when pθ 0 =p ∗ 0 and f θ =f ∗ this distance achieves its minimal value 0

    =E t∼[0,T] Ex∗ t ∼p∗ t   ∥f ∗ t (x∗ t )·((β−α) +α pθ t (x∗ t ) p∗ t (x∗ t ))−f θ t (x∗ t )·β pθ t (x∗ t ) p∗ t (x∗ t ) ∥2 (1−α) +α pθ t (x∗ t ) p∗ t (x∗ t )   . It is easy to see that when pθ 0 =p ∗ 0 and f θ =f ∗ this distance achieves its minimal value 0. Moreover, optimal fake model in this case matches the teacherf ∗, i.e., arg max f Lα,β R-UID(f,...

  6. [6]

    = Z xt lt(xt, β, α)dxt, lt(xt, β, α) := ∥(p∗ t (xt)(β−α) +αp θ t (xt))·f ∗ t (xt)−βp θ t (xt)·f θ t (xt)∥2 (1−α)p ∗ t (xt) +αp θ t (xt) , wherel t(xt, β, α)denotes the distance for the particular pointx t. The total distance mostly sums up from the two groups of points: incorrectly generated points from the generator’s main domain, i.e., pθ t (xt)≫0, p ∗(...

  7. [7]

    In turn, after real data incorporation, we obtain our RealUID loss (Theorem 2)

    =E t∼[0,T] Exθ 0∼pθ 0,xθ t ∼pθ t (·|xθ 0) −∥δt(xθ t )∥2 + 2⟨δt(xθ t ), f ∗ t (xθ t )⟩ −2⟨δ t(xθ t ), f θ t (xθ t |xθ 0)⟩ . In turn, after real data incorporation, we obtain our RealUID loss (Theorem 2). Putting the explicit values for RealUM loss (17) in RealUID loss (18), we get the explicit formula: Lα,β R-UID(δ, pθ

  8. [8]

    =E t∼[0,T] Exθ 0∼pθ 0,xθ t ∼pθ t (·|xθ 0)[−α∥δt(xθ t )∥2 + 2α⟨δt(xθ t ), f ∗ t (xθ t )⟩ −2β⟨δ t(xθ t ), f θ t (xθ t |xθ 0)⟩] +Et∼[0,T] Ex∗ 0 ∼p∗ 0 ,x∗ t ∼p∗ t (·|x∗ 0 )[−(1−α)∥δ t(x∗ t )∥2 + 2(1−α)⟨δ t(x∗ t ), f ∗ t (x∗ t )⟩ −2(1−β)⟨δ t(x∗ t ), f ∗ t (x∗ t |x∗ 0)⟩]. These two formulas give us alternative explanation on how to add real data into arbitrary ...

  9. [9]

    on generated data pθ 0 ∈ P(R D) with coefficientsα, β, γ∈(0,1]: Lα,β,γ R-UID(δ, pθ

  10. [10]

    Optionally, one can change default reparameterization δ=f ∗ −f or substitute sampled real data termf ∗ t (x∗ t |x∗ 0)with the unconditional teacherf ∗ t (x∗ t )

    :=E t∼[0,T] Exθ 0∼pθ 0,xθ t ∼pθ t (·|xθ 0)[−γ∥δt(xθ t )∥2 + 2α⟨δt(xθ t ), f ∗ t (xθ t )⟩ −2β⟨δ t(xθ t ), f θ t (xθ t |xθ 0)⟩] +Et∼[0,T] Ex∗ 0 ∼p∗ 0 ,x∗ t ∼p∗ t (·|x∗ 0 )[−(1−γ)∥δ t(x∗ t )∥2 + 2(1−α)⟨δ t(x∗ t ), f ∗ t (x∗ t )⟩ −2(1−β)⟨δ t(x∗ t ), f ∗ t (x∗ t |x∗ 0)⟩]. Optionally, one can change default reparameterization δ=f ∗ −f or substitute sampled real...

  11. [11]

    The distances being minimized for RealUID (Lemma 2) and General RealUID (Lemma 3) are almost identical except the scale factor in the denominator

    =E t∼[0,T] Ex∗ t ∼p∗ t ∥(p∗ t (x∗ t )(β−α) +αp θ t (x∗ t ))·f ∗ t (x∗ t )−βp θ t (x∗ t )·f θ t (x∗ t )∥2 p∗ t (x∗ t )((1−γ)p ∗ t (x∗ t ) +γp θ t (x∗ t )) . The distances being minimized for RealUID (Lemma 2) and General RealUID (Lemma 3) are almost identical except the scale factor in the denominator. Thus, we keep the same recommendations for choosing co...

  12. [12]

    =E t∼[0,T] Exθ t ∼pθ t [−γ∥δt(xθ t )∥2 + 2α⟨δt(xθ t ), f ∗ t (xθ t )⟩ −2β⟨δ t(xθ t ), f θ t (xθ t )⟩] +Et∼[0,T] Ex∗ t ∼p∗ t [−(1−γ)∥δ t(x∗ t )∥2 + 2(1−α)⟨δ t(x∗ t ), f ∗ t (x∗ t )⟩ −2(1−β)⟨δ t(x∗ t ), f ∗ t (x∗ t )⟩]. Then, we rescale the generated data terms in the General RealUID loss using the equality pθ t (xt) = pθ t (xt) p∗ t (xt) p∗ t (xt) for xt ∈...

  13. [13]

    Then we maximize the loss w.r.t

    = Et∼[0,T] Ex∗ t ∼p∗ t −[(1−γ) +γ pθ t (x∗ t ) p∗ t (x∗ t )]∥δt(x∗ t )∥2 + 2[(β−α) +α pθ t (x∗ t ) p∗ t (x∗ t )]⟨δt(x∗ t ), f ∗ t (x∗ t )⟩ −2β pθ t (x∗ t ) p∗ t (x∗ t ) ⟨δt(x∗ t ), f θ t (x∗ t )⟩ . Then we maximize the loss w.r.t. δt(x∗ t ) for each x∗ t and t as a quadratic function. The maximum is achieved when δt(x∗ t ) = [(β−α) +α pθ t (x∗ t ) p∗ t (x...

  14. [14]

    =E t∼[0,T] Ex∗ t ∼p∗ t   ∥f ∗ t (x∗ t )·((β−α) +α pθ t (x∗ t ) p∗ t (x∗ t ))−f θ t (x∗ t )·β pθ t (x∗ t ) p∗ t (x∗ t ) ∥2 (1−γ) +γ pθ t (x∗ t ) p∗ t (x∗ t )   . A.3 SIDWITH REAL DATA We recall that data-free UID loss (Theorem 1), which is equivalent to SiD loss with αSiD = 1/2, can be restated via linearization technique withδ=f−f ∗ as LUID(δ, pθ

  15. [15]

    (23) 18 In turn, after real data incorporation, we obtain our RealUID loss (Theorem 2)

    =E t∼[0,T] Exθ 0∼pθ 0,xθ t ∼pθ t (·|xθ 0) −∥δt(xθ t )∥2 + 2⟨δt(xθ t ), f ∗ t (xθ t )⟩ −2⟨δ t(xθ t ), f θ t (xθ t |xθ 0)⟩ . (23) 18 In turn, after real data incorporation, we obtain our RealUID loss (Theorem 2). Putting the explicit values for RealUM loss (17) in RealUID loss (18), we get the explicit formula: Lα,β R-UID(δ, pθ

  16. [16]

    =E t∼[0,T] Exθ 0∼pθ 0,xθ t ∼pθ t (·|xθ 0)[−α∥δt(xθ t )∥2 + 2α⟨δt(xθ t ), f ∗ t (xθ t )⟩ −2β⟨δ t(xθ t ), f θ t (xθ t |xθ 0)⟩] +Et∼[0,T] Ex∗ 0 ∼p∗ 0 ,x∗ t ∼p∗ t (·|x∗ 0 )[−(1−α)∥δ t(x∗ t )∥2 + 2(1−α)⟨δ t(x∗ t ), f ∗ t (x∗ t )⟩ −2(1−β)⟨δ t(x∗ t ), f ∗ t (x∗ t |x∗ 0)⟩]. These two formulas give us alternative explanation on how to add real data into arbitrary ...

  17. [17]

    Following the structure of generator SiD loss, we propose to scale the first coefficient in our RealUID loss during generator updates

    =E t∼[0,T] Exθ 0∼pθ 0,xθ t ∼pθ t (·|xθ 0) −2αSiD∥δt(xθ t )∥2 + 2⟨δt(xθ t ), f ∗ t (xθ t )⟩ −2⟨δ t(xθ t ), f θ t (xθ t |xθ 0)⟩ . Following the structure of generator SiD loss, we propose to scale the first coefficient in our RealUID loss during generator updates. The wholeSiD pipeline with real datadetermined by coefficients α, β∈(0,1], α SiD and teacherf ...

  18. [18]

    Minimize the real data modified UM loss Lα,β R-UM(f, pθ

  19. [19]

    2) for the fake model f via several update steps: Lα,β R-UM(f, pθ

    (Def. 2) for the fake model f via several update steps: Lα,β R-UM(f, pθ

  20. [20]

    =α·E t∼[0,T] Exθ 0∼pθ 0,xθ t ∼pθ t (·|xθ 0) ∥ft(xθ t )− β α f θ(xθ t |xθ 0)∥2 | {z } generated datap θ 0 term + (1−α)·E t∼[0,T] Ex∗ 0 ∼p∗ 0 ,x∗ t ∼p∗ t (·|x∗ 0 ) ∥ft(x∗ t )− 1−β 1−α f ∗ t (x∗ t |x∗ 0)∥2 | {z } real datap ∗ 0 term

  21. [21]

    Make generator update step minimizing the lossL α,β R-UID,αSiD(pθ 0)withδ=f−f ∗ : Lα,β R-UID,αSiD(pθ

  22. [22]

    We keep the same recommendations for choosing coefficientsα, β as we discuss in Appendix A.1.2

    =E t∼[0,T] Exθ 0∼pθ 0,xθ t ∼pθ t (·|xθ 0) −2αSiD ·α· ∥δ t(xθ t )∥2 + 2α⟨δt(xθ t ), f ∗ t (xθ t )⟩ −2β⟨δ t(xθ t ), f θ t (xθ t |xθ 0)⟩ . We keep the same recommendations for choosing coefficientsα, β as we discuss in Appendix A.1.2. The optimal choice is slightly differentα̸=β which are close to 1. Following (Zhou et al., 2024a), the best choice forα SiD i...

  23. [23]

    and student f θ := arg minf LUM(f, pθ

  24. [24]

    In this case, the connection with the inverse optimization disappears

    functions. In this case, the connection with the inverse optimization disappears. For a fixed pointx θ t and timet, we derive: ∥f ∗ t (xθ t )−f θ t (xθ t )∥= max δt(xθ t ) ⟨ δt(xθ t ) ∥δt(xθ t )∥ , f ∗ t (xθ t )−f θ t (xθ t )⟩ = max δt(xθ t ) Exθ 0∼pθ 0(·|xθ t ) ⟨ δt(xθ t ) ∥δt(xθ t )∥ , f ∗ t (xθ t )⟩ − ⟨ δt(xθ t ) ∥δt(xθ t )∥ , f θ t (xθ t |xθ 0)⟩ .(24)...

  25. [25]

    for min-max optimization to solvemin θ Et∼[0,T] Exθ t ∼pθ t ∥f ∗ t (xθ t )−f θ t (xθ t )∥is: min θ max f ˆLUID(f, pθ

  26. [26]

    We need to split two summands in the linearized representation (24) into generated and real data parts with weights α,(1−α) and β,(1−β)

    :=E t∼[0,T] Exθ 0∼pθ 0,xθ t ∼pθ t (·|xθ 0) ⟨ f ∗ t (xθ t )−f t(xθ t ) ∥f ∗ t (xθ t )−f t(xθ t )∥ , f ∗ t (xθ t )−f θ t (xθ t |xθ 0)⟩ .(25) 19 Adding real data.Following the intuition from the proof for RealUID in Appendix A.1.1, we can incorporate real data in Normalized UID loss (25) as well. We need to split two summands in the linearized representation...

  27. [27]

    on generated data pθ 0 ∈ P(R D) with coefficientsα, β∈(0,1]: ˆLα,β R-UID(f, pθ

  28. [28]

    :=E t∼[0,T] Exθ t ∼pθ t ,xθ 0∼pθ 0(·|xθ t ) ⟨ f ∗ t (xθ t )−f t(xθ t ) ∥f ∗ t (xθ t )−f t(xθ t )∥ , α·f ∗ t (xθ t )−β·f θ t (xθ t |xθ 0)⟩ +Et∼[0,T] Ex∗ t ∼p∗ t ,x∗ 0 ∼p∗ 0 (·|x∗ t ) ⟨ f ∗ t (x∗ t )−f t(x∗ t ) ∥f ∗ t (x∗ t )−f t(x∗ t )∥ ,(1−α)·f ∗ t (x∗ t )−(1−β)·f ∗ t (x∗ t |x∗ 0)⟩ . Similar to the proof of RealUID distance Lemma 2, we can show that min-m...

  29. [29]

    This distance attains minimum whenp θ 0 =p ∗ 0, justifying the procedure

    =E t∼[0,T] Ex∗ t ∼p∗ t ∥((β−α) +α pθ t (x∗ t ) p∗ t (x∗ t ))·f ∗ t (x∗ t )−β pθ t (x∗ t ) p∗ t (x∗ t ) ·f θ t (x∗ t )∥ . This distance attains minimum whenp θ 0 =p ∗ 0, justifying the procedure. A.5 DMDAPPROACH WITH REAL DATA Distribution Matching Distillation(Luo et al., 2023; Wang et al., 2023; Yin et al., 2024b;a) (DMD) approach distills Gaussian diffu...

  30. [30]

    The final algorithm alternates updates for the fake model and the generator similar to SiD approach

    and student scores θ = arg mins LDSM(s, pθ 0)at each time moment: Et∼[0,T] dDKL(pθ t ||p∗ t ) dθ =E t∼[0,T] Ez∼pZ ,xθ 0=Gθ(z),xθ t ∼pθ t (sθ t (xθ t )−s ∗ t (xθ t )) dGθ dθ . The final algorithm alternates updates for the fake model and the generator similar to SiD approach. We would like to highlight that DMD does not fit our UID framework.The UID loss i...

  31. [31]

    Then apply the generator parameters update based on the KL divergence between mixed distributions

    :=α·E t∼[0,T] Exθ 0∼pθ 0,xθ t ∼pθ t (·|x0) ∥st(xθ t )−s θ(xθ t |xθ 0)∥2 | {z } generated datap θ 0 term + (1−α)·E t∼[0,T] Ex∗ 0 ∼p∗ 0 ,x∗ t ∼p∗ t (·|x∗ 0 ) ∥st(x∗ t )−s ∗ t (x∗ t |x∗ 0)∥2 | {z } real datap ∗ 0 term . Then apply the generator parameters update based on the KL divergence between mixed distributions. Lemma 4(DMD with real data).Consider real...

  32. [32]

    First, we use (Wang et al., 2023, Lemma 1) which says that, for any two distributions p, q∈ P(R D) and pointx∈R D, we have δDKL(q||p) δq [x] = logq(x)−logp(x) + 1

    :=E t∼[0,T] DKL(α·p θ t + (1−α)·p ∗ t ||p∗ t ) . First, we use (Wang et al., 2023, Lemma 1) which says that, for any two distributions p, q∈ P(R D) and pointx∈R D, we have δDKL(q||p) δq [x] = logq(x)−logp(x) + 1. Second, for the parametrization xθ 0 =G θ(z), z∼p Z and a fixed point xt, we have (Wang et al., 2023, Lemma 2) δpθ t (xt) δpθ 0 [θ] = Z z pθ t (...

  33. [33]

    :=α·E t∼[0,T] Exθ t ∼pθ t ,xθ 0∼pθ 0(·|xt) ∥st(xθ t )−s θ(xθ t |xθ 0)∥2 + (1−α)·E t∼[0,T] Ex∗ t ∼p∗ t ,x∗ 0 ∼p∗ 0 (·|x∗ t ) ∥st(x∗ t )−s ∗ t (x∗ t |x∗ 0)∥2 . This loss is equivalent to the following sequence min s n αEt∼[0,T] Exθ t ∼pθ t ∥st(xθ t )−s θ t (xθ t )∥2 + (1−α)E t∼[0,T] Ex∗ t ∼p∗ t ∥st(x∗ t )−s ∗ t (x∗ t )∥2 o , min s n αEt∼[0,T] Exθ t ∼pθ t ∥s...