Universal Inverse Distillation for Matching Models with Real-Data Supervision (No GANs)
Pith reviewed 2026-05-21 22:12 UTC · model grok-4.3
The pith
RealUID is a universal distillation framework that lets any matching model use real data directly to train one-step generators without GANs.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
RealUID is a universal inverse distillation framework for matching models that seamlessly incorporates real data into the distillation procedure without GANs and supplies a simple theoretical foundation that covers previous distillation methods for Flow Matching and Diffusion models while extending to Bridge Matching and Stochastic Interpolants.
What carries the argument
The RealUID distillation loss, which directly combines guidance from a pre-trained teacher matching model with explicit matching to real data samples.
If this is right
- Earlier data-free distillation techniques for flow and diffusion become direct instances of RealUID when the real-data term is omitted.
- The same procedure applies without modification to bridge matching and stochastic interpolants.
- Distilled one-step models gain improved sample quality from real-data supervision while retaining the speed advantage of single-step inference.
- No extra discriminator network is required when moving from data-free to real-data distillation.
Where Pith is reading between the lines
- The framework could be tested on newly proposed matching-model variants to check whether the direct real-data term remains stable across different noise schedules.
- By removing the adversarial component, RealUID may lower the hyper-parameter burden when practitioners adapt distillation to new data domains.
- The unification suggests that future matching-model papers could adopt RealUID as a default fast-sampling baseline rather than re-deriving separate distillation losses.
Load-bearing premise
Real data can be incorporated directly into the distillation loss for any matching model without introducing instability or requiring additional adversarial components.
What would settle it
Run RealUID distillation on a standard flow-matching teacher using real image data and measure whether the resulting one-step model shows higher FID or training divergence than a comparable GAN-based real-data baseline on the same dataset and architecture.
Figures
read the original abstract
While achieving exceptional generative quality, modern diffusion, flow, and other matching models suffer from slow inference, as they require many steps of iterative generation. Recent distillation methods address this problem by training efficient one-step generators under the guidance of a pre-trained teacher model. However, these methods are often constrained to only one specific framework, e.g., only to diffusion or only to flow models. Furthermore, these methods are originally data-free, and to benefit from the usage of real data, it is required to use an additional complex adversarial training with an extra discriminator model. In this paper, we present RealUID, a universal distillation framework for all matching models that seamlessly incorporates real data into the distillation procedure without GANs. Our RealUID approach offers a simple theoretical foundation that covers previous distillation methods for Flow Matching and Diffusion models, and can be also extended to their modifications, such as Bridge Matching and Stochastic Interpolants. The code can be found in https://github.com/David-cripto/RealUID.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces RealUID, a universal inverse distillation framework for matching models (including Flow Matching, Diffusion, Bridge Matching, and Stochastic Interpolants). It claims to seamlessly incorporate real-data supervision into the distillation loss without GANs or extra discriminators, while providing a simple theoretical foundation that recovers prior distillation methods as special cases.
Significance. If the central derivation holds and the real-data mixture preserves the required contraction or fixed-point properties across model families, the result would be significant: it unifies existing distillation techniques under one objective, removes the need for adversarial components, and enables direct use of real data for one-step generators. The open-source code link is a positive factor for reproducibility.
major comments (1)
- [§3.2–3.3] §3.2–3.3: The universality claim rests on the RealUID loss (mixture of real data and teacher marginal inside the inverse distillation objective) generalizing without additional regularity assumptions. No explicit bound is derived on how the real-data mixture weight perturbs the contraction mapping or fixed-point uniqueness relied upon by prior data-free proofs for Flow Matching and Diffusion; if the perturbation is non-contractive for some vector fields, the claimed simple theoretical foundation does not cover the full family of matching models.
minor comments (2)
- [Abstract / §1] The abstract and introduction state that the framework 'covers previous distillation methods' but do not include a short table or explicit reduction showing how the RealUID objective specializes to the cited Flow-Matching and Diffusion losses.
- [§3] Notation for the mixture weight and the inverse-distillation operator should be introduced once with a clear definition before being used in the loss equations.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed review. The feedback highlights an important point regarding the rigor of the theoretical claims. We address the major comment below and commit to revisions that strengthen the presentation without altering the core contributions.
read point-by-point responses
-
Referee: [§3.2–3.3] §3.2–3.3: The universality claim rests on the RealUID loss (mixture of real data and teacher marginal inside the inverse distillation objective) generalizing without additional regularity assumptions. No explicit bound is derived on how the real-data mixture weight perturbs the contraction mapping or fixed-point uniqueness relied upon by prior data-free proofs for Flow Matching and Diffusion; if the perturbation is non-contractive for some vector fields, the claimed simple theoretical foundation does not cover the full family of matching models.
Authors: We appreciate the referee's careful reading of the theoretical sections. The current derivation in §3.2–3.3 shows that RealUID recovers the data-free inverse distillation objectives exactly when the real-data weight is set to zero, and that the combined loss remains a well-defined expectation under the same marginals used in prior work. We agree that an explicit perturbation bound on the contraction constant would make the universality statement more complete. In the revised manuscript we will add a short subsection (or extended remark) in §3.3 that (i) recalls the Lipschitz and contraction assumptions from the referenced Flow Matching and Diffusion proofs, (ii) treats the real-data term as a bounded perturbation whose Lipschitz constant is controlled by the data distribution's regularity, and (iii) states a sufficient condition on the mixture weight α such that the overall operator remains contractive whenever α is below a threshold determined by the original contraction gap. This addition uses only the regularity already standard in the literature and does not introduce new assumptions. We believe the revised argument will directly address the concern while preserving the paper's claim of a simple unifying foundation. revision: yes
Circularity Check
No significant circularity: derivation introduces independent real-data mixture term without reducing to fitted inputs or self-citation chains
full rationale
The paper defines RealUID by constructing an inverse distillation objective that mixes real data samples with teacher-generated samples inside the loss, then shows this recovers prior Flow Matching and Diffusion distillation losses as special cases while extending to Bridge Matching and Stochastic Interpolants. This construction adds an explicit real-data supervision term that is not present in the cited data-free baselines, and the universality claim rests on algebraic substitution within the loss rather than on any parameter being fitted to the target result or on a self-citation that forbids alternatives. No equation is presented in which the claimed prediction equals its own input by definition, and the theoretical foundation is stated to be simple and assumption-light without invoking uniqueness theorems from the authors' prior work as load-bearing. The derivation is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption A pre-trained teacher matching model can provide reliable guidance for distilling a one-step student when real data is added to the objective.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Universal Matching loss LUM(f, p0) … min-max optimization of Universal Inverse Distillation (UID) loss … RealUID loss with real data (α, β ∈ (0,1])
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Decoupled Weight Decay Regularization
URLhttps://openreview.net/forum?id=XVjTT1nw5z. Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017. Weijian Luo, Tianyang Hu, Shifeng Zhang, Jiacheng Sun, Zhenguo Li, and Zhihua Zhang. Diff- instruct: A universal approach for transferring knowledge from pre-trained diffusion models. Advances in Neu...
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[2]
=E t∼[0,T] Exθ 0∼pθ 0,xθ t ∼pθ t (·|xθ 0)[−α∥δt(xθ t )∥2 + 2α⟨δt(xθ t ), f ∗ t (xθ t )⟩ −2β⟨δ t(xθ t ), f θ t (xθ t |xθ 0)⟩] +Et∼[0,T] Ex∗ 0 ∼p∗ 0 ,x∗ t ∼p∗ t (·|x∗ t )[−(1−α)∥δ t(x∗ t )∥2 + 2(1−α)⟨δ t(x∗ t ), f ∗ t (x∗ t )⟩ −2(1−β)⟨δ t(x∗ t ), f ∗ t (x∗ t |x∗ 0)⟩]. This form provides an alternative definition of coefficients α and β: they define the prop...
-
[3]
15 the real data, i.e,L α,β R-UID(δ, pθ
=E t∼[0,T] Exθ t ∼pθ t [−α∥δt(xθ t )∥2 + 2α⟨δt(xθ t ), f ∗ t (xθ t )⟩ −2β⟨δ t(xθ t ), f θ t (xθ t )⟩] +Et∼[0,T] Ex∗ t ∼p∗ t [−(1−α)∥δ t(x∗ t )∥2 + 2(1−α)⟨δ t(x∗ t ), f ∗ t (x∗ t )⟩ −2(1−β)⟨δ t(x∗ t ), f ∗ t (x∗ t )⟩].(20) Then, we rescale the generated data terms in RealUID loss (20) using the equality pθ t (xt) = pθ t (xt) p∗ t (xt) p∗ t (xt) for xt ∈R D...
-
[4]
Finally, we maximize the loss w.r.t
= Et∼[0,T] Ex∗ t ∼p∗ t −[(1−α) +α pθ t (x∗ t ) p∗ t (x∗ t )]∥δt(x∗ t )∥2 + 2[(β−α) +α pθ t (x∗ t ) p∗ t (x∗ t )]⟨δt(x∗ t ), f ∗ t (x∗ t )⟩ −2β pθ t (x∗ t ) p∗ t (x∗ t ) ⟨δt(x∗ t ), f θ t (x∗ t )⟩ . Finally, we maximize the loss w.r.t. δt(x∗ t ) for each x∗ t and t as a quadratic function. The maximum is achieved when δt(x∗ t ) = [(β−α) +α pθ t (x∗ t ) p∗ ...
-
[5]
It is easy to see that when pθ 0 =p ∗ 0 and f θ =f ∗ this distance achieves its minimal value 0
=E t∼[0,T] Ex∗ t ∼p∗ t ∥f ∗ t (x∗ t )·((β−α) +α pθ t (x∗ t ) p∗ t (x∗ t ))−f θ t (x∗ t )·β pθ t (x∗ t ) p∗ t (x∗ t ) ∥2 (1−α) +α pθ t (x∗ t ) p∗ t (x∗ t ) . It is easy to see that when pθ 0 =p ∗ 0 and f θ =f ∗ this distance achieves its minimal value 0. Moreover, optimal fake model in this case matches the teacherf ∗, i.e., arg max f Lα,β R-UID(f,...
-
[6]
= Z xt lt(xt, β, α)dxt, lt(xt, β, α) := ∥(p∗ t (xt)(β−α) +αp θ t (xt))·f ∗ t (xt)−βp θ t (xt)·f θ t (xt)∥2 (1−α)p ∗ t (xt) +αp θ t (xt) , wherel t(xt, β, α)denotes the distance for the particular pointx t. The total distance mostly sums up from the two groups of points: incorrectly generated points from the generator’s main domain, i.e., pθ t (xt)≫0, p ∗(...
-
[7]
In turn, after real data incorporation, we obtain our RealUID loss (Theorem 2)
=E t∼[0,T] Exθ 0∼pθ 0,xθ t ∼pθ t (·|xθ 0) −∥δt(xθ t )∥2 + 2⟨δt(xθ t ), f ∗ t (xθ t )⟩ −2⟨δ t(xθ t ), f θ t (xθ t |xθ 0)⟩ . In turn, after real data incorporation, we obtain our RealUID loss (Theorem 2). Putting the explicit values for RealUM loss (17) in RealUID loss (18), we get the explicit formula: Lα,β R-UID(δ, pθ
-
[8]
=E t∼[0,T] Exθ 0∼pθ 0,xθ t ∼pθ t (·|xθ 0)[−α∥δt(xθ t )∥2 + 2α⟨δt(xθ t ), f ∗ t (xθ t )⟩ −2β⟨δ t(xθ t ), f θ t (xθ t |xθ 0)⟩] +Et∼[0,T] Ex∗ 0 ∼p∗ 0 ,x∗ t ∼p∗ t (·|x∗ 0 )[−(1−α)∥δ t(x∗ t )∥2 + 2(1−α)⟨δ t(x∗ t ), f ∗ t (x∗ t )⟩ −2(1−β)⟨δ t(x∗ t ), f ∗ t (x∗ t |x∗ 0)⟩]. These two formulas give us alternative explanation on how to add real data into arbitrary ...
-
[9]
on generated data pθ 0 ∈ P(R D) with coefficientsα, β, γ∈(0,1]: Lα,β,γ R-UID(δ, pθ
-
[10]
:=E t∼[0,T] Exθ 0∼pθ 0,xθ t ∼pθ t (·|xθ 0)[−γ∥δt(xθ t )∥2 + 2α⟨δt(xθ t ), f ∗ t (xθ t )⟩ −2β⟨δ t(xθ t ), f θ t (xθ t |xθ 0)⟩] +Et∼[0,T] Ex∗ 0 ∼p∗ 0 ,x∗ t ∼p∗ t (·|x∗ 0 )[−(1−γ)∥δ t(x∗ t )∥2 + 2(1−α)⟨δ t(x∗ t ), f ∗ t (x∗ t )⟩ −2(1−β)⟨δ t(x∗ t ), f ∗ t (x∗ t |x∗ 0)⟩]. Optionally, one can change default reparameterization δ=f ∗ −f or substitute sampled real...
-
[11]
=E t∼[0,T] Ex∗ t ∼p∗ t ∥(p∗ t (x∗ t )(β−α) +αp θ t (x∗ t ))·f ∗ t (x∗ t )−βp θ t (x∗ t )·f θ t (x∗ t )∥2 p∗ t (x∗ t )((1−γ)p ∗ t (x∗ t ) +γp θ t (x∗ t )) . The distances being minimized for RealUID (Lemma 2) and General RealUID (Lemma 3) are almost identical except the scale factor in the denominator. Thus, we keep the same recommendations for choosing co...
-
[12]
=E t∼[0,T] Exθ t ∼pθ t [−γ∥δt(xθ t )∥2 + 2α⟨δt(xθ t ), f ∗ t (xθ t )⟩ −2β⟨δ t(xθ t ), f θ t (xθ t )⟩] +Et∼[0,T] Ex∗ t ∼p∗ t [−(1−γ)∥δ t(x∗ t )∥2 + 2(1−α)⟨δ t(x∗ t ), f ∗ t (x∗ t )⟩ −2(1−β)⟨δ t(x∗ t ), f ∗ t (x∗ t )⟩]. Then, we rescale the generated data terms in the General RealUID loss using the equality pθ t (xt) = pθ t (xt) p∗ t (xt) p∗ t (xt) for xt ∈...
-
[13]
Then we maximize the loss w.r.t
= Et∼[0,T] Ex∗ t ∼p∗ t −[(1−γ) +γ pθ t (x∗ t ) p∗ t (x∗ t )]∥δt(x∗ t )∥2 + 2[(β−α) +α pθ t (x∗ t ) p∗ t (x∗ t )]⟨δt(x∗ t ), f ∗ t (x∗ t )⟩ −2β pθ t (x∗ t ) p∗ t (x∗ t ) ⟨δt(x∗ t ), f θ t (x∗ t )⟩ . Then we maximize the loss w.r.t. δt(x∗ t ) for each x∗ t and t as a quadratic function. The maximum is achieved when δt(x∗ t ) = [(β−α) +α pθ t (x∗ t ) p∗ t (x...
-
[14]
=E t∼[0,T] Ex∗ t ∼p∗ t ∥f ∗ t (x∗ t )·((β−α) +α pθ t (x∗ t ) p∗ t (x∗ t ))−f θ t (x∗ t )·β pθ t (x∗ t ) p∗ t (x∗ t ) ∥2 (1−γ) +γ pθ t (x∗ t ) p∗ t (x∗ t ) . A.3 SIDWITH REAL DATA We recall that data-free UID loss (Theorem 1), which is equivalent to SiD loss with αSiD = 1/2, can be restated via linearization technique withδ=f−f ∗ as LUID(δ, pθ
-
[15]
(23) 18 In turn, after real data incorporation, we obtain our RealUID loss (Theorem 2)
=E t∼[0,T] Exθ 0∼pθ 0,xθ t ∼pθ t (·|xθ 0) −∥δt(xθ t )∥2 + 2⟨δt(xθ t ), f ∗ t (xθ t )⟩ −2⟨δ t(xθ t ), f θ t (xθ t |xθ 0)⟩ . (23) 18 In turn, after real data incorporation, we obtain our RealUID loss (Theorem 2). Putting the explicit values for RealUM loss (17) in RealUID loss (18), we get the explicit formula: Lα,β R-UID(δ, pθ
-
[16]
=E t∼[0,T] Exθ 0∼pθ 0,xθ t ∼pθ t (·|xθ 0)[−α∥δt(xθ t )∥2 + 2α⟨δt(xθ t ), f ∗ t (xθ t )⟩ −2β⟨δ t(xθ t ), f θ t (xθ t |xθ 0)⟩] +Et∼[0,T] Ex∗ 0 ∼p∗ 0 ,x∗ t ∼p∗ t (·|x∗ 0 )[−(1−α)∥δ t(x∗ t )∥2 + 2(1−α)⟨δ t(x∗ t ), f ∗ t (x∗ t )⟩ −2(1−β)⟨δ t(x∗ t ), f ∗ t (x∗ t |x∗ 0)⟩]. These two formulas give us alternative explanation on how to add real data into arbitrary ...
-
[17]
=E t∼[0,T] Exθ 0∼pθ 0,xθ t ∼pθ t (·|xθ 0) −2αSiD∥δt(xθ t )∥2 + 2⟨δt(xθ t ), f ∗ t (xθ t )⟩ −2⟨δ t(xθ t ), f θ t (xθ t |xθ 0)⟩ . Following the structure of generator SiD loss, we propose to scale the first coefficient in our RealUID loss during generator updates. The wholeSiD pipeline with real datadetermined by coefficients α, β∈(0,1], α SiD and teacherf ...
-
[18]
Minimize the real data modified UM loss Lα,β R-UM(f, pθ
-
[19]
2) for the fake model f via several update steps: Lα,β R-UM(f, pθ
(Def. 2) for the fake model f via several update steps: Lα,β R-UM(f, pθ
-
[20]
=α·E t∼[0,T] Exθ 0∼pθ 0,xθ t ∼pθ t (·|xθ 0) ∥ft(xθ t )− β α f θ(xθ t |xθ 0)∥2 | {z } generated datap θ 0 term + (1−α)·E t∼[0,T] Ex∗ 0 ∼p∗ 0 ,x∗ t ∼p∗ t (·|x∗ 0 ) ∥ft(x∗ t )− 1−β 1−α f ∗ t (x∗ t |x∗ 0)∥2 | {z } real datap ∗ 0 term
-
[21]
Make generator update step minimizing the lossL α,β R-UID,αSiD(pθ 0)withδ=f−f ∗ : Lα,β R-UID,αSiD(pθ
-
[22]
We keep the same recommendations for choosing coefficientsα, β as we discuss in Appendix A.1.2
=E t∼[0,T] Exθ 0∼pθ 0,xθ t ∼pθ t (·|xθ 0) −2αSiD ·α· ∥δ t(xθ t )∥2 + 2α⟨δt(xθ t ), f ∗ t (xθ t )⟩ −2β⟨δ t(xθ t ), f θ t (xθ t |xθ 0)⟩ . We keep the same recommendations for choosing coefficientsα, β as we discuss in Appendix A.1.2. The optimal choice is slightly differentα̸=β which are close to 1. Following (Zhou et al., 2024a), the best choice forα SiD i...
-
[23]
and student f θ := arg minf LUM(f, pθ
-
[24]
In this case, the connection with the inverse optimization disappears
functions. In this case, the connection with the inverse optimization disappears. For a fixed pointx θ t and timet, we derive: ∥f ∗ t (xθ t )−f θ t (xθ t )∥= max δt(xθ t ) ⟨ δt(xθ t ) ∥δt(xθ t )∥ , f ∗ t (xθ t )−f θ t (xθ t )⟩ = max δt(xθ t ) Exθ 0∼pθ 0(·|xθ t ) ⟨ δt(xθ t ) ∥δt(xθ t )∥ , f ∗ t (xθ t )⟩ − ⟨ δt(xθ t ) ∥δt(xθ t )∥ , f θ t (xθ t |xθ 0)⟩ .(24)...
-
[25]
for min-max optimization to solvemin θ Et∼[0,T] Exθ t ∼pθ t ∥f ∗ t (xθ t )−f θ t (xθ t )∥is: min θ max f ˆLUID(f, pθ
-
[26]
:=E t∼[0,T] Exθ 0∼pθ 0,xθ t ∼pθ t (·|xθ 0) ⟨ f ∗ t (xθ t )−f t(xθ t ) ∥f ∗ t (xθ t )−f t(xθ t )∥ , f ∗ t (xθ t )−f θ t (xθ t |xθ 0)⟩ .(25) 19 Adding real data.Following the intuition from the proof for RealUID in Appendix A.1.1, we can incorporate real data in Normalized UID loss (25) as well. We need to split two summands in the linearized representation...
-
[27]
on generated data pθ 0 ∈ P(R D) with coefficientsα, β∈(0,1]: ˆLα,β R-UID(f, pθ
-
[28]
:=E t∼[0,T] Exθ t ∼pθ t ,xθ 0∼pθ 0(·|xθ t ) ⟨ f ∗ t (xθ t )−f t(xθ t ) ∥f ∗ t (xθ t )−f t(xθ t )∥ , α·f ∗ t (xθ t )−β·f θ t (xθ t |xθ 0)⟩ +Et∼[0,T] Ex∗ t ∼p∗ t ,x∗ 0 ∼p∗ 0 (·|x∗ t ) ⟨ f ∗ t (x∗ t )−f t(x∗ t ) ∥f ∗ t (x∗ t )−f t(x∗ t )∥ ,(1−α)·f ∗ t (x∗ t )−(1−β)·f ∗ t (x∗ t |x∗ 0)⟩ . Similar to the proof of RealUID distance Lemma 2, we can show that min-m...
-
[29]
This distance attains minimum whenp θ 0 =p ∗ 0, justifying the procedure
=E t∼[0,T] Ex∗ t ∼p∗ t ∥((β−α) +α pθ t (x∗ t ) p∗ t (x∗ t ))·f ∗ t (x∗ t )−β pθ t (x∗ t ) p∗ t (x∗ t ) ·f θ t (x∗ t )∥ . This distance attains minimum whenp θ 0 =p ∗ 0, justifying the procedure. A.5 DMDAPPROACH WITH REAL DATA Distribution Matching Distillation(Luo et al., 2023; Wang et al., 2023; Yin et al., 2024b;a) (DMD) approach distills Gaussian diffu...
work page 2023
-
[30]
The final algorithm alternates updates for the fake model and the generator similar to SiD approach
and student scores θ = arg mins LDSM(s, pθ 0)at each time moment: Et∼[0,T] dDKL(pθ t ||p∗ t ) dθ =E t∼[0,T] Ez∼pZ ,xθ 0=Gθ(z),xθ t ∼pθ t (sθ t (xθ t )−s ∗ t (xθ t )) dGθ dθ . The final algorithm alternates updates for the fake model and the generator similar to SiD approach. We would like to highlight that DMD does not fit our UID framework.The UID loss i...
-
[31]
Then apply the generator parameters update based on the KL divergence between mixed distributions
:=α·E t∼[0,T] Exθ 0∼pθ 0,xθ t ∼pθ t (·|x0) ∥st(xθ t )−s θ(xθ t |xθ 0)∥2 | {z } generated datap θ 0 term + (1−α)·E t∼[0,T] Ex∗ 0 ∼p∗ 0 ,x∗ t ∼p∗ t (·|x∗ 0 ) ∥st(x∗ t )−s ∗ t (x∗ t |x∗ 0)∥2 | {z } real datap ∗ 0 term . Then apply the generator parameters update based on the KL divergence between mixed distributions. Lemma 4(DMD with real data).Consider real...
work page 2023
-
[32]
:=E t∼[0,T] DKL(α·p θ t + (1−α)·p ∗ t ||p∗ t ) . First, we use (Wang et al., 2023, Lemma 1) which says that, for any two distributions p, q∈ P(R D) and pointx∈R D, we have δDKL(q||p) δq [x] = logq(x)−logp(x) + 1. Second, for the parametrization xθ 0 =G θ(z), z∼p Z and a fixed point xt, we have (Wang et al., 2023, Lemma 2) δpθ t (xt) δpθ 0 [θ] = Z z pθ t (...
work page 2023
-
[33]
:=α·E t∼[0,T] Exθ t ∼pθ t ,xθ 0∼pθ 0(·|xt) ∥st(xθ t )−s θ(xθ t |xθ 0)∥2 + (1−α)·E t∼[0,T] Ex∗ t ∼p∗ t ,x∗ 0 ∼p∗ 0 (·|x∗ t ) ∥st(x∗ t )−s ∗ t (x∗ t |x∗ 0)∥2 . This loss is equivalent to the following sequence min s n αEt∼[0,T] Exθ t ∼pθ t ∥st(xθ t )−s θ t (xθ t )∥2 + (1−α)E t∼[0,T] Ex∗ t ∼p∗ t ∥st(x∗ t )−s ∗ t (x∗ t )∥2 o , min s n αEt∼[0,T] Exθ t ∼pθ t ∥s...
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.