Demystifying Transition Matching: When and Why It Can Beat Flow Matching

Jaihoon Kim; Minhyuk Sung; Rajarshi Saha; Youngsuk Park

arxiv: 2510.17991 · v3 · pith:L4VJOGSFnew · submitted 2025-10-20 · 💻 cs.LG · cs.CV

Demystifying Transition Matching: When and Why It Can Beat Flow Matching

Jaihoon Kim , Rajarshi Saha , Minhyuk Sung , Youngsuk Park This is my paper

Pith reviewed 2026-05-22 13:23 UTC · model grok-4.3

classification 💻 cs.LG cs.CV

keywords transition matchingflow matchingKL divergenceGaussian mixturegenerative modelingsampling stepscovariance preservationunimodal distribution

0 comments

The pith

Transition matching attains strictly lower KL divergence than flow matching for finite-step sampling of unimodal Gaussians.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that transition matching outperforms flow matching when sampling from a unimodal Gaussian distribution using a finite number of steps, as measured by KL divergence. This happens because the stochastic updates in transition matching maintain the target distribution's covariance, while the deterministic updates in flow matching systematically underestimate it. The authors characterize the convergence rates and show that transition matching reaches the target faster under a fixed computational budget. They further show that this advantage extends to Gaussian mixture targets in local-unimodality regimes, particularly when the modes are well separated and the variances are non-negligible. Controlled experiments on Gaussians and applications to image and video generation back up the theoretical comparisons.

Core claim

When the target is a unimodal Gaussian distribution, transition matching attains strictly lower KL divergence than flow matching for finite number of steps. The improvement arises from stochastic difference latent updates in TM, which preserve target covariance that deterministic FM underestimates. For Gaussian mixtures the sampling dynamics approximate the unimodal case in local-unimodality regimes where the approximation error decreases as the minimal distance between component means increases. TM outperforms FM when the target distribution has well-separated modes and non-negligible variances, but the advantage diminishes as target variance approaches zero.

What carries the argument

Stochastic difference latent updates in transition matching that preserve the target covariance structure, in contrast to deterministic updates in flow matching.

If this is right

TM achieves strictly lower KL divergence to the target than FM for any finite number of steps on unimodal Gaussians.
TM converges faster than FM under a fixed compute budget in the unimodal Gaussian setting.
The performance advantage of TM holds for Gaussian mixtures in local-unimodality regimes with increasing mode separation.
The advantage of TM over FM diminishes as the target variance approaches zero.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the covariance preservation mechanism generalizes, TM could reduce the number of sampling steps needed in high-dimensional generative tasks.
Local-unimodality regimes suggest TM may excel on data with distinct, separated clusters rather than highly overlapping modes.
Future work could test whether the finite-step KL improvement persists in non-Gaussian unimodal distributions.
The convergence to FM as variance nears zero implies a continuous transition between the two methods.

Load-bearing premise

The sampling dynamics for Gaussian mixtures approximate the unimodal Gaussian case in local-unimodality regimes, with the approximation error decreasing as the minimal distance between component means increases.

What would settle it

Compute the KL divergence after a small finite number of steps for both TM and FM on a standard unimodal Gaussian and check if TM's value is strictly smaller.

Figures

Figures reproduced from arXiv: 2510.17991 by Jaihoon Kim, Minhyuk Sung, Rajarshi Saha, Youngsuk Park.

**Figure 2.** Figure 2: Qualitative Visualization of Unimodal Gaussian Target. Each panel shows the source N (0, Id) and target N (µ, σ2 Id) distributions with the generated samples of FM (left) and TM (right). With a small number of steps (N = 2), FM produces samples with reduced variance, whereas TM (N = 1, S = 2) preserves the target variance. 3 Unimodal Gaussian Target As outlined in the introduction, the unimodal Gaussian t… view at source ↗

**Figure 3.** Figure 3: Effect of Dmin on p(V |X). Visualization of p(V |X) using cosine-similarity histograms between difference latent samples Ve(m) tn ∼ p(V | Xtn ) and E[V | Xtn ] for Dmin ∈ {8, 45}. For Dmin = 8 (top) the distribution remains multimodal, whereas for Dmin = 45 (bottom) it concentrates near 1 at earlier tn, indicating unimodality. A larger Dmin tightens Cor. 2, so at a fixed tn the mixture is closer to p(V |X,… view at source ↗

**Figure 4.** Figure 4: KL Divergence for Mixture of Gaussians Target. Transition Matching (TM) shows lower KL divergence than Flow Matching (FM) as the modes are more separated (red curve, larger Dmin). The inset highlights the region near N = 8 . the unimodal case in §3. Specifically, if we define Gt(r, ρ∗ ) ≜ x : ∥x − tµkt(x)∥ ≤ r and ρt(x) ≥ ρ ∗ [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Quantitative Evaluation v.s. Wall Clock Time. Quantitative comparison of FM (orange) and TM (green) in class-conditioned image generation (left, middle) and frame-conditioned video generation (right), plotted against wall clock time measured in seconds. (red dotted line), the performance gap between FM and TM is greater for the larger Dmin, indicating that stronger local unimodality associated with larger … view at source ↗

**Figure 6.** Figure 6: Unimodal Gaussian KL Divergence against Wall Clock Time. We compare Flow Matching (FM) and Transition Matching (TM) on unimodal Gaussian KL divergence against wall clock time. For TM, we fix S ∈ {1, 2}. The inset zooms a reference region: at matched wall-clock compute, TM achieves lower KL than FM in the low-step regime. Additional Results on Unimodal Gaussian. In this section, we extend the experiments fr… view at source ↗

**Figure 7.** Figure 7: Effect of Target Variance on p(V |X). High-dimensional mixture of Gaussian with two variance settings, σ ∈ {1.0, 0.001}. Each panel shows the histogram of cosine similarities between difference latent samples Vetn ∼ p(V | Xtn ) and its expectation E[V | Xtn ] at timestep tn. For σ = 1.0, the difference latent samples form a unimodal distribution with non-negligible variance, whereas for σ = 0.001, they con… view at source ↗

**Figure 8.** Figure 8: Class-Conditioned Image Generation Qualitative Results [PITH_FULL_IMAGE:figures/full_fig_p025_8.png] view at source ↗

**Figure 9.** Figure 9: Frame-Conditioned Video Generation Results [PITH_FULL_IMAGE:figures/full_fig_p026_9.png] view at source ↗

read the original abstract

Flow Matching (FM) underpins many state-of-the-art generative models, yet recent results indicate that Transition Matching (TM) can achieve higher quality with fewer sampling steps. This work answers the question of when and why TM outperforms FM. First, when the target is a unimodal Gaussian distribution, we prove that TM attains strictly lower KL divergence than FM for finite number of steps. The improvement arises from stochastic difference latent updates in TM, which preserve target covariance that deterministic FM underestimates. We then characterize convergence rates, showing that TM achieves faster convergence than FM under a fixed compute budget, establishing its advantage in the unimodal Gaussian setting. Second, we extend the analysis to Gaussian mixtures and identify local-unimodality regimes in which the sampling dynamics approximate the unimodal case, where TM can outperform FM. The approximation error decreases as the minimal distance between component means increases, highlighting that TM is favored when the modes are well separated. However, when the target variance approaches zero, each TM update converges to the FM update, and the performance advantage of TM diminishes. In summary, we show that TM outperforms FM when the target distribution has well-separated modes and non-negligible variances. We validate our theoretical results with controlled experiments on Gaussian distributions, and extend the comparison to real-world applications in image and video generation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TM beats FM on finite-step KL for unimodal Gaussians because stochastic updates preserve target covariance while deterministic ones underestimate it, but the mixture extension rests on an unquantified local-unimodality approximation.

read the letter

The key takeaway is that Transition Matching can strictly outperform Flow Matching in finite-step KL divergence for unimodal Gaussians. The reason is that TM's stochastic latent updates preserve the target's covariance, while FM's deterministic ones fall short. They back this with a proof and show faster convergence under fixed compute. This is new: a finite-step comparison with explicit conditions on mode separation and variance for when TM wins. It moves past just showing better samples to explaining why. The Gaussian case is handled well. The math is direct, and the controlled experiments confirm the predictions without obvious post-hoc tuning. The main soft spot is in the Gaussian mixture section. They argue that when modes are far enough apart, the sampling stays in a local-unimodality regime where the unimodal advantage applies. The error shrinks with greater separation, but no rate or bound is given on how the KL gap changes with finite distances. This makes it hard to judge if the advantage holds for typical mixtures or gets lost to mode interactions. The real data experiments on images and video are referenced to show relevance, but details are light so it's difficult to assess how cleanly they isolate the TM-FM difference. This paper is for people building or analyzing generative samplers who need guidance on choosing between these methods based on the target distribution's properties. A reader interested in theoretical backing for empirical wins will find value here. It deserves peer review. The central Gaussian result is grounded enough to warrant checking the derivations, and the overall question is timely even if the mixture analysis could be tightened with better bounds.

Referee Report

1 major / 1 minor

Summary. The paper investigates when and why Transition Matching (TM) can outperform Flow Matching (FM). For unimodal Gaussian targets it proves that TM attains strictly lower finite-step KL divergence than FM, with the improvement arising from stochastic difference latent updates that preserve target covariance (while deterministic FM underestimates it). Convergence rates are characterized under a fixed compute budget. The analysis is extended to Gaussian mixtures by identifying local-unimodality regimes in which the sampling dynamics approximate the unimodal case, with the approximation error decreasing as the minimal distance between component means increases; TM is therefore favored when modes are well separated and variances are non-negligible. When target variance approaches zero the TM update converges to the FM update. Theoretical claims are supported by controlled Gaussian experiments and extended to image and video generation tasks.

Significance. If the unimodal-Gaussian proof and the local-unimodality characterization hold, the work supplies a concrete mechanistic explanation for the empirical edge of TM and delineates precise regimes (well-separated modes, non-negligible variance) in which the advantage appears. The explicit stochastic-versus-deterministic comparison and the finite-step KL result constitute a clear theoretical contribution that could inform the design of future matching-based generative models.

major comments (1)

[Extension to Gaussian mixtures] Extension to Gaussian mixtures: the statement that sampling dynamics locally approximate the unimodal regime, with approximation error decreasing as minimal mean distance grows, is invoked to extend the finite-step KL advantage. No explicit rate or bound is supplied on how the KL difference scales with mode separation, leaving open whether cross-mode leakage for finite separations erodes or reverses the claimed TM advantage before the large-separation limit is reached. This quantification is load-bearing for the mixture claim.

minor comments (1)

The abstract and experimental section would benefit from explicit reporting of the number of sampling steps, the precise KL values obtained, and the statistical significance of the TM-FM gap on the Gaussian test cases.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful and constructive review, which highlights both the strengths of our theoretical analysis and an important point for strengthening the mixture extension. We address the major comment below and will incorporate the requested quantification in the revision.

read point-by-point responses

Referee: Extension to Gaussian mixtures: the statement that sampling dynamics locally approximate the unimodal regime, with approximation error decreasing as minimal mean distance grows, is invoked to extend the finite-step KL advantage. No explicit rate or bound is supplied on how the KL difference scales with mode separation, leaving open whether cross-mode leakage for finite separations erodes or reverses the claimed TM advantage before the large-separation limit is reached. This quantification is load-bearing for the mixture claim.

Authors: We agree that the current manuscript provides only a qualitative statement that the approximation error decreases with increasing minimal mean distance, without an explicit rate or bound on the resulting KL divergence difference. This leaves the precise scaling for finite separations unquantified. In the revised manuscript we will add a bound on the difference between the finite-step KL divergences of TM and FM, expressed in terms of the minimal inter-mode distance. The bound will be obtained by controlling the perturbation to the velocity field induced by neighboring modes via a Lipschitz assumption on the score and an exponential decay of the cross-term contributions, thereby clarifying the separation threshold at which the TM advantage is preserved. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained via explicit update-rule comparison

full rationale

The paper derives the TM advantage for unimodal Gaussians by directly comparing the stochastic difference latent updates (which preserve target covariance) against deterministic FM updates (which underestimate it), yielding a strict finite-step KL inequality. Convergence rates under fixed compute are then characterized from the same dynamics. Extension to mixtures invokes local-unimodality approximation whose error decreases with mode separation, without any reduction of the central claim to a fitted parameter, self-citation chain, or ansatz smuggled from prior work. No equation or step is shown to equal its input by construction, and the analysis remains independent of the target result.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claims rest on standard properties of KL divergence for Gaussians and on a domain assumption that local sampling dynamics in separated-mode mixtures behave like the unimodal case. No free parameters or new invented entities appear in the abstract.

axioms (2)

standard math KL divergence between finite-step sampling trajectories and the target Gaussian can be compared exactly using covariance preservation properties.
Invoked directly in the unimodal Gaussian proof.
domain assumption Local sampling dynamics around each mode in a Gaussian mixture approximate the unimodal Gaussian dynamics when component means are sufficiently separated.
Used when extending the KL comparison to mixtures.

pith-pipeline@v0.9.0 · 5776 in / 1358 out tokens · 50065 ms · 2026-05-22T13:23:31.379356+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages · 4 internal anchors

[1]

Goku: Flow based video generative foundation models

Shoufa Chen, Chongjian Ge, Yuqi Zhang, Yida Zhang, Fengda Zhu, Hao Yang, Hongxiang Hao, Hui Wu, Zhichao Lai, Yifei Hu, Ting-Che Lin, Shilong Zhang, Fu Li, Chuan Li, Xing Wang, Yanghua Peng, Peize Sun, Ping Luo, Yi Jiang, Zehuan Yuan, Bingyue Peng, and Xiaobing Liu. Goku: Flow based video generative foundation models. arXiv preprint arXiv:2502.04896,

work page arXiv
[2]

Diffu- sion meets flow matching: Two sides of the same coin

Ruiqi Gao, Emiel Hoogeboom, Jonathan Heek, Valentin De Bortoli, Kevin P Murphy, and Tim Salimans. Diffu- sion meets flow matching: Two sides of the same coin. 2024.URL https://diffusionflow. github. io,

work page 2024
[3]

HunyuanVideo: A Systematic Framework For Large Video Generative Models

Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jian- wei Zhang, et al. Hunyuanvideo: A systematic frame- work for large video generative models.arXiv preprint arXiv:2412.03603,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

TripoSG: High-Fidelity 3D Shape Synthesis using Large-Scale Rectified Flow Models

Yangguang Li, Zi-Xin Zou, Zexiang Liu, Dehu Wang, Yuan Liang, Zhipeng Yu, Xingchao Liu, Yuan-Chen Guo, Ding Liang, Wanli Ouyang, et al. Triposg: High-fidelity 3d shape synthesis using large-scale rectified flow models. arXiv preprint arXiv:2502.06608,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Explicit flow matching: On the theory of flow matching algorithms with applications.arXiv preprint arXiv:2402.03232,

Gleb Ryzhakov, Svetlana Pavlova, Egor Sevriugov, and Ivan Oseledets. Explicit flow matching: On the theory of flow matching algorithms with applications.arXiv preprint arXiv:2402.03232,

work page arXiv
[6]

Closed-form diffusion models.arXiv preprint arXiv:2310.12395,

Christopher Scarvelis, Haitz Sáez de Ocáriz Borde, and Justin Solomon. Closed-form diffusion models.arXiv preprint arXiv:2310.12395,

work page arXiv
[7]

Transition matching: Scalable and flexible generative modeling.arXiv preprint arXiv:2506.23589,

Neta Shaul, Uriel Singer, Itai Gat, and Yaron Lipman. Transition matching: Scalable and flexible generative modeling.arXiv preprint arXiv:2506.23589,

work page arXiv
[8]

Towards Accurate Generative Models of Video: A New Metric & Challenges

Demystifying T ransition Matching Thomas Unterthiner, Sjoerd Van Steenkiste, Karol Kurach, Raphael Marinier, Marcin Michalski, and Sylvain Gelly. Towards accurate generative models of video: A new metric & challenges.arXiv preprint arXiv:1812.01717,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Sta- ble target field for reduced variance score estimation in diffusion models.arXiv preprint arXiv:2302.00670,

Yilun Xu, Shangyuan Tong, and Tommi Jaakkola. Sta- ble target field for reduced variance score estimation in diffusion models.arXiv preprint arXiv:2302.00670,

work page arXiv
[11]

The bounds on∆are obtained by noting that1−ϵ≤w(x, k)≤1, completing the proof

(i) =E X∼q [logq(X)−logp 1(X)] =E X∼q [logq(X)−logϕ k(X)]−logπ k +E X∼q[logw(X, k)] = KL(q∥ϕk) + log 1 πk + ∆, (71) where(i)follows from the definition of KL divergence, and∆denotes EX∼q[logw (X, k)] ∈ [log(1−ϵ), 0]. The bounds on∆are obtained by noting that1−ϵ≤w(x, k)≤1, completing the proof. C Related Work Diffusion and Flow Models.Diffusion models Sohl...

work page 2015
[12]

into a latent representation of size16×16×16 . Jaihoon Kim, Rajarshi Saha, Minhyuk Sung, Y oungsuk Park Frame-Conditioned Video Generation.For video generation, we adopt the History-Guided Diffusion framework (Song et al., 2025), which utilizes 3D DiT blocks. Each video frame is encoded by a pretrained VAE into a latent tensor of shape16 × 16 × 16, and th...

work page 2025
[13]

The leftmost column shows the conditioning frame used for video generation, and the two groups on the right show three frames generated by FM and TM, respectively. Videos produced by FM often exhibit artifacts, including missing content from the conditioning frame (e.g., the presenter hand in row 1 and the right leg of the baby in row 2), whereas TM prese...

work page 2023

[1] [1]

Goku: Flow based video generative foundation models

Shoufa Chen, Chongjian Ge, Yuqi Zhang, Yida Zhang, Fengda Zhu, Hao Yang, Hongxiang Hao, Hui Wu, Zhichao Lai, Yifei Hu, Ting-Che Lin, Shilong Zhang, Fu Li, Chuan Li, Xing Wang, Yanghua Peng, Peize Sun, Ping Luo, Yi Jiang, Zehuan Yuan, Bingyue Peng, and Xiaobing Liu. Goku: Flow based video generative foundation models. arXiv preprint arXiv:2502.04896,

work page arXiv

[2] [2]

Diffu- sion meets flow matching: Two sides of the same coin

Ruiqi Gao, Emiel Hoogeboom, Jonathan Heek, Valentin De Bortoli, Kevin P Murphy, and Tim Salimans. Diffu- sion meets flow matching: Two sides of the same coin. 2024.URL https://diffusionflow. github. io,

work page 2024

[3] [3]

HunyuanVideo: A Systematic Framework For Large Video Generative Models

Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jian- wei Zhang, et al. Hunyuanvideo: A systematic frame- work for large video generative models.arXiv preprint arXiv:2412.03603,

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

TripoSG: High-Fidelity 3D Shape Synthesis using Large-Scale Rectified Flow Models

Yangguang Li, Zi-Xin Zou, Zexiang Liu, Dehu Wang, Yuan Liang, Zhipeng Yu, Xingchao Liu, Yuan-Chen Guo, Ding Liang, Wanli Ouyang, et al. Triposg: High-fidelity 3d shape synthesis using large-scale rectified flow models. arXiv preprint arXiv:2502.06608,

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

Explicit flow matching: On the theory of flow matching algorithms with applications.arXiv preprint arXiv:2402.03232,

Gleb Ryzhakov, Svetlana Pavlova, Egor Sevriugov, and Ivan Oseledets. Explicit flow matching: On the theory of flow matching algorithms with applications.arXiv preprint arXiv:2402.03232,

work page arXiv

[6] [6]

Closed-form diffusion models.arXiv preprint arXiv:2310.12395,

Christopher Scarvelis, Haitz Sáez de Ocáriz Borde, and Justin Solomon. Closed-form diffusion models.arXiv preprint arXiv:2310.12395,

work page arXiv

[7] [7]

Transition matching: Scalable and flexible generative modeling.arXiv preprint arXiv:2506.23589,

Neta Shaul, Uriel Singer, Itai Gat, and Yaron Lipman. Transition matching: Scalable and flexible generative modeling.arXiv preprint arXiv:2506.23589,

work page arXiv

[8] [8]

Towards Accurate Generative Models of Video: A New Metric & Challenges

Demystifying T ransition Matching Thomas Unterthiner, Sjoerd Van Steenkiste, Karol Kurach, Raphael Marinier, Marcin Michalski, and Sylvain Gelly. Towards accurate generative models of video: A new metric & challenges.arXiv preprint arXiv:1812.01717,

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314,

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

Sta- ble target field for reduced variance score estimation in diffusion models.arXiv preprint arXiv:2302.00670,

Yilun Xu, Shangyuan Tong, and Tommi Jaakkola. Sta- ble target field for reduced variance score estimation in diffusion models.arXiv preprint arXiv:2302.00670,

work page arXiv

[11] [11]

The bounds on∆are obtained by noting that1−ϵ≤w(x, k)≤1, completing the proof

(i) =E X∼q [logq(X)−logp 1(X)] =E X∼q [logq(X)−logϕ k(X)]−logπ k +E X∼q[logw(X, k)] = KL(q∥ϕk) + log 1 πk + ∆, (71) where(i)follows from the definition of KL divergence, and∆denotes EX∼q[logw (X, k)] ∈ [log(1−ϵ), 0]. The bounds on∆are obtained by noting that1−ϵ≤w(x, k)≤1, completing the proof. C Related Work Diffusion and Flow Models.Diffusion models Sohl...

work page 2015

[12] [12]

into a latent representation of size16×16×16 . Jaihoon Kim, Rajarshi Saha, Minhyuk Sung, Y oungsuk Park Frame-Conditioned Video Generation.For video generation, we adopt the History-Guided Diffusion framework (Song et al., 2025), which utilizes 3D DiT blocks. Each video frame is encoded by a pretrained VAE into a latent tensor of shape16 × 16 × 16, and th...

work page 2025

[13] [13]

The leftmost column shows the conditioning frame used for video generation, and the two groups on the right show three frames generated by FM and TM, respectively. Videos produced by FM often exhibit artifacts, including missing content from the conditioning frame (e.g., the presenter hand in row 1 and the right leg of the baby in row 2), whereas TM prese...

work page 2023