Efficient Image Synthesis with Sphere Latent Encoder

Hao Li; Thuan Hoang Nguyen; Tung Do

arxiv: 2605.15592 · v1 · pith:VHUVX7CWnew · submitted 2026-05-15 · 💻 cs.CV

Efficient Image Synthesis with Sphere Latent Encoder

Tung Do , Thuan Hoang Nguyen , Hao Li This is my paper

Pith reviewed 2026-05-20 19:28 UTC · model grok-4.3

classification 💻 cs.CV

keywords few-step image generationspherical latent spacedecoupled frameworkimage synthesislatent denoisingpretrained encoderefficiency improvement

0 comments

The pith

Decoupling a pretrained image encoder from a spherical latent denoiser enables efficient few-step image synthesis.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The authors aim to fix the computational inefficiency and conflicting objectives in the Sphere Encoder method for few-step image generation. They do this by fixing a pretrained image encoder and training a separate denoising model entirely within the spherical latent space. This change removes the need for repeated switches between pixel and latent spaces. A reader should care if this leads to faster inference and better image quality without sacrificing stability, making advanced generative models more usable in practice.

Core claim

The central claim is that decoupling reconstruction and generation by using a fixed pretrained encoder and training the denoiser only in latent space overcomes the limitations of joint optimization in a single architecture, resulting in superior performance on Animal-Faces, Oxford-Flowers, and ImageNet-1K in terms of quality and speed compared to Sphere Encoder.

What carries the argument

The separate spherical latent denoising model that operates entirely in latent space after a one-time encoding by the fixed pretrained image encoder.

If this is right

Generation quality improves significantly on the three evaluated datasets while inference becomes faster.
Repeated pixel-space operations are eliminated during both training and inference.
Reconstruction and generation can specialize without objective conflict.
Results remain competitive with leading few-step and multi-step baselines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This approach may generalize to other latent-based generative models by allowing independent scaling of the denoiser.
Applications in resource-constrained environments could benefit from the reduced computational overhead.
Testing on additional datasets or higher resolutions would further validate the efficiency gains.

Load-bearing premise

A fixed pretrained image encoder plus a separately trained spherical latent denoiser can fully replace joint optimization of reconstruction and generation without new quality or stability trade-offs.

What would settle it

An experiment where the decoupled method fails to outperform Sphere Encoder in both FID scores and sampling speed on the reported datasets would disprove the main result.

Figures

Figures reproduced from arXiv: 2605.15592 by Hao Li, Thuan Hoang Nguyen, Tung Do.

**Figure 2.** Figure 2: Comparison between Sphere Encoder and our method. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Overview of our Sphere Latent Encoder framework and training objectives. [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Qualitative comparison between Sphere Encoder [ [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Increasing inference steps improves fidelity (a), while stronger semantic representations [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Qualitative comparison between different number of sampling steps on ImageNet-1K [ [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗

**Figure 7.** Figure 7: 4-NFE Generation Results. Examples of class-conditional generation on ImageNet 256 × 256 using our 4-NFE model. 17 [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗

**Figure 8.** Figure 8: 4-NFE Generation Results. Examples of class-conditional generation on ImageNet 256 × 256 using our 4-NFE model. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗

**Figure 9.** Figure 9: 6-NFE Generation Results. Examples of class-conditional generation on ImageNet 256 × 256. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_9.png] view at source ↗

**Figure 10.** Figure 10: 6-NFE Generation Results. Examples of class-conditional generation on ImageNet 256 × 256. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_10.png] view at source ↗

read the original abstract

Few-step image generation has seen rapid progress, with consistency and meanflow-based methods significantly reducing the number of sampling steps. Despite their low inference cost, these approaches often suffer from training instability and limited scalability. Sphere Encoder is a recent alternative that produces high-quality images in only a few steps; however, it requires repeated transitions between the pixel space and latent space during inference while jointly optimizing reconstruction and generation within a single architecture. This design leads to computational inefficiency and objective conflict between reconstruction and generation. To address these limitations, we decouple the framework into a fixed pretrained image encoder and a separate latent denoising model trained entirely in a spherical latent space. Our approach eliminates repeated pixel-space operations during training and inference, improving efficiency and allowing reconstruction and generation to specialize independently. On Animal-Faces, Oxford-Flowers and ImageNet-1K datasets, our method significantly outperforms Sphere Encoder in both generation quality and inference speed, while achieving competitive results against strong few-step and multi-step baselines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The decoupling of frozen encoder from spherical denoiser is a clean efficiency fix, but missing metrics make the quality claims hard to judge yet.

read the letter

The paper's main contribution is taking the Sphere Encoder setup and explicitly splitting it: a fixed pretrained image encoder stays frozen while a separate model handles denoising entirely inside spherical latent space. This removes the repeated pixel-to-latent round trips and lets reconstruction and generation objectives stop competing inside one network. That separation is presented as a direct answer to the inefficiency and instability the authors flag in the joint version, and it is the clearest new piece here.

Referee Report

2 major / 2 minor

Summary. The paper introduces an efficient few-step image synthesis method called Sphere Latent Encoder. It decouples the original Sphere Encoder into a fixed pretrained image encoder and a separately trained latent denoiser operating entirely in spherical latent space. This design eliminates repeated pixel-to-latent transitions during training and inference, resolves objective conflicts between reconstruction and generation, and reportedly yields higher quality and faster inference than the joint Sphere Encoder baseline while remaining competitive with other few-step and multi-step methods on Animal-Faces, Oxford-Flowers, and ImageNet-1K.

Significance. If the quantitative claims are supported by rigorous metrics and ablations, the decoupled spherical-latent approach could provide a practical route to stable, scalable few-step generation that avoids the training instabilities of consistency and mean-flow models. The separation of concerns is conceptually clean and could generalize to other latent-space generative frameworks.

major comments (2)

[Abstract and §3] Abstract and §3 (Framework): The central claim that a fixed pretrained encoder plus separate spherical denoiser fully replaces joint optimization without quality or stability trade-offs is load-bearing yet unsupported by any reported comparison of latent-space statistics, reconstruction error, or spherical coverage between the pretrained latents and those obtained under joint training. An ablation fixing the encoder versus fine-tuning it on the target datasets is required to substantiate the assumption.
[§4] §4 (Experiments): No numerical results, FID scores, CLIP scores, or wall-clock inference times are referenced in the provided text despite the strong comparative claims against Sphere Encoder and other baselines. Tables or figures reporting these quantities on all three datasets are necessary to evaluate the magnitude and consistency of the reported gains.

minor comments (2)

[§2] Notation for the spherical latent space (e.g., definition of the sphere radius or normalization) should be introduced explicitly in §2 before its use in the denoising objective.
[§3] The description of the separate denoiser architecture would benefit from a diagram or explicit comparison to the original Sphere Encoder's joint architecture to clarify the efficiency gains.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. The comments highlight important areas where additional evidence would strengthen the paper. We address each major comment below and commit to revisions that incorporate the requested analyses and clarifications.

read point-by-point responses

Referee: [Abstract and §3] Abstract and §3 (Framework): The central claim that a fixed pretrained encoder plus separate spherical denoiser fully replaces joint optimization without quality or stability trade-offs is load-bearing yet unsupported by any reported comparison of latent-space statistics, reconstruction error, or spherical coverage between the pretrained latents and those obtained under joint training. An ablation fixing the encoder versus fine-tuning it on the target datasets is required to substantiate the assumption.

Authors: We agree that an explicit ablation comparing the fixed pretrained encoder to a fine-tuned version would provide stronger support for the decoupling assumption. In the revised manuscript we will add this ablation on Animal-Faces and Oxford-Flowers, reporting reconstruction error, latent-norm statistics, and spherical coverage metrics for both the fixed and fine-tuned encoders. We note that full joint training on ImageNet-1K is computationally prohibitive, which is precisely why the decoupled design was introduced; the smaller-dataset ablation will still allow direct assessment of any quality trade-off. revision: yes
Referee: [§4] §4 (Experiments): No numerical results, FID scores, CLIP scores, or wall-clock inference times are referenced in the provided text despite the strong comparative claims against Sphere Encoder and other baselines. Tables or figures reporting these quantities on all three datasets are necessary to evaluate the magnitude and consistency of the reported gains.

Authors: We apologize that the numerical results were not clearly cross-referenced in the text provided to the referee. The full manuscript already contains Table 1 with FID scores on Animal-Faces, Oxford-Flowers, and ImageNet-1K and Table 2 with wall-clock inference times. We will revise §4 to explicitly cite these tables whenever comparative claims are made and will add CLIP scores if they are not already present. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on external empirical comparisons

full rationale

The paper's contribution is an architectural proposal to decouple a fixed pretrained image encoder from a separately trained spherical latent denoiser, addressing claimed inefficiencies in the prior Sphere Encoder. All performance assertions are grounded in direct comparisons against baselines on external datasets (Animal-Faces, Oxford-Flowers, ImageNet-1K) rather than any internal derivation, equation, or fitted quantity that reduces to the inputs by construction. No self-definitional steps, fitted-input predictions, or load-bearing self-citations appear in the provided description, making the method self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Review performed on abstract only; therefore free-parameter and axiom counts are necessarily incomplete and marked low-confidence.

axioms (1)

domain assumption Spherical latent space supports stable few-step denoising when the encoder is held fixed
Invoked when the abstract states that training occurs entirely in spherical latent space after decoupling

pith-pipeline@v0.9.0 · 5693 in / 1267 out tokens · 44965 ms · 2026-05-20T19:28:00.472432+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

we decouple the framework into a fixed pretrained image encoder and a separate latent denoising model trained entirely in a spherical latent space... spherification function F first flattens z and projects it onto a hypersphere via RMSNorm
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Reconstruction loss... Consistency loss... Noise Distribution... LogNorm

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

43 extracted references · 43 canonical work pages · 12 internal anchors

[1]

Building normalizing flows with stochastic interpolants

Michael Samuel Albergo and Eric Vanden-Eijnden. Building normalizing flows with stochastic interpolants. InThe Eleventh International Conference on Learning Representations, 2023

work page 2023
[2]

Stargan v2: Diverse image synthesis for multiple domains

Yunjey Choi, Youngjung Uh, Jaejun Yoo, and Jung-Woo Ha. Stargan v2: Diverse image synthesis for multiple domains. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8188–8197, 2020

work page 2020
[3]

Imagenet: A large- scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large- scale hierarchical image database. In2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255, 2009

work page 2009
[4]

Generative Modeling via Drifting

Mingyang Deng, He Li, Tianhong Li, Yilun Du, and Kaiming He. Generative modeling via drifting.arXiv preprint arXiv:2602.04770, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[5]

Diffusion models beat gans on image synthesis

Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. Advances in neural information processing systems, 34:8780–8794, 2021

work page 2021
[6]

Scaling rectified flow transformers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, Dustin Podell, Tim Dockhorn, Zion English, Kyle Lacey, Alex Goodwin, Yannik Marek, and Robin Rombach. Scaling rectified flow transformers for high-resolution image synthesis. InProceedings of the 41st Internatio...

work page 2024
[7]

One Step Diffusion via Shortcut Models

Kevin Frans, Danijar Hafner, Sergey Levine, and Pieter Abbeel. One step diffusion via shortcut models.arXiv preprint arXiv:2410.12557, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[8]

Mean Flows for One-step Generative Modeling

Zhengyang Geng, Mingyang Deng, Xingjian Bai, J Zico Kolter, and Kaiming He. Mean flows for one-step generative modeling.arXiv preprint arXiv:2505.13447, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[9]

Improved Mean Flows: On the Challenges of Fastforward Generative Models

Zhengyang Geng, Yiyang Lu, Zongze Wu, Eli Shechtman, J Zico Kolter, and Kaiming He. Improved mean flows: On the challenges of fastforward generative models.arXiv preprint arXiv:2512.02012, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[10]

Gans trained by a two time-scale update rule converge to a local nash equilibrium.Advances in neural information processing systems, 30, 2017

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium.Advances in neural information processing systems, 30, 2017

work page 2017
[11]

Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

work page 2020
[12]

Cmt: Mid-training for efficient learning of consistency, mean flow, and flow map models.arXiv preprint arXiv:2509.24526, 2025

Zheyuan Hu, Chieh-Hsin Lai, Yuki Mitsufuji, and Stefano Ermon. Cmt: Mid-training for effi- cient learning of consistency, mean flow, and flow map models.arXiv preprint arXiv:2509.24526, 2025

work page arXiv 2025
[13]

Meanflow trans- formers with representation autoencoders.arXiv preprint arXiv:2511.13019, 2025

Zheyuan Hu, Chieh-Hsin Lai, Ge Wu, Yuki Mitsufuji, and Stefano Ermon. Meanflow trans- formers with representation autoencoders.arXiv preprint arXiv:2511.13019, 2025

work page arXiv 2025
[14]

Rethinking fid: Towards a better evaluation metric for image generation

Sadeep Jayasumana, Srikumar Ramalingam, Andreas Veit, Daniel Glasner, Ayan Chakrabarti, and Sanjiv Kumar. Rethinking fid: Towards a better evaluation metric for image generation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9307–9315, 2024

work page 2024
[15]

Distribution matching distillation meets reinforcement learning,

Dengyang Jiang, Dongyang Liu, Zanyi Wang, Qilong Wu, Liuzhuozheng Li, Hengzhuang Li, Xin Jin, David Liu, Changsheng Lu, Zhen Li, et al. Distribution matching distillation meets reinforcement learning.arXiv preprint arXiv:2511.13649, 2025

work page arXiv 2025
[16]

Stabilizing consistency training: A flow map analysis and self-distillation.arXiv preprint arXiv:2601.22679, 2026

Youngjoong Kim, Duhoe Kim, Woosung Kim, and Jaesik Park. Stabilizing consistency training: A flow map analysis and self-distillation.arXiv preprint arXiv:2601.22679, 2026

work page arXiv 2026
[17]

Auto-Encoding Variational Bayes

Diederik P Kingma and Max Welling. Auto-encoding variational bayes.arXiv preprint arXiv:1312.6114, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013
[18]

Flux.https://github.com/black-forest-labs/flux, 2024

Black Forest Labs. Flux.https://github.com/black-forest-labs/flux, 2024. 10

work page 2024
[19]

The Principles of Diffusion Models

Chieh-Hsin Lai, Yang Song, Dongjun Kim, Yuki Mitsufuji, and Stefano Ermon. The principles of diffusion models.arXiv preprint arXiv:2510.21890, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[20]

Repa-e: Unlocking vae for end-to-end tuning with latent diffusion transformers.arXiv preprint arXiv:2504.10483, 2025

Xingjian Leng, Jaskirat Singh, Yunzhong Hou, Zhenchang Xing, Saining Xie, and Liang Zheng. Repa-e: Unlocking vae for end-to-end tuning with latent diffusion transformers.arXiv preprint arXiv:2504.10483, 2025

work page arXiv 2025
[21]

Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling. InAdvances in Neural Information Processing Systems, volume 36, 2023

work page 2023
[22]

Geometric autoencoder for diffusion models

Hangyu Liu, Jianyong Wang, and Yutao Sun. Geometric autoencoder for diffusion models. arXiv preprint arXiv:2603.10365, 2026

work page arXiv 2026
[23]

Flow straight and fast: Learning to generate and transfer data with rectified flow

Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow. InInternational Conference on Learning Representations (ICLR) 2023, 2023

work page 2023
[24]

Simplifying, Stabilizing and Scaling Continuous-Time Consistency Models

Cheng Lu and Yang Song. Simplifying, stabilizing and scaling continuous-time consistency models.arXiv preprint arXiv:2410.11081, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[25]

Sit: Exploring flow and diffusion-based generative models with scalable interpolant transformers

Nanye Ma, Mark Goldstein, Michael S Albergo, Nicholas M Boffi, Eric Vanden-Eijnden, and Saining Xie. Sit: Exploring flow and diffusion-based generative models with scalable interpolant transformers. InEuropean Conference on Computer Vision, pages 23–40. Springer, 2024

work page 2024
[26]

Swiftbrush: One-step text-to-image diffusion model with variational score distillation

Thuan Hoang Nguyen and Anh Tran. Swiftbrush: One-step text-to-image diffusion model with variational score distillation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7807–7816, 2024

work page 2024
[27]

Automated flower classification over a large number of classes

Maria-Elena Nilsback and Andrew Zisserman. Automated flower classification over a large number of classes. In2008 Sixth Indian conference on computer vision, graphics & image processing, pages 722–729. IEEE, 2008

work page 2008
[28]

DINOv2: Learning Robust Visual Features without Supervision

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[29]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023

work page 2023
[30]

Improving the diffusability of autoencoders

Ivan Skorokhodov, Sharath Girish, Benran Hu, Willi Menapace, Yanyu Li, Rameen Abdal, Sergey Tulyakov, and Aliaksandr Siarohin. Improving the diffusability of autoencoders. In Proceedings of the 42nd International Conference on Machine Learning, volume 267 ofPro- ceedings of Machine Learning Research, pages 55876–55905. PMLR, 2025

work page 2025
[31]

Improved Techniques for Training Consistency Models

Yang Song and Prafulla Dhariwal. Improved techniques for training consistency models.arXiv preprint arXiv:2310.14189, 2023

work page internal anchor Pith review arXiv 2023
[32]

Consistency Models

Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. Consistency models.arXiv preprint arXiv:2303.01469, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[33]

Qwen-image technical report, 2025

Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, Yuxiang Chen, Zecheng Tang, Zekai Zhang, Zhengyi Wang, An Yang, Bowen Yu, Chen Cheng, Dayiheng Liu, Deqing Li, Hang Zhang, Hao Meng, Hu Wei, Jingyuan Ni, Kai Chen, Kuan Cao, Liang Peng, Lin Qu, Minggang Wu, Peng Wang, Shuting Yu, Tingkun...

work page 2025
[34]

arXiv preprint arXiv:2507.01467 (2025)

Ge Wu, Shen Zhang, Ruijing Shi, Shanghua Gao, Zhenyuan Chen, Lei Wang, Zhaowei Chen, Hongcheng Gao, Yao Tang, Jian Yang, et al. Representation entanglement for generation: Training diffusion transformers is much easier than you think.arXiv preprint arXiv:2507.01467, 2025. 11

work page arXiv 2025
[35]

Making Reconstruction FID Predictive of Diffusion Generation FID

Tongda Xu, Mingwei He, Shady Abu-Hussein, Jose Miguel Hernandez-Lobato, Haotian Zhang, Kai Zhao, Chao Zhou, Ya-Qin Zhang, and Yan Wang. Making reconstruction fid predictive of diffusion generation fid.arXiv preprint arXiv:2603.05630, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[36]

Reconstruction vs

Jingfeng Yao, Bin Yang, and Xinggang Wang. Reconstruction vs. generation: Taming optimiza- tion dilemma in latent diffusion models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 15703–15712, 2025

work page 2025
[37]

Improved distribution matching distillation for fast image synthesis

Tianwei Yin, Michaël Gharbi, Taesung Park, Richard Zhang, Eli Shechtman, Fredo Durand, and William T Freeman. Improved distribution matching distillation for fast image synthesis. Advances in neural information processing systems, 37:47455–47487, 2024

work page 2024
[38]

One-step diffusion with distribution matching distillation

Tianwei Yin, Michaël Gharbi, Richard Zhang, Eli Shechtman, Fredo Durand, William T Freeman, and Taesung Park. One-step diffusion with distribution matching distillation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6613–6623, 2024

work page 2024
[39]

Representation alignment for generation: Training diffusion transformers is easier than you think

Sihyun Yu, Sangkyung Kwak, Huiwon Jang, Jongheon Jeong, Jonathan Huang, Jinwoo Shin, and Saining Xie. Representation alignment for generation: Training diffusion transformers is easier than you think. InInternational Conference on Learning Representations, 2025

work page 2025
[40]

Image generation with a sphere encoder

Kaiyu Yue, Menglin Jia, Ji Hou, and Tom Goldstein. Image generation with a sphere encoder. arXiv preprint arXiv:2602.15030, 2026

work page arXiv 2026
[41]

Root mean square layer normalization.Advances in neural information processing systems, 32, 2019

Biao Zhang and Rico Sennrich. Root mean square layer normalization.Advances in neural information processing systems, 32, 2019

work page 2019
[42]

Alphaflow: Understanding and improving meanflow models

Huijie Zhang, Aliaksandr Siarohin, Willi Menapace, Michael Vasilkovsky, Sergey Tulyakov, Qing Qu, and Ivan Skorokhodov. Alphaflow: Understanding and improving meanflow models. arXiv preprint arXiv:2510.20771, 2025

work page arXiv 2025
[43]

Diffusion Transformers with Representation Autoencoders

Boyang Zheng, Nanye Ma, Shengbang Tong, and Saining Xie. Diffusion transformers with representation autoencoders.arXiv preprint arXiv:2510.11690, 2025. 12 Appendix for Efficient Image Synthesis with Sphere Latent Encoder A Implementation Table 4: Configurations on different datasets. dataset Animal-Faces[2]Oxford-Flowers[27]ImageNet-1K[3] model configurat...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[1] [1]

Building normalizing flows with stochastic interpolants

Michael Samuel Albergo and Eric Vanden-Eijnden. Building normalizing flows with stochastic interpolants. InThe Eleventh International Conference on Learning Representations, 2023

work page 2023

[2] [2]

Stargan v2: Diverse image synthesis for multiple domains

Yunjey Choi, Youngjung Uh, Jaejun Yoo, and Jung-Woo Ha. Stargan v2: Diverse image synthesis for multiple domains. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8188–8197, 2020

work page 2020

[3] [3]

Imagenet: A large- scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large- scale hierarchical image database. In2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255, 2009

work page 2009

[4] [4]

Generative Modeling via Drifting

Mingyang Deng, He Li, Tianhong Li, Yilun Du, and Kaiming He. Generative modeling via drifting.arXiv preprint arXiv:2602.04770, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[5] [5]

Diffusion models beat gans on image synthesis

Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. Advances in neural information processing systems, 34:8780–8794, 2021

work page 2021

[6] [6]

Scaling rectified flow transformers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, Dustin Podell, Tim Dockhorn, Zion English, Kyle Lacey, Alex Goodwin, Yannik Marek, and Robin Rombach. Scaling rectified flow transformers for high-resolution image synthesis. InProceedings of the 41st Internatio...

work page 2024

[7] [7]

One Step Diffusion via Shortcut Models

Kevin Frans, Danijar Hafner, Sergey Levine, and Pieter Abbeel. One step diffusion via shortcut models.arXiv preprint arXiv:2410.12557, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[8] [8]

Mean Flows for One-step Generative Modeling

Zhengyang Geng, Mingyang Deng, Xingjian Bai, J Zico Kolter, and Kaiming He. Mean flows for one-step generative modeling.arXiv preprint arXiv:2505.13447, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[9] [9]

Improved Mean Flows: On the Challenges of Fastforward Generative Models

Zhengyang Geng, Yiyang Lu, Zongze Wu, Eli Shechtman, J Zico Kolter, and Kaiming He. Improved mean flows: On the challenges of fastforward generative models.arXiv preprint arXiv:2512.02012, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[10] [10]

Gans trained by a two time-scale update rule converge to a local nash equilibrium.Advances in neural information processing systems, 30, 2017

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium.Advances in neural information processing systems, 30, 2017

work page 2017

[11] [11]

Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

work page 2020

[12] [12]

Cmt: Mid-training for efficient learning of consistency, mean flow, and flow map models.arXiv preprint arXiv:2509.24526, 2025

Zheyuan Hu, Chieh-Hsin Lai, Yuki Mitsufuji, and Stefano Ermon. Cmt: Mid-training for effi- cient learning of consistency, mean flow, and flow map models.arXiv preprint arXiv:2509.24526, 2025

work page arXiv 2025

[13] [13]

Meanflow trans- formers with representation autoencoders.arXiv preprint arXiv:2511.13019, 2025

Zheyuan Hu, Chieh-Hsin Lai, Ge Wu, Yuki Mitsufuji, and Stefano Ermon. Meanflow trans- formers with representation autoencoders.arXiv preprint arXiv:2511.13019, 2025

work page arXiv 2025

[14] [14]

Rethinking fid: Towards a better evaluation metric for image generation

Sadeep Jayasumana, Srikumar Ramalingam, Andreas Veit, Daniel Glasner, Ayan Chakrabarti, and Sanjiv Kumar. Rethinking fid: Towards a better evaluation metric for image generation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9307–9315, 2024

work page 2024

[15] [15]

Distribution matching distillation meets reinforcement learning,

Dengyang Jiang, Dongyang Liu, Zanyi Wang, Qilong Wu, Liuzhuozheng Li, Hengzhuang Li, Xin Jin, David Liu, Changsheng Lu, Zhen Li, et al. Distribution matching distillation meets reinforcement learning.arXiv preprint arXiv:2511.13649, 2025

work page arXiv 2025

[16] [16]

Stabilizing consistency training: A flow map analysis and self-distillation.arXiv preprint arXiv:2601.22679, 2026

Youngjoong Kim, Duhoe Kim, Woosung Kim, and Jaesik Park. Stabilizing consistency training: A flow map analysis and self-distillation.arXiv preprint arXiv:2601.22679, 2026

work page arXiv 2026

[17] [17]

Auto-Encoding Variational Bayes

Diederik P Kingma and Max Welling. Auto-encoding variational bayes.arXiv preprint arXiv:1312.6114, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013

[18] [18]

Flux.https://github.com/black-forest-labs/flux, 2024

Black Forest Labs. Flux.https://github.com/black-forest-labs/flux, 2024. 10

work page 2024

[19] [19]

The Principles of Diffusion Models

Chieh-Hsin Lai, Yang Song, Dongjun Kim, Yuki Mitsufuji, and Stefano Ermon. The principles of diffusion models.arXiv preprint arXiv:2510.21890, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[20] [20]

Repa-e: Unlocking vae for end-to-end tuning with latent diffusion transformers.arXiv preprint arXiv:2504.10483, 2025

Xingjian Leng, Jaskirat Singh, Yunzhong Hou, Zhenchang Xing, Saining Xie, and Liang Zheng. Repa-e: Unlocking vae for end-to-end tuning with latent diffusion transformers.arXiv preprint arXiv:2504.10483, 2025

work page arXiv 2025

[21] [21]

Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling. InAdvances in Neural Information Processing Systems, volume 36, 2023

work page 2023

[22] [22]

Geometric autoencoder for diffusion models

Hangyu Liu, Jianyong Wang, and Yutao Sun. Geometric autoencoder for diffusion models. arXiv preprint arXiv:2603.10365, 2026

work page arXiv 2026

[23] [23]

Flow straight and fast: Learning to generate and transfer data with rectified flow

Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow. InInternational Conference on Learning Representations (ICLR) 2023, 2023

work page 2023

[24] [24]

Simplifying, Stabilizing and Scaling Continuous-Time Consistency Models

Cheng Lu and Yang Song. Simplifying, stabilizing and scaling continuous-time consistency models.arXiv preprint arXiv:2410.11081, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[25] [25]

Sit: Exploring flow and diffusion-based generative models with scalable interpolant transformers

Nanye Ma, Mark Goldstein, Michael S Albergo, Nicholas M Boffi, Eric Vanden-Eijnden, and Saining Xie. Sit: Exploring flow and diffusion-based generative models with scalable interpolant transformers. InEuropean Conference on Computer Vision, pages 23–40. Springer, 2024

work page 2024

[26] [26]

Swiftbrush: One-step text-to-image diffusion model with variational score distillation

Thuan Hoang Nguyen and Anh Tran. Swiftbrush: One-step text-to-image diffusion model with variational score distillation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7807–7816, 2024

work page 2024

[27] [27]

Automated flower classification over a large number of classes

Maria-Elena Nilsback and Andrew Zisserman. Automated flower classification over a large number of classes. In2008 Sixth Indian conference on computer vision, graphics & image processing, pages 722–729. IEEE, 2008

work page 2008

[28] [28]

DINOv2: Learning Robust Visual Features without Supervision

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[29] [29]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023

work page 2023

[30] [30]

Improving the diffusability of autoencoders

Ivan Skorokhodov, Sharath Girish, Benran Hu, Willi Menapace, Yanyu Li, Rameen Abdal, Sergey Tulyakov, and Aliaksandr Siarohin. Improving the diffusability of autoencoders. In Proceedings of the 42nd International Conference on Machine Learning, volume 267 ofPro- ceedings of Machine Learning Research, pages 55876–55905. PMLR, 2025

work page 2025

[31] [31]

Improved Techniques for Training Consistency Models

Yang Song and Prafulla Dhariwal. Improved techniques for training consistency models.arXiv preprint arXiv:2310.14189, 2023

work page internal anchor Pith review arXiv 2023

[32] [32]

Consistency Models

Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. Consistency models.arXiv preprint arXiv:2303.01469, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[33] [33]

Qwen-image technical report, 2025

Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, Yuxiang Chen, Zecheng Tang, Zekai Zhang, Zhengyi Wang, An Yang, Bowen Yu, Chen Cheng, Dayiheng Liu, Deqing Li, Hang Zhang, Hao Meng, Hu Wei, Jingyuan Ni, Kai Chen, Kuan Cao, Liang Peng, Lin Qu, Minggang Wu, Peng Wang, Shuting Yu, Tingkun...

work page 2025

[34] [34]

arXiv preprint arXiv:2507.01467 (2025)

Ge Wu, Shen Zhang, Ruijing Shi, Shanghua Gao, Zhenyuan Chen, Lei Wang, Zhaowei Chen, Hongcheng Gao, Yao Tang, Jian Yang, et al. Representation entanglement for generation: Training diffusion transformers is much easier than you think.arXiv preprint arXiv:2507.01467, 2025. 11

work page arXiv 2025

[35] [35]

Making Reconstruction FID Predictive of Diffusion Generation FID

Tongda Xu, Mingwei He, Shady Abu-Hussein, Jose Miguel Hernandez-Lobato, Haotian Zhang, Kai Zhao, Chao Zhou, Ya-Qin Zhang, and Yan Wang. Making reconstruction fid predictive of diffusion generation fid.arXiv preprint arXiv:2603.05630, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[36] [36]

Reconstruction vs

Jingfeng Yao, Bin Yang, and Xinggang Wang. Reconstruction vs. generation: Taming optimiza- tion dilemma in latent diffusion models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 15703–15712, 2025

work page 2025

[37] [37]

Improved distribution matching distillation for fast image synthesis

Tianwei Yin, Michaël Gharbi, Taesung Park, Richard Zhang, Eli Shechtman, Fredo Durand, and William T Freeman. Improved distribution matching distillation for fast image synthesis. Advances in neural information processing systems, 37:47455–47487, 2024

work page 2024

[38] [38]

One-step diffusion with distribution matching distillation

Tianwei Yin, Michaël Gharbi, Richard Zhang, Eli Shechtman, Fredo Durand, William T Freeman, and Taesung Park. One-step diffusion with distribution matching distillation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6613–6623, 2024

work page 2024

[39] [39]

Representation alignment for generation: Training diffusion transformers is easier than you think

Sihyun Yu, Sangkyung Kwak, Huiwon Jang, Jongheon Jeong, Jonathan Huang, Jinwoo Shin, and Saining Xie. Representation alignment for generation: Training diffusion transformers is easier than you think. InInternational Conference on Learning Representations, 2025

work page 2025

[40] [40]

Image generation with a sphere encoder

Kaiyu Yue, Menglin Jia, Ji Hou, and Tom Goldstein. Image generation with a sphere encoder. arXiv preprint arXiv:2602.15030, 2026

work page arXiv 2026

[41] [41]

Root mean square layer normalization.Advances in neural information processing systems, 32, 2019

Biao Zhang and Rico Sennrich. Root mean square layer normalization.Advances in neural information processing systems, 32, 2019

work page 2019

[42] [42]

Alphaflow: Understanding and improving meanflow models

Huijie Zhang, Aliaksandr Siarohin, Willi Menapace, Michael Vasilkovsky, Sergey Tulyakov, Qing Qu, and Ivan Skorokhodov. Alphaflow: Understanding and improving meanflow models. arXiv preprint arXiv:2510.20771, 2025

work page arXiv 2025

[43] [43]

Diffusion Transformers with Representation Autoencoders

Boyang Zheng, Nanye Ma, Shengbang Tong, and Saining Xie. Diffusion transformers with representation autoencoders.arXiv preprint arXiv:2510.11690, 2025. 12 Appendix for Efficient Image Synthesis with Sphere Latent Encoder A Implementation Table 4: Configurations on different datasets. dataset Animal-Faces[2]Oxford-Flowers[27]ImageNet-1K[3] model configurat...

work page internal anchor Pith review Pith/arXiv arXiv 2025