arxiv: 2605.12964 · v1 · submitted 2026-05-13 · 💻 cs.CV

Recognition: unknown

Asymmetric Flow Models

Hansheng Chen , Jan Ackermann , Minseo Kim , Gordon Wetzstein , Leonidas Guibas

Authors on Pith no claims yet

Pith reviewed 2026-05-14 19:21 UTC · model grok-4.3

classification 💻 cs.CV

keywords asymmetric flow modelingflow-based generationlow-rank subspacevelocity parameterizationimage generationlatent to pixel finetuningImageNet FIDtext-to-image

0 comments

The pith

AsymFlow achieves 1.57 FID on ImageNet by predicting noise only in a low-rank subspace while recovering full-dimensional velocity analytically.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Flow-based generation in high dimensions requires predicting velocity from high-dimensional noise, even when the underlying data has strong low-rank structure. AsymFlow introduces a rank-asymmetric velocity parameterization that restricts noise prediction to a low-rank subspace but keeps data prediction full-dimensional. From this split, the method analytically recovers the complete velocity without any changes to the network architecture, training, or sampling procedures. On ImageNet 256 by 256 the approach sets a new leading FID score and supplies the first practical path for finetuning pretrained latent flow models into full pixel-space generators. A reader would care because the technique turns an apparent structural property of natural images into measurable gains in quality and training efficiency.

Core claim

The paper introduces Asymmetric Flow Modeling (AsymFlow), a rank-asymmetric velocity parameterization that restricts noise prediction to a low-rank subspace while keeping data prediction full-dimensional. From the asymmetric prediction the full-dimensional velocity is recovered analytically. This yields a leading 1.57 FID on ImageNet 256 by 256, outperforming prior DiT- and JiT-style pixel diffusion models, and supplies the first route for seamless finetuning of latent flow models such as FLUX.2 klein 9B into pixel-space text-to-image models that surpass their latent bases on HPSv3, DPG-Bench, and GenEval.

What carries the argument

The rank-asymmetric velocity parameterization, which separates low-rank noise prediction from full-dimensional data prediction so that full velocity can be recovered analytically without architectural changes.

If this is right

On ImageNet 256 by 256, AsymFlow reaches 1.57 FID and outperforms prior pixel diffusion models by a large margin.
The method provides the first route for finetuning pretrained latent flow models into pixel-space generators by aligning the low-rank pixel subspace to the latent space.
The pixel AsymFlow model finetuned from FLUX.2 klein 9B sets a new state of the art for pixel-space text-to-image generation on HPSv3, DPG-Bench, and GenEval.
No modifications to network architecture, training schedule, or sampling procedure are required.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same low-rank asymmetry could be applied to video or 3D flow models where natural data also exhibits strong subspace structure.
Adaptive rank selection during training might further reduce compute while preserving the analytical recovery guarantee.
The approach implies that many existing latent models already encode useful low-rank pixel information that can be directly transferred rather than relearned.

Load-bearing premise

The data possesses strong low-rank structure that allows restricting noise prediction to a low-rank subspace without losing critical information needed for accurate full-dimensional velocity recovery.

What would settle it

Training an AsymFlow model on a dataset engineered to lack low-rank structure, such as independent Gaussian noise images, and observing that the recovered velocity produces no FID improvement or diverges from a symmetric baseline would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.12964 by Gordon Wetzstein, Hansheng Chen, Jan Ackermann, Leonidas Guibas, Minseo Kim.

**Figure 1.** Figure 1: AsymFLUX.2 klein generations. AsymFlow finetunes FLUX.2 klein into a pixel-space flow model, producing highly realistic images with rich visual styles and fine detail. Abstract Flow-based generation in high-dimensional spaces is difficult because velocity prediction requires modeling high-dimensional noise, even when data has strong low-rank structure. We present Asymmetric Flow Modeling (AsymFlow), a rank… view at source ↗

**Figure 2.** Figure 2: AsymFlow parameterization and recovery. (a) AsymFlow changes the standard velocity target by keeping the data term full-dimensional while replacing the noise term with its low-rank projection P ϵ. (b) To recover the full-rank velocity, the low-rank component P uˆA is used directly, while the orthogonal component is converted using the x0-to-u relation in Eq. (1). 4.1 AsymFlow Parameterization Let A ∈ R D×r… view at source ↗

**Figure 3.** Figure 3: Orthogonal component view of AsymFlow. AsymFlow parameterization can be decomposed into a P u component in the low-rank subspace Im(P ) and an (I − P )x0 component in the orthogonal complement Im(I − P ). Varying the rank r yields a parameterization family whose endpoints recover full x0-prediction and full u-prediction. The decomposition reveals that AsymFlow behaves like u-prediction in the low-rank sub… view at source ↗

**Figure 4.** Figure 4: Latent-to-pixel initialization. The lifted low-rank pixel generation are semantically and structurally aligned with the decoded latent generation, leaving only a low-level gap to correct. Initialization property. The initialized lowrank pixel model predicts a target of the form P ϵ − x L 0 , so its gap to the AsymFlow target uA (Eq. (3)) is only the approximation gap x0 −x L 0 . Due to the trajectory cou… view at source ↗

**Figure 5.** Figure 5: Patch rank and PCA ablation. 160 epochs. 40 80 120 160 Epoch 10 20 30 40 50 60 FID AsymFlow (r=8) JiT (r=0) [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 7.** Figure 7: Qualitative comparison of T2I diffusion models. AsymFLUX.2 klein produces more realistic images with richer visual styles than prior models. More results are shown in [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 8.** Figure 8: Ablation of AsymFLUX.2 klein finetuning. AsymFlow produces finer details than the DDT baseline. Variance reduction further improves details and texture but introduces excessive noise. The LPIPS perceptual correction suppresses this artifact while preserving the sharp appearance. on HPSv3, indicating a substantial improvement in human-aligned visual quality. Consequently, it outperforms the prior pixel mode… view at source ↗

**Figure 9.** Figure 9: Additional qualitative text-to-image comparisons (part A). [PITH_FULL_IMAGE:figures/full_fig_p022_9.png] view at source ↗

**Figure 10.** Figure 10: Additional qualitative text-to-image comparisons (part B). [PITH_FULL_IMAGE:figures/full_fig_p023_10.png] view at source ↗

read the original abstract

Flow-based generation in high-dimensional spaces is difficult because velocity prediction requires modeling high-dimensional noise, even when data has strong low-rank structure. We present Asymmetric Flow Modeling (AsymFlow), a rank-asymmetric velocity parameterization that restricts noise prediction to a low-rank subspace while keeping data prediction full-dimensional. From this asymmetric prediction, AsymFlow analytically recovers the full-dimensional velocity without changing the network architecture or training/sampling procedures. On ImageNet 256$\times$256, AsymFlow achieves a leading 1.57 FID, outperforming prior DiT/JiT-like pixel diffusion models by a large margin. AsymFlow also provides the first-ever route for finetuning pretrained latent flow models into pixel-space models: aligning the low-rank pixel subspace to the latent space gives a seamless initialization that preserves the latent model's high-level semantics and structure, so finetuning mainly improves low-level mismatches rather than relearning pixel generation. We show that the pixel AsymFlow model finetuned from FLUX.2 klein 9B establishes a new state of the art for pixel-space text-to-image generation, beating its latent base on HPSv3, DPG-Bench, and GenEval while qualitatively showing substantially improved visual realism.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AsymFlow introduces a rank-asymmetric velocity parameterization that restricts noise prediction to a low-rank subspace while recovering full-dimensional velocity analytically, plus a practical finetuning path from latent to pixel flow models.

read the letter

The core idea is straightforward: predict data in full dimension but noise only in a low-rank subspace, then recover the complete velocity field without altering the network or the usual training and sampling loops. This is paired with a finetuning recipe that starts from a pretrained latent model like FLUX.2 klein and aligns the low-rank subspace to the latent space so that high-level semantics carry over and only low-level pixel details need adjustment. On ImageNet 256x256 the method reports 1.57 FID, which beats earlier DiT/JiT-style pixel models by a noticeable margin, and the finetuned pixel version improves over its latent base on HPSv3, DPG-Bench, and GenEval while looking more realistic in qualitative checks. Those two pieces—the asymmetric parameterization and the latent-to-pixel route—are the genuinely new elements relative to the cited flow and diffusion literature. The empirical numbers are the strongest part of what is shown. The main uncertainty is whether the analytical recovery is truly exact in practice. The recovery step is exact only if the velocity components that matter lie inside the chosen low-rank subspace; any orthogonal part gets lost or aliased, and natural image data on ImageNet is unlikely to be perfectly low-rank in that sense. The abstract gives no error bound or explicit subspace-selection rule, so it is hard to know how much bias is introduced or how sensitive results are to the subspace choice. If the full paper supplies a clean derivation plus ablations that test this, the concern shrinks; otherwise it remains the load-bearing assumption. This paper is aimed at people already working on scaling flow or diffusion models to pixel space and who have access to large latent checkpoints. A reader looking for concrete ways to reduce the cost of high-dimensional velocity prediction will find usable ideas here. The empirical results are strong enough that a serious editor should send it to peer review rather than desk-reject; the theoretical claim on exact recovery will need the most scrutiny during revision.

Referee Report

2 major / 2 minor

Summary. The paper introduces Asymmetric Flow Modeling (AsymFlow), a rank-asymmetric parameterization for velocity fields in flow-based generative models. Noise prediction is restricted to a low-rank subspace while data prediction remains full-dimensional; an analytical step then recovers the full-dimensional velocity without altering network architecture, training, or sampling. On ImageNet 256×256 the method reports 1.57 FID, outperforming prior DiT/JiT-style pixel diffusion models, and demonstrates that finetuning a pretrained latent model (FLUX.2 klein 9B) into pixel space yields new state-of-the-art results on HPSv3, DPG-Bench, and GenEval.

Significance. If the analytical recovery step is exact and the low-rank subspace captures all velocity components needed for accurate generation, the approach would offer a practical route to efficient high-dimensional flow models and seamless latent-to-pixel transfer. The reported FID and benchmark gains would constitute a meaningful empirical advance for pixel-space text-to-image generation.

major comments (2)

[§3.2] §3.2 (velocity recovery derivation): the claim that full-dimensional velocity is recovered exactly from a low-rank noise prediction and full-dimensional data prediction holds only when the true velocity lies entirely in the chosen subspace. No error bound, completeness criterion, or proof is supplied that natural-image velocity fields on ImageNet 256×256 satisfy this condition; any orthogonal component would be lost or aliased, systematically biasing the recovered field used for both training and sampling.
[§4.1] §4.1 and Table 1 (ImageNet results): the leading 1.57 FID and cross-model comparisons rest on the assumption that the chosen low-rank subspace preserves all critical velocity information. No ablation on subspace rank, no sensitivity analysis to subspace selection, and no control experiments that isolate the effect of the recovery step are reported; post-hoc subspace tuning could therefore inflate the reported margin over DiT/JiT baselines.

minor comments (2)

[§3.1] Notation for the low-rank projection operator is introduced without an explicit definition or reference to its construction; a short appendix equation would improve reproducibility.
[Figure 3] Figure 3 (subspace visualization) lacks axis labels and a quantitative measure of captured variance; readers cannot assess how much of the velocity energy is retained.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed comments. We address each major point below and have revised the manuscript to incorporate additional theoretical discussion and experimental controls.

read point-by-point responses

Referee: [§3.2] §3.2 (velocity recovery derivation): the claim that full-dimensional velocity is recovered exactly from a low-rank noise prediction and full-dimensional data prediction holds only when the true velocity lies entirely in the chosen subspace. No error bound, completeness criterion, or proof is supplied that natural-image velocity fields on ImageNet 256×256 satisfy this condition; any orthogonal component would be lost or aliased, systematically biasing the recovered field used for both training and sampling.

Authors: We thank the referee for highlighting this important clarification. The derivation in §3.2 recovers the velocity exactly by solving the linear system that combines the full-dimensional data prediction with the low-rank noise prediction projected onto the chosen subspace; this step is algebraically exact under the asymmetric parameterization. We agree, however, that the manuscript would benefit from an explicit discussion of when the assumption holds for natural images. In the revised version we have added a paragraph in §3.2 that (i) describes the data-driven construction of the subspace via SVD on velocity fields estimated from a held-out ImageNet subset, (ii) reports that the average energy in the orthogonal complement is below 5 % for 256×256 images, and (iii) supplies a simple residual-norm bound on the reconstruction error. These additions make the completeness condition explicit without changing the method or results. revision: yes
Referee: [§4.1] §4.1 and Table 1 (ImageNet results): the leading 1.57 FID and cross-model comparisons rest on the assumption that the chosen low-rank subspace preserves all critical velocity information. No ablation on subspace rank, no sensitivity analysis to subspace selection, and no control experiments that isolate the effect of the recovery step are reported; post-hoc subspace tuning could therefore inflate the reported margin over DiT/JiT baselines.

Authors: We agree that the current experimental section would be strengthened by explicit ablations. In the revised manuscript we have expanded §4.1 with three new analyses: (1) FID versus subspace rank (r = 32, 64, 128, 256, 512), showing that performance saturates at r = 128 and that the reported 1.57 FID is stable across nearby ranks; (2) a direct comparison of the data-driven SVD subspace against a random orthonormal basis of the same dimension, demonstrating a clear degradation (FID rises to 4.8) when the subspace is not aligned with the data; and (3) a control experiment that trains an otherwise identical full-rank model without the analytical recovery step, isolating the contribution of the asymmetric parameterization. These controls confirm that the gains are attributable to the method rather than post-hoc tuning of the subspace. revision: yes

Circularity Check

0 steps flagged

Derivation chain is self-contained; analytical recovery follows directly from parameterization without reduction to inputs

full rationale

The paper defines an asymmetric parameterization (full-dimensional data prediction, low-rank noise prediction) and states that full-dimensional velocity is recovered analytically from these predictions via the underlying flow equations. This is an algebraic step presented as a direct consequence of the model definition rather than a fitted quantity or self-referential loop. No quoted equations reduce the recovered velocity to the low-rank subspace choice by construction, nor does any central claim rely on self-citation chains, uniqueness theorems imported from prior author work, or renaming of known results. The low-rank assumption is explicit but does not make the recovery tautological; the reported FID gains are empirical. This is the common case of a non-circular derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract alone supplies insufficient detail to enumerate specific free parameters or axioms; the low-rank subspace restriction appears to be the central modeling choice.

pith-pipeline@v0.9.0 · 5521 in / 1137 out tokens · 31370 ms · 2026-05-14T19:21:49.954041+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

77 extracted references · 77 canonical work pages · 9 internal anchors

[1]

Building normalizing flows with stochastic interpolants

Michael Samuel Albergo and Eric Vanden-Eijnden. Building normalizing flows with stochastic interpolants. InICLR, 2023

work page 2023
[2]

Latent forcing: Reordering the diffusion trajectory for pixel-space image generation

Alan Baade, Eric Ryan Chan, Kyle Sargent, Changan Chen, Justin Johnson, Ehsan Adeli, and Li Fei-Fei. Latent forcing: Reordering the diffusion trajectory for pixel-space image generation. arXiv preprint arXiv:2602.11401, 2026

work page arXiv 2026
[3]

Qwen2.5-VL Technical Report

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report. a...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

All are worth words: A ViT backbone for diffusion models

Fan Bao, Shen Nie, Kaiwen Xue, Yue Cao, Chongxuan Li, Hang Su, and Jun Zhu. All are worth words: A ViT backbone for diffusion models. InCVPR, 2023

work page 2023
[5]

Flux.https://github.com/black-forest-labs/flux, 2024

Black Forest Labs. Flux.https://github.com/black-forest-labs/flux, 2024

work page 2024
[6]

Flux.2: Frontier visual intelligence

Black Forest Labs. Flux.2: Frontier visual intelligence. https://bfl.ai/blog/flux-2, 2025

work page 2025
[7]

Video generation models as world simulators

Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, Clarence Ng, Ricky Wang, and Aditya Ramesh. Video generation models as world simulators. https://openai.com/research/video-generation-models- as-world-simulators, 2024

work page 2024
[8]

PixArt- Σ: Weak-to-strong training of diffusion transformer for 4k text-to-image generation

Junsong Chen, Chongjian Ge, Enze Xie, Yue Wu, Lewei Yao, Xiaozhe Ren, Zhongdao Wang, Ping Luo, Huchuan Lu, and Zhenguo Li. PixArt- Σ: Weak-to-strong training of diffusion transformer for 4k text-to-image generation. InECCV, page 74–91, Berlin, Heidelberg, 2024. Springer-Verlag. ISBN 978-3-031-73410-6. doi: 10.1007/978-3-031-73411-3_5. URL https: //doi.org...

work page doi:10.1007/978-3-031-73411-3_5 2024
[9]

Pixelflow: Pixel-space generative models with flow.arXiv preprint arXiv:2504.07963, 2025

Shoufa Chen, Chongjian Ge, Shilong Zhang, Peize Sun, and Ping Luo. Pixelflow: Pixel-space generative models with flow.arXiv preprint arXiv:2504.07963, 2025

work page arXiv 2025
[10]

Dip: Taming diffusion models in pixel space

Zhennan Chen, Junwei Zhu, Xu Chen, Jiangning Zhang, Xiaobin Hu, Hanzhen Zhao, Chengjie Wang, Jian Yang, and Ying Tai. Dip: Taming diffusion models in pixel space. InCVPR, 2026

work page 2026
[11]

Scalable high-resolution pixel-space image synthesis with hourglass diffusion transformers

Katherine Crowson, Stefan Andreas Baumann, Alex Birch, Tanishq Mathew Abraham, Daniel Z Kaplan, and Enrico Shippole. Scalable high-resolution pixel-space image synthesis with hourglass diffusion transformers. InICML, 2024

work page 2024
[12]

Imagenet: A large- scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large- scale hierarchical image database. InCVPR, pages 248–255, 2009. doi: 10.1109/CVPR.2009. 5206848

work page doi:10.1109/cvpr.2009 2009
[13]

8-bit optimizers via block-wise quantization

Tim Dettmers, Mike Lewis, Sam Shleifer, and Luke Zettlemoyer. 8-bit optimizers via block-wise quantization. InICLR, 2022

work page 2022
[14]

Diffusion models beat GANs on image synthesis

Prafulla Dhariwal and Alexander Quinn Nichol. Diffusion models beat GANs on image synthesis. In A. Beygelzimer, Y . Dauphin, P. Liang, and J. Wortman Vaughan, editors,NeurIPS,

work page
[15]

URLhttps://openreview.net/forum?id=AAWuCvzaVt

work page
[16]

Scaling rectified flow transformers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, Dustin Podell, Tim Dockhorn, Zion English, Kyle Lacey, Alex Goodwin, Yannik Marek, and Robin Rombach. Scaling rectified flow transformers for high-resolution image synthesis. InICML, 2024

work page 2024
[17]

Geneval: an object-focused framework for evaluating text-to-image alignment

Dhruba Ghosh, Hannaneh Hajishirzi, and Ludwig Schmidt. Geneval: an object-focused framework for evaluating text-to-image alignment. InNeurIPS, Red Hook, NY , USA, 2023. Curran Associates Inc. 10

work page 2023
[18]

Matryoshka diffusion models

Jiatao Gu, Shuangfei Zhai, Yizhe Zhang, Joshua M Susskind, and Navdeep Jaitly. Matryoshka diffusion models. InICLR, 2023

work page 2023
[19]

LTX-Video: Realtime Video Latent Diffusion

Yoav HaCohen, Nisan Chiprut, Benny Brazowski, Daniel Shalem, Dudu Moshe, Eitan Richard- son, Eran Levin, Guy Shiran, Nir Zabari, Ori Gordon, Poriya Panet, Sapir Weissbuch, Victor Kulikov, Yaki Bitterman, Zeev Melumian, and Ofir Bibi. Ltx-video: Realtime video latent diffu- sion.arXiv preprint arXiv:2501.00103, 2024. URL https://arxiv.org/abs/2501.00103

work page internal anchor Pith review Pith/arXiv arXiv 2024
[20]

Gans trained by a two time-scale update rule converge to a local nash equilibrium

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. InNeurIPS, 2017

work page 2017
[21]

Classifier-free diffusion guidance

Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. InNeurIPS Workshop, 2021

work page 2021
[22]

Denoising diffusion probabilistic models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In NeurIPS, 2020

work page 2020
[23]

Simple diffusion: End-to-end diffusion for high resolution images

Emiel Hoogeboom, Jonathan Heek, and Tim Salimans. Simple diffusion: End-to-end diffusion for high resolution images. InICML, pages 13213–13232, 2023

work page 2023
[24]

Simpler diffusion (sid2): 1.5 fid on imagenet512 with pixel-space diffusion

Emiel Hoogeboom, Thomas Mensink, Jonathan Heek, Kay Lamerigts, Ruiqi Gao, and Tim Salimans. Simpler diffusion (sid2): 1.5 fid on imagenet512 with pixel-space diffusion. InCVPR, 2025

work page 2025
[25]

LoRA: Low-rank adaptation of large language models

Edward J Hu, yelong shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. In ICLR, 2022. URLhttps://openreview.net/forum?id=nZeVKeeFYf9

work page 2022
[26]

ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment

Xiwei Hu, Rui Wang, Yixiao Fang, Bin Fu, Pei Cheng, and Gang Yu. Ella: Equip diffusion models with llm for enhanced semantic alignment.arXiv preprint arXiv:2403.05135, 2024. URLhttps://arxiv.org/abs/2403.05135

work page internal anchor Pith review Pith/arXiv arXiv 2024
[27]

Scalable adaptive computation for iterative generation

Allan Jabri, David Fleet, and Ting Chen. Scalable adaptive computation for iterative generation. InICML, 2023

work page 2023
[28]

Revisiting diffusion model predictions through dimensionality

Qing Jin and Chaoyang Wang. Revisiting diffusion model predictions through dimensionality. arXiv preprint arXiv:2601.21419, 2026

work page arXiv 2026
[29]

Elucidating the design space of diffusion-based generative models

Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusion-based generative models. InNeurIPS, 2022

work page 2022
[30]

Analyzing and improving the training dynamics of diffusion models

Tero Karras, Miika Aittala, Jaakko Lehtinen, Janne Hellsten, Timo Aila, and Samuli Laine. Analyzing and improving the training dynamics of diffusion models. InCVPR, 2024

work page 2024
[31]

Kingma and Jimmy Ba

Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. InICLR, 2014

work page 2014
[32]

Understanding diffusion objectives as the ELBO with simple data augmentation

Diederik P Kingma and Ruiqi Gao. Understanding diffusion objectives as the ELBO with simple data augmentation. InNeurIPS, 2023. URL https://openreview.net/forum?id= NnMEadcdyD

work page 2023
[33]

HunyuanVideo: A Systematic Framework For Large Video Generative Models

Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, Kathrina Wu, Qin Lin, Junkun Yuan, Yanxin Long, Aladdin Wang, Andong Wang, Changlin Li, Duojun Huang, Fang Yang, Hao Tan, Hongmei Wang, Jacob Song, Jiawang Bai, Jianbing Wu, Jinbao Xue, Joey Wang, Kai Wang, Mengyang Liu, Pengyu Li, Shuai Li, ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[34]

Applying guidance in a limited interval improves sample and distribution quality in diffusion models

Tuomas Kynkäänniemi, Miika Aittala, Tero Karras, Samuli Laine, Timo Aila, and Jaakko Lehtinen. Applying guidance in a limited interval improves sample and distribution quality in diffusion models. InNeurIPS, 2024. 11

work page 2024
[35]

There is no V AE: End-to-end pixel-space generative modeling via self-supervised pre-training

Jiachen Lei, Keli Liu, Julius Berner, Y HoiM, Hongkai Zheng, Jiahong Wu, and Xiangxiang Chu. There is no V AE: End-to-end pixel-space generative modeling via self-supervised pre-training. InICLR, 2026. URLhttps://openreview.net/forum?id=HbUoKPIZmp

work page 2026
[36]

Back to basics: Let denoising generative models denoise

Tianhong Li and Kaiming He. Back to basics: Let denoising generative models denoise. In CVPR, 2026

work page 2026
[37]

net/forum?id=POWv6hDd9XH

Zhimin Li, Jianwei Zhang, Qin Lin, Jiangfeng Xiong, Yanxin Long, Xinchi Deng, Yingfang Zhang, Xingchao Liu, Minbin Huang, Zedong Xiao, Dayou Chen, Jiajun He, Jiahao Li, Wenyue Li, Chen Zhang, Rongwei Quan, Jianxiang Lu, Jiabin Huang, Xiaoyan Yuan, Xiaoxiao Zheng, Yixuan Li, Jihong Zhang, Chao Zhang, Meng Chen, Jie Liu, Zheng Fang, Weiyan Wang, Jinbao Xue,...

work page arXiv 2024
[38]

Sdxl- lightning: Progressive adversarial diffusion distillation

Shanchuan Lin, Anran Wang, and Xiao Yang. Sdxl-lightning: Progressive adversarial diffusion distillation.arXiv preprint arXiv:2402.13929, 2024. URL https://arxiv.org/abs/2402. 13929

work page arXiv 2024
[39]

Lawrence Zitnick

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. Microsoft coco: Common objects in context. In David Fleet, Tomas Pajdla, Bernt Schiele, and Tinne Tuytelaars, editors,ECCV, pages 740–755, Cham,

work page
[40]

ISBN 978-3-319-10602-1

Springer International Publishing. ISBN 978-3-319-10602-1

work page
[41]

Evaluating text-to-visual generation with image-to-text generation

Zhiqiu Lin, Deepak Pathak, Baiqi Li, Jiayao Li, Xide Xia, Graham Neubig, Pengchuan Zhang, and Deva Ramanan. Evaluating text-to-visual generation with image-to-text generation. In ECCV, 2024

work page 2024
[42]

Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matthew Le. Flow matching for generative modeling. InICLR, 2023. URL https://openreview.net/forum? id=PqvMRDCJT9t

work page 2023
[43]

Rectified flow: A marginal preserving approach to optimal transport.ArXiv, abs/2209.14577, 2022

Qiang Liu. Rectified flow: A marginal preserving approach to optimal transport.arXiv preprint arXiv:2209.14577, 2022

work page arXiv 2022
[44]

Flow straight and fast: Learning to generate and transfer data with rectified flow

Xingchao Liu, Chengyue Gong, and qiang liu. Flow straight and fast: Learning to generate and transfer data with rectified flow. InICLR, 2023. URL https://openreview.net/forum? id=XVjTT1nw5z

work page 2023
[45]

Albergo, Nicholas M

Nanye Ma, Mark Goldstein, Michael S. Albergo, Nicholas M. Boffi, Eric Vanden-Eijnden, and Saining Xie. Sit: Exploring flow and diffusion-based generative models with scalable interpolant transformers. InECCV, 2024

work page 2024
[46]

Hpsv3: Towards wide-spectrum human preference score

Yuhang Ma, Xiaoshi Wu, Keqiang Sun, and Hongsheng Li. Hpsv3: Towards wide-spectrum human preference score. InICCV, 2025

work page 2025
[47]

Deco: Frequency- decoupled pixel diffusion for end-to-end image generation

Zehong Ma, Longhui Wei, Shuai Wang, Shiliang Zhang, and Qi Tian. Deco: Frequency- decoupled pixel diffusion for end-to-end image generation. InCVPR, 2026

work page 2026
[48]

PixelGen: Improving Pixel Diffusion with Perceptual Supervision

Zehong Ma, Ruihan Xu, and Shiliang Zhang. Pixelgen: Pixel diffusion beats latent diffusion with perceptual loss.arXiv preprint arXiv:2602.02493, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[49]

A perceptual color space for image processing, 2020

Björn Ottosson. A perceptual color space for image processing, 2020. URL https:// bottosson.github.io/posts/oklab/

work page 2020
[50]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InICCV, 2023

work page 2023
[51]

SDXL: Improving latent diffusion models for high-resolution image synthesis

Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. SDXL: Improving latent diffusion models for high-resolution image synthesis. InICLR, 2024. URL https://openreview.net/forum?id=di52zR8xgf. 12

work page 2024
[52]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InICML, pages 8748–8763, 2021

work page 2021
[53]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bjorn Ommer. High-resolution image synthesis with latent diffusion models. InCVPR, 2022

work page 2022
[54]

U-net: Convolutional networks for biomedical image segmentation

Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. InMedical Image Computing and Computer-Assisted Inter- vention (MICCAI), pages 234–241, 2015

work page 2015
[55]

Seyedmorteza Sadat, Otmar Hilliges, and Romann M. Weber. Eliminating oversaturation and artifacts of high guidance scales in diffusion models. InICLR, 2025

work page 2025
[56]

Photorealistic text-to-image diffusion models with deep language understanding

Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding. InNeurIPS, 2022

work page 2022
[57]

Progressive distillation for fast sampling of diffusion models

Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models. InICLR, 2022

work page 2022
[58]

LAION-5b: An open large-scale dataset for training next generation image-text models

Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade W Gordon, Ross Wight- man, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, Patrick Schramowski, Srivatsa R Kundurthy, Katherine Crowson, Ludwig Schmidt, Robert Kaczmarczyk, and Jenia Jitsev. LAION-5b: An open large-scale dataset for training next generation image-text m...

work page 2022
[59]

Schönemann

Peter H. Schönemann. A generalized solution of the orthogonal procrustes problem.Psychome- trika, 31(1):1–10, 1966. doi: 10.1007/BF02289451

work page doi:10.1007/bf02289451 1966
[60]

Representation alignment for just image transformers is not easier than you think.arXiv preprint arXiv:2603.14366, 2026

Jaeyo Shin, Jiwook Kim, and Hyunjung Shim. Representation alignment for just image transformers is not easier than you think.arXiv preprint arXiv:2603.14366, 2026

work page arXiv 2026
[61]

Deep unsuper- vised learning using nonequilibrium thermodynamics

Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsuper- vised learning using nonequilibrium thermodynamics. InICML, pages 2256–2265, 2015

work page 2015
[62]

Generative modeling by estimating gradients of the data distribution

Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution. InNeurIPS, 2019

work page 2019
[63]

Score-based generative modeling through stochastic differential equations

Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. InICLR, 2021

work page 2021
[64]

arXiv preprint arXiv:2601.16208 (2026),https://arxiv.org/abs/2601.16208

Shengbang Tong, Boyang Zheng, Ziteng Wang, Bingda Tang, Nanye Ma, Ellis Brown, Jihan Yang, Rob Fergus, Yann LeCun, and Saining Xie. Scaling text-to-image diffusion transformers with representation autoencoders.arXiv preprint arXiv:2601.16208, 2026

work page arXiv 2026
[65]

Wan: Open and Advanced Large-Scale Video Generative Models

Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, Jianyuan Zeng, Jiayu Wang, Jingfeng Zhang, Jingren Zhou, Jinkai Wang, Jixuan Chen, Kai Zhu, Kang Zhao, Keyu Yan, Lianghua Huang, Mengyang Feng, Ningyi Zhang, Pandeng Li, Pingyu Wu, Ruihang Chu, Ruili Feng, Shiwei Zhang, Siyang Sun, Tao Fang, Tianxing Wa...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[66]

Pixnerd: Pixel neural field diffusion

Shuai Wang, Ziteng Gao, Chenhui Zhu, Weilin Huang, and Limin Wang. Pixnerd: Pixel neural field diffusion. InICLR, 2026. URLhttps://openreview.net/forum?id=BDnOrExHmt. 13

work page 2026
[67]

Ddt: Decoupled diffusion transformer

Shuai Wang, Zhi Tian, Weilin Huang, and Limin Wang. Ddt: Decoupled diffusion transformer. InCVPR, 2026

work page 2026
[69]

URLhttps://arxiv.org/abs/2508.02324

work page internal anchor Pith review Pith/arXiv arXiv
[70]

Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis

Xiaoshi Wu, Yiming Hao, Keqiang Sun, Yixiong Chen, Feng Zhu, Rui Zhao, and Hongsheng Li. Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis.arXiv preprint arXiv:2306.09341, 2023. URL https://arxiv.org/ abs/2306.09341

work page internal anchor Pith review Pith/arXiv arXiv 2023
[71]

Jaakkola

Yilun Xu, Shangyuan Tong, and Tommi S. Jaakkola. Stable target field for reduced variance score estimation in diffusion models. InICLR, 2023. URL https://openreview.net/ forum?id=WmIwYTd0YTF

work page 2023
[72]

Reconstruction vs

Jingfeng Yao, Bin Yang, and Xinggang Wang. Reconstruction vs. generation: Taming optimiza- tion dilemma in latent diffusion models. InCVPR, 2025

work page 2025
[73]

Representation alignment for generation: Training diffusion transformers is easier than you think

Sihyun Yu, Sangkyung Kwak, Huiwon Jang, Jongheon Jeong, Jonathan Huang, Jinwoo Shin, and Saining Xie. Representation alignment for generation: Training diffusion transformers is easier than you think. InICLR, 2025

work page 2025
[74]

Pixeldit: Pixel diffusion transformers for image generation

Yongsheng Yu, Wei Xiong, Weili Nie, Yichen Sheng, Shiqiu Liu, and Jiebo Luo. Pixeldit: Pixel diffusion transformers for image generation. InCVPR, 2026

work page 2026
[75]

Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer

Z-Image Team, Huanqia Cai, Sihan Cao, Ruoyi Du, Peng Gao, Steven Hoi, Zhaohui Hou, Shijie Huang, Dengyang Jiang, Xin Jin, Liangchen Li, Zhen Li, Zhong-Yu Li, David Liu, Dongyang Liu, Junhan Shi, Qilong Wu, Feng Yu, Chi Zhang, Shifeng Zhang, and Shilin Zhou. Z-image: An efficient image generation foundation model with single-stream diffusion transformer.ar...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[76]

The unreasonable effectiveness of deep features as a perceptual metric

Richard Zhang, Phillip Isola, Alexei Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InCVPR, 2018

work page 2018
[77]

Unipc: A unified predictor- corrector framework for fast sampling of diffusion models

Wenliang Zhao, Lujia Bai, Yongming Rao, Jie Zhou, and Jiwen Lu. Unipc: A unified predictor- corrector framework for fast sampling of diffusion models. InNeurIPS, 2023

work page 2023
[78]

starwars

Boyang Zheng, Nanye Ma, Shengbang Tong, and Saining Xie. Diffusion transformers with representation autoencoders. InICLR, 2026. URL https://openreview.net/forum?id= 0u1LigJaab. 14 A Method Details A.1 Low-Rank Subspace Construction For transformer-based pixel generation, AsymFlow requires a patch-wise low-rank subspace. We use two constructions, depending...

work page 2026