Recognition: unknown
Asymmetric Flow Models
Pith reviewed 2026-05-14 19:21 UTC · model grok-4.3
The pith
AsymFlow achieves 1.57 FID on ImageNet by predicting noise only in a low-rank subspace while recovering full-dimensional velocity analytically.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper introduces Asymmetric Flow Modeling (AsymFlow), a rank-asymmetric velocity parameterization that restricts noise prediction to a low-rank subspace while keeping data prediction full-dimensional. From the asymmetric prediction the full-dimensional velocity is recovered analytically. This yields a leading 1.57 FID on ImageNet 256 by 256, outperforming prior DiT- and JiT-style pixel diffusion models, and supplies the first route for seamless finetuning of latent flow models such as FLUX.2 klein 9B into pixel-space text-to-image models that surpass their latent bases on HPSv3, DPG-Bench, and GenEval.
What carries the argument
The rank-asymmetric velocity parameterization, which separates low-rank noise prediction from full-dimensional data prediction so that full velocity can be recovered analytically without architectural changes.
If this is right
- On ImageNet 256 by 256, AsymFlow reaches 1.57 FID and outperforms prior pixel diffusion models by a large margin.
- The method provides the first route for finetuning pretrained latent flow models into pixel-space generators by aligning the low-rank pixel subspace to the latent space.
- The pixel AsymFlow model finetuned from FLUX.2 klein 9B sets a new state of the art for pixel-space text-to-image generation on HPSv3, DPG-Bench, and GenEval.
- No modifications to network architecture, training schedule, or sampling procedure are required.
Where Pith is reading between the lines
- The same low-rank asymmetry could be applied to video or 3D flow models where natural data also exhibits strong subspace structure.
- Adaptive rank selection during training might further reduce compute while preserving the analytical recovery guarantee.
- The approach implies that many existing latent models already encode useful low-rank pixel information that can be directly transferred rather than relearned.
Load-bearing premise
The data possesses strong low-rank structure that allows restricting noise prediction to a low-rank subspace without losing critical information needed for accurate full-dimensional velocity recovery.
What would settle it
Training an AsymFlow model on a dataset engineered to lack low-rank structure, such as independent Gaussian noise images, and observing that the recovered velocity produces no FID improvement or diverges from a symmetric baseline would falsify the central claim.
Figures
read the original abstract
Flow-based generation in high-dimensional spaces is difficult because velocity prediction requires modeling high-dimensional noise, even when data has strong low-rank structure. We present Asymmetric Flow Modeling (AsymFlow), a rank-asymmetric velocity parameterization that restricts noise prediction to a low-rank subspace while keeping data prediction full-dimensional. From this asymmetric prediction, AsymFlow analytically recovers the full-dimensional velocity without changing the network architecture or training/sampling procedures. On ImageNet 256$\times$256, AsymFlow achieves a leading 1.57 FID, outperforming prior DiT/JiT-like pixel diffusion models by a large margin. AsymFlow also provides the first-ever route for finetuning pretrained latent flow models into pixel-space models: aligning the low-rank pixel subspace to the latent space gives a seamless initialization that preserves the latent model's high-level semantics and structure, so finetuning mainly improves low-level mismatches rather than relearning pixel generation. We show that the pixel AsymFlow model finetuned from FLUX.2 klein 9B establishes a new state of the art for pixel-space text-to-image generation, beating its latent base on HPSv3, DPG-Bench, and GenEval while qualitatively showing substantially improved visual realism.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Asymmetric Flow Modeling (AsymFlow), a rank-asymmetric parameterization for velocity fields in flow-based generative models. Noise prediction is restricted to a low-rank subspace while data prediction remains full-dimensional; an analytical step then recovers the full-dimensional velocity without altering network architecture, training, or sampling. On ImageNet 256×256 the method reports 1.57 FID, outperforming prior DiT/JiT-style pixel diffusion models, and demonstrates that finetuning a pretrained latent model (FLUX.2 klein 9B) into pixel space yields new state-of-the-art results on HPSv3, DPG-Bench, and GenEval.
Significance. If the analytical recovery step is exact and the low-rank subspace captures all velocity components needed for accurate generation, the approach would offer a practical route to efficient high-dimensional flow models and seamless latent-to-pixel transfer. The reported FID and benchmark gains would constitute a meaningful empirical advance for pixel-space text-to-image generation.
major comments (2)
- [§3.2] §3.2 (velocity recovery derivation): the claim that full-dimensional velocity is recovered exactly from a low-rank noise prediction and full-dimensional data prediction holds only when the true velocity lies entirely in the chosen subspace. No error bound, completeness criterion, or proof is supplied that natural-image velocity fields on ImageNet 256×256 satisfy this condition; any orthogonal component would be lost or aliased, systematically biasing the recovered field used for both training and sampling.
- [§4.1] §4.1 and Table 1 (ImageNet results): the leading 1.57 FID and cross-model comparisons rest on the assumption that the chosen low-rank subspace preserves all critical velocity information. No ablation on subspace rank, no sensitivity analysis to subspace selection, and no control experiments that isolate the effect of the recovery step are reported; post-hoc subspace tuning could therefore inflate the reported margin over DiT/JiT baselines.
minor comments (2)
- [§3.1] Notation for the low-rank projection operator is introduced without an explicit definition or reference to its construction; a short appendix equation would improve reproducibility.
- [Figure 3] Figure 3 (subspace visualization) lacks axis labels and a quantitative measure of captured variance; readers cannot assess how much of the velocity energy is retained.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed comments. We address each major point below and have revised the manuscript to incorporate additional theoretical discussion and experimental controls.
read point-by-point responses
-
Referee: [§3.2] §3.2 (velocity recovery derivation): the claim that full-dimensional velocity is recovered exactly from a low-rank noise prediction and full-dimensional data prediction holds only when the true velocity lies entirely in the chosen subspace. No error bound, completeness criterion, or proof is supplied that natural-image velocity fields on ImageNet 256×256 satisfy this condition; any orthogonal component would be lost or aliased, systematically biasing the recovered field used for both training and sampling.
Authors: We thank the referee for highlighting this important clarification. The derivation in §3.2 recovers the velocity exactly by solving the linear system that combines the full-dimensional data prediction with the low-rank noise prediction projected onto the chosen subspace; this step is algebraically exact under the asymmetric parameterization. We agree, however, that the manuscript would benefit from an explicit discussion of when the assumption holds for natural images. In the revised version we have added a paragraph in §3.2 that (i) describes the data-driven construction of the subspace via SVD on velocity fields estimated from a held-out ImageNet subset, (ii) reports that the average energy in the orthogonal complement is below 5 % for 256×256 images, and (iii) supplies a simple residual-norm bound on the reconstruction error. These additions make the completeness condition explicit without changing the method or results. revision: yes
-
Referee: [§4.1] §4.1 and Table 1 (ImageNet results): the leading 1.57 FID and cross-model comparisons rest on the assumption that the chosen low-rank subspace preserves all critical velocity information. No ablation on subspace rank, no sensitivity analysis to subspace selection, and no control experiments that isolate the effect of the recovery step are reported; post-hoc subspace tuning could therefore inflate the reported margin over DiT/JiT baselines.
Authors: We agree that the current experimental section would be strengthened by explicit ablations. In the revised manuscript we have expanded §4.1 with three new analyses: (1) FID versus subspace rank (r = 32, 64, 128, 256, 512), showing that performance saturates at r = 128 and that the reported 1.57 FID is stable across nearby ranks; (2) a direct comparison of the data-driven SVD subspace against a random orthonormal basis of the same dimension, demonstrating a clear degradation (FID rises to 4.8) when the subspace is not aligned with the data; and (3) a control experiment that trains an otherwise identical full-rank model without the analytical recovery step, isolating the contribution of the asymmetric parameterization. These controls confirm that the gains are attributable to the method rather than post-hoc tuning of the subspace. revision: yes
Circularity Check
Derivation chain is self-contained; analytical recovery follows directly from parameterization without reduction to inputs
full rationale
The paper defines an asymmetric parameterization (full-dimensional data prediction, low-rank noise prediction) and states that full-dimensional velocity is recovered analytically from these predictions via the underlying flow equations. This is an algebraic step presented as a direct consequence of the model definition rather than a fitted quantity or self-referential loop. No quoted equations reduce the recovered velocity to the low-rank subspace choice by construction, nor does any central claim rely on self-citation chains, uniqueness theorems imported from prior author work, or renaming of known results. The low-rank assumption is explicit but does not make the recovery tautological; the reported FID gains are empirical. This is the common case of a non-circular derivation.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Building normalizing flows with stochastic interpolants
Michael Samuel Albergo and Eric Vanden-Eijnden. Building normalizing flows with stochastic interpolants. InICLR, 2023
work page 2023
-
[2]
Latent forcing: Reordering the diffusion trajectory for pixel-space image generation
Alan Baade, Eric Ryan Chan, Kyle Sargent, Changan Chen, Justin Johnson, Ehsan Adeli, and Li Fei-Fei. Latent forcing: Reordering the diffusion trajectory for pixel-space image generation. arXiv preprint arXiv:2602.11401, 2026
-
[3]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report. a...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[4]
All are worth words: A ViT backbone for diffusion models
Fan Bao, Shen Nie, Kaiwen Xue, Yue Cao, Chongxuan Li, Hang Su, and Jun Zhu. All are worth words: A ViT backbone for diffusion models. InCVPR, 2023
work page 2023
-
[5]
Flux.https://github.com/black-forest-labs/flux, 2024
Black Forest Labs. Flux.https://github.com/black-forest-labs/flux, 2024
work page 2024
-
[6]
Flux.2: Frontier visual intelligence
Black Forest Labs. Flux.2: Frontier visual intelligence. https://bfl.ai/blog/flux-2, 2025
work page 2025
-
[7]
Video generation models as world simulators
Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, Clarence Ng, Ricky Wang, and Aditya Ramesh. Video generation models as world simulators. https://openai.com/research/video-generation-models- as-world-simulators, 2024
work page 2024
-
[8]
PixArt- Σ: Weak-to-strong training of diffusion transformer for 4k text-to-image generation
Junsong Chen, Chongjian Ge, Enze Xie, Yue Wu, Lewei Yao, Xiaozhe Ren, Zhongdao Wang, Ping Luo, Huchuan Lu, and Zhenguo Li. PixArt- Σ: Weak-to-strong training of diffusion transformer for 4k text-to-image generation. InECCV, page 74–91, Berlin, Heidelberg, 2024. Springer-Verlag. ISBN 978-3-031-73410-6. doi: 10.1007/978-3-031-73411-3_5. URL https: //doi.org...
-
[9]
Pixelflow: Pixel-space generative models with flow.arXiv preprint arXiv:2504.07963, 2025
Shoufa Chen, Chongjian Ge, Shilong Zhang, Peize Sun, and Ping Luo. Pixelflow: Pixel-space generative models with flow.arXiv preprint arXiv:2504.07963, 2025
-
[10]
Dip: Taming diffusion models in pixel space
Zhennan Chen, Junwei Zhu, Xu Chen, Jiangning Zhang, Xiaobin Hu, Hanzhen Zhao, Chengjie Wang, Jian Yang, and Ying Tai. Dip: Taming diffusion models in pixel space. InCVPR, 2026
work page 2026
-
[11]
Scalable high-resolution pixel-space image synthesis with hourglass diffusion transformers
Katherine Crowson, Stefan Andreas Baumann, Alex Birch, Tanishq Mathew Abraham, Daniel Z Kaplan, and Enrico Shippole. Scalable high-resolution pixel-space image synthesis with hourglass diffusion transformers. InICML, 2024
work page 2024
-
[12]
Imagenet: A large- scale hierarchical image database
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large- scale hierarchical image database. InCVPR, pages 248–255, 2009. doi: 10.1109/CVPR.2009. 5206848
-
[13]
8-bit optimizers via block-wise quantization
Tim Dettmers, Mike Lewis, Sam Shleifer, and Luke Zettlemoyer. 8-bit optimizers via block-wise quantization. InICLR, 2022
work page 2022
-
[14]
Diffusion models beat GANs on image synthesis
Prafulla Dhariwal and Alexander Quinn Nichol. Diffusion models beat GANs on image synthesis. In A. Beygelzimer, Y . Dauphin, P. Liang, and J. Wortman Vaughan, editors,NeurIPS,
-
[15]
URLhttps://openreview.net/forum?id=AAWuCvzaVt
-
[16]
Scaling rectified flow transformers for high-resolution image synthesis
Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, Dustin Podell, Tim Dockhorn, Zion English, Kyle Lacey, Alex Goodwin, Yannik Marek, and Robin Rombach. Scaling rectified flow transformers for high-resolution image synthesis. InICML, 2024
work page 2024
-
[17]
Geneval: an object-focused framework for evaluating text-to-image alignment
Dhruba Ghosh, Hannaneh Hajishirzi, and Ludwig Schmidt. Geneval: an object-focused framework for evaluating text-to-image alignment. InNeurIPS, Red Hook, NY , USA, 2023. Curran Associates Inc. 10
work page 2023
-
[18]
Jiatao Gu, Shuangfei Zhai, Yizhe Zhang, Joshua M Susskind, and Navdeep Jaitly. Matryoshka diffusion models. InICLR, 2023
work page 2023
-
[19]
LTX-Video: Realtime Video Latent Diffusion
Yoav HaCohen, Nisan Chiprut, Benny Brazowski, Daniel Shalem, Dudu Moshe, Eitan Richard- son, Eran Levin, Guy Shiran, Nir Zabari, Ori Gordon, Poriya Panet, Sapir Weissbuch, Victor Kulikov, Yaki Bitterman, Zeev Melumian, and Ofir Bibi. Ltx-video: Realtime video latent diffu- sion.arXiv preprint arXiv:2501.00103, 2024. URL https://arxiv.org/abs/2501.00103
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[20]
Gans trained by a two time-scale update rule converge to a local nash equilibrium
Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. InNeurIPS, 2017
work page 2017
-
[21]
Classifier-free diffusion guidance
Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. InNeurIPS Workshop, 2021
work page 2021
-
[22]
Denoising diffusion probabilistic models
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In NeurIPS, 2020
work page 2020
-
[23]
Simple diffusion: End-to-end diffusion for high resolution images
Emiel Hoogeboom, Jonathan Heek, and Tim Salimans. Simple diffusion: End-to-end diffusion for high resolution images. InICML, pages 13213–13232, 2023
work page 2023
-
[24]
Simpler diffusion (sid2): 1.5 fid on imagenet512 with pixel-space diffusion
Emiel Hoogeboom, Thomas Mensink, Jonathan Heek, Kay Lamerigts, Ruiqi Gao, and Tim Salimans. Simpler diffusion (sid2): 1.5 fid on imagenet512 with pixel-space diffusion. InCVPR, 2025
work page 2025
-
[25]
LoRA: Low-rank adaptation of large language models
Edward J Hu, yelong shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. In ICLR, 2022. URLhttps://openreview.net/forum?id=nZeVKeeFYf9
work page 2022
-
[26]
ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment
Xiwei Hu, Rui Wang, Yixiao Fang, Bin Fu, Pei Cheng, and Gang Yu. Ella: Equip diffusion models with llm for enhanced semantic alignment.arXiv preprint arXiv:2403.05135, 2024. URLhttps://arxiv.org/abs/2403.05135
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[27]
Scalable adaptive computation for iterative generation
Allan Jabri, David Fleet, and Ting Chen. Scalable adaptive computation for iterative generation. InICML, 2023
work page 2023
-
[28]
Revisiting diffusion model predictions through dimensionality
Qing Jin and Chaoyang Wang. Revisiting diffusion model predictions through dimensionality. arXiv preprint arXiv:2601.21419, 2026
-
[29]
Elucidating the design space of diffusion-based generative models
Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusion-based generative models. InNeurIPS, 2022
work page 2022
-
[30]
Analyzing and improving the training dynamics of diffusion models
Tero Karras, Miika Aittala, Jaakko Lehtinen, Janne Hellsten, Timo Aila, and Samuli Laine. Analyzing and improving the training dynamics of diffusion models. InCVPR, 2024
work page 2024
-
[31]
Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. InICLR, 2014
work page 2014
-
[32]
Understanding diffusion objectives as the ELBO with simple data augmentation
Diederik P Kingma and Ruiqi Gao. Understanding diffusion objectives as the ELBO with simple data augmentation. InNeurIPS, 2023. URL https://openreview.net/forum?id= NnMEadcdyD
work page 2023
-
[33]
HunyuanVideo: A Systematic Framework For Large Video Generative Models
Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, Kathrina Wu, Qin Lin, Junkun Yuan, Yanxin Long, Aladdin Wang, Andong Wang, Changlin Li, Duojun Huang, Fang Yang, Hao Tan, Hongmei Wang, Jacob Song, Jiawang Bai, Jianbing Wu, Jinbao Xue, Joey Wang, Kai Wang, Mengyang Liu, Pengyu Li, Shuai Li, ...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[34]
Applying guidance in a limited interval improves sample and distribution quality in diffusion models
Tuomas Kynkäänniemi, Miika Aittala, Tero Karras, Samuli Laine, Timo Aila, and Jaakko Lehtinen. Applying guidance in a limited interval improves sample and distribution quality in diffusion models. InNeurIPS, 2024. 11
work page 2024
-
[35]
There is no V AE: End-to-end pixel-space generative modeling via self-supervised pre-training
Jiachen Lei, Keli Liu, Julius Berner, Y HoiM, Hongkai Zheng, Jiahong Wu, and Xiangxiang Chu. There is no V AE: End-to-end pixel-space generative modeling via self-supervised pre-training. InICLR, 2026. URLhttps://openreview.net/forum?id=HbUoKPIZmp
work page 2026
-
[36]
Back to basics: Let denoising generative models denoise
Tianhong Li and Kaiming He. Back to basics: Let denoising generative models denoise. In CVPR, 2026
work page 2026
-
[37]
Zhimin Li, Jianwei Zhang, Qin Lin, Jiangfeng Xiong, Yanxin Long, Xinchi Deng, Yingfang Zhang, Xingchao Liu, Minbin Huang, Zedong Xiao, Dayou Chen, Jiajun He, Jiahao Li, Wenyue Li, Chen Zhang, Rongwei Quan, Jianxiang Lu, Jiabin Huang, Xiaoyan Yuan, Xiaoxiao Zheng, Yixuan Li, Jihong Zhang, Chao Zhang, Meng Chen, Jie Liu, Zheng Fang, Weiyan Wang, Jinbao Xue,...
-
[38]
Sdxl- lightning: Progressive adversarial diffusion distillation
Shanchuan Lin, Anran Wang, and Xiao Yang. Sdxl-lightning: Progressive adversarial diffusion distillation.arXiv preprint arXiv:2402.13929, 2024. URL https://arxiv.org/abs/2402. 13929
-
[39]
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. Microsoft coco: Common objects in context. In David Fleet, Tomas Pajdla, Bernt Schiele, and Tinne Tuytelaars, editors,ECCV, pages 740–755, Cham,
- [40]
-
[41]
Evaluating text-to-visual generation with image-to-text generation
Zhiqiu Lin, Deepak Pathak, Baiqi Li, Jiayao Li, Xide Xia, Graham Neubig, Pengchuan Zhang, and Deva Ramanan. Evaluating text-to-visual generation with image-to-text generation. In ECCV, 2024
work page 2024
-
[42]
Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matthew Le. Flow matching for generative modeling. InICLR, 2023. URL https://openreview.net/forum? id=PqvMRDCJT9t
work page 2023
-
[43]
Rectified flow: A marginal preserving approach to optimal transport.ArXiv, abs/2209.14577, 2022
Qiang Liu. Rectified flow: A marginal preserving approach to optimal transport.arXiv preprint arXiv:2209.14577, 2022
-
[44]
Flow straight and fast: Learning to generate and transfer data with rectified flow
Xingchao Liu, Chengyue Gong, and qiang liu. Flow straight and fast: Learning to generate and transfer data with rectified flow. InICLR, 2023. URL https://openreview.net/forum? id=XVjTT1nw5z
work page 2023
-
[45]
Nanye Ma, Mark Goldstein, Michael S. Albergo, Nicholas M. Boffi, Eric Vanden-Eijnden, and Saining Xie. Sit: Exploring flow and diffusion-based generative models with scalable interpolant transformers. InECCV, 2024
work page 2024
-
[46]
Hpsv3: Towards wide-spectrum human preference score
Yuhang Ma, Xiaoshi Wu, Keqiang Sun, and Hongsheng Li. Hpsv3: Towards wide-spectrum human preference score. InICCV, 2025
work page 2025
-
[47]
Deco: Frequency- decoupled pixel diffusion for end-to-end image generation
Zehong Ma, Longhui Wei, Shuai Wang, Shiliang Zhang, and Qi Tian. Deco: Frequency- decoupled pixel diffusion for end-to-end image generation. InCVPR, 2026
work page 2026
-
[48]
PixelGen: Improving Pixel Diffusion with Perceptual Supervision
Zehong Ma, Ruihan Xu, and Shiliang Zhang. Pixelgen: Pixel diffusion beats latent diffusion with perceptual loss.arXiv preprint arXiv:2602.02493, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[49]
A perceptual color space for image processing, 2020
Björn Ottosson. A perceptual color space for image processing, 2020. URL https:// bottosson.github.io/posts/oklab/
work page 2020
-
[50]
Scalable diffusion models with transformers
William Peebles and Saining Xie. Scalable diffusion models with transformers. InICCV, 2023
work page 2023
-
[51]
SDXL: Improving latent diffusion models for high-resolution image synthesis
Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. SDXL: Improving latent diffusion models for high-resolution image synthesis. InICLR, 2024. URL https://openreview.net/forum?id=di52zR8xgf. 12
work page 2024
-
[52]
Learning transferable visual models from natural language supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InICML, pages 8748–8763, 2021
work page 2021
-
[53]
High-resolution image synthesis with latent diffusion models
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bjorn Ommer. High-resolution image synthesis with latent diffusion models. InCVPR, 2022
work page 2022
-
[54]
U-net: Convolutional networks for biomedical image segmentation
Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. InMedical Image Computing and Computer-Assisted Inter- vention (MICCAI), pages 234–241, 2015
work page 2015
-
[55]
Seyedmorteza Sadat, Otmar Hilliges, and Romann M. Weber. Eliminating oversaturation and artifacts of high guidance scales in diffusion models. InICLR, 2025
work page 2025
-
[56]
Photorealistic text-to-image diffusion models with deep language understanding
Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding. InNeurIPS, 2022
work page 2022
-
[57]
Progressive distillation for fast sampling of diffusion models
Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models. InICLR, 2022
work page 2022
-
[58]
LAION-5b: An open large-scale dataset for training next generation image-text models
Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade W Gordon, Ross Wight- man, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, Patrick Schramowski, Srivatsa R Kundurthy, Katherine Crowson, Ludwig Schmidt, Robert Kaczmarczyk, and Jenia Jitsev. LAION-5b: An open large-scale dataset for training next generation image-text m...
work page 2022
-
[59]
Peter H. Schönemann. A generalized solution of the orthogonal procrustes problem.Psychome- trika, 31(1):1–10, 1966. doi: 10.1007/BF02289451
-
[60]
Jaeyo Shin, Jiwook Kim, and Hyunjung Shim. Representation alignment for just image transformers is not easier than you think.arXiv preprint arXiv:2603.14366, 2026
-
[61]
Deep unsuper- vised learning using nonequilibrium thermodynamics
Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsuper- vised learning using nonequilibrium thermodynamics. InICML, pages 2256–2265, 2015
work page 2015
-
[62]
Generative modeling by estimating gradients of the data distribution
Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution. InNeurIPS, 2019
work page 2019
-
[63]
Score-based generative modeling through stochastic differential equations
Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. InICLR, 2021
work page 2021
-
[64]
arXiv preprint arXiv:2601.16208 (2026),https://arxiv.org/abs/2601.16208
Shengbang Tong, Boyang Zheng, Ziteng Wang, Bingda Tang, Nanye Ma, Ellis Brown, Jihan Yang, Rob Fergus, Yann LeCun, and Saining Xie. Scaling text-to-image diffusion transformers with representation autoencoders.arXiv preprint arXiv:2601.16208, 2026
-
[65]
Wan: Open and Advanced Large-Scale Video Generative Models
Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, Jianyuan Zeng, Jiayu Wang, Jingfeng Zhang, Jingren Zhou, Jinkai Wang, Jixuan Chen, Kai Zhu, Kang Zhao, Keyu Yan, Lianghua Huang, Mengyang Feng, Ningyi Zhang, Pandeng Li, Pingyu Wu, Ruihang Chu, Ruili Feng, Shiwei Zhang, Siyang Sun, Tao Fang, Tianxing Wa...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[66]
Pixnerd: Pixel neural field diffusion
Shuai Wang, Ziteng Gao, Chenhui Zhu, Weilin Huang, and Limin Wang. Pixnerd: Pixel neural field diffusion. InICLR, 2026. URLhttps://openreview.net/forum?id=BDnOrExHmt. 13
work page 2026
-
[67]
Ddt: Decoupled diffusion transformer
Shuai Wang, Zhi Tian, Weilin Huang, and Limin Wang. Ddt: Decoupled diffusion transformer. InCVPR, 2026
work page 2026
-
[69]
URLhttps://arxiv.org/abs/2508.02324
work page internal anchor Pith review Pith/arXiv arXiv
-
[70]
Xiaoshi Wu, Yiming Hao, Keqiang Sun, Yixiong Chen, Feng Zhu, Rui Zhao, and Hongsheng Li. Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis.arXiv preprint arXiv:2306.09341, 2023. URL https://arxiv.org/ abs/2306.09341
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [71]
-
[72]
Jingfeng Yao, Bin Yang, and Xinggang Wang. Reconstruction vs. generation: Taming optimiza- tion dilemma in latent diffusion models. InCVPR, 2025
work page 2025
-
[73]
Representation alignment for generation: Training diffusion transformers is easier than you think
Sihyun Yu, Sangkyung Kwak, Huiwon Jang, Jongheon Jeong, Jonathan Huang, Jinwoo Shin, and Saining Xie. Representation alignment for generation: Training diffusion transformers is easier than you think. InICLR, 2025
work page 2025
-
[74]
Pixeldit: Pixel diffusion transformers for image generation
Yongsheng Yu, Wei Xiong, Weili Nie, Yichen Sheng, Shiqiu Liu, and Jiebo Luo. Pixeldit: Pixel diffusion transformers for image generation. InCVPR, 2026
work page 2026
-
[75]
Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer
Z-Image Team, Huanqia Cai, Sihan Cao, Ruoyi Du, Peng Gao, Steven Hoi, Zhaohui Hou, Shijie Huang, Dengyang Jiang, Xin Jin, Liangchen Li, Zhen Li, Zhong-Yu Li, David Liu, Dongyang Liu, Junhan Shi, Qilong Wu, Feng Yu, Chi Zhang, Shifeng Zhang, and Shilin Zhou. Z-image: An efficient image generation foundation model with single-stream diffusion transformer.ar...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[76]
The unreasonable effectiveness of deep features as a perceptual metric
Richard Zhang, Phillip Isola, Alexei Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InCVPR, 2018
work page 2018
-
[77]
Unipc: A unified predictor- corrector framework for fast sampling of diffusion models
Wenliang Zhao, Lujia Bai, Yongming Rao, Jie Zhou, and Jiwen Lu. Unipc: A unified predictor- corrector framework for fast sampling of diffusion models. InNeurIPS, 2023
work page 2023
-
[78]
Boyang Zheng, Nanye Ma, Shengbang Tong, and Saining Xie. Diffusion transformers with representation autoencoders. InICLR, 2026. URL https://openreview.net/forum?id= 0u1LigJaab. 14 A Method Details A.1 Low-Rank Subspace Construction For transformer-based pixel generation, AsymFlow requires a patch-wise low-rank subspace. We use two constructions, depending...
work page 2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.