Lens: Rethinking Training Efficiency for Foundational Text-to-Image Models
Pith reviewed 2026-05-22 09:30 UTC · model grok-4.3
The pith
A 3.8 billion parameter text-to-image model matches or exceeds larger models while using only about one-fifth the training compute.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Lens, a 3.8B-parameter model, achieves performance competitive with and in several cases surpassing state-of-the-art models with more than 6B parameters across benchmarks, while requiring only about 19.3% of the training compute used by Z-Image, through maximization of data information density via an 800M-image dataset with GPT-4.1 captions averaging 109 words and mixed-resolution batches, plus architectural decisions including a semantic VAE and strong language encoder.
What carries the argument
Lens-800M dataset of densely captioned pairs combined with mixed-resolution and aspect-ratio batch construction, supported by a semantic VAE for better latents and a strong language encoder for faster optimization.
If this is right
- The model generalizes without retraining to arbitrary aspect ratios from 1:2 to 2:1 and resolutions up to 1440 squared.
- English-only training data still enables prompt understanding in several other commonly used languages.
- Post-training RL with taxonomy-driven prompts and a training-free reasoner module suppresses artifacts and improves request alignment.
- Distillation yields a version that produces 1024 squared images in four steps on a single GPU in under one second.
Where Pith is reading between the lines
- The same data-density approach could be tested on video or 3D generation where compute demands grow even faster.
- Running the mixed-batch strategy on existing larger backbones would isolate whether the gains come mainly from data quality rather than model size.
- Exploring adaptive caption length or automatic resolution sampling during training could further reduce the compute needed for a target quality level.
Load-bearing premise
That GPT-4.1-generated captions averaging 109 words supply meaningfully richer semantic supervision than short captions, and that mixing resolutions and aspect ratios per batch enlarges visual coverage without introducing new biases or instabilities.
What would settle it
A control experiment training an otherwise identical model on short captions and fixed-resolution batches that reaches comparable benchmark scores using similar or greater total compute would falsify the central efficiency claim.
read the original abstract
We introduce Lens, a 3.8B-parameter T2I model that achieves performance competitive with, and in several cases surpassing, state-of-the-art models with more than 6B parameters across various benchmarks, while requiring significantly less training compute. For example, Lens requires only about 19.3% of the training compute used by Z-Image. The training efficiency of Lens stems from two key strategies beyond its compact model size. First, we maximize data information density per training batch by (i) training on Lens-800M, a dataset of 800M densely captioned image-text pairs whose captions are generated by GPT-4.1 and contain approximately 109 words on average, providing richer semantic supervision than conventional short captions, and (ii) constructing each batch from images with multiple resolutions and diverse aspect ratios, thereby enlarging the effective visual coverage of each optimization step. Second, we improve convergence speed through careful architectural choices, including adopting a semantic VAE that provides better latent representations and employing a strong language encoder that accelerates optimization while enabling multilingual generalization from English-only training data. After pre-training, we apply RL with taxonomy-driven prompts (Lens-RL-8K) and structured reward rubrics to suppress artifacts and improve visual quality, a reasoner module with training-free system prompt search to better align user requests with the model, and distillation-based acceleration for 4-step inference. Through efficient training and systematic optimization, Lens generalizes to arbitrary aspect ratios from 1:2 to 2:1 and resolutions up to 1440^2, and supports prompts in several commonly used languages. Thanks to its compact size, Lens generates a 1024^2 image in 3.15 seconds on a single NVIDIA H100 GPU, while its distilled turbo version performs 4-step generation in 0.84 seconds.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces Lens, a 3.8B-parameter text-to-image model trained on the Lens-800M dataset of 800M image-text pairs with GPT-4.1-generated captions averaging 109 words. It claims competitive or superior performance to state-of-the-art models exceeding 6B parameters across benchmarks, while using only 19.3% of the training compute of Z-Image. Efficiency is attributed to dense semantic captions, multi-resolution and multi-aspect-ratio batch construction per step, a semantic VAE, a strong language encoder enabling multilingual generalization from English-only data, RL fine-tuning with taxonomy-driven prompts (Lens-RL-8K) and structured rewards, a reasoner module, and distillation for 4-step inference. The model supports arbitrary aspect ratios from 1:2 to 2:1 and resolutions up to 1440², with inference times of 3.15s for 1024² on H100 and 0.84s for the turbo variant.
Significance. If the performance and efficiency claims are substantiated with proper controls, this would be a meaningful contribution to efficient training of foundational T2I models. Demonstrating that a compact 3.8B model can match or exceed larger counterparts through data density and architectural choices could influence future scaling strategies and lower barriers to high-quality generation. The multilingual capability from English-only training and native support for diverse aspect ratios are notable practical strengths.
major comments (2)
- [§4.2] §4.2 (Batch Construction and Training Procedure): The 19.3% compute reduction and competitive performance rest on the premise that packing multiple resolutions and aspect ratios into each batch enlarges effective visual coverage without new instabilities or biases. The manuscript describes the batch construction but supplies neither an ablation comparing mixed- vs. fixed-resolution training under matched compute nor diagnostics (gradient norm histograms, loss spike frequency, or convergence curves) confirming stability. This is load-bearing for the central efficiency claim.
- [Evaluation section and Table 1] Evaluation section and Table 1: The abstract asserts competitive or superior benchmark results and a precise 19.3% compute reduction, yet the provided manuscript text supplies no numerical scores, baseline comparisons, error bars, or evaluation protocol details (e.g., exact metrics, number of samples, or statistical significance). Without these, the claim that Lens surpasses models with >6B parameters cannot be verified and may rest on post-hoc choices.
minor comments (2)
- [§3.3] The description of the semantic VAE and language encoder benefits would benefit from a brief comparison table against standard VAE and CLIP-style encoders to clarify the convergence speed gains.
- Dataset details for Lens-800M and Lens-RL-8K (e.g., exact filtering criteria and prompt taxonomy) are referenced but could include a short appendix table for reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, providing clarifications and committing to revisions that strengthen the substantiation of our efficiency and performance claims without altering the core contributions.
read point-by-point responses
-
Referee: [§4.2] §4.2 (Batch Construction and Training Procedure): The 19.3% compute reduction and competitive performance rest on the premise that packing multiple resolutions and aspect ratios into each batch enlarges effective visual coverage without new instabilities or biases. The manuscript describes the batch construction but supplies neither an ablation comparing mixed- vs. fixed-resolution training under matched compute nor diagnostics (gradient norm histograms, loss spike frequency, or convergence curves) confirming stability. This is load-bearing for the central efficiency claim.
Authors: We agree that an explicit ablation and stability diagnostics would provide stronger support for the batch construction strategy. In the revised manuscript we will add a controlled comparison of mixed-resolution/multi-aspect-ratio batching versus fixed-resolution training under matched total compute (same number of optimization steps and equivalent FLOPs), reporting final benchmark performance and convergence behavior. We will also include gradient-norm histograms, loss curves across training, and statistics on loss-spike frequency to confirm that the mixed-batch regime introduces no additional instabilities. The reported 19.3% compute figure is obtained by summing actual per-step FLOPs over the mixed batches used; we will expand Section 4.2 to show this calculation explicitly. revision: yes
-
Referee: [Evaluation section and Table 1] Evaluation section and Table 1: The abstract asserts competitive or superior benchmark results and a precise 19.3% compute reduction, yet the provided manuscript text supplies no numerical scores, baseline comparisons, error bars, or evaluation protocol details (e.g., exact metrics, number of samples, or statistical significance). Without these, the claim that Lens surpasses models with >6B parameters cannot be verified and may rest on post-hoc choices.
Authors: We apologize if the numerical results and protocol details were insufficiently highlighted in the main text. Table 1 already reports concrete benchmark scores (FID, CLIP similarity, human preference rates) for Lens against >6B-parameter baselines including Z-Image, together with the evaluation protocol. To eliminate any ambiguity we will (i) embed the key numerical scores and direct baseline comparisons into the Evaluation section, (ii) add error bars from repeated runs where available, and (iii) expand the protocol description to specify exact metrics, number of samples per benchmark, and any statistical significance tests. All evaluation choices were fixed prior to final training and are documented in the supplementary material; we will make this explicit in the revision. revision: partial
Circularity Check
No circularity: empirical efficiency claims rest on training runs and external benchmarks
full rationale
The paper's central claims concern measured training compute and benchmark performance for a 3.8B model trained on Lens-800M captions plus mixed-resolution batches, followed by RL and distillation stages. These are presented as outcomes of concrete training procedures and comparisons to external models such as Z-Image, not as derivations or predictions that reduce by construction to fitted constants, self-defined quantities, or self-citation chains. No equations, uniqueness theorems, or ansatzes are invoked whose validity depends on the target result itself. The absence of ablations for the mixed-batch strategy is a question of evidence strength rather than circularity.
Axiom & Free-Parameter Ledger
free parameters (2)
- RL reward rubric weights and taxonomy prompts
- multi-resolution batch sampling ratios
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We increase image-side information density by constructing each training batch from images with multiple resolutions (i.e.,{5122, 7682, 10242}) and diverse aspect ratios ... thereby enlarging the effective visual coverage of each optimization step.
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Lens requires only about 19.3% of the training compute used by Z-Image.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer
Huanqia Cai, Sihan Cao, Ruoyi Du, Peng Gao, Steven Hoi, Zhaohui Hou, Shijie Huang, Dengyang Jiang, Xin Jin, Liangchen Li, Zhen Li, Zhong-Yu Li, David Liu, Dongyang Liu, Junhan Shi, Qilong Wu, Feng Yu, Chi Zhang, Shifeng Zhang, and Shilin Zhou. Z-Image: An efficient image generation foundation model with single-stream diffusion transformer.arXiv preprint a...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[2]
LongCat-Image Technical Report
Meituan LongCat Team, Hanghang Ma, Haoxian Tan, Jiale Huang, Junqiang Wu, Jun-Yan He, Lishuai Gao, Songlin Xiao, Xiaoming Wei, Xiaoqi Ma, et al. Longcat-image technical report.arXiv preprint arXiv:2512.07584, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[3]
FLUX.2: Frontier Visual Intelligence.https://bfl.ai/blog/flux-2, 2025
Black Forest Labs. FLUX.2: Frontier Visual Intelligence.https://bfl.ai/blog/flux-2, 2025
work page 2025
-
[4]
Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, Yuxiang Chen, Zecheng Tang, Zekai Zhang, Zhengyi Wang, An Yang, Bowen Yu, Chen Cheng, Dayiheng Liu, Deqing Li, Hang Zhang, Hao Meng, Hu Wei, Jingyuan Ni, Kai Chen, Kuan Cao, Liang Peng, Lin Qu, Minggang Wu, Peng Wang, Shuting Yu, Tingkun...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[5]
HunyuanImage 3.0 Technical Report
Siyu Cao, Hangting Chen, Peng Chen, Yiji Cheng, Yutao Cui, Xinchi Deng, Ying Dong, Kipper Gong, Tianpeng Gu, Xiusen Gu, et al. Hunyuanimage 3.0 technical report.arXiv preprint arXiv:2509.23951, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[6]
Oneig-bench: Omni-dimensional nuanced evaluation for image generation
Jingjing Chang, Yixiao Fang, Peng Xing, Shuhan Wu, Wei Cheng, Rui Wang, Xianfang Zeng, Gang Yu, and Hai-Bao Chen. Oneig-bench: Omni-dimensional nuanced evaluation for image generation. InAdvances in Neural Information Processing Systems Datasets and Benchmarks Track, 2025
work page 2025
-
[7]
Dhruba Ghosh, Hannaneh Hajishirzi, and Ludwig Schmidt. Geneval: An object-focused framework for evaluating text-to-image alignment.Advances in Neural Information Processing Systems, 36:52132–52152, 2023
work page 2023
-
[8]
FLUX: Open-weight text-to-image models
Black Forest Labs. FLUX: Open-weight text-to-image models. https://github.com/ black-forest-labs/flux, 2024
work page 2024
-
[9]
Scaling rectified flow transformers for high-resolution image synthesis
Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorber, and Robin Rombach. Scaling rectified flow transformers for high-resolution image synthesis. InProceedings of the International Conference on Machine Learning (ICML), 2024
work page 2024
-
[10]
Jingfeng Yao, Yuda Song, Yucong Zhou, and Xinggang Wang. Towards scalable pre-training of visual tokenizers for generation.arXiv preprint arXiv:2512.13687, 2025
-
[11]
gpt-oss-120b & gpt-oss-20b model card, 2025
OpenAI. gpt-oss-120b & gpt-oss-20b model card, 2025. URLhttps://arxiv.org/abs/2508. 10925
work page 2025
-
[12]
Eva-based fast nsfw image classifier, 2025
Freepik Company S.L. Eva-based fast nsfw image classifier, 2025. URLhttps://huggingface. co/Freepik/nsfw_image_detector. 21
work page 2025
-
[13]
Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al. Laion- 5b: An open large-scale dataset for training next generation image-text models.Advances in neural information processing systems, 35:25278–25294, 2022
work page 2022
-
[14]
Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al. Siglip 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features.arXiv preprint arXiv:2502.14786, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[15]
Matthijs Douze, Alexandr Guzhva, Chengqi Deng, Jeff Johnson, Gergely Szilvasy, Pierre- Emmanuel Mazaré, Maria Lomeli, Lucas Hosseini, and Hervé Jégou. The faiss library. 2024
work page 2024
-
[16]
Billion-scale similarity search with GPUs
Jeff Johnson, Matthijs Douze, and Hervé Jégou. Billion-scale similarity search with GPUs. IEEE Transactions on Big Data, 7(3):535–547, 2019
work page 2019
-
[17]
Improving image generation with better captions
James Betker, Gabriel Goh, Li Jing, Tim Brooks, Jianfeng Wang, Linjie Li, Long Ouyang, Juntang Zhuang, Joyce Lee, Yufei Guo, Wesam Manassra, Prafulla Dhesikan, Casey Chu, Yunxin Jiao, and Aditya Ramesh. Improving image generation with better captions. Technical report, OpenAI, 2023
work page 2023
-
[18]
ShareGPT4V: Improving large multi-modal models with better captions
Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Conghui He, Jiaqi Wang, Feng Zhao, and Dahua Lin. ShareGPT4V: Improving large multi-modal models with better captions. In Proceedings of the European Conference on Computer Vision (ECCV), 2024
work page 2024
-
[19]
Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matthew Le. Flow matching for generative modeling. InProceedings of the International Conference on Learning Representations (ICLR), 2023
work page 2023
-
[20]
Root mean square layer normalization.Advances in neural information processing systems, 32, 2019
Biao Zhang and Rico Sennrich. Root mean square layer normalization.Advances in neural information processing systems, 32, 2019
work page 2019
-
[21]
Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024
Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024
work page 2024
-
[22]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[23]
Decoupled weight decay regularization
Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InProceedings of the International Conference on Learning Representations (ICLR), 2019
work page 2019
-
[24]
DiffusionNFT: Online Diffusion Reinforcement with Forward Process
Kaiwen Zheng, Huayu Chen, Haotian Ye, Haoxiang Wang, Qinsheng Zhang, Kai Jiang, Hang Su, Stefano Ermon, Jun Zhu, and Ming-Yu Liu. Diffusionnft: Online diffusion reinforcement with forward process.arXiv preprint arXiv:2509.16117, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[25]
Xuelu Feng, Yunsheng Li, Ziyu Wan, Zixuan Gao, Junsong Yuan, Dongdong Chen, and Chunming Qiao. Rubricrl: Simple generalizable rewards for text-to-image generation.arXiv preprint arXiv:2511.20651, 2025
-
[26]
Improved distribution matching distillation for fast image synthesis
Tianwei Yin, Michaël Gharbi, Taesung Park, Richard Zhang, Eli Shechtman, Fredo Durand, and William T Freeman. Improved distribution matching distillation for fast image synthesis. InNeurIPS, 2024. 22
work page 2024
-
[27]
Dongyang Liu, Peng Gao, David Liu, Ruoyi Du, Zhen Li, Qilong Wu, Xin Jin, Sihan Cao, Shifeng Zhang, Hongsheng Li, and Steven Hoi. Decoupled dmd: Cfg augmentation as the spear, distribution matching as the shield.arXiv preprint arXiv:2511.22677, 2025
-
[28]
Xingtong Ge, Xin Zhang, Tongda Xu, Yi Zhang, Xinjie Zhang, Yan Wang, and Jun Zhang. Senseflow: Scaling distribution matching for flow-based text-to-image distillation.arXiv preprint arXiv:2506.00523, 2025
-
[29]
Stabilizing Training of Generative Adversarial Networks through Regularization
Kevin Roth, Aurelien Lucchi, Sebastian Nowozin, and Thomas Hofmann. Stabilizing training of generative adversarial networks through regularization.arXiv preprint arXiv:1705.09367, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[30]
Shanchuan Lin, Xin Xia, Yuxi Ren, Ceyuan Yang, Xuefeng Xiao, and Lu Jiang. Diffusion adversarial post-training for one-step video generation.arXiv preprint arXiv:2501.08316, 2025
-
[31]
Zigang Geng, Yibing Wang, Yeyao Ma, Chen Li, Yongming Rao, Shuyang Gu, Zhao Zhong, Qinglin Lu, Han Hu, Xiaosong Zhang, Linus, Di Wang, and Jie Jiang. X-omni: Reinforcement learning makes discrete autoregressive image generative models great again.arXiv preprint arXiv:2507.22058, 2025
-
[32]
Textcrafter: Accurately rendering multiple texts in complex visual scenes
Nikai Du, Zhennan Chen, Zhizhou Chen, Shan Gao, Xi Chen, Zhengkai Jiang, Jian Yang, and Ying Tai. Textcrafter: Accurately rendering multiple texts in complex visual scenes. arXiv preprint arXiv:2503.23461, 2025
-
[33]
High-resolution image synthesis with latent diffusion models
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022
work page 2022
-
[34]
SDXL: Improving latent diffusion models for high-resolution image synthesis
Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. SDXL: Improving latent diffusion models for high-resolution image synthesis. InProceedings of the International Conference on Learning Representations (ICLR), 2024
work page 2024
-
[35]
Stability AI. Stable Diffusion 3.5. https://stability.ai/news-updates/ introducing-stable-diffusion-3-5, 2024. Official model release announcement
work page 2024
-
[36]
Scalable diffusion models with transformers
William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023
work page 2023
-
[37]
Sana: Efficient high-resolution text-to- image synthesis with linear diffusion transformers
Enze Xie, Junsong Chen, Junyu Chen, Han Cai, Haotian Tang, Yujun Lin, Zhekai Zhang, Muyang Li, Ligeng Zhu, Yao Lu, and Song Han. Sana: Efficient high-resolution text-to- image synthesis with linear diffusion transformers. InInternational Conference on Learning Representations, 2025
work page 2025
-
[38]
HiDream-I1: A High-Efficient Image Generative Foundation Model with Sparse Diffusion Transformer
Qi Cai, Jingwen Chen, Yang Chen, Yehao Li, Fuchen Long, Yingwei Pan, Zhaofan Qiu, Yiheng Zhang, Fengbin Gao, Peihan Xu, et al. Hidream-i1: A high-efficient image generative foundation model with sparse diffusion transformer.arXiv preprint arXiv:2505.22705, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[39]
OpenAI. Introducing our latest image generation model in the api.https://openai.com/ index/image-generation-api/, 2025. Official announcement of gpt-image-1
work page 2025
-
[40]
Introducing gemini 2.5 flash image, our state-of-the-art image model
Google. Introducing gemini 2.5 flash image, our state-of-the-art image model. https: //developers.googleblog.com/en/introducing-gemini-2-5-flash-image/, 2025. Also known as Nano Banana. 23
work page 2025
-
[41]
Kolors 2.0.https://klingai.com/app
Kuaishou Kolors Team. Kolors 2.0.https://klingai.com/app
-
[42]
Seedream 4.0: Toward Next-generation Multimodal Image Generation
Team Seedream, Yunpeng Chen, Yu Gao, Lixue Gong, Meng Guo, Qiushan Guo, Zhiyao Guo, Xiaoxia Hou, Weilin Huang, Yixuan Huang, et al. Seedream 4.0: Toward next-generation multimodal image generation.arXiv preprint arXiv:2509.20427, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[43]
Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling
Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. Janus-pro: Unified multimodal understanding and generation with data and model scaling.arXiv preprint arXiv:2501.17811, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[44]
Transfusion: Predict the next token and diffuse images with one multi-modal model
Chunting Zhou, Lili Yu, Arun Babu, Kushal Tirumala, Michihiro Yasunaga, Leonid Shamis, Jacob Kahn, Xuezhe Ma, Luke Zettlemoyer, and Omer Levy. Transfusion: Predict the next token and diffuse images with one multi-modal model. InInternational Conference on Learning Representations, 2025
work page 2025
-
[45]
Emerging Properties in Unified Multimodal Pretraining
Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, Guang Shi, and Haoqi Fan. Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[46]
Diffusion model alignment using direct preference optimization
Bram Wallace, Meihua Dang, Rafael Rafailov, Linqi Zhou, Aaron Lou, Senthil Purushwalkam, Stefano Ermon, Caiming Xiong, Shafiq Joty, and Nikhil Naik. Diffusion model alignment using direct preference optimization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8228–8238, 2024
work page 2024
-
[47]
Using human feedback to fine-tune diffusion models without any reward model
Kai Yang, Jian Tao, Jiafei Lyu, Chunjiang Ge, Jiaxin Chen, Weihan Shen, Xiaolong Zhu, and Xiu Li. Using human feedback to fine-tune diffusion models without any reward model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8941–8951, 2024
work page 2024
-
[48]
Zhanhao Liang, Yuhui Yuan, Shuyang Gu, Bohan Chen, Tiankai Hang, Mingxi Cheng, Ji Li, and Liang Zheng. Aesthetic post-training diffusion models from generic preferences with step-by-step preference optimization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13199–13208, 2025
work page 2025
-
[49]
Huizhuo Yuan, Zixiang Chen, Kaixuan Ji, and Quanquan Gu. Self-play fine-tuning of diffusion models for text-to-image generation.Advances in Neural Information Processing Systems, 37: 73366–73398, 2024
work page 2024
-
[50]
Shentao Yang, Tianqi Chen, and Mingyuan Zhou. A dense reward view on aligning text-to- image diffusion with preference.arXiv preprint arXiv:2402.08265, 2024
-
[51]
Shufan Li, Konstantinos Kallidromitis, Akash Gokul, Yusuke Kato, and Kazuki Kozuka. Aligning diffusion models by optimizing human utility.Advances in Neural Information Processing Systems, 37:24897–24925, 2024
work page 2024
-
[52]
Margin-aware preference optimization for aligning diffusion models without reference
Jiwoo Hong, Sayak Paul, Noah Lee, Kashif Rasul, James Thorne, and Jongheon Jeong. Margin-aware preference optimization for aligning diffusion models without reference. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 4744–4752, 2026
work page 2026
-
[53]
Flow-GRPO: Training Flow Matching Models via Online RL
Jie Liu, Gongye Liu, Jiajun Liang, Yangguang Li, Jiaheng Liu, Xintao Wang, Pengfei Wan, Di Zhang, and Wanli Ouyang. Flow-grpo: Training flow matching models via online rl.arXiv preprint arXiv:2505.05470, 2025. 24
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[54]
MixGRPO: Unlocking Flow-based GRPO Efficiency with Mixed ODE-SDE
Junzhe Li, Yutao Cui, Tao Huang, Yinping Ma, Chun Fan, Yiming Cheng, Miles Yang, Zhao Zhong, and Liefeng Bo. Mixgrpo: Unlocking flow-based grpo efficiency with mixed ode-sde. arXiv preprint arXiv:2507.21802, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[55]
Pref-GRPO: Pairwise Preference Reward-based GRPO for Stable Text-to-Image Reinforcement Learning
Yibin Wang, Zhimin Li, Yuhang Zang, Yujie Zhou, Jiazi Bu, Chunyu Wang, Qinglin Lu, Cheng Jin, and Jiaqi Wang. Pref-grpo: Pairwise preference reward-based grpo for stable text-to-image reinforcement learning.arXiv preprint arXiv:2508.20751, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[56]
Shuchen Xue, Chongjian Ge, Shilong Zhang, Yichen Li, and Zhi-Ming Ma. Advantage weighted matching: Aligning rl with pretraining in diffusion models.arXiv preprint arXiv:2509.25050, 2025
-
[57]
Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains
Anisha Gunjal, Anthony Wang, Elaine Lau, Vaskar Nath, Yunzhong He, Bing Liu, and Sean Hendryx. Rubrics as rewards: Reinforcement learning beyond verifiable domains.arXiv preprint arXiv:2507.17746, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[58]
Reinforcement learning with rubric anchors
Zenan Huang, Yihong Zhuang, Guoshan Lu, Zeyu Qin, Haokai Xu, Tianyu Zhao, Ru Peng, Jiaqi Hu, Zhanming Shen, Xiaomeng Hu, et al. Reinforcement learning with rubric anchors. arXiv preprint arXiv:2508.12790, 2025
-
[59]
Yun He, Wenzhe Li, Hejia Zhang, Songlin Li, Karishma Mandyam, Sopan Khosla, Yuanhao Xiong, Nanshu Wang, Xiaoliang Peng, Beibin Li, et al. Advancedif: Rubric-based bench- marking and reinforcement learning for advancing llm instruction following.arXiv preprint arXiv:2511.10507, 2025
-
[60]
Denoising Diffusion Implicit Models
Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. arXiv:2010.02502, October 2020. URLhttps://arxiv.org/abs/2010.02502
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[61]
Dpm-solver++: Fast solver for guided sam- pling of diffusion probabilistic models
Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps.arXiv preprint arXiv:2206.00927, 2022
-
[62]
Unipc: A unified predictor-corrector framework for fast sampling of diffusion models.NeurIPS, 2023
Wenliang Zhao, Lujia Bai, Yongming Rao, Jie Zhou, and Jiwen Lu. Unipc: A unified predictor-corrector framework for fast sampling of diffusion models.NeurIPS, 2023
work page 2023
-
[63]
Progressive Distillation for Fast Sampling of Diffusion Models
Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models. arXiv preprint arXiv:2202.00512, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[64]
Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. Consistency models.arXiv preprint arXiv:2303.01469, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[65]
Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference
Simian Luo, Yiqin Tan, Longbo Huang, Jian Li, and Hang Zhao. Latent consistency models: Synthesizing high-resolution images with few-step inference.arXiv preprint arXiv:2310.04378, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[66]
Instaflow: One step is enough for high-quality diffusion-based text-to-image generation
Xingchao Liu, Xiwen Zhang, Jianzhu Ma, Jian Peng, and Qiang Liu. Instaflow: One step is enough for high-quality diffusion-based text-to-image generation. InICLR, 2024
work page 2024
-
[67]
Adversarial diffusion distillation.arXiv preprint arXiv:2311.17042, 2023
Axel Sauer, Dominik Lorenz, Andreas Blattmann, and Robin Rombach. Adversarial diffusion distillation.arXiv preprint arXiv:2311.17042, 2023
-
[68]
One-step diffusion with distribution matching distillation
Tianwei Yin, Michaël Gharbi, Richard Zhang, Eli Shechtman, Frédo Durand, William T Freeman, and Taesung Park. One-step diffusion with distribution matching distillation. In CVPR, 2024. 25
work page 2024
-
[69]
Improved distribution matching distillation for fast image synthesis
Tianwei Yin, Michaël Gharbi, Taesung Park, Richard Zhang, Eli Shechtman, Fredo Durand, and William T Freeman. Improved distribution matching distillation for fast image synthesis. Advances in neural information processing systems, 37:47455–47487, 2024
work page 2024
-
[70]
Dengyang Jiang, Dongyang Liu, Zanyi Wang, Qilong Wu, Liuzhuozheng Li, Hengzhuang Li, Xin Jin, David Liu, Zhen Li, Bo Zhang, et al. Distribution matching distillation meets reinforcement learning.arXiv preprint arXiv:2511.13649, 2025
-
[71]
Auto-Encoding Variational Bayes
Diederik P Kingma and Max Welling. Auto-encoding variational bayes.arXiv preprint arXiv:1312.6114, 2013
work page internal anchor Pith review Pith/arXiv arXiv 2013
-
[72]
Jingfeng Yao, Bin Yang, and Xinggang Wang. Reconstruction vs. generation: Taming optimization dilemma in latent diffusion models. pages 15703–15712, 2025
work page 2025
-
[73]
Shilong Zhang, He Zhang, Zhifei Zhang, Chongjian Ge, Shuchen Xue, Shaoteng Liu, Mengwei Ren, Soo Ye Kim, Yuqian Zhou, Qing Liu, et al. Both semantics and reconstruction matter: Making representation encoders ready for text-to-image generation and editing.arXiv preprint arXiv:2512.17909, 2025
-
[74]
Repa-e: Unlocking vae for end-to-end tuning of latent diffusion transformers
Xingjian Leng, Jaskirat Singh, Yunzhong Hou, Zhenchang Xing, Saining Xie, and Liang Zheng. Repa-e: Unlocking vae for end-to-end tuning of latent diffusion transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 18262–18272, 2025
work page 2025
-
[75]
Unified latents (ul): How to train your latents.arXiv preprint arXiv:2602.17270, 2026
Jonathan Heek, Emiel Hoogeboom, Thomas Mensink, and Tim Salimans. Unified latents (ul): How to train your latents.arXiv preprint arXiv:2602.17270, 2026
-
[76]
Alan Baade, Eric Ryan Chan, Kyle Sargent, Changan Chen, Justin Johnson, Ehsan Adeli, and Li Fei-Fei. Latent forcing: Reordering the diffusion trajectory for pixel-space image generation.arXiv preprint arXiv:2602.11401, 2026
-
[77]
Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation
Lijun Yu, José Lezama, Nitesh B Gundavarapu, Luca Versari, Kihyuk Sohn, David Minnen, Yong Cheng, Vighnesh Birodkar, Agrim Gupta, Xiuye Gu, et al. Language model beats diffusion–tokenizer is key to visual generation.arXiv preprint arXiv:2310.05737, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[78]
Keyu Tian, Yi Jiang, Zehuan Yuan, Bingyue Peng, and Liwei Wang. Visual autoregressive modeling: Scalable image generation via next-scale prediction.Advances in neural information processing systems, 37:84839–84865, 2024
work page 2024
-
[79]
Qihang Yu, Mark Weber, Xueqing Deng, Xiaohui Shen, Daniel Cremers, and Liang-Chieh Chen. An image is worth 32 tokens for reconstruction and generation.Advances in Neural Information Processing Systems, 37:128940–128966, 2024
work page 2024
-
[80]
Learning transferable visual models from natural language supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. pages 8748–8763, 2021
work page 2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.