pith. sign in

arxiv: 2605.21573 · v1 · pith:JTT7Z4WVnew · submitted 2026-05-20 · 💻 cs.CV

Lens: Rethinking Training Efficiency for Foundational Text-to-Image Models

Pith reviewed 2026-05-22 09:30 UTC · model grok-4.3

classification 💻 cs.CV
keywords text-to-image generationtraining efficiencydense image captionsmixed resolution trainingdiffusion modelsaspect ratio generalizationmultilingual text-to-image
0
0 comments X

The pith

A 3.8 billion parameter text-to-image model matches or exceeds larger models while using only about one-fifth the training compute.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Lens as evidence that foundational text-to-image models need not grow ever larger to improve performance. Instead, it shows that packing each training batch with denser information through long, detailed captions and varied image shapes and sizes can drive faster convergence and competitive results. Architectural choices like a semantic VAE and a capable language encoder further accelerate learning and enable generalization from limited language data. If correct, these choices point toward high-quality generation becoming feasible with far smaller parameter counts and compute budgets than current scaling trends assume.

Core claim

Lens, a 3.8B-parameter model, achieves performance competitive with and in several cases surpassing state-of-the-art models with more than 6B parameters across benchmarks, while requiring only about 19.3% of the training compute used by Z-Image, through maximization of data information density via an 800M-image dataset with GPT-4.1 captions averaging 109 words and mixed-resolution batches, plus architectural decisions including a semantic VAE and strong language encoder.

What carries the argument

Lens-800M dataset of densely captioned pairs combined with mixed-resolution and aspect-ratio batch construction, supported by a semantic VAE for better latents and a strong language encoder for faster optimization.

If this is right

  • The model generalizes without retraining to arbitrary aspect ratios from 1:2 to 2:1 and resolutions up to 1440 squared.
  • English-only training data still enables prompt understanding in several other commonly used languages.
  • Post-training RL with taxonomy-driven prompts and a training-free reasoner module suppresses artifacts and improves request alignment.
  • Distillation yields a version that produces 1024 squared images in four steps on a single GPU in under one second.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same data-density approach could be tested on video or 3D generation where compute demands grow even faster.
  • Running the mixed-batch strategy on existing larger backbones would isolate whether the gains come mainly from data quality rather than model size.
  • Exploring adaptive caption length or automatic resolution sampling during training could further reduce the compute needed for a target quality level.

Load-bearing premise

That GPT-4.1-generated captions averaging 109 words supply meaningfully richer semantic supervision than short captions, and that mixing resolutions and aspect ratios per batch enlarges visual coverage without introducing new biases or instabilities.

What would settle it

A control experiment training an otherwise identical model on short captions and fixed-resolution batches that reaches comparable benchmark scores using similar or greater total compute would falsify the central efficiency claim.

read the original abstract

We introduce Lens, a 3.8B-parameter T2I model that achieves performance competitive with, and in several cases surpassing, state-of-the-art models with more than 6B parameters across various benchmarks, while requiring significantly less training compute. For example, Lens requires only about 19.3% of the training compute used by Z-Image. The training efficiency of Lens stems from two key strategies beyond its compact model size. First, we maximize data information density per training batch by (i) training on Lens-800M, a dataset of 800M densely captioned image-text pairs whose captions are generated by GPT-4.1 and contain approximately 109 words on average, providing richer semantic supervision than conventional short captions, and (ii) constructing each batch from images with multiple resolutions and diverse aspect ratios, thereby enlarging the effective visual coverage of each optimization step. Second, we improve convergence speed through careful architectural choices, including adopting a semantic VAE that provides better latent representations and employing a strong language encoder that accelerates optimization while enabling multilingual generalization from English-only training data. After pre-training, we apply RL with taxonomy-driven prompts (Lens-RL-8K) and structured reward rubrics to suppress artifacts and improve visual quality, a reasoner module with training-free system prompt search to better align user requests with the model, and distillation-based acceleration for 4-step inference. Through efficient training and systematic optimization, Lens generalizes to arbitrary aspect ratios from 1:2 to 2:1 and resolutions up to 1440^2, and supports prompts in several commonly used languages. Thanks to its compact size, Lens generates a 1024^2 image in 3.15 seconds on a single NVIDIA H100 GPU, while its distilled turbo version performs 4-step generation in 0.84 seconds.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Lens, a 3.8B-parameter text-to-image model trained on the Lens-800M dataset of 800M image-text pairs with GPT-4.1-generated captions averaging 109 words. It claims competitive or superior performance to state-of-the-art models exceeding 6B parameters across benchmarks, while using only 19.3% of the training compute of Z-Image. Efficiency is attributed to dense semantic captions, multi-resolution and multi-aspect-ratio batch construction per step, a semantic VAE, a strong language encoder enabling multilingual generalization from English-only data, RL fine-tuning with taxonomy-driven prompts (Lens-RL-8K) and structured rewards, a reasoner module, and distillation for 4-step inference. The model supports arbitrary aspect ratios from 1:2 to 2:1 and resolutions up to 1440², with inference times of 3.15s for 1024² on H100 and 0.84s for the turbo variant.

Significance. If the performance and efficiency claims are substantiated with proper controls, this would be a meaningful contribution to efficient training of foundational T2I models. Demonstrating that a compact 3.8B model can match or exceed larger counterparts through data density and architectural choices could influence future scaling strategies and lower barriers to high-quality generation. The multilingual capability from English-only training and native support for diverse aspect ratios are notable practical strengths.

major comments (2)
  1. [§4.2] §4.2 (Batch Construction and Training Procedure): The 19.3% compute reduction and competitive performance rest on the premise that packing multiple resolutions and aspect ratios into each batch enlarges effective visual coverage without new instabilities or biases. The manuscript describes the batch construction but supplies neither an ablation comparing mixed- vs. fixed-resolution training under matched compute nor diagnostics (gradient norm histograms, loss spike frequency, or convergence curves) confirming stability. This is load-bearing for the central efficiency claim.
  2. [Evaluation section and Table 1] Evaluation section and Table 1: The abstract asserts competitive or superior benchmark results and a precise 19.3% compute reduction, yet the provided manuscript text supplies no numerical scores, baseline comparisons, error bars, or evaluation protocol details (e.g., exact metrics, number of samples, or statistical significance). Without these, the claim that Lens surpasses models with >6B parameters cannot be verified and may rest on post-hoc choices.
minor comments (2)
  1. [§3.3] The description of the semantic VAE and language encoder benefits would benefit from a brief comparison table against standard VAE and CLIP-style encoders to clarify the convergence speed gains.
  2. Dataset details for Lens-800M and Lens-RL-8K (e.g., exact filtering criteria and prompt taxonomy) are referenced but could include a short appendix table for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, providing clarifications and committing to revisions that strengthen the substantiation of our efficiency and performance claims without altering the core contributions.

read point-by-point responses
  1. Referee: [§4.2] §4.2 (Batch Construction and Training Procedure): The 19.3% compute reduction and competitive performance rest on the premise that packing multiple resolutions and aspect ratios into each batch enlarges effective visual coverage without new instabilities or biases. The manuscript describes the batch construction but supplies neither an ablation comparing mixed- vs. fixed-resolution training under matched compute nor diagnostics (gradient norm histograms, loss spike frequency, or convergence curves) confirming stability. This is load-bearing for the central efficiency claim.

    Authors: We agree that an explicit ablation and stability diagnostics would provide stronger support for the batch construction strategy. In the revised manuscript we will add a controlled comparison of mixed-resolution/multi-aspect-ratio batching versus fixed-resolution training under matched total compute (same number of optimization steps and equivalent FLOPs), reporting final benchmark performance and convergence behavior. We will also include gradient-norm histograms, loss curves across training, and statistics on loss-spike frequency to confirm that the mixed-batch regime introduces no additional instabilities. The reported 19.3% compute figure is obtained by summing actual per-step FLOPs over the mixed batches used; we will expand Section 4.2 to show this calculation explicitly. revision: yes

  2. Referee: [Evaluation section and Table 1] Evaluation section and Table 1: The abstract asserts competitive or superior benchmark results and a precise 19.3% compute reduction, yet the provided manuscript text supplies no numerical scores, baseline comparisons, error bars, or evaluation protocol details (e.g., exact metrics, number of samples, or statistical significance). Without these, the claim that Lens surpasses models with >6B parameters cannot be verified and may rest on post-hoc choices.

    Authors: We apologize if the numerical results and protocol details were insufficiently highlighted in the main text. Table 1 already reports concrete benchmark scores (FID, CLIP similarity, human preference rates) for Lens against >6B-parameter baselines including Z-Image, together with the evaluation protocol. To eliminate any ambiguity we will (i) embed the key numerical scores and direct baseline comparisons into the Evaluation section, (ii) add error bars from repeated runs where available, and (iii) expand the protocol description to specify exact metrics, number of samples per benchmark, and any statistical significance tests. All evaluation choices were fixed prior to final training and are documented in the supplementary material; we will make this explicit in the revision. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical efficiency claims rest on training runs and external benchmarks

full rationale

The paper's central claims concern measured training compute and benchmark performance for a 3.8B model trained on Lens-800M captions plus mixed-resolution batches, followed by RL and distillation stages. These are presented as outcomes of concrete training procedures and comparisons to external models such as Z-Image, not as derivations or predictions that reduce by construction to fitted constants, self-defined quantities, or self-citation chains. No equations, uniqueness theorems, or ansatzes are invoked whose validity depends on the target result itself. The absence of ablations for the mixed-batch strategy is a question of evidence strength rather than circularity.

Axiom & Free-Parameter Ledger

2 free parameters · 0 axioms · 0 invented entities

The work is entirely empirical. No new mathematical axioms or physical entities are introduced; performance rests on the quality of the GPT-4.1 captions, the effectiveness of the chosen architecture components, and the fairness of the undisclosed evaluation protocol.

free parameters (2)
  • RL reward rubric weights and taxonomy prompts
    These are tuned post-training to suppress artifacts and are not derived from first principles.
  • multi-resolution batch sampling ratios
    Chosen to enlarge visual coverage; exact distribution not specified in abstract.

pith-pipeline@v0.9.0 · 5943 in / 1467 out tokens · 39317 ms · 2026-05-22T09:30:23.702548+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

86 extracted references · 86 canonical work pages · 23 internal anchors

  1. [1]

    Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer

    Huanqia Cai, Sihan Cao, Ruoyi Du, Peng Gao, Steven Hoi, Zhaohui Hou, Shijie Huang, Dengyang Jiang, Xin Jin, Liangchen Li, Zhen Li, Zhong-Yu Li, David Liu, Dongyang Liu, Junhan Shi, Qilong Wu, Feng Yu, Chi Zhang, Shifeng Zhang, and Shilin Zhou. Z-Image: An efficient image generation foundation model with single-stream diffusion transformer.arXiv preprint a...

  2. [2]

    LongCat-Image Technical Report

    Meituan LongCat Team, Hanghang Ma, Haoxian Tan, Jiale Huang, Junqiang Wu, Jun-Yan He, Lishuai Gao, Songlin Xiao, Xiaoming Wei, Xiaoqi Ma, et al. Longcat-image technical report.arXiv preprint arXiv:2512.07584, 2025

  3. [3]

    FLUX.2: Frontier Visual Intelligence.https://bfl.ai/blog/flux-2, 2025

    Black Forest Labs. FLUX.2: Frontier Visual Intelligence.https://bfl.ai/blog/flux-2, 2025

  4. [4]

    Qwen-Image Technical Report

    Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, Yuxiang Chen, Zecheng Tang, Zekai Zhang, Zhengyi Wang, An Yang, Bowen Yu, Chen Cheng, Dayiheng Liu, Deqing Li, Hang Zhang, Hao Meng, Hu Wei, Jingyuan Ni, Kai Chen, Kuan Cao, Liang Peng, Lin Qu, Minggang Wu, Peng Wang, Shuting Yu, Tingkun...

  5. [5]

    HunyuanImage 3.0 Technical Report

    Siyu Cao, Hangting Chen, Peng Chen, Yiji Cheng, Yutao Cui, Xinchi Deng, Ying Dong, Kipper Gong, Tianpeng Gu, Xiusen Gu, et al. Hunyuanimage 3.0 technical report.arXiv preprint arXiv:2509.23951, 2025

  6. [6]

    Oneig-bench: Omni-dimensional nuanced evaluation for image generation

    Jingjing Chang, Yixiao Fang, Peng Xing, Shuhan Wu, Wei Cheng, Rui Wang, Xianfang Zeng, Gang Yu, and Hai-Bao Chen. Oneig-bench: Omni-dimensional nuanced evaluation for image generation. InAdvances in Neural Information Processing Systems Datasets and Benchmarks Track, 2025

  7. [7]

    Geneval: An object-focused framework for evaluating text-to-image alignment.Advances in Neural Information Processing Systems, 36:52132–52152, 2023

    Dhruba Ghosh, Hannaneh Hajishirzi, and Ludwig Schmidt. Geneval: An object-focused framework for evaluating text-to-image alignment.Advances in Neural Information Processing Systems, 36:52132–52152, 2023

  8. [8]

    FLUX: Open-weight text-to-image models

    Black Forest Labs. FLUX: Open-weight text-to-image models. https://github.com/ black-forest-labs/flux, 2024

  9. [9]

    Scaling rectified flow transformers for high-resolution image synthesis

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorber, and Robin Rombach. Scaling rectified flow transformers for high-resolution image synthesis. InProceedings of the International Conference on Machine Learning (ICML), 2024

  10. [10]

    Towards scalable pre-training of visual tokenizers for generation.arXiv preprint arXiv:2512.13687, 2025

    Jingfeng Yao, Yuda Song, Yucong Zhou, and Xinggang Wang. Towards scalable pre-training of visual tokenizers for generation.arXiv preprint arXiv:2512.13687, 2025

  11. [11]

    gpt-oss-120b & gpt-oss-20b model card, 2025

    OpenAI. gpt-oss-120b & gpt-oss-20b model card, 2025. URLhttps://arxiv.org/abs/2508. 10925

  12. [12]

    Eva-based fast nsfw image classifier, 2025

    Freepik Company S.L. Eva-based fast nsfw image classifier, 2025. URLhttps://huggingface. co/Freepik/nsfw_image_detector. 21

  13. [13]

    Laion- 5b: An open large-scale dataset for training next generation image-text models.Advances in neural information processing systems, 35:25278–25294, 2022

    Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al. Laion- 5b: An open large-scale dataset for training next generation image-text models.Advances in neural information processing systems, 35:25278–25294, 2022

  14. [14]

    SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

    Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al. Siglip 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features.arXiv preprint arXiv:2502.14786, 2025

  15. [15]

    The faiss library

    Matthijs Douze, Alexandr Guzhva, Chengqi Deng, Jeff Johnson, Gergely Szilvasy, Pierre- Emmanuel Mazaré, Maria Lomeli, Lucas Hosseini, and Hervé Jégou. The faiss library. 2024

  16. [16]

    Billion-scale similarity search with GPUs

    Jeff Johnson, Matthijs Douze, and Hervé Jégou. Billion-scale similarity search with GPUs. IEEE Transactions on Big Data, 7(3):535–547, 2019

  17. [17]

    Improving image generation with better captions

    James Betker, Gabriel Goh, Li Jing, Tim Brooks, Jianfeng Wang, Linjie Li, Long Ouyang, Juntang Zhuang, Joyce Lee, Yufei Guo, Wesam Manassra, Prafulla Dhesikan, Casey Chu, Yunxin Jiao, and Aditya Ramesh. Improving image generation with better captions. Technical report, OpenAI, 2023

  18. [18]

    ShareGPT4V: Improving large multi-modal models with better captions

    Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Conghui He, Jiaqi Wang, Feng Zhao, and Dahua Lin. ShareGPT4V: Improving large multi-modal models with better captions. In Proceedings of the European Conference on Computer Vision (ECCV), 2024

  19. [19]

    Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matthew Le. Flow matching for generative modeling. InProceedings of the International Conference on Learning Representations (ICLR), 2023

  20. [20]

    Root mean square layer normalization.Advances in neural information processing systems, 32, 2019

    Biao Zhang and Rico Sennrich. Root mean square layer normalization.Advances in neural information processing systems, 32, 2019

  21. [21]

    Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

    Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

  22. [22]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

  23. [23]

    Decoupled weight decay regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InProceedings of the International Conference on Learning Representations (ICLR), 2019

  24. [24]

    DiffusionNFT: Online Diffusion Reinforcement with Forward Process

    Kaiwen Zheng, Huayu Chen, Haotian Ye, Haoxiang Wang, Qinsheng Zhang, Kai Jiang, Hang Su, Stefano Ermon, Jun Zhu, and Ming-Yu Liu. Diffusionnft: Online diffusion reinforcement with forward process.arXiv preprint arXiv:2509.16117, 2025

  25. [25]

    Rubricrl: Simple generalizable rewards for text-to-image generation.arXiv preprint arXiv:2511.20651, 2025

    Xuelu Feng, Yunsheng Li, Ziyu Wan, Zixuan Gao, Junsong Yuan, Dongdong Chen, and Chunming Qiao. Rubricrl: Simple generalizable rewards for text-to-image generation.arXiv preprint arXiv:2511.20651, 2025

  26. [26]

    Improved distribution matching distillation for fast image synthesis

    Tianwei Yin, Michaël Gharbi, Taesung Park, Richard Zhang, Eli Shechtman, Fredo Durand, and William T Freeman. Improved distribution matching distillation for fast image synthesis. InNeurIPS, 2024. 22

  27. [27]

    Decoupled dmd: Cfg augmentation as the spear, distribution matching as the shield.arXiv preprint arXiv:2511.22677, 2025

    Dongyang Liu, Peng Gao, David Liu, Ruoyi Du, Zhen Li, Qilong Wu, Xin Jin, Sihan Cao, Shifeng Zhang, Hongsheng Li, and Steven Hoi. Decoupled dmd: Cfg augmentation as the spear, distribution matching as the shield.arXiv preprint arXiv:2511.22677, 2025

  28. [28]

    Senseflow: Scaling distribution matching for flow-based text-to-image distillation.arXiv preprint arXiv:2506.00523, 2025

    Xingtong Ge, Xin Zhang, Tongda Xu, Yi Zhang, Xinjie Zhang, Yan Wang, and Jun Zhang. Senseflow: Scaling distribution matching for flow-based text-to-image distillation.arXiv preprint arXiv:2506.00523, 2025

  29. [29]

    Stabilizing Training of Generative Adversarial Networks through Regularization

    Kevin Roth, Aurelien Lucchi, Sebastian Nowozin, and Thomas Hofmann. Stabilizing training of generative adversarial networks through regularization.arXiv preprint arXiv:1705.09367, 2017

  30. [30]

    Diffusion adversarial post-training for one-step video generation.arXiv preprint arXiv:2501.08316, 2025

    Shanchuan Lin, Xin Xia, Yuxi Ren, Ceyuan Yang, Xuefeng Xiao, and Lu Jiang. Diffusion adversarial post-training for one-step video generation.arXiv preprint arXiv:2501.08316, 2025

  31. [31]

    X-omni: Reinforcement learning makes discrete autoregressive image generative models great again.arXiv preprint arXiv:2507.22058, 2025

    Zigang Geng, Yibing Wang, Yeyao Ma, Chen Li, Yongming Rao, Shuyang Gu, Zhao Zhong, Qinglin Lu, Han Hu, Xiaosong Zhang, Linus, Di Wang, and Jie Jiang. X-omni: Reinforcement learning makes discrete autoregressive image generative models great again.arXiv preprint arXiv:2507.22058, 2025

  32. [32]

    Textcrafter: Accurately rendering multiple texts in complex visual scenes

    Nikai Du, Zhennan Chen, Zhizhou Chen, Shan Gao, Xi Chen, Zhengkai Jiang, Jian Yang, and Ying Tai. Textcrafter: Accurately rendering multiple texts in complex visual scenes. arXiv preprint arXiv:2503.23461, 2025

  33. [33]

    High-resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022

  34. [34]

    SDXL: Improving latent diffusion models for high-resolution image synthesis

    Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. SDXL: Improving latent diffusion models for high-resolution image synthesis. InProceedings of the International Conference on Learning Representations (ICLR), 2024

  35. [35]

    Stable Diffusion 3.5

    Stability AI. Stable Diffusion 3.5. https://stability.ai/news-updates/ introducing-stable-diffusion-3-5, 2024. Official model release announcement

  36. [36]

    Scalable diffusion models with transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023

  37. [37]

    Sana: Efficient high-resolution text-to- image synthesis with linear diffusion transformers

    Enze Xie, Junsong Chen, Junyu Chen, Han Cai, Haotian Tang, Yujun Lin, Zhekai Zhang, Muyang Li, Ligeng Zhu, Yao Lu, and Song Han. Sana: Efficient high-resolution text-to- image synthesis with linear diffusion transformers. InInternational Conference on Learning Representations, 2025

  38. [38]

    HiDream-I1: A High-Efficient Image Generative Foundation Model with Sparse Diffusion Transformer

    Qi Cai, Jingwen Chen, Yang Chen, Yehao Li, Fuchen Long, Yingwei Pan, Zhaofan Qiu, Yiheng Zhang, Fengbin Gao, Peihan Xu, et al. Hidream-i1: A high-efficient image generative foundation model with sparse diffusion transformer.arXiv preprint arXiv:2505.22705, 2025

  39. [39]

    Introducing our latest image generation model in the api.https://openai.com/ index/image-generation-api/, 2025

    OpenAI. Introducing our latest image generation model in the api.https://openai.com/ index/image-generation-api/, 2025. Official announcement of gpt-image-1

  40. [40]

    Introducing gemini 2.5 flash image, our state-of-the-art image model

    Google. Introducing gemini 2.5 flash image, our state-of-the-art image model. https: //developers.googleblog.com/en/introducing-gemini-2-5-flash-image/, 2025. Also known as Nano Banana. 23

  41. [41]

    Kolors 2.0.https://klingai.com/app

    Kuaishou Kolors Team. Kolors 2.0.https://klingai.com/app

  42. [42]

    Seedream 4.0: Toward Next-generation Multimodal Image Generation

    Team Seedream, Yunpeng Chen, Yu Gao, Lixue Gong, Meng Guo, Qiushan Guo, Zhiyao Guo, Xiaoxia Hou, Weilin Huang, Yixuan Huang, et al. Seedream 4.0: Toward next-generation multimodal image generation.arXiv preprint arXiv:2509.20427, 2025

  43. [43]

    Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

    Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. Janus-pro: Unified multimodal understanding and generation with data and model scaling.arXiv preprint arXiv:2501.17811, 2025

  44. [44]

    Transfusion: Predict the next token and diffuse images with one multi-modal model

    Chunting Zhou, Lili Yu, Arun Babu, Kushal Tirumala, Michihiro Yasunaga, Leonid Shamis, Jacob Kahn, Xuezhe Ma, Luke Zettlemoyer, and Omer Levy. Transfusion: Predict the next token and diffuse images with one multi-modal model. InInternational Conference on Learning Representations, 2025

  45. [45]

    Emerging Properties in Unified Multimodal Pretraining

    Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, Guang Shi, and Haoqi Fan. Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683, 2025

  46. [46]

    Diffusion model alignment using direct preference optimization

    Bram Wallace, Meihua Dang, Rafael Rafailov, Linqi Zhou, Aaron Lou, Senthil Purushwalkam, Stefano Ermon, Caiming Xiong, Shafiq Joty, and Nikhil Naik. Diffusion model alignment using direct preference optimization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8228–8238, 2024

  47. [47]

    Using human feedback to fine-tune diffusion models without any reward model

    Kai Yang, Jian Tao, Jiafei Lyu, Chunjiang Ge, Jiaxin Chen, Weihan Shen, Xiaolong Zhu, and Xiu Li. Using human feedback to fine-tune diffusion models without any reward model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8941–8951, 2024

  48. [48]

    Aesthetic post-training diffusion models from generic preferences with step-by-step preference optimization

    Zhanhao Liang, Yuhui Yuan, Shuyang Gu, Bohan Chen, Tiankai Hang, Mingxi Cheng, Ji Li, and Liang Zheng. Aesthetic post-training diffusion models from generic preferences with step-by-step preference optimization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13199–13208, 2025

  49. [49]

    Self-play fine-tuning of diffusion models for text-to-image generation.Advances in Neural Information Processing Systems, 37: 73366–73398, 2024

    Huizhuo Yuan, Zixiang Chen, Kaixuan Ji, and Quanquan Gu. Self-play fine-tuning of diffusion models for text-to-image generation.Advances in Neural Information Processing Systems, 37: 73366–73398, 2024

  50. [50]

    A dense reward view on aligning text-to- image diffusion with preference.arXiv preprint arXiv:2402.08265, 2024

    Shentao Yang, Tianqi Chen, and Mingyuan Zhou. A dense reward view on aligning text-to- image diffusion with preference.arXiv preprint arXiv:2402.08265, 2024

  51. [51]

    Aligning diffusion models by optimizing human utility.Advances in Neural Information Processing Systems, 37:24897–24925, 2024

    Shufan Li, Konstantinos Kallidromitis, Akash Gokul, Yusuke Kato, and Kazuki Kozuka. Aligning diffusion models by optimizing human utility.Advances in Neural Information Processing Systems, 37:24897–24925, 2024

  52. [52]

    Margin-aware preference optimization for aligning diffusion models without reference

    Jiwoo Hong, Sayak Paul, Noah Lee, Kashif Rasul, James Thorne, and Jongheon Jeong. Margin-aware preference optimization for aligning diffusion models without reference. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 4744–4752, 2026

  53. [53]

    Flow-GRPO: Training Flow Matching Models via Online RL

    Jie Liu, Gongye Liu, Jiajun Liang, Yangguang Li, Jiaheng Liu, Xintao Wang, Pengfei Wan, Di Zhang, and Wanli Ouyang. Flow-grpo: Training flow matching models via online rl.arXiv preprint arXiv:2505.05470, 2025. 24

  54. [54]

    MixGRPO: Unlocking Flow-based GRPO Efficiency with Mixed ODE-SDE

    Junzhe Li, Yutao Cui, Tao Huang, Yinping Ma, Chun Fan, Yiming Cheng, Miles Yang, Zhao Zhong, and Liefeng Bo. Mixgrpo: Unlocking flow-based grpo efficiency with mixed ode-sde. arXiv preprint arXiv:2507.21802, 2025

  55. [55]

    Pref-GRPO: Pairwise Preference Reward-based GRPO for Stable Text-to-Image Reinforcement Learning

    Yibin Wang, Zhimin Li, Yuhang Zang, Yujie Zhou, Jiazi Bu, Chunyu Wang, Qinglin Lu, Cheng Jin, and Jiaqi Wang. Pref-grpo: Pairwise preference reward-based grpo for stable text-to-image reinforcement learning.arXiv preprint arXiv:2508.20751, 2025

  56. [56]

    Advantage weighted matching: Aligning rl with pretraining in diffusion models.arXiv preprint arXiv:2509.25050, 2025

    Shuchen Xue, Chongjian Ge, Shilong Zhang, Yichen Li, and Zhi-Ming Ma. Advantage weighted matching: Aligning rl with pretraining in diffusion models.arXiv preprint arXiv:2509.25050, 2025

  57. [57]

    Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains

    Anisha Gunjal, Anthony Wang, Elaine Lau, Vaskar Nath, Yunzhong He, Bing Liu, and Sean Hendryx. Rubrics as rewards: Reinforcement learning beyond verifiable domains.arXiv preprint arXiv:2507.17746, 2025

  58. [58]

    Reinforcement learning with rubric anchors

    Zenan Huang, Yihong Zhuang, Guoshan Lu, Zeyu Qin, Haokai Xu, Tianyu Zhao, Ru Peng, Jiaqi Hu, Zhanming Shen, Xiaomeng Hu, et al. Reinforcement learning with rubric anchors. arXiv preprint arXiv:2508.12790, 2025

  59. [59]

    Advancedif: Rubric-based bench- marking and reinforcement learning for advancing llm instruction following.arXiv preprint arXiv:2511.10507, 2025

    Yun He, Wenzhe Li, Hejia Zhang, Songlin Li, Karishma Mandyam, Sopan Khosla, Yuanhao Xiong, Nanshu Wang, Xiaoliang Peng, Beibin Li, et al. Advancedif: Rubric-based bench- marking and reinforcement learning for advancing llm instruction following.arXiv preprint arXiv:2511.10507, 2025

  60. [60]

    Denoising Diffusion Implicit Models

    Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. arXiv:2010.02502, October 2020. URLhttps://arxiv.org/abs/2010.02502

  61. [61]

    DPM-Solver: A fast ode solver for diffu- sion probabilistic model sampling in around 10 steps

    Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps.arXiv preprint arXiv:2206.00927, 2022

  62. [62]

    Unipc: A unified predictor-corrector framework for fast sampling of diffusion models.NeurIPS, 2023

    Wenliang Zhao, Lujia Bai, Yongming Rao, Jie Zhou, and Jiwen Lu. Unipc: A unified predictor-corrector framework for fast sampling of diffusion models.NeurIPS, 2023

  63. [63]

    Progressive Distillation for Fast Sampling of Diffusion Models

    Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models. arXiv preprint arXiv:2202.00512, 2022

  64. [64]

    Consistency Models

    Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. Consistency models.arXiv preprint arXiv:2303.01469, 2023

  65. [65]

    Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference

    Simian Luo, Yiqin Tan, Longbo Huang, Jian Li, and Hang Zhao. Latent consistency models: Synthesizing high-resolution images with few-step inference.arXiv preprint arXiv:2310.04378, 2023

  66. [66]

    Instaflow: One step is enough for high-quality diffusion-based text-to-image generation

    Xingchao Liu, Xiwen Zhang, Jianzhu Ma, Jian Peng, and Qiang Liu. Instaflow: One step is enough for high-quality diffusion-based text-to-image generation. InICLR, 2024

  67. [67]

    Adversarial diffusion distillation.arXiv preprint arXiv:2311.17042, 2023

    Axel Sauer, Dominik Lorenz, Andreas Blattmann, and Robin Rombach. Adversarial diffusion distillation.arXiv preprint arXiv:2311.17042, 2023

  68. [68]

    One-step diffusion with distribution matching distillation

    Tianwei Yin, Michaël Gharbi, Richard Zhang, Eli Shechtman, Frédo Durand, William T Freeman, and Taesung Park. One-step diffusion with distribution matching distillation. In CVPR, 2024. 25

  69. [69]

    Improved distribution matching distillation for fast image synthesis

    Tianwei Yin, Michaël Gharbi, Taesung Park, Richard Zhang, Eli Shechtman, Fredo Durand, and William T Freeman. Improved distribution matching distillation for fast image synthesis. Advances in neural information processing systems, 37:47455–47487, 2024

  70. [70]

    Distribution matching distillation meets reinforcement learning.arXiv preprint arXiv:2511.13649, 2025

    Dengyang Jiang, Dongyang Liu, Zanyi Wang, Qilong Wu, Liuzhuozheng Li, Hengzhuang Li, Xin Jin, David Liu, Zhen Li, Bo Zhang, et al. Distribution matching distillation meets reinforcement learning.arXiv preprint arXiv:2511.13649, 2025

  71. [71]

    Auto-Encoding Variational Bayes

    Diederik P Kingma and Max Welling. Auto-encoding variational bayes.arXiv preprint arXiv:1312.6114, 2013

  72. [72]

    Reconstruction vs

    Jingfeng Yao, Bin Yang, and Xinggang Wang. Reconstruction vs. generation: Taming optimization dilemma in latent diffusion models. pages 15703–15712, 2025

  73. [73]

    Both semantics and reconstruction matter: Making representation encoders ready for text-to-image generation and editing.arXiv preprint arXiv:2512.17909, 2025

    Shilong Zhang, He Zhang, Zhifei Zhang, Chongjian Ge, Shuchen Xue, Shaoteng Liu, Mengwei Ren, Soo Ye Kim, Yuqian Zhou, Qing Liu, et al. Both semantics and reconstruction matter: Making representation encoders ready for text-to-image generation and editing.arXiv preprint arXiv:2512.17909, 2025

  74. [74]

    Repa-e: Unlocking vae for end-to-end tuning of latent diffusion transformers

    Xingjian Leng, Jaskirat Singh, Yunzhong Hou, Zhenchang Xing, Saining Xie, and Liang Zheng. Repa-e: Unlocking vae for end-to-end tuning of latent diffusion transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 18262–18272, 2025

  75. [75]

    Unified latents (ul): How to train your latents.arXiv preprint arXiv:2602.17270, 2026

    Jonathan Heek, Emiel Hoogeboom, Thomas Mensink, and Tim Salimans. Unified latents (ul): How to train your latents.arXiv preprint arXiv:2602.17270, 2026

  76. [76]

    Latent forcing: Reordering the diffusion trajectory for pixel-space image generation.arXiv preprint arXiv:2602.11401, 2026

    Alan Baade, Eric Ryan Chan, Kyle Sargent, Changan Chen, Justin Johnson, Ehsan Adeli, and Li Fei-Fei. Latent forcing: Reordering the diffusion trajectory for pixel-space image generation.arXiv preprint arXiv:2602.11401, 2026

  77. [77]

    Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation

    Lijun Yu, José Lezama, Nitesh B Gundavarapu, Luca Versari, Kihyuk Sohn, David Minnen, Yong Cheng, Vighnesh Birodkar, Agrim Gupta, Xiuye Gu, et al. Language model beats diffusion–tokenizer is key to visual generation.arXiv preprint arXiv:2310.05737, 2023

  78. [78]

    Visual autoregressive modeling: Scalable image generation via next-scale prediction.Advances in neural information processing systems, 37:84839–84865, 2024

    Keyu Tian, Yi Jiang, Zehuan Yuan, Bingyue Peng, and Liwei Wang. Visual autoregressive modeling: Scalable image generation via next-scale prediction.Advances in neural information processing systems, 37:84839–84865, 2024

  79. [79]

    An image is worth 32 tokens for reconstruction and generation.Advances in Neural Information Processing Systems, 37:128940–128966, 2024

    Qihang Yu, Mark Weber, Xueqing Deng, Xiaohui Shen, Daniel Cremers, and Liang-Chieh Chen. An image is worth 32 tokens for reconstruction and generation.Advances in Neural Information Processing Systems, 37:128940–128966, 2024

  80. [80]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. pages 8748–8763, 2021

Showing first 80 references.