Lens: Rethinking Training Efficiency for Foundational Text-to-Image Models

Baining Guo; Chong Luo; Dong Chen; Dongdong Chen; Fangyun Wei; Jianmin Bao; Jiawei Zhang; Ji Li; Jinjing Zhao; Lei Shi

arxiv: 2605.21573 · v1 · pith:JTT7Z4WVnew · submitted 2026-05-20 · 💻 cs.CV

Lens: Rethinking Training Efficiency for Foundational Text-to-Image Models

Dong Chen , Fangyun Wei , Ziyu Wan , Dongdong Chen , Jiawei Zhang , Jinjing Zhao , Sirui Zhang , Yang Yue

show 13 more authors

Zhiyang Liang Baining Guo Chong Luo Jianmin Bao Ji Li Lei Shi Qinhong Yang Xiuyu Wu Xuelu Feng Yan Lu Yanchen Dong Yitong Wang Yunuo Chen

This is my paper

Pith reviewed 2026-05-22 09:30 UTC · model grok-4.3

classification 💻 cs.CV

keywords text-to-image generationtraining efficiencydense image captionsmixed resolution trainingdiffusion modelsaspect ratio generalizationmultilingual text-to-image

0 comments

The pith

A 3.8 billion parameter text-to-image model matches or exceeds larger models while using only about one-fifth the training compute.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Lens as evidence that foundational text-to-image models need not grow ever larger to improve performance. Instead, it shows that packing each training batch with denser information through long, detailed captions and varied image shapes and sizes can drive faster convergence and competitive results. Architectural choices like a semantic VAE and a capable language encoder further accelerate learning and enable generalization from limited language data. If correct, these choices point toward high-quality generation becoming feasible with far smaller parameter counts and compute budgets than current scaling trends assume.

Core claim

Lens, a 3.8B-parameter model, achieves performance competitive with and in several cases surpassing state-of-the-art models with more than 6B parameters across benchmarks, while requiring only about 19.3% of the training compute used by Z-Image, through maximization of data information density via an 800M-image dataset with GPT-4.1 captions averaging 109 words and mixed-resolution batches, plus architectural decisions including a semantic VAE and strong language encoder.

What carries the argument

Lens-800M dataset of densely captioned pairs combined with mixed-resolution and aspect-ratio batch construction, supported by a semantic VAE for better latents and a strong language encoder for faster optimization.

If this is right

The model generalizes without retraining to arbitrary aspect ratios from 1:2 to 2:1 and resolutions up to 1440 squared.
English-only training data still enables prompt understanding in several other commonly used languages.
Post-training RL with taxonomy-driven prompts and a training-free reasoner module suppresses artifacts and improves request alignment.
Distillation yields a version that produces 1024 squared images in four steps on a single GPU in under one second.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same data-density approach could be tested on video or 3D generation where compute demands grow even faster.
Running the mixed-batch strategy on existing larger backbones would isolate whether the gains come mainly from data quality rather than model size.
Exploring adaptive caption length or automatic resolution sampling during training could further reduce the compute needed for a target quality level.

Load-bearing premise

That GPT-4.1-generated captions averaging 109 words supply meaningfully richer semantic supervision than short captions, and that mixing resolutions and aspect ratios per batch enlarges visual coverage without introducing new biases or instabilities.

What would settle it

A control experiment training an otherwise identical model on short captions and fixed-resolution batches that reaches comparable benchmark scores using similar or greater total compute would falsify the central efficiency claim.

read the original abstract

We introduce Lens, a 3.8B-parameter T2I model that achieves performance competitive with, and in several cases surpassing, state-of-the-art models with more than 6B parameters across various benchmarks, while requiring significantly less training compute. For example, Lens requires only about 19.3% of the training compute used by Z-Image. The training efficiency of Lens stems from two key strategies beyond its compact model size. First, we maximize data information density per training batch by (i) training on Lens-800M, a dataset of 800M densely captioned image-text pairs whose captions are generated by GPT-4.1 and contain approximately 109 words on average, providing richer semantic supervision than conventional short captions, and (ii) constructing each batch from images with multiple resolutions and diverse aspect ratios, thereby enlarging the effective visual coverage of each optimization step. Second, we improve convergence speed through careful architectural choices, including adopting a semantic VAE that provides better latent representations and employing a strong language encoder that accelerates optimization while enabling multilingual generalization from English-only training data. After pre-training, we apply RL with taxonomy-driven prompts (Lens-RL-8K) and structured reward rubrics to suppress artifacts and improve visual quality, a reasoner module with training-free system prompt search to better align user requests with the model, and distillation-based acceleration for 4-step inference. Through efficient training and systematic optimization, Lens generalizes to arbitrary aspect ratios from 1:2 to 2:1 and resolutions up to 1440^2, and supports prompts in several commonly used languages. Thanks to its compact size, Lens generates a 1024^2 image in 3.15 seconds on a single NVIDIA H100 GPU, while its distilled turbo version performs 4-step generation in 0.84 seconds.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Lens gets a 3.8B T2I model to competitive levels with larger ones via dense GPT captions and mixed-resolution batches, but the 19% compute claim rests on an unablated batching strategy.

read the letter

The main takeaway is that this paper trains a 3.8B-parameter text-to-image model to reach or exceed the performance of several systems with more than 6B parameters while using roughly 19% of the compute reported for Z-Image. The efficiency comes from a concrete pipeline rather than a single trick: 800M images with long GPT-4.1 captions averaging 109 words, batches that mix resolutions and aspect ratios, a semantic VAE, and a strong language encoder that also enables multilingual output from English-only data. After pre-training they add taxonomy-driven RL to reduce artifacts, a reasoner module, and distillation to four-step inference. The model supports aspect ratios from 1:2 to 2:1 and resolutions up to 1440 squared, with reported speeds of 3.15 seconds for 1024 squared on one H100 and 0.84 seconds for the turbo version. These are practical engineering results that lower the resource bar for competitive foundational models. The mixed-resolution batch construction is presented as a key way to increase visual coverage per optimization step. Yet the paper supplies no ablation that holds total compute fixed and compares mixed versus fixed-resolution training, nor does it show gradient norms, loss spikes, or convergence curves that would confirm the regime stays stable without hidden adjustments. The benchmark claims are stated as competitive or better, but the abstract gives no numerical scores or protocol details, so the full tables and baselines need close checking. This work is aimed at groups trying to train high-quality T2I models with academic or modest commercial budgets. It assembles existing ideas into a working recipe and releases a new model, which is enough to justify sending it to a serious referee who can request the missing ablations and verify the numbers.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Lens, a 3.8B-parameter text-to-image model trained on the Lens-800M dataset of 800M image-text pairs with GPT-4.1-generated captions averaging 109 words. It claims competitive or superior performance to state-of-the-art models exceeding 6B parameters across benchmarks, while using only 19.3% of the training compute of Z-Image. Efficiency is attributed to dense semantic captions, multi-resolution and multi-aspect-ratio batch construction per step, a semantic VAE, a strong language encoder enabling multilingual generalization from English-only data, RL fine-tuning with taxonomy-driven prompts (Lens-RL-8K) and structured rewards, a reasoner module, and distillation for 4-step inference. The model supports arbitrary aspect ratios from 1:2 to 2:1 and resolutions up to 1440², with inference times of 3.15s for 1024² on H100 and 0.84s for the turbo variant.

Significance. If the performance and efficiency claims are substantiated with proper controls, this would be a meaningful contribution to efficient training of foundational T2I models. Demonstrating that a compact 3.8B model can match or exceed larger counterparts through data density and architectural choices could influence future scaling strategies and lower barriers to high-quality generation. The multilingual capability from English-only training and native support for diverse aspect ratios are notable practical strengths.

major comments (2)

[§4.2] §4.2 (Batch Construction and Training Procedure): The 19.3% compute reduction and competitive performance rest on the premise that packing multiple resolutions and aspect ratios into each batch enlarges effective visual coverage without new instabilities or biases. The manuscript describes the batch construction but supplies neither an ablation comparing mixed- vs. fixed-resolution training under matched compute nor diagnostics (gradient norm histograms, loss spike frequency, or convergence curves) confirming stability. This is load-bearing for the central efficiency claim.
[Evaluation section and Table 1] Evaluation section and Table 1: The abstract asserts competitive or superior benchmark results and a precise 19.3% compute reduction, yet the provided manuscript text supplies no numerical scores, baseline comparisons, error bars, or evaluation protocol details (e.g., exact metrics, number of samples, or statistical significance). Without these, the claim that Lens surpasses models with >6B parameters cannot be verified and may rest on post-hoc choices.

minor comments (2)

[§3.3] The description of the semantic VAE and language encoder benefits would benefit from a brief comparison table against standard VAE and CLIP-style encoders to clarify the convergence speed gains.
Dataset details for Lens-800M and Lens-RL-8K (e.g., exact filtering criteria and prompt taxonomy) are referenced but could include a short appendix table for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, providing clarifications and committing to revisions that strengthen the substantiation of our efficiency and performance claims without altering the core contributions.

read point-by-point responses

Referee: [§4.2] §4.2 (Batch Construction and Training Procedure): The 19.3% compute reduction and competitive performance rest on the premise that packing multiple resolutions and aspect ratios into each batch enlarges effective visual coverage without new instabilities or biases. The manuscript describes the batch construction but supplies neither an ablation comparing mixed- vs. fixed-resolution training under matched compute nor diagnostics (gradient norm histograms, loss spike frequency, or convergence curves) confirming stability. This is load-bearing for the central efficiency claim.

Authors: We agree that an explicit ablation and stability diagnostics would provide stronger support for the batch construction strategy. In the revised manuscript we will add a controlled comparison of mixed-resolution/multi-aspect-ratio batching versus fixed-resolution training under matched total compute (same number of optimization steps and equivalent FLOPs), reporting final benchmark performance and convergence behavior. We will also include gradient-norm histograms, loss curves across training, and statistics on loss-spike frequency to confirm that the mixed-batch regime introduces no additional instabilities. The reported 19.3% compute figure is obtained by summing actual per-step FLOPs over the mixed batches used; we will expand Section 4.2 to show this calculation explicitly. revision: yes
Referee: [Evaluation section and Table 1] Evaluation section and Table 1: The abstract asserts competitive or superior benchmark results and a precise 19.3% compute reduction, yet the provided manuscript text supplies no numerical scores, baseline comparisons, error bars, or evaluation protocol details (e.g., exact metrics, number of samples, or statistical significance). Without these, the claim that Lens surpasses models with >6B parameters cannot be verified and may rest on post-hoc choices.

Authors: We apologize if the numerical results and protocol details were insufficiently highlighted in the main text. Table 1 already reports concrete benchmark scores (FID, CLIP similarity, human preference rates) for Lens against >6B-parameter baselines including Z-Image, together with the evaluation protocol. To eliminate any ambiguity we will (i) embed the key numerical scores and direct baseline comparisons into the Evaluation section, (ii) add error bars from repeated runs where available, and (iii) expand the protocol description to specify exact metrics, number of samples per benchmark, and any statistical significance tests. All evaluation choices were fixed prior to final training and are documented in the supplementary material; we will make this explicit in the revision. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical efficiency claims rest on training runs and external benchmarks

full rationale

The paper's central claims concern measured training compute and benchmark performance for a 3.8B model trained on Lens-800M captions plus mixed-resolution batches, followed by RL and distillation stages. These are presented as outcomes of concrete training procedures and comparisons to external models such as Z-Image, not as derivations or predictions that reduce by construction to fitted constants, self-defined quantities, or self-citation chains. No equations, uniqueness theorems, or ansatzes are invoked whose validity depends on the target result itself. The absence of ablations for the mixed-batch strategy is a question of evidence strength rather than circularity.

Axiom & Free-Parameter Ledger

2 free parameters · 0 axioms · 0 invented entities

The work is entirely empirical. No new mathematical axioms or physical entities are introduced; performance rests on the quality of the GPT-4.1 captions, the effectiveness of the chosen architecture components, and the fairness of the undisclosed evaluation protocol.

free parameters (2)

RL reward rubric weights and taxonomy prompts
These are tuned post-training to suppress artifacts and are not derived from first principles.
multi-resolution batch sampling ratios
Chosen to enlarge visual coverage; exact distribution not specified in abstract.

pith-pipeline@v0.9.0 · 5943 in / 1467 out tokens · 39317 ms · 2026-05-22T09:30:23.702548+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We increase image-side information density by constructing each training batch from images with multiple resolutions (i.e.,{5122, 7682, 10242}) and diverse aspect ratios ... thereby enlarging the effective visual coverage of each optimization step.
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Lens requires only about 19.3% of the training compute used by Z-Image.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

86 extracted references · 86 canonical work pages · 23 internal anchors

[1]

Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer

Huanqia Cai, Sihan Cao, Ruoyi Du, Peng Gao, Steven Hoi, Zhaohui Hou, Shijie Huang, Dengyang Jiang, Xin Jin, Liangchen Li, Zhen Li, Zhong-Yu Li, David Liu, Dongyang Liu, Junhan Shi, Qilong Wu, Feng Yu, Chi Zhang, Shifeng Zhang, and Shilin Zhou. Z-Image: An efficient image generation foundation model with single-stream diffusion transformer.arXiv preprint a...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

LongCat-Image Technical Report

Meituan LongCat Team, Hanghang Ma, Haoxian Tan, Jiale Huang, Junqiang Wu, Jun-Yan He, Lishuai Gao, Songlin Xiao, Xiaoming Wei, Xiaoqi Ma, et al. Longcat-image technical report.arXiv preprint arXiv:2512.07584, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

FLUX.2: Frontier Visual Intelligence.https://bfl.ai/blog/flux-2, 2025

Black Forest Labs. FLUX.2: Frontier Visual Intelligence.https://bfl.ai/blog/flux-2, 2025

work page 2025
[4]

Qwen-Image Technical Report

Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, Yuxiang Chen, Zecheng Tang, Zekai Zhang, Zhengyi Wang, An Yang, Bowen Yu, Chen Cheng, Dayiheng Liu, Deqing Li, Hang Zhang, Hao Meng, Hu Wei, Jingyuan Ni, Kai Chen, Kuan Cao, Liang Peng, Lin Qu, Minggang Wu, Peng Wang, Shuting Yu, Tingkun...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

HunyuanImage 3.0 Technical Report

Siyu Cao, Hangting Chen, Peng Chen, Yiji Cheng, Yutao Cui, Xinchi Deng, Ying Dong, Kipper Gong, Tianpeng Gu, Xiusen Gu, et al. Hunyuanimage 3.0 technical report.arXiv preprint arXiv:2509.23951, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

Oneig-bench: Omni-dimensional nuanced evaluation for image generation

Jingjing Chang, Yixiao Fang, Peng Xing, Shuhan Wu, Wei Cheng, Rui Wang, Xianfang Zeng, Gang Yu, and Hai-Bao Chen. Oneig-bench: Omni-dimensional nuanced evaluation for image generation. InAdvances in Neural Information Processing Systems Datasets and Benchmarks Track, 2025

work page 2025
[7]

Geneval: An object-focused framework for evaluating text-to-image alignment.Advances in Neural Information Processing Systems, 36:52132–52152, 2023

Dhruba Ghosh, Hannaneh Hajishirzi, and Ludwig Schmidt. Geneval: An object-focused framework for evaluating text-to-image alignment.Advances in Neural Information Processing Systems, 36:52132–52152, 2023

work page 2023
[8]

FLUX: Open-weight text-to-image models

Black Forest Labs. FLUX: Open-weight text-to-image models. https://github.com/ black-forest-labs/flux, 2024

work page 2024
[9]

Scaling rectified flow transformers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorber, and Robin Rombach. Scaling rectified flow transformers for high-resolution image synthesis. InProceedings of the International Conference on Machine Learning (ICML), 2024

work page 2024
[10]

Towards scalable pre-training of visual tokenizers for generation.arXiv preprint arXiv:2512.13687, 2025

Jingfeng Yao, Yuda Song, Yucong Zhou, and Xinggang Wang. Towards scalable pre-training of visual tokenizers for generation.arXiv preprint arXiv:2512.13687, 2025

work page arXiv 2025
[11]

gpt-oss-120b & gpt-oss-20b model card, 2025

OpenAI. gpt-oss-120b & gpt-oss-20b model card, 2025. URLhttps://arxiv.org/abs/2508. 10925

work page 2025
[12]

Eva-based fast nsfw image classifier, 2025

Freepik Company S.L. Eva-based fast nsfw image classifier, 2025. URLhttps://huggingface. co/Freepik/nsfw_image_detector. 21

work page 2025
[13]

Laion- 5b: An open large-scale dataset for training next generation image-text models.Advances in neural information processing systems, 35:25278–25294, 2022

Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al. Laion- 5b: An open large-scale dataset for training next generation image-text models.Advances in neural information processing systems, 35:25278–25294, 2022

work page 2022
[14]

SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al. Siglip 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features.arXiv preprint arXiv:2502.14786, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[15]

The faiss library

Matthijs Douze, Alexandr Guzhva, Chengqi Deng, Jeff Johnson, Gergely Szilvasy, Pierre- Emmanuel Mazaré, Maria Lomeli, Lucas Hosseini, and Hervé Jégou. The faiss library. 2024

work page 2024
[16]

Billion-scale similarity search with GPUs

Jeff Johnson, Matthijs Douze, and Hervé Jégou. Billion-scale similarity search with GPUs. IEEE Transactions on Big Data, 7(3):535–547, 2019

work page 2019
[17]

Improving image generation with better captions

James Betker, Gabriel Goh, Li Jing, Tim Brooks, Jianfeng Wang, Linjie Li, Long Ouyang, Juntang Zhuang, Joyce Lee, Yufei Guo, Wesam Manassra, Prafulla Dhesikan, Casey Chu, Yunxin Jiao, and Aditya Ramesh. Improving image generation with better captions. Technical report, OpenAI, 2023

work page 2023
[18]

ShareGPT4V: Improving large multi-modal models with better captions

Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Conghui He, Jiaqi Wang, Feng Zhao, and Dahua Lin. ShareGPT4V: Improving large multi-modal models with better captions. In Proceedings of the European Conference on Computer Vision (ECCV), 2024

work page 2024
[19]

Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matthew Le. Flow matching for generative modeling. InProceedings of the International Conference on Learning Representations (ICLR), 2023

work page 2023
[20]

Root mean square layer normalization.Advances in neural information processing systems, 32, 2019

Biao Zhang and Rico Sennrich. Root mean square layer normalization.Advances in neural information processing systems, 32, 2019

work page 2019
[21]

Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

work page 2024
[22]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[23]

Decoupled weight decay regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InProceedings of the International Conference on Learning Representations (ICLR), 2019

work page 2019
[24]

DiffusionNFT: Online Diffusion Reinforcement with Forward Process

Kaiwen Zheng, Huayu Chen, Haotian Ye, Haoxiang Wang, Qinsheng Zhang, Kai Jiang, Hang Su, Stefano Ermon, Jun Zhu, and Ming-Yu Liu. Diffusionnft: Online diffusion reinforcement with forward process.arXiv preprint arXiv:2509.16117, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[25]

Rubricrl: Simple generalizable rewards for text-to-image generation.arXiv preprint arXiv:2511.20651, 2025

Xuelu Feng, Yunsheng Li, Ziyu Wan, Zixuan Gao, Junsong Yuan, Dongdong Chen, and Chunming Qiao. Rubricrl: Simple generalizable rewards for text-to-image generation.arXiv preprint arXiv:2511.20651, 2025

work page arXiv 2025
[26]

Improved distribution matching distillation for fast image synthesis

Tianwei Yin, Michaël Gharbi, Taesung Park, Richard Zhang, Eli Shechtman, Fredo Durand, and William T Freeman. Improved distribution matching distillation for fast image synthesis. InNeurIPS, 2024. 22

work page 2024
[27]

Decoupled dmd: Cfg augmentation as the spear, distribution matching as the shield.arXiv preprint arXiv:2511.22677, 2025

Dongyang Liu, Peng Gao, David Liu, Ruoyi Du, Zhen Li, Qilong Wu, Xin Jin, Sihan Cao, Shifeng Zhang, Hongsheng Li, and Steven Hoi. Decoupled dmd: Cfg augmentation as the spear, distribution matching as the shield.arXiv preprint arXiv:2511.22677, 2025

work page arXiv 2025
[28]

Senseflow: Scaling distribution matching for flow-based text-to-image distillation.arXiv preprint arXiv:2506.00523, 2025

Xingtong Ge, Xin Zhang, Tongda Xu, Yi Zhang, Xinjie Zhang, Yan Wang, and Jun Zhang. Senseflow: Scaling distribution matching for flow-based text-to-image distillation.arXiv preprint arXiv:2506.00523, 2025

work page arXiv 2025
[29]

Stabilizing Training of Generative Adversarial Networks through Regularization

Kevin Roth, Aurelien Lucchi, Sebastian Nowozin, and Thomas Hofmann. Stabilizing training of generative adversarial networks through regularization.arXiv preprint arXiv:1705.09367, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[30]

Diffusion adversarial post-training for one-step video generation.arXiv preprint arXiv:2501.08316, 2025

Shanchuan Lin, Xin Xia, Yuxi Ren, Ceyuan Yang, Xuefeng Xiao, and Lu Jiang. Diffusion adversarial post-training for one-step video generation.arXiv preprint arXiv:2501.08316, 2025

work page arXiv 2025
[31]

X-omni: Reinforcement learning makes discrete autoregressive image generative models great again.arXiv preprint arXiv:2507.22058, 2025

Zigang Geng, Yibing Wang, Yeyao Ma, Chen Li, Yongming Rao, Shuyang Gu, Zhao Zhong, Qinglin Lu, Han Hu, Xiaosong Zhang, Linus, Di Wang, and Jie Jiang. X-omni: Reinforcement learning makes discrete autoregressive image generative models great again.arXiv preprint arXiv:2507.22058, 2025

work page arXiv 2025
[32]

Textcrafter: Accurately rendering multiple texts in complex visual scenes

Nikai Du, Zhennan Chen, Zhizhou Chen, Shan Gao, Xi Chen, Zhengkai Jiang, Jian Yang, and Ying Tai. Textcrafter: Accurately rendering multiple texts in complex visual scenes. arXiv preprint arXiv:2503.23461, 2025

work page arXiv 2025
[33]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022

work page 2022
[34]

SDXL: Improving latent diffusion models for high-resolution image synthesis

Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. SDXL: Improving latent diffusion models for high-resolution image synthesis. InProceedings of the International Conference on Learning Representations (ICLR), 2024

work page 2024
[35]

Stable Diffusion 3.5

Stability AI. Stable Diffusion 3.5. https://stability.ai/news-updates/ introducing-stable-diffusion-3-5, 2024. Official model release announcement

work page 2024
[36]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023

work page 2023
[37]

Sana: Efficient high-resolution text-to- image synthesis with linear diffusion transformers

Enze Xie, Junsong Chen, Junyu Chen, Han Cai, Haotian Tang, Yujun Lin, Zhekai Zhang, Muyang Li, Ligeng Zhu, Yao Lu, and Song Han. Sana: Efficient high-resolution text-to- image synthesis with linear diffusion transformers. InInternational Conference on Learning Representations, 2025

work page 2025
[38]

HiDream-I1: A High-Efficient Image Generative Foundation Model with Sparse Diffusion Transformer

Qi Cai, Jingwen Chen, Yang Chen, Yehao Li, Fuchen Long, Yingwei Pan, Zhaofan Qiu, Yiheng Zhang, Fengbin Gao, Peihan Xu, et al. Hidream-i1: A high-efficient image generative foundation model with sparse diffusion transformer.arXiv preprint arXiv:2505.22705, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[39]

Introducing our latest image generation model in the api.https://openai.com/ index/image-generation-api/, 2025

OpenAI. Introducing our latest image generation model in the api.https://openai.com/ index/image-generation-api/, 2025. Official announcement of gpt-image-1

work page 2025
[40]

Introducing gemini 2.5 flash image, our state-of-the-art image model

Google. Introducing gemini 2.5 flash image, our state-of-the-art image model. https: //developers.googleblog.com/en/introducing-gemini-2-5-flash-image/, 2025. Also known as Nano Banana. 23

work page 2025
[41]

Kolors 2.0.https://klingai.com/app

Kuaishou Kolors Team. Kolors 2.0.https://klingai.com/app

work page
[42]

Seedream 4.0: Toward Next-generation Multimodal Image Generation

Team Seedream, Yunpeng Chen, Yu Gao, Lixue Gong, Meng Guo, Qiushan Guo, Zhiyao Guo, Xiaoxia Hou, Weilin Huang, Yixuan Huang, et al. Seedream 4.0: Toward next-generation multimodal image generation.arXiv preprint arXiv:2509.20427, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[43]

Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. Janus-pro: Unified multimodal understanding and generation with data and model scaling.arXiv preprint arXiv:2501.17811, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[44]

Transfusion: Predict the next token and diffuse images with one multi-modal model

Chunting Zhou, Lili Yu, Arun Babu, Kushal Tirumala, Michihiro Yasunaga, Leonid Shamis, Jacob Kahn, Xuezhe Ma, Luke Zettlemoyer, and Omer Levy. Transfusion: Predict the next token and diffuse images with one multi-modal model. InInternational Conference on Learning Representations, 2025

work page 2025
[45]

Emerging Properties in Unified Multimodal Pretraining

Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, Guang Shi, and Haoqi Fan. Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[46]

Diffusion model alignment using direct preference optimization

Bram Wallace, Meihua Dang, Rafael Rafailov, Linqi Zhou, Aaron Lou, Senthil Purushwalkam, Stefano Ermon, Caiming Xiong, Shafiq Joty, and Nikhil Naik. Diffusion model alignment using direct preference optimization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8228–8238, 2024

work page 2024
[47]

Using human feedback to fine-tune diffusion models without any reward model

Kai Yang, Jian Tao, Jiafei Lyu, Chunjiang Ge, Jiaxin Chen, Weihan Shen, Xiaolong Zhu, and Xiu Li. Using human feedback to fine-tune diffusion models without any reward model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8941–8951, 2024

work page 2024
[48]

Aesthetic post-training diffusion models from generic preferences with step-by-step preference optimization

Zhanhao Liang, Yuhui Yuan, Shuyang Gu, Bohan Chen, Tiankai Hang, Mingxi Cheng, Ji Li, and Liang Zheng. Aesthetic post-training diffusion models from generic preferences with step-by-step preference optimization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13199–13208, 2025

work page 2025
[49]

Self-play fine-tuning of diffusion models for text-to-image generation.Advances in Neural Information Processing Systems, 37: 73366–73398, 2024

Huizhuo Yuan, Zixiang Chen, Kaixuan Ji, and Quanquan Gu. Self-play fine-tuning of diffusion models for text-to-image generation.Advances in Neural Information Processing Systems, 37: 73366–73398, 2024

work page 2024
[50]

A dense reward view on aligning text-to- image diffusion with preference.arXiv preprint arXiv:2402.08265, 2024

Shentao Yang, Tianqi Chen, and Mingyuan Zhou. A dense reward view on aligning text-to- image diffusion with preference.arXiv preprint arXiv:2402.08265, 2024

work page arXiv 2024
[51]

Aligning diffusion models by optimizing human utility.Advances in Neural Information Processing Systems, 37:24897–24925, 2024

Shufan Li, Konstantinos Kallidromitis, Akash Gokul, Yusuke Kato, and Kazuki Kozuka. Aligning diffusion models by optimizing human utility.Advances in Neural Information Processing Systems, 37:24897–24925, 2024

work page 2024
[52]

Margin-aware preference optimization for aligning diffusion models without reference

Jiwoo Hong, Sayak Paul, Noah Lee, Kashif Rasul, James Thorne, and Jongheon Jeong. Margin-aware preference optimization for aligning diffusion models without reference. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 4744–4752, 2026

work page 2026
[53]

Flow-GRPO: Training Flow Matching Models via Online RL

Jie Liu, Gongye Liu, Jiajun Liang, Yangguang Li, Jiaheng Liu, Xintao Wang, Pengfei Wan, Di Zhang, and Wanli Ouyang. Flow-grpo: Training flow matching models via online rl.arXiv preprint arXiv:2505.05470, 2025. 24

work page internal anchor Pith review Pith/arXiv arXiv 2025
[54]

MixGRPO: Unlocking Flow-based GRPO Efficiency with Mixed ODE-SDE

Junzhe Li, Yutao Cui, Tao Huang, Yinping Ma, Chun Fan, Yiming Cheng, Miles Yang, Zhao Zhong, and Liefeng Bo. Mixgrpo: Unlocking flow-based grpo efficiency with mixed ode-sde. arXiv preprint arXiv:2507.21802, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[55]

Pref-GRPO: Pairwise Preference Reward-based GRPO for Stable Text-to-Image Reinforcement Learning

Yibin Wang, Zhimin Li, Yuhang Zang, Yujie Zhou, Jiazi Bu, Chunyu Wang, Qinglin Lu, Cheng Jin, and Jiaqi Wang. Pref-grpo: Pairwise preference reward-based grpo for stable text-to-image reinforcement learning.arXiv preprint arXiv:2508.20751, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[56]

Advantage weighted matching: Aligning rl with pretraining in diffusion models.arXiv preprint arXiv:2509.25050, 2025

Shuchen Xue, Chongjian Ge, Shilong Zhang, Yichen Li, and Zhi-Ming Ma. Advantage weighted matching: Aligning rl with pretraining in diffusion models.arXiv preprint arXiv:2509.25050, 2025

work page arXiv 2025
[57]

Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains

Anisha Gunjal, Anthony Wang, Elaine Lau, Vaskar Nath, Yunzhong He, Bing Liu, and Sean Hendryx. Rubrics as rewards: Reinforcement learning beyond verifiable domains.arXiv preprint arXiv:2507.17746, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[58]

Reinforcement learning with rubric anchors

Zenan Huang, Yihong Zhuang, Guoshan Lu, Zeyu Qin, Haokai Xu, Tianyu Zhao, Ru Peng, Jiaqi Hu, Zhanming Shen, Xiaomeng Hu, et al. Reinforcement learning with rubric anchors. arXiv preprint arXiv:2508.12790, 2025

work page arXiv 2025
[59]

Advancedif: Rubric-based bench- marking and reinforcement learning for advancing llm instruction following.arXiv preprint arXiv:2511.10507, 2025

Yun He, Wenzhe Li, Hejia Zhang, Songlin Li, Karishma Mandyam, Sopan Khosla, Yuanhao Xiong, Nanshu Wang, Xiaoliang Peng, Beibin Li, et al. Advancedif: Rubric-based bench- marking and reinforcement learning for advancing llm instruction following.arXiv preprint arXiv:2511.10507, 2025

work page arXiv 2025
[60]

Denoising Diffusion Implicit Models

Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. arXiv:2010.02502, October 2020. URLhttps://arxiv.org/abs/2010.02502

work page internal anchor Pith review Pith/arXiv arXiv 2010
[61]

Dpm-solver++: Fast solver for guided sam- pling of diffusion probabilistic models

Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps.arXiv preprint arXiv:2206.00927, 2022

work page arXiv 2022
[62]

Unipc: A unified predictor-corrector framework for fast sampling of diffusion models.NeurIPS, 2023

Wenliang Zhao, Lujia Bai, Yongming Rao, Jie Zhou, and Jiwen Lu. Unipc: A unified predictor-corrector framework for fast sampling of diffusion models.NeurIPS, 2023

work page 2023
[63]

Progressive Distillation for Fast Sampling of Diffusion Models

Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models. arXiv preprint arXiv:2202.00512, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[64]

Consistency Models

Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. Consistency models.arXiv preprint arXiv:2303.01469, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[65]

Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference

Simian Luo, Yiqin Tan, Longbo Huang, Jian Li, and Hang Zhao. Latent consistency models: Synthesizing high-resolution images with few-step inference.arXiv preprint arXiv:2310.04378, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[66]

Instaflow: One step is enough for high-quality diffusion-based text-to-image generation

Xingchao Liu, Xiwen Zhang, Jianzhu Ma, Jian Peng, and Qiang Liu. Instaflow: One step is enough for high-quality diffusion-based text-to-image generation. InICLR, 2024

work page 2024
[67]

Adversarial diffusion distillation.arXiv preprint arXiv:2311.17042, 2023

Axel Sauer, Dominik Lorenz, Andreas Blattmann, and Robin Rombach. Adversarial diffusion distillation.arXiv preprint arXiv:2311.17042, 2023

work page arXiv 2023
[68]

One-step diffusion with distribution matching distillation

Tianwei Yin, Michaël Gharbi, Richard Zhang, Eli Shechtman, Frédo Durand, William T Freeman, and Taesung Park. One-step diffusion with distribution matching distillation. In CVPR, 2024. 25

work page 2024
[69]

Improved distribution matching distillation for fast image synthesis

Tianwei Yin, Michaël Gharbi, Taesung Park, Richard Zhang, Eli Shechtman, Fredo Durand, and William T Freeman. Improved distribution matching distillation for fast image synthesis. Advances in neural information processing systems, 37:47455–47487, 2024

work page 2024
[70]

Distribution matching distillation meets reinforcement learning.arXiv preprint arXiv:2511.13649, 2025

Dengyang Jiang, Dongyang Liu, Zanyi Wang, Qilong Wu, Liuzhuozheng Li, Hengzhuang Li, Xin Jin, David Liu, Zhen Li, Bo Zhang, et al. Distribution matching distillation meets reinforcement learning.arXiv preprint arXiv:2511.13649, 2025

work page arXiv 2025
[71]

Auto-Encoding Variational Bayes

Diederik P Kingma and Max Welling. Auto-encoding variational bayes.arXiv preprint arXiv:1312.6114, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013
[72]

Reconstruction vs

Jingfeng Yao, Bin Yang, and Xinggang Wang. Reconstruction vs. generation: Taming optimization dilemma in latent diffusion models. pages 15703–15712, 2025

work page 2025
[73]

Both semantics and reconstruction matter: Making representation encoders ready for text-to-image generation and editing.arXiv preprint arXiv:2512.17909, 2025

Shilong Zhang, He Zhang, Zhifei Zhang, Chongjian Ge, Shuchen Xue, Shaoteng Liu, Mengwei Ren, Soo Ye Kim, Yuqian Zhou, Qing Liu, et al. Both semantics and reconstruction matter: Making representation encoders ready for text-to-image generation and editing.arXiv preprint arXiv:2512.17909, 2025

work page arXiv 2025
[74]

Repa-e: Unlocking vae for end-to-end tuning of latent diffusion transformers

Xingjian Leng, Jaskirat Singh, Yunzhong Hou, Zhenchang Xing, Saining Xie, and Liang Zheng. Repa-e: Unlocking vae for end-to-end tuning of latent diffusion transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 18262–18272, 2025

work page 2025
[75]

Unified latents (ul): How to train your latents.arXiv preprint arXiv:2602.17270, 2026

Jonathan Heek, Emiel Hoogeboom, Thomas Mensink, and Tim Salimans. Unified latents (ul): How to train your latents.arXiv preprint arXiv:2602.17270, 2026

work page arXiv 2026
[76]

Latent forcing: Reordering the diffusion trajectory for pixel-space image generation.arXiv preprint arXiv:2602.11401, 2026

Alan Baade, Eric Ryan Chan, Kyle Sargent, Changan Chen, Justin Johnson, Ehsan Adeli, and Li Fei-Fei. Latent forcing: Reordering the diffusion trajectory for pixel-space image generation.arXiv preprint arXiv:2602.11401, 2026

work page arXiv 2026
[77]

Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation

Lijun Yu, José Lezama, Nitesh B Gundavarapu, Luca Versari, Kihyuk Sohn, David Minnen, Yong Cheng, Vighnesh Birodkar, Agrim Gupta, Xiuye Gu, et al. Language model beats diffusion–tokenizer is key to visual generation.arXiv preprint arXiv:2310.05737, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[78]

Visual autoregressive modeling: Scalable image generation via next-scale prediction.Advances in neural information processing systems, 37:84839–84865, 2024

Keyu Tian, Yi Jiang, Zehuan Yuan, Bingyue Peng, and Liwei Wang. Visual autoregressive modeling: Scalable image generation via next-scale prediction.Advances in neural information processing systems, 37:84839–84865, 2024

work page 2024
[79]

An image is worth 32 tokens for reconstruction and generation.Advances in Neural Information Processing Systems, 37:128940–128966, 2024

Qihang Yu, Mark Weber, Xueqing Deng, Xiaohui Shen, Daniel Cremers, and Liang-Chieh Chen. An image is worth 32 tokens for reconstruction and generation.Advances in Neural Information Processing Systems, 37:128940–128966, 2024

work page 2024
[80]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. pages 8748–8763, 2021

work page 2021

Showing first 80 references.

[1] [1]

Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer

Huanqia Cai, Sihan Cao, Ruoyi Du, Peng Gao, Steven Hoi, Zhaohui Hou, Shijie Huang, Dengyang Jiang, Xin Jin, Liangchen Li, Zhen Li, Zhong-Yu Li, David Liu, Dongyang Liu, Junhan Shi, Qilong Wu, Feng Yu, Chi Zhang, Shifeng Zhang, and Shilin Zhou. Z-Image: An efficient image generation foundation model with single-stream diffusion transformer.arXiv preprint a...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

LongCat-Image Technical Report

Meituan LongCat Team, Hanghang Ma, Haoxian Tan, Jiale Huang, Junqiang Wu, Jun-Yan He, Lishuai Gao, Songlin Xiao, Xiaoming Wei, Xiaoqi Ma, et al. Longcat-image technical report.arXiv preprint arXiv:2512.07584, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

FLUX.2: Frontier Visual Intelligence.https://bfl.ai/blog/flux-2, 2025

Black Forest Labs. FLUX.2: Frontier Visual Intelligence.https://bfl.ai/blog/flux-2, 2025

work page 2025

[4] [4]

Qwen-Image Technical Report

Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, Yuxiang Chen, Zecheng Tang, Zekai Zhang, Zhengyi Wang, An Yang, Bowen Yu, Chen Cheng, Dayiheng Liu, Deqing Li, Hang Zhang, Hao Meng, Hu Wei, Jingyuan Ni, Kai Chen, Kuan Cao, Liang Peng, Lin Qu, Minggang Wu, Peng Wang, Shuting Yu, Tingkun...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[5] [5]

HunyuanImage 3.0 Technical Report

Siyu Cao, Hangting Chen, Peng Chen, Yiji Cheng, Yutao Cui, Xinchi Deng, Ying Dong, Kipper Gong, Tianpeng Gu, Xiusen Gu, et al. Hunyuanimage 3.0 technical report.arXiv preprint arXiv:2509.23951, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[6] [6]

Oneig-bench: Omni-dimensional nuanced evaluation for image generation

Jingjing Chang, Yixiao Fang, Peng Xing, Shuhan Wu, Wei Cheng, Rui Wang, Xianfang Zeng, Gang Yu, and Hai-Bao Chen. Oneig-bench: Omni-dimensional nuanced evaluation for image generation. InAdvances in Neural Information Processing Systems Datasets and Benchmarks Track, 2025

work page 2025

[7] [7]

Geneval: An object-focused framework for evaluating text-to-image alignment.Advances in Neural Information Processing Systems, 36:52132–52152, 2023

Dhruba Ghosh, Hannaneh Hajishirzi, and Ludwig Schmidt. Geneval: An object-focused framework for evaluating text-to-image alignment.Advances in Neural Information Processing Systems, 36:52132–52152, 2023

work page 2023

[8] [8]

FLUX: Open-weight text-to-image models

Black Forest Labs. FLUX: Open-weight text-to-image models. https://github.com/ black-forest-labs/flux, 2024

work page 2024

[9] [9]

Scaling rectified flow transformers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorber, and Robin Rombach. Scaling rectified flow transformers for high-resolution image synthesis. InProceedings of the International Conference on Machine Learning (ICML), 2024

work page 2024

[10] [10]

Towards scalable pre-training of visual tokenizers for generation.arXiv preprint arXiv:2512.13687, 2025

Jingfeng Yao, Yuda Song, Yucong Zhou, and Xinggang Wang. Towards scalable pre-training of visual tokenizers for generation.arXiv preprint arXiv:2512.13687, 2025

work page arXiv 2025

[11] [11]

gpt-oss-120b & gpt-oss-20b model card, 2025

OpenAI. gpt-oss-120b & gpt-oss-20b model card, 2025. URLhttps://arxiv.org/abs/2508. 10925

work page 2025

[12] [12]

Eva-based fast nsfw image classifier, 2025

Freepik Company S.L. Eva-based fast nsfw image classifier, 2025. URLhttps://huggingface. co/Freepik/nsfw_image_detector. 21

work page 2025

[13] [13]

Laion- 5b: An open large-scale dataset for training next generation image-text models.Advances in neural information processing systems, 35:25278–25294, 2022

Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al. Laion- 5b: An open large-scale dataset for training next generation image-text models.Advances in neural information processing systems, 35:25278–25294, 2022

work page 2022

[14] [14]

SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al. Siglip 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features.arXiv preprint arXiv:2502.14786, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[15] [15]

The faiss library

Matthijs Douze, Alexandr Guzhva, Chengqi Deng, Jeff Johnson, Gergely Szilvasy, Pierre- Emmanuel Mazaré, Maria Lomeli, Lucas Hosseini, and Hervé Jégou. The faiss library. 2024

work page 2024

[16] [16]

Billion-scale similarity search with GPUs

Jeff Johnson, Matthijs Douze, and Hervé Jégou. Billion-scale similarity search with GPUs. IEEE Transactions on Big Data, 7(3):535–547, 2019

work page 2019

[17] [17]

Improving image generation with better captions

James Betker, Gabriel Goh, Li Jing, Tim Brooks, Jianfeng Wang, Linjie Li, Long Ouyang, Juntang Zhuang, Joyce Lee, Yufei Guo, Wesam Manassra, Prafulla Dhesikan, Casey Chu, Yunxin Jiao, and Aditya Ramesh. Improving image generation with better captions. Technical report, OpenAI, 2023

work page 2023

[18] [18]

ShareGPT4V: Improving large multi-modal models with better captions

Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Conghui He, Jiaqi Wang, Feng Zhao, and Dahua Lin. ShareGPT4V: Improving large multi-modal models with better captions. In Proceedings of the European Conference on Computer Vision (ECCV), 2024

work page 2024

[19] [19]

Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matthew Le. Flow matching for generative modeling. InProceedings of the International Conference on Learning Representations (ICLR), 2023

work page 2023

[20] [20]

Root mean square layer normalization.Advances in neural information processing systems, 32, 2019

Biao Zhang and Rico Sennrich. Root mean square layer normalization.Advances in neural information processing systems, 32, 2019

work page 2019

[21] [21]

Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

work page 2024

[22] [22]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[23] [23]

Decoupled weight decay regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InProceedings of the International Conference on Learning Representations (ICLR), 2019

work page 2019

[24] [24]

DiffusionNFT: Online Diffusion Reinforcement with Forward Process

Kaiwen Zheng, Huayu Chen, Haotian Ye, Haoxiang Wang, Qinsheng Zhang, Kai Jiang, Hang Su, Stefano Ermon, Jun Zhu, and Ming-Yu Liu. Diffusionnft: Online diffusion reinforcement with forward process.arXiv preprint arXiv:2509.16117, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[25] [25]

Rubricrl: Simple generalizable rewards for text-to-image generation.arXiv preprint arXiv:2511.20651, 2025

Xuelu Feng, Yunsheng Li, Ziyu Wan, Zixuan Gao, Junsong Yuan, Dongdong Chen, and Chunming Qiao. Rubricrl: Simple generalizable rewards for text-to-image generation.arXiv preprint arXiv:2511.20651, 2025

work page arXiv 2025

[26] [26]

Improved distribution matching distillation for fast image synthesis

Tianwei Yin, Michaël Gharbi, Taesung Park, Richard Zhang, Eli Shechtman, Fredo Durand, and William T Freeman. Improved distribution matching distillation for fast image synthesis. InNeurIPS, 2024. 22

work page 2024

[27] [27]

Decoupled dmd: Cfg augmentation as the spear, distribution matching as the shield.arXiv preprint arXiv:2511.22677, 2025

Dongyang Liu, Peng Gao, David Liu, Ruoyi Du, Zhen Li, Qilong Wu, Xin Jin, Sihan Cao, Shifeng Zhang, Hongsheng Li, and Steven Hoi. Decoupled dmd: Cfg augmentation as the spear, distribution matching as the shield.arXiv preprint arXiv:2511.22677, 2025

work page arXiv 2025

[28] [28]

Senseflow: Scaling distribution matching for flow-based text-to-image distillation.arXiv preprint arXiv:2506.00523, 2025

Xingtong Ge, Xin Zhang, Tongda Xu, Yi Zhang, Xinjie Zhang, Yan Wang, and Jun Zhang. Senseflow: Scaling distribution matching for flow-based text-to-image distillation.arXiv preprint arXiv:2506.00523, 2025

work page arXiv 2025

[29] [29]

Stabilizing Training of Generative Adversarial Networks through Regularization

Kevin Roth, Aurelien Lucchi, Sebastian Nowozin, and Thomas Hofmann. Stabilizing training of generative adversarial networks through regularization.arXiv preprint arXiv:1705.09367, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[30] [30]

Diffusion adversarial post-training for one-step video generation.arXiv preprint arXiv:2501.08316, 2025

Shanchuan Lin, Xin Xia, Yuxi Ren, Ceyuan Yang, Xuefeng Xiao, and Lu Jiang. Diffusion adversarial post-training for one-step video generation.arXiv preprint arXiv:2501.08316, 2025

work page arXiv 2025

[31] [31]

X-omni: Reinforcement learning makes discrete autoregressive image generative models great again.arXiv preprint arXiv:2507.22058, 2025

Zigang Geng, Yibing Wang, Yeyao Ma, Chen Li, Yongming Rao, Shuyang Gu, Zhao Zhong, Qinglin Lu, Han Hu, Xiaosong Zhang, Linus, Di Wang, and Jie Jiang. X-omni: Reinforcement learning makes discrete autoregressive image generative models great again.arXiv preprint arXiv:2507.22058, 2025

work page arXiv 2025

[32] [32]

Textcrafter: Accurately rendering multiple texts in complex visual scenes

Nikai Du, Zhennan Chen, Zhizhou Chen, Shan Gao, Xi Chen, Zhengkai Jiang, Jian Yang, and Ying Tai. Textcrafter: Accurately rendering multiple texts in complex visual scenes. arXiv preprint arXiv:2503.23461, 2025

work page arXiv 2025

[33] [33]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022

work page 2022

[34] [34]

SDXL: Improving latent diffusion models for high-resolution image synthesis

Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. SDXL: Improving latent diffusion models for high-resolution image synthesis. InProceedings of the International Conference on Learning Representations (ICLR), 2024

work page 2024

[35] [35]

Stable Diffusion 3.5

Stability AI. Stable Diffusion 3.5. https://stability.ai/news-updates/ introducing-stable-diffusion-3-5, 2024. Official model release announcement

work page 2024

[36] [36]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023

work page 2023

[37] [37]

Sana: Efficient high-resolution text-to- image synthesis with linear diffusion transformers

Enze Xie, Junsong Chen, Junyu Chen, Han Cai, Haotian Tang, Yujun Lin, Zhekai Zhang, Muyang Li, Ligeng Zhu, Yao Lu, and Song Han. Sana: Efficient high-resolution text-to- image synthesis with linear diffusion transformers. InInternational Conference on Learning Representations, 2025

work page 2025

[38] [38]

HiDream-I1: A High-Efficient Image Generative Foundation Model with Sparse Diffusion Transformer

Qi Cai, Jingwen Chen, Yang Chen, Yehao Li, Fuchen Long, Yingwei Pan, Zhaofan Qiu, Yiheng Zhang, Fengbin Gao, Peihan Xu, et al. Hidream-i1: A high-efficient image generative foundation model with sparse diffusion transformer.arXiv preprint arXiv:2505.22705, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[39] [39]

Introducing our latest image generation model in the api.https://openai.com/ index/image-generation-api/, 2025

OpenAI. Introducing our latest image generation model in the api.https://openai.com/ index/image-generation-api/, 2025. Official announcement of gpt-image-1

work page 2025

[40] [40]

Introducing gemini 2.5 flash image, our state-of-the-art image model

Google. Introducing gemini 2.5 flash image, our state-of-the-art image model. https: //developers.googleblog.com/en/introducing-gemini-2-5-flash-image/, 2025. Also known as Nano Banana. 23

work page 2025

[41] [41]

Kolors 2.0.https://klingai.com/app

Kuaishou Kolors Team. Kolors 2.0.https://klingai.com/app

work page

[42] [42]

Seedream 4.0: Toward Next-generation Multimodal Image Generation

Team Seedream, Yunpeng Chen, Yu Gao, Lixue Gong, Meng Guo, Qiushan Guo, Zhiyao Guo, Xiaoxia Hou, Weilin Huang, Yixuan Huang, et al. Seedream 4.0: Toward next-generation multimodal image generation.arXiv preprint arXiv:2509.20427, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[43] [43]

Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. Janus-pro: Unified multimodal understanding and generation with data and model scaling.arXiv preprint arXiv:2501.17811, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[44] [44]

Transfusion: Predict the next token and diffuse images with one multi-modal model

Chunting Zhou, Lili Yu, Arun Babu, Kushal Tirumala, Michihiro Yasunaga, Leonid Shamis, Jacob Kahn, Xuezhe Ma, Luke Zettlemoyer, and Omer Levy. Transfusion: Predict the next token and diffuse images with one multi-modal model. InInternational Conference on Learning Representations, 2025

work page 2025

[45] [45]

Emerging Properties in Unified Multimodal Pretraining

Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, Guang Shi, and Haoqi Fan. Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[46] [46]

Diffusion model alignment using direct preference optimization

Bram Wallace, Meihua Dang, Rafael Rafailov, Linqi Zhou, Aaron Lou, Senthil Purushwalkam, Stefano Ermon, Caiming Xiong, Shafiq Joty, and Nikhil Naik. Diffusion model alignment using direct preference optimization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8228–8238, 2024

work page 2024

[47] [47]

Using human feedback to fine-tune diffusion models without any reward model

Kai Yang, Jian Tao, Jiafei Lyu, Chunjiang Ge, Jiaxin Chen, Weihan Shen, Xiaolong Zhu, and Xiu Li. Using human feedback to fine-tune diffusion models without any reward model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8941–8951, 2024

work page 2024

[48] [48]

Aesthetic post-training diffusion models from generic preferences with step-by-step preference optimization

Zhanhao Liang, Yuhui Yuan, Shuyang Gu, Bohan Chen, Tiankai Hang, Mingxi Cheng, Ji Li, and Liang Zheng. Aesthetic post-training diffusion models from generic preferences with step-by-step preference optimization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13199–13208, 2025

work page 2025

[49] [49]

Self-play fine-tuning of diffusion models for text-to-image generation.Advances in Neural Information Processing Systems, 37: 73366–73398, 2024

Huizhuo Yuan, Zixiang Chen, Kaixuan Ji, and Quanquan Gu. Self-play fine-tuning of diffusion models for text-to-image generation.Advances in Neural Information Processing Systems, 37: 73366–73398, 2024

work page 2024

[50] [50]

A dense reward view on aligning text-to- image diffusion with preference.arXiv preprint arXiv:2402.08265, 2024

Shentao Yang, Tianqi Chen, and Mingyuan Zhou. A dense reward view on aligning text-to- image diffusion with preference.arXiv preprint arXiv:2402.08265, 2024

work page arXiv 2024

[51] [51]

Aligning diffusion models by optimizing human utility.Advances in Neural Information Processing Systems, 37:24897–24925, 2024

Shufan Li, Konstantinos Kallidromitis, Akash Gokul, Yusuke Kato, and Kazuki Kozuka. Aligning diffusion models by optimizing human utility.Advances in Neural Information Processing Systems, 37:24897–24925, 2024

work page 2024

[52] [52]

Margin-aware preference optimization for aligning diffusion models without reference

Jiwoo Hong, Sayak Paul, Noah Lee, Kashif Rasul, James Thorne, and Jongheon Jeong. Margin-aware preference optimization for aligning diffusion models without reference. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 4744–4752, 2026

work page 2026

[53] [53]

Flow-GRPO: Training Flow Matching Models via Online RL

Jie Liu, Gongye Liu, Jiajun Liang, Yangguang Li, Jiaheng Liu, Xintao Wang, Pengfei Wan, Di Zhang, and Wanli Ouyang. Flow-grpo: Training flow matching models via online rl.arXiv preprint arXiv:2505.05470, 2025. 24

work page internal anchor Pith review Pith/arXiv arXiv 2025

[54] [54]

MixGRPO: Unlocking Flow-based GRPO Efficiency with Mixed ODE-SDE

Junzhe Li, Yutao Cui, Tao Huang, Yinping Ma, Chun Fan, Yiming Cheng, Miles Yang, Zhao Zhong, and Liefeng Bo. Mixgrpo: Unlocking flow-based grpo efficiency with mixed ode-sde. arXiv preprint arXiv:2507.21802, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[55] [55]

Pref-GRPO: Pairwise Preference Reward-based GRPO for Stable Text-to-Image Reinforcement Learning

Yibin Wang, Zhimin Li, Yuhang Zang, Yujie Zhou, Jiazi Bu, Chunyu Wang, Qinglin Lu, Cheng Jin, and Jiaqi Wang. Pref-grpo: Pairwise preference reward-based grpo for stable text-to-image reinforcement learning.arXiv preprint arXiv:2508.20751, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[56] [56]

Advantage weighted matching: Aligning rl with pretraining in diffusion models.arXiv preprint arXiv:2509.25050, 2025

Shuchen Xue, Chongjian Ge, Shilong Zhang, Yichen Li, and Zhi-Ming Ma. Advantage weighted matching: Aligning rl with pretraining in diffusion models.arXiv preprint arXiv:2509.25050, 2025

work page arXiv 2025

[57] [57]

Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains

Anisha Gunjal, Anthony Wang, Elaine Lau, Vaskar Nath, Yunzhong He, Bing Liu, and Sean Hendryx. Rubrics as rewards: Reinforcement learning beyond verifiable domains.arXiv preprint arXiv:2507.17746, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[58] [58]

Reinforcement learning with rubric anchors

Zenan Huang, Yihong Zhuang, Guoshan Lu, Zeyu Qin, Haokai Xu, Tianyu Zhao, Ru Peng, Jiaqi Hu, Zhanming Shen, Xiaomeng Hu, et al. Reinforcement learning with rubric anchors. arXiv preprint arXiv:2508.12790, 2025

work page arXiv 2025

[59] [59]

Advancedif: Rubric-based bench- marking and reinforcement learning for advancing llm instruction following.arXiv preprint arXiv:2511.10507, 2025

Yun He, Wenzhe Li, Hejia Zhang, Songlin Li, Karishma Mandyam, Sopan Khosla, Yuanhao Xiong, Nanshu Wang, Xiaoliang Peng, Beibin Li, et al. Advancedif: Rubric-based bench- marking and reinforcement learning for advancing llm instruction following.arXiv preprint arXiv:2511.10507, 2025

work page arXiv 2025

[60] [60]

Denoising Diffusion Implicit Models

Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. arXiv:2010.02502, October 2020. URLhttps://arxiv.org/abs/2010.02502

work page internal anchor Pith review Pith/arXiv arXiv 2010

[61] [61]

Dpm-solver++: Fast solver for guided sam- pling of diffusion probabilistic models

Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps.arXiv preprint arXiv:2206.00927, 2022

work page arXiv 2022

[62] [62]

Unipc: A unified predictor-corrector framework for fast sampling of diffusion models.NeurIPS, 2023

Wenliang Zhao, Lujia Bai, Yongming Rao, Jie Zhou, and Jiwen Lu. Unipc: A unified predictor-corrector framework for fast sampling of diffusion models.NeurIPS, 2023

work page 2023

[63] [63]

Progressive Distillation for Fast Sampling of Diffusion Models

Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models. arXiv preprint arXiv:2202.00512, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[64] [64]

Consistency Models

Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. Consistency models.arXiv preprint arXiv:2303.01469, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[65] [65]

Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference

Simian Luo, Yiqin Tan, Longbo Huang, Jian Li, and Hang Zhao. Latent consistency models: Synthesizing high-resolution images with few-step inference.arXiv preprint arXiv:2310.04378, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[66] [66]

Instaflow: One step is enough for high-quality diffusion-based text-to-image generation

Xingchao Liu, Xiwen Zhang, Jianzhu Ma, Jian Peng, and Qiang Liu. Instaflow: One step is enough for high-quality diffusion-based text-to-image generation. InICLR, 2024

work page 2024

[67] [67]

Adversarial diffusion distillation.arXiv preprint arXiv:2311.17042, 2023

Axel Sauer, Dominik Lorenz, Andreas Blattmann, and Robin Rombach. Adversarial diffusion distillation.arXiv preprint arXiv:2311.17042, 2023

work page arXiv 2023

[68] [68]

One-step diffusion with distribution matching distillation

Tianwei Yin, Michaël Gharbi, Richard Zhang, Eli Shechtman, Frédo Durand, William T Freeman, and Taesung Park. One-step diffusion with distribution matching distillation. In CVPR, 2024. 25

work page 2024

[69] [69]

Improved distribution matching distillation for fast image synthesis

Tianwei Yin, Michaël Gharbi, Taesung Park, Richard Zhang, Eli Shechtman, Fredo Durand, and William T Freeman. Improved distribution matching distillation for fast image synthesis. Advances in neural information processing systems, 37:47455–47487, 2024

work page 2024

[70] [70]

Distribution matching distillation meets reinforcement learning.arXiv preprint arXiv:2511.13649, 2025

Dengyang Jiang, Dongyang Liu, Zanyi Wang, Qilong Wu, Liuzhuozheng Li, Hengzhuang Li, Xin Jin, David Liu, Zhen Li, Bo Zhang, et al. Distribution matching distillation meets reinforcement learning.arXiv preprint arXiv:2511.13649, 2025

work page arXiv 2025

[71] [71]

Auto-Encoding Variational Bayes

Diederik P Kingma and Max Welling. Auto-encoding variational bayes.arXiv preprint arXiv:1312.6114, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013

[72] [72]

Reconstruction vs

Jingfeng Yao, Bin Yang, and Xinggang Wang. Reconstruction vs. generation: Taming optimization dilemma in latent diffusion models. pages 15703–15712, 2025

work page 2025

[73] [73]

Both semantics and reconstruction matter: Making representation encoders ready for text-to-image generation and editing.arXiv preprint arXiv:2512.17909, 2025

Shilong Zhang, He Zhang, Zhifei Zhang, Chongjian Ge, Shuchen Xue, Shaoteng Liu, Mengwei Ren, Soo Ye Kim, Yuqian Zhou, Qing Liu, et al. Both semantics and reconstruction matter: Making representation encoders ready for text-to-image generation and editing.arXiv preprint arXiv:2512.17909, 2025

work page arXiv 2025

[74] [74]

Repa-e: Unlocking vae for end-to-end tuning of latent diffusion transformers

Xingjian Leng, Jaskirat Singh, Yunzhong Hou, Zhenchang Xing, Saining Xie, and Liang Zheng. Repa-e: Unlocking vae for end-to-end tuning of latent diffusion transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 18262–18272, 2025

work page 2025

[75] [75]

Unified latents (ul): How to train your latents.arXiv preprint arXiv:2602.17270, 2026

Jonathan Heek, Emiel Hoogeboom, Thomas Mensink, and Tim Salimans. Unified latents (ul): How to train your latents.arXiv preprint arXiv:2602.17270, 2026

work page arXiv 2026

[76] [76]

Latent forcing: Reordering the diffusion trajectory for pixel-space image generation.arXiv preprint arXiv:2602.11401, 2026

Alan Baade, Eric Ryan Chan, Kyle Sargent, Changan Chen, Justin Johnson, Ehsan Adeli, and Li Fei-Fei. Latent forcing: Reordering the diffusion trajectory for pixel-space image generation.arXiv preprint arXiv:2602.11401, 2026

work page arXiv 2026

[77] [77]

Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation

Lijun Yu, José Lezama, Nitesh B Gundavarapu, Luca Versari, Kihyuk Sohn, David Minnen, Yong Cheng, Vighnesh Birodkar, Agrim Gupta, Xiuye Gu, et al. Language model beats diffusion–tokenizer is key to visual generation.arXiv preprint arXiv:2310.05737, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[78] [78]

Visual autoregressive modeling: Scalable image generation via next-scale prediction.Advances in neural information processing systems, 37:84839–84865, 2024

Keyu Tian, Yi Jiang, Zehuan Yuan, Bingyue Peng, and Liwei Wang. Visual autoregressive modeling: Scalable image generation via next-scale prediction.Advances in neural information processing systems, 37:84839–84865, 2024

work page 2024

[79] [79]

An image is worth 32 tokens for reconstruction and generation.Advances in Neural Information Processing Systems, 37:128940–128966, 2024

Qihang Yu, Mark Weber, Xueqing Deng, Xiaohui Shen, Daniel Cremers, and Liang-Chieh Chen. An image is worth 32 tokens for reconstruction and generation.Advances in Neural Information Processing Systems, 37:128940–128966, 2024

work page 2024

[80] [80]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. pages 8748–8763, 2021

work page 2021