arxiv: 2405.08748 · v1 · submitted 2024-05-14 · 💻 cs.CV

Recognition: 1 theorem link

Hunyuan-DiT: A Powerful Multi-Resolution Diffusion Transformer with Fine-Grained Chinese Understanding

Zhimin Li , Jianwei Zhang , Qin Lin , Jiangfeng Xiong , Yanxin Long , Xinchi Deng , Yingfang Zhang , Xingchao Liu

show 37 more authors

Minbin Huang Zedong Xiao Dayou Chen Jiajun He Jiahao Li Wenyue Li Chen Zhang Rongwei Quan Jianxiang Lu Jiabin Huang Xiaoyan Yuan Xiaoxiao Zheng Yixuan Li Jihong Zhang Chao Zhang Meng Chen Jie Liu Zheng Fang Weiyan Wang Jinbao Xue Yangyu Tao Jianchen Zhu Kai Liu Sihuan Lin Yifu Sun Yun Li Dongdong Wang Mingtao Chen Zhichao Hu Xiao Xiao Yan Chen Yuhong Liu Wei Liu Di Wang Yong Yang Jie Jiang Qinglin Lu

Authors on Pith no claims yet

Pith reviewed 2026-05-16 14:54 UTC · model grok-4.3

classification 💻 cs.CV

keywords text-to-image generationdiffusion transformerChinese language understandingmultimodal dialogueimage caption refinementmulti-resolution architectureopen-source model

0 comments

The pith

Hunyuan-DiT is a diffusion transformer that generates images from Chinese text with state-of-the-art detail through custom architecture and refined data handling.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Hunyuan-DiT as a text-to-image diffusion transformer built to handle fine-grained prompts in both English and Chinese. It does this by designing the transformer blocks, text encoders, and positional encodings from the ground up, while creating a full data pipeline that refreshes training examples and uses a separate multimodal language model to improve image captions. The resulting model also supports ongoing conversations where users can describe changes and receive updated images. A human evaluation involving more than fifty professional raters shows it outperforming other open-source systems on Chinese prompts. This matters for anyone who needs precise visual output from natural Chinese descriptions rather than English translations.

Core claim

Hunyuan-DiT is a multi-resolution diffusion transformer whose structure, text encoder, and positional encoding are jointly designed to capture fine-grained bilingual language understanding; a supporting data pipeline and caption-refining multimodal large language model allow iterative improvement, enabling the model to conduct multi-turn multimodal dialogue and to surpass prior open-source models on Chinese-to-image tasks according to holistic human ratings.

What carries the argument

The Hunyuan-DiT diffusion transformer, whose multi-resolution blocks, bilingual text encoder, and learned positional encodings jointly process prompts to produce images while supporting dialogue-based refinement.

If this is right

The model can carry on multi-turn conversations that iteratively refine generated images based on follow-up instructions in Chinese or English.
A dedicated data pipeline that continuously updates and evaluates training examples supports repeated model improvements without starting from scratch.
Caption refinement performed by a separate multimodal large language model directly improves the model's grasp of detailed Chinese descriptions.
The same architecture delivers competitive results on English prompts while leading on Chinese ones among open-source systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar caption-refinement and data-pipeline techniques could be applied to other non-English languages to close performance gaps in text-to-image generation.
The multi-resolution design may reduce the need for separate models when users want both low- and high-resolution outputs from the same prompt.
Because the system already handles dialogue, it could be integrated into creative tools where users iteratively describe changes in their native language.

Load-bearing premise

The evaluation protocol with more than fifty professional raters measures genuine fine-grained Chinese understanding without bias from how prompts are chosen, how evaluators are selected, or how comparisons are presented.

What would settle it

A new human evaluation using the same protocol but with a larger, demographically broader pool of raters and a fresh set of Chinese prompts that shows no advantage or a reversal for Hunyuan-DiT against the same open-source baselines would falsify the state-of-the-art claim.

read the original abstract

We present Hunyuan-DiT, a text-to-image diffusion transformer with fine-grained understanding of both English and Chinese. To construct Hunyuan-DiT, we carefully design the transformer structure, text encoder, and positional encoding. We also build from scratch a whole data pipeline to update and evaluate data for iterative model optimization. For fine-grained language understanding, we train a Multimodal Large Language Model to refine the captions of the images. Finally, Hunyuan-DiT can perform multi-turn multimodal dialogue with users, generating and refining images according to the context. Through our holistic human evaluation protocol with more than 50 professional human evaluators, Hunyuan-DiT sets a new state-of-the-art in Chinese-to-image generation compared with other open-source models. Code and pretrained models are publicly available at github.com/Tencent/HunyuanDiT

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Hunyuan-DiT is a practical open-source DiT release tuned for Chinese text-to-image with multi-resolution and dialogue features, but its SOTA claim depends on human evaluations whose protocol details are missing.

read the letter

Hunyuan-DiT ships an open diffusion transformer that handles both English and Chinese prompts, adds multi-resolution positional encoding, and supports multi-turn image refinement through dialogue. The team built a full data pipeline from scratch and trained an MLLM to refine image captions for finer language understanding. That combination is the concrete addition over prior DiT work, and releasing the code and weights at the GitHub link is the part that actually matters for the field right now. People working on non-English generation now have a concrete baseline they can run and extend without starting from scratch. The architecture choices look like straightforward engineering adaptations rather than new theory, which is fine for a model paper. The main weakness is the human evaluation that underpins the SOTA claim. The abstract says more than fifty professional evaluators ran a holistic protocol, yet it supplies no information on prompt selection, blinding, exact scoring rubrics for fine-grained Chinese understanding, inter-rater reliability, or statistical comparisons. Without those pieces the superiority over other open models cannot be checked from the released artifacts alone. The rest of the paper stays within standard DiT scaling and training practices, so there are no obvious mathematical or data circularity problems. This paper is mainly for groups that need a strong Chinese-capable starting point or want to benchmark multilingual text-to-image systems. A reader who cares about open models and downstream applications in Chinese will get immediate value from the artifacts even if the evaluation section stays light. I would send it to peer review because the release itself is substantive enough to justify referee time, provided the authors add the missing protocol details.

Referee Report

1 major / 1 minor

Summary. The paper introduces Hunyuan-DiT, a multi-resolution diffusion transformer for text-to-image generation with fine-grained English and Chinese understanding. It covers the transformer architecture, text encoder, positional encoding, a from-scratch data pipeline for iterative optimization, training of a Multimodal Large Language Model for caption refinement, and multi-turn multimodal dialogue support. The central claim is that Hunyuan-DiT achieves state-of-the-art Chinese-to-image generation via a holistic human evaluation with more than 50 professional evaluators, outperforming other open-source models. Code and pretrained models are released publicly.

Significance. If the human evaluation protocol can be shown to be unbiased and reproducible, the work would represent a meaningful contribution to open-source multilingual text-to-image models by delivering strong Chinese language understanding and interactive generation capabilities.

major comments (1)

[Abstract] Abstract: The SOTA claim for Chinese-to-image generation depends entirely on a 'holistic human evaluation protocol with more than 50 professional human evaluators,' yet the manuscript supplies no details on test prompt distribution, scoring rubrics for fine-grained Chinese understanding, evaluator blinding, inter-rater agreement, statistical tests, or exact comparison baselines. Without these, the central performance claim cannot be verified or reproduced from the released code.

minor comments (1)

[Abstract] Abstract: The phrasing around the data pipeline and MLLM caption refinement is somewhat dense; a brief enumeration of key steps would improve clarity for readers.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We appreciate the referee's thorough review and constructive comments. We acknowledge the need for greater transparency in our human evaluation protocol to substantiate the state-of-the-art claims. We will revise the manuscript accordingly to provide all necessary details for reproducibility.

read point-by-point responses

Referee: [Abstract] Abstract: The SOTA claim for Chinese-to-image generation depends entirely on a 'holistic human evaluation protocol with more than 50 professional human evaluators,' yet the manuscript supplies no details on test prompt distribution, scoring rubrics for fine-grained Chinese understanding, evaluator blinding, inter-rater agreement, statistical tests, or exact comparison baselines. Without these, the central performance claim cannot be verified or reproduced from the released code.

Authors: We agree that the manuscript currently lacks sufficient details on the human evaluation protocol in both the abstract and the main body. This is a valid concern for verifying the central performance claim. In the revised manuscript, we will introduce a new subsection detailing the evaluation methodology. Specifically, we will describe the test prompt distribution (including examples and categorization for Chinese understanding), the scoring rubrics used for assessing fine-grained Chinese understanding and other criteria, the blinding procedures for evaluators, inter-rater agreement statistics, the statistical tests performed, and the exact list of comparison baselines. Additionally, we will make the evaluation prompts and rubrics publicly available alongside the code. We believe these additions will fully address the referee's concerns and allow independent verification of our results. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper presents an empirical construction of Hunyuan-DiT via transformer design choices, a data pipeline for iterative optimization, and training a separate MLLM for caption refinement, followed by human evaluation against external open-source baselines. No mathematical derivations, predictions, or first-principles results are described that reduce to their own inputs by construction, fitted parameters renamed as outputs, or load-bearing self-citations. The SOTA claim is grounded in an external human protocol rather than internal self-reference, making the overall chain self-contained with independent content.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The central claim rests on empirical training of a large neural network; many architecture and training choices are free parameters chosen to fit performance on image-text data. No new physical entities are postulated.

free parameters (2)

model scale and hyperparameters
Transformer depth, width, attention heads, learning rate schedule, and resolution-specific parameters chosen during development to optimize generation quality.
data filtering thresholds
Criteria used in the data pipeline to select and update training images and captions.

axioms (1)

domain assumption Standard diffusion model assumptions on data distribution and denoising process
The training relies on the usual assumption that image-text pairs follow a distribution amenable to iterative denoising.

pith-pipeline@v0.9.0 · 5610 in / 1142 out tokens · 32075 ms · 2026-05-16T14:54:58.300525+00:00 · methodology

discussion (0)

Forward citations

Cited by 19 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Asymmetric Flow Models
cs.CV 2026-05 unverdicted novelty 7.0

Asymmetric Flow Modeling restricts noise prediction to a low-rank subspace for high-dimensional flow generation, reaching 1.57 FID on ImageNet 256x256 and new state-of-the-art pixel text-to-image performance via finet...
VACE: All-in-One Video Creation and Editing
cs.CV 2025-03 unverdicted novelty 7.0

VACE unifies reference-to-video generation, video-to-video editing, and masked video-to-video editing in one Diffusion Transformer framework using a Video Condition Unit for inputs and a Context Adapter for task injection.
HiDream-O1-Image: A Natively Unified Image Generative Foundation Model with Pixel-level Unified Transformer
cs.CV 2026-05 unverdicted novelty 6.0

A pixel-space Diffusion Transformer with Unified Transformer architecture unifies image generation, editing, and personalization in an end-to-end model that maps all inputs to a shared token space and scales from 8B t...
Fashion130K: An E-commerce Fashion Dataset for Outfit Generation with Unified Multi-modal Condition
cs.CV 2026-05 unverdicted novelty 6.0

Fashion130K dataset and UMC framework align text and visual prompts to generate more consistent fashion outfits than prior state-of-the-art methods.
Fashion130K: An E-commerce Fashion Dataset for Outfit Generation with Unified Multi-modal Condition
cs.CV 2026-05 unverdicted novelty 6.0

Fashion130K dataset and UMC framework align text and visual prompts with embedding refiner, Fusion Transformer, and redesigned attention to generate more consistent outfits than prior methods.
Leveraging Verifier-Based Reinforcement Learning in Image Editing
cs.CV 2026-04 unverdicted novelty 6.0

Edit-R1 trains a CoT-based reasoning reward model with GCPO and uses it to boost image editing performance over VLMs and models like FLUX.1-kontext via GRPO.
Beyond Fixed Formulas: Data-Driven Linear Predictor for Efficient Diffusion Models
cs.CV 2026-04 unverdicted novelty 6.0

L2P trains per-timestep linear weights on feature trajectories in about 20 seconds to enable aggressive caching in DiT models, delivering up to 4.55x FLOPs reduction with maintained visual quality.
CineAGI: Character-Consistent Movie Creation through LLM-Orchestrated Multi-Modal Generation and Cross-Scene Integration
cs.MM 2026-04 unverdicted novelty 6.0

CineAGI is a multi-agent LLM framework that generates multi-scene movies with improved character consistency, narrative coherence, and audio-visual alignment.
When Numbers Speak: Aligning Textual Numerals and Visual Instances in Text-to-Video Diffusion Models
cs.CV 2026-04 unverdicted novelty 6.0

NUMINA improves counting accuracy in text-to-video diffusion models by up to 7.4% via a training-free identify-then-guide framework on the new CountBench dataset.
The Algorithmic Gaze of Image Quality Assessment: An Audit and Trace Ethnography of the LAION-Aesthetics Predictor
cs.HC 2026-01 conditional novelty 6.0

LAION-Aesthetics Predictor reinforces Western and male biases by preferentially selecting images associated with women and realistic Western/Japanese art while excluding men, LGBTQ+ references, and other styles.
HunyuanImage 3.0 Technical Report
cs.CV 2025-09 accept novelty 6.0

HunyuanImage 3.0 delivers an 80B-parameter MoE model unifying multimodal understanding and generation that matches prior state-of-the-art results while being fully open-sourced.
SANA: Efficient High-Resolution Image Synthesis with Linear Diffusion Transformers
cs.CV 2024-10 unverdicted novelty 6.0

Sana-0.6B produces high-resolution images with strong text alignment at 20x smaller size and 100x higher throughput than Flux-12B by combining 32x image compression, linear DiT blocks, and a decoder-only LLM text encoder.
Emu3: Next-Token Prediction is All You Need
cs.CV 2024-09 unverdicted novelty 6.0

Emu3 shows that next-token prediction on a unified discrete token space for text, images, and video lets a single transformer outperform task-specific models such as SDXL and LLaVA-1.6 in multimodal generation and perception.
ACPO: Anchor-Constrained Perceptual Optimization for Diffusion Models with No-Reference Quality Guidance
cs.CV 2026-04 unverdicted novelty 5.0

ACPO uses anchor-based regularization with NR-IQA guidance to enable stable perceptual quality improvements in diffusion model fine-tuning.
Diffusion Templates: A Unified Plugin Framework for Controllable Diffusion
cs.LG 2026-04 unverdicted novelty 5.0

Diffusion Templates is a unified plugin framework that allows injecting various controllable capabilities into diffusion models through a standardized interface.
Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer
cs.CV 2025-11 unverdicted novelty 5.0

Z-Image is an efficient 6B-parameter foundation model for image generation that rivals larger commercial systems in photorealism and bilingual text rendering through a new single-stream diffusion transformer and strea...
Qwen-Image-2.0 Technical Report
cs.CV 2026-05 unverdicted novelty 4.0

Qwen-Image-2.0 unifies high-fidelity image generation and precise editing by coupling Qwen3-VL with a Multimodal Diffusion Transformer, improving text rendering, photorealism, and complex prompt following over prior versions.
OmniFysics: Towards Physical Intelligence Evolution via Omni-Modal Signal Processing and Network Optimization
cs.CV 2026-02 unverdicted novelty 4.0

OmniFysics is an omni-modal network using a dynamic physical data engine and evolutive tuning to improve performance on multimodal benchmarks and physics-oriented tasks.
Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling
cs.AI 2025-01 conditional novelty 3.0

Scaling data, model size, and training optimization on the Janus architecture yields better multimodal understanding and more stable, instruction-following text-to-image generation.

Reference graph

Works this paper leans on

41 extracted references · 41 canonical work pages · cited by 18 Pith papers · 6 internal anchors

[1]

https://www.midjourney.com/home

Midjourney. https://www.midjourney.com/home

work page
[2]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers

Yogesh Balaji, Seungjun Nah, Xun Huang, Arash Vahdat, Jiaming Song, Qinsheng Zhang, Karsten Kreis, Miika Aittala, Timo Aila, Samuli Laine, et al. ediff-i: Text-to-image diffusion models with an ensemble of expert denoisers. arXiv preprint arXiv:2211.01324, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[4]

All are worth words: A vit backbone for diffusion models

Fan Bao, Shen Nie, Kaiwen Xue, Yue Cao, Chongxuan Li, Hang Su, and Jun Zhu. All are worth words: A vit backbone for diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22669–22679, 2023

work page 2023
[5]

Improving image generation with better captions

James Betker, Gabriel Goh, Li Jing, Tim Brooks, Jianfeng Wang, Linjie Li, Long Ouyang, Juntang Zhuang, Joyce Lee, Yufei Guo, et al. Improving image generation with better captions. Computer Science. https://cdn. openai. com/papers/dall-e-3. pdf, 2(3):8, 2023

work page 2023
[6]

Muse: Text-to-image generation via masked generative transformers

Huiwen Chang, Han Zhang, Jarred Barber, Aaron Maschinot, Jose Lezama, Lu Jiang, Ming-Hsuan Yang, Kevin Patrick Murphy, William T Freeman, Michael Rubinstein, et al. Muse: Text-to-image generation via masked generative transformers. In International Conference on Machine Learning, pages 4055–4075. PMLR, 2023

work page 2023
[7]

Pixart-\alpha: Fast training of diffusion transformer for photorealistic text-to-image synthesis

Junsong Chen, YU Jincheng, GE Chongjian, Lewei Yao, Enze Xie, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart-\alpha: Fast training of diffusion transformer for photorealistic text-to-image synthesis. In The Twelfth International Conference on Learning Representations, 2023

work page 2023
[8]

Flashattention: Fast and memory-efficient exact attention with io-awareness

Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in Neural Information Processing Systems, 35:16344–16359, 2022

work page 2022
[9]

An image is worth 16x16 words: Transformers for image recognition at scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2020

work page 2020
[10]

Scaling Rectified Flow Transformers for High-Resolution Image Synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. arXiv preprint arXiv:2403.03206, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[11]

Making llama see and draw with seed tokenizer

Yuying Ge, Sijie Zhao, Ziyun Zeng, Yixiao Ge, Chen Li, Xintao Wang, and Ying Shan. Making llama see and draw with seed tokenizer. arXiv preprint arXiv:2310.01218, 2023

work page arXiv 2023
[12]

Matryoshka diffusion models

Jiatao Gu, Shuangfei Zhai, Yizhe Zhang, Joshua M Susskind, and Navdeep Jaitly. Matryoshka diffusion models. In The Twelfth International Conference on Learning Representations, 2023

work page 2023
[13]

Query-key normalization for transformers

Alex Henry, Prudhvi Raj Dachapally, Shubham Shantaram Pawar, and Yuxuan Chen. Query-key normalization for transformers. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 4246–4253, 2020

work page 2020
[14]

Gans trained by a two time-scale update rule converge to a local nash equilibrium

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017

work page 2017
[15]

Dialoggen: Multi-modal interactive dialogue system for multi-turn text-to-image generation

Minbin Huang, Yanxin Long, Xinchi Deng, Ruihang Chu, Jiangfeng Xiong, Xiaodan Liang, Hong Cheng, Qinglin Lu, and Wei Liu. Dialoggen: Multi-modal interactive dialogue system for multi-turn text-to-image generation. arXiv preprint arXiv:2403.08857, 2024

work page arXiv 2024
[16]

Playground v2.5: Three Insights towards Enhancing Aesthetic Quality in Text-to-Image Generation

Daiqing Li, Aleks Kamko, Ehsan Akhgari, Ali Sabet, Linmiao Xu, and Suhail Doshi. Playground v2. 5: Three insights towards enhancing aesthetic quality in text-to-image generation. arXiv preprint arXiv:2402.17245, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[17]

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In International conference on machine learning, pages 19730–19742. PMLR, 2023. 19

work page 2023
[18]

Swinv2-imagen: Hierarchical vision transformer diffusion models for text-to-image generation

Ruijun Li, Weihua Li, Yi Yang, Hanyu Wei, Jianhua Jiang, and Quan Bai. Swinv2-imagen: Hierarchical vision transformer diffusion models for text-to-image generation. Neural Computing and Applications, pages 1–16, 2023

work page 2023
[19]

Microsoft coco: Common objects in context

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13 , pages 740–755. Springer, 2014

work page 2014
[20]

Improved Baselines with Visual Instruction Tuning

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[21]

Instaflow: One step is enough for high-quality diffusion-based text-to-image generation

Xingchao Liu, Xiwen Zhang, Jianzhu Ma, Jian Peng, and Qiang Liu. Instaflow: One step is enough for high-quality diffusion-based text-to-image generation. In The Twelfth International Conference on Learning Representations, 2023

work page 2023
[22]

Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference

Simian Luo, Yiqin Tan, Longbo Huang, Jian Li, and Hang Zhao. Latent consistency models: Synthesizing high-resolution images with few-step inference. arXiv preprint arXiv:2310.04378, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[23]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4195–4205, 2023

work page 2023
[24]

Sdxl: Improving latent diffusion models for high-resolution image synthesis

Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis. In The Twelfth International Conference on Learning Representations, 2023

work page 2023
[25]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021

work page 2021
[26]

Exploring the limits of transfer learning with a unified text-to-text transformer

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research, 21(140):1–67, 2020

work page 2020
[27]

Zero: Memory optimizations toward training trillion parameter models

Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. Zero: Memory optimizations toward training trillion parameter models. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–16. IEEE, 2020

work page 2020
[28]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022

work page 2022
[29]

Photorealistic text-to-image diffusion models with deep language understanding

Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding. Advances in neural information processing systems, 35:36479–36494, 2022

work page 2022
[30]

Progressive distillation for fast sampling of diffusion models

Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models. In International Conference on Learning Representations, 2021

work page 2021
[31]

Adversarial diffusion distillation.arXiv preprint arXiv:2311.17042, 2023

Axel Sauer, Dominik Lorenz, Andreas Blattmann, and Robin Rombach. Adversarial diffusion distillation. arXiv preprint arXiv:2311.17042, 2023

work page arXiv 2023
[32]

Roformer: Enhanced transformer with rotary position embedding

Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, 568:127063, 2024

work page 2024
[33]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017

work page 2017
[34]

Pai-diffusion: Constructing and serving a family of open chinese diffusion models for text-to-image synthesis on the cloud

Chengyu Wang, Zhongjie Duan, Bingyan Liu, Xinyi Zou, Cen Chen, Kui Jia, and Jun Huang. Pai-diffusion: Constructing and serving a family of open chinese diffusion models for text-to-image synthesis on the cloud. arXiv preprint arXiv:2309.05534, 2023

work page arXiv 2023
[35]

Next-gpt: Any-to-any multimodal llm

Shengqiong Wu, Hao Fei, Leigang Qu, Wei Ji, and Tat-Seng Chua. Next-gpt: Any-to-any multimodal llm. arXiv preprint arXiv:2309.05519, 2023. 20

work page arXiv 2023
[36]

Taiyi-diffusion-xl: Advancing bilingual text-to-image generation with large vision-language model support

Xiaojun Wu, Dixiang Zhang, Ruyi Gan, Junyu Lu, Ziwei Wu, Renliang Sun, Jiaxing Zhang, Pingjian Zhang, and Yan Song. Taiyi-diffusion-xl: Advancing bilingual text-to-image generation with large vision-language model support. arXiv preprint arXiv:2401.14688, 2024

work page arXiv 2024
[37]

Ufogen: You forward once large scale text-to-image generation via diffusion gans

Yanwu Xu, Yang Zhao, Zhisheng Xiao, and Tingbo Hou. Ufogen: You forward once large scale text-to-image generation via diffusion gans. arXiv preprint arXiv:2311.09257, 2023

work page arXiv 2023
[38]

Mastering text-to-image diffusion: Recaptioning, planning, and generating with multimodal llms

Ling Yang, Zhaochen Yu, Chenlin Meng, Minkai Xu, Stefano Ermon, and Bin Cui. Mastering text-to-image diffusion: Recaptioning, planning, and generating with multimodal llms. arXiv preprint arXiv:2401.11708, 2024

work page arXiv 2024
[39]

Altdiffusion: A multilingual text-to-image diffusion model

Fulong Ye, Guang Liu, Xinya Wu, and Ledell Wu. Altdiffusion: A multilingual text-to-image diffusion model. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 6648–6656, 2024

work page 2024
[40]

One-step diffusion with distribution matching distillation

Tianwei Yin, Michaël Gharbi, Richard Zhang, Eli Shechtman, Fredo Durand, William T Freeman, and Taesung Park. One-step diffusion with distribution matching distillation. arXiv preprint arXiv:2311.18828, 2023

work page arXiv 2023
[41]

Capsfusion: Rethinking image-text data at scale

Qiying Yu, Quan Sun, Xiaosong Zhang, Yufeng Cui, Fan Zhang, Xinlong Wang, and Jingjing Liu. Capsfusion: Rethinking image-text data at scale. arXiv preprint arXiv:2310.20550, 2023. 21 A Additional Materials Figure 18: The hierarchy of subjects in our training data. 22 Figure 19: The hierarchy of styles in our training data. 23 Figure 20: Illustration of ou...

work page arXiv 2023