pith. machine review for the scientific record. sign in

arxiv: 2405.08748 · v1 · submitted 2024-05-14 · 💻 cs.CV

Recognition: 1 theorem link

Hunyuan-DiT: A Powerful Multi-Resolution Diffusion Transformer with Fine-Grained Chinese Understanding

Authors on Pith no claims yet

Pith reviewed 2026-05-16 14:54 UTC · model grok-4.3

classification 💻 cs.CV
keywords text-to-image generationdiffusion transformerChinese language understandingmultimodal dialogueimage caption refinementmulti-resolution architectureopen-source model
0
0 comments X

The pith

Hunyuan-DiT is a diffusion transformer that generates images from Chinese text with state-of-the-art detail through custom architecture and refined data handling.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Hunyuan-DiT as a text-to-image diffusion transformer built to handle fine-grained prompts in both English and Chinese. It does this by designing the transformer blocks, text encoders, and positional encodings from the ground up, while creating a full data pipeline that refreshes training examples and uses a separate multimodal language model to improve image captions. The resulting model also supports ongoing conversations where users can describe changes and receive updated images. A human evaluation involving more than fifty professional raters shows it outperforming other open-source systems on Chinese prompts. This matters for anyone who needs precise visual output from natural Chinese descriptions rather than English translations.

Core claim

Hunyuan-DiT is a multi-resolution diffusion transformer whose structure, text encoder, and positional encoding are jointly designed to capture fine-grained bilingual language understanding; a supporting data pipeline and caption-refining multimodal large language model allow iterative improvement, enabling the model to conduct multi-turn multimodal dialogue and to surpass prior open-source models on Chinese-to-image tasks according to holistic human ratings.

What carries the argument

The Hunyuan-DiT diffusion transformer, whose multi-resolution blocks, bilingual text encoder, and learned positional encodings jointly process prompts to produce images while supporting dialogue-based refinement.

If this is right

  • The model can carry on multi-turn conversations that iteratively refine generated images based on follow-up instructions in Chinese or English.
  • A dedicated data pipeline that continuously updates and evaluates training examples supports repeated model improvements without starting from scratch.
  • Caption refinement performed by a separate multimodal large language model directly improves the model's grasp of detailed Chinese descriptions.
  • The same architecture delivers competitive results on English prompts while leading on Chinese ones among open-source systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar caption-refinement and data-pipeline techniques could be applied to other non-English languages to close performance gaps in text-to-image generation.
  • The multi-resolution design may reduce the need for separate models when users want both low- and high-resolution outputs from the same prompt.
  • Because the system already handles dialogue, it could be integrated into creative tools where users iteratively describe changes in their native language.

Load-bearing premise

The evaluation protocol with more than fifty professional raters measures genuine fine-grained Chinese understanding without bias from how prompts are chosen, how evaluators are selected, or how comparisons are presented.

What would settle it

A new human evaluation using the same protocol but with a larger, demographically broader pool of raters and a fresh set of Chinese prompts that shows no advantage or a reversal for Hunyuan-DiT against the same open-source baselines would falsify the state-of-the-art claim.

read the original abstract

We present Hunyuan-DiT, a text-to-image diffusion transformer with fine-grained understanding of both English and Chinese. To construct Hunyuan-DiT, we carefully design the transformer structure, text encoder, and positional encoding. We also build from scratch a whole data pipeline to update and evaluate data for iterative model optimization. For fine-grained language understanding, we train a Multimodal Large Language Model to refine the captions of the images. Finally, Hunyuan-DiT can perform multi-turn multimodal dialogue with users, generating and refining images according to the context. Through our holistic human evaluation protocol with more than 50 professional human evaluators, Hunyuan-DiT sets a new state-of-the-art in Chinese-to-image generation compared with other open-source models. Code and pretrained models are publicly available at github.com/Tencent/HunyuanDiT

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper introduces Hunyuan-DiT, a multi-resolution diffusion transformer for text-to-image generation with fine-grained English and Chinese understanding. It covers the transformer architecture, text encoder, positional encoding, a from-scratch data pipeline for iterative optimization, training of a Multimodal Large Language Model for caption refinement, and multi-turn multimodal dialogue support. The central claim is that Hunyuan-DiT achieves state-of-the-art Chinese-to-image generation via a holistic human evaluation with more than 50 professional evaluators, outperforming other open-source models. Code and pretrained models are released publicly.

Significance. If the human evaluation protocol can be shown to be unbiased and reproducible, the work would represent a meaningful contribution to open-source multilingual text-to-image models by delivering strong Chinese language understanding and interactive generation capabilities.

major comments (1)
  1. [Abstract] Abstract: The SOTA claim for Chinese-to-image generation depends entirely on a 'holistic human evaluation protocol with more than 50 professional human evaluators,' yet the manuscript supplies no details on test prompt distribution, scoring rubrics for fine-grained Chinese understanding, evaluator blinding, inter-rater agreement, statistical tests, or exact comparison baselines. Without these, the central performance claim cannot be verified or reproduced from the released code.
minor comments (1)
  1. [Abstract] Abstract: The phrasing around the data pipeline and MLLM caption refinement is somewhat dense; a brief enumeration of key steps would improve clarity for readers.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We appreciate the referee's thorough review and constructive comments. We acknowledge the need for greater transparency in our human evaluation protocol to substantiate the state-of-the-art claims. We will revise the manuscript accordingly to provide all necessary details for reproducibility.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The SOTA claim for Chinese-to-image generation depends entirely on a 'holistic human evaluation protocol with more than 50 professional human evaluators,' yet the manuscript supplies no details on test prompt distribution, scoring rubrics for fine-grained Chinese understanding, evaluator blinding, inter-rater agreement, statistical tests, or exact comparison baselines. Without these, the central performance claim cannot be verified or reproduced from the released code.

    Authors: We agree that the manuscript currently lacks sufficient details on the human evaluation protocol in both the abstract and the main body. This is a valid concern for verifying the central performance claim. In the revised manuscript, we will introduce a new subsection detailing the evaluation methodology. Specifically, we will describe the test prompt distribution (including examples and categorization for Chinese understanding), the scoring rubrics used for assessing fine-grained Chinese understanding and other criteria, the blinding procedures for evaluators, inter-rater agreement statistics, the statistical tests performed, and the exact list of comparison baselines. Additionally, we will make the evaluation prompts and rubrics publicly available alongside the code. We believe these additions will fully address the referee's concerns and allow independent verification of our results. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper presents an empirical construction of Hunyuan-DiT via transformer design choices, a data pipeline for iterative optimization, and training a separate MLLM for caption refinement, followed by human evaluation against external open-source baselines. No mathematical derivations, predictions, or first-principles results are described that reduce to their own inputs by construction, fitted parameters renamed as outputs, or load-bearing self-citations. The SOTA claim is grounded in an external human protocol rather than internal self-reference, making the overall chain self-contained with independent content.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The central claim rests on empirical training of a large neural network; many architecture and training choices are free parameters chosen to fit performance on image-text data. No new physical entities are postulated.

free parameters (2)
  • model scale and hyperparameters
    Transformer depth, width, attention heads, learning rate schedule, and resolution-specific parameters chosen during development to optimize generation quality.
  • data filtering thresholds
    Criteria used in the data pipeline to select and update training images and captions.
axioms (1)
  • domain assumption Standard diffusion model assumptions on data distribution and denoising process
    The training relies on the usual assumption that image-text pairs follow a distribution amenable to iterative denoising.

pith-pipeline@v0.9.0 · 5610 in / 1142 out tokens · 32075 ms · 2026-05-16T14:54:58.300525+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 19 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Asymmetric Flow Models

    cs.CV 2026-05 unverdicted novelty 7.0

    Asymmetric Flow Modeling restricts noise prediction to a low-rank subspace for high-dimensional flow generation, reaching 1.57 FID on ImageNet 256x256 and new state-of-the-art pixel text-to-image performance via finet...

  2. VACE: All-in-One Video Creation and Editing

    cs.CV 2025-03 unverdicted novelty 7.0

    VACE unifies reference-to-video generation, video-to-video editing, and masked video-to-video editing in one Diffusion Transformer framework using a Video Condition Unit for inputs and a Context Adapter for task injection.

  3. HiDream-O1-Image: A Natively Unified Image Generative Foundation Model with Pixel-level Unified Transformer

    cs.CV 2026-05 unverdicted novelty 6.0

    A pixel-space Diffusion Transformer with Unified Transformer architecture unifies image generation, editing, and personalization in an end-to-end model that maps all inputs to a shared token space and scales from 8B t...

  4. Fashion130K: An E-commerce Fashion Dataset for Outfit Generation with Unified Multi-modal Condition

    cs.CV 2026-05 unverdicted novelty 6.0

    Fashion130K dataset and UMC framework align text and visual prompts to generate more consistent fashion outfits than prior state-of-the-art methods.

  5. Fashion130K: An E-commerce Fashion Dataset for Outfit Generation with Unified Multi-modal Condition

    cs.CV 2026-05 unverdicted novelty 6.0

    Fashion130K dataset and UMC framework align text and visual prompts with embedding refiner, Fusion Transformer, and redesigned attention to generate more consistent outfits than prior methods.

  6. Leveraging Verifier-Based Reinforcement Learning in Image Editing

    cs.CV 2026-04 unverdicted novelty 6.0

    Edit-R1 trains a CoT-based reasoning reward model with GCPO and uses it to boost image editing performance over VLMs and models like FLUX.1-kontext via GRPO.

  7. Beyond Fixed Formulas: Data-Driven Linear Predictor for Efficient Diffusion Models

    cs.CV 2026-04 unverdicted novelty 6.0

    L2P trains per-timestep linear weights on feature trajectories in about 20 seconds to enable aggressive caching in DiT models, delivering up to 4.55x FLOPs reduction with maintained visual quality.

  8. CineAGI: Character-Consistent Movie Creation through LLM-Orchestrated Multi-Modal Generation and Cross-Scene Integration

    cs.MM 2026-04 unverdicted novelty 6.0

    CineAGI is a multi-agent LLM framework that generates multi-scene movies with improved character consistency, narrative coherence, and audio-visual alignment.

  9. When Numbers Speak: Aligning Textual Numerals and Visual Instances in Text-to-Video Diffusion Models

    cs.CV 2026-04 unverdicted novelty 6.0

    NUMINA improves counting accuracy in text-to-video diffusion models by up to 7.4% via a training-free identify-then-guide framework on the new CountBench dataset.

  10. The Algorithmic Gaze of Image Quality Assessment: An Audit and Trace Ethnography of the LAION-Aesthetics Predictor

    cs.HC 2026-01 conditional novelty 6.0

    LAION-Aesthetics Predictor reinforces Western and male biases by preferentially selecting images associated with women and realistic Western/Japanese art while excluding men, LGBTQ+ references, and other styles.

  11. HunyuanImage 3.0 Technical Report

    cs.CV 2025-09 accept novelty 6.0

    HunyuanImage 3.0 delivers an 80B-parameter MoE model unifying multimodal understanding and generation that matches prior state-of-the-art results while being fully open-sourced.

  12. SANA: Efficient High-Resolution Image Synthesis with Linear Diffusion Transformers

    cs.CV 2024-10 unverdicted novelty 6.0

    Sana-0.6B produces high-resolution images with strong text alignment at 20x smaller size and 100x higher throughput than Flux-12B by combining 32x image compression, linear DiT blocks, and a decoder-only LLM text encoder.

  13. Emu3: Next-Token Prediction is All You Need

    cs.CV 2024-09 unverdicted novelty 6.0

    Emu3 shows that next-token prediction on a unified discrete token space for text, images, and video lets a single transformer outperform task-specific models such as SDXL and LLaVA-1.6 in multimodal generation and perception.

  14. ACPO: Anchor-Constrained Perceptual Optimization for Diffusion Models with No-Reference Quality Guidance

    cs.CV 2026-04 unverdicted novelty 5.0

    ACPO uses anchor-based regularization with NR-IQA guidance to enable stable perceptual quality improvements in diffusion model fine-tuning.

  15. Diffusion Templates: A Unified Plugin Framework for Controllable Diffusion

    cs.LG 2026-04 unverdicted novelty 5.0

    Diffusion Templates is a unified plugin framework that allows injecting various controllable capabilities into diffusion models through a standardized interface.

  16. Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer

    cs.CV 2025-11 unverdicted novelty 5.0

    Z-Image is an efficient 6B-parameter foundation model for image generation that rivals larger commercial systems in photorealism and bilingual text rendering through a new single-stream diffusion transformer and strea...

  17. Qwen-Image-2.0 Technical Report

    cs.CV 2026-05 unverdicted novelty 4.0

    Qwen-Image-2.0 unifies high-fidelity image generation and precise editing by coupling Qwen3-VL with a Multimodal Diffusion Transformer, improving text rendering, photorealism, and complex prompt following over prior versions.

  18. OmniFysics: Towards Physical Intelligence Evolution via Omni-Modal Signal Processing and Network Optimization

    cs.CV 2026-02 unverdicted novelty 4.0

    OmniFysics is an omni-modal network using a dynamic physical data engine and evolutive tuning to improve performance on multimodal benchmarks and physics-oriented tasks.

  19. Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

    cs.AI 2025-01 conditional novelty 3.0

    Scaling data, model size, and training optimization on the Janus architecture yields better multimodal understanding and more stable, instruction-following text-to-image generation.

Reference graph

Works this paper leans on

41 extracted references · 41 canonical work pages · cited by 18 Pith papers · 6 internal anchors

  1. [1]

    https://www.midjourney.com/home

    Midjourney. https://www.midjourney.com/home

  2. [2]

    Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

    Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966, 2023

  3. [3]

    eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers

    Yogesh Balaji, Seungjun Nah, Xun Huang, Arash Vahdat, Jiaming Song, Qinsheng Zhang, Karsten Kreis, Miika Aittala, Timo Aila, Samuli Laine, et al. ediff-i: Text-to-image diffusion models with an ensemble of expert denoisers. arXiv preprint arXiv:2211.01324, 2022

  4. [4]

    All are worth words: A vit backbone for diffusion models

    Fan Bao, Shen Nie, Kaiwen Xue, Yue Cao, Chongxuan Li, Hang Su, and Jun Zhu. All are worth words: A vit backbone for diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22669–22679, 2023

  5. [5]

    Improving image generation with better captions

    James Betker, Gabriel Goh, Li Jing, Tim Brooks, Jianfeng Wang, Linjie Li, Long Ouyang, Juntang Zhuang, Joyce Lee, Yufei Guo, et al. Improving image generation with better captions. Computer Science. https://cdn. openai. com/papers/dall-e-3. pdf, 2(3):8, 2023

  6. [6]

    Muse: Text-to-image generation via masked generative transformers

    Huiwen Chang, Han Zhang, Jarred Barber, Aaron Maschinot, Jose Lezama, Lu Jiang, Ming-Hsuan Yang, Kevin Patrick Murphy, William T Freeman, Michael Rubinstein, et al. Muse: Text-to-image generation via masked generative transformers. In International Conference on Machine Learning, pages 4055–4075. PMLR, 2023

  7. [7]

    Pixart-\alpha: Fast training of diffusion transformer for photorealistic text-to-image synthesis

    Junsong Chen, YU Jincheng, GE Chongjian, Lewei Yao, Enze Xie, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart-\alpha: Fast training of diffusion transformer for photorealistic text-to-image synthesis. In The Twelfth International Conference on Learning Representations, 2023

  8. [8]

    Flashattention: Fast and memory-efficient exact attention with io-awareness

    Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in Neural Information Processing Systems, 35:16344–16359, 2022

  9. [9]

    An image is worth 16x16 words: Transformers for image recognition at scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2020

  10. [10]

    Scaling Rectified Flow Transformers for High-Resolution Image Synthesis

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. arXiv preprint arXiv:2403.03206, 2024

  11. [11]

    Making llama see and draw with seed tokenizer

    Yuying Ge, Sijie Zhao, Ziyun Zeng, Yixiao Ge, Chen Li, Xintao Wang, and Ying Shan. Making llama see and draw with seed tokenizer. arXiv preprint arXiv:2310.01218, 2023

  12. [12]

    Matryoshka diffusion models

    Jiatao Gu, Shuangfei Zhai, Yizhe Zhang, Joshua M Susskind, and Navdeep Jaitly. Matryoshka diffusion models. In The Twelfth International Conference on Learning Representations, 2023

  13. [13]

    Query-key normalization for transformers

    Alex Henry, Prudhvi Raj Dachapally, Shubham Shantaram Pawar, and Yuxuan Chen. Query-key normalization for transformers. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 4246–4253, 2020

  14. [14]

    Gans trained by a two time-scale update rule converge to a local nash equilibrium

    Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017

  15. [15]

    Dialoggen: Multi-modal interactive dialogue system for multi-turn text-to-image generation

    Minbin Huang, Yanxin Long, Xinchi Deng, Ruihang Chu, Jiangfeng Xiong, Xiaodan Liang, Hong Cheng, Qinglin Lu, and Wei Liu. Dialoggen: Multi-modal interactive dialogue system for multi-turn text-to-image generation. arXiv preprint arXiv:2403.08857, 2024

  16. [16]

    Playground v2.5: Three Insights towards Enhancing Aesthetic Quality in Text-to-Image Generation

    Daiqing Li, Aleks Kamko, Ehsan Akhgari, Ali Sabet, Linmiao Xu, and Suhail Doshi. Playground v2. 5: Three insights towards enhancing aesthetic quality in text-to-image generation. arXiv preprint arXiv:2402.17245, 2024

  17. [17]

    Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In International conference on machine learning, pages 19730–19742. PMLR, 2023. 19

  18. [18]

    Swinv2-imagen: Hierarchical vision transformer diffusion models for text-to-image generation

    Ruijun Li, Weihua Li, Yi Yang, Hanyu Wei, Jianhua Jiang, and Quan Bai. Swinv2-imagen: Hierarchical vision transformer diffusion models for text-to-image generation. Neural Computing and Applications, pages 1–16, 2023

  19. [19]

    Microsoft coco: Common objects in context

    Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13 , pages 740–755. Springer, 2014

  20. [20]

    Improved Baselines with Visual Instruction Tuning

    Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744, 2023

  21. [21]

    Instaflow: One step is enough for high-quality diffusion-based text-to-image generation

    Xingchao Liu, Xiwen Zhang, Jianzhu Ma, Jian Peng, and Qiang Liu. Instaflow: One step is enough for high-quality diffusion-based text-to-image generation. In The Twelfth International Conference on Learning Representations, 2023

  22. [22]

    Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference

    Simian Luo, Yiqin Tan, Longbo Huang, Jian Li, and Hang Zhao. Latent consistency models: Synthesizing high-resolution images with few-step inference. arXiv preprint arXiv:2310.04378, 2023

  23. [23]

    Scalable diffusion models with transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4195–4205, 2023

  24. [24]

    Sdxl: Improving latent diffusion models for high-resolution image synthesis

    Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis. In The Twelfth International Conference on Learning Representations, 2023

  25. [25]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021

  26. [26]

    Exploring the limits of transfer learning with a unified text-to-text transformer

    Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research, 21(140):1–67, 2020

  27. [27]

    Zero: Memory optimizations toward training trillion parameter models

    Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. Zero: Memory optimizations toward training trillion parameter models. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–16. IEEE, 2020

  28. [28]

    High-resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022

  29. [29]

    Photorealistic text-to-image diffusion models with deep language understanding

    Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding. Advances in neural information processing systems, 35:36479–36494, 2022

  30. [30]

    Progressive distillation for fast sampling of diffusion models

    Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models. In International Conference on Learning Representations, 2021

  31. [31]

    Adversarial diffusion distillation.arXiv preprint arXiv:2311.17042, 2023

    Axel Sauer, Dominik Lorenz, Andreas Blattmann, and Robin Rombach. Adversarial diffusion distillation. arXiv preprint arXiv:2311.17042, 2023

  32. [32]

    Roformer: Enhanced transformer with rotary position embedding

    Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, 568:127063, 2024

  33. [33]

    Attention is all you need

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017

  34. [34]

    Pai-diffusion: Constructing and serving a family of open chinese diffusion models for text-to-image synthesis on the cloud

    Chengyu Wang, Zhongjie Duan, Bingyan Liu, Xinyi Zou, Cen Chen, Kui Jia, and Jun Huang. Pai-diffusion: Constructing and serving a family of open chinese diffusion models for text-to-image synthesis on the cloud. arXiv preprint arXiv:2309.05534, 2023

  35. [35]

    Next-gpt: Any-to-any multimodal llm

    Shengqiong Wu, Hao Fei, Leigang Qu, Wei Ji, and Tat-Seng Chua. Next-gpt: Any-to-any multimodal llm. arXiv preprint arXiv:2309.05519, 2023. 20

  36. [36]

    Taiyi-diffusion-xl: Advancing bilingual text-to-image generation with large vision-language model support

    Xiaojun Wu, Dixiang Zhang, Ruyi Gan, Junyu Lu, Ziwei Wu, Renliang Sun, Jiaxing Zhang, Pingjian Zhang, and Yan Song. Taiyi-diffusion-xl: Advancing bilingual text-to-image generation with large vision-language model support. arXiv preprint arXiv:2401.14688, 2024

  37. [37]

    Ufogen: You forward once large scale text-to-image generation via diffusion gans

    Yanwu Xu, Yang Zhao, Zhisheng Xiao, and Tingbo Hou. Ufogen: You forward once large scale text-to-image generation via diffusion gans. arXiv preprint arXiv:2311.09257, 2023

  38. [38]

    Mastering text-to-image diffusion: Recaptioning, planning, and generating with multimodal llms

    Ling Yang, Zhaochen Yu, Chenlin Meng, Minkai Xu, Stefano Ermon, and Bin Cui. Mastering text-to-image diffusion: Recaptioning, planning, and generating with multimodal llms. arXiv preprint arXiv:2401.11708, 2024

  39. [39]

    Altdiffusion: A multilingual text-to-image diffusion model

    Fulong Ye, Guang Liu, Xinya Wu, and Ledell Wu. Altdiffusion: A multilingual text-to-image diffusion model. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 6648–6656, 2024

  40. [40]

    One-step diffusion with distribution matching distillation

    Tianwei Yin, Michaël Gharbi, Richard Zhang, Eli Shechtman, Fredo Durand, William T Freeman, and Taesung Park. One-step diffusion with distribution matching distillation. arXiv preprint arXiv:2311.18828, 2023

  41. [41]

    Capsfusion: Rethinking image-text data at scale

    Qiying Yu, Quan Sun, Xiaosong Zhang, Yufeng Cui, Fan Zhang, Xinlong Wang, and Jingjing Liu. Capsfusion: Rethinking image-text data at scale. arXiv preprint arXiv:2310.20550, 2023. 21 A Additional Materials Figure 18: The hierarchy of subjects in our training data. 22 Figure 19: The hierarchy of styles in our training data. 23 Figure 20: Illustration of ou...