Recognition: 1 theorem link
Hunyuan-DiT: A Powerful Multi-Resolution Diffusion Transformer with Fine-Grained Chinese Understanding
Pith reviewed 2026-05-16 14:54 UTC · model grok-4.3
The pith
Hunyuan-DiT is a diffusion transformer that generates images from Chinese text with state-of-the-art detail through custom architecture and refined data handling.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Hunyuan-DiT is a multi-resolution diffusion transformer whose structure, text encoder, and positional encoding are jointly designed to capture fine-grained bilingual language understanding; a supporting data pipeline and caption-refining multimodal large language model allow iterative improvement, enabling the model to conduct multi-turn multimodal dialogue and to surpass prior open-source models on Chinese-to-image tasks according to holistic human ratings.
What carries the argument
The Hunyuan-DiT diffusion transformer, whose multi-resolution blocks, bilingual text encoder, and learned positional encodings jointly process prompts to produce images while supporting dialogue-based refinement.
If this is right
- The model can carry on multi-turn conversations that iteratively refine generated images based on follow-up instructions in Chinese or English.
- A dedicated data pipeline that continuously updates and evaluates training examples supports repeated model improvements without starting from scratch.
- Caption refinement performed by a separate multimodal large language model directly improves the model's grasp of detailed Chinese descriptions.
- The same architecture delivers competitive results on English prompts while leading on Chinese ones among open-source systems.
Where Pith is reading between the lines
- Similar caption-refinement and data-pipeline techniques could be applied to other non-English languages to close performance gaps in text-to-image generation.
- The multi-resolution design may reduce the need for separate models when users want both low- and high-resolution outputs from the same prompt.
- Because the system already handles dialogue, it could be integrated into creative tools where users iteratively describe changes in their native language.
Load-bearing premise
The evaluation protocol with more than fifty professional raters measures genuine fine-grained Chinese understanding without bias from how prompts are chosen, how evaluators are selected, or how comparisons are presented.
What would settle it
A new human evaluation using the same protocol but with a larger, demographically broader pool of raters and a fresh set of Chinese prompts that shows no advantage or a reversal for Hunyuan-DiT against the same open-source baselines would falsify the state-of-the-art claim.
read the original abstract
We present Hunyuan-DiT, a text-to-image diffusion transformer with fine-grained understanding of both English and Chinese. To construct Hunyuan-DiT, we carefully design the transformer structure, text encoder, and positional encoding. We also build from scratch a whole data pipeline to update and evaluate data for iterative model optimization. For fine-grained language understanding, we train a Multimodal Large Language Model to refine the captions of the images. Finally, Hunyuan-DiT can perform multi-turn multimodal dialogue with users, generating and refining images according to the context. Through our holistic human evaluation protocol with more than 50 professional human evaluators, Hunyuan-DiT sets a new state-of-the-art in Chinese-to-image generation compared with other open-source models. Code and pretrained models are publicly available at github.com/Tencent/HunyuanDiT
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Hunyuan-DiT, a multi-resolution diffusion transformer for text-to-image generation with fine-grained English and Chinese understanding. It covers the transformer architecture, text encoder, positional encoding, a from-scratch data pipeline for iterative optimization, training of a Multimodal Large Language Model for caption refinement, and multi-turn multimodal dialogue support. The central claim is that Hunyuan-DiT achieves state-of-the-art Chinese-to-image generation via a holistic human evaluation with more than 50 professional evaluators, outperforming other open-source models. Code and pretrained models are released publicly.
Significance. If the human evaluation protocol can be shown to be unbiased and reproducible, the work would represent a meaningful contribution to open-source multilingual text-to-image models by delivering strong Chinese language understanding and interactive generation capabilities.
major comments (1)
- [Abstract] Abstract: The SOTA claim for Chinese-to-image generation depends entirely on a 'holistic human evaluation protocol with more than 50 professional human evaluators,' yet the manuscript supplies no details on test prompt distribution, scoring rubrics for fine-grained Chinese understanding, evaluator blinding, inter-rater agreement, statistical tests, or exact comparison baselines. Without these, the central performance claim cannot be verified or reproduced from the released code.
minor comments (1)
- [Abstract] Abstract: The phrasing around the data pipeline and MLLM caption refinement is somewhat dense; a brief enumeration of key steps would improve clarity for readers.
Simulated Author's Rebuttal
We appreciate the referee's thorough review and constructive comments. We acknowledge the need for greater transparency in our human evaluation protocol to substantiate the state-of-the-art claims. We will revise the manuscript accordingly to provide all necessary details for reproducibility.
read point-by-point responses
-
Referee: [Abstract] Abstract: The SOTA claim for Chinese-to-image generation depends entirely on a 'holistic human evaluation protocol with more than 50 professional human evaluators,' yet the manuscript supplies no details on test prompt distribution, scoring rubrics for fine-grained Chinese understanding, evaluator blinding, inter-rater agreement, statistical tests, or exact comparison baselines. Without these, the central performance claim cannot be verified or reproduced from the released code.
Authors: We agree that the manuscript currently lacks sufficient details on the human evaluation protocol in both the abstract and the main body. This is a valid concern for verifying the central performance claim. In the revised manuscript, we will introduce a new subsection detailing the evaluation methodology. Specifically, we will describe the test prompt distribution (including examples and categorization for Chinese understanding), the scoring rubrics used for assessing fine-grained Chinese understanding and other criteria, the blinding procedures for evaluators, inter-rater agreement statistics, the statistical tests performed, and the exact list of comparison baselines. Additionally, we will make the evaluation prompts and rubrics publicly available alongside the code. We believe these additions will fully address the referee's concerns and allow independent verification of our results. revision: yes
Circularity Check
No significant circularity in derivation chain
full rationale
The paper presents an empirical construction of Hunyuan-DiT via transformer design choices, a data pipeline for iterative optimization, and training a separate MLLM for caption refinement, followed by human evaluation against external open-source baselines. No mathematical derivations, predictions, or first-principles results are described that reduce to their own inputs by construction, fitted parameters renamed as outputs, or load-bearing self-citations. The SOTA claim is grounded in an external human protocol rather than internal self-reference, making the overall chain self-contained with independent content.
Axiom & Free-Parameter Ledger
free parameters (2)
- model scale and hyperparameters
- data filtering thresholds
axioms (1)
- domain assumption Standard diffusion model assumptions on data distribution and denoising process
Forward citations
Cited by 19 Pith papers
-
Asymmetric Flow Models
Asymmetric Flow Modeling restricts noise prediction to a low-rank subspace for high-dimensional flow generation, reaching 1.57 FID on ImageNet 256x256 and new state-of-the-art pixel text-to-image performance via finet...
-
VACE: All-in-One Video Creation and Editing
VACE unifies reference-to-video generation, video-to-video editing, and masked video-to-video editing in one Diffusion Transformer framework using a Video Condition Unit for inputs and a Context Adapter for task injection.
-
HiDream-O1-Image: A Natively Unified Image Generative Foundation Model with Pixel-level Unified Transformer
A pixel-space Diffusion Transformer with Unified Transformer architecture unifies image generation, editing, and personalization in an end-to-end model that maps all inputs to a shared token space and scales from 8B t...
-
Fashion130K: An E-commerce Fashion Dataset for Outfit Generation with Unified Multi-modal Condition
Fashion130K dataset and UMC framework align text and visual prompts to generate more consistent fashion outfits than prior state-of-the-art methods.
-
Fashion130K: An E-commerce Fashion Dataset for Outfit Generation with Unified Multi-modal Condition
Fashion130K dataset and UMC framework align text and visual prompts with embedding refiner, Fusion Transformer, and redesigned attention to generate more consistent outfits than prior methods.
-
Leveraging Verifier-Based Reinforcement Learning in Image Editing
Edit-R1 trains a CoT-based reasoning reward model with GCPO and uses it to boost image editing performance over VLMs and models like FLUX.1-kontext via GRPO.
-
Beyond Fixed Formulas: Data-Driven Linear Predictor for Efficient Diffusion Models
L2P trains per-timestep linear weights on feature trajectories in about 20 seconds to enable aggressive caching in DiT models, delivering up to 4.55x FLOPs reduction with maintained visual quality.
-
CineAGI: Character-Consistent Movie Creation through LLM-Orchestrated Multi-Modal Generation and Cross-Scene Integration
CineAGI is a multi-agent LLM framework that generates multi-scene movies with improved character consistency, narrative coherence, and audio-visual alignment.
-
When Numbers Speak: Aligning Textual Numerals and Visual Instances in Text-to-Video Diffusion Models
NUMINA improves counting accuracy in text-to-video diffusion models by up to 7.4% via a training-free identify-then-guide framework on the new CountBench dataset.
-
The Algorithmic Gaze of Image Quality Assessment: An Audit and Trace Ethnography of the LAION-Aesthetics Predictor
LAION-Aesthetics Predictor reinforces Western and male biases by preferentially selecting images associated with women and realistic Western/Japanese art while excluding men, LGBTQ+ references, and other styles.
-
HunyuanImage 3.0 Technical Report
HunyuanImage 3.0 delivers an 80B-parameter MoE model unifying multimodal understanding and generation that matches prior state-of-the-art results while being fully open-sourced.
-
SANA: Efficient High-Resolution Image Synthesis with Linear Diffusion Transformers
Sana-0.6B produces high-resolution images with strong text alignment at 20x smaller size and 100x higher throughput than Flux-12B by combining 32x image compression, linear DiT blocks, and a decoder-only LLM text encoder.
-
Emu3: Next-Token Prediction is All You Need
Emu3 shows that next-token prediction on a unified discrete token space for text, images, and video lets a single transformer outperform task-specific models such as SDXL and LLaVA-1.6 in multimodal generation and perception.
-
ACPO: Anchor-Constrained Perceptual Optimization for Diffusion Models with No-Reference Quality Guidance
ACPO uses anchor-based regularization with NR-IQA guidance to enable stable perceptual quality improvements in diffusion model fine-tuning.
-
Diffusion Templates: A Unified Plugin Framework for Controllable Diffusion
Diffusion Templates is a unified plugin framework that allows injecting various controllable capabilities into diffusion models through a standardized interface.
-
Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer
Z-Image is an efficient 6B-parameter foundation model for image generation that rivals larger commercial systems in photorealism and bilingual text rendering through a new single-stream diffusion transformer and strea...
-
Qwen-Image-2.0 Technical Report
Qwen-Image-2.0 unifies high-fidelity image generation and precise editing by coupling Qwen3-VL with a Multimodal Diffusion Transformer, improving text rendering, photorealism, and complex prompt following over prior versions.
-
OmniFysics: Towards Physical Intelligence Evolution via Omni-Modal Signal Processing and Network Optimization
OmniFysics is an omni-modal network using a dynamic physical data engine and evolutive tuning to improve performance on multimodal benchmarks and physics-oriented tasks.
-
Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling
Scaling data, model size, and training optimization on the Janus architecture yields better multimodal understanding and more stable, instruction-following text-to-image generation.
Reference graph
Works this paper leans on
- [1]
-
[2]
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[3]
eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers
Yogesh Balaji, Seungjun Nah, Xun Huang, Arash Vahdat, Jiaming Song, Qinsheng Zhang, Karsten Kreis, Miika Aittala, Timo Aila, Samuli Laine, et al. ediff-i: Text-to-image diffusion models with an ensemble of expert denoisers. arXiv preprint arXiv:2211.01324, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[4]
All are worth words: A vit backbone for diffusion models
Fan Bao, Shen Nie, Kaiwen Xue, Yue Cao, Chongxuan Li, Hang Su, and Jun Zhu. All are worth words: A vit backbone for diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22669–22679, 2023
work page 2023
-
[5]
Improving image generation with better captions
James Betker, Gabriel Goh, Li Jing, Tim Brooks, Jianfeng Wang, Linjie Li, Long Ouyang, Juntang Zhuang, Joyce Lee, Yufei Guo, et al. Improving image generation with better captions. Computer Science. https://cdn. openai. com/papers/dall-e-3. pdf, 2(3):8, 2023
work page 2023
-
[6]
Muse: Text-to-image generation via masked generative transformers
Huiwen Chang, Han Zhang, Jarred Barber, Aaron Maschinot, Jose Lezama, Lu Jiang, Ming-Hsuan Yang, Kevin Patrick Murphy, William T Freeman, Michael Rubinstein, et al. Muse: Text-to-image generation via masked generative transformers. In International Conference on Machine Learning, pages 4055–4075. PMLR, 2023
work page 2023
-
[7]
Pixart-\alpha: Fast training of diffusion transformer for photorealistic text-to-image synthesis
Junsong Chen, YU Jincheng, GE Chongjian, Lewei Yao, Enze Xie, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart-\alpha: Fast training of diffusion transformer for photorealistic text-to-image synthesis. In The Twelfth International Conference on Learning Representations, 2023
work page 2023
-
[8]
Flashattention: Fast and memory-efficient exact attention with io-awareness
Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in Neural Information Processing Systems, 35:16344–16359, 2022
work page 2022
-
[9]
An image is worth 16x16 words: Transformers for image recognition at scale
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2020
work page 2020
-
[10]
Scaling Rectified Flow Transformers for High-Resolution Image Synthesis
Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. arXiv preprint arXiv:2403.03206, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[11]
Making llama see and draw with seed tokenizer
Yuying Ge, Sijie Zhao, Ziyun Zeng, Yixiao Ge, Chen Li, Xintao Wang, and Ying Shan. Making llama see and draw with seed tokenizer. arXiv preprint arXiv:2310.01218, 2023
-
[12]
Jiatao Gu, Shuangfei Zhai, Yizhe Zhang, Joshua M Susskind, and Navdeep Jaitly. Matryoshka diffusion models. In The Twelfth International Conference on Learning Representations, 2023
work page 2023
-
[13]
Query-key normalization for transformers
Alex Henry, Prudhvi Raj Dachapally, Shubham Shantaram Pawar, and Yuxuan Chen. Query-key normalization for transformers. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 4246–4253, 2020
work page 2020
-
[14]
Gans trained by a two time-scale update rule converge to a local nash equilibrium
Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017
work page 2017
-
[15]
Dialoggen: Multi-modal interactive dialogue system for multi-turn text-to-image generation
Minbin Huang, Yanxin Long, Xinchi Deng, Ruihang Chu, Jiangfeng Xiong, Xiaodan Liang, Hong Cheng, Qinglin Lu, and Wei Liu. Dialoggen: Multi-modal interactive dialogue system for multi-turn text-to-image generation. arXiv preprint arXiv:2403.08857, 2024
-
[16]
Playground v2.5: Three Insights towards Enhancing Aesthetic Quality in Text-to-Image Generation
Daiqing Li, Aleks Kamko, Ehsan Akhgari, Ali Sabet, Linmiao Xu, and Suhail Doshi. Playground v2. 5: Three insights towards enhancing aesthetic quality in text-to-image generation. arXiv preprint arXiv:2402.17245, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[17]
Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In International conference on machine learning, pages 19730–19742. PMLR, 2023. 19
work page 2023
-
[18]
Swinv2-imagen: Hierarchical vision transformer diffusion models for text-to-image generation
Ruijun Li, Weihua Li, Yi Yang, Hanyu Wei, Jianhua Jiang, and Quan Bai. Swinv2-imagen: Hierarchical vision transformer diffusion models for text-to-image generation. Neural Computing and Applications, pages 1–16, 2023
work page 2023
-
[19]
Microsoft coco: Common objects in context
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13 , pages 740–755. Springer, 2014
work page 2014
-
[20]
Improved Baselines with Visual Instruction Tuning
Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[21]
Instaflow: One step is enough for high-quality diffusion-based text-to-image generation
Xingchao Liu, Xiwen Zhang, Jianzhu Ma, Jian Peng, and Qiang Liu. Instaflow: One step is enough for high-quality diffusion-based text-to-image generation. In The Twelfth International Conference on Learning Representations, 2023
work page 2023
-
[22]
Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference
Simian Luo, Yiqin Tan, Longbo Huang, Jian Li, and Hang Zhao. Latent consistency models: Synthesizing high-resolution images with few-step inference. arXiv preprint arXiv:2310.04378, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[23]
Scalable diffusion models with transformers
William Peebles and Saining Xie. Scalable diffusion models with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4195–4205, 2023
work page 2023
-
[24]
Sdxl: Improving latent diffusion models for high-resolution image synthesis
Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis. In The Twelfth International Conference on Learning Representations, 2023
work page 2023
-
[25]
Learning transferable visual models from natural language supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021
work page 2021
-
[26]
Exploring the limits of transfer learning with a unified text-to-text transformer
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research, 21(140):1–67, 2020
work page 2020
-
[27]
Zero: Memory optimizations toward training trillion parameter models
Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. Zero: Memory optimizations toward training trillion parameter models. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–16. IEEE, 2020
work page 2020
-
[28]
High-resolution image synthesis with latent diffusion models
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022
work page 2022
-
[29]
Photorealistic text-to-image diffusion models with deep language understanding
Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding. Advances in neural information processing systems, 35:36479–36494, 2022
work page 2022
-
[30]
Progressive distillation for fast sampling of diffusion models
Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models. In International Conference on Learning Representations, 2021
work page 2021
-
[31]
Adversarial diffusion distillation.arXiv preprint arXiv:2311.17042, 2023
Axel Sauer, Dominik Lorenz, Andreas Blattmann, and Robin Rombach. Adversarial diffusion distillation. arXiv preprint arXiv:2311.17042, 2023
-
[32]
Roformer: Enhanced transformer with rotary position embedding
Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, 568:127063, 2024
work page 2024
-
[33]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017
work page 2017
-
[34]
Chengyu Wang, Zhongjie Duan, Bingyan Liu, Xinyi Zou, Cen Chen, Kui Jia, and Jun Huang. Pai-diffusion: Constructing and serving a family of open chinese diffusion models for text-to-image synthesis on the cloud. arXiv preprint arXiv:2309.05534, 2023
-
[35]
Next-gpt: Any-to-any multimodal llm
Shengqiong Wu, Hao Fei, Leigang Qu, Wei Ji, and Tat-Seng Chua. Next-gpt: Any-to-any multimodal llm. arXiv preprint arXiv:2309.05519, 2023. 20
-
[36]
Xiaojun Wu, Dixiang Zhang, Ruyi Gan, Junyu Lu, Ziwei Wu, Renliang Sun, Jiaxing Zhang, Pingjian Zhang, and Yan Song. Taiyi-diffusion-xl: Advancing bilingual text-to-image generation with large vision-language model support. arXiv preprint arXiv:2401.14688, 2024
-
[37]
Ufogen: You forward once large scale text-to-image generation via diffusion gans
Yanwu Xu, Yang Zhao, Zhisheng Xiao, and Tingbo Hou. Ufogen: You forward once large scale text-to-image generation via diffusion gans. arXiv preprint arXiv:2311.09257, 2023
-
[38]
Mastering text-to-image diffusion: Recaptioning, planning, and generating with multimodal llms
Ling Yang, Zhaochen Yu, Chenlin Meng, Minkai Xu, Stefano Ermon, and Bin Cui. Mastering text-to-image diffusion: Recaptioning, planning, and generating with multimodal llms. arXiv preprint arXiv:2401.11708, 2024
-
[39]
Altdiffusion: A multilingual text-to-image diffusion model
Fulong Ye, Guang Liu, Xinya Wu, and Ledell Wu. Altdiffusion: A multilingual text-to-image diffusion model. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 6648–6656, 2024
work page 2024
-
[40]
One-step diffusion with distribution matching distillation
Tianwei Yin, Michaël Gharbi, Richard Zhang, Eli Shechtman, Fredo Durand, William T Freeman, and Taesung Park. One-step diffusion with distribution matching distillation. arXiv preprint arXiv:2311.18828, 2023
-
[41]
Capsfusion: Rethinking image-text data at scale
Qiying Yu, Quan Sun, Xiaosong Zhang, Yufeng Cui, Fan Zhang, Xinlong Wang, and Jingjing Liu. Capsfusion: Rethinking image-text data at scale. arXiv preprint arXiv:2310.20550, 2023. 21 A Additional Materials Figure 18: The hierarchy of subjects in our training data. 22 Figure 19: The hierarchy of styles in our training data. 23 Figure 20: Illustration of ou...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.