pith. sign in

arxiv: 2606.28421 · v1 · pith:ICPFKGWSnew · submitted 2026-06-25 · 💻 cs.CV · cs.AI

JuZhou 1.0 Technical Report: The First Edge-Native Text-to-Image Foundation Model Trained Entirely on China-Developed AI Accelerators

Pith reviewed 2026-06-30 00:58 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords text-to-image diffusionon-device inferenceedge deploymentChinese prompt alignmentdomestic AI acceleratorsGenEval benchmarkmodel distillationmobile text-to-image
0
0 comments X

The pith

JuZhou 1.0 is a 0.387-billion-parameter text-to-image model trained entirely on domestic Chinese accelerators that runs fully offline on smartphones and scores 0.69 on GenEval.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents JuZhou 1.0 as an ultra-lightweight text-to-image diffusion model built for fully offline on-device execution. It combines a 0.385B-parameter U-Net with a small distilled decoder, Rectified Flow training, and DMD2 distillation to reach 4 sampling steps. The model is trained on 9M Chinese image-text pairs for native prompting and completes its entire pipeline on Sugon K100 accelerators. Its 28-step base version reaches an overall GenEval score of 0.69, exceeding published results from SDXL, SD3-Medium, and IF-XL. On-device tests show the 4-step branch completing in about 1.6 seconds on Snapdragon hardware and the full Android pipeline in 4.5 seconds.

Core claim

JuZhou 1.0 achieves an overall GenEval score of 0.69 with its 28-step base model, outperforming published baselines including SDXL (2.6B, 0.55), SD3-Medium (2B, 0.62), and IF-XL (4.3B, 0.61), while being designed for fully offline on-device execution on mobile platforms after training on domestic accelerators.

What carries the argument

The compact image-generation backbone consisting of a 0.385B-parameter denoising U-Net and a 1.90M-parameter distilled decoder, combined with Rectified Flow training and DMD2 distillation that reduces inference to 4 sampling steps.

If this is right

  • The 4-step distilled model enables practical fully offline text-to-image generation on smartphones after a single installation.
  • Direct Chinese prompting works without external translation steps at inference time.
  • The complete training and distillation pipeline can be executed on domestically developed Sugon K100 accelerators.
  • The generation branch runs in approximately 1.6 seconds on Snapdragon 8 Elite hardware and the full Android poetry-to-image pipeline completes in 4.5 seconds.
  • The same architecture validates on both Android and iOS platforms.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The combination of distillation and domestic hardware training could reduce dependence on foreign cloud infrastructure for image generation tasks.
  • Native Chinese semantic alignment may improve prompt fidelity for users writing in Chinese compared with translated pipelines.
  • The reported parameter count and step reduction suggest similar compression techniques could be tested on other edge vision or multimodal tasks.

Load-bearing premise

GenEval benchmark scores can be compared directly across models of different scales and training regimes, and the reported on-device timings reflect typical performance without undisclosed hardware-specific optimizations.

What would settle it

A side-by-side GenEval evaluation of JuZhou 1.0 and the baseline models run on the identical benchmark code and data splits, or a hardware profiler trace of the mobile inference confirming the reported latencies without hidden optimizations.

Figures

Figures reproduced from arXiv: 2606.28421 by Bin Jiang, Ce Chen, Chen Ma, Congrui Wang, Jingjing Zhou, Junhao Xiao, Kede Ma, Long Lan, Mingyang Geng, Qilin Lu, Qilin Sun, Shanzhi Gu, Shijia Li, Tongliang Liu, Xingyu Wang, Xinwang Liu, Yaoguang Jin, Yao Wu, Yaqing Hu, Yifan Peng, Yingqi Peng, Yonglin Li, Zhaoyang Qu, Zhenchen Wan, Zhengpeng Xing, Zufeng Zhang.

Figure 1
Figure 1. Figure 1: Overview of JuZhou 1.0. An edge-native text-to-image foundation model featuring an ultra￾light on-device architecture, native Chinese semantic alignment, and fully offline mobile deployment. The entire training pipeline is completed exclusively on domestic compute (Sugon K100). Abstract: Text-to-image (T2I) diffusion models typically require substantial computational re￾sources and cloud infrastructure, po… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the General Chinese Text-to-Image Curation Pipeline. Two stages construct a Chinese image-text corpus for training the base model: (1) Data Synthesis starts from 9M filtered English text-to-image prompts from DiffusionDB and synthesize images via Stable Diffusion 3.5 Large; (2) Prompt Translation converts prompts into Chinese using Qwen3-235B-Instruct, producing training￾ready Chinese prompt-im… view at source ↗
Figure 3
Figure 3. Figure 3 [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Overview of the JuZhou 1.0 framework. A raw poem or user prompt is first refined by Qwen3-1.7B and then encoded by CN-CLIP1 to obtain Chinese semantic conditioning. The conditioning signal is injected into a 0.385B-parameter denoising U-Net for efficient high-resolution generation. The generated latent representation is decoded by an ultra-compact 1.9M-parameter VAE decoder without attention layers. DMD2 d… view at source ↗
Figure 5
Figure 5. Figure 5: Overview of the lightweight denoising network. (a) Denoiser architecture. (b) Trans [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Detailed modules in the denoising network. (a) Cross-attention. (b) Self-attention. (c) [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Overview of the compact VAE decoder. (a) VAE decoder architecture. (b) VAE decoder [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Heterogeneous deployment architecture across An￾droid and iOS platforms. On Android, the pipeline is partitioned between CPU (MNN engine for language model and text encod￾ing) and NPU (QNN for U-Net and VAE decoding). On iOS, all diffusion modules run on Core ML in FP16 precision without ad￾ditional quantization. Xiaomi 17 Pro Max iQOO Neo11 Redmi K60 iPhone 15 Pro 0 2 4 6 8 10 12 14 Latency (s) 2.9 3.5 9.… view at source ↗
Figure 10
Figure 10. Figure 10: Screenshots of the Mojie application on Android devices, illustrating (a) the main interface, [PITH_FULL_IMAGE:figures/full_fig_p014_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Qualitative comparison of 1024 × 1024 images generated by our 28-step base model against the Stable Diffusion family (SD 1.5, SD 2.1, SDXL, SD 3.5-L) across five diverse prompts. Despite its 0.385B parameter budget, JuZhou 1.0 consistently achieves competitive visual fidelity in lighting, texture, and compositional structure. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Qualitative visual comparison under different sampling step settings. Each row presents [PITH_FULL_IMAGE:figures/full_fig_p019_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Visualization of images generated directly from classical Chinese poetry by our distilled [PITH_FULL_IMAGE:figures/full_fig_p021_13.png] view at source ↗
read the original abstract

Text-to-image (T2I) diffusion models typically require substantial computational resources and cloud infrastructure, posing significant challenges for edge deployment in terms of latency, cost, and user privacy. We present JuZhou 1.0, an ultra-lightweight T2I foundation model designed for fully offline, on-device execution. JuZhou 1.0 achieves its efficiency through four key designs: (1) a compact image-generation backbone consisting of a 0.385B-parameter denoising U-Net and a 1.90M-parameter distilled decoder, totaling approximately 0.387B parameters; (2) Rectified Flow training combined with DMD2 distillation, reducing inference to 4 sampling steps; (3) Chinese semantic alignment trained on 9M curated image-text pairs, enabling direct Chinese prompting without external translation at inference time; and (4) a training and distillation pipeline completed on domestically developed Sugon K100 AI accelerators without relying on NVIDIA GPUs for training or distillation. Despite its compact scale, the 28-step base model of JuZhou 1.0 achieves an overall GenEval score of 0.69, outperforming published baselines including SDXL (2.6B, 0.55), SD3-Medium (2B, 0.62), and IF-XL (4.3B, 0.61). We further validate the full poetry-to-image pipeline on Android and the core CLIP-U-Net-VAE generation branch on iOS. On a smartphone powered by the Snapdragon 8 Elite Gen 5 Mobile Platform, the 4-step U-Net denoising branch runs in approximately 1.6 seconds, while the full Android poetry-to-image pipeline takes 4.5 seconds with on-device prompt refinement on Xiaomi 17 Pro Max. These results position JuZhou 1.0 as a practical approach to mobile text-to-image generation and provide a concrete reference for Chinese-native generation, domestic-compute training, and fully offline on-device deployment after one-time installation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces JuZhou 1.0, an ultra-lightweight 0.387B-parameter text-to-image diffusion model (0.385B U-Net + 1.90M decoder) trained on 9M curated Chinese image-text pairs using Rectified Flow and DMD2 distillation on Sugon K100 accelerators. It claims a GenEval score of 0.69 for the 28-step base model (outperforming SDXL at 0.55, SD3-Medium at 0.62, and IF-XL at 0.61), 4-step inference, direct Chinese prompting, and on-device timings of ~1.6s for the U-Net on Snapdragon 8 Elite and 4.5s for the full Android poetry-to-image pipeline.

Significance. If the performance claims are substantiated under equivalent conditions, the work would demonstrate practical fully-offline on-device T2I generation with a compact model trained entirely on domestic hardware and language-specific data, providing a reference point for edge deployment, privacy, and non-Western foundation model development.

major comments (2)
  1. [Abstract] Abstract: The headline GenEval claim (0.69 for the 28-step base model outperforming the listed baselines) is presented without any description of the evaluation protocol, including whether the same GenEval prompt set/version, sampling steps, guidance scale, scheduler, or number of samples were used for all models, or whether baseline scores were re-computed in the authors' setup rather than taken from prior publications.
  2. [Abstract] Abstract: The training description stresses 9M Chinese image-text pairs and 'direct Chinese prompting without external translation,' yet GenEval is an English prompt benchmark; the manuscript does not state whether English prompts were used for JuZhou evaluation, how any language mismatch was handled, or whether this affects score comparability.
minor comments (1)
  1. The abstract distinguishes the 28-step base model for the GenEval claim from the 4-step distilled model used for timings, but the manuscript would benefit from an explicit results subsection separating these configurations and their respective metrics.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on the evaluation protocol and language considerations for our GenEval results. We address each point below and will revise the manuscript to improve clarity and transparency on these aspects.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The headline GenEval claim (0.69 for the 28-step base model outperforming the listed baselines) is presented without any description of the evaluation protocol, including whether the same GenEval prompt set/version, sampling steps, guidance scale, scheduler, or number of samples were used for all models, or whether baseline scores were re-computed in the authors' setup rather than taken from prior publications.

    Authors: We agree that the abstract does not describe the evaluation protocol in sufficient detail. The reported baseline scores (SDXL 0.55, SD3-Medium 0.62, IF-XL 0.61) are taken directly from the original publications rather than re-evaluated in our environment. Our JuZhou 28-step model was evaluated on the standard GenEval benchmark using its default prompt set. In the revised manuscript we will add an explicit paragraph (likely in Section 4 or a new evaluation subsection) stating the precise protocol used for JuZhou: 28 sampling steps, guidance scale 7.5, the default Euler scheduler, and the full GenEval prompt set with the standard number of samples. We will also note that direct apples-to-apples re-computation of all baselines was not performed. revision: yes

  2. Referee: [Abstract] Abstract: The training description stresses 9M Chinese image-text pairs and 'direct Chinese prompting without external translation,' yet GenEval is an English prompt benchmark; the manuscript does not state whether English prompts were used for JuZhou evaluation, how any language mismatch was handled, or whether this affects score comparability.

    Authors: We acknowledge the observation. Although the model was trained exclusively on Chinese image-text pairs to support direct Chinese prompting, the GenEval score of 0.69 was obtained by running the standard English GenEval prompts through the model without any translation step. The model accepts English input because its text encoder was initialized from a multilingual checkpoint and further adapted during training. In the revision we will add a sentence clarifying that GenEval evaluation used the original English prompts and will briefly discuss the potential impact on cross-model comparability given the Chinese-centric training data. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical benchmark results are self-contained

full rationale

The paper is a technical report presenting an empirical model (0.387B parameters, trained on 9M pairs, evaluated at 0.69 GenEval) with performance numbers and on-device timings. No derivation chain, equations, or self-referential definitions are present in the provided text. Claims rest on external benchmarks and hardware measurements that do not reduce by construction to the paper's own fitted inputs or self-citations. This is the expected honest outcome for a performance report without mathematical self-reference.

Axiom & Free-Parameter Ledger

3 free parameters · 2 axioms · 0 invented entities

Performance superiority and deployment claims rest on the domain assumption that GenEval is a fair cross-model comparator and that distillation preserves quality; no free parameters are explicitly fitted in the abstract beyond design choices for edge constraints.

free parameters (3)
  • U-Net parameter count = 0.385B
    Chosen to meet edge-device memory and latency limits
  • inference steps after distillation = 4
    Selected for low-latency on-device execution
  • curated training pairs = 9M
    Size selected for Chinese semantic alignment
axioms (2)
  • domain assumption GenEval benchmark provides a meaningful and unbiased measure of text-to-image quality across model scales
    Invoked when claiming outperformance over larger published models
  • domain assumption DMD2 distillation from the 28-step base preserves sufficient quality for the reported use cases
    Required to equate 4-step inference performance to the base model results

pith-pipeline@v0.9.1-grok · 6011 in / 1700 out tokens · 58040 ms · 2026-06-30T00:58:55.596187+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

47 extracted references · 19 canonical work pages · 7 internal anchors

  1. [1]

    Text-to-image diffusion models in generative ai: A survey,

    C. Zhang, C. Zhang, M. Zhang, I. S. Kweon, and J. Kim, “Text-to-image diffusion models in generative ai: A survey,”arXiv preprint arXiv:2303.07909, 2023

  2. [2]

    Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding

    C. Saharia, W. Chan, S. Saxena, J. Li, J. Whang, E. Denton, S. M. H. A. Ghasemipour,et al., “Imagen: Photorealistic text-to-image diffusion models with deep language understanding,”arXiv preprint arXiv:2205.11487, 2022

  3. [3]

    Hierarchical Text-Conditional Image Generation with CLIP Latents

    A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, and M. Chen, “Hierarchical text-conditional image generation with clip latents,”arXiv preprint arXiv:2204.06125, 2022

  4. [4]

    Sdxl: Improving latent diffusion models for high-resolution image synthesis,

    D. Podell, Z. English, K. Lacey, A. Blattmann, T. Dockhorn, J. M¨uller, J. Penna, and R. Rombach, “Sdxl: Improving latent diffusion models for high-resolution image synthesis,” inInternational Conference on Learning Representations, 2024. 5https://www.icswb.com/default.php?mod=live_text&live_id=914&temp=live_video 26

  5. [5]

    B. F. Labs, “Flux.”https://github.com/black-forest-labs/flux, 2024

  6. [6]

    Scalable Diffusion Models with Transformers

    W. Peebles and S. Xie, “Scalable diffusion models with transformers,”arXiv preprint arXiv:2212.09748, 2022

  7. [7]

    High-resolution image synthesis with latent diffusion models,

    R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-resolution image synthesis with latent diffusion models,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 10684–10695, 2022

  8. [8]

    Stable diffusion 2.1 release notes

    S. AI, “Stable diffusion 2.1 release notes.”https://stability.ai/news/ stable-diffusion-v2-1-release, 2022

  9. [9]

    Cogview: Pretrained transformers for image synthesis from text,

    M. Ding, Z. Yang, W. Duan, J. Lu, J. Wu, and C. Zhu, “Cogview: Pretrained transformers for image synthesis from text,”arXiv preprint arXiv:2105.13290, 2021

  10. [10]

    Taiyi-diffusion-xl: Advancing bilingual text-to- image generation with large vision-language model support,

    X. Wu, D. Zhang, R. Gan, J. Lu, Z. Wu,et al., “Taiyi-diffusion-xl: Advancing bilingual text-to- image generation with large vision-language model support,”arXiv preprint arXiv:2401.14688, 2024

  11. [11]

    Mobilediffusion: Instant text-to-image generation on mobile devices,

    Y . Zhao, Y . Xu, Z. Xiao, H. Jia, and T. Hou, “Mobilediffusion: Instant text-to-image generation on mobile devices,” inEuropean Conference on Computer Vision, pp. 225–242, Springer, 2024

  12. [12]

    Snapgen: Taming high-resolution text-to- image models for mobile devices with efficient architectures and training,

    D. Hu, J. Chen, X. Huang, H. Coskun,et al., “Snapgen: Taming high-resolution text-to- image models for mobile devices with efficient architectures and training,”arXiv preprint arXiv:2412.09619, 2024

  13. [13]

    Denoising diffusion probabilistic models,

    J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” inAdvances in Neural Information Processing Systems, vol. 33, pp. 6840–6851, 2020

  14. [14]

    Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

    X. Liu, C. Gong, and Q. Liu, “Flow straight and fast: Learning to generate and transfer data with rectified flow,”arXiv preprint arXiv:2209.03003, 2022

  15. [15]

    Improved distribution matching distillation for fast image synthesis,

    T. Yin, M. Gharbi, T. Park, R. Zhang, E. Shechtman, F. Durand, and W. T. Freeman, “Improved distribution matching distillation for fast image synthesis,”Advances in Neural Information Pro- cessing Systems, vol. 37, 2024

  16. [16]

    Mnn: A universal and efficient inference engine,

    X. Jiang, H. Wang, Y . Chen, Z. Wu, L. Wang, B. Zou, Y . Yang, Z. Cui, Y . Cai, T. Yu, C. Lv, and Z. Wu, “Mnn: A universal and efficient inference engine,” inMLSys, 2020

  17. [17]

    Qualcomm ai engine direct sdk reference guide

    I. Qualcomm Technologies, “Qualcomm ai engine direct sdk reference guide.”https://docs. qualcomm.com/nav/home/index_QNN.html?product=1601111740009302, 2026

  18. [18]

    Laion-5b: An open large-scale dataset for training next generation image-text models,

    C. Schuhmann, R. Beaumont, R. Vencu, C. Gordon, R. Wightman, M. Cherti,et al., “Laion-5b: An open large-scale dataset for training next generation image-text models,”Advances in Neural Information Processing Systems, vol. 35, pp. 35276–35296, 2022

  19. [19]

    Pai-diffusion: Constructing and serving a family of open chinese diffusion models for text-to-image synthesis on the cloud,

    C. Wang, Z. Duan, B. Liu, X. Zou, C. Chen, K. Jia, and J. Huang, “Pai-diffusion: Constructing and serving a family of open chinese diffusion models for text-to-image synthesis on the cloud,” in Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics, 2023

  20. [20]

    Accelerating dif- fusion model training under minimal budgets: A condensation-based perspective,

    R. Huang, S. Shao, Z. Zhou, P. Zhao, H. Guo, T. Ye, L. Bai, S. Yang, and Z. Xie, “Accelerating dif- fusion model training under minimal budgets: A condensation-based perspective,”arXiv preprint arXiv:2507.05914, 2025. 27

  21. [21]

    Diffusiondb: A large-scale prompt gallery for text-to-image diffusion models,

    Z. J. Wang, E. Montoya, D. Munechika, H. Yang, B. Hoover, and D. H. Chau, “Diffusiondb: A large-scale prompt gallery for text-to-image diffusion models,” inProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL), 2023

  22. [22]

    chinese-poetry: The most comprehensive database of Chinese poetry

    J. Gao and contributors, “chinese-poetry: The most comprehensive database of Chinese poetry.” https://github.com/chinese-poetry/chinese-poetry, 2019. Accessed 2026- 05-27

  23. [23]

    Open Chinese Convert (OpenCC)

    BYV oid, “Open Chinese Convert (OpenCC).”https://github.com/byvoid/opencc,

  24. [24]

    Version 1.3.1, accessed 2026-05-27

  25. [25]

    Qwen-image technical report,

    C. Wu, J. Li, J. Zhou, J. Lin, K. Gao, K. Yan, S. ming Yin, S. Bai, X. Xu, Y . Chen, Y . Chen, Z. Tang, Z. Zhang, Z. Wang, A. Yang, B. Yu, C. Cheng, D. Liu, D. Li, H. Zhang, H. Meng, H. Wei, J. Ni, K. Chen, K. Cao, L. Peng, L. Qu, M. Wu, P. Wang, S. Yu, T. Wen, W. Feng, X. Xu, Y . Wang, Y . Zhang, Y . Zhu, Y . Wu, Y . Cai, and Z. Liu, “Qwen-image technica...

  26. [26]

    Reproducible scaling laws for contrastive language-image learning,

    M. Cherti, R. Beaumont, R. Wightman, M. Wortsman, G. Ilharco, C. Gordon, C. Schuhmann, L. Schmidt, and J. Jitsev, “Reproducible scaling laws for contrastive language-image learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2816– 2827, 2023

  27. [27]

    Long-clip: Unlocking the long-text capability of clip,

    B. Zhang, P. Zhang, X. Dong, Y . Zang, and J. Wang, “Long-clip: Unlocking the long-text capability of clip,”arXiv preprint arXiv:2403.15378, 2024

  28. [28]

    Qwen3 technical report,

    Q. Team, “Qwen3 technical report,” 2025

  29. [29]

    Bridge diffusion model: Bridge chinese text-to-image diffusion model with english communities,

    S. Liu, B. Cheng, Y . Ma, L. Wu, A. Ma, X. Wu, D. Leng, and Y . Yin, “Bridge diffusion model: Bridge chinese text-to-image diffusion model with english communities,” inProceedings of the AAAI Conference on Artificial Intelligence, 2023

  30. [30]

    Chinese clip: Contrastive vision-language pretraining in chinese,

    A. Yang, J. Pan, J. Lin, R. Men, Y . Zhang, J. Zhou, and C. Zhou, “Chinese clip: Contrastive vision-language pretraining in chinese,”arXiv preprint arXiv:2211.01335, 2022

  31. [31]

    LoRA: Low-Rank Adaptation of Large Language Models

    E. J. Hu, Y . Shen, P. Wallis,et al., “Lora: Low-rank adaptation of large language models,”arXiv preprint arXiv:2106.09685, 2021

  32. [32]

    Q-diffusion: Quantizing diffusion models,

    X. Li, Y . Liu, L. Lian, H. Yang, Z. Dong, D. Kang, S. Zhang, and K. Keutzer, “Q-diffusion: Quantizing diffusion models,”arXiv preprint arXiv:2302.04304, 2023

  33. [33]

    Understanding int4 quantization for transformer models: Latency speedup, composability, and failure cases,

    X. Wu, C. Li, R. Yazdani Aminabadi, Z. Yao, and Y . He, “Understanding int4 quantization for transformer models: Latency speedup, composability, and failure cases,”arXiv preprint arXiv:2301.12017, 2023

  34. [34]

    Low-bit model quantization for deep neural networks: A survey,

    K. Liu, Q. Zheng, K. Tao, Z. Li, H. Qin, W. Li, Y . Guo, X. Liu, L. Kong, G. Chen, Y . Zhang, and X. Yang, “Low-bit model quantization for deep neural networks: A survey,”arXiv preprint arXiv:2505.05530, 2025

  35. [35]

    Geneval: An object-focused framework for evaluating text-to-image alignment,

    D. Ghosh, H. Hajishirzi, and L. Schmidt, “Geneval: An object-focused framework for evaluating text-to-image alignment,”Advances in Neural Information Processing Systems, vol. 36, pp. 52132– 52152, 2023. 28

  36. [36]

    Lumina-next: Making lumina-t2x stronger and faster with next-dit,

    L. Zhuo, R. Du, H. Xiao, Y . Li, D. Liu, R. Huang, W. Liu, L. Zhao, F.-Y . Wang, Z. Ma, X. Luo, Z. Wang, K. Zhang, X. Zhu, S. Liu, X. Yue, D. Liu, W. Ouyang, Z. Liu, Y . Qiao, H. Li, and P. Gao, “Lumina-next: Making lumina-t2x stronger and faster with next-dit,” 2024

  37. [37]

    Scaling rectified flow transformers for high-resolution image synthesis,

    P. Esser, S. Kulal, A. Blattmann, R. Entezari, J. M ¨uller, H. Saini, Y . Levi, D. Lorenz, A. Sauer, F. Boesel, D. Podell, T. Dockhorn, Z. English, K. Lacey, A. Goodwin, Y . Marek, and R. Rombach, “Scaling rectified flow transformers for high-resolution image synthesis,” 2024

  38. [38]

    Playground v2.5: Three insights towards enhancing aesthetic quality in text-to-image generation,

    D. Li, A. Kamko, E. Akhgari, A. Sabet, L. Xu, and S. Doshi, “Playground v2.5: Three insights towards enhancing aesthetic quality in text-to-image generation,” 2024

  39. [39]

    DeepFloyd IF: A modular cascaded pixel-space text-to-image diffusion model,

    DeepFloyd Lab and Stability AI, “DeepFloyd IF: A modular cascaded pixel-space text-to-image diffusion model,” 2023. Official model repository

  40. [40]

    Sana: Efficient high-resolution image synthesis with linear diffusion transformers,

    E. Xie, J. Chen, J. Chen, H. Cai, H. Tang, Y . Lin, Z. Zhang, M. Li, L. Zhu, Y . Lu, and S. Han, “Sana: Efficient high-resolution image synthesis with linear diffusion transformers,” 2024

  41. [41]

    Hunyuan-dit: A powerful multi-resolution diffusion transformer with fine-grained chinese understanding,

    Z. Li, J. Zhang, Q. Lin, J. Xiong, Y . Long, X. Deng, Y . Zhang, X. Liu, M. Huang, Z. Xiao, D. Chen, J. He, J. Li, W. Li, C. Zhang, R. Quan, J. Lu, J. Huang, X. Yuan, X. Zheng, Y . Li, J. Zhang, C. Zhang, M. Chen, J. Liu, Z. Fang, W. Wang, J. Xue, Y . Tao, J. Zhu, K. Liu, S. Lin, Y . Sun, Y . Li, D. Wang, M. Chen, Z. Hu, X. Xiao, Y . Chen, Y . Liu, W. Liu...

  42. [42]

    Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference

    S. Luo, Y . Tan, L. Huang, J. Li, and H. Zhao, “Latent consistency models: Synthesizing high- resolution images with few-step inference,”arXiv preprint arXiv:2310.04378, 2023

  43. [43]

    Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation,

    N. Ruiz, Y . Li, P. Varadharajan, D. Cohen, N. Hariri, Y . Zhang,et al., “Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6840–6850, 2023

  44. [44]

    An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion

    R. Gal, Y . Alaluf, Y . Atzmon, O. Patashnik, A. H. Bermano, G. Chechik, and D. Cohen-Or, “An image is worth one word: Personalizing text-to-image generation using textual inversion,”arXiv preprint arXiv:2208.01618, 2022

  45. [45]

    Animatediff: Animate your personalized text-to-image diffusion models without specific tuning,

    Y . Guo, C. Yang, A. Rao, Y . Wang, Y . Qiao, D. Lin, and B. Dai, “Animatediff: Animate your personalized text-to-image diffusion models without specific tuning,” inInternational Conference on Learning Representations, 2024

  46. [46]

    Lcm-lora: A universal stable-diffusion acceleration module,

    S. Luo, Y . Tan, S. Patil, D. Gu, P. von Platen, A. Passos, L. Huang, J. Li, and H. Zhao, “Lcm-lora: A universal stable-diffusion acceleration module,”arXiv preprint arXiv:2311.05556, 2023

  47. [47]

    Rethinking fid: Towards a better evaluation metric for image generation,

    S. Jayasumana, S. Ramalingam, A. Veit, D. Glasner, A. Chakrabarti, and S. Kumar, “Rethinking fid: Towards a better evaluation metric for image generation,”arXiv preprint arXiv:2308.14956, 2023. 29