JuZhou 1.0 Technical Report: The First Edge-Native Text-to-Image Foundation Model Trained Entirely on China-Developed AI Accelerators

Bin Jiang; Ce Chen; Chen Ma; Congrui Wang; Jingjing Zhou; Junhao Xiao; Kede Ma; Long Lan; Mingyang Geng; Qilin Lu

arxiv: 2606.28421 · v1 · pith:ICPFKGWSnew · submitted 2026-06-25 · 💻 cs.CV · cs.AI

JuZhou 1.0 Technical Report: The First Edge-Native Text-to-Image Foundation Model Trained Entirely on China-Developed AI Accelerators

Ce Chen , Congrui Wang , Yonglin Li , Zhenchen Wan , Mingyang Geng , Junhao Xiao , Zhengpeng Xing , Yaqing Hu

show 18 more authors

Yao Wu Zhaoyang Qu Long Lan Xinwang Liu Yingqi Peng Shijia Li Zufeng Zhang Chen Ma Jingjing Zhou Xingyu Wang Qilin Lu Bin Jiang Qilin Sun Shanzhi Gu Yaoguang Jin Tongliang Liu Kede Ma Yifan Peng

This is my paper

Pith reviewed 2026-06-30 00:58 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords text-to-image diffusionon-device inferenceedge deploymentChinese prompt alignmentdomestic AI acceleratorsGenEval benchmarkmodel distillationmobile text-to-image

0 comments

The pith

JuZhou 1.0 is a 0.387-billion-parameter text-to-image model trained entirely on domestic Chinese accelerators that runs fully offline on smartphones and scores 0.69 on GenEval.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents JuZhou 1.0 as an ultra-lightweight text-to-image diffusion model built for fully offline on-device execution. It combines a 0.385B-parameter U-Net with a small distilled decoder, Rectified Flow training, and DMD2 distillation to reach 4 sampling steps. The model is trained on 9M Chinese image-text pairs for native prompting and completes its entire pipeline on Sugon K100 accelerators. Its 28-step base version reaches an overall GenEval score of 0.69, exceeding published results from SDXL, SD3-Medium, and IF-XL. On-device tests show the 4-step branch completing in about 1.6 seconds on Snapdragon hardware and the full Android pipeline in 4.5 seconds.

Core claim

JuZhou 1.0 achieves an overall GenEval score of 0.69 with its 28-step base model, outperforming published baselines including SDXL (2.6B, 0.55), SD3-Medium (2B, 0.62), and IF-XL (4.3B, 0.61), while being designed for fully offline on-device execution on mobile platforms after training on domestic accelerators.

What carries the argument

The compact image-generation backbone consisting of a 0.385B-parameter denoising U-Net and a 1.90M-parameter distilled decoder, combined with Rectified Flow training and DMD2 distillation that reduces inference to 4 sampling steps.

If this is right

The 4-step distilled model enables practical fully offline text-to-image generation on smartphones after a single installation.
Direct Chinese prompting works without external translation steps at inference time.
The complete training and distillation pipeline can be executed on domestically developed Sugon K100 accelerators.
The generation branch runs in approximately 1.6 seconds on Snapdragon 8 Elite hardware and the full Android poetry-to-image pipeline completes in 4.5 seconds.
The same architecture validates on both Android and iOS platforms.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The combination of distillation and domestic hardware training could reduce dependence on foreign cloud infrastructure for image generation tasks.
Native Chinese semantic alignment may improve prompt fidelity for users writing in Chinese compared with translated pipelines.
The reported parameter count and step reduction suggest similar compression techniques could be tested on other edge vision or multimodal tasks.

Load-bearing premise

GenEval benchmark scores can be compared directly across models of different scales and training regimes, and the reported on-device timings reflect typical performance without undisclosed hardware-specific optimizations.

What would settle it

A side-by-side GenEval evaluation of JuZhou 1.0 and the baseline models run on the identical benchmark code and data splits, or a hardware profiler trace of the mobile inference confirming the reported latencies without hidden optimizations.

Figures

Figures reproduced from arXiv: 2606.28421 by Bin Jiang, Ce Chen, Chen Ma, Congrui Wang, Jingjing Zhou, Junhao Xiao, Kede Ma, Long Lan, Mingyang Geng, Qilin Lu, Qilin Sun, Shanzhi Gu, Shijia Li, Tongliang Liu, Xingyu Wang, Xinwang Liu, Yaoguang Jin, Yao Wu, Yaqing Hu, Yifan Peng, Yingqi Peng, Yonglin Li, Zhaoyang Qu, Zhenchen Wan, Zhengpeng Xing, Zufeng Zhang.

**Figure 1.** Figure 1: Overview of JuZhou 1.0. An edge-native text-to-image foundation model featuring an ultralight on-device architecture, native Chinese semantic alignment, and fully offline mobile deployment. The entire training pipeline is completed exclusively on domestic compute (Sugon K100). Abstract: Text-to-image (T2I) diffusion models typically require substantial computational resources and cloud infrastructure, po… view at source ↗

**Figure 2.** Figure 2: Overview of the General Chinese Text-to-Image Curation Pipeline. Two stages construct a Chinese image-text corpus for training the base model: (1) Data Synthesis starts from 9M filtered English text-to-image prompts from DiffusionDB and synthesize images via Stable Diffusion 3.5 Large; (2) Prompt Translation converts prompts into Chinese using Qwen3-235B-Instruct, producing trainingready Chinese prompt-im… view at source ↗

**Figure 3.** Figure 3 [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Overview of the JuZhou 1.0 framework. A raw poem or user prompt is first refined by Qwen3-1.7B and then encoded by CN-CLIP1 to obtain Chinese semantic conditioning. The conditioning signal is injected into a 0.385B-parameter denoising U-Net for efficient high-resolution generation. The generated latent representation is decoded by an ultra-compact 1.9M-parameter VAE decoder without attention layers. DMD2 d… view at source ↗

**Figure 5.** Figure 5: Overview of the lightweight denoising network. (a) Denoiser architecture. (b) Trans [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Detailed modules in the denoising network. (a) Cross-attention. (b) Self-attention. (c) [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: Overview of the compact VAE decoder. (a) VAE decoder architecture. (b) VAE decoder [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗

**Figure 8.** Figure 8: Heterogeneous deployment architecture across Android and iOS platforms. On Android, the pipeline is partitioned between CPU (MNN engine for language model and text encoding) and NPU (QNN for U-Net and VAE decoding). On iOS, all diffusion modules run on Core ML in FP16 precision without additional quantization. Xiaomi 17 Pro Max iQOO Neo11 Redmi K60 iPhone 15 Pro 0 2 4 6 8 10 12 14 Latency (s) 2.9 3.5 9.… view at source ↗

**Figure 10.** Figure 10: Screenshots of the Mojie application on Android devices, illustrating (a) the main interface, [PITH_FULL_IMAGE:figures/full_fig_p014_10.png] view at source ↗

**Figure 11.** Figure 11: Qualitative comparison of 1024 × 1024 images generated by our 28-step base model against the Stable Diffusion family (SD 1.5, SD 2.1, SDXL, SD 3.5-L) across five diverse prompts. Despite its 0.385B parameter budget, JuZhou 1.0 consistently achieves competitive visual fidelity in lighting, texture, and compositional structure. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_11.png] view at source ↗

**Figure 12.** Figure 12: Qualitative visual comparison under different sampling step settings. Each row presents [PITH_FULL_IMAGE:figures/full_fig_p019_12.png] view at source ↗

**Figure 13.** Figure 13: Visualization of images generated directly from classical Chinese poetry by our distilled [PITH_FULL_IMAGE:figures/full_fig_p021_13.png] view at source ↗

read the original abstract

Text-to-image (T2I) diffusion models typically require substantial computational resources and cloud infrastructure, posing significant challenges for edge deployment in terms of latency, cost, and user privacy. We present JuZhou 1.0, an ultra-lightweight T2I foundation model designed for fully offline, on-device execution. JuZhou 1.0 achieves its efficiency through four key designs: (1) a compact image-generation backbone consisting of a 0.385B-parameter denoising U-Net and a 1.90M-parameter distilled decoder, totaling approximately 0.387B parameters; (2) Rectified Flow training combined with DMD2 distillation, reducing inference to 4 sampling steps; (3) Chinese semantic alignment trained on 9M curated image-text pairs, enabling direct Chinese prompting without external translation at inference time; and (4) a training and distillation pipeline completed on domestically developed Sugon K100 AI accelerators without relying on NVIDIA GPUs for training or distillation. Despite its compact scale, the 28-step base model of JuZhou 1.0 achieves an overall GenEval score of 0.69, outperforming published baselines including SDXL (2.6B, 0.55), SD3-Medium (2B, 0.62), and IF-XL (4.3B, 0.61). We further validate the full poetry-to-image pipeline on Android and the core CLIP-U-Net-VAE generation branch on iOS. On a smartphone powered by the Snapdragon 8 Elite Gen 5 Mobile Platform, the 4-step U-Net denoising branch runs in approximately 1.6 seconds, while the full Android poetry-to-image pipeline takes 4.5 seconds with on-device prompt refinement on Xiaomi 17 Pro Max. These results position JuZhou 1.0 as a practical approach to mobile text-to-image generation and provide a concrete reference for Chinese-native generation, domestic-compute training, and fully offline on-device deployment after one-time installation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

JuZhou 1.0 reports a compact 0.387B Chinese T2I model trained on domestic hardware with specific mobile timings, but the GenEval outperformance rests on unverified evaluation details.

read the letter

The main thing to know is that this technical report describes a small text-to-image model trained end-to-end on Sugon K100 accelerators, using Rectified Flow plus DMD2 distillation down to 4 steps, and gives measured runtimes on Snapdragon hardware. The 0.387B parameter count, 9M Chinese pair dataset, and direct Chinese prompting are the concrete new artifacts.

The work does a reasonable job documenting the pipeline choices and the on-device results. Reporting 1.6 seconds for the U-Net branch and 4.5 seconds for the full Android poetry-to-image flow on Xiaomi hardware supplies usable numbers for anyone thinking about offline mobile generation. The domestic compute angle is also straightforward and worth noting for hardware-independence discussions.

The soft spot is the headline GenEval result. The abstract states 0.69 for the 28-step base model, above SDXL, SD3-Medium, and IF-XL, but supplies no protocol details. It is unclear whether English prompts were used, whether the baselines were re-evaluated in the same setup, or how sampling and guidance matched. The stress-test concern about language mismatch and non-identical conditions therefore stands on the information given; the numerical edge cannot be taken as settled without that evidence.

This paper is mainly for people working on edge deployment or Chinese-language generation stacks. The deployment measurements are the part that could be cited or built on. The central feasibility claim is plausible on its face, but the benchmark comparison needs tightening before it carries weight.

I would send it to peer review so the evaluation section can be checked, but I would not cite it in its current form.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces JuZhou 1.0, an ultra-lightweight 0.387B-parameter text-to-image diffusion model (0.385B U-Net + 1.90M decoder) trained on 9M curated Chinese image-text pairs using Rectified Flow and DMD2 distillation on Sugon K100 accelerators. It claims a GenEval score of 0.69 for the 28-step base model (outperforming SDXL at 0.55, SD3-Medium at 0.62, and IF-XL at 0.61), 4-step inference, direct Chinese prompting, and on-device timings of ~1.6s for the U-Net on Snapdragon 8 Elite and 4.5s for the full Android poetry-to-image pipeline.

Significance. If the performance claims are substantiated under equivalent conditions, the work would demonstrate practical fully-offline on-device T2I generation with a compact model trained entirely on domestic hardware and language-specific data, providing a reference point for edge deployment, privacy, and non-Western foundation model development.

major comments (2)

[Abstract] Abstract: The headline GenEval claim (0.69 for the 28-step base model outperforming the listed baselines) is presented without any description of the evaluation protocol, including whether the same GenEval prompt set/version, sampling steps, guidance scale, scheduler, or number of samples were used for all models, or whether baseline scores were re-computed in the authors' setup rather than taken from prior publications.
[Abstract] Abstract: The training description stresses 9M Chinese image-text pairs and 'direct Chinese prompting without external translation,' yet GenEval is an English prompt benchmark; the manuscript does not state whether English prompts were used for JuZhou evaluation, how any language mismatch was handled, or whether this affects score comparability.

minor comments (1)

The abstract distinguishes the 28-step base model for the GenEval claim from the 4-step distilled model used for timings, but the manuscript would benefit from an explicit results subsection separating these configurations and their respective metrics.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on the evaluation protocol and language considerations for our GenEval results. We address each point below and will revise the manuscript to improve clarity and transparency on these aspects.

read point-by-point responses

Referee: [Abstract] Abstract: The headline GenEval claim (0.69 for the 28-step base model outperforming the listed baselines) is presented without any description of the evaluation protocol, including whether the same GenEval prompt set/version, sampling steps, guidance scale, scheduler, or number of samples were used for all models, or whether baseline scores were re-computed in the authors' setup rather than taken from prior publications.

Authors: We agree that the abstract does not describe the evaluation protocol in sufficient detail. The reported baseline scores (SDXL 0.55, SD3-Medium 0.62, IF-XL 0.61) are taken directly from the original publications rather than re-evaluated in our environment. Our JuZhou 28-step model was evaluated on the standard GenEval benchmark using its default prompt set. In the revised manuscript we will add an explicit paragraph (likely in Section 4 or a new evaluation subsection) stating the precise protocol used for JuZhou: 28 sampling steps, guidance scale 7.5, the default Euler scheduler, and the full GenEval prompt set with the standard number of samples. We will also note that direct apples-to-apples re-computation of all baselines was not performed. revision: yes
Referee: [Abstract] Abstract: The training description stresses 9M Chinese image-text pairs and 'direct Chinese prompting without external translation,' yet GenEval is an English prompt benchmark; the manuscript does not state whether English prompts were used for JuZhou evaluation, how any language mismatch was handled, or whether this affects score comparability.

Authors: We acknowledge the observation. Although the model was trained exclusively on Chinese image-text pairs to support direct Chinese prompting, the GenEval score of 0.69 was obtained by running the standard English GenEval prompts through the model without any translation step. The model accepts English input because its text encoder was initialized from a multilingual checkpoint and further adapted during training. In the revision we will add a sentence clarifying that GenEval evaluation used the original English prompts and will briefly discuss the potential impact on cross-model comparability given the Chinese-centric training data. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical benchmark results are self-contained

full rationale

The paper is a technical report presenting an empirical model (0.387B parameters, trained on 9M pairs, evaluated at 0.69 GenEval) with performance numbers and on-device timings. No derivation chain, equations, or self-referential definitions are present in the provided text. Claims rest on external benchmarks and hardware measurements that do not reduce by construction to the paper's own fitted inputs or self-citations. This is the expected honest outcome for a performance report without mathematical self-reference.

Axiom & Free-Parameter Ledger

3 free parameters · 2 axioms · 0 invented entities

Performance superiority and deployment claims rest on the domain assumption that GenEval is a fair cross-model comparator and that distillation preserves quality; no free parameters are explicitly fitted in the abstract beyond design choices for edge constraints.

free parameters (3)

U-Net parameter count = 0.385B
Chosen to meet edge-device memory and latency limits
inference steps after distillation = 4
Selected for low-latency on-device execution
curated training pairs = 9M
Size selected for Chinese semantic alignment

axioms (2)

domain assumption GenEval benchmark provides a meaningful and unbiased measure of text-to-image quality across model scales
Invoked when claiming outperformance over larger published models
domain assumption DMD2 distillation from the 28-step base preserves sufficient quality for the reported use cases
Required to equate 4-step inference performance to the base model results

pith-pipeline@v0.9.1-grok · 6011 in / 1700 out tokens · 58040 ms · 2026-06-30T00:58:55.596187+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

47 extracted references · 19 canonical work pages · 7 internal anchors

[1]

Text-to-image diffusion models in generative ai: A survey,

C. Zhang, C. Zhang, M. Zhang, I. S. Kweon, and J. Kim, “Text-to-image diffusion models in generative ai: A survey,”arXiv preprint arXiv:2303.07909, 2023

work page arXiv 2023
[2]

Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding

C. Saharia, W. Chan, S. Saxena, J. Li, J. Whang, E. Denton, S. M. H. A. Ghasemipour,et al., “Imagen: Photorealistic text-to-image diffusion models with deep language understanding,”arXiv preprint arXiv:2205.11487, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[3]

Hierarchical Text-Conditional Image Generation with CLIP Latents

A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, and M. Chen, “Hierarchical text-conditional image generation with clip latents,”arXiv preprint arXiv:2204.06125, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[4]

Sdxl: Improving latent diffusion models for high-resolution image synthesis,

D. Podell, Z. English, K. Lacey, A. Blattmann, T. Dockhorn, J. M¨uller, J. Penna, and R. Rombach, “Sdxl: Improving latent diffusion models for high-resolution image synthesis,” inInternational Conference on Learning Representations, 2024. 5https://www.icswb.com/default.php?mod=live_text&live_id=914&temp=live_video 26

2024
[5]

B. F. Labs, “Flux.”https://github.com/black-forest-labs/flux, 2024

2024
[6]

Scalable Diffusion Models with Transformers

W. Peebles and S. Xie, “Scalable diffusion models with transformers,”arXiv preprint arXiv:2212.09748, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[7]

High-resolution image synthesis with latent diffusion models,

R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-resolution image synthesis with latent diffusion models,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 10684–10695, 2022

2022
[8]

Stable diffusion 2.1 release notes

S. AI, “Stable diffusion 2.1 release notes.”https://stability.ai/news/ stable-diffusion-v2-1-release, 2022

2022
[9]

Cogview: Pretrained transformers for image synthesis from text,

M. Ding, Z. Yang, W. Duan, J. Lu, J. Wu, and C. Zhu, “Cogview: Pretrained transformers for image synthesis from text,”arXiv preprint arXiv:2105.13290, 2021

work page arXiv 2021
[10]

Taiyi-diffusion-xl: Advancing bilingual text-to- image generation with large vision-language model support,

X. Wu, D. Zhang, R. Gan, J. Lu, Z. Wu,et al., “Taiyi-diffusion-xl: Advancing bilingual text-to- image generation with large vision-language model support,”arXiv preprint arXiv:2401.14688, 2024

work page arXiv 2024
[11]

Mobilediffusion: Instant text-to-image generation on mobile devices,

Y . Zhao, Y . Xu, Z. Xiao, H. Jia, and T. Hou, “Mobilediffusion: Instant text-to-image generation on mobile devices,” inEuropean Conference on Computer Vision, pp. 225–242, Springer, 2024

2024
[12]

Snapgen: Taming high-resolution text-to- image models for mobile devices with efficient architectures and training,

D. Hu, J. Chen, X. Huang, H. Coskun,et al., “Snapgen: Taming high-resolution text-to- image models for mobile devices with efficient architectures and training,”arXiv preprint arXiv:2412.09619, 2024

work page arXiv 2024
[13]

Denoising diffusion probabilistic models,

J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” inAdvances in Neural Information Processing Systems, vol. 33, pp. 6840–6851, 2020

2020
[14]

Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

X. Liu, C. Gong, and Q. Liu, “Flow straight and fast: Learning to generate and transfer data with rectified flow,”arXiv preprint arXiv:2209.03003, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[15]

Improved distribution matching distillation for fast image synthesis,

T. Yin, M. Gharbi, T. Park, R. Zhang, E. Shechtman, F. Durand, and W. T. Freeman, “Improved distribution matching distillation for fast image synthesis,”Advances in Neural Information Pro- cessing Systems, vol. 37, 2024

2024
[16]

Mnn: A universal and efficient inference engine,

X. Jiang, H. Wang, Y . Chen, Z. Wu, L. Wang, B. Zou, Y . Yang, Z. Cui, Y . Cai, T. Yu, C. Lv, and Z. Wu, “Mnn: A universal and efficient inference engine,” inMLSys, 2020

2020
[17]

Qualcomm ai engine direct sdk reference guide

I. Qualcomm Technologies, “Qualcomm ai engine direct sdk reference guide.”https://docs. qualcomm.com/nav/home/index_QNN.html?product=1601111740009302, 2026

2026
[18]

Laion-5b: An open large-scale dataset for training next generation image-text models,

C. Schuhmann, R. Beaumont, R. Vencu, C. Gordon, R. Wightman, M. Cherti,et al., “Laion-5b: An open large-scale dataset for training next generation image-text models,”Advances in Neural Information Processing Systems, vol. 35, pp. 35276–35296, 2022

2022
[19]

Pai-diffusion: Constructing and serving a family of open chinese diffusion models for text-to-image synthesis on the cloud,

C. Wang, Z. Duan, B. Liu, X. Zou, C. Chen, K. Jia, and J. Huang, “Pai-diffusion: Constructing and serving a family of open chinese diffusion models for text-to-image synthesis on the cloud,” in Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics, 2023

2023
[20]

Accelerating dif- fusion model training under minimal budgets: A condensation-based perspective,

R. Huang, S. Shao, Z. Zhou, P. Zhao, H. Guo, T. Ye, L. Bai, S. Yang, and Z. Xie, “Accelerating dif- fusion model training under minimal budgets: A condensation-based perspective,”arXiv preprint arXiv:2507.05914, 2025. 27

work page arXiv 2025
[21]

Diffusiondb: A large-scale prompt gallery for text-to-image diffusion models,

Z. J. Wang, E. Montoya, D. Munechika, H. Yang, B. Hoover, and D. H. Chau, “Diffusiondb: A large-scale prompt gallery for text-to-image diffusion models,” inProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL), 2023

2023
[22]

chinese-poetry: The most comprehensive database of Chinese poetry

J. Gao and contributors, “chinese-poetry: The most comprehensive database of Chinese poetry.” https://github.com/chinese-poetry/chinese-poetry, 2019. Accessed 2026- 05-27

2019
[23]

Open Chinese Convert (OpenCC)

BYV oid, “Open Chinese Convert (OpenCC).”https://github.com/byvoid/opencc,
[24]

Version 1.3.1, accessed 2026-05-27

2026
[25]

Qwen-image technical report,

C. Wu, J. Li, J. Zhou, J. Lin, K. Gao, K. Yan, S. ming Yin, S. Bai, X. Xu, Y . Chen, Y . Chen, Z. Tang, Z. Zhang, Z. Wang, A. Yang, B. Yu, C. Cheng, D. Liu, D. Li, H. Zhang, H. Meng, H. Wei, J. Ni, K. Chen, K. Cao, L. Peng, L. Qu, M. Wu, P. Wang, S. Yu, T. Wen, W. Feng, X. Xu, Y . Wang, Y . Zhang, Y . Zhu, Y . Wu, Y . Cai, and Z. Liu, “Qwen-image technica...

2025
[26]

Reproducible scaling laws for contrastive language-image learning,

M. Cherti, R. Beaumont, R. Wightman, M. Wortsman, G. Ilharco, C. Gordon, C. Schuhmann, L. Schmidt, and J. Jitsev, “Reproducible scaling laws for contrastive language-image learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2816– 2827, 2023

2023
[27]

Long-clip: Unlocking the long-text capability of clip,

B. Zhang, P. Zhang, X. Dong, Y . Zang, and J. Wang, “Long-clip: Unlocking the long-text capability of clip,”arXiv preprint arXiv:2403.15378, 2024

work page arXiv 2024
[28]

Qwen3 technical report,

Q. Team, “Qwen3 technical report,” 2025

2025
[29]

Bridge diffusion model: Bridge chinese text-to-image diffusion model with english communities,

S. Liu, B. Cheng, Y . Ma, L. Wu, A. Ma, X. Wu, D. Leng, and Y . Yin, “Bridge diffusion model: Bridge chinese text-to-image diffusion model with english communities,” inProceedings of the AAAI Conference on Artificial Intelligence, 2023

2023
[30]

Chinese clip: Contrastive vision-language pretraining in chinese,

A. Yang, J. Pan, J. Lin, R. Men, Y . Zhang, J. Zhou, and C. Zhou, “Chinese clip: Contrastive vision-language pretraining in chinese,”arXiv preprint arXiv:2211.01335, 2022

work page arXiv 2022
[31]

LoRA: Low-Rank Adaptation of Large Language Models

E. J. Hu, Y . Shen, P. Wallis,et al., “Lora: Low-rank adaptation of large language models,”arXiv preprint arXiv:2106.09685, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[32]

Q-diffusion: Quantizing diffusion models,

X. Li, Y . Liu, L. Lian, H. Yang, Z. Dong, D. Kang, S. Zhang, and K. Keutzer, “Q-diffusion: Quantizing diffusion models,”arXiv preprint arXiv:2302.04304, 2023

work page arXiv 2023
[33]

Understanding int4 quantization for transformer models: Latency speedup, composability, and failure cases,

X. Wu, C. Li, R. Yazdani Aminabadi, Z. Yao, and Y . He, “Understanding int4 quantization for transformer models: Latency speedup, composability, and failure cases,”arXiv preprint arXiv:2301.12017, 2023

work page arXiv 2023
[34]

Low-bit model quantization for deep neural networks: A survey,

K. Liu, Q. Zheng, K. Tao, Z. Li, H. Qin, W. Li, Y . Guo, X. Liu, L. Kong, G. Chen, Y . Zhang, and X. Yang, “Low-bit model quantization for deep neural networks: A survey,”arXiv preprint arXiv:2505.05530, 2025

work page arXiv 2025
[35]

Geneval: An object-focused framework for evaluating text-to-image alignment,

D. Ghosh, H. Hajishirzi, and L. Schmidt, “Geneval: An object-focused framework for evaluating text-to-image alignment,”Advances in Neural Information Processing Systems, vol. 36, pp. 52132– 52152, 2023. 28

2023
[36]

Lumina-next: Making lumina-t2x stronger and faster with next-dit,

L. Zhuo, R. Du, H. Xiao, Y . Li, D. Liu, R. Huang, W. Liu, L. Zhao, F.-Y . Wang, Z. Ma, X. Luo, Z. Wang, K. Zhang, X. Zhu, S. Liu, X. Yue, D. Liu, W. Ouyang, Z. Liu, Y . Qiao, H. Li, and P. Gao, “Lumina-next: Making lumina-t2x stronger and faster with next-dit,” 2024

2024
[37]

Scaling rectified flow transformers for high-resolution image synthesis,

P. Esser, S. Kulal, A. Blattmann, R. Entezari, J. M ¨uller, H. Saini, Y . Levi, D. Lorenz, A. Sauer, F. Boesel, D. Podell, T. Dockhorn, Z. English, K. Lacey, A. Goodwin, Y . Marek, and R. Rombach, “Scaling rectified flow transformers for high-resolution image synthesis,” 2024

2024
[38]

Playground v2.5: Three insights towards enhancing aesthetic quality in text-to-image generation,

D. Li, A. Kamko, E. Akhgari, A. Sabet, L. Xu, and S. Doshi, “Playground v2.5: Three insights towards enhancing aesthetic quality in text-to-image generation,” 2024

2024
[39]

DeepFloyd IF: A modular cascaded pixel-space text-to-image diffusion model,

DeepFloyd Lab and Stability AI, “DeepFloyd IF: A modular cascaded pixel-space text-to-image diffusion model,” 2023. Official model repository

2023
[40]

Sana: Efficient high-resolution image synthesis with linear diffusion transformers,

E. Xie, J. Chen, J. Chen, H. Cai, H. Tang, Y . Lin, Z. Zhang, M. Li, L. Zhu, Y . Lu, and S. Han, “Sana: Efficient high-resolution image synthesis with linear diffusion transformers,” 2024

2024
[41]

Hunyuan-dit: A powerful multi-resolution diffusion transformer with fine-grained chinese understanding,

Z. Li, J. Zhang, Q. Lin, J. Xiong, Y . Long, X. Deng, Y . Zhang, X. Liu, M. Huang, Z. Xiao, D. Chen, J. He, J. Li, W. Li, C. Zhang, R. Quan, J. Lu, J. Huang, X. Yuan, X. Zheng, Y . Li, J. Zhang, C. Zhang, M. Chen, J. Liu, Z. Fang, W. Wang, J. Xue, Y . Tao, J. Zhu, K. Liu, S. Lin, Y . Sun, Y . Li, D. Wang, M. Chen, Z. Hu, X. Xiao, Y . Chen, Y . Liu, W. Liu...

2024
[42]

Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference

S. Luo, Y . Tan, L. Huang, J. Li, and H. Zhao, “Latent consistency models: Synthesizing high- resolution images with few-step inference,”arXiv preprint arXiv:2310.04378, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[43]

Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation,

N. Ruiz, Y . Li, P. Varadharajan, D. Cohen, N. Hariri, Y . Zhang,et al., “Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6840–6850, 2023

2023
[44]

An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion

R. Gal, Y . Alaluf, Y . Atzmon, O. Patashnik, A. H. Bermano, G. Chechik, and D. Cohen-Or, “An image is worth one word: Personalizing text-to-image generation using textual inversion,”arXiv preprint arXiv:2208.01618, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[45]

Animatediff: Animate your personalized text-to-image diffusion models without specific tuning,

Y . Guo, C. Yang, A. Rao, Y . Wang, Y . Qiao, D. Lin, and B. Dai, “Animatediff: Animate your personalized text-to-image diffusion models without specific tuning,” inInternational Conference on Learning Representations, 2024

2024
[46]

Lcm-lora: A universal stable-diffusion acceleration module,

S. Luo, Y . Tan, S. Patil, D. Gu, P. von Platen, A. Passos, L. Huang, J. Li, and H. Zhao, “Lcm-lora: A universal stable-diffusion acceleration module,”arXiv preprint arXiv:2311.05556, 2023

work page arXiv 2023
[47]

Rethinking fid: Towards a better evaluation metric for image generation,

S. Jayasumana, S. Ramalingam, A. Veit, D. Glasner, A. Chakrabarti, and S. Kumar, “Rethinking fid: Towards a better evaluation metric for image generation,”arXiv preprint arXiv:2308.14956, 2023. 29

work page arXiv 2023

[1] [1]

Text-to-image diffusion models in generative ai: A survey,

C. Zhang, C. Zhang, M. Zhang, I. S. Kweon, and J. Kim, “Text-to-image diffusion models in generative ai: A survey,”arXiv preprint arXiv:2303.07909, 2023

work page arXiv 2023

[2] [2]

Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding

C. Saharia, W. Chan, S. Saxena, J. Li, J. Whang, E. Denton, S. M. H. A. Ghasemipour,et al., “Imagen: Photorealistic text-to-image diffusion models with deep language understanding,”arXiv preprint arXiv:2205.11487, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[3] [3]

Hierarchical Text-Conditional Image Generation with CLIP Latents

A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, and M. Chen, “Hierarchical text-conditional image generation with clip latents,”arXiv preprint arXiv:2204.06125, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[4] [4]

Sdxl: Improving latent diffusion models for high-resolution image synthesis,

D. Podell, Z. English, K. Lacey, A. Blattmann, T. Dockhorn, J. M¨uller, J. Penna, and R. Rombach, “Sdxl: Improving latent diffusion models for high-resolution image synthesis,” inInternational Conference on Learning Representations, 2024. 5https://www.icswb.com/default.php?mod=live_text&live_id=914&temp=live_video 26

2024

[5] [5]

B. F. Labs, “Flux.”https://github.com/black-forest-labs/flux, 2024

2024

[6] [6]

Scalable Diffusion Models with Transformers

W. Peebles and S. Xie, “Scalable diffusion models with transformers,”arXiv preprint arXiv:2212.09748, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[7] [7]

High-resolution image synthesis with latent diffusion models,

R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-resolution image synthesis with latent diffusion models,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 10684–10695, 2022

2022

[8] [8]

Stable diffusion 2.1 release notes

S. AI, “Stable diffusion 2.1 release notes.”https://stability.ai/news/ stable-diffusion-v2-1-release, 2022

2022

[9] [9]

Cogview: Pretrained transformers for image synthesis from text,

M. Ding, Z. Yang, W. Duan, J. Lu, J. Wu, and C. Zhu, “Cogview: Pretrained transformers for image synthesis from text,”arXiv preprint arXiv:2105.13290, 2021

work page arXiv 2021

[10] [10]

Taiyi-diffusion-xl: Advancing bilingual text-to- image generation with large vision-language model support,

X. Wu, D. Zhang, R. Gan, J. Lu, Z. Wu,et al., “Taiyi-diffusion-xl: Advancing bilingual text-to- image generation with large vision-language model support,”arXiv preprint arXiv:2401.14688, 2024

work page arXiv 2024

[11] [11]

Mobilediffusion: Instant text-to-image generation on mobile devices,

Y . Zhao, Y . Xu, Z. Xiao, H. Jia, and T. Hou, “Mobilediffusion: Instant text-to-image generation on mobile devices,” inEuropean Conference on Computer Vision, pp. 225–242, Springer, 2024

2024

[12] [12]

Snapgen: Taming high-resolution text-to- image models for mobile devices with efficient architectures and training,

D. Hu, J. Chen, X. Huang, H. Coskun,et al., “Snapgen: Taming high-resolution text-to- image models for mobile devices with efficient architectures and training,”arXiv preprint arXiv:2412.09619, 2024

work page arXiv 2024

[13] [13]

Denoising diffusion probabilistic models,

J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” inAdvances in Neural Information Processing Systems, vol. 33, pp. 6840–6851, 2020

2020

[14] [14]

Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

X. Liu, C. Gong, and Q. Liu, “Flow straight and fast: Learning to generate and transfer data with rectified flow,”arXiv preprint arXiv:2209.03003, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[15] [15]

Improved distribution matching distillation for fast image synthesis,

T. Yin, M. Gharbi, T. Park, R. Zhang, E. Shechtman, F. Durand, and W. T. Freeman, “Improved distribution matching distillation for fast image synthesis,”Advances in Neural Information Pro- cessing Systems, vol. 37, 2024

2024

[16] [16]

Mnn: A universal and efficient inference engine,

X. Jiang, H. Wang, Y . Chen, Z. Wu, L. Wang, B. Zou, Y . Yang, Z. Cui, Y . Cai, T. Yu, C. Lv, and Z. Wu, “Mnn: A universal and efficient inference engine,” inMLSys, 2020

2020

[17] [17]

Qualcomm ai engine direct sdk reference guide

I. Qualcomm Technologies, “Qualcomm ai engine direct sdk reference guide.”https://docs. qualcomm.com/nav/home/index_QNN.html?product=1601111740009302, 2026

2026

[18] [18]

Laion-5b: An open large-scale dataset for training next generation image-text models,

C. Schuhmann, R. Beaumont, R. Vencu, C. Gordon, R. Wightman, M. Cherti,et al., “Laion-5b: An open large-scale dataset for training next generation image-text models,”Advances in Neural Information Processing Systems, vol. 35, pp. 35276–35296, 2022

2022

[19] [19]

Pai-diffusion: Constructing and serving a family of open chinese diffusion models for text-to-image synthesis on the cloud,

C. Wang, Z. Duan, B. Liu, X. Zou, C. Chen, K. Jia, and J. Huang, “Pai-diffusion: Constructing and serving a family of open chinese diffusion models for text-to-image synthesis on the cloud,” in Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics, 2023

2023

[20] [20]

Accelerating dif- fusion model training under minimal budgets: A condensation-based perspective,

R. Huang, S. Shao, Z. Zhou, P. Zhao, H. Guo, T. Ye, L. Bai, S. Yang, and Z. Xie, “Accelerating dif- fusion model training under minimal budgets: A condensation-based perspective,”arXiv preprint arXiv:2507.05914, 2025. 27

work page arXiv 2025

[21] [21]

Diffusiondb: A large-scale prompt gallery for text-to-image diffusion models,

Z. J. Wang, E. Montoya, D. Munechika, H. Yang, B. Hoover, and D. H. Chau, “Diffusiondb: A large-scale prompt gallery for text-to-image diffusion models,” inProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL), 2023

2023

[22] [22]

chinese-poetry: The most comprehensive database of Chinese poetry

J. Gao and contributors, “chinese-poetry: The most comprehensive database of Chinese poetry.” https://github.com/chinese-poetry/chinese-poetry, 2019. Accessed 2026- 05-27

2019

[23] [23]

Open Chinese Convert (OpenCC)

BYV oid, “Open Chinese Convert (OpenCC).”https://github.com/byvoid/opencc,

[24] [24]

Version 1.3.1, accessed 2026-05-27

2026

[25] [25]

Qwen-image technical report,

C. Wu, J. Li, J. Zhou, J. Lin, K. Gao, K. Yan, S. ming Yin, S. Bai, X. Xu, Y . Chen, Y . Chen, Z. Tang, Z. Zhang, Z. Wang, A. Yang, B. Yu, C. Cheng, D. Liu, D. Li, H. Zhang, H. Meng, H. Wei, J. Ni, K. Chen, K. Cao, L. Peng, L. Qu, M. Wu, P. Wang, S. Yu, T. Wen, W. Feng, X. Xu, Y . Wang, Y . Zhang, Y . Zhu, Y . Wu, Y . Cai, and Z. Liu, “Qwen-image technica...

2025

[26] [26]

Reproducible scaling laws for contrastive language-image learning,

M. Cherti, R. Beaumont, R. Wightman, M. Wortsman, G. Ilharco, C. Gordon, C. Schuhmann, L. Schmidt, and J. Jitsev, “Reproducible scaling laws for contrastive language-image learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2816– 2827, 2023

2023

[27] [27]

Long-clip: Unlocking the long-text capability of clip,

B. Zhang, P. Zhang, X. Dong, Y . Zang, and J. Wang, “Long-clip: Unlocking the long-text capability of clip,”arXiv preprint arXiv:2403.15378, 2024

work page arXiv 2024

[28] [28]

Qwen3 technical report,

Q. Team, “Qwen3 technical report,” 2025

2025

[29] [29]

Bridge diffusion model: Bridge chinese text-to-image diffusion model with english communities,

S. Liu, B. Cheng, Y . Ma, L. Wu, A. Ma, X. Wu, D. Leng, and Y . Yin, “Bridge diffusion model: Bridge chinese text-to-image diffusion model with english communities,” inProceedings of the AAAI Conference on Artificial Intelligence, 2023

2023

[30] [30]

Chinese clip: Contrastive vision-language pretraining in chinese,

A. Yang, J. Pan, J. Lin, R. Men, Y . Zhang, J. Zhou, and C. Zhou, “Chinese clip: Contrastive vision-language pretraining in chinese,”arXiv preprint arXiv:2211.01335, 2022

work page arXiv 2022

[31] [31]

LoRA: Low-Rank Adaptation of Large Language Models

E. J. Hu, Y . Shen, P. Wallis,et al., “Lora: Low-rank adaptation of large language models,”arXiv preprint arXiv:2106.09685, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[32] [32]

Q-diffusion: Quantizing diffusion models,

X. Li, Y . Liu, L. Lian, H. Yang, Z. Dong, D. Kang, S. Zhang, and K. Keutzer, “Q-diffusion: Quantizing diffusion models,”arXiv preprint arXiv:2302.04304, 2023

work page arXiv 2023

[33] [33]

Understanding int4 quantization for transformer models: Latency speedup, composability, and failure cases,

X. Wu, C. Li, R. Yazdani Aminabadi, Z. Yao, and Y . He, “Understanding int4 quantization for transformer models: Latency speedup, composability, and failure cases,”arXiv preprint arXiv:2301.12017, 2023

work page arXiv 2023

[34] [34]

Low-bit model quantization for deep neural networks: A survey,

K. Liu, Q. Zheng, K. Tao, Z. Li, H. Qin, W. Li, Y . Guo, X. Liu, L. Kong, G. Chen, Y . Zhang, and X. Yang, “Low-bit model quantization for deep neural networks: A survey,”arXiv preprint arXiv:2505.05530, 2025

work page arXiv 2025

[35] [35]

Geneval: An object-focused framework for evaluating text-to-image alignment,

D. Ghosh, H. Hajishirzi, and L. Schmidt, “Geneval: An object-focused framework for evaluating text-to-image alignment,”Advances in Neural Information Processing Systems, vol. 36, pp. 52132– 52152, 2023. 28

2023

[36] [36]

Lumina-next: Making lumina-t2x stronger and faster with next-dit,

L. Zhuo, R. Du, H. Xiao, Y . Li, D. Liu, R. Huang, W. Liu, L. Zhao, F.-Y . Wang, Z. Ma, X. Luo, Z. Wang, K. Zhang, X. Zhu, S. Liu, X. Yue, D. Liu, W. Ouyang, Z. Liu, Y . Qiao, H. Li, and P. Gao, “Lumina-next: Making lumina-t2x stronger and faster with next-dit,” 2024

2024

[37] [37]

Scaling rectified flow transformers for high-resolution image synthesis,

P. Esser, S. Kulal, A. Blattmann, R. Entezari, J. M ¨uller, H. Saini, Y . Levi, D. Lorenz, A. Sauer, F. Boesel, D. Podell, T. Dockhorn, Z. English, K. Lacey, A. Goodwin, Y . Marek, and R. Rombach, “Scaling rectified flow transformers for high-resolution image synthesis,” 2024

2024

[38] [38]

Playground v2.5: Three insights towards enhancing aesthetic quality in text-to-image generation,

D. Li, A. Kamko, E. Akhgari, A. Sabet, L. Xu, and S. Doshi, “Playground v2.5: Three insights towards enhancing aesthetic quality in text-to-image generation,” 2024

2024

[39] [39]

DeepFloyd IF: A modular cascaded pixel-space text-to-image diffusion model,

DeepFloyd Lab and Stability AI, “DeepFloyd IF: A modular cascaded pixel-space text-to-image diffusion model,” 2023. Official model repository

2023

[40] [40]

Sana: Efficient high-resolution image synthesis with linear diffusion transformers,

E. Xie, J. Chen, J. Chen, H. Cai, H. Tang, Y . Lin, Z. Zhang, M. Li, L. Zhu, Y . Lu, and S. Han, “Sana: Efficient high-resolution image synthesis with linear diffusion transformers,” 2024

2024

[41] [41]

Hunyuan-dit: A powerful multi-resolution diffusion transformer with fine-grained chinese understanding,

Z. Li, J. Zhang, Q. Lin, J. Xiong, Y . Long, X. Deng, Y . Zhang, X. Liu, M. Huang, Z. Xiao, D. Chen, J. He, J. Li, W. Li, C. Zhang, R. Quan, J. Lu, J. Huang, X. Yuan, X. Zheng, Y . Li, J. Zhang, C. Zhang, M. Chen, J. Liu, Z. Fang, W. Wang, J. Xue, Y . Tao, J. Zhu, K. Liu, S. Lin, Y . Sun, Y . Li, D. Wang, M. Chen, Z. Hu, X. Xiao, Y . Chen, Y . Liu, W. Liu...

2024

[42] [42]

Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference

S. Luo, Y . Tan, L. Huang, J. Li, and H. Zhao, “Latent consistency models: Synthesizing high- resolution images with few-step inference,”arXiv preprint arXiv:2310.04378, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[43] [43]

Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation,

N. Ruiz, Y . Li, P. Varadharajan, D. Cohen, N. Hariri, Y . Zhang,et al., “Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6840–6850, 2023

2023

[44] [44]

An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion

R. Gal, Y . Alaluf, Y . Atzmon, O. Patashnik, A. H. Bermano, G. Chechik, and D. Cohen-Or, “An image is worth one word: Personalizing text-to-image generation using textual inversion,”arXiv preprint arXiv:2208.01618, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[45] [45]

Animatediff: Animate your personalized text-to-image diffusion models without specific tuning,

Y . Guo, C. Yang, A. Rao, Y . Wang, Y . Qiao, D. Lin, and B. Dai, “Animatediff: Animate your personalized text-to-image diffusion models without specific tuning,” inInternational Conference on Learning Representations, 2024

2024

[46] [46]

Lcm-lora: A universal stable-diffusion acceleration module,

S. Luo, Y . Tan, S. Patil, D. Gu, P. von Platen, A. Passos, L. Huang, J. Li, and H. Zhao, “Lcm-lora: A universal stable-diffusion acceleration module,”arXiv preprint arXiv:2311.05556, 2023

work page arXiv 2023

[47] [47]

Rethinking fid: Towards a better evaluation metric for image generation,

S. Jayasumana, S. Ramalingam, A. Veit, D. Glasner, A. Chakrabarti, and S. Kumar, “Rethinking fid: Towards a better evaluation metric for image generation,”arXiv preprint arXiv:2308.14956, 2023. 29

work page arXiv 2023