pith. sign in

arxiv: 2606.22568 · v3 · pith:TII2SODXnew · submitted 2026-06-21 · 💻 cs.CV

SeFi-Image: A Text-to-Image Foundation Model with Semantic-First Diffusion

Pith reviewed 2026-06-29 05:17 UTC · model grok-4.3

classification 💻 cs.CV
keywords text-to-image generationdiffusion modelssemantic guidancelatent diffusionmodel scalingefficient trainingfoundation models
0
0 comments X

The pith

Semantic-first diffusion lets a 5B text-to-image model match or exceed prior models after training on only 10-20 percent of the usual compute.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents SeFi-Image, a set of text-to-image models at 1B, 2B, and 5B parameters that rely on semantic-first diffusion, a new latent diffusion approach that places semantic guidance at the core of the training process. This design allows the largest model to reach competitive or better results on GenEval, DPG, LongTextBench, OneIG, and CVTG-2K while using only 125K A800 GPU hours. The work also supplies distilled few-step turbo versions at each scale. The central demonstration is that semantic guidance, previously tested only at small scale on simple data, can be scaled to foundation-model regimes and still deliver substantial efficiency gains.

Core claim

SeFi-Image is a text-to-image foundation model built on semantic-first diffusion, a latent diffusion paradigm that prioritizes semantic guidance. Instantiated at 1B, 2B, and 5B parameters, the 5B model was trained with 125K A800 GPU hours—roughly 10-20 percent of the compute used by Z-Image—yet produces results comparable or superior to Qwen-Image and Z-Image across standard benchmarks. Distilled DMD2 few-step variants are provided for each scale to support varied deployment constraints.

What carries the argument

semantic-first diffusion, a latent diffusion modeling paradigm that structures the generation process around semantic guidance to accelerate training and improve efficiency at scale.

If this is right

  • Three model scales allow deployment choices matched to different compute budgets.
  • DMD2-distilled few-step variants reduce inference latency while preserving the base performance.
  • Public release of code and weights supplies ready-to-use options for text-to-image generation.
  • The scaling study supplies data on how semantic guidance behaves as parameter count grows from 1B to 5B.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the efficiency pattern holds, organizations with modest hardware budgets could train competitive text-to-image models without relying on the largest industrial clusters.
  • The same semantic-first ordering might be tested on video or 3D generation to check whether the compute savings transfer to other modalities.
  • Further work could measure exactly how much of the reported gain comes from the semantic ordering versus other training details such as data curation or optimizer choices.

Load-bearing premise

The efficiency and performance gains from semantic guidance will continue when the method is applied to large models on complex, high-resolution data rather than the small-scale simple datasets used in earlier tests.

What would settle it

Training an otherwise identical 5B model without the semantic-first component on the same 125K GPU-hour budget and showing that it matches or exceeds SeFi-Image on the reported benchmarks.

Figures

Figures reproduced from arXiv: 2606.22568 by SeFi-Team.

Figure 1
Figure 1. Figure 1: Images generated by SEFI-IMAGE. 2 [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Additional images generated by SEFI-IMAGE. 3 [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Examples from the two-part text-rendered synthetic data pipeline. Part 1 uses plain text [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Distribution of VLM-based scores used by the SFT hard filtering gate. Scores are discrete [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Illustration of Semantic-First Diffusion. The semantic latent is denoised ahead of the [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Construction of semantic and texture latents for SFD. The texture VAE encodes low-level [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Overall framework of SEFI-IMAGE. The DiT takes the noisy composite latent, dual timestep embeddings, and text embeddings as input, and predicts velocity for both semantic and texture streams. where ts, tz ∈ [0, 1] denote the semantic and texture timesteps, respectively. A timestep of 0 corresponds to pure noise and 1 to the clean latent. Distinct timesteps for semantics and textures. To model semantics and… view at source ↗
Figure 8
Figure 8. Figure 8: Illustration of the architecture of Semantic VAE (SemVAE). [PITH_FULL_IMAGE:figures/full_fig_p011_8.png] view at source ↗
Figure 11
Figure 11. Figure 11: Natural scene examples generated by SEFI-IMAGE across multiple aspect ratios. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Anime-style examples generated by SEFI-IMAGE across multiple aspect ratios. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Portrait examples generated by SEFI-IMAGE across multiple aspect ratios. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Stylized generation examples produced by S [PITH_FULL_IMAGE:figures/full_fig_p021_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Text-rich examples generated by SEFI-IMAGE across multiple aspect ratios. 22 [PITH_FULL_IMAGE:figures/full_fig_p022_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: LongTextBench Avg during DMD2 distillation for different model scales. The 1B curve [PITH_FULL_IMAGE:figures/full_fig_p032_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Training-free 1440px samples generated by the 5B version of S [PITH_FULL_IMAGE:figures/full_fig_p034_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Additional training-free 1440px samples generated by the 5B version of S [PITH_FULL_IMAGE:figures/full_fig_p035_18.png] view at source ↗
read the original abstract

Training image generation foundation models consumes substantial resources. Previous methods have attempted to leverage semantic guidance to accelerate the training process, yet their experiments were only conducted on simple datasets such as ImageNet, at low resolutions, and with small-scale models. In this paper, we propose SeFi-Image, a text-to-image foundation model built upon semantic-first diffusion, a novel latent diffusion modeling paradigm. We instantiate SeFi-Image at three model scales, 1B, 2B, and 5B parameters, enabling systematic study of scaling behavior and flexible deployment under varying compute budgets. Notably, our largest 5B model was trained with merely 125K A800 GPU hours, corresponding to roughly 10-20% of the training compute used by Z-Image. However, it achieves results comparable to or even superior to Qwen-Image and Z-Image. Despite this modest training compute, SeFi-Image achieves strong performance on a wide range of benchmarks, including GenEval, DPG, LongTextBench, OneIG, and CVTG-2K. Moreover, we provide DMD2-distilled few-step turbo variants for each model scale to accommodate diverse hardware constraints and latency requirements. We publicly release our code, weights and hope this work offers the community useful insights into semantic-guided diffusion modeling for T2I generation, while also providing practical and readily deployable model options.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces SeFi-Image, a text-to-image foundation model based on semantic-first diffusion. It describes instantiating the approach at 1B, 2B, and 5B parameter scales, with the 5B model trained using only 125K A800 GPU hours (claimed 10-20% of Z-Image compute) while achieving comparable or superior results to Z-Image and Qwen-Image on benchmarks including GenEval, DPG, LongTextBench, OneIG, and CVTG-2K. The work also provides DMD2-distilled few-step variants and publicly releases code and weights.

Significance. If the efficiency and performance claims hold, the semantic-first diffusion paradigm could meaningfully reduce the compute required for high-quality T2I foundation models at billion-parameter scale. The explicit release of code, weights, and turbo variants is a concrete strength that supports reproducibility and deployment.

major comments (2)
  1. [Abstract] Abstract: the claim that the 5B model achieves results 'comparable to or even superior' to Z-Image and Qwen-Image with 10-20% of the training compute is load-bearing for the central contribution, yet the provided text supplies no scaling curves, ablations on semantic guidance at 5B scale, or quantitative comparison tables that would allow evaluation of whether the efficiency multiplier transfers from prior ImageNet experiments.
  2. [Abstract] Abstract: the manuscript states that prior semantic-guidance experiments were limited to ImageNet, low resolution, and small models, but presents the 5B T2I result as direct evidence of successful scaling without any description of the conditioning injection mechanism, loss formulation, or architecture diagram for the latent diffusion backbone at this scale.
minor comments (1)
  1. The abstract refers to a 'systematic study of scaling behavior' across the three model sizes but does not list the specific metrics or protocols used for that study.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive comments on strengthening the abstract's support for the central claims. We respond point-by-point below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that the 5B model achieves results 'comparable to or even superior' to Z-Image and Qwen-Image with 10-20% of the training compute is load-bearing for the central contribution, yet the provided text supplies no scaling curves, ablations on semantic guidance at 5B scale, or quantitative comparison tables that would allow evaluation of whether the efficiency multiplier transfers from prior ImageNet experiments.

    Authors: We agree the abstract is concise and will revise it to explicitly reference the benchmark results (GenEval, DPG, LongTextBench, OneIG, CVTG-2K) and scaling observations across 1B/2B/5B scales presented in Sections 4 and 5. The manuscript body contains the quantitative comparison tables. We will also add discussion clarifying how the efficiency gains observed at smaller scales transfer. However, dedicated scaling curves and 5B-scale ablations on the semantic guidance component were not performed. revision: partial

  2. Referee: [Abstract] Abstract: the manuscript states that prior semantic-guidance experiments were limited to ImageNet, low resolution, and small models, but presents the 5B T2I result as direct evidence of successful scaling without any description of the conditioning injection mechanism, loss formulation, or architecture diagram for the latent diffusion backbone at this scale.

    Authors: The abstract is high-level by design. Section 3 of the manuscript details the semantic-first diffusion paradigm, including the conditioning injection mechanism (semantic feature cross-attention), loss formulation, and the latent diffusion backbone architecture (with diagram in Figure 2), which applies at all scales including 5B. We will revise the abstract to include a concise reference to these elements. revision: yes

standing simulated objections not resolved
  • We did not perform scaling curves or ablations on semantic guidance at the 5B scale, as these were outside the scope of the original experiments and additional runs at this scale are not feasible due to compute cost.

Circularity Check

0 steps flagged

No circularity: empirical efficiency claim stands on reported training outcomes without reduction to fitted inputs or self-citation chains.

full rationale

The paper's central claim is an empirical report of training a 5B-parameter model in 125K A800 GPU hours while matching or exceeding baselines on GenEval, DPG, and other benchmarks. The abstract introduces semantic-first diffusion as a novel paradigm but supplies no equations, scaling derivations, or self-citations that would make the efficiency multiplier equivalent to its inputs by construction. Prior semantic-guidance work is cited only as motivation for the new experiments; the 5B result is presented as a direct training outcome rather than a prediction derived from small-scale fits. No load-bearing step reduces to self-definition, fitted-input renaming, or imported uniqueness theorems. The derivation chain is therefore self-contained as an empirical demonstration.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no information on free parameters, axioms, or invented entities.

pith-pipeline@v0.9.1-grok · 5778 in / 1111 out tokens · 34062 ms · 2026-06-29T05:17:19.823747+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

49 extracted references · 5 canonical work pages · 2 internal anchors

  1. [1]

    High- resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bjorn Ommer. High- resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022

  2. [2]

    Scaling rectified flow transformers for high-resolution image synthesis

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, Dustin Podell, Tim Dockhorn, Zion English, Kyle Lacey, Alex Goodwin, Yannik Marek, and Robin Rombach. Scaling rectified flow transformers for high-resolution image synthesis. InProceedings of the 41st Internatio...

  3. [3]

    Sara Mahdavi, Rapha Gontijo Lopes, Tim Salimans, Jonathan Ho, David J

    Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed Ghasemipour, Burcu Karagol Ayan, S. Sara Mahdavi, Rapha Gontijo Lopes, Tim Salimans, Jonathan Ho, David J. Fleet, and Mohammad Norouzi. Photorealistic text-to-image diffusion models with deep language understanding, 2022

  4. [4]

    Seedream 3.0 technical report, 2025

    Yu Gao, Lixue Gong, Qiushan Guo, Xiaoxia Hou, Zhichao Lai, Fanshi Li, Liang Li, Xiaochen Lian, Chao Liao, Liyang Liu, et al. Seedream 3.0 technical report, 2025

  5. [5]

    Seedream 4.0: Toward next-generation multimodal image generation, 2025

    Yunpeng Chen, Yu Gao, Lixue Gong, Meng Guo, Qiushan Guo, Zhiyao Guo, Xiaoxia Hou, Weilin Huang, Yixuan Huang, Xiaowen Jian, et al. Seedream 4.0: Toward next-generation multimodal image generation, 2025

  6. [6]

    Qwen-image technical report, 2025

    Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-image technical report, 2025

  7. [7]

    Z-image: An efficient image generation foundation model with single-stream diffusion transformer, 2025

    Huanqia Cai, Sihan Cao, Ruoyi Du, Peng Gao, Steven Hoi, Shijie Huang, Zhaohui Hou, Dengyang Jiang, Xin Jin, Liangchen Li, Zhen Li, Zhong-Yu Li, David Liu, Dongyang Liu, Junhan Shi, Qilong Wu, Feng Yu, Chi Zhang, Shifeng Zhang, and Shilin Zhou. Z-image: An efficient image generation foundation model with single-stream diffusion transformer, 2025

  8. [8]

    Hunyuanimage 3.0 technical report, 2025

    Siyu Cao, Hangting Chen, Peng Chen, Yiji Cheng, Yutao Cui, Xinchi Deng, Ying Dong, Kipper Gong, Tianpeng Gu, Xiusen Gu, et al. Hunyuanimage 3.0 technical report, 2025

  9. [9]

    Longcat-image technical report, 2025

    Hanghang Ma, Haoxian Tan, Jiale Huang, Junqiang Wu, Jun-Yan He, Lishuai Gao, Songlin Xiao, Xiaoming Wei, Xiaoqi Ma, Xunliang Cai, Yayong Guan, and Jie Hu. Longcat-image technical report, 2025

  10. [10]

    Diffusion transformers with representation autoencoders, 2025

    Boyang Zheng, Nanye Ma, Shengbang Tong, and Saining Xie. Diffusion transformers with representation autoencoders, 2025

  11. [11]

    Representation alignment for generation: Training diffusion transformers is easier than you think

    Sihyun Yu, Sangkyung Kwak, Huiwon Jang, Jongheon Jeong, Jonathan Huang, Jinwoo Shin, and Saining Xie. Representation alignment for generation: Training diffusion transformers is easier than you think. InInternational Conference on Learning Representations, 2025

  12. [12]

    Reconstruction vs

    Jingfeng Yao, Bin Yang, and Xinggang Wang. Reconstruction vs. generation: Taming opti- mization dilemma in latent diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

  13. [13]

    Boosting generative image modeling via joint image-feature synthesis.Advances in Neural Information Processing Systems, 38:16685–16714, 2026

    Theodoros Kouzelis, Efstathios Karypidis, Ioannis Kakogeorgiou, Spyridon Gidaris, and Nikos Komodakis. Boosting generative image modeling via joint image-feature synthesis.Advances in Neural Information Processing Systems, 38:16685–16714, 2026

  14. [14]

    Representation entanglement for gen- eration: Training diffusion transformers is much easier than you think.Advances in Neural Information Processing Systems, 38:7714–7743, 2026

    Ge Wu, Shen Zhang, Ruijing Shi, Shanghua Gao, Zhenyuan Chen, Lei Wang, Zhaowei Chen, Hongcheng Gao, Yao Tang, Ming-Ming Cheng, et al. Representation entanglement for gen- eration: Training diffusion transformers is much easier than you think.Advances in Neural Information Processing Systems, 38:7714–7743, 2026

  15. [15]

    Semantics lead the way: Harmonizing semantic and texture modeling with asynchronous latent diffusion, 2025

    Yueming Pan, Ruoyu Feng, Qi Dai, Yuqi Wang, Wenfeng Lin, Mingyu Guo, Chong Luo, and Nanning Zheng. Semantics lead the way: Harmonizing semantic and texture modeling with asynchronous latent diffusion, 2025. 24

  16. [16]

    GANs trained by a two time-scale update rule converge to a local nash equilibrium

    Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. GANs trained by a two time-scale update rule converge to a local nash equilibrium. InAdvances in Neural Information Processing Systems, 2017

  17. [17]

    Berg, and Li Fei-Fei

    Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. ImageNet large scale visual recognition challenge.International Journal of Computer Vision, 115(3):211–252, 2015

  18. [18]

    Scaling text-to-image diffusion transformers with representation autoencoders.arXiv preprint arXiv:2601.16208, 2026

    Shengbang Tong, Boyang Zheng, Ziteng Wang, Bingda Tang, Nanye Ma, Ellis Brown, Jihan Yang, Rob Fergus, Yann LeCun, and Saining Xie. Scaling text-to-image diffusion transformers with representation autoencoders.arXiv preprint arXiv:2601.16208, 2026

  19. [19]

    Svg-t2i: Scaling up text-to-image latent diffusion model without variational autoencoder.arXiv preprint arXiv:2512.11749, 2025

    Minglei Shi, Haolin Wang, Borui Zhang, Wenzhao Zheng, Bohan Zeng, Ziyang Yuan, Xiaoshi Wu, Yuanxing Zhang, Huan Yang, Xintao Wang, et al. Svg-t2i: Scaling up text-to-image latent diffusion model without variational autoencoder.arXiv preprint arXiv:2512.11749, 2025

  20. [20]

    GenEval: An object-focused framework for evaluating text-to-image alignment, 2023

    Dhruba Ghosh, Hanna Hajishirzi, and Ludwig Schmidt. GenEval: An object-focused framework for evaluating text-to-image alignment, 2023

  21. [21]

    ELLA: Equip diffusion models with LLM for enhanced semantic alignment, 2024

    Xiwei Hu, Rui Wang, Yixiao Fang, Bin Fu, Pei Cheng, and Gang Yu. ELLA: Equip diffusion models with LLM for enhanced semantic alignment, 2024

  22. [22]

    X-Omni: Reinforcement learning makes discrete autoregressive image generative models great again, 2025

    Zigang Geng, Yibing Wang, Yeyao Ma, Chen Li, Yongming Rao, Shuyang Gu, Zhao Zhong, Qinglin Lu, Han Hu, Xiaosong Zhang, Linus, and Di Wang. X-Omni: Reinforcement learning makes discrete autoregressive image generative models great again, 2025

  23. [23]

    OneIG-Bench: Omni-dimensional nuanced evaluation for image generation, 2025

    Jingjing Chang, Yixiao Fang, Peng Xing, Shuhan Wu, Wei Cheng, Rui Wang, Xianfang Zeng, Gang Yu, and Hai-Bao Chen. OneIG-Bench: Omni-dimensional nuanced evaluation for image generation, 2025

  24. [24]

    Investigating text insulation and attention mechanisms for complex visual text generation, 2025

    Ying Tai, Nikai Du, Rui Xie, Zhennan Chen, Qian Wang, Zhengkai Jiang, Kai Zhang, and Jian Yang. Investigating text insulation and attention mechanisms for complex visual text generation, 2025

  25. [25]

    Qwen3.5: Towards native multimodal agents, February 2026

    Qwen Team. Qwen3.5: Towards native multimodal agents, February 2026. URL https: //qwen.ai/blog?id=qwen3.5

  26. [26]

    Improving image generation with better captions, 2023

    James Betker, Gabriel Goh, Li Jing, Tim Brooks, Jianfeng Wang, Linjie Li, Long Ouyang, Juntang Zhuang, Joyce Lee, Yufei Guo, Wesam Manassra, Prafulla Dhariwal, Casey Chu, Yunxin Jiao, and Aditya Ramesh. Improving image generation with better captions, 2023. URL https://cdn.openai.com/papers/dall-e-3.pdf

  27. [27]

    Lens: Rethinking Training Efficiency for Foundational Text-to-Image Models

    Dong Chen, Fangyun Wei, Ziyu Wan, Dongdong Chen, Jiawei Zhang, Jinjing Zhao, Sirui Zhang, Yang Yue, Zhiyang Liang, Baining Guo, et al. Lens: Rethinking training efficiency for foundational text-to-image models.arXiv preprint arXiv:2605.21573, 2026

  28. [28]

    Fine-T2I: An open, large-scale, and diverse dataset for high-quality T2I fine-tuning, 2026

    Xu Ma, Yitian Zhang, Qihua Dong, and Yun Fu. Fine-T2I: An open, large-scale, and diverse dataset for high-quality T2I fine-tuning, 2026

  29. [29]

    Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V . V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mahmoud Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Hervé Jégou, Julien Mairal, Patric...

  30. [30]

    Flow straight and fast: Learning to generate and transfer data with rectified flow

    Xingchao Liu, Chengyue Gong, et al. Flow straight and fast: Learning to generate and transfer data with rectified flow. InThe Eleventh International Conference on Learning Representations, 2023

  31. [31]

    FLUX.2: Frontier visual intelligence

    Black Forest Labs. FLUX.2: Frontier visual intelligence. https://bfl.ai/blog/flux-2, 2025. 25

  32. [32]

    FLUX.https://github.com/black-forest-labs/flux, 2024

    Black Forest Labs. FLUX.https://github.com/black-forest-labs/flux, 2024

  33. [33]

    Efros, Eli Shechtman, and Oliver Wang

    Richard Zhang, Phillip Isola, Alexei A. Efros, Eli Shechtman, and Oliver Wang. The unrea- sonable effectiveness of deep features as a perceptual metric. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018

  34. [34]

    Qwen3-VL technical report, 2025

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, et al. Qwen3-VL technical report, 2025

  35. [35]

    SVG: Latent diffusion model without variational autoencoder.arXiv:2510.15301, 2025

    Minglei Shi, Haolin Wang, Wenzhao Zheng, Ziyang Yuan, Xiaoshi Wu, Xintao Wang, Pengfei Wan, Jie Zhou, and Jiwen Lu. Latent diffusion model without variational autoencoder, 2025. URLhttps://arxiv.org/abs/2510.15301

  36. [36]

    Qwen-Image-V AE-2.0 technical report, 2026

    Zekai Zhang, Deqing Li, Kuan Cao, Yujia Wu, Chenfei Wu, Yu Wu, Liang Peng, Hao Meng, Jiahao Li, Jie Zhang, et al. Qwen-Image-V AE-2.0 technical report, 2026

  37. [37]

    Kodak lossless true color image suite.http://r0k.us/graphics/ kodak/, 1993

    Eastman Kodak Company. Kodak lossless true color image suite.http://r0k.us/graphics/ kodak/, 1993

  38. [38]

    Bovik, Hamid R

    Zhou Wang, Alan C. Bovik, Hamid R. Sheikh, and Eero P. Simoncelli. Image quality assessment: From error visibility to structural similarity.IEEE Transactions on Image Processing, 13(4): 600–612, 2004

  39. [39]

    Tianwei Yin, Michaël Gharbi, Taesung Park, Richard Zhang, Eli Shechtman, Fredo Durand, and William T. Freeman. Improved distribution matching distillation for fast image synthesis, 2024

  40. [40]

    DiffusionNFT: Online Diffusion Reinforcement with Forward Process

    Kaiwen Zheng, Huayu Chen, Haotian Ye, Haoxiang Wang, Qinsheng Zhang, Kai Jiang, Hang Su, Stefano Ermon, Jun Zhu, and Ming-Yu Liu. Diffusionnft: Online diffusion reinforcement with forward process.arXiv preprint arXiv:2509.16117, 2025

  41. [41]

    Awaking spatial intelligence in unified multimodal understanding and generation, 2026

    Lin Song, Wenbo Li, Guoqing Ma, Wei Tang, Bo Wang, Yuan Zhang, Yijun Yang, Yicheng Xiao, Jianhui Liu, Yanbing Zhang, Guohui Zhang, Wenhu Zhang, Hang Xu, Nan Jiang, Xin Han, Haoze Sun, Maoquan Zhang, Haoyang Huang, and Nan Duan. Awaking spatial intelligence in unified multimodal understanding and generation, 2026

  42. [42]

    CLIPScore: A reference-free evaluation metric for image captioning

    Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. CLIPScore: A reference-free evaluation metric for image captioning. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 2021. 26 A Additional Related Work Foundational text-to-image models.Recent text-to-image foundation models have made substan...

  43. [43]

    Produce: - short_caption: concise and high-signal - dense_caption: complete, specific, and concise

  44. [44]

    Cover the main subjects, main action or state, main scene, key spatial relations, reliable counts, and clearly visible attributes when important

  45. [45]

    short_caption must retain the core facts from dense_caption, especially the main subject, main action or state, main scene, and key count, OCR, or abnormal details when present

  46. [46]

    Use cautious wording when details are uncertain because of blur, occlusion, crop, low resolution, overexposure, or partial visibility

  47. [47]

    Use viewer perspective consistently for spatial relations such as left, right, top, bottom, front, and behind

  48. [48]

    dense_caption should be objective and natural, without marketing language, storytelling, or aesthetic praise

  49. [49]

    short_caption

    If legible text exists in the image, put quoted visible text in double quotes. Output strict JSON only: { "short_caption": { "en": "...", "zh": "..." }, "dense_caption": { "en": "...", "zh": "..." } } B.2 SFT Metadata and Caption Prompts For supervised fine-tuning data, we use a two-stage VLM annotation workflow. The first prompt extracts structured metad...