pith. machine review for the scientific record. sign in

arxiv: 2503.07703 · v1 · submitted 2025-03-10 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

Seedream 2.0: A Native Chinese-English Bilingual Image Generation Foundation Model

Authors on Pith no claims yet

Pith reviewed 2026-05-17 08:20 UTC · model grok-4.3

classification 💻 cs.CV
keywords bilingual image generationdiffusion modelstext renderingRLHFcultural nuancesprompt followingfoundation modelimage editing
0
0 comments X

The pith

Seedream 2.0 uses a self-developed bilingual LLM text encoder to generate high-fidelity images from Chinese or English prompts with accurate cultural nuances.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Seedream 2.0 as a foundation model for image generation that handles prompts in Chinese and English with native skill. It builds custom data and caption systems alongside a self-developed bilingual large language model that serves as the text encoder. This setup lets the model acquire cultural and linguistic details straight from massive datasets. The result is stronger performance in following prompts, producing pleasing aesthetics, rendering text correctly, and maintaining structural accuracy. Additional RLHF training aligns outputs with human preferences, as measured by high ELO scores, and the model adapts readily to instruction-based editing tasks.

Core claim

Seedream 2.0 achieves state-of-the-art performance across prompt-following, aesthetics, text rendering, and structural correctness. It is a native Chinese-English bilingual image generation foundation model integrated with a self-developed bilingual large language model as a text encoder, allowing it to learn native knowledge directly from massive data. This enables high-fidelity images with accurate cultural nuances and aesthetic expressions described in either Chinese or English. Glyph-Aligned ByT5 is applied for flexible character-level text rendering, while a Scaled ROPE generalizes well to untrained resolutions. Multi-phase post-training optimizations, including SFT and RLHF iterations,

What carries the argument

Self-developed bilingual large language model as text encoder that learns native knowledge directly from data

If this is right

  • The model produces images that accurately reflect Chinese cultural details when given Chinese prompts.
  • Glyph-Aligned ByT5 enables precise character-level text rendering without distortion in generated images.
  • Scaled ROPE allows the model to handle image resolutions not seen during training.
  • RLHF iterations produce outputs that align closely with human preferences in direct comparisons.
  • The same base model adapts to instruction-based editing while preserving image consistency.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Native encoder designs like this one could be replicated for other language pairs to reduce cultural bias in image generation.
  • Learning directly from data rather than through English translation layers may cut down on common multilingual errors.
  • Strong human preference alignment suggests the approach could support more personalized creative tools.

Load-bearing premise

The self-developed bilingual LLM text encoder and the custom data and caption systems allow the model to learn native Chinese knowledge directly from data without introducing new biases or requiring post-hoc fixes.

What would settle it

A blind rating study in which native Chinese speakers compare images generated from Chinese cultural prompts by Seedream 2.0 versus models such as Flux or SD3.5 and find no consistent advantage in cultural accuracy or text correctness.

read the original abstract

Rapid advancement of diffusion models has catalyzed remarkable progress in the field of image generation. However, prevalent models such as Flux, SD3.5 and Midjourney, still grapple with issues like model bias, limited text rendering capabilities, and insufficient understanding of Chinese cultural nuances. To address these limitations, we present Seedream 2.0, a native Chinese-English bilingual image generation foundation model that excels across diverse dimensions, which adeptly manages text prompt in both Chinese and English, supporting bilingual image generation and text rendering. We develop a powerful data system that facilitates knowledge integration, and a caption system that balances the accuracy and richness for image description. Particularly, Seedream is integrated with a self-developed bilingual large language model as a text encoder, allowing it to learn native knowledge directly from massive data. This enable it to generate high-fidelity images with accurate cultural nuances and aesthetic expressions described in either Chinese or English. Beside, Glyph-Aligned ByT5 is applied for flexible character-level text rendering, while a Scaled ROPE generalizes well to untrained resolutions. Multi-phase post-training optimizations, including SFT and RLHF iterations, further improve the overall capability. Through extensive experimentation, we demonstrate that Seedream 2.0 achieves state-of-the-art performance across multiple aspects, including prompt-following, aesthetics, text rendering, and structural correctness. Furthermore, Seedream 2.0 has been optimized through multiple RLHF iterations to closely align its output with human preferences, as revealed by its outstanding ELO score. In addition, it can be readily adapted to an instruction-based image editing model, such as SeedEdit, with strong editing capability that balances instruction-following and image consistency.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces Seedream 2.0, a diffusion-based native Chinese-English bilingual image generation foundation model. It describes a custom data system, caption system balancing accuracy and richness, integration of a self-developed bilingual LLM as text encoder to acquire native knowledge directly from data, Glyph-Aligned ByT5 for character-level text rendering, Scaled ROPE for resolution generalization, and multi-phase post-training with SFT and RLHF iterations for human preference alignment. The central claims are that the model achieves state-of-the-art performance in prompt-following, aesthetics, text rendering, and structural correctness, delivers outstanding ELO scores, and can be adapted to instruction-based image editing while balancing instruction-following and consistency.

Significance. If the SOTA and native bilingual performance claims hold with rigorous evidence, the work would represent a meaningful advance in multilingual image generation by addressing cultural nuances and text rendering limitations in existing models. The combination of a custom bilingual encoder with RLHF alignment and editing adaptability could influence development of culturally inclusive foundation models, provided the contributions are isolated and independently verified.

major comments (2)
  1. [Abstract] Abstract: The assertion of state-of-the-art performance 'across multiple aspects, including prompt-following, aesthetics, text rendering, and structural correctness' and 'outstanding ELO score' after 'extensive experimentation' and 'multiple RLHF iterations' provides no quantitative metrics, baseline comparisons (e.g., against Flux or SD3.5), evaluation protocols, error bars, or ablation studies. This directly undermines validation of the central SOTA and human-preference-alignment claims.
  2. [Abstract] Abstract and methodology description: No isolating ablations are reported for the self-developed bilingual LLM text encoder or the custom data/caption systems. The claim that these components enable 'learn[ing] native knowledge directly from massive data' without introducing new biases or requiring post-hoc fixes is therefore load-bearing but untested, leaving open whether gains derive from the claimed native mechanism, scale, or other factors.
minor comments (1)
  1. [Abstract] Abstract: Grammatical issues include 'This enable it' (should be 'This enables it') and 'Beside' (should be 'Besides').

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight opportunities to better substantiate our central claims in the abstract and methodology sections. We address each point below and commit to revisions that strengthen the evidence without altering the core contributions.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The assertion of state-of-the-art performance 'across multiple aspects, including prompt-following, aesthetics, text rendering, and structural correctness' and 'outstanding ELO score' after 'extensive experimentation' and 'multiple RLHF iterations' provides no quantitative metrics, baseline comparisons (e.g., against Flux or SD3.5), evaluation protocols, error bars, or ablation studies. This directly undermines validation of the central SOTA and human-preference-alignment claims.

    Authors: We agree that the abstract, constrained by length, omits specific quantitative metrics and direct baseline comparisons, which weakens the immediate visibility of the SOTA claims. The full manuscript details these in the Experiments and Evaluation sections, including ELO scores from human studies, comparisons against Flux and SD3.5, evaluation protocols for prompt-following and text rendering, and structural correctness metrics. To directly address this, we will revise the abstract to include key quantitative highlights (e.g., relative ELO improvements and specific benchmark scores) while preserving brevity. We will also ensure error bars and protocol summaries are more explicitly cross-referenced in the main text. revision: yes

  2. Referee: [Abstract] Abstract and methodology description: No isolating ablations are reported for the self-developed bilingual LLM text encoder or the custom data/caption systems. The claim that these components enable 'learn[ing] native knowledge directly from massive data' without introducing new biases or requiring post-hoc fixes is therefore load-bearing but untested, leaving open whether gains derive from the claimed native mechanism, scale, or other factors.

    Authors: We recognize that isolating ablations would provide stronger evidence for attributing gains specifically to the bilingual LLM text encoder and custom data/caption systems rather than scale alone. The current manuscript supports the overall performance through end-to-end results and cultural nuance evaluations but does not present component-wise ablations. We will add a dedicated ablation subsection in the revised version, including controlled experiments that isolate the contribution of the self-developed LLM and caption balancing approach, to clarify the native knowledge integration mechanism. revision: partial

Circularity Check

0 steps flagged

No significant circularity in empirical model description and performance claims

full rationale

The paper presents an engineering description of Seedream 2.0, a bilingual diffusion model incorporating a self-developed LLM text encoder, custom data/caption systems, Glyph-Aligned ByT5, Scaled ROPE, and post-training via SFT and RLHF. Performance claims (SOTA on prompt-following, aesthetics, text rendering, structural correctness, and ELO alignment) are asserted via 'extensive experimentation' without any mathematical derivation chain, equations, or first-principles predictions. No step reduces by construction to its inputs: RLHF is a standard optimization procedure whose output alignment is measured by ELO, which the text does not equate to the training preference data itself. Self-developed components are described but not justified solely by overlapping self-citations. The paper is self-contained against external benchmarks in the sense that its central claims rest on reported empirical results rather than tautological re-labeling of fitted quantities.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 2 invented entities

The central claims rest on standard diffusion model assumptions plus several new components whose independent validation is not provided in the abstract.

free parameters (1)
  • training hyperparameters and scaling factors
    Standard ML training choices that are fitted or tuned but not enumerated.
axioms (1)
  • domain assumption Diffusion models can be trained to high fidelity on bilingual data when paired with a native LLM encoder.
    Invoked to justify the core architecture choice.
invented entities (2)
  • Glyph-Aligned ByT5 no independent evidence
    purpose: Flexible character-level text rendering in generated images
    New module introduced for text handling.
  • Scaled ROPE no independent evidence
    purpose: Generalization to untrained image resolutions
    Modified position encoding presented as a contribution.

pith-pipeline@v0.9.0 · 5706 in / 1443 out tokens · 59905 ms · 2026-05-17T08:20:40.510849+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 17 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Flow-GRPO: Training Flow Matching Models via Online RL

    cs.CV 2025-05 unverdicted novelty 8.0

    Flow-GRPO is the first online RL method for flow matching models, raising GenEval accuracy from 63% to 95% and text-rendering accuracy from 59% to 92% with little reward hacking.

  2. Improving Text-to-Image Generation with Intrinsic Self-Confidence Rewards

    cs.CV 2026-03 unverdicted novelty 7.0

    SOLACE improves text-to-image generation by using intrinsic self-confidence rewards from noise reconstruction accuracy during reinforcement learning post-training without external supervision.

  3. Leveraging Verifier-Based Reinforcement Learning in Image Editing

    cs.CV 2026-04 unverdicted novelty 6.0

    Edit-R1 trains a CoT-based reasoning reward model with GCPO and uses it to boost image editing performance over VLMs and models like FLUX.1-kontext via GRPO.

  4. PixelGen: Improving Pixel Diffusion with Perceptual Supervision

    cs.CV 2026-02 accept novelty 6.0

    PixelGen augments pixel diffusion with gated perceptual supervision to reach FID 5.11 on ImageNet-256 and GenEval 0.79 in text-to-image, narrowing the gap to latent methods without VAEs.

  5. Seedance 1.5 pro: A Native Audio-Visual Joint Generation Foundation Model

    cs.CV 2025-12 unverdicted novelty 6.0

    Seedance 1.5 pro is a joint audio-visual generation model achieving high synchronization via dual-branch diffusion transformer and post-training optimizations.

  6. DeCo: Frequency-Decoupled Pixel Diffusion for End-to-End Image Generation

    cs.CV 2025-11 conditional novelty 6.0

    DeCo decouples high- and low-frequency generation in pixel diffusion via a DiT plus lightweight decoder and a frequency-aware flow-matching loss, reaching FID 1.62 at 256x256 and 2.22 at 512x512 on ImageNet while clos...

  7. DanceGRPO: Unleashing GRPO on Visual Generation

    cs.CV 2025-05 unverdicted novelty 6.0

    DanceGRPO applies GRPO to visual generation tasks to achieve stable policy optimization across diffusion models, rectified flows, multiple tasks, and diverse reward models, outperforming prior RL methods.

  8. Mogao: An Omni Foundation Model for Interleaved Multi-Modal Generation

    cs.CV 2025-05 unverdicted novelty 6.0

    Mogao presents a causal unified model with deep fusion, dual encoders, and interleaved position embeddings that achieves strong performance on multi-modal understanding, text-to-image generation, and coherent interlea...

  9. A Systematic Post-Train Framework for Video Generation

    cs.CV 2026-04 unverdicted novelty 5.0

    A post-training pipeline for video generation models combines SFT, RLHF with novel GRPO, prompt enhancement, and inference optimization to improve visual quality, temporal coherence, and instruction following.

  10. LongCat-Image Technical Report

    cs.CV 2025-12 unverdicted novelty 5.0

    LongCat-Image delivers a compact 6B-parameter bilingual image generation model that sets new standards for Chinese character rendering accuracy and photorealism while remaining efficient and fully open-source.

  11. Qwen-Image Technical Report

    cs.CV 2025-08 unverdicted novelty 5.0

    Qwen-Image is a foundation model that reaches state-of-the-art results in image generation and editing by combining a large-scale text-focused data pipeline with curriculum learning and dual semantic-reconstructive en...

  12. Qwen-Image-2.0 Technical Report

    cs.CV 2026-05 unverdicted novelty 4.0

    Qwen-Image-2.0 unifies high-fidelity image generation and precise editing by coupling Qwen3-VL with a Multimodal Diffusion Transformer, improving text rendering, photorealism, and complex prompt following over prior versions.

  13. MMCORE: MultiModal COnnection with Representation Aligned Latent Embeddings

    cs.CV 2026-04 unverdicted novelty 4.0

    MMCORE transfers VLM reasoning into diffusion-based image generation and editing via aligned latent embeddings from learnable queries, outperforming baselines on text-to-image and editing tasks.

  14. Seedance 1.0: Exploring the Boundaries of Video Generation Models

    cs.CV 2025-06 unverdicted novelty 4.0

    Seedance 1.0 generates 5-second 1080p videos in about 41 seconds with claimed superior motion quality, prompt adherence, and multi-shot consistency compared to prior models.

  15. Seedream 3.0 Technical Report

    cs.CV 2025-04 unverdicted novelty 4.0

    Seedream 3.0 improves bilingual image generation through doubled defect-aware data, mixed-resolution training, cross-modality RoPE, representation alignment, aesthetic SFT, VLM reward modeling, and importance-aware ti...

  16. Seedance 2.0: Advancing Video Generation for World Complexity

    cs.CV 2026-04 unverdicted novelty 3.0

    Seedance 2.0 is an updated multi-modal model for generating 4-15 second audio-video content at 480p/720p with support for up to 3 video, 9 image, and 3 audio references.

  17. Seedream 4.0: Toward Next-generation Multimodal Image Generation

    cs.CV 2025-09 unverdicted novelty 3.0

    Seedream 4.0 unifies text-to-image synthesis, image editing, and multi-image composition in an efficient diffusion transformer pretrained on billions of pairs and accelerated to 1.8 seconds for 2K output.

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages · cited by 17 Pith papers · 8 internal anchors

  1. [1]

    Training Diffusion Models with Reinforcement Learning

    Kevin Black, Michael Janner, Yilun Du, Ilya Kostrikov, and Sergey Levine. Training diffusion models with reinforcement learning.arXiv preprint arXiv:2305.13301, 2023

  2. [2]

    Instructpix2pix: Learning to follow image editing instructions

    Tim Brooks, Aleksander Holynski, and Alexei A Efros. Instructpix2pix: Learning to follow image editing instructions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18392–18402, 2023

  3. [3]

    Masactrl: Tuning- free mutual self-attention control for consistent image synthesis and editing

    Mingdeng Cao, Xintao Wang, Zhongang Qi, Ying Shan, Xiaohu Qie, and Yinqiang Zheng. Masactrl: Tuning- free mutual self-attention control for consistent image synthesis and editing. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 22560–22570, October 2023

  4. [4]

    Textdiffuser-2: Unleashing the power of language models for text rendering

    Jingye Chen, Yupan Huang, Tengchao Lv, Lei Cui, Qifeng Chen, and Furu Wei. Textdiffuser-2: Unleashing the power of language models for text rendering. InEuropean Conference on Computer Vision, pages 386–402. Springer, 2024

  5. [5]

    Altclip: Altering the language encoder in clip for extended language capabilities.arXiv preprint arXiv:2211.06679, 2022

    Zhongzhi Chen, Guang Liu, Bo-Wen Zhang, Fulong Ye, Qinghong Yang, and Ledell Wu. Altclip: Altering the language encoder in clip for extended language capabilities.arXiv preprint arXiv:2211.06679, 2022

  6. [6]

    The proposed uscf rating system, its development, theory, and applications.Chess Life, XXII(8):242–247, 1967

    Arpad Emmerich Elo. The proposed uscf rating system, its development, theory, and applications.Chess Life, XXII(8):242–247, 1967

  7. [7]

    Scaling rectified flow transformers for high-resolution image synthesis

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. In Forty-firstInternational Conference on Machine Learning, 2024

  8. [8]

    Evalmuse-40k: A reliable and fine-grained benchmark with comprehensive human annotations for text-to-image generation model evaluation, 2024

    Shuhao Han, Haotian Fan, Jiachen Fu, Liang Li, Tao Li, Junhui Cui, Yunqiu Wang, Yang Tai, Jingwei Sun, Chunle Guo, and Chongyi Li. Evalmuse-40k: A reliable and fine-grained benchmark with comprehensive human annotations for text-to-image generation model evaluation, 2024. URLhttps://arxiv.org/abs/2412.18150

  9. [9]

    CLIPScore: A Reference-free Evaluation Metric for Image Captioning

    Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. Clipscore: A reference-free evaluation metric for image captioning, 2022. URLhttps://arxiv.org/abs/2104.08718

  10. [10]

    Ideogram

    Ideogram. Ideogram. https://about.ideogram.ai/2.0, 2024

  11. [11]

    Rethinking fid: Towards a better evaluation metric for image generation, 2024

    Sadeep Jayasumana, Srikumar Ramalingam, Andreas Veit, Daniel Glasner, Ayan Chakrabarti, and Sanjiv Kumar. Rethinking fid: Towards a better evaluation metric for image generation, 2024. URLhttps://arxiv.org/abs/ 2401.09603

  12. [12]

    Adaface: Quality adaptive margin for face recognition

    Minchul Kim, Anil K Jain, and Xiaoming Liu. Adaface: Quality adaptive margin for face recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18750–18759, 2022

  13. [13]

    Flux.https://github.com/black-forest-labs/flux, 2023

    Black Forest Labs. Flux.https://github.com/black-forest-labs/flux, 2023

  14. [14]

    Controlnet++: Improving conditional controls with efficient consistency feedback

    Ming Li, Taojiannan Yang, Huafeng Kuang, Jie Wu, Zhaoning Wang, Xuefeng Xiao, and Chen Chen. Controlnet++: Improving conditional controls with efficient consistency feedback. InEuropean Conference on Computer Vision, pages 129–147. Springer, 2025

  15. [15]

    Hunyuan-DiT: A Powerful Multi-Resolution Diffusion Transformer with Fine-Grained Chinese Understanding

    Zhimin Li, Jianwei Zhang, Qin Lin, Jiangfeng Xiong, Yanxin Long, Xinchi Deng, Yingfang Zhang, Xingchao Liu, Minbin Huang, Zedong Xiao, et al. Hunyuan-dit: A powerful multi-resolution diffusion transformer with fine-grained chinese understanding.arXiv preprint arXiv:2405.08748, 2024

  16. [16]

    A technique for the measurement of attitudes.Archives of Psychology.140: 1–55, 1932

    Rensis Likert. A technique for the measurement of attitudes.Archives of Psychology.140: 1–55, 1932

  17. [17]

    Evaluating text-to-visual generation with image-to-text generation.arXiv preprint arXiv:2404.01291, 2024

    Zhiqiu Lin, Deepak Pathak, Baiqi Li, Jiayao Li, Xide Xia, Graham Neubig, Pengchuan Zhang, and Deva Ramanan. Evaluating text-to-visual generation with image-to-text generation.arXiv preprint arXiv:2404.01291, 2024

  18. [18]

    Glyph-byt5: A customized text encoder for accurate visual text rendering

    Zeyu Liu, Weicong Liang, Zhanhao Liang, Chong Luo, Ji Li, Gao Huang, and Yuhui Yuan. Glyph-byt5: A customized text encoder for accurate visual text rendering. InEuropean Conference on Computer Vision, pages 361–377. Springer, 2024

  19. [19]

    Glyph-byt5- v2: A strong aesthetic baseline for accurate multilingual visual text rendering.arXiv preprint arXiv:2406.10208, 2024

    Zeyu Liu, Weicong Liang, Yiming Zhao, Bohan Chen, Lin Liang, Lijuan Wang, Ji Li, and Yuhui Yuan. Glyph-byt5- v2: A strong aesthetic baseline for accurate multilingual visual text rendering.arXiv preprint arXiv:2406.10208, 2024

  20. [20]

    Meitu. Meitu. https://www.whee.com/ai/text-to-image, 2024. 30

  21. [21]

    Midjourney v6.1

    Midjourney. Midjourney v6.1. https://www.midjourney.com/, 2024

  22. [22]

    Gpt-4o system card, 2024

    OpenAI, :, Aaron Hurst, and Adam Lerer et al. Gpt-4o system card, 2024. URLhttps://arxiv.org/abs/2410. 21276

  23. [23]

    DINOv2: Learning Robust Visual Features without Supervision

    Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023

  24. [24]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021

  25. [25]

    Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of machine learning research, 21(140):1–67, 2020

    Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of machine learning research, 21(140):1–67, 2020

  26. [26]

    Recraft v3

    Recraft. Recraft v3. https://www.recraft.ai/projects, 2024

  27. [27]

    Hyper-sd: Trajectory segmented consistency model for efficient image synthesis.Advancesin Neural Information Processing Systems, 37:117340–117362, 2025

    Yuxi Ren, Xin Xia, Yanzuo Lu, Jiacheng Zhang, Jie Wu, Pan Xie, Xing Wang, and Xuefeng Xiao. Hyper-sd: Trajectory segmented consistency model for efficient image synthesis.Advancesin Neural Information Processing Systems, 37:117340–117362, 2025

  28. [28]

    Seededit: Align image re-generation to image editing, 2024

    Yichun Shi, Peng Wang, and Weilin Huang. Seededit: Align image re-generation to image editing, 2024. URL https://arxiv.org/abs/2411.06686

  29. [29]

    Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

    Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

  30. [30]

    Kolors: Effective training of diffusion model for photorealistic text-to-image synthesis.arXiv preprint, 2024

    Kolors Team. Kolors: Effective training of diffusion model for photorealistic text-to-image synthesis.arXiv preprint, 2024

  31. [31]

    Tencent. Hunyuan. https://console.cloud.tencent.com/hunyuan/experience/image, 2024

  32. [32]

    Anytext: Multilingual visual text generation and editing.arXiv preprint arXiv:2311.03054, 2023

    Yuxiang Tuo, Wangmeng Xiang, Jun-Yan He, Yifeng Geng, and Xuansong Xie. Anytext: Multilingual visual text generation and editing.arXiv preprint arXiv:2311.03054, 2023

  33. [33]

    Diffusion model alignment using direct preference optimization

    Bram Wallace, Meihua Dang, Rafael Rafailov, Linqi Zhou, Aaron Lou, Senthil Purushwalkam, Stefano Ermon, Caiming Xiong, Shafiq Joty, and Nikhil Naik. Diffusion model alignment using direct preference optimization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8228–8238, 2024

  34. [34]

    Vmix: Improving text-to-image diffusion model with cross-attention mixing control.arXiv preprint arXiv:2412.20800, 2024

    Shaojin Wu, Fei Ding, Mengqi Huang, Wei Liu, and Qian He. Vmix: Improving text-to-image diffusion model with cross-attention mixing control.arXiv preprint arXiv:2412.20800, 2024

  35. [35]

    Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis

    Xiaoshi Wu, Yiming Hao, Keqiang Sun, Yixiong Chen, Feng Zhu, Rui Zhao, and Hongsheng Li. Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis.arXiv preprint arXiv:2306.09341, 2023

  36. [36]

    Imagereward: Learning and evaluating human preferences for text-to-image generation

    Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. Imagereward: Learning and evaluating human preferences for text-to-image generation. Advances in Neural Information Processing Systems, 36, 2024

  37. [37]

    Byt5: Towards a token-free future with pre-trained byte-to-byte models.Transactions of the Association for Computational Linguistics, 10:291–306, 2022

    Linting Xue, Aditya Barua, Noah Constant, Rami Al-Rfou, Sharan Narang, Mihir Kale, Adam Roberts, and Colin Raffel. Byt5: Towards a token-free future with pre-trained byte-to-byte models.Transactions of the Association for Computational Linguistics, 10:291–306, 2022

  38. [38]

    IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models

    Hu Ye, Jun Zhang, Sibo Liu, Xiao Han, and Wei Yang. Ip-adapter: Text compatible image prompt adapter for text-to-image diffusion models.arXiv preprint arXiv:2308.06721, 2023

  39. [39]

    Scaling Autoregressive Models for Content-Rich Text-to-Image Generation

    Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gunjan Baid, Zirui Wang, Vijay Vasudevan, Alexander Ku, Yinfei Yang, Burcu Karagol Ayan, Ben Hutchinson, Wei Han, Zarana Parekh, Xin Li, Han Zhang, Jason Baldridge, and Yonghui Wu. Scaling autoregressive models for content-rich text-to-image generation, 2022. URL https://arxiv.org/abs/2206.10789

  40. [40]

    Jedi: Joint-image diffusion models for finetuning-free personalized text-to-image generation

    Yu Zeng, Vishal M Patel, Haochen Wang, Xun Huang, Ting-Chun Wang, Ming-Yu Liu, and Yogesh Balaji. Jedi: Joint-image diffusion models for finetuning-free personalized text-to-image generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6786–6795, 2024. 31

  41. [41]

    Onlinevpo: Align video diffusion model with online video-centric preference optimization.arXiv preprint arXiv:2412.15159, 2024

    Jiacheng Zhang, Jie Wu, Weifeng Chen, Yatai Ji, Xuefeng Xiao, Weilin Huang, and Kai Han. Onlinevpo: Align video diffusion model with online video-centric preference optimization.arXiv preprint arXiv:2412.15159, 2024

  42. [42]

    Unifl: Improve stable diffusion via unified feedback learning.arXiv preprint arXiv:2404.05595, 2024

    Jiacheng Zhang, Jie Wu, Yuxi Ren, Xin Xia, Huafeng Kuang, Pan Xie, Jiashi Li, Xuefeng Xiao, Min Zheng, Lean Fu, et al. Unifl: Improve stable diffusion via unified feedback learning.arXiv preprint arXiv:2404.05595, 2024

  43. [43]

    Learning multi-dimensional human preference for text-to-image generation

    Sixian Zhang, Bohan Wang, Junqiang Wu, Yan Li, Tingting Gao, Di Zhang, and Zhongyuan Wang. Learning multi-dimensional human preference for text-to-image generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8018–8027, 2024

  44. [44]

    PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel

    Yanli Zhao, Andrew Gu, Rohan Varma, Liang Luo, Chien-Chin Huang, Min Xu, Less Wright, Hamid Shojanazeri, Myle Ott, Sam Shleifer, et al. Pytorch fsdp: experiences on scaling fully sharded data parallel.arXiv preprint arXiv:2304.11277, 2023. 32 Appendix A Contributions and Acknowledgments All contributors of Seedream are listed in alphabetical order by thei...