arxiv: 2503.07703 · v1 · submitted 2025-03-10 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

Seedream 2.0: A Native Chinese-English Bilingual Image Generation Foundation Model

Lixue Gong , Xiaoxia Hou , Fanshi Li , Liang Li , Xiaochen Lian , Fei Liu , Liyang Liu , Wei Liu

show 20 more authors

Wei Lu Yichun Shi Shiqi Sun Yu Tian Zhi Tian Peng Wang Xun Wang Ye Wang Guofeng Wu Jie Wu Xin Xia Xuefeng Xiao Linjie Yang Zhonghua Zhai Xinyu Zhang Qi Zhang Yuwei Zhang Shijia Zhao Jianchao Yang Weilin Huang

Authors on Pith no claims yet

Pith reviewed 2026-05-17 08:20 UTC · model grok-4.3

classification 💻 cs.CV

keywords bilingual image generationdiffusion modelstext renderingRLHFcultural nuancesprompt followingfoundation modelimage editing

0 comments

The pith

Seedream 2.0 uses a self-developed bilingual LLM text encoder to generate high-fidelity images from Chinese or English prompts with accurate cultural nuances.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Seedream 2.0 as a foundation model for image generation that handles prompts in Chinese and English with native skill. It builds custom data and caption systems alongside a self-developed bilingual large language model that serves as the text encoder. This setup lets the model acquire cultural and linguistic details straight from massive datasets. The result is stronger performance in following prompts, producing pleasing aesthetics, rendering text correctly, and maintaining structural accuracy. Additional RLHF training aligns outputs with human preferences, as measured by high ELO scores, and the model adapts readily to instruction-based editing tasks.

Core claim

Seedream 2.0 achieves state-of-the-art performance across prompt-following, aesthetics, text rendering, and structural correctness. It is a native Chinese-English bilingual image generation foundation model integrated with a self-developed bilingual large language model as a text encoder, allowing it to learn native knowledge directly from massive data. This enables high-fidelity images with accurate cultural nuances and aesthetic expressions described in either Chinese or English. Glyph-Aligned ByT5 is applied for flexible character-level text rendering, while a Scaled ROPE generalizes well to untrained resolutions. Multi-phase post-training optimizations, including SFT and RLHF iterations,

What carries the argument

Self-developed bilingual large language model as text encoder that learns native knowledge directly from data

If this is right

The model produces images that accurately reflect Chinese cultural details when given Chinese prompts.
Glyph-Aligned ByT5 enables precise character-level text rendering without distortion in generated images.
Scaled ROPE allows the model to handle image resolutions not seen during training.
RLHF iterations produce outputs that align closely with human preferences in direct comparisons.
The same base model adapts to instruction-based editing while preserving image consistency.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Native encoder designs like this one could be replicated for other language pairs to reduce cultural bias in image generation.
Learning directly from data rather than through English translation layers may cut down on common multilingual errors.
Strong human preference alignment suggests the approach could support more personalized creative tools.

Load-bearing premise

The self-developed bilingual LLM text encoder and the custom data and caption systems allow the model to learn native Chinese knowledge directly from data without introducing new biases or requiring post-hoc fixes.

What would settle it

A blind rating study in which native Chinese speakers compare images generated from Chinese cultural prompts by Seedream 2.0 versus models such as Flux or SD3.5 and find no consistent advantage in cultural accuracy or text correctness.

read the original abstract

Rapid advancement of diffusion models has catalyzed remarkable progress in the field of image generation. However, prevalent models such as Flux, SD3.5 and Midjourney, still grapple with issues like model bias, limited text rendering capabilities, and insufficient understanding of Chinese cultural nuances. To address these limitations, we present Seedream 2.0, a native Chinese-English bilingual image generation foundation model that excels across diverse dimensions, which adeptly manages text prompt in both Chinese and English, supporting bilingual image generation and text rendering. We develop a powerful data system that facilitates knowledge integration, and a caption system that balances the accuracy and richness for image description. Particularly, Seedream is integrated with a self-developed bilingual large language model as a text encoder, allowing it to learn native knowledge directly from massive data. This enable it to generate high-fidelity images with accurate cultural nuances and aesthetic expressions described in either Chinese or English. Beside, Glyph-Aligned ByT5 is applied for flexible character-level text rendering, while a Scaled ROPE generalizes well to untrained resolutions. Multi-phase post-training optimizations, including SFT and RLHF iterations, further improve the overall capability. Through extensive experimentation, we demonstrate that Seedream 2.0 achieves state-of-the-art performance across multiple aspects, including prompt-following, aesthetics, text rendering, and structural correctness. Furthermore, Seedream 2.0 has been optimized through multiple RLHF iterations to closely align its output with human preferences, as revealed by its outstanding ELO score. In addition, it can be readily adapted to an instruction-based image editing model, such as SeedEdit, with strong editing capability that balances instruction-following and image consistency.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Seedream 2.0 adds a custom bilingual LLM encoder, Glyph-Aligned ByT5, and Scaled ROPE to diffusion models, but its SOTA and native performance claims rest on unshown metrics and missing ablations.

read the letter

The paper describes a new diffusion model built for direct Chinese and English prompt handling without translation steps. It pairs a self-developed bilingual LLM text encoder with Glyph-Aligned ByT5 for character-level rendering and Scaled ROPE for resolution flexibility, then runs multi-phase post-training that includes SFT and RLHF rounds. These choices target real pain points in current models like poor Chinese text rendering and weak cultural nuance capture when prompts stay in Chinese. The engineering focus on a native encoder and balanced caption system is a clear step forward from simply fine-tuning English-centric bases. The work earns credit for laying out the data pipeline and why each piece matters for prompt following and aesthetics in both languages. The architecture details read as concrete and reproducible enough to build on. The main weakness is the results section. The abstract states SOTA across prompt following, aesthetics, text rendering, and structural correctness plus an outstanding ELO from RLHF, yet supplies no numbers, no direct comparisons to Flux or SD3.5, and no error bars. There are also no ablations that isolate the bilingual encoder or the custom caption system from scale or training tricks. The internal preference data used for RLHF creates the usual circularity risk: the model may simply match the preferences of the team that collected the data. The stress-test concern holds up on the available text. Without controlled swaps or bias audits on Chinese cultural prompts, it remains unclear whether the claimed native advantage comes from the new components or from other factors. This paper is mainly for engineers and researchers working on multilingual image generation or Chinese-market applications. Readers who need practical details on text rendering and bilingual encoders will get something useful from the methods. It deserves a serious referee because the problem is relevant and the proposed fixes are specific enough to evaluate once the numbers appear. I would send it to peer review with a request for the quantitative results, baseline tables, and isolating ablations in the next round.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces Seedream 2.0, a diffusion-based native Chinese-English bilingual image generation foundation model. It describes a custom data system, caption system balancing accuracy and richness, integration of a self-developed bilingual LLM as text encoder to acquire native knowledge directly from data, Glyph-Aligned ByT5 for character-level text rendering, Scaled ROPE for resolution generalization, and multi-phase post-training with SFT and RLHF iterations for human preference alignment. The central claims are that the model achieves state-of-the-art performance in prompt-following, aesthetics, text rendering, and structural correctness, delivers outstanding ELO scores, and can be adapted to instruction-based image editing while balancing instruction-following and consistency.

Significance. If the SOTA and native bilingual performance claims hold with rigorous evidence, the work would represent a meaningful advance in multilingual image generation by addressing cultural nuances and text rendering limitations in existing models. The combination of a custom bilingual encoder with RLHF alignment and editing adaptability could influence development of culturally inclusive foundation models, provided the contributions are isolated and independently verified.

major comments (2)

[Abstract] Abstract: The assertion of state-of-the-art performance 'across multiple aspects, including prompt-following, aesthetics, text rendering, and structural correctness' and 'outstanding ELO score' after 'extensive experimentation' and 'multiple RLHF iterations' provides no quantitative metrics, baseline comparisons (e.g., against Flux or SD3.5), evaluation protocols, error bars, or ablation studies. This directly undermines validation of the central SOTA and human-preference-alignment claims.
[Abstract] Abstract and methodology description: No isolating ablations are reported for the self-developed bilingual LLM text encoder or the custom data/caption systems. The claim that these components enable 'learn[ing] native knowledge directly from massive data' without introducing new biases or requiring post-hoc fixes is therefore load-bearing but untested, leaving open whether gains derive from the claimed native mechanism, scale, or other factors.

minor comments (1)

[Abstract] Abstract: Grammatical issues include 'This enable it' (should be 'This enables it') and 'Beside' (should be 'Besides').

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight opportunities to better substantiate our central claims in the abstract and methodology sections. We address each point below and commit to revisions that strengthen the evidence without altering the core contributions.

read point-by-point responses

Referee: [Abstract] Abstract: The assertion of state-of-the-art performance 'across multiple aspects, including prompt-following, aesthetics, text rendering, and structural correctness' and 'outstanding ELO score' after 'extensive experimentation' and 'multiple RLHF iterations' provides no quantitative metrics, baseline comparisons (e.g., against Flux or SD3.5), evaluation protocols, error bars, or ablation studies. This directly undermines validation of the central SOTA and human-preference-alignment claims.

Authors: We agree that the abstract, constrained by length, omits specific quantitative metrics and direct baseline comparisons, which weakens the immediate visibility of the SOTA claims. The full manuscript details these in the Experiments and Evaluation sections, including ELO scores from human studies, comparisons against Flux and SD3.5, evaluation protocols for prompt-following and text rendering, and structural correctness metrics. To directly address this, we will revise the abstract to include key quantitative highlights (e.g., relative ELO improvements and specific benchmark scores) while preserving brevity. We will also ensure error bars and protocol summaries are more explicitly cross-referenced in the main text. revision: yes
Referee: [Abstract] Abstract and methodology description: No isolating ablations are reported for the self-developed bilingual LLM text encoder or the custom data/caption systems. The claim that these components enable 'learn[ing] native knowledge directly from massive data' without introducing new biases or requiring post-hoc fixes is therefore load-bearing but untested, leaving open whether gains derive from the claimed native mechanism, scale, or other factors.

Authors: We recognize that isolating ablations would provide stronger evidence for attributing gains specifically to the bilingual LLM text encoder and custom data/caption systems rather than scale alone. The current manuscript supports the overall performance through end-to-end results and cultural nuance evaluations but does not present component-wise ablations. We will add a dedicated ablation subsection in the revised version, including controlled experiments that isolate the contribution of the self-developed LLM and caption balancing approach, to clarify the native knowledge integration mechanism. revision: partial

Circularity Check

0 steps flagged

No significant circularity in empirical model description and performance claims

full rationale

The paper presents an engineering description of Seedream 2.0, a bilingual diffusion model incorporating a self-developed LLM text encoder, custom data/caption systems, Glyph-Aligned ByT5, Scaled ROPE, and post-training via SFT and RLHF. Performance claims (SOTA on prompt-following, aesthetics, text rendering, structural correctness, and ELO alignment) are asserted via 'extensive experimentation' without any mathematical derivation chain, equations, or first-principles predictions. No step reduces by construction to its inputs: RLHF is a standard optimization procedure whose output alignment is measured by ELO, which the text does not equate to the training preference data itself. Self-developed components are described but not justified solely by overlapping self-citations. The paper is self-contained against external benchmarks in the sense that its central claims rest on reported empirical results rather than tautological re-labeling of fitted quantities.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 2 invented entities

The central claims rest on standard diffusion model assumptions plus several new components whose independent validation is not provided in the abstract.

free parameters (1)

training hyperparameters and scaling factors
Standard ML training choices that are fitted or tuned but not enumerated.

axioms (1)

domain assumption Diffusion models can be trained to high fidelity on bilingual data when paired with a native LLM encoder.
Invoked to justify the core architecture choice.

invented entities (2)

Glyph-Aligned ByT5 no independent evidence
purpose: Flexible character-level text rendering in generated images
New module introduced for text handling.
Scaled ROPE no independent evidence
purpose: Generalization to untrained image resolutions
Modified position encoding presented as a contribution.

pith-pipeline@v0.9.0 · 5706 in / 1443 out tokens · 59905 ms · 2026-05-17T08:20:40.510849+00:00 · methodology

discussion (0)

Forward citations

Cited by 17 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Flow-GRPO: Training Flow Matching Models via Online RL
cs.CV 2025-05 unverdicted novelty 8.0

Flow-GRPO is the first online RL method for flow matching models, raising GenEval accuracy from 63% to 95% and text-rendering accuracy from 59% to 92% with little reward hacking.
Improving Text-to-Image Generation with Intrinsic Self-Confidence Rewards
cs.CV 2026-03 unverdicted novelty 7.0

SOLACE improves text-to-image generation by using intrinsic self-confidence rewards from noise reconstruction accuracy during reinforcement learning post-training without external supervision.
Leveraging Verifier-Based Reinforcement Learning in Image Editing
cs.CV 2026-04 unverdicted novelty 6.0

Edit-R1 trains a CoT-based reasoning reward model with GCPO and uses it to boost image editing performance over VLMs and models like FLUX.1-kontext via GRPO.
PixelGen: Improving Pixel Diffusion with Perceptual Supervision
cs.CV 2026-02 accept novelty 6.0

PixelGen augments pixel diffusion with gated perceptual supervision to reach FID 5.11 on ImageNet-256 and GenEval 0.79 in text-to-image, narrowing the gap to latent methods without VAEs.
Seedance 1.5 pro: A Native Audio-Visual Joint Generation Foundation Model
cs.CV 2025-12 unverdicted novelty 6.0

Seedance 1.5 pro is a joint audio-visual generation model achieving high synchronization via dual-branch diffusion transformer and post-training optimizations.
DeCo: Frequency-Decoupled Pixel Diffusion for End-to-End Image Generation
cs.CV 2025-11 conditional novelty 6.0

DeCo decouples high- and low-frequency generation in pixel diffusion via a DiT plus lightweight decoder and a frequency-aware flow-matching loss, reaching FID 1.62 at 256x256 and 2.22 at 512x512 on ImageNet while clos...
DanceGRPO: Unleashing GRPO on Visual Generation
cs.CV 2025-05 unverdicted novelty 6.0

DanceGRPO applies GRPO to visual generation tasks to achieve stable policy optimization across diffusion models, rectified flows, multiple tasks, and diverse reward models, outperforming prior RL methods.
Mogao: An Omni Foundation Model for Interleaved Multi-Modal Generation
cs.CV 2025-05 unverdicted novelty 6.0

Mogao presents a causal unified model with deep fusion, dual encoders, and interleaved position embeddings that achieves strong performance on multi-modal understanding, text-to-image generation, and coherent interlea...
A Systematic Post-Train Framework for Video Generation
cs.CV 2026-04 unverdicted novelty 5.0

A post-training pipeline for video generation models combines SFT, RLHF with novel GRPO, prompt enhancement, and inference optimization to improve visual quality, temporal coherence, and instruction following.
LongCat-Image Technical Report
cs.CV 2025-12 unverdicted novelty 5.0

LongCat-Image delivers a compact 6B-parameter bilingual image generation model that sets new standards for Chinese character rendering accuracy and photorealism while remaining efficient and fully open-source.
Qwen-Image Technical Report
cs.CV 2025-08 unverdicted novelty 5.0

Qwen-Image is a foundation model that reaches state-of-the-art results in image generation and editing by combining a large-scale text-focused data pipeline with curriculum learning and dual semantic-reconstructive en...
Qwen-Image-2.0 Technical Report
cs.CV 2026-05 unverdicted novelty 4.0

Qwen-Image-2.0 unifies high-fidelity image generation and precise editing by coupling Qwen3-VL with a Multimodal Diffusion Transformer, improving text rendering, photorealism, and complex prompt following over prior versions.
MMCORE: MultiModal COnnection with Representation Aligned Latent Embeddings
cs.CV 2026-04 unverdicted novelty 4.0

MMCORE transfers VLM reasoning into diffusion-based image generation and editing via aligned latent embeddings from learnable queries, outperforming baselines on text-to-image and editing tasks.
Seedance 1.0: Exploring the Boundaries of Video Generation Models
cs.CV 2025-06 unverdicted novelty 4.0

Seedance 1.0 generates 5-second 1080p videos in about 41 seconds with claimed superior motion quality, prompt adherence, and multi-shot consistency compared to prior models.
Seedream 3.0 Technical Report
cs.CV 2025-04 unverdicted novelty 4.0

Seedream 3.0 improves bilingual image generation through doubled defect-aware data, mixed-resolution training, cross-modality RoPE, representation alignment, aesthetic SFT, VLM reward modeling, and importance-aware ti...
Seedance 2.0: Advancing Video Generation for World Complexity
cs.CV 2026-04 unverdicted novelty 3.0

Seedance 2.0 is an updated multi-modal model for generating 4-15 second audio-video content at 480p/720p with support for up to 3 video, 9 image, and 3 audio references.
Seedream 4.0: Toward Next-generation Multimodal Image Generation
cs.CV 2025-09 unverdicted novelty 3.0

Seedream 4.0 unifies text-to-image synthesis, image editing, and multi-image composition in an efficient diffusion transformer pretrained on billions of pairs and accelerated to 1.8 seconds for 2K output.

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages · cited by 17 Pith papers · 8 internal anchors

[1]

Training Diffusion Models with Reinforcement Learning

Kevin Black, Michael Janner, Yilun Du, Ilya Kostrikov, and Sergey Levine. Training diffusion models with reinforcement learning.arXiv preprint arXiv:2305.13301, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Instructpix2pix: Learning to follow image editing instructions

Tim Brooks, Aleksander Holynski, and Alexei A Efros. Instructpix2pix: Learning to follow image editing instructions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18392–18402, 2023

work page 2023
[3]

Masactrl: Tuning- free mutual self-attention control for consistent image synthesis and editing

Mingdeng Cao, Xintao Wang, Zhongang Qi, Ying Shan, Xiaohu Qie, and Yinqiang Zheng. Masactrl: Tuning- free mutual self-attention control for consistent image synthesis and editing. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 22560–22570, October 2023

work page 2023
[4]

Textdiffuser-2: Unleashing the power of language models for text rendering

Jingye Chen, Yupan Huang, Tengchao Lv, Lei Cui, Qifeng Chen, and Furu Wei. Textdiffuser-2: Unleashing the power of language models for text rendering. InEuropean Conference on Computer Vision, pages 386–402. Springer, 2024

work page 2024
[5]

Altclip: Altering the language encoder in clip for extended language capabilities.arXiv preprint arXiv:2211.06679, 2022

Zhongzhi Chen, Guang Liu, Bo-Wen Zhang, Fulong Ye, Qinghong Yang, and Ledell Wu. Altclip: Altering the language encoder in clip for extended language capabilities.arXiv preprint arXiv:2211.06679, 2022

work page arXiv 2022
[6]

The proposed uscf rating system, its development, theory, and applications.Chess Life, XXII(8):242–247, 1967

Arpad Emmerich Elo. The proposed uscf rating system, its development, theory, and applications.Chess Life, XXII(8):242–247, 1967

work page 1967
[7]

Scaling rectified flow transformers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. In Forty-firstInternational Conference on Machine Learning, 2024

work page 2024
[8]

Evalmuse-40k: A reliable and fine-grained benchmark with comprehensive human annotations for text-to-image generation model evaluation, 2024

Shuhao Han, Haotian Fan, Jiachen Fu, Liang Li, Tao Li, Junhui Cui, Yunqiu Wang, Yang Tai, Jingwei Sun, Chunle Guo, and Chongyi Li. Evalmuse-40k: A reliable and fine-grained benchmark with comprehensive human annotations for text-to-image generation model evaluation, 2024. URLhttps://arxiv.org/abs/2412.18150

work page arXiv 2024
[9]

CLIPScore: A Reference-free Evaluation Metric for Image Captioning

Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. Clipscore: A reference-free evaluation metric for image captioning, 2022. URLhttps://arxiv.org/abs/2104.08718

work page internal anchor Pith review Pith/arXiv arXiv 2022
[10]

Ideogram

Ideogram. Ideogram. https://about.ideogram.ai/2.0, 2024

work page 2024
[11]

Rethinking fid: Towards a better evaluation metric for image generation, 2024

Sadeep Jayasumana, Srikumar Ramalingam, Andreas Veit, Daniel Glasner, Ayan Chakrabarti, and Sanjiv Kumar. Rethinking fid: Towards a better evaluation metric for image generation, 2024. URLhttps://arxiv.org/abs/ 2401.09603

work page arXiv 2024
[12]

Adaface: Quality adaptive margin for face recognition

Minchul Kim, Anil K Jain, and Xiaoming Liu. Adaface: Quality adaptive margin for face recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18750–18759, 2022

work page 2022
[13]

Flux.https://github.com/black-forest-labs/flux, 2023

Black Forest Labs. Flux.https://github.com/black-forest-labs/flux, 2023

work page 2023
[14]

Controlnet++: Improving conditional controls with efficient consistency feedback

Ming Li, Taojiannan Yang, Huafeng Kuang, Jie Wu, Zhaoning Wang, Xuefeng Xiao, and Chen Chen. Controlnet++: Improving conditional controls with efficient consistency feedback. InEuropean Conference on Computer Vision, pages 129–147. Springer, 2025

work page 2025
[15]

Hunyuan-DiT: A Powerful Multi-Resolution Diffusion Transformer with Fine-Grained Chinese Understanding

Zhimin Li, Jianwei Zhang, Qin Lin, Jiangfeng Xiong, Yanxin Long, Xinchi Deng, Yingfang Zhang, Xingchao Liu, Minbin Huang, Zedong Xiao, et al. Hunyuan-dit: A powerful multi-resolution diffusion transformer with fine-grained chinese understanding.arXiv preprint arXiv:2405.08748, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[16]

A technique for the measurement of attitudes.Archives of Psychology.140: 1–55, 1932

Rensis Likert. A technique for the measurement of attitudes.Archives of Psychology.140: 1–55, 1932

work page 1932
[17]

Evaluating text-to-visual generation with image-to-text generation.arXiv preprint arXiv:2404.01291, 2024

Zhiqiu Lin, Deepak Pathak, Baiqi Li, Jiayao Li, Xide Xia, Graham Neubig, Pengchuan Zhang, and Deva Ramanan. Evaluating text-to-visual generation with image-to-text generation.arXiv preprint arXiv:2404.01291, 2024

work page arXiv 2024
[18]

Glyph-byt5: A customized text encoder for accurate visual text rendering

Zeyu Liu, Weicong Liang, Zhanhao Liang, Chong Luo, Ji Li, Gao Huang, and Yuhui Yuan. Glyph-byt5: A customized text encoder for accurate visual text rendering. InEuropean Conference on Computer Vision, pages 361–377. Springer, 2024

work page 2024
[19]

Glyph-byt5- v2: A strong aesthetic baseline for accurate multilingual visual text rendering.arXiv preprint arXiv:2406.10208, 2024

Zeyu Liu, Weicong Liang, Yiming Zhao, Bohan Chen, Lin Liang, Lijuan Wang, Ji Li, and Yuhui Yuan. Glyph-byt5- v2: A strong aesthetic baseline for accurate multilingual visual text rendering.arXiv preprint arXiv:2406.10208, 2024

work page arXiv 2024
[20]

Meitu. Meitu. https://www.whee.com/ai/text-to-image, 2024. 30

work page 2024
[21]

Midjourney v6.1

Midjourney. Midjourney v6.1. https://www.midjourney.com/, 2024

work page 2024
[22]

Gpt-4o system card, 2024

OpenAI, :, Aaron Hurst, and Adam Lerer et al. Gpt-4o system card, 2024. URLhttps://arxiv.org/abs/2410. 21276

work page 2024
[23]

DINOv2: Learning Robust Visual Features without Supervision

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[24]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021

work page 2021
[25]

Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of machine learning research, 21(140):1–67, 2020

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of machine learning research, 21(140):1–67, 2020

work page 2020
[26]

Recraft v3

Recraft. Recraft v3. https://www.recraft.ai/projects, 2024

work page 2024
[27]

Hyper-sd: Trajectory segmented consistency model for efficient image synthesis.Advancesin Neural Information Processing Systems, 37:117340–117362, 2025

Yuxi Ren, Xin Xia, Yanzuo Lu, Jiacheng Zhang, Jie Wu, Pan Xie, Xing Wang, and Xuefeng Xiao. Hyper-sd: Trajectory segmented consistency model for efficient image synthesis.Advancesin Neural Information Processing Systems, 37:117340–117362, 2025

work page 2025
[28]

Seededit: Align image re-generation to image editing, 2024

Yichun Shi, Peng Wang, and Weilin Huang. Seededit: Align image re-generation to image editing, 2024. URL https://arxiv.org/abs/2411.06686

work page arXiv 2024
[29]

Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

work page 2024
[30]

Kolors: Effective training of diffusion model for photorealistic text-to-image synthesis.arXiv preprint, 2024

Kolors Team. Kolors: Effective training of diffusion model for photorealistic text-to-image synthesis.arXiv preprint, 2024

work page 2024
[31]

Tencent. Hunyuan. https://console.cloud.tencent.com/hunyuan/experience/image, 2024

work page 2024
[32]

Anytext: Multilingual visual text generation and editing.arXiv preprint arXiv:2311.03054, 2023

Yuxiang Tuo, Wangmeng Xiang, Jun-Yan He, Yifeng Geng, and Xuansong Xie. Anytext: Multilingual visual text generation and editing.arXiv preprint arXiv:2311.03054, 2023

work page arXiv 2023
[33]

Diffusion model alignment using direct preference optimization

Bram Wallace, Meihua Dang, Rafael Rafailov, Linqi Zhou, Aaron Lou, Senthil Purushwalkam, Stefano Ermon, Caiming Xiong, Shafiq Joty, and Nikhil Naik. Diffusion model alignment using direct preference optimization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8228–8238, 2024

work page 2024
[34]

Vmix: Improving text-to-image diffusion model with cross-attention mixing control.arXiv preprint arXiv:2412.20800, 2024

Shaojin Wu, Fei Ding, Mengqi Huang, Wei Liu, and Qian He. Vmix: Improving text-to-image diffusion model with cross-attention mixing control.arXiv preprint arXiv:2412.20800, 2024

work page arXiv 2024
[35]

Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis

Xiaoshi Wu, Yiming Hao, Keqiang Sun, Yixiong Chen, Feng Zhu, Rui Zhao, and Hongsheng Li. Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis.arXiv preprint arXiv:2306.09341, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[36]

Imagereward: Learning and evaluating human preferences for text-to-image generation

Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. Imagereward: Learning and evaluating human preferences for text-to-image generation. Advances in Neural Information Processing Systems, 36, 2024

work page 2024
[37]

Byt5: Towards a token-free future with pre-trained byte-to-byte models.Transactions of the Association for Computational Linguistics, 10:291–306, 2022

Linting Xue, Aditya Barua, Noah Constant, Rami Al-Rfou, Sharan Narang, Mihir Kale, Adam Roberts, and Colin Raffel. Byt5: Towards a token-free future with pre-trained byte-to-byte models.Transactions of the Association for Computational Linguistics, 10:291–306, 2022

work page 2022
[38]

IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models

Hu Ye, Jun Zhang, Sibo Liu, Xiao Han, and Wei Yang. Ip-adapter: Text compatible image prompt adapter for text-to-image diffusion models.arXiv preprint arXiv:2308.06721, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[39]

Scaling Autoregressive Models for Content-Rich Text-to-Image Generation

Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gunjan Baid, Zirui Wang, Vijay Vasudevan, Alexander Ku, Yinfei Yang, Burcu Karagol Ayan, Ben Hutchinson, Wei Han, Zarana Parekh, Xin Li, Han Zhang, Jason Baldridge, and Yonghui Wu. Scaling autoregressive models for content-rich text-to-image generation, 2022. URL https://arxiv.org/abs/2206.10789

work page internal anchor Pith review Pith/arXiv arXiv 2022
[40]

Jedi: Joint-image diffusion models for finetuning-free personalized text-to-image generation

Yu Zeng, Vishal M Patel, Haochen Wang, Xun Huang, Ting-Chun Wang, Ming-Yu Liu, and Yogesh Balaji. Jedi: Joint-image diffusion models for finetuning-free personalized text-to-image generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6786–6795, 2024. 31

work page 2024
[41]

Onlinevpo: Align video diffusion model with online video-centric preference optimization.arXiv preprint arXiv:2412.15159, 2024

Jiacheng Zhang, Jie Wu, Weifeng Chen, Yatai Ji, Xuefeng Xiao, Weilin Huang, and Kai Han. Onlinevpo: Align video diffusion model with online video-centric preference optimization.arXiv preprint arXiv:2412.15159, 2024

work page arXiv 2024
[42]

Unifl: Improve stable diffusion via unified feedback learning.arXiv preprint arXiv:2404.05595, 2024

Jiacheng Zhang, Jie Wu, Yuxi Ren, Xin Xia, Huafeng Kuang, Pan Xie, Jiashi Li, Xuefeng Xiao, Min Zheng, Lean Fu, et al. Unifl: Improve stable diffusion via unified feedback learning.arXiv preprint arXiv:2404.05595, 2024

work page arXiv 2024
[43]

Learning multi-dimensional human preference for text-to-image generation

Sixian Zhang, Bohan Wang, Junqiang Wu, Yan Li, Tingting Gao, Di Zhang, and Zhongyuan Wang. Learning multi-dimensional human preference for text-to-image generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8018–8027, 2024

work page 2024
[44]

PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel

Yanli Zhao, Andrew Gu, Rohan Varma, Liang Luo, Chien-Chin Huang, Min Xu, Less Wright, Hamid Shojanazeri, Myle Ott, Sam Shleifer, et al. Pytorch fsdp: experiences on scaling fully sharded data parallel.arXiv preprint arXiv:2304.11277, 2023. 32 Appendix A Contributions and Acknowledgments All contributors of Seedream are listed in alphabetical order by thei...

work page internal anchor Pith review Pith/arXiv arXiv 2023