Recognition: 2 theorem links
· Lean TheoremSeedream 2.0: A Native Chinese-English Bilingual Image Generation Foundation Model
Pith reviewed 2026-05-17 08:20 UTC · model grok-4.3
The pith
Seedream 2.0 uses a self-developed bilingual LLM text encoder to generate high-fidelity images from Chinese or English prompts with accurate cultural nuances.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Seedream 2.0 achieves state-of-the-art performance across prompt-following, aesthetics, text rendering, and structural correctness. It is a native Chinese-English bilingual image generation foundation model integrated with a self-developed bilingual large language model as a text encoder, allowing it to learn native knowledge directly from massive data. This enables high-fidelity images with accurate cultural nuances and aesthetic expressions described in either Chinese or English. Glyph-Aligned ByT5 is applied for flexible character-level text rendering, while a Scaled ROPE generalizes well to untrained resolutions. Multi-phase post-training optimizations, including SFT and RLHF iterations,
What carries the argument
Self-developed bilingual large language model as text encoder that learns native knowledge directly from data
If this is right
- The model produces images that accurately reflect Chinese cultural details when given Chinese prompts.
- Glyph-Aligned ByT5 enables precise character-level text rendering without distortion in generated images.
- Scaled ROPE allows the model to handle image resolutions not seen during training.
- RLHF iterations produce outputs that align closely with human preferences in direct comparisons.
- The same base model adapts to instruction-based editing while preserving image consistency.
Where Pith is reading between the lines
- Native encoder designs like this one could be replicated for other language pairs to reduce cultural bias in image generation.
- Learning directly from data rather than through English translation layers may cut down on common multilingual errors.
- Strong human preference alignment suggests the approach could support more personalized creative tools.
Load-bearing premise
The self-developed bilingual LLM text encoder and the custom data and caption systems allow the model to learn native Chinese knowledge directly from data without introducing new biases or requiring post-hoc fixes.
What would settle it
A blind rating study in which native Chinese speakers compare images generated from Chinese cultural prompts by Seedream 2.0 versus models such as Flux or SD3.5 and find no consistent advantage in cultural accuracy or text correctness.
read the original abstract
Rapid advancement of diffusion models has catalyzed remarkable progress in the field of image generation. However, prevalent models such as Flux, SD3.5 and Midjourney, still grapple with issues like model bias, limited text rendering capabilities, and insufficient understanding of Chinese cultural nuances. To address these limitations, we present Seedream 2.0, a native Chinese-English bilingual image generation foundation model that excels across diverse dimensions, which adeptly manages text prompt in both Chinese and English, supporting bilingual image generation and text rendering. We develop a powerful data system that facilitates knowledge integration, and a caption system that balances the accuracy and richness for image description. Particularly, Seedream is integrated with a self-developed bilingual large language model as a text encoder, allowing it to learn native knowledge directly from massive data. This enable it to generate high-fidelity images with accurate cultural nuances and aesthetic expressions described in either Chinese or English. Beside, Glyph-Aligned ByT5 is applied for flexible character-level text rendering, while a Scaled ROPE generalizes well to untrained resolutions. Multi-phase post-training optimizations, including SFT and RLHF iterations, further improve the overall capability. Through extensive experimentation, we demonstrate that Seedream 2.0 achieves state-of-the-art performance across multiple aspects, including prompt-following, aesthetics, text rendering, and structural correctness. Furthermore, Seedream 2.0 has been optimized through multiple RLHF iterations to closely align its output with human preferences, as revealed by its outstanding ELO score. In addition, it can be readily adapted to an instruction-based image editing model, such as SeedEdit, with strong editing capability that balances instruction-following and image consistency.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces Seedream 2.0, a diffusion-based native Chinese-English bilingual image generation foundation model. It describes a custom data system, caption system balancing accuracy and richness, integration of a self-developed bilingual LLM as text encoder to acquire native knowledge directly from data, Glyph-Aligned ByT5 for character-level text rendering, Scaled ROPE for resolution generalization, and multi-phase post-training with SFT and RLHF iterations for human preference alignment. The central claims are that the model achieves state-of-the-art performance in prompt-following, aesthetics, text rendering, and structural correctness, delivers outstanding ELO scores, and can be adapted to instruction-based image editing while balancing instruction-following and consistency.
Significance. If the SOTA and native bilingual performance claims hold with rigorous evidence, the work would represent a meaningful advance in multilingual image generation by addressing cultural nuances and text rendering limitations in existing models. The combination of a custom bilingual encoder with RLHF alignment and editing adaptability could influence development of culturally inclusive foundation models, provided the contributions are isolated and independently verified.
major comments (2)
- [Abstract] Abstract: The assertion of state-of-the-art performance 'across multiple aspects, including prompt-following, aesthetics, text rendering, and structural correctness' and 'outstanding ELO score' after 'extensive experimentation' and 'multiple RLHF iterations' provides no quantitative metrics, baseline comparisons (e.g., against Flux or SD3.5), evaluation protocols, error bars, or ablation studies. This directly undermines validation of the central SOTA and human-preference-alignment claims.
- [Abstract] Abstract and methodology description: No isolating ablations are reported for the self-developed bilingual LLM text encoder or the custom data/caption systems. The claim that these components enable 'learn[ing] native knowledge directly from massive data' without introducing new biases or requiring post-hoc fixes is therefore load-bearing but untested, leaving open whether gains derive from the claimed native mechanism, scale, or other factors.
minor comments (1)
- [Abstract] Abstract: Grammatical issues include 'This enable it' (should be 'This enables it') and 'Beside' (should be 'Besides').
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. The comments highlight opportunities to better substantiate our central claims in the abstract and methodology sections. We address each point below and commit to revisions that strengthen the evidence without altering the core contributions.
read point-by-point responses
-
Referee: [Abstract] Abstract: The assertion of state-of-the-art performance 'across multiple aspects, including prompt-following, aesthetics, text rendering, and structural correctness' and 'outstanding ELO score' after 'extensive experimentation' and 'multiple RLHF iterations' provides no quantitative metrics, baseline comparisons (e.g., against Flux or SD3.5), evaluation protocols, error bars, or ablation studies. This directly undermines validation of the central SOTA and human-preference-alignment claims.
Authors: We agree that the abstract, constrained by length, omits specific quantitative metrics and direct baseline comparisons, which weakens the immediate visibility of the SOTA claims. The full manuscript details these in the Experiments and Evaluation sections, including ELO scores from human studies, comparisons against Flux and SD3.5, evaluation protocols for prompt-following and text rendering, and structural correctness metrics. To directly address this, we will revise the abstract to include key quantitative highlights (e.g., relative ELO improvements and specific benchmark scores) while preserving brevity. We will also ensure error bars and protocol summaries are more explicitly cross-referenced in the main text. revision: yes
-
Referee: [Abstract] Abstract and methodology description: No isolating ablations are reported for the self-developed bilingual LLM text encoder or the custom data/caption systems. The claim that these components enable 'learn[ing] native knowledge directly from massive data' without introducing new biases or requiring post-hoc fixes is therefore load-bearing but untested, leaving open whether gains derive from the claimed native mechanism, scale, or other factors.
Authors: We recognize that isolating ablations would provide stronger evidence for attributing gains specifically to the bilingual LLM text encoder and custom data/caption systems rather than scale alone. The current manuscript supports the overall performance through end-to-end results and cultural nuance evaluations but does not present component-wise ablations. We will add a dedicated ablation subsection in the revised version, including controlled experiments that isolate the contribution of the self-developed LLM and caption balancing approach, to clarify the native knowledge integration mechanism. revision: partial
Circularity Check
No significant circularity in empirical model description and performance claims
full rationale
The paper presents an engineering description of Seedream 2.0, a bilingual diffusion model incorporating a self-developed LLM text encoder, custom data/caption systems, Glyph-Aligned ByT5, Scaled ROPE, and post-training via SFT and RLHF. Performance claims (SOTA on prompt-following, aesthetics, text rendering, structural correctness, and ELO alignment) are asserted via 'extensive experimentation' without any mathematical derivation chain, equations, or first-principles predictions. No step reduces by construction to its inputs: RLHF is a standard optimization procedure whose output alignment is measured by ELO, which the text does not equate to the training preference data itself. Self-developed components are described but not justified solely by overlapping self-citations. The paper is self-contained against external benchmarks in the sense that its central claims rest on reported empirical results rather than tautological re-labeling of fitted quantities.
Axiom & Free-Parameter Ledger
free parameters (1)
- training hyperparameters and scaling factors
axioms (1)
- domain assumption Diffusion models can be trained to high fidelity on bilingual data when paired with a native LLM encoder.
invented entities (2)
-
Glyph-Aligned ByT5
no independent evidence
-
Scaled ROPE
no independent evidence
Forward citations
Cited by 17 Pith papers
-
Flow-GRPO: Training Flow Matching Models via Online RL
Flow-GRPO is the first online RL method for flow matching models, raising GenEval accuracy from 63% to 95% and text-rendering accuracy from 59% to 92% with little reward hacking.
-
Improving Text-to-Image Generation with Intrinsic Self-Confidence Rewards
SOLACE improves text-to-image generation by using intrinsic self-confidence rewards from noise reconstruction accuracy during reinforcement learning post-training without external supervision.
-
Leveraging Verifier-Based Reinforcement Learning in Image Editing
Edit-R1 trains a CoT-based reasoning reward model with GCPO and uses it to boost image editing performance over VLMs and models like FLUX.1-kontext via GRPO.
-
PixelGen: Improving Pixel Diffusion with Perceptual Supervision
PixelGen augments pixel diffusion with gated perceptual supervision to reach FID 5.11 on ImageNet-256 and GenEval 0.79 in text-to-image, narrowing the gap to latent methods without VAEs.
-
Seedance 1.5 pro: A Native Audio-Visual Joint Generation Foundation Model
Seedance 1.5 pro is a joint audio-visual generation model achieving high synchronization via dual-branch diffusion transformer and post-training optimizations.
-
DeCo: Frequency-Decoupled Pixel Diffusion for End-to-End Image Generation
DeCo decouples high- and low-frequency generation in pixel diffusion via a DiT plus lightweight decoder and a frequency-aware flow-matching loss, reaching FID 1.62 at 256x256 and 2.22 at 512x512 on ImageNet while clos...
-
DanceGRPO: Unleashing GRPO on Visual Generation
DanceGRPO applies GRPO to visual generation tasks to achieve stable policy optimization across diffusion models, rectified flows, multiple tasks, and diverse reward models, outperforming prior RL methods.
-
Mogao: An Omni Foundation Model for Interleaved Multi-Modal Generation
Mogao presents a causal unified model with deep fusion, dual encoders, and interleaved position embeddings that achieves strong performance on multi-modal understanding, text-to-image generation, and coherent interlea...
-
A Systematic Post-Train Framework for Video Generation
A post-training pipeline for video generation models combines SFT, RLHF with novel GRPO, prompt enhancement, and inference optimization to improve visual quality, temporal coherence, and instruction following.
-
LongCat-Image Technical Report
LongCat-Image delivers a compact 6B-parameter bilingual image generation model that sets new standards for Chinese character rendering accuracy and photorealism while remaining efficient and fully open-source.
-
Qwen-Image Technical Report
Qwen-Image is a foundation model that reaches state-of-the-art results in image generation and editing by combining a large-scale text-focused data pipeline with curriculum learning and dual semantic-reconstructive en...
-
Qwen-Image-2.0 Technical Report
Qwen-Image-2.0 unifies high-fidelity image generation and precise editing by coupling Qwen3-VL with a Multimodal Diffusion Transformer, improving text rendering, photorealism, and complex prompt following over prior versions.
-
MMCORE: MultiModal COnnection with Representation Aligned Latent Embeddings
MMCORE transfers VLM reasoning into diffusion-based image generation and editing via aligned latent embeddings from learnable queries, outperforming baselines on text-to-image and editing tasks.
-
Seedance 1.0: Exploring the Boundaries of Video Generation Models
Seedance 1.0 generates 5-second 1080p videos in about 41 seconds with claimed superior motion quality, prompt adherence, and multi-shot consistency compared to prior models.
-
Seedream 3.0 Technical Report
Seedream 3.0 improves bilingual image generation through doubled defect-aware data, mixed-resolution training, cross-modality RoPE, representation alignment, aesthetic SFT, VLM reward modeling, and importance-aware ti...
-
Seedance 2.0: Advancing Video Generation for World Complexity
Seedance 2.0 is an updated multi-modal model for generating 4-15 second audio-video content at 480p/720p with support for up to 3 video, 9 image, and 3 audio references.
-
Seedream 4.0: Toward Next-generation Multimodal Image Generation
Seedream 4.0 unifies text-to-image synthesis, image editing, and multi-image composition in an efficient diffusion transformer pretrained on billions of pairs and accelerated to 1.8 seconds for 2K output.
Reference graph
Works this paper leans on
-
[1]
Training Diffusion Models with Reinforcement Learning
Kevin Black, Michael Janner, Yilun Du, Ilya Kostrikov, and Sergey Levine. Training diffusion models with reinforcement learning.arXiv preprint arXiv:2305.13301, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[2]
Instructpix2pix: Learning to follow image editing instructions
Tim Brooks, Aleksander Holynski, and Alexei A Efros. Instructpix2pix: Learning to follow image editing instructions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18392–18402, 2023
work page 2023
-
[3]
Masactrl: Tuning- free mutual self-attention control for consistent image synthesis and editing
Mingdeng Cao, Xintao Wang, Zhongang Qi, Ying Shan, Xiaohu Qie, and Yinqiang Zheng. Masactrl: Tuning- free mutual self-attention control for consistent image synthesis and editing. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 22560–22570, October 2023
work page 2023
-
[4]
Textdiffuser-2: Unleashing the power of language models for text rendering
Jingye Chen, Yupan Huang, Tengchao Lv, Lei Cui, Qifeng Chen, and Furu Wei. Textdiffuser-2: Unleashing the power of language models for text rendering. InEuropean Conference on Computer Vision, pages 386–402. Springer, 2024
work page 2024
-
[5]
Zhongzhi Chen, Guang Liu, Bo-Wen Zhang, Fulong Ye, Qinghong Yang, and Ledell Wu. Altclip: Altering the language encoder in clip for extended language capabilities.arXiv preprint arXiv:2211.06679, 2022
-
[6]
Arpad Emmerich Elo. The proposed uscf rating system, its development, theory, and applications.Chess Life, XXII(8):242–247, 1967
work page 1967
-
[7]
Scaling rectified flow transformers for high-resolution image synthesis
Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. In Forty-firstInternational Conference on Machine Learning, 2024
work page 2024
-
[8]
Shuhao Han, Haotian Fan, Jiachen Fu, Liang Li, Tao Li, Junhui Cui, Yunqiu Wang, Yang Tai, Jingwei Sun, Chunle Guo, and Chongyi Li. Evalmuse-40k: A reliable and fine-grained benchmark with comprehensive human annotations for text-to-image generation model evaluation, 2024. URLhttps://arxiv.org/abs/2412.18150
-
[9]
CLIPScore: A Reference-free Evaluation Metric for Image Captioning
Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. Clipscore: A reference-free evaluation metric for image captioning, 2022. URLhttps://arxiv.org/abs/2104.08718
work page internal anchor Pith review Pith/arXiv arXiv 2022
- [10]
-
[11]
Rethinking fid: Towards a better evaluation metric for image generation, 2024
Sadeep Jayasumana, Srikumar Ramalingam, Andreas Veit, Daniel Glasner, Ayan Chakrabarti, and Sanjiv Kumar. Rethinking fid: Towards a better evaluation metric for image generation, 2024. URLhttps://arxiv.org/abs/ 2401.09603
-
[12]
Adaface: Quality adaptive margin for face recognition
Minchul Kim, Anil K Jain, and Xiaoming Liu. Adaface: Quality adaptive margin for face recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18750–18759, 2022
work page 2022
-
[13]
Flux.https://github.com/black-forest-labs/flux, 2023
Black Forest Labs. Flux.https://github.com/black-forest-labs/flux, 2023
work page 2023
-
[14]
Controlnet++: Improving conditional controls with efficient consistency feedback
Ming Li, Taojiannan Yang, Huafeng Kuang, Jie Wu, Zhaoning Wang, Xuefeng Xiao, and Chen Chen. Controlnet++: Improving conditional controls with efficient consistency feedback. InEuropean Conference on Computer Vision, pages 129–147. Springer, 2025
work page 2025
-
[15]
Zhimin Li, Jianwei Zhang, Qin Lin, Jiangfeng Xiong, Yanxin Long, Xinchi Deng, Yingfang Zhang, Xingchao Liu, Minbin Huang, Zedong Xiao, et al. Hunyuan-dit: A powerful multi-resolution diffusion transformer with fine-grained chinese understanding.arXiv preprint arXiv:2405.08748, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[16]
A technique for the measurement of attitudes.Archives of Psychology.140: 1–55, 1932
Rensis Likert. A technique for the measurement of attitudes.Archives of Psychology.140: 1–55, 1932
work page 1932
-
[17]
Zhiqiu Lin, Deepak Pathak, Baiqi Li, Jiayao Li, Xide Xia, Graham Neubig, Pengchuan Zhang, and Deva Ramanan. Evaluating text-to-visual generation with image-to-text generation.arXiv preprint arXiv:2404.01291, 2024
-
[18]
Glyph-byt5: A customized text encoder for accurate visual text rendering
Zeyu Liu, Weicong Liang, Zhanhao Liang, Chong Luo, Ji Li, Gao Huang, and Yuhui Yuan. Glyph-byt5: A customized text encoder for accurate visual text rendering. InEuropean Conference on Computer Vision, pages 361–377. Springer, 2024
work page 2024
-
[19]
Zeyu Liu, Weicong Liang, Yiming Zhao, Bohan Chen, Lin Liang, Lijuan Wang, Ji Li, and Yuhui Yuan. Glyph-byt5- v2: A strong aesthetic baseline for accurate multilingual visual text rendering.arXiv preprint arXiv:2406.10208, 2024
-
[20]
Meitu. Meitu. https://www.whee.com/ai/text-to-image, 2024. 30
work page 2024
- [21]
-
[22]
OpenAI, :, Aaron Hurst, and Adam Lerer et al. Gpt-4o system card, 2024. URLhttps://arxiv.org/abs/2410. 21276
work page 2024
-
[23]
DINOv2: Learning Robust Visual Features without Supervision
Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[24]
Learning transferable visual models from natural language supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021
work page 2021
-
[25]
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of machine learning research, 21(140):1–67, 2020
work page 2020
- [26]
-
[27]
Yuxi Ren, Xin Xia, Yanzuo Lu, Jiacheng Zhang, Jie Wu, Pan Xie, Xing Wang, and Xuefeng Xiao. Hyper-sd: Trajectory segmented consistency model for efficient image synthesis.Advancesin Neural Information Processing Systems, 37:117340–117362, 2025
work page 2025
-
[28]
Seededit: Align image re-generation to image editing, 2024
Yichun Shi, Peng Wang, and Weilin Huang. Seededit: Align image re-generation to image editing, 2024. URL https://arxiv.org/abs/2411.06686
-
[29]
Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024
Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024
work page 2024
-
[30]
Kolors Team. Kolors: Effective training of diffusion model for photorealistic text-to-image synthesis.arXiv preprint, 2024
work page 2024
-
[31]
Tencent. Hunyuan. https://console.cloud.tencent.com/hunyuan/experience/image, 2024
work page 2024
-
[32]
Anytext: Multilingual visual text generation and editing.arXiv preprint arXiv:2311.03054, 2023
Yuxiang Tuo, Wangmeng Xiang, Jun-Yan He, Yifeng Geng, and Xuansong Xie. Anytext: Multilingual visual text generation and editing.arXiv preprint arXiv:2311.03054, 2023
-
[33]
Diffusion model alignment using direct preference optimization
Bram Wallace, Meihua Dang, Rafael Rafailov, Linqi Zhou, Aaron Lou, Senthil Purushwalkam, Stefano Ermon, Caiming Xiong, Shafiq Joty, and Nikhil Naik. Diffusion model alignment using direct preference optimization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8228–8238, 2024
work page 2024
-
[34]
Shaojin Wu, Fei Ding, Mengqi Huang, Wei Liu, and Qian He. Vmix: Improving text-to-image diffusion model with cross-attention mixing control.arXiv preprint arXiv:2412.20800, 2024
-
[35]
Xiaoshi Wu, Yiming Hao, Keqiang Sun, Yixiong Chen, Feng Zhu, Rui Zhao, and Hongsheng Li. Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis.arXiv preprint arXiv:2306.09341, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[36]
Imagereward: Learning and evaluating human preferences for text-to-image generation
Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. Imagereward: Learning and evaluating human preferences for text-to-image generation. Advances in Neural Information Processing Systems, 36, 2024
work page 2024
-
[37]
Linting Xue, Aditya Barua, Noah Constant, Rami Al-Rfou, Sharan Narang, Mihir Kale, Adam Roberts, and Colin Raffel. Byt5: Towards a token-free future with pre-trained byte-to-byte models.Transactions of the Association for Computational Linguistics, 10:291–306, 2022
work page 2022
-
[38]
IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models
Hu Ye, Jun Zhang, Sibo Liu, Xiao Han, and Wei Yang. Ip-adapter: Text compatible image prompt adapter for text-to-image diffusion models.arXiv preprint arXiv:2308.06721, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[39]
Scaling Autoregressive Models for Content-Rich Text-to-Image Generation
Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gunjan Baid, Zirui Wang, Vijay Vasudevan, Alexander Ku, Yinfei Yang, Burcu Karagol Ayan, Ben Hutchinson, Wei Han, Zarana Parekh, Xin Li, Han Zhang, Jason Baldridge, and Yonghui Wu. Scaling autoregressive models for content-rich text-to-image generation, 2022. URL https://arxiv.org/abs/2206.10789
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[40]
Jedi: Joint-image diffusion models for finetuning-free personalized text-to-image generation
Yu Zeng, Vishal M Patel, Haochen Wang, Xun Huang, Ting-Chun Wang, Ming-Yu Liu, and Yogesh Balaji. Jedi: Joint-image diffusion models for finetuning-free personalized text-to-image generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6786–6795, 2024. 31
work page 2024
-
[41]
Jiacheng Zhang, Jie Wu, Weifeng Chen, Yatai Ji, Xuefeng Xiao, Weilin Huang, and Kai Han. Onlinevpo: Align video diffusion model with online video-centric preference optimization.arXiv preprint arXiv:2412.15159, 2024
-
[42]
Unifl: Improve stable diffusion via unified feedback learning.arXiv preprint arXiv:2404.05595, 2024
Jiacheng Zhang, Jie Wu, Yuxi Ren, Xin Xia, Huafeng Kuang, Pan Xie, Jiashi Li, Xuefeng Xiao, Min Zheng, Lean Fu, et al. Unifl: Improve stable diffusion via unified feedback learning.arXiv preprint arXiv:2404.05595, 2024
-
[43]
Learning multi-dimensional human preference for text-to-image generation
Sixian Zhang, Bohan Wang, Junqiang Wu, Yan Li, Tingting Gao, Di Zhang, and Zhongyuan Wang. Learning multi-dimensional human preference for text-to-image generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8018–8027, 2024
work page 2024
-
[44]
PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel
Yanli Zhao, Andrew Gu, Rohan Varma, Liang Luo, Chien-Chin Huang, Min Xu, Less Wright, Hamid Shojanazeri, Myle Ott, Sam Shleifer, et al. Pytorch fsdp: experiences on scaling fully sharded data parallel.arXiv preprint arXiv:2304.11277, 2023. 32 Appendix A Contributions and Acknowledgments All contributors of Seedream are listed in alphabetical order by thei...
work page internal anchor Pith review Pith/arXiv arXiv 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.