Recognition: 2 theorem links
· Lean TheoremSeedream 4.0: Toward Next-generation Multimodal Image Generation
Pith reviewed 2026-05-12 16:35 UTC · model grok-4.3
The pith
Seedream 4.0 unifies text-to-image synthesis, image editing, and multi-image composition inside one diffusion framework for fast high-resolution output.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
A single efficient diffusion transformer with a reduced-token VAE, pretrained on billions of diverse text-image pairs and jointly post-trained with a VLM, delivers state-of-the-art results on both text-to-image and multimodal editing while supporting multi-image references, multiple outputs, and high-resolution generation up to 4K in under two seconds without an external LLM.
What carries the argument
The highly efficient diffusion transformer paired with a powerful VAE that reduces image tokens, combined with VLM-based multi-modal post-training for joint T2I and editing.
Where Pith is reading between the lines
- The unified framework could reduce the need for separate specialized tools in design workflows.
- Support for multiple outputs and multi-image references may enable new forms of iterative creative exploration.
- Further scaling to 4.5 suggests the approach could continue to improve with more data and compute.
Load-bearing premise
Internal evaluations on proprietary datasets and vertical scenarios are enough to establish state-of-the-art performance and generalization.
What would settle it
Public benchmark scores on standard T2I and editing datasets that fall below current leading models would show the claimed superiority does not hold under open evaluation.
read the original abstract
We introduce Seedream 4.0, an efficient and high-performance multimodal image generation system that unifies text-to-image (T2I) synthesis, image editing, and multi-image composition within a single framework. We develop a highly efficient diffusion transformer with a powerful VAE which also can reduce the number of image tokens considerably. This allows for efficient training of our model, and enables it to fast generate native high-resolution images (e.g., 1K-4K). Seedream 4.0 is pretrained on billions of text-image pairs spanning diverse taxonomies and knowledge-centric concepts. Comprehensive data collection across hundreds of vertical scenarios, coupled with optimized strategies, ensures stable and large-scale training, with strong generalization. By incorporating a carefully fine-tuned VLM model, we perform multi-modal post-training for training both T2I and image editing tasks jointly. For inference acceleration, we integrate adversarial distillation, distribution matching, and quantization, as well as speculative decoding. It achieves an inference time of up to 1.8 seconds for generating a 2K image (without a LLM/VLM as PE model). Comprehensive evaluations reveal that Seedream 4.0 can achieve state-of-the-art results on both T2I and multimodal image editing. In particular, it demonstrates exceptional multimodal capabilities in complex tasks, including precise image editing and in-context reasoning, and also allows for multi-image reference, and can generate multiple output images. This extends traditional T2I systems into an more interactive and multidimensional creative tool, pushing the boundary of generative AI for both creativity and professional applications. We further scale our model and data as Seedream 4.5. Seedream 4.0 and Seedream 4.5 are accessible on Volcano Engine https://www.volcengine.com/experience/ark?launch=seedream.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Seedream 4.0, an efficient diffusion transformer model with a powerful VAE for unified text-to-image synthesis, image editing, and multi-image composition. Pretrained on billions of text-image pairs and fine-tuned jointly with a VLM for T2I and editing tasks, it incorporates adversarial distillation, quantization, and speculative decoding to achieve 1.8-second inference for 2K images. The authors claim state-of-the-art performance on T2I and multimodal editing, with strong capabilities in precise editing, in-context reasoning, multi-image references, and multi-output generation, and note further scaling to Seedream 4.5.
Significance. If the performance claims hold, the work would advance unified multimodal generative systems by integrating high-resolution native output, task unification, and inference efficiency in one framework, with potential applications in interactive creative tools. The emphasis on large-scale pretraining and post-training strategies could inform future model scaling, though the proprietary nature of the evaluations limits immediate reproducibility and comparison.
major comments (3)
- [Abstract] Abstract: The central claim that Seedream 4.0 'achieves state-of-the-art results on both T2I and multimodal image editing' and demonstrates 'exceptional multimodal capabilities' is not supported by any quantitative metrics (FID, CLIPScore, human preference), comparison tables against baselines such as SD3/Flux/DALL-E 3, or details on public benchmarks like MS-COCO or DrawBench.
- [Introduction / Model Description] The manuscript provides no ablation studies or details on the impact of the VAE token reduction factor, joint T2I+editing post-training, or the 'optimized strategies' for the hundreds of vertical scenarios, which are load-bearing for validating the generalization and efficiency claims.
- [Inference Acceleration] No experimental section, tables, or figures present error analysis, inference hardware details, or comparisons for the reported 1.8-second 2K inference time, undermining the acceleration claims relative to existing methods.
minor comments (2)
- [Abstract] Abstract: Typo in 'extends traditional T2I systems into an more interactive' should be 'a more interactive'.
- [Abstract] The description of multi-image reference and multiple output generation lacks protocol details or examples, which would improve clarity even if not central to the claims.
Simulated Author's Rebuttal
We thank the referee for the careful review and constructive feedback on our manuscript describing Seedream 4.0. We address each major comment point by point below, indicating where revisions will be made to the next version of the paper.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claim that Seedream 4.0 'achieves state-of-the-art results on both T2I and multimodal image editing' and demonstrates 'exceptional multimodal capabilities' is not supported by any quantitative metrics (FID, CLIPScore, human preference), comparison tables against baselines such as SD3/Flux/DALL-E 3, or details on public benchmarks like MS-COCO or DrawBench.
Authors: We acknowledge that the abstract makes performance claims without accompanying quantitative metrics or tables in the manuscript. Seedream 4.0 is a proprietary production system, and detailed public benchmark results (including FID, CLIPScore, or comparisons on MS-COCO/DrawBench) are not released to protect intellectual property and competitive positioning. Internal evaluations and platform-based user studies support the claims, but these cannot be fully disclosed. We will revise the abstract to moderate the language, frame the contribution more as a unified system description with demonstrated capabilities, and add a note directing readers to the Volcano Engine service for practical evaluation. This change will be incorporated in the revised manuscript. revision: yes
-
Referee: [Introduction / Model Description] The manuscript provides no ablation studies or details on the impact of the VAE token reduction factor, joint T2I+editing post-training, or the 'optimized strategies' for the hundreds of vertical scenarios, which are load-bearing for validating the generalization and efficiency claims.
Authors: The referee is correct that the manuscript does not include ablation studies on the VAE token reduction factor, the joint T2I+editing post-training procedure, or the specific optimized strategies for vertical scenarios. These elements rely on proprietary data pipelines and internal experimentation that we cannot fully detail without disclosing sensitive information. We will expand the model description section with additional high-level discussion of the design rationale for token reduction and the benefits observed from joint post-training. However, we maintain that complete ablations are not feasible in a public manuscript of this scope and will not be added. revision: partial
-
Referee: [Inference Acceleration] No experimental section, tables, or figures present error analysis, inference hardware details, or comparisons for the reported 1.8-second 2K inference time, undermining the acceleration claims relative to existing methods.
Authors: We agree that the current manuscript lacks a dedicated experimental section with hardware specifications, error analysis, and comparative results for the 1.8-second 2K inference time. We will add a new subsection under inference acceleration that specifies the hardware platform used, provides basic error analysis where appropriate, and includes available comparisons to standard baselines. This addition will directly address the concern and strengthen the presentation of the acceleration techniques. revision: yes
- Detailed quantitative benchmark tables and public-dataset comparisons (FID, CLIPScore, etc.) due to the proprietary nature of the model and evaluations.
- Full ablation studies on internal data strategies and training configurations that involve proprietary vertical-scenario data.
Circularity Check
No circularity detected; paper is an empirical system description without load-bearing derivations or predictions that reduce to inputs by construction.
full rationale
The manuscript introduces Seedream 4.0 as a unified multimodal generation framework, detailing architecture choices (efficient diffusion transformer + VAE token reduction), pretraining on billions of text-image pairs, multi-modal post-training with a VLM, and inference optimizations. It asserts SOTA performance via 'comprehensive evaluations' on T2I and editing tasks. No mathematical derivation chain, equations, or first-principles results are presented that could exhibit self-definition, fitted-input-as-prediction, or self-citation load-bearing patterns. Claims rest on internal proprietary data and evaluations rather than external benchmarks, but this constitutes a transparency or reproducibility limitation, not a circular reduction of any claimed result to its own inputs. The paper contains no ansatz smuggling, uniqueness theorems, or renaming of known results in a derivation sense.
Axiom & Free-Parameter Ledger
free parameters (2)
- VAE token reduction factor
- Adversarial distillation and quantization hyperparameters
axioms (1)
- domain assumption Large-scale pretraining on billions of text-image pairs yields stable generalization across vertical scenarios.
Lean theorems connected to this paper
-
IndisputableMonolith.Foundation.DAlembert.Inevitabilitybilinear_family_forced unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We develop a highly efficient diffusion transformer with a powerful VAE which also can reduce the number of image tokens considerably... Seedream 4.0 is pretrained on billions of text-image pairs... By incorporating a carefully fine-tuned VLM model, we perform multi-modal post-training for training both T2I and image editing tasks jointly.
-
IndisputableMonolith.Foundation.PhiForcingphi_forcing unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Comprehensive evaluations reveal that Seedream 4.0 can achieve state-of-the-art results on both T2I and multimodal image editing.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 34 Pith papers
-
Edit-Compass & EditReward-Compass: A Unified Benchmark for Image Editing and Reward Modeling
Edit-Compass and EditReward-Compass are new unified benchmarks for fine-grained image editing evaluation and realistic reward modeling in reinforcement learning optimization.
-
Images in Sentences: Scaling Interleaved Instructions for Unified Visual Generation
INSET embeds images as native tokens in interleaved instructions, outperforming prior methods on multi-image consistency and text alignment as complexity grows.
-
MULTITEXTEDIT: Benchmarking Cross-Lingual Degradation in Text-in-Image Editing
MULTITEXTEDIT benchmark reveals that all tested text-in-image editing models show pronounced degradation on non-English languages, especially Hebrew and Arabic, mainly in text accuracy and script fidelity.
-
Guiding Distribution Matching Distillation with Gradient-Based Reinforcement Learning
GDMD replaces raw-sample rewards with distillation-gradient rewards in RL-guided diffusion distillation, yielding 4-step models that surpass their multi-step teachers on GenEval and human preference metrics.
-
UDM-GRPO: Stable and Efficient Group Relative Policy Optimization for Uniform Discrete Diffusion Models
UDM-GRPO is the first RL integration for uniform discrete diffusion models, using final clean samples as actions and forward-process trajectory reconstruction to raise GenEval accuracy from 69% to 96% and OCR accuracy...
-
LeapAlign: Post-Training Flow Matching Models at Any Generation Step by Building Two-Step Trajectories
LeapAlign fine-tunes flow matching models by constructing two consecutive leaps that skip multiple ODE steps with randomized timesteps and consistency weighting, enabling stable updates at any generation step.
-
RefineAnything: Multimodal Region-Specific Refinement for Perfect Local Details
RefineAnything is a multimodal diffusion model using Focus-and-Refine crop-and-resize with blended paste-back to achieve high-fidelity local image refinement and near-perfect background preservation.
-
Banana100: Breaking NR-IQA Metrics by 100 Iterative Image Replications with Nano Banana Pro
Banana100 dataset shows that none of 21 popular NR-IQA metrics consistently rate images degraded by 100 iterative edits lower than clean originals.
-
GeoR-Bench: Evaluating Geoscience Visual Reasoning
GeoR-Bench shows top multimodal models reach only 42.7% strict accuracy on geoscience visual reasoning tasks while open-source models reach 10.3%, with outputs often visually plausible yet scientifically inaccurate.
-
HiDream-O1-Image: A Natively Unified Image Generative Foundation Model with Pixel-level Unified Transformer
A pixel-space Diffusion Transformer with Unified Transformer architecture unifies image generation, editing, and personalization in an end-to-end model that maps all inputs to a shared token space and scales from 8B t...
-
D-OPSD: On-Policy Self-Distillation for Continuously Tuning Step-Distilled Diffusion Models
D-OPSD enables continuous supervised fine-tuning of few-step diffusion models via on-policy self-distillation where the model acts as both teacher (multimodal context) and student (text-only context) on its own roll-outs.
-
POCA: Pareto-Optimal Curriculum Alignment for Visual Text Generation
POCA combines Pareto optimization with curriculum alignment to improve multi-reward reinforcement learning for visual text generation without relying on weighted sums.
-
DreamShot: Personalized Storyboard Synthesis with Video Diffusion Prior
DreamShot uses video diffusion priors and a role-attention consistency loss to produce coherent, personalized storyboards with better character and scene continuity than text-to-image methods.
-
Continuous Adversarial Flow Models
Continuous adversarial flow models replace MSE in flow matching with adversarial training via a discriminator, improving guidance-free FID on ImageNet from 8.26 to 3.63 for SiT and similar gains for JiT and text-to-im...
-
SpatialEdit: Benchmarking Fine-Grained Image Spatial Editing
SpatialEdit provides a benchmark, large synthetic dataset, and baseline model for precise object and camera spatial manipulations in images, with the model beating priors on spatial editing.
-
Gen-Searcher: Reinforcing Agentic Search for Image Generation
Gen-Searcher is the first search-augmented image generation agent trained with SFT followed by agentic RL using dual text and image rewards on custom datasets and the KnowGen benchmark.
-
FoleyDirector: Fine-Grained Temporal Steering for Video-to-Audio Generation via Structured Scripts
FoleyDirector introduces structured temporal scripts and a fusion module to enable precise timing control in DiT-based video-to-audio generation while preserving audio fidelity.
-
SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture
SenseNova-U1 presents native unified multimodal models that match top understanding VLMs while delivering strong performance in image generation, infographics, and interleaved tasks via the NEO-unify architecture.
-
CreatiParser: Generative Image Parsing of Raster Graphic Designs into Editable Layers
CreatiParser decomposes raster graphic designs into editable text, background, and sticker layers via a hybrid VLM-diffusion model with ParserReward and GRPO optimization, reporting 23.7% average metric gains on Parse...
-
Self-Reasoning Agentic Framework for Narrative Product Grid-Collage Generation
A self-reasoning agentic framework constructs a Product Narrative Framework, generates constraint-aware unified grid collages, and refines outputs via failure attribution to improve narrative coherence and aesthetics ...
-
Q-DeepSight: Incentivizing Thinking with Images for Image Quality Assessment and Refinement
Q-DeepSight proposes a think-with-image multimodal CoT framework trained via RL with perceptual curriculum rewards and evidence gradient filtering to achieve SOTA IQA performance and enable training-free perceptual re...
-
FineEdit: Fine-Grained Image Edit with Bounding Box Guidance
FineEdit adds multi-level bounding box injection to diffusion image editing, releases a 1.2M-pair dataset with box annotations, and shows better instruction following and background consistency than prior open models ...
-
Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer
Z-Image is an efficient 6B-parameter foundation model for image generation that rivals larger commercial systems in photorealism and bilingual text rendering through a new single-stream diffusion transformer and strea...
-
Qwen-Image-2.0 Technical Report
Qwen-Image-2.0 unifies high-fidelity image generation and precise editing by coupling Qwen3-VL with a Multimodal Diffusion Transformer, improving text rendering, photorealism, and complex prompt following over prior versions.
-
simpleposter: a simple baseline for product poster generation
SimplePoster achieves 98.7% subject preservation and improved text accuracy in product posters via full-parameter fine-tuning of an inpainting model and zero-cost character-level position encoding, outperforming compl...
-
Why Do DiT Editors Drift? Plug-and-Play Low Frequency Alignment in VAE Latent Space
VAE-LFA suppresses semantic drift in multi-turn DiT image editing by low-pass filtering latent discrepancies and aligning low-frequency components to an EMA of previous rounds in VAE space.
-
Awaking Spatial Intelligence in Unified Multimodal Understanding and Generation
JoyAI-Image unifies visual understanding, generation, and editing in one model and claims stronger spatial intelligence through bidirectional perception-generation loops.
-
Mamoda2.5: Enhancing Unified Multimodal Model with DiT-MoE
Mamoda2.5 is a 25B-parameter DiT-MoE unified AR-Diffusion model that reaches top video generation and editing benchmarks with 4-step inference up to 95.9x faster than baselines.
-
MMCORE: MultiModal COnnection with Representation Aligned Latent Embeddings
MMCORE transfers VLM reasoning into diffusion-based image generation and editing via aligned latent embeddings from learnable queries, outperforming baselines on text-to-image and editing tasks.
-
Tstars-Tryon 1.0: Robust and Realistic Virtual Try-On for Diverse Fashion Items
Tstars-Tryon 1.0 is a deployed virtual try-on system claiming high robustness, photorealism, multi-reference flexibility, and near real-time speed for diverse fashion items.
-
Can Nano Banana 2 Replace Traditional Image Restoration Models? An Evaluation of Its Performance on Image Restoration Tasks
Nano Banana 2 delivers competitive perceptual quality on image restoration but produces over-enhanced results that diverge from input fidelity in ways standard metrics miss.
-
Tstars-Tryon 1.0: Robust and Realistic Virtual Try-On for Diverse Fashion Items
Tstars-Tryon 1.0 is a robust, photorealistic virtual try-on system with multi-image support and near real-time speed, deployed at industrial scale on Taobao and accompanied by a released benchmark.
-
Wan-Image: Pushing the Boundaries of Generative Visual Intelligence
Wan-Image is a unified multi-modal system that integrates LLMs and diffusion transformers to deliver professional-grade image generation features including complex typography, multi-subject consistency, and precise ed...
-
Seedance 2.0: Advancing Video Generation for World Complexity
Seedance 2.0 is an updated multi-modal model for generating 4-15 second audio-video content at 480p/720p with support for up to 3 video, 9 image, and 3 audio references.
Reference graph
Works this paper leans on
-
[1]
artificialanalysis.ai. artificialanalysis. https://artificialanalysis.ai/text-to-image/arena?tab=Leaderboard, 2025
work page 2025
- [2]
-
[3]
Yu Gao, Lixue Gong, Qiushan Guo, Xiaoxia Hou, Zhichao Lai, Fanshi Li, Liang Li, Xiaochen Lian, Chao Liao, Liyang Liu, et al. Seedream 3.0 technical report.arXiv preprint arXiv:2504.11346, 2025
work page internal anchor Pith review arXiv 2025
-
[4]
Lixue Gong, Xiaoxia Hou, Fanshi Li, Liang Li, Xiaochen Lian, Fei Liu, Liyang Liu, Wei Liu, Wei Lu, Yichun Shi, et al. Seedream 2.0: A native chinese-english bilingual image generation foundation model.arXiv preprint arXiv:2503.07703, 2025
- [5]
-
[6]
Dong Guo, Faming Wu, Feida Zhu, Fuxing Leng, Guang Shi, Haobin Chen, Haoqi Fan, Jian Wang, Jianyu Jiang, Jiawei Wang, et al. Seed1. 5-vl technical report.arXiv preprint arXiv:2505.07062, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[7]
Flux.https://github.com/black-forest-labs/flux, 2023
Black Forest Labs. Flux.https://github.com/black-forest-labs/flux, 2023
work page 2023
-
[8]
FLUX.1 Kontext: Flow Matching for In-Context Image Generation and Editing in Latent Space
Black Forest Labs, Stephen Batifol, Andreas Blattmann, Frederic Boesel, Saksham Consul, Cyril Diagne, Tim Dockhorn, Jack English, Zion English, Patrick Esser, Sumith Kulal, Kyle Lacey, Yam Levi, Cheng Li, Dominik Lorenz, Jonas Müller, Dustin Podell, Robin Rombach, Harry Saini, Axel Sauer, and Luke Smith. Flux.1 kontext: Flow matching for in-context image ...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[9]
Controlnet++: Improving conditional controls with efficient consistency feedback
Ming Li, Taojiannan Yang, Huafeng Kuang, Jie Wu, Zhaoning Wang, Xuefeng Xiao, and Chen Chen. Controlnet++: Improving conditional controls with efficient consistency feedback. InEuropean Conference on Computer Vision, pages 129–147. Springer, 2025
work page 2025
-
[10]
arXiv preprint arXiv:2501.08316 (2025) 2, 3, 4
Shanchuan Lin, Xin Xia, Yuxi Ren, Ceyuan Yang, Xuefeng Xiao, and Lu Jiang. Diffusion adversarial post-training for one-step video generation.arXiv preprint arXiv:2501.08316, 2025
-
[11]
Flow-GRPO: Training Flow Matching Models via Online RL
Jie Liu, Gongye Liu, Jiajun Liang, Yangguang Li, Jiaheng Liu, Xintao Wang, Pengfei Wan, Di Zhang, and Wanli Ouyang. Flow-grpo: Training flow matching models via online rl.arXiv preprint arXiv:2505.05470, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[12]
Chenwei Lou, Zewei Sun, Xinnian Liang, Meng Qu, Wei Shen, Wenqi Wang, Yuntao Li, Qingping Yang, and Shuangzhi Wu. Adacot: Pareto-optimal adaptive chain-of-thought triggering via reinforcement learning.arXiv preprint arXiv:2505.11896, 2025
-
[13]
arXiv preprint arXiv:2507.18569 (2025) 2, 4, 11 1.x-Distill 17
Yanzuo Lu, Yuxi Ren, Xin Xia, Shanchuan Lin, Xing Wang, Xuefeng Xiao, Andy J Ma, Xiaohua Xie, and Jian-Huang Lai. Adversarial distribution matching for diffusion distillation towards efficient image and video synthesis. arXiv preprint arXiv:2507.18569, 2025
-
[14]
Hyper- bagel: A unified acceleration framework for multimodal understanding and generation, 2025
Yanzuo Lu, Xin Xia, Manlin Zhang, Huafeng Kuang, Jianbin Zheng, Yuxi Ren, and Xuefeng Xiao. Hyper- bagel: A unified acceleration framework for multimodal understanding and generation, 2025. URL https: //arxiv.org/abs/2509.18824
-
[15]
Gpt-4o.https://openai.com/index/introducing-4o-image-generation/, 2025
OpenAI. Gpt-4o.https://openai.com/index/introducing-4o-image-generation/, 2025
work page 2025
-
[16]
OpenAI, :, Aaron Hurst, and Adam Lerer et al. Gpt-4o system card, 2024. URLhttps://arxiv.org/abs/2410. 21276
work page 2024
-
[17]
Yuxi Ren, Xin Xia, Yanzuo Lu, Jiacheng Zhang, Jie Wu, Pan Xie, Xing Wang, and Xuefeng Xiao. Hyper-sd: Trajectory segmented consistency model for efficient image synthesis.Advancesin Neural Information Processing Systems, 37:117340–117362, 2025
work page 2025
-
[18]
High-resolution image synthesis with latent diffusion models
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. InCVPR, pages 10684–10695, 2022
work page 2022
-
[19]
Huiyang Shao, Xin Xia, Yuhong Yang, Yuxi Ren, Xing Wang, and Xuefeng Xiao. Rayflow: Instance-aware diffusion acceleration via adaptive flow trajectories.arXiv preprint arXiv:2503.07699, 2025
-
[21]
Seededit: Align image re-generation to image editing
Yichun Shi, Peng Wang, and Weilin Huang. Seededit: Align image re-generation to image editing.arXiv preprint arXiv:2411.06686, 2024
-
[22]
Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, Yuxiang Chen, Zecheng Tang, Zekai Zhang, Zhengyi Wang, An Yang, Bowen Yu, Chen Cheng, Dayiheng Liu, Deqing Li, Hang Zhang, Hao Meng, Hu Wei, Jingyuan Ni, Kai Chen, Kuan Cao, Liang Peng, Lin Qu, Minggang Wu, Peng Wang, Shuting Yu, Tingkun...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[23]
RewardDance: Reward scaling in visual generation.arXiv preprint arXiv:2509.08826, 2025
Jie Wu, Yu Gao, Zilyu Ye, Ming Li, Liang Li, Hanzhong Guo, Jie Liu, Zeyue Xue, Xiaoxia Hou, Wei Liu, et al. Rewarddance: Reward scaling in visual generation.arXiv preprint arXiv:2509.08826, 2025
-
[24]
Imagereward: Learning and evaluating human preferences for text-to-image generation
Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. Imagereward: Learning and evaluating human preferences for text-to-image generation. Advances in Neural Information Processing Systems, 36, 2024
work page 2024
-
[25]
DanceGRPO: Unleashing GRPO on Visual Generation
Zeyue Xue, Jie Wu, Yu Gao, Fangyuan Kong, Lingting Zhu, Mengzhao Chen, Zhiheng Liu, Wei Liu, Qiushan Guo, Weilin Huang, et al. Dancegrpo: Unleashing grpo on visual generation.arXiv preprint arXiv:2505.07818, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[26]
Adding conditional control to text-to-image diffusion models,
Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models,
-
[27]
Adding conditional control to text-to-image diffusion models,
URLhttps://arxiv.org/abs/2302.05543. 18 Appendix A Contributions and Acknowledgments All contributors of Seedream are listed in alphabetical order by their last names. A.1 Core Contributors Yunpeng Chen, Yu Gao, Lixue Gong, Meng Guo, Qiushan Guo, Zhiyao Guo, Xiaoxia Hou, Weilin Huang, Yixuan Huang, Xiaowen Jian, Huafeng Kuang, Zhichao Lai, Fanshi Li, Lian...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.