pith. machine review for the scientific record. sign in

arxiv: 2506.09113 · v2 · submitted 2025-06-10 · 💻 cs.CV

Recognition: 3 theorem links

· Lean Theorem

Seedance 1.0: Exploring the Boundaries of Video Generation Models

Authors on Pith no claims yet

Pith reviewed 2026-05-11 12:03 UTC · model grok-4.3

classification 💻 cs.CV
keywords video generationtext-to-videoimage-to-videomulti-shot generationdiffusion modelsmodel distillationRLHFvideo captioning
0
0 comments X

The pith

Seedance 1.0 generates 5-second 1080p videos in 41 seconds while improving multi-subject accuracy and multi-shot coherence.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Seedance 1.0 as a video foundation model that tackles the difficulty of balancing prompt following, motion plausibility, and visual quality in diffusion-based generators. It combines multi-source data with detailed captioning, an architecture that jointly handles text-to-video and image-to-video tasks while supporting multi-shot sequences, post-training that includes supervised fine-tuning and video-specific RLHF with multi-dimensional rewards, and multi-stage distillation for acceleration. These changes are presented as the reason the model can produce higher-quality output at practical speeds. A reader would care because video generation for stories or simulations needs all three qualities at once, and a model that delivers them efficiently could change how such content is created. The central result is that Seedance 1.0 achieves these gains together rather than trading one off against the others.

Core claim

Seedance 1.0 integrates multi-source data curation augmented with precision video captioning, an efficient architecture that natively supports multi-shot generation and jointly learns text-to-video and image-to-video tasks, carefully optimized post-training with fine-grained supervised fine-tuning and video-specific RLHF using multi-dimensional reward mechanisms, and multi-stage distillation with system-level optimizations for acceleration. This combination enables generation of a 5-second 1080p video in 41.4 seconds on an NVIDIA L20 GPU and produces output with superior spatiotemporal fluidity and structural stability, precise instruction adherence in complex multi-subject contexts, and nat

What carries the argument

Seedance 1.0's integrated pipeline of multi-source data curation, multi-task architecture for multi-shot outputs, RLHF post-training with multi-dimensional rewards, and multi-stage distillation for speedup

Load-bearing premise

The listed improvements in data, architecture, post-training, and distillation directly produce the claimed performance advantages without post-hoc selection or unstated baselines.

What would settle it

A standardized public benchmark using fixed sets of complex multi-subject and multi-shot prompts, scored on quantitative metrics for text-video alignment, motion smoothness, subject consistency across shots, and wall-clock inference time against current leading models.

read the original abstract

Notable breakthroughs in diffusion modeling have propelled rapid improvements in video generation, yet current foundational model still face critical challenges in simultaneously balancing prompt following, motion plausibility, and visual quality. In this report, we introduce Seedance 1.0, a high-performance and inference-efficient video foundation generation model that integrates several core technical improvements: (i) multi-source data curation augmented with precision and meaningful video captioning, enabling comprehensive learning across diverse scenarios; (ii) an efficient architecture design with proposed training paradigm, which allows for natively supporting multi-shot generation and jointly learning of both text-to-video and image-to-video tasks. (iii) carefully-optimized post-training approaches leveraging fine-grained supervised fine-tuning, and video-specific RLHF with multi-dimensional reward mechanisms for comprehensive performance improvements; (iv) excellent model acceleration achieving ~10x inference speedup through multi-stage distillation strategies and system-level optimizations. Seedance 1.0 can generate a 5-second video at 1080p resolution only with 41.4 seconds (NVIDIA-L20). Compared to state-of-the-art video generation models, Seedance 1.0 stands out with high-quality and fast video generation having superior spatiotemporal fluidity with structural stability, precise instruction adherence in complex multi-subject contexts, native multi-shot narrative coherence with consistent subject representation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Seedance 1.0, a video generation foundation model. It details four technical improvements: multi-source data curation with precision captioning for diverse scenarios; an efficient architecture and training paradigm supporting native multi-shot generation along with joint text-to-video and image-to-video learning; post-training via fine-grained supervised fine-tuning and video-specific RLHF with multi-dimensional rewards; and multi-stage distillation plus system optimizations yielding ~10x inference speedup. The model is stated to generate 5-second 1080p videos in 41.4 seconds on NVIDIA L20 and is claimed to achieve superior spatiotemporal fluidity, structural stability, instruction adherence in multi-subject scenes, and multi-shot narrative coherence relative to state-of-the-art models.

Significance. The enumerated pipeline components (data curation, joint T2V/I2V architecture, RLHF, and distillation) describe practical engineering choices that could inform video diffusion model development if substantiated. The concrete inference-time figure provides one verifiable data point. However, the absence of any quantitative benchmarks, ablations, or baseline comparisons means the superiority claims cannot be assessed, limiting the manuscript's contribution to a descriptive summary rather than a validated advance.

major comments (2)
  1. [Abstract] Abstract: The central claim that 'Seedance 1.0 stands out with ... superior spatiotemporal fluidity with structural stability, precise instruction adherence in complex multi-subject contexts, native multi-shot narrative coherence' is unsupported by any metrics (FVD, FID, CLIP similarity, user-study scores), ablation results, or named baseline comparisons (e.g., to Sora, Kling, or Luma). This directly undermines the performance assertions that constitute the paper's primary contribution.
  2. [Abstract] Abstract and technical description: The listed improvements in data curation, architecture, RLHF, and distillation are presented as directly producing the claimed advantages, yet no experimental protocol, evaluation datasets, or controlled results link these choices to measured gains. Without such linkage, the causal connection remains an unverified assumption.
minor comments (2)
  1. [Abstract] Abstract: Grammatical issue in 'current foundational model still face critical challenges' (should be 'models still face').
  2. [Abstract] Abstract: The phrase 'excellent model acceleration achieving ~10x inference speedup' would benefit from a precise baseline model and measurement conditions for the speedup factor.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive review. The comments correctly identify the need for stronger empirical grounding of the performance claims. Below we respond point by point and indicate the revisions we will make.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim that 'Seedance 1.0 stands out with ... superior spatiotemporal fluidity with structural stability, precise instruction adherence in complex multi-subject contexts, native multi-shot narrative coherence' is unsupported by any metrics (FVD, FID, CLIP similarity, user-study scores), ablation results, or named baseline comparisons (e.g., to Sora, Kling, or Luma). This directly undermines the performance assertions that constitute the paper's primary contribution.

    Authors: We agree that the abstract asserts superiority without quantitative support or named baselines. The manuscript is framed as a technical report focused on the overall system and engineering pipeline rather than a full benchmark study; the only concrete, reproducible number supplied is the 41.4-second inference latency on an NVIDIA L20. The superiority language reflects internal qualitative and automated checks that guided development. In the revision we will (i) tone down the abstract claims to reflect the available evidence, (ii) add a short evaluation section that reports the inference-time comparison against publicly available models, and (iii) include any additional internal metrics (e.g., CLIP similarity on held-out prompts) that can be disclosed without violating proprietary constraints. revision: yes

  2. Referee: [Abstract] Abstract and technical description: The listed improvements in data curation, architecture, RLHF, and distillation are presented as directly producing the claimed advantages, yet no experimental protocol, evaluation datasets, or controlled results link these choices to measured gains. Without such linkage, the causal connection remains an unverified assumption.

    Authors: The referee is correct that the manuscript describes the four components but does not present ablations or controlled experiments that isolate their individual contributions. Because the current text is a high-level system overview, such experiments were omitted. We will revise the manuscript to include (a) a concise description of the internal evaluation protocol and datasets used for development, and (b) at least one ablation table (e.g., effect of joint T2V/I2V training and of the multi-dimensional RLHF rewards) on a small public validation set. Where full ablations are not feasible within the revision timeline, we will explicitly state the limitation. revision: yes

Circularity Check

0 steps flagged

No circularity: descriptive engineering report without derivations or self-referential predictions

full rationale

The paper is a technical report summarizing data curation, architecture choices for joint text-to-video and image-to-video training, post-training with supervised fine-tuning and video-specific RLHF, and multi-stage distillation for inference speedup. It asserts that these yield superior spatiotemporal fluidity, structural stability, instruction adherence, and multi-shot coherence versus SOTA models, with one concrete efficiency claim (5-second 1080p video in 41.4 seconds on NVIDIA-L20). No equations, first-principles derivations, fitted parameters renamed as predictions, or load-bearing self-citations appear. The central claims are direct assertions of outcomes from the enumerated engineering steps rather than a closed logical chain that reduces to its own inputs by construction. This is a standard non-circular descriptive summary.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No mathematical derivations, free parameters, or new physical entities are introduced; the work is an applied systems report on model training and optimization.

pith-pipeline@v0.9.0 · 5701 in / 1065 out tokens · 54751 ms · 2026-05-11T12:03:01.611577+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • Cost.FunctionalEquation washburn_uniqueness_aczel unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    Seedance 1.0 integrates four key technical improvements: (i) multi-source data curation... (ii) an efficient architecture design... (iii) carefully-optimized post-training... (iv) excellent model acceleration achieving ~10× inference speedup

  • Foundation.HierarchyEmergence hierarchy_emergence_forces_phi unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    Compared to state-of-the-art video generation models, Seedance 1.0 stands out with high-quality and fast video generation having superior spatiotemporal fluidity with structural stability, precise instruction adherence in complex multi-subject contexts, native multi-shot narrative coherence

  • Foundation.DimensionForcing alexander_duality_circle_linking unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    We employ the flow matching framework with velocity prediction... decoupled spatial and temporal layers... Multishot MM-RoPE

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 39 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. TeDiO: Temporal Diagonal Optimization for Training-Free Coherent Video Diffusion

    cs.CV 2026-05 unverdicted novelty 7.0

    TeDiO regularizes temporal diagonals in diffusion transformer attention maps to produce smoother video motion while keeping per-frame quality intact.

  2. AniMatrix: An Anime Video Generation Model that Thinks in Art, Not Physics

    cs.CV 2026-05 unverdicted novelty 7.0

    AniMatrix generates anime videos by structuring artistic production rules into a controllable taxonomy and training the model to prioritize those rules over physical realism, achieving top scores from professional ani...

  3. AniMatrix: An Anime Video Generation Model that Thinks in Art, Not Physics

    cs.CV 2026-05 unverdicted novelty 7.0

    AniMatrix generates anime videos using a production knowledge taxonomy, dual-channel conditioning, style-motion curriculum, and deformation-aware preference optimization, outperforming baselines in animator evaluation...

  4. AniMatrix: An Anime Video Generation Model that Thinks in Art, Not Physics

    cs.CV 2026-05 unverdicted novelty 7.0

    AniMatrix generates anime videos using a structured taxonomy of artistic production variables, dual-channel conditioning, a style-motion curriculum, and deformation-aware optimization to prioritize art over physics.

  5. TrajShield: Trajectory-Level Safety Mediation for Defending Text-to-Video Models Against Jailbreak Attacks

    cs.CV 2026-05 unverdicted novelty 7.0

    TrajShield is a training-free defense that reduces jailbreak success rates by 52.44% on average in text-to-video models by localizing and neutralizing risks through trajectory simulation and causal intervention.

  6. VAnim: Rendering-Aware Sparse State Modeling for Structure-Preserving Vector Animation

    cs.CV 2026-05 unverdicted novelty 7.0

    VAnim creates open-domain text-to-SVG animations via sparse state updates on a persistent DOM tree, identification-first planning, and rendering-aware RL with a new 134k-example benchmark.

  7. Reshoot-Anything: A Self-Supervised Model for In-the-Wild Video Reshooting

    cs.CV 2026-04 unverdicted novelty 7.0

    Reshoot-Anything trains a diffusion transformer on pseudo multi-view triplets created by cropping and warping monocular videos to achieve temporally consistent video reshooting with robust camera control on dynamic scenes.

  8. HumanScore: Benchmarking Human Motions in Generated Videos

    cs.CV 2026-04 unverdicted novelty 7.0

    HumanScore defines six metrics for kinematic plausibility, temporal stability, and biomechanical consistency to benchmark human motions in videos from thirteen state-of-the-art generation models, revealing gaps betwee...

  9. RoboWM-Bench: A Benchmark for Evaluating World Models in Robotic Manipulation

    cs.RO 2026-04 unverdicted novelty 7.0

    RoboWM-Bench evaluates video world models by converting their manipulation video predictions into executable actions validated in simulation, showing that visual plausibility does not guarantee physical executability.

  10. RoboWM-Bench: A Benchmark for Evaluating World Models in Robotic Manipulation

    cs.RO 2026-04 unverdicted novelty 7.0

    RoboWM-Bench evaluates video world models by converting their outputs into executable robot actions and running them on manipulation tasks, showing that physical inconsistencies remain common.

  11. Efficient Video Diffusion Models: Advancements and Challenges

    cs.CV 2026-04 unverdicted novelty 7.0

    A survey that groups efficient video diffusion methods into four paradigms—step distillation, efficient attention, model compression, and cache/trajectory optimization—and outlines open challenges for practical use.

  12. AnimationBench: Are Video Models Good at Character-Centric Animation?

    cs.CV 2026-04 unverdicted novelty 7.0

    AnimationBench is the first benchmark that operationalizes the twelve basic principles of animation and IP preservation into scalable, VLM-assisted metrics for animation-style I2V generation.

  13. Tracking High-order Evolutions via Cascading Low-rank Fitting

    cs.LG 2026-04 unverdicted novelty 7.0

    Cascading low-rank fitting approximates successive high-order derivatives in diffusion models via a shared base function with sequentially added low-rank components, accompanied by theorems proving monotonic non-incre...

  14. OmniCamera: A Unified Framework for Multi-task Video Generation with Arbitrary Camera Control

    cs.CV 2026-04 unverdicted novelty 7.0

    OmniCamera disentangles video content and camera motion for multi-task generation with arbitrary camera control via the OmniCAM hybrid dataset and Dual-level Curriculum Co-Training.

  15. DWDP: Distributed Weight Data Parallelism for High-Performance LLM Inference on NVL72

    cs.DC 2026-04 unverdicted novelty 7.0

    DWDP distributes MoE weights across GPUs for independent execution without collective synchronization, improving output TPS/GPU by 8.8 percent on GB200 NVL72 for DeepSeek-R1 under 8K input and 1K output lengths.

  16. PhyMotion: Structured 3D Motion Reward for Physics-Grounded Human Video Generation

    cs.CV 2026-05 conditional novelty 6.0

    PhyMotion scores generated human videos by grounding recovered 3D poses in a physics simulator across kinematic, contact, and dynamic axes, yielding stronger human correlation and larger RL post-training gains than pr...

  17. WorldReasonBench: Human-Aligned Stress Testing of Video Generators as Future World-State Predictors

    cs.CV 2026-05 unverdicted novelty 6.0

    The paper presents WorldReasonBench, a benchmark that tests video generators on maintaining physical, social, logical, and informational consistency when predicting future states from initial conditions and actions.

  18. SocialDirector: Training-Free Social Interaction Control for Multi-Person Video Generation

    cs.CV 2026-05 unverdicted novelty 6.0

    SocialDirector uses spatiotemporal actor masking and directional reweighting on cross-attention maps to reduce actor-action mismatches and improve target-directed interactions in generated multi-person videos.

  19. FaithfulFaces: Pose-Faithful Facial Identity Preservation for Text-to-Video Generation

    cs.CV 2026-05 unverdicted novelty 6.0

    FaithfulFaces introduces a pose-faithful identity aligner with a shared dictionary and invariance constraint to maintain facial identity in text-to-video generation under large pose changes and occlusions.

  20. Motion-Aware Caching for Efficient Autoregressive Video Generation

    cs.CV 2026-05 conditional novelty 6.0

    MotionCache accelerates autoregressive video generation up to 6.28x by motion-weighted cache reuse based on inter-frame differences, with negligible quality loss on SkyReels-V2 and MAGI-1.

  21. Leveraging Verifier-Based Reinforcement Learning in Image Editing

    cs.CV 2026-04 unverdicted novelty 6.0

    Edit-R1 trains a CoT-based reasoning reward model with GCPO and uses it to boost image editing performance over VLMs and models like FLUX.1-kontext via GRPO.

  22. CityRAG: Stepping Into a City via Spatially-Grounded Video Generation

    cs.CV 2026-04 unverdicted novelty 6.0

    CityRAG generates minutes-long 3D-consistent videos of real-world cities by grounding outputs in geo-registered data and using temporally unaligned training to disentangle fixed scenes from transient elements like weather.

  23. Memorize When Needed: Decoupled Memory Control for Spatially Consistent Long-Horizon Video Generation

    cs.CV 2026-04 unverdicted novelty 6.0

    A decoupled memory branch with hybrid cues, cross-attention, and gating improves spatial consistency and data efficiency in long-horizon camera-trajectory video generation.

  24. DreamShot: Personalized Storyboard Synthesis with Video Diffusion Prior

    cs.CV 2026-04 unverdicted novelty 6.0

    DreamShot uses video diffusion priors and a role-attention consistency loss to produce coherent, personalized storyboards with better character and scene continuity than text-to-image methods.

  25. Continuous Adversarial Flow Models

    cs.LG 2026-04 unverdicted novelty 6.0

    Continuous adversarial flow models replace MSE in flow matching with adversarial training via a discriminator, improving guidance-free FID on ImageNet from 8.26 to 3.63 for SiT and similar gains for JiT and text-to-im...

  26. Latent-Compressed Variational Autoencoder for Video Diffusion Models

    cs.CV 2026-04 unverdicted novelty 6.0

    A frequency-based latent compression method for video VAEs yields higher reconstruction quality than channel-reduction baselines at fixed compression ratios.

  27. Long-Horizon Streaming Video Generation via Hybrid Attention with Decoupled Distillation

    cs.CV 2026-04 conditional novelty 6.0

    Hybrid Forcing combines linear temporal attention for long-range retention, block-sparse attention for efficiency, and decoupled distillation to achieve real-time unbounded 832x480 streaming video generation at 29.5 FPS.

  28. ImVideoEdit: Image-learning Video Editing via 2D Spatial Difference Attention Blocks

    cs.CV 2026-04 unverdicted novelty 6.0

    ImVideoEdit learns video editing from 13K image pairs by decoupling spatial modifications from frozen temporal dynamics in pretrained models, matching larger video-trained systems in fidelity and consistency.

  29. Video Generation Models as World Models: Efficient Paradigms, Architectures and Algorithms

    eess.IV 2026-03 unverdicted novelty 6.0

    Video generation models can function as world simulators if efficiency gaps in spatiotemporal modeling are bridged via organized paradigms, architectures, and algorithms.

  30. Video Generation with Predictive Latents

    cs.CV 2026-05 unverdicted novelty 5.0

    PV-VAE improves video latent spaces for generation by unifying reconstruction with future-frame prediction, reporting 52% faster convergence and 34.42 FVD gain over Wan2.2 VAE on UCF101.

  31. Motion-Aware Caching for Efficient Autoregressive Video Generation

    cs.CV 2026-05 unverdicted novelty 5.0

    MotionCache speeds up autoregressive video generation by 6.28x on SkyReels-V2 and 1.64x on MAGI-1 via motion-weighted cache reuse based on inter-frame differences, with negligible quality loss on VBench.

  32. A Systematic Post-Train Framework for Video Generation

    cs.CV 2026-04 unverdicted novelty 5.0

    A post-training pipeline for video generation models combines SFT, RLHF with novel GRPO, prompt enhancement, and inference optimization to improve visual quality, temporal coherence, and instruction following.

  33. Motif-Video 2B: Technical Report

    cs.CV 2026-04 unverdicted novelty 5.0

    Motif-Video 2B achieves 83.76% VBench score, beating a 14B-parameter baseline with 7x fewer parameters and substantially less training data through shared cross-attention and a three-part backbone.

  34. Not all tokens contribute equally to diffusion learning

    cs.CV 2026-04 unverdicted novelty 5.0

    DARE mitigates neglect of important tokens in conditional diffusion models via distribution-rectified guidance and spatial attention alignment.

  35. Advancing Reliable Synthetic Video Detection: Insights from the SAFE Challenge

    cs.CV 2026-05 unverdicted novelty 4.0

    The SAFE challenge shows measurable progress in detecting synthetic videos across different generators but persistent weaknesses against post-processing operations.

  36. Mamoda2.5: Enhancing Unified Multimodal Model with DiT-MoE

    cs.CV 2026-05 unverdicted novelty 4.0

    Mamoda2.5 is a 25B-parameter DiT-MoE unified AR-Diffusion model that reaches top video generation and editing benchmarks with 4-step inference up to 95.9x faster than baselines.

  37. World Simulation with Video Foundation Models for Physical AI

    cs.CV 2025-10 unverdicted novelty 4.0

    Cosmos-Predict2.5 unifies text-to-world, image-to-world, and video-to-world generation in one model trained on 200M clips with RL post-training, delivering improved quality and control for physical AI.

  38. Seedance 2.0: Advancing Video Generation for World Complexity

    cs.CV 2026-04 unverdicted novelty 3.0

    Seedance 2.0 is an updated multi-modal model for generating 4-15 second audio-video content at 480p/720p with support for up to 3 video, 9 image, and 3 audio references.

  39. Evolution of Video Generative Foundations

    cs.CV 2026-04 unverdicted novelty 2.0

    This survey traces video generation technology from GANs to diffusion models and then to autoregressive and multimodal approaches while analyzing principles, strengths, and future trends.

Reference graph

Works this paper leans on

35 extracted references · 35 canonical work pages · cited by 35 Pith papers · 12 internal anchors

  1. [1]

    artificialanalysis

    artificialanalysis.ai. artificialanalysis. https://artificialanalysis.ai/text-to-video/arena?tab=leaderboard, 2025

  2. [2]

    ByteDance. bmf. https://babitmf.github.io/, 2024

  3. [3]

    Deep compression autoencoder for efficient high-resolution diffusion models

    Junyu Chen, Han Cai, Junsong Chen, Enze Xie, Shang Yang, Haotian Tang, Muyang Li, Yao Lu, and Song Han. Deep compression autoencoder for efficient high-resolution diffusion models.arXiv preprint arXiv:2410.10733, 2024

  4. [4]

    Control-a-video: Controllable text-to-video generation with diffusion models.arXiv e-prints, pages arXiv–2305, 2023

    Weifeng Chen, Yatai Ji, Jie Wu, Hefeng Wu, Pan Xie, Jiashi Li, Xin Xia, Xuefeng Xiao, and Liang Lin. Control-a-video: Controllable text-to-video generation with diffusion models.arXiv e-prints, pages arXiv–2305, 2023

  5. [5]

    Scaling rectified flow transformers for high-resolution image synthesis

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. In Forty-firstinternational conference on machine learning, 2024

  6. [6]

    Seedream 3.0 Technical Report

    Yu Gao, Lixue Gong, Qiushan Guo, Xiaoxia Hou, Zhichao Lai, Fanshi Li, Liang Li, Xiaochen Lian, Chao Liao, Liyang Liu, et al. Seedream 3.0 technical report.arXiv preprint arXiv:2504.11346, 2025

  7. [7]

    Factorizing text-to-video generation by explicit image conditioning

    Rohit Girdhar, Mannat Singh, Andrew Brown, Quentin Duval, Samaneh Azadi, Sai Saketh Rambhatla, Akbar Shah, Xi Yin, Devi Parikh, and Ishan Misra. Factorizing text-to-video generation by explicit image conditioning. In European Conference on Computer Vision, pages 205–224. Springer, 2024

  8. [8]

    Seedream 2.0: A native chinese-english bilingual image generation foundation model.arXiv preprint arXiv:2503.07703, 2025

    Lixue Gong, Xiaoxia Hou, Fanshi Li, Liang Li, Xiaochen Lian, Fei Liu, Liyang Liu, Wei Liu, Wei Lu, Yichun Shi, et al. Seedream 2.0: A native chinese-english bilingual image generation foundation model.arXiv preprint arXiv:2503.07703, 2025

  9. [9]

    Long context tuning for video generation

    Yuwei Guo, Ceyuan Yang, Ziyan Yang, Zhibei Ma, Zhijie Lin, Zhenheng Yang, Dahua Lin, and Lu Jiang. Long context tuning for video generation.arXiv preprint arXiv:2503.10589, 2025

  10. [10]

    Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022

  11. [11]

    Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A. Efros. Image-to-image translation with conditional adversarial networks, 2018

  12. [12]

    DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models

    Sam Ade Jacobs, Masahiro Tanaka, Chengming Zhang, Minjia Zhang, Shuaiwen Leon Song, Samyam Rajbhandari, and Yuxiong He. Deepspeed ulysses: System optimizations for enabling training of extreme long sequence transformer models. arXiv preprint arXiv:2309.14509, 2023

  13. [13]

    Prompt-a-video: Prompt your video diffusion model via preference-aligned llm.arXiv preprint arXiv:2412.15156, 2024

    Yatai Ji, Jiacheng Zhang, Jie Wu, Shilong Zhang, Shoufa Chen, Chongjian GE, Peize Sun, Weifeng Chen, Wenqi Shao, Xuefeng Xiao, et al. Prompt-a-video: Prompt your video diffusion model via preference-aligned llm.arXiv preprint arXiv:2412.15156, 2024

  14. [14]

    Auto-Encoding Variational Bayes

    Diederik P. Kingma and Max Welling. Auto-encoding variational bayes.CoRR, abs/1312.6114, 2013. URL https://api.semanticscholar.org/CorpusID:216078090

  15. [15]

    HunyuanVideo: A Systematic Framework For Large Video Generative Models

    Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024

  16. [16]

    arXiv preprint arXiv:2501.08316 (2025) 2, 3, 4

    Shanchuan Lin, Xin Xia, Yuxi Ren, Ceyuan Yang, Xuefeng Xiao, and Lu Jiang. Diffusion adversarial post-training for one-step video generation.arXiv preprint arXiv:2501.08316, 2025

  17. [17]

    Flow-GRPO: Training Flow Matching Models via Online RL

    Jie Liu, Gongye Liu, Jiajun Liang, Yangguang Li, Jiaheng Liu, Xintao Wang, Pengfei Wan, Di Zhang, and Wanli Ouyang. Flow-grpo: Training flow matching models via online rl.arXiv preprint arXiv:2505.05470, 2025

  18. [18]

    Improving Video Generation with Human Feedback

    Jie Liu, Gongye Liu, Jiajun Liang, Ziyang Yuan, Xiaokun Liu, Mingwu Zheng, Xiele Wu, Qiulin Wang, Wenyu Qin, Menghan Xia, et al. Improving video generation with human feedback.arXiv preprint arXiv:2501.13918, 2025

  19. [19]

    Ray: A distributed framework for emerging{AI} applications

    Philipp Moritz, Robert Nishihara, Stephanie Wang, Alexey Tumanov, Richard Liaw, Eric Liang, Melih Elibol, Zongheng Yang, William Paul, Michael I Jordan, et al. Ray: A distributed framework for emerging{AI} applications. In 13th USENIX symposium on operating systems design and implementation (OSDI 18), pages 561–577, 2018. 23

  20. [20]

    Scalable diffusion models with transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023

  21. [21]

    Direct preference optimization: Your language model is secretly a reward model.Advancesin Neural Information Processing Systems, 36:53728–53741, 2023

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model.Advancesin Neural Information Processing Systems, 36:53728–53741, 2023

  22. [22]

    Hyper-sd: Trajectory segmented consistency model for efficient image synthesis.Advancesin Neural Information Processing Systems, 37:117340–117362, 2025

    Yuxi Ren, Xin Xia, Yanzuo Lu, Jiacheng Zhang, Jie Wu, Pan Xie, Xing Wang, and Xuefeng Xiao. Hyper-sd: Trajectory segmented consistency model for efficient image synthesis.Advancesin Neural Information Processing Systems, 37:117340–117362, 2025

  23. [23]

    High-resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. InCVPR, pages 10684–10695, 2022

  24. [24]

    Seaweed-7b: Cost-effective training of video generation foundation model

    Team Seawead, Ceyuan Yang, Zhijie Lin, Yang Zhao, Shanchuan Lin, Zhibei Ma, Haoyuan Guo, Hao Chen, Lu Qi, Sen Wang, et al. Seaweed-7b: Cost-effective training of video generation foundation model.arXiv preprint arXiv:2504.08685, 2025

  25. [25]

    Rayflow: Instance-aware diffusion acceleration via adaptive flow trajectories.arXiv preprint arXiv:2503.07699, 2025

    Huiyang Shao, Xin Xia, Yuhong Yang, Yuxi Ren, Xing Wang, and Xuefeng Xiao. Rayflow: Instance-aware diffusion acceleration via adaptive flow trajectories.arXiv preprint arXiv:2503.07699, 2025

  26. [26]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianx- iao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

  27. [27]

    Training-free and Adaptive Sparse Attention for Efficient Long Video Generation, February 2025

    Yifei Xia, Suhan Ling, Fangcheng Fu, Yujie Wang, Huixia Li, Xuefeng Xiao, and Bin Cui. Training-free and adaptive sparse attention for efficient long video generation.arXiv preprint arXiv:2502.21079, 2025

  28. [28]

    DanceGRPO: Unleashing GRPO on Visual Generation

    Zeyue Xue, Jie Wu, Yu Gao, Fangyuan Kong, Lingting Zhu, Mengzhao Chen, Zhiheng Liu, Wei Liu, Qiushan Guo, Weilin Huang, et al. Dancegrpo: Unleashing grpo on visual generation.arXiv preprint arXiv:2505.07818, 2025

  29. [29]

    Qwen2.5 Technical Report

    An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. Qwen2.5 technical report.arXiv preprint arXiv:2412.15115, 2024

  30. [30]

    CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

    Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024

  31. [31]

    Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation

    Lijun Yu, José Lezama, Nitesh B Gundavarapu, Luca Versari, Kihyuk Sohn, David Minnen, Yong Cheng, Vighnesh Birodkar, Agrim Gupta, Xiuye Gu, et al. Language model beats diffusion–tokenizer is key to visual generation. arXiv preprint arXiv:2310.05737, 2023

  32. [32]

    Tarsier2: Advanc- ing large vision-language models from detailed video description to comprehensive video understanding.arXiv preprint arXiv:2501.07888, 2025

    Liping Yuan, Jiawei Wang, Haomiao Sun, Yuchen Zhang, and Yuan Lin. Tarsier2: Advancing large vision-language models from detailed video description to comprehensive video understanding.arXiv preprint arXiv:2501.07888, 2025

  33. [33]

    Onlinevpo: Align video diffusion model with online video-centric preference optimization,

    Jiacheng Zhang, Jie Wu, Weifeng Chen, Yatai Ji, Xuefeng Xiao, Weilin Huang, and Kai Han. Onlinevpo: Align video diffusion model with online video-centric preference optimization.arXiv preprint arXiv:2412.15159, 2024

  34. [34]

    The unreasonable effectiveness of deep features as a perceptual metric

    Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 586–595, 2018

  35. [35]

    PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel

    Yanli Zhao, Andrew Gu, Rohan Varma, Liang Luo, Chien-Chin Huang, Min Xu, Less Wright, Hamid Shojanazeri, Myle Ott, Sam Shleifer, et al. Pytorch fsdp: experiences on scaling fully sharded data parallel.arXiv preprint arXiv:2304.11277, 2023. 24 Appendix A Contributions and Acknowledgments All contributors of Seedance are listed in alphabetical order by thei...