pith. the verified trust layer for science. sign in

arxiv: 2502.10248 · v3 · pith:7EGEGHRSnew · submitted 2025-02-14 · 💻 cs.CV · cs.CL

Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model

Pith reviewed 2026-05-19 07:57 UTC · model grok-4.3

classification 💻 cs.CV cs.CL
keywords text-to-videovideo generationdiffusion transformerflow matchingdirect preference optimizationvariational autoencoderfoundation modelbenchmark
0
0 comments X p. Extension
Add this Pith Number to your LaTeX paper What is a Pith Number?
\usepackage{pith}
\pithnumber{7EGEGHRS}

Prints a linked pith:7EGEGHRS badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more

The pith

A 30 billion parameter model generates high-quality videos up to 204 frames long from text prompts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Step-Video-T2V, a pre-trained text-to-video model with 30 billion parameters designed to produce videos as long as 204 frames. It introduces a deep compression Video-VAE for efficient latent space representation, bilingual text encoders for English and Chinese prompts, a diffusion transformer with 3D attention trained by flow matching, and a video-specific direct preference optimization step to clean up artifacts. The model is shown to reach state-of-the-art results on the authors' new Step-Video-T2V-Eval benchmark against both open and commercial competitors. Readers would care because reliable text-to-video systems open pathways for automated video production in media, education, and design. The report also shares practical training observations and discusses current limits of diffusion models for video.

Core claim

Step-Video-T2V is a 30B-parameter text-to-video foundation model that can generate videos up to 204 frames in length. It relies on a Video-VAE providing 16x16 spatial and 8x temporal compression with high reconstruction fidelity, dual bilingual text encoders, a DiT backbone using 3D full attention and trained via flow matching to turn noise into video latents, plus a Video-DPO stage that aligns outputs to reduce visual artifacts. On the newly introduced Step-Video-T2V-Eval benchmark the system records state-of-the-art quality scores relative to existing open-source and commercial text-to-video engines.

What carries the argument

The combination of a deep-compression Video-VAE, 3D full-attention DiT trained with flow matching, and Video-DPO preference tuning that together enable long, high-fidelity video synthesis from text.

If this is right

  • Longer video sequences become feasible without proportional increases in compute.
  • Video-DPO can be reused to polish outputs from other generation pipelines.
  • Training insights help scale future video models beyond current diffusion limits.
  • Benchmark results guide the community toward better evaluation practices for video generation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the benchmark proves representative, flow-matching plus preference optimization may become standard for video alignment.
  • Future work could test whether the same compression ratios work for even longer or higher-resolution videos.
  • Connections to image foundation models suggest hybrid training regimes might further improve consistency.

Load-bearing premise

The Step-Video-T2V-Eval benchmark fairly measures real-world video quality without favoring models that use similar training data or objectives.

What would settle it

Independent tests on public benchmarks such as VBench or Gen-AI-Bench showing that Step-Video-T2V falls behind leading commercial systems in human preference scores or automatic metrics.

read the original abstract

We present Step-Video-T2V, a state-of-the-art text-to-video pre-trained model with 30B parameters and the ability to generate videos up to 204 frames in length. A deep compression Variational Autoencoder, Video-VAE, is designed for video generation tasks, achieving 16x16 spatial and 8x temporal compression ratios, while maintaining exceptional video reconstruction quality. User prompts are encoded using two bilingual text encoders to handle both English and Chinese. A DiT with 3D full attention is trained using Flow Matching and is employed to denoise input noise into latent frames. A video-based DPO approach, Video-DPO, is applied to reduce artifacts and improve the visual quality of the generated videos. We also detail our training strategies and share key observations and insights. Step-Video-T2V's performance is evaluated on a novel video generation benchmark, Step-Video-T2V-Eval, demonstrating its state-of-the-art text-to-video quality when compared with both open-source and commercial engines. Additionally, we discuss the limitations of current diffusion-based model paradigm and outline future directions for video foundation models. We make both Step-Video-T2V and Step-Video-T2V-Eval available at https://github.com/stepfun-ai/Step-Video-T2V. The online version can be accessed from https://yuewen.cn/videos as well. Our goal is to accelerate the innovation of video foundation models and empower video content creators.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript presents Step-Video-T2V, a 30B-parameter text-to-video foundation model that generates videos up to 204 frames long. It introduces a deep-compression Video-VAE (16x16 spatial and 8x temporal), bilingual text encoders, a 3D full-attention DiT trained via Flow Matching, and a Video-DPO post-training stage to reduce artifacts. Training strategies and observations are shared, and the model is evaluated on the authors' newly proposed Step-Video-T2V-Eval benchmark, where it is reported to achieve state-of-the-art quality relative to both open-source and commercial systems. The work also discusses limitations of current diffusion paradigms and future directions, with public release of the model and benchmark.

Significance. If the performance claims are independently verified, the work would constitute a meaningful contribution by scaling video generation to 30B parameters with long-sequence outputs and by releasing both the model and a new benchmark. The practical training insights and the specific Video-VAE and Video-DPO components could aid subsequent research in video foundation models.

major comments (2)
  1. Evaluation section: The SOTA claim is supported solely by quantitative and human-preference results on the self-introduced Step-Video-T2V-Eval benchmark. No inter-rater reliability statistics, prompt-source diversity metrics, or leakage analysis against the 30B model's pre-training corpus are reported, leaving open the possibility that observed margins reflect benchmark construction rather than general capability.
  2. Model architecture and training details: While high-level components (Video-VAE compression ratios, Flow Matching objective, Video-DPO) are described, the manuscript provides no equations, pseudocode, or ablation tables quantifying the contribution of each stage (e.g., the incremental gain from Video-DPO over the base Flow-Matching DiT), which is required to substantiate the architectural choices as load-bearing for the reported quality.
minor comments (2)
  1. Abstract: The claim of 'state-of-the-art text-to-video quality' is stated without any numerical scores or table references, reducing immediate clarity for readers.
  2. Notation: The terms 'Video-VAE' and 'Video-DPO' are introduced without an initial parenthetical expansion or citation to the corresponding prior DPO literature, which would aid readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our technical report. We address each major comment below and outline the revisions we will make to improve the manuscript.

read point-by-point responses
  1. Referee: Evaluation section: The SOTA claim is supported solely by quantitative and human-preference results on the self-introduced Step-Video-T2V-Eval benchmark. No inter-rater reliability statistics, prompt-source diversity metrics, or leakage analysis against the 30B model's pre-training corpus are reported, leaving open the possibility that observed margins reflect benchmark construction rather than general capability.

    Authors: We agree that stronger evaluation details are needed to support the SOTA claims. In the revised version, we will expand the evaluation section with: (i) explicit metrics on prompt-source diversity and construction process for Step-Video-T2V-Eval, (ii) inter-rater reliability statistics (e.g., Cohen's or Fleiss' kappa) from the human preference study, and (iii) a dedicated paragraph discussing steps taken to minimize data leakage, including manual curation and deduplication checks against the pre-training corpus. While a exhaustive leakage audit at 30B scale is computationally prohibitive, these additions will provide greater transparency and address concerns about benchmark-specific effects. revision: partial

  2. Referee: Model architecture and training details: While high-level components (Video-VAE compression ratios, Flow Matching objective, Video-DPO) are described, the manuscript provides no equations, pseudocode, or ablation tables quantifying the contribution of each stage (e.g., the incremental gain from Video-DPO over the base Flow-Matching DiT), which is required to substantiate the architectural choices as load-bearing for the reported quality.

    Authors: We concur that additional technical specificity is warranted. We will revise the manuscript to include: the full equations for the Flow Matching objective and the Video-DPO preference loss; pseudocode outlining the 3D full-attention DiT forward pass, training loop, and Video-VAE encoding/decoding; and a new ablation table that reports incremental gains (e.g., FID, human preference scores) when adding Video-DPO on top of the base Flow-Matching DiT. These changes will make the contribution of each component explicit and reproducible. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation or prediction chain

full rationale

The paper is a technical report on training and evaluating a 30B-parameter text-to-video model using Flow Matching, 3D DiT, Video-VAE, bilingual encoders, and Video-DPO. All claims rest on empirical model outputs and comparisons against baselines on the newly introduced Step-Video-T2V-Eval benchmark. No mathematical derivations, first-principles predictions, or equations are presented that reduce to fitted inputs or self-definitions by construction. No load-bearing self-citations, uniqueness theorems, or ansatzes imported from prior author work appear in any reasoning chain. The central SOTA result is an empirical observation rather than a self-referential definition, making the paper self-contained against external benchmarks with no circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 2 invented entities

Based on abstract only; no explicit free parameters, axioms, or invented entities beyond the introduction of custom Video-VAE and Video-DPO as engineering components.

invented entities (2)
  • Video-VAE no independent evidence
    purpose: Achieve 16x16 spatial and 8x temporal compression for video while preserving reconstruction quality
    Custom variational autoencoder designed specifically for the video generation pipeline
  • Video-DPO no independent evidence
    purpose: Reduce artifacts and improve visual quality post-training
    Video-adapted direct preference optimization applied to generated outputs

pith-pipeline@v0.9.0 · 6246 in / 1372 out tokens · 31418 ms · 2026-05-19T07:57:17.124855+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • Foundation/DimensionForcing.lean D3_admits_circle_linking echoes
    ?
    echoes

    ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

    A DiT with 3D full attention is trained using Flow Matching and is employed to denoise input noise into latent frames.

  • Foundation/EightTick.lean eight_tick_forces_D3 echoes
    ?
    echoes

    ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

    achieving 16x16 spatial and 8x temporal compression ratios

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 19 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. HASTE: Training-Free Video Diffusion Acceleration via Head-Wise Adaptive Sparse Attention

    cs.CV 2026-05 unverdicted novelty 7.0

    HASTE delivers up to 1.93x speedup on Wan2.1 video DiTs via head-wise adaptive sparse attention using temporal mask reuse and error-guided per-head calibration while preserving video quality.

  2. HorizonDrive: Self-Corrective Autoregressive World Model for Long-horizon Driving Simulation

    cs.CV 2026-05 conditional novelty 7.0

    HorizonDrive enables stable long-horizon autoregressive driving simulation via anti-drifting teacher training with scheduled rollout recovery and teacher rollout distillation.

  3. Offline Preference Optimization for Rectified Flow with Noise-Tracked Pairs

    cs.CV 2026-05 unverdicted novelty 7.0

    PNAPO augments preference data with prior noise pairs and uses straight-line interpolation to create a tighter surrogate objective for offline alignment of rectified flow models.

  4. Efficient Video Diffusion Models: Advancements and Challenges

    cs.CV 2026-04 unverdicted novelty 7.0

    A survey that groups efficient video diffusion methods into four paradigms—step distillation, efficient attention, model compression, and cache/trajectory optimization—and outlines open challenges for practical use.

  5. VideoASMR-Bench: Can AI-Generated ASMR Videos Fool VLMs and Humans?

    cs.CV 2025-12 unverdicted novelty 7.0

    VideoASMR-Bench shows state-of-the-art VLMs fail to reliably detect AI-generated ASMR videos from real ones, though humans can still identify the fakes relatively easily.

  6. GenHSI: Controllable Generation of Human-Scene Interaction Videos

    cs.CV 2025-06 unverdicted novelty 7.0

    GenHSI is a training-free three-stage pipeline that turns a scene image, character image, and complex HSI prompt into long videos with plausible chained interactions by generating atomic actions, 3D keyframes via 2D i...

  7. Qwen-Image-VAE-2.0 Technical Report

    cs.CV 2026-05 unverdicted novelty 6.0

    Qwen-Image-VAE-2.0 achieves state-of-the-art high-compression image reconstruction and superior diffusability for diffusion models, with a new text-rich document benchmark.

  8. Leveraging Verifier-Based Reinforcement Learning in Image Editing

    cs.CV 2026-04 unverdicted novelty 6.0

    Edit-R1 trains a CoT-based reasoning reward model with GCPO and uses it to boost image editing performance over VLMs and models like FLUX.1-kontext via GRPO.

  9. DreamShot: Personalized Storyboard Synthesis with Video Diffusion Prior

    cs.CV 2026-04 unverdicted novelty 6.0

    DreamShot uses video diffusion priors and a role-attention consistency loss to produce coherent, personalized storyboards with better character and scene continuity than text-to-image methods.

  10. SynthForensics: Benchmarking and Evaluating People-Centric Synthetic Video Deepfakes

    cs.CV 2026-02 unverdicted novelty 6.0

    SynthForensics is a people-centric benchmark where face-based detectors lose 13-55 AUC points on modern synthetic videos compared to legacy manipulation sets.

  11. HunyuanVideo 1.5 Technical Report

    cs.CV 2025-11 unverdicted novelty 6.0

    HunyuanVideo 1.5 delivers state-of-the-art open-source text-to-video and image-to-video generation with an 8.3B parameter DiT model featuring SSTA attention, glyph-aware encoding, and progressive training.

  12. Enhancing Physical Plausibility in Video Generation by Reasoning the Implausibility

    cs.CV 2025-09 unverdicted novelty 6.0

    A training-free framework uses physics-violating counterfactual prompts and Synchronized Decoupled Guidance to suppress implausible motions in diffusion-based video generation while preserving photorealism.

  13. Listener-Rewarded Thinking in VLMs for Image Preferences

    cs.CV 2025-06 unverdicted novelty 6.0

    Listener-augmented GRPO uses an independent frozen VLM to provide dense confidence scores on reasoning traces, yielding 67.4% accuracy on ImageReward, up to +6% OOD gains on 1.2M-vote human data, and fewer reasoning c...

  14. MAGI-1: Autoregressive Video Generation at Scale

    cs.CV 2025-05 unverdicted novelty 6.0

    MAGI-1 is a 24B-parameter autoregressive video world model that predicts denoised frame chunks sequentially with increasing noise to enable causal, scalable, streaming generation up to 4M token contexts.

  15. VBench-2.0: Advancing Video Generation Benchmark Suite for Intrinsic Faithfulness

    cs.CV 2025-03 accept novelty 6.0

    VBench-2.0 is a benchmark suite that automatically evaluates video generative models on five dimensions of intrinsic faithfulness: Human Fidelity, Controllability, Creativity, Physics, and Commonsense using VLMs, LLMs...

  16. Motif-Video 2B: Technical Report

    cs.CV 2026-04 unverdicted novelty 5.0

    Motif-Video 2B achieves 83.76% VBench score, beating a 14B-parameter baseline with 7x fewer parameters and substantially less training data through shared cross-attention and a three-part backbone.

  17. Qwen-Image-2.0 Technical Report

    cs.CV 2026-05 unverdicted novelty 4.0

    Qwen-Image-2.0 unifies high-fidelity image generation and precise editing by coupling Qwen3-VL with a Multimodal Diffusion Transformer, improving text rendering, photorealism, and complex prompt following over prior versions.

  18. EchoTorrent: Towards Swift, Sustained, and Streaming Multi-Modal Video Generation

    cs.CV 2026-02 unverdicted novelty 4.0

    EchoTorrent combines multi-teacher distillation, adaptive CFG calibration, hybrid long-tail forcing, and VAE decoder refinement to enable few-pass autoregressive streaming video generation with improved temporal consi...

  19. Evolution of Video Generative Foundations

    cs.CV 2026-04 unverdicted novelty 2.0

    This survey traces video generation technology from GANs to diffusion models and then to autoregressive and multimodal approaches while analyzing principles, strengths, and future trends.

Reference graph

Works this paper leans on

296 extracted references · 296 canonical work pages · cited by 19 Pith papers · 65 internal anchors

  1. [1]

    Video generation models as world simulators

    OpenAI. Video generation models as world simulators. https://openai.com/index/video-generation-models-as-world-simulators, 2024

  2. [2]

    DeepMind. Veo 2. https://deepmind.google/technologies/veo/veo-2, 2024

  3. [3]

    Kuaishou. Kling. https://klingai.kuaishou.com, 2024

  4. [4]

    MiniMax. Hailuo. https://hailuoai.com/video, 2024

  5. [5]

    Gen-3 alpha

    RunwayML. Gen-3 alpha. https://runwayml.com/research/introducing-gen-3-alpha, 2024

  6. [6]

    HunyuanVideo: A Systematic Framework For Large Video Generative Models

    Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, Kathrina Wu, Qin Lin, Junkun Yuan, Yanxin Long, Aladdin Wang, Andong Wang, Changlin Li, Duojun Huang, Fang Yang, Hao Tan, Hongmei Wang, Jacob Song, Jiawang Bai, Jianbing Wu, Jinbao Xue, Joey Wang, Kai Wang, Mengyang Liu, Pengyu Li, Shuai Li, ...

  7. [8]

    Open-sora: Democratizing efficient video production for all

    Zangwei Zheng, Xiangyu Peng, Tianji Yang, Chenhui Shen, Shenggui Li, Hongxin Liu, Yukun Zhou, Tianyi Li, and Yang You. Open-sora: Democratizing efficient video production for all. March 2024. URL https://github.com/hpcaitech/Open-Sora

  8. [10]

    Scaling Rectified Flow Transformers for High-Resolution Image Synthesis

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, Dustin Podell, Tim Dockhorn, Zion English, Kyle Lacey, Alex Goodwin, Yannik Marek, and Robin Rombach. Scaling rectified flow transformers for high-resolution image synthesis, 2024. URL https://arxiv.org/abs/2403.03206

  9. [11]

    Scalable Diffusion Models with Transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers, 2023. URL https://arxiv.org/abs/2212.09748

  10. [12]

    Movie Gen: A Cast of Media Foundation Models

    Adam Polyak, Amit Zohar, Andrew Brown, Andros Tjandra, Animesh Sinha, Ann Lee, Apoorv Vyas, Bowen Shi, Chih-Yao Ma, Ching-Yao Chuang, David Yan, Dhruv Choudhary, Dingkang Wang, Geet Sethi, Guan Pang, Haoyu Ma, Ishan Misra, Ji Hou, Jialiang Wang, Kiran Jagadeesh, Kunpeng Li, Luxin Zhang, Mannat Singh, Mary Williamson, Matt Le, Matthew Yu, Mitesh Kumar Sing...

  11. [13]

    Language model beats diffusion - tokenizer is key to visual generation

    Lijun Yu, Jose Lezama, Nitesh Bharadwaj Gundavarapu, Luca Versari, Kihyuk Sohn, David Minnen, Yong Cheng, Agrim Gupta, Xiuye Gu, Alexander G Hauptmann, Boqing Gong, Ming-Hsuan Yang, Irfan Essa, David A Ross, and Lu Jiang. Language model beats diffusion - tokenizer is key to visual generation. In The Twelfth International Conference on Learning Representat...

  12. [14]

    Cosmos World Foundation Model Platform for Physical AI

    Nvidia. Cosmos world foundation model platform for physical ai. arXiv preprint arXiv:2501.03575, 2025

  13. [15]

    Wf-vae: Enhancing video vae by wavelet-driven energy flow for latent video diffusion model

    Zongjian Li, Bin Lin, Yang Ye, Liuhan Chen, Xinhua Cheng, Shenghai Yuan, and Li Yuan. Wf-vae: Enhancing video vae by wavelet-driven energy flow for latent video diffusion model. arXiv preprint arXiv:2411.17459, 2024 a

  14. [16]

    Deep compression autoencoder for efficient high-resolution diffusion models

    Junyu Chen, Han Cai, Junsong Chen, Enze Xie, Shang Yang, Haotian Tang, Muyang Li, and Song Han. Deep compression autoencoder for efficient high-resolution diffusion models. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=wH8XXUOUZU

  15. [17]

    Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling, 2023. URL https://arxiv.org/abs/2210.02747

  16. [18]

    Hunyuan-DiT: A Powerful Multi-Resolution Diffusion Transformer with Fine-Grained Chinese Understanding

    Zhimin Li, Jianwei Zhang, Qin Lin, Jiangfeng Xiong, Yanxin Long, Xinchi Deng, Yingfang Zhang, Xingchao Liu, Minbin Huang, Zedong Xiao, Dayou Chen, Jiajun He, Jiahao Li, Wenyue Li, Chen Zhang, Rongwei Quan, Jianxiang Lu, Jiabin Huang, Xiaoyan Yuan, Xiaoxiao Zheng, Yixuan Li, Jihong Zhang, Chao Zhang, Meng Chen, Jie Liu, Zheng Fang, Weiyan Wang, Jinbao Xue,...

  17. [19]

    Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation

    Ofir Press, Noah A. Smith, and Mike Lewis. Train short, test long: Attention with linear biases enables input length extrapolation, 2022. URL https://arxiv.org/abs/2108.12409

  18. [20]

    PixArt-$\alpha$: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis

    Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Yue Wu, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart- : Fast training of diffusion transformer for photorealistic text-to-image synthesis, 2023. URL https://arxiv.org/abs/2310.00426

  19. [21]

    RoFormer: Enhanced Transformer with Rotary Position Embedding

    Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding, 2023. URL https://arxiv.org/abs/2104.09864

  20. [22]

    Training language models to follow instructions with human feedback

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35: 0 27730--27744, 2022

  21. [23]

    Deep reinforcement learning from human preferences

    Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30, 2017

  22. [24]

    Direct preference optimization: Your language model is secretly a reward model

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36, 2024

  23. [25]

    Diffusion model alignment using direct preference optimization

    Bram Wallace, Meihua Dang, Rafael Rafailov, Linqi Zhou, Aaron Lou, Senthil Purushwalkam, Stefano Ermon, Caiming Xiong, Shafiq Joty, and Nikhil Naik. Diffusion model alignment using direct preference optimization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8228--8238, 2024

  24. [26]

    Using human feedback to fine-tune diffusion models without any reward model

    Kai Yang, Jian Tao, Jiafei Lyu, Chunjiang Ge, Jiaxin Chen, Weihan Shen, Xiaolong Zhu, and Xiu Li. Using human feedback to fine-tune diffusion models without any reward model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8941--8951, 2024 b

  25. [27]

    Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

    Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow. ArXiv, abs/2209.03003, 2022. URL https://api.semanticscholar.org/CorpusID:252111177

  26. [28]

    Improving the training of rectified flows

    Sangyun Lee, Zinan Lin, and Giulia Fanti. Improving the training of rectified flows. arXiv preprint arXiv:2405.20320, 2024

  27. [29]

    Efficient large-scale language model training on gpu clusters using megatron-lm

    Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick LeGresley, Mostofa Patwary, Vijay Korthikanti, Dmitri Vainbrand, Prethvi Kashinkunti, Julie Bernauer, Bryan Catanzaro, et al. Efficient large-scale language model training on gpu clusters using megatron-lm. In Proceedings of the International Conference for High Performance Computing, Networking, S...

  28. [30]

    Reducing activation recomputation in large transformer models

    Vijay Anand Korthikanti, Jared Casper, Sangkug Lym, Lawrence McAfee, Michael Andersch, Mohammad Shoeybi, and Bryan Catanzaro. Reducing activation recomputation in large transformer models. Proceedings of Machine Learning and Systems, 5: 0 341--353, 2023

  29. [31]

    Ring Attention with Blockwise Transformers for Near-Infinite Context

    Hao Liu, Matei Zaharia, and Pieter Abbeel. Ring attention with blockwise transformers for near-infinite context. arXiv preprint arXiv:2310.01889, 2023

  30. [32]

    DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models

    Sam Ade Jacobs, Masahiro Tanaka, Chengming Zhang, Minjia Zhang, Shuaiwen Leon Song, Samyam Rajbhandari, and Yuxiong He. Deepspeed ulysses: System optimizations for enabling training of extreme long sequence transformer models. arXiv preprint arXiv:2309.14509, 2023

  31. [33]

    Zero: Memory optimizations toward training trillion parameter models

    Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. Zero: Memory optimizations toward training trillion parameter models. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1--16. IEEE, 2020

  32. [34]

    Disttrain: Addressing model and data heterogeneity with disaggregated training for multimodal large language models

    Zili Zhang, Yinmin Zhong, Ranchen Ming, Hanpeng Hu, Jianjian Sun, Zheng Ge, Yibo Zhu, and Xin Jin. Disttrain: Addressing model and data heterogeneity with disaggregated training for multimodal large language models. arXiv preprint arXiv:2408.04275, 2024

  33. [35]

    Channels Last Memory Format in PyTorch

    PyTorch . Channels Last Memory Format in PyTorch . PyTorch, https://pytorch.org/tutorials/intermediate/memory_format_tutorial.html, 2023. Accessed: Oct 4, 2023

  34. [36]

    Jordan, and Ion Stoica

    Philipp Moritz, Robert Nishihara, Stephanie Wang, Alexey Tumanov, Richard Liaw, Eric Liang, Melih Elibol, Zongheng Yang, William Paul, Michael I. Jordan, and Ion Stoica. Ray: A distributed framework for emerging AI applications. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), pages 561--577, Carlsbad, CA, October 2018. U...

  35. [37]

    Pytorch rpc: Distributed deep learning built on tensor-optimized remote procedure calls

    Pritam Damania, Shen Li, Alban Desmaison, Alisson Azzolini, Brian Vaughan, Edward Yang, Gregory Chanan, Guoqiang Jerry Chen, Hongyi Jia, Howard Huang, et al. Pytorch rpc: Distributed deep learning built on tensor-optimized remote procedure calls. Proceedings of Machine Learning and Systems, 5: 0 219--231, 2023

  36. [38]

    Mooncake: A kvcache-centric disaggregated architecture for llm serving, 2024

    Ruoyu Qin, Zheming Li, Weiran He, Mingxing Zhang, Yongwei Wu, Weimin Zheng, and Xinran Xu. Mooncake: A kvcache-centric disaggregated architecture for llm serving, 2024. URL https://arxiv.org/abs/2407.00079

  37. [39]

    The llama 3 herd of models, April 2024

    Meta LlamaTeam. The llama 3 herd of models, April 2024. URL https://ai.meta.com/research/publications/the-llama-3-herd-of-models/

  38. [40]

    PySceneDetect

    PySceneDetect Developers . PySceneDetect. PySceneDetect. https://www.scenedetect.com/

  39. [41]

    FFmpeg Developers . FFmpeg. FFmpeg. https://ffmpeg.org/

  40. [42]

    Panda-70m: Captioning 70m videos with multiple cross-modality teachers

    Tsai-Shien Chen, Aliaksandr Siarohin, Willi Menapace, Ekaterina Deyneka, Hsiang-wei Chao, Byung Eun Jeon, Yuwei Fang, Hsin-Ying Lee, Jian Ren, Ming-Hsuan Yang, et al. Panda-70m: Captioning 70m videos with multiple cross-modality teachers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13320--13331, 2024 a

  41. [43]

    Laion-5b: An open large-scale dataset for training next generation image-text models

    Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al. Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems, 35: 0 25278--25294, 2022

  42. [44]

    Clip-based nsfw detector

    LAION. Clip-based nsfw detector. https://github.com/LAION-AI/CLIP-based-NSFW-Detector, 2021. Accessed: [Insert Access Date]

  43. [45]

    Efficientnet: Rethinking model scaling for convolutional neural networks

    Mingxing Tan and Quoc V Le. Efficientnet: Rethinking model scaling for convolutional neural networks. In Proceedings of the 36th International Conference on Machine Learning, pages 6105--6114. PMLR, 2019

  44. [46]

    Paddleocr

    PaddleOCR Contributors. Paddleocr. https://github.com/PaddlePaddle/PaddleOCR, 2023. Accessed: [Insert Access Date]

  45. [47]

    OpenCV Developers . OpenCV . OpenCV, https://opencv.org/, 2021. Accessed: August 1, 2023

  46. [48]

    Diatom autofocusing in brightfield microscopy: a comparative study

    Jos \'e Luis Pech-Pacheco, Gabriel Crist \'o bal, Jes \'u s Chamorro-Martinez, and Joaqu \' n Fern \'a ndez-Valdivia. Diatom autofocusing in brightfield microscopy: a comparative study. In Proceedings 15th International Conference on Pattern Recognition. ICPR-2000, volume 3, pages 314--317. IEEE, 2000

  47. [49]

    Improving image generation with better captions

    James Betker, Gabriel Goh, Li Jing, Tim Brooks, Jianfeng Wang, Linjie Li, Long Ouyang, Juntang Zhuang, Joyce Lee, Yufei Guo, et al. Improving image generation with better captions. Computer Science. https://cdn. openai. com/papers/dall-e-3. pdf, 2 0 (3): 0 8, 2023

  48. [50]

    Some methods for classification and analysis of multivariate observations

    J MacQueen. Some methods for classification and analysis of multivariate observations. In Proceedings of 5-th Berkeley Symposium on Mathematical Statistics and Probability/University of California Press, 1967

  49. [53]

    DSV: Exploiting Dynamic Sparsity to Accelerate Large-Scale Video DiT Training

    Xin Tan, Yuetao Chen, Yimin Jiang, Xing Chen, Kun Yan, Nan Duan, Yibo Zhu, Daxin Jiang, and Hong Xu. DSV: Exploiting Dynamic Sparsity to Accelerate Large-Scale Video DiT Training . arXiv preprint arXiv:2502.07590, 2025

  50. [54]

    Diffusion forcing: Next-token prediction meets full-sequence diffusion, 2024 b

    Boyuan Chen, Diego Marti Monso, Yilun Du, Max Simchowitz, Russ Tedrake, and Vincent Sitzmann. Diffusion forcing: Next-token prediction meets full-sequence diffusion, 2024 b . URL https://arxiv.org/abs/2407.01392

  51. [55]

    LTX-Video: Realtime Video Latent Diffusion

    Yoav HaCohen, Nisan Chiprut, Benny Brazowski, Daniel Shalem, Dudu Moshe, Eitan Richardson, Eran Levin, Guy Shiran, Nir Zabari, Ori Gordon, Poriya Panet, Sapir Weissbuch, Victor Kulikov, Yaki Bitterman, Zeev Melumian, and Ofir Bibi. Ltx-video: Realtime video latent diffusion, 2024. URL https://arxiv.org/abs/2501.00103

  52. [56]

    Ni, and Heung-Yeung Shum

    Deyu Zhou, Quan Sun, Yuang Peng, Kun Yan, Runpei Dong, Duomin Wang, Zheng Ge, Nan Duan, Xiangyu Zhang, Lionel M. Ni, and Heung-Yeung Shum. Taming teacher forcing for masked autoregressive video generation, 2025. URL https://arxiv.org/abs/2501.12389

  53. [57]

    DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai D...

  54. [58]

    2024 , journal =

    OpenAI , title=. 2024 , journal =

  55. [59]

    2024 , eprint=

    Diffusion Forcing: Next-token Prediction Meets Full-Sequence Diffusion , author=. 2024 , eprint=

  56. [60]

    2024 , eprint=

    LTX-Video: Realtime Video Latent Diffusion , author=. 2024 , eprint=

  57. [61]

    2025 , eprint=

    Taming Teacher Forcing for Masked Autoregressive Video Generation , author=. 2025 , eprint=

  58. [62]

    Computer Science

    Improving image generation with better captions , author=. Computer Science. https://cdn. openai. com/papers/dall-e-3. pdf , volume=

  59. [63]

    2024 , eprint=

    Hunyuan-DiT: A Powerful Multi-Resolution Diffusion Transformer with Fine-Grained Chinese Understanding , author=. 2024 , eprint=

  60. [64]

    2023 , eprint=

    Flow Matching for Generative Modeling , author=. 2023 , eprint=

  61. [65]

    2024 , eprint=

    Movie Gen: A Cast of Media Foundation Models , author=. 2024 , eprint=

  62. [66]

    2024 , eprint=

    Scaling Rectified Flow Transformers for High-Resolution Image Synthesis , author=. 2024 , eprint=

  63. [67]

    CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

    CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer , author=. arXiv preprint arXiv:2408.06072 , year=

  64. [68]

    2024 , url =

    Zangwei Zheng and Xiangyu Peng and Tianji Yang and Chenhui Shen and Shenggui Li and Hongxin Liu and Yukun Zhou and Tianyi Li and Yang You , title =. 2024 , url =

  65. [69]

    Open-Sora Plan: Open-Source Large Video Generation Model

    Open-Sora Plan: Open-Source Large Video Generation Model , author=. arXiv preprint arXiv:2412.00131 , year=

  66. [70]

    2025 , eprint=

    HunyuanVideo: A Systematic Framework For Large Video Generative Models , author=. 2025 , eprint=

  67. [71]

    2024 , journal =

    DeepMind , title=. 2024 , journal =

  68. [72]

    2024 , journal =

    Kuaishou , title=. 2024 , journal =

  69. [73]

    2024 , journal =

    MiniMax , title=. 2024 , journal =

  70. [74]

    2024 , journal =

    RunwayML , title=. 2024 , journal =

  71. [75]

    2025 , eprint=

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning , author=. 2025 , eprint=

  72. [76]

    2023 , eprint=

    OpenWebMath: An Open Dataset of High-Quality Mathematical Web Text , author=. 2023 , eprint=

  73. [77]

    and Ilharco, Gabriel and Song, Shuran and Kollar, Thomas and Carmon, Yair and Dave, Achal and Heckel, Reinhard and Muennighoff, Niklas and Schmidt, Ludwig , title=

    Gadre, Samir Yitzhak and Smyrnis, Georgios and Shankar, Vaishaal and Gururangan, Suchin and Wortsman, Mitchell and Shao, Rulin and Mercat, Jean and Fang, Alex and Li, Jeffrey and Keh, Sedrick and Xin, Rui and Nezhurina, Marianna and Vasiljevic, Igor and Jitsev, Jenia and Dimakis, Alexandros G. and Ilharco, Gabriel and Song, Shuran and Kollar, Thomas and C...

  74. [78]

    2023 , eprint=

    GPT-4 Technical Report , author=. 2023 , eprint=

  75. [79]

    PaLM 2 Technical Report

    Palm 2 technical report , author=. arXiv preprint arXiv:2305.10403 , year=

  76. [80]

    arXiv preprint arXiv:1911.00359 , year=

    CCNet: Extracting high quality monolingual datasets from web crawl data , author=. arXiv preprint arXiv:1911.00359 , year=

  77. [81]

    LLaMA: Open and Efficient Foundation Language Models

    Llama: Open and efficient foundation language models , author=. arXiv preprint arXiv:2302.13971 , year=

  78. [82]

    Mistral 7B

    Mistral 7B , author=. arXiv preprint arXiv:2310.06825 , year=

  79. [83]

    The Twelfth International Conference on Learning Representations , year=

    Language Modeling Is Compression , author=. The Twelfth International Conference on Learning Representations , year=

  80. [84]

    Soboleva Daria and Al-Khateeb Faisal and Myers Robert Steeves Jacob R and Hestness Joel and Dey Nolan , title =

Showing first 80 references.