Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model
Pith reviewed 2026-05-19 07:57 UTC · model grok-4.3
pith:7EGEGHRS Add to your LaTeX paper
What is a Pith Number?\usepackage{pith}
\pithnumber{7EGEGHRS}
Prints a linked pith:7EGEGHRS badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more
The pith
A 30 billion parameter model generates high-quality videos up to 204 frames long from text prompts.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Step-Video-T2V is a 30B-parameter text-to-video foundation model that can generate videos up to 204 frames in length. It relies on a Video-VAE providing 16x16 spatial and 8x temporal compression with high reconstruction fidelity, dual bilingual text encoders, a DiT backbone using 3D full attention and trained via flow matching to turn noise into video latents, plus a Video-DPO stage that aligns outputs to reduce visual artifacts. On the newly introduced Step-Video-T2V-Eval benchmark the system records state-of-the-art quality scores relative to existing open-source and commercial text-to-video engines.
What carries the argument
The combination of a deep-compression Video-VAE, 3D full-attention DiT trained with flow matching, and Video-DPO preference tuning that together enable long, high-fidelity video synthesis from text.
If this is right
- Longer video sequences become feasible without proportional increases in compute.
- Video-DPO can be reused to polish outputs from other generation pipelines.
- Training insights help scale future video models beyond current diffusion limits.
- Benchmark results guide the community toward better evaluation practices for video generation.
Where Pith is reading between the lines
- If the benchmark proves representative, flow-matching plus preference optimization may become standard for video alignment.
- Future work could test whether the same compression ratios work for even longer or higher-resolution videos.
- Connections to image foundation models suggest hybrid training regimes might further improve consistency.
Load-bearing premise
The Step-Video-T2V-Eval benchmark fairly measures real-world video quality without favoring models that use similar training data or objectives.
What would settle it
Independent tests on public benchmarks such as VBench or Gen-AI-Bench showing that Step-Video-T2V falls behind leading commercial systems in human preference scores or automatic metrics.
read the original abstract
We present Step-Video-T2V, a state-of-the-art text-to-video pre-trained model with 30B parameters and the ability to generate videos up to 204 frames in length. A deep compression Variational Autoencoder, Video-VAE, is designed for video generation tasks, achieving 16x16 spatial and 8x temporal compression ratios, while maintaining exceptional video reconstruction quality. User prompts are encoded using two bilingual text encoders to handle both English and Chinese. A DiT with 3D full attention is trained using Flow Matching and is employed to denoise input noise into latent frames. A video-based DPO approach, Video-DPO, is applied to reduce artifacts and improve the visual quality of the generated videos. We also detail our training strategies and share key observations and insights. Step-Video-T2V's performance is evaluated on a novel video generation benchmark, Step-Video-T2V-Eval, demonstrating its state-of-the-art text-to-video quality when compared with both open-source and commercial engines. Additionally, we discuss the limitations of current diffusion-based model paradigm and outline future directions for video foundation models. We make both Step-Video-T2V and Step-Video-T2V-Eval available at https://github.com/stepfun-ai/Step-Video-T2V. The online version can be accessed from https://yuewen.cn/videos as well. Our goal is to accelerate the innovation of video foundation models and empower video content creators.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents Step-Video-T2V, a 30B-parameter text-to-video foundation model that generates videos up to 204 frames long. It introduces a deep-compression Video-VAE (16x16 spatial and 8x temporal), bilingual text encoders, a 3D full-attention DiT trained via Flow Matching, and a Video-DPO post-training stage to reduce artifacts. Training strategies and observations are shared, and the model is evaluated on the authors' newly proposed Step-Video-T2V-Eval benchmark, where it is reported to achieve state-of-the-art quality relative to both open-source and commercial systems. The work also discusses limitations of current diffusion paradigms and future directions, with public release of the model and benchmark.
Significance. If the performance claims are independently verified, the work would constitute a meaningful contribution by scaling video generation to 30B parameters with long-sequence outputs and by releasing both the model and a new benchmark. The practical training insights and the specific Video-VAE and Video-DPO components could aid subsequent research in video foundation models.
major comments (2)
- Evaluation section: The SOTA claim is supported solely by quantitative and human-preference results on the self-introduced Step-Video-T2V-Eval benchmark. No inter-rater reliability statistics, prompt-source diversity metrics, or leakage analysis against the 30B model's pre-training corpus are reported, leaving open the possibility that observed margins reflect benchmark construction rather than general capability.
- Model architecture and training details: While high-level components (Video-VAE compression ratios, Flow Matching objective, Video-DPO) are described, the manuscript provides no equations, pseudocode, or ablation tables quantifying the contribution of each stage (e.g., the incremental gain from Video-DPO over the base Flow-Matching DiT), which is required to substantiate the architectural choices as load-bearing for the reported quality.
minor comments (2)
- Abstract: The claim of 'state-of-the-art text-to-video quality' is stated without any numerical scores or table references, reducing immediate clarity for readers.
- Notation: The terms 'Video-VAE' and 'Video-DPO' are introduced without an initial parenthetical expansion or citation to the corresponding prior DPO literature, which would aid readability.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our technical report. We address each major comment below and outline the revisions we will make to improve the manuscript.
read point-by-point responses
-
Referee: Evaluation section: The SOTA claim is supported solely by quantitative and human-preference results on the self-introduced Step-Video-T2V-Eval benchmark. No inter-rater reliability statistics, prompt-source diversity metrics, or leakage analysis against the 30B model's pre-training corpus are reported, leaving open the possibility that observed margins reflect benchmark construction rather than general capability.
Authors: We agree that stronger evaluation details are needed to support the SOTA claims. In the revised version, we will expand the evaluation section with: (i) explicit metrics on prompt-source diversity and construction process for Step-Video-T2V-Eval, (ii) inter-rater reliability statistics (e.g., Cohen's or Fleiss' kappa) from the human preference study, and (iii) a dedicated paragraph discussing steps taken to minimize data leakage, including manual curation and deduplication checks against the pre-training corpus. While a exhaustive leakage audit at 30B scale is computationally prohibitive, these additions will provide greater transparency and address concerns about benchmark-specific effects. revision: partial
-
Referee: Model architecture and training details: While high-level components (Video-VAE compression ratios, Flow Matching objective, Video-DPO) are described, the manuscript provides no equations, pseudocode, or ablation tables quantifying the contribution of each stage (e.g., the incremental gain from Video-DPO over the base Flow-Matching DiT), which is required to substantiate the architectural choices as load-bearing for the reported quality.
Authors: We concur that additional technical specificity is warranted. We will revise the manuscript to include: the full equations for the Flow Matching objective and the Video-DPO preference loss; pseudocode outlining the 3D full-attention DiT forward pass, training loop, and Video-VAE encoding/decoding; and a new ablation table that reports incremental gains (e.g., FID, human preference scores) when adding Video-DPO on top of the base Flow-Matching DiT. These changes will make the contribution of each component explicit and reproducible. revision: yes
Circularity Check
No circularity in derivation or prediction chain
full rationale
The paper is a technical report on training and evaluating a 30B-parameter text-to-video model using Flow Matching, 3D DiT, Video-VAE, bilingual encoders, and Video-DPO. All claims rest on empirical model outputs and comparisons against baselines on the newly introduced Step-Video-T2V-Eval benchmark. No mathematical derivations, first-principles predictions, or equations are presented that reduce to fitted inputs or self-definitions by construction. No load-bearing self-citations, uniqueness theorems, or ansatzes imported from prior author work appear in any reasoning chain. The central SOTA result is an empirical observation rather than a self-referential definition, making the paper self-contained against external benchmarks with no circular steps.
Axiom & Free-Parameter Ledger
invented entities (2)
-
Video-VAE
no independent evidence
-
Video-DPO
no independent evidence
Lean theorems connected to this paper
-
Foundation/DimensionForcing.leanD3_admits_circle_linking echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
A DiT with 3D full attention is trained using Flow Matching and is employed to denoise input noise into latent frames.
-
Foundation/EightTick.leaneight_tick_forces_D3 echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
achieving 16x16 spatial and 8x temporal compression ratios
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 19 Pith papers
-
HASTE: Training-Free Video Diffusion Acceleration via Head-Wise Adaptive Sparse Attention
HASTE delivers up to 1.93x speedup on Wan2.1 video DiTs via head-wise adaptive sparse attention using temporal mask reuse and error-guided per-head calibration while preserving video quality.
-
HorizonDrive: Self-Corrective Autoregressive World Model for Long-horizon Driving Simulation
HorizonDrive enables stable long-horizon autoregressive driving simulation via anti-drifting teacher training with scheduled rollout recovery and teacher rollout distillation.
-
Offline Preference Optimization for Rectified Flow with Noise-Tracked Pairs
PNAPO augments preference data with prior noise pairs and uses straight-line interpolation to create a tighter surrogate objective for offline alignment of rectified flow models.
-
Efficient Video Diffusion Models: Advancements and Challenges
A survey that groups efficient video diffusion methods into four paradigms—step distillation, efficient attention, model compression, and cache/trajectory optimization—and outlines open challenges for practical use.
-
VideoASMR-Bench: Can AI-Generated ASMR Videos Fool VLMs and Humans?
VideoASMR-Bench shows state-of-the-art VLMs fail to reliably detect AI-generated ASMR videos from real ones, though humans can still identify the fakes relatively easily.
-
GenHSI: Controllable Generation of Human-Scene Interaction Videos
GenHSI is a training-free three-stage pipeline that turns a scene image, character image, and complex HSI prompt into long videos with plausible chained interactions by generating atomic actions, 3D keyframes via 2D i...
-
Qwen-Image-VAE-2.0 Technical Report
Qwen-Image-VAE-2.0 achieves state-of-the-art high-compression image reconstruction and superior diffusability for diffusion models, with a new text-rich document benchmark.
-
Leveraging Verifier-Based Reinforcement Learning in Image Editing
Edit-R1 trains a CoT-based reasoning reward model with GCPO and uses it to boost image editing performance over VLMs and models like FLUX.1-kontext via GRPO.
-
DreamShot: Personalized Storyboard Synthesis with Video Diffusion Prior
DreamShot uses video diffusion priors and a role-attention consistency loss to produce coherent, personalized storyboards with better character and scene continuity than text-to-image methods.
-
SynthForensics: Benchmarking and Evaluating People-Centric Synthetic Video Deepfakes
SynthForensics is a people-centric benchmark where face-based detectors lose 13-55 AUC points on modern synthetic videos compared to legacy manipulation sets.
-
HunyuanVideo 1.5 Technical Report
HunyuanVideo 1.5 delivers state-of-the-art open-source text-to-video and image-to-video generation with an 8.3B parameter DiT model featuring SSTA attention, glyph-aware encoding, and progressive training.
-
Enhancing Physical Plausibility in Video Generation by Reasoning the Implausibility
A training-free framework uses physics-violating counterfactual prompts and Synchronized Decoupled Guidance to suppress implausible motions in diffusion-based video generation while preserving photorealism.
-
Listener-Rewarded Thinking in VLMs for Image Preferences
Listener-augmented GRPO uses an independent frozen VLM to provide dense confidence scores on reasoning traces, yielding 67.4% accuracy on ImageReward, up to +6% OOD gains on 1.2M-vote human data, and fewer reasoning c...
-
MAGI-1: Autoregressive Video Generation at Scale
MAGI-1 is a 24B-parameter autoregressive video world model that predicts denoised frame chunks sequentially with increasing noise to enable causal, scalable, streaming generation up to 4M token contexts.
-
VBench-2.0: Advancing Video Generation Benchmark Suite for Intrinsic Faithfulness
VBench-2.0 is a benchmark suite that automatically evaluates video generative models on five dimensions of intrinsic faithfulness: Human Fidelity, Controllability, Creativity, Physics, and Commonsense using VLMs, LLMs...
-
Motif-Video 2B: Technical Report
Motif-Video 2B achieves 83.76% VBench score, beating a 14B-parameter baseline with 7x fewer parameters and substantially less training data through shared cross-attention and a three-part backbone.
-
Qwen-Image-2.0 Technical Report
Qwen-Image-2.0 unifies high-fidelity image generation and precise editing by coupling Qwen3-VL with a Multimodal Diffusion Transformer, improving text rendering, photorealism, and complex prompt following over prior versions.
-
EchoTorrent: Towards Swift, Sustained, and Streaming Multi-Modal Video Generation
EchoTorrent combines multi-teacher distillation, adaptive CFG calibration, hybrid long-tail forcing, and VAE decoder refinement to enable few-pass autoregressive streaming video generation with improved temporal consi...
-
Evolution of Video Generative Foundations
This survey traces video generation technology from GANs to diffusion models and then to autoregressive and multimodal approaches while analyzing principles, strengths, and future trends.
Reference graph
Works this paper leans on
-
[1]
Video generation models as world simulators
OpenAI. Video generation models as world simulators. https://openai.com/index/video-generation-models-as-world-simulators, 2024
work page 2024
-
[2]
DeepMind. Veo 2. https://deepmind.google/technologies/veo/veo-2, 2024
work page 2024
-
[3]
Kuaishou. Kling. https://klingai.kuaishou.com, 2024
work page 2024
-
[4]
MiniMax. Hailuo. https://hailuoai.com/video, 2024
work page 2024
-
[5]
RunwayML. Gen-3 alpha. https://runwayml.com/research/introducing-gen-3-alpha, 2024
work page 2024
-
[6]
HunyuanVideo: A Systematic Framework For Large Video Generative Models
Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, Kathrina Wu, Qin Lin, Junkun Yuan, Yanxin Long, Aladdin Wang, Andong Wang, Changlin Li, Duojun Huang, Fang Yang, Hao Tan, Hongmei Wang, Jacob Song, Jiawang Bai, Jianbing Wu, Jinbao Xue, Joey Wang, Kai Wang, Mengyang Liu, Pengyu Li, Shuai Li, ...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[8]
Open-sora: Democratizing efficient video production for all
Zangwei Zheng, Xiangyu Peng, Tianji Yang, Chenhui Shen, Shenggui Li, Hongxin Liu, Yukun Zhou, Tianyi Li, and Yang You. Open-sora: Democratizing efficient video production for all. March 2024. URL https://github.com/hpcaitech/Open-Sora
work page 2024
-
[10]
Scaling Rectified Flow Transformers for High-Resolution Image Synthesis
Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, Dustin Podell, Tim Dockhorn, Zion English, Kyle Lacey, Alex Goodwin, Yannik Marek, and Robin Rombach. Scaling rectified flow transformers for high-resolution image synthesis, 2024. URL https://arxiv.org/abs/2403.03206
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[11]
Scalable Diffusion Models with Transformers
William Peebles and Saining Xie. Scalable diffusion models with transformers, 2023. URL https://arxiv.org/abs/2212.09748
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[12]
Movie Gen: A Cast of Media Foundation Models
Adam Polyak, Amit Zohar, Andrew Brown, Andros Tjandra, Animesh Sinha, Ann Lee, Apoorv Vyas, Bowen Shi, Chih-Yao Ma, Ching-Yao Chuang, David Yan, Dhruv Choudhary, Dingkang Wang, Geet Sethi, Guan Pang, Haoyu Ma, Ishan Misra, Ji Hou, Jialiang Wang, Kiran Jagadeesh, Kunpeng Li, Luxin Zhang, Mannat Singh, Mary Williamson, Matt Le, Matthew Yu, Mitesh Kumar Sing...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[13]
Language model beats diffusion - tokenizer is key to visual generation
Lijun Yu, Jose Lezama, Nitesh Bharadwaj Gundavarapu, Luca Versari, Kihyuk Sohn, David Minnen, Yong Cheng, Agrim Gupta, Xiuye Gu, Alexander G Hauptmann, Boqing Gong, Ming-Hsuan Yang, Irfan Essa, David A Ross, and Lu Jiang. Language model beats diffusion - tokenizer is key to visual generation. In The Twelfth International Conference on Learning Representat...
work page 2024
-
[14]
Cosmos World Foundation Model Platform for Physical AI
Nvidia. Cosmos world foundation model platform for physical ai. arXiv preprint arXiv:2501.03575, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[15]
Wf-vae: Enhancing video vae by wavelet-driven energy flow for latent video diffusion model
Zongjian Li, Bin Lin, Yang Ye, Liuhan Chen, Xinhua Cheng, Shenghai Yuan, and Li Yuan. Wf-vae: Enhancing video vae by wavelet-driven energy flow for latent video diffusion model. arXiv preprint arXiv:2411.17459, 2024 a
-
[16]
Deep compression autoencoder for efficient high-resolution diffusion models
Junyu Chen, Han Cai, Junsong Chen, Enze Xie, Shang Yang, Haotian Tang, Muyang Li, and Song Han. Deep compression autoencoder for efficient high-resolution diffusion models. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=wH8XXUOUZU
work page 2025
-
[17]
Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling, 2023. URL https://arxiv.org/abs/2210.02747
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[18]
Zhimin Li, Jianwei Zhang, Qin Lin, Jiangfeng Xiong, Yanxin Long, Xinchi Deng, Yingfang Zhang, Xingchao Liu, Minbin Huang, Zedong Xiao, Dayou Chen, Jiajun He, Jiahao Li, Wenyue Li, Chen Zhang, Rongwei Quan, Jianxiang Lu, Jiabin Huang, Xiaoyan Yuan, Xiaoxiao Zheng, Yixuan Li, Jihong Zhang, Chao Zhang, Meng Chen, Jie Liu, Zheng Fang, Weiyan Wang, Jinbao Xue,...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[19]
Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation
Ofir Press, Noah A. Smith, and Mike Lewis. Train short, test long: Attention with linear biases enables input length extrapolation, 2022. URL https://arxiv.org/abs/2108.12409
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[20]
PixArt-$\alpha$: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis
Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Yue Wu, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart- : Fast training of diffusion transformer for photorealistic text-to-image synthesis, 2023. URL https://arxiv.org/abs/2310.00426
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[21]
RoFormer: Enhanced Transformer with Rotary Position Embedding
Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding, 2023. URL https://arxiv.org/abs/2104.09864
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[22]
Training language models to follow instructions with human feedback
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35: 0 27730--27744, 2022
work page 2022
-
[23]
Deep reinforcement learning from human preferences
Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30, 2017
work page 2017
-
[24]
Direct preference optimization: Your language model is secretly a reward model
Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36, 2024
work page 2024
-
[25]
Diffusion model alignment using direct preference optimization
Bram Wallace, Meihua Dang, Rafael Rafailov, Linqi Zhou, Aaron Lou, Senthil Purushwalkam, Stefano Ermon, Caiming Xiong, Shafiq Joty, and Nikhil Naik. Diffusion model alignment using direct preference optimization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8228--8238, 2024
work page 2024
-
[26]
Using human feedback to fine-tune diffusion models without any reward model
Kai Yang, Jian Tao, Jiafei Lyu, Chunjiang Ge, Jiaxin Chen, Weihan Shen, Xiaolong Zhu, and Xiu Li. Using human feedback to fine-tune diffusion models without any reward model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8941--8951, 2024 b
work page 2024
-
[27]
Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow
Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow. ArXiv, abs/2209.03003, 2022. URL https://api.semanticscholar.org/CorpusID:252111177
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[28]
Improving the training of rectified flows
Sangyun Lee, Zinan Lin, and Giulia Fanti. Improving the training of rectified flows. arXiv preprint arXiv:2405.20320, 2024
-
[29]
Efficient large-scale language model training on gpu clusters using megatron-lm
Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick LeGresley, Mostofa Patwary, Vijay Korthikanti, Dmitri Vainbrand, Prethvi Kashinkunti, Julie Bernauer, Bryan Catanzaro, et al. Efficient large-scale language model training on gpu clusters using megatron-lm. In Proceedings of the International Conference for High Performance Computing, Networking, S...
work page 2021
-
[30]
Reducing activation recomputation in large transformer models
Vijay Anand Korthikanti, Jared Casper, Sangkug Lym, Lawrence McAfee, Michael Andersch, Mohammad Shoeybi, and Bryan Catanzaro. Reducing activation recomputation in large transformer models. Proceedings of Machine Learning and Systems, 5: 0 341--353, 2023
work page 2023
-
[31]
Ring Attention with Blockwise Transformers for Near-Infinite Context
Hao Liu, Matei Zaharia, and Pieter Abbeel. Ring attention with blockwise transformers for near-infinite context. arXiv preprint arXiv:2310.01889, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[32]
Sam Ade Jacobs, Masahiro Tanaka, Chengming Zhang, Minjia Zhang, Shuaiwen Leon Song, Samyam Rajbhandari, and Yuxiong He. Deepspeed ulysses: System optimizations for enabling training of extreme long sequence transformer models. arXiv preprint arXiv:2309.14509, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[33]
Zero: Memory optimizations toward training trillion parameter models
Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. Zero: Memory optimizations toward training trillion parameter models. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1--16. IEEE, 2020
work page 2020
-
[34]
Zili Zhang, Yinmin Zhong, Ranchen Ming, Hanpeng Hu, Jianjian Sun, Zheng Ge, Yibo Zhu, and Xin Jin. Disttrain: Addressing model and data heterogeneity with disaggregated training for multimodal large language models. arXiv preprint arXiv:2408.04275, 2024
-
[35]
Channels Last Memory Format in PyTorch
PyTorch . Channels Last Memory Format in PyTorch . PyTorch, https://pytorch.org/tutorials/intermediate/memory_format_tutorial.html, 2023. Accessed: Oct 4, 2023
work page 2023
-
[36]
Philipp Moritz, Robert Nishihara, Stephanie Wang, Alexey Tumanov, Richard Liaw, Eric Liang, Melih Elibol, Zongheng Yang, William Paul, Michael I. Jordan, and Ion Stoica. Ray: A distributed framework for emerging AI applications. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), pages 561--577, Carlsbad, CA, October 2018. U...
work page 2018
-
[37]
Pytorch rpc: Distributed deep learning built on tensor-optimized remote procedure calls
Pritam Damania, Shen Li, Alban Desmaison, Alisson Azzolini, Brian Vaughan, Edward Yang, Gregory Chanan, Guoqiang Jerry Chen, Hongyi Jia, Howard Huang, et al. Pytorch rpc: Distributed deep learning built on tensor-optimized remote procedure calls. Proceedings of Machine Learning and Systems, 5: 0 219--231, 2023
work page 2023
-
[38]
Mooncake: A kvcache-centric disaggregated architecture for llm serving, 2024
Ruoyu Qin, Zheming Li, Weiran He, Mingxing Zhang, Yongwei Wu, Weimin Zheng, and Xinran Xu. Mooncake: A kvcache-centric disaggregated architecture for llm serving, 2024. URL https://arxiv.org/abs/2407.00079
-
[39]
The llama 3 herd of models, April 2024
Meta LlamaTeam. The llama 3 herd of models, April 2024. URL https://ai.meta.com/research/publications/the-llama-3-herd-of-models/
work page 2024
-
[40]
PySceneDetect Developers . PySceneDetect. PySceneDetect. https://www.scenedetect.com/
-
[41]
FFmpeg Developers . FFmpeg. FFmpeg. https://ffmpeg.org/
-
[42]
Panda-70m: Captioning 70m videos with multiple cross-modality teachers
Tsai-Shien Chen, Aliaksandr Siarohin, Willi Menapace, Ekaterina Deyneka, Hsiang-wei Chao, Byung Eun Jeon, Yuwei Fang, Hsin-Ying Lee, Jian Ren, Ming-Hsuan Yang, et al. Panda-70m: Captioning 70m videos with multiple cross-modality teachers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13320--13331, 2024 a
work page 2024
-
[43]
Laion-5b: An open large-scale dataset for training next generation image-text models
Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al. Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems, 35: 0 25278--25294, 2022
work page 2022
-
[44]
LAION. Clip-based nsfw detector. https://github.com/LAION-AI/CLIP-based-NSFW-Detector, 2021. Accessed: [Insert Access Date]
work page 2021
-
[45]
Efficientnet: Rethinking model scaling for convolutional neural networks
Mingxing Tan and Quoc V Le. Efficientnet: Rethinking model scaling for convolutional neural networks. In Proceedings of the 36th International Conference on Machine Learning, pages 6105--6114. PMLR, 2019
work page 2019
- [46]
-
[47]
OpenCV Developers . OpenCV . OpenCV, https://opencv.org/, 2021. Accessed: August 1, 2023
work page 2021
-
[48]
Diatom autofocusing in brightfield microscopy: a comparative study
Jos \'e Luis Pech-Pacheco, Gabriel Crist \'o bal, Jes \'u s Chamorro-Martinez, and Joaqu \' n Fern \'a ndez-Valdivia. Diatom autofocusing in brightfield microscopy: a comparative study. In Proceedings 15th International Conference on Pattern Recognition. ICPR-2000, volume 3, pages 314--317. IEEE, 2000
work page 2000
-
[49]
Improving image generation with better captions
James Betker, Gabriel Goh, Li Jing, Tim Brooks, Jianfeng Wang, Linjie Li, Long Ouyang, Juntang Zhuang, Joyce Lee, Yufei Guo, et al. Improving image generation with better captions. Computer Science. https://cdn. openai. com/papers/dall-e-3. pdf, 2 0 (3): 0 8, 2023
work page 2023
-
[50]
Some methods for classification and analysis of multivariate observations
J MacQueen. Some methods for classification and analysis of multivariate observations. In Proceedings of 5-th Berkeley Symposium on Mathematical Statistics and Probability/University of California Press, 1967
work page 1967
-
[53]
DSV: Exploiting Dynamic Sparsity to Accelerate Large-Scale Video DiT Training
Xin Tan, Yuetao Chen, Yimin Jiang, Xing Chen, Kun Yan, Nan Duan, Yibo Zhu, Daxin Jiang, and Hong Xu. DSV: Exploiting Dynamic Sparsity to Accelerate Large-Scale Video DiT Training . arXiv preprint arXiv:2502.07590, 2025
-
[54]
Diffusion forcing: Next-token prediction meets full-sequence diffusion, 2024 b
Boyuan Chen, Diego Marti Monso, Yilun Du, Max Simchowitz, Russ Tedrake, and Vincent Sitzmann. Diffusion forcing: Next-token prediction meets full-sequence diffusion, 2024 b . URL https://arxiv.org/abs/2407.01392
-
[55]
LTX-Video: Realtime Video Latent Diffusion
Yoav HaCohen, Nisan Chiprut, Benny Brazowski, Daniel Shalem, Dudu Moshe, Eitan Richardson, Eran Levin, Guy Shiran, Nir Zabari, Ori Gordon, Poriya Panet, Sapir Weissbuch, Victor Kulikov, Yaki Bitterman, Zeev Melumian, and Ofir Bibi. Ltx-video: Realtime video latent diffusion, 2024. URL https://arxiv.org/abs/2501.00103
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[56]
Deyu Zhou, Quan Sun, Yuang Peng, Kun Yan, Runpei Dong, Duomin Wang, Zheng Ge, Nan Duan, Xiangyu Zhang, Lionel M. Ni, and Heung-Yeung Shum. Taming teacher forcing for masked autoregressive video generation, 2025. URL https://arxiv.org/abs/2501.12389
-
[57]
DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai D...
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [58]
-
[59]
Diffusion Forcing: Next-token Prediction Meets Full-Sequence Diffusion , author=. 2024 , eprint=
work page 2024
- [60]
-
[61]
Taming Teacher Forcing for Masked Autoregressive Video Generation , author=. 2025 , eprint=
work page 2025
-
[62]
Improving image generation with better captions , author=. Computer Science. https://cdn. openai. com/papers/dall-e-3. pdf , volume=
-
[63]
Hunyuan-DiT: A Powerful Multi-Resolution Diffusion Transformer with Fine-Grained Chinese Understanding , author=. 2024 , eprint=
work page 2024
- [64]
-
[65]
Movie Gen: A Cast of Media Foundation Models , author=. 2024 , eprint=
work page 2024
-
[66]
Scaling Rectified Flow Transformers for High-Resolution Image Synthesis , author=. 2024 , eprint=
work page 2024
-
[67]
CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer
CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer , author=. arXiv preprint arXiv:2408.06072 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[68]
Zangwei Zheng and Xiangyu Peng and Tianji Yang and Chenhui Shen and Shenggui Li and Hongxin Liu and Yukun Zhou and Tianyi Li and Yang You , title =. 2024 , url =
work page 2024
-
[69]
Open-Sora Plan: Open-Source Large Video Generation Model
Open-Sora Plan: Open-Source Large Video Generation Model , author=. arXiv preprint arXiv:2412.00131 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[70]
HunyuanVideo: A Systematic Framework For Large Video Generative Models , author=. 2025 , eprint=
work page 2025
- [71]
- [72]
- [73]
- [74]
-
[75]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning , author=. 2025 , eprint=
work page 2025
-
[76]
OpenWebMath: An Open Dataset of High-Quality Mathematical Web Text , author=. 2023 , eprint=
work page 2023
-
[77]
Gadre, Samir Yitzhak and Smyrnis, Georgios and Shankar, Vaishaal and Gururangan, Suchin and Wortsman, Mitchell and Shao, Rulin and Mercat, Jean and Fang, Alex and Li, Jeffrey and Keh, Sedrick and Xin, Rui and Nezhurina, Marianna and Vasiljevic, Igor and Jitsev, Jenia and Dimakis, Alexandros G. and Ilharco, Gabriel and Song, Shuran and Kollar, Thomas and C...
work page 2024
- [78]
-
[79]
Palm 2 technical report , author=. arXiv preprint arXiv:2305.10403 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[80]
arXiv preprint arXiv:1911.00359 , year=
CCNet: Extracting high quality monolingual datasets from web crawl data , author=. arXiv preprint arXiv:1911.00359 , year=
-
[81]
LLaMA: Open and Efficient Foundation Language Models
Llama: Open and efficient foundation language models , author=. arXiv preprint arXiv:2302.13971 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[82]
Mistral 7B , author=. arXiv preprint arXiv:2310.06825 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[83]
The Twelfth International Conference on Learning Representations , year=
Language Modeling Is Compression , author=. The Twelfth International Conference on Learning Representations , year=
-
[84]
Soboleva Daria and Al-Khateeb Faisal and Myers Robert Steeves Jacob R and Hestness Joel and Dey Nolan , title =
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.