Recognition: unknown
Motif-Video 2B: Technical Report
Pith reviewed 2026-05-10 15:46 UTC · model grok-4.3
The pith
Separating prompt alignment, temporal consistency, and fine-detail recovery into distinct stages lets a 2B video model outperform a 14B baseline on VBench with far less data and compute.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Motif-Video 2B demonstrates that a 2 billion parameter text-to-video model reaches 83.76 percent on VBench by using a three-part backbone to separate early fusion, joint representation learning, and detail refinement, combined with shared cross-attention for long token sequences and a training recipe of dynamic token routing plus early feature alignment to a frozen pretrained video encoder, thereby surpassing the 14 billion parameter Wan2.1 model while using seven times fewer parameters and substantially less training data.
What carries the argument
Three-part backbone that separates early fusion, joint representation learning, and detail refinement, together with shared cross-attention to maintain text control over long video sequences.
If this is right
- Later transformer blocks develop clearer cross-frame attention structure than single-stream baselines under the same training conditions.
- Text control remains strong even when video token sequences grow long.
- High-quality video generation becomes achievable with under 10 million training clips and fewer than 100,000 H200 GPU hours.
- Architectural specialization can narrow or reverse the quality gap usually tied to much larger parameter counts.
Where Pith is reading between the lines
- The same separation of conflicting objectives might improve efficiency in related generative tasks such as high-resolution image synthesis or long audio generation.
- If role separation is the key driver, then scaling laws for video models may need revision when architecture is allowed to specialize rather than remain uniform.
- Controlled tests on even smaller parameter budgets could reveal how far the three-part design can be pushed before quality saturates.
Load-bearing premise
The performance gains result mainly from architecturally separating prompt alignment, temporal consistency, and fine-detail recovery rather than from dynamic token routing, early feature alignment, or other details of the training process.
What would settle it
A direct ablation that trains an otherwise identical single-stream model of the same parameter count and training recipe and checks whether the VBench score drops to or below the 14B baseline level.
Figures
read the original abstract
Training strong video generation models usually requires massive datasets, large parameter counts, and substantial compute. In this work, we ask whether strong text-to-video quality is possible at a much smaller budget: fewer than 10M clips and less than 100,000 H200 GPU hours. Our core claim is that part of the answer lies in how model capacity is organized, not only in how much of it is used. In video generation, prompt alignment, temporal consistency, and fine-detail recovery can interfere with one another when they are handled through the same pathway. Motif-Video 2B addresses this by separating these roles architecturally, rather than relying on scale alone. The model combines two key ideas. First, Shared Cross-Attention strengthens text control when video token sequences become long. Second, a three-part backbone separates early fusion, joint representation learning, and detail refinement. To make this design effective under a limited compute budget, we pair it with an efficient training recipe based on dynamic token routing and early-phase feature alignment to a frozen pretrained video encoder. Our analysis shows that later blocks develop clearer cross-frame attention structure than standard single-stream baselines. On VBench, Motif-Video~2B reaches 83.76\%, surpassing Wan2.1 14B while using 7$\times$ fewer parameters and substantially less training data. These results suggest that careful architectural specialization, combined with an efficiency-oriented training recipe, can narrow or exceed the quality gap typically associated with much larger video models.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents Motif-Video 2B, a 2B-parameter text-to-video model trained on <10M clips. It claims that separating prompt alignment, temporal consistency, and fine-detail recovery into a three-part backbone with Shared Cross-Attention, paired with dynamic token routing and early-phase alignment to a frozen video encoder, enables 83.76% VBench score. This surpasses Wan2.1 14B while using 7× fewer parameters and far less compute (<100k H200 GPU hours). Later blocks are said to show clearer cross-frame attention than single-stream baselines.
Significance. If the performance holds and the architectural separation is shown to be causal, the result would be significant: it would demonstrate that targeted organization of capacity plus an efficiency recipe can close much of the gap to much larger models, shifting emphasis from raw scale in video generation. The efficiency claims (low data, low compute) and the attention-structure observation are potentially valuable if quantified.
major comments (3)
- [Abstract] Abstract: The headline claim of 83.76% VBench (surpassing Wan2.1 14B) is stated without any ablation results, statistical details, evaluation protocol, baseline implementation notes, or variance estimates. This leaves the central performance claim unsupported by visible evidence.
- [Abstract] Abstract: No ablation holds the training recipe (dynamic token routing + early feature alignment) fixed while replacing the three-part backbone with a standard single-stream transformer of equal capacity. Without this comparison, it remains unclear whether the reported gains require the claimed role separation or arise from the efficiency components alone.
- [Abstract] Abstract: The statement that later blocks develop 'clearer cross-frame attention structure than standard single-stream baselines' is asserted but not supported by any quantitative metric, figure, or matched-baseline comparison.
minor comments (1)
- [Abstract] The abstract contains a typesetting artifact ('Motif-Video~2B'); this should be corrected for clarity.
Simulated Author's Rebuttal
We thank the referee for the constructive comments. We address each major point below and will make revisions to better support the claims presented in the abstract.
read point-by-point responses
-
Referee: [Abstract] Abstract: The headline claim of 83.76% VBench (surpassing Wan2.1 14B) is stated without any ablation results, statistical details, evaluation protocol, baseline implementation notes, or variance estimates. This leaves the central performance claim unsupported by visible evidence.
Authors: We agree that the abstract would benefit from additional context to support the headline result. The full manuscript details the VBench evaluation protocol, baseline implementations, and ablation studies in the Experiments section. We will revise the abstract to include a brief reference to the evaluation protocol and direct readers to the relevant sections for ablations and comparisons. Variance estimates are not reported, consistent with standard practice for large-scale training runs due to compute constraints; we will add an explicit note clarifying this. revision: yes
-
Referee: [Abstract] Abstract: No ablation holds the training recipe (dynamic token routing + early feature alignment) fixed while replacing the three-part backbone with a standard single-stream transformer of equal capacity. Without this comparison, it remains unclear whether the reported gains require the claimed role separation or arise from the efficiency components alone.
Authors: This is a valid criticism. The manuscript presents comparisons to single-stream models and ablations on the dynamic routing and alignment components, but does not include the exact control experiment that holds the training recipe fixed while swapping only the backbone architecture. Such an ablation would require substantial additional compute. In the revision we will expand the discussion of the role-separation motivation, drawing on preliminary observations of task interference, and explicitly note this as a limitation. revision: partial
-
Referee: [Abstract] Abstract: The statement that later blocks develop 'clearer cross-frame attention structure than standard single-stream baselines' is asserted but not supported by any quantitative metric, figure, or matched-baseline comparison.
Authors: We acknowledge that the current claim is supported only by qualitative inspection. In the revised manuscript we will add quantitative metrics (such as frame-wise attention concentration scores) together with matched visualizations comparing later blocks of Motif-Video 2B against single-stream baselines, and include these in the analysis section. revision: yes
Circularity Check
No circularity: empirical benchmark results with no derivations or self-referential predictions.
full rationale
The paper is a technical report on an empirical video generation model. Performance claims (e.g., 83.76% on VBench) are presented as direct benchmark outcomes from training and evaluation, not as quantities derived from equations, fitted parameters, or self-citations. No mathematical derivations, predictions, or first-principles results are described that could reduce to inputs by construction. Architectural claims about role separation and attention structure are supported by comparisons to baselines rather than self-definitional logic. The work is self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
V-jepa: Latent video prediction for visual representation learning
Adrien Bardes, Quentin Garrido, Jean Ponce, Xinlei Chen, Michael Rabbat, Yann LeCun, Mido Assran, and Nicolas Ballas. V-jepa: Latent video prediction for visual representation learning. 2023
2023
-
[2]
Speedrunning imagenet diffusion.arXiv preprint arXiv:2512.12386, 2025
Swayam Bhanded. Speedrunning imagenet diffusion.arXiv preprint arXiv:2512.12386, 2025
-
[3]
Guibin Chen, Dixuan Lin, Jiangping Yang, Youqiang Zhang, Zhengcong Fei, Debang Li, Sheng Chen, Chaofeng Ao, Nuo Pang, Yiming Wang, et al. Skyreels-v4: Multi-modal video-audio generation, inpainting and editing model.arXiv preprint arXiv:2602.21818, 2026
-
[4]
Pixart-α: Fast training of diffusion transformer for photorealistic text-to-image synthesis
Junsong Chen, YU Jincheng, GE Chongjian, Lewei Yao, Enze Xie, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart-α: Fast training of diffusion transformer for photorealistic text-to-image synthesis. InThe Twelfth International Conference on Learning Representations
-
[5]
Sana-video: Efficient video generation with block linear diffusion transformer, 2025 b
Junsong Chen, Yuyang Zhao, Jincheng Yu, Ruihang Chu, Junyu Chen, Shuai Yang, Xianbang Wang, Yicheng Pan, Daquan Zhou, Huan Ling, et al. Sana-video: Efficient video generation with block linear diffusion transformer.arXiv preprint arXiv:2509.24695, 2025
-
[6]
Sanghyeok Choi, Yuchang Song, Taegyun Jeong, Taesung Kwon, and Kihyuk Sohn. Enhanc- ing motion dynamics of image-to-video models via adaptive low-pass guidance.arXiv preprint arXiv:2506.08456, 2025
-
[7]
PaddleOCR-VL: Boosting Multilingual Document Parsing via a 0.9 B Ultra-Compact Vision-Language Model
Cheng Cui, Ting Sun, Suyin Liang, Tingquan Gao, Zelun Zhang, Jiaxuan Liu, Xueqing Wang, Changda Zhou, Hongen Liu, Manhui Lin, et al. Paddleocr-vl: Boosting multilingual document parsing via a 0.9 b ultra-compact vision-language model.arXiv preprint arXiv:2510.14528, 2025
-
[8]
aesthetic-predictor-v2-5
discus0434. aesthetic-predictor-v2-5. https://github.com/discus0434/aesthetic-predictor-v2-5 , 2024
2024
-
[9]
Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The Llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[10]
Scaling rectified flow transformers for high-resolution image synthesis
Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas M¨uller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. InInternational Conference on Machine Learning, pages 12606–12633. PMLR, 2024
2024
-
[11]
Seedance 1.0: Exploring the Boundaries of Video Generation Models
Yu Gao, Haoyuan Guo, Tuyen Hoang, Weilin Huang, Lu Jiang, Fangyuan Kong, Huixia Li, Jiashi Li, Liang Li, Xiaojie Li, et al. Seedance 1.0: Exploring the boundaries of video generation models.arXiv preprint arXiv:2506.09113, 2025. 27
work page internal anchor Pith review arXiv 2025
-
[12]
Accelerate: Training and inference at scale made simple, efficient and adaptable.https://github.com/huggingface/accelerate, 2022
Sylvain Gugger, Lysandre Debut, Thomas Wolf, Philipp Schmid, Zachary Mueller, Sourab Man- grulkar, Marc Sun, and Benjamin Bossan. Accelerate: Training and inference at scale made simple, efficient and adaptable.https://github.com/huggingface/accelerate, 2022
2022
-
[13]
Ltx-2: Efficient joint audio-visual foundation model
Yoav HaCohen, Benny Brazowski Nisan Chiprut Yaki Bitterman, Andrew Kvochko Avishai Berkowitz Daniel Shalem, Daphna Lifschitz Dudu Moshe, Eitan Porat Eitan Richardson Guy Shi- ran, Itay Chachy Jonathan Chetboun, Michael Finkelson Michael Kupchick Nir Zabari, Nitzan Guetta Noa Kotler, Ofir Bibi Ori Gordon Poriya Panet, Roi Benita Shahar Armon, et al. Ltx-2:...
-
[14]
LTX-Video: Realtime Video Latent Diffusion
Yoav HaCohen, Nisan Chiprut, Benny Brazowski, Daniel Shalem, Dudu Moshe, Eitan Richardson, Eran Levin, Guy Shiran, Nir Zabari, Ori Gordon, et al. Ltx-video: Realtime video latent diffusion. arXiv preprint arXiv:2501.00103, 2024
work page internal anchor Pith review arXiv 2024
-
[15]
VBench: Comprehensive benchmark suite for video generative models
Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, Yaohui Wang, Xinyuan Chen, Limin Wang, Dahua Lin, Yu Qiao, and Ziwei Liu. VBench: Comprehensive benchmark suite for video generative models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognit...
2024
-
[16]
Nemo-curator: a toolkit for data curation, 2024
Joseph Jennings, Mostofa Patwary, et al. Nemo-curator: a toolkit for data curation, 2024. URL https://github.com/NVIDIA-NeMo/Curator
2024
-
[17]
Kirkpatrick, C
S. Kirkpatrick, C. D. Gelatt, and M. P . Vecchi. Optimization by simulated annealing.Science, 220 (4598):671–680, 1983
1983
-
[18]
HunyuanVideo: A Systematic Framework For Large Video Generative Models
Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[19]
Tread: Token routing for efficient architecture-agnostic diffusion training
Felix Krause, Timy Phan, Ming Gui, Stefan Andreas Baumann, Vincent Tao Hu, and Bj¨orn Ommer. Tread: Token routing for efficient architecture-agnostic diffusion training. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 15703–15713, 2025
2025
-
[20]
Flux.https://github.com/black-forest-labs/flux, 2024
Black Forest Labs. Flux.https://github.com/black-forest-labs/flux, 2024
2024
-
[21]
Scaling laws for diffusion transformers.CoRR, abs/2410.08184, 2024
Zhengyang Liang, Hao He, Ceyuan Yang, and Bo Dai. Scaling laws for diffusion transformers.arXiv preprint arXiv:2410.08184, 2024
-
[22]
Flow matching for generative modeling
Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matthew Le. Flow matching for generative modeling. InThe Eleventh International Conference on Learning Representations
-
[23]
Enhance-a-video: Better generated video for free.arXiv preprint arXiv:2502.07508, 2025
Yang Luo, Xuanlei Zhao, Mengzhao Chen, Kaipeng Zhang, Wenqi Shao, Kai Wang, Zhangyang Wang, and Yang You. Enhance-a-video: Better generated video for free.arXiv preprint arXiv:2502.07508, 2025
-
[24]
Guoqing Ma, Haoyang Huang, Kun Yan, Liangyu Chen, Nan Duan, Shengming Yin, Changyi Wan, Ranchen Ming, Xiaoniu Song, Xing Chen, et al. Step-video-t2v technical report: The practice, challenges, and future of video foundation model.arXiv preprint arXiv:2502.10248, 2025
-
[25]
arXiv preprint arXiv:2603.14482 (2026)
Lorenzo Mur-Labadia, Matthew Muckley, Amir Bar, Mido Assran, Koustuv Sinha, Mike Rabbat, Yann LeCun, Nicolas Ballas, and Adrien Bardes. V-jepa 2.1: Unlocking dense features in video self-supervised learning.arXiv preprint arXiv:2603.14482, 2026
-
[26]
cuVS: GPU-accelerated vector search and clustering
NVIDIA RAPIDS Team. cuVS: GPU-accelerated vector search and clustering. GitHub repository,
-
[27]
Multi-GPU IVF-PQ and ANN indexes for large- scale vector search
URL https://github.com/rapidsai/cuvs. Multi-GPU IVF-PQ and ANN indexes for large- scale vector search
-
[28]
Video generation models as world simulators
OpenAI. Video generation models as world simulators. https://openai.com/index/ video-generation-models-as-world-simulators/, 2024
2024
-
[29]
Prx part 3 — training a text-to-image model in 24h
Photoroom. Prx part 3 — training a text-to-image model in 24h. https://huggingface.co/blog/ Photoroom/prx-part3, 2025
2025
-
[30]
A self- supervised descriptor for image copy detection
Ed Pizzi, Sreya Dutta Roy, Sugosh Nagavara Ravindra, Priya Goyal, and Matthijs Douze. A self- supervised descriptor for image copy detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022
2022
-
[31]
Movie Gen: A Cast of Media Foundation Models
Adam Polyak, Amit Zohar, Andrew Brown, Andros Tjandra, Animesh Sinha, Ann Lee, Apoorv Vyas, Bowen Shi, Chih-Yao Ma, Ching-Yao Chuang, et al. Movie gen: A cast of media foundation models.arXiv preprint arXiv:2410.13720, 2024. 28
work page internal anchor Pith review arXiv 2024
-
[32]
Qwen3-VL technical report.arXiv preprint, 2025
Qwen Team. Qwen3-VL technical report.arXiv preprint, 2025. Qwen3-VL-30B-A3B vision-language model
2025
-
[33]
Eliminating oversaturation and artifacts of high guidance scales in diffusion models
Seyedmorteza Sadat, Otmar Hilliges, and Romann M Weber. Eliminating oversaturation and artifacts of high guidance scales in diffusion models. InThe Thirteenth International Conference on Learning Representations, 2024
2024
-
[34]
Team Seedance, Heyi Chen, Siyan Chen, Xin Chen, Yanfei Chen, Ying Chen, Zhuo Chen, Feng Cheng, Tianheng Cheng, Xinqi Cheng, et al. Seedance 1.5 pro: A native audio-visual joint generation foundation model.arXiv preprint arXiv:2512.13507, 2025
-
[35]
Jaskirat Singh, Xingjian Leng, Zongze Wu, Liang Zheng, Richard Zhang, Eli Shechtman, and Saining Xie. What matters for representation alignment: Global information or spatial structure?arXiv preprint arXiv:2512.10794, 2025
-
[36]
SkyReels-V2: Infinite-length Film Generative Model
Skywork AI SkyReels Team. SkyReels-V2: Infinite-length film generative model.arXiv preprint arXiv:2504.13074, 2025
work page internal anchor Pith review arXiv 2025
-
[37]
Wan: Open and Advanced Large-Scale Video Generative Models
Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[38]
A comprehensive study of decoder-only llms for text-to-image generation
Andrew Z Wang, Songwei Ge, Tero Karras, Ming-Yu Liu, and Yogesh Balaji. A comprehensive study of decoder-only llms for text-to-image generation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 28575–28585, 2025
2025
-
[39]
arXiv preprint arXiv:2410.08260 , year=
Qiuheng Wang, Yukai Shi, Jiarong Ou, Rui Chen, Ke Lin, Jiahao Wang, Boyuan Jiang, Haotian Yang, Mingwu Zheng, Xin Tao, Fei Yang, Pengfei Wan, and Di Zhang. Koala-36m: A large-scale video dataset improving consistency between fine-grained conditions and video content, 2025. URL https://arxiv.org/abs/2410.08260
-
[40]
DDT: Decoupled diffusion Transformer.arXiv:2504.05741, 2025
Shuai Wang, Zhi Tian, Weilin Huang, and Limin Wang. Ddt: Decoupled diffusion transformer.arXiv preprint arXiv:2504.05741, 2025
-
[41]
Repa works until it doesn’t: Early-stopped, holistic alignment supercharges diffusion training
Ziqiao Wang, Wangbo Zhao, Yuhao Zhou, Zekai Li, Zhiyuan Liang, Mingjia Shi, Xuanlei Zhao, Pengfei Zhou, Kaipeng Zhang, Zhangyang Wang, et al. Repa works until it doesn’t: Early-stopped, holistic alignment supercharges diffusion training. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems
-
[42]
Webdataset
WebDataset Authors. Webdataset. GitHub repository, 2026. URL https://github.com/webdataset/ webdataset. Tar-sharded dataset format for sequential streaming in large-scale deep learning
2026
-
[43]
Video models are zero-shot learners and reasoners
Thadd¨aus Wiedemer, Yuxuan Li, Paul Vicol, Shixiang Shane Gu, Nick Matarese, Kevin Swersky, Been Kim, Priyank Jaini, and Robert Geirhos. Video models are zero-shot learners and reasoners. arXiv preprint arXiv:2509.20328, 2025
work page internal anchor Pith review arXiv 2025
-
[44]
HunyuanVideo 1.5 Technical Report
Bing Wu, Chang Zou, Changlin Li, Duojun Huang, Fang Yang, Hao Tan, Jack Peng, Jianbing Wu, Jiangfeng Xiong, Jie Jiang, et al. Hunyuanvideo 1.5 technical report.arXiv preprint arXiv:2511.18870, 2025
work page internal anchor Pith review arXiv 2025
-
[45]
Haoning Wu, Erli Zhang, Liang Liao, Chaofeng Chen, Jingwen Hou, Annan Wang, Wenxiu Sun, Qiong Yan, and Weisi Lin. Exploring video quality assessment on user generated contents from aesthetic and technical perspectives, 2023. URLhttps://arxiv.org/abs/2211.04894
-
[46]
Revisiting weak-to-strong consistency in semi-supervised semantic segmentation, 2023
Lihe Yang, Lei Qi, Litong Feng, Wayne Zhang, and Yinghuan Shi. Revisiting weak-to-strong consistency in semi-supervised semantic segmentation, 2023. URL https://arxiv.org/abs/2208. 09910
2023
-
[47]
Cogvideox: Text-to-video diffusion models with an expert transformer
Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer. InThe Thirteenth International Conference on Learning Representations
-
[48]
{SkyPilot}: An intercloud broker for sky computing
Zongheng Yang, Zhanghao Wu, Michael Luo, Wei-Lin Chiang, Romil Bhardwaj, Woosuk Kwon, Siyuan Zhuang, Frank Sifei Luan, Gautam Mittal, Scott Shenker, et al. {SkyPilot}: An intercloud broker for sky computing. In20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23), pages 437–455, 2023. 29
2023
-
[49]
Representation alignment for generation: Training diffusion transformers is easier than you think
Sihyun Yu, Sangkyung Kwak, Huiwon Jang, Jongheon Jeong, Jonathan Huang, Jinwoo Shin, and Saining Xie. Representation alignment for generation: Training diffusion transformers is easier than you think. InThe Thirteenth International Conference on Learning Representations
-
[50]
Sigmoid loss for language image pre-training, 2023
Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training, 2023. URLhttps://arxiv.org/abs/2303.15343
-
[51]
Cat.” indicates language-heavy (lang) or OCR/document (ocr) sources. “%>1024
Biao Zhang, Paul Suganthan, Ga¨el Liu, Ilya Philippov, Sahil Dua, Ben Hora, Kat Black, Gus Martins, Omar Sanseviero, Shreya Pathak, et al. T5gemma 2: Seeing, reading, and understanding longer. arXiv preprint arXiv:2512.14856, 2025
-
[52]
Videorepa: Learning physics for video generation through relational alignment with foundation models
Xiangdong Zhang, Jiaqi Liao, Shaofeng Zhang, Fanqing Meng, Xiangpeng Wan, Junchi Yan, and Yu Cheng. Videorepa: Learning physics for video generation through relational alignment with foundation models. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems
-
[53]
Waver: Wave your way to lifelike video generation,
Yifu Zhang, Hao Yang, Yuqi Zhang, Yifei Hu, Fengda Zhu, Chuang Lin, Xiaofeng Mei, Yi Jiang, Bingyue Peng, and Zehuan Yuan. Waver: Wave your way to lifelike video generation.arXiv preprint arXiv:2508.15761, 2025
-
[54]
Open-sora 2.0: Training a commercial-level video generation model in $200 k
Zangwei Zheng, Xiangyu Peng, Yuxuan Lou, Chenhui Shen, Tom Young, Xinying Guo, Binluo Wang, Hang Xu, Hongxin Liu, Mingyan Jiang, et al. Open-sora 2.0: Training a commercial-level video generation model in $200k.arXiv preprint arXiv:2503.09642, 2025. 30 Figure 16:Additional qualitative human-centered generations.Representative frames from videos involving ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.