Motif-Video 2B: Technical Report
Pith reviewed 2026-05-21 00:08 UTC · model grok-4.3
The pith
Separating prompt alignment, temporal consistency, and detail recovery into distinct pathways lets a 2B video model surpass 14B-parameter rivals on VBench.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Motif-Video 2B reaches 83.76 percent on VBench by using shared cross-attention to improve text control over long video token sequences and a three-part backbone that separates early fusion, joint representation learning, and detail refinement. Dynamic token routing and early-phase feature alignment to a frozen pretrained video encoder keep training efficient. The resulting 2B-parameter model exceeds the score of the 14B-parameter Wan2.1 while using seven times fewer parameters and substantially less training data.
What carries the argument
Shared cross-attention paired with a three-part backbone that divides processing into early fusion, joint representation learning, and detail refinement.
If this is right
- Later blocks exhibit clearer cross-frame attention patterns than those in standard single-stream video models.
- Competitive text-to-video quality is reachable with fewer than 10 million training clips and under 100,000 H200 GPU hours.
- Architectural specialization can narrow or close the quality gap that usually requires much larger parameter counts.
Where Pith is reading between the lines
- The same role-separation idea could be tested in other generative domains such as high-resolution image synthesis or audio generation to reduce task interference.
- Lower training budgets might allow repeated experimentation and faster iteration cycles for teams without access to large GPU clusters.
- Combining the three-part design with task-specific losses or additional frozen encoders could yield further efficiency gains on particular video styles.
Load-bearing premise
That the separation of prompt alignment, temporal consistency, and fine-detail recovery into distinct pathways through shared cross-attention and the three-part backbone, together with the dynamic routing and alignment recipe, is what produces the reported performance under the given data and compute limits.
What would settle it
Train a 2B-parameter single-stream baseline without shared cross-attention or the three-part backbone on the same clips and compute budget, then check whether its VBench score remains below 83.76 percent.
Figures
read the original abstract
Training strong video generation models usually requires massive datasets, large parameter counts, and substantial compute. In this work, we ask whether strong text-to-video quality is possible at a much smaller budget: fewer than 10M clips and less than 100,000 H200 GPU hours. Our core claim is that part of the answer lies in how model capacity is organized, not only in how much of it is used. In video generation, prompt alignment, temporal consistency, and fine-detail recovery can interfere with one another when they are handled through the same pathway. Motif-Video 2B addresses this by separating these roles architecturally, rather than relying on scale alone. The model combines two key ideas. First, Shared Cross-Attention strengthens text control when video token sequences become long. Second, a three-part backbone separates early fusion, joint representation learning, and detail refinement. To make this design effective under a limited compute budget, we pair it with an efficient training recipe based on dynamic token routing and early-phase feature alignment to a frozen pretrained video encoder. Our analysis shows that later blocks develop clearer cross-frame attention structure than standard single-stream baselines. On VBench, Motif-Video~2B reaches 83.76\%, surpassing Wan2.1 14B while using 7$\times$ fewer parameters and substantially less training data. These results suggest that careful architectural specialization, combined with an efficiency-oriented training recipe, can narrow or exceed the quality gap typically associated with much larger video models.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents Motif-Video 2B, a 2B-parameter text-to-video model that reaches 83.76% on VBench, outperforming Wan2.1 14B while using 7× fewer parameters and substantially less training data (<10M clips, <100k H200 GPU hours). The central claim is that separating prompt alignment, temporal consistency, and fine-detail recovery via a three-part backbone with shared cross-attention, combined with dynamic token routing and early-phase feature alignment to a frozen encoder, enables this efficiency; later blocks exhibit clearer cross-frame attention than single-stream baselines.
Significance. If the attribution to architectural specialization holds under controlled conditions, the result would indicate that targeted capacity organization can close the quality gap with much larger models under tight data and compute budgets, offering a practical path toward more accessible video generation. The reported attention-structure analysis supplies a modest mechanistic observation that could be developed further.
major comments (1)
- [Abstract] Abstract and Results section: The headline claim that the three-part backbone and shared cross-attention are responsible for competitive performance under the stated budget is not supported by any ablation that holds dynamic token routing and early-phase frozen-encoder alignment fixed while reverting to a single-stream backbone. Without this control, the data-efficiency result cannot be attributed to the architectural separation rather than the training recipe alone.
minor comments (2)
- [Abstract] The abstract states the VBench score but supplies no information on evaluation protocol, baseline details, number of samples, statistical significance, or error bars, making it impossible to judge whether the 83.76% figure reliably supports the central claim.
- Notation for the three-part backbone (early fusion, joint representation, detail refinement) and the dynamic routing mechanism should be defined explicitly with equations or pseudocode in the methods section for reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address the major comment below and outline the revisions we will make to strengthen the attribution of results to the proposed architecture.
read point-by-point responses
-
Referee: [Abstract] Abstract and Results section: The headline claim that the three-part backbone and shared cross-attention are responsible for competitive performance under the stated budget is not supported by any ablation that holds dynamic token routing and early-phase frozen-encoder alignment fixed while reverting to a single-stream backbone. Without this control, the data-efficiency result cannot be attributed to the architectural separation rather than the training recipe alone.
Authors: We agree that the current evidence does not fully isolate the contribution of the three-part backbone and shared cross-attention from the training recipe components. Our manuscript reports comparisons to single-stream baselines that exhibit weaker cross-frame attention in later blocks, but these baselines were not trained under an identical recipe that fixes dynamic token routing and early-phase alignment to the frozen encoder. To address this directly, we will add a controlled ablation in the revised manuscript: a single-stream backbone trained with the same dynamic token routing and early-phase feature alignment, using the same data and compute budget. This will allow clearer attribution of the efficiency gains to the architectural separation of prompt alignment, temporal consistency, and detail recovery. revision: yes
Circularity Check
No derivation chain present; empirical report only
full rationale
The paper is a technical report on an empirical video generation model. It reports a VBench score of 83.76% for Motif-Video 2B and attributes results to architectural choices (shared cross-attention, three-part backbone) paired with a training recipe (dynamic token routing, early-phase alignment to frozen encoder). No equations, first-principles derivations, fitted parameters renamed as predictions, or load-bearing self-citation chains appear in the abstract or described claims. The central performance claim is externally benchmarked and does not reduce to any input by construction, satisfying the criteria for a self-contained empirical result.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
three-part backbone separates early fusion, joint representation learning, and detail refinement
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanLogicNat as forced Peano structure unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
separating these roles architecturally, rather than relying on scale alone
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
V-jepa: Latent video prediction for visual representation learning
Adrien Bardes, Quentin Garrido, Jean Ponce, Xinlei Chen, Michael Rabbat, Yann LeCun, Mido Assran, and Nicolas Ballas. V-jepa: Latent video prediction for visual representation learning. 2023
work page 2023
-
[2]
Speedrunning imagenet diffusion.arXiv preprint arXiv:2512.12386, 2025
Swayam Bhanded. Speedrunning imagenet diffusion.arXiv preprint arXiv:2512.12386, 2025
-
[3]
Guibin Chen, Dixuan Lin, Jiangping Yang, Youqiang Zhang, Zhengcong Fei, Debang Li, Sheng Chen, Chaofeng Ao, Nuo Pang, Yiming Wang, et al. Skyreels-v4: Multi-modal video-audio generation, inpainting and editing model.arXiv preprint arXiv:2602.21818, 2026
-
[4]
Pixart-α: Fast training of diffusion transformer for photorealistic text-to-image synthesis
Junsong Chen, YU Jincheng, GE Chongjian, Lewei Yao, Enze Xie, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart-α: Fast training of diffusion transformer for photorealistic text-to-image synthesis. InThe Twelfth International Conference on Learning Representations
-
[5]
Junsong Chen, Yuyang Zhao, Jincheng Yu, Ruihang Chu, Junyu Chen, Shuai Yang, Xianbang Wang, Yicheng Pan, Daquan Zhou, Huan Ling, et al. Sana-video: Efficient video generation with block linear diffusion transformer.arXiv preprint arXiv:2509.24695, 2025
-
[6]
Sanghyeok Choi, Yuchang Song, Taegyun Jeong, Taesung Kwon, and Kihyuk Sohn. Enhanc- ing motion dynamics of image-to-video models via adaptive low-pass guidance.arXiv preprint arXiv:2506.08456, 2025
-
[7]
Cheng Cui, Ting Sun, Suyin Liang, Tingquan Gao, Zelun Zhang, Jiaxuan Liu, Xueqing Wang, Changda Zhou, Hongen Liu, Manhui Lin, et al. Paddleocr-vl: Boosting multilingual document parsing via a 0.9 b ultra-compact vision-language model.arXiv preprint arXiv:2510.14528, 2025
-
[8]
discus0434. aesthetic-predictor-v2-5. https://github.com/discus0434/aesthetic-predictor-v2-5 , 2024
work page 2024
-
[9]
Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The Llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[10]
Scaling rectified flow transformers for high-resolution image synthesis
Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas M¨uller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. InInternational Conference on Machine Learning, pages 12606–12633. PMLR, 2024
work page 2024
-
[11]
Seedance 1.0: Exploring the Boundaries of Video Generation Models
Yu Gao, Haoyuan Guo, Tuyen Hoang, Weilin Huang, Lu Jiang, Fangyuan Kong, Huixia Li, Jiashi Li, Liang Li, Xiaojie Li, et al. Seedance 1.0: Exploring the boundaries of video generation models.arXiv preprint arXiv:2506.09113, 2025. 27
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[12]
Sylvain Gugger, Lysandre Debut, Thomas Wolf, Philipp Schmid, Zachary Mueller, Sourab Man- grulkar, Marc Sun, and Benjamin Bossan. Accelerate: Training and inference at scale made simple, efficient and adaptable.https://github.com/huggingface/accelerate, 2022
work page 2022
-
[13]
Ltx-2: Efficient joint audio-visual foundation model
Yoav HaCohen, Benny Brazowski Nisan Chiprut Yaki Bitterman, Andrew Kvochko Avishai Berkowitz Daniel Shalem, Daphna Lifschitz Dudu Moshe, Eitan Porat Eitan Richardson Guy Shi- ran, Itay Chachy Jonathan Chetboun, Michael Finkelson Michael Kupchick Nir Zabari, Nitzan Guetta Noa Kotler, Ofir Bibi Ori Gordon Poriya Panet, Roi Benita Shahar Armon, et al. Ltx-2:...
-
[14]
LTX-Video: Realtime Video Latent Diffusion
Yoav HaCohen, Nisan Chiprut, Benny Brazowski, Daniel Shalem, Dudu Moshe, Eitan Richardson, Eran Levin, Guy Shiran, Nir Zabari, Ori Gordon, et al. Ltx-video: Realtime video latent diffusion. arXiv preprint arXiv:2501.00103, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[15]
VBench: Comprehensive benchmark suite for video generative models
Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, Yaohui Wang, Xinyuan Chen, Limin Wang, Dahua Lin, Yu Qiao, and Ziwei Liu. VBench: Comprehensive benchmark suite for video generative models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognit...
work page 2024
-
[16]
Nemo-curator: a toolkit for data curation, 2024
Joseph Jennings, Mostofa Patwary, et al. Nemo-curator: a toolkit for data curation, 2024. URL https://github.com/NVIDIA-NeMo/Curator
work page 2024
-
[17]
S. Kirkpatrick, C. D. Gelatt, and M. P . Vecchi. Optimization by simulated annealing.Science, 220 (4598):671–680, 1983
work page 1983
-
[18]
HunyuanVideo: A Systematic Framework For Large Video Generative Models
Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[19]
Tread: Token routing for efficient architecture-agnostic diffusion training
Felix Krause, Timy Phan, Ming Gui, Stefan Andreas Baumann, Vincent Tao Hu, and Bj¨orn Ommer. Tread: Token routing for efficient architecture-agnostic diffusion training. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 15703–15713, 2025
work page 2025
-
[20]
Flux.https://github.com/black-forest-labs/flux, 2024
Black Forest Labs. Flux.https://github.com/black-forest-labs/flux, 2024
work page 2024
-
[21]
Scaling laws for diffusion transformers.arXiv preprint arXiv:2410.08184, 2024
Zhengyang Liang, Hao He, Ceyuan Yang, and Bo Dai. Scaling laws for diffusion transformers.arXiv preprint arXiv:2410.08184, 2024
-
[22]
Flow matching for generative modeling
Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matthew Le. Flow matching for generative modeling. InThe Eleventh International Conference on Learning Representations
-
[23]
Enhance-a-video: Better generated video for free.arXiv preprint arXiv:2502.07508, 2025
Yang Luo, Xuanlei Zhao, Mengzhao Chen, Kaipeng Zhang, Wenqi Shao, Kai Wang, Zhangyang Wang, and Yang You. Enhance-a-video: Better generated video for free.arXiv preprint arXiv:2502.07508, 2025
-
[24]
Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model
Guoqing Ma, Haoyang Huang, Kun Yan, Liangyu Chen, Nan Duan, Shengming Yin, Changyi Wan, Ranchen Ming, Xiaoniu Song, Xing Chen, et al. Step-video-t2v technical report: The practice, challenges, and future of video foundation model.arXiv preprint arXiv:2502.10248, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[25]
Lorenzo Mur-Labadia, Matthew Muckley, Amir Bar, Mido Assran, Koustuv Sinha, Mike Rabbat, Yann LeCun, Nicolas Ballas, and Adrien Bardes. V-jepa 2.1: Unlocking dense features in video self-supervised learning.arXiv preprint arXiv:2603.14482, 2026
-
[26]
cuVS: GPU-accelerated vector search and clustering
NVIDIA RAPIDS Team. cuVS: GPU-accelerated vector search and clustering. GitHub repository,
-
[27]
Multi-GPU IVF-PQ and ANN indexes for large- scale vector search
URL https://github.com/rapidsai/cuvs. Multi-GPU IVF-PQ and ANN indexes for large- scale vector search
-
[28]
Video generation models as world simulators
OpenAI. Video generation models as world simulators. https://openai.com/index/ video-generation-models-as-world-simulators/, 2024
work page 2024
-
[29]
Prx part 3 — training a text-to-image model in 24h
Photoroom. Prx part 3 — training a text-to-image model in 24h. https://huggingface.co/blog/ Photoroom/prx-part3, 2025
work page 2025
-
[30]
A self- supervised descriptor for image copy detection
Ed Pizzi, Sreya Dutta Roy, Sugosh Nagavara Ravindra, Priya Goyal, and Matthijs Douze. A self- supervised descriptor for image copy detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022
work page 2022
-
[31]
Movie Gen: A Cast of Media Foundation Models
Adam Polyak, Amit Zohar, Andrew Brown, Andros Tjandra, Animesh Sinha, Ann Lee, Apoorv Vyas, Bowen Shi, Chih-Yao Ma, Ching-Yao Chuang, et al. Movie gen: A cast of media foundation models.arXiv preprint arXiv:2410.13720, 2024. 28
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[32]
Qwen3-VL technical report.arXiv preprint, 2025
Qwen Team. Qwen3-VL technical report.arXiv preprint, 2025. Qwen3-VL-30B-A3B vision-language model
work page 2025
-
[33]
Eliminating oversaturation and artifacts of high guidance scales in diffusion models
Seyedmorteza Sadat, Otmar Hilliges, and Romann M Weber. Eliminating oversaturation and artifacts of high guidance scales in diffusion models. InThe Thirteenth International Conference on Learning Representations, 2024
work page 2024
-
[34]
Seedance 1.5 pro: A Native Audio-Visual Joint Generation Foundation Model
Team Seedance, Heyi Chen, Siyan Chen, Xin Chen, Yanfei Chen, Ying Chen, Zhuo Chen, Feng Cheng, Tianheng Cheng, Xinqi Cheng, et al. Seedance 1.5 pro: A native audio-visual joint generation foundation model.arXiv preprint arXiv:2512.13507, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[35]
Jaskirat Singh, Xingjian Leng, Zongze Wu, Liang Zheng, Richard Zhang, Eli Shechtman, and Saining Xie. What matters for representation alignment: Global information or spatial structure?arXiv preprint arXiv:2512.10794, 2025
-
[36]
SkyReels-V2: Infinite-length Film Generative Model
Skywork AI SkyReels Team. SkyReels-V2: Infinite-length film generative model.arXiv preprint arXiv:2504.13074, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[37]
Wan: Open and Advanced Large-Scale Video Generative Models
Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[38]
A comprehensive study of decoder-only llms for text-to-image generation
Andrew Z Wang, Songwei Ge, Tero Karras, Ming-Yu Liu, and Yogesh Balaji. A comprehensive study of decoder-only llms for text-to-image generation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 28575–28585, 2025
work page 2025
-
[39]
Qiuheng Wang, Yukai Shi, Jiarong Ou, Rui Chen, Ke Lin, Jiahao Wang, Boyuan Jiang, Haotian Yang, Mingwu Zheng, Xin Tao, Fei Yang, Pengfei Wan, and Di Zhang. Koala-36m: A large-scale video dataset improving consistency between fine-grained conditions and video content, 2025. URL https://arxiv.org/abs/2410.08260
-
[40]
Ddt: Decoupled diffusion transformer.arXiv preprint arXiv:2504.05741, 2025
Shuai Wang, Zhi Tian, Weilin Huang, and Limin Wang. Ddt: Decoupled diffusion transformer.arXiv preprint arXiv:2504.05741, 2025
-
[41]
Repa works until it doesn’t: Early-stopped, holistic alignment supercharges diffusion training
Ziqiao Wang, Wangbo Zhao, Yuhao Zhou, Zekai Li, Zhiyuan Liang, Mingjia Shi, Xuanlei Zhao, Pengfei Zhou, Kaipeng Zhang, Zhangyang Wang, et al. Repa works until it doesn’t: Early-stopped, holistic alignment supercharges diffusion training. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems
-
[42]
WebDataset Authors. Webdataset. GitHub repository, 2026. URL https://github.com/webdataset/ webdataset. Tar-sharded dataset format for sequential streaming in large-scale deep learning
work page 2026
-
[43]
Video models are zero-shot learners and reasoners
Thadd¨aus Wiedemer, Yuxuan Li, Paul Vicol, Shixiang Shane Gu, Nick Matarese, Kevin Swersky, Been Kim, Priyank Jaini, and Robert Geirhos. Video models are zero-shot learners and reasoners. arXiv preprint arXiv:2509.20328, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[44]
HunyuanVideo 1.5 Technical Report
Bing Wu, Chang Zou, Changlin Li, Duojun Huang, Fang Yang, Hao Tan, Jack Peng, Jianbing Wu, Jiangfeng Xiong, Jie Jiang, et al. Hunyuanvideo 1.5 technical report.arXiv preprint arXiv:2511.18870, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[45]
Haoning Wu, Erli Zhang, Liang Liao, Chaofeng Chen, Jingwen Hou, Annan Wang, Wenxiu Sun, Qiong Yan, and Weisi Lin. Exploring video quality assessment on user generated contents from aesthetic and technical perspectives, 2023. URLhttps://arxiv.org/abs/2211.04894
-
[46]
Revisiting weak-to-strong consistency in semi-supervised semantic segmentation, 2023
Lihe Yang, Lei Qi, Litong Feng, Wayne Zhang, and Yinghuan Shi. Revisiting weak-to-strong consistency in semi-supervised semantic segmentation, 2023. URL https://arxiv.org/abs/2208. 09910
work page 2023
-
[47]
Cogvideox: Text-to-video diffusion models with an expert transformer
Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer. InThe Thirteenth International Conference on Learning Representations
-
[48]
{SkyPilot}: An intercloud broker for sky computing
Zongheng Yang, Zhanghao Wu, Michael Luo, Wei-Lin Chiang, Romil Bhardwaj, Woosuk Kwon, Siyuan Zhuang, Frank Sifei Luan, Gautam Mittal, Scott Shenker, et al. {SkyPilot}: An intercloud broker for sky computing. In20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23), pages 437–455, 2023. 29
work page 2023
-
[49]
Representation alignment for generation: Training diffusion transformers is easier than you think
Sihyun Yu, Sangkyung Kwak, Huiwon Jang, Jongheon Jeong, Jonathan Huang, Jinwoo Shin, and Saining Xie. Representation alignment for generation: Training diffusion transformers is easier than you think. InThe Thirteenth International Conference on Learning Representations
-
[50]
Sigmoid Loss for Language Image Pre-Training
Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training, 2023. URLhttps://arxiv.org/abs/2303.15343
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[51]
T5gemma 2: Seeing, reading, and understanding longer
Biao Zhang, Paul Suganthan, Ga¨el Liu, Ilya Philippov, Sahil Dua, Ben Hora, Kat Black, Gus Martins, Omar Sanseviero, Shreya Pathak, et al. T5gemma 2: Seeing, reading, and understanding longer. arXiv preprint arXiv:2512.14856, 2025
-
[52]
Videorepa: Learning physics for video generation through relational alignment with foundation models
Xiangdong Zhang, Jiaqi Liao, Shaofeng Zhang, Fanqing Meng, Xiangpeng Wan, Junchi Yan, and Yu Cheng. Videorepa: Learning physics for video generation through relational alignment with foundation models. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems
-
[53]
Waver: Wave your way to lifelike video generation.arXiv preprint arXiv:2508.15761, 2025
Yifu Zhang, Hao Yang, Yuqi Zhang, Yifei Hu, Fengda Zhu, Chuang Lin, Xiaofeng Mei, Yi Jiang, Bingyue Peng, and Zehuan Yuan. Waver: Wave your way to lifelike video generation.arXiv preprint arXiv:2508.15761, 2025
-
[54]
Open-Sora 2.0: Training a Commercial-Level Video Generation Model in $200k
Zangwei Zheng, Xiangyu Peng, Yuxuan Lou, Chenhui Shen, Tom Young, Xinying Guo, Binluo Wang, Hang Xu, Hongxin Liu, Mingyan Jiang, et al. Open-sora 2.0: Training a commercial-level video generation model in $200k.arXiv preprint arXiv:2503.09642, 2025. 30 Figure 16:Additional qualitative human-centered generations.Representative frames from videos involving ...
work page internal anchor Pith review Pith/arXiv arXiv 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.