pith. sign in

arxiv: 2604.16503 · v2 · pith:J2YFXZY3new · submitted 2026-04-14 · 💻 cs.CV · cs.AI

Motif-Video 2B: Technical Report

Pith reviewed 2026-05-21 00:08 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords text-to-videovideo generationmodel efficiencycross-attentionarchitectural designparameter reductiontemporal consistency
0
0 comments X

The pith

Separating prompt alignment, temporal consistency, and detail recovery into distinct pathways lets a 2B video model surpass 14B-parameter rivals on VBench.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to show that strong text-to-video performance is achievable with far fewer parameters and less training data than current large models demand. It argues that interference among prompt alignment, temporal consistency, and fine-detail recovery is reduced when these roles are given separate architectural pathways instead of being forced through one shared stream. The authors combine shared cross-attention for stronger text conditioning on long sequences with a three-part backbone that handles early fusion, joint learning, and refinement in turn. An efficiency-focused training recipe using dynamic token routing and early alignment to a frozen encoder makes the design work under a tight budget of under 10 million clips and 100,000 H200 GPU hours. If the claim holds, smaller-scale video generation becomes practical without waiting for ever-larger compute clusters.

Core claim

Motif-Video 2B reaches 83.76 percent on VBench by using shared cross-attention to improve text control over long video token sequences and a three-part backbone that separates early fusion, joint representation learning, and detail refinement. Dynamic token routing and early-phase feature alignment to a frozen pretrained video encoder keep training efficient. The resulting 2B-parameter model exceeds the score of the 14B-parameter Wan2.1 while using seven times fewer parameters and substantially less training data.

What carries the argument

Shared cross-attention paired with a three-part backbone that divides processing into early fusion, joint representation learning, and detail refinement.

If this is right

  • Later blocks exhibit clearer cross-frame attention patterns than those in standard single-stream video models.
  • Competitive text-to-video quality is reachable with fewer than 10 million training clips and under 100,000 H200 GPU hours.
  • Architectural specialization can narrow or close the quality gap that usually requires much larger parameter counts.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same role-separation idea could be tested in other generative domains such as high-resolution image synthesis or audio generation to reduce task interference.
  • Lower training budgets might allow repeated experimentation and faster iteration cycles for teams without access to large GPU clusters.
  • Combining the three-part design with task-specific losses or additional frozen encoders could yield further efficiency gains on particular video styles.

Load-bearing premise

That the separation of prompt alignment, temporal consistency, and fine-detail recovery into distinct pathways through shared cross-attention and the three-part backbone, together with the dynamic routing and alignment recipe, is what produces the reported performance under the given data and compute limits.

What would settle it

Train a 2B-parameter single-stream baseline without shared cross-attention or the three-part backbone on the same clips and compute budget, then check whether its VBench score remains below 83.76 percent.

Figures

Figures reproduced from arXiv: 2604.16503 by Beomgyu Kim, Bokki Ryu, Changjin Kang, Dahye Choi, Dongjoo Weon, Dongpin Oh, Dongseok Kim, Eunhwan Park, Haesol Lee, Hanbin Jung, Hongjoo Lee, Hyeyeon Cho, Hyukjin Kweon, Jaeheui Her, Jaeyeon Huh, Jangwoong Kim, Jeesoo Lee, Jeongdoo Lee, Junghwan Lim, Junhyeok Lee, Minjae Kim, Minsu Ha, Sungmin Lee, Taehyun Kim, Taewhan Kim, Wai Ting Cheung, Yeongjae Park, Youngrok Kim.

Figure 1
Figure 1. Figure 1: Representative generations from Motif-Video 2B. Frames are captured from videos generated by our 2B-parameter text-to-video model across a diverse set of prompts, illustrating the combination of prompt fidelity, temporal coherence, and visual detail that we target throughout this work. The banner is intended as a qualitative teaser; later sections analyze the architectural and training choices that make th… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of Motif-Video 2B. Text is encoded by T5Gemma2, while video frames are com￾pressed by the Wan2.1 VAE into spatiotemporal latents and patchified into tokens. The transformer backbone follows a three-stage design that separates early modality fusion, joint text-video representation learning, and final detail reconstruction: 12 dual-stream layers preserve modality-specific processing during early fus… view at source ↗
Figure 3
Figure 3. Figure 3: Attention structure in dual-stream vs. single-stream vs. DDT decoder layers. Compared with dual and single-stream layers, DDT decoder layers show stronger inter-frame attention structure, where each frame attends more to temporally adjacent frames. The blue box denotes the encoder hidden state: text tokens in the dual-stream and single-stream cases, and the video output tokens from the encoder layers in th… view at source ↗
Figure 4
Figure 4. Figure 4: Intermediate-layer text-attention drop in single-stream blocks. We compare attention maps from a representative intermediate layer in dual-stream and single-stream stages. Relative to dual-stream, the single-stream intermediate layer allocates substantially less attention mass to text tokens, indicating weaker text conditioning under joint-token competition. attending to text token j: αij = exp q ⊤ i kj/ … view at source ↗
Figure 5
Figure 5. Figure 5: Zero-init alone does not save a cross-attention whose K, V geometry is ungrounded. Both variants are inserted into the same pretrained 360p checkpoint with Wcross O = 0, making both forward passes identical to the base model at step 0. After 1,000 steps of continued training under matched optimizer settings, data, and learning rate, the SkyReels-V4–style cross-attention (top, raw xt as K, V) collapses: out… view at source ↗
Figure 6
Figure 6. Figure 6: Dense features from V-JEPA 2.0. The visualization highlights that, while V￾JEPA 2.0 captures global motion structure well, its dense features are less spatially coherent than would be ideal for dense REPA supervision in video generation. In practice, we align hidden states from a single interme￾diate encoder layer (layer 8) to the frozen teacher features. Following iREPA [34], we use a convolutional projec… view at source ↗
Figure 7
Figure 7. Figure 7: Overview of the training-data construction pipeline. The raw pool is split into Image Real, Image Synthetic, Video Real, and Video Synthetic branches. An initial sanitation stage removes broken files, abnormally small files, near-duplicates (SSCD-based), NSFW content, and watermarked content. Surviving clips are progressively filtered by resolution, clip length, motion, and aesthetic signals as they advanc… view at source ↗
Figure 8
Figure 8. Figure 8: Subject composition of the cross-attention fine-tuning corpus. The corpus was assembled iteratively by curating additional clips from underperforming categories. Left: image distribution. Right: video distribution. reinterpret them. Specifically, watermark, nsfw, and padded flags trigger hard removal; multi scene clips are dropped as a secondary check on scene segmentation; quality=low is excluded from 480… view at source ↗
Figure 9
Figure 9. Figure 9: Overview of our offline bucket-balanced sampler for WebDataset-formatted video corpora on [PITH_FULL_IMAGE:figures/full_fig_p019_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Shared Cross-Attention contribution across single-stream encoder blocks and denoising steps (1280 × 736, 121 frames, 50 steps, σ ∈ [1.00, 0.29]). Left: Frobenius norm of the cross-attention output Wcross O Attn(Q, K, V) per block (row) and step (column). Right: ratio of the cross-attention residual norm to the self-attention output norm ∥hv∥. No block falls below 5.2%; the global mean is 7.6% and the maxi… view at source ↗
Figure 11
Figure 11. Figure 11: Selected single-frame samples from Motif-Video 2B across a range of subjects and visual styles. Each tile is a frame drawn from an independently generated text-to-video clip. The grid is intended to convey the breadth of domains the model handles, including photographic scenes, stylized and fantastical content, close-up subjects, and wide landscapes, rather than to claim uniform quality across all prompts… view at source ↗
Figure 12
Figure 12. Figure 12: Image-to-video generation results. The leftmost panel is the input image, and the model preserves its original appearance while generating temporally coherent video content from it [PITH_FULL_IMAGE:figures/full_fig_p023_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Example of generated results from the arena. Prompt: ”A guitarist sits on a fire escape playing at twilight, fingers moving in relaxed patterns along the neck of a scratched acoustic guitar. Shot on a 40mm lens with a slow crane-up from the street below, the brick wall beside him glows deep orange as the last sun hits it and the sky above shifts toward indigo. He wears a loose denim shirt rolled to the el… view at source ↗
Figure 14
Figure 14. Figure 14: Micro-scale semantic distortion. Three characteristic failures at the sub-object level: distorted hand anatomy on a close-up instrument subject (left), broken body structure under a high-motion skydiving prompt (middle), and attribute leakage between co-present animals in a multi-subject scene (right). The generations may remain category-correct (guitar, skydiver, cat and dog), leading VBench’s semantic d… view at source ↗
Figure 15
Figure 15. Figure 15: Temporal failure modes. Top: physically implausible liquid dynamics in a wine-splash prompt: the motion is locally smooth but violates gravity and surface tension. Middle: loss of temporal coherence under high scene complexity in a cavalry-charge prompt, where subject identities blur across frames and multi-agent spatial relationships fail to persist. Bottom: unintended mid-clip scene transition, where th… view at source ↗
Figure 16
Figure 16. Figure 16: Additional qualitative human-centered generations. Representative frames from videos involving human subjects, included as supplementary qualitative results. A Additional results This section presents additional qualitative results for both text-to-video and image-to-video generation in Figures 16 and 17. B Sampling Configuration We describe the sampling configuration used to produce the VBench scores rep… view at source ↗
Figure 17
Figure 17. Figure 17: Additional image-to-video results. The leftmost panel is the input image, and the remaining panels show representative generated video frames. Negative prompt. Following Wan [36], we apply a fixed negative prompt at every sampling call. The full string used is: The video has text and graphic overlays burned into the frame, including watermarks, logos, subtitles, timestamps, broadcast graphics, UI elements… view at source ↗
Figure 18
Figure 18. Figure 18: Qualitative effect of Shared Cross-Attention. For each prompt, the top row shows generation with Shared Cross-Attention enabled; the bottom row shows the same prompt and seed with cross￾attention disabled on all 16 single-stream encoder blocks (360p, 50 steps, 121 frames). 34 [PITH_FULL_IMAGE:figures/full_fig_p034_18.png] view at source ↗
read the original abstract

Training strong video generation models usually requires massive datasets, large parameter counts, and substantial compute. In this work, we ask whether strong text-to-video quality is possible at a much smaller budget: fewer than 10M clips and less than 100,000 H200 GPU hours. Our core claim is that part of the answer lies in how model capacity is organized, not only in how much of it is used. In video generation, prompt alignment, temporal consistency, and fine-detail recovery can interfere with one another when they are handled through the same pathway. Motif-Video 2B addresses this by separating these roles architecturally, rather than relying on scale alone. The model combines two key ideas. First, Shared Cross-Attention strengthens text control when video token sequences become long. Second, a three-part backbone separates early fusion, joint representation learning, and detail refinement. To make this design effective under a limited compute budget, we pair it with an efficient training recipe based on dynamic token routing and early-phase feature alignment to a frozen pretrained video encoder. Our analysis shows that later blocks develop clearer cross-frame attention structure than standard single-stream baselines. On VBench, Motif-Video~2B reaches 83.76\%, surpassing Wan2.1 14B while using 7$\times$ fewer parameters and substantially less training data. These results suggest that careful architectural specialization, combined with an efficiency-oriented training recipe, can narrow or exceed the quality gap typically associated with much larger video models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper presents Motif-Video 2B, a 2B-parameter text-to-video model that reaches 83.76% on VBench, outperforming Wan2.1 14B while using 7× fewer parameters and substantially less training data (<10M clips, <100k H200 GPU hours). The central claim is that separating prompt alignment, temporal consistency, and fine-detail recovery via a three-part backbone with shared cross-attention, combined with dynamic token routing and early-phase feature alignment to a frozen encoder, enables this efficiency; later blocks exhibit clearer cross-frame attention than single-stream baselines.

Significance. If the attribution to architectural specialization holds under controlled conditions, the result would indicate that targeted capacity organization can close the quality gap with much larger models under tight data and compute budgets, offering a practical path toward more accessible video generation. The reported attention-structure analysis supplies a modest mechanistic observation that could be developed further.

major comments (1)
  1. [Abstract] Abstract and Results section: The headline claim that the three-part backbone and shared cross-attention are responsible for competitive performance under the stated budget is not supported by any ablation that holds dynamic token routing and early-phase frozen-encoder alignment fixed while reverting to a single-stream backbone. Without this control, the data-efficiency result cannot be attributed to the architectural separation rather than the training recipe alone.
minor comments (2)
  1. [Abstract] The abstract states the VBench score but supplies no information on evaluation protocol, baseline details, number of samples, statistical significance, or error bars, making it impossible to judge whether the 83.76% figure reliably supports the central claim.
  2. Notation for the three-part backbone (early fusion, joint representation, detail refinement) and the dynamic routing mechanism should be defined explicitly with equations or pseudocode in the methods section for reproducibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the major comment below and outline the revisions we will make to strengthen the attribution of results to the proposed architecture.

read point-by-point responses
  1. Referee: [Abstract] Abstract and Results section: The headline claim that the three-part backbone and shared cross-attention are responsible for competitive performance under the stated budget is not supported by any ablation that holds dynamic token routing and early-phase frozen-encoder alignment fixed while reverting to a single-stream backbone. Without this control, the data-efficiency result cannot be attributed to the architectural separation rather than the training recipe alone.

    Authors: We agree that the current evidence does not fully isolate the contribution of the three-part backbone and shared cross-attention from the training recipe components. Our manuscript reports comparisons to single-stream baselines that exhibit weaker cross-frame attention in later blocks, but these baselines were not trained under an identical recipe that fixes dynamic token routing and early-phase alignment to the frozen encoder. To address this directly, we will add a controlled ablation in the revised manuscript: a single-stream backbone trained with the same dynamic token routing and early-phase feature alignment, using the same data and compute budget. This will allow clearer attribution of the efficiency gains to the architectural separation of prompt alignment, temporal consistency, and detail recovery. revision: yes

Circularity Check

0 steps flagged

No derivation chain present; empirical report only

full rationale

The paper is a technical report on an empirical video generation model. It reports a VBench score of 83.76% for Motif-Video 2B and attributes results to architectural choices (shared cross-attention, three-part backbone) paired with a training recipe (dynamic token routing, early-phase alignment to frozen encoder). No equations, first-principles derivations, fitted parameters renamed as predictions, or load-bearing self-citation chains appear in the abstract or described claims. The central performance claim is externally benchmarked and does not reduce to any input by construction, satisfying the criteria for a self-contained empirical result.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; it introduces no explicit free parameters, axioms, or invented entities beyond standard deep-learning components. All concrete details on architecture, losses, and data handling are absent.

pith-pipeline@v0.9.0 · 5925 in / 1207 out tokens · 50879 ms · 2026-05-21T00:08:25.821283+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

54 extracted references · 54 canonical work pages · 13 internal anchors

  1. [1]

    V-jepa: Latent video prediction for visual representation learning

    Adrien Bardes, Quentin Garrido, Jean Ponce, Xinlei Chen, Michael Rabbat, Yann LeCun, Mido Assran, and Nicolas Ballas. V-jepa: Latent video prediction for visual representation learning. 2023

  2. [2]

    Speedrunning imagenet diffusion.arXiv preprint arXiv:2512.12386, 2025

    Swayam Bhanded. Speedrunning imagenet diffusion.arXiv preprint arXiv:2512.12386, 2025

  3. [3]

    Skyreels-v4: Multi-modal video-audio generation, inpainting and editing model.arXiv preprint arXiv:2602.21818, 2026

    Guibin Chen, Dixuan Lin, Jiangping Yang, Youqiang Zhang, Zhengcong Fei, Debang Li, Sheng Chen, Chaofeng Ao, Nuo Pang, Yiming Wang, et al. Skyreels-v4: Multi-modal video-audio generation, inpainting and editing model.arXiv preprint arXiv:2602.21818, 2026

  4. [4]

    Pixart-α: Fast training of diffusion transformer for photorealistic text-to-image synthesis

    Junsong Chen, YU Jincheng, GE Chongjian, Lewei Yao, Enze Xie, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart-α: Fast training of diffusion transformer for photorealistic text-to-image synthesis. InThe Twelfth International Conference on Learning Representations

  5. [5]

    Sana-video: Efficient video generation with block linear diffusion transformer.arXiv preprint arXiv:2509.24695, 2025

    Junsong Chen, Yuyang Zhao, Jincheng Yu, Ruihang Chu, Junyu Chen, Shuai Yang, Xianbang Wang, Yicheng Pan, Daquan Zhou, Huan Ling, et al. Sana-video: Efficient video generation with block linear diffusion transformer.arXiv preprint arXiv:2509.24695, 2025

  6. [6]

    Enhancing motion dynamics of image-to-video models via adaptive low-pass guidance.arXiv preprint arXiv:2506.08456, 2025

    Sanghyeok Choi, Yuchang Song, Taegyun Jeong, Taesung Kwon, and Kihyuk Sohn. Enhanc- ing motion dynamics of image-to-video models via adaptive low-pass guidance.arXiv preprint arXiv:2506.08456, 2025

  7. [7]

    Paddleocr-vl: Boosting multilingual document parsing via a 0.9 b ultra-compact vision-language model.arXiv preprint arXiv:2510.14528, 2025

    Cheng Cui, Ting Sun, Suyin Liang, Tingquan Gao, Zelun Zhang, Jiaxuan Liu, Xueqing Wang, Changda Zhou, Hongen Liu, Manhui Lin, et al. Paddleocr-vl: Boosting multilingual document parsing via a 0.9 b ultra-compact vision-language model.arXiv preprint arXiv:2510.14528, 2025

  8. [8]

    aesthetic-predictor-v2-5

    discus0434. aesthetic-predictor-v2-5. https://github.com/discus0434/aesthetic-predictor-v2-5 , 2024

  9. [9]

    The Llama 3 Herd of Models

    Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The Llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024

  10. [10]

    Scaling rectified flow transformers for high-resolution image synthesis

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas M¨uller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. InInternational Conference on Machine Learning, pages 12606–12633. PMLR, 2024

  11. [11]

    Seedance 1.0: Exploring the Boundaries of Video Generation Models

    Yu Gao, Haoyuan Guo, Tuyen Hoang, Weilin Huang, Lu Jiang, Fangyuan Kong, Huixia Li, Jiashi Li, Liang Li, Xiaojie Li, et al. Seedance 1.0: Exploring the boundaries of video generation models.arXiv preprint arXiv:2506.09113, 2025. 27

  12. [12]

    Accelerate: Training and inference at scale made simple, efficient and adaptable.https://github.com/huggingface/accelerate, 2022

    Sylvain Gugger, Lysandre Debut, Thomas Wolf, Philipp Schmid, Zachary Mueller, Sourab Man- grulkar, Marc Sun, and Benjamin Bossan. Accelerate: Training and inference at scale made simple, efficient and adaptable.https://github.com/huggingface/accelerate, 2022

  13. [13]

    Ltx-2: Efficient joint audio-visual foundation model

    Yoav HaCohen, Benny Brazowski Nisan Chiprut Yaki Bitterman, Andrew Kvochko Avishai Berkowitz Daniel Shalem, Daphna Lifschitz Dudu Moshe, Eitan Porat Eitan Richardson Guy Shi- ran, Itay Chachy Jonathan Chetboun, Michael Finkelson Michael Kupchick Nir Zabari, Nitzan Guetta Noa Kotler, Ofir Bibi Ori Gordon Poriya Panet, Roi Benita Shahar Armon, et al. Ltx-2:...

  14. [14]

    LTX-Video: Realtime Video Latent Diffusion

    Yoav HaCohen, Nisan Chiprut, Benny Brazowski, Daniel Shalem, Dudu Moshe, Eitan Richardson, Eran Levin, Guy Shiran, Nir Zabari, Ori Gordon, et al. Ltx-video: Realtime video latent diffusion. arXiv preprint arXiv:2501.00103, 2024

  15. [15]

    VBench: Comprehensive benchmark suite for video generative models

    Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, Yaohui Wang, Xinyuan Chen, Limin Wang, Dahua Lin, Yu Qiao, and Ziwei Liu. VBench: Comprehensive benchmark suite for video generative models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognit...

  16. [16]

    Nemo-curator: a toolkit for data curation, 2024

    Joseph Jennings, Mostofa Patwary, et al. Nemo-curator: a toolkit for data curation, 2024. URL https://github.com/NVIDIA-NeMo/Curator

  17. [17]

    Kirkpatrick, C

    S. Kirkpatrick, C. D. Gelatt, and M. P . Vecchi. Optimization by simulated annealing.Science, 220 (4598):671–680, 1983

  18. [18]

    HunyuanVideo: A Systematic Framework For Large Video Generative Models

    Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024

  19. [19]

    Tread: Token routing for efficient architecture-agnostic diffusion training

    Felix Krause, Timy Phan, Ming Gui, Stefan Andreas Baumann, Vincent Tao Hu, and Bj¨orn Ommer. Tread: Token routing for efficient architecture-agnostic diffusion training. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 15703–15713, 2025

  20. [20]

    Flux.https://github.com/black-forest-labs/flux, 2024

    Black Forest Labs. Flux.https://github.com/black-forest-labs/flux, 2024

  21. [21]

    Scaling laws for diffusion transformers.arXiv preprint arXiv:2410.08184, 2024

    Zhengyang Liang, Hao He, Ceyuan Yang, and Bo Dai. Scaling laws for diffusion transformers.arXiv preprint arXiv:2410.08184, 2024

  22. [22]

    Flow matching for generative modeling

    Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matthew Le. Flow matching for generative modeling. InThe Eleventh International Conference on Learning Representations

  23. [23]

    Enhance-a-video: Better generated video for free.arXiv preprint arXiv:2502.07508, 2025

    Yang Luo, Xuanlei Zhao, Mengzhao Chen, Kaipeng Zhang, Wenqi Shao, Kai Wang, Zhangyang Wang, and Yang You. Enhance-a-video: Better generated video for free.arXiv preprint arXiv:2502.07508, 2025

  24. [24]

    Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model

    Guoqing Ma, Haoyang Huang, Kun Yan, Liangyu Chen, Nan Duan, Shengming Yin, Changyi Wan, Ranchen Ming, Xiaoniu Song, Xing Chen, et al. Step-video-t2v technical report: The practice, challenges, and future of video foundation model.arXiv preprint arXiv:2502.10248, 2025

  25. [25]

    V-JEPA 2.1: Unlocking dense features in video self-supervised learning.arXiv preprint arXiv:2603.14482, 2026

    Lorenzo Mur-Labadia, Matthew Muckley, Amir Bar, Mido Assran, Koustuv Sinha, Mike Rabbat, Yann LeCun, Nicolas Ballas, and Adrien Bardes. V-jepa 2.1: Unlocking dense features in video self-supervised learning.arXiv preprint arXiv:2603.14482, 2026

  26. [26]

    cuVS: GPU-accelerated vector search and clustering

    NVIDIA RAPIDS Team. cuVS: GPU-accelerated vector search and clustering. GitHub repository,

  27. [27]

    Multi-GPU IVF-PQ and ANN indexes for large- scale vector search

    URL https://github.com/rapidsai/cuvs. Multi-GPU IVF-PQ and ANN indexes for large- scale vector search

  28. [28]

    Video generation models as world simulators

    OpenAI. Video generation models as world simulators. https://openai.com/index/ video-generation-models-as-world-simulators/, 2024

  29. [29]

    Prx part 3 — training a text-to-image model in 24h

    Photoroom. Prx part 3 — training a text-to-image model in 24h. https://huggingface.co/blog/ Photoroom/prx-part3, 2025

  30. [30]

    A self- supervised descriptor for image copy detection

    Ed Pizzi, Sreya Dutta Roy, Sugosh Nagavara Ravindra, Priya Goyal, and Matthijs Douze. A self- supervised descriptor for image copy detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022

  31. [31]

    Movie Gen: A Cast of Media Foundation Models

    Adam Polyak, Amit Zohar, Andrew Brown, Andros Tjandra, Animesh Sinha, Ann Lee, Apoorv Vyas, Bowen Shi, Chih-Yao Ma, Ching-Yao Chuang, et al. Movie gen: A cast of media foundation models.arXiv preprint arXiv:2410.13720, 2024. 28

  32. [32]

    Qwen3-VL technical report.arXiv preprint, 2025

    Qwen Team. Qwen3-VL technical report.arXiv preprint, 2025. Qwen3-VL-30B-A3B vision-language model

  33. [33]

    Eliminating oversaturation and artifacts of high guidance scales in diffusion models

    Seyedmorteza Sadat, Otmar Hilliges, and Romann M Weber. Eliminating oversaturation and artifacts of high guidance scales in diffusion models. InThe Thirteenth International Conference on Learning Representations, 2024

  34. [34]

    Seedance 1.5 pro: A Native Audio-Visual Joint Generation Foundation Model

    Team Seedance, Heyi Chen, Siyan Chen, Xin Chen, Yanfei Chen, Ying Chen, Zhuo Chen, Feng Cheng, Tianheng Cheng, Xinqi Cheng, et al. Seedance 1.5 pro: A native audio-visual joint generation foundation model.arXiv preprint arXiv:2512.13507, 2025

  35. [35]

    What matters for representation alignment: Global information or spatial structure?arXiv preprint arXiv:2512.10794, 2025

    Jaskirat Singh, Xingjian Leng, Zongze Wu, Liang Zheng, Richard Zhang, Eli Shechtman, and Saining Xie. What matters for representation alignment: Global information or spatial structure?arXiv preprint arXiv:2512.10794, 2025

  36. [36]

    SkyReels-V2: Infinite-length Film Generative Model

    Skywork AI SkyReels Team. SkyReels-V2: Infinite-length film generative model.arXiv preprint arXiv:2504.13074, 2025

  37. [37]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

  38. [38]

    A comprehensive study of decoder-only llms for text-to-image generation

    Andrew Z Wang, Songwei Ge, Tero Karras, Ming-Yu Liu, and Yogesh Balaji. A comprehensive study of decoder-only llms for text-to-image generation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 28575–28585, 2025

  39. [39]

    Koala-36m: A large-scale video dataset improving consistency between fine-grained conditions and video content, 2025

    Qiuheng Wang, Yukai Shi, Jiarong Ou, Rui Chen, Ke Lin, Jiahao Wang, Boyuan Jiang, Haotian Yang, Mingwu Zheng, Xin Tao, Fei Yang, Pengfei Wan, and Di Zhang. Koala-36m: A large-scale video dataset improving consistency between fine-grained conditions and video content, 2025. URL https://arxiv.org/abs/2410.08260

  40. [40]

    Ddt: Decoupled diffusion transformer.arXiv preprint arXiv:2504.05741, 2025

    Shuai Wang, Zhi Tian, Weilin Huang, and Limin Wang. Ddt: Decoupled diffusion transformer.arXiv preprint arXiv:2504.05741, 2025

  41. [41]

    Repa works until it doesn’t: Early-stopped, holistic alignment supercharges diffusion training

    Ziqiao Wang, Wangbo Zhao, Yuhao Zhou, Zekai Li, Zhiyuan Liang, Mingjia Shi, Xuanlei Zhao, Pengfei Zhou, Kaipeng Zhang, Zhangyang Wang, et al. Repa works until it doesn’t: Early-stopped, holistic alignment supercharges diffusion training. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems

  42. [42]

    Webdataset

    WebDataset Authors. Webdataset. GitHub repository, 2026. URL https://github.com/webdataset/ webdataset. Tar-sharded dataset format for sequential streaming in large-scale deep learning

  43. [43]

    Video models are zero-shot learners and reasoners

    Thadd¨aus Wiedemer, Yuxuan Li, Paul Vicol, Shixiang Shane Gu, Nick Matarese, Kevin Swersky, Been Kim, Priyank Jaini, and Robert Geirhos. Video models are zero-shot learners and reasoners. arXiv preprint arXiv:2509.20328, 2025

  44. [44]

    HunyuanVideo 1.5 Technical Report

    Bing Wu, Chang Zou, Changlin Li, Duojun Huang, Fang Yang, Hao Tan, Jack Peng, Jianbing Wu, Jiangfeng Xiong, Jie Jiang, et al. Hunyuanvideo 1.5 technical report.arXiv preprint arXiv:2511.18870, 2025

  45. [45]

    Exploring video quality assessment on user generated contents from aesthetic and technical perspectives, 2023

    Haoning Wu, Erli Zhang, Liang Liao, Chaofeng Chen, Jingwen Hou, Annan Wang, Wenxiu Sun, Qiong Yan, and Weisi Lin. Exploring video quality assessment on user generated contents from aesthetic and technical perspectives, 2023. URLhttps://arxiv.org/abs/2211.04894

  46. [46]

    Revisiting weak-to-strong consistency in semi-supervised semantic segmentation, 2023

    Lihe Yang, Lei Qi, Litong Feng, Wayne Zhang, and Yinghuan Shi. Revisiting weak-to-strong consistency in semi-supervised semantic segmentation, 2023. URL https://arxiv.org/abs/2208. 09910

  47. [47]

    Cogvideox: Text-to-video diffusion models with an expert transformer

    Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer. InThe Thirteenth International Conference on Learning Representations

  48. [48]

    {SkyPilot}: An intercloud broker for sky computing

    Zongheng Yang, Zhanghao Wu, Michael Luo, Wei-Lin Chiang, Romil Bhardwaj, Woosuk Kwon, Siyuan Zhuang, Frank Sifei Luan, Gautam Mittal, Scott Shenker, et al. {SkyPilot}: An intercloud broker for sky computing. In20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23), pages 437–455, 2023. 29

  49. [49]

    Representation alignment for generation: Training diffusion transformers is easier than you think

    Sihyun Yu, Sangkyung Kwak, Huiwon Jang, Jongheon Jeong, Jonathan Huang, Jinwoo Shin, and Saining Xie. Representation alignment for generation: Training diffusion transformers is easier than you think. InThe Thirteenth International Conference on Learning Representations

  50. [50]

    Sigmoid Loss for Language Image Pre-Training

    Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training, 2023. URLhttps://arxiv.org/abs/2303.15343

  51. [51]

    T5gemma 2: Seeing, reading, and understanding longer

    Biao Zhang, Paul Suganthan, Ga¨el Liu, Ilya Philippov, Sahil Dua, Ben Hora, Kat Black, Gus Martins, Omar Sanseviero, Shreya Pathak, et al. T5gemma 2: Seeing, reading, and understanding longer. arXiv preprint arXiv:2512.14856, 2025

  52. [52]

    Videorepa: Learning physics for video generation through relational alignment with foundation models

    Xiangdong Zhang, Jiaqi Liao, Shaofeng Zhang, Fanqing Meng, Xiangpeng Wan, Junchi Yan, and Yu Cheng. Videorepa: Learning physics for video generation through relational alignment with foundation models. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems

  53. [53]

    Waver: Wave your way to lifelike video generation.arXiv preprint arXiv:2508.15761, 2025

    Yifu Zhang, Hao Yang, Yuqi Zhang, Yifei Hu, Fengda Zhu, Chuang Lin, Xiaofeng Mei, Yi Jiang, Bingyue Peng, and Zehuan Yuan. Waver: Wave your way to lifelike video generation.arXiv preprint arXiv:2508.15761, 2025

  54. [54]

    Open-Sora 2.0: Training a Commercial-Level Video Generation Model in $200k

    Zangwei Zheng, Xiangyu Peng, Yuxuan Lou, Chenhui Shen, Tom Young, Xinying Guo, Binluo Wang, Hang Xu, Hongxin Liu, Mingyan Jiang, et al. Open-sora 2.0: Training a commercial-level video generation model in $200k.arXiv preprint arXiv:2503.09642, 2025. 30 Figure 16:Additional qualitative human-centered generations.Representative frames from videos involving ...