pith. machine review for the scientific record. sign in

arxiv: 2604.16503 · v1 · submitted 2026-04-14 · 💻 cs.CV · cs.AI

Recognition: unknown

Motif-Video 2B: Technical Report

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:46 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords text-to-video generationvideo diffusionmodel architectureparameter efficiencycross-attentionVBench evaluationtraining efficiency
0
0 comments X

The pith

Separating prompt alignment, temporal consistency, and fine-detail recovery into distinct stages lets a 2B video model outperform a 14B baseline on VBench with far less data and compute.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that high-quality text-to-video generation does not require massive parameter counts or datasets if model capacity is organized to prevent different objectives from interfering. Prompt alignment, temporal consistency, and detail recovery are handled through separate architectural pathways rather than a single shared stream, which the authors argue reduces conflicts that hurt performance in conventional designs. If this holds, smaller models could close or reverse the quality gap with much larger ones, making strong video generation feasible under budgets of under 10 million clips and 100,000 GPU hours. The approach pairs a three-part backbone with shared cross-attention and an efficiency recipe using dynamic token routing plus early alignment to a frozen encoder. Results show later blocks form clearer cross-frame attention patterns than single-stream baselines.

Core claim

Motif-Video 2B demonstrates that a 2 billion parameter text-to-video model reaches 83.76 percent on VBench by using a three-part backbone to separate early fusion, joint representation learning, and detail refinement, combined with shared cross-attention for long token sequences and a training recipe of dynamic token routing plus early feature alignment to a frozen pretrained video encoder, thereby surpassing the 14 billion parameter Wan2.1 model while using seven times fewer parameters and substantially less training data.

What carries the argument

Three-part backbone that separates early fusion, joint representation learning, and detail refinement, together with shared cross-attention to maintain text control over long video sequences.

If this is right

  • Later transformer blocks develop clearer cross-frame attention structure than single-stream baselines under the same training conditions.
  • Text control remains strong even when video token sequences grow long.
  • High-quality video generation becomes achievable with under 10 million training clips and fewer than 100,000 H200 GPU hours.
  • Architectural specialization can narrow or reverse the quality gap usually tied to much larger parameter counts.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same separation of conflicting objectives might improve efficiency in related generative tasks such as high-resolution image synthesis or long audio generation.
  • If role separation is the key driver, then scaling laws for video models may need revision when architecture is allowed to specialize rather than remain uniform.
  • Controlled tests on even smaller parameter budgets could reveal how far the three-part design can be pushed before quality saturates.

Load-bearing premise

The performance gains result mainly from architecturally separating prompt alignment, temporal consistency, and fine-detail recovery rather than from dynamic token routing, early feature alignment, or other details of the training process.

What would settle it

A direct ablation that trains an otherwise identical single-stream model of the same parameter count and training recipe and checks whether the VBench score drops to or below the 14B baseline level.

Figures

Figures reproduced from arXiv: 2604.16503 by Beomgyu Kim, Bokki Ryu, Changjin Kang, Dahye Choi, Dongjoo Weon, Dongpin Oh, Dongseok Kim, Eunhwan Park, Haesol Lee, Hanbin Jung, Hongjoo Lee, Hyeyeon Cho, Hyukjin Kweon, Jaeheui Her, Jaeyeon Huh, Jangwoong Kim, Jeesoo Lee, Jeongdoo Lee, Junghwan Lim, Junhyeok Lee, Minjae Kim, Minsu Ha, Sungmin Lee, Taehyun Kim, Taewhan Kim, Wai Ting Cheung, Yeongjae Park, Youngrok Kim.

Figure 1
Figure 1. Figure 1: Representative generations from Motif-Video 2B. Frames are captured from videos generated by our 2B-parameter text-to-video model across a diverse set of prompts, illustrating the combination of prompt fidelity, temporal coherence, and visual detail that we target throughout this work. The banner is intended as a qualitative teaser; later sections analyze the architectural and training choices that make th… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of Motif-Video 2B. Text is encoded by T5Gemma2, while video frames are com￾pressed by the Wan2.1 VAE into spatiotemporal latents and patchified into tokens. The transformer backbone follows a three-stage design that separates early modality fusion, joint text-video representation learning, and final detail reconstruction: 12 dual-stream layers preserve modality-specific processing during early fus… view at source ↗
Figure 3
Figure 3. Figure 3: Attention structure in dual-stream vs. single-stream vs. DDT decoder layers. Compared with dual and single-stream layers, DDT decoder layers show stronger inter-frame attention structure, where each frame attends more to temporally adjacent frames. The blue box denotes the encoder hidden state: text tokens in the dual-stream and single-stream cases, and the video output tokens from the encoder layers in th… view at source ↗
Figure 4
Figure 4. Figure 4: Intermediate-layer text-attention drop in single-stream blocks. We compare attention maps from a representative intermediate layer in dual-stream and single-stream stages. Relative to dual-stream, the single-stream intermediate layer allocates substantially less attention mass to text tokens, indicating weaker text conditioning under joint-token competition. attending to text token j: αij = exp q ⊤ i kj/ … view at source ↗
Figure 5
Figure 5. Figure 5: Zero-init alone does not save a cross-attention whose K, V geometry is ungrounded. Both variants are inserted into the same pretrained 360p checkpoint with Wcross O = 0, making both forward passes identical to the base model at step 0. After 1,000 steps of continued training under matched optimizer settings, data, and learning rate, the SkyReels-V4–style cross-attention (top, raw xt as K, V) collapses: out… view at source ↗
Figure 6
Figure 6. Figure 6: Dense features from V-JEPA 2.0. The visualization highlights that, while V￾JEPA 2.0 captures global motion structure well, its dense features are less spatially coherent than would be ideal for dense REPA supervision in video generation. In practice, we align hidden states from a single interme￾diate encoder layer (layer 8) to the frozen teacher features. Following iREPA [34], we use a convolutional projec… view at source ↗
Figure 7
Figure 7. Figure 7: Overview of the training-data construction pipeline. The raw pool is split into Image Real, Image Synthetic, Video Real, and Video Synthetic branches. An initial sanitation stage removes broken files, abnormally small files, near-duplicates (SSCD-based), NSFW content, and watermarked content. Surviving clips are progressively filtered by resolution, clip length, motion, and aesthetic signals as they advanc… view at source ↗
Figure 8
Figure 8. Figure 8: Subject composition of the cross-attention fine-tuning corpus. The corpus was assembled iteratively by curating additional clips from underperforming categories. Left: image distribution. Right: video distribution. reinterpret them. Specifically, watermark, nsfw, and padded flags trigger hard removal; multi scene clips are dropped as a secondary check on scene segmentation; quality=low is excluded from 480… view at source ↗
Figure 9
Figure 9. Figure 9: Overview of our offline bucket-balanced sampler for WebDataset-formatted video corpora on [PITH_FULL_IMAGE:figures/full_fig_p019_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Shared Cross-Attention contribution across single-stream encoder blocks and denoising steps (1280 × 736, 121 frames, 50 steps, σ ∈ [1.00, 0.29]). Left: Frobenius norm of the cross-attention output Wcross O Attn(Q, K, V) per block (row) and step (column). Right: ratio of the cross-attention residual norm to the self-attention output norm ∥hv∥. No block falls below 5.2%; the global mean is 7.6% and the maxi… view at source ↗
Figure 11
Figure 11. Figure 11: Selected single-frame samples from Motif-Video 2B across a range of subjects and visual styles. Each tile is a frame drawn from an independently generated text-to-video clip. The grid is intended to convey the breadth of domains the model handles, including photographic scenes, stylized and fantastical content, close-up subjects, and wide landscapes, rather than to claim uniform quality across all prompts… view at source ↗
Figure 12
Figure 12. Figure 12: Image-to-video generation results. The leftmost panel is the input image, and the model preserves its original appearance while generating temporally coherent video content from it [PITH_FULL_IMAGE:figures/full_fig_p023_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Example of generated results from the arena. Prompt: ”A guitarist sits on a fire escape playing at twilight, fingers moving in relaxed patterns along the neck of a scratched acoustic guitar. Shot on a 40mm lens with a slow crane-up from the street below, the brick wall beside him glows deep orange as the last sun hits it and the sky above shifts toward indigo. He wears a loose denim shirt rolled to the el… view at source ↗
Figure 14
Figure 14. Figure 14: Micro-scale semantic distortion. Three characteristic failures at the sub-object level: distorted hand anatomy on a close-up instrument subject (left), broken body structure under a high-motion skydiving prompt (middle), and attribute leakage between co-present animals in a multi-subject scene (right). The generations may remain category-correct (guitar, skydiver, cat and dog), leading VBench’s semantic d… view at source ↗
Figure 15
Figure 15. Figure 15: Temporal failure modes. Top: physically implausible liquid dynamics in a wine-splash prompt: the motion is locally smooth but violates gravity and surface tension. Middle: loss of temporal coherence under high scene complexity in a cavalry-charge prompt, where subject identities blur across frames and multi-agent spatial relationships fail to persist. Bottom: unintended mid-clip scene transition, where th… view at source ↗
Figure 16
Figure 16. Figure 16: Additional qualitative human-centered generations. Representative frames from videos involving human subjects, included as supplementary qualitative results. A Additional results This section presents additional qualitative results for both text-to-video and image-to-video generation in Figures 16 and 17. B Sampling Configuration We describe the sampling configuration used to produce the VBench scores rep… view at source ↗
Figure 17
Figure 17. Figure 17: Additional image-to-video results. The leftmost panel is the input image, and the remaining panels show representative generated video frames. Negative prompt. Following Wan [36], we apply a fixed negative prompt at every sampling call. The full string used is: The video has text and graphic overlays burned into the frame, including watermarks, logos, subtitles, timestamps, broadcast graphics, UI elements… view at source ↗
Figure 18
Figure 18. Figure 18: Qualitative effect of Shared Cross-Attention. For each prompt, the top row shows generation with Shared Cross-Attention enabled; the bottom row shows the same prompt and seed with cross￾attention disabled on all 16 single-stream encoder blocks (360p, 50 steps, 121 frames). 34 [PITH_FULL_IMAGE:figures/full_fig_p034_18.png] view at source ↗
read the original abstract

Training strong video generation models usually requires massive datasets, large parameter counts, and substantial compute. In this work, we ask whether strong text-to-video quality is possible at a much smaller budget: fewer than 10M clips and less than 100,000 H200 GPU hours. Our core claim is that part of the answer lies in how model capacity is organized, not only in how much of it is used. In video generation, prompt alignment, temporal consistency, and fine-detail recovery can interfere with one another when they are handled through the same pathway. Motif-Video 2B addresses this by separating these roles architecturally, rather than relying on scale alone. The model combines two key ideas. First, Shared Cross-Attention strengthens text control when video token sequences become long. Second, a three-part backbone separates early fusion, joint representation learning, and detail refinement. To make this design effective under a limited compute budget, we pair it with an efficient training recipe based on dynamic token routing and early-phase feature alignment to a frozen pretrained video encoder. Our analysis shows that later blocks develop clearer cross-frame attention structure than standard single-stream baselines. On VBench, Motif-Video~2B reaches 83.76\%, surpassing Wan2.1 14B while using 7$\times$ fewer parameters and substantially less training data. These results suggest that careful architectural specialization, combined with an efficiency-oriented training recipe, can narrow or exceed the quality gap typically associated with much larger video models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper presents Motif-Video 2B, a 2B-parameter text-to-video model trained on <10M clips. It claims that separating prompt alignment, temporal consistency, and fine-detail recovery into a three-part backbone with Shared Cross-Attention, paired with dynamic token routing and early-phase alignment to a frozen video encoder, enables 83.76% VBench score. This surpasses Wan2.1 14B while using 7× fewer parameters and far less compute (<100k H200 GPU hours). Later blocks are said to show clearer cross-frame attention than single-stream baselines.

Significance. If the performance holds and the architectural separation is shown to be causal, the result would be significant: it would demonstrate that targeted organization of capacity plus an efficiency recipe can close much of the gap to much larger models, shifting emphasis from raw scale in video generation. The efficiency claims (low data, low compute) and the attention-structure observation are potentially valuable if quantified.

major comments (3)
  1. [Abstract] Abstract: The headline claim of 83.76% VBench (surpassing Wan2.1 14B) is stated without any ablation results, statistical details, evaluation protocol, baseline implementation notes, or variance estimates. This leaves the central performance claim unsupported by visible evidence.
  2. [Abstract] Abstract: No ablation holds the training recipe (dynamic token routing + early feature alignment) fixed while replacing the three-part backbone with a standard single-stream transformer of equal capacity. Without this comparison, it remains unclear whether the reported gains require the claimed role separation or arise from the efficiency components alone.
  3. [Abstract] Abstract: The statement that later blocks develop 'clearer cross-frame attention structure than standard single-stream baselines' is asserted but not supported by any quantitative metric, figure, or matched-baseline comparison.
minor comments (1)
  1. [Abstract] The abstract contains a typesetting artifact ('Motif-Video~2B'); this should be corrected for clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and will make revisions to better support the claims presented in the abstract.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The headline claim of 83.76% VBench (surpassing Wan2.1 14B) is stated without any ablation results, statistical details, evaluation protocol, baseline implementation notes, or variance estimates. This leaves the central performance claim unsupported by visible evidence.

    Authors: We agree that the abstract would benefit from additional context to support the headline result. The full manuscript details the VBench evaluation protocol, baseline implementations, and ablation studies in the Experiments section. We will revise the abstract to include a brief reference to the evaluation protocol and direct readers to the relevant sections for ablations and comparisons. Variance estimates are not reported, consistent with standard practice for large-scale training runs due to compute constraints; we will add an explicit note clarifying this. revision: yes

  2. Referee: [Abstract] Abstract: No ablation holds the training recipe (dynamic token routing + early feature alignment) fixed while replacing the three-part backbone with a standard single-stream transformer of equal capacity. Without this comparison, it remains unclear whether the reported gains require the claimed role separation or arise from the efficiency components alone.

    Authors: This is a valid criticism. The manuscript presents comparisons to single-stream models and ablations on the dynamic routing and alignment components, but does not include the exact control experiment that holds the training recipe fixed while swapping only the backbone architecture. Such an ablation would require substantial additional compute. In the revision we will expand the discussion of the role-separation motivation, drawing on preliminary observations of task interference, and explicitly note this as a limitation. revision: partial

  3. Referee: [Abstract] Abstract: The statement that later blocks develop 'clearer cross-frame attention structure than standard single-stream baselines' is asserted but not supported by any quantitative metric, figure, or matched-baseline comparison.

    Authors: We acknowledge that the current claim is supported only by qualitative inspection. In the revised manuscript we will add quantitative metrics (such as frame-wise attention concentration scores) together with matched visualizations comparing later blocks of Motif-Video 2B against single-stream baselines, and include these in the analysis section. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark results with no derivations or self-referential predictions.

full rationale

The paper is a technical report on an empirical video generation model. Performance claims (e.g., 83.76% on VBench) are presented as direct benchmark outcomes from training and evaluation, not as quantities derived from equations, fitted parameters, or self-citations. No mathematical derivations, predictions, or first-principles results are described that could reduce to inputs by construction. Architectural claims about role separation and attention structure are supported by comparisons to baselines rather than self-definitional logic. The work is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract does not enumerate any explicit free parameters, mathematical axioms, or newly postulated entities. The central claim implicitly rests on the unstated premise that standard transformer attention dynamics and a frozen pretrained encoder behave as expected under the described routing and alignment procedures.

pith-pipeline@v0.9.0 · 5694 in / 1246 out tokens · 38468 ms · 2026-05-10T15:46:21.654550+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

54 extracted references · 27 canonical work pages · 9 internal anchors

  1. [1]

    V-jepa: Latent video prediction for visual representation learning

    Adrien Bardes, Quentin Garrido, Jean Ponce, Xinlei Chen, Michael Rabbat, Yann LeCun, Mido Assran, and Nicolas Ballas. V-jepa: Latent video prediction for visual representation learning. 2023

  2. [2]

    Speedrunning imagenet diffusion.arXiv preprint arXiv:2512.12386, 2025

    Swayam Bhanded. Speedrunning imagenet diffusion.arXiv preprint arXiv:2512.12386, 2025

  3. [3]

    Skyreels-v4: Multi-modal video-audio generation, inpainting and editing model.arXiv preprint arXiv:2602.21818, 2026

    Guibin Chen, Dixuan Lin, Jiangping Yang, Youqiang Zhang, Zhengcong Fei, Debang Li, Sheng Chen, Chaofeng Ao, Nuo Pang, Yiming Wang, et al. Skyreels-v4: Multi-modal video-audio generation, inpainting and editing model.arXiv preprint arXiv:2602.21818, 2026

  4. [4]

    Pixart-α: Fast training of diffusion transformer for photorealistic text-to-image synthesis

    Junsong Chen, YU Jincheng, GE Chongjian, Lewei Yao, Enze Xie, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart-α: Fast training of diffusion transformer for photorealistic text-to-image synthesis. InThe Twelfth International Conference on Learning Representations

  5. [5]

    Sana-video: Efficient video generation with block linear diffusion transformer, 2025 b

    Junsong Chen, Yuyang Zhao, Jincheng Yu, Ruihang Chu, Junyu Chen, Shuai Yang, Xianbang Wang, Yicheng Pan, Daquan Zhou, Huan Ling, et al. Sana-video: Efficient video generation with block linear diffusion transformer.arXiv preprint arXiv:2509.24695, 2025

  6. [6]

    Enhanc- ing motion dynamics of image-to-video models via adaptive low-pass guidance.arXiv preprint arXiv:2506.08456, 2025

    Sanghyeok Choi, Yuchang Song, Taegyun Jeong, Taesung Kwon, and Kihyuk Sohn. Enhanc- ing motion dynamics of image-to-video models via adaptive low-pass guidance.arXiv preprint arXiv:2506.08456, 2025

  7. [7]

    PaddleOCR-VL: Boosting Multilingual Document Parsing via a 0.9 B Ultra-Compact Vision-Language Model

    Cheng Cui, Ting Sun, Suyin Liang, Tingquan Gao, Zelun Zhang, Jiaxuan Liu, Xueqing Wang, Changda Zhou, Hongen Liu, Manhui Lin, et al. Paddleocr-vl: Boosting multilingual document parsing via a 0.9 b ultra-compact vision-language model.arXiv preprint arXiv:2510.14528, 2025

  8. [8]

    aesthetic-predictor-v2-5

    discus0434. aesthetic-predictor-v2-5. https://github.com/discus0434/aesthetic-predictor-v2-5 , 2024

  9. [9]

    The Llama 3 Herd of Models

    Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The Llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024

  10. [10]

    Scaling rectified flow transformers for high-resolution image synthesis

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas M¨uller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. InInternational Conference on Machine Learning, pages 12606–12633. PMLR, 2024

  11. [11]

    Seedance 1.0: Exploring the Boundaries of Video Generation Models

    Yu Gao, Haoyuan Guo, Tuyen Hoang, Weilin Huang, Lu Jiang, Fangyuan Kong, Huixia Li, Jiashi Li, Liang Li, Xiaojie Li, et al. Seedance 1.0: Exploring the boundaries of video generation models.arXiv preprint arXiv:2506.09113, 2025. 27

  12. [12]

    Accelerate: Training and inference at scale made simple, efficient and adaptable.https://github.com/huggingface/accelerate, 2022

    Sylvain Gugger, Lysandre Debut, Thomas Wolf, Philipp Schmid, Zachary Mueller, Sourab Man- grulkar, Marc Sun, and Benjamin Bossan. Accelerate: Training and inference at scale made simple, efficient and adaptable.https://github.com/huggingface/accelerate, 2022

  13. [13]

    Ltx-2: Efficient joint audio-visual foundation model

    Yoav HaCohen, Benny Brazowski Nisan Chiprut Yaki Bitterman, Andrew Kvochko Avishai Berkowitz Daniel Shalem, Daphna Lifschitz Dudu Moshe, Eitan Porat Eitan Richardson Guy Shi- ran, Itay Chachy Jonathan Chetboun, Michael Finkelson Michael Kupchick Nir Zabari, Nitzan Guetta Noa Kotler, Ofir Bibi Ori Gordon Poriya Panet, Roi Benita Shahar Armon, et al. Ltx-2:...

  14. [14]

    LTX-Video: Realtime Video Latent Diffusion

    Yoav HaCohen, Nisan Chiprut, Benny Brazowski, Daniel Shalem, Dudu Moshe, Eitan Richardson, Eran Levin, Guy Shiran, Nir Zabari, Ori Gordon, et al. Ltx-video: Realtime video latent diffusion. arXiv preprint arXiv:2501.00103, 2024

  15. [15]

    VBench: Comprehensive benchmark suite for video generative models

    Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, Yaohui Wang, Xinyuan Chen, Limin Wang, Dahua Lin, Yu Qiao, and Ziwei Liu. VBench: Comprehensive benchmark suite for video generative models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognit...

  16. [16]

    Nemo-curator: a toolkit for data curation, 2024

    Joseph Jennings, Mostofa Patwary, et al. Nemo-curator: a toolkit for data curation, 2024. URL https://github.com/NVIDIA-NeMo/Curator

  17. [17]

    Kirkpatrick, C

    S. Kirkpatrick, C. D. Gelatt, and M. P . Vecchi. Optimization by simulated annealing.Science, 220 (4598):671–680, 1983

  18. [18]

    HunyuanVideo: A Systematic Framework For Large Video Generative Models

    Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024

  19. [19]

    Tread: Token routing for efficient architecture-agnostic diffusion training

    Felix Krause, Timy Phan, Ming Gui, Stefan Andreas Baumann, Vincent Tao Hu, and Bj¨orn Ommer. Tread: Token routing for efficient architecture-agnostic diffusion training. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 15703–15713, 2025

  20. [20]

    Flux.https://github.com/black-forest-labs/flux, 2024

    Black Forest Labs. Flux.https://github.com/black-forest-labs/flux, 2024

  21. [21]

    Scaling laws for diffusion transformers.CoRR, abs/2410.08184, 2024

    Zhengyang Liang, Hao He, Ceyuan Yang, and Bo Dai. Scaling laws for diffusion transformers.arXiv preprint arXiv:2410.08184, 2024

  22. [22]

    Flow matching for generative modeling

    Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matthew Le. Flow matching for generative modeling. InThe Eleventh International Conference on Learning Representations

  23. [23]

    Enhance-a-video: Better generated video for free.arXiv preprint arXiv:2502.07508, 2025

    Yang Luo, Xuanlei Zhao, Mengzhao Chen, Kaipeng Zhang, Wenqi Shao, Kai Wang, Zhangyang Wang, and Yang You. Enhance-a-video: Better generated video for free.arXiv preprint arXiv:2502.07508, 2025

  24. [24]

    Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model, February 2025

    Guoqing Ma, Haoyang Huang, Kun Yan, Liangyu Chen, Nan Duan, Shengming Yin, Changyi Wan, Ranchen Ming, Xiaoniu Song, Xing Chen, et al. Step-video-t2v technical report: The practice, challenges, and future of video foundation model.arXiv preprint arXiv:2502.10248, 2025

  25. [25]

    arXiv preprint arXiv:2603.14482 (2026)

    Lorenzo Mur-Labadia, Matthew Muckley, Amir Bar, Mido Assran, Koustuv Sinha, Mike Rabbat, Yann LeCun, Nicolas Ballas, and Adrien Bardes. V-jepa 2.1: Unlocking dense features in video self-supervised learning.arXiv preprint arXiv:2603.14482, 2026

  26. [26]

    cuVS: GPU-accelerated vector search and clustering

    NVIDIA RAPIDS Team. cuVS: GPU-accelerated vector search and clustering. GitHub repository,

  27. [27]

    Multi-GPU IVF-PQ and ANN indexes for large- scale vector search

    URL https://github.com/rapidsai/cuvs. Multi-GPU IVF-PQ and ANN indexes for large- scale vector search

  28. [28]

    Video generation models as world simulators

    OpenAI. Video generation models as world simulators. https://openai.com/index/ video-generation-models-as-world-simulators/, 2024

  29. [29]

    Prx part 3 — training a text-to-image model in 24h

    Photoroom. Prx part 3 — training a text-to-image model in 24h. https://huggingface.co/blog/ Photoroom/prx-part3, 2025

  30. [30]

    A self- supervised descriptor for image copy detection

    Ed Pizzi, Sreya Dutta Roy, Sugosh Nagavara Ravindra, Priya Goyal, and Matthijs Douze. A self- supervised descriptor for image copy detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022

  31. [31]

    Movie Gen: A Cast of Media Foundation Models

    Adam Polyak, Amit Zohar, Andrew Brown, Andros Tjandra, Animesh Sinha, Ann Lee, Apoorv Vyas, Bowen Shi, Chih-Yao Ma, Ching-Yao Chuang, et al. Movie gen: A cast of media foundation models.arXiv preprint arXiv:2410.13720, 2024. 28

  32. [32]

    Qwen3-VL technical report.arXiv preprint, 2025

    Qwen Team. Qwen3-VL technical report.arXiv preprint, 2025. Qwen3-VL-30B-A3B vision-language model

  33. [33]

    Eliminating oversaturation and artifacts of high guidance scales in diffusion models

    Seyedmorteza Sadat, Otmar Hilliges, and Romann M Weber. Eliminating oversaturation and artifacts of high guidance scales in diffusion models. InThe Thirteenth International Conference on Learning Representations, 2024

  34. [34]

    Seedance 1.5 pro: A native audio-visual joint generation foundation model.arXiv preprint arXiv:2512.13507, 2025

    Team Seedance, Heyi Chen, Siyan Chen, Xin Chen, Yanfei Chen, Ying Chen, Zhuo Chen, Feng Cheng, Tianheng Cheng, Xinqi Cheng, et al. Seedance 1.5 pro: A native audio-visual joint generation foundation model.arXiv preprint arXiv:2512.13507, 2025

  35. [35]

    What matters for representation alignment: Global information or spatial structure? arXiv preprint arXiv:2512.10794, 2025

    Jaskirat Singh, Xingjian Leng, Zongze Wu, Liang Zheng, Richard Zhang, Eli Shechtman, and Saining Xie. What matters for representation alignment: Global information or spatial structure?arXiv preprint arXiv:2512.10794, 2025

  36. [36]

    SkyReels-V2: Infinite-length Film Generative Model

    Skywork AI SkyReels Team. SkyReels-V2: Infinite-length film generative model.arXiv preprint arXiv:2504.13074, 2025

  37. [37]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

  38. [38]

    A comprehensive study of decoder-only llms for text-to-image generation

    Andrew Z Wang, Songwei Ge, Tero Karras, Ming-Yu Liu, and Yogesh Balaji. A comprehensive study of decoder-only llms for text-to-image generation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 28575–28585, 2025

  39. [39]

    arXiv preprint arXiv:2410.08260 , year=

    Qiuheng Wang, Yukai Shi, Jiarong Ou, Rui Chen, Ke Lin, Jiahao Wang, Boyuan Jiang, Haotian Yang, Mingwu Zheng, Xin Tao, Fei Yang, Pengfei Wan, and Di Zhang. Koala-36m: A large-scale video dataset improving consistency between fine-grained conditions and video content, 2025. URL https://arxiv.org/abs/2410.08260

  40. [40]

    DDT: Decoupled diffusion Transformer.arXiv:2504.05741, 2025

    Shuai Wang, Zhi Tian, Weilin Huang, and Limin Wang. Ddt: Decoupled diffusion transformer.arXiv preprint arXiv:2504.05741, 2025

  41. [41]

    Repa works until it doesn’t: Early-stopped, holistic alignment supercharges diffusion training

    Ziqiao Wang, Wangbo Zhao, Yuhao Zhou, Zekai Li, Zhiyuan Liang, Mingjia Shi, Xuanlei Zhao, Pengfei Zhou, Kaipeng Zhang, Zhangyang Wang, et al. Repa works until it doesn’t: Early-stopped, holistic alignment supercharges diffusion training. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems

  42. [42]

    Webdataset

    WebDataset Authors. Webdataset. GitHub repository, 2026. URL https://github.com/webdataset/ webdataset. Tar-sharded dataset format for sequential streaming in large-scale deep learning

  43. [43]

    Video models are zero-shot learners and reasoners

    Thadd¨aus Wiedemer, Yuxuan Li, Paul Vicol, Shixiang Shane Gu, Nick Matarese, Kevin Swersky, Been Kim, Priyank Jaini, and Robert Geirhos. Video models are zero-shot learners and reasoners. arXiv preprint arXiv:2509.20328, 2025

  44. [44]

    HunyuanVideo 1.5 Technical Report

    Bing Wu, Chang Zou, Changlin Li, Duojun Huang, Fang Yang, Hao Tan, Jack Peng, Jianbing Wu, Jiangfeng Xiong, Jie Jiang, et al. Hunyuanvideo 1.5 technical report.arXiv preprint arXiv:2511.18870, 2025

  45. [45]

    Disentangling aesthetic and technical effects for video quality assessment of user generated content,

    Haoning Wu, Erli Zhang, Liang Liao, Chaofeng Chen, Jingwen Hou, Annan Wang, Wenxiu Sun, Qiong Yan, and Weisi Lin. Exploring video quality assessment on user generated contents from aesthetic and technical perspectives, 2023. URLhttps://arxiv.org/abs/2211.04894

  46. [46]

    Revisiting weak-to-strong consistency in semi-supervised semantic segmentation, 2023

    Lihe Yang, Lei Qi, Litong Feng, Wayne Zhang, and Yinghuan Shi. Revisiting weak-to-strong consistency in semi-supervised semantic segmentation, 2023. URL https://arxiv.org/abs/2208. 09910

  47. [47]

    Cogvideox: Text-to-video diffusion models with an expert transformer

    Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer. InThe Thirteenth International Conference on Learning Representations

  48. [48]

    {SkyPilot}: An intercloud broker for sky computing

    Zongheng Yang, Zhanghao Wu, Michael Luo, Wei-Lin Chiang, Romil Bhardwaj, Woosuk Kwon, Siyuan Zhuang, Frank Sifei Luan, Gautam Mittal, Scott Shenker, et al. {SkyPilot}: An intercloud broker for sky computing. In20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23), pages 437–455, 2023. 29

  49. [49]

    Representation alignment for generation: Training diffusion transformers is easier than you think

    Sihyun Yu, Sangkyung Kwak, Huiwon Jang, Jongheon Jeong, Jonathan Huang, Jinwoo Shin, and Saining Xie. Representation alignment for generation: Training diffusion transformers is easier than you think. InThe Thirteenth International Conference on Learning Representations

  50. [50]

    Sigmoid loss for language image pre-training, 2023

    Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training, 2023. URLhttps://arxiv.org/abs/2303.15343

  51. [51]

    Cat.” indicates language-heavy (lang) or OCR/document (ocr) sources. “%>1024

    Biao Zhang, Paul Suganthan, Ga¨el Liu, Ilya Philippov, Sahil Dua, Ben Hora, Kat Black, Gus Martins, Omar Sanseviero, Shreya Pathak, et al. T5gemma 2: Seeing, reading, and understanding longer. arXiv preprint arXiv:2512.14856, 2025

  52. [52]

    Videorepa: Learning physics for video generation through relational alignment with foundation models

    Xiangdong Zhang, Jiaqi Liao, Shaofeng Zhang, Fanqing Meng, Xiangpeng Wan, Junchi Yan, and Yu Cheng. Videorepa: Learning physics for video generation through relational alignment with foundation models. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems

  53. [53]

    Waver: Wave your way to lifelike video generation,

    Yifu Zhang, Hao Yang, Yuqi Zhang, Yifei Hu, Fengda Zhu, Chuang Lin, Xiaofeng Mei, Yi Jiang, Bingyue Peng, and Zehuan Yuan. Waver: Wave your way to lifelike video generation.arXiv preprint arXiv:2508.15761, 2025

  54. [54]

    Open-sora 2.0: Training a commercial-level video generation model in $200 k

    Zangwei Zheng, Xiangyu Peng, Yuxuan Lou, Chenhui Shen, Tom Young, Xinying Guo, Binluo Wang, Hang Xu, Hongxin Liu, Mingyan Jiang, et al. Open-sora 2.0: Training a commercial-level video generation model in $200k.arXiv preprint arXiv:2503.09642, 2025. 30 Figure 16:Additional qualitative human-centered generations.Representative frames from videos involving ...