Motif-Video 2B: Technical Report

Beomgyu Kim; Bokki Ryu; Changjin Kang; Dahye Choi; Dongjoo Weon; Dongpin Oh; Dongseok Kim; Eunhwan Park; Haesol Lee; Hanbin Jung

arxiv: 2604.16503 · v2 · pith:J2YFXZY3new · submitted 2026-04-14 · 💻 cs.CV · cs.AI

Motif-Video 2B: Technical Report

Junghwan Lim , Wai Ting Cheung , Minsu Ha , Beomgyu Kim , Taewhan Kim , Haesol Lee , Dongpin Oh , Jeesoo Lee

show 20 more authors

Taehyun Kim Minjae Kim Sungmin Lee Hyeyeon Cho Dahye Choi Jaeheui Her Jaeyeon Huh Hanbin Jung Changjin Kang Dongseok Kim Jangwoong Kim Youngrok Kim Hyukjin Kweon Hongjoo Lee Jeongdoo Lee Junhyeok Lee Eunhwan Park Yeongjae Park Bokki Ryu Dongjoo Weon

This is my paper

Pith reviewed 2026-05-21 00:08 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords text-to-videovideo generationmodel efficiencycross-attentionarchitectural designparameter reductiontemporal consistency

0 comments

The pith

Separating prompt alignment, temporal consistency, and detail recovery into distinct pathways lets a 2B video model surpass 14B-parameter rivals on VBench.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to show that strong text-to-video performance is achievable with far fewer parameters and less training data than current large models demand. It argues that interference among prompt alignment, temporal consistency, and fine-detail recovery is reduced when these roles are given separate architectural pathways instead of being forced through one shared stream. The authors combine shared cross-attention for stronger text conditioning on long sequences with a three-part backbone that handles early fusion, joint learning, and refinement in turn. An efficiency-focused training recipe using dynamic token routing and early alignment to a frozen encoder makes the design work under a tight budget of under 10 million clips and 100,000 H200 GPU hours. If the claim holds, smaller-scale video generation becomes practical without waiting for ever-larger compute clusters.

Core claim

Motif-Video 2B reaches 83.76 percent on VBench by using shared cross-attention to improve text control over long video token sequences and a three-part backbone that separates early fusion, joint representation learning, and detail refinement. Dynamic token routing and early-phase feature alignment to a frozen pretrained video encoder keep training efficient. The resulting 2B-parameter model exceeds the score of the 14B-parameter Wan2.1 while using seven times fewer parameters and substantially less training data.

What carries the argument

Shared cross-attention paired with a three-part backbone that divides processing into early fusion, joint representation learning, and detail refinement.

If this is right

Later blocks exhibit clearer cross-frame attention patterns than those in standard single-stream video models.
Competitive text-to-video quality is reachable with fewer than 10 million training clips and under 100,000 H200 GPU hours.
Architectural specialization can narrow or close the quality gap that usually requires much larger parameter counts.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same role-separation idea could be tested in other generative domains such as high-resolution image synthesis or audio generation to reduce task interference.
Lower training budgets might allow repeated experimentation and faster iteration cycles for teams without access to large GPU clusters.
Combining the three-part design with task-specific losses or additional frozen encoders could yield further efficiency gains on particular video styles.

Load-bearing premise

That the separation of prompt alignment, temporal consistency, and fine-detail recovery into distinct pathways through shared cross-attention and the three-part backbone, together with the dynamic routing and alignment recipe, is what produces the reported performance under the given data and compute limits.

What would settle it

Train a 2B-parameter single-stream baseline without shared cross-attention or the three-part backbone on the same clips and compute budget, then check whether its VBench score remains below 83.76 percent.

Figures

Figures reproduced from arXiv: 2604.16503 by Beomgyu Kim, Bokki Ryu, Changjin Kang, Dahye Choi, Dongjoo Weon, Dongpin Oh, Dongseok Kim, Eunhwan Park, Haesol Lee, Hanbin Jung, Hongjoo Lee, Hyeyeon Cho, Hyukjin Kweon, Jaeheui Her, Jaeyeon Huh, Jangwoong Kim, Jeesoo Lee, Jeongdoo Lee, Junghwan Lim, Junhyeok Lee, Minjae Kim, Minsu Ha, Sungmin Lee, Taehyun Kim, Taewhan Kim, Wai Ting Cheung, Yeongjae Park, Youngrok Kim.

**Figure 1.** Figure 1: Representative generations from Motif-Video 2B. Frames are captured from videos generated by our 2B-parameter text-to-video model across a diverse set of prompts, illustrating the combination of prompt fidelity, temporal coherence, and visual detail that we target throughout this work. The banner is intended as a qualitative teaser; later sections analyze the architectural and training choices that make th… view at source ↗

**Figure 2.** Figure 2: Overview of Motif-Video 2B. Text is encoded by T5Gemma2, while video frames are compressed by the Wan2.1 VAE into spatiotemporal latents and patchified into tokens. The transformer backbone follows a three-stage design that separates early modality fusion, joint text-video representation learning, and final detail reconstruction: 12 dual-stream layers preserve modality-specific processing during early fus… view at source ↗

**Figure 3.** Figure 3: Attention structure in dual-stream vs. single-stream vs. DDT decoder layers. Compared with dual and single-stream layers, DDT decoder layers show stronger inter-frame attention structure, where each frame attends more to temporally adjacent frames. The blue box denotes the encoder hidden state: text tokens in the dual-stream and single-stream cases, and the video output tokens from the encoder layers in th… view at source ↗

**Figure 4.** Figure 4: Intermediate-layer text-attention drop in single-stream blocks. We compare attention maps from a representative intermediate layer in dual-stream and single-stream stages. Relative to dual-stream, the single-stream intermediate layer allocates substantially less attention mass to text tokens, indicating weaker text conditioning under joint-token competition. attending to text token j: αij = exp q ⊤ i kj/ … view at source ↗

**Figure 5.** Figure 5: Zero-init alone does not save a cross-attention whose K, V geometry is ungrounded. Both variants are inserted into the same pretrained 360p checkpoint with Wcross O = 0, making both forward passes identical to the base model at step 0. After 1,000 steps of continued training under matched optimizer settings, data, and learning rate, the SkyReels-V4–style cross-attention (top, raw xt as K, V) collapses: out… view at source ↗

**Figure 6.** Figure 6: Dense features from V-JEPA 2.0. The visualization highlights that, while VJEPA 2.0 captures global motion structure well, its dense features are less spatially coherent than would be ideal for dense REPA supervision in video generation. In practice, we align hidden states from a single intermediate encoder layer (layer 8) to the frozen teacher features. Following iREPA [34], we use a convolutional projec… view at source ↗

**Figure 7.** Figure 7: Overview of the training-data construction pipeline. The raw pool is split into Image Real, Image Synthetic, Video Real, and Video Synthetic branches. An initial sanitation stage removes broken files, abnormally small files, near-duplicates (SSCD-based), NSFW content, and watermarked content. Surviving clips are progressively filtered by resolution, clip length, motion, and aesthetic signals as they advanc… view at source ↗

**Figure 8.** Figure 8: Subject composition of the cross-attention fine-tuning corpus. The corpus was assembled iteratively by curating additional clips from underperforming categories. Left: image distribution. Right: video distribution. reinterpret them. Specifically, watermark, nsfw, and padded flags trigger hard removal; multi scene clips are dropped as a secondary check on scene segmentation; quality=low is excluded from 480… view at source ↗

**Figure 9.** Figure 9: Overview of our offline bucket-balanced sampler for WebDataset-formatted video corpora on [PITH_FULL_IMAGE:figures/full_fig_p019_9.png] view at source ↗

**Figure 10.** Figure 10: Shared Cross-Attention contribution across single-stream encoder blocks and denoising steps (1280 × 736, 121 frames, 50 steps, σ ∈ [1.00, 0.29]). Left: Frobenius norm of the cross-attention output Wcross O Attn(Q, K, V) per block (row) and step (column). Right: ratio of the cross-attention residual norm to the self-attention output norm ∥hv∥. No block falls below 5.2%; the global mean is 7.6% and the maxi… view at source ↗

**Figure 11.** Figure 11: Selected single-frame samples from Motif-Video 2B across a range of subjects and visual styles. Each tile is a frame drawn from an independently generated text-to-video clip. The grid is intended to convey the breadth of domains the model handles, including photographic scenes, stylized and fantastical content, close-up subjects, and wide landscapes, rather than to claim uniform quality across all prompts… view at source ↗

**Figure 12.** Figure 12: Image-to-video generation results. The leftmost panel is the input image, and the model preserves its original appearance while generating temporally coherent video content from it [PITH_FULL_IMAGE:figures/full_fig_p023_12.png] view at source ↗

**Figure 13.** Figure 13: Example of generated results from the arena. Prompt: ”A guitarist sits on a fire escape playing at twilight, fingers moving in relaxed patterns along the neck of a scratched acoustic guitar. Shot on a 40mm lens with a slow crane-up from the street below, the brick wall beside him glows deep orange as the last sun hits it and the sky above shifts toward indigo. He wears a loose denim shirt rolled to the el… view at source ↗

**Figure 14.** Figure 14: Micro-scale semantic distortion. Three characteristic failures at the sub-object level: distorted hand anatomy on a close-up instrument subject (left), broken body structure under a high-motion skydiving prompt (middle), and attribute leakage between co-present animals in a multi-subject scene (right). The generations may remain category-correct (guitar, skydiver, cat and dog), leading VBench’s semantic d… view at source ↗

**Figure 15.** Figure 15: Temporal failure modes. Top: physically implausible liquid dynamics in a wine-splash prompt: the motion is locally smooth but violates gravity and surface tension. Middle: loss of temporal coherence under high scene complexity in a cavalry-charge prompt, where subject identities blur across frames and multi-agent spatial relationships fail to persist. Bottom: unintended mid-clip scene transition, where th… view at source ↗

**Figure 16.** Figure 16: Additional qualitative human-centered generations. Representative frames from videos involving human subjects, included as supplementary qualitative results. A Additional results This section presents additional qualitative results for both text-to-video and image-to-video generation in Figures 16 and 17. B Sampling Configuration We describe the sampling configuration used to produce the VBench scores rep… view at source ↗

**Figure 17.** Figure 17: Additional image-to-video results. The leftmost panel is the input image, and the remaining panels show representative generated video frames. Negative prompt. Following Wan [36], we apply a fixed negative prompt at every sampling call. The full string used is: The video has text and graphic overlays burned into the frame, including watermarks, logos, subtitles, timestamps, broadcast graphics, UI elements… view at source ↗

**Figure 18.** Figure 18: Qualitative effect of Shared Cross-Attention. For each prompt, the top row shows generation with Shared Cross-Attention enabled; the bottom row shows the same prompt and seed with crossattention disabled on all 16 single-stream encoder blocks (360p, 50 steps, 121 frames). 34 [PITH_FULL_IMAGE:figures/full_fig_p034_18.png] view at source ↗

read the original abstract

Training strong video generation models usually requires massive datasets, large parameter counts, and substantial compute. In this work, we ask whether strong text-to-video quality is possible at a much smaller budget: fewer than 10M clips and less than 100,000 H200 GPU hours. Our core claim is that part of the answer lies in how model capacity is organized, not only in how much of it is used. In video generation, prompt alignment, temporal consistency, and fine-detail recovery can interfere with one another when they are handled through the same pathway. Motif-Video 2B addresses this by separating these roles architecturally, rather than relying on scale alone. The model combines two key ideas. First, Shared Cross-Attention strengthens text control when video token sequences become long. Second, a three-part backbone separates early fusion, joint representation learning, and detail refinement. To make this design effective under a limited compute budget, we pair it with an efficient training recipe based on dynamic token routing and early-phase feature alignment to a frozen pretrained video encoder. Our analysis shows that later blocks develop clearer cross-frame attention structure than standard single-stream baselines. On VBench, Motif-Video~2B reaches 83.76\%, surpassing Wan2.1 14B while using 7$\times$ fewer parameters and substantially less training data. These results suggest that careful architectural specialization, combined with an efficiency-oriented training recipe, can narrow or exceed the quality gap typically associated with much larger video models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Motif-Video 2B claims 83.76 on VBench with a 2B model by splitting prompt, temporal, and detail roles across a three-part backbone plus shared cross-attention, but the data do not yet isolate that split from the training recipe.

read the letter

The headline result is a 2B-parameter text-to-video model that reaches 83.76 on VBench while beating Wan2.1 14B with roughly 7 times fewer parameters and far less training data. The authors argue that prompt alignment, temporal consistency, and fine-detail recovery interfere when forced through one pathway, so they built a three-part backbone and added shared cross-attention to keep text control strong on long token sequences. They also use dynamic token routing and early alignment to a frozen video encoder to stay under 10 M clips and 100 k H200 hours. That combination is the concrete new piece on offer: a specific way to organize capacity rather than just scaling it up. The later-block attention maps they show do look more structured than single-stream baselines, which is at least consistent with the design goal. The training recipe itself looks pragmatic for the stated budget. The main gap is the missing control the stress-test note flags. The abstract gives the final score and a qualitative observation on attention, but no ablation that holds the dynamic routing and frozen-encoder alignment fixed while reverting to a single backbone. Without that, it is hard to credit the architectural separation rather than the efficiency tricks for the data-efficiency result. Evaluation details, baselines, and error bars are also thin in what is visible, so the number is hard to weigh precisely. This report is useful for groups trying to push video generation below the 10 B parameter mark. Readers who want concrete ideas for role separation and cheap alignment tricks will find something to try, even if the attribution is still loose. It is worth sending to peer review so the authors can supply the missing controls and let referees check whether the split actually moves the needle once the training recipe is held constant.

Referee Report

1 major / 2 minor

Summary. The paper presents Motif-Video 2B, a 2B-parameter text-to-video model that reaches 83.76% on VBench, outperforming Wan2.1 14B while using 7× fewer parameters and substantially less training data (<10M clips, <100k H200 GPU hours). The central claim is that separating prompt alignment, temporal consistency, and fine-detail recovery via a three-part backbone with shared cross-attention, combined with dynamic token routing and early-phase feature alignment to a frozen encoder, enables this efficiency; later blocks exhibit clearer cross-frame attention than single-stream baselines.

Significance. If the attribution to architectural specialization holds under controlled conditions, the result would indicate that targeted capacity organization can close the quality gap with much larger models under tight data and compute budgets, offering a practical path toward more accessible video generation. The reported attention-structure analysis supplies a modest mechanistic observation that could be developed further.

major comments (1)

[Abstract] Abstract and Results section: The headline claim that the three-part backbone and shared cross-attention are responsible for competitive performance under the stated budget is not supported by any ablation that holds dynamic token routing and early-phase frozen-encoder alignment fixed while reverting to a single-stream backbone. Without this control, the data-efficiency result cannot be attributed to the architectural separation rather than the training recipe alone.

minor comments (2)

[Abstract] The abstract states the VBench score but supplies no information on evaluation protocol, baseline details, number of samples, statistical significance, or error bars, making it impossible to judge whether the 83.76% figure reliably supports the central claim.
Notation for the three-part backbone (early fusion, joint representation, detail refinement) and the dynamic routing mechanism should be defined explicitly with equations or pseudocode in the methods section for reproducibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the major comment below and outline the revisions we will make to strengthen the attribution of results to the proposed architecture.

read point-by-point responses

Referee: [Abstract] Abstract and Results section: The headline claim that the three-part backbone and shared cross-attention are responsible for competitive performance under the stated budget is not supported by any ablation that holds dynamic token routing and early-phase frozen-encoder alignment fixed while reverting to a single-stream backbone. Without this control, the data-efficiency result cannot be attributed to the architectural separation rather than the training recipe alone.

Authors: We agree that the current evidence does not fully isolate the contribution of the three-part backbone and shared cross-attention from the training recipe components. Our manuscript reports comparisons to single-stream baselines that exhibit weaker cross-frame attention in later blocks, but these baselines were not trained under an identical recipe that fixes dynamic token routing and early-phase alignment to the frozen encoder. To address this directly, we will add a controlled ablation in the revised manuscript: a single-stream backbone trained with the same dynamic token routing and early-phase feature alignment, using the same data and compute budget. This will allow clearer attribution of the efficiency gains to the architectural separation of prompt alignment, temporal consistency, and detail recovery. revision: yes

Circularity Check

0 steps flagged

No derivation chain present; empirical report only

full rationale

The paper is a technical report on an empirical video generation model. It reports a VBench score of 83.76% for Motif-Video 2B and attributes results to architectural choices (shared cross-attention, three-part backbone) paired with a training recipe (dynamic token routing, early-phase alignment to frozen encoder). No equations, first-principles derivations, fitted parameters renamed as predictions, or load-bearing self-citation chains appear in the abstract or described claims. The central performance claim is externally benchmarked and does not reduce to any input by construction, satisfying the criteria for a self-contained empirical result.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; it introduces no explicit free parameters, axioms, or invented entities beyond standard deep-learning components. All concrete details on architecture, losses, and data handling are absent.

pith-pipeline@v0.9.0 · 5925 in / 1207 out tokens · 50879 ms · 2026-05-21T00:08:25.821283+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

three-part backbone separates early fusion, joint representation learning, and detail refinement
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat as forced Peano structure unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

separating these roles architecturally, rather than relying on scale alone

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

54 extracted references · 54 canonical work pages · 13 internal anchors

[1]

V-jepa: Latent video prediction for visual representation learning

Adrien Bardes, Quentin Garrido, Jean Ponce, Xinlei Chen, Michael Rabbat, Yann LeCun, Mido Assran, and Nicolas Ballas. V-jepa: Latent video prediction for visual representation learning. 2023

work page 2023
[2]

Speedrunning imagenet diffusion.arXiv preprint arXiv:2512.12386, 2025

Swayam Bhanded. Speedrunning imagenet diffusion.arXiv preprint arXiv:2512.12386, 2025

work page arXiv 2025
[3]

Skyreels-v4: Multi-modal video-audio generation, inpainting and editing model.arXiv preprint arXiv:2602.21818, 2026

Guibin Chen, Dixuan Lin, Jiangping Yang, Youqiang Zhang, Zhengcong Fei, Debang Li, Sheng Chen, Chaofeng Ao, Nuo Pang, Yiming Wang, et al. Skyreels-v4: Multi-modal video-audio generation, inpainting and editing model.arXiv preprint arXiv:2602.21818, 2026

work page arXiv 2026
[4]

Pixart-α: Fast training of diffusion transformer for photorealistic text-to-image synthesis

Junsong Chen, YU Jincheng, GE Chongjian, Lewei Yao, Enze Xie, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart-α: Fast training of diffusion transformer for photorealistic text-to-image synthesis. InThe Twelfth International Conference on Learning Representations

work page
[5]

Sana-video: Efficient video generation with block linear diffusion transformer.arXiv preprint arXiv:2509.24695, 2025

Junsong Chen, Yuyang Zhao, Jincheng Yu, Ruihang Chu, Junyu Chen, Shuai Yang, Xianbang Wang, Yicheng Pan, Daquan Zhou, Huan Ling, et al. Sana-video: Efficient video generation with block linear diffusion transformer.arXiv preprint arXiv:2509.24695, 2025

work page arXiv 2025
[6]

Enhancing motion dynamics of image-to-video models via adaptive low-pass guidance.arXiv preprint arXiv:2506.08456, 2025

Sanghyeok Choi, Yuchang Song, Taegyun Jeong, Taesung Kwon, and Kihyuk Sohn. Enhanc- ing motion dynamics of image-to-video models via adaptive low-pass guidance.arXiv preprint arXiv:2506.08456, 2025

work page arXiv 2025
[7]

Paddleocr-vl: Boosting multilingual document parsing via a 0.9 b ultra-compact vision-language model.arXiv preprint arXiv:2510.14528, 2025

Cheng Cui, Ting Sun, Suyin Liang, Tingquan Gao, Zelun Zhang, Jiaxuan Liu, Xueqing Wang, Changda Zhou, Hongen Liu, Manhui Lin, et al. Paddleocr-vl: Boosting multilingual document parsing via a 0.9 b ultra-compact vision-language model.arXiv preprint arXiv:2510.14528, 2025

work page arXiv 2025
[8]

aesthetic-predictor-v2-5

discus0434. aesthetic-predictor-v2-5. https://github.com/discus0434/aesthetic-predictor-v2-5 , 2024

work page 2024
[9]

The Llama 3 Herd of Models

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The Llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[10]

Scaling rectified flow transformers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas M¨uller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. InInternational Conference on Machine Learning, pages 12606–12633. PMLR, 2024

work page 2024
[11]

Seedance 1.0: Exploring the Boundaries of Video Generation Models

Yu Gao, Haoyuan Guo, Tuyen Hoang, Weilin Huang, Lu Jiang, Fangyuan Kong, Huixia Li, Jiashi Li, Liang Li, Xiaojie Li, et al. Seedance 1.0: Exploring the boundaries of video generation models.arXiv preprint arXiv:2506.09113, 2025. 27

work page internal anchor Pith review Pith/arXiv arXiv 2025
[12]

Accelerate: Training and inference at scale made simple, efficient and adaptable.https://github.com/huggingface/accelerate, 2022

Sylvain Gugger, Lysandre Debut, Thomas Wolf, Philipp Schmid, Zachary Mueller, Sourab Man- grulkar, Marc Sun, and Benjamin Bossan. Accelerate: Training and inference at scale made simple, efficient and adaptable.https://github.com/huggingface/accelerate, 2022

work page 2022
[13]

Ltx-2: Efficient joint audio-visual foundation model

Yoav HaCohen, Benny Brazowski Nisan Chiprut Yaki Bitterman, Andrew Kvochko Avishai Berkowitz Daniel Shalem, Daphna Lifschitz Dudu Moshe, Eitan Porat Eitan Richardson Guy Shi- ran, Itay Chachy Jonathan Chetboun, Michael Finkelson Michael Kupchick Nir Zabari, Nitzan Guetta Noa Kotler, Ofir Bibi Ori Gordon Poriya Panet, Roi Benita Shahar Armon, et al. Ltx-2:...

work page
[14]

LTX-Video: Realtime Video Latent Diffusion

Yoav HaCohen, Nisan Chiprut, Benny Brazowski, Daniel Shalem, Dudu Moshe, Eitan Richardson, Eran Levin, Guy Shiran, Nir Zabari, Ori Gordon, et al. Ltx-video: Realtime video latent diffusion. arXiv preprint arXiv:2501.00103, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[15]

VBench: Comprehensive benchmark suite for video generative models

Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, Yaohui Wang, Xinyuan Chen, Limin Wang, Dahua Lin, Yu Qiao, and Ziwei Liu. VBench: Comprehensive benchmark suite for video generative models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognit...

work page 2024
[16]

Nemo-curator: a toolkit for data curation, 2024

Joseph Jennings, Mostofa Patwary, et al. Nemo-curator: a toolkit for data curation, 2024. URL https://github.com/NVIDIA-NeMo/Curator

work page 2024
[17]

Kirkpatrick, C

S. Kirkpatrick, C. D. Gelatt, and M. P . Vecchi. Optimization by simulated annealing.Science, 220 (4598):671–680, 1983

work page 1983
[18]

HunyuanVideo: A Systematic Framework For Large Video Generative Models

Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[19]

Tread: Token routing for efficient architecture-agnostic diffusion training

Felix Krause, Timy Phan, Ming Gui, Stefan Andreas Baumann, Vincent Tao Hu, and Bj¨orn Ommer. Tread: Token routing for efficient architecture-agnostic diffusion training. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 15703–15713, 2025

work page 2025
[20]

Flux.https://github.com/black-forest-labs/flux, 2024

Black Forest Labs. Flux.https://github.com/black-forest-labs/flux, 2024

work page 2024
[21]

Scaling laws for diffusion transformers.arXiv preprint arXiv:2410.08184, 2024

Zhengyang Liang, Hao He, Ceyuan Yang, and Bo Dai. Scaling laws for diffusion transformers.arXiv preprint arXiv:2410.08184, 2024

work page arXiv 2024
[22]

Flow matching for generative modeling

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matthew Le. Flow matching for generative modeling. InThe Eleventh International Conference on Learning Representations

work page
[23]

Enhance-a-video: Better generated video for free.arXiv preprint arXiv:2502.07508, 2025

Yang Luo, Xuanlei Zhao, Mengzhao Chen, Kaipeng Zhang, Wenqi Shao, Kai Wang, Zhangyang Wang, and Yang You. Enhance-a-video: Better generated video for free.arXiv preprint arXiv:2502.07508, 2025

work page arXiv 2025
[24]

Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model

Guoqing Ma, Haoyang Huang, Kun Yan, Liangyu Chen, Nan Duan, Shengming Yin, Changyi Wan, Ranchen Ming, Xiaoniu Song, Xing Chen, et al. Step-video-t2v technical report: The practice, challenges, and future of video foundation model.arXiv preprint arXiv:2502.10248, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[25]

V-JEPA 2.1: Unlocking dense features in video self-supervised learning.arXiv preprint arXiv:2603.14482, 2026

Lorenzo Mur-Labadia, Matthew Muckley, Amir Bar, Mido Assran, Koustuv Sinha, Mike Rabbat, Yann LeCun, Nicolas Ballas, and Adrien Bardes. V-jepa 2.1: Unlocking dense features in video self-supervised learning.arXiv preprint arXiv:2603.14482, 2026

work page arXiv 2026
[26]

cuVS: GPU-accelerated vector search and clustering

NVIDIA RAPIDS Team. cuVS: GPU-accelerated vector search and clustering. GitHub repository,

work page
[27]

Multi-GPU IVF-PQ and ANN indexes for large- scale vector search

URL https://github.com/rapidsai/cuvs. Multi-GPU IVF-PQ and ANN indexes for large- scale vector search

work page
[28]

Video generation models as world simulators

OpenAI. Video generation models as world simulators. https://openai.com/index/ video-generation-models-as-world-simulators/, 2024

work page 2024
[29]

Prx part 3 — training a text-to-image model in 24h

Photoroom. Prx part 3 — training a text-to-image model in 24h. https://huggingface.co/blog/ Photoroom/prx-part3, 2025

work page 2025
[30]

A self- supervised descriptor for image copy detection

Ed Pizzi, Sreya Dutta Roy, Sugosh Nagavara Ravindra, Priya Goyal, and Matthijs Douze. A self- supervised descriptor for image copy detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022

work page 2022
[31]

Movie Gen: A Cast of Media Foundation Models

Adam Polyak, Amit Zohar, Andrew Brown, Andros Tjandra, Animesh Sinha, Ann Lee, Apoorv Vyas, Bowen Shi, Chih-Yao Ma, Ching-Yao Chuang, et al. Movie gen: A cast of media foundation models.arXiv preprint arXiv:2410.13720, 2024. 28

work page internal anchor Pith review Pith/arXiv arXiv 2024
[32]

Qwen3-VL technical report.arXiv preprint, 2025

Qwen Team. Qwen3-VL technical report.arXiv preprint, 2025. Qwen3-VL-30B-A3B vision-language model

work page 2025
[33]

Eliminating oversaturation and artifacts of high guidance scales in diffusion models

Seyedmorteza Sadat, Otmar Hilliges, and Romann M Weber. Eliminating oversaturation and artifacts of high guidance scales in diffusion models. InThe Thirteenth International Conference on Learning Representations, 2024

work page 2024
[34]

Seedance 1.5 pro: A Native Audio-Visual Joint Generation Foundation Model

Team Seedance, Heyi Chen, Siyan Chen, Xin Chen, Yanfei Chen, Ying Chen, Zhuo Chen, Feng Cheng, Tianheng Cheng, Xinqi Cheng, et al. Seedance 1.5 pro: A native audio-visual joint generation foundation model.arXiv preprint arXiv:2512.13507, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[35]

What matters for representation alignment: Global information or spatial structure?arXiv preprint arXiv:2512.10794, 2025

Jaskirat Singh, Xingjian Leng, Zongze Wu, Liang Zheng, Richard Zhang, Eli Shechtman, and Saining Xie. What matters for representation alignment: Global information or spatial structure?arXiv preprint arXiv:2512.10794, 2025

work page arXiv 2025
[36]

SkyReels-V2: Infinite-length Film Generative Model

Skywork AI SkyReels Team. SkyReels-V2: Infinite-length film generative model.arXiv preprint arXiv:2504.13074, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[37]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[38]

A comprehensive study of decoder-only llms for text-to-image generation

Andrew Z Wang, Songwei Ge, Tero Karras, Ming-Yu Liu, and Yogesh Balaji. A comprehensive study of decoder-only llms for text-to-image generation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 28575–28585, 2025

work page 2025
[39]

Koala-36m: A large-scale video dataset improving consistency between fine-grained conditions and video content, 2025

Qiuheng Wang, Yukai Shi, Jiarong Ou, Rui Chen, Ke Lin, Jiahao Wang, Boyuan Jiang, Haotian Yang, Mingwu Zheng, Xin Tao, Fei Yang, Pengfei Wan, and Di Zhang. Koala-36m: A large-scale video dataset improving consistency between fine-grained conditions and video content, 2025. URL https://arxiv.org/abs/2410.08260

work page arXiv 2025
[40]

Ddt: Decoupled diffusion transformer.arXiv preprint arXiv:2504.05741, 2025

Shuai Wang, Zhi Tian, Weilin Huang, and Limin Wang. Ddt: Decoupled diffusion transformer.arXiv preprint arXiv:2504.05741, 2025

work page arXiv 2025
[41]

Repa works until it doesn’t: Early-stopped, holistic alignment supercharges diffusion training

Ziqiao Wang, Wangbo Zhao, Yuhao Zhou, Zekai Li, Zhiyuan Liang, Mingjia Shi, Xuanlei Zhao, Pengfei Zhou, Kaipeng Zhang, Zhangyang Wang, et al. Repa works until it doesn’t: Early-stopped, holistic alignment supercharges diffusion training. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems

work page
[42]

Webdataset

WebDataset Authors. Webdataset. GitHub repository, 2026. URL https://github.com/webdataset/ webdataset. Tar-sharded dataset format for sequential streaming in large-scale deep learning

work page 2026
[43]

Video models are zero-shot learners and reasoners

Thadd¨aus Wiedemer, Yuxuan Li, Paul Vicol, Shixiang Shane Gu, Nick Matarese, Kevin Swersky, Been Kim, Priyank Jaini, and Robert Geirhos. Video models are zero-shot learners and reasoners. arXiv preprint arXiv:2509.20328, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[44]

HunyuanVideo 1.5 Technical Report

Bing Wu, Chang Zou, Changlin Li, Duojun Huang, Fang Yang, Hao Tan, Jack Peng, Jianbing Wu, Jiangfeng Xiong, Jie Jiang, et al. Hunyuanvideo 1.5 technical report.arXiv preprint arXiv:2511.18870, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[45]

Exploring video quality assessment on user generated contents from aesthetic and technical perspectives, 2023

Haoning Wu, Erli Zhang, Liang Liao, Chaofeng Chen, Jingwen Hou, Annan Wang, Wenxiu Sun, Qiong Yan, and Weisi Lin. Exploring video quality assessment on user generated contents from aesthetic and technical perspectives, 2023. URLhttps://arxiv.org/abs/2211.04894

work page arXiv 2023
[46]

Revisiting weak-to-strong consistency in semi-supervised semantic segmentation, 2023

Lihe Yang, Lei Qi, Litong Feng, Wayne Zhang, and Yinghuan Shi. Revisiting weak-to-strong consistency in semi-supervised semantic segmentation, 2023. URL https://arxiv.org/abs/2208. 09910

work page 2023
[47]

Cogvideox: Text-to-video diffusion models with an expert transformer

Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer. InThe Thirteenth International Conference on Learning Representations

work page
[48]

{SkyPilot}: An intercloud broker for sky computing

Zongheng Yang, Zhanghao Wu, Michael Luo, Wei-Lin Chiang, Romil Bhardwaj, Woosuk Kwon, Siyuan Zhuang, Frank Sifei Luan, Gautam Mittal, Scott Shenker, et al. {SkyPilot}: An intercloud broker for sky computing. In20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23), pages 437–455, 2023. 29

work page 2023
[49]

Representation alignment for generation: Training diffusion transformers is easier than you think

Sihyun Yu, Sangkyung Kwak, Huiwon Jang, Jongheon Jeong, Jonathan Huang, Jinwoo Shin, and Saining Xie. Representation alignment for generation: Training diffusion transformers is easier than you think. InThe Thirteenth International Conference on Learning Representations

work page
[50]

Sigmoid Loss for Language Image Pre-Training

Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training, 2023. URLhttps://arxiv.org/abs/2303.15343

work page internal anchor Pith review Pith/arXiv arXiv 2023
[51]

T5gemma 2: Seeing, reading, and understanding longer

Biao Zhang, Paul Suganthan, Ga¨el Liu, Ilya Philippov, Sahil Dua, Ben Hora, Kat Black, Gus Martins, Omar Sanseviero, Shreya Pathak, et al. T5gemma 2: Seeing, reading, and understanding longer. arXiv preprint arXiv:2512.14856, 2025

work page arXiv 2025
[52]

Videorepa: Learning physics for video generation through relational alignment with foundation models

Xiangdong Zhang, Jiaqi Liao, Shaofeng Zhang, Fanqing Meng, Xiangpeng Wan, Junchi Yan, and Yu Cheng. Videorepa: Learning physics for video generation through relational alignment with foundation models. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems

work page
[53]

Waver: Wave your way to lifelike video generation.arXiv preprint arXiv:2508.15761, 2025

Yifu Zhang, Hao Yang, Yuqi Zhang, Yifei Hu, Fengda Zhu, Chuang Lin, Xiaofeng Mei, Yi Jiang, Bingyue Peng, and Zehuan Yuan. Waver: Wave your way to lifelike video generation.arXiv preprint arXiv:2508.15761, 2025

work page arXiv 2025
[54]

Open-Sora 2.0: Training a Commercial-Level Video Generation Model in $200k

Zangwei Zheng, Xiangyu Peng, Yuxuan Lou, Chenhui Shen, Tom Young, Xinying Guo, Binluo Wang, Hang Xu, Hongxin Liu, Mingyan Jiang, et al. Open-sora 2.0: Training a commercial-level video generation model in $200k.arXiv preprint arXiv:2503.09642, 2025. 30 Figure 16:Additional qualitative human-centered generations.Representative frames from videos involving ...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[1] [1]

V-jepa: Latent video prediction for visual representation learning

Adrien Bardes, Quentin Garrido, Jean Ponce, Xinlei Chen, Michael Rabbat, Yann LeCun, Mido Assran, and Nicolas Ballas. V-jepa: Latent video prediction for visual representation learning. 2023

work page 2023

[2] [2]

Speedrunning imagenet diffusion.arXiv preprint arXiv:2512.12386, 2025

Swayam Bhanded. Speedrunning imagenet diffusion.arXiv preprint arXiv:2512.12386, 2025

work page arXiv 2025

[3] [3]

Skyreels-v4: Multi-modal video-audio generation, inpainting and editing model.arXiv preprint arXiv:2602.21818, 2026

Guibin Chen, Dixuan Lin, Jiangping Yang, Youqiang Zhang, Zhengcong Fei, Debang Li, Sheng Chen, Chaofeng Ao, Nuo Pang, Yiming Wang, et al. Skyreels-v4: Multi-modal video-audio generation, inpainting and editing model.arXiv preprint arXiv:2602.21818, 2026

work page arXiv 2026

[4] [4]

Pixart-α: Fast training of diffusion transformer for photorealistic text-to-image synthesis

Junsong Chen, YU Jincheng, GE Chongjian, Lewei Yao, Enze Xie, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart-α: Fast training of diffusion transformer for photorealistic text-to-image synthesis. InThe Twelfth International Conference on Learning Representations

work page

[5] [5]

Sana-video: Efficient video generation with block linear diffusion transformer.arXiv preprint arXiv:2509.24695, 2025

Junsong Chen, Yuyang Zhao, Jincheng Yu, Ruihang Chu, Junyu Chen, Shuai Yang, Xianbang Wang, Yicheng Pan, Daquan Zhou, Huan Ling, et al. Sana-video: Efficient video generation with block linear diffusion transformer.arXiv preprint arXiv:2509.24695, 2025

work page arXiv 2025

[6] [6]

Enhancing motion dynamics of image-to-video models via adaptive low-pass guidance.arXiv preprint arXiv:2506.08456, 2025

Sanghyeok Choi, Yuchang Song, Taegyun Jeong, Taesung Kwon, and Kihyuk Sohn. Enhanc- ing motion dynamics of image-to-video models via adaptive low-pass guidance.arXiv preprint arXiv:2506.08456, 2025

work page arXiv 2025

[7] [7]

Paddleocr-vl: Boosting multilingual document parsing via a 0.9 b ultra-compact vision-language model.arXiv preprint arXiv:2510.14528, 2025

Cheng Cui, Ting Sun, Suyin Liang, Tingquan Gao, Zelun Zhang, Jiaxuan Liu, Xueqing Wang, Changda Zhou, Hongen Liu, Manhui Lin, et al. Paddleocr-vl: Boosting multilingual document parsing via a 0.9 b ultra-compact vision-language model.arXiv preprint arXiv:2510.14528, 2025

work page arXiv 2025

[8] [8]

aesthetic-predictor-v2-5

discus0434. aesthetic-predictor-v2-5. https://github.com/discus0434/aesthetic-predictor-v2-5 , 2024

work page 2024

[9] [9]

The Llama 3 Herd of Models

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The Llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[10] [10]

Scaling rectified flow transformers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas M¨uller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. InInternational Conference on Machine Learning, pages 12606–12633. PMLR, 2024

work page 2024

[11] [11]

Seedance 1.0: Exploring the Boundaries of Video Generation Models

Yu Gao, Haoyuan Guo, Tuyen Hoang, Weilin Huang, Lu Jiang, Fangyuan Kong, Huixia Li, Jiashi Li, Liang Li, Xiaojie Li, et al. Seedance 1.0: Exploring the boundaries of video generation models.arXiv preprint arXiv:2506.09113, 2025. 27

work page internal anchor Pith review Pith/arXiv arXiv 2025

[12] [12]

Accelerate: Training and inference at scale made simple, efficient and adaptable.https://github.com/huggingface/accelerate, 2022

Sylvain Gugger, Lysandre Debut, Thomas Wolf, Philipp Schmid, Zachary Mueller, Sourab Man- grulkar, Marc Sun, and Benjamin Bossan. Accelerate: Training and inference at scale made simple, efficient and adaptable.https://github.com/huggingface/accelerate, 2022

work page 2022

[13] [13]

Ltx-2: Efficient joint audio-visual foundation model

Yoav HaCohen, Benny Brazowski Nisan Chiprut Yaki Bitterman, Andrew Kvochko Avishai Berkowitz Daniel Shalem, Daphna Lifschitz Dudu Moshe, Eitan Porat Eitan Richardson Guy Shi- ran, Itay Chachy Jonathan Chetboun, Michael Finkelson Michael Kupchick Nir Zabari, Nitzan Guetta Noa Kotler, Ofir Bibi Ori Gordon Poriya Panet, Roi Benita Shahar Armon, et al. Ltx-2:...

work page

[14] [14]

LTX-Video: Realtime Video Latent Diffusion

Yoav HaCohen, Nisan Chiprut, Benny Brazowski, Daniel Shalem, Dudu Moshe, Eitan Richardson, Eran Levin, Guy Shiran, Nir Zabari, Ori Gordon, et al. Ltx-video: Realtime video latent diffusion. arXiv preprint arXiv:2501.00103, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[15] [15]

VBench: Comprehensive benchmark suite for video generative models

Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, Yaohui Wang, Xinyuan Chen, Limin Wang, Dahua Lin, Yu Qiao, and Ziwei Liu. VBench: Comprehensive benchmark suite for video generative models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognit...

work page 2024

[16] [16]

Nemo-curator: a toolkit for data curation, 2024

Joseph Jennings, Mostofa Patwary, et al. Nemo-curator: a toolkit for data curation, 2024. URL https://github.com/NVIDIA-NeMo/Curator

work page 2024

[17] [17]

Kirkpatrick, C

S. Kirkpatrick, C. D. Gelatt, and M. P . Vecchi. Optimization by simulated annealing.Science, 220 (4598):671–680, 1983

work page 1983

[18] [18]

HunyuanVideo: A Systematic Framework For Large Video Generative Models

Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[19] [19]

Tread: Token routing for efficient architecture-agnostic diffusion training

Felix Krause, Timy Phan, Ming Gui, Stefan Andreas Baumann, Vincent Tao Hu, and Bj¨orn Ommer. Tread: Token routing for efficient architecture-agnostic diffusion training. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 15703–15713, 2025

work page 2025

[20] [20]

Flux.https://github.com/black-forest-labs/flux, 2024

Black Forest Labs. Flux.https://github.com/black-forest-labs/flux, 2024

work page 2024

[21] [21]

Scaling laws for diffusion transformers.arXiv preprint arXiv:2410.08184, 2024

Zhengyang Liang, Hao He, Ceyuan Yang, and Bo Dai. Scaling laws for diffusion transformers.arXiv preprint arXiv:2410.08184, 2024

work page arXiv 2024

[22] [22]

Flow matching for generative modeling

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matthew Le. Flow matching for generative modeling. InThe Eleventh International Conference on Learning Representations

work page

[23] [23]

Enhance-a-video: Better generated video for free.arXiv preprint arXiv:2502.07508, 2025

Yang Luo, Xuanlei Zhao, Mengzhao Chen, Kaipeng Zhang, Wenqi Shao, Kai Wang, Zhangyang Wang, and Yang You. Enhance-a-video: Better generated video for free.arXiv preprint arXiv:2502.07508, 2025

work page arXiv 2025

[24] [24]

Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model

Guoqing Ma, Haoyang Huang, Kun Yan, Liangyu Chen, Nan Duan, Shengming Yin, Changyi Wan, Ranchen Ming, Xiaoniu Song, Xing Chen, et al. Step-video-t2v technical report: The practice, challenges, and future of video foundation model.arXiv preprint arXiv:2502.10248, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[25] [25]

V-JEPA 2.1: Unlocking dense features in video self-supervised learning.arXiv preprint arXiv:2603.14482, 2026

Lorenzo Mur-Labadia, Matthew Muckley, Amir Bar, Mido Assran, Koustuv Sinha, Mike Rabbat, Yann LeCun, Nicolas Ballas, and Adrien Bardes. V-jepa 2.1: Unlocking dense features in video self-supervised learning.arXiv preprint arXiv:2603.14482, 2026

work page arXiv 2026

[26] [26]

cuVS: GPU-accelerated vector search and clustering

NVIDIA RAPIDS Team. cuVS: GPU-accelerated vector search and clustering. GitHub repository,

work page

[27] [27]

Multi-GPU IVF-PQ and ANN indexes for large- scale vector search

URL https://github.com/rapidsai/cuvs. Multi-GPU IVF-PQ and ANN indexes for large- scale vector search

work page

[28] [28]

Video generation models as world simulators

OpenAI. Video generation models as world simulators. https://openai.com/index/ video-generation-models-as-world-simulators/, 2024

work page 2024

[29] [29]

Prx part 3 — training a text-to-image model in 24h

Photoroom. Prx part 3 — training a text-to-image model in 24h. https://huggingface.co/blog/ Photoroom/prx-part3, 2025

work page 2025

[30] [30]

A self- supervised descriptor for image copy detection

Ed Pizzi, Sreya Dutta Roy, Sugosh Nagavara Ravindra, Priya Goyal, and Matthijs Douze. A self- supervised descriptor for image copy detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022

work page 2022

[31] [31]

Movie Gen: A Cast of Media Foundation Models

Adam Polyak, Amit Zohar, Andrew Brown, Andros Tjandra, Animesh Sinha, Ann Lee, Apoorv Vyas, Bowen Shi, Chih-Yao Ma, Ching-Yao Chuang, et al. Movie gen: A cast of media foundation models.arXiv preprint arXiv:2410.13720, 2024. 28

work page internal anchor Pith review Pith/arXiv arXiv 2024

[32] [32]

Qwen3-VL technical report.arXiv preprint, 2025

Qwen Team. Qwen3-VL technical report.arXiv preprint, 2025. Qwen3-VL-30B-A3B vision-language model

work page 2025

[33] [33]

Eliminating oversaturation and artifacts of high guidance scales in diffusion models

Seyedmorteza Sadat, Otmar Hilliges, and Romann M Weber. Eliminating oversaturation and artifacts of high guidance scales in diffusion models. InThe Thirteenth International Conference on Learning Representations, 2024

work page 2024

[34] [34]

Seedance 1.5 pro: A Native Audio-Visual Joint Generation Foundation Model

Team Seedance, Heyi Chen, Siyan Chen, Xin Chen, Yanfei Chen, Ying Chen, Zhuo Chen, Feng Cheng, Tianheng Cheng, Xinqi Cheng, et al. Seedance 1.5 pro: A native audio-visual joint generation foundation model.arXiv preprint arXiv:2512.13507, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[35] [35]

What matters for representation alignment: Global information or spatial structure?arXiv preprint arXiv:2512.10794, 2025

Jaskirat Singh, Xingjian Leng, Zongze Wu, Liang Zheng, Richard Zhang, Eli Shechtman, and Saining Xie. What matters for representation alignment: Global information or spatial structure?arXiv preprint arXiv:2512.10794, 2025

work page arXiv 2025

[36] [36]

SkyReels-V2: Infinite-length Film Generative Model

Skywork AI SkyReels Team. SkyReels-V2: Infinite-length film generative model.arXiv preprint arXiv:2504.13074, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[37] [37]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[38] [38]

A comprehensive study of decoder-only llms for text-to-image generation

Andrew Z Wang, Songwei Ge, Tero Karras, Ming-Yu Liu, and Yogesh Balaji. A comprehensive study of decoder-only llms for text-to-image generation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 28575–28585, 2025

work page 2025

[39] [39]

Koala-36m: A large-scale video dataset improving consistency between fine-grained conditions and video content, 2025

Qiuheng Wang, Yukai Shi, Jiarong Ou, Rui Chen, Ke Lin, Jiahao Wang, Boyuan Jiang, Haotian Yang, Mingwu Zheng, Xin Tao, Fei Yang, Pengfei Wan, and Di Zhang. Koala-36m: A large-scale video dataset improving consistency between fine-grained conditions and video content, 2025. URL https://arxiv.org/abs/2410.08260

work page arXiv 2025

[40] [40]

Ddt: Decoupled diffusion transformer.arXiv preprint arXiv:2504.05741, 2025

Shuai Wang, Zhi Tian, Weilin Huang, and Limin Wang. Ddt: Decoupled diffusion transformer.arXiv preprint arXiv:2504.05741, 2025

work page arXiv 2025

[41] [41]

Repa works until it doesn’t: Early-stopped, holistic alignment supercharges diffusion training

Ziqiao Wang, Wangbo Zhao, Yuhao Zhou, Zekai Li, Zhiyuan Liang, Mingjia Shi, Xuanlei Zhao, Pengfei Zhou, Kaipeng Zhang, Zhangyang Wang, et al. Repa works until it doesn’t: Early-stopped, holistic alignment supercharges diffusion training. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems

work page

[42] [42]

Webdataset

WebDataset Authors. Webdataset. GitHub repository, 2026. URL https://github.com/webdataset/ webdataset. Tar-sharded dataset format for sequential streaming in large-scale deep learning

work page 2026

[43] [43]

Video models are zero-shot learners and reasoners

Thadd¨aus Wiedemer, Yuxuan Li, Paul Vicol, Shixiang Shane Gu, Nick Matarese, Kevin Swersky, Been Kim, Priyank Jaini, and Robert Geirhos. Video models are zero-shot learners and reasoners. arXiv preprint arXiv:2509.20328, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[44] [44]

HunyuanVideo 1.5 Technical Report

Bing Wu, Chang Zou, Changlin Li, Duojun Huang, Fang Yang, Hao Tan, Jack Peng, Jianbing Wu, Jiangfeng Xiong, Jie Jiang, et al. Hunyuanvideo 1.5 technical report.arXiv preprint arXiv:2511.18870, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[45] [45]

Exploring video quality assessment on user generated contents from aesthetic and technical perspectives, 2023

Haoning Wu, Erli Zhang, Liang Liao, Chaofeng Chen, Jingwen Hou, Annan Wang, Wenxiu Sun, Qiong Yan, and Weisi Lin. Exploring video quality assessment on user generated contents from aesthetic and technical perspectives, 2023. URLhttps://arxiv.org/abs/2211.04894

work page arXiv 2023

[46] [46]

Revisiting weak-to-strong consistency in semi-supervised semantic segmentation, 2023

Lihe Yang, Lei Qi, Litong Feng, Wayne Zhang, and Yinghuan Shi. Revisiting weak-to-strong consistency in semi-supervised semantic segmentation, 2023. URL https://arxiv.org/abs/2208. 09910

work page 2023

[47] [47]

Cogvideox: Text-to-video diffusion models with an expert transformer

Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer. InThe Thirteenth International Conference on Learning Representations

work page

[48] [48]

{SkyPilot}: An intercloud broker for sky computing

Zongheng Yang, Zhanghao Wu, Michael Luo, Wei-Lin Chiang, Romil Bhardwaj, Woosuk Kwon, Siyuan Zhuang, Frank Sifei Luan, Gautam Mittal, Scott Shenker, et al. {SkyPilot}: An intercloud broker for sky computing. In20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23), pages 437–455, 2023. 29

work page 2023

[49] [49]

Representation alignment for generation: Training diffusion transformers is easier than you think

Sihyun Yu, Sangkyung Kwak, Huiwon Jang, Jongheon Jeong, Jonathan Huang, Jinwoo Shin, and Saining Xie. Representation alignment for generation: Training diffusion transformers is easier than you think. InThe Thirteenth International Conference on Learning Representations

work page

[50] [50]

Sigmoid Loss for Language Image Pre-Training

Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training, 2023. URLhttps://arxiv.org/abs/2303.15343

work page internal anchor Pith review Pith/arXiv arXiv 2023

[51] [51]

T5gemma 2: Seeing, reading, and understanding longer

Biao Zhang, Paul Suganthan, Ga¨el Liu, Ilya Philippov, Sahil Dua, Ben Hora, Kat Black, Gus Martins, Omar Sanseviero, Shreya Pathak, et al. T5gemma 2: Seeing, reading, and understanding longer. arXiv preprint arXiv:2512.14856, 2025

work page arXiv 2025

[52] [52]

Videorepa: Learning physics for video generation through relational alignment with foundation models

Xiangdong Zhang, Jiaqi Liao, Shaofeng Zhang, Fanqing Meng, Xiangpeng Wan, Junchi Yan, and Yu Cheng. Videorepa: Learning physics for video generation through relational alignment with foundation models. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems

work page

[53] [53]

Waver: Wave your way to lifelike video generation.arXiv preprint arXiv:2508.15761, 2025

Yifu Zhang, Hao Yang, Yuqi Zhang, Yifei Hu, Fengda Zhu, Chuang Lin, Xiaofeng Mei, Yi Jiang, Bingyue Peng, and Zehuan Yuan. Waver: Wave your way to lifelike video generation.arXiv preprint arXiv:2508.15761, 2025

work page arXiv 2025

[54] [54]

Open-Sora 2.0: Training a Commercial-Level Video Generation Model in $200k

Zangwei Zheng, Xiangyu Peng, Yuxuan Lou, Chenhui Shen, Tom Young, Xinying Guo, Binluo Wang, Hang Xu, Hongxin Liu, Mingyan Jiang, et al. Open-sora 2.0: Training a commercial-level video generation model in $200k.arXiv preprint arXiv:2503.09642, 2025. 30 Figure 16:Additional qualitative human-centered generations.Representative frames from videos involving ...

work page internal anchor Pith review Pith/arXiv arXiv 2025