CineOrchestra: Unified Entity-Centric Conditioning for Cinematic Video Generation

Aliaksandr Siarohin; Hao Chen; Mukesh Singhal; Sergey Tulyakov; Sharath Girish; Tsai-Shien Chen; Zhikang Dong

arxiv: 2606.13768 · v2 · pith:AW5J55DDnew · submitted 2026-06-11 · 💻 cs.CV · cs.AI

CineOrchestra: Unified Entity-Centric Conditioning for Cinematic Video Generation

Sharath Girish , Tsai-Shien Chen , Zhikang Dong , Mukesh Singhal , Hao Chen , Sergey Tulyakov , Aliaksandr Siarohin This is my paper

Pith reviewed 2026-06-27 06:50 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords cinematic video generationvideo diffusion modelentity-centric conditioningrotary position embeddingsmulti-shot synthesiscamera controltemporal controlshot transitions

0 comments

The pith

CineOrchestra unifies control of subjects, events, cameras and shot transitions in video generation by treating each as an entity over a temporal interval.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to show that a single video diffusion model can simultaneously manage subjects, events, cameras and shot transitions instead of handling each axis with a separate specialist. It claims these elements share a common structure as entities acting over time intervals, so they can all be expressed with one set of entity-centric conditioning primitives plus reference images. This reduces the design task to one positional encoding problem. The authors solve it with two parameter-free rotary embeddings that keep attention consistent across events of different lengths and route each condition to its matching region. If correct, the approach yields better dense caption following and shot timing than prior isolated methods.

Core claim

CineOrchestra is a unified video diffusion model that controls subjects, events, cameras, and shot transitions at once. The central claim is that these elements all function as entities acting over specific temporal intervals and can therefore share one conditioning structure of primitives augmented by reference images; this reduces the architectural problem to positional encoding, which the model solves with an interval-sampled temporal RoPE for consistent attention across varying durations and a 2D entity-temporal cross-attention RoPE that disambiguates per-entity conditions and routes them to the correct spatiotemporal locations.

What carries the argument

entity-centric conditioning primitives that express every cinematic element as an entity over a temporal interval, reduced to a positional encoding problem solved by two coordinated parameter-free rotary embeddings

If this is right

The model outperforms six per-axis specialist models on dense caption following and shot-transition timing benchmarks.
It produces consistent gains over baselines in pairwise user studies.
Ablations confirm that both the interval-sampled temporal RoPE and the 2D entity-temporal RoPE contribute to the unified performance.
The same framework simultaneously supports multi-subject personalization, temporal control, multi-shot synthesis and camera control.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The parameter-free RoPE design could be reused for other conditioning signals that also vary in duration without requiring new learned embeddings.
If the entity-interval abstraction holds, similar unification might extend to audio tracks or 3D object trajectories within the same diffusion backbone.
Longer videos with many overlapping entities would test whether the 2D cross-attention RoPE continues to route conditions correctly at scale.

Load-bearing premise

That subjects, events, cameras and shot transitions all share the same fundamental structure as entities acting over temporal intervals and therefore fit into one shared conditioning structure.

What would settle it

A cinematic control task whose required element cannot be expressed as any entity with a bounded temporal interval, such as a global lighting shift without a tied subject or event, would produce uncontrolled output or timing errors.

Figures

Figures reproduced from arXiv: 2606.13768 by Aliaksandr Siarohin, Hao Chen, Mukesh Singhal, Sergey Tulyakov, Sharath Girish, Tsai-Shien Chen, Zhikang Dong.

**Figure 1.** Figure 1: CineOrchestra generates cinematic scenes from unified conditioning. Our entity-centric conditioning represents every cinematic element (visual subjects, camera, and shot transitions) as a unified, timestamped expression (top), with optional reference images (top right). It enables CineOrchestra to generate cinematic frames in a single forward pass, jointly realizing multi-subject personalization, multi-eve… view at source ↗

**Figure 2.** Figure 2: Overview of CineOrchestra. Each entity k is represented by a reference image Ik, a global description gk, and a set of event-level dense descriptions {(t s k,j , te k,j , ek,j )} pairing temporal intervals with prompts. Reference image tokens are concatenated to the video tokens for full self-attention, while all text tokens are consumed via cross-attention. Two coordinated RoPEs, interval-sampled temporal… view at source ↗

**Figure 3.** Figure 3: Two coordinated RoPE designs. (a) Similarity between a video token and an event token across event durations L. Our β(L) rescaling produces duration-invariant peaks. (b) Cross-attention similarity between video/image queries and global/dense-description keys under Tab. 1’s coordinates. Sharp peaks emerge only where query and key share the same entity and overlap in time, jointly achieving entity disambigua… view at source ↗

**Figure 4.** Figure 4: Qualitative comparison of cinematic conditioning on CineBench. Given the entitycentric conditioning (top), CineOrchestra (top video row) simultaneously preserves all four subject identities, follows the dense per-entity timeline, and lands three hard cuts, outperforming all existing methods. More comparisons on CineBenchSyn can be found in App. D. • Subject identity consistency: following Video Alchemist … view at source ↗

**Figure 5.** Figure 5: User study on CineBench. Pairwise preference of CineOrchestra against six baselines on eight dimensions (one radar panel per baseline). Each axis reports pref = W/(W + L), the share of decisive votes that favoured CineOrchestra, where W counts wins, L counts losses, and ties are excluded. The dashed circle marks the 50% tie line: points outside favour CineOrchestra, points inside favour the baseline. CineO… view at source ↗

**Figure 6.** Figure 6: Visual ablation of two coordinated RoPEs. Only the full method (top video row) routes each entity to its annotated interval and lands all four hard cuts at the specified times. CineOrchestra on every entity-, text-, and structure-related dimension across all six baselines, and on perceptual dimensions (motion, overall quality, scene) against most baselines. 4.2 Ablation Study Tab. 3 and [PITH_FULL_IMAGE:f… view at source ↗

**Figure 7.** Figure 7: Closed-form β(L) and its three properties. (a) shows β(L) from Eq. (2) across event durations L in log-scale. The green marker verifies P1 (β(0) = 1). The curve is monotone (P2), upper-bounded by p drope/2 (red dashed) and approaches this asymptote only past L≈1/θmin (purple dashed), so the cinematic range sits well within the bounded regime. (b) verifies P3: normalized peak similarity maxtv s(tv; e) (medi… view at source ↗

**Figure 8.** Figure 8: Entity-centric captioning prompt. A single structured-output query to Gemini-2.5- Pro [20] returns entity tags, per-entity global descriptions, and dense <mm:ss.ff>-stamped event timelines for visual subjects, {camera}, and {transition} in one pass. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗

**Figure 9.** Figure 9: User study interface. Side-by-side rating UI shown to in-house raters. Each video is scored on a 1 − 5 scale across eight axes spanning visual quality (overall quality, motion realism), identity preservation (reference ID consistency), and prompt adherence (global description, dense caption, shot structure, camera, and scene). App. C.4 shows full instructions and per-axis question. 22 [PITH_FULL_IMAGE:fig… view at source ↗

**Figure 10.** Figure 10: Additional qualitative comparison on CineBench. 24 [PITH_FULL_IMAGE:figures/full_fig_p024_10.png] view at source ↗

**Figure 11.** Figure 11: Additional qualitative comparison on CineBench. 25 [PITH_FULL_IMAGE:figures/full_fig_p025_11.png] view at source ↗

**Figure 12.** Figure 12: Additional qualitative comparison on CineBenchSyn. 26 [PITH_FULL_IMAGE:figures/full_fig_p026_12.png] view at source ↗

**Figure 13.** Figure 13: Additional qualitative comparison on CineBenchSyn. 27 [PITH_FULL_IMAGE:figures/full_fig_p027_13.png] view at source ↗

**Figure 14.** Figure 14: Additional qualitative comparison on CineBenchSyn. 28 [PITH_FULL_IMAGE:figures/full_fig_p028_14.png] view at source ↗

**Figure 15.** Figure 15: Qualitative comparison at 720p resolution for CineOrchestra on CineBench. 29 [PITH_FULL_IMAGE:figures/full_fig_p029_15.png] view at source ↗

**Figure 16.** Figure 16: Qualitative comparison at 720p resolution for CineOrchestra on CineBenchSyn. 30 [PITH_FULL_IMAGE:figures/full_fig_p030_16.png] view at source ↗

**Figure 17.** Figure 17: Long video generation (40s) from CineOrchestra on CineBench. 31 [PITH_FULL_IMAGE:figures/full_fig_p031_17.png] view at source ↗

**Figure 18.** Figure 18: Long video generation (40s) from CineOrchestra on CineBench. 32 [PITH_FULL_IMAGE:figures/full_fig_p032_18.png] view at source ↗

read the original abstract

Cinematic video depicts multiple subjects acting or interacting at specific moments, captured with deliberate camera movement, and stitched together by shot transitions. Together, these elements demand a level of fine-grained control beyond current text-to-video models. Existing work addresses each axis in isolation: multi-subject personalization, temporal control, multi-shot synthesis, or camera control; no prior framework jointly integrates all four. We present CineOrchestra, a unified video diffusion model that controls subjects, events, cameras, and shot transitions simultaneously. Our key insight is that these heterogeneous cinematic elements share a fundamental structure: each is an entity acting over a specific temporal interval, which can therefore all be expressed through one shared structure of entity-centric conditioning primitives, augmented with reference images for visual entities. This formulation reduces the architectural challenge to a single positional encoding problem, which we solve with two parameter-free coordinated rotary embeddings: (a) an interval-sampled temporal RoPE that yields consistent attention behavior across events of dramatically varying duration, and (b) a 2D entity-temporal cross-attention RoPE that disambiguates per-entity conditions and routes each to its corresponding spatiotemporal region. On two new benchmarks, CineOrchestra outperforms six per-axis specialists on dense caption following and shot-transition timing, with consistent gains in a pairwise user study and component ablations. Project page: https://snap-research.github.io/CineOrchestra

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CineOrchestra unifies four control axes in video diffusion via entity-centric intervals and two parameter-free RoPE variants, but the abstract gives no numbers to judge the gains.

read the letter

The main thing to know is that this paper unifies multi-subject personalization, temporal control, multi-shot synthesis, and camera control in one video diffusion model by treating all of them as entities acting over temporal intervals. It reduces the problem to positional encoding and solves it with an interval-sampled temporal RoPE for consistent attention across different event lengths plus a 2D entity-temporal cross-attention RoPE to route conditions to the right spatiotemporal spots. Both are parameter-free and build on existing rotary ideas.

What the work does cleanly is frame the four axes as sharing the same basic structure, which lets them use reference images for visual entities and avoid separate modules for each control type. Adding two new benchmarks focused on dense caption following and shot-transition timing is a practical step that future papers can use.

The soft spots are straightforward. The abstract states outperformance over six specialists plus user-study gains and component ablations, yet supplies zero numbers, error bars, dataset sizes, or table excerpts. That makes it impossible to tell whether the improvements are large, consistent, or fairly measured. The core assumption—that heterogeneous elements really collapse to one entity-temporal structure without losing important interactions—sounds reasonable on paper but needs the full methods and failure cases to evaluate.

This is for people working on controllable text-to-video systems who want to move beyond single-axis conditioning. Readers already familiar with RoPE in diffusion models or multi-condition generation will see the most direct value. The thinking is clear and the citation pattern looks standard, so the paper has enough substance to go through peer review even if the results section needs tightening.

Referee Report

0 major / 2 minor

Summary. The paper introduces CineOrchestra, a unified video diffusion model for cinematic video generation that simultaneously controls subjects, events, cameras, and shot transitions. It models these heterogeneous elements as entities acting over temporal intervals via a shared entity-centric conditioning structure augmented with reference images, reducing the problem to positional encoding solved by two parameter-free coordinated rotary embeddings: an interval-sampled temporal RoPE and a 2D entity-temporal cross-attention RoPE. The work claims outperformance over six per-axis specialist models on two new benchmarks for dense caption following and shot-transition timing, plus consistent gains in pairwise user studies and component ablations.

Significance. If the empirical claims hold, the unification via entity-centric temporal intervals would represent a meaningful advance in controllable video generation by replacing multiple specialized models with a single framework. The parameter-free character of the two RoPE variants and the introduction of new benchmarks for caption following and transition timing are explicit strengths that could facilitate reproducibility and future comparisons.

minor comments (2)

Abstract: the claims of outperformance and 'consistent gains' are stated without any numerical metrics, error bars, dataset sizes, or specific benchmark scores, which limits immediate assessment of effect sizes even though the full results sections presumably contain them.
The description of the two RoPE variants would benefit from an explicit equation or pseudocode block showing how interval sampling is implemented in the temporal RoPE and how the 2D cross-attention RoPE routes conditions to spatiotemporal regions.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive assessment of CineOrchestra, the recognition of its unified entity-centric approach, the parameter-free RoPE contributions, and the new benchmarks. The recommendation for minor revision is noted. No specific major comments were provided in the report.

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper presents a unification of cinematic controls via an entity-centric temporal interval structure that reduces the problem to two parameter-free RoPE variants. This builds directly on standard rotary embeddings without any fitted parameters renamed as predictions, without load-bearing self-citations for uniqueness theorems, and without ansatzes smuggled through prior work. New benchmarks are introduced rather than reusing fitted numbers from the same authors, and the derivation chain remains self-contained against external positional encoding methods.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper introduces no fitted free parameters and no new physical entities. It relies on the standard assumption that rotary positional embeddings can be extended in the described parameter-free manner to handle variable-duration intervals and per-entity disambiguation.

axioms (1)

domain assumption Rotary positional embeddings can be adapted to interval-sampled temporal and 2D entity-temporal cross-attention settings while remaining parameter-free and yielding consistent attention behavior.
Invoked to solve the single positional encoding problem after reducing all cinematic elements to entity-temporal intervals.

pith-pipeline@v0.9.1-grok · 5804 in / 1385 out tokens · 28759 ms · 2026-06-27T06:50:05.274252+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

76 extracted references · 25 canonical work pages · 19 internal anchors

[1]

Ac3d: Analyzing and improving 3d camera control in video diffusion transformers

Sherwin Bahmani, Ivan Skorokhodov, Guocheng Qian, Aliaksandr Siarohin, Willi Menapace, Andrea Tagliasacchi, David B Lindell, and Sergey Tulyakov. Ac3d: Analyzing and improving 3d camera control in video diffusion transformers. InCVPR, 2025. 3, 11

2025
[2]

Lindell, and Sergey Tulyakov

Sherwin Bahmani, Ivan Skorokhodov, Aliaksandr Siarohin, Willi Menapace, Guocheng Qian, Michael Vasilkovsky, Hsin-Ying Lee, Chaoyang Wang, Jiaxu Zou, Andrea Tagliasacchi, David B. Lindell, and Sergey Tulyakov. VD3d: Taming large video diffusion transformers for 3d camera control. InICLR, 2025. 3

2025
[3]

Recammaster: Camera-controlled generative rendering from a single video

Jianhong Bai, Menghan Xia, Xiao Fu, Xintao Wang, Lianrui Mu, Jinwen Cao, Zuozhu Liu, Haoji Hu, Xiang Bai, Pengfei Wan, et al. Recammaster: Camera-controlled generative rendering from a single video. InICCV, 2025. 3, 11

2025
[4]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023. 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[5]

Align your latents: High-resolution video synthesis with latent diffusion models

Andreas Blattmann, Robin Rombach, Huan Ling, Tim Dockhorn, Seung Wook Kim, Sanja Fidler, and Karsten Kreis. Align your latents: High-resolution video synthesis with latent diffusion models. InCVPR,
[6]

Video generation models as world simulators

Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, Clarence Ng, Ricky Wang, and Aditya Ramesh. Video generation models as world simulators. Technical report, OpenAI, 2024. 2, 3

2024
[7]

Emerging properties in self-supervised vision transformers

Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. InICCV, 2021. 8

2021
[8]

VideoCrafter1: Open Diffusion Models for High-Quality Video Generation

Haoxin Chen, Menghan Xia, Yingqing He, Yong Zhang, Xiaodong Cun, Shaoshu Yang, Jinbo Xing, Yaofang Liu, Qifeng Chen, Xintao Wang, Chao Weng, and Ying Shan. VideoCrafter1: Open diffusion models for high-quality video generation.arXiv preprint arXiv:2310.19512, 2023. 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[9]

VideoCrafter2: Overcoming data limitations for high-quality video diffusion models

Haoxin Chen, Yong Zhang, Xiaodong Cun, Menghan Xia, Xintao Wang, Chao Weng, and Ying Shan. VideoCrafter2: Overcoming data limitations for high-quality video diffusion models. InCVPR, 2024. 3

2024
[10]

Motion- conditioned diffusion model for controllable video synthesis.arXiv preprint arXiv:2304.14404, 2023

Tsai-Shien Chen, Chieh Hubert Lin, Hung-Yu Tseng, Tsung-Yi Lin, and Ming-Hsuan Yang. Motion- conditioned diffusion model for controllable video synthesis.arXiv preprint arXiv:2304.14404, 2023. 3

work page arXiv 2023
[11]

Panda-70m: Captioning 70m videos with multiple cross-modality teachers

Tsai-Shien Chen, Aliaksandr Siarohin, Willi Menapace, Ekaterina Deyneka, Hsiang-wei Chao, Byung Eun Jeon, Yuwei Fang, Hsin-Ying Lee, Jian Ren, Ming-Hsuan Yang, et al. Panda-70m: Captioning 70m videos with multiple cross-modality teachers. InCVPR, 2024. 3

2024
[12]

Multi-subject open-set personalization in video generation

Tsai-Shien Chen, Aliaksandr Siarohin, Willi Menapace, Yuwei Fang, Kwot Sin Lee, Ivan Skorokhodov, Kfir Aberman, Jun-Yan Zhu, Ming-Hsuan Yang, and Sergey Tulyakov. Multi-subject open-set personalization in video generation. InCVPR, 2025. 2, 3, 7, 8

2025
[13]

Omni-attribute: Open-vocabulary attribute encoder for visual concept personalization

Tsai-Shien Chen, Aliaksandr Siarohin, Guocheng Gordon Qian, Kuan-Chieh Jackson Wang, Egor Nemchi- nov, Moayed Haji-Ali, Riza Alp Guler, Willi Menapace, Ivan Skorokhodov, Anil Kag, et al. Omni-attribute: Open-vocabulary attribute encoder for visual concept personalization. InCVPR, 2026. 3

2026
[14]

Canvas-to-image: Compositional image generation with multimodal controls.ACM TOG, 2026

Yusuf Dalva, Guocheng Gordon Qian, Maya Goldenberg, Tsai-Shien Chen, Kfir Aberman, Sergey Tulyakov, Pinar Yanardag, and Kuan-Chieh Jackson Wang. Canvas-to-image: Compositional image generation with multimodal controls.ACM TOG, 2026. 3

2026
[15]

MAGREF: Masked guidance for any-reference video generation with subject disentanglement

Yufan Deng, Yuanyang Yin, Xun Guo, Yizhi Wang, Jacob Zhiyuan Fang, Shenghai Yuan, Yiding Yang, Angtian Wang, Bo Liu, Haibin Huang, and Chongyang Ma. MAGREF: Masked guidance for any-reference video generation with subject disentanglement. InICLR, 2026. 2, 3

2026
[16]

VIMI: Grounding video generation through multi-modal instruction

Yuwei Fang, Willi Menapace, Aliaksandr Siarohin, Tsai-Shien Chen, Kuan-Chien Wang, Ivan Skorokhodov, Graham Neubig, and Sergey Tulyakov. VIMI: Grounding video generation through multi-modal instruction. InEMNLP, 2024. 3

2024
[17]

Skyreels-a2: Compose anything in video diffusion transformers.arXiv preprint arXiv:2504.02436, 2025

Zhengcong Fei, Debang Li, Di Qiu, Jiahua Wang, Yikun Dou, Rui Wang, Jingtao Xu, Mingyuan Fan, Guibin Chen, Yang Li, et al. Skyreels-a2: Compose anything in video diffusion transformers.arXiv preprint arXiv:2504.02436, 2025. 2, 3

work page arXiv 2025
[18]

An image is worth one word: Personalizing text-to-image generation using textual inversion

Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patashnik, Amit Haim Bermano, Gal Chechik, and Daniel Cohen-or. An image is worth one word: Personalizing text-to-image generation using textual inversion. In ICLR, 2023. 3

2023
[19]

Alchemint: Fine-grained temporal control for multi-reference consistent video generation

Sharath Girish, Viacheslav Ivanov, Tsai-Shien Chen, Hao Chen, Aliaksandr Siarohin, and Sergey Tulyakov. Alchemint: Fine-grained temporal control for multi-reference consistent video generation. InCVPR, 2026. 2, 3, 9, 10

2026
[20]

Google. Gemini. https://aistudio.google.com/models/gemini-2-5-flash-image , 2025. 7, 17, 18

2025
[21]

Google. Veo 3. https://deepmind.google/models/veo/, 2025. 2, 3

2025
[22]

Animatediff: Animate your personalized text-to-image diffusion models without specific tuning

Yuwei Guo, Ceyuan Yang, Anyi Rao, Zhengyang Liang, Yaohui Wang, Yu Qiao, Maneesh Agrawala, Dahua Lin, and Bo Dai. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. InICLR, 2024. 3

2024
[23]

Photorealistic video generation with diffusion models

Agrim Gupta, Lijun Yu, Kihyuk Sohn, Xiuye Gu, Meera Hahn, Fei-Fei Li, Irfan Essa, Lu Jiang, and José Lezama. Photorealistic video generation with diffusion models. InECCV, 2024. 3 12

2024
[24]

LTX-Video: Realtime Video Latent Diffusion

Yoav HaCohen, Nisan Chiprut, Benny Brazowski, Daniel Shalem, Dudu Moshe, Eitan Richardson, Eran Levin, Guy Shiran, Nir Zabari, Ori Gordon, Poriya Panet, Sapir Weissbuch, Victor Kulikov, Yaki Bitterman, Zeev Melumian, and Ofir Bibi. Ltx-video: Realtime video latent diffusion.arXiv preprint arXiv:2501.00103, 2024. 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2024
[25]

Cameractrl: Enabling camera control for video diffusion models

Hao He, Yinghao Xu, Yuwei Guo, Gordon Wetzstein, Bo Dai, Hongsheng Li, and Ceyuan Yang. Cameractrl: Enabling camera control for video diffusion models. InICLR, 2025. 2, 3, 11

2025
[26]

Imagen Video: High Definition Video Generation with Diffusion Models

Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P Kingma, Ben Poole, Mohammad Norouzi, David J Fleet, et al. Imagen video: High definition video generation with diffusion models.arXiv preprint arXiv:2210.02303, 2022. 3

work page internal anchor Pith review Pith/arXiv arXiv 2022
[27]

Denoising diffusion probabilistic models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. InNeurIPS, 2020. 3

2020
[28]

Classifier-free diffusion guidance

Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. InNeurIPS Workshop, 2021. 19

2021
[29]

Video diffusion models

Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. Video diffusion models. InNeurIPS, 2022. 3

2022
[30]

Conceptmaster: Multi-concept video customization on diffusion transformer models without test-time tuning.arXiv preprint arXiv:2501.04698, 2025

Yuzhou Huang, Ziyang Yuan, Quande Liu, Qiulin Wang, Xintao Wang, Ruimao Zhang, Pengfei Wan, Di Zhang, and Kun Gai. Conceptmaster: Multi-concept video customization on diffusion transformer models without test-time tuning.arXiv preprint arXiv:2501.04698, 2025. 3

work page arXiv 2025
[31]

Vace: All-in-one video creation and editing

Zeyinzi Jiang, Zhen Han, Chaojie Mao, Jingfeng Zhang, Yulin Pan, and Yu Liu. Vace: All-in-one video creation and editing. InICCV, 2025. 7, 9, 20, 22, 23

2025
[32]

Analyzing and improving the training dynamics of diffusion models

Tero Karras, Miika Aittala, Jaakko Lehtinen, Janne Hellsten, Timo Aila, and Samuli Laine. Analyzing and improving the training dynamics of diffusion models. InCVPR, 2024. 19

2024
[33]

YOLOv11: An Overview of the Key Architectural Enhancements

Rahima Khanam and Muhammad Hussain. Yolov11: An overview of the key architectural enhancements. arXiv preprint arXiv:2410.17725, 2024. 17

work page internal anchor Pith review Pith/arXiv arXiv 2024
[34]

Auto-Encoding Variational Bayes

Diederik P Kingma and Max Welling. Auto-encoding variational bayes.arXiv preprint arXiv:1312.6114,

work page internal anchor Pith review Pith/arXiv arXiv
[35]

HunyuanVideo: A Systematic Framework For Large Video Generative Models

Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024. 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2024
[36]

Multi-concept customization of text-to-image diffusion

Nupur Kumari, Bingliang Zhang, Richard Zhang, Eli Shechtman, and Jun-Yan Zhu. Multi-concept customization of text-to-image diffusion. InCVPR, 2023. 3

2023
[37]

Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matthew Le. Flow matching for generative modeling. InICLR, 2023. 19

2023
[38]

Phantom: Subject-consistent video generation via cross-modal alignment

Lijie Liu, Tianxiang Ma, Bingchuan Li, Zhuowei Chen, Jiawei Liu, Gen Li, Siyu Zhou, Qian He, and Xinglong Wu. Phantom: Subject-consistent video generation via cross-modal alignment. InICCV, 2025. 2, 3, 7, 9, 20, 22, 23

2025
[39]

Grounding dino: Marrying dino with grounded pre-training for open-set object detection

Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. InECCV, 2024. 20

2024
[40]

Flow straight and fast: Learning to generate and transfer data with rectified flow

Xingchao Liu, Chengyue Gong, and qiang liu. Flow straight and fast: Learning to generate and transfer data with rectified flow. InICLR, 2023. 19

2023
[41]

Decoupled Weight Decay Regularization

I Loshchilov. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017. 19

work page internal anchor Pith review Pith/arXiv arXiv 2017
[42]

Shotstream: Streaming multi-shot video generation for interactive storytelling.arXiv preprint arXiv:2603.25746, 2026

Yawen Luo, Xiaoyu Shi, Junhao Zhuang, Yutian Chen, Quande Liu, Xintao Wang, Pengfei Wan, and Tianfan Xue. Shotstream: Streaming multi-shot video generation for interactive storytelling.arXiv preprint arXiv:2603.25746, 2026. 2, 3, 4, 7, 9, 11, 20, 22, 23

work page arXiv 2026
[43]

Snap video: Scaled spatiotemporal transformers for text-to-video synthesis

Willi Menapace, Aliaksandr Siarohin, Ivan Skorokhodov, Ekaterina Deyneka, Tsai-Shien Chen, Anil Kag, Yuwei Fang, Aleksei Stoliar, Elisa Ricci, Jian Ren, et al. Snap video: Scaled spatiotemporal transformers for text-to-video synthesis. InCVPR, 2024. 2, 3

2024
[44]

DINOv2: Learning Robust Visual Features without Supervision

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193, 2023. 20

work page internal anchor Pith review Pith/arXiv arXiv 2023
[45]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InICCV, 2023. 2, 3, 17

2023
[46]

Movie Gen: A Cast of Media Foundation Models

Adam Polyak, Amit Zohar, Andrew Brown, Andros Tjandra, Animesh Sinha, Ann Lee, Apoorv Vyas, Bowen Shi, Chih-Yao Ma, Ching-Yao Chuang, et al. Movie gen: A cast of media foundation models. arXiv preprint arXiv:2410.13720, 2024. 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2024
[47]

Layercomposer: Interactive personalized t2i via spatially-aware layered canvas.arXiv preprint arXiv:2510.20820, 2025

Guocheng Gordon Qian, Ruihang Zhang, Tsai-Shien Chen, Yusuf Dalva, Anujraaj Argo Goyal, Willi Menapace, Ivan Skorokhodov, Meng Dong, Arpit Sahni, Daniil Ostashev, et al. Layercomposer: Interactive personalized t2i via spatially-aware layered canvas.arXiv preprint arXiv:2510.20820, 2025. 3

work page arXiv 2025
[48]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InICML, 2021. 7, 8, 17, 20

2021
[49]

Exploring the limits of transfer learning with a unified text-to-text transformer

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. JMLR, 2020. 3, 17, 20

2020
[50]

SAM 2: Segment Anything in Images and Videos

Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, et al. Sam 2: Segment anything in images and videos.arXiv preprint arXiv:2408.00714, 2024. 20

work page internal anchor Pith review Pith/arXiv arXiv 2024
[51]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. InCVPR, 2022. 3 13

2022
[52]

U-net: Convolutional networks for biomedical image segmentation

Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. InMICCAI, 2015. 3

2015
[53]

Dream- booth: Fine tuning text-to-image diffusion models for subject-driven generation

Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. Dream- booth: Fine tuning text-to-image diffusion models for subject-driven generation. InCVPR, 2023. 3

2023
[54]

Instantbooth: Personalized text-to-image generation without test-time finetuning

Jing Shi, Wei Xiong, Zhe Lin, and Hyun Joon Jung. Instantbooth: Personalized text-to-image generation without test-time finetuning. InCVPR, 2024. 3

2024
[55]

Make-a-video: Text-to-video generation without text-video data

Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Oran Gafni, Devi Parikh, Sonal Gupta, and Yaniv Taigman. Make-a-video: Text-to-video generation without text-video data. InICLR, 2023. 3

2023
[56]

Deep unsupervised learning using nonequilibrium thermodynamics

Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. InICML, 2015. 3

2015
[57]

Generative modeling by estimating gradients of the data distribution

Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution. In NeurIPS, 2019. 3

2019
[58]

Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 2024

Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 2024. 2, 5, 6, 15

2024
[59]

Qwen Team. Qwen3. 5-omni technical report.arXiv preprint arXiv:2604.15804, 2026. 20

work page internal anchor Pith review Pith/arXiv arXiv 2026
[60]

Attention is all you need

A Vaswani. Attention is all you need. InNeurIPS, 2017. 2, 3

2017
[61]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025. 2, 3, 17, 20

work page internal anchor Pith review Pith/arXiv arXiv 2025
[62]

Echoshot: Multi-shot portrait video generation

Jiahao Wang, Hualian Sheng, Sijia Cai, Weizhan Zhang, Caixia Yan, Yachuang Feng, Bing Deng, and Jieping Ye. Echoshot: Multi-shot portrait video generation. InNeurIPS, 2025. 2, 3, 4, 7, 9, 11, 20, 22, 23

2025
[63]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024. 9, 21

work page internal anchor Pith review Pith/arXiv arXiv 2024
[64]

InstantID: Zero-shot Identity-Preserving Generation in Seconds

Qixun Wang, Xu Bai, Haofan Wang, Zekui Qin, and Anthony Chen. Instantid: Zero-shot identity-preserving generation in seconds.arXiv preprint arXiv:2401.07519, 2024. 3

work page internal anchor Pith review Pith/arXiv arXiv 2024
[65]

Multishotmaster: A controllable multi-shot video generation framework

Qinghe Wang, Xiaoyu Shi, Baolu Li, Weikang Bian, Quande Liu, Huchuan Lu, Xintao Wang, Pengfei Wan, Kun Gai, and Xu Jia. Multishotmaster: A controllable multi-shot video generation framework. In CVPR, 2026. 2, 3, 4, 7, 9, 11, 20, 22, 23

2026
[66]

Internvid: A large-scale video-text dataset for multimodal understanding and generation

Yi Wang, Yinan He, Yizhuo Li, Kunchang Li, Jiashuo Yu, Xin Ma, Xinhao Li, Guo Chen, Xinyuan Chen, Yaohui Wang, Ping Luo, Ziwei Liu, Yali Wang, Limin Wang, and Yu Qiao. Internvid: A large-scale video-text dataset for multimodal understanding and generation. InICLR, 2024. 8, 20

2024
[67]

InternVideo: General Video Foundation Models via Generative and Discriminative Learning

Yi Wang, Kunchang Li, Yizhuo Li, Yinan He, Bingkun Huang, Zhiyu Zhao, Hongjie Zhang, Jilan Xu, Yi Liu, Zun Wang, et al. Internvideo: General video foundation models via generative and discriminative learning.arXiv preprint arXiv:2212.03191, 2022. 8, 20

work page internal anchor Pith review Pith/arXiv arXiv 2022
[68]

Motionctrl: A unified and flexible motion controller for video generation

Zhouxia Wang, Ziyang Yuan, Xintao Wang, Yaowei Li, Tianshui Chen, Menghan Xia, Ping Luo, and Ying Shan. Motionctrl: A unified and flexible motion controller for video generation. InSIGGRAPH, 2024. 2, 3, 11

2024
[69]

Qwen-Image Technical Report

Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-image technical report.arXiv preprint arXiv:2508.02324, 2025. 6, 7, 17, 19

work page internal anchor Pith review Pith/arXiv arXiv 2025
[70]

Cinetrans: Learning to generate videos with cinematic transitions via masked diffusion models

Xiaoxue Wu, Bingjie Gao, Yu Qiao, Yaohui Wang, and Xinyuan Chen. Cinetrans: Learning to generate videos with cinematic transitions via masked diffusion models. InICLR, 2026. 2, 3, 4, 7, 9, 11, 20, 22, 23

2026
[71]

Mind the time: Temporally-controlled multi-event video generation

Ziyi Wu, Aliaksandr Siarohin, Willi Menapace, Ivan Skorokhodov, Yuwei Fang, Varnith Chordia, Igor Gilitschenski, and Sergey Tulyakov. Mind the time: Temporally-controlled multi-event video generation. InCVPR, 2025. 2, 3, 5

2025
[72]

Cogvideox: Text-to-video diffusion models with an expert transformer

Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, Da Yin, Yuxuan.Zhang, Weihan Wang, Yean Cheng, Bin Xu, Xiaotao Gu, Yuxiao Dong, and Jie Tang. Cogvideox: Text-to-video diffusion models with an expert transformer. In ICLR, 2025. 2, 3

2025
[73]

IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models

Hu Ye, Jun Zhang, Sibo Liu, Xiao Han, and Wei Yang. Ip-adapter: Text compatible image prompt adapter for text-to-image diffusion models.arXiv preprint arXiv:2308.06721, 2023. 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[74]

Tora2: Motion and appearance customized diffusion transformer for multi-entity video generation

Zhenghao Zhang, Junchao Liao, Xiangyu Meng, Long Qin, and Weizhi Wang. Tora2: Motion and appearance customized diffusion transformer for multi-entity video generation. InACM MM, 2025. 3

2025
[75]

PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel

Yanli Zhao, Andrew Gu, Rohan Varma, Liang Luo, Chien-Chin Huang, Min Xu, Less Wright, Hamid Shojanazeri, Myle Ott, Sam Shleifer, et al. Pytorch fsdp: experiences on scaling fully sharded data parallel. arXiv preprint arXiv:2304.11277, 2023. 19

work page internal anchor Pith review Pith/arXiv arXiv 2023
[76]

camera pans left

Guangcong Zheng, Teng Li, Rui Jiang, Yehao Lu, Tao Wu, and Xi Li. Cami2v: Camera-controlled image-to-video diffusion model.arXiv preprint arXiv:2410.15957, 2024. 3, 11 14 CineOrchestra: Unified Entity-Centric Conditioning for Cinematic Video Generation Supplementary Material A Derivation and Properties ofβ(L) This appendix derives the closed form of the d...

work page arXiv 2024

[1] [1]

Ac3d: Analyzing and improving 3d camera control in video diffusion transformers

Sherwin Bahmani, Ivan Skorokhodov, Guocheng Qian, Aliaksandr Siarohin, Willi Menapace, Andrea Tagliasacchi, David B Lindell, and Sergey Tulyakov. Ac3d: Analyzing and improving 3d camera control in video diffusion transformers. InCVPR, 2025. 3, 11

2025

[2] [2]

Lindell, and Sergey Tulyakov

Sherwin Bahmani, Ivan Skorokhodov, Aliaksandr Siarohin, Willi Menapace, Guocheng Qian, Michael Vasilkovsky, Hsin-Ying Lee, Chaoyang Wang, Jiaxu Zou, Andrea Tagliasacchi, David B. Lindell, and Sergey Tulyakov. VD3d: Taming large video diffusion transformers for 3d camera control. InICLR, 2025. 3

2025

[3] [3]

Recammaster: Camera-controlled generative rendering from a single video

Jianhong Bai, Menghan Xia, Xiao Fu, Xintao Wang, Lianrui Mu, Jinwen Cao, Zuozhu Liu, Haoji Hu, Xiang Bai, Pengfei Wan, et al. Recammaster: Camera-controlled generative rendering from a single video. InICCV, 2025. 3, 11

2025

[4] [4]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023. 3

work page internal anchor Pith review Pith/arXiv arXiv 2023

[5] [5]

Align your latents: High-resolution video synthesis with latent diffusion models

Andreas Blattmann, Robin Rombach, Huan Ling, Tim Dockhorn, Seung Wook Kim, Sanja Fidler, and Karsten Kreis. Align your latents: High-resolution video synthesis with latent diffusion models. InCVPR,

[6] [6]

Video generation models as world simulators

Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, Clarence Ng, Ricky Wang, and Aditya Ramesh. Video generation models as world simulators. Technical report, OpenAI, 2024. 2, 3

2024

[7] [7]

Emerging properties in self-supervised vision transformers

Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. InICCV, 2021. 8

2021

[8] [8]

VideoCrafter1: Open Diffusion Models for High-Quality Video Generation

Haoxin Chen, Menghan Xia, Yingqing He, Yong Zhang, Xiaodong Cun, Shaoshu Yang, Jinbo Xing, Yaofang Liu, Qifeng Chen, Xintao Wang, Chao Weng, and Ying Shan. VideoCrafter1: Open diffusion models for high-quality video generation.arXiv preprint arXiv:2310.19512, 2023. 3

work page internal anchor Pith review Pith/arXiv arXiv 2023

[9] [9]

VideoCrafter2: Overcoming data limitations for high-quality video diffusion models

Haoxin Chen, Yong Zhang, Xiaodong Cun, Menghan Xia, Xintao Wang, Chao Weng, and Ying Shan. VideoCrafter2: Overcoming data limitations for high-quality video diffusion models. InCVPR, 2024. 3

2024

[10] [10]

Motion- conditioned diffusion model for controllable video synthesis.arXiv preprint arXiv:2304.14404, 2023

Tsai-Shien Chen, Chieh Hubert Lin, Hung-Yu Tseng, Tsung-Yi Lin, and Ming-Hsuan Yang. Motion- conditioned diffusion model for controllable video synthesis.arXiv preprint arXiv:2304.14404, 2023. 3

work page arXiv 2023

[11] [11]

Panda-70m: Captioning 70m videos with multiple cross-modality teachers

Tsai-Shien Chen, Aliaksandr Siarohin, Willi Menapace, Ekaterina Deyneka, Hsiang-wei Chao, Byung Eun Jeon, Yuwei Fang, Hsin-Ying Lee, Jian Ren, Ming-Hsuan Yang, et al. Panda-70m: Captioning 70m videos with multiple cross-modality teachers. InCVPR, 2024. 3

2024

[12] [12]

Multi-subject open-set personalization in video generation

Tsai-Shien Chen, Aliaksandr Siarohin, Willi Menapace, Yuwei Fang, Kwot Sin Lee, Ivan Skorokhodov, Kfir Aberman, Jun-Yan Zhu, Ming-Hsuan Yang, and Sergey Tulyakov. Multi-subject open-set personalization in video generation. InCVPR, 2025. 2, 3, 7, 8

2025

[13] [13]

Omni-attribute: Open-vocabulary attribute encoder for visual concept personalization

Tsai-Shien Chen, Aliaksandr Siarohin, Guocheng Gordon Qian, Kuan-Chieh Jackson Wang, Egor Nemchi- nov, Moayed Haji-Ali, Riza Alp Guler, Willi Menapace, Ivan Skorokhodov, Anil Kag, et al. Omni-attribute: Open-vocabulary attribute encoder for visual concept personalization. InCVPR, 2026. 3

2026

[14] [14]

Canvas-to-image: Compositional image generation with multimodal controls.ACM TOG, 2026

Yusuf Dalva, Guocheng Gordon Qian, Maya Goldenberg, Tsai-Shien Chen, Kfir Aberman, Sergey Tulyakov, Pinar Yanardag, and Kuan-Chieh Jackson Wang. Canvas-to-image: Compositional image generation with multimodal controls.ACM TOG, 2026. 3

2026

[15] [15]

MAGREF: Masked guidance for any-reference video generation with subject disentanglement

Yufan Deng, Yuanyang Yin, Xun Guo, Yizhi Wang, Jacob Zhiyuan Fang, Shenghai Yuan, Yiding Yang, Angtian Wang, Bo Liu, Haibin Huang, and Chongyang Ma. MAGREF: Masked guidance for any-reference video generation with subject disentanglement. InICLR, 2026. 2, 3

2026

[16] [16]

VIMI: Grounding video generation through multi-modal instruction

Yuwei Fang, Willi Menapace, Aliaksandr Siarohin, Tsai-Shien Chen, Kuan-Chien Wang, Ivan Skorokhodov, Graham Neubig, and Sergey Tulyakov. VIMI: Grounding video generation through multi-modal instruction. InEMNLP, 2024. 3

2024

[17] [17]

Skyreels-a2: Compose anything in video diffusion transformers.arXiv preprint arXiv:2504.02436, 2025

Zhengcong Fei, Debang Li, Di Qiu, Jiahua Wang, Yikun Dou, Rui Wang, Jingtao Xu, Mingyuan Fan, Guibin Chen, Yang Li, et al. Skyreels-a2: Compose anything in video diffusion transformers.arXiv preprint arXiv:2504.02436, 2025. 2, 3

work page arXiv 2025

[18] [18]

An image is worth one word: Personalizing text-to-image generation using textual inversion

Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patashnik, Amit Haim Bermano, Gal Chechik, and Daniel Cohen-or. An image is worth one word: Personalizing text-to-image generation using textual inversion. In ICLR, 2023. 3

2023

[19] [19]

Alchemint: Fine-grained temporal control for multi-reference consistent video generation

Sharath Girish, Viacheslav Ivanov, Tsai-Shien Chen, Hao Chen, Aliaksandr Siarohin, and Sergey Tulyakov. Alchemint: Fine-grained temporal control for multi-reference consistent video generation. InCVPR, 2026. 2, 3, 9, 10

2026

[20] [20]

Google. Gemini. https://aistudio.google.com/models/gemini-2-5-flash-image , 2025. 7, 17, 18

2025

[21] [21]

Google. Veo 3. https://deepmind.google/models/veo/, 2025. 2, 3

2025

[22] [22]

Animatediff: Animate your personalized text-to-image diffusion models without specific tuning

Yuwei Guo, Ceyuan Yang, Anyi Rao, Zhengyang Liang, Yaohui Wang, Yu Qiao, Maneesh Agrawala, Dahua Lin, and Bo Dai. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. InICLR, 2024. 3

2024

[23] [23]

Photorealistic video generation with diffusion models

Agrim Gupta, Lijun Yu, Kihyuk Sohn, Xiuye Gu, Meera Hahn, Fei-Fei Li, Irfan Essa, Lu Jiang, and José Lezama. Photorealistic video generation with diffusion models. InECCV, 2024. 3 12

2024

[24] [24]

LTX-Video: Realtime Video Latent Diffusion

Yoav HaCohen, Nisan Chiprut, Benny Brazowski, Daniel Shalem, Dudu Moshe, Eitan Richardson, Eran Levin, Guy Shiran, Nir Zabari, Ori Gordon, Poriya Panet, Sapir Weissbuch, Victor Kulikov, Yaki Bitterman, Zeev Melumian, and Ofir Bibi. Ltx-video: Realtime video latent diffusion.arXiv preprint arXiv:2501.00103, 2024. 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2024

[25] [25]

Cameractrl: Enabling camera control for video diffusion models

Hao He, Yinghao Xu, Yuwei Guo, Gordon Wetzstein, Bo Dai, Hongsheng Li, and Ceyuan Yang. Cameractrl: Enabling camera control for video diffusion models. InICLR, 2025. 2, 3, 11

2025

[26] [26]

Imagen Video: High Definition Video Generation with Diffusion Models

Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P Kingma, Ben Poole, Mohammad Norouzi, David J Fleet, et al. Imagen video: High definition video generation with diffusion models.arXiv preprint arXiv:2210.02303, 2022. 3

work page internal anchor Pith review Pith/arXiv arXiv 2022

[27] [27]

Denoising diffusion probabilistic models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. InNeurIPS, 2020. 3

2020

[28] [28]

Classifier-free diffusion guidance

Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. InNeurIPS Workshop, 2021. 19

2021

[29] [29]

Video diffusion models

Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. Video diffusion models. InNeurIPS, 2022. 3

2022

[30] [30]

Conceptmaster: Multi-concept video customization on diffusion transformer models without test-time tuning.arXiv preprint arXiv:2501.04698, 2025

Yuzhou Huang, Ziyang Yuan, Quande Liu, Qiulin Wang, Xintao Wang, Ruimao Zhang, Pengfei Wan, Di Zhang, and Kun Gai. Conceptmaster: Multi-concept video customization on diffusion transformer models without test-time tuning.arXiv preprint arXiv:2501.04698, 2025. 3

work page arXiv 2025

[31] [31]

Vace: All-in-one video creation and editing

Zeyinzi Jiang, Zhen Han, Chaojie Mao, Jingfeng Zhang, Yulin Pan, and Yu Liu. Vace: All-in-one video creation and editing. InICCV, 2025. 7, 9, 20, 22, 23

2025

[32] [32]

Analyzing and improving the training dynamics of diffusion models

Tero Karras, Miika Aittala, Jaakko Lehtinen, Janne Hellsten, Timo Aila, and Samuli Laine. Analyzing and improving the training dynamics of diffusion models. InCVPR, 2024. 19

2024

[33] [33]

YOLOv11: An Overview of the Key Architectural Enhancements

Rahima Khanam and Muhammad Hussain. Yolov11: An overview of the key architectural enhancements. arXiv preprint arXiv:2410.17725, 2024. 17

work page internal anchor Pith review Pith/arXiv arXiv 2024

[34] [34]

Auto-Encoding Variational Bayes

Diederik P Kingma and Max Welling. Auto-encoding variational bayes.arXiv preprint arXiv:1312.6114,

work page internal anchor Pith review Pith/arXiv arXiv

[35] [35]

HunyuanVideo: A Systematic Framework For Large Video Generative Models

Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024. 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2024

[36] [36]

Multi-concept customization of text-to-image diffusion

Nupur Kumari, Bingliang Zhang, Richard Zhang, Eli Shechtman, and Jun-Yan Zhu. Multi-concept customization of text-to-image diffusion. InCVPR, 2023. 3

2023

[37] [37]

Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matthew Le. Flow matching for generative modeling. InICLR, 2023. 19

2023

[38] [38]

Phantom: Subject-consistent video generation via cross-modal alignment

Lijie Liu, Tianxiang Ma, Bingchuan Li, Zhuowei Chen, Jiawei Liu, Gen Li, Siyu Zhou, Qian He, and Xinglong Wu. Phantom: Subject-consistent video generation via cross-modal alignment. InICCV, 2025. 2, 3, 7, 9, 20, 22, 23

2025

[39] [39]

Grounding dino: Marrying dino with grounded pre-training for open-set object detection

Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. InECCV, 2024. 20

2024

[40] [40]

Flow straight and fast: Learning to generate and transfer data with rectified flow

Xingchao Liu, Chengyue Gong, and qiang liu. Flow straight and fast: Learning to generate and transfer data with rectified flow. InICLR, 2023. 19

2023

[41] [41]

Decoupled Weight Decay Regularization

I Loshchilov. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017. 19

work page internal anchor Pith review Pith/arXiv arXiv 2017

[42] [42]

Shotstream: Streaming multi-shot video generation for interactive storytelling.arXiv preprint arXiv:2603.25746, 2026

Yawen Luo, Xiaoyu Shi, Junhao Zhuang, Yutian Chen, Quande Liu, Xintao Wang, Pengfei Wan, and Tianfan Xue. Shotstream: Streaming multi-shot video generation for interactive storytelling.arXiv preprint arXiv:2603.25746, 2026. 2, 3, 4, 7, 9, 11, 20, 22, 23

work page arXiv 2026

[43] [43]

Snap video: Scaled spatiotemporal transformers for text-to-video synthesis

Willi Menapace, Aliaksandr Siarohin, Ivan Skorokhodov, Ekaterina Deyneka, Tsai-Shien Chen, Anil Kag, Yuwei Fang, Aleksei Stoliar, Elisa Ricci, Jian Ren, et al. Snap video: Scaled spatiotemporal transformers for text-to-video synthesis. InCVPR, 2024. 2, 3

2024

[44] [44]

DINOv2: Learning Robust Visual Features without Supervision

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193, 2023. 20

work page internal anchor Pith review Pith/arXiv arXiv 2023

[45] [45]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InICCV, 2023. 2, 3, 17

2023

[46] [46]

Movie Gen: A Cast of Media Foundation Models

Adam Polyak, Amit Zohar, Andrew Brown, Andros Tjandra, Animesh Sinha, Ann Lee, Apoorv Vyas, Bowen Shi, Chih-Yao Ma, Ching-Yao Chuang, et al. Movie gen: A cast of media foundation models. arXiv preprint arXiv:2410.13720, 2024. 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2024

[47] [47]

Layercomposer: Interactive personalized t2i via spatially-aware layered canvas.arXiv preprint arXiv:2510.20820, 2025

Guocheng Gordon Qian, Ruihang Zhang, Tsai-Shien Chen, Yusuf Dalva, Anujraaj Argo Goyal, Willi Menapace, Ivan Skorokhodov, Meng Dong, Arpit Sahni, Daniil Ostashev, et al. Layercomposer: Interactive personalized t2i via spatially-aware layered canvas.arXiv preprint arXiv:2510.20820, 2025. 3

work page arXiv 2025

[48] [48]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InICML, 2021. 7, 8, 17, 20

2021

[49] [49]

Exploring the limits of transfer learning with a unified text-to-text transformer

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. JMLR, 2020. 3, 17, 20

2020

[50] [50]

SAM 2: Segment Anything in Images and Videos

Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, et al. Sam 2: Segment anything in images and videos.arXiv preprint arXiv:2408.00714, 2024. 20

work page internal anchor Pith review Pith/arXiv arXiv 2024

[51] [51]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. InCVPR, 2022. 3 13

2022

[52] [52]

U-net: Convolutional networks for biomedical image segmentation

Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. InMICCAI, 2015. 3

2015

[53] [53]

Dream- booth: Fine tuning text-to-image diffusion models for subject-driven generation

Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. Dream- booth: Fine tuning text-to-image diffusion models for subject-driven generation. InCVPR, 2023. 3

2023

[54] [54]

Instantbooth: Personalized text-to-image generation without test-time finetuning

Jing Shi, Wei Xiong, Zhe Lin, and Hyun Joon Jung. Instantbooth: Personalized text-to-image generation without test-time finetuning. InCVPR, 2024. 3

2024

[55] [55]

Make-a-video: Text-to-video generation without text-video data

Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Oran Gafni, Devi Parikh, Sonal Gupta, and Yaniv Taigman. Make-a-video: Text-to-video generation without text-video data. InICLR, 2023. 3

2023

[56] [56]

Deep unsupervised learning using nonequilibrium thermodynamics

Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. InICML, 2015. 3

2015

[57] [57]

Generative modeling by estimating gradients of the data distribution

Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution. In NeurIPS, 2019. 3

2019

[58] [58]

Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 2024

Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 2024. 2, 5, 6, 15

2024

[59] [59]

Qwen Team. Qwen3. 5-omni technical report.arXiv preprint arXiv:2604.15804, 2026. 20

work page internal anchor Pith review Pith/arXiv arXiv 2026

[60] [60]

Attention is all you need

A Vaswani. Attention is all you need. InNeurIPS, 2017. 2, 3

2017

[61] [61]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025. 2, 3, 17, 20

work page internal anchor Pith review Pith/arXiv arXiv 2025

[62] [62]

Echoshot: Multi-shot portrait video generation

Jiahao Wang, Hualian Sheng, Sijia Cai, Weizhan Zhang, Caixia Yan, Yachuang Feng, Bing Deng, and Jieping Ye. Echoshot: Multi-shot portrait video generation. InNeurIPS, 2025. 2, 3, 4, 7, 9, 11, 20, 22, 23

2025

[63] [63]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024. 9, 21

work page internal anchor Pith review Pith/arXiv arXiv 2024

[64] [64]

InstantID: Zero-shot Identity-Preserving Generation in Seconds

Qixun Wang, Xu Bai, Haofan Wang, Zekui Qin, and Anthony Chen. Instantid: Zero-shot identity-preserving generation in seconds.arXiv preprint arXiv:2401.07519, 2024. 3

work page internal anchor Pith review Pith/arXiv arXiv 2024

[65] [65]

Multishotmaster: A controllable multi-shot video generation framework

Qinghe Wang, Xiaoyu Shi, Baolu Li, Weikang Bian, Quande Liu, Huchuan Lu, Xintao Wang, Pengfei Wan, Kun Gai, and Xu Jia. Multishotmaster: A controllable multi-shot video generation framework. In CVPR, 2026. 2, 3, 4, 7, 9, 11, 20, 22, 23

2026

[66] [66]

Internvid: A large-scale video-text dataset for multimodal understanding and generation

Yi Wang, Yinan He, Yizhuo Li, Kunchang Li, Jiashuo Yu, Xin Ma, Xinhao Li, Guo Chen, Xinyuan Chen, Yaohui Wang, Ping Luo, Ziwei Liu, Yali Wang, Limin Wang, and Yu Qiao. Internvid: A large-scale video-text dataset for multimodal understanding and generation. InICLR, 2024. 8, 20

2024

[67] [67]

InternVideo: General Video Foundation Models via Generative and Discriminative Learning

Yi Wang, Kunchang Li, Yizhuo Li, Yinan He, Bingkun Huang, Zhiyu Zhao, Hongjie Zhang, Jilan Xu, Yi Liu, Zun Wang, et al. Internvideo: General video foundation models via generative and discriminative learning.arXiv preprint arXiv:2212.03191, 2022. 8, 20

work page internal anchor Pith review Pith/arXiv arXiv 2022

[68] [68]

Motionctrl: A unified and flexible motion controller for video generation

Zhouxia Wang, Ziyang Yuan, Xintao Wang, Yaowei Li, Tianshui Chen, Menghan Xia, Ping Luo, and Ying Shan. Motionctrl: A unified and flexible motion controller for video generation. InSIGGRAPH, 2024. 2, 3, 11

2024

[69] [69]

Qwen-Image Technical Report

Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-image technical report.arXiv preprint arXiv:2508.02324, 2025. 6, 7, 17, 19

work page internal anchor Pith review Pith/arXiv arXiv 2025

[70] [70]

Cinetrans: Learning to generate videos with cinematic transitions via masked diffusion models

Xiaoxue Wu, Bingjie Gao, Yu Qiao, Yaohui Wang, and Xinyuan Chen. Cinetrans: Learning to generate videos with cinematic transitions via masked diffusion models. InICLR, 2026. 2, 3, 4, 7, 9, 11, 20, 22, 23

2026

[71] [71]

Mind the time: Temporally-controlled multi-event video generation

Ziyi Wu, Aliaksandr Siarohin, Willi Menapace, Ivan Skorokhodov, Yuwei Fang, Varnith Chordia, Igor Gilitschenski, and Sergey Tulyakov. Mind the time: Temporally-controlled multi-event video generation. InCVPR, 2025. 2, 3, 5

2025

[72] [72]

Cogvideox: Text-to-video diffusion models with an expert transformer

Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, Da Yin, Yuxuan.Zhang, Weihan Wang, Yean Cheng, Bin Xu, Xiaotao Gu, Yuxiao Dong, and Jie Tang. Cogvideox: Text-to-video diffusion models with an expert transformer. In ICLR, 2025. 2, 3

2025

[73] [73]

IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models

Hu Ye, Jun Zhang, Sibo Liu, Xiao Han, and Wei Yang. Ip-adapter: Text compatible image prompt adapter for text-to-image diffusion models.arXiv preprint arXiv:2308.06721, 2023. 3

work page internal anchor Pith review Pith/arXiv arXiv 2023

[74] [74]

Tora2: Motion and appearance customized diffusion transformer for multi-entity video generation

Zhenghao Zhang, Junchao Liao, Xiangyu Meng, Long Qin, and Weizhi Wang. Tora2: Motion and appearance customized diffusion transformer for multi-entity video generation. InACM MM, 2025. 3

2025

[75] [75]

PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel

Yanli Zhao, Andrew Gu, Rohan Varma, Liang Luo, Chien-Chin Huang, Min Xu, Less Wright, Hamid Shojanazeri, Myle Ott, Sam Shleifer, et al. Pytorch fsdp: experiences on scaling fully sharded data parallel. arXiv preprint arXiv:2304.11277, 2023. 19

work page internal anchor Pith review Pith/arXiv arXiv 2023

[76] [76]

camera pans left

Guangcong Zheng, Teng Li, Rui Jiang, Yehao Lu, Tao Wu, and Xi Li. Cami2v: Camera-controlled image-to-video diffusion model.arXiv preprint arXiv:2410.15957, 2024. 3, 11 14 CineOrchestra: Unified Entity-Centric Conditioning for Cinematic Video Generation Supplementary Material A Derivation and Properties ofβ(L) This appendix derives the closed form of the d...

work page arXiv 2024