EM-Vid: Training-Free Entity-Centric Memory for Efficient and Consistent Multi-Shot Video Generation

Chun-Hao P. Huang; Hyeonho Jeong; Jente Vandersanden; Matheus Gadelha; Yulia Gryaditskaya

REVIEW 2 major objections 1 minor 1 cited by

An entity-indexed bank of latent patches enables training-free consistent multi-shot video generation by restricting attention to relevant tokens.

Reviewed by Pith at T0; open to challenge. T0 means a machine referee read the full paper against a public rubric. the ladder, T0–T4 →

Challenge this review Re-run · record.json Download PDF Read on arXiv ↗

T0 review · grok-4.3

2026-05-25 04:54 UTC pith:5ZKBXQEU

load-bearing objection Entity-centric latent patch memory with sparse conditioning is a practical engineering response to leakage in autoregressive video, but the abstract supplies zero numbers or ablations to check if it actually works. the 2 major comments →

arxiv 2605.23610 v1 pith:5ZKBXQEU submitted 2026-05-22 cs.CV cs.AI

EM-Vid: Training-Free Entity-Centric Memory for Efficient and Consistent Multi-Shot Video Generation

Jente Vandersanden , Matheus Gadelha , Chun-Hao P. Huang , Hyeonho Jeong , Yulia Gryaditskaya This is my paper

classification cs.CV cs.AI

keywords multi-shot video generationentity consistencylatent patchessparse token conditioningtraining-freeprompt adherencememory bank

verification ladder T0 review T1 audit T2 compute T3 formal T4 reserved

The pith

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to maintain consistent appearances of recurring entities across multiple video shots while staying faithful to shot-specific text prompts. Existing autoregressive methods store full previous frames as memory, which entangles persistent entity details with transient scene context and causes irrelevant information leakage plus high computational cost. The proposed approach replaces this with an entity-centric memory consisting of an entity-indexed bank of latent patches, supported by sparse token conditioning that limits self-attention to entity-relevant tokens, a budgeted memory update strategy, and noise injection for appearance control. A structured multi-shot script format is also introduced to guide the process. All components work with pretrained models without any additional training.

Core claim

The paper claims that an entity-centric memory in the form of an entity-indexed bank of latent patches, combined with sparse token conditioning that restricts self-attention to entity-relevant tokens, a budgeted memory update strategy, and a noise-injection mechanism, allows pretrained models to generate multi-shot videos with improved prompt adherence and efficiency while preserving subject consistency.

What carries the argument

Entity-indexed bank of latent patches paired with sparse token conditioning that restricts self-attention to entity-relevant tokens.

Load-bearing premise

Restricting self-attention to only entity-relevant tokens will not remove necessary scene context or degrade overall generation quality for non-entity elements.

What would settle it

Compare generation quality and scene coherence of non-entity background elements in videos produced with the sparse conditioning versus full self-attention on the same prompts; degradation in the sparse case would falsify the claim.

Watch this falsifier — get emailed when new claim-graph text bears on it.

If this is right

Recurring entities maintain consistent appearance across separate shots.
Computational cost drops because full frames are not stored and attention is limited to relevant tokens.
Shot-specific text prompts are followed more closely without interference from prior scene context.
Noise injection gives fine-grained control over entity appearance without leakage of unrelated details.
Memory stays compact through budgeted updates while supporting structured multi-shot scripts.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach could support longer video sequences on limited hardware by keeping memory usage low.
Sparse conditioning might transfer to other consistency tasks such as sequential image editing or 3D asset reuse.
Script-based input could simplify user workflows for narrative video creation beyond current frame-by-frame methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit.

Desk Editor's Note

Entity-centric latent patch memory with sparse conditioning is a practical engineering response to leakage in autoregressive video, but the abstract supplies zero numbers or ablations to check if it actually works.

read the letter

The main takeaway is that this paper replaces full-frame memory with an entity-indexed bank of latent patches, adds sparse token conditioning to limit attention to entity-relevant tokens, includes a budgeted update rule, and uses noise injection for appearance control. It also introduces a structured multi-shot script format. These pieces together target the leakage and compute cost problems in prior autoregressive approaches for multi-shot video generation without any training.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces EM-Vid, a training-free method for multi-shot video generation. It replaces full-frame memory reuse with an entity-indexed bank of latent patches, introduces sparse token conditioning that restricts self-attention to entity-relevant tokens, proposes a structured multi-shot script format, a budgeted memory update, and a noise-injection mechanism for appearance control. The central claim is that these components together improve prompt adherence and computational efficiency while preserving subject consistency across shots.

Significance. If the empirical claims hold, the work would supply a practical, training-free route to entity-consistent multi-shot video generation that avoids the storage and leakage costs of full-frame memory. The compatibility with pretrained models and the explicit handling of entity patches are clear strengths that could be adopted by existing autoregressive pipelines.

major comments (2)

[Abstract] Abstract: the abstract asserts improvements in prompt adherence, efficiency, and consistency but supplies no quantitative results, ablation data, or implementation details; without these the central claim cannot be verified.
[Method (sparse conditioning)] Sparse token conditioning mechanism (described in abstract and method): the claim that restricting self-attention to entity-relevant tokens from the latent patch bank preserves overall generation quality rests on the untested premise that non-entity scene context (background, lighting, inter-shot relations) remains recoverable; no analysis or experiment addresses whether this restriction removes necessary context.

minor comments (1)

[Abstract] The structured multi-shot script format is introduced but receives only a one-sentence mention; a short example or formal definition would clarify how it interfaces with the memory bank.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. Below we respond point by point to the major comments and indicate the planned revisions.

read point-by-point responses

Referee: [Abstract] Abstract: the abstract asserts improvements in prompt adherence, efficiency, and consistency but supplies no quantitative results, ablation data, or implementation details; without these the central claim cannot be verified.

Authors: We agree that the abstract would be strengthened by including quantitative support for the stated improvements. In the revised manuscript we will update the abstract to report key metrics from our experiments, including subject consistency scores, prompt adherence rates, and efficiency gains such as reduced memory footprint and inference time. revision: yes
Referee: [Method (sparse conditioning)] Sparse token conditioning mechanism (described in abstract and method): the claim that restricting self-attention to entity-relevant tokens from the latent patch bank preserves overall generation quality rests on the untested premise that non-entity scene context (background, lighting, inter-shot relations) remains recoverable; no analysis or experiment addresses whether this restriction removes necessary context.

Authors: The referee correctly notes the absence of a dedicated analysis on whether non-entity context is fully recoverable under sparse conditioning. While our main experiments demonstrate maintained generation quality, we did not isolate this aspect. We will add a targeted experiment or discussion in the revision that compares full versus sparse attention with respect to background and lighting consistency, using both qualitative examples and quantitative scene-coherence metrics. revision: yes

Circularity Check

0 steps flagged

No circularity; new method construction with no self-referential reductions

full rationale

The paper introduces a novel entity-centric memory bank, sparse token conditioning, structured script format, budgeted update, and noise-injection mechanism as a training-free approach. No equations, fitted parameters, or predictions are described that reduce by construction to the method's own inputs or prior self-citations. Claims of improved prompt adherence and efficiency are presented as empirical outcomes of the new architecture rather than tautological redefinitions. The derivation chain is self-contained as an original construction without load-bearing self-citation chains or ansatz smuggling.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The method rests on compatibility of pretrained models with the new conditioning scheme and on the ability to identify and index entity patches reliably; no free parameters or invented physical entities are named in the abstract.

axioms (1)

domain assumption Pretrained video generation models remain functional when self-attention is restricted to entity-relevant tokens.
The training-free claim depends on this compatibility being sufficient for high-quality output.

pith-pipeline@v0.9.0 · 5697 in / 1223 out tokens · 22531 ms · 2026-05-25T04:54:50.375802+00:00 · methodology

0 comments

read the original abstract

Multi-shot video generation requires maintaining a consistent appearance of recurring entities across shots while remaining faithful to shot-specific text prompts. Recent autoregressive methods reuse previously generated frames as memory. However, full-frame storage entangles persistent entity information with transient scene context, leading to irrelevant information leakage and high computational cost. We propose an entity-centric memory in the form of an entity-indexed bank of latent patches. We introduce sparse token conditioning compatible with pretrained models, restricting self-attention to entity-relevant tokens and reducing computational cost. To support this, we introduce a structured multi-shot script format. We additionally propose a budgeted memory update strategy to maintain a compact, evolving memory. Finally, we equip the entity representation with a noise-injection mechanism that enables fine-grained appearance control, preventing leakage of irrelevant information. Our method improves prompt adherence and efficiency while preserving subject consistency.

Figures

Figures reproduced from arXiv: 2605.23610 by Chun-Hao P. Huang, Hyeonho Jeong, Jente Vandersanden, Matheus Gadelha, Yulia Gryaditskaya.

**Figure 2.** Figure 2: Generated shots with varying context sizes for the prompt [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: Memory frames to video architecture. Conditioning is implemented through two complementary mechanisms: time-wise conditioning, where memory frames are encoded and concatenated along the temporal dimension, and channel-wise conditioning, where information is fused via concatenation of noisy latent channels 𝑧𝑡 , binary mask channels 𝑧𝑚𝑎𝑠𝑘 , and clean conditioning channels 𝑧𝑐 . During diffusion, only 𝑧𝑡 are … view at source ↗

**Figure 4.** Figure 4: Adding entry to the entity bank. We introduce an entity bank E that stores VAE-space patches for each subject or scene entity. Patches are extracted from memory frames based on overlap with the segmented entity region (pink region in the VAE latent) and stored with their spatial coordinates (𝑥, 𝑦) and frame index 𝑓𝑖 . Each entry is further enriched with region-level descriptors (DINO features and CLIP regi… view at source ↗

**Figure 5.** Figure 5: Inference-time sparse memory mechanism. For each target shot, entity identifiers are extracted from the abstract prompt and used to retrieve the corresponding entries from the entity bank. Sparse latent patches are scattered back to their original spatio-temporal locations to reconstruct a dense latent tensor before patchification (1), preserving consistency with the pretrained video model’s convolution-ba… view at source ↗

**Figure 6.** Figure 6: Qualitative comparison between our method and the baselines [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Substantial efficiency gains. Efficiency comparison against the full-frame memory baseline of Zhang et al. [2025b]. We report memory-token count and per-shot inference time, averaged over the 20 multi-shot video generation experiments in Sec. 4.1. Our sparse memory representation substantially reduces both memory-token usage and inference time. Even with scene memory, our method remains efficient by loadin… view at source ↗

**Figure 8.** Figure 8: Additional qualitative results for EM-Vid on multi-shot video generation. We observe that EM-Vid maintains subject consistency while adhering to the prompt’s narrative. More examples and full playable versions can be found in the supplementary viewer. Background noise (Ours ) No background noise In memory Subject boundary Generated In memory Subject boundary Generated 0.569 0.578 Aesthetic 0.689 0.706 Imag… view at source ↗

**Figure 9.** Figure 9: Noise injection reduces background leakage at entity boundaries. Because patch-level selection is coarser than pixel-level segmentation, boundary patches can contain background pixels. These pixels may leak into the generated video, causing artifacts such as the white “glow” around the subject. Adding noise suppresses this leakage while preserving the subject, improving visual quality and intra-shot (VBenc… view at source ↗

**Figure 10.** Figure 10: Fine-grained entity appearance control via noise injection. We demonstrate the importance of discarding irrelevant tokens and adding noise to boundary tokens when the prompt specifies localized entity modifications, particularly when it is semantically similar to the memory frame. Appearance changes relative to the original entity are highlighted in red in the prompt. With a holistic entity representation… view at source ↗

**Figure 11.** Figure 11: shows how our entity bank is updated over time. Entity bank content … … … … … … Generated shot Shot 1 Shot 3 Shot 9 [PITH_FULL_IMAGE:figures/full_fig_p011_11.png] view at source ↗

**Figure 12.** Figure 12: Token sparsity, not sparse patchification, drives speedup. Performance comparison between sparse and dense patchification strategies (see Sec. E). Sparse patchification is slightly faster in most cases, but the difference between sparse and dense patchification is modest overall. The main performance gain of our method instead comes from conditioning the transformer on only a sparse subset of tokens, whic… view at source ↗

**Figure 13.** Figure 13: Preserving the pre-trained memory layout matters. Visual comparison between sparse and dense patchification strategies (see Sec. E). While both strategies produce semantically consistent results, sparse patchification yields noticeable visual artifacts. We attribute this to a train-test mismatch: the pre-trained M2V transformer was trained on patchified latent grids from full memory frames, whereas sparse… view at source ↗

**Figure 14.** Figure 14: Long-form video generation. Our memory representation can also be used to autoregressively generate a single long video. We show frames from a 50-second video generated by autoregressive rollout using the pre-trained M2V model together with our memory representation. The memory helps preserve subject consistency over long temporal gaps, even when subjects leave the frame and later reappear. For example, i… view at source ↗

**Figure 15.** Figure 15: Comparison to a keyframe-based iterative generation. We compare our method with a keyframe-based pipeline in which FLUX.2 generates the first frame of each shot, and Wan2.2-I2V animates the resulting keyframes. FLUX.2 is conditioned on the previously generated keyframe and instructed to preserve subject identity. While this approach often maintains local consistency between adjacent keyframes, subject att… view at source ↗

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

FilmWorld: Agentic Novel-to-Film Generation through Dynamic Cinematic World Modeling
cs.CV 2026-07 conditional novelty 7.0

FilmWorld generates multi-scene films from novels by materializing an explicit evolving world-state trajectory and rendering shots in parallel, beating five agents on its own FilmEval benchmark.

Reference graph

Works this paper leans on

24 extracted references · 24 canonical work pages · cited by 1 Pith paper · 7 internal anchors

[1]

arXiv preprint arXiv:2512.07802 , year =

OneStory: Coherent Multi-Shot Video Generation with Adaptive Memory.arXiv preprint arXiv:2512.07802(2025). Nicolas Carion, Laura Gustafson, Yuan-Ting Hu, Shoubhik Debnath, Ronghang Hu, Di- dac Suris, Chaitanya Ryali, Kalyan Vasudev Alwala, Haitham Khedr, Andrew Huang, et al

work page arXiv 2025
[2]

SAM 3: Segment Anything with Concepts

Sam 3: Segment anything with concepts.arXiv preprint arXiv:2511.16719 (2025). Tsai-Shien Chen, Aliaksandr Siarohin, Willi Menapace, Yuwei Fang, Kwot Sin Lee, Ivan Skorokhodov, Kfir Aberman, Jun-Yan Zhu, Ming-Hsuan Yang, and Sergey Tulyakov. 2025b. Multi-subject open-set personalization in video generation. InProceedings of the Computer Vision and Pattern ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Seedance 1.0: Exploring the Boundaries of Video Generation Models

Seedance 1.0: Exploring the Boundaries of Video Generation Models.arXiv preprint arXiv:2506.09113(2025). Yuwei Guo, Ceyuan Yang, Ziyan Yang, Zhibei Ma, Zhijie Lin, Zhenheng Yang, Dahua Lin, and Lu Jiang

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

arXiv preprint arXiv:2503.10589 (2025)

Long context tuning for video generation.arXiv preprint arXiv:2503.10589(2025). Yoav HaCohen, Nisan Chiprut, Benny Brazowski, Daniel Shalem, Dudu Moshe, Eitan Richardson, Eran Levin, Guy Shiran, Nir Zabari, Ori Gordon, et al

work page arXiv 2025
[5]

LTX-Video: Realtime Video Latent Diffusion

Ltx-video: Realtime video latent diffusion.arXiv preprint arXiv:2501.00103(2024). Panwen Hu, Jin Jiang, Jianqi Chen, Mingfei Han, Shengcai Liao, Xiaojun Chang, and Xiaodan Liang

work page internal anchor Pith review Pith/arXiv arXiv 2024
[6]

StoryAgent: Customized Storytelling Video Generation via Multi-Agent Collaboration

Storyagent: Customized storytelling video generation via multi-agent collaboration.arXiv preprint arXiv:2411.04925(2024). Xun Huang, Zhengqi Li, Guande He, Mingyuan Zhou, and Eli Shechtman

work page Pith review arXiv 2024
[7]

Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion

Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion.arXiv preprint arXiv:2506.08009(2025). Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al

work page internal anchor Pith review Pith/arXiv arXiv 2025
[8]

arXiv preprint arXiv:2510.18692 (2025) 5

MoGA: Mixture-of-Groups Attention for End-to-End Long Video Generation.arXiv preprint arXiv:2510.18692(2025). Jiaxiu Jiang, Wenbo Li, Jingjing Ren, Yuping Qiu, Yong Guo, Xiaogang Xu, Han Wu, and Wangmeng Zuo

work page arXiv 2025
[9]

LoViC: Efficient long video generation with context compression.arXiv preprint arXiv:2507.12952, 2025

LoViC: Efficient Long Video Generation with Context Compression.arXiv preprint arXiv:2507.12952(2025). Ozgur Kara, Krishna Kumar Singh, Feng Liu, Duygu Ceylan, James M Rehg, and Tobias Hinz

work page arXiv 2025
[10]

HunyuanVideo: A Systematic Framework For Large Video Generative Models

Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603(2024). Black Forest Labs

work page internal anchor Pith review Pith/arXiv arXiv 2024
[11]

Lijie Liu, Tianxiang Ma, Bingchuan Li, Zhuowei Chen, Jiawei Liu, Gen Li, Siyu Zhou, Qian He, and Xinglong Wu

Blip-diffusion: Pre-trained subject repre- sentation for controllable text-to-image generation and editing.Advances in Neural Information Processing Systems36 (2023), 30146–30166. Lijie Liu, Tianxiang Ma, Bingchuan Li, Zhuowei Chen, Jiawei Liu, Gen Li, Siyu Zhou, Qian He, and Xinglong Wu

work page 2023
[12]

Phantom: Subject-consistent video generation via cross- modal alignment.arXiv preprint arXiv:2502.11079, 2025

Phantom: Subject-consistent video generation via cross-modal alignment.arXiv preprint arXiv:2502.11079(2025). Fuchen Long, Zhaofan Qiu, Ting Yao, and Tao Mei

work page arXiv 2025
[13]

Yihao Meng, Hao Ouyang, Yue Yu, Qiuyu Wang, Wen Wang, Ka Leong Cheng, Hanlin Wang, Yixuan Li, Cheng Chen, Yanhong Zeng, et al

ContextAnyone: Context-Aware Diffusion for Character-Consistent Text-to-Video Generation.arXiv preprint arXiv:2512.07328 (2025). Yihao Meng, Hao Ouyang, Yue Yu, Qiuyu Wang, Wen Wang, Ka Leong Cheng, Hanlin Wang, Yixuan Li, Cheng Chen, Yanhong Zeng, et al

work page arXiv 2025
[14]

Holocine: Holistic generation of cinematic multi-shot long video narratives.arXiv preprint arXiv:2510.20822, 2025

HoloCine: Holistic Genera- tion of Cinematic Multi-Shot Long Video Narratives.arXiv preprint arXiv:2510.20822 (2025). Team Seawead, Ceyuan Yang, Zhijie Lin, Yang Zhao, Shanchuan Lin, Zhibei Ma, Haoyuan Guo, Hao Chen, Lu Qi, Sen Wang, et al

work page arXiv 2025
[15]

arXiv preprint arXiv:2504.08685 , year=

Seaweed-7b: Cost-effective training of video generation foundation model.arXiv preprint arXiv:2504.08685 (2025). Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. 2025a. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314(2025). Team Wan, Ang Wang...

work page arXiv 2025
[16]

Multishotmaster: A controllable multi-shot video generation framework.arXiv preprint arXiv:2512.03041,

MultiShotMaster: A Controllable Multi-Shot Video Generation Framework.arXiv preprint arXiv:2512.03041(2025). Xierui Wang, Siming Fu, Qihan Huang, Wanggui He, and Hao Jiang

work page arXiv 2025
[17]

arXiv:2406.07209

Ms-diffusion: Multi-subject zero-shot image personalization with layout guidance.arXiv preprint arXiv:2406.07209(2024). Yi Wang, Yinan He, Yizhuo Li, Kunchang Li, Jiashuo Yu, Xin Ma, Xinhao Li, Guo Chen, Xinyuan Chen, Yaohui Wang, et al

work page arXiv 2024
[18]

InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation

Internvid: A large-scale video-text dataset for multimodal understanding and generation.arXiv preprint arXiv:2307.06942 (2023). Weijia Wu, Zeyu Zhu, and Mike Zheng Shou. 2025b. Automated movie generation via multi-agent cot planning.arXiv preprint arXiv:2503.07314(2025). Xiaofei Wu, Guozhen Zhang, Zhiyong Xu, Yuan Zhou, Qinglin Lu, and Xuming He. 2025a. P...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[19]

arXiv preprint arXiv:2507.18634 , year =

Captain cinema: Towards short movie generation.arXiv preprint arXiv:2507.18634(2025). Hu Ye, Jun Zhang, Sibo Liu, Xiao Han, and Wei Yang

work page arXiv 2025
[20]

IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models

Ip-adapter: Text com- patible image prompt adapter for text-to-image diffusion models.arXiv preprint arXiv:2308.06721(2023). Kaiwen Zhang, Liming Jiang, Angtian Wang, Jacob Zhiyuan Fang, Tiancheng Zhi, Qing Yan, Hao Kang, Xin Lu, and Xingang Pan. 2025b. StoryMem: Multi-shot Long Video Storytelling with Memory.arXiv preprint arXiv:2512.19539(2025). Lvmin Z...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[21]

arXiv preprint arXiv:2504.12626 , year=

Packing input frame context in next-frame prediction models for video generation.arXiv preprint arXiv:2504.12626(2025). Lvmin Zhang, Shengqu Cai, Muyang Li, Chong Zeng, Beijia Lu, Anyi Rao, Song Han, Gordon Wetzstein, and Maneesh Agrawala. 2025a. Pretraining Frame Preservation in Autoregressive Video Memory Compression.arXiv preprint arXiv:2512.23851 (202...

work page arXiv 2025
[22]

arXiv preprint arXiv:2407.16655 , year=

Moviedreamer: Hierarchical generation for coherent long visual sequence.arXiv preprint arXiv:2407.16655(2024). Yupeng Zhou, Daquan Zhou, Ming-Ming Cheng, Jiashi Feng, and Qibin Hou

work page arXiv 2024
[23]

Advances in Neural Information Processing Systems37 (2024)

Storydiffusion: Consistent self-attention for long-range image and video generation. Advances in Neural Information Processing Systems37 (2024). Shaobin Zhuang, Kunchang Li, Xinyuan Chen, Yaohui Wang, Ziwei Liu, Yu Qiao, and Yali Wang

work page 2024
[24]

Statistic Value Scene Statistics Indoor / outdoor shots (%) 25.7 / 74.3 Avg

Detailed statistics of the used story scripts. Statistic Value Scene Statistics Indoor / outdoor shots (%) 25.7 / 74.3 Avg. scenes per script 4.1 Avg. shots per scene 2.4 Entity Statistics Avg. recur. char./obj. per script 1.3 (char.) / 1.1 (obj.) Min / max recur. entities per script 2 / 3 Avg. unique entities per script 2.4 Avg. char. per shot 1.2 Avg. o...

work page 2025

[1] [1]

arXiv preprint arXiv:2512.07802 , year =

OneStory: Coherent Multi-Shot Video Generation with Adaptive Memory.arXiv preprint arXiv:2512.07802(2025). Nicolas Carion, Laura Gustafson, Yuan-Ting Hu, Shoubhik Debnath, Ronghang Hu, Di- dac Suris, Chaitanya Ryali, Kalyan Vasudev Alwala, Haitham Khedr, Andrew Huang, et al

work page arXiv 2025

[2] [2]

SAM 3: Segment Anything with Concepts

Sam 3: Segment anything with concepts.arXiv preprint arXiv:2511.16719 (2025). Tsai-Shien Chen, Aliaksandr Siarohin, Willi Menapace, Yuwei Fang, Kwot Sin Lee, Ivan Skorokhodov, Kfir Aberman, Jun-Yan Zhu, Ming-Hsuan Yang, and Sergey Tulyakov. 2025b. Multi-subject open-set personalization in video generation. InProceedings of the Computer Vision and Pattern ...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

Seedance 1.0: Exploring the Boundaries of Video Generation Models

Seedance 1.0: Exploring the Boundaries of Video Generation Models.arXiv preprint arXiv:2506.09113(2025). Yuwei Guo, Ceyuan Yang, Ziyan Yang, Zhibei Ma, Zhijie Lin, Zhenheng Yang, Dahua Lin, and Lu Jiang

work page internal anchor Pith review Pith/arXiv arXiv 2025

[4] [4]

arXiv preprint arXiv:2503.10589 (2025)

Long context tuning for video generation.arXiv preprint arXiv:2503.10589(2025). Yoav HaCohen, Nisan Chiprut, Benny Brazowski, Daniel Shalem, Dudu Moshe, Eitan Richardson, Eran Levin, Guy Shiran, Nir Zabari, Ori Gordon, et al

work page arXiv 2025

[5] [5]

LTX-Video: Realtime Video Latent Diffusion

Ltx-video: Realtime video latent diffusion.arXiv preprint arXiv:2501.00103(2024). Panwen Hu, Jin Jiang, Jianqi Chen, Mingfei Han, Shengcai Liao, Xiaojun Chang, and Xiaodan Liang

work page internal anchor Pith review Pith/arXiv arXiv 2024

[6] [6]

StoryAgent: Customized Storytelling Video Generation via Multi-Agent Collaboration

Storyagent: Customized storytelling video generation via multi-agent collaboration.arXiv preprint arXiv:2411.04925(2024). Xun Huang, Zhengqi Li, Guande He, Mingyuan Zhou, and Eli Shechtman

work page Pith review arXiv 2024

[7] [7]

Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion

Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion.arXiv preprint arXiv:2506.08009(2025). Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al

work page internal anchor Pith review Pith/arXiv arXiv 2025

[8] [8]

arXiv preprint arXiv:2510.18692 (2025) 5

MoGA: Mixture-of-Groups Attention for End-to-End Long Video Generation.arXiv preprint arXiv:2510.18692(2025). Jiaxiu Jiang, Wenbo Li, Jingjing Ren, Yuping Qiu, Yong Guo, Xiaogang Xu, Han Wu, and Wangmeng Zuo

work page arXiv 2025

[9] [9]

LoViC: Efficient long video generation with context compression.arXiv preprint arXiv:2507.12952, 2025

LoViC: Efficient Long Video Generation with Context Compression.arXiv preprint arXiv:2507.12952(2025). Ozgur Kara, Krishna Kumar Singh, Feng Liu, Duygu Ceylan, James M Rehg, and Tobias Hinz

work page arXiv 2025

[10] [10]

HunyuanVideo: A Systematic Framework For Large Video Generative Models

Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603(2024). Black Forest Labs

work page internal anchor Pith review Pith/arXiv arXiv 2024

[11] [11]

Lijie Liu, Tianxiang Ma, Bingchuan Li, Zhuowei Chen, Jiawei Liu, Gen Li, Siyu Zhou, Qian He, and Xinglong Wu

Blip-diffusion: Pre-trained subject repre- sentation for controllable text-to-image generation and editing.Advances in Neural Information Processing Systems36 (2023), 30146–30166. Lijie Liu, Tianxiang Ma, Bingchuan Li, Zhuowei Chen, Jiawei Liu, Gen Li, Siyu Zhou, Qian He, and Xinglong Wu

work page 2023

[12] [12]

Phantom: Subject-consistent video generation via cross- modal alignment.arXiv preprint arXiv:2502.11079, 2025

Phantom: Subject-consistent video generation via cross-modal alignment.arXiv preprint arXiv:2502.11079(2025). Fuchen Long, Zhaofan Qiu, Ting Yao, and Tao Mei

work page arXiv 2025

[13] [13]

Yihao Meng, Hao Ouyang, Yue Yu, Qiuyu Wang, Wen Wang, Ka Leong Cheng, Hanlin Wang, Yixuan Li, Cheng Chen, Yanhong Zeng, et al

ContextAnyone: Context-Aware Diffusion for Character-Consistent Text-to-Video Generation.arXiv preprint arXiv:2512.07328 (2025). Yihao Meng, Hao Ouyang, Yue Yu, Qiuyu Wang, Wen Wang, Ka Leong Cheng, Hanlin Wang, Yixuan Li, Cheng Chen, Yanhong Zeng, et al

work page arXiv 2025

[14] [14]

Holocine: Holistic generation of cinematic multi-shot long video narratives.arXiv preprint arXiv:2510.20822, 2025

HoloCine: Holistic Genera- tion of Cinematic Multi-Shot Long Video Narratives.arXiv preprint arXiv:2510.20822 (2025). Team Seawead, Ceyuan Yang, Zhijie Lin, Yang Zhao, Shanchuan Lin, Zhibei Ma, Haoyuan Guo, Hao Chen, Lu Qi, Sen Wang, et al

work page arXiv 2025

[15] [15]

arXiv preprint arXiv:2504.08685 , year=

Seaweed-7b: Cost-effective training of video generation foundation model.arXiv preprint arXiv:2504.08685 (2025). Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. 2025a. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314(2025). Team Wan, Ang Wang...

work page arXiv 2025

[16] [16]

Multishotmaster: A controllable multi-shot video generation framework.arXiv preprint arXiv:2512.03041,

MultiShotMaster: A Controllable Multi-Shot Video Generation Framework.arXiv preprint arXiv:2512.03041(2025). Xierui Wang, Siming Fu, Qihan Huang, Wanggui He, and Hao Jiang

work page arXiv 2025

[17] [17]

arXiv:2406.07209

Ms-diffusion: Multi-subject zero-shot image personalization with layout guidance.arXiv preprint arXiv:2406.07209(2024). Yi Wang, Yinan He, Yizhuo Li, Kunchang Li, Jiashuo Yu, Xin Ma, Xinhao Li, Guo Chen, Xinyuan Chen, Yaohui Wang, et al

work page arXiv 2024

[18] [18]

InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation

Internvid: A large-scale video-text dataset for multimodal understanding and generation.arXiv preprint arXiv:2307.06942 (2023). Weijia Wu, Zeyu Zhu, and Mike Zheng Shou. 2025b. Automated movie generation via multi-agent cot planning.arXiv preprint arXiv:2503.07314(2025). Xiaofei Wu, Guozhen Zhang, Zhiyong Xu, Yuan Zhou, Qinglin Lu, and Xuming He. 2025a. P...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[19] [19]

arXiv preprint arXiv:2507.18634 , year =

Captain cinema: Towards short movie generation.arXiv preprint arXiv:2507.18634(2025). Hu Ye, Jun Zhang, Sibo Liu, Xiao Han, and Wei Yang

work page arXiv 2025

[20] [20]

IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models

Ip-adapter: Text com- patible image prompt adapter for text-to-image diffusion models.arXiv preprint arXiv:2308.06721(2023). Kaiwen Zhang, Liming Jiang, Angtian Wang, Jacob Zhiyuan Fang, Tiancheng Zhi, Qing Yan, Hao Kang, Xin Lu, and Xingang Pan. 2025b. StoryMem: Multi-shot Long Video Storytelling with Memory.arXiv preprint arXiv:2512.19539(2025). Lvmin Z...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[21] [21]

arXiv preprint arXiv:2504.12626 , year=

Packing input frame context in next-frame prediction models for video generation.arXiv preprint arXiv:2504.12626(2025). Lvmin Zhang, Shengqu Cai, Muyang Li, Chong Zeng, Beijia Lu, Anyi Rao, Song Han, Gordon Wetzstein, and Maneesh Agrawala. 2025a. Pretraining Frame Preservation in Autoregressive Video Memory Compression.arXiv preprint arXiv:2512.23851 (202...

work page arXiv 2025

[22] [22]

arXiv preprint arXiv:2407.16655 , year=

Moviedreamer: Hierarchical generation for coherent long visual sequence.arXiv preprint arXiv:2407.16655(2024). Yupeng Zhou, Daquan Zhou, Ming-Ming Cheng, Jiashi Feng, and Qibin Hou

work page arXiv 2024

[23] [23]

Advances in Neural Information Processing Systems37 (2024)

Storydiffusion: Consistent self-attention for long-range image and video generation. Advances in Neural Information Processing Systems37 (2024). Shaobin Zhuang, Kunchang Li, Xinyuan Chen, Yaohui Wang, Ziwei Liu, Yu Qiao, and Yali Wang

work page 2024

[24] [24]

Statistic Value Scene Statistics Indoor / outdoor shots (%) 25.7 / 74.3 Avg

Detailed statistics of the used story scripts. Statistic Value Scene Statistics Indoor / outdoor shots (%) 25.7 / 74.3 Avg. scenes per script 4.1 Avg. shots per scene 2.4 Entity Statistics Avg. recur. char./obj. per script 1.3 (char.) / 1.1 (obj.) Min / max recur. entities per script 2 / 3 Avg. unique entities per script 2.4 Avg. char. per shot 1.2 Avg. o...

work page 2025