EM-Vid introduces an entity-centric latent patch memory bank with sparse token conditioning and budgeted updates for training-free consistent multi-shot video generation.
Onestory: Coherent multi-shot video generation with adaptive memory.arXiv preprint arXiv:2512.07802, 2025
6 Pith papers cite this work. Polarity classification is still indexing.
fields
cs.CV 6years
2026 6verdicts
UNVERDICTED 6representative citing papers
Echo-Forcing decouples stable anchors, compressed history, and recent dynamics in video diffusion KV caches using hierarchical memory, scene recall frames, and difference-aware decay to support interactive long video generation under bounded cache.
A controlled study finds that block-wise state-space recurrence outperforms other memory designs for open-domain scene return in action-conditioned video models, and that standard replay metrics do not adequately measure memory quality.
FlowNar achieves bounded memory and 3x higher throughput for streaming narration on Ego4D, EgoExo4D, and EpicKitchens100 by combining dynamic historical context removal with a Cross Linear Attentive Memory module.
Introduces CineDance-1M dataset for multi-shot long-form text-to-audio-video generation along with CineBench and a model adaptation.
A hierarchical multi-agent framework converts a single sentence into a short drama using debate-based scripting, 3D-grounded first frames for spatial consistency, and multi-stage reviewer loops.
citing papers explorer
-
EM-Vid: Training-Free Entity-Centric Memory for Efficient and Consistent Multi-Shot Video Generation
EM-Vid introduces an entity-centric latent patch memory bank with sparse token conditioning and budgeted updates for training-free consistent multi-shot video generation.
-
Echo-Forcing: A Scene Memory Framework for Interactive Long Video Generation
Echo-Forcing decouples stable anchors, compressed history, and recent dynamics in video diffusion KV caches using hierarchical memory, scene recall frames, and difference-aware decay to support interactive long video generation under bounded cache.
-
Echo-Memory: A Controlled Study of Memory in Action World Models
A controlled study finds that block-wise state-space recurrence outperforms other memory designs for open-domain scene return in action-conditioned video models, and that standard replay metrics do not adequately measure memory quality.
-
FlowNar: Scalable Streaming Narration for Long-Form Videos
FlowNar achieves bounded memory and 3x higher throughput for streaming narration on Ego4D, EgoExo4D, and EpicKitchens100 by combining dynamic historical context removal with a Cross Linear Attentive Memory module.
-
CineDance: Towards Next-Generation Multi-Shot Long-Form Cinematic Audio-Video Generation
Introduces CineDance-1M dataset for multi-shot long-form text-to-audio-video generation along with CineBench and a model adaptation.
-
One Sentence, One Drama: Personalized Short-Form Drama Generation via Multi-Agent Systems
A hierarchical multi-agent framework converts a single sentence into a short drama using debate-based scripting, 3D-grounded first frames for spatial consistency, and multi-stage reviewer loops.