MetaWorld scales multi-agent video world models from single-view videos using monocular decomposition into ego-motion and trajectories, subject-aware generation, and cross-attention alignment for consistency.
arXiv preprint arXiv:2510.18775 (2025) 1
3 Pith papers cite this work. Polarity classification is still indexing.
fields
cs.CV 3years
2026 3verdicts
UNVERDICTED 3representative citing papers
Introduces CineDance-1M dataset for multi-shot long-form text-to-audio-video generation along with CineBench and a model adaptation.
PixelWizard decouples global structure from fine details via a spatiotemporal anchor and introduces Noise-Span Aligned Shortcut Training with biased sampling to achieve over 10x faster sampling for high-fidelity 2K/4K video generation.
citing papers explorer
-
MetaWorld: Scaling Multi-Agent Video World Model from Single-view Video Data
MetaWorld scales multi-agent video world models from single-view videos using monocular decomposition into ego-motion and trajectories, subject-aware generation, and cross-attention alignment for consistency.
-
CineDance: Towards Next-Generation Multi-Shot Long-Form Cinematic Audio-Video Generation
Introduces CineDance-1M dataset for multi-shot long-form text-to-audio-video generation along with CineBench and a model adaptation.
-
PixelWizard: Towards Efficient High-Fidelity Video Generation at Ultra-Large Spatial Resolution
PixelWizard decouples global structure from fine details via a spatiotemporal anchor and introduces Noise-Span Aligned Shortcut Training with biased sampling to achieve over 10x faster sampling for high-fidelity 2K/4K video generation.