MetaWorld scales multi-agent video world models from single-view videos using monocular decomposition into ego-motion and trajectories, subject-aware generation, and cross-attention alignment for consistency.
Harmony: Harmonizing audio and video generation through cross-task synergy
6 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 6verdicts
UNVERDICTED 6roles
background 1polarities
background 1representative citing papers
MTAVG-Bench 2.0 is a new benchmark that evaluates omni LLMs on diagnosing high-level cinematic failures in multi-talker audio-video generation using a taxonomy of acting, narrative, atmosphere, and audio-visual language.
Unison presents a unified audio-video generation model that decouples speech and sound effects while using bidirectional forcing to synchronize with motion, claiming SOTA perceptual quality and alignment.
Introduces CineDance-1M dataset for multi-shot long-form text-to-audio-video generation along with CineBench and a model adaptation.
Omni-Customizer proposes an end-to-end framework using Omni-Context Fusion, Masked TTS Cross-Attention, Semantic-Anchored Multimodal RoPE, and specialized training curricula to achieve precise multimodal identity binding in joint audio-video generation.
ST-DRC proposes latent in-context injection, TASS-RoPE, appearance-invariant augmentation, and three-stream guidance to improve identity preservation in text-to-video diffusion models built on LTX-2.3.
citing papers explorer
No citing papers match the current filters.