Unison presents a unified audio-video generation model that decouples speech and sound effects while using bidirectional forcing to synchronize with motion, claiming SOTA perceptual quality and alignment.
Advances in Neural Information Processing Systems37, 65618–65642 (2024)
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
citation-role summary
dataset 1
method 1
citation-polarity summary
fields
cs.CV 2years
2026 2representative citing papers
Hybrid Forcing combines linear temporal attention for long-range retention, block-sparse attention for efficiency, and decoupled distillation to achieve real-time unbounded 832x480 streaming video generation at 29.5 FPS.
citing papers explorer
-
Unison: Harmonizing Motion, Speech, and Sound for Human-Centric Audio-Video Generation
Unison presents a unified audio-video generation model that decouples speech and sound effects while using bidirectional forcing to synchronize with motion, claiming SOTA perceptual quality and alignment.
-
Long-Horizon Streaming Video Generation via Hybrid Attention with Decoupled Distillation
Hybrid Forcing combines linear temporal attention for long-range retention, block-sparse attention for efficiency, and decoupled distillation to achieve real-time unbounded 832x480 streaming video generation at 29.5 FPS.