Unison: Harmonizing Motion, Speech, and Sound for Human-Centric Audio-Video Generation

· 2026 · cs.CV · arXiv 2605.08729

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

open full Pith review browse 2 citing papers arXiv PDF

abstract

Motion, speech, and sound effects are fundamental elements of human-centric videos, yet their heterogeneous temporal characteristics make joint generation highly challenging. Existing audio-video generation models often fail to maintain consistent alignment across these modalities, leading to noticeable mismatches between motion, speech, and environmental sounds. We present Unison, a unified framework that explicitly promotes coherence across the motion, speech, and sound modalities. Within the audio stream, Unison employs a semantic-guided harmonization strategy that decouples the generation of speech and sound-effect components. Leveraging bidirectional audio cross-attention and semantic-conditioned gating for semantic-driven adaptive recomposition, this approach effectively mitigates speech dominance and enhances acoustic clarity. For audio-motion synchronization, we propose a bidirectional cross-modal forcing strategy where the cleaner modality guides the noisier one through decoupled denoising schedules, reinforced by a progressive stabilization strategy. Extensive experiments demonstrate that Unison achieves state-of-the-art performance in both audio perceptual quality and cross-modal synchronization, highlighting the importance of explicit multimodal harmonization in human-centric video generation.

representative citing papers

DART: Difficulty-Adaptive Routing for Zero-Shot Video Temporal Grounding

cs.CV · 2026-07-01 · unverdicted · novelty 7.0

DART routes zero-shot video temporal grounding queries by difficulty using DPP entropy, achieving up to 3.5 mIoU gains with 7x fewer frames on Charades-STA and ActivityNet Captions.

InteractiveAvatar: Real-Time Streaming Video Generation for Consistent and Intent-Aware Avatars

cs.CV · 2026-06-22 · unverdicted · novelty 6.0

InteractiveAvatar is a real-time infinite-streaming avatar video generation system using autoregressive distillation, Long-Short Visual Memory for consistency, and a Reasoning-Reaction Module for intent-aware interactions.

citing papers explorer

Showing 2 of 2 citing papers after filters.

DART: Difficulty-Adaptive Routing for Zero-Shot Video Temporal Grounding cs.CV · 2026-07-01 · unverdicted · none · ref 9 · internal anchor
DART routes zero-shot video temporal grounding queries by difficulty using DPP entropy, achieving up to 3.5 mIoU gains with 7x fewer frames on Charades-STA and ActivityNet Captions.
InteractiveAvatar: Real-Time Streaming Video Generation for Consistent and Intent-Aware Avatars cs.CV · 2026-06-22 · unverdicted · none · ref 4 · internal anchor
InteractiveAvatar is a real-time infinite-streaming avatar video generation system using autoregressive distillation, Long-Short Visual Memory for consistency, and a Reasoning-Reaction Module for intent-aware interactions.

Unison: Harmonizing Motion, Speech, and Sound for Human-Centric Audio-Video Generation

fields

years

verdicts

representative citing papers

citing papers explorer