A single generative model uses twin DiT backbones with blockwise cross-attention and scaled-RoPE timing exchange to synthesize synchronized audio-video directly.
Deepaudio-v1: Towards multi-modal multi-stage end-to-end video to speech and au- dio generation
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
verdicts
UNVERDICTED 2representative citing papers
Tora3 uses shared object trajectories as kinematic priors to jointly guide visual motion and acoustic events in audio-video generation, improving realism and synchronization.
citing papers explorer
-
Ovi: Twin Backbone Cross-Modal Fusion for Audio-Video Generation
A single generative model uses twin DiT backbones with blockwise cross-attention and scaled-RoPE timing exchange to synthesize synchronized audio-video directly.
-
Tora3: Trajectory-Guided Audio-Video Generation with Physical Coherence
Tora3 uses shared object trajectories as kinematic priors to jointly guide visual motion and acoustic events in audio-video generation, improving realism and synchronization.