A text-conditioned video-to-music model that plans global structure autoregressively from video and text then synthesizes coherent audio via diffusion transformers, outperforming video-only baselines with 2.21x faster inference.
Title resolution pending
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.SD 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
Video-Robin: Autoregressive Diffusion Planning for Intent-Grounded Video-to-Music Generation
A text-conditioned video-to-music model that plans global structure autoregressively from video and text then synthesizes coherent audio via diffusion transformers, outperforming video-only baselines with 2.21x faster inference.