Video-to-audio generation with hidden alignment

Manjie Xu, Chenxing Li, Yong Ren, Rilin Chen, Yu Gu, Wei Liang, Dong Yu · 2024 · arXiv 2407.07464

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

representative citing papers

AudioMoG: Guiding Audio Generation with Mixture-of-Guidance

cs.SD · 2025-09-28 · unverdicted · novelty 7.0

AudioMoG is a mixture-of-guidance sampling technique that combines CFG and AG signals to outperform single-guidance baselines in text-to-audio generation at equivalent speed.

Movie Gen: A Cast of Media Foundation Models

cs.CV · 2024-10-17 · unverdicted · novelty 5.0

A 30B-parameter transformer and related models generate high-quality videos and audio, claiming state-of-the-art results on text-to-video, video editing, personalization, and audio generation tasks.

AMAVA: Adaptive Motion-Aware Video-to-Audio Framework for Visually-Impaired Assistance

cs.CV · 2026-04-26 · unverdicted · novelty 4.0

AMAVA is an adaptive motion-aware video-to-audio framework that switches between scene descriptions and safety sound cues based on detected movement, with a user study showing increased confidence when added to a white cane.

citing papers explorer

Showing 3 of 3 citing papers.

AudioMoG: Guiding Audio Generation with Mixture-of-Guidance cs.SD · 2025-09-28 · unverdicted · none · ref 72
AudioMoG is a mixture-of-guidance sampling technique that combines CFG and AG signals to outperform single-guidance baselines in text-to-audio generation at equivalent speed.
Movie Gen: A Cast of Media Foundation Models cs.CV · 2024-10-17 · unverdicted · none · ref 76
A 30B-parameter transformer and related models generate high-quality videos and audio, claiming state-of-the-art results on text-to-video, video editing, personalization, and audio generation tasks.
AMAVA: Adaptive Motion-Aware Video-to-Audio Framework for Visually-Impaired Assistance cs.CV · 2026-04-26 · unverdicted · none · ref 8
AMAVA is an adaptive motion-aware video-to-audio framework that switches between scene descriptions and safety sound cues based on detected movement, with a user study showing increased confidence when added to a white cane.

Video-to-audio generation with hidden alignment

fields

years

verdicts

representative citing papers

citing papers explorer