Training-Free Multimodal Guidance for Video to Audio Generation

Aurelio Uncini; Danilo Comminiello; Eleonora Grassucci; Fabio Antonacci; Giordano Cicchetti; Giuliano Galadini

arxiv: 2509.24550 · v1 · pith:PQDOAUWMnew · submitted 2025-09-29 · 💻 cs.LG · cs.SD

Training-Free Multimodal Guidance for Video to Audio Generation

Eleonora Grassucci , Giuliano Galadini , Giordano Cicchetti , Aurelio Uncini , Fabio Antonacci , Danilo Comminiello This is my paper

Pith reviewed 2026-05-21 22:14 UTC · model grok-4.3

classification 💻 cs.LG cs.SD

keywords video-to-audio generationmultimodal diffusion guidancetraining-free methoddiffusion modelsmodality embeddingsunified alignmentperceptual quality

0 comments

The pith

A training-free guidance signal from the volume spanned by video, audio and text embeddings improves alignment in any pretrained audio diffusion model.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes a multimodal diffusion guidance method that computes a control signal from the volume spanned by embeddings of video, audio and text. This signal enforces unified alignment across the three modalities and can be added to any existing audio diffusion model without retraining or fine-tuning. Experiments on VGGSound and AudioCaps show gains in perceptual quality and multimodal coherence over baselines that rely on pairwise similarities or full joint training. A sympathetic reader would care because current video-to-audio systems either demand large paired datasets and expensive retraining or miss global coherence. If the approach holds, it offers a lightweight way to upgrade existing diffusion models for more realistic Foley and video editing.

Core claim

The central claim is that a multimodal diffusion guidance mechanism, built by leveraging the volume spanned by modality embeddings from video, audio and text, enforces unified alignment and functions as a plug-and-play control signal that improves generation quality when applied to any pretrained audio diffusion model for video-to-audio tasks.

What carries the argument

The volume spanned by the modality embeddings, used to derive a guidance signal that aligns video, audio and text in a unified manner during diffusion sampling.

If this is right

The guidance improves perceptual quality and multimodal alignment compared with baselines on VGGSound and AudioCaps.
It functions as a lightweight plug-and-play addition without requiring retraining of the base audio diffusion model.
It captures global multimodal coherence more effectively than approaches based on pairwise similarities.
The same control signal can be attached to different pretrained audio diffusion models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The geometric volume approach might extend to other multimodal tasks such as text-conditioned video generation or image-to-audio synthesis.
It points toward using embedding-space geometry as a substitute for learned cross-modal adapters in future diffusion pipelines.
Real-time applications could test whether the added guidance computation remains lightweight enough for interactive video editing.

Load-bearing premise

The volume spanned by the modality embeddings from video, audio and text can be leveraged to enforce unified alignment across modalities in a manner that improves generation quality when used as guidance on pretrained diffusion models without any retraining or adaptation.

What would settle it

Applying the proposed guidance to a standard pretrained video-to-audio diffusion model and observing no measurable gain in perceptual quality metrics or multimodal alignment scores on the AudioCaps or VGGSound test sets would falsify the central claim.

read the original abstract

Video-to-audio (V2A) generation aims to synthesize realistic and semantically aligned audio from silent videos, with potential applications in video editing, Foley sound design, and assistive multimedia. Although the excellent results, existing approaches either require costly joint training on large-scale paired datasets or rely on pairwise similarities that may fail to capture global multimodal coherence. In this work, we propose a novel training-free multimodal guidance mechanism for V2A diffusion that leverages the volume spanned by the modality embeddings to enforce unified alignment across video, audio, and text. The proposed multimodal diffusion guidance (MDG) provides a lightweight, plug-and-play control signal that can be applied on top of any pretrained audio diffusion model without retraining. Experiments on VGGSound and AudioCaps demonstrate that our MDG consistently improves perceptual quality and multimodal alignment compared to baselines, proving the effectiveness of a joint multimodal guidance for V2A.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces a training-free multimodal diffusion guidance (MDG) for video-to-audio generation. It computes a scalar volume from the parallelepiped spanned by video, audio, and text embeddings and uses this as a plug-and-play additive guidance term on top of any pretrained audio diffusion model. Experiments on VGGSound and AudioCaps are reported to show consistent gains in perceptual quality and multimodal alignment over baselines.

Significance. A verified training-free geometric guidance mechanism that improves alignment without retraining or parameter fitting would be a useful, low-cost addition to the V2A literature, especially if it generalizes across diffusion backbones and remains stable across noise schedules.

major comments (2)

[§3] §3 (Method), guidance formulation: the central claim that the scalar volume det([v-a, t-a]) (or equivalent) enforces unified multimodal alignment and supplies a useful gradient for the reverse diffusion process lacks any derivation or stability analysis. No argument is given showing why larger volume corresponds to better semantic coherence rather than to embedding-scale mismatches, noise amplification, or spurious directions when the audio embedding is taken from partially denoised latents.
[§4] §4 (Experiments): the abstract and results claim “consistent improvements” on VGGSound and AudioCaps, yet no quantitative metrics (e.g., CLAP, FAD, or human MOS), exact baseline implementations, statistical significance tests, or ablations isolating the volume term versus simple pairwise cosine guidance are supplied. Without these, the magnitude and reliability of the reported gains cannot be assessed.

minor comments (1)

[Abstract] Abstract, first sentence: “Although the excellent results,” is grammatically incomplete and should be rephrased (e.g., “Although existing approaches achieve excellent results,”).

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and outline the revisions we will make to improve the manuscript.

read point-by-point responses

Referee: [§3] §3 (Method), guidance formulation: the central claim that the scalar volume det([v-a, t-a]) (or equivalent) enforces unified multimodal alignment and supplies a useful gradient for the reverse diffusion process lacks any derivation or stability analysis. No argument is given showing why larger volume corresponds to better semantic coherence rather than to embedding-scale mismatches, noise amplification, or spurious directions when the audio embedding is taken from partially denoised latents.

Authors: We acknowledge that the current manuscript relies primarily on geometric intuition for the volume term without a formal derivation or stability analysis. The volume is intended to quantify linear independence among the three centered embeddings, with the guidance designed to increase this volume to promote a more coherent joint representation rather than collapse. We agree this requires stronger justification. In the revised manuscript we will add a dedicated derivation of the guidance gradient from the determinant and include a stability analysis examining behavior under different noise schedules and embedding normalizations. revision: yes
Referee: [§4] §4 (Experiments): the abstract and results claim “consistent improvements” on VGGSound and AudioCaps, yet no quantitative metrics (e.g., CLAP, FAD, or human MOS), exact baseline implementations, statistical significance tests, or ablations isolating the volume term versus simple pairwise cosine guidance are supplied. Without these, the magnitude and reliability of the reported gains cannot be assessed.

Authors: We agree that the experimental evaluation would be substantially strengthened by the requested details. The current version reports qualitative and some alignment improvements but does not include the full set of quantitative metrics, baseline specifications, significance tests, or targeted ablations. We will revise §4 to add CLAP and FAD scores, human MOS ratings with statistical significance testing, precise baseline implementation details, and an ablation comparing the full volume guidance against pairwise cosine guidance. revision: yes

Circularity Check

0 steps flagged

MDG volume-based guidance is an independent ansatz with external validation

full rationale

The paper proposes the multimodal diffusion guidance (MDG) as a novel training-free mechanism that computes a scalar volume from video/audio/text embeddings and adds it as a plug-and-play term to any pretrained audio diffusion sampler. This geometric construction is introduced directly in the method section without reducing to a fitted parameter or self-citation chain; the volume term is not shown to be mathematically equivalent to the input embeddings by construction, nor is the claimed alignment improvement forced by the definition of the embeddings themselves. Experiments on VGGSound and AudioCaps supply independent empirical checks outside the derivation, satisfying the criteria for a self-contained result. No load-bearing step collapses to a self-referential fit or imported uniqueness theorem.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The approach rests on standard assumptions from diffusion models and multimodal embedding spaces; no free parameters or invented entities are introduced in the abstract description.

axioms (2)

domain assumption Pretrained audio diffusion models can accept external multimodal guidance signals without retraining while preserving generation quality.
Invoked to support the plug-and-play claim for any pretrained model.
domain assumption The geometric volume spanned by video, audio, and text embeddings provides a meaningful measure of multimodal coherence.
Central to the proposed guidance mechanism.

pith-pipeline@v0.9.0 · 5695 in / 1379 out tokens · 29864 ms · 2026-05-21T22:14:16.242868+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we propose a novel training-free multimodal guidance mechanism for V2A diffusion that leverages the volume spanned by the modality embeddings to enforce unified alignment across video, audio, and text

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

33 extracted references · 33 canonical work pages · 2 internal anchors

[1]

Future Artificial Intelligence Research

INTRODUCTION The ability to generate realistic and semantically consistent audio from a given video has the potential to transform a wide range of applications, from automated video editing and silent film restora- tion to assistive multimedia technologies and immersive content cre- ation [1–3]. The task, known as video-to-audio (V2A) generation, requires...

work page
[2]

Training-Free Multimodal Guidance for Video to Audio Generation

and by our method. MDG (ours) better guides the generation towards a semantically meaningful synthesized audio, while See- ing&Hearing generation is noisy and semantically inconsistent. GPUs [4,11]. Furthermore, these systems usually also require large- scale paired datasets [12], which are expensive to collect and rare to find. This limits all the models...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

The audio generator is a latent dif- fusion model (LDM) operating in the audio latent space

PROPOSED METHOD We consider the task of generating audio ˆxa conditioned on a given videox v and a text promptx p. The audio generator is a latent dif- fusion model (LDM) operating in the audio latent space. By jointly leveraging the information of the clean audiox a, of the videox v (and optionally of the textx p), we provide a novel unified geo- metric ...

work page
[4]

=N √¯αt za 0,(1−¯α t)I ,(2) with¯αt =Qt i=1(1−β i). The denoiserϵ a θ(za t, t,x p)is pretrained by LLDM =E za 0,ϵ,t ∥ϵ−ϵ a θ(za t , t,x p)∥2 2 .(3) During sampling, the predicted clean latent is ˜za 0 = 1√¯αt za t − √ 1−¯αt ϵa θ(za t, t,x p) ,(4) which we will steer with the proposed guidance. The final spectro- gram is ˆxa =D( ˜za 0)that can be converted...

work page
[5]

with pairwise cosine similarity replaced by the volume measure of Eq. (7). This retains the benefits of contrastive learning while enforcing joint (tri-modal) geometry rather than pairwise alignment. 2.3. Multimodal Diffusion Guidance We propose a training-free multimodal diffusion guidance (MDG) mechanism that leverages the geometric structure learned by...

work page
[6]

to steer the audio generation process. At each denoising step, the model predicts an intermediate clean audio latent, which is then adjusted through the shared information coming from the multi- modal embedding space built with the video and text inputs. Rather than relying on pairwise similarities as [13], we compute a tri-modal similarity measure based ...

work page
[7]

EXPERIMENTS This section details the experimental framework designed to evaluate the performance of the proposed Multimodal Diffusion Guidance. 3.1. Experimental Setup To conduct a comprehensive comparison, we design a two-part ex- perimental evaluation to assess both in-domain performance and generalization capabilities under a domain shift. The main ex-...

work page
[8]

CONCLUSION In this paper, we presented a novel multimodal guidance mecha- nism (MDG) for video-to-audio (V2A) generation that leverages the volume spanned by modality embeddings to enforce joint semantic alignment across video, audio, and text. MDG introduces a training- free, plug-and-play strategy that integrates a volume-based multi- modal alignment ob...

work page
[9]

Syncfusion: Multimodal onset-synchronized video-to-audio foley synthesis,

M. Comunit `a, R. F. Gramaccioni, E. Postolache, E. Rodol `a, D. Comminiello, and J. D. Reiss, “Syncfusion: Multimodal onset-synchronized video-to-audio foley synthesis,” inIEEE Int. Conf. on Acoustics, Speech and Signal Process. (ICASSP), 2024, pp. 936–940

work page 2024
[10]

AudioLDM 2: Learning holistic audio generation with self-supervised pre- training,

H. Liu, Q. Tian, Y . Yuan, X. Liu, X. Mei, Q. Kong, Y . Wang, W. Wang, Y . Wang, and M. D. Plumbley, “AudioLDM 2: Learning holistic audio generation with self-supervised pre- training,”IEEE/ACM Trans. on Audio, Speech, and Language Process., vol. 32, pp. 2871–2883, 2023

work page 2023
[11]

StereoSync: Spatially- aware stereo audio generation from video,

C. Marinoni, R. F. Gramaccioni, K. Shimada, T. Shibuya, Y . Mitsufuji, and D. Comminiello, “StereoSync: Spatially- aware stereo audio generation from video,” inInternational Joint Conference on Neural Networks (IJCNN), Rome, Italy, June 2025

work page 2025
[12]

Mmaudio: Taming multimodal joint train- ing for high-quality video-to-audio synthesis,

H. K. Cheng, M. Ishii, A. Hayakawa, T. Shibuya, A. Schwing, and Y . Mitsufuji, “Mmaudio: Taming multimodal joint train- ing for high-quality video-to-audio synthesis,” inIEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), June 2025, pp. 28901–28911

work page 2025
[13]

FOL AI: Syn- chronized foley sound generation with semantic and temporal alignment,

R. F. Gramaccioni, C. Marinoni, E. Postolache, M. Comunit `a, L. Cosmo, J. D. Reiss, and D. Comminiello, “FOL AI: Syn- chronized foley sound generation with semantic and temporal alignment,”ArXiv preprint: arXiv:2412.15023, 2024

work page arXiv 2024
[14]

Read, watch and scream! sound generation from text and video,

Y . Jeong, Y . Kim, S. Chun, and J. Lee, “Read, watch and scream! sound generation from text and video,”AAAI Conf. on Artificial Intelligence, vol. 39, no. 17, pp. 17590–17598, Apr. 2025

work page 2025
[15]

Draw an audio: Leveraging multi- instruction for video-to-audio synthesis,

Q. Yang, B. Mao, Z. Wang, X. Nie, P. Gao, Y . Guo, C. Zhen, P. Yan, and S. Xiang, “Draw an audio: Leveraging multi- instruction for video-to-audio synthesis,”ArXiv preprint: arXiv:2409.06135, 2024

work page arXiv 2024
[16]

FoleyGRAM: Video-to-audio generation with GRAM-aligned multimodal encoders,

R. F. Gramaccioni, C. Marinoni, E. Grassucci, G. Cic- chetti, A. Uncini, and D. Comminiello, “FoleyGRAM: Video-to-audio generation with GRAM-aligned multimodal encoders,” inInternational Joint Conference on Neural Net- works (IJCNN), Rome, Italy, June 2025

work page 2025
[17]

Vintage: Joint video and text conditioning for holistic audio generation,

S. S. Kushwaha and Y . Tian, “Vintage: Joint video and text conditioning for holistic audio generation,” inIEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), June 2025, pp. 13529–13539

work page 2025
[18]

Tri- ergon: Fine-grained video-to-audio generation with multi- modal conditions and lufs control,

B. Li, F. Yang, Y . Mao, Q. Ye, H. Chen, and Y . Zhong, “Tri- ergon: Fine-grained video-to-audio generation with multi- modal conditions and lufs control,”AAAI Conf. on Artificial Intelligence, vol. 39, no. 5, pp. 4616–4624, Apr. 2025

work page 2025
[19]

Frieren: Efficient video-to-audio generation net- work with rectified flow matching,

Y . Wang, W. Guo, R. Huang, J. Huang, Z. Wang, F. You, R. Li, and Z. Zhao, “Frieren: Efficient video-to-audio generation net- work with rectified flow matching,” inAdvances in Neural Information Processing Systems (NeurIPS), 2024, vol. 37, pp. 128118–128138

work page 2024
[20]

Vggsound: A large-scale audio-visual dataset,

H. Chen, W. Xie, A. Vedaldi, and A. Zisserman, “Vggsound: A large-scale audio-visual dataset,” inIEEE Int. Conf. on Acous- tics, Speech and Signal Processing (ICASSP), 2020, pp. 721– 725

work page 2020
[21]

Seeing and Hearing: Open-domain visual-audio generation with diffusion latent aligners,

Y . Xing, Y .-Y . He, Z-Tian, X. Wang, and Q. Chen, “Seeing and Hearing: Open-domain visual-audio generation with diffusion latent aligners,” inIEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 7151–7161

work page 2024
[22]

ImageBind one embedding space to bind them all,

R. Girdhar, A. El-Nouby, Z. Liu, M. Singh, K. V . Alwala, A. Joulin, and I. Misra, “ImageBind one embedding space to bind them all,” inIEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), 2023, pp. 15180–15190

work page 2023
[23]

Anchors aweigh! sail for optimal unified multi- modal representations,

M. Jeong, M. Namgung, Z. M. Kim, D. Kang, Y .-Y . Chiang, and A. Hero, “Anchors aweigh! sail for optimal unified multi- modal representations,”ArXiv preprint: arXiv:2410.02086, 2024

work page arXiv 2024
[24]

Gramian multimodal representation learning and alignment,

G. Cicchetti, E. Grassucci, L. Sigillo, and D. Comminiello, “Gramian multimodal representation learning and alignment,” inInt. Conf. on Learning Repr. (ICLR), 2025

work page 2025
[25]

AudioLDM: Text-to-audio generation with latent diffusion models,

H. Liu, Z. Chen, Y . Yuan, X. Mei, X. Liu, D. Mandic, W. Wang, and M. D. Plumbley, “AudioLDM: Text-to-audio generation with latent diffusion models,” inInt. Conf. on Machine Learn- ing (ICML), 2023, vol. 202, pp. 21450–21474

work page 2023
[26]

Matrix theory,

F. R. Gantmacher, “Matrix theory,”Chelsea Publishing Com- pany, 1959

work page 1959
[27]

Representation Learning with Contrastive Predictive Coding

A. van den Oord, Y . Li, and O. Vinyals, “Representation learning with contrastive predictive coding,”ArXiv preprint: arXiv:1807.03748, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[28]

Audiocaps: Generat- ing captions for audios in the wild,

C. D. Kim, B. Kim, H. Lee, and G. Kim, “Audiocaps: Generat- ing captions for audios in the wild,” inNorth American Chap- ter of the Association for Computational Linguistics (NAACL), 2019

work page 2019
[29]

Taming visually guided sound gener- ation,

V . Iashin and E. Rahtu, “Taming visually guided sound gener- ation,” inBritish Machine Vision Conference (BMVC), 2021

work page 2021
[30]

Diff-Foley: Synchronized video-to-audio synthesis with la- tent diffusion models,

Simian Luo, Chuanhao Yan, Chenxu Hu, and Hang Zhao, “Diff-Foley: Synchronized video-to-audio synthesis with la- tent diffusion models,” inAdvances in Neural Information Pro- cessing Systems (NeurIPS), 2023, vol. 36, pp. 48855–48876

work page 2023
[31]

PEA VS: Perceptual evaluation of audio-visual synchrony grounded in viewers’ opinion scores,

L. Goncalves, P. Mathur, C. Lavania, M. Cekic, M. Federico, and K. J. Han, “PEA VS: Perceptual evaluation of audio-visual synchrony grounded in viewers’ opinion scores,” inEuropean Conference on Computer Vision (ECCV), 2024

work page 2024
[32]

Fr´echet audio distance: A metric for evaluating music enhancement algorithms,

K. Kilgour, Mauricio Zuluaga, Dominik Roblek, and Matthew Sharifi, “Fr´echet audio distance: A metric for evaluating music enhancement algorithms,” 2019

work page 2019
[33]

Improved techniques for training gans,

T. Salimans, I. Goodfellow, W. Zaremba, V . Cheung, A. Rad- ford, and X. Chen, “Improved techniques for training gans,” in Neural Information Processing Systems (NIPS), 2016

work page 2016

[1] [1]

Future Artificial Intelligence Research

INTRODUCTION The ability to generate realistic and semantically consistent audio from a given video has the potential to transform a wide range of applications, from automated video editing and silent film restora- tion to assistive multimedia technologies and immersive content cre- ation [1–3]. The task, known as video-to-audio (V2A) generation, requires...

work page

[2] [2]

Training-Free Multimodal Guidance for Video to Audio Generation

and by our method. MDG (ours) better guides the generation towards a semantically meaningful synthesized audio, while See- ing&Hearing generation is noisy and semantically inconsistent. GPUs [4,11]. Furthermore, these systems usually also require large- scale paired datasets [12], which are expensive to collect and rare to find. This limits all the models...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

The audio generator is a latent dif- fusion model (LDM) operating in the audio latent space

PROPOSED METHOD We consider the task of generating audio ˆxa conditioned on a given videox v and a text promptx p. The audio generator is a latent dif- fusion model (LDM) operating in the audio latent space. By jointly leveraging the information of the clean audiox a, of the videox v (and optionally of the textx p), we provide a novel unified geo- metric ...

work page

[4] [4]

=N √¯αt za 0,(1−¯α t)I ,(2) with¯αt =Qt i=1(1−β i). The denoiserϵ a θ(za t, t,x p)is pretrained by LLDM =E za 0,ϵ,t ∥ϵ−ϵ a θ(za t , t,x p)∥2 2 .(3) During sampling, the predicted clean latent is ˜za 0 = 1√¯αt za t − √ 1−¯αt ϵa θ(za t, t,x p) ,(4) which we will steer with the proposed guidance. The final spectro- gram is ˆxa =D( ˜za 0)that can be converted...

work page

[5] [5]

with pairwise cosine similarity replaced by the volume measure of Eq. (7). This retains the benefits of contrastive learning while enforcing joint (tri-modal) geometry rather than pairwise alignment. 2.3. Multimodal Diffusion Guidance We propose a training-free multimodal diffusion guidance (MDG) mechanism that leverages the geometric structure learned by...

work page

[6] [6]

to steer the audio generation process. At each denoising step, the model predicts an intermediate clean audio latent, which is then adjusted through the shared information coming from the multi- modal embedding space built with the video and text inputs. Rather than relying on pairwise similarities as [13], we compute a tri-modal similarity measure based ...

work page

[7] [7]

EXPERIMENTS This section details the experimental framework designed to evaluate the performance of the proposed Multimodal Diffusion Guidance. 3.1. Experimental Setup To conduct a comprehensive comparison, we design a two-part ex- perimental evaluation to assess both in-domain performance and generalization capabilities under a domain shift. The main ex-...

work page

[8] [8]

CONCLUSION In this paper, we presented a novel multimodal guidance mecha- nism (MDG) for video-to-audio (V2A) generation that leverages the volume spanned by modality embeddings to enforce joint semantic alignment across video, audio, and text. MDG introduces a training- free, plug-and-play strategy that integrates a volume-based multi- modal alignment ob...

work page

[9] [9]

Syncfusion: Multimodal onset-synchronized video-to-audio foley synthesis,

M. Comunit `a, R. F. Gramaccioni, E. Postolache, E. Rodol `a, D. Comminiello, and J. D. Reiss, “Syncfusion: Multimodal onset-synchronized video-to-audio foley synthesis,” inIEEE Int. Conf. on Acoustics, Speech and Signal Process. (ICASSP), 2024, pp. 936–940

work page 2024

[10] [10]

AudioLDM 2: Learning holistic audio generation with self-supervised pre- training,

H. Liu, Q. Tian, Y . Yuan, X. Liu, X. Mei, Q. Kong, Y . Wang, W. Wang, Y . Wang, and M. D. Plumbley, “AudioLDM 2: Learning holistic audio generation with self-supervised pre- training,”IEEE/ACM Trans. on Audio, Speech, and Language Process., vol. 32, pp. 2871–2883, 2023

work page 2023

[11] [11]

StereoSync: Spatially- aware stereo audio generation from video,

C. Marinoni, R. F. Gramaccioni, K. Shimada, T. Shibuya, Y . Mitsufuji, and D. Comminiello, “StereoSync: Spatially- aware stereo audio generation from video,” inInternational Joint Conference on Neural Networks (IJCNN), Rome, Italy, June 2025

work page 2025

[12] [12]

Mmaudio: Taming multimodal joint train- ing for high-quality video-to-audio synthesis,

H. K. Cheng, M. Ishii, A. Hayakawa, T. Shibuya, A. Schwing, and Y . Mitsufuji, “Mmaudio: Taming multimodal joint train- ing for high-quality video-to-audio synthesis,” inIEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), June 2025, pp. 28901–28911

work page 2025

[13] [13]

FOL AI: Syn- chronized foley sound generation with semantic and temporal alignment,

R. F. Gramaccioni, C. Marinoni, E. Postolache, M. Comunit `a, L. Cosmo, J. D. Reiss, and D. Comminiello, “FOL AI: Syn- chronized foley sound generation with semantic and temporal alignment,”ArXiv preprint: arXiv:2412.15023, 2024

work page arXiv 2024

[14] [14]

Read, watch and scream! sound generation from text and video,

Y . Jeong, Y . Kim, S. Chun, and J. Lee, “Read, watch and scream! sound generation from text and video,”AAAI Conf. on Artificial Intelligence, vol. 39, no. 17, pp. 17590–17598, Apr. 2025

work page 2025

[15] [15]

Draw an audio: Leveraging multi- instruction for video-to-audio synthesis,

Q. Yang, B. Mao, Z. Wang, X. Nie, P. Gao, Y . Guo, C. Zhen, P. Yan, and S. Xiang, “Draw an audio: Leveraging multi- instruction for video-to-audio synthesis,”ArXiv preprint: arXiv:2409.06135, 2024

work page arXiv 2024

[16] [16]

FoleyGRAM: Video-to-audio generation with GRAM-aligned multimodal encoders,

R. F. Gramaccioni, C. Marinoni, E. Grassucci, G. Cic- chetti, A. Uncini, and D. Comminiello, “FoleyGRAM: Video-to-audio generation with GRAM-aligned multimodal encoders,” inInternational Joint Conference on Neural Net- works (IJCNN), Rome, Italy, June 2025

work page 2025

[17] [17]

Vintage: Joint video and text conditioning for holistic audio generation,

S. S. Kushwaha and Y . Tian, “Vintage: Joint video and text conditioning for holistic audio generation,” inIEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), June 2025, pp. 13529–13539

work page 2025

[18] [18]

Tri- ergon: Fine-grained video-to-audio generation with multi- modal conditions and lufs control,

B. Li, F. Yang, Y . Mao, Q. Ye, H. Chen, and Y . Zhong, “Tri- ergon: Fine-grained video-to-audio generation with multi- modal conditions and lufs control,”AAAI Conf. on Artificial Intelligence, vol. 39, no. 5, pp. 4616–4624, Apr. 2025

work page 2025

[19] [19]

Frieren: Efficient video-to-audio generation net- work with rectified flow matching,

Y . Wang, W. Guo, R. Huang, J. Huang, Z. Wang, F. You, R. Li, and Z. Zhao, “Frieren: Efficient video-to-audio generation net- work with rectified flow matching,” inAdvances in Neural Information Processing Systems (NeurIPS), 2024, vol. 37, pp. 128118–128138

work page 2024

[20] [20]

Vggsound: A large-scale audio-visual dataset,

H. Chen, W. Xie, A. Vedaldi, and A. Zisserman, “Vggsound: A large-scale audio-visual dataset,” inIEEE Int. Conf. on Acous- tics, Speech and Signal Processing (ICASSP), 2020, pp. 721– 725

work page 2020

[21] [21]

Seeing and Hearing: Open-domain visual-audio generation with diffusion latent aligners,

Y . Xing, Y .-Y . He, Z-Tian, X. Wang, and Q. Chen, “Seeing and Hearing: Open-domain visual-audio generation with diffusion latent aligners,” inIEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 7151–7161

work page 2024

[22] [22]

ImageBind one embedding space to bind them all,

R. Girdhar, A. El-Nouby, Z. Liu, M. Singh, K. V . Alwala, A. Joulin, and I. Misra, “ImageBind one embedding space to bind them all,” inIEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), 2023, pp. 15180–15190

work page 2023

[23] [23]

Anchors aweigh! sail for optimal unified multi- modal representations,

M. Jeong, M. Namgung, Z. M. Kim, D. Kang, Y .-Y . Chiang, and A. Hero, “Anchors aweigh! sail for optimal unified multi- modal representations,”ArXiv preprint: arXiv:2410.02086, 2024

work page arXiv 2024

[24] [24]

Gramian multimodal representation learning and alignment,

G. Cicchetti, E. Grassucci, L. Sigillo, and D. Comminiello, “Gramian multimodal representation learning and alignment,” inInt. Conf. on Learning Repr. (ICLR), 2025

work page 2025

[25] [25]

AudioLDM: Text-to-audio generation with latent diffusion models,

H. Liu, Z. Chen, Y . Yuan, X. Mei, X. Liu, D. Mandic, W. Wang, and M. D. Plumbley, “AudioLDM: Text-to-audio generation with latent diffusion models,” inInt. Conf. on Machine Learn- ing (ICML), 2023, vol. 202, pp. 21450–21474

work page 2023

[26] [26]

Matrix theory,

F. R. Gantmacher, “Matrix theory,”Chelsea Publishing Com- pany, 1959

work page 1959

[27] [27]

Representation Learning with Contrastive Predictive Coding

A. van den Oord, Y . Li, and O. Vinyals, “Representation learning with contrastive predictive coding,”ArXiv preprint: arXiv:1807.03748, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[28] [28]

Audiocaps: Generat- ing captions for audios in the wild,

C. D. Kim, B. Kim, H. Lee, and G. Kim, “Audiocaps: Generat- ing captions for audios in the wild,” inNorth American Chap- ter of the Association for Computational Linguistics (NAACL), 2019

work page 2019

[29] [29]

Taming visually guided sound gener- ation,

V . Iashin and E. Rahtu, “Taming visually guided sound gener- ation,” inBritish Machine Vision Conference (BMVC), 2021

work page 2021

[30] [30]

Diff-Foley: Synchronized video-to-audio synthesis with la- tent diffusion models,

Simian Luo, Chuanhao Yan, Chenxu Hu, and Hang Zhao, “Diff-Foley: Synchronized video-to-audio synthesis with la- tent diffusion models,” inAdvances in Neural Information Pro- cessing Systems (NeurIPS), 2023, vol. 36, pp. 48855–48876

work page 2023

[31] [31]

PEA VS: Perceptual evaluation of audio-visual synchrony grounded in viewers’ opinion scores,

L. Goncalves, P. Mathur, C. Lavania, M. Cekic, M. Federico, and K. J. Han, “PEA VS: Perceptual evaluation of audio-visual synchrony grounded in viewers’ opinion scores,” inEuropean Conference on Computer Vision (ECCV), 2024

work page 2024

[32] [32]

Fr´echet audio distance: A metric for evaluating music enhancement algorithms,

K. Kilgour, Mauricio Zuluaga, Dominik Roblek, and Matthew Sharifi, “Fr´echet audio distance: A metric for evaluating music enhancement algorithms,” 2019

work page 2019

[33] [33]

Improved techniques for training gans,

T. Salimans, I. Goodfellow, W. Zaremba, V . Cheung, A. Rad- ford, and X. Chen, “Improved techniques for training gans,” in Neural Information Processing Systems (NIPS), 2016

work page 2016