pith. sign in

arxiv: 2601.02731 · v3 · submitted 2026-01-06 · 💻 cs.SD · cs.CV· cs.MM

Omni2Sound: Towards Unified Video-Text-to-Audio Generation

Pith reviewed 2026-05-16 17:22 UTC · model grok-4.3

classification 💻 cs.SD cs.CVcs.MM
keywords video-to-audio generationtext-to-audio generationunified multimodal modeldiffusion transformerprogressive trainingaudio caption datasetsound generationoff-screen audio
0
0 comments X

The pith

A single diffusion model generates audio from video alone, text alone, or both at state-of-the-art levels.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to create one model capable of producing sound from video, from text, or from their combination while avoiding the usual drop in quality when tasks compete during training. It first builds SoundAtlas, a dataset of 470,000 video-audio pairs with detailed, tightly aligned captions produced through an agentic pipeline that compresses visuals, hands off between agents, and filters outputs. A three-stage progressive training schedule then lets the model learn the tasks jointly on a standard DiT backbone, turning competition into shared gains and reducing bias when both video and text are supplied. If this holds, applications that need flexible soundtrack creation could use one system instead of separate specialized ones.

Core claim

Omni2Sound is a unified video-text-to-audio diffusion model that, trained with a three-stage multi-task progressive schedule on the SoundAtlas dataset, reaches state-of-the-art results on video-to-audio, text-to-audio, and joint video-text-to-audio generation inside a single model while preserving alignment and off-screen audio fidelity.

What carries the argument

The three-stage multi-task progressive training schedule that integrates tasks sequentially to convert cross-task competition into joint optimization and to lessen modality bias when video and text conditions are combined.

If this is right

  • A single model maintains or exceeds separate-task performance across video-to-audio, text-to-audio, and combined inputs.
  • The model generates both on-screen and off-screen audio faithfully when given mixed video-text conditions.
  • Generalization remains strong on benchmarks that use different input combinations without task-specific adjustments.
  • Deployment simplifies because one checkpoint handles all three generation modes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar staged training might reduce task interference in other multimodal generators such as text-to-video or image-to-sound models.
  • The agent-based caption pipeline could be reused to create aligned datasets for additional modalities or languages.
  • Unified models of this kind could lower the cost of building production tools that accept any mix of video and text prompts.

Load-bearing premise

The three-stage training schedule is what resolves competition and bias, without hidden data exclusions or tuning steps that would change the reported numbers.

What would settle it

Retraining the identical DiT model on the same data but with a single joint training stage from the start produces measurable drops in any of the three tasks relative to the staged schedule results.

Figures

Figures reproduced from arXiv: 2601.02731 by Baolong Gao, Jianfei Cai, Jun Zhu, Qiuhong Ke, Yusheng Dai, Yuxuan Jiang, Zehua Chen.

Figure 1
Figure 1. Figure 1: Challenges in scaling high-quality audio captions. [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Data Construction Pipeline of SoundAtlas (Left). Comparison against SOTA baselines and human annotations (Right) . [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overview of our unified VT2A framework, which [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Subjective Evaluation Results on VGGSound-Omni. We report Mean Opinion Scores (MOS) on a 1-5 scale across [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Audio Captioning Instruction for SoundAtlas. [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: User study interface for human evaluation across dif [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗
read the original abstract

Training a unified model integrating video-to-audio (V2A), text-to-audio (T2A), and joint video-text-to-audio (VT2A) generation offers significant application flexibility, yet faces two unexplored foundational challenges: (1) the scarcity of high-quality audio captions with tight V-A-T alignment, leading to severe semantic conflict between multimodal conditions, and (2) cross-task and intra-task competition, manifesting as an adverse V2A-T2A performance trade-off and modality bias in the VT2A task. First, to address data scarcity, we introduce SoundAtlas, a large-scale dataset (470k pairs) that significantly outperforms existing benchmarks and even human experts in quality. Powered by a novel agentic pipeline, it integrates Vision-to-Language Compression to mitigate visual bias of MLLMs, a Junior-Senior Agent Handoff for a 5$\times$ cost reduction, and rigorous Post-hoc Filtering to ensure fidelity. Consequently, SoundAtlas delivers semantically rich and temporally detailed captions with tight V-A-T alignment. Second, we propose Omni2Sound, a unified VT2A diffusion model supporting flexible input modalities. To resolve the inherent cross-task and intra-task competition, we design a three-stage multi-task progressive training schedule that converts cross-task competition into joint optimization and mitigates modality bias in the VT2A task, maintaining both audio-visual alignment and off-screen audio generation faithfulness. Finally, we construct VGGSound-Omni, a comprehensive benchmark for unified evaluation, including challenging off-screen tracks. With a standard DiT backbone, Omni2Sound achieves unified SOTA performance across all three tasks within a single model, demonstrating strong generalization across benchmarks with heterogeneous input conditions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces SoundAtlas, a 470k-pair video-audio-text dataset constructed via an agentic pipeline (Vision-to-Language Compression, Junior-Senior Agent Handoff, Post-hoc Filtering) to address data scarcity and alignment issues. It presents Omni2Sound, a unified DiT-based diffusion model supporting flexible V2A, T2A, and VT2A inputs, trained with a three-stage multi-task progressive schedule intended to convert cross-task competition into joint optimization and reduce modality bias. The work constructs the VGGSound-Omni benchmark (including off-screen tracks) and claims unified SOTA performance across all three tasks within a single model.

Significance. If the SOTA claims and the effectiveness of the progressive schedule are substantiated by quantitative metrics and ablations, the work would be significant for unified multimodal audio generation: it shows that a standard DiT backbone can handle heterogeneous conditioning while preserving audio-visual alignment and off-screen faithfulness. The agentic dataset pipeline (with explicit cost-reduction and bias-mitigation steps) is a concrete, reusable contribution that could improve caption quality for future V-A-T models.

major comments (2)
  1. [Abstract] Abstract: the claim of 'unified SOTA performance across all three tasks within a single model' is stated without any quantitative metrics (FID, KL, CLAP, etc.), error bars, baseline comparisons, or dataset statistics, so the central performance claim cannot be evaluated from the provided text.
  2. [Abstract] Abstract (three-stage schedule): the description that the schedule 'converts cross-task competition into joint optimization and mitigates modality bias' lacks stage definitions, loss-weighting schedules, per-stage curves, or direct comparison to a joint-training baseline; without these the resolution of the V2A-T2A trade-off and VT2A off-screen gap remains unverified.
minor comments (1)
  1. [Abstract] Abstract: '470k pairs' should be clarified as exact or approximate and accompanied by a breakdown of video-only, text-only, and paired samples.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the two major comments on the abstract below and will revise the manuscript to improve clarity and evaluability of our claims while preserving the core contributions.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim of 'unified SOTA performance across all three tasks within a single model' is stated without any quantitative metrics (FID, KL, CLAP, etc.), error bars, baseline comparisons, or dataset statistics, so the central performance claim cannot be evaluated from the provided text.

    Authors: We agree that the abstract would benefit from including representative quantitative metrics to support the SOTA claim. The full manuscript already contains these details (FID, KL, CLAP scores with error bars, baseline comparisons, and dataset statistics) in Section 4 and Tables 1-3. In the revision we will add a concise set of key numbers (e.g., average relative improvements) to the abstract to make the central claim evaluable without exceeding length limits. revision: yes

  2. Referee: [Abstract] Abstract (three-stage schedule): the description that the schedule 'converts cross-task competition into joint optimization and mitigates modality bias' lacks stage definitions, loss-weighting schedules, per-stage curves, or direct comparison to a joint-training baseline; without these the resolution of the V2A-T2A trade-off and VT2A off-screen gap remains unverified.

    Authors: The three-stage schedule, including stage definitions, loss-weighting, per-stage curves, and joint-training ablations, is fully specified in Section 3.2 and evaluated in Section 4.4. We acknowledge the abstract description is high-level. In revision we will briefly enumerate the stages and explicitly reference the ablation results to better convey how cross-task competition is resolved, while keeping the abstract concise. revision: partial

Circularity Check

0 steps flagged

No circularity; claims rest on new dataset and training procedure

full rationale

The paper introduces SoundAtlas via an agentic pipeline (Vision-to-Language Compression, Junior-Senior Agent Handoff, Post-hoc Filtering) and Omni2Sound via a three-stage multi-task progressive training schedule on a standard DiT backbone. These are presented as independent empirical contributions that produce the reported unified SOTA results across V2A/T2A/VT2A. No equations, derivations, or first-principles results are described that reduce to fitted parameters defined by the target outcome, self-citations that bear the central load, or ansatzes smuggled from prior author work. The performance claims are framed as outcomes of the new data and schedule rather than self-referential constructions, making the derivation chain self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that the agentic captioning pipeline produces captions with tight V-A-T alignment superior to existing datasets and that the progressive training schedule resolves modality bias; both are introduced by the paper without external verification.

free parameters (1)
  • three-stage training hyperparameters
    Stage durations, loss weights, and learning rates chosen to balance tasks; values not reported in abstract.
axioms (1)
  • domain assumption Diffusion models conditioned on video and text can generate temporally aligned audio
    Invoked in the model architecture and training description.

pith-pipeline@v0.9.0 · 5630 in / 1383 out tokens · 45515 ms · 2026-05-16T17:22:24.772588+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. VidAudio-Bench: Benchmarking V2A and VT2A Generation across Four Audio Categories

    cs.SD 2026-04 unverdicted novelty 7.0

    VidAudio-Bench benchmarks V2A and VT2A models across four audio categories, revealing poor speech/singing performance and a tension between visual alignment and text instruction following.

  2. WavFlow: Audio Generation in Waveform Space

    cs.SD 2026-05 conditional novelty 6.0

    WavFlow performs direct waveform audio generation via flow matching on 2D token grids from raw patches plus amplitude lifting, matching latent-based methods on VGGSound and AudioCaps without intermediate compression.

Reference graph

Works this paper leans on

64 extracted references · 64 canonical work pages · cited by 2 Pith papers · 7 internal anchors

  1. [1]

    & Adi, Y

    F. Kreuk, Gabriel Synnaeve, Adam Polyak, Uriel Singer, Alexandre D’efossez, et al. Audiogen: Textually guided audio generation.ArXiv, abs/2209.15352, 2022. 1

  2. [2]

    Audioldm: Text-to-audio generation with la- tent diffusion models

    Haohe Liu, Zehua Chen, Yiitan Yuan, Xinhao Mei, Xubo Liu, et al. Audioldm: Text-to-audio generation with la- tent diffusion models. pages 21450–21474, 2023. 6, 4

  3. [3]

    Stable audio open.ICASSP 2025 - 2025 IEEE International Conference on Acous- tics, Speech and Signal Processing (ICASSP), pages 1–5,

    Zach Evans, Julian Parker, CJ Carr, Zack Zukowski, Josiah Taylor, et al. Stable audio open.ICASSP 2025 - 2025 IEEE International Conference on Acous- tics, Speech and Signal Processing (ICASSP), pages 1–5,

  4. [4]

    Text-to-audio generation using instruction- tuned LLM and latent diffusion model,

    Deepanway Ghosal, Navonil Majumder, Ambuj Mehrish, and Soujanya Poria. Text-to-audio generation using instruction-tuned llm and latent diffusion model. ArXiv, abs/2304.13731, 2023. 1

  5. [5]

    Diff-foley: Synchronized video-to-audio synthesis with latent diffusion models.ArXiv, abs/2306.17203, 2023

    Simian Luo, Chuanhao Yan, Chenxu Hu, and Hang Zhao. Diff-foley: Synchronized video-to-audio synthesis with latent diffusion models.ArXiv, abs/2306.17203, 2023. 1

  6. [6]

    Frieren: Efficient video-to- audio generation with rectified flow matching.ArXiv, abs/2406.00320, 2024

    Yongqi Wang, Wenxiang Guo, Rongjie Huang, Jia-Bin Huang, Zehan Wang, et al. Frieren: Efficient video-to- audio generation with rectified flow matching.ArXiv, abs/2406.00320, 2024. 6

  7. [7]

    Foleycrafter: Bring silent videos to life with lifelike and synchronized sounds.In- ternational Journal of Computer Vision, 134, 2024

    Yiming Zhang, Yicheng Gu, Yanhong Zeng, Zhening Xing, Yuancheng Wang, et al. Foleycrafter: Bring silent videos to life with lifelike and synchronized sounds.In- ternational Journal of Computer Vision, 134, 2024

  8. [8]

    Video-guided foley sound generation with multimodal controls.2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18770–18781, 2024

    Ziyang Chen, Prem Seetharaman, Bryan Russell, Oriol Nieto, David Bourgin, et al. Video-guided foley sound generation with multimodal controls.2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18770–18781, 2024. 1

  9. [9]

    Vintage: Joint video and text conditioning for holistic audio gener- ation.2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13529–13539,

    Saksham Singh Kushwaha and Yapeng Tian. Vintage: Joint video and text conditioning for holistic audio gener- ation.2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13529–13539,

  10. [10]

    Rongjie Huang, Dongchao Yang, Huadai Liu, Xixin Wu, and Helen M. Meng. Reasonaudio: Semantic reasoning and temporal synchrony in video–text-to-audio genera- tion, 2025. 1

  11. [11]

    Hunyuanvideo-foley: Multimodal diffusion with representation alignment for high-fidelity foley audio generation,

    Sizhe Shan, Qiulin Li, Yutao Cui, Miles Yang, Yuehai Wang, et al. Hunyuanvideo-foley: Multimodal diffusion with representation alignment for high-fidelity foley au- dio generation.ArXiv, abs/2508.16930, 2025. 1, 2, 6, 7, 3

  12. [12]

    Thinksound: Chain-of-thought reasoning in multimodal large language models for audio generation and editing,

    Huadai Liu, Jialei Wang, Kaicheng Luo, Wen Wang, Qian Chen, et al. Thinksound: Chain-of-thought rea- soning in multimodal large language models for audio generation and editing.ArXiv, abs/2506.21448, 2025. 1, 6, 7

  13. [13]

    Schwing, et al

    Ho Kei Cheng, Masato Ishii, Akio Hayakawa, Takashi Shibuya, Alexander G. Schwing, et al. Mmaudio: Tam- ing multimodal joint training for high-quality video-to- audio synthesis.2025 IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 28901–28911, 2024. 1, 2, 3, 5, 6, 7, 4

  14. [14]

    AudioX: A Unified Framework for Anything-to-Audio Generation

    Zeyue Tian, Yizhu Jin, Zhaoyang Liu, Ruibin Yuan, Xu Tan, et al. Audiox: Diffusion transformer for anything-to-audio generation.ArXiv, abs/2503.10522,

  15. [15]

    Detecting and mitigating inser- tion hallucination in video-to-audio generation.ArXiv, abs/2510.08078, 2025

    Liyang Chen, Hongkai Chen, Yujun Cai, Sifan Li, Qingwen Ye, et al. Detecting and mitigating inser- tion hallucination in video-to-audio generation.ArXiv, abs/2510.08078, 2025. 2

  16. [16]

    Vedaldi, and Andrew Zis- serman

    Honglie Chen, Weidi Xie, A. Vedaldi, and Andrew Zis- serman. Vggsound: A large-scale audio-visual dataset. ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 721–725, 2020. 2, 3, 4, 5, 6

  17. [17]

    Gemmeke, D

    J. Gemmeke, D. Ellis, Dylan Freedman, A. Jansen, W. Lawrence, et al. Audio set: An ontology and human- labeled dataset for audio events.2017 IEEE Interna- tional Conference on Acoustics, Speech and Signal Pro- cessing (ICASSP), pages 776–780, 2017. 2, 3, 4, 6

  18. [18]

    Gemini: A Family of Highly Capable Multimodal Models

    Gemini Team. Gemini: A family of highly capable mul- timodal models.CoRR, abs/2312.11805, 2023. 2, 3, 4, 1

  19. [19]

    SAGE Publications Ltd, 2006

    André Krouwel.Party Models, page 249–269. SAGE Publications Ltd, 2006. 2, 4

  20. [20]

    Vggsounder: Audio-visual evaluations for foundation models.ArXiv, abs/2508.08237, 2025

    Daniil Zverev, Thaddaus Wiedemer, Ameya Prabhu, Matthias Bethge, Wieland Brendel, et al. Vggsounder: Audio-visual evaluations for foundation models.ArXiv, abs/2508.08237, 2025. 2, 5, 6

  21. [21]

    Peebles and Saining Xie

    William S. Peebles and Saining Xie. Scalable diffusion models with transformers.2023 IEEE/CVF International Conference on Computer Vision (ICCV), pages 4172– 4182, 2022. 2, 4, 5

  22. [22]

    Audiocaps: Generating captions for audios in the wild

    Chris Dongjoo Kim, Byeongchang Kim, Hyunmin Lee, and Gunhee Kim. Audiocaps: Generating captions for audios in the wild. pages 119–132, 2019. 2, 4, 6, 7

  23. [23]

    Drossos, Samuel Lipping, and Tuomas Virtanen

    K. Drossos, Samuel Lipping, and Tuomas Virtanen. Clotho: an audio captioning dataset.ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 736–740, 2019. 2, 6, 4

  24. [24]

    Xinhao Mei, Chutong Meng, Haohe Liu, Qiuqiang Kong, Tom Ko, et al. Wavcaps: A chatgpt-assisted weakly- labelled audio captioning dataset for audio-language multimodal research.IEEE/ACM Transactions on Au- dio, Speech, and Language Processing, 32:3339–3354,

  25. [25]

    Jisheng Bai, Haohe Liu, Mou Wang, Dongyuan Shi, Wenwu Wang, et al. Audiosetcaps: An enriched audio- caption dataset using automated generation pipeline with large audio and language models.IEEE Transactions on Audio, Speech and Language Processing, 33:2817–2829,

  26. [26]

    Auto-acd: A large-scale dataset for audio-language rep- resentation learning.Proceedings of the 32nd ACM In- ternational Conference on Multimedia, 2023

    Luoyi Sun, Xuenan Xu, Mengyue Wu, and Weidi Xie. Auto-acd: A large-scale dataset for audio-language rep- resentation learning.Proceedings of the 32nd ACM In- ternational Conference on Multimedia, 2023. 3, 4

  27. [27]

    Yiitan Yuan, Dongya Jia, Xiaobin Zhuang, Yuanzhe Chen, Zhengxi Liu, et al. Sound-vecaps: Improving au- dio generation with visually enhanced captions.ICASSP 2025 - 2025 IEEE International Conference on Acous- tics, Speech and Signal Processing (ICASSP), pages 1–5,

  28. [28]

    Audiogen-omni: A unified multimodal diffusion transformer for video-synchronized audio, speech, and song generation.ArXiv, abs/2508.00733,

    Le Wang, Jun Wang, Chunyu Qiang, Feng Deng, Chen Zhang, et al. Audiogen-omni: A unified multimodal diffusion transformer for video-synchronized audio, speech, and song generation.ArXiv, abs/2508.00733,

  29. [29]

    Uniflow-audio: Unified flow matching for audio generation from omni-modalities.ArXiv, abs/2509.24391, 2025

    Xuenan Xu, Jiahao Mei, Zihao Zheng, Ye Tao, Zeyu Xie, et al. Uniflow-audio: Unified flow matching for audio generation from omni-modalities.ArXiv, abs/2509.24391, 2025. 3

  30. [30]

    Mmar: A challenging benchmark for deep reasoning in speech, audio, music, and their mix,

    Ziyang Ma, Yi Ma, Yanqiao Zhu, Chen Yang, Yi-Wen Chao, et al. Mmar: A challenging benchmark for deep reasoning in speech, audio, music, and their mix.ArXiv, abs/2505.13032, 2025. 3

  31. [31]

    Qwen3-Omni Technical Report

    Jin Xu, Zhifang Guo, Hangrui Hu, Yunfei Chu, Xiong Wang, et al. Qwen3-omni technical report.CoRR, abs/2509.17765, 2025. 3

  32. [32]

    Clap learning audio concepts from natural language supervision.ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5, 2023

    Benjamin Elizalde, Soham Deshmukh, Mahmoud Al Is- mail, and Huaming Wang. Clap learning audio concepts from natural language supervision.ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5, 2023. 4, 6

  33. [33]

    arXiv preprint arXiv:2402.04825 , year=

    Zach Evans, CJ Carr, Josiah Taylor, Scott H. Hawley, and Jordi Pons. Fast timing-conditioned latent audio diffu- sion.ArXiv, abs/2402.04825, 2024. 4

  34. [34]

    Scaling Instruction-Finetuned Language Models

    Hyung Won Chung, Le Hou, S. Longpre, Barret Zoph, Yi Tay, et al. Scaling instruction-finetuned language models.ArXiv, abs/2210.11416, 2022. 5

  35. [35]

    Ramesh, Gabriel Goh, et al

    Alec Radford, Jong Wook Kim, Chris Hallacy, A. Ramesh, Gabriel Goh, et al. Learning transferable visual models from natural language supervision. pages 8748–8763, 2021. 5

  36. [36]

    Iashin, Weidi Xie, Esa Rahtu, and Andrew Zisserman

    Vladimir E. Iashin, Weidi Xie, Esa Rahtu, and Andrew Zisserman. Synchformer: Efficient synchronization from sparse cues.ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5325–5329, 2024. 5

  37. [37]

    Video-llama: An instruction-tuned audio-visual language model for video understanding

    Hang Zhang, Xin Li, and Lidong Bing. Video-llama: An instruction-tuned audio-visual language model for video understanding. pages 543–553, 2023. 6, 7

  38. [38]

    Iashin, and Esa Rahtu

    Ilpo Viertola, Vladimir E. Iashin, and Esa Rahtu. Tem- porally aligned audio for video with autoregression. ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5, 2024. 6

  39. [39]

    Imagebind one embedding space to bind them all.2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15180–15190, 2023

    Rohit Girdhar, Alaaeldin El-Nouby, Zhuang Liu, Man- nat Singh, Kalyan Vasudev Alwala, et al. Imagebind one embedding space to bind them all.2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15180–15190, 2023. 6, 4

  40. [40]

    MusicLM: Generating Music From Text

    A. Agostinelli, Timo I. Denk, Zalán Borsos, Jesse Engel, Mauro Verzetti, et al. Musiclm: Generating music from text.ArXiv, abs/2301.11325, 2023. 6, 2

  41. [41]

    FSD50K: an open dataset of human-labeled sound events.IEEE ACM Trans

    Eduardo Fonseca, Xavier Favory, Jordi Pons, Frederic Font, and Xavier Serra. FSD50K: an open dataset of human-labeled sound events.IEEE ACM Trans. Audio Speech Lang. Process., 30:829–852, 2022. 6, 4

  42. [42]

    Ellis, B

    Thierry Bertin-Mahieux, D. Ellis, B. Whitman, and Paul Lamere. The million song dataset. pages 591–596, 2011. 6, 4

  43. [43]

    Vandergheynst, and X

    Michaël Defferrard, Kirell Benzi, P. Vandergheynst, and X. Bresson. Fma: A dataset for music analysis. pages 316–323, 2016. 6, 4

  44. [44]

    Ellis, J

    Shawn Hershey, Sourish Chaudhuri, D. Ellis, J. Gem- meke, A. Jansen, et al. Cnn architectures for large- scale audio classification.2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 131–135, 2016. 6, 4

  45. [45]

    Khaled Koutini, Jan Schlüter, Hamid Eghbalzadeh, and G. Widmer. Efficient training of audio transformers with patchout.ArXiv, abs/2110.05069, 2021. 6, 4

  46. [46]

    Panns: Large-scale pretrained audio neural networks for audio pattern recognition

    Qiuqiang Kong, Yin Cao, Turab Iqbal, Yuxuan Wang, Wenwu Wang, et al. Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Lan- guage Processing, 28:2880–2894, 2019. 6, 4

  47. [47]

    Improved Techniques for Training GANs

    Tim Salimans, I. Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, et al. Improved techniques for training gans.ArXiv, abs/1606.03498, 2016. 6, 4

  48. [48]

    Meta Audiobox Aesthetics: Unified Automatic Quality Assessment for Speech, Music, and Sound

    Andros Tjandra, Yi-Chiao Wu, Baishan Guo, John Hoff- man, Brian Ellis, et al. Meta audiobox aesthetics: Uni- fied automatic quality assessment for speech, music, and sound.ArXiv, abs/2502.05139, 2025. 6

  49. [49]

    Chen, Tianyu Zhang, Yuchen Hui, Tay- lor Berg-Kirkpatrick, et al

    Yusong Wu, K. Chen, Tianyu Zhang, Yuchen Hui, Tay- lor Berg-Kirkpatrick, et al. Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation.ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5, 2022. 6, 4

  50. [50]

    Iashin, Weidi Xie, Esa Rahtu, and Andrew Zisserman

    Vladimir E. Iashin, Weidi Xie, Esa Rahtu, and Andrew Zisserman. Synchformer: Efficient synchronization from sparse cues.ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5325–5329, 2024. 6, 4

  51. [51]

    Audioldm 2: Learning holistic audio genera- tion with self-supervised pretraining.IEEE/ACM Trans- actions on Audio, Speech, and Language Processing, 32:2871–2883, 2023

    Haohe Liu, Qiao Tian, Yiitan Yuan, Xubo Liu, Xinhao Mei, et al. Audioldm 2: Learning holistic audio genera- tion with self-supervised pretraining.IEEE/ACM Trans- actions on Audio, Speech, and Language Processing, 32:2871–2883, 2023. 2

  52. [52]

    Tango 2: Aligning diffusion-based text-to-audio generations through direct preference optimization.Proceedings of the 32nd ACM International Conference on Multimedia, 2024

    Navonil Majumder, Chia-Yu Hung, Deepanway Ghosal, Wei-Ning Hsu, Rada Mihalcea, et al. Tango 2: Aligning diffusion-based text-to-audio generations through direct preference optimization.Proceedings of the 32nd ACM International Conference on Multimedia, 2024. 2

  53. [53]

    Make-An-Audio 2: Temporal-enhanced text-to- audio generation,

    Jia-Bin Huang, Yi Ren, Rongjie Huang, Dongchao Yang, Zhenhui Ye, et al. Make-an-audio 2: Temporal-enhanced text-to-audio generation.ArXiv, abs/2305.18474, 2023. 2

  54. [54]

    Taming data and transformers for audio generation

    Moayed Haji-Ali, Willi Menapace, Aliaksandr Siaro- hin, Guha Balakrishnan, Sergey Tulyakov, et al. Tam- ing data and transformers for audio generation.CoRR, abs/2406.19388, 2024. 2 Omni2Sound: Towards Unified Video-Text-to-Audio Generation Supplementary Material OverviewThis document provides technical details, evaluation protocols, and extended experimen...

  55. [55]

    Semantic Alignment (MOS-S, Scale 1-4).This met- ric assesses bothAccuracy(factuality of sound events) andDetail(precision of adjectives). The scale is de- fined as: (1) Factually incorrect/Brief; (2) Mostly in- correct/Brief; (3) Minor errors/Detailed (but visually re- dundant); and (4) Error-free and Detailed (strictly audio- centric)

  56. [56]

    V”) labels, re- taining only those with Audio-Visual (“A V

    Temporal Alignment (MOS-T, Scale 1-3).This evaluates whether the chronological order of described events matches the audio stream. The scale ranges from (1) Disordered, (2) Partially Correct, to (3) Perfectly Or- dered. Samples with constant or stationary sounds (lack- ing distinct temporal events) are marked asN/Aand ex- cluded from this metric. Human Ev...

  57. [57]

    leads on several metrics, this is expected given its massive 100k-hour internal dataset, which is tens of times larger than our SoundAtlas filter derived from VGGSound and AudioSet. Nevertheless, Omni2Sound consistently outperforms all other strong baselines (e.g., MMAudio, AudioX, and ThinkSound) across V2A and VT2A tasks, demonstrating strong generaliza...

  58. [58]

    •Objects:traffic, office sounds, battlefield, tools

    Primary Sound Information •Humans/Animals:speech (talking, shouting), movements (footsteps).Note: Do not transcribe words/lyrics; describe voice characteristics. •Objects:traffic, office sounds, battlefield, tools. •Characteristics:Gender/age, language, quantity (monologue/turn-taking), emotional tone, voice quali- ties

  59. [59]

    Briefly specify the environment if necessary

    Background Sounds (if present) •Natural (wind, rain) or Artificial (city noise, crowds). Briefly specify the environment if necessary

  60. [60]

    •Identifiable instruments and effects (harmonies, reverb)

    Music (if present) •Style/genre, rhythmic features, emotional tone, atmosphere. •Identifiable instruments and effects (harmonies, reverb)

  61. [61]

    Narrative functions

    Detailed Descriptors •Changes in volume/speed/intensity. Narrative functions. •Detailed duration, spatial distance, pitch, timbre, texture. Important Guidelines

  62. [62]

    Keep it concise

    Avoid Redundancy:Identify sources once unless they change significantly. Keep it concise

  63. [63]

    If a sound isn’t audible, don’t describe it

    Prioritize the Audio:Use video descriptiononlyto clarify ambiguous sounds. If a sound isn’t audible, don’t describe it

  64. [64]

    high- pitched squeal

    Avoid Hallucinated Sounds:Only describe perceptible sounds. Avoid describing artifacts (e.g., "high- pitched squeal" from edits). Output Format Integrate elements intoone or few sentencesfollowing these rules: •Language:English. •Structure:No lists or bullet points. •Length:Max 40 words. Concise but detailed. •Temporal Order:Chronological (e.g., "first", ...