Omni2Sound: Towards Unified Video-Text-to-Audio Generation

Baolong Gao; Jianfei Cai; Jun Zhu; Qiuhong Ke; Yusheng Dai; Yuxuan Jiang; Zehua Chen

arxiv: 2601.02731 · v3 · submitted 2026-01-06 · 💻 cs.SD · cs.CV· cs.MM

Omni2Sound: Towards Unified Video-Text-to-Audio Generation

Yusheng Dai , Zehua Chen , Yuxuan Jiang , Baolong Gao , Qiuhong Ke , Jianfei Cai , Jun Zhu This is my paper

Pith reviewed 2026-05-16 17:22 UTC · model grok-4.3

classification 💻 cs.SD cs.CVcs.MM

keywords video-to-audio generationtext-to-audio generationunified multimodal modeldiffusion transformerprogressive trainingaudio caption datasetsound generationoff-screen audio

0 comments

The pith

A single diffusion model generates audio from video alone, text alone, or both at state-of-the-art levels.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to create one model capable of producing sound from video, from text, or from their combination while avoiding the usual drop in quality when tasks compete during training. It first builds SoundAtlas, a dataset of 470,000 video-audio pairs with detailed, tightly aligned captions produced through an agentic pipeline that compresses visuals, hands off between agents, and filters outputs. A three-stage progressive training schedule then lets the model learn the tasks jointly on a standard DiT backbone, turning competition into shared gains and reducing bias when both video and text are supplied. If this holds, applications that need flexible soundtrack creation could use one system instead of separate specialized ones.

Core claim

Omni2Sound is a unified video-text-to-audio diffusion model that, trained with a three-stage multi-task progressive schedule on the SoundAtlas dataset, reaches state-of-the-art results on video-to-audio, text-to-audio, and joint video-text-to-audio generation inside a single model while preserving alignment and off-screen audio fidelity.

What carries the argument

The three-stage multi-task progressive training schedule that integrates tasks sequentially to convert cross-task competition into joint optimization and to lessen modality bias when video and text conditions are combined.

If this is right

A single model maintains or exceeds separate-task performance across video-to-audio, text-to-audio, and combined inputs.
The model generates both on-screen and off-screen audio faithfully when given mixed video-text conditions.
Generalization remains strong on benchmarks that use different input combinations without task-specific adjustments.
Deployment simplifies because one checkpoint handles all three generation modes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar staged training might reduce task interference in other multimodal generators such as text-to-video or image-to-sound models.
The agent-based caption pipeline could be reused to create aligned datasets for additional modalities or languages.
Unified models of this kind could lower the cost of building production tools that accept any mix of video and text prompts.

Load-bearing premise

The three-stage training schedule is what resolves competition and bias, without hidden data exclusions or tuning steps that would change the reported numbers.

What would settle it

Retraining the identical DiT model on the same data but with a single joint training stage from the start produces measurable drops in any of the three tasks relative to the staged schedule results.

Figures

Figures reproduced from arXiv: 2601.02731 by Baolong Gao, Jianfei Cai, Jun Zhu, Qiuhong Ke, Yusheng Dai, Yuxuan Jiang, Zehua Chen.

**Figure 2.** Figure 2: Data Construction Pipeline of SoundAtlas (Left). Comparison against SOTA baselines and human annotations (Right) . [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Overview of our unified VT2A framework, which [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Subjective Evaluation Results on VGGSound-Omni. We report Mean Opinion Scores (MOS) on a 1-5 scale across [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗

**Figure 5.** Figure 5: Audio Captioning Instruction for SoundAtlas. [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗

**Figure 6.** Figure 6: User study interface for human evaluation across dif [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗

read the original abstract

Training a unified model integrating video-to-audio (V2A), text-to-audio (T2A), and joint video-text-to-audio (VT2A) generation offers significant application flexibility, yet faces two unexplored foundational challenges: (1) the scarcity of high-quality audio captions with tight V-A-T alignment, leading to severe semantic conflict between multimodal conditions, and (2) cross-task and intra-task competition, manifesting as an adverse V2A-T2A performance trade-off and modality bias in the VT2A task. First, to address data scarcity, we introduce SoundAtlas, a large-scale dataset (470k pairs) that significantly outperforms existing benchmarks and even human experts in quality. Powered by a novel agentic pipeline, it integrates Vision-to-Language Compression to mitigate visual bias of MLLMs, a Junior-Senior Agent Handoff for a 5$\times$ cost reduction, and rigorous Post-hoc Filtering to ensure fidelity. Consequently, SoundAtlas delivers semantically rich and temporally detailed captions with tight V-A-T alignment. Second, we propose Omni2Sound, a unified VT2A diffusion model supporting flexible input modalities. To resolve the inherent cross-task and intra-task competition, we design a three-stage multi-task progressive training schedule that converts cross-task competition into joint optimization and mitigates modality bias in the VT2A task, maintaining both audio-visual alignment and off-screen audio generation faithfulness. Finally, we construct VGGSound-Omni, a comprehensive benchmark for unified evaluation, including challenging off-screen tracks. With a standard DiT backbone, Omni2Sound achieves unified SOTA performance across all three tasks within a single model, demonstrating strong generalization across benchmarks with heterogeneous input conditions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Omni2Sound builds a single DiT model for V2A, T2A, and VT2A with a new 470k dataset and three-stage training, but the SOTA claims rest on unshown metrics and missing ablations for the schedule.

read the letter

Colleague, the main thing to know is that this paper puts forward one diffusion model that handles video-to-audio, text-to-audio, and combined video-text inputs without switching architectures. They support it with SoundAtlas, a 470k-pair dataset built via an agentic pipeline that includes vision-to-language compression, a junior-senior handoff for lower cost, and post-hoc filtering, plus a new benchmark that adds off-screen audio tracks for harder testing. The three-stage progressive training is their answer to cross-task competition and modality bias in the joint case. That setup is concrete and addresses a real practical issue for media workflows that mix inputs. What they do well is spell out the data scarcity problem and give a clear pipeline that aims for tighter V-A-T alignment than prior sets. The off-screen evaluation tracks are a useful addition because they force the model to generate faithful audio even when the source is not visible on screen. The progressive schedule itself is a reasonable engineering response to the trade-off they describe. The soft spots are straightforward. The abstract states unified SOTA results but supplies no FID, KL, or other numbers, no error bars, and no ablation tables. The stress-test concern lands: without stage definitions, loss weight schedules, per-stage curves, or a direct joint-training baseline, it is hard to tell whether the schedule actually converts competition into joint gains or whether the dataset quality is carrying the results. If the full paper has those tables and they hold up under scrutiny, the contribution strengthens; from the given text the evidence stays thin. This is for people working on multimodal audio generation who want flexible single-model setups rather than separate systems. A reader focused on diffusion training schedules for audio would find the approach worth examining. I would bring it to reading group as maybe because the problem and the proposed fixes are timely. I would not cite it in the next year without seeing the numbers. It deserves peer review because the framing is honest and the components are new enough that referees can usefully ask for the missing ablations and metrics.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces SoundAtlas, a 470k-pair video-audio-text dataset constructed via an agentic pipeline (Vision-to-Language Compression, Junior-Senior Agent Handoff, Post-hoc Filtering) to address data scarcity and alignment issues. It presents Omni2Sound, a unified DiT-based diffusion model supporting flexible V2A, T2A, and VT2A inputs, trained with a three-stage multi-task progressive schedule intended to convert cross-task competition into joint optimization and reduce modality bias. The work constructs the VGGSound-Omni benchmark (including off-screen tracks) and claims unified SOTA performance across all three tasks within a single model.

Significance. If the SOTA claims and the effectiveness of the progressive schedule are substantiated by quantitative metrics and ablations, the work would be significant for unified multimodal audio generation: it shows that a standard DiT backbone can handle heterogeneous conditioning while preserving audio-visual alignment and off-screen faithfulness. The agentic dataset pipeline (with explicit cost-reduction and bias-mitigation steps) is a concrete, reusable contribution that could improve caption quality for future V-A-T models.

major comments (2)

[Abstract] Abstract: the claim of 'unified SOTA performance across all three tasks within a single model' is stated without any quantitative metrics (FID, KL, CLAP, etc.), error bars, baseline comparisons, or dataset statistics, so the central performance claim cannot be evaluated from the provided text.
[Abstract] Abstract (three-stage schedule): the description that the schedule 'converts cross-task competition into joint optimization and mitigates modality bias' lacks stage definitions, loss-weighting schedules, per-stage curves, or direct comparison to a joint-training baseline; without these the resolution of the V2A-T2A trade-off and VT2A off-screen gap remains unverified.

minor comments (1)

[Abstract] Abstract: '470k pairs' should be clarified as exact or approximate and accompanied by a breakdown of video-only, text-only, and paired samples.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the two major comments on the abstract below and will revise the manuscript to improve clarity and evaluability of our claims while preserving the core contributions.

read point-by-point responses

Referee: [Abstract] Abstract: the claim of 'unified SOTA performance across all three tasks within a single model' is stated without any quantitative metrics (FID, KL, CLAP, etc.), error bars, baseline comparisons, or dataset statistics, so the central performance claim cannot be evaluated from the provided text.

Authors: We agree that the abstract would benefit from including representative quantitative metrics to support the SOTA claim. The full manuscript already contains these details (FID, KL, CLAP scores with error bars, baseline comparisons, and dataset statistics) in Section 4 and Tables 1-3. In the revision we will add a concise set of key numbers (e.g., average relative improvements) to the abstract to make the central claim evaluable without exceeding length limits. revision: yes
Referee: [Abstract] Abstract (three-stage schedule): the description that the schedule 'converts cross-task competition into joint optimization and mitigates modality bias' lacks stage definitions, loss-weighting schedules, per-stage curves, or direct comparison to a joint-training baseline; without these the resolution of the V2A-T2A trade-off and VT2A off-screen gap remains unverified.

Authors: The three-stage schedule, including stage definitions, loss-weighting, per-stage curves, and joint-training ablations, is fully specified in Section 3.2 and evaluated in Section 4.4. We acknowledge the abstract description is high-level. In revision we will briefly enumerate the stages and explicitly reference the ablation results to better convey how cross-task competition is resolved, while keeping the abstract concise. revision: partial

Circularity Check

0 steps flagged

No circularity; claims rest on new dataset and training procedure

full rationale

The paper introduces SoundAtlas via an agentic pipeline (Vision-to-Language Compression, Junior-Senior Agent Handoff, Post-hoc Filtering) and Omni2Sound via a three-stage multi-task progressive training schedule on a standard DiT backbone. These are presented as independent empirical contributions that produce the reported unified SOTA results across V2A/T2A/VT2A. No equations, derivations, or first-principles results are described that reduce to fitted parameters defined by the target outcome, self-citations that bear the central load, or ansatzes smuggled from prior author work. The performance claims are framed as outcomes of the new data and schedule rather than self-referential constructions, making the derivation chain self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that the agentic captioning pipeline produces captions with tight V-A-T alignment superior to existing datasets and that the progressive training schedule resolves modality bias; both are introduced by the paper without external verification.

free parameters (1)

three-stage training hyperparameters
Stage durations, loss weights, and learning rates chosen to balance tasks; values not reported in abstract.

axioms (1)

domain assumption Diffusion models conditioned on video and text can generate temporally aligned audio
Invoked in the model architecture and training description.

pith-pipeline@v0.9.0 · 5630 in / 1383 out tokens · 45515 ms · 2026-05-16T17:22:24.772588+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we design a three-stage multi-task progressive training schedule that converts cross-task competition into joint optimization and mitigates modality bias in the VT2A task

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

VidAudio-Bench: Benchmarking V2A and VT2A Generation across Four Audio Categories
cs.SD 2026-04 unverdicted novelty 7.0

VidAudio-Bench benchmarks V2A and VT2A models across four audio categories, revealing poor speech/singing performance and a tension between visual alignment and text instruction following.
WavFlow: Audio Generation in Waveform Space
cs.SD 2026-05 conditional novelty 6.0

WavFlow performs direct waveform audio generation via flow matching on 2D token grids from raw patches plus amplitude lifting, matching latent-based methods on VGGSound and AudioCaps without intermediate compression.

Reference graph

Works this paper leans on

64 extracted references · 64 canonical work pages · cited by 2 Pith papers · 7 internal anchors

[1]

& Adi, Y

F. Kreuk, Gabriel Synnaeve, Adam Polyak, Uriel Singer, Alexandre D’efossez, et al. Audiogen: Textually guided audio generation.ArXiv, abs/2209.15352, 2022. 1

work page arXiv 2022
[2]

Audioldm: Text-to-audio generation with la- tent diffusion models

Haohe Liu, Zehua Chen, Yiitan Yuan, Xinhao Mei, Xubo Liu, et al. Audioldm: Text-to-audio generation with la- tent diffusion models. pages 21450–21474, 2023. 6, 4

work page 2023
[3]

Stable audio open.ICASSP 2025 - 2025 IEEE International Conference on Acous- tics, Speech and Signal Processing (ICASSP), pages 1–5,

Zach Evans, Julian Parker, CJ Carr, Zack Zukowski, Josiah Taylor, et al. Stable audio open.ICASSP 2025 - 2025 IEEE International Conference on Acous- tics, Speech and Signal Processing (ICASSP), pages 1–5,

work page 2025
[4]

Text-to-audio generation using instruction- tuned LLM and latent diffusion model,

Deepanway Ghosal, Navonil Majumder, Ambuj Mehrish, and Soujanya Poria. Text-to-audio generation using instruction-tuned llm and latent diffusion model. ArXiv, abs/2304.13731, 2023. 1

work page arXiv 2023
[5]

Diff-foley: Synchronized video-to-audio synthesis with latent diffusion models.ArXiv, abs/2306.17203, 2023

Simian Luo, Chuanhao Yan, Chenxu Hu, and Hang Zhao. Diff-foley: Synchronized video-to-audio synthesis with latent diffusion models.ArXiv, abs/2306.17203, 2023. 1

work page arXiv 2023
[6]

Frieren: Efficient video-to- audio generation with rectified flow matching.ArXiv, abs/2406.00320, 2024

Yongqi Wang, Wenxiang Guo, Rongjie Huang, Jia-Bin Huang, Zehan Wang, et al. Frieren: Efficient video-to- audio generation with rectified flow matching.ArXiv, abs/2406.00320, 2024. 6

work page arXiv 2024
[7]

Foleycrafter: Bring silent videos to life with lifelike and synchronized sounds.In- ternational Journal of Computer Vision, 134, 2024

Yiming Zhang, Yicheng Gu, Yanhong Zeng, Zhening Xing, Yuancheng Wang, et al. Foleycrafter: Bring silent videos to life with lifelike and synchronized sounds.In- ternational Journal of Computer Vision, 134, 2024

work page 2024
[8]

Video-guided foley sound generation with multimodal controls.2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18770–18781, 2024

Ziyang Chen, Prem Seetharaman, Bryan Russell, Oriol Nieto, David Bourgin, et al. Video-guided foley sound generation with multimodal controls.2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18770–18781, 2024. 1

work page 2025
[9]

Vintage: Joint video and text conditioning for holistic audio gener- ation.2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13529–13539,

Saksham Singh Kushwaha and Yapeng Tian. Vintage: Joint video and text conditioning for holistic audio gener- ation.2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13529–13539,

work page 2025
[10]

Rongjie Huang, Dongchao Yang, Huadai Liu, Xixin Wu, and Helen M. Meng. Reasonaudio: Semantic reasoning and temporal synchrony in video–text-to-audio genera- tion, 2025. 1

work page 2025
[11]

Hunyuanvideo-foley: Multimodal diffusion with representation alignment for high-fidelity foley audio generation,

Sizhe Shan, Qiulin Li, Yutao Cui, Miles Yang, Yuehai Wang, et al. Hunyuanvideo-foley: Multimodal diffusion with representation alignment for high-fidelity foley au- dio generation.ArXiv, abs/2508.16930, 2025. 1, 2, 6, 7, 3

work page arXiv 2025
[12]

Thinksound: Chain-of-thought reasoning in multimodal large language models for audio generation and editing,

Huadai Liu, Jialei Wang, Kaicheng Luo, Wen Wang, Qian Chen, et al. Thinksound: Chain-of-thought rea- soning in multimodal large language models for audio generation and editing.ArXiv, abs/2506.21448, 2025. 1, 6, 7

work page arXiv 2025
[13]

Schwing, et al

Ho Kei Cheng, Masato Ishii, Akio Hayakawa, Takashi Shibuya, Alexander G. Schwing, et al. Mmaudio: Tam- ing multimodal joint training for high-quality video-to- audio synthesis.2025 IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 28901–28911, 2024. 1, 2, 3, 5, 6, 7, 4

work page 2025
[14]

AudioX: A Unified Framework for Anything-to-Audio Generation

Zeyue Tian, Yizhu Jin, Zhaoyang Liu, Ruibin Yuan, Xu Tan, et al. Audiox: Diffusion transformer for anything-to-audio generation.ArXiv, abs/2503.10522,

work page internal anchor Pith review Pith/arXiv arXiv
[15]

Detecting and mitigating inser- tion hallucination in video-to-audio generation.ArXiv, abs/2510.08078, 2025

Liyang Chen, Hongkai Chen, Yujun Cai, Sifan Li, Qingwen Ye, et al. Detecting and mitigating inser- tion hallucination in video-to-audio generation.ArXiv, abs/2510.08078, 2025. 2

work page arXiv 2025
[16]

Vedaldi, and Andrew Zis- serman

Honglie Chen, Weidi Xie, A. Vedaldi, and Andrew Zis- serman. Vggsound: A large-scale audio-visual dataset. ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 721–725, 2020. 2, 3, 4, 5, 6

work page 2020
[17]

Gemmeke, D

J. Gemmeke, D. Ellis, Dylan Freedman, A. Jansen, W. Lawrence, et al. Audio set: An ontology and human- labeled dataset for audio events.2017 IEEE Interna- tional Conference on Acoustics, Speech and Signal Pro- cessing (ICASSP), pages 776–780, 2017. 2, 3, 4, 6

work page 2017
[18]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team. Gemini: A family of highly capable mul- timodal models.CoRR, abs/2312.11805, 2023. 2, 3, 4, 1

work page internal anchor Pith review Pith/arXiv arXiv 2023
[19]

SAGE Publications Ltd, 2006

André Krouwel.Party Models, page 249–269. SAGE Publications Ltd, 2006. 2, 4

work page 2006
[20]

Vggsounder: Audio-visual evaluations for foundation models.ArXiv, abs/2508.08237, 2025

Daniil Zverev, Thaddaus Wiedemer, Ameya Prabhu, Matthias Bethge, Wieland Brendel, et al. Vggsounder: Audio-visual evaluations for foundation models.ArXiv, abs/2508.08237, 2025. 2, 5, 6

work page arXiv 2025
[21]

Peebles and Saining Xie

William S. Peebles and Saining Xie. Scalable diffusion models with transformers.2023 IEEE/CVF International Conference on Computer Vision (ICCV), pages 4172– 4182, 2022. 2, 4, 5

work page 2023
[22]

Audiocaps: Generating captions for audios in the wild

Chris Dongjoo Kim, Byeongchang Kim, Hyunmin Lee, and Gunhee Kim. Audiocaps: Generating captions for audios in the wild. pages 119–132, 2019. 2, 4, 6, 7

work page 2019
[23]

Drossos, Samuel Lipping, and Tuomas Virtanen

K. Drossos, Samuel Lipping, and Tuomas Virtanen. Clotho: an audio captioning dataset.ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 736–740, 2019. 2, 6, 4

work page 2020
[24]

Xinhao Mei, Chutong Meng, Haohe Liu, Qiuqiang Kong, Tom Ko, et al. Wavcaps: A chatgpt-assisted weakly- labelled audio captioning dataset for audio-language multimodal research.IEEE/ACM Transactions on Au- dio, Speech, and Language Processing, 32:3339–3354,

work page
[25]

Jisheng Bai, Haohe Liu, Mou Wang, Dongyuan Shi, Wenwu Wang, et al. Audiosetcaps: An enriched audio- caption dataset using automated generation pipeline with large audio and language models.IEEE Transactions on Audio, Speech and Language Processing, 33:2817–2829,

work page
[26]

Auto-acd: A large-scale dataset for audio-language rep- resentation learning.Proceedings of the 32nd ACM In- ternational Conference on Multimedia, 2023

Luoyi Sun, Xuenan Xu, Mengyue Wu, and Weidi Xie. Auto-acd: A large-scale dataset for audio-language rep- resentation learning.Proceedings of the 32nd ACM In- ternational Conference on Multimedia, 2023. 3, 4

work page 2023
[27]

Yiitan Yuan, Dongya Jia, Xiaobin Zhuang, Yuanzhe Chen, Zhengxi Liu, et al. Sound-vecaps: Improving au- dio generation with visually enhanced captions.ICASSP 2025 - 2025 IEEE International Conference on Acous- tics, Speech and Signal Processing (ICASSP), pages 1–5,

work page 2025
[28]

Audiogen-omni: A unified multimodal diffusion transformer for video-synchronized audio, speech, and song generation.ArXiv, abs/2508.00733,

Le Wang, Jun Wang, Chunyu Qiang, Feng Deng, Chen Zhang, et al. Audiogen-omni: A unified multimodal diffusion transformer for video-synchronized audio, speech, and song generation.ArXiv, abs/2508.00733,

work page arXiv
[29]

Uniflow-audio: Unified flow matching for audio generation from omni-modalities.ArXiv, abs/2509.24391, 2025

Xuenan Xu, Jiahao Mei, Zihao Zheng, Ye Tao, Zeyu Xie, et al. Uniflow-audio: Unified flow matching for audio generation from omni-modalities.ArXiv, abs/2509.24391, 2025. 3

work page arXiv 2025
[30]

Mmar: A challenging benchmark for deep reasoning in speech, audio, music, and their mix,

Ziyang Ma, Yi Ma, Yanqiao Zhu, Chen Yang, Yi-Wen Chao, et al. Mmar: A challenging benchmark for deep reasoning in speech, audio, music, and their mix.ArXiv, abs/2505.13032, 2025. 3

work page arXiv 2025
[31]

Qwen3-Omni Technical Report

Jin Xu, Zhifang Guo, Hangrui Hu, Yunfei Chu, Xiong Wang, et al. Qwen3-omni technical report.CoRR, abs/2509.17765, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025
[32]

Clap learning audio concepts from natural language supervision.ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5, 2023

Benjamin Elizalde, Soham Deshmukh, Mahmoud Al Is- mail, and Huaming Wang. Clap learning audio concepts from natural language supervision.ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5, 2023. 4, 6

work page 2023
[33]

arXiv preprint arXiv:2402.04825 , year=

Zach Evans, CJ Carr, Josiah Taylor, Scott H. Hawley, and Jordi Pons. Fast timing-conditioned latent audio diffu- sion.ArXiv, abs/2402.04825, 2024. 4

work page arXiv 2024
[34]

Scaling Instruction-Finetuned Language Models

Hyung Won Chung, Le Hou, S. Longpre, Barret Zoph, Yi Tay, et al. Scaling instruction-finetuned language models.ArXiv, abs/2210.11416, 2022. 5

work page internal anchor Pith review Pith/arXiv arXiv 2022
[35]

Ramesh, Gabriel Goh, et al

Alec Radford, Jong Wook Kim, Chris Hallacy, A. Ramesh, Gabriel Goh, et al. Learning transferable visual models from natural language supervision. pages 8748–8763, 2021. 5

work page 2021
[36]

Iashin, Weidi Xie, Esa Rahtu, and Andrew Zisserman

Vladimir E. Iashin, Weidi Xie, Esa Rahtu, and Andrew Zisserman. Synchformer: Efficient synchronization from sparse cues.ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5325–5329, 2024. 5

work page 2024
[37]

Video-llama: An instruction-tuned audio-visual language model for video understanding

Hang Zhang, Xin Li, and Lidong Bing. Video-llama: An instruction-tuned audio-visual language model for video understanding. pages 543–553, 2023. 6, 7

work page 2023
[38]

Iashin, and Esa Rahtu

Ilpo Viertola, Vladimir E. Iashin, and Esa Rahtu. Tem- porally aligned audio for video with autoregression. ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5, 2024. 6

work page 2025
[39]

Imagebind one embedding space to bind them all.2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15180–15190, 2023

Rohit Girdhar, Alaaeldin El-Nouby, Zhuang Liu, Man- nat Singh, Kalyan Vasudev Alwala, et al. Imagebind one embedding space to bind them all.2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15180–15190, 2023. 6, 4

work page 2023
[40]

MusicLM: Generating Music From Text

A. Agostinelli, Timo I. Denk, Zalán Borsos, Jesse Engel, Mauro Verzetti, et al. Musiclm: Generating music from text.ArXiv, abs/2301.11325, 2023. 6, 2

work page internal anchor Pith review Pith/arXiv arXiv 2023
[41]

FSD50K: an open dataset of human-labeled sound events.IEEE ACM Trans

Eduardo Fonseca, Xavier Favory, Jordi Pons, Frederic Font, and Xavier Serra. FSD50K: an open dataset of human-labeled sound events.IEEE ACM Trans. Audio Speech Lang. Process., 30:829–852, 2022. 6, 4

work page 2022
[42]

Ellis, B

Thierry Bertin-Mahieux, D. Ellis, B. Whitman, and Paul Lamere. The million song dataset. pages 591–596, 2011. 6, 4

work page 2011
[43]

Vandergheynst, and X

Michaël Defferrard, Kirell Benzi, P. Vandergheynst, and X. Bresson. Fma: A dataset for music analysis. pages 316–323, 2016. 6, 4

work page 2016
[44]

Ellis, J

Shawn Hershey, Sourish Chaudhuri, D. Ellis, J. Gem- meke, A. Jansen, et al. Cnn architectures for large- scale audio classification.2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 131–135, 2016. 6, 4

work page 2017
[45]

Khaled Koutini, Jan Schlüter, Hamid Eghbalzadeh, and G. Widmer. Efficient training of audio transformers with patchout.ArXiv, abs/2110.05069, 2021. 6, 4

work page arXiv 2021
[46]

Panns: Large-scale pretrained audio neural networks for audio pattern recognition

Qiuqiang Kong, Yin Cao, Turab Iqbal, Yuxuan Wang, Wenwu Wang, et al. Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Lan- guage Processing, 28:2880–2894, 2019. 6, 4

work page 2019
[47]

Improved Techniques for Training GANs

Tim Salimans, I. Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, et al. Improved techniques for training gans.ArXiv, abs/1606.03498, 2016. 6, 4

work page internal anchor Pith review Pith/arXiv arXiv 2016
[48]

Meta Audiobox Aesthetics: Unified Automatic Quality Assessment for Speech, Music, and Sound

Andros Tjandra, Yi-Chiao Wu, Baishan Guo, John Hoff- man, Brian Ellis, et al. Meta audiobox aesthetics: Uni- fied automatic quality assessment for speech, music, and sound.ArXiv, abs/2502.05139, 2025. 6

work page internal anchor Pith review arXiv 2025
[49]

Chen, Tianyu Zhang, Yuchen Hui, Tay- lor Berg-Kirkpatrick, et al

Yusong Wu, K. Chen, Tianyu Zhang, Yuchen Hui, Tay- lor Berg-Kirkpatrick, et al. Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation.ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5, 2022. 6, 4

work page 2023
[50]

Iashin, Weidi Xie, Esa Rahtu, and Andrew Zisserman

Vladimir E. Iashin, Weidi Xie, Esa Rahtu, and Andrew Zisserman. Synchformer: Efficient synchronization from sparse cues.ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5325–5329, 2024. 6, 4

work page 2024
[51]

Audioldm 2: Learning holistic audio genera- tion with self-supervised pretraining.IEEE/ACM Trans- actions on Audio, Speech, and Language Processing, 32:2871–2883, 2023

Haohe Liu, Qiao Tian, Yiitan Yuan, Xubo Liu, Xinhao Mei, et al. Audioldm 2: Learning holistic audio genera- tion with self-supervised pretraining.IEEE/ACM Trans- actions on Audio, Speech, and Language Processing, 32:2871–2883, 2023. 2

work page 2023
[52]

Tango 2: Aligning diffusion-based text-to-audio generations through direct preference optimization.Proceedings of the 32nd ACM International Conference on Multimedia, 2024

Navonil Majumder, Chia-Yu Hung, Deepanway Ghosal, Wei-Ning Hsu, Rada Mihalcea, et al. Tango 2: Aligning diffusion-based text-to-audio generations through direct preference optimization.Proceedings of the 32nd ACM International Conference on Multimedia, 2024. 2

work page 2024
[53]

Make-An-Audio 2: Temporal-enhanced text-to- audio generation,

Jia-Bin Huang, Yi Ren, Rongjie Huang, Dongchao Yang, Zhenhui Ye, et al. Make-an-audio 2: Temporal-enhanced text-to-audio generation.ArXiv, abs/2305.18474, 2023. 2

work page arXiv 2023
[54]

Taming data and transformers for audio generation

Moayed Haji-Ali, Willi Menapace, Aliaksandr Siaro- hin, Guha Balakrishnan, Sergey Tulyakov, et al. Tam- ing data and transformers for audio generation.CoRR, abs/2406.19388, 2024. 2 Omni2Sound: Towards Unified Video-Text-to-Audio Generation Supplementary Material OverviewThis document provides technical details, evaluation protocols, and extended experimen...

work page arXiv 2024
[55]

Semantic Alignment (MOS-S, Scale 1-4).This met- ric assesses bothAccuracy(factuality of sound events) andDetail(precision of adjectives). The scale is de- fined as: (1) Factually incorrect/Brief; (2) Mostly in- correct/Brief; (3) Minor errors/Detailed (but visually re- dundant); and (4) Error-free and Detailed (strictly audio- centric)

work page
[56]

V”) labels, re- taining only those with Audio-Visual (“A V

Temporal Alignment (MOS-T, Scale 1-3).This evaluates whether the chronological order of described events matches the audio stream. The scale ranges from (1) Disordered, (2) Partially Correct, to (3) Perfectly Or- dered. Samples with constant or stationary sounds (lack- ing distinct temporal events) are marked asN/Aand ex- cluded from this metric. Human Ev...

work page 1915
[57]

leads on several metrics, this is expected given its massive 100k-hour internal dataset, which is tens of times larger than our SoundAtlas filter derived from VGGSound and AudioSet. Nevertheless, Omni2Sound consistently outperforms all other strong baselines (e.g., MMAudio, AudioX, and ThinkSound) across V2A and VT2A tasks, demonstrating strong generaliza...

work page
[58]

•Objects:traffic, office sounds, battlefield, tools

Primary Sound Information •Humans/Animals:speech (talking, shouting), movements (footsteps).Note: Do not transcribe words/lyrics; describe voice characteristics. •Objects:traffic, office sounds, battlefield, tools. •Characteristics:Gender/age, language, quantity (monologue/turn-taking), emotional tone, voice quali- ties

work page
[59]

Briefly specify the environment if necessary

Background Sounds (if present) •Natural (wind, rain) or Artificial (city noise, crowds). Briefly specify the environment if necessary

work page
[60]

•Identifiable instruments and effects (harmonies, reverb)

Music (if present) •Style/genre, rhythmic features, emotional tone, atmosphere. •Identifiable instruments and effects (harmonies, reverb)

work page
[61]

Narrative functions

Detailed Descriptors •Changes in volume/speed/intensity. Narrative functions. •Detailed duration, spatial distance, pitch, timbre, texture. Important Guidelines

work page
[62]

Keep it concise

Avoid Redundancy:Identify sources once unless they change significantly. Keep it concise

work page
[63]

If a sound isn’t audible, don’t describe it

Prioritize the Audio:Use video descriptiononlyto clarify ambiguous sounds. If a sound isn’t audible, don’t describe it

work page
[64]

high- pitched squeal

Avoid Hallucinated Sounds:Only describe perceptible sounds. Avoid describing artifacts (e.g., "high- pitched squeal" from edits). Output Format Integrate elements intoone or few sentencesfollowing these rules: •Language:English. •Structure:No lists or bullet points. •Length:Max 40 words. Concise but detailed. •Temporal Order:Chronological (e.g., "first", ...

work page

[1] [1]

& Adi, Y

F. Kreuk, Gabriel Synnaeve, Adam Polyak, Uriel Singer, Alexandre D’efossez, et al. Audiogen: Textually guided audio generation.ArXiv, abs/2209.15352, 2022. 1

work page arXiv 2022

[2] [2]

Audioldm: Text-to-audio generation with la- tent diffusion models

Haohe Liu, Zehua Chen, Yiitan Yuan, Xinhao Mei, Xubo Liu, et al. Audioldm: Text-to-audio generation with la- tent diffusion models. pages 21450–21474, 2023. 6, 4

work page 2023

[3] [3]

Stable audio open.ICASSP 2025 - 2025 IEEE International Conference on Acous- tics, Speech and Signal Processing (ICASSP), pages 1–5,

Zach Evans, Julian Parker, CJ Carr, Zack Zukowski, Josiah Taylor, et al. Stable audio open.ICASSP 2025 - 2025 IEEE International Conference on Acous- tics, Speech and Signal Processing (ICASSP), pages 1–5,

work page 2025

[4] [4]

Text-to-audio generation using instruction- tuned LLM and latent diffusion model,

Deepanway Ghosal, Navonil Majumder, Ambuj Mehrish, and Soujanya Poria. Text-to-audio generation using instruction-tuned llm and latent diffusion model. ArXiv, abs/2304.13731, 2023. 1

work page arXiv 2023

[5] [5]

Diff-foley: Synchronized video-to-audio synthesis with latent diffusion models.ArXiv, abs/2306.17203, 2023

Simian Luo, Chuanhao Yan, Chenxu Hu, and Hang Zhao. Diff-foley: Synchronized video-to-audio synthesis with latent diffusion models.ArXiv, abs/2306.17203, 2023. 1

work page arXiv 2023

[6] [6]

Frieren: Efficient video-to- audio generation with rectified flow matching.ArXiv, abs/2406.00320, 2024

Yongqi Wang, Wenxiang Guo, Rongjie Huang, Jia-Bin Huang, Zehan Wang, et al. Frieren: Efficient video-to- audio generation with rectified flow matching.ArXiv, abs/2406.00320, 2024. 6

work page arXiv 2024

[7] [7]

Foleycrafter: Bring silent videos to life with lifelike and synchronized sounds.In- ternational Journal of Computer Vision, 134, 2024

Yiming Zhang, Yicheng Gu, Yanhong Zeng, Zhening Xing, Yuancheng Wang, et al. Foleycrafter: Bring silent videos to life with lifelike and synchronized sounds.In- ternational Journal of Computer Vision, 134, 2024

work page 2024

[8] [8]

Video-guided foley sound generation with multimodal controls.2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18770–18781, 2024

Ziyang Chen, Prem Seetharaman, Bryan Russell, Oriol Nieto, David Bourgin, et al. Video-guided foley sound generation with multimodal controls.2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18770–18781, 2024. 1

work page 2025

[9] [9]

Vintage: Joint video and text conditioning for holistic audio gener- ation.2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13529–13539,

Saksham Singh Kushwaha and Yapeng Tian. Vintage: Joint video and text conditioning for holistic audio gener- ation.2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13529–13539,

work page 2025

[10] [10]

Rongjie Huang, Dongchao Yang, Huadai Liu, Xixin Wu, and Helen M. Meng. Reasonaudio: Semantic reasoning and temporal synchrony in video–text-to-audio genera- tion, 2025. 1

work page 2025

[11] [11]

Hunyuanvideo-foley: Multimodal diffusion with representation alignment for high-fidelity foley audio generation,

Sizhe Shan, Qiulin Li, Yutao Cui, Miles Yang, Yuehai Wang, et al. Hunyuanvideo-foley: Multimodal diffusion with representation alignment for high-fidelity foley au- dio generation.ArXiv, abs/2508.16930, 2025. 1, 2, 6, 7, 3

work page arXiv 2025

[12] [12]

Thinksound: Chain-of-thought reasoning in multimodal large language models for audio generation and editing,

Huadai Liu, Jialei Wang, Kaicheng Luo, Wen Wang, Qian Chen, et al. Thinksound: Chain-of-thought rea- soning in multimodal large language models for audio generation and editing.ArXiv, abs/2506.21448, 2025. 1, 6, 7

work page arXiv 2025

[13] [13]

Schwing, et al

Ho Kei Cheng, Masato Ishii, Akio Hayakawa, Takashi Shibuya, Alexander G. Schwing, et al. Mmaudio: Tam- ing multimodal joint training for high-quality video-to- audio synthesis.2025 IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 28901–28911, 2024. 1, 2, 3, 5, 6, 7, 4

work page 2025

[14] [14]

AudioX: A Unified Framework for Anything-to-Audio Generation

Zeyue Tian, Yizhu Jin, Zhaoyang Liu, Ruibin Yuan, Xu Tan, et al. Audiox: Diffusion transformer for anything-to-audio generation.ArXiv, abs/2503.10522,

work page internal anchor Pith review Pith/arXiv arXiv

[15] [15]

Detecting and mitigating inser- tion hallucination in video-to-audio generation.ArXiv, abs/2510.08078, 2025

Liyang Chen, Hongkai Chen, Yujun Cai, Sifan Li, Qingwen Ye, et al. Detecting and mitigating inser- tion hallucination in video-to-audio generation.ArXiv, abs/2510.08078, 2025. 2

work page arXiv 2025

[16] [16]

Vedaldi, and Andrew Zis- serman

Honglie Chen, Weidi Xie, A. Vedaldi, and Andrew Zis- serman. Vggsound: A large-scale audio-visual dataset. ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 721–725, 2020. 2, 3, 4, 5, 6

work page 2020

[17] [17]

Gemmeke, D

J. Gemmeke, D. Ellis, Dylan Freedman, A. Jansen, W. Lawrence, et al. Audio set: An ontology and human- labeled dataset for audio events.2017 IEEE Interna- tional Conference on Acoustics, Speech and Signal Pro- cessing (ICASSP), pages 776–780, 2017. 2, 3, 4, 6

work page 2017

[18] [18]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team. Gemini: A family of highly capable mul- timodal models.CoRR, abs/2312.11805, 2023. 2, 3, 4, 1

work page internal anchor Pith review Pith/arXiv arXiv 2023

[19] [19]

SAGE Publications Ltd, 2006

André Krouwel.Party Models, page 249–269. SAGE Publications Ltd, 2006. 2, 4

work page 2006

[20] [20]

Vggsounder: Audio-visual evaluations for foundation models.ArXiv, abs/2508.08237, 2025

Daniil Zverev, Thaddaus Wiedemer, Ameya Prabhu, Matthias Bethge, Wieland Brendel, et al. Vggsounder: Audio-visual evaluations for foundation models.ArXiv, abs/2508.08237, 2025. 2, 5, 6

work page arXiv 2025

[21] [21]

Peebles and Saining Xie

William S. Peebles and Saining Xie. Scalable diffusion models with transformers.2023 IEEE/CVF International Conference on Computer Vision (ICCV), pages 4172– 4182, 2022. 2, 4, 5

work page 2023

[22] [22]

Audiocaps: Generating captions for audios in the wild

Chris Dongjoo Kim, Byeongchang Kim, Hyunmin Lee, and Gunhee Kim. Audiocaps: Generating captions for audios in the wild. pages 119–132, 2019. 2, 4, 6, 7

work page 2019

[23] [23]

Drossos, Samuel Lipping, and Tuomas Virtanen

K. Drossos, Samuel Lipping, and Tuomas Virtanen. Clotho: an audio captioning dataset.ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 736–740, 2019. 2, 6, 4

work page 2020

[24] [24]

Xinhao Mei, Chutong Meng, Haohe Liu, Qiuqiang Kong, Tom Ko, et al. Wavcaps: A chatgpt-assisted weakly- labelled audio captioning dataset for audio-language multimodal research.IEEE/ACM Transactions on Au- dio, Speech, and Language Processing, 32:3339–3354,

work page

[25] [25]

Jisheng Bai, Haohe Liu, Mou Wang, Dongyuan Shi, Wenwu Wang, et al. Audiosetcaps: An enriched audio- caption dataset using automated generation pipeline with large audio and language models.IEEE Transactions on Audio, Speech and Language Processing, 33:2817–2829,

work page

[26] [26]

Auto-acd: A large-scale dataset for audio-language rep- resentation learning.Proceedings of the 32nd ACM In- ternational Conference on Multimedia, 2023

Luoyi Sun, Xuenan Xu, Mengyue Wu, and Weidi Xie. Auto-acd: A large-scale dataset for audio-language rep- resentation learning.Proceedings of the 32nd ACM In- ternational Conference on Multimedia, 2023. 3, 4

work page 2023

[27] [27]

Yiitan Yuan, Dongya Jia, Xiaobin Zhuang, Yuanzhe Chen, Zhengxi Liu, et al. Sound-vecaps: Improving au- dio generation with visually enhanced captions.ICASSP 2025 - 2025 IEEE International Conference on Acous- tics, Speech and Signal Processing (ICASSP), pages 1–5,

work page 2025

[28] [28]

Audiogen-omni: A unified multimodal diffusion transformer for video-synchronized audio, speech, and song generation.ArXiv, abs/2508.00733,

Le Wang, Jun Wang, Chunyu Qiang, Feng Deng, Chen Zhang, et al. Audiogen-omni: A unified multimodal diffusion transformer for video-synchronized audio, speech, and song generation.ArXiv, abs/2508.00733,

work page arXiv

[29] [29]

Uniflow-audio: Unified flow matching for audio generation from omni-modalities.ArXiv, abs/2509.24391, 2025

Xuenan Xu, Jiahao Mei, Zihao Zheng, Ye Tao, Zeyu Xie, et al. Uniflow-audio: Unified flow matching for audio generation from omni-modalities.ArXiv, abs/2509.24391, 2025. 3

work page arXiv 2025

[30] [30]

Mmar: A challenging benchmark for deep reasoning in speech, audio, music, and their mix,

Ziyang Ma, Yi Ma, Yanqiao Zhu, Chen Yang, Yi-Wen Chao, et al. Mmar: A challenging benchmark for deep reasoning in speech, audio, music, and their mix.ArXiv, abs/2505.13032, 2025. 3

work page arXiv 2025

[31] [31]

Qwen3-Omni Technical Report

Jin Xu, Zhifang Guo, Hangrui Hu, Yunfei Chu, Xiong Wang, et al. Qwen3-omni technical report.CoRR, abs/2509.17765, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025

[32] [32]

Clap learning audio concepts from natural language supervision.ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5, 2023

Benjamin Elizalde, Soham Deshmukh, Mahmoud Al Is- mail, and Huaming Wang. Clap learning audio concepts from natural language supervision.ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5, 2023. 4, 6

work page 2023

[33] [33]

arXiv preprint arXiv:2402.04825 , year=

Zach Evans, CJ Carr, Josiah Taylor, Scott H. Hawley, and Jordi Pons. Fast timing-conditioned latent audio diffu- sion.ArXiv, abs/2402.04825, 2024. 4

work page arXiv 2024

[34] [34]

Scaling Instruction-Finetuned Language Models

Hyung Won Chung, Le Hou, S. Longpre, Barret Zoph, Yi Tay, et al. Scaling instruction-finetuned language models.ArXiv, abs/2210.11416, 2022. 5

work page internal anchor Pith review Pith/arXiv arXiv 2022

[35] [35]

Ramesh, Gabriel Goh, et al

Alec Radford, Jong Wook Kim, Chris Hallacy, A. Ramesh, Gabriel Goh, et al. Learning transferable visual models from natural language supervision. pages 8748–8763, 2021. 5

work page 2021

[36] [36]

Iashin, Weidi Xie, Esa Rahtu, and Andrew Zisserman

Vladimir E. Iashin, Weidi Xie, Esa Rahtu, and Andrew Zisserman. Synchformer: Efficient synchronization from sparse cues.ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5325–5329, 2024. 5

work page 2024

[37] [37]

Video-llama: An instruction-tuned audio-visual language model for video understanding

Hang Zhang, Xin Li, and Lidong Bing. Video-llama: An instruction-tuned audio-visual language model for video understanding. pages 543–553, 2023. 6, 7

work page 2023

[38] [38]

Iashin, and Esa Rahtu

Ilpo Viertola, Vladimir E. Iashin, and Esa Rahtu. Tem- porally aligned audio for video with autoregression. ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5, 2024. 6

work page 2025

[39] [39]

Imagebind one embedding space to bind them all.2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15180–15190, 2023

Rohit Girdhar, Alaaeldin El-Nouby, Zhuang Liu, Man- nat Singh, Kalyan Vasudev Alwala, et al. Imagebind one embedding space to bind them all.2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15180–15190, 2023. 6, 4

work page 2023

[40] [40]

MusicLM: Generating Music From Text

A. Agostinelli, Timo I. Denk, Zalán Borsos, Jesse Engel, Mauro Verzetti, et al. Musiclm: Generating music from text.ArXiv, abs/2301.11325, 2023. 6, 2

work page internal anchor Pith review Pith/arXiv arXiv 2023

[41] [41]

FSD50K: an open dataset of human-labeled sound events.IEEE ACM Trans

Eduardo Fonseca, Xavier Favory, Jordi Pons, Frederic Font, and Xavier Serra. FSD50K: an open dataset of human-labeled sound events.IEEE ACM Trans. Audio Speech Lang. Process., 30:829–852, 2022. 6, 4

work page 2022

[42] [42]

Ellis, B

Thierry Bertin-Mahieux, D. Ellis, B. Whitman, and Paul Lamere. The million song dataset. pages 591–596, 2011. 6, 4

work page 2011

[43] [43]

Vandergheynst, and X

Michaël Defferrard, Kirell Benzi, P. Vandergheynst, and X. Bresson. Fma: A dataset for music analysis. pages 316–323, 2016. 6, 4

work page 2016

[44] [44]

Ellis, J

Shawn Hershey, Sourish Chaudhuri, D. Ellis, J. Gem- meke, A. Jansen, et al. Cnn architectures for large- scale audio classification.2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 131–135, 2016. 6, 4

work page 2017

[45] [45]

Khaled Koutini, Jan Schlüter, Hamid Eghbalzadeh, and G. Widmer. Efficient training of audio transformers with patchout.ArXiv, abs/2110.05069, 2021. 6, 4

work page arXiv 2021

[46] [46]

Panns: Large-scale pretrained audio neural networks for audio pattern recognition

Qiuqiang Kong, Yin Cao, Turab Iqbal, Yuxuan Wang, Wenwu Wang, et al. Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Lan- guage Processing, 28:2880–2894, 2019. 6, 4

work page 2019

[47] [47]

Improved Techniques for Training GANs

Tim Salimans, I. Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, et al. Improved techniques for training gans.ArXiv, abs/1606.03498, 2016. 6, 4

work page internal anchor Pith review Pith/arXiv arXiv 2016

[48] [48]

Meta Audiobox Aesthetics: Unified Automatic Quality Assessment for Speech, Music, and Sound

Andros Tjandra, Yi-Chiao Wu, Baishan Guo, John Hoff- man, Brian Ellis, et al. Meta audiobox aesthetics: Uni- fied automatic quality assessment for speech, music, and sound.ArXiv, abs/2502.05139, 2025. 6

work page internal anchor Pith review arXiv 2025

[49] [49]

Chen, Tianyu Zhang, Yuchen Hui, Tay- lor Berg-Kirkpatrick, et al

Yusong Wu, K. Chen, Tianyu Zhang, Yuchen Hui, Tay- lor Berg-Kirkpatrick, et al. Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation.ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5, 2022. 6, 4

work page 2023

[50] [50]

Iashin, Weidi Xie, Esa Rahtu, and Andrew Zisserman

Vladimir E. Iashin, Weidi Xie, Esa Rahtu, and Andrew Zisserman. Synchformer: Efficient synchronization from sparse cues.ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5325–5329, 2024. 6, 4

work page 2024

[51] [51]

Audioldm 2: Learning holistic audio genera- tion with self-supervised pretraining.IEEE/ACM Trans- actions on Audio, Speech, and Language Processing, 32:2871–2883, 2023

Haohe Liu, Qiao Tian, Yiitan Yuan, Xubo Liu, Xinhao Mei, et al. Audioldm 2: Learning holistic audio genera- tion with self-supervised pretraining.IEEE/ACM Trans- actions on Audio, Speech, and Language Processing, 32:2871–2883, 2023. 2

work page 2023

[52] [52]

Tango 2: Aligning diffusion-based text-to-audio generations through direct preference optimization.Proceedings of the 32nd ACM International Conference on Multimedia, 2024

Navonil Majumder, Chia-Yu Hung, Deepanway Ghosal, Wei-Ning Hsu, Rada Mihalcea, et al. Tango 2: Aligning diffusion-based text-to-audio generations through direct preference optimization.Proceedings of the 32nd ACM International Conference on Multimedia, 2024. 2

work page 2024

[53] [53]

Make-An-Audio 2: Temporal-enhanced text-to- audio generation,

Jia-Bin Huang, Yi Ren, Rongjie Huang, Dongchao Yang, Zhenhui Ye, et al. Make-an-audio 2: Temporal-enhanced text-to-audio generation.ArXiv, abs/2305.18474, 2023. 2

work page arXiv 2023

[54] [54]

Taming data and transformers for audio generation

Moayed Haji-Ali, Willi Menapace, Aliaksandr Siaro- hin, Guha Balakrishnan, Sergey Tulyakov, et al. Tam- ing data and transformers for audio generation.CoRR, abs/2406.19388, 2024. 2 Omni2Sound: Towards Unified Video-Text-to-Audio Generation Supplementary Material OverviewThis document provides technical details, evaluation protocols, and extended experimen...

work page arXiv 2024

[55] [55]

Semantic Alignment (MOS-S, Scale 1-4).This met- ric assesses bothAccuracy(factuality of sound events) andDetail(precision of adjectives). The scale is de- fined as: (1) Factually incorrect/Brief; (2) Mostly in- correct/Brief; (3) Minor errors/Detailed (but visually re- dundant); and (4) Error-free and Detailed (strictly audio- centric)

work page

[56] [56]

V”) labels, re- taining only those with Audio-Visual (“A V

Temporal Alignment (MOS-T, Scale 1-3).This evaluates whether the chronological order of described events matches the audio stream. The scale ranges from (1) Disordered, (2) Partially Correct, to (3) Perfectly Or- dered. Samples with constant or stationary sounds (lack- ing distinct temporal events) are marked asN/Aand ex- cluded from this metric. Human Ev...

work page 1915

[57] [57]

leads on several metrics, this is expected given its massive 100k-hour internal dataset, which is tens of times larger than our SoundAtlas filter derived from VGGSound and AudioSet. Nevertheless, Omni2Sound consistently outperforms all other strong baselines (e.g., MMAudio, AudioX, and ThinkSound) across V2A and VT2A tasks, demonstrating strong generaliza...

work page

[58] [58]

•Objects:traffic, office sounds, battlefield, tools

Primary Sound Information •Humans/Animals:speech (talking, shouting), movements (footsteps).Note: Do not transcribe words/lyrics; describe voice characteristics. •Objects:traffic, office sounds, battlefield, tools. •Characteristics:Gender/age, language, quantity (monologue/turn-taking), emotional tone, voice quali- ties

work page

[59] [59]

Briefly specify the environment if necessary

Background Sounds (if present) •Natural (wind, rain) or Artificial (city noise, crowds). Briefly specify the environment if necessary

work page

[60] [60]

•Identifiable instruments and effects (harmonies, reverb)

Music (if present) •Style/genre, rhythmic features, emotional tone, atmosphere. •Identifiable instruments and effects (harmonies, reverb)

work page

[61] [61]

Narrative functions

Detailed Descriptors •Changes in volume/speed/intensity. Narrative functions. •Detailed duration, spatial distance, pitch, timbre, texture. Important Guidelines

work page

[62] [62]

Keep it concise

Avoid Redundancy:Identify sources once unless they change significantly. Keep it concise

work page

[63] [63]

If a sound isn’t audible, don’t describe it

Prioritize the Audio:Use video descriptiononlyto clarify ambiguous sounds. If a sound isn’t audible, don’t describe it

work page

[64] [64]

high- pitched squeal

Avoid Hallucinated Sounds:Only describe perceptible sounds. Avoid describing artifacts (e.g., "high- pitched squeal" from edits). Output Format Integrate elements intoone or few sentencesfollowing these rules: •Language:English. •Structure:No lists or bullet points. •Length:Max 40 words. Concise but detailed. •Temporal Order:Chronological (e.g., "first", ...

work page