Omni2Sound: Towards Unified Video-Text-to-Audio Generation
Pith reviewed 2026-05-16 17:22 UTC · model grok-4.3
The pith
A single diffusion model generates audio from video alone, text alone, or both at state-of-the-art levels.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Omni2Sound is a unified video-text-to-audio diffusion model that, trained with a three-stage multi-task progressive schedule on the SoundAtlas dataset, reaches state-of-the-art results on video-to-audio, text-to-audio, and joint video-text-to-audio generation inside a single model while preserving alignment and off-screen audio fidelity.
What carries the argument
The three-stage multi-task progressive training schedule that integrates tasks sequentially to convert cross-task competition into joint optimization and to lessen modality bias when video and text conditions are combined.
If this is right
- A single model maintains or exceeds separate-task performance across video-to-audio, text-to-audio, and combined inputs.
- The model generates both on-screen and off-screen audio faithfully when given mixed video-text conditions.
- Generalization remains strong on benchmarks that use different input combinations without task-specific adjustments.
- Deployment simplifies because one checkpoint handles all three generation modes.
Where Pith is reading between the lines
- Similar staged training might reduce task interference in other multimodal generators such as text-to-video or image-to-sound models.
- The agent-based caption pipeline could be reused to create aligned datasets for additional modalities or languages.
- Unified models of this kind could lower the cost of building production tools that accept any mix of video and text prompts.
Load-bearing premise
The three-stage training schedule is what resolves competition and bias, without hidden data exclusions or tuning steps that would change the reported numbers.
What would settle it
Retraining the identical DiT model on the same data but with a single joint training stage from the start produces measurable drops in any of the three tasks relative to the staged schedule results.
Figures
read the original abstract
Training a unified model integrating video-to-audio (V2A), text-to-audio (T2A), and joint video-text-to-audio (VT2A) generation offers significant application flexibility, yet faces two unexplored foundational challenges: (1) the scarcity of high-quality audio captions with tight V-A-T alignment, leading to severe semantic conflict between multimodal conditions, and (2) cross-task and intra-task competition, manifesting as an adverse V2A-T2A performance trade-off and modality bias in the VT2A task. First, to address data scarcity, we introduce SoundAtlas, a large-scale dataset (470k pairs) that significantly outperforms existing benchmarks and even human experts in quality. Powered by a novel agentic pipeline, it integrates Vision-to-Language Compression to mitigate visual bias of MLLMs, a Junior-Senior Agent Handoff for a 5$\times$ cost reduction, and rigorous Post-hoc Filtering to ensure fidelity. Consequently, SoundAtlas delivers semantically rich and temporally detailed captions with tight V-A-T alignment. Second, we propose Omni2Sound, a unified VT2A diffusion model supporting flexible input modalities. To resolve the inherent cross-task and intra-task competition, we design a three-stage multi-task progressive training schedule that converts cross-task competition into joint optimization and mitigates modality bias in the VT2A task, maintaining both audio-visual alignment and off-screen audio generation faithfulness. Finally, we construct VGGSound-Omni, a comprehensive benchmark for unified evaluation, including challenging off-screen tracks. With a standard DiT backbone, Omni2Sound achieves unified SOTA performance across all three tasks within a single model, demonstrating strong generalization across benchmarks with heterogeneous input conditions.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces SoundAtlas, a 470k-pair video-audio-text dataset constructed via an agentic pipeline (Vision-to-Language Compression, Junior-Senior Agent Handoff, Post-hoc Filtering) to address data scarcity and alignment issues. It presents Omni2Sound, a unified DiT-based diffusion model supporting flexible V2A, T2A, and VT2A inputs, trained with a three-stage multi-task progressive schedule intended to convert cross-task competition into joint optimization and reduce modality bias. The work constructs the VGGSound-Omni benchmark (including off-screen tracks) and claims unified SOTA performance across all three tasks within a single model.
Significance. If the SOTA claims and the effectiveness of the progressive schedule are substantiated by quantitative metrics and ablations, the work would be significant for unified multimodal audio generation: it shows that a standard DiT backbone can handle heterogeneous conditioning while preserving audio-visual alignment and off-screen faithfulness. The agentic dataset pipeline (with explicit cost-reduction and bias-mitigation steps) is a concrete, reusable contribution that could improve caption quality for future V-A-T models.
major comments (2)
- [Abstract] Abstract: the claim of 'unified SOTA performance across all three tasks within a single model' is stated without any quantitative metrics (FID, KL, CLAP, etc.), error bars, baseline comparisons, or dataset statistics, so the central performance claim cannot be evaluated from the provided text.
- [Abstract] Abstract (three-stage schedule): the description that the schedule 'converts cross-task competition into joint optimization and mitigates modality bias' lacks stage definitions, loss-weighting schedules, per-stage curves, or direct comparison to a joint-training baseline; without these the resolution of the V2A-T2A trade-off and VT2A off-screen gap remains unverified.
minor comments (1)
- [Abstract] Abstract: '470k pairs' should be clarified as exact or approximate and accompanied by a breakdown of video-only, text-only, and paired samples.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address the two major comments on the abstract below and will revise the manuscript to improve clarity and evaluability of our claims while preserving the core contributions.
read point-by-point responses
-
Referee: [Abstract] Abstract: the claim of 'unified SOTA performance across all three tasks within a single model' is stated without any quantitative metrics (FID, KL, CLAP, etc.), error bars, baseline comparisons, or dataset statistics, so the central performance claim cannot be evaluated from the provided text.
Authors: We agree that the abstract would benefit from including representative quantitative metrics to support the SOTA claim. The full manuscript already contains these details (FID, KL, CLAP scores with error bars, baseline comparisons, and dataset statistics) in Section 4 and Tables 1-3. In the revision we will add a concise set of key numbers (e.g., average relative improvements) to the abstract to make the central claim evaluable without exceeding length limits. revision: yes
-
Referee: [Abstract] Abstract (three-stage schedule): the description that the schedule 'converts cross-task competition into joint optimization and mitigates modality bias' lacks stage definitions, loss-weighting schedules, per-stage curves, or direct comparison to a joint-training baseline; without these the resolution of the V2A-T2A trade-off and VT2A off-screen gap remains unverified.
Authors: The three-stage schedule, including stage definitions, loss-weighting, per-stage curves, and joint-training ablations, is fully specified in Section 3.2 and evaluated in Section 4.4. We acknowledge the abstract description is high-level. In revision we will briefly enumerate the stages and explicitly reference the ablation results to better convey how cross-task competition is resolved, while keeping the abstract concise. revision: partial
Circularity Check
No circularity; claims rest on new dataset and training procedure
full rationale
The paper introduces SoundAtlas via an agentic pipeline (Vision-to-Language Compression, Junior-Senior Agent Handoff, Post-hoc Filtering) and Omni2Sound via a three-stage multi-task progressive training schedule on a standard DiT backbone. These are presented as independent empirical contributions that produce the reported unified SOTA results across V2A/T2A/VT2A. No equations, derivations, or first-principles results are described that reduce to fitted parameters defined by the target outcome, self-citations that bear the central load, or ansatzes smuggled from prior author work. The performance claims are framed as outcomes of the new data and schedule rather than self-referential constructions, making the derivation chain self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
free parameters (1)
- three-stage training hyperparameters
axioms (1)
- domain assumption Diffusion models conditioned on video and text can generate temporally aligned audio
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
we design a three-stage multi-task progressive training schedule that converts cross-task competition into joint optimization and mitigates modality bias in the VT2A task
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 2 Pith papers
-
VidAudio-Bench: Benchmarking V2A and VT2A Generation across Four Audio Categories
VidAudio-Bench benchmarks V2A and VT2A models across four audio categories, revealing poor speech/singing performance and a tension between visual alignment and text instruction following.
-
WavFlow: Audio Generation in Waveform Space
WavFlow performs direct waveform audio generation via flow matching on 2D token grids from raw patches plus amplitude lifting, matching latent-based methods on VGGSound and AudioCaps without intermediate compression.
Reference graph
Works this paper leans on
- [1]
-
[2]
Audioldm: Text-to-audio generation with la- tent diffusion models
Haohe Liu, Zehua Chen, Yiitan Yuan, Xinhao Mei, Xubo Liu, et al. Audioldm: Text-to-audio generation with la- tent diffusion models. pages 21450–21474, 2023. 6, 4
work page 2023
-
[3]
Zach Evans, Julian Parker, CJ Carr, Zack Zukowski, Josiah Taylor, et al. Stable audio open.ICASSP 2025 - 2025 IEEE International Conference on Acous- tics, Speech and Signal Processing (ICASSP), pages 1–5,
work page 2025
-
[4]
Text-to-audio generation using instruction- tuned LLM and latent diffusion model,
Deepanway Ghosal, Navonil Majumder, Ambuj Mehrish, and Soujanya Poria. Text-to-audio generation using instruction-tuned llm and latent diffusion model. ArXiv, abs/2304.13731, 2023. 1
-
[5]
Simian Luo, Chuanhao Yan, Chenxu Hu, and Hang Zhao. Diff-foley: Synchronized video-to-audio synthesis with latent diffusion models.ArXiv, abs/2306.17203, 2023. 1
-
[6]
Yongqi Wang, Wenxiang Guo, Rongjie Huang, Jia-Bin Huang, Zehan Wang, et al. Frieren: Efficient video-to- audio generation with rectified flow matching.ArXiv, abs/2406.00320, 2024. 6
-
[7]
Yiming Zhang, Yicheng Gu, Yanhong Zeng, Zhening Xing, Yuancheng Wang, et al. Foleycrafter: Bring silent videos to life with lifelike and synchronized sounds.In- ternational Journal of Computer Vision, 134, 2024
work page 2024
-
[8]
Ziyang Chen, Prem Seetharaman, Bryan Russell, Oriol Nieto, David Bourgin, et al. Video-guided foley sound generation with multimodal controls.2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18770–18781, 2024. 1
work page 2025
-
[9]
Saksham Singh Kushwaha and Yapeng Tian. Vintage: Joint video and text conditioning for holistic audio gener- ation.2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13529–13539,
work page 2025
-
[10]
Rongjie Huang, Dongchao Yang, Huadai Liu, Xixin Wu, and Helen M. Meng. Reasonaudio: Semantic reasoning and temporal synchrony in video–text-to-audio genera- tion, 2025. 1
work page 2025
-
[11]
Sizhe Shan, Qiulin Li, Yutao Cui, Miles Yang, Yuehai Wang, et al. Hunyuanvideo-foley: Multimodal diffusion with representation alignment for high-fidelity foley au- dio generation.ArXiv, abs/2508.16930, 2025. 1, 2, 6, 7, 3
-
[12]
Huadai Liu, Jialei Wang, Kaicheng Luo, Wen Wang, Qian Chen, et al. Thinksound: Chain-of-thought rea- soning in multimodal large language models for audio generation and editing.ArXiv, abs/2506.21448, 2025. 1, 6, 7
-
[13]
Ho Kei Cheng, Masato Ishii, Akio Hayakawa, Takashi Shibuya, Alexander G. Schwing, et al. Mmaudio: Tam- ing multimodal joint training for high-quality video-to- audio synthesis.2025 IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 28901–28911, 2024. 1, 2, 3, 5, 6, 7, 4
work page 2025
-
[14]
AudioX: A Unified Framework for Anything-to-Audio Generation
Zeyue Tian, Yizhu Jin, Zhaoyang Liu, Ruibin Yuan, Xu Tan, et al. Audiox: Diffusion transformer for anything-to-audio generation.ArXiv, abs/2503.10522,
work page internal anchor Pith review Pith/arXiv arXiv
-
[15]
Liyang Chen, Hongkai Chen, Yujun Cai, Sifan Li, Qingwen Ye, et al. Detecting and mitigating inser- tion hallucination in video-to-audio generation.ArXiv, abs/2510.08078, 2025. 2
-
[16]
Vedaldi, and Andrew Zis- serman
Honglie Chen, Weidi Xie, A. Vedaldi, and Andrew Zis- serman. Vggsound: A large-scale audio-visual dataset. ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 721–725, 2020. 2, 3, 4, 5, 6
work page 2020
-
[17]
J. Gemmeke, D. Ellis, Dylan Freedman, A. Jansen, W. Lawrence, et al. Audio set: An ontology and human- labeled dataset for audio events.2017 IEEE Interna- tional Conference on Acoustics, Speech and Signal Pro- cessing (ICASSP), pages 776–780, 2017. 2, 3, 4, 6
work page 2017
-
[18]
Gemini: A Family of Highly Capable Multimodal Models
Gemini Team. Gemini: A family of highly capable mul- timodal models.CoRR, abs/2312.11805, 2023. 2, 3, 4, 1
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[19]
André Krouwel.Party Models, page 249–269. SAGE Publications Ltd, 2006. 2, 4
work page 2006
-
[20]
Vggsounder: Audio-visual evaluations for foundation models.ArXiv, abs/2508.08237, 2025
Daniil Zverev, Thaddaus Wiedemer, Ameya Prabhu, Matthias Bethge, Wieland Brendel, et al. Vggsounder: Audio-visual evaluations for foundation models.ArXiv, abs/2508.08237, 2025. 2, 5, 6
-
[21]
William S. Peebles and Saining Xie. Scalable diffusion models with transformers.2023 IEEE/CVF International Conference on Computer Vision (ICCV), pages 4172– 4182, 2022. 2, 4, 5
work page 2023
-
[22]
Audiocaps: Generating captions for audios in the wild
Chris Dongjoo Kim, Byeongchang Kim, Hyunmin Lee, and Gunhee Kim. Audiocaps: Generating captions for audios in the wild. pages 119–132, 2019. 2, 4, 6, 7
work page 2019
-
[23]
Drossos, Samuel Lipping, and Tuomas Virtanen
K. Drossos, Samuel Lipping, and Tuomas Virtanen. Clotho: an audio captioning dataset.ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 736–740, 2019. 2, 6, 4
work page 2020
-
[24]
Xinhao Mei, Chutong Meng, Haohe Liu, Qiuqiang Kong, Tom Ko, et al. Wavcaps: A chatgpt-assisted weakly- labelled audio captioning dataset for audio-language multimodal research.IEEE/ACM Transactions on Au- dio, Speech, and Language Processing, 32:3339–3354,
-
[25]
Jisheng Bai, Haohe Liu, Mou Wang, Dongyuan Shi, Wenwu Wang, et al. Audiosetcaps: An enriched audio- caption dataset using automated generation pipeline with large audio and language models.IEEE Transactions on Audio, Speech and Language Processing, 33:2817–2829,
-
[26]
Luoyi Sun, Xuenan Xu, Mengyue Wu, and Weidi Xie. Auto-acd: A large-scale dataset for audio-language rep- resentation learning.Proceedings of the 32nd ACM In- ternational Conference on Multimedia, 2023. 3, 4
work page 2023
-
[27]
Yiitan Yuan, Dongya Jia, Xiaobin Zhuang, Yuanzhe Chen, Zhengxi Liu, et al. Sound-vecaps: Improving au- dio generation with visually enhanced captions.ICASSP 2025 - 2025 IEEE International Conference on Acous- tics, Speech and Signal Processing (ICASSP), pages 1–5,
work page 2025
-
[28]
Le Wang, Jun Wang, Chunyu Qiang, Feng Deng, Chen Zhang, et al. Audiogen-omni: A unified multimodal diffusion transformer for video-synchronized audio, speech, and song generation.ArXiv, abs/2508.00733,
-
[29]
Xuenan Xu, Jiahao Mei, Zihao Zheng, Ye Tao, Zeyu Xie, et al. Uniflow-audio: Unified flow matching for audio generation from omni-modalities.ArXiv, abs/2509.24391, 2025. 3
-
[30]
Mmar: A challenging benchmark for deep reasoning in speech, audio, music, and their mix,
Ziyang Ma, Yi Ma, Yanqiao Zhu, Chen Yang, Yi-Wen Chao, et al. Mmar: A challenging benchmark for deep reasoning in speech, audio, music, and their mix.ArXiv, abs/2505.13032, 2025. 3
-
[31]
Jin Xu, Zhifang Guo, Hangrui Hu, Yunfei Chu, Xiong Wang, et al. Qwen3-omni technical report.CoRR, abs/2509.17765, 2025. 3
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[32]
Benjamin Elizalde, Soham Deshmukh, Mahmoud Al Is- mail, and Huaming Wang. Clap learning audio concepts from natural language supervision.ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5, 2023. 4, 6
work page 2023
-
[33]
arXiv preprint arXiv:2402.04825 , year=
Zach Evans, CJ Carr, Josiah Taylor, Scott H. Hawley, and Jordi Pons. Fast timing-conditioned latent audio diffu- sion.ArXiv, abs/2402.04825, 2024. 4
-
[34]
Scaling Instruction-Finetuned Language Models
Hyung Won Chung, Le Hou, S. Longpre, Barret Zoph, Yi Tay, et al. Scaling instruction-finetuned language models.ArXiv, abs/2210.11416, 2022. 5
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[35]
Alec Radford, Jong Wook Kim, Chris Hallacy, A. Ramesh, Gabriel Goh, et al. Learning transferable visual models from natural language supervision. pages 8748–8763, 2021. 5
work page 2021
-
[36]
Iashin, Weidi Xie, Esa Rahtu, and Andrew Zisserman
Vladimir E. Iashin, Weidi Xie, Esa Rahtu, and Andrew Zisserman. Synchformer: Efficient synchronization from sparse cues.ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5325–5329, 2024. 5
work page 2024
-
[37]
Video-llama: An instruction-tuned audio-visual language model for video understanding
Hang Zhang, Xin Li, and Lidong Bing. Video-llama: An instruction-tuned audio-visual language model for video understanding. pages 543–553, 2023. 6, 7
work page 2023
-
[38]
Ilpo Viertola, Vladimir E. Iashin, and Esa Rahtu. Tem- porally aligned audio for video with autoregression. ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5, 2024. 6
work page 2025
-
[39]
Rohit Girdhar, Alaaeldin El-Nouby, Zhuang Liu, Man- nat Singh, Kalyan Vasudev Alwala, et al. Imagebind one embedding space to bind them all.2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15180–15190, 2023. 6, 4
work page 2023
-
[40]
MusicLM: Generating Music From Text
A. Agostinelli, Timo I. Denk, Zalán Borsos, Jesse Engel, Mauro Verzetti, et al. Musiclm: Generating music from text.ArXiv, abs/2301.11325, 2023. 6, 2
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[41]
FSD50K: an open dataset of human-labeled sound events.IEEE ACM Trans
Eduardo Fonseca, Xavier Favory, Jordi Pons, Frederic Font, and Xavier Serra. FSD50K: an open dataset of human-labeled sound events.IEEE ACM Trans. Audio Speech Lang. Process., 30:829–852, 2022. 6, 4
work page 2022
- [42]
-
[43]
Michaël Defferrard, Kirell Benzi, P. Vandergheynst, and X. Bresson. Fma: A dataset for music analysis. pages 316–323, 2016. 6, 4
work page 2016
- [44]
- [45]
-
[46]
Panns: Large-scale pretrained audio neural networks for audio pattern recognition
Qiuqiang Kong, Yin Cao, Turab Iqbal, Yuxuan Wang, Wenwu Wang, et al. Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Lan- guage Processing, 28:2880–2894, 2019. 6, 4
work page 2019
-
[47]
Improved Techniques for Training GANs
Tim Salimans, I. Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, et al. Improved techniques for training gans.ArXiv, abs/1606.03498, 2016. 6, 4
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[48]
Meta Audiobox Aesthetics: Unified Automatic Quality Assessment for Speech, Music, and Sound
Andros Tjandra, Yi-Chiao Wu, Baishan Guo, John Hoff- man, Brian Ellis, et al. Meta audiobox aesthetics: Uni- fied automatic quality assessment for speech, music, and sound.ArXiv, abs/2502.05139, 2025. 6
work page internal anchor Pith review arXiv 2025
-
[49]
Chen, Tianyu Zhang, Yuchen Hui, Tay- lor Berg-Kirkpatrick, et al
Yusong Wu, K. Chen, Tianyu Zhang, Yuchen Hui, Tay- lor Berg-Kirkpatrick, et al. Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation.ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5, 2022. 6, 4
work page 2023
-
[50]
Iashin, Weidi Xie, Esa Rahtu, and Andrew Zisserman
Vladimir E. Iashin, Weidi Xie, Esa Rahtu, and Andrew Zisserman. Synchformer: Efficient synchronization from sparse cues.ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5325–5329, 2024. 6, 4
work page 2024
-
[51]
Haohe Liu, Qiao Tian, Yiitan Yuan, Xubo Liu, Xinhao Mei, et al. Audioldm 2: Learning holistic audio genera- tion with self-supervised pretraining.IEEE/ACM Trans- actions on Audio, Speech, and Language Processing, 32:2871–2883, 2023. 2
work page 2023
-
[52]
Navonil Majumder, Chia-Yu Hung, Deepanway Ghosal, Wei-Ning Hsu, Rada Mihalcea, et al. Tango 2: Aligning diffusion-based text-to-audio generations through direct preference optimization.Proceedings of the 32nd ACM International Conference on Multimedia, 2024. 2
work page 2024
-
[53]
Make-An-Audio 2: Temporal-enhanced text-to- audio generation,
Jia-Bin Huang, Yi Ren, Rongjie Huang, Dongchao Yang, Zhenhui Ye, et al. Make-an-audio 2: Temporal-enhanced text-to-audio generation.ArXiv, abs/2305.18474, 2023. 2
-
[54]
Taming data and transformers for audio generation
Moayed Haji-Ali, Willi Menapace, Aliaksandr Siaro- hin, Guha Balakrishnan, Sergey Tulyakov, et al. Tam- ing data and transformers for audio generation.CoRR, abs/2406.19388, 2024. 2 Omni2Sound: Towards Unified Video-Text-to-Audio Generation Supplementary Material OverviewThis document provides technical details, evaluation protocols, and extended experimen...
-
[55]
Semantic Alignment (MOS-S, Scale 1-4).This met- ric assesses bothAccuracy(factuality of sound events) andDetail(precision of adjectives). The scale is de- fined as: (1) Factually incorrect/Brief; (2) Mostly in- correct/Brief; (3) Minor errors/Detailed (but visually re- dundant); and (4) Error-free and Detailed (strictly audio- centric)
-
[56]
V”) labels, re- taining only those with Audio-Visual (“A V
Temporal Alignment (MOS-T, Scale 1-3).This evaluates whether the chronological order of described events matches the audio stream. The scale ranges from (1) Disordered, (2) Partially Correct, to (3) Perfectly Or- dered. Samples with constant or stationary sounds (lack- ing distinct temporal events) are marked asN/Aand ex- cluded from this metric. Human Ev...
work page 1915
-
[57]
leads on several metrics, this is expected given its massive 100k-hour internal dataset, which is tens of times larger than our SoundAtlas filter derived from VGGSound and AudioSet. Nevertheless, Omni2Sound consistently outperforms all other strong baselines (e.g., MMAudio, AudioX, and ThinkSound) across V2A and VT2A tasks, demonstrating strong generaliza...
-
[58]
•Objects:traffic, office sounds, battlefield, tools
Primary Sound Information •Humans/Animals:speech (talking, shouting), movements (footsteps).Note: Do not transcribe words/lyrics; describe voice characteristics. •Objects:traffic, office sounds, battlefield, tools. •Characteristics:Gender/age, language, quantity (monologue/turn-taking), emotional tone, voice quali- ties
-
[59]
Briefly specify the environment if necessary
Background Sounds (if present) •Natural (wind, rain) or Artificial (city noise, crowds). Briefly specify the environment if necessary
-
[60]
•Identifiable instruments and effects (harmonies, reverb)
Music (if present) •Style/genre, rhythmic features, emotional tone, atmosphere. •Identifiable instruments and effects (harmonies, reverb)
-
[61]
Detailed Descriptors •Changes in volume/speed/intensity. Narrative functions. •Detailed duration, spatial distance, pitch, timbre, texture. Important Guidelines
-
[62]
Avoid Redundancy:Identify sources once unless they change significantly. Keep it concise
-
[63]
If a sound isn’t audible, don’t describe it
Prioritize the Audio:Use video descriptiononlyto clarify ambiguous sounds. If a sound isn’t audible, don’t describe it
-
[64]
Avoid Hallucinated Sounds:Only describe perceptible sounds. Avoid describing artifacts (e.g., "high- pitched squeal" from edits). Output Format Integrate elements intoone or few sentencesfollowing these rules: •Language:English. •Structure:No lists or bullet points. •Length:Max 40 words. Concise but detailed. •Temporal Order:Chronological (e.g., "first", ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.