pith. machine review for the scientific record. sign in

arxiv: 2604.04348 · v1 · submitted 2026-04-06 · 💻 cs.SD · cs.CV· cs.MM

Recognition: no theorem link

OmniSonic: Towards Universal and Holistic Audio Generation from Video and Text

Authors on Pith no claims yet

Pith reviewed 2026-05-10 20:19 UTC · model grok-4.3

classification 💻 cs.SD cs.CVcs.MM
keywords audio generationvideo-to-audiotext-to-audiodiffusion modelsflow matchingmultimodal generationspeech synthesisholistic audio
0
0 comments X

The pith

A diffusion model with triple cross-attention generates full audio scenes that include on-screen sounds, off-screen sounds, and speech from video and text.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper defines a new task of universal holistic audio generation that requires producing complete auditory scenes containing both visible and invisible environmental sounds plus human speech. It presents OmniSonic as a flow-matching diffusion model whose TriAttn-DiT backbone runs three separate cross-attention operations, one for each audio type, while an MoE gate balances their influence at every generation step. The authors also release UniHAGen-Bench, a dataset of more than one thousand video-text pairs spanning ambient, instrumental, and spoken scenarios. Experiments show that this single architecture surpasses earlier video-only and non-speech joint models on standard metrics and in listener studies, indicating that unified conditioning can replace the current patchwork of specialized generators.

Core claim

OmniSonic is a flow-matching-based diffusion framework jointly conditioned on video and text. It features a TriAttn-DiT architecture that performs three cross-attention operations to process on-screen environmental sound, off-screen environmental sound, and speech conditions simultaneously, with a Mixture-of-Experts gating mechanism that adaptively balances their contributions during generation. The model is evaluated on the new UniHAGen-Bench covering three representative on/off-screen speech-environment scenarios and consistently outperforms prior state-of-the-art approaches on both objective metrics and human evaluations.

What carries the argument

TriAttn-DiT, a diffusion transformer that applies three distinct cross-attention layers—one each for on-screen environmental, off-screen environmental, and speech conditioning—together with an MoE gating layer that dynamically weights their contributions during audio synthesis.

If this is right

  • A single model can now synthesize auditory scenes that mix ambient events, musical instruments, and human speech without needing separate pipelines for speech and non-speech audio.
  • The UniHAGen-Bench supplies a standardized testbed that measures both environmental and speech fidelity in the presence of on- and off-screen sources.
  • Joint video-text conditioning with adaptive gating yields higher objective scores and listener preference than prior video-only or non-speech joint generators.
  • The architecture supports diverse domains while maintaining the ability to render sounds that are not visible on screen.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same three-way conditioning pattern could be applied to other multimodal tasks that require balancing visible, invisible, and linguistic signals, such as generating video from audio descriptions.
  • If the MoE router learns stable routing across longer sequences, the framework may scale to minute-long videos without additional architectural changes.
  • Replacing multiple task-specific audio generators with one unified model would simplify deployment in applications such as video editing tools or virtual-reality sound design.

Load-bearing premise

The three cross-attention modules together with the MoE gating can process and balance on-screen environmental, off-screen environmental, and speech conditions at the same time without destructive interference or loss of quality.

What would settle it

On UniHAGen-Bench samples that contain simultaneous on-screen action, off-screen events, and spoken dialogue, generate audio with the full TriAttn-DiT model and with each cross-attention module ablated; if any ablation version matches or exceeds the full model on FAD, CLAP scores, and human preference for completeness, the claim that triple attention plus MoE is required for balanced holistic output is falsified.

Figures

Figures reproduced from arXiv: 2604.04348 by Kai Wang, Saksham Singh Kushwaha, Shijian Deng, Weiguo Pian, Yapeng Tian, Yunhui Guo, Zhimin Chen.

Figure 1
Figure 1. Figure 1: Illustration of the proposed Universal Holistic Audio [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: (A) Overview of our proposed OmniSonic, which mainly consists of an environmental text encoder (FLAN-T5), a speech [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Visualization of the spectrograms of generated audios and the ground-truth. [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Ablation study on the MoE Gating module using in-the [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Interface for the subjective evaluation. [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗
read the original abstract

In this paper, we propose Universal Holistic Audio Generation (UniHAGen), a task for synthesizing comprehensive auditory scenes that include both on-screen and off-screen sounds across diverse domains (e.g., ambient events, musical instruments, and human speech). Prior video-conditioned audio generation models typically focus on producing on-screen environmental sounds that correspond to visible sounding events, neglecting off-screen auditory events. While recent holistic joint text-video-to-audio generation models aim to produce auditory scenes with both on- and off-screen sound but they are limited to non-speech sounds, lacking the ability to generate or integrate human speech. To overcome these limitations, we introduce OmniSonic, a flow-matching-based diffusion framework jointly conditioned on video and text. It features a TriAttn-DiT architecture that performs three cross-attention operations to process on-screen environmental sound, off-screen environmental sound, and speech conditions simultaneously, with a Mixture-of-Experts (MoE) gating mechanism that adaptively balances their contributions during generation. Furthermore, we construct UniHAGen-Bench, a new benchmark with over one thousand samples covering three representative on/off-screen speech-environment scenarios. Extensive experiments show that OmniSonic consistently outperforms state-of-the-art approaches on both objective metrics and human evaluations, establishing a strong baseline for universal and holistic audio generation. Project page: https://weiguopian.github.io/OmniSonic_webpage/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes Universal Holistic Audio Generation (UniHAGen), a task for synthesizing complete auditory scenes from video and text that include on-screen environmental sounds, off-screen environmental sounds, and human speech. It introduces OmniSonic, a flow-matching diffusion model featuring a TriAttn-DiT architecture with three parallel cross-attention operations (one each for on-screen env, off-screen env, and speech) whose outputs are balanced by a Mixture-of-Experts (MoE) gating mechanism. The authors also release UniHAGen-Bench, a new benchmark of over 1,000 samples spanning three on/off-screen speech-environment scenarios, and claim that OmniSonic consistently outperforms prior state-of-the-art methods on both objective metrics and human evaluations.

Significance. If the reported superiority holds under rigorous verification, the work would advance audio generation by addressing the gap in holistic scene synthesis that jointly handles environmental sounds and speech. The new UniHAGen-Bench could become a useful community resource for standardized evaluation. The TriAttn-DiT + MoE design offers a concrete architectural proposal for multi-condition conditioning, though its practical effectiveness remains to be demonstrated.

major comments (2)
  1. [TriAttn-DiT and MoE description] The description of the TriAttn-DiT architecture provides no equations, pseudocode, or loss terms defining the MoE gating mechanism or how the three cross-attention outputs are combined and conditioned. This is load-bearing for the central outperformance claim, because the skeptic concern that speech (higher energy, clearer supervision) may dominate and suppress off-screen environmental sounds cannot be evaluated without these details.
  2. [Experiments section] No information is given on training data composition and size, exact objective metrics and their implementations, baseline reproductions, number of runs, or statistical significance tests. Since the manuscript's primary contribution is the empirical superiority on UniHAGen-Bench, these omissions prevent verification of the results and leave open the possibility of evaluation biases.
minor comments (1)
  1. [Abstract] The abstract states that OmniSonic 'establishes a strong baseline' but does not quantify the margin of improvement or report variance across seeds; adding these numbers would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We agree that greater technical specificity is required to support the central claims and enable verification. We will revise the manuscript to address both major comments fully, as detailed below.

read point-by-point responses
  1. Referee: [TriAttn-DiT and MoE description] The description of the TriAttn-DiT architecture provides no equations, pseudocode, or loss terms defining the MoE gating mechanism or how the three cross-attention outputs are combined and conditioned. This is load-bearing for the central outperformance claim, because the skeptic concern that speech (higher energy, clearer supervision) may dominate and suppress off-screen environmental sounds cannot be evaluated without these details.

    Authors: We acknowledge that the current manuscript description of the TriAttn-DiT and MoE components is insufficiently detailed. The architecture employs three parallel cross-attention branches to separately process on-screen environmental, off-screen environmental, and speech conditions, with the MoE gating mechanism intended to adaptively balance their influence and mitigate dominance by higher-energy signals such as speech. We agree this requires explicit formalization. In the revised manuscript we will add the governing equations for each cross-attention operation, the MoE router and expert combination formulas, the overall conditioning mechanism, and the flow-matching loss terms. These additions will allow direct evaluation of how the gating prevents suppression of off-screen sounds. revision: yes

  2. Referee: [Experiments section] No information is given on training data composition and size, exact objective metrics and their implementations, baseline reproductions, number of runs, or statistical significance tests. Since the manuscript's primary contribution is the empirical superiority on UniHAGen-Bench, these omissions prevent verification of the results and leave open the possibility of evaluation biases.

    Authors: We agree that the Experiments section lacks the information needed for reproducibility and independent verification. In the revised manuscript we will expand this section to specify the training data sources, composition, and total size; the precise definitions and code-level implementations of all objective metrics; the procedures used to reproduce each baseline; the number of independent runs conducted; and the statistical significance tests applied (including p-values). We will also add further analysis of results across the three scenario types in UniHAGen-Bench to address potential evaluation biases. revision: yes

Circularity Check

0 steps flagged

No circularity in empirical architecture and benchmark proposal

full rationale

The paper proposes a new task (UniHAGen), a flow-matching diffusion model (OmniSonic with TriAttn-DiT and MoE gating), and a new benchmark (UniHAGen-Bench) with experimental comparisons. No equations, derivations, or predictions are presented that reduce to self-definitions, fitted inputs renamed as outputs, or load-bearing self-citations. Performance claims rest on end-to-end training and external evaluations rather than any internal reduction by construction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Relies on standard assumptions of diffusion/flow-matching generative models and learned neural network parameters; introduces new architectural modules but no new physical entities.

free parameters (1)
  • DiT and MoE hyperparameters
    Numerous learned weights, attention dimensions, expert counts, and gating parameters tuned during training.
axioms (1)
  • domain assumption Flow-matching diffusion can produce coherent multimodal-conditioned audio when separate attention streams are balanced by MoE.
    Invoked to justify the TriAttn-DiT design choice.

pith-pipeline@v0.9.0 · 5569 in / 1240 out tokens · 31413 ms · 2026-05-10T20:19:50.825540+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

50 extracted references · 6 canonical work pages · 2 internal anchors

  1. [1]

    LRS3-TED: a large- scale dataset for visual speech recognition,

    Triantafyllos Afouras, Joon Son Chung, and Andrew Zisser- man. Lrs3-ted: a large-scale dataset for visual speech recog- nition.arXiv preprint arXiv:1809.00496, 2018. 5, 12

  2. [2]

    Speecht5: Unified-modal encoder-decoder pre-training for spoken language processing

    Junyi Ao, Rui Wang, Long Zhou, Chengyi Wang, Shuo Ren, Yu Wu, Shujie Liu, Tom Ko, Qing Li, Yu Zhang, et al. Speecht5: Unified-modal encoder-decoder pre-training for spoken language processing. InProceedings of the 60th An- nual Meeting of the Association for Computational Linguis- tics (Volume 1: Long Papers), pages 5723–5738, 2022. 3, 14

  3. [3]

    Com- mon voice: A massively-multilingual speech corpus

    Rosana Ardila, Megan Branson, Kelly Davis, Michael Kohler, Josh Meyer, Michael Henretty, Reuben Morais, Lindsay Saunders, Francis Tyers, and Gregor Weber. Com- mon voice: A massively-multilingual speech corpus. InPro- ceedings of the twelfth language resources and evaluation conference, pages 4218–4222, 2020. 5, 12

  4. [4]

    Vggsound: A large-scale audio-visual dataset

    Honglie Chen, Weidi Xie, Andrea Vedaldi, and Andrew Zisserman. Vggsound: A large-scale audio-visual dataset. InICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 721–725. IEEE, 2020. 5, 12

  5. [5]

    Video-guided foley sound generation with multimodal con- trols

    Ziyang Chen, Prem Seetharaman, Bryan Russell, Oriol Ni- eto, David Bourgin, Andrew Owens, and Justin Salamon. Video-guided foley sound generation with multimodal con- trols. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 18770– 18781, 2025. 1, 2

  6. [6]

    Mmaudio: Taming multimodal joint training for high-quality video-to- audio synthesis

    Ho Kei Cheng, Masato Ishii, Akio Hayakawa, Takashi Shibuya, Alexander Schwing, and Yuki Mitsufuji. Mmaudio: Taming multimodal joint training for high-quality video-to- audio synthesis. InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR), pages 28901–28911, 2025. 1, 2, 3, 4, 6, 13

  7. [7]

    Scaling rec- tified flow transformers for high-resolution image synthesis

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas M ¨uller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, Dustin Podell, Tim Dockhorn, Zion English, and Robin Rombach. Scaling rec- tified flow transformers for high-resolution image synthesis. InForty-first International Conference on Machine Learn- ing, 2024. 2, 14

  8. [8]

    Text-to-audio generation using instruc- tion guided latent diffusion model

    Deepanway Ghosal, Navonil Majumder, Ambuj Mehrish, and Soujanya Poria. Text-to-audio generation using instruc- tion guided latent diffusion model. InProceedings of the 31st ACM international conference on multimedia, pages 3590– 3598, 2023. 1, 2

  9. [9]

    Classifier-Free Diffusion Guidance

    Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598, 2022. 12

  10. [10]

    Denoising dif- fusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising dif- fusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020. 1, 3

  11. [11]

    Imagen Video: High Definition Video Generation with Diffusion Models

    Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P. Kingma, Ben Poole, Mohammad Norouzi, David J. Fleet, and Tim Sali- mans. Imagen video: High definition video generation with diffusion models.arXiv preprint arXiv:2210.02303, 2022. 3

  12. [12]

    Video dif- fusion models.Advances in neural information processing systems, 35:8633–8646, 2022

    Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. Video dif- fusion models.Advances in neural information processing systems, 35:8633–8646, 2022. 3

  13. [13]

    Make-an-audio: Text-to-audio gen- eration with prompt-enhanced diffusion models

    Rongjie Huang, Jiawei Huang, Dongchao Yang, Yi Ren, Luping Liu, Mingze Li, Zhenhui Ye, Jinglin Liu, Xiang Yin, and Zhou Zhao. Make-an-audio: Text-to-audio gen- eration with prompt-enhanced diffusion models. InInter- national Conference on Machine Learning, pages 13916– 13932. PMLR, 2023. 2, 3

  14. [14]

    Taming visually guided sound generation

    Vladimir Iashin and Esa Rahtu. Taming visually guided sound generation. InBritish Machine Vision Conference (BMVC), 2021. 2, 6, 13

  15. [15]

    Synchformer: Efficient synchronization from sparse cues

    Vladimir Iashin, Weidi Xie, Esa Rahtu, and Andrew Zisser- man. Synchformer: Efficient synchronization from sparse cues. InICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5325–5329. IEEE, 2024. 7, 13

  16. [16]

    V oicedit: Dual-condition diffusion transformer for environment-aware speech synthesis

    Jaemin Jung, Junseok Ahn, Chaeyoung Jung, Tan Dat Nguyen, Youngjoon Jang, and Joon Son Chung. V oicedit: Dual-condition diffusion transformer for environment-aware speech synthesis. InICASSP 2025-2025 IEEE Interna- tional Conference on Acoustics, Speech and Signal Process- ing (ICASSP), pages 1–5. IEEE, 2025. 2

  17. [17]

    Fr\’echet audio distance: A metric for evaluating music enhancement algo- rithms,

    Kevin Kilgour, Mauricio Zuluaga, Dominik Roblek, and Matthew Sharifi. Fr ´echet audio distance: A metric for evaluating music enhancement algorithms.arXiv preprint arXiv:1812.08466, 2018. 6, 13

  18. [18]

    Guided- tts: A diffusion model for text-to-speech via classifier guid- ance

    Heeseung Kim, Sungwon Kim, and Sungroh Yoon. Guided- tts: A diffusion model for text-to-speech via classifier guid- ance. InInternational Conference on Machine Learning, pages 11119–11133. PMLR, 2022. 6

  19. [19]

    Kingma and Max Welling

    Diederik P. Kingma and Max Welling. Auto-encoding vari- ational bayes. In2nd International Conference on Learning Representations (ICLR), 2014. 2, 3

  20. [20]

    Hifi-gan: Generative adversarial networks for efficient and high fi- delity speech synthesis.Advances in neural information pro- cessing systems, 33:17022–17033, 2020

    Jungil Kong, Jaehyeon Kim, and Jaekyoung Bae. Hifi-gan: Generative adversarial networks for efficient and high fi- delity speech synthesis.Advances in neural information pro- cessing systems, 33:17022–17033, 2020. 3, 5

  21. [21]

    Audiogen: Textually guided audio gen- eration

    Felix Kreuk, Gabriel Synnaeve, Adam Polyak, Uriel Singer, Alexandre D ´efossez, Jade Copet, Devi Parikh, Yaniv Taig- man, and Yossi Adi. Audiogen: Textually guided audio gen- eration. InThe Eleventh International Conference on Learn- ing Representations, 2023. 2

  22. [22]

    Vintage: Joint video and text conditioning for holistic audio generation

    Saksham Singh Kushwaha and Yapeng Tian. Vintage: Joint video and text conditioning for holistic audio generation. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition (CVPR), pages 13529–13539,

  23. [23]

    Sang-Hoon Lee, Seung-Bin Kim, Ji-Hyun Lee, Eunwoo Song, Min-Jae Hwang, and Seong-Whan Lee. Hierspeech: Bridging the gap between text and speech by hierarchical variational inference using self-supervised representations for speech synthesis.Advances in Neural Information Pro- cessing Systems, 35:16624–16636, 2022. 6

  24. [24]

    V oiceldm: Text-to-speech with environmental con- text

    Yeonghyeon Lee, Inmo Yeon, Juhan Nam, and Joon Son Chung. V oiceldm: Text-to-speech with environmental con- text. InICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 12566–12571. IEEE, 2024. 2, 3, 4, 6, 12, 13, 14

  25. [25]

    Tri-ergon: Fine-grained video-to-audio generation with multi-modal conditions and lufs control

    Bingliang Li, Fengyu Yang, Yuxin Mao, Qingwen Ye, Hongkai Chen, and Yiran Zhong. Tri-ergon: Fine-grained video-to-audio generation with multi-modal conditions and lufs control. InProceedings of the AAAI Conference on Ar- tificial Intelligence, pages 4616–4624, 2025. 1

  26. [26]

    Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maxim- ilian Nickel, and Matthew Le. Flow matching for genera- tive modeling. InThe Eleventh International Conference on Learning Representations, 2023. 1, 3

  27. [27]

    Au- dioldm: Text-to-audio generation with latent diffusion mod- els

    Haohe Liu, Zehua Chen, Yi Yuan, Xinhao Mei, Xubo Liu, Danilo Mandic, Wenwu Wang, and Mark D Plumbley. Au- dioldm: Text-to-audio generation with latent diffusion mod- els. InInternational Conference on Machine Learning, pages 21450–21474. PMLR, 2023. 1, 2, 3, 14

  28. [28]

    Audioldm 2: Learning holistic audio gen- eration with self-supervised pretraining.IEEE/ACM Trans- actions on Audio, Speech, and Language Processing, 32: 2871–2883, 2024

    Haohe Liu, Yi Yuan, Xubo Liu, Xinhao Mei, Qiuqiang Kong, Qiao Tian, Yuping Wang, Wenwu Wang, Yuxuan Wang, and Mark D Plumbley. Audioldm 2: Learning holistic audio gen- eration with self-supervised pretraining.IEEE/ACM Trans- actions on Audio, Speech, and Language Processing, 32: 2871–2883, 2024. 1, 2, 6, 13

  29. [29]

    Flow straight and fast: Learning to generate and transfer data with rectified flow

    Xingchao Liu, Chengyue Gong, and qiang liu. Flow straight and fast: Learning to generate and transfer data with rectified flow. InThe Eleventh International Conference on Learning Representations, 2023. 1, 3

  30. [30]

    Decoupled weight de- cay regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight de- cay regularization. InInternational Conference on Learning Representations, 2019. 14

  31. [31]

    Diff-foley: Synchronized video-to-audio synthesis with la- tent diffusion models.Advances in Neural Information Pro- cessing Systems, 36:48855–48876, 2023

    Simian Luo, Chuanhao Yan, Chenxu Hu, and Hang Zhao. Diff-foley: Synchronized video-to-audio synthesis with la- tent diffusion models.Advances in Neural Information Pro- cessing Systems, 36:48855–48876, 2023. 1, 2

  32. [32]

    Tango 2: Aligning diffusion-based text-to-audio generations through direct preference optimization

    Navonil Majumder, Chia-Yu Hung, Deepanway Ghosal, Wei-Ning Hsu, Rada Mihalcea, and Soujanya Poria. Tango 2: Aligning diffusion-based text-to-audio generations through direct preference optimization. InProceedings of the 32nd ACM International Conference on Multimedia, pages 564–572, 2024. 1

  33. [33]

    Pytorch: An im- perative style, high-performance deep learning library.Ad- vances in neural information processing systems, 32, 2019

    Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An im- perative style, high-performance deep learning library.Ad- vances in neural information processing systems, 32, 2019. 14

  34. [34]

    Scalable diffusion models with transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF inter- national conference on computer vision, pages 4195–4205,

  35. [35]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR,

  36. [36]

    Robust speech recognition via large-scale weak supervision

    Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision. InInterna- tional conference on machine learning, pages 28492–28518. PMLR, 2023. 13

  37. [37]

    Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of machine learning research, 21(140):1–67, 2020

    Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of machine learning research, 21(140):1–67, 2020. 3, 14

  38. [38]

    High-resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022. 1, 3

  39. [39]

    Tjandra, A., Wu, Y .-C., Guo, B., Hoffman, J., Ellis, B., Vyas, A., Shi, B., Chen, S., Le, M., Zacharov, N., Wood, C., Lee, A., and Hsu, W.-N

    Sizhe Shan, Qiulin Li, Yutao Cui, Miles Yang, Yuehai Wang, Qun Yang, Jin Zhou, and Zhao Zhong. Hunyuanvideo- foley: Multimodal diffusion with representation alignment for high-fidelity foley audio generation.arXiv preprint arXiv:2508.16930, 2025. 1, 2, 3, 6, 13

  40. [40]

    I hear your true colors: Im- age guided audio generation

    Roy Sheffer and Yossi Adi. I hear your true colors: Im- age guided audio generation. InICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2023. 2, 13

  41. [41]

    Make-a-video: Text-to-video generation without text-video data

    Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Oran Gafni, Devi Parikh, Sonal Gupta, and Yaniv Taigman. Make-a-video: Text-to-video generation without text-video data. InThe Eleventh International Conference on Learning Representations, 2023. 3

  42. [42]

    Score-based generative modeling through stochastic differential equa- tions

    Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Ab- hishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equa- tions. InInternational Conference on Learning Represen- tations, 2021. 3

  43. [43]

    Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063,

    Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063,

  44. [44]

    Naturalspeech: End-to-end text-to-speech synthesis with human-level quality.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(6):4234–4245, 2024

    Xu Tan, Jiawei Chen, Haohe Liu, Jian Cong, Chen Zhang, Yanqing Liu, Xi Wang, Yichong Leng, Yuanhao Yi, Lei He, et al. Naturalspeech: End-to-end text-to-speech synthesis with human-level quality.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(6):4234–4245, 2024. 3, 14

  45. [45]

    V2a-mapper: A lightweight solution for vision-to-audio generation by connecting foun- dation models

    Heng Wang, Jianbo Ma, Santiago Pascual, Richard Cartwright, and Weidong Cai. V2a-mapper: A lightweight solution for vision-to-audio generation by connecting foun- dation models. InProceedings of the AAAI Conference on Artificial Intelligence, pages 15492–15501, 2024. 13

  46. [46]

    Frieren: Efficient video-to-audio generation network with rectified flow matching.Advances in neural information pro- cessing systems, 37:128118–128138, 2024

    Yongqi Wang, Wenxiang Guo, Rongjie Huang, Jiawei Huang, Zehan Wang, Fuming You, Ruiqi Li, and Zhou Zhao. Frieren: Efficient video-to-audio generation network with rectified flow matching.Advances in neural information pro- cessing systems, 37:128118–128138, 2024. 1, 2

  47. [47]

    Wav2clip: Learning robust audio repre- sentations from clip

    Ho-Hsiang Wu, Prem Seetharaman, Kundan Kumar, and Juan Pablo Bello. Wav2clip: Learning robust audio repre- sentations from clip. InICASSP 2022-2022 IEEE Interna- tional Conference on Acoustics, Speech and Signal Process- ing (ICASSP), pages 4563–4567. IEEE, 2022. 13

  48. [48]

    Son- icvisionlm: Playing sound with vision language models

    Zhifeng Xie, Shengye Yu, Qile He, and Mengtian Li. Son- icvisionlm: Playing sound with vision language models. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 26866–26875, 2024. 2

  49. [49]

    Foleycrafter: Bring silent videos to life with lifelike and synchronized sounds

    Yiming Zhang, Yicheng Gu, Yanhong Zeng, Zhening Xing, Yuancheng Wang, Zhizheng Wu, and Kai Chen. Foley- crafter: Bring silent videos to life with lifelike and synchro- nized sounds.arXiv preprint arXiv:2407.01494, 2024. 1, 13

  50. [50]

    condition–unconditional

    Yipin Zhou, Zhaowen Wang, Chen Fang, Trung Bui, and Tamara L. Berg. Visual to sound: Generating natural sound for videos in the wild. InProceedings of the IEEE con- ference on computer vision and pattern recognition, pages 3550–3558, 2018. 2 A. Appendix A.1. Training and Inference Process Training.During the training of OmniSonic, for each data point, we ...