Recognition: no theorem link
OmniSonic: Towards Universal and Holistic Audio Generation from Video and Text
Pith reviewed 2026-05-10 20:19 UTC · model grok-4.3
The pith
A diffusion model with triple cross-attention generates full audio scenes that include on-screen sounds, off-screen sounds, and speech from video and text.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
OmniSonic is a flow-matching-based diffusion framework jointly conditioned on video and text. It features a TriAttn-DiT architecture that performs three cross-attention operations to process on-screen environmental sound, off-screen environmental sound, and speech conditions simultaneously, with a Mixture-of-Experts gating mechanism that adaptively balances their contributions during generation. The model is evaluated on the new UniHAGen-Bench covering three representative on/off-screen speech-environment scenarios and consistently outperforms prior state-of-the-art approaches on both objective metrics and human evaluations.
What carries the argument
TriAttn-DiT, a diffusion transformer that applies three distinct cross-attention layers—one each for on-screen environmental, off-screen environmental, and speech conditioning—together with an MoE gating layer that dynamically weights their contributions during audio synthesis.
If this is right
- A single model can now synthesize auditory scenes that mix ambient events, musical instruments, and human speech without needing separate pipelines for speech and non-speech audio.
- The UniHAGen-Bench supplies a standardized testbed that measures both environmental and speech fidelity in the presence of on- and off-screen sources.
- Joint video-text conditioning with adaptive gating yields higher objective scores and listener preference than prior video-only or non-speech joint generators.
- The architecture supports diverse domains while maintaining the ability to render sounds that are not visible on screen.
Where Pith is reading between the lines
- The same three-way conditioning pattern could be applied to other multimodal tasks that require balancing visible, invisible, and linguistic signals, such as generating video from audio descriptions.
- If the MoE router learns stable routing across longer sequences, the framework may scale to minute-long videos without additional architectural changes.
- Replacing multiple task-specific audio generators with one unified model would simplify deployment in applications such as video editing tools or virtual-reality sound design.
Load-bearing premise
The three cross-attention modules together with the MoE gating can process and balance on-screen environmental, off-screen environmental, and speech conditions at the same time without destructive interference or loss of quality.
What would settle it
On UniHAGen-Bench samples that contain simultaneous on-screen action, off-screen events, and spoken dialogue, generate audio with the full TriAttn-DiT model and with each cross-attention module ablated; if any ablation version matches or exceeds the full model on FAD, CLAP scores, and human preference for completeness, the claim that triple attention plus MoE is required for balanced holistic output is falsified.
Figures
read the original abstract
In this paper, we propose Universal Holistic Audio Generation (UniHAGen), a task for synthesizing comprehensive auditory scenes that include both on-screen and off-screen sounds across diverse domains (e.g., ambient events, musical instruments, and human speech). Prior video-conditioned audio generation models typically focus on producing on-screen environmental sounds that correspond to visible sounding events, neglecting off-screen auditory events. While recent holistic joint text-video-to-audio generation models aim to produce auditory scenes with both on- and off-screen sound but they are limited to non-speech sounds, lacking the ability to generate or integrate human speech. To overcome these limitations, we introduce OmniSonic, a flow-matching-based diffusion framework jointly conditioned on video and text. It features a TriAttn-DiT architecture that performs three cross-attention operations to process on-screen environmental sound, off-screen environmental sound, and speech conditions simultaneously, with a Mixture-of-Experts (MoE) gating mechanism that adaptively balances their contributions during generation. Furthermore, we construct UniHAGen-Bench, a new benchmark with over one thousand samples covering three representative on/off-screen speech-environment scenarios. Extensive experiments show that OmniSonic consistently outperforms state-of-the-art approaches on both objective metrics and human evaluations, establishing a strong baseline for universal and holistic audio generation. Project page: https://weiguopian.github.io/OmniSonic_webpage/
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes Universal Holistic Audio Generation (UniHAGen), a task for synthesizing complete auditory scenes from video and text that include on-screen environmental sounds, off-screen environmental sounds, and human speech. It introduces OmniSonic, a flow-matching diffusion model featuring a TriAttn-DiT architecture with three parallel cross-attention operations (one each for on-screen env, off-screen env, and speech) whose outputs are balanced by a Mixture-of-Experts (MoE) gating mechanism. The authors also release UniHAGen-Bench, a new benchmark of over 1,000 samples spanning three on/off-screen speech-environment scenarios, and claim that OmniSonic consistently outperforms prior state-of-the-art methods on both objective metrics and human evaluations.
Significance. If the reported superiority holds under rigorous verification, the work would advance audio generation by addressing the gap in holistic scene synthesis that jointly handles environmental sounds and speech. The new UniHAGen-Bench could become a useful community resource for standardized evaluation. The TriAttn-DiT + MoE design offers a concrete architectural proposal for multi-condition conditioning, though its practical effectiveness remains to be demonstrated.
major comments (2)
- [TriAttn-DiT and MoE description] The description of the TriAttn-DiT architecture provides no equations, pseudocode, or loss terms defining the MoE gating mechanism or how the three cross-attention outputs are combined and conditioned. This is load-bearing for the central outperformance claim, because the skeptic concern that speech (higher energy, clearer supervision) may dominate and suppress off-screen environmental sounds cannot be evaluated without these details.
- [Experiments section] No information is given on training data composition and size, exact objective metrics and their implementations, baseline reproductions, number of runs, or statistical significance tests. Since the manuscript's primary contribution is the empirical superiority on UniHAGen-Bench, these omissions prevent verification of the results and leave open the possibility of evaluation biases.
minor comments (1)
- [Abstract] The abstract states that OmniSonic 'establishes a strong baseline' but does not quantify the margin of improvement or report variance across seeds; adding these numbers would improve clarity.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback on our manuscript. We agree that greater technical specificity is required to support the central claims and enable verification. We will revise the manuscript to address both major comments fully, as detailed below.
read point-by-point responses
-
Referee: [TriAttn-DiT and MoE description] The description of the TriAttn-DiT architecture provides no equations, pseudocode, or loss terms defining the MoE gating mechanism or how the three cross-attention outputs are combined and conditioned. This is load-bearing for the central outperformance claim, because the skeptic concern that speech (higher energy, clearer supervision) may dominate and suppress off-screen environmental sounds cannot be evaluated without these details.
Authors: We acknowledge that the current manuscript description of the TriAttn-DiT and MoE components is insufficiently detailed. The architecture employs three parallel cross-attention branches to separately process on-screen environmental, off-screen environmental, and speech conditions, with the MoE gating mechanism intended to adaptively balance their influence and mitigate dominance by higher-energy signals such as speech. We agree this requires explicit formalization. In the revised manuscript we will add the governing equations for each cross-attention operation, the MoE router and expert combination formulas, the overall conditioning mechanism, and the flow-matching loss terms. These additions will allow direct evaluation of how the gating prevents suppression of off-screen sounds. revision: yes
-
Referee: [Experiments section] No information is given on training data composition and size, exact objective metrics and their implementations, baseline reproductions, number of runs, or statistical significance tests. Since the manuscript's primary contribution is the empirical superiority on UniHAGen-Bench, these omissions prevent verification of the results and leave open the possibility of evaluation biases.
Authors: We agree that the Experiments section lacks the information needed for reproducibility and independent verification. In the revised manuscript we will expand this section to specify the training data sources, composition, and total size; the precise definitions and code-level implementations of all objective metrics; the procedures used to reproduce each baseline; the number of independent runs conducted; and the statistical significance tests applied (including p-values). We will also add further analysis of results across the three scenario types in UniHAGen-Bench to address potential evaluation biases. revision: yes
Circularity Check
No circularity in empirical architecture and benchmark proposal
full rationale
The paper proposes a new task (UniHAGen), a flow-matching diffusion model (OmniSonic with TriAttn-DiT and MoE gating), and a new benchmark (UniHAGen-Bench) with experimental comparisons. No equations, derivations, or predictions are presented that reduce to self-definitions, fitted inputs renamed as outputs, or load-bearing self-citations. Performance claims rest on end-to-end training and external evaluations rather than any internal reduction by construction.
Axiom & Free-Parameter Ledger
free parameters (1)
- DiT and MoE hyperparameters
axioms (1)
- domain assumption Flow-matching diffusion can produce coherent multimodal-conditioned audio when separate attention streams are balanced by MoE.
Reference graph
Works this paper leans on
-
[1]
LRS3-TED: a large- scale dataset for visual speech recognition,
Triantafyllos Afouras, Joon Son Chung, and Andrew Zisser- man. Lrs3-ted: a large-scale dataset for visual speech recog- nition.arXiv preprint arXiv:1809.00496, 2018. 5, 12
-
[2]
Speecht5: Unified-modal encoder-decoder pre-training for spoken language processing
Junyi Ao, Rui Wang, Long Zhou, Chengyi Wang, Shuo Ren, Yu Wu, Shujie Liu, Tom Ko, Qing Li, Yu Zhang, et al. Speecht5: Unified-modal encoder-decoder pre-training for spoken language processing. InProceedings of the 60th An- nual Meeting of the Association for Computational Linguis- tics (Volume 1: Long Papers), pages 5723–5738, 2022. 3, 14
2022
-
[3]
Com- mon voice: A massively-multilingual speech corpus
Rosana Ardila, Megan Branson, Kelly Davis, Michael Kohler, Josh Meyer, Michael Henretty, Reuben Morais, Lindsay Saunders, Francis Tyers, and Gregor Weber. Com- mon voice: A massively-multilingual speech corpus. InPro- ceedings of the twelfth language resources and evaluation conference, pages 4218–4222, 2020. 5, 12
2020
-
[4]
Vggsound: A large-scale audio-visual dataset
Honglie Chen, Weidi Xie, Andrea Vedaldi, and Andrew Zisserman. Vggsound: A large-scale audio-visual dataset. InICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 721–725. IEEE, 2020. 5, 12
2020
-
[5]
Video-guided foley sound generation with multimodal con- trols
Ziyang Chen, Prem Seetharaman, Bryan Russell, Oriol Ni- eto, David Bourgin, Andrew Owens, and Justin Salamon. Video-guided foley sound generation with multimodal con- trols. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 18770– 18781, 2025. 1, 2
2025
-
[6]
Mmaudio: Taming multimodal joint training for high-quality video-to- audio synthesis
Ho Kei Cheng, Masato Ishii, Akio Hayakawa, Takashi Shibuya, Alexander Schwing, and Yuki Mitsufuji. Mmaudio: Taming multimodal joint training for high-quality video-to- audio synthesis. InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR), pages 28901–28911, 2025. 1, 2, 3, 4, 6, 13
2025
-
[7]
Scaling rec- tified flow transformers for high-resolution image synthesis
Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas M ¨uller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, Dustin Podell, Tim Dockhorn, Zion English, and Robin Rombach. Scaling rec- tified flow transformers for high-resolution image synthesis. InForty-first International Conference on Machine Learn- ing, 2024. 2, 14
2024
-
[8]
Text-to-audio generation using instruc- tion guided latent diffusion model
Deepanway Ghosal, Navonil Majumder, Ambuj Mehrish, and Soujanya Poria. Text-to-audio generation using instruc- tion guided latent diffusion model. InProceedings of the 31st ACM international conference on multimedia, pages 3590– 3598, 2023. 1, 2
2023
-
[9]
Classifier-Free Diffusion Guidance
Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598, 2022. 12
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[10]
Denoising dif- fusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising dif- fusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020. 1, 3
2020
-
[11]
Imagen Video: High Definition Video Generation with Diffusion Models
Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P. Kingma, Ben Poole, Mohammad Norouzi, David J. Fleet, and Tim Sali- mans. Imagen video: High definition video generation with diffusion models.arXiv preprint arXiv:2210.02303, 2022. 3
work page internal anchor Pith review arXiv 2022
-
[12]
Video dif- fusion models.Advances in neural information processing systems, 35:8633–8646, 2022
Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. Video dif- fusion models.Advances in neural information processing systems, 35:8633–8646, 2022. 3
2022
-
[13]
Make-an-audio: Text-to-audio gen- eration with prompt-enhanced diffusion models
Rongjie Huang, Jiawei Huang, Dongchao Yang, Yi Ren, Luping Liu, Mingze Li, Zhenhui Ye, Jinglin Liu, Xiang Yin, and Zhou Zhao. Make-an-audio: Text-to-audio gen- eration with prompt-enhanced diffusion models. InInter- national Conference on Machine Learning, pages 13916– 13932. PMLR, 2023. 2, 3
2023
-
[14]
Taming visually guided sound generation
Vladimir Iashin and Esa Rahtu. Taming visually guided sound generation. InBritish Machine Vision Conference (BMVC), 2021. 2, 6, 13
2021
-
[15]
Synchformer: Efficient synchronization from sparse cues
Vladimir Iashin, Weidi Xie, Esa Rahtu, and Andrew Zisser- man. Synchformer: Efficient synchronization from sparse cues. InICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5325–5329. IEEE, 2024. 7, 13
2024
-
[16]
V oicedit: Dual-condition diffusion transformer for environment-aware speech synthesis
Jaemin Jung, Junseok Ahn, Chaeyoung Jung, Tan Dat Nguyen, Youngjoon Jang, and Joon Son Chung. V oicedit: Dual-condition diffusion transformer for environment-aware speech synthesis. InICASSP 2025-2025 IEEE Interna- tional Conference on Acoustics, Speech and Signal Process- ing (ICASSP), pages 1–5. IEEE, 2025. 2
2025
-
[17]
Fr\’echet audio distance: A metric for evaluating music enhancement algo- rithms,
Kevin Kilgour, Mauricio Zuluaga, Dominik Roblek, and Matthew Sharifi. Fr ´echet audio distance: A metric for evaluating music enhancement algorithms.arXiv preprint arXiv:1812.08466, 2018. 6, 13
-
[18]
Guided- tts: A diffusion model for text-to-speech via classifier guid- ance
Heeseung Kim, Sungwon Kim, and Sungroh Yoon. Guided- tts: A diffusion model for text-to-speech via classifier guid- ance. InInternational Conference on Machine Learning, pages 11119–11133. PMLR, 2022. 6
2022
-
[19]
Kingma and Max Welling
Diederik P. Kingma and Max Welling. Auto-encoding vari- ational bayes. In2nd International Conference on Learning Representations (ICLR), 2014. 2, 3
2014
-
[20]
Hifi-gan: Generative adversarial networks for efficient and high fi- delity speech synthesis.Advances in neural information pro- cessing systems, 33:17022–17033, 2020
Jungil Kong, Jaehyeon Kim, and Jaekyoung Bae. Hifi-gan: Generative adversarial networks for efficient and high fi- delity speech synthesis.Advances in neural information pro- cessing systems, 33:17022–17033, 2020. 3, 5
2020
-
[21]
Audiogen: Textually guided audio gen- eration
Felix Kreuk, Gabriel Synnaeve, Adam Polyak, Uriel Singer, Alexandre D ´efossez, Jade Copet, Devi Parikh, Yaniv Taig- man, and Yossi Adi. Audiogen: Textually guided audio gen- eration. InThe Eleventh International Conference on Learn- ing Representations, 2023. 2
2023
-
[22]
Vintage: Joint video and text conditioning for holistic audio generation
Saksham Singh Kushwaha and Yapeng Tian. Vintage: Joint video and text conditioning for holistic audio generation. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition (CVPR), pages 13529–13539,
-
[23]
Sang-Hoon Lee, Seung-Bin Kim, Ji-Hyun Lee, Eunwoo Song, Min-Jae Hwang, and Seong-Whan Lee. Hierspeech: Bridging the gap between text and speech by hierarchical variational inference using self-supervised representations for speech synthesis.Advances in Neural Information Pro- cessing Systems, 35:16624–16636, 2022. 6
2022
-
[24]
V oiceldm: Text-to-speech with environmental con- text
Yeonghyeon Lee, Inmo Yeon, Juhan Nam, and Joon Son Chung. V oiceldm: Text-to-speech with environmental con- text. InICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 12566–12571. IEEE, 2024. 2, 3, 4, 6, 12, 13, 14
2024
-
[25]
Tri-ergon: Fine-grained video-to-audio generation with multi-modal conditions and lufs control
Bingliang Li, Fengyu Yang, Yuxin Mao, Qingwen Ye, Hongkai Chen, and Yiran Zhong. Tri-ergon: Fine-grained video-to-audio generation with multi-modal conditions and lufs control. InProceedings of the AAAI Conference on Ar- tificial Intelligence, pages 4616–4624, 2025. 1
2025
-
[26]
Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maxim- ilian Nickel, and Matthew Le. Flow matching for genera- tive modeling. InThe Eleventh International Conference on Learning Representations, 2023. 1, 3
2023
-
[27]
Au- dioldm: Text-to-audio generation with latent diffusion mod- els
Haohe Liu, Zehua Chen, Yi Yuan, Xinhao Mei, Xubo Liu, Danilo Mandic, Wenwu Wang, and Mark D Plumbley. Au- dioldm: Text-to-audio generation with latent diffusion mod- els. InInternational Conference on Machine Learning, pages 21450–21474. PMLR, 2023. 1, 2, 3, 14
2023
-
[28]
Audioldm 2: Learning holistic audio gen- eration with self-supervised pretraining.IEEE/ACM Trans- actions on Audio, Speech, and Language Processing, 32: 2871–2883, 2024
Haohe Liu, Yi Yuan, Xubo Liu, Xinhao Mei, Qiuqiang Kong, Qiao Tian, Yuping Wang, Wenwu Wang, Yuxuan Wang, and Mark D Plumbley. Audioldm 2: Learning holistic audio gen- eration with self-supervised pretraining.IEEE/ACM Trans- actions on Audio, Speech, and Language Processing, 32: 2871–2883, 2024. 1, 2, 6, 13
2024
-
[29]
Flow straight and fast: Learning to generate and transfer data with rectified flow
Xingchao Liu, Chengyue Gong, and qiang liu. Flow straight and fast: Learning to generate and transfer data with rectified flow. InThe Eleventh International Conference on Learning Representations, 2023. 1, 3
2023
-
[30]
Decoupled weight de- cay regularization
Ilya Loshchilov and Frank Hutter. Decoupled weight de- cay regularization. InInternational Conference on Learning Representations, 2019. 14
2019
-
[31]
Diff-foley: Synchronized video-to-audio synthesis with la- tent diffusion models.Advances in Neural Information Pro- cessing Systems, 36:48855–48876, 2023
Simian Luo, Chuanhao Yan, Chenxu Hu, and Hang Zhao. Diff-foley: Synchronized video-to-audio synthesis with la- tent diffusion models.Advances in Neural Information Pro- cessing Systems, 36:48855–48876, 2023. 1, 2
2023
-
[32]
Tango 2: Aligning diffusion-based text-to-audio generations through direct preference optimization
Navonil Majumder, Chia-Yu Hung, Deepanway Ghosal, Wei-Ning Hsu, Rada Mihalcea, and Soujanya Poria. Tango 2: Aligning diffusion-based text-to-audio generations through direct preference optimization. InProceedings of the 32nd ACM International Conference on Multimedia, pages 564–572, 2024. 1
2024
-
[33]
Pytorch: An im- perative style, high-performance deep learning library.Ad- vances in neural information processing systems, 32, 2019
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An im- perative style, high-performance deep learning library.Ad- vances in neural information processing systems, 32, 2019. 14
2019
-
[34]
Scalable diffusion models with transformers
William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF inter- national conference on computer vision, pages 4195–4205,
-
[35]
Learning transferable visual models from natural language supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR,
-
[36]
Robust speech recognition via large-scale weak supervision
Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision. InInterna- tional conference on machine learning, pages 28492–28518. PMLR, 2023. 13
2023
-
[37]
Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of machine learning research, 21(140):1–67, 2020
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of machine learning research, 21(140):1–67, 2020. 3, 14
2020
-
[38]
High-resolution image synthesis with latent diffusion models
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022. 1, 3
2022
-
[39]
Sizhe Shan, Qiulin Li, Yutao Cui, Miles Yang, Yuehai Wang, Qun Yang, Jin Zhou, and Zhao Zhong. Hunyuanvideo- foley: Multimodal diffusion with representation alignment for high-fidelity foley audio generation.arXiv preprint arXiv:2508.16930, 2025. 1, 2, 3, 6, 13
-
[40]
I hear your true colors: Im- age guided audio generation
Roy Sheffer and Yossi Adi. I hear your true colors: Im- age guided audio generation. InICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2023. 2, 13
2023
-
[41]
Make-a-video: Text-to-video generation without text-video data
Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Oran Gafni, Devi Parikh, Sonal Gupta, and Yaniv Taigman. Make-a-video: Text-to-video generation without text-video data. InThe Eleventh International Conference on Learning Representations, 2023. 3
2023
-
[42]
Score-based generative modeling through stochastic differential equa- tions
Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Ab- hishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equa- tions. InInternational Conference on Learning Represen- tations, 2021. 3
2021
-
[43]
Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063,
Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063,
-
[44]
Naturalspeech: End-to-end text-to-speech synthesis with human-level quality.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(6):4234–4245, 2024
Xu Tan, Jiawei Chen, Haohe Liu, Jian Cong, Chen Zhang, Yanqing Liu, Xi Wang, Yichong Leng, Yuanhao Yi, Lei He, et al. Naturalspeech: End-to-end text-to-speech synthesis with human-level quality.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(6):4234–4245, 2024. 3, 14
2024
-
[45]
V2a-mapper: A lightweight solution for vision-to-audio generation by connecting foun- dation models
Heng Wang, Jianbo Ma, Santiago Pascual, Richard Cartwright, and Weidong Cai. V2a-mapper: A lightweight solution for vision-to-audio generation by connecting foun- dation models. InProceedings of the AAAI Conference on Artificial Intelligence, pages 15492–15501, 2024. 13
2024
-
[46]
Frieren: Efficient video-to-audio generation network with rectified flow matching.Advances in neural information pro- cessing systems, 37:128118–128138, 2024
Yongqi Wang, Wenxiang Guo, Rongjie Huang, Jiawei Huang, Zehan Wang, Fuming You, Ruiqi Li, and Zhou Zhao. Frieren: Efficient video-to-audio generation network with rectified flow matching.Advances in neural information pro- cessing systems, 37:128118–128138, 2024. 1, 2
2024
-
[47]
Wav2clip: Learning robust audio repre- sentations from clip
Ho-Hsiang Wu, Prem Seetharaman, Kundan Kumar, and Juan Pablo Bello. Wav2clip: Learning robust audio repre- sentations from clip. InICASSP 2022-2022 IEEE Interna- tional Conference on Acoustics, Speech and Signal Process- ing (ICASSP), pages 4563–4567. IEEE, 2022. 13
2022
-
[48]
Son- icvisionlm: Playing sound with vision language models
Zhifeng Xie, Shengye Yu, Qile He, and Mengtian Li. Son- icvisionlm: Playing sound with vision language models. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 26866–26875, 2024. 2
2024
-
[49]
Foleycrafter: Bring silent videos to life with lifelike and synchronized sounds
Yiming Zhang, Yicheng Gu, Yanhong Zeng, Zhening Xing, Yuancheng Wang, Zhizheng Wu, and Kai Chen. Foley- crafter: Bring silent videos to life with lifelike and synchro- nized sounds.arXiv preprint arXiv:2407.01494, 2024. 1, 13
-
[50]
condition–unconditional
Yipin Zhou, Zhaowen Wang, Chen Fang, Trung Bui, and Tamara L. Berg. Visual to sound: Generating natural sound for videos in the wild. InProceedings of the IEEE con- ference on computer vision and pattern recognition, pages 3550–3558, 2018. 2 A. Appendix A.1. Training and Inference Process Training.During the training of OmniSonic, for each data point, we ...
2018
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.