pith. sign in

arxiv: 2605.30940 · v1 · pith:BTOKXAPGnew · submitted 2026-05-29 · 📡 eess.AS · cs.MM· cs.SD

Towards Streaming Synchronized Spatial Audio Generation via Autoregressive Diffusion Transformer

Pith reviewed 2026-06-28 21:15 UTC · model grok-4.3

classification 📡 eess.AS cs.MMcs.SD
keywords spatial audio generationautoregressive diffusion transformerstreaming synthesisvideo-to-audiotext-to-audiocontrastive learningpreference optimizationpanoramic video
0
0 comments X

The pith

SwanSphere generates streaming spatial audio from panoramic videos and text using a causal autoregressive diffusion transformer.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces SwanSphere to solve the quality-versus-latency tradeoff in spatial audio synthesis while better capturing spatial cues from multimodal inputs. It builds a unified streaming framework around a causal autoregressive diffusion transformer that produces high-fidelity output in real time. A Spatial Video-Audio Contrastive learning strategy aligns the video encoder to the acoustic domain, and a multi-objective online direct preference optimization scheme strengthens spatial perception during synthesis. An automated annotation pipeline supplies detailed spatial captions to enlarge training data. Experiments show the resulting model outperforms prior methods on both video-to-spatial and text-to-spatial audio tasks.

Core claim

SwanSphere is a unified streaming framework for high-fidelity spatial audio generation from panoramic videos and text prompts, built on a causal autoregressive diffusion transformer architecture, enhanced by a Spatial Video-Audio Contrastive learning strategy and a multi-objective online direct preference optimization scheme, and supported by an automated spatial caption annotation pipeline, achieving superior performance in video-to-spatial and text-to-spatial generation tasks.

What carries the argument

The causal autoregressive diffusion transformer architecture that produces streaming output while preserving spatial fidelity, together with the SVAC contrastive alignment and multi-objective ODPO optimization that enforce domain matching and preference-driven refinement.

Load-bearing premise

The SVAC learning strategy and multi-objective ODPO scheme will produce strong spatial perception and robust multimodal synthesis.

What would settle it

A controlled experiment in which SwanSphere audio is rated no better than strong baselines on objective spatial localization accuracy or subjective synchronization scores when conditioned on the same panoramic video or text inputs.

Figures

Figures reproduced from arXiv: 2605.30940 by Changhao Pan, Ke Lei, Ruiqi Li, Wenxiang Guo, Xueyi Pu, Yu Zhang, Zhou Zhao.

Figure 1
Figure 1. Figure 1: Overview. Left: The pipeline of audio caption generation. Middle: The streaming inference diagram of SwanSphere, which simultaneously supports panoramic video and textual descriptions as inputs. Right: Example results generated by SwanSphere. As shown above, our model accurately captures the spatial audio variation as the marching band moves from the front to the right side of the scene, manifested by a gr… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the SwanSphere framework. The left side illustrates the training pipeline based on the teacher forcing strategy, which supports both video and textual modalities during training. The upper-right section details our SVAC (Spatial Video-Audio Contrastive Learning) strategy for enhancing the Video Encoder’s alignment capability. The lower-right section introduces the Multi-Objective Preference Ali… view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative Comparison. The left column depicts sea waves positioned directly in front; our model generates distinct and rhythmic wave sounds. In the right column, featuring a marching band moving from the front toward the right side, the signal intensity of the X channel gradually decreases while the intensity of the Y channel increases accordingly. achieves superior performance in both semantic consisten… view at source ↗
read the original abstract

Real-time and accurate spatial audio generation is pivotal for delivering an immersive experience. However, existing spatial audio synthesis technologies are often encumbered by a tradeoff between generation quality and high inference latency, as well as difficulty in capturing precise spatial information from multimodal inputs. To address these challenges, we propose SwanSphere, a unified streaming framework for high-fidelity spatial audio generation from panoramic videos and text prompts. SwanSphere mainly makes the following contributions: 1) We introduce a causal autoregressive diffusion transformer architecture that enables streaming high-quality spatial audio generation. 2) We design a Spatial Video-Audio Contrastive (SVAC) learning strategy to align the video encoder with the acoustic domain, and further employ a multi-objective online direct preference optimization (ODPO) scheme, resulting in strong spatial perception and robust multimodal spatial audio synthesis. 3) To alleviate the current scarcity of spatial audio datasets, we also develop an automated annotation pipeline for generating detailed spatial captions. Experimental results demonstrate that SwanSphere achieves superior performance in both video-to-spatial and text-to-spatial audio generation tasks. Demos can be found at: https://swanaigc.github.io.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper proposes SwanSphere, a unified streaming framework for high-fidelity spatial audio generation from panoramic videos and text prompts. It introduces a causal autoregressive diffusion transformer architecture, a Spatial Video-Audio Contrastive (SVAC) learning strategy to align video and acoustic domains, a multi-objective online direct preference optimization (ODPO) scheme, and an automated pipeline for generating spatial captions. The central claim is that these components enable superior performance on video-to-spatial and text-to-spatial audio generation tasks.

Significance. If the superiority claims are substantiated with rigorous experiments, the work could meaningfully advance real-time immersive audio synthesis by addressing latency-quality tradeoffs and multimodal alignment. The introduction of SVAC and ODPO for spatial perception, along with the captioning pipeline to address data scarcity, would represent practical contributions to the field if supported by evidence.

major comments (1)
  1. Abstract: The statement that 'Experimental results demonstrate that SwanSphere achieves superior performance in both video-to-spatial and text-to-spatial audio generation tasks' is presented without any accompanying metrics, baselines, ablation studies, error bars, test-set descriptions, or spatial-specific evaluation criteria (such as angular error or binaural quality scores). This absence leaves the central empirical claim without visible grounding.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the major comment below.

read point-by-point responses
  1. Referee: [—] Abstract: The statement that 'Experimental results demonstrate that SwanSphere achieves superior performance in both video-to-spatial and text-to-spatial audio generation tasks' is presented without any accompanying metrics, baselines, ablation studies, error bars, test-set descriptions, or spatial-specific evaluation criteria (such as angular error or binaural quality scores). This absence leaves the central empirical claim without visible grounding.

    Authors: We agree that the abstract claim would be stronger with explicit grounding. The full manuscript details the experimental setup, metrics (including angular error and binaural quality scores), baselines, ablations, error bars, and test-set descriptions in the Experiments section. To address the concern, we will revise the abstract to incorporate key quantitative results supporting the superiority claim. revision: yes

Circularity Check

0 steps flagged

No circularity: paper describes architecture and asserts empirical results without any derivation chain or self-referential predictions

full rationale

The manuscript introduces SwanSphere via high-level component descriptions (causal autoregressive diffusion transformer, SVAC alignment, multi-objective ODPO) and states that experimental results show superiority on video-to-spatial and text-to-spatial tasks. No equations, fitted parameters renamed as predictions, self-citations invoked as uniqueness theorems, or ansatzes smuggled via prior work appear in the supplied text. The performance claim is presented as an empirical outcome rather than a derived quantity that reduces to its own inputs by construction; therefore the derivation chain (such as it exists) is self-contained and independent of the circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no free parameters, axioms, or invented entities are described.

pith-pipeline@v0.9.1-grok · 5748 in / 1054 out tokens · 29933 ms · 2026-06-28T21:15:53.144782+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

63 extracted references · 31 canonical work pages · 8 internal anchors

  1. [1]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

  2. [2]

    MusicLM: Generating Music From Text

    Agostinelli, A., Denk, T. I., Borsos, Z., Engel, J., Verzetti, M., Caillon, A., Huang, Q., Jansen, A., Roberts, A., Tagliasacchi, M., et al. Musiclm: Generating music from text. arXiv preprint arXiv:2301.11325, 2023

  3. [3]

    Spatial sound—history, principle, progress and challenge

    Bosun, X. Spatial sound—history, principle, progress and challenge. Chinese Journal of Electronics, 29 0 (3): 0 397--416, 2020. doi:10.1049/cje.2020.02.016. URL https://cje.ejournal.org.cn/en/article/doi/10.1049/cje.2020.02.016

  4. [4]

    Ccstereo: Audio-visual contextual and contrastive learning for binaural audio generation

    Chen, Y., Shimada, K., Simon, C., Ikemiya, Y., Shibuya, T., and Mitsufuji, Y. Ccstereo: Audio-visual contextual and contrastive learning for binaural audio generation. In Proceedings of the 33rd ACM International Conference on Multimedia, pp.\ 7510--7518, 2025

  5. [5]

    K., Ishii, M., Hayakawa, A., Shibuya, T., Schwing, A., and Mitsufuji, Y

    Cheng, H. K., Ishii, M., Hayakawa, A., Shibuya, T., Schwing, A., and Mitsufuji, Y. Mmaudio: Taming multimodal joint training for high-quality video-to-audio synthesis. In Proceedings of the Computer Vision and Pattern Recognition Conference, pp.\ 28901--28911, 2025

  6. [6]

    W., Hou, L., Longpre, S., Zoph, B., Tay, Y., Fedus, W., Li, Y., Wang, X., Dehghani, M., Brahma, S., et al

    Chung, H. W., Hou, L., Longpre, S., Zoph, B., Tay, Y., Fedus, W., Li, Y., Wang, X., Dehghani, M., Brahma, S., et al. Scaling instruction-finetuned language models. Journal of Machine Learning Research, 25 0 (70): 0 1--53, 2024

  7. [7]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Comanici, G., Bieber, E., Schaekermann, M., Pasupat, I., Sachdeva, N., Dhillon, I., Blistein, M., Ram, O., Zhang, D., Rosen, E., et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. arXiv preprint arXiv:2507.06261, 2025

  8. [8]

    Veo 3, 2025

    DeepMind, G. Veo 3, 2025. URL https://deepmind.google/technologies/veo

  9. [9]

    Scaling rectified flow transformers for high-resolution image synthesis

    Esser, P., Kulal, S., Blattmann, A., Entezari, R., M \"u ller, J., Saini, H., Levi, Y., Lorenz, D., Sauer, A., Boesel, F., et al. Scaling rectified flow transformers for high-resolution image synthesis. In Forty-first international conference on machine learning, 2024

  10. [10]

    D., Carr, C., Zukowski, Z., Taylor, J., and Pons, J

    Evans, Z., Parker, J. D., Carr, C., Zukowski, Z., Taylor, J., and Pons, J. Stable audio open. In ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.\ 1--5. IEEE, 2025

  11. [11]

    and Grauman, K

    Gao, R. and Grauman, K. 2.5 d visual sound. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 324--333, 2019

  12. [12]

    Geometry-aware multi-task learning for binaural audio generation from video

    Garg, R., Gao, R., and Grauman, K. Geometry-aware multi-task learning for binaural audio generation from video. arXiv preprint arXiv:2111.10882, 2021

  13. [13]

    Mrsaudio: A large-scale multimodal recorded spatial audio dataset with refined annotations

    Guo, W., Pan, C., Zhu, Z., Hu, X., Zhang, Y., Tang, L., Yang, R., Wang, H., Zhang, Z., Wang, Y., Chen, Y., Xu, H., Xu, K., Fan, P., Chen, Z., Yu, Y., Huang, Q., Wu, F., and Zhao, Z. Mrsaudio: A large-scale multimodal recorded spatial audio dataset with refined annotations. arXiv preprint arXiv:2510.10396, 2025 a

  14. [14]

    TechSinger : Technique controllable multilingual singing voice synthesis via flow matching

    Guo, W., Zhang, Y., Pan, C., Huang, R., Tang, L., Li, R., Hong, Z., Wang, Y., and Zhao, Z. TechSinger : Technique controllable multilingual singing voice synthesis via flow matching. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 39, pp.\ 23978--23986, 2025 b . doi:10.1609/aaai.v39i22.34571

  15. [15]

    STARS : A unified framework for singing transcription, alignment, and refined style annotation

    Guo, W., Zhang, Y., Pan, C., Zhu, Z., Li, R., Chen, Z., Xu, W., Wu, F., and Zhao, Z. STARS : A unified framework for singing transcription, alignment, and refined style annotation. In Findings of the Association for Computational Linguistics: ACL 2025, pp.\ 15081--15093, Vienna, Austria, 2025 c . Association for Computational Linguistics. doi:10.18653/v1/...

  16. [16]

    Immersediffusion: A generative spatial audio latent diffusion model

    Heydari, M., Souden, M., Conejo, B., and Atkins, J. Immersediffusion: A generative spatial audio latent diffusion model. In ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.\ 1--5. IEEE, 2025

  17. [17]

    D., and Yang, J

    Hu, J., Cao, Y., Wu, M., Kang, F., Yang, F., Wang, W., Plumbley, M. D., and Yang, J. Pseldnets: Pre-trained neural networks on a large-scale synthetic dataset for sound event localization and detection. IEEE Transactions on Audio, Speech and Language Processing, 2025

  18. [18]

    Impact: Iterative mask-based parallel decoding for text-to-audio generation with diffusion modeling

    Huang, K.-P., Yang, S.-w., Phan, H., Lu, B.-R., Kim, B., Macha, S., Tang, Q., Ghosh, S., Lee, H.-y., Kao, C.-C., et al. Impact: Iterative mask-based parallel decoding for text-to-audio generation with diffusion modeling. arXiv preprint arXiv:2506.00736, 2025

  19. [19]

    Masked autoencoders that listen

    Huang, P.-Y., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F., and Feichtenhofer, C. Masked autoencoders that listen. Advances in Neural Information Processing Systems, 35: 0 28708--28720, 2022

  20. [20]

    and Rahtu, E

    Iashin, V. and Rahtu, E. Taming visually guided sound generation. arXiv preprint arXiv:2110.08791, 2021

  21. [21]

    Wavtokenizer: An efficient acoustic discrete codec tokenizer for audio language modeling

    Ji, S., Jiang, Z., Wang, W., Chen, Y., Fang, M., Zuo, J., Yang, Q., Cheng, X., Wang, Z., Li, R., et al. Wavtokenizer: an efficient acoustic discrete codec tokenizer for audio language modeling. arXiv preprint arXiv:2408.16532, 2024

  22. [22]

    Stereofoley: Object-aware stereo audio generation from video

    Karchkhadze, T., Chen, K.-L., Heydari, M., Henzel, R., Toso, A., Souden, M., and Atkins, J. Stereofoley: Object-aware stereo audio generation from video. In ICASSP 2026-2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.\ 16027--16031. IEEE, 2026

  23. [23]

    Visage: Video-to-spatial audio generation

    Kim, J., Yun, H., and Kim, G. Visage: Video-to-spatial audio generation. arXiv preprint arXiv:2506.12199, 2025

  24. [24]

    Guiding audio editing with audio language model

    Lan, Z., Hao, Y., and Zhao, M. Guiding audio editing with audio language model. arXiv preprint arXiv:2509.21625, 2025

  25. [25]

    Binauralgrad: A two-stage conditional diffusion probabilistic model for binaural audio synthesis

    Leng, Y., Chen, Z., Guo, J., Liu, H., Chen, J., Tan, X., Mandic, D., He, L., Li, X., Qin, T., et al. Binauralgrad: A two-stage conditional diffusion probabilistic model for binaural audio synthesis. Advances in Neural Information Processing Systems, 35: 0 23689--23700, 2022

  26. [26]

    Robust singing voice transcription serves synthesis

    Li, R., Zhang, Y., Wang, Y., Hong, Z., Huang, R., and Zhao, Z. Robust singing voice transcription serves synthesis. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 9751--9766, Bangkok, Thailand, 2024. Association for Computational Linguistics. doi:10.18653/v1/2024.acl-long.526. URL h...

  27. [27]

    Flow Matching for Generative Modeling

    Lipman, Y., Chen, R. T., Ben-Hamu, H., Nickel, M., and Le, M. Flow matching for generative modeling. arXiv preprint arXiv:2210.02747, 2022

  28. [28]

    Liu, H., Yuan, Y., Liu, X., Mei, X., Kong, Q., Tian, Q., Wang, Y., Wang, W., Wang, Y., and Plumbley, M. D. Audioldm 2: Learning holistic audio generation with self-supervised pretraining. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 32: 0 2871--2883, 2024

  29. [29]

    Thinksound: Chain-of-thought reasoning in multimodal large language models for audio generation and editing

    Liu, H., Luo, K., Wang, J., Wang, W., Chen, Q., Zhao, Z., and Xue, W. Thinksound: Chain-of-thought reasoning in multimodal large language models for audio generation and editing. arXiv preprint arXiv:2506.21448, 2025 a

  30. [30]

    Omniaudio: Generating spatial audio from 360-degree video

    Liu, H., Luo, T., Luo, K., Jiang, Q., Sun, P., Wang, J., Huang, R., Chen, Q., Wang, W., Li, X., et al. Omniaudio: Generating spatial audio from 360-degree video. arXiv preprint arXiv:2504.14906, 2025 b

  31. [31]

    Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

    Liu, X., Gong, C., and Liu, Q. Flow straight and fast: Learning to generate and transfer data with rectified flow. arXiv preprint arXiv:2209.03003, 2022

  32. [32]

    Diff-foley: Synchronized video-to-audio synthesis with latent diffusion models

    Luo, S., Yan, C., Hu, C., and Zhao, H. Diff-foley: Synchronized video-to-audio synthesis with latent diffusion models. Advances in Neural Information Processing Systems, 36: 0 48855--48876, 2023

  33. [33]

    Tango 2: Aligning diffusion-based text-to-audio generations through direct preference optimization

    Majumder, N., Hung, C.-Y., Ghosal, D., Hsu, W.-N., Mihalcea, R., and Poria, S. Tango 2: Aligning diffusion-based text-to-audio generations through direct preference optimization. In Proceedings of the 32nd ACM International Conference on Multimedia, pp.\ 564--572, 2024

  34. [34]

    Foleygen: Visually-guided audio generation

    Mei, X., Nagaraja, V., Le Lan, G., Ni, Z., Chang, E., Shi, Y., and Chandra, V. Foleygen: Visually-guided audio generation. In 2024 IEEE 34th International Workshop on Machine Learning for Signal Processing (MLSP), pp.\ 1--6, 2024

  35. [35]

    Self-supervised generation of spatial audio for 360 video

    Morgado, P., Vasconcelos, N., Langlois, T., and Wang, O. Self-supervised generation of spatial audio for 360 video. Advances in neural information processing systems, 31, 2018

  36. [36]

    Sora 2: Video generation model, 2025

    OpenAI. Sora 2: Video generation model, 2025. URL https://openai.com/sora

  37. [37]

    A multimodal evaluation framework for spatial audio playback systems: From localization to listener preference

    Pan, C., Guo, W., Zhang, Y., Zhu, Z., Chen, Z., Wang, H., and Zhao, Z. A multimodal evaluation framework for spatial audio playback systems: From localization to listener preference. In Proceedings of the 33rd ACM International Conference on Multimedia, pp.\ 7006--7015, 2025. doi:10.1145/3746027.3755571

  38. [38]

    and Xie, S

    Peebles, W. and Xie, S. Scalable diffusion models with transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pp.\ 4195--4205, 2023

  39. [39]

    Movie Gen: A Cast of Media Foundation Models

    Polyak, A., Zohar, A., Brown, A., Tjandra, A., Sinha, A., Lee, A., Vyas, A., Shi, B., Ma, C.-Y., Chuang, C.-Y., et al. Movie gen: A cast of media foundation models. arXiv preprint arXiv:2410.13720, 2024

  40. [40]

    W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al

    Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pp.\ 8748--8763. PmLR, 2021

  41. [41]

    High-resolution image synthesis with latent diffusion models

    Rombach, R., Blattmann, A., Lorenz, D., Esser, P., and Ommer, B. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.\ 10684--10695, 2022

  42. [42]

    Soundreactor: Frame-level online video-to-audio generation

    Saito, K., Tanke, J., Simon, C., Ishii, M., Shimada, K., Novack, Z., Zhong, Z., Hayakawa, A., Shibuya, T., and Mitsufuji, Y. Soundreactor: Frame-level online video-to-audio generation. arXiv preprint arXiv:2510.02110, 2025

  43. [43]

    DINOv3

    Sim \'e oni, O., Vo, H. V., Seitzer, M., Baldassarre, F., Oquab, M., Jose, C., Khalidov, V., Szafraniec, M., Yi, S., Ramamonjisoa, M., et al. Dinov3. arXiv preprint arXiv:2508.10104, 2025

  44. [44]

    Both ears wide open: Towards language-driven spatial audio generation

    Sun, P., Cheng, S., Li, X., Ye, Z., Liu, H., Zhang, H., Xue, W., and Guo, Y. Both ears wide open: Towards language-driven spatial audio generation. arXiv preprint arXiv:2410.10676, 2024

  45. [45]

    Codi-2: In-context interleaved and interactive any-to-any generation

    Tang, Z., Yang, Z., Khademi, M., Liu, Y., Zhu, C., and Bansal, M. Codi-2: In-context interleaved and interactive any-to-any generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 27425--27434, 2024

  46. [46]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023

  47. [47]

    Temporally aligned audio for video with autoregression

    Viertola, I., Iashin, V., and Rahtu, E. Temporally aligned audio for video with autoregression. In ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.\ 1--5. IEEE, 2025

  48. [48]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Wan, T., Wang, A., Ai, B., Wen, B., Mao, C., Xie, C.-W., Chen, D., Yu, F., Zhao, H., Yang, J., et al. Wan: Open and advanced large-scale video generative models. arXiv preprint arXiv:2503.20314, 2025

  49. [49]

    Videomae v2: Scaling video masked autoencoders with dual masking

    Wang, L., Huang, B., Zhao, Z., Tong, Z., He, Y., Wang, Y., Wang, Y., and Qiao, Y. Videomae v2: Scaling video masked autoencoders with dual masking. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.\ 14549--14560, 2023

  50. [50]

    Audiogen-omni: A unified multimodal diffusion transformer for video-synchronized audio, speech, and song generation

    Wang, L., Wang, J., Qiang, C., Deng, F., Zhang, C., Zhang, D., and Gai, K. Audiogen-omni: A unified multimodal diffusion transformer for video-synchronized audio, speech, and song generation. arXiv preprint arXiv:2508.00733, 2025

  51. [51]

    Frieren: Efficient video-to-audio generation network with rectified flow matching

    Wang, Y., Guo, W., Huang, R., Huang, J., Wang, Z., You, F., Li, R., and Zhao, Z. Frieren: Efficient video-to-audio generation network with rectified flow matching. Advances in Neural Information Processing Systems, 37: 0 128118--128138, 2024

  52. [52]

    arXiv preprint arXiv:2407.07464 , year=

    Xu, M., Li, C., Tu, X., Ren, Y., Chen, R., Gu, Y., Liang, W., and Yu, D. Video-to-audio generation with hidden alignment. arXiv preprint arXiv:2407.07464, 2024

  53. [53]

    Visually informed binaural audio generation without binaural audios

    Xu, X., Zhou, H., Liu, Z., Dai, B., Wang, X., and Lin, D. Visually informed binaural audio generation without binaural audios. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 15485--15494, 2021

  54. [54]

    Ta-v2a: Textually assisted video-to-audio generation

    You, Y., Wu, X., and Qu, T. Ta-v2a: Textually assisted video-to-audio generation. In ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.\ 1--5, 2025

  55. [55]

    StyleSinger : Style transfer for out-of-domain singing voice synthesis

    Zhang, Y., Huang, R., Li, R., He, J., Xia, Y., Chen, F., Duan, X., Huai, B., and Zhao, Z. StyleSinger : Style transfer for out-of-domain singing voice synthesis. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pp.\ 19597--19605, 2024 a . doi:10.1609/aaai.v38i17.29932

  56. [56]

    Tcsinger: Zero-shot singing voice synthesis with style transfer and multi-level style control,

    Zhang, Y., Jiang, Z., Li, R., Pan, C., He, J., Huang, R., Wang, C., and Zhao, Z. TCS inger: Zero-shot singing voice synthesis with style transfer and multi-level style control. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pp.\ 1960--1975, Miami, Florida, USA, 2024 b . Association for Computational Linguistics....

  57. [57]

    GTS inger: A global multi-technique singing corpus with realistic music scores for all singing tasks

    Zhang, Y., Pan, C., Guo, W., Li, R., Zhu, Z., Wang, J., Xu, W., Lu, J., Hong, Z., Wang, C., Zhang, L., He, J., Jiang, Z., Chen, Y., Yang, C., Zhou, J., Cheng, X., and Zhao, Z. GTS inger: A global multi-technique singing corpus with realistic music scores for all singing tasks. In Advances in Neural Information Processing Systems, volume 37, 2024 c

  58. [58]

    TCS inger 2: Customizable multilingual zero-shot singing voice synthesis

    Zhang, Y., Guo, W., Pan, C., Yao, D., Zhu, Z., Jiang, Z., Wang, Y., Jin, T., and Zhao, Z. TCS inger 2: Customizable multilingual zero-shot singing voice synthesis. In Findings of the Association for Computational Linguistics: ACL 2025, pp.\ 13280--13294, Vienna, Austria, 2025 a . Association for Computational Linguistics. doi:10.18653/v1/2025.findings-acl...

  59. [59]

    Isdrama: Immersive spatial drama generation through multimodal prompting

    Zhang, Y., Guo, W., Pan, C., Zhu, Z., Jin, T., and Zhao, Z. Isdrama: Immersive spatial drama generation through multimodal prompting. In Proceedings of the 33rd ACM International Conference on Multimedia, pp.\ 9618--9627, 2025 b

  60. [60]

    Versatile framework for song generation with prompt-based control

    Zhang, Y., Guo, W., Pan, C., Zhu, Z., Li, R., Lu, J., Huang, R., Zhang, R., Hong, Z., Jiang, Z., and Zhao, Z. Versatile framework for song generation with prompt-based control. In Findings of the Association for Computational Linguistics: EMNLP 2025, pp.\ 195--219, Suzhou, China, 2025 c . Association for Computational Linguistics. doi:10.18653/v1/2025.fin...

  61. [61]

    Conan: A chunkwise online network for zero-shot adaptive voice conversion

    Zhang, Y., Tian, B., and Duan, Z. Conan: A chunkwise online network for zero-shot adaptive voice conversion. arXiv preprint arXiv:2507.14534, 2025 d

  62. [62]

    Asaudio: A survey of advanced spatial audio research

    Zhu, Z., Zhang, Y., Guo, W., Pan, C., and Zhao, Z. Asaudio: A survey of advanced spatial audio research. In Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics, pp.\ 417--442, 2025

  63. [63]

    and Frank, M

    Zotter, F. and Frank, M. Ambisonics: A practical 3D audio theory for recording, studio production, sound reinforcement, and virtual reality. Springer Nature, 2019