pith. sign in

arxiv: 2605.31530 · v2 · pith:RDMSKUXLnew · submitted 2026-05-29 · 📡 eess.AS · cs.SD

UNISON: A Unified Sound Generation and Editing Framework via Deep LLM Fusion

Pith reviewed 2026-06-28 20:33 UTC · model grok-4.3

classification 📡 eess.AS cs.SD
keywords unified audio generationtext-to-audiotext-to-speechaudio editinglatent diffusionmultimodal LLM fusionmulti-task learningspeaker cloning
0
0 comments X

The pith

A single model with 621M-732M parameters unifies text-to-audio, text-to-speech, zero-shot cloning, mixed generation, and multiple audio editing tasks under one set of weights.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

UNISON is a latent diffusion model that performs text-to-audio, text-to-speech, zero-shot speaker cloning, mixed speech-and-sound generation, scene-level editing, speech-in-scene editing, and timed temporal composition. All tasks share the same weights and rely on two design choices: layer-wise injection of hidden states from multiple layers of a frozen multimodal LLM into diffusion transformer blocks, and task encoding via a channel-wise mask plus VAE channel concatenation. The model reaches performance levels competitive with or better than task-specific systems while remaining roughly four times smaller than prior unified approaches. A sympathetic reader would care because the result suggests that broad audio capabilities no longer require separate models or large task-specific modules.

Core claim

A single latent diffusion model with layer-wise deep fusion of uniformly sampled hidden states from a frozen MLLM into corresponding MM-DiT blocks, plus a unified multi-task design that encodes task identity only through a channel-wise mask and VAE concatenation, can jointly solve text-to-audio, text-to-speech, zero-shot speaker cloning, mixed speech-and-sound generation, scene-level audio editing, speech-in-scene editing, and timed temporal composition while sharing one set of 621M-732M trainable parameters and matching or exceeding specialist models.

What carries the argument

Layer-wise deep LLM fusion that injects hidden states from uniformly sampled layers of a frozen MLLM into MM-DiT blocks via learned projections, paired with channel-wise mask and VAE concatenation to encode task identity.

If this is right

  • One set of weights suffices for both open-ended generation and precise temporal editing of mixed speech and sound scenes.
  • Depth-matched conditioning from multiple LLM layers improves following of complex editing instructions compared with single-layer baselines.
  • Online GPU-side multi-task data synthesis with task-homogeneous batching and two-stage curriculum stabilizes joint training across seven tasks.
  • The resulting model size remains roughly four times smaller than prior unified audio systems while matching their accuracy.
  • Seamless mixing of speech and environmental sound, plus zero-shot cloning, emerges from the shared architecture without extra components.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Deployment of audio AI could become simpler if only one model needs to be hosted instead of separate generators and editors.
  • The frozen-LLM approach may allow future scaling by swapping in larger language models without retraining the diffusion backbone.
  • Task-homogeneous batching combined with curriculum learning could transfer to other multi-task generative settings beyond audio.
  • If the mask-and-concatenation scheme generalizes, similar lightweight task encoding might reduce parameter overhead in unified video or music models.

Load-bearing premise

Uniformly sampled hidden states from multiple layers of a frozen MLLM, when injected through learned projections, supply depth-matched semantic conditioning that improves instruction following, and a channel-wise mask plus VAE concatenation alone is enough to distinguish tasks without any task-specific modules.

What would settle it

An ablation that replaces the multi-layer LLM injection with single-layer injection and removes the channel-wise mask, then measures whether instruction-following accuracy on speech-in-scene editing drops below the performance of the full UNISON model.

Figures

Figures reproduced from arXiv: 2605.31530 by Haoning Xu, Haoxuan Che, Huimeng Wang, Jiajun Deng, Jingran Su, Rui Liu, Tianzi Wang, Xunying Liu, Yaofang Liu, Zengrui Jin, Zhaoqing Li, Zhefan Rao.

Figure 1
Figure 1. Figure 1: Overview of UNISON. A single flow￾matching model handles text-to-audio generation, zero￾shot TTS, gender control, audio-scene editing, and timed temporal composition. All tasks share the same architecture and weights, differentiated only by a task mask channel and optional source latent concatenation. disparate conditioning pipelines. This fragmen￾tation increases deployment complexity and pre￾vents cross-… view at source ↗
Figure 2
Figure 2. Figure 2: UNISON Architecture. Left: Layer-wise deep LLM fusion injects per-layer Qwen hidden states into corresponding DiT blocks via learned projectors. Middle: Each double-stream block performs joint attention; text tokens are refreshed per block (✗) while audio tokens pass through the MLP. Bottom: [zt ∥ zs ∥ m] are channel￾concatenated and embedded; the ODE solver denoises the latent, which is VAE-decoded to wav… view at source ↗
Figure 3
Figure 3. Figure 3: Audio editing qualitative examples from UNISON (D24, 16 kHz). Each row shows one sub-task (Add / Remove / Replace). Left: source audio. Middle: UNISON output. Right: constructed ground truth [PITH_FULL_IMAGE:figures/full_fig_p015_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Audio editing qualitative examples from UNISON (D20, 44.1 kHz) on the same samples as [PITH_FULL_IMAGE:figures/full_fig_p016_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Speech-in-scene editing qualitative examples from [PITH_FULL_IMAGE:figures/full_fig_p017_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Speech-in-scene editing qualitative examples from [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Timed generation mel spectrograms from UNISON (D24, 16 kHz). Colored dashed lines and shading denote the time boundaries from the input prompt; segment captions are annotated above each region. (a)–(b): sequential segments. (c)–(d): overlapping segments. The model produces distinct spectral patterns that align with the specified time intervals [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Timed generation mel spectrograms from UNISON (D20, 44.1 kHz) on the same prompts as [PITH_FULL_IMAGE:figures/full_fig_p020_8.png] view at source ↗
read the original abstract

We present UNISON, a latent diffusion framework that unifies speech generation, sound generation, and audio editing within a single model. A single model handles text-to-audio, text-to-speech, zero-shot speaker cloning, mixed speech-and-sound generation, scene-level audio editing, speech-in-scene editing, and timed temporal composition, all of which share a single set of weights. Our architecture features two core designs: (1) Layer-wise deep LLM fusion, which injects hidden states from uniformly sampled layers of a frozen MLLM into corresponding MM-DiT blocks via learned projections, providing depth-matched semantic conditioning that improves instruction following over single-layer baselines; and (2) a unified multi-task architecture where task identity is encoded solely by a channel-wise mask and source audio is provided through VAE-encoded channel concatenation. Training is stabilized by an online GPU-side multi-task data synthesis pipeline with task-homogeneous batching and a two-stage curriculum. With 621M--732M trainable parameters, UNISON achieves results competitive with or exceeding task-specialist models across evaluated domains, while being roughly $4\times$ smaller than comparable unified systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The manuscript presents UNISON, a latent diffusion framework unifying speech generation, sound generation, and audio editing in a single model with 621M-732M trainable parameters. Core designs include layer-wise deep LLM fusion (injecting uniformly sampled hidden states from a frozen MLLM into corresponding MM-DiT blocks via learned projections) and a multi-task architecture using only channel-wise masks for task identity plus VAE-encoded channel concatenation for source audio. Training uses an online GPU-side multi-task data synthesis pipeline with task-homogeneous batching and two-stage curriculum. The model is claimed to handle text-to-audio, text-to-speech, zero-shot speaker cloning, mixed speech-and-sound generation, scene-level editing, speech-in-scene editing, and timed temporal composition while achieving results competitive with or exceeding task-specialist models and being ~4x smaller than comparable unified systems.

Significance. If the performance claims hold with rigorous quantitative support, the work would demonstrate a meaningful step toward efficient unified audio models by showing that depth-matched LLM conditioning and minimal task encoding can support broad task coverage without task-specific components. This could reduce model proliferation in the field and highlight the value of deep fusion over single-layer baselines.

minor comments (2)
  1. The abstract states competitive results but does not report specific metrics, baselines, datasets, or error bars; the full manuscript should include these in a dedicated results section with tables for each task.
  2. Notation for the channel-wise mask and VAE concatenation should be formalized with an equation or diagram in the methods section to clarify how task identity is encoded without additional components.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their summary of the manuscript and for noting the potential significance of demonstrating efficient unified audio modeling via depth-matched LLM fusion and minimal task encoding. We are pleased that the work is viewed as a possible step toward reducing model proliferation if the quantitative claims hold under rigorous evaluation. Below we respond to the points raised.

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The abstract and description contain no equations, derivations, or self-citations that could reduce any claimed prediction or result to its inputs by construction. Architectural choices such as layer-wise LLM fusion via learned projections and task encoding via channel-wise mask plus VAE concatenation are presented as design decisions without any fitted-input-called-prediction pattern or self-definitional loop. The unified model is described as achieving competitive results through standard training procedures, with no load-bearing self-citation chains or ansatz smuggling visible. This is the normal self-contained case for an empirical architecture paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no free parameters, axioms, or invented entities can be identified from the provided text.

pith-pipeline@v0.9.1-grok · 5773 in / 1135 out tokens · 18564 ms · 2026-06-28T20:33:00.883421+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

91 extracted references · 18 canonical work pages · 11 internal anchors

  1. [1]

    online" 'onlinestring :=

    ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...

  2. [2]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

  3. [5]

    Honglie Chen, Weidi Xie, Andrea Vedaldi, and Andrew Zisserman. 2020. Vggsound: A large-scale audio-visual dataset. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 721--725. IEEE

  4. [6]

    Yushen Chen, Zhikang Niu, Ziyang Ma, Keqi Deng, Chunhui Wang, JianZhao JianZhao, Kai Yu, and Xie Chen. 2025. F5-tts: A fairytaler that fakes fluent and faithful speech with flow matching. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6255--6271

  5. [7]

    Ho Kei Cheng, Masato Ishii, Akio Hayakawa, Takashi Shibuya, Alexander Schwing, and Yuki Mitsufuji. 2025. Mmaudio: Taming multimodal joint training for high-quality video-to-audio synthesis. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 28901--28911

  6. [8]

    Yunfei Chu, Jin Xu, Xiaohuan Zhou, Qian Yang, Shiliang Zhang, Zhijie Yan, Chang Zhou, and Jingren Zhou. 2023. Qwen-audio: Advancing universal audio understanding via unified large-scale audio-language models. arXiv preprint arXiv:2311.07919

  7. [9]

    Kevin Clark, Urvashi Khandelwal, Omer Levy, and Christopher D Manning. 2019. What does bert look at? an analysis of bert’s attention. In Proceedings of the 2019 ACL workshop BlackboxNLP: analyzing and interpreting neural networks for NLP, pages 276--286

  8. [10]

    Alexis Conneau, Alexei Baevski, Ronan Collobert, Abdelrahman Mohamed, and Michael Auli. 2021. Unsupervised cross-lingual representation learning for speech recognition. In Interspeech 2021, pages 2426--2430

  9. [11]

    Jade Copet, Felix Kreuk, Itai Gat, Tal Remez, David Kant, Gabriel Synnaeve, Yossi Adi, and Alexandre D \'e fossez. 2023. Simple and controllable music generation. In Advances in Neural Information Processing Systems, volume 36

  10. [13]

    Zhihao Du, Qian Chen, Xian Shi, Xiang Lv, Zhifu Gao, Changfeng Gao, Hui Wang, Dong Yu, Jianzong Pan, and Fan Wang. 2024 a . Cosyvoice: A scalable multilingual zero-shot text-to-speech synthesizer based on supervised semantic tokens. arXiv preprint arXiv:2407.05407

  11. [15]

    Sefik Emre Eskimez, Xiaofei Wang, Manthan Thakker, Canrun Li, Chung-Hsien Tsai, Zhen Xiao, Hemin Yang, Zirun Zhu, Min Tang, Xu Tan, and 1 others. 2024. E2 tts: Embarrassingly easy fully non-autoregressive zero-shot tts. In 2024 IEEE spoken language technology workshop (SLT), pages 682--689. IEEE

  12. [16]

    Zach Evans, Julian D Parker, CJ Carr, Zack Zukowski, Josiah Taylor, and Jordi Pons. 2025. Stable audio open. In ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1--5. IEEE

  13. [17]

    Zhifu Gao, Zerui Li, Jiaming Wang, Haoneng Luo, Xian Shi, Mengzhe Chen, Yabin Li, Lingyun Zuo, Zhihao Du, Zhangyu Xiao, and Shiliang Zhang. 2023. Funasr: A fundamental end-to-end speech recognition toolkit. In Interspeech 2023, pages 1593--1597

  14. [18]

    Jort F Gemmeke, Daniel PW Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, R Channing Moore, Manoj Plakal, and Marvin Ritter. 2017. Audio set: An ontology and human-labeled dataset for audio events. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 776--780. IEEE

  15. [19]

    Deepanway Ghosal, Navonil Majumder, Ambuj Mehrish, and Soujanya Poria. 2023. Text-to-audio generation using instruction-tuned LLM and latent diffusion model. In Proceedings of the 31st ACM International Conference on Multimedia, pages 3590--3598

  16. [20]

    Moayed Haji-Ali, Willi Menapace, Aliaksandr Siarohin, Guha Balakrishnan, and Vicente Ordonez. 2026. Taming data and transformers for audio generation. International Journal of Computer Vision, 134(3):87

  17. [21]

    Haorui He, Zengqiang Shang, Chaoren Wang, Xuyuan Li, Yicheng Gu, Hua Hua, Liwei Liu, Chen Yang, Jiaqi Li, Peiyang Shi, and 1 others. 2024. Emilia: An extensive, multilingual, and diverse speech dataset for large-scale speech generation. In 2024 IEEE Spoken Language Technology Workshop (SLT), pages 885--890. IEEE

  18. [22]

    Jonathan Ho and Tim Salimans. 2021. Classifier-free diffusion guidance. In NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications

  19. [24]

    Chia-Yu Hung, Navonil Majumder, Zhifeng Kong, Ambuj Mehrish, Amir Zadeh, Chuan Li, Rafael Valle, Bryan Catanzaro, and Soujanya Poria. 2026. Tangoflux: Super fast and faithful text to audio generation with flow matching and clap-ranked preference optimization. In International Conference on Learning Representations

  20. [25]

    Ziyue Jiang, Qian Yang, Jialong Zuo, Zhenhui Ye, Rongjie Huang, Yi Ren, and Zhou Zhao. 2023. Fluentspeech: Stutter-oriented automatic speech editing with context-aware diffusion models. In Findings of the Association for Computational Linguistics: ACL 2023, pages 11655--11671

  21. [26]

    Chris Dongjoo Kim, Byeongchang Kim, Hyunmin Lee, and Gunhee Kim. 2019. Audiocaps: Generating captions for audios in the wild. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 119--132

  22. [27]

    Qiuqiang Kong, Yin Cao, Turab Iqbal, Yuxuan Wang, Wenwu Wang, and Mark D Plumbley. 2020. Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 28:2880--2894

  23. [28]

    Felix Kreuk, Gabriel Synnaeve, Adam Polyak, Uriel Singer, Alexandre D \'e fossez, Jade Copet, Devi Parikh, Yaniv Taigman, and Yossi Adi. 2023. Audiogen: Textually guided audio generation. In International Conference on Learning Representations

  24. [29]

    Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. 2023. Flow matching for generative modeling. In International Conference on Learning Representations

  25. [30]

    Haohe Liu, Zehua Chen, Yi Yuan, Xinhao Mei, Xubo Liu, Danilo Mandic, Wenwu Wang, and Mark D Plumbley. 2023 a . Audioldm: Text-to-audio generation with latent diffusion models. In International Conference on Machine Learning, pages 21450--21474. PMLR

  26. [31]

    Haohe Liu, Yi Yuan, Xubo Liu, Xinhao Mei, Qiuqiang Kong, Qiao Tian, Yuping Wang, Wenwu Wang, Yuxuan Wang, and Mark D Plumbley. 2024. Audioldm 2: Learning holistic audio generation with self-supervised pretraining. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 32:2871--2883

  27. [32]

    Xingchao Liu, Chengyue Gong, and Qiang Liu. 2023 b . Flow straight and fast: Learning to generate and transfer data with rectified flow. In International Conference on Learning Representations

  28. [33]

    Hila Manor and Tomer Michaeli. 2024. Zero-shot unsupervised and text-based audio editing using DDPM inversion. In International Conference on Machine Learning, pages 34603--34629. PMLR

  29. [34]

    Xinhao Mei, Chutong Meng, Haohe Liu, Qiuqiang Kong, Tom Ko, Chengqi Zhao, Mark D Plumbley, Yuexian Zou, and Wenwu Wang. 2024. Wavcaps: A chatgpt-assisted weakly-labelled audio captioning dataset for audio-language multimodal research. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 32:3339--3354

  30. [35]

    Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jiajun Wu, Jun-Yan Zhu, and Stefano Ermon. 2022. Sdedit: Guided image synthesis and editing with stochastic differential equations. In International Conference on Learning Representations

  31. [36]

    William Peebles and Saining Xie. 2023. Scalable diffusion models with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4195--4205

  32. [38]

    Chunyu Qiang, Kang Yin, Xiaopeng Wang, Yuzhe Liang, Jiahui Zhao, Ruibo Fu, Tianrui Wang, Cheng Gong, Chen Zhang, Longbiao Wang, and 1 others. 2026 b . Instructaudio: Unified speech and music generation with natural language instruction. In ICASSP 2026-2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 17722--17726. IEEE

  33. [39]

    Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. 2023. Robust speech recognition via large-scale weak supervision. In International conference on machine learning, pages 28492--28518. PMLR

  34. [40]

    Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. 2023. https://arxiv.org/abs/2104.09864 Roformer: Enhanced transformer with rotary position embedding . Preprint, arXiv:2104.09864

  35. [41]

    Jaesung Tae, Hyeongju Kim, and Taesu Kim. 2022. Editts: Score-based editing for controllable text-to-speech. In Interspeech 2022, pages 421--425

  36. [42]

    Bingda Tang, Boyang Zheng, Sayak Paul, and Saining Xie. 2025. Exploring the deep fusion of large language models and diffusion transformers for text-to-image synthesis. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 28586--28595

  37. [44]

    Silero Team. 2024. Silero vad: pre-trained enterprise-grade voice activity detector (vad), number detector and language classifier. https://github.com/snakers4/silero-vad

  38. [45]

    Ian Tenney, Dipanjan Das, and Ellie Pavlick. 2019. Bert rediscovers the classical nlp pipeline. In Proceedings of the 57th annual meeting of the association for computational linguistics, pages 4593--4601

  39. [46]

    Zeyue Tian, Binxin Yang, Zhaoyang Liu, Jiexuan Zhang, Ruibin Yuan, Hubery Yin, Qifeng Chen, Chen Li, Jing Lv, Wei Xue, and 1 others. 2026. Audio-omni: Extending multi-modal understanding to versatile audio generation and editing. In ACM SIGGRAPH

  40. [48]

    Chengyi Wang, Sanyuan Chen, Yu Wu, Ziqiang Zhang, Long Zhou, Shujie Liu, Zhuo Chen, Yanqing Liu, Huaming Wang, Jinyu Li, and 1 others. 2023. Neural codec language models are zero-shot text to speech synthesizers. arXiv preprint arXiv:2301.02111

  41. [49]

    Yuancheng Wang, Haoyue Zhan, Liwei Liu, Ruihong Zeng, Haotian Guo, Jiachen Zheng, Qiang Zhang, Xueyao Zhang, Shunsi Zhang, and Zhizheng Wu. 2025. Maskgct: Zero-shot text-to-speech with masked generative codec transformer. In International Conference on Learning Representations, volume 2025, pages 47127--47150

  42. [50]

    Yusong Wu, Ke Chen, Tianyu Zhang, Yuchen Hui, Taylor Berg-Kirkpatrick, and Shlomo Dubnov. 2023. Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1--5. IEEE

  43. [51]

    Jin Xu, Zhifang Guo, Jinzheng He, Hangrui Hu, Ting He, Shuai Bai, Keqin Chen, Jialin Wang, Yang Fan, Kai Dang, and 1 others. 2025. Qwen2.5-omni technical report. arXiv preprint arXiv:2503.20215

  44. [52]

    Dongchao Yang, Jinchuan Tian, Xu Tan, Rongjie Huang, Songxiang Liu, Haohan Guo, Xuankai Chang, Jiatong Shi, Sheng Zhao, Jiang Bian, Zhou Zhao, Xixin Wu, and Helen M. Meng. 2024. Uniaudio: Towards universal audio generation with large language models. In International Conference on Machine Learning, pages 56422--56447. PMLR

  45. [53]

    Zhuoyuan Yao, Di Wu 0061, Xiong Wang, Binbin Zhang, Fan Yu, Chao Yang, Zhendong Peng, Xiaoyu Chen, Lei Xie, and Xin Lei. 2021. Wenet: Production oriented streaming and non-streaming end-to-end speech recognition toolkit. In interspeech, volume 2021, pages 4054--4058

  46. [54]

    Heiga Zen, Viet Dang, Rob Clark, Yu Zhang, Ron J Weiss, Ye Jia, Zhifeng Chen, and Yonghui Wu. 2019. Libritts: A corpus derived from librispeech for text-to-speech. In Interspeech 2019, pages 1526--1530

  47. [55]

    Han Zhu, Wei Kang, Zengwei Yao, Liyong Guo, Fangjun Kuang, Zhaoqing Li, Weiji Zhuang, Long Lin, and Daniel Povey. 2025. Zipvoice: Fast and high-quality zero-shot text-to-speech with flow matching. In IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)

  48. [56]

    Kim, C.D. et al. (2019). AudioCaps: Generating Captions for Audios in The Wild. NAACL-HLT 2019

  49. [57]

    Vyas, A. et al. (2023). AudioBox: Unified Audio Generation with Natural Language Prompts. arXiv:2312.15821

  50. [58]

    Liu, H. et al. (2023). AudioLDM 2: Learning Holistic Audio Generation with Self-supervised Pretraining. arXiv:2308.05734

  51. [59]

    Tian, Z. et al. (2026). Audio-Omni: Extending Multi-modal Understanding to Versatile Audio Generation and Editing. arXiv:2604.10708

  52. [60]

    Cai, Q. et al. (2025). HiDream-I1: A High-Efficient Image Generative Foundation Model with Sparse Diffusion Transformer. arXiv:2505.22705

  53. [61]

    Deng, C. e¸t al. (2025). Emerging Properties in Unified Multimodal Pretraining. arXiv:2505.14683

  54. [62]

    Du, Z. et al. (2024). CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models. arXiv:2412.10117

  55. [63]

    Eskimez, S.E. et al. (2024). E2 TTS: Embarrassingly Easy Fully Non-Autoregressive Zero-Shot TTS

  56. [64]

    Tae, J. et al. (2021). EdiTTS: Score-based Editing for Controllable Text-to-Speech

  57. [65]

    Tang, B. et al. (2025). Exploring the Deep Fusion of Large Language Models and Diffusion Transformers for Text-to-Image Synthesis. arXiv:2505.10046

  58. [66]

    Tenney, I. et al. (2019). BERT Rediscovers the Classical NLP Pipeline. ACL 2019

  59. [67]

    Jawahar, G. et al. (2019). What Does BERT Look At? An Analysis of BERT's Attention. BlackboxNLP, ACL 2019

  60. [68]

    Chen, Y. et al. (2024). F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching

  61. [69]

    Jiang, Z. et al. (2022). FluentSpeech: Stutter-Oriented Automatic Speech Editing with Context-Aware Diffusion Models

  62. [70]

    Ghosh, S. et al. (2024). Taming Data and Transformers for Scalable Audio Generation. arXiv:2406.19388

  63. [71]

    He, H. et al. (2024). Emilia: An Extensive, Multilingual, and Diverse Speech Dataset for Large-Scale Speech Generation

  64. [72]

    Qiang, D. et al. (2025). InstructAudio

  65. [73]

    Huang, R. et al. (2023). Make-An-Audio 2: Temporal-Enhanced Text-to-Audio Generation. arXiv:2305.18474

  66. [74]

    Wang, Y. et al. (2024). MaskGCT: Zero-Shot Text-to-Speech with Masked Generative Codec Transformer

  67. [75]

    Cheng, H.K. et al. (2025). MMAudio: Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis. CVPR 2025

  68. [76]

    Tian, Y. et al. (2024). MMEDIT: A Unified Framework for Multi-Type Audio Editing via Audio Language Model. arXiv:2512.20339

  69. [77]

    Xu, Z. et al. (2025). Qwen2.5-Omni Technical Report

  70. [78]

    Su, J. et al. (2021). RoFormer: Enhanced Transformer with Rotary Position Embedding

  71. [79]

    Meng, C. et al. (2021). SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations

  72. [80]

    Evans, Z. et al. (2024). Stable Audio Open

  73. [81]

    Ghosal, D. et al. (2023). Text-to-Audio Generation using Instruction-Tuned LLM and Latent Diffusion Model

  74. [82]

    Hung, C.Y. et al. (2024). TangoFlux: Super Fast and Faithful Text to Audio Generation with Flow Matching and Clap-Ranked Preference Optimization. arXiv:2412.21037

  75. [83]

    Yang, D. et al. (2024). UniAudio: An Audio Foundation Model Toward Universal Audio Generation. ICML 2024

  76. [84]

    Qiang, D. et al. (2025). UniSonate: A Unified Model for Speech, Music, and Sound Effect Generation with Text Instructions. arXiv:2604.22209

  77. [85]

    Zhu, H. et al. (2025). ZipVoice: Fast and High-Quality Zero-Shot Text-to-Speech with Flow Matching

  78. [86]

    & Michaeli, T

    Manor, H. & Michaeli, T. (2024). Zero-Shot Unsupervised and Text-Based Audio Editing Using DDPM Inversion

  79. [87]

    Anastassiou, P. et al. (2024). Seed-TTS: A Family of High-Quality Versatile Speech Generation Models. arXiv:2406.02430

  80. [88]

    Kong, Q. et al. (2020). PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Trans. Audio, Speech, Lang. Process., 28, 2880--2894

Showing first 80 references.