UNISON: A Unified Sound Generation and Editing Framework via Deep LLM Fusion

Haoning Xu; Haoxuan Che; Huimeng Wang; Jiajun Deng; Jingran Su; Rui Liu; Tianzi Wang; Xunying Liu; Yaofang Liu; Zengrui Jin

arxiv: 2605.31530 · v2 · pith:RDMSKUXLnew · submitted 2026-05-29 · 📡 eess.AS · cs.SD

UNISON: A Unified Sound Generation and Editing Framework via Deep LLM Fusion

Zhaoqing Li , Haoning Xu , Jingran Su , Yaofang Liu , Zhefan Rao , Huimeng Wang , Jiajun Deng , Tianzi Wang

show 4 more authors

Zengrui Jin Rui Liu Haoxuan Che Xunying Liu

This is my paper

Pith reviewed 2026-06-28 20:33 UTC · model grok-4.3

classification 📡 eess.AS cs.SD

keywords unified audio generationtext-to-audiotext-to-speechaudio editinglatent diffusionmultimodal LLM fusionmulti-task learningspeaker cloning

0 comments

The pith

A single model with 621M-732M parameters unifies text-to-audio, text-to-speech, zero-shot cloning, mixed generation, and multiple audio editing tasks under one set of weights.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

UNISON is a latent diffusion model that performs text-to-audio, text-to-speech, zero-shot speaker cloning, mixed speech-and-sound generation, scene-level editing, speech-in-scene editing, and timed temporal composition. All tasks share the same weights and rely on two design choices: layer-wise injection of hidden states from multiple layers of a frozen multimodal LLM into diffusion transformer blocks, and task encoding via a channel-wise mask plus VAE channel concatenation. The model reaches performance levels competitive with or better than task-specific systems while remaining roughly four times smaller than prior unified approaches. A sympathetic reader would care because the result suggests that broad audio capabilities no longer require separate models or large task-specific modules.

Core claim

A single latent diffusion model with layer-wise deep fusion of uniformly sampled hidden states from a frozen MLLM into corresponding MM-DiT blocks, plus a unified multi-task design that encodes task identity only through a channel-wise mask and VAE concatenation, can jointly solve text-to-audio, text-to-speech, zero-shot speaker cloning, mixed speech-and-sound generation, scene-level audio editing, speech-in-scene editing, and timed temporal composition while sharing one set of 621M-732M trainable parameters and matching or exceeding specialist models.

What carries the argument

Layer-wise deep LLM fusion that injects hidden states from uniformly sampled layers of a frozen MLLM into MM-DiT blocks via learned projections, paired with channel-wise mask and VAE concatenation to encode task identity.

If this is right

One set of weights suffices for both open-ended generation and precise temporal editing of mixed speech and sound scenes.
Depth-matched conditioning from multiple LLM layers improves following of complex editing instructions compared with single-layer baselines.
Online GPU-side multi-task data synthesis with task-homogeneous batching and two-stage curriculum stabilizes joint training across seven tasks.
The resulting model size remains roughly four times smaller than prior unified audio systems while matching their accuracy.
Seamless mixing of speech and environmental sound, plus zero-shot cloning, emerges from the shared architecture without extra components.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Deployment of audio AI could become simpler if only one model needs to be hosted instead of separate generators and editors.
The frozen-LLM approach may allow future scaling by swapping in larger language models without retraining the diffusion backbone.
Task-homogeneous batching combined with curriculum learning could transfer to other multi-task generative settings beyond audio.
If the mask-and-concatenation scheme generalizes, similar lightweight task encoding might reduce parameter overhead in unified video or music models.

Load-bearing premise

Uniformly sampled hidden states from multiple layers of a frozen MLLM, when injected through learned projections, supply depth-matched semantic conditioning that improves instruction following, and a channel-wise mask plus VAE concatenation alone is enough to distinguish tasks without any task-specific modules.

What would settle it

An ablation that replaces the multi-layer LLM injection with single-layer injection and removes the channel-wise mask, then measures whether instruction-following accuracy on speech-in-scene editing drops below the performance of the full UNISON model.

Figures

Figures reproduced from arXiv: 2605.31530 by Haoning Xu, Haoxuan Che, Huimeng Wang, Jiajun Deng, Jingran Su, Rui Liu, Tianzi Wang, Xunying Liu, Yaofang Liu, Zengrui Jin, Zhaoqing Li, Zhefan Rao.

**Figure 1.** Figure 1: Overview of UNISON. A single flowmatching model handles text-to-audio generation, zeroshot TTS, gender control, audio-scene editing, and timed temporal composition. All tasks share the same architecture and weights, differentiated only by a task mask channel and optional source latent concatenation. disparate conditioning pipelines. This fragmentation increases deployment complexity and prevents cross-… view at source ↗

**Figure 2.** Figure 2: UNISON Architecture. Left: Layer-wise deep LLM fusion injects per-layer Qwen hidden states into corresponding DiT blocks via learned projectors. Middle: Each double-stream block performs joint attention; text tokens are refreshed per block (✗) while audio tokens pass through the MLP. Bottom: [zt ∥ zs ∥ m] are channelconcatenated and embedded; the ODE solver denoises the latent, which is VAE-decoded to wav… view at source ↗

**Figure 3.** Figure 3: Audio editing qualitative examples from UNISON (D24, 16 kHz). Each row shows one sub-task (Add / Remove / Replace). Left: source audio. Middle: UNISON output. Right: constructed ground truth [PITH_FULL_IMAGE:figures/full_fig_p015_3.png] view at source ↗

**Figure 4.** Figure 4: Audio editing qualitative examples from UNISON (D20, 44.1 kHz) on the same samples as [PITH_FULL_IMAGE:figures/full_fig_p016_4.png] view at source ↗

**Figure 5.** Figure 5: Speech-in-scene editing qualitative examples from [PITH_FULL_IMAGE:figures/full_fig_p017_5.png] view at source ↗

**Figure 6.** Figure 6: Speech-in-scene editing qualitative examples from [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗

**Figure 7.** Figure 7: Timed generation mel spectrograms from UNISON (D24, 16 kHz). Colored dashed lines and shading denote the time boundaries from the input prompt; segment captions are annotated above each region. (a)–(b): sequential segments. (c)–(d): overlapping segments. The model produces distinct spectral patterns that align with the specified time intervals [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗

**Figure 8.** Figure 8: Timed generation mel spectrograms from UNISON (D20, 44.1 kHz) on the same prompts as [PITH_FULL_IMAGE:figures/full_fig_p020_8.png] view at source ↗

read the original abstract

We present UNISON, a latent diffusion framework that unifies speech generation, sound generation, and audio editing within a single model. A single model handles text-to-audio, text-to-speech, zero-shot speaker cloning, mixed speech-and-sound generation, scene-level audio editing, speech-in-scene editing, and timed temporal composition, all of which share a single set of weights. Our architecture features two core designs: (1) Layer-wise deep LLM fusion, which injects hidden states from uniformly sampled layers of a frozen MLLM into corresponding MM-DiT blocks via learned projections, providing depth-matched semantic conditioning that improves instruction following over single-layer baselines; and (2) a unified multi-task architecture where task identity is encoded solely by a channel-wise mask and source audio is provided through VAE-encoded channel concatenation. Training is stabilized by an online GPU-side multi-task data synthesis pipeline with task-homogeneous batching and a two-stage curriculum. With 621M--732M trainable parameters, UNISON achieves results competitive with or exceeding task-specialist models across evaluated domains, while being roughly $4\times$ smaller than comparable unified systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

UNISON claims one model handles many audio tasks via layer-wise MLLM fusion into MM-DiT plus channel-mask encoding, but the abstract gives no metrics or baselines so the performance claims cannot be checked.

read the letter

The main thing to know is that UNISON is a latent diffusion model that puts text-to-audio, text-to-speech, zero-shot cloning, mixed speech-sound generation, scene editing, speech-in-scene editing, and timed composition into a single set of weights. The two design moves are layer-wise fusion, where hidden states are taken from uniformly sampled layers of a frozen MLLM and projected into corresponding MM-DiT blocks, and task identity encoded only by a channel-wise mask with VAE concatenation for source audio.

The paper does a reasonable job spelling out the training side: an online GPU-side multi-task data synthesis pipeline, task-homogeneous batching, and a two-stage curriculum to keep multi-task training stable. The parameter count of 621M-732M and the claim of being roughly four times smaller than other unified systems are concrete numbers that can be compared later.

The soft spot is the complete absence of results. The abstract states competitive or better performance against task specialists but supplies no numbers, no baselines, no datasets, and no ablations. Without those it is impossible to tell whether the layer-wise fusion actually improves instruction following or whether the channel mask is enough to keep task performance from dropping. The assumption that uniform layer sampling gives useful depth-matched conditioning is presented as an improvement over single-layer baselines, yet nothing in the provided text shows that it holds.

This is for people working on unified audio models or on LLM conditioning inside diffusion transformers. A reader who already follows MM-DiT or multi-task audio work would get value from the architecture choices if the experiments turn out to be solid.

It deserves peer review. The unification goal is practical and the design is described clearly enough to evaluate once the results and ablations are on the table.

Referee Report

0 major / 2 minor

Summary. The manuscript presents UNISON, a latent diffusion framework unifying speech generation, sound generation, and audio editing in a single model with 621M-732M trainable parameters. Core designs include layer-wise deep LLM fusion (injecting uniformly sampled hidden states from a frozen MLLM into corresponding MM-DiT blocks via learned projections) and a multi-task architecture using only channel-wise masks for task identity plus VAE-encoded channel concatenation for source audio. Training uses an online GPU-side multi-task data synthesis pipeline with task-homogeneous batching and two-stage curriculum. The model is claimed to handle text-to-audio, text-to-speech, zero-shot speaker cloning, mixed speech-and-sound generation, scene-level editing, speech-in-scene editing, and timed temporal composition while achieving results competitive with or exceeding task-specialist models and being ~4x smaller than comparable unified systems.

Significance. If the performance claims hold with rigorous quantitative support, the work would demonstrate a meaningful step toward efficient unified audio models by showing that depth-matched LLM conditioning and minimal task encoding can support broad task coverage without task-specific components. This could reduce model proliferation in the field and highlight the value of deep fusion over single-layer baselines.

minor comments (2)

The abstract states competitive results but does not report specific metrics, baselines, datasets, or error bars; the full manuscript should include these in a dedicated results section with tables for each task.
Notation for the channel-wise mask and VAE concatenation should be formalized with an equation or diagram in the methods section to clarify how task identity is encoded without additional components.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their summary of the manuscript and for noting the potential significance of demonstrating efficient unified audio modeling via depth-matched LLM fusion and minimal task encoding. We are pleased that the work is viewed as a possible step toward reducing model proliferation if the quantitative claims hold under rigorous evaluation. Below we respond to the points raised.

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The abstract and description contain no equations, derivations, or self-citations that could reduce any claimed prediction or result to its inputs by construction. Architectural choices such as layer-wise LLM fusion via learned projections and task encoding via channel-wise mask plus VAE concatenation are presented as design decisions without any fitted-input-called-prediction pattern or self-definitional loop. The unified model is described as achieving competitive results through standard training procedures, with no load-bearing self-citation chains or ansatz smuggling visible. This is the normal self-contained case for an empirical architecture paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no free parameters, axioms, or invented entities can be identified from the provided text.

pith-pipeline@v0.9.1-grok · 5773 in / 1135 out tokens · 18564 ms · 2026-06-28T20:33:00.883421+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

91 extracted references · 18 canonical work pages · 11 internal anchors

[1]

online" 'onlinestring :=

ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...
[2]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...
[5]

Honglie Chen, Weidi Xie, Andrea Vedaldi, and Andrew Zisserman. 2020. Vggsound: A large-scale audio-visual dataset. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 721--725. IEEE

2020
[6]

Yushen Chen, Zhikang Niu, Ziyang Ma, Keqi Deng, Chunhui Wang, JianZhao JianZhao, Kai Yu, and Xie Chen. 2025. F5-tts: A fairytaler that fakes fluent and faithful speech with flow matching. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6255--6271

2025
[7]

Ho Kei Cheng, Masato Ishii, Akio Hayakawa, Takashi Shibuya, Alexander Schwing, and Yuki Mitsufuji. 2025. Mmaudio: Taming multimodal joint training for high-quality video-to-audio synthesis. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 28901--28911

2025
[8]

Yunfei Chu, Jin Xu, Xiaohuan Zhou, Qian Yang, Shiliang Zhang, Zhijie Yan, Chang Zhou, and Jingren Zhou. 2023. Qwen-audio: Advancing universal audio understanding via unified large-scale audio-language models. arXiv preprint arXiv:2311.07919

work page internal anchor Pith review Pith/arXiv arXiv 2023
[9]

Kevin Clark, Urvashi Khandelwal, Omer Levy, and Christopher D Manning. 2019. What does bert look at? an analysis of bert’s attention. In Proceedings of the 2019 ACL workshop BlackboxNLP: analyzing and interpreting neural networks for NLP, pages 276--286

2019
[10]

Alexis Conneau, Alexei Baevski, Ronan Collobert, Abdelrahman Mohamed, and Michael Auli. 2021. Unsupervised cross-lingual representation learning for speech recognition. In Interspeech 2021, pages 2426--2430

2021
[11]

Jade Copet, Felix Kreuk, Itai Gat, Tal Remez, David Kant, Gabriel Synnaeve, Yossi Adi, and Alexandre D \'e fossez. 2023. Simple and controllable music generation. In Advances in Neural Information Processing Systems, volume 36

2023
[13]

Zhihao Du, Qian Chen, Xian Shi, Xiang Lv, Zhifu Gao, Changfeng Gao, Hui Wang, Dong Yu, Jianzong Pan, and Fan Wang. 2024 a . Cosyvoice: A scalable multilingual zero-shot text-to-speech synthesizer based on supervised semantic tokens. arXiv preprint arXiv:2407.05407

work page internal anchor Pith review Pith/arXiv arXiv 2024
[15]

Sefik Emre Eskimez, Xiaofei Wang, Manthan Thakker, Canrun Li, Chung-Hsien Tsai, Zhen Xiao, Hemin Yang, Zirun Zhu, Min Tang, Xu Tan, and 1 others. 2024. E2 tts: Embarrassingly easy fully non-autoregressive zero-shot tts. In 2024 IEEE spoken language technology workshop (SLT), pages 682--689. IEEE

2024
[16]

Zach Evans, Julian D Parker, CJ Carr, Zack Zukowski, Josiah Taylor, and Jordi Pons. 2025. Stable audio open. In ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1--5. IEEE

2025
[17]

Zhifu Gao, Zerui Li, Jiaming Wang, Haoneng Luo, Xian Shi, Mengzhe Chen, Yabin Li, Lingyun Zuo, Zhihao Du, Zhangyu Xiao, and Shiliang Zhang. 2023. Funasr: A fundamental end-to-end speech recognition toolkit. In Interspeech 2023, pages 1593--1597

2023
[18]

Jort F Gemmeke, Daniel PW Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, R Channing Moore, Manoj Plakal, and Marvin Ritter. 2017. Audio set: An ontology and human-labeled dataset for audio events. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 776--780. IEEE

2017
[19]

Deepanway Ghosal, Navonil Majumder, Ambuj Mehrish, and Soujanya Poria. 2023. Text-to-audio generation using instruction-tuned LLM and latent diffusion model. In Proceedings of the 31st ACM International Conference on Multimedia, pages 3590--3598

2023
[20]

Moayed Haji-Ali, Willi Menapace, Aliaksandr Siarohin, Guha Balakrishnan, and Vicente Ordonez. 2026. Taming data and transformers for audio generation. International Journal of Computer Vision, 134(3):87

2026
[21]

Haorui He, Zengqiang Shang, Chaoren Wang, Xuyuan Li, Yicheng Gu, Hua Hua, Liwei Liu, Chen Yang, Jiaqi Li, Peiyang Shi, and 1 others. 2024. Emilia: An extensive, multilingual, and diverse speech dataset for large-scale speech generation. In 2024 IEEE Spoken Language Technology Workshop (SLT), pages 885--890. IEEE

2024
[22]

Jonathan Ho and Tim Salimans. 2021. Classifier-free diffusion guidance. In NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications

2021
[24]

Chia-Yu Hung, Navonil Majumder, Zhifeng Kong, Ambuj Mehrish, Amir Zadeh, Chuan Li, Rafael Valle, Bryan Catanzaro, and Soujanya Poria. 2026. Tangoflux: Super fast and faithful text to audio generation with flow matching and clap-ranked preference optimization. In International Conference on Learning Representations

2026
[25]

Ziyue Jiang, Qian Yang, Jialong Zuo, Zhenhui Ye, Rongjie Huang, Yi Ren, and Zhou Zhao. 2023. Fluentspeech: Stutter-oriented automatic speech editing with context-aware diffusion models. In Findings of the Association for Computational Linguistics: ACL 2023, pages 11655--11671

2023
[26]

Chris Dongjoo Kim, Byeongchang Kim, Hyunmin Lee, and Gunhee Kim. 2019. Audiocaps: Generating captions for audios in the wild. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 119--132

2019
[27]

Qiuqiang Kong, Yin Cao, Turab Iqbal, Yuxuan Wang, Wenwu Wang, and Mark D Plumbley. 2020. Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 28:2880--2894

2020
[28]

Felix Kreuk, Gabriel Synnaeve, Adam Polyak, Uriel Singer, Alexandre D \'e fossez, Jade Copet, Devi Parikh, Yaniv Taigman, and Yossi Adi. 2023. Audiogen: Textually guided audio generation. In International Conference on Learning Representations

2023
[29]

Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. 2023. Flow matching for generative modeling. In International Conference on Learning Representations

2023
[30]

Haohe Liu, Zehua Chen, Yi Yuan, Xinhao Mei, Xubo Liu, Danilo Mandic, Wenwu Wang, and Mark D Plumbley. 2023 a . Audioldm: Text-to-audio generation with latent diffusion models. In International Conference on Machine Learning, pages 21450--21474. PMLR

2023
[31]

Haohe Liu, Yi Yuan, Xubo Liu, Xinhao Mei, Qiuqiang Kong, Qiao Tian, Yuping Wang, Wenwu Wang, Yuxuan Wang, and Mark D Plumbley. 2024. Audioldm 2: Learning holistic audio generation with self-supervised pretraining. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 32:2871--2883

2024
[32]

Xingchao Liu, Chengyue Gong, and Qiang Liu. 2023 b . Flow straight and fast: Learning to generate and transfer data with rectified flow. In International Conference on Learning Representations

2023
[33]

Hila Manor and Tomer Michaeli. 2024. Zero-shot unsupervised and text-based audio editing using DDPM inversion. In International Conference on Machine Learning, pages 34603--34629. PMLR

2024
[34]

Xinhao Mei, Chutong Meng, Haohe Liu, Qiuqiang Kong, Tom Ko, Chengqi Zhao, Mark D Plumbley, Yuexian Zou, and Wenwu Wang. 2024. Wavcaps: A chatgpt-assisted weakly-labelled audio captioning dataset for audio-language multimodal research. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 32:3339--3354

2024
[35]

Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jiajun Wu, Jun-Yan Zhu, and Stefano Ermon. 2022. Sdedit: Guided image synthesis and editing with stochastic differential equations. In International Conference on Learning Representations

2022
[36]

William Peebles and Saining Xie. 2023. Scalable diffusion models with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4195--4205

2023
[38]

Chunyu Qiang, Kang Yin, Xiaopeng Wang, Yuzhe Liang, Jiahui Zhao, Ruibo Fu, Tianrui Wang, Cheng Gong, Chen Zhang, Longbiao Wang, and 1 others. 2026 b . Instructaudio: Unified speech and music generation with natural language instruction. In ICASSP 2026-2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 17722--17726. IEEE

2026
[39]

Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. 2023. Robust speech recognition via large-scale weak supervision. In International conference on machine learning, pages 28492--28518. PMLR

2023
[40]

Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. 2023. https://arxiv.org/abs/2104.09864 Roformer: Enhanced transformer with rotary position embedding . Preprint, arXiv:2104.09864

work page internal anchor Pith review Pith/arXiv arXiv 2023
[41]

Jaesung Tae, Hyeongju Kim, and Taesu Kim. 2022. Editts: Score-based editing for controllable text-to-speech. In Interspeech 2022, pages 421--425

2022
[42]

Bingda Tang, Boyang Zheng, Sayak Paul, and Saining Xie. 2025. Exploring the deep fusion of large language models and diffusion transformers for text-to-image synthesis. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 28586--28595

2025
[44]

Silero Team. 2024. Silero vad: pre-trained enterprise-grade voice activity detector (vad), number detector and language classifier. https://github.com/snakers4/silero-vad

2024
[45]

Ian Tenney, Dipanjan Das, and Ellie Pavlick. 2019. Bert rediscovers the classical nlp pipeline. In Proceedings of the 57th annual meeting of the association for computational linguistics, pages 4593--4601

2019
[46]

Zeyue Tian, Binxin Yang, Zhaoyang Liu, Jiexuan Zhang, Ruibin Yuan, Hubery Yin, Qifeng Chen, Chen Li, Jing Lv, Wei Xue, and 1 others. 2026. Audio-omni: Extending multi-modal understanding to versatile audio generation and editing. In ACM SIGGRAPH

2026
[48]

Chengyi Wang, Sanyuan Chen, Yu Wu, Ziqiang Zhang, Long Zhou, Shujie Liu, Zhuo Chen, Yanqing Liu, Huaming Wang, Jinyu Li, and 1 others. 2023. Neural codec language models are zero-shot text to speech synthesizers. arXiv preprint arXiv:2301.02111

work page internal anchor Pith review Pith/arXiv arXiv 2023
[49]

Yuancheng Wang, Haoyue Zhan, Liwei Liu, Ruihong Zeng, Haotian Guo, Jiachen Zheng, Qiang Zhang, Xueyao Zhang, Shunsi Zhang, and Zhizheng Wu. 2025. Maskgct: Zero-shot text-to-speech with masked generative codec transformer. In International Conference on Learning Representations, volume 2025, pages 47127--47150

2025
[50]

Yusong Wu, Ke Chen, Tianyu Zhang, Yuchen Hui, Taylor Berg-Kirkpatrick, and Shlomo Dubnov. 2023. Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1--5. IEEE

2023
[51]

Jin Xu, Zhifang Guo, Jinzheng He, Hangrui Hu, Ting He, Shuai Bai, Keqin Chen, Jialin Wang, Yang Fan, Kai Dang, and 1 others. 2025. Qwen2.5-omni technical report. arXiv preprint arXiv:2503.20215

work page internal anchor Pith review Pith/arXiv arXiv 2025
[52]

Dongchao Yang, Jinchuan Tian, Xu Tan, Rongjie Huang, Songxiang Liu, Haohan Guo, Xuankai Chang, Jiatong Shi, Sheng Zhao, Jiang Bian, Zhou Zhao, Xixin Wu, and Helen M. Meng. 2024. Uniaudio: Towards universal audio generation with large language models. In International Conference on Machine Learning, pages 56422--56447. PMLR

2024
[53]

Zhuoyuan Yao, Di Wu 0061, Xiong Wang, Binbin Zhang, Fan Yu, Chao Yang, Zhendong Peng, Xiaoyu Chen, Lei Xie, and Xin Lei. 2021. Wenet: Production oriented streaming and non-streaming end-to-end speech recognition toolkit. In interspeech, volume 2021, pages 4054--4058

2021
[54]

Heiga Zen, Viet Dang, Rob Clark, Yu Zhang, Ron J Weiss, Ye Jia, Zhifeng Chen, and Yonghui Wu. 2019. Libritts: A corpus derived from librispeech for text-to-speech. In Interspeech 2019, pages 1526--1530

2019
[55]

Han Zhu, Wei Kang, Zengwei Yao, Liyong Guo, Fangjun Kuang, Zhaoqing Li, Weiji Zhuang, Long Lin, and Daniel Povey. 2025. Zipvoice: Fast and high-quality zero-shot text-to-speech with flow matching. In IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)

2025
[56]

Kim, C.D. et al. (2019). AudioCaps: Generating Captions for Audios in The Wild. NAACL-HLT 2019

2019
[57]

Vyas, A. et al. (2023). AudioBox: Unified Audio Generation with Natural Language Prompts. arXiv:2312.15821

work page arXiv 2023
[58]

Liu, H. et al. (2023). AudioLDM 2: Learning Holistic Audio Generation with Self-supervised Pretraining. arXiv:2308.05734

work page arXiv 2023
[59]

Tian, Z. et al. (2026). Audio-Omni: Extending Multi-modal Understanding to Versatile Audio Generation and Editing. arXiv:2604.10708

work page internal anchor Pith review Pith/arXiv arXiv 2026
[60]

Cai, Q. et al. (2025). HiDream-I1: A High-Efficient Image Generative Foundation Model with Sparse Diffusion Transformer. arXiv:2505.22705

work page internal anchor Pith review Pith/arXiv arXiv 2025
[61]

Deng, C. e¸t al. (2025). Emerging Properties in Unified Multimodal Pretraining. arXiv:2505.14683

work page internal anchor Pith review Pith/arXiv arXiv 2025
[62]

Du, Z. et al. (2024). CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models. arXiv:2412.10117

work page internal anchor Pith review Pith/arXiv arXiv 2024
[63]

Eskimez, S.E. et al. (2024). E2 TTS: Embarrassingly Easy Fully Non-Autoregressive Zero-Shot TTS

2024
[64]

Tae, J. et al. (2021). EdiTTS: Score-based Editing for Controllable Text-to-Speech

2021
[65]

Tang, B. et al. (2025). Exploring the Deep Fusion of Large Language Models and Diffusion Transformers for Text-to-Image Synthesis. arXiv:2505.10046

work page arXiv 2025
[66]

Tenney, I. et al. (2019). BERT Rediscovers the Classical NLP Pipeline. ACL 2019

2019
[67]

Jawahar, G. et al. (2019). What Does BERT Look At? An Analysis of BERT's Attention. BlackboxNLP, ACL 2019

2019
[68]

Chen, Y. et al. (2024). F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching

2024
[69]

Jiang, Z. et al. (2022). FluentSpeech: Stutter-Oriented Automatic Speech Editing with Context-Aware Diffusion Models

2022
[70]

Ghosh, S. et al. (2024). Taming Data and Transformers for Scalable Audio Generation. arXiv:2406.19388

work page arXiv 2024
[71]

He, H. et al. (2024). Emilia: An Extensive, Multilingual, and Diverse Speech Dataset for Large-Scale Speech Generation

2024
[72]

Qiang, D. et al. (2025). InstructAudio

2025
[73]

Huang, R. et al. (2023). Make-An-Audio 2: Temporal-Enhanced Text-to-Audio Generation. arXiv:2305.18474

work page arXiv 2023
[74]

Wang, Y. et al. (2024). MaskGCT: Zero-Shot Text-to-Speech with Masked Generative Codec Transformer

2024
[75]

Cheng, H.K. et al. (2025). MMAudio: Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis. CVPR 2025

2025
[76]

Tian, Y. et al. (2024). MMEDIT: A Unified Framework for Multi-Type Audio Editing via Audio Language Model. arXiv:2512.20339

work page arXiv 2024
[77]

Xu, Z. et al. (2025). Qwen2.5-Omni Technical Report

2025
[78]

Su, J. et al. (2021). RoFormer: Enhanced Transformer with Rotary Position Embedding

2021
[79]

Meng, C. et al. (2021). SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations

2021
[80]

Evans, Z. et al. (2024). Stable Audio Open

2024
[81]

Ghosal, D. et al. (2023). Text-to-Audio Generation using Instruction-Tuned LLM and Latent Diffusion Model

2023
[82]

Hung, C.Y. et al. (2024). TangoFlux: Super Fast and Faithful Text to Audio Generation with Flow Matching and Clap-Ranked Preference Optimization. arXiv:2412.21037

work page arXiv 2024
[83]

Yang, D. et al. (2024). UniAudio: An Audio Foundation Model Toward Universal Audio Generation. ICML 2024

2024
[84]

Qiang, D. et al. (2025). UniSonate: A Unified Model for Speech, Music, and Sound Effect Generation with Text Instructions. arXiv:2604.22209

work page internal anchor Pith review Pith/arXiv arXiv 2025
[85]

Zhu, H. et al. (2025). ZipVoice: Fast and High-Quality Zero-Shot Text-to-Speech with Flow Matching

2025
[86]

& Michaeli, T

Manor, H. & Michaeli, T. (2024). Zero-Shot Unsupervised and Text-Based Audio Editing Using DDPM Inversion

2024
[87]

Anastassiou, P. et al. (2024). Seed-TTS: A Family of High-Quality Versatile Speech Generation Models. arXiv:2406.02430

work page internal anchor Pith review Pith/arXiv arXiv 2024
[88]

Kong, Q. et al. (2020). PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Trans. Audio, Speech, Lang. Process., 28, 2880--2894

2020

Showing first 80 references.

[1] [1]

online" 'onlinestring :=

ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...

[2] [2]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

[3] [5]

Honglie Chen, Weidi Xie, Andrea Vedaldi, and Andrew Zisserman. 2020. Vggsound: A large-scale audio-visual dataset. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 721--725. IEEE

2020

[4] [6]

Yushen Chen, Zhikang Niu, Ziyang Ma, Keqi Deng, Chunhui Wang, JianZhao JianZhao, Kai Yu, and Xie Chen. 2025. F5-tts: A fairytaler that fakes fluent and faithful speech with flow matching. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6255--6271

2025

[5] [7]

Ho Kei Cheng, Masato Ishii, Akio Hayakawa, Takashi Shibuya, Alexander Schwing, and Yuki Mitsufuji. 2025. Mmaudio: Taming multimodal joint training for high-quality video-to-audio synthesis. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 28901--28911

2025

[6] [8]

Yunfei Chu, Jin Xu, Xiaohuan Zhou, Qian Yang, Shiliang Zhang, Zhijie Yan, Chang Zhou, and Jingren Zhou. 2023. Qwen-audio: Advancing universal audio understanding via unified large-scale audio-language models. arXiv preprint arXiv:2311.07919

work page internal anchor Pith review Pith/arXiv arXiv 2023

[7] [9]

Kevin Clark, Urvashi Khandelwal, Omer Levy, and Christopher D Manning. 2019. What does bert look at? an analysis of bert’s attention. In Proceedings of the 2019 ACL workshop BlackboxNLP: analyzing and interpreting neural networks for NLP, pages 276--286

2019

[8] [10]

Alexis Conneau, Alexei Baevski, Ronan Collobert, Abdelrahman Mohamed, and Michael Auli. 2021. Unsupervised cross-lingual representation learning for speech recognition. In Interspeech 2021, pages 2426--2430

2021

[9] [11]

Jade Copet, Felix Kreuk, Itai Gat, Tal Remez, David Kant, Gabriel Synnaeve, Yossi Adi, and Alexandre D \'e fossez. 2023. Simple and controllable music generation. In Advances in Neural Information Processing Systems, volume 36

2023

[10] [13]

Zhihao Du, Qian Chen, Xian Shi, Xiang Lv, Zhifu Gao, Changfeng Gao, Hui Wang, Dong Yu, Jianzong Pan, and Fan Wang. 2024 a . Cosyvoice: A scalable multilingual zero-shot text-to-speech synthesizer based on supervised semantic tokens. arXiv preprint arXiv:2407.05407

work page internal anchor Pith review Pith/arXiv arXiv 2024

[11] [15]

Sefik Emre Eskimez, Xiaofei Wang, Manthan Thakker, Canrun Li, Chung-Hsien Tsai, Zhen Xiao, Hemin Yang, Zirun Zhu, Min Tang, Xu Tan, and 1 others. 2024. E2 tts: Embarrassingly easy fully non-autoregressive zero-shot tts. In 2024 IEEE spoken language technology workshop (SLT), pages 682--689. IEEE

2024

[12] [16]

Zach Evans, Julian D Parker, CJ Carr, Zack Zukowski, Josiah Taylor, and Jordi Pons. 2025. Stable audio open. In ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1--5. IEEE

2025

[13] [17]

Zhifu Gao, Zerui Li, Jiaming Wang, Haoneng Luo, Xian Shi, Mengzhe Chen, Yabin Li, Lingyun Zuo, Zhihao Du, Zhangyu Xiao, and Shiliang Zhang. 2023. Funasr: A fundamental end-to-end speech recognition toolkit. In Interspeech 2023, pages 1593--1597

2023

[14] [18]

Jort F Gemmeke, Daniel PW Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, R Channing Moore, Manoj Plakal, and Marvin Ritter. 2017. Audio set: An ontology and human-labeled dataset for audio events. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 776--780. IEEE

2017

[15] [19]

Deepanway Ghosal, Navonil Majumder, Ambuj Mehrish, and Soujanya Poria. 2023. Text-to-audio generation using instruction-tuned LLM and latent diffusion model. In Proceedings of the 31st ACM International Conference on Multimedia, pages 3590--3598

2023

[16] [20]

Moayed Haji-Ali, Willi Menapace, Aliaksandr Siarohin, Guha Balakrishnan, and Vicente Ordonez. 2026. Taming data and transformers for audio generation. International Journal of Computer Vision, 134(3):87

2026

[17] [21]

Haorui He, Zengqiang Shang, Chaoren Wang, Xuyuan Li, Yicheng Gu, Hua Hua, Liwei Liu, Chen Yang, Jiaqi Li, Peiyang Shi, and 1 others. 2024. Emilia: An extensive, multilingual, and diverse speech dataset for large-scale speech generation. In 2024 IEEE Spoken Language Technology Workshop (SLT), pages 885--890. IEEE

2024

[18] [22]

Jonathan Ho and Tim Salimans. 2021. Classifier-free diffusion guidance. In NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications

2021

[19] [24]

Chia-Yu Hung, Navonil Majumder, Zhifeng Kong, Ambuj Mehrish, Amir Zadeh, Chuan Li, Rafael Valle, Bryan Catanzaro, and Soujanya Poria. 2026. Tangoflux: Super fast and faithful text to audio generation with flow matching and clap-ranked preference optimization. In International Conference on Learning Representations

2026

[20] [25]

Ziyue Jiang, Qian Yang, Jialong Zuo, Zhenhui Ye, Rongjie Huang, Yi Ren, and Zhou Zhao. 2023. Fluentspeech: Stutter-oriented automatic speech editing with context-aware diffusion models. In Findings of the Association for Computational Linguistics: ACL 2023, pages 11655--11671

2023

[21] [26]

Chris Dongjoo Kim, Byeongchang Kim, Hyunmin Lee, and Gunhee Kim. 2019. Audiocaps: Generating captions for audios in the wild. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 119--132

2019

[22] [27]

Qiuqiang Kong, Yin Cao, Turab Iqbal, Yuxuan Wang, Wenwu Wang, and Mark D Plumbley. 2020. Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 28:2880--2894

2020

[23] [28]

Felix Kreuk, Gabriel Synnaeve, Adam Polyak, Uriel Singer, Alexandre D \'e fossez, Jade Copet, Devi Parikh, Yaniv Taigman, and Yossi Adi. 2023. Audiogen: Textually guided audio generation. In International Conference on Learning Representations

2023

[24] [29]

Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. 2023. Flow matching for generative modeling. In International Conference on Learning Representations

2023

[25] [30]

Haohe Liu, Zehua Chen, Yi Yuan, Xinhao Mei, Xubo Liu, Danilo Mandic, Wenwu Wang, and Mark D Plumbley. 2023 a . Audioldm: Text-to-audio generation with latent diffusion models. In International Conference on Machine Learning, pages 21450--21474. PMLR

2023

[26] [31]

Haohe Liu, Yi Yuan, Xubo Liu, Xinhao Mei, Qiuqiang Kong, Qiao Tian, Yuping Wang, Wenwu Wang, Yuxuan Wang, and Mark D Plumbley. 2024. Audioldm 2: Learning holistic audio generation with self-supervised pretraining. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 32:2871--2883

2024

[27] [32]

Xingchao Liu, Chengyue Gong, and Qiang Liu. 2023 b . Flow straight and fast: Learning to generate and transfer data with rectified flow. In International Conference on Learning Representations

2023

[28] [33]

Hila Manor and Tomer Michaeli. 2024. Zero-shot unsupervised and text-based audio editing using DDPM inversion. In International Conference on Machine Learning, pages 34603--34629. PMLR

2024

[29] [34]

Xinhao Mei, Chutong Meng, Haohe Liu, Qiuqiang Kong, Tom Ko, Chengqi Zhao, Mark D Plumbley, Yuexian Zou, and Wenwu Wang. 2024. Wavcaps: A chatgpt-assisted weakly-labelled audio captioning dataset for audio-language multimodal research. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 32:3339--3354

2024

[30] [35]

Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jiajun Wu, Jun-Yan Zhu, and Stefano Ermon. 2022. Sdedit: Guided image synthesis and editing with stochastic differential equations. In International Conference on Learning Representations

2022

[31] [36]

William Peebles and Saining Xie. 2023. Scalable diffusion models with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4195--4205

2023

[32] [38]

Chunyu Qiang, Kang Yin, Xiaopeng Wang, Yuzhe Liang, Jiahui Zhao, Ruibo Fu, Tianrui Wang, Cheng Gong, Chen Zhang, Longbiao Wang, and 1 others. 2026 b . Instructaudio: Unified speech and music generation with natural language instruction. In ICASSP 2026-2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 17722--17726. IEEE

2026

[33] [39]

Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. 2023. Robust speech recognition via large-scale weak supervision. In International conference on machine learning, pages 28492--28518. PMLR

2023

[34] [40]

Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. 2023. https://arxiv.org/abs/2104.09864 Roformer: Enhanced transformer with rotary position embedding . Preprint, arXiv:2104.09864

work page internal anchor Pith review Pith/arXiv arXiv 2023

[35] [41]

Jaesung Tae, Hyeongju Kim, and Taesu Kim. 2022. Editts: Score-based editing for controllable text-to-speech. In Interspeech 2022, pages 421--425

2022

[36] [42]

Bingda Tang, Boyang Zheng, Sayak Paul, and Saining Xie. 2025. Exploring the deep fusion of large language models and diffusion transformers for text-to-image synthesis. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 28586--28595

2025

[37] [44]

Silero Team. 2024. Silero vad: pre-trained enterprise-grade voice activity detector (vad), number detector and language classifier. https://github.com/snakers4/silero-vad

2024

[38] [45]

Ian Tenney, Dipanjan Das, and Ellie Pavlick. 2019. Bert rediscovers the classical nlp pipeline. In Proceedings of the 57th annual meeting of the association for computational linguistics, pages 4593--4601

2019

[39] [46]

Zeyue Tian, Binxin Yang, Zhaoyang Liu, Jiexuan Zhang, Ruibin Yuan, Hubery Yin, Qifeng Chen, Chen Li, Jing Lv, Wei Xue, and 1 others. 2026. Audio-omni: Extending multi-modal understanding to versatile audio generation and editing. In ACM SIGGRAPH

2026

[40] [48]

Chengyi Wang, Sanyuan Chen, Yu Wu, Ziqiang Zhang, Long Zhou, Shujie Liu, Zhuo Chen, Yanqing Liu, Huaming Wang, Jinyu Li, and 1 others. 2023. Neural codec language models are zero-shot text to speech synthesizers. arXiv preprint arXiv:2301.02111

work page internal anchor Pith review Pith/arXiv arXiv 2023

[41] [49]

Yuancheng Wang, Haoyue Zhan, Liwei Liu, Ruihong Zeng, Haotian Guo, Jiachen Zheng, Qiang Zhang, Xueyao Zhang, Shunsi Zhang, and Zhizheng Wu. 2025. Maskgct: Zero-shot text-to-speech with masked generative codec transformer. In International Conference on Learning Representations, volume 2025, pages 47127--47150

2025

[42] [50]

Yusong Wu, Ke Chen, Tianyu Zhang, Yuchen Hui, Taylor Berg-Kirkpatrick, and Shlomo Dubnov. 2023. Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1--5. IEEE

2023

[43] [51]

Jin Xu, Zhifang Guo, Jinzheng He, Hangrui Hu, Ting He, Shuai Bai, Keqin Chen, Jialin Wang, Yang Fan, Kai Dang, and 1 others. 2025. Qwen2.5-omni technical report. arXiv preprint arXiv:2503.20215

work page internal anchor Pith review Pith/arXiv arXiv 2025

[44] [52]

Dongchao Yang, Jinchuan Tian, Xu Tan, Rongjie Huang, Songxiang Liu, Haohan Guo, Xuankai Chang, Jiatong Shi, Sheng Zhao, Jiang Bian, Zhou Zhao, Xixin Wu, and Helen M. Meng. 2024. Uniaudio: Towards universal audio generation with large language models. In International Conference on Machine Learning, pages 56422--56447. PMLR

2024

[45] [53]

Zhuoyuan Yao, Di Wu 0061, Xiong Wang, Binbin Zhang, Fan Yu, Chao Yang, Zhendong Peng, Xiaoyu Chen, Lei Xie, and Xin Lei. 2021. Wenet: Production oriented streaming and non-streaming end-to-end speech recognition toolkit. In interspeech, volume 2021, pages 4054--4058

2021

[46] [54]

Heiga Zen, Viet Dang, Rob Clark, Yu Zhang, Ron J Weiss, Ye Jia, Zhifeng Chen, and Yonghui Wu. 2019. Libritts: A corpus derived from librispeech for text-to-speech. In Interspeech 2019, pages 1526--1530

2019

[47] [55]

Han Zhu, Wei Kang, Zengwei Yao, Liyong Guo, Fangjun Kuang, Zhaoqing Li, Weiji Zhuang, Long Lin, and Daniel Povey. 2025. Zipvoice: Fast and high-quality zero-shot text-to-speech with flow matching. In IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)

2025

[48] [56]

Kim, C.D. et al. (2019). AudioCaps: Generating Captions for Audios in The Wild. NAACL-HLT 2019

2019

[49] [57]

Vyas, A. et al. (2023). AudioBox: Unified Audio Generation with Natural Language Prompts. arXiv:2312.15821

work page arXiv 2023

[50] [58]

Liu, H. et al. (2023). AudioLDM 2: Learning Holistic Audio Generation with Self-supervised Pretraining. arXiv:2308.05734

work page arXiv 2023

[51] [59]

Tian, Z. et al. (2026). Audio-Omni: Extending Multi-modal Understanding to Versatile Audio Generation and Editing. arXiv:2604.10708

work page internal anchor Pith review Pith/arXiv arXiv 2026

[52] [60]

Cai, Q. et al. (2025). HiDream-I1: A High-Efficient Image Generative Foundation Model with Sparse Diffusion Transformer. arXiv:2505.22705

work page internal anchor Pith review Pith/arXiv arXiv 2025

[53] [61]

Deng, C. e¸t al. (2025). Emerging Properties in Unified Multimodal Pretraining. arXiv:2505.14683

work page internal anchor Pith review Pith/arXiv arXiv 2025

[54] [62]

Du, Z. et al. (2024). CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models. arXiv:2412.10117

work page internal anchor Pith review Pith/arXiv arXiv 2024

[55] [63]

Eskimez, S.E. et al. (2024). E2 TTS: Embarrassingly Easy Fully Non-Autoregressive Zero-Shot TTS

2024

[56] [64]

Tae, J. et al. (2021). EdiTTS: Score-based Editing for Controllable Text-to-Speech

2021

[57] [65]

Tang, B. et al. (2025). Exploring the Deep Fusion of Large Language Models and Diffusion Transformers for Text-to-Image Synthesis. arXiv:2505.10046

work page arXiv 2025

[58] [66]

Tenney, I. et al. (2019). BERT Rediscovers the Classical NLP Pipeline. ACL 2019

2019

[59] [67]

Jawahar, G. et al. (2019). What Does BERT Look At? An Analysis of BERT's Attention. BlackboxNLP, ACL 2019

2019

[60] [68]

Chen, Y. et al. (2024). F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching

2024

[61] [69]

Jiang, Z. et al. (2022). FluentSpeech: Stutter-Oriented Automatic Speech Editing with Context-Aware Diffusion Models

2022

[62] [70]

Ghosh, S. et al. (2024). Taming Data and Transformers for Scalable Audio Generation. arXiv:2406.19388

work page arXiv 2024

[63] [71]

He, H. et al. (2024). Emilia: An Extensive, Multilingual, and Diverse Speech Dataset for Large-Scale Speech Generation

2024

[64] [72]

Qiang, D. et al. (2025). InstructAudio

2025

[65] [73]

Huang, R. et al. (2023). Make-An-Audio 2: Temporal-Enhanced Text-to-Audio Generation. arXiv:2305.18474

work page arXiv 2023

[66] [74]

Wang, Y. et al. (2024). MaskGCT: Zero-Shot Text-to-Speech with Masked Generative Codec Transformer

2024

[67] [75]

Cheng, H.K. et al. (2025). MMAudio: Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis. CVPR 2025

2025

[68] [76]

Tian, Y. et al. (2024). MMEDIT: A Unified Framework for Multi-Type Audio Editing via Audio Language Model. arXiv:2512.20339

work page arXiv 2024

[69] [77]

Xu, Z. et al. (2025). Qwen2.5-Omni Technical Report

2025

[70] [78]

Su, J. et al. (2021). RoFormer: Enhanced Transformer with Rotary Position Embedding

2021

[71] [79]

Meng, C. et al. (2021). SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations

2021

[72] [80]

Evans, Z. et al. (2024). Stable Audio Open

2024

[73] [81]

Ghosal, D. et al. (2023). Text-to-Audio Generation using Instruction-Tuned LLM and Latent Diffusion Model

2023

[74] [82]

Hung, C.Y. et al. (2024). TangoFlux: Super Fast and Faithful Text to Audio Generation with Flow Matching and Clap-Ranked Preference Optimization. arXiv:2412.21037

work page arXiv 2024

[75] [83]

Yang, D. et al. (2024). UniAudio: An Audio Foundation Model Toward Universal Audio Generation. ICML 2024

2024

[76] [84]

Qiang, D. et al. (2025). UniSonate: A Unified Model for Speech, Music, and Sound Effect Generation with Text Instructions. arXiv:2604.22209

work page internal anchor Pith review Pith/arXiv arXiv 2025

[77] [85]

Zhu, H. et al. (2025). ZipVoice: Fast and High-Quality Zero-Shot Text-to-Speech with Flow Matching

2025

[78] [86]

& Michaeli, T

Manor, H. & Michaeli, T. (2024). Zero-Shot Unsupervised and Text-Based Audio Editing Using DDPM Inversion

2024

[79] [87]

Anastassiou, P. et al. (2024). Seed-TTS: A Family of High-Quality Versatile Speech Generation Models. arXiv:2406.02430

work page internal anchor Pith review Pith/arXiv arXiv 2024

[80] [88]

Kong, Q. et al. (2020). PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE/ACM Trans. Audio, Speech, Lang. Process., 28, 2880--2894

2020