pith. sign in

arxiv: 2605.28063 · v1 · pith:UAZ2Y4PWnew · submitted 2026-05-27 · 💻 cs.SD · cs.AI· cs.MM

Unified Synthesis of Compositional Speech and Sound from Free-Form Text Prompts

Pith reviewed 2026-06-29 10:26 UTC · model grok-4.3

classification 💻 cs.SD cs.AIcs.MM
keywords audio generationunified synthesisfree-form text promptschain-of-thoughtLLM-based frameworkspeech and sound compositionautoregressive generationcomposite audio
0
0 comments X

The pith

PlanAudio generates unified audio with speech and sounds directly from free-form text prompts by using an LLM's reasoning and a semantic latent chain-of-thought.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper defines a new task of producing composite audio containing speech, environmental sounds, and their natural interactions straight from unconstrained natural language. It presents PlanAudio, an autoregressive LLM-based model that replaces separate text encoders and pipelines with the LLM's own planning ability. A semantic latent chain-of-thought step implicitly translates high-level meaning into low-level acoustic details without external rewriting or structured inputs. Evaluations on speech-only, sound-only, and mixed scenarios show the model beats pipeline and unified baselines while matching specialized single-task systems. The work also stresses that continuous training across multiple audio scenarios improves results over isolated training.

Core claim

PlanAudio is a unified autoregressive LLM-based framework for the Free-Form-Text-Prompt-to-Unified-Audio task that simplifies architecture by relying on the LLM's intrinsic reasoning and introduces a semantic latent chain-of-thought mechanism to bridge high-level semantic understanding with low-level acoustic synthesis, enabling direct generation of composite audio from unconstrained natural language prompts.

What carries the argument

The semantic latent chain-of-thought mechanism, an implicit planning step inside the LLM that connects high-level semantics to low-level acoustic output without external text processing.

If this is right

  • Composite audio can be produced in one forward pass without stitching outputs from separate speech and sound models.
  • Fine-grained timing and interaction details between speech and sound emerge from the same latent planning step.
  • No external text rewriting or conversion to structured formats is required for flexible prompt handling.
  • Performance on mixed scenarios improves when the model trains continuously across speech, sound, and composite data rather than in isolation.
  • Semantic latent chain-of-thought outperforms other chain-of-thought variants for this bridging task.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same latent-planning approach could apply to generating synchronized video and audio from text.
  • If the mechanism scales, it reduces the need for task-specific audio models in favor of general LLM-based generators.
  • Real-world applications such as game sound design or film post-production could shift from manual layering to direct text prompting.
  • The emphasis on continuous multi-scenario training suggests similar curricula may help other generative models handle mixed modalities.

Load-bearing premise

The LLM's built-in reasoning can reliably translate free-form text meaning into coherent acoustic details without separate text encoders, rewriters, or structured inputs.

What would settle it

A collection of free-form prompts that require precise timing and interaction between spoken words and background sounds, where the generated audio either fails to match the described scene or produces audible mismatches between speech and sound elements.

Figures

Figures reproduced from arXiv: 2605.28063 by Ruihua Song, Xihua Wang, Xin Cheng, Yijing Chen, Yuyue Wang.

Figure 1
Figure 1. Figure 1: Paradigms for Free-Form-Text-to-Unified-Audio Generation. (a) Pipeline Approach: [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: PlanAudio framework. Given a free-form text prompt input, the model first performs [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative comparison of audio mel-spectrograms from different models. Textual keywords [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Normalized Score of data curriculum strategies across training epochs. Each sub-figure [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
read the original abstract

Audio generation has made significant progress, yet synthesizing unified audio where speech and sounds are naturally composited remains a challenge. Current methods either rely on disjoint pipelines, which fail to capture fine-grained interactions, or require structured inputs and external text rewriting, which limits the flexibility of free-form text prompts. In this paper, we introduce a new task: Free-Form-Text-Prompt-to-Unified-Audio generation, which aims to directly synthesize unified audio containing speech, sound, and their composites from unconstrained natural language. To address this task, we propose PlanAudio, a unified, autoregressive LLM-based framework. First, it simplifies the model architecture by leveraging intrinsic LLM reasoning capability instead of traditional text encoders. Second, it introduces a semantic latent chain-of-thought mechanism, an implicit planning mechanism that bridges high-level semantic understanding and low-level acoustic synthesis. Furthermore, we create PlanAudio-Bench, a specialized benchmark for evaluating composite audio scenarios. We perform evaluations in the scenarios of speech, sound, and their composites. The results demonstrate that PlanAudio generally outperforms the existing pipeline and unified baselines, while staying competitive with models designed for a single scenario. Our analysis further reveals the superiority of semantic latent CoT over other CoT mechanisms and highlights the importance of continuous multi-scenario training curricula.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces the task of Free-Form-Text-Prompt-to-Unified-Audio generation and proposes PlanAudio, an autoregressive LLM-based framework that uses intrinsic LLM reasoning and a semantic latent chain-of-thought mechanism to synthesize composite audio containing speech, sounds, and their interactions directly from unconstrained natural language prompts. It also presents PlanAudio-Bench for evaluation across speech, sound, and composite scenarios, claiming that PlanAudio outperforms existing pipeline and unified baselines while remaining competitive with single-scenario models, with additional analysis showing benefits of the semantic latent CoT and multi-scenario training.

Significance. If the empirical claims hold with rigorous validation, the work would advance unified audio synthesis by removing reliance on structured inputs or external rewriting, enabling more flexible generation of naturally composited speech and environmental sounds; the introduction of a specialized benchmark and the implicit planning mechanism could influence downstream applications in multimedia and conversational AI.

major comments (2)
  1. [Abstract / Results] The abstract states that PlanAudio 'generally outperforms' baselines on PlanAudio-Bench but supplies no quantitative metrics, error bars, dataset sizes, or ablation results; without these in the results section, the central empirical claim cannot be assessed for statistical significance or fairness of comparisons.
  2. [Method / Framework description] The semantic latent chain-of-thought mechanism is presented as bridging high-level semantics to low-level acoustics without external rewriting, yet the manuscript provides no formal definition, training objective, or ablation isolating its contribution versus standard CoT or direct prompting; this is load-bearing for the architectural novelty claim.
minor comments (2)
  1. [Benchmark section] Clarify the exact composition of PlanAudio-Bench (number of prompts per scenario, annotation process, and how composite cases are constructed) to allow reproducibility.
  2. [Method] The claim of 'parameter-free' or simplified architecture via LLM reasoning should be supported by explicit comparison of parameter counts or training stages against baselines.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our work. We address each major comment point-by-point below, clarifying aspects of the manuscript and outlining planned revisions where appropriate.

read point-by-point responses
  1. Referee: [Abstract / Results] The abstract states that PlanAudio 'generally outperforms' baselines on PlanAudio-Bench but supplies no quantitative metrics, error bars, dataset sizes, or ablation results; without these in the results section, the central empirical claim cannot be assessed for statistical significance or fairness of comparisons.

    Authors: The results section contains quantitative tables reporting performance metrics across speech, sound, and composite scenarios on PlanAudio-Bench, with direct comparisons to pipeline and unified baselines as well as single-scenario models. The experimental setup details dataset sizes, and the analysis section includes ablations on the semantic latent CoT and multi-scenario training. We agree that the abstract would benefit from including key quantitative results to make the claims more concrete. We will revise the abstract to report specific metrics (e.g., relative improvements) and ensure clear cross-references to the results tables, error bars where applicable, and dataset statistics. revision: yes

  2. Referee: [Method / Framework description] The semantic latent chain-of-thought mechanism is presented as bridging high-level semantics to low-level acoustics without external rewriting, yet the manuscript provides no formal definition, training objective, or ablation isolating its contribution versus standard CoT or direct prompting; this is load-bearing for the architectural novelty claim.

    Authors: Section 3 describes the semantic latent chain-of-thought as an implicit planning mechanism integrated into the autoregressive LLM decoder that enables semantic composition reasoning prior to acoustic token generation. The analysis section reports comparisons demonstrating its superiority over alternative CoT mechanisms. We acknowledge that a formal definition and explicit training objective would improve rigor and clarity. We will add a mathematical formulation of the mechanism and the associated training objective in the revised method section. We will also expand the ablation studies to more explicitly compare against standard CoT and direct prompting baselines. revision: partial

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper introduces a new task and an LLM-based framework (PlanAudio) with a semantic latent chain-of-thought mechanism, then reports empirical results on a new benchmark (PlanAudio-Bench) showing outperformance over baselines. No derivation chain, equations, fitted parameters renamed as predictions, or self-citation load-bearing steps are present in the abstract or described framework. Claims rest on external experimental comparisons rather than reducing to self-definition or imported uniqueness theorems. The central result is self-contained against the stated evaluations.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities beyond the named mechanism are detailed.

invented entities (1)
  • semantic latent chain-of-thought mechanism no independent evidence
    purpose: bridges high-level semantic understanding and low-level acoustic synthesis
    Presented as a core component of PlanAudio in the abstract.

pith-pipeline@v0.9.1-grok · 5774 in / 1061 out tokens · 35057 ms · 2026-06-29T10:26:06.951156+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

38 extracted references · 28 canonical work pages · 7 internal anchors

  1. [1]

    Cosyvoice 2: Scalable streaming speech synthesis with large language models,

    Z. Du, Y . Wang, Q. Chen, X. Shi, X. Lv, T. Zhao, Z. Gao, Y . Yang, C. Gao, H. Wang, F. Yu, H. Liu, Z. Sheng, Y . Gu, C. Deng, W. Wang, S. Zhang, Z. Yan, and J. Zhou, “Cosyvoice 2: Scalable streaming speech synthesis with large language models,”CoRR, vol. abs/2412.10117,

  2. [2]
  3. [3]

    CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training

    Z. Du, C. Gao, Y . Wang, F. Yu, T. Zhao, H. Wang, X. Lv, H. Wang, C. Ni, X. Shi, K. An, G. Yang, Y . Li, Y . Chen, Z. Gao, Q. Chen, Y . Gu, M. Chen, Y . Chen, S. Zhang, W. Wang, and J. Ye, “Cosyvoice 3: Towards in-the-wild speech generation via scaling-up and post-training,”CoRR, vol. abs/2505.17589, 2025. [Online]. Available: https://doi.org/10.48550/arX...

  4. [4]

    Qwen3-TTS Technical Report

    Q. Team, “Qwen3-tts technical report,”CoRR, vol. abs/2601.15621, 2026. [Online]. Available: https://doi.org/10.48550/arXiv.2601.15621

  5. [5]

    Audiogen: Textually guided audio generation,

    F. Kreuk, G. Synnaeve, A. Polyak, U. Singer, A. Défossez, J. Copet, D. Parikh, Y . Taigman, and Y . Adi, “Audiogen: Textually guided audio generation,” inThe Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023. [Online]. Available: https://openreview.net/forum?id=CYK7RfcOzQ4

  6. [6]

    Audioldm 2: Learning holistic audio generation with self-supervised pretraining,

    H. Liu, Y . Yuan, X. Liu, X. Mei, Q. Kong, Q. Tian, Y . Wang, W. Wang, Y . Wang, and M. D. Plumbley, “Audioldm 2: Learning holistic audio generation with self-supervised pretraining,” IEEE ACM Trans. Audio Speech Lang. Process., vol. 32, pp. 2871–2883, 2024. [Online]. Available: https://doi.org/10.1109/TASLP.2024.3399607

  7. [7]

    Audiocomposer: Towards fine-grained audio generation with natural language descriptions,

    Z. Evans, J. D. Parker, C. Carr, Z. Zukowski, J. Taylor, and J. Pons, “Stable audio open,” in 2025 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2025, Hyderabad, India, April 6-11, 2025. IEEE, 2025, pp. 1–5. [Online]. Available: https://doi.org/10.1109/ICASSP49660.2025.10888461

  8. [8]

    Prompttts++: Controlling speaker identity in prompt-based text-to-speech using natural language descriptions,

    Y . Lee, I. Yeon, J. Nam, and J. S. Chung, “V oiceldm: Text-to-speech with environmental context,” inIEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2024, Seoul, Republic of Korea, April 14-19, 2024. IEEE, 2024, pp. 12 566–12 571. [Online]. Available: https://doi.org/10.1109/ICASSP48485.2024.10448268

  9. [9]

    Audiocomposer: Towards fine-grained audio generation with natural language descriptions,

    J. Jung, J. Ahn, C. Jung, T. D. Nguyen, Y . Jang, and J. S. Chung, “V oicedit: Dual-condition diffusion transformer for environment-aware speech synthesis,” in2025 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2025, Hyderabad, India, April 6-11, 2025. IEEE, 2025, pp. 1–5. [Online]. Available: https://doi.org/10.1109/ICAS...

  10. [10]

    ControlAudio: Tackling Text-Guided, Timing-Indicated and Intelligible Audio Generation via Progressive Diffusion Modeling

    Y . Jiang, Z. Chen, Z. Ju, Y . Dai, W. Dou, and J. Zhu, “Controlaudio: Tackling text-guided, timing-indicated and intelligible audio generation via progressive diffusion modeling,”CoRR, vol. abs/2510.08878, 2025. [Online]. Available: https://doi.org/10.48550/arXiv.2510.08878

  11. [11]

    Audiobox: Unified audio generation with natural language prompts,

    A. Vyas, B. Shi, M. Le, A. Tjandra, Y . Wu, B. Guo, J. Zhang, X. Zhang, R. Adkins, W. Ngan, J. Wang, I. Cruz, B. Akula, A. Akinyemi, B. Ellis, R. Moritz, Y . Yungster, A. Rakotoarison, L. Tan, C. Summers, C. Wood, J. Lane, M. Williamson, and W. Hsu, “Audiobox: Unified audio generation with natural language prompts,”CoRR, vol. abs/2312.15821, 2023. [Online...

  12. [12]

    Libritts: A corpus derived from librispeech for text-to-speech,

    H. Zen, V . Dang, R. Clark, Y . Zhang, R. J. Weiss, Y . Jia, Z. Chen, and Y . Wu, “Libritts: A corpus derived from librispeech for text-to-speech,” in20th Annual Conference of the International Speech Communication Association, Interspeech 2019, Graz, Austria, September 15-19, 2019, G. Kubin and Z. Kacic, Eds. ISCA, 2019, pp. 1526–1530. [Online]. Availabl...

  13. [13]

    Audiocaps: Generating captions for audios in the wild,

    C. D. Kim, B. Kim, H. Lee, and G. Kim, “Audiocaps: Generating captions for audios in the wild,” inProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, V olume 1 (Long and Short Papers), J. Burstein, C. Doran, and...

  14. [14]

    Wavcaps: A chatgpt-assisted weakly-labelled audio captioning dataset for audio-language multimodal research,

    X. Mei, C. Meng, H. Liu, Q. Kong, T. Ko, C. Zhao, M. D. Plumbley, Y . Zou, and W. Wang, “Wavcaps: A chatgpt-assisted weakly-labelled audio captioning dataset for audio-language multimodal research,”IEEE ACM Trans. Audio Speech Lang. Process., vol. 32, pp. 3339–3354,

  15. [15]

    Available: https://doi.org/10.1109/TASLP.2024.3419446

    [Online]. Available: https://doi.org/10.1109/TASLP.2024.3419446

  16. [16]

    Audiocomposer: Towards fine-grained audio generation with natural language descriptions,

    Y . Wang, H. Chen, D. Yang, Z. Wu, and X. Wu, “Audiocomposer: Towards fine-grained audio generation with natural language descriptions,” in2025 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2025, Hyderabad, India, April 6-11, 2025. IEEE, 2025, pp. 1–5. [Online]. Available: https://doi.org/10.1109/ICASSP49660.2025.10888303

  17. [17]

    Freeaudio: Training-free timing planning for controllable long-form text-to-audio generation,

    Y . Jiang, Z. Chen, Z. Ju, C. Li, W. Dou, and J. Zhu, “Freeaudio: Training-free timing planning for controllable long-form text-to-audio generation,” inProceedings of the 33rd ACM International Conference on Multimedia, MM 2025, Dublin, Ireland, October 27-31, 2025, C. Gurrin, K. Schoeffmann, M. Zhang, L. Rossetto, S. Rudinac, D. Dang-Nguyen, W. Cheng, P....

  18. [18]

    Audiocomposer: Towards fine-grained audio generation with natural language descriptions,

    Z. Xie, X. Xu, Z. Wu, and M. Wu, “Picoaudio: Enabling precise temporal controllability in text-to-audio generation,” in2025 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2025, Hyderabad, India, April 6-11, 2025. IEEE, 2025, pp. 1–5. [Online]. Available: https://doi.org/10.1109/ICASSP49660.2025.10890827

  19. [19]

    V oxinstruct: Expressive human instruction-to-speech generation with unified multilingual codec language modelling,

    Y . Zhou, X. Qin, Z. Jin, S. Zhou, S. Lei, S. Zhou, Z. Wu, and J. Jia, “V oxinstruct: Expressive human instruction-to-speech generation with unified multilingual codec language modelling,” in Proceedings of the 32nd ACM International Conference on Multimedia, MM 2024, Melbourne, VIC, Australia, 28 October 2024 - 1 November 2024, J. Cai, M. S. Kankanhalli,...

  20. [20]

    Flexivoice: Enabling flexible style control in zero-shot TTS with natural language instructions,

    D. Chen, X. Zhang, Y . Wang, K. Dai, L. Ma, and Z. Wu, “Flexivoice: Enabling flexible style control in zero-shot TTS with natural language instructions,”CoRR, vol. abs/2601.04656, 2026. [Online]. Available: https://doi.org/10.48550/arXiv.2601.04656

  21. [21]

    Moe-tts: Enhancing out-of-domain text understanding for description-based TTS via mixture-of-experts,

    H. Xue, X. Song, Y . Tang, J. Chen, Y . Chen, Y . Li, and Y . Zhou, “Moe-tts: Enhancing out-of-domain text understanding for description-based TTS via mixture-of-experts,”CoRR, vol. abs/2508.11326, 2025. [Online]. Available: https://doi.org/10.48550/arXiv.2508.11326

  22. [22]

    Uniaudio: An audio foundation model toward universal audio generation,

    D. Yang, J. Tian, X. Tan, R. Huang, S. Liu, X. Chang, J. Shi, S. Zhao, J. Bian, X. Wu, Z. Zhao, S. Watanabe, and H. Meng, “Uniaudio: An audio foundation model toward universal audio generation,”CoRR, vol. abs/2310.00704, 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2310.00704

  23. [23]

    Fugatto 1: Foundational generative audio transformer opus 1,

    R. Valle, R. Badlani, Z. Kong, S. Lee, A. Goel, S. Kim, J. F. Santos, S. Dai, S. Gururani, A. Aljafari, A. H. Liu, K. J. Shih, R. Prenger, W. Ping, C. H. Yang, and B. Catanzaro, “Fugatto 1: Foundational generative audio transformer opus 1,” inThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. Open...

  24. [24]

    Available: https://openreview.net/forum?id=B2Fqu7Y2cd

    [Online]. Available: https://openreview.net/forum?id=B2Fqu7Y2cd

  25. [25]

    Chain-of-thought prompting elicits reasoning in large language models,

    J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. H. Chi, Q. V . Le, and D. Zhou, “Chain-of-thought prompting elicits reasoning in large language models,” in Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022...

  26. [26]

    Cot-vtm: Visual-to-music generation with chain-of-thought reasoning,

    X. Guan, Z. Gu, J. Huo, T. Ding, and Y . Gao, “Cot-vtm: Visual-to-music generation with chain-of-thought reasoning,” inFindings of the Association for Computational Linguistics, ACL 2025, Vienna, Austria, July 27 - August 1, 2025, ser. Findings of ACL, W. Che, J. Nabende, E. Shutova, and M. T. Pilehvar, Eds. Association for Computational Linguistics, 2025...

  27. [27]

    Enhancing non-core language instruction-following in speech llms via semi-implicit cross-lingual cot reasoning,

    H. Xue, Y . Tang, H. Liu, J. Zhang, X. Geng, and L. Xie, “Enhancing non-core language instruction-following in speech llms via semi-implicit cross-lingual cot reasoning,” in Proceedings of the 33rd ACM International Conference on Multimedia, MM 2025, Dublin, Ireland, October 27-31, 2025, C. Gurrin, K. Schoeffmann, M. Zhang, L. Rossetto, S. Rudinac, D. Dan...

  28. [28]

    Ov-instructtts: Towards open-vocabulary instruct text-to-speech,

    Y . Ren, J. Yi, J. Tao, H. Sun, Z. Wen, H. Gu, L. Xu, and Y . Bai, “Ov-instructtts: Towards open-vocabulary instruct text-to-speech,”CoRR, vol. abs/2601.01459, 2026. [Online]. Available: https://doi.org/10.48550/arXiv.2601.01459

  29. [29]

    Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach

    J. Geiping, S. McLeish, N. Jain, J. Kirchenbauer, S. Singh, B. R. Bartoldson, B. Kailkhura, A. Bhatele, and T. Goldstein, “Scaling up test-time compute with latent reasoning: A recurrent depth approach,”CoRR, vol. abs/2502.05171, 2025. [Online]. Available: https://doi.org/10.48550/arXiv.2502.05171

  30. [30]

    Training Large Language Models to Reason in a Continuous Latent Space

    S. Hao, S. Sukhbaatar, D. Su, X. Li, Z. Hu, J. Weston, and Y . Tian, “Training large language models to reason in a continuous latent space,”CoRR, vol. abs/2412.06769, 2024. [Online]. Available: https://doi.org/10.48550/arXiv.2412.06769

  31. [31]

    Reasoning beyond language: A comprehensive survey on latent chain-of-thought reasoning,

    X. Chen, A. Zhao, H. Xia, X. Lu, H. Wang, Y . Chen, W. Zhang, J. Wang, W. Li, and X. Shen, “Reasoning beyond language: A comprehensive survey on latent chain-of-thought reasoning,”CoRR, vol. abs/2505.16782, 2025. [Online]. Available: https://doi.org/10.48550/arXiv.2505.16782

  32. [32]

    Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio Language Models

    A. Goel, S. Ghosh, J. Kim, S. Kumar, Z. Kong, S. Lee, C. H. Yang, R. Duraiswami, D. Manocha, R. Valle, and B. Catanzaro, “Audio flamingo 3: Advancing audio intelligence with fully open large audio language models,”CoRR, vol. abs/2507.08128, 2025. [Online]. Available: https://doi.org/10.48550/arXiv.2507.08128

  33. [33]

    Audio set: An ontology and human-labeled dataset for audio events,

    J. F. Gemmeke, D. P. W. Ellis, D. Freedman, A. Jansen, W. Lawrence, R. C. Moore, M. Plakal, and M. Ritter, “Audio set: An ontology and human-labeled dataset for audio events,” in2017 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2017, New Orleans, LA, USA, March 5-9, 2017. IEEE, 2017, pp. 776–780. [Online]. Available: ht...

  34. [34]

    Robust speech recognition via large-scale weak supervision,

    A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervision,” inInternational Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, ser. Proceedings of Machine Learning Research, A. Krause, E. Brunskill, K. Cho, B. Engelhardt, S. Sabato, and J. Scarlett,...

  35. [35]

    Simple and controllable music generation,

    J. Copet, F. Kreuk, I. Gat, T. Remez, D. Kant, G. Synnaeve, Y . Adi, and A. Défossez, “Simple and controllable music generation,” inAdvances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, A. Oh, T. Naumann, A. Globerson, K. Saenko, M....

  36. [36]

    Text-to-audio generation using instruction-tuned LLM and latent diffusion model,

    D. Ghosal, N. Majumder, A. Mehrish, and S. Poria, “Text-to-audio generation using instruction-tuned LLM and latent diffusion model,”CoRR, vol. abs/2304.13731, 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2304.13731

  37. [37]

    Make-an-audio: Text-to-audio generation with prompt-enhanced diffusion models,

    R. Huang, J. Huang, D. Yang, Y . Ren, L. Liu, M. Li, Z. Ye, J. Liu, X. Yin, and Z. Zhao, “Make-an-audio: Text-to-audio generation with prompt-enhanced diffusion models,” in International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, ser. Proceedings of Machine Learning Research, A. Krause, E. Brunskill, K. Cho, B. Enge...

  38. [38]

    Prompttts++: Controlling speaker identity in prompt-based text-to-speech using natural language descriptions,

    R. Shimizu, R. Yamamoto, M. Kawamura, Y . Shirahata, H. Doi, T. Komatsu, and K. Tachibana, “Prompttts++: Controlling speaker identity in prompt-based text-to-speech using natural language descriptions,” inIEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2024, Seoul, Republic of Korea, April 14-19, 2024. IEEE, 2024, pp. 12 6...