pith. sign in

arxiv: 2606.21227 · v1 · pith:CC6P7HDInew · submitted 2026-06-19 · 💻 cs.SD

Bagpiper-Edit: Zero-Shot Open-Ended Audio Editing via Rich-Caption

Pith reviewed 2026-06-26 13:15 UTC · model grok-4.3

classification 💻 cs.SD
keywords audio editingzero-shotcaption rewritingtext-guided audioopen-ended editingrich captionspeech and music processing
0
0 comments X

The pith

Audio editing is recast as rewriting rich captions to allow zero-shot open-ended changes while anchoring to the original audio.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that by treating rich captions as semantic representations of audio, user requests can be turned into edited captions to generate new audio using the original as context. This approach eliminates the need for paired training data and predefined templates, enabling natural language instructions for any edit in speech, music, or sound. It matters because current methods are limited to specific operations and require domain-specific pipelines, while this unifies them in zero-shot fashion. Evaluations show it maintains consistency and matches expert models in most cases.

Core claim

Bagpiper-Edit reformulates audio editing as a rich-caption rewriting task. A rich caption acts as the semantic representation of an audio clip. The user request is translated into an edited caption that guides generation of the target edited audio, with the original audio serving as a contextual acoustic anchor. This enables free-form editing without paired audio-editing training data and supports zero-shot capabilities across domains.

What carries the argument

Rich-caption rewriting, where edited captions condition audio generation anchored by the original audio's acoustic features.

If this is right

  • Supports free-form natural language instructions instead of templates.
  • Works across speech, music, and general audio without separate pipelines.
  • Achieves zero-shot editing by avoiding paired training data requirements.
  • Maintains consistency with the original audio while performing comparably to specialized models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The method might generalize to other generation tasks where semantic representations can be edited textually.
  • It could inspire similar caption-based editing in video or image domains.
  • Reducing reliance on paired data could accelerate development of audio manipulation tools.

Load-bearing premise

A rich caption serves as a sufficient semantic representation of the audio clip so that editing the caption and regenerating produces the desired acoustic changes without losing un-described details.

What would settle it

Finding an audio example where important un-captioned acoustic features are altered or lost after caption rewriting and regeneration, despite the edit request not targeting those features.

Figures

Figures reproduced from arXiv: 2606.21227 by Haoran Wang, Jinchuan Tian, Shinji Watanabe, William Chen, Xun Gong, Yanmin Qian.

Figure 1
Figure 1. Figure 1: Bagpiper-Edit treats a rich caption as the semantic representation of an audio clip, and defines editing as a text￾space transformation before materializing the edited audio via anchored original audio. Bagpiper-Base [30], pre-trained on caption-to-audio and audio￾to-caption tasks. Shown in Fig.1, a rich caption is a detailed natural-language description of an audio clip’s events and at￾tributes, capturing… view at source ↗
Figure 2
Figure 2. Figure 2: Dialogue patterns for Bagpiper-Edit. (a) The Single￾Turn (ST) pattern concatenates captions and audio within a sin￾gle interaction turn. (b) The Multi-Turn (MT) pattern structures the input as a two-round sequential dialogue. U denotes user and A denotes assistant, both are special tags inherited from Bagpiper-Base. For model θ, an audio editing request therefore requires generating a target edited audio s… view at source ↗
Figure 3
Figure 3. Figure 3 [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
read the original abstract

Current text-guided audio editing methods rely on paired training data, predefined operation templates, and separate processing pipelines across speech, music, and sound. We present Bagpiper-Edit to enable open-ended audio editing via free-form natural language instructions. We reformulate audio editing as a rich-caption rewriting task by treating a rich caption as the semantic representation of an audio clip. The user request is translated into an edited caption, which then guides Bagpiper-Edit to generate the target edited audio with the original audio as contextual acoustic anchor. This unlocks the potential of free-form editing, and circumvents the need for paired audio-editing training data, enabling powerful zero-shot editing capabilities. Evaluations across speech, audio, and free-form editing show Bagpiper-Edit maintains good consistency to the original audio and achieves similar performance to other expert models in most cases. Demo: https://bagpiper-edit.github.io, Codes: https://github.com/espnet/espnet/pull/6417 & https://github.com/HsunGong/espnet

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The manuscript presents Bagpiper-Edit, a zero-shot method for open-ended audio editing. It reformulates editing as a rich-caption rewriting task in which a rich caption serves as the semantic representation of an audio clip; user instructions are translated into an edited caption that, together with the original audio as contextual acoustic anchor, guides generation of the target audio. The approach is claimed to support free-form natural-language instructions across speech, music, and environmental sound without requiring paired audio-editing training data. Evaluations are stated to demonstrate good consistency with the original and performance comparable to specialized expert models.

Significance. If the central claims hold, the work offers a unified, training-data-efficient route to open-ended audio editing by composing existing captioning and generation models. The explicit release of code (ESPnet PR and companion repository) and a public demo constitute a clear reproducibility strength. The core modeling choice—treating rich captions as editable semantic proxies while anchoring on the original waveform—addresses a practical bottleneck in the field, provided the caption-plus-anchor mechanism reliably preserves task-relevant acoustic detail.

major comments (2)
  1. [Abstract / Evaluation section] Abstract and Evaluation section: the statement that 'evaluations across speech, audio, and free-form editing show Bagpiper-Edit maintains good consistency … and achieves similar performance to other expert models in most cases' supplies no quantitative metrics, baseline tables, dataset descriptions, or statistical tests. Because these results are the sole empirical support for the zero-shot claim, their absence prevents assessment of whether the method actually meets the stated performance level.
  2. [Method section] Method section (rich-caption rewriting and anchor conditioning): the central claim that rewriting the caption and regenerating with the original audio as anchor yields the intended edit rests on the untested assumption that any acoustic property omitted by the captioner (timbre micro-variations, exact reverb tails, non-linguistic textures) is either irrelevant or perfectly recovered by the anchor. No ablation isolating caption-omitted features or measuring drift when the edited caption receives higher weight is reported; this directly bears on whether the zero-shot regime is robust.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the empirical presentation and the core modeling assumptions. We address each major comment below and commit to revisions that strengthen the manuscript without altering its central claims.

read point-by-point responses
  1. Referee: [Abstract / Evaluation section] Abstract and Evaluation section: the statement that 'evaluations across speech, audio, and free-form editing show Bagpiper-Edit maintains good consistency … and achieves similar performance to other expert models in most cases' supplies no quantitative metrics, baseline tables, dataset descriptions, or statistical tests. Because these results are the sole empirical support for the zero-shot claim, their absence prevents assessment of whether the method actually meets the stated performance level.

    Authors: The referee is correct that the current version presents evaluation outcomes primarily through qualitative descriptions and high-level statements rather than quantitative tables, baseline comparisons, dataset details, or statistical tests. This limits independent assessment of the zero-shot performance claims. In the revised manuscript we will expand the Evaluation section to include quantitative metrics (e.g., consistency and quality scores), comparison tables against expert models, descriptions of the test sets used for speech/music/environmental editing, and any applicable statistical analysis. These additions will be placed in both the main text and supplementary material. revision: yes

  2. Referee: [Method section] Method section (rich-caption rewriting and anchor conditioning): the central claim that rewriting the caption and regenerating with the original audio as anchor yields the intended edit rests on the untested assumption that any acoustic property omitted by the captioner (timbre micro-variations, exact reverb tails, non-linguistic textures) is either irrelevant or perfectly recovered by the anchor. No ablation isolating caption-omitted features or measuring drift when the edited caption receives higher weight is reported; this directly bears on whether the zero-shot regime is robust.

    Authors: We agree that the robustness of the caption-plus-anchor mechanism would be better supported by explicit ablations. The current manuscript relies on the design rationale and qualitative listening results to argue that the acoustic anchor recovers omitted details, but does not isolate the contribution of caption-omitted features or vary the relative weighting of the edited caption. In revision we will add an ablation subsection that (1) compares performance when selected acoustic attributes are deliberately omitted from the caption and (2) reports results under different caption-versus-anchor weighting schemes. This will directly test the assumption highlighted by the referee. revision: yes

Circularity Check

0 steps flagged

No circularity; pipeline relies on external models

full rationale

The paper's core claim is a methodological reformulation of audio editing into a rich-caption rewriting task that uses an external captioner to produce a semantic representation, rewrites the caption per user instruction, and regenerates audio via a separate generator conditioned on the original clip as anchor. This construction draws performance from pre-trained external components rather than from any fitted parameter, self-referential equation, or self-citation chain internal to the present work. No equation or derivation step reduces the claimed zero-shot result to its own inputs by construction; the zero-shot property follows directly from the absence of paired editing data in the training pipeline. The provided text contains no load-bearing self-citations or uniqueness theorems imported from the authors' prior work.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated or derivable from the provided text.

pith-pipeline@v0.9.1-grok · 5722 in / 1132 out tokens · 23425 ms · 2026-06-26T13:15:34.621420+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

63 extracted references · 11 linked inside Pith

  1. [1]

    I looked out the window and saw the rain falling on the empty street. It made me realize how long I've been waiting for you've been waiting for you to come back

    Introduction Open-ended audio editing aims to modify an original audio recording according to a natural language user request. Unlike text-to-audio generation, the model must perform targeted op- erations while keeping other content unchanged. This requires identifying which elements to edit and which to preserve, while maintaining overall realism. The ma...

  2. [2]

    ZETA [33] and Prompt-Guided-Edit [13] approximate latent inversion per- form text-guided editing but [13] requires explicit token-level temporal alignments

    Related work Audio Editing: Prior text-guided audio editing approaches of- ten intervene directly within diffusion frameworks. ZETA [33] and Prompt-Guided-Edit [13] approximate latent inversion per- form text-guided editing but [13] requires explicit token-level temporal alignments. Alternatively, AUDIT [11] binds pre- defined atomic operations (such as a...

  3. [3]

    While these pipelines improve user flexi- bility, they remain constrained by rigid atomic operations and the dependency on operation-specific paired training data

    and AudioChat [23] employ LLMs, as reasoning agents to decompose abstract requests into deterministic sequences of atomic operations. While these pipelines improve user flexi- bility, they remain constrained by rigid atomic operations and the dependency on operation-specific paired training data. In contrast,Bagpiper-Editbypasses these limitations by repl...

  4. [4]

    Preliminaries and Problem Formulation Bagpiper-Editis built upon the foundational architecture of Bagpiper-Base[30], a unified, autoregressive audio foundation model

    Bagpiper-Edit 3.1. Preliminaries and Problem Formulation Bagpiper-Editis built upon the foundational architecture of Bagpiper-Base[30], a unified, autoregressive audio foundation model. At its core, the model utilizes a decoder-only LLM to establish a bidirectional mapping between an audio signalaand its rich captionc, thereby unifying comprehensive audio...

  5. [5]

    TheBagpiper-Editmodel is built based onBagpiper-Base, and we train two distinct model variantsSTandMTbased on these two patterns in Sec.3.2

    Experimental setup Bagpiper-Baseutilizes the Qwen3-8B-Base [36] and multi- stream X-Codec [37] operating at 50Hz, as the audio prediction targets, and is pre-trained on caption-to-audio synthesis, audio- to-caption understanding, and pure text modeling data [30]. TheBagpiper-Editmodel is built based onBagpiper-Base, and we train two distinct model variant...

  6. [6]

    Speech editing We first analyze transcription editing in Tab.1

    Experiments 5.1. Speech editing We first analyze transcription editing in Tab.1. Although pro- vided with original audio,Bagpiper-Basestill causes severe speaker identity loss (SpkSIM = 0.58). Our self-supervised training resolves this: theSTpattern achieves SpkSIM of 0.86, surpassing CosyV oice-3. However, it also results in poor edit- ing accuracy (47.1...

  7. [7]

    By avoiding rigid predefined operations, it provides a unified, natural language-driven interface for audio editing across speech, sound, and music domains

    Conclusion This paper introducesBagpiper-Edit, a novel framework that reformulates open-ended audio editing as a text-space rich- caption rewriting task. By avoiding rigid predefined operations, it provides a unified, natural language-driven interface for audio editing across speech, sound, and music domains. A core con- tribution is our self-supervised t...

  8. [8]

    U25A20409, and in part by SJTU Med-X (Medicine & Engineering) Translational Research Grant (YG2025LC09)

    Acknowledgments This work was supported in part by China NSFC project under Grants No. U25A20409, and in part by SJTU Med-X (Medicine & Engineering) Translational Research Grant (YG2025LC09)

  9. [9]

    Additionally, as detailed in Sec.4, Large Lan- guage Models (LLMs) and multimodal foundational models were employed for evaluation

    Generative AI Use Disclosure Generative AI tools were utilized in the preparation of this manuscript primarily for stylistic polishing and grammatical re- finement; they were not used to author significant original tech- nical content. Additionally, as detailed in Sec.4, Large Lan- guage Models (LLMs) and multimodal foundational models were employed for e...

  10. [10]

    Qwen2.5-omni technical report,

    J. Xu, Z. Guo, J. He, H. Hu, T. He, S. Bai, K. Chen, J. Wang, Y . Fan, K. Danget al., “Qwen2.5-omni technical report,”arXiv preprint arXiv:2503.20215, 2025

  11. [11]

    Qwen3-omni technical report,

    J. Xu, Z. Guo, H. Hu, Y . Chu, X. Wang, J. He, Y . Wang, X. Shi, T. He, X. Zhuet al., “Qwen3-omni technical report,”arXiv preprint arXiv:2509.17765, 2025

  12. [12]

    Audio flamingo: a novel audio language model with few- shot learning and dialogue abilities,

    Z. Kong, A. Goel, R. Badlani, W. Ping, R. Valle, and B. Catan- zaro, “Audio flamingo: a novel audio language model with few- shot learning and dialogue abilities,” inProceedings of the 41st In- ternational Conference on Machine Learning, 2024, pp. 25 125– 25 148

  13. [13]

    Audio flamingo 2: An audio- language model with long-audio understanding and expert reason- ing abilities,

    S. Ghosh, Z. Kong, S. Kumar, S. Sakshi, J. Kim, W. Ping, R. Valle, D. Manocha, and B. Catanzaro, “Audio flamingo 2: An audio- language model with long-audio understanding and expert reason- ing abilities,” inInternational Conference on Machine Learning. PMLR, 2025, pp. 19 358–19 405

  14. [14]

    Audio flamingo 3: Advancing audio intelligence with fully open large audio language models,

    S. Ghosh, A. Goel, J. Kim, S. Kumar, Z. Kong, S. Lee, C.-H. H. Yang, R. Duraiswami, D. Manocha, R. Valle, and B. Catanzaro, “Audio flamingo 3: Advancing audio intelligence with fully open large audio language models,” inThe Thirty-ninth Annual Confer- ence on Neural Information Processing Systems, 2025

  15. [15]

    Music flamingo: Scaling music understanding in audio language models,

    S. Ghosh, A. Goel, L. Koroshinadze, S.-g. Lee, Z. Kong, J. F. Santos, R. Duraiswami, D. Manocha, W. Ping, M. Shoeybiet al., “Music flamingo: Scaling music understanding in audio language models,”arXiv preprint arXiv:2511.10289, 2025

  16. [16]

    Qwen2-audio technical report,

    Y . Chu, J. Xu, Q. Yang, H. Wei, X. Wei, Z. Guo, Y . Leng, Y . Lv, J. He, J. Linet al., “Qwen2-audio technical report,”arXiv preprint arXiv:2407.10759, 2024

  17. [17]

    Owsm v3. 1: Better and faster open whisper-style speech models based on e-branchformer,

    Y . Peng, J. Tian, W. Chen, S. Arora, B. Yan, Y . Sudo, M. Shakeel, K. Choi, J. Shi, X. Changet al., “Owsm v3. 1: Better and faster open whisper-style speech models based on e-branchformer,” in Proc. Interspeech 2024, 2024, pp. 352–356

  18. [18]

    Owls: Scaling laws for multilingual speech recognition and translation models,

    W. Chen, J. Tian, Y . Peng, B. Yan, C.-H. H. Yang, and S. Watan- abe, “Owls: Scaling laws for multilingual speech recognition and translation models,” inForty-second International Conference on Machine Learning, 2025

  19. [19]

    Moshi: a speech-text foundation model for real-time dialogue,

    A. D ´efossezet al., “Moshi: a speech-text foundation model for real-time dialogue,”arXiv preprint arXiv:2410.00037, 2024

  20. [20]

    Audit: Audio editing by following instructions with latent diffusion models,

    Y . Wang, Z. Ju, X. Tan, L. He, Z. Wu, J. Bianet al., “Audit: Audio editing by following instructions with latent diffusion models,” inAdvances in Neural Information Processing Systems, vol. 36, 2023, pp. 71 340–71 357

  21. [21]

    Audioeditor: A training-free diffusion-based audio edit- ing framework,

    Y . Jia, Y . Chen, J. Zhao, S. Zhao, W. Zeng, Y . Chen, and Y . Qin, “Audioeditor: A training-free diffusion-based audio edit- ing framework,” inICASSP 2025-2025 IEEE International Con- ference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2025, pp. 1–5

  22. [22]

    Prompt- guided precise audio editing with diffusion models,

    M. Xu, C. Li, D. Zhang, D. Su, W. Liang, and D. Yu, “Prompt- guided precise audio editing with diffusion models,” inProceed- ings of the 41st International Conference on Machine Learning, 2024, pp. 55 126–55 143

  23. [23]

    Zero-shot unsupervised and text- based audio editing using ddpm inversion,

    H. Manor and T. Michaeli, “Zero-shot unsupervised and text- based audio editing using ddpm inversion,” inProceedings of the 41st International Conference on Machine Learning, 2024, pp. 34 603–34 629

  24. [24]

    Stable audio open,

    Z. Evans, J. D. Parker, C. Carr, Z. Zukowski, J. Taylor, and J. Pons, “Stable audio open,” inICASSP 2025-2025 IEEE Inter- national Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2025, pp. 1–5

  25. [25]

    Step-audio-editx technical report,

    C. Yan, B. Wuet al., “Step-audio-editx technical report,”arXiv preprint arXiv:2511.03601, 2025

  26. [26]

    Editgen: Harnessing cross-attention control for instruction-based auto- regressive audio editing,

    V . Sioros, A. Potamianos, and G. Paraskevopoulos, “Editgen: Harnessing cross-attention control for instruction-based auto- regressive audio editing,”arXiv preprint arXiv:2507.11096, 2025

  27. [27]

    Simple and controllable music gen- eration,

    J. Copet, F. Kreuk, I. Gat, T. Remez, D. Kant, G. Synnaeve, Y . Adi, and A. D ´efossez, “Simple and controllable music gen- eration,” inAdvances in Neural Information Processing Systems, vol. 36, 2024

  28. [28]

    Analyzable chain-of-musical- thought prompting for high-fidelity music generation,

    M. W. Lam, Y . Xing, W. You, J. Wu, Z. Yin, F. Jiang, H. Liu, F. Liu, X. Li, W.-T. Luet al., “Analyzable chain-of-musical- thought prompting for high-fidelity music generation,”arXiv preprint arXiv:2503.19611, 2025

  29. [29]

    Wavcraft: Audio edit- ing and generation with large language models,

    J. Liang, H. Zhang, H. Liu, Y . Cao, Q. Kong, X. Liu, W. Wang, M. D. Plumbley, H. Phan, and E. Benetos, “Wavcraft: Audio edit- ing and generation with large language models,” inICLR 2024 Workshop on Large Language Model (LLM) Agents, 2024

  30. [30]

    V oicecraft: Zero-shot speech editing and text-to-speech in the wild,

    P. Peng, P.-Y . Huang, S.-W. Li, A. Mohamed, and D. Harwath, “V oicecraft: Zero-shot speech editing and text-to-speech in the wild,” inProceedings of the 62nd Annual Meeting of the Asso- ciation for Computational Linguistics (Volume 1: Long Papers), 2024, pp. 12 442–12 462

  31. [31]

    Guiding audio editing with audio language model,

    Z. Lan, Y . Hao, and M. Zhao, “Guiding audio editing with audio language model,”arXiv preprint arXiv:2509.21625, 2025

  32. [32]

    Audiochat: Unified audio sto- rytelling, editing, and understanding with transfusion forcing,

    W. Chen, P. Seetharaman, R. Kumar, O. Nieto, S. Watan- abe, J. Salamon, and Z. Jin, “Audiochat: Unified audio sto- rytelling, editing, and understanding with transfusion forcing,” arXiv preprint arXiv:2602.17097, 2026

  33. [33]

    E2 tts: Embarrassingly easy fully non-autoregressive zeroshot tts,

    S. E. Eskimez, X. Wang, M. Thakker, C. Li, C.-H. Tsai, Z. Xiao, H. Yang, Z. Zhu, M. Tang, X. Tanet al., “E2 tts: Embarrassingly easy fully non-autoregressive zeroshot tts,” in2024 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2024, pp. 682– 689

  34. [34]

    Ming-uniaudio: Speech llm for joint understanding, generation and editing with unified represen- tation,

    C. Yan, C. Jin, D. Huang, H. Yu, H. Peng, H. Zhan, J. Gao, J. Peng, J. Chen, J. Zhouet al., “Ming-uniaudio: Speech llm for joint understanding, generation and editing with unified represen- tation,”arXiv preprint arXiv:2511.05516, 2025

  35. [35]

    CosyV oice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training,

    Z. Du, C. Gao, Y . Wang, F. Yu, T. Zhao, H. Wang, X. Lv, H. Wang, C. Ni, X. Shi, K. An, G. Yang, Y . Li, Y . Chen, Z. Gao, Q. Chen, Y . Gu, M. Chen, Y . Chen, S. Zhang, W. Wang, and J. Ye, “CosyV oice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training,”arXiv preprint arXiv:2505.17589, May 2025

  36. [36]

    Qwen3-TTS Technical Report,

    H. Hu, X. Zhu, T. He, D. Guo, B. Zhang, X. Wang, Z. Guo, Z. Jiang, H. Hao, Z. Guo, X. Zhang, P. Zhang, B. Yang, J. Xu, J. Zhou, and J. Lin, “Qwen3-TTS Technical Report,”arXiv preprint arXiv:2601.15621, Jan. 2026

  37. [37]

    Cosyedit: High-fidelity audio editing with large lan- guage models,

    e. a. Chen, “Cosyedit: High-fidelity audio editing with large lan- guage models,”arXiv preprint, 2026

  38. [38]

    Scal- ing rich style-prompted text-to-speech datasets,

    A. Diwan, Z. Zheng, D. Harwath, and E. Choi, “Scal- ing rich style-prompted text-to-speech datasets,”arXiv preprint arXiv:2503.04713, 2025

  39. [39]

    Bagpiper: Solv- ing open-ended audio tasks via rich captions,

    J. Tian, H. Wang, B.-H. Su, C.-y. Huang, Q. Wang, J. Shi, W. Chen, X. Gong, S. Arora, C.-J. Liet al., “Bagpiper: Solv- ing open-ended audio tasks via rich captions,”arXiv preprint arXiv:2602.05220, 2026

  40. [40]

    Fusionaudio-1.2 m: Towards fine-grained audio captioning with multimodal contextual fusion,

    S. Chen, X. Xie, Z. Chen, L. Zhao, O. Lee, Z. Su, Q. Sun, and B. Wang, “Fusionaudio-1.2 m: Towards fine-grained audio captioning with multimodal contextual fusion,”arXiv preprint arXiv:2506.01111, 2025

  41. [41]

    Omni-captioner: Data pipeline, mod- els, and benchmark for omni detailed perception,

    Z. Ma, R. Xu, Z. Xing, Y . Chu, Y . Wang, J. He, J. Xu, P.-A. Heng, K. Yu, J. Linet al., “Omni-captioner: Data pipeline, mod- els, and benchmark for omni detailed perception,”arXiv preprint arXiv:2510.12720, 2025

  42. [42]

    Zeta: Zero-shot expressive text-to- audio generation and editing,

    H. Manor and T. Michaeli, “Zeta: Zero-shot expressive text-to- audio generation and editing,”arXiv preprint, 2024

  43. [43]

    Cosyvoice 2: Scalable streaming speech synthesis with large language models,

    Z. Duet al., “Cosyvoice 2: Scalable streaming speech synthesis with large language models,”arXiv preprint arXiv:2412.10117, 2024

  44. [44]

    Mimo-audio: Audio language models are few-shot learners,

    D. Zhang, G. Wang, J. Xue, K. Fang, L. Zhao, R. Ma, S. Ren, S. Liu, T. Guo, W. Zhuanget al., “Mimo-audio: Audio language models are few-shot learners,”arXiv preprint arXiv:2512.23808, 2025

  45. [45]

    Qwen3 technical report,

    A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lvet al., “Qwen3 technical report,”arXiv preprint arXiv:2505.09388, 2025

  46. [46]

    Codec does matter: Exploring the semantic shortcoming of codec for audio language model,

    Z. Ye, P. Sun, J. Lei, H. Lin, X. Tan, Z. Dai, Q. Kong, J. Chen, J. Pan, Q. Liuet al., “Codec does matter: Exploring the semantic shortcoming of codec for audio language model,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 39, 2025, pp. 25 697–25 705

  47. [47]

    Emilia: An extensive, multilingual, and diverse speech dataset for large-scale speech generation,

    H. He, Z. Shang, C. Wang, X. Li, Y . Gu, H. Hua, L. Liu, C. Yang, J. Li, P. Shiet al., “Emilia: An extensive, multilingual, and diverse speech dataset for large-scale speech generation,” in2024 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2024, pp. 885–890

  48. [48]

    Audio set: An ontology and human-labeled dataset for audio events,

    J. F. Gemmeke, D. P. Ellis, D. Freedman, A. Jansen, W. Lawrence, R. C. Moore, M. Plakal, and M. Ritter, “Audio set: An ontology and human-labeled dataset for audio events,” in2017 IEEE Inter- national Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2017, pp. 776–780

  49. [49]

    Wavcaps: A chatgpt-assisted weakly- labelled audio captioning dataset for audio-language multimodal research,

    X. Mei, C. Meng, H. Liu, Q. Kong, T. Ko, C. Zhao, M. D. Plumb- ley, Y . Zou, and W. Wang, “Wavcaps: A chatgpt-assisted weakly- labelled audio captioning dataset for audio-language multimodal research,”IEEE/ACM Transactions on Audio, Speech, and Lan- guage Processing, 2024

  50. [50]

    Audiocaps: Generat- ing captions for audios in the wild,

    C. D. Kim, B. Kim, H. Lee, and G. Kim, “Audiocaps: Generat- ing captions for audios in the wild,” inProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Vol- ume 1 (Long and Short Papers), 2019, pp. 119–132

  51. [51]

    Classifier-free diffusion guidance,

    J. Ho and T. Salimans, “Classifier-free diffusion guidance,” in NeurIPS 2021 Workshop on Deep Generative Models and Down- stream Applications, 2021

  52. [52]

    Koel-tts: Enhanc- ing llm based speech generation with preference alignment and classifier free guidance,

    S. S. Hussain, P. Neekhara, X. Yang, E. Casanova, S. Ghosh, R. Fejgin, M. T. Desta, R. Valle, and J. Li, “Koel-tts: Enhanc- ing llm based speech generation with preference alignment and classifier free guidance,” inProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, 2025, pp. 21 230–21 245

  53. [53]

    Stochastic beams and where to find them: The gumbel-top-k trick for sampling se- quences without replacement,

    W. Kool, H. Van Hoof, and M. Welling, “Stochastic beams and where to find them: The gumbel-top-k trick for sampling se- quences without replacement,” inInternational Conference on Machine Learning. PMLR, 2019, pp. 3499–3508

  54. [54]

    Lib- rispeech: An asr corpus based on public domain audio books,

    V . Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Lib- rispeech: An asr corpus based on public domain audio books,” in2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2015, pp. 5206–5210

  55. [55]

    Robust speech recognition via large-scale weak su- pervision,

    A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak su- pervision,” inInternational Conference on Machine Learning. PMLR, 2023, pp. 28 492–28 518

  56. [56]

    Wavlm: Large-scale self- supervised pre-training for full stack speech processing,

    S. Chen, C. Wang, Z. Chen, Y . Wu, S. Liu, Z. Chen, J. Li, N. Kanda, T. Yoshioka, X. Xiaoet al., “Wavlm: Large-scale self- supervised pre-training for full stack speech processing,”IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1505–1518, 2022

  57. [57]

    Seed-tts: A family of high-quality versa- tile speech generation models,

    P. Anastassiouet al., “Seed-tts: A family of high-quality versa- tile speech generation models,”arXiv preprint arXiv:2406.02430, 2024

  58. [58]

    Dnsmos: A non-intrusive perceptual objective speech quality metric to evaluate noise sup- pressors,

    C. K. Reddy, V . Gopal, and R. Cutler, “Dnsmos: A non-intrusive perceptual objective speech quality metric to evaluate noise sup- pressors,” inICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021, pp. 6493–6497

  59. [59]

    emotion2vec: Self-supervised pre-training for speech emotion representation,

    Z. Ma, Z. Zheng, J. Ye, J. Li, Z. Gao, S. Zhang, and X. Chen, “emotion2vec: Self-supervised pre-training for speech emotion representation,” inFindings of the Association for Computational Linguistics: ACL 2024, 2024, pp. 15 747–15 760

  60. [60]

    Versa: A versatile eval- uation toolkit for speech, audio, and music,

    J. Shi, H.-j. Shim, J. Tian, S. Arora, H. Wu, D. Petermann, J. Q. Yip, Y . Zhang, Y . Tang, W. Zhanget al., “Versa: A versatile eval- uation toolkit for speech, audio, and music,” inProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (System Demonstrations)...

  61. [61]

    Fr ´echet au- dio distance: A reference-free metric for evaluating music en- hancement algorithms,

    K. Kilgour, M. Zuluaga, D. Roblek, and M. Sharifi, “Fr ´echet au- dio distance: A reference-free metric for evaluating music en- hancement algorithms,” inProc. InterSpeech 2019, 2019

  62. [62]

    Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation,

    Y . Wu, K. Chen, T. Zhang, Y . Hui, T. Berg-Kirkpatrick, and S. Dubnov, “Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5

  63. [63]

    Audioldm 2: Learn- ing holistic audio generation with self-supervised pretraining,

    H. Liu, Y . Yuan, X. Liu, X. Mei, Q. Kong, Q. Tian, Y . Wang, W. Wang, Y . Wang, and M. D. Plumbley, “Audioldm 2: Learn- ing holistic audio generation with self-supervised pretraining,” IEEE/ACM Transactions on Audio, Speech, and Language Pro- cessing, vol. 32, pp. 2871–2883, 2024