pith. machine review for the scientific record. sign in

arxiv: 2605.04505 · v1 · submitted 2026-05-06 · 📡 eess.AS · cs.AI· cs.SD

Recognition: unknown

JASTIN: Aligning LLMs for Zero-Shot Audio and Speech Evaluation via Natural Language Instructions

Authors on Pith no claims yet

Pith reviewed 2026-05-08 16:40 UTC · model grok-4.3

classification 📡 eess.AS cs.AIcs.SD
keywords zero-shot audio evaluationLLM alignmentinstruction tuningaudio quality assessmentmultimodal LLMsspeech evaluationgenerative audio models
0
0 comments X

The pith

JASTIN aligns LLMs to audio encoders so natural language instructions alone drive zero-shot evaluation of speech, sound, and music.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces JASTIN as a method to turn audio evaluation into an instruction-following reasoning task. It connects a frozen audio encoder to a fine-tuned LLM through a trainable adapter and trains on a carefully constructed dataset that mixes sources, tasks, calibration signals, and varied descriptions. The central claim is that this setup produces strong alignment with human subjective ratings across domains without any task-specific retraining. If correct, evaluators could assess new generative audio models using plain English prompts instead of fixed metrics or domain-specific fine-tuning.

Core claim

JASTIN formulates audio assessment as a self-instructed reasoning task by bridging a frozen high-performance audio encoder with a fine-tuned LLM backbone via a trainable audio adapter, then applies a Multi-Source, Multi-Task, Multi-Calibration, and Multi-Description data preparation pipeline to enable robust zero-shot generalization, resulting in state-of-the-art Pearson and Spearman correlations with human ratings that surpass general multimodal LLMs on speech, sound, music, and out-of-domain tasks.

What carries the argument

The Multi-Source, Multi-Task, Multi-Calibration, and Multi-Description instruction-following data preparation pipeline, which supplies the LLM with varied audio inputs and natural language prompts to produce evaluation reasoning.

If this is right

  • Evaluation of new generative audio models becomes possible with only natural language instructions and no per-task retraining.
  • A single model can handle speech quality, environmental sound, music, and out-of-domain audio using the same weights.
  • Human subjective ratings can be approximated more closely than with existing objective metrics or untuned multimodal LLMs.
  • The approach removes the need to redesign separate metrics when audio generation methods change.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same adapter-plus-instruction recipe could be tested on video or multimodal generation evaluation without starting from scratch.
  • If the pipeline truly avoids overfitting, swapping the audio encoder for a newer one should preserve most of the zero-shot gains.
  • Developers of generative audio systems could use the model as an automated judge during iterative training loops.

Load-bearing premise

The data preparation pipeline creates genuine zero-shot generalization instead of the model simply learning patterns that happen to appear in the instruction examples.

What would settle it

Performance on a held-out audio domain whose sources, tasks, and description styles are completely absent from the training pipeline drops to the level of untuned general MLLMs.

Figures

Figures reproduced from arXiv: 2605.04505 by Bach Viet Do, Bowen Shi, Haibin Wu, Leying Zhang, Yanmin Qian.

Figure 1
Figure 1. Figure 1: Pipeline of our proposed framework JASTIN predicted score s is a function of the evaluation system f, the natural language task description T, and the input audio A: s = f(T, A) (1) B. Pipeline The architecture of JASTIN is designed to process multi￾modal inputs effectively while maintaining computational efficiency. Unlike traditional objective metrics that only ingest raw audio, JASTIN accepts a multimod… view at source ↗
Figure 2
Figure 2. Figure 2: Data preparation pipeline of our proposed framework view at source ↗
Figure 3
Figure 3. Figure 3: Cross-Model Spearman Correlation Comparison on view at source ↗
Figure 4
Figure 4. Figure 4: Cross-Metric Spearman Correlation Comparison of Our view at source ↗
Figure 5
Figure 5. Figure 5: Training and Inference Performance Comparison with view at source ↗
read the original abstract

The rapid advancement of generative audio models has outpaced the development of robust evaluation methodologies. Existing objective metrics and general multimodal large language models (MLLMs) often struggle with domain generalization, zero-shot capabilities, and instructional flexibility. To address these bottlenecks, we propose JASTIN, a generalizable, instruction-driven audio evaluation framework that formulates audio assessment as a self-instructed reasoning task. JASTIN bridges a frozen high-performance audio encoder with a fine-tuned LLM backbone via a trainable audio adapter. To ensure robust zero-shot generalization, we introduce a comprehensive instruction following data preparation pipeline, incorporating Multi-Source, Multi-Task, Multi-Calibration, and Multi-Description data. Experimental results demonstrate that JASTIN achieves state-of-the-art Pearson and Spearman correlations with human subjective ratings. It consistently outperforms general MLLMs across speech, sound, music, and out-of-domain evaluation tasks without the need for task-specific retraining.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript presents JASTIN, a framework for zero-shot audio and speech evaluation that formulates assessment as an instruction-driven reasoning task. It bridges a frozen audio encoder with a fine-tuned LLM via a trainable adapter and introduces a Multi-Source, Multi-Task, Multi-Calibration, and Multi-Description data preparation pipeline to promote generalization. The central claim is that JASTIN achieves state-of-the-art Pearson and Spearman correlations with human ratings and consistently outperforms general MLLMs across speech, sound, music, and out-of-domain tasks without task-specific retraining.

Significance. If the zero-shot generalization is substantiated through rigorous verification of instruction novelty, this work could advance audio evaluation by offering a flexible, natural-language-based alternative to fixed metrics or retrained models. The instruction-following pipeline is a notable strength that, if properly validated, supports broader applicability.

major comments (2)
  1. [§3] §3 (Method), Data Preparation Pipeline subsection: The Multi-Description component is presented as enabling robust zero-shot generalization, but the manuscript provides no quantitative verification (e.g., embedding similarity, lexical overlap, or semantic distance metrics) that evaluation prompts are disjoint from the generated training instructions. This verification is load-bearing for the zero-shot claim, as overlap could explain performance gains via pattern matching rather than instruction-driven reasoning.
  2. [§4] §4 (Experiments): The results assert SOTA Pearson and Spearman correlations and outperformance over general MLLMs, yet supply no details on baseline adaptations for zero-shot settings, data splits, or statistical significance tests (e.g., p-values or confidence intervals on correlation differences). These omissions undermine assessment of whether the reported gains are robust and attributable to the proposed pipeline.
minor comments (2)
  1. [Abstract] The abstract and §1 could more explicitly define the scope of 'out-of-domain evaluation tasks' and list the specific benchmarks used to allow readers to assess the breadth of generalization.
  2. [Figure 1] Figure 1 (architecture diagram) would benefit from explicit annotation of which components are frozen versus trainable to clarify the training setup.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and describe the revisions we will incorporate to strengthen the manuscript.

read point-by-point responses
  1. Referee: [§3] §3 (Method), Data Preparation Pipeline subsection: The Multi-Description component is presented as enabling robust zero-shot generalization, but the manuscript provides no quantitative verification (e.g., embedding similarity, lexical overlap, or semantic distance metrics) that evaluation prompts are disjoint from the generated training instructions. This verification is load-bearing for the zero-shot claim, as overlap could explain performance gains via pattern matching rather than instruction-driven reasoning.

    Authors: We agree that quantitative verification of disjointness is necessary to rigorously support the zero-shot claim. While the Multi-Description pipeline was intentionally designed to produce diverse instructions, the manuscript does not currently report embedding similarity, lexical overlap, or semantic distance metrics. In the revised manuscript we will add a dedicated analysis subsection that computes and reports these metrics (using sentence embeddings for similarity, n-gram overlap for lexical measures, and cosine distance on semantic embeddings) between the training instruction corpus and the held-out evaluation prompts. This addition will directly address the concern and provide evidence that gains arise from instruction-driven reasoning. revision: yes

  2. Referee: [§4] §4 (Experiments): The results assert SOTA Pearson and Spearman correlations and outperformance over general MLLMs, yet supply no details on baseline adaptations for zero-shot settings, data splits, or statistical significance tests (e.g., p-values or confidence intervals on correlation differences). These omissions undermine assessment of whether the reported gains are robust and attributable to the proposed pipeline.

    Authors: We concur that additional experimental details are required for full assessment of robustness. The current manuscript omits explicit descriptions of zero-shot baseline adaptations, precise data splits, and statistical testing. In the revision we will expand §4 to include: (i) detailed prompt-engineering procedures used to adapt general MLLMs to the zero-shot instruction setting, (ii) explicit documentation of training/validation/test splits with confirmation of no data leakage, and (iii) statistical significance results (p-values and confidence intervals) for all reported Pearson and Spearman correlation differences. These changes will allow readers to evaluate the strength and attribution of the claimed improvements. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The paper presents an architectural description (frozen audio encoder + trainable adapter + fine-tuned LLM) and a data-preparation pipeline (Multi-Source/Multi-Task/Multi-Calibration/Multi-Description) whose outputs are then evaluated on held-out tasks. No equations, uniqueness theorems, or self-citations are invoked that reduce the reported Pearson/Spearman correlations to quantities defined by construction within the same work. The zero-shot generalization claim is supported by the explicit separation of training instructions from evaluation prompts rather than by any self-referential fit or renaming of prior results.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Review performed on abstract only; no explicit free parameters, axioms, or invented entities are stated in the provided text.

pith-pipeline@v0.9.0 · 5478 in / 1087 out tokens · 54007 ms · 2026-05-08T16:40:27.629219+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

56 extracted references · 23 canonical work pages · 6 internal anchors

  1. [1]

    V oicebox: Text- guided multilingual universal speech generation at scale,

    M. Le, A. Vyas, B. Shi, B. Karrer, L. Sari, R. Moritz, M. Williamson, V . Manohar, Y . Adi, J. Mahadeokar, and W.-N. Hsu, “V oicebox: Text- guided multilingual universal speech generation at scale,” inAdvances in neural information processing systems, 2024, pp. 14 005–14 034

  2. [2]

    NaturalSpeech 3: Zero-shot speech synthesis with factorized codec and diffusion models,

    Z. Ju, Y . Wang, K. Shen, X. Tan, D. Xin, D. Yang, Y . Liu, Y . Leng, K. Song, S. Tanget al., “NaturalSpeech 3: Zero-shot speech synthesis with factorized codec and diffusion models,” inProc. ICML, 2024

  3. [3]

    CoV oMix: Advancing zero-shot speech generation for human-like multi-talker conversations,

    L. Zhang, Y . Qian, L. Zhou, S. Liu, D. Wang, X. Wang, M. Yousefi, Y . Qian, J. Li, L. Heet al., “CoV oMix: Advancing zero-shot speech generation for human-like multi-talker conversations,”Advances in neural information processing systems, vol. 37, pp. 100 291–100 317, 2024

  4. [4]

    Advanced zero-shot text- to-speech for background removal and preservation with controllable masked speech prediction,

    L. Zhang, W. Zhang, Z. Chen, and Y . Qian, “Advanced zero-shot text- to-speech for background removal and preservation with controllable masked speech prediction,” inICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2025, pp. 1–5

  5. [5]

    Rethinking MUSHRA: Addressing modern challenges in text-to-speech evaluation,

    P. S. Varadhan, A. Sankar, S. Anand, A. Gupta, A. Mukherjee, S. K. Marepally, A. Bhatia, S. Jaju, S. Bhooshan, M. M. Khapraet al., “Rethinking MUSHRA: Addressing modern challenges in text-to-speech evaluation,”Transactions on Machine Learning Research

  6. [6]

    CoV oMix2: Advancing zero-shot dialogue generation with fully non-autoregressive flow matching,

    L. Zhang, Y . Qian, X. Wang, M. Thakker, D. Wang, J. Yu, H. Wu, Y . Hu, J. Li, Y . Qianet al., “CoV oMix2: Advancing zero-shot dialogue generation with fully non-autoregressive flow matching,” inThe Thirty- ninth Annual Conference on Neural Information Processing Systems

  7. [7]

    Perceptual evaluation of speech quality (pesq)-a new method for speech quality as- sessment of telephone networks and codecs,

    A. W. Rix, J. G. Beerends, M. P. Hollier, and A. P. Hekstra, “Perceptual evaluation of speech quality (pesq)-a new method for speech quality as- sessment of telephone networks and codecs,” in2001 IEEE international conference on acoustics, speech, and signal processing. Proceedings (Cat. No. 01CH37221), vol. 2. IEEE, 2001, pp. 749–752

  8. [8]

    An algorithm for predicting the intelligibility of speech masked by modulated noise maskers,

    J. Jensen and C. H. Taal, “An algorithm for predicting the intelligibility of speech masked by modulated noise maskers,”IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 24, no. 11, pp. 2009–2022, 2016

  9. [9]

    Performance measurement in blind audio source separation,

    E. Vincent, R. Gribonval, and C. Févotte, “Performance measurement in blind audio source separation,”IEEE Transactions on Audio, Speech, and Language Processing, vol. 14, no. 4, pp. 1462–1469, 2006

  10. [10]

    Nisqa: A deep cnn- self-attention model for multidimensional speech quality prediction with crowdsourced datasets,

    G. Mittag, B. Naderi, A. Chehadi, and S. Möller, “Nisqa: A deep cnn- self-attention model for multidimensional speech quality prediction with crowdsourced datasets,”arXiv preprint arXiv:2104.09494, 2021

  11. [11]

    The t05 system for the VoiceMOS Challenge 2024: Transfer learning from deep image classifier to naturalness MOS prediction of high-quality synthetic speech,

    K. Baba, W. Nakata, Y . Saito, and H. Saruwatari, “The t05 system for the VoiceMOS Challenge 2024: Transfer learning from deep image classifier to naturalness MOS prediction of high-quality synthetic speech,” inIEEE Spoken Language Technology Workshop (SLT), 2024, pp. 818–824

  12. [12]

    Dnsmos: A non-intrusive perceptual objective speech quality metric to evaluate noise suppressors,

    C. K. Reddy, V . Gopal, and R. Cutler, “Dnsmos: A non-intrusive perceptual objective speech quality metric to evaluate noise suppressors,” inICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021, pp. 6493–6497. JOURNAL OF LATEX CLASS FILES, VOL. 18, NO. 9, SEPTEMBER 2020 12

  13. [13]

    Meta audiobox aesthetics: Unified automatic quality assessment for speech, music, and sound,

    A. Tjandra, Y .-C. Wu, B. Guo, J. Hoffman, B. Ellis, A. Vyas, B. Shi, S. Chen, M. Le, N. Zacharovet al., “Meta audiobox aesthetics: Unified automatic quality assessment for speech, music, and sound,”arXiv preprint arXiv:2502.05139, 2025

  14. [14]

    Judging llm-as-a-judge with mt-bench and chatbot arena,

    L. Zheng, W.-L. Chiang, Y . Sheng, S. Zhuang, Z. Wu, Y . Zhuang, Z. Lin, Z. Li, D. Li, E. Xinget al., “Judging llm-as-a-judge with mt-bench and chatbot arena,”Advances in neural information processing systems, vol. 36, pp. 46 595–46 623, 2023

  15. [15]

    Gemini 3 pro,

    Google, “Gemini 3 pro,” 2024, accessed: 2026-04-09. [Online]. Available: https://gemini.google.com/

  16. [16]

    [Online]

    OpenAI, “Gpt-4o,” 2024, accessed: 2026-04-09. [Online]. Available: https://chatgpt.com/

  17. [17]

    AudioCapBench: Quick evaluation on audio captioning across sound, music, and speech,

    J. Qiu, J. Zhang, Z. Chen, L. Yang, M. Zhu, J. Tan, H. Chen, W. Zhao, R. Murthy, R. Ramet al., “AudioCapBench: Quick evaluation on audio captioning across sound, music, and speech,”arXiv preprint arXiv:2602.23649, 2026

  18. [18]

    Deepasmr: Llm-based zero-shot asmr speech generation for anyone of any voice,

    L. Zhang, T. Zhou, H. Sun, M. Bi, and Y . Qian, “Deepasmr: Llm-based zero-shot asmr speech generation for anyone of any voice,”arXiv preprint arXiv:2601.15596, 2026

  19. [19]

    AudioJudge: Understanding what works in large audio model based speech evaluation,

    P. Manakul, W. H. Gan, M. J. Ryan, A. S. Khan, W. Sirichotedumrong, K. Pipatanakul, W. Held, and D. Yang, “Audiojudge: Understanding what works in large audio model based speech evaluation,”arXiv preprint arXiv:2507.12705, 2025

  20. [20]

    SpeechJudge: Towards human-level judgment for speech naturalness,

    X. Zhang, C. Wang, H. Liao, Z. Li, Y . Wang, L. Wang, D. Jia, Y . Chen, X. Li, Z. Chenet al., “SpeechJudge: Towards human-level judgment for speech naturalness,”arXiv preprint arXiv:2511.07931, 2025

  21. [21]

    Qualispeech: A speech quality assessment dataset with natural language reasoning and descriptions,

    S. Wang, W. Yu, X. Chen, X. Tian, J. Zhang, L. Lu, Y . Tsao, J. Yamagishi, Y . Wang, and C. Zhang, “Qualispeech: A speech quality assessment dataset with natural language reasoning and descriptions,” inProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025, pp. 23 588–23 609

  22. [22]

    SpeechLLM-as-Judges: Towards General and Interpretable Speech Quality Evaluation

    H. Wang, J. Zhao, Y . Yang, S. Liu, J. Chen, Y . Zhang, S. Zhao, J. Li, J. Zhou, H. Sunet al., “Speechllm-as-judges: Towards general and interpretable speech quality evaluation,”arXiv preprint arXiv:2510.14664, 2025

  23. [23]

    Adapting frechet audio distance for generative music evaluation,

    A. Gui, H. Gamper, S. Braun, and D. Emmanouilidou, “Adapting frechet audio distance for generative music evaluation,” inICASSP 2024- 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2024, pp. 1331–1335

  24. [24]

    Gsrm: Generative speech reward model for speech rlhf,

    M. Shen, T. Jayashankar, O. Hanna, N. Kanda, Y . Wang, K. Žmolíková, R. Xie, N. Moritz, A. Xu, Y . Gauret al., “Gsrm: Generative speech reward model for speech rlhf,”arXiv preprint arXiv:2602.13891, 2026

  25. [25]

    UrgentMOS: Unified multi-metric and preference learning for robust speech quality assessment,

    W. Wang, W. Zhang, C. Li, J. Wang, S. Cornell, M. Sach, K. Saijo, Y . Fu, Z. Ni, B. Hanet al., “UrgentMOS: Unified multi-metric and preference learning for robust speech quality assessment,”arXiv preprint arXiv:2601.18438, 2026

  26. [26]

    SAM Audio Judge: A unified multimodal framework for perceptual evaluation of audio separation,

    H. Wang, B. Shi, A. Tjandra, J. Hoffman, Y .-C. Wu, A. Vyas, N. Dehak, A. Lee, and W.-N. Hsu, “SAM Audio Judge: A unified multimodal framework for perceptual evaluation of audio separation,” 2026

  27. [27]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    G. Comanici, E. Bieber, M. Schaekermann, I. Pasupat, N. Sachdeva, I. Dhillon, M. Blistein, O. Ram, D. Zhang, E. Rosenet al., “Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities,”arXiv preprint arXiv:2507.06261, 2025

  28. [28]

    Qwen3-Omni Technical Report

    J. Xu, Z. Guo, H. Hu, Y . Chu, X. Wang, J. He, Y . Wang, X. Shi, T. He, X. Zhuet al., “Qwen3-omni technical report,”arXiv preprint arXiv:2509.17765, 2025

  29. [29]

    Llms can read music, but struggle to hear it: An evaluation of core music perception tasks,

    B. J. Carone, I. R. Roman, and P. Ripollés, “Llms can read music, but struggle to hear it: An evaluation of core music perception tasks,” in1st International Workshop on Emerging AI Technologies for Music, 2026

  30. [30]

    Towards holistic evaluation of large audio-language models: A comprehensive survey,

    C.-K. Yang, N. S. Ho, and H.-y. Lee, “Towards holistic evaluation of large audio-language models: A comprehensive survey,” inProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, 2025, pp. 10 155–10 181

  31. [31]

    ARECHO: Autoregressive evaluation via chain-based hypothesis optimization for speech multi-metric estimation,

    J. Shi, Y . Cheng, B.-H. Su, H.-j. Shim, J. Tian, S. Cornell, Y . Zhao, S. Arora, and S. Watanabe, “ARECHO: Autoregressive evaluation via chain-based hypothesis optimization for speech multi-metric estimation,” inAdvances in neural information processing systems, 2025

  32. [32]

    Calibration-reasoning framework for descriptive speech quality assessment,

    E. Kostenok, M. Salzmann, and M. Cernak, “Calibration-reasoning framework for descriptive speech quality assessment,”arXiv preprint arXiv:2603.10175, 2026

  33. [33]

    Audio large language models can be descriptive speech quality evaluators,

    C. Chen, Y . Hu, S. Wang, H. Wang, Z. Chen, C. Zhang, C.-H. H. Yang, and E. S. Chng, “Audio large language models can be descriptive speech quality evaluators,”arXiv preprint arXiv:2501.17202, 2025

  34. [34]

    SpeechQualityLLM: Llm-based multimodal assessment of speech quality,

    M. Monjur and S. Nirjon, “SpeechQualityLLM: Llm-based multimodal assessment of speech quality,”arXiv preprint arXiv:2512.08238, 2025

  35. [35]

    Hearing between the lines: Unlocking the reasoning power of llms for speech evaluation,

    A. Chandra, K. Miller, V . Ravichandran, C. Papayiannis, and V . Saligrama, “Hearing between the lines: Unlocking the reasoning power of llms for speech evaluation,”arXiv e-prints, pp. arXiv–2601, 2026

  36. [36]

    Pushing the frontier of audiovisual perception with large-scale multimodal correspondence learning, 2025

    A. Vyas, H.-J. Chang, C.-F. Yang, P.-Y . Huang, L. Gao, J. Richter, S. Chen, M. Le, P. Dollár, C. Feichtenhoferet al., “Pushing the frontier of audiovisual perception with large-scale multimodal correspondence learning,”arXiv preprint arXiv:2512.19687, 2025

  37. [37]

    Llama 3.2,

    M. AI, “Llama 3.2,” https://www.llama.com/docs/ model-cards-and-prompt-formats/llama3_2/, 2024, accessed: 2026-04-20

  38. [38]

    How do V oices from Past Speech Synthesis Challenges Compare Today?

    E. Cooper and J. Yamagishi, “How do V oices from Past Speech Synthesis Challenges Compare Today?” in11th ISCA Speech Synthesis Workshop (SSW 11), 2021, pp. 183–188

  39. [39]

    Urgentmos: Unified multi-metric and preference learning for robust speech quality assessment,

    W. Wang, W. Zhang, C. Li, J. Wang, S. Cornell, M. Sach, K. Saijo, Y . Fu, Z. Ni, B. Han, X. Gong, M. Bi, T. Fingscheidt, S. Watanabe, and Y . Qian, “Urgentmos: Unified multi-metric and preference learning for robust speech quality assessment,” 2026

  40. [40]

    LibriTTS: A Corpus Derived from LibriSpeech for Text-to-Speech

    H. Zen, V . Dang, R. Clark, Y . Zhang, R. J. Weiss, Y . Jia, Z. Chen, and Y . Wu, “Libritts: A corpus derived from librispeech for text-to-speech,” arXiv preprint arXiv:1904.02882, 2019

  41. [41]

    Expresso: A benchmark and analysis of discrete expressive speech resynthesis.arXiv preprint arXiv:2308.05725,

    T. A. Nguyen, W.-N. Hsu, A. d’Avirro, B. Shi, I. Gat, M. Fazel-Zarani, T. Remez, J. Copet, G. Synnaeve, M. Hassidet al., “Expresso: A benchmark and analysis of discrete expressive speech resynthesis,”arXiv preprint arXiv:2308.05725, 2023

  42. [42]

    Common voice: A massively-multilingual speech corpus,

    R. Ardila, M. Branson, K. Davis, M. Kohler, J. Meyer, M. Henretty, R. Morais, L. Saunders, F. Tyers, and G. Weber, “Common voice: A massively-multilingual speech corpus,” inProceedings of the twelfth language resources and evaluation conference, 2020, pp. 4218–4222

  43. [43]

    Ears: An anechoic fullband speech dataset benchmarked for speech enhancement and dereverberation,

    J. Richter, Y .-C. Wu, S. Krenn, S. Welker, B. Lay, S. Watanabe, A. Richard, and T. Gerkmann, “Ears: An anechoic fullband speech dataset benchmarked for speech enhancement and dereverberation,”arXiv preprint arXiv:2406.06185, 2024

  44. [44]

    Audio set: An ontology and human-labeled dataset for audio events,

    J. F. Gemmeke, D. P. Ellis, D. Freedman, A. Jansen, W. Lawrence, R. C. Moore, M. Plakal, and M. Ritter, “Audio set: An ontology and human-labeled dataset for audio events,” in2017 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2017, pp. 776–780

  45. [45]

    Freesound datasets: A platform for the creation of open audio datasets

    E. Fonseca, J. Pons, X. Favory, F. Font, D. Bogdanov, A. Ferraro, S. Oramas, A. Porter, and X. Serra, “Freesound datasets: A platform for the creation of open audio datasets.” inISMIR, 2017, pp. 486–493

  46. [46]

    MusicLM: Generating Music From Text

    A. Agostinelli, T. I. Denk, Z. Borsos, J. Engel, M. Verzetti, A. Caillon, Q. Huang, A. Jansen, A. Roberts, M. Tagliasacchiet al., “Musiclm: Generating music from text,”arXiv preprint arXiv:2301.11325, 2023

  47. [47]

    The musdb18 corpus for music separation,

    Z. Rafii, A. Liutkus, F.-R. Stöter, S. I. Mimilakis, and R. Bittner, “The musdb18 corpus for music separation,” 2017

  48. [48]

    The chains speech corpus: Characterizing individual speakers,

    F. Cummins, M. Grimaldi, T. Leonard, and J. Simko, “The chains speech corpus: Characterizing individual speakers,” inProc of SPECOM, 2006, pp. 1–6

  49. [49]

    Librispeech: An asr corpus based on public domain audio books,

    V . Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: An asr corpus based on public domain audio books,” in2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2015, pp. 5206–5210

  50. [50]

    Generative expressive conversational speech synthesis,

    R. Liu, Y . Hu, Y . Ren, X. Yin, and H. Li, “Generative expressive conversational speech synthesis,” inProceedings of the 32nd ACM International Conference on Multimedia. New York, NY , USA: Association for Computing Machinery, 2024, p. 4187–4196. [Online]. Available: https://doi.org/10.1145/3664647.3681697

  51. [51]

    The audiomos challenge 2025,

    W.-C. Huang, H. Wang, C. Liu, Y .-C. Wu, A. Tjandra, W.-N. Hsu, E. Cooper, Y . Qin, and T. Toda, “The audiomos challenge 2025,”arXiv preprint arXiv:2509.01336, 2025

  52. [52]

    Qwen2-Audio Technical Report

    Y . Chu, J. Xu, Q. Yang, H. Wei, X. Wei, Z. Guo, Y . Leng, Y . Lv, J. He, J. Linet al., “Qwen2-audio technical report,”arXiv preprint arXiv:2407.10759, 2024

  53. [53]

    Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio Language Models

    A. Goel, S. Ghosh, J. Kim, S. Kumar, Z. Kong, S.-g. Lee, C.-H. H. Yang, R. Duraiswami, D. Manocha, R. Valleet al., “Audio flamingo 3: Advancing audio intelligence with fully open large audio language models,”arXiv preprint arXiv:2507.08128, 2025

  54. [54]

    Wavlm: Large-scale self-supervised pre- training for full stack speech processing,

    S. Chen, C. Wang, Z. Chen, Y . Wu, S. Liu, Z. Chen, J. Li, N. Kanda, T. Yoshioka, X. Xiaoet al., “Wavlm: Large-scale self-supervised pre- training for full stack speech processing,”IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1505–1518, 2022

  55. [55]

    Language models are unsupervised multitask learners,

    A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever, “Language models are unsupervised multitask learners,” 2019

  56. [56]

    Desta2. 5- audio: Toward general-purpose large audio language model with self- generated cross-modal alignment,

    K.-H. Lu, Z. Chen, S.-W. Fu, C.-H. H. Yang, S.-F. Huang, C.-K. Yang, C.-E. Yu, C.-W. Chen, W.-C. Chen, C.-y. Huanget al., “Desta2. 5- audio: Toward general-purpose large audio language model with self- generated cross-modal alignment,”IEEE Transactions on Audio, Speech and Language Processing, 2026