pith. machine review for the scientific record. sign in

arxiv: 2507.08128 · v2 · submitted 2025-07-10 · 💻 cs.SD · cs.AI· cs.CL· eess.AS

Recognition: 3 theorem links

· Lean Theorem

Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio Language Models

Authors on Pith no claims yet

Pith reviewed 2026-05-15 03:37 UTC · model grok-4.3

classification 💻 cs.SD cs.AIcs.CLeess.AS
keywords audio language modelspeech understandingsound reasoningmusic analysislarge audio modelscurriculum learningchain of thoughtopen source AI
0
0 comments X

The pith

Audio Flamingo 3 is a fully open large audio-language model that sets new state-of-the-art results on over twenty audio understanding and reasoning benchmarks using only open-source data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Audio Flamingo 3 as an open large audio-language model for advanced reasoning across speech, sound, and music. It achieves this through a unified audio encoder, support for long audio inputs up to ten minutes, multi-turn multi-audio chat, on-demand chain-of-thought reasoning, and voice-to-voice interaction. These features are enabled by new curated datasets and a five-stage curriculum training strategy. A sympathetic reader would care because the model reaches top performance without proprietary data, showing that capable audio intelligence systems can be built and shared transparently.

Core claim

Audio Flamingo 3 (AF3) is a fully open state-of-the-art large audio-language model that advances reasoning and understanding across speech, sound, and music. It introduces AF-Whisper as a unified audio encoder for joint representation learning, flexible on-demand thinking for chain-of-thought reasoning, multi-turn multi-audio chat, long audio understanding up to ten minutes including speech, and voice-to-voice interaction. These capabilities are supported by large-scale open datasets such as AudioSkills-XL, LongAudio-XL, AF-Think, and AF-Chat, trained via a novel five-stage curriculum strategy. Trained only on open-source audio data, AF3 achieves new state-of-the-art results on over twenty (

What carries the argument

AF-Whisper unified audio encoder for joint speech-sound-music representation learning, combined with the five-stage curriculum training on custom long-audio and skills datasets.

If this is right

  • Supports long audio understanding and reasoning including speech up to 10 minutes.
  • Enables multi-turn multi-audio chat and on-demand chain-of-thought reasoning.
  • Provides voice-to-voice interaction alongside text-based audio analysis.
  • Surpasses both open-weight and closed-source models on over 20 benchmarks despite training on smaller open data.
  • Delivers these capabilities through a fully open model and training process.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The curriculum approach may generalize to other multimodal domains where staged training helps manage long contexts.
  • Open release of the model and datasets could accelerate community experiments on specialized audio tasks like environmental sound monitoring.
  • Future extensions might test whether the same encoder scales to joint audio-video reasoning without major retraining.
  • Widespread adoption could shift industry focus toward efficient open data pipelines instead of ever-larger proprietary corpora.

Load-bearing premise

The performance gains stem from genuine generalization enabled by the new datasets and curriculum rather than benchmark-specific tuning or differences in evaluation protocols.

What would settle it

Re-running AF3 on a new held-out collection of long audio reasoning tasks never seen during its training or original benchmarks, then comparing results directly to closed models under identical protocols.

read the original abstract

We present Audio Flamingo 3 (AF3), a fully open state-of-the-art (SOTA) large audio-language model that advances reasoning and understanding across speech, sound, and music. AF3 introduces: (i) AF-Whisper, a unified audio encoder trained using a novel strategy for joint representation learning across all 3 modalities of speech, sound, and music; (ii) flexible, on-demand thinking, allowing the model to do chain-of-thought-type reasoning before answering; (iii) multi-turn, multi-audio chat; (iv) long audio understanding and reasoning (including speech) up to 10 minutes; and (v) voice-to-voice interaction. To enable these capabilities, we propose several large-scale training datasets curated using novel strategies, including AudioSkills-XL, LongAudio-XL, AF-Think, and AF-Chat, and train AF3 with a novel five-stage curriculum-based training strategy. Trained on only open-source audio data, AF3 achieves new SOTA results on over 20+ (long) audio understanding and reasoning benchmarks, surpassing both open-weight and closed-source models trained on much larger datasets.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript presents Audio Flamingo 3 (AF3), a fully open large audio-language model advancing reasoning across speech, sound, and music. It introduces AF-Whisper as a unified audio encoder for joint modality representation, on-demand chain-of-thought thinking, multi-turn multi-audio chat, long audio understanding up to 10 minutes, and voice-to-voice interaction. These capabilities are enabled by newly curated datasets (AudioSkills-XL, LongAudio-XL, AF-Think, AF-Chat) and a five-stage curriculum-based training strategy using only open-source data, with claims of new SOTA results on over 20 audio understanding and reasoning benchmarks that surpass both open-weight and closed-source models trained on larger datasets.

Significance. If the SOTA claims hold after verification of no data contamination, the work would demonstrate that open-source audio-language models can achieve superior performance through targeted dataset curation and staged training, rather than scale alone. This could accelerate progress in accessible multimodal audio AI, particularly for long-context reasoning and flexible interaction capabilities that remain challenging in the field.

major comments (2)
  1. [§3] §3 (new datasets): The descriptions of AudioSkills-XL, LongAudio-XL, AF-Think, and AF-Chat provide no details on deduplication, audio fingerprinting, embedding similarity checks, or other overlap detection against the test splits of the 20+ reported benchmarks. This is load-bearing for the central SOTA claim, as any leakage would allow the five-stage curriculum to exploit benchmark-specific patterns rather than demonstrate generalization.
  2. [§5] §5 (experiments): The reported benchmark results do not include error bars, exact evaluation protocol details, or ablations that isolate the contribution of the new datasets versus the curriculum stages, making it difficult to assess the robustness of the performance gains.
minor comments (1)
  1. [Abstract] Abstract: The phrasing 'over 20+ (long) audio understanding and reasoning benchmarks' is imprecise; clarify the exact number of benchmarks evaluated and which subset specifically tests long audio.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript. We address each major comment below and outline the revisions we will make to strengthen the presentation of our results.

read point-by-point responses
  1. Referee: [§3] §3 (new datasets): The descriptions of AudioSkills-XL, LongAudio-XL, AF-Think, and AF-Chat provide no details on deduplication, audio fingerprinting, embedding similarity checks, or other overlap detection against the test splits of the 20+ reported benchmarks. This is load-bearing for the central SOTA claim, as any leakage would allow the five-stage curriculum to exploit benchmark-specific patterns rather than demonstrate generalization.

    Authors: We agree that explicit documentation of contamination checks is necessary to substantiate the SOTA claims. In the revised manuscript we will expand §3 with a dedicated subsection describing our full deduplication pipeline: perceptual audio fingerprinting, embedding-based similarity filtering (cosine threshold 0.85), and exhaustive overlap scans against every benchmark test split. We will also report the measured overlap statistics (which were below 0.1 % after filtering). These checks were performed during curation; adding the details will directly address the concern without altering the experimental outcomes. revision: yes

  2. Referee: [§5] §5 (experiments): The reported benchmark results do not include error bars, exact evaluation protocol details, or ablations that isolate the contribution of the new datasets versus the curriculum stages, making it difficult to assess the robustness of the performance gains.

    Authors: We accept that additional reporting is required for robustness. We will add standard-error bars to all main-result tables (computed over three independent runs) and include a new appendix with exact evaluation protocols (prompt templates, decoding parameters, and metric implementations). We will also insert a targeted ablation study that incrementally adds the new datasets and curriculum stages on a representative subset of benchmarks. Full isolation across all 20+ benchmarks is computationally prohibitive, so the ablations will be partial but sufficient to illustrate the relative contributions. revision: partial

Circularity Check

0 steps flagged

No circularity in empirical claims or training pipeline

full rationale

The paper is an empirical ML work that introduces new datasets (AudioSkills-XL, LongAudio-XL, AF-Think, AF-Chat) and a five-stage curriculum to train AF3, then reports benchmark results. No derivation chain, equations, or first-principles predictions exist that could reduce to inputs by construction. Claims rest on external benchmark comparisons rather than self-referential fitting or self-citation load-bearing steps. The work is self-contained against external benchmarks with no visible circularity signals.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim depends on the assumption that the new encoder, datasets, and curriculum deliver superior performance; the abstract provides no explicit free parameters but implicitly relies on standard ML hyperparameters and benchmark validity.

free parameters (1)
  • training hyperparameters across five stages
    Typical in large-model training; specific values not given in abstract.
axioms (1)
  • domain assumption Benchmarks used are fair and representative for cross-model comparison
    Required to support the SOTA claim.

pith-pipeline@v0.9.0 · 5560 in / 1363 out tokens · 82298 ms · 2026-05-15T03:37:56.252379+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 21 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. HalluAudio: A Comprehensive Benchmark for Hallucination Detection in Large Audio-Language Models

    cs.SD 2026-04 unverdicted novelty 8.0

    HalluAudio is the first large-scale benchmark spanning speech, environmental sound, and music that uses human-verified QA pairs, adversarial prompts, and mixed-audio tests to measure hallucinations in large audio-lang...

  2. Benign Fine-Tuning Breaks Safety Alignment in Audio LLMs

    cs.CR 2026-04 conditional novelty 8.0

    Benign fine-tuning on audio data breaks safety alignment in Audio LLMs by raising jailbreak success rates up to 87%, with the dominant risk axis depending on model architecture and embedding proximity to harmful content.

  3. Listening with Time: Precise Temporal Awareness for Long-Form Audio Understanding

    eess.AS 2026-04 unverdicted novelty 7.0

    LAT-Audio introduces a global-to-local reasoning approach with TWA-CoT that outperforms prior models on temporal tasks for audio up to 30 minutes.

  4. Speaker-Reasoner: Scaling Interaction Turns and Reasoning Patterns for Timestamped Speaker-Attributed ASR

    eess.AS 2026-04 unverdicted novelty 7.0

    Speaker-Reasoner is an end-to-end speech LLM that iteratively analyzes audio structure, predicts temporal boundaries, and jointly models speaker identity, gender, timestamps, and transcription using a speaker-aware ca...

  5. AudioMosaic: Contrastive Masked Audio Representation Learning

    cs.LG 2026-05 unverdicted novelty 6.0

    AudioMosaic learns general-purpose audio representations through contrastive pre-training with structured spectrogram masking, reaching state-of-the-art results on standard benchmarks and improving audio-language tasks.

  6. FSD50K-Solo: Automated Curation of Single-Source Sound Events

    eess.AS 2026-05 conditional novelty 6.0

    The authors present a scalable curation method that combines diffusion-based mixture synthesis with a discriminative classifier to automatically extract single-source sound events from FSD50K and release the cleaned F...

  7. Towards Fine-Grained Multi-Dimensional Speech Understanding: Data Pipeline, Benchmark, and Model

    eess.AS 2026-05 unverdicted novelty 6.0

    A data pipeline, 14-dimension benchmark, and decoupled fine-tuning model are presented to advance fine-grained multi-dimensional speech understanding in LLMs.

  8. JASTIN: Aligning LLMs for Zero-Shot Audio and Speech Evaluation via Natural Language Instructions

    eess.AS 2026-05 unverdicted novelty 6.0

    JASTIN is an instruction-driven audio evaluation system that achieves state-of-the-art correlation with human ratings on speech, sound, music, and out-of-domain tasks without task-specific retraining.

  9. Can Multimodal Large Language Models Understand Pathologic Movements? A Pilot Study on Seizure Semiology

    cs.CV 2026-05 unverdicted novelty 6.0

    MLLMs achieve zero-shot recognition of seizure semiological features better than fine-tuned vision models on most tested features, with signal enhancement and faithful explanations.

  10. HeadRouter: Dynamic Head-Weight Routing for Task-Adaptive Audio Token Pruning in Large Audio Language Models

    cs.SD 2026-04 unverdicted novelty 6.0

    HeadRouter prunes audio tokens more effectively by dynamically routing based on per-head importance for semantic versus acoustic tasks, exceeding baseline performance at 70% token retention on Qwen2.5-Omni models.

  11. Omni-Embed-Audio: Leveraging Multimodal LLMs for Robust Audio-Text Retrieval

    cs.SD 2026-04 unverdicted novelty 6.0

    Omni-Embed-Audio uses multimodal LLMs to match CLAP on standard audio retrieval while improving text-to-text retrieval by 22% relative and hard negative discrimination by 4.3 points HNSR@10 on user-intent queries.

  12. Audio2Tool: Speak, Call, Act -- A Dataset for Benchmarking Speech Tool Use

    cs.SD 2026-04 unverdicted novelty 6.0

    Audio2Tool is a new benchmark dataset that shows speech models perform well on simple commands but degrade sharply on compositional tasks and realistic acoustic noise.

  13. Dual-Axis Generative Reward Model Toward Semantic and Turn-taking Robustness in Interactive Spoken Dialogue Models

    cs.AI 2026-04 unverdicted novelty 6.0

    A generative reward model supplies separate semantic and turn-taking scores for spoken dialogues to enable more reliable reinforcement learning.

  14. SpotSound: Enhancing Large Audio-Language Models with Fine-Grained Temporal Grounding

    cs.SD 2026-04 unverdicted novelty 6.0

    SpotSound adds a hallucination-suppressing objective and a needle-in-haystack benchmark to audio-language models, reaching state-of-the-art temporal grounding while keeping general task performance.

  15. Noise-Aware In-Context Learning for Hallucination Mitigation in ALLMs

    cs.SD 2026-04 unverdicted novelty 6.0

    NAICL reduces hallucination rates in ALLMs from 26.53% to 16.98% via noise priors in context and introduces the Clotho-1K benchmark with four hallucination types.

  16. Rethinking Entropy Allocation in LLM-based ASR: Understanding the Dynamics between Speech Encoders and LLMs

    eess.AS 2026-04 unverdicted novelty 6.0

    A multi-stage training method for LLM-based ASR uses new entropy allocation metrics to achieve competitive benchmark performance with 2.3B parameters while mitigating hallucinations via better encoder-LLM decoupling.

  17. Qwen3-Omni Technical Report

    cs.CL 2025-09 unverdicted novelty 6.0

    Qwen3-Omni is a unified multimodal model that achieves open-source SOTA on 32 of 36 audio and audio-visual benchmarks and overall SOTA on 22 without degrading performance on text, image, or video relative to single-mo...

  18. GaMMA: Towards Joint Global-Temporal Music Understanding in Large Multimodal Models

    cs.SD 2026-05 unverdicted novelty 5.0

    GaMMA unifies global and temporal music understanding in a single LMM via MoE audio encoders and progressive training, achieving new state-of-the-art accuracies on music benchmarks including 79.1% on MuchoMusic.

  19. Audio-DeepThinker: Progressive Reasoning-Aware Reinforcement Learning for High-Quality Chain-of-Thought Emergence in Audio Language Models

    cs.SD 2026-04 unverdicted novelty 5.0

    A hybrid-reward progressive RL curriculum enables high-quality chain-of-thought to emerge in audio language models without prior supervised CoT training, yielding SOTA results on MMAR, MMAU, and MMSU benchmarks.

  20. Audio-Cogito: Towards Deep Audio Reasoning in Large Audio Language Models

    eess.AS 2026-04 unverdicted novelty 5.0

    Audio-Cogito is an open-source LALM using Cogito-pipe data curation and self-distillation to achieve leading open-source performance on audio reasoning benchmarks.

  21. Robust Audio-Text Retrieval via Cross-Modal Attention and Hybrid Loss

    cs.CL 2026-04 unverdicted novelty 4.0

    A cross-modal attention refinement module plus hybrid loss improves robustness of audio-text retrieval on noisy and long-form audio.

Reference graph

Works this paper leans on

208 extracted references · 208 canonical work pages · cited by 21 Pith papers · 15 internal anchors

  1. [1]

    Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs

    A. Abouelenin, A. Ashfaq, A. Atkinson, H. Awadalla, N. Bach, J. Bao, A. Benhaim, M. Cai, V . Chaudhary, C. Chen, et al. Phi-4-mini technical report: Compact yet powerful multimodal language models via mixture-of-loras. arXiv preprint arXiv:2503.01743, 2025

  2. [2]

    YouTube-8M: A Large-Scale Video Classification Benchmark

    S. Abu-El-Haija, N. Kothari, J. Lee, P. Natsev, G. Toderici, B. Varadarajan, and S. Vijaya- narasimhan. Youtube-8m: A large-scale video classification benchmark. arXiv preprint arXiv:1609.08675, 2016

  3. [3]

    MusicLM: Generating Music From Text

    A. Agostinelli, T. I. Denk, Z. Borsos, J. Engel, M. Verzetti, A. Caillon, Q. Huang, A. Jansen, A. Roberts, M. Tagliasacchi, et al. Musiclm: Generating music from text. arXiv preprint arXiv:2301.11325, 2023

  4. [4]

    Seed-tts: A family of high-quality versatile speech generation models,

    P. Anastassiou, J. Chen, J. Chen, Y . Chen, Z. Chen, Z. Chen, J. Cong, L. Deng, C. Ding, L. Gao, et al. Seed-tts: A family of high-quality versatile speech generation models. arXiv preprint arXiv:2406.02430, 2024

  5. [5]

    Ardila, M

    R. Ardila, M. Branson, K. Davis, M. Henretty, M. Kohler, J. Meyer, R. Morais, L. Saunders, F. M. Tyers, and G. Weber. Common voice: A massively-multilingual speech corpus. In Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020), pages 4211–4215, 2020

  6. [6]

    J. Bai, H. Liu, M. Wang, D. Shi, W. Wang, M. D. Plumbley, W.-S. Gan, and J. Chen. Audioset- caps: Enriched audio captioning dataset generation using large audio language models. In Audio Imagination: NeurIPS 2024 Workshop AI-Driven Speech, Music, and Sound Generation, 2024

  7. [7]

    Barros, N

    P. Barros, N. Churamani, E. Lakomkin, H. Siqueira, A. Sutherland, and S. Wermter. The omg-emotion behavior dataset. In 2018 International Joint Conference on Neural Networks (IJCNN), pages 1–7. IEEE, 2018

  8. [8]

    Bertin-Mahieux, D

    T. Bertin-Mahieux, D. P. Ellis, B. Whitman, and P. Lamere. The million song dataset. In Proceedings of the 12th International Conference on Music Information Retrieval (ISMIR 2011), 2011

  9. [9]

    Bertin-Mahieux, D

    T. Bertin-Mahieux, D. P. Ellis, B. Whitman, and P. Lamere. The million song dataset. InIsmir, volume 2, page 10, 2011

  10. [10]

    R. M. Bittner, J. Salamon, M. Tierney, M. Mauch, C. Cannam, and J. P. Bello. Medleydb: A multitrack dataset for annotation-intensive mir research. In Ismir, volume 14, pages 155–160, 2014

  11. [11]

    Busso, M

    C. Busso, M. Bulut, C.-C. Lee, A. Kazemzadeh, E. Mower, S. Kim, J. N. Chang, S. Lee, and S. S. Narayanan. Iemocap: Interactive emotional dyadic motion capture database. Language resources and evaluation, 42:335–359, 2008

  12. [12]

    Cartwright, J

    M. Cartwright, J. Cramer, A. E. M. Mendez, Y . Wang, H.-H. Wu, V . Lostanlen, M. Fuentes, G. Dove, C. Mydlarz, J. Salamon, et al. Sonyc-ust-v2: An urban sound tagging dataset with spatiotemporal context. arXiv preprint arXiv:2009.05188, 2020

  13. [13]

    C. Chen, P. Peng, A. Baid, Z. Xue, W.-N. Hsu, D. Harwath, and K. Grauman. Action2sound: Ambient-aware generation of action sounds from egocentric videos. In European Conference on Computer Vision, pages 277–295. Springer, 2024

  14. [14]

    G. Chen, S. Chai, G. Wang, J. Du, W.-Q. Zhang, C. Weng, D. Su, D. Povey, J. Trmal, J. Zhang, et al. Gigaspeech: An evolving, multi-domain asr corpus with 10,000 hours of transcribed audio. arXiv preprint arXiv:2106.06909, 2021

  15. [15]

    H. Chen, W. Xie, A. Vedaldi, and A. Zisserman. Vggsound: A large-scale audio-visual dataset. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 721–725. IEEE, 2020. 11

  16. [16]

    S. Chen, Y . Wu, C. Wang, S. Liu, D. Tompkins, Z. Chen, and F. Wei. Beats: Audio pre-training with acoustic tokenizers, 2022

  17. [17]

    Y . Chen, F. Xue, D. Li, Q. Hu, L. Zhu, X. Li, Y . Fang, H. Tang, S. Yang, Z. Liu, et al. Longvila: Scaling long-context visual language models for long videos.arXiv preprint arXiv:2408.10188, 2024

  18. [18]

    Y . Chen, X. Yue, C. Zhang, X. Gao, R. T. Tan, and H. Li. V oicebench: Benchmarking llm-based voice assistants. arXiv preprint arXiv:2410.17196, 2024

  19. [19]

    Y . Chu, J. Xu, Q. Yang, H. Wei, X. Wei, Z. Guo, Y . Leng, Y . Lv, J. He, J. Lin, C. Zhou, and J. Zhou. Qwen2-audio technical report, 2024

  20. [20]

    Y . Chu, J. Xu, X. Zhou, Q. Yang, S. Zhang, Z. Yan, C. Zhou, and J. Zhou. Qwen-audio: Advancing universal audio understanding via unified large-scale audio-language models, 2023

  21. [21]

    J. S. Chung, A. Nagrani, and A. Zisserman. V oxceleb2: Deep speaker recognition. arXiv preprint arXiv:1806.05622, 2018

  22. [22]

    Cieri, D

    C. Cieri, D. Miller, and K. Walker. The fisher corpus: A resource for the next generations of speech-to-text. In LREC, volume 4, pages 69–71, 2004

  23. [23]

    Clifton, S

    A. Clifton, S. Reddy, Y . Yu, A. Pappu, R. Rezapour, H. Bonab, M. Eskevich, G. Jones, J. Karlgren, B. Carterette, and R. Jones. 100,000 podcasts: A spoken English document corpus. In Proceedings of the 28th International Conference on Computational Linguistics , pages 5903–5917, Barcelona, Spain (Online), Dec. 2020. International Committee on Computationa...

  24. [24]

    Daniel, M

    F. Daniel, M. Matera, V . Zaccaria, and A. Dell’Orto. Toward truly personal chatbots: on the development of custom conversational assistants. In Proceedings of the 1st international workshop on software engineering for cognitive services, pages 31–36, 2018

  25. [25]

    FMA: A Dataset For Music Analysis

    M. Defferrard, K. Benzi, P. Vandergheynst, and X. Bresson. Fma: A dataset for music analysis. arXiv preprint arXiv:1612.01840, 2016

  26. [26]

    Z. Deng, Y . Ma, Y . Liu, R. Guo, G. Zhang, W. Chen, W. Huang, and E. Benetos. Musilingo: Bridging music and text with pre-trained language models for music captioning and query response. arXiv preprint arXiv:2309.08730, 2023

  27. [27]

    Deshmukh, B

    S. Deshmukh, B. Elizalde, R. Singh, and H. Wang. Pengi: An audio language model for audio tasks, 2023

  28. [28]

    Deshmukh, B

    S. Deshmukh, B. Elizalde, and H. Wang. Audio retrieval with wavtext5k and clap training. arXiv preprint arXiv:2209.14275, 2022

  29. [29]

    Deshmukh, S

    S. Deshmukh, S. Han, H. Bukhari, B. Elizalde, H. Gamper, R. Singh, and B. Raj. Audio entailment: Assessing deductive reasoning for audio understanding. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 23769–23777, 2025

  30. [30]

    S. Doh, K. Choi, J. Lee, and J. Nam. Lp-musiccaps: Llm-based pseudo music captioning. arXiv preprint arXiv:2307.16372, 2023

  31. [31]

    The Faiss library

    M. Douze, A. Guzhva, C. Deng, J. Johnson, G. Szilvasy, P.-E. Mazaré, M. Lomeli, L. Hosseini, and H. Jégou. The faiss library. arXiv preprint arXiv:2401.08281, 2024

  32. [32]

    Drossos, S

    K. Drossos, S. Lipping, and T. Virtanen. Clotho: An audio captioning dataset. InICASSP 2020- 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 736–740. IEEE, 2020

  33. [33]

    Elizalde, S

    B. Elizalde, S. Deshmukh, M. Al Ismail, and H. Wang. Clap: Learning audio concepts from natural language supervision, 2022

  34. [34]

    Engel, C

    J. Engel, C. Resnick, A. Roberts, S. Dieleman, M. Norouzi, D. Eck, and K. Simonyan. Neural audio synthesis of musical notes with wavenet autoencoders. In International conference on machine learning, pages 1068–1077. PMLR, 2017. 12

  35. [35]

    J. Feng, Q. Sun, C. Xu, P. Zhao, Y . Yang, C. Tao, D. Zhao, and Q. Lin. Mmdialog: A large-scale multi-turn dialogue dataset towards multi-modal open-domain conversation. arXiv preprint arXiv:2211.05719, 2022

  36. [36]

    Fonseca, X

    E. Fonseca, X. Favory, J. Pons, F. Font, and X. Serra. Fsd50k: an open dataset of human- labeled sound events. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 30:829–852, 2021

  37. [37]

    Fonseca, J

    E. Fonseca, J. Pons, X. Favory, F. Font, D. Bogdanov, A. Ferraro, S. Oramas, A. Porter, and X. Serra. Freesound datasets: A platform for the creation of open audio datasets. In ISMIR, pages 486–493, 2017

  38. [38]

    Foster, S

    P. Foster, S. Sigtia, S. Krstulovic, J. Barker, and M. D. Plumbley. Chime-home: A dataset for sound source recognition in a domestic environment. In 2015 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pages 1–5. IEEE, 2015

  39. [39]

    J. F. Gemmeke, D. P. Ellis, D. Freedman, A. Jansen, W. Lawrence, R. C. Moore, M. Plakal, and M. Ritter. Audio set: An ontology and human-labeled dataset for audio events. In 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 776–780. IEEE, 2017

  40. [40]

    Ghosh, Z

    S. Ghosh, Z. Kong, S. Kumar, S. Sakshi, J. Kim, W. Ping, R. Valle, D. Manocha, and B. Catanzaro. Audio flamingo 2: An audio-language model with long-audio understanding and expert reasoning abilities, 2025

  41. [41]

    Ghosh, S

    S. Ghosh, S. Kumar, A. Seth, C. K. R. Evuru, U. Tyagi, Sakshi, O. Nieto, R. Duraiswami, and D. Manocha. Gama: A large audio-language model with advanced audio understanding and complex reasoning abilities, 2024

  42. [42]

    Ghosh, A

    S. Ghosh, A. Seth, S. Kumar, U. Tyagi, C. K. R. Evuru, S. Ramaneswaran, S. Sakshi, O. Nieto, R. Duraiswami, and D. Manocha. Compa: Addressing the gap in compositional reasoning in audio-language models. In The Twelfth International Conference on Learning Representations

  43. [43]

    J. J. Godfrey, E. C. Holliman, and J. McDaniel. Switchboard: Telephone speech corpus for research and development. In Acoustics, speech, and signal processing, ieee international conference on, volume 1, pages 517–520. IEEE Computer Society, 1992

  44. [44]

    A. Goel, Z. Kong, R. Valle, and B. Catanzaro. Audio dialogues: Dialogues dataset for audio and music understanding. arXiv preprint arXiv:2404.07616, 2024

  45. [45]

    Y . Gong, A. H. Liu, H. Luo, L. Karlinsky, and J. Glass. Joint audio and speech understanding, 2023

  46. [46]

    Y . Gong, H. Luo, A. H. Liu, L. Karlinsky, and J. Glass. Listen, think, and understand, 2023

  47. [47]

    D. Guo, D. Yang, H. Zhang, J. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma, P. Wang, X. Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025

  48. [48]

    Guzhov, F

    A. Guzhov, F. Raue, J. Hees, and A. Dengel. Audioclip: Extending clip to image, text and audio, 2021

  49. [49]

    Hernandez, V

    F. Hernandez, V . Nguyen, S. Ghannay, N. Tomashenko, and Y . Esteve. Ted-lium 3: Twice as much data and corpus repartition for experiments on speaker adaptation. In Speech and Computer: 20th International Conference, SPECOM 2018, Leipzig, Germany, September 18–22, 2018, Proceedings 20, pages 198–208. Springer, 2018

  50. [50]

    Hershey, D

    S. Hershey, D. P. Ellis, E. Fonseca, A. Jansen, C. Liu, R. C. Moore, and M. Plakal. The benefit of temporally-strong labels in audio event classification. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 366–

  51. [51]

    E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, W. Chen, et al. Lora: Low-rank adaptation of large language models. ICLR, 1(2):3, 2022. 13

  52. [52]

    Step-audio: Unified understanding and generation in intelligent speech interaction, 2025

    A. Huang, B. Wu, B. Wang, C. Yan, C. Hu, C. Feng, F. Tian, F. Shen, J. Li, M. Chen, et al. Step-audio: Unified understanding and generation in intelligent speech interaction. arXiv preprint arXiv:2502.11946, 2025

  53. [53]

    Huang, M

    R. Huang, M. Li, D. Yang, J. Shi, X. Chang, Z. Ye, Y . Wu, Z. Hong, J. Huang, J. Liu, Y . Ren, Z. Zhao, and S. Watanabe. Audiogpt: Understanding and generating speech, music, sound, and talking head, 2023

  54. [54]

    GPT-4o System Card

    A. Hurst, A. Lerer, A. P. Goucher, A. Perelman, A. Ramesh, A. Clark, A. Ostrow, A. Welihinda, A. Hayes, A. Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024

  55. [55]

    M. M. Islam, N. Ho, X. Yang, T. Nagarajan, L. Torresani, and G. Bertasius. Video recap: Recursive captioning of hour-long videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18198–18208, 2024

  56. [56]

    James, L

    J. James, L. Tian, and C. I. Watson. An open source emotional speech corpus for human robot interaction applications. In Interspeech, pages 2768–2772, 2018

  57. [57]

    Jeong and J

    I.-Y . Jeong and J. Park. Cochlscene: Acquisition of acoustic scene data using crowdsourcing. In 2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), pages 17–21. IEEE, 2022

  58. [58]

    X. Ju, Y . Gao, Z. Zhang, Z. Yuan, X. Wang, A. Zeng, Y . Xiong, Q. Xu, and Y . Shan. Miradata: A large-scale video dataset with long durations and structured captions. Advances in Neural Information Processing Systems, 37:48955–48970, 2024

  59. [59]

    W. Kang, X. Yang, Z. Yao, F. Kuang, Y . Yang, L. Guo, L. Lin, and D. Povey. Libriheavy: A 50,000 hours asr corpus with punctuation casing and context. In ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 10991–10995. IEEE, 2024

  60. [60]

    C. D. Kim, B. Kim, H. Lee, and G. Kim. Audiocaps: Generating captions for audios in the wild. In NAACL-HLT, 2019

  61. [61]

    J. Kim, T. Moon, K. Lee, and J. Cho. Efficient generative modeling with residual vector quantization-based tokens. arXiv preprint arXiv:2412.10208, 2024

  62. [62]

    P. Koehn. Europarl: A parallel corpus for statistical machine translation. In Proceedings of machine translation summit x: papers, pages 79–86, 2005

  63. [63]

    A. S. Koepke, A.-M. Oncescu, J. F. Henriques, Z. Akata, and S. Albanie. Audio retrieval with natural language queries: A benchmark study. IEEE Transactions on Multimedia, 25:2675– 2685, 2022

  64. [64]

    Koizumi, H

    Y . Koizumi, H. Zen, S. Karita, Y . Ding, K. Yatabe, N. Morioka, M. Bacchiani, Y . Zhang, W. Han, and A. Bapna. Libritts-r: A restored multi-speaker text-to-speech corpus. INTER- SPEECH 2023, 2023

  65. [65]

    Z. Kong, A. Goel, R. Badlani, W. Ping, R. Valle, and B. Catanzaro. Audio flamingo: A novel audio language model with few-shot learning and dialogue abilities, 2024

  66. [66]

    Kumar, P

    R. Kumar, P. Seetharaman, A. Luebs, I. Kumar, and K. Kumar. High-fidelity audio compression with improved rvqgan. In Thirty-seventh Conference on Neural Information Processing Systems

  67. [67]

    E. Law, K. West, M. Mandel, M. Bay, and J. Downie. Evaluation of algorithms using games: the case of music annotation. In Proceedings of the 11th International Society for Music Information Retrieval Conference (ISMIR). Utrecht, the Netherlands, 2010

  68. [68]

    C. Lee, R. Roy, M. Xu, J. Raiman, M. Shoeybi, B. Catanzaro, and W. Ping. Nv-embed: Improved techniques for training llms as generalist embedding models. arXiv preprint arXiv:2405.17428, 2024. 14

  69. [69]

    D. Lee, C. Kim, S. Kim, M. Cho, and W.-S. Han. Autoregressive image generation using residual quantization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11523–11532, 2022

  70. [70]

    K. Lee, D. W. Kim, J. Kim, S. Chung, and J. Cho. DiTTo-TTS: Diffusion transformers for scalable text-to-speech without domain-specific factors. In The Thirteenth International Conference on Learning Representations, 2025

  71. [71]

    K. Lee, K. Park, and D. Kim. Dailytalk: Spoken dialogue dataset for conversational text- to-speech. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2023

  72. [72]

    S. Leng, Y . Xing, Z. Cheng, Y . Zhou, H. Zhang, X. Li, D. Zhao, S. Lu, C. Miao, and L. Bing. The curse of multi-modalities: Evaluating hallucinations of large multimodal models across language, visual, and audio. arXiv preprint arXiv:2410.12787, 2024

  73. [73]

    G. Li, J. Liu, H. Dinkel, Y . Niu, J. Zhang, and J. Luan. Reinforcement learning outper- forms supervised fine-tuning: A case study on audio question answering. arXiv preprint arXiv:2503.11197, 2025

  74. [74]

    G. Li, Y . Wei, Y . Tian, C. Xu, J.-R. Wen, and D. Hu. Learning to answer questions in dynamic audio-visual scenarios. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19108–19118, 2022

  75. [75]

    T. Li, J. Liu, T. Zhang, Y . Fang, D. Pan, M. Wang, Z. Liang, Z. Li, M. Lin, G. Dong, et al. Baichuan-audio: A unified framework for end-to-end speech interaction. arXiv preprint arXiv:2502.17239, 2025

  76. [76]

    Lipping, P

    S. Lipping, P. Sudarsanam, K. Drossos, and T. Virtanen. Clotho-aqa: A crowdsourced dataset for audio question answering. In 2022 30th European Signal Processing Conference (EUSIPCO), pages 1140–1144. IEEE, 2022

  77. [77]

    S. Liu, H. J. Cho, M. Freedman, X. Ma, and J. May. Recap: retrieval-enhanced context-aware prefix encoder for personalized dialogue response generation.arXiv preprint arXiv:2306.07206, 2023

  78. [78]

    S. Liu, A. S. Hussain, C. Sun, and Y . Shan. Music understanding llama: Advancing text- to-music generation with question answering and captioning. In ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 286–

  79. [79]

    Z. Liu, H. Mao, C.-Y . Wu, C. Feichtenhofer, T. Darrell, and S. Xie. A convnet for the 2020s. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11976–11986, 2022

  80. [80]

    Z. Ma, Z. Chen, Y . Wang, E. S. Chng, and X. Chen. Audio-cot: Exploring chain-of-thought reasoning in large audio language model. arXiv preprint arXiv:2501.07246, 2025

Showing first 80 references.