arxiv: 2507.08128 · v2 · submitted 2025-07-10 · 💻 cs.SD · cs.AI· cs.CL· eess.AS

Recognition: 3 theorem links

· Lean Theorem

Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio Language Models

Arushi Goel , Sreyan Ghosh , Jaehyeon Kim , Sonal Kumar , Zhifeng Kong , Sang-gil Lee , Chao-Han Huck Yang , Ramani Duraiswami

show 3 more authors

Dinesh Manocha Rafael Valle Bryan Catanzaro

Authors on Pith no claims yet

Pith reviewed 2026-05-15 03:37 UTC · model grok-4.3

classification 💻 cs.SD cs.AIcs.CLeess.AS

keywords audio language modelspeech understandingsound reasoningmusic analysislarge audio modelscurriculum learningchain of thoughtopen source AI

0 comments

The pith

Audio Flamingo 3 is a fully open large audio-language model that sets new state-of-the-art results on over twenty audio understanding and reasoning benchmarks using only open-source data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Audio Flamingo 3 as an open large audio-language model for advanced reasoning across speech, sound, and music. It achieves this through a unified audio encoder, support for long audio inputs up to ten minutes, multi-turn multi-audio chat, on-demand chain-of-thought reasoning, and voice-to-voice interaction. These features are enabled by new curated datasets and a five-stage curriculum training strategy. A sympathetic reader would care because the model reaches top performance without proprietary data, showing that capable audio intelligence systems can be built and shared transparently.

Core claim

Audio Flamingo 3 (AF3) is a fully open state-of-the-art large audio-language model that advances reasoning and understanding across speech, sound, and music. It introduces AF-Whisper as a unified audio encoder for joint representation learning, flexible on-demand thinking for chain-of-thought reasoning, multi-turn multi-audio chat, long audio understanding up to ten minutes including speech, and voice-to-voice interaction. These capabilities are supported by large-scale open datasets such as AudioSkills-XL, LongAudio-XL, AF-Think, and AF-Chat, trained via a novel five-stage curriculum strategy. Trained only on open-source audio data, AF3 achieves new state-of-the-art results on over twenty (

What carries the argument

AF-Whisper unified audio encoder for joint speech-sound-music representation learning, combined with the five-stage curriculum training on custom long-audio and skills datasets.

If this is right

Supports long audio understanding and reasoning including speech up to 10 minutes.
Enables multi-turn multi-audio chat and on-demand chain-of-thought reasoning.
Provides voice-to-voice interaction alongside text-based audio analysis.
Surpasses both open-weight and closed-source models on over 20 benchmarks despite training on smaller open data.
Delivers these capabilities through a fully open model and training process.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The curriculum approach may generalize to other multimodal domains where staged training helps manage long contexts.
Open release of the model and datasets could accelerate community experiments on specialized audio tasks like environmental sound monitoring.
Future extensions might test whether the same encoder scales to joint audio-video reasoning without major retraining.
Widespread adoption could shift industry focus toward efficient open data pipelines instead of ever-larger proprietary corpora.

Load-bearing premise

The performance gains stem from genuine generalization enabled by the new datasets and curriculum rather than benchmark-specific tuning or differences in evaluation protocols.

What would settle it

Re-running AF3 on a new held-out collection of long audio reasoning tasks never seen during its training or original benchmarks, then comparing results directly to closed models under identical protocols.

read the original abstract

We present Audio Flamingo 3 (AF3), a fully open state-of-the-art (SOTA) large audio-language model that advances reasoning and understanding across speech, sound, and music. AF3 introduces: (i) AF-Whisper, a unified audio encoder trained using a novel strategy for joint representation learning across all 3 modalities of speech, sound, and music; (ii) flexible, on-demand thinking, allowing the model to do chain-of-thought-type reasoning before answering; (iii) multi-turn, multi-audio chat; (iv) long audio understanding and reasoning (including speech) up to 10 minutes; and (v) voice-to-voice interaction. To enable these capabilities, we propose several large-scale training datasets curated using novel strategies, including AudioSkills-XL, LongAudio-XL, AF-Think, and AF-Chat, and train AF3 with a novel five-stage curriculum-based training strategy. Trained on only open-source audio data, AF3 achieves new SOTA results on over 20+ (long) audio understanding and reasoning benchmarks, surpassing both open-weight and closed-source models trained on much larger datasets.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AF3 shows open audio models can hit SOTA on many tasks via new encoder and datasets, but overlap checks are the make-or-break detail.

read the letter

Colleague, AF3 is the first fully open audio LM to report new SOTA numbers across more than 20 speech, sound, and music benchmarks, including long-context ones up to 10 minutes. It does this with a unified AF-Whisper encoder, four new large datasets, and a five-stage curriculum that adds on-demand chain-of-thought and multi-turn chat. Those pieces are genuinely new relative to prior open work and give a practical path for others to follow. The results look stronger than several closed models trained on bigger data, which is the main practical takeaway. The soft spot is data hygiene. The new datasets (AudioSkills-XL, LongAudio-XL, AF-Think, AF-Chat) are central to the gains, yet the abstract and methods summary give no concrete description of deduplication, audio fingerprinting, or embedding-based overlap tests against the evaluation splits. If any leakage exists, the SOTA claims weaken. The full paper needs to show those checks explicitly for the numbers to be convincing. This paper is for groups building or benchmarking audio-language systems who want open recipes and long-audio handling details. Readers who care about reproducible training pipelines will find usable material here. I would send it to peer review; the empirical scope is large enough that referees should examine the data splits and ablation evidence directly.

Referee Report

2 major / 1 minor

Summary. The manuscript presents Audio Flamingo 3 (AF3), a fully open large audio-language model advancing reasoning across speech, sound, and music. It introduces AF-Whisper as a unified audio encoder for joint modality representation, on-demand chain-of-thought thinking, multi-turn multi-audio chat, long audio understanding up to 10 minutes, and voice-to-voice interaction. These capabilities are enabled by newly curated datasets (AudioSkills-XL, LongAudio-XL, AF-Think, AF-Chat) and a five-stage curriculum-based training strategy using only open-source data, with claims of new SOTA results on over 20 audio understanding and reasoning benchmarks that surpass both open-weight and closed-source models trained on larger datasets.

Significance. If the SOTA claims hold after verification of no data contamination, the work would demonstrate that open-source audio-language models can achieve superior performance through targeted dataset curation and staged training, rather than scale alone. This could accelerate progress in accessible multimodal audio AI, particularly for long-context reasoning and flexible interaction capabilities that remain challenging in the field.

major comments (2)

[§3] §3 (new datasets): The descriptions of AudioSkills-XL, LongAudio-XL, AF-Think, and AF-Chat provide no details on deduplication, audio fingerprinting, embedding similarity checks, or other overlap detection against the test splits of the 20+ reported benchmarks. This is load-bearing for the central SOTA claim, as any leakage would allow the five-stage curriculum to exploit benchmark-specific patterns rather than demonstrate generalization.
[§5] §5 (experiments): The reported benchmark results do not include error bars, exact evaluation protocol details, or ablations that isolate the contribution of the new datasets versus the curriculum stages, making it difficult to assess the robustness of the performance gains.

minor comments (1)

[Abstract] Abstract: The phrasing 'over 20+ (long) audio understanding and reasoning benchmarks' is imprecise; clarify the exact number of benchmarks evaluated and which subset specifically tests long audio.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript. We address each major comment below and outline the revisions we will make to strengthen the presentation of our results.

read point-by-point responses

Referee: [§3] §3 (new datasets): The descriptions of AudioSkills-XL, LongAudio-XL, AF-Think, and AF-Chat provide no details on deduplication, audio fingerprinting, embedding similarity checks, or other overlap detection against the test splits of the 20+ reported benchmarks. This is load-bearing for the central SOTA claim, as any leakage would allow the five-stage curriculum to exploit benchmark-specific patterns rather than demonstrate generalization.

Authors: We agree that explicit documentation of contamination checks is necessary to substantiate the SOTA claims. In the revised manuscript we will expand §3 with a dedicated subsection describing our full deduplication pipeline: perceptual audio fingerprinting, embedding-based similarity filtering (cosine threshold 0.85), and exhaustive overlap scans against every benchmark test split. We will also report the measured overlap statistics (which were below 0.1 % after filtering). These checks were performed during curation; adding the details will directly address the concern without altering the experimental outcomes. revision: yes
Referee: [§5] §5 (experiments): The reported benchmark results do not include error bars, exact evaluation protocol details, or ablations that isolate the contribution of the new datasets versus the curriculum stages, making it difficult to assess the robustness of the performance gains.

Authors: We accept that additional reporting is required for robustness. We will add standard-error bars to all main-result tables (computed over three independent runs) and include a new appendix with exact evaluation protocols (prompt templates, decoding parameters, and metric implementations). We will also insert a targeted ablation study that incrementally adds the new datasets and curriculum stages on a representative subset of benchmarks. Full isolation across all 20+ benchmarks is computationally prohibitive, so the ablations will be partial but sufficient to illustrate the relative contributions. revision: partial

Circularity Check

0 steps flagged

No circularity in empirical claims or training pipeline

full rationale

The paper is an empirical ML work that introduces new datasets (AudioSkills-XL, LongAudio-XL, AF-Think, AF-Chat) and a five-stage curriculum to train AF3, then reports benchmark results. No derivation chain, equations, or first-principles predictions exist that could reduce to inputs by construction. Claims rest on external benchmark comparisons rather than self-referential fitting or self-citation load-bearing steps. The work is self-contained against external benchmarks with no visible circularity signals.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim depends on the assumption that the new encoder, datasets, and curriculum deliver superior performance; the abstract provides no explicit free parameters but implicitly relies on standard ML hyperparameters and benchmark validity.

free parameters (1)

training hyperparameters across five stages
Typical in large-model training; specific values not given in abstract.

axioms (1)

domain assumption Benchmarks used are fair and representative for cross-model comparison
Required to support the SOTA claim.

pith-pipeline@v0.9.0 · 5560 in / 1363 out tokens · 82298 ms · 2026-05-15T03:37:56.252379+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith.Cost.FunctionalEquation washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose several large-scale training datasets curated using novel strategies, including AudioSkills-XL, LongAudio-XL, AF-Think, and AF-Chat, and train AF3 with a novel five-stage curriculum-based training strategy.
IndisputableMonolith.Foundation.DimensionForcing dimension_forced unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

AF3 achieves new SOTA results on over 20+ (long) audio understanding and reasoning benchmarks
IndisputableMonolith.Foundation.HierarchyEmergence hierarchy_emergence_forces_phi unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

AF-Whisper, a unified audio encoder trained using a novel strategy for joint representation learning across all 3 modalities

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 21 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

HalluAudio: A Comprehensive Benchmark for Hallucination Detection in Large Audio-Language Models
cs.SD 2026-04 unverdicted novelty 8.0

HalluAudio is the first large-scale benchmark spanning speech, environmental sound, and music that uses human-verified QA pairs, adversarial prompts, and mixed-audio tests to measure hallucinations in large audio-lang...
Benign Fine-Tuning Breaks Safety Alignment in Audio LLMs
cs.CR 2026-04 conditional novelty 8.0

Benign fine-tuning on audio data breaks safety alignment in Audio LLMs by raising jailbreak success rates up to 87%, with the dominant risk axis depending on model architecture and embedding proximity to harmful content.
Listening with Time: Precise Temporal Awareness for Long-Form Audio Understanding
eess.AS 2026-04 unverdicted novelty 7.0

LAT-Audio introduces a global-to-local reasoning approach with TWA-CoT that outperforms prior models on temporal tasks for audio up to 30 minutes.
Speaker-Reasoner: Scaling Interaction Turns and Reasoning Patterns for Timestamped Speaker-Attributed ASR
eess.AS 2026-04 unverdicted novelty 7.0

Speaker-Reasoner is an end-to-end speech LLM that iteratively analyzes audio structure, predicts temporal boundaries, and jointly models speaker identity, gender, timestamps, and transcription using a speaker-aware ca...
AudioMosaic: Contrastive Masked Audio Representation Learning
cs.LG 2026-05 unverdicted novelty 6.0

AudioMosaic learns general-purpose audio representations through contrastive pre-training with structured spectrogram masking, reaching state-of-the-art results on standard benchmarks and improving audio-language tasks.
FSD50K-Solo: Automated Curation of Single-Source Sound Events
eess.AS 2026-05 conditional novelty 6.0

The authors present a scalable curation method that combines diffusion-based mixture synthesis with a discriminative classifier to automatically extract single-source sound events from FSD50K and release the cleaned F...
Towards Fine-Grained Multi-Dimensional Speech Understanding: Data Pipeline, Benchmark, and Model
eess.AS 2026-05 unverdicted novelty 6.0

A data pipeline, 14-dimension benchmark, and decoupled fine-tuning model are presented to advance fine-grained multi-dimensional speech understanding in LLMs.
JASTIN: Aligning LLMs for Zero-Shot Audio and Speech Evaluation via Natural Language Instructions
eess.AS 2026-05 unverdicted novelty 6.0

JASTIN is an instruction-driven audio evaluation system that achieves state-of-the-art correlation with human ratings on speech, sound, music, and out-of-domain tasks without task-specific retraining.
Can Multimodal Large Language Models Understand Pathologic Movements? A Pilot Study on Seizure Semiology
cs.CV 2026-05 unverdicted novelty 6.0

MLLMs achieve zero-shot recognition of seizure semiological features better than fine-tuned vision models on most tested features, with signal enhancement and faithful explanations.
HeadRouter: Dynamic Head-Weight Routing for Task-Adaptive Audio Token Pruning in Large Audio Language Models
cs.SD 2026-04 unverdicted novelty 6.0

HeadRouter prunes audio tokens more effectively by dynamically routing based on per-head importance for semantic versus acoustic tasks, exceeding baseline performance at 70% token retention on Qwen2.5-Omni models.
Omni-Embed-Audio: Leveraging Multimodal LLMs for Robust Audio-Text Retrieval
cs.SD 2026-04 unverdicted novelty 6.0

Omni-Embed-Audio uses multimodal LLMs to match CLAP on standard audio retrieval while improving text-to-text retrieval by 22% relative and hard negative discrimination by 4.3 points HNSR@10 on user-intent queries.
Audio2Tool: Speak, Call, Act -- A Dataset for Benchmarking Speech Tool Use
cs.SD 2026-04 unverdicted novelty 6.0

Audio2Tool is a new benchmark dataset that shows speech models perform well on simple commands but degrade sharply on compositional tasks and realistic acoustic noise.
Dual-Axis Generative Reward Model Toward Semantic and Turn-taking Robustness in Interactive Spoken Dialogue Models
cs.AI 2026-04 unverdicted novelty 6.0

A generative reward model supplies separate semantic and turn-taking scores for spoken dialogues to enable more reliable reinforcement learning.
SpotSound: Enhancing Large Audio-Language Models with Fine-Grained Temporal Grounding
cs.SD 2026-04 unverdicted novelty 6.0

SpotSound adds a hallucination-suppressing objective and a needle-in-haystack benchmark to audio-language models, reaching state-of-the-art temporal grounding while keeping general task performance.
Noise-Aware In-Context Learning for Hallucination Mitigation in ALLMs
cs.SD 2026-04 unverdicted novelty 6.0

NAICL reduces hallucination rates in ALLMs from 26.53% to 16.98% via noise priors in context and introduces the Clotho-1K benchmark with four hallucination types.
Rethinking Entropy Allocation in LLM-based ASR: Understanding the Dynamics between Speech Encoders and LLMs
eess.AS 2026-04 unverdicted novelty 6.0

A multi-stage training method for LLM-based ASR uses new entropy allocation metrics to achieve competitive benchmark performance with 2.3B parameters while mitigating hallucinations via better encoder-LLM decoupling.
Qwen3-Omni Technical Report
cs.CL 2025-09 unverdicted novelty 6.0

Qwen3-Omni is a unified multimodal model that achieves open-source SOTA on 32 of 36 audio and audio-visual benchmarks and overall SOTA on 22 without degrading performance on text, image, or video relative to single-mo...
GaMMA: Towards Joint Global-Temporal Music Understanding in Large Multimodal Models
cs.SD 2026-05 unverdicted novelty 5.0

GaMMA unifies global and temporal music understanding in a single LMM via MoE audio encoders and progressive training, achieving new state-of-the-art accuracies on music benchmarks including 79.1% on MuchoMusic.
Audio-DeepThinker: Progressive Reasoning-Aware Reinforcement Learning for High-Quality Chain-of-Thought Emergence in Audio Language Models
cs.SD 2026-04 unverdicted novelty 5.0

A hybrid-reward progressive RL curriculum enables high-quality chain-of-thought to emerge in audio language models without prior supervised CoT training, yielding SOTA results on MMAR, MMAU, and MMSU benchmarks.
Audio-Cogito: Towards Deep Audio Reasoning in Large Audio Language Models
eess.AS 2026-04 unverdicted novelty 5.0

Audio-Cogito is an open-source LALM using Cogito-pipe data curation and self-distillation to achieve leading open-source performance on audio reasoning benchmarks.
Robust Audio-Text Retrieval via Cross-Modal Attention and Hybrid Loss
cs.CL 2026-04 unverdicted novelty 4.0

A cross-modal attention refinement module plus hybrid loss improves robustness of audio-text retrieval on noisy and long-form audio.

Reference graph

Works this paper leans on

208 extracted references · 208 canonical work pages · cited by 21 Pith papers · 16 internal anchors

[1]

Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs

A. Abouelenin, A. Ashfaq, A. Atkinson, H. Awadalla, N. Bach, J. Bao, A. Benhaim, M. Cai, V . Chaudhary, C. Chen, et al. Phi-4-mini technical report: Compact yet powerful multimodal language models via mixture-of-loras. arXiv preprint arXiv:2503.01743, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

YouTube-8M: A Large-Scale Video Classification Benchmark

S. Abu-El-Haija, N. Kothari, J. Lee, P. Natsev, G. Toderici, B. Varadarajan, and S. Vijaya- narasimhan. Youtube-8m: A large-scale video classification benchmark. arXiv preprint arXiv:1609.08675, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[3]

MusicLM: Generating Music From Text

A. Agostinelli, T. I. Denk, Z. Borsos, J. Engel, M. Verzetti, A. Caillon, Q. Huang, A. Jansen, A. Roberts, M. Tagliasacchi, et al. Musiclm: Generating music from text. arXiv preprint arXiv:2301.11325, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[4]

Seed-TTS: A Family of High-Quality Versatile Speech Generation Models

P. Anastassiou, J. Chen, J. Chen, Y . Chen, Z. Chen, Z. Chen, J. Cong, L. Deng, C. Ding, L. Gao, et al. Seed-tts: A family of high-quality versatile speech generation models. arXiv preprint arXiv:2406.02430, 2024

work page internal anchor Pith review arXiv 2024
[5]

Ardila, M

R. Ardila, M. Branson, K. Davis, M. Henretty, M. Kohler, J. Meyer, R. Morais, L. Saunders, F. M. Tyers, and G. Weber. Common voice: A massively-multilingual speech corpus. In Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020), pages 4211–4215, 2020

work page 2020
[6]

J. Bai, H. Liu, M. Wang, D. Shi, W. Wang, M. D. Plumbley, W.-S. Gan, and J. Chen. Audioset- caps: Enriched audio captioning dataset generation using large audio language models. In Audio Imagination: NeurIPS 2024 Workshop AI-Driven Speech, Music, and Sound Generation, 2024

work page 2024
[7]

Barros, N

P. Barros, N. Churamani, E. Lakomkin, H. Siqueira, A. Sutherland, and S. Wermter. The omg-emotion behavior dataset. In 2018 International Joint Conference on Neural Networks (IJCNN), pages 1–7. IEEE, 2018

work page 2018
[8]

Bertin-Mahieux, D

T. Bertin-Mahieux, D. P. Ellis, B. Whitman, and P. Lamere. The million song dataset. In Proceedings of the 12th International Conference on Music Information Retrieval (ISMIR 2011), 2011

work page 2011
[9]

Bertin-Mahieux, D

T. Bertin-Mahieux, D. P. Ellis, B. Whitman, and P. Lamere. The million song dataset. InIsmir, volume 2, page 10, 2011

work page 2011
[10]

R. M. Bittner, J. Salamon, M. Tierney, M. Mauch, C. Cannam, and J. P. Bello. Medleydb: A multitrack dataset for annotation-intensive mir research. In Ismir, volume 14, pages 155–160, 2014

work page 2014
[11]

Busso, M

C. Busso, M. Bulut, C.-C. Lee, A. Kazemzadeh, E. Mower, S. Kim, J. N. Chang, S. Lee, and S. S. Narayanan. Iemocap: Interactive emotional dyadic motion capture database. Language resources and evaluation, 42:335–359, 2008

work page 2008
[12]

Cartwright, J

M. Cartwright, J. Cramer, A. E. M. Mendez, Y . Wang, H.-H. Wu, V . Lostanlen, M. Fuentes, G. Dove, C. Mydlarz, J. Salamon, et al. Sonyc-ust-v2: An urban sound tagging dataset with spatiotemporal context. arXiv preprint arXiv:2009.05188, 2020

work page arXiv 2009
[13]

C. Chen, P. Peng, A. Baid, Z. Xue, W.-N. Hsu, D. Harwath, and K. Grauman. Action2sound: Ambient-aware generation of action sounds from egocentric videos. In European Conference on Computer Vision, pages 277–295. Springer, 2024

work page 2024
[14]

G. Chen, S. Chai, G. Wang, J. Du, W.-Q. Zhang, C. Weng, D. Su, D. Povey, J. Trmal, J. Zhang, et al. Gigaspeech: An evolving, multi-domain asr corpus with 10,000 hours of transcribed audio. arXiv preprint arXiv:2106.06909, 2021

work page arXiv 2021
[15]

H. Chen, W. Xie, A. Vedaldi, and A. Zisserman. Vggsound: A large-scale audio-visual dataset. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 721–725. IEEE, 2020. 11

work page 2020
[16]

S. Chen, Y . Wu, C. Wang, S. Liu, D. Tompkins, Z. Chen, and F. Wei. Beats: Audio pre-training with acoustic tokenizers, 2022

work page 2022
[17]

Y . Chen, F. Xue, D. Li, Q. Hu, L. Zhu, X. Li, Y . Fang, H. Tang, S. Yang, Z. Liu, et al. Longvila: Scaling long-context visual language models for long videos.arXiv preprint arXiv:2408.10188, 2024

work page arXiv 2024
[18]

Y . Chen, X. Yue, C. Zhang, X. Gao, R. T. Tan, and H. Li. V oicebench: Benchmarking llm-based voice assistants. arXiv preprint arXiv:2410.17196, 2024

work page arXiv 2024
[19]

Y . Chu, J. Xu, Q. Yang, H. Wei, X. Wei, Z. Guo, Y . Leng, Y . Lv, J. He, J. Lin, C. Zhou, and J. Zhou. Qwen2-audio technical report, 2024

work page 2024
[20]

Y . Chu, J. Xu, X. Zhou, Q. Yang, S. Zhang, Z. Yan, C. Zhou, and J. Zhou. Qwen-audio: Advancing universal audio understanding via unified large-scale audio-language models, 2023

work page 2023
[21]

J. S. Chung, A. Nagrani, and A. Zisserman. V oxceleb2: Deep speaker recognition. arXiv preprint arXiv:1806.05622, 2018

work page arXiv 2018
[22]

Cieri, D

C. Cieri, D. Miller, and K. Walker. The fisher corpus: A resource for the next generations of speech-to-text. In LREC, volume 4, pages 69–71, 2004

work page 2004
[23]

Clifton, S

A. Clifton, S. Reddy, Y . Yu, A. Pappu, R. Rezapour, H. Bonab, M. Eskevich, G. Jones, J. Karlgren, B. Carterette, and R. Jones. 100,000 podcasts: A spoken English document corpus. In Proceedings of the 28th International Conference on Computational Linguistics , pages 5903–5917, Barcelona, Spain (Online), Dec. 2020. International Committee on Computationa...

work page 2020
[24]

Daniel, M

F. Daniel, M. Matera, V . Zaccaria, and A. Dell’Orto. Toward truly personal chatbots: on the development of custom conversational assistants. In Proceedings of the 1st international workshop on software engineering for cognitive services, pages 31–36, 2018

work page 2018
[25]

FMA: A Dataset For Music Analysis

M. Defferrard, K. Benzi, P. Vandergheynst, and X. Bresson. Fma: A dataset for music analysis. arXiv preprint arXiv:1612.01840, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[26]

Z. Deng, Y . Ma, Y . Liu, R. Guo, G. Zhang, W. Chen, W. Huang, and E. Benetos. Musilingo: Bridging music and text with pre-trained language models for music captioning and query response. arXiv preprint arXiv:2309.08730, 2023

work page arXiv 2023
[27]

Deshmukh, B

S. Deshmukh, B. Elizalde, R. Singh, and H. Wang. Pengi: An audio language model for audio tasks, 2023

work page 2023
[28]

Deshmukh, B

S. Deshmukh, B. Elizalde, and H. Wang. Audio retrieval with wavtext5k and clap training. arXiv preprint arXiv:2209.14275, 2022

work page arXiv 2022
[29]

Deshmukh, S

S. Deshmukh, S. Han, H. Bukhari, B. Elizalde, H. Gamper, R. Singh, and B. Raj. Audio entailment: Assessing deductive reasoning for audio understanding. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 23769–23777, 2025

work page 2025
[30]

S. Doh, K. Choi, J. Lee, and J. Nam. Lp-musiccaps: Llm-based pseudo music captioning. arXiv preprint arXiv:2307.16372, 2023

work page arXiv 2023
[31]

The Faiss library

M. Douze, A. Guzhva, C. Deng, J. Johnson, G. Szilvasy, P.-E. Mazaré, M. Lomeli, L. Hosseini, and H. Jégou. The faiss library. arXiv preprint arXiv:2401.08281, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[32]

Drossos, S

K. Drossos, S. Lipping, and T. Virtanen. Clotho: An audio captioning dataset. InICASSP 2020- 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 736–740. IEEE, 2020

work page 2020
[33]

Elizalde, S

B. Elizalde, S. Deshmukh, M. Al Ismail, and H. Wang. Clap: Learning audio concepts from natural language supervision, 2022

work page 2022
[34]

Engel, C

J. Engel, C. Resnick, A. Roberts, S. Dieleman, M. Norouzi, D. Eck, and K. Simonyan. Neural audio synthesis of musical notes with wavenet autoencoders. In International conference on machine learning, pages 1068–1077. PMLR, 2017. 12

work page 2017
[35]

J. Feng, Q. Sun, C. Xu, P. Zhao, Y . Yang, C. Tao, D. Zhao, and Q. Lin. Mmdialog: A large-scale multi-turn dialogue dataset towards multi-modal open-domain conversation. arXiv preprint arXiv:2211.05719, 2022

work page arXiv 2022
[36]

Fonseca, X

E. Fonseca, X. Favory, J. Pons, F. Font, and X. Serra. Fsd50k: an open dataset of human- labeled sound events. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 30:829–852, 2021

work page 2021
[37]

Fonseca, J

E. Fonseca, J. Pons, X. Favory, F. Font, D. Bogdanov, A. Ferraro, S. Oramas, A. Porter, and X. Serra. Freesound datasets: A platform for the creation of open audio datasets. In ISMIR, pages 486–493, 2017

work page 2017
[38]

Foster, S

P. Foster, S. Sigtia, S. Krstulovic, J. Barker, and M. D. Plumbley. Chime-home: A dataset for sound source recognition in a domestic environment. In 2015 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pages 1–5. IEEE, 2015

work page 2015
[39]

J. F. Gemmeke, D. P. Ellis, D. Freedman, A. Jansen, W. Lawrence, R. C. Moore, M. Plakal, and M. Ritter. Audio set: An ontology and human-labeled dataset for audio events. In 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 776–780. IEEE, 2017

work page 2017
[40]

Ghosh, Z

S. Ghosh, Z. Kong, S. Kumar, S. Sakshi, J. Kim, W. Ping, R. Valle, D. Manocha, and B. Catanzaro. Audio flamingo 2: An audio-language model with long-audio understanding and expert reasoning abilities, 2025

work page 2025
[41]

Ghosh, S

S. Ghosh, S. Kumar, A. Seth, C. K. R. Evuru, U. Tyagi, Sakshi, O. Nieto, R. Duraiswami, and D. Manocha. Gama: A large audio-language model with advanced audio understanding and complex reasoning abilities, 2024

work page 2024
[42]

Ghosh, A

S. Ghosh, A. Seth, S. Kumar, U. Tyagi, C. K. R. Evuru, S. Ramaneswaran, S. Sakshi, O. Nieto, R. Duraiswami, and D. Manocha. Compa: Addressing the gap in compositional reasoning in audio-language models. In The Twelfth International Conference on Learning Representations

work page
[43]

J. J. Godfrey, E. C. Holliman, and J. McDaniel. Switchboard: Telephone speech corpus for research and development. In Acoustics, speech, and signal processing, ieee international conference on, volume 1, pages 517–520. IEEE Computer Society, 1992

work page 1992
[44]

A. Goel, Z. Kong, R. Valle, and B. Catanzaro. Audio dialogues: Dialogues dataset for audio and music understanding. arXiv preprint arXiv:2404.07616, 2024

work page arXiv 2024
[45]

Y . Gong, A. H. Liu, H. Luo, L. Karlinsky, and J. Glass. Joint audio and speech understanding, 2023

work page 2023
[46]

Y . Gong, H. Luo, A. H. Liu, L. Karlinsky, and J. Glass. Listen, think, and understand, 2023

work page 2023
[47]

D. Guo, D. Yang, H. Zhang, J. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma, P. Wang, X. Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[48]

Guzhov, F

A. Guzhov, F. Raue, J. Hees, and A. Dengel. Audioclip: Extending clip to image, text and audio, 2021

work page 2021
[49]

Hernandez, V

F. Hernandez, V . Nguyen, S. Ghannay, N. Tomashenko, and Y . Esteve. Ted-lium 3: Twice as much data and corpus repartition for experiments on speaker adaptation. In Speech and Computer: 20th International Conference, SPECOM 2018, Leipzig, Germany, September 18–22, 2018, Proceedings 20, pages 198–208. Springer, 2018

work page 2018
[50]

Hershey, D

S. Hershey, D. P. Ellis, E. Fonseca, A. Jansen, C. Liu, R. C. Moore, and M. Plakal. The benefit of temporally-strong labels in audio event classification. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 366–

work page 2021
[51]

E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, W. Chen, et al. Lora: Low-rank adaptation of large language models. ICLR, 1(2):3, 2022. 13

work page 2022
[52]

Step-audio: Unified understanding and generation in intelligent speech interaction, 2025

A. Huang, B. Wu, B. Wang, C. Yan, C. Hu, C. Feng, F. Tian, F. Shen, J. Li, M. Chen, et al. Step-audio: Unified understanding and generation in intelligent speech interaction. arXiv preprint arXiv:2502.11946, 2025

work page arXiv 2025
[53]

Huang, M

R. Huang, M. Li, D. Yang, J. Shi, X. Chang, Z. Ye, Y . Wu, Z. Hong, J. Huang, J. Liu, Y . Ren, Z. Zhao, and S. Watanabe. Audiogpt: Understanding and generating speech, music, sound, and talking head, 2023

work page 2023
[54]

GPT-4o System Card

A. Hurst, A. Lerer, A. P. Goucher, A. Perelman, A. Ramesh, A. Clark, A. Ostrow, A. Welihinda, A. Hayes, A. Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[55]

M. M. Islam, N. Ho, X. Yang, T. Nagarajan, L. Torresani, and G. Bertasius. Video recap: Recursive captioning of hour-long videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18198–18208, 2024

work page 2024
[56]

James, L

J. James, L. Tian, and C. I. Watson. An open source emotional speech corpus for human robot interaction applications. In Interspeech, pages 2768–2772, 2018

work page 2018
[57]

Jeong and J

I.-Y . Jeong and J. Park. Cochlscene: Acquisition of acoustic scene data using crowdsourcing. In 2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), pages 17–21. IEEE, 2022

work page 2022
[58]

X. Ju, Y . Gao, Z. Zhang, Z. Yuan, X. Wang, A. Zeng, Y . Xiong, Q. Xu, and Y . Shan. Miradata: A large-scale video dataset with long durations and structured captions. Advances in Neural Information Processing Systems, 37:48955–48970, 2024

work page 2024
[59]

W. Kang, X. Yang, Z. Yao, F. Kuang, Y . Yang, L. Guo, L. Lin, and D. Povey. Libriheavy: A 50,000 hours asr corpus with punctuation casing and context. In ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 10991–10995. IEEE, 2024

work page 2024
[60]

C. D. Kim, B. Kim, H. Lee, and G. Kim. Audiocaps: Generating captions for audios in the wild. In NAACL-HLT, 2019

work page 2019
[61]

J. Kim, T. Moon, K. Lee, and J. Cho. Efficient generative modeling with residual vector quantization-based tokens. arXiv preprint arXiv:2412.10208, 2024

work page arXiv 2024
[62]

P. Koehn. Europarl: A parallel corpus for statistical machine translation. In Proceedings of machine translation summit x: papers, pages 79–86, 2005

work page 2005
[63]

A. S. Koepke, A.-M. Oncescu, J. F. Henriques, Z. Akata, and S. Albanie. Audio retrieval with natural language queries: A benchmark study. IEEE Transactions on Multimedia, 25:2675– 2685, 2022

work page 2022
[64]

Koizumi, H

Y . Koizumi, H. Zen, S. Karita, Y . Ding, K. Yatabe, N. Morioka, M. Bacchiani, Y . Zhang, W. Han, and A. Bapna. Libritts-r: A restored multi-speaker text-to-speech corpus. INTER- SPEECH 2023, 2023

work page 2023
[65]

Z. Kong, A. Goel, R. Badlani, W. Ping, R. Valle, and B. Catanzaro. Audio flamingo: A novel audio language model with few-shot learning and dialogue abilities, 2024

work page 2024
[66]

Kumar, P

R. Kumar, P. Seetharaman, A. Luebs, I. Kumar, and K. Kumar. High-fidelity audio compression with improved rvqgan. In Thirty-seventh Conference on Neural Information Processing Systems

work page
[67]

E. Law, K. West, M. Mandel, M. Bay, and J. Downie. Evaluation of algorithms using games: the case of music annotation. In Proceedings of the 11th International Society for Music Information Retrieval Conference (ISMIR). Utrecht, the Netherlands, 2010

work page 2010
[68]

C. Lee, R. Roy, M. Xu, J. Raiman, M. Shoeybi, B. Catanzaro, and W. Ping. Nv-embed: Improved techniques for training llms as generalist embedding models. arXiv preprint arXiv:2405.17428, 2024. 14

work page internal anchor Pith review Pith/arXiv arXiv 2024
[69]

D. Lee, C. Kim, S. Kim, M. Cho, and W.-S. Han. Autoregressive image generation using residual quantization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11523–11532, 2022

work page 2022
[70]

K. Lee, D. W. Kim, J. Kim, S. Chung, and J. Cho. DiTTo-TTS: Diffusion transformers for scalable text-to-speech without domain-specific factors. In The Thirteenth International Conference on Learning Representations, 2025

work page 2025
[71]

K. Lee, K. Park, and D. Kim. Dailytalk: Spoken dialogue dataset for conversational text- to-speech. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2023

work page 2023
[72]

S. Leng, Y . Xing, Z. Cheng, Y . Zhou, H. Zhang, X. Li, D. Zhao, S. Lu, C. Miao, and L. Bing. The curse of multi-modalities: Evaluating hallucinations of large multimodal models across language, visual, and audio. arXiv preprint arXiv:2410.12787, 2024

work page arXiv 2024
[73]

G. Li, J. Liu, H. Dinkel, Y . Niu, J. Zhang, and J. Luan. Reinforcement learning outper- forms supervised fine-tuning: A case study on audio question answering. arXiv preprint arXiv:2503.11197, 2025

work page arXiv 2025
[74]

G. Li, Y . Wei, Y . Tian, C. Xu, J.-R. Wen, and D. Hu. Learning to answer questions in dynamic audio-visual scenarios. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19108–19118, 2022

work page 2022
[75]

T. Li, J. Liu, T. Zhang, Y . Fang, D. Pan, M. Wang, Z. Liang, Z. Li, M. Lin, G. Dong, et al. Baichuan-audio: A unified framework for end-to-end speech interaction. arXiv preprint arXiv:2502.17239, 2025

work page arXiv 2025
[76]

Lipping, P

S. Lipping, P. Sudarsanam, K. Drossos, and T. Virtanen. Clotho-aqa: A crowdsourced dataset for audio question answering. In 2022 30th European Signal Processing Conference (EUSIPCO), pages 1140–1144. IEEE, 2022

work page 2022
[77]

S. Liu, H. J. Cho, M. Freedman, X. Ma, and J. May. Recap: retrieval-enhanced context-aware prefix encoder for personalized dialogue response generation.arXiv preprint arXiv:2306.07206, 2023

work page arXiv 2023
[78]

S. Liu, A. S. Hussain, C. Sun, and Y . Shan. Music understanding llama: Advancing text- to-music generation with question answering and captioning. In ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 286–

work page 2024
[79]

Z. Liu, H. Mao, C.-Y . Wu, C. Feichtenhofer, T. Darrell, and S. Xie. A convnet for the 2020s. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11976–11986, 2022

work page 2022
[80]

Z. Ma, Z. Chen, Y . Wang, E. S. Chng, and X. Chen. Audio-cot: Exploring chain-of-thought reasoning in large audio language model. arXiv preprint arXiv:2501.07246, 2025

work page arXiv 2025

Showing first 80 references.