Audio Flamingo Next: Next-Generation Open Audio-Language Models for Speech, Sound, and Music

arxiv: 2604.10905 · v1 · submitted 2026-04-13 · 💻 cs.SD · cs.AI· cs.CL· eess.AS

Audio Flamingo Next: Next-Generation Open Audio-Language Models for Speech, Sound, and Music

Sreyan Ghosh , Arushi Goel , Kaousheik Jayakumar , Lasha Koroshinadze , Nishit Anand , Zhifeng Kong , Siddharth Gururani , Sang-gil Lee

show 10 more authors

Jaehyeon Kim Aya Aljafari Chao-Han Huck Yang Sungwon Kim Ramani Duraiswami Dinesh Manocha Mohammad Shoeybi Bryan Catanzaro Ming-Yu Liu Wei Ping

This is my paper

Pith reviewed 2026-05-10 16:19 UTC · model grok-4.3

classification 💻 cs.SD cs.AIcs.CLeess.AS

keywords audio-language modelslong-context audiotemporal chain-of-thoughtspeech understandingsound and music reasoningmultimodal training

0 comments p. Extension

The pith

AF-Next advances open audio-language models by handling thirty-minute inputs with timestamp-grounded reasoning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents AF-Next as an updated audio-language model for speech, environmental sounds, and music. It establishes that curating over one million hours of new training data, running a curriculum across pre-training, mid-training, and post-training stages, extending context length to thirty minutes, and adding a Temporal Audio Chain-of-Thought method produces higher accuracy on understanding and reasoning tasks. A sympathetic reader would care because this points to practical ways open models can process complex, extended audio without closed-source resources. If the claim holds, it would supply accessible tools for analyzing long recordings in areas such as podcast search, music structure detection, and environmental monitoring.

Core claim

AF-Next creates a stronger base audio-language model, scales data construction beyond existing benchmarks to over one million hours, supports long and complex audio up to thirty minutes, and introduces Temporal Audio Chain-of-Thought that grounds each intermediate reasoning step to explicit timestamps in the input. Experiments across twenty benchmarks show outperformance over similarly sized open models and competitiveness with or superiority to much larger open-weight and closed models, plus strong transfer to unseen tasks.

What carries the argument

Temporal Audio Chain-of-Thought, a reasoning method that aligns each step of the model's intermediate thinking to specific timestamps within long audio inputs.

If this is right

Performance on long-audio tasks improves without requiring larger model sizes.
Timestamp grounding increases interpretability of reasoning steps on extended inputs.
The model transfers effectively to tasks not seen during training.
Three open-sourced variants support instruction following, thinking, and captioning uses.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Data curation at this scale may matter more than further architectural changes for closing gaps between open and closed audio models.
Timestamp alignment methods could extend to other sequential domains such as video event detection.
Curriculum training across multiple stages may apply to other long-context multimodal tasks.

Load-bearing premise

The newly curated datasets totaling over one million hours and the curriculum training strategy produce genuine generalization to real-world long audio rather than benchmark-specific gains.

What would settle it

A large drop in accuracy on a new long-audio reasoning benchmark containing temporal dependencies absent from the training data would show that the claimed generalization does not hold.

Figures

Figures reproduced from arXiv: 2604.10905 by Arushi Goel, Aya Aljafari, Bryan Catanzaro, Chao-Han Huck Yang, Dinesh Manocha, Jaehyeon Kim, Kaousheik Jayakumar, Lasha Koroshinadze, Ming-Yu Liu, Mohammad Shoeybi, Nishit Anand, Ramani Duraiswami, Sang-gil Lee, Siddharth Gururani, Sreyan Ghosh, Sungwon Kim, Wei Ping, Zhifeng Kong.

**Figure 1.** Figure 1: Performance comparison of AF-Next against prior SOTA LALMs across key audio understanding and reasoning benchmarks. A key barrier is that much of open LALM development has been either closed or tightly coupled to a small set of academic benchmarks. While benchmarks are valuable, they encode biases and incomplete coverage (Kumar et al., 2025b), and audio benchmarks in particular are still emerging. As a re… view at source ↗

**Figure 2.** Figure 2: Examples of new data types introduced to scale AF-Next training. More examples are shown in Figures 12– 15, and details are provided in Section 3.2.1. Feature Extraction. Given an audio input A, we first resample it to 16 kHz mono and convert the waveform into a 128-channel log mel-spectrogram using a 25 ms window and 10 ms hop size. The spectrogram is then passed through AF-Whisper to obtain hidden repres… view at source ↗

**Figure 3.** Figure 3: Training pipeline for AF-Next, curriculum learning stages, and illustration of sequence-parallel setup for long-context training. Example shown for 32 attention heads (H0–H31) and batch size 2 (seq_0–seq_1) across 2 GPUs. Before All-to-All: each GPU holds the full sequence shard with all attention heads. All-to-All (scatter heads, gather sequence): heads are distributed across GPUs while sequence chunks ar… view at source ↗

**Figure 4.** Figure 4: Prompt used for generating multi-turn chat QA pairs from long audio. [PITH_FULL_IMAGE:figures/full_fig_p020_4.png] view at source ↗

**Figure 5.** Figure 5: Prompt used for generating counting QA pairs from long audio. [PITH_FULL_IMAGE:figures/full_fig_p021_5.png] view at source ↗

**Figure 6.** Figure 6: Prompt used for generating detailed audio captions from long audio. [PITH_FULL_IMAGE:figures/full_fig_p021_6.png] view at source ↗

**Figure 7.** Figure 7: Prompt used for generating needle-in-the-haystack QA pairs from long audio. [PITH_FULL_IMAGE:figures/full_fig_p022_7.png] view at source ↗

**Figure 8.** Figure 8: Prompt used for generating subscene captioning QA pairs from long audio. [PITH_FULL_IMAGE:figures/full_fig_p023_8.png] view at source ↗

**Figure 9.** Figure 9: Prompt used for generating temporal understanding QA pairs from long audio. [PITH_FULL_IMAGE:figures/full_fig_p024_9.png] view at source ↗

**Figure 10.** Figure 10: Prompt used for generating time-grounded chain-of-thought reasoning and timestamped caption [PITH_FULL_IMAGE:figures/full_fig_p025_10.png] view at source ↗

**Figure 11.** Figure 11: Prompt used for generating instruction-following QA pairs. [PITH_FULL_IMAGE:figures/full_fig_p026_11.png] view at source ↗

**Figure 12.** Figure 12: Example of AF-Next training data for fine-grained timestamped audio captioning. The model [PITH_FULL_IMAGE:figures/full_fig_p026_12.png] view at source ↗

**Figure 13.** Figure 13: Example of AF-Next training data for multi-speaker automatic speech recognition. Training [PITH_FULL_IMAGE:figures/full_fig_p027_13.png] view at source ↗

**Figure 14.** Figure 14: Examples of AF-Next multi-turn audio chat training data across diverse audio clips, spanning [PITH_FULL_IMAGE:figures/full_fig_p028_14.png] view at source ↗

**Figure 15.** Figure 15: Examples of AF-Next safety fine-tuning training data. Samples include both benign queries (e.g., [PITH_FULL_IMAGE:figures/full_fig_p029_15.png] view at source ↗

read the original abstract

We present Audio Flamingo Next (AF-Next), the next-generation and most capable large audio-language model in the Audio Flamingo series, designed to advance understanding and reasoning over speech, environmental sounds and music. Compared to Audio Flamingo 3, AF-Next introduces: (i) a stronger foundational audio-language model that significantly improves accuracy across diverse audio understanding tasks; (ii) scalable strategies for constructing large-scale audio understanding and reasoning data beyond existing academic benchmarks; (iii) support for long and complex audio inputs up to 30 minutes; and (iv) Temporal Audio Chain-of-Thought, a new reasoning paradigm that explicitly grounds intermediate reasoning steps to timestamps in long audio, enabling fine-grained temporal alignment and improved interpretability. To enable these capabilities, we first conduct a systematic analysis of Audio Flamingo 3 to identify key gaps in audio understanding and reasoning. We then curate and scale new large-scale datasets totaling over 1 million hours to address these limitations and expand the existing AudioSkills-XL, LongAudio-XL, AF-Think and AF-Chat datasets. AF-Next is trained using a curriculum-based strategy spanning pre-training, mid-training and post-training stages. Extensive experiments across 20 audio understanding and reasoning benchmarks, including challenging long-audio tasks, show that AF-Next outperforms similarly sized open models by large margins and remains highly competitive with and sometimes surpasses, much larger open-weight and closed models. Beyond benchmark performance, AF-Next exhibits strong real-world utility and transfers well to unseen tasks, highlighting its robustness and generalization ability. In addition to all data, code and methods, we open-source 3 variants of AF-Next, including AF-Next-Instruct, AF-Next-Think and AF-Next-Captioner.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AF-Next scales audio-language models to 30-minute inputs with a timestamp-grounded reasoning method and releases the weights plus over a million hours of new data, but the reported gains rest on unverified assumptions about dataset cleanliness.

read the letter

The paper's core advance is extending the Audio Flamingo line to handle long audio clips and introducing Temporal Audio Chain-of-Thought, which ties reasoning steps to specific timestamps. They also describe a systematic gap analysis of the prior model followed by curriculum training across pre-, mid-, and post-training stages on expanded datasets totaling more than one million hours. The open release of three model variants, code, and data is the most immediately useful part for anyone who wants to build on this work rather than just read about it.

Referee Report

3 major / 2 minor

Summary. The manuscript presents Audio Flamingo Next (AF-Next), an advancement in the Audio Flamingo series of audio-language models. It describes improvements including a stronger foundational model, curation of large-scale datasets exceeding 1 million hours by expanding AudioSkills-XL, LongAudio-XL, AF-Think, and AF-Chat, a curriculum-based training strategy with pre-training, mid-training, and post-training stages, support for long audio inputs up to 30 minutes, and the introduction of Temporal Audio Chain-of-Thought (CoT) for timestamp-grounded reasoning. The paper reports results from extensive experiments on 20 audio understanding and reasoning benchmarks, claiming that AF-Next outperforms similarly sized open models by large margins and is competitive with or surpasses larger models, while also demonstrating real-world utility and generalization. The authors commit to open-sourcing three model variants (AF-Next-Instruct, AF-Next-Think, AF-Next-Captioner), along with data, code, and methods.

Significance. If the reported performance improvements are robust and not due to data contamination, this work would make a substantial contribution to the field of audio-language modeling by providing scalable data construction methods, enhanced long-context capabilities, and a novel reasoning paradigm that improves temporal alignment and interpretability. The open release of models, data, and code is a particular strength, as it facilitates reproducibility and community-driven extensions in speech, sound, and music understanding tasks.

major comments (3)

[Experiments] The reported benchmark results lack error bars, standard deviations across multiple runs, or detailed ablation studies on the contributions of individual components such as the new datasets, curriculum stages, and Temporal Audio CoT. This makes it challenging to verify the statistical reliability of the claimed large margins over similarly sized models.
[Dataset Curation] In the section describing the curation of the new large-scale datasets (AudioSkills-XL, LongAudio-XL, AF-Think, AF-Chat expansions totaling over 1 million hours), there is no mention of decontamination protocols, overlap statistics, or filtering procedures to ensure no leakage with the 20 evaluation benchmarks. Given that the central performance claims depend on genuine generalization, this omission is load-bearing and requires explicit addressing to support the outperformance assertions.
[Temporal Audio Chain-of-Thought] The description of the Temporal Audio Chain-of-Thought paradigm lacks quantitative evaluation of its impact on interpretability and potential new failure modes on unseen long-audio tasks, as the abstract claims improved interpretability but provides no specific metrics or analysis beyond qualitative description.

minor comments (2)

[Abstract] The abstract contains a grammatical issue in the sentence: 'and sometimes surpasses, much larger open-weight and closed models' – the comma appears misplaced and should be revised for clarity (e.g., 'and sometimes surpasses much larger open-weight and closed models').
Some dataset names like AF-Think and AF-Chat are introduced without prior definition in the abstract, which could be clarified for readers unfamiliar with the series.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thorough and constructive review. We address each major comment point by point below, committing to revisions that strengthen the experimental reporting, data transparency, and evaluation of Temporal Audio CoT while maintaining the integrity of our claims.

read point-by-point responses

Referee: [Experiments] The reported benchmark results lack error bars, standard deviations across multiple runs, or detailed ablation studies on the contributions of individual components such as the new datasets, curriculum stages, and Temporal Audio CoT. This makes it challenging to verify the statistical reliability of the claimed large margins over similarly sized models.

Authors: We acknowledge that the absence of error bars and multiple-run statistics limits the ability to assess statistical significance. The main results were obtained from single training and evaluation runs owing to the high computational cost of scaling to over 1 million hours of data and long-context training. However, we have performed additional ablation experiments isolating the effects of the expanded datasets, curriculum stages, and Temporal Audio CoT. These ablations, along with standard deviations from repeated inference runs on key benchmarks (where feasible without full retraining), will be incorporated into the revised manuscript. We will also note the single-run limitation explicitly where full multi-run statistics cannot be provided. revision: partial
Referee: [Dataset Curation] In the section describing the curation of the new large-scale datasets (AudioSkills-XL, LongAudio-XL, AF-Think, AF-Chat expansions totaling over 1 million hours), there is no mention of decontamination protocols, overlap statistics, or filtering procedures to ensure no leakage with the 20 evaluation benchmarks. Given that the central performance claims depend on genuine generalization, this omission is load-bearing and requires explicit addressing to support the outperformance assertions.

Authors: We appreciate the referee's emphasis on this essential detail for validating generalization. Although the manuscript emphasized the construction methodology, decontamination was performed during curation: we applied embedding similarity thresholds and exact n-gram overlap detection to remove any samples matching the 20 evaluation benchmarks. Quantitative statistics on pre- and post-filtering overlap will be added in a dedicated subsection of the Dataset Curation section in the revision, including the fraction of data removed and the methods used to ensure no leakage. revision: yes
Referee: [Temporal Audio Chain-of-Thought] The description of the Temporal Audio Chain-of-Thought paradigm lacks quantitative evaluation of its impact on interpretability and potential new failure modes on unseen long-audio tasks, as the abstract claims improved interpretability but provides no specific metrics or analysis beyond qualitative description.

Authors: We concur that quantitative metrics would provide stronger evidence for the interpretability benefits. We have run follow-up experiments measuring timestamp alignment precision, reasoning step consistency, and error rates on long-audio tasks with and without Temporal Audio CoT. We will also include an analysis of failure modes, such as over-generation of timestamps or propagation of early errors on unseen tasks. These results and a balanced discussion will be added to the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical scaling and benchmark evaluation remain self-contained

full rationale

The paper advances an audio-language model via dataset curation (>1M hours), curriculum training stages, and evaluation on 20 external benchmarks. No mathematical derivations, equations, or first-principles predictions are present that could reduce to fitted parameters or self-defined quantities by construction. Claims of outperformance rest on reported benchmark scores rather than internal fits or self-citation chains that bear the central result. While data overlap with evaluation sets is a potential validity concern, it does not create circularity in the reported derivation or argument chain.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

Relies on standard transformer training assumptions and new empirical data curation; no new physical entities or ungrounded mathematical axioms beyond typical LLM scaling.

free parameters (1)

Curriculum training stages and data mixture ratios
Pre-training, mid-training, and post-training stages with specific data scaling choices not detailed in abstract but required for the claimed performance.

axioms (1)

standard math Standard transformer-based audio-language architecture supports long-context processing up to 30 minutes
Invoked implicitly as the foundation for extending prior Audio Flamingo models.

invented entities (1)

Temporal Audio Chain-of-Thought no independent evidence
purpose: Grounds intermediate reasoning steps to timestamps in long audio for fine-grained alignment
New paradigm introduced to improve interpretability; no independent evidence provided beyond the model's own outputs.

pith-pipeline@v0.9.0 · 5721 in / 1305 out tokens · 50354 ms · 2026-05-10T16:19:59.009694+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

A Survey of Advancing Audio Super-Resolution and Bandwidth Extension from Discriminative to Generative Models
eess.AS 2026-05 unverdicted novelty 2.0

A structured survey of audio bandwidth extension that organizes the transition from deterministic discriminative DNNs to generative approaches including GANs, diffusion models, and flow-based methods.

Reference graph

Works this paper leans on

2 extracted references · 2 canonical work pages · cited by 1 Pith paper

[1]

V.; Girish, K.; Sen, A.; Xie, J.; Strimel, G

URLhttps://arxiv.org/abs/2504.09081. 15 Yifan Peng, Jinchuan Tian, William Chen, Siddhant Arora, Brian Yan, Yui Sudo, Muhammad Shakeel, Kwanghee Choi, Jiatong Shi, Xuankai Chang, et al. Owsm v3. 1: Better and faster open whisper-style speech models based on e-branchformer.arXiv preprint arXiv:2401.16658, 2024. Zhiliang Peng, Jianwei Yu, Yaoyao Chang, Zilo...

work page arXiv 2024
[2]

Covost 2 and mas- sively multilingual speech-to-text translation.arXiv preprint arXiv:2007.10310, 2020

URLhttps://arxiv.org/abs/2007.10310. Changhan Wang, Morgane Riviere, Ann Lee, Anne Wu, Chaitanya Talnikar, Daniel Haziza, Mary Williamson, Juan Pino, and Emmanuel Dupoux. Voxpopuli: A large-scale multilingual speech corpus for representation learning, semi-supervised learning and interpretation.arXiv preprint arXiv:2101.00390, 2021. Dingdong Wang, Jincenz...

work page arXiv 2007

[1] [1]

V.; Girish, K.; Sen, A.; Xie, J.; Strimel, G

URLhttps://arxiv.org/abs/2504.09081. 15 Yifan Peng, Jinchuan Tian, William Chen, Siddhant Arora, Brian Yan, Yui Sudo, Muhammad Shakeel, Kwanghee Choi, Jiatong Shi, Xuankai Chang, et al. Owsm v3. 1: Better and faster open whisper-style speech models based on e-branchformer.arXiv preprint arXiv:2401.16658, 2024. Zhiliang Peng, Jianwei Yu, Yaoyao Chang, Zilo...

work page arXiv 2024

[2] [2]

Covost 2 and mas- sively multilingual speech-to-text translation.arXiv preprint arXiv:2007.10310, 2020

URLhttps://arxiv.org/abs/2007.10310. Changhan Wang, Morgane Riviere, Ann Lee, Anne Wu, Chaitanya Talnikar, Daniel Haziza, Mary Williamson, Juan Pino, and Emmanuel Dupoux. Voxpopuli: A large-scale multilingual speech corpus for representation learning, semi-supervised learning and interpretation.arXiv preprint arXiv:2101.00390, 2021. Dingdong Wang, Jincenz...

work page arXiv 2007