Recognition: 3 theorem links
· Lean TheoremAudio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio Language Models
Pith reviewed 2026-05-15 03:37 UTC · model grok-4.3
The pith
Audio Flamingo 3 is a fully open large audio-language model that sets new state-of-the-art results on over twenty audio understanding and reasoning benchmarks using only open-source data.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Audio Flamingo 3 (AF3) is a fully open state-of-the-art large audio-language model that advances reasoning and understanding across speech, sound, and music. It introduces AF-Whisper as a unified audio encoder for joint representation learning, flexible on-demand thinking for chain-of-thought reasoning, multi-turn multi-audio chat, long audio understanding up to ten minutes including speech, and voice-to-voice interaction. These capabilities are supported by large-scale open datasets such as AudioSkills-XL, LongAudio-XL, AF-Think, and AF-Chat, trained via a novel five-stage curriculum strategy. Trained only on open-source audio data, AF3 achieves new state-of-the-art results on over twenty (
What carries the argument
AF-Whisper unified audio encoder for joint speech-sound-music representation learning, combined with the five-stage curriculum training on custom long-audio and skills datasets.
If this is right
- Supports long audio understanding and reasoning including speech up to 10 minutes.
- Enables multi-turn multi-audio chat and on-demand chain-of-thought reasoning.
- Provides voice-to-voice interaction alongside text-based audio analysis.
- Surpasses both open-weight and closed-source models on over 20 benchmarks despite training on smaller open data.
- Delivers these capabilities through a fully open model and training process.
Where Pith is reading between the lines
- The curriculum approach may generalize to other multimodal domains where staged training helps manage long contexts.
- Open release of the model and datasets could accelerate community experiments on specialized audio tasks like environmental sound monitoring.
- Future extensions might test whether the same encoder scales to joint audio-video reasoning without major retraining.
- Widespread adoption could shift industry focus toward efficient open data pipelines instead of ever-larger proprietary corpora.
Load-bearing premise
The performance gains stem from genuine generalization enabled by the new datasets and curriculum rather than benchmark-specific tuning or differences in evaluation protocols.
What would settle it
Re-running AF3 on a new held-out collection of long audio reasoning tasks never seen during its training or original benchmarks, then comparing results directly to closed models under identical protocols.
read the original abstract
We present Audio Flamingo 3 (AF3), a fully open state-of-the-art (SOTA) large audio-language model that advances reasoning and understanding across speech, sound, and music. AF3 introduces: (i) AF-Whisper, a unified audio encoder trained using a novel strategy for joint representation learning across all 3 modalities of speech, sound, and music; (ii) flexible, on-demand thinking, allowing the model to do chain-of-thought-type reasoning before answering; (iii) multi-turn, multi-audio chat; (iv) long audio understanding and reasoning (including speech) up to 10 minutes; and (v) voice-to-voice interaction. To enable these capabilities, we propose several large-scale training datasets curated using novel strategies, including AudioSkills-XL, LongAudio-XL, AF-Think, and AF-Chat, and train AF3 with a novel five-stage curriculum-based training strategy. Trained on only open-source audio data, AF3 achieves new SOTA results on over 20+ (long) audio understanding and reasoning benchmarks, surpassing both open-weight and closed-source models trained on much larger datasets.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents Audio Flamingo 3 (AF3), a fully open large audio-language model advancing reasoning across speech, sound, and music. It introduces AF-Whisper as a unified audio encoder for joint modality representation, on-demand chain-of-thought thinking, multi-turn multi-audio chat, long audio understanding up to 10 minutes, and voice-to-voice interaction. These capabilities are enabled by newly curated datasets (AudioSkills-XL, LongAudio-XL, AF-Think, AF-Chat) and a five-stage curriculum-based training strategy using only open-source data, with claims of new SOTA results on over 20 audio understanding and reasoning benchmarks that surpass both open-weight and closed-source models trained on larger datasets.
Significance. If the SOTA claims hold after verification of no data contamination, the work would demonstrate that open-source audio-language models can achieve superior performance through targeted dataset curation and staged training, rather than scale alone. This could accelerate progress in accessible multimodal audio AI, particularly for long-context reasoning and flexible interaction capabilities that remain challenging in the field.
major comments (2)
- [§3] §3 (new datasets): The descriptions of AudioSkills-XL, LongAudio-XL, AF-Think, and AF-Chat provide no details on deduplication, audio fingerprinting, embedding similarity checks, or other overlap detection against the test splits of the 20+ reported benchmarks. This is load-bearing for the central SOTA claim, as any leakage would allow the five-stage curriculum to exploit benchmark-specific patterns rather than demonstrate generalization.
- [§5] §5 (experiments): The reported benchmark results do not include error bars, exact evaluation protocol details, or ablations that isolate the contribution of the new datasets versus the curriculum stages, making it difficult to assess the robustness of the performance gains.
minor comments (1)
- [Abstract] Abstract: The phrasing 'over 20+ (long) audio understanding and reasoning benchmarks' is imprecise; clarify the exact number of benchmarks evaluated and which subset specifically tests long audio.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback on our manuscript. We address each major comment below and outline the revisions we will make to strengthen the presentation of our results.
read point-by-point responses
-
Referee: [§3] §3 (new datasets): The descriptions of AudioSkills-XL, LongAudio-XL, AF-Think, and AF-Chat provide no details on deduplication, audio fingerprinting, embedding similarity checks, or other overlap detection against the test splits of the 20+ reported benchmarks. This is load-bearing for the central SOTA claim, as any leakage would allow the five-stage curriculum to exploit benchmark-specific patterns rather than demonstrate generalization.
Authors: We agree that explicit documentation of contamination checks is necessary to substantiate the SOTA claims. In the revised manuscript we will expand §3 with a dedicated subsection describing our full deduplication pipeline: perceptual audio fingerprinting, embedding-based similarity filtering (cosine threshold 0.85), and exhaustive overlap scans against every benchmark test split. We will also report the measured overlap statistics (which were below 0.1 % after filtering). These checks were performed during curation; adding the details will directly address the concern without altering the experimental outcomes. revision: yes
-
Referee: [§5] §5 (experiments): The reported benchmark results do not include error bars, exact evaluation protocol details, or ablations that isolate the contribution of the new datasets versus the curriculum stages, making it difficult to assess the robustness of the performance gains.
Authors: We accept that additional reporting is required for robustness. We will add standard-error bars to all main-result tables (computed over three independent runs) and include a new appendix with exact evaluation protocols (prompt templates, decoding parameters, and metric implementations). We will also insert a targeted ablation study that incrementally adds the new datasets and curriculum stages on a representative subset of benchmarks. Full isolation across all 20+ benchmarks is computationally prohibitive, so the ablations will be partial but sufficient to illustrate the relative contributions. revision: partial
Circularity Check
No circularity in empirical claims or training pipeline
full rationale
The paper is an empirical ML work that introduces new datasets (AudioSkills-XL, LongAudio-XL, AF-Think, AF-Chat) and a five-stage curriculum to train AF3, then reports benchmark results. No derivation chain, equations, or first-principles predictions exist that could reduce to inputs by construction. Claims rest on external benchmark comparisons rather than self-referential fitting or self-citation load-bearing steps. The work is self-contained against external benchmarks with no visible circularity signals.
Axiom & Free-Parameter Ledger
free parameters (1)
- training hyperparameters across five stages
axioms (1)
- domain assumption Benchmarks used are fair and representative for cross-model comparison
Lean theorems connected to this paper
-
IndisputableMonolith.Cost.FunctionalEquationwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We propose several large-scale training datasets curated using novel strategies, including AudioSkills-XL, LongAudio-XL, AF-Think, and AF-Chat, and train AF3 with a novel five-stage curriculum-based training strategy.
-
IndisputableMonolith.Foundation.DimensionForcingdimension_forced unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
AF3 achieves new SOTA results on over 20+ (long) audio understanding and reasoning benchmarks
-
IndisputableMonolith.Foundation.HierarchyEmergencehierarchy_emergence_forces_phi unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
AF-Whisper, a unified audio encoder trained using a novel strategy for joint representation learning across all 3 modalities
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 21 Pith papers
-
HalluAudio: A Comprehensive Benchmark for Hallucination Detection in Large Audio-Language Models
HalluAudio is the first large-scale benchmark spanning speech, environmental sound, and music that uses human-verified QA pairs, adversarial prompts, and mixed-audio tests to measure hallucinations in large audio-lang...
-
Benign Fine-Tuning Breaks Safety Alignment in Audio LLMs
Benign fine-tuning on audio data breaks safety alignment in Audio LLMs by raising jailbreak success rates up to 87%, with the dominant risk axis depending on model architecture and embedding proximity to harmful content.
-
Listening with Time: Precise Temporal Awareness for Long-Form Audio Understanding
LAT-Audio introduces a global-to-local reasoning approach with TWA-CoT that outperforms prior models on temporal tasks for audio up to 30 minutes.
-
Speaker-Reasoner: Scaling Interaction Turns and Reasoning Patterns for Timestamped Speaker-Attributed ASR
Speaker-Reasoner is an end-to-end speech LLM that iteratively analyzes audio structure, predicts temporal boundaries, and jointly models speaker identity, gender, timestamps, and transcription using a speaker-aware ca...
-
AudioMosaic: Contrastive Masked Audio Representation Learning
AudioMosaic learns general-purpose audio representations through contrastive pre-training with structured spectrogram masking, reaching state-of-the-art results on standard benchmarks and improving audio-language tasks.
-
FSD50K-Solo: Automated Curation of Single-Source Sound Events
The authors present a scalable curation method that combines diffusion-based mixture synthesis with a discriminative classifier to automatically extract single-source sound events from FSD50K and release the cleaned F...
-
Towards Fine-Grained Multi-Dimensional Speech Understanding: Data Pipeline, Benchmark, and Model
A data pipeline, 14-dimension benchmark, and decoupled fine-tuning model are presented to advance fine-grained multi-dimensional speech understanding in LLMs.
-
JASTIN: Aligning LLMs for Zero-Shot Audio and Speech Evaluation via Natural Language Instructions
JASTIN is an instruction-driven audio evaluation system that achieves state-of-the-art correlation with human ratings on speech, sound, music, and out-of-domain tasks without task-specific retraining.
-
Can Multimodal Large Language Models Understand Pathologic Movements? A Pilot Study on Seizure Semiology
MLLMs achieve zero-shot recognition of seizure semiological features better than fine-tuned vision models on most tested features, with signal enhancement and faithful explanations.
-
HeadRouter: Dynamic Head-Weight Routing for Task-Adaptive Audio Token Pruning in Large Audio Language Models
HeadRouter prunes audio tokens more effectively by dynamically routing based on per-head importance for semantic versus acoustic tasks, exceeding baseline performance at 70% token retention on Qwen2.5-Omni models.
-
Omni-Embed-Audio: Leveraging Multimodal LLMs for Robust Audio-Text Retrieval
Omni-Embed-Audio uses multimodal LLMs to match CLAP on standard audio retrieval while improving text-to-text retrieval by 22% relative and hard negative discrimination by 4.3 points HNSR@10 on user-intent queries.
-
Audio2Tool: Speak, Call, Act -- A Dataset for Benchmarking Speech Tool Use
Audio2Tool is a new benchmark dataset that shows speech models perform well on simple commands but degrade sharply on compositional tasks and realistic acoustic noise.
-
Dual-Axis Generative Reward Model Toward Semantic and Turn-taking Robustness in Interactive Spoken Dialogue Models
A generative reward model supplies separate semantic and turn-taking scores for spoken dialogues to enable more reliable reinforcement learning.
-
SpotSound: Enhancing Large Audio-Language Models with Fine-Grained Temporal Grounding
SpotSound adds a hallucination-suppressing objective and a needle-in-haystack benchmark to audio-language models, reaching state-of-the-art temporal grounding while keeping general task performance.
-
Noise-Aware In-Context Learning for Hallucination Mitigation in ALLMs
NAICL reduces hallucination rates in ALLMs from 26.53% to 16.98% via noise priors in context and introduces the Clotho-1K benchmark with four hallucination types.
-
Rethinking Entropy Allocation in LLM-based ASR: Understanding the Dynamics between Speech Encoders and LLMs
A multi-stage training method for LLM-based ASR uses new entropy allocation metrics to achieve competitive benchmark performance with 2.3B parameters while mitigating hallucinations via better encoder-LLM decoupling.
-
Qwen3-Omni Technical Report
Qwen3-Omni is a unified multimodal model that achieves open-source SOTA on 32 of 36 audio and audio-visual benchmarks and overall SOTA on 22 without degrading performance on text, image, or video relative to single-mo...
-
GaMMA: Towards Joint Global-Temporal Music Understanding in Large Multimodal Models
GaMMA unifies global and temporal music understanding in a single LMM via MoE audio encoders and progressive training, achieving new state-of-the-art accuracies on music benchmarks including 79.1% on MuchoMusic.
-
Audio-DeepThinker: Progressive Reasoning-Aware Reinforcement Learning for High-Quality Chain-of-Thought Emergence in Audio Language Models
A hybrid-reward progressive RL curriculum enables high-quality chain-of-thought to emerge in audio language models without prior supervised CoT training, yielding SOTA results on MMAR, MMAU, and MMSU benchmarks.
-
Audio-Cogito: Towards Deep Audio Reasoning in Large Audio Language Models
Audio-Cogito is an open-source LALM using Cogito-pipe data curation and self-distillation to achieve leading open-source performance on audio reasoning benchmarks.
-
Robust Audio-Text Retrieval via Cross-Modal Attention and Hybrid Loss
A cross-modal attention refinement module plus hybrid loss improves robustness of audio-text retrieval on noisy and long-form audio.
Reference graph
Works this paper leans on
-
[1]
Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs
A. Abouelenin, A. Ashfaq, A. Atkinson, H. Awadalla, N. Bach, J. Bao, A. Benhaim, M. Cai, V . Chaudhary, C. Chen, et al. Phi-4-mini technical report: Compact yet powerful multimodal language models via mixture-of-loras. arXiv preprint arXiv:2503.01743, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[2]
YouTube-8M: A Large-Scale Video Classification Benchmark
S. Abu-El-Haija, N. Kothari, J. Lee, P. Natsev, G. Toderici, B. Varadarajan, and S. Vijaya- narasimhan. Youtube-8m: A large-scale video classification benchmark. arXiv preprint arXiv:1609.08675, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[3]
MusicLM: Generating Music From Text
A. Agostinelli, T. I. Denk, Z. Borsos, J. Engel, M. Verzetti, A. Caillon, Q. Huang, A. Jansen, A. Roberts, M. Tagliasacchi, et al. Musiclm: Generating music from text. arXiv preprint arXiv:2301.11325, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[4]
Seed-TTS: A Family of High-Quality Versatile Speech Generation Models
P. Anastassiou, J. Chen, J. Chen, Y . Chen, Z. Chen, Z. Chen, J. Cong, L. Deng, C. Ding, L. Gao, et al. Seed-tts: A family of high-quality versatile speech generation models. arXiv preprint arXiv:2406.02430, 2024
work page internal anchor Pith review arXiv 2024
-
[5]
R. Ardila, M. Branson, K. Davis, M. Henretty, M. Kohler, J. Meyer, R. Morais, L. Saunders, F. M. Tyers, and G. Weber. Common voice: A massively-multilingual speech corpus. In Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020), pages 4211–4215, 2020
work page 2020
-
[6]
J. Bai, H. Liu, M. Wang, D. Shi, W. Wang, M. D. Plumbley, W.-S. Gan, and J. Chen. Audioset- caps: Enriched audio captioning dataset generation using large audio language models. In Audio Imagination: NeurIPS 2024 Workshop AI-Driven Speech, Music, and Sound Generation, 2024
work page 2024
- [7]
-
[8]
T. Bertin-Mahieux, D. P. Ellis, B. Whitman, and P. Lamere. The million song dataset. In Proceedings of the 12th International Conference on Music Information Retrieval (ISMIR 2011), 2011
work page 2011
-
[9]
T. Bertin-Mahieux, D. P. Ellis, B. Whitman, and P. Lamere. The million song dataset. InIsmir, volume 2, page 10, 2011
work page 2011
-
[10]
R. M. Bittner, J. Salamon, M. Tierney, M. Mauch, C. Cannam, and J. P. Bello. Medleydb: A multitrack dataset for annotation-intensive mir research. In Ismir, volume 14, pages 155–160, 2014
work page 2014
- [11]
-
[12]
M. Cartwright, J. Cramer, A. E. M. Mendez, Y . Wang, H.-H. Wu, V . Lostanlen, M. Fuentes, G. Dove, C. Mydlarz, J. Salamon, et al. Sonyc-ust-v2: An urban sound tagging dataset with spatiotemporal context. arXiv preprint arXiv:2009.05188, 2020
-
[13]
C. Chen, P. Peng, A. Baid, Z. Xue, W.-N. Hsu, D. Harwath, and K. Grauman. Action2sound: Ambient-aware generation of action sounds from egocentric videos. In European Conference on Computer Vision, pages 277–295. Springer, 2024
work page 2024
- [14]
-
[15]
H. Chen, W. Xie, A. Vedaldi, and A. Zisserman. Vggsound: A large-scale audio-visual dataset. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 721–725. IEEE, 2020. 11
work page 2020
-
[16]
S. Chen, Y . Wu, C. Wang, S. Liu, D. Tompkins, Z. Chen, and F. Wei. Beats: Audio pre-training with acoustic tokenizers, 2022
work page 2022
- [17]
- [18]
-
[19]
Y . Chu, J. Xu, Q. Yang, H. Wei, X. Wei, Z. Guo, Y . Leng, Y . Lv, J. He, J. Lin, C. Zhou, and J. Zhou. Qwen2-audio technical report, 2024
work page 2024
-
[20]
Y . Chu, J. Xu, X. Zhou, Q. Yang, S. Zhang, Z. Yan, C. Zhou, and J. Zhou. Qwen-audio: Advancing universal audio understanding via unified large-scale audio-language models, 2023
work page 2023
- [21]
- [22]
-
[23]
A. Clifton, S. Reddy, Y . Yu, A. Pappu, R. Rezapour, H. Bonab, M. Eskevich, G. Jones, J. Karlgren, B. Carterette, and R. Jones. 100,000 podcasts: A spoken English document corpus. In Proceedings of the 28th International Conference on Computational Linguistics , pages 5903–5917, Barcelona, Spain (Online), Dec. 2020. International Committee on Computationa...
work page 2020
- [24]
-
[25]
FMA: A Dataset For Music Analysis
M. Defferrard, K. Benzi, P. Vandergheynst, and X. Bresson. Fma: A dataset for music analysis. arXiv preprint arXiv:1612.01840, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
- [26]
-
[27]
S. Deshmukh, B. Elizalde, R. Singh, and H. Wang. Pengi: An audio language model for audio tasks, 2023
work page 2023
-
[28]
S. Deshmukh, B. Elizalde, and H. Wang. Audio retrieval with wavtext5k and clap training. arXiv preprint arXiv:2209.14275, 2022
-
[29]
S. Deshmukh, S. Han, H. Bukhari, B. Elizalde, H. Gamper, R. Singh, and B. Raj. Audio entailment: Assessing deductive reasoning for audio understanding. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 23769–23777, 2025
work page 2025
- [30]
-
[31]
M. Douze, A. Guzhva, C. Deng, J. Johnson, G. Szilvasy, P.-E. Mazaré, M. Lomeli, L. Hosseini, and H. Jégou. The faiss library. arXiv preprint arXiv:2401.08281, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[32]
K. Drossos, S. Lipping, and T. Virtanen. Clotho: An audio captioning dataset. InICASSP 2020- 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 736–740. IEEE, 2020
work page 2020
-
[33]
B. Elizalde, S. Deshmukh, M. Al Ismail, and H. Wang. Clap: Learning audio concepts from natural language supervision, 2022
work page 2022
- [34]
- [35]
-
[36]
E. Fonseca, X. Favory, J. Pons, F. Font, and X. Serra. Fsd50k: an open dataset of human- labeled sound events. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 30:829–852, 2021
work page 2021
-
[37]
E. Fonseca, J. Pons, X. Favory, F. Font, D. Bogdanov, A. Ferraro, S. Oramas, A. Porter, and X. Serra. Freesound datasets: A platform for the creation of open audio datasets. In ISMIR, pages 486–493, 2017
work page 2017
- [38]
-
[39]
J. F. Gemmeke, D. P. Ellis, D. Freedman, A. Jansen, W. Lawrence, R. C. Moore, M. Plakal, and M. Ritter. Audio set: An ontology and human-labeled dataset for audio events. In 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 776–780. IEEE, 2017
work page 2017
- [40]
- [41]
- [42]
-
[43]
J. J. Godfrey, E. C. Holliman, and J. McDaniel. Switchboard: Telephone speech corpus for research and development. In Acoustics, speech, and signal processing, ieee international conference on, volume 1, pages 517–520. IEEE Computer Society, 1992
work page 1992
- [44]
-
[45]
Y . Gong, A. H. Liu, H. Luo, L. Karlinsky, and J. Glass. Joint audio and speech understanding, 2023
work page 2023
-
[46]
Y . Gong, H. Luo, A. H. Liu, L. Karlinsky, and J. Glass. Listen, think, and understand, 2023
work page 2023
-
[47]
D. Guo, D. Yang, H. Zhang, J. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma, P. Wang, X. Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [48]
-
[49]
F. Hernandez, V . Nguyen, S. Ghannay, N. Tomashenko, and Y . Esteve. Ted-lium 3: Twice as much data and corpus repartition for experiments on speaker adaptation. In Speech and Computer: 20th International Conference, SPECOM 2018, Leipzig, Germany, September 18–22, 2018, Proceedings 20, pages 198–208. Springer, 2018
work page 2018
-
[50]
S. Hershey, D. P. Ellis, E. Fonseca, A. Jansen, C. Liu, R. C. Moore, and M. Plakal. The benefit of temporally-strong labels in audio event classification. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 366–
work page 2021
-
[51]
E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, W. Chen, et al. Lora: Low-rank adaptation of large language models. ICLR, 1(2):3, 2022. 13
work page 2022
-
[52]
Step-audio: Unified understanding and generation in intelligent speech interaction, 2025
A. Huang, B. Wu, B. Wang, C. Yan, C. Hu, C. Feng, F. Tian, F. Shen, J. Li, M. Chen, et al. Step-audio: Unified understanding and generation in intelligent speech interaction. arXiv preprint arXiv:2502.11946, 2025
- [53]
-
[54]
A. Hurst, A. Lerer, A. P. Goucher, A. Perelman, A. Ramesh, A. Clark, A. Ostrow, A. Welihinda, A. Hayes, A. Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[55]
M. M. Islam, N. Ho, X. Yang, T. Nagarajan, L. Torresani, and G. Bertasius. Video recap: Recursive captioning of hour-long videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18198–18208, 2024
work page 2024
- [56]
-
[57]
I.-Y . Jeong and J. Park. Cochlscene: Acquisition of acoustic scene data using crowdsourcing. In 2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), pages 17–21. IEEE, 2022
work page 2022
-
[58]
X. Ju, Y . Gao, Z. Zhang, Z. Yuan, X. Wang, A. Zeng, Y . Xiong, Q. Xu, and Y . Shan. Miradata: A large-scale video dataset with long durations and structured captions. Advances in Neural Information Processing Systems, 37:48955–48970, 2024
work page 2024
-
[59]
W. Kang, X. Yang, Z. Yao, F. Kuang, Y . Yang, L. Guo, L. Lin, and D. Povey. Libriheavy: A 50,000 hours asr corpus with punctuation casing and context. In ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 10991–10995. IEEE, 2024
work page 2024
-
[60]
C. D. Kim, B. Kim, H. Lee, and G. Kim. Audiocaps: Generating captions for audios in the wild. In NAACL-HLT, 2019
work page 2019
- [61]
-
[62]
P. Koehn. Europarl: A parallel corpus for statistical machine translation. In Proceedings of machine translation summit x: papers, pages 79–86, 2005
work page 2005
-
[63]
A. S. Koepke, A.-M. Oncescu, J. F. Henriques, Z. Akata, and S. Albanie. Audio retrieval with natural language queries: A benchmark study. IEEE Transactions on Multimedia, 25:2675– 2685, 2022
work page 2022
-
[64]
Y . Koizumi, H. Zen, S. Karita, Y . Ding, K. Yatabe, N. Morioka, M. Bacchiani, Y . Zhang, W. Han, and A. Bapna. Libritts-r: A restored multi-speaker text-to-speech corpus. INTER- SPEECH 2023, 2023
work page 2023
-
[65]
Z. Kong, A. Goel, R. Badlani, W. Ping, R. Valle, and B. Catanzaro. Audio flamingo: A novel audio language model with few-shot learning and dialogue abilities, 2024
work page 2024
- [66]
-
[67]
E. Law, K. West, M. Mandel, M. Bay, and J. Downie. Evaluation of algorithms using games: the case of music annotation. In Proceedings of the 11th International Society for Music Information Retrieval Conference (ISMIR). Utrecht, the Netherlands, 2010
work page 2010
-
[68]
C. Lee, R. Roy, M. Xu, J. Raiman, M. Shoeybi, B. Catanzaro, and W. Ping. Nv-embed: Improved techniques for training llms as generalist embedding models. arXiv preprint arXiv:2405.17428, 2024. 14
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[69]
D. Lee, C. Kim, S. Kim, M. Cho, and W.-S. Han. Autoregressive image generation using residual quantization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11523–11532, 2022
work page 2022
-
[70]
K. Lee, D. W. Kim, J. Kim, S. Chung, and J. Cho. DiTTo-TTS: Diffusion transformers for scalable text-to-speech without domain-specific factors. In The Thirteenth International Conference on Learning Representations, 2025
work page 2025
-
[71]
K. Lee, K. Park, and D. Kim. Dailytalk: Spoken dialogue dataset for conversational text- to-speech. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2023
work page 2023
- [72]
- [73]
-
[74]
G. Li, Y . Wei, Y . Tian, C. Xu, J.-R. Wen, and D. Hu. Learning to answer questions in dynamic audio-visual scenarios. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19108–19118, 2022
work page 2022
- [75]
-
[76]
S. Lipping, P. Sudarsanam, K. Drossos, and T. Virtanen. Clotho-aqa: A crowdsourced dataset for audio question answering. In 2022 30th European Signal Processing Conference (EUSIPCO), pages 1140–1144. IEEE, 2022
work page 2022
- [77]
-
[78]
S. Liu, A. S. Hussain, C. Sun, and Y . Shan. Music understanding llama: Advancing text- to-music generation with question answering and captioning. In ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 286–
work page 2024
-
[79]
Z. Liu, H. Mao, C.-Y . Wu, C. Feichtenhofer, T. Darrell, and S. Xie. A convnet for the 2020s. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11976–11986, 2022
work page 2022
- [80]
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.