{"paper":{"title":"Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio Language Models","license":"http://creativecommons.org/licenses/by/4.0/","headline":"Audio Flamingo 3 is a fully open large audio-language model that sets new state-of-the-art results on over twenty audio understanding and reasoning benchmarks using only open-source data.","cross_cats":["cs.AI","cs.CL","eess.AS"],"primary_cat":"cs.SD","authors_text":"Arushi Goel, Bryan Catanzaro, Chao-Han Huck Yang, Dinesh Manocha, Jaehyeon Kim, Rafael Valle, Ramani Duraiswami, Sang-gil Lee, Sonal Kumar, Sreyan Ghosh, Zhifeng Kong","submitted_at":"2025-07-10T19:40:21Z","abstract_excerpt":"We present Audio Flamingo 3 (AF3), a fully open state-of-the-art (SOTA) large audio-language model that advances reasoning and understanding across speech, sound, and music. AF3 introduces: (i) AF-Whisper, a unified audio encoder trained using a novel strategy for joint representation learning across all 3 modalities of speech, sound, and music; (ii) flexible, on-demand thinking, allowing the model to do chain-of-thought-type reasoning before answering; (iii) multi-turn, multi-audio chat; (iv) long audio understanding and reasoning (including speech) up to 10 minutes; and (v) voice-to-voice in"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"AF3 achieves new SOTA results on over 20+ (long) audio understanding and reasoning benchmarks, surpassing both open-weight and closed-source models trained on much larger datasets.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That the newly introduced datasets and five-stage curriculum produce genuine generalization rather than benchmark-specific gains, and that all comparisons use identical evaluation protocols without undisclosed advantages.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"Audio Flamingo 3 introduces an open large audio-language model achieving new state-of-the-art results on over 20 audio understanding and reasoning benchmarks using a unified encoder and curriculum training on open data.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"Audio Flamingo 3 is a fully open large audio-language model that sets new state-of-the-art results on over twenty audio understanding and reasoning benchmarks using only open-source data.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"61739c1547a299dafec5f569bdc620cb33833c7a962f0e396ae7603607c47617"},"source":{"id":"2507.08128","kind":"arxiv","version":2},"verdict":{"id":"697c620a-a579-4359-8139-070b17ff1d58","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-15T03:37:56.252379Z","strongest_claim":"AF3 achieves new SOTA results on over 20+ (long) audio understanding and reasoning benchmarks, surpassing both open-weight and closed-source models trained on much larger datasets.","one_line_summary":"Audio Flamingo 3 introduces an open large audio-language model achieving new state-of-the-art results on over 20 audio understanding and reasoning benchmarks using a unified encoder and curriculum training on open data.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That the newly introduced datasets and five-stage curriculum produce genuine generalization rather than benchmark-specific gains, and that all comparisons use identical evaluation protocols without undisclosed advantages.","pith_extraction_headline":"Audio Flamingo 3 is a fully open large audio-language model that sets new state-of-the-art results on over twenty audio understanding and reasoning benchmarks using only open-source data."},"references":{"count":208,"sample":[{"doi":"","year":2025,"title":"Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs","work_id":"83956045-536a-41ff-af02-b80e2a614eab","ref_index":1,"cited_arxiv_id":"2503.01743","is_internal_anchor":true},{"doi":"","year":2016,"title":"YouTube-8M: A Large-Scale Video Classification Benchmark","work_id":"6b543bd8-75e8-4c53-9718-b4545e4bc424","ref_index":2,"cited_arxiv_id":"1609.08675","is_internal_anchor":true},{"doi":"","year":2023,"title":"MusicLM: Generating Music From Text","work_id":"15e6566e-1c36-468f-966e-823248cbf87f","ref_index":3,"cited_arxiv_id":"2301.11325","is_internal_anchor":true},{"doi":"","year":2024,"title":"Seed-TTS: A Family of High-Quality Versatile Speech Generation Models","work_id":"6e88ee95-1133-4302-a142-cdf8f9456a8d","ref_index":4,"cited_arxiv_id":"2406.02430","is_internal_anchor":true},{"doi":"","year":2020,"title":"R. Ardila, M. Branson, K. Davis, M. Henretty, M. Kohler, J. Meyer, R. Morais, L. Saunders, F. M. Tyers, and G. Weber. Common voice: A massively-multilingual speech corpus. In Proceedings of the 12th C","work_id":"c0ea9007-1463-4192-bef0-5bcd366eaa01","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":208,"snapshot_sha256":"859abea44efcf4fc6cc8c0c9aa68713d9203c13f325b5c5ee3ec29c643cccefd","internal_anchors":21},"formal_canon":{"evidence_count":3,"snapshot_sha256":"c802c9fa3d64325deb6b1497a258471f351502a086ee9eecbcfd3e496bff1d3b"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}