arxiv: 2605.03361 · v2 · submitted 2026-05-05 · 💻 cs.AI

Recognition: 1 theorem link

ReasonAudio: A Benchmark for Evaluating Reasoning Beyond Matching in Text-Audio Retrieval

Honglei Zhang , Yuting Chen , Chenpeng Hu , Siyue Zhang , Yilei Shi

Authors on Pith no claims yet

Pith reviewed 2026-05-08 18:24 UTC · model grok-4.3

classification 💻 cs.AI

keywords text-audio retrievalreasoning benchmarknegationdurationcomposite audiomultimodal modelscontrastive fine-tuning

0 comments

The pith

Existing text-audio retrieval models fail on reasoning tasks like negation and duration, as shown by the ReasonAudio benchmark.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces ReasonAudio as the first benchmark to test reasoning abilities in text-audio retrieval instead of simple semantic matching. It consists of 1,000 queries and 10,000 composite audio clips built around five tasks: Negation, Order, Overlap, Duration, and Mix. Evaluation of ten state-of-the-art models finds that all struggle, with the weakest results on Negation and Duration and somewhat better results on Overlap and Order. Multimodal large language model-based embedding models lose the reasoning skills of their base models after contrastive fine-tuning. The results indicate that present training methods do not preserve reasoning capacity in retrieval settings.

Core claim

ReasonAudio shows that all tested models struggle with reasoning-intensive audio retrieval, performing particularly poorly on Negation and Duration while showing relatively better results on Overlap and Order. Multimodal large language model-based embedding models fail to inherit the reasoning capabilities of their backbones through contrastive fine-tuning, suggesting that current training paradigms are insufficient to preserve reasoning capacity in retrieval settings.

What carries the argument

The ReasonAudio benchmark, built from composite audio clips across five reasoning tasks that go beyond semantic matching.

Load-bearing premise

The five tasks isolate genuine reasoning requirements that cannot be solved by semantic matching, pattern recognition, or unintended cues in the audio clips.

What would settle it

A model achieving high accuracy on Negation and Duration by relying only on keyword presence or simple pattern matching without logical or temporal understanding would show the tasks do not require the intended reasoning.

Figures

Figures reproduced from arXiv: 2605.03361 by Chenpeng Hu, Honglei Zhang, Siyue Zhang, Yilei Shi, Yuting Chen.

**Figure 1.** Figure 1: Illustration of the five reasoning-intensive cross-modal retrieval tasks in ReasonAudio. Each panel shows an example view at source ↗

**Figure 2.** Figure 2: Performance of OmniEmbed-7B on the proposed view at source ↗

**Figure 3.** Figure 3: t-SNE plot of text and audio embeddings. Samples view at source ↗

read the original abstract

As multimodal content continues to expand at a rapid pace, audio retrieval has emerged as a key enabling technology for media search, content organization, and intelligent assistants. However, most existing benchmarks concentrate on semantic matching and fail to capture the fact that real-world queries often demand advanced reasoning abilities, including negation understanding, temporal ordering, concurrent event recognition, and duration discrimination. To address this gap, we introduce ReasonAudio, the first reasoning-intensive benchmark for Text-Audio Retrieval, comprising 1,000 queries and 10,000 composite audio clips across five fundamental reasoning tasks: Negation, Order, Overlap, Duration, and Mix. Despite their intuitive nature for humans and straightforward construction, these tasks pose significant challenges to current models. Our evaluation of ten state-of-the-art models reveals the following findings: All models struggle with reasoning-intensive audio retrieval, performing particularly poorly on Negation and Duration while showing relatively better results on Overlap and Order. Moreover, Multimodal Large Language Model-based embedding models fail to inherit the reasoning capabilities of their backbones through contrastive fine-tuning, suggesting that current training paradigms are insufficient to preserve reasoning capacity in retrieval settings

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ReasonAudio creates a benchmark with five reasoning tasks for text-audio retrieval but leaves open whether models are failing at logic or just missing acoustic cues.

read the letter

The paper's main point is that current text-audio retrieval models, including MLLM-based embeddings, do poorly when queries require steps like negation or duration judgment instead of direct semantic overlap. It supports this with a new set of 1,000 queries and 10,000 composite clips covering Negation, Order, Overlap, Duration, and Mix, plus results from ten models that show the weakest results on negation and duration and little transfer of reasoning from backbone models after contrastive training.

Referee Report

3 major / 2 minor

Summary. The paper introduces ReasonAudio, the first reasoning-intensive benchmark for text-audio retrieval, with 1,000 queries and 10,000 composite audio clips spanning five tasks (Negation, Order, Overlap, Duration, Mix). Evaluation of ten state-of-the-art models shows uniformly poor performance on reasoning demands, with particularly low results on Negation and Duration, relatively better on Overlap and Order; additionally, MLLM-based embedding models do not retain the reasoning abilities of their backbones after contrastive fine-tuning, indicating limitations in current training paradigms for retrieval.

Significance. If the benchmark tasks are shown to isolate genuine reasoning without acoustic or pattern-based shortcuts, the work would provide a valuable new evaluation resource and empirical evidence of a key limitation in multimodal retrieval models. The focus on negation, temporal, and compositional reasoning addresses a clear gap in existing semantic-matching benchmarks and could guide development of more capable audio-text systems.

major comments (3)

[Task construction / benchmark design] Task construction (implied in abstract and methods): The paper asserts that the five tasks require reasoning beyond matching, yet provides no ablations, shortcut analyses, human baselines, or controls for low-level cues (e.g., energy/spectral differences in Negation or Duration clips). This directly undermines the central claim that observed poor performance reflects reasoning deficits rather than exploitable patterns.
[Experiments / results] Evaluation and results: No statistical significance tests, confidence intervals, or error analysis are reported for the performance gaps (e.g., Negation vs. Overlap), making it impossible to assess whether differences are reliable or driven by specific failure modes.
[Model evaluation / discussion] MLLM embedding claim: The assertion that contrastive fine-tuning fails to preserve backbone reasoning lacks direct comparison of the MLLM backbones on equivalent reasoning probes (in text or other modalities) or probing experiments on the embeddings themselves.

minor comments (2)

[Abstract / Introduction] Abstract and introduction would benefit from explicit citation of prior audio retrieval benchmarks to better position the novelty.
[Benchmark description] Notation for the five tasks and composite clip generation should be standardized with a clear table or diagram for reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive feedback. We address each of the major comments below and describe the changes we will implement in the revised manuscript.

read point-by-point responses

Referee: [Task construction / benchmark design] Task construction (implied in abstract and methods): The paper asserts that the five tasks require reasoning beyond matching, yet provides no ablations, shortcut analyses, human baselines, or controls for low-level cues (e.g., energy/spectral differences in Negation or Duration clips). This directly undermines the central claim that observed poor performance reflects reasoning deficits rather than exploitable patterns.

Authors: We appreciate this observation, as validating the reasoning nature of the tasks is crucial. Although the tasks were constructed to necessitate understanding of logical operations and temporal relations (e.g., negation requires recognizing the absence of an event, and duration involves comparing lengths), we acknowledge the absence of explicit controls. In the revised manuscript, we will incorporate human baseline results on all tasks to confirm that they are intuitive for humans. We will also conduct shortcut analyses by generating variant audio clips with normalized acoustic features (such as equalizing energy and spectral profiles for Negation and Duration) and report model performance on these controls. A new subsection will discuss potential low-level cues and argue that the multi-event composition makes simple pattern matching insufficient. revision: yes
Referee: [Experiments / results] Evaluation and results: No statistical significance tests, confidence intervals, or error analysis are reported for the performance gaps (e.g., Negation vs. Overlap), making it impossible to assess whether differences are reliable or driven by specific failure modes.

Authors: We agree that this would enhance the interpretability of our results. We will update the evaluation section to include bootstrap-derived 95% confidence intervals for all reported metrics. Statistical significance tests, such as Wilcoxon signed-rank tests for comparing model performances across tasks, will be added to assess the reliability of observed gaps. Additionally, we will provide an error analysis that examines common failure cases, including qualitative examples of model errors on Negation and Duration tasks. revision: yes
Referee: [Model evaluation / discussion] MLLM embedding claim: The assertion that contrastive fine-tuning fails to preserve backbone reasoning lacks direct comparison of the MLLM backbones on equivalent reasoning probes (in text or other modalities) or probing experiments on the embeddings themselves.

Authors: This point is well-taken. Our discussion relies on the general knowledge that MLLM backbones possess strong reasoning abilities in their native settings, contrasted with the poor performance of their fine-tuned retrieval versions on ReasonAudio. To address the lack of direct comparison, we will add experiments evaluating the unfine-tuned MLLM backbones on text-based versions of the reasoning tasks (e.g., text queries involving negation and duration). We will also include a brief analysis of the embeddings, such as measuring similarity structures for reasoning-related attributes. These additions will more rigorously support our conclusion about the limitations of contrastive fine-tuning paradigms. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark with direct model evaluation

full rationale

The paper constructs a benchmark dataset of 1,000 queries and 10,000 composite clips across five tasks and reports empirical performance of ten models on them. No equations, derivations, fitted parameters, or predictions appear in the abstract or described content. Claims about model struggles on Negation/Duration and failure of contrastive fine-tuning to transfer reasoning are direct observations from testing, not reductions to self-defined inputs or self-citations. The work is self-contained as dataset creation plus external baseline evaluation with no load-bearing self-referential steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that the benchmark tasks require reasoning beyond matching and that model failures reflect true capability gaps rather than benchmark artifacts.

axioms (1)

domain assumption The tasks of negation understanding, temporal ordering, concurrent event recognition, and duration discrimination in audio require advanced reasoning that cannot be reduced to semantic matching.
Invoked in the benchmark design and interpretation of model failures.

pith-pipeline@v0.9.0 · 5510 in / 1333 out tokens · 40595 ms · 2026-05-08T18:24:34.152635+00:00 · methodology

Review history (3 revisions) →

discussion (0)

Reference graph

Works this paper leans on

37 extracted references · 13 canonical work pages · 2 internal anchors

[1]

Torr, Yoon Kim, and Marzyeh Ghassemi

Kumail Alhamoud, Shaden Alshammari, Yonglong Tian, Guohao Li, Philip H.S. Torr, Yoon Kim, and Marzyeh Ghassemi. 2025. Vision-Language Models Do Not Understand Negation. In2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 29612–29622

2025
[2]

Haonan Chen, Sicheng Gao, Radu Timofte, Tetsuya Sakai, and Zhicheng Dou
[3]

arXiv:2601.03666 [cs.CL] https://arxiv.org/abs/2601.03666

e5-omni: Explicit Cross-modal Alignment for Omni-modal Embeddings. https://arxiv.org/abs/2601.03666

work page arXiv
[4]

Jianlyu Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, and Zheng Liu
[5]

InFindings of the As- sociation for Computational Linguistics: ACL 2024, Lun-Wei Ku, Andre Martins, and Vivek Srikumar (Eds.)

M3-Embedding: Multi-Linguality, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation. InFindings of the As- sociation for Computational Linguistics: ACL 2024, Lun-Wei Ku, Andre Martins, and Vivek Srikumar (Eds.). Association for Computational Linguistics, Bangkok, Thailand, 2318–2335. https://aclanthology.org/2024.fin...

2024
[6]

Yifu Chen, Shengpeng Ji, Haoxiao Wang, Ziqing Wang, Siyu Chen, Jinzheng He, Jin Xu, and Zhou Zhao. 2025. WavRAG: Audio-Integrated Retrieval Augmented Generation for Spoken Dialogue Models. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and M...

2025
[7]

Yunfei Chu, Jin Xu, Qian Yang, Haojie Wei, Xipin Wei, Zhifang Guo, Yichong Leng, Yuanjun Lv, Jinzheng He, Junyang Lin, Chang Zhou, and Jingren Zhou
[8]

Qwen2-Audio Technical Report

Qwen2-Audio Technical Report. arXiv:2407.10759 [eess.AS] https://arxiv. org/abs/2407.10759

work page internal anchor Pith review arXiv
[9]

Soham Deshmukh, Benjamin Elizalde, and Huaming Wang. 2022. Audio Retrieval with WavText5K and CLAP Training. arXiv:2209.14275 [eess.AS] https://arxiv. org/abs/2209.14275

work page arXiv 2022
[10]

Konstantinos Drossos, Samuel Lipping, and Tuomas Virtanen. 2020. Clotho: an Audio Captioning Dataset. InIEEE 45th International Conference on Acoustics, Speech, and Signal Processing (ICASSP) (Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing). IEEE, United States, 736–

2020
[11]

doi:10.1109/ICASSP40776.2020.9052990 IEEE International Conference on Acoustics, Speech and Signal Processing ; Conference date: 01-01-1900 Through 01-01-2000

work page doi:10.1109/icassp40776.2020.9052990 2020
[12]

Benjamin Elizalde, Soham Deshmukh, Mahmoud Al Ismail, and Huaming Wang
[13]

In ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

CLAP Learning Audio Concepts from Natural Language Supervision. In ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 1–5

2023
[14]

Eduardo Fonseca, Xavier Favory, Jordi Pons, Frederic Font, and Xavier Serra
[15]

FSD50K: An Open Dataset of Human-Labeled Sound Events.IEEE/ACM Transactions on Audio, Speech, and Language Processing30 (2022), 829–852

2022
[16]

Google Research. 2025. Simple Voice Questions (SVQ) Dataset. https:// huggingface.co/datasets/google/svq. Accessed: Feb. 1, 2026

2025
[17]

Andrey Guzhov, Federico Raue, Jörn Hees, and Andreas Dengel. 2022. Audioclip: Extending Clip to Image, Text and Audio. InICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 976–980. doi:10. 1109/ICASSP43922.2022.9747631

work page arXiv 2022
[18]

Ailin Huang, Boyong Wu, Bruce Wang, Chao Yan, Chen Hu, Chengli Feng, Fei Tian, Feiyu Shen, Jingbei Li, Mingrui Chen, Peng Liu, Ruihang Miao, Wang You, Xi Chen, Xuerui Yang, Yechang Huang, Yuxiang Zhang, Zheng Gong, Zixin Zhang, Hongyu Zhou, Jianjian Sun, Brian Li, Chengting Feng, Changyi Wan, Hanpeng Hu, Jianchang Wu, Jiangjie Zhen, Ranchen Ming, Song Yua...

work page arXiv 2025
[19]

Yuxuan Jiang, Zehua Chen, Zeqian Ju, Chang Li, Weibei Dou, and Jun Zhu. 2025. FreeAudio: Training-Free Timing Planning for Controllable Long-Form Text- to-Audio Generation.Proceedings of the 33rd ACM International Conference on Multimedia(2025). https://api.semanticscholar.org/CorpusID:280253048

2025
[20]

Chris Dongjoo Kim, Byeongchang Kim, Hyunmin Lee, and Gunhee Kim. 2019. AudioCaps: Generating Captions for Audios in The Wild. InNAACL-HLT

2019
[21]

Sophia Koepke, Andreea-Maria Oncescu, João F

A. Sophia Koepke, Andreea-Maria Oncescu, João F. Henriques, Zeynep Akata, and Samuel Albanie. 2023. Audio Retrieval With Natural Language Queries: A Benchmark Study.IEEE Transactions on Multimedia25 (2023), 2675–2685. http://dx.doi.org/10.1109/TMM.2022.3149712

work page doi:10.1109/tmm.2022.3149712 2023
[22]

Chia-Hsuan Lee, Szu-Lin Wu, Chi-Liang Liu, and Hung yi Lee. 2018. Spoken SQuAD: A Study of Mitigating the Impact of Speech Recognition Errors on Lis- tening Comprehension. InInterspeech. https://api.semanticscholar.org/CorpusID: 4561735

2018
[23]

Haohe Liu, Zehua Chen, Yi Yuan, Xinhao Mei, Xubo Liu, Danilo Mandic, Wenwu Wang, and Mark D Plumbley. 2023. AudioLDM: Text-to-Audio Generation with Latent Diffusion Models. InProceedings of the 40th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 202), Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Enge...

2023
[24]

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning Transferable Visual Models From Natural Language Supervision. InProceedings of the 38th Inter- national Conference on Machine Learning (Proceedings of Machi...

2021
[25]

Hongjin SU, Howard Yen, Mengzhou Xia, Weijia Shi, Niklas Muennighoff, Han yu Wang, Liu Haisu, Quan Shi, Zachary S Siegel, Michael Tang, Ruoxi Sun, Jinsung Yoon, Sercan O Arik, Danqi Chen, and Tao Yu. 2025. BRIGHT: A Realistic and Challenging Benchmark for Reasoning-Intensive Retrieval. InThe Thirteenth International Conference on Learning Representations....

2025
[26]

Yulin Sun, Qisheng Xu, Yi Su, Qian Zhu, Yong Dou, Xinwang Liu, and Kele Xu
[27]

InProceedings of the 33rd ACM International Conference on Multimedia (MM ’25)

AudioSet-R: A Refined AudioSet with Multi-Stage LLM Label Reannotation. InProceedings of the 33rd ACM International Conference on Multimedia (MM ’25). ACM, 13089–13096. http://dx.doi.org/10.1145/3746027.3758260

work page doi:10.1145/3746027.3758260
[28]

William Webber, Alistair Moffat, and Justin Zobel. 2008. Statistical power in retrieval experimentation. InProceedings of the 17th ACM International Conference on Information and Knowledge Management. ACM, 571–580. doi:10.1145/1458082. 1458158

work page doi:10.1145/1458082 2008
[29]

Orion Weller, Benjamin Chang, Sean MacAvaney, Kyle Lo, Arman Cohan, Ben- jamin Van Durme, Dawn Lawrie, and Luca Soldaini. 2025. FollowIR: Evaluating and Teaching Information Retrieval Models to Follow Instructions. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language T...

2025
[30]

Ho-Hsiang Wu, Prem Seetharaman, Kundan Kumar, and Juan Pablo Bello. 2022. Wav2CLIP: Learning Robust Audio Representations from Clip. InICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 4563–4567. doi:10.1109/ICASSP43922.2022.9747669

work page doi:10.1109/icassp43922.2022.9747669 2022
[31]

Chenghao Xiao, Hou Pong Chan, Hao Zhang, Weiwen Xu, Mahani Aljunied, and Yu Rong. 2025. Scaling Language-Centric Omnimodal Representation Learning. arXiv:2510.11693 [cs.CL] https://arxiv.org/abs/2510.11693

work page arXiv 2025
[32]

Mengyao Xu, Wenfei Zhou, Yauhen Babakhin, Gabriel Moreira, Ronay Ak, Radek Osmulski, Bo Liu, Even Oldridge, and Benedikt Schifferer. 2025. Omni-Embed- Nemotron: A Unified Multimodal Retrieval Model for Text, Image, Audio, and Video. arXiv:2510.03458 [cs.CL] https://arxiv.org/abs/2510.03458

work page arXiv 2025
[33]

Siyue Zhang, Yuan Gao, Xiao Zhou, Yilun Zhao, Tingyu Song, Arman Cohan, Anh Tuan Luu, and Chen Zhao. 2026. MRMR: A Realistic and Expert-Level Multidisciplinary Benchmark for Reasoning-Intensive Multimodal Retrieval. In The Fourteenth International Conference on Learning Representations. https: //openreview.net/forum?id=XZNXSM4rHG

2026
[34]

Siyue Zhang, Yuxiang Xue, Yiming Zhang, Xiaobao Wu, Anh Tuan Luu, and Chen Zhao. 2025. MRAG: A Modular Retrieval Framework for Time-Sensitive Question Answering. InFindings of the Association for Computational Linguistics: EMNLP 2025. 3080–3118. https://aclanthology.org/2025.findings-emnlp.167/ 5

2025
[35]

Siyue Zhang, Yilun Zhao, Liyuan Geng, Arman Cohan, Anh Tuan Luu, and Chen Zhao. 2025. Diffusion vs. Autoregressive Language Models: A Text Em- bedding Perspective. InProceedings of the 2025 Conference on Empirical Meth- ods in Natural Language Processing. Association for Computational Linguistics. https://aclanthology.org/2025.emnlp-main.213/

2025
[36]

Yanzhao Zhang, Mingxin Li, Dingkun Long, Xin Zhang, Huan Lin, Baosong Yang, Pengjun Xie, An Yang, Dayiheng Liu, Junyang Lin, Fei Huang, and Jingren Zhou
[37]

Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models

Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models. arXiv:2506.05176 [cs.CL] https://arxiv.org/abs/2506.05176 6

work page internal anchor Pith review arXiv