Recognition: unknown
Jamendo-MT-QA: A Benchmark for Multi-Track Comparative Music Question Answering
Pith reviewed 2026-05-10 17:13 UTC · model grok-4.3
The pith
Jamendo-MT-QA supplies a benchmark dataset of 36,519 comparative questions drawn from 12,173 pairs of music tracks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper constructs Jamendo-MT-QA by selecting pairs of tracks and applying an LLM-assisted pipeline to produce 36,519 comparative QA items across yes/no, short-answer, and sentence-level questions, thereby creating the first systematic resource for testing models on multi-track music reasoning.
What carries the argument
The LLM-assisted pipeline that generates and filters comparative questions from track pairs to produce three distinct question types requiring cross-track audio reasoning.
If this is right
- Audio-language models can be measured specifically on comparative music understanding instead of single-track description.
- The three question formats allow separate assessment of binary judgment, concise recall, and descriptive comparison skills.
- The dataset supplies training material for improving models that must handle listener-style comparisons.
- Combined automatic metrics and LLM-as-a-judge evaluation offer a scalable protocol for future multi-track benchmarks.
Where Pith is reading between the lines
- The same pair-wise construction method could be applied to other audio domains such as speech or environmental sound comparisons.
- Models trained on this data may improve performance in music recommendation systems that rely on relative rather than absolute descriptions.
- The benchmark highlights the need for audio encoders that maintain distinct representations of multiple simultaneous tracks.
Load-bearing premise
The questions produced by the LLM pipeline genuinely require listening to and comparing both tracks rather than being answerable from text metadata or a single track alone.
What would settle it
Human reviewers find that a large fraction of the questions can be answered correctly without hearing both tracks or show systematic biases introduced by the generation process.
Figures
read the original abstract
Recent work on music question answering (Music-QA) has primarily focused on single-track understanding, where models answer questions about an individual audio clip using its tags, captions, or metadata. However, listeners often describe music in comparative terms, and existing benchmarks do not systematically evaluate reasoning across multiple tracks. Building on the Jamendo-QA dataset, we introduce Jamendo-MT-QA, a dataset and benchmark for multi-track comparative question answering. From Creative Commons-licensed tracks on Jamendo, we construct 36,519 comparative QA items over 12,173 track pairs, with each pair yielding three question types: yes/no, short-answer, and sentence-level questions. We describe an LLM-assisted pipeline for generating and filtering comparative questions, and benchmark representative audio-language models using both automatic metrics and LLM-as-a-Judge evaluation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Jamendo-MT-QA, a benchmark dataset for multi-track comparative music question answering built on the existing Jamendo-QA collection. From Creative Commons tracks on Jamendo it derives 36,519 QA items across 12,173 track pairs, with each pair producing three question types (yes/no, short-answer, sentence-level) via an LLM-assisted generation and filtering pipeline; the authors then benchmark representative audio-language models using both automatic metrics and LLM-as-a-Judge evaluation.
Significance. If the generated questions demonstrably require cross-track audio reasoning rather than metadata or single-track cues, the dataset would address a clear gap in music QA by moving beyond single-track understanding to comparative reasoning that aligns with how listeners often describe music. The scale (over 36k items) and the three question formats would make it a useful resource for evaluating audio-language models on multi-track tasks.
major comments (2)
- [§3] §3 (Dataset Construction) and the pipeline description: the LLM-assisted generation and filtering procedure is presented without any human validation step, inter-annotator agreement statistics, or adversarial tests (e.g., metadata-only or single-track baselines) that would confirm items cannot be solved without listening to both tracks. This directly affects the central claim that the benchmark evaluates multi-track comparative audio reasoning.
- [Evaluation] Evaluation section (and Table 1 or equivalent results table): reported model scores are given without ablations that isolate the contribution of audio features versus textual metadata or captions from Jamendo; without such controls it is impossible to verify that performance differences reflect comparative audio understanding.
minor comments (2)
- [Abstract] The abstract states the dataset scale but does not preview any quantitative validation metrics; adding a brief sentence on human review or agreement would improve clarity.
- [§3] Notation for the three question types is introduced without an explicit example table; a small illustrative table showing one pair and its three generated questions would aid readability.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback on Jamendo-MT-QA. The comments highlight important aspects of validation and evaluation that we address below. We have revised the manuscript accordingly to strengthen the evidence for multi-track comparative reasoning.
read point-by-point responses
-
Referee: [§3] §3 (Dataset Construction) and the pipeline description: the LLM-assisted generation and filtering procedure is presented without any human validation step, inter-annotator agreement statistics, or adversarial tests (e.g., metadata-only or single-track baselines) that would confirm items cannot be solved without listening to both tracks. This directly affects the central claim that the benchmark evaluates multi-track comparative audio reasoning.
Authors: We acknowledge that the submitted manuscript did not report human validation, inter-annotator agreement, or explicit adversarial baselines. The LLM-assisted pipeline includes rule-based and LLM filtering steps to remove inconsistent or non-comparative items, but this does not fully substitute for human checks. In the revised version, we add a human validation study on a 500-item subset where two annotators independently judge whether each question requires audio from both tracks and cannot be answered from metadata alone; we report agreement statistics (Cohen's kappa). We also include new adversarial experiments evaluating models on metadata-only and single-track inputs, which show substantially lower performance and support the need for cross-track audio reasoning. revision: yes
-
Referee: [Evaluation] Evaluation section (and Table 1 or equivalent results table): reported model scores are given without ablations that isolate the contribution of audio features versus textual metadata or captions from Jamendo; without such controls it is impossible to verify that performance differences reflect comparative audio understanding.
Authors: We agree that isolating the contribution of audio is essential. The original evaluation focused on full audio-language model inputs but lacked explicit controls for metadata and captions. The revised manuscript adds ablation results to the evaluation section and Table 1: models are re-evaluated using only Jamendo metadata/tags, only captions, and single-track audio. These controls demonstrate that performance on the comparative QA tasks drops markedly without both tracks' audio, confirming that the benchmark measures multi-track audio reasoning rather than textual shortcuts. revision: yes
Circularity Check
No circularity: dataset construction and empirical benchmarking with no derivations or self-referential predictions
full rationale
The paper describes construction of the Jamendo-MT-QA dataset from existing Jamendo tracks and an LLM-assisted generation/filtering pipeline, followed by benchmarking of audio-language models. No equations, fitted parameters, predictions, or first-principles derivations are present. The central claims rest on the empirical items produced and evaluation results rather than any reduction of outputs to inputs by construction. Self-citations (if any) are not load-bearing for a uniqueness theorem or ansatz. This is a standard dataset/benchmark paper whose value is independently falsifiable via human validation or metadata-only ablations.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Qwen2-audio technical report.Preprint, arXiv:2407.10759. Michaël Defferrard, Kirell Benzi, Pierre Vandergheynst, and Xavier Bresson. 2017. FMA: A dataset for music analysis. In18th International Society for Music Information Retrieval Conference (ISMIR). Michaël Defferrard, Sharada P. Mohanty, Sean F. Car- roll, and Marcel Salathé. 2018. Learning to recog...
work page internal anchor Pith review arXiv 2017
-
[2]
Gama: A large audio-language model with advanced audio understanding and complex reasoning abilities
Gama: A large audio-language model with ad- vanced audio understanding and complex reasoning abilities.Preprint, arXiv:2406.11768. Xanh Ho, Anh-Khoa Duong Nguyen, Saku Sugawara, and Akiko Aizawa. 2020. Constructing a multi- hop QA dataset for comprehensive evaluation of reasoning steps. InProceedings of the 28th Inter- national Conference on Computational...
-
[3]
Morehopqa: More than multi-hop reasoning. Preprint, arXiv:2406.13397. Changli Tang, Wenyi Yu, Guangzhi Sun, Xianzhao Chen, Tian Tan, Wei Li, Lu Lu, Zejun Ma, and Chao Zhang. 2024. Salmonn: Towards generic hear- ing abilities for large language models.Preprint, arXiv:2310.13289. G. Tzanetakis and P. Cook. 2002. Musical genre clas- sification of audio signa...
-
[4]
ChatMusician: Understanding and generating music intrinsically with LLM. InFindings of the As- sociation for Computational Linguistics: ACL 2024, pages 6252–6271, Bangkok, Thailand. Association for Computational Linguistics. Ruibin Yuan, Yinghao Ma, Yizhi Li, Ge Zhang, Xingran Chen, Hanzhi Yin, Yiqi Liu, Jiawen Huang, Zeyue Tian, Binyue Deng, and 1 others...
-
[5]
Is the answer factually cor- rect based on the provided descriptions of Track A and Track B?
reflecting partial correctness and varying degrees of coverage and specificity. Given the consistently high alignment scores and low variance across an- notators, we found that this evaluation size was suf- ficient to validate caption quality for downstream QA generation. Table 5 summarizes the results of the human evaluation for audio–text alignment of e...
-
[6]
YES/NO question - Must compare an attribute between both tracks
-
[7]
short-answer question - Answer should be the audio name
-
[8]
score": <0-5>,
SENTENCE question - Detailed comparison in a complete sentence D LLM-as-a-Judge Prompt and Scoring Rubric D.1 Judge Prompt for Sentence-level Evaluation For sentence-level comparative questions, we em- ploy an LLM-as-a-Judge protocol to evaluate se- mantic correctness and comparative soundness. The following prompt is used to score model pre- dictions on ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.