Towards Explainable Music Emotion Recognition: The Route via Mid-level Features
Pith reviewed 2026-05-25 00:53 UTC · model grok-4.3
The pith
A neural network predicts music emotions from mid-level perceptual features with only a small average loss in accuracy.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We propose a VGG-style deep neural network that learns to predict emotional characteristics of a musical piece together with (and based on) human-interpretable, mid-level perceptual features. We compare this to predicting emotion directly with an identical network that does not take into account the mid-level features and observe that the loss in predictive performance of going through the mid-level features is surprisingly low, on average. The design of our network allows us to visualize the effects of perceptual features on individual emotion predictions, and we argue that the small loss in performance in going through the mid-level features is justified by the gain in explainability of t
What carries the argument
VGG-style deep neural network that jointly predicts mid-level perceptual features and emotion labels from audio, enabling feature-effect visualizations.
If this is right
- Visualizations become available that show the contribution of each mid-level feature to individual emotion predictions.
- The small average performance loss makes the explainable route competitive with direct prediction for many uses.
- Predictions gain musically meaningful and intuitive explanations rather than remaining abstract.
- The same joint-prediction architecture can be applied to other subjective tasks in music information retrieval.
Where Pith is reading between the lines
- Music recommendation systems could incorporate these visualizations to give users transparent reasons for suggested tracks.
- Testing the same mid-level route on larger and culturally diverse datasets would check whether the low performance loss generalizes.
- The approach could be extended to allow users to interactively adjust mid-level features and observe resulting emotion shifts.
- Similar bridging via interpretable intermediate features might address explainability in other audio classification problems.
Load-bearing premise
The selected mid-level perceptual features are both sufficiently predictive of emotion and genuinely understandable to humans so that the visualizations become meaningful.
What would settle it
A replication on new music corpora that shows either a large accuracy drop when routing through mid-level features or user studies where the visualizations fail to improve understanding of predictions.
Figures
read the original abstract
Emotional aspects play an important part in our interaction with music. However, modelling these aspects in MIR systems have been notoriously challenging since emotion is an inherently abstract and subjective experience, thus making it difficult to quantify or predict in the first place, and to make sense of the predictions in the next. In an attempt to create a model that can give a musically meaningful and intuitive explanation for its predictions, we propose a VGG-style deep neural network that learns to predict emotional characteristics of a musical piece together with (and based on) human-interpretable, mid-level perceptual features. We compare this to predicting emotion directly with an identical network that does not take into account the mid-level features and observe that the loss in predictive performance of going through the mid-level features is surprisingly low, on average. The design of our network allows us to visualize the effects of perceptual features on individual emotion predictions, and we argue that the small loss in performance in going through the mid-level features is justified by the gain in explainability of the predictions.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes a VGG-style deep neural network for music emotion recognition that jointly learns mid-level perceptual features and emotional characteristics in a multi-task setup. It compares this to an otherwise identical network that predicts emotion directly and reports that the average performance drop incurred by routing through the mid-level features is surprisingly low; the architecture further enables visualizations that attribute emotion predictions to the perceptual features, which the authors argue justifies the modest accuracy cost via improved explainability.
Significance. If the reported metrics hold, the work is significant because it supplies a concrete, falsifiable empirical demonstration that explainability via human-interpretable mid-level features can be obtained in music emotion modeling at low predictive cost. The multi-task design and visualization approach directly address a recognized limitation of black-box deep models in MIR; the fact that the comparison is performed with matched base architectures makes the central trade-off claim testable from the numbers alone.
major comments (2)
- [§4] §4 (results): the claim that the performance loss is 'surprisingly low on average' requires the actual per-metric deltas, standard deviations across folds or runs, and a statistical test (paired t-test or equivalent) to establish that the difference is distinguishable from training noise; without these the central empirical observation cannot be evaluated.
- [§3.1] §3.1 (mid-level features): the assertion that the chosen perceptual features are 'human-interpretable' and therefore confer explainability is load-bearing for the justification of the approach, yet the manuscript supplies no external validation (listener study, correlation with established perceptual scales, or citation to perceptual literature) that these features are meaningful to users beyond the authors' selection.
minor comments (3)
- [Abstract] Abstract: quantitative performance numbers (e.g., mean R² or accuracy drop and the direct vs. routed values) should be stated so readers can assess the 'surprisingly low' claim without reading the full results section.
- [Figure 2] Figure 2 or 3 (visualizations): the attribution maps would be clearer if accompanied by a quantitative measure (e.g., feature-emotion correlation or ablation delta) rather than relying solely on qualitative inspection.
- [§2] §2 (related work): a brief comparison table of prior emotion-prediction accuracies on the same datasets would help situate the absolute performance levels achieved here.
Simulated Author's Rebuttal
We thank the referee for the constructive review and positive assessment of the work's significance. We address each major comment below and will revise the manuscript to incorporate the requested details and clarifications.
read point-by-point responses
-
Referee: [§4] §4 (results): the claim that the performance loss is 'surprisingly low on average' requires the actual per-metric deltas, standard deviations across folds or runs, and a statistical test (paired t-test or equivalent) to establish that the difference is distinguishable from training noise; without these the central empirical observation cannot be evaluated.
Authors: We agree that the claim would be strengthened by explicit per-metric deltas, standard deviations, and a statistical test. In the revised version we will add a table showing the exact performance difference for each metric (arousal, valence, and the four quadrant categories), report standard deviations across the 10 cross-validation folds for both models, and include the results of a paired t-test on the per-fold differences to confirm that the observed gap is distinguishable from training variability. revision: yes
-
Referee: [§3.1] §3.1 (mid-level features): the assertion that the chosen perceptual features are 'human-interpretable' and therefore confer explainability is load-bearing for the justification of the approach, yet the manuscript supplies no external validation (listener study, correlation with established perceptual scales, or citation to perceptual literature) that these features are meaningful to users beyond the authors' selection.
Authors: The mid-level features are taken from prior MIR literature that has already performed listener validation and acoustic correlation studies. We will revise §3.1 to include explicit citations to those perceptual validation studies (e.g., the original papers introducing the feature set and subsequent listener experiments), thereby grounding the claim of human interpretability in existing evidence rather than author selection alone. revision: yes
Circularity Check
No significant circularity; empirical comparison only
full rationale
The paper presents an empirical study comparing two VGG-style networks for music emotion recognition: one that routes predictions through jointly learned mid-level perceptual features and one that predicts emotion directly. The central claim (small average performance loss when using mid-level features) is an observation from trained model metrics on held-out data, not a mathematical derivation or prediction derived from fitted parameters. No equations, ansatzes, uniqueness theorems, or self-citation chains are invoked to force results by construction. The multi-task architecture is internally consistent for measuring the described trade-off, and the claim remains directly falsifiable from the reported performance numbers without reducing to its own inputs.
Axiom & Free-Parameter Ledger
free parameters (1)
- network weights and biases
axioms (1)
- domain assumption Mid-level perceptual features can be reliably annotated by humans and are predictive of emotional response
Reference graph
Works this paper leans on
-
[1]
Towards Explainable Music Emotion Recognition: The Route via Mid-level Features
INTRODUCTION Emotions – portrayed, perceived, or induced – are an im- portant aspect of music. MIR systems can benefit from leveraging this aspect because of its direct impact on hu- man perception of music, but doing so has been challeng- ing due to the inherently abstract and subjective quality of this feature. Moreover, it is difficult to interpret emo- ...
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[2]
RELA TED WORK In the MIR field, audio-based music emotion recognition (MER) has traditionally been done by extracting selected features from the audio and predicting emotion based on subsequent processing of these features [7]. Methods such as linear regression, regression trees, support vector re- gression, and variants have been used for prediction as me...
-
[3]
MID-LEVEL PERCEPTUAL FEA TURES The notion of (‘mid-level’) perceptual features for charac- terizing music recordings has been put forward by several authors, as an alternative to purely sound-based or statis- tical low-level features (e.g., MFCCs, ZCR, spectral cen- troid) or more abstract music-theoretic concepts (e.g., me- ter, harmony). The idea is tha...
-
[4]
DA TASETS For our experiments, we need music recordings annotated both with mid-level perceptual features, and with human ratings along some well-defined emotion categories. Our starting point is Aljanaki & Soleymani’sMid-level Percep- tual Features dataset [1], which provides mid-level fea- ture annotations. For the actual emotion prediction ex- periments...
-
[5]
The annotators were required to have some musical ed- ucation and were selected based on passing a musical test. The ratings range from 1 to 10 and were scaled by a factor of 0.1 before being used for our experiments. 4.2 Emotion Ratings: The Soundtracks Dataset The Soundtracks (Stimulus Set 1) dataset, published by Eerola and Vuoskoski [4], consists of 3...
-
[6]
AUDIO-TO-EMOTION MODELS In the following, we describe three different approaches to modeling emotion from audio, all based on VGG-style convolutional neural networks (CNNs). The architectures are summarized in Figure 2. For all models, we use an Adam optimizer with a learning rate of 0.0005 and a batch size of 8, and employ early stopping with a patience ...
work page 2048
-
[7]
EXPERIMENTS The audio clips are preprocessed as described in Section 5 to obtain the input spectrograms. During training, one ran- dom 10-second snippet from each spectrogram is taken as input. We optimize the mean squared error, and use Pearson’s correlation coefficient as the evaluation metric for emotion rating prediction. Each of the paths (A2E, Valenc...
-
[8]
OBTAINING EXPLANA TIONS Since the mapping between mid-level features and emo- tions is linear in both proposed schemes (A2Mid2E, A2Mid2E-Joint), it is now straightforward to create human-understandable explanations. Linear models can be interpreted by analyzing their weights: increasing a nu- merical feature by one unit changes the prediction by its weigh...
-
[9]
DISCUSSION AND CONCLUSION Model interpretability and the possibility to obtain expla- nations for a given prediction are not ends in themselves. There are many scenarios where one may need to under- stand why a piece of music was recommended or placed 1 https://www.jyu.fi/hytk/fi/laitokset/mutku/en/research/projects2/past- projects/coe/materials/emotion/sou...
-
[10]
ACKNOWLEDGMENTS This research has received funding from the European Re- search Council (ERC) under the European Union’s Hori- zon 2020 research and innovation programme under grant agreement No. 670035 (project “Con Espressione”)
work page 2020
-
[11]
A data- driven approach to mid-level perceptual musical fea- ture modeling
Anna Aljanaki and Mohammad Soleymani. A data- driven approach to mid-level perceptual musical fea- ture modeling. In Proceedings of the 19th International Society for Music Information Retrieval Conference, ISMIR 2018, Paris, France, September 23-27, 2018 , pages 615–621, 2018
work page 2018
-
[12]
Developing a benchmark for emotional anal- ysis of music
Anna Aljanaki, Yi-Hsuan Yang, and Mohammad So- leymani. Developing a benchmark for emotional anal- ysis of music. PLOS ONE, 12(3):1–22, 03 2017
work page 2017
-
[13]
Eduardo Coutinho, George Trigeorgis, Stefanos Zafeiriou, and Björn W Schuller. Automatically esti- mating emotion in music with deep long-short term memory recurrent neural networks. In MediaEval, 2015
work page 2015
- [14]
-
[15]
Anders Friberg, Erwin Schoonderwaldt, Anton Hed- blad, Marco Fabiani, and Anders Elowsson. Using listener-based perceptual features as intermediate rep- resentations in music information retrieval. The Jour- nal of the Acoustical Society of America , 136(4):1951– 1963, 2014
work page 1951
-
[16]
Auto- mated music emotion recognition: A systematic evalu- ation
Arefin Huq, Juan Pablo Bello, and Robert Rowe. Auto- mated music emotion recognition: A systematic evalu- ation. Journal of New Music Research, 39(3):227–244, 2010
work page 2010
-
[17]
Music emotion recognition: A state of the art review
Youngmoo E Kim, Erik M Schmidt, Raymond Migneco, Brandon G Morton, Patrick Richardson, Jef- frey Scott, Jacquelin A Speck, and Douglas Turnbull. Music emotion recognition: A state of the art review. In Proc. ISMIR, pages 255–266. Citeseer, 2010
work page 2010
-
[18]
Neuron activation profiles for interpreting convolu- tional speech recognition models
Andreas Krug, René Knaebel, and Sebastian Stober. Neuron activation profiles for interpreting convolu- tional speech recognition models. In NIPS 2018 Inter- pretability and Robustness for Audio, Speech and Lan- guage Workshop (IRASL’18), 2018. to appear
work page 2018
-
[19]
Music emotion recognition based on two- level support vector classification
Chingshun Lin, Mingyu Liu, Weiwei Hsiung, and Jhih- siang Jhang. Music emotion recognition based on two- level support vector classification. In 2016 Interna- tional Conference on Machine Learning and Cybernet- ics (ICMLC), volume 1, pages 375–389. IEEE, 2016
work page 2016
-
[20]
Interpretable Machine Learn- ing
Christoph Molnar. Interpretable Machine Learn- ing. 2019. https://christophm.github.io/ interpretable-ml-book/
work page 2019
-
[21]
Multi-modal mu- sic emotion recognition: A new dataset, methodology and comparative analysis
Renato Panda, Ricardo Malheiro, Bruno Rocha, An- tónio Oliveira, and Rui Pedro Paiva. Multi-modal mu- sic emotion recognition: A new dataset, methodology and comparative analysis
-
[22]
Multi-modal mu- sic emotion recognition: A new dataset, methodology and comparative analysis
Renato Panda, Ricardo Malheiro, Bruno Rocha, An- tónio Oliveira, and Rui Pedro Paiva. Multi-modal mu- sic emotion recognition: A new dataset, methodology and comparative analysis. In International Symposium on Computer Music Multidisciplinary Research , 2013
work page 2013
-
[23]
Speaker recogni- tion from raw waveform with sincnet
Mirco Ravanelli and Yoshua Bengio. Speaker recogni- tion from raw waveform with sincnet. In 2018 IEEE Spoken Language Technology Workshop, SLT 2018, Athens, Greece, December 18-21, 2018 , pages 1021– 1028, 2018
work page 2018
-
[24]
Optimized scoring sys- tems: Toward trust in machine learning for healthcare and criminal justice
Cynthia Rudin and Ustun Berk. Optimized scoring sys- tems: Toward trust in machine learning for healthcare and criminal justice. Interfaces, 48(5):449–466, 2018
work page 2018
-
[25]
James A Russell. A circumplex model of affect. Jour- nal of personality and social psychology , 39(6):1161, 1980
work page 1980
-
[26]
Very deep convolutional networks for large-scale image recogni- tion
Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recogni- tion. In 3rd International Conference on Learning Rep- resentations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015
work page 2015
-
[27]
Getting closer to the essence of mu- sic: The Con Espressione Manifesto
Gerhard Widmer. Getting closer to the essence of mu- sic: The Con Espressione Manifesto. ACM Transac- tions on Intelligent Systems and Technology (TIST) , 8(2):19, 2017
work page 2017
-
[28]
Mingxing Xu, Xinxing Li, Haishu Xianyu, Jiashen Tian, Fanhang Meng, and Wenxiao Chen. Multi-scale approaches to the mediaeval 2015" emotion in music" task. In MediaEval, 2015
work page 2015
-
[29]
JiangLong Zhang, XiangLin Huang, Lifang Yang, and Liqiang Nie. Bridge the semantic gap between pop mu- sic acoustic feature and emotion: Build an interpretable model. Neurocomputing, 208:333 – 341, 2016. SI: BridgingSemantic
work page 2016
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.