Towards Explainable Music Emotion Recognition: The Route via Mid-level Features

Andreu Vall; Gerhard Widmer; Shreyan Chowdhury; Verena Haunschmid

arxiv: 1907.03572 · v1 · pith:U3PEXLSUnew · submitted 2019-07-08 · 💻 cs.SD · cs.LG· stat.ML

Towards Explainable Music Emotion Recognition: The Route via Mid-level Features

Shreyan Chowdhury , Andreu Vall , Verena Haunschmid , Gerhard Widmer This is my paper

Pith reviewed 2026-05-25 00:53 UTC · model grok-4.3

classification 💻 cs.SD cs.LGstat.ML

keywords music emotion recognitionmid-level featuresexplainable AIdeep neural networksmusic information retrievalperceptual featuresVGG network

0 comments

The pith

A neural network predicts music emotions from mid-level perceptual features with only a small average loss in accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a VGG-style deep neural network that predicts both human-interpretable mid-level perceptual features and the emotional qualities of music pieces. It compares this route to an identical network that predicts emotions directly from audio, finding the performance difference surprisingly small on average. This setup enables visualizations showing how individual perceptual features influence specific emotion predictions, trading a minor accuracy cost for greater explainability in modeling subjective musical experience.

Core claim

We propose a VGG-style deep neural network that learns to predict emotional characteristics of a musical piece together with (and based on) human-interpretable, mid-level perceptual features. We compare this to predicting emotion directly with an identical network that does not take into account the mid-level features and observe that the loss in predictive performance of going through the mid-level features is surprisingly low, on average. The design of our network allows us to visualize the effects of perceptual features on individual emotion predictions, and we argue that the small loss in performance in going through the mid-level features is justified by the gain in explainability of t

What carries the argument

VGG-style deep neural network that jointly predicts mid-level perceptual features and emotion labels from audio, enabling feature-effect visualizations.

If this is right

Visualizations become available that show the contribution of each mid-level feature to individual emotion predictions.
The small average performance loss makes the explainable route competitive with direct prediction for many uses.
Predictions gain musically meaningful and intuitive explanations rather than remaining abstract.
The same joint-prediction architecture can be applied to other subjective tasks in music information retrieval.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Music recommendation systems could incorporate these visualizations to give users transparent reasons for suggested tracks.
Testing the same mid-level route on larger and culturally diverse datasets would check whether the low performance loss generalizes.
The approach could be extended to allow users to interactively adjust mid-level features and observe resulting emotion shifts.
Similar bridging via interpretable intermediate features might address explainability in other audio classification problems.

Load-bearing premise

The selected mid-level perceptual features are both sufficiently predictive of emotion and genuinely understandable to humans so that the visualizations become meaningful.

What would settle it

A replication on new music corpora that shows either a large accuracy drop when routing through mid-level features or user studies where the visualizations fail to improve understanding of predictions.

Figures

Figures reproduced from arXiv: 1907.03572 by Andreu Vall, Gerhard Widmer, Shreyan Chowdhury, Verena Haunschmid.

read the original abstract

Emotional aspects play an important part in our interaction with music. However, modelling these aspects in MIR systems have been notoriously challenging since emotion is an inherently abstract and subjective experience, thus making it difficult to quantify or predict in the first place, and to make sense of the predictions in the next. In an attempt to create a model that can give a musically meaningful and intuitive explanation for its predictions, we propose a VGG-style deep neural network that learns to predict emotional characteristics of a musical piece together with (and based on) human-interpretable, mid-level perceptual features. We compare this to predicting emotion directly with an identical network that does not take into account the mid-level features and observe that the loss in predictive performance of going through the mid-level features is surprisingly low, on average. The design of our network allows us to visualize the effects of perceptual features on individual emotion predictions, and we argue that the small loss in performance in going through the mid-level features is justified by the gain in explainability of the predictions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows routing emotion prediction through mid-level features costs only a small average performance drop while enabling visualizations.

read the letter

The main takeaway is that forcing the emotion predictions to go through a mid-level feature layer costs only a small average drop in performance while making the model more explainable. What the paper does is build a VGG-style network that learns mid-level perceptual features jointly with the emotion task. It then compares this to a direct prediction network with the same base layers. The design lets them visualize how changes in the perceptual features affect the emotion outputs. This is a direct extension of prior work on mid-level features in MIR, but the explicit routing for explainability is the concrete addition. The setup holds up well. The multi-task training avoids obvious mismatches in capacity or training procedure, so the performance comparison is meaningful. The claim about the low loss is something you can check against the reported metrics. The soft spots are limited. One is that the mid-level features are treated as human-interpretable without much validation that users actually find the visualizations helpful. That assumption might need user studies to confirm. Another is that the abstract leaves out the specific numbers and statistical details, so the 'surprisingly low' part depends on seeing the full results. Nothing load-bearing seems broken, though. This kind of work is aimed at MIR people who care about making models for creative or subjective tasks more transparent. A reader interested in practical explainability techniques for audio will find a usable example here. It is worth sending to a serious referee because the experiment is straightforward and the results can be reproduced or challenged on their own terms. I would recommend putting it through peer review.

Referee Report

2 major / 3 minor

Summary. The manuscript proposes a VGG-style deep neural network for music emotion recognition that jointly learns mid-level perceptual features and emotional characteristics in a multi-task setup. It compares this to an otherwise identical network that predicts emotion directly and reports that the average performance drop incurred by routing through the mid-level features is surprisingly low; the architecture further enables visualizations that attribute emotion predictions to the perceptual features, which the authors argue justifies the modest accuracy cost via improved explainability.

Significance. If the reported metrics hold, the work is significant because it supplies a concrete, falsifiable empirical demonstration that explainability via human-interpretable mid-level features can be obtained in music emotion modeling at low predictive cost. The multi-task design and visualization approach directly address a recognized limitation of black-box deep models in MIR; the fact that the comparison is performed with matched base architectures makes the central trade-off claim testable from the numbers alone.

major comments (2)

[§4] §4 (results): the claim that the performance loss is 'surprisingly low on average' requires the actual per-metric deltas, standard deviations across folds or runs, and a statistical test (paired t-test or equivalent) to establish that the difference is distinguishable from training noise; without these the central empirical observation cannot be evaluated.
[§3.1] §3.1 (mid-level features): the assertion that the chosen perceptual features are 'human-interpretable' and therefore confer explainability is load-bearing for the justification of the approach, yet the manuscript supplies no external validation (listener study, correlation with established perceptual scales, or citation to perceptual literature) that these features are meaningful to users beyond the authors' selection.

minor comments (3)

[Abstract] Abstract: quantitative performance numbers (e.g., mean R² or accuracy drop and the direct vs. routed values) should be stated so readers can assess the 'surprisingly low' claim without reading the full results section.
[Figure 2] Figure 2 or 3 (visualizations): the attribution maps would be clearer if accompanied by a quantitative measure (e.g., feature-emotion correlation or ablation delta) rather than relying solely on qualitative inspection.
[§2] §2 (related work): a brief comparison table of prior emotion-prediction accuracies on the same datasets would help situate the absolute performance levels achieved here.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive review and positive assessment of the work's significance. We address each major comment below and will revise the manuscript to incorporate the requested details and clarifications.

read point-by-point responses

Referee: [§4] §4 (results): the claim that the performance loss is 'surprisingly low on average' requires the actual per-metric deltas, standard deviations across folds or runs, and a statistical test (paired t-test or equivalent) to establish that the difference is distinguishable from training noise; without these the central empirical observation cannot be evaluated.

Authors: We agree that the claim would be strengthened by explicit per-metric deltas, standard deviations, and a statistical test. In the revised version we will add a table showing the exact performance difference for each metric (arousal, valence, and the four quadrant categories), report standard deviations across the 10 cross-validation folds for both models, and include the results of a paired t-test on the per-fold differences to confirm that the observed gap is distinguishable from training variability. revision: yes
Referee: [§3.1] §3.1 (mid-level features): the assertion that the chosen perceptual features are 'human-interpretable' and therefore confer explainability is load-bearing for the justification of the approach, yet the manuscript supplies no external validation (listener study, correlation with established perceptual scales, or citation to perceptual literature) that these features are meaningful to users beyond the authors' selection.

Authors: The mid-level features are taken from prior MIR literature that has already performed listener validation and acoustic correlation studies. We will revise §3.1 to include explicit citations to those perceptual validation studies (e.g., the original papers introducing the feature set and subsequent listener experiments), thereby grounding the claim of human interpretability in existing evidence rather than author selection alone. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical comparison only

full rationale

The paper presents an empirical study comparing two VGG-style networks for music emotion recognition: one that routes predictions through jointly learned mid-level perceptual features and one that predicts emotion directly. The central claim (small average performance loss when using mid-level features) is an observation from trained model metrics on held-out data, not a mathematical derivation or prediction derived from fitted parameters. No equations, ansatzes, uniqueness theorems, or self-citation chains are invoked to force results by construction. The multi-task architecture is internally consistent for measuring the described trade-off, and the claim remains directly falsifiable from the reported performance numbers without reducing to its own inputs.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the empirical performance of two supervised DNNs trained on music audio; the mid-level features are treated as given human-interpretable quantities whose predictive power is measured rather than derived.

free parameters (1)

network weights and biases
All parameters of the VGG-style CNNs are fitted to the training data for both the direct and mid-level routes.

axioms (1)

domain assumption Mid-level perceptual features can be reliably annotated by humans and are predictive of emotional response
Invoked when the authors state that routing through these features yields interpretable explanations (abstract).

pith-pipeline@v0.9.0 · 5722 in / 1246 out tokens · 19569 ms · 2026-05-25T00:53:38.425510+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages · 1 internal anchor

[1]

Towards Explainable Music Emotion Recognition: The Route via Mid-level Features

INTRODUCTION Emotions – portrayed, perceived, or induced – are an im- portant aspect of music. MIR systems can beneﬁt from leveraging this aspect because of its direct impact on hu- man perception of music, but doing so has been challeng- ing due to the inherently abstract and subjective quality of this feature. Moreover, it is difﬁcult to interpret emo- ...

work page internal anchor Pith review Pith/arXiv arXiv 2019
[2]

RELA TED WORK In the MIR ﬁeld, audio-based music emotion recognition (MER) has traditionally been done by extracting selected features from the audio and predicting emotion based on subsequent processing of these features [7]. Methods such as linear regression, regression trees, support vector re- gression, and variants have been used for prediction as me...

work page
[3]

The idea is that they should represent musi- cal characteristics that are easily perceived and recognized by most listeners, without any music-theoretical training

MID-LEVEL PERCEPTUAL FEA TURES The notion of (‘mid-level’) perceptual features for charac- terizing music recordings has been put forward by several authors, as an alternative to purely sound-based or statis- tical low-level features (e.g., MFCCs, ZCR, spectral cen- troid) or more abstract music-theoretic concepts (e.g., me- ter, harmony). The idea is tha...

work page
[4]

Our starting point is Aljanaki & Soleymani’sMid-level Percep- tual Features dataset [1], which provides mid-level fea- ture annotations

DA TASETS For our experiments, we need music recordings annotated both with mid-level perceptual features, and with human ratings along some well-deﬁned emotion categories. Our starting point is Aljanaki & Soleymani’sMid-level Percep- tual Features dataset [1], which provides mid-level fea- ture annotations. For the actual emotion prediction ex- periments...

work page
[5]

The ratings range from 1 to 10 and were scaled by a factor of 0.1 before being used for our experiments

The annotators were required to have some musical ed- ucation and were selected based on passing a musical test. The ratings range from 1 to 10 and were scaled by a factor of 0.1 before being used for our experiments. 4.2 Emotion Ratings: The Soundtracks Dataset The Soundtracks (Stimulus Set 1) dataset, published by Eerola and Vuoskoski [4], consists of 3...

work page
[6]

A2Mid2E-Joint

AUDIO-TO-EMOTION MODELS In the following, we describe three different approaches to modeling emotion from audio, all based on VGG-style convolutional neural networks (CNNs). The architectures are summarized in Figure 2. For all models, we use an Adam optimizer with a learning rate of 0.0005 and a batch size of 8, and employ early stopping with a patience ...

work page 2048
[7]

cost of explainability

EXPERIMENTS The audio clips are preprocessed as described in Section 5 to obtain the input spectrograms. During training, one ran- dom 10-second snippet from each spectrogram is taken as input. We optimize the mean squared error, and use Pearson’s correlation coefﬁcient as the evaluation metric for emotion rating prediction. Each of the paths (A2E, Valenc...

work page
[8]

minorness

OBTAINING EXPLANA TIONS Since the mapping between mid-level features and emo- tions is linear in both proposed schemes (A2Mid2E, A2Mid2E-Joint), it is now straightforward to create human-understandable explanations. Linear models can be interpreted by analyzing their weights: increasing a nu- merical feature by one unit changes the prediction by its weigh...

work page
[9]

DISCUSSION AND CONCLUSION Model interpretability and the possibility to obtain expla- nations for a given prediction are not ends in themselves. There are many scenarios where one may need to under- stand why a piece of music was recommended or placed 1 https://www.jyu.ﬁ/hytk/ﬁ/laitokset/mutku/en/research/projects2/past- projects/coe/materials/emotion/sou...

work page
[10]

Con Espressione

ACKNOWLEDGMENTS This research has received funding from the European Re- search Council (ERC) under the European Union’s Hori- zon 2020 research and innovation programme under grant agreement No. 670035 (project “Con Espressione”)

work page 2020
[11]

A data- driven approach to mid-level perceptual musical fea- ture modeling

Anna Aljanaki and Mohammad Soleymani. A data- driven approach to mid-level perceptual musical fea- ture modeling. In Proceedings of the 19th International Society for Music Information Retrieval Conference, ISMIR 2018, Paris, France, September 23-27, 2018 , pages 615–621, 2018

work page 2018
[12]

Developing a benchmark for emotional anal- ysis of music

Anna Aljanaki, Yi-Hsuan Yang, and Mohammad So- leymani. Developing a benchmark for emotional anal- ysis of music. PLOS ONE, 12(3):1–22, 03 2017

work page 2017
[13]

Automatically esti- mating emotion in music with deep long-short term memory recurrent neural networks

Eduardo Coutinho, George Trigeorgis, Stefanos Zafeiriou, and Björn W Schuller. Automatically esti- mating emotion in music with deep long-short term memory recurrent neural networks. In MediaEval, 2015

work page 2015
[14]

Vuoskoski

Tuomas Eerola and Jonna K. Vuoskoski. A comparison of the discrete and dimensional models of emotion in music. Psychology of Music, 39(1):18–49, 2011

work page 2011
[15]

Using listener-based perceptual features as intermediate rep- resentations in music information retrieval

Anders Friberg, Erwin Schoonderwaldt, Anton Hed- blad, Marco Fabiani, and Anders Elowsson. Using listener-based perceptual features as intermediate rep- resentations in music information retrieval. The Jour- nal of the Acoustical Society of America , 136(4):1951– 1963, 2014

work page 1951
[16]

Auto- mated music emotion recognition: A systematic evalu- ation

Areﬁn Huq, Juan Pablo Bello, and Robert Rowe. Auto- mated music emotion recognition: A systematic evalu- ation. Journal of New Music Research, 39(3):227–244, 2010

work page 2010
[17]

Music emotion recognition: A state of the art review

Youngmoo E Kim, Erik M Schmidt, Raymond Migneco, Brandon G Morton, Patrick Richardson, Jef- frey Scott, Jacquelin A Speck, and Douglas Turnbull. Music emotion recognition: A state of the art review. In Proc. ISMIR, pages 255–266. Citeseer, 2010

work page 2010
[18]

Neuron activation proﬁles for interpreting convolu- tional speech recognition models

Andreas Krug, René Knaebel, and Sebastian Stober. Neuron activation proﬁles for interpreting convolu- tional speech recognition models. In NIPS 2018 Inter- pretability and Robustness for Audio, Speech and Lan- guage Workshop (IRASL’18), 2018. to appear

work page 2018
[19]

Music emotion recognition based on two- level support vector classiﬁcation

Chingshun Lin, Mingyu Liu, Weiwei Hsiung, and Jhih- siang Jhang. Music emotion recognition based on two- level support vector classiﬁcation. In 2016 Interna- tional Conference on Machine Learning and Cybernet- ics (ICMLC), volume 1, pages 375–389. IEEE, 2016

work page 2016
[20]

Interpretable Machine Learn- ing

Christoph Molnar. Interpretable Machine Learn- ing. 2019. https://christophm.github.io/ interpretable-ml-book/

work page 2019
[21]

Multi-modal mu- sic emotion recognition: A new dataset, methodology and comparative analysis

Renato Panda, Ricardo Malheiro, Bruno Rocha, An- tónio Oliveira, and Rui Pedro Paiva. Multi-modal mu- sic emotion recognition: A new dataset, methodology and comparative analysis

work page
[22]

Multi-modal mu- sic emotion recognition: A new dataset, methodology and comparative analysis

Renato Panda, Ricardo Malheiro, Bruno Rocha, An- tónio Oliveira, and Rui Pedro Paiva. Multi-modal mu- sic emotion recognition: A new dataset, methodology and comparative analysis. In International Symposium on Computer Music Multidisciplinary Research , 2013

work page 2013
[23]

Speaker recogni- tion from raw waveform with sincnet

Mirco Ravanelli and Yoshua Bengio. Speaker recogni- tion from raw waveform with sincnet. In 2018 IEEE Spoken Language Technology Workshop, SLT 2018, Athens, Greece, December 18-21, 2018 , pages 1021– 1028, 2018

work page 2018
[24]

Optimized scoring sys- tems: Toward trust in machine learning for healthcare and criminal justice

Cynthia Rudin and Ustun Berk. Optimized scoring sys- tems: Toward trust in machine learning for healthcare and criminal justice. Interfaces, 48(5):449–466, 2018

work page 2018
[25]

A circumplex model of affect

James A Russell. A circumplex model of affect. Jour- nal of personality and social psychology , 39(6):1161, 1980

work page 1980
[26]

Very deep convolutional networks for large-scale image recogni- tion

Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recogni- tion. In 3rd International Conference on Learning Rep- resentations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015

work page 2015
[27]

Getting closer to the essence of mu- sic: The Con Espressione Manifesto

Gerhard Widmer. Getting closer to the essence of mu- sic: The Con Espressione Manifesto. ACM Transac- tions on Intelligent Systems and Technology (TIST) , 8(2):19, 2017

work page 2017
[28]

emotion in music

Mingxing Xu, Xinxing Li, Haishu Xianyu, Jiashen Tian, Fanhang Meng, and Wenxiao Chen. Multi-scale approaches to the mediaeval 2015" emotion in music" task. In MediaEval, 2015

work page 2015
[29]

Bridge the semantic gap between pop mu- sic acoustic feature and emotion: Build an interpretable model

JiangLong Zhang, XiangLin Huang, Lifang Yang, and Liqiang Nie. Bridge the semantic gap between pop mu- sic acoustic feature and emotion: Build an interpretable model. Neurocomputing, 208:333 – 341, 2016. SI: BridgingSemantic

work page 2016

[1] [1]

Towards Explainable Music Emotion Recognition: The Route via Mid-level Features

INTRODUCTION Emotions – portrayed, perceived, or induced – are an im- portant aspect of music. MIR systems can beneﬁt from leveraging this aspect because of its direct impact on hu- man perception of music, but doing so has been challeng- ing due to the inherently abstract and subjective quality of this feature. Moreover, it is difﬁcult to interpret emo- ...

work page internal anchor Pith review Pith/arXiv arXiv 2019

[2] [2]

RELA TED WORK In the MIR ﬁeld, audio-based music emotion recognition (MER) has traditionally been done by extracting selected features from the audio and predicting emotion based on subsequent processing of these features [7]. Methods such as linear regression, regression trees, support vector re- gression, and variants have been used for prediction as me...

work page

[3] [3]

The idea is that they should represent musi- cal characteristics that are easily perceived and recognized by most listeners, without any music-theoretical training

MID-LEVEL PERCEPTUAL FEA TURES The notion of (‘mid-level’) perceptual features for charac- terizing music recordings has been put forward by several authors, as an alternative to purely sound-based or statis- tical low-level features (e.g., MFCCs, ZCR, spectral cen- troid) or more abstract music-theoretic concepts (e.g., me- ter, harmony). The idea is tha...

work page

[4] [4]

Our starting point is Aljanaki & Soleymani’sMid-level Percep- tual Features dataset [1], which provides mid-level fea- ture annotations

DA TASETS For our experiments, we need music recordings annotated both with mid-level perceptual features, and with human ratings along some well-deﬁned emotion categories. Our starting point is Aljanaki & Soleymani’sMid-level Percep- tual Features dataset [1], which provides mid-level fea- ture annotations. For the actual emotion prediction ex- periments...

work page

[5] [5]

The ratings range from 1 to 10 and were scaled by a factor of 0.1 before being used for our experiments

The annotators were required to have some musical ed- ucation and were selected based on passing a musical test. The ratings range from 1 to 10 and were scaled by a factor of 0.1 before being used for our experiments. 4.2 Emotion Ratings: The Soundtracks Dataset The Soundtracks (Stimulus Set 1) dataset, published by Eerola and Vuoskoski [4], consists of 3...

work page

[6] [6]

A2Mid2E-Joint

AUDIO-TO-EMOTION MODELS In the following, we describe three different approaches to modeling emotion from audio, all based on VGG-style convolutional neural networks (CNNs). The architectures are summarized in Figure 2. For all models, we use an Adam optimizer with a learning rate of 0.0005 and a batch size of 8, and employ early stopping with a patience ...

work page 2048

[7] [7]

cost of explainability

EXPERIMENTS The audio clips are preprocessed as described in Section 5 to obtain the input spectrograms. During training, one ran- dom 10-second snippet from each spectrogram is taken as input. We optimize the mean squared error, and use Pearson’s correlation coefﬁcient as the evaluation metric for emotion rating prediction. Each of the paths (A2E, Valenc...

work page

[8] [8]

minorness

OBTAINING EXPLANA TIONS Since the mapping between mid-level features and emo- tions is linear in both proposed schemes (A2Mid2E, A2Mid2E-Joint), it is now straightforward to create human-understandable explanations. Linear models can be interpreted by analyzing their weights: increasing a nu- merical feature by one unit changes the prediction by its weigh...

work page

[9] [9]

DISCUSSION AND CONCLUSION Model interpretability and the possibility to obtain expla- nations for a given prediction are not ends in themselves. There are many scenarios where one may need to under- stand why a piece of music was recommended or placed 1 https://www.jyu.ﬁ/hytk/ﬁ/laitokset/mutku/en/research/projects2/past- projects/coe/materials/emotion/sou...

work page

[10] [10]

Con Espressione

ACKNOWLEDGMENTS This research has received funding from the European Re- search Council (ERC) under the European Union’s Hori- zon 2020 research and innovation programme under grant agreement No. 670035 (project “Con Espressione”)

work page 2020

[11] [11]

A data- driven approach to mid-level perceptual musical fea- ture modeling

Anna Aljanaki and Mohammad Soleymani. A data- driven approach to mid-level perceptual musical fea- ture modeling. In Proceedings of the 19th International Society for Music Information Retrieval Conference, ISMIR 2018, Paris, France, September 23-27, 2018 , pages 615–621, 2018

work page 2018

[12] [12]

Developing a benchmark for emotional anal- ysis of music

Anna Aljanaki, Yi-Hsuan Yang, and Mohammad So- leymani. Developing a benchmark for emotional anal- ysis of music. PLOS ONE, 12(3):1–22, 03 2017

work page 2017

[13] [13]

Automatically esti- mating emotion in music with deep long-short term memory recurrent neural networks

Eduardo Coutinho, George Trigeorgis, Stefanos Zafeiriou, and Björn W Schuller. Automatically esti- mating emotion in music with deep long-short term memory recurrent neural networks. In MediaEval, 2015

work page 2015

[14] [14]

Vuoskoski

Tuomas Eerola and Jonna K. Vuoskoski. A comparison of the discrete and dimensional models of emotion in music. Psychology of Music, 39(1):18–49, 2011

work page 2011

[15] [15]

Using listener-based perceptual features as intermediate rep- resentations in music information retrieval

Anders Friberg, Erwin Schoonderwaldt, Anton Hed- blad, Marco Fabiani, and Anders Elowsson. Using listener-based perceptual features as intermediate rep- resentations in music information retrieval. The Jour- nal of the Acoustical Society of America , 136(4):1951– 1963, 2014

work page 1951

[16] [16]

Auto- mated music emotion recognition: A systematic evalu- ation

Areﬁn Huq, Juan Pablo Bello, and Robert Rowe. Auto- mated music emotion recognition: A systematic evalu- ation. Journal of New Music Research, 39(3):227–244, 2010

work page 2010

[17] [17]

Music emotion recognition: A state of the art review

Youngmoo E Kim, Erik M Schmidt, Raymond Migneco, Brandon G Morton, Patrick Richardson, Jef- frey Scott, Jacquelin A Speck, and Douglas Turnbull. Music emotion recognition: A state of the art review. In Proc. ISMIR, pages 255–266. Citeseer, 2010

work page 2010

[18] [18]

Neuron activation proﬁles for interpreting convolu- tional speech recognition models

Andreas Krug, René Knaebel, and Sebastian Stober. Neuron activation proﬁles for interpreting convolu- tional speech recognition models. In NIPS 2018 Inter- pretability and Robustness for Audio, Speech and Lan- guage Workshop (IRASL’18), 2018. to appear

work page 2018

[19] [19]

Music emotion recognition based on two- level support vector classiﬁcation

Chingshun Lin, Mingyu Liu, Weiwei Hsiung, and Jhih- siang Jhang. Music emotion recognition based on two- level support vector classiﬁcation. In 2016 Interna- tional Conference on Machine Learning and Cybernet- ics (ICMLC), volume 1, pages 375–389. IEEE, 2016

work page 2016

[20] [20]

Interpretable Machine Learn- ing

Christoph Molnar. Interpretable Machine Learn- ing. 2019. https://christophm.github.io/ interpretable-ml-book/

work page 2019

[21] [21]

Multi-modal mu- sic emotion recognition: A new dataset, methodology and comparative analysis

Renato Panda, Ricardo Malheiro, Bruno Rocha, An- tónio Oliveira, and Rui Pedro Paiva. Multi-modal mu- sic emotion recognition: A new dataset, methodology and comparative analysis

work page

[22] [22]

Multi-modal mu- sic emotion recognition: A new dataset, methodology and comparative analysis

Renato Panda, Ricardo Malheiro, Bruno Rocha, An- tónio Oliveira, and Rui Pedro Paiva. Multi-modal mu- sic emotion recognition: A new dataset, methodology and comparative analysis. In International Symposium on Computer Music Multidisciplinary Research , 2013

work page 2013

[23] [23]

Speaker recogni- tion from raw waveform with sincnet

Mirco Ravanelli and Yoshua Bengio. Speaker recogni- tion from raw waveform with sincnet. In 2018 IEEE Spoken Language Technology Workshop, SLT 2018, Athens, Greece, December 18-21, 2018 , pages 1021– 1028, 2018

work page 2018

[24] [24]

Optimized scoring sys- tems: Toward trust in machine learning for healthcare and criminal justice

Cynthia Rudin and Ustun Berk. Optimized scoring sys- tems: Toward trust in machine learning for healthcare and criminal justice. Interfaces, 48(5):449–466, 2018

work page 2018

[25] [25]

A circumplex model of affect

James A Russell. A circumplex model of affect. Jour- nal of personality and social psychology , 39(6):1161, 1980

work page 1980

[26] [26]

Very deep convolutional networks for large-scale image recogni- tion

Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recogni- tion. In 3rd International Conference on Learning Rep- resentations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015

work page 2015

[27] [27]

Getting closer to the essence of mu- sic: The Con Espressione Manifesto

Gerhard Widmer. Getting closer to the essence of mu- sic: The Con Espressione Manifesto. ACM Transac- tions on Intelligent Systems and Technology (TIST) , 8(2):19, 2017

work page 2017

[28] [28]

emotion in music

Mingxing Xu, Xinxing Li, Haishu Xianyu, Jiashen Tian, Fanhang Meng, and Wenxiao Chen. Multi-scale approaches to the mediaeval 2015" emotion in music" task. In MediaEval, 2015

work page 2015

[29] [29]

Bridge the semantic gap between pop mu- sic acoustic feature and emotion: Build an interpretable model

JiangLong Zhang, XiangLin Huang, Lifang Yang, and Liqiang Nie. Bridge the semantic gap between pop mu- sic acoustic feature and emotion: Build an interpretable model. Neurocomputing, 208:333 – 341, 2016. SI: BridgingSemantic

work page 2016