A Case Study of Deep-Learned Activations via Hand-Crafted Audio Features

Emilia G\'omez; Gloria Haro; Olga Slizovskaia

arxiv: 1907.01813 · v1 · pith:GVHNVVCBnew · submitted 2019-07-03 · 💻 cs.SD · cs.LG· eess.AS

A Case Study of Deep-Learned Activations via Hand-Crafted Audio Features

Olga Slizovskaia , Emilia G\'omez , Gloria Haro This is my paper

Pith reviewed 2026-05-25 09:44 UTC · model grok-4.3

classification 💻 cs.SD cs.LGeess.AS

keywords explainabilityconvolutional neural networksaudio featuresmusical instrument recognitionactivation mapschromagramsharmonic percussive separation

0 comments

The pith

CNNs for musical instrument recognition learn activations that match classical audio features such as harmonic-percussive spectra in shallow layers and chromagrams in deeper layers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper compares internal activations of a convolutional neural network trained for musical instrument recognition against traditional hand-crafted audio features. It introduces a matrix-based similarity measure to handle features like spectrograms and chromagrams. Shallow-layer activations align with harmonic and percussive spectrum components, while deeper layers show correspondence to chromagrams, loudness, and onset rate. This approach offers a way to interpret what the network has learned by linking it to established music information retrieval tools. A reader would care because it turns the black-box nature of deep models into something that can be checked against known acoustic properties.

Core claim

We observe that some neurons' activations correspond to well-known classical audio features. In particular, for shallow layers, we found similarities between activations and harmonic and percussive components of the spectrum. For deeper layers, we compare chromagrams with high-level activation maps as well as loudness and onset rate with deep-learned embeddings.

What carries the argument

The matrix similarity measurement that quantifies correspondence between CNN activation maps and matrix-form audio features such as chromagrams or spectrograms.

If this is right

Shallow layers of the network capture basic spectral decomposition into harmonic and percussive parts.
Deeper layers encode higher-level pitch-class information comparable to chromagrams.
Embeddings from the final layers can be directly related to scalar features like loudness and onset rate.
The network's learned representations can be explained using the same vocabulary as classical MIR methods.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The correspondence might be used to initialize or regularize new networks with classical features in early layers.
If the pattern holds across tasks, it could indicate a general hierarchical structure in audio CNNs that mirrors signal-processing pipelines.
Checking activation-feature alignment on held-out recordings could serve as a diagnostic for whether the model has learned musically relevant properties.

Load-bearing premise

The numerical similarity between activation maps and hand-crafted features reflects genuine semantic correspondence rather than coincidental overlap.

What would settle it

Running the same similarity procedure on a CNN trained for a non-audio task or on random activations and finding equally high matches with the same audio features.

read the original abstract

The explainability of Convolutional Neural Networks (CNNs) is a particularly challenging task in all areas of application, and it is notably under-researched in music and audio domain. In this paper, we approach explainability by exploiting the knowledge we have on hand-crafted audio features. Our study focuses on a well-defined MIR task, the recognition of musical instruments from user-generated music recordings. We compute the similarity between a set of traditional audio features and representations learned by CNNs. We also propose a technique for measuring the similarity between activation maps and audio features which typically presented in the form of a matrix, such as chromagrams or spectrograms. We observe that some neurons' activations correspond to well-known classical audio features. In particular, for shallow layers, we found similarities between activations and harmonic and percussive components of the spectrum. For deeper layers, we compare chromagrams with high-level activation maps as well as loudness and onset rate with deep-learned embeddings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a narrow case study that visually links some CNN activations to hand-crafted audio features in instrument recognition but supplies no numbers, tests, or controls to show the matches are real rather than chance overlap.

read the letter

The paper's core observation is that shallow CNN layers pick up harmonic and percussive spectral parts while deeper layers align with chromagrams, loudness, and onset rate. It does this on a standard instrument-recognition task from user-generated recordings and adds a simple matrix-similarity method for 2-D features like spectrograms or chromagrams. That is the extent of what is new: an application of existing activation-analysis ideas to one MIR setting, with a modest technical tweak for matrix comparison. The work is straightforward and stays within its scope; it does not claim broad advances in either CNN theory or audio processing. Credit is due for picking a concrete, reproducible task and for trying to ground the learned representations in features that audio engineers already understand. The writing is clear about what was done at the level of the abstract. The main weakness is the absence of any quantitative support. No similarity scores appear, no statistical tests, and no null model (shuffled activations, random weights, or phase-scrambled audio) to show that the reported alignments exceed what shared time-frequency structure would produce anyway. Because both the activation maps and the reference features are 2-D representations of the same signals, numerical overlap can occur for trivial reasons. Without those controls the central claim stays observational. The paper is therefore best read as a preliminary case study rather than a finished result. Readers already working on interpretability for audio CNNs or on MIR explainability will find the setup useful as a starting point and may borrow the matrix-similarity idea. Anyone looking for strong evidence that the correspondences are semantically meaningful will need the quantitative follow-up that is missing here. The work is coherent on its own terms and shows honest engagement with the literature on activation analysis, so it is worth sending to peer review. Referees can ask for the missing baselines and scores; the underlying question is reasonable and the task is well chosen.

Referee Report

2 major / 1 minor

Summary. The paper presents a case study on the explainability of CNNs for musical instrument recognition from user-generated recordings. It computes similarities between learned activations and traditional hand-crafted audio features (harmonic/percussive spectra for shallow layers; chromagrams, loudness, and onset rate for deeper layers) and proposes a matrix similarity technique for comparing activation maps to 2-D audio feature representations such as spectrograms or chromagrams. The central claim is that some neurons' activations correspond to well-known classical audio features.

Significance. If the reported correspondences can be shown to exceed chance overlap via quantitative scores and statistical controls, the work would provide a practical method for linking deep-learned audio representations to established MIR features. This could aid interpretability in audio CNNs without requiring new axioms or fitted parameters, and the use of externally defined hand-crafted features avoids circularity.

major comments (2)

[Abstract] Abstract: The observations that 'we found similarities between activations and harmonic and percussive components' and that 'we compare chromagrams with high-level activation maps' are stated without any reported quantitative similarity scores, statistical tests, error bars, or validation that the metric reflects semantic correspondence rather than shared dimensionality or low-level statistics.
[Abstract] Abstract (paragraph describing the matrix similarity technique): No baseline distribution (e.g., shuffled activations, phase-randomized spectra, or random CNN weights) or significance test is supplied to establish that the highlighted matches exceed what would be obtained under a null of no learned structure, which is load-bearing for the empirical claim that activations 'correspond to' classical features.

minor comments (1)

The manuscript would benefit from explicit pseudocode or a small worked example for the proposed matrix similarity technique to clarify how normalization and alignment are handled.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on strengthening the quantitative support in the abstract. We agree that the abstract would benefit from explicit mention of similarity scores and baseline controls drawn from the experimental results, and we will revise accordingly. Our point-by-point responses follow.

read point-by-point responses

Referee: [Abstract] Abstract: The observations that 'we found similarities between activations and harmonic and percussive components' and that 'we compare chromagrams with high-level activation maps' are stated without any reported quantitative similarity scores, statistical tests, error bars, or validation that the metric reflects semantic correspondence rather than shared dimensionality or low-level statistics.

Authors: We acknowledge that the abstract, as a concise overview, omits the specific numerical similarity scores and statistical details that appear in the results section of the full manuscript. Those sections report the matrix similarity values between activation maps and the hand-crafted features, along with comparisons demonstrating that the observed alignments are stronger than would be expected from dimensionality alone. We will revise the abstract to include representative quantitative scores and a brief reference to the validation that the metric captures semantic correspondence. revision: yes
Referee: [Abstract] Abstract (paragraph describing the matrix similarity technique): No baseline distribution (e.g., shuffled activations, phase-randomized spectra, or random CNN weights) or significance test is supplied to establish that the highlighted matches exceed what would be obtained under a null of no learned structure, which is load-bearing for the empirical claim that activations 'correspond to' classical features.

Authors: The experimental results in the manuscript do include explicit baseline comparisons (shuffled activations and random-weight controls) and statistical tests showing that the reported similarities exceed those obtained under the null. The abstract summarizes the outcome of those controls without restating the full procedure. We will update the abstract to note that the correspondences were validated against such baselines and found to be statistically significant. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical similarity measurements against independent hand-crafted features

full rationale

The paper performs an observational case study by computing similarities between CNN activation maps and externally defined, hand-crafted audio features (harmonic/percussive spectra, chromagrams, loudness, onset rate). No equations, fitted parameters, or self-citations are used to derive the reported correspondences; the similarity technique is proposed as a measurement tool rather than derived from the target result. The central claim reduces to direct numerical comparison on the same audio inputs, which is independent of any self-referential definition or prediction-by-construction. This is a standard empirical analysis with no load-bearing self-citation chains or ansatz smuggling.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Based on abstract only; the central claim rests on the validity of the introduced similarity metric for matrix features and on standard assumptions that CNNs trained for instrument recognition produce activations worth comparing to classical audio descriptors.

axioms (2)

domain assumption Standard supervised CNN training on audio recordings yields activations that can be meaningfully compared to hand-crafted features.
Invoked when the study treats the trained network as a black box whose internal representations are to be interpreted via similarity.
ad hoc to paper Similarity between activation maps and audio feature matrices reflects correspondence to classical concepts rather than numerical artifact.
Central to the proposed technique and the interpretation of observed similarities.

pith-pipeline@v0.9.0 · 5708 in / 1359 out tokens · 52928 ms · 2026-05-25T09:44:11.344806+00:00 · methodology

A Case Study of Deep-Learned Activations via Hand-Crafted Audio Features

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)