Evaluating the Temporal Detection Capability of Integrated Gradients Applied on Sound Classifier

Martynas Dumpis; Tuomas Virtanen

arxiv: 2605.23293 · v1 · pith:GD34IMTVnew · submitted 2026-05-22 · 📡 eess.AS · cs.SD· eess.SP

Evaluating the Temporal Detection Capability of Integrated Gradients Applied on Sound Classifier

Martynas Dumpis , Tuomas Virtanen This is my paper

Pith reviewed 2026-05-25 03:01 UTC · model grok-4.3

classification 📡 eess.AS cs.SDeess.SP

keywords integrated gradientssound event detectiontemporal localizationaudio classificationattribution methodspolyphonic audioweak supervisiondomestic sounds

0 comments

The pith

Integrated gradients localize sound events temporally at 0.39 mean IoU without frame labels

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper checks whether integrated gradients can find the start and end times of sound events inside audio clips. The classifier itself only saw whole-clip labels during training, not any timing information. The authors build synthetic mixtures of ten domestic sounds with exact known timestamps. They then measure how well the IG importance scores match those timestamps. The resulting scores achieve an IoU of 0.39 and F1 of 0.52, nearly matching a model trained with weak frame labels.

Core claim

Integrated gradients can be used to detect the temporal activity of sound events when applied to a classifier that has no access to frame-level labels during training. On a dataset of synthetic polyphonic domestic sound mixtures, IG attributions achieve a mean Intersection over Union of 0.39 with ground-truth event boundaries, a frame-level F1 score of 0.52, and Pointing Game accuracy of 82.6%. These figures come close to those obtained by a framewise CNN trained with weak supervision and exceed random and energy-based baselines, though they remain below a strongly supervised framewise model.

What carries the argument

Integrated gradients attributions computed on the output of a CNN classifier trained only on clip-level labels, used to produce time-resolved importance scores for sound events.

If this is right

IG can provide temporal localization as a side effect of standard clip-level classification training.
Post-hoc attribution reaches localization performance near that of weakly supervised framewise models.
Attribution scores capture event activity patterns beyond random guessing or simple energy thresholds.
A remaining gap exists between post-hoc IG and models trained with explicit frame-level labels.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Evaluating IG on real recorded audio with human-annotated timestamps would test whether synthetic results generalize.
The approach could lower the need for expensive frame-level annotations when building sound event detectors.
The same evaluation protocol could compare other attribution methods for their temporal capabilities in audio.

Load-bearing premise

The synthetic polyphonic mixtures with perfect ground-truth timestamps are representative enough of real acoustic conditions that alignment between IG attributions and event boundaries measures true temporal detection capability.

What would settle it

Applying the same IG analysis to real domestic audio recordings that have independently verified event timestamps and finding substantially lower IoU than 0.39 would falsify the temporal detection claim.

Figures

Figures reproduced from arXiv: 2605.23293 by Martynas Dumpis, Tuomas Virtanen.

**Figure 2.** Figure 2: IG attribution magnitudes for a polyphonic test sample. Top: waveform [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: Threshold sensitivity of temporal detection. IoU (solid) and frame [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

read the original abstract

Gradient-based attribution methods can highlight input regions important for neural network predictions, but their effectiveness for temporal sound event detection in audio classification has not been systematically evaluated. This paper assesses whether integrated gradients (IG) can temporally detect sound events when applied to a classifier trained without temporal supervision. We use synthetic polyphonic audio with ground truth timestamps to measure alignment between IG attributions and event boundaries. On a 10-class domestic sound dataset, IG achieves mean Intersection over Union (IoU) of 0.39, frame-level F1 of 0.52, and Pointing Game accuracy of 82.6\%. For comparison, a framewise CNN trained with weak supervision (FW-WS, clip-level training labels) achieves 0.42 IoU, 0.55 F1, and 97.3\% PG, while a strongly supervised variant (FW-SS, frame-level training labels) reaches 0.45 IoU, 0.58 F1, and 97.9\% PG. Overall, these results suggest that post-hoc IG captures meaningful temporal activity patterns of sound events, with localization performance approaching models that explicitly produce frame-level predictions. All methods substantially outperform random and energy-based baselines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

IG recovers moderate temporal alignment on synthetic polyphonic mixtures but the perfect-label setup leaves real-world transfer untested.

read the letter

The paper checks whether integrated gradients can extract usable timing from a clip-level sound classifier. On 10-class synthetic domestic mixtures they report IG at 0.39 mean IoU, 0.52 frame F1 and 82.6 % pointing-game accuracy. A weakly-supervised framewise CNN reaches 0.42/0.55/97.3 and the strongly-supervised version 0.45/0.58/97.9; both beat random and energy baselines comfortably. That is the concrete result they deliver. The comparison is direct and the metrics are standard, so the numbers themselves are easy to interpret. The work is honest about using post-hoc attribution rather than claiming a new method. The main limitation is the data. Every mixture is constructed with exact, noise-free timestamps and no reverberation or natural co-occurrence statistics. Alignment on that set does not automatically show that the attributions track events under real recording conditions. The paper does not add any acoustic distortion tests or real recordings, so the transfer claim stays open. The relative ordering versus the framewise models is also measured on the same synthetic set, which does not close the gap. This is useful reading for people already working on attribution in audio event detection. It is a clean incremental measurement rather than a new framework. I would send it to review because the question is well-posed and the experiments are reproducible on the reported data, even though the conclusions would need stronger caveats about the synthetic regime.

Referee Report

2 major / 2 minor

Summary. The manuscript evaluates whether Integrated Gradients (IG) applied post-hoc to a sound event classifier trained only on clip-level labels can recover temporal event boundaries. Experiments use synthetic polyphonic mixtures from a 10-class domestic sound dataset with perfect ground-truth timestamps; IG is reported to achieve mean IoU 0.39, frame-level F1 0.52, and Pointing Game accuracy 82.6 %, outperforming random and energy baselines while approaching weakly-supervised (FW-WS) and strongly-supervised (FW-SS) framewise CNNs (0.42/0.55/97.3 % and 0.45/0.58/97.9 % respectively). The authors conclude that IG captures meaningful temporal activity patterns.

Significance. If the evaluation is accepted as representative, the work supplies a concrete empirical benchmark showing that a standard post-hoc attribution method can extract usable temporal localization from a classifier that never saw frame labels. The direct comparison against random, energy, and both weak- and strong-supervision baselines on identical data is a clear strength and allows readers to gauge the practical gap. The result is of moderate significance for weakly-supervised sound event detection research.

major comments (2)

[§4] §4 (Evaluation on synthetic mixtures): The reported IoU, F1, and Pointing Game numbers are obtained exclusively on synthetic polyphonic mixtures that supply noise-free, perfectly aligned timestamps by construction. No experiment evaluates the same metrics on real recordings that contain reverberation, variable SNR, or natural co-occurrence statistics. Because the central claim is that IG “captures meaningful temporal activity patterns of sound events” and “approaches” framewise models, the absence of any transfer test to genuine acoustic conditions is load-bearing for the abstract conclusion.
[§4.3] §4.3 (Baseline comparisons): FW-WS and FW-SS are also evaluated on the identical synthetic set, so the relative ordering does not mitigate the representativeness gap. A minimal additional experiment on a real, manually annotated test set would be required to support the claim that IG localization performance is practically useful.

minor comments (2)

[Abstract / §3] The abstract states concrete numbers but the methods section should explicitly list the exact IG hyperparameters (number of steps, baseline choice) and the precise definition of the Pointing Game used.
[Figures] Figure captions should state the number of mixtures and the exact train/validation/test split sizes so that the reported means can be reproduced.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive comments regarding the evaluation setup. We address each major comment below.

read point-by-point responses

Referee: [§4] §4 (Evaluation on synthetic mixtures): The reported IoU, F1, and Pointing Game numbers are obtained exclusively on synthetic polyphonic mixtures that supply noise-free, perfectly aligned timestamps by construction. No experiment evaluates the same metrics on real recordings that contain reverberation, variable SNR, or natural co-occurrence statistics. Because the central claim is that IG “captures meaningful temporal activity patterns of sound events” and “approaches” framewise models, the absence of any transfer test to genuine acoustic conditions is load-bearing for the abstract conclusion.

Authors: The synthetic polyphonic mixtures were selected specifically to furnish noise-free, perfectly aligned ground-truth timestamps. This controlled setting enables an unambiguous measurement of how closely IG attributions recover event boundaries, free from annotation noise or acoustic distortions that would complicate interpretation on real data. We agree that the current results do not demonstrate transfer to reverberant or variable-SNR conditions and that this constrains the strength of claims about practical utility. In the revised manuscript we will add an explicit limitations paragraph, moderate the abstract wording, and outline the need for future real-data validation. revision: partial
Referee: [§4.3] §4.3 (Baseline comparisons): FW-WS and FW-SS are also evaluated on the identical synthetic set, so the relative ordering does not mitigate the representativeness gap. A minimal additional experiment on a real, manually annotated test set would be required to support the claim that IG localization performance is practically useful.

Authors: Performing all methods on the same synthetic data permits a direct, apples-to-apples comparison of post-hoc attribution against weakly and strongly supervised framewise models. The ordering therefore quantifies the gap that remains when temporal supervision is removed. We nevertheless accept that this comparison does not address generalization to real recordings. As stated in the response to the preceding comment, the revised manuscript will clarify the synthetic scope of the results and temper the associated claims; we are not in a position to add new experiments on manually annotated real data. revision: partial

standing simulated objections not resolved

Conducting additional experiments on a real, manually annotated test set

Circularity Check

0 steps flagged

No circularity: purely empirical evaluation on synthetic data

full rationale

The paper reports experimental results: a classifier is trained on synthetic polyphonic mixtures, IG attributions are computed, and alignment with provided ground-truth timestamps is measured via IoU, frame-level F1, and Pointing Game accuracy. Direct comparisons are made to framewise CNN baselines (FW-WS, FW-SS) and random/energy baselines on the identical dataset. No equations, derivations, or first-principles claims appear; no parameters are fitted and then relabeled as predictions; no self-citations are invoked as load-bearing uniqueness theorems. The evaluation chain is self-contained and externally falsifiable against the stated synthetic ground truth.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities. The evaluation implicitly assumes that synthetic mixtures and the chosen metrics (IoU, F1, Pointing Game) are appropriate proxies for temporal detection capability.

pith-pipeline@v0.9.0 · 5753 in / 1114 out tokens · 17746 ms · 2026-05-25T03:01:26.043351+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

23 extracted references · 23 canonical work pages

[1]

Sound event detection in domestic environments with weakly labeled data and soundscape synthesis,

N. Turpault, R. Serizel, A. P. Shah, and J. Salamon, “Sound event detection in domestic environments with weakly labeled data and soundscape synthesis,” inWorkshop on Detection and Classification of Acoustic Scenes and Events, 2019

work page 2019
[2]

A comparison of five multiple instance learning pooling functions for sound event detection with weak labeling,

Y . Wang, J. Li, and F. Metze, “A comparison of five multiple instance learning pooling functions for sound event detection with weak labeling,” inICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019, pp. 31–35

work page 2019
[3]

Improving weakly supervised sound event detection with self-supervised auxiliary tasks,

S. Deshmukh, B. Raj, and R. Singh, “Improving weakly supervised sound event detection with self-supervised auxiliary tasks,” inProc. Interspeech 2021, 2021, pp. 596–600

work page 2021
[4]

Audio explainable artificial intelligence: A review,

A. Akman and B. W. Schuller, “Audio explainable artificial intelligence: A review,”Intelligent Computing, vol. 2, 2024

work page 2024
[5]

Gradient based feature attribution in explainable ai: A technical review,

Y . Wang, T. Zhang, X. Guo, and Z. Shen, “Gradient based feature attribution in explainable ai: A technical review,” arXiv preprint, 2024. [Online]. Available: https://arxiv.org/abs/2403.10415

work page arXiv 2024
[6]

Audiomnist: Exploring explainable artificial intelligence for audio analysis on a simple benchmark,

S. Becker, J. Vielhaben, M. Ackermann, K.-R. Mueller, S. Lapuschkin, and W. Samek, “Audiomnist: Exploring explainable artificial intelligence for audio analysis on a simple benchmark,”Journal of the Franklin Institute, vol. 361, no. 1, pp. 418–428, 2024

work page 2024
[7]

Explainable audio classi- fication of playing techniques with layer-wise relevance propagation,

C. Wang, V . Lostanlen, and M. Lagrange, “Explainable audio classi- fication of playing techniques with layer-wise relevance propagation,” inICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023, pp. 1–5

work page 2023
[8]

Xai-based comparison of audio event classifiers with different input representations,

A. Frommholz, F. Seipel, S. Lapuschkin, W. Samek, and J. Vielhaben, “Xai-based comparison of audio event classifiers with different input representations,” in20th International Conference on Content-Based Multimedia Indexing (CBMI), 2023, pp. 126–132

work page 2023
[9]

Focal modulation net- works for interpretable sound classification,

L. Della Libera, C. Subakan, and M. Ravanelli, “Focal modulation net- works for interpretable sound classification,” in2024 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops (ICASSPW), 2024, pp. 853–857

work page 2024
[10]

Benchmarking time- localized explanations for audio classification models,

C. Bola ˜nos, L. Pepino, M. Meza, and L. Ferrer, “Benchmarking time- localized explanations for audio classification models,” inInterspeech 2025, 2025, pp. 211–215

work page 2025
[11]

Scaper: A library for soundscape synthesis and augmentation,

J. Salamon, D. MacConnell, M. Cartwright, P. Li, and J. P. Bello, “Scaper: A library for soundscape synthesis and augmentation,” in2017 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), 2017, pp. 344–348

work page 2017
[12]

Panns: Large-scale pretrained audio neural networks for audio pattern recognition,

Q. Kong, Y . Cao, T. Iqbal, Y . Wang, W. Wang, and M. D. Plumbley, “Panns: Large-scale pretrained audio neural networks for audio pattern recognition,”IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 28, pp. 2880–2894, 2020

work page 2020
[13]

Audio set: An ontology and human-labeled dataset for audio events,

J. F. Gemmeke, D. P. W. Ellis, D. Freedman, A. Jansen, W. Lawrence, R. C. Moore, M. Plakal, and M. Ritter, “Audio set: An ontology and human-labeled dataset for audio events,” in2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2017, pp. 776–780

work page 2017
[14]

Axiomatic attribution for deep networks,

M. Sundararajan, A. Taly, and Q. Yan, “Axiomatic attribution for deep networks,” inInternational Conference on Machine Learning (ICML), vol. 70, 2017

work page 2017
[15]

Captum: A unified and generic model inter- pretability library for pytorch,

N. Kokhlikyan, V . Miglani, M. Martin, E. Wang, B. Alsallakh, J. Reynolds, A. Melnikov, N. Kliushkina, C. Araya, S. Yan, and O. Reblitz-Richardson, “Captum: A unified and generic model inter- pretability library for pytorch,” 2020

work page 2020
[16]

Metrics for polyphonic sound event detection,

A. Mesaros, T. Heittola, and T. Virtanen, “Metrics for polyphonic sound event detection,”Applied Sciences, vol. 6, no. 6, 2016

work page 2016
[17]

Sound event detection in synthetic domestic environments,

R. Serizel, N. Turpault, A. Shah, and J. Salamon, “Sound event detection in synthetic domestic environments,” in2020 IEEE International Con- ference on Acoustics, Speech, and Signal Processing (ICASSP), 2020, pp. 86–90

work page 2020
[18]

audiolime: Listenable explanations using source separation,

V . Haunschmid, E. Manilow, and G. Widmer, “audiolime: Listenable explanations using source separation,” inProceedings of the 13th In- ternational Workshop on Machine Learning and Music (MML), ECML- PKDD, 2020, arXiv:2008.00582

work page arXiv 2020
[19]

Listenable maps for audio classifiers,

F. Paissan, M. Ravanelli, and C. Subakan, “Listenable maps for audio classifiers,” inProceedings of the 41st International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, vol

work page
[20]

39 009–39 021

PMLR, 2024, pp. 39 009–39 021

work page 2024
[21]

Learning deep features for discriminative localization,

B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba, “Learning deep features for discriminative localization,” in2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 2921– 2929

work page 2016
[22]

Grad-cam: Visual explanations from deep networks via gradient-based localization,

R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra, “Grad-cam: Visual explanations from deep networks via gradient-based localization,”International Journal of Computer Vision, vol. 128, no. 2, pp. 336–359, 2020

work page 2020
[23]

Weakly-supervised sound event detection with self- attention,

K. Miyazaki, T. Komatsu, T. Hayashi, S. Watanabe, T. Toda, and K. Takeda, “Weakly-supervised sound event detection with self- attention,” inICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, pp. 66–70

work page 2020

[1] [1]

Sound event detection in domestic environments with weakly labeled data and soundscape synthesis,

N. Turpault, R. Serizel, A. P. Shah, and J. Salamon, “Sound event detection in domestic environments with weakly labeled data and soundscape synthesis,” inWorkshop on Detection and Classification of Acoustic Scenes and Events, 2019

work page 2019

[2] [2]

A comparison of five multiple instance learning pooling functions for sound event detection with weak labeling,

Y . Wang, J. Li, and F. Metze, “A comparison of five multiple instance learning pooling functions for sound event detection with weak labeling,” inICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019, pp. 31–35

work page 2019

[3] [3]

Improving weakly supervised sound event detection with self-supervised auxiliary tasks,

S. Deshmukh, B. Raj, and R. Singh, “Improving weakly supervised sound event detection with self-supervised auxiliary tasks,” inProc. Interspeech 2021, 2021, pp. 596–600

work page 2021

[4] [4]

Audio explainable artificial intelligence: A review,

A. Akman and B. W. Schuller, “Audio explainable artificial intelligence: A review,”Intelligent Computing, vol. 2, 2024

work page 2024

[5] [5]

Gradient based feature attribution in explainable ai: A technical review,

Y . Wang, T. Zhang, X. Guo, and Z. Shen, “Gradient based feature attribution in explainable ai: A technical review,” arXiv preprint, 2024. [Online]. Available: https://arxiv.org/abs/2403.10415

work page arXiv 2024

[6] [6]

Audiomnist: Exploring explainable artificial intelligence for audio analysis on a simple benchmark,

S. Becker, J. Vielhaben, M. Ackermann, K.-R. Mueller, S. Lapuschkin, and W. Samek, “Audiomnist: Exploring explainable artificial intelligence for audio analysis on a simple benchmark,”Journal of the Franklin Institute, vol. 361, no. 1, pp. 418–428, 2024

work page 2024

[7] [7]

Explainable audio classi- fication of playing techniques with layer-wise relevance propagation,

C. Wang, V . Lostanlen, and M. Lagrange, “Explainable audio classi- fication of playing techniques with layer-wise relevance propagation,” inICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023, pp. 1–5

work page 2023

[8] [8]

Xai-based comparison of audio event classifiers with different input representations,

A. Frommholz, F. Seipel, S. Lapuschkin, W. Samek, and J. Vielhaben, “Xai-based comparison of audio event classifiers with different input representations,” in20th International Conference on Content-Based Multimedia Indexing (CBMI), 2023, pp. 126–132

work page 2023

[9] [9]

Focal modulation net- works for interpretable sound classification,

L. Della Libera, C. Subakan, and M. Ravanelli, “Focal modulation net- works for interpretable sound classification,” in2024 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops (ICASSPW), 2024, pp. 853–857

work page 2024

[10] [10]

Benchmarking time- localized explanations for audio classification models,

C. Bola ˜nos, L. Pepino, M. Meza, and L. Ferrer, “Benchmarking time- localized explanations for audio classification models,” inInterspeech 2025, 2025, pp. 211–215

work page 2025

[11] [11]

Scaper: A library for soundscape synthesis and augmentation,

J. Salamon, D. MacConnell, M. Cartwright, P. Li, and J. P. Bello, “Scaper: A library for soundscape synthesis and augmentation,” in2017 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), 2017, pp. 344–348

work page 2017

[12] [12]

Panns: Large-scale pretrained audio neural networks for audio pattern recognition,

Q. Kong, Y . Cao, T. Iqbal, Y . Wang, W. Wang, and M. D. Plumbley, “Panns: Large-scale pretrained audio neural networks for audio pattern recognition,”IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 28, pp. 2880–2894, 2020

work page 2020

[13] [13]

Audio set: An ontology and human-labeled dataset for audio events,

J. F. Gemmeke, D. P. W. Ellis, D. Freedman, A. Jansen, W. Lawrence, R. C. Moore, M. Plakal, and M. Ritter, “Audio set: An ontology and human-labeled dataset for audio events,” in2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2017, pp. 776–780

work page 2017

[14] [14]

Axiomatic attribution for deep networks,

M. Sundararajan, A. Taly, and Q. Yan, “Axiomatic attribution for deep networks,” inInternational Conference on Machine Learning (ICML), vol. 70, 2017

work page 2017

[15] [15]

Captum: A unified and generic model inter- pretability library for pytorch,

N. Kokhlikyan, V . Miglani, M. Martin, E. Wang, B. Alsallakh, J. Reynolds, A. Melnikov, N. Kliushkina, C. Araya, S. Yan, and O. Reblitz-Richardson, “Captum: A unified and generic model inter- pretability library for pytorch,” 2020

work page 2020

[16] [16]

Metrics for polyphonic sound event detection,

A. Mesaros, T. Heittola, and T. Virtanen, “Metrics for polyphonic sound event detection,”Applied Sciences, vol. 6, no. 6, 2016

work page 2016

[17] [17]

Sound event detection in synthetic domestic environments,

R. Serizel, N. Turpault, A. Shah, and J. Salamon, “Sound event detection in synthetic domestic environments,” in2020 IEEE International Con- ference on Acoustics, Speech, and Signal Processing (ICASSP), 2020, pp. 86–90

work page 2020

[18] [18]

audiolime: Listenable explanations using source separation,

V . Haunschmid, E. Manilow, and G. Widmer, “audiolime: Listenable explanations using source separation,” inProceedings of the 13th In- ternational Workshop on Machine Learning and Music (MML), ECML- PKDD, 2020, arXiv:2008.00582

work page arXiv 2020

[19] [19]

Listenable maps for audio classifiers,

F. Paissan, M. Ravanelli, and C. Subakan, “Listenable maps for audio classifiers,” inProceedings of the 41st International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, vol

work page

[20] [20]

39 009–39 021

PMLR, 2024, pp. 39 009–39 021

work page 2024

[21] [21]

Learning deep features for discriminative localization,

B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba, “Learning deep features for discriminative localization,” in2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 2921– 2929

work page 2016

[22] [22]

Grad-cam: Visual explanations from deep networks via gradient-based localization,

R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra, “Grad-cam: Visual explanations from deep networks via gradient-based localization,”International Journal of Computer Vision, vol. 128, no. 2, pp. 336–359, 2020

work page 2020

[23] [23]

Weakly-supervised sound event detection with self- attention,

K. Miyazaki, T. Komatsu, T. Hayashi, S. Watanabe, T. Toda, and K. Takeda, “Weakly-supervised sound event detection with self- attention,” inICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, pp. 66–70

work page 2020