Explaining Object Detectors via Collective Contribution of Pixels

Hiroshi Kera; Kazuhiko Kawamoto; Toshinori Yamauchi

arxiv: 2412.00666 · v4 · pith:GDR2PIIHnew · submitted 2024-12-01 · 💻 cs.CV

Explaining Object Detectors via Collective Contribution of Pixels

Toshinori Yamauchi , Hiroshi Kera , Kazuhiko Kawamoto This is my paper

Pith reviewed 2026-05-23 08:17 UTC · model grok-4.3

classification 💻 cs.CV

keywords object detectionexplainable AIShapley valuesvisual explanationspixel contributionsgame theorycooperative game

0 comments

The pith

A game-theoretic method uses Shapley values and interactions to explain object detectors by capturing collective pixel contributions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes a method to explain object detectors that accounts for how groups of pixels jointly influence detections, rather than examining pixels in isolation. Existing approaches overlook these collective effects and can miss compositional cues or highlight spurious correlations instead. The new technique models the detector's output as a cooperative game and applies Shapley values together with interaction terms to quantify both single-pixel and multi-pixel importance. It generates explanations for both bounding-box localization and class prediction. Experiments indicate that the resulting importance maps align more closely with truly influential regions than those from prior methods.

Core claim

The central claim is that a game-theoretic approach based on Shapley values and interactions explicitly captures both individual and collective pixel contributions, thereby providing explanations for bounding box localization and class determination that identify important regions more accurately than state-of-the-art methods.

What carries the argument

A cooperative game whose value function is taken directly from the detector output, with Shapley values measuring individual pixel contributions and interaction terms measuring collective contributions of pixel groups.

If this is right

Explanations now cover both localization accuracy and class assignment within the same framework.
Important regions are identified more accurately than with methods that consider only individual pixel contributions.
Spurious correlations arising from isolated pixel analysis can be reduced.
The same machinery applies to both bounding-box regression and classification heads.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach could be used to audit detectors for reliance on unintended contextual cues in real-world scenes.
It might guide data-augmentation strategies that strengthen or weaken collective feature interactions.
Similar game-theoretic accounting could be tested on other dense prediction tasks such as instance segmentation.

Load-bearing premise

The collective contribution of pixels can be faithfully represented by a cooperative game whose value function is defined directly from the detector's output without introducing approximation artifacts that alter the ranking of regions.

What would settle it

An experiment that masks known critical pixel groups and checks whether the detector output change contradicts the importance ranking produced by the Shapley-plus-interaction method.

Figures

Figures reproduced from arXiv: 2412.00666 by Hiroshi Kera, Kazuhiko Kawamoto, Toshinori Yamauchi.

**Figure 2.** Figure 2: Overview of VX-CODE. The input image is divided [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Comparison of visual explanations generated by ea [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Comparison of identified patches and generated hea [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Comparison of visualizations generated by Grad [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Comparison of visualizations for bounding box gen [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

read the original abstract

Visual explanations for object detectors are crucial for enhancing their reliability. Object detectors identify and localize instances by assessing multiple visual features collectively. When generating explanations, overlooking these collective influences in detections may lead to missing compositional cues or capturing spurious correlations. However, existing methods typically focus solely on individual pixel contributions, neglecting the collective contribution of multiple pixels. To address this limitation, we propose a game-theoretic method based on Shapley values and interactions to explicitly capture both individual and collective pixel contributions. Our method provides explanations for both bounding box localization and class determination, highlighting regions crucial for detection. Extensive experiments demonstrate that the proposed method identifies important regions more accurately than state-of-the-art methods. The code is available at https://github.com/tttt-0814/VX-CODE

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper proposes a game-theoretic explanation method for object detectors that applies Shapley values and interaction indices to capture both individual pixel contributions and collective (interaction) contributions to bounding-box localization and class prediction. It argues that existing methods overlook collective influences and presents the new approach as addressing this by defining a cooperative game directly on the detector output. The central claim is that extensive experiments show the method identifies important regions more accurately than state-of-the-art methods, with code released at the provided GitHub link.

Significance. If the experiments are shown to be rigorous and the value-function definition avoids ranking-altering artifacts, the work would usefully extend pixel-attribution techniques to object detection by incorporating established interaction indices, potentially improving faithfulness for compositional detections. The public code release is a positive contribution for reproducibility.

major comments (1)

[Abstract] Abstract (method paragraph): the central claim that collective pixel contributions are faithfully captured rests on the assumption that the cooperative-game value function (defined directly from detector output) introduces no approximation artifacts that alter region rankings; this assumption is load-bearing for the superiority claim but receives no explicit verification or sensitivity analysis in the provided description.

minor comments (2)

[Abstract] The abstract states that 'extensive experiments' demonstrate superior accuracy but provides no details on baselines, metrics, statistical controls, error bars, or ablation studies; these should be summarized with quantitative results and controls in the main text or a dedicated experiments section.
Notation for the interaction indices and the precise definition of the value function v(S) should be introduced with an equation early in the method section to allow readers to assess how collective contributions are operationalized.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive comment and the recommendation for minor revision. We address the point below.

read point-by-point responses

Referee: [Abstract] Abstract (method paragraph): the central claim that collective pixel contributions are faithfully captured rests on the assumption that the cooperative-game value function (defined directly from detector output) introduces no approximation artifacts that alter region rankings; this assumption is load-bearing for the superiority claim but receives no explicit verification or sensitivity analysis in the provided description.

Authors: We appreciate the referee's observation. The value function is defined exactly as the detector's raw output (localization and classification scores) on the masked input, with no additional modeling or approximation at that stage; any numerical approximations arise solely from the standard Monte-Carlo sampling used to estimate Shapley values and interaction indices, which is common practice. Nevertheless, we acknowledge that an explicit sensitivity analysis of the value-function definition would strengthen the superiority claim. In the revised manuscript we will add a dedicated paragraph and supplementary experiments that vary the masking strategy and perturbation level, confirming that the resulting region rankings remain stable and do not alter the comparative conclusions. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper frames its contribution as a direct application of established Shapley values and interaction indices to define pixel contributions for object detector explanations. No equations or claims in the provided abstract reduce a result to a fitted parameter, self-citation chain, or definitional equivalence with the inputs. The method's value function is defined from the detector output using standard cooperative game concepts, and performance claims rest on experiments rather than any internal renaming or forced prediction. This is the common case of an independent application of prior mathematical tools.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated. The method inherits standard Shapley axioms (efficiency, symmetry, dummy player, additivity) and the modeling choice that detector output can serve as the characteristic function.

pith-pipeline@v0.9.0 · 5655 in / 1009 out tokens · 21756 ms · 2026-05-23T08:17:39.307079+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

H-Sets: Hessian-Guided Discovery of Set-Level Feature Interactions in Image Classifiers
cs.CV 2026-04 unverdicted novelty 6.0

H-Sets detects higher-order feature interactions in image classifiers via Hessian-guided pair merging and attributes them with IDG-Vis to generate more interpretable saliency maps than existing marginal or coarse methods.

Reference graph

Works this paper leans on

73 extracted references · 73 canonical work pages · cited by 1 Pith paper · 4 internal anchors

[1]

Sanity checks for saliency maps

Julius Adebayo, Justin Gilmer, Michael Muelly, Ian Good - fellow, Moritz Hardt, and Been Kim. Sanity checks for saliency maps. In Advances in Neural Information Process- ing Systems (NeurIPS) , 2018. 1

work page 2018
[2]

Enhancing wrist abnormality detection with yolo: Analysis of state-of-the -art single-stage detection models

Ammar Ahmed, Ali Shariq Imran, Abdul Manaf, Zenun Kastrati, and Sher Muhammad Daudpota. Enhancing wrist abnormality detection with yolo: Analysis of state-of-the -art single-stage detection models. Biomedical Signal Processing and Control, 93:106144, 2024. 1

work page 2024
[3]

On pixel-wise explanations for non-linear classiﬁe r decisions by layer-wise relevance propagation

Sebastian Bach, Alexander Binder, Gr´ egoire Montavon, Frederick Klauschen, Klaus-Robert M¨ uller, and Wojciech Samek. On pixel-wise explanations for non-linear classiﬁe r decisions by layer-wise relevance propagation. PLOS ONE, 10(7):1–46, 2015. 3, 1

work page 2015
[4]

Preddiff: Explanations and interactions from conditional ex- pectations

Stefan Bl¨ ucher, Johanna Vielhaben, and Nils Strodthof f. Preddiff: Explanations and interactions from conditional ex- pectations. Artiﬁcial Intelligence, 312:103774, 2022. 3, 1

work page 2022
[5]

Alexey Bochkovskiy, Chien-Yao Wang, and H. Liao. Y olov4: Optimal speed and accuracy of object detection. ArXiv, abs/2004.10934, 2020. 2

work page internal anchor Pith review Pith/arXiv arXiv 2004
[6]

Slice: Sta- bilized lime for consistent explanations for image classiﬁ - cation

Revoti Prasad Bora, Philipp Terh¨ orst, Raymond V eldhui s, Raghavendra Ramachandra, and Kiran Raja. Slice: Sta- bilized lime for consistent explanations for image classiﬁ - cation. In Proceedings of the IEEE Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 10988– 10996, 2024. 3, 1

work page 2024
[7]

End- to-end object detection with transformers

Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nic olas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End- to-end object detection with transformers. In Proceedings of the European Conference on Computer Vision (ECCV), page 213–229, 2020. 1, 2, 5

work page 2020
[8]

Grad-cam++: General- ized gradient-based visual explanations for deep convolu- tional networks

Aditya Chattopadhay, Anirban Sarkar, Prantik Howlader , and Vineeth N Balasubramanian. Grad-cam++: General- ized gradient-based visual explanations for deep convolu- tional networks. In IEEE Winter Conference on Applications of Computer Vision (WACV) , pages 839–847, 2018. 3, 5, 6, 7, 8, 1

work page 2018
[9]

Ramaswamy

Saurabh Desai and Harish G. Ramaswamy. Ablation- cam: Visual explanations for deep convolutional network via gradient-free localization. In IEEE Winter Conference on Applications of Computer Vision (WACV) , pages 972–980,

work page
[10]

Revealing hidden context bias in segmentation and object detection through concept-speciﬁc explanations

Maximilian Dreyer, Reduan Achtibat, Thomas Wiegand, Wojciech Samek, and Sebastian Lapuschkin. Revealing hidden context bias in segmentation and object detection through concept-speciﬁc explanations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recogni- tion (CVPR) W orkshops, pages 3829–3839, 2023. 2

work page 2023
[11]

Revealing hidden context bias in segmentation and object detection through concept-speciﬁc explanations

Maximilian Dreyer, Reduan Achtibat, Thomas Wiegand, Wojciech Samek, and Sebastian Lapuschkin. Revealing hidden context bias in segmentation and object detection through concept-speciﬁc explanations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recogni- tion (CVPR) W orkshops, pages 3829–3839, 2023. 1

work page 2023
[12]

Mark Everingham, Luc V an Gool, Christopher K. I. Williams, John M. Winn, and Andrew Zisserman. The pas- cal visual object classes (voc) challenge. Int. J. Comput. Vis., 88(2):303–338, 2010. 5

work page 2010
[13]

Axiom-based grad-cam: Towards accurate visualization and explanation of cnns

Ruigang Fu, Qingyong Hu, Xiaohu Dong, Y ulan Guo, Yinghui Gao, and Biao Li. Axiom-based grad-cam: Towards accurate visualization and explanation of cnns. In British Machine Vision Conference (BMVC), 2020. 3, 1

work page 2020
[14]

Fast r-cnn

Ross Girshick. Fast r-cnn. In Proceedings of the IEEE In- ternational Conference on Computer Vision (ICCV) , pages 1440–1448, 2015. 2

work page 2015
[15]

an axiomatic approa ch to the concept of interaction among players in cooperative games

Michel Grabisch and Marc Roubens. “an axiomatic approa ch to the concept of interaction among players in cooperative games”. International Journal of Game Theory, 28:547–565, 1999. 1, 3

work page 1999
[16]

Understan d- ing individual decisions of cnns via contrastive backpropa - gation

Jindong Gu, Yinchong Yang, and V olker Tresp. Understan d- ing individual decisions of cnns via contrastive backpropa - gation. In Asian Conference on Computer Vision (ACCV) , pages 119–134, 2018. 3, 1

work page 2018
[17]

Chul Gwon and Steven C. Howell. Odsmoothgrad: Generat- ing saliency maps for object detectors. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recogni- tion (CVPR) W orkshops, pages 3685–3689, 2023. 2

work page 2023
[18]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceed- ings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016. 5

work page 2016
[19]

FastSHAP: Real-time shapley value estimation

Neil Jethani, Mukund Sudarshan, Ian Connick Covert, Su -In Lee, and Rajesh Ranganath. FastSHAP: Real-time shapley value estimation. In International Conference on Learning Representations (ICRL), 2022. 3, 1

work page 2022
[20]

Comparing th e decision-making mechanisms by transformers and cnns via explanation methods

Mingqi Jiang, Saeed Khorram, and Li Fuxin. Comparing th e decision-making mechanisms by transformers and cnns via explanation methods. In Proceedings of the IEEE Confer- ence on Computer Vision and Pattern Recognition (CVPR) , pages 9546–9555, 2024. 14

work page 2024
[21]

Towards better expla- nations of class activation mapping

Hyungsik Jung and Y oungrock Oh. Towards better expla- nations of class activation mapping. In IEEE International Conference on Computer Vision (ICCV) , pages 1316–1324,

work page
[22]

Localized semantic feature mixers for efﬁ- cient pedestrian detection in autonomous driving

Abdul Hannan Khan, Mohammed Shariq Nawaz, and An- dreas Dengel. Localized semantic feature mixers for efﬁ- cient pedestrian detection in autonomous driving. In Pro- ceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5476–5485, 2023. 1

work page 2023
[23]

Bsed: Baseli ne shapley-based explainable detector

Michihiro Kuroki and Toshihiko Yamasaki. Bsed: Baseli ne shapley-based explainable detector. IEEE Access, 12:57959– 57973, 2024. 2

work page 2024
[24]

Fast explana tion using shapley value for object detection

Michihiro Kuroki and Toshihiko Yamasaki. Fast explana tion using shapley value for object detection. IEEE Access , 12: 31047–31054, 2024. 2

work page 2024
[25]

Lawrence Zitnick, and Piotr Doll´ ar

Tsung-Yi Lin, Michael Maire, Serge Belongie, Lubomir Bourdev, Ross Girshick, James Hays, Pietro Perona, Deva Ramanan, C. Lawrence Zitnick, and Piotr Doll´ ar. Microsoft coco: Common objects in context, 2014. 5 9

work page 2014
[26]

Feature pyramid networks for object detection

Tsung-Yi Lin, Piotr Doll´ ar, Ross Girshick, Kaiming He , Bharath Hariharan, and Serge Belongie. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 936–944, 2017. 5

work page 2017
[27]

Focal loss for dense object detection

Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, a nd Piotr Doll´ ar. Focal loss for dense object detection. In IEEE International Conference on Computer Vision (ICCV), pages 2999–3007, 2017. 2

work page 2017
[28]

Reed, Cheng-Yang Fu, and Alexander C

Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott E. Reed, Cheng-Yang Fu, and Alexander C. Berg. SSD: Single shot multibox detector. In Proceedings of the European Conference on Computer Vision (ECCV) , pages 21–37, 2016. 2

work page 2016
[29]

Lundberg and Su-In Lee

Scott M. Lundberg and Su-In Lee. A uniﬁed approach to interpreting model predictions. In Advances in Neural In- formation Processing Systems (NeurIPS) , page 4768–4777,

work page
[30]

Nguyen, Haichun Yang, Ruining Deng, Y uzhe Lu, Zheyu Zhu, Joseph T

Ethan H. Nguyen, Haichun Yang, Ruining Deng, Y uzhe Lu, Zheyu Zhu, Joseph T. Roland, Le Lu, Bennett A. Landman, Agnes B. Fogo, and Y uankai Huo. Circle representation for medical object detection. IEEE Transactions on Medical Imaging, 41(3):746–754, 2022. 1

work page 2022
[31]

Automatic dif- ferentiation in pytorch

Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic dif- ferentiation in pytorch. In Advances in Neural Information Processing Systems (NeurIPS) W orkshop on Autodiff , 2017. 5

work page 2017
[32]

Rise: Random - ized input sampling for explanation of black-box models

Vitali Petsiuk, Abir Das, and Kate Saenko. Rise: Random - ized input sampling for explanation of black-box models. In British Machine Vision Conference (BMVC) , page 151, 2018. 2, 3, 5, 1

work page 2018
[33]

Morariu, Ashutosh Mehra, Vicente Ordonez, and Kate Saenko

Vitali Petsiuk, Rajiv Jain, V arun Manjunatha, Vlad I. Morariu, Ashutosh Mehra, Vicente Ordonez, and Kate Saenko. Black-box explanation of object detectors via saliency maps. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , pages 11443–11452, 2021. 1, 2, 3, 5, 6, 7, 8

work page 2021
[34]

Do vision trans- formers see like convolutional neural networks? In Advances in Neural Information Processing Systems (NeurIPS) , pages 12116–12128, 2021

Maithra Raghu, Thomas Unterthiner, Simon Kornblith, Chiyuan Zhang, and Alexey Dosovitskiy. Do vision trans- formers see like convolutional neural networks? In Advances in Neural Information Processing Systems (NeurIPS) , pages 12116–12128, 2021. 14

work page 2021
[35]

Y olo9000: Better, faste r, stronger

Joseph Redmon and Ali Farhadi. Y olo9000: Better, faste r, stronger. In Proceedings of the IEEE Conference on Com- puter Vision and Pattern Recognition (CVPR) , pages 6517– 6525, 2017. 2

work page 2017
[36]

YOLOv3: An Incremental Improvement

Joseph Redmon and Ali Farhadi. YOLOv3: An Incremental Improvement. arXiv.org, pages 1–6, 2018

work page 2018
[37]

Y ou only look once: Uniﬁed, real-time object de- tection

Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. Y ou only look once: Uniﬁed, real-time object de- tection. In Proceedings of the IEEE Conference on Com- puter Vision and Pattern Recognition (CVPR) , pages 779– 788, 2016

work page 2016
[38]

Girshick, and Jian Sun

Shaoqing Ren, Kaiming He, Ross B. Girshick, and Jian Sun . Faster r-cnn: Towards real-time object detection with regi on proposal networks. In Advances in Neural Information Pro- cessing Systems (NeurIPS) , pages 91–99, 2015. 1, 2, 5

work page 2015
[39]

”why should I trust you?”: Explaining the predictions of any classiﬁer

Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin . ”why should I trust you?”: Explaining the predictions of any classiﬁer. In Proceedings of the 22nd ACM SIGKDD In- ternational Conference on Knowledge Discovery and Data Mining, pages 1135–1144, 2016. 3, 1

work page 2016
[40]

Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna V edantam, Devi Parikh, and Dhruv Ba- tra

Ramprasaath R. Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna V edantam, Devi Parikh, and Dhruv Ba- tra. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE In- ternational Conference on Computer Vision (ICCV) , pages 618–626, 2017. 1, 2, 3, 5, 6, 7, 8

work page 2017
[41]

A value for n-person games

Lloyd S Shapley. A value for n-person games. In Contri- butions to the Theory of Games II , pages 307–317. 1953. 1, 3

work page 1953
[42]

Learning important features through propagating activati on differences

Avanti Shrikumar, Peyton Greenside, and Anshul Kundaj e. Learning important features through propagating activati on differences. In Proceedings of the International Conference on Machine Learning (ICML) , page 3145–3153, 2017. 3, 1

work page 2017
[43]

Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps

Karen Simonyan, Andrea V edaldi, and Andrew Zisserman. Deep inside convolutional networks: Visualising image clas- siﬁcation models and saliency maps. CoRR, abs/1312.6034, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013
[44]

SmoothGrad: removing noise by adding noise

Daniel Smilkov, Nikhil Thorat, Been Kim, Fernanda B. Vi´ egas, and Martin Wattenberg. Smoothgrad: removing noise by adding noise. CoRR, abs/1706.03825, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[45]

Striving for simplicity: The all convolutional net

Jost Tobias Springenberg, Alexey Dosovitskiy, Thomas Brox, and Martin Riedmiller. Striving for simplicity: The all convolutional net. In International Conference on Learn- ing Representations (ICLR) workshop track , 2015. 1

work page 2015
[46]

Identifying important group of pixels using interactions

Kosuke Sumiyasu, Kazuhiko Kawamoto, and Hiroshi Kera. Identifying important group of pixels using interactions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , pages 6017–6026, 2024. 1, 3, 4

work page 2024
[47]

The many shapley values for model explanation

Mukund Sundararajan and Amir Najmi. The many shapley values for model explanation. In Proceedings of the Interna- tional Conference on Machine Learning (ICML) , 2020. 1

work page 2020
[48]

Axiomat ic attribution for deep networks

Mukund Sundararajan, Ankur Taly, and Qiqi Yan. Axiomat ic attribution for deep networks. In International Conference on Machine Learning, 2017. 3, 1

work page 2017
[49]

FCOS: Fully convolutional one-stage object detection

Zhi Tian, Chunhua Shen, Hao Chen, and Tong He. FCOS: Fully convolutional one-stage object detection. In Proceed- ings of the IEEE International Conference on Computer Vi- sion (ICCV), pages 9626–9635, 2019. 2

work page 2019
[50]

Towards better explanations for object detection

V an Binh Truong, Truong Thanh Hung Nguyen, V o Thanh Khang Nguyen, Quoc Khanh Nguyen, and Quoc Hung Cao. Towards better explanations for object detection. In Asian Conference on Computer Vision (ACCV), pages 1385–1400, 2024. 2

work page 2024
[51]

Grifﬁths

Shikhar Tuli, Ishita Dasgupta, Erin Grant, and Thomas L . Grifﬁths. Are convolutional neural networks or transforme rs more like human vision? ArXiv, abs/2105.07197, 2021. 14

work page arXiv 2021
[52]

Score-cam: Score-weighted visual explanations for convolutional neu ral 10 networks

Haofan Wang, Zifan Wang, Mengnan Du, Fan Yang, Zijian Zhang, Sirui Ding, Piotr Mardziel, and Xia Hu. Score-cam: Score-weighted visual explanations for convolutional neu ral 10 networks. In Proceedings of the IEEE Conference on Com- puter Vision and Pattern Recognition (CVPR) W orkshops , pages 111–119, 2020. 3, 1, 6

work page 2020
[53]

Detectron2

Y uxin Wu, Alexander Kirillov, Francisco Massa, Wan-Ye n Lo, and Ross Girshick. Detectron2. https://github. com/facebookresearch/detectron2, 2019. 5

work page 2019
[54]

Spatial sensitive grad-cam++: Im - proved visual explanation for object detectors via weighte d combination of gradient map

Toshinori Yamauchi. Spatial sensitive grad-cam++: Im - proved visual explanation for object detectors via weighte d combination of gradient map. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) W orkshops, pages 8164–8168, 2024. 1, 2, 5, 6, 7, 8

work page 2024
[55]

Spatial se n- sitive grad-cam: Visual explanations for object detection by incorporating spatial sensitivity

Toshinori Yamauchi and Masayoshi Ishikawa. Spatial se n- sitive grad-cam: Visual explanations for object detection by incorporating spatial sensitivity. In IEEE International Con- ference on Image Processing (ICIP) , pages 256–260, 2022. 1, 2, 5, 6, 7, 8

work page 2022
[56]

Visualizing and Understanding Convolutional Networks

Matthew D. Zeiler and Rob Fergus. Visualizing and un- derstanding convolutional networks. CoRR, abs/1311.2901,

work page internal anchor Pith review Pith/arXiv arXiv
[57]

Lin, Jonathan Brandt, Xiaohui Sh en, and Stan Sclaroff

Jianming Zhang, Zhe L. Lin, Jonathan Brandt, Xiaohui Sh en, and Stan Sclaroff. Top-down neural attention by excitation backprop. International Journal of Computer Vision , 126: 1084–1102, 2018. 3, 1, 6

work page 2018
[58]

Group-cam: Group score-weighted visual explanations for deep convo- lutional networks, 2021

Qinglong Zhang, Lu Rao, and Y ubin Yang. Group-cam: Group score-weighted visual explanations for deep convo- lutional networks, 2021. 5

work page 2021
[59]

Shifeng Zhang, Longyin Wen, Xiao Bian, Zhen Lei, and Stan Z. Li. Single-shot reﬁnement neural network for ob- ject detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 2018. 2

work page 2018
[60]

When pedestrian detection meets multi-modal learning: Generalist model and benchmark dataset, 2024

Yi Zhang, Wang Zeng, Sheng Jin, Chen Qian, Ping Luo, and Wentao Liu. When pedestrian detection meets multi-modal learning: Generalist model and benchmark dataset, 2024. 1

work page 2024
[61]

C. Zhao, J. H. Hsiao, and A. B. Chan. Gradient-based instance-speciﬁc visual explanations for object speciﬁca tion and object discrimination. IEEE Transactions on Pattern Analysis & Machine Intelligence , 46(09):5967–5985, 2024. 1, 2, 5, 6, 7, 8

work page 2024
[62]

Shap- cam: Visual explanations for convolutional neural network s based on shapley value

Quan Zheng, Ziwei Wang, Jie Zhou, and Jiwen Lu. Shap- cam: Visual explanations for convolutional neural network s based on shapley value. In Proceedings of the European Conference on Computer Vision (ECCV) , page 459–474,

work page
[63]

Learning deep features for discrimi- native localization

Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva , and Antonio Torralba. Learning deep features for discrimi- native localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 2016. 3, 1

work page 2016
[64]

Deformable DETR: deformable transformers for end-to-end object detection

Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, and Jifeng Dai. Deformable DETR: deformable transformers for end-to-end object detection. In International Conference on Learning Representations (ICLR) , 2021. 2 11 Explaining Object Detectors via Collective Contribution o f Pixels Supplementary Material A. Visual explanations for classiﬁcation In t...

work page 2021
[65]

computes pixel-wise feature attributions by propagating relevance from the output layer to the input layer. CPR [ 16] extends LRP by emphasizing the relevance originating from the target class, calculating the disparity between the rel e- vance from the target class and the average relevance from other classes. CAM-based methods calculate the weights repr...

work page
[66]

(15) Therefore, this selection considers interactions between b1 and b2, as well as their Shapley values in the self-context

and (8), this is formulated as follows: {b1, b2} = arg max {b′ 1,b′ 2}⊂ N f ({b′ 1, b′ 2})− f (∅) = arg max {b′ 1,b′ 2}⊂ N φsc({b′ 1, b′ 2}) = arg max {b′ 1,b′ 2}⊂ N I sc 2 ({b′ 1, b′ 2}) + 1∑ r′=1   ∑ c∈P r′ ({b′ 1,b′ 2}) φsc(c)   = arg max {b′ 1,b′ 2}⊂ N I sc 2 ({b′ 1, b′ 2}) + φsc({b′ 1}) + φsc({b′ 2}). (15) Therefore, this selection considers inte...

work page
[67]

Br 1 = arg min Br ⊆ N f (N\ Br)− f (N ) = arg max Br⊆ N f (N )− f (N\ Br) = arg max Br⊆ N φfc d (Br|N ), (20) where φfc d (Br|N ) def = Pd(N\ Br|N )[f (N )− f (N\ Br)]

is minimized. Br 1 = arg min Br ⊆ N f (N\ Br)− f (N ) = arg max Br⊆ N f (N )− f (N\ Br) = arg max Br⊆ N φfc d (Br|N ), (20) where φfc d (Br|N ) def = Pd(N\ Br|N )[f (N )− f (N\ Br)]. (21) In the last equality, we omit Pd(N\ Br 1|N ), as it remains constant regardless of the patch selection. The quantity φfc d is known as the Shapley value in full-context ...

work page
[68]

(23) Therefore, at step k = 1 , the selection considers both in- teractions of the r patches (the ﬁrst term) and Shapley val- ues of each patch combination (the second term)

and (22), we get Br 1 = arg max Br ⊆ N φfc d (Br|N ) = arg max Br ⊆ N I fc d,r(Br)    Interaction + r− 1∑ r′=1   ∑ c∈P r′(Br) φfc d (c|N\{ Br\ c})      Shapley values . (23) Therefore, at step k = 1 , the selection considers both in- teractions of the r patches (the ﬁrst term) and Shapley val- ues of each patch combination (the second term). I...

work page
[69]

Next section, we describe speciﬁc examples in patch deletion with r = 1 and r = 2. B.4. Speciﬁc example of patch deletion We provide speciﬁc examples for cases with r = 1 and r = 2 in patch deletion described in Appendix B.3. B.5. Case with r = 1 For step k = 1 , from Eq. ( 20), only the highest Shapley value φfc d ({b1}|N ) = n− 1[f (N )− f (N\{ b1}] bec...

work page
[70]

(26) Therefore, this selection considers interactions between b1 and b2, as well as their Shapley values in the full-context

and ( 23), this is formulated as follows: {b1, b2} = arg min {b′ 1,b′ 2}⊂ N f (N\{ b′ 1, b′ 2})− f (N ) = arg max {b′ 1,b′ 2}⊂ N f (N )− f (N\{ b′ 1, b′ 2}) = arg max {b′ 1,b′ 2}⊂ N φfc d ({b′ 1, b′ 2}| N ) = arg max {b′ 1,b′ 2}⊂ N I fc d,2({b′ 1, b′ 2}) + 1∑ r′=1   ∑ c∈P r′({b′ 1,b′ 2}) φfc d (c|N\{{ b′ 1, b′ 2}\ c})   = arg max {b′ 1,b′ 2}⊂ N I fc d...

work page
[71]

This process gen- erates a heat map that effectively highlights regions with signiﬁcant changes in the reward function values

In Algorithm 3, get position(bi) returns the diagonal corners (xb 1 , yb 1 ) and (xb 2, yb 2 ) of the patch bi in the image. This process gen- erates a heat map that effectively highlights regions with signiﬁcant changes in the reward function values. Algorithm 3 Generation of a heat map based on identiﬁed patches Input: Set of identiﬁed patches{b1, . . ....

work page
[72]

(30) As shown in this equation, if α is close to 0, it identi- ﬁes patches that focus more on the prediction of bounding boxes

with α∈ [0, 1] as follows: f (D(x); (Bt, P t)) = max (B,P )∈D (x) { IoU(Bt, B) }1− α · { P t·P ∥P t∥∥P∥ }α . (30) As shown in this equation, if α is close to 0, it identi- ﬁes patches that focus more on the prediction of bounding boxes. If α is close to 1, it identiﬁes patches that focus more on the prediction of class scores. If α = 0.5, it identi- ﬁes p...

work page
[73]

In con- trast, for Faster R-CNN, patches in speciﬁc regions are pref- erentially identiﬁed, and the reward value reaches 0.94 after 14 patches are identiﬁed

reaches 0.91 after four patches are identiﬁed. In con- trast, for Faster R-CNN, patches in speciﬁc regions are pref- erentially identiﬁed, and the reward value reaches 0.94 after 14 patches are identiﬁed. These results indicate that DETR, a transformer-based architecture, recognizes instances i n a more compositional manner and with less information com- ...

work page

[1] [1]

Sanity checks for saliency maps

Julius Adebayo, Justin Gilmer, Michael Muelly, Ian Good - fellow, Moritz Hardt, and Been Kim. Sanity checks for saliency maps. In Advances in Neural Information Process- ing Systems (NeurIPS) , 2018. 1

work page 2018

[2] [2]

Enhancing wrist abnormality detection with yolo: Analysis of state-of-the -art single-stage detection models

Ammar Ahmed, Ali Shariq Imran, Abdul Manaf, Zenun Kastrati, and Sher Muhammad Daudpota. Enhancing wrist abnormality detection with yolo: Analysis of state-of-the -art single-stage detection models. Biomedical Signal Processing and Control, 93:106144, 2024. 1

work page 2024

[3] [3]

On pixel-wise explanations for non-linear classiﬁe r decisions by layer-wise relevance propagation

Sebastian Bach, Alexander Binder, Gr´ egoire Montavon, Frederick Klauschen, Klaus-Robert M¨ uller, and Wojciech Samek. On pixel-wise explanations for non-linear classiﬁe r decisions by layer-wise relevance propagation. PLOS ONE, 10(7):1–46, 2015. 3, 1

work page 2015

[4] [4]

Preddiff: Explanations and interactions from conditional ex- pectations

Stefan Bl¨ ucher, Johanna Vielhaben, and Nils Strodthof f. Preddiff: Explanations and interactions from conditional ex- pectations. Artiﬁcial Intelligence, 312:103774, 2022. 3, 1

work page 2022

[5] [5]

Alexey Bochkovskiy, Chien-Yao Wang, and H. Liao. Y olov4: Optimal speed and accuracy of object detection. ArXiv, abs/2004.10934, 2020. 2

work page internal anchor Pith review Pith/arXiv arXiv 2004

[6] [6]

Slice: Sta- bilized lime for consistent explanations for image classiﬁ - cation

Revoti Prasad Bora, Philipp Terh¨ orst, Raymond V eldhui s, Raghavendra Ramachandra, and Kiran Raja. Slice: Sta- bilized lime for consistent explanations for image classiﬁ - cation. In Proceedings of the IEEE Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 10988– 10996, 2024. 3, 1

work page 2024

[7] [7]

End- to-end object detection with transformers

Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nic olas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End- to-end object detection with transformers. In Proceedings of the European Conference on Computer Vision (ECCV), page 213–229, 2020. 1, 2, 5

work page 2020

[8] [8]

Grad-cam++: General- ized gradient-based visual explanations for deep convolu- tional networks

Aditya Chattopadhay, Anirban Sarkar, Prantik Howlader , and Vineeth N Balasubramanian. Grad-cam++: General- ized gradient-based visual explanations for deep convolu- tional networks. In IEEE Winter Conference on Applications of Computer Vision (WACV) , pages 839–847, 2018. 3, 5, 6, 7, 8, 1

work page 2018

[9] [9]

Ramaswamy

Saurabh Desai and Harish G. Ramaswamy. Ablation- cam: Visual explanations for deep convolutional network via gradient-free localization. In IEEE Winter Conference on Applications of Computer Vision (WACV) , pages 972–980,

work page

[10] [10]

Revealing hidden context bias in segmentation and object detection through concept-speciﬁc explanations

Maximilian Dreyer, Reduan Achtibat, Thomas Wiegand, Wojciech Samek, and Sebastian Lapuschkin. Revealing hidden context bias in segmentation and object detection through concept-speciﬁc explanations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recogni- tion (CVPR) W orkshops, pages 3829–3839, 2023. 2

work page 2023

[11] [11]

Revealing hidden context bias in segmentation and object detection through concept-speciﬁc explanations

Maximilian Dreyer, Reduan Achtibat, Thomas Wiegand, Wojciech Samek, and Sebastian Lapuschkin. Revealing hidden context bias in segmentation and object detection through concept-speciﬁc explanations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recogni- tion (CVPR) W orkshops, pages 3829–3839, 2023. 1

work page 2023

[12] [12]

Mark Everingham, Luc V an Gool, Christopher K. I. Williams, John M. Winn, and Andrew Zisserman. The pas- cal visual object classes (voc) challenge. Int. J. Comput. Vis., 88(2):303–338, 2010. 5

work page 2010

[13] [13]

Axiom-based grad-cam: Towards accurate visualization and explanation of cnns

Ruigang Fu, Qingyong Hu, Xiaohu Dong, Y ulan Guo, Yinghui Gao, and Biao Li. Axiom-based grad-cam: Towards accurate visualization and explanation of cnns. In British Machine Vision Conference (BMVC), 2020. 3, 1

work page 2020

[14] [14]

Fast r-cnn

Ross Girshick. Fast r-cnn. In Proceedings of the IEEE In- ternational Conference on Computer Vision (ICCV) , pages 1440–1448, 2015. 2

work page 2015

[15] [15]

an axiomatic approa ch to the concept of interaction among players in cooperative games

Michel Grabisch and Marc Roubens. “an axiomatic approa ch to the concept of interaction among players in cooperative games”. International Journal of Game Theory, 28:547–565, 1999. 1, 3

work page 1999

[16] [16]

Understan d- ing individual decisions of cnns via contrastive backpropa - gation

Jindong Gu, Yinchong Yang, and V olker Tresp. Understan d- ing individual decisions of cnns via contrastive backpropa - gation. In Asian Conference on Computer Vision (ACCV) , pages 119–134, 2018. 3, 1

work page 2018

[17] [17]

Chul Gwon and Steven C. Howell. Odsmoothgrad: Generat- ing saliency maps for object detectors. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recogni- tion (CVPR) W orkshops, pages 3685–3689, 2023. 2

work page 2023

[18] [18]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceed- ings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016. 5

work page 2016

[19] [19]

FastSHAP: Real-time shapley value estimation

Neil Jethani, Mukund Sudarshan, Ian Connick Covert, Su -In Lee, and Rajesh Ranganath. FastSHAP: Real-time shapley value estimation. In International Conference on Learning Representations (ICRL), 2022. 3, 1

work page 2022

[20] [20]

Comparing th e decision-making mechanisms by transformers and cnns via explanation methods

Mingqi Jiang, Saeed Khorram, and Li Fuxin. Comparing th e decision-making mechanisms by transformers and cnns via explanation methods. In Proceedings of the IEEE Confer- ence on Computer Vision and Pattern Recognition (CVPR) , pages 9546–9555, 2024. 14

work page 2024

[21] [21]

Towards better expla- nations of class activation mapping

Hyungsik Jung and Y oungrock Oh. Towards better expla- nations of class activation mapping. In IEEE International Conference on Computer Vision (ICCV) , pages 1316–1324,

work page

[22] [22]

Localized semantic feature mixers for efﬁ- cient pedestrian detection in autonomous driving

Abdul Hannan Khan, Mohammed Shariq Nawaz, and An- dreas Dengel. Localized semantic feature mixers for efﬁ- cient pedestrian detection in autonomous driving. In Pro- ceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5476–5485, 2023. 1

work page 2023

[23] [23]

Bsed: Baseli ne shapley-based explainable detector

Michihiro Kuroki and Toshihiko Yamasaki. Bsed: Baseli ne shapley-based explainable detector. IEEE Access, 12:57959– 57973, 2024. 2

work page 2024

[24] [24]

Fast explana tion using shapley value for object detection

Michihiro Kuroki and Toshihiko Yamasaki. Fast explana tion using shapley value for object detection. IEEE Access , 12: 31047–31054, 2024. 2

work page 2024

[25] [25]

Lawrence Zitnick, and Piotr Doll´ ar

Tsung-Yi Lin, Michael Maire, Serge Belongie, Lubomir Bourdev, Ross Girshick, James Hays, Pietro Perona, Deva Ramanan, C. Lawrence Zitnick, and Piotr Doll´ ar. Microsoft coco: Common objects in context, 2014. 5 9

work page 2014

[26] [26]

Feature pyramid networks for object detection

Tsung-Yi Lin, Piotr Doll´ ar, Ross Girshick, Kaiming He , Bharath Hariharan, and Serge Belongie. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 936–944, 2017. 5

work page 2017

[27] [27]

Focal loss for dense object detection

Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, a nd Piotr Doll´ ar. Focal loss for dense object detection. In IEEE International Conference on Computer Vision (ICCV), pages 2999–3007, 2017. 2

work page 2017

[28] [28]

Reed, Cheng-Yang Fu, and Alexander C

Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott E. Reed, Cheng-Yang Fu, and Alexander C. Berg. SSD: Single shot multibox detector. In Proceedings of the European Conference on Computer Vision (ECCV) , pages 21–37, 2016. 2

work page 2016

[29] [29]

Lundberg and Su-In Lee

Scott M. Lundberg and Su-In Lee. A uniﬁed approach to interpreting model predictions. In Advances in Neural In- formation Processing Systems (NeurIPS) , page 4768–4777,

work page

[30] [30]

Nguyen, Haichun Yang, Ruining Deng, Y uzhe Lu, Zheyu Zhu, Joseph T

Ethan H. Nguyen, Haichun Yang, Ruining Deng, Y uzhe Lu, Zheyu Zhu, Joseph T. Roland, Le Lu, Bennett A. Landman, Agnes B. Fogo, and Y uankai Huo. Circle representation for medical object detection. IEEE Transactions on Medical Imaging, 41(3):746–754, 2022. 1

work page 2022

[31] [31]

Automatic dif- ferentiation in pytorch

Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic dif- ferentiation in pytorch. In Advances in Neural Information Processing Systems (NeurIPS) W orkshop on Autodiff , 2017. 5

work page 2017

[32] [32]

Rise: Random - ized input sampling for explanation of black-box models

Vitali Petsiuk, Abir Das, and Kate Saenko. Rise: Random - ized input sampling for explanation of black-box models. In British Machine Vision Conference (BMVC) , page 151, 2018. 2, 3, 5, 1

work page 2018

[33] [33]

Morariu, Ashutosh Mehra, Vicente Ordonez, and Kate Saenko

Vitali Petsiuk, Rajiv Jain, V arun Manjunatha, Vlad I. Morariu, Ashutosh Mehra, Vicente Ordonez, and Kate Saenko. Black-box explanation of object detectors via saliency maps. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , pages 11443–11452, 2021. 1, 2, 3, 5, 6, 7, 8

work page 2021

[34] [34]

Do vision trans- formers see like convolutional neural networks? In Advances in Neural Information Processing Systems (NeurIPS) , pages 12116–12128, 2021

Maithra Raghu, Thomas Unterthiner, Simon Kornblith, Chiyuan Zhang, and Alexey Dosovitskiy. Do vision trans- formers see like convolutional neural networks? In Advances in Neural Information Processing Systems (NeurIPS) , pages 12116–12128, 2021. 14

work page 2021

[35] [35]

Y olo9000: Better, faste r, stronger

Joseph Redmon and Ali Farhadi. Y olo9000: Better, faste r, stronger. In Proceedings of the IEEE Conference on Com- puter Vision and Pattern Recognition (CVPR) , pages 6517– 6525, 2017. 2

work page 2017

[36] [36]

YOLOv3: An Incremental Improvement

Joseph Redmon and Ali Farhadi. YOLOv3: An Incremental Improvement. arXiv.org, pages 1–6, 2018

work page 2018

[37] [37]

Y ou only look once: Uniﬁed, real-time object de- tection

Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. Y ou only look once: Uniﬁed, real-time object de- tection. In Proceedings of the IEEE Conference on Com- puter Vision and Pattern Recognition (CVPR) , pages 779– 788, 2016

work page 2016

[38] [38]

Girshick, and Jian Sun

Shaoqing Ren, Kaiming He, Ross B. Girshick, and Jian Sun . Faster r-cnn: Towards real-time object detection with regi on proposal networks. In Advances in Neural Information Pro- cessing Systems (NeurIPS) , pages 91–99, 2015. 1, 2, 5

work page 2015

[39] [39]

”why should I trust you?”: Explaining the predictions of any classiﬁer

Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin . ”why should I trust you?”: Explaining the predictions of any classiﬁer. In Proceedings of the 22nd ACM SIGKDD In- ternational Conference on Knowledge Discovery and Data Mining, pages 1135–1144, 2016. 3, 1

work page 2016

[40] [40]

Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna V edantam, Devi Parikh, and Dhruv Ba- tra

Ramprasaath R. Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna V edantam, Devi Parikh, and Dhruv Ba- tra. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE In- ternational Conference on Computer Vision (ICCV) , pages 618–626, 2017. 1, 2, 3, 5, 6, 7, 8

work page 2017

[41] [41]

A value for n-person games

Lloyd S Shapley. A value for n-person games. In Contri- butions to the Theory of Games II , pages 307–317. 1953. 1, 3

work page 1953

[42] [42]

Learning important features through propagating activati on differences

Avanti Shrikumar, Peyton Greenside, and Anshul Kundaj e. Learning important features through propagating activati on differences. In Proceedings of the International Conference on Machine Learning (ICML) , page 3145–3153, 2017. 3, 1

work page 2017

[43] [43]

Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps

Karen Simonyan, Andrea V edaldi, and Andrew Zisserman. Deep inside convolutional networks: Visualising image clas- siﬁcation models and saliency maps. CoRR, abs/1312.6034, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013

[44] [44]

SmoothGrad: removing noise by adding noise

Daniel Smilkov, Nikhil Thorat, Been Kim, Fernanda B. Vi´ egas, and Martin Wattenberg. Smoothgrad: removing noise by adding noise. CoRR, abs/1706.03825, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[45] [45]

Striving for simplicity: The all convolutional net

Jost Tobias Springenberg, Alexey Dosovitskiy, Thomas Brox, and Martin Riedmiller. Striving for simplicity: The all convolutional net. In International Conference on Learn- ing Representations (ICLR) workshop track , 2015. 1

work page 2015

[46] [46]

Identifying important group of pixels using interactions

Kosuke Sumiyasu, Kazuhiko Kawamoto, and Hiroshi Kera. Identifying important group of pixels using interactions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , pages 6017–6026, 2024. 1, 3, 4

work page 2024

[47] [47]

The many shapley values for model explanation

Mukund Sundararajan and Amir Najmi. The many shapley values for model explanation. In Proceedings of the Interna- tional Conference on Machine Learning (ICML) , 2020. 1

work page 2020

[48] [48]

Axiomat ic attribution for deep networks

Mukund Sundararajan, Ankur Taly, and Qiqi Yan. Axiomat ic attribution for deep networks. In International Conference on Machine Learning, 2017. 3, 1

work page 2017

[49] [49]

FCOS: Fully convolutional one-stage object detection

Zhi Tian, Chunhua Shen, Hao Chen, and Tong He. FCOS: Fully convolutional one-stage object detection. In Proceed- ings of the IEEE International Conference on Computer Vi- sion (ICCV), pages 9626–9635, 2019. 2

work page 2019

[50] [50]

Towards better explanations for object detection

V an Binh Truong, Truong Thanh Hung Nguyen, V o Thanh Khang Nguyen, Quoc Khanh Nguyen, and Quoc Hung Cao. Towards better explanations for object detection. In Asian Conference on Computer Vision (ACCV), pages 1385–1400, 2024. 2

work page 2024

[51] [51]

Grifﬁths

Shikhar Tuli, Ishita Dasgupta, Erin Grant, and Thomas L . Grifﬁths. Are convolutional neural networks or transforme rs more like human vision? ArXiv, abs/2105.07197, 2021. 14

work page arXiv 2021

[52] [52]

Score-cam: Score-weighted visual explanations for convolutional neu ral 10 networks

Haofan Wang, Zifan Wang, Mengnan Du, Fan Yang, Zijian Zhang, Sirui Ding, Piotr Mardziel, and Xia Hu. Score-cam: Score-weighted visual explanations for convolutional neu ral 10 networks. In Proceedings of the IEEE Conference on Com- puter Vision and Pattern Recognition (CVPR) W orkshops , pages 111–119, 2020. 3, 1, 6

work page 2020

[53] [53]

Detectron2

Y uxin Wu, Alexander Kirillov, Francisco Massa, Wan-Ye n Lo, and Ross Girshick. Detectron2. https://github. com/facebookresearch/detectron2, 2019. 5

work page 2019

[54] [54]

Spatial sensitive grad-cam++: Im - proved visual explanation for object detectors via weighte d combination of gradient map

Toshinori Yamauchi. Spatial sensitive grad-cam++: Im - proved visual explanation for object detectors via weighte d combination of gradient map. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) W orkshops, pages 8164–8168, 2024. 1, 2, 5, 6, 7, 8

work page 2024

[55] [55]

Spatial se n- sitive grad-cam: Visual explanations for object detection by incorporating spatial sensitivity

Toshinori Yamauchi and Masayoshi Ishikawa. Spatial se n- sitive grad-cam: Visual explanations for object detection by incorporating spatial sensitivity. In IEEE International Con- ference on Image Processing (ICIP) , pages 256–260, 2022. 1, 2, 5, 6, 7, 8

work page 2022

[56] [56]

Visualizing and Understanding Convolutional Networks

Matthew D. Zeiler and Rob Fergus. Visualizing and un- derstanding convolutional networks. CoRR, abs/1311.2901,

work page internal anchor Pith review Pith/arXiv arXiv

[57] [57]

Lin, Jonathan Brandt, Xiaohui Sh en, and Stan Sclaroff

Jianming Zhang, Zhe L. Lin, Jonathan Brandt, Xiaohui Sh en, and Stan Sclaroff. Top-down neural attention by excitation backprop. International Journal of Computer Vision , 126: 1084–1102, 2018. 3, 1, 6

work page 2018

[58] [58]

Group-cam: Group score-weighted visual explanations for deep convo- lutional networks, 2021

Qinglong Zhang, Lu Rao, and Y ubin Yang. Group-cam: Group score-weighted visual explanations for deep convo- lutional networks, 2021. 5

work page 2021

[59] [59]

Shifeng Zhang, Longyin Wen, Xiao Bian, Zhen Lei, and Stan Z. Li. Single-shot reﬁnement neural network for ob- ject detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 2018. 2

work page 2018

[60] [60]

When pedestrian detection meets multi-modal learning: Generalist model and benchmark dataset, 2024

Yi Zhang, Wang Zeng, Sheng Jin, Chen Qian, Ping Luo, and Wentao Liu. When pedestrian detection meets multi-modal learning: Generalist model and benchmark dataset, 2024. 1

work page 2024

[61] [61]

C. Zhao, J. H. Hsiao, and A. B. Chan. Gradient-based instance-speciﬁc visual explanations for object speciﬁca tion and object discrimination. IEEE Transactions on Pattern Analysis & Machine Intelligence , 46(09):5967–5985, 2024. 1, 2, 5, 6, 7, 8

work page 2024

[62] [62]

Shap- cam: Visual explanations for convolutional neural network s based on shapley value

Quan Zheng, Ziwei Wang, Jie Zhou, and Jiwen Lu. Shap- cam: Visual explanations for convolutional neural network s based on shapley value. In Proceedings of the European Conference on Computer Vision (ECCV) , page 459–474,

work page

[63] [63]

Learning deep features for discrimi- native localization

Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva , and Antonio Torralba. Learning deep features for discrimi- native localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 2016. 3, 1

work page 2016

[64] [64]

Deformable DETR: deformable transformers for end-to-end object detection

Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, and Jifeng Dai. Deformable DETR: deformable transformers for end-to-end object detection. In International Conference on Learning Representations (ICLR) , 2021. 2 11 Explaining Object Detectors via Collective Contribution o f Pixels Supplementary Material A. Visual explanations for classiﬁcation In t...

work page 2021

[65] [65]

computes pixel-wise feature attributions by propagating relevance from the output layer to the input layer. CPR [ 16] extends LRP by emphasizing the relevance originating from the target class, calculating the disparity between the rel e- vance from the target class and the average relevance from other classes. CAM-based methods calculate the weights repr...

work page

[66] [66]

(15) Therefore, this selection considers interactions between b1 and b2, as well as their Shapley values in the self-context

and (8), this is formulated as follows: {b1, b2} = arg max {b′ 1,b′ 2}⊂ N f ({b′ 1, b′ 2})− f (∅) = arg max {b′ 1,b′ 2}⊂ N φsc({b′ 1, b′ 2}) = arg max {b′ 1,b′ 2}⊂ N I sc 2 ({b′ 1, b′ 2}) + 1∑ r′=1   ∑ c∈P r′ ({b′ 1,b′ 2}) φsc(c)   = arg max {b′ 1,b′ 2}⊂ N I sc 2 ({b′ 1, b′ 2}) + φsc({b′ 1}) + φsc({b′ 2}). (15) Therefore, this selection considers inte...

work page

[67] [67]

Br 1 = arg min Br ⊆ N f (N\ Br)− f (N ) = arg max Br⊆ N f (N )− f (N\ Br) = arg max Br⊆ N φfc d (Br|N ), (20) where φfc d (Br|N ) def = Pd(N\ Br|N )[f (N )− f (N\ Br)]

is minimized. Br 1 = arg min Br ⊆ N f (N\ Br)− f (N ) = arg max Br⊆ N f (N )− f (N\ Br) = arg max Br⊆ N φfc d (Br|N ), (20) where φfc d (Br|N ) def = Pd(N\ Br|N )[f (N )− f (N\ Br)]. (21) In the last equality, we omit Pd(N\ Br 1|N ), as it remains constant regardless of the patch selection. The quantity φfc d is known as the Shapley value in full-context ...

work page

[68] [68]

(23) Therefore, at step k = 1 , the selection considers both in- teractions of the r patches (the ﬁrst term) and Shapley val- ues of each patch combination (the second term)

and (22), we get Br 1 = arg max Br ⊆ N φfc d (Br|N ) = arg max Br ⊆ N I fc d,r(Br)    Interaction + r− 1∑ r′=1   ∑ c∈P r′(Br) φfc d (c|N\{ Br\ c})      Shapley values . (23) Therefore, at step k = 1 , the selection considers both in- teractions of the r patches (the ﬁrst term) and Shapley val- ues of each patch combination (the second term). I...

work page

[69] [69]

Next section, we describe speciﬁc examples in patch deletion with r = 1 and r = 2. B.4. Speciﬁc example of patch deletion We provide speciﬁc examples for cases with r = 1 and r = 2 in patch deletion described in Appendix B.3. B.5. Case with r = 1 For step k = 1 , from Eq. ( 20), only the highest Shapley value φfc d ({b1}|N ) = n− 1[f (N )− f (N\{ b1}] bec...

work page

[70] [70]

(26) Therefore, this selection considers interactions between b1 and b2, as well as their Shapley values in the full-context

and ( 23), this is formulated as follows: {b1, b2} = arg min {b′ 1,b′ 2}⊂ N f (N\{ b′ 1, b′ 2})− f (N ) = arg max {b′ 1,b′ 2}⊂ N f (N )− f (N\{ b′ 1, b′ 2}) = arg max {b′ 1,b′ 2}⊂ N φfc d ({b′ 1, b′ 2}| N ) = arg max {b′ 1,b′ 2}⊂ N I fc d,2({b′ 1, b′ 2}) + 1∑ r′=1   ∑ c∈P r′({b′ 1,b′ 2}) φfc d (c|N\{{ b′ 1, b′ 2}\ c})   = arg max {b′ 1,b′ 2}⊂ N I fc d...

work page

[71] [71]

This process gen- erates a heat map that effectively highlights regions with signiﬁcant changes in the reward function values

In Algorithm 3, get position(bi) returns the diagonal corners (xb 1 , yb 1 ) and (xb 2, yb 2 ) of the patch bi in the image. This process gen- erates a heat map that effectively highlights regions with signiﬁcant changes in the reward function values. Algorithm 3 Generation of a heat map based on identiﬁed patches Input: Set of identiﬁed patches{b1, . . ....

work page

[72] [72]

(30) As shown in this equation, if α is close to 0, it identi- ﬁes patches that focus more on the prediction of bounding boxes

with α∈ [0, 1] as follows: f (D(x); (Bt, P t)) = max (B,P )∈D (x) { IoU(Bt, B) }1− α · { P t·P ∥P t∥∥P∥ }α . (30) As shown in this equation, if α is close to 0, it identi- ﬁes patches that focus more on the prediction of bounding boxes. If α is close to 1, it identiﬁes patches that focus more on the prediction of class scores. If α = 0.5, it identi- ﬁes p...

work page

[73] [73]

In con- trast, for Faster R-CNN, patches in speciﬁc regions are pref- erentially identiﬁed, and the reward value reaches 0.94 after 14 patches are identiﬁed

reaches 0.91 after four patches are identiﬁed. In con- trast, for Faster R-CNN, patches in speciﬁc regions are pref- erentially identiﬁed, and the reward value reaches 0.94 after 14 patches are identiﬁed. These results indicate that DETR, a transformer-based architecture, recognizes instances i n a more compositional manner and with less information com- ...

work page