pith. sign in

arxiv: 2412.00666 · v4 · pith:GDR2PIIHnew · submitted 2024-12-01 · 💻 cs.CV

Explaining Object Detectors via Collective Contribution of Pixels

Pith reviewed 2026-05-23 08:17 UTC · model grok-4.3

classification 💻 cs.CV
keywords object detectionexplainable AIShapley valuesvisual explanationspixel contributionsgame theorycooperative game
0
0 comments X

The pith

A game-theoretic method uses Shapley values and interactions to explain object detectors by capturing collective pixel contributions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes a method to explain object detectors that accounts for how groups of pixels jointly influence detections, rather than examining pixels in isolation. Existing approaches overlook these collective effects and can miss compositional cues or highlight spurious correlations instead. The new technique models the detector's output as a cooperative game and applies Shapley values together with interaction terms to quantify both single-pixel and multi-pixel importance. It generates explanations for both bounding-box localization and class prediction. Experiments indicate that the resulting importance maps align more closely with truly influential regions than those from prior methods.

Core claim

The central claim is that a game-theoretic approach based on Shapley values and interactions explicitly captures both individual and collective pixel contributions, thereby providing explanations for bounding box localization and class determination that identify important regions more accurately than state-of-the-art methods.

What carries the argument

A cooperative game whose value function is taken directly from the detector output, with Shapley values measuring individual pixel contributions and interaction terms measuring collective contributions of pixel groups.

If this is right

  • Explanations now cover both localization accuracy and class assignment within the same framework.
  • Important regions are identified more accurately than with methods that consider only individual pixel contributions.
  • Spurious correlations arising from isolated pixel analysis can be reduced.
  • The same machinery applies to both bounding-box regression and classification heads.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach could be used to audit detectors for reliance on unintended contextual cues in real-world scenes.
  • It might guide data-augmentation strategies that strengthen or weaken collective feature interactions.
  • Similar game-theoretic accounting could be tested on other dense prediction tasks such as instance segmentation.

Load-bearing premise

The collective contribution of pixels can be faithfully represented by a cooperative game whose value function is defined directly from the detector's output without introducing approximation artifacts that alter the ranking of regions.

What would settle it

An experiment that masks known critical pixel groups and checks whether the detector output change contradicts the importance ranking produced by the Shapley-plus-interaction method.

Figures

Figures reproduced from arXiv: 2412.00666 by Hiroshi Kera, Kazuhiko Kawamoto, Toshinori Yamauchi.

Figure 1
Figure 1. Figure 1: (a) Generated heat maps are shown for ODAM, SSGrad- [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of VX-CODE. The input image is divided [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Comparison of visual explanations generated by ea [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Comparison of identified patches and generated hea [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Comparison of visualizations generated by Grad [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Comparison of visualizations for bounding box gen [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
read the original abstract

Visual explanations for object detectors are crucial for enhancing their reliability. Object detectors identify and localize instances by assessing multiple visual features collectively. When generating explanations, overlooking these collective influences in detections may lead to missing compositional cues or capturing spurious correlations. However, existing methods typically focus solely on individual pixel contributions, neglecting the collective contribution of multiple pixels. To address this limitation, we propose a game-theoretic method based on Shapley values and interactions to explicitly capture both individual and collective pixel contributions. Our method provides explanations for both bounding box localization and class determination, highlighting regions crucial for detection. Extensive experiments demonstrate that the proposed method identifies important regions more accurately than state-of-the-art methods. The code is available at https://github.com/tttt-0814/VX-CODE

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper proposes a game-theoretic explanation method for object detectors that applies Shapley values and interaction indices to capture both individual pixel contributions and collective (interaction) contributions to bounding-box localization and class prediction. It argues that existing methods overlook collective influences and presents the new approach as addressing this by defining a cooperative game directly on the detector output. The central claim is that extensive experiments show the method identifies important regions more accurately than state-of-the-art methods, with code released at the provided GitHub link.

Significance. If the experiments are shown to be rigorous and the value-function definition avoids ranking-altering artifacts, the work would usefully extend pixel-attribution techniques to object detection by incorporating established interaction indices, potentially improving faithfulness for compositional detections. The public code release is a positive contribution for reproducibility.

major comments (1)
  1. [Abstract] Abstract (method paragraph): the central claim that collective pixel contributions are faithfully captured rests on the assumption that the cooperative-game value function (defined directly from detector output) introduces no approximation artifacts that alter region rankings; this assumption is load-bearing for the superiority claim but receives no explicit verification or sensitivity analysis in the provided description.
minor comments (2)
  1. [Abstract] The abstract states that 'extensive experiments' demonstrate superior accuracy but provides no details on baselines, metrics, statistical controls, error bars, or ablation studies; these should be summarized with quantitative results and controls in the main text or a dedicated experiments section.
  2. Notation for the interaction indices and the precise definition of the value function v(S) should be introduced with an equation early in the method section to allow readers to assess how collective contributions are operationalized.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive comment and the recommendation for minor revision. We address the point below.

read point-by-point responses
  1. Referee: [Abstract] Abstract (method paragraph): the central claim that collective pixel contributions are faithfully captured rests on the assumption that the cooperative-game value function (defined directly from detector output) introduces no approximation artifacts that alter region rankings; this assumption is load-bearing for the superiority claim but receives no explicit verification or sensitivity analysis in the provided description.

    Authors: We appreciate the referee's observation. The value function is defined exactly as the detector's raw output (localization and classification scores) on the masked input, with no additional modeling or approximation at that stage; any numerical approximations arise solely from the standard Monte-Carlo sampling used to estimate Shapley values and interaction indices, which is common practice. Nevertheless, we acknowledge that an explicit sensitivity analysis of the value-function definition would strengthen the superiority claim. In the revised manuscript we will add a dedicated paragraph and supplementary experiments that vary the masking strategy and perturbation level, confirming that the resulting region rankings remain stable and do not alter the comparative conclusions. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper frames its contribution as a direct application of established Shapley values and interaction indices to define pixel contributions for object detector explanations. No equations or claims in the provided abstract reduce a result to a fitted parameter, self-citation chain, or definitional equivalence with the inputs. The method's value function is defined from the detector output using standard cooperative game concepts, and performance claims rest on experiments rather than any internal renaming or forced prediction. This is the common case of an independent application of prior mathematical tools.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated. The method inherits standard Shapley axioms (efficiency, symmetry, dummy player, additivity) and the modeling choice that detector output can serve as the characteristic function.

pith-pipeline@v0.9.0 · 5655 in / 1009 out tokens · 21756 ms · 2026-05-23T08:17:39.307079+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. H-Sets: Hessian-Guided Discovery of Set-Level Feature Interactions in Image Classifiers

    cs.CV 2026-04 unverdicted novelty 6.0

    H-Sets detects higher-order feature interactions in image classifiers via Hessian-guided pair merging and attributes them with IDG-Vis to generate more interpretable saliency maps than existing marginal or coarse methods.

Reference graph

Works this paper leans on

73 extracted references · 73 canonical work pages · cited by 1 Pith paper · 4 internal anchors

  1. [1]

    Sanity checks for saliency maps

    Julius Adebayo, Justin Gilmer, Michael Muelly, Ian Good - fellow, Moritz Hardt, and Been Kim. Sanity checks for saliency maps. In Advances in Neural Information Process- ing Systems (NeurIPS) , 2018. 1

  2. [2]

    Enhancing wrist abnormality detection with yolo: Analysis of state-of-the -art single-stage detection models

    Ammar Ahmed, Ali Shariq Imran, Abdul Manaf, Zenun Kastrati, and Sher Muhammad Daudpota. Enhancing wrist abnormality detection with yolo: Analysis of state-of-the -art single-stage detection models. Biomedical Signal Processing and Control, 93:106144, 2024. 1

  3. [3]

    On pixel-wise explanations for non-linear classifie r decisions by layer-wise relevance propagation

    Sebastian Bach, Alexander Binder, Gr´ egoire Montavon, Frederick Klauschen, Klaus-Robert M¨ uller, and Wojciech Samek. On pixel-wise explanations for non-linear classifie r decisions by layer-wise relevance propagation. PLOS ONE, 10(7):1–46, 2015. 3, 1

  4. [4]

    Preddiff: Explanations and interactions from conditional ex- pectations

    Stefan Bl¨ ucher, Johanna Vielhaben, and Nils Strodthof f. Preddiff: Explanations and interactions from conditional ex- pectations. Artificial Intelligence, 312:103774, 2022. 3, 1

  5. [5]

    Alexey Bochkovskiy, Chien-Yao Wang, and H. Liao. Y olov4: Optimal speed and accuracy of object detection. ArXiv, abs/2004.10934, 2020. 2

  6. [6]

    Slice: Sta- bilized lime for consistent explanations for image classifi - cation

    Revoti Prasad Bora, Philipp Terh¨ orst, Raymond V eldhui s, Raghavendra Ramachandra, and Kiran Raja. Slice: Sta- bilized lime for consistent explanations for image classifi - cation. In Proceedings of the IEEE Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 10988– 10996, 2024. 3, 1

  7. [7]

    End- to-end object detection with transformers

    Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nic olas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End- to-end object detection with transformers. In Proceedings of the European Conference on Computer Vision (ECCV), page 213–229, 2020. 1, 2, 5

  8. [8]

    Grad-cam++: General- ized gradient-based visual explanations for deep convolu- tional networks

    Aditya Chattopadhay, Anirban Sarkar, Prantik Howlader , and Vineeth N Balasubramanian. Grad-cam++: General- ized gradient-based visual explanations for deep convolu- tional networks. In IEEE Winter Conference on Applications of Computer Vision (WACV) , pages 839–847, 2018. 3, 5, 6, 7, 8, 1

  9. [9]

    Ramaswamy

    Saurabh Desai and Harish G. Ramaswamy. Ablation- cam: Visual explanations for deep convolutional network via gradient-free localization. In IEEE Winter Conference on Applications of Computer Vision (WACV) , pages 972–980,

  10. [10]

    Revealing hidden context bias in segmentation and object detection through concept-specific explanations

    Maximilian Dreyer, Reduan Achtibat, Thomas Wiegand, Wojciech Samek, and Sebastian Lapuschkin. Revealing hidden context bias in segmentation and object detection through concept-specific explanations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recogni- tion (CVPR) W orkshops, pages 3829–3839, 2023. 2

  11. [11]

    Revealing hidden context bias in segmentation and object detection through concept-specific explanations

    Maximilian Dreyer, Reduan Achtibat, Thomas Wiegand, Wojciech Samek, and Sebastian Lapuschkin. Revealing hidden context bias in segmentation and object detection through concept-specific explanations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recogni- tion (CVPR) W orkshops, pages 3829–3839, 2023. 1

  12. [12]

    Mark Everingham, Luc V an Gool, Christopher K. I. Williams, John M. Winn, and Andrew Zisserman. The pas- cal visual object classes (voc) challenge. Int. J. Comput. Vis., 88(2):303–338, 2010. 5

  13. [13]

    Axiom-based grad-cam: Towards accurate visualization and explanation of cnns

    Ruigang Fu, Qingyong Hu, Xiaohu Dong, Y ulan Guo, Yinghui Gao, and Biao Li. Axiom-based grad-cam: Towards accurate visualization and explanation of cnns. In British Machine Vision Conference (BMVC), 2020. 3, 1

  14. [14]

    Fast r-cnn

    Ross Girshick. Fast r-cnn. In Proceedings of the IEEE In- ternational Conference on Computer Vision (ICCV) , pages 1440–1448, 2015. 2

  15. [15]

    an axiomatic approa ch to the concept of interaction among players in cooperative games

    Michel Grabisch and Marc Roubens. “an axiomatic approa ch to the concept of interaction among players in cooperative games”. International Journal of Game Theory, 28:547–565, 1999. 1, 3

  16. [16]

    Understan d- ing individual decisions of cnns via contrastive backpropa - gation

    Jindong Gu, Yinchong Yang, and V olker Tresp. Understan d- ing individual decisions of cnns via contrastive backpropa - gation. In Asian Conference on Computer Vision (ACCV) , pages 119–134, 2018. 3, 1

  17. [17]

    Chul Gwon and Steven C. Howell. Odsmoothgrad: Generat- ing saliency maps for object detectors. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recogni- tion (CVPR) W orkshops, pages 3685–3689, 2023. 2

  18. [18]

    Deep residual learning for image recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceed- ings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016. 5

  19. [19]

    FastSHAP: Real-time shapley value estimation

    Neil Jethani, Mukund Sudarshan, Ian Connick Covert, Su -In Lee, and Rajesh Ranganath. FastSHAP: Real-time shapley value estimation. In International Conference on Learning Representations (ICRL), 2022. 3, 1

  20. [20]

    Comparing th e decision-making mechanisms by transformers and cnns via explanation methods

    Mingqi Jiang, Saeed Khorram, and Li Fuxin. Comparing th e decision-making mechanisms by transformers and cnns via explanation methods. In Proceedings of the IEEE Confer- ence on Computer Vision and Pattern Recognition (CVPR) , pages 9546–9555, 2024. 14

  21. [21]

    Towards better expla- nations of class activation mapping

    Hyungsik Jung and Y oungrock Oh. Towards better expla- nations of class activation mapping. In IEEE International Conference on Computer Vision (ICCV) , pages 1316–1324,

  22. [22]

    Localized semantic feature mixers for effi- cient pedestrian detection in autonomous driving

    Abdul Hannan Khan, Mohammed Shariq Nawaz, and An- dreas Dengel. Localized semantic feature mixers for effi- cient pedestrian detection in autonomous driving. In Pro- ceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5476–5485, 2023. 1

  23. [23]

    Bsed: Baseli ne shapley-based explainable detector

    Michihiro Kuroki and Toshihiko Yamasaki. Bsed: Baseli ne shapley-based explainable detector. IEEE Access, 12:57959– 57973, 2024. 2

  24. [24]

    Fast explana tion using shapley value for object detection

    Michihiro Kuroki and Toshihiko Yamasaki. Fast explana tion using shapley value for object detection. IEEE Access , 12: 31047–31054, 2024. 2

  25. [25]

    Lawrence Zitnick, and Piotr Doll´ ar

    Tsung-Yi Lin, Michael Maire, Serge Belongie, Lubomir Bourdev, Ross Girshick, James Hays, Pietro Perona, Deva Ramanan, C. Lawrence Zitnick, and Piotr Doll´ ar. Microsoft coco: Common objects in context, 2014. 5 9

  26. [26]

    Feature pyramid networks for object detection

    Tsung-Yi Lin, Piotr Doll´ ar, Ross Girshick, Kaiming He , Bharath Hariharan, and Serge Belongie. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 936–944, 2017. 5

  27. [27]

    Focal loss for dense object detection

    Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, a nd Piotr Doll´ ar. Focal loss for dense object detection. In IEEE International Conference on Computer Vision (ICCV), pages 2999–3007, 2017. 2

  28. [28]

    Reed, Cheng-Yang Fu, and Alexander C

    Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott E. Reed, Cheng-Yang Fu, and Alexander C. Berg. SSD: Single shot multibox detector. In Proceedings of the European Conference on Computer Vision (ECCV) , pages 21–37, 2016. 2

  29. [29]

    Lundberg and Su-In Lee

    Scott M. Lundberg and Su-In Lee. A unified approach to interpreting model predictions. In Advances in Neural In- formation Processing Systems (NeurIPS) , page 4768–4777,

  30. [30]

    Nguyen, Haichun Yang, Ruining Deng, Y uzhe Lu, Zheyu Zhu, Joseph T

    Ethan H. Nguyen, Haichun Yang, Ruining Deng, Y uzhe Lu, Zheyu Zhu, Joseph T. Roland, Le Lu, Bennett A. Landman, Agnes B. Fogo, and Y uankai Huo. Circle representation for medical object detection. IEEE Transactions on Medical Imaging, 41(3):746–754, 2022. 1

  31. [31]

    Automatic dif- ferentiation in pytorch

    Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic dif- ferentiation in pytorch. In Advances in Neural Information Processing Systems (NeurIPS) W orkshop on Autodiff , 2017. 5

  32. [32]

    Rise: Random - ized input sampling for explanation of black-box models

    Vitali Petsiuk, Abir Das, and Kate Saenko. Rise: Random - ized input sampling for explanation of black-box models. In British Machine Vision Conference (BMVC) , page 151, 2018. 2, 3, 5, 1

  33. [33]

    Morariu, Ashutosh Mehra, Vicente Ordonez, and Kate Saenko

    Vitali Petsiuk, Rajiv Jain, V arun Manjunatha, Vlad I. Morariu, Ashutosh Mehra, Vicente Ordonez, and Kate Saenko. Black-box explanation of object detectors via saliency maps. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , pages 11443–11452, 2021. 1, 2, 3, 5, 6, 7, 8

  34. [34]

    Do vision trans- formers see like convolutional neural networks? In Advances in Neural Information Processing Systems (NeurIPS) , pages 12116–12128, 2021

    Maithra Raghu, Thomas Unterthiner, Simon Kornblith, Chiyuan Zhang, and Alexey Dosovitskiy. Do vision trans- formers see like convolutional neural networks? In Advances in Neural Information Processing Systems (NeurIPS) , pages 12116–12128, 2021. 14

  35. [35]

    Y olo9000: Better, faste r, stronger

    Joseph Redmon and Ali Farhadi. Y olo9000: Better, faste r, stronger. In Proceedings of the IEEE Conference on Com- puter Vision and Pattern Recognition (CVPR) , pages 6517– 6525, 2017. 2

  36. [36]

    YOLOv3: An Incremental Improvement

    Joseph Redmon and Ali Farhadi. YOLOv3: An Incremental Improvement. arXiv.org, pages 1–6, 2018

  37. [37]

    Y ou only look once: Unified, real-time object de- tection

    Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. Y ou only look once: Unified, real-time object de- tection. In Proceedings of the IEEE Conference on Com- puter Vision and Pattern Recognition (CVPR) , pages 779– 788, 2016

  38. [38]

    Girshick, and Jian Sun

    Shaoqing Ren, Kaiming He, Ross B. Girshick, and Jian Sun . Faster r-cnn: Towards real-time object detection with regi on proposal networks. In Advances in Neural Information Pro- cessing Systems (NeurIPS) , pages 91–99, 2015. 1, 2, 5

  39. [39]

    ”why should I trust you?”: Explaining the predictions of any classifier

    Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin . ”why should I trust you?”: Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD In- ternational Conference on Knowledge Discovery and Data Mining, pages 1135–1144, 2016. 3, 1

  40. [40]

    Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna V edantam, Devi Parikh, and Dhruv Ba- tra

    Ramprasaath R. Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna V edantam, Devi Parikh, and Dhruv Ba- tra. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE In- ternational Conference on Computer Vision (ICCV) , pages 618–626, 2017. 1, 2, 3, 5, 6, 7, 8

  41. [41]

    A value for n-person games

    Lloyd S Shapley. A value for n-person games. In Contri- butions to the Theory of Games II , pages 307–317. 1953. 1, 3

  42. [42]

    Learning important features through propagating activati on differences

    Avanti Shrikumar, Peyton Greenside, and Anshul Kundaj e. Learning important features through propagating activati on differences. In Proceedings of the International Conference on Machine Learning (ICML) , page 3145–3153, 2017. 3, 1

  43. [43]

    Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps

    Karen Simonyan, Andrea V edaldi, and Andrew Zisserman. Deep inside convolutional networks: Visualising image clas- sification models and saliency maps. CoRR, abs/1312.6034, 2013

  44. [44]

    SmoothGrad: removing noise by adding noise

    Daniel Smilkov, Nikhil Thorat, Been Kim, Fernanda B. Vi´ egas, and Martin Wattenberg. Smoothgrad: removing noise by adding noise. CoRR, abs/1706.03825, 2017

  45. [45]

    Striving for simplicity: The all convolutional net

    Jost Tobias Springenberg, Alexey Dosovitskiy, Thomas Brox, and Martin Riedmiller. Striving for simplicity: The all convolutional net. In International Conference on Learn- ing Representations (ICLR) workshop track , 2015. 1

  46. [46]

    Identifying important group of pixels using interactions

    Kosuke Sumiyasu, Kazuhiko Kawamoto, and Hiroshi Kera. Identifying important group of pixels using interactions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , pages 6017–6026, 2024. 1, 3, 4

  47. [47]

    The many shapley values for model explanation

    Mukund Sundararajan and Amir Najmi. The many shapley values for model explanation. In Proceedings of the Interna- tional Conference on Machine Learning (ICML) , 2020. 1

  48. [48]

    Axiomat ic attribution for deep networks

    Mukund Sundararajan, Ankur Taly, and Qiqi Yan. Axiomat ic attribution for deep networks. In International Conference on Machine Learning, 2017. 3, 1

  49. [49]

    FCOS: Fully convolutional one-stage object detection

    Zhi Tian, Chunhua Shen, Hao Chen, and Tong He. FCOS: Fully convolutional one-stage object detection. In Proceed- ings of the IEEE International Conference on Computer Vi- sion (ICCV), pages 9626–9635, 2019. 2

  50. [50]

    Towards better explanations for object detection

    V an Binh Truong, Truong Thanh Hung Nguyen, V o Thanh Khang Nguyen, Quoc Khanh Nguyen, and Quoc Hung Cao. Towards better explanations for object detection. In Asian Conference on Computer Vision (ACCV), pages 1385–1400, 2024. 2

  51. [51]

    Griffiths

    Shikhar Tuli, Ishita Dasgupta, Erin Grant, and Thomas L . Griffiths. Are convolutional neural networks or transforme rs more like human vision? ArXiv, abs/2105.07197, 2021. 14

  52. [52]

    Score-cam: Score-weighted visual explanations for convolutional neu ral 10 networks

    Haofan Wang, Zifan Wang, Mengnan Du, Fan Yang, Zijian Zhang, Sirui Ding, Piotr Mardziel, and Xia Hu. Score-cam: Score-weighted visual explanations for convolutional neu ral 10 networks. In Proceedings of the IEEE Conference on Com- puter Vision and Pattern Recognition (CVPR) W orkshops , pages 111–119, 2020. 3, 1, 6

  53. [53]

    Detectron2

    Y uxin Wu, Alexander Kirillov, Francisco Massa, Wan-Ye n Lo, and Ross Girshick. Detectron2. https://github. com/facebookresearch/detectron2, 2019. 5

  54. [54]

    Spatial sensitive grad-cam++: Im - proved visual explanation for object detectors via weighte d combination of gradient map

    Toshinori Yamauchi. Spatial sensitive grad-cam++: Im - proved visual explanation for object detectors via weighte d combination of gradient map. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) W orkshops, pages 8164–8168, 2024. 1, 2, 5, 6, 7, 8

  55. [55]

    Spatial se n- sitive grad-cam: Visual explanations for object detection by incorporating spatial sensitivity

    Toshinori Yamauchi and Masayoshi Ishikawa. Spatial se n- sitive grad-cam: Visual explanations for object detection by incorporating spatial sensitivity. In IEEE International Con- ference on Image Processing (ICIP) , pages 256–260, 2022. 1, 2, 5, 6, 7, 8

  56. [56]

    Visualizing and Understanding Convolutional Networks

    Matthew D. Zeiler and Rob Fergus. Visualizing and un- derstanding convolutional networks. CoRR, abs/1311.2901,

  57. [57]

    Lin, Jonathan Brandt, Xiaohui Sh en, and Stan Sclaroff

    Jianming Zhang, Zhe L. Lin, Jonathan Brandt, Xiaohui Sh en, and Stan Sclaroff. Top-down neural attention by excitation backprop. International Journal of Computer Vision , 126: 1084–1102, 2018. 3, 1, 6

  58. [58]

    Group-cam: Group score-weighted visual explanations for deep convo- lutional networks, 2021

    Qinglong Zhang, Lu Rao, and Y ubin Yang. Group-cam: Group score-weighted visual explanations for deep convo- lutional networks, 2021. 5

  59. [59]

    Shifeng Zhang, Longyin Wen, Xiao Bian, Zhen Lei, and Stan Z. Li. Single-shot refinement neural network for ob- ject detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 2018. 2

  60. [60]

    When pedestrian detection meets multi-modal learning: Generalist model and benchmark dataset, 2024

    Yi Zhang, Wang Zeng, Sheng Jin, Chen Qian, Ping Luo, and Wentao Liu. When pedestrian detection meets multi-modal learning: Generalist model and benchmark dataset, 2024. 1

  61. [61]

    C. Zhao, J. H. Hsiao, and A. B. Chan. Gradient-based instance-specific visual explanations for object specifica tion and object discrimination. IEEE Transactions on Pattern Analysis & Machine Intelligence , 46(09):5967–5985, 2024. 1, 2, 5, 6, 7, 8

  62. [62]

    Shap- cam: Visual explanations for convolutional neural network s based on shapley value

    Quan Zheng, Ziwei Wang, Jie Zhou, and Jiwen Lu. Shap- cam: Visual explanations for convolutional neural network s based on shapley value. In Proceedings of the European Conference on Computer Vision (ECCV) , page 459–474,

  63. [63]

    Learning deep features for discrimi- native localization

    Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva , and Antonio Torralba. Learning deep features for discrimi- native localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 2016. 3, 1

  64. [64]

    Deformable DETR: deformable transformers for end-to-end object detection

    Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, and Jifeng Dai. Deformable DETR: deformable transformers for end-to-end object detection. In International Conference on Learning Representations (ICLR) , 2021. 2 11 Explaining Object Detectors via Collective Contribution o f Pixels Supplementary Material A. Visual explanations for classification In t...

  65. [65]

    computes pixel-wise feature attributions by propagating relevance from the output layer to the input layer. CPR [ 16] extends LRP by emphasizing the relevance originating from the target class, calculating the disparity between the rel e- vance from the target class and the average relevance from other classes. CAM-based methods calculate the weights repr...

  66. [66]

    (15) Therefore, this selection considers interactions between b1 and b2, as well as their Shapley values in the self-context

    and (8), this is formulated as follows: {b1, b2} = arg max {b′ 1,b′ 2}⊂ N f ({b′ 1, b′ 2})− f (∅) = arg max {b′ 1,b′ 2}⊂ N φsc({b′ 1, b′ 2}) = arg max {b′ 1,b′ 2}⊂ N I sc 2 ({b′ 1, b′ 2}) + 1∑ r′=1   ∑ c∈P r′ ({b′ 1,b′ 2}) φsc(c)   = arg max {b′ 1,b′ 2}⊂ N I sc 2 ({b′ 1, b′ 2}) + φsc({b′ 1}) + φsc({b′ 2}). (15) Therefore, this selection considers inte...

  67. [67]

    Br 1 = arg min Br ⊆ N f (N\ Br)− f (N ) = arg max Br⊆ N f (N )− f (N\ Br) = arg max Br⊆ N φfc d (Br|N ), (20) where φfc d (Br|N ) def = Pd(N\ Br|N )[f (N )− f (N\ Br)]

    is minimized. Br 1 = arg min Br ⊆ N f (N\ Br)− f (N ) = arg max Br⊆ N f (N )− f (N\ Br) = arg max Br⊆ N φfc d (Br|N ), (20) where φfc d (Br|N ) def = Pd(N\ Br|N )[f (N )− f (N\ Br)]. (21) In the last equality, we omit Pd(N\ Br 1|N ), as it remains constant regardless of the patch selection. The quantity φfc d is known as the Shapley value in full-context ...

  68. [68]

    (23) Therefore, at step k = 1 , the selection considers both in- teractions of the r patches (the first term) and Shapley val- ues of each patch combination (the second term)

    and (22), we get Br 1 = arg max Br ⊆ N φfc d (Br|N ) = arg max Br ⊆ N I fc d,r(Br)    Interaction + r− 1∑ r′=1   ∑ c∈P r′(Br) φfc d (c|N\{ Br\ c})      Shapley values . (23) Therefore, at step k = 1 , the selection considers both in- teractions of the r patches (the first term) and Shapley val- ues of each patch combination (the second term). I...

  69. [69]

    Next section, we describe specific examples in patch deletion with r = 1 and r = 2. B.4. Specific example of patch deletion We provide specific examples for cases with r = 1 and r = 2 in patch deletion described in Appendix B.3. B.5. Case with r = 1 For step k = 1 , from Eq. ( 20), only the highest Shapley value φfc d ({b1}|N ) = n− 1[f (N )− f (N\{ b1}] bec...

  70. [70]

    (26) Therefore, this selection considers interactions between b1 and b2, as well as their Shapley values in the full-context

    and ( 23), this is formulated as follows: {b1, b2} = arg min {b′ 1,b′ 2}⊂ N f (N\{ b′ 1, b′ 2})− f (N ) = arg max {b′ 1,b′ 2}⊂ N f (N )− f (N\{ b′ 1, b′ 2}) = arg max {b′ 1,b′ 2}⊂ N φfc d ({b′ 1, b′ 2}| N ) = arg max {b′ 1,b′ 2}⊂ N I fc d,2({b′ 1, b′ 2}) + 1∑ r′=1   ∑ c∈P r′({b′ 1,b′ 2}) φfc d (c|N\{{ b′ 1, b′ 2}\ c})   = arg max {b′ 1,b′ 2}⊂ N I fc d...

  71. [71]

    This process gen- erates a heat map that effectively highlights regions with significant changes in the reward function values

    In Algorithm 3, get position(bi) returns the diagonal corners (xb 1 , yb 1 ) and (xb 2, yb 2 ) of the patch bi in the image. This process gen- erates a heat map that effectively highlights regions with significant changes in the reward function values. Algorithm 3 Generation of a heat map based on identified patches Input: Set of identified patches{b1, . . ....

  72. [72]

    (30) As shown in this equation, if α is close to 0, it identi- fies patches that focus more on the prediction of bounding boxes

    with α∈ [0, 1] as follows: f (D(x); (Bt, P t)) = max (B,P )∈D (x) { IoU(Bt, B) }1− α · { P t·P ∥P t∥∥P∥ }α . (30) As shown in this equation, if α is close to 0, it identi- fies patches that focus more on the prediction of bounding boxes. If α is close to 1, it identifies patches that focus more on the prediction of class scores. If α = 0.5, it identi- fies p...

  73. [73]

    In con- trast, for Faster R-CNN, patches in specific regions are pref- erentially identified, and the reward value reaches 0.94 after 14 patches are identified

    reaches 0.91 after four patches are identified. In con- trast, for Faster R-CNN, patches in specific regions are pref- erentially identified, and the reward value reaches 0.94 after 14 patches are identified. These results indicate that DETR, a transformer-based architecture, recognizes instances i n a more compositional manner and with less information com- ...