Explaining Object Detectors via Collective Contribution of Pixels
Pith reviewed 2026-05-23 08:17 UTC · model grok-4.3
The pith
A game-theoretic method uses Shapley values and interactions to explain object detectors by capturing collective pixel contributions.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that a game-theoretic approach based on Shapley values and interactions explicitly captures both individual and collective pixel contributions, thereby providing explanations for bounding box localization and class determination that identify important regions more accurately than state-of-the-art methods.
What carries the argument
A cooperative game whose value function is taken directly from the detector output, with Shapley values measuring individual pixel contributions and interaction terms measuring collective contributions of pixel groups.
If this is right
- Explanations now cover both localization accuracy and class assignment within the same framework.
- Important regions are identified more accurately than with methods that consider only individual pixel contributions.
- Spurious correlations arising from isolated pixel analysis can be reduced.
- The same machinery applies to both bounding-box regression and classification heads.
Where Pith is reading between the lines
- The approach could be used to audit detectors for reliance on unintended contextual cues in real-world scenes.
- It might guide data-augmentation strategies that strengthen or weaken collective feature interactions.
- Similar game-theoretic accounting could be tested on other dense prediction tasks such as instance segmentation.
Load-bearing premise
The collective contribution of pixels can be faithfully represented by a cooperative game whose value function is defined directly from the detector's output without introducing approximation artifacts that alter the ranking of regions.
What would settle it
An experiment that masks known critical pixel groups and checks whether the detector output change contradicts the importance ranking produced by the Shapley-plus-interaction method.
Figures
read the original abstract
Visual explanations for object detectors are crucial for enhancing their reliability. Object detectors identify and localize instances by assessing multiple visual features collectively. When generating explanations, overlooking these collective influences in detections may lead to missing compositional cues or capturing spurious correlations. However, existing methods typically focus solely on individual pixel contributions, neglecting the collective contribution of multiple pixels. To address this limitation, we propose a game-theoretic method based on Shapley values and interactions to explicitly capture both individual and collective pixel contributions. Our method provides explanations for both bounding box localization and class determination, highlighting regions crucial for detection. Extensive experiments demonstrate that the proposed method identifies important regions more accurately than state-of-the-art methods. The code is available at https://github.com/tttt-0814/VX-CODE
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes a game-theoretic explanation method for object detectors that applies Shapley values and interaction indices to capture both individual pixel contributions and collective (interaction) contributions to bounding-box localization and class prediction. It argues that existing methods overlook collective influences and presents the new approach as addressing this by defining a cooperative game directly on the detector output. The central claim is that extensive experiments show the method identifies important regions more accurately than state-of-the-art methods, with code released at the provided GitHub link.
Significance. If the experiments are shown to be rigorous and the value-function definition avoids ranking-altering artifacts, the work would usefully extend pixel-attribution techniques to object detection by incorporating established interaction indices, potentially improving faithfulness for compositional detections. The public code release is a positive contribution for reproducibility.
major comments (1)
- [Abstract] Abstract (method paragraph): the central claim that collective pixel contributions are faithfully captured rests on the assumption that the cooperative-game value function (defined directly from detector output) introduces no approximation artifacts that alter region rankings; this assumption is load-bearing for the superiority claim but receives no explicit verification or sensitivity analysis in the provided description.
minor comments (2)
- [Abstract] The abstract states that 'extensive experiments' demonstrate superior accuracy but provides no details on baselines, metrics, statistical controls, error bars, or ablation studies; these should be summarized with quantitative results and controls in the main text or a dedicated experiments section.
- Notation for the interaction indices and the precise definition of the value function v(S) should be introduced with an equation early in the method section to allow readers to assess how collective contributions are operationalized.
Simulated Author's Rebuttal
We thank the referee for the constructive comment and the recommendation for minor revision. We address the point below.
read point-by-point responses
-
Referee: [Abstract] Abstract (method paragraph): the central claim that collective pixel contributions are faithfully captured rests on the assumption that the cooperative-game value function (defined directly from detector output) introduces no approximation artifacts that alter region rankings; this assumption is load-bearing for the superiority claim but receives no explicit verification or sensitivity analysis in the provided description.
Authors: We appreciate the referee's observation. The value function is defined exactly as the detector's raw output (localization and classification scores) on the masked input, with no additional modeling or approximation at that stage; any numerical approximations arise solely from the standard Monte-Carlo sampling used to estimate Shapley values and interaction indices, which is common practice. Nevertheless, we acknowledge that an explicit sensitivity analysis of the value-function definition would strengthen the superiority claim. In the revised manuscript we will add a dedicated paragraph and supplementary experiments that vary the masking strategy and perturbation level, confirming that the resulting region rankings remain stable and do not alter the comparative conclusions. revision: yes
Circularity Check
No significant circularity
full rationale
The paper frames its contribution as a direct application of established Shapley values and interaction indices to define pixel contributions for object detector explanations. No equations or claims in the provided abstract reduce a result to a fitted parameter, self-citation chain, or definitional equivalence with the inputs. The method's value function is defined from the detector output using standard cooperative game concepts, and performance claims rest on experiments rather than any internal renaming or forced prediction. This is the common case of an independent application of prior mathematical tools.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 1 Pith paper
-
H-Sets: Hessian-Guided Discovery of Set-Level Feature Interactions in Image Classifiers
H-Sets detects higher-order feature interactions in image classifiers via Hessian-guided pair merging and attributes them with IDG-Vis to generate more interpretable saliency maps than existing marginal or coarse methods.
Reference graph
Works this paper leans on
-
[1]
Sanity checks for saliency maps
Julius Adebayo, Justin Gilmer, Michael Muelly, Ian Good - fellow, Moritz Hardt, and Been Kim. Sanity checks for saliency maps. In Advances in Neural Information Process- ing Systems (NeurIPS) , 2018. 1
work page 2018
-
[2]
Ammar Ahmed, Ali Shariq Imran, Abdul Manaf, Zenun Kastrati, and Sher Muhammad Daudpota. Enhancing wrist abnormality detection with yolo: Analysis of state-of-the -art single-stage detection models. Biomedical Signal Processing and Control, 93:106144, 2024. 1
work page 2024
-
[3]
On pixel-wise explanations for non-linear classifie r decisions by layer-wise relevance propagation
Sebastian Bach, Alexander Binder, Gr´ egoire Montavon, Frederick Klauschen, Klaus-Robert M¨ uller, and Wojciech Samek. On pixel-wise explanations for non-linear classifie r decisions by layer-wise relevance propagation. PLOS ONE, 10(7):1–46, 2015. 3, 1
work page 2015
-
[4]
Preddiff: Explanations and interactions from conditional ex- pectations
Stefan Bl¨ ucher, Johanna Vielhaben, and Nils Strodthof f. Preddiff: Explanations and interactions from conditional ex- pectations. Artificial Intelligence, 312:103774, 2022. 3, 1
work page 2022
-
[5]
Alexey Bochkovskiy, Chien-Yao Wang, and H. Liao. Y olov4: Optimal speed and accuracy of object detection. ArXiv, abs/2004.10934, 2020. 2
work page internal anchor Pith review Pith/arXiv arXiv 2004
-
[6]
Slice: Sta- bilized lime for consistent explanations for image classifi - cation
Revoti Prasad Bora, Philipp Terh¨ orst, Raymond V eldhui s, Raghavendra Ramachandra, and Kiran Raja. Slice: Sta- bilized lime for consistent explanations for image classifi - cation. In Proceedings of the IEEE Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 10988– 10996, 2024. 3, 1
work page 2024
-
[7]
End- to-end object detection with transformers
Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nic olas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End- to-end object detection with transformers. In Proceedings of the European Conference on Computer Vision (ECCV), page 213–229, 2020. 1, 2, 5
work page 2020
-
[8]
Grad-cam++: General- ized gradient-based visual explanations for deep convolu- tional networks
Aditya Chattopadhay, Anirban Sarkar, Prantik Howlader , and Vineeth N Balasubramanian. Grad-cam++: General- ized gradient-based visual explanations for deep convolu- tional networks. In IEEE Winter Conference on Applications of Computer Vision (WACV) , pages 839–847, 2018. 3, 5, 6, 7, 8, 1
work page 2018
- [9]
-
[10]
Maximilian Dreyer, Reduan Achtibat, Thomas Wiegand, Wojciech Samek, and Sebastian Lapuschkin. Revealing hidden context bias in segmentation and object detection through concept-specific explanations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recogni- tion (CVPR) W orkshops, pages 3829–3839, 2023. 2
work page 2023
-
[11]
Maximilian Dreyer, Reduan Achtibat, Thomas Wiegand, Wojciech Samek, and Sebastian Lapuschkin. Revealing hidden context bias in segmentation and object detection through concept-specific explanations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recogni- tion (CVPR) W orkshops, pages 3829–3839, 2023. 1
work page 2023
-
[12]
Mark Everingham, Luc V an Gool, Christopher K. I. Williams, John M. Winn, and Andrew Zisserman. The pas- cal visual object classes (voc) challenge. Int. J. Comput. Vis., 88(2):303–338, 2010. 5
work page 2010
-
[13]
Axiom-based grad-cam: Towards accurate visualization and explanation of cnns
Ruigang Fu, Qingyong Hu, Xiaohu Dong, Y ulan Guo, Yinghui Gao, and Biao Li. Axiom-based grad-cam: Towards accurate visualization and explanation of cnns. In British Machine Vision Conference (BMVC), 2020. 3, 1
work page 2020
-
[14]
Ross Girshick. Fast r-cnn. In Proceedings of the IEEE In- ternational Conference on Computer Vision (ICCV) , pages 1440–1448, 2015. 2
work page 2015
-
[15]
an axiomatic approa ch to the concept of interaction among players in cooperative games
Michel Grabisch and Marc Roubens. “an axiomatic approa ch to the concept of interaction among players in cooperative games”. International Journal of Game Theory, 28:547–565, 1999. 1, 3
work page 1999
-
[16]
Understan d- ing individual decisions of cnns via contrastive backpropa - gation
Jindong Gu, Yinchong Yang, and V olker Tresp. Understan d- ing individual decisions of cnns via contrastive backpropa - gation. In Asian Conference on Computer Vision (ACCV) , pages 119–134, 2018. 3, 1
work page 2018
-
[17]
Chul Gwon and Steven C. Howell. Odsmoothgrad: Generat- ing saliency maps for object detectors. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recogni- tion (CVPR) W orkshops, pages 3685–3689, 2023. 2
work page 2023
-
[18]
Deep residual learning for image recognition
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceed- ings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016. 5
work page 2016
-
[19]
FastSHAP: Real-time shapley value estimation
Neil Jethani, Mukund Sudarshan, Ian Connick Covert, Su -In Lee, and Rajesh Ranganath. FastSHAP: Real-time shapley value estimation. In International Conference on Learning Representations (ICRL), 2022. 3, 1
work page 2022
-
[20]
Comparing th e decision-making mechanisms by transformers and cnns via explanation methods
Mingqi Jiang, Saeed Khorram, and Li Fuxin. Comparing th e decision-making mechanisms by transformers and cnns via explanation methods. In Proceedings of the IEEE Confer- ence on Computer Vision and Pattern Recognition (CVPR) , pages 9546–9555, 2024. 14
work page 2024
-
[21]
Towards better expla- nations of class activation mapping
Hyungsik Jung and Y oungrock Oh. Towards better expla- nations of class activation mapping. In IEEE International Conference on Computer Vision (ICCV) , pages 1316–1324,
-
[22]
Localized semantic feature mixers for effi- cient pedestrian detection in autonomous driving
Abdul Hannan Khan, Mohammed Shariq Nawaz, and An- dreas Dengel. Localized semantic feature mixers for effi- cient pedestrian detection in autonomous driving. In Pro- ceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5476–5485, 2023. 1
work page 2023
-
[23]
Bsed: Baseli ne shapley-based explainable detector
Michihiro Kuroki and Toshihiko Yamasaki. Bsed: Baseli ne shapley-based explainable detector. IEEE Access, 12:57959– 57973, 2024. 2
work page 2024
-
[24]
Fast explana tion using shapley value for object detection
Michihiro Kuroki and Toshihiko Yamasaki. Fast explana tion using shapley value for object detection. IEEE Access , 12: 31047–31054, 2024. 2
work page 2024
-
[25]
Lawrence Zitnick, and Piotr Doll´ ar
Tsung-Yi Lin, Michael Maire, Serge Belongie, Lubomir Bourdev, Ross Girshick, James Hays, Pietro Perona, Deva Ramanan, C. Lawrence Zitnick, and Piotr Doll´ ar. Microsoft coco: Common objects in context, 2014. 5 9
work page 2014
-
[26]
Feature pyramid networks for object detection
Tsung-Yi Lin, Piotr Doll´ ar, Ross Girshick, Kaiming He , Bharath Hariharan, and Serge Belongie. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 936–944, 2017. 5
work page 2017
-
[27]
Focal loss for dense object detection
Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, a nd Piotr Doll´ ar. Focal loss for dense object detection. In IEEE International Conference on Computer Vision (ICCV), pages 2999–3007, 2017. 2
work page 2017
-
[28]
Reed, Cheng-Yang Fu, and Alexander C
Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott E. Reed, Cheng-Yang Fu, and Alexander C. Berg. SSD: Single shot multibox detector. In Proceedings of the European Conference on Computer Vision (ECCV) , pages 21–37, 2016. 2
work page 2016
-
[29]
Scott M. Lundberg and Su-In Lee. A unified approach to interpreting model predictions. In Advances in Neural In- formation Processing Systems (NeurIPS) , page 4768–4777,
-
[30]
Nguyen, Haichun Yang, Ruining Deng, Y uzhe Lu, Zheyu Zhu, Joseph T
Ethan H. Nguyen, Haichun Yang, Ruining Deng, Y uzhe Lu, Zheyu Zhu, Joseph T. Roland, Le Lu, Bennett A. Landman, Agnes B. Fogo, and Y uankai Huo. Circle representation for medical object detection. IEEE Transactions on Medical Imaging, 41(3):746–754, 2022. 1
work page 2022
-
[31]
Automatic dif- ferentiation in pytorch
Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic dif- ferentiation in pytorch. In Advances in Neural Information Processing Systems (NeurIPS) W orkshop on Autodiff , 2017. 5
work page 2017
-
[32]
Rise: Random - ized input sampling for explanation of black-box models
Vitali Petsiuk, Abir Das, and Kate Saenko. Rise: Random - ized input sampling for explanation of black-box models. In British Machine Vision Conference (BMVC) , page 151, 2018. 2, 3, 5, 1
work page 2018
-
[33]
Morariu, Ashutosh Mehra, Vicente Ordonez, and Kate Saenko
Vitali Petsiuk, Rajiv Jain, V arun Manjunatha, Vlad I. Morariu, Ashutosh Mehra, Vicente Ordonez, and Kate Saenko. Black-box explanation of object detectors via saliency maps. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , pages 11443–11452, 2021. 1, 2, 3, 5, 6, 7, 8
work page 2021
-
[34]
Maithra Raghu, Thomas Unterthiner, Simon Kornblith, Chiyuan Zhang, and Alexey Dosovitskiy. Do vision trans- formers see like convolutional neural networks? In Advances in Neural Information Processing Systems (NeurIPS) , pages 12116–12128, 2021. 14
work page 2021
-
[35]
Y olo9000: Better, faste r, stronger
Joseph Redmon and Ali Farhadi. Y olo9000: Better, faste r, stronger. In Proceedings of the IEEE Conference on Com- puter Vision and Pattern Recognition (CVPR) , pages 6517– 6525, 2017. 2
work page 2017
-
[36]
YOLOv3: An Incremental Improvement
Joseph Redmon and Ali Farhadi. YOLOv3: An Incremental Improvement. arXiv.org, pages 1–6, 2018
work page 2018
-
[37]
Y ou only look once: Unified, real-time object de- tection
Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. Y ou only look once: Unified, real-time object de- tection. In Proceedings of the IEEE Conference on Com- puter Vision and Pattern Recognition (CVPR) , pages 779– 788, 2016
work page 2016
-
[38]
Shaoqing Ren, Kaiming He, Ross B. Girshick, and Jian Sun . Faster r-cnn: Towards real-time object detection with regi on proposal networks. In Advances in Neural Information Pro- cessing Systems (NeurIPS) , pages 91–99, 2015. 1, 2, 5
work page 2015
-
[39]
”why should I trust you?”: Explaining the predictions of any classifier
Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin . ”why should I trust you?”: Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD In- ternational Conference on Knowledge Discovery and Data Mining, pages 1135–1144, 2016. 3, 1
work page 2016
-
[40]
Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna V edantam, Devi Parikh, and Dhruv Ba- tra
Ramprasaath R. Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna V edantam, Devi Parikh, and Dhruv Ba- tra. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE In- ternational Conference on Computer Vision (ICCV) , pages 618–626, 2017. 1, 2, 3, 5, 6, 7, 8
work page 2017
-
[41]
Lloyd S Shapley. A value for n-person games. In Contri- butions to the Theory of Games II , pages 307–317. 1953. 1, 3
work page 1953
-
[42]
Learning important features through propagating activati on differences
Avanti Shrikumar, Peyton Greenside, and Anshul Kundaj e. Learning important features through propagating activati on differences. In Proceedings of the International Conference on Machine Learning (ICML) , page 3145–3153, 2017. 3, 1
work page 2017
-
[43]
Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps
Karen Simonyan, Andrea V edaldi, and Andrew Zisserman. Deep inside convolutional networks: Visualising image clas- sification models and saliency maps. CoRR, abs/1312.6034, 2013
work page internal anchor Pith review Pith/arXiv arXiv 2013
-
[44]
SmoothGrad: removing noise by adding noise
Daniel Smilkov, Nikhil Thorat, Been Kim, Fernanda B. Vi´ egas, and Martin Wattenberg. Smoothgrad: removing noise by adding noise. CoRR, abs/1706.03825, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[45]
Striving for simplicity: The all convolutional net
Jost Tobias Springenberg, Alexey Dosovitskiy, Thomas Brox, and Martin Riedmiller. Striving for simplicity: The all convolutional net. In International Conference on Learn- ing Representations (ICLR) workshop track , 2015. 1
work page 2015
-
[46]
Identifying important group of pixels using interactions
Kosuke Sumiyasu, Kazuhiko Kawamoto, and Hiroshi Kera. Identifying important group of pixels using interactions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , pages 6017–6026, 2024. 1, 3, 4
work page 2024
-
[47]
The many shapley values for model explanation
Mukund Sundararajan and Amir Najmi. The many shapley values for model explanation. In Proceedings of the Interna- tional Conference on Machine Learning (ICML) , 2020. 1
work page 2020
-
[48]
Axiomat ic attribution for deep networks
Mukund Sundararajan, Ankur Taly, and Qiqi Yan. Axiomat ic attribution for deep networks. In International Conference on Machine Learning, 2017. 3, 1
work page 2017
-
[49]
FCOS: Fully convolutional one-stage object detection
Zhi Tian, Chunhua Shen, Hao Chen, and Tong He. FCOS: Fully convolutional one-stage object detection. In Proceed- ings of the IEEE International Conference on Computer Vi- sion (ICCV), pages 9626–9635, 2019. 2
work page 2019
-
[50]
Towards better explanations for object detection
V an Binh Truong, Truong Thanh Hung Nguyen, V o Thanh Khang Nguyen, Quoc Khanh Nguyen, and Quoc Hung Cao. Towards better explanations for object detection. In Asian Conference on Computer Vision (ACCV), pages 1385–1400, 2024. 2
work page 2024
- [51]
-
[52]
Score-cam: Score-weighted visual explanations for convolutional neu ral 10 networks
Haofan Wang, Zifan Wang, Mengnan Du, Fan Yang, Zijian Zhang, Sirui Ding, Piotr Mardziel, and Xia Hu. Score-cam: Score-weighted visual explanations for convolutional neu ral 10 networks. In Proceedings of the IEEE Conference on Com- puter Vision and Pattern Recognition (CVPR) W orkshops , pages 111–119, 2020. 3, 1, 6
work page 2020
-
[53]
Y uxin Wu, Alexander Kirillov, Francisco Massa, Wan-Ye n Lo, and Ross Girshick. Detectron2. https://github. com/facebookresearch/detectron2, 2019. 5
work page 2019
-
[54]
Toshinori Yamauchi. Spatial sensitive grad-cam++: Im - proved visual explanation for object detectors via weighte d combination of gradient map. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) W orkshops, pages 8164–8168, 2024. 1, 2, 5, 6, 7, 8
work page 2024
-
[55]
Toshinori Yamauchi and Masayoshi Ishikawa. Spatial se n- sitive grad-cam: Visual explanations for object detection by incorporating spatial sensitivity. In IEEE International Con- ference on Image Processing (ICIP) , pages 256–260, 2022. 1, 2, 5, 6, 7, 8
work page 2022
-
[56]
Visualizing and Understanding Convolutional Networks
Matthew D. Zeiler and Rob Fergus. Visualizing and un- derstanding convolutional networks. CoRR, abs/1311.2901,
work page internal anchor Pith review Pith/arXiv arXiv
-
[57]
Lin, Jonathan Brandt, Xiaohui Sh en, and Stan Sclaroff
Jianming Zhang, Zhe L. Lin, Jonathan Brandt, Xiaohui Sh en, and Stan Sclaroff. Top-down neural attention by excitation backprop. International Journal of Computer Vision , 126: 1084–1102, 2018. 3, 1, 6
work page 2018
-
[58]
Group-cam: Group score-weighted visual explanations for deep convo- lutional networks, 2021
Qinglong Zhang, Lu Rao, and Y ubin Yang. Group-cam: Group score-weighted visual explanations for deep convo- lutional networks, 2021. 5
work page 2021
-
[59]
Shifeng Zhang, Longyin Wen, Xiao Bian, Zhen Lei, and Stan Z. Li. Single-shot refinement neural network for ob- ject detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 2018. 2
work page 2018
-
[60]
When pedestrian detection meets multi-modal learning: Generalist model and benchmark dataset, 2024
Yi Zhang, Wang Zeng, Sheng Jin, Chen Qian, Ping Luo, and Wentao Liu. When pedestrian detection meets multi-modal learning: Generalist model and benchmark dataset, 2024. 1
work page 2024
-
[61]
C. Zhao, J. H. Hsiao, and A. B. Chan. Gradient-based instance-specific visual explanations for object specifica tion and object discrimination. IEEE Transactions on Pattern Analysis & Machine Intelligence , 46(09):5967–5985, 2024. 1, 2, 5, 6, 7, 8
work page 2024
-
[62]
Shap- cam: Visual explanations for convolutional neural network s based on shapley value
Quan Zheng, Ziwei Wang, Jie Zhou, and Jiwen Lu. Shap- cam: Visual explanations for convolutional neural network s based on shapley value. In Proceedings of the European Conference on Computer Vision (ECCV) , page 459–474,
-
[63]
Learning deep features for discrimi- native localization
Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva , and Antonio Torralba. Learning deep features for discrimi- native localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 2016. 3, 1
work page 2016
-
[64]
Deformable DETR: deformable transformers for end-to-end object detection
Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, and Jifeng Dai. Deformable DETR: deformable transformers for end-to-end object detection. In International Conference on Learning Representations (ICLR) , 2021. 2 11 Explaining Object Detectors via Collective Contribution o f Pixels Supplementary Material A. Visual explanations for classification In t...
work page 2021
-
[65]
computes pixel-wise feature attributions by propagating relevance from the output layer to the input layer. CPR [ 16] extends LRP by emphasizing the relevance originating from the target class, calculating the disparity between the rel e- vance from the target class and the average relevance from other classes. CAM-based methods calculate the weights repr...
-
[66]
and (8), this is formulated as follows: {b1, b2} = arg max {b′ 1,b′ 2}⊂ N f ({b′ 1, b′ 2})− f (∅) = arg max {b′ 1,b′ 2}⊂ N φsc({b′ 1, b′ 2}) = arg max {b′ 1,b′ 2}⊂ N I sc 2 ({b′ 1, b′ 2}) + 1∑ r′=1 ∑ c∈P r′ ({b′ 1,b′ 2}) φsc(c) = arg max {b′ 1,b′ 2}⊂ N I sc 2 ({b′ 1, b′ 2}) + φsc({b′ 1}) + φsc({b′ 2}). (15) Therefore, this selection considers inte...
-
[67]
is minimized. Br 1 = arg min Br ⊆ N f (N\ Br)− f (N ) = arg max Br⊆ N f (N )− f (N\ Br) = arg max Br⊆ N φfc d (Br|N ), (20) where φfc d (Br|N ) def = Pd(N\ Br|N )[f (N )− f (N\ Br)]. (21) In the last equality, we omit Pd(N\ Br 1|N ), as it remains constant regardless of the patch selection. The quantity φfc d is known as the Shapley value in full-context ...
-
[68]
and (22), we get Br 1 = arg max Br ⊆ N φfc d (Br|N ) = arg max Br ⊆ N I fc d,r(Br) Interaction + r− 1∑ r′=1 ∑ c∈P r′(Br) φfc d (c|N\{ Br\ c}) Shapley values . (23) Therefore, at step k = 1 , the selection considers both in- teractions of the r patches (the first term) and Shapley val- ues of each patch combination (the second term). I...
-
[69]
Next section, we describe specific examples in patch deletion with r = 1 and r = 2. B.4. Specific example of patch deletion We provide specific examples for cases with r = 1 and r = 2 in patch deletion described in Appendix B.3. B.5. Case with r = 1 For step k = 1 , from Eq. ( 20), only the highest Shapley value φfc d ({b1}|N ) = n− 1[f (N )− f (N\{ b1}] bec...
-
[70]
and ( 23), this is formulated as follows: {b1, b2} = arg min {b′ 1,b′ 2}⊂ N f (N\{ b′ 1, b′ 2})− f (N ) = arg max {b′ 1,b′ 2}⊂ N f (N )− f (N\{ b′ 1, b′ 2}) = arg max {b′ 1,b′ 2}⊂ N φfc d ({b′ 1, b′ 2}| N ) = arg max {b′ 1,b′ 2}⊂ N I fc d,2({b′ 1, b′ 2}) + 1∑ r′=1 ∑ c∈P r′({b′ 1,b′ 2}) φfc d (c|N\{{ b′ 1, b′ 2}\ c}) = arg max {b′ 1,b′ 2}⊂ N I fc d...
-
[71]
In Algorithm 3, get position(bi) returns the diagonal corners (xb 1 , yb 1 ) and (xb 2, yb 2 ) of the patch bi in the image. This process gen- erates a heat map that effectively highlights regions with significant changes in the reward function values. Algorithm 3 Generation of a heat map based on identified patches Input: Set of identified patches{b1, . . ....
-
[72]
with α∈ [0, 1] as follows: f (D(x); (Bt, P t)) = max (B,P )∈D (x) { IoU(Bt, B) }1− α · { P t·P ∥P t∥∥P∥ }α . (30) As shown in this equation, if α is close to 0, it identi- fies patches that focus more on the prediction of bounding boxes. If α is close to 1, it identifies patches that focus more on the prediction of class scores. If α = 0.5, it identi- fies p...
-
[73]
reaches 0.91 after four patches are identified. In con- trast, for Faster R-CNN, patches in specific regions are pref- erentially identified, and the reward value reaches 0.94 after 14 patches are identified. These results indicate that DETR, a transformer-based architecture, recognizes instances i n a more compositional manner and with less information com- ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.