i-WiViG: Interpretable Window Vision GNN

Adrian H\"ohl; Dario Oliveira; Dmitry Kangin; Ivica Obadic; Plamen P Angelov; Xiao Xiang Zhu

arxiv: 2503.08321 · v2 · submitted 2025-03-11 · 💻 cs.CV

i-WiViG: Interpretable Window Vision GNN

Ivica Obadic , Dmitry Kangin , Adrian H\"ohl , Dario Oliveira , Plamen P Angelov , Xiao Xiang Zhu This is my paper

Pith reviewed 2026-05-23 00:31 UTC · model grok-4.3

classification 💻 cs.CV

keywords interpretable vision GNNsubgraph explanationssparse attention bottleneckdisjoint local windowsscene classificationremote sensing imagerytexture bias

0 comments

The pith

Constraining vision GNN nodes to disjoint local windows plus a learnable sparse attention bottleneck produces semantic subgraph explanations while matching black-box accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces i-WiViG to give vision graph neural networks an inherent way to explain their predictions. It rests on two postulates: each node sees only a disjoint local window of the image, and a sparse attention bottleneck selects the relevant interactions among those windows. The resulting subgraphs are claimed to be semantic, intuitive, and faithful to the model's reasoning. Experiments on scene classification and regression, including remote-sensing imagery, show competitive performance even when datasets have strong texture bias.

Core claim

By constraining graph nodes' receptive fields to disjoint local windows and inserting an inherently interpretable graph bottleneck with learnable sparse attention, the model identifies the relevant interactions among local image windows; the identified subgraphs therefore deliver semantic, intuitive, and faithful explanations, and the overall accuracy remains competitive with black-box vision GNNs on both natural and remote-sensing imagery even under strong texture bias.

What carries the argument

The inherently interpretable graph bottleneck with learnable sparse attention that selects relevant interactions among nodes whose receptive fields are restricted to disjoint local image windows.

If this is right

The identified subgraphs provide semantic, intuitive, and faithful explanations of the model's reasoning.
Accuracy remains competitive with black-box counterparts on scene classification and regression tasks.
Performance holds on datasets that exhibit strong texture bias.
The same architecture works for both natural imagery and remote-sensing imagery.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The window-plus-bottleneck design could be tested on tasks that require explicit spatial reasoning, such as object counting or layout verification.
Because explanations are produced by the same sparse attention used for prediction, the subgraphs offer a direct route to auditing whether the model relies on expected spatial relations.
The method's restriction to local windows suggests a natural way to compare how different window sizes affect the balance between local texture and global context in the learned explanations.

Load-bearing premise

Limiting each node's receptive field to a disjoint local window and routing interactions through a learnable sparse attention bottleneck will surface interactions that are both faithful to the model's internal computation and semantically meaningful to humans.

What would settle it

A controlled test in which the subgraphs highlighted by the bottleneck do not align with the image regions that actually drive the model's output when the same input is fed to an otherwise identical black-box GNN.

Figures

Figures reproduced from arXiv: 2503.08321 by Adrian H\"ohl, Dario Oliveira, Dmitry Kangin, Ivica Obadic, Plamen P Angelov, Xiao Xiang Zhu.

**Figure 1.** Figure 1: VisionGNN approaches visualized in the top row yield large and overlapping receptive fields for the graph nodes, which limits [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: i-WiViG is split into several steps: (1) per-patch rep [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: The relevant subgraphs within the top-5 edge importance percentile for the predictions of the i-WiViG model on examples of the class Bridge (left) and the class Airplane (middle) and Medium Residential Area (right) in the NWPU-RESISC45 dataset. (a) High Liveability (b) Medium Liveability (c) Low Liveability [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Examples of identified subgraphs containing edges within the top-5 importance percentile for the predictions of the i-WiViG model on examples of high (left), medium (centre) and low liveability areas (right) from the Liveability dataset [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: RESISC45 edge attribution evaluation after an incremental addition of the edges with highest importance (blue curve) and the edges with lowest importance (orange curve). pixels in the image, as estimated by the xAI method, on the model predictions. Hence, explanations with low infidelity are preferred as this implies that model predictions change significantly when the relevant pixels are perturbed. On th… view at source ↗

**Figure 7.** Figure 7: Quantiative evaluation of the explanation quality of our method against the other Vision GNN benchmarks on the NWPU-RESISC45 scene classification dataset. The left plot visualizes the explanation infidelity while the right plot depicts the explanation sparsity for the post hoc attributions computed with the Integrated Gradients (IG) and Occlusion methods. The arrows indicate the preferred directions for th… view at source ↗

**Figure 8.** Figure 8: Liveability edge attribution evaluation when using r = 0.5 for model training after an incremental addition of the edges with highest importance (blue curve) and the edges with lowest importance (orange curve). the isotropic ViG model. Further, the Occlusion attribution maps on [PITH_FULL_IMAGE:figures/full_fig_p012_8.png] view at source ↗

**Figure 9.** Figure 9: Quantiative evaluation of the explanation quality of our method against the other Vision GNN benchmarks on the Liveability regression dataset. The left plot visualizes the explanation infidelity while the right plot depicts the explanation sparsity for the post-hoc attributions computed with the Integrated Gradients (IG) and Occlusion methods. The arrows indicate the preferred directions for the metrics, i… view at source ↗

read the original abstract

Vision graph neural networks have emerged as a popular approach for modeling the global and spatial context for image recognition. However, a significant drawback of these methods is that they do not offer an inherent interpretation of the relevant spatial interactions for their prediction. We address this problem by introducing i-WiViG, an approach that enables interpretable model reasoning based on a sparse subgraph in the image. i-WiViG is based on two key postulates: 1) constraining the graph nodes' receptive field to disjoint local windows in the image, and 2) an inherently interpretable graph bottleneck with learnable sparse attention that identifies the relevant interactions among the local image windows. We evaluate our approach on both scene classification and regression tasks using natural and remote sensing imagery. Our results, supported by quantitative and qualitative evidence, demonstrate that the method delivers semantic, intuitive, and faithful explanations through the identified subgraphs. Furthermore, extensive experiments confirm that it achieves competitive performance to its black-box counterparts, even on datasets exhibiting strong texture bias. The implementation is available on https://github.com/zhu-xlab/i-WiViG.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

i-WiViG adds disjoint windows and a learnable sparse attention bottleneck to vision GNNs for subgraph explanations, but the faithfulness of those subgraphs to the model's actual reasoning lacks direct quantitative checks.

read the letter

The main takeaway is that the paper builds a vision GNN where nodes are restricted to disjoint local windows and cross-window interactions are forced through a sparse, learnable attention layer that surfaces a subgraph for explanation. The second point is that the evidence tying those subgraphs to faithful model reasoning stays mostly at the level of visualizations and competitive accuracy numbers rather than targeted tests.

Referee Report

3 major / 2 minor

Summary. The paper introduces i-WiViG, a vision GNN for image recognition that enforces interpretability via two architectural postulates: (1) restricting each node's receptive field to disjoint local windows and (2) routing interactions through a learnable sparse-attention bottleneck that extracts a sparse subgraph. The central claims are that the resulting subgraphs yield semantic, intuitive, and faithful explanations of the model's reasoning and that the model attains competitive accuracy with black-box counterparts on scene classification and regression tasks (natural and remote-sensing imagery), even under strong texture bias.

Significance. If the faithfulness claim holds, the work would supply a concrete architectural route to inherently interpretable vision GNNs without post-hoc explanation modules. The public code release is a clear strength. At present, however, the interpretability argument rests on qualitative visualizations and accuracy parity rather than quantitative faithfulness tests, so the advance over existing windowed or attention-based GNNs remains provisional.

major comments (3)

[Abstract] Abstract: the assertion that the subgraphs are 'faithful' to the model's internal computation is not accompanied by any quantitative faithfulness metric (subgraph deletion AUC, gradient correlation, or comparison against a non-bottleneck ablation on the identical architecture). Only competitive accuracy and qualitative evidence are referenced.
[Abstract] Abstract (postulate 2): the claim that the learnable sparse attention bottleneck 'identifies the relevant interactions' is load-bearing for the interpretability contribution, yet no ablation is described that isolates the bottleneck's contribution to either accuracy or explanation quality versus a dense or non-learnable attention baseline.
[Abstract] Abstract: the statement of 'quantitative and qualitative evidence' and 'extensive experiments' is unsupported by any reported error bars, dataset statistics, baseline tables, or ablation details in the provided summary, preventing verification that performance remains competitive under the stated texture-bias condition.

minor comments (2)

[Abstract] Abstract: 'competitive performance to its black-box counterparts' should be accompanied by the specific baselines and metrics used.
[Abstract] Abstract: the GitHub link is welcome, but the manuscript should state the exact commit or release tag used for the reported results.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments, which highlight opportunities to strengthen the presentation of our interpretability claims. We respond to each major comment below.

read point-by-point responses

Referee: [Abstract] Abstract: the assertion that the subgraphs are 'faithful' to the model's internal computation is not accompanied by any quantitative faithfulness metric (subgraph deletion AUC, gradient correlation, or comparison against a non-bottleneck ablation on the identical architecture). Only competitive accuracy and qualitative evidence are referenced.

Authors: The faithfulness claim follows directly from the architecture: the sparse attention bottleneck is the sole pathway from input windows to the classifier, so the extracted subgraph is the model's computation by design (unlike post-hoc methods). The manuscript supports this with qualitative semantic analysis. We will revise the abstract to qualify the claim as 'architecturally faithful' and reference the relevant sections, without adding new quantitative experiments at this stage. revision: partial
Referee: [Abstract] Abstract (postulate 2): the claim that the learnable sparse attention bottleneck 'identifies the relevant interactions' is load-bearing for the interpretability contribution, yet no ablation is described that isolates the bottleneck's contribution to either accuracy or explanation quality versus a dense or non-learnable attention baseline.

Authors: Ablation studies isolating the learnable sparse attention (versus dense and non-learnable baselines) appear in the experiments section and demonstrate its impact on both accuracy and subgraph quality. The abstract condenses these results. We will update the abstract to explicitly reference the ablation studies supporting postulate 2. revision: yes
Referee: [Abstract] Abstract: the statement of 'quantitative and qualitative evidence' and 'extensive experiments' is unsupported by any reported error bars, dataset statistics, baseline tables, or ablation details in the provided summary, preventing verification that performance remains competitive under the stated texture-bias condition.

Authors: The abstract is a concise overview; the full manuscript contains the requested tables (with error bars), dataset statistics, baselines, and texture-bias ablations. We will revise the abstract to more precisely describe the evidence types presented in the paper. revision: yes

Circularity Check

0 steps flagged

No circularity; architecture and claims are self-contained

full rationale

The paper defines i-WiViG via two explicit architectural postulates (disjoint windows + sparse attention bottleneck) that are design choices, not derived quantities. Performance is shown via direct empirical comparison to black-box baselines on multiple datasets; explanations rest on qualitative subgraph visualizations rather than any equation that reduces a result to a fitted parameter defined in terms of itself. No self-citation chain, uniqueness theorem, or ansatz smuggling is used to justify the central claims. The derivation chain consists of standard GNN operations plus the stated constraints, with no step that is equivalent to its inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on two domain assumptions stated as key postulates; no free parameters or invented entities are introduced in the abstract.

axioms (2)

domain assumption Constraining the graph nodes' receptive field to disjoint local windows preserves the spatial interactions needed for accurate prediction.
First key postulate listed in the abstract.
domain assumption An inherently interpretable graph bottleneck with learnable sparse attention identifies the relevant interactions among the local image windows.
Second key postulate listed in the abstract.

pith-pipeline@v0.9.0 · 5739 in / 1309 out tokens · 77314 ms · 2026-05-23T00:31:09.327300+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages · 2 internal anchors

[1]

Sanity checks for saliency maps

Julius Adebayo, Justin Gilmer, Michael Muelly, Ian Goodfel- low, Moritz Hardt, and Been Kim. Sanity checks for saliency maps. Advances in neural information processing systems, 31, 2018. 1, 3

work page 2018
[2]

Approximating cnns with bag-of-local-features models works surprisingly well on imagenet

Wieland Brendel and Matthias Bethge. Approximating cnns with bag-of-local-features models works surprisingly well on imagenet. International Conference on Learning Representa- tions, 2019. 5

work page 2019
[3]

Hierarchical gnn framework for earth’s surface anomaly detection in single satellite im- agery

Boan Chen, Zhi Gao, Ziyao Li, Siqi Liu, Aohan Hu, Weiwei Song, Yu Zhang, and Qiao Wang. Hierarchical gnn framework for earth’s surface anomaly detection in single satellite im- agery. IEEE Transactions on Geoscience and Remote Sensing,

work page
[4]

How interpretable are interpretable graph neural networks? In Forty-first International Conference on Machine Learning,

Yongqiang Chen, Yatao Bian, Bo Han, and James Cheng. How interpretable are interpretable graph neural networks? In Forty-first International Conference on Machine Learning,

work page
[5]

Remote sensing image scene classification: Benchmark and state of the art

Gong Cheng, Junwei Han, and Xiaoqiang Lu. Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE, 105(10):1865–1883, 2017. 5

work page 2017
[6]

Pruning deep neural networks from a sparsity perspective

Enmao Diao, Ganghua Wang, Jiawei Zhan, Yuhong Yang, Jie Ding, and Vahid Tarokh. Pruning deep neural networks from a sparsity perspective. arXiv preprint arXiv:2302.05601,

work page arXiv
[7]

Imagenet-trained cnns are biased towards texture; increasing shape bias improves accuracy and robustness

Robert Geirhos, Patricia Rubisch, Claudio Michaelis, Matthias Bethge, Felix A Wichmann, and Wieland Brendel. Imagenet-trained cnns are biased towards texture; increasing shape bias improves accuracy and robustness. InInternational Conference on Learning Representations, 2018. 1

work page 2018
[8]

Vision gnn: An image is worth graph of nodes

Kai Han, Yunhe Wang, Jianyuan Guo, Yehui Tang, and En- hua Wu. Vision gnn: An image is worth graph of nodes. Advances in neural information processing systems, 35:8291– 8303, 2022. 1, 2, 3, 4, 5

work page 2022
[9]

Vision hgnn: An image is more than a graph of nodes

Yan Han, Peihao Wang, Souvik Kundu, Ying Ding, and Zhangyang Wang. Vision hgnn: An image is more than a graph of nodes. In Proceedings of the IEEE/CVF Interna- tional Conference on Computer Vision, pages 19878–19888,

work page
[10]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceed- ings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. 4, 5

work page 2016
[11]

Predicting the liveability of dutch cities with aerial images and semantic intermediate concepts

Alex Levering, Diego Marcos, Jasper van Vliet, and Devis Tuia. Predicting the liveability of dutch cities with aerial images and semantic intermediate concepts. Remote Sensing of Environment, 287:113454, 2023. 5

work page 2023
[12]

Deepgcns: Can gcns go as deep as cnns? In Proceedings of the IEEE/CVF international conference on computer vision, pages 9267–9276, 2019

Guohao Li, Matthias Muller, Ali Thabet, and Bernard Ghanem. Deepgcns: Can gcns go as deep as cnns? In Proceedings of the IEEE/CVF international conference on computer vision, pages 9267–9276, 2019. 1

work page 2019
[13]

Orphicx: A causality-inspired latent variable model for interpreting graph neural networks

Wanyu Lin, Hao Lan, Hao Wang, and Baochun Li. Orphicx: A causality-inspired latent variable model for interpreting graph neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 13729–13738, 2022. 3

work page 2022
[14]

Self-constructing graph neural networks to model long-range pixel dependencies for semantic segmen- tation of remote sensing images

Qinghui Liu, Michael Kampffmeyer, Robert Jenssen, and Arnt-Børre Salberg. Self-constructing graph neural networks to model long-range pixel dependencies for semantic segmen- tation of remote sensing images. International Journal of Remote Sensing, 42(16):6184–6208, 2021. 1

work page 2021
[15]

Interpretable and generaliz- able graph learning via stochastic attention mechanism

Siqi Miao, Mia Liu, and Pan Li. Interpretable and generaliz- able graph learning via stochastic attention mechanism. In International Conference on Machine Learning, pages 15524– 15543. PMLR, 2022. 1, 3, 4

work page 2022
[16]

Mo- bilevig: Graph-based sparse attention for mobile vision ap- plications

Mustafa Munir, William Avery, and Radu Marculescu. Mo- bilevig: Graph-based sparse attention for mobile vision ap- plications. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2211–2219,

work page
[17]

Greedyvig: Dynamic axial graph construc- tion for efficient vision gnns

Mustafa Munir, William Avery, Md Mostafijur Rahman, and Radu Marculescu. Greedyvig: Dynamic axial graph construc- tion for efficient vision gnns. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 6118–6127, 2024. 1, 2, 3

work page 2024
[18]

From anecdotal evidence to quantitative evaluation methods: A systematic review on evaluating explainable ai

Meike Nauta, Jan Trienes, Shreyasi Pathak, Elisa Nguyen, Michelle Peters, Yasmin Schmitt, J¨org Schl¨otterer, Maurice Van Keulen, and Christin Seifert. From anecdotal evidence to quantitative evaluation methods: A systematic review on evaluating explainable ai. ACM Computing Surveys, 55(13s): 1–42, 2023. 7

work page 2023
[19]

In-domain representation learning for remote sensing

Maxim Neumann, Andre Susano Pinto, Xiaohua Zhai, and Neil Houlsby. In-domain representation learning for remote sensing. arXiv preprint arXiv:1911.06721, 2019. 5

work page arXiv 1911
[20]

Rise: Random- ized input sampling for explanation of black-box models

Vitali Petsiuk, Abir Das, and Kate Saenko. Rise: Random- ized input sampling for explanation of black-box models. In British Machine Vision Conference (BMVC), 2018. 6

work page 2018
[21]

Dy- namic routing between capsules

Sara Sabour, Nicholas Frosst, and Geoffrey E Hinton. Dy- namic routing between capsules. Advances in neural infor- mation processing systems, 30, 2017. 1

work page 2017
[22]

Augment to interpret: Unsupervised and inher- ently interpretable graph embeddings

Gregory Scafarto, Madalina Ciortan, Simon Tihon, and Quentin Ferre. Augment to interpret: Unsupervised and inher- ently interpretable graph embeddings. In Asian Conference on Machine Learning, pages 1183–1198. PMLR, 2024. 3

work page 2024
[23]

Graph infor- mation bottleneck for remote sensing segmentation

Yuntao Shou, Wei Ai, Tao Meng, and Nan Yin. Graph infor- mation bottleneck for remote sensing segmentation. arXiv preprint arXiv:2312.02545, 2023. 1

work page arXiv 2023
[24]

Context spatial awareness remote sensing image change detection network based on graph and convolution interaction

Xinyang Song, Zhen Hua, and Jinjiang Li. Context spatial awareness remote sensing image change detection network based on graph and convolution interaction. IEEE Transac- tions on Geoscience and Remote Sensing, 2024. 1

work page 2024
[25]

Wignet: Windowed vision 9 graph neural network

Gabriele Spadaro, Marco Grangetto, Attilio Fiandrotti, Enzo Tartaglione, and Jhony H Giraldo. Wignet: Windowed vision 9 graph neural network. arXiv preprint arXiv:2410.00807, 2024. 1, 2, 3, 4, 5

work page arXiv 2024
[26]

Fishnet: A versatile backbone for image, region, and pixel level prediction

Shuyang Sun, Jiangmiao Pang, Jianping Shi, Shuai Yi, and Wanli Ouyang. Fishnet: A versatile backbone for image, region, and pixel level prediction. Advances in neural infor- mation processing systems, 31, 2018. 5

work page 2018
[27]

Axiomatic attribution for deep networks

Mukund Sundararajan, Ankur Taly, and Qiqi Yan. Axiomatic attribution for deep networks. In International conference on machine learning, pages 3319–3328. PMLR, 2017. 6

work page 2017
[28]

Graph information bottleneck

Tailin Wu, Hongyu Ren, Pan Li, and Jure Leskovec. Graph information bottleneck. Advances in Neural Information Processing Systems, 33:20437–20448, 2020. 4

work page 2020
[29]

How Powerful are Graph Neural Networks?

Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. How powerful are graph neural networks? arXiv preprint arXiv:1810.00826, 2018. 1

work page internal anchor Pith review Pith/arXiv arXiv 2018
[30]

On the (in) fidelity and sensitivity of explanations

Chih-Kuan Yeh, Cheng-Yu Hsieh, Arun Suggala, David I Inouye, and Pradeep K Ravikumar. On the (in) fidelity and sensitivity of explanations. Advances in neural information processing systems, 32, 2019. 6

work page 2019
[31]

Gnnexplainer: Generating explanations for graph neural networks

Zhitao Ying, Dylan Bourgeois, Jiaxuan You, Marinka Zitnik, and Jure Leskovec. Gnnexplainer: Generating explanations for graph neural networks. Advances in neural information processing systems, 32, 2019. 1, 3

work page 2019
[32]

Crossed siamese vision graph neural network for remote sensing image change detection

Zhi-Hui You, Jia-Xin Wang, Si-Bao Chen, Chris HQ Ding, Gui-Zhou Wang, Jin Tang, and Bin Luo. Crossed siamese vision graph neural network for remote sensing image change detection. IEEE Transactions on Geoscience and Remote Sensing, 2023. 1

work page 2023
[33]

Xgnn: To- wards model-level explanations of graph neural networks

Hao Yuan, Jiliang Tang, Xia Hu, and Shuiwang Ji. Xgnn: To- wards model-level explanations of graph neural networks. In Proceedings of the 26th ACM SIGKDD international confer- ence on knowledge discovery & data mining, pages 430–438,

work page
[34]

Ex- plainability in graph neural networks: A taxonomic survey

Hao Yuan, Haiyang Yu, Shurui Gui, and Shuiwang Ji. Ex- plainability in graph neural networks: A taxonomic survey. IEEE transactions on pattern analysis and machine intelli- gence, 45(5):5782–5799, 2022. 3

work page 2022
[35]

Cutmix: Regu- larization strategy to train strong classifiers with localizable features

Sangdoo Yun, Dongyoon Han, Seong Joon Oh, Sanghyuk Chun, Junsuk Choe, and Youngjoon Yoo. Cutmix: Regu- larization strategy to train strong classifiers with localizable features. In Proceedings of the IEEE/CVF international con- ference on computer vision, pages 6023–6032, 2019. 1

work page 2019
[36]

Hcgnet: A hy- brid change detection network based on cnn and gnn

Cui Zhang, Liejun Wang, and Shuli Cheng. Hcgnet: A hy- brid change detection network based on cnn and gnn. IEEE Transactions on Geoscience and Remote Sensing, 2024. 1

work page 2024
[37]

mixup: Beyond Empirical Risk Minimization

Hongyi Zhang. mixup: Beyond empirical risk minimization. arXiv preprint arXiv:1710.09412, 2017. 1 10 i-WiViG: Interpretable Window Vision GNN Supplementary Material A. i-WiViG Architecture Table 4 presents in detail the layer composition and the hy- perparameters used in our proposed i-WiViG model. Similar to the WiGNet and the pyramid ViG models, we perf...

work page internal anchor Pith review Pith/arXiv arXiv 2017
[38]

Similar to the standard grapher layers in the ViG model, in the GSAT block we perform the operations in the following order to the graph node:

Further, to obtain an inherently interpretable model that reveals the relevant subgraph for its prediction, we insert the GSAT graph in the final stage before the prediction head to learn the global long-range relations between the windows in the image by ranking the importance of the edges in the graph. Similar to the standard grapher layers in the ViG m...

work page
[39]

Linear transformation of the node embeddings

work page
[40]

GIN graph convolution

work page
[41]

FFN layer processing Finally, the GSAT block is followed by a prediction head consisting of pooling and MLP layers, as illustrated in Table 4. B. i-WiViG Training Procedure For model training, we performed image transformations in the following order:

work page
[42]

Resizing the images to a size of 256 x 256

work page
[43]

Regarding the benchmark ViG models, we used the default hyperparameter setting of the tiny versions

Min-max image normalization Further, for the scene classification task, we have used the Cutmix [35] and Mixup [37] augmentations during training. Regarding the benchmark ViG models, we used the default hyperparameter setting of the tiny versions. For our i-WiViG model, we used the hyperparameters of the WiGNet grapher blocks illustrated in Table 4 settin...

work page
[44]

for graph processing. Stage Output Size Hyperparameters Stem H 4 × W 4 Conv ×2 Stage 1 WiGNet block H 4 × W 4   D = 48 E = 4 k = 9 W = 4   × 2 Downsample H 8 × W 8 Conv ×2 Stage 2 WiGNet block H 8 × W 8   D = 96 E = 4 k = 9 W = 4   × 2 Downsample H 16 × W 16 Conv ×2 Stage 3 WiGNet block H 16 × W 16   D = 240 E = 4 k = 9 W = 4   × 4 Downs...

work page

[1] [1]

Sanity checks for saliency maps

Julius Adebayo, Justin Gilmer, Michael Muelly, Ian Goodfel- low, Moritz Hardt, and Been Kim. Sanity checks for saliency maps. Advances in neural information processing systems, 31, 2018. 1, 3

work page 2018

[2] [2]

Approximating cnns with bag-of-local-features models works surprisingly well on imagenet

Wieland Brendel and Matthias Bethge. Approximating cnns with bag-of-local-features models works surprisingly well on imagenet. International Conference on Learning Representa- tions, 2019. 5

work page 2019

[3] [3]

Hierarchical gnn framework for earth’s surface anomaly detection in single satellite im- agery

Boan Chen, Zhi Gao, Ziyao Li, Siqi Liu, Aohan Hu, Weiwei Song, Yu Zhang, and Qiao Wang. Hierarchical gnn framework for earth’s surface anomaly detection in single satellite im- agery. IEEE Transactions on Geoscience and Remote Sensing,

work page

[4] [4]

How interpretable are interpretable graph neural networks? In Forty-first International Conference on Machine Learning,

Yongqiang Chen, Yatao Bian, Bo Han, and James Cheng. How interpretable are interpretable graph neural networks? In Forty-first International Conference on Machine Learning,

work page

[5] [5]

Remote sensing image scene classification: Benchmark and state of the art

Gong Cheng, Junwei Han, and Xiaoqiang Lu. Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE, 105(10):1865–1883, 2017. 5

work page 2017

[6] [6]

Pruning deep neural networks from a sparsity perspective

Enmao Diao, Ganghua Wang, Jiawei Zhan, Yuhong Yang, Jie Ding, and Vahid Tarokh. Pruning deep neural networks from a sparsity perspective. arXiv preprint arXiv:2302.05601,

work page arXiv

[7] [7]

Imagenet-trained cnns are biased towards texture; increasing shape bias improves accuracy and robustness

Robert Geirhos, Patricia Rubisch, Claudio Michaelis, Matthias Bethge, Felix A Wichmann, and Wieland Brendel. Imagenet-trained cnns are biased towards texture; increasing shape bias improves accuracy and robustness. InInternational Conference on Learning Representations, 2018. 1

work page 2018

[8] [8]

Vision gnn: An image is worth graph of nodes

Kai Han, Yunhe Wang, Jianyuan Guo, Yehui Tang, and En- hua Wu. Vision gnn: An image is worth graph of nodes. Advances in neural information processing systems, 35:8291– 8303, 2022. 1, 2, 3, 4, 5

work page 2022

[9] [9]

Vision hgnn: An image is more than a graph of nodes

Yan Han, Peihao Wang, Souvik Kundu, Ying Ding, and Zhangyang Wang. Vision hgnn: An image is more than a graph of nodes. In Proceedings of the IEEE/CVF Interna- tional Conference on Computer Vision, pages 19878–19888,

work page

[10] [10]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceed- ings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. 4, 5

work page 2016

[11] [11]

Predicting the liveability of dutch cities with aerial images and semantic intermediate concepts

Alex Levering, Diego Marcos, Jasper van Vliet, and Devis Tuia. Predicting the liveability of dutch cities with aerial images and semantic intermediate concepts. Remote Sensing of Environment, 287:113454, 2023. 5

work page 2023

[12] [12]

Deepgcns: Can gcns go as deep as cnns? In Proceedings of the IEEE/CVF international conference on computer vision, pages 9267–9276, 2019

Guohao Li, Matthias Muller, Ali Thabet, and Bernard Ghanem. Deepgcns: Can gcns go as deep as cnns? In Proceedings of the IEEE/CVF international conference on computer vision, pages 9267–9276, 2019. 1

work page 2019

[13] [13]

Orphicx: A causality-inspired latent variable model for interpreting graph neural networks

Wanyu Lin, Hao Lan, Hao Wang, and Baochun Li. Orphicx: A causality-inspired latent variable model for interpreting graph neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 13729–13738, 2022. 3

work page 2022

[14] [14]

Self-constructing graph neural networks to model long-range pixel dependencies for semantic segmen- tation of remote sensing images

Qinghui Liu, Michael Kampffmeyer, Robert Jenssen, and Arnt-Børre Salberg. Self-constructing graph neural networks to model long-range pixel dependencies for semantic segmen- tation of remote sensing images. International Journal of Remote Sensing, 42(16):6184–6208, 2021. 1

work page 2021

[15] [15]

Interpretable and generaliz- able graph learning via stochastic attention mechanism

Siqi Miao, Mia Liu, and Pan Li. Interpretable and generaliz- able graph learning via stochastic attention mechanism. In International Conference on Machine Learning, pages 15524– 15543. PMLR, 2022. 1, 3, 4

work page 2022

[16] [16]

Mo- bilevig: Graph-based sparse attention for mobile vision ap- plications

Mustafa Munir, William Avery, and Radu Marculescu. Mo- bilevig: Graph-based sparse attention for mobile vision ap- plications. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2211–2219,

work page

[17] [17]

Greedyvig: Dynamic axial graph construc- tion for efficient vision gnns

Mustafa Munir, William Avery, Md Mostafijur Rahman, and Radu Marculescu. Greedyvig: Dynamic axial graph construc- tion for efficient vision gnns. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 6118–6127, 2024. 1, 2, 3

work page 2024

[18] [18]

From anecdotal evidence to quantitative evaluation methods: A systematic review on evaluating explainable ai

Meike Nauta, Jan Trienes, Shreyasi Pathak, Elisa Nguyen, Michelle Peters, Yasmin Schmitt, J¨org Schl¨otterer, Maurice Van Keulen, and Christin Seifert. From anecdotal evidence to quantitative evaluation methods: A systematic review on evaluating explainable ai. ACM Computing Surveys, 55(13s): 1–42, 2023. 7

work page 2023

[19] [19]

In-domain representation learning for remote sensing

Maxim Neumann, Andre Susano Pinto, Xiaohua Zhai, and Neil Houlsby. In-domain representation learning for remote sensing. arXiv preprint arXiv:1911.06721, 2019. 5

work page arXiv 1911

[20] [20]

Rise: Random- ized input sampling for explanation of black-box models

Vitali Petsiuk, Abir Das, and Kate Saenko. Rise: Random- ized input sampling for explanation of black-box models. In British Machine Vision Conference (BMVC), 2018. 6

work page 2018

[21] [21]

Dy- namic routing between capsules

Sara Sabour, Nicholas Frosst, and Geoffrey E Hinton. Dy- namic routing between capsules. Advances in neural infor- mation processing systems, 30, 2017. 1

work page 2017

[22] [22]

Augment to interpret: Unsupervised and inher- ently interpretable graph embeddings

Gregory Scafarto, Madalina Ciortan, Simon Tihon, and Quentin Ferre. Augment to interpret: Unsupervised and inher- ently interpretable graph embeddings. In Asian Conference on Machine Learning, pages 1183–1198. PMLR, 2024. 3

work page 2024

[23] [23]

Graph infor- mation bottleneck for remote sensing segmentation

Yuntao Shou, Wei Ai, Tao Meng, and Nan Yin. Graph infor- mation bottleneck for remote sensing segmentation. arXiv preprint arXiv:2312.02545, 2023. 1

work page arXiv 2023

[24] [24]

Context spatial awareness remote sensing image change detection network based on graph and convolution interaction

Xinyang Song, Zhen Hua, and Jinjiang Li. Context spatial awareness remote sensing image change detection network based on graph and convolution interaction. IEEE Transac- tions on Geoscience and Remote Sensing, 2024. 1

work page 2024

[25] [25]

Wignet: Windowed vision 9 graph neural network

Gabriele Spadaro, Marco Grangetto, Attilio Fiandrotti, Enzo Tartaglione, and Jhony H Giraldo. Wignet: Windowed vision 9 graph neural network. arXiv preprint arXiv:2410.00807, 2024. 1, 2, 3, 4, 5

work page arXiv 2024

[26] [26]

Fishnet: A versatile backbone for image, region, and pixel level prediction

Shuyang Sun, Jiangmiao Pang, Jianping Shi, Shuai Yi, and Wanli Ouyang. Fishnet: A versatile backbone for image, region, and pixel level prediction. Advances in neural infor- mation processing systems, 31, 2018. 5

work page 2018

[27] [27]

Axiomatic attribution for deep networks

Mukund Sundararajan, Ankur Taly, and Qiqi Yan. Axiomatic attribution for deep networks. In International conference on machine learning, pages 3319–3328. PMLR, 2017. 6

work page 2017

[28] [28]

Graph information bottleneck

Tailin Wu, Hongyu Ren, Pan Li, and Jure Leskovec. Graph information bottleneck. Advances in Neural Information Processing Systems, 33:20437–20448, 2020. 4

work page 2020

[29] [29]

How Powerful are Graph Neural Networks?

Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. How powerful are graph neural networks? arXiv preprint arXiv:1810.00826, 2018. 1

work page internal anchor Pith review Pith/arXiv arXiv 2018

[30] [30]

On the (in) fidelity and sensitivity of explanations

Chih-Kuan Yeh, Cheng-Yu Hsieh, Arun Suggala, David I Inouye, and Pradeep K Ravikumar. On the (in) fidelity and sensitivity of explanations. Advances in neural information processing systems, 32, 2019. 6

work page 2019

[31] [31]

Gnnexplainer: Generating explanations for graph neural networks

Zhitao Ying, Dylan Bourgeois, Jiaxuan You, Marinka Zitnik, and Jure Leskovec. Gnnexplainer: Generating explanations for graph neural networks. Advances in neural information processing systems, 32, 2019. 1, 3

work page 2019

[32] [32]

Crossed siamese vision graph neural network for remote sensing image change detection

Zhi-Hui You, Jia-Xin Wang, Si-Bao Chen, Chris HQ Ding, Gui-Zhou Wang, Jin Tang, and Bin Luo. Crossed siamese vision graph neural network for remote sensing image change detection. IEEE Transactions on Geoscience and Remote Sensing, 2023. 1

work page 2023

[33] [33]

Xgnn: To- wards model-level explanations of graph neural networks

Hao Yuan, Jiliang Tang, Xia Hu, and Shuiwang Ji. Xgnn: To- wards model-level explanations of graph neural networks. In Proceedings of the 26th ACM SIGKDD international confer- ence on knowledge discovery & data mining, pages 430–438,

work page

[34] [34]

Ex- plainability in graph neural networks: A taxonomic survey

Hao Yuan, Haiyang Yu, Shurui Gui, and Shuiwang Ji. Ex- plainability in graph neural networks: A taxonomic survey. IEEE transactions on pattern analysis and machine intelli- gence, 45(5):5782–5799, 2022. 3

work page 2022

[35] [35]

Cutmix: Regu- larization strategy to train strong classifiers with localizable features

Sangdoo Yun, Dongyoon Han, Seong Joon Oh, Sanghyuk Chun, Junsuk Choe, and Youngjoon Yoo. Cutmix: Regu- larization strategy to train strong classifiers with localizable features. In Proceedings of the IEEE/CVF international con- ference on computer vision, pages 6023–6032, 2019. 1

work page 2019

[36] [36]

Hcgnet: A hy- brid change detection network based on cnn and gnn

Cui Zhang, Liejun Wang, and Shuli Cheng. Hcgnet: A hy- brid change detection network based on cnn and gnn. IEEE Transactions on Geoscience and Remote Sensing, 2024. 1

work page 2024

[37] [37]

mixup: Beyond Empirical Risk Minimization

Hongyi Zhang. mixup: Beyond empirical risk minimization. arXiv preprint arXiv:1710.09412, 2017. 1 10 i-WiViG: Interpretable Window Vision GNN Supplementary Material A. i-WiViG Architecture Table 4 presents in detail the layer composition and the hy- perparameters used in our proposed i-WiViG model. Similar to the WiGNet and the pyramid ViG models, we perf...

work page internal anchor Pith review Pith/arXiv arXiv 2017

[38] [38]

Similar to the standard grapher layers in the ViG model, in the GSAT block we perform the operations in the following order to the graph node:

Further, to obtain an inherently interpretable model that reveals the relevant subgraph for its prediction, we insert the GSAT graph in the final stage before the prediction head to learn the global long-range relations between the windows in the image by ranking the importance of the edges in the graph. Similar to the standard grapher layers in the ViG m...

work page

[39] [39]

Linear transformation of the node embeddings

work page

[40] [40]

GIN graph convolution

work page

[41] [41]

FFN layer processing Finally, the GSAT block is followed by a prediction head consisting of pooling and MLP layers, as illustrated in Table 4. B. i-WiViG Training Procedure For model training, we performed image transformations in the following order:

work page

[42] [42]

Resizing the images to a size of 256 x 256

work page

[43] [43]

Regarding the benchmark ViG models, we used the default hyperparameter setting of the tiny versions

Min-max image normalization Further, for the scene classification task, we have used the Cutmix [35] and Mixup [37] augmentations during training. Regarding the benchmark ViG models, we used the default hyperparameter setting of the tiny versions. For our i-WiViG model, we used the hyperparameters of the WiGNet grapher blocks illustrated in Table 4 settin...

work page

[44] [44]

for graph processing. Stage Output Size Hyperparameters Stem H 4 × W 4 Conv ×2 Stage 1 WiGNet block H 4 × W 4   D = 48 E = 4 k = 9 W = 4   × 2 Downsample H 8 × W 8 Conv ×2 Stage 2 WiGNet block H 8 × W 8   D = 96 E = 4 k = 9 W = 4   × 2 Downsample H 16 × W 16 Conv ×2 Stage 3 WiGNet block H 16 × W 16   D = 240 E = 4 k = 9 W = 4   × 4 Downs...

work page