pith. sign in

arxiv: 2604.14232 · v1 · submitted 2026-04-14 · 💻 cs.LG · cs.AI

Explainable Graph Neural Networks for Interbank Contagion Surveillance: A Regulatory-Aligned Framework for the U.S. Banking Sector

Pith reviewed 2026-05-10 16:17 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords graph neural networksinterbank contagionbank distress predictionexplainable AImacro-prudential surveillancesystemic riskFDIC call reports
0
0 comments X

The pith

ST-GAT predicts U.S. bank distress at 0.939 AUPRC using reconstructed interbank graphs from public FDIC data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops the Spatial-Temporal Graph Attention Network (ST-GAT) to detect early signs of distress in U.S. banks and surveil the interbank system for macro-prudential purposes. It builds a dynamic directed weighted graph of 8103 institutions across 58 quarters by estimating bilateral exposures via maximum entropy from aggregated FDIC Call Reports. The framework achieves the highest AUPRC among GNN models at 0.939 plus or minus 0.010, nearly matching XGBoost at 0.944, with the temporal BiLSTM component adding 0.020 to performance. Ablation and importance analysis highlight ROA and NPL ratio as top predictors, consistent with post-crisis observations. A reader would care because this supplies regulators an explainable, reproducible tool for contagion monitoring based only on public data.

Core claim

The ST-GAT framework models interbank contagion risk by applying graph attention to spatial bank linkages and bidirectional LSTM with temporal attention to quarterly sequences on a maximum-entropy reconstructed directed weighted graph, yielding strong distress prediction and interpretable weights that emphasize long-run structural vulnerabilities.

What carries the argument

The Spatial-Temporal Graph Attention Network (ST-GAT) that fuses graph attention layers for interbank connections with BiLSTM and temporal attention for time dynamics on the reconstructed exposure graph.

Load-bearing premise

Maximum entropy estimation from aggregated FDIC Call Reports produces a sufficiently accurate reconstruction of true bilateral interbank exposures for the downstream prediction task.

What would settle it

Direct validation of the maximum entropy reconstructed bilateral exposures against any available confidential detailed transaction data from the same period would show whether prediction performance holds when true links replace the estimates.

read the original abstract

The Spatial-Temporal Graph Attention Network (ST-GAT) framework was created to serve as an explainable GNN-based solution for detecting bank distress early warning signs and for conducting macro-prudential surveillance of the interbank system in the United States. The ST-GAT framework models 8,103 FDIC insured institutions across 58 quarterly snapshots (2010Q1-2024Q2). Bilateral exposures were reconstructed from publicly available FDIC Call Reports using maximum entropy estimation to produce a dynamic directed weighted graph. The framework achieves the highest AUPRC among all GNN architectures (0.939 +/- 0.010), trailing only XGBoost (0.944). Ablation analysis confirms the BiLSTM temporal component contributes +0.020 AUPRC; temporal attention weights exhibit a monotonically decreasing pattern consistent with long-run structural vulnerability weighting. Permutation importance identifies ROA (0.309) and NPL Ratio (0.252) as dominant predictors, consistent with post-mortem analyses of the 2023 regional banking crisis. All data are publicly available FDIC Call Reports and FRED series; all code and results are released.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces the Spatial-Temporal Graph Attention Network (ST-GAT) as an explainable GNN framework for early detection of bank distress and macro-prudential surveillance of the U.S. interbank system. It processes data on 8,103 FDIC-insured institutions across 58 quarterly snapshots (2010Q1–2024Q2), reconstructs bilateral exposures from aggregated FDIC Call Reports via maximum-entropy estimation to form a dynamic directed weighted graph, and reports an AUPRC of 0.939 ± 0.010 (highest among GNN variants, second only to XGBoost at 0.944). Ablation studies attribute +0.020 AUPRC to the BiLSTM temporal module, temporal attention weights show a monotonically decreasing pattern, and permutation importance ranks ROA (0.309) and NPL Ratio (0.252) as top predictors, consistent with 2023 crisis analyses. All data, code, and results are released publicly.

Significance. If the maximum-entropy reconstruction is sufficiently faithful to actual bilateral exposures, the work supplies a reproducible, interpretable early-warning system for contagion surveillance that aligns feature importances with post-crisis evidence and benefits from full public release of data and code. The reported ablation results and temporal attention patterns provide concrete support for the claimed contribution of the spatial-temporal architecture.

major comments (2)
  1. [§2.2] §2.2 (Graph Construction): The central performance claim (AUPRC 0.939) rests on a directed weighted graph whose edges are imputed exclusively via maximum-entropy estimation from aggregated FDIC Call Report totals. No validation against any ground-truth bilateral data, no sensitivity checks under alternative reconstruction methods (gravity, minimum entropy), and no reported diagnostics on network statistics (degree distribution, sparsity, core-periphery structure) are provided. Because the GNN and all downstream feature importances operate directly on this imputed structure, the absence of such checks leaves open the possibility that the reported metrics are artifacts of the reconstruction rather than evidence of genuine contagion dynamics.
  2. [§4.3] §4.3 (Experimental Setup): The manuscript reports AUPRC on held-out quarterly snapshots but does not specify the exact train/validation/test partitioning scheme across the 58 quarters, the handling of severe class imbalance in distress labels, or any explicit out-of-sample temporal generalization tests. These details are load-bearing for interpreting whether the 0.939 AUPRC reflects robust predictive power or leakage from the reconstruction procedure.
minor comments (2)
  1. [Table 2] Table 2: The caption should explicitly state whether the reported standard deviations are across random seeds or across quarterly folds.
  2. [Figure 3] Figure 3 (temporal attention weights): The monotonically decreasing pattern is visually clear, but the x-axis labeling of quarters could be clarified to indicate whether attention is computed per snapshot or aggregated.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive and detailed comments. We address each major point below with honest responses, proposing revisions where feasible while noting inherent limitations of the data.

read point-by-point responses
  1. Referee: [§2.2] §2.2 (Graph Construction): The central performance claim (AUPRC 0.939) rests on a directed weighted graph whose edges are imputed exclusively via maximum-entropy estimation from aggregated FDIC Call Report totals. No validation against any ground-truth bilateral data, no sensitivity checks under alternative reconstruction methods (gravity, minimum entropy), and no reported diagnostics on network statistics (degree distribution, sparsity, core-periphery structure) are provided. Because the GNN and all downstream feature importances operate directly on this imputed structure, the absence of such checks leaves open the possibility that the reported metrics are artifacts of the reconstruction rather than evidence of genuine contagion dynamics.

    Authors: We agree that additional robustness checks would strengthen the claims. Maximum-entropy reconstruction is a standard approach in the interbank network literature when only aggregate exposures are available. However, direct validation against ground-truth bilateral data is impossible because such granular interbank exposure information is confidential and not released by the FDIC. In revision, we will add sensitivity analyses comparing maximum-entropy results to a gravity-model reconstruction and include network-level diagnostics (degree distributions, sparsity, and core-periphery statistics) in a new appendix. These changes will clarify that performance is not an artifact of one reconstruction method. revision: partial

  2. Referee: [§4.3] §4.3 (Experimental Setup): The manuscript reports AUPRC on held-out quarterly snapshots but does not specify the exact train/validation/test partitioning scheme across the 58 quarters, the handling of severe class imbalance in distress labels, or any explicit out-of-sample temporal generalization tests. These details are load-bearing for interpreting whether the 0.939 AUPRC reflects robust predictive power or leakage from the reconstruction procedure.

    Authors: We accept that these implementation details must be stated explicitly. The original experiments used a strict chronological split: quarters 1–40 for training, 41–50 for validation, and 51–58 for testing. Class imbalance was addressed with a weighted cross-entropy loss (weights set inversely to class frequencies). Because the test quarters are strictly later than the training data, the evaluation already constitutes temporal out-of-sample generalization with no forward leakage from the reconstruction. We will expand §4.3 with these exact specifications and add a short paragraph confirming the temporal ordering prevents leakage. revision: yes

standing simulated objections not resolved
  • Direct validation of the reconstructed graph against actual bilateral interbank exposures is not possible, as such data remains confidential and unavailable from public FDIC sources.

Circularity Check

0 steps flagged

No circularity: performance from standard held-out evaluation on reconstructed graph

full rationale

The paper reconstructs a directed weighted interbank graph via maximum-entropy estimation from aggregated FDIC Call Report totals, then trains ST-GAT (with BiLSTM and attention) to predict bank distress labels across 58 quarterly snapshots. Reported AUPRC (0.939) is obtained by standard train/test split on held-out quarters, with ablation and permutation importance as post-hoc analysis. No equations or claims reduce the target metric to a fitted parameter or self-citation by construction; the graph construction is an external preprocessing step whose output is treated as input to an independent supervised model. All data and code are stated to be public, making the evaluation externally reproducible rather than tautological.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central performance claim rests on the domain assumption that maximum-entropy graph reconstruction faithfully represents interbank exposures and on standard supervised learning assumptions for the temporal graph model.

free parameters (1)
  • ST-GAT architecture hyperparameters
    Layer sizes, attention heads, learning rate, and temporal window length are tuned to achieve the reported AUPRC.
axioms (1)
  • domain assumption Maximum entropy estimation from aggregated Call Reports yields an accurate proxy for bilateral exposures
    Invoked to construct the dynamic directed weighted graph used as model input.

pith-pipeline@v0.9.0 · 5505 in / 1345 out tokens · 78223 ms · 2026-05-10T16:17:52.581303+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

26 extracted references · 26 canonical work pages · 1 internal anchor

  1. [1]

    Explainable Graph Neural Networks for Interbank Contagion Surveillance: A Regulatory-Aligned Framework for the U.S. Banking Sector Mohammad Nasir Uddin Data Analytics and Applied AI Researcher, Westcliff University, Irvine, CA, USA m.uddin.258@westcliff.edu | ORCID: 0009-0009-0990-4616 ABSTRACT The Spatial-Temporal Graph Attention Network (ST-GAT) framewo...

  2. [2]

    venture-backed startups, was seized by the California Department of Financial Protection and Innovation in what became the second-largest bank failure in American history

    Introduction On March 10, 2023, Silicon Valley Bank, a $212 billion institution holding the deposits of nearly half of all U.S. venture-backed startups, was seized by the California Department of Financial Protection and Innovation in what became the second-largest bank failure in American history. Signature Bank followed two days later. First Republic Ba...

  3. [3]

    Yet the Federal Reserve's 2022 annual stress test, conducted just nine months before SVB's collapse, found the bank well capitalized under its severely adverse scenario

    The combined asset size of these three institutions exceeded $500 billion, larger than the total assets of all 25 banks that failed during the entire 2008-2009 financial crisis. Yet the Federal Reserve's 2022 annual stress test, conducted just nine months before SVB's collapse, found the bank well capitalized under its severely adverse scenario. This pape...

  4. [4]

    Literature Review 2.1 Network Models of Interbank Contagion Eisenberg and Noe (2001) pioneered the theory of interbank contagion by presenting the payment clearing problem as a set of simultaneous equations. Allen and Gale (2000) demonstrated that network completeness determines contagion resilience: incomplete networks can contain distress, while densely...

  5. [5]

    to incorporate information about the sparse, scale-free topology observed in actual interbank networks. Gonon, Meyer-Brandis, and Weber (2024) apply graph neural networks to compute Eisenberg-Noe systemic risk measures, providing theoretical grounding for network-based distress propagation. Franch, Nocciola, and Vouldis (2024) study temporal contagion net...

  6. [6]

    have all demonstrated superior fraud detection compared to non-graph baselines. Balmaseda, Coronado, and Cadenas-Santiago (2023) directly test GCN and GAT architectures against traditional ML for systemic risk classification on financial networks, reporting 94% MCC improvement for GNNs -- the strongest published evidence that graph structure improves fina...

  7. [7]

    Kikuchi (2025) applies a network diffusion framework to European banking data

    provides GNN-based interbank credit rating models, but focuses on credit rating prediction rather than systemic contagion propagation. Kikuchi (2025) applies a network diffusion framework to European banking data. Liu et al. (2025) develop temporal graph learning for default prediction integrating macroeconomic trends, reporting 88.3% AUC -- the closest a...

  8. [8]

    Khan et al

    and OCC Bulletin 2011-12 -- has created specific demand for XAI methods that produce actionable explanations. Khan et al. (2025) systematically review 150 studies on model-agnostic XAI in finance, concluding that SHAP provides the strongest alignment between statistical attribution and regulatory documentation requirements. SHAP (Lundberg and Lee,

  9. [9]

    For graph models specifically, GNNExplainer (Ying et al.,

    has become the dominant post-hoc explanation method for financial models (Bussmann et al., 2021). For graph models specifically, GNNExplainer (Ying et al.,

  10. [10]

    are the best representations of near real-time systemic monitoring available today, but will not capture the network-transmission aspect of vulnerabilities. Awasthi (2025) argues that SR 11-7 compliance is better served by architecturally interpretable models than post-hoc SHAP — a perspective this paper addresses by providing native temporal attention we...

  11. [11]

    regulatory filings

    Data and Graph Construction 3.1 Data Sources and Panel Construction This empirical framework is based exclusively on publicly available U.S. regulatory filings. The dataset consists of quarterly data over a period of 58 quarters between Q1 2010 and Q2 2024, capturing four distinct stress regimes: the post-GFC (global financial crisis) recovery from 2010-2...

  12. [12]

    Bootstrap confidence intervals (1,000 resamples, median CI across seeds) are reported in the AUROC 95% CI column of Table 1 for models evaluated across 5 seeds

    are used for all neural models; mean +/- std is reported across seeds. Bootstrap confidence intervals (1,000 resamples, median CI across seeds) are reported in the AUROC 95% CI column of Table 1 for models evaluated across 5 seeds. This paper targets bank financial distress early warning rather than bank failure prediction. The distinction is deliberate: ...

  13. [13]

    -Temporal removes the BiLSTM

    Ablation Analysis -- ST-GAT Component Contributions (mean +/- std over 5 seeds) Model AUROC AUPRC F1 Delta AUPRC vs full ST-GAT (full) 0.9827 +/-0.0035 0.9389 +/-0.0100 0.9135 +/-0.0133 -- ST-GAT - Macro 0.9827 +/-0.0035 0.9389 +/-0.0100 0.9135 +/-0.0133 0.000 ST-GAT - Temporal 0.9792 +/-0.0080 0.9185 +/-0.0120 0.8919 +/-0.0195 -0.020 ST-GAT - Attention 0...

  14. [14]

    DeepSHAP fell back to permutation importance due to computation graph incompatibility; permutation importance is valid for ranking feature contributions

    Feature Importance: Permutation Importance on ST-GAT Node Features Rank Feature Permutation Importance (Delta AUROC) Economic Rationale 1 Return on Assets (ROA) 0.309 Core earnings capacity; negative ROA sustained over 2+ quarters is a distress signal (CAMELS E component) 2 NPL Ratio 0.252 Non-performing loan ratio; primary asset quality indicator (CAMELS...

  15. [15]

    MERIT, the FDIC’s off-site monitoring system, could integrate ST-GAT scores and provide updates on quarterly risk flags between on-site examinations

    provide a forward-looking complement to CAMELS ratings for the FDIC by capturing network-transmitted vulnerabilities that cannot be detected through supervisory examinations at the institution level. MERIT, the FDIC’s off-site monitoring system, could integrate ST-GAT scores and provide updates on quarterly risk flags between on-site examinations. For the...

  16. [16]

    The gap of 0.005 is within the ST-GAT seed variance range (+/-0.010)

    Discussion 7.1 Performance Interpretation The ST-GAT achieves AUPRC 0.9389 +/- 0.0100, the best among all GNN architectures and second only to XGBoost (0.9439). The gap of 0.005 is within the ST-GAT seed variance range (+/-0.010). This near-equivalence is itself a meaningful finding: a spatial-temporal GNN operating on a graph of institutional exposures m...

  17. [17]

    The framework's explainability contribution rests on two validated layers: temporal attention attribution and permutation-based feature importance

    are validated; GNNExplainer subgraph identification was additionally attempted as a network-level complement to the two validated layers; it produced empty edge masks across all test institutions due to a PyG implementation incompatibility with the GATWrapper architecture and is identified as future work. The framework's explainability contribution rests ...

  18. [18]

    bank distress early warning and interbank contagion surveillance

    Conclusion We proposed and empirically evaluated the Spatial-Temporal Graph Attention Network (ST-GAT) for U.S. bank distress early warning and interbank contagion surveillance. Using a 14-year panel of 58 quarterly snapshots covering 8,103 FDIC-insured institutions, evaluated on 43 confirmed distress cases from 2023Q1 through 2024Q2, we found that ST-GAT...

  19. [19]

    Kikuchi, T. (2025). Network contagion dynamics in European banking: A Navier-Stokes framework for systemic risk assessment. arXiv:2510.19630. Kipf, T. N., & Welling, M. (2017). Semi-supervised classification with graph convolutional networks. ICLR

  20. [20]

    Lin, T.-Y., Goyal, P., Girshick, R., He, K., & Dollar, P. (2017). Focal loss for dense object detection. ICCV 2017, 2980-2988. Liu, J., Cheng, D., & Jiang, C. (2024). Preferential selective-aware graph neural network for preventing attacks in interbank credit rating. IEEE Transactions on Neural Networks and Learning Systems. Lundberg, S. M., & Lee, S.-I. ...

  21. [21]

    Mistrulli, P. E. (2011). Assessing financial contagion in the interbank market: Maximum entropy versus observed interbank lending patterns. Journal of Banking & Finance, 35(5), 1114-1127. Office of the Comptroller of the Currency (OCC). (2011). Sound practices for model risk management. OCC Bulletin 2011-12. Pareja, A., et al. (2020). EvolveGCN: Evolving ...

  22. [22]

    Tanaka, K., Kinkyo, T., & Hamori, S. (2019). Random forests-based early warning system for bank failures. Economics Letters, 176, 49-52. Upper, C., & Worms, A. (2004). Estimating bilateral exposures in the German interbank market: Is there a danger of contagion? European Economic Review, 48(4), 827-849. Velickovic, P., Cucurull, G., Casanova, A., Romero, ...

  23. [23]

    Ying, R., Bourgeois, D., You, J., Zitnik, M., & Leskovec, J. (2019). GNNExplainer: Generating explanations for graph neural networks. NeurIPS

  24. [24]

    Zhang, Y., et al. (2026). Temporal attentive graph networks for financial surveillance: Lead time analysis on the SVB collapse. Working paper. Ahmad, W., Tiwari, S. R., Wadhwani, A. K., Khan, M. A., & Bekiros, S. (2023). Financial networks and systemic risk vulnerabilities: A tale of Indian banks. Research in International Business and Finance, 65, 101962...

  25. [25]

    Liu, M., Li, T., Chen, J., Niu, Z., & Zhang, J

    doi:10.1007/s10462-025-11215-9. Liu, M., Li, T., Chen, J., Niu, Z., & Zhang, J. (2025). Temporal graph learning for default prediction and systemic risk mitigation in financial networks. Intelligent Computing, 4,

  26. [26]

    Owoo, N., & Odei-Mensah, J. (2025). Hierarchical clustering-based early warning model for predicting bank failures: Insights from Ghana's financial sector reforms. Research in International Business and Finance, 73, 102944. Tarkocin, C., & Donduran, M. (2023). Constructing early warning indicators for banks using machine learning models. North American Jo...