Inductive inference of gradient-boosted decision trees on graphs for insurance fraud detection

Bruno Deprez; F\'elix Vandervorst; Tim Verdonck; Wouter Verbeke

arxiv: 2510.05676 · v2 · pith:V5EWBZ3Xnew · submitted 2025-10-07 · 💻 cs.LG · cs.SI

Inductive inference of gradient-boosted decision trees on graphs for insurance fraud detection

F\'elix Vandervorst , Bruno Deprez , Wouter Verbeke , Tim Verdonck This is my paper

Pith reviewed 2026-05-21 20:48 UTC · model grok-4.3

classification 💻 cs.LG cs.SI

keywords insurance fraud detectiongradient boostingheterogeneous graphsgraph machine learningSHAP explanationsclass imbalanceinductive learningtabular feature integration

0 comments

The pith

Gradient-boosted trees using concatenated path features from heterogeneous graphs detect insurance fraud on par or better than prior methods.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces G-GBM, an inductive graph gradient boosting machine for supervised learning on heterogeneous and dynamic graphs in insurance fraud detection. It encodes relational information by concatenating path-level features onto the original tabular data, adding neighborhood context without discarding standard features. This setup retains gradient boosting's strength against severe class imbalance while supporting SHAP explanations tied to specific metapaths and features. Experiments on one public and one proprietary dataset show performance that matches or exceeds existing approaches. The authors also release an insurance fraud dataset to aid reproducibility and further research.

Core claim

We present a novel inductive graph gradient boosting machine (G-GBM) for supervised learning on heterogeneous and dynamic graphs. G-GBM combines the class-imbalance robustness of gradient boosting with heterogeneous graph information encoded through interpretable path-level feature concatenations, while preserving access to the original tabular feature space. In addition, the explicit representation of neighbourhood information enables transparent SHAP-based explanations at the metapath and feature level. We demonstrate G-GBM for insurance fraud detection on an open-source and a real-world, proprietary dataset, and find that G-GBM performs on par or better than the state-of-the-art.

What carries the argument

Path-level feature concatenations from the heterogeneous graph that supply neighborhood information to gradient-boosted decision trees.

Load-bearing premise

Concatenating path-level features from the heterogeneous graph supplies non-redundant, non-noisy signal that improves or at least does not degrade the gradient-boosted tree's ability to separate the rare fraud class, even under high class imbalance and changing graph structure.

What would settle it

A test on the same datasets showing that a standard gradient-boosted model without any path features achieves equal or higher fraud detection performance would falsify the claimed contribution of the graph component.

read the original abstract

Graph-based methods are becoming increasingly popular in machine learning due to their ability to model complex data and relations. Insurance fraud is a prime use case, since fraudulent claims are often the result of organised criminals that stage accidents or the same persons filing erroneous claims on multiple policies. One challenge is that graph-based approaches struggle to find meaningful representations of the data because of the high class imbalance present in fraud data. In addition, insurance graphs are heterogeneous and dynamic, given the changing relations among people, companies and policies. As a result, gradient-boosted tree approaches on tabular data still dominate the field. Therefore, we present a novel inductive graph gradient boosting machine (G-GBM) for supervised learning on heterogeneous and dynamic graphs. G-GBM combines the class-imbalance robustness of gradient boosting with heterogeneous graph information encoded through interpretable path-level feature concatenations, while preserving access to the original tabular feature space. In addition, the explicit representation of neighbourhood information enables transparent SHAP-based explanations at the metapath and feature level. We demonstrate G-GBM for insurance fraud detection on an open-source and a real-world, proprietary dataset, and find that G-GBM performs on par or better than the state-of-the-art. The associated insurance fraud dataset is publicly released to facilitate reproducibility.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The manuscript introduces G-GBM, an inductive graph gradient boosting machine for supervised learning on heterogeneous and dynamic graphs applied to insurance fraud detection. It augments standard GBM with interpretable path-level feature concatenations extracted from metapaths, preserves the original tabular feature space, and enables metapath-level SHAP explanations. The central empirical claim is that G-GBM performs on par or better than state-of-the-art methods on both a publicly released open-source dataset and a proprietary real-world dataset, while handling high class imbalance and graph dynamics via temporal splits and inductive feature extraction without full graph retraining.

Significance. If the reported results hold, the work is significant for bridging dominant tabular GBM approaches in fraud detection with graph-based methods without sacrificing interpretability or tabular access. Strengths include the public dataset release to support reproducibility, ablation studies isolating the contribution of concatenated path features, explicit tests of inductive inference under simulated node/edge additions, and temporal train/test splits that respect graph dynamics. These elements provide a practical, explainable alternative for heterogeneous dynamic graphs under imbalance.

minor comments (3)

[Abstract] Abstract: the statement that G-GBM 'performs on par or better than the state-of-the-art' would be strengthened by a brief quantitative reference (e.g., AUC or F1 improvement) or pointer to the main results table.
[§3] §3 (Method): the exact procedure for selecting and concatenating metapath features (including any filtering or aggregation steps) should be stated with sufficient detail for independent reproduction of the feature construction pipeline.
[Results] Table 2 or equivalent results section: confirm that all reported metrics include standard deviations across multiple runs or seeds to allow assessment of variability under the high class imbalance.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive and constructive review, as well as the recommendation for minor revision. We are pleased that the referee recognizes the practical value of G-GBM in combining the robustness of gradient boosting with interpretable graph neighborhood information for heterogeneous, dynamic insurance graphs, while preserving tabular features and enabling metapath-level explanations. The highlighted strengths—public dataset release, ablation studies on concatenated path features, inductive inference tests under node/edge additions, and temporal splits—are core to our contribution and we are grateful for this acknowledgment.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The manuscript presents G-GBM as an empirical method that concatenates interpretable path-level features from heterogeneous graphs into a gradient-boosted tree while retaining the original tabular space and enabling metapath SHAP. All central claims rest on external performance comparisons, ablation studies, temporal train/test splits, and node-addition simulations against published baselines on both public and proprietary data. No equation, prediction, or uniqueness claim reduces by construction to a fitted parameter or self-citation; the inductive construction is explicitly validated rather than assumed. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The method rests on the domain assumption that path features add useful signal and on standard GBM hyperparameters; no new physical entities are postulated.

free parameters (1)

GBM hyperparameters (learning rate, tree depth, etc.)
Standard tuning parameters required for any gradient-boosted tree model; their specific values are not reported in the abstract.

axioms (1)

domain assumption Gradient boosting is robust to severe class imbalance when appropriate loss and sampling strategies are used
Invoked to justify why the tree component remains effective on fraud data.

pith-pipeline@v0.9.0 · 5769 in / 1354 out tokens · 58270 ms · 2026-05-21T20:48:29.563173+00:00 · methodology

Inductive inference of gradient-boosted decision trees on graphs for insurance fraud detection

Core claim

What carries the argument

Load-bearing premise

What would settle it

discussion (0)