pith. sign in

arxiv: 2605.17201 · v1 · pith:O4QZEDFFnew · submitted 2026-05-17 · 💻 cs.CR · cs.LG

Filter-then-Verify: A Multiphase GNN and ModernBERT Framework for Social Engineering Detection in Email Networks

Pith reviewed 2026-05-20 13:39 UTC · model grok-4.3

classification 💻 cs.CR cs.LG
keywords social engineering detectiongraph neural networksModernBERTemail securityanomaly detectioncontent verificationcybersecurityEnron dataset
0
0 comments X

The pith

A two-stage filter-then-verify framework uses graph neural networks to spot anomalous email patterns and ModernBERT to check message content, achieving 86% recall and over 92% precision on social engineering detection.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to show that social engineering attacks, which exploit human trust rather than technical flaws, can be caught at scale by first screening email networks for unusual sender-receiver structures and then examining the actual message content. It introduces a pipeline in which inductive GNNs perform the initial structural filter and a co-attention ModernBERT model performs the verification step to cut down false positives. The method is evaluated on the Enron email collection after realistic synthetic attack campaigns are added, producing 86% recall during filtering and more than 92% precision once content is checked. A reader would care because conventional filters miss many of these attacks, and a practical system that handles both outside threats and insider misuse could strengthen everyday email security.

Core claim

The central claim is that combining inductive Graph Neural Networks for structural anomaly detection with a co-attention ModernBERT model for content verification creates a practical, scalable two-stage filter-then-verify system that identifies multi-stage social engineering attacks in email networks, as shown by 86% recall in the GNN filtering stage and over 92% precision after BERT refinement on the Enron dataset augmented with synthetic campaigns.

What carries the argument

The filter-then-verify pipeline, where GNN-based structural filtering first identifies anomalous sender-receiver patterns and ModernBERT then verifies message context to reduce false positives.

If this is right

  • The approach detects both external attacks and insider threats within the same email network.
  • Structural filtering provides high recall while content verification keeps precision high enough for operational use.
  • The framework scales to large email networks because the GNN stage is inductive and the BERT stage is applied only to filtered candidates.
  • Multi-stage campaigns that unfold over sequences of messages can be caught by the combined structural and content signals.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the reported numbers hold on live traffic, organizations could insert the pipeline as a lightweight pre-filter before traditional spam or phishing gateways.
  • The same structural-plus-content split might transfer to other trust-exploitation settings such as messaging platforms or enterprise collaboration tools.
  • Extending the GNN to incorporate timing or attachment metadata could further improve detection of campaigns that rely on gradual relationship building.

Load-bearing premise

The synthetic social engineering campaigns added to the Enron dataset are realistic enough that performance measured on this mixture will generalize to real-world email traffic and attacks.

What would settle it

Running the trained framework on a separate collection of verified real social engineering emails collected from actual incidents and checking whether recall stays near 86% and precision stays above 92%.

Figures

Figures reproduced from arXiv: 2605.17201 by Barsat Khadka, Kshitiz Neupane, Nick Rahimi, Prasant Koirala.

Figure 1
Figure 1. Figure 1: Training Phase: Raw emails.csv is preprocessed into time-series features and graph [PITH_FULL_IMAGE:figures/full_fig_p006_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Detection phase with synthetic attacks. Structural anomaly detection is performed [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
read the original abstract

Social engineering attacks exploit human trust rather than software vulnerabilities, making them difficult to detect using conventional filters. We propose a two-stage filter-then-verify framework combining inductive Graph Neural Networks (GNNs) for structural anomaly detection with a co-attention ModernBERT model for content verification. The GNN identifies anomalous sender-receiver patterns, while BERT analyzes message context to reduce false positives. Using the Enron dataset augmented with realistic synthetic campaigns, we show that the framework achieves 86% recall in structural filtering and over 92% precision after BERT refinement, effectively detecting both external attacks and insider threats. Our results demonstrate that combining structural and content analysis allows practical, scalable detection of multi-stage social engineering attacks in email networks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes a two-stage filter-then-verify framework for detecting social engineering attacks in email networks. Inductive Graph Neural Networks (GNNs) perform structural anomaly detection on sender-receiver patterns, followed by a co-attention ModernBERT model for content verification to reduce false positives. The system is evaluated on the Enron dataset augmented with realistic synthetic social engineering campaigns and reports 86% recall after structural filtering and over 92% precision after BERT refinement, claiming effective detection of both external attacks and insider threats via combined structural and content analysis.

Significance. If the results hold under proper validation, the multiphase GNN-then-ModernBERT pipeline could offer a practical, scalable method for identifying multi-stage social engineering attacks that exploit both network anomalies and linguistic cues. The approach addresses limitations of single-modality detectors and demonstrates how structural filtering can be paired with modern language models for improved precision. The significance is limited by the absence of external validation for the synthetic data and missing experimental controls.

major comments (2)
  1. [Dataset and Evaluation] The manuscript augments the Enron dataset with 'realistic synthetic campaigns' but supplies no external anchoring such as comparison to documented real-world incidents, expert labeling of the synthetics, or distribution-shift metrics between synthetic and authentic attack traces. This assumption is load-bearing for the central claim because the headline metrics (86% structural recall and >92% final precision) are measured exclusively on the self-constructed mixture.
  2. [Experimental Results] The abstract and results report concrete performance figures (86% recall, >92% precision) yet supply no experimental protocol, baseline comparisons, error bars, or ablation results on the GNN filtering stage versus the BERT refinement stage. Without these controls it is impossible to determine whether the reported gains arise from the proposed pipeline or from dataset construction choices.
minor comments (2)
  1. [Abstract] The abstract uses the vague phrase 'over 92% precision'; reporting the exact value and confidence interval would improve clarity.
  2. [Model Architecture] Clarify how the GNN node embeddings are passed as input or conditioning to the co-attention ModernBERT model.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the thoughtful and constructive comments on our manuscript. The feedback highlights important aspects of validation and experimental rigor that we will address to strengthen the paper. We respond to each major comment below and indicate planned revisions.

read point-by-point responses
  1. Referee: [Dataset and Evaluation] The manuscript augments the Enron dataset with 'realistic synthetic campaigns' but supplies no external anchoring such as comparison to documented real-world incidents, expert labeling of the synthetics, or distribution-shift metrics between synthetic and authentic attack traces. This assumption is load-bearing for the central claim because the headline metrics (86% structural recall and >92% final precision) are measured exclusively on the self-constructed mixture.

    Authors: We agree that stronger validation of the synthetic data would increase confidence in the results. The synthetic campaigns were generated by overlaying documented social engineering tactics (drawn from public reports such as the Verizon DBIR and known phishing/impersonation patterns) onto the real Enron email graph while preserving temporal and structural properties. In the revised manuscript we will add an expanded dataset section that details the exact generation procedure, provides quantitative distribution-shift metrics (e.g., KL divergence on degree, temporal, and linguistic features between synthetic attacks and authentic Enron messages), and cites the specific real-world case studies used to calibrate the synthetics. Direct expert labeling of the full synthetic set or matching to particular confidential incidents is not feasible within academic constraints; however, the added methodological transparency and statistical comparisons will better anchor the evaluation. revision: partial

  2. Referee: [Experimental Results] The abstract and results report concrete performance figures (86% recall, >92% precision) yet supply no experimental protocol, baseline comparisons, error bars, or ablation results on the GNN filtering stage versus the BERT refinement stage. Without these controls it is impossible to determine whether the reported gains arise from the proposed pipeline or from dataset construction choices.

    Authors: We acknowledge that the presentation of experimental controls was insufficient. The full manuscript already describes the data splits, hyper-parameters, and evaluation protocol in Section 4, but we agree these details should be more prominent and supplemented with additional analyses. In the revision we will insert a dedicated experimental protocol subsection, add baseline comparisons (standalone inductive GNN, standalone ModernBERT, traditional ML classifiers on hand-crafted features), include ablation tables that isolate the contribution of the structural filter versus the content verifier, and report all headline metrics as mean ± standard deviation across five random seeds. These additions will allow readers to attribute performance gains to the proposed multiphase design rather than dataset artifacts. revision: yes

standing simulated objections not resolved
  • Direct expert labeling or one-to-one matching of synthetic attacks to specific real-world incidents, which would require access to proprietary or sensitive operational data unavailable for this study.

Circularity Check

0 steps flagged

No circularity: empirical results on held-out test set with no derivations or self-referential definitions

full rationale

The manuscript presents an applied two-stage detection pipeline (inductive GNN for structural anomalies followed by co-attention ModernBERT for content verification) and reports performance numbers (86% structural recall, >92% final precision) as measured outcomes on a held-out portion of the Enron dataset after augmentation with synthetic campaigns. No equations, parameter-fitting steps that are then re-labeled as predictions, or self-citation chains appear in the derivation chain. The central claims rest on standard empirical evaluation rather than quantities defined in terms of themselves or reductions by construction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central performance claims rest on the representativeness of the Enron-plus-synthetic dataset and on standard but unstated machine-learning assumptions about generalization from the test distribution.

free parameters (1)
  • GNN and BERT training hyperparameters
    Learning rates, layer sizes, attention heads, and regularization coefficients are necessarily chosen or tuned during model development.
axioms (1)
  • domain assumption The mixture of real Enron emails and researcher-generated synthetic attack campaigns is statistically representative of live organizational email traffic and real social-engineering attempts.
    All reported recall and precision figures are measured on this augmented collection.

pith-pipeline@v0.9.0 · 5667 in / 1349 out tokens · 72410 ms · 2026-05-20T13:39:26.926074+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

17 extracted references · 17 canonical work pages · 1 internal anchor

  1. [1]

    Alsufyani and Sultan M

    Abdulrahman A. Alsufyani and Sultan M. Alzahrani. Social engineering attack detection using machine learning: Text phishing attack.Indian Journal of Computer Science and Engineering (IJCSE), 12(3):743–751, 2021

  2. [2]

    Blondel, Jean-Loup Guillaume, Renaud Lambiotte, and Etienne Lefebvre

    Vincent D. Blondel, Jean-Loup Guillaume, Renaud Lambiotte, and Etienne Lefebvre. Fast unfolding of communities in large networks.Journal of Statistical Mechanics: Theory and Experiment, 2008(10):P10008, 2008

  3. [3]

    Deep anomaly detection on attributed networks

    Kaize Ding, Jundong Li, Rohit Bhanushali, and Huan Liu. Deep anomaly detection on attributed networks. InProceedings of the 2019 SIAM International Conference on Data Mining (SDM), pages 594–602, 2019

  4. [4]

    Hamilton, Rex Ying, and Jure Leskovec

    William L. Hamilton, Rex Ying, and Jure Leskovec. Inductive representation learning on large graphs. InAdvances in Neural Information Processing Systems (NeurIPS), 2017

  5. [5]

    Explainable verbal deception de- tection using transformers.arXiv preprint arXiv:2210.03080, 10 2022

    Loukas Ilias, Felix Soldner, and Bennett Kleinberg. Explainable verbal deception de- tection using transformers.arXiv preprint arXiv:2210.03080, 10 2022. 16

  6. [6]

    A survey on the principles of persuasion as a social engineering strategy in phishing

    Kalam Khadka, Abu Barkat Ullah, Wanli Ma, and Elisa Marroquin. A survey on the principles of persuasion as a social engineering strategy in phishing. In2023 IEEE 22nd International Conference on Trust, Security and Privacy in Computing and Communi- cations (TrustCom), pages 1631–1638, Exeter, United Kingdom, 2023. IEEE

  7. [7]

    SEADer++: Social engineering attack detection in online environments using machine learning.Journal of Information and Telecommunication, 4(3):346–362, 2020

    Matthew Lansley, Francois Mouton, Stelios Kapetanakis, and Nikolaos Polatidis. SEADer++: Social engineering attack detection in online environments using machine learning.Journal of Information and Telecommunication, 4(3):346–362, 2020

  8. [8]

    Hierarchical question-image co- attention for visual question answering

    Jiasen Lu, Jianwei Yang, Dhruv Batra, and Devi Parikh. Hierarchical question-image co- attention for visual question answering. InAdvances in Neural Information Processing Systems (NeurIPS), volume 29, 2016

  9. [9]

    Enron email time-series network

    Volodymyr Miz, Benoˆ ıt Ricaud, Kirell Benzi, and Pierre Vandergheynst. Enron email time-series network. Zenodo [dataset], 8 2018. [dataset]

  10. [10]

    Malan, Louise Leenen, and H

    Francois Mouton, Mercia M. Malan, Louise Leenen, and H. S. Venter. Social engineering attack framework. In2014 Information Security for South Africa (ISSA), pages 1–9. IEEE, 2014

  11. [11]

    Priebe, John M

    Carey E. Priebe, John M. Conroy, David J. Marchette, and Youngser Park. Scan statistics on Enron graphs.Computational & Mathematical Organization Theory, 11(3– 4):229–247, 2005

  12. [12]

    Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learn- ing with a unified text-to-text transformer.Journal of Machine Learning Research, 21(140):1–67, 2020

  13. [13]

    Francophoned: A sophisticated social engineering attack

    Symantec Security Response. Francophoned: A sophisticated social engineering attack. Symantec Connect Blog, 1 2014. Accessed: 2026-05-16

  14. [14]

    Induc- tive graph representation learning for fraud detection.Expert Systems with Applications, 193:116463, 2022

    Rafa¨ el Van Belle, Charles Van Damme, Hendrik Tytgat, and Jochen De Weerdt. Induc- tive graph representation learning for fraud detection.Expert Systems with Applications, 193:116463, 2022

  15. [15]

    Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference

    Benjamin Warner, Antoine Chaffin, Benjamin Clavi´ e, Orion Weller, Oskar Hallstr¨ om, Said Taghadouini, Alexis Gallagher, Raja Biswas, Faisal Ladhak, Tom Aarsen, Nathan Cooper, Griffin Adams, Jeremy Howard, and Iacopo Poli. Smarter, better, faster, longer: A modern bidirectional encoder for fast, memory efficient, and long context finetuning and inference...

  16. [16]

    Fast and accurate anomaly detection in dynamic graphs with a two-pronged approach

    Minji Yoon, Bryan Hooi, Kijung Shin, and Christos Faloutsos. Fast and accurate anomaly detection in dynamic graphs with a two-pronged approach. InProceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 647–657. Association for Computing Machinery, 2019

  17. [17]

    Malware infection that began with windshield fliers

    Lenny Zeltser. Malware infection that began with windshield fliers. Internet Storm Center, 2 2009. Accessed: 2026-05-16. 17