Validating Threat Modeling Results with the Help of Vulnerable Test Applications

Davide Fucci; Felix Viktor Jedrzejewski; Nishrith Saini; Oleksandr Adamov; Ricardo Britto

arxiv: 2605.23695 · v1 · pith:4X7A7RDYnew · submitted 2026-05-22 · 💻 cs.CR

Validating Threat Modeling Results with the Help of Vulnerable Test Applications

Oleksandr Adamov , Davide Fucci , Felix Viktor Jedrzejewski , Ricardo Britto , Nishrith Saini This is my paper

Pith reviewed 2026-05-25 04:05 UTC · model grok-4.3

classification 💻 cs.CR

keywords threat modelingvulnerability coveragevalidation benchmarksLLM-assisted toolstest applicationssecurity analysisdata flow diagrams

0 comments

The pith

An LLM-assisted threat modeling approach discovers more of the known vulnerabilities than a conventional tool when tested on applications with documented security issues.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Validating threat modeling results is hard because there is usually no complete external check on whether all relevant issues have been found. This paper tests a complementary method that feeds architecture diagrams and data flow descriptions into modeling tools and then counts how many of the known vulnerabilities in intentionally vulnerable applications each tool surfaces. The evaluation shows the LLM-assisted solution recovers a larger share of those documented issues than the standard tool across both systems examined. The approach supplies a repeatable, vulnerability-grounded benchmark that can be used alongside expert reference models.

Core claim

Vulnerable test applications equipped with documented vulnerability sets can serve as practical oracles for measuring threat modeling coverage, and an LLM-assisted solution achieves higher coverage than a conventional tool when both receive only architecture and data flow inputs.

What carries the argument

Intentionally vulnerable applications with documented vulnerability sets that function as ground truth for calculating the fraction of issues surfaced by threat modeling tools.

If this is right

Threat modeling outputs can be scored quantitatively by the proportion of documented vulnerabilities they recover.
LLM assistance can increase the number of relevant vulnerabilities identified from the same diagram inputs.
Validation becomes possible with standard architecture descriptions rather than full source code or expert consensus.
The benchmark method reduces dependence on potentially incomplete human reference models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same test applications could be reused to compare additional threat modeling tools in a consistent way.
Expanding the documented vulnerability lists in the test applications would make the coverage metric more demanding.
Explicit tracing from each surfaced threat back to specific vulnerabilities could expose which classes of issues the tools still miss.

Load-bearing premise

The documented vulnerability sets in the test applications are complete and can be accurately mapped to threats that the modeling tools should identify from the supplied architecture and data flow diagrams.

What would settle it

Discovery of additional exploitable vulnerabilities in the test applications that lie outside the documented sets, or demonstration that some listed vulnerabilities have no corresponding threat derivable from the input diagrams.

Figures

Figures reproduced from arXiv: 2605.23695 by Davide Fucci, Felix Viktor Jedrzejewski, Nishrith Saini, Oleksandr Adamov, Ricardo Britto.

**Figure 2.** Figure 2: Discovered vulnerabilities in VulnBank by ThreMo [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

read the original abstract

Validating threat modeling results remains difficult because completeness is hard to judge without an external oracle. Existing studies often rely on expert-produced reference models and other human baselines, but these can contain omissions or disagreements. This paper evaluates a complementary, vulnerability-grounded validation approach. We apply threat modeling to intentionally vulnerable applications with a known vulnerability set to measure the number of related vulnerabilities that can be discovered. We compare ThreMoLIA, an LLM-assisted threat modeling solution developed by our team, with the Microsoft Threat Modeling Tool (MTMT) across two vulnerable applications: AzureGoat and the Vulnerable Bank Application (VulnBank). The inputs to both tools are limited to architecture, data flow diagrams, and their descriptions. The results show that ThreMoLIA achieved higher vulnerability coverage on both systems. We show that vulnerable test applications provide a practical benchmark for assessing threat coverage and complement expert-based validation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows a workable coverage metric for threat models by running them on two known-vulnerable apps and counting hits against documented issues, with their tool ahead of MTMT, but the oracle needs explicit checks that every counted vuln is visible in the diagrams.

read the letter

The main thing to know is that the authors test threat modeling by feeding architecture and data-flow diagrams from AzureGoat and VulnBank into both ThreMoLIA and the Microsoft tool, then count how many of the apps' documented vulnerabilities each one surfaces. ThreMoLIA covers more on both cases. This gives a number instead of another expert opinion, which is the practical step they add. The comparison itself is new relative to the expert-baseline papers they cite. It is a clean way to get repeatable results on real applications without inventing new test cases. The method is limited to two apps and stays within the inputs the tools actually receive, which keeps the setup honest. The soft spot is the oracle. The stress-test note is on target: if some of the counted vulnerabilities require code inspection or details not present in the supplied diagrams and descriptions, the coverage numbers do not measure what the tools can actually derive from the stated inputs. The abstract gives no mapping procedure or filter that confirms every vulnerability is inferable from the high-level material alone. That needs to be shown in the full methods. Two applications also leave the result suggestive rather than robust. This is for people who build or evaluate threat-modeling tools and want an empirical complement to expert review. A reader working on validation methods or LLM-assisted modeling would get a usable idea from it. The work is coherent enough on its own terms to go to peer review; the referee would mainly press on the oracle construction and the mapping details.

Referee Report

1 major / 1 minor

Summary. The paper proposes validating threat modeling outputs via an oracle of known vulnerabilities in intentionally vulnerable applications (AzureGoat and VulnBank). Inputs to both ThreMoLIA (the authors' LLM-assisted tool) and Microsoft Threat Modeling Tool are restricted to architecture diagrams, data-flow diagrams, and descriptions. The central claim is that this yields higher vulnerability coverage for ThreMoLIA than MTMT on both systems and that vulnerable test applications constitute a practical, objective benchmark that complements expert reference models.

Significance. If the oracle vulnerabilities can be shown to be inferable from the supplied high-level inputs, the approach supplies a reproducible, falsifiable metric for threat coverage that reduces dependence on subjective expert baselines. The use of pre-existing vulnerable applications is a concrete strength, as it avoids the need to construct new oracles from scratch and enables direct comparison across tools.

major comments (1)

[Evaluation / Results] Evaluation / Results section: The headline coverage comparison treats the documented vulnerability sets of AzureGoat and VulnBank as ground truth for what threats should be surfaced from the architecture and DFD inputs. No mapping procedure, inclusion/exclusion criteria, or verification that each counted vulnerability is derivable from those inputs (rather than from code-level details invisible in the diagrams) is described; without this, the reported coverage numbers do not establish tool performance on the stated inputs.

minor comments (1)

[Abstract] Abstract and methods: Quantitative details on coverage (exact counts, false-positive handling, statistical comparison) and the precise procedure for associating vulnerabilities with modeled threats are absent; these should be supplied to allow readers to assess the strength of the comparison.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive critique of our evaluation methodology. The point raised is valid and we will revise the manuscript to address it directly.

read point-by-point responses

Referee: The headline coverage comparison treats the documented vulnerability sets of AzureGoat and VulnBank as ground truth for what threats should be surfaced from the architecture and DFD inputs. No mapping procedure, inclusion/exclusion criteria, or verification that each counted vulnerability is derivable from those inputs (rather than from code-level details invisible in the diagrams) is described; without this, the reported coverage numbers do not establish tool performance on the stated inputs.

Authors: We agree that the current manuscript does not provide an explicit mapping procedure or verification step showing that each counted vulnerability is inferable from the supplied architecture diagrams, DFDs, and descriptions alone. In the revised version we will add a dedicated subsection (likely 4.3 or equivalent) that, for every vulnerability in the AzureGoat and VulnBank oracle sets, lists the specific diagram elements or textual descriptions from which the threat could reasonably be identified. We will also state the inclusion criterion (vulnerability must be detectable from high-level design artifacts without code inspection) and exclusion criterion (vulnerabilities requiring implementation details are noted but not counted toward coverage). This addition will make the coverage metric reproducible and directly tied to the inputs used by both tools. revision: yes

Circularity Check

0 steps flagged

No circularity; evaluation uses external pre-existing oracles

full rationale

The paper performs an empirical comparison of ThreMoLIA (authors' LLM-assisted tool) versus MTMT on two external vulnerable applications (AzureGoat, VulnBank) whose documented vulnerability sets serve as the oracle. Inputs to both tools are restricted to architecture diagrams, data-flow diagrams, and descriptions. No equations, fitted parameters, or derivations appear in the provided text. The coverage metric is computed directly from matches to the external oracle rather than from any quantity defined in terms of the authors' prior models or self-citations. The central result therefore does not reduce to its own inputs by construction; the benchmark is independent of the present work.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The validation approach depends on the assumption that the test applications supply an external, complete oracle of vulnerabilities; no free parameters or new entities are introduced in the abstract.

axioms (1)

domain assumption The documented vulnerability sets of AzureGoat and VulnBank constitute a complete ground truth against which threat-model outputs can be scored.
The coverage metric is only meaningful if this oracle is accepted as exhaustive and correctly aligned with the modeling inputs.

pith-pipeline@v0.9.0 · 5695 in / 1172 out tokens · 20991 ms · 2026-05-25T04:05:05.331136+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

13 extracted references

[1]

Shostack,Threat Modeling: Designing for Security

A. Shostack,Threat Modeling: Designing for Security. Wiley, 2014

2014
[2]

Threat modeling — a systematic literature review,

W. Xiong and R. Lagerstr”om, “Threat modeling — a systematic literature review,”Computers & Security, vol. 84, pp. 53–69, 2019

2019
[3]

Threat modeling: From infancy to maturity,

K. Yskout, T. Heyman, D. V . Landuyt, L. Sion, K. Wuyts, and W. Joosen, “Threat modeling: From infancy to maturity,” inProceed- ings of the ACM/IEEE 42nd International Conference on Software Engineering: New Ideas and Emerging Results, 2020, pp. 9–12

2020
[4]

ThreMoLIA: Threat modeling of large language model-integrated applications,

F. V . Jedrzejewski, D. Fucci, and O. Adamov, “ThreMoLIA: Threat modeling of large language model-integrated applications,” pp. 505– 507, 2025

2025
[5]

A descriptive study of microsoft’s threat modeling technique,

R. Scandariato, K. Wuyts, and W. Joosen, “A descriptive study of microsoft’s threat modeling technique,”Requirements Engineering, vol. 20, no. 2, pp. 163–180, 2015

2015
[6]

Microsoft threat modeling tool overview,

Microsoft, “Microsoft threat modeling tool overview,” Microsoft Learn, 2026, accessed: 2026-03-30. [Online]. Available: https://learn. microsoft.com/en-us/azure/security/develop/threat-modeling-tool

2026
[7]

AzureGoat: A damn vulnerable azure infrastructure,

INE Labs, “AzureGoat: A damn vulnerable azure infrastructure,” GitHub repository, 2026, accessed: 2026-03-30. [Online]. Available: https://github.com/ine-labs/AzureGoat

2026
[8]

Vulnerable bank application,

Commando-X, “Vulnerable bank application,” GitHub repository, 2026, accessed: 2026-03-30. [Online]. Available: https://github.com/ Commando-X/vuln-bank

2026
[9]

Empirical evaluation of a privacy-focused threat modeling methodology,

K. Wuyts, R. Scandariato, and W. Joosen, “Empirical evaluation of a privacy-focused threat modeling methodology,”Journal of Systems and Software, vol. 96, pp. 122–138, 2014

2014
[10]

A comparative benchmark study of LLM-based threat elicitation tools,

D. V . Landuyt, M. Mollaeefar, M. Raciti, S. Verreydt, A. Kalash, A. Bissoli, D. Preuveneers, G. Bella, and S. Ranise, “A comparative benchmark study of LLM-based threat elicitation tools,”Future Gen- eration Computer Systems, vol. 177, p. 108243, 2026

2026
[11]

Systematic analysis of automated threat modelling techniques: Comparison of open-source tools,

D. Granata and M. Rak, “Systematic analysis of automated threat modelling techniques: Comparison of open-source tools,”Software Quality Journal, vol. 32, pp. 125–161, 2024

2024
[12]

OW ASP Threat Dragon Documentation,

OW ASP, “OW ASP Threat Dragon Documentation,” Project documentation, 2026, accessed: 2026-03-30. [Online]. Available: https://www.threatdragon.com/docs/

2026
[13]

Finding security threats that matter: Two industrial case studies,

K. Tuma, C. Sandberg, U. Thorsson, M. Widman, T. Herpel, and R. Scandariato, “Finding security threats that matter: Two industrial case studies,”Journal of Systems and Software, vol. 179, p. 111003, 2021. APPENDIXA APPENDIX: DETAILEDVULNERABILITYCOVERAGE TABLES TABLE II: AzureGoat detailed threat coverage. Vulnerability ThreMoLIA MTMT XSSDiscoveredMissed ...

2021

[1] [1]

Shostack,Threat Modeling: Designing for Security

A. Shostack,Threat Modeling: Designing for Security. Wiley, 2014

2014

[2] [2]

Threat modeling — a systematic literature review,

W. Xiong and R. Lagerstr”om, “Threat modeling — a systematic literature review,”Computers & Security, vol. 84, pp. 53–69, 2019

2019

[3] [3]

Threat modeling: From infancy to maturity,

K. Yskout, T. Heyman, D. V . Landuyt, L. Sion, K. Wuyts, and W. Joosen, “Threat modeling: From infancy to maturity,” inProceed- ings of the ACM/IEEE 42nd International Conference on Software Engineering: New Ideas and Emerging Results, 2020, pp. 9–12

2020

[4] [4]

ThreMoLIA: Threat modeling of large language model-integrated applications,

F. V . Jedrzejewski, D. Fucci, and O. Adamov, “ThreMoLIA: Threat modeling of large language model-integrated applications,” pp. 505– 507, 2025

2025

[5] [5]

A descriptive study of microsoft’s threat modeling technique,

R. Scandariato, K. Wuyts, and W. Joosen, “A descriptive study of microsoft’s threat modeling technique,”Requirements Engineering, vol. 20, no. 2, pp. 163–180, 2015

2015

[6] [6]

Microsoft threat modeling tool overview,

Microsoft, “Microsoft threat modeling tool overview,” Microsoft Learn, 2026, accessed: 2026-03-30. [Online]. Available: https://learn. microsoft.com/en-us/azure/security/develop/threat-modeling-tool

2026

[7] [7]

AzureGoat: A damn vulnerable azure infrastructure,

INE Labs, “AzureGoat: A damn vulnerable azure infrastructure,” GitHub repository, 2026, accessed: 2026-03-30. [Online]. Available: https://github.com/ine-labs/AzureGoat

2026

[8] [8]

Vulnerable bank application,

Commando-X, “Vulnerable bank application,” GitHub repository, 2026, accessed: 2026-03-30. [Online]. Available: https://github.com/ Commando-X/vuln-bank

2026

[9] [9]

Empirical evaluation of a privacy-focused threat modeling methodology,

K. Wuyts, R. Scandariato, and W. Joosen, “Empirical evaluation of a privacy-focused threat modeling methodology,”Journal of Systems and Software, vol. 96, pp. 122–138, 2014

2014

[10] [10]

A comparative benchmark study of LLM-based threat elicitation tools,

D. V . Landuyt, M. Mollaeefar, M. Raciti, S. Verreydt, A. Kalash, A. Bissoli, D. Preuveneers, G. Bella, and S. Ranise, “A comparative benchmark study of LLM-based threat elicitation tools,”Future Gen- eration Computer Systems, vol. 177, p. 108243, 2026

2026

[11] [11]

Systematic analysis of automated threat modelling techniques: Comparison of open-source tools,

D. Granata and M. Rak, “Systematic analysis of automated threat modelling techniques: Comparison of open-source tools,”Software Quality Journal, vol. 32, pp. 125–161, 2024

2024

[12] [12]

OW ASP Threat Dragon Documentation,

OW ASP, “OW ASP Threat Dragon Documentation,” Project documentation, 2026, accessed: 2026-03-30. [Online]. Available: https://www.threatdragon.com/docs/

2026

[13] [13]

Finding security threats that matter: Two industrial case studies,

K. Tuma, C. Sandberg, U. Thorsson, M. Widman, T. Herpel, and R. Scandariato, “Finding security threats that matter: Two industrial case studies,”Journal of Systems and Software, vol. 179, p. 111003, 2021. APPENDIXA APPENDIX: DETAILEDVULNERABILITYCOVERAGE TABLES TABLE II: AzureGoat detailed threat coverage. Vulnerability ThreMoLIA MTMT XSSDiscoveredMissed ...

2021