Evaluating Assurance Cases as Text-Attributed Graphs for Structure and Provenance Analysis
Pith reviewed 2026-05-09 23:48 UTC · model grok-4.3
The pith
Assurance cases modeled as text-attributed graphs let graph neural networks predict argument links and detect LLM authorship.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Assurance cases are turned into text-attributed graphs so that graph neural networks can perform link prediction at ROC-AUC 0.760 on real cases while generalizing across domains and semi-supervised regimes. The same models classify human-authored versus LLM-generated cases at F1 0.94 and expose distinct hierarchical linking patterns in the LLM outputs. Existing GNN explanation methods align only moderately with the true argument structure.
What carries the argument
Text-attributed graphs of assurance cases, with nodes carrying text descriptions of argument elements and edges encoding relationships, fed to graph neural networks for link prediction and provenance classification.
If this is right
- Link prediction models can help complete or audit the logical connections inside assurance cases.
- Provenance classification can flag LLM-generated cases for extra human review in regulated safety work.
- The observed difference in hierarchical patterns indicates current LLMs produce more uniform argument structures.
- Moderate faithfulness of GNN explanations shows a remaining gap between model reasoning and actual argument flow.
Where Pith is reading between the lines
- The same graph treatment could be tried on other regulated documents such as legal or medical arguments to spot AI assistance.
- Prompt designers might use the linking-pattern difference to make future LLM outputs closer to human variety.
- If adopted at scale the method could support standards that require visible human oversight of AI-drafted safety cases.
Load-bearing premise
Turning assurance cases into graphs with text on nodes keeps the essential semantic links and origin signals intact.
What would settle it
A new collection of LLM-generated assurance cases deliberately prompted to copy human hierarchical linking statistics, tested to see whether classification F1 drops well below 0.94.
Figures
read the original abstract
An assurance case is a structured argument document that justifies claims about a system's requirements or properties, which are supported by evidence. In regulated domains, these are crucial for meeting compliance and safety requirements to industry standards. We propose a graph diagnostic framework for analysing the structure and provenance of assurance cases. We focus on two main tasks: (1) link prediction, to learn and identify connections between argument elements, and (2) graph classification, to differentiate between assurance cases created by a state-of-the-art large language model and those created by humans, aiming to detect bias. We compiled a publicly available dataset of assurance cases, represented as graphs with nodes and edges, supporting both link prediction and provenance analysis. Experiments show that graph neural networks (GNNs) achieve strong link prediction performance (ROC-AUC 0.760) on real assurance cases and generalise well across domains and semi-supervised settings. For provenance detection, GNNs effectively distinguish human-authored from LLM-generated cases (F1 0.94). We observed that LLM-generated assurance cases have different hierarchical linking patterns compared to human-authored cases. Furthermore, existing GNN explanation methods show only moderate faithfulness, revealing a gap between predicted reasoning and the true argument structure.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes representing assurance cases as text-attributed graphs to support structure and provenance analysis via graph neural networks. It introduces a public dataset and evaluates two tasks: link prediction on real assurance cases (ROC-AUC 0.76, with generalization across domains and semi-supervised settings) and binary graph classification to distinguish human-authored from LLM-generated cases (F1 0.94). The authors report that LLM-generated cases exhibit different hierarchical linking patterns and that existing GNN explanation methods show only moderate faithfulness to the underlying argument structure.
Significance. If the central empirical claims hold after addressing methodological gaps, the work would be significant for regulated software engineering domains where assurance cases are mandatory. It provides an automated, graph-based approach to detect potential provenance issues and structural differences, along with a publicly available dataset that could enable further research. The reported generalization and the observation of linking pattern differences are strengths, though their reliability hinges on unbiased dataset construction.
major comments (3)
- [Dataset compilation] Dataset construction (Section on dataset compilation): The process for creating the LLM-generated assurance cases is described at too high a level. Specifics on the LLM model, prompt templates, temperature, sampling strategy, and any post-processing or filtering steps are absent. This is load-bearing for the provenance classification result (F1 0.94), because the GNN could be learning synthesis artifacts (e.g., shallower or more uniform trees) rather than genuine human vs. LLM structural differences.
- [Representation as text-attributed graphs] Graph construction details (Section on representation as text-attributed graphs): The conversion of assurance cases into nodes, edges, and text attributes is not specified in sufficient detail (e.g., how implicit links are encoded, what text is attached to nodes/edges, embedding method, or handling of hierarchical vs. cross-reference edges). Without this, it is impossible to assess whether the reported link-prediction ROC-AUC of 0.76 and classification performance truly reflect preserved semantic and provenance signals.
- [Experiments] Experimental evaluation (Experiments section): The manuscript reports concrete performance numbers but omits baselines, statistical significance tests, error analysis, dataset size/split details, and ablation studies. These omissions make it difficult to verify the claims of strong performance and cross-domain generalization for both tasks.
minor comments (2)
- [Abstract] The abstract states that GNNs 'generalise well across domains' but provides no quantitative breakdown by domain or explicit list of domains represented in the dataset.
- Notation for graph components (nodes, edges, attributes) and the precise definition of 'hierarchical linking patterns' could be introduced earlier and used consistently to improve readability for readers outside the GNN community.
Simulated Author's Rebuttal
We thank the referee for their detailed and constructive feedback on our manuscript. We believe the suggested clarifications will improve the paper's reproducibility and clarity. We address each major comment below.
read point-by-point responses
-
Referee: [Dataset compilation] Dataset construction (Section on dataset compilation): The process for creating the LLM-generated assurance cases is described at too high a level. Specifics on the LLM model, prompt templates, temperature, sampling strategy, and any post-processing or filtering steps are absent. This is load-bearing for the provenance classification result (F1 0.94), because the GNN could be learning synthesis artifacts (e.g., shallower or more uniform trees) rather than genuine human vs. LLM structural differences.
Authors: We concur that greater specificity is required here to substantiate the provenance classification results and mitigate concerns over potential artifacts. Accordingly, the revised manuscript will provide comprehensive details on the LLM generation process, including the model (GPT-4), complete prompt templates (to be included in an appendix), temperature (set to 0.7), sampling method (top-p sampling with p=0.9), and post-processing (automatic filtering for valid JSON structure followed by manual review for argument coherence). We will also include comparative statistics on graph properties such as average depth and branching factor to demonstrate that observed differences reflect substantive variations rather than superficial synthesis traits. revision: yes
-
Referee: [Representation as text-attributed graphs] Graph construction details (Section on representation as text-attributed graphs): The conversion of assurance cases into nodes, edges, and text attributes is not specified in sufficient detail (e.g., how implicit links are encoded, what text is attached to nodes/edges, embedding method, or handling of hierarchical vs. cross-reference edges). Without this, it is impossible to assess whether the reported link-prediction ROC-AUC of 0.76 and classification performance truly reflect preserved semantic and provenance signals.
Authors: We appreciate this observation and will enhance the representation section with precise specifications. Nodes correspond to individual argument elements (e.g., claims, strategies, evidence), each attributed with its original textual content. Edges encode 'supportedBy' relations for the hierarchical argument structure and 'inContextOf' for cross-references. Implicit links are derived directly from the assurance case's documented relationships. Text attributes are embedded using the all-MiniLM-L6-v2 sentence transformer model. The revised text will include a step-by-step description and pseudocode for the graph construction pipeline, distinguishing hierarchical from cross-reference edges. revision: yes
-
Referee: [Experiments] Experimental evaluation (Experiments section): The manuscript reports concrete performance numbers but omits baselines, statistical significance tests, error analysis, dataset size/split details, and ablation studies. These omissions make it difficult to verify the claims of strong performance and cross-domain generalization for both tasks.
Authors: We agree that these methodological details are important for validating our empirical claims. In the updated Experiments section, we will incorporate: baseline methods such as random guessing, feature-based classifiers without graph structure, and traditional ML models; statistical significance testing using bootstrap resampling or t-tests with reported p-values; qualitative error analysis highlighting representative failure cases for both tasks; explicit dataset statistics including the number of graphs (120 human-authored and 120 LLM-generated for classification, with domain-specific counts), and the train/validation/test splits (70%/15%/15%); and ablation experiments varying the use of text embeddings and edge types. New tables and figures will present these results to support the reported performance and generalization. revision: yes
Circularity Check
No circularity: empirical ML evaluation on compiled dataset
full rationale
The paper describes compiling a dataset of assurance cases represented as text-attributed graphs, then applying GNNs to perform link prediction (ROC-AUC 0.760) and binary classification between human and LLM-generated cases (F1 0.94). All reported results are standard experimental metrics obtained by training models on external data splits and evaluating on held-out examples. No equations, self-definitional relations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided abstract or description. The derivation chain consists of data construction followed by independent model training and evaluation, with no reduction of outputs to inputs by construction.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Assurance cases can be losslessly represented as text-attributed graphs where nodes are argument elements and edges capture logical connections.
- domain assumption GNNs trained on these graphs can generalize across domains and in semi-supervised settings.
Reference graph
Works this paper leans on
-
[1]
Adelard. 2024. Claims-Arguments-Evidence (CAE). https://www.adelard.com/ asce/cae. Accessed July 2025
work page 2024
-
[2]
Chirag Agarwal, Owen Queen, Himabindu Lakkaraju, and Marinka Zitnik. 2023. Evaluating explainability for graph neural networks.Scientific Data10, 1 (2023), 144
work page 2023
-
[3]
Ankit Agrawal, Seyedehzahra Khoshmanesh, Michael Vierhauser, Mona Rahimi, Jane Cleland-Huang, and Robyn Lutz. 2019. Leveraging artifact trees to evolve and reuse safety cases. InProceedings of the 41st International Conference on Software Engineering(Montreal, Quebec, Canada)(ICSE ’19). IEEE Press, New York, NY, USA, 1222–1233. doi:10.1109/ICSE.2019.00124...
-
[4]
Alexander Ahlbrecht, Jasper Sprockhoff, and Umut Durak. 2024. A system- theoretic assurance framework for safety-driven systems engineering: A System- Theoretic Assurance Framework...Softw. Syst. Model.24, 1 (Sept. 2024), 253–270. doi:10.1007/s10270-024-01209-6
- [5]
-
[6]
Ewen Denney, Ganesh Pai, and Ibrahim Habli. 2015. Dynamic safety cases for through-life safety assurance. InProceedings of the 37th International Conference on Software Engineering - Volume 2(Florence, Italy)(ICSE ’15). IEEE Press, New York, NY, USA, 587–590
work page 2015
-
[7]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Jill Burstein, Christy...
work page 2019
- [8]
-
[9]
Matthias Fey, Jinu Sunil, Akihiro Nitta, Rishi Puri, Manan Shah, Blaž Stojanovič, Ramona Bendias, Alexandria Barghi, Vid Kocijan, Zecheng Zhang, et al . 2025. Pyg 2.0: Scalable learning on real world graphs, In Temporal Graph Learning Workshop @ KDD.arXiv e-prints, arXiv–2507
work page 2025
-
[10]
Justin Gilmer, Samuel S. Schoenholz, Patrick F. Riley, Oriol Vinyals, and George E. Dahl. 2017. Neural message passing for Quantum chemistry. InProceedings of the 34th International Conference on Machine Learning - Volume 70(Sydney, NSW, Australia)(ICML’17). JMLR.org, 1263–1272
work page 2017
-
[11]
Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. 2024. The llama 3 herd of models.arXiv preprint arXiv:2407.21783 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[12]
Provider of Services for Urban Air Mobility (PSU) Prototype Simulation (X5) Final Report,
Mallory S. Graydon and Sarah M. Lehman. 2025.Examining Proposed Uses of LLMs to Produce or Assess Assurance Arguments. Technical Memorandum (TM) 20250001849. Langley Research Center, NASA. https://ntrs.nasa.gov/api/ citations/20250001849/downloads/NASA-TM-20250001849.pdf Public release; work of the U.S. Government
-
[13]
Hamilton, Rex Ying, and Jure Leskovec
William L. Hamilton, Rex Ying, and Jure Leskovec. 2017. Inductive representation learning on large graphs. InProceedings of the 31st International Conference on Neural Information Processing Systems(Long Beach, California, USA)(NIPS’17). Curran Associates Inc., Red Hook, NY, USA, 1025–1035
work page 2017
-
[14]
Yufei He, Yuan Sui, Xiaoxin He, and Bryan Hooi. 2025. UniGraph: Learning a Uni- fied Cross-Domain Foundation Model for Text-Attributed Graphs. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V.1 (Toronto ON, Canada)(KDD ’25). Association for Computing Machinery, New York, NY, USA, 448–459. doi:10.1145/3690624.3709277
-
[15]
Yuntong Hu, Zhihan Lei, Zheng Zhang, Bo Pan, Chen Ling, and Liang Zhao. 2025. GRAG: Graph Retrieval-Augmented Generation. InFindings of the Association for Computational Linguistics: NAACL 2025, Luis Chiruzzo, Alan Ritter, and Lu Wang (Eds.). Association for Computational Linguistics, Albuquerque, New Mexico, 4145–4157. https://aclanthology.org/2025.findi...
work page 2025
-
[16]
Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. 2024. Gpt-4o system card.arXiv preprint arXiv:2410.21276(2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
- [17]
-
[18]
Bowen Jin, Gang Liu, Chi Han, Meng Jiang, Heng Ji, and Jiawei Han. 2024. Large Language Models on Graphs: A Comprehensive Survey.IEEE Transactions on Knowledge and Data Engineering36, 12 (2024), 8622–8642. doi:10.1109/TKDE. 2024.3469578
-
[19]
Tim Kelly and Rob Weaver. 2004. The goal structuring notation–a safety argument notation. InProceedings of the dependable systems and networks 2004 workshop on assurance cases, Vol. 6. Citeseer Princeton, NJ
work page 2004
-
[20]
1999.Arguing safety: a systematic approach to managing safety cases
Timothy Patrick Kelly et al . 1999.Arguing safety: a systematic approach to managing safety cases. Ph. D. Dissertation. University of York York, UK
work page 1999
-
[21]
Thomas N. Kipf and Max Welling. 2017. Semi-Supervised Classification with Graph Convolutional Networks. In5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net. https://openreview.net/forum?id=SJU4ayYgl
work page 2017
-
[22]
Zemin Liu, Xingtong Yu, Yuan Fang, and Xinming Zhang. 2023. GraphPrompt: Unifying Pre-Training and Downstream Tasks for Graph Neural Networks. In Proceedings of the ACM Web Conference 2023(Austin, TX, USA)(WWW ’23). Association for Computing Machinery, New York, NY, USA, 417–428. doi:10. 1145/3543507.3583386
-
[23]
Yihan Ma, Zhikun Zhang, Ning Yu, Xinlei He, Michael Backes, Yun Shen, and Yang Zhang. 2023. Generated graph detection. InInternational Conference on Machine Learning. PMLR, 23412–23428
work page 2023
-
[24]
Mazen Mohamad, Jan-Philipp Steghöfer, and Riccardo Scandariato. 2021. Security assurance cases—state of the art of an emerging approach.Empirical software engineering26, 4 (2021), 70
work page 2021
-
[25]
Anitha Murugesan, Isaac Wong, Joaquín Arias, Robert Stroud, Srivatsan Varadara- jan, Elmer Salazar, Gopal Gupta, Robin Bloomfield, and John Rushby. 2024. Au- tomating semantic analysis of system assurance cases using goal-directed ASP. Theory and Practice of Logic Programming24, 4 (2024), 805–824
work page 2024
-
[26]
Joakim Nivre. 2010. Dependency parsing.Language and Linguistics Compass4, 3 (2010), 138–152
work page 2010
-
[27]
Belle, Song Wang, Segla Kpodjedo, Timothy C
Oluwafemi Odu, Alvine B. Belle, Song Wang, Segla Kpodjedo, Timothy C. Leth- bridge, and Hadi Hemmati. 2025. Automatic instantiation of assurance cases from patterns using large language models.Journal of Systems and Software222 (2025), 112353. doi:10.1016/j.jss.2025.112353
-
[28]
Ross, Mark Winstead, and Michael McEvilley
Ronald S. Ross, Mark Winstead, and Michael McEvilley. 2022. Engineering Trustworthy Secure Systems. doi:10.6028/NIST.SP.800-160v1r1
-
[29]
2015.Un- derstanding and evaluating assurance cases
John Rushby, Xidong Xu, Murali Rangarajan, and Thomas L Weaver. 2015.Un- derstanding and evaluating assurance cases. Technical Report. Langley Research Center, National Aeronautics and Space Administration
work page 2015
- [30]
-
[31]
Mithila Sivakumar, Alvine B Belle, Jinjun Shan, and Kimya Khakzad Shahandashti
-
[32]
Prompting GPT–4 to support automatic safety case generation.Expert Systems with Applications255 (2024), 124653
work page 2024
-
[33]
Petar Velickovic, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua Bengio. 2018. Graph Attention Networks. In6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. OpenReview.net. https://openreview. net/forum?id=rJXMpikCZ
work page 2018
-
[34]
Xiang Wang, Xiangnan He, Yixin Cao, Meng Liu, and Tat-Seng Chua. 2019. KGAT: Knowledge Graph Attention Network for Recommendation. InProceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining(Anchorage, AK, USA)(KDD ’19). Association for Computing Machinery, New York, NY, USA, 950–958. doi:10.1145/3292500.3330989
-
[35]
Francis Rhys Ward and Ibrahim Habli. 2020. An assurance case pattern for the interpretability of machine learning in safety-critical systems. InInternational Conference on Computer Safety, Reliability, and Security. Springer, Springer, 395– 407
work page 2020
-
[36]
Charles B. Weinstock, Howard F. Lipson, and John B. Goodenough. 2007.Arguing Security: Creating Security Assurance Cases. Technical Report. Software Engi- neering Institute, Carnegie Mellon University. https://www.sei.cmu.edu/library/ arguing-security-creating-security-assurance-cases/ Accessed: 2025-10-09
work page 2007
-
[37]
Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, et al. 2020. Transformers: State-of-the-Art Natural Language Processing. InPro- ceedings of the 2020 Conference on Empirical Methods in Natural Language Process- ing: System Demonstrations, Qun Liu and David Schlangen (Eds.). Association for Computational Linguistics, Online, 38–...
-
[38]
Zonghan Wu, Shirui Pan, Fengwen Chen, Guodong Long, Chengqi Zhang, and Philip S Yu. 2020. A comprehensive survey on graph neural networks.IEEE transactions on neural networks and learning systems32, 1 (2020), 4–24
work page 2020
-
[39]
An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Cheng- peng Li, Chengyuan Li, Dayiheng Liu, et al. 2024. Qwen2 Technical Report.arXiv preprint arXiv:2407.10671(2024)
work page internal anchor Pith review arXiv 2024
-
[40]
Liang Yao, Chengsheng Mao, and Yuan Luo. 2019. Graph convolutional networks for text classification. InProceedings of the Thirty-Third AAAI Conference on Arti- ficial Intelligence and Thirty-First Innovative Applications of Artificial Intelligence Conference and Ninth AAAI Symposium on Educational Advances in Artificial Intelligence(Honolulu, Hawaii, USA)...
-
[41]
Rex Ying, Ruining He, Kaifeng Chen, Pong Eksombatchai, William L Hamilton, and Jure Leskovec. 2018. Graph convolutional neural networks for web-scale recommender systems. InProceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining. Association for Computing Machinery, New York, NY, USA, 974–983
work page 2018
-
[42]
Zhitao Ying, Dylan Bourgeois, Jiaxuan You, Marinka Zitnik, and Jure Leskovec
-
[43]
Gnnexplainer: Generating explanations for graph neural networks.Ad- vances in neural information processing systems32 (2019)
work page 2019
-
[44]
Marinka Zitnik, Monica Agrawal, and Jure Leskovec. 2018. Modeling polyphar- macy side effects with graph convolutional networks.Bioinformatics34, 13 (06 2018), i457–i466. doi:10.1093/bioinformatics/bty294
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.