GraphFlow: An Architecture for Formally Verifiable Visual Workflows Enabling Reliable Agentic AI Automation
Pith reviewed 2026-06-30 20:21 UTC · model grok-4.3
The pith
Workflow diagrams serve as executable specifications whose contracts can be formally proof-checked before runtime use in agentic AI systems.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
GraphFlow treats workflow diagrams as the executable specification, a single artifact defining data scope, execution semantics, and monitoring. At compile time, a restricted class of diagrams is specified to produce reusable automations whose contracts (preconditions, postconditions, and composition obligations) are intended to be proof-checked before admission to a shared library. At runtime, a durable engine records outcomes in an append-only event log and can enforce contracts at system boundaries, supporting replay, retries, and audit. Swimlanes make trust boundaries explicit, separating verified logic from external systems, human judgment, and AI decisions.
What carries the argument
The restricted class of diagrams compiled into reusable automations with proof-checkable contracts that encode preconditions, postconditions, and composition obligations.
If this is right
- Reusable automations with verified contracts can be admitted to a shared library for safe composition.
- Runtime contract enforcement at system boundaries limits error propagation in multi-step processes.
- Append-only event logs enable replay, retries, and full audits of execution history.
- Swimlanes separate verified logic from external integrations, human judgment, and AI decisions.
Where Pith is reading between the lines
- The separation of verified and unverified regions via swimlanes could support incremental rollout where only parts of a workflow are formally checked.
- Shifting correctness from inference-time planning to compile-time contract checking might reduce sensitivity to prompt changes in agentic components.
- The model of diagram-as-specification could be tested for transfer to non-clinical domains with similar needs for auditability.
Load-bearing premise
A restricted class of diagrams exists for which contracts can be formally proof-checked in practice.
What would settle it
Implementation of the formal semantics and proof checker on at least one non-trivial diagram from the restricted class, followed by an attempt to admit or reject it based on contract verification results.
Figures
read the original abstract
GraphFlow is a visual workflow system designed to improve the reliability of agentic AI automation in multi-step, mission-critical processes. In these workflows, small errors compound rapidly: under an idealized model of independent steps, a ten-step process with 90% per-step reliability completes successfully only 35% of the time. Existing workflow platforms provide durable execution and observability but offer few semantic correctness guarantees, while agentic systems plan at inference time, making behavior sensitive to prompt variation and difficult to audit. GraphFlow is designed to address this gap by treating workflow diagrams as the executable specification, a single artifact defining data scope, execution semantics, and monitoring. At compile time, a restricted class of diagrams is specified to produce reusable automations whose contracts (preconditions, postconditions, and composition obligations) are intended to be proof-checked before admission to a shared library. At runtime, a durable engine records outcomes in an append-only event log and can enforce contracts at system boundaries, supporting replay, retries, and audit. Swimlanes make trust boundaries explicit, separating verified logic from external systems, human judgment, and AI decisions. A year-long pilot across three clinical sites executed 8,728 cohort-enrolled workflow runs with a 97.08% completion rate under an early prototype without the verified-core subsystem; observed failures were localized primarily to external integrations. The formal semantics and proof-checked admission model described here are specified and under active development. Evaluation of the verified core is reserved for future work.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents GraphFlow, a visual workflow architecture for reliable agentic AI automation in which diagrams serve as the executable specification defining data scope, execution semantics, and monitoring. A restricted class of diagrams is intended to yield reusable automations whose contracts (preconditions, postconditions, composition obligations) are proof-checked at compile time before library admission; runtime enforcement occurs via an append-only event log. Swimlanes separate trust boundaries. A year-long pilot across three clinical sites executed 8,728 runs with 97.08% completion under an early prototype lacking the verified-core subsystem; the formal semantics and proof-checked admission model are specified but under active development, with evaluation of the verified core reserved for future work.
Significance. If the compile-time proof-checking model can be realized and evaluated, the architecture could provide stronger semantic correctness guarantees than existing durable-execution platforms or inference-time agentic planners, particularly for mission-critical multi-step processes where error compounding is severe. The pilot data on the non-verified prototype offers preliminary evidence of practical deployment in clinical settings, but the formal-verification contribution remains unevaluated.
major comments (1)
- [Abstract] Abstract: The central claim that contracts are 'intended to be proof-checked before admission to a shared library' and that the system enables 'formally verifiable' workflows is not supported by any evaluation. The reported pilot (97.08% completion on 8,728 runs) explicitly used an early prototype without the verified-core subsystem, and the abstract states that 'evaluation of the verified core is reserved for future work.' This renders the load-bearing formal-verification contribution unevaluated.
Simulated Author's Rebuttal
We thank the referee for the detailed reading and for highlighting the need to clarify the distinction between the proposed architecture and the status of its evaluated components. We address the single major comment below.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claim that contracts are 'intended to be proof-checked before admission to a shared library' and that the system enables 'formally verifiable' workflows is not supported by any evaluation. The reported pilot (97.08% completion on 8,728 runs) explicitly used an early prototype without the verified-core subsystem, and the abstract states that 'evaluation of the verified core is reserved for future work.' This renders the load-bearing formal-verification contribution unevaluated.
Authors: We agree that the abstract must avoid any implication that formal verification has been implemented or evaluated. The manuscript already states that the verified-core subsystem was absent from the pilot and that its evaluation is reserved for future work; the formal semantics are specified but remain under active development. To eliminate ambiguity, we will revise the abstract to (1) explicitly note that the reported pilot used the non-verified prototype, (2) frame the proof-checking model as a specified design goal rather than a completed feature, and (3) remove any phrasing that could be read as claiming current formal verifiability. These changes will be incorporated in the revised manuscript. revision: yes
Circularity Check
No significant circularity identified
full rationale
The paper presents an architecture for verifiable workflows but explicitly states that the formal semantics, proof-checked admission model, and verified core are under active development with evaluation reserved for future work. The reported 97.08% pilot completion rate is from an early prototype without the verified-core subsystem and is presented as an independent observation on external integrations. No equations, fitted parameters, self-citations, or derivation steps are described that reduce the central claims to inputs by construction; the work is self-contained as a design proposal with prospective formal components.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
M. D. Anderson Breaks With IBM Watson, Raising Questions About Artificial Intelligence in Oncology,
C. Schmidt, “M. D. Anderson Breaks With IBM Watson, Raising Questions About Artificial Intelligence in Oncology,” JNCI: Journal of the National Cancer Institute, vol. 109, no. 5, p. djx113, 2017, doi: 10.1093/jnci/djx113
-
[2]
How IBM Watson Overpromised and Underdelivered on AI Health Care
E. Strickland, “How IBM Watson Overpromised and Underdelivered on AI Health Care.” [On - line]. Available: https://spectrum.ieee.org/how-ibm-watson-overpromised-and-underdelivered-on- ai-health-care
-
[3]
The Fallacy of AI Functionality,
I. D. Raji, I. E. Kumar, A. Horowitz, and A. D. Selbst, “The Fallacy of AI Functionality,” arXiv preprint arXiv:2206.09511, 2022, [Online]. Available: https://arxiv.org/abs/2206.09511
-
[4]
Trust and Medical AI: The challenges we face and the expertise needed to overcome them,
T. P. Quinn, M. Senadeera, S. Jacobs, S. Coghlan, and V. Le, “Trust and Medical AI: The challenges we face and the expertise needed to overcome them,” arXiv preprint arXiv:2008.07734, 2020, [Online]. Available: https://arxiv.org/abs/2008.07734
-
[5]
Cancer-Myth: Evaluating Large Language Models on Patient Questions with False Presuppositions,
W. B. Zhu et al., “Cancer-Myth: Evaluating Large Language Models on Patient Questions with False Presuppositions,” arXiv preprint arXiv:2504.11373, 2025, [Online]. Available: https://arxiv. org/abs/2504.11373
-
[6]
Chatbots as conversational healthcare services,
M. Jovanović, M. Baez, and F. Casati, “Chatbots as conversational healthcare services,” arXiv preprint arXiv:2011.03969, 2020, [Online]. Available: https://arxiv.org/abs/2011.03969
-
[7]
Moffatt v. Air Canada
C. C. Rivers, “Moffatt v. Air Canada.” [Online]. Available: https://canlii.ca/t/k2spq
-
[8]
Mitchell, Artificial Intelligence: A Guide for Thinking Humans
M. Mitchell, Artificial Intelligence: A Guide for Thinking Humans. Farrar, Straus, Giroux, 2019
2019
-
[9]
On the Measure of Intelligence
F. Chollet, “On the Measure of Intelligence,” arXiv preprint arXiv:1911.01547 , 2019, [Online]. Available: https://arxiv.org/abs/1911.01547
work page internal anchor Pith review Pith/arXiv arXiv 1911
-
[10]
The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity
Apple Machine Learning Research, “The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity.” [Online]. Available: https:// machinelearning.apple.com/research/illusion-of-thinking
-
[11]
Humans and Automation: Use, Misuse, Disuse, Abuse,
R. Parasuraman and V. Riley, “Humans and Automation: Use, Misuse, Disuse, Abuse,” Human Factors, vol. 39, no. 2, pp. 230–253, 1997, doi: 10.1518/001872097778543886
-
[12]
Guidelines for Human-AI Interaction,
S. Amershi et al. , “Guidelines for Human-AI Interaction,” in Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems , ACM, 2019, pp. 1–13. doi: 10.1145/3290605.3300233
-
[13]
W. E. Deming, The New Economics for Industry, Government, Education. MIT Press, 1993
1993
-
[14]
Temporal: Durable Execution
“Temporal: Durable Execution.” [Online]. Available: https://docs.temporal.io/temporal
-
[15]
Workflow Management Coalition: The Workflow Reference Model,
D. Hollingsworth, “Workflow Management Coalition: The Workflow Reference Model,” technical report TC0–3, 1995. [Online]. Available: https://wfmc.org/wp-content/uploads/2022/09/tc003v 11.pdf
1995
-
[16]
Workflow Patterns,
W. M. P. van der Aalst, A. H. M. ter Hofstede, B. Kiepuszewski, and A. P. Barros, “Workflow Patterns,” Distributed and Parallel Databases , vol. 14, no. 1, pp. 5–51, 2003, doi: 10.1023/ A:1022883727209
2003
-
[17]
AWS Step Functions
“AWS Step Functions.” [Online]. Available: https://aws.amazon.com/step-functions/
-
[18]
Apache Airflow
“Apache Airflow.” [Online]. Available: https://airflow.apache.org/
-
[19]
Business Process Model and Notation (BPMN) Version 2.0
“Business Process Model and Notation (BPMN) Version 2.0.” [Online]. Available: https://www. omg.org/spec/BPMN/2.0/
-
[20]
Camunda Platform
“Camunda Platform.” [Online]. Available: https://camunda.com/
-
[21]
An Axiomatic Basis for Computer Programming,
C. A. R. Hoare, “An Axiomatic Basis for Computer Programming,” Communications of the ACM, vol. 12, no. 10, pp. 576–583, 1969, doi: 10.1145/363235.363259
-
[22]
G. C. Necula, “Proof-Carrying Code,” in Proceedings of the 24th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL) , ACM, 1997, pp. 106–119. doi: 10.1145/263699.263712
-
[23]
Prediction: AI will make formal verification go mainstream
M. Kleppmann, “Prediction: AI will make formal verification go mainstream.” [Online]. Available: https://martin.kleppmann.com/2025/12/08/ai-formal-verification.html
2025
-
[24]
QuickCheck: A Lightweight Tool for Random Testing of Haskell Programs,
K. Claessen and J. Hughes, “QuickCheck: A Lightweight Tool for Random Testing of Haskell Programs,” in Proceedings of the 2000 ACM SIGPLAN International Conference on Functional Programming (ICFP), ACM, 2000, pp. 268–279. doi: 10.1145/351240.351266
-
[25]
Security and Privacy Controls for Information Systems and Organizations,
National Institute of Standards and Technology, “Security and Privacy Controls for Information Systems and Organizations,” technical report NIST Special Publication 800-53 Revision 5, 2020. doi: 10.6028/NIST.SP.800-53r5. Appendix A: GraphFlow Language Example The following is a GFL example demonstrating formal verification with mathematical properties: # ...
-
[26]
Square a
[task] "Square a" @system --> 3: action: calls: (:multiply { .a: $.a .b: $.a }) assigns: $.aSquared ensures: (:gte $.aSquared 0)
-
[27]
Square b
[task] "Square b" @system --> 3: action: calls: (:multiply { .a: $.b .b: $.b }) assigns: $.bSquared ensures: (:gte $.bSquared 0)
-
[28]
Sum Squares
[milestone] "Sum Squares" @system --> 4: requires: (:and (:ne $.aSquared null) (:ne $.bSquared null)) action: calls: (:add { .a: $.aSquared .b: $.bSquared }) assigns: $.sum ensures: (:gte $.sum 0)
-
[29]
Is sum within bound?
[decision] "Is sum within bound?" @system :yes--> 5a :no--> 5b: action: calls: (:condition { .yes: (:lte $.sum 1000) }) 5a. [milestone] "Return Result" @system: action: calls: (:return { .sum: $.sum }) ensures: (:lte $.return.sum 1000) 5b. [milestone] "Reject Unbounded Sum" @system: action: calls: (:throw "Sum exceeds allowed bound") # Explicit assumption...
2025
-
[30]
Order Cognitive Screening
[task] "Order Cognitive Screening" @medflow --> 2: assigned: [:staff, :ehr] action: calls: (:create-lab-order { .orderType: "OTHER" .labOrder: "Cognitive Screening" .patientId: $.swimlanes.patient.contact.ext-id .orderingProviderId: $.swimlanes.provider.contact.ext-id }) assigns: $.order ensures: (:ne $.order null)
-
[31]
Complete Cognitive Screening
[task] "Complete Cognitive Screening" @patient --> 3: description: "Patient is able to complete this at home before their appointment" action: calls: (:send-text { .template: "cognitive-screening" .orderId: $.order.id .contact: $.swimlanes.patient.contact })
-
[32]
Cognitive Screening Result
[wait] "Cognitive Screening Result" @medflow --> 4: assigned: [:ehr] action: calls: (:await-with-tag { .resource: $.swimlanes.patient.contact .filters: - with: :cognitive-screening-completed })
-
[33]
Further testing recommended?
[decision] "Further testing recommended?" @provider :yes --> 5: action: calls: (:condition { .yes: (:with-tag $.swimlanes.patient.contact :cognitive-screening-positive) })
-
[34]
Order Cognitive Assessment
[task] "Order Cognitive Assessment" @medflow --> 6: assigned: [:staff, :ehr] action: calls: (:create-lab-order { .orderType: "OTHER" .labOrder: "Cognitive Assessment" .patientId: $.swimlanes.patient.contact.ext-id .orderingProviderId: $.swimlanes.provider.contact.ext-id })
-
[35]
Proctor Cognitive Assessment
[meeting] "Proctor Cognitive Assessment" @staff --> 7: assigned: [:patient] action: calls: :next
-
[36]
Cognitive Assessment Result
[wait] "Cognitive Assessment Result" @medflow --> 8: assigned: [:staff, :ehr] action: calls: (:await-with-tag { .resource: $.swimlanes.patient.contact .filters: - with: :cognitive-assessment-completed })
-
[37]
Care Plan recommended?
[decision] "Care Plan recommended?" @provider :yes --> 9: action: calls: (:condition { .yes: (:with-tag $.swimlanes.patient.contact :cognitive-assessment-positive) })
-
[38]
Order Cognitive Care Plan
[task] "Order Cognitive Care Plan" @medflow: action: calls: (:create-lab-order { .orderType: "OTHER" .labOrder: "Cognitive Care Plan" .patientId: $.swimlanes.patient.contact.ext-id .orderingProviderId: $.swimlanes.provider.contact.ext-id }) # Pillar III - Operational Dashboards query "Cognitive Screening Ordered": resource-type: :contact ext-type: "Patien...
2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.