pith. sign in

arxiv: 2605.14968 · v1 · pith:77G4TBMVnew · submitted 2026-05-14 · 💻 cs.AI

GraphFlow: An Architecture for Formally Verifiable Visual Workflows Enabling Reliable Agentic AI Automation

Pith reviewed 2026-06-30 20:21 UTC · model grok-4.3

classification 💻 cs.AI
keywords visual workflowsformal verificationagentic AIexecutable specificationscontract checkingworkflow automationtrust boundariesdurable execution
0
0 comments X

The pith

Workflow diagrams serve as executable specifications whose contracts can be formally proof-checked before runtime use in agentic AI systems.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents GraphFlow as a system that makes visual workflow diagrams the single artifact defining data scope, execution semantics, and monitoring for multi-step AI automations. By restricting the class of diagrams that can be compiled, the approach aims to produce reusable components whose preconditions, postconditions, and composition rules are verified before they enter a shared library. Runtime execution relies on a durable engine that logs events in append-only fashion and enforces contracts at boundaries while swimlanes mark explicit trust separations between verified logic, external systems, humans, and AI. A year-long clinical pilot achieved over 97 percent completion even without the verified core in place. If the verification layer works as specified, libraries of composable, auditable workflows become feasible without the compounding failure rates typical of unverified agentic processes.

Core claim

GraphFlow treats workflow diagrams as the executable specification, a single artifact defining data scope, execution semantics, and monitoring. At compile time, a restricted class of diagrams is specified to produce reusable automations whose contracts (preconditions, postconditions, and composition obligations) are intended to be proof-checked before admission to a shared library. At runtime, a durable engine records outcomes in an append-only event log and can enforce contracts at system boundaries, supporting replay, retries, and audit. Swimlanes make trust boundaries explicit, separating verified logic from external systems, human judgment, and AI decisions.

What carries the argument

The restricted class of diagrams compiled into reusable automations with proof-checkable contracts that encode preconditions, postconditions, and composition obligations.

If this is right

  • Reusable automations with verified contracts can be admitted to a shared library for safe composition.
  • Runtime contract enforcement at system boundaries limits error propagation in multi-step processes.
  • Append-only event logs enable replay, retries, and full audits of execution history.
  • Swimlanes separate verified logic from external integrations, human judgment, and AI decisions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The separation of verified and unverified regions via swimlanes could support incremental rollout where only parts of a workflow are formally checked.
  • Shifting correctness from inference-time planning to compile-time contract checking might reduce sensitivity to prompt changes in agentic components.
  • The model of diagram-as-specification could be tested for transfer to non-clinical domains with similar needs for auditability.

Load-bearing premise

A restricted class of diagrams exists for which contracts can be formally proof-checked in practice.

What would settle it

Implementation of the formal semantics and proof checker on at least one non-trivial diagram from the restricted class, followed by an attempt to admit or reject it based on contract verification results.

Figures

Figures reproduced from arXiv: 2605.14968 by Drewry H. Morris V, Inc.), Luis Valles, Reza Hosseini Ghomi (MedFlow.

Figure 1
Figure 1. Figure 1: Cognitive Testing Care Pathway with swimlanes for Patient, Staff, Provider, MedFlow, and EHR. [PITH_FULL_IMAGE:figures/full_fig_p023_1.png] view at source ↗
read the original abstract

GraphFlow is a visual workflow system designed to improve the reliability of agentic AI automation in multi-step, mission-critical processes. In these workflows, small errors compound rapidly: under an idealized model of independent steps, a ten-step process with 90% per-step reliability completes successfully only 35% of the time. Existing workflow platforms provide durable execution and observability but offer few semantic correctness guarantees, while agentic systems plan at inference time, making behavior sensitive to prompt variation and difficult to audit. GraphFlow is designed to address this gap by treating workflow diagrams as the executable specification, a single artifact defining data scope, execution semantics, and monitoring. At compile time, a restricted class of diagrams is specified to produce reusable automations whose contracts (preconditions, postconditions, and composition obligations) are intended to be proof-checked before admission to a shared library. At runtime, a durable engine records outcomes in an append-only event log and can enforce contracts at system boundaries, supporting replay, retries, and audit. Swimlanes make trust boundaries explicit, separating verified logic from external systems, human judgment, and AI decisions. A year-long pilot across three clinical sites executed 8,728 cohort-enrolled workflow runs with a 97.08% completion rate under an early prototype without the verified-core subsystem; observed failures were localized primarily to external integrations. The formal semantics and proof-checked admission model described here are specified and under active development. Evaluation of the verified core is reserved for future work.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper presents GraphFlow, a visual workflow architecture for reliable agentic AI automation in which diagrams serve as the executable specification defining data scope, execution semantics, and monitoring. A restricted class of diagrams is intended to yield reusable automations whose contracts (preconditions, postconditions, composition obligations) are proof-checked at compile time before library admission; runtime enforcement occurs via an append-only event log. Swimlanes separate trust boundaries. A year-long pilot across three clinical sites executed 8,728 runs with 97.08% completion under an early prototype lacking the verified-core subsystem; the formal semantics and proof-checked admission model are specified but under active development, with evaluation of the verified core reserved for future work.

Significance. If the compile-time proof-checking model can be realized and evaluated, the architecture could provide stronger semantic correctness guarantees than existing durable-execution platforms or inference-time agentic planners, particularly for mission-critical multi-step processes where error compounding is severe. The pilot data on the non-verified prototype offers preliminary evidence of practical deployment in clinical settings, but the formal-verification contribution remains unevaluated.

major comments (1)
  1. [Abstract] Abstract: The central claim that contracts are 'intended to be proof-checked before admission to a shared library' and that the system enables 'formally verifiable' workflows is not supported by any evaluation. The reported pilot (97.08% completion on 8,728 runs) explicitly used an early prototype without the verified-core subsystem, and the abstract states that 'evaluation of the verified core is reserved for future work.' This renders the load-bearing formal-verification contribution unevaluated.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed reading and for highlighting the need to clarify the distinction between the proposed architecture and the status of its evaluated components. We address the single major comment below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim that contracts are 'intended to be proof-checked before admission to a shared library' and that the system enables 'formally verifiable' workflows is not supported by any evaluation. The reported pilot (97.08% completion on 8,728 runs) explicitly used an early prototype without the verified-core subsystem, and the abstract states that 'evaluation of the verified core is reserved for future work.' This renders the load-bearing formal-verification contribution unevaluated.

    Authors: We agree that the abstract must avoid any implication that formal verification has been implemented or evaluated. The manuscript already states that the verified-core subsystem was absent from the pilot and that its evaluation is reserved for future work; the formal semantics are specified but remain under active development. To eliminate ambiguity, we will revise the abstract to (1) explicitly note that the reported pilot used the non-verified prototype, (2) frame the proof-checking model as a specified design goal rather than a completed feature, and (3) remove any phrasing that could be read as claiming current formal verifiability. These changes will be incorporated in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper presents an architecture for verifiable workflows but explicitly states that the formal semantics, proof-checked admission model, and verified core are under active development with evaluation reserved for future work. The reported 97.08% pilot completion rate is from an early prototype without the verified-core subsystem and is presented as an independent observation on external integrations. No equations, fitted parameters, self-citations, or derivation steps are described that reduce the central claims to inputs by construction; the work is self-contained as a design proposal with prospective formal components.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper is a system-architecture description with no mathematical derivations, fitted parameters, or new physical entities. The formal semantics are referenced but not supplied in the abstract.

pith-pipeline@v0.9.1-grok · 5817 in / 1322 out tokens · 27745 ms · 2026-06-30T20:21:20.765115+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

38 extracted references · 12 canonical work pages · 1 internal anchor

  1. [1]

    M. D. Anderson Breaks With IBM Watson, Raising Questions About Artificial Intelligence in Oncology,

    C. Schmidt, “M. D. Anderson Breaks With IBM Watson, Raising Questions About Artificial Intelligence in Oncology,” JNCI: Journal of the National Cancer Institute, vol. 109, no. 5, p. djx113, 2017, doi: 10.1093/jnci/djx113

  2. [2]

    How IBM Watson Overpromised and Underdelivered on AI Health Care

    E. Strickland, “How IBM Watson Overpromised and Underdelivered on AI Health Care.” [On - line]. Available: https://spectrum.ieee.org/how-ibm-watson-overpromised-and-underdelivered-on- ai-health-care

  3. [3]

    The Fallacy of AI Functionality,

    I. D. Raji, I. E. Kumar, A. Horowitz, and A. D. Selbst, “The Fallacy of AI Functionality,” arXiv preprint arXiv:2206.09511, 2022, [Online]. Available: https://arxiv.org/abs/2206.09511

  4. [4]

    Trust and Medical AI: The challenges we face and the expertise needed to overcome them,

    T. P. Quinn, M. Senadeera, S. Jacobs, S. Coghlan, and V. Le, “Trust and Medical AI: The challenges we face and the expertise needed to overcome them,” arXiv preprint arXiv:2008.07734, 2020, [Online]. Available: https://arxiv.org/abs/2008.07734

  5. [5]

    Cancer-Myth: Evaluating Large Language Models on Patient Questions with False Presuppositions,

    W. B. Zhu et al., “Cancer-Myth: Evaluating Large Language Models on Patient Questions with False Presuppositions,” arXiv preprint arXiv:2504.11373, 2025, [Online]. Available: https://arxiv. org/abs/2504.11373

  6. [6]

    Chatbots as conversational healthcare services,

    M. Jovanović, M. Baez, and F. Casati, “Chatbots as conversational healthcare services,” arXiv preprint arXiv:2011.03969, 2020, [Online]. Available: https://arxiv.org/abs/2011.03969

  7. [7]

    Moffatt v. Air Canada

    C. C. Rivers, “Moffatt v. Air Canada.” [Online]. Available: https://canlii.ca/t/k2spq

  8. [8]

    Mitchell, Artificial Intelligence: A Guide for Thinking Humans

    M. Mitchell, Artificial Intelligence: A Guide for Thinking Humans. Farrar, Straus, Giroux, 2019

  9. [9]

    On the Measure of Intelligence

    F. Chollet, “On the Measure of Intelligence,” arXiv preprint arXiv:1911.01547 , 2019, [Online]. Available: https://arxiv.org/abs/1911.01547

  10. [10]

    The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity

    Apple Machine Learning Research, “The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity.” [Online]. Available: https:// machinelearning.apple.com/research/illusion-of-thinking

  11. [11]

    Humans and Automation: Use, Misuse, Disuse, Abuse,

    R. Parasuraman and V. Riley, “Humans and Automation: Use, Misuse, Disuse, Abuse,” Human Factors, vol. 39, no. 2, pp. 230–253, 1997, doi: 10.1518/001872097778543886

  12. [12]

    Guidelines for Human-AI Interaction,

    S. Amershi et al. , “Guidelines for Human-AI Interaction,” in Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems , ACM, 2019, pp. 1–13. doi: 10.1145/3290605.3300233

  13. [13]

    W. E. Deming, The New Economics for Industry, Government, Education. MIT Press, 1993

  14. [14]

    Temporal: Durable Execution

    “Temporal: Durable Execution.” [Online]. Available: https://docs.temporal.io/temporal

  15. [15]

    Workflow Management Coalition: The Workflow Reference Model,

    D. Hollingsworth, “Workflow Management Coalition: The Workflow Reference Model,” technical report TC0–3, 1995. [Online]. Available: https://wfmc.org/wp-content/uploads/2022/09/tc003v 11.pdf

  16. [16]

    Workflow Patterns,

    W. M. P. van der Aalst, A. H. M. ter Hofstede, B. Kiepuszewski, and A. P. Barros, “Workflow Patterns,” Distributed and Parallel Databases , vol. 14, no. 1, pp. 5–51, 2003, doi: 10.1023/ A:1022883727209

  17. [17]

    AWS Step Functions

    “AWS Step Functions.” [Online]. Available: https://aws.amazon.com/step-functions/

  18. [18]

    Apache Airflow

    “Apache Airflow.” [Online]. Available: https://airflow.apache.org/

  19. [19]

    Business Process Model and Notation (BPMN) Version 2.0

    “Business Process Model and Notation (BPMN) Version 2.0.” [Online]. Available: https://www. omg.org/spec/BPMN/2.0/

  20. [20]

    Camunda Platform

    “Camunda Platform.” [Online]. Available: https://camunda.com/

  21. [21]

    An Axiomatic Basis for Computer Programming,

    C. A. R. Hoare, “An Axiomatic Basis for Computer Programming,” Communications of the ACM, vol. 12, no. 10, pp. 576–583, 1969, doi: 10.1145/363235.363259

  22. [22]

    Proof-Carrying Code,

    G. C. Necula, “Proof-Carrying Code,” in Proceedings of the 24th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL) , ACM, 1997, pp. 106–119. doi: 10.1145/263699.263712

  23. [23]

    Prediction: AI will make formal verification go mainstream

    M. Kleppmann, “Prediction: AI will make formal verification go mainstream.” [Online]. Available: https://martin.kleppmann.com/2025/12/08/ai-formal-verification.html

  24. [24]

    QuickCheck: A Lightweight Tool for Random Testing of Haskell Programs,

    K. Claessen and J. Hughes, “QuickCheck: A Lightweight Tool for Random Testing of Haskell Programs,” in Proceedings of the 2000 ACM SIGPLAN International Conference on Functional Programming (ICFP), ACM, 2000, pp. 268–279. doi: 10.1145/351240.351266

  25. [25]

    Security and Privacy Controls for Information Systems and Organizations,

    National Institute of Standards and Technology, “Security and Privacy Controls for Information Systems and Organizations,” technical report NIST Special Publication 800-53 Revision 5, 2020. doi: 10.6028/NIST.SP.800-53r5. Appendix A: GraphFlow Language Example The following is a GFL example demonstrating formal verification with mathematical properties: # ...

  26. [26]

    Square a

    [task] "Square a" @system --> 3: action: calls: (:multiply { .a: $.a .b: $.a }) assigns: $.aSquared ensures: (:gte $.aSquared 0)

  27. [27]

    Square b

    [task] "Square b" @system --> 3: action: calls: (:multiply { .a: $.b .b: $.b }) assigns: $.bSquared ensures: (:gte $.bSquared 0)

  28. [28]

    Sum Squares

    [milestone] "Sum Squares" @system --> 4: requires: (:and (:ne $.aSquared null) (:ne $.bSquared null)) action: calls: (:add { .a: $.aSquared .b: $.bSquared }) assigns: $.sum ensures: (:gte $.sum 0)

  29. [29]

    Is sum within bound?

    [decision] "Is sum within bound?" @system :yes--> 5a :no--> 5b: action: calls: (:condition { .yes: (:lte $.sum 1000) }) 5a. [milestone] "Return Result" @system: action: calls: (:return { .sum: $.sum }) ensures: (:lte $.return.sum 1000) 5b. [milestone] "Reject Unbounded Sum" @system: action: calls: (:throw "Sum exceeds allowed bound") # Explicit assumption...

  30. [30]

    Order Cognitive Screening

    [task] "Order Cognitive Screening" @medflow --> 2: assigned: [:staff, :ehr] action: calls: (:create-lab-order { .orderType: "OTHER" .labOrder: "Cognitive Screening" .patientId: $.swimlanes.patient.contact.ext-id .orderingProviderId: $.swimlanes.provider.contact.ext-id }) assigns: $.order ensures: (:ne $.order null)

  31. [31]

    Complete Cognitive Screening

    [task] "Complete Cognitive Screening" @patient --> 3: description: "Patient is able to complete this at home before their appointment" action: calls: (:send-text { .template: "cognitive-screening" .orderId: $.order.id .contact: $.swimlanes.patient.contact })

  32. [32]

    Cognitive Screening Result

    [wait] "Cognitive Screening Result" @medflow --> 4: assigned: [:ehr] action: calls: (:await-with-tag { .resource: $.swimlanes.patient.contact .filters: - with: :cognitive-screening-completed })

  33. [33]

    Further testing recommended?

    [decision] "Further testing recommended?" @provider :yes --> 5: action: calls: (:condition { .yes: (:with-tag $.swimlanes.patient.contact :cognitive-screening-positive) })

  34. [34]

    Order Cognitive Assessment

    [task] "Order Cognitive Assessment" @medflow --> 6: assigned: [:staff, :ehr] action: calls: (:create-lab-order { .orderType: "OTHER" .labOrder: "Cognitive Assessment" .patientId: $.swimlanes.patient.contact.ext-id .orderingProviderId: $.swimlanes.provider.contact.ext-id })

  35. [35]

    Proctor Cognitive Assessment

    [meeting] "Proctor Cognitive Assessment" @staff --> 7: assigned: [:patient] action: calls: :next

  36. [36]

    Cognitive Assessment Result

    [wait] "Cognitive Assessment Result" @medflow --> 8: assigned: [:staff, :ehr] action: calls: (:await-with-tag { .resource: $.swimlanes.patient.contact .filters: - with: :cognitive-assessment-completed })

  37. [37]

    Care Plan recommended?

    [decision] "Care Plan recommended?" @provider :yes --> 9: action: calls: (:condition { .yes: (:with-tag $.swimlanes.patient.contact :cognitive-assessment-positive) })

  38. [38]

    Order Cognitive Care Plan

    [task] "Order Cognitive Care Plan" @medflow: action: calls: (:create-lab-order { .orderType: "OTHER" .labOrder: "Cognitive Care Plan" .patientId: $.swimlanes.patient.contact.ext-id .orderingProviderId: $.swimlanes.provider.contact.ext-id }) # Pillar III - Operational Dashboards query "Cognitive Screening Ordered": resource-type: :contact ext-type: "Patien...