pith. sign in

arxiv: 2606.09489 · v1 · pith:YIPRKAR6new · submitted 2026-06-08 · 💻 cs.AI

LLM-Orchestrated Conformance Checking in Stroke Care Without Computer-Interpretable Guidelines

Pith reviewed 2026-06-27 16:12 UTC · model grok-4.3

classification 💻 cs.AI
keywords conformance checkinglarge language modelsstroke careclinical guidelinespatient traceshealthcare process analysistrace conformance indicator
0
0 comments X

The pith

Orchestrated LLMs extract patient traces and rules from raw texts to measure stroke care conformance without formal guidelines.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a modular framework that uses multiple large language models to pull patient event sequences from clinical discharge letters, pull normative rules from plain-language guidelines, turn those rules into runnable code, and score overall compliance with a single indicator. This matters because most real hospitals lack the computer-interpretable guideline models that traditional conformance tools require, so the method removes a major practical barrier. The authors ran the system on hundreds of real stroke cases from one Italian neurological ward against fifty rules and found more than 86 percent of traces met the guideline. The work therefore claims both technical feasibility and evidence of strong local adherence to stroke protocols.

Core claim

A modular architecture coordinates several LLMs and helper components to extract patient traces directly from unstructured discharge letters, derive normative rules from textual clinical guidelines, translate the rules into executable scripts, and compute a Trace Conformance Indicator that quantifies how many traces satisfy the rules; when applied to stroke care data from Alessandria Hospital the system processed hundreds of traces against fifty derived rules and reported more than 86 percent conformance.

What carries the argument

The modular LLM-orchestration pipeline that sequentially extracts traces, identifies rules, generates executable scripts, and produces a Trace Conformance Indicator.

If this is right

  • Conformance checking becomes possible in hospitals that have only ordinary text guidelines.
  • Hundreds of patient records can be assessed automatically against dozens of rules derived from a single guideline document.
  • A single numeric Trace Conformance Indicator summarizes overall guideline adherence for an entire event log.
  • The same pipeline can be reused on new domains once suitable text sources are supplied.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Hospitals could run periodic automated audits without first investing in formal guideline encoding.
  • Non-conformant traces flagged by the indicator could be routed to clinicians for targeted review.
  • The approach might lower the cost of maintaining compliance monitoring across multiple clinical pathways.

Load-bearing premise

Large language models can reliably extract accurate patient traces and guideline rules from unstructured clinical text without introducing significant errors or hallucinations.

What would settle it

Side-by-side comparison of LLM outputs against independent expert manual annotation on the same set of discharge letters and guideline text, showing extraction accuracy below 80 percent or rule sets that differ on more than 10 percent of conditions.

Figures

Figures reproduced from arXiv: 2606.09489 by Alessandro Canessa, Delfina Ferrandi, Giorgio Leonardi, Manuel Striani, Stefania Montani.

Figure 1
Figure 1. Figure 1: Orchestrated LLM pipeline for guideline-based conformance checking but in a non-correct way with respect to the guideline directives (e.g., too late); 5. Rule Refinement: by means of a fifth LLM (currently, Gemini 3 Pro-Preview (Google DeepMind, 2025b)) we improve the Python rules, by fixing possible cod￾ing bugs, and eliminating redundancy, noise and ar￾tifacts; indeed, Gemini 3 Pro-Preview exceeds the ca… view at source ↗
Figure 2
Figure 2. Figure 2: Prompt template with color-coded prompting strategies. The original prompt (in Italian) has been translated into English for the sake of paper clarity. First Author et al.: Preprint submitted to Elsevier Page 5 of 10 [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Part of the Python code generated by Gemini for an example rule. In the figure, only some sections of the code are shown, for the sake of simplicity, and terms in the activity synonym reference list have been translated into English On the 50 rules, more than 86% were conformant to the clinical guideline, testifying a robust quality assessment performance for Alessandria. Nevertheless, the (few) non confor… view at source ↗
Figure 4
Figure 4. Figure 4: Conformance analysis results. Blog, B.A.R.B., 2024. The shift from models to compound ai systems. Berkeley AI Research Blog URL: https://bair.berkeley.edu/blog/202 4/02/18/compound-ai-systems/. Borrego, D., Barba, I., 2014. Conformance checking and diagnosis for declarative business process models in data-aware scenarios. Expert Syst. Appl. 41, 5340–5352. doi:10.1016/J.ESWA.2014.03.010. Bottrighi, A., Cane… view at source ↗
read the original abstract

Objective: Conformance checking in healthcare seeks to assess whether patient care pathways adhere to clinical guidelines. However, its practical application often depends on the availability of formal, machine-interpretable representations of guidelines, such as Computer-Interpretable Guidelines (CIGs), which are seldom available in real-world clinical settings. Methods: This work introduces a modular framework based on the orchestration of Large Language Models (LLMs) to support medical conformance checking directly from unstructured clinical and guideline texts, without requiring predefined CIGs. The proposed architecture integrates multiple LLMs and supporting components to extract patient traces from clinical discharge letters, identify normative rules from textual clinical guidelines, translate these rules into executable scripts, and compute a Trace Conformance Indicator to quantify compliance within the event log. Results: The framework was implemented and evaluated in the stroke care domain at the neurological ward of Alessandria Hospital. Hundreds of patient traces were automatically extracted from hospital data and assessed against 50 rules derived from the reference guideline. The analysis showed that more than 86\% of the available traces were conformant. Conclusion: The results demonstrate the feasibility of using orchestrated LLMs for practical healthcare conformance analysis. At the same time, the study provides evidence of a high level of adherence to stroke care guidelines at Alessandria Hospital.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces a modular framework that orchestrates multiple LLMs to perform conformance checking in healthcare directly from unstructured texts, without Computer-Interpretable Guidelines. It extracts patient traces from clinical discharge letters, derives normative rules from guideline texts, translates rules into executable scripts, and computes a Trace Conformance Indicator. The framework is implemented and evaluated in the stroke care domain using data from Alessandria Hospital, where hundreds of traces were assessed against 50 LLM-derived rules, yielding a reported conformance rate exceeding 86%.

Significance. If the LLM-based extractions prove reliable, the approach would enable conformance analysis in real-world settings lacking formal CIGs and provide evidence of high guideline adherence at the studied hospital. The real-hospital deployment and use of actual patient data constitute a practical strength that could support broader adoption of LLM-orchestrated process mining in clinical informatics.

major comments (1)
  1. [Results] Results: The headline finding that >86% of hundreds of traces are conformant depends entirely on the accuracy of the LLM extraction stages for patient traces (from discharge letters) and normative rules (from the guideline). No ground-truth validation, error rates, precision/recall metrics, or human review of the extracted traces and rules is reported, rendering the conformance percentage uninterpretable as evidence of actual guideline adherence rather than possible LLM artifacts.
minor comments (2)
  1. [Methods] The description of the orchestration architecture would benefit from explicit pseudocode or a diagram showing the sequence of LLM calls and data flows between components.
  2. [Abstract] The abstract and results paragraph should state the exact number of traces analyzed rather than 'hundreds' to allow readers to assess statistical power.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The major comment highlights an important limitation in the current presentation of results, which we address below with a commitment to revision.

read point-by-point responses
  1. Referee: [Results] Results: The headline finding that >86% of hundreds of traces are conformant depends entirely on the accuracy of the LLM extraction stages for patient traces (from discharge letters) and normative rules (from the guideline). No ground-truth validation, error rates, precision/recall metrics, or human review of the extracted traces and rules is reported, rendering the conformance percentage uninterpretable as evidence of actual guideline adherence rather than possible LLM artifacts.

    Authors: We agree that the lack of reported validation for the LLM extraction stages is a substantive limitation that affects the strength of the conformance claims. The manuscript presents the framework as a feasibility demonstration in a real clinical setting and reports the observed rate from the hospital data, but does not include ground-truth checks or quantitative error metrics on the trace and rule extractions. In the revised manuscript we will add a dedicated validation subsection that includes: (i) human review of a random sample of extracted patient traces against the original discharge letters, (ii) human review of the 50 derived rules against the source guideline text, and (iii) reported agreement rates together with any observed error categories. This addition will allow readers to assess the reliability of the extraction pipeline and thereby interpret the >86% conformance figure more confidently. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical result derived from external hospital traces and guideline text

full rationale

The paper presents an LLM-orchestrated pipeline for extracting traces and rules from real hospital discharge letters and guideline documents, then computes conformance on those extracted artifacts. No equations, fitted parameters, or self-citations are used to derive the 86% figure; it is reported as a direct count from the Alessandria Hospital data set. The central claim therefore rests on the accuracy of the LLM extractions rather than on any definitional or self-referential reduction. Absence of ground-truth validation for the extractions is a correctness concern, not a circularity issue.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper is an empirical application of LLMs; no free parameters, axioms, or invented entities are mentioned in the abstract.

pith-pipeline@v0.9.1-grok · 5774 in / 1043 out tokens · 21527 ms · 2026-06-27T16:12:38.293826+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

24 extracted references · 13 canonical work pages

  1. [1]

    Springer

    Process Mining - Data Science in Action, Second Edition. Springer. URL:https://doi.org/10.1007/978-3-662-4 9851-4, doi:10.1007/978-3-662-49851-4. Adriansyah, A., Munoz-Gama, J., Carmona, J., van Dongen, B.F., van der Aalst, W.M.P.,

  2. [2]

    (Eds.), Business Process Management Workshops - BPM 2012 International Workshops, Tallinn, Estonia, September 3,

    Alignment based precision checking, in: Rosa, M.L., Soffer, P. (Eds.), Business Process Management Workshops - BPM 2012 International Workshops, Tallinn, Estonia, September 3,

  3. [3]

    Revised Papers, Springer. pp. 137–149. doi:10.1007/978-3-642 -36285-9\_15. Berti, A., Kourani, H., Hafke, H., Li, C.Y., Schuster, D.,

  4. [4]

    Berti, A., Schuster, D., van der Aalst, W.M.P.,

    Evaluating largelanguagemodelsinprocessmining:Capabilities,benchmarks,and evaluation strategies.arXiv:2403.06749. Berti, A., Schuster, D., van der Aalst, W.M.P.,

  5. [5]

    Abstractions, sce- narios, and prompt definitions for process mining with llms: A case study,in:Weerdt,J.D.,Pufahl,L.(Eds.),BusinessProcessManagement Workshops - BPM 2023 International Workshops, Utrecht, The Nether- lands, September 11-15, 2023, Revised Selected Papers, Springer. pp. 427–439. URL:https://doi.org/10.1007/978-3-031-50974-2_32, doi:10.1007/...

  6. [6]

    Expert Syst

    Conformance checking and diagnosis for declarative business process models in data-aware scenarios. Expert Syst. Appl. 41, 5340–5352. doi:10.1016/J.ESWA.2014.03.010. Bottrighi, A., Canessa, A., Ferrandi, D., Leonardi, G., Maconi, A., Mas- sarino, C., Montani, S., Roveta, A., Striani, M.,

  7. [7]

    (Eds.), Computer-based Medical Guidelines and Protocols: A Primer and Cur- rentTrends.IOSPress.volume139ofStudiesinHealthTechnologyand Informatics, pp

    Computer-interpretable guideline formalisms, in: ten Teije, A., Miksch, S., Lucas, P.J.F. (Eds.), Computer-based Medical Guidelines and Protocols: A Primer and Cur- rentTrends.IOSPress.volume139ofStudiesinHealthTechnologyand Informatics, pp. 22–43. doi:10.3233/978-1-58603-873-1-22. Cosentino, C., Defilippo, A., Dossena, M., Irwin, C., Joubbi, S., Liò, P.,

  8. [8]

    URL:https://arxiv.org/ abs/2508.07308,arXiv:2508.07308

    Healthbranches: Synthesizing clinically-grounded question answering datasets via decision pathways. URL:https://arxiv.org/ abs/2508.07308,arXiv:2508.07308. Desel,J.,Reisig,W.,Rozenberg,G.(Eds.),2004. LecturesonConcurrency and Petri Nets, Advances in Petri Nets [This tutorial volume originates from the 4th Advanced Course on Petri Nets, ACPN 2003, held in ...

  9. [9]

    volume 3098 ofLecture Notes in Computer Science, Springer

    In addition to lectures given at ACPN 2003, additional chapters have been commissioned]. volume 3098 ofLecture Notes in Computer Science, Springer. URL:https: //doi.org/10.1007/b98282, doi:10.1007/B98282. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.,

  10. [10]

    BERT: Pre- training of deep bidirectional transformers for language understanding, in: Burstein, J., Doran, C., Solorio, T. (Eds.), Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Association for Computational Linguistics, Minn...

  11. [11]

    Conformance check- ing:astate-of-the-artliteraturereview,in:Betz,S.(Ed.),Proceedingsof the11thInternationalConferenceonSubject-OrientedBusinessProcess Management, S-BPM ONE 2019, Seville, Spain, June 26-28, 2019, ACM. pp. 4:1–4:10. doi:10.1145/3329007.3329014. Google DeepMind, 2025a. Gemini 2.5 flash. Large language model de- veloped by Google, optimized ...

  12. [12]

    Grohs,M.,Abb,L.,Elsayed,N.,Rehse,J.R.,2023

    for document understanding and synthesis. Grohs,M.,Abb,L.,Elsayed,N.,Rehse,J.R.,2023. Largelanguagemodels can accomplish business process management tasks.arXiv:2307.09923. Gu,J.,Jiang,X.,Shi,Z.,Tan,H.,Zhai,X.,Xu,C.,Li,W.,Shen,Y.,Ma,S., Liu, H., Wang, S., Zhang, K., Wang, Y., Gao, W., Ni, L., Guo, J.,

  13. [13]

    URL:https://arxiv.org/abs/2411.15594, arXiv:2411.15594

    A survey on llm-as-a-judge. URL:https://arxiv.org/abs/2411.15594, arXiv:2411.15594. ItalianStroke Association,. URL:https://isa-aii.com/linee-guida/linee -guida-attuali/. Jessen, U., Sroka, M., Fahland, D.,

  14. [14]

    Khurana, D., Koli, A., Khatter, K., Singh, S.,

    Chit-chat or deep talk: Prompt engineering for process mining.arXiv:2307.09909. Khurana, D., Koli, A., Khatter, K., Singh, S.,

  15. [15]

    Multim.Tools Appl

    Natural language processing:stateoftheart,currenttrendsandchallenges. Multim.Tools Appl. 82, 3713–3744. URL:https://doi.org/10.1007/s11042-022-134 28-4, doi:10.1007/S11042-022-13428-4. First Author et al.:Preprint submitted to ElsevierPage 9 of 10 Short Title of the Article Korotich, A.,

  16. [16]

    Industry analysis emphasizing cohesive AI ecosystems over standalone tools

    Cpo predictions: The year ai finally learns to speak workflow.https://www.wrike.com/blog/ai-workflow-2026/. Industry analysis emphasizing cohesive AI ecosystems over standalone tools. Kourani, H., Berti, A., Hennrich, J., Kratsch, W., Weidlich, R., Li, C.Y., Arslan,A.,Schuster,D.,vanderAalst,W.M.P.,2024a. Leveraginglarge language models for enhanced proce...

  17. [17]

    arXiv preprint arXiv:2508.19517 URL: https://arxiv.org/abs/2508.19517

    Orchid: Orchestrating context across creative workflows with generative ai. arXiv preprint arXiv:2508.19517 URL: https://arxiv.org/abs/2508.19517. Qafari, M.S., van der Aalst, W.,

  18. [18]

    (Eds.), On the Move to Meaningful Internet Systems: OTM 2019 Conferences, Springer International Publishing, Cham

    Fairness-aware process mining, in: Panetto, H., Debruyne, C., Hepp, M., Lewis, D., Ardagna, C.A., Meersman, R. (Eds.), On the Move to Meaningful Internet Systems: OTM 2019 Conferences, Springer International Publishing, Cham. pp. 182–192. Reichert, M., Weber, B.,

  19. [19]

    Springer

    Enabling Flexibility in Process-Aware Information Systems - Challenges, Methods, Technologies. Springer. URL:https://doi.org/10.1007/978-3-642-30409-5, doi:10.1007/978-3 -642-30409-5. Rodella,G.,Scalogna,A.,Carenzo,L.,DellaCorte,F.,2025. Fromprompt to platform: an agentic ai workflow for healthcare simulation scenario design. Advances in Simulation

  20. [20]

    Rozinat, A., van der Aalst, W.M.P.,

    URL:https://advancesinsimu lation.biomedcentral.com/articles/10.1186/s41077-025-00357-z, doi:10.1186/s41077-025-00357-z. Rozinat, A., van der Aalst, W.M.P.,

  21. [21]

    Inf.Syst.33,64–95

    Conformance checking of processesbasedonmonitoringrealbehavior. Inf.Syst.33,64–95. URL: https://doi.org/10.1016/j.is.2007.07.001, doi:10.1016/J.IS.2007.07.0

  22. [22]

    URL:https://arxiv.org/abs/2402.07927, arXiv:2402.07927

    A systematic survey of prompt engineering in large language models: Techniques and applications. URL:https://arxiv.org/abs/2402.07927, arXiv:2402.07927. Susaiyah, A., Sidorova, N.,

  23. [23]

    JournalofMedicalInternetResearchURL:https: //pubmed.ncbi.nlm.nih.gov/40658884/

    Large language model synergy for ensemble learning in medical question answering: Design andevaluationstudy. JournalofMedicalInternetResearchURL:https: //pubmed.ncbi.nlm.nih.gov/40658884/. Yang,L.,Xu,S.,Sellergren,A.,Kohlberger,T.,Zhou,Y.,Ktena,I.,Kiraly, A., Ahmed, F., Hormozdiari, F., Jaroensri, T., Wang, E., Wulczyn, E., Jamil, F., Guidroz, T., Lau, C....

  24. [24]

    First Author et al.:Preprint submitted to ElsevierPage 10 of 10

    Advancing multimodal medical capabilities of gemini URL:https: //arxiv.org/abs/2405.03162,arXiv:2405.03162. First Author et al.:Preprint submitted to ElsevierPage 10 of 10