Clinical Harness for Governable Medical AI Skill Ecosystems
Pith reviewed 2026-07-01 07:18 UTC · model grok-4.3
The pith
A Clinical Harness registers, orchestrates, constrains and monitors clinical AI skills to support accountable lifecycle care instead of isolated models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors define clinical AI skills and claim that the Clinical Harness, as a runtime governance architecture, registers, orchestrates, constrains and monitors these skills so that knowledge-driven, data-driven and physics-enhanced skills together can deliver lifecycle care for conditions such as osteoporosis and serve as a controlled substrate for medical agents.
What carries the argument
The Clinical Harness, a runtime governance architecture that registers, orchestrates, constrains and monitors clinical AI skills.
If this is right
- Skills become persistent capabilities that span multiple care episodes rather than one-time model outputs.
- Different skill types can be combined under shared constraints for a single condition.
- Future medical agents operate on a governed substrate with built-in registration and monitoring.
- Constraints and monitoring apply uniformly across knowledge, data and physics-based skills.
Where Pith is reading between the lines
- Hospitals might maintain shared skill registries that allow skills developed at one site to run under the same constraints at another.
- Runtime monitoring could surface skill drift before it affects patient decisions in ongoing conditions.
- Physics-enhanced skills open a route to embed simulation outputs directly into monitored care pathways.
Load-bearing premise
That wrapping skills in a runtime harness will automatically produce enforceable accountability and persistence in actual clinical use rather than remaining a high-level description.
What would settle it
An implemented Clinical Harness that fails to block or log an unmonitored change to an osteoporosis treatment plan during a controlled simulation of lifecycle care.
read the original abstract
Medical AI remains organized around isolated models, whereas care requires accountable capabilities that persist across time. We define clinical AI skills and propose the Clinical Harness, a runtime governance architecture that registers, orchestrates, constrains and monitors them. Using osteoporosis as an exemplar, we show how knowledge-driven, data-driven and physics-enhanced skills can support lifecycle care and provide a governed substrate for future medical agents.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper defines clinical AI skills and proposes the Clinical Harness, a runtime governance architecture that registers, orchestrates, constrains and monitors them. Using osteoporosis as an exemplar, it illustrates how knowledge-driven, data-driven and physics-enhanced skills can support lifecycle care and serve as a governed substrate for future medical agents.
Significance. If realized with enforceable runtime mechanisms, the framework could advance accountable, persistent medical AI capabilities beyond isolated models. As presented, however, the contribution is a high-level architectural vision without demonstrated enforcement, validation or specifications, so its significance lies in conceptual framing rather than in delivered governance.
major comments (2)
- [Clinical Harness] Clinical Harness description: the architecture is stated to register, orchestrate, constrain and monitor skills, yet no specification is given of the constraint engine, monitoring predicates, blocking logic for invalid outputs, or logging of violations. This absence is load-bearing for the central claim of runtime governance.
- [Osteoporosis exemplar] Osteoporosis exemplar: the mapping of knowledge-driven, data-driven and physics-enhanced skills to lifecycle care is presented as a conceptual illustration only, with no pseudocode, constraint predicates or execution-time enforcement details that would allow verification that the harness actually applies constraints.
minor comments (1)
- [Abstract] The abstract and main text use 'show how' for the exemplar; rephrasing to 'illustrate conceptually' would better reflect the descriptive nature of the content.
Simulated Author's Rebuttal
We thank the referee for the constructive review and for recognizing the conceptual framing of the Clinical Harness. We agree that additional illustrative specifications would strengthen the manuscript and will revise to address the points on detailed mechanisms and the exemplar.
read point-by-point responses
-
Referee: [Clinical Harness] Clinical Harness description: the architecture is stated to register, orchestrate, constrain and monitor skills, yet no specification is given of the constraint engine, monitoring predicates, blocking logic for invalid outputs, or logging of violations. This absence is load-bearing for the central claim of runtime governance.
Authors: We acknowledge that the manuscript presents the Clinical Harness at an architectural level and does not include concrete specifications for the constraint engine, monitoring predicates, blocking logic, or violation logging. This reflects the paper's focus on defining the overall governance substrate rather than delivering an implemented system. In revision we will add a new subsection with high-level pseudocode for the orchestration and monitoring loop, example monitoring predicates for skill outputs, a description of blocking logic for invalid results, and a logging approach for violations. These additions will illustrate how runtime enforcement could operate while preserving the conceptual scope. revision: yes
-
Referee: [Osteoporosis exemplar] Osteoporosis exemplar: the mapping of knowledge-driven, data-driven and physics-enhanced skills to lifecycle care is presented as a conceptual illustration only, with no pseudocode, constraint predicates or execution-time enforcement details that would allow verification that the harness actually applies constraints.
Authors: The osteoporosis section is explicitly framed as a conceptual illustration of skill mapping to lifecycle care. We agree that the absence of pseudocode and enforcement details limits verifiability of constraint application. We will revise this section to include example constraint predicates and pseudocode for at least one skill type (e.g., the knowledge-driven skill), showing how the harness would evaluate outputs against clinical rules at runtime and trigger blocking or logging as needed. revision: yes
Circularity Check
No circularity: purely definitional architecture proposal with no derivations or fitted results
full rationale
The paper consists of definitions of clinical AI skills and a proposed runtime governance architecture (the Clinical Harness) illustrated via an osteoporosis exemplar. No equations, predictions, fitted parameters, or derivation chains exist that could reduce a claimed result to its inputs by construction. Claims are presented as conceptual mappings rather than derived outputs, making the work self-contained without any of the enumerated circularity patterns.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Clinical AI skills can be defined as persistent, orchestratable units that remain accountable across time.
- ad hoc to paper A runtime governance layer can register, orchestrate, constrain and monitor skills without introducing new failure modes.
invented entities (2)
-
Clinical Harness
no independent evidence
-
clinical AI skills
no independent evidence
Reference graph
Works this paper leans on
-
[1]
& Topol, E
Rajpurkar, P., Chen, E., Banerjee, O. & Topol, E. J. AI in health and medicine. Nat. Med. 28, 31–38 (2022)
2022
-
[2]
P., Ashley, E
Shad, R., Cunningham, J. P., Ashley, E. A., Langlotz, C. P. & Hiesinger, W. Designing clinically translatable artificial intelligence systems for high-dimensional medical imaging. Nat. Mach. Intell. 3, 929–935 (2021)
2021
-
[3]
Pickhardt, P. J. et al. Improved CT-based osteoporosis assessment with a fully automated deep learning tool. Radiol. Artif. Intell. 4, e220042 (2022)
2022
-
[4]
Kolanu, N. et al. Clinical utility of computer-aided diagnosis of vertebral fractures from computed tomography images. J. Bone Miner. Res. 35, 2307–2312 (2020)
2020
-
[5]
Hsieh, C.-I. et al. Automated bone mineral density prediction and fracture risk assessment using plain radiographs via deep learning. Nat. Commun. 12, 5472 (2021)
2021
-
[6]
Thirunavukarasu, A. J. et al. Large language models in medicine. Nat. Med. 29, 1930–1940 (2023)
1930
-
[7]
Singhal, K. et al. Large language models encode clinical knowledge. Nature 620, 172–180 (2023)
2023
-
[8]
Kresevic, S. et al. Optimization of hepatological clinical guidelines interpretation by large language models: a retrieval augmented generation-based framework. npj Digit. Med. 7, 102 (2024)
2024
-
[9]
Wu, E. et al. How medical AI devices are evaluated: limitations and recommendations from an analysis of FDA approvals. Nat. Med. 27, 582–584 (2021)
2021
-
[10]
Wong, A. et al. External validation of a widely implemented proprietary sepsis prediction model in hospitalized patients. JAMA Intern. Med. 181, 1065–1070 (2021)
2021
-
[11]
Feng, J. et al. Clinical artificial intelligence quality improvement: towards continual monitoring and updating of AI algorithms in healthcare. npj Digit. Med. 5, 66 (2022)
2022
-
[12]
P., Gao, M., Brajer, N
Sendak, M. P., Gao, M., Brajer, N. & Balu, S. Presenting machine learning model information to clinical end users with model facts labels. npj Digit. Med. 3, 41 (2020)
2020
-
[13]
Mitchell, M. et al. Model cards for model reporting. In Proc. Conference on Fairness, Accountability, and Transparency 220–229 (ACM, 2019)
2019
-
[14]
Krishnamoorthy, M., Sjoding, M. W. & Wiens, J. Off-label use of artificial intelligence models in healthcare. Nat. Med. 30, 1525–1527 (2024)
2024
-
[15]
Vasey, B. et al. Reporting guideline for the early-stage clinical evaluation of decision support systems driven by artificial intelligence: DECIDE-AI. Nat. Med. 28, 924–933 (2022)
2022
-
[16]
Collins, G. S. et al. TRIPOD+AI statement: updated guidance for reporting clinical prediction models that use regression or machine learning methods. BMJ 385, e078378 (2024)
2024
-
[17]
Lekadir, K. et al. FUTURE-AI: international consensus guideline for trustworthy and deployable artificial intelligence in healthcare. BMJ 388, e081554 (2025)
2025
-
[18]
Alderman, J. E. et al. Tackling algorithmic bias and promoting transparency in health datasets: the STANDING Together consensus recommendations. Lancet Digit. Health 7, e64–e88 (2025)
2025
-
[19]
Park, H. et al. Automated deep learning-based bone mineral density assessment for opportunistic osteoporosis screening using various CT protocols with multi-vendor scanners. Sci. Rep. 14, 25014 (2024)
2024
-
[20]
Suri, A. et al. Vertebral deformity measurements at MRI, CT, and radiography using deep learning. Radiol. Artif. Intell. 4, e210015 (2022)
2022
-
[21]
Hong, N. et al. Deep-learning-based detection of vertebral fracture and osteoporosis using lateral spine X-ray radiography. J. Bone Miner. Res. 38, 887–895 (2023)
2023
-
[22]
Tu, T. et al. Towards conversational diagnostic artificial intelligence. Nature 642, 442–450 (2025)
2025
-
[23]
Liu, X. et al. A generalist medical language model for disease diagnosis assistance. Nat. Med. 31, 932–942 (2025)
2025
-
[24]
LeBoff, M. S. et al. The clinician's guide to prevention and treatment of osteoporosis. Osteoporos. Int. 33, 2049–2102 (2022)
2049
-
[25]
Gregson, C. L. et al. The 2024 UK clinical guideline for the prevention and treatment of osteoporosis. Arch. Osteoporos. 20, 119 (2025)
2024
-
[26]
Ma, J. et al. Segment anything in medical images. Nat. Commun. 15, 654 (2024)
2024
-
[27]
Chen, J. et al. TransUNet: rethinking the U-Net architecture design for medical image segmentation through the lens of transformers. Med. Image Anal. 97, 103280 (2024)
2024
-
[28]
J.-W., Lin, T.-Y
Liao, J.-C., Chen, M. J.-W., Lin, T.-Y. & Chen, W.-P. Biomechanical comparison of vertebroplasty, kyphoplasty, vertebrae stent for osteoporotic vertebral compression fractures: a finite element analysis. Appl. Sci. 11, 5764 (2021)
2021
-
[29]
Li, W. et al. Machine learning applications for the prediction of bone cement leakage in percutaneous vertebroplasty. Front. Public Health 9, 812023 (2021)
2021
-
[30]
Wang, X. et al. Predicting secondary vertebral compression fracture after vertebral augmentation via CT-based machine learning radiomics-clinical model. Acad. Radiol. 32, 298–310 (2025)
2025
-
[31]
S., Obey, N
Ong, C. S., Obey, N. T., Zheng, Y., Cohan, A. & Schneider, E. B. SurgeryLLM: a retrieval-augmented generation large language model framework for surgical decision support and workflow enhancement. npj Digit. Med. 7, 364 (2024)
2024
-
[32]
de Hond, A. et al. From text to treatment: the crucial role of validation for generative large language models in health care. Lancet Digit. Health 6, e441-e443 (2024)
2024
-
[33]
Yao, S. et al. ReAct: synergizing reasoning and acting in language models. In Proc. 11th International Conference on Learning Representations (ICLR, 2023)
2023
-
[34]
Schick, T. et al. Toolformer: language models can teach themselves to use tools. In Advances in Neural Information Processing Systems 36 (eds Oh, A. et al.) (Curran Associates, 2023)
2023
-
[35]
Shinn, N. et al. Reflexion: language agents with verbal reinforcement learning. In Advances in Neural Information Processing Systems 36 (eds Oh, A. et al.) (Curran Associates, 2023)
2023
-
[36]
Habib, A. R. & Gross, C. P. FDA regulations of AI-driven clinical decision support devices fall short. JAMA Intern. Med. 183, 1401–1402 (2023)
2023
-
[37]
Khera, R., Simon, M. A. & Ross, J. S. Automation bias and assistive AI. JAMA 330, 2255–2257 (2023)
2023
-
[38]
E., Kant, I
van Genderen, M. E., Kant, I. M. J., Tacchetti, C. & Jovinge, S. Moving toward implementation of responsible artificial intelligence in health care. JAMA 333, 1483–1484 (2025)
2025
-
[39]
& Kather, J
Gilbert, S. & Kather, J. N. Guardrails for the use of generalist AI in cancer care. Nat. Rev. Cancer 24, 357–358 (2024)
2024
-
[40]
Y., Stoyanovich, J
Hernandez-Boussard, T., Lee, A. Y., Stoyanovich, J. & Biven, L. Promoting transparency in AI for biomedical and behavioral research. Nat. Med. 31, 1733–1734 (2025)
2025
-
[41]
Karargyris, A. et al. Federated benchmarking of medical artificial intelligence with MedPerf. Nat. Mach. Intell. 5, 799–810 (2023)
2023
-
[42]
Moor, M. et al. Foundation models for generalist medical artificial intelligence. Nature 616, 259–265 (2023)
2023
-
[43]
N., Falcone, G
Acosta, J. N., Falcone, G. J., Rajpurkar, P. & Topol, E. J. Multimodal biomedical AI. Nat. Med. 28, 1773–1784 (2022). Supplementary Information Supplementary Table 1 | Clinical AI Skill Cards for S1-S9. Registration requires each skill card to specify the clinical objective, intended-use population, inputs, outputs, operating boundary, exclusions or contr...
2022
-
[44]
Input quality, accuracy, calibration, subgroup robustness, safety, failure modes and fallback rules
Skill-level validation Validate each S1-S9 skill before registration. Input quality, accuracy, calibration, subgroup robustness, safety, failure modes and fallback rules. Guideline fidelity; calibration; segmentation or surrogate-model accuracy
-
[45]
Workflow fit, artifact completeness, guardrail activation, latency, audit- log completeness and clinician- disagreement patterns
Silent deployment Run Clinical Harness in the background without affecting care. Workflow fit, artifact completeness, guardrail activation, latency, audit- log completeness and clinician- disagreement patterns. Skill-call rate; safety-gate trigger rate; escalation frequency
-
[46]
Care-process improvement, safety, usability, clinician trust and patient- centred benefit
Prospective workflow evaluation Test clinical utility with clinicians in control. Care-process improvement, safety, usability, clinician trust and patient- centred benefit. Screening yield; treatment initiation; clinician override; patient-reported outcomes
-
[47]
Drift, bias, safety events, version changes, guideline updates, revalidation and retirement triggers
Post-deployment monitoring Maintain performance and governance after deployment. Drift, bias, safety events, version changes, guideline updates, revalidation and retirement triggers. Calibration drift; subgroup gaps; adverse events; revalidation events. Supplementary Figure 1 | Exemplar clinical AI skills for osteoporosis lifecycle management. S1-S9 map t...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.