Clinical Harness for Governable Medical AI Skill Ecosystems

Lei Bao; Tianhan Xu; Tian Shen; Yongxiang Wang; Zhe Hu

arxiv: 2606.26494 · v2 · pith:BX4Q6SEKnew · submitted 2026-06-25 · 💻 cs.AI

Clinical Harness for Governable Medical AI Skill Ecosystems

Tianhan Xu , Lei Bao , Zhe Hu , Tian Shen , Yongxiang Wang This is my paper

Pith reviewed 2026-07-01 07:18 UTC · model grok-4.3

classification 💻 cs.AI

keywords clinical AI skillsruntime governanceClinical Harnessosteoporosislifecycle caremedical agentsaccountable AI

0 comments

The pith

A Clinical Harness registers, orchestrates, constrains and monitors clinical AI skills to support accountable lifecycle care instead of isolated models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Medical AI today consists of separate models that lack continuity across a patient's full care timeline. The paper defines clinical AI skills as accountable capabilities and proposes the Clinical Harness as a runtime architecture to register, orchestrate, constrain and monitor them. Using osteoporosis as the running example, it shows how knowledge-driven, data-driven and physics-enhanced skills can be combined for different stages of care. The result is presented as a governed base layer for future medical agents that can persist and remain accountable over time.

Core claim

The authors define clinical AI skills and claim that the Clinical Harness, as a runtime governance architecture, registers, orchestrates, constrains and monitors these skills so that knowledge-driven, data-driven and physics-enhanced skills together can deliver lifecycle care for conditions such as osteoporosis and serve as a controlled substrate for medical agents.

What carries the argument

The Clinical Harness, a runtime governance architecture that registers, orchestrates, constrains and monitors clinical AI skills.

If this is right

Skills become persistent capabilities that span multiple care episodes rather than one-time model outputs.
Different skill types can be combined under shared constraints for a single condition.
Future medical agents operate on a governed substrate with built-in registration and monitoring.
Constraints and monitoring apply uniformly across knowledge, data and physics-based skills.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Hospitals might maintain shared skill registries that allow skills developed at one site to run under the same constraints at another.
Runtime monitoring could surface skill drift before it affects patient decisions in ongoing conditions.
Physics-enhanced skills open a route to embed simulation outputs directly into monitored care pathways.

Load-bearing premise

That wrapping skills in a runtime harness will automatically produce enforceable accountability and persistence in actual clinical use rather than remaining a high-level description.

What would settle it

An implemented Clinical Harness that fails to block or log an unmonitored change to an osteoporosis treatment plan during a controlled simulation of lifecycle care.

read the original abstract

Medical AI remains organized around isolated models, whereas care requires accountable capabilities that persist across time. We define clinical AI skills and propose the Clinical Harness, a runtime governance architecture that registers, orchestrates, constrains and monitors them. Using osteoporosis as an exemplar, we show how knowledge-driven, data-driven and physics-enhanced skills can support lifecycle care and provide a governed substrate for future medical agents.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a high-level conceptual proposal for a Clinical Harness that defines skills and governance but shows no runtime enforcement or validation.

read the letter

This paper proposes defining clinical AI skills and wrapping them in the Clinical Harness, a runtime architecture meant to register, orchestrate, constrain, and monitor them for accountable medical use. The osteoporosis exemplar breaks skills into knowledge-driven, data-driven, and physics-enhanced types to support lifecycle care.

The framing is useful. It correctly identifies that isolated models fall short for persistent clinical needs and tries to organize capabilities around governance instead.

The limitation is that the work stays descriptive. The abstract and description claim the harness constrains and monitors skills, yet no constraint logic, monitoring predicates, blocking rules, or persistence mechanisms are specified. The exemplar maps skills conceptually without any execution trace, simulation, or test showing that governance actually applies at runtime.

No data, derivations, error analysis, or implemented components appear. The central claim about delivering a governed substrate therefore rests on architecture text alone.

Readers interested in high-level AI safety frameworks for healthcare might find the skill categorization worth noting. For peer review, the paper lacks the concrete mechanisms or evidence needed to assess whether the harness works as stated. I would not send it to referees in this form.

Referee Report

2 major / 1 minor

Summary. The paper defines clinical AI skills and proposes the Clinical Harness, a runtime governance architecture that registers, orchestrates, constrains and monitors them. Using osteoporosis as an exemplar, it illustrates how knowledge-driven, data-driven and physics-enhanced skills can support lifecycle care and serve as a governed substrate for future medical agents.

Significance. If realized with enforceable runtime mechanisms, the framework could advance accountable, persistent medical AI capabilities beyond isolated models. As presented, however, the contribution is a high-level architectural vision without demonstrated enforcement, validation or specifications, so its significance lies in conceptual framing rather than in delivered governance.

major comments (2)

[Clinical Harness] Clinical Harness description: the architecture is stated to register, orchestrate, constrain and monitor skills, yet no specification is given of the constraint engine, monitoring predicates, blocking logic for invalid outputs, or logging of violations. This absence is load-bearing for the central claim of runtime governance.
[Osteoporosis exemplar] Osteoporosis exemplar: the mapping of knowledge-driven, data-driven and physics-enhanced skills to lifecycle care is presented as a conceptual illustration only, with no pseudocode, constraint predicates or execution-time enforcement details that would allow verification that the harness actually applies constraints.

minor comments (1)

[Abstract] The abstract and main text use 'show how' for the exemplar; rephrasing to 'illustrate conceptually' would better reflect the descriptive nature of the content.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive review and for recognizing the conceptual framing of the Clinical Harness. We agree that additional illustrative specifications would strengthen the manuscript and will revise to address the points on detailed mechanisms and the exemplar.

read point-by-point responses

Referee: [Clinical Harness] Clinical Harness description: the architecture is stated to register, orchestrate, constrain and monitor skills, yet no specification is given of the constraint engine, monitoring predicates, blocking logic for invalid outputs, or logging of violations. This absence is load-bearing for the central claim of runtime governance.

Authors: We acknowledge that the manuscript presents the Clinical Harness at an architectural level and does not include concrete specifications for the constraint engine, monitoring predicates, blocking logic, or violation logging. This reflects the paper's focus on defining the overall governance substrate rather than delivering an implemented system. In revision we will add a new subsection with high-level pseudocode for the orchestration and monitoring loop, example monitoring predicates for skill outputs, a description of blocking logic for invalid results, and a logging approach for violations. These additions will illustrate how runtime enforcement could operate while preserving the conceptual scope. revision: yes
Referee: [Osteoporosis exemplar] Osteoporosis exemplar: the mapping of knowledge-driven, data-driven and physics-enhanced skills to lifecycle care is presented as a conceptual illustration only, with no pseudocode, constraint predicates or execution-time enforcement details that would allow verification that the harness actually applies constraints.

Authors: The osteoporosis section is explicitly framed as a conceptual illustration of skill mapping to lifecycle care. We agree that the absence of pseudocode and enforcement details limits verifiability of constraint application. We will revise this section to include example constraint predicates and pseudocode for at least one skill type (e.g., the knowledge-driven skill), showing how the harness would evaluate outputs against clinical rules at runtime and trigger blocking or logging as needed. revision: yes

Circularity Check

0 steps flagged

No circularity: purely definitional architecture proposal with no derivations or fitted results

full rationale

The paper consists of definitions of clinical AI skills and a proposed runtime governance architecture (the Clinical Harness) illustrated via an osteoporosis exemplar. No equations, predictions, fitted parameters, or derivation chains exist that could reduce a claimed result to its inputs by construction. Claims are presented as conceptual mappings rather than derived outputs, making the work self-contained without any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 2 invented entities

The proposal introduces new terminology and an architecture without external benchmarks or derivations; the central claim therefore rests on the untested premise that the harness can be realized and will deliver governance.

axioms (2)

domain assumption Clinical AI skills can be defined as persistent, orchestratable units that remain accountable across time.
Stated in the abstract as the foundational definition enabling the harness.
ad hoc to paper A runtime governance layer can register, orchestrate, constrain and monitor skills without introducing new failure modes.
Implicit in the proposal that the harness provides a governed substrate.

invented entities (2)

Clinical Harness no independent evidence
purpose: Runtime governance architecture for medical AI skills
New system proposed to solve isolation of models; no independent evidence of functionality supplied.
clinical AI skills no independent evidence
purpose: Reusable accountable capabilities that persist across time
New conceptual unit introduced to replace isolated models; no external validation given.

pith-pipeline@v0.9.1-grok · 5580 in / 1447 out tokens · 25074 ms · 2026-07-01T07:18:07.467868+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

47 extracted references

[1]

& Topol, E

Rajpurkar, P., Chen, E., Banerjee, O. & Topol, E. J. AI in health and medicine. Nat. Med. 28, 31–38 (2022)

2022
[2]

P., Ashley, E

Shad, R., Cunningham, J. P., Ashley, E. A., Langlotz, C. P. & Hiesinger, W. Designing clinically translatable artificial intelligence systems for high-dimensional medical imaging. Nat. Mach. Intell. 3, 929–935 (2021)

2021
[3]

Pickhardt, P. J. et al. Improved CT-based osteoporosis assessment with a fully automated deep learning tool. Radiol. Artif. Intell. 4, e220042 (2022)

2022
[4]

Kolanu, N. et al. Clinical utility of computer-aided diagnosis of vertebral fractures from computed tomography images. J. Bone Miner. Res. 35, 2307–2312 (2020)

2020
[5]

Hsieh, C.-I. et al. Automated bone mineral density prediction and fracture risk assessment using plain radiographs via deep learning. Nat. Commun. 12, 5472 (2021)

2021
[6]

Thirunavukarasu, A. J. et al. Large language models in medicine. Nat. Med. 29, 1930–1940 (2023)

1930
[7]

Singhal, K. et al. Large language models encode clinical knowledge. Nature 620, 172–180 (2023)

2023
[8]

Kresevic, S. et al. Optimization of hepatological clinical guidelines interpretation by large language models: a retrieval augmented generation-based framework. npj Digit. Med. 7, 102 (2024)

2024
[9]

Wu, E. et al. How medical AI devices are evaluated: limitations and recommendations from an analysis of FDA approvals. Nat. Med. 27, 582–584 (2021)

2021
[10]

Wong, A. et al. External validation of a widely implemented proprietary sepsis prediction model in hospitalized patients. JAMA Intern. Med. 181, 1065–1070 (2021)

2021
[11]

Feng, J. et al. Clinical artificial intelligence quality improvement: towards continual monitoring and updating of AI algorithms in healthcare. npj Digit. Med. 5, 66 (2022)

2022
[12]

P., Gao, M., Brajer, N

Sendak, M. P., Gao, M., Brajer, N. & Balu, S. Presenting machine learning model information to clinical end users with model facts labels. npj Digit. Med. 3, 41 (2020)

2020
[13]

Mitchell, M. et al. Model cards for model reporting. In Proc. Conference on Fairness, Accountability, and Transparency 220–229 (ACM, 2019)

2019
[14]

Krishnamoorthy, M., Sjoding, M. W. & Wiens, J. Off-label use of artificial intelligence models in healthcare. Nat. Med. 30, 1525–1527 (2024)

2024
[15]

Vasey, B. et al. Reporting guideline for the early-stage clinical evaluation of decision support systems driven by artificial intelligence: DECIDE-AI. Nat. Med. 28, 924–933 (2022)

2022
[16]

Collins, G. S. et al. TRIPOD+AI statement: updated guidance for reporting clinical prediction models that use regression or machine learning methods. BMJ 385, e078378 (2024)

2024
[17]

Lekadir, K. et al. FUTURE-AI: international consensus guideline for trustworthy and deployable artificial intelligence in healthcare. BMJ 388, e081554 (2025)

2025
[18]

Alderman, J. E. et al. Tackling algorithmic bias and promoting transparency in health datasets: the STANDING Together consensus recommendations. Lancet Digit. Health 7, e64–e88 (2025)

2025
[19]

Park, H. et al. Automated deep learning-based bone mineral density assessment for opportunistic osteoporosis screening using various CT protocols with multi-vendor scanners. Sci. Rep. 14, 25014 (2024)

2024
[20]

Suri, A. et al. Vertebral deformity measurements at MRI, CT, and radiography using deep learning. Radiol. Artif. Intell. 4, e210015 (2022)

2022
[21]

Hong, N. et al. Deep-learning-based detection of vertebral fracture and osteoporosis using lateral spine X-ray radiography. J. Bone Miner. Res. 38, 887–895 (2023)

2023
[22]

Tu, T. et al. Towards conversational diagnostic artificial intelligence. Nature 642, 442–450 (2025)

2025
[23]

Liu, X. et al. A generalist medical language model for disease diagnosis assistance. Nat. Med. 31, 932–942 (2025)

2025
[24]

LeBoff, M. S. et al. The clinician's guide to prevention and treatment of osteoporosis. Osteoporos. Int. 33, 2049–2102 (2022)

2049
[25]

Gregson, C. L. et al. The 2024 UK clinical guideline for the prevention and treatment of osteoporosis. Arch. Osteoporos. 20, 119 (2025)

2024
[26]

Ma, J. et al. Segment anything in medical images. Nat. Commun. 15, 654 (2024)

2024
[27]

Chen, J. et al. TransUNet: rethinking the U-Net architecture design for medical image segmentation through the lens of transformers. Med. Image Anal. 97, 103280 (2024)

2024
[28]

J.-W., Lin, T.-Y

Liao, J.-C., Chen, M. J.-W., Lin, T.-Y. & Chen, W.-P. Biomechanical comparison of vertebroplasty, kyphoplasty, vertebrae stent for osteoporotic vertebral compression fractures: a finite element analysis. Appl. Sci. 11, 5764 (2021)

2021
[29]

Li, W. et al. Machine learning applications for the prediction of bone cement leakage in percutaneous vertebroplasty. Front. Public Health 9, 812023 (2021)

2021
[30]

Wang, X. et al. Predicting secondary vertebral compression fracture after vertebral augmentation via CT-based machine learning radiomics-clinical model. Acad. Radiol. 32, 298–310 (2025)

2025
[31]

S., Obey, N

Ong, C. S., Obey, N. T., Zheng, Y., Cohan, A. & Schneider, E. B. SurgeryLLM: a retrieval-augmented generation large language model framework for surgical decision support and workflow enhancement. npj Digit. Med. 7, 364 (2024)

2024
[32]

de Hond, A. et al. From text to treatment: the crucial role of validation for generative large language models in health care. Lancet Digit. Health 6, e441-e443 (2024)

2024
[33]

Yao, S. et al. ReAct: synergizing reasoning and acting in language models. In Proc. 11th International Conference on Learning Representations (ICLR, 2023)

2023
[34]

Schick, T. et al. Toolformer: language models can teach themselves to use tools. In Advances in Neural Information Processing Systems 36 (eds Oh, A. et al.) (Curran Associates, 2023)

2023
[35]

Shinn, N. et al. Reflexion: language agents with verbal reinforcement learning. In Advances in Neural Information Processing Systems 36 (eds Oh, A. et al.) (Curran Associates, 2023)

2023
[36]

Habib, A. R. & Gross, C. P. FDA regulations of AI-driven clinical decision support devices fall short. JAMA Intern. Med. 183, 1401–1402 (2023)

2023
[37]

Khera, R., Simon, M. A. & Ross, J. S. Automation bias and assistive AI. JAMA 330, 2255–2257 (2023)

2023
[38]

E., Kant, I

van Genderen, M. E., Kant, I. M. J., Tacchetti, C. & Jovinge, S. Moving toward implementation of responsible artificial intelligence in health care. JAMA 333, 1483–1484 (2025)

2025
[39]

& Kather, J

Gilbert, S. & Kather, J. N. Guardrails for the use of generalist AI in cancer care. Nat. Rev. Cancer 24, 357–358 (2024)

2024
[40]

Y., Stoyanovich, J

Hernandez-Boussard, T., Lee, A. Y., Stoyanovich, J. & Biven, L. Promoting transparency in AI for biomedical and behavioral research. Nat. Med. 31, 1733–1734 (2025)

2025
[41]

Karargyris, A. et al. Federated benchmarking of medical artificial intelligence with MedPerf. Nat. Mach. Intell. 5, 799–810 (2023)

2023
[42]

Moor, M. et al. Foundation models for generalist medical artificial intelligence. Nature 616, 259–265 (2023)

2023
[43]

N., Falcone, G

Acosta, J. N., Falcone, G. J., Rajpurkar, P. & Topol, E. J. Multimodal biomedical AI. Nat. Med. 28, 1773–1784 (2022). Supplementary Information Supplementary Table 1 | Clinical AI Skill Cards for S1-S9. Registration requires each skill card to specify the clinical objective, intended-use population, inputs, outputs, operating boundary, exclusions or contr...

2022
[44]

Input quality, accuracy, calibration, subgroup robustness, safety, failure modes and fallback rules

Skill-level validation Validate each S1-S9 skill before registration. Input quality, accuracy, calibration, subgroup robustness, safety, failure modes and fallback rules. Guideline fidelity; calibration; segmentation or surrogate-model accuracy
[45]

Workflow fit, artifact completeness, guardrail activation, latency, audit- log completeness and clinician- disagreement patterns

Silent deployment Run Clinical Harness in the background without affecting care. Workflow fit, artifact completeness, guardrail activation, latency, audit- log completeness and clinician- disagreement patterns. Skill-call rate; safety-gate trigger rate; escalation frequency
[46]

Care-process improvement, safety, usability, clinician trust and patient- centred benefit

Prospective workflow evaluation Test clinical utility with clinicians in control. Care-process improvement, safety, usability, clinician trust and patient- centred benefit. Screening yield; treatment initiation; clinician override; patient-reported outcomes
[47]

Drift, bias, safety events, version changes, guideline updates, revalidation and retirement triggers

Post-deployment monitoring Maintain performance and governance after deployment. Drift, bias, safety events, version changes, guideline updates, revalidation and retirement triggers. Calibration drift; subgroup gaps; adverse events; revalidation events. Supplementary Figure 1 | Exemplar clinical AI skills for osteoporosis lifecycle management. S1-S9 map t...

[1] [1]

& Topol, E

Rajpurkar, P., Chen, E., Banerjee, O. & Topol, E. J. AI in health and medicine. Nat. Med. 28, 31–38 (2022)

2022

[2] [2]

P., Ashley, E

Shad, R., Cunningham, J. P., Ashley, E. A., Langlotz, C. P. & Hiesinger, W. Designing clinically translatable artificial intelligence systems for high-dimensional medical imaging. Nat. Mach. Intell. 3, 929–935 (2021)

2021

[3] [3]

Pickhardt, P. J. et al. Improved CT-based osteoporosis assessment with a fully automated deep learning tool. Radiol. Artif. Intell. 4, e220042 (2022)

2022

[4] [4]

Kolanu, N. et al. Clinical utility of computer-aided diagnosis of vertebral fractures from computed tomography images. J. Bone Miner. Res. 35, 2307–2312 (2020)

2020

[5] [5]

Hsieh, C.-I. et al. Automated bone mineral density prediction and fracture risk assessment using plain radiographs via deep learning. Nat. Commun. 12, 5472 (2021)

2021

[6] [6]

Thirunavukarasu, A. J. et al. Large language models in medicine. Nat. Med. 29, 1930–1940 (2023)

1930

[7] [7]

Singhal, K. et al. Large language models encode clinical knowledge. Nature 620, 172–180 (2023)

2023

[8] [8]

Kresevic, S. et al. Optimization of hepatological clinical guidelines interpretation by large language models: a retrieval augmented generation-based framework. npj Digit. Med. 7, 102 (2024)

2024

[9] [9]

Wu, E. et al. How medical AI devices are evaluated: limitations and recommendations from an analysis of FDA approvals. Nat. Med. 27, 582–584 (2021)

2021

[10] [10]

Wong, A. et al. External validation of a widely implemented proprietary sepsis prediction model in hospitalized patients. JAMA Intern. Med. 181, 1065–1070 (2021)

2021

[11] [11]

Feng, J. et al. Clinical artificial intelligence quality improvement: towards continual monitoring and updating of AI algorithms in healthcare. npj Digit. Med. 5, 66 (2022)

2022

[12] [12]

P., Gao, M., Brajer, N

Sendak, M. P., Gao, M., Brajer, N. & Balu, S. Presenting machine learning model information to clinical end users with model facts labels. npj Digit. Med. 3, 41 (2020)

2020

[13] [13]

Mitchell, M. et al. Model cards for model reporting. In Proc. Conference on Fairness, Accountability, and Transparency 220–229 (ACM, 2019)

2019

[14] [14]

Krishnamoorthy, M., Sjoding, M. W. & Wiens, J. Off-label use of artificial intelligence models in healthcare. Nat. Med. 30, 1525–1527 (2024)

2024

[15] [15]

Vasey, B. et al. Reporting guideline for the early-stage clinical evaluation of decision support systems driven by artificial intelligence: DECIDE-AI. Nat. Med. 28, 924–933 (2022)

2022

[16] [16]

Collins, G. S. et al. TRIPOD+AI statement: updated guidance for reporting clinical prediction models that use regression or machine learning methods. BMJ 385, e078378 (2024)

2024

[17] [17]

Lekadir, K. et al. FUTURE-AI: international consensus guideline for trustworthy and deployable artificial intelligence in healthcare. BMJ 388, e081554 (2025)

2025

[18] [18]

Alderman, J. E. et al. Tackling algorithmic bias and promoting transparency in health datasets: the STANDING Together consensus recommendations. Lancet Digit. Health 7, e64–e88 (2025)

2025

[19] [19]

Park, H. et al. Automated deep learning-based bone mineral density assessment for opportunistic osteoporosis screening using various CT protocols with multi-vendor scanners. Sci. Rep. 14, 25014 (2024)

2024

[20] [20]

Suri, A. et al. Vertebral deformity measurements at MRI, CT, and radiography using deep learning. Radiol. Artif. Intell. 4, e210015 (2022)

2022

[21] [21]

Hong, N. et al. Deep-learning-based detection of vertebral fracture and osteoporosis using lateral spine X-ray radiography. J. Bone Miner. Res. 38, 887–895 (2023)

2023

[22] [22]

Tu, T. et al. Towards conversational diagnostic artificial intelligence. Nature 642, 442–450 (2025)

2025

[23] [23]

Liu, X. et al. A generalist medical language model for disease diagnosis assistance. Nat. Med. 31, 932–942 (2025)

2025

[24] [24]

LeBoff, M. S. et al. The clinician's guide to prevention and treatment of osteoporosis. Osteoporos. Int. 33, 2049–2102 (2022)

2049

[25] [25]

Gregson, C. L. et al. The 2024 UK clinical guideline for the prevention and treatment of osteoporosis. Arch. Osteoporos. 20, 119 (2025)

2024

[26] [26]

Ma, J. et al. Segment anything in medical images. Nat. Commun. 15, 654 (2024)

2024

[27] [27]

Chen, J. et al. TransUNet: rethinking the U-Net architecture design for medical image segmentation through the lens of transformers. Med. Image Anal. 97, 103280 (2024)

2024

[28] [28]

J.-W., Lin, T.-Y

Liao, J.-C., Chen, M. J.-W., Lin, T.-Y. & Chen, W.-P. Biomechanical comparison of vertebroplasty, kyphoplasty, vertebrae stent for osteoporotic vertebral compression fractures: a finite element analysis. Appl. Sci. 11, 5764 (2021)

2021

[29] [29]

Li, W. et al. Machine learning applications for the prediction of bone cement leakage in percutaneous vertebroplasty. Front. Public Health 9, 812023 (2021)

2021

[30] [30]

Wang, X. et al. Predicting secondary vertebral compression fracture after vertebral augmentation via CT-based machine learning radiomics-clinical model. Acad. Radiol. 32, 298–310 (2025)

2025

[31] [31]

S., Obey, N

Ong, C. S., Obey, N. T., Zheng, Y., Cohan, A. & Schneider, E. B. SurgeryLLM: a retrieval-augmented generation large language model framework for surgical decision support and workflow enhancement. npj Digit. Med. 7, 364 (2024)

2024

[32] [32]

de Hond, A. et al. From text to treatment: the crucial role of validation for generative large language models in health care. Lancet Digit. Health 6, e441-e443 (2024)

2024

[33] [33]

Yao, S. et al. ReAct: synergizing reasoning and acting in language models. In Proc. 11th International Conference on Learning Representations (ICLR, 2023)

2023

[34] [34]

Schick, T. et al. Toolformer: language models can teach themselves to use tools. In Advances in Neural Information Processing Systems 36 (eds Oh, A. et al.) (Curran Associates, 2023)

2023

[35] [35]

Shinn, N. et al. Reflexion: language agents with verbal reinforcement learning. In Advances in Neural Information Processing Systems 36 (eds Oh, A. et al.) (Curran Associates, 2023)

2023

[36] [36]

Habib, A. R. & Gross, C. P. FDA regulations of AI-driven clinical decision support devices fall short. JAMA Intern. Med. 183, 1401–1402 (2023)

2023

[37] [37]

Khera, R., Simon, M. A. & Ross, J. S. Automation bias and assistive AI. JAMA 330, 2255–2257 (2023)

2023

[38] [38]

E., Kant, I

van Genderen, M. E., Kant, I. M. J., Tacchetti, C. & Jovinge, S. Moving toward implementation of responsible artificial intelligence in health care. JAMA 333, 1483–1484 (2025)

2025

[39] [39]

& Kather, J

Gilbert, S. & Kather, J. N. Guardrails for the use of generalist AI in cancer care. Nat. Rev. Cancer 24, 357–358 (2024)

2024

[40] [40]

Y., Stoyanovich, J

Hernandez-Boussard, T., Lee, A. Y., Stoyanovich, J. & Biven, L. Promoting transparency in AI for biomedical and behavioral research. Nat. Med. 31, 1733–1734 (2025)

2025

[41] [41]

Karargyris, A. et al. Federated benchmarking of medical artificial intelligence with MedPerf. Nat. Mach. Intell. 5, 799–810 (2023)

2023

[42] [42]

Moor, M. et al. Foundation models for generalist medical artificial intelligence. Nature 616, 259–265 (2023)

2023

[43] [43]

N., Falcone, G

Acosta, J. N., Falcone, G. J., Rajpurkar, P. & Topol, E. J. Multimodal biomedical AI. Nat. Med. 28, 1773–1784 (2022). Supplementary Information Supplementary Table 1 | Clinical AI Skill Cards for S1-S9. Registration requires each skill card to specify the clinical objective, intended-use population, inputs, outputs, operating boundary, exclusions or contr...

2022

[44] [44]

Input quality, accuracy, calibration, subgroup robustness, safety, failure modes and fallback rules

Skill-level validation Validate each S1-S9 skill before registration. Input quality, accuracy, calibration, subgroup robustness, safety, failure modes and fallback rules. Guideline fidelity; calibration; segmentation or surrogate-model accuracy

[45] [45]

Workflow fit, artifact completeness, guardrail activation, latency, audit- log completeness and clinician- disagreement patterns

Silent deployment Run Clinical Harness in the background without affecting care. Workflow fit, artifact completeness, guardrail activation, latency, audit- log completeness and clinician- disagreement patterns. Skill-call rate; safety-gate trigger rate; escalation frequency

[46] [46]

Care-process improvement, safety, usability, clinician trust and patient- centred benefit

Prospective workflow evaluation Test clinical utility with clinicians in control. Care-process improvement, safety, usability, clinician trust and patient- centred benefit. Screening yield; treatment initiation; clinician override; patient-reported outcomes

[47] [47]

Drift, bias, safety events, version changes, guideline updates, revalidation and retirement triggers

Post-deployment monitoring Maintain performance and governance after deployment. Drift, bias, safety events, version changes, guideline updates, revalidation and retirement triggers. Calibration drift; subgroup gaps; adverse events; revalidation events. Supplementary Figure 1 | Exemplar clinical AI skills for osteoporosis lifecycle management. S1-S9 map t...