Enhancing Clinician Decision-Making via Uncertainty-Aware Multi-Expert Fusion for Stroke Rehabilitation

Tamim Ahmed; Thanassis Rikakis

arxiv: 2606.24960 · v1 · pith:CRS4AL74new · submitted 2026-06-23 · 💻 cs.LG · cs.AI

Enhancing Clinician Decision-Making via Uncertainty-Aware Multi-Expert Fusion for Stroke Rehabilitation

Tamim Ahmed , Thanassis Rikakis This is my paper

Pith reviewed 2026-06-26 00:39 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords stroke rehabilitationuncertainty quantificationmulti-expert fusionARAT assessmentdynamic Bayesian networkmovement quality analysisclinical decision supportmultimodal models

0 comments

The pith

xAARA fuses 692 multimodal models via a Dynamic Bayesian Network to deliver uncertainty-calibrated ARAT assessments that reduce predictive uncertainty by 96.1% versus single-clinician scoring in stroke survivors.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces xAARA as a system that augments rather than replaces clinician judgment when assessing how stroke survivors organize movements during the Action Research Arm Test. It treats scoring as an ill-posed inference task and composes many calibrated models into a network that outputs task-level, phase-level, and quality-level results together with calibrated uncertainty and explanations. The system applies clinical validity rules and defers low-confidence cases. In 105 survivors completing 788 exercises the method reached 94.2 percent task accuracy and 81.3 percent movement-phase accuracy while cutting uncertainty dramatically and matching at least one human rater on every subjective case without ever producing out-of-range scores. Four independent clinicians reviewed the outputs and indicated they would adopt the tool.

Core claim

xAARA composes 692 calibrated multimodal models via a Dynamic Bayesian Network with entropy-based gating, qualifies outputs against clinical validity rules, and defers low-confidence cases. In 105 stroke survivors performing 788 exercises it achieved 94.2 percent task accuracy (kappa 0.934) and 81.3 percent movement-phase accuracy (kappa 0.727), reduced predictive uncertainty by 96.1 percent relative to single-clinician scoring, matched at least one rater 100 percent of the time on subjective cases, and never returned out-of-range scores. Four clinicians validated the assessments and expressed willingness to adopt the system.

What carries the argument

Dynamic Bayesian Network with entropy-based gating that fuses 692 calibrated multimodal models and qualifies results by clinical validity rules before deferral.

If this is right

94.2 percent task accuracy and 81.3 percent phase accuracy with corresponding kappa values
96.1 percent reduction in predictive uncertainty compared with single-clinician scoring
100 percent match with at least one rater on subjective cases and zero out-of-range scores
Four independent clinicians validated outputs and indicated willingness to adopt the system

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same fusion-plus-deferral pattern could be applied to other standardized movement assessments that currently rely on single-observer ordinal scores.
Automatic deferral of uncertain cases could reduce overall clinician time spent on routine scoring while preserving human oversight for ambiguous movements.
If the entropy-gating mechanism generalizes, it may support home-based or tele-rehabilitation settings where live clinician review is unavailable.
The approach supplies a concrete template for embedding calibrated uncertainty into other medical AI systems that must interface with expert judgment.

Load-bearing premise

The 692 multimodal models are sufficiently diverse and independently calibrated that their composition produces uncertainty estimates aligned with human raters and that the clinical validity rules catch all problematic outputs.

What would settle it

A replication study on a fresh cohort of stroke survivors in which uncertainty reduction drops below 50 percent or more than 20 percent of outputs are rejected by clinicians would falsify the claim of clinical utility.

read the original abstract

Tailoring stroke rehabilitation requires assessing how movements are organized, not merely if they succeed. Currently, this assessment is a rate-limiting bottleneck. Instruments like the Action Research Arm Test (ARAT) compress rich behavioral observations into single ordinal endpoints, discarding the movement-quality details that distinguish recovery from compensation. Automated alternatives typically chase accuracy on noisy, single-observer labels to output opaque scores - a technology-centric approach that rarely reaches clinical practice. To address this, we present xAARA: an engine designed to augment rather than replace clinical judgment. From multi-view video, xAARA returns ARAT assessments with calibrated uncertainty and explanations across task, movement-phase, and movement-quality levels. Treating clinical scoring as an ill-posed inference problem, xAARA composes 692 calibrated multimodal models via a Dynamic Bayesian Network with entropy-based gating. It qualifies results against clinical validity rules and defers low-confidence cases. In 105 stroke survivors (788 exercises), xAARA achieved 94.2% task accuracy (Cohen's kappa=0.934) and 81.3% movement-phase accuracy (kappa=0.727), reducing predictive uncertainty by 96.1% compared to single-clinician scoring. For subjective cases, it matched at least one rater 100% of the time and never returned out-of-range scores. Four independent clinicians validated the assessments and indicated willingness to adopt the system. We argue that principled uncertainty quantification and clinician-aligned explainability are the critical bridges moving automated assessment from technical demonstration to a deployable clinical tool.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

xAARA reports high ARAT accuracy and 96% uncertainty reduction via 692-model DBN fusion, but the uncertainty claim lacks any calibration or diversity evidence.

read the letter

The main thing to know is that this paper describes xAARA, which fuses 692 multimodal models through a Dynamic Bayesian Network and entropy gating to output ARAT scores plus uncertainty for stroke rehab videos. It claims 94.2% task accuracy, 81.3% phase accuracy, and a 96.1% drop in predictive uncertainty versus single-clinician scoring on 788 exercises from 105 patients, plus 100% match to at least one rater on subjective cases and no out-of-range outputs.

The work does target a genuine clinical issue: ARAT and similar tests lose movement-quality information that separates recovery from compensation. The multi-level outputs and the rule-based deferral for low-confidence cases are reasonable attempts to make the output more usable in practice. The reported clinician review and stated willingness to adopt also give it a bit more grounding than pure technical demos.

The soft spots are clear from the abstract. No details appear on how the 692 models were trained, calibrated, or checked for diversity, and there are no reliability diagrams, expected calibration error numbers, or pairwise disagreement metrics. The entropy gating and uncertainty reduction therefore rest on untested assumptions. Data handling is also opaque—no splits, no mention of whether thresholds were tuned on the evaluation set, and no statistical tests on the kappas. These gaps make the 96.1% figure hard to trust as a real clinical improvement rather than an artifact of the fusion rule.

This is for researchers building automated rehab assessment tools and for clinicians who want uncertainty-aware outputs. It deserves a serious referee to check whether the methods section supplies the missing calibration and validation evidence. I would send it to peer review.

Referee Report

2 major / 2 minor

Summary. The paper presents xAARA, a system for ARAT-based stroke rehabilitation assessment that fuses 692 multimodal models via a Dynamic Bayesian Network with entropy-based gating, qualifies outputs against clinical validity rules, and defers low-confidence cases. On 788 exercises from 105 patients, it reports 94.2% task accuracy (kappa=0.934), 81.3% movement-phase accuracy (kappa=0.727), a 96.1% reduction in predictive uncertainty versus single-clinician scoring, 100% match to at least one rater on subjective cases, no out-of-range scores, and positive validation by four clinicians willing to adopt the tool.

Significance. If the uncertainty estimates are shown to be well-calibrated and the fusion produces rater-aligned outputs, the emphasis on uncertainty quantification and clinician-aligned explainability could help bridge automated assessment to clinical use, addressing limitations of prior accuracy-focused approaches.

major comments (2)

[Abstract] Abstract: The headline claim of 96.1% predictive uncertainty reduction (relative to single-clinician scoring) is load-bearing for the central argument that xAARA augments rather than replaces judgment, yet the manuscript supplies no reliability diagrams, expected calibration error, pairwise model disagreement, or mutual information metrics to demonstrate that the 692 models are diverse and independently calibrated or that the DBN+entropy gating yields well-calibrated posteriors.
[Abstract] Abstract / Results: The reported accuracies and uncertainty reduction rest on the composition step, but no information is given on training procedures, data splits, how the 692 models were calibrated, or whether entropy thresholds and clinical validity rules were tuned on the same 788-exercise dataset, creating a circularity risk that undermines interpretation of the 96.1% figure.

minor comments (2)

The description of the clinical validity rules used for qualification is too brief to allow assessment of their comprehensiveness.
Add inter-rater agreement statistics from the four validating clinicians and details on the validation protocol.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and will revise the manuscript to strengthen the presentation of calibration evidence and methodological details.

read point-by-point responses

Referee: [Abstract] Abstract: The headline claim of 96.1% predictive uncertainty reduction (relative to single-clinician scoring) is load-bearing for the central argument that xAARA augments rather than replaces judgment, yet the manuscript supplies no reliability diagrams, expected calibration error, pairwise model disagreement, or mutual information metrics to demonstrate that the 692 models are diverse and independently calibrated or that the DBN+entropy gating yields well-calibrated posteriors.

Authors: We agree that the 96.1% uncertainty reduction claim requires supporting calibration diagnostics. The current manuscript does not include reliability diagrams, ECE, pairwise disagreement, or mutual information analyses. In revision we will add these metrics, computed on the held-out test set, to demonstrate diversity among the 692 models and calibration of the DBN posteriors. revision: yes
Referee: [Abstract] Abstract / Results: The reported accuracies and uncertainty reduction rest on the composition step, but no information is given on training procedures, data splits, how the 692 models were calibrated, or whether entropy thresholds and clinical validity rules were tuned on the same 788-exercise dataset, creating a circularity risk that undermines interpretation of the 96.1% figure.

Authors: We acknowledge the circularity concern. The manuscript as submitted does not detail training procedures, data splits, individual model calibration, or the tuning of entropy thresholds and validity rules. We will revise the Methods section to specify patient-wise cross-validation splits, the calibration protocol applied to each of the 692 models, and confirmation that thresholds were tuned on a separate validation partition so that the reported test metrics remain unbiased. revision: yes

Circularity Check

0 steps flagged

No circularity detected in derivation chain

full rationale

The provided abstract and description report empirical performance metrics (94.2% task accuracy, 81.3% phase accuracy, 96.1% uncertainty reduction) on a held-out set of 788 exercises from 105 patients. No equations, self-citations, or steps are quoted that reduce the claimed outputs to fitted parameters or prior self-referential results by construction. The composition of 692 models via DBN and entropy gating is presented as a method whose outputs are then validated against clinical rules and human raters; the text does not exhibit self-definitional fitting, renaming of known results, or load-bearing self-citation that forces the headline numbers. The derivation therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The abstract relies on a large ensemble of pre-calibrated models whose internal parameters and training data are unspecified; the fusion method assumes model diversity and calibration quality without independent verification in the provided text.

free parameters (2)

Number of multimodal models = 692
The system composes exactly 692 models; this count is presented as given and likely selected or tuned to reach the reported accuracy and uncertainty levels.
Entropy threshold for gating
Entropy-based gating decides when to defer; the threshold value is not stated but is central to the uncertainty handling and deferral behavior.

axioms (2)

domain assumption Clinical scoring is an ill-posed inference problem best addressed by multi-model fusion
Abstract explicitly frames the approach as treating scoring this way.
domain assumption The Dynamic Bayesian Network produces calibrated uncertainty estimates that align with clinical validity
Core premise underlying the 96.1% uncertainty reduction claim and deferral mechanism.

pith-pipeline@v0.9.1-grok · 5812 in / 1749 out tokens · 31431 ms · 2026-06-26T00:39:26.273982+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

22 extracted references · 3 canonical work pages

[1]

The Lancet 377(9778), 1693–1702 (2011) 15

Langhorne, P., Bernhardt, J., Kwakkel, G.: Stroke rehabilitation. The Lancet 377(9778), 1693–1702 (2011) 15

2011
[2]

Physiotherapy107, 216–223 (2020)

Borschmann, K.N., Hayward, K.S.: Recovery of upper limb function is greatest early after stroke but does continue to improve during the chronic phase: a two- year, observational study. Physiotherapy107, 216–223 (2020)

2020
[3]

Life13(10), 2061 (2023)

Li, S.: Stroke recovery is a journey: prediction and potentials of motor recovery after a stroke from a practical perspective. Life13(10), 2061 (2023)

2061
[4]

Journal of hand therapy : official journal of the American Society of Hand Therapists26, 104–14 (2013) https://doi.org/10.1016/j.jht.2012.06.005

Lang, C., Bland, M., Bailey, R., Schaefer, S., Birkenmeier, R.: Assessment of upper extremity impairment, function, and activity after stroke: foundations for clinical decision making. Journal of hand therapy : official journal of the American Society of Hand Therapists26, 104–14 (2013) https://doi.org/10.1016/j.jht.2012.06.005

work page doi:10.1016/j.jht.2012.06.005 2013
[5]

Levin, M.F., Kleim, J.A., Wolf, S.L.: What do motor ˆ a€œrecoveryˆ a€and ˆ a€œcompensationˆ a€mean in patients following stroke? Neurorehabilitation and neural repair23(4), 313–319 (2009)

2009
[6]

Current opinion in neurology19(1), 84–90 (2006)

Krakauer, J.W.: Motor learning: its relevance to stroke recovery and neuroreha- bilitation. Current opinion in neurology19(1), 84–90 (2006)

2006
[7]

Aust J Physiother54(3), 220 (2008)

McDonnell, M.: Action research arm test. Aust J Physiother54(3), 220 (2008)

2008
[8]

Neurorehabilitation and neural repair 22(1), 78–90 (2008)

Yozbatiran, N., Der-Yeghiaian, L., Cramer, S.C.: A standardized approach to performing the action research arm test. Neurorehabilitation and neural repair 22(1), 78–90 (2008)

2008
[9]

Neuroreha- bilitation and neural repair16(3), 232–240 (2002)

Gladstone, D.J., Danells, C.J., Black, S.E.: The fugl-meyer assessment of motor recovery after stroke: a critical review of its measurement properties. Neuroreha- bilitation and neural repair16(3), 232–240 (2002)

2002
[10]

Journal of rehabilitation medicine33(3), 110–113 (2001)

Van Der Lee, J.H., Beckerman, H., Lankhorst, G.J., Bouter, L.M.: The respon- siveness of the action research arm test and the fugl-meyer assessment scale in chronic stroke patients. Journal of rehabilitation medicine33(3), 110–113 (2001)

2001
[11]

Journal of neuroengi- neering and rehabilitation11, 1–23 (2014)

Nordin, N., Xie, S.Q., W¨ unsche, B.: Assessment of movement quality in robot- assisted upper limb rehabilitation after stroke: a review. Journal of neuroengi- neering and rehabilitation11, 1–23 (2014)

2014
[12]

Neurorehabil Neural Repair21(5), 398–411 (2007) https://doi.org/10.1177/ 1545968306298414

Cirstea, M.C., Levin, M.F.: Improvement of arm movement patterns and end- point control depends on type of feedback during practice in stroke survivors. Neurorehabil Neural Repair21(5), 398–411 (2007) https://doi.org/10.1177/ 1545968306298414

2007
[13]

Disability and rehabilitation26(2), 109–116 (2004)

Lannin, N.A.: Reliability, validity and factor structure of the upper limb subscale of the motor assessment scale (ul-mas) in adults following stroke. Disability and rehabilitation26(2), 109–116 (2004)

2004
[14]

Journal of the Royal Statistical Society: Series C (Applied Statistics)28(1), 20–28 (1979)

Dawid, A.P., Skene, A.M.: Maximum likelihood estimation of observer error-rates 16 using the em algorithm. Journal of the Royal Statistical Society: Series C (Applied Statistics)28(1), 20–28 (1979)

1979
[15]

Journal of Artificial Intelligence Research72, 1385– 1470 (2021)

Uma, A.N., Fornaciari, T., Hovy, D., Paun, S., Plank, B., Poesio, M.: Learning from disagreement: A survey. Journal of Artificial Intelligence Research72, 1385– 1470 (2021). TODO: confirm exact page range

2021
[16]

IEEE Transactions on Biomedical Engineering 69(2), 945–954 (2021)

Dutta, D., Aruchamy, S., Mandal, S., Sen, S.: Poststroke grasp ability assessment using an intelligent data glove based on action research arm test: Development, algorithms, and experiments. IEEE Transactions on Biomedical Engineering 69(2), 945–954 (2021)

2021
[17]

https://arxiv.org/abs/2505.01680

Ahmed, T., Rikakis, T.: Automated ARAT Scoring Using Multimodal Video Analysis, Multi-View Fusion, and Hierarchical Bayesian Models: A Clinician Study (2025). https://arxiv.org/abs/2505.01680

arXiv 2025
[18]

npj Digital Medicine8, 345 (2025) https://doi.org/10.1038/s41746-025-01725-9

Sokol, K., Fackler, J., Vogt, J.E.: Artificial intelligence should genuinely support clinical reasoning and decision making to bridge the translational gap. npj Digital Medicine8, 345 (2025) https://doi.org/10.1038/s41746-025-01725-9

work page doi:10.1038/s41746-025-01725-9 2025
[19]

npj Digital Medicine8, 690 (2025) https://doi.org/10.1038/s41746-025-02071-6

Zhou, S., Wang, J., Xu, Z., Wang, S., Brauer, D., Welton, L., Cogan, J., Chung, Y.-H., Tian, L., Zhan, Z., Hou, Y., Lin, M., Melton, G.B., Zhang, R.: Uncertainty-aware large language models for explainable disease diagnosis. npj Digital Medicine8, 690 (2025) https://doi.org/10.1038/s41746-025-02071-6

work page doi:10.1038/s41746-025-02071-6 2025
[20]

Candidate (VERIFY before use): F

PLACEHOLDER – supply a neuromechanics / motor-redundancy reference. Candidate (VERIFY before use): F. J. Valero-Cuevas, Fundamentals of Neurome- chanics, Springer, 2016. Replace before submission

2016
[21]

Frontiers in Artificial Intelligence (2025)

Ahmed, T., Rikakis, T.: A generalizable methodology for human-ai collabora- tion in complex clinical tasks. Frontiers in Artificial Intelligence (2025). TODO: confirm full author list, volume, pages, and DOI

2025
[22]

In: Proceedings of the CHI Conference on Human Factors in Computing Systems (CHI) (2026)

Ahmed, T., Rikakis, T., Kelliher, A., Khan, M.S.S.: xaara: Explainable and augmented automated rehabilitation assessment. In: Proceedings of the CHI Conference on Human Factors in Computing Systems (CHI) (2026). Co-design companion paper; confirm final venue/year/pages. Consider renaming this key from TODO:xAARA to e.g. ahmed2026xaara and updating the [?]...

2026

[1] [1]

The Lancet 377(9778), 1693–1702 (2011) 15

Langhorne, P., Bernhardt, J., Kwakkel, G.: Stroke rehabilitation. The Lancet 377(9778), 1693–1702 (2011) 15

2011

[2] [2]

Physiotherapy107, 216–223 (2020)

Borschmann, K.N., Hayward, K.S.: Recovery of upper limb function is greatest early after stroke but does continue to improve during the chronic phase: a two- year, observational study. Physiotherapy107, 216–223 (2020)

2020

[3] [3]

Life13(10), 2061 (2023)

Li, S.: Stroke recovery is a journey: prediction and potentials of motor recovery after a stroke from a practical perspective. Life13(10), 2061 (2023)

2061

[4] [4]

Journal of hand therapy : official journal of the American Society of Hand Therapists26, 104–14 (2013) https://doi.org/10.1016/j.jht.2012.06.005

Lang, C., Bland, M., Bailey, R., Schaefer, S., Birkenmeier, R.: Assessment of upper extremity impairment, function, and activity after stroke: foundations for clinical decision making. Journal of hand therapy : official journal of the American Society of Hand Therapists26, 104–14 (2013) https://doi.org/10.1016/j.jht.2012.06.005

work page doi:10.1016/j.jht.2012.06.005 2013

[5] [5]

Levin, M.F., Kleim, J.A., Wolf, S.L.: What do motor ˆ a€œrecoveryˆ a€and ˆ a€œcompensationˆ a€mean in patients following stroke? Neurorehabilitation and neural repair23(4), 313–319 (2009)

2009

[6] [6]

Current opinion in neurology19(1), 84–90 (2006)

Krakauer, J.W.: Motor learning: its relevance to stroke recovery and neuroreha- bilitation. Current opinion in neurology19(1), 84–90 (2006)

2006

[7] [7]

Aust J Physiother54(3), 220 (2008)

McDonnell, M.: Action research arm test. Aust J Physiother54(3), 220 (2008)

2008

[8] [8]

Neurorehabilitation and neural repair 22(1), 78–90 (2008)

Yozbatiran, N., Der-Yeghiaian, L., Cramer, S.C.: A standardized approach to performing the action research arm test. Neurorehabilitation and neural repair 22(1), 78–90 (2008)

2008

[9] [9]

Neuroreha- bilitation and neural repair16(3), 232–240 (2002)

Gladstone, D.J., Danells, C.J., Black, S.E.: The fugl-meyer assessment of motor recovery after stroke: a critical review of its measurement properties. Neuroreha- bilitation and neural repair16(3), 232–240 (2002)

2002

[10] [10]

Journal of rehabilitation medicine33(3), 110–113 (2001)

Van Der Lee, J.H., Beckerman, H., Lankhorst, G.J., Bouter, L.M.: The respon- siveness of the action research arm test and the fugl-meyer assessment scale in chronic stroke patients. Journal of rehabilitation medicine33(3), 110–113 (2001)

2001

[11] [11]

Journal of neuroengi- neering and rehabilitation11, 1–23 (2014)

Nordin, N., Xie, S.Q., W¨ unsche, B.: Assessment of movement quality in robot- assisted upper limb rehabilitation after stroke: a review. Journal of neuroengi- neering and rehabilitation11, 1–23 (2014)

2014

[12] [12]

Neurorehabil Neural Repair21(5), 398–411 (2007) https://doi.org/10.1177/ 1545968306298414

Cirstea, M.C., Levin, M.F.: Improvement of arm movement patterns and end- point control depends on type of feedback during practice in stroke survivors. Neurorehabil Neural Repair21(5), 398–411 (2007) https://doi.org/10.1177/ 1545968306298414

2007

[13] [13]

Disability and rehabilitation26(2), 109–116 (2004)

Lannin, N.A.: Reliability, validity and factor structure of the upper limb subscale of the motor assessment scale (ul-mas) in adults following stroke. Disability and rehabilitation26(2), 109–116 (2004)

2004

[14] [14]

Journal of the Royal Statistical Society: Series C (Applied Statistics)28(1), 20–28 (1979)

Dawid, A.P., Skene, A.M.: Maximum likelihood estimation of observer error-rates 16 using the em algorithm. Journal of the Royal Statistical Society: Series C (Applied Statistics)28(1), 20–28 (1979)

1979

[15] [15]

Journal of Artificial Intelligence Research72, 1385– 1470 (2021)

Uma, A.N., Fornaciari, T., Hovy, D., Paun, S., Plank, B., Poesio, M.: Learning from disagreement: A survey. Journal of Artificial Intelligence Research72, 1385– 1470 (2021). TODO: confirm exact page range

2021

[16] [16]

IEEE Transactions on Biomedical Engineering 69(2), 945–954 (2021)

Dutta, D., Aruchamy, S., Mandal, S., Sen, S.: Poststroke grasp ability assessment using an intelligent data glove based on action research arm test: Development, algorithms, and experiments. IEEE Transactions on Biomedical Engineering 69(2), 945–954 (2021)

2021

[17] [17]

https://arxiv.org/abs/2505.01680

Ahmed, T., Rikakis, T.: Automated ARAT Scoring Using Multimodal Video Analysis, Multi-View Fusion, and Hierarchical Bayesian Models: A Clinician Study (2025). https://arxiv.org/abs/2505.01680

arXiv 2025

[18] [18]

npj Digital Medicine8, 345 (2025) https://doi.org/10.1038/s41746-025-01725-9

Sokol, K., Fackler, J., Vogt, J.E.: Artificial intelligence should genuinely support clinical reasoning and decision making to bridge the translational gap. npj Digital Medicine8, 345 (2025) https://doi.org/10.1038/s41746-025-01725-9

work page doi:10.1038/s41746-025-01725-9 2025

[19] [19]

npj Digital Medicine8, 690 (2025) https://doi.org/10.1038/s41746-025-02071-6

Zhou, S., Wang, J., Xu, Z., Wang, S., Brauer, D., Welton, L., Cogan, J., Chung, Y.-H., Tian, L., Zhan, Z., Hou, Y., Lin, M., Melton, G.B., Zhang, R.: Uncertainty-aware large language models for explainable disease diagnosis. npj Digital Medicine8, 690 (2025) https://doi.org/10.1038/s41746-025-02071-6

work page doi:10.1038/s41746-025-02071-6 2025

[20] [20]

Candidate (VERIFY before use): F

PLACEHOLDER – supply a neuromechanics / motor-redundancy reference. Candidate (VERIFY before use): F. J. Valero-Cuevas, Fundamentals of Neurome- chanics, Springer, 2016. Replace before submission

2016

[21] [21]

Frontiers in Artificial Intelligence (2025)

Ahmed, T., Rikakis, T.: A generalizable methodology for human-ai collabora- tion in complex clinical tasks. Frontiers in Artificial Intelligence (2025). TODO: confirm full author list, volume, pages, and DOI

2025

[22] [22]

In: Proceedings of the CHI Conference on Human Factors in Computing Systems (CHI) (2026)

Ahmed, T., Rikakis, T., Kelliher, A., Khan, M.S.S.: xaara: Explainable and augmented automated rehabilitation assessment. In: Proceedings of the CHI Conference on Human Factors in Computing Systems (CHI) (2026). Co-design companion paper; confirm final venue/year/pages. Consider renaming this key from TODO:xAARA to e.g. ahmed2026xaara and updating the [?]...

2026