Evaluating Physician-AI Interaction for Cancer Management: Paving the Path towards Precision Oncology
Pith reviewed 2026-05-24 02:13 UTC · model grok-4.3
The pith
Physicians shifted toward ML-supported cancer treatments over conflicting RCT evidence, often without reviewing model details.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
When ML and RCT outputs were concordant, physicians reported greater confidence than with RCT data alone. When results were discordant, most physicians shifted toward the ML-supported treatment, often before reviewing any information about model training or validation, suggesting a tendency toward automation bias rather than algorithm avoidance. Despite reporting higher perceived reliability after viewing model quality disclosures, physicians were largely unable to describe the validation procedures they had reviewed.
What carries the argument
A within-subjects web-based clinical decision support system presenting survival and adverse event data from simulated RCT and ML models across 12 synthetic multiple myeloma scenarios, used to track physicians' synthesis of competing evidence sources.
If this is right
- CDSS interfaces need redesign to prompt explicit review of ML validation before treatment selection.
- Clinician training programs should address how to weigh RCT evidence against ML outputs when they conflict.
- Institutional safeguards such as required validation summaries or second reviews become necessary before ML systems enter routine oncology use.
- Perceived reliability of ML rises after disclosures even when users cannot articulate what those disclosures contained.
Where Pith is reading between the lines
- The observed shift may accelerate deployment of ML tools whose validation remains incomplete if real-world workflows mirror the simulated ones.
- Similar automation bias could appear in other specialties where AI predictions compete with trial data, such as cardiology or neurology.
- Mandating a minimum review time or simplified validation checklist inside the CDSS might reduce the early shift to ML recommendations.
Load-bearing premise
Physicians' treatment choices inside the web-based system with synthetic scenarios will reflect how they would integrate real RCT and ML evidence when treating actual patients.
What would settle it
A study in which the same physicians make decisions on real patient cases using actual RCT publications and deployed ML models and show no net shift toward ML recommendations or full use of validation information would falsify the central pattern.
read the original abstract
As machine learning (ML)-based decision support tools proliferate in clinical practice, understanding how clinicians integrate personalized ML predictions alongside randomized controlled trial (RCT) evidence is critical. We designed a web-based clinical decision support system (CDSS) presenting survival and adverse event data from a simulated RCT and ML model across 12 synthetic multiple myeloma scenarios. In a within- subjects study with 32 physicians, we evaluated how clinicians synthesize competing evidence sources to make treatment decisions. When ML and RCT outputs were concordant, physicians reported greater confidence than with RCT data alone. When results were discordant, most physicians shifted toward the ML-supported treatment, often before reviewing any information about model training or validation, suggesting a tendency toward automation bias rather than algorithm avoidance. Despite reporting higher perceived reliability after viewing model quality disclosures, physicians were largely unable to describe the validation procedures they had reviewed. Taken together, these findings reveal that clinicians may over-rely on ML recommendations even when equipped with tools designed to support critical appraisal. We discuss implications for CDSS design, clinician training, and the institutional safeguards needed before ML-based systems are deployed in high-stakes oncology settings.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript describes a within-subjects web-based study in which 32 physicians evaluated 12 synthetic multiple myeloma scenarios presented via a CDSS that displayed simulated RCT and ML model outputs for survival and adverse events. The central claims are that physicians reported higher confidence when RCT and ML outputs were concordant than with RCT alone, that most physicians shifted toward the ML recommendation in discordant cases (often before reviewing model training/validation details), and that this pattern indicates automation bias rather than algorithm avoidance; the authors also report that physicians could not accurately describe the validation procedures they had viewed despite increased perceived reliability after disclosure.
Significance. If the behavioral patterns are shown to be robust, the work would contribute to the growing literature on clinician-AI interaction by providing concrete evidence of over-reliance on ML outputs in an oncology decision-support context, with direct implications for CDSS interface design, clinician training, and institutional safeguards. The study design (synthetic cases, explicit model disclosures) is a reasonable starting point for isolating evidence-integration behavior.
major comments (3)
- [Methods] Methods section: the abstract and study description supply no statistical methods, hypothesis tests, effect sizes, confidence intervals, or power analysis; it is therefore impossible to evaluate whether the reported shifts (e.g., 'most physicians') exceed chance or are robust to multiple-comparison correction.
- [Results and Discussion] Results/Discussion: the claim that the observed shift constitutes automation bias (rather than an artifact of the interface) rests on the untested assumption that decisions made in a low-stakes, decontextualized web interface with 12 synthetic scenarios generalize to real oncology practice; the manuscript provides no discussion or sensitivity analysis addressing how patient-specific factors, liability, time pressure, or multidisciplinary input might alter evidence weighting.
- [Methods] Methods: the within-subjects design with 12 scenarios does not report counterbalancing of presentation order or any analysis of order or carry-over effects, which could confound the reported preference shifts when ML and RCT outputs are discordant.
minor comments (2)
- [Abstract] The abstract states that physicians 'were largely unable to describe the validation procedures they had reviewed' but does not quantify this (e.g., percentage correct on a recall or recognition task) or report inter-rater reliability for coding free-text responses.
- [Figures/Tables] Figure or table captions should explicitly state the exact wording of the confidence and reliability rating scales used and whether they were administered before or after each scenario.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed comments, which highlight important areas for improving the transparency and contextualization of our work. We address each major comment below and have made revisions to the manuscript where the points are valid.
read point-by-point responses
-
Referee: [Methods] Methods section: the abstract and study description supply no statistical methods, hypothesis tests, effect sizes, confidence intervals, or power analysis; it is therefore impossible to evaluate whether the reported shifts (e.g., 'most physicians') exceed chance or are robust to multiple-comparison correction.
Authors: We agree that the original submission lacked a dedicated description of statistical methods, which limits evaluation of the findings' robustness. In the revised manuscript, we have added a 'Statistical Analysis' subsection to the Methods that specifies all tests (McNemar's tests for binary choice shifts and paired t-tests for confidence ratings), reports effect sizes (Cohen's h and d), 95% confidence intervals, and includes a post-hoc power calculation. Multiple-comparison correction (Bonferroni) was applied across the discordant scenarios. These additions enable readers to assess whether the observed shifts exceed chance levels. revision: yes
-
Referee: [Results and Discussion] Results/Discussion: the claim that the observed shift constitutes automation bias (rather than an artifact of the interface) rests on the untested assumption that decisions made in a low-stakes, decontextualized web interface with 12 synthetic scenarios generalize to real oncology practice; the manuscript provides no discussion or sensitivity analysis addressing how patient-specific factors, liability, time pressure, or multidisciplinary input might alter evidence weighting.
Authors: We acknowledge that the controlled, synthetic design limits direct claims about real-world generalizability, and the original manuscript did not sufficiently discuss this. The study was intended to isolate evidence-integration behavior under standardized conditions. In the revised version, we have expanded the Discussion with a new 'Limitations' paragraph that explicitly addresses the low-stakes web interface, synthetic cases, and potential moderating effects of patient-specific factors, liability concerns, time pressure, and multidisciplinary input. We qualify the automation-bias interpretation accordingly while retaining the core finding as evidence from this controlled setting, and we outline directions for future ecologically valid studies. revision: yes
-
Referee: [Methods] Methods: the within-subjects design with 12 scenarios does not report counterbalancing of presentation order or any analysis of order or carry-over effects, which could confound the reported preference shifts when ML and RCT outputs are discordant.
Authors: The referee correctly notes that the original manuscript omitted details on scenario ordering. The scenarios were in fact presented in randomized order per participant (via the web platform's randomization feature), but this was not stated. We have added this information to the Methods section. We also performed an additional analysis of order and carry-over effects using mixed-effects logistic regression with scenario position as a fixed effect; no significant effects were detected. These details and results have been incorporated into the revised manuscript to rule out confounding. revision: yes
Circularity Check
No circularity: purely empirical user study with no derivations or fitted parameters
full rationale
The paper reports results from a within-subjects experiment involving 32 physicians making treatment choices in a web-based interface across 12 synthetic multiple myeloma scenarios. No equations, model derivations, parameter fittings, or predictive claims derived from prior outputs appear in the work. All findings are direct observations of participant behavior and self-reports within the controlled study design. The central interpretation (shift toward ML recommendations indicating automation bias) is presented as an empirical pattern from the collected data rather than a quantity computed from or defined in terms of itself. No self-citation chains or ansatzes are invoked to justify load-bearing steps. The study is therefore self-contained against external benchmarks and receives the default non-circularity finding.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
A review of cancer immunotherapy: from the past, to the present, to the future
Esfahani K, Roudaia L, Buhlaiga N, Del Rincon S, Papneja N, and Miller W. A review of cancer immunotherapy: from the past, to the present, to the future. Current Oncology 2020;27:87–97
work page 2020
-
[2]
CAR-T cell therapy: current limitations and potential strategies
Sterner RC and Sterner RM. CAR-T cell therapy: current limitations and potential strategies. Blood cancer journal 2021;11:69
work page 2021
-
[3]
Bispecific antibodies: from research to clinical application
Ma J, Mo Y , Tang M, et al. Bispecific antibodies: from research to clinical application. Frontiers in Immunology 2021:1555
work page 2021
-
[4]
The landmark series: gallbladder cancer
Gamboa AC and Maithel SK. The landmark series: gallbladder cancer. Annals of Surgical Oncology 2020;27:2846–58
work page 2020
-
[5]
The landmark series: axillary management in breast cancer
Fisher CS, Margenthaler JA, Hunt KK, and Schwartz T. The landmark series: axillary management in breast cancer. Annals of surgical oncology 2020;27:724–9
work page 2020
-
[6]
Multiple myeloma, version 3.2017, NCCN clinical practice guidelines in oncology
Kumar SK, Callander NS, Alsina M, et al. Multiple myeloma, version 3.2017, NCCN clinical practice guidelines in oncology. Journal of the National Comprehensive Cancer Network 2017;15:230–69
work page 2017
-
[7]
Kumar SK, Dispenzieri A, Lacy MQ, et al. Continued improvement in survival in multiple myeloma: changes in early mortality and outcomes in older patients. Leukemia 2014;28:1122–8. 20
work page 2014
-
[8]
Durie BG, Hoering A, Abidi MH, et al. Bortezomib with lenalidomide and dexamethasone versus lenalidomide and dexamethasone alone in patients with newly diagnosed myeloma without intent for immediate autologous stem-cell transplant (SWOG S0777): a randomized, open-label, phase 3 trial. The Lancet 2017;389:519–27
work page 2017
-
[9]
Attal M, Harousseau JL, Stoppa AM, et al. A prospective, randomized trial of autologous bone marrow transplantation and chemotherapy in multiple myeloma. New England Journal of Medicine 1996;335:91–7
work page 1996
-
[10]
High -dose chemotherapy with hematopoietic stem - cell rescue for multiple myeloma
Child JA, Morgan GJ, Davies FE, et al. High -dose chemotherapy with hematopoietic stem - cell rescue for multiple myeloma. New England Journal of Medicine 2003;348:1875–83
work page 2003
-
[11]
Lenalidomide, bortezomib, and dexamethasone with transplantation for myeloma
Attal M, Lauwers-Cances V , Hulin C, et al. Lenalidomide, bortezomib, and dexamethasone with transplantation for myeloma. New England Journal of Medicine 2017;376:1311–20
work page 2017
-
[12]
Multiple myeloma: 2022 update on diagnosis, risk stratification, and management
Rajkumar SV . Multiple myeloma: 2022 update on diagnosis, risk stratification, and management. American journal of hematology 2022;97:1086–107
work page 2022
-
[13]
Terpos E, Mikhael J, Hajek R, et al. Management of patients with multiple myeloma beyond the clinical-trial setting: understanding the balance between efficacy, safety and tolerability, and quality of life. Blood cancer journal 2021;11:40
work page 2021
-
[14]
Oncology (cancer)/hematologic malignancies approval notifications
FDA U et al. Oncology (cancer)/hematologic malignancies approval notifications. 2021
work page 2021
-
[15]
The levels of evidence and their role in evidence - based medicine
Burns PB, Rohrich RJ, and Chung KC. The levels of evidence and their role in evidence - based medicine. Plastic and reconstructive surgery 2011;128:305
work page 2011
-
[16]
Allegra A, Tonacci A, Sciaccotta R, et al. Machine learning and deep learning applications in multiple myeloma diagnosis, prognosis, and treatment selection. Cancers 2022;14:606. 21
work page 2022
-
[17]
Gut microbiome, big data and machine learning to promote precision medicine for cancer
Cammarota G, Ianiro G, Ahern A, et al. Gut microbiome, big data and machine learning to promote precision medicine for cancer. Nature reviews gastroenterology & hepatology 2020;17:635– 48
work page 2020
-
[18]
Ozer ME, Sarica PO, and Arga KY . New machine learning applications to accelerate personalized medicine in breast cancer: rise of the support vector machines. Omics: a journal of integrative biology 2020;24:241–6
work page 2020
-
[19]
Learning for personalized medicine: a comprehensive review from a deep learning perspective
Zhang S, Bamakan SMH, Qu Q, and Li S. Learning for personalized medicine: a comprehensive review from a deep learning perspective. IEEE reviews in biomedical engineering 2018;12:194– 208
work page 2018
-
[20]
Machine learning based personalized drug response prediction for lung cancer patients
Qureshi R, Basit SA, Shamsi JA, et al. Machine learning based personalized drug response prediction for lung cancer patients. Scientific Reports 2022;12:18935
work page 2022
-
[21]
High-performance medicine: the convergence of human and artificial intelligence
Topol EJ. High-performance medicine: the convergence of human and artificial intelligence. Nature medicine 2019;25:44–56
work page 2019
-
[22]
Emani S, Rui A, Rocha HAL, et al. Physicians’ Perceptions of and Satisfaction With Artificial Intelligence in Cancer Treatment: A Clinical Decision Support System Experience and Implications for Low-Middle–Income Countries. JMIR cancer 2022;8:e31461
work page 2022
-
[23]
Parikh RB, Manz CR, Nelson MN, et al. Clinician perspectives on machine learning prognostic algorithms in the routine care of patients with cancer: a qualitative study. Supportive Care in Cancer 2022;30:4363–72
work page 2022
-
[24]
Scheetz J, Rothschild P, McGuinness M, et al. A survey of clinicians on the use of artificial intelligence in ophthalmology, dermatology, radiology and radiation oncology. Scientific reports 2021;11:1–10. 22
work page 2021
-
[25]
Jacobs M, Pradier MF, McCoy Jr TH, Perlis RH, Doshi -Velez F, and Gajos KZ. How machinelearning recommendations influence clinician treatment selections: the example of antidepressant selection. Translational psychiatry 2021;11:108
work page 2021
-
[26]
Do as AI say: susceptibility in deployment of clinical decision-aids
Gaube S, Suresh H, Raue M, et al. Do as AI say: susceptibility in deployment of clinical decision-aids. NPJ digital medicine 2021;4:31
work page 2021
-
[27]
Mitigating the impact of biased artificial intelligence in emergency decision -making
Adam H, Balagopalan A, Alsentzer E, Christia F, and Ghassemi M. Mitigating the impact of biased artificial intelligence in emergency decision -making. Communications Medicine 2022;2:149
work page 2022
-
[28]
Using thematic analysis in psychology
Braun V and Clarke V . Using thematic analysis in psychology. Qualitative research in psychology 2006;3:77–101
work page 2006
-
[29]
Valentı V , Ramos J, Pérez C, et al. Increased survival time or better quality of life? Tradeoff between benefits and adverse events in the systemic treatment of cancer. Clinical and Translational Oncology 2020;22:935–42
work page 2020
-
[30]
Key challenges for delivering clinical impact with artificial intelligence
Kelly CJ, Karthikesalingam A, Suleyman M, Corrado G, and King D. Key challenges for delivering clinical impact with artificial intelligence. BMC medicine 2019;17:1–9
work page 2019
-
[31]
Randomized Controlled Trials of Artificial Intelligence in Clinical Practice: Systematic Review
Lam TY , Cheung MF, Munro YL, Lim KM, Shung D, and Sung JJ. Randomized Controlled Trials of Artificial Intelligence in Clinical Practice: Systematic Review. Journal of Medical Internet Research 2022;24:e37188
work page 2022
-
[32]
Randomized clinical trials of machine learning interventions in health care: a systematic review
Plana D, Shung DL, Grimshaw AA, Saraf A, Sung JJ, and Kann BH. Randomized clinical trials of machine learning interventions in health care: a systematic review. JAMA Network Open 2022;5:e2233946–e2233946. 23
work page 2022
-
[33]
Artificial intelligence for health professions educators
Lomis K, Jeffries P, Palatta A, et al. Artificial intelligence for health professions educators. NAM perspectives 2021;2021
work page 2021
-
[34]
Falsification before Extrapolation in Causal Effect Estimation
Hussain Z, Oberst M, Shih MC, and Sontag D. Falsification before Extrapolation in Causal Effect Estimation. Arxiv preprint arXiv:2209.13708 2022
-
[35]
Hussain Z, Shih MC, Oberst M, Demirel I, and Sontag D. Falsification of Internal and External Validity in Observational Studies via Conditional Moment Restrictions. In: International Conference on Artificial Intelligence and Statistics. PMLR. 2023:5869–98
work page 2023
-
[36]
Towards A Rigorous Science of Interpretable Machine Learning
Doshi-Velez F and Kim B. Towards a rigorous science of interpretable machine learning. arXiv preprint arXiv:1702.08608 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[37]
Manipulating and measuring model interpretability
Poursabzi-Sangdeh F, Goldstein DG, Hofman JM, Wortman Vaughan JW, and Wallach H. Manipulating and measuring model interpretability. In: Proceedings of the 2021 CHI conference on human factors in computing systems. 2021:1–52
work page 2021
-
[38]
The road to explainability is paved with bias: Measuring the fairness of explanations
Balagopalan A, Zhang H, Hamidieh K, Hartvigsen T, Rudzicz F, and Ghassemi M. The road to explainability is paved with bias: Measuring the fairness of explanations. In: 2022 ACM Conference on Fairness, Accountability, and Transparency. 2022:1194–206
work page 2022
-
[39]
Barnhart BJ, Reddy SG, and Arnold GK. Remind me again: physician response to web surveys: the effect of email reminders across 11 opinion survey efforts at the American Board of Internal Medicine from 2017 to 2019. Evaluation & the Health Professions 2021;44:245–59
work page 2017
-
[40]
Physician confidence in artificial intelligence: an online mobile survey
Oh S, Kim JH, Choi SW, Lee HJ, Hong J, and Kwon SH. Physician confidence in artificial intelligence: an online mobile survey. Journal of medical Internet research 2019;21:e12422.\ 24
work page 2019
-
[41]
Staes, Catherine J., et al. "Design of an interface to communicate artificial intelligence-based prognosis for patients with advanced solid tumors: a user-centered approach." Journal of the American Medical Informatics Association 31.1 (2024): 174-187
work page 2024
-
[42]
Buçinca, Zana, Maja Barbara Malaya, and Krzysztof Z. Gajos. "To trust or to think: cognitive forcing functions can reduce overreliance on AI in AI -assisted decision - making." Proceedings of the ACM on Human-Computer Interaction 5.CSCW1 (2021): 1-21
work page 2021
-
[43]
Machine learning in haematological malignancies
Radakovich, Nathan, Matthew Nagy, and Aziz Nazha. "Machine learning in haematological malignancies." The Lancet Haematology 7.7 (2020): e541-e550
work page 2020
-
[44]
Human–computer collaboration for skin cancer recognition
Tschandl, Philipp, et al. "Human–computer collaboration for skin cancer recognition." Nature Medicine 26.8 (2020): 1229-1234. 25 Appendix Appendix Figure 1: Clinical decision support system created for the study Participants used a web-based clinical decision support system (CDSS) created for the study called the “Multiple Myeloma Decision Support Tool” (...
work page 2020
-
[45]
How do you think about data from RCTs when your patient does not meet inclusion criteria?
How do you interpret the data from this RCT? If participant mentions that their patient does not meet inclusion criteria, 1a. How do you think about data from RCTs when your patient does not meet inclusion criteria?
-
[48]
Why are you choosing that confidence level? For Tiers 2 and 3 (ML data)
-
[49]
How do you think about data from ML models when your patient does not meet inclusion criteria?
How do you interpret the data from this ML model? If participant mentions that their patient does not meet inclusion criteria, 2a. How do you think about data from ML models when your patient does not meet inclusion criteria?
-
[50]
What factors are you weighing when choosing a treatment option?
-
[51]
How are you weighing the side effects?
-
[52]
Why are you choosing that confidence level?
-
[53]
Why are you choosing that level of perceived reliability? Finally:
-
[54]
How do you compare the RCT results to the ML results? 32
-
[55]
We found that the majority of participants choose to switch to the blue pill after seeing the ML data and context (show them Scenario K results document). Why do you think that is? Helpful probes: • Can you talk more about that? • Help me understand what you mean. • Can you give an example? 33 Appendix Figure 6: Experimental results for all scenarios Full...
work page 2007
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.