Evaluating Physician-AI Interaction for Cancer Management: Paving the Path towards Precision Oncology

Andrew J. Yee; Barbara D. Lam; David Sontag; Fernando A. Acosta-Perez; Irbaz Bin Riaz; Maia Jacobs; Zeshan Hussain

arxiv: 2404.15187 · v2 · pith:ILGOWQPDnew · submitted 2024-04-23 · 💻 cs.HC

Evaluating Physician-AI Interaction for Cancer Management: Paving the Path towards Precision Oncology

Zeshan Hussain , Barbara D. Lam , Fernando A. Acosta-Perez , Irbaz Bin Riaz , Maia Jacobs , Andrew J. Yee , David Sontag This is my paper

Pith reviewed 2026-05-24 02:13 UTC · model grok-4.3

classification 💻 cs.HC

keywords physician-AI interactionautomation biasclinical decision support systemsmultiple myelomamachine learning in oncologyRCT evidence integrationtreatment decision makingprecision oncology

0 comments

The pith

Physicians shifted toward ML-supported cancer treatments over conflicting RCT evidence, often without reviewing model details.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests how doctors combine machine learning predictions with randomized trial results inside a clinical decision support tool for multiple myeloma. Across 12 synthetic cases and 32 physicians, concordant ML and RCT outputs raised reported confidence above RCT data alone. When outputs disagreed, most physicians moved to the ML option, frequently before any model training or validation information was viewed. Even after seeing quality disclosures, participants could rarely describe the validation steps they had examined. The work shows that current decision support setups may not prevent over-reliance on ML outputs in oncology.

Core claim

When ML and RCT outputs were concordant, physicians reported greater confidence than with RCT data alone. When results were discordant, most physicians shifted toward the ML-supported treatment, often before reviewing any information about model training or validation, suggesting a tendency toward automation bias rather than algorithm avoidance. Despite reporting higher perceived reliability after viewing model quality disclosures, physicians were largely unable to describe the validation procedures they had reviewed.

What carries the argument

A within-subjects web-based clinical decision support system presenting survival and adverse event data from simulated RCT and ML models across 12 synthetic multiple myeloma scenarios, used to track physicians' synthesis of competing evidence sources.

If this is right

CDSS interfaces need redesign to prompt explicit review of ML validation before treatment selection.
Clinician training programs should address how to weigh RCT evidence against ML outputs when they conflict.
Institutional safeguards such as required validation summaries or second reviews become necessary before ML systems enter routine oncology use.
Perceived reliability of ML rises after disclosures even when users cannot articulate what those disclosures contained.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The observed shift may accelerate deployment of ML tools whose validation remains incomplete if real-world workflows mirror the simulated ones.
Similar automation bias could appear in other specialties where AI predictions compete with trial data, such as cardiology or neurology.
Mandating a minimum review time or simplified validation checklist inside the CDSS might reduce the early shift to ML recommendations.

Load-bearing premise

Physicians' treatment choices inside the web-based system with synthetic scenarios will reflect how they would integrate real RCT and ML evidence when treating actual patients.

What would settle it

A study in which the same physicians make decisions on real patient cases using actual RCT publications and deployed ML models and show no net shift toward ML recommendations or full use of validation information would falsify the central pattern.

read the original abstract

As machine learning (ML)-based decision support tools proliferate in clinical practice, understanding how clinicians integrate personalized ML predictions alongside randomized controlled trial (RCT) evidence is critical. We designed a web-based clinical decision support system (CDSS) presenting survival and adverse event data from a simulated RCT and ML model across 12 synthetic multiple myeloma scenarios. In a within- subjects study with 32 physicians, we evaluated how clinicians synthesize competing evidence sources to make treatment decisions. When ML and RCT outputs were concordant, physicians reported greater confidence than with RCT data alone. When results were discordant, most physicians shifted toward the ML-supported treatment, often before reviewing any information about model training or validation, suggesting a tendency toward automation bias rather than algorithm avoidance. Despite reporting higher perceived reliability after viewing model quality disclosures, physicians were largely unable to describe the validation procedures they had reviewed. Taken together, these findings reveal that clinicians may over-rely on ML recommendations even when equipped with tools designed to support critical appraisal. We discuss implications for CDSS design, clinician training, and the institutional safeguards needed before ML-based systems are deployed in high-stakes oncology settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The study finds physicians shifting to ML outputs in discordant cases inside a synthetic CDSS and retaining little about validation afterward, but the web-based design leaves open whether this reflects real oncology practice.

read the letter

The central observation is that when RCT and ML outputs disagreed, most of the 32 physicians moved toward the ML recommendation, often before looking at model details, and later could not recall the validation steps they had seen. The paper sets up a within-subjects web experiment with 12 synthetic multiple myeloma scenarios that present both concordant and discordant evidence sources, then tracks treatment choice, confidence, and information retention after disclosure. That controlled comparison of evidence integration is the concrete piece it adds to the automation-bias literature in clinical AI. The execution is straightforward and the measures are direct enough to let readers see the pattern in this interface. The main weakness is the distance from actual practice. Decisions here occur in a low-stakes browser tool with made-up cases and no comorbidities, time pressure, liability, or team input. Those missing elements could easily alter how physicians weigh the two evidence streams, so the automation-bias reading rests on an untested assumption that the synthetic behavior carries over. The abstract also gives no statistical methods or effect sizes, which makes it harder to judge how stable the “most physicians” claims are. This work is aimed at people building or studying clinical decision support tools in oncology and HCI. A reader already tracking automation bias would pick up the specific experimental contrast and the recall finding as one more data point. It is worth sending to peer review because the topic matters and the study design is replicable; referees can push on the external-validity discussion and the missing stats details without the paper being fundamentally broken.

Referee Report

3 major / 2 minor

Summary. The manuscript describes a within-subjects web-based study in which 32 physicians evaluated 12 synthetic multiple myeloma scenarios presented via a CDSS that displayed simulated RCT and ML model outputs for survival and adverse events. The central claims are that physicians reported higher confidence when RCT and ML outputs were concordant than with RCT alone, that most physicians shifted toward the ML recommendation in discordant cases (often before reviewing model training/validation details), and that this pattern indicates automation bias rather than algorithm avoidance; the authors also report that physicians could not accurately describe the validation procedures they had viewed despite increased perceived reliability after disclosure.

Significance. If the behavioral patterns are shown to be robust, the work would contribute to the growing literature on clinician-AI interaction by providing concrete evidence of over-reliance on ML outputs in an oncology decision-support context, with direct implications for CDSS interface design, clinician training, and institutional safeguards. The study design (synthetic cases, explicit model disclosures) is a reasonable starting point for isolating evidence-integration behavior.

major comments (3)

[Methods] Methods section: the abstract and study description supply no statistical methods, hypothesis tests, effect sizes, confidence intervals, or power analysis; it is therefore impossible to evaluate whether the reported shifts (e.g., 'most physicians') exceed chance or are robust to multiple-comparison correction.
[Results and Discussion] Results/Discussion: the claim that the observed shift constitutes automation bias (rather than an artifact of the interface) rests on the untested assumption that decisions made in a low-stakes, decontextualized web interface with 12 synthetic scenarios generalize to real oncology practice; the manuscript provides no discussion or sensitivity analysis addressing how patient-specific factors, liability, time pressure, or multidisciplinary input might alter evidence weighting.
[Methods] Methods: the within-subjects design with 12 scenarios does not report counterbalancing of presentation order or any analysis of order or carry-over effects, which could confound the reported preference shifts when ML and RCT outputs are discordant.

minor comments (2)

[Abstract] The abstract states that physicians 'were largely unable to describe the validation procedures they had reviewed' but does not quantify this (e.g., percentage correct on a recall or recognition task) or report inter-rater reliability for coding free-text responses.
[Figures/Tables] Figure or table captions should explicitly state the exact wording of the confidence and reliability rating scales used and whether they were administered before or after each scenario.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed comments, which highlight important areas for improving the transparency and contextualization of our work. We address each major comment below and have made revisions to the manuscript where the points are valid.

read point-by-point responses

Referee: [Methods] Methods section: the abstract and study description supply no statistical methods, hypothesis tests, effect sizes, confidence intervals, or power analysis; it is therefore impossible to evaluate whether the reported shifts (e.g., 'most physicians') exceed chance or are robust to multiple-comparison correction.

Authors: We agree that the original submission lacked a dedicated description of statistical methods, which limits evaluation of the findings' robustness. In the revised manuscript, we have added a 'Statistical Analysis' subsection to the Methods that specifies all tests (McNemar's tests for binary choice shifts and paired t-tests for confidence ratings), reports effect sizes (Cohen's h and d), 95% confidence intervals, and includes a post-hoc power calculation. Multiple-comparison correction (Bonferroni) was applied across the discordant scenarios. These additions enable readers to assess whether the observed shifts exceed chance levels. revision: yes
Referee: [Results and Discussion] Results/Discussion: the claim that the observed shift constitutes automation bias (rather than an artifact of the interface) rests on the untested assumption that decisions made in a low-stakes, decontextualized web interface with 12 synthetic scenarios generalize to real oncology practice; the manuscript provides no discussion or sensitivity analysis addressing how patient-specific factors, liability, time pressure, or multidisciplinary input might alter evidence weighting.

Authors: We acknowledge that the controlled, synthetic design limits direct claims about real-world generalizability, and the original manuscript did not sufficiently discuss this. The study was intended to isolate evidence-integration behavior under standardized conditions. In the revised version, we have expanded the Discussion with a new 'Limitations' paragraph that explicitly addresses the low-stakes web interface, synthetic cases, and potential moderating effects of patient-specific factors, liability concerns, time pressure, and multidisciplinary input. We qualify the automation-bias interpretation accordingly while retaining the core finding as evidence from this controlled setting, and we outline directions for future ecologically valid studies. revision: yes
Referee: [Methods] Methods: the within-subjects design with 12 scenarios does not report counterbalancing of presentation order or any analysis of order or carry-over effects, which could confound the reported preference shifts when ML and RCT outputs are discordant.

Authors: The referee correctly notes that the original manuscript omitted details on scenario ordering. The scenarios were in fact presented in randomized order per participant (via the web platform's randomization feature), but this was not stated. We have added this information to the Methods section. We also performed an additional analysis of order and carry-over effects using mixed-effects logistic regression with scenario position as a fixed effect; no significant effects were detected. These details and results have been incorporated into the revised manuscript to rule out confounding. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical user study with no derivations or fitted parameters

full rationale

The paper reports results from a within-subjects experiment involving 32 physicians making treatment choices in a web-based interface across 12 synthetic multiple myeloma scenarios. No equations, model derivations, parameter fittings, or predictive claims derived from prior outputs appear in the work. All findings are direct observations of participant behavior and self-reports within the controlled study design. The central interpretation (shift toward ML recommendations indicating automation bias) is presented as an empirical pattern from the collected data rather than a quantity computed from or defined in terms of itself. No self-citation chains or ansatzes are invoked to justify load-bearing steps. The study is therefore self-contained against external benchmarks and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical human-subjects study and introduces no free parameters, mathematical axioms, or invented entities.

pith-pipeline@v0.9.0 · 5755 in / 1222 out tokens · 32058 ms · 2026-05-24T02:13:01.487502+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

53 extracted references · 53 canonical work pages · 1 internal anchor

[1]

A review of cancer immunotherapy: from the past, to the present, to the future

Esfahani K, Roudaia L, Buhlaiga N, Del Rincon S, Papneja N, and Miller W. A review of cancer immunotherapy: from the past, to the present, to the future. Current Oncology 2020;27:87–97

work page 2020
[2]

CAR-T cell therapy: current limitations and potential strategies

Sterner RC and Sterner RM. CAR-T cell therapy: current limitations and potential strategies. Blood cancer journal 2021;11:69

work page 2021
[3]

Bispecific antibodies: from research to clinical application

Ma J, Mo Y , Tang M, et al. Bispecific antibodies: from research to clinical application. Frontiers in Immunology 2021:1555

work page 2021
[4]

The landmark series: gallbladder cancer

Gamboa AC and Maithel SK. The landmark series: gallbladder cancer. Annals of Surgical Oncology 2020;27:2846–58

work page 2020
[5]

The landmark series: axillary management in breast cancer

Fisher CS, Margenthaler JA, Hunt KK, and Schwartz T. The landmark series: axillary management in breast cancer. Annals of surgical oncology 2020;27:724–9

work page 2020
[6]

Multiple myeloma, version 3.2017, NCCN clinical practice guidelines in oncology

Kumar SK, Callander NS, Alsina M, et al. Multiple myeloma, version 3.2017, NCCN clinical practice guidelines in oncology. Journal of the National Comprehensive Cancer Network 2017;15:230–69

work page 2017
[7]

Continued improvement in survival in multiple myeloma: changes in early mortality and outcomes in older patients

Kumar SK, Dispenzieri A, Lacy MQ, et al. Continued improvement in survival in multiple myeloma: changes in early mortality and outcomes in older patients. Leukemia 2014;28:1122–8. 20

work page 2014
[8]

Durie BG, Hoering A, Abidi MH, et al. Bortezomib with lenalidomide and dexamethasone versus lenalidomide and dexamethasone alone in patients with newly diagnosed myeloma without intent for immediate autologous stem-cell transplant (SWOG S0777): a randomized, open-label, phase 3 trial. The Lancet 2017;389:519–27

work page 2017
[9]

A prospective, randomized trial of autologous bone marrow transplantation and chemotherapy in multiple myeloma

Attal M, Harousseau JL, Stoppa AM, et al. A prospective, randomized trial of autologous bone marrow transplantation and chemotherapy in multiple myeloma. New England Journal of Medicine 1996;335:91–7

work page 1996
[10]

High -dose chemotherapy with hematopoietic stem - cell rescue for multiple myeloma

Child JA, Morgan GJ, Davies FE, et al. High -dose chemotherapy with hematopoietic stem - cell rescue for multiple myeloma. New England Journal of Medicine 2003;348:1875–83

work page 2003
[11]

Lenalidomide, bortezomib, and dexamethasone with transplantation for myeloma

Attal M, Lauwers-Cances V , Hulin C, et al. Lenalidomide, bortezomib, and dexamethasone with transplantation for myeloma. New England Journal of Medicine 2017;376:1311–20

work page 2017
[12]

Multiple myeloma: 2022 update on diagnosis, risk stratification, and management

Rajkumar SV . Multiple myeloma: 2022 update on diagnosis, risk stratification, and management. American journal of hematology 2022;97:1086–107

work page 2022
[13]

Management of patients with multiple myeloma beyond the clinical-trial setting: understanding the balance between efficacy, safety and tolerability, and quality of life

Terpos E, Mikhael J, Hajek R, et al. Management of patients with multiple myeloma beyond the clinical-trial setting: understanding the balance between efficacy, safety and tolerability, and quality of life. Blood cancer journal 2021;11:40

work page 2021
[14]

Oncology (cancer)/hematologic malignancies approval notifications

FDA U et al. Oncology (cancer)/hematologic malignancies approval notifications. 2021

work page 2021
[15]

The levels of evidence and their role in evidence - based medicine

Burns PB, Rohrich RJ, and Chung KC. The levels of evidence and their role in evidence - based medicine. Plastic and reconstructive surgery 2011;128:305

work page 2011
[16]

Machine learning and deep learning applications in multiple myeloma diagnosis, prognosis, and treatment selection

Allegra A, Tonacci A, Sciaccotta R, et al. Machine learning and deep learning applications in multiple myeloma diagnosis, prognosis, and treatment selection. Cancers 2022;14:606. 21

work page 2022
[17]

Gut microbiome, big data and machine learning to promote precision medicine for cancer

Cammarota G, Ianiro G, Ahern A, et al. Gut microbiome, big data and machine learning to promote precision medicine for cancer. Nature reviews gastroenterology & hepatology 2020;17:635– 48

work page 2020
[18]

New machine learning applications to accelerate personalized medicine in breast cancer: rise of the support vector machines

Ozer ME, Sarica PO, and Arga KY . New machine learning applications to accelerate personalized medicine in breast cancer: rise of the support vector machines. Omics: a journal of integrative biology 2020;24:241–6

work page 2020
[19]

Learning for personalized medicine: a comprehensive review from a deep learning perspective

Zhang S, Bamakan SMH, Qu Q, and Li S. Learning for personalized medicine: a comprehensive review from a deep learning perspective. IEEE reviews in biomedical engineering 2018;12:194– 208

work page 2018
[20]

Machine learning based personalized drug response prediction for lung cancer patients

Qureshi R, Basit SA, Shamsi JA, et al. Machine learning based personalized drug response prediction for lung cancer patients. Scientific Reports 2022;12:18935

work page 2022
[21]

High-performance medicine: the convergence of human and artificial intelligence

Topol EJ. High-performance medicine: the convergence of human and artificial intelligence. Nature medicine 2019;25:44–56

work page 2019
[22]

Emani S, Rui A, Rocha HAL, et al. Physicians’ Perceptions of and Satisfaction With Artificial Intelligence in Cancer Treatment: A Clinical Decision Support System Experience and Implications for Low-Middle–Income Countries. JMIR cancer 2022;8:e31461

work page 2022
[23]

Clinician perspectives on machine learning prognostic algorithms in the routine care of patients with cancer: a qualitative study

Parikh RB, Manz CR, Nelson MN, et al. Clinician perspectives on machine learning prognostic algorithms in the routine care of patients with cancer: a qualitative study. Supportive Care in Cancer 2022;30:4363–72

work page 2022
[24]

A survey of clinicians on the use of artificial intelligence in ophthalmology, dermatology, radiology and radiation oncology

Scheetz J, Rothschild P, McGuinness M, et al. A survey of clinicians on the use of artificial intelligence in ophthalmology, dermatology, radiology and radiation oncology. Scientific reports 2021;11:1–10. 22

work page 2021
[25]

How machinelearning recommendations influence clinician treatment selections: the example of antidepressant selection

Jacobs M, Pradier MF, McCoy Jr TH, Perlis RH, Doshi -Velez F, and Gajos KZ. How machinelearning recommendations influence clinician treatment selections: the example of antidepressant selection. Translational psychiatry 2021;11:108

work page 2021
[26]

Do as AI say: susceptibility in deployment of clinical decision-aids

Gaube S, Suresh H, Raue M, et al. Do as AI say: susceptibility in deployment of clinical decision-aids. NPJ digital medicine 2021;4:31

work page 2021
[27]

Mitigating the impact of biased artificial intelligence in emergency decision -making

Adam H, Balagopalan A, Alsentzer E, Christia F, and Ghassemi M. Mitigating the impact of biased artificial intelligence in emergency decision -making. Communications Medicine 2022;2:149

work page 2022
[28]

Using thematic analysis in psychology

Braun V and Clarke V . Using thematic analysis in psychology. Qualitative research in psychology 2006;3:77–101

work page 2006
[29]

Increased survival time or better quality of life? Tradeoff between benefits and adverse events in the systemic treatment of cancer

Valentı V , Ramos J, Pérez C, et al. Increased survival time or better quality of life? Tradeoff between benefits and adverse events in the systemic treatment of cancer. Clinical and Translational Oncology 2020;22:935–42

work page 2020
[30]

Key challenges for delivering clinical impact with artificial intelligence

Kelly CJ, Karthikesalingam A, Suleyman M, Corrado G, and King D. Key challenges for delivering clinical impact with artificial intelligence. BMC medicine 2019;17:1–9

work page 2019
[31]

Randomized Controlled Trials of Artificial Intelligence in Clinical Practice: Systematic Review

Lam TY , Cheung MF, Munro YL, Lim KM, Shung D, and Sung JJ. Randomized Controlled Trials of Artificial Intelligence in Clinical Practice: Systematic Review. Journal of Medical Internet Research 2022;24:e37188

work page 2022
[32]

Randomized clinical trials of machine learning interventions in health care: a systematic review

Plana D, Shung DL, Grimshaw AA, Saraf A, Sung JJ, and Kann BH. Randomized clinical trials of machine learning interventions in health care: a systematic review. JAMA Network Open 2022;5:e2233946–e2233946. 23

work page 2022
[33]

Artificial intelligence for health professions educators

Lomis K, Jeffries P, Palatta A, et al. Artificial intelligence for health professions educators. NAM perspectives 2021;2021

work page 2021
[34]

Falsification before Extrapolation in Causal Effect Estimation

Hussain Z, Oberst M, Shih MC, and Sontag D. Falsification before Extrapolation in Causal Effect Estimation. Arxiv preprint arXiv:2209.13708 2022

work page arXiv 2022
[35]

Falsification of Internal and External Validity in Observational Studies via Conditional Moment Restrictions

Hussain Z, Shih MC, Oberst M, Demirel I, and Sontag D. Falsification of Internal and External Validity in Observational Studies via Conditional Moment Restrictions. In: International Conference on Artificial Intelligence and Statistics. PMLR. 2023:5869–98

work page 2023
[36]

Towards A Rigorous Science of Interpretable Machine Learning

Doshi-Velez F and Kim B. Towards a rigorous science of interpretable machine learning. arXiv preprint arXiv:1702.08608 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[37]

Manipulating and measuring model interpretability

Poursabzi-Sangdeh F, Goldstein DG, Hofman JM, Wortman Vaughan JW, and Wallach H. Manipulating and measuring model interpretability. In: Proceedings of the 2021 CHI conference on human factors in computing systems. 2021:1–52

work page 2021
[38]

The road to explainability is paved with bias: Measuring the fairness of explanations

Balagopalan A, Zhang H, Hamidieh K, Hartvigsen T, Rudzicz F, and Ghassemi M. The road to explainability is paved with bias: Measuring the fairness of explanations. In: 2022 ACM Conference on Fairness, Accountability, and Transparency. 2022:1194–206

work page 2022
[39]

Remind me again: physician response to web surveys: the effect of email reminders across 11 opinion survey efforts at the American Board of Internal Medicine from 2017 to 2019

Barnhart BJ, Reddy SG, and Arnold GK. Remind me again: physician response to web surveys: the effect of email reminders across 11 opinion survey efforts at the American Board of Internal Medicine from 2017 to 2019. Evaluation & the Health Professions 2021;44:245–59

work page 2017
[40]

Physician confidence in artificial intelligence: an online mobile survey

Oh S, Kim JH, Choi SW, Lee HJ, Hong J, and Kwon SH. Physician confidence in artificial intelligence: an online mobile survey. Journal of medical Internet research 2019;21:e12422.\ 24

work page 2019
[41]

Design of an interface to communicate artificial intelligence-based prognosis for patients with advanced solid tumors: a user-centered approach

Staes, Catherine J., et al. "Design of an interface to communicate artificial intelligence-based prognosis for patients with advanced solid tumors: a user-centered approach." Journal of the American Medical Informatics Association 31.1 (2024): 174-187

work page 2024
[42]

To trust or to think: cognitive forcing functions can reduce overreliance on AI in AI -assisted decision - making

Buçinca, Zana, Maja Barbara Malaya, and Krzysztof Z. Gajos. "To trust or to think: cognitive forcing functions can reduce overreliance on AI in AI -assisted decision - making." Proceedings of the ACM on Human-Computer Interaction 5.CSCW1 (2021): 1-21

work page 2021
[43]

Machine learning in haematological malignancies

Radakovich, Nathan, Matthew Nagy, and Aziz Nazha. "Machine learning in haematological malignancies." The Lancet Haematology 7.7 (2020): e541-e550

work page 2020
[44]

Human–computer collaboration for skin cancer recognition

Tschandl, Philipp, et al. "Human–computer collaboration for skin cancer recognition." Nature Medicine 26.8 (2020): 1229-1234. 25 Appendix Appendix Figure 1: Clinical decision support system created for the study Participants used a web-based clinical decision support system (CDSS) created for the study called the “Multiple Myeloma Decision Support Tool” (...

work page 2020
[45]

How do you think about data from RCTs when your patient does not meet inclusion criteria?

How do you interpret the data from this RCT? If participant mentions that their patient does not meet inclusion criteria, 1a. How do you think about data from RCTs when your patient does not meet inclusion criteria?

work page
[48]

Why are you choosing that confidence level? For Tiers 2 and 3 (ML data)

work page
[49]

How do you think about data from ML models when your patient does not meet inclusion criteria?

How do you interpret the data from this ML model? If participant mentions that their patient does not meet inclusion criteria, 2a. How do you think about data from ML models when your patient does not meet inclusion criteria?

work page
[50]

What factors are you weighing when choosing a treatment option?

work page
[51]

How are you weighing the side effects?

work page
[52]

Why are you choosing that confidence level?

work page
[53]

Why are you choosing that level of perceived reliability? Finally:

work page
[54]

How do you compare the RCT results to the ML results? 32

work page
[55]

red pill

We found that the majority of participants choose to switch to the blue pill after seeing the ML data and context (show them Scenario K results document). Why do you think that is? Helpful probes: • Can you talk more about that? • Help me understand what you mean. • Can you give an example? 33 Appendix Figure 6: Experimental results for all scenarios Full...

work page 2007

[1] [1]

A review of cancer immunotherapy: from the past, to the present, to the future

Esfahani K, Roudaia L, Buhlaiga N, Del Rincon S, Papneja N, and Miller W. A review of cancer immunotherapy: from the past, to the present, to the future. Current Oncology 2020;27:87–97

work page 2020

[2] [2]

CAR-T cell therapy: current limitations and potential strategies

Sterner RC and Sterner RM. CAR-T cell therapy: current limitations and potential strategies. Blood cancer journal 2021;11:69

work page 2021

[3] [3]

Bispecific antibodies: from research to clinical application

Ma J, Mo Y , Tang M, et al. Bispecific antibodies: from research to clinical application. Frontiers in Immunology 2021:1555

work page 2021

[4] [4]

The landmark series: gallbladder cancer

Gamboa AC and Maithel SK. The landmark series: gallbladder cancer. Annals of Surgical Oncology 2020;27:2846–58

work page 2020

[5] [5]

The landmark series: axillary management in breast cancer

Fisher CS, Margenthaler JA, Hunt KK, and Schwartz T. The landmark series: axillary management in breast cancer. Annals of surgical oncology 2020;27:724–9

work page 2020

[6] [6]

Multiple myeloma, version 3.2017, NCCN clinical practice guidelines in oncology

Kumar SK, Callander NS, Alsina M, et al. Multiple myeloma, version 3.2017, NCCN clinical practice guidelines in oncology. Journal of the National Comprehensive Cancer Network 2017;15:230–69

work page 2017

[7] [7]

Continued improvement in survival in multiple myeloma: changes in early mortality and outcomes in older patients

Kumar SK, Dispenzieri A, Lacy MQ, et al. Continued improvement in survival in multiple myeloma: changes in early mortality and outcomes in older patients. Leukemia 2014;28:1122–8. 20

work page 2014

[8] [8]

Durie BG, Hoering A, Abidi MH, et al. Bortezomib with lenalidomide and dexamethasone versus lenalidomide and dexamethasone alone in patients with newly diagnosed myeloma without intent for immediate autologous stem-cell transplant (SWOG S0777): a randomized, open-label, phase 3 trial. The Lancet 2017;389:519–27

work page 2017

[9] [9]

A prospective, randomized trial of autologous bone marrow transplantation and chemotherapy in multiple myeloma

Attal M, Harousseau JL, Stoppa AM, et al. A prospective, randomized trial of autologous bone marrow transplantation and chemotherapy in multiple myeloma. New England Journal of Medicine 1996;335:91–7

work page 1996

[10] [10]

High -dose chemotherapy with hematopoietic stem - cell rescue for multiple myeloma

Child JA, Morgan GJ, Davies FE, et al. High -dose chemotherapy with hematopoietic stem - cell rescue for multiple myeloma. New England Journal of Medicine 2003;348:1875–83

work page 2003

[11] [11]

Lenalidomide, bortezomib, and dexamethasone with transplantation for myeloma

Attal M, Lauwers-Cances V , Hulin C, et al. Lenalidomide, bortezomib, and dexamethasone with transplantation for myeloma. New England Journal of Medicine 2017;376:1311–20

work page 2017

[12] [12]

Multiple myeloma: 2022 update on diagnosis, risk stratification, and management

Rajkumar SV . Multiple myeloma: 2022 update on diagnosis, risk stratification, and management. American journal of hematology 2022;97:1086–107

work page 2022

[13] [13]

Management of patients with multiple myeloma beyond the clinical-trial setting: understanding the balance between efficacy, safety and tolerability, and quality of life

Terpos E, Mikhael J, Hajek R, et al. Management of patients with multiple myeloma beyond the clinical-trial setting: understanding the balance between efficacy, safety and tolerability, and quality of life. Blood cancer journal 2021;11:40

work page 2021

[14] [14]

Oncology (cancer)/hematologic malignancies approval notifications

FDA U et al. Oncology (cancer)/hematologic malignancies approval notifications. 2021

work page 2021

[15] [15]

The levels of evidence and their role in evidence - based medicine

Burns PB, Rohrich RJ, and Chung KC. The levels of evidence and their role in evidence - based medicine. Plastic and reconstructive surgery 2011;128:305

work page 2011

[16] [16]

Machine learning and deep learning applications in multiple myeloma diagnosis, prognosis, and treatment selection

Allegra A, Tonacci A, Sciaccotta R, et al. Machine learning and deep learning applications in multiple myeloma diagnosis, prognosis, and treatment selection. Cancers 2022;14:606. 21

work page 2022

[17] [17]

Gut microbiome, big data and machine learning to promote precision medicine for cancer

Cammarota G, Ianiro G, Ahern A, et al. Gut microbiome, big data and machine learning to promote precision medicine for cancer. Nature reviews gastroenterology & hepatology 2020;17:635– 48

work page 2020

[18] [18]

New machine learning applications to accelerate personalized medicine in breast cancer: rise of the support vector machines

Ozer ME, Sarica PO, and Arga KY . New machine learning applications to accelerate personalized medicine in breast cancer: rise of the support vector machines. Omics: a journal of integrative biology 2020;24:241–6

work page 2020

[19] [19]

Learning for personalized medicine: a comprehensive review from a deep learning perspective

Zhang S, Bamakan SMH, Qu Q, and Li S. Learning for personalized medicine: a comprehensive review from a deep learning perspective. IEEE reviews in biomedical engineering 2018;12:194– 208

work page 2018

[20] [20]

Machine learning based personalized drug response prediction for lung cancer patients

Qureshi R, Basit SA, Shamsi JA, et al. Machine learning based personalized drug response prediction for lung cancer patients. Scientific Reports 2022;12:18935

work page 2022

[21] [21]

High-performance medicine: the convergence of human and artificial intelligence

Topol EJ. High-performance medicine: the convergence of human and artificial intelligence. Nature medicine 2019;25:44–56

work page 2019

[22] [22]

Emani S, Rui A, Rocha HAL, et al. Physicians’ Perceptions of and Satisfaction With Artificial Intelligence in Cancer Treatment: A Clinical Decision Support System Experience and Implications for Low-Middle–Income Countries. JMIR cancer 2022;8:e31461

work page 2022

[23] [23]

Clinician perspectives on machine learning prognostic algorithms in the routine care of patients with cancer: a qualitative study

Parikh RB, Manz CR, Nelson MN, et al. Clinician perspectives on machine learning prognostic algorithms in the routine care of patients with cancer: a qualitative study. Supportive Care in Cancer 2022;30:4363–72

work page 2022

[24] [24]

A survey of clinicians on the use of artificial intelligence in ophthalmology, dermatology, radiology and radiation oncology

Scheetz J, Rothschild P, McGuinness M, et al. A survey of clinicians on the use of artificial intelligence in ophthalmology, dermatology, radiology and radiation oncology. Scientific reports 2021;11:1–10. 22

work page 2021

[25] [25]

How machinelearning recommendations influence clinician treatment selections: the example of antidepressant selection

Jacobs M, Pradier MF, McCoy Jr TH, Perlis RH, Doshi -Velez F, and Gajos KZ. How machinelearning recommendations influence clinician treatment selections: the example of antidepressant selection. Translational psychiatry 2021;11:108

work page 2021

[26] [26]

Do as AI say: susceptibility in deployment of clinical decision-aids

Gaube S, Suresh H, Raue M, et al. Do as AI say: susceptibility in deployment of clinical decision-aids. NPJ digital medicine 2021;4:31

work page 2021

[27] [27]

Mitigating the impact of biased artificial intelligence in emergency decision -making

Adam H, Balagopalan A, Alsentzer E, Christia F, and Ghassemi M. Mitigating the impact of biased artificial intelligence in emergency decision -making. Communications Medicine 2022;2:149

work page 2022

[28] [28]

Using thematic analysis in psychology

Braun V and Clarke V . Using thematic analysis in psychology. Qualitative research in psychology 2006;3:77–101

work page 2006

[29] [29]

Increased survival time or better quality of life? Tradeoff between benefits and adverse events in the systemic treatment of cancer

Valentı V , Ramos J, Pérez C, et al. Increased survival time or better quality of life? Tradeoff between benefits and adverse events in the systemic treatment of cancer. Clinical and Translational Oncology 2020;22:935–42

work page 2020

[30] [30]

Key challenges for delivering clinical impact with artificial intelligence

Kelly CJ, Karthikesalingam A, Suleyman M, Corrado G, and King D. Key challenges for delivering clinical impact with artificial intelligence. BMC medicine 2019;17:1–9

work page 2019

[31] [31]

Randomized Controlled Trials of Artificial Intelligence in Clinical Practice: Systematic Review

Lam TY , Cheung MF, Munro YL, Lim KM, Shung D, and Sung JJ. Randomized Controlled Trials of Artificial Intelligence in Clinical Practice: Systematic Review. Journal of Medical Internet Research 2022;24:e37188

work page 2022

[32] [32]

Randomized clinical trials of machine learning interventions in health care: a systematic review

Plana D, Shung DL, Grimshaw AA, Saraf A, Sung JJ, and Kann BH. Randomized clinical trials of machine learning interventions in health care: a systematic review. JAMA Network Open 2022;5:e2233946–e2233946. 23

work page 2022

[33] [33]

Artificial intelligence for health professions educators

Lomis K, Jeffries P, Palatta A, et al. Artificial intelligence for health professions educators. NAM perspectives 2021;2021

work page 2021

[34] [34]

Falsification before Extrapolation in Causal Effect Estimation

Hussain Z, Oberst M, Shih MC, and Sontag D. Falsification before Extrapolation in Causal Effect Estimation. Arxiv preprint arXiv:2209.13708 2022

work page arXiv 2022

[35] [35]

Falsification of Internal and External Validity in Observational Studies via Conditional Moment Restrictions

Hussain Z, Shih MC, Oberst M, Demirel I, and Sontag D. Falsification of Internal and External Validity in Observational Studies via Conditional Moment Restrictions. In: International Conference on Artificial Intelligence and Statistics. PMLR. 2023:5869–98

work page 2023

[36] [36]

Towards A Rigorous Science of Interpretable Machine Learning

Doshi-Velez F and Kim B. Towards a rigorous science of interpretable machine learning. arXiv preprint arXiv:1702.08608 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[37] [37]

Manipulating and measuring model interpretability

Poursabzi-Sangdeh F, Goldstein DG, Hofman JM, Wortman Vaughan JW, and Wallach H. Manipulating and measuring model interpretability. In: Proceedings of the 2021 CHI conference on human factors in computing systems. 2021:1–52

work page 2021

[38] [38]

The road to explainability is paved with bias: Measuring the fairness of explanations

Balagopalan A, Zhang H, Hamidieh K, Hartvigsen T, Rudzicz F, and Ghassemi M. The road to explainability is paved with bias: Measuring the fairness of explanations. In: 2022 ACM Conference on Fairness, Accountability, and Transparency. 2022:1194–206

work page 2022

[39] [39]

Remind me again: physician response to web surveys: the effect of email reminders across 11 opinion survey efforts at the American Board of Internal Medicine from 2017 to 2019

Barnhart BJ, Reddy SG, and Arnold GK. Remind me again: physician response to web surveys: the effect of email reminders across 11 opinion survey efforts at the American Board of Internal Medicine from 2017 to 2019. Evaluation & the Health Professions 2021;44:245–59

work page 2017

[40] [40]

Physician confidence in artificial intelligence: an online mobile survey

Oh S, Kim JH, Choi SW, Lee HJ, Hong J, and Kwon SH. Physician confidence in artificial intelligence: an online mobile survey. Journal of medical Internet research 2019;21:e12422.\ 24

work page 2019

[41] [41]

Design of an interface to communicate artificial intelligence-based prognosis for patients with advanced solid tumors: a user-centered approach

Staes, Catherine J., et al. "Design of an interface to communicate artificial intelligence-based prognosis for patients with advanced solid tumors: a user-centered approach." Journal of the American Medical Informatics Association 31.1 (2024): 174-187

work page 2024

[42] [42]

To trust or to think: cognitive forcing functions can reduce overreliance on AI in AI -assisted decision - making

Buçinca, Zana, Maja Barbara Malaya, and Krzysztof Z. Gajos. "To trust or to think: cognitive forcing functions can reduce overreliance on AI in AI -assisted decision - making." Proceedings of the ACM on Human-Computer Interaction 5.CSCW1 (2021): 1-21

work page 2021

[43] [43]

Machine learning in haematological malignancies

Radakovich, Nathan, Matthew Nagy, and Aziz Nazha. "Machine learning in haematological malignancies." The Lancet Haematology 7.7 (2020): e541-e550

work page 2020

[44] [44]

Human–computer collaboration for skin cancer recognition

Tschandl, Philipp, et al. "Human–computer collaboration for skin cancer recognition." Nature Medicine 26.8 (2020): 1229-1234. 25 Appendix Appendix Figure 1: Clinical decision support system created for the study Participants used a web-based clinical decision support system (CDSS) created for the study called the “Multiple Myeloma Decision Support Tool” (...

work page 2020

[45] [45]

How do you think about data from RCTs when your patient does not meet inclusion criteria?

How do you interpret the data from this RCT? If participant mentions that their patient does not meet inclusion criteria, 1a. How do you think about data from RCTs when your patient does not meet inclusion criteria?

work page

[46] [48]

Why are you choosing that confidence level? For Tiers 2 and 3 (ML data)

work page

[47] [49]

How do you think about data from ML models when your patient does not meet inclusion criteria?

How do you interpret the data from this ML model? If participant mentions that their patient does not meet inclusion criteria, 2a. How do you think about data from ML models when your patient does not meet inclusion criteria?

work page

[48] [50]

What factors are you weighing when choosing a treatment option?

work page

[49] [51]

How are you weighing the side effects?

work page

[50] [52]

Why are you choosing that confidence level?

work page

[51] [53]

Why are you choosing that level of perceived reliability? Finally:

work page

[52] [54]

How do you compare the RCT results to the ML results? 32

work page

[53] [55]

red pill

We found that the majority of participants choose to switch to the blue pill after seeing the ML data and context (show them Scenario K results document). Why do you think that is? Helpful probes: • Can you talk more about that? • Help me understand what you mean. • Can you give an example? 33 Appendix Figure 6: Experimental results for all scenarios Full...

work page 2007