pith. sign in

arxiv: 2404.15187 · v2 · pith:ILGOWQPDnew · submitted 2024-04-23 · 💻 cs.HC

Evaluating Physician-AI Interaction for Cancer Management: Paving the Path towards Precision Oncology

Pith reviewed 2026-05-24 02:13 UTC · model grok-4.3

classification 💻 cs.HC
keywords physician-AI interactionautomation biasclinical decision support systemsmultiple myelomamachine learning in oncologyRCT evidence integrationtreatment decision makingprecision oncology
0
0 comments X

The pith

Physicians shifted toward ML-supported cancer treatments over conflicting RCT evidence, often without reviewing model details.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests how doctors combine machine learning predictions with randomized trial results inside a clinical decision support tool for multiple myeloma. Across 12 synthetic cases and 32 physicians, concordant ML and RCT outputs raised reported confidence above RCT data alone. When outputs disagreed, most physicians moved to the ML option, frequently before any model training or validation information was viewed. Even after seeing quality disclosures, participants could rarely describe the validation steps they had examined. The work shows that current decision support setups may not prevent over-reliance on ML outputs in oncology.

Core claim

When ML and RCT outputs were concordant, physicians reported greater confidence than with RCT data alone. When results were discordant, most physicians shifted toward the ML-supported treatment, often before reviewing any information about model training or validation, suggesting a tendency toward automation bias rather than algorithm avoidance. Despite reporting higher perceived reliability after viewing model quality disclosures, physicians were largely unable to describe the validation procedures they had reviewed.

What carries the argument

A within-subjects web-based clinical decision support system presenting survival and adverse event data from simulated RCT and ML models across 12 synthetic multiple myeloma scenarios, used to track physicians' synthesis of competing evidence sources.

If this is right

  • CDSS interfaces need redesign to prompt explicit review of ML validation before treatment selection.
  • Clinician training programs should address how to weigh RCT evidence against ML outputs when they conflict.
  • Institutional safeguards such as required validation summaries or second reviews become necessary before ML systems enter routine oncology use.
  • Perceived reliability of ML rises after disclosures even when users cannot articulate what those disclosures contained.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The observed shift may accelerate deployment of ML tools whose validation remains incomplete if real-world workflows mirror the simulated ones.
  • Similar automation bias could appear in other specialties where AI predictions compete with trial data, such as cardiology or neurology.
  • Mandating a minimum review time or simplified validation checklist inside the CDSS might reduce the early shift to ML recommendations.

Load-bearing premise

Physicians' treatment choices inside the web-based system with synthetic scenarios will reflect how they would integrate real RCT and ML evidence when treating actual patients.

What would settle it

A study in which the same physicians make decisions on real patient cases using actual RCT publications and deployed ML models and show no net shift toward ML recommendations or full use of validation information would falsify the central pattern.

read the original abstract

As machine learning (ML)-based decision support tools proliferate in clinical practice, understanding how clinicians integrate personalized ML predictions alongside randomized controlled trial (RCT) evidence is critical. We designed a web-based clinical decision support system (CDSS) presenting survival and adverse event data from a simulated RCT and ML model across 12 synthetic multiple myeloma scenarios. In a within- subjects study with 32 physicians, we evaluated how clinicians synthesize competing evidence sources to make treatment decisions. When ML and RCT outputs were concordant, physicians reported greater confidence than with RCT data alone. When results were discordant, most physicians shifted toward the ML-supported treatment, often before reviewing any information about model training or validation, suggesting a tendency toward automation bias rather than algorithm avoidance. Despite reporting higher perceived reliability after viewing model quality disclosures, physicians were largely unable to describe the validation procedures they had reviewed. Taken together, these findings reveal that clinicians may over-rely on ML recommendations even when equipped with tools designed to support critical appraisal. We discuss implications for CDSS design, clinician training, and the institutional safeguards needed before ML-based systems are deployed in high-stakes oncology settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript describes a within-subjects web-based study in which 32 physicians evaluated 12 synthetic multiple myeloma scenarios presented via a CDSS that displayed simulated RCT and ML model outputs for survival and adverse events. The central claims are that physicians reported higher confidence when RCT and ML outputs were concordant than with RCT alone, that most physicians shifted toward the ML recommendation in discordant cases (often before reviewing model training/validation details), and that this pattern indicates automation bias rather than algorithm avoidance; the authors also report that physicians could not accurately describe the validation procedures they had viewed despite increased perceived reliability after disclosure.

Significance. If the behavioral patterns are shown to be robust, the work would contribute to the growing literature on clinician-AI interaction by providing concrete evidence of over-reliance on ML outputs in an oncology decision-support context, with direct implications for CDSS interface design, clinician training, and institutional safeguards. The study design (synthetic cases, explicit model disclosures) is a reasonable starting point for isolating evidence-integration behavior.

major comments (3)
  1. [Methods] Methods section: the abstract and study description supply no statistical methods, hypothesis tests, effect sizes, confidence intervals, or power analysis; it is therefore impossible to evaluate whether the reported shifts (e.g., 'most physicians') exceed chance or are robust to multiple-comparison correction.
  2. [Results and Discussion] Results/Discussion: the claim that the observed shift constitutes automation bias (rather than an artifact of the interface) rests on the untested assumption that decisions made in a low-stakes, decontextualized web interface with 12 synthetic scenarios generalize to real oncology practice; the manuscript provides no discussion or sensitivity analysis addressing how patient-specific factors, liability, time pressure, or multidisciplinary input might alter evidence weighting.
  3. [Methods] Methods: the within-subjects design with 12 scenarios does not report counterbalancing of presentation order or any analysis of order or carry-over effects, which could confound the reported preference shifts when ML and RCT outputs are discordant.
minor comments (2)
  1. [Abstract] The abstract states that physicians 'were largely unable to describe the validation procedures they had reviewed' but does not quantify this (e.g., percentage correct on a recall or recognition task) or report inter-rater reliability for coding free-text responses.
  2. [Figures/Tables] Figure or table captions should explicitly state the exact wording of the confidence and reliability rating scales used and whether they were administered before or after each scenario.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed comments, which highlight important areas for improving the transparency and contextualization of our work. We address each major comment below and have made revisions to the manuscript where the points are valid.

read point-by-point responses
  1. Referee: [Methods] Methods section: the abstract and study description supply no statistical methods, hypothesis tests, effect sizes, confidence intervals, or power analysis; it is therefore impossible to evaluate whether the reported shifts (e.g., 'most physicians') exceed chance or are robust to multiple-comparison correction.

    Authors: We agree that the original submission lacked a dedicated description of statistical methods, which limits evaluation of the findings' robustness. In the revised manuscript, we have added a 'Statistical Analysis' subsection to the Methods that specifies all tests (McNemar's tests for binary choice shifts and paired t-tests for confidence ratings), reports effect sizes (Cohen's h and d), 95% confidence intervals, and includes a post-hoc power calculation. Multiple-comparison correction (Bonferroni) was applied across the discordant scenarios. These additions enable readers to assess whether the observed shifts exceed chance levels. revision: yes

  2. Referee: [Results and Discussion] Results/Discussion: the claim that the observed shift constitutes automation bias (rather than an artifact of the interface) rests on the untested assumption that decisions made in a low-stakes, decontextualized web interface with 12 synthetic scenarios generalize to real oncology practice; the manuscript provides no discussion or sensitivity analysis addressing how patient-specific factors, liability, time pressure, or multidisciplinary input might alter evidence weighting.

    Authors: We acknowledge that the controlled, synthetic design limits direct claims about real-world generalizability, and the original manuscript did not sufficiently discuss this. The study was intended to isolate evidence-integration behavior under standardized conditions. In the revised version, we have expanded the Discussion with a new 'Limitations' paragraph that explicitly addresses the low-stakes web interface, synthetic cases, and potential moderating effects of patient-specific factors, liability concerns, time pressure, and multidisciplinary input. We qualify the automation-bias interpretation accordingly while retaining the core finding as evidence from this controlled setting, and we outline directions for future ecologically valid studies. revision: yes

  3. Referee: [Methods] Methods: the within-subjects design with 12 scenarios does not report counterbalancing of presentation order or any analysis of order or carry-over effects, which could confound the reported preference shifts when ML and RCT outputs are discordant.

    Authors: The referee correctly notes that the original manuscript omitted details on scenario ordering. The scenarios were in fact presented in randomized order per participant (via the web platform's randomization feature), but this was not stated. We have added this information to the Methods section. We also performed an additional analysis of order and carry-over effects using mixed-effects logistic regression with scenario position as a fixed effect; no significant effects were detected. These details and results have been incorporated into the revised manuscript to rule out confounding. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical user study with no derivations or fitted parameters

full rationale

The paper reports results from a within-subjects experiment involving 32 physicians making treatment choices in a web-based interface across 12 synthetic multiple myeloma scenarios. No equations, model derivations, parameter fittings, or predictive claims derived from prior outputs appear in the work. All findings are direct observations of participant behavior and self-reports within the controlled study design. The central interpretation (shift toward ML recommendations indicating automation bias) is presented as an empirical pattern from the collected data rather than a quantity computed from or defined in terms of itself. No self-citation chains or ansatzes are invoked to justify load-bearing steps. The study is therefore self-contained against external benchmarks and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical human-subjects study and introduces no free parameters, mathematical axioms, or invented entities.

pith-pipeline@v0.9.0 · 5755 in / 1222 out tokens · 32058 ms · 2026-05-24T02:13:01.487502+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

53 extracted references · 53 canonical work pages · 1 internal anchor

  1. [1]

    A review of cancer immunotherapy: from the past, to the present, to the future

    Esfahani K, Roudaia L, Buhlaiga N, Del Rincon S, Papneja N, and Miller W. A review of cancer immunotherapy: from the past, to the present, to the future. Current Oncology 2020;27:87–97

  2. [2]

    CAR-T cell therapy: current limitations and potential strategies

    Sterner RC and Sterner RM. CAR-T cell therapy: current limitations and potential strategies. Blood cancer journal 2021;11:69

  3. [3]

    Bispecific antibodies: from research to clinical application

    Ma J, Mo Y , Tang M, et al. Bispecific antibodies: from research to clinical application. Frontiers in Immunology 2021:1555

  4. [4]

    The landmark series: gallbladder cancer

    Gamboa AC and Maithel SK. The landmark series: gallbladder cancer. Annals of Surgical Oncology 2020;27:2846–58

  5. [5]

    The landmark series: axillary management in breast cancer

    Fisher CS, Margenthaler JA, Hunt KK, and Schwartz T. The landmark series: axillary management in breast cancer. Annals of surgical oncology 2020;27:724–9

  6. [6]

    Multiple myeloma, version 3.2017, NCCN clinical practice guidelines in oncology

    Kumar SK, Callander NS, Alsina M, et al. Multiple myeloma, version 3.2017, NCCN clinical practice guidelines in oncology. Journal of the National Comprehensive Cancer Network 2017;15:230–69

  7. [7]

    Continued improvement in survival in multiple myeloma: changes in early mortality and outcomes in older patients

    Kumar SK, Dispenzieri A, Lacy MQ, et al. Continued improvement in survival in multiple myeloma: changes in early mortality and outcomes in older patients. Leukemia 2014;28:1122–8. 20

  8. [8]

    Durie BG, Hoering A, Abidi MH, et al. Bortezomib with lenalidomide and dexamethasone versus lenalidomide and dexamethasone alone in patients with newly diagnosed myeloma without intent for immediate autologous stem-cell transplant (SWOG S0777): a randomized, open-label, phase 3 trial. The Lancet 2017;389:519–27

  9. [9]

    A prospective, randomized trial of autologous bone marrow transplantation and chemotherapy in multiple myeloma

    Attal M, Harousseau JL, Stoppa AM, et al. A prospective, randomized trial of autologous bone marrow transplantation and chemotherapy in multiple myeloma. New England Journal of Medicine 1996;335:91–7

  10. [10]

    High -dose chemotherapy with hematopoietic stem - cell rescue for multiple myeloma

    Child JA, Morgan GJ, Davies FE, et al. High -dose chemotherapy with hematopoietic stem - cell rescue for multiple myeloma. New England Journal of Medicine 2003;348:1875–83

  11. [11]

    Lenalidomide, bortezomib, and dexamethasone with transplantation for myeloma

    Attal M, Lauwers-Cances V , Hulin C, et al. Lenalidomide, bortezomib, and dexamethasone with transplantation for myeloma. New England Journal of Medicine 2017;376:1311–20

  12. [12]

    Multiple myeloma: 2022 update on diagnosis, risk stratification, and management

    Rajkumar SV . Multiple myeloma: 2022 update on diagnosis, risk stratification, and management. American journal of hematology 2022;97:1086–107

  13. [13]

    Management of patients with multiple myeloma beyond the clinical-trial setting: understanding the balance between efficacy, safety and tolerability, and quality of life

    Terpos E, Mikhael J, Hajek R, et al. Management of patients with multiple myeloma beyond the clinical-trial setting: understanding the balance between efficacy, safety and tolerability, and quality of life. Blood cancer journal 2021;11:40

  14. [14]

    Oncology (cancer)/hematologic malignancies approval notifications

    FDA U et al. Oncology (cancer)/hematologic malignancies approval notifications. 2021

  15. [15]

    The levels of evidence and their role in evidence - based medicine

    Burns PB, Rohrich RJ, and Chung KC. The levels of evidence and their role in evidence - based medicine. Plastic and reconstructive surgery 2011;128:305

  16. [16]

    Machine learning and deep learning applications in multiple myeloma diagnosis, prognosis, and treatment selection

    Allegra A, Tonacci A, Sciaccotta R, et al. Machine learning and deep learning applications in multiple myeloma diagnosis, prognosis, and treatment selection. Cancers 2022;14:606. 21

  17. [17]

    Gut microbiome, big data and machine learning to promote precision medicine for cancer

    Cammarota G, Ianiro G, Ahern A, et al. Gut microbiome, big data and machine learning to promote precision medicine for cancer. Nature reviews gastroenterology & hepatology 2020;17:635– 48

  18. [18]

    New machine learning applications to accelerate personalized medicine in breast cancer: rise of the support vector machines

    Ozer ME, Sarica PO, and Arga KY . New machine learning applications to accelerate personalized medicine in breast cancer: rise of the support vector machines. Omics: a journal of integrative biology 2020;24:241–6

  19. [19]

    Learning for personalized medicine: a comprehensive review from a deep learning perspective

    Zhang S, Bamakan SMH, Qu Q, and Li S. Learning for personalized medicine: a comprehensive review from a deep learning perspective. IEEE reviews in biomedical engineering 2018;12:194– 208

  20. [20]

    Machine learning based personalized drug response prediction for lung cancer patients

    Qureshi R, Basit SA, Shamsi JA, et al. Machine learning based personalized drug response prediction for lung cancer patients. Scientific Reports 2022;12:18935

  21. [21]

    High-performance medicine: the convergence of human and artificial intelligence

    Topol EJ. High-performance medicine: the convergence of human and artificial intelligence. Nature medicine 2019;25:44–56

  22. [22]

    Emani S, Rui A, Rocha HAL, et al. Physicians’ Perceptions of and Satisfaction With Artificial Intelligence in Cancer Treatment: A Clinical Decision Support System Experience and Implications for Low-Middle–Income Countries. JMIR cancer 2022;8:e31461

  23. [23]

    Clinician perspectives on machine learning prognostic algorithms in the routine care of patients with cancer: a qualitative study

    Parikh RB, Manz CR, Nelson MN, et al. Clinician perspectives on machine learning prognostic algorithms in the routine care of patients with cancer: a qualitative study. Supportive Care in Cancer 2022;30:4363–72

  24. [24]

    A survey of clinicians on the use of artificial intelligence in ophthalmology, dermatology, radiology and radiation oncology

    Scheetz J, Rothschild P, McGuinness M, et al. A survey of clinicians on the use of artificial intelligence in ophthalmology, dermatology, radiology and radiation oncology. Scientific reports 2021;11:1–10. 22

  25. [25]

    How machinelearning recommendations influence clinician treatment selections: the example of antidepressant selection

    Jacobs M, Pradier MF, McCoy Jr TH, Perlis RH, Doshi -Velez F, and Gajos KZ. How machinelearning recommendations influence clinician treatment selections: the example of antidepressant selection. Translational psychiatry 2021;11:108

  26. [26]

    Do as AI say: susceptibility in deployment of clinical decision-aids

    Gaube S, Suresh H, Raue M, et al. Do as AI say: susceptibility in deployment of clinical decision-aids. NPJ digital medicine 2021;4:31

  27. [27]

    Mitigating the impact of biased artificial intelligence in emergency decision -making

    Adam H, Balagopalan A, Alsentzer E, Christia F, and Ghassemi M. Mitigating the impact of biased artificial intelligence in emergency decision -making. Communications Medicine 2022;2:149

  28. [28]

    Using thematic analysis in psychology

    Braun V and Clarke V . Using thematic analysis in psychology. Qualitative research in psychology 2006;3:77–101

  29. [29]

    Increased survival time or better quality of life? Tradeoff between benefits and adverse events in the systemic treatment of cancer

    Valentı V , Ramos J, Pérez C, et al. Increased survival time or better quality of life? Tradeoff between benefits and adverse events in the systemic treatment of cancer. Clinical and Translational Oncology 2020;22:935–42

  30. [30]

    Key challenges for delivering clinical impact with artificial intelligence

    Kelly CJ, Karthikesalingam A, Suleyman M, Corrado G, and King D. Key challenges for delivering clinical impact with artificial intelligence. BMC medicine 2019;17:1–9

  31. [31]

    Randomized Controlled Trials of Artificial Intelligence in Clinical Practice: Systematic Review

    Lam TY , Cheung MF, Munro YL, Lim KM, Shung D, and Sung JJ. Randomized Controlled Trials of Artificial Intelligence in Clinical Practice: Systematic Review. Journal of Medical Internet Research 2022;24:e37188

  32. [32]

    Randomized clinical trials of machine learning interventions in health care: a systematic review

    Plana D, Shung DL, Grimshaw AA, Saraf A, Sung JJ, and Kann BH. Randomized clinical trials of machine learning interventions in health care: a systematic review. JAMA Network Open 2022;5:e2233946–e2233946. 23

  33. [33]

    Artificial intelligence for health professions educators

    Lomis K, Jeffries P, Palatta A, et al. Artificial intelligence for health professions educators. NAM perspectives 2021;2021

  34. [34]

    Falsification before Extrapolation in Causal Effect Estimation

    Hussain Z, Oberst M, Shih MC, and Sontag D. Falsification before Extrapolation in Causal Effect Estimation. Arxiv preprint arXiv:2209.13708 2022

  35. [35]

    Falsification of Internal and External Validity in Observational Studies via Conditional Moment Restrictions

    Hussain Z, Shih MC, Oberst M, Demirel I, and Sontag D. Falsification of Internal and External Validity in Observational Studies via Conditional Moment Restrictions. In: International Conference on Artificial Intelligence and Statistics. PMLR. 2023:5869–98

  36. [36]

    Towards A Rigorous Science of Interpretable Machine Learning

    Doshi-Velez F and Kim B. Towards a rigorous science of interpretable machine learning. arXiv preprint arXiv:1702.08608 2017

  37. [37]

    Manipulating and measuring model interpretability

    Poursabzi-Sangdeh F, Goldstein DG, Hofman JM, Wortman Vaughan JW, and Wallach H. Manipulating and measuring model interpretability. In: Proceedings of the 2021 CHI conference on human factors in computing systems. 2021:1–52

  38. [38]

    The road to explainability is paved with bias: Measuring the fairness of explanations

    Balagopalan A, Zhang H, Hamidieh K, Hartvigsen T, Rudzicz F, and Ghassemi M. The road to explainability is paved with bias: Measuring the fairness of explanations. In: 2022 ACM Conference on Fairness, Accountability, and Transparency. 2022:1194–206

  39. [39]

    Remind me again: physician response to web surveys: the effect of email reminders across 11 opinion survey efforts at the American Board of Internal Medicine from 2017 to 2019

    Barnhart BJ, Reddy SG, and Arnold GK. Remind me again: physician response to web surveys: the effect of email reminders across 11 opinion survey efforts at the American Board of Internal Medicine from 2017 to 2019. Evaluation & the Health Professions 2021;44:245–59

  40. [40]

    Physician confidence in artificial intelligence: an online mobile survey

    Oh S, Kim JH, Choi SW, Lee HJ, Hong J, and Kwon SH. Physician confidence in artificial intelligence: an online mobile survey. Journal of medical Internet research 2019;21:e12422.\ 24

  41. [41]

    Design of an interface to communicate artificial intelligence-based prognosis for patients with advanced solid tumors: a user-centered approach

    Staes, Catherine J., et al. "Design of an interface to communicate artificial intelligence-based prognosis for patients with advanced solid tumors: a user-centered approach." Journal of the American Medical Informatics Association 31.1 (2024): 174-187

  42. [42]

    To trust or to think: cognitive forcing functions can reduce overreliance on AI in AI -assisted decision - making

    Buçinca, Zana, Maja Barbara Malaya, and Krzysztof Z. Gajos. "To trust or to think: cognitive forcing functions can reduce overreliance on AI in AI -assisted decision - making." Proceedings of the ACM on Human-Computer Interaction 5.CSCW1 (2021): 1-21

  43. [43]

    Machine learning in haematological malignancies

    Radakovich, Nathan, Matthew Nagy, and Aziz Nazha. "Machine learning in haematological malignancies." The Lancet Haematology 7.7 (2020): e541-e550

  44. [44]

    Human–computer collaboration for skin cancer recognition

    Tschandl, Philipp, et al. "Human–computer collaboration for skin cancer recognition." Nature Medicine 26.8 (2020): 1229-1234. 25 Appendix Appendix Figure 1: Clinical decision support system created for the study Participants used a web-based clinical decision support system (CDSS) created for the study called the “Multiple Myeloma Decision Support Tool” (...

  45. [45]

    How do you think about data from RCTs when your patient does not meet inclusion criteria?

    How do you interpret the data from this RCT? If participant mentions that their patient does not meet inclusion criteria, 1a. How do you think about data from RCTs when your patient does not meet inclusion criteria?

  46. [48]

    Why are you choosing that confidence level? For Tiers 2 and 3 (ML data)

  47. [49]

    How do you think about data from ML models when your patient does not meet inclusion criteria?

    How do you interpret the data from this ML model? If participant mentions that their patient does not meet inclusion criteria, 2a. How do you think about data from ML models when your patient does not meet inclusion criteria?

  48. [50]

    What factors are you weighing when choosing a treatment option?

  49. [51]

    How are you weighing the side effects?

  50. [52]

    Why are you choosing that confidence level?

  51. [53]

    Why are you choosing that level of perceived reliability? Finally:

  52. [54]

    How do you compare the RCT results to the ML results? 32

  53. [55]

    red pill

    We found that the majority of participants choose to switch to the blue pill after seeing the ML data and context (show them Scenario K results document). Why do you think that is? Helpful probes: • Can you talk more about that? • Help me understand what you mean. • Can you give an example? 33 Appendix Figure 6: Experimental results for all scenarios Full...