pith. sign in

arxiv: 2506.23040 · v5 · submitted 2025-06-29 · 📊 stat.OT · cs.AI

Treatment, evidence, imitation, and chat

Pith reviewed 2026-05-19 08:18 UTC · model grok-4.3

classification 📊 stat.OT cs.AI
keywords large language modelsmedical decision makingevidence-based medicinetreatment problemimitation learningobservational datastatinsrandomized trials
0
0 comments X

The pith

Imitation from chat data cannot solve the core medical treatment problem that clinicians and patients must address together.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper distinguishes the treatment problem, which requires evidence-based choices about interventions like statins, from the chat problem of generating conversational responses. It shows that imitation of expert behavior can support useful interactions but leaves the actual decision-making task unsolved because it bypasses the need for experimental and observational evidence. Training large language models on treatment decisions therefore runs into barriers around running ethical experiments and making defensible assumptions from observational records. The discussion ties these limits back to longstanding practices in evidence-based medicine and suggests how the medical research community might adapt its methods.

Core claim

Solving the treatment problem demands integration of randomized experimental data and carefully interpreted observational data rather than imitation alone; an LLM-based system can participate in that process but only after the ethical and evidentiary challenges of obtaining suitable training signals are resolved.

What carries the argument

The contrast between the treatment problem (evidence-driven collaborative decision making) and the chat problem (imitation of conversational responses), with statins used to illustrate the evidentiary requirements.

If this is right

  • Experimental data from randomized trials remains indispensable for validating treatment choices even when language models participate.
  • Observational data can fill gaps but requires explicit handling of confounding and selection assumptions.
  • Imitation-trained chat capabilities may improve communication around decisions without replacing the evidence base.
  • Regulatory and ethical frameworks for medical AI will need to address how training data for treatment decisions is obtained.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar distinctions between imitation and evidence may apply to other high-stakes domains such as legal or financial advice.
  • One testable extension is whether hybrid training that injects trial results into language-model fine-tuning measurably improves downstream patient outcomes.
  • The argument implies that purely observational or chat-derived systems risk systematic bias unless paired with experimental benchmarks.

Load-bearing premise

Ethical experiments and defensible observational assumptions can be secured to generate the data needed for training systems on real treatment decisions.

What would settle it

A controlled study in which patients whose decisions are guided by an imitation-only model achieve the same or better health outcomes than those guided by current evidence-based protocols would undermine the central claim.

read the original abstract

Large language models are thought to have the potential to aid in medical decision making. This work investigates the degree to which this might be the case. We start with the treatment problem, the patient's core medical decision-making task, which is solved in collaboration with a clinician. We discuss different approaches to solving it, including, within evidence-based medicine, experimental and observational data. We then discuss the chat problem, and how this differs from the treatment problem -- in particular with respect to imitation (and how imitation alone cannot solve the true treatment problem, although this does not mean it is not useful). We then discuss how a large-language-model-based system might be trained to solve the treatment problem, highlighting that the major challenges relate to the ethics of experimentation and the assumptions associated with observation. We finally discuss how these challenges relate to evidence-based medicine and how this might inform the efforts of the medical research community to solve the treatment problem. Throughout, we illustrate our arguments with the cholesterol medications, statins.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The manuscript claims that large language models have potential to aid medical decision making but must distinguish the treatment problem, which requires causal evidence from experimental or observational studies under ethical constraints, from the chat problem based on imitation. Using the statins example, it argues that imitation alone cannot solve the treatment problem, though it can be useful, and identifies ethics of experimentation and observational assumptions as major challenges for training such systems, relating this to evidence-based medicine.

Significance. The result, if it holds, is significant in that it provides a clear conceptual separation between imitation-driven chat and evidence-based treatment in the context of AI for medicine. The paper gives credit to the concrete statins illustration for making the ethical and observational issues tangible. This framework could help steer research away from over-reliance on pure imitation learning for high-stakes causal decisions.

minor comments (2)
  1. [Abstract] The abstract is concise but could include a sentence on the statins example to better prepare the reader for the full argument.
  2. [Discussion of training LLMs] The challenges are well-described qualitatively; adding a reference to specific methods in causal inference, such as those handling observational data biases, would enhance clarity without altering the conceptual nature.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their positive assessment of our manuscript and the recommendation for minor revision. We are pleased that the conceptual framework distinguishing the treatment problem from the chat problem, and the use of the statins example to illustrate ethical and observational challenges, was viewed as significant. We respond to the referee's summary of the paper below.

read point-by-point responses
  1. Referee: The manuscript claims that large language models have potential to aid medical decision making but must distinguish the treatment problem, which requires causal evidence from experimental or observational studies under ethical constraints, from the chat problem based on imitation. Using the statins example, it argues that imitation alone cannot solve the treatment problem, though it can be useful, and identifies ethics of experimentation and observational assumptions as major challenges for training such systems, relating this to evidence-based medicine.

    Authors: We appreciate this concise summary of our work, which accurately reflects the main points we sought to make. We agree that the distinction is important for guiding research in AI for medicine away from over-reliance on imitation for causal decisions. revision: no

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

This is a conceptual discussion paper with no mathematical derivations, equations, or fitted parameters. It distinguishes the treatment problem (requiring causal evidence under ethical/observational constraints) from the chat/imitation problem using standard principles of evidence-based medicine and causal inference, illustrated via the statins example. All load-bearing claims rest on externally established distinctions rather than self-referential definitions, self-citations, or reductions to inputs by construction. The paper is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The paper relies on established domain assumptions from evidence-based medicine without introducing new free parameters or invented entities.

axioms (2)
  • domain assumption The treatment problem requires evidence from experimental or observational data rather than imitation alone.
    Central distinction drawn in the discussion of solving the treatment problem.
  • domain assumption Ethics of experimentation and assumptions in observational data are the primary barriers to training LLM-based treatment systems.
    Highlighted as the major challenges in the final sections of the abstract.

pith-pipeline@v0.9.0 · 5690 in / 1278 out tokens · 51140 ms · 2026-05-19T08:18:11.443767+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

122 extracted references · 122 canonical work pages · 7 internal anchors

  1. [1]

    Constrained policy optimization, in: International Conference on Machine Learning, PMLR

    Achiam, J., Held, D., Tamar, A., Abbeel, P., 2017. Constrained policy optimization, in: International Conference on Machine Learning, PMLR. pp. 22–31

  2. [2]

    Large language models as decision-making tools in oncology: Comparing artificial intelligence suggestions and expert recommendations

    Ah-Thiane, L., Heudel, P.E., Campone, M., Robert, M., Brillaud-Meflah, V., Rousseau, C., Le Blanc-Onfroy, M., Tomaszewski, F., Supiot, S., Perennec, T., et al., 2025. Large language models as decision-making tools in oncology: Comparing artificial intelligence suggestions and expert recommendations. JCO Clinical Cancer Informatics 9, e2400230

  3. [3]

    Large language models as co-pilots for causal inference in medical studies

    Alaa, A., Phillips, R.V., Kıcıman, E., Balzer, L.B., van der Laan, M., Petersen, M., 2024. Large language models as co-pilots for causal inference in medical studies. arXiv preprint arXiv:2407.19118

  4. [4]

    Artificial hallucinations in chatgpt: implications in scientific writing

    Alkaissi, H., McFarlane, S.I., 2023. Artificial hallucinations in chatgpt: implications in scientific writing. Cureus 15

  5. [5]

    Randomized-controlled trials are methodologically inappropriate in adolescent transgender healthcare

    Ashley, F., Tordoff, D.M., Olson-Kennedy, J., Restar, A.J., 2024. Randomized-controlled trials are methodologically inappropriate in adolescent transgender healthcare. International Journal of Transgender Health 25, 407–418

  6. [6]

    Evaluating artificial intelligence responses to public health questions

    Ayers, J.W., Zhu, Z., Poliak, A., Leas, E.C., Dredze, M., Hogarth, M., Smith, D.M., 2023. Evaluating artificial intelligence responses to public health questions. JAMA Network Open 6, e2317517–e2317517

  7. [7]

    Why we need observational studies to evaluate the effectiveness of health care

    Black, N., 1996. Why we need observational studies to evaluate the effectiveness of health care. Bmj 312, 1215–1218

  8. [8]

    Clinical intuition versus statistics: different modes of tacit knowledge in clinical epidemiology and evidence-based medicine

    Braude, H.D., 2009. Clinical intuition versus statistics: different modes of tacit knowledge in clinical epidemiology and evidence-based medicine. Theoretical medicine and bioethics 30, 181–198

  9. [9]

    Superhuman performance of a large language model on the reasoning tasks of a physician

    Brodeur, P.G., Buckley, T.A., Kanjee, Z., Goh, E., Ling, E.B., Jain, P., Cabral, S., Abdulnour, R.E., Haimovich, A., Freed, J.A., et al., 2024. Superhuman performance of a large language model on the reasoning tasks of a physician. arXiv preprint arXiv:2412.10849

  10. [10]

    Language models are few-shot learners

    Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al., 2020. Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901

  11. [11]

    Impact of a digital scribe system on clinical documentation time and quality: usability study

    van Buchem, M.M., Kant, I.M., King, L., Kazmaier, J., Steyerberg, E.W., Bauer, M.P., 2024. Impact of a digital scribe system on clinical documentation time and quality: usability study. JMIR AI 3, e60020

  12. [12]

    Carr, K., . Tweet. https://x.com/kareem_carr/status/1930633158833136034. [Accessed 17-06-2025]. 13

  13. [13]

    Statistical Reinforcement Learning

    Chakraborty, B., Moodie, E.E.M., 2013. Statistical Reinforcement Learning. Springer New York, New York, NY. pp. 31–52. URL: https://doi.org/10.1007/978-1-4614-7428-9_3 , doi:10.1007/978-1-4614-7428-9_3

  14. [14]

    Use of bayesian decision analysis to maximize value in patient-centered randomized clinical trials in parkinson’s disease

    Chaudhuri, S.E., Ben Chaouch, Z., Hauber, B., Mange, B., Zhou, M., Christopher, S., Bardot, D., Sheehan, M., Donnelly, A., McLaughlin, L., et al., 2023. Use of bayesian decision analysis to maximize value in patient-centered randomized clinical trials in parkinson’s disease. Journal of Biopharmaceutical Statistics , 1–20

  15. [15]

    Clinical judgement in the era of big data and predictive analytics

    Chin-Yee, B., Upshur, R., 2018. Clinical judgement in the era of big data and predictive analytics. Journal of Evaluation in Clinical Practice 24, 638–645

  16. [16]

    Statin use for the primary prevention of cardiovascular disease in adults: updated evidence report and systematic review for the us preventive services task force

    Chou, R., Cantor, A., Dana, T., Wagner, J., Ahmed, A.Y., Fu, R., Ferencik, M., 2022. Statin use for the primary prevention of cardiovascular disease in adults: updated evidence report and systematic review for the us preventive services task force. Jama 328, 754–771

  17. [17]

    Beyond randomised versus observational studies

    Concato, J., Horwitz, R.I., 2004. Beyond randomised versus observational studies. The Lancet 363, 1660–1661

  18. [18]

    Understanding and misunderstanding randomized controlled trials

    Deaton, A., Cartwright, N., 2018. Understanding and misunderstanding randomized controlled trials. Social science & medicine 210, 2–21

  19. [19]

    Health professionals’ adherence to stroke clinical guidelines: a review of the literature

    Donnellan, C., Sweetman, S., Shelley, E., 2013. Health professionals’ adherence to stroke clinical guidelines: a review of the literature. Health policy 111, 245–263

  20. [20]

    Suffering, meaning, and healing: challenges of contemporary medicine

    Egnew, T.R., 2009. Suffering, meaning, and healing: challenges of contemporary medicine. The Annals of Family Medicine 7, 170–175

  21. [21]

    Constructing dynamic treatment regimes over indefinite time horizons

    Ertefaie, A., Strawderman, R.L., 2018. Constructing dynamic treatment regimes over indefinite time horizons. Biometrika 105, 963–977

  22. [22]

    Value-aware loss function for model learning in reinforcement learning

    Farahmand, A.m., Barreto, A.M., Nikovski, D.N., 2016. Value-aware loss function for model learning in reinforcement learning

  23. [23]

    The intellectual crisis of psychiatric research

    Fava, G.A., 2006. The intellectual crisis of psychiatric research. Psychotherapy and Psycho- somatics 75, 202–208

  24. [24]

    evidence

    Feinstein, A.R., Horwitz, R.I., 1997. Problems in the “evidence” of “evidence-based medicine”. The American journal of medicine 103, 529–535

  25. [25]

    Judg- ment and decision-making in clinical dentistry

    Feller, L., Lemmer, J., Nemutandani, M.S., Ballyram, R., Khammissa, R.A.G., 2020. Judg- ment and decision-making in clinical dentistry. Journal of International Medical Research 48, 0300060520972877

  26. [26]

    Can chatgpt pass the life support exams without entering the american heart association course? Resuscitation 185

    Fijaˇ cko, N., Gosak, L.,ˇStiglic, G., Picard, C.T., Douma, M.J., 2023. Can chatgpt pass the life support exams without entering the american heart association course? Resuscitation 185

  27. [27]

    Statistical methods for research workers, in: Breakthroughs in statistics: Methodology and distribution

    Fisher, R.A., 1970. Statistical methods for research workers, in: Breakthroughs in statistics: Methodology and distribution. Springer, pp. 66–70

  28. [28]

    Principal stratification in causal inference

    Frangakis, C.E., Rubin, D.B., 2002. Principal stratification in causal inference. Biometrics 58, 21–29

  29. [29]

    Popcorn: Partially observed prediction constrained reinforcement learning

    Futoma, J., Hughes, M.C., Doshi-Velez, F., 2020. Popcorn: Partially observed prediction constrained reinforcement learning. arXiv preprint arXiv:2001.04032 . 14

  30. [30]

    Personalized decision making for coronary artery disease treatment using offline reinforcement learning

    Ghasemi, P., Greenberg, M., Southern, D.A., Li, B., White, J.A., Lee, J., 2025. Personalized decision making for coronary artery disease treatment using offline reinforcement learning. npj Digital Medicine 8, 99

  31. [31]

    2013 acc/aha guideline on the assessment of cardiovascular risk: a report of the american college of cardiology/american heart association task force on practice guidelines

    Goff, D.C., Lloyd-Jones, D.M., Bennett, G., Coady, S., D’agostino, R.B., Gibbons, R., Green- land, P., Lackland, D.T., Levy, D., O’donnell, C.J., et al., 2014. 2013 acc/aha guideline on the assessment of cardiovascular risk: a report of the american college of cardiology/american heart association task force on practice guidelines. Journal of the American...

  32. [32]

    Large language model influence on management reasoning: A randomized controlled trial

    Goh, E., Gallo, R., Strong, E., Weng, Y., Kerman, H., Freed, J., Cool, J.A., Kanjee, Z., Lane, K.P., Parsons, A.S., et al., 2024. Large language model influence on management reasoning: A randomized controlled trial. medRxiv

  33. [33]

    Accuracy and reliability of chatbot responses to physician questions

    Goodman, R.S., Patrinely, J.R., Stone, C.A., Zimmerman, E., Donald, R.R., Chang, S.S., Berkowitz, S.T., Finn, A.P., Jahangir, E., Scoville, E.A., et al., 2023. Accuracy and reliability of chatbot responses to physician questions. JAMA Network Open 6, e2336483–e2336483

  34. [34]

    Gottesman, O., Futoma, J., Liu, Y., Parbhoo, S., Celi, L., Brunskill, E., Doshi-Velez, F.,

  35. [35]

    Interpretable off-policy evaluation in reinforcement learning by highlighting influential transitions, in: International Conference on Machine Learning, PMLR. pp. 3658–3667

  36. [36]

    Evaluating Reinforcement Learning Algorithms in Observational Health Settings

    Gottesman, O., Johansson, F., Meier, J., Dent, J., Lee, D., Srinivasan, S., Zhang, L., Ding, Y., Wihl, D., Peng, X., et al., 2018. Evaluating reinforcement learning algorithms in observational health settings. arXiv preprint arXiv:1805.12298

  37. [37]

    Intuition and evidence–uneasy bedfellows? British Journal of General Practice 52, 395–400

    Greenhalgh, T., 2002. Intuition and evidence–uneasy bedfellows? British Journal of General Practice 52, 395–400

  38. [38]

    As- sessment of large language models (llms) in decision-making support for gynecologic oncology

    Gumilar, K.E., Indraprasta, B.R., Faridzi, A.S., Wibowo, B.M., Herlambang, A., Rahestyn- ingtyas, E., Irawan, B., Tambunan, Z., Bustomi, A.F., Brahmantara, B.N., et al., 2024. As- sessment of large language models (llms) in decision-making support for gynecologic oncology. Computational and Structural Biotechnology Journal 23, 4019–4026

  39. [39]

    The impact of nuance dax ambient listening ai documentation: a cohort study

    Haberle, T., Cleveland, C., Snow, G.L., Barber, C., Stookey, N., Thornock, C., Younger, L., Mullahkhel, B., Ize-Ludlow, D., 2024. The impact of nuance dax ambient listening ai documentation: a cohort study. Journal of the American Medical Informatics Association 31, 975–979

  40. [40]

    Artificial intelligence in medicine

    Hamet, P., Tremblay, J., 2017. Artificial intelligence in medicine. metabolism 69, S36–S40

  41. [41]

    Medpair: Measuring physicians and ai relevance alignment in medical question answering

    Hao, Y., Alhamoud, K., Jeong, H., Zhang, H., Puri, I., Torr, P., Schaekermann, M., Stern, A.D., Ghassemi, M., 2025. Medpair: Measuring physicians and ai relevance alignment in medical question answering. arXiv preprint arXiv:2505.24040

  42. [42]

    Why a bayesian approach to drug development and evalua- tion?

    Harrell Jr, F.E., Vange, L., 2019. Why a bayesian approach to drug development and evalua- tion?

  43. [43]

    Recognizing racit knowledge in medical epistemology

    Henry, S.G., 2006. Recognizing racit knowledge in medical epistemology. Theoretical medicine and bioethics 27, 187–213

  44. [44]

    Evidence-based practice– imperfect but necessary

    Herbert, R.D., Sherrington, C., Maher, C., Moseley, A.M., 2001. Evidence-based practice– imperfect but necessary. Physiotherapy Theory and Practice 17, 201–211. 15

  45. [45]

    Artificial intelligence in medicine

    Holmes, J., Sacchi, L., Bellazzi, R., et al., 2004. Artificial intelligence in medicine. Ann R Coll Surg Engl 86, 334–8

  46. [46]

    A generalization of sampling without replacement from a finite universe

    Horvitz, D.G., Thompson, D.J., 1952. A generalization of sampling without replacement from a finite universe. Journal of the American statistical Association 47, 663–685

  47. [47]

    Adaptive experiment design with synthetic controls, in: International Conference on Artificial Intelligence and Statistics, PMLR

    H¨ uy¨ uk, A., Qian, Z., van der Schaar, M., 2024. Adaptive experiment design with synthetic controls, in: International Conference on Artificial Intelligence and Statistics, PMLR. pp. 1180–1188

  48. [48]

    An evaluation framework for clinical use of large language models in patient interaction tasks

    Johri, S., Jeong, J., Tran, B.A., Schlessinger, D.I., Wongvibulsin, S., Barnes, L.A., Zhou, H.Y., Cai, Z.R., Van Allen, E.M., Kim, D., et al., 2025. An evaluation framework for clinical use of large language models in patient interaction tasks. Nature Medicine , 1–10

  49. [49]

    Deep reinforcement learning in medicine

    Jonsson, A., 2019. Deep reinforcement learning in medicine. Kidney diseases 5, 18–22

  50. [50]

    Thinking, fast and slow

    Kahneman, D., 2011. Thinking, fast and slow. macmillan

  51. [51]

    Efficient evaluation of natural stochastic policies in offline reinforcement learning

    Kallus, N., Uehara, M., 2020. Efficient evaluation of natural stochastic policies in offline reinforcement learning. arXiv preprint arXiv:2006.03886

  52. [52]

    Accuracy of a Generative Artificial Intelligence Model in a Complex Diagnostic Challenge

    Kanjee, Z., Crowe, B., Rodman, A., 2023. Accuracy of a Generative Artificial Intelligence Model in a Complex Diagnostic Challenge. JAMA URL: https://doi.org/10.1001/jama. 2023.8288, doi:10.1001/jama.2023.8288

  53. [53]

    Diversity, equity, and inclusion in clinical trials

    Keegan, G., Crown, A., Joseph, K.A., 2023. Diversity, equity, and inclusion in clinical trials. Surgical Oncology Clinics 32, 221–232

  54. [54]

    Towards optimal doubly robust estimation of heterogeneous causal effects

    Kennedy, E.H., 2023. Towards optimal doubly robust estimation of heterogeneous causal effects. Electronic Journal of Statistics 17, 3008–3049

  55. [55]

    Abstentionbench: Reasoning llms fail on unanswerable questions

    Kirichenko, P., Ibrahim, M., Chaudhuri, K., Bell, S.J., 2025. Abstentionbench: Reasoning llms fail on unanswerable questions. arXiv preprint arXiv:2506.09038

  56. [56]

    Imitation and reinforcement learning

    Kober, J., Peters, J., 2010. Imitation and reinforcement learning. IEEE Robotics & Automa- tion Magazine 17, 55–62

  57. [57]

    On information and sufficiency

    Kullback, S., Leibler, R.A., 1951. On information and sufficiency. The annals of mathematical statistics 22, 79–86

  58. [58]

    Chatgpt and the clinical informatics board examination: the end of unproctored maintenance of certification? Journal of the American Medical Informatics Association , ocad104

    Kumah-Crystal, Y., Mankowitz, S., Embi, P., Lehmann, C.U., 2023. Chatgpt and the clinical informatics board examination: the end of unproctored maintenance of certification? Journal of the American Medical Informatics Association , ocad104

  59. [59]

    Performance of chatgpt on usmle: Potential for ai-assisted medical education using large language models

    Kung, T.H., Cheatham, M., Medenilla, A., Sillos, C., De Leon, L., Elepa˜ no, C., Madriaga, M., Aggabao, R., Diaz-Candido, G., Maningo, J., et al., 2023. Performance of chatgpt on usmle: Potential for ai-assisted medical education using large language models. PLoS digital health 2, e0000198

  60. [60]

    Ehrnoteqa: A patient-specific question answering benchmark for evaluating large language models in clin- ical settings

    Kweon, S., Kim, J., Kwak, H., Cha, D., Yoon, H., Kim, K., Won, S., Choi, E., 2024. Ehrnoteqa: A patient-specific question answering benchmark for evaluating large language models in clin- ical settings. Preprint

  61. [61]

    Dynamic treatment regimes: Technical challenges and applications

    Laber, E.B., Lizotte, D.J., Qian, M., Pelham, W.E., Murphy, S.A., 2014. Dynamic treatment regimes: Technical challenges and applications. Electronic journal of statistics 8, 1225. 16

  62. [62]

    Recommender systems: a review

    LeBlanc, P.M., Banks, D., Fu, L., Li, M., Tang, Z., Wu, Q., 2024. Recommender systems: a review. Journal of the American Statistical Association 119, 773–785

  63. [63]

    Learning neural network policies with guided policy search under unknown dynamics., in: NIPS, Citeseer

    Levine, S., Abbeel, P., 2014. Learning neural network policies with guided policy search under unknown dynamics., in: NIPS, Citeseer. pp. 1071–1079

  64. [64]

    Mediq: Question-asking llms and a benchmark for reliable interactive clinical reasoning

    Li, S., Balachandran, V., Feng, S., Ilgen, J., Pierson, E., Koh, P.W.W., Tsvetkov, Y., 2024. Mediq: Question-asking llms and a benchmark for reliable interactive clinical reasoning. Ad- vances in Neural Information Processing Systems 37, 28858–28888

  65. [65]

    Making Decisions

    Lindley, D., 1991. Making Decisions. Wiley. URL: https://books.google.com/books?id= 3-ZQAAAAMAAJ

  66. [66]

    Using AI-generated suggestions from ChatGPT to opti- mize clinical decision support

    Liu, S., Wright, A.P., Patterson, B.L., Wanderer, J.P., Turer, R.W., Nelson, S.D., McCoy, A.B., Sittig, D.F., Wright, A., 2023. Using AI-generated suggestions from ChatGPT to opti- mize clinical decision support. Journal of the American Medical Informatics Association 30, 1237–1245. doi: 10.1093/jamia/ocad072

  67. [67]

    Luckett, D.J., Laber, E.B., Kahkoska, A.R., Maahs, D.M., Mayer-Davis, E., Kosorok, M.R.,

  68. [68]

    Journal of the American Statistical Association

    Estimating dynamic treatment regimes in mobile health using v-learning. Journal of the American Statistical Association

  69. [69]

    Overview of artificial intelligence in medicine

    Malik, P., Pathania, M., Rathaur, V.K., et al., 2019. Overview of artificial intelligence in medicine. Journal of family medicine and primary care 8, 2328–2331

  70. [70]

    Understanding contraceptive switching rationales from real world clinical notes using large language models

    Miao, B.Y., Williams, C.Y., Chinedu-Eneh, E., Zack, T., Alsentzer, E., Butte, A.J., Chen, I.Y., 2025. Understanding contraceptive switching rationales from real world clinical notes using large language models. npj Digital Medicine 8, 221

  71. [71]

    Optimal dynamic treatment regimes

    Murphy, S.A., 2003. Optimal dynamic treatment regimes. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 65, 331–355

  72. [72]

    Ai snake oil: What artificial intelligence can do, what it can’t, and how to tell the difference, in: AI Snake Oil

    Narayanan, A., Kapoor, S., 2024. Ai snake oil: What artificial intelligence can do, what it can’t, and how to tell the difference, in: AI Snake Oil. Princeton University Press

  73. [73]

    PEGASUS: A Policy Search Method for Large MDPs and POMDPs

    Ng, A.Y., Jordan, M.I., 2013. Pegasus: A policy search method for large mdps and pomdps. arXiv preprint arXiv:1301.3878

  74. [74]

    OpenAI, 2023. Chatgpt. https://chat.openai.com

  75. [75]

    Training language models to follow instructions with human feedback

    Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., et al., 2022. Training language models to follow instructions with human feedback. Advances in neural information processing systems 35, 27730–27744

  76. [76]

    Monte Carlo theory, methods and examples

    Owen, A.B., 2013. Monte Carlo theory, methods and examples

  77. [77]

    Decision analysis

    Pauker, S.G., Kassirer, J.P., 1987. Decision analysis. New England Journal of Medicine 316, 250–258

  78. [78]

    Causality

    Pearl, J., 2009. Causality. Cambridge university press

  79. [79]

    Relative entropy policy search, in: Proceedings of the AAAI Conference on Artificial Intelligence

    Peters, J., Mulling, K., Altun, Y., 2010. Relative entropy policy search, in: Proceedings of the AAAI Conference on Artificial Intelligence

  80. [80]

    Petersen, B.K., Yang, J., Grathwohl, W.S., Cockrell, C., Santiago, C., An, G., Faissol, D.M.,

Showing first 80 references.