Treatment, evidence, imitation, and chat

Samuel J. Weisenthal

arxiv: 2506.23040 · v5 · submitted 2025-06-29 · 📊 stat.OT · cs.AI

Treatment, evidence, imitation, and chat

Samuel J. Weisenthal This is my paper

Pith reviewed 2026-05-19 08:18 UTC · model grok-4.3

classification 📊 stat.OT cs.AI

keywords large language modelsmedical decision makingevidence-based medicinetreatment problemimitation learningobservational datastatinsrandomized trials

0 comments

The pith

Imitation from chat data cannot solve the core medical treatment problem that clinicians and patients must address together.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper distinguishes the treatment problem, which requires evidence-based choices about interventions like statins, from the chat problem of generating conversational responses. It shows that imitation of expert behavior can support useful interactions but leaves the actual decision-making task unsolved because it bypasses the need for experimental and observational evidence. Training large language models on treatment decisions therefore runs into barriers around running ethical experiments and making defensible assumptions from observational records. The discussion ties these limits back to longstanding practices in evidence-based medicine and suggests how the medical research community might adapt its methods.

Core claim

Solving the treatment problem demands integration of randomized experimental data and carefully interpreted observational data rather than imitation alone; an LLM-based system can participate in that process but only after the ethical and evidentiary challenges of obtaining suitable training signals are resolved.

What carries the argument

The contrast between the treatment problem (evidence-driven collaborative decision making) and the chat problem (imitation of conversational responses), with statins used to illustrate the evidentiary requirements.

If this is right

Experimental data from randomized trials remains indispensable for validating treatment choices even when language models participate.
Observational data can fill gaps but requires explicit handling of confounding and selection assumptions.
Imitation-trained chat capabilities may improve communication around decisions without replacing the evidence base.
Regulatory and ethical frameworks for medical AI will need to address how training data for treatment decisions is obtained.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar distinctions between imitation and evidence may apply to other high-stakes domains such as legal or financial advice.
One testable extension is whether hybrid training that injects trial results into language-model fine-tuning measurably improves downstream patient outcomes.
The argument implies that purely observational or chat-derived systems risk systematic bias unless paired with experimental benchmarks.

Load-bearing premise

Ethical experiments and defensible observational assumptions can be secured to generate the data needed for training systems on real treatment decisions.

What would settle it

A controlled study in which patients whose decisions are guided by an imitation-only model achieve the same or better health outcomes than those guided by current evidence-based protocols would undermine the central claim.

read the original abstract

Large language models are thought to have the potential to aid in medical decision making. This work investigates the degree to which this might be the case. We start with the treatment problem, the patient's core medical decision-making task, which is solved in collaboration with a clinician. We discuss different approaches to solving it, including, within evidence-based medicine, experimental and observational data. We then discuss the chat problem, and how this differs from the treatment problem -- in particular with respect to imitation (and how imitation alone cannot solve the true treatment problem, although this does not mean it is not useful). We then discuss how a large-language-model-based system might be trained to solve the treatment problem, highlighting that the major challenges relate to the ethics of experimentation and the assumptions associated with observation. We finally discuss how these challenges relate to evidence-based medicine and how this might inform the efforts of the medical research community to solve the treatment problem. Throughout, we illustrate our arguments with the cholesterol medications, statins.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper clearly separates the medical treatment problem from LLM chat imitation using standard causal ideas and the statins example, but adds no new models or data.

read the letter

The main thing to know is that imitation from chat data cannot solve the actual treatment problem in medicine because treatment requires causal evidence under ethical and observational limits, while chat is mostly about copying patterns. The paper draws this line explicitly and connects it to evidence-based medicine, showing with statins how experiments are hard to run and observations carry assumptions that limit what you can learn for decision-making. This framing is straightforward and could help clarify why current LLMs are better suited to one task than the other. It does a decent job of making the distinction accessible without overclaiming. The statins illustration keeps the argument concrete and shows the practical stakes around data collection. What is new is mostly the application of these existing distinctions to LLM-based medical systems rather than a technical advance. The soft spots are that everything stays conceptual. There is no formal model, no quantitative analysis, and no test of the claims beyond the single example. The points about ethics of experimentation and observational assumptions are standard in the field, so the paper rests on logical separation instead of fresh evidence. This means the central claim holds up on its own terms but does not move beyond restating known issues in a new context. The paper is for people working on medical AI design or regulation who want a clear way to think about why pure imitation falls short. A reader already familiar with causal inference and evidence-based medicine will not learn much new technically, but the separation could be useful for discussion. It deserves a serious referee because the topic is timely and the argument is coherent, even if revisions would likely focus on adding more substance or examples.

Referee Report

0 major / 2 minor

Summary. The manuscript claims that large language models have potential to aid medical decision making but must distinguish the treatment problem, which requires causal evidence from experimental or observational studies under ethical constraints, from the chat problem based on imitation. Using the statins example, it argues that imitation alone cannot solve the treatment problem, though it can be useful, and identifies ethics of experimentation and observational assumptions as major challenges for training such systems, relating this to evidence-based medicine.

Significance. The result, if it holds, is significant in that it provides a clear conceptual separation between imitation-driven chat and evidence-based treatment in the context of AI for medicine. The paper gives credit to the concrete statins illustration for making the ethical and observational issues tangible. This framework could help steer research away from over-reliance on pure imitation learning for high-stakes causal decisions.

minor comments (2)

[Abstract] The abstract is concise but could include a sentence on the statins example to better prepare the reader for the full argument.
[Discussion of training LLMs] The challenges are well-described qualitatively; adding a reference to specific methods in causal inference, such as those handling observational data biases, would enhance clarity without altering the conceptual nature.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their positive assessment of our manuscript and the recommendation for minor revision. We are pleased that the conceptual framework distinguishing the treatment problem from the chat problem, and the use of the statins example to illustrate ethical and observational challenges, was viewed as significant. We respond to the referee's summary of the paper below.

read point-by-point responses

Referee: The manuscript claims that large language models have potential to aid medical decision making but must distinguish the treatment problem, which requires causal evidence from experimental or observational studies under ethical constraints, from the chat problem based on imitation. Using the statins example, it argues that imitation alone cannot solve the treatment problem, though it can be useful, and identifies ethics of experimentation and observational assumptions as major challenges for training such systems, relating this to evidence-based medicine.

Authors: We appreciate this concise summary of our work, which accurately reflects the main points we sought to make. We agree that the distinction is important for guiding research in AI for medicine away from over-reliance on imitation for causal decisions. revision: no

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

This is a conceptual discussion paper with no mathematical derivations, equations, or fitted parameters. It distinguishes the treatment problem (requiring causal evidence under ethical/observational constraints) from the chat/imitation problem using standard principles of evidence-based medicine and causal inference, illustrated via the statins example. All load-bearing claims rest on externally established distinctions rather than self-referential definitions, self-citations, or reductions to inputs by construction. The paper is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The paper relies on established domain assumptions from evidence-based medicine without introducing new free parameters or invented entities.

axioms (2)

domain assumption The treatment problem requires evidence from experimental or observational data rather than imitation alone.
Central distinction drawn in the discussion of solving the treatment problem.
domain assumption Ethics of experimentation and assumptions in observational data are the primary barriers to training LLM-based treatment systems.
Highlighted as the major challenges in the final sections of the abstract.

pith-pipeline@v0.9.0 · 5690 in / 1278 out tokens · 51140 ms · 2026-05-19T08:18:11.443767+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

the optimization in (2) is the fundamental treatment problem... imitation objective in (7) does not take into account utility, U
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

observational data... no unmeasured confounders assumption cannot be verified with data alone

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

122 extracted references · 122 canonical work pages · 7 internal anchors

[1]

Constrained policy optimization, in: International Conference on Machine Learning, PMLR

Achiam, J., Held, D., Tamar, A., Abbeel, P., 2017. Constrained policy optimization, in: International Conference on Machine Learning, PMLR. pp. 22–31

work page 2017
[2]

Large language models as decision-making tools in oncology: Comparing artificial intelligence suggestions and expert recommendations

Ah-Thiane, L., Heudel, P.E., Campone, M., Robert, M., Brillaud-Meflah, V., Rousseau, C., Le Blanc-Onfroy, M., Tomaszewski, F., Supiot, S., Perennec, T., et al., 2025. Large language models as decision-making tools in oncology: Comparing artificial intelligence suggestions and expert recommendations. JCO Clinical Cancer Informatics 9, e2400230

work page 2025
[3]

Large language models as co-pilots for causal inference in medical studies

Alaa, A., Phillips, R.V., Kıcıman, E., Balzer, L.B., van der Laan, M., Petersen, M., 2024. Large language models as co-pilots for causal inference in medical studies. arXiv preprint arXiv:2407.19118

work page arXiv 2024
[4]

Artificial hallucinations in chatgpt: implications in scientific writing

Alkaissi, H., McFarlane, S.I., 2023. Artificial hallucinations in chatgpt: implications in scientific writing. Cureus 15

work page 2023
[5]

Randomized-controlled trials are methodologically inappropriate in adolescent transgender healthcare

Ashley, F., Tordoff, D.M., Olson-Kennedy, J., Restar, A.J., 2024. Randomized-controlled trials are methodologically inappropriate in adolescent transgender healthcare. International Journal of Transgender Health 25, 407–418

work page 2024
[6]

Evaluating artificial intelligence responses to public health questions

Ayers, J.W., Zhu, Z., Poliak, A., Leas, E.C., Dredze, M., Hogarth, M., Smith, D.M., 2023. Evaluating artificial intelligence responses to public health questions. JAMA Network Open 6, e2317517–e2317517

work page 2023
[7]

Why we need observational studies to evaluate the effectiveness of health care

Black, N., 1996. Why we need observational studies to evaluate the effectiveness of health care. Bmj 312, 1215–1218

work page 1996
[8]

Clinical intuition versus statistics: different modes of tacit knowledge in clinical epidemiology and evidence-based medicine

Braude, H.D., 2009. Clinical intuition versus statistics: different modes of tacit knowledge in clinical epidemiology and evidence-based medicine. Theoretical medicine and bioethics 30, 181–198

work page 2009
[9]

Superhuman performance of a large language model on the reasoning tasks of a physician

Brodeur, P.G., Buckley, T.A., Kanjee, Z., Goh, E., Ling, E.B., Jain, P., Cabral, S., Abdulnour, R.E., Haimovich, A., Freed, J.A., et al., 2024. Superhuman performance of a large language model on the reasoning tasks of a physician. arXiv preprint arXiv:2412.10849

work page arXiv 2024
[10]

Language models are few-shot learners

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al., 2020. Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901

work page 2020
[11]

Impact of a digital scribe system on clinical documentation time and quality: usability study

van Buchem, M.M., Kant, I.M., King, L., Kazmaier, J., Steyerberg, E.W., Bauer, M.P., 2024. Impact of a digital scribe system on clinical documentation time and quality: usability study. JMIR AI 3, e60020

work page 2024
[12]

Carr, K., . Tweet. https://x.com/kareem_carr/status/1930633158833136034. [Accessed 17-06-2025]. 13

work page arXiv 2025
[13]

Statistical Reinforcement Learning

Chakraborty, B., Moodie, E.E.M., 2013. Statistical Reinforcement Learning. Springer New York, New York, NY. pp. 31–52. URL: https://doi.org/10.1007/978-1-4614-7428-9_3 , doi:10.1007/978-1-4614-7428-9_3

work page doi:10.1007/978-1-4614-7428-9_3 2013
[14]

Use of bayesian decision analysis to maximize value in patient-centered randomized clinical trials in parkinson’s disease

Chaudhuri, S.E., Ben Chaouch, Z., Hauber, B., Mange, B., Zhou, M., Christopher, S., Bardot, D., Sheehan, M., Donnelly, A., McLaughlin, L., et al., 2023. Use of bayesian decision analysis to maximize value in patient-centered randomized clinical trials in parkinson’s disease. Journal of Biopharmaceutical Statistics , 1–20

work page 2023
[15]

Clinical judgement in the era of big data and predictive analytics

Chin-Yee, B., Upshur, R., 2018. Clinical judgement in the era of big data and predictive analytics. Journal of Evaluation in Clinical Practice 24, 638–645

work page 2018
[16]

Statin use for the primary prevention of cardiovascular disease in adults: updated evidence report and systematic review for the us preventive services task force

Chou, R., Cantor, A., Dana, T., Wagner, J., Ahmed, A.Y., Fu, R., Ferencik, M., 2022. Statin use for the primary prevention of cardiovascular disease in adults: updated evidence report and systematic review for the us preventive services task force. Jama 328, 754–771

work page 2022
[17]

Beyond randomised versus observational studies

Concato, J., Horwitz, R.I., 2004. Beyond randomised versus observational studies. The Lancet 363, 1660–1661

work page 2004
[18]

Understanding and misunderstanding randomized controlled trials

Deaton, A., Cartwright, N., 2018. Understanding and misunderstanding randomized controlled trials. Social science & medicine 210, 2–21

work page 2018
[19]

Health professionals’ adherence to stroke clinical guidelines: a review of the literature

Donnellan, C., Sweetman, S., Shelley, E., 2013. Health professionals’ adherence to stroke clinical guidelines: a review of the literature. Health policy 111, 245–263

work page 2013
[20]

Suffering, meaning, and healing: challenges of contemporary medicine

Egnew, T.R., 2009. Suffering, meaning, and healing: challenges of contemporary medicine. The Annals of Family Medicine 7, 170–175

work page 2009
[21]

Constructing dynamic treatment regimes over indefinite time horizons

Ertefaie, A., Strawderman, R.L., 2018. Constructing dynamic treatment regimes over indefinite time horizons. Biometrika 105, 963–977

work page 2018
[22]

Value-aware loss function for model learning in reinforcement learning

Farahmand, A.m., Barreto, A.M., Nikovski, D.N., 2016. Value-aware loss function for model learning in reinforcement learning

work page 2016
[23]

The intellectual crisis of psychiatric research

Fava, G.A., 2006. The intellectual crisis of psychiatric research. Psychotherapy and Psycho- somatics 75, 202–208

work page 2006
[24]

evidence

Feinstein, A.R., Horwitz, R.I., 1997. Problems in the “evidence” of “evidence-based medicine”. The American journal of medicine 103, 529–535

work page 1997
[25]

Judg- ment and decision-making in clinical dentistry

Feller, L., Lemmer, J., Nemutandani, M.S., Ballyram, R., Khammissa, R.A.G., 2020. Judg- ment and decision-making in clinical dentistry. Journal of International Medical Research 48, 0300060520972877

work page 2020
[26]

Can chatgpt pass the life support exams without entering the american heart association course? Resuscitation 185

Fijaˇ cko, N., Gosak, L.,ˇStiglic, G., Picard, C.T., Douma, M.J., 2023. Can chatgpt pass the life support exams without entering the american heart association course? Resuscitation 185

work page 2023
[27]

Statistical methods for research workers, in: Breakthroughs in statistics: Methodology and distribution

Fisher, R.A., 1970. Statistical methods for research workers, in: Breakthroughs in statistics: Methodology and distribution. Springer, pp. 66–70

work page 1970
[28]

Principal stratification in causal inference

Frangakis, C.E., Rubin, D.B., 2002. Principal stratification in causal inference. Biometrics 58, 21–29

work page 2002
[29]

Popcorn: Partially observed prediction constrained reinforcement learning

Futoma, J., Hughes, M.C., Doshi-Velez, F., 2020. Popcorn: Partially observed prediction constrained reinforcement learning. arXiv preprint arXiv:2001.04032 . 14

work page arXiv 2020
[30]

Personalized decision making for coronary artery disease treatment using offline reinforcement learning

Ghasemi, P., Greenberg, M., Southern, D.A., Li, B., White, J.A., Lee, J., 2025. Personalized decision making for coronary artery disease treatment using offline reinforcement learning. npj Digital Medicine 8, 99

work page 2025
[31]

2013 acc/aha guideline on the assessment of cardiovascular risk: a report of the american college of cardiology/american heart association task force on practice guidelines

Goff, D.C., Lloyd-Jones, D.M., Bennett, G., Coady, S., D’agostino, R.B., Gibbons, R., Green- land, P., Lackland, D.T., Levy, D., O’donnell, C.J., et al., 2014. 2013 acc/aha guideline on the assessment of cardiovascular risk: a report of the american college of cardiology/american heart association task force on practice guidelines. Journal of the American...

work page 2014
[32]

Large language model influence on management reasoning: A randomized controlled trial

Goh, E., Gallo, R., Strong, E., Weng, Y., Kerman, H., Freed, J., Cool, J.A., Kanjee, Z., Lane, K.P., Parsons, A.S., et al., 2024. Large language model influence on management reasoning: A randomized controlled trial. medRxiv

work page 2024
[33]

Accuracy and reliability of chatbot responses to physician questions

Goodman, R.S., Patrinely, J.R., Stone, C.A., Zimmerman, E., Donald, R.R., Chang, S.S., Berkowitz, S.T., Finn, A.P., Jahangir, E., Scoville, E.A., et al., 2023. Accuracy and reliability of chatbot responses to physician questions. JAMA Network Open 6, e2336483–e2336483

work page 2023
[34]

Gottesman, O., Futoma, J., Liu, Y., Parbhoo, S., Celi, L., Brunskill, E., Doshi-Velez, F.,

work page
[35]

Interpretable off-policy evaluation in reinforcement learning by highlighting influential transitions, in: International Conference on Machine Learning, PMLR. pp. 3658–3667

work page
[36]

Evaluating Reinforcement Learning Algorithms in Observational Health Settings

Gottesman, O., Johansson, F., Meier, J., Dent, J., Lee, D., Srinivasan, S., Zhang, L., Ding, Y., Wihl, D., Peng, X., et al., 2018. Evaluating reinforcement learning algorithms in observational health settings. arXiv preprint arXiv:1805.12298

work page internal anchor Pith review Pith/arXiv arXiv 2018
[37]

Intuition and evidence–uneasy bedfellows? British Journal of General Practice 52, 395–400

Greenhalgh, T., 2002. Intuition and evidence–uneasy bedfellows? British Journal of General Practice 52, 395–400

work page 2002
[38]

As- sessment of large language models (llms) in decision-making support for gynecologic oncology

Gumilar, K.E., Indraprasta, B.R., Faridzi, A.S., Wibowo, B.M., Herlambang, A., Rahestyn- ingtyas, E., Irawan, B., Tambunan, Z., Bustomi, A.F., Brahmantara, B.N., et al., 2024. As- sessment of large language models (llms) in decision-making support for gynecologic oncology. Computational and Structural Biotechnology Journal 23, 4019–4026

work page 2024
[39]

The impact of nuance dax ambient listening ai documentation: a cohort study

Haberle, T., Cleveland, C., Snow, G.L., Barber, C., Stookey, N., Thornock, C., Younger, L., Mullahkhel, B., Ize-Ludlow, D., 2024. The impact of nuance dax ambient listening ai documentation: a cohort study. Journal of the American Medical Informatics Association 31, 975–979

work page 2024
[40]

Artificial intelligence in medicine

Hamet, P., Tremblay, J., 2017. Artificial intelligence in medicine. metabolism 69, S36–S40

work page 2017
[41]

Medpair: Measuring physicians and ai relevance alignment in medical question answering

Hao, Y., Alhamoud, K., Jeong, H., Zhang, H., Puri, I., Torr, P., Schaekermann, M., Stern, A.D., Ghassemi, M., 2025. Medpair: Measuring physicians and ai relevance alignment in medical question answering. arXiv preprint arXiv:2505.24040

work page arXiv 2025
[42]

Why a bayesian approach to drug development and evalua- tion?

Harrell Jr, F.E., Vange, L., 2019. Why a bayesian approach to drug development and evalua- tion?

work page 2019
[43]

Recognizing racit knowledge in medical epistemology

Henry, S.G., 2006. Recognizing racit knowledge in medical epistemology. Theoretical medicine and bioethics 27, 187–213

work page 2006
[44]

Evidence-based practice– imperfect but necessary

Herbert, R.D., Sherrington, C., Maher, C., Moseley, A.M., 2001. Evidence-based practice– imperfect but necessary. Physiotherapy Theory and Practice 17, 201–211. 15

work page 2001
[45]

Artificial intelligence in medicine

Holmes, J., Sacchi, L., Bellazzi, R., et al., 2004. Artificial intelligence in medicine. Ann R Coll Surg Engl 86, 334–8

work page 2004
[46]

A generalization of sampling without replacement from a finite universe

Horvitz, D.G., Thompson, D.J., 1952. A generalization of sampling without replacement from a finite universe. Journal of the American statistical Association 47, 663–685

work page 1952
[47]

Adaptive experiment design with synthetic controls, in: International Conference on Artificial Intelligence and Statistics, PMLR

H¨ uy¨ uk, A., Qian, Z., van der Schaar, M., 2024. Adaptive experiment design with synthetic controls, in: International Conference on Artificial Intelligence and Statistics, PMLR. pp. 1180–1188

work page 2024
[48]

An evaluation framework for clinical use of large language models in patient interaction tasks

Johri, S., Jeong, J., Tran, B.A., Schlessinger, D.I., Wongvibulsin, S., Barnes, L.A., Zhou, H.Y., Cai, Z.R., Van Allen, E.M., Kim, D., et al., 2025. An evaluation framework for clinical use of large language models in patient interaction tasks. Nature Medicine , 1–10

work page 2025
[49]

Deep reinforcement learning in medicine

Jonsson, A., 2019. Deep reinforcement learning in medicine. Kidney diseases 5, 18–22

work page 2019
[50]

Thinking, fast and slow

Kahneman, D., 2011. Thinking, fast and slow. macmillan

work page 2011
[51]

Efficient evaluation of natural stochastic policies in offline reinforcement learning

Kallus, N., Uehara, M., 2020. Efficient evaluation of natural stochastic policies in offline reinforcement learning. arXiv preprint arXiv:2006.03886

work page arXiv 2020
[52]

Accuracy of a Generative Artificial Intelligence Model in a Complex Diagnostic Challenge

Kanjee, Z., Crowe, B., Rodman, A., 2023. Accuracy of a Generative Artificial Intelligence Model in a Complex Diagnostic Challenge. JAMA URL: https://doi.org/10.1001/jama. 2023.8288, doi:10.1001/jama.2023.8288

work page doi:10.1001/jama 2023
[53]

Diversity, equity, and inclusion in clinical trials

Keegan, G., Crown, A., Joseph, K.A., 2023. Diversity, equity, and inclusion in clinical trials. Surgical Oncology Clinics 32, 221–232

work page 2023
[54]

Towards optimal doubly robust estimation of heterogeneous causal effects

Kennedy, E.H., 2023. Towards optimal doubly robust estimation of heterogeneous causal effects. Electronic Journal of Statistics 17, 3008–3049

work page 2023
[55]

Abstentionbench: Reasoning llms fail on unanswerable questions

Kirichenko, P., Ibrahim, M., Chaudhuri, K., Bell, S.J., 2025. Abstentionbench: Reasoning llms fail on unanswerable questions. arXiv preprint arXiv:2506.09038

work page arXiv 2025
[56]

Imitation and reinforcement learning

Kober, J., Peters, J., 2010. Imitation and reinforcement learning. IEEE Robotics & Automa- tion Magazine 17, 55–62

work page 2010
[57]

On information and sufficiency

Kullback, S., Leibler, R.A., 1951. On information and sufficiency. The annals of mathematical statistics 22, 79–86

work page 1951
[58]

Chatgpt and the clinical informatics board examination: the end of unproctored maintenance of certification? Journal of the American Medical Informatics Association , ocad104

Kumah-Crystal, Y., Mankowitz, S., Embi, P., Lehmann, C.U., 2023. Chatgpt and the clinical informatics board examination: the end of unproctored maintenance of certification? Journal of the American Medical Informatics Association , ocad104

work page 2023
[59]

Performance of chatgpt on usmle: Potential for ai-assisted medical education using large language models

Kung, T.H., Cheatham, M., Medenilla, A., Sillos, C., De Leon, L., Elepa˜ no, C., Madriaga, M., Aggabao, R., Diaz-Candido, G., Maningo, J., et al., 2023. Performance of chatgpt on usmle: Potential for ai-assisted medical education using large language models. PLoS digital health 2, e0000198

work page 2023
[60]

Ehrnoteqa: A patient-specific question answering benchmark for evaluating large language models in clin- ical settings

Kweon, S., Kim, J., Kwak, H., Cha, D., Yoon, H., Kim, K., Won, S., Choi, E., 2024. Ehrnoteqa: A patient-specific question answering benchmark for evaluating large language models in clin- ical settings. Preprint

work page 2024
[61]

Dynamic treatment regimes: Technical challenges and applications

Laber, E.B., Lizotte, D.J., Qian, M., Pelham, W.E., Murphy, S.A., 2014. Dynamic treatment regimes: Technical challenges and applications. Electronic journal of statistics 8, 1225. 16

work page 2014
[62]

Recommender systems: a review

LeBlanc, P.M., Banks, D., Fu, L., Li, M., Tang, Z., Wu, Q., 2024. Recommender systems: a review. Journal of the American Statistical Association 119, 773–785

work page 2024
[63]

Learning neural network policies with guided policy search under unknown dynamics., in: NIPS, Citeseer

Levine, S., Abbeel, P., 2014. Learning neural network policies with guided policy search under unknown dynamics., in: NIPS, Citeseer. pp. 1071–1079

work page 2014
[64]

Mediq: Question-asking llms and a benchmark for reliable interactive clinical reasoning

Li, S., Balachandran, V., Feng, S., Ilgen, J., Pierson, E., Koh, P.W.W., Tsvetkov, Y., 2024. Mediq: Question-asking llms and a benchmark for reliable interactive clinical reasoning. Ad- vances in Neural Information Processing Systems 37, 28858–28888

work page 2024
[65]

Making Decisions

Lindley, D., 1991. Making Decisions. Wiley. URL: https://books.google.com/books?id= 3-ZQAAAAMAAJ

work page 1991
[66]

Using AI-generated suggestions from ChatGPT to opti- mize clinical decision support

Liu, S., Wright, A.P., Patterson, B.L., Wanderer, J.P., Turer, R.W., Nelson, S.D., McCoy, A.B., Sittig, D.F., Wright, A., 2023. Using AI-generated suggestions from ChatGPT to opti- mize clinical decision support. Journal of the American Medical Informatics Association 30, 1237–1245. doi: 10.1093/jamia/ocad072

work page doi:10.1093/jamia/ocad072 2023
[67]

Luckett, D.J., Laber, E.B., Kahkoska, A.R., Maahs, D.M., Mayer-Davis, E., Kosorok, M.R.,

work page
[68]

Journal of the American Statistical Association

Estimating dynamic treatment regimes in mobile health using v-learning. Journal of the American Statistical Association

work page
[69]

Overview of artificial intelligence in medicine

Malik, P., Pathania, M., Rathaur, V.K., et al., 2019. Overview of artificial intelligence in medicine. Journal of family medicine and primary care 8, 2328–2331

work page 2019
[70]

Understanding contraceptive switching rationales from real world clinical notes using large language models

Miao, B.Y., Williams, C.Y., Chinedu-Eneh, E., Zack, T., Alsentzer, E., Butte, A.J., Chen, I.Y., 2025. Understanding contraceptive switching rationales from real world clinical notes using large language models. npj Digital Medicine 8, 221

work page 2025
[71]

Optimal dynamic treatment regimes

Murphy, S.A., 2003. Optimal dynamic treatment regimes. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 65, 331–355

work page 2003
[72]

Ai snake oil: What artificial intelligence can do, what it can’t, and how to tell the difference, in: AI Snake Oil

Narayanan, A., Kapoor, S., 2024. Ai snake oil: What artificial intelligence can do, what it can’t, and how to tell the difference, in: AI Snake Oil. Princeton University Press

work page 2024
[73]

PEGASUS: A Policy Search Method for Large MDPs and POMDPs

Ng, A.Y., Jordan, M.I., 2013. Pegasus: A policy search method for large mdps and pomdps. arXiv preprint arXiv:1301.3878

work page internal anchor Pith review Pith/arXiv arXiv 2013
[74]

OpenAI, 2023. Chatgpt. https://chat.openai.com

work page 2023
[75]

Training language models to follow instructions with human feedback

Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., et al., 2022. Training language models to follow instructions with human feedback. Advances in neural information processing systems 35, 27730–27744

work page 2022
[76]

Monte Carlo theory, methods and examples

Owen, A.B., 2013. Monte Carlo theory, methods and examples

work page 2013
[77]

Decision analysis

Pauker, S.G., Kassirer, J.P., 1987. Decision analysis. New England Journal of Medicine 316, 250–258

work page 1987
[78]

Causality

Pearl, J., 2009. Causality. Cambridge university press

work page 2009
[79]

Relative entropy policy search, in: Proceedings of the AAAI Conference on Artificial Intelligence

Peters, J., Mulling, K., Altun, Y., 2010. Relative entropy policy search, in: Proceedings of the AAAI Conference on Artificial Intelligence

work page 2010
[80]

Petersen, B.K., Yang, J., Grathwohl, W.S., Cockrell, C., Santiago, C., An, G., Faissol, D.M.,

work page

Showing first 80 references.

[1] [1]

Constrained policy optimization, in: International Conference on Machine Learning, PMLR

Achiam, J., Held, D., Tamar, A., Abbeel, P., 2017. Constrained policy optimization, in: International Conference on Machine Learning, PMLR. pp. 22–31

work page 2017

[2] [2]

Large language models as decision-making tools in oncology: Comparing artificial intelligence suggestions and expert recommendations

Ah-Thiane, L., Heudel, P.E., Campone, M., Robert, M., Brillaud-Meflah, V., Rousseau, C., Le Blanc-Onfroy, M., Tomaszewski, F., Supiot, S., Perennec, T., et al., 2025. Large language models as decision-making tools in oncology: Comparing artificial intelligence suggestions and expert recommendations. JCO Clinical Cancer Informatics 9, e2400230

work page 2025

[3] [3]

Large language models as co-pilots for causal inference in medical studies

Alaa, A., Phillips, R.V., Kıcıman, E., Balzer, L.B., van der Laan, M., Petersen, M., 2024. Large language models as co-pilots for causal inference in medical studies. arXiv preprint arXiv:2407.19118

work page arXiv 2024

[4] [4]

Artificial hallucinations in chatgpt: implications in scientific writing

Alkaissi, H., McFarlane, S.I., 2023. Artificial hallucinations in chatgpt: implications in scientific writing. Cureus 15

work page 2023

[5] [5]

Randomized-controlled trials are methodologically inappropriate in adolescent transgender healthcare

Ashley, F., Tordoff, D.M., Olson-Kennedy, J., Restar, A.J., 2024. Randomized-controlled trials are methodologically inappropriate in adolescent transgender healthcare. International Journal of Transgender Health 25, 407–418

work page 2024

[6] [6]

Evaluating artificial intelligence responses to public health questions

Ayers, J.W., Zhu, Z., Poliak, A., Leas, E.C., Dredze, M., Hogarth, M., Smith, D.M., 2023. Evaluating artificial intelligence responses to public health questions. JAMA Network Open 6, e2317517–e2317517

work page 2023

[7] [7]

Why we need observational studies to evaluate the effectiveness of health care

Black, N., 1996. Why we need observational studies to evaluate the effectiveness of health care. Bmj 312, 1215–1218

work page 1996

[8] [8]

Clinical intuition versus statistics: different modes of tacit knowledge in clinical epidemiology and evidence-based medicine

Braude, H.D., 2009. Clinical intuition versus statistics: different modes of tacit knowledge in clinical epidemiology and evidence-based medicine. Theoretical medicine and bioethics 30, 181–198

work page 2009

[9] [9]

Superhuman performance of a large language model on the reasoning tasks of a physician

Brodeur, P.G., Buckley, T.A., Kanjee, Z., Goh, E., Ling, E.B., Jain, P., Cabral, S., Abdulnour, R.E., Haimovich, A., Freed, J.A., et al., 2024. Superhuman performance of a large language model on the reasoning tasks of a physician. arXiv preprint arXiv:2412.10849

work page arXiv 2024

[10] [10]

Language models are few-shot learners

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al., 2020. Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901

work page 2020

[11] [11]

Impact of a digital scribe system on clinical documentation time and quality: usability study

van Buchem, M.M., Kant, I.M., King, L., Kazmaier, J., Steyerberg, E.W., Bauer, M.P., 2024. Impact of a digital scribe system on clinical documentation time and quality: usability study. JMIR AI 3, e60020

work page 2024

[12] [12]

Carr, K., . Tweet. https://x.com/kareem_carr/status/1930633158833136034. [Accessed 17-06-2025]. 13

work page arXiv 2025

[13] [13]

Statistical Reinforcement Learning

Chakraborty, B., Moodie, E.E.M., 2013. Statistical Reinforcement Learning. Springer New York, New York, NY. pp. 31–52. URL: https://doi.org/10.1007/978-1-4614-7428-9_3 , doi:10.1007/978-1-4614-7428-9_3

work page doi:10.1007/978-1-4614-7428-9_3 2013

[14] [14]

Use of bayesian decision analysis to maximize value in patient-centered randomized clinical trials in parkinson’s disease

Chaudhuri, S.E., Ben Chaouch, Z., Hauber, B., Mange, B., Zhou, M., Christopher, S., Bardot, D., Sheehan, M., Donnelly, A., McLaughlin, L., et al., 2023. Use of bayesian decision analysis to maximize value in patient-centered randomized clinical trials in parkinson’s disease. Journal of Biopharmaceutical Statistics , 1–20

work page 2023

[15] [15]

Clinical judgement in the era of big data and predictive analytics

Chin-Yee, B., Upshur, R., 2018. Clinical judgement in the era of big data and predictive analytics. Journal of Evaluation in Clinical Practice 24, 638–645

work page 2018

[16] [16]

Statin use for the primary prevention of cardiovascular disease in adults: updated evidence report and systematic review for the us preventive services task force

Chou, R., Cantor, A., Dana, T., Wagner, J., Ahmed, A.Y., Fu, R., Ferencik, M., 2022. Statin use for the primary prevention of cardiovascular disease in adults: updated evidence report and systematic review for the us preventive services task force. Jama 328, 754–771

work page 2022

[17] [17]

Beyond randomised versus observational studies

Concato, J., Horwitz, R.I., 2004. Beyond randomised versus observational studies. The Lancet 363, 1660–1661

work page 2004

[18] [18]

Understanding and misunderstanding randomized controlled trials

Deaton, A., Cartwright, N., 2018. Understanding and misunderstanding randomized controlled trials. Social science & medicine 210, 2–21

work page 2018

[19] [19]

Health professionals’ adherence to stroke clinical guidelines: a review of the literature

Donnellan, C., Sweetman, S., Shelley, E., 2013. Health professionals’ adherence to stroke clinical guidelines: a review of the literature. Health policy 111, 245–263

work page 2013

[20] [20]

Suffering, meaning, and healing: challenges of contemporary medicine

Egnew, T.R., 2009. Suffering, meaning, and healing: challenges of contemporary medicine. The Annals of Family Medicine 7, 170–175

work page 2009

[21] [21]

Constructing dynamic treatment regimes over indefinite time horizons

Ertefaie, A., Strawderman, R.L., 2018. Constructing dynamic treatment regimes over indefinite time horizons. Biometrika 105, 963–977

work page 2018

[22] [22]

Value-aware loss function for model learning in reinforcement learning

Farahmand, A.m., Barreto, A.M., Nikovski, D.N., 2016. Value-aware loss function for model learning in reinforcement learning

work page 2016

[23] [23]

The intellectual crisis of psychiatric research

Fava, G.A., 2006. The intellectual crisis of psychiatric research. Psychotherapy and Psycho- somatics 75, 202–208

work page 2006

[24] [24]

evidence

Feinstein, A.R., Horwitz, R.I., 1997. Problems in the “evidence” of “evidence-based medicine”. The American journal of medicine 103, 529–535

work page 1997

[25] [25]

Judg- ment and decision-making in clinical dentistry

Feller, L., Lemmer, J., Nemutandani, M.S., Ballyram, R., Khammissa, R.A.G., 2020. Judg- ment and decision-making in clinical dentistry. Journal of International Medical Research 48, 0300060520972877

work page 2020

[26] [26]

Can chatgpt pass the life support exams without entering the american heart association course? Resuscitation 185

Fijaˇ cko, N., Gosak, L.,ˇStiglic, G., Picard, C.T., Douma, M.J., 2023. Can chatgpt pass the life support exams without entering the american heart association course? Resuscitation 185

work page 2023

[27] [27]

Statistical methods for research workers, in: Breakthroughs in statistics: Methodology and distribution

Fisher, R.A., 1970. Statistical methods for research workers, in: Breakthroughs in statistics: Methodology and distribution. Springer, pp. 66–70

work page 1970

[28] [28]

Principal stratification in causal inference

Frangakis, C.E., Rubin, D.B., 2002. Principal stratification in causal inference. Biometrics 58, 21–29

work page 2002

[29] [29]

Popcorn: Partially observed prediction constrained reinforcement learning

Futoma, J., Hughes, M.C., Doshi-Velez, F., 2020. Popcorn: Partially observed prediction constrained reinforcement learning. arXiv preprint arXiv:2001.04032 . 14

work page arXiv 2020

[30] [30]

Personalized decision making for coronary artery disease treatment using offline reinforcement learning

Ghasemi, P., Greenberg, M., Southern, D.A., Li, B., White, J.A., Lee, J., 2025. Personalized decision making for coronary artery disease treatment using offline reinforcement learning. npj Digital Medicine 8, 99

work page 2025

[31] [31]

2013 acc/aha guideline on the assessment of cardiovascular risk: a report of the american college of cardiology/american heart association task force on practice guidelines

Goff, D.C., Lloyd-Jones, D.M., Bennett, G., Coady, S., D’agostino, R.B., Gibbons, R., Green- land, P., Lackland, D.T., Levy, D., O’donnell, C.J., et al., 2014. 2013 acc/aha guideline on the assessment of cardiovascular risk: a report of the american college of cardiology/american heart association task force on practice guidelines. Journal of the American...

work page 2014

[32] [32]

Large language model influence on management reasoning: A randomized controlled trial

Goh, E., Gallo, R., Strong, E., Weng, Y., Kerman, H., Freed, J., Cool, J.A., Kanjee, Z., Lane, K.P., Parsons, A.S., et al., 2024. Large language model influence on management reasoning: A randomized controlled trial. medRxiv

work page 2024

[33] [33]

Accuracy and reliability of chatbot responses to physician questions

Goodman, R.S., Patrinely, J.R., Stone, C.A., Zimmerman, E., Donald, R.R., Chang, S.S., Berkowitz, S.T., Finn, A.P., Jahangir, E., Scoville, E.A., et al., 2023. Accuracy and reliability of chatbot responses to physician questions. JAMA Network Open 6, e2336483–e2336483

work page 2023

[34] [34]

Gottesman, O., Futoma, J., Liu, Y., Parbhoo, S., Celi, L., Brunskill, E., Doshi-Velez, F.,

work page

[35] [35]

Interpretable off-policy evaluation in reinforcement learning by highlighting influential transitions, in: International Conference on Machine Learning, PMLR. pp. 3658–3667

work page

[36] [36]

Evaluating Reinforcement Learning Algorithms in Observational Health Settings

Gottesman, O., Johansson, F., Meier, J., Dent, J., Lee, D., Srinivasan, S., Zhang, L., Ding, Y., Wihl, D., Peng, X., et al., 2018. Evaluating reinforcement learning algorithms in observational health settings. arXiv preprint arXiv:1805.12298

work page internal anchor Pith review Pith/arXiv arXiv 2018

[37] [37]

Intuition and evidence–uneasy bedfellows? British Journal of General Practice 52, 395–400

Greenhalgh, T., 2002. Intuition and evidence–uneasy bedfellows? British Journal of General Practice 52, 395–400

work page 2002

[38] [38]

As- sessment of large language models (llms) in decision-making support for gynecologic oncology

Gumilar, K.E., Indraprasta, B.R., Faridzi, A.S., Wibowo, B.M., Herlambang, A., Rahestyn- ingtyas, E., Irawan, B., Tambunan, Z., Bustomi, A.F., Brahmantara, B.N., et al., 2024. As- sessment of large language models (llms) in decision-making support for gynecologic oncology. Computational and Structural Biotechnology Journal 23, 4019–4026

work page 2024

[39] [39]

The impact of nuance dax ambient listening ai documentation: a cohort study

Haberle, T., Cleveland, C., Snow, G.L., Barber, C., Stookey, N., Thornock, C., Younger, L., Mullahkhel, B., Ize-Ludlow, D., 2024. The impact of nuance dax ambient listening ai documentation: a cohort study. Journal of the American Medical Informatics Association 31, 975–979

work page 2024

[40] [40]

Artificial intelligence in medicine

Hamet, P., Tremblay, J., 2017. Artificial intelligence in medicine. metabolism 69, S36–S40

work page 2017

[41] [41]

Medpair: Measuring physicians and ai relevance alignment in medical question answering

Hao, Y., Alhamoud, K., Jeong, H., Zhang, H., Puri, I., Torr, P., Schaekermann, M., Stern, A.D., Ghassemi, M., 2025. Medpair: Measuring physicians and ai relevance alignment in medical question answering. arXiv preprint arXiv:2505.24040

work page arXiv 2025

[42] [42]

Why a bayesian approach to drug development and evalua- tion?

Harrell Jr, F.E., Vange, L., 2019. Why a bayesian approach to drug development and evalua- tion?

work page 2019

[43] [43]

Recognizing racit knowledge in medical epistemology

Henry, S.G., 2006. Recognizing racit knowledge in medical epistemology. Theoretical medicine and bioethics 27, 187–213

work page 2006

[44] [44]

Evidence-based practice– imperfect but necessary

Herbert, R.D., Sherrington, C., Maher, C., Moseley, A.M., 2001. Evidence-based practice– imperfect but necessary. Physiotherapy Theory and Practice 17, 201–211. 15

work page 2001

[45] [45]

Artificial intelligence in medicine

Holmes, J., Sacchi, L., Bellazzi, R., et al., 2004. Artificial intelligence in medicine. Ann R Coll Surg Engl 86, 334–8

work page 2004

[46] [46]

A generalization of sampling without replacement from a finite universe

Horvitz, D.G., Thompson, D.J., 1952. A generalization of sampling without replacement from a finite universe. Journal of the American statistical Association 47, 663–685

work page 1952

[47] [47]

Adaptive experiment design with synthetic controls, in: International Conference on Artificial Intelligence and Statistics, PMLR

H¨ uy¨ uk, A., Qian, Z., van der Schaar, M., 2024. Adaptive experiment design with synthetic controls, in: International Conference on Artificial Intelligence and Statistics, PMLR. pp. 1180–1188

work page 2024

[48] [48]

An evaluation framework for clinical use of large language models in patient interaction tasks

Johri, S., Jeong, J., Tran, B.A., Schlessinger, D.I., Wongvibulsin, S., Barnes, L.A., Zhou, H.Y., Cai, Z.R., Van Allen, E.M., Kim, D., et al., 2025. An evaluation framework for clinical use of large language models in patient interaction tasks. Nature Medicine , 1–10

work page 2025

[49] [49]

Deep reinforcement learning in medicine

Jonsson, A., 2019. Deep reinforcement learning in medicine. Kidney diseases 5, 18–22

work page 2019

[50] [50]

Thinking, fast and slow

Kahneman, D., 2011. Thinking, fast and slow. macmillan

work page 2011

[51] [51]

Efficient evaluation of natural stochastic policies in offline reinforcement learning

Kallus, N., Uehara, M., 2020. Efficient evaluation of natural stochastic policies in offline reinforcement learning. arXiv preprint arXiv:2006.03886

work page arXiv 2020

[52] [52]

Accuracy of a Generative Artificial Intelligence Model in a Complex Diagnostic Challenge

Kanjee, Z., Crowe, B., Rodman, A., 2023. Accuracy of a Generative Artificial Intelligence Model in a Complex Diagnostic Challenge. JAMA URL: https://doi.org/10.1001/jama. 2023.8288, doi:10.1001/jama.2023.8288

work page doi:10.1001/jama 2023

[53] [53]

Diversity, equity, and inclusion in clinical trials

Keegan, G., Crown, A., Joseph, K.A., 2023. Diversity, equity, and inclusion in clinical trials. Surgical Oncology Clinics 32, 221–232

work page 2023

[54] [54]

Towards optimal doubly robust estimation of heterogeneous causal effects

Kennedy, E.H., 2023. Towards optimal doubly robust estimation of heterogeneous causal effects. Electronic Journal of Statistics 17, 3008–3049

work page 2023

[55] [55]

Abstentionbench: Reasoning llms fail on unanswerable questions

Kirichenko, P., Ibrahim, M., Chaudhuri, K., Bell, S.J., 2025. Abstentionbench: Reasoning llms fail on unanswerable questions. arXiv preprint arXiv:2506.09038

work page arXiv 2025

[56] [56]

Imitation and reinforcement learning

Kober, J., Peters, J., 2010. Imitation and reinforcement learning. IEEE Robotics & Automa- tion Magazine 17, 55–62

work page 2010

[57] [57]

On information and sufficiency

Kullback, S., Leibler, R.A., 1951. On information and sufficiency. The annals of mathematical statistics 22, 79–86

work page 1951

[58] [58]

Chatgpt and the clinical informatics board examination: the end of unproctored maintenance of certification? Journal of the American Medical Informatics Association , ocad104

Kumah-Crystal, Y., Mankowitz, S., Embi, P., Lehmann, C.U., 2023. Chatgpt and the clinical informatics board examination: the end of unproctored maintenance of certification? Journal of the American Medical Informatics Association , ocad104

work page 2023

[59] [59]

Performance of chatgpt on usmle: Potential for ai-assisted medical education using large language models

Kung, T.H., Cheatham, M., Medenilla, A., Sillos, C., De Leon, L., Elepa˜ no, C., Madriaga, M., Aggabao, R., Diaz-Candido, G., Maningo, J., et al., 2023. Performance of chatgpt on usmle: Potential for ai-assisted medical education using large language models. PLoS digital health 2, e0000198

work page 2023

[60] [60]

Ehrnoteqa: A patient-specific question answering benchmark for evaluating large language models in clin- ical settings

Kweon, S., Kim, J., Kwak, H., Cha, D., Yoon, H., Kim, K., Won, S., Choi, E., 2024. Ehrnoteqa: A patient-specific question answering benchmark for evaluating large language models in clin- ical settings. Preprint

work page 2024

[61] [61]

Dynamic treatment regimes: Technical challenges and applications

Laber, E.B., Lizotte, D.J., Qian, M., Pelham, W.E., Murphy, S.A., 2014. Dynamic treatment regimes: Technical challenges and applications. Electronic journal of statistics 8, 1225. 16

work page 2014

[62] [62]

Recommender systems: a review

LeBlanc, P.M., Banks, D., Fu, L., Li, M., Tang, Z., Wu, Q., 2024. Recommender systems: a review. Journal of the American Statistical Association 119, 773–785

work page 2024

[63] [63]

Learning neural network policies with guided policy search under unknown dynamics., in: NIPS, Citeseer

Levine, S., Abbeel, P., 2014. Learning neural network policies with guided policy search under unknown dynamics., in: NIPS, Citeseer. pp. 1071–1079

work page 2014

[64] [64]

Mediq: Question-asking llms and a benchmark for reliable interactive clinical reasoning

Li, S., Balachandran, V., Feng, S., Ilgen, J., Pierson, E., Koh, P.W.W., Tsvetkov, Y., 2024. Mediq: Question-asking llms and a benchmark for reliable interactive clinical reasoning. Ad- vances in Neural Information Processing Systems 37, 28858–28888

work page 2024

[65] [65]

Making Decisions

Lindley, D., 1991. Making Decisions. Wiley. URL: https://books.google.com/books?id= 3-ZQAAAAMAAJ

work page 1991

[66] [66]

Using AI-generated suggestions from ChatGPT to opti- mize clinical decision support

Liu, S., Wright, A.P., Patterson, B.L., Wanderer, J.P., Turer, R.W., Nelson, S.D., McCoy, A.B., Sittig, D.F., Wright, A., 2023. Using AI-generated suggestions from ChatGPT to opti- mize clinical decision support. Journal of the American Medical Informatics Association 30, 1237–1245. doi: 10.1093/jamia/ocad072

work page doi:10.1093/jamia/ocad072 2023

[67] [67]

Luckett, D.J., Laber, E.B., Kahkoska, A.R., Maahs, D.M., Mayer-Davis, E., Kosorok, M.R.,

work page

[68] [68]

Journal of the American Statistical Association

Estimating dynamic treatment regimes in mobile health using v-learning. Journal of the American Statistical Association

work page

[69] [69]

Overview of artificial intelligence in medicine

Malik, P., Pathania, M., Rathaur, V.K., et al., 2019. Overview of artificial intelligence in medicine. Journal of family medicine and primary care 8, 2328–2331

work page 2019

[70] [70]

Understanding contraceptive switching rationales from real world clinical notes using large language models

Miao, B.Y., Williams, C.Y., Chinedu-Eneh, E., Zack, T., Alsentzer, E., Butte, A.J., Chen, I.Y., 2025. Understanding contraceptive switching rationales from real world clinical notes using large language models. npj Digital Medicine 8, 221

work page 2025

[71] [71]

Optimal dynamic treatment regimes

Murphy, S.A., 2003. Optimal dynamic treatment regimes. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 65, 331–355

work page 2003

[72] [72]

Ai snake oil: What artificial intelligence can do, what it can’t, and how to tell the difference, in: AI Snake Oil

Narayanan, A., Kapoor, S., 2024. Ai snake oil: What artificial intelligence can do, what it can’t, and how to tell the difference, in: AI Snake Oil. Princeton University Press

work page 2024

[73] [73]

PEGASUS: A Policy Search Method for Large MDPs and POMDPs

Ng, A.Y., Jordan, M.I., 2013. Pegasus: A policy search method for large mdps and pomdps. arXiv preprint arXiv:1301.3878

work page internal anchor Pith review Pith/arXiv arXiv 2013

[74] [74]

OpenAI, 2023. Chatgpt. https://chat.openai.com

work page 2023

[75] [75]

Training language models to follow instructions with human feedback

Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., et al., 2022. Training language models to follow instructions with human feedback. Advances in neural information processing systems 35, 27730–27744

work page 2022

[76] [76]

Monte Carlo theory, methods and examples

Owen, A.B., 2013. Monte Carlo theory, methods and examples

work page 2013

[77] [77]

Decision analysis

Pauker, S.G., Kassirer, J.P., 1987. Decision analysis. New England Journal of Medicine 316, 250–258

work page 1987

[78] [78]

Causality

Pearl, J., 2009. Causality. Cambridge university press

work page 2009

[79] [79]

Relative entropy policy search, in: Proceedings of the AAAI Conference on Artificial Intelligence

Peters, J., Mulling, K., Altun, Y., 2010. Relative entropy policy search, in: Proceedings of the AAAI Conference on Artificial Intelligence

work page 2010

[80] [80]

Petersen, B.K., Yang, J., Grathwohl, W.S., Cockrell, C., Santiago, C., An, G., Faissol, D.M.,

work page