pith. machine review for the scientific record. sign in

arxiv: 2604.17988 · v1 · submitted 2026-04-20 · 💻 cs.CL

Employing General-Purpose and Biomedical Large Language Models with Advanced Prompt Engineering for Pharmacoepidemiologic Study Design

Pith reviewed 2026-05-10 04:24 UTC · model grok-4.3

classification 💻 cs.CL
keywords large language modelspharmacoepidemiologyprompt engineeringstudy designGPT-4biomedical AIontology mappingSentinel System
0
0 comments X

The pith

Off-the-shelf general-purpose LLMs outperform specialized biomedical LLMs for pharmacoepidemiologic study design.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests four large language models on 46 real pharmacoepidemiologic protocols drawn from the HMA-EMA Catalogue and Sentinel System. It measures how well each model generates relevant study designs, supplies logical justifications, and correctly maps concepts to standard medical coding systems. General-purpose models (GPT-4o and DeepSeek-R1) paired with Least-to-Most prompting receive the highest scores on relevance and justification, while the two biomedical fine-tuned models lag and often produce thin reasoning. All models show limited skill at ontology-code agreement, but the prompting strategy itself strongly affects how stable and useful the outputs become.

Core claim

When applied to pharmacoepidemiologic study design tasks, general-purpose LLMs such as GPT-4o and DeepSeek-R1 achieve higher median relevance scores and stronger logical justifications than biomedically fine-tuned models. On HMA-EMA protocols, GPT-4o with Least-to-Most prompting reaches a median relevance of 4 in eight of nine evaluation questions. Biomedical LLMs more frequently generate insufficient justification. Least-to-Most prompting improves reasoning stability across models, yet every LLM tested remains limited in mapping study elements to ontology codes.

What carries the argument

Least-to-Most prompting applied to general-purpose LLMs, evaluated through team-scored relevance, logic of justification, and ontology-code agreement on 46 real protocols.

If this is right

  • General-purpose models can already supply more usable first drafts of pharmacoepidemiologic protocols than current biomedical models.
  • Least-to-Most prompting offers a practical way to improve reasoning consistency without retraining.
  • Ontology-code mapping remains a shared weakness that limits immediate deployment for fully automated coding tasks.
  • Prompt choice matters more than model specialization for this application area.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The results suggest that broad training data may capture more transferable patterns for study design than narrow biomedical fine-tuning.
  • Integrating general LLMs with external ontology lookup tools could address the persistent code-mapping shortfall.
  • Regulatory groups might begin pilot-testing general LLMs for protocol drafting sooner than expected if the performance gap holds in live use.

Load-bearing premise

The study team's manual ratings of relevance, justification quality, and code accuracy provide an objective and complete measure of how useful the LLM outputs would be when pharmacoepidemiologists actually design studies.

What would settle it

An independent blinded review in which practicing pharmacoepidemiologists rate the biomedical models higher than the general-purpose models on the same 46 protocols for relevance and justification quality would falsify the superiority claim.

Figures

Figures reproduced from arXiv: 2604.17988 by Francesco Paolo Speca, Manuela Del Castillo Suero, Maurizio Sessa, Nicole Sonne Heckmann, Xinyao Zhang.

Figure 1
Figure 1. Figure 1: Box plots on the distribution of 5-Likert-scale relevance scores for outputs generated by different LLMs architectures across 9 questions. Legend: Each subplot (1-9) represents one question of specific pharmacoepidemiological study. The y-axis lists LLM–prompt combinations, while the x-axis indicates Likert scores from 1 (completely inaccurate) to 5 (highly accurate). Box plots illustrate the interquartile… view at source ↗
read the original abstract

Background: The potential of large language models (LLMs) to automate and support pharmacoepidemiologic study design is an emerging area of interest, yet their reliability remains insufficiently characterized. General-purpose LLMs often display inaccuracies, while the comparative performance of specialized biomedical LLMs in this domain remains unknown. Methods: This study evaluated general-purpose LLMs (GPT-4o and DeepSeek-R1) versus biomedically fine-tuned LLMs (QuantFactory/Bio-Medical-Llama-3-8B-GGUF and Irathernotsay/qwen2-1.5B-medical_qa-Finetune) using 46 protocols (2018-2024) from the HMA-EMA Catalogue and Sentinel System. Performance was assessed across relevance, logic of justification, and ontology-code agreement across multiple coding systems using Least-to-Most (LTM) and Active Prompting strategies. Results: GPT-4o and DeepSeek-R1 paired with LTM prompting achieved the highest relevance and logic of justification scores, with GPT-4o-LTM reaching a median relevance score of 4 in 8 of 9 questions for HMA-EMA protocols. Biomedical LLMs showed lower relevance overall and frequently generated insufficient justification. All LLMs demonstrated limited proficiency in ontology-code mapping, although LTM provided the most consistent improvements in reasoning stability. Conclusion: Off-the-shelf general-purpose LLMs currently offer superior support for pharmacoepidemiologic design compared to biomedical LLMs. Prompt strategy strongly influenced LLM performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims that general-purpose LLMs (GPT-4o and DeepSeek-R1) with advanced prompting strategies like Least-to-Most (LTM) outperform biomedically fine-tuned LLMs in supporting pharmacoepidemiologic study design, based on evaluations of relevance, logic of justification, and ontology-code agreement using 46 real-world protocols from HMA-EMA and Sentinel catalogues.

Significance. Should the results prove reliable upon addressing methodological details, this finding would be significant for the field as it challenges the assumption that domain-specific fine-tuning is necessary or beneficial for LLMs in specialized medical research design tasks. It emphasizes prompt engineering's impact and could guide researchers in selecting tools for automating aspects of pharmacoepidemiology, potentially improving efficiency in study protocol development.

major comments (3)
  1. [Results] The quantitative scores, such as the median relevance of 4 for GPT-4o-LTM on 8/9 HMA-EMA questions, rely on subjective assessments by the study team. No information is provided regarding blinding to the LLM identity, the number of evaluators, the detailed scoring criteria, or measures of inter-rater agreement, which undermines confidence in the comparative claims.
  2. [Methods] The biomedical LLMs evaluated are much smaller (8B and 1.5B parameters) than the general-purpose ones. This size disparity is a potential confounder for the observed performance gap, and the manuscript does not address whether the differences are due to domain specialization or model capacity.
  3. [Methods] The selection of 46 protocols from two catalogues may not sufficiently represent the full range of pharmacoepidemiologic study designs (e.g., those with complex time-dependent exposures or linked databases), raising questions about whether the superiority of general-purpose LLMs holds across broader applications.
minor comments (2)
  1. [Abstract] The abstract reports 'ontology-code agreement' but provides no specific quantitative results or examples, making it hard to assess the extent of the limitation.
  2. [Abstract] Acronyms such as HMA-EMA, LTM, and Sentinel System should be defined on first use for clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive comments, which have helped us identify areas for improvement in transparency and discussion of limitations. We address each major comment below and indicate the revisions we will make to the manuscript.

read point-by-point responses
  1. Referee: [Results] The quantitative scores, such as the median relevance of 4 for GPT-4o-LTM on 8/9 HMA-EMA questions, rely on subjective assessments by the study team. No information is provided regarding blinding to the LLM identity, the number of evaluators, the detailed scoring criteria, or measures of inter-rater agreement, which undermines confidence in the comparative claims.

    Authors: We agree that greater transparency regarding the evaluation process is warranted. The original manuscript did not include these details. We will revise the Methods section to specify the number of evaluators from the study team, the detailed scoring criteria and rubrics used for relevance and justification, and any inter-rater agreement measures. We will also state that evaluators were not blinded to LLM identity due to distinct output characteristics and discuss this as a limitation. revision: yes

  2. Referee: [Methods] The biomedical LLMs evaluated are much smaller (8B and 1.5B parameters) than the general-purpose ones. This size disparity is a potential confounder for the observed performance gap, and the manuscript does not address whether the differences are due to domain specialization or model capacity.

    Authors: This is a valid observation that was not addressed in the original submission. We will add a discussion in the Limitations section acknowledging the size disparity as a potential confounder and noting that the performance gap may reflect both domain specialization and model capacity. We will recommend future comparisons using models of comparable sizes to isolate these effects, while maintaining that the results reflect currently available biomedical LLMs. revision: partial

  3. Referee: [Methods] The selection of 46 protocols from two catalogues may not sufficiently represent the full range of pharmacoepidemiologic study designs (e.g., those with complex time-dependent exposures or linked databases), raising questions about whether the superiority of general-purpose LLMs holds across broader applications.

    Authors: We selected the 46 protocols from established HMA-EMA and Sentinel catalogues to represent real-world studies from 2018-2024. We acknowledge that this sample may not encompass all design variations. We will revise the Discussion to explicitly address generalizability limitations and recommend validation on additional protocol types, including those with complex time-dependent exposures or linked databases. revision: partial

Circularity Check

0 steps flagged

No significant circularity in this empirical LLM evaluation study

full rationale

The paper conducts a direct empirical comparison of off-the-shelf general-purpose LLMs (GPT-4o, DeepSeek-R1) against smaller biomedical LLMs on 46 real pharmacoepidemiologic protocols drawn from external catalogues. Performance is measured by human-assigned scores on relevance, logic of justification, and ontology-code agreement under different prompting strategies. No mathematical derivations, equations, fitted parameters, or predictions appear; the central claims rest on straightforward evaluation against independent human-designed protocols rather than any self-referential reduction or self-citation chain. The study is therefore self-contained with no load-bearing steps that collapse to the inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the representativeness of the protocol sample and the validity of the qualitative scoring metrics. No free parameters or new entities introduced.

axioms (2)
  • domain assumption The selected 46 protocols from 2018-2024 are representative of pharmacoepidemiologic studies.
    Used as the test set for evaluation.
  • domain assumption Human or expert assessment of relevance and logic is reliable and unbiased.
    Basis for scoring LLM performance.

pith-pipeline@v0.9.0 · 5600 in / 1495 out tokens · 73105 ms · 2026-05-10T04:24:48.399807+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

71 extracted references · 15 canonical work pages

  1. [1]

    Introduction The growing need for rapid evidence generation in pharmacoepidemiology places strong demands on study design and data quality.[1] This is particularly important for populations underrepresented in clinical trials and comparative effectiveness /safety research in routine care. [2,3] As artificial intelligence (AI) becomes increasingly integrat...

  2. [2]

    drug and outcome analysis

    Methods 2.1 Data Sources This study evaluated LLM performance using study protocols from two pharmacoepidemiologic sources (i.e., HMA-EMA Catalogue and Sentinel System) . From the HMA -EMA Catalogue of Real-World Data (RWD) Studies, all available DARWIN EU® protocols were included at the time of analysis (n=16). In addition, 15 non -DARWIN expert -develop...

  3. [3]

    LLMs were paired with seven prompt engineering strategies identified from the literature:

  4. [4]

    basic prompt with term definitions, 2) synthetic prompting, 3) active-prompt, 4) plan-and-solve,

  5. [5]

    [20–22] Together, this yielded 42 model-prompt combinations for the initial pre-assessment of the most performing combinations

    least-to-most (LTM), 6) Tree-of-Thought, and 7) decomposition prompting. [20–22] Together, this yielded 42 model-prompt combinations for the initial pre-assessment of the most performing combinations. Prompt construction was guided by the CLEAR framework, which emphasizes concise, logical, explicit, adaptive, and reflective prompt design. [23] A pre -asse...

  6. [6]

    The 46 protocols are available at https://github.com/madelsu/LLM-for-pharmacoepi-study-design/tree/main/Data

    Results 3.1 Descriptive Analysis A total of 46 pharmacoepidemiological protocols were included: 16 from DARWIN EU®, 15 from the HMA-EMA Catalogue, and 15 from the Sentinel System. The 46 protocols are available at https://github.com/madelsu/LLM-for-pharmacoepi-study-design/tree/main/Data. There was full agreement between the human experts when extracting ...

  7. [7]

    These uncertainties highlight the need for systematic evaluation of LLM capabilities in this domain

    Discussion The use of LLMs in pharmacoepidemiology has received increasing research attention in recent years.[11] Despite this growing interest, important concerns remain regarding the reliability of general-purpose LLMs, the suitability of their training data for scientific and regulatory tasks, and the performance of newly released models and prompt -e...

  8. [8]

    It is also in line with Dada et al., who showed that increasing biomedical specialization does not necessarily translate into better performance on medical tasks and may come at the cost of instruction-following or broader task adaptability. [27] Together, these findings suggest that pharmacoepidemiologic study design is not simply a knowledge retrieval t...

  9. [9]

    Conclusions This study showed that off-the-shelf general-purpose LLMs, particularly GPT-4o and DeepSeek- R1 combined with LTM prompting, outperformed the biomedical LLMs evaluated in this study for support of pharmacoepidemiologic study design. Their advantage was evident not only for the primary outcome of relevance, but also for logic of justification, ...

  10. [10]

    Because the study relied exclusively on publicly available study protocols, ethical approval and informed consent were not applicable

    Ethics, Funding, and Conflict of Interest No funding was obtained for this study. Because the study relied exclusively on publicly available study protocols, ethical approval and informed consent were not applicable. The authors declare no conflicts of interest with respect to the conduct or reporting of this study

  11. [11]

    MS and XZ contributed to data collection, formal analysis, interpretation of results, and revision of the manuscript

    Authors’ Contribution NSH and MS had primary responsibility for the study conception and design and for the acquisition, analysis, and interpretation of data, and they led the drafting of the manuscript. MS and XZ contributed to data collection, formal analysis, interpretation of results, and revision of the manuscript. MDCS and FPS contributed to interpr...

  12. [12]

    Pharmacoepidemiology

    Strom B, Kimmel S, Hennessy S. Pharmacoepidemiology. 6th ed. Hoboken (NJ): Wiley- Blackwell; 2020

  13. [13]

    Accelerating evidence generation: Addressing critical challenges and charting a path forward

    Rim JG, Jackman JG, Hornik CP, Rutter JL, Warraich H, Wittes J, et al. Accelerating evidence generation: Addressing critical challenges and charting a path forward. J Clin Transl Sci. 2024 Oct 31;8(1):e184. doi:10.1017/cts.2024.621

  14. [14]

    Now is the time to fix the evidence generation system

    Califf RM. Now is the time to fix the evidence generation system. Clinical Trials. 2023 Feb 17;20(1):3–12. doi:10.1177/17407745221147689

  15. [15]

    Epidemiological Data Challenges: Planning for a More Robust Future Through Data Standards

    Fairchild G, Tasseff B, Khalsa H, Generous N, Daughton AR, Velappan N, et al. Epidemiological Data Challenges: Planning for a More Robust Future Through Data Standards. Front Public Health. 2018 Nov 23;6. doi:10.3389/fpubh.2018.00336 14

  16. [16]

    Food and Drug Administration

    U.S. Food and Drug Administration. FDA [Internet]. 2024 [cited 2026 Mar 24]. BERTox Initiative. Available from: https://www.fda.gov/about-fda/nctr-research-focus- areas/bertox-initiative

  17. [17]

    European Union; 2025 Aug 2

    EU Artificial Intelligence Act. European Union; 2025 Aug 2

  18. [18]

    What Is an Exposure, What Is a Disease, and How Do We Measure Them? In: Epidemiology Matters

    Keyes KM, Galea S. What Is an Exposure, What Is a Disease, and How Do We Measure Them? In: Epidemiology Matters. Oxford University Press; 2014. p. 18–32. doi:10.1093/med/9780199331246.003.0003

  19. [19]

    Visualizations throughout pharmacoepidemiology study planning, implementation, and reporting

    Gatto NM, Wang S V., Murk W, Mattox P, Brookhart MA, Bate A, et al. Visualizations throughout pharmacoepidemiology study planning, implementation, and reporting. Pharmacoepidemiol Drug Saf. 2022 Nov 9;31(11):1140–52. doi:10.1002/pds.5529

  20. [20]

    Inclusion and exclusion criteria in research studies: definitions and why they matter

    Patino CM, Ferreira JC. Inclusion and exclusion criteria in research studies: definitions and why they matter. Jornal Brasileiro de Pneumologia. 2018 Apr;44(2):84–84. doi:10.1590/s1806-37562018000000088

  21. [21]

    Evaluating the Accuracy of Responses by Large Language Models for Information on Disease Epidemiology

    Zhu K, Zhang J, Klishin A, Esser M, Blumentals WA, Juhaeri J, et al. Evaluating the Accuracy of Responses by Large Language Models for Information on Disease Epidemiology. Pharmacoepidemiol Drug Saf. 2025 Feb 3;34(2). doi:10.1002/pds.70111

  22. [22]

    On the Applications of Neural Ordinary Differential Equations in Medical Image Analysis

    Wang D, Zhang S. Large language models in medical and healthcare fields: applications, advances, and challenges. Artif Intell Rev. 2024 Sep 20;57(11):299. doi:10.1007/s10462- 024-10921-0

  23. [23]

    Off-the-Shelf Large Language Models for Causality Assessment of Individual Case Safety Reports: A Proof- of-Concept with COVID-19 Vaccines

    Abate A, Poncato E, Barbieri M, Powell G, Rossi A, Peker S, et al. Off-the-Shelf Large Language Models for Causality Assessment of Individual Case Safety Reports: A Proof- of-Concept with COVID-19 Vaccines. Drug Saf. 2025 Jul 12;48(7):805–20. doi:10.1007/s40264-025-01531-y

  24. [24]

    Amsterdam: European Medicines Agency [Internet]

    European Medicines Agency, Heads of Medicines Agencies. Amsterdam: European Medicines Agency [Internet]. 2025. HMA-EMA Catalogues of real-world data sources and studies. Available from: https://catalogues.ema.europa.eu

  25. [25]

    Silver Spring (MD): Sentinel Initiative [Internet]

    Sentinel Initiative. Silver Spring (MD): Sentinel Initiative [Internet]. 2025. Drug Studies. Available from: sentinelinitiative.org/studies/drugs

  26. [26]

    Wang S V., Pottegård A, Crown W, Arlett P, Ashcroft DM, Benchimol EI, et al. HARmonized Protocol Template to Enhance Reproducibility of hypothesis evaluating real‐world evidence studies on treatment effects: A good practices report of a joint ISPE/ISPOR task force. Pharmacoepidemiol Drug Saf. 2023 Jan 10;32(1):44–55. doi:10.1002/pds.5507

  27. [27]

    Hugging Face [Internet]

    Irathernotsay. Hugging Face [Internet]. 2025. qwen2-1.5B-medical_qa-Finetune . Available from: https://huggingface.co/Irathernotsay/qwen2-1.5B-medical_qa-Finetune 15

  28. [28]

    Hugging Face [Internet]

    Plaban81. Hugging Face [Internet]. 2025. gemma-medical_qa-Finetune. Available from: https://huggingface.co/Plaban81/gemma-medical_qa-Finetune

  29. [29]

    Hugging Face [Internet]

    mradermacher. Hugging Face [Internet]. 2025. DeepSeek-r1-Medical-Mini-GGUF. Available from: https://huggingface.co/mradermacher/DeepSeek-r1-Medical-Mini-GGUF

  30. [30]

    Hugging Face [Internet]

    QuantFactory. Hugging Face [Internet]. 2024. Bio-Medical-Llama-3-8B-GGUF. Available from: https://huggingface.co/QuantFactory/Bio-Medical-Llama-3-8B-GGUF

  31. [31]

    Can GPT Improve the State of Prior Authorization via Guideline Based Automated Question Answering? ArXiv

    Vatsal S, Singh A, Tafreshi S. Can GPT Improve the State of Prior Authorization via Guideline Based Automated Question Answering? ArXiv. 2024 Feb 28

  32. [32]

    Least-to-Most Prompting Enables Complex Reasoning in Large Language Models

    Zhou D, Schärli N, Hou L, Wei J, Scales N, Wang X, et al. Least-to-Most Prompting Enables Complex Reasoning in Large Language Models. ArXiv. 2022 May 21

  33. [33]

    A Survey of Prompt Engineering Methods in Large Language Models for Different NLP Tasks

    Vatsal S, Dubey H. A Survey of Prompt Engineering Methods in Large Language Models for Different NLP Tasks. ArXiv. 2024 Jun 17

  34. [34]

    The CLEAR path: A framework for enhancing information literacy through prompt engineering

    Lo LS. The CLEAR path: A framework for enhancing information literacy through prompt engineering. The Journal of Academic Librarianship. 2023 Jul;49(4):102720. doi:10.1016/j.acalib.2023.102720

  35. [35]

    The TRIPOD-LLM reporting guideline for studies using large language models

    Gallifant J, Afshar M, Ameen S, Aphinyanaphongs Y, Chen S, Cacciamani G, et al. The TRIPOD-LLM reporting guideline for studies using large language models. Nat Med. 2025 Jan 8;31(1):60–9. doi:10.1038/s41591-024-03425-5

  36. [36]

    Reporting guidelines for chatbot health advice studies: explanation and elaboration for the Chatbot Assessment Reporting Tool (CHART). BMJ. 2025 Aug 1;390:e083305. doi:10.1136/bmj-2024-083305

  37. [37]

    Biomedical Large Language Models and Prompt Engineering for Causality Assessment of Individual Case Safety Reports in Pharmacovigilance

    Heckmann NS, Papoutsi DG, Barbieri MA, Battini V, Mølgaard SN, Schmidt SØ, et al. Biomedical Large Language Models and Prompt Engineering for Causality Assessment of Individual Case Safety Reports in Pharmacovigilance. medRxiv (preprint). 2026. doi:10.64898/2026.02.19.26346467

  38. [38]

    Does Biomedical Training Lead to Better Medical Performance? ArXiv

    Dada A, Bauer M, Contreras A, et al. Does Biomedical Training Lead to Better Medical Performance? ArXiv. 2024 Apr 5;arXiv:2404.04067. 16 Table 1. Prompt Examples, Concept Least to Most Active prompt Study design Based on the title [TITLE] , list all the candidate study design types (e.g., cohort study, case- control study) and select the most appropriate ...

  39. [39]

    Align with CONSORT/STROBE guidelines

  40. [40]

    Prioritize feasibility in real-world settings

  41. [41]

    Primary prompt: Define index dates of this study

    Explicitly state rejection reasons for alternative designs Then enhance prompts: [CoT Example] Objective: Assess long-term effects of Drug X on risk Step 1: Identify need for longitudinal exposure- outcome data → Cohort design Step 2: Check feasibility of randomization → Reject RCT (lack of equipoise) Step 3: Compare with case-control → Prefer cohort for ...

  42. [42]

    For pharmacological studies: Use first prescription date + 30-day washout

  43. [43]

    For surgical studies: Use procedure date ± 7-day pre-op assessment

  44. [44]

    Mandatory elements: 17 exclusion criteria medical history, missing data) should be developed for this study based on the [Q1 answer] design and [Q2 answer] index date

    Flag potential immortal time bias sources [Validation] Check FDA Sentinel Common Data Model, DARWIN, HMA-EMA for alignment Inclusion and Inclusion criteria (e.g., age ≥18 years, confirmed disease) and exclusion criteria (e.g., past Generate inclusion/exclusion criteria of this study. Mandatory elements: 17 exclusion criteria medical history, missing data)...

  45. [45]

    Explicit linkage to objective-specific endpoints

  46. [46]

    Use WHO International Classification of Functioning (ICF) framework

  47. [47]

    Constraints:

    Flag criteria causing selection bias >20% Inclusion and exclusion assessment window Based on the [answer to question 2] index date, identify the time window for assessing inclusion and exclusion criteria (e.g., a baseline period of 6 months before the index date, an exclusion period of 1 year after the index date) and its specific duration Primary prompt:...

  48. [48]

    Align with EMA Guideline on GCP (Rev 3)

  49. [49]

    Use moving window approach for chronic conditions

  50. [50]

    Specify allowable overlaps (±5% timeline tolerance) Enhance prompt: [CoT Example] Objective: Window Logic:

  51. [51]

    Exclusion: [Check] Apply inverse probability weighting for missing window data Exposure Based on the [Answer to Question 1] design and [Answer to Question 3] inclusion criteria, define exposure (e.g., drug dose, exposure duration) Primary prompt: Define study exposure of the study Required elements:

  52. [52]

    Dose-response granularity (ATC + RxNorm mapping)

  53. [53]

    Must include:

    Exposure lag periods with biological plausibility check Competing risk adjustment plan Outcome Based on the [answer to question 1] design, specify primary/secondary outcome Primary prompt: Specify outcome for this study. Must include:

  54. [54]

    Primary/secondary endpoints 18 definitions (e.g., laboratory confirmed diagnosis, imaging evidence)

  55. [55]

    Competing risk definitions (e.g., death censoring rules) Sensitivity analysis protocols for outcome misclassification. Follow-up period In conjunction with [Q.2 answer] Index date and [Q.6 answer] Definition of ending, set the start and end of the follow up period, whether right censoring is allowed and the minimum length of follow-up (e.g. 1 year). Quest...

  56. [56]

    Account for disease-specific latency periods

  57. [57]

    Apply landmark analysis for time- varying exposures

  58. [58]

    Specify left-truncation handling methods Enhance prompt: [CoT Example] Objective: Follow-up Logic:

  59. [59]

    Primary prompt: Identify covariates in this study

    End: Censor: Death Covariate List covariates (e.g., age, sex, comorbidities, medication history) to be adjusted for based on [QUESTION 5 ANSWER] exposure and [QUESTION 6 ANSWER] outcome. Primary prompt: Identify covariates in this study. Must:

  60. [60]

    Categorize into confounders/mediators/effect modifiers

  61. [61]

    Provide LOINC codes for lab covariates

  62. [62]

    Include negative control variables Enhance prompt: [CoT Example] Objective: Covariates:

  63. [63]

    Primary prompt: Define temporal windows for covariate assessment in this study

    Effect modifier: 19 Negative control: Covariate assessment window Based on the [QUESTION 2 ANSWER] index date, define the time window in which the covariates will be evaluated (e.g., laboratory data within 1 year prior to the index date). Primary prompt: Define temporal windows for covariate assessment in this study. Constraints:

  64. [64]

    Align with FDA's Structured Product Labeling (SPL) standards

  65. [65]

    Handle time-varying covariates using marginal structural models

  66. [66]

    ICD-10, ATC

    Specify missing data imputation thresholds Enhance prompt: [CoT Example] Objective: Window Strategy: • Static: • Time-varying: Lagged: Ontology- code Convert diagnoses, exposures, outcomes from [Q5/6 answers] to standard codes i.e. ICD-10, ATC. Map all medical concepts in previous answers to standard terminologies based on the answer to question 5, 6 list...

  67. [67]

    Data collection methods: Self-reported questionnaires; Medical records review

  68. [68]

    Data collection methods: Physician interviews; Standardized weight and height measurements

  69. [69]

    Data collection methods: Patient diaries; Blood sample analysis

  70. [70]

    Data collection methods: Telephone surveys; Review of electronic medical records

  71. [71]

    Most appropriate answer: D

    Data collection methods: Clinic visits; Nurse interviews. Most appropriate answer: D. Data collection methods: Telephone surveys; Review of electronic medical records. 29 Justification: Given the inclusion criteria defined for this study, data collection methods should be chosen to ensure accuracy and reliability in capturing relevant information about ea...