arxiv: 2604.17988 · v1 · submitted 2026-04-20 · 💻 cs.CL

Employing General-Purpose and Biomedical Large Language Models with Advanced Prompt Engineering for Pharmacoepidemiologic Study Design

Xinyao Zhang , Nicole Sonne Heckmann , Manuela Del Castillo Suero , Francesco Paolo Speca , Maurizio Sessa This is my paper

Pith reviewed 2026-05-10 04:24 UTC · model grok-4.3

classification 💻 cs.CL

keywords large language modelspharmacoepidemiologyprompt engineeringstudy designGPT-4biomedical AIontology mappingSentinel System

0 comments

The pith

Off-the-shelf general-purpose LLMs outperform specialized biomedical LLMs for pharmacoepidemiologic study design.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests four large language models on 46 real pharmacoepidemiologic protocols drawn from the HMA-EMA Catalogue and Sentinel System. It measures how well each model generates relevant study designs, supplies logical justifications, and correctly maps concepts to standard medical coding systems. General-purpose models (GPT-4o and DeepSeek-R1) paired with Least-to-Most prompting receive the highest scores on relevance and justification, while the two biomedical fine-tuned models lag and often produce thin reasoning. All models show limited skill at ontology-code agreement, but the prompting strategy itself strongly affects how stable and useful the outputs become.

Core claim

When applied to pharmacoepidemiologic study design tasks, general-purpose LLMs such as GPT-4o and DeepSeek-R1 achieve higher median relevance scores and stronger logical justifications than biomedically fine-tuned models. On HMA-EMA protocols, GPT-4o with Least-to-Most prompting reaches a median relevance of 4 in eight of nine evaluation questions. Biomedical LLMs more frequently generate insufficient justification. Least-to-Most prompting improves reasoning stability across models, yet every LLM tested remains limited in mapping study elements to ontology codes.

What carries the argument

Least-to-Most prompting applied to general-purpose LLMs, evaluated through team-scored relevance, logic of justification, and ontology-code agreement on 46 real protocols.

If this is right

General-purpose models can already supply more usable first drafts of pharmacoepidemiologic protocols than current biomedical models.
Least-to-Most prompting offers a practical way to improve reasoning consistency without retraining.
Ontology-code mapping remains a shared weakness that limits immediate deployment for fully automated coding tasks.
Prompt choice matters more than model specialization for this application area.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The results suggest that broad training data may capture more transferable patterns for study design than narrow biomedical fine-tuning.
Integrating general LLMs with external ontology lookup tools could address the persistent code-mapping shortfall.
Regulatory groups might begin pilot-testing general LLMs for protocol drafting sooner than expected if the performance gap holds in live use.

Load-bearing premise

The study team's manual ratings of relevance, justification quality, and code accuracy provide an objective and complete measure of how useful the LLM outputs would be when pharmacoepidemiologists actually design studies.

What would settle it

An independent blinded review in which practicing pharmacoepidemiologists rate the biomedical models higher than the general-purpose models on the same 46 protocols for relevance and justification quality would falsify the superiority claim.

Figures

Figures reproduced from arXiv: 2604.17988 by Francesco Paolo Speca, Manuela Del Castillo Suero, Maurizio Sessa, Nicole Sonne Heckmann, Xinyao Zhang.

**Figure 1.** Figure 1: Box plots on the distribution of 5-Likert-scale relevance scores for outputs generated by different LLMs architectures across 9 questions. Legend: Each subplot (1-9) represents one question of specific pharmacoepidemiological study. The y-axis lists LLM–prompt combinations, while the x-axis indicates Likert scores from 1 (completely inaccurate) to 5 (highly accurate). Box plots illustrate the interquartile… view at source ↗

read the original abstract

Background: The potential of large language models (LLMs) to automate and support pharmacoepidemiologic study design is an emerging area of interest, yet their reliability remains insufficiently characterized. General-purpose LLMs often display inaccuracies, while the comparative performance of specialized biomedical LLMs in this domain remains unknown. Methods: This study evaluated general-purpose LLMs (GPT-4o and DeepSeek-R1) versus biomedically fine-tuned LLMs (QuantFactory/Bio-Medical-Llama-3-8B-GGUF and Irathernotsay/qwen2-1.5B-medical_qa-Finetune) using 46 protocols (2018-2024) from the HMA-EMA Catalogue and Sentinel System. Performance was assessed across relevance, logic of justification, and ontology-code agreement across multiple coding systems using Least-to-Most (LTM) and Active Prompting strategies. Results: GPT-4o and DeepSeek-R1 paired with LTM prompting achieved the highest relevance and logic of justification scores, with GPT-4o-LTM reaching a median relevance score of 4 in 8 of 9 questions for HMA-EMA protocols. Biomedical LLMs showed lower relevance overall and frequently generated insufficient justification. All LLMs demonstrated limited proficiency in ontology-code mapping, although LTM provided the most consistent improvements in reasoning stability. Conclusion: Off-the-shelf general-purpose LLMs currently offer superior support for pharmacoepidemiologic design compared to biomedical LLMs. Prompt strategy strongly influenced LLM performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

General LLMs with LTM prompting scored higher than two small biomedical models on relevance and justification for real pharmacoepidemiology protocols, but the gap rests on unblinded team ratings and size differences.

read the letter

The main takeaway is that off-the-shelf general models like GPT-4o and DeepSeek-R1, paired with least-to-most prompting, produced more relevant and logically sound study designs than the two biomedical models tested, according to the authors' scores on 46 protocols from HMA-EMA and Sentinel catalogues. Prompt strategy mattered more than domain fine-tuning in their results, and ontology mapping stayed weak across the board.

Referee Report

3 major / 2 minor

Summary. The paper claims that general-purpose LLMs (GPT-4o and DeepSeek-R1) with advanced prompting strategies like Least-to-Most (LTM) outperform biomedically fine-tuned LLMs in supporting pharmacoepidemiologic study design, based on evaluations of relevance, logic of justification, and ontology-code agreement using 46 real-world protocols from HMA-EMA and Sentinel catalogues.

Significance. Should the results prove reliable upon addressing methodological details, this finding would be significant for the field as it challenges the assumption that domain-specific fine-tuning is necessary or beneficial for LLMs in specialized medical research design tasks. It emphasizes prompt engineering's impact and could guide researchers in selecting tools for automating aspects of pharmacoepidemiology, potentially improving efficiency in study protocol development.

major comments (3)

[Results] The quantitative scores, such as the median relevance of 4 for GPT-4o-LTM on 8/9 HMA-EMA questions, rely on subjective assessments by the study team. No information is provided regarding blinding to the LLM identity, the number of evaluators, the detailed scoring criteria, or measures of inter-rater agreement, which undermines confidence in the comparative claims.
[Methods] The biomedical LLMs evaluated are much smaller (8B and 1.5B parameters) than the general-purpose ones. This size disparity is a potential confounder for the observed performance gap, and the manuscript does not address whether the differences are due to domain specialization or model capacity.
[Methods] The selection of 46 protocols from two catalogues may not sufficiently represent the full range of pharmacoepidemiologic study designs (e.g., those with complex time-dependent exposures or linked databases), raising questions about whether the superiority of general-purpose LLMs holds across broader applications.

minor comments (2)

[Abstract] The abstract reports 'ontology-code agreement' but provides no specific quantitative results or examples, making it hard to assess the extent of the limitation.
[Abstract] Acronyms such as HMA-EMA, LTM, and Sentinel System should be defined on first use for clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive comments, which have helped us identify areas for improvement in transparency and discussion of limitations. We address each major comment below and indicate the revisions we will make to the manuscript.

read point-by-point responses

Referee: [Results] The quantitative scores, such as the median relevance of 4 for GPT-4o-LTM on 8/9 HMA-EMA questions, rely on subjective assessments by the study team. No information is provided regarding blinding to the LLM identity, the number of evaluators, the detailed scoring criteria, or measures of inter-rater agreement, which undermines confidence in the comparative claims.

Authors: We agree that greater transparency regarding the evaluation process is warranted. The original manuscript did not include these details. We will revise the Methods section to specify the number of evaluators from the study team, the detailed scoring criteria and rubrics used for relevance and justification, and any inter-rater agreement measures. We will also state that evaluators were not blinded to LLM identity due to distinct output characteristics and discuss this as a limitation. revision: yes
Referee: [Methods] The biomedical LLMs evaluated are much smaller (8B and 1.5B parameters) than the general-purpose ones. This size disparity is a potential confounder for the observed performance gap, and the manuscript does not address whether the differences are due to domain specialization or model capacity.

Authors: This is a valid observation that was not addressed in the original submission. We will add a discussion in the Limitations section acknowledging the size disparity as a potential confounder and noting that the performance gap may reflect both domain specialization and model capacity. We will recommend future comparisons using models of comparable sizes to isolate these effects, while maintaining that the results reflect currently available biomedical LLMs. revision: partial
Referee: [Methods] The selection of 46 protocols from two catalogues may not sufficiently represent the full range of pharmacoepidemiologic study designs (e.g., those with complex time-dependent exposures or linked databases), raising questions about whether the superiority of general-purpose LLMs holds across broader applications.

Authors: We selected the 46 protocols from established HMA-EMA and Sentinel catalogues to represent real-world studies from 2018-2024. We acknowledge that this sample may not encompass all design variations. We will revise the Discussion to explicitly address generalizability limitations and recommend validation on additional protocol types, including those with complex time-dependent exposures or linked databases. revision: partial

Circularity Check

0 steps flagged

No significant circularity in this empirical LLM evaluation study

full rationale

The paper conducts a direct empirical comparison of off-the-shelf general-purpose LLMs (GPT-4o, DeepSeek-R1) against smaller biomedical LLMs on 46 real pharmacoepidemiologic protocols drawn from external catalogues. Performance is measured by human-assigned scores on relevance, logic of justification, and ontology-code agreement under different prompting strategies. No mathematical derivations, equations, fitted parameters, or predictions appear; the central claims rest on straightforward evaluation against independent human-designed protocols rather than any self-referential reduction or self-citation chain. The study is therefore self-contained with no load-bearing steps that collapse to the inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the representativeness of the protocol sample and the validity of the qualitative scoring metrics. No free parameters or new entities introduced.

axioms (2)

domain assumption The selected 46 protocols from 2018-2024 are representative of pharmacoepidemiologic studies.
Used as the test set for evaluation.
domain assumption Human or expert assessment of relevance and logic is reliable and unbiased.
Basis for scoring LLM performance.

pith-pipeline@v0.9.0 · 5600 in / 1495 out tokens · 73105 ms · 2026-05-10T04:24:48.399807+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

71 extracted references · 15 canonical work pages

[1]

Introduction The growing need for rapid evidence generation in pharmacoepidemiology places strong demands on study design and data quality.[1] This is particularly important for populations underrepresented in clinical trials and comparative effectiveness /safety research in routine care. [2,3] As artificial intelligence (AI) becomes increasingly integrat...
[2]

drug and outcome analysis

Methods 2.1 Data Sources This study evaluated LLM performance using study protocols from two pharmacoepidemiologic sources (i.e., HMA-EMA Catalogue and Sentinel System) . From the HMA -EMA Catalogue of Real-World Data (RWD) Studies, all available DARWIN EU® protocols were included at the time of analysis (n=16). In addition, 15 non -DARWIN expert -develop...

2023
[3]

LLMs were paired with seven prompt engineering strategies identified from the literature:
[4]

basic prompt with term definitions, 2) synthetic prompting, 3) active-prompt, 4) plan-and-solve,
[5]

[20–22] Together, this yielded 42 model-prompt combinations for the initial pre-assessment of the most performing combinations

least-to-most (LTM), 6) Tree-of-Thought, and 7) decomposition prompting. [20–22] Together, this yielded 42 model-prompt combinations for the initial pre-assessment of the most performing combinations. Prompt construction was guided by the CLEAR framework, which emphasizes concise, logical, explicit, adaptive, and reflective prompt design. [23] A pre -asse...
[6]

The 46 protocols are available at https://github.com/madelsu/LLM-for-pharmacoepi-study-design/tree/main/Data

Results 3.1 Descriptive Analysis A total of 46 pharmacoepidemiological protocols were included: 16 from DARWIN EU®, 15 from the HMA-EMA Catalogue, and 15 from the Sentinel System. The 46 protocols are available at https://github.com/madelsu/LLM-for-pharmacoepi-study-design/tree/main/Data. There was full agreement between the human experts when extracting ...

2018
[7]

These uncertainties highlight the need for systematic evaluation of LLM capabilities in this domain

Discussion The use of LLMs in pharmacoepidemiology has received increasing research attention in recent years.[11] Despite this growing interest, important concerns remain regarding the reliability of general-purpose LLMs, the suitability of their training data for scientific and regulatory tasks, and the performance of newly released models and prompt -e...
[8]

It is also in line with Dada et al., who showed that increasing biomedical specialization does not necessarily translate into better performance on medical tasks and may come at the cost of instruction-following or broader task adaptability. [27] Together, these findings suggest that pharmacoepidemiologic study design is not simply a knowledge retrieval t...
[9]

Conclusions This study showed that off-the-shelf general-purpose LLMs, particularly GPT-4o and DeepSeek- R1 combined with LTM prompting, outperformed the biomedical LLMs evaluated in this study for support of pharmacoepidemiologic study design. Their advantage was evident not only for the primary outcome of relevance, but also for logic of justification, ...
[10]

Because the study relied exclusively on publicly available study protocols, ethical approval and informed consent were not applicable

Ethics, Funding, and Conflict of Interest No funding was obtained for this study. Because the study relied exclusively on publicly available study protocols, ethical approval and informed consent were not applicable. The authors declare no conflicts of interest with respect to the conduct or reporting of this study
[11]

MS and XZ contributed to data collection, formal analysis, interpretation of results, and revision of the manuscript

Authors’ Contribution NSH and MS had primary responsibility for the study conception and design and for the acquisition, analysis, and interpretation of data, and they led the drafting of the manuscript. MS and XZ contributed to data collection, formal analysis, interpretation of results, and revision of the manuscript. MDCS and FPS contributed to interpr...
[12]

Pharmacoepidemiology

Strom B, Kimmel S, Hennessy S. Pharmacoepidemiology. 6th ed. Hoboken (NJ): Wiley- Blackwell; 2020

2020
[13]

Accelerating evidence generation: Addressing critical challenges and charting a path forward

Rim JG, Jackman JG, Hornik CP, Rutter JL, Warraich H, Wittes J, et al. Accelerating evidence generation: Addressing critical challenges and charting a path forward. J Clin Transl Sci. 2024 Oct 31;8(1):e184. doi:10.1017/cts.2024.621

work page doi:10.1017/cts.2024.621 2024
[14]

Now is the time to fix the evidence generation system

Califf RM. Now is the time to fix the evidence generation system. Clinical Trials. 2023 Feb 17;20(1):3–12. doi:10.1177/17407745221147689

work page doi:10.1177/17407745221147689 2023
[15]

Epidemiological Data Challenges: Planning for a More Robust Future Through Data Standards

Fairchild G, Tasseff B, Khalsa H, Generous N, Daughton AR, Velappan N, et al. Epidemiological Data Challenges: Planning for a More Robust Future Through Data Standards. Front Public Health. 2018 Nov 23;6. doi:10.3389/fpubh.2018.00336 14

work page doi:10.3389/fpubh.2018.00336 2018
[16]

Food and Drug Administration

U.S. Food and Drug Administration. FDA [Internet]. 2024 [cited 2026 Mar 24]. BERTox Initiative. Available from: https://www.fda.gov/about-fda/nctr-research-focus- areas/bertox-initiative

2024
[17]

European Union; 2025 Aug 2

EU Artificial Intelligence Act. European Union; 2025 Aug 2

2025
[18]

What Is an Exposure, What Is a Disease, and How Do We Measure Them? In: Epidemiology Matters

Keyes KM, Galea S. What Is an Exposure, What Is a Disease, and How Do We Measure Them? In: Epidemiology Matters. Oxford University Press; 2014. p. 18–32. doi:10.1093/med/9780199331246.003.0003

work page doi:10.1093/med/9780199331246.003.0003 2014
[19]

Visualizations throughout pharmacoepidemiology study planning, implementation, and reporting

Gatto NM, Wang S V., Murk W, Mattox P, Brookhart MA, Bate A, et al. Visualizations throughout pharmacoepidemiology study planning, implementation, and reporting. Pharmacoepidemiol Drug Saf. 2022 Nov 9;31(11):1140–52. doi:10.1002/pds.5529

work page doi:10.1002/pds.5529 2022
[20]

Inclusion and exclusion criteria in research studies: definitions and why they matter

Patino CM, Ferreira JC. Inclusion and exclusion criteria in research studies: definitions and why they matter. Jornal Brasileiro de Pneumologia. 2018 Apr;44(2):84–84. doi:10.1590/s1806-37562018000000088

work page doi:10.1590/s1806-37562018000000088 2018
[21]

Evaluating the Accuracy of Responses by Large Language Models for Information on Disease Epidemiology

Zhu K, Zhang J, Klishin A, Esser M, Blumentals WA, Juhaeri J, et al. Evaluating the Accuracy of Responses by Large Language Models for Information on Disease Epidemiology. Pharmacoepidemiol Drug Saf. 2025 Feb 3;34(2). doi:10.1002/pds.70111

work page doi:10.1002/pds.70111 2025
[22]

On the Applications of Neural Ordinary Differential Equations in Medical Image Analysis

Wang D, Zhang S. Large language models in medical and healthcare fields: applications, advances, and challenges. Artif Intell Rev. 2024 Sep 20;57(11):299. doi:10.1007/s10462- 024-10921-0

work page doi:10.1007/s10462- 2024
[23]

Off-the-Shelf Large Language Models for Causality Assessment of Individual Case Safety Reports: A Proof- of-Concept with COVID-19 Vaccines

Abate A, Poncato E, Barbieri M, Powell G, Rossi A, Peker S, et al. Off-the-Shelf Large Language Models for Causality Assessment of Individual Case Safety Reports: A Proof- of-Concept with COVID-19 Vaccines. Drug Saf. 2025 Jul 12;48(7):805–20. doi:10.1007/s40264-025-01531-y

work page doi:10.1007/s40264-025-01531-y 2025
[24]

Amsterdam: European Medicines Agency [Internet]

European Medicines Agency, Heads of Medicines Agencies. Amsterdam: European Medicines Agency [Internet]. 2025. HMA-EMA Catalogues of real-world data sources and studies. Available from: https://catalogues.ema.europa.eu

2025
[25]

Silver Spring (MD): Sentinel Initiative [Internet]

Sentinel Initiative. Silver Spring (MD): Sentinel Initiative [Internet]. 2025. Drug Studies. Available from: sentinelinitiative.org/studies/drugs

2025
[26]

Wang S V., Pottegård A, Crown W, Arlett P, Ashcroft DM, Benchimol EI, et al. HARmonized Protocol Template to Enhance Reproducibility of hypothesis evaluating real‐world evidence studies on treatment effects: A good practices report of a joint ISPE/ISPOR task force. Pharmacoepidemiol Drug Saf. 2023 Jan 10;32(1):44–55. doi:10.1002/pds.5507

work page doi:10.1002/pds.5507 2023
[27]

Hugging Face [Internet]

Irathernotsay. Hugging Face [Internet]. 2025. qwen2-1.5B-medical_qa-Finetune . Available from: https://huggingface.co/Irathernotsay/qwen2-1.5B-medical_qa-Finetune 15

2025
[28]

Hugging Face [Internet]

Plaban81. Hugging Face [Internet]. 2025. gemma-medical_qa-Finetune. Available from: https://huggingface.co/Plaban81/gemma-medical_qa-Finetune

2025
[29]

Hugging Face [Internet]

mradermacher. Hugging Face [Internet]. 2025. DeepSeek-r1-Medical-Mini-GGUF. Available from: https://huggingface.co/mradermacher/DeepSeek-r1-Medical-Mini-GGUF

2025
[30]

Hugging Face [Internet]

QuantFactory. Hugging Face [Internet]. 2024. Bio-Medical-Llama-3-8B-GGUF. Available from: https://huggingface.co/QuantFactory/Bio-Medical-Llama-3-8B-GGUF

2024
[31]

Can GPT Improve the State of Prior Authorization via Guideline Based Automated Question Answering? ArXiv

Vatsal S, Singh A, Tafreshi S. Can GPT Improve the State of Prior Authorization via Guideline Based Automated Question Answering? ArXiv. 2024 Feb 28

2024
[32]

Least-to-Most Prompting Enables Complex Reasoning in Large Language Models

Zhou D, Schärli N, Hou L, Wei J, Scales N, Wang X, et al. Least-to-Most Prompting Enables Complex Reasoning in Large Language Models. ArXiv. 2022 May 21

2022
[33]

A Survey of Prompt Engineering Methods in Large Language Models for Different NLP Tasks

Vatsal S, Dubey H. A Survey of Prompt Engineering Methods in Large Language Models for Different NLP Tasks. ArXiv. 2024 Jun 17

2024
[34]

The CLEAR path: A framework for enhancing information literacy through prompt engineering

Lo LS. The CLEAR path: A framework for enhancing information literacy through prompt engineering. The Journal of Academic Librarianship. 2023 Jul;49(4):102720. doi:10.1016/j.acalib.2023.102720

work page doi:10.1016/j.acalib.2023.102720 2023
[35]

The TRIPOD-LLM reporting guideline for studies using large language models

Gallifant J, Afshar M, Ameen S, Aphinyanaphongs Y, Chen S, Cacciamani G, et al. The TRIPOD-LLM reporting guideline for studies using large language models. Nat Med. 2025 Jan 8;31(1):60–9. doi:10.1038/s41591-024-03425-5

work page doi:10.1038/s41591-024-03425-5 2025
[36]

Reporting guidelines for chatbot health advice studies: explanation and elaboration for the Chatbot Assessment Reporting Tool (CHART). BMJ. 2025 Aug 1;390:e083305. doi:10.1136/bmj-2024-083305

work page doi:10.1136/bmj-2024-083305 2025
[37]

Biomedical Large Language Models and Prompt Engineering for Causality Assessment of Individual Case Safety Reports in Pharmacovigilance

Heckmann NS, Papoutsi DG, Barbieri MA, Battini V, Mølgaard SN, Schmidt SØ, et al. Biomedical Large Language Models and Prompt Engineering for Causality Assessment of Individual Case Safety Reports in Pharmacovigilance. medRxiv (preprint). 2026. doi:10.64898/2026.02.19.26346467

work page doi:10.64898/2026.02.19.26346467 2026
[38]

Does Biomedical Training Lead to Better Medical Performance? ArXiv

Dada A, Bauer M, Contreras A, et al. Does Biomedical Training Lead to Better Medical Performance? ArXiv. 2024 Apr 5;arXiv:2404.04067. 16 Table 1. Prompt Examples, Concept Least to Most Active prompt Study design Based on the title [TITLE] , list all the candidate study design types (e.g., cohort study, case- control study) and select the most appropriate ...

work page arXiv 2024
[39]

Align with CONSORT/STROBE guidelines
[40]

Prioritize feasibility in real-world settings
[41]

Primary prompt: Define index dates of this study

Explicitly state rejection reasons for alternative designs Then enhance prompts: [CoT Example] Objective: Assess long-term effects of Drug X on risk Step 1: Identify need for longitudinal exposure- outcome data → Cohort design Step 2: Check feasibility of randomization → Reject RCT (lack of equipoise) Step 3: Compare with case-control → Prefer cohort for ...
[42]

For pharmacological studies: Use first prescription date + 30-day washout
[43]

For surgical studies: Use procedure date ± 7-day pre-op assessment
[44]

Mandatory elements: 17 exclusion criteria medical history, missing data) should be developed for this study based on the [Q1 answer] design and [Q2 answer] index date

Flag potential immortal time bias sources [Validation] Check FDA Sentinel Common Data Model, DARWIN, HMA-EMA for alignment Inclusion and Inclusion criteria (e.g., age ≥18 years, confirmed disease) and exclusion criteria (e.g., past Generate inclusion/exclusion criteria of this study. Mandatory elements: 17 exclusion criteria medical history, missing data)...
[45]

Explicit linkage to objective-specific endpoints
[46]

Use WHO International Classification of Functioning (ICF) framework
[47]

Constraints:

Flag criteria causing selection bias >20% Inclusion and exclusion assessment window Based on the [answer to question 2] index date, identify the time window for assessing inclusion and exclusion criteria (e.g., a baseline period of 6 months before the index date, an exclusion period of 1 year after the index date) and its specific duration Primary prompt:...
[48]

Align with EMA Guideline on GCP (Rev 3)
[49]

Use moving window approach for chronic conditions
[50]

Specify allowable overlaps (±5% timeline tolerance) Enhance prompt: [CoT Example] Objective: Window Logic:
[51]

Exclusion: [Check] Apply inverse probability weighting for missing window data Exposure Based on the [Answer to Question 1] design and [Answer to Question 3] inclusion criteria, define exposure (e.g., drug dose, exposure duration) Primary prompt: Define study exposure of the study Required elements:
[52]

Dose-response granularity (ATC + RxNorm mapping)
[53]

Must include:

Exposure lag periods with biological plausibility check Competing risk adjustment plan Outcome Based on the [answer to question 1] design, specify primary/secondary outcome Primary prompt: Specify outcome for this study. Must include:
[54]

Primary/secondary endpoints 18 definitions (e.g., laboratory confirmed diagnosis, imaging evidence)
[55]

Competing risk definitions (e.g., death censoring rules) Sensitivity analysis protocols for outcome misclassification. Follow-up period In conjunction with [Q.2 answer] Index date and [Q.6 answer] Definition of ending, set the start and end of the follow up period, whether right censoring is allowed and the minimum length of follow-up (e.g. 1 year). Quest...
[56]

Account for disease-specific latency periods
[57]

Apply landmark analysis for time- varying exposures
[58]

Specify left-truncation handling methods Enhance prompt: [CoT Example] Objective: Follow-up Logic:
[59]

Primary prompt: Identify covariates in this study

End: Censor: Death Covariate List covariates (e.g., age, sex, comorbidities, medication history) to be adjusted for based on [QUESTION 5 ANSWER] exposure and [QUESTION 6 ANSWER] outcome. Primary prompt: Identify covariates in this study. Must:
[60]

Categorize into confounders/mediators/effect modifiers
[61]

Provide LOINC codes for lab covariates
[62]

Include negative control variables Enhance prompt: [CoT Example] Objective: Covariates:
[63]

Primary prompt: Define temporal windows for covariate assessment in this study

Effect modifier: 19 Negative control: Covariate assessment window Based on the [QUESTION 2 ANSWER] index date, define the time window in which the covariates will be evaluated (e.g., laboratory data within 1 year prior to the index date). Primary prompt: Define temporal windows for covariate assessment in this study. Constraints:
[64]

Align with FDA's Structured Product Labeling (SPL) standards
[65]

Handle time-varying covariates using marginal structural models
[66]

ICD-10, ATC

Specify missing data imputation thresholds Enhance prompt: [CoT Example] Objective: Window Strategy: • Static: • Time-varying: Lagged: Ontology- code Convert diagnoses, exposures, outcomes from [Q5/6 answers] to standard codes i.e. ICD-10, ATC. Map all medical concepts in previous answers to standard terminologies based on the answer to question 5, 6 list...
[67]

Data collection methods: Self-reported questionnaires; Medical records review
[68]

Data collection methods: Physician interviews; Standardized weight and height measurements
[69]

Data collection methods: Patient diaries; Blood sample analysis
[70]

Data collection methods: Telephone surveys; Review of electronic medical records
[71]

Most appropriate answer: D

Data collection methods: Clinic visits; Nurse interviews. Most appropriate answer: D. Data collection methods: Telephone surveys; Review of electronic medical records. 29 Justification: Given the inclusion criteria defined for this study, data collection methods should be chosen to ensure accuracy and reliability in capturing relevant information about ea...

2012