A Multi-Stage Validation Framework for Trustworthy Large-scale Clinical Information Extraction using Large Language Models

Caitlin Rizy; Elizabeth M. Oliva; Elliot M. Fielstein; Gregory M. Dams; Ioana Danciu; Jodie Trafton; Joseph Erdos; Josh Arnold; Kamonica L. Craig; Maria Mahbub

arxiv: 2604.06028 · v1 · submitted 2026-04-07 · 💻 cs.CL · cs.AI· cs.IR

A Multi-Stage Validation Framework for Trustworthy Large-scale Clinical Information Extraction using Large Language Models

Maria Mahbub , Gregory M. Dams , Josh Arnold , Caitlin Rizy , Sudarshan Srinivasan , Elliot M. Fielstein , Minu A. Aghevli , Kamonica L. Craig

show 4 more authors

Elizabeth M. Oliva Joseph Erdos Jodie Trafton Ioana Danciu

This is my paper

Pith reviewed 2026-05-10 19:05 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.IR

keywords clinical information extractionlarge language modelsvalidation frameworksubstance use disorderweak supervisionpredictive validitytrustworthy AI

0 comments

The pith

A multi-stage validation process allows large language models to extract substance use disorder diagnoses reliably from nearly a million clinical notes without exhaustive manual labeling.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a validation framework that chains prompt calibration, rule-based plausibility checks, semantic grounding, a higher-capacity judge model for uncertain outputs, limited expert review, and external checks against real care outcomes. Applied to extraction of eleven substance use categories across 919,783 notes, the framework removed unsupported or implausible extractions and produced outputs that aligned well with expert judgments while outperforming structured data in predicting later specialty care. This combination lets researchers and clinicians assess LLM performance at population scale using weak supervision instead of full annotation.

Core claim

The multi-stage validation framework integrates prompt calibration, rule-based plausibility filtering, semantic grounding assessment, targeted confirmatory evaluation using an independent higher-capacity judge LLM, selective expert review, and external predictive validity analysis to quantify uncertainty and characterize error modes without exhaustive manual annotation, as demonstrated by substantial agreement with experts and superior predictive performance.

What carries the argument

The multi-stage validation framework that combines automated filters with selective review by a higher-capacity judge model and limited experts to assess LLM outputs under weak supervision.

If this is right

Rule-based filtering and semantic grounding can remove roughly 15 percent of unsupported or implausible LLM extractions.
Judge LLM assessments can serve as scalable references that agree substantially with expert review.
LLM-extracted diagnoses can predict subsequent clinical engagement more accurately than structured data alone.
Population-scale clinical extraction becomes feasible without annotation-intensive reference standards.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same staged approach could be adapted to extract other clinical entities such as medications or procedures from notes.
Lower annotation costs might allow smaller health systems to adopt LLM tools for record review.
Combining LLMs with predictive validity checks against outcomes offers one route to building trust in deployed models.

Load-bearing premise

The higher-capacity judge LLM supplies reliable confirmatory labels for uncertain cases and the rule-based plus semantic filters capture most errors without introducing new biases.

What would settle it

Substantial disagreement between the judge LLM and independent expert review on a new set of high-uncertainty extractions would undermine the claim of trustworthy validation.

Figures

Figures reproduced from arXiv: 2604.06028 by Caitlin Rizy, Elizabeth M. Oliva, Elliot M. Fielstein, Gregory M. Dams, Ioana Danciu, Jodie Trafton, Joseph Erdos, Josh Arnold, Kamonica L. Craig, Maria Mahbub, Minu A. Aghevli, Sudarshan Srinivasan.

**Figure 2.** Figure 2: Direct vs chain-of-thought prompting to extract SUD diagnoses information from clinical notes. [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Prompt for judge LLM was supported by the documentation. Independently, the same cases were evaluated by the judge LLM using identical source material. The SME additionally assessed the quality and appropriateness of the judge LLM’s reasoning and decision. Evaluation Metrics Agreement between the SME and the judge LLM was quantified using inter-annotator agreement (IAA) metrics. Specifically, we used Gwet’… view at source ↗

**Figure 4.** Figure 4: Predictive validity of LLM-extracted SUD diagnoses shown by ROC curves comparing outcome [PITH_FULL_IMAGE:figures/full_fig_p015_4.png] view at source ↗

read the original abstract

Large language models (LLMs) show promise for extracting clinically meaningful information from unstructured health records, yet their translation into real-world settings is constrained by the lack of scalable and trustworthy validation approaches. Conventional evaluation methods rely heavily on annotation-intensive reference standards or incomplete structured data, limiting feasibility at population scale. We propose a multi-stage validation framework for LLM-based clinical information extraction that enables rigorous assessment under weak supervision. The framework integrates prompt calibration, rule-based plausibility filtering, semantic grounding assessment, targeted confirmatory evaluation using an independent higher-capacity judge LLM, selective expert review, and external predictive validity analysis to quantify uncertainty and characterize error modes without exhaustive manual annotation. We applied this framework to extraction of substance use disorder (SUD) diagnoses across 11 substance categories from 919,783 clinical notes. Rule-based filtering and semantic grounding removed 14.59% of LLM-positive extractions that were unsupported, irrelevant, or structurally implausible. For high-uncertainty cases, the judge LLM's assessments showed substantial agreement with subject matter expert review (Gwet's AC1=0.80). Using judge-evaluated outputs as references, the primary LLM achieved an F1 score of 0.80 under relaxed matching criteria. LLM-extracted SUD diagnoses also predicted subsequent engagement in SUD specialty care more accurately than structured-data baselines (AUC=0.80). These findings demonstrate that scalable, trustworthy deployment of LLM-based clinical information extraction is feasible without annotation-intensive evaluation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a concrete multi-stage pipeline for validating LLM extractions of substance use diagnoses at scale, but the judge-LLM step leaves a moderate circularity risk that needs tighter checks.

read the letter

The main thing here is a practical pipeline that combines prompt calibration, rule-based filtering, semantic grounding, a higher-capacity judge LLM for uncertain cases, selective expert review, and external predictive validity against care records. They run it on 919k notes for 11 substance categories, drop 14.59% of LLM positives as unsupported, report Gwet AC1 of 0.80 with experts on the hard cases, get F1 0.80 for the primary model, and show AUC 0.80 for predicting specialty care engagement—better than structured baselines. That integrated stack at this volume is the real addition; most prior work stops at one or two of these checks.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes a multi-stage validation framework for trustworthy large-scale clinical information extraction using LLMs. The framework combines prompt calibration, rule-based plausibility filtering, semantic grounding, confirmatory evaluation by a higher-capacity judge LLM, selective expert review, and external predictive validity analysis. It is demonstrated on extracting SUD diagnoses from 919,783 clinical notes across 11 substance categories, reporting that 14.59% of LLM-positive extractions were filtered, substantial agreement (Gwet's AC1 = 0.80) with experts on high-uncertainty cases, F1 = 0.80 for the primary LLM, and superior predictive validity (AUC = 0.80) for care engagement compared to structured baselines. The authors conclude that scalable, trustworthy LLM-based clinical IE is feasible without annotation-intensive evaluation.

Significance. If the reported metrics are robust, this work is significant for enabling population-scale clinical NLP applications by reducing the need for exhaustive manual annotations. The application to a very large corpus (nearly 1 million notes) and the inclusion of external validation against real-world care engagement records provide concrete evidence of feasibility and utility. Strengths include the integration of multiple validation stages and the focus on error mode characterization.

major comments (3)

[Abstract] The claim of 'trustworthy' extraction depends on the judge LLM's reliability for high-uncertainty cases, yet only selective expert review is reported (Gwet AC1=0.80); there is no validation reported for cases where the primary and judge LLMs agree, which constitutes the bulk of outputs and leaves open the possibility of shared biases.
[Methods (framework)] Exact thresholds for rule-based plausibility filtering and semantic grounding assessment are not detailed, nor are the criteria for identifying 'high-uncertainty cases' for judge LLM review; these are load-bearing for evaluating whether the 14.59% filtering introduces selection bias or misses error modes.
[Results (predictive validity)] While AUC=0.80 for predicting SUD specialty care engagement is promising, this external validity does not directly confirm the correctness of the extracted diagnoses, as the association could arise from correlated but inaccurate signals; additional analyses to rule out this are needed to support the trustworthiness claim.

minor comments (2)

[Abstract] The 'relaxed matching criteria' for the F1 score of 0.80 should be explicitly defined in the methods section for reproducibility.
Consider adding a table summarizing the multi-stage framework components and their roles to improve clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback, which identifies key areas to strengthen the claims around trustworthiness in our multi-stage LLM validation framework. We respond point-by-point to the major comments below, committing to revisions that enhance transparency and address the concerns raised.

read point-by-point responses

Referee: [Abstract] The claim of 'trustworthy' extraction depends on the judge LLM's reliability for high-uncertainty cases, yet only selective expert review is reported (Gwet AC1=0.80); there is no validation reported for cases where the primary and judge LLMs agree, which constitutes the bulk of outputs and leaves open the possibility of shared biases.

Authors: We acknowledge this as a valid limitation: while the judge LLM provides confirmatory evaluation on high-uncertainty cases and expert review on a selective subset yields substantial agreement, the bulk of outputs (where primary and judge LLMs concur) lack direct expert validation, leaving room for shared biases. To address this, the revised manuscript will include a post-hoc expert review on a random sample of agreed cases, with results reported in the Results section to quantify agreement and characterize potential biases. revision: yes
Referee: [Methods (framework)] Exact thresholds for rule-based plausibility filtering and semantic grounding assessment are not detailed, nor are the criteria for identifying 'high-uncertainty cases' for judge LLM review; these are load-bearing for evaluating whether the 14.59% filtering introduces selection bias or misses error modes.

Authors: We agree that the absence of exact thresholds and criteria in the Methods section hinders reproducibility and evaluation of selection bias from the 14.59% filtering. In the revised manuscript, we will expand the Methods to detail the specific thresholds for rule-based plausibility filtering and semantic grounding assessment, along with the precise criteria (e.g., confidence scores or disagreement flags) used to identify high-uncertainty cases for judge LLM review. We will also add a sensitivity analysis on these parameters. revision: yes
Referee: [Results (predictive validity)] While AUC=0.80 for predicting SUD specialty care engagement is promising, this external validity does not directly confirm the correctness of the extracted diagnoses, as the association could arise from correlated but inaccurate signals; additional analyses to rule out this are needed to support the trustworthiness claim.

Authors: This is a substantive point: the predictive validity analysis demonstrates utility for downstream tasks but remains indirect and could reflect correlated signals rather than diagnostic accuracy. In the revised manuscript, we will add analyses comparing LLM-extracted diagnoses to overlapping structured ICD codes and explicitly discuss this limitation in the Discussion, while maintaining that the multi-stage internal validations combined with external utility provide supportive evidence for scalable trustworthiness. revision: partial

Circularity Check

0 steps flagged

No significant circularity detected in the multi-stage validation framework

full rationale

The paper proposes and applies a multi-stage framework consisting of prompt calibration, rule-based plausibility filtering, semantic grounding assessment, confirmatory evaluation by an independent higher-capacity judge LLM on high-uncertainty cases, selective expert review (with Gwet's AC1=0.80 agreement), and external predictive validity against care engagement records (AUC=0.80). The primary LLM's F1=0.80 is computed using judge outputs as references, but this is an explicit component of the framework and is supported by the reported expert agreement on the judge assessments rather than reducing to a self-referential fit or definition. No equations, self-citations, or ansatzes are invoked in a load-bearing way that collapses the central claim to its inputs by construction. The derivation remains self-contained against the described external benchmarks and selective human validation.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The framework rests on domain assumptions about LLM reliability and the sufficiency of rule-based and semantic filters to remove unsupported extractions; no new physical entities or free parameters are introduced beyond standard NLP thresholds.

axioms (2)

domain assumption Higher-capacity judge LLMs can serve as reliable proxies for expert review on uncertain cases
Invoked when using judge LLM assessments to evaluate primary LLM outputs and report agreement with experts
domain assumption Rule-based plausibility filters and semantic grounding capture the majority of LLM error modes
Central to the claim that 14.59% removal leaves trustworthy extractions

pith-pipeline@v0.9.0 · 5622 in / 1436 out tokens · 49421 ms · 2026-05-10T19:05:30.031510+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The framework integrates prompt calibration, rule-based plausibility filtering, semantic grounding assessment, targeted confirmatory evaluation using an independent higher-capacity judge LLM, selective expert review, and external predictive validity analysis
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Rule-based filtering and semantic grounding removed 14.59% of LLM-positive extractions... Gwet's AC1=0.80... AUC=0.80

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

58 extracted references · 58 canonical work pages · 4 internal anchors

[1]

arXiv preprint arXiv:2205.12689 , year=

M. Agrawal, S. Hegselmann, H. Hunter, and D. Sontag, “Large language models are few-shot clinical information extractors,”arXiv preprint arXiv:2205.12689, 2022

work page arXiv 2022
[2]

Arlington, VA: American Psychiatric Publishing, 2013

American Psychiatric Association,Diagnostic and statistical manual of mental disorders (DSM-5®). Arlington, VA: American Psychiatric Publishing, 2013

work page 2013
[3]

Abuse and M

S. Abuse and M. H. S. Administration, “Key substance use and mental health indicators in the united states: Results from the 2024 national survey on drug use and health (hhs publication no. pep25-07-007, nsduh series h-60),”Center for Behavioral Health Statistics and Quality, Substance Abuse and Mental Health Services Administration, 2025. [Online]. Avail...

work page 2024
[4]

The global burden of disease attributable to alcohol and drug use in 195 countries and territories, 1990–2016: A systematic analysis for the global burden of disease study 2016,

L. Degenhardt, F. Charlson, A. Ferrari, et al., “The global burden of disease attributable to alcohol and drug use in 195 countries and territories, 1990–2016: A systematic analysis for the global burden of disease study 2016,”The Lancet Psychiatry, vol. 5, no. 12, pp. 987–1012, 2018

work page 1990
[5]

Clinical implications of using administrative data to identify substance use disorders,

R. H. Perlis, D. V. Iosifescu, V. Castro, et al., “Clinical implications of using administrative data to identify substance use disorders,”Psychiatric Services, vol. 63, no. 8, pp. 837–837, 2012

work page 2012
[6]

The Llama 3 Herd of Models

A. Dubey, A. Jauhri, A. Pandey, et al., “The llama 3 herd of models,”arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[7]

GPT-4 Technical Report

J. Achiam, S. Adler, S. Agarwal, et al., “Gpt-4 technical report,”arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[8]

anthropic

Anthropic,The claude 3 model family: Opus, sonnet, haiku,https : / / www - cdn . anthropic . com / files/4b/claude-3-model-card.pdf, 2024. 16

work page 2024
[9]

Large language models encode clinical knowledge,

K. Singhal, S. Azizi, T. Tu, et al., “Large language models encode clinical knowledge,”Nature, vol. 620, no. 7972, pp. 172–180, 2023

work page 2023
[10]

Large language models in medicine,

A. J. Thirunavukarasu, D. S. W. Ting, K. Elangovan, et al., “Large language models in medicine,” Nature Medicine, vol. 29, no. 8, pp. 1930–1940, 2023

work page 1930
[11]

Automated extraction of substance use information from clinical texts,

Y. Wang et al., “Automated extraction of substance use information from clinical texts,” inAMIA Annual Symposium Proceedings, vol. 2015, 2015, p. 2121

work page 2015
[12]

Decoding substance use disorder severity from clinical notes using a large language model,

M. Mahbub et al., “Decoding substance use disorder severity from clinical notes using a large language model,”npj Mental Health Research, vol. 4, no. 1, p. 5, 2025

work page 2025
[13]

Extracting social determinants of health from electronic health records: Development and comparison of rule-based and large language models-based methods,

B. Wang, D. Kabir, C. R. Clark, K. W. Choi, and J. W. Smoller, “Extracting social determinants of health from electronic health records: Development and comparison of rule-based and large language models-based methods,”medRxiv, pp. 2025–11, 2025.doi:10.1101/2025.11.15.25339520

work page doi:10.1101/2025.11.15.25339520 2025
[14]

Llms accelerate annotation for medical information extraction,

A. Goel et al., “Llms accelerate annotation for medical information extraction,” inmachine learning for health (ML4H), PMLR, 2023, pp. 82–100

work page 2023
[15]

Clinical text annotation–what factors are associated with the cost of time?

Q. Wei, A. Franklin, T. Cohen, and H. Xu, “Clinical text annotation–what factors are associated with the cost of time?” InAMIA Annual Symposium Proceedings, vol. 2018, 2018, p. 1552

work page 2018
[16]

Geneva: World Health Organization, 1992, vol

W.H.Organization,The ICD-10 classification of mental and behavioural disorders: clinical descriptions and diagnostic guidelines. Geneva: World Health Organization, 1992, vol. 1

work page 1992
[17]

Coding reliability and agreement of international classification of disease, 10th revision (icd-10) codes in emergency department data,

M. Peng et al., “Coding reliability and agreement of international classification of disease, 10th revision (icd-10) codes in emergency department data,”International journal of population data science, vol. 3, no. 1, p. 445, 2018

work page 2018
[18]

Coding rules for uncertain and “ruled out

O. O. Atolagbe, P. S. Romano, D. A. Southern, W. Wongtanasarasin, and W. A. Ghali, “Coding rules for uncertain and “ruled out” diagnoses in icd-10 and icd-11,”BMC Medical Informatics and Decision Making, vol. 21, no. Suppl 6, p. 386, 2021

work page 2021
[19]

Validating opioid use disorder diagnoses in administrative data: A commentary on existing evidence and future directions,

J. F. Scherrer, M. D. Sullivan, M. R. LaRochelle, and R. Grucza, “Validating opioid use disorder diagnoses in administrative data: A commentary on existing evidence and future directions,”Addiction Science & Clinical Practice, vol. 18, no. 1, p. 49, 2023

work page 2023
[20]

Diagnosis and coding of opioid misuse: A systematic scoping review and implementation framework,

R. W. Hurley, K. T. Bland, M. D. Chaskes, E. L. Hill, and M. C. Adams, “Diagnosis and coding of opioid misuse: A systematic scoping review and implementation framework,”Pain Medicine, pnaf019, 2025

work page 2025
[21]

A large language model for electronic health records,

X. Yang et al., “A large language model for electronic health records,”NPJ digital medicine, vol. 5, no. 1, p. 194, 2022

work page 2022
[22]

Leveraging open-source large language models for clinical information extraction in resource-constrained settings,

L. Builtjes, J. Bosma, M. Prokop, B. van Ginneken, and A. Hering, “Leveraging open-source large language models for clinical information extraction in resource-constrained settings,”JAMIA open, vol. 8, no. 5, ooaf109, 2025

work page 2025
[23]

Automating evaluation of ai text generation in healthcare with a large language model (llm)-as-a-judge,

E. Croxford et al., “Automating evaluation of ai text generation in healthcare with a large language model (llm)-as-a-judge,”medRxiv, pp. 2025–04, 2025

work page 2025
[24]

Llm-as-a-judge for software engineering: Literature review, vision, and the road ahead,

J. He et al., “Llm-as-a-judge for software engineering: Literature review, vision, and the road ahead,” ACM Transactions on Software Engineering and Methodology, 2025

work page 2025
[25]

A framework to assess clinical safety and hallucination rates of llms for medical text summarisation,

E. Asgari et al., “A framework to assess clinical safety and hallucination rates of llms for medical text summarisation,”npj Digital Medicine, vol. 8, no. 1, p. 274, 2025

work page 2025
[26]

Survey and analysis of hallucinations in large language models: Attribution to prompting strategies or model behavior,

D. Anh-Hoang, V. Tran, and L.-M. Nguyen, “Survey and analysis of hallucinations in large language models: Attribution to prompting strategies or model behavior,”Frontiers in Artificial Intelligence, vol. 8, p. 1622292, 2025

work page 2025
[27]

MedHallu: A Comprehensive Benchmark for Detecting Medical Hallucinations in Large Language Models, February 2025

S. Pandit et al., “Medhallu: A comprehensive benchmark for detecting medical hallucinations in large language models,”arXiv preprint arXiv:2502.14302, 2025

work page arXiv 2025
[28]

Available: https://arxiv.org/abs/2503.05777

Y. Kim et al., “Medical hallucinations in foundation models and their impact on healthcare,”arXiv preprint arXiv:2503.05777, 2025. 17

work page arXiv 2025
[29]

& Valdes, G

C. Garcia-Fernandez et al., “Trustworthy ai for medicine: Continuous hallucination detection and elimination with check,”arXiv preprint arXiv:2506.11129, 2025

work page arXiv 2025
[30]

Faithfulness hallucination detection in healthcare ai,

P. R. Vishwanath et al., “Faithfulness hallucination detection in healthcare ai,” inArtificial Intelligence and Data Science for Healthcare: Bridging Data-Centric AI and People-Centric Healthcare, 2024

work page 2024
[31]

Anempiricalevalua- tion of prompting strategies for large language models in zero-shot clinical natural language processing: Algorithm development and validation study,

S.Sivarajkumar,M.Kelley,A.Samolyk-Mazzanti,S.Visweswaran,andY.Wang,“Anempiricalevalua- tion of prompting strategies for large language models in zero-shot clinical natural language processing: Algorithm development and validation study,”JMIR Medical Informatics, vol. 12, e55318, 2024

work page 2024
[32]

Prompt engineering paradigms for medical applications: Scoping review,

J. Zaghir, M. Naguib, M. Bjelogrlic, A. Névéol, X. Tannier, and C. Lovis, “Prompt engineering paradigms for medical applications: Scoping review,”Journal of Medical Internet Research, vol. 26, e60501, 2024

work page 2024
[33]

Improving large language models for adverse drug reactions named entity recognition via error correction prompt engineering,

Y. Zhang and W. Liao, “Improving large language models for adverse drug reactions named entity recognition via error correction prompt engineering,”Journal of Biomedical Informatics, p. 104893, 2025

work page 2025
[34]

Evaluation and mitigation of the limitations of large language models in clinical decision-making,

P. Hager et al., “Evaluation and mitigation of the limitations of large language models in clinical decision-making,”Nature medicine, vol. 30, no. 9, pp. 2613–2622, 2024

work page 2024
[35]

Prompt engineering in clinical practice: Tutorial for clinicians,

J. Liu, F. Liu, C. Wang, and S. Liu, “Prompt engineering in clinical practice: Tutorial for clinicians,” Journal of Medical Internet Research, vol. 27, e72644, 2025

work page 2025
[36]

Streamlining evidence based clinical recommendations with large language models,

D. Li et al., “Streamlining evidence based clinical recommendations with large language models,”npj Digital Medicine, 2025

work page 2025
[37]

Medpromptextract (med- ical data extraction tool): Anonymization and high-fidelity automated data extraction using natural language processing and prompt engineering,

R. Srivastava, L. Bhat, S. Prasad, S. Deshpande, B. Das, and K. Jadhav, “Medpromptextract (med- ical data extraction tool): Anonymization and high-fidelity automated data extraction using natural language processing and prompt engineering,”The Journal of Applied Laboratory Medicine, vol. 10, no. 4, pp. 793–805, 2025

work page 2025
[38]

Construct validity in psychological tests.,

L. J. Cronbach and P. E. Meehl, “Construct validity in psychological tests.,”Psychological bulletin, vol. 52, no. 4, p. 281, 1955

work page 1955
[39]

The phq-9: Validity of a brief depression severity measure,

K. Kroenke, R. L. Spitzer, and J. B. Williams, “The phq-9: Validity of a brief depression severity measure,”Journal of general internal medicine, vol. 16, no. 9, pp. 606–613, 2001

work page 2001
[40]

Between-visit changes in suicidal ideation and risk of subsequent suicide attempt,

G. E. Simon et al., “Between-visit changes in suicidal ideation and risk of subsequent suicide attempt,” Depression and anxiety, vol. 34, no. 9, pp. 794–800, 2017

work page 2017
[41]

Predicting dsm-iv dependence diagnoses from addiction severity index composite scores,

S. H. Rikoon, J. S. Cacciola, D. Carise, A. I. Alterman, and A. T. McLellan, “Predicting dsm-iv dependence diagnoses from addiction severity index composite scores,”Journal of substance abuse treatment, vol. 31, no. 1, pp. 17–24, 2006

work page 2006
[42]

Predictive capacity of the audit questionnaire for alcohol-related harm,

K. M. Conigrave, J. B. Saunders, and R. B. Reznik, “Predictive capacity of the audit questionnaire for alcohol-related harm,”Addiction, vol. 90, no. 11, pp. 1479–1485, 1995

work page 1995
[43]

The audit questionnaire: Choosing a cut-off score,

K. M. Conigrave, W. D. Hall, and J. B. Saunders, “The audit questionnaire: Choosing a cut-off score,” Addiction, vol. 90, no. 10, pp. 1349–1356, 1995

work page 1995
[44]

Dsm-5 criteria for substance use disorders: Recommendations and rationale,

D. S. Hasin et al., “Dsm-5 criteria for substance use disorders: Recommendations and rationale,” American Journal of Psychiatry, vol. 170, no. 8, pp. 834–851, 2013

work page 2013
[45]

Development and multimodal validation of a substance misuse algorithm for referral to treatment using artificial intelligence (smart-ai): A retrospective deep learning study,

M. Afshar et al., “Development and multimodal validation of a substance misuse algorithm for referral to treatment using artificial intelligence (smart-ai): A retrospective deep learning study,”The Lancet Digital Health, vol. 4, no. 6, e426–e435, 2022

work page 2022
[46]

M. Afshar et al., “Deployment of real-time natural language processing and deep learning clinical decision support in the electronic health record: Pipeline implementation for an opioid misuse screener in hospitalized adults,”JMIR Medical Informatics, vol. 11, e44977, 2023

work page 2023
[47]

Automated detection of substance use information from electronic health records for a pediatric population,

Y. Ni, A. Bachtel, K. Nause, and S. Beal, “Automated detection of substance use information from electronic health records for a pediatric population,”Journal of the American Medical Informatics Association, vol. 28, no. 10, pp. 2116–2127, 2021. 18

work page 2021
[48]

Predicting treatment retention in medication for opioid use disorder: A machine learning approach using nlp and llm-derived clinical features,

F. Nateghi Haredasht et al., “Predicting treatment retention in medication for opioid use disorder: A machine learning approach using nlp and llm-derived clinical features,”Journal of the American Medical Informatics Association, vol. 32, no. 12, pp. 1865–1876, 2025

work page 2025
[49]

Large language model applications for health information extraction in oncology: Scoping review,

D. Chen, S. A. Alnassar, K. E. Avison, R. S. Huang, and S. Raman, “Large language model applications for health information extraction in oncology: Scoping review,”JMIR cancer, vol. 11, e65984, 2025

work page 2025
[50]

A scoping review of large language model based ap- proaches for information extraction from radiology reports,

D. Reichenpfader, H. Müller, and K. Denecke, “A scoping review of large language model based ap- proaches for information extraction from radiology reports,”npj Digital Medicine, vol. 7, no. 1, p. 222, 2024

work page 2024
[51]

C. D. Manning, P. Raghavan, and H. Schütze,Introduction to information retrieval. Cambridge uni- versity press, 2008

work page 2008
[52]

Mayo clinical text analysis and knowledge extrac- tion system (ctakes): Architecture, component evaluation and applications,

G. K. Savova, J. J. Masanz, P. V. Ogren, et al., “Mayo clinical text analysis and knowledge extrac- tion system (ctakes): Architecture, component evaluation and applications,”Journal of the American Medical Informatics Association, vol. 17, no. 5, pp. 507–513, 2010

work page 2010
[53]

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

N. Reimers and I. Gurevych, “Sentence-bert: Sentence embeddings using siamese bert-networks,”arXiv preprint arXiv:1908.10084, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1908
[54]

Handbook of inter-rater reliability,

K. Gwet, “Handbook of inter-rater reliability,”Gaithersburg, MD: STATAXIS Publishing Company, pp. 223–246, 2001

work page 2001
[55]

The kappa paradox explained,

B. M. Derksen, W. Bruinsma, J. C. Goslings, and N. W. Schep, “The kappa paradox explained,”The Journal of hand surgery, vol. 49, no. 5, pp. 482–485, 2024

work page 2024
[56]

Themeasurementofobserveragreementforcategoricaldata,

J.R.LandisandG.G.Koch,“Themeasurementofobserveragreementforcategoricaldata,”biometrics, pp. 159–174, 1977

work page 1977
[57]

gpt-oss-120b & gpt-oss-20b Model Card

S. Agarwal et al., “Gpt-oss-120b & gpt-oss-20b model card,”arXiv preprint arXiv:2508.10925, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[58]

Washington, DC: American Psychiatric Publishing, 2000,isbn: 978-0-89042- 024-9.doi:10.1176/appi.books.9780890420249.dsm-iv-tr Acknowledgment This initiative is sponsored by the U.S

American Psychiatric Association,Diagnostic and Statistical Manual of Mental Disorders: DSM-IV- TR®, 4th ed., text rev. Washington, DC: American Psychiatric Publishing, 2000,isbn: 978-0-89042- 024-9.doi:10.1176/appi.books.9780890420249.dsm-iv-tr Acknowledgment This initiative is sponsored by the U.S. Department of Veterans Affairs (VA) and utilizes VA-fun...

work page doi:10.1176/appi.books.9780890420249.dsm-iv-tr 2000

[1] [1]

arXiv preprint arXiv:2205.12689 , year=

M. Agrawal, S. Hegselmann, H. Hunter, and D. Sontag, “Large language models are few-shot clinical information extractors,”arXiv preprint arXiv:2205.12689, 2022

work page arXiv 2022

[2] [2]

Arlington, VA: American Psychiatric Publishing, 2013

American Psychiatric Association,Diagnostic and statistical manual of mental disorders (DSM-5®). Arlington, VA: American Psychiatric Publishing, 2013

work page 2013

[3] [3]

Abuse and M

S. Abuse and M. H. S. Administration, “Key substance use and mental health indicators in the united states: Results from the 2024 national survey on drug use and health (hhs publication no. pep25-07-007, nsduh series h-60),”Center for Behavioral Health Statistics and Quality, Substance Abuse and Mental Health Services Administration, 2025. [Online]. Avail...

work page 2024

[4] [4]

The global burden of disease attributable to alcohol and drug use in 195 countries and territories, 1990–2016: A systematic analysis for the global burden of disease study 2016,

L. Degenhardt, F. Charlson, A. Ferrari, et al., “The global burden of disease attributable to alcohol and drug use in 195 countries and territories, 1990–2016: A systematic analysis for the global burden of disease study 2016,”The Lancet Psychiatry, vol. 5, no. 12, pp. 987–1012, 2018

work page 1990

[5] [5]

Clinical implications of using administrative data to identify substance use disorders,

R. H. Perlis, D. V. Iosifescu, V. Castro, et al., “Clinical implications of using administrative data to identify substance use disorders,”Psychiatric Services, vol. 63, no. 8, pp. 837–837, 2012

work page 2012

[6] [6]

The Llama 3 Herd of Models

A. Dubey, A. Jauhri, A. Pandey, et al., “The llama 3 herd of models,”arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[7] [7]

GPT-4 Technical Report

J. Achiam, S. Adler, S. Agarwal, et al., “Gpt-4 technical report,”arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[8] [8]

anthropic

Anthropic,The claude 3 model family: Opus, sonnet, haiku,https : / / www - cdn . anthropic . com / files/4b/claude-3-model-card.pdf, 2024. 16

work page 2024

[9] [9]

Large language models encode clinical knowledge,

K. Singhal, S. Azizi, T. Tu, et al., “Large language models encode clinical knowledge,”Nature, vol. 620, no. 7972, pp. 172–180, 2023

work page 2023

[10] [10]

Large language models in medicine,

A. J. Thirunavukarasu, D. S. W. Ting, K. Elangovan, et al., “Large language models in medicine,” Nature Medicine, vol. 29, no. 8, pp. 1930–1940, 2023

work page 1930

[11] [11]

Automated extraction of substance use information from clinical texts,

Y. Wang et al., “Automated extraction of substance use information from clinical texts,” inAMIA Annual Symposium Proceedings, vol. 2015, 2015, p. 2121

work page 2015

[12] [12]

Decoding substance use disorder severity from clinical notes using a large language model,

M. Mahbub et al., “Decoding substance use disorder severity from clinical notes using a large language model,”npj Mental Health Research, vol. 4, no. 1, p. 5, 2025

work page 2025

[13] [13]

Extracting social determinants of health from electronic health records: Development and comparison of rule-based and large language models-based methods,

B. Wang, D. Kabir, C. R. Clark, K. W. Choi, and J. W. Smoller, “Extracting social determinants of health from electronic health records: Development and comparison of rule-based and large language models-based methods,”medRxiv, pp. 2025–11, 2025.doi:10.1101/2025.11.15.25339520

work page doi:10.1101/2025.11.15.25339520 2025

[14] [14]

Llms accelerate annotation for medical information extraction,

A. Goel et al., “Llms accelerate annotation for medical information extraction,” inmachine learning for health (ML4H), PMLR, 2023, pp. 82–100

work page 2023

[15] [15]

Clinical text annotation–what factors are associated with the cost of time?

Q. Wei, A. Franklin, T. Cohen, and H. Xu, “Clinical text annotation–what factors are associated with the cost of time?” InAMIA Annual Symposium Proceedings, vol. 2018, 2018, p. 1552

work page 2018

[16] [16]

Geneva: World Health Organization, 1992, vol

W.H.Organization,The ICD-10 classification of mental and behavioural disorders: clinical descriptions and diagnostic guidelines. Geneva: World Health Organization, 1992, vol. 1

work page 1992

[17] [17]

Coding reliability and agreement of international classification of disease, 10th revision (icd-10) codes in emergency department data,

M. Peng et al., “Coding reliability and agreement of international classification of disease, 10th revision (icd-10) codes in emergency department data,”International journal of population data science, vol. 3, no. 1, p. 445, 2018

work page 2018

[18] [18]

Coding rules for uncertain and “ruled out

O. O. Atolagbe, P. S. Romano, D. A. Southern, W. Wongtanasarasin, and W. A. Ghali, “Coding rules for uncertain and “ruled out” diagnoses in icd-10 and icd-11,”BMC Medical Informatics and Decision Making, vol. 21, no. Suppl 6, p. 386, 2021

work page 2021

[19] [19]

Validating opioid use disorder diagnoses in administrative data: A commentary on existing evidence and future directions,

J. F. Scherrer, M. D. Sullivan, M. R. LaRochelle, and R. Grucza, “Validating opioid use disorder diagnoses in administrative data: A commentary on existing evidence and future directions,”Addiction Science & Clinical Practice, vol. 18, no. 1, p. 49, 2023

work page 2023

[20] [20]

Diagnosis and coding of opioid misuse: A systematic scoping review and implementation framework,

R. W. Hurley, K. T. Bland, M. D. Chaskes, E. L. Hill, and M. C. Adams, “Diagnosis and coding of opioid misuse: A systematic scoping review and implementation framework,”Pain Medicine, pnaf019, 2025

work page 2025

[21] [21]

A large language model for electronic health records,

X. Yang et al., “A large language model for electronic health records,”NPJ digital medicine, vol. 5, no. 1, p. 194, 2022

work page 2022

[22] [22]

Leveraging open-source large language models for clinical information extraction in resource-constrained settings,

L. Builtjes, J. Bosma, M. Prokop, B. van Ginneken, and A. Hering, “Leveraging open-source large language models for clinical information extraction in resource-constrained settings,”JAMIA open, vol. 8, no. 5, ooaf109, 2025

work page 2025

[23] [23]

Automating evaluation of ai text generation in healthcare with a large language model (llm)-as-a-judge,

E. Croxford et al., “Automating evaluation of ai text generation in healthcare with a large language model (llm)-as-a-judge,”medRxiv, pp. 2025–04, 2025

work page 2025

[24] [24]

Llm-as-a-judge for software engineering: Literature review, vision, and the road ahead,

J. He et al., “Llm-as-a-judge for software engineering: Literature review, vision, and the road ahead,” ACM Transactions on Software Engineering and Methodology, 2025

work page 2025

[25] [25]

A framework to assess clinical safety and hallucination rates of llms for medical text summarisation,

E. Asgari et al., “A framework to assess clinical safety and hallucination rates of llms for medical text summarisation,”npj Digital Medicine, vol. 8, no. 1, p. 274, 2025

work page 2025

[26] [26]

Survey and analysis of hallucinations in large language models: Attribution to prompting strategies or model behavior,

D. Anh-Hoang, V. Tran, and L.-M. Nguyen, “Survey and analysis of hallucinations in large language models: Attribution to prompting strategies or model behavior,”Frontiers in Artificial Intelligence, vol. 8, p. 1622292, 2025

work page 2025

[27] [27]

MedHallu: A Comprehensive Benchmark for Detecting Medical Hallucinations in Large Language Models, February 2025

S. Pandit et al., “Medhallu: A comprehensive benchmark for detecting medical hallucinations in large language models,”arXiv preprint arXiv:2502.14302, 2025

work page arXiv 2025

[28] [28]

Available: https://arxiv.org/abs/2503.05777

Y. Kim et al., “Medical hallucinations in foundation models and their impact on healthcare,”arXiv preprint arXiv:2503.05777, 2025. 17

work page arXiv 2025

[29] [29]

& Valdes, G

C. Garcia-Fernandez et al., “Trustworthy ai for medicine: Continuous hallucination detection and elimination with check,”arXiv preprint arXiv:2506.11129, 2025

work page arXiv 2025

[30] [30]

Faithfulness hallucination detection in healthcare ai,

P. R. Vishwanath et al., “Faithfulness hallucination detection in healthcare ai,” inArtificial Intelligence and Data Science for Healthcare: Bridging Data-Centric AI and People-Centric Healthcare, 2024

work page 2024

[31] [31]

Anempiricalevalua- tion of prompting strategies for large language models in zero-shot clinical natural language processing: Algorithm development and validation study,

S.Sivarajkumar,M.Kelley,A.Samolyk-Mazzanti,S.Visweswaran,andY.Wang,“Anempiricalevalua- tion of prompting strategies for large language models in zero-shot clinical natural language processing: Algorithm development and validation study,”JMIR Medical Informatics, vol. 12, e55318, 2024

work page 2024

[32] [32]

Prompt engineering paradigms for medical applications: Scoping review,

J. Zaghir, M. Naguib, M. Bjelogrlic, A. Névéol, X. Tannier, and C. Lovis, “Prompt engineering paradigms for medical applications: Scoping review,”Journal of Medical Internet Research, vol. 26, e60501, 2024

work page 2024

[33] [33]

Improving large language models for adverse drug reactions named entity recognition via error correction prompt engineering,

Y. Zhang and W. Liao, “Improving large language models for adverse drug reactions named entity recognition via error correction prompt engineering,”Journal of Biomedical Informatics, p. 104893, 2025

work page 2025

[34] [34]

Evaluation and mitigation of the limitations of large language models in clinical decision-making,

P. Hager et al., “Evaluation and mitigation of the limitations of large language models in clinical decision-making,”Nature medicine, vol. 30, no. 9, pp. 2613–2622, 2024

work page 2024

[35] [35]

Prompt engineering in clinical practice: Tutorial for clinicians,

J. Liu, F. Liu, C. Wang, and S. Liu, “Prompt engineering in clinical practice: Tutorial for clinicians,” Journal of Medical Internet Research, vol. 27, e72644, 2025

work page 2025

[36] [36]

Streamlining evidence based clinical recommendations with large language models,

D. Li et al., “Streamlining evidence based clinical recommendations with large language models,”npj Digital Medicine, 2025

work page 2025

[37] [37]

Medpromptextract (med- ical data extraction tool): Anonymization and high-fidelity automated data extraction using natural language processing and prompt engineering,

R. Srivastava, L. Bhat, S. Prasad, S. Deshpande, B. Das, and K. Jadhav, “Medpromptextract (med- ical data extraction tool): Anonymization and high-fidelity automated data extraction using natural language processing and prompt engineering,”The Journal of Applied Laboratory Medicine, vol. 10, no. 4, pp. 793–805, 2025

work page 2025

[38] [38]

Construct validity in psychological tests.,

L. J. Cronbach and P. E. Meehl, “Construct validity in psychological tests.,”Psychological bulletin, vol. 52, no. 4, p. 281, 1955

work page 1955

[39] [39]

The phq-9: Validity of a brief depression severity measure,

K. Kroenke, R. L. Spitzer, and J. B. Williams, “The phq-9: Validity of a brief depression severity measure,”Journal of general internal medicine, vol. 16, no. 9, pp. 606–613, 2001

work page 2001

[40] [40]

Between-visit changes in suicidal ideation and risk of subsequent suicide attempt,

G. E. Simon et al., “Between-visit changes in suicidal ideation and risk of subsequent suicide attempt,” Depression and anxiety, vol. 34, no. 9, pp. 794–800, 2017

work page 2017

[41] [41]

Predicting dsm-iv dependence diagnoses from addiction severity index composite scores,

S. H. Rikoon, J. S. Cacciola, D. Carise, A. I. Alterman, and A. T. McLellan, “Predicting dsm-iv dependence diagnoses from addiction severity index composite scores,”Journal of substance abuse treatment, vol. 31, no. 1, pp. 17–24, 2006

work page 2006

[42] [42]

Predictive capacity of the audit questionnaire for alcohol-related harm,

K. M. Conigrave, J. B. Saunders, and R. B. Reznik, “Predictive capacity of the audit questionnaire for alcohol-related harm,”Addiction, vol. 90, no. 11, pp. 1479–1485, 1995

work page 1995

[43] [43]

The audit questionnaire: Choosing a cut-off score,

K. M. Conigrave, W. D. Hall, and J. B. Saunders, “The audit questionnaire: Choosing a cut-off score,” Addiction, vol. 90, no. 10, pp. 1349–1356, 1995

work page 1995

[44] [44]

Dsm-5 criteria for substance use disorders: Recommendations and rationale,

D. S. Hasin et al., “Dsm-5 criteria for substance use disorders: Recommendations and rationale,” American Journal of Psychiatry, vol. 170, no. 8, pp. 834–851, 2013

work page 2013

[45] [45]

Development and multimodal validation of a substance misuse algorithm for referral to treatment using artificial intelligence (smart-ai): A retrospective deep learning study,

M. Afshar et al., “Development and multimodal validation of a substance misuse algorithm for referral to treatment using artificial intelligence (smart-ai): A retrospective deep learning study,”The Lancet Digital Health, vol. 4, no. 6, e426–e435, 2022

work page 2022

[46] [46]

M. Afshar et al., “Deployment of real-time natural language processing and deep learning clinical decision support in the electronic health record: Pipeline implementation for an opioid misuse screener in hospitalized adults,”JMIR Medical Informatics, vol. 11, e44977, 2023

work page 2023

[47] [47]

Automated detection of substance use information from electronic health records for a pediatric population,

Y. Ni, A. Bachtel, K. Nause, and S. Beal, “Automated detection of substance use information from electronic health records for a pediatric population,”Journal of the American Medical Informatics Association, vol. 28, no. 10, pp. 2116–2127, 2021. 18

work page 2021

[48] [48]

Predicting treatment retention in medication for opioid use disorder: A machine learning approach using nlp and llm-derived clinical features,

F. Nateghi Haredasht et al., “Predicting treatment retention in medication for opioid use disorder: A machine learning approach using nlp and llm-derived clinical features,”Journal of the American Medical Informatics Association, vol. 32, no. 12, pp. 1865–1876, 2025

work page 2025

[49] [49]

Large language model applications for health information extraction in oncology: Scoping review,

D. Chen, S. A. Alnassar, K. E. Avison, R. S. Huang, and S. Raman, “Large language model applications for health information extraction in oncology: Scoping review,”JMIR cancer, vol. 11, e65984, 2025

work page 2025

[50] [50]

A scoping review of large language model based ap- proaches for information extraction from radiology reports,

D. Reichenpfader, H. Müller, and K. Denecke, “A scoping review of large language model based ap- proaches for information extraction from radiology reports,”npj Digital Medicine, vol. 7, no. 1, p. 222, 2024

work page 2024

[51] [51]

C. D. Manning, P. Raghavan, and H. Schütze,Introduction to information retrieval. Cambridge uni- versity press, 2008

work page 2008

[52] [52]

Mayo clinical text analysis and knowledge extrac- tion system (ctakes): Architecture, component evaluation and applications,

G. K. Savova, J. J. Masanz, P. V. Ogren, et al., “Mayo clinical text analysis and knowledge extrac- tion system (ctakes): Architecture, component evaluation and applications,”Journal of the American Medical Informatics Association, vol. 17, no. 5, pp. 507–513, 2010

work page 2010

[53] [53]

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

N. Reimers and I. Gurevych, “Sentence-bert: Sentence embeddings using siamese bert-networks,”arXiv preprint arXiv:1908.10084, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1908

[54] [54]

Handbook of inter-rater reliability,

K. Gwet, “Handbook of inter-rater reliability,”Gaithersburg, MD: STATAXIS Publishing Company, pp. 223–246, 2001

work page 2001

[55] [55]

The kappa paradox explained,

B. M. Derksen, W. Bruinsma, J. C. Goslings, and N. W. Schep, “The kappa paradox explained,”The Journal of hand surgery, vol. 49, no. 5, pp. 482–485, 2024

work page 2024

[56] [56]

Themeasurementofobserveragreementforcategoricaldata,

J.R.LandisandG.G.Koch,“Themeasurementofobserveragreementforcategoricaldata,”biometrics, pp. 159–174, 1977

work page 1977

[57] [57]

gpt-oss-120b & gpt-oss-20b Model Card

S. Agarwal et al., “Gpt-oss-120b & gpt-oss-20b model card,”arXiv preprint arXiv:2508.10925, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[58] [58]

Washington, DC: American Psychiatric Publishing, 2000,isbn: 978-0-89042- 024-9.doi:10.1176/appi.books.9780890420249.dsm-iv-tr Acknowledgment This initiative is sponsored by the U.S

American Psychiatric Association,Diagnostic and Statistical Manual of Mental Disorders: DSM-IV- TR®, 4th ed., text rev. Washington, DC: American Psychiatric Publishing, 2000,isbn: 978-0-89042- 024-9.doi:10.1176/appi.books.9780890420249.dsm-iv-tr Acknowledgment This initiative is sponsored by the U.S. Department of Veterans Affairs (VA) and utilizes VA-fun...

work page doi:10.1176/appi.books.9780890420249.dsm-iv-tr 2000