pith. sign in

arxiv: 2604.06028 · v1 · submitted 2026-04-07 · 💻 cs.CL · cs.AI· cs.IR

A Multi-Stage Validation Framework for Trustworthy Large-scale Clinical Information Extraction using Large Language Models

Pith reviewed 2026-05-10 19:05 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.IR
keywords clinical information extractionlarge language modelsvalidation frameworksubstance use disorderweak supervisionpredictive validitytrustworthy AI
0
0 comments X

The pith

A multi-stage validation process allows large language models to extract substance use disorder diagnoses reliably from nearly a million clinical notes without exhaustive manual labeling.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a validation framework that chains prompt calibration, rule-based plausibility checks, semantic grounding, a higher-capacity judge model for uncertain outputs, limited expert review, and external checks against real care outcomes. Applied to extraction of eleven substance use categories across 919,783 notes, the framework removed unsupported or implausible extractions and produced outputs that aligned well with expert judgments while outperforming structured data in predicting later specialty care. This combination lets researchers and clinicians assess LLM performance at population scale using weak supervision instead of full annotation.

Core claim

The multi-stage validation framework integrates prompt calibration, rule-based plausibility filtering, semantic grounding assessment, targeted confirmatory evaluation using an independent higher-capacity judge LLM, selective expert review, and external predictive validity analysis to quantify uncertainty and characterize error modes without exhaustive manual annotation, as demonstrated by substantial agreement with experts and superior predictive performance.

What carries the argument

The multi-stage validation framework that combines automated filters with selective review by a higher-capacity judge model and limited experts to assess LLM outputs under weak supervision.

If this is right

  • Rule-based filtering and semantic grounding can remove roughly 15 percent of unsupported or implausible LLM extractions.
  • Judge LLM assessments can serve as scalable references that agree substantially with expert review.
  • LLM-extracted diagnoses can predict subsequent clinical engagement more accurately than structured data alone.
  • Population-scale clinical extraction becomes feasible without annotation-intensive reference standards.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same staged approach could be adapted to extract other clinical entities such as medications or procedures from notes.
  • Lower annotation costs might allow smaller health systems to adopt LLM tools for record review.
  • Combining LLMs with predictive validity checks against outcomes offers one route to building trust in deployed models.

Load-bearing premise

The higher-capacity judge LLM supplies reliable confirmatory labels for uncertain cases and the rule-based plus semantic filters capture most errors without introducing new biases.

What would settle it

Substantial disagreement between the judge LLM and independent expert review on a new set of high-uncertainty extractions would undermine the claim of trustworthy validation.

Figures

Figures reproduced from arXiv: 2604.06028 by Caitlin Rizy, Elizabeth M. Oliva, Elliot M. Fielstein, Gregory M. Dams, Ioana Danciu, Jodie Trafton, Joseph Erdos, Josh Arnold, Kamonica L. Craig, Maria Mahbub, Minu A. Aghevli, Sudarshan Srinivasan.

Figure 1
Figure 1. Figure 1: A multi-stage framework for trustworthy large-scale LLM extraction of clinical information [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Direct vs chain-of-thought prompting to extract SUD diagnoses information from clinical notes. [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Prompt for judge LLM was supported by the documentation. Independently, the same cases were evaluated by the judge LLM using identical source material. The SME additionally assessed the quality and appropriateness of the judge LLM’s reasoning and decision. Evaluation Metrics Agreement between the SME and the judge LLM was quantified using inter-annotator agreement (IAA) metrics. Specifically, we used Gwet’… view at source ↗
Figure 4
Figure 4. Figure 4: Predictive validity of LLM-extracted SUD diagnoses shown by ROC curves comparing outcome [PITH_FULL_IMAGE:figures/full_fig_p015_4.png] view at source ↗
read the original abstract

Large language models (LLMs) show promise for extracting clinically meaningful information from unstructured health records, yet their translation into real-world settings is constrained by the lack of scalable and trustworthy validation approaches. Conventional evaluation methods rely heavily on annotation-intensive reference standards or incomplete structured data, limiting feasibility at population scale. We propose a multi-stage validation framework for LLM-based clinical information extraction that enables rigorous assessment under weak supervision. The framework integrates prompt calibration, rule-based plausibility filtering, semantic grounding assessment, targeted confirmatory evaluation using an independent higher-capacity judge LLM, selective expert review, and external predictive validity analysis to quantify uncertainty and characterize error modes without exhaustive manual annotation. We applied this framework to extraction of substance use disorder (SUD) diagnoses across 11 substance categories from 919,783 clinical notes. Rule-based filtering and semantic grounding removed 14.59% of LLM-positive extractions that were unsupported, irrelevant, or structurally implausible. For high-uncertainty cases, the judge LLM's assessments showed substantial agreement with subject matter expert review (Gwet's AC1=0.80). Using judge-evaluated outputs as references, the primary LLM achieved an F1 score of 0.80 under relaxed matching criteria. LLM-extracted SUD diagnoses also predicted subsequent engagement in SUD specialty care more accurately than structured-data baselines (AUC=0.80). These findings demonstrate that scalable, trustworthy deployment of LLM-based clinical information extraction is feasible without annotation-intensive evaluation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes a multi-stage validation framework for trustworthy large-scale clinical information extraction using LLMs. The framework combines prompt calibration, rule-based plausibility filtering, semantic grounding, confirmatory evaluation by a higher-capacity judge LLM, selective expert review, and external predictive validity analysis. It is demonstrated on extracting SUD diagnoses from 919,783 clinical notes across 11 substance categories, reporting that 14.59% of LLM-positive extractions were filtered, substantial agreement (Gwet's AC1 = 0.80) with experts on high-uncertainty cases, F1 = 0.80 for the primary LLM, and superior predictive validity (AUC = 0.80) for care engagement compared to structured baselines. The authors conclude that scalable, trustworthy LLM-based clinical IE is feasible without annotation-intensive evaluation.

Significance. If the reported metrics are robust, this work is significant for enabling population-scale clinical NLP applications by reducing the need for exhaustive manual annotations. The application to a very large corpus (nearly 1 million notes) and the inclusion of external validation against real-world care engagement records provide concrete evidence of feasibility and utility. Strengths include the integration of multiple validation stages and the focus on error mode characterization.

major comments (3)
  1. [Abstract] The claim of 'trustworthy' extraction depends on the judge LLM's reliability for high-uncertainty cases, yet only selective expert review is reported (Gwet AC1=0.80); there is no validation reported for cases where the primary and judge LLMs agree, which constitutes the bulk of outputs and leaves open the possibility of shared biases.
  2. [Methods (framework)] Exact thresholds for rule-based plausibility filtering and semantic grounding assessment are not detailed, nor are the criteria for identifying 'high-uncertainty cases' for judge LLM review; these are load-bearing for evaluating whether the 14.59% filtering introduces selection bias or misses error modes.
  3. [Results (predictive validity)] While AUC=0.80 for predicting SUD specialty care engagement is promising, this external validity does not directly confirm the correctness of the extracted diagnoses, as the association could arise from correlated but inaccurate signals; additional analyses to rule out this are needed to support the trustworthiness claim.
minor comments (2)
  1. [Abstract] The 'relaxed matching criteria' for the F1 score of 0.80 should be explicitly defined in the methods section for reproducibility.
  2. Consider adding a table summarizing the multi-stage framework components and their roles to improve clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback, which identifies key areas to strengthen the claims around trustworthiness in our multi-stage LLM validation framework. We respond point-by-point to the major comments below, committing to revisions that enhance transparency and address the concerns raised.

read point-by-point responses
  1. Referee: [Abstract] The claim of 'trustworthy' extraction depends on the judge LLM's reliability for high-uncertainty cases, yet only selective expert review is reported (Gwet AC1=0.80); there is no validation reported for cases where the primary and judge LLMs agree, which constitutes the bulk of outputs and leaves open the possibility of shared biases.

    Authors: We acknowledge this as a valid limitation: while the judge LLM provides confirmatory evaluation on high-uncertainty cases and expert review on a selective subset yields substantial agreement, the bulk of outputs (where primary and judge LLMs concur) lack direct expert validation, leaving room for shared biases. To address this, the revised manuscript will include a post-hoc expert review on a random sample of agreed cases, with results reported in the Results section to quantify agreement and characterize potential biases. revision: yes

  2. Referee: [Methods (framework)] Exact thresholds for rule-based plausibility filtering and semantic grounding assessment are not detailed, nor are the criteria for identifying 'high-uncertainty cases' for judge LLM review; these are load-bearing for evaluating whether the 14.59% filtering introduces selection bias or misses error modes.

    Authors: We agree that the absence of exact thresholds and criteria in the Methods section hinders reproducibility and evaluation of selection bias from the 14.59% filtering. In the revised manuscript, we will expand the Methods to detail the specific thresholds for rule-based plausibility filtering and semantic grounding assessment, along with the precise criteria (e.g., confidence scores or disagreement flags) used to identify high-uncertainty cases for judge LLM review. We will also add a sensitivity analysis on these parameters. revision: yes

  3. Referee: [Results (predictive validity)] While AUC=0.80 for predicting SUD specialty care engagement is promising, this external validity does not directly confirm the correctness of the extracted diagnoses, as the association could arise from correlated but inaccurate signals; additional analyses to rule out this are needed to support the trustworthiness claim.

    Authors: This is a substantive point: the predictive validity analysis demonstrates utility for downstream tasks but remains indirect and could reflect correlated signals rather than diagnostic accuracy. In the revised manuscript, we will add analyses comparing LLM-extracted diagnoses to overlapping structured ICD codes and explicitly discuss this limitation in the Discussion, while maintaining that the multi-stage internal validations combined with external utility provide supportive evidence for scalable trustworthiness. revision: partial

Circularity Check

0 steps flagged

No significant circularity detected in the multi-stage validation framework

full rationale

The paper proposes and applies a multi-stage framework consisting of prompt calibration, rule-based plausibility filtering, semantic grounding assessment, confirmatory evaluation by an independent higher-capacity judge LLM on high-uncertainty cases, selective expert review (with Gwet's AC1=0.80 agreement), and external predictive validity against care engagement records (AUC=0.80). The primary LLM's F1=0.80 is computed using judge outputs as references, but this is an explicit component of the framework and is supported by the reported expert agreement on the judge assessments rather than reducing to a self-referential fit or definition. No equations, self-citations, or ansatzes are invoked in a load-bearing way that collapses the central claim to its inputs by construction. The derivation remains self-contained against the described external benchmarks and selective human validation.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The framework rests on domain assumptions about LLM reliability and the sufficiency of rule-based and semantic filters to remove unsupported extractions; no new physical entities or free parameters are introduced beyond standard NLP thresholds.

axioms (2)
  • domain assumption Higher-capacity judge LLMs can serve as reliable proxies for expert review on uncertain cases
    Invoked when using judge LLM assessments to evaluate primary LLM outputs and report agreement with experts
  • domain assumption Rule-based plausibility filters and semantic grounding capture the majority of LLM error modes
    Central to the claim that 14.59% removal leaves trustworthy extractions

pith-pipeline@v0.9.0 · 5622 in / 1436 out tokens · 49421 ms · 2026-05-10T19:05:30.031510+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

58 extracted references · 58 canonical work pages · 4 internal anchors

  1. [1]

    arXiv preprint arXiv:2205.12689 , year=

    M. Agrawal, S. Hegselmann, H. Hunter, and D. Sontag, “Large language models are few-shot clinical information extractors,”arXiv preprint arXiv:2205.12689, 2022

  2. [2]

    Arlington, VA: American Psychiatric Publishing, 2013

    American Psychiatric Association,Diagnostic and statistical manual of mental disorders (DSM-5®). Arlington, VA: American Psychiatric Publishing, 2013

  3. [3]

    Abuse and M

    S. Abuse and M. H. S. Administration, “Key substance use and mental health indicators in the united states: Results from the 2024 national survey on drug use and health (hhs publication no. pep25-07-007, nsduh series h-60),”Center for Behavioral Health Statistics and Quality, Substance Abuse and Mental Health Services Administration, 2025. [Online]. Avail...

  4. [4]

    The global burden of disease attributable to alcohol and drug use in 195 countries and territories, 1990–2016: A systematic analysis for the global burden of disease study 2016,

    L. Degenhardt, F. Charlson, A. Ferrari, et al., “The global burden of disease attributable to alcohol and drug use in 195 countries and territories, 1990–2016: A systematic analysis for the global burden of disease study 2016,”The Lancet Psychiatry, vol. 5, no. 12, pp. 987–1012, 2018

  5. [5]

    Clinical implications of using administrative data to identify substance use disorders,

    R. H. Perlis, D. V. Iosifescu, V. Castro, et al., “Clinical implications of using administrative data to identify substance use disorders,”Psychiatric Services, vol. 63, no. 8, pp. 837–837, 2012

  6. [6]

    The Llama 3 Herd of Models

    A. Dubey, A. Jauhri, A. Pandey, et al., “The llama 3 herd of models,”arXiv preprint arXiv:2407.21783, 2024

  7. [7]

    GPT-4 Technical Report

    J. Achiam, S. Adler, S. Agarwal, et al., “Gpt-4 technical report,”arXiv preprint arXiv:2303.08774, 2023

  8. [8]

    anthropic

    Anthropic,The claude 3 model family: Opus, sonnet, haiku,https : / / www - cdn . anthropic . com / files/4b/claude-3-model-card.pdf, 2024. 16

  9. [9]

    Large language models encode clinical knowledge,

    K. Singhal, S. Azizi, T. Tu, et al., “Large language models encode clinical knowledge,”Nature, vol. 620, no. 7972, pp. 172–180, 2023

  10. [10]

    Large language models in medicine,

    A. J. Thirunavukarasu, D. S. W. Ting, K. Elangovan, et al., “Large language models in medicine,” Nature Medicine, vol. 29, no. 8, pp. 1930–1940, 2023

  11. [11]

    Automated extraction of substance use information from clinical texts,

    Y. Wang et al., “Automated extraction of substance use information from clinical texts,” inAMIA Annual Symposium Proceedings, vol. 2015, 2015, p. 2121

  12. [12]

    Decoding substance use disorder severity from clinical notes using a large language model,

    M. Mahbub et al., “Decoding substance use disorder severity from clinical notes using a large language model,”npj Mental Health Research, vol. 4, no. 1, p. 5, 2025

  13. [13]

    Extracting social determinants of health from electronic health records: Development and comparison of rule-based and large language models-based methods,

    B. Wang, D. Kabir, C. R. Clark, K. W. Choi, and J. W. Smoller, “Extracting social determinants of health from electronic health records: Development and comparison of rule-based and large language models-based methods,”medRxiv, pp. 2025–11, 2025.doi:10.1101/2025.11.15.25339520

  14. [14]

    Llms accelerate annotation for medical information extraction,

    A. Goel et al., “Llms accelerate annotation for medical information extraction,” inmachine learning for health (ML4H), PMLR, 2023, pp. 82–100

  15. [15]

    Clinical text annotation–what factors are associated with the cost of time?

    Q. Wei, A. Franklin, T. Cohen, and H. Xu, “Clinical text annotation–what factors are associated with the cost of time?” InAMIA Annual Symposium Proceedings, vol. 2018, 2018, p. 1552

  16. [16]

    Geneva: World Health Organization, 1992, vol

    W.H.Organization,The ICD-10 classification of mental and behavioural disorders: clinical descriptions and diagnostic guidelines. Geneva: World Health Organization, 1992, vol. 1

  17. [17]

    Coding reliability and agreement of international classification of disease, 10th revision (icd-10) codes in emergency department data,

    M. Peng et al., “Coding reliability and agreement of international classification of disease, 10th revision (icd-10) codes in emergency department data,”International journal of population data science, vol. 3, no. 1, p. 445, 2018

  18. [18]

    Coding rules for uncertain and “ruled out

    O. O. Atolagbe, P. S. Romano, D. A. Southern, W. Wongtanasarasin, and W. A. Ghali, “Coding rules for uncertain and “ruled out” diagnoses in icd-10 and icd-11,”BMC Medical Informatics and Decision Making, vol. 21, no. Suppl 6, p. 386, 2021

  19. [19]

    Validating opioid use disorder diagnoses in administrative data: A commentary on existing evidence and future directions,

    J. F. Scherrer, M. D. Sullivan, M. R. LaRochelle, and R. Grucza, “Validating opioid use disorder diagnoses in administrative data: A commentary on existing evidence and future directions,”Addiction Science & Clinical Practice, vol. 18, no. 1, p. 49, 2023

  20. [20]

    Diagnosis and coding of opioid misuse: A systematic scoping review and implementation framework,

    R. W. Hurley, K. T. Bland, M. D. Chaskes, E. L. Hill, and M. C. Adams, “Diagnosis and coding of opioid misuse: A systematic scoping review and implementation framework,”Pain Medicine, pnaf019, 2025

  21. [21]

    A large language model for electronic health records,

    X. Yang et al., “A large language model for electronic health records,”NPJ digital medicine, vol. 5, no. 1, p. 194, 2022

  22. [22]

    Leveraging open-source large language models for clinical information extraction in resource-constrained settings,

    L. Builtjes, J. Bosma, M. Prokop, B. van Ginneken, and A. Hering, “Leveraging open-source large language models for clinical information extraction in resource-constrained settings,”JAMIA open, vol. 8, no. 5, ooaf109, 2025

  23. [23]

    Automating evaluation of ai text generation in healthcare with a large language model (llm)-as-a-judge,

    E. Croxford et al., “Automating evaluation of ai text generation in healthcare with a large language model (llm)-as-a-judge,”medRxiv, pp. 2025–04, 2025

  24. [24]

    Llm-as-a-judge for software engineering: Literature review, vision, and the road ahead,

    J. He et al., “Llm-as-a-judge for software engineering: Literature review, vision, and the road ahead,” ACM Transactions on Software Engineering and Methodology, 2025

  25. [25]

    A framework to assess clinical safety and hallucination rates of llms for medical text summarisation,

    E. Asgari et al., “A framework to assess clinical safety and hallucination rates of llms for medical text summarisation,”npj Digital Medicine, vol. 8, no. 1, p. 274, 2025

  26. [26]

    Survey and analysis of hallucinations in large language models: Attribution to prompting strategies or model behavior,

    D. Anh-Hoang, V. Tran, and L.-M. Nguyen, “Survey and analysis of hallucinations in large language models: Attribution to prompting strategies or model behavior,”Frontiers in Artificial Intelligence, vol. 8, p. 1622292, 2025

  27. [27]

    MedHallu: A Comprehensive Benchmark for Detecting Medical Hallucinations in Large Language Models, February 2025

    S. Pandit et al., “Medhallu: A comprehensive benchmark for detecting medical hallucinations in large language models,”arXiv preprint arXiv:2502.14302, 2025

  28. [28]

    Available: https://arxiv.org/abs/2503.05777

    Y. Kim et al., “Medical hallucinations in foundation models and their impact on healthcare,”arXiv preprint arXiv:2503.05777, 2025. 17

  29. [29]

    & Valdes, G

    C. Garcia-Fernandez et al., “Trustworthy ai for medicine: Continuous hallucination detection and elimination with check,”arXiv preprint arXiv:2506.11129, 2025

  30. [30]

    Faithfulness hallucination detection in healthcare ai,

    P. R. Vishwanath et al., “Faithfulness hallucination detection in healthcare ai,” inArtificial Intelligence and Data Science for Healthcare: Bridging Data-Centric AI and People-Centric Healthcare, 2024

  31. [31]

    Anempiricalevalua- tion of prompting strategies for large language models in zero-shot clinical natural language processing: Algorithm development and validation study,

    S.Sivarajkumar,M.Kelley,A.Samolyk-Mazzanti,S.Visweswaran,andY.Wang,“Anempiricalevalua- tion of prompting strategies for large language models in zero-shot clinical natural language processing: Algorithm development and validation study,”JMIR Medical Informatics, vol. 12, e55318, 2024

  32. [32]

    Prompt engineering paradigms for medical applications: Scoping review,

    J. Zaghir, M. Naguib, M. Bjelogrlic, A. Névéol, X. Tannier, and C. Lovis, “Prompt engineering paradigms for medical applications: Scoping review,”Journal of Medical Internet Research, vol. 26, e60501, 2024

  33. [33]

    Improving large language models for adverse drug reactions named entity recognition via error correction prompt engineering,

    Y. Zhang and W. Liao, “Improving large language models for adverse drug reactions named entity recognition via error correction prompt engineering,”Journal of Biomedical Informatics, p. 104893, 2025

  34. [34]

    Evaluation and mitigation of the limitations of large language models in clinical decision-making,

    P. Hager et al., “Evaluation and mitigation of the limitations of large language models in clinical decision-making,”Nature medicine, vol. 30, no. 9, pp. 2613–2622, 2024

  35. [35]

    Prompt engineering in clinical practice: Tutorial for clinicians,

    J. Liu, F. Liu, C. Wang, and S. Liu, “Prompt engineering in clinical practice: Tutorial for clinicians,” Journal of Medical Internet Research, vol. 27, e72644, 2025

  36. [36]

    Streamlining evidence based clinical recommendations with large language models,

    D. Li et al., “Streamlining evidence based clinical recommendations with large language models,”npj Digital Medicine, 2025

  37. [37]

    Medpromptextract (med- ical data extraction tool): Anonymization and high-fidelity automated data extraction using natural language processing and prompt engineering,

    R. Srivastava, L. Bhat, S. Prasad, S. Deshpande, B. Das, and K. Jadhav, “Medpromptextract (med- ical data extraction tool): Anonymization and high-fidelity automated data extraction using natural language processing and prompt engineering,”The Journal of Applied Laboratory Medicine, vol. 10, no. 4, pp. 793–805, 2025

  38. [38]

    Construct validity in psychological tests.,

    L. J. Cronbach and P. E. Meehl, “Construct validity in psychological tests.,”Psychological bulletin, vol. 52, no. 4, p. 281, 1955

  39. [39]

    The phq-9: Validity of a brief depression severity measure,

    K. Kroenke, R. L. Spitzer, and J. B. Williams, “The phq-9: Validity of a brief depression severity measure,”Journal of general internal medicine, vol. 16, no. 9, pp. 606–613, 2001

  40. [40]

    Between-visit changes in suicidal ideation and risk of subsequent suicide attempt,

    G. E. Simon et al., “Between-visit changes in suicidal ideation and risk of subsequent suicide attempt,” Depression and anxiety, vol. 34, no. 9, pp. 794–800, 2017

  41. [41]

    Predicting dsm-iv dependence diagnoses from addiction severity index composite scores,

    S. H. Rikoon, J. S. Cacciola, D. Carise, A. I. Alterman, and A. T. McLellan, “Predicting dsm-iv dependence diagnoses from addiction severity index composite scores,”Journal of substance abuse treatment, vol. 31, no. 1, pp. 17–24, 2006

  42. [42]

    Predictive capacity of the audit questionnaire for alcohol-related harm,

    K. M. Conigrave, J. B. Saunders, and R. B. Reznik, “Predictive capacity of the audit questionnaire for alcohol-related harm,”Addiction, vol. 90, no. 11, pp. 1479–1485, 1995

  43. [43]

    The audit questionnaire: Choosing a cut-off score,

    K. M. Conigrave, W. D. Hall, and J. B. Saunders, “The audit questionnaire: Choosing a cut-off score,” Addiction, vol. 90, no. 10, pp. 1349–1356, 1995

  44. [44]

    Dsm-5 criteria for substance use disorders: Recommendations and rationale,

    D. S. Hasin et al., “Dsm-5 criteria for substance use disorders: Recommendations and rationale,” American Journal of Psychiatry, vol. 170, no. 8, pp. 834–851, 2013

  45. [45]

    Development and multimodal validation of a substance misuse algorithm for referral to treatment using artificial intelligence (smart-ai): A retrospective deep learning study,

    M. Afshar et al., “Development and multimodal validation of a substance misuse algorithm for referral to treatment using artificial intelligence (smart-ai): A retrospective deep learning study,”The Lancet Digital Health, vol. 4, no. 6, e426–e435, 2022

  46. [46]

    M. Afshar et al., “Deployment of real-time natural language processing and deep learning clinical decision support in the electronic health record: Pipeline implementation for an opioid misuse screener in hospitalized adults,”JMIR Medical Informatics, vol. 11, e44977, 2023

  47. [47]

    Automated detection of substance use information from electronic health records for a pediatric population,

    Y. Ni, A. Bachtel, K. Nause, and S. Beal, “Automated detection of substance use information from electronic health records for a pediatric population,”Journal of the American Medical Informatics Association, vol. 28, no. 10, pp. 2116–2127, 2021. 18

  48. [48]

    Predicting treatment retention in medication for opioid use disorder: A machine learning approach using nlp and llm-derived clinical features,

    F. Nateghi Haredasht et al., “Predicting treatment retention in medication for opioid use disorder: A machine learning approach using nlp and llm-derived clinical features,”Journal of the American Medical Informatics Association, vol. 32, no. 12, pp. 1865–1876, 2025

  49. [49]

    Large language model applications for health information extraction in oncology: Scoping review,

    D. Chen, S. A. Alnassar, K. E. Avison, R. S. Huang, and S. Raman, “Large language model applications for health information extraction in oncology: Scoping review,”JMIR cancer, vol. 11, e65984, 2025

  50. [50]

    A scoping review of large language model based ap- proaches for information extraction from radiology reports,

    D. Reichenpfader, H. Müller, and K. Denecke, “A scoping review of large language model based ap- proaches for information extraction from radiology reports,”npj Digital Medicine, vol. 7, no. 1, p. 222, 2024

  51. [51]

    C. D. Manning, P. Raghavan, and H. Schütze,Introduction to information retrieval. Cambridge uni- versity press, 2008

  52. [52]

    Mayo clinical text analysis and knowledge extrac- tion system (ctakes): Architecture, component evaluation and applications,

    G. K. Savova, J. J. Masanz, P. V. Ogren, et al., “Mayo clinical text analysis and knowledge extrac- tion system (ctakes): Architecture, component evaluation and applications,”Journal of the American Medical Informatics Association, vol. 17, no. 5, pp. 507–513, 2010

  53. [53]

    Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

    N. Reimers and I. Gurevych, “Sentence-bert: Sentence embeddings using siamese bert-networks,”arXiv preprint arXiv:1908.10084, 2019

  54. [54]

    Handbook of inter-rater reliability,

    K. Gwet, “Handbook of inter-rater reliability,”Gaithersburg, MD: STATAXIS Publishing Company, pp. 223–246, 2001

  55. [55]

    The kappa paradox explained,

    B. M. Derksen, W. Bruinsma, J. C. Goslings, and N. W. Schep, “The kappa paradox explained,”The Journal of hand surgery, vol. 49, no. 5, pp. 482–485, 2024

  56. [56]

    Themeasurementofobserveragreementforcategoricaldata,

    J.R.LandisandG.G.Koch,“Themeasurementofobserveragreementforcategoricaldata,”biometrics, pp. 159–174, 1977

  57. [57]

    gpt-oss-120b & gpt-oss-20b Model Card

    S. Agarwal et al., “Gpt-oss-120b & gpt-oss-20b model card,”arXiv preprint arXiv:2508.10925, 2025

  58. [58]

    Washington, DC: American Psychiatric Publishing, 2000,isbn: 978-0-89042- 024-9.doi:10.1176/appi.books.9780890420249.dsm-iv-tr Acknowledgment This initiative is sponsored by the U.S

    American Psychiatric Association,Diagnostic and Statistical Manual of Mental Disorders: DSM-IV- TR®, 4th ed., text rev. Washington, DC: American Psychiatric Publishing, 2000,isbn: 978-0-89042- 024-9.doi:10.1176/appi.books.9780890420249.dsm-iv-tr Acknowledgment This initiative is sponsored by the U.S. Department of Veterans Affairs (VA) and utilizes VA-fun...