Recognition: 2 theorem links
· Lean TheoremDetecting HIV-Related Stigma in Clinical Narratives Using Large Language Models
Pith reviewed 2026-05-10 18:34 UTC · model grok-4.3
The pith
Large language models can identify HIV stigma in clinical notes.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors created an LLM-based pipeline that first filters clinical sentences using curated stigma keywords and clinical embeddings, manually annotates 1,332 sentences into four subscales, and then applies models to classify them. Encoder-based models such as GatorTron-large produced the strongest overall results, while generative models improved markedly once given a few labeled examples. Classification accuracy varied across subscales, proving highest for Negative Self-Image and lowest for Personalized Stigma, thereby demonstrating that automated stigma detection is feasible with current language models.
What carries the argument
Keyword-filtered sentence selection followed by LLM classification into four HIV stigma subscales.
If this is right
- Large volumes of historical and future clinical notes can be scanned automatically to flag stigma without exhaustive manual review.
- Stigma types differ in how readily they can be detected, with Negative Self-Image proving more predictable than Personalized Stigma.
- Few-shot prompting brings smaller generative models close to encoder performance, lowering the barrier to using open models.
- The resulting labels enable large-scale studies of how stigma documentation correlates with care outcomes over time.
- The pipeline supplies a reusable baseline that future work can extend or fine-tune on new data.
Where Pith is reading between the lines
- The same keyword-plus-LLM approach could be repurposed to detect stigma tied to other conditions such as mental illness or substance use.
- Real-world use would require testing whether performance holds across different hospitals, regions, and patient demographics.
- Linking detected stigma labels to downstream outcomes could quantify how much documented stigma predicts missed appointments or treatment interruptions.
- Embedding the detector inside electronic record systems might let care teams receive timely alerts when stigma language appears in a patient's chart.
Load-bearing premise
The expert-curated keywords plus the team's manual annotations fully capture the range of stigma language that appears in real clinical notes without systematic omission or bias.
What would settle it
Independent clinicians annotate a fresh sample of notes drawn from the same population and find many stigma-bearing sentences that the original keyword filter missed or that the trained models misclassify.
Figures
read the original abstract
Human immunodeficiency virus (HIV)-related stigma is a critical psychosocial determinant of health for people living with HIV (PLWH), influencing mental health, engagement in care, and treatment outcomes. Although stigma-related experiences are documented in clinical narratives, there is a lack of off-the-shelf tools to extract and categorize them. This study aims to develop a large language model (LLM)-based tool for identifying HIV stigma from clinical notes. We identified clinical notes from PLWH receiving care at the University of Florida (UF) Health between 2012 and 2022. Candidate sentences were identified using expert-curated stigma-related keywords and iteratively expanded via clinical word embeddings. A total of 1,332 sentences were manually annotated across four stigma subscales: Concern with Public Attitudes, Disclosure Concerns, Negative Self-Image, and Personalized Stigma. We compared GatorTron-large and BERT as encoder-based baselines, and GPT-OSS-20B, LLaMA-8B, and MedGemma-27B as generative LLMs, under zero-shot and few-shot prompting. GatorTron-large achieved the best overall performance (Micro F1 = 0.62). Few-shot prompting substantially improved generative model performance, with 5-shot GPT-OSS-20B and LLaMA-8B achieving Micro-F1 scores of 0.57 and 0.59, respectively. Performance varied by stigma subscale, with Negative Self-Image showing the highest predictability and Personalized Stigma remaining the most challenging. Zero-shot generative inference exhibited non-trivial failure rates (up to 32%). This study develops the first practical NLP tool for identifying HIV stigma in clinical notes.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents the development of an LLM-based NLP system to detect HIV-related stigma in clinical narratives from people living with HIV at UF Health (2012-2022). Candidate sentences are selected via expert-curated keywords iteratively expanded with clinical embeddings, yielding 1,332 sentences manually annotated by the team into four subscales (Concern with Public Attitudes, Disclosure Concerns, Negative Self-Image, Personalized Stigma). Encoder models (GatorTron-large, BERT) and generative LLMs (GPT-OSS-20B, LLaMA-8B, MedGemma-27B) are evaluated in zero- and few-shot settings, with GatorTron-large achieving the highest Micro F1 of 0.62; few-shot prompting improves generative models, but performance varies by subscale and zero-shot generative inference shows up to 32% failure rates. The work claims to deliver the first practical tool for this task.
Significance. If the reported performance and generalizability hold under broader validation, the work would offer a useful contribution to clinical NLP by enabling extraction of psychosocial determinants from notes, potentially supporting stigma-aware care for PLWH. The multi-model comparison, subscale-specific analysis, and use of domain-specific encoders like GatorTron provide empirical grounding. However, the moderate peak F1 and reliance on internally curated data limit immediate claims of practicality.
major comments (2)
- [Methods] Methods (Candidate Sentence Identification and Annotation): Candidate sentences were obtained exclusively via expert-curated keywords expanded by clinical embeddings, followed by internal team annotation of 1,332 sentences into the four subscales. No inter-annotator agreement metrics, external annotators, or sampling of non-keyword sentences are reported to assess missed stigma expressions or label reliability. This directly limits the support for the central claim of a reliable, practical detection tool, as systematic omissions in the curation step would propagate to model recall and deployment performance.
- [Results] Results: GatorTron-large reports the best Micro F1 of 0.62, with substantial subscale variation (Negative Self-Image highest, Personalized Stigma lowest) and up to 32% zero-shot failure in generative models. These figures, obtained on the keyword-filtered and internally labeled set, do not yet demonstrate the robustness required for the 'practical clinical tool' asserted in the abstract and conclusion; external validation on unfiltered notes would be needed to substantiate generalizability.
minor comments (2)
- [Abstract] Abstract: Lacks summary details on annotation process (e.g., number of annotators, agreement), data splits, or statistical testing, which would help readers assess the strength of the reported F1 scores.
- [Methods] The manuscript would benefit from explicit discussion of how the four subscales were operationalized during annotation and any guidelines provided to annotators.
Simulated Author's Rebuttal
We are grateful to the referee for the detailed and constructive feedback on our manuscript. We have carefully considered each major comment and provide point-by-point responses below. Where appropriate, we have made revisions to the manuscript to address the concerns raised.
read point-by-point responses
-
Referee: [Methods] Methods (Candidate Sentence Identification and Annotation): Candidate sentences were obtained exclusively via expert-curated keywords expanded by clinical embeddings, followed by internal team annotation of 1,332 sentences into the four subscales. No inter-annotator agreement metrics, external annotators, or sampling of non-keyword sentences are reported to assess missed stigma expressions or label reliability. This directly limits the support for the central claim of a reliable, practical detection tool, as systematic omissions in the curation step would propagate to model recall and deployment performance.
Authors: We appreciate the referee's emphasis on the importance of robust annotation practices. Our approach to candidate sentence identification using expert-curated keywords iteratively expanded with clinical embeddings was intended to focus annotation efforts on likely relevant content in a large corpus of clinical notes, which is a practical necessity given the volume of data. The annotations were performed by a team with combined expertise in HIV clinical care, psychology, and natural language processing. Nevertheless, we acknowledge that the lack of reported inter-annotator agreement, use of external annotators, and evaluation of non-keyword sentences represents a limitation that could affect assessments of label reliability and potential missed expressions. In the revised manuscript, we will add details on the annotation guidelines and team composition in the Methods section and include a new Limitations subsection that explicitly discusses these aspects, along with plans for future studies involving broader sampling and multi-annotator validation to better support the tool's reliability. revision: yes
-
Referee: [Results] Results: GatorTron-large reports the best Micro F1 of 0.62, with substantial subscale variation (Negative Self-Image highest, Personalized Stigma lowest) and up to 32% zero-shot failure in generative models. These figures, obtained on the keyword-filtered and internally labeled set, do not yet demonstrate the robustness required for the 'practical clinical tool' asserted in the abstract and conclusion; external validation on unfiltered notes would be needed to substantiate generalizability.
Authors: We agree that the performance metrics are derived from the keyword-filtered dataset and that this constrains strong claims about generalizability to unfiltered clinical notes. The Micro F1 score of 0.62 for GatorTron-large, while the highest among the models tested, indeed shows variation across subscales, and the failure rates in zero-shot generative approaches highlight areas for improvement. We recognize that asserting a 'practical clinical tool' may overstate the current evidence without external validation. Accordingly, in the revised manuscript, we will revise the abstract and conclusion to describe the work as developing an initial LLM-based approach with promising results on an internal dataset, and we will add explicit statements regarding the need for future external validation on diverse, unfiltered notes to establish broader applicability and robustness. revision: yes
Circularity Check
No circularity: empirical pipeline with independent annotations and held-out evaluation
full rationale
The paper follows a standard empirical NLP workflow: expert-curated keywords iteratively expanded by clinical embeddings to select candidate sentences, followed by internal manual annotation of 1,332 sentences into four stigma subscales, then supervised evaluation of encoder and generative LLMs with reported Micro F1 on the annotated data. No mathematical derivations, equations, or first-principles claims exist. No predictions reduce to fitted inputs by construction, no self-citations serve as load-bearing uniqueness theorems, and no ansatzes are smuggled via prior work. Performance metrics are measured against the team's annotations rather than being tautological. The central claim rests on these empirical comparisons, which are self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Expert-curated keywords and clinical word embeddings can identify candidate sentences containing HIV stigma
- domain assumption The four stigma subscales are distinct and annotatable from text
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We compared GatorTron-large and BERT as encoder-based baselines, and GPT-OSS-20B, LLaMA-8B, and MedGemma-27B as generative LLMs, under zero-shot and few-shot prompting. GatorTron-large achieved the best overall performance (Micro F1 = 0.62).
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanLogicNat_induction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
A total of 1,332 sentences were manually annotated across four stigma subscales: Concern with Public Attitudes, Disclosure Concerns, Negative Self-Image, and Personalized Stigma.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Wells, Jr
Detecting HIV-Related Stigma in Clinical Narratives Using Large Language Models Ziyi Chen, MS1, Yasir Khan, MS1, Mengyuan Zhang, MS1, Cheng Peng, PhD1, Mengxian Lyu, MS1, Yiyang Liu, PhD2, Krishna Vaddiparti, PhD2, Robert L Cook, PhD2, Mattia Prosperi, PhD2, Yonghui Wu, PhD1,3* 1Department of Health Outcomes and Biomedical Informatics, College of Medicine...
2012
-
[2]
A total of 1,332 sentences were manually annotated across four stigma subscales: Concern with Public Attitudes, Disclosure Concerns, Negative Self-Image, and Personalized Stigma
Candidate sentences were identified using expert-curated stigma-related keywords and iteratively expanded via clinical word embeddings. A total of 1,332 sentences were manually annotated across four stigma subscales: Concern with Public Attitudes, Disclosure Concerns, Negative Self-Image, and Personalized Stigma. We compared GatorTron-large and BERT as en...
2012
-
[3]
The association between HIV-stigma and depressive symptoms among people living with HIV/AIDS: A systematic review of studies conducted in South Africa
MacLean JR, Wetherall K. The association between HIV-stigma and depressive symptoms among people living with HIV/AIDS: A systematic review of studies conducted in South Africa. J Affect Disord. 2021;287:125-137
2021
-
[4]
Turan B, Budhwani H, Fazeli PL, et al. How does stigma affect people living with HIV? The mediating roles of internalized and anticipated HIV stigma in the effects of perceived community stigma on health and psychosocial outcomes. AIDS Behav. 2017;21(1):283-291
2017
-
[5]
The association of HIV-related stigma to HIV medication adherence: A systematic review and synthesis of the literature
Sweeney SM, Vanable PA. The association of HIV-related stigma to HIV medication adherence: A systematic review and synthesis of the literature. AIDS Behav. 2016;20(1):29-50
2016
-
[6]
Interpersonal mechanisms contributing to the association between HIV-related internalized stigma and medication adherence
Blake Helms C, Turan JM, Atkins G, et al. Interpersonal mechanisms contributing to the association between HIV-related internalized stigma and medication adherence. AIDS Behav. 2017;21(1):238-247
2017
-
[7]
HIV-related stigma as a barrier to achievement of global PMTCT and maternal health goals: a review of the evidence
Turan JM, Nyblade L. HIV-related stigma as a barrier to achievement of global PMTCT and maternal health goals: a review of the evidence. AIDS Behav. 2013;17(7):2528-2539
2013
-
[8]
Social support and moment-to-moment changes in treatment self-efficacy in men living with HIV: Psychosocial moderators and clinical outcomes
Turan B, Fazeli PL, Raper JL, Mugavero MJ, Johnson MO. Social support and moment-to-moment changes in treatment self-efficacy in men living with HIV: Psychosocial moderators and clinical outcomes. Health Psychol. 2016;35(10):1126-1134
2016
-
[9]
Strengthening adherence to Anti Retroviral Therapy (ART) monitoring and support: operation research to identify barriers and facilitators in Nepal
Bam K, Rajbhandari RM, Karmacharya DB, Dixit SM. Strengthening adherence to Anti Retroviral Therapy (ART) monitoring and support: operation research to identify barriers and facilitators in Nepal. BMC Health Serv Res. 2015;15(1):188
2015
-
[10]
doi:10.5114/hivar.2022.115763
-
[11]
Measuring stigma in people with HIV: psychometric assessment of the HIV stigma scale
Berger BE, Ferrans CE, Lashley FR. Measuring stigma in people with HIV: psychometric assessment of the HIV stigma scale. Res Nurs Health. 2001;24(6):518-529
2001
-
[12]
Validation of the HIV/AIDS Stigma Instrument - PLWA (HASI-P)
Holzemer WL, Uys LR, Chirwa ML, et al. Validation of the HIV/AIDS Stigma Instrument - PLWA (HASI-P). AIDS Care. 2007;19(8):1002-1012
2007
-
[13]
A topic modeling analysis of stigma dimensions, social, and related behavioral circumstances in clinical notes among patients with HIV
Chen Z, Liu Y, Prosperi M, et al. A topic modeling analysis of stigma dimensions, social, and related behavioral circumstances in clinical notes among patients with HIV. Int J Med Inform. 2026;209(106269):106269
2026
-
[14]
Enhanced language models for predicting and understanding HIV care disengagement: a case study in Tanzania
Wei W, Shao J, Lyu RQ, et al. Enhanced language models for predicting and understanding HIV care disengagement: a case study in Tanzania. NPJ Digit Med. 2026;9(1):165
2026
-
[15]
HIV risk score and prediction model in the United States: A scoping review
Albernas A, Patel MD, Cook RL, Vaddiparti K, Prosperi M, Liu Y. HIV risk score and prediction model in the United States: A scoping review. AIDS Behav. 2025;29(8):2388-2407
2025
-
[16]
Development of an electronic health record-based Human Immunodeficiency Virus (HIV) risk prediction model for women, incorporating social determinants of health
Liu Y, Chen A, Cho H, Siddiqi KA, Cook RL, Prosperi M. Development of an electronic health record-based Human Immunodeficiency Virus (HIV) risk prediction model for women, incorporating social determinants of health. BMC Public Health. 2025;25(1):2257
2025
-
[17]
Optimizing identification of people living with HIV from electronic medical records: Computable phenotype development and validation
Liu Y, Siddiqi KA, Cook RL, et al. Optimizing identification of people living with HIV from electronic medical records: Computable phenotype development and validation. Methods Inf Med. 2021;60(3-04):84-94
2021
-
[18]
A large language model for electronic health records
Yang X, Chen A, PourNejatian N, et al. A large language model for electronic health records. NPJ Digit Med. 2022;5(1):194
2022
- [19]
-
[20]
http://arxiv.org/abs/1810.04805
work page internal anchor Pith review Pith/arXiv arXiv
-
[21]
http://arxiv.org/abs/2407.21783
work page internal anchor Pith review Pith/arXiv arXiv
-
[22]
http://arxiv.org/abs/2508.10925
work page internal anchor Pith review Pith/arXiv arXiv
-
[23]
http://arxiv.org/abs/2507.05201
work page internal anchor Pith review Pith/arXiv arXiv
-
[24]
Generative large language models are all-purpose text analytics engines: text-to-text learning is all your need
Peng C, Yang X, Chen A, et al. Generative large language models are all-purpose text analytics engines: text-to-text learning is all your need. J Am Med Inform Assoc. 2024;31(9):1892-1903
2024
-
[25]
Natural language generation in healthcare: A review of methods and applications
Lyu M, Li X, Chen Z, et al. Natural language generation in healthcare: A review of methods and applications. J Biomed Inform. 2026;176(104997):104997. Appendix Appendix Table
2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.