arxiv: 2604.07717 · v2 · submitted 2026-04-09 · 💻 cs.CL · cs.AI

Recognition: 2 theorem links

· Lean Theorem

Detecting HIV-Related Stigma in Clinical Narratives Using Large Language Models

Ziyi Chen , Yasir Khan , Mengyuan Zhang , Cheng Peng , Mengxian Lyu , Yiyang Liu , Krishna Vaddiparti , Robert L Cook

show 2 more authors

Mattia Prosperi Yonghui Wu

Authors on Pith no claims yet

Pith reviewed 2026-05-10 18:34 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords HIV stigmaclinical narrativeslarge language modelsstigma detectionelectronic health recordsnatural language processingstigma subscales

0 comments

The pith

Large language models can identify HIV stigma in clinical notes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper develops the first practical tool that uses large language models to automatically extract and categorize HIV-related stigma from free-text clinical notes. Stigma acts as a barrier that affects mental health, care engagement, and treatment success for people living with HIV, yet it has remained hidden in unstructured records without dedicated extraction methods. The authors started with notes from one health system, used expert keywords and embeddings to surface candidate sentences, and created a manually labeled set covering four standard stigma subscales. They then benchmarked encoder and generative models under different prompting conditions, establishing baseline performance for the task. A successful tool would let researchers and clinicians scan thousands of records to surface patterns that manual review cannot reach.

Core claim

The authors created an LLM-based pipeline that first filters clinical sentences using curated stigma keywords and clinical embeddings, manually annotates 1,332 sentences into four subscales, and then applies models to classify them. Encoder-based models such as GatorTron-large produced the strongest overall results, while generative models improved markedly once given a few labeled examples. Classification accuracy varied across subscales, proving highest for Negative Self-Image and lowest for Personalized Stigma, thereby demonstrating that automated stigma detection is feasible with current language models.

What carries the argument

Keyword-filtered sentence selection followed by LLM classification into four HIV stigma subscales.

If this is right

Large volumes of historical and future clinical notes can be scanned automatically to flag stigma without exhaustive manual review.
Stigma types differ in how readily they can be detected, with Negative Self-Image proving more predictable than Personalized Stigma.
Few-shot prompting brings smaller generative models close to encoder performance, lowering the barrier to using open models.
The resulting labels enable large-scale studies of how stigma documentation correlates with care outcomes over time.
The pipeline supplies a reusable baseline that future work can extend or fine-tune on new data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same keyword-plus-LLM approach could be repurposed to detect stigma tied to other conditions such as mental illness or substance use.
Real-world use would require testing whether performance holds across different hospitals, regions, and patient demographics.
Linking detected stigma labels to downstream outcomes could quantify how much documented stigma predicts missed appointments or treatment interruptions.
Embedding the detector inside electronic record systems might let care teams receive timely alerts when stigma language appears in a patient's chart.

Load-bearing premise

The expert-curated keywords plus the team's manual annotations fully capture the range of stigma language that appears in real clinical notes without systematic omission or bias.

What would settle it

Independent clinicians annotate a fresh sample of notes drawn from the same population and find many stigma-bearing sentences that the original keyword filter missed or that the trained models misclassify.

Figures

Figures reproduced from arXiv: 2604.07717 by Cheng Peng, Krishna Vaddiparti, Mattia Prosperi, Mengxian Lyu, Mengyuan Zhang, Robert L Cook, Yasir Khan, Yiyang Liu, Yonghui Wu, Ziyi Chen.

**Figure 2.** Figure 2: Overall Micro-F1, Micro-Precision, Micro-Recall, and Accuracy Performance Across HIV Stigma Classification Models [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Overall Model Performance Distribution by Stigma Subcategory [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

read the original abstract

Human immunodeficiency virus (HIV)-related stigma is a critical psychosocial determinant of health for people living with HIV (PLWH), influencing mental health, engagement in care, and treatment outcomes. Although stigma-related experiences are documented in clinical narratives, there is a lack of off-the-shelf tools to extract and categorize them. This study aims to develop a large language model (LLM)-based tool for identifying HIV stigma from clinical notes. We identified clinical notes from PLWH receiving care at the University of Florida (UF) Health between 2012 and 2022. Candidate sentences were identified using expert-curated stigma-related keywords and iteratively expanded via clinical word embeddings. A total of 1,332 sentences were manually annotated across four stigma subscales: Concern with Public Attitudes, Disclosure Concerns, Negative Self-Image, and Personalized Stigma. We compared GatorTron-large and BERT as encoder-based baselines, and GPT-OSS-20B, LLaMA-8B, and MedGemma-27B as generative LLMs, under zero-shot and few-shot prompting. GatorTron-large achieved the best overall performance (Micro F1 = 0.62). Few-shot prompting substantially improved generative model performance, with 5-shot GPT-OSS-20B and LLaMA-8B achieving Micro-F1 scores of 0.57 and 0.59, respectively. Performance varied by stigma subscale, with Negative Self-Image showing the highest predictability and Personalized Stigma remaining the most challenging. Zero-shot generative inference exhibited non-trivial failure rates (up to 32%). This study develops the first practical NLP tool for identifying HIV stigma in clinical notes.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

They built the first LLM pipeline for HIV stigma detection in clinical notes with concrete benchmarks, but keyword-driven sampling and internal-only labels limit how far the results can be trusted.

read the letter

The main takeaway is that this paper delivers a practical first tool for pulling HIV stigma out of clinical notes using LLMs. They pulled 1,332 sentences from UF Health records via expert keywords expanded with embeddings, labeled them into four standard subscales, and ran head-to-head tests on GatorTron-large, BERT, and several generative models. GatorTron-large reached 0.62 micro F1, few-shot prompting lifted the generative models to 0.57-0.59, and they show clear differences across subscales with negative self-image easiest and personalized stigma hardest. Zero-shot failures up to 32% are reported plainly, which is useful to see.

Referee Report

2 major / 2 minor

Summary. The manuscript presents the development of an LLM-based NLP system to detect HIV-related stigma in clinical narratives from people living with HIV at UF Health (2012-2022). Candidate sentences are selected via expert-curated keywords iteratively expanded with clinical embeddings, yielding 1,332 sentences manually annotated by the team into four subscales (Concern with Public Attitudes, Disclosure Concerns, Negative Self-Image, Personalized Stigma). Encoder models (GatorTron-large, BERT) and generative LLMs (GPT-OSS-20B, LLaMA-8B, MedGemma-27B) are evaluated in zero- and few-shot settings, with GatorTron-large achieving the highest Micro F1 of 0.62; few-shot prompting improves generative models, but performance varies by subscale and zero-shot generative inference shows up to 32% failure rates. The work claims to deliver the first practical tool for this task.

Significance. If the reported performance and generalizability hold under broader validation, the work would offer a useful contribution to clinical NLP by enabling extraction of psychosocial determinants from notes, potentially supporting stigma-aware care for PLWH. The multi-model comparison, subscale-specific analysis, and use of domain-specific encoders like GatorTron provide empirical grounding. However, the moderate peak F1 and reliance on internally curated data limit immediate claims of practicality.

major comments (2)

[Methods] Methods (Candidate Sentence Identification and Annotation): Candidate sentences were obtained exclusively via expert-curated keywords expanded by clinical embeddings, followed by internal team annotation of 1,332 sentences into the four subscales. No inter-annotator agreement metrics, external annotators, or sampling of non-keyword sentences are reported to assess missed stigma expressions or label reliability. This directly limits the support for the central claim of a reliable, practical detection tool, as systematic omissions in the curation step would propagate to model recall and deployment performance.
[Results] Results: GatorTron-large reports the best Micro F1 of 0.62, with substantial subscale variation (Negative Self-Image highest, Personalized Stigma lowest) and up to 32% zero-shot failure in generative models. These figures, obtained on the keyword-filtered and internally labeled set, do not yet demonstrate the robustness required for the 'practical clinical tool' asserted in the abstract and conclusion; external validation on unfiltered notes would be needed to substantiate generalizability.

minor comments (2)

[Abstract] Abstract: Lacks summary details on annotation process (e.g., number of annotators, agreement), data splits, or statistical testing, which would help readers assess the strength of the reported F1 scores.
[Methods] The manuscript would benefit from explicit discussion of how the four subscales were operationalized during annotation and any guidelines provided to annotators.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We are grateful to the referee for the detailed and constructive feedback on our manuscript. We have carefully considered each major comment and provide point-by-point responses below. Where appropriate, we have made revisions to the manuscript to address the concerns raised.

read point-by-point responses

Referee: [Methods] Methods (Candidate Sentence Identification and Annotation): Candidate sentences were obtained exclusively via expert-curated keywords expanded by clinical embeddings, followed by internal team annotation of 1,332 sentences into the four subscales. No inter-annotator agreement metrics, external annotators, or sampling of non-keyword sentences are reported to assess missed stigma expressions or label reliability. This directly limits the support for the central claim of a reliable, practical detection tool, as systematic omissions in the curation step would propagate to model recall and deployment performance.

Authors: We appreciate the referee's emphasis on the importance of robust annotation practices. Our approach to candidate sentence identification using expert-curated keywords iteratively expanded with clinical embeddings was intended to focus annotation efforts on likely relevant content in a large corpus of clinical notes, which is a practical necessity given the volume of data. The annotations were performed by a team with combined expertise in HIV clinical care, psychology, and natural language processing. Nevertheless, we acknowledge that the lack of reported inter-annotator agreement, use of external annotators, and evaluation of non-keyword sentences represents a limitation that could affect assessments of label reliability and potential missed expressions. In the revised manuscript, we will add details on the annotation guidelines and team composition in the Methods section and include a new Limitations subsection that explicitly discusses these aspects, along with plans for future studies involving broader sampling and multi-annotator validation to better support the tool's reliability. revision: yes
Referee: [Results] Results: GatorTron-large reports the best Micro F1 of 0.62, with substantial subscale variation (Negative Self-Image highest, Personalized Stigma lowest) and up to 32% zero-shot failure in generative models. These figures, obtained on the keyword-filtered and internally labeled set, do not yet demonstrate the robustness required for the 'practical clinical tool' asserted in the abstract and conclusion; external validation on unfiltered notes would be needed to substantiate generalizability.

Authors: We agree that the performance metrics are derived from the keyword-filtered dataset and that this constrains strong claims about generalizability to unfiltered clinical notes. The Micro F1 score of 0.62 for GatorTron-large, while the highest among the models tested, indeed shows variation across subscales, and the failure rates in zero-shot generative approaches highlight areas for improvement. We recognize that asserting a 'practical clinical tool' may overstate the current evidence without external validation. Accordingly, in the revised manuscript, we will revise the abstract and conclusion to describe the work as developing an initial LLM-based approach with promising results on an internal dataset, and we will add explicit statements regarding the need for future external validation on diverse, unfiltered notes to establish broader applicability and robustness. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical pipeline with independent annotations and held-out evaluation

full rationale

The paper follows a standard empirical NLP workflow: expert-curated keywords iteratively expanded by clinical embeddings to select candidate sentences, followed by internal manual annotation of 1,332 sentences into four stigma subscales, then supervised evaluation of encoder and generative LLMs with reported Micro F1 on the annotated data. No mathematical derivations, equations, or first-principles claims exist. No predictions reduce to fitted inputs by construction, no self-citations serve as load-bearing uniqueness theorems, and no ansatzes are smuggled via prior work. Performance metrics are measured against the team's annotations rather than being tautological. The central claim rests on these empirical comparisons, which are self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The approach depends on domain-specific assumptions about how stigma manifests in clinical text and the reliability of manual labeling.

axioms (2)

domain assumption Expert-curated keywords and clinical word embeddings can identify candidate sentences containing HIV stigma
Used to select 1,332 sentences for annotation from notes between 2012-2022.
domain assumption The four stigma subscales are distinct and annotatable from text
Based on established scales for annotation.

pith-pipeline@v0.9.0 · 5636 in / 1269 out tokens · 41686 ms · 2026-05-10T18:34:00.571980+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We compared GatorTron-large and BERT as encoder-based baselines, and GPT-OSS-20B, LLaMA-8B, and MedGemma-27B as generative LLMs, under zero-shot and few-shot prompting. GatorTron-large achieved the best overall performance (Micro F1 = 0.62).
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat_induction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

A total of 1,332 sentences were manually annotated across four stigma subscales: Concern with Public Attitudes, Disclosure Concerns, Negative Self-Image, and Personalized Stigma.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

25 extracted references · 6 canonical work pages · 4 internal anchors

[1]

Wells, Jr

Detecting HIV-Related Stigma in Clinical Narratives Using Large Language Models Ziyi Chen, MS1, Yasir Khan, MS1, Mengyuan Zhang, MS1, Cheng Peng, PhD1, Mengxian Lyu, MS1, Yiyang Liu, PhD2, Krishna Vaddiparti, PhD2, Robert L Cook, PhD2, Mattia Prosperi, PhD2, Yonghui Wu, PhD1,3* 1Department of Health Outcomes and Biomedical Informatics, College of Medicine...

2012
[2]

A total of 1,332 sentences were manually annotated across four stigma subscales: Concern with Public Attitudes, Disclosure Concerns, Negative Self-Image, and Personalized Stigma

Candidate sentences were identified using expert-curated stigma-related keywords and iteratively expanded via clinical word embeddings. A total of 1,332 sentences were manually annotated across four stigma subscales: Concern with Public Attitudes, Disclosure Concerns, Negative Self-Image, and Personalized Stigma. We compared GatorTron-large and BERT as en...

2012
[3]

The association between HIV-stigma and depressive symptoms among people living with HIV/AIDS: A systematic review of studies conducted in South Africa

MacLean JR, Wetherall K. The association between HIV-stigma and depressive symptoms among people living with HIV/AIDS: A systematic review of studies conducted in South Africa. J Affect Disord. 2021;287:125-137

2021
[4]

Turan B, Budhwani H, Fazeli PL, et al. How does stigma affect people living with HIV? The mediating roles of internalized and anticipated HIV stigma in the effects of perceived community stigma on health and psychosocial outcomes. AIDS Behav. 2017;21(1):283-291

2017
[5]

The association of HIV-related stigma to HIV medication adherence: A systematic review and synthesis of the literature

Sweeney SM, Vanable PA. The association of HIV-related stigma to HIV medication adherence: A systematic review and synthesis of the literature. AIDS Behav. 2016;20(1):29-50

2016
[6]

Interpersonal mechanisms contributing to the association between HIV-related internalized stigma and medication adherence

Blake Helms C, Turan JM, Atkins G, et al. Interpersonal mechanisms contributing to the association between HIV-related internalized stigma and medication adherence. AIDS Behav. 2017;21(1):238-247

2017
[7]

HIV-related stigma as a barrier to achievement of global PMTCT and maternal health goals: a review of the evidence

Turan JM, Nyblade L. HIV-related stigma as a barrier to achievement of global PMTCT and maternal health goals: a review of the evidence. AIDS Behav. 2013;17(7):2528-2539

2013
[8]

Social support and moment-to-moment changes in treatment self-efficacy in men living with HIV: Psychosocial moderators and clinical outcomes

Turan B, Fazeli PL, Raper JL, Mugavero MJ, Johnson MO. Social support and moment-to-moment changes in treatment self-efficacy in men living with HIV: Psychosocial moderators and clinical outcomes. Health Psychol. 2016;35(10):1126-1134

2016
[9]

Strengthening adherence to Anti Retroviral Therapy (ART) monitoring and support: operation research to identify barriers and facilitators in Nepal

Bam K, Rajbhandari RM, Karmacharya DB, Dixit SM. Strengthening adherence to Anti Retroviral Therapy (ART) monitoring and support: operation research to identify barriers and facilitators in Nepal. BMC Health Serv Res. 2015;15(1):188

2015
[10]

doi:10.5114/hivar.2022.115763

work page doi:10.5114/hivar.2022.115763 2022
[11]

Measuring stigma in people with HIV: psychometric assessment of the HIV stigma scale

Berger BE, Ferrans CE, Lashley FR. Measuring stigma in people with HIV: psychometric assessment of the HIV stigma scale. Res Nurs Health. 2001;24(6):518-529

2001
[12]

Validation of the HIV/AIDS Stigma Instrument - PLWA (HASI-P)

Holzemer WL, Uys LR, Chirwa ML, et al. Validation of the HIV/AIDS Stigma Instrument - PLWA (HASI-P). AIDS Care. 2007;19(8):1002-1012

2007
[13]

A topic modeling analysis of stigma dimensions, social, and related behavioral circumstances in clinical notes among patients with HIV

Chen Z, Liu Y, Prosperi M, et al. A topic modeling analysis of stigma dimensions, social, and related behavioral circumstances in clinical notes among patients with HIV. Int J Med Inform. 2026;209(106269):106269

2026
[14]

Enhanced language models for predicting and understanding HIV care disengagement: a case study in Tanzania

Wei W, Shao J, Lyu RQ, et al. Enhanced language models for predicting and understanding HIV care disengagement: a case study in Tanzania. NPJ Digit Med. 2026;9(1):165

2026
[15]

HIV risk score and prediction model in the United States: A scoping review

Albernas A, Patel MD, Cook RL, Vaddiparti K, Prosperi M, Liu Y. HIV risk score and prediction model in the United States: A scoping review. AIDS Behav. 2025;29(8):2388-2407

2025
[16]

Development of an electronic health record-based Human Immunodeficiency Virus (HIV) risk prediction model for women, incorporating social determinants of health

Liu Y, Chen A, Cho H, Siddiqi KA, Cook RL, Prosperi M. Development of an electronic health record-based Human Immunodeficiency Virus (HIV) risk prediction model for women, incorporating social determinants of health. BMC Public Health. 2025;25(1):2257

2025
[17]

Optimizing identification of people living with HIV from electronic medical records: Computable phenotype development and validation

Liu Y, Siddiqi KA, Cook RL, et al. Optimizing identification of people living with HIV from electronic medical records: Computable phenotype development and validation. Methods Inf Med. 2021;60(3-04):84-94

2021
[18]

A large language model for electronic health records

Yang X, Chen A, PourNejatian N, et al. A large language model for electronic health records. NPJ Digit Med. 2022;5(1):194

2022
[19]

http://arxiv.org/abs/2403.11425

work page arXiv
[20]

http://arxiv.org/abs/1810.04805

work page internal anchor Pith review Pith/arXiv arXiv
[21]

http://arxiv.org/abs/2407.21783

work page internal anchor Pith review Pith/arXiv arXiv
[22]

http://arxiv.org/abs/2508.10925

work page internal anchor Pith review Pith/arXiv arXiv
[23]

http://arxiv.org/abs/2507.05201

work page internal anchor Pith review Pith/arXiv arXiv
[24]

Generative large language models are all-purpose text analytics engines: text-to-text learning is all your need

Peng C, Yang X, Chen A, et al. Generative large language models are all-purpose text analytics engines: text-to-text learning is all your need. J Am Med Inform Assoc. 2024;31(9):1892-1903

2024
[25]

Natural language generation in healthcare: A review of methods and applications

Lyu M, Li X, Chen Z, et al. Natural language generation in healthcare: A review of methods and applications. J Biomed Inform. 2026;176(104997):104997. Appendix Appendix Table

2026