pith. machine review for the scientific record. sign in

arxiv: 2604.07717 · v2 · submitted 2026-04-09 · 💻 cs.CL · cs.AI

Recognition: 2 theorem links

· Lean Theorem

Detecting HIV-Related Stigma in Clinical Narratives Using Large Language Models

Authors on Pith no claims yet

Pith reviewed 2026-05-10 18:34 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords HIV stigmaclinical narrativeslarge language modelsstigma detectionelectronic health recordsnatural language processingstigma subscales
0
0 comments X

The pith

Large language models can identify HIV stigma in clinical notes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper develops the first practical tool that uses large language models to automatically extract and categorize HIV-related stigma from free-text clinical notes. Stigma acts as a barrier that affects mental health, care engagement, and treatment success for people living with HIV, yet it has remained hidden in unstructured records without dedicated extraction methods. The authors started with notes from one health system, used expert keywords and embeddings to surface candidate sentences, and created a manually labeled set covering four standard stigma subscales. They then benchmarked encoder and generative models under different prompting conditions, establishing baseline performance for the task. A successful tool would let researchers and clinicians scan thousands of records to surface patterns that manual review cannot reach.

Core claim

The authors created an LLM-based pipeline that first filters clinical sentences using curated stigma keywords and clinical embeddings, manually annotates 1,332 sentences into four subscales, and then applies models to classify them. Encoder-based models such as GatorTron-large produced the strongest overall results, while generative models improved markedly once given a few labeled examples. Classification accuracy varied across subscales, proving highest for Negative Self-Image and lowest for Personalized Stigma, thereby demonstrating that automated stigma detection is feasible with current language models.

What carries the argument

Keyword-filtered sentence selection followed by LLM classification into four HIV stigma subscales.

If this is right

  • Large volumes of historical and future clinical notes can be scanned automatically to flag stigma without exhaustive manual review.
  • Stigma types differ in how readily they can be detected, with Negative Self-Image proving more predictable than Personalized Stigma.
  • Few-shot prompting brings smaller generative models close to encoder performance, lowering the barrier to using open models.
  • The resulting labels enable large-scale studies of how stigma documentation correlates with care outcomes over time.
  • The pipeline supplies a reusable baseline that future work can extend or fine-tune on new data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same keyword-plus-LLM approach could be repurposed to detect stigma tied to other conditions such as mental illness or substance use.
  • Real-world use would require testing whether performance holds across different hospitals, regions, and patient demographics.
  • Linking detected stigma labels to downstream outcomes could quantify how much documented stigma predicts missed appointments or treatment interruptions.
  • Embedding the detector inside electronic record systems might let care teams receive timely alerts when stigma language appears in a patient's chart.

Load-bearing premise

The expert-curated keywords plus the team's manual annotations fully capture the range of stigma language that appears in real clinical notes without systematic omission or bias.

What would settle it

Independent clinicians annotate a fresh sample of notes drawn from the same population and find many stigma-bearing sentences that the original keyword filter missed or that the trained models misclassify.

Figures

Figures reproduced from arXiv: 2604.07717 by Cheng Peng, Krishna Vaddiparti, Mattia Prosperi, Mengxian Lyu, Mengyuan Zhang, Robert L Cook, Yasir Khan, Yiyang Liu, Yonghui Wu, Ziyi Chen.

Figure 1
Figure 1. Figure 1: End-to-end workflow for developing the HIV stigma prediction model [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overall Micro-F1, Micro-Precision, Micro-Recall, and Accuracy Performance Across HIV Stigma Classification Models [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overall Model Performance Distribution by Stigma Subcategory [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
read the original abstract

Human immunodeficiency virus (HIV)-related stigma is a critical psychosocial determinant of health for people living with HIV (PLWH), influencing mental health, engagement in care, and treatment outcomes. Although stigma-related experiences are documented in clinical narratives, there is a lack of off-the-shelf tools to extract and categorize them. This study aims to develop a large language model (LLM)-based tool for identifying HIV stigma from clinical notes. We identified clinical notes from PLWH receiving care at the University of Florida (UF) Health between 2012 and 2022. Candidate sentences were identified using expert-curated stigma-related keywords and iteratively expanded via clinical word embeddings. A total of 1,332 sentences were manually annotated across four stigma subscales: Concern with Public Attitudes, Disclosure Concerns, Negative Self-Image, and Personalized Stigma. We compared GatorTron-large and BERT as encoder-based baselines, and GPT-OSS-20B, LLaMA-8B, and MedGemma-27B as generative LLMs, under zero-shot and few-shot prompting. GatorTron-large achieved the best overall performance (Micro F1 = 0.62). Few-shot prompting substantially improved generative model performance, with 5-shot GPT-OSS-20B and LLaMA-8B achieving Micro-F1 scores of 0.57 and 0.59, respectively. Performance varied by stigma subscale, with Negative Self-Image showing the highest predictability and Personalized Stigma remaining the most challenging. Zero-shot generative inference exhibited non-trivial failure rates (up to 32%). This study develops the first practical NLP tool for identifying HIV stigma in clinical notes.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript presents the development of an LLM-based NLP system to detect HIV-related stigma in clinical narratives from people living with HIV at UF Health (2012-2022). Candidate sentences are selected via expert-curated keywords iteratively expanded with clinical embeddings, yielding 1,332 sentences manually annotated by the team into four subscales (Concern with Public Attitudes, Disclosure Concerns, Negative Self-Image, Personalized Stigma). Encoder models (GatorTron-large, BERT) and generative LLMs (GPT-OSS-20B, LLaMA-8B, MedGemma-27B) are evaluated in zero- and few-shot settings, with GatorTron-large achieving the highest Micro F1 of 0.62; few-shot prompting improves generative models, but performance varies by subscale and zero-shot generative inference shows up to 32% failure rates. The work claims to deliver the first practical tool for this task.

Significance. If the reported performance and generalizability hold under broader validation, the work would offer a useful contribution to clinical NLP by enabling extraction of psychosocial determinants from notes, potentially supporting stigma-aware care for PLWH. The multi-model comparison, subscale-specific analysis, and use of domain-specific encoders like GatorTron provide empirical grounding. However, the moderate peak F1 and reliance on internally curated data limit immediate claims of practicality.

major comments (2)
  1. [Methods] Methods (Candidate Sentence Identification and Annotation): Candidate sentences were obtained exclusively via expert-curated keywords expanded by clinical embeddings, followed by internal team annotation of 1,332 sentences into the four subscales. No inter-annotator agreement metrics, external annotators, or sampling of non-keyword sentences are reported to assess missed stigma expressions or label reliability. This directly limits the support for the central claim of a reliable, practical detection tool, as systematic omissions in the curation step would propagate to model recall and deployment performance.
  2. [Results] Results: GatorTron-large reports the best Micro F1 of 0.62, with substantial subscale variation (Negative Self-Image highest, Personalized Stigma lowest) and up to 32% zero-shot failure in generative models. These figures, obtained on the keyword-filtered and internally labeled set, do not yet demonstrate the robustness required for the 'practical clinical tool' asserted in the abstract and conclusion; external validation on unfiltered notes would be needed to substantiate generalizability.
minor comments (2)
  1. [Abstract] Abstract: Lacks summary details on annotation process (e.g., number of annotators, agreement), data splits, or statistical testing, which would help readers assess the strength of the reported F1 scores.
  2. [Methods] The manuscript would benefit from explicit discussion of how the four subscales were operationalized during annotation and any guidelines provided to annotators.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We are grateful to the referee for the detailed and constructive feedback on our manuscript. We have carefully considered each major comment and provide point-by-point responses below. Where appropriate, we have made revisions to the manuscript to address the concerns raised.

read point-by-point responses
  1. Referee: [Methods] Methods (Candidate Sentence Identification and Annotation): Candidate sentences were obtained exclusively via expert-curated keywords expanded by clinical embeddings, followed by internal team annotation of 1,332 sentences into the four subscales. No inter-annotator agreement metrics, external annotators, or sampling of non-keyword sentences are reported to assess missed stigma expressions or label reliability. This directly limits the support for the central claim of a reliable, practical detection tool, as systematic omissions in the curation step would propagate to model recall and deployment performance.

    Authors: We appreciate the referee's emphasis on the importance of robust annotation practices. Our approach to candidate sentence identification using expert-curated keywords iteratively expanded with clinical embeddings was intended to focus annotation efforts on likely relevant content in a large corpus of clinical notes, which is a practical necessity given the volume of data. The annotations were performed by a team with combined expertise in HIV clinical care, psychology, and natural language processing. Nevertheless, we acknowledge that the lack of reported inter-annotator agreement, use of external annotators, and evaluation of non-keyword sentences represents a limitation that could affect assessments of label reliability and potential missed expressions. In the revised manuscript, we will add details on the annotation guidelines and team composition in the Methods section and include a new Limitations subsection that explicitly discusses these aspects, along with plans for future studies involving broader sampling and multi-annotator validation to better support the tool's reliability. revision: yes

  2. Referee: [Results] Results: GatorTron-large reports the best Micro F1 of 0.62, with substantial subscale variation (Negative Self-Image highest, Personalized Stigma lowest) and up to 32% zero-shot failure in generative models. These figures, obtained on the keyword-filtered and internally labeled set, do not yet demonstrate the robustness required for the 'practical clinical tool' asserted in the abstract and conclusion; external validation on unfiltered notes would be needed to substantiate generalizability.

    Authors: We agree that the performance metrics are derived from the keyword-filtered dataset and that this constrains strong claims about generalizability to unfiltered clinical notes. The Micro F1 score of 0.62 for GatorTron-large, while the highest among the models tested, indeed shows variation across subscales, and the failure rates in zero-shot generative approaches highlight areas for improvement. We recognize that asserting a 'practical clinical tool' may overstate the current evidence without external validation. Accordingly, in the revised manuscript, we will revise the abstract and conclusion to describe the work as developing an initial LLM-based approach with promising results on an internal dataset, and we will add explicit statements regarding the need for future external validation on diverse, unfiltered notes to establish broader applicability and robustness. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical pipeline with independent annotations and held-out evaluation

full rationale

The paper follows a standard empirical NLP workflow: expert-curated keywords iteratively expanded by clinical embeddings to select candidate sentences, followed by internal manual annotation of 1,332 sentences into four stigma subscales, then supervised evaluation of encoder and generative LLMs with reported Micro F1 on the annotated data. No mathematical derivations, equations, or first-principles claims exist. No predictions reduce to fitted inputs by construction, no self-citations serve as load-bearing uniqueness theorems, and no ansatzes are smuggled via prior work. Performance metrics are measured against the team's annotations rather than being tautological. The central claim rests on these empirical comparisons, which are self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The approach depends on domain-specific assumptions about how stigma manifests in clinical text and the reliability of manual labeling.

axioms (2)
  • domain assumption Expert-curated keywords and clinical word embeddings can identify candidate sentences containing HIV stigma
    Used to select 1,332 sentences for annotation from notes between 2012-2022.
  • domain assumption The four stigma subscales are distinct and annotatable from text
    Based on established scales for annotation.

pith-pipeline@v0.9.0 · 5636 in / 1269 out tokens · 41686 ms · 2026-05-10T18:34:00.571980+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

25 extracted references · 6 canonical work pages · 4 internal anchors

  1. [1]

    Wells, Jr

    Detecting HIV-Related Stigma in Clinical Narratives Using Large Language Models Ziyi Chen, MS1, Yasir Khan, MS1, Mengyuan Zhang, MS1, Cheng Peng, PhD1, Mengxian Lyu, MS1, Yiyang Liu, PhD2, Krishna Vaddiparti, PhD2, Robert L Cook, PhD2, Mattia Prosperi, PhD2, Yonghui Wu, PhD1,3* 1Department of Health Outcomes and Biomedical Informatics, College of Medicine...

  2. [2]

    A total of 1,332 sentences were manually annotated across four stigma subscales: Concern with Public Attitudes, Disclosure Concerns, Negative Self-Image, and Personalized Stigma

    Candidate sentences were identified using expert-curated stigma-related keywords and iteratively expanded via clinical word embeddings. A total of 1,332 sentences were manually annotated across four stigma subscales: Concern with Public Attitudes, Disclosure Concerns, Negative Self-Image, and Personalized Stigma. We compared GatorTron-large and BERT as en...

  3. [3]

    The association between HIV-stigma and depressive symptoms among people living with HIV/AIDS: A systematic review of studies conducted in South Africa

    MacLean JR, Wetherall K. The association between HIV-stigma and depressive symptoms among people living with HIV/AIDS: A systematic review of studies conducted in South Africa. J Affect Disord. 2021;287:125-137

  4. [4]

    Turan B, Budhwani H, Fazeli PL, et al. How does stigma affect people living with HIV? The mediating roles of internalized and anticipated HIV stigma in the effects of perceived community stigma on health and psychosocial outcomes. AIDS Behav. 2017;21(1):283-291

  5. [5]

    The association of HIV-related stigma to HIV medication adherence: A systematic review and synthesis of the literature

    Sweeney SM, Vanable PA. The association of HIV-related stigma to HIV medication adherence: A systematic review and synthesis of the literature. AIDS Behav. 2016;20(1):29-50

  6. [6]

    Interpersonal mechanisms contributing to the association between HIV-related internalized stigma and medication adherence

    Blake Helms C, Turan JM, Atkins G, et al. Interpersonal mechanisms contributing to the association between HIV-related internalized stigma and medication adherence. AIDS Behav. 2017;21(1):238-247

  7. [7]

    HIV-related stigma as a barrier to achievement of global PMTCT and maternal health goals: a review of the evidence

    Turan JM, Nyblade L. HIV-related stigma as a barrier to achievement of global PMTCT and maternal health goals: a review of the evidence. AIDS Behav. 2013;17(7):2528-2539

  8. [8]

    Social support and moment-to-moment changes in treatment self-efficacy in men living with HIV: Psychosocial moderators and clinical outcomes

    Turan B, Fazeli PL, Raper JL, Mugavero MJ, Johnson MO. Social support and moment-to-moment changes in treatment self-efficacy in men living with HIV: Psychosocial moderators and clinical outcomes. Health Psychol. 2016;35(10):1126-1134

  9. [9]

    Strengthening adherence to Anti Retroviral Therapy (ART) monitoring and support: operation research to identify barriers and facilitators in Nepal

    Bam K, Rajbhandari RM, Karmacharya DB, Dixit SM. Strengthening adherence to Anti Retroviral Therapy (ART) monitoring and support: operation research to identify barriers and facilitators in Nepal. BMC Health Serv Res. 2015;15(1):188

  10. [10]

    doi:10.5114/hivar.2022.115763

  11. [11]

    Measuring stigma in people with HIV: psychometric assessment of the HIV stigma scale

    Berger BE, Ferrans CE, Lashley FR. Measuring stigma in people with HIV: psychometric assessment of the HIV stigma scale. Res Nurs Health. 2001;24(6):518-529

  12. [12]

    Validation of the HIV/AIDS Stigma Instrument - PLWA (HASI-P)

    Holzemer WL, Uys LR, Chirwa ML, et al. Validation of the HIV/AIDS Stigma Instrument - PLWA (HASI-P). AIDS Care. 2007;19(8):1002-1012

  13. [13]

    A topic modeling analysis of stigma dimensions, social, and related behavioral circumstances in clinical notes among patients with HIV

    Chen Z, Liu Y, Prosperi M, et al. A topic modeling analysis of stigma dimensions, social, and related behavioral circumstances in clinical notes among patients with HIV. Int J Med Inform. 2026;209(106269):106269

  14. [14]

    Enhanced language models for predicting and understanding HIV care disengagement: a case study in Tanzania

    Wei W, Shao J, Lyu RQ, et al. Enhanced language models for predicting and understanding HIV care disengagement: a case study in Tanzania. NPJ Digit Med. 2026;9(1):165

  15. [15]

    HIV risk score and prediction model in the United States: A scoping review

    Albernas A, Patel MD, Cook RL, Vaddiparti K, Prosperi M, Liu Y. HIV risk score and prediction model in the United States: A scoping review. AIDS Behav. 2025;29(8):2388-2407

  16. [16]

    Development of an electronic health record-based Human Immunodeficiency Virus (HIV) risk prediction model for women, incorporating social determinants of health

    Liu Y, Chen A, Cho H, Siddiqi KA, Cook RL, Prosperi M. Development of an electronic health record-based Human Immunodeficiency Virus (HIV) risk prediction model for women, incorporating social determinants of health. BMC Public Health. 2025;25(1):2257

  17. [17]

    Optimizing identification of people living with HIV from electronic medical records: Computable phenotype development and validation

    Liu Y, Siddiqi KA, Cook RL, et al. Optimizing identification of people living with HIV from electronic medical records: Computable phenotype development and validation. Methods Inf Med. 2021;60(3-04):84-94

  18. [18]

    A large language model for electronic health records

    Yang X, Chen A, PourNejatian N, et al. A large language model for electronic health records. NPJ Digit Med. 2022;5(1):194

  19. [19]

    http://arxiv.org/abs/2403.11425

  20. [20]

    http://arxiv.org/abs/1810.04805

  21. [21]

    http://arxiv.org/abs/2407.21783

  22. [22]

    http://arxiv.org/abs/2508.10925

  23. [23]

    http://arxiv.org/abs/2507.05201

  24. [24]

    Generative large language models are all-purpose text analytics engines: text-to-text learning is all your need

    Peng C, Yang X, Chen A, et al. Generative large language models are all-purpose text analytics engines: text-to-text learning is all your need. J Am Med Inform Assoc. 2024;31(9):1892-1903

  25. [25]

    Natural language generation in healthcare: A review of methods and applications

    Lyu M, Li X, Chen Z, et al. Natural language generation in healthcare: A review of methods and applications. J Biomed Inform. 2026;176(104997):104997. Appendix Appendix Table