arxiv: 2604.17359 · v1 · submitted 2026-04-19 · 💻 cs.CY · cs.AI

Recognition: unknown

PsychBench: Auditing Epidemiological Fidelity in Large Language Model Mental Health Simulations

Patrick Keough

Authors on Pith no claims yet

Pith reviewed 2026-05-10 06:00 UTC · model grok-4.3

classification 💻 cs.CY cs.AI

keywords LLM patient simulationepidemiological fidelitymental healthvariance compressiondiagnostic inconsistencypopulation biascoherence-fidelity dissociationcalibration bias

0 comments

The pith

Large language models produce plausible individual mental health profiles but systematically misrepresent real population distributions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents PsychBench as an audit that generates 28,800 patient profiles from four frontier models and compares them to NHANES and NESARC-III survey baselines across 120 demographic cohorts. It shows that while individual profiles often read as clinically coherent, the resulting populations compress variance by 14 to 62 percent, cross diagnostic thresholds between repeated generations, and exhibit biased symptom correlations and calibration errors. These patterns appear across model families and point to training data effects rather than isolated implementation choices. The dissociation matters for clinical training tools and patient-facing applications that rely on these simulations to stand in for real groups.

Core claim

Models produce clinically plausible individuals while misrepresenting the populations they are drawn from. Variance compression ranges from 14 percent to 62 percent, eliminating the distributional tails of clinical reality. Test-retest correlations exceed r = 0.90 yet 36.66 percent of cases cross diagnostic thresholds between runs. Symptom correlation matrices diverge across groups, calibration bias overestimates depression severity for most cohorts, and transgender populations show under-capture of documented elevations. Patterns replicate across US and Chinese model architectures.

What carries the argument

PsychBench epidemiological audit that measures variance compression, threshold stability, symptom correlations, and calibration bias by comparing LLM-generated profiles against NHANES and NESARC-III baselines in intersectional cohorts.

If this is right

For most users, LLM mental health tools risk pathologizing ordinary distress.
For transgender users, the tools produce algorithmic erasure of documented minority stress elevations.
Models encode racialized and gendered assumptions, such as excess irritability attributed to Black men and fatigue to women.
The failures appear tied to current training paradigms rather than any single architecture.
Population-level fidelity checks become necessary before deploying these simulations in training or research.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Training data drawn from clinical corpora with elevated base rates likely drives the overestimation of severity for most groups.
Developers could test whether fine-tuning on demographically balanced survey responses reduces the observed variance compression.
Synthetic patient datasets generated by these models may understate the prevalence of severe cases in real populations.
Similar audits could be applied to non-mental-health domains where LLMs simulate individuals from known statistical populations.

Load-bearing premise

That NHANES and NESARC-III survey data serve as the correct ground-truth target distributions for what LLM-generated populations should match.

What would settle it

If new LLM simulations were shown to reproduce the same variance, symptom correlations, diagnostic prevalence rates, and group differences as the NHANES and NESARC-III data without compression or systematic bias.

Figures

Figures reproduced from arXiv: 2604.17359 by Patrick Keough.

**Figure 2.** Figure 2: Diagnostic transition heatmap showing category [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: Run-to-run stability by demographic intersection. [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: Frobenius norm distances between PHQ-8 symptom [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗

**Figure 6.** Figure 6: Inter-model correlation matrix showing output sim [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗

**Figure 7.** Figure 7: Bias residuals by demographic group. Positive [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗

**Figure 8.** Figure 8: The coherence-fidelity dissociation across all four [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗

read the original abstract

Large language models are increasingly deployed to simulate patients for clinical training, research, and mental health tools, yet population-level validity remains largely untested. We introduce PsychBench, the first epidemiological audit of LLM patient simulation: 28,800 profiles from four frontier models (GPT-4o-mini, DeepSeek-V3, Gemini-3-Flash, GLM-4.7) evaluated against NHANES and NESARC-III baselines across 120 intersectional cohorts. The central finding is a coherence-fidelity dissociation: models produce clinically plausible individuals while misrepresenting the populations they are drawn from. Variance compression ranges from 14 percent (GLM-4.7) to 62 percent (DeepSeek-V3), eliminating the distributional tails of clinical reality. Despite test-retest correlations above r = 0.90, 36.66 percent of cases cross diagnostic thresholds between runs. Symptom correlation matrices diverge across demographic groups beyond split-half noise, with transgender populations diverging three to five times more than racial differences. Calibration bias is systematic and asymmetric. Models overestimate depression severity for most groups by 3.6 to 6.1 points (Cohen d = 1.13 to 1.91), consistent with training on clinical corpora with elevated base rates. For transgender women the direction inverts: models capture only 8 to 46 percent of documented minority stress elevation, yielding a -5.42 residual (d = -1.55). Models also attribute irritability to Black men and fatigue to women beyond matched controls, encoding racialized and gendered assumptions. Patterns replicate across US and Chinese architectures, indicating failures tied to current training paradigms rather than isolated implementations. For most users, LLM mental health tools risk pathologizing ordinary distress; for transgender users, algorithmic erasure of genuine need. The patients look right. They do not represent real populations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LLMs make plausible individual mental health profiles but compress variance and show demographic biases against NHANES/NESARC-III baselines; the scale is new but ground-truth assumptions for subgroups deserve scrutiny.

read the letter

The main takeaway is that current frontier LLMs generate individual patient profiles that read as clinically coherent yet produce population-level distributions that deviate from real survey data, with variance compression between 14 and 62 percent and asymmetric calibration biases. This dissociation is the paper's central empirical claim and it is backed by direct comparisons rather than internal modeling assumptions.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces PsychBench, the first epidemiological audit of LLM patient simulations for mental health. It generates 28,800 profiles from four frontier models (GPT-4o-mini, DeepSeek-V3, Gemini-3-Flash, GLM-4.7) across 120 intersectional cohorts and evaluates them against NHANES and NESARC-III baselines on variance, test-retest stability (r > 0.90), symptom correlations, and calibration. The central claim is a coherence-fidelity dissociation: models produce clinically plausible individuals but exhibit variance compression (14% for GLM-4.7 to 62% for DeepSeek-V3), 36.66% diagnostic threshold crossing between runs, divergent symptom matrices (especially 3-5x larger for transgender groups), systematic overestimation of depression severity (3.6-6.1 points, d=1.13-1.91), and under-capture of minority-stress elevation in transgender women (only 8-46% captured, -5.42 residual, d=-1.55). Patterns replicate across architectures, with additional encoding of racialized/gendered assumptions.

Significance. If the results hold, the work is significant as a large-scale empirical demonstration that LLM mental health simulations can appear coherent at the individual level while failing to represent population distributions, with direct implications for clinical training tools and risk of pathologizing ordinary distress or erasing genuine need in transgender cohorts. The scale (28,800 profiles, 120 cohorts, four models), concrete quantitative metrics, and cross-architecture replication provide a reproducible foundation for future audits. Credit is due for the falsifiable predictions and focus on intersectional effects rather than aggregate performance.

major comments (2)

[Results (transgender cohort analysis)] The central fidelity-dissociation claim rests on NHANES and NESARC-III serving as the normative ground truth for population-level statistics, including for transgender cohorts where the paper reports under-capture of minority stress (Results section on transgender women, -5.42 residual). However, these surveys have documented small subsample sizes and under-sampling for transgender respondents; without a sensitivity analysis to alternative epidemiological sources or explicit discussion of how survey limitations affect the 8-46% capture rate and variance-compression metrics, the attribution of deviations to 'algorithmic erasure' rather than ground-truth noise is not fully supported. This is load-bearing for the population-misrepresentation conclusion.
[Methods and Results (variance and calibration subsections)] The variance-compression range (14-62%) and claim of 'eliminating the distributional tails' are presented as model failures, but the manuscript does not report whether the chosen statistical metrics (variance, calibration) were pre-registered or if post-hoc cohort definitions influenced the effect sizes. A sensitivity check excluding or reweighting small-N intersectional cells would strengthen the claim that the dissociation is not an artifact of the 120-cohort design.

minor comments (2)

[Abstract and Results] Ensure all percentages and effect sizes (e.g., 36.66%, d=1.13-1.91) are reported with consistent precision and accompanied by confidence intervals or exact p-values in the main text and tables.
[Methods] The abstract states patterns 'replicate across US and Chinese architectures'; clarify in the methods whether model selection was exhaustive or representative and whether any architecture-specific hyperparameters were controlled.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for their constructive and detailed comments, which identify important areas for strengthening the robustness of our claims regarding population-level fidelity in LLM mental health simulations. We address each major comment below and outline the revisions we will implement.

read point-by-point responses

Referee: The central fidelity-dissociation claim rests on NHANES and NESARC-III serving as the normative ground truth for population-level statistics, including for transgender cohorts where the paper reports under-capture of minority stress (Results section on transgender women, -5.42 residual). However, these surveys have documented small subsample sizes and under-sampling for transgender respondents; without a sensitivity analysis to alternative epidemiological sources or explicit discussion of how survey limitations affect the 8-46% capture rate and variance-compression metrics, the attribution of deviations to 'algorithmic erasure' rather than ground-truth noise is not fully supported. This is load-bearing for the population-misrepresentation conclusion.

Authors: We agree that the small subsample sizes for transgender respondents in NHANES and NESARC-III introduce meaningful uncertainty that requires explicit acknowledgment. In the revised manuscript, we will expand the Limitations section with a dedicated paragraph reporting the approximate transgender subsample sizes in each survey (typically under 200 respondents for relevant items) and discussing how under-sampling may contribute to variability in the reported 8-46% minority stress capture rates and associated residuals. We will also qualify the attribution language to note that while cross-model consistency supports a model-related component, survey noise could partially explain the deviations. A full sensitivity analysis using alternative epidemiological sources is not feasible, as no other large-scale, nationally representative US datasets provide comparable intersectional measures of minority stress with sufficient power. revision: partial
Referee: The variance-compression range (14-62%) and claim of 'eliminating the distributional tails' are presented as model failures, but the manuscript does not report whether the chosen statistical metrics (variance, calibration) were pre-registered or if post-hoc cohort definitions influenced the effect sizes. A sensitivity check excluding or reweighting small-N intersectional cells would strengthen the claim that the dissociation is not an artifact of the 120-cohort design.

Authors: The 120 intersectional cohorts were defined a priori using the demographic strata directly available in the NHANES and NESARC-III datasets (age bands, gender including transgender identity, race/ethnicity). The variance and calibration metrics were selected to match standard epidemiological approaches for assessing distributional fidelity rather than being post-hoc inventions. We did not pre-register the specific analyses, as this was an exploratory large-scale audit. In the revision, we will add a Methods subsection explicitly describing the cohort construction process, noting the absence of pre-registration, and reporting the sensitivity analysis requested: we will re-compute key metrics after excluding or reweighting cohorts with N < 50 to demonstrate that the 14-62% variance compression and coherence-fidelity dissociation persist. revision: yes

standing simulated objections not resolved

Full sensitivity analysis to alternative epidemiological sources for transgender minority stress, as no equivalent large-scale nationally representative datasets with intersectional data are available.

Circularity Check

0 steps flagged

No circularity: empirical audit against independent external surveys

full rationale

The paper conducts a direct empirical comparison: it generates 28,800 LLM patient profiles from four models and evaluates them against fixed external baselines (NHANES and NESARC-III) using pre-specified statistical metrics (variance compression, test-retest stability, symptom correlations, calibration bias). No equations, parameters, or claims are derived from the paper's own outputs or prior self-citations. The coherence-fidelity dissociation is measured by deviation from these public datasets, not by fitting or renaming internal constructs. This structure is self-contained against external benchmarks and contains none of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Based on the abstract alone, the central claim rests on treating national survey data as the appropriate reference distribution and on the assumption that the selected statistical measures adequately quantify fidelity.

axioms (1)

domain assumption NHANES and NESARC-III provide representative baselines for mental health symptom distributions across the tested demographic intersections.
These surveys are used as the ground truth against which LLM outputs are judged.

pith-pipeline@v0.9.0 · 5634 in / 1245 out tokens · 58778 ms · 2026-05-10T06:00:01.409784+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

29 extracted references · 3 canonical work pages · 1 internal anchor

[1]

Singhal, K., Azizi, S., Tu, T., et al. (2023). Large language models encode clinical knowledge.Nature, 620(7972), 172–180

2023
[2]

Eckardt, J.-N., Wendt, K., Bornhäuser, M., & Middeke, J.M. (2024). Mimicking clinical trials with synthetic acute myeloid leukemia patients using generative artificial intelligence.npj Digital Medicine, 7(1), 76

2024
[3]

Ji, Z., Lee, N., Frieske, R., et al. (2023). Survey of hallucination in natural language generation.ACM Computing Surveys, 55(12), 1–38

2023
[4]

Patel, J.S., Oh, Y., Engel, K.L., & Grzywacz, J.G. (2019). Trends in depressive symp- toms among United States adults: National Health and Nutrition Examination Survey 2005–2016.Depression and Anxiety, 36(10), 919–926

2019
[5]

Ouyang, L., Wu, J., Jiang, X., et al. (2022). Training language models to follow instructions with human feedback.Advances in Neural Information Processing Systems, 35, 27730–27744

2022
[6]

Bai, Y., Jones, A., Kaplan, J., et al. (2022). Constitutional AI: Harmlessness from AI feedback.arXiv preprint arXiv:2212.08073

work page internal anchor Pith review Pith/arXiv arXiv 2022
[7]

Kroenke, K., Spitzer, R.L., & Williams, J.B. (2001). The PHQ-9: Validity of a brief depression severity measure.Journal of General Internal Medicine, 16(9), 606–613

2001
[8]

Bland, J.M., & Altman, D.G. (1986). Statistical methods for assessing agreement between two methods of clinical measurement.The Lancet, 327(8476), 307–310

1986
[9]

Borsboom, D., & Cramer, A.O. (2013). Network analysis: An integrative approach to the structure of psychopathology.Annual Review of Clinical Psychology, 9, 91–121. 17

2013
[10]

Fried, E.I., van Borkulo, C.D., Cramer, A.O., et al. (2017). Mental disorders as net- works of problems: A review of recent insights.Social Psychiatry and Psychiatric Epidemiology, 52(1), 1–10

2017
[11]

Spitzer, R.L., Kroenke, K., Williams, J.B., & Löwe, B. (2006). A brief measure for assessing generalized anxiety disorder: The GAD-7.Archives of Internal Medicine, 166(10), 1092–1097

2006
[12]

Bradley, K.A., DeBenedetti, A.F., Volk, R.J., et al. (2007). AUDIT-C as a brief screen for alcohol misuse in primary care.Alcoholism: Clinical and Experimental Research, 31(7), 1208–1217

2007
[13]

(2013).The PTSD Checklist for DSM-5 (PCL-5)

Weathers, F.W., Litz, B.T., Keane, T.M., et al. (2013).The PTSD Checklist for DSM-5 (PCL-5). National Center for PTSD

2013
[14]

Kalibatseva, Z., & Leong, F.T. (2011). Depression among Asian Americans: Review and recommendations.Depression Research and Treatment, 2011, 320902

2011
[15]

Hughto, J.M.W., Gunn, H.A., Engel, L.E., et al. (2024). Rates of depression and anxiety among transgender and gender-diverse adults in the United States, 2014– 2022.JAMA Internal Medicine, 184(8), 981–984

2024
[16]

Williams, D.R., González, H.M., Neighbors, H., et al. (2007). Prevalence and dis- tribution of major depressive disorder in African Americans, Caribbean Blacks, and non-Hispanic Whites.Archives of General Psychiatry, 64(3), 305–315

2007
[17]

Tao, Y., Viberg, O., Baker, R.S., & Kizilcec, R.F. (2024). Cultural alignment in large language models: A cross-cultural analysis of value priorities.arXiv preprint arXiv:2402.01509

work page arXiv 2024
[18]

Dror, R., Baumer, G., Shlomov, S., & Reichart, R. (2018). The hitchhiker’s guide to testing statistical significance in natural language processing.Proceedings of ACL 2018, 1383–1392

2018
[19]

Santomauro, D.F., Mantilla Herrera, A.M., Shadid, J., et al. (2021). Global preva- lence and burden of depressive and anxiety disorders in 204 countries and terri- tories in 2020 due to the COVID-19 pandemic.The Lancet, 398(10312), 1700–1712

2021
[20]

(2020).National Health and Nutrition Examination Survey 2005–2018 Data Documentation

Centers for Disease Control and Prevention. (2020).National Health and Nutrition Examination Survey 2005–2018 Data Documentation. Hyattsville, MD: National Center for Health Statistics

2020
[21]

Grant, B.F., Goldstein, R.B., Saha, T.D., et al. (2015). Epidemiology of DSM-5 alcohol use disorder: Results from NESARC-III.JAMA Psychiatry, 72(8), 757–766

2015
[22]

Zheng, L., Chiang, W.L., Sheng, Y., et al. (2023). Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena.Advances in Neural Information Processing Systems, 36

2023
[23]

Lin, S., Hilton, J., & Evans, O. (2022). TruthfulQA: Measuring how models mimic human falsehoods.Proceedings of ACL 2022, 3214–3252

2022
[24]

Hagendorff, T., Fabi, S., & Kosinski, M. (2023). Machine psychology: Investigating emergent capabilities and behavior in large language models using psychological methods.Nature Reviews Psychology, 2, 1–10

2023
[25]

Obermeyer, Z., Powers, B., Vogeli, C., & Mullainathan, S. (2019). Dissecting racial bias in an algorithm used on more than 200 million people.Science, 366(6464), 447–453

2019
[26]

Bolukbasi, T., Chang, K.W., Zou, J.Y., Saligrama, V., & Kalai, A.T. (2016). Man is to computer programmer as woman is to homemaker? Debiasing word embeddings. Advances in Neural Information Processing Systems, 29, 4349–4357

2016
[27]

Abid, A., Farooqi, M., & Zou, J. (2021). Persistent anti-Muslim bias in large language models.Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society, 298–306

2021
[28]

Nadeem, M., Bethke, A., & Reddy, S. (2021). StereoSet: Measuring stereotypical bias in pretrained language models.Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics, 5356–5371

2021
[29]

Wolf, Y., Wies, N., Avnery, O., Levine, Y., & Shashua, A. (2023). Fundamental limitations of alignment in large language models.arXiv preprint arXiv:2304.11082. 18

work page arXiv 2023