Recognition: unknown
PsychBench: Auditing Epidemiological Fidelity in Large Language Model Mental Health Simulations
Pith reviewed 2026-05-10 06:00 UTC · model grok-4.3
The pith
Large language models produce plausible individual mental health profiles but systematically misrepresent real population distributions.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Models produce clinically plausible individuals while misrepresenting the populations they are drawn from. Variance compression ranges from 14 percent to 62 percent, eliminating the distributional tails of clinical reality. Test-retest correlations exceed r = 0.90 yet 36.66 percent of cases cross diagnostic thresholds between runs. Symptom correlation matrices diverge across groups, calibration bias overestimates depression severity for most cohorts, and transgender populations show under-capture of documented elevations. Patterns replicate across US and Chinese model architectures.
What carries the argument
PsychBench epidemiological audit that measures variance compression, threshold stability, symptom correlations, and calibration bias by comparing LLM-generated profiles against NHANES and NESARC-III baselines in intersectional cohorts.
If this is right
- For most users, LLM mental health tools risk pathologizing ordinary distress.
- For transgender users, the tools produce algorithmic erasure of documented minority stress elevations.
- Models encode racialized and gendered assumptions, such as excess irritability attributed to Black men and fatigue to women.
- The failures appear tied to current training paradigms rather than any single architecture.
- Population-level fidelity checks become necessary before deploying these simulations in training or research.
Where Pith is reading between the lines
- Training data drawn from clinical corpora with elevated base rates likely drives the overestimation of severity for most groups.
- Developers could test whether fine-tuning on demographically balanced survey responses reduces the observed variance compression.
- Synthetic patient datasets generated by these models may understate the prevalence of severe cases in real populations.
- Similar audits could be applied to non-mental-health domains where LLMs simulate individuals from known statistical populations.
Load-bearing premise
That NHANES and NESARC-III survey data serve as the correct ground-truth target distributions for what LLM-generated populations should match.
What would settle it
If new LLM simulations were shown to reproduce the same variance, symptom correlations, diagnostic prevalence rates, and group differences as the NHANES and NESARC-III data without compression or systematic bias.
Figures
read the original abstract
Large language models are increasingly deployed to simulate patients for clinical training, research, and mental health tools, yet population-level validity remains largely untested. We introduce PsychBench, the first epidemiological audit of LLM patient simulation: 28,800 profiles from four frontier models (GPT-4o-mini, DeepSeek-V3, Gemini-3-Flash, GLM-4.7) evaluated against NHANES and NESARC-III baselines across 120 intersectional cohorts. The central finding is a coherence-fidelity dissociation: models produce clinically plausible individuals while misrepresenting the populations they are drawn from. Variance compression ranges from 14 percent (GLM-4.7) to 62 percent (DeepSeek-V3), eliminating the distributional tails of clinical reality. Despite test-retest correlations above r = 0.90, 36.66 percent of cases cross diagnostic thresholds between runs. Symptom correlation matrices diverge across demographic groups beyond split-half noise, with transgender populations diverging three to five times more than racial differences. Calibration bias is systematic and asymmetric. Models overestimate depression severity for most groups by 3.6 to 6.1 points (Cohen d = 1.13 to 1.91), consistent with training on clinical corpora with elevated base rates. For transgender women the direction inverts: models capture only 8 to 46 percent of documented minority stress elevation, yielding a -5.42 residual (d = -1.55). Models also attribute irritability to Black men and fatigue to women beyond matched controls, encoding racialized and gendered assumptions. Patterns replicate across US and Chinese architectures, indicating failures tied to current training paradigms rather than isolated implementations. For most users, LLM mental health tools risk pathologizing ordinary distress; for transgender users, algorithmic erasure of genuine need. The patients look right. They do not represent real populations.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces PsychBench, the first epidemiological audit of LLM patient simulations for mental health. It generates 28,800 profiles from four frontier models (GPT-4o-mini, DeepSeek-V3, Gemini-3-Flash, GLM-4.7) across 120 intersectional cohorts and evaluates them against NHANES and NESARC-III baselines on variance, test-retest stability (r > 0.90), symptom correlations, and calibration. The central claim is a coherence-fidelity dissociation: models produce clinically plausible individuals but exhibit variance compression (14% for GLM-4.7 to 62% for DeepSeek-V3), 36.66% diagnostic threshold crossing between runs, divergent symptom matrices (especially 3-5x larger for transgender groups), systematic overestimation of depression severity (3.6-6.1 points, d=1.13-1.91), and under-capture of minority-stress elevation in transgender women (only 8-46% captured, -5.42 residual, d=-1.55). Patterns replicate across architectures, with additional encoding of racialized/gendered assumptions.
Significance. If the results hold, the work is significant as a large-scale empirical demonstration that LLM mental health simulations can appear coherent at the individual level while failing to represent population distributions, with direct implications for clinical training tools and risk of pathologizing ordinary distress or erasing genuine need in transgender cohorts. The scale (28,800 profiles, 120 cohorts, four models), concrete quantitative metrics, and cross-architecture replication provide a reproducible foundation for future audits. Credit is due for the falsifiable predictions and focus on intersectional effects rather than aggregate performance.
major comments (2)
- [Results (transgender cohort analysis)] The central fidelity-dissociation claim rests on NHANES and NESARC-III serving as the normative ground truth for population-level statistics, including for transgender cohorts where the paper reports under-capture of minority stress (Results section on transgender women, -5.42 residual). However, these surveys have documented small subsample sizes and under-sampling for transgender respondents; without a sensitivity analysis to alternative epidemiological sources or explicit discussion of how survey limitations affect the 8-46% capture rate and variance-compression metrics, the attribution of deviations to 'algorithmic erasure' rather than ground-truth noise is not fully supported. This is load-bearing for the population-misrepresentation conclusion.
- [Methods and Results (variance and calibration subsections)] The variance-compression range (14-62%) and claim of 'eliminating the distributional tails' are presented as model failures, but the manuscript does not report whether the chosen statistical metrics (variance, calibration) were pre-registered or if post-hoc cohort definitions influenced the effect sizes. A sensitivity check excluding or reweighting small-N intersectional cells would strengthen the claim that the dissociation is not an artifact of the 120-cohort design.
minor comments (2)
- [Abstract and Results] Ensure all percentages and effect sizes (e.g., 36.66%, d=1.13-1.91) are reported with consistent precision and accompanied by confidence intervals or exact p-values in the main text and tables.
- [Methods] The abstract states patterns 'replicate across US and Chinese architectures'; clarify in the methods whether model selection was exhaustive or representative and whether any architecture-specific hyperparameters were controlled.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed comments, which identify important areas for strengthening the robustness of our claims regarding population-level fidelity in LLM mental health simulations. We address each major comment below and outline the revisions we will implement.
read point-by-point responses
-
Referee: The central fidelity-dissociation claim rests on NHANES and NESARC-III serving as the normative ground truth for population-level statistics, including for transgender cohorts where the paper reports under-capture of minority stress (Results section on transgender women, -5.42 residual). However, these surveys have documented small subsample sizes and under-sampling for transgender respondents; without a sensitivity analysis to alternative epidemiological sources or explicit discussion of how survey limitations affect the 8-46% capture rate and variance-compression metrics, the attribution of deviations to 'algorithmic erasure' rather than ground-truth noise is not fully supported. This is load-bearing for the population-misrepresentation conclusion.
Authors: We agree that the small subsample sizes for transgender respondents in NHANES and NESARC-III introduce meaningful uncertainty that requires explicit acknowledgment. In the revised manuscript, we will expand the Limitations section with a dedicated paragraph reporting the approximate transgender subsample sizes in each survey (typically under 200 respondents for relevant items) and discussing how under-sampling may contribute to variability in the reported 8-46% minority stress capture rates and associated residuals. We will also qualify the attribution language to note that while cross-model consistency supports a model-related component, survey noise could partially explain the deviations. A full sensitivity analysis using alternative epidemiological sources is not feasible, as no other large-scale, nationally representative US datasets provide comparable intersectional measures of minority stress with sufficient power. revision: partial
-
Referee: The variance-compression range (14-62%) and claim of 'eliminating the distributional tails' are presented as model failures, but the manuscript does not report whether the chosen statistical metrics (variance, calibration) were pre-registered or if post-hoc cohort definitions influenced the effect sizes. A sensitivity check excluding or reweighting small-N intersectional cells would strengthen the claim that the dissociation is not an artifact of the 120-cohort design.
Authors: The 120 intersectional cohorts were defined a priori using the demographic strata directly available in the NHANES and NESARC-III datasets (age bands, gender including transgender identity, race/ethnicity). The variance and calibration metrics were selected to match standard epidemiological approaches for assessing distributional fidelity rather than being post-hoc inventions. We did not pre-register the specific analyses, as this was an exploratory large-scale audit. In the revision, we will add a Methods subsection explicitly describing the cohort construction process, noting the absence of pre-registration, and reporting the sensitivity analysis requested: we will re-compute key metrics after excluding or reweighting cohorts with N < 50 to demonstrate that the 14-62% variance compression and coherence-fidelity dissociation persist. revision: yes
- Full sensitivity analysis to alternative epidemiological sources for transgender minority stress, as no equivalent large-scale nationally representative datasets with intersectional data are available.
Circularity Check
No circularity: empirical audit against independent external surveys
full rationale
The paper conducts a direct empirical comparison: it generates 28,800 LLM patient profiles from four models and evaluates them against fixed external baselines (NHANES and NESARC-III) using pre-specified statistical metrics (variance compression, test-retest stability, symptom correlations, calibration bias). No equations, parameters, or claims are derived from the paper's own outputs or prior self-citations. The coherence-fidelity dissociation is measured by deviation from these public datasets, not by fitting or renaming internal constructs. This structure is self-contained against external benchmarks and contains none of the enumerated circularity patterns.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption NHANES and NESARC-III provide representative baselines for mental health symptom distributions across the tested demographic intersections.
Reference graph
Works this paper leans on
-
[1]
Singhal, K., Azizi, S., Tu, T., et al. (2023). Large language models encode clinical knowledge.Nature, 620(7972), 172–180
2023
-
[2]
Eckardt, J.-N., Wendt, K., Bornhäuser, M., & Middeke, J.M. (2024). Mimicking clinical trials with synthetic acute myeloid leukemia patients using generative artificial intelligence.npj Digital Medicine, 7(1), 76
2024
-
[3]
Ji, Z., Lee, N., Frieske, R., et al. (2023). Survey of hallucination in natural language generation.ACM Computing Surveys, 55(12), 1–38
2023
-
[4]
Patel, J.S., Oh, Y., Engel, K.L., & Grzywacz, J.G. (2019). Trends in depressive symp- toms among United States adults: National Health and Nutrition Examination Survey 2005–2016.Depression and Anxiety, 36(10), 919–926
2019
-
[5]
Ouyang, L., Wu, J., Jiang, X., et al. (2022). Training language models to follow instructions with human feedback.Advances in Neural Information Processing Systems, 35, 27730–27744
2022
-
[6]
Bai, Y., Jones, A., Kaplan, J., et al. (2022). Constitutional AI: Harmlessness from AI feedback.arXiv preprint arXiv:2212.08073
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[7]
Kroenke, K., Spitzer, R.L., & Williams, J.B. (2001). The PHQ-9: Validity of a brief depression severity measure.Journal of General Internal Medicine, 16(9), 606–613
2001
-
[8]
Bland, J.M., & Altman, D.G. (1986). Statistical methods for assessing agreement between two methods of clinical measurement.The Lancet, 327(8476), 307–310
1986
-
[9]
Borsboom, D., & Cramer, A.O. (2013). Network analysis: An integrative approach to the structure of psychopathology.Annual Review of Clinical Psychology, 9, 91–121. 17
2013
-
[10]
Fried, E.I., van Borkulo, C.D., Cramer, A.O., et al. (2017). Mental disorders as net- works of problems: A review of recent insights.Social Psychiatry and Psychiatric Epidemiology, 52(1), 1–10
2017
-
[11]
Spitzer, R.L., Kroenke, K., Williams, J.B., & Löwe, B. (2006). A brief measure for assessing generalized anxiety disorder: The GAD-7.Archives of Internal Medicine, 166(10), 1092–1097
2006
-
[12]
Bradley, K.A., DeBenedetti, A.F., Volk, R.J., et al. (2007). AUDIT-C as a brief screen for alcohol misuse in primary care.Alcoholism: Clinical and Experimental Research, 31(7), 1208–1217
2007
-
[13]
(2013).The PTSD Checklist for DSM-5 (PCL-5)
Weathers, F.W., Litz, B.T., Keane, T.M., et al. (2013).The PTSD Checklist for DSM-5 (PCL-5). National Center for PTSD
2013
-
[14]
Kalibatseva, Z., & Leong, F.T. (2011). Depression among Asian Americans: Review and recommendations.Depression Research and Treatment, 2011, 320902
2011
-
[15]
Hughto, J.M.W., Gunn, H.A., Engel, L.E., et al. (2024). Rates of depression and anxiety among transgender and gender-diverse adults in the United States, 2014– 2022.JAMA Internal Medicine, 184(8), 981–984
2024
-
[16]
Williams, D.R., González, H.M., Neighbors, H., et al. (2007). Prevalence and dis- tribution of major depressive disorder in African Americans, Caribbean Blacks, and non-Hispanic Whites.Archives of General Psychiatry, 64(3), 305–315
2007
- [17]
-
[18]
Dror, R., Baumer, G., Shlomov, S., & Reichart, R. (2018). The hitchhiker’s guide to testing statistical significance in natural language processing.Proceedings of ACL 2018, 1383–1392
2018
-
[19]
Santomauro, D.F., Mantilla Herrera, A.M., Shadid, J., et al. (2021). Global preva- lence and burden of depressive and anxiety disorders in 204 countries and terri- tories in 2020 due to the COVID-19 pandemic.The Lancet, 398(10312), 1700–1712
2021
-
[20]
(2020).National Health and Nutrition Examination Survey 2005–2018 Data Documentation
Centers for Disease Control and Prevention. (2020).National Health and Nutrition Examination Survey 2005–2018 Data Documentation. Hyattsville, MD: National Center for Health Statistics
2020
-
[21]
Grant, B.F., Goldstein, R.B., Saha, T.D., et al. (2015). Epidemiology of DSM-5 alcohol use disorder: Results from NESARC-III.JAMA Psychiatry, 72(8), 757–766
2015
-
[22]
Zheng, L., Chiang, W.L., Sheng, Y., et al. (2023). Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena.Advances in Neural Information Processing Systems, 36
2023
-
[23]
Lin, S., Hilton, J., & Evans, O. (2022). TruthfulQA: Measuring how models mimic human falsehoods.Proceedings of ACL 2022, 3214–3252
2022
-
[24]
Hagendorff, T., Fabi, S., & Kosinski, M. (2023). Machine psychology: Investigating emergent capabilities and behavior in large language models using psychological methods.Nature Reviews Psychology, 2, 1–10
2023
-
[25]
Obermeyer, Z., Powers, B., Vogeli, C., & Mullainathan, S. (2019). Dissecting racial bias in an algorithm used on more than 200 million people.Science, 366(6464), 447–453
2019
-
[26]
Bolukbasi, T., Chang, K.W., Zou, J.Y., Saligrama, V., & Kalai, A.T. (2016). Man is to computer programmer as woman is to homemaker? Debiasing word embeddings. Advances in Neural Information Processing Systems, 29, 4349–4357
2016
-
[27]
Abid, A., Farooqi, M., & Zou, J. (2021). Persistent anti-Muslim bias in large language models.Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society, 298–306
2021
-
[28]
Nadeem, M., Bethke, A., & Reddy, S. (2021). StereoSet: Measuring stereotypical bias in pretrained language models.Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics, 5356–5371
2021
- [29]
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.