Patterns vs. Patients: Evaluating LLMs against Mental Health Professionals on Personality Disorder Diagnosis through First-Person Narratives

Anna Sterna; Kacper Dudzic; Karolina Dro\.zd\.z; Marcin Moskalewicz

arxiv: 2512.20298 · v4 · pith:CN4XQMCWnew · submitted 2025-12-23 · 💻 cs.CL · cs.AI· cs.CY· cs.HC

Patterns vs. Patients: Evaluating LLMs against Mental Health Professionals on Personality Disorder Diagnosis through First-Person Narratives

Karolina Dro\.zd\.z , Kacper Dudzic , Anna Sterna , Marcin Moskalewicz This is my paper

Pith reviewed 2026-05-25 07:33 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.CYcs.HC

keywords LLMspersonality disordersdiagnosismental healthfirst-person narrativesborderline personality disordernarcissistic personality disorderdiagnostic bias

0 comments

The pith

LLMs achieve higher diagnostic scores than mental health professionals when assessing personality disorders from first-person narratives.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper evaluates state-of-the-art LLMs against mental health professionals on diagnosing borderline and narcissistic personality disorders using Polish first-person autobiographical accounts. Top-performing Gemini Pro models scored 65.48 percent overall compared to 43.57 percent for the human professionals, a gap of nearly 22 points. Both performed well on borderline cases, but the models severely underdiagnosed narcissism. The models offered confident, pattern-focused justifications while the experts were more cautious and attended to the patient's sense of self over time. This comparison shows LLMs can handle complex clinical narratives but raises questions about their biases and reliability in practice.

Core claim

Within the studied sample, Gemini Pro models attained an overall diagnostic score of 65.48% on the personality disorder assessments, exceeding the human professionals' average of 43.57% by 21.91 percentage points, with strong performance on borderline personality disorder but a pronounced underdiagnosis of narcissistic personality disorder at an F1 score of 6.7 versus 50.0 for humans, accompanied by justifications centered on formal patterns and categories rather than the patients' temporal self-experience.

What carries the argument

The direct comparison of diagnostic performance and qualitative reasoning styles between LLMs and mental health professionals on first-person narratives for BPD and NPD.

Load-bearing premise

The selected set of Polish first-person autobiographical accounts and the scoring method accurately represent real clinical diagnostic skill without additional context.

What would settle it

Replicating the study with a larger sample of narratives that includes more cases of narcissism and longitudinal follow-up data showing no performance advantage for models.

Figures

Figures reproduced from arXiv: 2512.20298 by Anna Sterna, Kacper Dudzic, Karolina Dro\.zd\.z, Marcin Moskalewicz.

**Figure 2.** Figure 2: Average model and mental health profession [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: MDS and UMAP projections of the semantic [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

read the original abstract

Growing reliance on LLMs for psychiatric self-assessment raises questions about their ability to interpret qualitative patient narratives. This depth over breadth case study directly compares state-of-the-art LLMs and mental health professionals in assessing Borderline (BPD) and Narcissistic (NPD) Personality Disorders based on Polish-language first-person autobiographical accounts. Within our sample, the overall diagnostic scores of the top-performing Gemini Pro models (65.48%) were 21.91 percentage points higher than the average scores of the human professionals (43.57%). While both models and human experts excelled at identifying BPD (F1 = 83.4 & F1 = 80.0, respectively), models severely underdiagnosed NPD (F1 = 6.7 vs. 50.0), showing a potential reluctance toward the value-laden term "narcissism." Qualitatively, models provided confident, elaborate justifications focused on patterns and formal categories, while human experts remained concise and cautious, emphasizing the patients' sense of self and temporal experience. Our findings demonstrate that while LLMs might be competent at interpreting complex first-person clinical data, their outputs still carry critical reliability and bias issues.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows Gemini models beating human clinicians by 22 points on Polish first-person narratives but missing NPD almost entirely, yet the missing sample details and scoring validation make that gap hard to trust.

read the letter

The one thing to take away is that the top Gemini models reached 65% overall diagnostic score against the humans' 44% average, with a sharp drop to 6.7 F1 on NPD while humans hit 50. The qualitative contrast is also clear: models gave elaborate pattern-based justifications, humans stayed short and focused on the patient's self-view over time. That NPD miss stands out as the concrete signal here. What is new is the head-to-head on Polish autobiographical accounts for exactly these two disorders; prior work has done similar LLM-clinician comparisons but not this language and source combination. The paper does a clean job of reporting the numeric split and the style difference without overclaiming. The soft spots sit in the methods. The abstract gives no sample size, no selection criteria for the narratives, no inter-rater numbers for the human scorers, and no explicit rubric for how the diagnostic scores were assigned. That leaves the central performance gap resting on an unvalidated proxy. The stress-test concern about the scoring method is fair: without those details it is possible the setup rewards the models' categorical style while the humans' caution looks weaker by comparison. The paper is aimed at people working on AI for psychiatric screening or self-assessment tools. A reader who wants a quick empirical flag on bias patterns will get something from it, but anyone needing reproducible evidence will want the missing protocol first. It deserves a serious referee once the authors add the sample information, reliability stats, and scoring details; the current version is too thin for strong conclusions but the NPD observation is worth checking.

Referee Report

3 major / 1 minor

Summary. The manuscript is a depth-over-breadth empirical comparison of state-of-the-art LLMs (chiefly Gemini Pro variants) versus mental health professionals on diagnosing Borderline (BPD) and Narcissistic (NPD) Personality Disorders from Polish-language first-person autobiographical accounts. It reports that the best LLMs attained an overall diagnostic score of 65.48% versus 43.57% for the human average (a 21.91-point gap), with both groups performing well on BPD (F1 83.4 and 80.0) but LLMs severely under-diagnosing NPD (F1 6.7 versus 50.0); it also contrasts the models’ confident, pattern-focused justifications with the humans’ cautious, self-focused responses.

Significance. If the scoring procedure can be shown to be a reliable proxy for clinical judgment, the work would supply concrete evidence on both the pattern-recognition strengths and the categorical biases of current LLMs when applied to qualitative psychiatric narratives, thereby informing debates about AI-assisted self-assessment.

major comments (3)

[Abstract] Abstract and implied Methods: the headline performance gap (65.48% vs 43.57%) and the NPD F1 disparity rest on a quantitative scoring method whose rubric, inter-rater reliability among the human professionals, sample size, and narrative selection criteria are not reported; without these, it is impossible to determine whether the numeric scores validly measure diagnostic competence rather than prompt sensitivity or surface pattern matching.
[Abstract] Abstract: the study deliberately withholds longitudinal data and additional clinical context yet treats the resulting scores as directly comparable to real diagnostic skill; this design choice is load-bearing for the claim that LLMs are “competent at interpreting complex first-person clinical data,” because the same isolation may systematically favor the models’ categorical style while penalizing the humans’ cautious responses.
[Abstract] Abstract: the qualitative observation that models “severely underdiagnosed NPD” is presented as evidence of bias against the term “narcissism,” but no control condition or prompt-variation experiment is described to rule out the possibility that the disparity is an artifact of the evaluation protocol rather than a stable model property.

minor comments (1)

[Abstract] Abstract: state the exact number of narratives and the number of human raters so that readers can assess the stability of the reported percentages and F1 scores.

Simulated Author's Rebuttal

3 responses · 0 unresolved

Thank you for your thorough and constructive review. We address each major comment below and have revised the manuscript to improve methodological transparency and qualify interpretive claims where needed.

read point-by-point responses

Referee: [Abstract] Abstract and implied Methods: the headline performance gap (65.48% vs 43.57%) and the NPD F1 disparity rest on a quantitative scoring method whose rubric, inter-rater reliability among the human professionals, sample size, and narrative selection criteria are not reported; without these, it is impossible to determine whether the numeric scores validly measure diagnostic competence rather than prompt sensitivity or surface pattern matching.

Authors: We agree that these details should be visible in the abstract. The Methods section already specifies the scoring rubric (DSM-5 symptom criteria mapped to narrative indicators), the sample of narratives, selection criteria from Polish autobiographical sources, and inter-rater reliability among the human professionals. We have revised the abstract to include a brief summary of the evaluation protocol and added a supplementary table that consolidates these elements for immediate accessibility. revision: yes
Referee: [Abstract] Abstract: the study deliberately withholds longitudinal data and additional clinical context yet treats the resulting scores as directly comparable to real diagnostic skill; this design choice is load-bearing for the claim that LLMs are “competent at interpreting complex first-person clinical data,” because the same isolation may systematically favor the models’ categorical style while penalizing the humans’ cautious responses.

Authors: The isolation of first-person narratives is an intentional design choice to model self-assessment scenarios. We acknowledge that this limits direct equivalence to full clinical practice and may interact with observed stylistic differences. The revised manuscript adds an explicit Limitations paragraph discussing this scope and tempers the abstract language to frame the results as applying to narrative-only interpretation rather than claiming broad diagnostic competence. revision: yes
Referee: [Abstract] Abstract: the qualitative observation that models “severely underdiagnosed NPD” is presented as evidence of bias against the term “narcissism,” but no control condition or prompt-variation experiment is described to rule out the possibility that the disparity is an artifact of the evaluation protocol rather than a stable model property.

Authors: We accept that the current text does not contain controls to isolate the cause of the NPD disparity. The revised version rephrases the abstract to present the underdiagnosis as an observed disparity whose etiology (possible term bias or protocol effects) requires further investigation, and adds a Limitations note calling for prompt-variation experiments in future work. revision: yes

Circularity Check

0 steps flagged

No circularity: direct empirical comparison study

full rationale

This paper reports a head-to-head empirical evaluation of LLM vs. human diagnostic performance on fixed Polish autobiographical texts. No equations, fitted parameters, predictions derived from inputs, or load-bearing self-citations appear in the abstract or described methods. The central claim rests on observed numeric scores (Gemini Pro 65.48% vs. human 43.57%) obtained by applying the same rubric to both groups; the scoring procedure itself is not derived from the results. The study is therefore self-contained and externally falsifiable.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

This is an empirical evaluation study with no mathematical derivations. The central claim rests on the representativeness of the narrative sample and the validity of the diagnostic scoring procedure.

axioms (1)

domain assumption The selected first-person autobiographical accounts contain sufficient diagnostic information to support reliable BPD and NPD judgments by both models and humans.
Invoked when the study treats narrative-based diagnosis as a fair test of capability.

pith-pipeline@v0.9.0 · 5765 in / 1209 out tokens · 50776 ms · 2026-05-25T07:33:39.667503+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

8 extracted references · 8 canonical work pages · 2 internal anchors

[1]

https://www-cdn.anthropic.com/ 4263b940cabb546aa0e3283f35b686f4f3b2ff47. pdf. Accessed: 2025-12-18. John W. Ayers, Adam Poliak, Mark Dredze, Eric C. Leas, Zechariah Zhu, Jessica B. Kelley, Dennis J. Faix, Aaron M. Goodman, Christopher A. Longhurst, Michael Hogarth, and Davey M. Smith. 2023. Com- paring Physician and Artificial Intelligence Chatbot Respons...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

Stefan Harrer

A Comprehensive Evaluation of Large Language Models on Mental Illnesses.Preprint, arXiv:2409.15687. Stefan Harrer. 2023. Attention is not all you need: the complicated case of ethically using large language models in healthcare and medicine.eBioMedicine, 90. Yining Hua, Hongbin Na, Zehan Li, Fenglin Liu, Xiao Fang, David Clifton, and John Torous. 2025. A ...

work page arXiv 2023
[3]

UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction.Preprint, arXiv:1802.03426. Burt L. Monroe, Michael P. Colaresi, and Kevin M. Quinn. 2008. Fightin’ Words: Lexical Feature Se- lection and Evaluation for Identifying the Content of Political Conflict.Political Analysis, 16(4):372–403. Viet Cuong Nguyen, Mohammad Taher, Dongwan Hon...

work page internal anchor Pith review Pith/arXiv arXiv 2008
[4]

Ching-Fang Sun, Christoph U

The Sense of Self and Interpersonal Func- tioning in Borderline Personality Disorder: Toward Qualitative Evidence-Based Phenomenological Con- ceptualization.Qualitative Health Research, page 10497323251376224. Ching-Fang Sun, Christoph U. Correll, Robert L. Trest- man, Yezhe Lin, Hui Xie, Maria Stack Hankey, Ray- mond Paglinawan Uymatiao, Riya T. Patel, V...

work page arXiv 2023
[5]

For humans, due to the small amount of data relative to model outputs, a single grouping was created from all justifications for a single human participant

The texts of all justifications were aggregated into groupings. For humans, due to the small amount of data relative to model outputs, a single grouping was created from all justifications for a single human participant. For each model, we created two separate groupings: one based on all categorical justifications and the other based on all dimensional ju...

work page
[6]

Before the embedding process, simple text pre-processing with regular expressions was applied to remove LLM-characteristic text formatting artifacts, such as Markdown syntax and redundant whitespace

work page
[7]

The model operated in 16-bit preci- sion on a single NVIDIA A100 (40GB) GPU, with a batch size of 8 and default remaining hyperparameter values

Each grouping was converted into a dense embedding using the chosen BAAI/bge-multilingual-gemma2 embedding model. The model operated in 16-bit preci- sion on a single NVIDIA A100 (40GB) GPU, with a batch size of 8 and default remaining hyperparameter values

work page
[8]

For human participants, since there was no categorical-dimensional grouping separation, a single summary embedding was derived by averaging the embeddings of all individuals

A summary embedding representing the semantic contents of justifications was created for each model by first calculating the mean value for the categorical and dimensional grouping embeddings separately, and subsequently averaging these two values. For human participants, since there was no categorical-dimensional grouping separation, a single summary emb...

work page

[1] [1]

https://www-cdn.anthropic.com/ 4263b940cabb546aa0e3283f35b686f4f3b2ff47. pdf. Accessed: 2025-12-18. John W. Ayers, Adam Poliak, Mark Dredze, Eric C. Leas, Zechariah Zhu, Jessica B. Kelley, Dennis J. Faix, Aaron M. Goodman, Christopher A. Longhurst, Michael Hogarth, and Davey M. Smith. 2023. Com- paring Physician and Artificial Intelligence Chatbot Respons...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

Stefan Harrer

A Comprehensive Evaluation of Large Language Models on Mental Illnesses.Preprint, arXiv:2409.15687. Stefan Harrer. 2023. Attention is not all you need: the complicated case of ethically using large language models in healthcare and medicine.eBioMedicine, 90. Yining Hua, Hongbin Na, Zehan Li, Fenglin Liu, Xiao Fang, David Clifton, and John Torous. 2025. A ...

work page arXiv 2023

[3] [3]

UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction.Preprint, arXiv:1802.03426. Burt L. Monroe, Michael P. Colaresi, and Kevin M. Quinn. 2008. Fightin’ Words: Lexical Feature Se- lection and Evaluation for Identifying the Content of Political Conflict.Political Analysis, 16(4):372–403. Viet Cuong Nguyen, Mohammad Taher, Dongwan Hon...

work page internal anchor Pith review Pith/arXiv arXiv 2008

[4] [4]

Ching-Fang Sun, Christoph U

The Sense of Self and Interpersonal Func- tioning in Borderline Personality Disorder: Toward Qualitative Evidence-Based Phenomenological Con- ceptualization.Qualitative Health Research, page 10497323251376224. Ching-Fang Sun, Christoph U. Correll, Robert L. Trest- man, Yezhe Lin, Hui Xie, Maria Stack Hankey, Ray- mond Paglinawan Uymatiao, Riya T. Patel, V...

work page arXiv 2023

[5] [5]

For humans, due to the small amount of data relative to model outputs, a single grouping was created from all justifications for a single human participant

The texts of all justifications were aggregated into groupings. For humans, due to the small amount of data relative to model outputs, a single grouping was created from all justifications for a single human participant. For each model, we created two separate groupings: one based on all categorical justifications and the other based on all dimensional ju...

work page

[6] [6]

Before the embedding process, simple text pre-processing with regular expressions was applied to remove LLM-characteristic text formatting artifacts, such as Markdown syntax and redundant whitespace

work page

[7] [7]

The model operated in 16-bit preci- sion on a single NVIDIA A100 (40GB) GPU, with a batch size of 8 and default remaining hyperparameter values

Each grouping was converted into a dense embedding using the chosen BAAI/bge-multilingual-gemma2 embedding model. The model operated in 16-bit preci- sion on a single NVIDIA A100 (40GB) GPU, with a batch size of 8 and default remaining hyperparameter values

work page

[8] [8]

For human participants, since there was no categorical-dimensional grouping separation, a single summary embedding was derived by averaging the embeddings of all individuals

A summary embedding representing the semantic contents of justifications was created for each model by first calculating the mean value for the categorical and dimensional grouping embeddings separately, and subsequently averaging these two values. For human participants, since there was no categorical-dimensional grouping separation, a single summary emb...

work page