Multi-Level Narrative Evaluation Outperforms Lexical Features for Mental Health
Pith reviewed 2026-05-07 04:53 UTC · model grok-4.3
The pith
Macro-level LLM evaluation of narrative structure outperforms lexical features and embeddings for predicting mental health from therapeutic writing.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Across 830 Chinese therapeutic texts spanning depression, anxiety, and trauma, macro-level LLM narrative evaluation substantially outperforms micro-level lexical features and meso-level semantic embeddings for mental health prediction. Formal structural features such as Labov's story grammar, RST coherence, and propositional composition show that narrative organization per se carries predictive signal, while clinically-grounded narrative dimensions capture how psychological states are expressed through discourse. Semantic embeddings add minimal independent value but yield incremental gains in multi-level classification.
What carries the argument
A three-level framework that maps micro-level lexical features, meso-level semantic embeddings, and macro-level LLM narrative evaluation onto the hierarchical processes of narrative construction, with the macro level assessing global organization via story grammar, coherence relations, and clinically relevant dimensions.
If this is right
- Narrative organization at the macro level carries predictive signal independent of lexical counts.
- Semantic embeddings contribute only incremental value when added to other levels in classification.
- Clinically-grounded dimensions of narratives express psychological states through discourse structure.
- The framework generates testable hypotheses for intervention design based on narrative patterns.
- Longitudinal studies can track changes in macro-level organization during therapy.
Where Pith is reading between the lines
- The macro-level advantage could be tested by applying the framework to English-language or non-Chinese therapeutic texts to check cross-linguistic stability.
- Integration into digital therapy tools might enable real-time feedback on global story coherence to support patient progress.
- Direct comparison of LLM macro scores against human clinician ratings on the same texts would clarify clinical validity.
Load-bearing premise
That large language model assessments of narrative structure accurately reflect clinically relevant discourse organization without bias from the model or prompt design.
What would settle it
A replication on a fresh dataset in which macro-level LLM narrative scores show no predictive advantage over lexical feature baselines for mental health outcomes.
Figures
read the original abstract
How people narrate their experiences offers a window into how the mind organizes them. Computational approaches to therapeutic writing have evolved from lexical counting to neural methods, yet remain fragmented: dictionary tools miss discourse structure, while embeddings conflate local coherence with global organization. No existing framework maps these techniques onto the hierarchical processes through which narratives are constructed. Here we introduce a three-level framework - micro-level lexical features, meso-level semantic embeddings, and macro-level LLM narrative evaluation - and show, across 830 Chinese therapeutic texts spanning depression, anxiety, and trauma, that macro-level evaluation substantially outperforms lexical and embedding features for mental health prediction. This challenges the field's emphasis on word-counting: formal structural features (Labov's story grammar, RST coherence, propositional composition) demonstrate that narrative organization per se carries predictive signal, while clinically-grounded narrative dimensions capture how psychological states are expressed through discourse. Semantic embeddings add minimal independent value but yield incremental gains in multi-level classification. By grounding computational levels in discourse processing theory, this framework identifies macro-structural organization as the primary locus of clinical signal and generates testable hypotheses for intervention design and longitudinal research.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces a three-level framework for computational analysis of therapeutic writing: micro-level lexical features, meso-level semantic embeddings, and macro-level LLM narrative evaluation. On a corpus of 830 Chinese texts spanning depression, anxiety, and trauma, it claims that macro-level LLM evaluation substantially outperforms lexical and embedding baselines for mental-health prediction. The work grounds the levels in discourse-processing theory (Labov story grammar, RST coherence, propositional composition) and argues that narrative organization per se supplies clinically relevant signal beyond what lower-level features capture, while also noting that embeddings add only incremental value in multi-level models.
Significance. If the reported superiority is robust and the LLM scores are shown to be independent of lexical leakage, the paper would advance the field by replacing fragmented word-counting or embedding approaches with a theoretically motivated hierarchy that identifies macro-structural organization as the primary locus of clinical signal. The explicit linkage to Labov/RST constructs and the use of Chinese therapeutic data are positive features that could generate testable hypotheses for longitudinal studies and intervention design. However, the current absence of methodological transparency and quantitative results prevents any assessment of whether these benefits are realized.
major comments (3)
- [Abstract] Abstract: The central claim that 'macro-level evaluation substantially outperforms lexical and embedding features' is asserted without any performance metrics, statistical tests, baseline definitions, or error analysis. This absence makes it impossible to evaluate whether the data support the claim or whether the reported gains are additive beyond what embeddings already encode.
- [Methods] Methods (inferred from abstract description): No details are supplied on the LLM employed, the precise prompt template, the set of narrative dimensions scored, temperature, few-shot examples, or any human validation of the LLM ratings. Without these, it cannot be determined whether the macro scores reflect independent discourse organization (Labov/RST/propositional structure) or simply re-express lexical or stylistic cues already captured at the micro and meso levels, especially on Chinese text where tokenization and cultural alignment issues are well-documented.
- [Results] Results section: The statements that 'semantic embeddings add minimal independent value' and that 'formal structural features demonstrate that narrative organization per se carries predictive signal' require explicit ablation studies, feature-importance metrics, or cross-level correlation analyses. The current text provides none, leaving open the possibility that the macro-level advantage is an artifact of prompt leakage rather than genuine hierarchical structure.
minor comments (2)
- [Abstract] The abstract would be strengthened by a single sentence reporting the key quantitative result (e.g., macro F1 or AUC) and the exact number of classes or regression targets.
- [Dataset] Clarify whether the 830 texts are balanced across the three mental-health categories and whether any demographic or writing-length covariates were controlled.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback, which identifies important opportunities to strengthen the transparency and rigor of our presentation. We address each major comment below and will make substantial revisions to the manuscript to incorporate the suggested improvements.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claim that 'macro-level evaluation substantially outperforms lexical and embedding features' is asserted without any performance metrics, statistical tests, baseline definitions, or error analysis. This absence makes it impossible to evaluate whether the data support the claim or whether the reported gains are additive beyond what embeddings already encode.
Authors: We agree that the abstract should include quantitative evidence to support the central claim. In the revised manuscript, we will add specific performance metrics (accuracy, macro-F1), statistical test results (e.g., paired t-tests or McNemar tests with p-values), brief baseline definitions, and a summary of error patterns. These additions will allow readers to directly assess the magnitude and reliability of the reported outperformance. revision: yes
-
Referee: [Methods] Methods (inferred from abstract description): No details are supplied on the LLM employed, the precise prompt template, the set of narrative dimensions scored, temperature, few-shot examples, or any human validation of the LLM ratings. Without these, it cannot be determined whether the macro scores reflect independent discourse organization (Labov/RST/propositional structure) or simply re-express lexical or stylistic cues already captured at the micro and meso levels, especially on Chinese text where tokenization and cultural alignment issues are well-documented.
Authors: We acknowledge that the current Methods section lacks sufficient implementation detail. In the revision, we will expand it with a new subsection that specifies the exact LLM (model name, version, and provider), the full prompt template (including the narrative dimensions explicitly mapped to Labov story grammar, RST relations, and propositional structure), all hyperparameters (temperature, top-p, few-shot examples if used), and quantitative human validation results (e.g., Pearson correlations or Cohen’s kappa between LLM scores and expert annotations on a held-out subset). We will also add a paragraph addressing Chinese-language tokenization and cultural alignment, with evidence that the macro scores target discourse-level constructs rather than lexical or stylistic leakage. revision: yes
-
Referee: [Results] Results section: The statements that 'semantic embeddings add minimal independent value' and that 'formal structural features demonstrate that narrative organization per se carries predictive signal' require explicit ablation studies, feature-importance metrics, or cross-level correlation analyses. The current text provides none, leaving open the possibility that the macro-level advantage is an artifact of prompt leakage rather than genuine hierarchical structure.
Authors: We agree that the Results section requires additional quantitative support for the hierarchical claims. In the revised manuscript, we will insert new analyses including: (1) systematic ablation experiments that remove each level in turn and report the resulting performance drops, (2) feature-importance rankings (e.g., from logistic regression or random-forest coefficients) across the multi-level model, and (3) cross-level correlation matrices together with partial-correlation controls to quantify independence. We will also include a targeted check for prompt leakage by correlating macro scores against micro- and meso-level features and by reporting performance on lexical-only controls. These additions will directly address the concern that the macro advantage may be artifactual. revision: yes
Circularity Check
No significant circularity; empirical framework is self-contained
full rationale
The paper introduces a three-level framework (micro lexical features, meso semantic embeddings, macro LLM narrative evaluation) and reports empirical results showing macro-level evaluation outperforms baselines on 830 Chinese therapeutic texts for mental health prediction. The central claim rests on direct experimental comparisons against external mental-health labels, with grounding in discourse theory (Labov, RST) but no mathematical derivations, self-definitional equations, fitted parameters renamed as predictions, or load-bearing self-citations that reduce outputs to inputs by construction. No equations or closed loops appear in the provided text; the superiority claim is tested rather than assumed via prior author work. This is the standard case of an honest empirical study with independent validation.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption LLM narrative evaluation can reliably assess macro-level discourse features such as Labov's story grammar and RST coherence in a clinically meaningful way
- domain assumption The 830 Chinese therapeutic texts spanning depression, anxiety, and trauma constitute a representative sample for mental-health prediction
Reference graph
Works this paper leans on
-
[1]
Adler, J. M. (2012). Living into the story: Agency and coher- ence in a longitudinal study of narrative identity develop- ment and mental health over the course of psychotherapy. Journal of Personality and Social Psychology,102(2), 367 (cit. on pp. 3, 5, 6)
2012
-
[2]
M., Lodi-Smith, J., Philippe, F
Adler, J. M., Lodi-Smith, J., Philippe, F. L., & Houle, I. (2016). The incremental validity of narrative identity in predicting well-being: A review of the field and recommendations for the future.Personality and Social Psychology Review,20(2), 142–175 (cit. on pp. 1–3, 5)
2016
-
[3]
Barzilay, R., & Lapata, M. (2008). Modeling local coherence: An entity-based approach.Computational Linguistics,34(1), 1–34 (cit. on p. 2)
2008
-
[4]
T., Epstein, N., Brown, G., & Steer, R
Beck, A. T., Epstein, N., Brown, G., & Steer, R. A. (1988). An inventory for measuring clinical anxiety: Psychometric properties.Journal of Consulting and Clinical Psychology, 56(6), 893 (cit. on p. 3)
1988
-
[5]
T., Rush, A
Beck, A. T., Rush, A. J., Shaw, B. F., Emery, G., DeRubeis, R. J., & Hollon, S. D. (1979).Cognitive therapy of depres- sion. Guilford Press. (Cit. on pp. 3, 6)
1979
-
[6]
T., Steer, R
Beck, A. T., Steer, R. A., Ball, R., & Ranieri, W. F. (1996). Comparison of beck depression inventories-ia and-ii in psychiatric outpatients.Journal of Personality Assessment, 67(3), 588–597 (cit. on p. 2)
1996
-
[7]
L., & Schwartz, H
Boyd, R. L., & Schwartz, H. A. (2021). Natural language analysis and the psychology of verbal behavior: The past, present, and future states of the field.Journal of Language and Social Psychology,40(1), 21–41 (cit. on p. 1)
2021
-
[8]
R., Gregory, J
Brewin, C. R., Gregory, J. D., Lipton, M., & Burgess, N. (2010). Intrusive images in psychological disorders: Char- acteristics, neural mechanisms, and treatment implications. Psychological Review,117(1), 210 (cit. on p. 2)
2010
-
[9]
Bruner, J. (1991). The narrative construction of reality.Critical inquiry,18(1), 1–21 (cit. on p. 3)
1991
-
[10]
Francis, S. E. (2000). Assessment of symptoms of dsm-iv anxiety and depression in children: A revised child anxiety and depression scale.Behaviour Research and Therapy, 38(8), 835–855 (cit. on p. 3)
2000
-
[11]
Cohen, J., Mannarino, A., Deblinger, E., et al. (2006). Treat- ing trauma and traumatic grief in children and adolescents. Guilford Publications(cit. on p. 3)
2006
-
[12]
Young, J., Higa-McMillan, C., & Weisz, J. R. (2012). The revised child anxiety and depression scale-short version: Scale reduction via exploratory bifactor modeling of the broad anxiety factor.Psychological Assessment,24(4), 833 (cit. on p. 3)
2012
-
[13]
J., Wade, T
Egan, S. J., Wade, T. D., & Shafran, R. (2011). Perfectionism as a transdiagnostic process: A clinical review.Clinical Psychology Review,31(2), 203–212 (cit. on p. 3)
2011
-
[14]
E., & Diener, E
Eid, M. E., & Diener, E. E. (2006).Handbook of multimethod measurement in psychology.American Psychological Asso- ciation. (Cit. on p. 1)
2006
-
[15]
B., Asnaani, A., Zang, Y ., Capaldi, S., & Yeh, R
Foa, E. B., Asnaani, A., Zang, Y ., Capaldi, S., & Yeh, R. (2018). Psychometrics of the child ptsd symptom scale for dsm-5 for trauma-exposed children and adolescents.Journal of Clinical Child & Adolescent Psychology,47(1), 38–46 (cit. on p. 3)
2018
-
[16]
B., & Kauffman, B
Porter, K., Knowles, K., Powers, M. B., & Kauffman, B. Y . (2016). Psychometric properties of the posttraumatic stress disorder symptom scale interview for dsm–5 (pssi–5).Psy- chological Assessment,28(10), 1159 (cit. on p. 3)
2016
-
[17]
Frattaroli, J. (2006). Experimental disclosure and its modera- tors: A meta-analysis.Psychological Bulletin,132(6), 823 (cit. on p. 2)
2006
-
[18]
Gao, R., Hao, B., Li, H., Gao, Y ., & Zhu, T. (2013). Devel- oping simplified chinese psychological linguistic analysis dictionary for microblog.International Conference on Brain and Health Informatics(cit. on p. 3)
2013
-
[19]
C., McNamara, D
Graesser, A. C., McNamara, D. S., Louwerse, M. M., & Cai, Z. (2004). Coh-metrix: Analysis of text on cohesion and language.Behavior Research Methods, Instruments, & Com- puters,36(2), 193–202 (cit. on p. 1)
2004
-
[20]
H., Farrington, J., Keen, T., Li, K., et al
Guo, Z., Lai, A., Thygesen, J. H., Farrington, J., Keen, T., Li, K., et al. (2024). Large language models for mental health applications: Systematic review.JMIR Mental Health,11(1), e57400 (cit. on p. 2)
2024
-
[21]
Halliday, M. A. K., & Hasan, R. (1976).Cohesion in english. Routledge. (Cit. on p. 2)
1976
-
[22]
R., Ting, D
Abdullah, H. R., Ting, D. S. W., & Liu, N. (2024). Miti- gating cognitive biases in clinical decision-making through multi-agent conversations using large language models: Sim- ulation study.Journal of Medical Internet Research,26, e59439 (cit. on p. 2)
2024
-
[23]
Kintsch, W., & Van Dijk, T. A. (1978). Toward a model of text comprehension and production.Psychological Review, 85(5), 363 (cit. on pp. 1, 5)
1978
-
[24]
L., & Williams, J
Kroenke, K., Spitzer, R. L., & Williams, J. B. (2001). The phq- 9: Validity of a brief depression severity measure.Journal of General Internal Medicine,16(9), 606–613 (cit. on p. 3)
2001
-
[25]
L., Williams, J
Kroenke, K., Spitzer, R. L., Williams, J. B., & Löwe, B. (2009). An ultra-brief screening scale for anxiety and depression: The phq–4.Psychosomatics,50(6), 613–621 (cit. on p. 3)
2009
-
[26]
(1972).Language in the inner city: Studies in the black english vernacular(V ol
Labov, W. (1972).Language in the inner city: Studies in the black english vernacular(V ol. 3). University of Pennsylva- nia Press. (Cit. on p. 3)
1972
-
[27]
Li, J., & Hovy, E. (2014). A model of coherence based on dis- tributed sentence representation.Annual Conference on Em- pirical Methods in Natural Language Processing (EMNLP) (cit. on p. 3)
2014
-
[28]
Li, M., Zhao, Y ., Guo, Z., Wei, M., Fan, S., Chen, Q., Li, Y ., & Zang, Y . (2025). Written exposure therapy for posttrau- matic stress disorder and integration of a mindfulness based app in china: A pilot randomized controlled trial.Behavior Therapy(cit. on p. 2)
2025
-
[29]
Li, M., Zhao, Y ., Rosenfield, D., Guo, Z., Wei, M., Fan, S., Li, Y ., & Zang, Y . (2025). An online guided written exposure therapy for symptoms of posttraumatic stress disorder: A randomized controlled trial.Psychotherapy and Psychoso- matics(cit. on p. 2)
2025
-
[30]
C., & Thompson, S
Mann, W. C., & Thompson, S. A. (1988). Rhetorical structure theory: Toward a functional theory of text organization.Text- interdisciplinary Journal for the Study of Discourse,8(3), 243–281 (cit. on p. 3)
1988
-
[31]
S., & Dean, J
Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality.Proceedings of Advances in Neural Information Processing Systems (NeurIPS)(cit. on p. 1)
2013
-
[32]
Mizumoto, A., & Eguchi, M. (2023). Exploring the potential of using an ai language model for automated essay scoring. Research Methods in Applied Linguistics,2(2), 100050 (cit. on p. 2)
2023
-
[33]
Nolen-Hoeksema, S. (1991). Responses to depression and their effects on the duration of depressive episodes.Journal of Abnormal Psychology,100(4), 569 (cit. on pp. 2, 3, 6)
1991
-
[34]
Pennebaker, J. W. (1997). Writing about emotional experi- ences as a therapeutic process.Psychological Science,8(3), 162–166 (cit. on p. 3)
1997
-
[35]
Pennebaker, J. W. (2016).Opening up by writing it down: How expressive writing improves health and eases emotional pain. Guilford Publications. (Cit. on pp. 1, 2)
2016
-
[36]
W., Boyd, R
Pennebaker, J. W., Boyd, R. L., Jordan, K., & Blackburn, K. (2015). The development and psychometric properties of liwc2015 (cit. on pp. 1–3)
2015
-
[37]
J., Smolenski, D
Prins, A., Bovin, M. J., Smolenski, D. J., Marx, B. P., Kimer- ling, R., Jenkins-Guarnieri, M. A., Kaloupek, D. G., Schnurr, P. P., Kaiser, A. P., Leyva, Y . E., et al. (2016). The primary care ptsd screen for dsm-5 (pc-ptsd-5): Development and evaluation within a veteran primary care sample.Journal of General Internal Medicine,31(10), 1206–1211 (cit. on p. 3)
2016
-
[38]
Pyszczynski, T., & Greenberg, J. (1987). Self-regulatory per- severation and the depressive self-focusing style: A self- awareness theory of reactive depression.Psychological Bul- letin,102(1), 122 (cit. on p. 3)
1987
-
[39]
Rude, S., Gortner, E.-M., & Pennebaker, J. (2004). Language use of depressed and depression-vulnerable college students. Cognition & Emotion,18(8), 1121–1133 (cit. on pp. 1, 2)
2004
-
[40]
W., Miner, A
Sharma, A., Lin, I. W., Miner, A. S., Atkins, D. C., & Althoff, T. (2023). Human–ai collaboration enables more empathic conversations in text-based peer-to-peer mental health sup- port.Nature Machine Intelligence,5(1), 46–57 (cit. on p. 2)
2023
-
[41]
Smith, P., Perrin, S., Dyregrov, A., & Yule, W. (2003). Princi- pal components analysis of the impact of event scale with children in war.Personality and Individual Differences, 34(2), 315–322 (cit. on p. 3)
2003
-
[42]
R., Kumar, N., & De Choudhury, M
Song, I., Pendse, S. R., Kumar, N., & De Choudhury, M. (2025). The typing cure: Experiences with large language model chatbots for mental health support.ACM Conference on Human Factors in Computing Systems (CHI)(cit. on p. 2)
2025
-
[43]
L., Kroenke, K., Williams, J
Spitzer, R. L., Kroenke, K., Williams, J. B., & Löwe, B. (2006). A brief measure for assessing generalized anxiety disorder: The gad-7.Archives of Internal Medicine,166(10), 1092– 1097 (cit. on p. 3)
2006
-
[44]
Willer, R., & Eichstaedt, J. C. (2024). Large language mod- els could change the future of behavioral healthcare: A pro- posal for responsible development and evaluation.NPJ Men- tal Health Research,3(1), 12 (cit. on p. 2)
2024
-
[45]
Taraban, R., & Abusal, K. (2019). Analyzing topic differences, writing quality, and rhetorical context in college students’ essays using linguistic inquiry and word count (liwc).East European Journal of Psycholinguistics(cit. on p. 2)
2019
-
[46]
Teng, Q., Liu, Z., Song, Y ., Han, K., & Lu, Y . (2022). A survey on the interpretability of deep learning in medical diagnosis. Multimedia Systems,28(6), 2335–2355 (cit. on p. 1)
2022
-
[47]
R., Choi, H., & Valenstein, M
Teo, A. R., Choi, H., & Valenstein, M. (2013). Social relation- ships and depression: Ten-year follow-up from a nationally representative study.PloS one,8(4), e62396 (cit. on p. 3). Van Dijk, T. A. (2019).Macrostructures: An interdisciplinary study of global structures in discourse, interaction, and cognition. Routledge. (Cit. on p. 3). Van Dijk, T. A., K...
2013
-
[48]
Williams, J. M. G., Barnhofer, T., Crane, C., Herman, D., Raes, F., Watkins, E., & Dalgleish, T. (2007). Autobiographical memory specificity and emotional disorder.Psychological bulletin,133(1), 122 (cit. on p. 3)
2007
-
[49]
Yang, K., Ji, S., Zhang, T., Xie, Q., Kuang, Z., & Ananiadou, S. (2023). Towards interpretable mental health analysis with large language models.Annual Conference on Empirical Methods in Natural Language Processing (EMNLP)(cit. on p. 2)
2023
-
[50]
Zirikly, A., Resnik, P., Uzuner, O., & Hollingshead, K. (2019). Clpsych 2019 shared task: Predicting the degree of suicide risk in reddit posts.The Sixth Workshop on Computational Linguistics and Clinical Psychology(cit. on p. 2)
2019
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.