RSPC: A Benchmark for Modeling Stress and Psychiatric Conditions in Digitally Mediated Relationships using Psychiatrist Annotations
Pith reviewed 2026-06-26 05:19 UTC · model grok-4.3
The pith
A new corpus of psychiatrist-annotated Reddit posts about long-distance relationships shows anxiety disorders strongly associate with chronic relational uncertainty.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The RSPC corpus supplies psychiatrist annotations for diagnostic categories, relational stressor triggers, and indications of relationship phase across 1,799 posts. Benchmarking across model families reveals clear task-dependent performance gaps and identifies strong associations between anxiety disorders and chronic relational uncertainty. These results support a shift in NLP mental health modeling from individual-centric to context-aware approaches that incorporate social and temporal dynamics of distress.
What carries the argument
The RSPC corpus of psychiatrist-annotated long-distance relationship posts, which supplies labels for multi-label disorder classification, relational trigger detection, and temporal phase prediction.
If this is right
- Model families exhibit distinct capabilities, with some excelling at disorder classification and others at relational trigger detection.
- Anxiety disorders associate strongly with chronic relational uncertainty in the annotated posts.
- Benchmark results enable fine-tuning or selection of models for specific relational mental health subtasks.
- The corpus supports evaluation of temporal phase prediction alongside disorder and trigger detection.
Where Pith is reading between the lines
- The benchmark could be applied to posts from other social platforms to test whether the observed model differences and anxiety associations hold beyond Reddit.
- Integrating relational trigger detection into screening tools might improve identification of context-specific distress patterns.
- Longitudinal studies using similar annotations could check whether model predictions align with actual changes in relationship phases or distress levels over time.
- The task-dependent findings suggest that future model development should include separate training objectives for classification versus trigger identification.
Load-bearing premise
Psychiatrist annotations of the Reddit posts provide accurate and reliable labels for diagnostic categories, relational stressor triggers, and relationship phases.
What would settle it
A replication study that compares the RSPC labels against independent clinical diagnoses for the same individuals or reports inter-annotator agreement below 0.6 would falsify the dataset's reliability for the claimed tasks.
Figures
read the original abstract
In NLP, mental health conditions are often modeled as isolated phenomena, without interpersonal context. We use Reddit posts about long-distance relationships to capture both mental health distress and associated relational triggers. We introduce the Relational Stress and Psychiatry Corpus (RSPC) containing 1,799 Reddit posts annotated by psychiatrists for diagnostic categories, including the most prevalent mood disorders (anxiety and depression), relational stressor triggers, and indications of relationship phase. We benchmark seven fine-tuned transformer models and five large language models across multi-label disorder classification, relational trigger detection, and temporal phase prediction tasks. We find clear task-dependent differences between model families, with Claude-3-Haiku achieving the best disorder classification performance (Macro-F1 = 0.538) and GPT-4o obtaining the strongest relational trigger detection performance (Macro-F1 = 0.519), suggesting distinct model capabilities. We further find strong associations between anxiety disorders and chronic relational uncertainty. Overall, RSPC establishes a benchmark for NLP tasks that consider relational context and supports a shift from individual-centric to context-aware mental health modeling that captures the social and temporal dynamics of distress.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces the Relational Stress and Psychiatry Corpus (RSPC) consisting of 1,799 Reddit posts about long-distance relationships, annotated by psychiatrists for diagnostic categories (primarily anxiety and depression), relational stressor triggers, and relationship phases. It benchmarks seven fine-tuned transformer models and five LLMs on multi-label disorder classification, relational trigger detection, and temporal phase prediction tasks, reporting task-dependent performance differences (Claude-3-Haiku Macro-F1=0.538 on disorders; GPT-4o Macro-F1=0.519 on triggers) and a strong association between anxiety disorders and chronic relational uncertainty.
Significance. If the annotations are shown to be reliable, RSPC would provide a useful benchmark for shifting NLP mental-health modeling from isolated individual symptoms to relational and temporal context, with the reported model-family differences and anxiety-uncertainty association offering concrete, falsifiable claims for follow-up studies.
major comments (2)
- [Abstract] Abstract: the headline Macro-F1 scores and the anxiety–chronic-uncertainty association are derived directly from the psychiatrist labels on the 1,799 posts, yet the abstract supplies no inter-annotator agreement statistics, annotation guidelines, or external validation (e.g., against clinical records), rendering the support for all reported performance differences and associations impossible to evaluate.
- [Abstract] Abstract and benchmark framing: no information is given on data splits, statistical testing for the model comparisons, or exclusion criteria, so the claim of “clear task-dependent differences between model families” cannot be assessed for robustness.
minor comments (1)
- [Abstract] Abstract: specify the exact fine-tuning procedures and hyper-parameters used for the seven transformer models.
Simulated Author's Rebuttal
We thank the referee for the constructive comments on the abstract. We address each point below and will revise the abstract to improve transparency on annotation reliability and experimental framing while preserving conciseness.
read point-by-point responses
-
Referee: [Abstract] Abstract: the headline Macro-F1 scores and the anxiety–chronic-uncertainty association are derived directly from the psychiatrist labels on the 1,799 posts, yet the abstract supplies no inter-annotator agreement statistics, annotation guidelines, or external validation (e.g., against clinical records), rendering the support for all reported performance differences and associations impossible to evaluate.
Authors: We agree the abstract should reference annotation quality to support the reported results. The full manuscript details inter-annotator agreement and the annotation protocol (following DSM-5 criteria) in Section 3. We will revise the abstract to include a brief clause on agreement statistics and guidelines. External validation against clinical records is not possible for this anonymized public Reddit corpus due to privacy and linkage constraints; we will add this as an explicit limitation in the discussion section. revision: partial
-
Referee: [Abstract] Abstract and benchmark framing: no information is given on data splits, statistical testing for the model comparisons, or exclusion criteria, so the claim of “clear task-dependent differences between model families” cannot be assessed for robustness.
Authors: The full manuscript specifies the data split (Section 4.1), statistical testing for model comparisons (Section 4.3 and Table 3), and exclusion criteria (Section 2.2). We will revise the abstract to note the split and that differences are supported by statistical testing. This will allow readers to evaluate the robustness of the task-dependent model-family differences. revision: yes
Circularity Check
No circularity: empirical benchmark with direct model evaluations
full rationale
The paper constructs a new annotated corpus (RSPC) from Reddit posts and reports model performance metrics (Macro-F1 scores) plus one association finding on that corpus. No equations, parameter fits, derivations, or self-citation chains are present that could reduce any reported result to its own inputs by construction. All outputs are direct empirical measurements on the newly collected and labeled data; annotation reliability is a validity issue outside the scope of circularity analysis.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Reddit posts discussing long-distance relationships can be used as a valid proxy for studying mental health conditions and associated relational stressors.
- domain assumption Annotations performed by psychiatrists yield reliable multi-label diagnostic categories, trigger identifications, and phase labels.
Reference graph
Works this paper leans on
-
[1]
InProceed- ings of the conference on fairness, accountability, and transparency, pages 79–88
A taxonomy of ethical tensions in inferring mental health states from social media. InProceed- ings of the conference on fairness, accountability, and transparency, pages 79–88. Lee Anna Clark and David Watson. 1991. Tripartite model of anxiety and depression: psychometric evi- dence and taxonomic implications.Journal of abnor- mal psychology, 100(3):316....
1991
-
[2]
Zhijun Guo, Alvina Lai, Johan H Thygesen, Joseph Far- rington, Thomas Keen, and Kezhi Li
Chatgpt outperforms crowd workers for text-annotation tasks.Proceedings of the National Academy of Sciences, 120(30):e2305016120. Zhijun Guo, Alvina Lai, Johan H Thygesen, Joseph Far- rington, Thomas Keen, and Kezhi Li. 2024. Large language models for mental health applications: sys- tematic review.JMIR mental health, 11(1):e57400. Keith Harrigian, Carlos...
2024
-
[3]
On the state of social media data for men- tal health research. InProceedings of the Seventh Workshop on Computational Linguistics and Clinical Psychology: Improving Access, pages 15–24. Allison G Harvey. 2002. A cognitive model of insomnia. Behaviour research and therapy, 40(8):869–893. Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhan...
Pith/arXiv arXiv 2002
-
[4]
Overview of erisk: early risk prediction on the internet. InInternational conference of the cross- language evaluation forum for european languages, pages 343–361. Springer. Ilya Loshchilov and Frank Hutter. 2017. Decou- pled weight decay regularization.arXiv preprint arXiv:1711.05101. Daniel M Low, Laurie Rumker, Tanya Talkar, John Torous, Guillermo Cecc...
Pith/arXiv arXiv 2017
-
[5]
Sean MacAvaney, Bart Desmet, Arman Cohan, Luca Soldaini, Andrew Yates, Ayah Zirikly, and Nazli Go- harian
Natural language processing reveals vulner- able mental health support groups and heightened health anxiety on reddit during covid-19: Observa- tional study.Journal of medical Internet research, 22(10):e22635. Sean MacAvaney, Bart Desmet, Arman Cohan, Luca Soldaini, Andrew Yates, Ayah Zirikly, and Nazli Go- harian. 2018. Rsdd-time: Temporal annotation of ...
2018
-
[6]
Comorbidity of anxiety and unipolar mood disorders.Fear and anxiety, pages 113–148. Charles M Morin. 1993.Insomnia: Psychological as- sessment and management.Guilford press. Moin Nadeem. 2016. Identifying depression on twitter. arXiv preprint arXiv:1607.07384. Carman Neustaedter and Saul Greenberg. 2012. Inti- macy in long-distance relationships over vide...
Pith/arXiv arXiv 1993
-
[7]
Post Collection.Reddit posts were collected from long-distance relationship communities (r/LongDistance, r/LDR) and filtered for narrative completeness, relational relevance, and linguistic consistency
-
[8]
The manual included category defini- tions, worked examples, and disambiguation guidelines for commonly confused labels (e.g., ADJ vs
Guideline Training.Annotators were trained using a detailed annotation manual aligned with DSM-5-TR and ICD-11 criteria, devel- oped in consultation with licensed psychia- trists. The manual included category defini- tions, worked examples, and disambiguation guidelines for commonly confused labels (e.g., ADJ vs. GAD, Silence Gap vs. Lack of Com- munication)
-
[9]
Independent Annotation.Four trained anno- tators independently labeled each post across all three tiers, applying multi-label annotation for Tasks 1 and 2 and single-label classifica- tion for Task 3
-
[10]
Systematic disagreement patterns were reviewed to identify and clarify ambiguous schema boundaries
Adjudication.Disagreements between an- notators were resolved by a senior psychi- atrist’s adjudication, with a third expert re- viewer serving as the tiebreaker for contested labels. Systematic disagreement patterns were reviewed to identify and clarify ambiguous schema boundaries
-
[11]
how can I trust him
Quality Assurance.Inter-annotator agree- ment was measured using Cohen’sκ, Fleiss’ κ, and Krippendorff’sα across all annotation tiers (see Appendix-1). Final labels were con- firmed following adjudication, yielding the clinically grounded, multi-faceted RSPC cor- pus of 1,799 annotated posts. Appendix-3: Dataset Examples Table 6 presents three representat...
-
[12]
with β1 = 0.9 , β2 = 0.999 , ϵ= 1e−
-
[13]
Learning rate schedules used linear decay with warmup over the first 100 steps. B. Early Stopping. Training was stopped if the validation Macro-F1 did not improve for 3 consecutive epochs. Final model checkpoints were selected based on the best validation Macro-F1 rather than training loss to prioritize rare-class performance. C. Class Weighting. For mult...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.