RSPC: A Benchmark for Modeling Stress and Psychiatric Conditions in Digitally Mediated Relationships using Psychiatrist Annotations

Johannes C. Eichstaedt; MSVPJ Sathvik; Parmitha Vangapandu; Sai Ganesh Mokkapati; Sathwik Narkedimilli; Simon See; Timothy Liu

arxiv: 2606.27247 · v1 · pith:RROMIG37new · submitted 2026-06-25 · 💻 cs.LG

RSPC: A Benchmark for Modeling Stress and Psychiatric Conditions in Digitally Mediated Relationships using Psychiatrist Annotations

Parmitha Vangapandu , Sai Ganesh Mokkapati , Sathwik Narkedimilli , MSVPJ Sathvik , Timothy Liu , Simon See , Johannes C. Eichstaedt This is my paper

Pith reviewed 2026-06-26 05:19 UTC · model grok-4.3

classification 💻 cs.LG

keywords mental health NLPrelational stresspsychiatrist annotationsbenchmark datasetanxiety disorderslong-distance relationshipstransformer modelsLLM evaluation

0 comments

The pith

A new corpus of psychiatrist-annotated Reddit posts about long-distance relationships shows anxiety disorders strongly associate with chronic relational uncertainty.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces the Relational Stress and Psychiatry Corpus (RSPC) containing 1,799 Reddit posts from long-distance relationships, annotated by psychiatrists for mood disorders such as anxiety and depression, relational stressor triggers, and relationship phases. It benchmarks seven fine-tuned transformer models and five large language models on three tasks: multi-label disorder classification, relational trigger detection, and temporal phase prediction. Results indicate task-dependent model differences, with Claude-3-Haiku reaching the highest Macro-F1 of 0.538 on disorder classification and GPT-4o reaching 0.519 on trigger detection. The work also reports strong associations between anxiety disorders and chronic relational uncertainty. This approach matters because most NLP mental health models treat conditions as isolated rather than embedded in interpersonal and temporal contexts.

Core claim

The RSPC corpus supplies psychiatrist annotations for diagnostic categories, relational stressor triggers, and indications of relationship phase across 1,799 posts. Benchmarking across model families reveals clear task-dependent performance gaps and identifies strong associations between anxiety disorders and chronic relational uncertainty. These results support a shift in NLP mental health modeling from individual-centric to context-aware approaches that incorporate social and temporal dynamics of distress.

What carries the argument

The RSPC corpus of psychiatrist-annotated long-distance relationship posts, which supplies labels for multi-label disorder classification, relational trigger detection, and temporal phase prediction.

If this is right

Model families exhibit distinct capabilities, with some excelling at disorder classification and others at relational trigger detection.
Anxiety disorders associate strongly with chronic relational uncertainty in the annotated posts.
Benchmark results enable fine-tuning or selection of models for specific relational mental health subtasks.
The corpus supports evaluation of temporal phase prediction alongside disorder and trigger detection.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The benchmark could be applied to posts from other social platforms to test whether the observed model differences and anxiety associations hold beyond Reddit.
Integrating relational trigger detection into screening tools might improve identification of context-specific distress patterns.
Longitudinal studies using similar annotations could check whether model predictions align with actual changes in relationship phases or distress levels over time.
The task-dependent findings suggest that future model development should include separate training objectives for classification versus trigger identification.

Load-bearing premise

Psychiatrist annotations of the Reddit posts provide accurate and reliable labels for diagnostic categories, relational stressor triggers, and relationship phases.

What would settle it

A replication study that compares the RSPC labels against independent clinical diagnoses for the same individuals or reports inter-annotator agreement below 0.6 would falsify the dataset's reliability for the claimed tasks.

Figures

Figures reproduced from arXiv: 2606.27247 by Johannes C. Eichstaedt, MSVPJ Sathvik, Parmitha Vangapandu, Sai Ganesh Mokkapati, Sathwik Narkedimilli, Simon See, Timothy Liu.

**Figure 1.** Figure 1: Workflow diagram of RSPC [USER], [PLACE]). Consistent with prior mental health NLP research, only publicly accessible posts were used, with no attempts made to infer user identities or contact individuals directly. 3.2 Data Annotation & Guidelines The annotation framework was developed in consultation with a team of four licensed psychiatrists from Andhra University and grounded in DSM5-TR (Association,… view at source ↗

**Figure 2.** Figure 2: Relational trigger distribution across high-anxiety (GAD-positive) and low-anxiety (GAD-negative) [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Conditional probability matrix P(Column Disorder | Row Disorder) for psychiatric symptom categories in RSPC (n = 1,799). Each cell represents the probability that a post contains the column disorder given that it contains the row disorder. Diagonal entries are 1.00 by definition. The anxiety cluster (SAD, GAD, ADJ) exhibits strong mutual overlap, while MDD and insomnia display asymmetric comorbidity patte… view at source ↗

**Figure 4.** Figure 4: Row-normalized disorder–trigger co-occurrence heatmap with Ward linkage hierarchical clustering on the [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: RSPC Annotation Taxonomy. Each Reddit post is annotated along three complementary dimensions: [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗

**Figure 6.** Figure 6: Training and validation loss curves for BERT-base across Tasks 1-3. Vertical dashed lines indicate early [PITH_FULL_IMAGE:figures/full_fig_p020_6.png] view at source ↗

read the original abstract

In NLP, mental health conditions are often modeled as isolated phenomena, without interpersonal context. We use Reddit posts about long-distance relationships to capture both mental health distress and associated relational triggers. We introduce the Relational Stress and Psychiatry Corpus (RSPC) containing 1,799 Reddit posts annotated by psychiatrists for diagnostic categories, including the most prevalent mood disorders (anxiety and depression), relational stressor triggers, and indications of relationship phase. We benchmark seven fine-tuned transformer models and five large language models across multi-label disorder classification, relational trigger detection, and temporal phase prediction tasks. We find clear task-dependent differences between model families, with Claude-3-Haiku achieving the best disorder classification performance (Macro-F1 = 0.538) and GPT-4o obtaining the strongest relational trigger detection performance (Macro-F1 = 0.519), suggesting distinct model capabilities. We further find strong associations between anxiety disorders and chronic relational uncertainty. Overall, RSPC establishes a benchmark for NLP tasks that consider relational context and supports a shift from individual-centric to context-aware mental health modeling that captures the social and temporal dynamics of distress.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RSPC gives a new relational-context corpus for mental health NLP, but the results depend on unverified psychiatrist labels.

read the letter

The main thing to know is that this paper introduces the RSPC corpus of 1,799 Reddit posts from long-distance relationships, annotated by psychiatrists for anxiety, depression, relational stressor triggers, and relationship phases. It then benchmarks seven fine-tuned transformers and five LLMs on multi-label disorder classification, trigger detection, and phase prediction, reporting task-dependent differences (Claude-3-Haiku at 0.538 Macro-F1 on disorders, GPT-4o at 0.519 on triggers) plus an association between anxiety and chronic relational uncertainty.

The work does a reasonable job of building a dataset that explicitly connects mental health labels to interpersonal context and temporal phases, which addresses the gap the abstract notes in isolated-phenomena modeling. The observation that different model families perform better on different tasks is a straightforward empirical point that could help guide choices in this subfield.

The soft spots center on the annotations. All the reported scores and the association finding rest directly on the psychiatrist labels treated as ground truth, yet the abstract supplies no inter-annotator agreement, annotation protocol, or validation against clinical records. Without those, the performance gaps and the anxiety-uncertainty link are difficult to interpret. The abstract also omits data splits, exclusion criteria, and statistical testing, which adds to the evaluation challenge.

This paper is for NLP researchers working on digital mental health who want benchmarks that incorporate relational stressors. A reader focused on context-aware modeling would get value from the corpus itself and the task setup, provided the full paper supplies the missing annotation details.

I would send it for peer review so the annotation process and experimental rigor can be examined.

Referee Report

2 major / 1 minor

Summary. The paper introduces the Relational Stress and Psychiatry Corpus (RSPC) consisting of 1,799 Reddit posts about long-distance relationships, annotated by psychiatrists for diagnostic categories (primarily anxiety and depression), relational stressor triggers, and relationship phases. It benchmarks seven fine-tuned transformer models and five LLMs on multi-label disorder classification, relational trigger detection, and temporal phase prediction tasks, reporting task-dependent performance differences (Claude-3-Haiku Macro-F1=0.538 on disorders; GPT-4o Macro-F1=0.519 on triggers) and a strong association between anxiety disorders and chronic relational uncertainty.

Significance. If the annotations are shown to be reliable, RSPC would provide a useful benchmark for shifting NLP mental-health modeling from isolated individual symptoms to relational and temporal context, with the reported model-family differences and anxiety-uncertainty association offering concrete, falsifiable claims for follow-up studies.

major comments (2)

[Abstract] Abstract: the headline Macro-F1 scores and the anxiety–chronic-uncertainty association are derived directly from the psychiatrist labels on the 1,799 posts, yet the abstract supplies no inter-annotator agreement statistics, annotation guidelines, or external validation (e.g., against clinical records), rendering the support for all reported performance differences and associations impossible to evaluate.
[Abstract] Abstract and benchmark framing: no information is given on data splits, statistical testing for the model comparisons, or exclusion criteria, so the claim of “clear task-dependent differences between model families” cannot be assessed for robustness.

minor comments (1)

[Abstract] Abstract: specify the exact fine-tuning procedures and hyper-parameters used for the seven transformer models.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on the abstract. We address each point below and will revise the abstract to improve transparency on annotation reliability and experimental framing while preserving conciseness.

read point-by-point responses

Referee: [Abstract] Abstract: the headline Macro-F1 scores and the anxiety–chronic-uncertainty association are derived directly from the psychiatrist labels on the 1,799 posts, yet the abstract supplies no inter-annotator agreement statistics, annotation guidelines, or external validation (e.g., against clinical records), rendering the support for all reported performance differences and associations impossible to evaluate.

Authors: We agree the abstract should reference annotation quality to support the reported results. The full manuscript details inter-annotator agreement and the annotation protocol (following DSM-5 criteria) in Section 3. We will revise the abstract to include a brief clause on agreement statistics and guidelines. External validation against clinical records is not possible for this anonymized public Reddit corpus due to privacy and linkage constraints; we will add this as an explicit limitation in the discussion section. revision: partial
Referee: [Abstract] Abstract and benchmark framing: no information is given on data splits, statistical testing for the model comparisons, or exclusion criteria, so the claim of “clear task-dependent differences between model families” cannot be assessed for robustness.

Authors: The full manuscript specifies the data split (Section 4.1), statistical testing for model comparisons (Section 4.3 and Table 3), and exclusion criteria (Section 2.2). We will revise the abstract to note the split and that differences are supported by statistical testing. This will allow readers to evaluate the robustness of the task-dependent model-family differences. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark with direct model evaluations

full rationale

The paper constructs a new annotated corpus (RSPC) from Reddit posts and reports model performance metrics (Macro-F1 scores) plus one association finding on that corpus. No equations, parameter fits, derivations, or self-citation chains are present that could reduce any reported result to its own inputs by construction. All outputs are direct empirical measurements on the newly collected and labeled data; annotation reliability is a validity issue outside the scope of circularity analysis.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central contribution rests on the creation and use of a new annotated corpus whose validity depends on domain assumptions about social media text and expert labeling rather than new free parameters or invented entities.

axioms (2)

domain assumption Reddit posts discussing long-distance relationships can be used as a valid proxy for studying mental health conditions and associated relational stressors.
This choice of data source is central to the corpus construction described in the abstract.
domain assumption Annotations performed by psychiatrists yield reliable multi-label diagnostic categories, trigger identifications, and phase labels.
Invoked directly in the creation of the RSPC corpus and subsequent benchmarking.

pith-pipeline@v0.9.1-grok · 5763 in / 1371 out tokens · 39870 ms · 2026-06-26T05:19:25.201666+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

13 extracted references · 3 linked inside Pith

[1]

InProceed- ings of the conference on fairness, accountability, and transparency, pages 79–88

A taxonomy of ethical tensions in inferring mental health states from social media. InProceed- ings of the conference on fairness, accountability, and transparency, pages 79–88. Lee Anna Clark and David Watson. 1991. Tripartite model of anxiety and depression: psychometric evi- dence and taxonomic implications.Journal of abnor- mal psychology, 100(3):316....

1991
[2]

Zhijun Guo, Alvina Lai, Johan H Thygesen, Joseph Far- rington, Thomas Keen, and Kezhi Li

Chatgpt outperforms crowd workers for text-annotation tasks.Proceedings of the National Academy of Sciences, 120(30):e2305016120. Zhijun Guo, Alvina Lai, Johan H Thygesen, Joseph Far- rington, Thomas Keen, and Kezhi Li. 2024. Large language models for mental health applications: sys- tematic review.JMIR mental health, 11(1):e57400. Keith Harrigian, Carlos...

2024
[3]

InProceedings of the Seventh Workshop on Computational Linguistics and Clinical Psychology: Improving Access, pages 15–24

On the state of social media data for men- tal health research. InProceedings of the Seventh Workshop on Computational Linguistics and Clinical Psychology: Improving Access, pages 15–24. Allison G Harvey. 2002. A cognitive model of insomnia. Behaviour research and therapy, 40(8):869–893. Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhan...

Pith/arXiv arXiv 2002
[4]

InInternational conference of the cross- language evaluation forum for european languages, pages 343–361

Overview of erisk: early risk prediction on the internet. InInternational conference of the cross- language evaluation forum for european languages, pages 343–361. Springer. Ilya Loshchilov and Frank Hutter. 2017. Decou- pled weight decay regularization.arXiv preprint arXiv:1711.05101. Daniel M Low, Laurie Rumker, Tanya Talkar, John Torous, Guillermo Cecc...

Pith/arXiv arXiv 2017
[5]

Sean MacAvaney, Bart Desmet, Arman Cohan, Luca Soldaini, Andrew Yates, Ayah Zirikly, and Nazli Go- harian

Natural language processing reveals vulner- able mental health support groups and heightened health anxiety on reddit during covid-19: Observa- tional study.Journal of medical Internet research, 22(10):e22635. Sean MacAvaney, Bart Desmet, Arman Cohan, Luca Soldaini, Andrew Yates, Ayah Zirikly, and Nazli Go- harian. 2018. Rsdd-time: Temporal annotation of ...

2018
[6]

Charles M Morin

Comorbidity of anxiety and unipolar mood disorders.Fear and anxiety, pages 113–148. Charles M Morin. 1993.Insomnia: Psychological as- sessment and management.Guilford press. Moin Nadeem. 2016. Identifying depression on twitter. arXiv preprint arXiv:1607.07384. Carman Neustaedter and Saul Greenberg. 2012. Inti- macy in long-distance relationships over vide...

Pith/arXiv arXiv 1993
[7]

Post Collection.Reddit posts were collected from long-distance relationship communities (r/LongDistance, r/LDR) and filtered for narrative completeness, relational relevance, and linguistic consistency
[8]

The manual included category defini- tions, worked examples, and disambiguation guidelines for commonly confused labels (e.g., ADJ vs

Guideline Training.Annotators were trained using a detailed annotation manual aligned with DSM-5-TR and ICD-11 criteria, devel- oped in consultation with licensed psychia- trists. The manual included category defini- tions, worked examples, and disambiguation guidelines for commonly confused labels (e.g., ADJ vs. GAD, Silence Gap vs. Lack of Com- munication)
[9]

Independent Annotation.Four trained anno- tators independently labeled each post across all three tiers, applying multi-label annotation for Tasks 1 and 2 and single-label classifica- tion for Task 3
[10]

Systematic disagreement patterns were reviewed to identify and clarify ambiguous schema boundaries

Adjudication.Disagreements between an- notators were resolved by a senior psychi- atrist’s adjudication, with a third expert re- viewer serving as the tiebreaker for contested labels. Systematic disagreement patterns were reviewed to identify and clarify ambiguous schema boundaries
[11]

how can I trust him

Quality Assurance.Inter-annotator agree- ment was measured using Cohen’sκ, Fleiss’ κ, and Krippendorff’sα across all annotation tiers (see Appendix-1). Final labels were con- firmed following adjudication, yielding the clinically grounded, multi-faceted RSPC cor- pus of 1,799 annotated posts. Appendix-3: Dataset Examples Table 6 presents three representat...
[12]

with β1 = 0.9 , β2 = 0.999 , ϵ= 1e−
[13]

We’ve been apart for 8 months, and I can’t stop crying. I feel like nothing matters anymore. I don’t even enjoy the things I used to love. I just want this pain to end

Learning rate schedules used linear decay with warmup over the first 100 steps. B. Early Stopping. Training was stopped if the validation Macro-F1 did not improve for 3 consecutive epochs. Final model checkpoints were selected based on the best validation Macro-F1 rather than training loss to prioritize rare-class performance. C. Class Weighting. For mult...

arXiv

[1] [1]

InProceed- ings of the conference on fairness, accountability, and transparency, pages 79–88

A taxonomy of ethical tensions in inferring mental health states from social media. InProceed- ings of the conference on fairness, accountability, and transparency, pages 79–88. Lee Anna Clark and David Watson. 1991. Tripartite model of anxiety and depression: psychometric evi- dence and taxonomic implications.Journal of abnor- mal psychology, 100(3):316....

1991

[2] [2]

Zhijun Guo, Alvina Lai, Johan H Thygesen, Joseph Far- rington, Thomas Keen, and Kezhi Li

Chatgpt outperforms crowd workers for text-annotation tasks.Proceedings of the National Academy of Sciences, 120(30):e2305016120. Zhijun Guo, Alvina Lai, Johan H Thygesen, Joseph Far- rington, Thomas Keen, and Kezhi Li. 2024. Large language models for mental health applications: sys- tematic review.JMIR mental health, 11(1):e57400. Keith Harrigian, Carlos...

2024

[3] [3]

InProceedings of the Seventh Workshop on Computational Linguistics and Clinical Psychology: Improving Access, pages 15–24

On the state of social media data for men- tal health research. InProceedings of the Seventh Workshop on Computational Linguistics and Clinical Psychology: Improving Access, pages 15–24. Allison G Harvey. 2002. A cognitive model of insomnia. Behaviour research and therapy, 40(8):869–893. Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhan...

Pith/arXiv arXiv 2002

[4] [4]

InInternational conference of the cross- language evaluation forum for european languages, pages 343–361

Overview of erisk: early risk prediction on the internet. InInternational conference of the cross- language evaluation forum for european languages, pages 343–361. Springer. Ilya Loshchilov and Frank Hutter. 2017. Decou- pled weight decay regularization.arXiv preprint arXiv:1711.05101. Daniel M Low, Laurie Rumker, Tanya Talkar, John Torous, Guillermo Cecc...

Pith/arXiv arXiv 2017

[5] [5]

Sean MacAvaney, Bart Desmet, Arman Cohan, Luca Soldaini, Andrew Yates, Ayah Zirikly, and Nazli Go- harian

Natural language processing reveals vulner- able mental health support groups and heightened health anxiety on reddit during covid-19: Observa- tional study.Journal of medical Internet research, 22(10):e22635. Sean MacAvaney, Bart Desmet, Arman Cohan, Luca Soldaini, Andrew Yates, Ayah Zirikly, and Nazli Go- harian. 2018. Rsdd-time: Temporal annotation of ...

2018

[6] [6]

Charles M Morin

Comorbidity of anxiety and unipolar mood disorders.Fear and anxiety, pages 113–148. Charles M Morin. 1993.Insomnia: Psychological as- sessment and management.Guilford press. Moin Nadeem. 2016. Identifying depression on twitter. arXiv preprint arXiv:1607.07384. Carman Neustaedter and Saul Greenberg. 2012. Inti- macy in long-distance relationships over vide...

Pith/arXiv arXiv 1993

[7] [7]

Post Collection.Reddit posts were collected from long-distance relationship communities (r/LongDistance, r/LDR) and filtered for narrative completeness, relational relevance, and linguistic consistency

[8] [8]

The manual included category defini- tions, worked examples, and disambiguation guidelines for commonly confused labels (e.g., ADJ vs

Guideline Training.Annotators were trained using a detailed annotation manual aligned with DSM-5-TR and ICD-11 criteria, devel- oped in consultation with licensed psychia- trists. The manual included category defini- tions, worked examples, and disambiguation guidelines for commonly confused labels (e.g., ADJ vs. GAD, Silence Gap vs. Lack of Com- munication)

[9] [9]

Independent Annotation.Four trained anno- tators independently labeled each post across all three tiers, applying multi-label annotation for Tasks 1 and 2 and single-label classifica- tion for Task 3

[10] [10]

Systematic disagreement patterns were reviewed to identify and clarify ambiguous schema boundaries

Adjudication.Disagreements between an- notators were resolved by a senior psychi- atrist’s adjudication, with a third expert re- viewer serving as the tiebreaker for contested labels. Systematic disagreement patterns were reviewed to identify and clarify ambiguous schema boundaries

[11] [11]

how can I trust him

Quality Assurance.Inter-annotator agree- ment was measured using Cohen’sκ, Fleiss’ κ, and Krippendorff’sα across all annotation tiers (see Appendix-1). Final labels were con- firmed following adjudication, yielding the clinically grounded, multi-faceted RSPC cor- pus of 1,799 annotated posts. Appendix-3: Dataset Examples Table 6 presents three representat...

[12] [12]

with β1 = 0.9 , β2 = 0.999 , ϵ= 1e−

[13] [13]

We’ve been apart for 8 months, and I can’t stop crying. I feel like nothing matters anymore. I don’t even enjoy the things I used to love. I just want this pain to end

Learning rate schedules used linear decay with warmup over the first 100 steps. B. Early Stopping. Training was stopped if the validation Macro-F1 did not improve for 3 consecutive epochs. Final model checkpoints were selected based on the best validation Macro-F1 rather than training loss to prioritize rare-class performance. C. Class Weighting. For mult...

arXiv