Overview of the PsyDefDetect Shared Task at BioNLP 2026: Detecting Levels of Psychological Defense Mechanisms in Supportive Conversations

Hongbin Na; John Torous; Kailai Yang; Ling Chen; Rena Gao; Shaoxiong Ji; Sophia Ananiadou; Wei Wang; Yining Hua; Zhaoming Chen

arxiv: 2605.24907 · v1 · pith:QYNKHTSTnew · submitted 2026-05-24 · 💻 cs.CL

Overview of the PsyDefDetect Shared Task at BioNLP 2026: Detecting Levels of Psychological Defense Mechanisms in Supportive Conversations

Hongbin Na , Zimu Wang , Zhaoming Chen , Yining Hua , Rena Gao , Kailai Yang , Ling Chen , Wei Wang

show 3 more authors

Shaoxiong Ji John Torous Sophia Ananiadou

This is my paper

Pith reviewed 2026-06-30 12:30 UTC · model grok-4.3

classification 💻 cs.CL

keywords psychological defense mechanismsemotional support dialoguesshared taskDMRStext classificationclinical NLPclass imbalance

0 comments

The pith

A shared task on PsyDefConv shows systems classify defense mechanism levels in support dialogues at 0.42 macro F1.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents PsyDefDetect, a shared task that asks NLP systems to label individual seeker utterances in emotional support dialogues with one of nine categories drawn from the Defense Mechanism Rating Scales. The new PsyDefConv corpus supplies 200 dialogues and 2336 annotated utterances that exhibit substantial inter-annotator agreement under the DMRS framework. Twenty-one teams submitted results; the strongest entry reached 0.420 macro F1, beating the prior fine-tuned baseline while exposing a bias toward the majority High-Adaptive class and a widening accuracy-to-macro-F1 gap caused by imbalance. The overview underscores the practical value of theory-aware and LLM-based methods for this fine-grained clinical classification task.

Core claim

Grounded in the clinically validated DMRS framework, the PsyDefDetect task demonstrates that automated classification of seven hierarchical defense-mechanism levels plus two auxiliary labels is feasible on seeker utterances given preceding context, with the top submitted system attaining 0.420 macro F1 on the 2336-utterance PsyDefConv test set and thereby surpassing the strongest reported baseline.

What carries the argument

The DMRS hierarchical levels (seven main levels plus two auxiliary labels) applied to classify target seeker utterances given dialogue context.

If this is right

Theory-aware and LLM-based models become preferable for fine-grained defensive-function classification.
Class-imbalance handling becomes essential for closing the gap between accuracy and macro F1.
Over-prediction of the High-Adaptive class remains a systematic error that future systems must address.
Continued community work on this clinical-NLP intersection is invited through released task materials.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If DMRS annotations prove stable across languages, the same task design could transfer to non-English support corpora.
A system that reliably tags defense levels could be inserted into real-time support-chat monitors to flag low-adaptive patterns.
The observed macro-F1 ceiling of 0.42 suggests the nine-class taxonomy may need coarser or hierarchical evaluation metrics in follow-up work.

Load-bearing premise

Annotators can reliably assign DMRS levels to individual utterances using only the preceding dialogue context.

What would settle it

A fresh annotation round on the same PsyDefConv utterances that yields low or chance-level inter-annotator agreement.

Figures

Figures reproduced from arXiv: 2605.24907 by Hongbin Na, John Torous, Kailai Yang, Ling Chen, Rena Gao, Shaoxiong Ji, Sophia Ananiadou, Wei Wang, Yining Hua, Zhaoming Chen, Zimu Wang.

**Figure 2.** Figure 2: Per-class F1-scores across all submitted sys [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Confusion matrices for the four top-ranked systems on the held-out test set. Rows denote ground-truth [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

read the original abstract

We present an overview of PsyDefDetect, the shared task on detecting levels of psychological defense mechanisms in emotional support dialogues, co-located with BioNLP@ACL 2026. Grounded in the clinically validated Defense Mechanism Rating Scales (DMRS) framework, the task asks systems to classify a target seeker utterance, given its preceding dialogue context, into one of nine categories: seven hierarchical DMRS levels plus two auxiliary labels. Participants worked on PsyDefConv, a newly released corpus of 200 dialogues and 2336 help-seeker utterances annotated under DMRS with substantial inter-annotator agreement. The task attracted 172 participants on CodaBench who produced 563 submissions, with 21 teams officially registering their results for the final ranking. The best system achieved a macro F1-score of 0.420, surpassing the strongest fine-tuned baseline reported in the dataset paper by a notable margin, yet leaving clear headroom. Our analysis highlights (i) a persistent tendency to over-predict the majority High-Adaptive class, (ii) a widening gap between accuracy and macro-F1 that reveals class-imbalance sensitivity, and (iii) the value of theory-aware and LLM-based approaches for fine-grained defensive-function classification. We release all task materials and invite the community to continue work on this novel intersection of clinical psychology and NLP.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a standard shared task overview that releases a new DMRS-annotated corpus but leaves annotation reliability details thin and performance modest at 0.42 macro F1.

read the letter

The main point is that this is a shared task overview paper that releases a new corpus for detecting DMRS defense mechanism levels in emotional support dialogues, with the top system scoring 0.42 macro F1.

What stands out is the PsyDefConv corpus itself—200 dialogues and 2336 seeker utterances annotated into nine categories including seven DMRS levels. The paper organizes the task, attracts a decent number of submissions, and releases the materials. It also does a good job documenting the practical issues: systems over-predict the majority class and macro F1 suffers from imbalance. That kind of analysis is useful for anyone thinking about similar fine-grained classification in clinical text.

The weaker parts are the details on how the annotations were produced. Calling the agreement "substantial" without a coefficient, annotator count, or per-class stats leaves a gap, especially since the stress-test note flags this as the key precondition for trusting the results. For a task this granular, more transparency on label reliability would strengthen the paper. The performance gap to baseline is noted but still leaves headroom, which is honest but also shows the problem is not solved.

This kind of paper is for the computational psychology and BioNLP community. Readers interested in new datasets or shared tasks in mental health applications will find it relevant. It shows clear engagement with the DMRS literature and reports results without overclaiming.

I would bring this to a reading group focused on applied NLP or clinical text, maybe. I would not cite it in the next year unless I use the corpus. It should be sent for peer review because documenting these tasks helps the field track progress on hard problems.

Referee Report

1 major / 0 minor

Summary. The manuscript is an overview of the PsyDefDetect shared task (BioNLP@ACL 2026) on classifying nine DMRS-based psychological defense mechanism levels (plus auxiliaries) for seeker utterances in emotional-support dialogues. It introduces the PsyDefConv corpus (200 dialogues, 2336 utterances) annotated under the DMRS framework with reported substantial inter-annotator agreement, describes participation (172 CodaBench users, 21 official teams, 563 submissions), reports that the winning system reached macro F1 of 0.420 (surpassing the dataset-paper baseline), and analyzes persistent issues of majority-class over-prediction and accuracy-vs-macro-F1 divergence.

Significance. If the gold labels prove reliable, the work opens a clinically grounded evaluation setting for fine-grained defensive-function detection in dialogue, supplies a reproducible benchmark with released materials, and demonstrates that theory-aware and LLM-based systems can improve over standard fine-tuning baselines while exposing class-imbalance sensitivities that future work must address.

major comments (1)

[Abstract / annotation section] Abstract (and the annotation-protocol section): the central performance claim (best macro F1 = 0.420) and the comparison to the dataset-paper baseline presuppose that the 9-class DMRS labels constitute a low-noise gold standard. The only supporting statement supplied is “substantial inter-annotator agreement”; no numerical coefficient (e.g., Fleiss’ κ or Krippendorff’s α), no annotator count per utterance, no resolution procedure for hierarchical-level disagreements, and no per-class agreement breakdown are provided. In a 9-way task already shown to be majority-class biased, this quantitative gap directly affects the interpretability of the reported margin over baseline.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed and constructive review. We address the single major comment below and will revise the manuscript to improve clarity on annotation reliability.

read point-by-point responses

Referee: [Abstract / annotation section] Abstract (and the annotation-protocol section): the central performance claim (best macro F1 = 0.420) and the comparison to the dataset-paper baseline presuppose that the 9-class DMRS labels constitute a low-noise gold standard. The only supporting statement supplied is “substantial inter-annotator agreement”; no numerical coefficient (e.g., Fleiss’ κ or Krippendorff’s α), no annotator count per utterance, no resolution procedure for hierarchical-level disagreements, and no per-class agreement breakdown are provided. In a 9-way task already shown to be majority-class biased, this quantitative gap directly affects the interpretability of the reported margin over baseline.

Authors: We agree that the manuscript currently provides only a qualitative reference to 'substantial inter-annotator agreement' and does not include the requested quantitative details. The full annotation protocol—including numerical coefficients (Fleiss’ κ and Krippendorff’s α), annotator counts, disagreement resolution (majority vote with senior adjudication), and per-class breakdowns—is documented in the referenced PsyDefConv dataset paper. To strengthen self-containment and directly address interpretability concerns for the shared-task overview, we will insert a concise summary of these metrics and procedures into the annotation section of the revised version. This revision will not alter the reported results but will make the gold-standard claims more transparent. revision: yes

Circularity Check

0 steps flagged

No circularity: shared-task overview reports external results without derivations or self-referential fitting

full rationale

The paper is a competition overview that describes the PsyDefConv corpus, participant submissions (172 participants, 563 submissions), and the winning macro F1 of 0.420 against a baseline from the dataset paper. No equations, predictions, or derivations appear. Claims rest on external team results and reported inter-annotator agreement rather than any fitted parameter renamed as prediction or self-citation chain that reduces the central result to its own inputs. The content is self-contained reporting of empirical outcomes.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a shared-task overview paper. No mathematical models, derivations, or theoretical constructs are introduced, so the ledger contains no entries.

pith-pipeline@v0.9.1-grok · 5813 in / 1089 out tokens · 29820 ms · 2026-06-30T12:30:33.727819+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

6 extracted references · 1 canonical work pages · 1 internal anchor

[1]

InProceedings of the 2024 Joint International Conference on Computa- tional Linguistics, Language Resources and Eval- uation (LREC-COLING 2024), pages 5734–5746, Torino, Italia

EmpCRL: Controllable empathetic response generation via in-context commonsense reasoning and reinforcement learning. InProceedings of the 2024 Joint International Conference on Computa- tional Linguistics, Language Resources and Eval- uation (LREC-COLING 2024), pages 5734–5746, Torino, Italia. ELRA and ICCL. Tong Chen, Zimu Wang, Yiyi Miao, Haoran Luo, Yu...

2024
[2]

InProceedings of the Fifth Workshop on Computational Linguistics and Clinical Psychology: From Keyboard to Clinic, pages 37–46, New Orleans, LA

CLPsych 2018 shared task: Predicting cur- rent and future psychological health from childhood essays. InProceedings of the Fifth Workshop on Computational Linguistics and Clinical Psychology: From Keyboard to Clinic, pages 37–46, New Orleans, LA. Association for Computational Linguistics. Jiayuan Ma, Hongbin Na, Zimu Wang, Yining Hua, Yue Liu, Wei Wang, a...

2018
[3]

InProceedings of the 25th Workshop on Biomedical Language Processing (Shared Tasks), San Diego, CA, USA

CS_Metro at PsyDefDetect: Detecting psy- chological defense mechanisms in mental health di- alogues with summarization-enhanced transformer ensembles. InProceedings of the 25th Workshop on Biomedical Language Processing (Shared Tasks), San Diego, CA, USA. Association for Computational Linguistics. Pritha Saha, Shuvodwip Saha, and Anik Mahmud Shanto. 2026....

2026
[4]

Philipp Steigerwald, Eric Rudolph, and Jens Albrecht

Large language models and empathy: System- atic review.J Med Internet Res, 26:e52597. Philipp Steigerwald, Eric Rudolph, and Jens Albrecht
[5]

InProceedings of the 25th Work- shop on Biomedical Language Processing (Shared Tasks), San Diego, CA, USA

Nürnberg NLP at PsyDefDetect: Multi-axis voter ensembles for psychological defence mecha- nism classification. InProceedings of the 25th Work- shop on Biomedical Language Processing (Shared Tasks), San Diego, CA, USA. Association for Com- putational Linguistics. Duc-Luong Tran, Phuong-Anh Chu, Hoang-Dat Do, Tu-Phuong Mai, Duy-Cat Can, and Hoang-Quynh Le. ...

2026
[6]

Do No Harm: Exposing Hidden Vulnerabilities of LLMs via Persona-based Client Simulation Attack in Psychological Counseling

Do no harm: Exposing hidden vulnera- bilities of LLMs via persona-based client simula- tion attack in psychological counseling.Preprint, arXiv:2604.04842. Zhen Xu, Sergio Escalera, Adrien Pavão, Magali Richard, Wei-Wei Tu, Quanming Yao, Huan Zhao, and Isabelle Guyon. 2022. Codabench: Flexible, easy-to-use, and reproducible meta-benchmark plat- form.Patter...

work page internal anchor Pith review Pith/arXiv arXiv 2022

[1] [1]

InProceedings of the 2024 Joint International Conference on Computa- tional Linguistics, Language Resources and Eval- uation (LREC-COLING 2024), pages 5734–5746, Torino, Italia

EmpCRL: Controllable empathetic response generation via in-context commonsense reasoning and reinforcement learning. InProceedings of the 2024 Joint International Conference on Computa- tional Linguistics, Language Resources and Eval- uation (LREC-COLING 2024), pages 5734–5746, Torino, Italia. ELRA and ICCL. Tong Chen, Zimu Wang, Yiyi Miao, Haoran Luo, Yu...

2024

[2] [2]

InProceedings of the Fifth Workshop on Computational Linguistics and Clinical Psychology: From Keyboard to Clinic, pages 37–46, New Orleans, LA

CLPsych 2018 shared task: Predicting cur- rent and future psychological health from childhood essays. InProceedings of the Fifth Workshop on Computational Linguistics and Clinical Psychology: From Keyboard to Clinic, pages 37–46, New Orleans, LA. Association for Computational Linguistics. Jiayuan Ma, Hongbin Na, Zimu Wang, Yining Hua, Yue Liu, Wei Wang, a...

2018

[3] [3]

InProceedings of the 25th Workshop on Biomedical Language Processing (Shared Tasks), San Diego, CA, USA

CS_Metro at PsyDefDetect: Detecting psy- chological defense mechanisms in mental health di- alogues with summarization-enhanced transformer ensembles. InProceedings of the 25th Workshop on Biomedical Language Processing (Shared Tasks), San Diego, CA, USA. Association for Computational Linguistics. Pritha Saha, Shuvodwip Saha, and Anik Mahmud Shanto. 2026....

2026

[4] [4]

Philipp Steigerwald, Eric Rudolph, and Jens Albrecht

Large language models and empathy: System- atic review.J Med Internet Res, 26:e52597. Philipp Steigerwald, Eric Rudolph, and Jens Albrecht

[5] [5]

InProceedings of the 25th Work- shop on Biomedical Language Processing (Shared Tasks), San Diego, CA, USA

Nürnberg NLP at PsyDefDetect: Multi-axis voter ensembles for psychological defence mecha- nism classification. InProceedings of the 25th Work- shop on Biomedical Language Processing (Shared Tasks), San Diego, CA, USA. Association for Com- putational Linguistics. Duc-Luong Tran, Phuong-Anh Chu, Hoang-Dat Do, Tu-Phuong Mai, Duy-Cat Can, and Hoang-Quynh Le. ...

2026

[6] [6]

Do No Harm: Exposing Hidden Vulnerabilities of LLMs via Persona-based Client Simulation Attack in Psychological Counseling

Do no harm: Exposing hidden vulnera- bilities of LLMs via persona-based client simula- tion attack in psychological counseling.Preprint, arXiv:2604.04842. Zhen Xu, Sergio Escalera, Adrien Pavão, Magali Richard, Wei-Wei Tu, Quanming Yao, Huan Zhao, and Isabelle Guyon. 2022. Codabench: Flexible, easy-to-use, and reproducible meta-benchmark plat- form.Patter...

work page internal anchor Pith review Pith/arXiv arXiv 2022