Towards Consistent Detection of Cognitive Distortions: LLM-Based Annotation and Dataset-Agnostic Evaluation

Kairit Sirts; Navneet Agarwal; Neha Sharma

arxiv: 2511.01482 · v2 · pith:Y3QRMSYOnew · submitted 2025-11-03 · 💻 cs.CL

Towards Consistent Detection of Cognitive Distortions: LLM-Based Annotation and Dataset-Agnostic Evaluation

Neha Sharma , Navneet Agarwal , Kairit Sirts This is my paper

Pith reviewed 2026-05-21 20:43 UTC · model grok-4.3

classification 💻 cs.CL

keywords cognitive distortionsLLM annotationsubjective tasksdataset-agnostic evaluationconsistency in annotationsGPT-4Fleiss kappaCohen kappa

0 comments

The pith

Large language models generate consistent annotations for cognitive distortions, improving detection model performance over human labels.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper addresses the challenge of subjective annotation in detecting cognitive distortions in text, where even experts show low agreement. It proposes using large language models like GPT-4 to create annotations by running them multiple times independently to find stable patterns. The authors also develop a dataset-agnostic evaluation method based on Cohen's kappa to compare models fairly across different datasets. Results indicate that GPT-4 achieves high consistency with a Fleiss's Kappa of 0.78, and models trained on these annotations perform better on test sets than those using human annotations. This points to LLMs as a scalable way to produce reliable training data for subjective tasks in natural language processing.

Core claim

The paper establishes that GPT-4 can produce consistent annotations for cognitive distortions across multiple independent runs, measured by Fleiss's Kappa of 0.78, and that training detection models on these LLM-generated annotations leads to improved performance on test sets compared to training on human-labeled data. It introduces a dataset-agnostic evaluation framework using Cohen's kappa as an effect size to allow fair comparisons between models from different datasets.

What carries the argument

Multiple independent LLM annotation runs to extract stable labeling patterns in subjective tasks, paired with a Cohen's kappa effect size measure for dataset-agnostic model evaluation.

If this is right

Training on LLM annotations leads to higher test set performance in cognitive distortion detection models.
The dataset-agnostic framework permits direct comparisons of model results across studies with varying datasets.
LLMs serve as a scalable source of consistent labels for subjective NLP tasks where human agreement is low.
Internal consistency from repeated LLM runs supports stronger downstream task performance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Adopting this method could lower the cost and effort of building large training datasets for mental health related NLP applications.
The stable patterns identified by LLMs might help create more standardized definitions of cognitive distortions for research.
Testing these annotations against clinical validation data would show if they capture real-world distortion patterns better.
This technique could extend to other subjective labeling problems such as detecting emotions or biases in text.

Load-bearing premise

High agreement among repeated runs of the same LLM indicates that the annotations are more accurate or useful for detecting cognitive distortions in practice, rather than just being artifacts of the model's training or the prompt design.

What would settle it

An experiment showing that models trained on the LLM annotations perform no better or worse than human-trained models when evaluated on a fresh set of annotations from multiple human experts or on real patient outcome data.

Figures

Figures reproduced from arXiv: 2511.01482 by Kairit Sirts, Navneet Agarwal, Neha Sharma.

**Figure 2.** Figure 2: Distribution of maximum label repetitions [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

read the original abstract

Text-based automated Cognitive Distortion detection is a challenging task due to its subjective nature, with low agreement scores observed even among expert human annotators, leading to unreliable annotations. We explore the use of Large Language Models (LLMs) as consistent and reliable annotators, and propose that multiple independent LLM runs can reveal stable labeling patterns despite the inherent subjectivity of the task. Furthermore, to fairly compare models trained on datasets with different characteristics, we introduce a dataset-agnostic evaluation framework using Cohen's kappa as an effect size measure. This methodology allows for fair cross-dataset and cross-study comparisons where traditional metrics like F1 score fall short. Our results show that GPT-4 can produce consistent annotations (Fleiss's Kappa = 0.78), resulting in improved test set performance for models trained on these annotations compared to those trained on human-labeled data. Our findings suggest that LLMs can offer a scalable and internally consistent alternative for generating training data that supports strong downstream performance in subjective NLP tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Multi-run GPT-4 labeling hits high internal consistency for cognitive distortions and adds a kappa-based cross-dataset eval, but the case for better quality than human labels rests on thin evidence.

read the letter

The paper's main point is that running GPT-4 multiple times on cognitive distortion labeling produces stable outputs (Fleiss kappa 0.78) and that models trained on those labels outperform ones trained on human labels. They also introduce a Cohen's kappa effect-size approach to compare results across datasets where F1 scores don't transfer well. This pairing of repeated LLM runs for stability with the dataset-agnostic comparison framework is not something I recall from the cited prior work on this task. It usefully flags the low human agreement problem in subjective mental-health NLP and offers a practical workaround for generating larger training sets. The evaluation idea is a reasonable fix for the incomparability issue that plagues this area. The soft spot is the central claim that the LLM labels are actually superior. High agreement across identical prompts mainly shows the model is reproducible with itself, which could just embed prompt or model biases rather than deliver cleaner or more accurate labels. The abstract reports performance gains but skips dataset sizes, exact prompts, baseline details, and any statistical tests, so the improvement is hard to evaluate. The concern that internal consistency does not equal external validity or clinical utility still stands on the evidence given. This is aimed at researchers building or evaluating datasets for subjective NLP tasks in health applications. A reader already working on annotation pipelines or cross-study comparisons could pick up the framework and try it. The work has enough concrete numbers and a clear methodological suggestion to deserve peer review rather than a desk reject, though the authors will need to add validation against held-out expert judgments or other external checks to make the quality argument stick.

Referee Report

3 major / 2 minor

Summary. The paper claims that LLMs such as GPT-4 can serve as consistent annotators for subjective cognitive distortion detection tasks, achieving Fleiss's Kappa of 0.78 across independent runs, and that models trained on these LLM-generated labels outperform those trained on human annotations. It further introduces a dataset-agnostic evaluation framework that uses Cohen's kappa as an effect-size measure to enable fair cross-dataset comparisons where standard metrics like F1 are inadequate.

Significance. If the central empirical claims hold after addressing verification gaps, the work would offer a practical, scalable alternative for generating training data in low-agreement subjective NLP tasks and a reusable evaluation protocol for cross-study comparisons. The concrete reporting of Fleiss's Kappa and downstream performance deltas is a strength, as is the focus on reproducibility via multiple LLM runs.

major comments (3)

[Results] Results section: The claim that GPT-4 annotations yield improved test-set performance lacks essential details on training-set sizes, exact annotation prompts, the specific baseline models compared, and any statistical significance testing of the reported gains; without these, the central performance advantage cannot be verified or reproduced.
[Abstract / Results] Abstract and results paragraph: The interpretation that intra-LLM Fleiss's Kappa of 0.78 demonstrates superior annotation quality for downstream detection rests on the untested assumption that high internal consistency reduces noise rather than capturing model-specific biases or prompt artifacts; the paper does not report alignment with held-out expert judgments or external clinical indicators to support this.
[Methods / Evaluation] Evaluation framework description: The dataset-agnostic Cohen's kappa approach is presented as enabling fair comparisons, but the manuscript provides no explicit formula, normalization procedure, or worked example showing how effect sizes are computed across datasets with differing label distributions or class imbalances.

minor comments (2)

[Abstract] The abstract states low human agreement as motivation but does not cite specific prior inter-annotator agreement figures from the cognitive distortion literature for context.
[Methods] Notation for the proposed evaluation metric should be introduced with a clear equation or pseudocode to distinguish it from standard Cohen's kappa.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for their constructive feedback, which has helped us strengthen the verifiability and transparency of the manuscript. We address each major comment below and have revised the paper accordingly where possible to improve reproducibility and address interpretive concerns.

read point-by-point responses

Referee: [Results] Results section: The claim that GPT-4 annotations yield improved test-set performance lacks essential details on training-set sizes, exact annotation prompts, the specific baseline models compared, and any statistical significance testing of the reported gains; without these, the central performance advantage cannot be verified or reproduced.

Authors: We agree that these details are essential for reproducibility. In the revised manuscript, we have added the exact training-set sizes for each experiment, included the full annotation prompts in a new supplementary appendix, specified the baseline models (logistic regression, SVM, BERT, and RoBERTa variants), and incorporated statistical significance testing via bootstrap resampling with reported p-values confirming the performance improvements. revision: yes
Referee: [Abstract / Results] Abstract and results paragraph: The interpretation that intra-LLM Fleiss's Kappa of 0.78 demonstrates superior annotation quality for downstream detection rests on the untested assumption that high internal consistency reduces noise rather than capturing model-specific biases or prompt artifacts; the paper does not report alignment with held-out expert judgments or external clinical indicators to support this.

Authors: We clarify that our primary claim is not absolute superiority over human experts but rather that LLM consistency (Fleiss's kappa = 0.78) produces training labels that yield stronger downstream performance than the human annotations available in the datasets. This is supported by direct comparison to human-labeled training data. We acknowledge the referee's point on potential biases and have added a dedicated limitations paragraph discussing model-specific artifacts and the value of future alignment studies with held-out experts. However, obtaining new expert judgments falls outside the scope of the current work. revision: partial
Referee: [Methods / Evaluation] Evaluation framework description: The dataset-agnostic Cohen's kappa approach is presented as enabling fair comparisons, but the manuscript provides no explicit formula, normalization procedure, or worked example showing how effect sizes are computed across datasets with differing label distributions or class imbalances.

Authors: We appreciate this request for explicitness. The revised Methods section now includes the full mathematical definition of the Cohen's kappa effect-size measure, the normalization steps to adjust for differing label distributions and imbalances, and a concrete worked example comparing two datasets with varying class distributions to illustrate the computation. revision: yes

standing simulated objections not resolved

New held-out expert judgments or clinical indicator alignments, which would require fresh data collection beyond the existing study resources.

Circularity Check

0 steps flagged

No significant circularity: empirical comparison of LLM vs human annotations

full rationale

The paper reports an experimental pipeline: multiple independent GPT-4 annotation runs, computation of Fleiss's kappa among those runs, training of downstream classifiers on the resulting labels, and direct performance comparison against models trained on human labels using held-out test data and a dataset-agnostic Cohen's kappa effect-size framework. None of these steps reduce by construction to quantities defined from the same inputs; the reported kappa values and accuracy deltas are measured outcomes, not algebraic identities or self-referential fits. No uniqueness theorems, ansatzes, or self-citations are invoked to force the central claims. The work is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the empirical reliability of LLM outputs for a subjective labeling task and on the appropriateness of kappa as a cross-dataset comparator. No new free parameters are introduced. The main axiom is the domain assumption that internal model consistency correlates with useful training signal for downstream detection.

axioms (1)

domain assumption Cohen's kappa serves as an appropriate effect-size measure that enables fair comparisons across datasets with different characteristics where F1 scores are insufficient.
Invoked to justify the dataset-agnostic evaluation framework.

pith-pipeline@v0.9.0 · 5707 in / 1355 out tokens · 48015 ms · 2026-05-21T20:43:49.135347+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

30 extracted references · 30 canonical work pages · 2 internal anchors

[1]

URL: " 'urlintro :=

ENTRY address author booktitle chapter edition editor howpublished institution journal key month note number organization pages publisher school series title type volume year eprint doi pubmed url lastchecked islrn pid label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRINGS urlintro eprintur...

work page
[2]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

work page
[3]

Mohammed Aldeen, Joshua Luo, Ashley Lian, Venus Zheng, Allen Hong, Preethika Yetukuri, and Long Cheng. 2023. https://doi.org/10.1109/ICMLA58977.2023.00089 Chatgpt vs. human annotators: A comprehensive analysis of chatgpt for text annotation . In 2023 International Conference on Machine Learning and Applications (ICMLA), pages 602--609

work page doi:10.1109/icmla58977.2023.00089 2023
[4]

Aureus, Ma

Jelly P. Aureus, Ma. Regina Justina E. Estuar, Dorothy C. Mapua, Roland P. Abao, and Anna Angeline M. Cataluña. 2021. https://doi.org/10.1109/iCORE54267.2021.00029 Determining linguistic markers in cognitive distortions from covid-19 pandemic-related reddit texts . In 2021 1st International Conference in Information and Computing Research (iCORE), pages 56--61

work page doi:10.1109/icore54267.2021.00029 2021
[5]

Aaron T Beck. 1963. https://doi.org/10.1001/archpsyc.1963.01720160014002 Thinking and depression: I. idiosyncratic content and cognitive distortions . Archives of general psychiatry, 9(4):324--333

work page doi:10.1001/archpsyc.1963.01720160014002 1963
[6]

Aaron T Beck and Brad A Alford. 2009. Depression: Causes and treatment. University of Pennsylvania Press

work page 2009
[7]

David D Burns and MD Feeling Good. 1980. The new mood therapy

work page 1980
[8]

Zhiyu Chen, Yujie Lu, and William Wang. 2023. https://doi.org/10.18653/v1/2023.findings-emnlp.284 Empowering psychotherapy with large language models: Cognitive distortion detection through diagnosis of thought prompting . In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 4295--4304, Singapore. Association for Computational L...

work page doi:10.18653/v1/2023.findings-emnlp.284 2023
[9]

Jacob Cohen. 1960. https://doi.org/https://doi.org/10.1177/001316446002000104 A coefficient of agreement for nominal scales . Educational and psychological measurement, 20(1):37--46

work page doi:10.1177/001316446002000104 1960
[10]

Xiruo Ding, Kevin Lybarger, Justin Tauscher, and Trevor Cohen. 2022. https://doi.org/10.18653/v1/2022.naacl-srw.9 Improving classification of infrequent cognitive distortions: Domain-specific model vs. data augmentation . In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Te...

work page doi:10.18653/v1/2022.naacl-srw.9 2022
[11]

Joseph L Fleiss. 1971. https://doi.org/https://doi.org/10.1037/h0031619 Measuring nominal scale agreement among many raters. Psychological bulletin, 76(5):378

work page doi:10.1037/h0031619 1971
[12]

Fabrizio Gilardi, Meysam Alizadeh, and Ma \"e l Kubli. 2023. https://doi.org/10.1073/pnas.2305016120 Chatgpt outperforms crowd workers for text-annotation tasks . Proceedings of the National Academy of Sciences, 120(30):e2305016120

work page doi:10.1073/pnas.2305016120 2023
[13]

Xingwei He, Zhenghao Lin, Yeyun Gong, A-Long Jin, Hang Zhang, Chen Lin, Jian Jiao, Siu Ming Yiu, Nan Duan, and Weizhu Chen. 2024. http://arxiv.org/abs/2303.16854 Annollm: Making large language models to be better crowdsourced annotators

work page arXiv 2024
[14]

Peter Henderson and Emma Brunskill. 2018. Distilling information from a flood: A possibility for the use of meta-analysis and systematic review in machine learning research. arXiv preprint arXiv:1812.01074

work page internal anchor Pith review Pith/arXiv arXiv 2018
[15]

Michael Heseltine and Bernhard Clemm von Hohenberg. 2024. https://doi.org/10.1177/20531680241236239 Large language models as a substitute for human experts in annotating political text . Research & Politics, 11(1):20531680241236239

work page doi:10.1177/20531680241236239 2024
[16]

Shaoxiong Ji, Tianlin Zhang, Luna Ansari, Jie Fu, Prayag Tiwari, and Erik Cambria. 2022. http://arxiv.org/abs/2110.15621 MentalBERT: Publicly Available Pretrained Language Models for Mental Healthcare

work page arXiv 2022
[17]

Jutta Joormann and Colin H Stanton. 2016. https://doi.org/https://doi.org/10.1016/j.brat.2016.07.007 Examining emotion regulation in depression: A review and future directions . Behaviour research and therapy, 86:35--49

work page doi:10.1016/j.brat.2016.07.007 2016
[18]

Ken Kelley and Kristopher J Preacher. 2012. On effect size. Psychological methods, 17(2):137

work page 2012
[19]

Jiyi Li. 2024. https://doi.org/10.1109/ICASSP48485.2024.10447803 A comparative study on annotation quality of crowdsourcing and llm via label aggregation . In ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6525--6529

work page doi:10.1109/icassp48485.2024.10447803 2024
[20]

Sehee Lim, Yejin Kim, Chi-Hyun Choi, Jy-yong Sohn, and Byung-Hoon Kim. 2024. http://arxiv.org/abs/2403.14255 Erd: a framework for improving llm reasoning for cognitive distortion classification

work page arXiv 2024
[21]

Kevin Lybarger, Justin Tauscher, Xiruo Ding, Dror Ben-zeev, and Trevor Cohen. 2022. https://doi.org/10.18653/v1/2022.clpsych-1.11 Identifying distorted thinking in patient-therapist text message exchanges by leveraging dynamic multi-turn context . In Proceedings of the Eighth Workshop on Computational Linguistics and Clinical Psychology, pages 126--136, S...

work page doi:10.18653/v1/2022.clpsych-1.11 2022
[22]

Nawal Ouhmad, Romain Deperrois, Wissam El Hage, and Nicolas Combalbert. 2024. https://doi.org/10.1080/00207411.2023.2219950 Cognitive distortions, anxiety, and depression in individuals suffering from ptsd . International Journal of Mental Health, 53(4):336--352

work page doi:10.1080/00207411.2023.2219950 2024
[23]

Lina Rojas-Barahona, Bo-Hsiang Tseng, Yinpei Dai, Clare Mansfield, Osman Ramadan, Stefan Ultes, Michael Crawford, and Milica Gasic. 2018. http://arxiv.org/abs/1809.00640 Deep learning for language understanding of mental health concepts derived from cognitive behavioural therapy

work page internal anchor Pith review Pith/arXiv arXiv 2018
[24]

Benjamin Shickel, Scott Siegel, Martin Heesacker, Sherry Benton, and Parisa Rashidi. 2020. https://doi.org/10.1109/BIBE50027.2020.00052 Automatic detection and classification of cognitive distortions in mental health text . In 2020 IEEE 20th International Conference on Bioinformatics and Bioengineering (BIBE), pages 275--280

work page doi:10.1109/bibe50027.2020.00052 2020
[25]

Sagarika Shreevastava and Peter Foltz. 2021. https://doi.org/10.18653/v1/2021.clpsych-1.17 Detecting cognitive distortions from patient-therapist interactions . In Proceedings of the Seventh Workshop on Computational Linguistics and Clinical Psychology: Improving Access, pages 151--158, Online. Association for Computational Linguistics

work page doi:10.18653/v1/2021.clpsych-1.17 2021
[26]

Simms, C

T. Simms, C. Ramstedt, M. Rich, M. Richards, T. Martinez, and C. Giraud-Carrier. 2017. https://doi.org/10.1109/ICHI.2017.39 Detecting cognitive distortions through machine learning text analytics . In 2017 IEEE International Conference on Healthcare Informatics (ICHI), pages 508--512

work page doi:10.1109/ichi.2017.39 2017
[27]

Justin S Tauscher, Kevin Lybarger, Xiruo Ding, Ayesha Chander, William J Hudenko, Trevor Cohen, and Dror Ben-Zeev. 2023. https://doi.org/10.1176/appi.ps.202100692 Automated detection of cognitive distortions in text exchanges between clinicians and people with serious mental illness . Psychiatric services, 74(4):407--410

work page doi:10.1176/appi.ps.202100692 2023
[28]

Bichen Wang, Pengfei Deng, Yanyan Zhao, and Bing Qin. 2023. https://doi.org/10.18653/v1/2023.findings-emnlp.680 C 2 D 2 dataset: A resource for the cognitive distortion analysis and its impact on mental health . In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 10149--10160, Singapore. Association for Computational Linguistics

work page doi:10.18653/v1/2023.findings-emnlp.680 2023
[29]

G \"u lin Yazici- C elebi and Feridun Kaya. 2022. https://doi.org/10.52380/ijpes.2022.9.3.769 Interpersonal cognitive distortions and anxiety: The mediating role of emotional intelligence. International Journal of Psychology and Educational Studies, 9(3):741--753

work page doi:10.52380/ijpes.2022.9.3.769 2022
[30]

Ruoyu Zhang, Yanzeng Li, Yongliang Ma, Ming Zhou, and Lei Zou. 2023. http://arxiv.org/abs/2310.19596 Llmaaa: Making large language models as active annotators

work page arXiv 2023

[1] [1]

URL: " 'urlintro :=

ENTRY address author booktitle chapter edition editor howpublished institution journal key month note number organization pages publisher school series title type volume year eprint doi pubmed url lastchecked islrn pid label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRINGS urlintro eprintur...

work page

[2] [2]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

work page

[3] [3]

Mohammed Aldeen, Joshua Luo, Ashley Lian, Venus Zheng, Allen Hong, Preethika Yetukuri, and Long Cheng. 2023. https://doi.org/10.1109/ICMLA58977.2023.00089 Chatgpt vs. human annotators: A comprehensive analysis of chatgpt for text annotation . In 2023 International Conference on Machine Learning and Applications (ICMLA), pages 602--609

work page doi:10.1109/icmla58977.2023.00089 2023

[4] [4]

Aureus, Ma

Jelly P. Aureus, Ma. Regina Justina E. Estuar, Dorothy C. Mapua, Roland P. Abao, and Anna Angeline M. Cataluña. 2021. https://doi.org/10.1109/iCORE54267.2021.00029 Determining linguistic markers in cognitive distortions from covid-19 pandemic-related reddit texts . In 2021 1st International Conference in Information and Computing Research (iCORE), pages 56--61

work page doi:10.1109/icore54267.2021.00029 2021

[5] [5]

Aaron T Beck. 1963. https://doi.org/10.1001/archpsyc.1963.01720160014002 Thinking and depression: I. idiosyncratic content and cognitive distortions . Archives of general psychiatry, 9(4):324--333

work page doi:10.1001/archpsyc.1963.01720160014002 1963

[6] [6]

Aaron T Beck and Brad A Alford. 2009. Depression: Causes and treatment. University of Pennsylvania Press

work page 2009

[7] [7]

David D Burns and MD Feeling Good. 1980. The new mood therapy

work page 1980

[8] [8]

Zhiyu Chen, Yujie Lu, and William Wang. 2023. https://doi.org/10.18653/v1/2023.findings-emnlp.284 Empowering psychotherapy with large language models: Cognitive distortion detection through diagnosis of thought prompting . In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 4295--4304, Singapore. Association for Computational L...

work page doi:10.18653/v1/2023.findings-emnlp.284 2023

[9] [9]

Jacob Cohen. 1960. https://doi.org/https://doi.org/10.1177/001316446002000104 A coefficient of agreement for nominal scales . Educational and psychological measurement, 20(1):37--46

work page doi:10.1177/001316446002000104 1960

[10] [10]

Xiruo Ding, Kevin Lybarger, Justin Tauscher, and Trevor Cohen. 2022. https://doi.org/10.18653/v1/2022.naacl-srw.9 Improving classification of infrequent cognitive distortions: Domain-specific model vs. data augmentation . In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Te...

work page doi:10.18653/v1/2022.naacl-srw.9 2022

[11] [11]

Joseph L Fleiss. 1971. https://doi.org/https://doi.org/10.1037/h0031619 Measuring nominal scale agreement among many raters. Psychological bulletin, 76(5):378

work page doi:10.1037/h0031619 1971

[12] [12]

Fabrizio Gilardi, Meysam Alizadeh, and Ma \"e l Kubli. 2023. https://doi.org/10.1073/pnas.2305016120 Chatgpt outperforms crowd workers for text-annotation tasks . Proceedings of the National Academy of Sciences, 120(30):e2305016120

work page doi:10.1073/pnas.2305016120 2023

[13] [13]

Xingwei He, Zhenghao Lin, Yeyun Gong, A-Long Jin, Hang Zhang, Chen Lin, Jian Jiao, Siu Ming Yiu, Nan Duan, and Weizhu Chen. 2024. http://arxiv.org/abs/2303.16854 Annollm: Making large language models to be better crowdsourced annotators

work page arXiv 2024

[14] [14]

Peter Henderson and Emma Brunskill. 2018. Distilling information from a flood: A possibility for the use of meta-analysis and systematic review in machine learning research. arXiv preprint arXiv:1812.01074

work page internal anchor Pith review Pith/arXiv arXiv 2018

[15] [15]

Michael Heseltine and Bernhard Clemm von Hohenberg. 2024. https://doi.org/10.1177/20531680241236239 Large language models as a substitute for human experts in annotating political text . Research & Politics, 11(1):20531680241236239

work page doi:10.1177/20531680241236239 2024

[16] [16]

Shaoxiong Ji, Tianlin Zhang, Luna Ansari, Jie Fu, Prayag Tiwari, and Erik Cambria. 2022. http://arxiv.org/abs/2110.15621 MentalBERT: Publicly Available Pretrained Language Models for Mental Healthcare

work page arXiv 2022

[17] [17]

Jutta Joormann and Colin H Stanton. 2016. https://doi.org/https://doi.org/10.1016/j.brat.2016.07.007 Examining emotion regulation in depression: A review and future directions . Behaviour research and therapy, 86:35--49

work page doi:10.1016/j.brat.2016.07.007 2016

[18] [18]

Ken Kelley and Kristopher J Preacher. 2012. On effect size. Psychological methods, 17(2):137

work page 2012

[19] [19]

Jiyi Li. 2024. https://doi.org/10.1109/ICASSP48485.2024.10447803 A comparative study on annotation quality of crowdsourcing and llm via label aggregation . In ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6525--6529

work page doi:10.1109/icassp48485.2024.10447803 2024

[20] [20]

Sehee Lim, Yejin Kim, Chi-Hyun Choi, Jy-yong Sohn, and Byung-Hoon Kim. 2024. http://arxiv.org/abs/2403.14255 Erd: a framework for improving llm reasoning for cognitive distortion classification

work page arXiv 2024

[21] [21]

Kevin Lybarger, Justin Tauscher, Xiruo Ding, Dror Ben-zeev, and Trevor Cohen. 2022. https://doi.org/10.18653/v1/2022.clpsych-1.11 Identifying distorted thinking in patient-therapist text message exchanges by leveraging dynamic multi-turn context . In Proceedings of the Eighth Workshop on Computational Linguistics and Clinical Psychology, pages 126--136, S...

work page doi:10.18653/v1/2022.clpsych-1.11 2022

[22] [22]

Nawal Ouhmad, Romain Deperrois, Wissam El Hage, and Nicolas Combalbert. 2024. https://doi.org/10.1080/00207411.2023.2219950 Cognitive distortions, anxiety, and depression in individuals suffering from ptsd . International Journal of Mental Health, 53(4):336--352

work page doi:10.1080/00207411.2023.2219950 2024

[23] [23]

Lina Rojas-Barahona, Bo-Hsiang Tseng, Yinpei Dai, Clare Mansfield, Osman Ramadan, Stefan Ultes, Michael Crawford, and Milica Gasic. 2018. http://arxiv.org/abs/1809.00640 Deep learning for language understanding of mental health concepts derived from cognitive behavioural therapy

work page internal anchor Pith review Pith/arXiv arXiv 2018

[24] [24]

Benjamin Shickel, Scott Siegel, Martin Heesacker, Sherry Benton, and Parisa Rashidi. 2020. https://doi.org/10.1109/BIBE50027.2020.00052 Automatic detection and classification of cognitive distortions in mental health text . In 2020 IEEE 20th International Conference on Bioinformatics and Bioengineering (BIBE), pages 275--280

work page doi:10.1109/bibe50027.2020.00052 2020

[25] [25]

Sagarika Shreevastava and Peter Foltz. 2021. https://doi.org/10.18653/v1/2021.clpsych-1.17 Detecting cognitive distortions from patient-therapist interactions . In Proceedings of the Seventh Workshop on Computational Linguistics and Clinical Psychology: Improving Access, pages 151--158, Online. Association for Computational Linguistics

work page doi:10.18653/v1/2021.clpsych-1.17 2021

[26] [26]

Simms, C

T. Simms, C. Ramstedt, M. Rich, M. Richards, T. Martinez, and C. Giraud-Carrier. 2017. https://doi.org/10.1109/ICHI.2017.39 Detecting cognitive distortions through machine learning text analytics . In 2017 IEEE International Conference on Healthcare Informatics (ICHI), pages 508--512

work page doi:10.1109/ichi.2017.39 2017

[27] [27]

Justin S Tauscher, Kevin Lybarger, Xiruo Ding, Ayesha Chander, William J Hudenko, Trevor Cohen, and Dror Ben-Zeev. 2023. https://doi.org/10.1176/appi.ps.202100692 Automated detection of cognitive distortions in text exchanges between clinicians and people with serious mental illness . Psychiatric services, 74(4):407--410

work page doi:10.1176/appi.ps.202100692 2023

[28] [28]

Bichen Wang, Pengfei Deng, Yanyan Zhao, and Bing Qin. 2023. https://doi.org/10.18653/v1/2023.findings-emnlp.680 C 2 D 2 dataset: A resource for the cognitive distortion analysis and its impact on mental health . In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 10149--10160, Singapore. Association for Computational Linguistics

work page doi:10.18653/v1/2023.findings-emnlp.680 2023

[29] [29]

G \"u lin Yazici- C elebi and Feridun Kaya. 2022. https://doi.org/10.52380/ijpes.2022.9.3.769 Interpersonal cognitive distortions and anxiety: The mediating role of emotional intelligence. International Journal of Psychology and Educational Studies, 9(3):741--753

work page doi:10.52380/ijpes.2022.9.3.769 2022

[30] [30]

Ruoyu Zhang, Yanzeng Li, Yongliang Ma, Ming Zhou, and Lei Zou. 2023. http://arxiv.org/abs/2310.19596 Llmaaa: Making large language models as active annotators

work page arXiv 2023