Towards Consistent Detection of Cognitive Distortions: LLM-Based Annotation and Dataset-Agnostic Evaluation
Pith reviewed 2026-05-21 20:43 UTC · model grok-4.3
The pith
Large language models generate consistent annotations for cognitive distortions, improving detection model performance over human labels.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper establishes that GPT-4 can produce consistent annotations for cognitive distortions across multiple independent runs, measured by Fleiss's Kappa of 0.78, and that training detection models on these LLM-generated annotations leads to improved performance on test sets compared to training on human-labeled data. It introduces a dataset-agnostic evaluation framework using Cohen's kappa as an effect size to allow fair comparisons between models from different datasets.
What carries the argument
Multiple independent LLM annotation runs to extract stable labeling patterns in subjective tasks, paired with a Cohen's kappa effect size measure for dataset-agnostic model evaluation.
If this is right
- Training on LLM annotations leads to higher test set performance in cognitive distortion detection models.
- The dataset-agnostic framework permits direct comparisons of model results across studies with varying datasets.
- LLMs serve as a scalable source of consistent labels for subjective NLP tasks where human agreement is low.
- Internal consistency from repeated LLM runs supports stronger downstream task performance.
Where Pith is reading between the lines
- Adopting this method could lower the cost and effort of building large training datasets for mental health related NLP applications.
- The stable patterns identified by LLMs might help create more standardized definitions of cognitive distortions for research.
- Testing these annotations against clinical validation data would show if they capture real-world distortion patterns better.
- This technique could extend to other subjective labeling problems such as detecting emotions or biases in text.
Load-bearing premise
High agreement among repeated runs of the same LLM indicates that the annotations are more accurate or useful for detecting cognitive distortions in practice, rather than just being artifacts of the model's training or the prompt design.
What would settle it
An experiment showing that models trained on the LLM annotations perform no better or worse than human-trained models when evaluated on a fresh set of annotations from multiple human experts or on real patient outcome data.
Figures
read the original abstract
Text-based automated Cognitive Distortion detection is a challenging task due to its subjective nature, with low agreement scores observed even among expert human annotators, leading to unreliable annotations. We explore the use of Large Language Models (LLMs) as consistent and reliable annotators, and propose that multiple independent LLM runs can reveal stable labeling patterns despite the inherent subjectivity of the task. Furthermore, to fairly compare models trained on datasets with different characteristics, we introduce a dataset-agnostic evaluation framework using Cohen's kappa as an effect size measure. This methodology allows for fair cross-dataset and cross-study comparisons where traditional metrics like F1 score fall short. Our results show that GPT-4 can produce consistent annotations (Fleiss's Kappa = 0.78), resulting in improved test set performance for models trained on these annotations compared to those trained on human-labeled data. Our findings suggest that LLMs can offer a scalable and internally consistent alternative for generating training data that supports strong downstream performance in subjective NLP tasks.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that LLMs such as GPT-4 can serve as consistent annotators for subjective cognitive distortion detection tasks, achieving Fleiss's Kappa of 0.78 across independent runs, and that models trained on these LLM-generated labels outperform those trained on human annotations. It further introduces a dataset-agnostic evaluation framework that uses Cohen's kappa as an effect-size measure to enable fair cross-dataset comparisons where standard metrics like F1 are inadequate.
Significance. If the central empirical claims hold after addressing verification gaps, the work would offer a practical, scalable alternative for generating training data in low-agreement subjective NLP tasks and a reusable evaluation protocol for cross-study comparisons. The concrete reporting of Fleiss's Kappa and downstream performance deltas is a strength, as is the focus on reproducibility via multiple LLM runs.
major comments (3)
- [Results] Results section: The claim that GPT-4 annotations yield improved test-set performance lacks essential details on training-set sizes, exact annotation prompts, the specific baseline models compared, and any statistical significance testing of the reported gains; without these, the central performance advantage cannot be verified or reproduced.
- [Abstract / Results] Abstract and results paragraph: The interpretation that intra-LLM Fleiss's Kappa of 0.78 demonstrates superior annotation quality for downstream detection rests on the untested assumption that high internal consistency reduces noise rather than capturing model-specific biases or prompt artifacts; the paper does not report alignment with held-out expert judgments or external clinical indicators to support this.
- [Methods / Evaluation] Evaluation framework description: The dataset-agnostic Cohen's kappa approach is presented as enabling fair comparisons, but the manuscript provides no explicit formula, normalization procedure, or worked example showing how effect sizes are computed across datasets with differing label distributions or class imbalances.
minor comments (2)
- [Abstract] The abstract states low human agreement as motivation but does not cite specific prior inter-annotator agreement figures from the cognitive distortion literature for context.
- [Methods] Notation for the proposed evaluation metric should be introduced with a clear equation or pseudocode to distinguish it from standard Cohen's kappa.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback, which has helped us strengthen the verifiability and transparency of the manuscript. We address each major comment below and have revised the paper accordingly where possible to improve reproducibility and address interpretive concerns.
read point-by-point responses
-
Referee: [Results] Results section: The claim that GPT-4 annotations yield improved test-set performance lacks essential details on training-set sizes, exact annotation prompts, the specific baseline models compared, and any statistical significance testing of the reported gains; without these, the central performance advantage cannot be verified or reproduced.
Authors: We agree that these details are essential for reproducibility. In the revised manuscript, we have added the exact training-set sizes for each experiment, included the full annotation prompts in a new supplementary appendix, specified the baseline models (logistic regression, SVM, BERT, and RoBERTa variants), and incorporated statistical significance testing via bootstrap resampling with reported p-values confirming the performance improvements. revision: yes
-
Referee: [Abstract / Results] Abstract and results paragraph: The interpretation that intra-LLM Fleiss's Kappa of 0.78 demonstrates superior annotation quality for downstream detection rests on the untested assumption that high internal consistency reduces noise rather than capturing model-specific biases or prompt artifacts; the paper does not report alignment with held-out expert judgments or external clinical indicators to support this.
Authors: We clarify that our primary claim is not absolute superiority over human experts but rather that LLM consistency (Fleiss's kappa = 0.78) produces training labels that yield stronger downstream performance than the human annotations available in the datasets. This is supported by direct comparison to human-labeled training data. We acknowledge the referee's point on potential biases and have added a dedicated limitations paragraph discussing model-specific artifacts and the value of future alignment studies with held-out experts. However, obtaining new expert judgments falls outside the scope of the current work. revision: partial
-
Referee: [Methods / Evaluation] Evaluation framework description: The dataset-agnostic Cohen's kappa approach is presented as enabling fair comparisons, but the manuscript provides no explicit formula, normalization procedure, or worked example showing how effect sizes are computed across datasets with differing label distributions or class imbalances.
Authors: We appreciate this request for explicitness. The revised Methods section now includes the full mathematical definition of the Cohen's kappa effect-size measure, the normalization steps to adjust for differing label distributions and imbalances, and a concrete worked example comparing two datasets with varying class distributions to illustrate the computation. revision: yes
- New held-out expert judgments or clinical indicator alignments, which would require fresh data collection beyond the existing study resources.
Circularity Check
No significant circularity: empirical comparison of LLM vs human annotations
full rationale
The paper reports an experimental pipeline: multiple independent GPT-4 annotation runs, computation of Fleiss's kappa among those runs, training of downstream classifiers on the resulting labels, and direct performance comparison against models trained on human labels using held-out test data and a dataset-agnostic Cohen's kappa effect-size framework. None of these steps reduce by construction to quantities defined from the same inputs; the reported kappa values and accuracy deltas are measured outcomes, not algebraic identities or self-referential fits. No uniqueness theorems, ansatzes, or self-citations are invoked to force the central claims. The work is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Cohen's kappa serves as an appropriate effect-size measure that enables fair comparisons across datasets with different characteristics where F1 scores are insufficient.
Reference graph
Works this paper leans on
-
[1]
ENTRY address author booktitle chapter edition editor howpublished institution journal key month note number organization pages publisher school series title type volume year eprint doi pubmed url lastchecked islrn pid label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRINGS urlintro eprintur...
-
[2]
" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...
-
[3]
Mohammed Aldeen, Joshua Luo, Ashley Lian, Venus Zheng, Allen Hong, Preethika Yetukuri, and Long Cheng. 2023. https://doi.org/10.1109/ICMLA58977.2023.00089 Chatgpt vs. human annotators: A comprehensive analysis of chatgpt for text annotation . In 2023 International Conference on Machine Learning and Applications (ICMLA), pages 602--609
-
[4]
Jelly P. Aureus, Ma. Regina Justina E. Estuar, Dorothy C. Mapua, Roland P. Abao, and Anna Angeline M. Cataluña. 2021. https://doi.org/10.1109/iCORE54267.2021.00029 Determining linguistic markers in cognitive distortions from covid-19 pandemic-related reddit texts . In 2021 1st International Conference in Information and Computing Research (iCORE), pages 56--61
-
[5]
Aaron T Beck. 1963. https://doi.org/10.1001/archpsyc.1963.01720160014002 Thinking and depression: I. idiosyncratic content and cognitive distortions . Archives of general psychiatry, 9(4):324--333
-
[6]
Aaron T Beck and Brad A Alford. 2009. Depression: Causes and treatment. University of Pennsylvania Press
work page 2009
-
[7]
David D Burns and MD Feeling Good. 1980. The new mood therapy
work page 1980
-
[8]
Zhiyu Chen, Yujie Lu, and William Wang. 2023. https://doi.org/10.18653/v1/2023.findings-emnlp.284 Empowering psychotherapy with large language models: Cognitive distortion detection through diagnosis of thought prompting . In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 4295--4304, Singapore. Association for Computational L...
-
[9]
Jacob Cohen. 1960. https://doi.org/https://doi.org/10.1177/001316446002000104 A coefficient of agreement for nominal scales . Educational and psychological measurement, 20(1):37--46
-
[10]
Xiruo Ding, Kevin Lybarger, Justin Tauscher, and Trevor Cohen. 2022. https://doi.org/10.18653/v1/2022.naacl-srw.9 Improving classification of infrequent cognitive distortions: Domain-specific model vs. data augmentation . In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Te...
-
[11]
Joseph L Fleiss. 1971. https://doi.org/https://doi.org/10.1037/h0031619 Measuring nominal scale agreement among many raters. Psychological bulletin, 76(5):378
-
[12]
Fabrizio Gilardi, Meysam Alizadeh, and Ma \"e l Kubli. 2023. https://doi.org/10.1073/pnas.2305016120 Chatgpt outperforms crowd workers for text-annotation tasks . Proceedings of the National Academy of Sciences, 120(30):e2305016120
- [13]
-
[14]
Peter Henderson and Emma Brunskill. 2018. Distilling information from a flood: A possibility for the use of meta-analysis and systematic review in machine learning research. arXiv preprint arXiv:1812.01074
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[15]
Michael Heseltine and Bernhard Clemm von Hohenberg. 2024. https://doi.org/10.1177/20531680241236239 Large language models as a substitute for human experts in annotating political text . Research & Politics, 11(1):20531680241236239
- [16]
-
[17]
Jutta Joormann and Colin H Stanton. 2016. https://doi.org/https://doi.org/10.1016/j.brat.2016.07.007 Examining emotion regulation in depression: A review and future directions . Behaviour research and therapy, 86:35--49
-
[18]
Ken Kelley and Kristopher J Preacher. 2012. On effect size. Psychological methods, 17(2):137
work page 2012
-
[19]
Jiyi Li. 2024. https://doi.org/10.1109/ICASSP48485.2024.10447803 A comparative study on annotation quality of crowdsourcing and llm via label aggregation . In ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6525--6529
- [20]
-
[21]
Kevin Lybarger, Justin Tauscher, Xiruo Ding, Dror Ben-zeev, and Trevor Cohen. 2022. https://doi.org/10.18653/v1/2022.clpsych-1.11 Identifying distorted thinking in patient-therapist text message exchanges by leveraging dynamic multi-turn context . In Proceedings of the Eighth Workshop on Computational Linguistics and Clinical Psychology, pages 126--136, S...
-
[22]
Nawal Ouhmad, Romain Deperrois, Wissam El Hage, and Nicolas Combalbert. 2024. https://doi.org/10.1080/00207411.2023.2219950 Cognitive distortions, anxiety, and depression in individuals suffering from ptsd . International Journal of Mental Health, 53(4):336--352
-
[23]
Lina Rojas-Barahona, Bo-Hsiang Tseng, Yinpei Dai, Clare Mansfield, Osman Ramadan, Stefan Ultes, Michael Crawford, and Milica Gasic. 2018. http://arxiv.org/abs/1809.00640 Deep learning for language understanding of mental health concepts derived from cognitive behavioural therapy
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[24]
Benjamin Shickel, Scott Siegel, Martin Heesacker, Sherry Benton, and Parisa Rashidi. 2020. https://doi.org/10.1109/BIBE50027.2020.00052 Automatic detection and classification of cognitive distortions in mental health text . In 2020 IEEE 20th International Conference on Bioinformatics and Bioengineering (BIBE), pages 275--280
-
[25]
Sagarika Shreevastava and Peter Foltz. 2021. https://doi.org/10.18653/v1/2021.clpsych-1.17 Detecting cognitive distortions from patient-therapist interactions . In Proceedings of the Seventh Workshop on Computational Linguistics and Clinical Psychology: Improving Access, pages 151--158, Online. Association for Computational Linguistics
-
[26]
T. Simms, C. Ramstedt, M. Rich, M. Richards, T. Martinez, and C. Giraud-Carrier. 2017. https://doi.org/10.1109/ICHI.2017.39 Detecting cognitive distortions through machine learning text analytics . In 2017 IEEE International Conference on Healthcare Informatics (ICHI), pages 508--512
-
[27]
Justin S Tauscher, Kevin Lybarger, Xiruo Ding, Ayesha Chander, William J Hudenko, Trevor Cohen, and Dror Ben-Zeev. 2023. https://doi.org/10.1176/appi.ps.202100692 Automated detection of cognitive distortions in text exchanges between clinicians and people with serious mental illness . Psychiatric services, 74(4):407--410
-
[28]
Bichen Wang, Pengfei Deng, Yanyan Zhao, and Bing Qin. 2023. https://doi.org/10.18653/v1/2023.findings-emnlp.680 C 2 D 2 dataset: A resource for the cognitive distortion analysis and its impact on mental health . In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 10149--10160, Singapore. Association for Computational Linguistics
-
[29]
G \"u lin Yazici- C elebi and Feridun Kaya. 2022. https://doi.org/10.52380/ijpes.2022.9.3.769 Interpersonal cognitive distortions and anxiety: The mediating role of emotional intelligence. International Journal of Psychology and Educational Studies, 9(3):741--753
- [30]
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.