Automated Detection and Classification of Delusion-related Content in Naturalistic Audio Diaries Using Multi-Agent Language Models

Adam Kuczynski; Alex Cohen; Angelina Pei-Tzu Tsai; Benjamin Buck; Changye Li; Dror Ben-Zeev; Feng Chen; Justin Tauscher; Meliha Yetisgen; Trevor Cohen

arxiv: 2605.24755 · v1 · pith:XMJV2QHNnew · submitted 2026-05-23 · 💻 cs.AI · cs.CL

Automated Detection and Classification of Delusion-related Content in Naturalistic Audio Diaries Using Multi-Agent Language Models

Feng Chen , Justin Tauscher , Changye Li , Meliha Yetisgen , Alex Cohen , Adam Kuczynski , Angelina Pei-Tzu Tsai , Benjamin Buck

show 2 more authors

Dror Ben-Zeev Trevor Cohen

This is my paper

Pith reviewed 2026-06-30 13:00 UTC · model grok-4.3

classification 💻 cs.AI cs.CL

keywords delusion detectionmulti-agent language modelsnaturalistic audio diariespersecutory ideationclinical NLPmajority votingmental health monitoringtranscript classification

0 comments

The pith

Majority voting among three language models detects delusion-related content in naturalistic audio diary transcripts at Micro F1 scores of 0.872 for detection and 0.779 for classification.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper builds a multi-agent system of large language models that extracts fine-grained evidence of delusional beliefs along with linked affective and behavioral responses from transcripts of free-form speech recorded by people experiencing moderate persecutory ideation. The system relies on detailed diagnostic prompts rather than task-specific training data. It finds that majority voting among agents outperforms conversational debate on ambiguous clinical material because debate tends to produce premature consensus. A sympathetic reader would care because the resulting pipeline offers a way to monitor changes in symptom expression at scale in everyday settings without requiring extensive human labeling.

Core claim

An ensemble of foundation models guided by diagnostic prompts and aggregated via majority voting forms a validated pipeline for multi-label extraction of delusional themes, affective responses, and behavioral responses from naturalistic audio diary transcripts; detailed prompts reduce false positives on theme classification while majority voting yields more reliable results than complex agent debate on clinically ambiguous text.

What carries the argument

Multi-agent LLM pipeline with majority voting for adjudication of delusion detection and multi-label classification.

If this is right

Detailed diagnostic prompt instructions reduce false positives during classification of specific delusional themes.
Complex conversational debate among agents lowers accuracy on clinically ambiguous text through premature consensus.
Majority voting delivers more robust performance than debate-based adjudication frameworks.
The resulting pipeline supplies a scalable method for automated characterization of delusion-related content in naturalistic speech.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same prompting and voting approach could be adapted to track other psychiatric symptoms if comparable diagnostic instructions are developed for them.
Embedding the pipeline in mobile apps might enable passive, longitudinal monitoring of symptom fluctuations in people with persecutory ideation.
Simpler aggregation rules such as majority voting may prove preferable to elaborate agent interactions in other clinical language tasks that involve interpretive ambiguity.
Results could shift if the underlying foundation models are replaced with newer versions or with models trained on different clinical corpora.

Load-bearing premise

That diagnostic prompt instructions can be written to cut false positives on delusional theme classification while the models still interpret affective and behavioral cues in ambiguous everyday speech.

What would settle it

A drop in Micro F1 below 0.872 for detection or 0.779 for classification when the same pipeline is run on a fresh collection of audio diary transcripts from a different cohort of participants with comparable persecutory ideation.

Figures

Figures reproduced from arXiv: 2605.24755 by Adam Kuczynski, Alex Cohen, Angelina Pei-Tzu Tsai, Benjamin Buck, Changye Li, Dror Ben-Zeev, Feng Chen, Justin Tauscher, Meliha Yetisgen, Trevor Cohen.

**Figure 2.** Figure 2: Micro-averaged precision (left) and recall (right) against human expert annotations for [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Micro-averaged F1 for the Three Independent Models and Tri-Model Majority Vote against human expert annotations for Delusion Type across the four prompt complexity levels. conversational adjudication reduced performance on Delusion Type down to a 0.545 micro-averaged F1. This degradation was even more severe on the high-stakes binary screening task of Delusion Presence, dropping to 0.667. A representative … view at source ↗

read the original abstract

Speech monologues recorded in naturalistic settings provide opportunities to characterize mental illness phenomenology and detect symptom exacerbation. Large language models (LLMs) offer new possibilities for automating this process, as they require annotated data primarily for evaluation rather than training. In this paper, we present a novel automated, multi-agent LLM pipeline for the fine-grained, multi-label extraction of language suggestive of delusional beliefs, associated affective responses, and behavioral responses from transcripts of naturalistic audio diaries collected from people with moderate persecutory ideation. Evaluating an ensemble of three foundation models, we demonstrate that detailed diagnostic prompt instructions successfully reduce false positives for delusional theme classification, but also constrain the interpretation of affective or behavioral responses. Furthermore, comparing multi-agent adjudication frameworks shows that complex conversational debate between agents diminishes accuracy on clinically ambiguous text by inducing premature consensus. Instead, majority voting establishes robust performance (Micro F1 of 0.872 and 0.779 for delusion detection and classification respectively). This work provides a validated and scalable pipeline for the automated detection and characterization of content suggesting delusional beliefs in naturalistic speech.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The F1 scores can't be fully trusted without details on how the ground truth was created.

read the letter

The one thing to know is that without any report on how the transcripts were annotated or how consistent the labels are, the Micro F1 numbers of 0.872 and 0.779 don't tell us much about whether the pipeline actually works on this kind of data.

What the paper does is take three foundation models and set them up as agents to pull out delusional themes, affective responses, and behavioral responses from naturalistic audio diary transcripts. They test detailed diagnostic prompts and find they reduce false positives on themes but limit the models on the other parts. They also pit conversational debate against majority voting and show voting holds up better on ambiguous cases. That's a useful data point for multi-agent setups in clinical applications.

The evaluation is on held-out data, so no obvious fitting issues.

The soft spot is exactly the annotation reliability the stress test flags. The abstract mentions annotated data but gives zero details on raters or agreement. For clinically ambiguous speech, that matters a lot. If the labels are noisy, the scores just reflect that noise.

This paper is aimed at researchers doing digital phenotyping or automated symptom detection in psychosis. Someone building LLM tools for mental health monitoring would find the prompt and adjudication results worth looking at.

I would send it for peer review. The core idea is practical and the comparison is new enough to be worth referee feedback, provided they fill in the data description.

Referee Report

1 major / 0 minor

Summary. The paper proposes a multi-agent LLM pipeline for automated, fine-grained, multi-label extraction of delusion-related content (delusional themes, affective responses, behavioral responses) from naturalistic audio diary transcripts of individuals with moderate persecutory ideation. It evaluates an ensemble of three foundation models, shows that detailed diagnostic prompts reduce false positives on theme classification but constrain affective/behavioral interpretation, finds that conversational debate harms accuracy on ambiguous text, and reports that majority voting achieves Micro F1 of 0.872 (detection) and 0.779 (classification).

Significance. If the evaluation holds, the work supplies a scalable, low-training-data approach to characterizing mental-illness phenomenology in real-world speech and supplies a concrete comparison of multi-agent adjudication strategies that is relevant to clinical NLP. The observation that debate induces premature consensus on ambiguous material is a useful domain-specific finding.

major comments (1)

[Abstract and Evaluation] The central performance claims (Abstract) rest on Micro F1 scores of 0.872 and 0.779 obtained from majority voting over annotated transcripts, yet the manuscript supplies no dataset size, annotation protocol, number or qualifications of raters, or inter-rater reliability statistic (e.g., Cohen’s kappa). Because the transcripts are described as clinically ambiguous, the absence of these details makes it impossible to determine whether the reported scores reflect robust detection or alignment with noisy or inconsistent human labels.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their careful review and for highlighting an important omission in our manuscript. We address the major comment below and will make the corresponding revisions.

read point-by-point responses

Referee: [Abstract and Evaluation] The central performance claims (Abstract) rest on Micro F1 scores of 0.872 and 0.779 obtained from majority voting over annotated transcripts, yet the manuscript supplies no dataset size, annotation protocol, number or qualifications of raters, or inter-rater reliability statistic (e.g., Cohen’s kappa). Because the transcripts are described as clinically ambiguous, the absence of these details makes it impossible to determine whether the reported scores reflect robust detection or alignment with noisy or inconsistent human labels.

Authors: We agree that the submitted manuscript omitted these essential details, which are required to properly interpret the reported Micro F1 scores given the clinical ambiguity of the material. In the revised version we will add a dedicated subsection in the Methods section that reports: (1) the total number of transcripts and participants, (2) the full annotation protocol, (3) the number and qualifications of the raters, and (4) the inter-rater reliability statistic (Cohen’s kappa). These additions will directly address the referee’s concern and allow readers to evaluate the quality of the human labels. revision: yes

Circularity Check

0 steps flagged

No significant circularity; evaluation metrics are independent of model inputs

full rationale

The paper reports Micro F1 scores from majority-voting LLM ensembles evaluated on held-out annotated transcripts. No equations, fitted parameters, or self-citations are used to derive the reported performance; the metrics are computed directly from comparison to external human labels. The abstract explicitly states that annotated data are used only for evaluation, not training. No self-definitional loops, fitted-input predictions, or load-bearing self-citations appear in the provided text. The central claim therefore remains independent of its own outputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper is an applied empirical study using off-the-shelf foundation models and prompt engineering; it introduces no mathematical free parameters, domain axioms, or new postulated entities.

pith-pipeline@v0.9.1-grok · 5752 in / 1127 out tokens · 35505 ms · 2026-06-30T13:00:38.740328+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

7 extracted references · 4 canonical work pages · 2 internal anchors

[1]

Version Number: 1

ChatEval: Towards Better LLM-based Eval- uators through Multi-Agent Debate.arXiv preprint. Version Number: 1. Hyeong Kyu Choi, Xiaojin Zhu, and Sharon Li. 2025. Debate or V ote: Which Yields Better Decisions in Multi-Agent Large Language Models?arXiv preprint. ArXiv:2508.17536 [cs]. Trevor Cohen, Brett Blatter, and Vimla Patel. 2005. Ex- ploring dangerous...

work page arXiv 2025
[2]

Chandra Kiran and Suprakash Chaudhury

Assessing the sources of unreliability (rater, subject, time-point) in a failed clinical trial using items of the Positive and Negative Syndrome Scale (PANSS).Journal of Clinical Psychopharmacology, 33(1):109–117. Chandra Kiran and Suprakash Chaudhury. 2009. Un- derstanding delusions.Industrial Psychiatry Journal, 18(1):3–18. Yanis Labrak, Mickael Rouvier...

2009
[3]

InProceedings of the 2024 Joint International Conference on Computa- tional Linguistics, Language Resources and Eval- uation (LREC-COLING 2024), pages 2049–2066, Torino, Italia

A Zero-shot and Few-shot Study of Instruction- Finetuned Large Language Models Applied to Clin- ical and Biomedical Tasks. InProceedings of the 2024 Joint International Conference on Computa- tional Linguistics, Language Resources and Eval- uation (LREC-COLING 2024), pages 2049–2066, Torino, Italia. ELRA and ICCL. Aziliz Le Glaz, Yannis Haralambous, Deok-...

2024
[4]

Cross-Task Generalization via Natural Language Crowdsourcing Instructions

Machine Learning and Natural Language Pro- cessing in Mental Health: Systematic Review.Jour- nal of Medical Internet Research, 23(5):e15708. Tian Liang, Zhiwei He, Wenxiang Jiao, Xing Wang, Yan Wang, Rui Wang, Yujiu Yang, Shuming Shi, and Zhaopeng Tu. 2024. Encouraging Divergent Think- ing in Large Language Models through Multi-Agent Debate. InProceedings...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1002/wps.20513 2024
[5]

SNOMED International

An Empirical Evaluation of Prompting Strate- gies for Large Language Models in Zero-Shot Clini- cal Natural Language Processing: Algorithm Devel- opment and Validation Study.JMIR Medical Infor- matics, 12:e55318. SNOMED International. 2024. SNOMED CT. Yasuaki Sumita, Koh Takeuchi, and Hisashi Kashima

2024
[6]

Self-Consistency Improves Chain of Thought Reasoning in Language Models

Cognitive Biases in Large Language Mod- els: A Survey and Mitigation Experiments.arXiv preprint. Version Number: 1. Justin Tauscher, Xiruo Ding, Sarah Kopelovich, Arun Nagendra, Kevin Lybarger, Trevor Cohen, and Dror Ben-Zeev. 2025. Automated Flagging of Cognitive Biases in the Spoken Language of People with Hal- lucination Experiences.Journal of Technolo...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

I don’t leave my house anymore because I’m scared someone will follow me

Avoidance/Withdrawal: Behaviors aimed at escaping, avoiding, or disengaging from situations, people, or cues linked to distress or delusional beliefs. The individual reduces contact rather than taking action to increase safety. Examples: i. “I don’t leave my house anymore because I’m scared someone will follow me.” ii. “I avoid drinking tap water because ...

work page arXiv

[1] [1]

Version Number: 1

ChatEval: Towards Better LLM-based Eval- uators through Multi-Agent Debate.arXiv preprint. Version Number: 1. Hyeong Kyu Choi, Xiaojin Zhu, and Sharon Li. 2025. Debate or V ote: Which Yields Better Decisions in Multi-Agent Large Language Models?arXiv preprint. ArXiv:2508.17536 [cs]. Trevor Cohen, Brett Blatter, and Vimla Patel. 2005. Ex- ploring dangerous...

work page arXiv 2025

[2] [2]

Chandra Kiran and Suprakash Chaudhury

Assessing the sources of unreliability (rater, subject, time-point) in a failed clinical trial using items of the Positive and Negative Syndrome Scale (PANSS).Journal of Clinical Psychopharmacology, 33(1):109–117. Chandra Kiran and Suprakash Chaudhury. 2009. Un- derstanding delusions.Industrial Psychiatry Journal, 18(1):3–18. Yanis Labrak, Mickael Rouvier...

2009

[3] [3]

InProceedings of the 2024 Joint International Conference on Computa- tional Linguistics, Language Resources and Eval- uation (LREC-COLING 2024), pages 2049–2066, Torino, Italia

A Zero-shot and Few-shot Study of Instruction- Finetuned Large Language Models Applied to Clin- ical and Biomedical Tasks. InProceedings of the 2024 Joint International Conference on Computa- tional Linguistics, Language Resources and Eval- uation (LREC-COLING 2024), pages 2049–2066, Torino, Italia. ELRA and ICCL. Aziliz Le Glaz, Yannis Haralambous, Deok-...

2024

[4] [4]

Cross-Task Generalization via Natural Language Crowdsourcing Instructions

Machine Learning and Natural Language Pro- cessing in Mental Health: Systematic Review.Jour- nal of Medical Internet Research, 23(5):e15708. Tian Liang, Zhiwei He, Wenxiang Jiao, Xing Wang, Yan Wang, Rui Wang, Yujiu Yang, Shuming Shi, and Zhaopeng Tu. 2024. Encouraging Divergent Think- ing in Large Language Models through Multi-Agent Debate. InProceedings...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1002/wps.20513 2024

[5] [5]

SNOMED International

An Empirical Evaluation of Prompting Strate- gies for Large Language Models in Zero-Shot Clini- cal Natural Language Processing: Algorithm Devel- opment and Validation Study.JMIR Medical Infor- matics, 12:e55318. SNOMED International. 2024. SNOMED CT. Yasuaki Sumita, Koh Takeuchi, and Hisashi Kashima

2024

[6] [6]

Self-Consistency Improves Chain of Thought Reasoning in Language Models

Cognitive Biases in Large Language Mod- els: A Survey and Mitigation Experiments.arXiv preprint. Version Number: 1. Justin Tauscher, Xiruo Ding, Sarah Kopelovich, Arun Nagendra, Kevin Lybarger, Trevor Cohen, and Dror Ben-Zeev. 2025. Automated Flagging of Cognitive Biases in the Spoken Language of People with Hal- lucination Experiences.Journal of Technolo...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[7] [7]

I don’t leave my house anymore because I’m scared someone will follow me

Avoidance/Withdrawal: Behaviors aimed at escaping, avoiding, or disengaging from situations, people, or cues linked to distress or delusional beliefs. The individual reduces contact rather than taking action to increase safety. Examples: i. “I don’t leave my house anymore because I’m scared someone will follow me.” ii. “I avoid drinking tap water because ...

work page arXiv