arxiv: 2604.24376 · v1 · submitted 2026-04-27 · 💻 cs.CL

Recognition: unknown

Learning Evidence of Depression Symptoms via Prompt Induction

Eliseo Bao , Anxo Perez , David Otero , Javier Parapar

Authors on Pith no claims yet

Pith reviewed 2026-05-08 03:48 UTC · model grok-4.3

classification 💻 cs.CL

keywords depression symptomsprompt inductionsymptom classificationlarge language modelsBDI-Senmental health NLPimbalanced text classificationguideline prompting

0 comments

The pith

Symptom Induction turns labeled examples into guidelines that improve LLM classification of depression symptoms.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Detecting specific symptoms of depression in online posts is challenging because the 21 symptoms from the BDI-II questionnaire are fine-grained and appear at very different rates in text. Standard prompting and fine-tuning of large language models often fail to maintain consistent standards for what text counts as evidence of each symptom. The paper proposes Symptom Induction, a technique that distills sets of labeled examples into concise, human-readable guidelines defining relevance criteria for every symptom. These guidelines are then used to condition the model's classification decisions. Experiments across multiple model families show that this yields higher weighted F1 scores than baselines, with the largest benefits on the rarest symptoms, and that the guidelines transfer effectively to texts about related conditions like bipolar disorder and eating disorders.

Core claim

Symptom Induction (SI) derives short interpretable guidelines from labeled examples that specify what counts as evidence for each of the 21 BDI-II depression symptoms and conditions LLM classification on these guidelines. On the BDI-Sen dataset, SI achieves the highest overall weighted F1 score among zero-shot, in-context learning, and fine-tuning approaches, with especially large gains for infrequent symptoms. The induced guidelines also generalize to an external dataset covering bipolar and eating disorder texts that share symptomatology.

What carries the argument

Symptom Induction, the process of compressing labeled examples into short, interpretable guidelines that define relevance criteria for each symptom and then using those guidelines to prompt the language model.

Load-bearing premise

That the short guidelines distilled from labeled examples capture stable, consistent relevance criteria that language models can follow more reliably than direct instructions or fine-tuning.

What would settle it

Running the same models on BDI-Sen with Symptom Induction guidelines and finding no increase in weighted F1 score compared to standard prompting, or seeing no transfer benefit on the external bipolar and eating disorder dataset.

Figures

Figures reproduced from arXiv: 2604.24376 by Anxo Perez, David Otero, Eliseo Bao, Javier Parapar.

**Figure 1.** Figure 1: Aggregated confusion matrices for Gemma 3 4B on BDI-Sen. Row-normalized, aggregated across all symptoms. Gemma 3 4B with SI (0.389), improving substantially over ZS (0.267), ICL (0.171), and SFT (0.242). Notably, strategy choice can dominate model scale: Gemma 3 4B with SI outperforms all other model/strategy combinations in our study, including larger models with fine-tuning view at source ↗

read the original abstract

Depression places substantial pressure on mental health services, and many people describe their experiences outside clinical settings in high-volume user-generated text (e.g., online forums and social media). Automatically identifying clinical symptom evidence in such text can therefore complement limited clinical capacity and scale to large populations. We address this need through sentence-level classification of 21 depression symptoms from the BDI-II questionnaire, using BDI-Sen, a dataset annotated for symptom relevance. This task is fine-grained and highly imbalanced, and we find that common LLM approaches (zero-shot, in-context learning, and fine-tuning) struggle to apply consistent relevance criteria for most symptoms. We propose Symptom Induction (SI), a novel approach which compresses labeled examples into short, interpretable guidelines that specify what counts as evidence for each symptom and uses these guidelines to condition classification. Across four LLM families and eight models, SI achieves the best overall weighted F1 on BDI-Sen, with especially large gains for infrequent symptoms. Cross-domain evaluation on an external dataset further shows that induced guidelines generalize across other diseases shared symptomatology (bipolar and eating disorders).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Symptom Induction turns labeled examples into short guidelines that lift weighted F1 on imbalanced symptom classification and transfer across related disorders.

read the letter

The paper's core move is Symptom Induction: take the labeled sentences for each of the 21 BDI-II symptoms, compress them into concise, human-readable guidelines that spell out what counts as evidence, then use those guidelines to prompt an LLM for sentence classification. On BDI-Sen this beats zero-shot, in-context learning, and fine-tuning baselines on overall weighted F1, with the largest gains on the infrequent symptoms, and the same guidelines produce reasonable results on bipolar and eating-disorder text as well.

Referee Report

2 major / 3 minor

Summary. The paper introduces Symptom Induction (SI), a method that compresses labeled examples into short, interpretable guidelines specifying evidence criteria for each of the 21 BDI-II depression symptoms. These guidelines condition LLM prompts for sentence-level classification on the BDI-Sen dataset. The central claim is that SI yields the highest weighted F1 across eight models from four LLM families, with large gains on infrequent symptoms, and that the induced guidelines generalize to cross-domain text from bipolar and eating disorders.

Significance. If the results hold under full experimental scrutiny, the work provides a falsifiable, interpretable prompting alternative to fine-tuning or standard ICL for imbalanced fine-grained clinical NLP tasks. It directly addresses the challenge of consistent relevance criteria in symptom detection from user-generated text and supplies a reusable guideline format that could scale to other symptom inventories.

major comments (2)

[Abstract and §4] The abstract and §4 (experimental setup) report only that SI achieves the 'best overall weighted F1' without listing the actual scores, standard deviations, or per-symptom breakdowns for the eight models. This omission makes it impossible to verify the magnitude of gains on infrequent symptoms or to compare against the fine-tuning baseline.
[§3] The induction algorithm itself (how labeled examples are compressed into guidelines) is described at a high level in §3 but lacks pseudocode, hyper-parameter choices, or the exact prompt template used for induction. Without these, the claim that the guidelines supply 'consistent relevance criteria' cannot be reproduced or stress-tested.

minor comments (3)

[Table 1] Table 1 (dataset statistics) should include the exact class distribution per symptom to contextualize the 'especially large gains for infrequent symptoms' claim.
[§5.3] The cross-domain evaluation section would benefit from an explicit statement of which symptoms overlap between BDI-Sen and the bipolar/eating-disorder corpora.
[§2] A few citations to prior work on prompt compression or guideline-based prompting (e.g., in clinical NLP) are missing from the related-work section.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful review and constructive suggestions. We have carefully considered the comments and made revisions to enhance the manuscript's clarity, reproducibility, and verifiability.

read point-by-point responses

Referee: [Abstract and §4] The abstract and §4 (experimental setup) report only that SI achieves the 'best overall weighted F1' without listing the actual scores, standard deviations, or per-symptom breakdowns for the eight models. This omission makes it impossible to verify the magnitude of gains on infrequent symptoms or to compare against the fine-tuning baseline.

Authors: We agree with this observation. The original manuscript focused on comparative statements without providing the raw metrics. In the revised version, we have expanded §4 to include a comprehensive table reporting the weighted F1 scores for SI and all baselines across the eight models, including standard deviations from multiple runs. Per-symptom F1 scores are also provided to highlight gains on infrequent symptoms. The abstract has been updated to report the specific overall weighted F1 achieved by SI. revision: yes
Referee: [§3] The induction algorithm itself (how labeled examples are compressed into guidelines) is described at a high level in §3 but lacks pseudocode, hyper-parameter choices, or the exact prompt template used for induction. Without these, the claim that the guidelines supply 'consistent relevance criteria' cannot be reproduced or stress-tested.

Authors: We acknowledge that the algorithmic details were presented at a high level. To address this, we have added pseudocode for the Symptom Induction procedure in §3. We also specify all hyper-parameters used in the induction process and include the exact prompt template in the revised manuscript (or as supplementary material if space is limited). These additions ensure that the method can be fully reproduced and the consistency of the induced guidelines can be evaluated. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical evaluation is independent

full rationale

The paper introduces Symptom Induction (SI) as a prompting method that distills labeled training examples into short guidelines for sentence-level symptom classification. Claims of superior weighted F1 (especially on rare symptoms) and cross-domain generalization are supported by direct comparisons against zero-shot, ICL, and fine-tuning baselines across eight models on BDI-Sen, plus evaluation on an external bipolar/eating-disorder dataset. No equations, derivations, or parameter-fitting steps are described that reduce the reported performance metrics to the inputs by construction. The evaluation protocol is falsifiable and uses held-out and external data, providing independent grounding rather than self-referential loops.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No mathematical free parameters, axioms, or invented entities; the work is an empirical NLP method relying on LLM instruction-following and the existence of the BDI-Sen dataset.

pith-pipeline@v0.9.0 · 5492 in / 1066 out tokens · 33088 ms · 2026-05-08T03:48:12.322547+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

24 extracted references · 20 canonical work pages

[1]

Eliseo Bao, Anxo Perez, David Otero, and Javier Parapar. 2025. How does depres- sion talk on social media? Modeling depression language with relevance-based statistical language models.Online Social Networks and Media50 (2025), 100339. doi:10.1016/j.osnem.2025.100339

work page doi:10.1016/j.osnem.2025.100339 2025
[2]

Eliseo Bao, Anxo Pérez, and Javier Parapar. 2024. Explainable depression symptom detection in social media.Health Information Science and Systems12, 1 (06 Sep 2024), 47. doi:10.1007/s13755-024-00303-9

work page doi:10.1007/s13755-024-00303-9 2024
[3]

Eliseo Bao, Anxo Perez, and Javier Parapar. 2025. ReDSM5: A Reddit Dataset for DSM-5 Depression Detection. InProceedings of the 34th ACM International Conference on Information and Knowledge Management(Seoul, Republic of Ko- rea)(CIKM ’25). Association for Computing Machinery, New York, NY, USA, 6323–6327. doi:10.1145/3746252.3761610

work page doi:10.1145/3746252.3761610 2025
[4]

Aaron T. Beck, R. A. Steer, and G. Brown. 1996. Beck Depression Inventory–II. doi:10.1037/t00742-000

work page doi:10.1037/t00742-000 1996
[5]

Loris Belcastro, Riccardo Cantini, Fabrizio Marozzo, Domenico Talia, and Paolo Trunfio. 2025. Detecting mental disorder on social media: A ChatGPT-augmented explainable approach.Online Social Networks and Media48 (2025), 100321. doi:10. 1016/j.osnem.2025.100321

work page arXiv 2025
[6]

Chen Chen, Fenghuan Li, Haopeng Chen, and Yuankun Lin. 2025. Heterogeneous subgraph network with prompt learning for interpretable depression detection on social media.Knowledge-Based Systems315 (2025), 113215. doi:10.1016/j. knosys.2025.113215

work page doi:10.1016/j 2025
[7]

Munmun De Choudhury and Sushovan De. 2014. Mental Health Discourse on reddit: Self-Disclosure, Social Support, and Anonymity.Proceedings of the International AAAI Conference on Web and Social Media8, 1 (May 2014), 71–80. doi:10.1609/icwsm.v8i1.14526

work page doi:10.1609/icwsm.v8i1.14526 2014
[8]

Zhuohan Ge, Nicole Hu, Darian Li, Yubo Wang, Shihao Qi, Yuming Xu, Han Shi, and Jason Zhang. 2025. A Survey of Large Language Models in Mental Health Disorder Detection on Social Media. In2025 IEEE 41st International Conference on Data Engineering Workshops (ICDEW). IEEE, Los Alamitos, CA, USA, 164–176. doi:10.1109/ICDEW67478.2025.00027

work page doi:10.1109/icdew67478.2025.00027 2025
[9]

Goodwin, Lisa C

Renee D. Goodwin, Lisa C. Dierker, Melody Wu, Sandro Galea, Christina W. Hoven, and Andrea H. Weinberger. 2022. Trends in U.S. Depression Prevalence From 2015 to 2020: The Widening Treatment Gap.American Journal of Preventive Medicine63, 5 (2022), 726–733. doi:10.1016/j.amepre.2022.05.014

work page doi:10.1016/j.amepre.2022.05.014 2022
[10]

and Levy, Omer

Or Honovich, Uri Shaham, Samuel R. Bowman, and Omer Levy. 2023. Instruction Induction: From Few Examples to Natural Language Task Descriptions. InPro- ceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (Eds.). Association for Computational Lingui...

work page doi:10.18653/v1/2023.acl-long.108 2023
[11]

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2022. LoRA: Low-Rank Adaptation of Large Language Models. InInternational Conference on Learning Representations. https: //openreview.net/forum?id=nZeVKeeFYf9

2022
[12]

Xiaochong Lan, Zhiguang Han, Yiming Cheng, Li Sheng, Jie Feng, Chen Gao, and Yong Li. 2025. Depression Detection on Social Media with Large Language Models. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track, Saloni Potdar, Lina Rojas-Barahona, and Sebastien Montella (Eds.). Association for Computation...

work page doi:10.18653/v1/2025.emnlp-industry.151 2025
[13]

Mayank Mishra, Prince Kumar, Riyaz Bhat, Rudra Murthy, Danish Contractor, and Srikanth Tamilselvam. 2023. Prompting with Pseudo-Code Instructions. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Houda Bouamor, Juan Pino, and Kalika Bali (Eds.). Association for Computational Linguistics, Singapore, 15178–15197. do...

work page doi:10.18653/v1/2023.emnlp- 2023
[14]

Anxo Pérez, Javier Parapar, Álvaro Barreiro, and Silvia Lopez-Larrosa. 2023. BDI- Sen: A Sentence Dataset for Clinical Symptoms of Depression. InProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval(Taipei, Taiwan)(SIGIR ’23). Association for Computing Machinery, New York, NY, USA, 2996–3006. doi:1...

work page doi:10.1145/3539618.3591905 2023
[15]

Federico Ravenda, Seyed Ali Bahrainian, Andrea Raballo, Antonietta Mira, and Noriko Kando. 2025. Are LLMs effective psychological assessors? Leveraging adaptive RAG for interpretable mental health screening through psychometric practice. InProceedings of the 63rd Annual Meeting of the Association for Compu- tational Linguistics (Volume 1: Long Papers), Wa...

work page doi:10.18653/v1/2025.acl-long.440 2025
[16]

Ríssola, Mario Ezra Aragón, David E

Esteban A. Ríssola, Mario Ezra Aragón, David E. Losada, and Fabio Crestani
[17]

doi:10.1007/s42001-025- 00377-9

On the incidence of depression symptoms on social media.Journal of Computational Social Science8, 2 (08 Mar 2025), 48. doi:10.1007/s42001-025- 00377-9

work page doi:10.1007/s42001-025- 2025
[18]

Hoyun Song, Huije Lee, Jisu Shin, Sukmin Cho, Changgeon Ko, and Jong C. Park
[19]

InFindings of the Association for Computational Linguistics: ACL 2025, Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar (Eds.)

Does Rationale Quality Matter? Enhancing Mental Disorder Detection via Selective Reasoning Distillation. InFindings of the Association for Computational Linguistics: ACL 2025, Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar (Eds.). Association for Computational Linguistics, Vienna, Austria, 21738–21756. doi:10.18653/v1/2025.fin...

work page doi:10.18653/v1/2025.findings-acl.1119 2025
[20]

Yuxi Wang, Diana Inkpen, and Prasadith Kirinde Gamaarachchige. 2024. Ex- plainable Depression Detection Using Large Language Models on Social Media Data. InProceedings of the 9th Workshop on Computational Linguistics and Clinical Psychology (CLPsych 2024), Andrew Yates, Bart Desmet, Emily Prud’hommeaux, Ayah Zirikly, Steven Bedrick, Sean MacAvaney, Kfir B...

work page doi:10.18653/v1/2024.clpsych-1.8 2024
[21]

2025.World Mental Health Today: Men- tal Health Atlas 2024

World Health Organization. 2025.World Mental Health Today: Men- tal Health Atlas 2024. Technical Report. World Health Organization, Geneva. https://www.who.int/news/item/02-09-2025-over-a-billion-people- living-with-mental-health-conditions-services-require-urgent-scale-up Over one billion people live with a mental health condition, 91% of people living w...

2025
[22]

Emily Xiao, Yixiao Zeng, Ada Chen, Chin-Jou Li, Amanda Bertsch, and Gra- ham Neubig. 2026. Prompt-MII: Meta-Learning Instruction Induction for LLMs. InThe Fourteenth International Conference on Learning Representations (ICLR). arXiv:2510.16932 [cs.CL] https://openreview.net/forum?id=zD9fjEj4Oz

work page arXiv 2026
[23]

Kailai Yang, Tianlin Zhang, Ziyan Kuang, Qianqian Xie, Jimin Huang, and Sophia Ananiadou. 2024. MentaLLaMA: Interpretable Mental Health Analysis on Social Media with Large Language Models. InProceedings of the ACM Web Conference 2024(Singapore, Singapore)(WWW ’24). Association for Computing Machinery, New York, NY, USA, 4489–4500. doi:10.1145/3589334.3648137

work page doi:10.1145/3589334.3648137 2024
[24]

Zhiling Zhang, Siyuan Chen, Mengyue Wu, and Kenny Zhu. 2022. Symptom Identification for Interpretable Detection of Multiple Mental Disorders on Social Media. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (Eds.). Association for Computational Linguistics, Abu Dhabi,...

work page doi:10.18653/v1/2022.emnlp-main.677 2022