pith. machine review for the scientific record. sign in

arxiv: 2604.18302 · v1 · submitted 2026-04-20 · 💻 cs.AI

Recognition: unknown

Toward Zero-Egress Psychiatric AI: On-Device LLM Deployment for Privacy-Preserving Mental Health Decision Support

Authors on Pith no claims yet

Pith reviewed 2026-05-10 05:15 UTC · model grok-4.3

classification 💻 cs.AI
keywords on-device AIpsychiatric diagnosisprivacy-preservingmental healthLLM deploymentmobile applicationDSM-5 assessmentzero-egress
0
0 comments X

The pith

A mobile app runs fine-tuned LLMs entirely locally to deliver psychiatric assessments without any patient data leaving the device.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to show that psychiatric decision support can shift from cloud servers to on-device execution using a consortium of compact, quantized language models. This matters because current cloud pipelines force sensitive mental health data to exit the device, which can discourage people from seeking help in military, correctional, or remote settings. The system orchestrates local inference to generate DSM-5-aligned outputs for clinicians and patients while keeping all processing on the phone. If the accuracy holds, the approach removes a major barrier to AI use in environments that reject external data transmission.

Core claim

The work presents a zero-egress cross-platform mobile application that integrates three lightweight, fine-tuned, and quantized open-source LLMs coordinated by an on-device orchestration layer to perform ensemble inference and consensus-based diagnostic reasoning, yielding DSM-5-aligned assessments for differential diagnosis and symptom mapping with accuracy comparable to the server-side version and real-time latency on commodity hardware.

What carries the argument

An on-device orchestration layer that coordinates ensemble inference and consensus-based diagnostic reasoning among three quantized LLMs.

If this is right

  • Clinicians gain local access to differential diagnosis support and evidence-linked symptom mapping without data transmission.
  • Patients can use self-screening features with built-in safeguards while data remains on-device.
  • The platform becomes usable in operational environments that prohibit any external data flow.
  • Real-time performance is maintained on standard mobile hardware rather than requiring specialized servers.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The design could extend to other privacy-sensitive medical domains by swapping the diagnostic focus while retaining the local orchestration layer.
  • Long-term, repeated local use might allow the models to adapt to individual users through on-device updates without cloud involvement.
  • Testing the system on diverse populations would reveal whether quantization introduces biases in specific demographic groups.

Load-bearing premise

The quantized and fine-tuned models retain enough diagnostic fidelity for DSM-5 assessments to match the accuracy of their full server-side versions.

What would settle it

A direct comparison of diagnostic outputs from the on-device system versus the server version on the same set of clinical cases, measuring agreement rates and specific error types.

Figures

Figures reproduced from arXiv: 2604.18302 by Anita H. Clayton, Asanga Gunaratna, Atmaram Yarlagadda, Christopher K. Rhea, Eranga Bandara, Isurunima Kularathna, Preston Samuel, Ravi Mukkamala, Ross Gore, Sachini Rajapakse, Sachin Shetty, Xueping Liang.

Figure 1
Figure 1. Figure 1: Zero-egress on-device architecture for privacy-preserving psychiatric AI decision [PITH_FULL_IMAGE:figures/full_fig_p012_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: End-to-end offline fine-tuning and on-device deployment pipeline. The [PITH_FULL_IMAGE:figures/full_fig_p017_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: On-device ensemble inference flow. A clinical conversation is captured via the [PITH_FULL_IMAGE:figures/full_fig_p021_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: AI model selection panel in the mobile application. Three modes are available: [PITH_FULL_IMAGE:figures/full_fig_p025_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Home screen of the mobile application, subtitled [PITH_FULL_IMAGE:figures/full_fig_p028_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Conversational interface (AI Session screen) of the mobile application. The [PITH_FULL_IMAGE:figures/full_fig_p031_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: SOAP Notes task flow. Upon receiving a Generate SOAP note request, the platform returns a structured intake prompt requesting four categories of clinical informa￾tion: (1) Subjective — patient complaints, symptoms, and history; (2) Objective — vital signs, physical examination findings, lab results, and imaging; (3) Assessment — clinical impressions and differential diagnoses; and (4) Plan — proposed treat… view at source ↗
read the original abstract

Privacy represents one of the most critical yet underaddressed barriers to AI adoption in mental healthcare -- particularly in high-sensitivity operational environments such as military, correctional, and remote healthcare settings, where the risk of patient data exposure can deter help-seeking behavior entirely. Existing AI-enabled psychiatric decision support systems predominantly rely on cloud-based inference pipelines, requiring sensitive patient data to leave the device and traverse external servers, creating unacceptable privacy and security risks in these contexts. In this paper, we propose a zero-egress, on-device AI platform for privacy-preserving psychiatric decision support, deployed as a cross-platform mobile application. The proposed system extends our prior work on fine-tuned LLM consortiums for psychiatric diagnosis standardization by fundamentally re-architecting the inference pipeline for fully local execution -- ensuring that no patient data is transmitted to, processed by, or stored on any external server at any stage. The platform integrates a consortium of three lightweight, fine-tuned, and quantized open-source LLMs -- Gemma, Phi-3.5-mini, and Qwen2 -- selected for their compact architectures and proven efficiency on resource-constrained mobile hardware. An on-device orchestration layer coordinates ensemble inference and consensus-based diagnostic reasoning, producing DSM-5-aligned assessments for conditions. The platform is designed to assist clinicians with differential diagnosis and evidence-linked symptom mapping, as well as to support patient-facing self-screening with appropriate clinical safeguards. Initial evaluation demonstrates that the proposed zero-egress deployment achieves diagnostic accuracy comparable to its server-side predecessor while sustaining real-time inference latency on commodity mobile hardware.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes a zero-egress on-device AI platform for privacy-preserving psychiatric decision support, implemented as a cross-platform mobile application. It extends prior work on fine-tuned LLM consortiums by deploying an ensemble of three quantized lightweight models (Gemma, Phi-3.5-mini, Qwen2) with an on-device orchestration layer for consensus-based, DSM-5-aligned diagnostic reasoning. The system targets differential diagnosis assistance and patient self-screening in sensitive settings (military, correctional, remote care) while ensuring no patient data leaves the device. The abstract asserts that initial evaluation shows diagnostic accuracy comparable to the server-side predecessor alongside real-time inference latency on commodity mobile hardware.

Significance. If the accuracy and latency claims are substantiated, the work would offer a concrete technical path to address a major adoption barrier for AI in mental healthcare: the privacy risk of data egress in high-stakes environments. Demonstrating a practical, fully local ensemble deployment using open-source models could enable safer clinician tools and self-screening applications without external servers, potentially increasing help-seeking behavior. The emphasis on consensus reasoning and clinical safeguards adds operational relevance beyond pure model compression.

major comments (2)
  1. [Abstract] Abstract: The central claim that 'initial evaluation demonstrates that the proposed zero-egress deployment achieves diagnostic accuracy comparable to its server-side predecessor' is presented without any quantitative metrics (accuracy, F1, Cohen's kappa), dataset description (size, conditions, ground-truth source), evaluation protocol, baselines, or error analysis. This absence leaves the primary performance assertion unsupported and prevents assessment of whether quantization and mobile constraints degrade DSM-5 diagnostic fidelity.
  2. [Abstract] Evaluation/Results section (implied by abstract claim): No details are provided on how the on-device ensemble was tested against the server-side predecessor, including any ablation on quantization effects, inter-rater agreement with clinicians, or statistical significance of the 'comparable' result. Without these, the claim that the architecture preserves diagnostic quality cannot be evaluated and is load-bearing for the paper's contribution.
minor comments (2)
  1. [Architecture] The description of the orchestration layer and consensus mechanism would benefit from a high-level diagram or pseudocode to clarify how the three models coordinate DSM-5 symptom mapping and differential diagnosis.
  2. [Introduction] Explicit citation to the prior server-side LLM consortium paper should be added in the introduction to clearly delineate the novel on-device re-architecture from the earlier fine-tuning work.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments correctly identify that the abstract's performance claim requires quantitative substantiation to be evaluable. We will revise the manuscript to add a dedicated Evaluation section with metrics, dataset details, ablations, and analysis, while updating the abstract accordingly. Our point-by-point responses follow.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim that 'initial evaluation demonstrates that the proposed zero-egress deployment achieves diagnostic accuracy comparable to its server-side predecessor' is presented without any quantitative metrics (accuracy, F1, Cohen's kappa), dataset description (size, conditions, ground-truth source), evaluation protocol, baselines, or error analysis. This absence leaves the primary performance assertion unsupported and prevents assessment of whether quantization and mobile constraints degrade DSM-5 diagnostic fidelity.

    Authors: We agree that the abstract claim is currently unsupported without supporting numbers and context. In revision we will expand the abstract to report key quantitative results (e.g., accuracy, F1, Cohen's kappa) and will add a new Evaluation section that fully describes the test dataset (size, conditions, ground-truth source), protocol, baselines, and error analysis so readers can assess any impact of quantization and on-device constraints. revision: yes

  2. Referee: [Abstract] Evaluation/Results section (implied by abstract claim): No details are provided on how the on-device ensemble was tested against the server-side predecessor, including any ablation on quantization effects, inter-rater agreement with clinicians, or statistical significance of the 'comparable' result. Without these, the claim that the architecture preserves diagnostic quality cannot be evaluated and is load-bearing for the paper's contribution.

    Authors: We accept this assessment. The current manuscript presents only a high-level claim. We will add a full Results/Evaluation section containing ablation studies on quantization, inter-rater agreement (Cohen's kappa) with clinicians and the server-side model, statistical significance tests, and error analysis. These additions will directly substantiate the comparability claim and allow evaluation of diagnostic fidelity preservation. revision: yes

Circularity Check

0 steps flagged

No significant circularity; architecture and claims are self-contained

full rationale

The paper presents an engineering description of an on-device LLM ensemble for psychiatric decision support, extending prior fine-tuned models via re-architecting for local execution. No mathematical derivations, equations, fitted parameters, or predictions appear in the provided text. The accuracy comparability claim is asserted from 'initial evaluation' without reducing by construction to the inputs or prior work; it is an evidentiary assertion rather than a self-referential loop. Self-citation of the authors' earlier LLM consortium work is present but does not bear load on any derivation chain, as the deployment pipeline stands independently. This matches the default expectation of no circularity for descriptive systems papers.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The claim rests on the untested premise that existing open-source LLMs, after fine-tuning and quantization, can produce clinically reliable DSM-5 outputs when orchestrated locally; no new entities or free parameters are introduced in the abstract.

axioms (2)
  • domain assumption Lightweight open-source LLMs can be fine-tuned to produce DSM-5-aligned psychiatric assessments
    Invoked when selecting Gemma, Phi-3.5-mini, and Qwen2 for the consortium
  • domain assumption Quantization preserves sufficient diagnostic accuracy for clinical decision support
    Required for the on-device performance claim

pith-pipeline@v0.9.0 · 5638 in / 1350 out tokens · 23888 ms · 2026-05-10T05:15:59.137640+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Self-Prompting Small Language Models for Privacy-Sensitive Clinical Information Extraction

    cs.CL 2026-05 unverdicted novelty 5.0

    Small open-weight language models can self-optimize prompts for clinical named entity recognition in dental notes, reaching micro F1 of 0.864 after DPO on Qwen2.5-14B.

Reference graph

Works this paper leans on

72 extracted references · 55 canonical work pages · cited by 1 Pith paper · 8 internal anchors

  1. [1]

    WHO global air quality guidelines: particulate matter (PM2.5 and PM10), ozone, nitrogen dioxide, sulfur dioxide and carbon monoxide.https://www

    World Health Organization, World mental health report: Transforming mental health for all, Tech. rep., WHO, Geneva, Switzerland (2022). URLhttps://www.who.int/publications/i/item/9789240049338

  2. [2]

    Saxena, G

    S. Saxena, G. Thornicroft, M. Knapp, H. Whiteford, Resources for men- tal health: scarcity, inequity, and inefficiency, The Lancet 370 (9590) (2007) 878–889. doi:10.1016/S0140-6736(07)61239-2

  3. [3]

    Thornicroft, et al., Undertreatment of people with major depressive disorder in 21 countries, The British Journal of Psychiatry 210 (2) (2016) 119–124

    G. Thornicroft, et al., Undertreatment of people with major depressive disorder in 21 countries, The British Journal of Psychiatry 210 (2) (2016) 119–124. doi:10.1192/bjp.bp.116.188078

  4. [4]

    C. W. Hoge, et al., Combat duty in Iraq and Afghanistan, mental health problems, and barriers to care, New England Journal of Medicine 351 (1) (2004) 13–22. doi:10.1056/NEJMoa040603

  5. [5]

    P. Y. Kim, et al., Stigma, barriers to care, and use of mental health ser- vices among active duty and National Guard soldiers after combat, Psy- chiatric Services 62 (1) (2011) 27–34. doi:10.1176/ps.62.1.pss6201 0027

  6. [6]

    Greene, et al., Stigma and barriers to mental health treatment in the military, Military Medicine 175 (2) (2010) 86–91

    T. Greene, et al., Stigma and barriers to mental health treatment in the military, Military Medicine 175 (2) (2010) 86–91. doi:10.7205/MILMED- D-09-00120

  7. [7]

    Guo, et al., Automated depression detection using deep learning and natural language processing, ACM Transactions on Computing for Healthcare 1 (3) (2020) 1–19

    Z. Guo, et al., Automated depression detection using deep learning and natural language processing, ACM Transactions on Computing for Healthcare 1 (3) (2020) 1–19. doi:10.1145/3372168

  8. [8]

    Shim, et al., Machine learning-based diagnostic models for psychi- atric disorders: a systematic review, Journal of Psychiatric Research 133 (2021) 1–12

    M. Shim, et al., Machine learning-based diagnostic models for psychi- atric disorders: a systematic review, Journal of Psychiatric Research 133 (2021) 1–12. doi:10.1016/j.jpsychires.2020.12.019

  9. [10]

    Gemma: Open Models Based on Gemini Research and Technology

    Gemma Team, Google DeepMind, Gemma: Open models based on Gem- ini research and technology, arXiv preprint arXiv:2403.08295 (2024). URLhttps://arxiv.org/abs/2403.08295

  10. [11]

    Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone

    M. Abdin, et al., Phi-3 technical report: A highly capable language model locally on your phone, arXiv preprint arXiv:2404.14219 (2024). URLhttps://arxiv.org/abs/2404.14219

  11. [12]

    Qwen2 Technical Report

    A. Yang, et al., Qwen2 technical report, arXiv preprint arXiv:2407.10671 (2024). URLhttps://arxiv.org/abs/2407.10671

  12. [13]

    QLoRA: Efficient Finetuning of Quantized LLMs

    T. Dettmers, A. Pagnoni, A. Holtzman, L. Zettlemoyer, QLoRA: Effi- cient finetuning of quantized LLMs, in: Advances in Neural Information Processing Systems (NeurIPS), 2023. URLhttps://arxiv.org/abs/2305.14314

  13. [14]

    A survey of resource-efficient llm and multimodal foundation models,

    M. Xu, et al., A survey of resource-efficient LLM and multimodal foun- dation models, arXiv preprint arXiv:2401.08092 (2024). URLhttps://arxiv.org/abs/2401.08092

  14. [15]

    Laskaridis, et al., MELTing point: Mobile evaluation of language transformers, arXiv preprint arXiv:2403.12844 (2024)

    S. Laskaridis, et al., MELTing point: Mobile evaluation of language transformers, arXiv preprint arXiv:2403.12844 (2024). URLhttps://arxiv.org/abs/2403.12844

  15. [16]

    American Psychiatric Publishing, Washington, DC, 5th edition, 2013

    American Psychiatric Association, Diagnostic and Statistical Manual of Mental Disorders, Fifth Edition (DSM-5), American Psychiatric Pub- lishing, Arlington, VA, 2013. doi:10.1176/appi.books.9780890425596

  16. [17]

    rep., WHO, Geneva, Switzerland (2019)

    World Health Organization, International classification of diseases, 11th revision (ICD-11), Tech. rep., WHO, Geneva, Switzerland (2019). URLhttps://icd.who.int

  17. [18]

    D. A. Regier, et al., Dsm-5 field trials in the United States and Canada, part II: Test-retest reliability of selected categorical di- agnoses, American Journal of Psychiatry 170 (1) (2013) 59–70. doi:10.1176/appi.ajp.2012.12070999

  18. [19]

    Freedman, et al., The initial field trials of DSM-5: new blooms and old thorns, American Journal of Psychiatry 170 (1) (2013) 1–5

    R. Freedman, et al., The initial field trials of DSM-5: new blooms and old thorns, American Journal of Psychiatry 170 (1) (2013) 1–5. doi:10.1176/appi.ajp.2012.12091189. 42

  19. [20]

    K. S. Kendler, An historical framework for psychiatric nosology, Psychological Medicine 39 (12) (2009) 1935–1941. doi:10.1017/S0033291709005753

  20. [21]

    R. M. A. Hirschfeld, et al., Perceptions and impact of bipolar disorder: how far have we really come? results of the national depressive and manic-depressive association 2000 survey, Journal of Clinical Psychiatry 64 (2) (2003) 161–174. doi:10.4088/JCP.v64n0209

  21. [22]

    Standardization of psychiatric diagnoses–role of fine-tuned llm consortium and openai-gpt-oss reasoning llm enabled de- cision support system,

    E. Bandara, R. Gore, A. Yarlagadda, A. H. Clayton, P. Samuel, C. K. Rhea, S. Shetty, Standardization of psychiatric diagnoses–role of fine- tuned llm consortium and openai-gpt-oss reasoning llm enabled decision support system, arXiv preprint arXiv:2510.25588 (2025)

  22. [23]

    Attention Is All You Need

    A. Vaswani, et al., Attention is all you need, Advances in Neural Infor- mation Processing Systems (NeurIPS) 30 (2017). URLhttps://arxiv.org/abs/1706.03762

  23. [24]

    T. B. Brown, et al., Language models are few-shot learners, Advances in Neural Information Processing Systems (NeurIPS) 33 (2020) 1877–1901. URLhttps://arxiv.org/abs/2005.14165

  24. [25]

    R. Gore, E. Bandara, S. Shetty, A. E. Musto, P. Rana, A. Valencia- Romero, C. Rhea, L. Tayebi, H. Richter, A. Yarlagadda, et al., Proof- of-tbi–fine-tuned vision language model consortium and openai-o3 rea- soning llm-based medical diagnosis support system for mild traumatic brain injury (tbi) prediction, arXiv preprint arXiv:2504.18671 (2025)

  25. [26]

    characterization of time-variant and time-invariant assessment of suicidality on reddit using C-SSRS

    M. Gaur, et al., “characterization of time-variant and time-invariant assessment of suicidality on reddit using C-SSRS”, PLOS ONE 16 (5) (2021) e0250448. doi:10.1371/journal.pone.0250448

  26. [27]

    Flemotomos, et al., Automated quality assessment of cogni- tive behavioral therapy sessions through extracting psycholinguis- tic features, in: Proceedings of Interspeech, 2021, pp

    N. Flemotomos, et al., Automated quality assessment of cogni- tive behavioral therapy sessions through extracting psycholinguis- tic features, in: Proceedings of Interspeech, 2021, pp. 4251–4255. doi:10.21437/Interspeech.2021-357

  27. [28]

    I. Y. Chen, et al., Ethical machine learning in healthcare, Annual Review of Biomedical Data Science 4 (2021) 123–144. doi:10.1146/annurev- biodatasci-092820-114757. 43

  28. [29]

    Bandara, A

    E. Bandara, A. Hass, S. Shetty, R. Mukkamala, R. Gore, A. Rahman, S. H. Bouk, Deep-stride: Automated security threat modeling with vision-language models, in: 2025 International Conference on Software, Telecommunications and Computer Networks (SoftCOM), 2025, pp. 1– 7

  29. [30]

    URLhttps://github.com/ggerganov/ggml

    GGML Contributors, GGUF: GPT-generated unified format (2023). URLhttps://github.com/ggerganov/ggml

  30. [31]

    Gerganov, llama.cpp: LLM inference in C/C++ (2023)

    G. Gerganov, llama.cpp: LLM inference in C/C++ (2023). URLhttps://github.com/ggerganov/llama.cpp

  31. [32]

    URLhttps://github.com/mlc-ai/mlc-llm

    MLC Team, MLC LLM: Universal LLM deployment engine (2023). URLhttps://github.com/mlc-ai/mlc-llm

  32. [33]

    United States Congress, Health insurance portability and accountability act of 1996 (HIPAA), Federal Legislation Public Law 104-191, United States Department of Health and Human Services, Washington, DC (1996)

  33. [34]

    rep., Official Journal of the European Union (2016)

    European Parliament and Council, General data protection regulation (GDPR), regulation (eu) 2016/679, Tech. rep., Official Journal of the European Union (2016). URLhttps://gdpr-info.eu

  34. [35]

    Department of Defense, DoD instruction 8582.01: Privacy in the DoD (2012)

    U.S. Department of Defense, DoD instruction 8582.01: Privacy in the DoD (2012). URLhttps://www.esd.whs.mil/DD/

  35. [36]

    General Services Administration, FedRAMP: Federal risk and au- thorization management program (2011)

    U.S. General Services Administration, FedRAMP: Federal risk and au- thorization management program (2011). URLhttps://www.fedramp.gov

  36. [37]

    Blobel, et al., Trustworthy, secure and privacy-protecting electronic health record systems, Methods of Information in Medicine 57 (2018) e47–e57

    B. Blobel, et al., Trustworthy, secure and privacy-protecting electronic health record systems, Methods of Information in Medicine 57 (2018) e47–e57. doi:10.3414/ME17-01-0048

  37. [38]

    P. S. Appelbaum, Privacy in psychiatric treatment: threats and re- sponses, American Journal of Psychiatry 159 (11) (2015) 1809–1818. doi:10.1176/appi.ajp.159.11.1809. 44

  38. [39]

    Rieke, J

    N. Rieke, et al., The future of digital health with federated learning, npj Digital Medicine 3 (1) (2020) 119. doi:10.1038/s41746-020-00323-1

  39. [40]

    J. C. Duchi, M. I. Jordan, M. J. Wainwright, Local privacy and sta- tistical minimax rates, IEEE Symposium on Foundations of Computer Science (FOCS) (2013) 429–438doi:10.1109/FOCS.2013.53

  40. [41]

    Bandara, A

    E. Bandara, A. Hass, R. Gore, S. Shetty, R. Mukkamala, S. H. Bouk, X. Liang, N. W. Keong, K. De Zoysa, A. Withanage, et al., Astride: A security threat modeling platform for agentic-ai applications, arXiv preprint arXiv:2512.04785 (2025)

  41. [42]

    URLhttps://github.com/unslothai/unsloth

    Unsloth Contributors, Unsloth: Fast and memory-efficient LLM fine- tuning (2024). URLhttps://github.com/unslothai/unsloth

  42. [43]

    D. B. Acharya, K. Kuppan, B. Divya, Agentic ai: Autonomous intelli- gence for complex goals–a comprehensive survey, IEEE Access (2025)

  43. [44]

    A Practical Guide for Designing, Developing, and Deploying Production-Grade Agentic AI Workflows,

    E. Bandara, R. Gore, P. Foytik, S. Shetty, R. Mukkamala, A. Rahman, X. Liang, S. H. Bouk, A. Hass, S. Rajapakse, et al., A practical guide for designing, developing, and deploying production-grade agentic ai work- flows, arXiv preprint arXiv:2512.08769 (2025)

  44. [45]

    Survey on Evaluation of LLM-based Agents

    A. Yehudai, L. Eden, A. Li, G. Uziel, Y. Zhao, R. Bar-Haim, A. Cohan, M. Shmueli-Scheuer, Survey on evaluation of llm-based agents, arXiv preprint arXiv:2503.16416 (2025)

  45. [46]

    Agentsway–software development methodology for ai agents- based teams,

    E. Bandara, R. Gore, X. Liang, S. Rajapakse, I. Kularathne, P. Karunarathna, P. Foytik, S. Shetty, R. Mukkamala, A. Rahman, et al., Agentsway–software development methodology for ai agents-based teams, arXiv preprint arXiv:2510.23664 (2025)

  46. [47]

    Towards responsible and explainable ai agents with consensus-driven reasoning,

    E. Bandara, T. Hewa, R. Gore, S. Shetty, R. Mukkamala, P. Foytik, A. Rahman, S. H. Bouk, X. Liang, A. Hass, et al., Towards respon- sible and explainable ai agents with consensus-driven reasoning, arXiv preprint arXiv:2512.21699 (2025)

  47. [48]

    Bandara, R

    E. Bandara, R. Gore, S. Shetty, S. Rajapakse, I. Kularathna, P. Karunarathna, R. Mukkamala, P. Foytik, S. H. Bouk, A. Rahman, 45 et al., A practical guide to agentic ai transition in organizations, arXiv preprint arXiv:2602.10122 (2026)

  48. [49]

    Spitzer, and Janet B

    K. Kroenke, R. L. Spitzer, J. B. W. Williams, The PHQ-9: Validity of a brief depression severity measure, Journal of General Internal Medicine 16 (9) (2001) 606–613. doi:10.1046/j.1525-1497.2001.016009606.x

  49. [50]

    F. W. Weathers, et al., PTSD checklist for DSM-5 (PCL-5), Tech. rep., National Center for PTSD (2013). URLhttps://www.ptsd.va.gov/professional/assessment/ adult-sr/ptsd-checklist.asp

  50. [51]

    URLhttps://developer.arm.com/ip-products/security-ip/ trustzone

    ARM Ltd., ARM trustzone technology (2023). URLhttps://developer.arm.com/ip-products/security-ip/ trustzone

  51. [52]

    URLhttps://support.apple.com/guide/security/ secure-enclave-sec59b0b31ff/web

    Apple Inc., Apple platform security: Secure enclave (2023). URLhttps://support.apple.com/guide/security/ secure-enclave-sec59b0b31ff/web

  52. [53]

    Bandara, mental-reasoning: A psychiatric diagnostic conversational dataset for DSM-5 aligned LLM fine-tuning (2025)

    E. Bandara, mental-reasoning: A psychiatric diagnostic conversational dataset for DSM-5 aligned LLM fine-tuning (2025). URLhttps://huggingface.co/datasets/lambdaeranga/ mental-reasoning

  53. [54]

    E. J. Hu, et al., LoRA: Low-rank adaptation of large language models, arXiv preprint arXiv:2106.09685 (2021). URLhttps://arxiv.org/abs/2106.09685

  54. [55]

    Stan- dardization of neuromuscular reflex analysis–role of fine-tuned vision- language model consortium and openai gpt-oss reasoning llm enabled decision support system,

    E. Bandara, R. Gore, S. Shetty, R. Mukkamala, C. Rhea, A. Yarlagadda, S. Kaushik, L. De Silva, A. Maznychenko, I. Sokolowska, et al., Stan- dardization of neuromuscular reflex analysis–role of fine-tuned vision- language model consortium and openai gpt-oss reasoning llm enabled decision support system, arXiv preprint arXiv:2508.12473 (2025)

  55. [56]

    R. L. Spitzer, K. Kroenke, J. B. W. Williams, B. L¨ owe, A brief measure for assessing generalized anxiety disorder: the GAD- 7, Archives of Internal Medicine 166 (10) (2006) 1092–1097. doi:10.1001/archinte.166.10.1092. 46

  56. [57]

    R. M. A. Hirschfeld, et al., Development and validation of a screen- ing instrument for bipolar spectrum disorder: the mood disorder ques- tionnaire, American Journal of Psychiatry 157 (11) (2000) 1873–1875. doi:10.1176/appi.ajp.157.11.1873

  57. [58]

    S. R. Kay, A. Fiszbein, L. A. Opler, The positive and negative syndrome scale (PANSS) for schizophrenia, Schizophrenia Bulletin 13 (2) (1987) 261–276. doi:10.1093/schbul/13.2.261

  58. [59]

    E. J. Topol, Deep Medicine: How Artificial Intelligence Can Make Healthcare Human Again, Basic Books, New York, NY, 2019

  59. [60]

    URLhttps://www.sprc.org/resources-programs/ safe-messaging-guidelines

    Suicide Prevention Resource Center, Safe messaging guidelines for suicide and mental health (2022). URLhttps://www.sprc.org/resources-programs/ safe-messaging-guidelines

  60. [61]

    Kim, et al., Promises and pitfalls of large language models in psychi- atric diagnosis and knowledge tasks, The British Journal of Psychiatry (2024)

    Y. Kim, et al., Promises and pitfalls of large language models in psychi- atric diagnosis and knowledge tasks, The British Journal of Psychiatry (2024). doi:10.1192/bjp.2024.83

  61. [62]

    Sara Mahdavi, Jason Wei, Hyung Won Chung, Nathan Scales, Ajay Tanwani, Heather Cole-Lewis, Stephen Pfohl, et al

    K. Singhal, et al., Large language models encode clinical knowledge, Nature 620 (2023) 172–180. doi:10.1038/s41586-023-06291-2

  62. [63]

    K. Yang, T. Zhang, Z. Kuang, Q. Xie, S. Ananiadou, MentalLLaMA: Interpretable mental health analysis on social media with large language models, arXiv preprint arXiv:2309.13567 (2024). URLhttps://arxiv.org/abs/2309.13567

  63. [64]

    URLhttps://arxiv.org/abs/2509.25992

    others, MHINDR: A DSM-5 based mental health diagnosis and rec- ommendation framework using LLM, arXiv preprint arXiv:2509.25992 (2025). URLhttps://arxiv.org/abs/2509.25992

  64. [65]

    Golan, et al., LLM questionnaire completion for auto- matic psychiatric assessment, in: Findings of EMNLP, 2024

    O. Golan, et al., LLM questionnaire completion for auto- matic psychiatric assessment, in: Findings of EMNLP, 2024. doi:10.18653/v1/2024.findings-emnlp.23

  65. [66]

    URLhttps://arxiv.org/abs/2508.11398

    others, Trustworthy AI psychotherapy: Multi-agent LLM workflow for counseling and explainable mental disorder diagnosis, arXiv preprint 47 arXiv:2508.11398 (2025). URLhttps://arxiv.org/abs/2508.11398

  66. [67]

    N. Sarwar, et al., FedMentalCare: Towards privacy-preserving fine- tuned LLMs to analyze mental health status using federated learning framework, arXiv preprint arXiv:2503.05786 (2025). URLhttps://arxiv.org/abs/2503.05786

  67. [68]

    URLhttps://arxiv.org/abs/2509.14275

    others, FedMentor: Domain-aware differential privacy for heteroge- neous federated LLMs in mental health, arXiv preprint arXiv:2509.14275 (2025). URLhttps://arxiv.org/abs/2509.14275

  68. [70]

    Investigating the heat transfer and two-phase fluid flow of nanofluid i n the rough microchannel affected by obstacle structure changes,

    S. Pati, et al., Privacy preservation for federated learning in health care, Patterns 5 (7) (2024). doi:10.1016/j.patter.2024.100974

  69. [71]

    URLhttps://arxiv.org/abs/2504.00002

    others, Are we there yet? a measurement study of efficiency for LLM applications on mobile devices, arXiv preprint arXiv:2504.00002 (2025). URLhttps://arxiv.org/abs/2504.00002

  70. [72]

    B. Yang, et al., DRHouse: An LLM-empowered diagnostic reasoning system through harnessing outcomes from sensor data and expert knowl- edge, in: Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, Vol. 8, 2024, pp. 1–29. doi:10.1145/3699771

  71. [73]

    doi:10.2196/78410

    others, Systematic review of large language models in mental health care, JMIR Mental Health 12 (2025) e78410. doi:10.2196/78410

  72. [74]

    doi:10.1002/wps.21307

    others, The evolving field of digital mental health: current evi- dence and implementation issues for smartphone apps, generative artificial intelligence, and virtual reality, World Psychiatry (2025). doi:10.1002/wps.21307. 48