CPEMH: An Agentic Framework for Prompt-Driven Behavior Evaluation and Assurance in Foundation-Model Systems for Mental Health Screening

Donald Cowan (University of Waterloo); Giuliano Lorenzoni; Ivens Portugal; Paulo Alencar

arxiv: 2605.11341 · v1 · submitted 2026-05-11 · 💻 cs.AI

CPEMH: An Agentic Framework for Prompt-Driven Behavior Evaluation and Assurance in Foundation-Model Systems for Mental Health Screening

Giuliano Lorenzoni , Ivens Portugal , Paulo Alencar , Donald Cowan (University of Waterloo) This is my paper

Pith reviewed 2026-05-13 01:24 UTC · model grok-4.3

classification 💻 cs.AI

keywords agentic frameworkprompt engineeringmental health screeningfoundation modelsbehavioral assurancedepression detectiontraceabilityprompt stability

0 comments

The pith

CPEMH deploys modular agents to autonomously design, evaluate, and select prompts that stabilize foundation-model outputs for mental health screening from transcripts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents CPEMH as an engineering approach to make large language models behave more reliably when processing interview transcripts for mental health tasks such as depression screening. It claims that an orchestrated system of agents can take over the full prompting lifecycle—designing strategies, testing them, and choosing the best ones—to reduce inconsistent or variable responses across different contexts. A case study applies the framework to automated depression detection and reports that the setup improves traceability, reproducibility, and robustness in clinically sensitive conversations. If the claim holds, this method would give developers systematic ways to audit and control how models handle conversational data without relying solely on manual prompt tweaking.

Core claim

CPEMH introduces an agentic framework whose orchestrated architecture autonomously performs the design, evaluation, and selection of prompt strategies, thereby enabling systematic control of behavioral variability across contexts; its modular design combines orchestrator, inference, and evaluation agents to maintain traceability, reproducibility, and robustness throughout the prompting lifecycle, as shown in a case study on automated depression screening from interview transcripts that demonstrates stabilization and auditing of foundation-model behavior in conversational and clinically sensitive domains.

What carries the argument

CPEMH's orchestrated architecture of orchestrator, inference, and evaluation agents that autonomously manage prompt design, evaluation, and selection to control behavioral variability.

If this is right

Modular orchestration produces traceable and reproducible prompt choices for clinical AI applications.
Stability and robustness can be treated as primary acceptance criteria alongside F1 scores when selecting prompts.
Prioritizing stability over added architectural complexity yields more reliable behavior in sensitive domains.
The approach extends behavioral assurance methods to conversational mental health data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The emphasis on stability over complexity could encourage simpler agent designs in other high-stakes transcript analysis tasks.
Adopting F1, bias, and robustness as joint criteria might create reusable evaluation standards for prompt-driven systems in healthcare.
If the orchestration works, it could reduce manual iteration cycles when deploying models on new clinical transcript datasets.

Load-bearing premise

The agents can independently design, test, and choose prompt strategies that deliver consistent model behavior across different contexts, with the depression screening case study serving as sufficient validation.

What would settle it

Running the same interview transcripts through the foundation model both with and without CPEMH orchestration and measuring changes in output variance, F1 stability, and bias metrics would test whether the agentic control actually reduces behavioral variability.

Figures

Figures reproduced from arXiv: 2605.11341 by Donald Cowan (University of Waterloo), Giuliano Lorenzoni, Ivens Portugal, Paulo Alencar.

**Figure 2.** Figure 2: IS vs. OOS joint comparison of bias and robustness. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

read the original abstract

This paper presents CPEMH, an agentic framework designed to evaluate prompt-driven behavior in foundation-model systems operating on transcript-based datasets for mental-health screening. CPEMH serves as an engineering methodology for behavioral assurance in large-scale language systems, introducing an orchestrated architecture that autonomously performs the design, evaluation, and selection of prompt strategies, enabling systematic control of behavioral variability across contexts. Its modular agentic design, combining orchestrator, inference, and evaluation agents, ensures traceability, reproducibility, and robustness throughout the prompting lifecycle. A case study on automated depression screening from interview transcripts demonstrates the framework's capacity to stabilize and audit foundation-model behavior in conversational and clinically sensitive domains. Lessons learned emphasize the role of modular orchestration in behavioral assurance, the prioritization of stability over architectural complexity, and the integration of F1, bias, and robustness as core acceptance criteria.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper presents CPEMH, an agentic framework for evaluating and assuring prompt-driven behavior in foundation-model systems for mental-health screening from transcripts. It proposes a modular architecture with orchestrator, inference, and evaluation agents that autonomously design, evaluate, and select prompt strategies to achieve traceability, reproducibility, and robustness. A case study on automated depression screening from interview transcripts is described as demonstrating the framework's ability to stabilize and audit model behavior in clinically sensitive domains, with lessons on modular orchestration and the use of F1, bias, and robustness metrics as acceptance criteria.

Significance. If the central claims hold with supporting evidence, the work would contribute a structured engineering methodology for behavioral control in LLM-based systems applied to high-stakes mental health tasks, where variability poses clinical risks. The emphasis on autonomous orchestration and explicit acceptance criteria (F1, bias, robustness) could inform responsible deployment practices, though the absence of quantitative validation limits immediate impact.

major comments (1)

[Case Study] Case study section: The central claim that the framework 'demonstrates the capacity to stabilize and audit foundation-model behavior' is unsupported by any quantitative results. No metrics (e.g., F1 scores, bias measures, robustness scores), baselines (non-agentic prompting), before/after comparisons, inter-run variance, or statistical tests are reported, which directly undermines the assertion of systematic control and reproducible assurance.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback on the manuscript. We address the major comment regarding the case study below and outline the revisions we will make to align the claims with the presented evidence.

read point-by-point responses

Referee: [Case Study] Case study section: The central claim that the framework 'demonstrates the capacity to stabilize and audit foundation-model behavior' is unsupported by any quantitative results. No metrics (e.g., F1 scores, bias measures, robustness scores), baselines (non-agentic prompting), before/after comparisons, inter-run variance, or statistical tests are reported, which directly undermines the assertion of systematic control and reproducible assurance.

Authors: We agree that the case study section, as currently written, does not report quantitative metrics, baselines, variance measures, or statistical tests to substantiate the claim of stabilization and auditability. The case study was intended to illustrate the end-to-end orchestration workflow of CPEMH (orchestrator, inference, and evaluation agents) and the role of F1, bias, and robustness as acceptance criteria in a mental-health transcript scenario, rather than to serve as a full empirical validation study. This distinction was not made sufficiently clear, which weakens the central claim. In the revised manuscript we will: (1) qualify the abstract and case-study description to state that the example demonstrates the framework's modular design and traceability mechanisms but does not constitute quantitative proof of behavioral stabilization; (2) add an explicit limitations paragraph noting the absence of comparative baselines and statistical analysis; and (3) include any readily available procedural logs or acceptance-criterion thresholds from the existing runs if they can be reported without new experiments. These changes will ensure all claims are proportionate to the evidence supplied. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in framework description

full rationale

The paper introduces CPEMH as an engineering methodology and agentic framework whose claims about traceability, reproducibility, and behavioral stabilization are presented as direct consequences of its described modular architecture (orchestrator, inference, and evaluation agents). The case study is characterized as a demonstration of the framework's capacity rather than a quantitative prediction, fitted result, or derived quantity obtained through equations or parameters that reduce to the inputs by construction. No mathematical derivations, self-citations of load-bearing uniqueness theorems, or ansatzes smuggled via prior work are invoked. The work remains self-contained as a proposal of an architecture and associated lessons, without tautological reductions of its central assertions to its own definitions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 3 invented entities

The central contribution is the introduction of the CPEMH framework and its agent components, which are postulated without independent evidence or external validation beyond the high-level case study description.

axioms (1)

domain assumption Modular agentic design ensures traceability, reproducibility, and robustness throughout the prompting lifecycle
Invoked as the basis for behavioral assurance in the framework description.

invented entities (3)

Orchestrator agent no independent evidence
purpose: Autonomously manages design, evaluation, and selection of prompt strategies
New component introduced as part of the CPEMH architecture.
Inference agent no independent evidence
purpose: Executes foundation model inferences with selected prompts
New component introduced as part of the CPEMH architecture.
Evaluation agent no independent evidence
purpose: Assesses prompt performance using F1, bias, and robustness criteria
New component introduced as part of the CPEMH architecture.

pith-pipeline@v0.9.0 · 5456 in / 1475 out tokens · 42319 ms · 2026-05-13T01:24:52.311990+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

modular agentic design, combining orchestrator, inference, and evaluation agents... integration of F1, bias, and robustness as core acceptance criteria

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

22 extracted references · 22 canonical work pages

[1]

Brown, B

T.B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, et al. 2020. Lan- guage models are few-shot learners. InAdvances in Neural Information Processing Systems, Vol. 33. 1877–1901

work page 2020
[2]

J. Choi. 2025. Efficient prompt optimization for relevance evaluation via LLM- based confusion matrix feedback.Applied Sciences15, 9 (2025), 5198

work page 2025
[3]

Y. Dong, X. Jiang, Z. Jin, and G. Li. 2024. Self-collaboration code generation via ChatGPT.ACM Transactions on Software Engineering and Methodology33, 7 (2024), 1–26

work page 2024
[4]

Jonathan Gratch, Ron Artstein, Gale Lucas, Giota Stratou, Stefan Scherer, Angela Nazarian, Rachel Wood, Jill Boberg, David DeVault, Stacy Marsella, David Traum, Skip Rizzo, and Albert Morency. 2014. The Distress Analysis Interview Corpus of Human and Computer Interviews. InProceedings of the Ninth International Conference on Language Resources and Evaluat...

work page 2014
[5]

Hughes, Y.K

L. Hughes, Y.K. Dwivedi, T. Malik, M. Shawosh, and M.A. et al. Albashrawi. 2025. AI agents and agentic systems: A multi-expert analysis.Journal of Computer Information Systems65, 4 (2025), 489–517

work page 2025
[6]

Bo Jin et al. 2024. AgentMD: Clinical decision-making agents for medical reason- ing.npj Digital Medicine7, 1 (2024)

work page 2024
[7]

H. Jin, S. Lee, H. Shin, and J. Kim. 2024. Teach AI how to code: Using large language models as teachable agents for programming education. InConference on Human Factors in Computing Systems. ACM, 1–28

work page 2024
[8]

Perez, D

E. Perez, D. Kiela, and K. Cho. 2021. True few-shot learning with language models. 34 (2021), 11054–11070

work page 2021
[9]

David M. W. Powers. 2020. Evaluation: From Precision, Recall and F-Measure to ROC, Informedness, Markedness and Correlation.Journal of Machine Learning Technologies2, 1 (2020), 37–63. https://arxiv.org/abs/2010.16061

work page arXiv 2020
[10]

J. Qiu, K. Lam, G. Li, A. Acharya, T.Y. Wong, A. Darzi, W. Yuan, and E.J. Topol

work page
[11]

LLM-based agentic systems in medicine and healthcare.Nature Machine Intelligence6, 12 (2024), 1418–1420

work page 2024
[12]

Sasaki, H

Y. Sasaki, H. Washizaki, J. Li, D. Sander, N. Yoshioka, and Y. Fukazawa. 2024. Sys- tematic literature review of prompt engineering patterns in software engineering. InProceedings of the IEEE 48th Annual Computers, Software, and Applications Conference (COMPSAC). 670–675

work page 2024
[13]

Suranga Seneviratne, Yong Zhang, Nitin Vaidya, and Munindar P. Singh. 2022. Bias and Fairness in Artificial Intelligence Systems for Mental Health. InProceed- ings of the 2022 ACM Conference on Fairness, Accountability, and Transparency (FAccT). 1–12. doi:10.1145/3531146.3533191

work page doi:10.1145/3531146.3533191 2022
[14]

Son, Y.-J

M. Son, Y.-J. Won, and S. Lee. 2025. Optimizing large language models: A deep dive into effective prompt engineering techniques.Applied Sciences15, 3 (2025), 1042

work page 2025
[15]

Toprani and V.K

D. Toprani and V.K. Madisetti. 2025. LLM agentic workflow for automated vulnerability detection and remediation in infrastructure-as-code.IEEE Access13 (2025), 69175–69181

work page 2025
[16]

Haoran Wang et al. 2025. Prompt-Strategy Trees for Multi-Agent Collaboration. Proceedings of AAAI 2025(2025), 15621–15629

work page 2025
[17]

J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, et al . 2022. Chain-of- thought prompting elicits reasoning in large language models. InAdvances in Neural Information Processing Systems, Vol. 35. 24824–24837

work page 2022
[18]

White, S

J. White, S. Hays, Q. Fu, J. Spencer-Smith, and D.C. Schmidt. 2024.ChatGPT Prompt Patterns for Improving Code Quality, Refactoring, Requirements Elicitation, and Software Design. Springer Nature. 49–61 pages

work page 2024
[19]

Xinyu Yang et al . 2025. MedAide: Agentic reasoning for clinical triage and recommendation.IEEE Transactions on Biomedical Engineering(2025), 1142– 1155

work page 2025
[20]

A. Zhao, D. Huang, Q. Xu, M. Lin, Y.-J. Liu, and G. Huang. 2024. Expel: LLM agents are experiential learners. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 38. AAAI, 19632–19642

work page 2024
[21]

Zhao et al

Tony Z. Zhao et al. 2021. Calibrate Before Use: Improving Few-Shot Performance of Language Models. InProceedings of the 38th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 139). PMLR, 12697–12706. http://proceedings.mlr.press

work page 2021
[22]

Zhou, A.I

Y. Zhou, A.I. Muresanu, Z. Han, K. Paster, S. Pitis, H. Chan, and J. Ba. 2023. Large language models are human-level prompt engineers. InInternational Conference on Learning Representations (ICLR). 1–24. 4

work page 2023

[1] [1]

Brown, B

T.B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, et al. 2020. Lan- guage models are few-shot learners. InAdvances in Neural Information Processing Systems, Vol. 33. 1877–1901

work page 2020

[2] [2]

J. Choi. 2025. Efficient prompt optimization for relevance evaluation via LLM- based confusion matrix feedback.Applied Sciences15, 9 (2025), 5198

work page 2025

[3] [3]

Y. Dong, X. Jiang, Z. Jin, and G. Li. 2024. Self-collaboration code generation via ChatGPT.ACM Transactions on Software Engineering and Methodology33, 7 (2024), 1–26

work page 2024

[4] [4]

Jonathan Gratch, Ron Artstein, Gale Lucas, Giota Stratou, Stefan Scherer, Angela Nazarian, Rachel Wood, Jill Boberg, David DeVault, Stacy Marsella, David Traum, Skip Rizzo, and Albert Morency. 2014. The Distress Analysis Interview Corpus of Human and Computer Interviews. InProceedings of the Ninth International Conference on Language Resources and Evaluat...

work page 2014

[5] [5]

Hughes, Y.K

L. Hughes, Y.K. Dwivedi, T. Malik, M. Shawosh, and M.A. et al. Albashrawi. 2025. AI agents and agentic systems: A multi-expert analysis.Journal of Computer Information Systems65, 4 (2025), 489–517

work page 2025

[6] [6]

Bo Jin et al. 2024. AgentMD: Clinical decision-making agents for medical reason- ing.npj Digital Medicine7, 1 (2024)

work page 2024

[7] [7]

H. Jin, S. Lee, H. Shin, and J. Kim. 2024. Teach AI how to code: Using large language models as teachable agents for programming education. InConference on Human Factors in Computing Systems. ACM, 1–28

work page 2024

[8] [8]

Perez, D

E. Perez, D. Kiela, and K. Cho. 2021. True few-shot learning with language models. 34 (2021), 11054–11070

work page 2021

[9] [9]

David M. W. Powers. 2020. Evaluation: From Precision, Recall and F-Measure to ROC, Informedness, Markedness and Correlation.Journal of Machine Learning Technologies2, 1 (2020), 37–63. https://arxiv.org/abs/2010.16061

work page arXiv 2020

[10] [10]

J. Qiu, K. Lam, G. Li, A. Acharya, T.Y. Wong, A. Darzi, W. Yuan, and E.J. Topol

work page

[11] [11]

LLM-based agentic systems in medicine and healthcare.Nature Machine Intelligence6, 12 (2024), 1418–1420

work page 2024

[12] [12]

Sasaki, H

Y. Sasaki, H. Washizaki, J. Li, D. Sander, N. Yoshioka, and Y. Fukazawa. 2024. Sys- tematic literature review of prompt engineering patterns in software engineering. InProceedings of the IEEE 48th Annual Computers, Software, and Applications Conference (COMPSAC). 670–675

work page 2024

[13] [13]

Suranga Seneviratne, Yong Zhang, Nitin Vaidya, and Munindar P. Singh. 2022. Bias and Fairness in Artificial Intelligence Systems for Mental Health. InProceed- ings of the 2022 ACM Conference on Fairness, Accountability, and Transparency (FAccT). 1–12. doi:10.1145/3531146.3533191

work page doi:10.1145/3531146.3533191 2022

[14] [14]

Son, Y.-J

M. Son, Y.-J. Won, and S. Lee. 2025. Optimizing large language models: A deep dive into effective prompt engineering techniques.Applied Sciences15, 3 (2025), 1042

work page 2025

[15] [15]

Toprani and V.K

D. Toprani and V.K. Madisetti. 2025. LLM agentic workflow for automated vulnerability detection and remediation in infrastructure-as-code.IEEE Access13 (2025), 69175–69181

work page 2025

[16] [16]

Haoran Wang et al. 2025. Prompt-Strategy Trees for Multi-Agent Collaboration. Proceedings of AAAI 2025(2025), 15621–15629

work page 2025

[17] [17]

J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, et al . 2022. Chain-of- thought prompting elicits reasoning in large language models. InAdvances in Neural Information Processing Systems, Vol. 35. 24824–24837

work page 2022

[18] [18]

White, S

J. White, S. Hays, Q. Fu, J. Spencer-Smith, and D.C. Schmidt. 2024.ChatGPT Prompt Patterns for Improving Code Quality, Refactoring, Requirements Elicitation, and Software Design. Springer Nature. 49–61 pages

work page 2024

[19] [19]

Xinyu Yang et al . 2025. MedAide: Agentic reasoning for clinical triage and recommendation.IEEE Transactions on Biomedical Engineering(2025), 1142– 1155

work page 2025

[20] [20]

A. Zhao, D. Huang, Q. Xu, M. Lin, Y.-J. Liu, and G. Huang. 2024. Expel: LLM agents are experiential learners. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 38. AAAI, 19632–19642

work page 2024

[21] [21]

Zhao et al

Tony Z. Zhao et al. 2021. Calibrate Before Use: Improving Few-Shot Performance of Language Models. InProceedings of the 38th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 139). PMLR, 12697–12706. http://proceedings.mlr.press

work page 2021

[22] [22]

Zhou, A.I

Y. Zhou, A.I. Muresanu, Z. Han, K. Paster, S. Pitis, H. Chan, and J. Ba. 2023. Large language models are human-level prompt engineers. InInternational Conference on Learning Representations (ICLR). 1–24. 4

work page 2023