arxiv: 2604.12258 · v2 · submitted 2026-04-14 · 💻 cs.CL · cs.AI

Recognition: unknown

Coding-Free and Privacy-Preserving Agentic Framework for Data-Driven Clinical Research

Taehun Kim , Hyeryun Park , Hyeonhoon Lee , Yushin Lee , Kyungsang Kim , Hyung-Chul Lee

Authors on Pith no claims yet

Pith reviewed 2026-05-10 16:05 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords clinical research automationagentic AIlarge language modelsIRB documentationprivacy-preserving analysiscohort constructionhuman-in-the-loopdata-driven medicine

0 comments

The pith

CARIS uses language models to automate clinical research from planning to reports without any coding or direct data access.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents CARIS as a system that lets clinicians drive data research through natural language commands alone. It automates the full pipeline of planning, literature search, cohort building, IRB paperwork, machine learning runs, and report writing while routing all outputs through human review and keeping raw patient records inaccessible to the models. The evaluation on three different clinical datasets showed the system finishing planning and IRB steps in four or fewer rounds and reaching 96 percent completeness under automated checks and 82 percent under human review. This approach aims to lower the technical and documentation hurdles that currently limit who can perform rigorous clinical studies on real data.

Core claim

CARIS integrates large language models with modular tools via the Model Context Protocol to execute end-to-end clinical research workflows, completing research planning and IRB documentation within four iterations, supporting Vibe ML, and producing reports with 96 percent LLM-evaluated completeness and 82 percent human-evaluated completeness across three heterogeneous datasets, all while preserving privacy by exposing only outputs to users.

What carries the argument

The Model Context Protocol (MCP) that links LLMs to specialized tools for tasks such as cohort construction and IRB drafting, allowing the entire workflow to be driven by natural language while keeping patient data private.

If this is right

Clinicians without programming skills can generate IRB-ready documentation and cohort definitions directly from study questions.
Research teams can iterate on analysis plans while keeping raw patient records inside secure environments.
Report generation becomes a repeatable output of the same agent loop rather than a separate manual step.
The same framework can be applied to both public and private datasets without changing the user interface.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the four-iteration bound holds across more institutions, the time from question to IRB submission could drop from weeks to days for many observational studies.
The privacy design opens the possibility of running the agent on federated hospital data without moving records to a central server.
Extending the tool set to include statistical power calculations or regulatory checklist completion could further compress the pre-study phase.
Human oversight remains essential, so the framework may be most useful as a co-pilot rather than a fully autonomous researcher.

Load-bearing premise

Large language models connected through MCP can reliably carry out clinical tasks such as cohort construction and IRB documentation with only the small number of human corrections reported.

What would settle it

Running CARIS on a fresh clinical dataset where planning or IRB documents require more than four rounds of correction or contain critical factual errors that human reviewers must rewrite.

Figures

Figures reproduced from arXiv: 2604.12258 by Hyeonhoon Lee, Hyeryun Park, Hyung-Chul Lee, Kyungsang Kim, Taehun Kim, Yushin Lee.

**Figure 2.** Figure 2: Overview of the clinical research workflow. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Revision patterns in IRB document generation. [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Radar chart of IRB document evaluation results across four criteria. Each axis represents the pass rate (%) for each criteria assessed across three tasks. iterative refinement process that enhances clinical relevance through domain expertise, PubMed evidence, and dataset-specific details. The final IRB documents were evaluated by LLM using four criteria: content completeness (9 items), non-expert accessib… view at source ↗

**Figure 5.** Figure 5: Checklist coverage of the final report across nine criteria. [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

read the original abstract

Clinical data-driven research requires clinical expertise, programming skills, access to patient data, and extensive documentation, creating barriers and slowing the pace for clinicians and external researchers. To address this, we developed the Clinical Agentic Research Intelligence System (CARIS) that automates the workflow: research planning, literature search, cohort construction, Institutional Review Board (IRB) documentation, Vibe Machine Learning (ML), and report generation, with human-in-the-loop refinement. CARIS integrates Large Language Models (LLMs) with modular tools through the Model Context Protocol (MCP), enabling natural language-driven research without coding while allowing users to access only outputs. We evaluated CARIS on three heterogeneous datasets with distinct clinical tasks, where it completed planning and IRB documentation within four iterations, supported Vibe ML, and generated reports, achieving 96% completeness in LLM-based evaluation and 82% in human evaluation. CARIS demonstrates potential to reduce documentation burden and technical barriers, accelerating data-driven clinical research across public and private data environments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CARIS is a straightforward LLM-agent wrapper for clinical workflows that hits real pain points but rests its claims on thin completeness scores without error checks or baselines.

read the letter

The main takeaway is that CARIS strings together LLM agents to handle research planning, literature search, cohort queries, IRB documents, some ML support, and report writing, all driven by natural language with human feedback and no user coding. It keeps data private by returning only outputs. They ran it on three different clinical datasets and say it finished the planning and IRB steps in four iterations or fewer while hitting 96% completeness by LLM judge and 82% by human raters. That integration for this exact use case is the new piece; the individual agent and tool-calling pieces are standard. The system description is clear enough on the workflow and the privacy angle is sensible for health data. The soft spot is the evaluation. Completeness ratings do not tell us whether the IRB text or cohort definitions contain factual mistakes, wrong medical logic, or safety problems, and there are no baselines, no error breakdown, and no blinded expert review against ground truth. The 82% human score could easily reflect surface coverage rather than reliable output. No derivations or heavy math here, just a system report, and the citations track the usual agentic AI sources without gaps. This is aimed at clinicians and data teams who want to cut documentation time on public or private datasets. It is worth a serious referee because the problem is genuine and the artifact is new, even though the current evidence needs tighter validation on accuracy and failure modes before the performance claims can be trusted.

Referee Report

3 major / 2 minor

Summary. The paper introduces CARIS (Clinical Agentic Research Intelligence System), an LLM-based agentic framework using the Model Context Protocol (MCP) to automate end-to-end data-driven clinical research workflows—research planning, literature search, cohort construction, IRB documentation, Vibe ML, and report generation—without requiring user coding while preserving privacy by exposing only outputs. Evaluated on three heterogeneous datasets with distinct clinical tasks, the system is reported to complete planning and IRB documentation in four human-in-the-loop iterations, support Vibe ML, and generate reports, achieving 96% completeness under LLM-based evaluation and 82% under human evaluation.

Significance. If the performance and safety claims are substantiated, CARIS could meaningfully reduce technical and documentation barriers for clinicians and external researchers working with sensitive clinical data, enabling faster iteration on cohort definition and regulatory documentation across public and private environments. The modular MCP integration for tool use without code exposure offers a practical template for privacy-preserving agentic systems in regulated domains.

major comments (3)

[Abstract / Evaluation] Evaluation methodology (abstract and results): The central performance claims rest on 96% LLM-based and 82% human completeness scores, yet no definition of 'completeness,' error typology (e.g., factual inaccuracies in IRB text or incorrect cohort logic), baseline comparisons, or inter-rater reliability is provided. This leaves the metrics vulnerable to superficial coverage rather than verified clinical accuracy or safety.
[Results] Human evaluation protocol: The 82% human score is reported without specifying the number or expertise of raters, blinding procedures, ground-truth references for the three datasets, or breakdown of error types (e.g., medical logic errors vs. formatting issues), which is required to support the claim that outputs are usable after only four iterations.
[Evaluation] Generalizability across datasets: While three heterogeneous datasets are mentioned, no per-dataset or per-task performance breakdown is given, nor any analysis of failure modes on varying data schemas or clinical domains, undermining the assertion of broad applicability.

minor comments (2)

[Abstract] The acronym 'Vibe ML' is used without an explicit expansion or reference on first use in the abstract.
[Methods] The description of MCP integration would benefit from a brief diagram or pseudocode showing the tool-calling loop to clarify how privacy is enforced at the protocol level.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback on our manuscript. The comments highlight important opportunities to strengthen the transparency and rigor of the evaluation methodology, and we address each major comment point by point below. We will incorporate revisions to address the identified gaps.

read point-by-point responses

Referee: [Abstract / Evaluation] Evaluation methodology (abstract and results): The central performance claims rest on 96% LLM-based and 82% human completeness scores, yet no definition of 'completeness,' error typology (e.g., factual inaccuracies in IRB text or incorrect cohort logic), baseline comparisons, or inter-rater reliability is provided. This leaves the metrics vulnerable to superficial coverage rather than verified clinical accuracy or safety.

Authors: We agree that the evaluation methodology requires greater specificity. The submitted manuscript reports the aggregate completeness scores but does not define 'completeness,' provide an error typology, include baseline comparisons, or report inter-rater reliability. In the revised version we will add an explicit definition of completeness (proportion of workflow components generated without critical factual or logical errors), a categorized error typology, a discussion of baseline limitations given the novelty of the end-to-end task, and inter-rater reliability statistics to better substantiate claims regarding clinical accuracy and safety. revision: yes
Referee: [Results] Human evaluation protocol: The 82% human score is reported without specifying the number or expertise of raters, blinding procedures, ground-truth references for the three datasets, or breakdown of error types (e.g., medical logic errors vs. formatting issues), which is required to support the claim that outputs are usable after only four iterations.

Authors: We acknowledge that the human evaluation protocol details are not described in the current manuscript. We will revise the Results section to specify the number and expertise of raters, blinding procedures, how ground-truth references were constructed for each dataset, and a breakdown of error types. These additions will provide the necessary context to evaluate the 82% score and the usability claim after four iterations. revision: yes
Referee: [Evaluation] Generalizability across datasets: While three heterogeneous datasets are mentioned, no per-dataset or per-task performance breakdown is given, nor any analysis of failure modes on varying data schemas or clinical domains, undermining the assertion of broad applicability.

Authors: We agree that the absence of per-dataset and per-task breakdowns, together with failure-mode analysis, limits the strength of the generalizability claim. The manuscript currently reports only aggregate results across the three datasets. In the revision we will add a table with performance metrics disaggregated by dataset and task, plus a dedicated analysis of observed failure modes related to data schemas and clinical domains. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical system description without derivations or self-referential predictions

full rationale

The paper presents CARIS as an LLM-integrated agentic framework for clinical research workflows and reports empirical completeness metrics (96% LLM-based, 82% human) on three datasets. No equations, first-principles derivations, fitted parameters, or predictions appear in the provided text or abstract. Claims rest on iterative human-in-the-loop execution and external evaluations rather than any chain that reduces to its own inputs by construction. Self-citations, if present, are not load-bearing for any mathematical result.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim rests on the unverified assumption that current LLMs can be reliably orchestrated for high-stakes clinical documentation and analysis tasks; no independent evidence or formal verification of this capability is provided.

axioms (1)

domain assumption Large language models can be safely and accurately guided through clinical research workflows using modular tools and human-in-the-loop refinement.
This assumption underpins the entire automation claim and is not independently tested in the provided abstract.

invented entities (2)

CARIS no independent evidence
purpose: End-to-end automation of clinical research tasks
Newly introduced system name and architecture.
Vibe ML no independent evidence
purpose: Machine learning component within the agentic workflow
Mentioned as a supported capability but not defined or evidenced.

pith-pipeline@v0.9.0 · 5496 in / 1309 out tokens · 79745 ms · 2026-05-10T16:05:35.731871+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

162 extracted references · 22 canonical work pages · 2 internal anchors

[1]

Watson J, Hutyra CA, Clancy SM, et al. Overcoming barriers to the adoption and implementation of predictive modeling and machine learning in clinical care: what can we learn from us academic medical cen- ters? JAMIA Open 2020;3:167–172. DOI: 10.1093/jamiaopen/ooz046

work page doi:10.1093/jamiaopen/ooz046 2020
[2]

Democratizing artificial intel- ligence imaging analysis with automated machine learning: tutorial

Thirunavukarasu AJ, Elangovan K, Gutier- rez L, et al. Democratizing artificial intel- ligence imaging analysis with automated machine learning: tutorial. J Med Internet Res 2023;25:e49949. DOI: 10.2196/49949

work page doi:10.2196/49949 2023
[3]

GPT-4 Technical Report

Achiam J, Adler S, Agarwal S, et al. Gpt- 4 technical report. March 15, 2023 (https:// arxiv.org/abs/2303.08774). Preprint

work page internal anchor Pith review Pith/arXiv arXiv 2023
[4]

Palm: Scaling language modeling with path- ways

Chowdhery A, Narang S, Devlin J, et al. Palm: Scaling language modeling with path- ways. J Mach Learn Res 2023;24:1–113. (http: //jmlr.org/papers/v24/22-1144.html)

2023
[5]

The llama 4 herd: The beginning of a new era of natively multimodal ai inno- vation

Meta AI. The llama 4 herd: The beginning of a new era of natively multimodal ai inno- vation. April 5, 2025 (https://ai.meta.com/ blog/llama-4-multimodal-intelligence/)

2025
[6]

Gemma 3 technical report

Gemma Team. Gemma 3 technical report. March 25, 2025 (https://arxiv.org/abs/2503. 19786). Preprint

2025
[7]

A survey on large language model based autonomous agents,

Wang L, Ma C, Feng X, et al. A survey on large language model based autonomous agents. Front Comput Sci 2024;18:186345. DOI: 10.1007/s11704-024-40231-1

work page doi:10.1007/s11704-024-40231-1 2024
[8]

A survey on llm-based multi-agent systems: workflow, infrastructure, and challenges

Li X, Wang S, Zeng S, Wu Y, Yang Y. A survey on llm-based multi-agent systems: workflow, infrastructure, and challenges. Vic- inagearth 2024;1:9. DOI: 10.1007/s44336-024- 00009-2

work page doi:10.1007/s44336-024- 2024
[9]

Llm-based agentic reasoning frame- works: A survey from methods to scenarios

Zhao B, Foo LG, Hu P, Theobalt C, Rahmani H, Liu J. Llm-based agentic reasoning frame- works: A survey from methods to scenarios. August 25, 2025 (https://arxiv.org/abs/2508. 17692). Preprint

2025
[10]

Autogen: Enabling next-gen llm applications via multi- agent conversations In: First Conference on Language Modeling (COLM)

Wu Q, Bansal G, Zhang J, et al. Autogen: Enabling next-gen llm applications via multi- agent conversations In: First Conference on Language Modeling (COLM). 2024 (https:// openreview.net/forum?id=BAakY1hNKS)

2024
[11]

Multi-agent col- laboration mechanisms: A survey of llms

Tran KT, Dao D, Nguyen MD, Pham QV, O’Sullivan B, Nguyen HD. Multi-agent col- laboration mechanisms: A survey of llms. Jan- uary 10, 2025 (https://arxiv.org/abs/2501. 06322). Preprint

2025
[12]

Baek J, Jauhar SK, Cucerzan S, Hwang SJ. Researchagent: Iterative research idea generation over scientific literature with large language models In: Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Lan- guage Technologies (Volume 1: Long Papers). ACL: 2025:6709–6738 (https://...

2025
[13]

DS - Agent : Automated Data Science by Empowering Large Language Models with Case - Based Reasoning

Guo S, Deng C, Wen Y, Chen H, Chang Y, Wang J. Ds-agent: Automated data sci- ence by empowering large language mod- els with case-based reasoning. February 27, 2024 (https://arxiv.org/abs/2402.17453). Preprint

work page arXiv 2024
[14]

Agentic ai for scientific discovery: A survey of progress, challenges, and future directions,

Gridach M, Nanavati J, Abidine KZE, Mendes L, Mack C. Agentic ai for scientific discovery: A survey of progress, challenges, and future directions. March 12, 2025 (https: //arxiv.org/abs/2503.08979). Preprint

work page arXiv 2025
[15]

The virtual lab: Ai agents design new sars-cov-2 nanobodies with exper- imental validation

Swanson K, Wu W, Bulaong NL, Pak JE, Zou J. The virtual lab: Ai agents design new sars-cov-2 nanobodies with exper- imental validation. November 11, 2024 (https://www.biorxiv.org/content/10.1101/ 2024.11.11.623004v1). Preprint

2024
[16]

Frontiers in Digital Health , author =

Artsi Y, Sorin V, Glicksberg BS, Korfiatis P, Nadkarni GN, Klang E. Large language models in real-world clinical workflows: a sys- tematic review of applications and implemen- tation. Front Digit Health 2025;7:1659134. DOI:10.3389/fdgth.2025.1659134

work page doi:10.3389/fdgth.2025.1659134 2025
[17]

Bar- riers to and facilitators of artificial intel- ligence adoption in health care: scoping review

Hassan M, Kushniruk A, Borycki E. Bar- riers to and facilitators of artificial intel- ligence adoption in health care: scoping review. JMIR Hum Factors 2024;11:e48633. DOI:10.2196/48633

work page doi:10.2196/48633 2024
[18]

Multi-site research using electronic health record data: Lessons learned from a case study

Garcia B, Hogarth M, Wang Y, Zhu X, Tu SP. Multi-site research using electronic health record data: Lessons learned from a case study. Learn Health Syst 2025;9:e70039. DOI:10.1002/lrh2.10439

work page doi:10.1002/lrh2.10439 2025
[19]

Advancements in electronic medical records for clinical trials: Enhancing data manage- ment and research efficiency

Lee M, Kim K, Shin Y, Lee Y, Kim TJ. Advancements in electronic medical records for clinical trials: Enhancing data manage- ment and research efficiency. Cancers (Basel) 2025;17:1552. DOI:10.3390/cancers17091552

work page doi:10.3390/cancers17091552 2025
[20]

Beyond Sleep Staging: Advancing End-to-End Event Scoring in Sleep Medicine

Holmes JH, Beinlich J, Boland MR, et al. Why is the electronic health record so chal- lenging for research and clinical care? Meth- ods Inf Med 2021;60:032–048. DOI:10.1055/s- 0041-1731784

work page doi:10.1055/s- 2021
[21]

Enhanc- ing clinical decision support and ehr insights through llms and the model context pro- tocol: An open-source mcp-fhir framework

Ehtesham A, Singh A, Kumar S. Enhanc- ing clinical decision support and ehr insights through llms and the model context pro- tocol: An open-source mcp-fhir framework. June 18, 2025 (https://ieeexplore.ieee.org/ abstract/document/11105280). Preprint

work page arXiv 2025
[22]

A survey of the model context protocol (mcp): Standardizing context to enhance large lan- guage models (llms)

Singh A, Ehtesham A, Kumar S, Khoei TT. A survey of the model context protocol (mcp): Standardizing context to enhance large lan- guage models (llms). April 3, 2025 (https: //www.preprints.org/frontend/manuscript/ b45407370ad06ed48b5ebc462c1d8a2c/ download pub). Preprint

2025
[23]

Conversational llms simplify secure clinical data access, understanding, and anal- ysis

Attrach RA, Moreira P, Fani R, Umeton R, Celi LA. Conversational llms simplify secure clinical data access, understanding, and anal- ysis. July 1, 2025 (https://doi.org/10.48550/ arXiv.2507.01053). Preprint

work page arXiv 2025
[24]

TRIPOD+AI statement: Updated guidance for reporting clinical prediction models that use regression or machine learning methods.BMJ, 385:e078378, April 2024

Collins GS, Moons KG, Dhiman P, et al. Tripod+ ai statement: updated guidance for reporting clinical prediction models that use regression or machine learning methods. bmj 2024;385. DOI:10.1136/bmj-2023-078378

work page doi:10.1136/bmj-2023-078378 2024
[25]

Evidence-based medicine: How to practice & teach EBM

Haynes RB, Sackett DL, Richardson WS, Rosenberg W, Langley GR. Evidence-based medicine: How to practice & teach EBM. Canadian Medical Association. Journal 1997;157:788

1997
[26]

Best match: new relevance search for pubmed

Fiorini N, Canese K, Starchenko G, et al. Best match: new relevance search for pubmed. PLoS Biol 2018;16:e2005343. DOI:10.1371/journal.pbio.2005343

work page doi:10.1371/journal.pbio.2005343 2018
[27]

A unified approach to interpreting model predictions In: Advances in Neural Information Pro- cessing Systems

Lundberg SM, Lee SI. A unified approach to interpreting model predictions In: Advances in Neural Information Pro- cessing Systems. 2017:4765–4774 (https: //proceedings.neurips.cc/paper/2017/hash/ 8a20a8621978632d76c43dfd28b67767-Abstract. html)

2017
[28]

MIMIC-IV , a freely accessible electronic health record dataset,

Johnson AE, Bulgarelli L, Shen L, et al. Mimic-iv, a freely accessible electronic health record dataset. Sci data 2023;10:1. DOI:10.1038/s41597-022-01899-x

work page doi:10.1038/s41597-022-01899-x 2023
[29]

Mimic-iv

Johnson A, Bulgarelli L, Pollard T, et al. Mimic-iv. October 11, 2024 (https:// physionet.org/content/mimiciv/3.1/)

2024
[30]

Inspire, a publicly available research dataset for peri- operative medicine

Lim L, Lee H, Jung CW, et al. Inspire, a publicly available research dataset for peri- operative medicine. Sci Data 2024;11:655. 9 DOI:10.1038/s41597-024-03517-4

work page doi:10.1038/s41597-024-03517-4 2024
[31]

Inspire, a publicly available research dataset for perioperative medicine

Lim L, Lee HC. Inspire, a publicly available research dataset for perioperative medicine. August 11, 2024 (https://physionet.org/ content/inspire/1.3/)

2024
[32]

Syntheticmass data, version 2

Walonoski J, Kramer M, Nichols J, et al. Syntheticmass data, version 2. May 24, 2017 (https://synthea.mitre.org/downloads)

2017
[33]

Explainable machine learning for icu readmission prediction.arXiv preprint arXiv:2309.13781, 2023

de S´a AG, Gould D, Fedyukova A, et al. Explainable machine learning for icu readmis- sion prediction. September 13, 2024 (https: //arxiv.org/abs/2309.13781). Preprint

work page arXiv 2024
[34]

Develop- ment of interpretable machine learning mod- els for prediction of acute kidney injury after noncardiac surgery: a retrospective cohort study

Sun R, Li S, Wei Y, et al. Develop- ment of interpretable machine learning mod- els for prediction of acute kidney injury after noncardiac surgery: a retrospective cohort study. Int J Surg 2024;110:2950–2962. DOI:10.1097/JS9.0000000000001237

work page doi:10.1097/js9.0000000000001237 2024
[35]

Risk stratification at prediabetes onset and association with dia- betes outcomes using EHR data

Luo J, Hu D, Han R, et al. Risk stratification at prediabetes onset and association with dia- betes outcomes using EHR data. NPJ Metab Health Dis 2025;3:48. DOI:10.1038/s44324- 025-00091-0

work page doi:10.1038/s44324- 2025
[36]

MedGemma Technical Report

Sellergren A, Kazemzadeh S, Jaroensri T, et al. Medgemma technical report. July 7, 2025 (https://arxiv.org/abs/2507.05201). Preprint. 10 Supplementary Appendix Supplementary Note 1. Orchestration prompt. The LLM orchestrates the workflow by interpreting user input and mapping it to appropriate tool invocations, guided by the prompt below, along with the a...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[37]

Take initiative: When given a research question, outline the full approach (data source -> cohort selection -> variables -> analysis plan -> expected outputs) before asking for confirmation
[38]

Check distributions, missing values, and sample sizes

Be thorough: Run exploratory queries to understand the data before jumping to analysis. Check distributions, missing values, and sample sizes
[39]

Document everything: Save intermediate results as CSV files, produce well-structured Word reports, and keep a clear audit trail
[40]

Communicate like a researcher: Use precise terminology, cite statistical rationale, and present results with context (confidence intervals, effect sizes, limitations)
[41]

Present all results clearly with proper formatting

Chain tools effectively: Combine database queries, file operations, ML pipelines, and document generation in a single workflow when appropriate. Present all results clearly with proper formatting. Supplementary Note 2. Tools descriptions. Table S1 provides a complete list and descriptions of key tools available to the agents. 1 Table S1 Key Tools for Clin...
[42]

Refine into a noun-phrase style topic
[43]

Remove extra background details and verbose wording
[44]

Keep essential domain terminology
[45]

topic_refined

Provide one English version IMPORTANT: You must respond in English only, regardless of the input language. Respond in the following JSON format: ```json {"topic_refined": "Refined research topic in English", "topic_en": "Refined research topic in English"} ``` The LLM prompt used to generate questions to draft the research plan is shown below. You are an ...
[46]

Generate exactly 12 questions total
[47]

Cover all four PIMO categories (P, I, M, O) at least once
[48]

Questions must not overlap in intent
[49]

Each question must include target_section and pimo_category
[50]

questions

Keep each question specific and answerable by a user IMPORTANT: You must respond in English only, regardless of the input language. Respond in the following JSON format: ```json { "questions": [ {"question_id": "Q1", "target_section": "research_purpose", "pimo_category": "P", "question": "Which prediabetic patient population is the primary target, and wha...
[51]

Clearly explain why the research is needed and what it aims to solve
[52]

Include background context, core objective, and hypothesis direction
[53]

Do not include implementation-level methods
[55]

Do not repeat claims or wording from previous sections
[56]

research_purpose

Define all abbreviations and technical terms on first use; write so that non-experts can follow IMPORTANT: You must respond in English only, regardless of the input language. Respond in the following JSON format: ```json {"research_purpose": "Research purpose paragraph in English"} ``` The LLM prompt used to generate the research design section based on t...
[57]

Describe study type and structural framework
[58]

Include cohort/group structure, comparison setup, and timeline scope; also address informed consent procedures (or justification for waiver), potential risks to participants and mitigation strategies, and applicable ethical guidelines (e.g., Declaration of Helsinki, IRB policies)
[59]

Minimize repetition of purpose rationale or analysis details
[61]

Do not overlap with previous sections
[62]

research_design

Define all abbreviations and technical terms on first use; write so that non-experts can follow IMPORTANT: You must respond in English only, regardless of the input language. Respond in the following JSON format: ```json {"research_design": "Research design paragraph in English"} ``` The LLM prompt to generate the research method section based on the resp...
[63]

Describe subject criteria, data collection, and variable definition
[64]

Include privacy protections (de-identification methods, access controls), data storage location, security measures, retention and deletion policies, and quality control procedures
[65]

Focus on operational workflow, not high-level design claims
[67]

Avoid repeated statements from previous sections
[68]

research_method

Define all abbreviations and technical terms on first use; write so that non-experts can follow IMPORTANT: You must respond in English only, regardless of the input language. Respond in the following JSON format: ```json {"research_method": "Research method paragraph in English"} ``` The LLM prompt to generate the validity evaluation section based on the ...
[69]

Specify validation strategy, statistical analyses, and performance metrics
[70]

Include confounder control, sensitivity analysis, and reproducibility checks
[71]

Keep it concrete and methodologically explicit
[73]

Do not repeat previous sections
[74]

validity_evaluation

Define all abbreviations and technical terms on first use; write so that non-experts can follow IMPORTANT: You must respond in English only, regardless of the input language. Respond in the following JSON format: ```json {"validity_evaluation": "Validity evaluation paragraph in English"} ``` The LLM prompt to generate the expected effects section based on...
[75]

Cover realistic academic, clinical, and practical impacts
[76]

Avoid overly optimistic language
[77]

Disclose potential conflicts of interest if applicable
[78]

Focus on significance and applicability, not detailed result patterns
[79]

Write one academic paragraph
[80]

Avoid overlap with previous sections
[81]

expected_effects

Define all abbreviations and technical terms on first use; write so that non-experts can follow IMPORTANT: You must respond in English only, regardless of the input language. Respond in the following JSON format: ```json {"expected_effects": "Expected effects paragraph in English"} ``` The LLM prompt to generate the anticipated results section based on th...
[82]

Describe expected trends and directional outcomes for key endpoints
[83]

Differentiate primary and secondary outcomes when relevant
[84]

Keep this section outcome-focused, not significance-focused

Showing first 80 references.