M3: Conversational LLMs Simplify Secure Clinical Data Access, Understanding, and Analysis

Amelia Fiske; Leo Anthony Celi; Pedro Moreira; Rafi Al Attrach; Rajna Fani; Renato Umeton

arxiv: 2507.01053 · v4 · pith:OVPDE5HOnew · submitted 2025-06-27 · 💻 cs.IR · cs.AI· cs.DB

M3: Conversational LLMs Simplify Secure Clinical Data Access, Understanding, and Analysis

Rafi Al Attrach , Pedro Moreira , Rajna Fani , Renato Umeton , Amelia Fiske , Leo Anthony Celi This is my paper

Pith reviewed 2026-05-22 00:01 UTC · model grok-4.3

classification 💻 cs.IR cs.AIcs.DB

keywords MIMIC-IVnatural language queryingclinical data analysislarge language modelsSQL generationprivacy-preserving deploymentelectronic health records

0 comments

The pith

M3 lets researchers query the MIMIC-IV clinical database in plain English using large language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces M3, a system that connects conversational AI to the large MIMIC-IV intensive care database so users can ask clinical questions in natural language rather than writing SQL. It reports that both a proprietary model and a smaller open-weights model running locally reach 93 to 94 percent accuracy on answerable benchmark questions drawn from EHRSQL 2024. The open model also abstains correctly on 69 percent of unanswerable questions instead of guessing. The system includes built-in security steps such as authentication and query validation to support safe handling of patient data.

Core claim

M3 shows that large language models can translate natural-language clinical questions into correct SQL queries against MIMIC-IV, execute them, and return results with the original query visible for verification, reaching 94 percent accuracy with Claude Sonnet 4 and 93 percent accuracy with the locally deployable gpt-oss-20B model on one hundred answerable questions while correctly refusing to answer 69 percent of the time on matched unanswerable questions.

What carries the argument

The M3 system, which uses the Model Context Protocol to retrieve MIMIC-IV data, start a local SQLite or hosted BigQuery instance, convert plain-English questions into SQL, run the queries, and present structured results with the underlying query for checking.

If this is right

Researchers without SQL training can directly explore large critical-care datasets.
An open model running on ordinary hardware makes fully local, privacy-preserving analysis practical.
Built-in validation and logging steps reduce risks when working with protected health information.
Most remaining errors trace to temporal reasoning or unclear phrasing rather than core system limits.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same natural-language interface could be adapted for other large medical record collections beyond MIMIC-IV.
Adding better handling of ambiguous questions might raise real-world usefulness without changing the architecture.
Direct trials with active clinical teams would show whether the benchmark numbers translate to everyday research workflows.

Load-bearing premise

Performance measured on selected questions from the EHRSQL 2024 benchmark will carry over to the wider range of ambiguous and complex questions that actual clinical researchers ask when exploring MIMIC-IV data.

What would settle it

Testing M3 on a fresh set of questions collected directly from practicing clinical researchers and checking whether accuracy stays near 90 percent and correct abstention rates remain high.

Figures

Figures reproduced from arXiv: 2507.01053 by Amelia Fiske, Leo Anthony Celi, Pedro Moreira, Rafi Al Attrach, Rajna Fani, Renato Umeton.

**Figure 2.** Figure 2: Conceptual Diagram of the M3 System Architecture [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Query: “Show trends in systolic blood pressure for patients on vasopressors within 48 hours of ICU admission.” [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

read the original abstract

Large-scale clinical databases offer opportunities for medical research, but their complexity creates barriers to effective use. The Medical Information Mart for Intensive Care (MIMIC-IV), one of the world's largest open-source electronic health record databases, traditionally requires both SQL proficiency and clinical domain expertise. We introduce M3, a system that enables natural language querying of MIMIC-IV data through the Model Context Protocol. With a single command, M3 retrieves MIMIC-IV from PhysioNet, launches a local SQLite instance or connects to hosted BigQuery, and allows researchers to pose clinical questions in plain English. We evaluated M3 using samples from the EHRSQL 2024 benchmark with two language models. On one hundred answerable questions, the proprietary Claude Sonnet 4 achieved 94% accuracy and the open-weights gpt-oss-20B (deployable locally on consumer hardware) achieved 93%; on a matched sample of one hundred unanswerable questions, where correct behavior is to abstain rather than produce SQL, gpt-oss-20B correctly abstained on 69%. Both models translate natural language into SQL, execute queries against MIMIC-IV, and return structured results alongside the underlying query for verification. Error analysis revealed that most failures stemmed from complex temporal reasoning or ambiguous question phrasing rather than fundamental architectural limitations. The comparable performance of a smaller open-weights model demonstrates that privacy-preserving local deployment is viable for sensitive clinical data analysis. M3 lowers technical barriers to critical care data analysis and is designed with security measures including OAuth2 authentication, query validation, and audit logging.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

M3 gives a workable local LLM setup for plain-English MIMIC-IV queries with 93% benchmark accuracy, but real clinical questions may expose more weaknesses than the tested samples show.

read the letter

Quick note on the M3 paper for conversational access to clinical data. This work introduces a system that makes it easier to query the MIMIC-IV database using natural language through large language models. It integrates the Model Context Protocol so that with minimal setup, you can pull the data, connect to a local database or BigQuery, and start asking questions in plain English. They report solid results on a benchmark: 93% accuracy for an open local model on answerable questions and 69% correct abstention when the question can't be answered from the data. The error analysis highlights temporal reasoning as the main challenge. What the paper does well is focus on practical deployment. The local model option addresses privacy concerns for sensitive health data, and they include security features like authentication and logging. Having specific numbers from defined samples and identifying failure modes gives a clearer picture than many similar efforts. It's an extension of existing text-to-SQL techniques tailored to this database with added safeguards. The soft spots are around generalization. The evaluation uses samples from EHRSQL 2024, which may not fully capture the messier, more ambiguous questions that actual clinical researchers ask when exploring the data. Real queries often involve complex joins or domain-specific temporal constraints that could trip up the system more than the benchmark suggests. The paper acknowledges some of these issues, but without additional validation on user-generated questions, the practical impact is harder to gauge. This paper is aimed at people in critical care informatics who need to work with large EHR databases but lack SQL expertise. It could help broaden participation in research using these resources. The approach shows honest engagement with the technical and privacy challenges. I'd recommend putting this through peer review. The concrete implementation and results make it worth a detailed look from referees familiar with clinical data systems.

Referee Report

1 major / 2 minor

Summary. The manuscript introduces the M3 system, which integrates conversational LLMs with the Model Context Protocol to allow natural language querying of the MIMIC-IV clinical database. The system automates data retrieval from PhysioNet, supports local SQLite or BigQuery connections, incorporates security features such as OAuth2 and query validation, and is evaluated on samples from the EHRSQL 2024 benchmark, where Claude Sonnet 4 reaches 94% accuracy and gpt-oss-20B reaches 93% on 100 answerable questions while gpt-oss-20B correctly abstains on 69% of 100 unanswerable questions; error analysis identifies temporal reasoning and ambiguous phrasing as primary failure modes.

Significance. If the empirical results hold, the work has clear significance for lowering barriers to clinical data analysis, enabling researchers without SQL expertise to explore MIMIC-IV while supporting privacy via local open-weights model deployment. The near-parity performance of the smaller gpt-oss-20B model and the explicit error analysis pinpointing temporal reasoning failures are particular strengths that strengthen the case for practical viability.

major comments (1)

[Evaluation] Evaluation section: the central performance claims rest on 100-question samples drawn from the EHRSQL 2024 benchmark, yet the manuscript provides no direct evidence or distributional analysis showing that these samples capture the greater ambiguity, multi-table complexity, and domain-specific temporal constraints typical of real clinical researchers' queries on MIMIC-IV; this assumption is load-bearing for the practical-utility and local-deployment conclusions.

minor comments (2)

[Abstract] The abstract and methods would benefit from a brief reference or one-sentence description of the Model Context Protocol for readers outside the immediate subfield.
[Evaluation] Additional detail on the exact query validation logic and how the 100-question test sets were constructed and validated would strengthen reproducibility without altering the core claims.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive assessment of the work's significance and for the constructive comment on the evaluation. We address the point below and are prepared to revise the manuscript accordingly to strengthen the claims regarding practical utility.

read point-by-point responses

Referee: [Evaluation] Evaluation section: the central performance claims rest on 100-question samples drawn from the EHRSQL 2024 benchmark, yet the manuscript provides no direct evidence or distributional analysis showing that these samples capture the greater ambiguity, multi-table complexity, and domain-specific temporal constraints typical of real clinical researchers' queries on MIMIC-IV; this assumption is load-bearing for the practical-utility and local-deployment conclusions.

Authors: We agree that a more explicit distributional analysis would strengthen the manuscript. The EHRSQL 2024 benchmark was developed specifically to emulate realistic clinical researcher queries on MIMIC-style EHR data, incorporating temporal constraints, multi-table joins, and ambiguous phrasing drawn from actual clinical workflows. Our error analysis already identifies temporal reasoning and ambiguous phrasing as the dominant failure modes, which directly map to the referee's concerns. To address the gap, we will add a new paragraph and table in the Evaluation section that reports query complexity metrics (average number of tables joined, frequency of temporal operators such as date ranges and interval comparisons, and prevalence of ambiguous terms) for both the 100-question samples and the full EHRSQL 2024 test set. We will also cite prior studies on clinical query patterns to link these characteristics to real-world MIMIC-IV usage. This revision will make the representativeness argument explicit rather than implicit. revision: yes

Circularity Check

0 steps flagged

No significant circularity; evaluation rests on external benchmark

full rationale

The paper describes an LLM-based system for natural-language querying of MIMIC-IV and reports direct empirical accuracy (93-94% on answerable EHRSQL 2024 samples, 69% correct abstention on unanswerable samples) without any mathematical derivations, fitted parameters renamed as predictions, or load-bearing self-citations. All performance claims are obtained by running the models on an independently constructed external benchmark, so the results remain falsifiable outside the paper's own inputs or assumptions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim depends on the reliability of LLM text-to-SQL translation for clinical queries and the representativeness of the EHRSQL benchmark samples; no free parameters are introduced, and the M3 system itself is the primary new entity without independent falsifiable evidence beyond the reported evaluation.

axioms (1)

domain assumption Samples from the EHRSQL 2024 benchmark are representative of real clinical questions that researchers would ask of MIMIC-IV.
The accuracy claims rest on evaluation against this benchmark; if the sample does not reflect typical usage, the reported performance may not generalize.

invented entities (1)

M3 system no independent evidence
purpose: To enable natural language querying of MIMIC-IV with security and local deployment options
The paper introduces M3 as a new integrated tool; no external falsifiable prediction (such as a specific new clinical finding) is provided to validate the entity independently.

pith-pipeline@v0.9.0 · 5841 in / 1451 out tokens · 58216 ms · 2026-05-22T00:01:15.024134+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We introduce M3, a system that enables natural language querying of MIMIC-IV data through the Model Context Protocol... 94% accuracy... 69% correct abstention

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Coding-Free and Privacy-Preserving Agentic Framework for Data-Driven Clinical Research
cs.CL 2026-04 unverdicted novelty 5.0

CARIS is a new agentic LLM framework that automates clinical research workflows from planning to reporting in a coding-free and privacy-preserving manner, achieving high completeness scores on heterogeneous datasets.
ClinQueryAgent: A Conversational Agent for Population Health Management
cs.IR 2026-04 unverdicted novelty 4.0

The paper introduces ClinQueryAgent, a conversational agent that converts natural language queries into database queries for population health management while keeping patient data secure, and reports its use by 128 s...

Reference graph

Works this paper leans on

31 extracted references · 31 canonical work pages · cited by 2 Pith papers · 1 internal anchor

[1]

Electronic health records: then, now, and in the future.Yearbook of medical informatics, 25(S 01):S48–S61, 2016

R Scott Evans. Electronic health records: then, now, and in the future.Yearbook of medical informatics, 25(S 01):S48–S61, 2016

work page 2016
[2]

Alistair E. W. Johnson, Lucas Bulgarelli, Li Shen, Amy Gayles, Ahmed Shammout, Steven Horng, Tom J. Pollard, Leo Anthony Celi, and Roger G. Mark. Mimic-iv, a freely accessible electronic health record dataset.Scientific Data, 10(1):1, 2023. doi: 10 .1038/s41597-022- 01899-x

work page 2023
[3]

Generative ai tracker, 2025

STAT News. Generative ai tracker, 2025. URL https://apps.statnews.com/ai-tracker/ public/index.html. [Accessed 25-06-2025]

work page 2025
[4]

Health system-scale language models are all-purpose prediction engines.Nature, 619(7969):357–362, 2023

Lavender Yao Jiang, Xujin Chris Liu, Nima Pour Nejatian, Mustafa Nasir-Moin, Duo Wang, Anas Abidin, Kevin Eaton, Howard Antony Riina, Ilya Laufer, Paawan Punjabi, et al. Health system-scale language models are all-purpose prediction engines.Nature, 619(7969):357–362, 2023

work page 2023
[5]

Abel, Mary Tolikas, and Jason M

Renato Umeton, Anne Kwok, Rahul Maurya, Domenic Leco, Naomi Lenane, Jennifer Willcox, Gregory A. Abel, Mary Tolikas, and Jason M. Johnson. Gpt-4 in a cancer center — institute- wide deployment challenges and lessons learned.NEJM AI, 1(4):AIcs2300191, 2024. doi: 10.1056/AIcs2300191. URLhttps://ai.nejm.org/doi/full/10.1056/AIcs2300191

work page doi:10.1056/aics2300191 2024
[6]

Model context protocol (mcp)

Anthropic. Model context protocol (mcp). https://www.anthropic.com/news/model- context-protocol, Nov 2024. Accessed: 2025-06-07

work page 2024
[7]

Celi, and Roger Mark

Alistair Johnson, Lucas Bulgarelli, Tom Pollard, Benjamin Gow, Benjamin Moody, Steven Horng, Leo A. Celi, and Roger Mark. Mimic-iv (version 3.1), 2024. URL https: //physionet.org/content/mimiciv/3.1/

work page 2024
[8]

Physiobank, physiotoolkit, and physionet: Components of a new research resource for complex physiologic signals.Circulation, 101(23):e215–e220, 2000

Ary L Goldberger, Luis AN Amaral, Leon Glass, Jeffrey M Hausdorff, Plamen Ch Ivanov, Roger G Mark, Joseph E Mietus, George B Moody, Chung-Kang Peng, and H Eugene Stanley. Physiobank, physiotoolkit, and physionet: Components of a new research resource for complex physiologic signals.Circulation, 101(23):e215–e220, 2000. 12

work page 2000
[9]

EHRSQL 2024 Dataset – MIMIC-IV Test Set

Gyubok Lee et al. EHRSQL 2024 Dataset – MIMIC-IV Test Set. https://github.com/ glee4810/ehrsql-2024/tree/master/data/mimic_iv/test, 2024. GitHub repository, accessed on 2025-06-25

work page 2024
[10]

Mimic-iv clinical database demo (version 2.2)

Alistair Johnson, Lucas Bulgarelli, Tom Pollard, Steven Horng, Leo Anthony Celi, and Roger Mark. Mimic-iv clinical database demo (version 2.2). https://doi.org/10.13026/dp1f- ex47, 2023. RRID:SCR-007345

work page doi:10.13026/dp1f- 2023
[11]

Reisner, Gari Clifford, Li-wei Lehman, George Moody, Thomas Heldt, Tin H

Mohammed Saeed, Mauricio Villarroel, Andrew T. Reisner, Gari Clifford, Li-wei Lehman, George Moody, Thomas Heldt, Tin H. Kyaw, Benjamin Moody, and Roger G. Mark. Mul- tiparameter intelligent monitoring in intensive care ii (mimic-ii): A public-access inten- sive care unit database.Critical Care Medicine, 39(5):952–960, 2011. doi: 10 .1097/ CCM.0b013e31820a92c6

work page 2011
[12]

Mimic-iv (version 2.2)

Alistair E W Johnson, Tom J Pollard, Lu Shen, Li-wei H Lehman, Mengling Feng, Mohammad Ghassemi, Benjamin Moody, Peter Szolovits, Leo Anthony Celi, and Roger G Mark. Mimic-iv (version 2.2). https://physionet.org/content/mimiciv/2.2/, 2022. Accessed: 2025- 05-30

work page 2022
[13]

Sqlucid: Grounding natural language database queries with interactive explanations

Yuan Tian, Jonathan K Kummerfeld, Toby Jia-Jun Li, and Tianyi Zhang. Sqlucid: Grounding natural language database queries with interactive explanations. InProceedings of the 37th Annual ACM Symposium on User Interface Software and Technology, pages 1–20, 2024

work page 2024
[14]

Hl7 fhir standard, Accessed 2025

HL7 International. Hl7 fhir standard, Accessed 2025. URL https://www.hl7.org/fhir/. Official HL7 FHIR documentation

work page 2025
[15]

Mimic-iv on fhir: converting a decade of in-patient data into an exchangeable, interoperable format.Journal of the American Medical Informatics Association, 30(4):718–725, 2023

Alex M Bennett, Hannes Ulrich, Philip Van Damme, Joshua Wiedekopf, and Alistair EW Johnson. Mimic-iv on fhir: converting a decade of in-patient data into an exchangeable, interoperable format.Journal of the American Medical Informatics Association, 30(4):718–725, 2023

work page 2023
[16]

Standardized data: The omop common data model, Accessed 2025

OHDSI. Standardized data: The omop common data model, Accessed 2025. URL https: //www.ohdsi.org/data-standardization/. Accessed: 25 June 2025

work page 2025
[17]

Text-to-sql generation for question answering on electronic medical records

Ping Wang, Tian Shi, and Chandan K Reddy. Text-to-sql generation for question answering on electronic medical records. InProceedings of The Web Conference 2020, pages 350–361, 2020

work page 2020
[18]

Ehrsql: A practical text-to-sql benchmark for electronic health records.Advances in Neural Information Processing Systems, 35:15589–15601, 2022

Gyubok Lee, Hyeonji Hwang, Seongsu Bae, Yeonsu Kwon, Woncheol Shin, Seongjun Yang, Minjoon Seo, Jong-Yeup Kim, and Edward Choi. Ehrsql: A practical text-to-sql benchmark for electronic health records.Advances in Neural Information Processing Systems, 35:15589–15601, 2022

work page 2022
[19]

Can llm already serve as a database interface? a big bench for large-scale database grounded text-to-sqls.Advances in Neural Information Processing Systems, 36, 2024

Jinyang Li, Binyuan Hui, Ge Qu, Jiaxi Yang, Binhua Li, Bowen Li, Bailin Wang, Bowen Qin, Ruiying Geng, Nan Huo, et al. Can llm already serve as a database interface? a big bench for large-scale database grounded text-to-sqls.Advances in Neural Information Processing Systems, 36, 2024

work page 2024
[20]

Spider: A large-scale human-labeled dataset for complex and cross-domain text-to-sql tasks

Tao Yu, Rui Zhang, Kai Yang, Michihiro Yasunaga, Dongxu Wang, Zifan Li, James Ma, Irene Li, Qingning Yao, Shanelle Roman, et al. Spider: A large-scale human-labeled dataset for complex and cross-domain text-to-sql tasks. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 3911–3921, 2018

work page 2018
[21]

Seq2SQL: Generating Structured Queries from Natural Language using Reinforcement Learning

Victor Zhong, Caiming Xiong, and Richard Socher. Seq2sql: Generating structured queries from natural language using reinforcement learning.arXiv preprint arXiv:1709.00103, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[22]

Biomedsql: Text-to-sql for scientific reasoning on biomedical knowledge bases.arXiv preprint arXiv:2505.20321, 2025

Mathew J Koretsky, Maya Willey, Adi Asija, Owen Bianchi, Chelsea X Alvarado, Tanay Nayak, Nicole Kuznetsov, Sungwon Kim, Mike A Nalls, Daniel Khashabi, et al. Biomedsql: Text-to-sql for scientific reasoning on biomedical knowledge bases.arXiv preprint arXiv:2505.20321, 2025

work page arXiv 2025
[23]

Overview of the EHRSQL 2024 shared task on reliable text-to-SQL modeling

Edward Choi, Jinfeng Liang, and Wenxuan Xu. Overview of the EHRSQL 2024 shared task on reliable text-to-SQL modeling. InProceedings of the 5th Clinical Natural Language Processing Workshop, page —, 2024. 13

work page 2024
[24]

Sql injection prevention cheat sheet

OWASP Foundation. Sql injection prevention cheat sheet. https://cheatsheetseries.owasp.org/cheatsheets/ SQL_Injection_Prevention_Cheat_Sheet.html, 2023. Accessed: 2025-06-04

work page 2023
[25]

Model context protocol (mcp) in pharma

IntuitionLabs. Model context protocol (mcp) in pharma. https://intuitionlabs.ai/ articles/model-context-protocol-mcp-in-pharma, 2025. Accessed: 2025-06-14

work page 2025
[26]

Future of industrial automation: Trends and predictions for mcp server adoption in smart manufacturing

SuperAGI. Future of industrial automation: Trends and predictions for mcp server adoption in smart manufacturing. https://superagi.com/future-of-industrial-automation- trends-and-predictions-for-mcp-server-adoption-in-smart-manufacturing/ ,

work page
[27]

Accessed: 2025-06-14

work page 2025
[28]

EHRSQL-2024 GitHub Repository

Gyubok Lee et al. EHRSQL-2024 GitHub Repository. https://github.com/glee4810/ ehrsql-2024, 2024. Accessed: 2025-06-12

work page 2024
[29]

Evaluating cross-domain text-to-sql models and benchmarks.arXiv preprint arXiv:2310.18538, 2023

Mohammadreza Pourreza and Davood Rafiei. Evaluating cross-domain text-to-sql models and benchmarks.arXiv preprint arXiv:2310.18538, 2023

work page arXiv 2023
[30]

AmbigQA: Answering ambiguous open-domain questions

Sewon Min, Julian Michael, Luke Zettlemoyer, and Hannaneh Hajishirzi. AmbigQA: Answering ambiguous open-domain questions. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 9199–9212, 2020

work page 2020
[31]

Large language models encode clinical knowledge.Nature, 620(7972):172–180, 2023

Karan Singhal, Shekoofeh Azizi, Tao Tu, S Sara Mahdavi, Jason Wei, Hyung Won Chung, Nathan Scales, Ajay Tanwani, Heather Cole-Lewis, Stephen Pfohl, et al. Large language models encode clinical knowledge.Nature, 620(7972):172–180, 2023. 14 CRediT Author Statement Rafi Al Attrach: Investigation (lead), Software (lead), Writing – original draft (lead), Writi...

work page 2023

[1] [1]

Electronic health records: then, now, and in the future.Yearbook of medical informatics, 25(S 01):S48–S61, 2016

R Scott Evans. Electronic health records: then, now, and in the future.Yearbook of medical informatics, 25(S 01):S48–S61, 2016

work page 2016

[2] [2]

Alistair E. W. Johnson, Lucas Bulgarelli, Li Shen, Amy Gayles, Ahmed Shammout, Steven Horng, Tom J. Pollard, Leo Anthony Celi, and Roger G. Mark. Mimic-iv, a freely accessible electronic health record dataset.Scientific Data, 10(1):1, 2023. doi: 10 .1038/s41597-022- 01899-x

work page 2023

[3] [3]

Generative ai tracker, 2025

STAT News. Generative ai tracker, 2025. URL https://apps.statnews.com/ai-tracker/ public/index.html. [Accessed 25-06-2025]

work page 2025

[4] [4]

Health system-scale language models are all-purpose prediction engines.Nature, 619(7969):357–362, 2023

Lavender Yao Jiang, Xujin Chris Liu, Nima Pour Nejatian, Mustafa Nasir-Moin, Duo Wang, Anas Abidin, Kevin Eaton, Howard Antony Riina, Ilya Laufer, Paawan Punjabi, et al. Health system-scale language models are all-purpose prediction engines.Nature, 619(7969):357–362, 2023

work page 2023

[5] [5]

Abel, Mary Tolikas, and Jason M

Renato Umeton, Anne Kwok, Rahul Maurya, Domenic Leco, Naomi Lenane, Jennifer Willcox, Gregory A. Abel, Mary Tolikas, and Jason M. Johnson. Gpt-4 in a cancer center — institute- wide deployment challenges and lessons learned.NEJM AI, 1(4):AIcs2300191, 2024. doi: 10.1056/AIcs2300191. URLhttps://ai.nejm.org/doi/full/10.1056/AIcs2300191

work page doi:10.1056/aics2300191 2024

[6] [6]

Model context protocol (mcp)

Anthropic. Model context protocol (mcp). https://www.anthropic.com/news/model- context-protocol, Nov 2024. Accessed: 2025-06-07

work page 2024

[7] [7]

Celi, and Roger Mark

Alistair Johnson, Lucas Bulgarelli, Tom Pollard, Benjamin Gow, Benjamin Moody, Steven Horng, Leo A. Celi, and Roger Mark. Mimic-iv (version 3.1), 2024. URL https: //physionet.org/content/mimiciv/3.1/

work page 2024

[8] [8]

Physiobank, physiotoolkit, and physionet: Components of a new research resource for complex physiologic signals.Circulation, 101(23):e215–e220, 2000

Ary L Goldberger, Luis AN Amaral, Leon Glass, Jeffrey M Hausdorff, Plamen Ch Ivanov, Roger G Mark, Joseph E Mietus, George B Moody, Chung-Kang Peng, and H Eugene Stanley. Physiobank, physiotoolkit, and physionet: Components of a new research resource for complex physiologic signals.Circulation, 101(23):e215–e220, 2000. 12

work page 2000

[9] [9]

EHRSQL 2024 Dataset – MIMIC-IV Test Set

Gyubok Lee et al. EHRSQL 2024 Dataset – MIMIC-IV Test Set. https://github.com/ glee4810/ehrsql-2024/tree/master/data/mimic_iv/test, 2024. GitHub repository, accessed on 2025-06-25

work page 2024

[10] [10]

Mimic-iv clinical database demo (version 2.2)

Alistair Johnson, Lucas Bulgarelli, Tom Pollard, Steven Horng, Leo Anthony Celi, and Roger Mark. Mimic-iv clinical database demo (version 2.2). https://doi.org/10.13026/dp1f- ex47, 2023. RRID:SCR-007345

work page doi:10.13026/dp1f- 2023

[11] [11]

Reisner, Gari Clifford, Li-wei Lehman, George Moody, Thomas Heldt, Tin H

Mohammed Saeed, Mauricio Villarroel, Andrew T. Reisner, Gari Clifford, Li-wei Lehman, George Moody, Thomas Heldt, Tin H. Kyaw, Benjamin Moody, and Roger G. Mark. Mul- tiparameter intelligent monitoring in intensive care ii (mimic-ii): A public-access inten- sive care unit database.Critical Care Medicine, 39(5):952–960, 2011. doi: 10 .1097/ CCM.0b013e31820a92c6

work page 2011

[12] [12]

Mimic-iv (version 2.2)

Alistair E W Johnson, Tom J Pollard, Lu Shen, Li-wei H Lehman, Mengling Feng, Mohammad Ghassemi, Benjamin Moody, Peter Szolovits, Leo Anthony Celi, and Roger G Mark. Mimic-iv (version 2.2). https://physionet.org/content/mimiciv/2.2/, 2022. Accessed: 2025- 05-30

work page 2022

[13] [13]

Sqlucid: Grounding natural language database queries with interactive explanations

Yuan Tian, Jonathan K Kummerfeld, Toby Jia-Jun Li, and Tianyi Zhang. Sqlucid: Grounding natural language database queries with interactive explanations. InProceedings of the 37th Annual ACM Symposium on User Interface Software and Technology, pages 1–20, 2024

work page 2024

[14] [14]

Hl7 fhir standard, Accessed 2025

HL7 International. Hl7 fhir standard, Accessed 2025. URL https://www.hl7.org/fhir/. Official HL7 FHIR documentation

work page 2025

[15] [15]

Mimic-iv on fhir: converting a decade of in-patient data into an exchangeable, interoperable format.Journal of the American Medical Informatics Association, 30(4):718–725, 2023

Alex M Bennett, Hannes Ulrich, Philip Van Damme, Joshua Wiedekopf, and Alistair EW Johnson. Mimic-iv on fhir: converting a decade of in-patient data into an exchangeable, interoperable format.Journal of the American Medical Informatics Association, 30(4):718–725, 2023

work page 2023

[16] [16]

Standardized data: The omop common data model, Accessed 2025

OHDSI. Standardized data: The omop common data model, Accessed 2025. URL https: //www.ohdsi.org/data-standardization/. Accessed: 25 June 2025

work page 2025

[17] [17]

Text-to-sql generation for question answering on electronic medical records

Ping Wang, Tian Shi, and Chandan K Reddy. Text-to-sql generation for question answering on electronic medical records. InProceedings of The Web Conference 2020, pages 350–361, 2020

work page 2020

[18] [18]

Ehrsql: A practical text-to-sql benchmark for electronic health records.Advances in Neural Information Processing Systems, 35:15589–15601, 2022

Gyubok Lee, Hyeonji Hwang, Seongsu Bae, Yeonsu Kwon, Woncheol Shin, Seongjun Yang, Minjoon Seo, Jong-Yeup Kim, and Edward Choi. Ehrsql: A practical text-to-sql benchmark for electronic health records.Advances in Neural Information Processing Systems, 35:15589–15601, 2022

work page 2022

[19] [19]

Can llm already serve as a database interface? a big bench for large-scale database grounded text-to-sqls.Advances in Neural Information Processing Systems, 36, 2024

Jinyang Li, Binyuan Hui, Ge Qu, Jiaxi Yang, Binhua Li, Bowen Li, Bailin Wang, Bowen Qin, Ruiying Geng, Nan Huo, et al. Can llm already serve as a database interface? a big bench for large-scale database grounded text-to-sqls.Advances in Neural Information Processing Systems, 36, 2024

work page 2024

[20] [20]

Spider: A large-scale human-labeled dataset for complex and cross-domain text-to-sql tasks

Tao Yu, Rui Zhang, Kai Yang, Michihiro Yasunaga, Dongxu Wang, Zifan Li, James Ma, Irene Li, Qingning Yao, Shanelle Roman, et al. Spider: A large-scale human-labeled dataset for complex and cross-domain text-to-sql tasks. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 3911–3921, 2018

work page 2018

[21] [21]

Seq2SQL: Generating Structured Queries from Natural Language using Reinforcement Learning

Victor Zhong, Caiming Xiong, and Richard Socher. Seq2sql: Generating structured queries from natural language using reinforcement learning.arXiv preprint arXiv:1709.00103, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[22] [22]

Biomedsql: Text-to-sql for scientific reasoning on biomedical knowledge bases.arXiv preprint arXiv:2505.20321, 2025

Mathew J Koretsky, Maya Willey, Adi Asija, Owen Bianchi, Chelsea X Alvarado, Tanay Nayak, Nicole Kuznetsov, Sungwon Kim, Mike A Nalls, Daniel Khashabi, et al. Biomedsql: Text-to-sql for scientific reasoning on biomedical knowledge bases.arXiv preprint arXiv:2505.20321, 2025

work page arXiv 2025

[23] [23]

Overview of the EHRSQL 2024 shared task on reliable text-to-SQL modeling

Edward Choi, Jinfeng Liang, and Wenxuan Xu. Overview of the EHRSQL 2024 shared task on reliable text-to-SQL modeling. InProceedings of the 5th Clinical Natural Language Processing Workshop, page —, 2024. 13

work page 2024

[24] [24]

Sql injection prevention cheat sheet

OWASP Foundation. Sql injection prevention cheat sheet. https://cheatsheetseries.owasp.org/cheatsheets/ SQL_Injection_Prevention_Cheat_Sheet.html, 2023. Accessed: 2025-06-04

work page 2023

[25] [25]

Model context protocol (mcp) in pharma

IntuitionLabs. Model context protocol (mcp) in pharma. https://intuitionlabs.ai/ articles/model-context-protocol-mcp-in-pharma, 2025. Accessed: 2025-06-14

work page 2025

[26] [26]

Future of industrial automation: Trends and predictions for mcp server adoption in smart manufacturing

SuperAGI. Future of industrial automation: Trends and predictions for mcp server adoption in smart manufacturing. https://superagi.com/future-of-industrial-automation- trends-and-predictions-for-mcp-server-adoption-in-smart-manufacturing/ ,

work page

[27] [27]

Accessed: 2025-06-14

work page 2025

[28] [28]

EHRSQL-2024 GitHub Repository

Gyubok Lee et al. EHRSQL-2024 GitHub Repository. https://github.com/glee4810/ ehrsql-2024, 2024. Accessed: 2025-06-12

work page 2024

[29] [29]

Evaluating cross-domain text-to-sql models and benchmarks.arXiv preprint arXiv:2310.18538, 2023

Mohammadreza Pourreza and Davood Rafiei. Evaluating cross-domain text-to-sql models and benchmarks.arXiv preprint arXiv:2310.18538, 2023

work page arXiv 2023

[30] [30]

AmbigQA: Answering ambiguous open-domain questions

Sewon Min, Julian Michael, Luke Zettlemoyer, and Hannaneh Hajishirzi. AmbigQA: Answering ambiguous open-domain questions. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 9199–9212, 2020

work page 2020

[31] [31]

Large language models encode clinical knowledge.Nature, 620(7972):172–180, 2023

Karan Singhal, Shekoofeh Azizi, Tao Tu, S Sara Mahdavi, Jason Wei, Hyung Won Chung, Nathan Scales, Ajay Tanwani, Heather Cole-Lewis, Stephen Pfohl, et al. Large language models encode clinical knowledge.Nature, 620(7972):172–180, 2023. 14 CRediT Author Statement Rafi Al Attrach: Investigation (lead), Software (lead), Writing – original draft (lead), Writi...

work page 2023