M3: Conversational LLMs Simplify Secure Clinical Data Access, Understanding, and Analysis
Pith reviewed 2026-05-22 00:01 UTC · model grok-4.3
The pith
M3 lets researchers query the MIMIC-IV clinical database in plain English using large language models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
M3 shows that large language models can translate natural-language clinical questions into correct SQL queries against MIMIC-IV, execute them, and return results with the original query visible for verification, reaching 94 percent accuracy with Claude Sonnet 4 and 93 percent accuracy with the locally deployable gpt-oss-20B model on one hundred answerable questions while correctly refusing to answer 69 percent of the time on matched unanswerable questions.
What carries the argument
The M3 system, which uses the Model Context Protocol to retrieve MIMIC-IV data, start a local SQLite or hosted BigQuery instance, convert plain-English questions into SQL, run the queries, and present structured results with the underlying query for checking.
If this is right
- Researchers without SQL training can directly explore large critical-care datasets.
- An open model running on ordinary hardware makes fully local, privacy-preserving analysis practical.
- Built-in validation and logging steps reduce risks when working with protected health information.
- Most remaining errors trace to temporal reasoning or unclear phrasing rather than core system limits.
Where Pith is reading between the lines
- The same natural-language interface could be adapted for other large medical record collections beyond MIMIC-IV.
- Adding better handling of ambiguous questions might raise real-world usefulness without changing the architecture.
- Direct trials with active clinical teams would show whether the benchmark numbers translate to everyday research workflows.
Load-bearing premise
Performance measured on selected questions from the EHRSQL 2024 benchmark will carry over to the wider range of ambiguous and complex questions that actual clinical researchers ask when exploring MIMIC-IV data.
What would settle it
Testing M3 on a fresh set of questions collected directly from practicing clinical researchers and checking whether accuracy stays near 90 percent and correct abstention rates remain high.
Figures
read the original abstract
Large-scale clinical databases offer opportunities for medical research, but their complexity creates barriers to effective use. The Medical Information Mart for Intensive Care (MIMIC-IV), one of the world's largest open-source electronic health record databases, traditionally requires both SQL proficiency and clinical domain expertise. We introduce M3, a system that enables natural language querying of MIMIC-IV data through the Model Context Protocol. With a single command, M3 retrieves MIMIC-IV from PhysioNet, launches a local SQLite instance or connects to hosted BigQuery, and allows researchers to pose clinical questions in plain English. We evaluated M3 using samples from the EHRSQL 2024 benchmark with two language models. On one hundred answerable questions, the proprietary Claude Sonnet 4 achieved 94% accuracy and the open-weights gpt-oss-20B (deployable locally on consumer hardware) achieved 93%; on a matched sample of one hundred unanswerable questions, where correct behavior is to abstain rather than produce SQL, gpt-oss-20B correctly abstained on 69%. Both models translate natural language into SQL, execute queries against MIMIC-IV, and return structured results alongside the underlying query for verification. Error analysis revealed that most failures stemmed from complex temporal reasoning or ambiguous question phrasing rather than fundamental architectural limitations. The comparable performance of a smaller open-weights model demonstrates that privacy-preserving local deployment is viable for sensitive clinical data analysis. M3 lowers technical barriers to critical care data analysis and is designed with security measures including OAuth2 authentication, query validation, and audit logging.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces the M3 system, which integrates conversational LLMs with the Model Context Protocol to allow natural language querying of the MIMIC-IV clinical database. The system automates data retrieval from PhysioNet, supports local SQLite or BigQuery connections, incorporates security features such as OAuth2 and query validation, and is evaluated on samples from the EHRSQL 2024 benchmark, where Claude Sonnet 4 reaches 94% accuracy and gpt-oss-20B reaches 93% on 100 answerable questions while gpt-oss-20B correctly abstains on 69% of 100 unanswerable questions; error analysis identifies temporal reasoning and ambiguous phrasing as primary failure modes.
Significance. If the empirical results hold, the work has clear significance for lowering barriers to clinical data analysis, enabling researchers without SQL expertise to explore MIMIC-IV while supporting privacy via local open-weights model deployment. The near-parity performance of the smaller gpt-oss-20B model and the explicit error analysis pinpointing temporal reasoning failures are particular strengths that strengthen the case for practical viability.
major comments (1)
- [Evaluation] Evaluation section: the central performance claims rest on 100-question samples drawn from the EHRSQL 2024 benchmark, yet the manuscript provides no direct evidence or distributional analysis showing that these samples capture the greater ambiguity, multi-table complexity, and domain-specific temporal constraints typical of real clinical researchers' queries on MIMIC-IV; this assumption is load-bearing for the practical-utility and local-deployment conclusions.
minor comments (2)
- [Abstract] The abstract and methods would benefit from a brief reference or one-sentence description of the Model Context Protocol for readers outside the immediate subfield.
- [Evaluation] Additional detail on the exact query validation logic and how the 100-question test sets were constructed and validated would strengthen reproducibility without altering the core claims.
Simulated Author's Rebuttal
We thank the referee for the positive assessment of the work's significance and for the constructive comment on the evaluation. We address the point below and are prepared to revise the manuscript accordingly to strengthen the claims regarding practical utility.
read point-by-point responses
-
Referee: [Evaluation] Evaluation section: the central performance claims rest on 100-question samples drawn from the EHRSQL 2024 benchmark, yet the manuscript provides no direct evidence or distributional analysis showing that these samples capture the greater ambiguity, multi-table complexity, and domain-specific temporal constraints typical of real clinical researchers' queries on MIMIC-IV; this assumption is load-bearing for the practical-utility and local-deployment conclusions.
Authors: We agree that a more explicit distributional analysis would strengthen the manuscript. The EHRSQL 2024 benchmark was developed specifically to emulate realistic clinical researcher queries on MIMIC-style EHR data, incorporating temporal constraints, multi-table joins, and ambiguous phrasing drawn from actual clinical workflows. Our error analysis already identifies temporal reasoning and ambiguous phrasing as the dominant failure modes, which directly map to the referee's concerns. To address the gap, we will add a new paragraph and table in the Evaluation section that reports query complexity metrics (average number of tables joined, frequency of temporal operators such as date ranges and interval comparisons, and prevalence of ambiguous terms) for both the 100-question samples and the full EHRSQL 2024 test set. We will also cite prior studies on clinical query patterns to link these characteristics to real-world MIMIC-IV usage. This revision will make the representativeness argument explicit rather than implicit. revision: yes
Circularity Check
No significant circularity; evaluation rests on external benchmark
full rationale
The paper describes an LLM-based system for natural-language querying of MIMIC-IV and reports direct empirical accuracy (93-94% on answerable EHRSQL 2024 samples, 69% correct abstention on unanswerable samples) without any mathematical derivations, fitted parameters renamed as predictions, or load-bearing self-citations. All performance claims are obtained by running the models on an independently constructed external benchmark, so the results remain falsifiable outside the paper's own inputs or assumptions.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Samples from the EHRSQL 2024 benchmark are representative of real clinical questions that researchers would ask of MIMIC-IV.
invented entities (1)
-
M3 system
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We introduce M3, a system that enables natural language querying of MIMIC-IV data through the Model Context Protocol... 94% accuracy... 69% correct abstention
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 2 Pith papers
-
Coding-Free and Privacy-Preserving Agentic Framework for Data-Driven Clinical Research
CARIS is a new agentic LLM framework that automates clinical research workflows from planning to reporting in a coding-free and privacy-preserving manner, achieving high completeness scores on heterogeneous datasets.
-
ClinQueryAgent: A Conversational Agent for Population Health Management
The paper introduces ClinQueryAgent, a conversational agent that converts natural language queries into database queries for population health management while keeping patient data secure, and reports its use by 128 s...
Reference graph
Works this paper leans on
-
[1]
R Scott Evans. Electronic health records: then, now, and in the future.Yearbook of medical informatics, 25(S 01):S48–S61, 2016
work page 2016
-
[2]
Alistair E. W. Johnson, Lucas Bulgarelli, Li Shen, Amy Gayles, Ahmed Shammout, Steven Horng, Tom J. Pollard, Leo Anthony Celi, and Roger G. Mark. Mimic-iv, a freely accessible electronic health record dataset.Scientific Data, 10(1):1, 2023. doi: 10 .1038/s41597-022- 01899-x
work page 2023
-
[3]
STAT News. Generative ai tracker, 2025. URL https://apps.statnews.com/ai-tracker/ public/index.html. [Accessed 25-06-2025]
work page 2025
-
[4]
Lavender Yao Jiang, Xujin Chris Liu, Nima Pour Nejatian, Mustafa Nasir-Moin, Duo Wang, Anas Abidin, Kevin Eaton, Howard Antony Riina, Ilya Laufer, Paawan Punjabi, et al. Health system-scale language models are all-purpose prediction engines.Nature, 619(7969):357–362, 2023
work page 2023
-
[5]
Abel, Mary Tolikas, and Jason M
Renato Umeton, Anne Kwok, Rahul Maurya, Domenic Leco, Naomi Lenane, Jennifer Willcox, Gregory A. Abel, Mary Tolikas, and Jason M. Johnson. Gpt-4 in a cancer center — institute- wide deployment challenges and lessons learned.NEJM AI, 1(4):AIcs2300191, 2024. doi: 10.1056/AIcs2300191. URLhttps://ai.nejm.org/doi/full/10.1056/AIcs2300191
-
[6]
Anthropic. Model context protocol (mcp). https://www.anthropic.com/news/model- context-protocol, Nov 2024. Accessed: 2025-06-07
work page 2024
-
[7]
Alistair Johnson, Lucas Bulgarelli, Tom Pollard, Benjamin Gow, Benjamin Moody, Steven Horng, Leo A. Celi, and Roger Mark. Mimic-iv (version 3.1), 2024. URL https: //physionet.org/content/mimiciv/3.1/
work page 2024
-
[8]
Ary L Goldberger, Luis AN Amaral, Leon Glass, Jeffrey M Hausdorff, Plamen Ch Ivanov, Roger G Mark, Joseph E Mietus, George B Moody, Chung-Kang Peng, and H Eugene Stanley. Physiobank, physiotoolkit, and physionet: Components of a new research resource for complex physiologic signals.Circulation, 101(23):e215–e220, 2000. 12
work page 2000
-
[9]
EHRSQL 2024 Dataset – MIMIC-IV Test Set
Gyubok Lee et al. EHRSQL 2024 Dataset – MIMIC-IV Test Set. https://github.com/ glee4810/ehrsql-2024/tree/master/data/mimic_iv/test, 2024. GitHub repository, accessed on 2025-06-25
work page 2024
-
[10]
Mimic-iv clinical database demo (version 2.2)
Alistair Johnson, Lucas Bulgarelli, Tom Pollard, Steven Horng, Leo Anthony Celi, and Roger Mark. Mimic-iv clinical database demo (version 2.2). https://doi.org/10.13026/dp1f- ex47, 2023. RRID:SCR-007345
-
[11]
Reisner, Gari Clifford, Li-wei Lehman, George Moody, Thomas Heldt, Tin H
Mohammed Saeed, Mauricio Villarroel, Andrew T. Reisner, Gari Clifford, Li-wei Lehman, George Moody, Thomas Heldt, Tin H. Kyaw, Benjamin Moody, and Roger G. Mark. Mul- tiparameter intelligent monitoring in intensive care ii (mimic-ii): A public-access inten- sive care unit database.Critical Care Medicine, 39(5):952–960, 2011. doi: 10 .1097/ CCM.0b013e31820a92c6
work page 2011
-
[12]
Alistair E W Johnson, Tom J Pollard, Lu Shen, Li-wei H Lehman, Mengling Feng, Mohammad Ghassemi, Benjamin Moody, Peter Szolovits, Leo Anthony Celi, and Roger G Mark. Mimic-iv (version 2.2). https://physionet.org/content/mimiciv/2.2/, 2022. Accessed: 2025- 05-30
work page 2022
-
[13]
Sqlucid: Grounding natural language database queries with interactive explanations
Yuan Tian, Jonathan K Kummerfeld, Toby Jia-Jun Li, and Tianyi Zhang. Sqlucid: Grounding natural language database queries with interactive explanations. InProceedings of the 37th Annual ACM Symposium on User Interface Software and Technology, pages 1–20, 2024
work page 2024
-
[14]
Hl7 fhir standard, Accessed 2025
HL7 International. Hl7 fhir standard, Accessed 2025. URL https://www.hl7.org/fhir/. Official HL7 FHIR documentation
work page 2025
-
[15]
Alex M Bennett, Hannes Ulrich, Philip Van Damme, Joshua Wiedekopf, and Alistair EW Johnson. Mimic-iv on fhir: converting a decade of in-patient data into an exchangeable, interoperable format.Journal of the American Medical Informatics Association, 30(4):718–725, 2023
work page 2023
-
[16]
Standardized data: The omop common data model, Accessed 2025
OHDSI. Standardized data: The omop common data model, Accessed 2025. URL https: //www.ohdsi.org/data-standardization/. Accessed: 25 June 2025
work page 2025
-
[17]
Text-to-sql generation for question answering on electronic medical records
Ping Wang, Tian Shi, and Chandan K Reddy. Text-to-sql generation for question answering on electronic medical records. InProceedings of The Web Conference 2020, pages 350–361, 2020
work page 2020
-
[18]
Gyubok Lee, Hyeonji Hwang, Seongsu Bae, Yeonsu Kwon, Woncheol Shin, Seongjun Yang, Minjoon Seo, Jong-Yeup Kim, and Edward Choi. Ehrsql: A practical text-to-sql benchmark for electronic health records.Advances in Neural Information Processing Systems, 35:15589–15601, 2022
work page 2022
-
[19]
Jinyang Li, Binyuan Hui, Ge Qu, Jiaxi Yang, Binhua Li, Bowen Li, Bailin Wang, Bowen Qin, Ruiying Geng, Nan Huo, et al. Can llm already serve as a database interface? a big bench for large-scale database grounded text-to-sqls.Advances in Neural Information Processing Systems, 36, 2024
work page 2024
-
[20]
Spider: A large-scale human-labeled dataset for complex and cross-domain text-to-sql tasks
Tao Yu, Rui Zhang, Kai Yang, Michihiro Yasunaga, Dongxu Wang, Zifan Li, James Ma, Irene Li, Qingning Yao, Shanelle Roman, et al. Spider: A large-scale human-labeled dataset for complex and cross-domain text-to-sql tasks. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 3911–3921, 2018
work page 2018
-
[21]
Seq2SQL: Generating Structured Queries from Natural Language using Reinforcement Learning
Victor Zhong, Caiming Xiong, and Richard Socher. Seq2sql: Generating structured queries from natural language using reinforcement learning.arXiv preprint arXiv:1709.00103, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[22]
Mathew J Koretsky, Maya Willey, Adi Asija, Owen Bianchi, Chelsea X Alvarado, Tanay Nayak, Nicole Kuznetsov, Sungwon Kim, Mike A Nalls, Daniel Khashabi, et al. Biomedsql: Text-to-sql for scientific reasoning on biomedical knowledge bases.arXiv preprint arXiv:2505.20321, 2025
-
[23]
Overview of the EHRSQL 2024 shared task on reliable text-to-SQL modeling
Edward Choi, Jinfeng Liang, and Wenxuan Xu. Overview of the EHRSQL 2024 shared task on reliable text-to-SQL modeling. InProceedings of the 5th Clinical Natural Language Processing Workshop, page —, 2024. 13
work page 2024
-
[24]
Sql injection prevention cheat sheet
OWASP Foundation. Sql injection prevention cheat sheet. https://cheatsheetseries.owasp.org/cheatsheets/ SQL_Injection_Prevention_Cheat_Sheet.html, 2023. Accessed: 2025-06-04
work page 2023
-
[25]
Model context protocol (mcp) in pharma
IntuitionLabs. Model context protocol (mcp) in pharma. https://intuitionlabs.ai/ articles/model-context-protocol-mcp-in-pharma, 2025. Accessed: 2025-06-14
work page 2025
-
[26]
SuperAGI. Future of industrial automation: Trends and predictions for mcp server adoption in smart manufacturing. https://superagi.com/future-of-industrial-automation- trends-and-predictions-for-mcp-server-adoption-in-smart-manufacturing/ ,
-
[27]
Accessed: 2025-06-14
work page 2025
-
[28]
Gyubok Lee et al. EHRSQL-2024 GitHub Repository. https://github.com/glee4810/ ehrsql-2024, 2024. Accessed: 2025-06-12
work page 2024
-
[29]
Evaluating cross-domain text-to-sql models and benchmarks.arXiv preprint arXiv:2310.18538, 2023
Mohammadreza Pourreza and Davood Rafiei. Evaluating cross-domain text-to-sql models and benchmarks.arXiv preprint arXiv:2310.18538, 2023
-
[30]
AmbigQA: Answering ambiguous open-domain questions
Sewon Min, Julian Michael, Luke Zettlemoyer, and Hannaneh Hajishirzi. AmbigQA: Answering ambiguous open-domain questions. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 9199–9212, 2020
work page 2020
-
[31]
Large language models encode clinical knowledge.Nature, 620(7972):172–180, 2023
Karan Singhal, Shekoofeh Azizi, Tao Tu, S Sara Mahdavi, Jason Wei, Hyung Won Chung, Nathan Scales, Ajay Tanwani, Heather Cole-Lewis, Stephen Pfohl, et al. Large language models encode clinical knowledge.Nature, 620(7972):172–180, 2023. 14 CRediT Author Statement Rafi Al Attrach: Investigation (lead), Software (lead), Writing – original draft (lead), Writi...
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.