pith. sign in

arxiv: 2507.01053 · v4 · pith:OVPDE5HOnew · submitted 2025-06-27 · 💻 cs.IR · cs.AI· cs.DB

M3: Conversational LLMs Simplify Secure Clinical Data Access, Understanding, and Analysis

Pith reviewed 2026-05-22 00:01 UTC · model grok-4.3

classification 💻 cs.IR cs.AIcs.DB
keywords MIMIC-IVnatural language queryingclinical data analysislarge language modelsSQL generationprivacy-preserving deploymentelectronic health records
0
0 comments X

The pith

M3 lets researchers query the MIMIC-IV clinical database in plain English using large language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces M3, a system that connects conversational AI to the large MIMIC-IV intensive care database so users can ask clinical questions in natural language rather than writing SQL. It reports that both a proprietary model and a smaller open-weights model running locally reach 93 to 94 percent accuracy on answerable benchmark questions drawn from EHRSQL 2024. The open model also abstains correctly on 69 percent of unanswerable questions instead of guessing. The system includes built-in security steps such as authentication and query validation to support safe handling of patient data.

Core claim

M3 shows that large language models can translate natural-language clinical questions into correct SQL queries against MIMIC-IV, execute them, and return results with the original query visible for verification, reaching 94 percent accuracy with Claude Sonnet 4 and 93 percent accuracy with the locally deployable gpt-oss-20B model on one hundred answerable questions while correctly refusing to answer 69 percent of the time on matched unanswerable questions.

What carries the argument

The M3 system, which uses the Model Context Protocol to retrieve MIMIC-IV data, start a local SQLite or hosted BigQuery instance, convert plain-English questions into SQL, run the queries, and present structured results with the underlying query for checking.

If this is right

  • Researchers without SQL training can directly explore large critical-care datasets.
  • An open model running on ordinary hardware makes fully local, privacy-preserving analysis practical.
  • Built-in validation and logging steps reduce risks when working with protected health information.
  • Most remaining errors trace to temporal reasoning or unclear phrasing rather than core system limits.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same natural-language interface could be adapted for other large medical record collections beyond MIMIC-IV.
  • Adding better handling of ambiguous questions might raise real-world usefulness without changing the architecture.
  • Direct trials with active clinical teams would show whether the benchmark numbers translate to everyday research workflows.

Load-bearing premise

Performance measured on selected questions from the EHRSQL 2024 benchmark will carry over to the wider range of ambiguous and complex questions that actual clinical researchers ask when exploring MIMIC-IV data.

What would settle it

Testing M3 on a fresh set of questions collected directly from practicing clinical researchers and checking whether accuracy stays near 90 percent and correct abstention rates remain high.

Figures

Figures reproduced from arXiv: 2507.01053 by Amelia Fiske, Leo Anthony Celi, Pedro Moreira, Rafi Al Attrach, Rajna Fani, Renato Umeton.

Figure 1
Figure 1. Figure 1: Results of a complex query, described in natural language as [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Conceptual Diagram of the M3 System Architecture [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Query: “Show trends in systolic blood pressure for patients on vasopressors within 48 hours of ICU admission.” [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
read the original abstract

Large-scale clinical databases offer opportunities for medical research, but their complexity creates barriers to effective use. The Medical Information Mart for Intensive Care (MIMIC-IV), one of the world's largest open-source electronic health record databases, traditionally requires both SQL proficiency and clinical domain expertise. We introduce M3, a system that enables natural language querying of MIMIC-IV data through the Model Context Protocol. With a single command, M3 retrieves MIMIC-IV from PhysioNet, launches a local SQLite instance or connects to hosted BigQuery, and allows researchers to pose clinical questions in plain English. We evaluated M3 using samples from the EHRSQL 2024 benchmark with two language models. On one hundred answerable questions, the proprietary Claude Sonnet 4 achieved 94% accuracy and the open-weights gpt-oss-20B (deployable locally on consumer hardware) achieved 93%; on a matched sample of one hundred unanswerable questions, where correct behavior is to abstain rather than produce SQL, gpt-oss-20B correctly abstained on 69%. Both models translate natural language into SQL, execute queries against MIMIC-IV, and return structured results alongside the underlying query for verification. Error analysis revealed that most failures stemmed from complex temporal reasoning or ambiguous question phrasing rather than fundamental architectural limitations. The comparable performance of a smaller open-weights model demonstrates that privacy-preserving local deployment is viable for sensitive clinical data analysis. M3 lowers technical barriers to critical care data analysis and is designed with security measures including OAuth2 authentication, query validation, and audit logging.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript introduces the M3 system, which integrates conversational LLMs with the Model Context Protocol to allow natural language querying of the MIMIC-IV clinical database. The system automates data retrieval from PhysioNet, supports local SQLite or BigQuery connections, incorporates security features such as OAuth2 and query validation, and is evaluated on samples from the EHRSQL 2024 benchmark, where Claude Sonnet 4 reaches 94% accuracy and gpt-oss-20B reaches 93% on 100 answerable questions while gpt-oss-20B correctly abstains on 69% of 100 unanswerable questions; error analysis identifies temporal reasoning and ambiguous phrasing as primary failure modes.

Significance. If the empirical results hold, the work has clear significance for lowering barriers to clinical data analysis, enabling researchers without SQL expertise to explore MIMIC-IV while supporting privacy via local open-weights model deployment. The near-parity performance of the smaller gpt-oss-20B model and the explicit error analysis pinpointing temporal reasoning failures are particular strengths that strengthen the case for practical viability.

major comments (1)
  1. [Evaluation] Evaluation section: the central performance claims rest on 100-question samples drawn from the EHRSQL 2024 benchmark, yet the manuscript provides no direct evidence or distributional analysis showing that these samples capture the greater ambiguity, multi-table complexity, and domain-specific temporal constraints typical of real clinical researchers' queries on MIMIC-IV; this assumption is load-bearing for the practical-utility and local-deployment conclusions.
minor comments (2)
  1. [Abstract] The abstract and methods would benefit from a brief reference or one-sentence description of the Model Context Protocol for readers outside the immediate subfield.
  2. [Evaluation] Additional detail on the exact query validation logic and how the 100-question test sets were constructed and validated would strengthen reproducibility without altering the core claims.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive assessment of the work's significance and for the constructive comment on the evaluation. We address the point below and are prepared to revise the manuscript accordingly to strengthen the claims regarding practical utility.

read point-by-point responses
  1. Referee: [Evaluation] Evaluation section: the central performance claims rest on 100-question samples drawn from the EHRSQL 2024 benchmark, yet the manuscript provides no direct evidence or distributional analysis showing that these samples capture the greater ambiguity, multi-table complexity, and domain-specific temporal constraints typical of real clinical researchers' queries on MIMIC-IV; this assumption is load-bearing for the practical-utility and local-deployment conclusions.

    Authors: We agree that a more explicit distributional analysis would strengthen the manuscript. The EHRSQL 2024 benchmark was developed specifically to emulate realistic clinical researcher queries on MIMIC-style EHR data, incorporating temporal constraints, multi-table joins, and ambiguous phrasing drawn from actual clinical workflows. Our error analysis already identifies temporal reasoning and ambiguous phrasing as the dominant failure modes, which directly map to the referee's concerns. To address the gap, we will add a new paragraph and table in the Evaluation section that reports query complexity metrics (average number of tables joined, frequency of temporal operators such as date ranges and interval comparisons, and prevalence of ambiguous terms) for both the 100-question samples and the full EHRSQL 2024 test set. We will also cite prior studies on clinical query patterns to link these characteristics to real-world MIMIC-IV usage. This revision will make the representativeness argument explicit rather than implicit. revision: yes

Circularity Check

0 steps flagged

No significant circularity; evaluation rests on external benchmark

full rationale

The paper describes an LLM-based system for natural-language querying of MIMIC-IV and reports direct empirical accuracy (93-94% on answerable EHRSQL 2024 samples, 69% correct abstention on unanswerable samples) without any mathematical derivations, fitted parameters renamed as predictions, or load-bearing self-citations. All performance claims are obtained by running the models on an independently constructed external benchmark, so the results remain falsifiable outside the paper's own inputs or assumptions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim depends on the reliability of LLM text-to-SQL translation for clinical queries and the representativeness of the EHRSQL benchmark samples; no free parameters are introduced, and the M3 system itself is the primary new entity without independent falsifiable evidence beyond the reported evaluation.

axioms (1)
  • domain assumption Samples from the EHRSQL 2024 benchmark are representative of real clinical questions that researchers would ask of MIMIC-IV.
    The accuracy claims rest on evaluation against this benchmark; if the sample does not reflect typical usage, the reported performance may not generalize.
invented entities (1)
  • M3 system no independent evidence
    purpose: To enable natural language querying of MIMIC-IV with security and local deployment options
    The paper introduces M3 as a new integrated tool; no external falsifiable prediction (such as a specific new clinical finding) is provided to validate the entity independently.

pith-pipeline@v0.9.0 · 5841 in / 1451 out tokens · 58216 ms · 2026-05-22T00:01:15.024134+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Coding-Free and Privacy-Preserving Agentic Framework for Data-Driven Clinical Research

    cs.CL 2026-04 unverdicted novelty 5.0

    CARIS is a new agentic LLM framework that automates clinical research workflows from planning to reporting in a coding-free and privacy-preserving manner, achieving high completeness scores on heterogeneous datasets.

  2. ClinQueryAgent: A Conversational Agent for Population Health Management

    cs.IR 2026-04 unverdicted novelty 4.0

    The paper introduces ClinQueryAgent, a conversational agent that converts natural language queries into database queries for population health management while keeping patient data secure, and reports its use by 128 s...

Reference graph

Works this paper leans on

31 extracted references · 31 canonical work pages · cited by 2 Pith papers · 1 internal anchor

  1. [1]

    Electronic health records: then, now, and in the future.Yearbook of medical informatics, 25(S 01):S48–S61, 2016

    R Scott Evans. Electronic health records: then, now, and in the future.Yearbook of medical informatics, 25(S 01):S48–S61, 2016

  2. [2]

    Alistair E. W. Johnson, Lucas Bulgarelli, Li Shen, Amy Gayles, Ahmed Shammout, Steven Horng, Tom J. Pollard, Leo Anthony Celi, and Roger G. Mark. Mimic-iv, a freely accessible electronic health record dataset.Scientific Data, 10(1):1, 2023. doi: 10 .1038/s41597-022- 01899-x

  3. [3]

    Generative ai tracker, 2025

    STAT News. Generative ai tracker, 2025. URL https://apps.statnews.com/ai-tracker/ public/index.html. [Accessed 25-06-2025]

  4. [4]

    Health system-scale language models are all-purpose prediction engines.Nature, 619(7969):357–362, 2023

    Lavender Yao Jiang, Xujin Chris Liu, Nima Pour Nejatian, Mustafa Nasir-Moin, Duo Wang, Anas Abidin, Kevin Eaton, Howard Antony Riina, Ilya Laufer, Paawan Punjabi, et al. Health system-scale language models are all-purpose prediction engines.Nature, 619(7969):357–362, 2023

  5. [5]

    Abel, Mary Tolikas, and Jason M

    Renato Umeton, Anne Kwok, Rahul Maurya, Domenic Leco, Naomi Lenane, Jennifer Willcox, Gregory A. Abel, Mary Tolikas, and Jason M. Johnson. Gpt-4 in a cancer center — institute- wide deployment challenges and lessons learned.NEJM AI, 1(4):AIcs2300191, 2024. doi: 10.1056/AIcs2300191. URLhttps://ai.nejm.org/doi/full/10.1056/AIcs2300191

  6. [6]

    Model context protocol (mcp)

    Anthropic. Model context protocol (mcp). https://www.anthropic.com/news/model- context-protocol, Nov 2024. Accessed: 2025-06-07

  7. [7]

    Celi, and Roger Mark

    Alistair Johnson, Lucas Bulgarelli, Tom Pollard, Benjamin Gow, Benjamin Moody, Steven Horng, Leo A. Celi, and Roger Mark. Mimic-iv (version 3.1), 2024. URL https: //physionet.org/content/mimiciv/3.1/

  8. [8]

    Physiobank, physiotoolkit, and physionet: Components of a new research resource for complex physiologic signals.Circulation, 101(23):e215–e220, 2000

    Ary L Goldberger, Luis AN Amaral, Leon Glass, Jeffrey M Hausdorff, Plamen Ch Ivanov, Roger G Mark, Joseph E Mietus, George B Moody, Chung-Kang Peng, and H Eugene Stanley. Physiobank, physiotoolkit, and physionet: Components of a new research resource for complex physiologic signals.Circulation, 101(23):e215–e220, 2000. 12

  9. [9]

    EHRSQL 2024 Dataset – MIMIC-IV Test Set

    Gyubok Lee et al. EHRSQL 2024 Dataset – MIMIC-IV Test Set. https://github.com/ glee4810/ehrsql-2024/tree/master/data/mimic_iv/test, 2024. GitHub repository, accessed on 2025-06-25

  10. [10]

    Mimic-iv clinical database demo (version 2.2)

    Alistair Johnson, Lucas Bulgarelli, Tom Pollard, Steven Horng, Leo Anthony Celi, and Roger Mark. Mimic-iv clinical database demo (version 2.2). https://doi.org/10.13026/dp1f- ex47, 2023. RRID:SCR-007345

  11. [11]

    Reisner, Gari Clifford, Li-wei Lehman, George Moody, Thomas Heldt, Tin H

    Mohammed Saeed, Mauricio Villarroel, Andrew T. Reisner, Gari Clifford, Li-wei Lehman, George Moody, Thomas Heldt, Tin H. Kyaw, Benjamin Moody, and Roger G. Mark. Mul- tiparameter intelligent monitoring in intensive care ii (mimic-ii): A public-access inten- sive care unit database.Critical Care Medicine, 39(5):952–960, 2011. doi: 10 .1097/ CCM.0b013e31820a92c6

  12. [12]

    Mimic-iv (version 2.2)

    Alistair E W Johnson, Tom J Pollard, Lu Shen, Li-wei H Lehman, Mengling Feng, Mohammad Ghassemi, Benjamin Moody, Peter Szolovits, Leo Anthony Celi, and Roger G Mark. Mimic-iv (version 2.2). https://physionet.org/content/mimiciv/2.2/, 2022. Accessed: 2025- 05-30

  13. [13]

    Sqlucid: Grounding natural language database queries with interactive explanations

    Yuan Tian, Jonathan K Kummerfeld, Toby Jia-Jun Li, and Tianyi Zhang. Sqlucid: Grounding natural language database queries with interactive explanations. InProceedings of the 37th Annual ACM Symposium on User Interface Software and Technology, pages 1–20, 2024

  14. [14]

    Hl7 fhir standard, Accessed 2025

    HL7 International. Hl7 fhir standard, Accessed 2025. URL https://www.hl7.org/fhir/. Official HL7 FHIR documentation

  15. [15]

    Mimic-iv on fhir: converting a decade of in-patient data into an exchangeable, interoperable format.Journal of the American Medical Informatics Association, 30(4):718–725, 2023

    Alex M Bennett, Hannes Ulrich, Philip Van Damme, Joshua Wiedekopf, and Alistair EW Johnson. Mimic-iv on fhir: converting a decade of in-patient data into an exchangeable, interoperable format.Journal of the American Medical Informatics Association, 30(4):718–725, 2023

  16. [16]

    Standardized data: The omop common data model, Accessed 2025

    OHDSI. Standardized data: The omop common data model, Accessed 2025. URL https: //www.ohdsi.org/data-standardization/. Accessed: 25 June 2025

  17. [17]

    Text-to-sql generation for question answering on electronic medical records

    Ping Wang, Tian Shi, and Chandan K Reddy. Text-to-sql generation for question answering on electronic medical records. InProceedings of The Web Conference 2020, pages 350–361, 2020

  18. [18]

    Ehrsql: A practical text-to-sql benchmark for electronic health records.Advances in Neural Information Processing Systems, 35:15589–15601, 2022

    Gyubok Lee, Hyeonji Hwang, Seongsu Bae, Yeonsu Kwon, Woncheol Shin, Seongjun Yang, Minjoon Seo, Jong-Yeup Kim, and Edward Choi. Ehrsql: A practical text-to-sql benchmark for electronic health records.Advances in Neural Information Processing Systems, 35:15589–15601, 2022

  19. [19]

    Can llm already serve as a database interface? a big bench for large-scale database grounded text-to-sqls.Advances in Neural Information Processing Systems, 36, 2024

    Jinyang Li, Binyuan Hui, Ge Qu, Jiaxi Yang, Binhua Li, Bowen Li, Bailin Wang, Bowen Qin, Ruiying Geng, Nan Huo, et al. Can llm already serve as a database interface? a big bench for large-scale database grounded text-to-sqls.Advances in Neural Information Processing Systems, 36, 2024

  20. [20]

    Spider: A large-scale human-labeled dataset for complex and cross-domain text-to-sql tasks

    Tao Yu, Rui Zhang, Kai Yang, Michihiro Yasunaga, Dongxu Wang, Zifan Li, James Ma, Irene Li, Qingning Yao, Shanelle Roman, et al. Spider: A large-scale human-labeled dataset for complex and cross-domain text-to-sql tasks. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 3911–3921, 2018

  21. [21]

    Seq2SQL: Generating Structured Queries from Natural Language using Reinforcement Learning

    Victor Zhong, Caiming Xiong, and Richard Socher. Seq2sql: Generating structured queries from natural language using reinforcement learning.arXiv preprint arXiv:1709.00103, 2017

  22. [22]

    Biomedsql: Text-to-sql for scientific reasoning on biomedical knowledge bases.arXiv preprint arXiv:2505.20321, 2025

    Mathew J Koretsky, Maya Willey, Adi Asija, Owen Bianchi, Chelsea X Alvarado, Tanay Nayak, Nicole Kuznetsov, Sungwon Kim, Mike A Nalls, Daniel Khashabi, et al. Biomedsql: Text-to-sql for scientific reasoning on biomedical knowledge bases.arXiv preprint arXiv:2505.20321, 2025

  23. [23]

    Overview of the EHRSQL 2024 shared task on reliable text-to-SQL modeling

    Edward Choi, Jinfeng Liang, and Wenxuan Xu. Overview of the EHRSQL 2024 shared task on reliable text-to-SQL modeling. InProceedings of the 5th Clinical Natural Language Processing Workshop, page —, 2024. 13

  24. [24]

    Sql injection prevention cheat sheet

    OWASP Foundation. Sql injection prevention cheat sheet. https://cheatsheetseries.owasp.org/cheatsheets/ SQL_Injection_Prevention_Cheat_Sheet.html, 2023. Accessed: 2025-06-04

  25. [25]

    Model context protocol (mcp) in pharma

    IntuitionLabs. Model context protocol (mcp) in pharma. https://intuitionlabs.ai/ articles/model-context-protocol-mcp-in-pharma, 2025. Accessed: 2025-06-14

  26. [26]

    Future of industrial automation: Trends and predictions for mcp server adoption in smart manufacturing

    SuperAGI. Future of industrial automation: Trends and predictions for mcp server adoption in smart manufacturing. https://superagi.com/future-of-industrial-automation- trends-and-predictions-for-mcp-server-adoption-in-smart-manufacturing/ ,

  27. [27]

    Accessed: 2025-06-14

  28. [28]

    EHRSQL-2024 GitHub Repository

    Gyubok Lee et al. EHRSQL-2024 GitHub Repository. https://github.com/glee4810/ ehrsql-2024, 2024. Accessed: 2025-06-12

  29. [29]

    Evaluating cross-domain text-to-sql models and benchmarks.arXiv preprint arXiv:2310.18538, 2023

    Mohammadreza Pourreza and Davood Rafiei. Evaluating cross-domain text-to-sql models and benchmarks.arXiv preprint arXiv:2310.18538, 2023

  30. [30]

    AmbigQA: Answering ambiguous open-domain questions

    Sewon Min, Julian Michael, Luke Zettlemoyer, and Hannaneh Hajishirzi. AmbigQA: Answering ambiguous open-domain questions. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 9199–9212, 2020

  31. [31]

    Large language models encode clinical knowledge.Nature, 620(7972):172–180, 2023

    Karan Singhal, Shekoofeh Azizi, Tao Tu, S Sara Mahdavi, Jason Wei, Hyung Won Chung, Nathan Scales, Ajay Tanwani, Heather Cole-Lewis, Stephen Pfohl, et al. Large language models encode clinical knowledge.Nature, 620(7972):172–180, 2023. 14 CRediT Author Statement Rafi Al Attrach: Investigation (lead), Software (lead), Writing – original draft (lead), Writi...