GraphRAG on Consumer Hardware: Benchmarking Local LLMs for Healthcare EHR Schema Retrieval
Pith reviewed 2026-05-21 04:59 UTC · model grok-4.3
The pith
GraphRAG runs on consumer hardware with open-source LLMs of 7B parameters or larger for EHR schema retrieval.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
GraphRAG is feasible on consumer hardware with open-source LLMs of approximately 7B parameters or larger for EHR schema retrieval. Local retrieval outperforms global summarization in latency and factual grounding with reduced hallucination. Models below this threshold fail to produce valid structured outputs. Indexing and answer quality decouple across models, and local retrieval consistently outperforms global summarization.
What carries the argument
The Microsoft GraphRAG pipeline run locally via Ollama on consumer GPU hardware, with knowledge graph construction from real-world EHR schema documentation and comparison of global versus local retrieval modes for query answering.
If this is right
- Local retrieval mode reduces latency and hallucinations relative to global summarization.
- Answer quality and indexing richness vary independently by model choice.
- Models smaller than approximately 7B parameters cannot reliably finish the pipeline with valid structured outputs.
- The approach supports privacy-compliant GraphRAG use in regulated healthcare settings without cloud dependence.
Where Pith is reading between the lines
- The same local GraphRAG setup could be applied to schema retrieval in other regulated domains such as legal or financial documents.
- Selecting retrieval mode by query complexity might yield further gains in speed and accuracy.
- The decoupling of graph construction quality from final answer quality points to separate model optimizations for indexing versus querying stages.
Load-bearing premise
The selected real-world EHR schema documentation represents typical complex regulated healthcare data and manual scoring of answer quality gives a reliable unbiased performance measure.
What would settle it
A 3.8B-parameter model completing the full GraphRAG pipeline on the same EHR documentation and generating valid structured outputs for every test query would disprove the claimed capacity threshold around 7B parameters.
Figures
read the original abstract
Graph-based Retrieval Augmented Generation (GraphRAG) extends retrieval-augmented generation to support structured reasoning over complex corpora, but its reliability under resource-constrained, privacy-sensitive deployments remains unclear. In healthcare, where Electronic Health Record (EHR) data is complex and strictly regulated, reliance on cloud-based large language models (LLMs) introduces challenges in cost, latency, and compliance. In this work, we present a systematic evaluation of GraphRAG for EHR schema retrieval using locally deployed open-source LLMs. We implement the Microsoft GraphRAG pipeline on real-world EHR schema documentation and benchmark four models, including Llama 3.1 (8B), Mistral (7B), Qwen 2.5 (7B), and Phi-4-mini (3.8B), each deployed via Ollama on a single consumer GPU (8 GB VRAM). We evaluate indexing efficiency, knowledge graph construction, query latency, answer quality, and hallucination under both global and local retrieval modes. Our results reveal substantial differences: Llama 3.1 produces the richest knowledge graph (1,172 entities), Qwen 2.5 achieves the best answer quality (3.3/5), Phi-4-mini fails to complete the pipeline due to structured-output errors, and Mistral exhibits degenerate repetition behavior. We further show that GraphRAG exhibits a practical capacity threshold, where models below approximately 7B parameters fail to reliably produce valid structured outputs and cannot complete the pipeline. In addition, indexing and answer quality are decoupled across models, and local retrieval consistently outperforms global summarization in both latency and factual grounding, with reduced hallucination. These findings demonstrate that GraphRAG is feasible on consumer hardware while highlighting the importance of model selection and retrieval design for robust deployment in regulated settings.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper benchmarks GraphRAG implemented with four local open-source LLMs (Llama 3.1 8B, Mistral 7B, Qwen 2.5 7B, Phi-4-mini 3.8B) via Ollama on a single consumer GPU (8 GB VRAM) for EHR schema retrieval on real-world healthcare documentation. It measures indexing efficiency, knowledge-graph construction (e.g., 1,172 entities for Llama 3.1), query latency, answer quality (peak 3.3/5 for Qwen 2.5), and hallucination under global summarization versus local retrieval modes, concluding that models at or above ~7B parameters succeed, local retrieval is superior in latency and factual grounding, and smaller models fail on structured outputs.
Significance. If the empirical results prove robust, the work offers timely practical guidance for privacy-preserving, resource-constrained deployments of structured RAG in regulated domains such as healthcare. It supplies direct measurements of pipeline behavior on consumer hardware and highlights a model-capacity threshold together with the advantage of local retrieval, which could inform model selection and retrieval design choices without requiring cloud services.
major comments (2)
- [Results and Evaluation] Evaluation / Results: The reported answer quality of 3.3/5 and claims of reduced hallucination with local retrieval rest on manual scoring, yet the manuscript supplies neither the scoring rubric, the number of test queries, inter-annotator agreement, nor quantitative hallucination rates or statistical comparisons between modes. These omissions make it impossible to verify the asserted superiority in factual grounding.
- [Abstract and Model Evaluation] Abstract and § on model comparison: The capacity threshold claim (models below ~7B fail to produce valid structured outputs) is illustrated by Phi-4-mini’s failure, but without details on the exact output schema, prompt templates, or parsing logic used, the threshold cannot be assessed for generality or reproduced.
minor comments (2)
- A summary table aggregating entity counts, latency, quality scores, and failure modes across all four models and both retrieval modes would improve readability and allow direct comparison.
- The manuscript would benefit from explicit citation to the original Microsoft GraphRAG paper and from a brief description of the EHR schema’s scale (number of tables, relationships, regulatory constraints).
Simulated Author's Rebuttal
We thank the referee for their constructive comments, which highlight important aspects of reproducibility and verifiability. We address each major comment below and indicate planned revisions to strengthen the manuscript.
read point-by-point responses
-
Referee: [Results and Evaluation] Evaluation / Results: The reported answer quality of 3.3/5 and claims of reduced hallucination with local retrieval rest on manual scoring, yet the manuscript supplies neither the scoring rubric, the number of test queries, inter-annotator agreement, nor quantitative hallucination rates or statistical comparisons between modes. These omissions make it impossible to verify the asserted superiority in factual grounding.
Authors: We agree that the current description of the manual evaluation lacks sufficient detail for independent verification. We will revise the Results and Evaluation sections to include the complete scoring rubric, the exact number of test queries used, clarification on annotator procedures (including any inter-annotator agreement metrics or justification for single-annotator design), quantitative hallucination rates, and statistical comparisons (e.g., significance tests) between global summarization and local retrieval modes. These additions will directly support the claims regarding factual grounding and reduced hallucination. revision: yes
-
Referee: [Abstract and Model Evaluation] Abstract and § on model comparison: The capacity threshold claim (models below ~7B fail to produce valid structured outputs) is illustrated by Phi-4-mini’s failure, but without details on the exact output schema, prompt templates, or parsing logic used, the threshold cannot be assessed for generality or reproduced.
Authors: We acknowledge that greater transparency on implementation details is needed to allow readers to evaluate the generality of the observed capacity threshold. In the revised manuscript we will add the precise output schema required from the models, the full prompt templates for entity/relationship extraction and other pipeline stages, and the parsing/validation logic used to detect and handle invalid structured outputs. This will improve reproducibility and permit assessment of whether the ~7B threshold holds under alternative schemas or prompts. revision: yes
Circularity Check
No significant circularity: pure empirical benchmarking with direct measurements
full rationale
The paper conducts a direct experimental evaluation of the Microsoft GraphRAG pipeline on four open-source LLMs deployed locally via Ollama, reporting observed quantities such as entity counts (e.g., 1,172 for Llama 3.1), answer quality scores (3.3/5 for Qwen 2.5), latency differences between local and global retrieval modes, and failure thresholds for models below ~7B parameters. No mathematical derivations, equations, fitted parameters, or predictions are defined in terms of the study's own outputs. Claims rest on measured pipeline runs rather than self-referential reductions, self-citation chains, or imported uniqueness results. The work is self-contained against external benchmarks of model behavior and retrieval performance.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The chosen EHR schema documentation is representative of real-world complex healthcare data.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/Cost.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We implement the Microsoft GraphRAG pipeline on real-world EHR schema documentation and benchmark four models... local retrieval consistently outperforms global summarization in both latency and factual grounding, with reduced hallucination.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. Retrieval-augmented generation for knowledge- intensive nlp tasks.Advances in neural information processing systems, 33:9459–9474, 2020
work page 2020
-
[2]
From Local to Global: A Graph RAG Approach to Query-Focused Summarization
Darren Edge, Ha Trinh, Newman Cheng, Joshua Bradley, Alex Chao, Apurva Mody, Steven Truitt, Dasha Metropolitansky, Robert Osazuwa Ness, and Jonathan Larson. From local to global: A graph rag approach to query-focused summarization.arXiv preprint arXiv:2404.16130, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[3]
From louvain to leiden: guaranteeing well- connected communities.Scientific reports, 9(1):5233, 2019
Vincent A Traag, Ludo Waltman, and Nees Jan Van Eck. From louvain to leiden: guaranteeing well- connected communities.Scientific reports, 9(1):5233, 2019
work page 2019
-
[4]
Enterprise information integration: successes, challenges and controversies
Alon Y Halevy, Naveen Ashish, Dina Bitton, Michael Carey, Denise Draper, Jeff Pollock, Arnon Rosen- thal, and Vishal Sikka. Enterprise information integration: successes, challenges and controversies. In Proceedings of the 2005 ACM SIGMOD international conference on Management of data, pages 778–787, 2005
work page 2005
-
[5]
The end of an architectural era: it’s time for a complete rewrite
Michael Stonebraker, Samuel Madden, Daniel J Abadi, Stavros Harizopoulos, Nabil Hachem, and Pat Helland. The end of an architectural era: it’s time for a complete rewrite. InMaking Databases Work: the Pragmatic Wisdom of Michael Stonebraker, pages 463–489. 2018
work page 2018
-
[6]
Retrieval-Augmented Generation for Large Language Models: A Survey
Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yixin Dai, Jiawei Sun, Haofen Wang, Haofen Wang, et al. Retrieval-augmented generation for large language models: A survey.arXiv preprint arXiv:2312.10997, 2(1):32, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[7]
George Hripcsak, Jon D Duke, Nigam H Shah, Christian G Reich, V ojtech Huser, Martijn J Schuemie, Marc A Suchard, Rae Woong Park, Ian Chi Kei Wong, Peter R Rijnbeek, et al. Observational health data sciences and informatics (ohdsi): opportunities for observational researchers.Studies in health technology and informatics, 216:574, 2015
work page 2015
-
[8]
Large language models encode clinical knowledge
Karan Singhal, Shekoofeh Azizi, Tao Tu, S Sara Mahdavi, Jason Wei, Hyung Won Chung, Nathan Scales, Ajay Tanwani, Heather Cole-Lewis, Stephen Pfohl, et al. Large language models encode clinical knowledge. Nature, 620(7972):172–180, 2023
work page 2023
-
[9]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023. 8
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[10]
A Survey of Large Language Models
Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Be- ichen Zhang, Junjie Zhang, Zican Dong, et al. A survey of large language models.arXiv preprint arXiv:2303.18223, 1(2):1–124, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[11]
Privacy in the age of medical big data.Nature medicine, 25(1): 37–43, 2019
W Nicholson Price and I Glenn Cohen. Privacy in the age of medical big data.Nature medicine, 25(1): 37–43, 2019
work page 2019
-
[12]
Carbon Emissions and Large Neural Network Training
David Patterson, Joseph Gonzalez, Quoc Le, Chen Liang, Lluis-Miquel Munguia, Daniel Rothchild, David So, Maud Texier, and Jeff Dean. Carbon emissions and large neural network training.arXiv preprint arXiv:2104.10350, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[13]
Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al- Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[14]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[15]
6g non-terrestrial networks enabled low-altitude economy: Opportunities and challenges
Yihang Jiang, Xiaoyang Li, Guangxu Zhu, Hang Li, Jing Deng, Kaifeng Han, Chao Shen, Qingjiang Shi, and Rui Zhang. 6g non-terrestrial networks enabled low-altitude economy: Opportunities and challenges. arXiv preprint arXiv:2311.09047, 2023
-
[16]
Marah Abdin, Jyoti Aneja, Harkirat Behl, Sébastien Bubeck, Ronen Eldan, Suriya Gunasekar, Michael Harrison, Russell J Hewett, Mojan Javaheripi, Piero Kauffmann, et al. Phi-4 technical report.arXiv preprint arXiv:2412.08905, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[17]
Feiyang Cai, Jiahui Bai, Tao Tang, Guijuan He, Joshua Luo, Tianyu Zhu, Srikanth Pilla, Gang Li, Ling Liu, and Feng Luo. Mollangbench: A comprehensive benchmark for language-prompted molecular structure recognition, editing, and generation.arXiv preprint arXiv:2505.15054, 2025
- [18]
-
[19]
Shirui Pan, Linhao Luo, Yufei Wang, Chen Chen, Jiapu Wang, and Xindong Wu. Unifying large language models and knowledge graphs: A roadmap.IEEE Transactions on Knowledge and Data Engineering, 36 (7):3580–3599, 2024
work page 2024
-
[20]
Nomic Embed: Training a Reproducible Long Context Text Embedder
Zach Nussbaum, John X Morris, Brandon Duderstadt, and Andriy Mulyar. Nomic embed: Training a reproducible long context text embedder.arXiv preprint arXiv:2402.01613, 2024
work page internal anchor Pith review arXiv 2024
-
[21]
Survey of hallucination in natural language generation.ACM computing surveys, 55(12):1–38, 2023
Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea Madotto, and Pascale Fung. Survey of hallucination in natural language generation.ACM computing surveys, 55(12):1–38, 2023. A Pipeline Configuration Details The GraphRAG configuration used across all experiments sets chunk size to 512 tokens with 256- token ...
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.