pith. sign in

arxiv: 2605.20815 · v1 · pith:VFNZ4VDPnew · submitted 2026-05-20 · 💻 cs.CL · cs.AI· cs.IR· cs.LG

GraphRAG on Consumer Hardware: Benchmarking Local LLMs for Healthcare EHR Schema Retrieval

Pith reviewed 2026-05-21 04:59 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.IRcs.LG
keywords GraphRAGlocal LLMsEHR schema retrievalconsumer hardwarehealthcare dataknowledge graphretrieval augmented generationhallucination
0
0 comments X

The pith

GraphRAG runs on consumer hardware with open-source LLMs of 7B parameters or larger for EHR schema retrieval.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper tests whether GraphRAG, which builds a knowledge graph to support structured reasoning over documents, can operate locally on everyday computers for pulling details from Electronic Health Record schemas. It deploys four open-source models on a single consumer GPU with 8 GB VRAM and compares indexing speed, graph quality, query response time, answer accuracy, and hallucination rates between global summarization and local retrieval modes. Results indicate that models at or above roughly 7 billion parameters complete the full pipeline with valid outputs while local retrieval yields faster and more grounded answers than global mode. The findings address privacy and cost concerns in healthcare by showing a path to avoid cloud services for regulated data.

Core claim

GraphRAG is feasible on consumer hardware with open-source LLMs of approximately 7B parameters or larger for EHR schema retrieval. Local retrieval outperforms global summarization in latency and factual grounding with reduced hallucination. Models below this threshold fail to produce valid structured outputs. Indexing and answer quality decouple across models, and local retrieval consistently outperforms global summarization.

What carries the argument

The Microsoft GraphRAG pipeline run locally via Ollama on consumer GPU hardware, with knowledge graph construction from real-world EHR schema documentation and comparison of global versus local retrieval modes for query answering.

If this is right

  • Local retrieval mode reduces latency and hallucinations relative to global summarization.
  • Answer quality and indexing richness vary independently by model choice.
  • Models smaller than approximately 7B parameters cannot reliably finish the pipeline with valid structured outputs.
  • The approach supports privacy-compliant GraphRAG use in regulated healthcare settings without cloud dependence.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same local GraphRAG setup could be applied to schema retrieval in other regulated domains such as legal or financial documents.
  • Selecting retrieval mode by query complexity might yield further gains in speed and accuracy.
  • The decoupling of graph construction quality from final answer quality points to separate model optimizations for indexing versus querying stages.

Load-bearing premise

The selected real-world EHR schema documentation represents typical complex regulated healthcare data and manual scoring of answer quality gives a reliable unbiased performance measure.

What would settle it

A 3.8B-parameter model completing the full GraphRAG pipeline on the same EHR documentation and generating valid structured outputs for every test query would disprove the claimed capacity threshold around 7B parameters.

Figures

Figures reproduced from arXiv: 2605.20815 by Peter Fernandes, Ria Kanjilal.

Figure 1
Figure 1. Figure 1: GraphRAG system architecture. (a) Offline indexing: HTML schema files are chunked [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
read the original abstract

Graph-based Retrieval Augmented Generation (GraphRAG) extends retrieval-augmented generation to support structured reasoning over complex corpora, but its reliability under resource-constrained, privacy-sensitive deployments remains unclear. In healthcare, where Electronic Health Record (EHR) data is complex and strictly regulated, reliance on cloud-based large language models (LLMs) introduces challenges in cost, latency, and compliance. In this work, we present a systematic evaluation of GraphRAG for EHR schema retrieval using locally deployed open-source LLMs. We implement the Microsoft GraphRAG pipeline on real-world EHR schema documentation and benchmark four models, including Llama 3.1 (8B), Mistral (7B), Qwen 2.5 (7B), and Phi-4-mini (3.8B), each deployed via Ollama on a single consumer GPU (8 GB VRAM). We evaluate indexing efficiency, knowledge graph construction, query latency, answer quality, and hallucination under both global and local retrieval modes. Our results reveal substantial differences: Llama 3.1 produces the richest knowledge graph (1,172 entities), Qwen 2.5 achieves the best answer quality (3.3/5), Phi-4-mini fails to complete the pipeline due to structured-output errors, and Mistral exhibits degenerate repetition behavior. We further show that GraphRAG exhibits a practical capacity threshold, where models below approximately 7B parameters fail to reliably produce valid structured outputs and cannot complete the pipeline. In addition, indexing and answer quality are decoupled across models, and local retrieval consistently outperforms global summarization in both latency and factual grounding, with reduced hallucination. These findings demonstrate that GraphRAG is feasible on consumer hardware while highlighting the importance of model selection and retrieval design for robust deployment in regulated settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper benchmarks GraphRAG implemented with four local open-source LLMs (Llama 3.1 8B, Mistral 7B, Qwen 2.5 7B, Phi-4-mini 3.8B) via Ollama on a single consumer GPU (8 GB VRAM) for EHR schema retrieval on real-world healthcare documentation. It measures indexing efficiency, knowledge-graph construction (e.g., 1,172 entities for Llama 3.1), query latency, answer quality (peak 3.3/5 for Qwen 2.5), and hallucination under global summarization versus local retrieval modes, concluding that models at or above ~7B parameters succeed, local retrieval is superior in latency and factual grounding, and smaller models fail on structured outputs.

Significance. If the empirical results prove robust, the work offers timely practical guidance for privacy-preserving, resource-constrained deployments of structured RAG in regulated domains such as healthcare. It supplies direct measurements of pipeline behavior on consumer hardware and highlights a model-capacity threshold together with the advantage of local retrieval, which could inform model selection and retrieval design choices without requiring cloud services.

major comments (2)
  1. [Results and Evaluation] Evaluation / Results: The reported answer quality of 3.3/5 and claims of reduced hallucination with local retrieval rest on manual scoring, yet the manuscript supplies neither the scoring rubric, the number of test queries, inter-annotator agreement, nor quantitative hallucination rates or statistical comparisons between modes. These omissions make it impossible to verify the asserted superiority in factual grounding.
  2. [Abstract and Model Evaluation] Abstract and § on model comparison: The capacity threshold claim (models below ~7B fail to produce valid structured outputs) is illustrated by Phi-4-mini’s failure, but without details on the exact output schema, prompt templates, or parsing logic used, the threshold cannot be assessed for generality or reproduced.
minor comments (2)
  1. A summary table aggregating entity counts, latency, quality scores, and failure modes across all four models and both retrieval modes would improve readability and allow direct comparison.
  2. The manuscript would benefit from explicit citation to the original Microsoft GraphRAG paper and from a brief description of the EHR schema’s scale (number of tables, relationships, regulatory constraints).

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which highlight important aspects of reproducibility and verifiability. We address each major comment below and indicate planned revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Results and Evaluation] Evaluation / Results: The reported answer quality of 3.3/5 and claims of reduced hallucination with local retrieval rest on manual scoring, yet the manuscript supplies neither the scoring rubric, the number of test queries, inter-annotator agreement, nor quantitative hallucination rates or statistical comparisons between modes. These omissions make it impossible to verify the asserted superiority in factual grounding.

    Authors: We agree that the current description of the manual evaluation lacks sufficient detail for independent verification. We will revise the Results and Evaluation sections to include the complete scoring rubric, the exact number of test queries used, clarification on annotator procedures (including any inter-annotator agreement metrics or justification for single-annotator design), quantitative hallucination rates, and statistical comparisons (e.g., significance tests) between global summarization and local retrieval modes. These additions will directly support the claims regarding factual grounding and reduced hallucination. revision: yes

  2. Referee: [Abstract and Model Evaluation] Abstract and § on model comparison: The capacity threshold claim (models below ~7B fail to produce valid structured outputs) is illustrated by Phi-4-mini’s failure, but without details on the exact output schema, prompt templates, or parsing logic used, the threshold cannot be assessed for generality or reproduced.

    Authors: We acknowledge that greater transparency on implementation details is needed to allow readers to evaluate the generality of the observed capacity threshold. In the revised manuscript we will add the precise output schema required from the models, the full prompt templates for entity/relationship extraction and other pipeline stages, and the parsing/validation logic used to detect and handle invalid structured outputs. This will improve reproducibility and permit assessment of whether the ~7B threshold holds under alternative schemas or prompts. revision: yes

Circularity Check

0 steps flagged

No significant circularity: pure empirical benchmarking with direct measurements

full rationale

The paper conducts a direct experimental evaluation of the Microsoft GraphRAG pipeline on four open-source LLMs deployed locally via Ollama, reporting observed quantities such as entity counts (e.g., 1,172 for Llama 3.1), answer quality scores (3.3/5 for Qwen 2.5), latency differences between local and global retrieval modes, and failure thresholds for models below ~7B parameters. No mathematical derivations, equations, fitted parameters, or predictions are defined in terms of the study's own outputs. Claims rest on measured pipeline runs rather than self-referential reductions, self-citation chains, or imported uniqueness results. The work is self-contained against external benchmarks of model behavior and retrieval performance.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The study rests on standard assumptions about data representativeness and evaluation validity rather than introducing new free parameters or postulated entities.

axioms (1)
  • domain assumption The chosen EHR schema documentation is representative of real-world complex healthcare data.
    Invoked when applying the pipeline to real-world EHR schema documentation for the benchmark.

pith-pipeline@v0.9.0 · 5872 in / 1318 out tokens · 44136 ms · 2026-05-21T04:59:38.710071+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith/Foundation/Cost.lean reality_from_one_distinction unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    We implement the Microsoft GraphRAG pipeline on real-world EHR schema documentation and benchmark four models... local retrieval consistently outperforms global summarization in both latency and factual grounding, with reduced hallucination.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

21 extracted references · 21 canonical work pages · 9 internal anchors

  1. [1]

    Retrieval-augmented generation for knowledge- intensive nlp tasks.Advances in neural information processing systems, 33:9459–9474, 2020

    Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. Retrieval-augmented generation for knowledge- intensive nlp tasks.Advances in neural information processing systems, 33:9459–9474, 2020

  2. [2]

    From Local to Global: A Graph RAG Approach to Query-Focused Summarization

    Darren Edge, Ha Trinh, Newman Cheng, Joshua Bradley, Alex Chao, Apurva Mody, Steven Truitt, Dasha Metropolitansky, Robert Osazuwa Ness, and Jonathan Larson. From local to global: A graph rag approach to query-focused summarization.arXiv preprint arXiv:2404.16130, 2024

  3. [3]

    From louvain to leiden: guaranteeing well- connected communities.Scientific reports, 9(1):5233, 2019

    Vincent A Traag, Ludo Waltman, and Nees Jan Van Eck. From louvain to leiden: guaranteeing well- connected communities.Scientific reports, 9(1):5233, 2019

  4. [4]

    Enterprise information integration: successes, challenges and controversies

    Alon Y Halevy, Naveen Ashish, Dina Bitton, Michael Carey, Denise Draper, Jeff Pollock, Arnon Rosen- thal, and Vishal Sikka. Enterprise information integration: successes, challenges and controversies. In Proceedings of the 2005 ACM SIGMOD international conference on Management of data, pages 778–787, 2005

  5. [5]

    The end of an architectural era: it’s time for a complete rewrite

    Michael Stonebraker, Samuel Madden, Daniel J Abadi, Stavros Harizopoulos, Nabil Hachem, and Pat Helland. The end of an architectural era: it’s time for a complete rewrite. InMaking Databases Work: the Pragmatic Wisdom of Michael Stonebraker, pages 463–489. 2018

  6. [6]

    Retrieval-Augmented Generation for Large Language Models: A Survey

    Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yixin Dai, Jiawei Sun, Haofen Wang, Haofen Wang, et al. Retrieval-augmented generation for large language models: A survey.arXiv preprint arXiv:2312.10997, 2(1):32, 2023

  7. [7]

    Observational health data sciences and informatics (ohdsi): opportunities for observational researchers.Studies in health technology and informatics, 216:574, 2015

    George Hripcsak, Jon D Duke, Nigam H Shah, Christian G Reich, V ojtech Huser, Martijn J Schuemie, Marc A Suchard, Rae Woong Park, Ian Chi Kei Wong, Peter R Rijnbeek, et al. Observational health data sciences and informatics (ohdsi): opportunities for observational researchers.Studies in health technology and informatics, 216:574, 2015

  8. [8]

    Large language models encode clinical knowledge

    Karan Singhal, Shekoofeh Azizi, Tao Tu, S Sara Mahdavi, Jason Wei, Hyung Won Chung, Nathan Scales, Ajay Tanwani, Heather Cole-Lewis, Stephen Pfohl, et al. Large language models encode clinical knowledge. Nature, 620(7972):172–180, 2023

  9. [9]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023. 8

  10. [10]

    A Survey of Large Language Models

    Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Be- ichen Zhang, Junjie Zhang, Zican Dong, et al. A survey of large language models.arXiv preprint arXiv:2303.18223, 1(2):1–124, 2023

  11. [11]

    Privacy in the age of medical big data.Nature medicine, 25(1): 37–43, 2019

    W Nicholson Price and I Glenn Cohen. Privacy in the age of medical big data.Nature medicine, 25(1): 37–43, 2019

  12. [12]

    Carbon Emissions and Large Neural Network Training

    David Patterson, Joseph Gonzalez, Quoc Le, Chen Liang, Lluis-Miquel Munguia, Daniel Rothchild, David So, Maud Texier, and Jeff Dean. Carbon emissions and large neural network training.arXiv preprint arXiv:2104.10350, 2021

  13. [13]

    The Llama 3 Herd of Models

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al- Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024

  14. [14]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

  15. [15]

    6g non-terrestrial networks enabled low-altitude economy: Opportunities and challenges

    Yihang Jiang, Xiaoyang Li, Guangxu Zhu, Hang Li, Jing Deng, Kaifeng Han, Chao Shen, Qingjiang Shi, and Rui Zhang. 6g non-terrestrial networks enabled low-altitude economy: Opportunities and challenges. arXiv preprint arXiv:2311.09047, 2023

  16. [16]

    Phi-4 Technical Report

    Marah Abdin, Jyoti Aneja, Harkirat Behl, Sébastien Bubeck, Ronen Eldan, Suriya Gunasekar, Michael Harrison, Russell J Hewett, Mojan Javaheripi, Piero Kauffmann, et al. Phi-4 technical report.arXiv preprint arXiv:2412.08905, 2024

  17. [17]

    Mollangbench: A comprehensive benchmark for language-prompted molecular structure recognition, editing, and generation.arXiv preprint arXiv:2505.15054, 2025

    Feiyang Cai, Jiahui Bai, Tao Tang, Guijuan He, Joshua Luo, Tianyu Zhu, Srikanth Pilla, Gang Li, Ling Liu, and Feng Luo. Mollangbench: A comprehensive benchmark for language-prompted molecular structure recognition, editing, and generation.arXiv preprint arXiv:2505.15054, 2025

  18. [18]

    Haoyu Han, Li Ma, Yu Wang, Harry Shomer, Yongjia Lei, Zhisheng Qi, Kai Guo, Zhigang Hua, Bo Long, Hui Liu, et al. Rag vs. graphrag: A systematic evaluation and key insights.arXiv preprint arXiv:2502.11371, 2025

  19. [19]

    Unifying large language models and knowledge graphs: A roadmap.IEEE Transactions on Knowledge and Data Engineering, 36 (7):3580–3599, 2024

    Shirui Pan, Linhao Luo, Yufei Wang, Chen Chen, Jiapu Wang, and Xindong Wu. Unifying large language models and knowledge graphs: A roadmap.IEEE Transactions on Knowledge and Data Engineering, 36 (7):3580–3599, 2024

  20. [20]

    Nomic Embed: Training a Reproducible Long Context Text Embedder

    Zach Nussbaum, John X Morris, Brandon Duderstadt, and Andriy Mulyar. Nomic embed: Training a reproducible long context text embedder.arXiv preprint arXiv:2402.01613, 2024

  21. [21]

    Survey of hallucination in natural language generation.ACM computing surveys, 55(12):1–38, 2023

    Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea Madotto, and Pascale Fung. Survey of hallucination in natural language generation.ACM computing surveys, 55(12):1–38, 2023. A Pipeline Configuration Details The GraphRAG configuration used across all experiments sets chunk size to 512 tokens with 256- token ...