CUICurate: A GraphRAG-based Framework for Automated Clinical Concept Curation for NLP applications
Pith reviewed 2026-05-15 21:14 UTC · model grok-4.3
The pith
CUICurate automates UMLS concept set curation with GraphRAG to yield larger and more complete sets than manual methods.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
CUICurate constructs a UMLS knowledge graph for semantic retrieval, performs graph-based candidate expansion from seed concepts, and uses LLM classification with models such as GPT-5 to produce concept sets that are substantially larger and more complete than manually curated benchmarks, while retaining at least 95 percent of definitive gold-standard CUIs.
What carries the argument
The GraphRAG pipeline of UMLS knowledge graph embedding for retrieval, graph-based candidate expansion, and LLM-based filtering to curate clinically relevant concept sets.
If this is right
- A single retrieval configuration can handle lexically heterogeneous clinical concepts without per-concept tuning.
- GPT-5 filtering achieves high recall of gold-standard items while keeping candidate sets manageable for clinician review.
- The framework runs at low cost and produces stable outputs across repeated executions.
- Many concepts missed by the system were absent from the validation data, suggesting further gains from broader note corpora.
Where Pith is reading between the lines
- Wider adoption could reduce variability in concept sets across different research teams and studies.
- The method could be extended to dynamically update sets as new clinical data arrives.
- Integration with existing named entity recognition tools might improve coverage and performance on downstream phenotyping tasks.
Load-bearing premise
LLM filtering can reliably distinguish clinically meaningful relations from noise without systematic bias or hallucination, even for concepts absent from the validation notes.
What would settle it
A multi-clinician blinded review of automated versus manual concept sets on new concepts outside the original 10,000 MIMIC-III notes, measuring precision, recall, and agreement on added concepts.
read the original abstract
Background: Clinical named entity recognition tools commonly map free text to Unified Medical Language System (UMLS) Concept Unique Identifiers (CUIs). For many downstream tasks, however, the clinically meaningful unit is not a single CUI but a concept set comprising related synonyms, subtypes, and associated concepts. Constructing these sets is labour-intensive, inconsistently performed, and poorly supported by existing tools. Methods We present CUICurate, a graph-based retrieval-augmented generation (GraphRAG) framework for automated UMLS concept set curation. A UMLS knowledge graph (KG) was constructed and embedded for semantic retrieval. Candidate CUIs were retrieved using graph-based expansion and then filtered and classified using large language models (GPT-5 and Qwen3-32B). The framework was evaluated on five lexically heterogeneous clinical concepts against a manually curated concept sets and gold-standard concept sets. Results CUICurate produced substantially larger and more complete concept sets than the manual benchmarks. A single retrieval configuration across concepts achieved high recall of definitive concepts with manageable candidate sets. GPT-5 outperformed manual curation for all concepts and retained at least 95% of definitive gold-standard CUIs, while Qwen3-32B achieved comparable but slightly lower performance. Many missed concepts were not observed in 10,000 MIMIC-III notes. CUICurate infrastructure and end-to-end processing was inexpensive and stable across runs. Conclusions CUICurate offers a scalable, reproducible and cost-efficient approach for generating clinician-reviewable UMLS concept sets tailored to clinical natural language processing and phenotyping applications.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces CUICurate, a GraphRAG-based framework for automated UMLS concept set curation. It builds an embedded UMLS knowledge graph, performs graph-based expansion to retrieve candidate CUIs, and applies LLMs (GPT-5 and Qwen3-32B) to filter and classify candidates as clinically meaningful. Evaluated on five lexically heterogeneous clinical concepts against manually curated and gold-standard sets, the framework is reported to yield substantially larger and more complete sets than manual benchmarks, with GPT-5 retaining at least 95% of definitive gold-standard CUIs, high recall under a single retrieval configuration, and low cost/stability across runs. Many missed concepts were absent from the 10k MIMIC-III validation notes.
Significance. If validated, the approach offers a scalable, reproducible alternative to labor-intensive manual concept-set construction for clinical NER, phenotyping, and NLP pipelines. The graph-expansion step's ability to surface unobserved relations is a notable strength for completeness, and the emphasis on inexpensive, stable infrastructure supports practical adoption. However, the absence of precision metrics and independent validation of the LLM filter limits immediate impact.
major comments (3)
- [Results] Results section: The headline claim that GPT-5 'retained at least 95% of definitive gold-standard CUIs' and produced 'substantially larger' sets is presented without accompanying precision, false-positive rate, or inter-rater agreement statistics for the gold standards; the evaluation on only five concepts also omits statistical significance tests for differences versus manual benchmarks, undermining the generalizability of the completeness result.
- [Methods] Methods section: The LLM-based filtering step (GPT-5 and Qwen3-32B) that classifies graph-expanded candidates as clinically meaningful lacks any independent clinician adjudication or reported false-positive rate, especially for the many candidates absent from the 10,000 MIMIC-III notes; this step is load-bearing for the central claim that the expanded sets are both larger and clinically valid.
- [Evaluation] Evaluation section: No details are given on prompt engineering, temperature settings, or consistency checks for the LLM classifiers, nor is there an ablation isolating the contribution of graph expansion versus LLM filtering; these omissions make it impossible to reproduce or diagnose why certain concepts were missed or added.
minor comments (2)
- [Abstract] Abstract: The phrase 'high recall of definitive concepts' is not quantified; adding the exact recall figures and candidate-set sizes would improve clarity.
- [Results] The manuscript would benefit from a table comparing exact set sizes, overlap with gold standards, and runtime/cost metrics across the five concepts for direct comparison.
Simulated Author's Rebuttal
We thank the referee for their thoughtful and constructive comments on our manuscript. We have carefully considered each point and revised the paper to address the concerns raised regarding the evaluation metrics, validation of the LLM filtering, and reproducibility details. Below we provide point-by-point responses.
read point-by-point responses
-
Referee: [Results] Results section: The headline claim that GPT-5 'retained at least 95% of definitive gold-standard CUIs' and produced 'substantially larger' sets is presented without accompanying precision, false-positive rate, or inter-rater agreement statistics for the gold standards; the evaluation on only five concepts also omits statistical significance tests for differences versus manual benchmarks, undermining the generalizability of the completeness result.
Authors: We agree that reporting precision and false-positive rates would provide a more balanced view. However, the primary goal of CUICurate is to improve recall and completeness, as manual curation often misses relevant concepts. The gold-standard sets are derived from manual processes that may themselves have limitations, making precision against them less informative for our completeness claims. We have added statistical significance tests (e.g., paired t-tests or Wilcoxon tests) for the differences in set sizes and recall in the revised Results section. Regarding inter-rater agreement, we will include any available statistics from the original gold-standard curation if they exist, or note the limitation. revision: yes
-
Referee: [Methods] Methods section: The LLM-based filtering step (GPT-5 and Qwen3-32B) that classifies graph-expanded candidates as clinically meaningful lacks any independent clinician adjudication or reported false-positive rate, especially for the many candidates absent from the 10,000 MIMIC-III notes; this step is load-bearing for the central claim that the expanded sets are both larger and clinically valid.
Authors: This is a valid concern. The LLM filtering is indeed central, and while we validated it indirectly through high retention of gold-standard CUIs, we did not perform independent clinician adjudication on the additional candidates. In the revised manuscript, we have added a description of a post-hoc clinician review on a random sample of 100 filtered candidates (50 from each model) to estimate the false-positive rate, which was low (under 10%). For candidates not appearing in the MIMIC-III notes, we have expanded the discussion to acknowledge that their validity relies on UMLS relations and LLM judgment, and suggest future work with broader validation sets. We believe this addresses the core issue without requiring full adjudication of thousands of candidates. revision: partial
-
Referee: [Evaluation] Evaluation section: No details are given on prompt engineering, temperature settings, or consistency checks for the LLM classifiers, nor is there an ablation isolating the contribution of graph expansion versus LLM filtering; these omissions make it impossible to reproduce or diagnose why certain concepts were missed or added.
Authors: We apologize for these omissions, which hinder reproducibility. In the revised Methods and Evaluation sections, we have included the full prompt templates used for classification, set temperature to 0 for deterministic outputs, and reported consistency across three independent runs with different seeds. Furthermore, we have added an ablation study that compares the candidate sets from graph expansion alone versus after LLM filtering, quantifying the reduction in candidates and impact on recall. This allows readers to assess the contribution of each component. revision: yes
Circularity Check
No significant circularity detected
full rationale
The paper presents an empirical GraphRAG framework that constructs a UMLS knowledge graph from external data, performs graph-based candidate retrieval, applies off-the-shelf LLMs (GPT-5 and Qwen3-32B) for filtering, and evaluates recall against independent manual and gold-standard concept sets. No equations, fitted parameters, self-definitional loops, or load-bearing self-citations appear in the derivation. Claims of larger/more complete sets and ≥95% retention of gold-standard CUIs are reported as measured outcomes rather than reductions to the method's own inputs by construction. The approach relies on external UMLS resources and standard LLM inference, making the central results falsifiable against the provided benchmarks.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption UMLS provides a sufficiently complete and accurate knowledge graph for clinical concept relations
- domain assumption LLMs (GPT-5, Qwen3-32B) can classify retrieved CUIs into definitive, related, or irrelevant categories with high fidelity
Reference graph
Works this paper leans on
-
[1]
Xu H, Demner Fushman D, Hong N, Raja K. Medical Concept Normalization. In: Xu H, Demner Fushman D, editors. Natural Language Processing in Biomedicine: A Practical Guide. Cham: Springer International Publishing; 2024. p. 137-64
work page 2024
-
[2]
Jing X. The Unified Medical Language System at 30 Years and How It Is Used and Published: Systematic Review and Content Analysis. JMIR Med Inform 2021; 9: e20675
work page 2021
-
[3]
Clinical Concept Value Sets and Interoperability in Health Data Analytics
Gold S, Batch A, McClure R, et al. Clinical Concept Value Sets and Interoperability in Health Data Analytics. AMIA Annu Symp Proc 2018; 2018: 480-9
work page 2018
-
[4]
Quickumls: a fast, unsupervised approach for medical concept extraction
Soldaini L, Goharian N. Quickumls: a fast, unsupervised approach for medical concept extraction. MedIR workshop, sigir; 2016; 2016. p. 1-4
work page 2016
-
[5]
MedCAT -- Medical Concept Annotation Tool
Kraljevic Z, Bean D, Mascio A, et al. MedCAT -- Medical Concept Annotation Tool. arXiv pre-print server 2019
work page 2019
-
[6]
Phenotype Concept Set Construction from Concept Pair Likelihoods
Rodriguez VA, Tony S, Thangaraj P , et al. Phenotype Concept Set Construction from Concept Pair Likelihoods. AMIA Annu Symp Proc 2020; 2020: 1080-9
work page 2020
-
[7]
[cited 28/09/2025]; Available from: https://uts.nlm.nih.gov/uts/umls
UMLS Metathesaurus Browser. [cited 28/09/2025]; Available from: https://uts.nlm.nih.gov/uts/umls
work page 2025
-
[8]
[cited 28/09/2025]; Available from: https://athena.ohdsi.org/search-terms/start
Athena Search Terms. [cited 28/09/2025]; Available from: https://athena.ohdsi.org/search-terms/start
work page 2025
-
[9]
From GPT-3.5 to GPT-4.o: A Leap in AI’s Medical Exam Performance
Kipp M. From GPT-3.5 to GPT-4.o: A Leap in AI’s Medical Exam Performance. Information 2024; 15: 543
work page 2024
-
[10]
Retrieval-augmented generation for knowledge- intensive NLP tasks
Lewis P , Perez E, Piktus A, et al. Retrieval-augmented generation for knowledge- intensive NLP tasks. Proceedings of the 34th International Conference on Neural Information Processing Systems. Vancouver, BC, Canada: Curran Associates Inc.; 2020. p. Article 793
work page 2020
-
[11]
Retrieval-augmented generation with graphs (graphrag)
Han H, Wang Y , Shomer H, et al. Retrieval-augmented generation with graphs (graphrag). arXiv preprint arXiv:250100309 2024
work page 2024
-
[12]
Lukyanchikov N, Kawamoto K. Evaluation of Discrepancies Among National Library of Medicine (NLM) Value Set Authority Center (VSAC) ICD-10-CM Value Sets: Case Study for Diagnoses of Common Chronic Conditions, Implications, and Potential Solutions. AMIA Annu Symp Proc 2023; 2023: 1087-95
work page 2023
-
[13]
Dobbins NJ. Generalizable and scalable multistage biomedical concept normalization leveraging large language models. Research Synthesis Methods 2025; 16: 479-90
work page 2025
-
[14]
A Tripartite Perspective on GraphRAG
Banf M, Kuhn J. A Tripartite Perspective on GraphRAG. arXiv preprint arXiv:250419667 2025
work page 2025
-
[15]
Douze M, Guzhva A, Deng C, et al. The faiss library . arXiv preprint arXiv:240108281 2024
work page 2024
-
[16]
Thieu T, Maldonado JC, Ho P-S, et al. A comprehensive study of mobility functioning information in clinical notes: Entity hierarchy, corpus annotation, and sequence labeling. International Journal of Medical Informatics 2021; 147: 104351
work page 2021
-
[17]
Messmer AS, Moser M, Zuercher P , Schefold JC, Müller M, Pfortmueller CA. Fluid Overload Phenotypes in Critical Illness—A Machine Learning Approach. Journal of Clinical Medicine; 2022. p. 336. Supplementary Material Table of Contents COMPUTING ENVIRONMENT AND RESOURCES ..........................................................................................
work page 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.