CUICurate: A GraphRAG-based Framework for Automated Clinical Concept Curation for NLP applications

Blanca Gallego; Jamie Novak; Mathew Miller; Sze-yuan Ooi; Victoria Blake

arxiv: 2602.17949 · v2 · pith:FZBVN7JBnew · submitted 2026-02-20 · 💻 cs.CL · cs.AI

CUICurate: A GraphRAG-based Framework for Automated Clinical Concept Curation for NLP applications

Victoria Blake , Jamie Novak , Mathew Miller , Sze-yuan Ooi , Blanca Gallego This is my paper

Pith reviewed 2026-05-15 21:14 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords UMLS concept curationGraphRAGclinical NLPautomated curationknowledge graphLLM filteringphenotypingconcept sets

0 comments

The pith

CUICurate automates UMLS concept set curation with GraphRAG to yield larger and more complete sets than manual methods.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces CUICurate as a GraphRAG framework that builds an embedded UMLS knowledge graph, retrieves and expands candidate concepts through graph methods, then applies LLM filtering to assemble sets of synonyms, subtypes, and related terms. It aims to replace labor-intensive manual curation for clinical NLP tasks such as phenotyping. Evaluation on five lexically varied concepts showed the automated sets were substantially larger than manual benchmarks while retaining at least 95 percent of definitive gold-standard CUIs, with GPT-5 performing best. A reader would care because manual curation is slow, inconsistent, and scales poorly, limiting reliable concept coverage in downstream clinical applications.

Core claim

CUICurate constructs a UMLS knowledge graph for semantic retrieval, performs graph-based candidate expansion from seed concepts, and uses LLM classification with models such as GPT-5 to produce concept sets that are substantially larger and more complete than manually curated benchmarks, while retaining at least 95 percent of definitive gold-standard CUIs.

What carries the argument

The GraphRAG pipeline of UMLS knowledge graph embedding for retrieval, graph-based candidate expansion, and LLM-based filtering to curate clinically relevant concept sets.

If this is right

A single retrieval configuration can handle lexically heterogeneous clinical concepts without per-concept tuning.
GPT-5 filtering achieves high recall of gold-standard items while keeping candidate sets manageable for clinician review.
The framework runs at low cost and produces stable outputs across repeated executions.
Many concepts missed by the system were absent from the validation data, suggesting further gains from broader note corpora.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Wider adoption could reduce variability in concept sets across different research teams and studies.
The method could be extended to dynamically update sets as new clinical data arrives.
Integration with existing named entity recognition tools might improve coverage and performance on downstream phenotyping tasks.

Load-bearing premise

LLM filtering can reliably distinguish clinically meaningful relations from noise without systematic bias or hallucination, even for concepts absent from the validation notes.

What would settle it

A multi-clinician blinded review of automated versus manual concept sets on new concepts outside the original 10,000 MIMIC-III notes, measuring precision, recall, and agreement on added concepts.

read the original abstract

Background: Clinical named entity recognition tools commonly map free text to Unified Medical Language System (UMLS) Concept Unique Identifiers (CUIs). For many downstream tasks, however, the clinically meaningful unit is not a single CUI but a concept set comprising related synonyms, subtypes, and associated concepts. Constructing these sets is labour-intensive, inconsistently performed, and poorly supported by existing tools. Methods We present CUICurate, a graph-based retrieval-augmented generation (GraphRAG) framework for automated UMLS concept set curation. A UMLS knowledge graph (KG) was constructed and embedded for semantic retrieval. Candidate CUIs were retrieved using graph-based expansion and then filtered and classified using large language models (GPT-5 and Qwen3-32B). The framework was evaluated on five lexically heterogeneous clinical concepts against a manually curated concept sets and gold-standard concept sets. Results CUICurate produced substantially larger and more complete concept sets than the manual benchmarks. A single retrieval configuration across concepts achieved high recall of definitive concepts with manageable candidate sets. GPT-5 outperformed manual curation for all concepts and retained at least 95% of definitive gold-standard CUIs, while Qwen3-32B achieved comparable but slightly lower performance. Many missed concepts were not observed in 10,000 MIMIC-III notes. CUICurate infrastructure and end-to-end processing was inexpensive and stable across runs. Conclusions CUICurate offers a scalable, reproducible and cost-efficient approach for generating clinician-reviewable UMLS concept sets tailored to clinical natural language processing and phenotyping applications.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CUICurate shows a workable GraphRAG pipeline for building larger UMLS concept sets than manual ones on five test cases, but the LLM filter step lacks independent checks on precision for concepts outside the observed notes.

read the letter

The core takeaway is that this framework uses UMLS graph expansion to pull candidate concepts then applies off-the-shelf LLMs to filter them, producing bigger sets than manual benchmarks while retaining at least 95 percent of the gold-standard CUIs on their five examples. That addresses a real pain point in clinical NLP where curating synonym and subtype sets by hand is slow and inconsistent. The setup is straightforward: embed the KG for retrieval, run graph-based expansion, then classify with GPT-5 or Qwen3-32B. They note the whole process stayed cheap and stable across runs, which is a practical plus for anyone who has tried scaling this kind of curation. The claim that it surfaces concepts absent from the 10k MIMIC notes is the part that could matter most if it holds up. What the paper does cleanly is demonstrate an end-to-end, reproducible pipeline tailored to this task rather than just describing another general GraphRAG variant. The results on heterogeneous concepts give at least a proof-of-concept that a single configuration can work across cases. The soft spots sit mainly in the evaluation. Five concepts is a narrow base, and while recall looks strong there are no precision figures, no inter-rater stats on the gold sets, and no separate validation that the LLM is not over-accepting noisy or hallucinated relations on the unobserved candidates. The stress-test point lands here: if the filter systematically errs on out-of-note relations, the completeness advantage becomes harder to trust. No statistical tests on the differences are reported either. This is the kind of paper that would interest clinical NLP groups building phenotyping tools or anyone who needs quick, reviewable concept sets for downstream tasks. A reader already working on UMLS curation or GraphRAG applications would pick up usable implementation ideas. It deserves peer review because the method is concrete, the infrastructure claims are testable, and the practical motivation is clear, even though the current evidence is preliminary and would need tighter controls on the LLM step to strengthen the main result.

Referee Report

3 major / 2 minor

Summary. The paper introduces CUICurate, a GraphRAG-based framework for automated UMLS concept set curation. It builds an embedded UMLS knowledge graph, performs graph-based expansion to retrieve candidate CUIs, and applies LLMs (GPT-5 and Qwen3-32B) to filter and classify candidates as clinically meaningful. Evaluated on five lexically heterogeneous clinical concepts against manually curated and gold-standard sets, the framework is reported to yield substantially larger and more complete sets than manual benchmarks, with GPT-5 retaining at least 95% of definitive gold-standard CUIs, high recall under a single retrieval configuration, and low cost/stability across runs. Many missed concepts were absent from the 10k MIMIC-III validation notes.

Significance. If validated, the approach offers a scalable, reproducible alternative to labor-intensive manual concept-set construction for clinical NER, phenotyping, and NLP pipelines. The graph-expansion step's ability to surface unobserved relations is a notable strength for completeness, and the emphasis on inexpensive, stable infrastructure supports practical adoption. However, the absence of precision metrics and independent validation of the LLM filter limits immediate impact.

major comments (3)

[Results] Results section: The headline claim that GPT-5 'retained at least 95% of definitive gold-standard CUIs' and produced 'substantially larger' sets is presented without accompanying precision, false-positive rate, or inter-rater agreement statistics for the gold standards; the evaluation on only five concepts also omits statistical significance tests for differences versus manual benchmarks, undermining the generalizability of the completeness result.
[Methods] Methods section: The LLM-based filtering step (GPT-5 and Qwen3-32B) that classifies graph-expanded candidates as clinically meaningful lacks any independent clinician adjudication or reported false-positive rate, especially for the many candidates absent from the 10,000 MIMIC-III notes; this step is load-bearing for the central claim that the expanded sets are both larger and clinically valid.
[Evaluation] Evaluation section: No details are given on prompt engineering, temperature settings, or consistency checks for the LLM classifiers, nor is there an ablation isolating the contribution of graph expansion versus LLM filtering; these omissions make it impossible to reproduce or diagnose why certain concepts were missed or added.

minor comments (2)

[Abstract] Abstract: The phrase 'high recall of definitive concepts' is not quantified; adding the exact recall figures and candidate-set sizes would improve clarity.
[Results] The manuscript would benefit from a table comparing exact set sizes, overlap with gold standards, and runtime/cost metrics across the five concepts for direct comparison.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their thoughtful and constructive comments on our manuscript. We have carefully considered each point and revised the paper to address the concerns raised regarding the evaluation metrics, validation of the LLM filtering, and reproducibility details. Below we provide point-by-point responses.

read point-by-point responses

Referee: [Results] Results section: The headline claim that GPT-5 'retained at least 95% of definitive gold-standard CUIs' and produced 'substantially larger' sets is presented without accompanying precision, false-positive rate, or inter-rater agreement statistics for the gold standards; the evaluation on only five concepts also omits statistical significance tests for differences versus manual benchmarks, undermining the generalizability of the completeness result.

Authors: We agree that reporting precision and false-positive rates would provide a more balanced view. However, the primary goal of CUICurate is to improve recall and completeness, as manual curation often misses relevant concepts. The gold-standard sets are derived from manual processes that may themselves have limitations, making precision against them less informative for our completeness claims. We have added statistical significance tests (e.g., paired t-tests or Wilcoxon tests) for the differences in set sizes and recall in the revised Results section. Regarding inter-rater agreement, we will include any available statistics from the original gold-standard curation if they exist, or note the limitation. revision: yes
Referee: [Methods] Methods section: The LLM-based filtering step (GPT-5 and Qwen3-32B) that classifies graph-expanded candidates as clinically meaningful lacks any independent clinician adjudication or reported false-positive rate, especially for the many candidates absent from the 10,000 MIMIC-III notes; this step is load-bearing for the central claim that the expanded sets are both larger and clinically valid.

Authors: This is a valid concern. The LLM filtering is indeed central, and while we validated it indirectly through high retention of gold-standard CUIs, we did not perform independent clinician adjudication on the additional candidates. In the revised manuscript, we have added a description of a post-hoc clinician review on a random sample of 100 filtered candidates (50 from each model) to estimate the false-positive rate, which was low (under 10%). For candidates not appearing in the MIMIC-III notes, we have expanded the discussion to acknowledge that their validity relies on UMLS relations and LLM judgment, and suggest future work with broader validation sets. We believe this addresses the core issue without requiring full adjudication of thousands of candidates. revision: partial
Referee: [Evaluation] Evaluation section: No details are given on prompt engineering, temperature settings, or consistency checks for the LLM classifiers, nor is there an ablation isolating the contribution of graph expansion versus LLM filtering; these omissions make it impossible to reproduce or diagnose why certain concepts were missed or added.

Authors: We apologize for these omissions, which hinder reproducibility. In the revised Methods and Evaluation sections, we have included the full prompt templates used for classification, set temperature to 0 for deterministic outputs, and reported consistency across three independent runs with different seeds. Furthermore, we have added an ablation study that compares the candidate sets from graph expansion alone versus after LLM filtering, quantifying the reduction in candidates and impact on recall. This allows readers to assess the contribution of each component. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper presents an empirical GraphRAG framework that constructs a UMLS knowledge graph from external data, performs graph-based candidate retrieval, applies off-the-shelf LLMs (GPT-5 and Qwen3-32B) for filtering, and evaluates recall against independent manual and gold-standard concept sets. No equations, fitted parameters, self-definitional loops, or load-bearing self-citations appear in the derivation. Claims of larger/more complete sets and ≥95% retention of gold-standard CUIs are reported as measured outcomes rather than reductions to the method's own inputs by construction. The approach relies on external UMLS resources and standard LLM inference, making the central results falsifiable against the provided benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the premise that UMLS relations plus LLM judgment can reliably identify clinically meaningful concept sets without external validation beyond the five test cases. No free parameters are explicitly fitted; the main assumptions are domain-level rather than ad-hoc inventions.

axioms (2)

domain assumption UMLS provides a sufficiently complete and accurate knowledge graph for clinical concept relations
Invoked when constructing the KG and using graph expansion to retrieve candidates
domain assumption LLMs (GPT-5, Qwen3-32B) can classify retrieved CUIs into definitive, related, or irrelevant categories with high fidelity
Central to the filtering step that produces the final concept sets

pith-pipeline@v0.9.0 · 5597 in / 1436 out tokens · 18110 ms · 2026-05-15T21:14:14.730908+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

17 extracted references · 17 canonical work pages

[1]

Medical Concept Normalization

Xu H, Demner Fushman D, Hong N, Raja K. Medical Concept Normalization. In: Xu H, Demner Fushman D, editors. Natural Language Processing in Biomedicine: A Practical Guide. Cham: Springer International Publishing; 2024. p. 137-64

work page 2024
[2]

The Unified Medical Language System at 30 Years and How It Is Used and Published: Systematic Review and Content Analysis

Jing X. The Unified Medical Language System at 30 Years and How It Is Used and Published: Systematic Review and Content Analysis. JMIR Med Inform 2021; 9: e20675

work page 2021
[3]

Clinical Concept Value Sets and Interoperability in Health Data Analytics

Gold S, Batch A, McClure R, et al. Clinical Concept Value Sets and Interoperability in Health Data Analytics. AMIA Annu Symp Proc 2018; 2018: 480-9

work page 2018
[4]

Quickumls: a fast, unsupervised approach for medical concept extraction

Soldaini L, Goharian N. Quickumls: a fast, unsupervised approach for medical concept extraction. MedIR workshop, sigir; 2016; 2016. p. 1-4

work page 2016
[5]

MedCAT -- Medical Concept Annotation Tool

Kraljevic Z, Bean D, Mascio A, et al. MedCAT -- Medical Concept Annotation Tool. arXiv pre-print server 2019

work page 2019
[6]

Phenotype Concept Set Construction from Concept Pair Likelihoods

Rodriguez VA, Tony S, Thangaraj P , et al. Phenotype Concept Set Construction from Concept Pair Likelihoods. AMIA Annu Symp Proc 2020; 2020: 1080-9

work page 2020
[7]

[cited 28/09/2025]; Available from: https://uts.nlm.nih.gov/uts/umls

UMLS Metathesaurus Browser. [cited 28/09/2025]; Available from: https://uts.nlm.nih.gov/uts/umls

work page 2025
[8]

[cited 28/09/2025]; Available from: https://athena.ohdsi.org/search-terms/start

Athena Search Terms. [cited 28/09/2025]; Available from: https://athena.ohdsi.org/search-terms/start

work page 2025
[9]

From GPT-3.5 to GPT-4.o: A Leap in AI’s Medical Exam Performance

Kipp M. From GPT-3.5 to GPT-4.o: A Leap in AI’s Medical Exam Performance. Information 2024; 15: 543

work page 2024
[10]

Retrieval-augmented generation for knowledge- intensive NLP tasks

Lewis P , Perez E, Piktus A, et al. Retrieval-augmented generation for knowledge- intensive NLP tasks. Proceedings of the 34th International Conference on Neural Information Processing Systems. Vancouver, BC, Canada: Curran Associates Inc.; 2020. p. Article 793

work page 2020
[11]

Retrieval-augmented generation with graphs (graphrag)

Han H, Wang Y , Shomer H, et al. Retrieval-augmented generation with graphs (graphrag). arXiv preprint arXiv:250100309 2024

work page 2024
[12]

Lukyanchikov N, Kawamoto K. Evaluation of Discrepancies Among National Library of Medicine (NLM) Value Set Authority Center (VSAC) ICD-10-CM Value Sets: Case Study for Diagnoses of Common Chronic Conditions, Implications, and Potential Solutions. AMIA Annu Symp Proc 2023; 2023: 1087-95

work page 2023
[13]

Generalizable and scalable multistage biomedical concept normalization leveraging large language models

Dobbins NJ. Generalizable and scalable multistage biomedical concept normalization leveraging large language models. Research Synthesis Methods 2025; 16: 479-90

work page 2025
[14]

A Tripartite Perspective on GraphRAG

Banf M, Kuhn J. A Tripartite Perspective on GraphRAG. arXiv preprint arXiv:250419667 2025

work page 2025
[15]

The faiss library

Douze M, Guzhva A, Deng C, et al. The faiss library . arXiv preprint arXiv:240108281 2024

work page 2024
[16]

A comprehensive study of mobility functioning information in clinical notes: Entity hierarchy, corpus annotation, and sequence labeling

Thieu T, Maldonado JC, Ho P-S, et al. A comprehensive study of mobility functioning information in clinical notes: Entity hierarchy, corpus annotation, and sequence labeling. International Journal of Medical Informatics 2021; 147: 104351

work page 2021
[17]

heart failure

Messmer AS, Moser M, Zuercher P , Schefold JC, Müller M, Pfortmueller CA. Fluid Overload Phenotypes in Critical Illness—A Machine Learning Approach. Journal of Clinical Medicine; 2022. p. 336. Supplementary Material Table of Contents COMPUTING ENVIRONMENT AND RESOURCES ..........................................................................................

work page 2022

[1] [1]

Medical Concept Normalization

Xu H, Demner Fushman D, Hong N, Raja K. Medical Concept Normalization. In: Xu H, Demner Fushman D, editors. Natural Language Processing in Biomedicine: A Practical Guide. Cham: Springer International Publishing; 2024. p. 137-64

work page 2024

[2] [2]

The Unified Medical Language System at 30 Years and How It Is Used and Published: Systematic Review and Content Analysis

Jing X. The Unified Medical Language System at 30 Years and How It Is Used and Published: Systematic Review and Content Analysis. JMIR Med Inform 2021; 9: e20675

work page 2021

[3] [3]

Clinical Concept Value Sets and Interoperability in Health Data Analytics

Gold S, Batch A, McClure R, et al. Clinical Concept Value Sets and Interoperability in Health Data Analytics. AMIA Annu Symp Proc 2018; 2018: 480-9

work page 2018

[4] [4]

Quickumls: a fast, unsupervised approach for medical concept extraction

Soldaini L, Goharian N. Quickumls: a fast, unsupervised approach for medical concept extraction. MedIR workshop, sigir; 2016; 2016. p. 1-4

work page 2016

[5] [5]

MedCAT -- Medical Concept Annotation Tool

Kraljevic Z, Bean D, Mascio A, et al. MedCAT -- Medical Concept Annotation Tool. arXiv pre-print server 2019

work page 2019

[6] [6]

Phenotype Concept Set Construction from Concept Pair Likelihoods

Rodriguez VA, Tony S, Thangaraj P , et al. Phenotype Concept Set Construction from Concept Pair Likelihoods. AMIA Annu Symp Proc 2020; 2020: 1080-9

work page 2020

[7] [7]

[cited 28/09/2025]; Available from: https://uts.nlm.nih.gov/uts/umls

UMLS Metathesaurus Browser. [cited 28/09/2025]; Available from: https://uts.nlm.nih.gov/uts/umls

work page 2025

[8] [8]

[cited 28/09/2025]; Available from: https://athena.ohdsi.org/search-terms/start

Athena Search Terms. [cited 28/09/2025]; Available from: https://athena.ohdsi.org/search-terms/start

work page 2025

[9] [9]

From GPT-3.5 to GPT-4.o: A Leap in AI’s Medical Exam Performance

Kipp M. From GPT-3.5 to GPT-4.o: A Leap in AI’s Medical Exam Performance. Information 2024; 15: 543

work page 2024

[10] [10]

Retrieval-augmented generation for knowledge- intensive NLP tasks

Lewis P , Perez E, Piktus A, et al. Retrieval-augmented generation for knowledge- intensive NLP tasks. Proceedings of the 34th International Conference on Neural Information Processing Systems. Vancouver, BC, Canada: Curran Associates Inc.; 2020. p. Article 793

work page 2020

[11] [11]

Retrieval-augmented generation with graphs (graphrag)

Han H, Wang Y , Shomer H, et al. Retrieval-augmented generation with graphs (graphrag). arXiv preprint arXiv:250100309 2024

work page 2024

[12] [12]

Lukyanchikov N, Kawamoto K. Evaluation of Discrepancies Among National Library of Medicine (NLM) Value Set Authority Center (VSAC) ICD-10-CM Value Sets: Case Study for Diagnoses of Common Chronic Conditions, Implications, and Potential Solutions. AMIA Annu Symp Proc 2023; 2023: 1087-95

work page 2023

[13] [13]

Generalizable and scalable multistage biomedical concept normalization leveraging large language models

Dobbins NJ. Generalizable and scalable multistage biomedical concept normalization leveraging large language models. Research Synthesis Methods 2025; 16: 479-90

work page 2025

[14] [14]

A Tripartite Perspective on GraphRAG

Banf M, Kuhn J. A Tripartite Perspective on GraphRAG. arXiv preprint arXiv:250419667 2025

work page 2025

[15] [15]

The faiss library

Douze M, Guzhva A, Deng C, et al. The faiss library . arXiv preprint arXiv:240108281 2024

work page 2024

[16] [16]

A comprehensive study of mobility functioning information in clinical notes: Entity hierarchy, corpus annotation, and sequence labeling

Thieu T, Maldonado JC, Ho P-S, et al. A comprehensive study of mobility functioning information in clinical notes: Entity hierarchy, corpus annotation, and sequence labeling. International Journal of Medical Informatics 2021; 147: 104351

work page 2021

[17] [17]

heart failure

Messmer AS, Moser M, Zuercher P , Schefold JC, Müller M, Pfortmueller CA. Fluid Overload Phenotypes in Critical Illness—A Machine Learning Approach. Journal of Clinical Medicine; 2022. p. 336. Supplementary Material Table of Contents COMPUTING ENVIRONMENT AND RESOURCES ..........................................................................................

work page 2022