Retrieve, Then Classify: Corpus-Grounded Automation of Clinical Value Set Authoring

Celena Wheeler; Chris Sidey-Gibbons; Juan Shu; Nairwita Mazumder; Shannon Hastings; Sumit Mukherjee; Tate Kernell

arxiv: 2604.14616 · v1 · submitted 2026-04-16 · 💻 cs.CL · cs.AI· cs.LG

Retrieve, Then Classify: Corpus-Grounded Automation of Clinical Value Set Authoring

Sumit Mukherjee , Juan Shu , Nairwita Mazumder , Tate Kernell , Celena Wheeler , Shannon Hastings , Chris Sidey-Gibbons This is my paper

Pith reviewed 2026-05-10 11:55 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords value set authoringVSACretrieval-augmented classificationclinical phenotypingSAPBertclinical vocabulariesvalue set completion

0 comments

The pith

Retrieving similar existing value sets then classifying candidates automates clinical value set authoring better than direct LLM prompting.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Clinical value set authoring identifies all standardized codes that define a clinical concept, a recurring manual bottleneck in quality measurement and phenotyping. The paper proposes Retrieval-Augmented Set Completion (RASC) that first pulls the K most similar value sets from a curated corpus to shrink the candidate pool, then applies a classifier to select the relevant codes from that pool. This approach is tested on a new benchmark built from 11,803 public VSAC value sets. A fine-tuned cross-encoder on SAPBert reaches AUROC 0.852 and value-set-level F1 0.298 while cutting irrelevant candidates per true positive from 12.3 down to about 3.2, outperforming both a three-layer MLP and zero-shot GPT-4o. The performance margin widens as value-set size grows, matching the theoretical claim that retrieval reduces output-space complexity.

Core claim

The paper establishes that retrieve-then-classify lowers statistical complexity by restricting the effective output space from the full vocabulary to a small retrieved candidate pool. On the VSAC benchmark this yields higher AUROC and F1 scores than retrieval alone or direct generation, with the gap increasing for larger value sets and holding across multiple classifier architectures.

What carries the argument

Retrieval-Augmented Set Completion (RASC): retrieve the K most similar existing value sets to form a candidate pool, then apply a classifier to each code in the pool.

If this is right

Both the cross-encoder and MLP reduce irrelevant candidates per true positive from 12.3 to roughly 3.2–4.4.
The accuracy advantage over retrieval-only or direct LLM prompting grows with larger value-set size.
Similar gains appear across cross-encoder, MLP, and LightGBM classifiers.
Zero-shot GPT-4o returns many codes absent from VSAC entirely and lower F1.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method could serve as a draft generator that clinicians refine by seeding with related existing sets.
Expanding the corpus with value sets from additional vocabularies would test coverage for rarer concepts.
Real-time use in EHR workflows could flag missing codes as new data patterns emerge.

Load-bearing premise

A corpus of existing VSAC value sets is representative enough that similarity retrieval will surface the right codes for new clinical concepts.

What would settle it

A new clinical concept whose correct codes are absent from all top-K retrieved value sets, so the classifier never sees them.

Figures

Figures reproduced from arXiv: 2604.14616 by Celena Wheeler, Chris Sidey-Gibbons, Juan Shu, Nairwita Mazumder, Shannon Hastings, Sumit Mukherjee, Tate Kernell.

**Figure 1.** Figure 1: Overview of RASC workflow. 3 Statistical Motivation for RASC Let U denote the universe of (code,system) pairs with |U| = N. For a query q (a value set title), the task is to identify the target label set Y (q) ⊆ U, where |Y (q)| ≪ N since each value set contains only a small fraction of the full ontology (see [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

**Figure 2.** Figure 2: Stratified value-set-level F1 across clinical type (a) and value set size (b). [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: Temporal distribution of VSAC value sets by year of creation or last update. Growth [PITH_FULL_IMAGE:figures/full_fig_p012_3.png] view at source ↗

**Figure 4.** Figure 4: Left: number of value sets per code system (top 15). SNOMED-CT and ICD-10-CM [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗

**Figure 5.** Figure 5: Left: inferred value set type distribution. Condition types dominate the corpus. Right: [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗

**Figure 6.** Figure 6: Left: fraction of value sets with a human-authored textual description. Only 19.6% carry [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗

**Figure 7.** Figure 7: Top 20 VSAC publisher organizations by number of value sets. The distribution is highly [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗

read the original abstract

Clinical value set authoring -- the task of identifying all codes in a standardized vocabulary that define a clinical concept -- is a recurring bottleneck in clinical quality measurement and phenotyping. A natural approach is to prompt a large language model (LLM) to generate the required codes directly, but structured clinical vocabularies are large, version-controlled, and not reliably memorized during pretraining. We propose Retrieval-Augmented Set Completion (RASC): retrieve the $K$ most similar existing value sets from a curated corpus to form a candidate pool, then apply a classifier to each candidate code. Theoretically, retrieve-and-select can reduce statistical complexity by shrinking the effective output space from the full vocabulary to a much smaller retrieved candidate pool. We demonstrate the utility of RASC on 11,803 publicly available VSAC value sets, constructing the first large-scale benchmark for this task. A cross-encoder fine-tuned on SAPBert achieves AUROC~0.852 and value-set-level F1~0.298, outperforming a simpler three-layer Multilayer Perceptron (AUROC~0.799, F1~0.250) and both reduce the number of irrelevant candidates per true positive from 12.3 (retrieval-only) to approximately 3.2 and 4.4 respectively. Zero-shot GPT-4o achieves value-set-level F1~0.105, with 48.6\% of returned codes absent from VSAC entirely. This performance gap widens with increasing value set size, consistent with RASC's theoretical advantage. We observe similar performance gains across two other classifier model types, namely a cross-encoder initialized from pre-trained SAPBert and a LightGBM model, demonstrating that RASC's benefits extend beyond a single model class. The code to download and create the benchmark dataset, as well as the model training code is available at: \href{https://github.com/mukhes3/RASC}{https://github.com/mukhes3/RASC}.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RASC shows retrieve-then-classify beats direct LLM generation on a new 11k-set clinical value set benchmark, with clear in-distribution gains, but the setup leaves generalization to novel concepts untested.

read the letter

The paper's main point is that pulling similar existing value sets first, then classifying codes from that smaller pool, improves results over zero-shot GPT-4o or simple classifiers for clinical value set completion. On their benchmark of 11,803 held-out VSAC sets, a fine-tuned SAPBert cross-encoder reaches AUROC 0.852 and value-set F1 0.298, cutting irrelevant candidates per true positive from 12.3 down to around 3.2, with bigger gains on larger sets. They also show the pattern holds for an MLP and LightGBM, and they release the benchmark construction code plus training scripts.

Referee Report

2 major / 2 minor

Summary. The paper proposes Retrieval-Augmented Set Completion (RASC) to automate clinical value set authoring: retrieve the K most similar existing value sets from a corpus of 11,803 VSAC entries to create a candidate pool, then apply a classifier to identify relevant codes for a new clinical concept. On a held-out benchmark, a cross-encoder fine-tuned from SAPBert achieves AUROC ~0.852 and value-set-level F1 ~0.298, outperforming a 3-layer MLP (AUROC ~0.799, F1 ~0.250), retrieval-only (12.3 irrelevant candidates per true positive), and zero-shot GPT-4o (F1 ~0.105); gains widen with value-set size. Code and benchmark construction scripts are released.

Significance. If the central empirical claims hold, the work offers a practical, corpus-grounded method for a recurring bottleneck in clinical phenotyping and quality measurement. Strengths include construction of the first large-scale benchmark for this task, consistent gains across three classifier families, explicit comparison to strong baselines including an LLM, and public release of data and training code, which directly supports reproducibility and follow-on work.

major comments (2)

[Benchmark construction] Benchmark construction (abstract and §4): test value sets are held out from the identical 11,803-VSAC corpus used both for retrieval and for training the classifier. This evaluates in-distribution retrieval but provides no results on out-of-distribution novel clinical concepts (e.g., emerging phenotypes or lab measures whose textual descriptions diverge from historical entries). If retrieval recall@K drops for such cases, the downstream classifier receives an incomplete pool and the claimed reduction in candidate complexity (from 12.3 to ~3.2 irrelevant per true positive) does not materialize.
[Results] Experimental details (results section): the manuscript reports AUROC/F1 numbers and scaling behavior but does not specify the exact train/test split procedure, whether value-set overlap was prevented between retrieval corpus and classifier training, or the precise implementation of similarity retrieval. These omissions are load-bearing for interpreting whether the reported 0.05–0.15 AUROC/F1 gains over the MLP baseline reflect genuine generalization or data leakage.

minor comments (2)

[Abstract] Abstract: approximate notation ('AUROC~0.852') should be replaced by exact values or ranges with standard errors to allow precise comparison with future work.
[Methods] Notation: the symbol K for retrieval depth is introduced without an explicit range or sensitivity analysis; a short table or plot showing performance vs. K would clarify the practical operating point.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive feedback and positive assessment of the work's significance, benchmark construction, and reproducibility contributions. We address each major comment point by point below.

read point-by-point responses

Referee: [Benchmark construction] Benchmark construction (abstract and §4): test value sets are held out from the identical 11,803-VSAC corpus used both for retrieval and for training the classifier. This evaluates in-distribution retrieval but provides no results on out-of-distribution novel clinical concepts (e.g., emerging phenotypes or lab measures whose textual descriptions diverge from historical entries). If retrieval recall@K drops for such cases, the downstream classifier receives an incomplete pool and the claimed reduction in candidate complexity (from 12.3 to ~3.2 irrelevant per true positive) does not materialize.

Authors: We agree that the benchmark evaluates in-distribution performance by holding out value sets from the same VSAC corpus. This setup is deliberate, as the practical use case for value set authoring typically involves clinical concepts with some textual or semantic similarity to existing VSAC entries. For entirely novel emerging phenotypes with no close historical analogs, retrieval recall would likely decrease and the candidate pool would be less complete. However, no separate out-of-distribution benchmark of such novel concepts exists within or outside VSAC. We will add an explicit discussion of this scope limitation and its implications for the claimed complexity reduction in the revised benchmark construction section. revision: partial
Referee: [Results] Experimental details (results section): the manuscript reports AUROC/F1 numbers and scaling behavior but does not specify the exact train/test split procedure, whether value-set overlap was prevented between retrieval corpus and classifier training, or the precise implementation of similarity retrieval. These omissions are load-bearing for interpreting whether the reported 0.05–0.15 AUROC/F1 gains over the MLP baseline reflect genuine generalization or data leakage.

Authors: We apologize for these omissions in the results section. The split procedure randomly holds out 20% of the 11,803 value sets (by unique ID) as the test set; the remaining 80% form the retrieval corpus used for both similarity search and classifier training, with no value-set ID overlap between them. Similarity retrieval computes cosine similarity between the SAPBert embedding of the input clinical concept description and the embeddings of value-set descriptions in the corpus. We will add these details, including pseudocode for the split and retrieval steps, to the revised manuscript to enable full reproduction and rule out leakage concerns. revision: yes

standing simulated objections not resolved

Empirical results on out-of-distribution novel clinical concepts (no such benchmark is available from VSAC or public sources).

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper is an empirical ML study that constructs a held-out benchmark from the VSAC corpus, trains classifiers (cross-encoder, MLP, LightGBM) on retrieved candidates, and reports AUROC/F1 metrics against explicit baselines including retrieval-only and zero-shot GPT-4o. No equations, derivations, or first-principles results are presented that reduce by construction to fitted inputs or self-citations; the sole theoretical remark is a general statement that shrinking the output space reduces complexity, which is not used to derive any specific numerical claim. All reported gains are direct empirical measurements on the test partition, with no self-referential fitting or renaming of known results. This is a standard benchmark comparison and receives score 0.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The method rests on the domain assumption that existing value sets form a representative corpus for retrieval. No new entities are postulated. The retrieval depth K is a free hyperparameter whose specific value is not reported in the abstract.

free parameters (1)

K (retrieval depth)
Number of most similar value sets retrieved to form the candidate pool; treated as a tunable hyperparameter.

axioms (1)

domain assumption A curated corpus of existing value sets is representative of the distribution of clinical concepts that future value sets will need to cover.
The retrieve step presupposes that similarity search within VSAC will surface relevant codes for new concepts.

pith-pipeline@v0.9.0 · 5692 in / 1513 out tokens · 81443 ms · 2026-05-10T11:55:58.722636+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

15 extracted references · 15 canonical work pages

[1]

InAdvances in Neural Information Processing Systems (NeurIPS)(2015), vol

Bhatia, K., Jain, H., Kar, P., Varma, M., and Jain, P.Sparse local embeddings for extreme multi-label classification. InAdvances in Neural Information Processing Systems (NeurIPS)(2015), vol. 28, pp. 730–738

work page 2015
[2]

PanelApp: A publicly available gene panel repository

Genomics England. PanelApp: A publicly available gene panel repository. https:// panelapp.genomicsengland.co.uk, 2019. Accessed 2025

work page 2019
[3]

InProceedings of the 4th Clinical Natural Language Processing Workshop(2022), Association for Computational Linguistics, pp

Huang, C.-W., Tsai, S.-C., and Chen, H.-H.PLM-ICD: Automatic ICD coding with pretrained language models. InProceedings of the 4th Clinical Natural Language Processing Workshop(2022), Association for Computational Linguistics, pp. 10–20

work page 2022
[4]

Johnson, J., Douze, M., and J ´egou, H.Billion-scale similarity search with GPUs.IEEE Transactions on Big Data 7, 3 (2021), 535–547

work page 2021
[5]

C., Speltz, P., Rasmussen, L

Kirby, J. C., Speltz, P., Rasmussen, L. V., Basford, M., Gottesman, O., Peissig, P. L., et al.PheKB: a catalog and workflow for creating electronic phenotype algorithms for transportability.Journal of the American Medical Informatics Association 23, 6 (2016), 1046–1052. 10

work page 2016
[6]

InAdvances in Neural Information Processing Systems (NeurIPS)(2020), vol

Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., K ¨uttler, H., Lewis, M., Yih, W.-t., Rockt ¨aschel, T., et al.Retrieval-augmented generation for knowledge-intensive NLP tasks. InAdvances in Neural Information Processing Systems (NeurIPS)(2020), vol. 33, pp. 9459–9474

work page 2020
[7]

Liu, F., Shareghi, E., Meng, Z., Basaldella, M., and Collier, N.Self-alignment pretraining for biomedical entity representations. InProceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT)(2021), Association for Computational Linguistics, pp. 4228–4238

work page 2021
[8]

K., Rasmussen, L

Mo, H., Thompson, W. K., Rasmussen, L. V., Pacheco, J. A., Jiang, G., Kiefer, R., Zhu, Q., Xu, J., Montague, E., Waitman, L. R., et al.Desiderata for computable representations of electronic health records-driven phenotype algorithms.Journal of the American Medical Informatics Association 22, 6 (2015), 1220–1230

work page 2015
[9]

Mullenbach, J., Wiegreffe, S., Duke, J., Sun, J., and Eisenstein, J.Explainable prediction of medical codes from clinical text. InProceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT)(2018), Association for Computational Linguistics, pp. 1101–1111

work page 2018
[10]

Value set authority center (VSAC)

National Library of Medicine. Value set authority center (VSAC). https://vsac.nlm. nih.gov, 2013. Accessed 2025

work page 2013
[11]

O’Mara-Eves, A., Thomas, J., McNaught, J., Miwa, M., and Ananiadou, S.Using text mining for study identification in systematic reviews: a systematic review of current approaches.Systematic Reviews 4, 1 (2015), 5

work page 2015
[12]

Sivarajkumar, S., Kelley, M., Samolyk-Mazzanti, A., Visweswaran, S., and Wang, Y.An empirical evaluation of prompting strategies for large language models in zero-shot clinical natural language processing.Journal of the American Medical Informatics Association 31, 9 (2024), 1935–1945

work page 2024
[13]

A., Bhattacharyya, S., Bhagwat, M., Morales, A., Desai, J

Steele, N. A., Bhattacharyya, S., Bhagwat, M., Morales, A., Desai, J. R., and Kho, A. N.Quality and consistency of VSAC value sets for clinical quality measurement. Journal of the American Medical Informatics Association 24, 4 (2017), 716–722

work page 2017
[14]

Tsoumakas, G., and Katakis, I.Multi-label classification: An overview.International Journal of Data Warehousing and Mining 3, 3 (2007), 1–13

work page 2007
[15]

code", "system

Yu, S., Chakrabortty, A., Ho, Y.-H., Hidalgo, B., Denny, J. C., Blessing, J., et al. PheNorm: A gene-regularized topic model for unsupervised EHR phenotyping.Journal of the American Medical Informatics Association 25, 1 (2018), 54–60. [16]Zhang, X., Lipman, Y., and Lee, H.Deep set prediction networks. 11 A VSAC Corpus Analysis We present an exploratory an...

work page 2018

[1] [1]

InAdvances in Neural Information Processing Systems (NeurIPS)(2015), vol

Bhatia, K., Jain, H., Kar, P., Varma, M., and Jain, P.Sparse local embeddings for extreme multi-label classification. InAdvances in Neural Information Processing Systems (NeurIPS)(2015), vol. 28, pp. 730–738

work page 2015

[2] [2]

PanelApp: A publicly available gene panel repository

Genomics England. PanelApp: A publicly available gene panel repository. https:// panelapp.genomicsengland.co.uk, 2019. Accessed 2025

work page 2019

[3] [3]

InProceedings of the 4th Clinical Natural Language Processing Workshop(2022), Association for Computational Linguistics, pp

Huang, C.-W., Tsai, S.-C., and Chen, H.-H.PLM-ICD: Automatic ICD coding with pretrained language models. InProceedings of the 4th Clinical Natural Language Processing Workshop(2022), Association for Computational Linguistics, pp. 10–20

work page 2022

[4] [4]

Johnson, J., Douze, M., and J ´egou, H.Billion-scale similarity search with GPUs.IEEE Transactions on Big Data 7, 3 (2021), 535–547

work page 2021

[5] [5]

C., Speltz, P., Rasmussen, L

Kirby, J. C., Speltz, P., Rasmussen, L. V., Basford, M., Gottesman, O., Peissig, P. L., et al.PheKB: a catalog and workflow for creating electronic phenotype algorithms for transportability.Journal of the American Medical Informatics Association 23, 6 (2016), 1046–1052. 10

work page 2016

[6] [6]

InAdvances in Neural Information Processing Systems (NeurIPS)(2020), vol

Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., K ¨uttler, H., Lewis, M., Yih, W.-t., Rockt ¨aschel, T., et al.Retrieval-augmented generation for knowledge-intensive NLP tasks. InAdvances in Neural Information Processing Systems (NeurIPS)(2020), vol. 33, pp. 9459–9474

work page 2020

[7] [7]

Liu, F., Shareghi, E., Meng, Z., Basaldella, M., and Collier, N.Self-alignment pretraining for biomedical entity representations. InProceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT)(2021), Association for Computational Linguistics, pp. 4228–4238

work page 2021

[8] [8]

K., Rasmussen, L

Mo, H., Thompson, W. K., Rasmussen, L. V., Pacheco, J. A., Jiang, G., Kiefer, R., Zhu, Q., Xu, J., Montague, E., Waitman, L. R., et al.Desiderata for computable representations of electronic health records-driven phenotype algorithms.Journal of the American Medical Informatics Association 22, 6 (2015), 1220–1230

work page 2015

[9] [9]

Mullenbach, J., Wiegreffe, S., Duke, J., Sun, J., and Eisenstein, J.Explainable prediction of medical codes from clinical text. InProceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT)(2018), Association for Computational Linguistics, pp. 1101–1111

work page 2018

[10] [10]

Value set authority center (VSAC)

National Library of Medicine. Value set authority center (VSAC). https://vsac.nlm. nih.gov, 2013. Accessed 2025

work page 2013

[11] [11]

O’Mara-Eves, A., Thomas, J., McNaught, J., Miwa, M., and Ananiadou, S.Using text mining for study identification in systematic reviews: a systematic review of current approaches.Systematic Reviews 4, 1 (2015), 5

work page 2015

[12] [12]

Sivarajkumar, S., Kelley, M., Samolyk-Mazzanti, A., Visweswaran, S., and Wang, Y.An empirical evaluation of prompting strategies for large language models in zero-shot clinical natural language processing.Journal of the American Medical Informatics Association 31, 9 (2024), 1935–1945

work page 2024

[13] [13]

A., Bhattacharyya, S., Bhagwat, M., Morales, A., Desai, J

Steele, N. A., Bhattacharyya, S., Bhagwat, M., Morales, A., Desai, J. R., and Kho, A. N.Quality and consistency of VSAC value sets for clinical quality measurement. Journal of the American Medical Informatics Association 24, 4 (2017), 716–722

work page 2017

[14] [14]

Tsoumakas, G., and Katakis, I.Multi-label classification: An overview.International Journal of Data Warehousing and Mining 3, 3 (2007), 1–13

work page 2007

[15] [15]

code", "system

Yu, S., Chakrabortty, A., Ho, Y.-H., Hidalgo, B., Denny, J. C., Blessing, J., et al. PheNorm: A gene-regularized topic model for unsupervised EHR phenotyping.Journal of the American Medical Informatics Association 25, 1 (2018), 54–60. [16]Zhang, X., Lipman, Y., and Lee, H.Deep set prediction networks. 11 A VSAC Corpus Analysis We present an exploratory an...

work page 2018