arxiv: 2604.25605 · v1 · submitted 2026-04-28 · 💻 cs.IR · cs.AI· cs.DB

Recognition: unknown

Health System Scale Semantic Search Across Unstructured Clinical Notes

Faith Wavinya Mutinda , Spandana Makeneni , Anna Lin , Shivaji Dutta , Irit R. Rasooly , Patrick Dibussolo , Shivani Kamath Belman , Hessam Shahriari

show 9 more authors

Kevin Murphy Alex B. Ruan Barbara H. Chaiyachati Sanjay Chainani Robert W. Grundmeier Scott M. Haag Jeffrey M. Miller Heather M. Griffis Ian M. Campbell

Authors on Pith no claims yet

Pith reviewed 2026-05-07 15:10 UTC · model grok-4.3

classification 💻 cs.IR cs.AIcs.DB

keywords semantic searchclinical notesvector embeddingshealth system scalechart abstractionunstructured dataHIPAA compliancecohort generation

0 comments

The pith

A semantic search system over 166 million clinical notes delivers sub-second results at roughly $4,000 per month while cutting chart review time by up to 89 percent.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that embedding-based semantic search can be run at the full scale of a large children's hospital's note archive using standard cloud tools and a modest embedding model. It indexes 484 million vectors, keeps metadata in a fast key-value store, and stays inside HIPAA rules without demanding special informatics staff. A physician-written benchmark reached 94.6 percent accuracy with 300-token chunks, and three abstraction tasks finished 24 to 89 percent faster than manual review while preserving inter-rater agreement. If the approach holds, hospitals gain an infrastructure layer for interactive queries, cohort building, and later LLM tools on top of existing unstructured data.

Core claim

The authors deployed a production semantic search service that indexes every clinical note in the system, stores vectors in a managed database with optimized indexing, and pairs them with full-text metadata. On a physician-authored benchmark it reached 94.6 percent accuracy; at full load it sustained median query times of 237 ms for one user and 451 ms for twenty concurrent users at roughly four thousand dollars monthly. In three real abstraction tasks the system shortened completion time by 24 to 89 percent compared with unaided clinician review while keeping agreement levels comparable.

What carries the argument

Instruction-tuned qwen3-embedding-0.6B vectors with 300-token chunking, stored in a managed vector database and paired with a low-latency key-value store for metadata, all running inside a HIPAA-compliant governance layer.

If this is right

Hospitals can add interactive semantic search and cohort generation without hiring specialized informatics teams.
The same index supports downstream LLM applications that need rapid retrieval of relevant notes.
Time savings of 24 to 89 percent in abstraction tasks scale to research, quality reporting, and care coordination workflows.
Full-text metadata remains queryable alongside semantic matches, preserving exact phrase lookup when needed.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same pattern could let smaller hospitals or clinics pool notes across sites for federated queries once privacy controls are standardized.
Real-time integration into electronic health record screens might let clinicians surface relevant history during visits rather than after the fact.
Cost and latency numbers suggest the approach could extend to multi-hospital networks without proportional increases in infrastructure spend.

Load-bearing premise

The physician benchmark and three abstraction tasks capture enough of everyday clinical search needs that the observed speed gains will appear in ordinary practice without increasing missed information or new errors.

What would settle it

A live workflow study in which clinicians using the system miss or misclassify information that independent manual review later finds, or in which query latency exceeds acceptable thresholds during normal peak hours.

Figures

Figures reproduced from arXiv: 2604.25605 by Alex B. Ruan, Anna Lin, Barbara H. Chaiyachati, Faith Wavinya Mutinda, Heather M. Griffis, Hessam Shahriari, Ian M. Campbell, Irit R. Rasooly, Jeffrey M. Miller, Kevin Murphy, Patrick Dibussolo, Robert W. Grundmeier, Sanjay Chainani, Scott M. Haag, Shivaji Dutta, Shivani Kamath Belman, Spandana Makeneni.

**Figure 1.** Figure 1: System architecture for health system–scale semantic search. Clinical notes are extracted from the CHOP EHR database, chunked into 300-token chunks with 50-token overlap and embedded using qwen3-embedding-0.6B. Chunk embeddings are stored in a vector database, while the full note text and metadata are stored in a low-latency key-value store (chosen for cost efficiency since BigTable offers lower-cost stora… view at source ↗

**Figure 2.** Figure 2: Screenshot of the semantic search user Interface. The interface enables users to query the semantic search system using natural language and apply advanced filters to refine results. Left panel: Users enter a question and select filters such as patient identifier (MRN), number of notes to retrieve, note category, encounter type, and so on to customize the search. Additional cohortbuilding tools allow user… view at source ↗

**Figure 3.** Figure 3: Accuracy evaluation of embedding models and chunking strategies on the CHOP_MCQA_v0.5 benchmark. Qwen3-embedding-0.6B with a 300-token chunk size achieved the highest accuracy of 95.51%. These experiments were run on a reduced index containing only the notes for patients included in the benchmarking dataset view at source ↗

read the original abstract

Introduction: Semantic search, which retrieves documents based on conceptual similarity rather than keyword matching, offers substantial advantages for retrieval of clinical information. However, deploying semantic search across entire health systems, comprising hundreds of millions of clinical notes, presents formidable engineering, cost, and governance challenges that have prevented adoption. Methods: We deployed a semantic search system at a large children's hospital indexing 166 million clinical notes (484 million vectors) from 1.68 million patients. The system uses instruction-tuned qwen3-embedding-0.6B embeddings, stores vectors in a managed database with storage-optimized indexing, maintains full-text metadata in a low-latency key-value store, and operates within a HIPAA-compliant governance framework. We evaluated the system through three experiments: optimization of embedding model and chunking strategy using a physician-authored benchmark dataset, characterization of full-scale performance (cost, latency, retrieval quality), and clinical utility assessment via comparison of chart abstraction efficiency across three tasks. Results: The system delivers sub-second query latency (median 237 ms single-user, 451 ms 20-user concurrency) with monthly costs of approximately USD 4,000. Qwen3 embeddings with 300-token chunk size achieved 94.6% accuracy on a clinical question-answering benchmark. In clinical utility evaluation across three abstraction tasks, semantic search reduced time-to-completion by 24 to 89% compared to clinician-performed chart review while maintaining comparable inter-rater agreement. Conclusion: Health-system-scale semantic search is both technically and operationally feasible. The system provides infrastructure supporting interactive search, cohort generation, and downstream LLM-powered clinical applications without requiring specialized informatics expertise.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

They built and ran semantic search over 166M notes with sub-second latency and low monthly cost, plus time savings on three controlled abstraction tasks, but those tasks leave the operational claim only partly tested.

read the letter

The main takeaway is that a semantic search system over 166 million clinical notes can be deployed at a large children's hospital with median latencies of 237 ms for single users and costs around $4000 per month. This is the practical result that stands out from the abstract and reported experiments. They handle the scale by using instruction-tuned qwen3 embeddings, 300-token chunks, a managed vector database with storage-optimized indexing, and a separate low-latency key-value store for full-text metadata, all inside a HIPAA-compliant framework. The benchmark reaches 94.6% accuracy, and the three abstraction tasks show 24-89% reductions in time-to-completion while keeping inter-rater agreement comparable to standard chart review. Those concrete numbers on cost, latency, and retrieval quality at this volume are the useful part. The limitation is that the evidence for operational feasibility rests on three physician-authored tasks performed in a controlled setting. The paper gives no data on recall for rare or complex queries, false-negative rates on high-stakes negative findings, or performance when clinicians use the system for open-ended daily work. If those tasks miss the harder cases that come up in routine practice, the time savings could mask new errors or missed information. This work is for hospital informatics teams and clinical AI groups who need to see what a production deployment at health-system scale actually costs and performs like with current embedding and vector-store tools. It does not introduce new methods or theory, but the engineering and governance choices are documented clearly enough to be reproducible by others facing similar constraints. The measurements look solid for what they cover, with no circularity in the results. I would bring this to a reading group to talk through the deployment tradeoffs and how the tasks were chosen. It deserves peer review because the scale and the real-world constraints make the reported metrics worth checking and extending.

Referee Report

1 major / 2 minor

Summary. The manuscript describes deployment of a semantic search system over 166 million clinical notes (484 million vectors) from 1.68 million patients at a large children's hospital. It uses instruction-tuned qwen3-embedding-0.6B embeddings with 300-token chunks, a managed vector database, low-latency metadata store, and HIPAA-compliant governance. Evaluation covers embedding/chunk optimization on a physician-authored benchmark (94.6% accuracy), full-scale performance (median 237 ms latency, ~USD 4,000 monthly cost), and clinical utility via three abstraction tasks showing 24–89% time reductions with comparable inter-rater agreement. The central claim is that health-system-scale semantic search is both technically and operationally feasible and supports interactive search, cohort generation, and downstream LLM applications without specialized informatics expertise.

Significance. If the results hold, the work provides concrete evidence that semantic search can be engineered and governed at the scale of hundreds of millions of notes with sub-second interactive latency and modest operating cost. The real deployment, benchmark results, and measured clinician time savings constitute a practical contribution to clinical information retrieval that could accelerate adoption and enable downstream applications.

major comments (1)

[Results] Results, clinical utility assessment: the operational-feasibility conclusion rests on time savings of 24–89% across three physician-authored abstraction tasks performed in a controlled setting while maintaining inter-rater agreement. The manuscript reports no recall, false-negative rates for critical or rare findings, or performance on open-ended longitudinal queries; without these data it is unclear whether the observed speed gains mask missed information or new errors, weakening support for routine operational use.

minor comments (2)

[Abstract] Abstract and Methods: the description of the three abstraction tasks, inter-rater agreement metric, and statistical comparisons lacks error bars, confidence intervals, or p-values, making it difficult to assess the reliability of the reported time savings and accuracy figures.
[Methods] Methods: the exact chunking strategy, embedding instruction template, and storage-optimized index parameters are referenced but not fully specified, hindering reproducibility of the 94.6% benchmark result.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive comments. We address the major comment below.

read point-by-point responses

Referee: [Results] Results, clinical utility assessment: the operational-feasibility conclusion rests on time savings of 24–89% across three physician-authored abstraction tasks performed in a controlled setting while maintaining inter-rater agreement. The manuscript reports no recall, false-negative rates for critical or rare findings, or performance on open-ended longitudinal queries; without these data it is unclear whether the observed speed gains mask missed information or new errors, weakening support for routine operational use.

Authors: We agree that the clinical utility evaluation is confined to time-to-completion reductions (24–89%) and preserved inter-rater agreement on three physician-authored abstraction tasks performed under controlled conditions. The manuscript does not report recall, false-negative rates for critical or rare findings, or results on open-ended longitudinal queries. This constitutes a real limitation for claims about routine operational deployment, as speed gains could in principle conceal missed information. The three tasks were selected to mirror common chart-abstraction workflows at our institution, and the fact that inter-rater agreement remained comparable indicates that the retrieved notes did not systematically alter clinical conclusions in those specific scenarios. In the revised manuscript we will add an explicit paragraph in the Discussion section acknowledging the absence of recall-oriented metrics and the need for future studies that quantify missed critical findings and evaluate open-ended longitudinal queries. We believe the present evidence still supports technical and operational feasibility for interactive search and cohort generation, while concurring that broader validation is required before asserting routine clinical use. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical deployment and measured evaluation

full rationale

The paper reports a production deployment of a semantic search system over 166M notes, with performance characterized by direct measurements of latency, cost, retrieval accuracy on a held-out benchmark, and time savings in three controlled abstraction tasks. No equations, predictions, or uniqueness claims are present; all results are grounded in observed metrics rather than any derivation that reduces to fitted inputs or self-citations by construction. The work is self-contained against external benchmarks and does not invoke load-bearing self-citations for its central feasibility conclusion.

Axiom & Free-Parameter Ledger

2 free parameters · 0 axioms · 0 invented entities

Central claim rests on empirical selection of embedding model and chunk size validated via benchmark; no new mathematical axioms, free parameters fitted to target data, or invented entities are introduced.

free parameters (2)

chunk size = 300 tokens
300 tokens selected after optimization experiment on benchmark dataset
embedding model = qwen3-embedding-0.6B
qwen3-embedding-0.6B chosen after model comparison

pith-pipeline@v0.9.0 · 5684 in / 1194 out tokens · 51616 ms · 2026-05-07T15:10:49.216217+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

31 extracted references · 9 canonical work pages · 2 internal anchors

[1]

Jh, H. et al. Development of an electronic health records datamart to support clinical and population health research. J. Clin. Transl. Sci. 5, (2020)

2020
[2]

Lewis, P. et al. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. in Advances in Neural Information Processing Systems vol. 33 9459–9474 (Curran Associates, Inc., 2020)

2020
[3]

Singhal, K. et al. Large language models encode clinical knowledge. Nature 620, 172–180 (2023)

2023
[4]

R., Visweswaran, S., Ning, X

Hill, J. R., Visweswaran, S., Ning, X. & Schleyer, T. K. Use, Impact, Weaknesses, and Advanced Features of Search Functions for Clinical Use in Electronic Health Records: A Scoping Review. Appl. Clin. Inform. 12, 417–428 (2021)

2021
[5]

Jin, Q., Leaman, R. & Lu, Z. PubMed and beyond: biomedical literature search in the age of artificial intelligence. eBioMedicine 100, 104988 (2024)

2024
[6]

H., Chaudhari, G., Smith, A

Savage, C. H., Chaudhari, G., Smith, A. D. & Sohn, J. H. RadSearch, a Semantic Search Model for Accurate Radiology Report Retrieval with Large Language Model Integration. Radiology 315, e240686 (2025)

2025
[7]

Pressat-Laffouilhère, T. et al. Evaluation of Doc’EDS: a French semantic search tool to query health documents from a clinical data warehouse. BMC Med. Inform. Decis. Mak. 22, 34 (2022)

2022
[8]

Son, N. et al. Development and Evaluation of a Retrieval-Augmented Generation-Based Electronic Medical Record Chatbot System. Healthc. Inform. Res. 31, 218–225 (2025)

2025
[9]

Kresevic, S. et al. Optimization of hepatological clinical guidelines interpretation by large language models: a retrieval augmented generation-based framework. Npj Digit. Med. 7, 102 (2024)

2024
[10]

S., Obey, N

Ong, C. S., Obey, N. T., Zheng, Y., Cohan, A. & Schneider, E. B. SurgeryLLM: a retrieval- augmented generation large language model framework for surgical decision support and workflow enhancement. Npj Digit. Med. 7, 364 (2024)

2024
[11]

M., Rinaldi, A

Benfenati, D., De Filippis, G. M., Rinaldi, A. M., Russo, C. & Tommasino, C. A Retrieval- augmented Generation application for Question-Answering in Nutrigenetics Domain. Procedia Comput. Sci. 246, 586–595 (2024)

2024
[12]

Ge, J. et al. Development of a liver disease–specific large language model chat interface using retrieval-augmented generation. Hepatology 80, 1158 (2024)

2024
[13]

Zhang, Y. et al. Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models. Preprint at https://doi.org/10.48550/arXiv.2506.05176 (2025)

work page internal anchor Pith review doi:10.48550/arxiv.2506.05176 2025
[14]

Nomic embed: Training a reproducible long context text embedder.arXiv preprint arXiv:2402.01613, 2024

Nussbaum, Z., Morris, J. X., Duderstadt, B. & Mulyar, A. Nomic Embed: Training a Reproducible Long Context Text Embedder. Preprint at https://doi.org/10.48550/arXiv.2402.01613 (2025)

work page doi:10.48550/arxiv.2402.01613 2025
[15]

Wang, L. et al. Text Embeddings by Weakly-Supervised Contrastive Pre-training. Preprint at https://doi.org/10.48550/arXiv.2212.03533 (2024)

work page internal anchor Pith review doi:10.48550/arxiv.2212.03533 2024
[16]

Alsentzer, E. et al. Publicly Available Clinical BERT Embeddings. Preprint at https://doi.org/10.48550/arXiv.1904.03323 (2019)

work page Pith review doi:10.48550/arxiv.1904.03323 1904
[17]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin

Tang, Y. & Yang, Y. Pooling And Attention: What Are Effective Designs For LLM-Based Embedding Models? Preprint at https://doi.org/10.48550/arXiv.2409.02727 (2024)

work page doi:10.48550/arxiv.2409.02727 2024
[18]

Google Cloud Documentation https://docs.cloud.google.com/vertex-ai/docs/vector-search/overview

Vector Search | Vertex AI. Google Cloud Documentation https://docs.cloud.google.com/vertex-ai/docs/vector-search/overview
[19]

& Kumar, S

Sun, P., Simcha, D., Dopson, D., Guo, R. & Kumar, S. SOAR: Improved Indexing for Approximate Nearest Neighbor Search. Preprint at https://doi.org/10.48550/arXiv.2404.00774 (2024)

work page doi:10.48550/arxiv.2404.00774 2024
[20]

Guo, D. et al. DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning. Nature 645, 633–638 (2025)

2025
[21]

Mann, H. B. & Whitney, D. R. On a Test of Whether one of Two Random Variables is Stochastically Larger than the Other. Ann. Math. Stat. 18, 50–60 (1947)

1947
[22]

Fleiss, J. L. Measuring nominal scale agreement among many raters. Psychological Bulletin, 76(5), 378–382. https://doi.org/10.1037/h0031619 (1971)

work page doi:10.1037/h0031619 1971
[23]

A Coefficient of Agreement for Nominal Scales

Cohen, J. A Coefficient of Agreement for Nominal Scales. Educ. Psychol. Meas. 20, 37–46 (1960)

1960
[24]

2019.Content Analysis

Krippendorff, K. Content Analysis: An Introduction to Its Methodology. (SAGE Publications, Inc., 2019). doi:10.4135/9781071878781

work page doi:10.4135/9781071878781 2019
[25]

CHOP Research Institute https://www.research.chop.edu/applications/arcus (2022)

Arcus. CHOP Research Institute https://www.research.chop.edu/applications/arcus (2022)

2022
[26]

Enevoldsen, K. et al. MMTEB: Massive Multilingual Text Embedding Benchmark. Preprint at https://doi.org/10.48550/arXiv.2502.13595 (2025)

work page doi:10.48550/arxiv.2502.13595 2025
[27]

Lee, C., Vogt, K. A. & Kumar, S. Prospects for AI clinical summarization to reduce the burden of patient chart review. Front. Digit. Health 6, 1475092 (2024)

2024
[28]

Ostropolets, A. et al. Scalable and interpretable alternative to chart review for phenotype evaluation using standardized structured data from electronic health records. J. Am. Med. Inform. Assoc. 31, 119–129 (2024)

2024
[29]

Goldschmidt, D. E. & Krishnamoorthy, M. Comparing keyword search to semantic search: a case study in solving crossword puzzles using the GoogleTM API. Softw. Pract. Exp. 38, 417– 445 (2008)

2008
[30]

Miller, T. P. et al. Automated Ascertainment of Typhlitis From the Electronic Health Record. JCO Clin. Cancer Inform. 6, e2200081 (2022)

2022
[31]

& Eickhoff, C

Kuhn, L. & Eickhoff, C. Implicit Negative Feedback in Clinical Information Retrieval. in Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval 1243–1243 (2016). Figures Figure 1: System architecture for health system –scale semantic search. Clinical notes are extracted from the CHOP EHR database, c...

2016