Recognition: unknown
Health System Scale Semantic Search Across Unstructured Clinical Notes
Pith reviewed 2026-05-07 15:10 UTC · model grok-4.3
The pith
A semantic search system over 166 million clinical notes delivers sub-second results at roughly $4,000 per month while cutting chart review time by up to 89 percent.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors deployed a production semantic search service that indexes every clinical note in the system, stores vectors in a managed database with optimized indexing, and pairs them with full-text metadata. On a physician-authored benchmark it reached 94.6 percent accuracy; at full load it sustained median query times of 237 ms for one user and 451 ms for twenty concurrent users at roughly four thousand dollars monthly. In three real abstraction tasks the system shortened completion time by 24 to 89 percent compared with unaided clinician review while keeping agreement levels comparable.
What carries the argument
Instruction-tuned qwen3-embedding-0.6B vectors with 300-token chunking, stored in a managed vector database and paired with a low-latency key-value store for metadata, all running inside a HIPAA-compliant governance layer.
If this is right
- Hospitals can add interactive semantic search and cohort generation without hiring specialized informatics teams.
- The same index supports downstream LLM applications that need rapid retrieval of relevant notes.
- Time savings of 24 to 89 percent in abstraction tasks scale to research, quality reporting, and care coordination workflows.
- Full-text metadata remains queryable alongside semantic matches, preserving exact phrase lookup when needed.
Where Pith is reading between the lines
- The same pattern could let smaller hospitals or clinics pool notes across sites for federated queries once privacy controls are standardized.
- Real-time integration into electronic health record screens might let clinicians surface relevant history during visits rather than after the fact.
- Cost and latency numbers suggest the approach could extend to multi-hospital networks without proportional increases in infrastructure spend.
Load-bearing premise
The physician benchmark and three abstraction tasks capture enough of everyday clinical search needs that the observed speed gains will appear in ordinary practice without increasing missed information or new errors.
What would settle it
A live workflow study in which clinicians using the system miss or misclassify information that independent manual review later finds, or in which query latency exceeds acceptable thresholds during normal peak hours.
Figures
read the original abstract
Introduction: Semantic search, which retrieves documents based on conceptual similarity rather than keyword matching, offers substantial advantages for retrieval of clinical information. However, deploying semantic search across entire health systems, comprising hundreds of millions of clinical notes, presents formidable engineering, cost, and governance challenges that have prevented adoption. Methods: We deployed a semantic search system at a large children's hospital indexing 166 million clinical notes (484 million vectors) from 1.68 million patients. The system uses instruction-tuned qwen3-embedding-0.6B embeddings, stores vectors in a managed database with storage-optimized indexing, maintains full-text metadata in a low-latency key-value store, and operates within a HIPAA-compliant governance framework. We evaluated the system through three experiments: optimization of embedding model and chunking strategy using a physician-authored benchmark dataset, characterization of full-scale performance (cost, latency, retrieval quality), and clinical utility assessment via comparison of chart abstraction efficiency across three tasks. Results: The system delivers sub-second query latency (median 237 ms single-user, 451 ms 20-user concurrency) with monthly costs of approximately USD 4,000. Qwen3 embeddings with 300-token chunk size achieved 94.6% accuracy on a clinical question-answering benchmark. In clinical utility evaluation across three abstraction tasks, semantic search reduced time-to-completion by 24 to 89% compared to clinician-performed chart review while maintaining comparable inter-rater agreement. Conclusion: Health-system-scale semantic search is both technically and operationally feasible. The system provides infrastructure supporting interactive search, cohort generation, and downstream LLM-powered clinical applications without requiring specialized informatics expertise.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript describes deployment of a semantic search system over 166 million clinical notes (484 million vectors) from 1.68 million patients at a large children's hospital. It uses instruction-tuned qwen3-embedding-0.6B embeddings with 300-token chunks, a managed vector database, low-latency metadata store, and HIPAA-compliant governance. Evaluation covers embedding/chunk optimization on a physician-authored benchmark (94.6% accuracy), full-scale performance (median 237 ms latency, ~USD 4,000 monthly cost), and clinical utility via three abstraction tasks showing 24–89% time reductions with comparable inter-rater agreement. The central claim is that health-system-scale semantic search is both technically and operationally feasible and supports interactive search, cohort generation, and downstream LLM applications without specialized informatics expertise.
Significance. If the results hold, the work provides concrete evidence that semantic search can be engineered and governed at the scale of hundreds of millions of notes with sub-second interactive latency and modest operating cost. The real deployment, benchmark results, and measured clinician time savings constitute a practical contribution to clinical information retrieval that could accelerate adoption and enable downstream applications.
major comments (1)
- [Results] Results, clinical utility assessment: the operational-feasibility conclusion rests on time savings of 24–89% across three physician-authored abstraction tasks performed in a controlled setting while maintaining inter-rater agreement. The manuscript reports no recall, false-negative rates for critical or rare findings, or performance on open-ended longitudinal queries; without these data it is unclear whether the observed speed gains mask missed information or new errors, weakening support for routine operational use.
minor comments (2)
- [Abstract] Abstract and Methods: the description of the three abstraction tasks, inter-rater agreement metric, and statistical comparisons lacks error bars, confidence intervals, or p-values, making it difficult to assess the reliability of the reported time savings and accuracy figures.
- [Methods] Methods: the exact chunking strategy, embedding instruction template, and storage-optimized index parameters are referenced but not fully specified, hindering reproducibility of the 94.6% benchmark result.
Simulated Author's Rebuttal
We thank the referee for the constructive comments. We address the major comment below.
read point-by-point responses
-
Referee: [Results] Results, clinical utility assessment: the operational-feasibility conclusion rests on time savings of 24–89% across three physician-authored abstraction tasks performed in a controlled setting while maintaining inter-rater agreement. The manuscript reports no recall, false-negative rates for critical or rare findings, or performance on open-ended longitudinal queries; without these data it is unclear whether the observed speed gains mask missed information or new errors, weakening support for routine operational use.
Authors: We agree that the clinical utility evaluation is confined to time-to-completion reductions (24–89%) and preserved inter-rater agreement on three physician-authored abstraction tasks performed under controlled conditions. The manuscript does not report recall, false-negative rates for critical or rare findings, or results on open-ended longitudinal queries. This constitutes a real limitation for claims about routine operational deployment, as speed gains could in principle conceal missed information. The three tasks were selected to mirror common chart-abstraction workflows at our institution, and the fact that inter-rater agreement remained comparable indicates that the retrieved notes did not systematically alter clinical conclusions in those specific scenarios. In the revised manuscript we will add an explicit paragraph in the Discussion section acknowledging the absence of recall-oriented metrics and the need for future studies that quantify missed critical findings and evaluate open-ended longitudinal queries. We believe the present evidence still supports technical and operational feasibility for interactive search and cohort generation, while concurring that broader validation is required before asserting routine clinical use. revision: yes
Circularity Check
No circularity: empirical deployment and measured evaluation
full rationale
The paper reports a production deployment of a semantic search system over 166M notes, with performance characterized by direct measurements of latency, cost, retrieval accuracy on a held-out benchmark, and time savings in three controlled abstraction tasks. No equations, predictions, or uniqueness claims are present; all results are grounded in observed metrics rather than any derivation that reduces to fitted inputs or self-citations by construction. The work is self-contained against external benchmarks and does not invoke load-bearing self-citations for its central feasibility conclusion.
Axiom & Free-Parameter Ledger
free parameters (2)
- chunk size =
300 tokens
- embedding model =
qwen3-embedding-0.6B
Reference graph
Works this paper leans on
-
[1]
Jh, H. et al. Development of an electronic health records datamart to support clinical and population health research. J. Clin. Transl. Sci. 5, (2020)
2020
-
[2]
Lewis, P. et al. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. in Advances in Neural Information Processing Systems vol. 33 9459–9474 (Curran Associates, Inc., 2020)
2020
-
[3]
Singhal, K. et al. Large language models encode clinical knowledge. Nature 620, 172–180 (2023)
2023
-
[4]
R., Visweswaran, S., Ning, X
Hill, J. R., Visweswaran, S., Ning, X. & Schleyer, T. K. Use, Impact, Weaknesses, and Advanced Features of Search Functions for Clinical Use in Electronic Health Records: A Scoping Review. Appl. Clin. Inform. 12, 417–428 (2021)
2021
-
[5]
Jin, Q., Leaman, R. & Lu, Z. PubMed and beyond: biomedical literature search in the age of artificial intelligence. eBioMedicine 100, 104988 (2024)
2024
-
[6]
H., Chaudhari, G., Smith, A
Savage, C. H., Chaudhari, G., Smith, A. D. & Sohn, J. H. RadSearch, a Semantic Search Model for Accurate Radiology Report Retrieval with Large Language Model Integration. Radiology 315, e240686 (2025)
2025
-
[7]
Pressat-Laffouilhère, T. et al. Evaluation of Doc’EDS: a French semantic search tool to query health documents from a clinical data warehouse. BMC Med. Inform. Decis. Mak. 22, 34 (2022)
2022
-
[8]
Son, N. et al. Development and Evaluation of a Retrieval-Augmented Generation-Based Electronic Medical Record Chatbot System. Healthc. Inform. Res. 31, 218–225 (2025)
2025
-
[9]
Kresevic, S. et al. Optimization of hepatological clinical guidelines interpretation by large language models: a retrieval augmented generation-based framework. Npj Digit. Med. 7, 102 (2024)
2024
-
[10]
S., Obey, N
Ong, C. S., Obey, N. T., Zheng, Y., Cohan, A. & Schneider, E. B. SurgeryLLM: a retrieval- augmented generation large language model framework for surgical decision support and workflow enhancement. Npj Digit. Med. 7, 364 (2024)
2024
-
[11]
M., Rinaldi, A
Benfenati, D., De Filippis, G. M., Rinaldi, A. M., Russo, C. & Tommasino, C. A Retrieval- augmented Generation application for Question-Answering in Nutrigenetics Domain. Procedia Comput. Sci. 246, 586–595 (2024)
2024
-
[12]
Ge, J. et al. Development of a liver disease–specific large language model chat interface using retrieval-augmented generation. Hepatology 80, 1158 (2024)
2024
-
[13]
Zhang, Y. et al. Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models. Preprint at https://doi.org/10.48550/arXiv.2506.05176 (2025)
work page internal anchor Pith review doi:10.48550/arxiv.2506.05176 2025
-
[14]
Nussbaum, Z., Morris, J. X., Duderstadt, B. & Mulyar, A. Nomic Embed: Training a Reproducible Long Context Text Embedder. Preprint at https://doi.org/10.48550/arXiv.2402.01613 (2025)
-
[15]
Wang, L. et al. Text Embeddings by Weakly-Supervised Contrastive Pre-training. Preprint at https://doi.org/10.48550/arXiv.2212.03533 (2024)
work page internal anchor Pith review doi:10.48550/arxiv.2212.03533 2024
-
[16]
Alsentzer, E. et al. Publicly Available Clinical BERT Embeddings. Preprint at https://doi.org/10.48550/arXiv.1904.03323 (2019)
-
[17]
Tang, Y. & Yang, Y. Pooling And Attention: What Are Effective Designs For LLM-Based Embedding Models? Preprint at https://doi.org/10.48550/arXiv.2409.02727 (2024)
-
[18]
Google Cloud Documentation https://docs.cloud.google.com/vertex-ai/docs/vector-search/overview
Vector Search | Vertex AI. Google Cloud Documentation https://docs.cloud.google.com/vertex-ai/docs/vector-search/overview
-
[19]
Sun, P., Simcha, D., Dopson, D., Guo, R. & Kumar, S. SOAR: Improved Indexing for Approximate Nearest Neighbor Search. Preprint at https://doi.org/10.48550/arXiv.2404.00774 (2024)
-
[20]
Guo, D. et al. DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning. Nature 645, 633–638 (2025)
2025
-
[21]
Mann, H. B. & Whitney, D. R. On a Test of Whether one of Two Random Variables is Stochastically Larger than the Other. Ann. Math. Stat. 18, 50–60 (1947)
1947
-
[22]
Fleiss, J. L. Measuring nominal scale agreement among many raters. Psychological Bulletin, 76(5), 378–382. https://doi.org/10.1037/h0031619 (1971)
-
[23]
A Coefficient of Agreement for Nominal Scales
Cohen, J. A Coefficient of Agreement for Nominal Scales. Educ. Psychol. Meas. 20, 37–46 (1960)
1960
-
[24]
Krippendorff, K. Content Analysis: An Introduction to Its Methodology. (SAGE Publications, Inc., 2019). doi:10.4135/9781071878781
-
[25]
CHOP Research Institute https://www.research.chop.edu/applications/arcus (2022)
Arcus. CHOP Research Institute https://www.research.chop.edu/applications/arcus (2022)
2022
-
[26]
Enevoldsen, K. et al. MMTEB: Massive Multilingual Text Embedding Benchmark. Preprint at https://doi.org/10.48550/arXiv.2502.13595 (2025)
-
[27]
Lee, C., Vogt, K. A. & Kumar, S. Prospects for AI clinical summarization to reduce the burden of patient chart review. Front. Digit. Health 6, 1475092 (2024)
2024
-
[28]
Ostropolets, A. et al. Scalable and interpretable alternative to chart review for phenotype evaluation using standardized structured data from electronic health records. J. Am. Med. Inform. Assoc. 31, 119–129 (2024)
2024
-
[29]
Goldschmidt, D. E. & Krishnamoorthy, M. Comparing keyword search to semantic search: a case study in solving crossword puzzles using the GoogleTM API. Softw. Pract. Exp. 38, 417– 445 (2008)
2008
-
[30]
Miller, T. P. et al. Automated Ascertainment of Typhlitis From the Electronic Health Record. JCO Clin. Cancer Inform. 6, e2200081 (2022)
2022
-
[31]
& Eickhoff, C
Kuhn, L. & Eickhoff, C. Implicit Negative Feedback in Clinical Information Retrieval. in Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval 1243–1243 (2016). Figures Figure 1: System architecture for health system –scale semantic search. Clinical notes are extracted from the CHOP EHR database, c...
2016
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.