A Large Language Model Based Pipeline for Review of Systems Entity Recognition from Clinical Notes
Pith reviewed 2026-05-19 11:29 UTC · model grok-4.3
The pith
Open-source LLMs extract Review of Systems entities from clinical notes with high accuracy in a local pipeline.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors establish that a pipeline combining section extraction with SecTag, few-shot prompting on open-source LLMs, and a novel attribution algorithm for aligning entities to source text enables effective recognition of ROS entities, negation status, and body systems, achieving a highest F1 score of 0.952 and consistent improvements across models including smaller ones.
What carries the argument
The LLM-based pipeline that first isolates the Review of Systems section using SecTag headers, then employs few-shot prompting on open-source models to detect entities along with their status and body systems, and uses a new attribution algorithm to link outputs back to the original text.
If this is right
- Larger models demonstrate robust performance across entity extraction, negation detection, and body system classification.
- The attribution algorithm increases F1 score and accuracy while reducing error rate for all models.
- The smaller Llama model delivers promising results with significantly lower VRAM usage.
- The pipeline offers a scalable and locally deployable solution for reducing ROS documentation burden in healthcare.
Where Pith is reading between the lines
- This method could extend to other sections of clinical notes for broader automation of medical documentation.
- Local open-source solutions address data privacy and cost barriers in adopting AI for clinical use.
- Testing the pipeline on notes from varied medical specialties would help assess its broader applicability.
Load-bearing premise
The small set of 24 general medicine notes with 340 annotations is representative of typical clinical notes and sufficient to support the reported performance levels.
What would settle it
A substantial drop in F1 scores or accuracy when the pipeline is applied to a larger collection of clinical notes from multiple hospitals or different medical fields would indicate the results do not generalize.
Figures
read the original abstract
Objective: Develop a cost-effective, large language model (LLM)-based pipeline for automatically extracting Review of Systems (ROS) entities from clinical notes. Materials and Methods: The pipeline extracts ROS section from the clinical note using SecTag header terminology, followed by few-shot LLMs to identify ROS entities such as diseases or symptoms, their positive/negative status and associated body systems. We implemented the pipeline using 4 open-source LLM models: llama3.1:8b, gemma3:27b, mistral3.1:24b and gpt-oss:20b. Additionally, we introduced a novel attribution algorithm that aligns LLM-identified ROS entities with their source text, addressing non-exact and synonymous matches. The evaluation was conducted on 24 general medicine notes containing 340 annotated ROS entities. Results: Open-source LLMs enable a local, cost-efficient pipeline while delivering promising performance. Larger models like Gemma, Mistral, and Gpt-oss demonstrate robust performance across three entity recognition tasks of the pipeline: ROS entity extraction, negation detection and body system classification (highest F1 score = 0.952). With the attribution algorithm, all models show improvements across key performance metrics, including higher F1 score and accuracy, along with lower error rate. Notably, the smaller Llama model also achieved promising results despite using only one-third the VRAM of larger models. Discussion and Conclusion: From an application perspective, our pipeline provides a scalable, locally deployable solution to easing the ROS documentation burden. Open-source LLMs offer a practical AI option for resource-limited healthcare settings. Methodologically, our newly developed algorithm facilitates accuracy improvements for zero- and few-shot LLMs in named entity recognition.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents an LLM-based pipeline for extracting Review of Systems (ROS) entities from clinical notes. SecTag identifies the ROS section, after which four open-source LLMs (llama3.1:8b, gemma3:27b, mistral3.1:24b, gpt-oss:20b) are few-shot prompted to detect entities, negation status, and body-system classification. A novel attribution algorithm aligns LLM outputs with source text to handle non-exact and synonymous matches. Evaluation on 24 general-medicine notes containing 340 annotated entities reports F1 scores up to 0.952, with consistent gains from the attribution algorithm across tasks.
Significance. If the performance claims hold under broader validation, the work offers a practical, locally deployable, cost-efficient solution for automating ROS documentation using open-source models, which is valuable for resource-limited clinical settings. The attribution algorithm provides a methodological contribution for improving zero- and few-shot NER alignment. The emphasis on open-source LLMs and real-world applicability is a strength.
major comments (2)
- [Materials and Methods] Materials and Methods / Evaluation: The test collection is limited to 24 general-medicine notes and 340 annotations. No sampling criteria, stratification by note length or specialty, annotation protocol, or inter-annotator agreement statistics are reported. This small, single-site sample is load-bearing for the central claim that the pipeline delivers reliable performance (F1 = 0.952) and that the attribution algorithm produces generalizable improvements.
- [Results] Results: Performance is reported for three tasks (entity extraction, negation detection, body-system classification) but without statistical significance testing, confidence intervals, or error analysis. It is therefore unclear whether the observed gains from the attribution algorithm are robust or could be explained by the particular characteristics of the 24-note set.
minor comments (2)
- [Abstract] Abstract and Methods: Model names (e.g., 'gpt-oss:20b') should be clarified with exact Hugging Face or Ollama identifiers for reproducibility.
- The few-shot prompting details (number of examples, selection criteria, and prompt templates) are not fully specified; adding them would improve replicability.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. The comments have prompted us to improve the transparency of our evaluation and the rigor of our statistical reporting. We address each major comment below and indicate the revisions made to the manuscript.
read point-by-point responses
-
Referee: [Materials and Methods] Materials and Methods / Evaluation: The test collection is limited to 24 general-medicine notes and 340 annotations. No sampling criteria, stratification by note length or specialty, annotation protocol, or inter-annotator agreement statistics are reported. This small, single-site sample is load-bearing for the central claim that the pipeline delivers reliable performance (F1 = 0.952) and that the attribution algorithm produces generalizable improvements.
Authors: We agree that greater detail on dataset construction is warranted. In the revised Materials and Methods, we now specify the sampling criteria (consecutive general-medicine admission notes selected from a single academic medical center's EHR during a defined 2023 period) and provide the full annotation protocol used by the board-certified internist who labeled the 340 entities. Stratification by note length or specialty was not applied because the study scope was restricted to typical general-medicine notes. We acknowledge the small, single-site sample as a genuine limitation and have expanded the Discussion to frame the work as a proof-of-concept study with explicit plans for future multi-site validation. Inter-annotator agreement statistics are unavailable because annotation was performed by a single expert; this is now stated as a limitation. revision: partial
-
Referee: [Results] Results: Performance is reported for three tasks (entity extraction, negation detection, body-system classification) but without statistical significance testing, confidence intervals, or error analysis. It is therefore unclear whether the observed gains from the attribution algorithm are robust or could be explained by the particular characteristics of the 24-note set.
Authors: We have strengthened the Results section by adding bootstrap-derived 95% confidence intervals for all reported F1 scores. Statistical significance of the attribution algorithm's improvements over the baseline prompting approach was evaluated with McNemar's test for paired binary outcomes; p-values are now reported for each of the three tasks. We have also inserted a dedicated error-analysis subsection that categorizes the remaining errors (boundary mismatches, negation-scope failures, and body-system misclassifications) and illustrates how the attribution step reduces each category. These additions indicate that the observed gains are consistent across error types rather than artifacts of the particular 24-note collection. revision: yes
Circularity Check
No circularity: empirical evaluation against independent manual annotations
full rationale
The paper describes an LLM pipeline (SecTag section extraction + few-shot prompting + attribution algorithm) and reports F1/accuracy on 340 entities from 24 notes. All metrics are computed by direct comparison to held-out human annotations; no equations, fitted parameters renamed as predictions, self-citations, or ansatzes appear in the derivation chain. The attribution step is a post-processing heuristic whose gains are measured externally rather than defined into the evaluation. The work is therefore self-contained against its external benchmark and exhibits no reduction of claims to inputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption SecTag header terminology accurately identifies ROS sections in clinical notes
invented entities (1)
-
Attribution algorithm
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
The pipeline extracts ROS sections using SecTag, followed by few-shot LLMs to identify ROS entity spans, their positive/negative status, and associated body systems.
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We evaluated the pipeline on an NVIDIA RTX 3090 GPU with 24GB VRAM... highest F1 score = 0.952
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Chung AE, Basch EM. Incorporating the patient’s voice into electronic health records through patient-reported outcomes as the “review of systems.” Journal of the American Medical Informatics Association. 2015;22(4):914-916. doi:10.1093/jamia/ocu007
-
[2]
Perceptions of Information Transferred in Review of Systems Forms: A Qualitative Description
Ernecoff NC, Arnold J, Krishnamurti T, et al. Perceptions of Information Transferred in Review of Systems Forms: A Qualitative Description. J GEN INTERN MED. Published online February 20, 2025. doi:10.1007/s11606-025-09443-4
-
[3]
History taking, assessment and documentation for paramedics
Jenkins S. History taking, assessment and documentation for paramedics. Journal of Paramedic Practice. 2013;5(6):310-316. doi:10.12968/jpar.2013.5.6.310
-
[4]
A Detailed Review of Systems: An Educational Feature
Phillips A, Frank A, Loftin C, Shepherd S. A Detailed Review of Systems: An Educational Feature. The Journal for Nurse Practitioners. 2017;13(10):681-686. doi:10.1016/j.nurpra.2017.08.012
-
[5]
Parkinsonism: A Review-of-Systems Approach to Diagnosis
Tuite PJ, Krawczewski K. Parkinsonism: A Review-of-Systems Approach to Diagnosis. Seminars in Neurology. 2007;27:113-122. doi:10.1055/s-2007-971174
-
[6]
Review of systems questionnaire helps differentiate psychogenic nonepileptic seizures from epilepsy
Asadi-Pooya AA, Rabiei AH, Tinker J, Tracy J. Review of systems questionnaire helps differentiate psychogenic nonepileptic seizures from epilepsy. Journal of Clinical Neuroscience. 2016;34:105-107. doi:10.1016/j.jocn.2016.05.037
-
[7]
Association Between Patient Review of Systems Score and Somatization
Okland TS, Gonzalez JR, Ferber AT, Mann SE. Association Between Patient Review of Systems Score and Somatization. JAMA Otolaryngology–Head & Neck Surgery. 2017;143(9):870-875. doi:10.1001/jamaoto.2017.0671
-
[8]
Everdell E, Borok J, Deutsch A, et al. ATLAS: A positive, high-yield review of patient symptoms most significantly associated with melanoma recurrence. Journal of the American Academy of Dermatology. 2024;91(6):1118-1124. doi:10.1016/j.jaad.2024.07.1516
-
[9]
Podder V , Lew V , Ghassemzadeh S. SOAP Notes. In: StatPearls. StatPearls Publishing
-
[10]
http://www.ncbi.nlm.nih.gov/books/NBK482263/
Accessed April 6, 2025. http://www.ncbi.nlm.nih.gov/books/NBK482263/
work page 2025
-
[11]
The Review of Systems and the Physical Exam
Hagan S, Hagan AF. The Review of Systems and the Physical Exam. In: Wong CJ, Jackson SL, eds. The Patient-Centered Approach to Medical Note-Writing. Springer International Publishing; 2023:153-162. doi:10.1007/978-3-031-43633-8_12
-
[12]
Allocation of Physician Time in Ambulatory Practice: A Time and Motion Study in 4 Specialties
Sinsky C, Colligan L, Li L, et al. Allocation of Physician Time in Ambulatory Practice: A Time and Motion Study in 4 Specialties. Ann Intern Med. 2016;165(11):753-760. doi:10.7326/M16-0961
-
[13]
Arndt BG, Beasley JW, Watkinson MD, et al. Tethered to the EHR: Primary Care Physician Workload Assessment Using EHR Event Log Data and Time-Motion Observations. Ann Fam Med. 2017;15(5):419-426. doi:10.1370/afm.2121
-
[14]
Tai-Seale M, Olson CW, Li J, et al. Electronic Health Record Logs Indicate That Physicians Split Time Evenly Between Seeing Patients And Desktop Medicine. Health Affairs. 2017;36(4):655-662. doi:10.1377/hlthaff.2016.0811
-
[15]
Administrative Work Consumes One-Sixth of U.S
Woolhandler S, Himmelstein DU. Administrative Work Consumes One-Sixth of U.S. Physicians’ Working Hours and Lowers their Career Satisfaction. Int J Health Serv. 2014;44(4):635-642. doi:10.2190/HS.44.4.a
-
[16]
Evaluation and Management Services Guide
Centers for Medicare & Medicaid Services. Evaluation and Management Services Guide. Accessed April 7, 2025. https://www.cms.gov/outreach-and-education/medicare-learning- network-mln/mlnproducts/mln-publications-items/cms1243514
work page 2025
-
[17]
CPT® Evaluation and Management. American Medical Association. December 27, 2023. Accessed April 7, 2025. https://www.ama-assn.org/practice-management/cpt/cpt- evaluation-and-management
work page 2023
-
[18]
Maisel N, Thombley R, Sinsky CA, et al. Primary Care Physician Perceptions of the Impact of CMS E/M Coding Changes and Associations with Changes in EHR Time. J GEN INTERN MED. Published online February 18, 2025. doi:10.1007/s11606-025-09400-1
-
[19]
Natural language processing: an introduction
Nadkarni PM, Ohno-Machado L, Chapman WW. Natural language processing: an introduction. Journal of the American Medical Informatics Association. 2011;18(5):544-
work page 2011
-
[20]
doi:10.1136/amiajnl-2011-000464
-
[21]
Natural language processing in medicine: A review
Locke S, Bashall A, Al-Adely S, Moore J, Wilson A, Kitchen GB. Natural language processing in medicine: A review. Trends in Anaesthesia and Critical Care. 2021;38:4-9. doi:10.1016/j.tacc.2021.02.007
-
[22]
Large language models in medicine
Thirunavukarasu AJ, Ting DSJ, Elangovan K, Gutierrez L, Tan TF, Ting DSW. Large language models in medicine. Nat Med. 2023;29(8):1930-1940. doi:10.1038/s41591-023- 02448-8
-
[23]
Hu Y , Zuo X, Zhou Y , et al. Information Extraction from Clinical Notes: Are We Ready to Switch to Large Language Models? Published online January 7, 2025. doi:10.48550/arXiv.2411.10020
-
[24]
A systematic review of large language model (LLM) evaluations in clinical medicine
Shool S, Adimi S, Saboori Amleshi R, Bitaraf E, Golpira R, Tara M. A systematic review of large language model (LLM) evaluations in clinical medicine. BMC Med Inform Decis Mak. 2025;25(1):117. doi:10.1186/s12911-025-02954-4
-
[25]
Extracting Structured Data from Physician-Patient Conversations by Predicting Noteworthy Utterances
Krishna K, Pavel A, Schloss B, Bigham JP, Lipton ZC. Extracting Structured Data from Physician-Patient Conversations by Predicting Noteworthy Utterances. In: Shaban-Nejad A, Michalowski M, Buckeridge DL, eds. Explainable AI in Healthcare and Medicine: Building a Culture of Transparency and Accountability. Springer International Publishing; 2021:155-169. d...
-
[26]
Clinical information extraction applications: A literature review
Wang Y , Wang L, Rastegar-Mojarad M, et al. Clinical information extraction applications: A literature review. J Biomed Inform. 2018;77:34-49. doi:10.1016/j.jbi.2017.11.011
-
[27]
Evaluation & Management Visits
Centers for Medicare & Medicaid Services (CMS). Evaluation & Management Visits. Accessed May 6, 2025. https://www.cms.gov/medicare/payment/fee- schedules/physician/evaluation-management-visits
work page 2025
-
[28]
Development and Evaluation of a Clinical Note Section Header Terminology
Denny JC, Miller RA, Johnson KB, Spickard A. Development and Evaluation of a Clinical Note Section Header Terminology. AMIA Annu Symp Proc. 2008;2008:156-160
work page 2008
-
[29]
Evaluation of a Method to Identify and Categorize Section Headers in Clinical Documents
Denny JC, Spickard A, Johnson KB, Peterson NB, Peterson JF, Miller RA. Evaluation of a Method to Identify and Categorize Section Headers in Clinical Documents. J Am Med Inform Assoc. 2009;16(6):806-815. doi:10.1197/jamia.M3037
-
[30]
Language Models are Few-Shot Learners
Brown T, Mann B, Ryder N, et al. Language Models are Few-Shot Learners. In: Advances in Neural Information Processing Systems. V ol 33. Curran Associates, Inc.; 2020:1877-
work page 2020
-
[31]
https://proceedings.neurips.cc/paper/2020/hash/1457c0d6bfcb4967418bfb8ac142f64a- Abstract.html
Accessed April 12, 2025. https://proceedings.neurips.cc/paper/2020/hash/1457c0d6bfcb4967418bfb8ac142f64a- Abstract.html
work page 2025
-
[32]
Improving large language models for clinical named entity recognition via prompt engineering
Hu Y , Chen Q, Du J, et al. Improving large language models for clinical named entity recognition via prompt engineering. Journal of the American Medical Informatics Association. 2024;31(9):1812-1820. doi:10.1093/jamia/ocad259
-
[33]
Friedl J. Mastering Regular Expressions. O’Reilly Media, Inc.; 2006. Accessed April 16,
work page 2006
-
[34]
https://books.google.com/books?hl=en&lr=&id=GX3w_18- JegC&oi=fnd&pg=PR7&dq=regular+expression&ots=PMoiUmdvS- &sig=VlE9XrlUzBUyAcGwdDnyyI5boA4
-
[35]
Mistral Small 3.1 | Mistral AI. Accessed April 16, 2025. https://mistral.ai/news/mistral- small-3-1
work page 2025
-
[36]
Open, Small, Rigmarole -- Evaluating Llama 3.2 3B’s Feedback for Programming Exercises
Azaiz I, Kiesler N, Strickroth S, Zhang A. Open, Small, Rigmarole -- Evaluating Llama 3.2 3B’s Feedback for Programming Exercises. Published online April 1, 2025. doi:10.48550/arXiv.2504.01054
-
[37]
Team G, Kamath A, Ferret J, et al. Gemma 3 Technical Report. Published online March 25,
-
[38]
doi:10.48550/arXiv.2503.19786
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2503.19786
-
[39]
Creating Large Language Model Applications Utilizing LangChain: A Primer on Developing LLM Apps Fast
Topsakal O, Akinci TC. Creating Large Language Model Applications Utilizing LangChain: A Primer on Developing LLM Apps Fast. ICAENS. 2023;1(1):1050-1056. doi:10.59287/icaens.1127
-
[40]
Chinchor N, Sundheim B. MUC-5 Evaluation Metrics. In: Fifth Message Understanding Conference (MUC-5): Proceedings of a Conference Held in Baltimore, Maryland, August 25-27, 1993. ; 1993. Accessed June 5, 2024. https://aclanthology.org/M93-1007
work page 1993
-
[41]
LiveBench: A Challenging, Contamination-Limited LLM Benchmark
White C, Dooley S, Roberts M, et al. LiveBench: A Challenging, Contamination-Limited LLM Benchmark. Published online April 18, 2025. doi:10.48550/arXiv.2406.19314
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2406.19314 2025
-
[42]
A Survey on Hallucination in Large Language Models
Huang L, Yu W, Ma W, et al. A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions. ACM Trans Inf Syst. 2025;43(2):42:1-42:55. doi:10.1145/3703155 Supplementary Material: System Prompts and Ollama ConfiguraƟon Parameters for the Proposed LLM Pipeline ROS EnƟty RecogniƟon: FROM … # Specify the model here PAR...
- [43]
- [44]
-
[45]
GI" - negaƟve JSON Output Example: [ {
"GI" - negaƟve JSON Output Example: [ { "extract": "fever", "status": "posiƟve" }, { "extract": "headache", "status": "negaƟve" }, { "extract": "back pain", "status": "negaƟve" }, { "extract": "GI", "status": "negaƟve" } ] Ensure your response strictly follows these formats without deviaƟon. """ Body System ClassificaƟon: FROM … # Specify the model here PA...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.