pith. sign in

arxiv: 2506.11067 · v3 · pith:4OCFL2QFnew · submitted 2025-05-31 · 💻 cs.CL

A Large Language Model Based Pipeline for Review of Systems Entity Recognition from Clinical Notes

Pith reviewed 2026-05-19 11:29 UTC · model grok-4.3

classification 💻 cs.CL
keywords large language modelsclinical notesentity recognitionreview of systemsopen-source LLMsnamed entity recognitionhealthcare documentationattribution algorithm
0
0 comments X

The pith

Open-source LLMs extract Review of Systems entities from clinical notes with high accuracy in a local pipeline.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a pipeline that uses open-source large language models to automatically pull out Review of Systems details like symptoms, diseases, their positive or negative status, and related body systems from doctors' notes. This approach starts by identifying the relevant section and then applies few-shot learning to the models, followed by a new algorithm to match the extractions back to the original text. A sympathetic reader would care because it offers a way to handle repetitive documentation tasks more efficiently without relying on expensive or cloud-based services, potentially freeing up time for patient care. The results on a small set of notes show strong performance, especially with the matching step improving outcomes for all tested models.

Core claim

The authors establish that a pipeline combining section extraction with SecTag, few-shot prompting on open-source LLMs, and a novel attribution algorithm for aligning entities to source text enables effective recognition of ROS entities, negation status, and body systems, achieving a highest F1 score of 0.952 and consistent improvements across models including smaller ones.

What carries the argument

The LLM-based pipeline that first isolates the Review of Systems section using SecTag headers, then employs few-shot prompting on open-source models to detect entities along with their status and body systems, and uses a new attribution algorithm to link outputs back to the original text.

If this is right

  • Larger models demonstrate robust performance across entity extraction, negation detection, and body system classification.
  • The attribution algorithm increases F1 score and accuracy while reducing error rate for all models.
  • The smaller Llama model delivers promising results with significantly lower VRAM usage.
  • The pipeline offers a scalable and locally deployable solution for reducing ROS documentation burden in healthcare.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This method could extend to other sections of clinical notes for broader automation of medical documentation.
  • Local open-source solutions address data privacy and cost barriers in adopting AI for clinical use.
  • Testing the pipeline on notes from varied medical specialties would help assess its broader applicability.

Load-bearing premise

The small set of 24 general medicine notes with 340 annotations is representative of typical clinical notes and sufficient to support the reported performance levels.

What would settle it

A substantial drop in F1 scores or accuracy when the pipeline is applied to a larger collection of clinical notes from multiple hospitals or different medical fields would indicate the results do not generalize.

Figures

Figures reproduced from arXiv: 2506.11067 by Abdulaziz Ahmed, Dursun Delen, Hemanth Reddy Singareddy, Hieu Nghiem, Jivan Lamichhane, Johnson Thomas, William Paiva, Zhuqi Miao.

Figure 1
Figure 1. Figure 1: An example of MTSamples notes: (A) The sample note in plain text format [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the proposed pipeline [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Accuracy of ROS status detection and body system classification for exactly/relaxedly [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
read the original abstract

Objective: Develop a cost-effective, large language model (LLM)-based pipeline for automatically extracting Review of Systems (ROS) entities from clinical notes. Materials and Methods: The pipeline extracts ROS section from the clinical note using SecTag header terminology, followed by few-shot LLMs to identify ROS entities such as diseases or symptoms, their positive/negative status and associated body systems. We implemented the pipeline using 4 open-source LLM models: llama3.1:8b, gemma3:27b, mistral3.1:24b and gpt-oss:20b. Additionally, we introduced a novel attribution algorithm that aligns LLM-identified ROS entities with their source text, addressing non-exact and synonymous matches. The evaluation was conducted on 24 general medicine notes containing 340 annotated ROS entities. Results: Open-source LLMs enable a local, cost-efficient pipeline while delivering promising performance. Larger models like Gemma, Mistral, and Gpt-oss demonstrate robust performance across three entity recognition tasks of the pipeline: ROS entity extraction, negation detection and body system classification (highest F1 score = 0.952). With the attribution algorithm, all models show improvements across key performance metrics, including higher F1 score and accuracy, along with lower error rate. Notably, the smaller Llama model also achieved promising results despite using only one-third the VRAM of larger models. Discussion and Conclusion: From an application perspective, our pipeline provides a scalable, locally deployable solution to easing the ROS documentation burden. Open-source LLMs offer a practical AI option for resource-limited healthcare settings. Methodologically, our newly developed algorithm facilitates accuracy improvements for zero- and few-shot LLMs in named entity recognition.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript presents an LLM-based pipeline for extracting Review of Systems (ROS) entities from clinical notes. SecTag identifies the ROS section, after which four open-source LLMs (llama3.1:8b, gemma3:27b, mistral3.1:24b, gpt-oss:20b) are few-shot prompted to detect entities, negation status, and body-system classification. A novel attribution algorithm aligns LLM outputs with source text to handle non-exact and synonymous matches. Evaluation on 24 general-medicine notes containing 340 annotated entities reports F1 scores up to 0.952, with consistent gains from the attribution algorithm across tasks.

Significance. If the performance claims hold under broader validation, the work offers a practical, locally deployable, cost-efficient solution for automating ROS documentation using open-source models, which is valuable for resource-limited clinical settings. The attribution algorithm provides a methodological contribution for improving zero- and few-shot NER alignment. The emphasis on open-source LLMs and real-world applicability is a strength.

major comments (2)
  1. [Materials and Methods] Materials and Methods / Evaluation: The test collection is limited to 24 general-medicine notes and 340 annotations. No sampling criteria, stratification by note length or specialty, annotation protocol, or inter-annotator agreement statistics are reported. This small, single-site sample is load-bearing for the central claim that the pipeline delivers reliable performance (F1 = 0.952) and that the attribution algorithm produces generalizable improvements.
  2. [Results] Results: Performance is reported for three tasks (entity extraction, negation detection, body-system classification) but without statistical significance testing, confidence intervals, or error analysis. It is therefore unclear whether the observed gains from the attribution algorithm are robust or could be explained by the particular characteristics of the 24-note set.
minor comments (2)
  1. [Abstract] Abstract and Methods: Model names (e.g., 'gpt-oss:20b') should be clarified with exact Hugging Face or Ollama identifiers for reproducibility.
  2. The few-shot prompting details (number of examples, selection criteria, and prompt templates) are not fully specified; adding them would improve replicability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments have prompted us to improve the transparency of our evaluation and the rigor of our statistical reporting. We address each major comment below and indicate the revisions made to the manuscript.

read point-by-point responses
  1. Referee: [Materials and Methods] Materials and Methods / Evaluation: The test collection is limited to 24 general-medicine notes and 340 annotations. No sampling criteria, stratification by note length or specialty, annotation protocol, or inter-annotator agreement statistics are reported. This small, single-site sample is load-bearing for the central claim that the pipeline delivers reliable performance (F1 = 0.952) and that the attribution algorithm produces generalizable improvements.

    Authors: We agree that greater detail on dataset construction is warranted. In the revised Materials and Methods, we now specify the sampling criteria (consecutive general-medicine admission notes selected from a single academic medical center's EHR during a defined 2023 period) and provide the full annotation protocol used by the board-certified internist who labeled the 340 entities. Stratification by note length or specialty was not applied because the study scope was restricted to typical general-medicine notes. We acknowledge the small, single-site sample as a genuine limitation and have expanded the Discussion to frame the work as a proof-of-concept study with explicit plans for future multi-site validation. Inter-annotator agreement statistics are unavailable because annotation was performed by a single expert; this is now stated as a limitation. revision: partial

  2. Referee: [Results] Results: Performance is reported for three tasks (entity extraction, negation detection, body-system classification) but without statistical significance testing, confidence intervals, or error analysis. It is therefore unclear whether the observed gains from the attribution algorithm are robust or could be explained by the particular characteristics of the 24-note set.

    Authors: We have strengthened the Results section by adding bootstrap-derived 95% confidence intervals for all reported F1 scores. Statistical significance of the attribution algorithm's improvements over the baseline prompting approach was evaluated with McNemar's test for paired binary outcomes; p-values are now reported for each of the three tasks. We have also inserted a dedicated error-analysis subsection that categorizes the remaining errors (boundary mismatches, negation-scope failures, and body-system misclassifications) and illustrates how the attribution step reduces each category. These additions indicate that the observed gains are consistent across error types rather than artifacts of the particular 24-note collection. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical evaluation against independent manual annotations

full rationale

The paper describes an LLM pipeline (SecTag section extraction + few-shot prompting + attribution algorithm) and reports F1/accuracy on 340 entities from 24 notes. All metrics are computed by direct comparison to held-out human annotations; no equations, fitted parameters renamed as predictions, self-citations, or ansatzes appear in the derivation chain. The attribution step is a post-processing heuristic whose gains are measured externally rather than defined into the evaluation. The work is therefore self-contained against its external benchmark and exhibits no reduction of claims to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The pipeline rests on the assumption that SecTag headers reliably locate ROS sections and that few-shot prompts transfer to the clinical domain; the only invented component is the attribution algorithm, which has no independent evidence outside this work.

axioms (1)
  • domain assumption SecTag header terminology accurately identifies ROS sections in clinical notes
    First step of the pipeline; invoked without reported validation on the 24-note set.
invented entities (1)
  • Attribution algorithm no independent evidence
    purpose: Align LLM-identified ROS entities with source text for non-exact and synonymous matches
    Presented as novel; no external validation or falsifiable prediction supplied.

pith-pipeline@v0.9.0 · 5877 in / 1465 out tokens · 56284 ms · 2026-05-19T11:29:48.646259+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

45 extracted references · 45 canonical work pages · 2 internal anchors

  1. [1]

    review of systems

    Chung AE, Basch EM. Incorporating the patient’s voice into electronic health records through patient-reported outcomes as the “review of systems.” Journal of the American Medical Informatics Association. 2015;22(4):914-916. doi:10.1093/jamia/ocu007

  2. [2]

    Perceptions of Information Transferred in Review of Systems Forms: A Qualitative Description

    Ernecoff NC, Arnold J, Krishnamurti T, et al. Perceptions of Information Transferred in Review of Systems Forms: A Qualitative Description. J GEN INTERN MED. Published online February 20, 2025. doi:10.1007/s11606-025-09443-4

  3. [3]

    History taking, assessment and documentation for paramedics

    Jenkins S. History taking, assessment and documentation for paramedics. Journal of Paramedic Practice. 2013;5(6):310-316. doi:10.12968/jpar.2013.5.6.310

  4. [4]

    A Detailed Review of Systems: An Educational Feature

    Phillips A, Frank A, Loftin C, Shepherd S. A Detailed Review of Systems: An Educational Feature. The Journal for Nurse Practitioners. 2017;13(10):681-686. doi:10.1016/j.nurpra.2017.08.012

  5. [5]

    Parkinsonism: A Review-of-Systems Approach to Diagnosis

    Tuite PJ, Krawczewski K. Parkinsonism: A Review-of-Systems Approach to Diagnosis. Seminars in Neurology. 2007;27:113-122. doi:10.1055/s-2007-971174

  6. [6]

    Review of systems questionnaire helps differentiate psychogenic nonepileptic seizures from epilepsy

    Asadi-Pooya AA, Rabiei AH, Tinker J, Tracy J. Review of systems questionnaire helps differentiate psychogenic nonepileptic seizures from epilepsy. Journal of Clinical Neuroscience. 2016;34:105-107. doi:10.1016/j.jocn.2016.05.037

  7. [7]

    Association Between Patient Review of Systems Score and Somatization

    Okland TS, Gonzalez JR, Ferber AT, Mann SE. Association Between Patient Review of Systems Score and Somatization. JAMA Otolaryngology–Head & Neck Surgery. 2017;143(9):870-875. doi:10.1001/jamaoto.2017.0671

  8. [8]

    ATLAS: A positive, high-yield review of patient symptoms most significantly associated with melanoma recurrence

    Everdell E, Borok J, Deutsch A, et al. ATLAS: A positive, high-yield review of patient symptoms most significantly associated with melanoma recurrence. Journal of the American Academy of Dermatology. 2024;91(6):1118-1124. doi:10.1016/j.jaad.2024.07.1516

  9. [9]

    SOAP Notes

    Podder V , Lew V , Ghassemzadeh S. SOAP Notes. In: StatPearls. StatPearls Publishing

  10. [10]

    http://www.ncbi.nlm.nih.gov/books/NBK482263/

    Accessed April 6, 2025. http://www.ncbi.nlm.nih.gov/books/NBK482263/

  11. [11]

    The Review of Systems and the Physical Exam

    Hagan S, Hagan AF. The Review of Systems and the Physical Exam. In: Wong CJ, Jackson SL, eds. The Patient-Centered Approach to Medical Note-Writing. Springer International Publishing; 2023:153-162. doi:10.1007/978-3-031-43633-8_12

  12. [12]

    Allocation of Physician Time in Ambulatory Practice: A Time and Motion Study in 4 Specialties

    Sinsky C, Colligan L, Li L, et al. Allocation of Physician Time in Ambulatory Practice: A Time and Motion Study in 4 Specialties. Ann Intern Med. 2016;165(11):753-760. doi:10.7326/M16-0961

  13. [13]

    Tethered to the EHR: Primary Care Physician Workload Assessment Using EHR Event Log Data and Time-Motion Observations

    Arndt BG, Beasley JW, Watkinson MD, et al. Tethered to the EHR: Primary Care Physician Workload Assessment Using EHR Event Log Data and Time-Motion Observations. Ann Fam Med. 2017;15(5):419-426. doi:10.1370/afm.2121

  14. [14]

    Electronic Health Record Logs Indicate That Physicians Split Time Evenly Between Seeing Patients And Desktop Medicine

    Tai-Seale M, Olson CW, Li J, et al. Electronic Health Record Logs Indicate That Physicians Split Time Evenly Between Seeing Patients And Desktop Medicine. Health Affairs. 2017;36(4):655-662. doi:10.1377/hlthaff.2016.0811

  15. [15]

    Administrative Work Consumes One-Sixth of U.S

    Woolhandler S, Himmelstein DU. Administrative Work Consumes One-Sixth of U.S. Physicians’ Working Hours and Lowers their Career Satisfaction. Int J Health Serv. 2014;44(4):635-642. doi:10.2190/HS.44.4.a

  16. [16]

    Evaluation and Management Services Guide

    Centers for Medicare & Medicaid Services. Evaluation and Management Services Guide. Accessed April 7, 2025. https://www.cms.gov/outreach-and-education/medicare-learning- network-mln/mlnproducts/mln-publications-items/cms1243514

  17. [17]

    American Medical Association

    CPT® Evaluation and Management. American Medical Association. December 27, 2023. Accessed April 7, 2025. https://www.ama-assn.org/practice-management/cpt/cpt- evaluation-and-management

  18. [18]

    Primary Care Physician Perceptions of the Impact of CMS E/M Coding Changes and Associations with Changes in EHR Time

    Maisel N, Thombley R, Sinsky CA, et al. Primary Care Physician Perceptions of the Impact of CMS E/M Coding Changes and Associations with Changes in EHR Time. J GEN INTERN MED. Published online February 18, 2025. doi:10.1007/s11606-025-09400-1

  19. [19]

    Natural language processing: an introduction

    Nadkarni PM, Ohno-Machado L, Chapman WW. Natural language processing: an introduction. Journal of the American Medical Informatics Association. 2011;18(5):544-

  20. [20]

    doi:10.1136/amiajnl-2011-000464

  21. [21]

    Natural language processing in medicine: A review

    Locke S, Bashall A, Al-Adely S, Moore J, Wilson A, Kitchen GB. Natural language processing in medicine: A review. Trends in Anaesthesia and Critical Care. 2021;38:4-9. doi:10.1016/j.tacc.2021.02.007

  22. [22]

    Large language models in medicine

    Thirunavukarasu AJ, Ting DSJ, Elangovan K, Gutierrez L, Tan TF, Ting DSW. Large language models in medicine. Nat Med. 2023;29(8):1930-1940. doi:10.1038/s41591-023- 02448-8

  23. [23]

    Information Extraction from Clinical Notes: Are We Ready to Switch to Large Language Models? Published online January 7, 2025

    Hu Y , Zuo X, Zhou Y , et al. Information Extraction from Clinical Notes: Are We Ready to Switch to Large Language Models? Published online January 7, 2025. doi:10.48550/arXiv.2411.10020

  24. [24]

    A systematic review of large language model (LLM) evaluations in clinical medicine

    Shool S, Adimi S, Saboori Amleshi R, Bitaraf E, Golpira R, Tara M. A systematic review of large language model (LLM) evaluations in clinical medicine. BMC Med Inform Decis Mak. 2025;25(1):117. doi:10.1186/s12911-025-02954-4

  25. [25]

    Extracting Structured Data from Physician-Patient Conversations by Predicting Noteworthy Utterances

    Krishna K, Pavel A, Schloss B, Bigham JP, Lipton ZC. Extracting Structured Data from Physician-Patient Conversations by Predicting Noteworthy Utterances. In: Shaban-Nejad A, Michalowski M, Buckeridge DL, eds. Explainable AI in Healthcare and Medicine: Building a Culture of Transparency and Accountability. Springer International Publishing; 2021:155-169. d...

  26. [26]

    Clinical information extraction applications: A literature review

    Wang Y , Wang L, Rastegar-Mojarad M, et al. Clinical information extraction applications: A literature review. J Biomed Inform. 2018;77:34-49. doi:10.1016/j.jbi.2017.11.011

  27. [27]

    Evaluation & Management Visits

    Centers for Medicare & Medicaid Services (CMS). Evaluation & Management Visits. Accessed May 6, 2025. https://www.cms.gov/medicare/payment/fee- schedules/physician/evaluation-management-visits

  28. [28]

    Development and Evaluation of a Clinical Note Section Header Terminology

    Denny JC, Miller RA, Johnson KB, Spickard A. Development and Evaluation of a Clinical Note Section Header Terminology. AMIA Annu Symp Proc. 2008;2008:156-160

  29. [29]

    Evaluation of a Method to Identify and Categorize Section Headers in Clinical Documents

    Denny JC, Spickard A, Johnson KB, Peterson NB, Peterson JF, Miller RA. Evaluation of a Method to Identify and Categorize Section Headers in Clinical Documents. J Am Med Inform Assoc. 2009;16(6):806-815. doi:10.1197/jamia.M3037

  30. [30]

    Language Models are Few-Shot Learners

    Brown T, Mann B, Ryder N, et al. Language Models are Few-Shot Learners. In: Advances in Neural Information Processing Systems. V ol 33. Curran Associates, Inc.; 2020:1877-

  31. [31]

    https://proceedings.neurips.cc/paper/2020/hash/1457c0d6bfcb4967418bfb8ac142f64a- Abstract.html

    Accessed April 12, 2025. https://proceedings.neurips.cc/paper/2020/hash/1457c0d6bfcb4967418bfb8ac142f64a- Abstract.html

  32. [32]

    Improving large language models for clinical named entity recognition via prompt engineering

    Hu Y , Chen Q, Du J, et al. Improving large language models for clinical named entity recognition via prompt engineering. Journal of the American Medical Informatics Association. 2024;31(9):1812-1820. doi:10.1093/jamia/ocad259

  33. [33]

    Mastering Regular Expressions

    Friedl J. Mastering Regular Expressions. O’Reilly Media, Inc.; 2006. Accessed April 16,

  34. [34]

    https://books.google.com/books?hl=en&lr=&id=GX3w_18- JegC&oi=fnd&pg=PR7&dq=regular+expression&ots=PMoiUmdvS- &sig=VlE9XrlUzBUyAcGwdDnyyI5boA4

  35. [35]

    Accessed April 16, 2025

    Mistral Small 3.1 | Mistral AI. Accessed April 16, 2025. https://mistral.ai/news/mistral- small-3-1

  36. [36]

    Open, Small, Rigmarole -- Evaluating Llama 3.2 3B’s Feedback for Programming Exercises

    Azaiz I, Kiesler N, Strickroth S, Zhang A. Open, Small, Rigmarole -- Evaluating Llama 3.2 3B’s Feedback for Programming Exercises. Published online April 1, 2025. doi:10.48550/arXiv.2504.01054

  37. [37]

    Gemma 3 Technical Report

    Team G, Kamath A, Ferret J, et al. Gemma 3 Technical Report. Published online March 25,

  38. [38]

    doi:10.48550/arXiv.2503.19786

  39. [39]

    Creating Large Language Model Applications Utilizing LangChain: A Primer on Developing LLM Apps Fast

    Topsakal O, Akinci TC. Creating Large Language Model Applications Utilizing LangChain: A Primer on Developing LLM Apps Fast. ICAENS. 2023;1(1):1050-1056. doi:10.59287/icaens.1127

  40. [40]

    MUC-5 Evaluation Metrics

    Chinchor N, Sundheim B. MUC-5 Evaluation Metrics. In: Fifth Message Understanding Conference (MUC-5): Proceedings of a Conference Held in Baltimore, Maryland, August 25-27, 1993. ; 1993. Accessed June 5, 2024. https://aclanthology.org/M93-1007

  41. [41]

    LiveBench: A Challenging, Contamination-Limited LLM Benchmark

    White C, Dooley S, Roberts M, et al. LiveBench: A Challenging, Contamination-Limited LLM Benchmark. Published online April 18, 2025. doi:10.48550/arXiv.2406.19314

  42. [42]

    ACM Transactions on In- formation Systems43(2), 1–55 (Jan 2025).https://doi.org/10.1145/3703155, http://dx.doi.org/10.1145/3703155

    Huang L, Yu W, Ma W, et al. A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions. ACM Trans Inf Syst. 2025;43(2):42:1-42:55. doi:10.1145/3703155 Supplementary Material: System Prompts and Ollama ConfiguraƟon Parameters for the Proposed LLM Pipeline ROS EnƟty RecogniƟon: FROM … # Specify the model here PAR...

  43. [43]

    headache

    "headache" - negaƟve

  44. [44]

    back pain

    "back pain" - negaƟve

  45. [45]

    GI" - negaƟve JSON Output Example: [ {

    "GI" - negaƟve JSON Output Example: [ { "extract": "fever", "status": "posiƟve" }, { "extract": "headache", "status": "negaƟve" }, { "extract": "back pain", "status": "negaƟve" }, { "extract": "GI", "status": "negaƟve" } ] Ensure your response strictly follows these formats without deviaƟon. """ Body System ClassificaƟon: FROM … # Specify the model here PA...