Training Models to Extract Treatment Plans from Clinical Notes Using Contents of Sections with Headings
Pith reviewed 2026-05-25 14:39 UTC · model grok-4.3
The pith
Models trained on sentences from plan-headed sections in clinical notes identify treatment plans across all notes with F-measures up to 0.97.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By applying common variations of plan headings and rule-based heuristics to 117,730 clinical notes, the authors extract 13,492 plan sentences as noisy training data. SVM and CNN models trained on this data achieve F-measures of 0.89 and 0.91 respectively under ten-fold cross validation on the noisy set, and 0.96 and 0.97 on a separate manually annotated evaluation set. The results establish that sections with informal plan headings supply effective training data for identifying treatment plans in every clinical note.
What carries the argument
Rule-based location of sections via common plan heading variations, followed by sentence extraction to form noisy labeled training data for SVM and CNN classification of plan sentences.
If this is right
- Treatment plan sentences can be extracted from clinical notes that contain no explicit plan section.
- CNN models slightly outperform SVM for classifying plan sentences when trained on this noisy data.
- The approach eliminates the need to create expensive manually annotated gold-standard datasets for this task.
- Sections with informal headings in clinical notes can generate training data for other supervised clinical NLP tasks.
Where Pith is reading between the lines
- The same heading-based noisy labeling could be tried for extracting other structured elements such as medication lists or follow-up instructions.
- If the models maintain performance on notes from different institutions, they could support large-scale automated processing for care coordination without per-institution relabeling.
- The 13 percent coverage rate implies that combining data from multiple institutions would quickly produce much larger noisy training sets.
Load-bearing premise
The rule-based heuristics using common heading variations accurately locate sections whose contents are predominantly treatment plans with limited contamination from other content.
What would settle it
A manual review finding that a substantial fraction of the extracted sentences from the identified plan sections do not describe treatment plans would show the training data is too contaminated for the reported model accuracy.
Figures
read the original abstract
Objective: Using natural language processing (NLP) to find sentences that state treatment plans in a clinical note, would automate plan extraction and would further enable their use in tools that help providers and care managers. However, as in the most NLP tasks on clinical text, creating gold standard to train and test NLP models is tedious and expensive. Fortuitously, sometimes but not always clinical notes contain sections with a heading that identifies the section as a plan. Leveraging contents of such labeled sections as a noisy training data, we assessed accuracy of NLP models trained with the data. Methods: We used common variations of plan headings and rule-based heuristics to find plan sections with headings in clinical notes, and we extracted sentences from them and formed a noisy training data of plan sentences. We trained Support Vector Machine (SVM) and Convolutional Neural Network (CNN) models with the data. We measured accuracy of the trained models on the noisy dataset using ten-fold cross validation and separately on a set-aside manually annotated dataset. Results: About 13% of 117,730 clinical notes contained treatment plans sections with recognizable headings in the 1001 longitudinal patient records that were obtained from Cleveland Clinic under an IRB approval. We were able to extract and create a noisy training data of 13,492 plan sentences from the clinical notes. CNN achieved best F measures, 0.91 and 0.97 in the cross-validation and set-aside evaluation experiments respectively. SVM slightly underperformed with F measures of 0.89 and 0.96 in the same experiments. Conclusion: Our study showed that the training supervised learning models using noisy plan sentences was effective in identifying them in all clinical notes. More broadly, sections with informal headings in clinical notes can be a good source for generating effective training data.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that rule-based heuristics based on common heading variations can reliably identify treatment-plan sections in clinical notes; sentences from these sections can then serve as noisy positive training data for SVM and CNN classifiers that identify plan sentences in arbitrary notes. On 13,492 sentences extracted from 13% of 117,730 Cleveland Clinic notes, the CNN reaches F-measures of 0.91 (10-fold CV on the noisy data) and 0.97 (held-out manually annotated set); SVM is slightly lower. The conclusion is that noisy supervision from headed sections is effective for plan-sentence identification.
Significance. If the noisy labels are sufficiently clean, the work demonstrates a low-cost route to large-scale supervision for a clinically useful extraction task. The use of a held-out manually annotated test set (rather than only noisy CV) is a clear methodological strength and supports the claim of generalization beyond the headed sections. The approach could reduce annotation burden in other clinical NLP settings where section headings provide weak labels.
major comments (2)
- [Methods (data extraction)] Methods (rule-based section identification): No quantitative audit of section purity is reported (e.g., manual review of a sample of the 13,492 sentences to measure the fraction that are actually treatment plans versus other content). This is load-bearing for the central claim; without it, high F-measures on both CV and held-out data could arise from models exploiting section-level cues present in the noisy training distribution rather than learning sentence-level plan semantics.
- [Results] Results and evaluation description: The manuscript provides no error analysis on the held-out manual set, no inter-annotator agreement for the manual annotations, and no count of how many candidate sections were excluded by the heuristics. These omissions leave open the possibility that performance reflects selection of easier cases.
minor comments (2)
- [Abstract] Abstract and Methods: Clarify whether the held-out manual test set was drawn exclusively from notes lacking recognizable plan headings or from the full population; this directly affects the interpretation of generalization to 'all clinical notes.'
- [Methods] The paper should report the exact list of heading variations and the full set of rule-based heuristics so that the extraction procedure is reproducible.
Simulated Author's Rebuttal
We thank the referee for the constructive comments. We respond point-by-point to the major comments below.
read point-by-point responses
-
Referee: [Methods (data extraction)] Methods (rule-based section identification): No quantitative audit of section purity is reported (e.g., manual review of a sample of the 13,492 sentences to measure the fraction that are actually treatment plans versus other content). This is load-bearing for the central claim; without it, high F-measures on both CV and held-out data could arise from models exploiting section-level cues present in the noisy training distribution rather than learning sentence-level plan semantics.
Authors: We agree that an explicit quantitative audit of label purity would strengthen the paper. However, the held-out evaluation provides direct evidence against the concern raised. The test sentences were drawn from arbitrary notes and manually annotated without reference to headings; the CNN nonetheless reaches 0.97 F-measure. If the models were primarily exploiting cues tied to the headed-section distribution in training, generalization to this independent test distribution would be unlikely. We will revise the manuscript to discuss this point explicitly and to clarify how the held-out design supports the noisy-supervision claim. revision: partial
-
Referee: [Results] Results and evaluation description: The manuscript provides no error analysis on the held-out manual set, no inter-annotator agreement for the manual annotations, and no count of how many candidate sections were excluded by the heuristics. These omissions leave open the possibility that performance reflects selection of easier cases.
Authors: We will add an error analysis of predictions on the held-out set in the revision. The annotations were performed by a single clinical expert; inter-annotator agreement was therefore not computed and we will state this limitation. The count of candidate sections excluded by the heuristics cannot be supplied, as this information was not recorded during the original extraction. revision: partial
- The count of how many candidate sections were excluded by the heuristics
Circularity Check
No significant circularity; purely empirical evaluation on independent held-out annotations
full rationale
The paper generates noisy training labels via rule-based heading heuristics, trains SVM and CNN classifiers, and reports F-measures on both 10-fold CV of the noisy data and a separate manually annotated held-out set. No equations, parameter fits presented as predictions, self-citations, or uniqueness theorems appear. The held-out manual evaluation is independent of the heuristic labels, so the central claim does not reduce to its inputs by construction. This is the expected non-finding for an empirical ML study.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Clinical notes contain sections identifiable by common heading variations that reliably indicate treatment plans.
Reference graph
Works this paper leans on
-
[1]
Thumps up? Sentiment Classification using Machine Learning Techniques,
B. Pang, L. Lee and S. Vaithyanathan, "Thumps up? Sentiment Classification using Machine Learning Techniques," in Proc. of the ACL-02 conference on Empirical methods in natural language processing - Volume 10 (EMNLP '02), Stroudsburg, PA, USA, 2002. [4] M. Mintz, S. B. Bills, R. Snow and D. Jurafsky, "Distant supervision for relation extraction without la...
work page 2002
-
[2]
Section classification in clinical notes using supervised hidden markov model,
Y. Li, S. Lipsky Gorman and N. Elhadad, "Section classification in clinical notes using supervised hidden markov model," in Proceedings of the 1st ACM International Health Informatics Symposium, Arlington, VA. USA, 2010. [15] C. Cortes and V. Vapnik, "Support-vector networks," Machine Learning, vol. 20, no. 3, pp. 273-297, Sept 1995. [16] G. Sidorov, F. V...
work page 2010
-
[3]
A sentimental education: sentiment analysis using subjectivity summarization based on minimum cuts,
B. Pang and L. Lee, "A sentimental education: sentiment analysis using subjectivity summarization based on minimum cuts," in ACL '04 Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics, Barcelona, Spain, 2004. [28] S. I. Wang and C. D. Manning, "Fast droptout training," in Proceedings of the 30th International Conference on...
work page 2004
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.