Training Models to Extract Treatment Plans from Clinical Notes Using Contents of Sections with Headings

Ananya Poddar; Bharath Dandala; Murthy Devarakonda

arxiv: 1906.11930 · v1 · pith:RUAQVXOPnew · submitted 2019-06-27 · 💻 cs.CL

Training Models to Extract Treatment Plans from Clinical Notes Using Contents of Sections with Headings

Ananya Poddar , Bharath Dandala , Murthy Devarakonda This is my paper

Pith reviewed 2026-05-25 14:39 UTC · model grok-4.3

classification 💻 cs.CL

keywords clinical notestreatment plansnoisy training datasection headingssupport vector machineconvolutional neural networknatural language processingplan extraction

0 comments

The pith

Models trained on sentences from plan-headed sections in clinical notes identify treatment plans across all notes with F-measures up to 0.97.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to automate extraction of treatment plan sentences from clinical notes by treating sections with recognizable plan headings as a source of noisy training labels. Rule-based heuristics identify those sections in about 13 percent of notes, yielding 13,492 sentences that are used to train SVM and CNN classifiers without any manual gold-standard annotation. The trained models then locate plan sentences even in notes that lack explicit headings. A sympathetic reader would care because the method sidesteps the high cost of creating labeled clinical data while supporting downstream tools for providers and care managers. The central demonstration is that the noisy data proves sufficient for high-accuracy classification on both cross-validation and held-out manually checked sets.

Core claim

By applying common variations of plan headings and rule-based heuristics to 117,730 clinical notes, the authors extract 13,492 plan sentences as noisy training data. SVM and CNN models trained on this data achieve F-measures of 0.89 and 0.91 respectively under ten-fold cross validation on the noisy set, and 0.96 and 0.97 on a separate manually annotated evaluation set. The results establish that sections with informal plan headings supply effective training data for identifying treatment plans in every clinical note.

What carries the argument

Rule-based location of sections via common plan heading variations, followed by sentence extraction to form noisy labeled training data for SVM and CNN classification of plan sentences.

If this is right

Treatment plan sentences can be extracted from clinical notes that contain no explicit plan section.
CNN models slightly outperform SVM for classifying plan sentences when trained on this noisy data.
The approach eliminates the need to create expensive manually annotated gold-standard datasets for this task.
Sections with informal headings in clinical notes can generate training data for other supervised clinical NLP tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same heading-based noisy labeling could be tried for extracting other structured elements such as medication lists or follow-up instructions.
If the models maintain performance on notes from different institutions, they could support large-scale automated processing for care coordination without per-institution relabeling.
The 13 percent coverage rate implies that combining data from multiple institutions would quickly produce much larger noisy training sets.

Load-bearing premise

The rule-based heuristics using common heading variations accurately locate sections whose contents are predominantly treatment plans with limited contamination from other content.

What would settle it

A manual review finding that a substantial fraction of the extracted sentences from the identified plan sections do not describe treatment plans would show the training data is too contaminated for the reported model accuracy.

Figures

Figures reproduced from arXiv: 1906.11930 by Ananya Poddar, Bharath Dandala, Murthy Devarakonda.

**Figure 2.** Figure 2: The elements of the method are described below. NLP Stack: As a part of a larger ongoing project, a separate set of algorithms and software have been developed to carry out the basic natural language processing (i.e. tokenization, segmentation, parsing) and clinical concept extraction. The output of this software stack is similar to contemporary packages such as cTAKES [10] and Metamap [11], including name… view at source ↗

read the original abstract

Objective: Using natural language processing (NLP) to find sentences that state treatment plans in a clinical note, would automate plan extraction and would further enable their use in tools that help providers and care managers. However, as in the most NLP tasks on clinical text, creating gold standard to train and test NLP models is tedious and expensive. Fortuitously, sometimes but not always clinical notes contain sections with a heading that identifies the section as a plan. Leveraging contents of such labeled sections as a noisy training data, we assessed accuracy of NLP models trained with the data. Methods: We used common variations of plan headings and rule-based heuristics to find plan sections with headings in clinical notes, and we extracted sentences from them and formed a noisy training data of plan sentences. We trained Support Vector Machine (SVM) and Convolutional Neural Network (CNN) models with the data. We measured accuracy of the trained models on the noisy dataset using ten-fold cross validation and separately on a set-aside manually annotated dataset. Results: About 13% of 117,730 clinical notes contained treatment plans sections with recognizable headings in the 1001 longitudinal patient records that were obtained from Cleveland Clinic under an IRB approval. We were able to extract and create a noisy training data of 13,492 plan sentences from the clinical notes. CNN achieved best F measures, 0.91 and 0.97 in the cross-validation and set-aside evaluation experiments respectively. SVM slightly underperformed with F measures of 0.89 and 0.96 in the same experiments. Conclusion: Our study showed that the training supervised learning models using noisy plan sentences was effective in identifying them in all clinical notes. More broadly, sections with informal headings in clinical notes can be a good source for generating effective training data.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Training on noisy plan sentences from headings gives strong numbers on manual data but the paper provides no check on how much noise is actually in the labels.

read the letter

The main things to know are that this work gets high F-measures by training on sentences from headed plan sections, and that the lack of any purity audit on those sections leaves the interpretation open. They used heading variations and heuristics to identify plan sections in clinical notes from Cleveland Clinic. Only 13% of the 117,730 notes had such sections, yielding 13,492 plan sentences for noisy training. SVM and CNN models were trained, with the CNN reaching 0.91 F in 10-fold CV on the noisy data and 0.97 on a set-aside manually annotated set. SVM was close behind. This is a practical way to generate training data without full manual labeling, which is the main contribution. The use of a held-out manual evaluation set strengthens the case that the approach works for identifying plans in notes. The soft spot is exactly the one in the stress test. No quantitative check is reported on how many sentences inside the headed sections are actually treatment plans versus other material. If the heuristics pull in mixed content, the models could achieve those scores by learning section context rather than plan semantics. The abstract also skips error analysis and details on how many notes were filtered out. This is for clinical NLP teams looking for low-cost labeling methods. It is an incremental but grounded engineering paper with numbers on real data. It deserves a serious referee to sort out the label quality questions. Recommendation: send it to peer review.

Referee Report

2 major / 2 minor

Summary. The paper claims that rule-based heuristics based on common heading variations can reliably identify treatment-plan sections in clinical notes; sentences from these sections can then serve as noisy positive training data for SVM and CNN classifiers that identify plan sentences in arbitrary notes. On 13,492 sentences extracted from 13% of 117,730 Cleveland Clinic notes, the CNN reaches F-measures of 0.91 (10-fold CV on the noisy data) and 0.97 (held-out manually annotated set); SVM is slightly lower. The conclusion is that noisy supervision from headed sections is effective for plan-sentence identification.

Significance. If the noisy labels are sufficiently clean, the work demonstrates a low-cost route to large-scale supervision for a clinically useful extraction task. The use of a held-out manually annotated test set (rather than only noisy CV) is a clear methodological strength and supports the claim of generalization beyond the headed sections. The approach could reduce annotation burden in other clinical NLP settings where section headings provide weak labels.

major comments (2)

[Methods (data extraction)] Methods (rule-based section identification): No quantitative audit of section purity is reported (e.g., manual review of a sample of the 13,492 sentences to measure the fraction that are actually treatment plans versus other content). This is load-bearing for the central claim; without it, high F-measures on both CV and held-out data could arise from models exploiting section-level cues present in the noisy training distribution rather than learning sentence-level plan semantics.
[Results] Results and evaluation description: The manuscript provides no error analysis on the held-out manual set, no inter-annotator agreement for the manual annotations, and no count of how many candidate sections were excluded by the heuristics. These omissions leave open the possibility that performance reflects selection of easier cases.

minor comments (2)

[Abstract] Abstract and Methods: Clarify whether the held-out manual test set was drawn exclusively from notes lacking recognizable plan headings or from the full population; this directly affects the interpretation of generalization to 'all clinical notes.'
[Methods] The paper should report the exact list of heading variations and the full set of rule-based heuristics so that the extraction procedure is reproducible.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive comments. We respond point-by-point to the major comments below.

read point-by-point responses

Referee: [Methods (data extraction)] Methods (rule-based section identification): No quantitative audit of section purity is reported (e.g., manual review of a sample of the 13,492 sentences to measure the fraction that are actually treatment plans versus other content). This is load-bearing for the central claim; without it, high F-measures on both CV and held-out data could arise from models exploiting section-level cues present in the noisy training distribution rather than learning sentence-level plan semantics.

Authors: We agree that an explicit quantitative audit of label purity would strengthen the paper. However, the held-out evaluation provides direct evidence against the concern raised. The test sentences were drawn from arbitrary notes and manually annotated without reference to headings; the CNN nonetheless reaches 0.97 F-measure. If the models were primarily exploiting cues tied to the headed-section distribution in training, generalization to this independent test distribution would be unlikely. We will revise the manuscript to discuss this point explicitly and to clarify how the held-out design supports the noisy-supervision claim. revision: partial
Referee: [Results] Results and evaluation description: The manuscript provides no error analysis on the held-out manual set, no inter-annotator agreement for the manual annotations, and no count of how many candidate sections were excluded by the heuristics. These omissions leave open the possibility that performance reflects selection of easier cases.

Authors: We will add an error analysis of predictions on the held-out set in the revision. The annotations were performed by a single clinical expert; inter-annotator agreement was therefore not computed and we will state this limitation. The count of candidate sections excluded by the heuristics cannot be supplied, as this information was not recorded during the original extraction. revision: partial

standing simulated objections not resolved

The count of how many candidate sections were excluded by the heuristics

Circularity Check

0 steps flagged

No significant circularity; purely empirical evaluation on independent held-out annotations

full rationale

The paper generates noisy training labels via rule-based heading heuristics, trains SVM and CNN classifiers, and reports F-measures on both 10-fold CV of the noisy data and a separate manually annotated held-out set. No equations, parameter fits presented as predictions, self-citations, or uniqueness theorems appear. The held-out manual evaluation is independent of the heuristic labels, so the central claim does not reduce to its inputs by construction. This is the expected non-finding for an empirical ML study.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that headed sections supply usable (if noisy) labels; no free parameters beyond standard classifier training and no new invented entities are introduced.

axioms (1)

domain assumption Clinical notes contain sections identifiable by common heading variations that reliably indicate treatment plans.
Invoked to justify rule-based extraction of noisy training sentences from 13% of notes.

pith-pipeline@v0.9.0 · 5871 in / 1215 out tokens · 27549 ms · 2026-05-25T14:39:05.817645+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

3 extracted references · 3 canonical work pages

[1]

Thumps up? Sentiment Classification using Machine Learning Techniques,

B. Pang, L. Lee and S. Vaithyanathan, "Thumps up? Sentiment Classification using Machine Learning Techniques," in Proc. of the ACL-02 conference on Empirical methods in natural language processing - Volume 10 (EMNLP '02), Stroudsburg, PA, USA, 2002. [4] M. Mintz, S. B. Bills, R. Snow and D. Jurafsky, "Distant supervision for relation extraction without la...

work page 2002
[2]

Section classification in clinical notes using supervised hidden markov model,

Y. Li, S. Lipsky Gorman and N. Elhadad, "Section classification in clinical notes using supervised hidden markov model," in Proceedings of the 1st ACM International Health Informatics Symposium, Arlington, VA. USA, 2010. [15] C. Cortes and V. Vapnik, "Support-vector networks," Machine Learning, vol. 20, no. 3, pp. 273-297, Sept 1995. [16] G. Sidorov, F. V...

work page 2010
[3]

A sentimental education: sentiment analysis using subjectivity summarization based on minimum cuts,

B. Pang and L. Lee, "A sentimental education: sentiment analysis using subjectivity summarization based on minimum cuts," in ACL '04 Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics, Barcelona, Spain, 2004. [28] S. I. Wang and C. D. Manning, "Fast droptout training," in Proceedings of the 30th International Conference on...

work page 2004

[1] [1]

Thumps up? Sentiment Classification using Machine Learning Techniques,

B. Pang, L. Lee and S. Vaithyanathan, "Thumps up? Sentiment Classification using Machine Learning Techniques," in Proc. of the ACL-02 conference on Empirical methods in natural language processing - Volume 10 (EMNLP '02), Stroudsburg, PA, USA, 2002. [4] M. Mintz, S. B. Bills, R. Snow and D. Jurafsky, "Distant supervision for relation extraction without la...

work page 2002

[2] [2]

Section classification in clinical notes using supervised hidden markov model,

Y. Li, S. Lipsky Gorman and N. Elhadad, "Section classification in clinical notes using supervised hidden markov model," in Proceedings of the 1st ACM International Health Informatics Symposium, Arlington, VA. USA, 2010. [15] C. Cortes and V. Vapnik, "Support-vector networks," Machine Learning, vol. 20, no. 3, pp. 273-297, Sept 1995. [16] G. Sidorov, F. V...

work page 2010

[3] [3]

A sentimental education: sentiment analysis using subjectivity summarization based on minimum cuts,

B. Pang and L. Lee, "A sentimental education: sentiment analysis using subjectivity summarization based on minimum cuts," in ACL '04 Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics, Barcelona, Spain, 2004. [28] S. I. Wang and C. D. Manning, "Fast droptout training," in Proceedings of the 30th International Conference on...

work page 2004