pith. sign in

arxiv: 2602.17513 · v2 · submitted 2026-02-19 · 💻 cs.CL

Bridging the Domain Divide: Supervised vs. Zero-Shot Clinical Section Segmentation from MIMIC-III to Obstetrics

Pith reviewed 2026-05-15 20:51 UTC · model grok-4.3

classification 💻 cs.CL
keywords clinical section segmentationsupervised modelszero-shot learningdomain adaptationMIMIC-IIIobstetricslarge language modelshallucinations
0
0 comments X

The pith

Supervised clinical section segmentation models drop in performance when moving from MIMIC-III to obstetrics notes, while zero-shot models remain robust after correcting for hallucinated headers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Clinical free-text notes are structured into sections that aid decision-making and downstream NLP tasks. This paper introduces a new labeled obstetrics dataset to expand beyond the MIMIC-III corpus where most models are trained. Supervised transformer models achieve strong results on in-domain MIMIC-III data but decline sharply on the new obstetrics notes. Zero-shot large language models show better out-of-domain adaptability once any hallucinated section headers are corrected. The work highlights the need for domain-specific clinical resources and positions zero-shot segmentation as a viable path for broader healthcare NLP use.

Core claim

While supervised models perform strongly in-domain, their performance drops substantially out-of-domain. In contrast, zero-shot models demonstrate robust out-of-domain adaptability once hallucinated section headers are corrected.

What carries the argument

Head-to-head comparison of supervised transformer-based models and zero-shot large language models for clinical section segmentation, using a curated MIMIC-III subset and a new obstetrics dataset.

Load-bearing premise

The new obstetrics dataset is representative of the broader domain and that manual correction of hallucinations provides a fair, scalable basis for comparing model performance.

What would settle it

A test on the obstetrics dataset or another out-of-domain clinical corpus where zero-shot models continue to underperform supervised models even after hallucination correction.

Figures

Figures reproduced from arXiv: 2602.17513 by Barbara Di Eugenio, Baris Karacan, Patrick Thornton.

Figure 1
Figure 1. Figure 1: Assessment and Plan section from a sample obstetrics note (includes typographical errors and masked identifier tokens). mats like "A/P," "A&P," and "A: P:"; we present an example of such a note in [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Zero-shot prompt snippet for Llama In￾struct models. The candidate label set corresponds to the 30 section headers defined in the ONC dataset; MedSecId uses a larger schema with 51 headers. Prompt Engineering. We adopt an instruction￾style prompt to assign section labels to each line in a clinical note, without any task-specific fine-tuning. All four models are chat-based and support sys￾tem/user prompting… view at source ↗
Figure 3
Figure 3. Figure 3: Proportional distribution of section label [PITH_FULL_IMAGE:figures/full_fig_p013_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Zero-shot prompt snippet for Llama Instruct models [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗
read the original abstract

Clinical free-text notes contain vital patient information. They are structured into labelled sections; recognizing these sections has been shown to support clinical decision-making and downstream NLP tasks. In this paper, we advance clinical section segmentation through three key contributions. First, we curate a new de-identified, section-labeled obstetrics notes dataset, to supplement the medical domains covered in public corpora such as MIMIC-III, on which most existing segmentation approaches are trained. Second, we systematically evaluate transformer-based supervised models for section segmentation on a curated subset of MIMIC-III (in-domain), and on the new obstetrics dataset (out-of-domain). Third, we conduct the first head-to-head comparison of supervised models for medical section segmentation with zero-shot large language models. Our results show that while supervised models perform strongly in-domain, their performance drops substantially out-of-domain. In contrast, zero-shot models demonstrate robust out-of-domain adaptability once hallucinated section headers are corrected. These findings underscore the importance of developing domain-specific clinical resources and highlight zero-shot segmentation as a promising direction for applying healthcare NLP beyond well-studied corpora, as long as hallucinations are appropriately managed.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper curates a new de-identified obstetrics notes dataset, evaluates supervised transformer models on a MIMIC-III subset (in-domain) and the new dataset (out-of-domain), and performs the first head-to-head comparison against zero-shot LLMs for clinical section segmentation. It claims supervised models perform strongly in-domain but drop substantially out-of-domain, while zero-shot models show robust out-of-domain adaptability after manual correction of hallucinated section headers.

Significance. If the central comparison holds after clarification, the work provides a useful new domain-specific resource and initial evidence that zero-shot approaches may offer better cross-domain generalization in clinical NLP than supervised models trained on MIMIC-III, provided hallucinations are managed. The head-to-head empirical evaluation on a previously under-represented obstetrics domain is a clear strength.

major comments (2)
  1. [Abstract] Abstract and Results: the headline claim that zero-shot models 'demonstrate robust out-of-domain adaptability once hallucinated section headers are corrected' is load-bearing for the supervised-vs-zero-shot comparison, yet no protocol is described for identifying hallucinations, the correction process, inter-annotator agreement, or the hallucination rate on the obstetrics set. Supervised models receive no equivalent post-processing, so any performance gap may reflect annotator effort rather than model capability.
  2. [Evaluation] Evaluation section: the abstract reports clear performance differences and robustness claims but omits exact metrics (e.g., F1, precision/recall), statistical tests, baseline implementation details, and how the obstetrics test set was constructed, leaving the quantitative support for the in-domain vs. out-of-domain drop only partially documented.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help strengthen the clarity and rigor of our work. We will revise the manuscript to provide the requested protocol details and quantitative metrics while preserving the core empirical findings.

read point-by-point responses
  1. Referee: [Abstract] Abstract and Results: the headline claim that zero-shot models 'demonstrate robust out-of-domain adaptability once hallucinated section headers are corrected' is load-bearing for the supervised-vs-zero-shot comparison, yet no protocol is described for identifying hallucinations, the correction process, inter-annotator agreement, or the hallucination rate on the obstetrics set. Supervised models receive no equivalent post-processing, so any performance gap may reflect annotator effort rather than model capability.

    Authors: We agree that the hallucination identification and correction protocol requires explicit description. In the revised manuscript we will add a dedicated paragraph in the Evaluation section that defines hallucinations (model outputs containing section headers absent from the source note or with incorrect boundaries/content), outlines the manual review process performed by two clinical annotators, reports inter-annotator agreement (Cohen's kappa), and states the observed hallucination rate on the obstetrics set. We will also clarify that supervised models are trained on a closed label vocabulary and therefore cannot generate out-of-set headers, eliminating the need for equivalent post-processing; this reflects a fundamental methodological difference rather than unequal annotator effort. revision: yes

  2. Referee: [Evaluation] Evaluation section: the abstract reports clear performance differences and robustness claims but omits exact metrics (e.g., F1, precision/recall), statistical tests, baseline implementation details, and how the obstetrics test set was constructed, leaving the quantitative support for the in-domain vs. out-of-domain drop only partially documented.

    Authors: We acknowledge the need for fuller quantitative documentation. The revised Evaluation section will report exact F1, precision, and recall values for every model on both the MIMIC-III and obstetrics sets, include results of statistical significance tests (paired t-tests on F1 scores), provide complete baseline implementation details (model checkpoints, hyperparameters, training epochs), and describe the obstetrics test-set construction (random 80/20 split with no patient overlap). These additions will directly support the reported performance drops and robustness claims. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical evaluation on new data

full rationale

The paper is an empirical study that curates a new obstetrics dataset and reports direct performance measurements of supervised and zero-shot models on in-domain (MIMIC-III) versus out-of-domain data. No equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text. All claims rest on observable metrics rather than any reduction to inputs by construction, satisfying the default expectation of a non-circular empirical paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The evaluation assumes standard NLP practices such as the validity of human-provided section labels and the operational definition of hallucinations as incorrect headers; no free parameters, ad-hoc axioms, or invented entities are introduced.

pith-pipeline@v0.9.0 · 5505 in / 1029 out tokens · 23857 ms · 2026-05-15T20:51:55.012614+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Implicit Framing in Obstetric Counseling Notes: A Grounded LLM Pipeline on a VBAC-Eligible Cohort

    cs.CL 2026-04 unverdicted novelty 5.0

    Physicians use substantially more risk-focused framing in counseling notes for repeat cesarean than for VBAC among patients clinically eligible for both.

Reference graph

Works this paper leans on

24 extracted references · 24 canonical work pages · cited by 1 Pith paper · 4 internal anchors

  1. [1]

    Bridging the Domain Divide: Supervised vs. Zero-Shot Clinical Section Segmentation from MIMIC-III to Obstetrics

    Introduction Electronic Health Records (EHRs) are widely used in modern healthcare to provide detailed records of patient encounters and their interactions within the healthcare system (Holmes et al., 2021). EHR data often contain free-text clinical notes, which are typically organized into sections such as "Chief Complaint" and "History of Present Illnes...

  2. [2]

    ONC serves as a realistic benchmark for studying section segmentation in underexplored clinical subdo- mains and is intended for community reuse

    Obstetrics Notes Collection (ONC):We in- troducetheObstetricsNotesCollection(ONC), a de-identified dataset of 100History & Phys- ical (H&P)obstetrics notes, annotated in col- laboration with a domain expert1. ONC serves as a realistic benchmark for studying section segmentation in underexplored clinical subdo- mains and is intended for community reuse

  3. [3]

    Domain-Specific Evaluation of Supervised Models:We assess whether transformer- based supervised models originally trained on public datasets can effectively generalize to obstetrics notes. By comparing them on in- domain (MedSecId (Landes et al., 2022)) and out-of-domain (ONC) data, we highlight the difficulties in transferring knowledge across clinical s...

  4. [4]

    Our exper- iments reveal challenges (e.g., hallucinated section headers) as well as the potential ben- efits of zero-shot strategies, especially when annotated data are scarce

    Systematic Comparison With Zero-Shot LLMs:We present the first head-to-head com- parison of supervised transformer models and zero-shotLLMs(i.e.,Llama,MistralandQwen) for clinical section segmentation. Our exper- iments reveal challenges (e.g., hallucinated section headers) as well as the potential ben- efits of zero-shot strategies, especially when annot...

  5. [5]

    history and physical

    Related Work Before the emergence of advanced machine learn- ing and NLP techniques, early approaches to clin- ical section segmentation primarily relied on rule- based methods. Denny et al. (2008), for instance, extracted candidate section header strings from a large corpus of "history and physical" (H&P) notes through pattern-based matching (e.g., detec...

  6. [6]

    Pregnancy History,

    Data We use Landes et al. (2022)’s publicly available MedSecId corpus to train and evaluate our mod- els. MedSecId comprises 2,002 fully annotated clinical notes from MIMIC-III, specifically designed for clinical section segmentation. Additionally, we introduce ONC, a novel, de-identified dataset of 100 H&Pnotesfrom 50vaginal birth after cesarean (VBAC) a...

  7. [7]

    In this section, we provide an overview of both approaches

    Methodology We explore two approaches for clinical section seg- mentation: Supervised Learning and Zero-shot Learning via LLMs. In this section, we provide an overview of both approaches. 4.1. Supervised Learning Approach We first develop a supervised approach to clinical sectionsegmentationusingpre-trainedtransformer- basedmodels,widelyusedintextclassifi...

  8. [8]

    Transformer-based Classification:Each line (i.e., a newline-separated sentence span extracted from the clinical note) is treated as an independent input and classified according to predefined section headers

  9. [9]

    <none>") using an IO-like encoding scheme: lines within la- beled sections are tagged as

    Transformer + CRF:A Conditional Random Field (CRF) layer is added on top of the trans- former to model label dependencies between consecutive lines, framing the task as se- quence labeling. 4Gravida: total pregnancies andpara: the number of births reaching viability. Section Header MedSecId ONC Dataset <none>/check-circle /check-circle 24-hour-events/chec...

  10. [10]

    FlattenInput:Wereshape(B,L,S)to(BxL,S) so each line can be processed independently by the transformer

  11. [11]

    Contextual Embeddings:We extract the [CLS]representation for each line

  12. [12]

    Logit Projection:We apply a linear layer to project contextual embeddings into logits of shape(B x L, num_labels) for each section label wherenum_labels = 51

  13. [13]

    CRF Reshaping:We reshape logits back to(B, L, num_labels), so the CRF can model line-level transitions across the entire note

  14. [14]

    clinical assistant special- izing in segmenting clinical notes

    Viterbi Decoding:At evaluation, we apply Viterbi decoding (Viterbi, 1967) to obtain the most likely label sequence for each note. Training hyperparameters and evaluation details are provided in Appendix A.2. 4.2. Zero-Shot Learning via LLMs We explore zero-shot learning for clinical section segmentation using pre-trained LLMs. Our primary goal is to evalu...

  15. [15]

    Evaluation and Experimental Setup We evaluate the performance of our supervised models and zero-shot LLMs on two datasets: Med- SecId and ONC

    Experiments 5.1. Evaluation and Experimental Setup We evaluate the performance of our supervised models and zero-shot LLMs on two datasets: Med- SecId and ONC. For MedSecId, we evaluate on the test portion of the dataset (20%, or 401 notes). To maintain a tractable sequence length for evalu- ation, we exclude notes with more than 100 lines, resulting in a...

  16. [16]

    We evaluated supervised and zero-shot LLM segmentation ap- proaches on this dataset and a widely used public corpus

    Conclusions and Future Work In this work, we addressed clinical section segmen- tation in a specialized domain by introducing the Obstetrics Notes Collection (ONC). We evaluated supervised and zero-shot LLM segmentation ap- proaches on this dataset and a widely used public corpus. We found that while supervised models perform well in-domain, they struggle...

  17. [17]

    Ethical Considerations This study uses de-identified clinical narratives de- rived from electronic health records

    Ethics Statement and Limitations 7.1. Ethical Considerations This study uses de-identified clinical narratives de- rived from electronic health records. As described in Sec. 3, all notes were processed within a HIPAA- compliant secure research environment and un- derwent automated and manual de-identification prior to analysis. The study was conducted und...

  18. [18]

    References Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. GPT-4 Technical Report.arXiv preprint arXiv:2303.08774. Available only as preprint. Emily Alsentzer, John Murphy, William Boag, Wei- Hung Weng, Di Jindi, Tristan Naumann, and ...

  19. [19]

    Parameter-Efficient Fine-Tuning for Large Models: A Comprehensive Survey

    Domain-specific language model pretrain- ing for biomedical natural language processing. ACM Transactions on Computing for Healthcare (HEALTH), 3(1):1–23. Zeyu Han, Chao Gao, Jinyang Liu, Jeff Zhang, and Sai Qian Zhang. 2024. Parameter-efficient fine- tuningforlargemodels: Acomprehensivesurvey. arXiv preprint arXiv:2403.14608. Available only as preprint. ...

  20. [20]

    Qwen2.5 Technical Report

    Qwen2. 5 technical report.arXiv preprint arXiv:2412.15115. Available only as preprint. Xi Yang, Aokun Chen, Nima PourNejatian, Hoo Chang Shin, Kaleb E Smith, Christopher Parisien, Colin Compas, Cheryl Martin, An- thony B Costa, Mona G Flores, et al. 2022. A large language model for electronic health records.NPJ digital medicine, 5(1):194. Fan Zhang, Itay ...

  21. [21]

    (updates Transformer + CRF parame- ters) •Max token length:100 for BERT-base, BioBERT,BiomedBERT;64forGatorTron(due to higher memory consumption) Evaluation MetricsAs in Section A.1, we com- pute precision, recall, macro-F1, and weighted-F1 to evaluate note-level segmentation performance. A.3. Zero-Shot Learning via LLMs—Inference Details Inference Detail...

  22. [22]

    **Label Confusion** The model predicted a valid but clearly different label from the gold

  23. [23]

    **Valid Local Interpretation** The predicted label is different from gold, but makes semantic sense given the span alone

  24. [24]

    assessment-and-plan

    **Other** This case is ambiguous or doesn’t fit the above categories. Respond exactly in the following format: Label: <one of the 3 options above> Reason: <your brief explanation> <|eot_id|><|start_header_id|>assistant<|end_header_id|> Section Headers: Figure 4: Zero-shot prompt snippet for Llama Instruct models Text Span Error Type Explanation Review / M...