Bridging the Domain Divide: Supervised vs. Zero-Shot Clinical Section Segmentation from MIMIC-III to Obstetrics

Barbara Di Eugenio; Baris Karacan; Patrick Thornton

arxiv: 2602.17513 · v2 · submitted 2026-02-19 · 💻 cs.CL

Bridging the Domain Divide: Supervised vs. Zero-Shot Clinical Section Segmentation from MIMIC-III to Obstetrics

Baris Karacan , Barbara Di Eugenio , Patrick Thornton This is my paper

Pith reviewed 2026-05-15 20:51 UTC · model grok-4.3

classification 💻 cs.CL

keywords clinical section segmentationsupervised modelszero-shot learningdomain adaptationMIMIC-IIIobstetricslarge language modelshallucinations

0 comments

The pith

Supervised clinical section segmentation models drop in performance when moving from MIMIC-III to obstetrics notes, while zero-shot models remain robust after correcting for hallucinated headers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Clinical free-text notes are structured into sections that aid decision-making and downstream NLP tasks. This paper introduces a new labeled obstetrics dataset to expand beyond the MIMIC-III corpus where most models are trained. Supervised transformer models achieve strong results on in-domain MIMIC-III data but decline sharply on the new obstetrics notes. Zero-shot large language models show better out-of-domain adaptability once any hallucinated section headers are corrected. The work highlights the need for domain-specific clinical resources and positions zero-shot segmentation as a viable path for broader healthcare NLP use.

Core claim

While supervised models perform strongly in-domain, their performance drops substantially out-of-domain. In contrast, zero-shot models demonstrate robust out-of-domain adaptability once hallucinated section headers are corrected.

What carries the argument

Head-to-head comparison of supervised transformer-based models and zero-shot large language models for clinical section segmentation, using a curated MIMIC-III subset and a new obstetrics dataset.

Load-bearing premise

The new obstetrics dataset is representative of the broader domain and that manual correction of hallucinations provides a fair, scalable basis for comparing model performance.

What would settle it

A test on the obstetrics dataset or another out-of-domain clinical corpus where zero-shot models continue to underperform supervised models even after hallucination correction.

Figures

Figures reproduced from arXiv: 2602.17513 by Barbara Di Eugenio, Baris Karacan, Patrick Thornton.

**Figure 1.** Figure 1: Assessment and Plan section from a sample obstetrics note (includes typographical errors and masked identifier tokens). mats like "A/P," "A&P," and "A: P:"; we present an example of such a note in [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗

**Figure 2.** Figure 2: Zero-shot prompt snippet for Llama Instruct models. The candidate label set corresponds to the 30 section headers defined in the ONC dataset; MedSecId uses a larger schema with 51 headers. Prompt Engineering. We adopt an instructionstyle prompt to assign section labels to each line in a clinical note, without any task-specific fine-tuning. All four models are chat-based and support system/user prompting… view at source ↗

**Figure 3.** Figure 3: Proportional distribution of section label [PITH_FULL_IMAGE:figures/full_fig_p013_3.png] view at source ↗

**Figure 4.** Figure 4: Zero-shot prompt snippet for Llama Instruct models [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗

read the original abstract

Clinical free-text notes contain vital patient information. They are structured into labelled sections; recognizing these sections has been shown to support clinical decision-making and downstream NLP tasks. In this paper, we advance clinical section segmentation through three key contributions. First, we curate a new de-identified, section-labeled obstetrics notes dataset, to supplement the medical domains covered in public corpora such as MIMIC-III, on which most existing segmentation approaches are trained. Second, we systematically evaluate transformer-based supervised models for section segmentation on a curated subset of MIMIC-III (in-domain), and on the new obstetrics dataset (out-of-domain). Third, we conduct the first head-to-head comparison of supervised models for medical section segmentation with zero-shot large language models. Our results show that while supervised models perform strongly in-domain, their performance drops substantially out-of-domain. In contrast, zero-shot models demonstrate robust out-of-domain adaptability once hallucinated section headers are corrected. These findings underscore the importance of developing domain-specific clinical resources and highlight zero-shot segmentation as a promising direction for applying healthcare NLP beyond well-studied corpora, as long as hallucinations are appropriately managed.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

New obstetrics dataset is a useful addition for cross-domain tests, but the zero-shot robustness claim rests on undocumented manual fixes that make the comparison uneven.

read the letter

The paper's clearest value is the new section-labeled obstetrics notes dataset. Most public clinical corpora are built on MIMIC-style data, so adding a distinct obstetrics set gives people a concrete way to measure domain shift on section segmentation. They also run the first direct supervised-versus-zero-shot comparison for this task, testing transformers trained on MIMIC against zero-shot LLMs on both the in-domain and out-of-domain sets. Supervised models drop when moved to obstetrics notes, which matches what we usually see with clinical text. The zero-shot side is presented as holding up better once hallucinated headers are corrected by hand. That basic empirical pattern is worth having on record. The soft spot is the zero-shot pipeline itself. The abstract and stress-test note both flag that performance only looks robust after post-hoc human correction of hallucinations, yet no details appear on correction rate, protocol, or agreement between correctors. Supervised models receive no equivalent step, so any reported gap could trace to the extra human effort rather than model behavior. Without those numbers the central claim is only partly supported. This work is aimed at clinical NLP groups that need labeled data for varied note types or that want to test domain adaptation. It deserves a serious referee to check the methods section on the hallucination handling and the exact metrics, but it should not be desk-rejected.

Referee Report

2 major / 0 minor

Summary. The paper curates a new de-identified obstetrics notes dataset, evaluates supervised transformer models on a MIMIC-III subset (in-domain) and the new dataset (out-of-domain), and performs the first head-to-head comparison against zero-shot LLMs for clinical section segmentation. It claims supervised models perform strongly in-domain but drop substantially out-of-domain, while zero-shot models show robust out-of-domain adaptability after manual correction of hallucinated section headers.

Significance. If the central comparison holds after clarification, the work provides a useful new domain-specific resource and initial evidence that zero-shot approaches may offer better cross-domain generalization in clinical NLP than supervised models trained on MIMIC-III, provided hallucinations are managed. The head-to-head empirical evaluation on a previously under-represented obstetrics domain is a clear strength.

major comments (2)

[Abstract] Abstract and Results: the headline claim that zero-shot models 'demonstrate robust out-of-domain adaptability once hallucinated section headers are corrected' is load-bearing for the supervised-vs-zero-shot comparison, yet no protocol is described for identifying hallucinations, the correction process, inter-annotator agreement, or the hallucination rate on the obstetrics set. Supervised models receive no equivalent post-processing, so any performance gap may reflect annotator effort rather than model capability.
[Evaluation] Evaluation section: the abstract reports clear performance differences and robustness claims but omits exact metrics (e.g., F1, precision/recall), statistical tests, baseline implementation details, and how the obstetrics test set was constructed, leaving the quantitative support for the in-domain vs. out-of-domain drop only partially documented.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help strengthen the clarity and rigor of our work. We will revise the manuscript to provide the requested protocol details and quantitative metrics while preserving the core empirical findings.

read point-by-point responses

Referee: [Abstract] Abstract and Results: the headline claim that zero-shot models 'demonstrate robust out-of-domain adaptability once hallucinated section headers are corrected' is load-bearing for the supervised-vs-zero-shot comparison, yet no protocol is described for identifying hallucinations, the correction process, inter-annotator agreement, or the hallucination rate on the obstetrics set. Supervised models receive no equivalent post-processing, so any performance gap may reflect annotator effort rather than model capability.

Authors: We agree that the hallucination identification and correction protocol requires explicit description. In the revised manuscript we will add a dedicated paragraph in the Evaluation section that defines hallucinations (model outputs containing section headers absent from the source note or with incorrect boundaries/content), outlines the manual review process performed by two clinical annotators, reports inter-annotator agreement (Cohen's kappa), and states the observed hallucination rate on the obstetrics set. We will also clarify that supervised models are trained on a closed label vocabulary and therefore cannot generate out-of-set headers, eliminating the need for equivalent post-processing; this reflects a fundamental methodological difference rather than unequal annotator effort. revision: yes
Referee: [Evaluation] Evaluation section: the abstract reports clear performance differences and robustness claims but omits exact metrics (e.g., F1, precision/recall), statistical tests, baseline implementation details, and how the obstetrics test set was constructed, leaving the quantitative support for the in-domain vs. out-of-domain drop only partially documented.

Authors: We acknowledge the need for fuller quantitative documentation. The revised Evaluation section will report exact F1, precision, and recall values for every model on both the MIMIC-III and obstetrics sets, include results of statistical significance tests (paired t-tests on F1 scores), provide complete baseline implementation details (model checkpoints, hyperparameters, training epochs), and describe the obstetrics test-set construction (random 80/20 split with no patient overlap). These additions will directly support the reported performance drops and robustness claims. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical evaluation on new data

full rationale

The paper is an empirical study that curates a new obstetrics dataset and reports direct performance measurements of supervised and zero-shot models on in-domain (MIMIC-III) versus out-of-domain data. No equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text. All claims rest on observable metrics rather than any reduction to inputs by construction, satisfying the default expectation of a non-circular empirical paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The evaluation assumes standard NLP practices such as the validity of human-provided section labels and the operational definition of hallucinations as incorrect headers; no free parameters, ad-hoc axioms, or invented entities are introduced.

pith-pipeline@v0.9.0 · 5505 in / 1029 out tokens · 23857 ms · 2026-05-15T20:51:55.012614+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Implicit Framing in Obstetric Counseling Notes: A Grounded LLM Pipeline on a VBAC-Eligible Cohort
cs.CL 2026-04 unverdicted novelty 5.0

Physicians use substantially more risk-focused framing in counseling notes for repeat cesarean than for VBAC among patients clinically eligible for both.

Reference graph

Works this paper leans on

24 extracted references · 24 canonical work pages · cited by 1 Pith paper · 4 internal anchors

[1]

Bridging the Domain Divide: Supervised vs. Zero-Shot Clinical Section Segmentation from MIMIC-III to Obstetrics

Introduction Electronic Health Records (EHRs) are widely used in modern healthcare to provide detailed records of patient encounters and their interactions within the healthcare system (Holmes et al., 2021). EHR data often contain free-text clinical notes, which are typically organized into sections such as "Chief Complaint" and "History of Present Illnes...

work page internal anchor Pith review Pith/arXiv arXiv 2021
[2]

ONC serves as a realistic benchmark for studying section segmentation in underexplored clinical subdo- mains and is intended for community reuse

Obstetrics Notes Collection (ONC):We in- troducetheObstetricsNotesCollection(ONC), a de-identified dataset of 100History & Phys- ical (H&P)obstetrics notes, annotated in col- laboration with a domain expert1. ONC serves as a realistic benchmark for studying section segmentation in underexplored clinical subdo- mains and is intended for community reuse

work page
[3]

Domain-Specific Evaluation of Supervised Models:We assess whether transformer- based supervised models originally trained on public datasets can effectively generalize to obstetrics notes. By comparing them on in- domain (MedSecId (Landes et al., 2022)) and out-of-domain (ONC) data, we highlight the difficulties in transferring knowledge across clinical s...

work page 2022
[4]

Our exper- iments reveal challenges (e.g., hallucinated section headers) as well as the potential ben- efits of zero-shot strategies, especially when annotated data are scarce

Systematic Comparison With Zero-Shot LLMs:We present the first head-to-head com- parison of supervised transformer models and zero-shotLLMs(i.e.,Llama,MistralandQwen) for clinical section segmentation. Our exper- iments reveal challenges (e.g., hallucinated section headers) as well as the potential ben- efits of zero-shot strategies, especially when annot...

work page
[5]

history and physical

Related Work Before the emergence of advanced machine learn- ing and NLP techniques, early approaches to clin- ical section segmentation primarily relied on rule- based methods. Denny et al. (2008), for instance, extracted candidate section header strings from a large corpus of "history and physical" (H&P) notes through pattern-based matching (e.g., detec...

work page 2008
[6]

Pregnancy History,

Data We use Landes et al. (2022)’s publicly available MedSecId corpus to train and evaluate our mod- els. MedSecId comprises 2,002 fully annotated clinical notes from MIMIC-III, specifically designed for clinical section segmentation. Additionally, we introduce ONC, a novel, de-identified dataset of 100 H&Pnotesfrom 50vaginal birth after cesarean (VBAC) a...

work page 2022
[7]

In this section, we provide an overview of both approaches

Methodology We explore two approaches for clinical section seg- mentation: Supervised Learning and Zero-shot Learning via LLMs. In this section, we provide an overview of both approaches. 4.1. Supervised Learning Approach We first develop a supervised approach to clinical sectionsegmentationusingpre-trainedtransformer- basedmodels,widelyusedintextclassifi...

work page 2017
[8]

Transformer-based Classification:Each line (i.e., a newline-separated sentence span extracted from the clinical note) is treated as an independent input and classified according to predefined section headers

work page
[9]

<none>") using an IO-like encoding scheme: lines within la- beled sections are tagged as

Transformer + CRF:A Conditional Random Field (CRF) layer is added on top of the trans- former to model label dependencies between consecutive lines, framing the task as se- quence labeling. 4Gravida: total pregnancies andpara: the number of births reaching viability. Section Header MedSecId ONC Dataset <none>/check-circle /check-circle 24-hour-events/chec...

work page 2022
[10]

FlattenInput:Wereshape(B,L,S)to(BxL,S) so each line can be processed independently by the transformer

work page
[11]

Contextual Embeddings:We extract the [CLS]representation for each line

work page
[12]

Logit Projection:We apply a linear layer to project contextual embeddings into logits of shape(B x L, num_labels) for each section label wherenum_labels = 51

work page
[13]

CRF Reshaping:We reshape logits back to(B, L, num_labels), so the CRF can model line-level transitions across the entire note

work page
[14]

clinical assistant special- izing in segmenting clinical notes

Viterbi Decoding:At evaluation, we apply Viterbi decoding (Viterbi, 1967) to obtain the most likely label sequence for each note. Training hyperparameters and evaluation details are provided in Appendix A.2. 4.2. Zero-Shot Learning via LLMs We explore zero-shot learning for clinical section segmentation using pre-trained LLMs. Our primary goal is to evalu...

work page 1967
[15]

Evaluation and Experimental Setup We evaluate the performance of our supervised models and zero-shot LLMs on two datasets: Med- SecId and ONC

Experiments 5.1. Evaluation and Experimental Setup We evaluate the performance of our supervised models and zero-shot LLMs on two datasets: Med- SecId and ONC. For MedSecId, we evaluate on the test portion of the dataset (20%, or 401 notes). To maintain a tractable sequence length for evalu- ation, we exclude notes with more than 100 lines, resulting in a...

work page 2023
[16]

We evaluated supervised and zero-shot LLM segmentation ap- proaches on this dataset and a widely used public corpus

Conclusions and Future Work In this work, we addressed clinical section segmen- tation in a specialized domain by introducing the Obstetrics Notes Collection (ONC). We evaluated supervised and zero-shot LLM segmentation ap- proaches on this dataset and a widely used public corpus. We found that while supervised models perform well in-domain, they struggle...

work page 2024
[17]

Ethical Considerations This study uses de-identified clinical narratives de- rived from electronic health records

Ethics Statement and Limitations 7.1. Ethical Considerations This study uses de-identified clinical narratives de- rived from electronic health records. As described in Sec. 3, all notes were processed within a HIPAA- compliant secure research environment and un- derwent automated and manual de-identification prior to analysis. The study was conducted und...

work page 2022
[18]

References Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. GPT-4 Technical Report.arXiv preprint arXiv:2303.08774. Available only as preprint. Emily Alsentzer, John Murphy, William Boag, Wei- Hung Weng, Di Jindi, Tristan Naumann, and ...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[19]

Parameter-Efficient Fine-Tuning for Large Models: A Comprehensive Survey

Domain-specific language model pretrain- ing for biomedical natural language processing. ACM Transactions on Computing for Healthcare (HEALTH), 3(1):1–23. Zeyu Han, Chao Gao, Jinyang Liu, Jeff Zhang, and Sai Qian Zhang. 2024. Parameter-efficient fine- tuningforlargemodels: Acomprehensivesurvey. arXiv preprint arXiv:2403.14608. Available only as preprint. ...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[20]

Qwen2.5 Technical Report

Qwen2. 5 technical report.arXiv preprint arXiv:2412.15115. Available only as preprint. Xi Yang, Aokun Chen, Nima PourNejatian, Hoo Chang Shin, Kaleb E Smith, Christopher Parisien, Colin Compas, Cheryl Martin, An- thony B Costa, Mona G Flores, et al. 2022. A large language model for electronic health records.NPJ digital medicine, 5(1):194. Fan Zhang, Itay ...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[21]

(updates Transformer + CRF parame- ters) •Max token length:100 for BERT-base, BioBERT,BiomedBERT;64forGatorTron(due to higher memory consumption) Evaluation MetricsAs in Section A.1, we com- pute precision, recall, macro-F1, and weighted-F1 to evaluate note-level segmentation performance. A.3. Zero-Shot Learning via LLMs—Inference Details Inference Detail...

work page
[22]

**Label Confusion** The model predicted a valid but clearly different label from the gold

work page
[23]

**Valid Local Interpretation** The predicted label is different from gold, but makes semantic sense given the span alone

work page
[24]

assessment-and-plan

**Other** This case is ambiguous or doesn’t fit the above categories. Respond exactly in the following format: Label: <one of the 3 options above> Reason: <your brief explanation> <|eot_id|><|start_header_id|>assistant<|end_header_id|> Section Headers: Figure 4: Zero-shot prompt snippet for Llama Instruct models Text Span Error Type Explanation Review / M...

work page

[1] [1]

Bridging the Domain Divide: Supervised vs. Zero-Shot Clinical Section Segmentation from MIMIC-III to Obstetrics

Introduction Electronic Health Records (EHRs) are widely used in modern healthcare to provide detailed records of patient encounters and their interactions within the healthcare system (Holmes et al., 2021). EHR data often contain free-text clinical notes, which are typically organized into sections such as "Chief Complaint" and "History of Present Illnes...

work page internal anchor Pith review Pith/arXiv arXiv 2021

[2] [2]

ONC serves as a realistic benchmark for studying section segmentation in underexplored clinical subdo- mains and is intended for community reuse

Obstetrics Notes Collection (ONC):We in- troducetheObstetricsNotesCollection(ONC), a de-identified dataset of 100History & Phys- ical (H&P)obstetrics notes, annotated in col- laboration with a domain expert1. ONC serves as a realistic benchmark for studying section segmentation in underexplored clinical subdo- mains and is intended for community reuse

work page

[3] [3]

Domain-Specific Evaluation of Supervised Models:We assess whether transformer- based supervised models originally trained on public datasets can effectively generalize to obstetrics notes. By comparing them on in- domain (MedSecId (Landes et al., 2022)) and out-of-domain (ONC) data, we highlight the difficulties in transferring knowledge across clinical s...

work page 2022

[4] [4]

Our exper- iments reveal challenges (e.g., hallucinated section headers) as well as the potential ben- efits of zero-shot strategies, especially when annotated data are scarce

Systematic Comparison With Zero-Shot LLMs:We present the first head-to-head com- parison of supervised transformer models and zero-shotLLMs(i.e.,Llama,MistralandQwen) for clinical section segmentation. Our exper- iments reveal challenges (e.g., hallucinated section headers) as well as the potential ben- efits of zero-shot strategies, especially when annot...

work page

[5] [5]

history and physical

Related Work Before the emergence of advanced machine learn- ing and NLP techniques, early approaches to clin- ical section segmentation primarily relied on rule- based methods. Denny et al. (2008), for instance, extracted candidate section header strings from a large corpus of "history and physical" (H&P) notes through pattern-based matching (e.g., detec...

work page 2008

[6] [6]

Pregnancy History,

Data We use Landes et al. (2022)’s publicly available MedSecId corpus to train and evaluate our mod- els. MedSecId comprises 2,002 fully annotated clinical notes from MIMIC-III, specifically designed for clinical section segmentation. Additionally, we introduce ONC, a novel, de-identified dataset of 100 H&Pnotesfrom 50vaginal birth after cesarean (VBAC) a...

work page 2022

[7] [7]

In this section, we provide an overview of both approaches

Methodology We explore two approaches for clinical section seg- mentation: Supervised Learning and Zero-shot Learning via LLMs. In this section, we provide an overview of both approaches. 4.1. Supervised Learning Approach We first develop a supervised approach to clinical sectionsegmentationusingpre-trainedtransformer- basedmodels,widelyusedintextclassifi...

work page 2017

[8] [8]

Transformer-based Classification:Each line (i.e., a newline-separated sentence span extracted from the clinical note) is treated as an independent input and classified according to predefined section headers

work page

[9] [9]

<none>") using an IO-like encoding scheme: lines within la- beled sections are tagged as

Transformer + CRF:A Conditional Random Field (CRF) layer is added on top of the trans- former to model label dependencies between consecutive lines, framing the task as se- quence labeling. 4Gravida: total pregnancies andpara: the number of births reaching viability. Section Header MedSecId ONC Dataset <none>/check-circle /check-circle 24-hour-events/chec...

work page 2022

[10] [10]

FlattenInput:Wereshape(B,L,S)to(BxL,S) so each line can be processed independently by the transformer

work page

[11] [11]

Contextual Embeddings:We extract the [CLS]representation for each line

work page

[12] [12]

Logit Projection:We apply a linear layer to project contextual embeddings into logits of shape(B x L, num_labels) for each section label wherenum_labels = 51

work page

[13] [13]

CRF Reshaping:We reshape logits back to(B, L, num_labels), so the CRF can model line-level transitions across the entire note

work page

[14] [14]

clinical assistant special- izing in segmenting clinical notes

Viterbi Decoding:At evaluation, we apply Viterbi decoding (Viterbi, 1967) to obtain the most likely label sequence for each note. Training hyperparameters and evaluation details are provided in Appendix A.2. 4.2. Zero-Shot Learning via LLMs We explore zero-shot learning for clinical section segmentation using pre-trained LLMs. Our primary goal is to evalu...

work page 1967

[15] [15]

Evaluation and Experimental Setup We evaluate the performance of our supervised models and zero-shot LLMs on two datasets: Med- SecId and ONC

Experiments 5.1. Evaluation and Experimental Setup We evaluate the performance of our supervised models and zero-shot LLMs on two datasets: Med- SecId and ONC. For MedSecId, we evaluate on the test portion of the dataset (20%, or 401 notes). To maintain a tractable sequence length for evalu- ation, we exclude notes with more than 100 lines, resulting in a...

work page 2023

[16] [16]

We evaluated supervised and zero-shot LLM segmentation ap- proaches on this dataset and a widely used public corpus

Conclusions and Future Work In this work, we addressed clinical section segmen- tation in a specialized domain by introducing the Obstetrics Notes Collection (ONC). We evaluated supervised and zero-shot LLM segmentation ap- proaches on this dataset and a widely used public corpus. We found that while supervised models perform well in-domain, they struggle...

work page 2024

[17] [17]

Ethical Considerations This study uses de-identified clinical narratives de- rived from electronic health records

Ethics Statement and Limitations 7.1. Ethical Considerations This study uses de-identified clinical narratives de- rived from electronic health records. As described in Sec. 3, all notes were processed within a HIPAA- compliant secure research environment and un- derwent automated and manual de-identification prior to analysis. The study was conducted und...

work page 2022

[18] [18]

References Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. GPT-4 Technical Report.arXiv preprint arXiv:2303.08774. Available only as preprint. Emily Alsentzer, John Murphy, William Boag, Wei- Hung Weng, Di Jindi, Tristan Naumann, and ...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[19] [19]

Parameter-Efficient Fine-Tuning for Large Models: A Comprehensive Survey

Domain-specific language model pretrain- ing for biomedical natural language processing. ACM Transactions on Computing for Healthcare (HEALTH), 3(1):1–23. Zeyu Han, Chao Gao, Jinyang Liu, Jeff Zhang, and Sai Qian Zhang. 2024. Parameter-efficient fine- tuningforlargemodels: Acomprehensivesurvey. arXiv preprint arXiv:2403.14608. Available only as preprint. ...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[20] [20]

Qwen2.5 Technical Report

Qwen2. 5 technical report.arXiv preprint arXiv:2412.15115. Available only as preprint. Xi Yang, Aokun Chen, Nima PourNejatian, Hoo Chang Shin, Kaleb E Smith, Christopher Parisien, Colin Compas, Cheryl Martin, An- thony B Costa, Mona G Flores, et al. 2022. A large language model for electronic health records.NPJ digital medicine, 5(1):194. Fan Zhang, Itay ...

work page internal anchor Pith review Pith/arXiv arXiv 2022

[21] [21]

(updates Transformer + CRF parame- ters) •Max token length:100 for BERT-base, BioBERT,BiomedBERT;64forGatorTron(due to higher memory consumption) Evaluation MetricsAs in Section A.1, we com- pute precision, recall, macro-F1, and weighted-F1 to evaluate note-level segmentation performance. A.3. Zero-Shot Learning via LLMs—Inference Details Inference Detail...

work page

[22] [22]

**Label Confusion** The model predicted a valid but clearly different label from the gold

work page

[23] [23]

**Valid Local Interpretation** The predicted label is different from gold, but makes semantic sense given the span alone

work page

[24] [24]

assessment-and-plan

**Other** This case is ambiguous or doesn’t fit the above categories. Respond exactly in the following format: Label: <one of the 3 options above> Reason: <your brief explanation> <|eot_id|><|start_header_id|>assistant<|end_header_id|> Section Headers: Figure 4: Zero-shot prompt snippet for Llama Instruct models Text Span Error Type Explanation Review / M...

work page