Evaluation of LLMs in Medical Text Summarization: The Role of Vocabulary Adaptation in High OOV Settings

Gunjan Balde; Mainack Mondal; Niloy Ganguly; Soumyadeep Roy

arxiv: 2505.21242 · v1 · submitted 2025-05-27 · 💻 cs.CL

Evaluation of LLMs in Medical Text Summarization: The Role of Vocabulary Adaptation in High OOV Settings

Gunjan Balde , Soumyadeep Roy , Mainack Mondal , Niloy Ganguly This is my paper

Pith reviewed 2026-05-19 13:03 UTC · model grok-4.3

classification 💻 cs.CL

keywords LLM evaluationmedical text summarizationvocabulary adaptationout-of-vocabulary wordsdomain adaptationOOV handlingmedical domain

0 comments

The pith

Adapting large language models' vocabularies with medical terms boosts their performance on summarization tasks involving many unfamiliar words.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models achieve good results in medical text summarization using in-context learning but show clear drops when faced with inputs heavy in specialized terms absent from their vocabulary. The paper benchmarks this issue across datasets and identifies high out-of-vocabulary concentrations and high novelty as key trouble spots. Vocabulary adaptation, by expanding the model's token set with domain-specific words, counters this mismatch and leads to better automatic and human-evaluated summaries. Experiments across multiple strategies, pretraining methods, and three datasets reveal practical ways to customize models for medicine. Medical experts confirm the adapted outputs are more relevant and accurate.

Core claim

The paper establishes that vocabulary adaptation helps improve the LLM summarization performance even in difficult settings with high concentration of out-of-vocabulary words or high novelty. Llama-3.1 faces over-fragmentation with medical words despite its large vocabulary size.

What carries the argument

Vocabulary adaptation, the process of updating the LLM vocabulary with medical domain words or subwords to reduce fragmentation and mismatch.

If this is right

Performance gains hold across different vocabulary adaptation strategies and continual pretraining approaches.
Human evaluations by medical experts show improved relevance and faithfulness in summaries from adapted models.
Even large-vocabulary models benefit from this customization in specialized domains.
The approach addresses issues not solved by scale alone in high-OOV medical data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This method may apply to other technical fields with dense specialized terminology.
Combining vocabulary adaptation with other domain adaptation techniques could yield further gains.
Dynamic or on-the-fly adaptation during inference might reduce the need for full retraining.

Load-bearing premise

The performance drop on high-OOV data is due mainly to vocabulary mismatch and not to other factors like longer inputs or harder topics.

What would settle it

Measure if summarization metrics on high-OOV subsets improve after vocabulary adaptation while holding input length and topic constant.

read the original abstract

Large Language Models (LLMs) recently achieved great success in medical text summarization by simply using in-context learning. However, these recent efforts do not perform fine-grained evaluations under difficult settings where LLMs might fail. They typically report performance scores over the entire dataset. Through our benchmarking study, we show that LLMs show a significant performance drop for data points with high concentration of out-of-vocabulary (OOV) words or with high novelty. Vocabulary adaptation is an intuitive solution to this vocabulary mismatch issue where the LLM vocabulary gets updated with certain expert domain (here, medical) words or subwords. An interesting finding from our study is that Llama-3.1, even with a vocabulary size of around 128K tokens, still faces over-fragmentation issue with medical words. To that end, we show vocabulary adaptation helps improve the LLM summarization performance even in difficult settings. Through extensive experimentation of multiple vocabulary adaptation strategies, two continual pretraining strategies, and three benchmark medical summarization datasets, we gain valuable insights into the role of vocabulary adaptation strategies for customizing LLMs to the medical domain. We also performed a human evaluation study with medical experts where they found that vocabulary adaptation results in more relevant and faithful summaries. Our codebase is made publicly available at https://github.com/gb-kgp/LLM-MedicalSummarization-Benchmark.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Vocabulary adaptation improves medical summarization on high-OOV subsets for Llama-3.1, but the gains are not cleanly isolated from domain pretraining or subset differences.

read the letter

The main point is that this paper shows vocabulary adaptation lifts LLM summarization performance in high-OOV medical settings, with supporting runs across three datasets, several adaptation methods, and expert human ratings that favor the adapted outputs for relevance and faithfulness. The observation that Llama-3.1 still over-fragments medical terms despite its 128k vocabulary is a concrete data point worth noting, and the public code makes the comparisons easy to check. They test multiple strategies and two continual pretraining variants, which gives a practical sense of what works better under those conditions. The human study adds a useful check beyond automatic metrics. The setup is straightforward empirical work with no invented claims or circular derivations. The soft spot is the attribution. High-OOV subsets are not matched on length or topic, and no regression isolates OOV rate after controlling for those factors. Since adaptation runs through continued pretraining on medical text, the improvement could come from extra domain exposure rather than the added tokens themselves. That leaves the central mechanism correlational rather than tightly demonstrated. Readers working on domain adaptation for clinical tools or on tokenization issues in LLMs will get the most out of the comparisons and the released artifacts. The question is relevant to deployment and the evidence is solid enough on the surface to justify referee time, even if the authors need to add controls or ablations to strengthen the causal story. I would send it for peer review with a request for those tighter checks.

Referee Report

2 major / 2 minor

Summary. The paper conducts an empirical benchmarking study of LLMs for medical text summarization, focusing on performance degradation in high-OOV or high-novelty subsets across three datasets. It evaluates multiple vocabulary adaptation strategies (including continued pretraining variants) and reports improvements in automatic metrics and human expert judgments of relevance and faithfulness, with public code release.

Significance. If the attribution to vocabulary mismatch holds, the work provides practical guidance on customizing LLMs for medical domains with high OOV rates, supported by convergent evidence from multiple datasets, adaptation methods, continual pretraining, and expert human evaluation. The public codebase strengthens reproducibility.

major comments (2)

[§4 and §5] §4 (subset analysis) and §5 (results): the performance drop on high-OOV/high-novelty partitions is presented as evidence of vocabulary mismatch, yet no length-matched or topic-matched controls are reported, nor is there a regression isolating OOV rate after conditioning on sequence length and domain entropy. This leaves open the possibility that confounders drive the observed drop rather than tokenization alone.
[§3.2] §3.2 (vocabulary adaptation via continual pretraining): gains from continued pretraining could arise from additional domain exposure rather than the specific addition of medical tokens. The paper does not include an ablation that holds domain exposure fixed while varying only the vocabulary extension to isolate the contribution of the new tokens.

minor comments (2)

[Table 2, Figure 3] Table 2 and Figure 3: clarify whether the reported ROUGE/BERTScore improvements on high-OOV subsets are statistically significant after multiple-comparison correction.
[§2] §2 (related work): add a brief discussion of prior vocabulary adaptation methods in non-medical domains to better contextualize the novelty of the medical-specific findings.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. The comments have helped us clarify the attribution of performance drops to vocabulary mismatch and better isolate the role of vocabulary extension. We address each major comment below, indicating the revisions incorporated into the updated manuscript.

read point-by-point responses

Referee: [§4 and §5] §4 (subset analysis) and §5 (results): the performance drop on high-OOV/high-novelty partitions is presented as evidence of vocabulary mismatch, yet no length-matched or topic-matched controls are reported, nor is there a regression isolating OOV rate after conditioning on sequence length and domain entropy. This leaves open the possibility that confounders drive the observed drop rather than tokenization alone.

Authors: We appreciate this observation and agree that additional controls would strengthen the causal link to vocabulary mismatch. In the revised manuscript, we have added length-matched controls by subsampling high-OOV and low-OOV partitions to equalize sequence length distributions. We have also included a multiple linear regression in the updated §4 that predicts performance while conditioning on sequence length and domain entropy (estimated via LDA topic entropy). The regression shows that OOV rate retains a statistically significant negative coefficient after these controls. These new analyses are reported in the revised §4 and §5. revision: yes
Referee: [§3.2] §3.2 (vocabulary adaptation via continual pretraining): gains from continued pretraining could arise from additional domain exposure rather than the specific addition of medical tokens. The paper does not include an ablation that holds domain exposure fixed while varying only the vocabulary extension to isolate the contribution of the new tokens.

Authors: We concur that isolating the contribution of the newly added tokens is necessary. We have therefore added an ablation experiment in the revised §3.2: continued pretraining is performed on the exact same medical corpus and for the same number of steps, once with the original vocabulary and once with the extended vocabulary containing new medical tokens. The results indicate that vocabulary extension yields further gains in summarization metrics beyond those attributable to domain exposure alone. This ablation is now presented in §3.2. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical benchmarking with direct measurements

full rationale

The paper conducts an empirical benchmarking study on LLMs for medical summarization, partitioning data by OOV rate and novelty, testing vocabulary adaptation via continued pretraining, and reporting performance metrics plus human evaluation. No equations, derivations, or predictions are claimed; results are direct experimental outcomes on public datasets with released code. No self-citations serve as load-bearing premises, no fitted parameters are renamed as predictions, and no ansatzes or uniqueness theorems are invoked. The work is self-contained against external benchmarks and does not reduce any claim to its inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work is empirical and relies on standard NLP assumptions rather than new theoretical constructs.

axioms (1)

domain assumption Standard automatic metrics (ROUGE, etc.) and human faithfulness/relevance judgments adequately capture summarization quality in the medical domain.
Used throughout the benchmarking and human evaluation sections.

pith-pipeline@v0.9.0 · 5791 in / 1114 out tokens · 37196 ms · 2026-05-19T13:03:20.765536+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We show that LLMs show a significant performance drop for data points with high concentration of out-of-vocabulary (OOV) words or with high novelty. Vocabulary adaptation is an intuitive solution to this vocabulary mismatch issue...
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean embed_injective unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

ScafFix constructs the candidate set for added vocabulary by directly considering the medical words and ignores the tokenization step for forming candidate subwords.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.