Evaluation of LLMs in Medical Text Summarization: The Role of Vocabulary Adaptation in High OOV Settings
Pith reviewed 2026-05-19 13:03 UTC · model grok-4.3
The pith
Adapting large language models' vocabularies with medical terms boosts their performance on summarization tasks involving many unfamiliar words.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper establishes that vocabulary adaptation helps improve the LLM summarization performance even in difficult settings with high concentration of out-of-vocabulary words or high novelty. Llama-3.1 faces over-fragmentation with medical words despite its large vocabulary size.
What carries the argument
Vocabulary adaptation, the process of updating the LLM vocabulary with medical domain words or subwords to reduce fragmentation and mismatch.
If this is right
- Performance gains hold across different vocabulary adaptation strategies and continual pretraining approaches.
- Human evaluations by medical experts show improved relevance and faithfulness in summaries from adapted models.
- Even large-vocabulary models benefit from this customization in specialized domains.
- The approach addresses issues not solved by scale alone in high-OOV medical data.
Where Pith is reading between the lines
- This method may apply to other technical fields with dense specialized terminology.
- Combining vocabulary adaptation with other domain adaptation techniques could yield further gains.
- Dynamic or on-the-fly adaptation during inference might reduce the need for full retraining.
Load-bearing premise
The performance drop on high-OOV data is due mainly to vocabulary mismatch and not to other factors like longer inputs or harder topics.
What would settle it
Measure if summarization metrics on high-OOV subsets improve after vocabulary adaptation while holding input length and topic constant.
read the original abstract
Large Language Models (LLMs) recently achieved great success in medical text summarization by simply using in-context learning. However, these recent efforts do not perform fine-grained evaluations under difficult settings where LLMs might fail. They typically report performance scores over the entire dataset. Through our benchmarking study, we show that LLMs show a significant performance drop for data points with high concentration of out-of-vocabulary (OOV) words or with high novelty. Vocabulary adaptation is an intuitive solution to this vocabulary mismatch issue where the LLM vocabulary gets updated with certain expert domain (here, medical) words or subwords. An interesting finding from our study is that Llama-3.1, even with a vocabulary size of around 128K tokens, still faces over-fragmentation issue with medical words. To that end, we show vocabulary adaptation helps improve the LLM summarization performance even in difficult settings. Through extensive experimentation of multiple vocabulary adaptation strategies, two continual pretraining strategies, and three benchmark medical summarization datasets, we gain valuable insights into the role of vocabulary adaptation strategies for customizing LLMs to the medical domain. We also performed a human evaluation study with medical experts where they found that vocabulary adaptation results in more relevant and faithful summaries. Our codebase is made publicly available at https://github.com/gb-kgp/LLM-MedicalSummarization-Benchmark.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper conducts an empirical benchmarking study of LLMs for medical text summarization, focusing on performance degradation in high-OOV or high-novelty subsets across three datasets. It evaluates multiple vocabulary adaptation strategies (including continued pretraining variants) and reports improvements in automatic metrics and human expert judgments of relevance and faithfulness, with public code release.
Significance. If the attribution to vocabulary mismatch holds, the work provides practical guidance on customizing LLMs for medical domains with high OOV rates, supported by convergent evidence from multiple datasets, adaptation methods, continual pretraining, and expert human evaluation. The public codebase strengthens reproducibility.
major comments (2)
- [§4 and §5] §4 (subset analysis) and §5 (results): the performance drop on high-OOV/high-novelty partitions is presented as evidence of vocabulary mismatch, yet no length-matched or topic-matched controls are reported, nor is there a regression isolating OOV rate after conditioning on sequence length and domain entropy. This leaves open the possibility that confounders drive the observed drop rather than tokenization alone.
- [§3.2] §3.2 (vocabulary adaptation via continual pretraining): gains from continued pretraining could arise from additional domain exposure rather than the specific addition of medical tokens. The paper does not include an ablation that holds domain exposure fixed while varying only the vocabulary extension to isolate the contribution of the new tokens.
minor comments (2)
- [Table 2, Figure 3] Table 2 and Figure 3: clarify whether the reported ROUGE/BERTScore improvements on high-OOV subsets are statistically significant after multiple-comparison correction.
- [§2] §2 (related work): add a brief discussion of prior vocabulary adaptation methods in non-medical domains to better contextualize the novelty of the medical-specific findings.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback. The comments have helped us clarify the attribution of performance drops to vocabulary mismatch and better isolate the role of vocabulary extension. We address each major comment below, indicating the revisions incorporated into the updated manuscript.
read point-by-point responses
-
Referee: [§4 and §5] §4 (subset analysis) and §5 (results): the performance drop on high-OOV/high-novelty partitions is presented as evidence of vocabulary mismatch, yet no length-matched or topic-matched controls are reported, nor is there a regression isolating OOV rate after conditioning on sequence length and domain entropy. This leaves open the possibility that confounders drive the observed drop rather than tokenization alone.
Authors: We appreciate this observation and agree that additional controls would strengthen the causal link to vocabulary mismatch. In the revised manuscript, we have added length-matched controls by subsampling high-OOV and low-OOV partitions to equalize sequence length distributions. We have also included a multiple linear regression in the updated §4 that predicts performance while conditioning on sequence length and domain entropy (estimated via LDA topic entropy). The regression shows that OOV rate retains a statistically significant negative coefficient after these controls. These new analyses are reported in the revised §4 and §5. revision: yes
-
Referee: [§3.2] §3.2 (vocabulary adaptation via continual pretraining): gains from continued pretraining could arise from additional domain exposure rather than the specific addition of medical tokens. The paper does not include an ablation that holds domain exposure fixed while varying only the vocabulary extension to isolate the contribution of the new tokens.
Authors: We concur that isolating the contribution of the newly added tokens is necessary. We have therefore added an ablation experiment in the revised §3.2: continued pretraining is performed on the exact same medical corpus and for the same number of steps, once with the original vocabulary and once with the extended vocabulary containing new medical tokens. The results indicate that vocabulary extension yields further gains in summarization metrics beyond those attributable to domain exposure alone. This ablation is now presented in §3.2. revision: yes
Circularity Check
No circularity: purely empirical benchmarking with direct measurements
full rationale
The paper conducts an empirical benchmarking study on LLMs for medical summarization, partitioning data by OOV rate and novelty, testing vocabulary adaptation via continued pretraining, and reporting performance metrics plus human evaluation. No equations, derivations, or predictions are claimed; results are direct experimental outcomes on public datasets with released code. No self-citations serve as load-bearing premises, no fitted parameters are renamed as predictions, and no ansatzes or uniqueness theorems are invoked. The work is self-contained against external benchmarks and does not reduce any claim to its inputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Standard automatic metrics (ROUGE, etc.) and human faithfulness/relevance judgments adequately capture summarization quality in the medical domain.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We show that LLMs show a significant performance drop for data points with high concentration of out-of-vocabulary (OOV) words or with high novelty. Vocabulary adaptation is an intuitive solution to this vocabulary mismatch issue...
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanembed_injective unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
ScafFix constructs the candidate set for added vocabulary by directly considering the medical words and ignores the tokenization step for forming candidate subwords.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.