Learning Faster with Better Tokens: Parameter-Efficient Vocabulary Adaptation for Specialized Text Summarization

Gunjan Balde; Mainack Mondal; Niloy Ganguly; Soumyadeep Roy

arxiv: 2605.17379 · v1 · pith:NQUAIQAFnew · submitted 2026-05-17 · 💻 cs.CL · cs.AI

Learning Faster with Better Tokens: Parameter-Efficient Vocabulary Adaptation for Specialized Text Summarization

Gunjan Balde , Soumyadeep Roy , Mainack Mondal , Niloy Ganguly This is my paper

Pith reviewed 2026-05-20 13:20 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords vocabulary adaptationparameter-efficient adaptationtext summarizationdomain adaptationlarge language modelslegal summarizationmedical summarization

0 comments

The pith

Adapting vocabularies by adding domain tokens and replacing under-trained ones improves specialized text summarization while reducing training time and parameters.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models often tokenize specialized domain texts inefficiently, leading to performance issues in tasks like summarization. The paper proposes a parameter-efficient approach that expands the vocabulary with relevant domain-specific tokens but selectively replaces under-trained tokens to control model size. This is evaluated on legal and medical summarization using Llama-3.1-8B and Qwen2.5-7B models with expert-driven texts. If the approach works, it allows for higher quality summaries that better match references semantically and use appropriate technical terms. Readers would be interested because it addresses a practical bottleneck in applying general models to professional domains without the high costs of full retraining.

Core claim

The vocabulary adaptation algorithm enhances the overall quality of the summarization model by improving semantic similarity between the generated summaries and their references. In addition, the adapted model produces summaries that incorporate more appropriate novel and domain-specific words, leading to improved coherence, relevance, and faithfulness. The method also significantly reduces training time by 35-55% over continual pretraining and reduces parameter counts up to 37% with respect to expansion-only methods.

What carries the argument

The unified framework for vocabulary adaptation that augments pretrained tokenizers with domain-specific tokens while selectively replacing under-trained and unreachable tokens to limit parameter growth.

If this is right

The adapted summarization models achieve higher semantic similarity with reference summaries.
Summaries include more novel and domain-specific words.
Generated summaries exhibit improved coherence, relevance, and faithfulness.
Training time is reduced by 35-55% compared to continual pretraining.
Parameter counts are reduced by up to 37% relative to methods that only expand the vocabulary.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar vocabulary replacement strategies could benefit other domain-specific NLP tasks like classification or generation in scientific literature.
The efficiency gains might allow for quicker iteration in adapting models to new specialized areas without requiring extensive computational resources.
Testing the impact on the model's performance on general-domain tasks after adaptation would reveal any trade-offs not explored in the specialized focus.

Load-bearing premise

Replacing some under-trained tokens with domain ones keeps the model effective on both general and specialized texts without significant degradation.

What would settle it

If evaluations on the legal and medical summarization tasks show no gains in semantic similarity or no reductions in training time and parameters compared to baselines, the benefits of the vocabulary adaptation approach would be called into question.

Figures

Figures reproduced from arXiv: 2605.17379 by Gunjan Balde, Mainack Mondal, Niloy Ganguly, Soumyadeep Roy.

**Figure 2.** Figure 2: We report the score distribution as obtained [PITH_FULL_IMAGE:figures/full_fig_p015_2.png] view at source ↗

**Figure 3.** Figure 3: We report the score distribution as obtained [PITH_FULL_IMAGE:figures/full_fig_p015_3.png] view at source ↗

**Figure 4.** Figure 4: Annotation instructions shown to the partici [PITH_FULL_IMAGE:figures/full_fig_p016_4.png] view at source ↗

read the original abstract

Large language models pretrained on general-domain corpora often exhibit tokenization inefficiencies when applied to specialized domains. Although continual pretraining for domain adaptation partially alleviate performance degradation, it does not resolve the fundamental vocabulary mismatch. To address this gap, we introduce a targeted parameter-efficient domain adaptation approach that combines vocabulary adaptation with pretraining for LLM-based text summarization. Our unified framework augments pretrained tokenizers with domain-specific tokens while selectively replacing under-trained and unreachable tokens to limit parameter growth. We evaluate our approach on Llama-3.1-8B and Qwen2.5-7B across legal and medical summarization tasks on a challenge-oriented evaluation protocol focused on expert-driven text and summaries which typically has higher concentration of over-fragmented Out-of-Vocabulary (OOV) words. The vocabulary adaptation algorithm enhances the overall quality of the summarization model by improving semantic similarity between the generated summaries and their references. In addition, the adapted model produces summaries that incorporate more appropriate novel and domain-specific words, leading to improved coherence, relevance, and faithfulness. We further observe that our proposed approach significantly reduce training time by $35-55\%$ over continual pretraining and reduce parameter counts up to $37\%$ w.r.t expansion-only methods. We make the codebase publicly available at https://github.com/gb-kgp/VocabReplace-Then-Expand.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper gives a practical tokenizer tweak for domain summarization by adding needed tokens and swapping out under-trained ones, which cuts training time and parameters while lifting output quality on legal and medical tasks.

read the letter

The main thing here is a method that adapts LLM vocabularies for specialized summarization without the full cost of continual pretraining. They augment the tokenizer with domain tokens and selectively replace under-trained or unreachable ones to limit parameter growth, then fine-tune for summarization on Llama-3.1-8B and Qwen2.5-7B. The reported gains include better semantic similarity to references, more use of appropriate domain words, and training time reductions of 35-55% versus pretraining baselines plus up to 37% fewer parameters than expansion-only approaches. The public codebase helps anyone wanting to inspect or reuse the token selection logic and evaluation setup on expert-driven legal and medical documents.

Referee Report

2 major / 2 minor

Summary. The paper proposes a parameter-efficient vocabulary adaptation method for LLMs applied to specialized text summarization in legal and medical domains. It augments pretrained tokenizers with domain-specific tokens while selectively replacing under-trained and unreachable tokens to constrain parameter growth, then combines this with continued pretraining. Experiments on Llama-3.1-8B and Qwen2.5-7B report gains in semantic similarity to references, increased use of novel/domain-appropriate words, better coherence/relevance/faithfulness, 35-55% reductions in training time versus continual pretraining, and up to 37% fewer parameters versus expansion-only baselines. The codebase is released publicly.

Significance. If the empirical claims hold after verification, the work would be moderately significant for practical domain adaptation of LLMs, as it targets the vocabulary mismatch that continual pretraining alone does not fix while controlling compute and parameter costs. The public release of the codebase supports reproducibility and is a clear strength. However, the central efficiency and quality claims rest on comparisons whose robustness depends on controls that are not yet visible in the provided description.

major comments (2)

Abstract and Evaluation section: The central claim that selective replacement improves specialized summarization quality without materially harming general-domain handling is load-bearing, yet no perplexity or summarization metrics on general-domain data (e.g., CNN/DailyMail) or ablation of replacement versus pure expansion are reported. This omission leaves open the possibility that efficiency gains mask capability trade-offs.
Abstract: Claims of improved semantic similarity, domain-word usage, coherence, relevance, and faithfulness are stated without any numerical values, specific metrics (ROUGE, BERTScore, etc.), baselines, or statistical tests. These quantities are required to assess whether the observed differences are meaningful and reproducible.

minor comments (2)

Abstract: Grammatical issue in 'significantly reduce training time' (should be 'significantly reduces').
Abstract: The phrase 'challenge-oriented evaluation protocol focused on expert-driven text' is underspecified; a brief definition or reference to the protocol would improve clarity for readers unfamiliar with the datasets.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive feedback. The comments highlight important aspects for strengthening the presentation of our results on general-domain preservation and the concreteness of quantitative claims. We address each major comment below and outline the revisions we will make.

read point-by-point responses

Referee: Abstract and Evaluation section: The central claim that selective replacement improves specialized summarization quality without materially harming general-domain handling is load-bearing, yet no perplexity or summarization metrics on general-domain data (e.g., CNN/DailyMail) or ablation of replacement versus pure expansion are reported. This omission leaves open the possibility that efficiency gains mask capability trade-offs.

Authors: We agree that explicit verification of preserved general-domain capabilities strengthens the central claim. The current experiments prioritize domain-specific summarization under a challenge-oriented protocol with high OOV concentration, as this is the primary use case. To directly address the concern, we will add perplexity and summarization evaluations on a general-domain benchmark such as CNN/DailyMail, along with an ablation comparing selective replacement to pure expansion. These results will be included in the revised Evaluation section and referenced in the abstract. revision: yes
Referee: Abstract: Claims of improved semantic similarity, domain-word usage, coherence, relevance, and faithfulness are stated without any numerical values, specific metrics (ROUGE, BERTScore, etc.), baselines, or statistical tests. These quantities are required to assess whether the observed differences are meaningful and reproducible.

Authors: The abstract is intentionally concise and summarizes the direction of improvements. Concrete numerical results—including ROUGE, BERTScore, domain-word usage statistics, coherence/relevance/faithfulness scores, baselines, and statistical significance—are reported in the Experiments section with tables and figures. We will revise the abstract to incorporate a small number of key quantitative highlights (e.g., relative gains and efficiency percentages already stated) while maintaining brevity, thereby making the claims more self-contained. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical method with external baselines

full rationale

The paper proposes a vocabulary adaptation method (augment + selective replacement) and evaluates it empirically against continual pretraining and expansion-only baselines on legal/medical summarization tasks using Llama-3.1-8B and Qwen2.5-7B. No equations, self-definitional loops, or load-bearing self-citations appear in the provided abstract or description; performance claims rest on measured improvements in semantic similarity, coherence, and efficiency metrics rather than quantities defined in terms of the fitted outcomes themselves. The derivation chain is therefore self-contained against external benchmarks and does not reduce to its inputs by construction.

Axiom & Free-Parameter Ledger

1 free parameters · 0 axioms · 0 invented entities

The approach depends on empirical choices for which tokens to add or replace and on the representativeness of the legal/medical evaluation sets; these are not derived from first principles.

free parameters (1)

Token replacement selection criteria
The method requires deciding which under-trained or unreachable tokens to drop; this choice is a modeling hyperparameter that affects parameter count and performance.

pith-pipeline@v0.9.0 · 5787 in / 1236 out tokens · 62132 ms · 2026-05-20T13:20:20.726742+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

92 extracted references · 92 canonical work pages · 5 internal anchors

[1]

BMC bioinformatics , volume=

An overview of the BIOASQ large-scale biomedical semantic indexing and question answering competition , author=. BMC bioinformatics , volume=. 2015 , publisher=

work page 2015
[2]

Proceedings of the Australasian Language Technology Association Workshop 2011 , pages=

Development of a corpus for evidence based medicine summarisation , author=. Proceedings of the Australasian Language Technology Association Workshop 2011 , pages=. 2011 , organization=

work page 2011
[3]

PubMedQA: A Dataset for Biomedical Research Question Answering , author=. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) , pages=

work page 2019
[4]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Llama 2: Open foundation and fine-tuned chat models , author=. arXiv preprint arXiv:2307.09288 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Mistral 7B

Mistral 7B , author=. arXiv preprint arXiv:2310.06825 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Advances in neural information processing systems , volume=

Language models are few-shot learners , author=. Advances in neural information processing systems , volume=

work page
[7]

Text summarization branches out , pages=

Rouge: A package for automatic evaluation of summaries , author=. Text summarization branches out , pages=

work page
[8]

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages=

FaMeSumm: Investigating and Improving Faithfulness of Medical Summarization , author=. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages=

work page 2023
[9]

Nature medicine , volume=

Adapted large language models can outperform medical experts in clinical text summarization , author=. Nature medicine , volume=. 2024 , publisher=

work page 2024
[10]

Computers in biology and medicine , volume=

A comprehensive evaluation of large language models on benchmark biomedical text processing tasks , author=. Computers in biology and medicine , volume=. 2024 , publisher=

work page 2024
[11]

2024 , eprint=

A Comprehensive Survey on Evaluating Large Language Model Applications in the Medical Industry , author=. 2024 , eprint=

work page 2024
[12]

Gold Panning in Vocabulary: An Adaptive Method for Vocabulary Expansion of Domain-Specific LLM s

Liu, Chengyuan and Wang, Shihang and others. Gold Panning in Vocabulary: An Adaptive Method for Vocabulary Expansion of Domain-Specific LLM s. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.424

work page doi:10.18653/v1/2024.emnlp-main.424 2024
[13]

Task-Adaptive Tokenization: Enhancing Long-Form Text Generation Efficacy in Mental Health and Beyond

Liu, Siyang and Deng, Naihao and others. Task-Adaptive Tokenization: Enhancing Long-Form Text Generation Efficacy in Mental Health and Beyond. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023. doi:10.18653/v1/2023.emnlp-main.944

work page doi:10.18653/v1/2023.emnlp-main.944 2023
[14]

Neural Machine Translation of Rare Words with Subword Units

Sennrich, Rico and Haddow, Barry and Birch, Alexandra. Neural Machine Translation of Rare Words with Subword Units. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2016. doi:10.18653/v1/P16-1162

work page doi:10.18653/v1/p16-1162 2016
[15]

Adaptive BPE Tokenization for Enhanced Vocabulary Adaptation in Finetuning Pretrained Language Models

Balde, Gunjan and Roy, Soumyadeep and others. Adaptive BPE Tokenization for Enhanced Vocabulary Adaptation in Finetuning Pretrained Language Models. Findings of the Association for Computational Linguistics: EMNLP 2024. 2024. doi:10.18653/v1/2024.findings-emnlp.863

work page doi:10.18653/v1/2024.findings-emnlp.863 2024
[16]

2024 , month =

Balde, Gunjan and Roy, Soumyadeep and others , booktitle =. 2024 , month =

work page 2024
[17]

BART : Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension

Lewis, Mike and Liu, Yinhan and others. BART : Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020. doi:10.18653/v1/2020.acl-main.703

work page doi:10.18653/v1/2020.acl-main.703 2020
[18]

Proceedings of the 37th International Conference on Machine Learning , articleno =

Zhang, Jingqing and Zhao, Yao and others , title =. Proceedings of the 37th International Conference on Machine Learning , articleno =. 2020 , publisher =

work page 2020
[19]

BPE Gets Picky: Efficient Vocabulary Refinement During Tokenizer Training

Chizhov, Pavel and Arnett, Catherine and others. BPE Gets Picky: Efficient Vocabulary Refinement During Tokenizer Training. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.925

work page doi:10.18653/v1/2024.emnlp-main.925 2024
[20]

BPE -knockout: Pruning Pre-existing BPE Tokenisers with Backwards-compatible Morphological Semi-supervision

Bauwens, Thomas and Delobelle, Pieter. BPE -knockout: Pruning Pre-existing BPE Tokenisers with Backwards-compatible Morphological Semi-supervision. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.naacl-long.324

work page doi:10.18653/v1/2024.naacl-long.324 2024
[21]

An Analysis of BPE Vocabulary Trimming in Neural Machine Translation

Cognetta, Marco and Hiraoka, Tatsuya and others. An Analysis of BPE Vocabulary Trimming in Neural Machine Translation. Proceedings of the Fifth Workshop on Insights from Negative Results in NLP. 2024. doi:10.18653/v1/2024.insights-1.7

work page doi:10.18653/v1/2024.insights-1.7 2024
[22]

PEFT: State-of-the-art Parameter-Efficient Fine-Tuning methods , author =

work page
[23]

2024 , eprint=

Efficient and Effective Text Encoding for Chinese LLaMA and Alpaca , author=. 2024 , eprint=

work page 2024
[24]

Findings of the Association for Computational Linguistics: EMNLP 2024 , pages=

VE-KD: Vocabulary-Expansion Knowledge-Distillation for Training Smaller Domain-Specific Language Models , author=. Findings of the Association for Computational Linguistics: EMNLP 2024 , pages=

work page 2024
[25]

2024 , eprint=

Chatlaw: A Multi-Agent Collaborative Legal Assistant with Knowledge Graph Enhanced Mixture-of-Experts Large Language Model , author=. 2024 , eprint=

work page 2024
[26]

Cost-Performance Optimization for Processing Low-Resource Language Tasks Using Commercial LLM s

Nag, Arijit and Mukherjee, Animesh and others. Cost-Performance Optimization for Processing Low-Resource Language Tasks Using Commercial LLM s. Findings of the Association for Computational Linguistics: EMNLP 2024. 2024. doi:10.18653/v1/2024.findings-emnlp.920

work page doi:10.18653/v1/2024.findings-emnlp.920 2024
[27]

Text Summarization with Pretrained Encoders

Liu, Yang and Lapata, Mirella. Text Summarization with Pretrained Encoders. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2019. doi:10.18653/v1/D19-1387

work page doi:10.18653/v1/d19-1387 2019
[28]

Don ' t Stop Pretraining: Adapt Language Models to Domains and Tasks

Gururangan, Suchin and Marasovi \'c , Ana and others. Don ' t Stop Pretraining: Adapt Language Models to Domains and Tasks. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020. doi:10.18653/v1/2020.acl-main.740

work page doi:10.18653/v1/2020.acl-main.740 2020
[29]

2022 , journal =

Hofmann, Valentin and Schuetze, Hinrich and Pierrehumbert, Janet. An Embarrassingly Simple Method to Mitigate Undesirable Properties of Pretrained Language Model Tokenizers. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). 2022. doi:10.18653/v1/2022.acl-short.43

work page doi:10.18653/v1/2022.acl-short.43 2022
[30]

Language Model Tokenizers Introduce Unfairness Between Languages , volume =

Petrov, Aleksandar and La Malfa, Emanuele and others , booktitle =. Language Model Tokenizers Introduce Unfairness Between Languages , volume =

work page
[31]

Edward J Hu and yelong shen and others , booktitle=. Lo

work page
[32]

How Good is Your Tokenizer? On the Monolingual Performance of Multilingual Language Models

Rust, Phillip and Pfeiffer, Jonas and others. How Good is Your Tokenizer? On the Monolingual Performance of Multilingual Language Models. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 2021. doi:10.18653/v1/2021.acl-long.243

work page doi:10.18653/v1/2021.acl-long.243 2021
[33]

arXiv preprint arXiv:2406.11477 , year=

How Can We Effectively Expand the Vocabulary of LLMs with 0.01 GB of Target Language Text? , author=. arXiv preprint arXiv:2406.11477 , year=

work page arXiv
[34]

Exploring Design Choices for Building Language-Specific LLM s

Tejaswi, Atula and Gupta, Nilesh and Choi, Eunsol. Exploring Design Choices for Building Language-Specific LLM s. Findings of the Association for Computational Linguistics: EMNLP 2024. 2024. doi:10.18653/v1/2024.findings-emnlp.614

work page doi:10.18653/v1/2024.findings-emnlp.614 2024
[35]

arXiv preprint arXiv:2409.00133 , year=

A survey for large language models in biomedicine , author=. arXiv preprint arXiv:2409.00133 , year=

work page arXiv
[36]

Proceedings of the 37th International Conference on Neural Information Processing Systems , articleno =

Dettmers, Tim and Pagnoni, Artidoro and others , title =. Proceedings of the 37th International Conference on Neural Information Processing Systems , articleno =. 2024 , publisher =

work page 2024
[37]

Can language models learn from explanations in context?

Lampinen, Andrew and Dasgupta, Ishita and others. Can language models learn from explanations in context?. Findings of the Association for Computational Linguistics: EMNLP 2022. 2022. doi:10.18653/v1/2022.findings-emnlp.38

work page doi:10.18653/v1/2022.findings-emnlp.38 2022
[38]

MedIR workshop, SIGIR , year=

Quickumls: a fast, unsupervised approach for medical concept extraction , author=. MedIR workshop, SIGIR , year=

work page
[39]

Deeper Insights Without Updates: The Power of In-Context Learning Over Fine-Tuning

Yin, Qingyu and He, Xuzheng and others. Deeper Insights Without Updates: The Power of In-Context Learning Over Fine-Tuning. Findings of the Association for Computational Linguistics: EMNLP 2024. 2024. doi:10.18653/v1/2024.findings-emnlp.239

work page doi:10.18653/v1/2024.findings-emnlp.239 2024
[40]

AV oca D o: Strategy for Adapting Vocabulary to Downstream Domain

Hong, Jimin and Kim, TaeHee and others. AV oca D o: Strategy for Adapting Vocabulary to Downstream Domain. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 2021. doi:10.18653/v1/2021.emnlp-main.385

work page doi:10.18653/v1/2021.emnlp-main.385 2021
[41]

and Kry \'s ci \'n ski, Wojciech and others

Fabbri, Alexander R. and Kry \'s ci \'n ski, Wojciech and others. S umm E val: Re-evaluating Summarization Evaluation. Transactions of the Association for Computational Linguistics. 2021. doi:10.1162/tacl_a_00373

work page doi:10.1162/tacl_a_00373 2021
[42]

Proceedings of the AAAI Conference on Artificial Intelligence , author=

Scaffold-BPE: Enhancing Byte Pair Encoding for Large Language Models with Simple and Effective Scaffold Token Removal , volume=. Proceedings of the AAAI Conference on Artificial Intelligence , author=. 2025 , month=. doi:10.1609/aaai.v39i23.34633 , abstractNote=

work page doi:10.1609/aaai.v39i23.34633 2025
[43]

The Llama 3 Herd of Models

The llama 3 herd of models , author=. arXiv preprint arXiv:2407.21783 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[44]

and Manning, Christopher D

See, Abigail and Liu, Peter J. and Manning, Christopher D. Get To The Point: Summarization with Pointer-Generator Networks. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2017. doi:10.18653/v1/P17-1099

work page doi:10.18653/v1/p17-1099 2017
[45]

Vocabulary Learning via Optimal Transport for Neural Machine Translation

Xu, Jingjing and Zhou, Hao and others. Vocabulary Learning via Optimal Transport for Neural Machine Translation. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 2021. doi:10.18653/v1/2021.acl-long.571

work page doi:10.18653/v1/2021.acl-long.571 2021
[46]

Entropy-guided Vocabulary Augmentation of Multilingual Language Models for Low-resource Tasks

Nag, Arijit and Samanta, Bidisha and others. Entropy-guided Vocabulary Augmentation of Multilingual Language Models for Low-resource Tasks. Findings of the Association for Computational Linguistics: ACL 2023. 2023. doi:10.18653/v1/2023.findings-acl.548

work page doi:10.18653/v1/2023.findings-acl.548 2023
[47]

Retrieval-Augmented Domain Adaptation of Language Models

Xu, Benfeng and Zhao, Chunxu and others. Retrieval-Augmented Domain Adaptation of Language Models. Proceedings of the 8th Workshop on Representation Learning for NLP (RepL4NLP 2023). 2023. doi:10.18653/v1/2023.repl4nlp-1.5

work page doi:10.18653/v1/2023.repl4nlp-1.5 2023
[48]

Tai, Wen and Kung, H. T. and others. ex BERT : Extending Pre-trained Models with Domain-specific Vocabulary Under Constrained Training Resources. Findings of the Association for Computational Linguistics: EMNLP 2020. 2020. doi:10.18653/v1/2020.findings-emnlp.129

work page doi:10.18653/v1/2020.findings-emnlp.129 2020
[49]

Efficient Domain Adaptation of Language Models via Adaptive Tokenization

Sachidananda, Vin and Kessler, Jason and Lai, Yi-An. Efficient Domain Adaptation of Language Models via Adaptive Tokenization. Proceedings of the Second Workshop on Simple and Efficient Natural Language Processing. 2021. doi:10.18653/v1/2021.sustainlp-1.16

work page doi:10.18653/v1/2021.sustainlp-1.16 2021
[50]

Proceedings of the 15th International Joint Conference on Biomedical Engineering Systems and Technologies (BIOSTEC 2022) - HEALTHINF , year=

Anastasios Lamproudis and Aron Henriksson and others , title=. Proceedings of the 15th International Joint Conference on Biomedical Engineering Systems and Technologies (BIOSTEC 2022) - HEALTHINF , year=. doi:10.5220/0010893800003123 , isbn=

work page doi:10.5220/0010893800003123 2022
[51]

Taming Pre-trained Language Models with N-gram Representations for Low-Resource Domain Adaptation

Diao, Shizhe and Xu, Ruijia and others. Taming Pre-trained Language Models with N-gram Representations for Low-Resource Domain Adaptation. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 2021. doi:10.18653/v1/2021.acl...

work page doi:10.18653/v1/2021.acl-long.259 2021
[52]

B io BART : Pretraining and Evaluation of A Biomedical Generative Language Model

Yuan, Hongyi and Yuan, Zheng and others. B io BART : Pretraining and Evaluation of A Biomedical Generative Language Model. Proceedings of the 21st Workshop on Biomedical Language Processing. 2022. doi:10.18653/v1/2022.bionlp-1.9

work page doi:10.18653/v1/2022.bionlp-1.9 2022
[53]

Journal of the American Medical Informatics Association , volume=

Enhancing clinical concept extraction with contextual embeddings , author=. Journal of the American Medical Informatics Association , volume=. 2019 , publisher=

work page 2019
[54]

Proceedings of the 17th Annual Workshop of the Australasian Language Technology Association , pages=

Investigating the effect of lexical segmentation in transformer-based models on medical datasets , author=. Proceedings of the 17th Annual Workshop of the Australasian Language Technology Association , pages=

work page
[55]

B io M istral: A Collection of Open-Source Pretrained Large Language Models for Medical Domains

Labrak, Yanis and Bazoge, Adrien and others. B io M istral: A Collection of Open-Source Pretrained Large Language Models for Medical Domains. Findings of the Association for Computational Linguistics: ACL 2024. 2024. doi:10.18653/v1/2024.findings-acl.348

work page doi:10.18653/v1/2024.findings-acl.348 2024
[56]

2025 , eprint=

Qwen2.5 Technical Report , author=. 2025 , eprint=

work page 2025
[57]

International Conference on Pattern Recognition , pages=

Beyond relevant documents: A knowledge-intensive approach for query-focused summarization using large language models , author=. International Conference on Pattern Recognition , pages=. 2025 , organization=

work page 2025
[58]

On the Summarization of Consumer Health Questions

Ben Abacha, Asma and Demner-Fushman, Dina. On the Summarization of Consumer Health Questions. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019

work page 2019
[59]

Overview of the MEDIQA -Chat 2023 Shared Tasks on the Summarization & Generation of Doctor-Patient Conversations

Ben Abacha, Asma and Yim, Wen-wai and others. Overview of the MEDIQA -Chat 2023 Shared Tasks on the Summarization & Generation of Doctor-Patient Conversations. Proceedings of the 5th Clinical Natural Language Processing Workshop. 2023. doi:10.18653/v1/2023.clinicalnlp-1.52

work page doi:10.18653/v1/2023.clinicalnlp-1.52 2023
[60]

doi:10.5281/zenodo.15517617 , url =

Rodríguez Ortega, Miguel and Rodríguez López, Eduard and others , title =. doi:10.5281/zenodo.15517617 , url =

work page doi:10.5281/zenodo.15517617
[61]

Evaluation of LLM s in Medical Text Summarization: The Role of Vocabulary Adaptation in High OOV Settings

Balde, Gunjan and Roy, Soumyadeep and others. Evaluation of LLM s in Medical Text Summarization: The Role of Vocabulary Adaptation in High OOV Settings. Findings of the Association for Computational Linguistics: ACL 2025. 2025

work page 2025
[62]

MEDITRON-70B: Scaling Medical Pretraining for Large Language Models

MEDITRON-70B: Scaling Medical Pretraining for Large Language Models , author =. 2023 , archivePrefix=. 2311.16079 , primaryClass =

work page internal anchor Pith review arXiv 2023
[63]

The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track , year=

The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale , author=. The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track , year=

work page
[64]

Fishing for Magikarp: Automatically Detecting Under-trained Tokens in Large Language Models

Land, Sander and Bartolo, Max. Fishing for Magikarp: Automatically Detecting Under-trained Tokens in Large Language Models. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.649

work page doi:10.18653/v1/2024.emnlp-main.649 2024
[65]

Advances in Neural Information Processing Systems , volume=

Not all tokens are what you need for pretraining , author=. Advances in Neural Information Processing Systems , volume=

work page
[66]

arXiv preprint arXiv:2505.18227 , year=

Token Reduction Should Go Beyond Efficiency in Generative Models--From Vision, Language to Multimodality , author=. arXiv preprint arXiv:2505.18227 , year=

work page arXiv
[67]

2025 , eprint=

Gemma 3 Technical Report , author=. 2025 , eprint=

work page 2025
[68]

Phi-4 Technical Report

Phi-4 technical report , author=. arXiv preprint arXiv:2412.08905 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[69]

M ed S umm RAG : Domain-Specific Retrieval for Medical Summarization

Luo, Guanting and Arase, Yuki. M ed S umm RAG : Domain-Specific Retrieval for Medical Summarization. Proceedings of the 24th Workshop on Biomedical Language Processing. 2025

work page 2025
[70]

2024 , eprint=

Byte Latent Transformer: Patches Scale Better Than Tokens , author=. 2024 , eprint=

work page 2024
[71]

Journal of the American Medical Informatics Association , volume=

PMC-LLaMA: toward building open-source language models for medicine , author=. Journal of the American Medical Informatics Association , volume=. 2024 , publisher=

work page 2024
[72]

ACM Transactions on Computing for Healthcare (HEALTH) , volume=

Domain-specific language model pretraining for biomedical natural language processing , author=. ACM Transactions on Computing for Healthcare (HEALTH) , volume=. 2021 , publisher=

work page 2021
[73]

Do All Languages Cost the Same? Tokenization in the Era of Commercial Language Models

Ahia, Orevaoghene and Kumar, Sachin and others. Do All Languages Cost the Same? Tokenization in the Era of Commercial Language Models. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023. doi:10.18653/v1/2023.emnlp-main.614

work page doi:10.18653/v1/2023.emnlp-main.614 2023
[74]

From Tokens to Words: On the Inner Lexicon of

Guy Kaplan and Matanel Oren and others , booktitle=. From Tokens to Words: On the Inner Lexicon of. 2025 , url=

work page 2025
[75]

International Conference on Learning Representations , year=

BERTScore: Evaluating Text Generation with BERT , author=. International Conference on Learning Representations , year=

work page
[76]

Bioinformatics , volume=

BioBERT: A pre-trained biomedical language representation model for biomedical text mining , author=. Bioinformatics , volume=. 2020 , publisher=

work page 2020
[77]

Proceedings of the Nineteenth International Conference on Artificial Intelligence and Law , pages =

Paul, Shounak and Mandal, Arpan and others , title =. Proceedings of the Nineteenth International Conference on Artificial Intelligence and Law , pages =. 2023 , isbn =. doi:10.1145/3594536.3595165 , abstract =

work page doi:10.1145/3594536.3595165 2023
[78]

S ci BERT : A Pretrained Language Model for Scientific Text

Beltagy, Iz and Lo, Kyle and Cohan, Arman. S ci BERT : A Pretrained Language Model for Scientific Text. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2019. doi:10.18653/v1/D19-1371

work page doi:10.18653/v1/d19-1371 2019
[79]

Teaching Old Tokenizers New Words: Efficient Tokenizer Adaptation for Pretrained Models

Purason, Taido and Chizhov, Pavel and others. Teaching Old Tokenizers New Words: Efficient Tokenizer Adaptation for Pretrained Models. Findings of the A ssociation for C omputational L inguistics: EACL 2026. 2026. doi:10.18653/v1/2026.findings-eacl.341

work page doi:10.18653/v1/2026.findings-eacl.341 2026
[80]

2025 , eprint=

Vocab Diet: Reshaping the Vocabulary of LLMs with Vector Arithmetic , author=. 2025 , eprint=

work page 2025

Showing first 80 references.

[1] [1]

BMC bioinformatics , volume=

An overview of the BIOASQ large-scale biomedical semantic indexing and question answering competition , author=. BMC bioinformatics , volume=. 2015 , publisher=

work page 2015

[2] [2]

Proceedings of the Australasian Language Technology Association Workshop 2011 , pages=

Development of a corpus for evidence based medicine summarisation , author=. Proceedings of the Australasian Language Technology Association Workshop 2011 , pages=. 2011 , organization=

work page 2011

[3] [3]

PubMedQA: A Dataset for Biomedical Research Question Answering , author=. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) , pages=

work page 2019

[4] [4]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Llama 2: Open foundation and fine-tuned chat models , author=. arXiv preprint arXiv:2307.09288 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

Mistral 7B

Mistral 7B , author=. arXiv preprint arXiv:2310.06825 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

Advances in neural information processing systems , volume=

Language models are few-shot learners , author=. Advances in neural information processing systems , volume=

work page

[7] [7]

Text summarization branches out , pages=

Rouge: A package for automatic evaluation of summaries , author=. Text summarization branches out , pages=

work page

[8] [8]

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages=

FaMeSumm: Investigating and Improving Faithfulness of Medical Summarization , author=. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages=

work page 2023

[9] [9]

Nature medicine , volume=

Adapted large language models can outperform medical experts in clinical text summarization , author=. Nature medicine , volume=. 2024 , publisher=

work page 2024

[10] [10]

Computers in biology and medicine , volume=

A comprehensive evaluation of large language models on benchmark biomedical text processing tasks , author=. Computers in biology and medicine , volume=. 2024 , publisher=

work page 2024

[11] [11]

2024 , eprint=

A Comprehensive Survey on Evaluating Large Language Model Applications in the Medical Industry , author=. 2024 , eprint=

work page 2024

[12] [12]

Gold Panning in Vocabulary: An Adaptive Method for Vocabulary Expansion of Domain-Specific LLM s

Liu, Chengyuan and Wang, Shihang and others. Gold Panning in Vocabulary: An Adaptive Method for Vocabulary Expansion of Domain-Specific LLM s. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.424

work page doi:10.18653/v1/2024.emnlp-main.424 2024

[13] [13]

Task-Adaptive Tokenization: Enhancing Long-Form Text Generation Efficacy in Mental Health and Beyond

Liu, Siyang and Deng, Naihao and others. Task-Adaptive Tokenization: Enhancing Long-Form Text Generation Efficacy in Mental Health and Beyond. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023. doi:10.18653/v1/2023.emnlp-main.944

work page doi:10.18653/v1/2023.emnlp-main.944 2023

[14] [14]

Neural Machine Translation of Rare Words with Subword Units

Sennrich, Rico and Haddow, Barry and Birch, Alexandra. Neural Machine Translation of Rare Words with Subword Units. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2016. doi:10.18653/v1/P16-1162

work page doi:10.18653/v1/p16-1162 2016

[15] [15]

Adaptive BPE Tokenization for Enhanced Vocabulary Adaptation in Finetuning Pretrained Language Models

Balde, Gunjan and Roy, Soumyadeep and others. Adaptive BPE Tokenization for Enhanced Vocabulary Adaptation in Finetuning Pretrained Language Models. Findings of the Association for Computational Linguistics: EMNLP 2024. 2024. doi:10.18653/v1/2024.findings-emnlp.863

work page doi:10.18653/v1/2024.findings-emnlp.863 2024

[16] [16]

2024 , month =

Balde, Gunjan and Roy, Soumyadeep and others , booktitle =. 2024 , month =

work page 2024

[17] [17]

BART : Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension

Lewis, Mike and Liu, Yinhan and others. BART : Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020. doi:10.18653/v1/2020.acl-main.703

work page doi:10.18653/v1/2020.acl-main.703 2020

[18] [18]

Proceedings of the 37th International Conference on Machine Learning , articleno =

Zhang, Jingqing and Zhao, Yao and others , title =. Proceedings of the 37th International Conference on Machine Learning , articleno =. 2020 , publisher =

work page 2020

[19] [19]

BPE Gets Picky: Efficient Vocabulary Refinement During Tokenizer Training

Chizhov, Pavel and Arnett, Catherine and others. BPE Gets Picky: Efficient Vocabulary Refinement During Tokenizer Training. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.925

work page doi:10.18653/v1/2024.emnlp-main.925 2024

[20] [20]

BPE -knockout: Pruning Pre-existing BPE Tokenisers with Backwards-compatible Morphological Semi-supervision

Bauwens, Thomas and Delobelle, Pieter. BPE -knockout: Pruning Pre-existing BPE Tokenisers with Backwards-compatible Morphological Semi-supervision. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.naacl-long.324

work page doi:10.18653/v1/2024.naacl-long.324 2024

[21] [21]

An Analysis of BPE Vocabulary Trimming in Neural Machine Translation

Cognetta, Marco and Hiraoka, Tatsuya and others. An Analysis of BPE Vocabulary Trimming in Neural Machine Translation. Proceedings of the Fifth Workshop on Insights from Negative Results in NLP. 2024. doi:10.18653/v1/2024.insights-1.7

work page doi:10.18653/v1/2024.insights-1.7 2024

[22] [22]

PEFT: State-of-the-art Parameter-Efficient Fine-Tuning methods , author =

work page

[23] [23]

2024 , eprint=

Efficient and Effective Text Encoding for Chinese LLaMA and Alpaca , author=. 2024 , eprint=

work page 2024

[24] [24]

Findings of the Association for Computational Linguistics: EMNLP 2024 , pages=

VE-KD: Vocabulary-Expansion Knowledge-Distillation for Training Smaller Domain-Specific Language Models , author=. Findings of the Association for Computational Linguistics: EMNLP 2024 , pages=

work page 2024

[25] [25]

2024 , eprint=

Chatlaw: A Multi-Agent Collaborative Legal Assistant with Knowledge Graph Enhanced Mixture-of-Experts Large Language Model , author=. 2024 , eprint=

work page 2024

[26] [26]

Cost-Performance Optimization for Processing Low-Resource Language Tasks Using Commercial LLM s

Nag, Arijit and Mukherjee, Animesh and others. Cost-Performance Optimization for Processing Low-Resource Language Tasks Using Commercial LLM s. Findings of the Association for Computational Linguistics: EMNLP 2024. 2024. doi:10.18653/v1/2024.findings-emnlp.920

work page doi:10.18653/v1/2024.findings-emnlp.920 2024

[27] [27]

Text Summarization with Pretrained Encoders

Liu, Yang and Lapata, Mirella. Text Summarization with Pretrained Encoders. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2019. doi:10.18653/v1/D19-1387

work page doi:10.18653/v1/d19-1387 2019

[28] [28]

Don ' t Stop Pretraining: Adapt Language Models to Domains and Tasks

Gururangan, Suchin and Marasovi \'c , Ana and others. Don ' t Stop Pretraining: Adapt Language Models to Domains and Tasks. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020. doi:10.18653/v1/2020.acl-main.740

work page doi:10.18653/v1/2020.acl-main.740 2020

[29] [29]

2022 , journal =

Hofmann, Valentin and Schuetze, Hinrich and Pierrehumbert, Janet. An Embarrassingly Simple Method to Mitigate Undesirable Properties of Pretrained Language Model Tokenizers. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). 2022. doi:10.18653/v1/2022.acl-short.43

work page doi:10.18653/v1/2022.acl-short.43 2022

[30] [30]

Language Model Tokenizers Introduce Unfairness Between Languages , volume =

Petrov, Aleksandar and La Malfa, Emanuele and others , booktitle =. Language Model Tokenizers Introduce Unfairness Between Languages , volume =

work page

[31] [31]

Edward J Hu and yelong shen and others , booktitle=. Lo

work page

[32] [32]

How Good is Your Tokenizer? On the Monolingual Performance of Multilingual Language Models

Rust, Phillip and Pfeiffer, Jonas and others. How Good is Your Tokenizer? On the Monolingual Performance of Multilingual Language Models. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 2021. doi:10.18653/v1/2021.acl-long.243

work page doi:10.18653/v1/2021.acl-long.243 2021

[33] [33]

arXiv preprint arXiv:2406.11477 , year=

How Can We Effectively Expand the Vocabulary of LLMs with 0.01 GB of Target Language Text? , author=. arXiv preprint arXiv:2406.11477 , year=

work page arXiv

[34] [34]

Exploring Design Choices for Building Language-Specific LLM s

Tejaswi, Atula and Gupta, Nilesh and Choi, Eunsol. Exploring Design Choices for Building Language-Specific LLM s. Findings of the Association for Computational Linguistics: EMNLP 2024. 2024. doi:10.18653/v1/2024.findings-emnlp.614

work page doi:10.18653/v1/2024.findings-emnlp.614 2024

[35] [35]

arXiv preprint arXiv:2409.00133 , year=

A survey for large language models in biomedicine , author=. arXiv preprint arXiv:2409.00133 , year=

work page arXiv

[36] [36]

Proceedings of the 37th International Conference on Neural Information Processing Systems , articleno =

Dettmers, Tim and Pagnoni, Artidoro and others , title =. Proceedings of the 37th International Conference on Neural Information Processing Systems , articleno =. 2024 , publisher =

work page 2024

[37] [37]

Can language models learn from explanations in context?

Lampinen, Andrew and Dasgupta, Ishita and others. Can language models learn from explanations in context?. Findings of the Association for Computational Linguistics: EMNLP 2022. 2022. doi:10.18653/v1/2022.findings-emnlp.38

work page doi:10.18653/v1/2022.findings-emnlp.38 2022

[38] [38]

MedIR workshop, SIGIR , year=

Quickumls: a fast, unsupervised approach for medical concept extraction , author=. MedIR workshop, SIGIR , year=

work page

[39] [39]

Deeper Insights Without Updates: The Power of In-Context Learning Over Fine-Tuning

Yin, Qingyu and He, Xuzheng and others. Deeper Insights Without Updates: The Power of In-Context Learning Over Fine-Tuning. Findings of the Association for Computational Linguistics: EMNLP 2024. 2024. doi:10.18653/v1/2024.findings-emnlp.239

work page doi:10.18653/v1/2024.findings-emnlp.239 2024

[40] [40]

AV oca D o: Strategy for Adapting Vocabulary to Downstream Domain

Hong, Jimin and Kim, TaeHee and others. AV oca D o: Strategy for Adapting Vocabulary to Downstream Domain. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 2021. doi:10.18653/v1/2021.emnlp-main.385

work page doi:10.18653/v1/2021.emnlp-main.385 2021

[41] [41]

and Kry \'s ci \'n ski, Wojciech and others

Fabbri, Alexander R. and Kry \'s ci \'n ski, Wojciech and others. S umm E val: Re-evaluating Summarization Evaluation. Transactions of the Association for Computational Linguistics. 2021. doi:10.1162/tacl_a_00373

work page doi:10.1162/tacl_a_00373 2021

[42] [42]

Proceedings of the AAAI Conference on Artificial Intelligence , author=

Scaffold-BPE: Enhancing Byte Pair Encoding for Large Language Models with Simple and Effective Scaffold Token Removal , volume=. Proceedings of the AAAI Conference on Artificial Intelligence , author=. 2025 , month=. doi:10.1609/aaai.v39i23.34633 , abstractNote=

work page doi:10.1609/aaai.v39i23.34633 2025

[43] [43]

The Llama 3 Herd of Models

The llama 3 herd of models , author=. arXiv preprint arXiv:2407.21783 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[44] [44]

and Manning, Christopher D

See, Abigail and Liu, Peter J. and Manning, Christopher D. Get To The Point: Summarization with Pointer-Generator Networks. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2017. doi:10.18653/v1/P17-1099

work page doi:10.18653/v1/p17-1099 2017

[45] [45]

Vocabulary Learning via Optimal Transport for Neural Machine Translation

Xu, Jingjing and Zhou, Hao and others. Vocabulary Learning via Optimal Transport for Neural Machine Translation. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 2021. doi:10.18653/v1/2021.acl-long.571

work page doi:10.18653/v1/2021.acl-long.571 2021

[46] [46]

Entropy-guided Vocabulary Augmentation of Multilingual Language Models for Low-resource Tasks

Nag, Arijit and Samanta, Bidisha and others. Entropy-guided Vocabulary Augmentation of Multilingual Language Models for Low-resource Tasks. Findings of the Association for Computational Linguistics: ACL 2023. 2023. doi:10.18653/v1/2023.findings-acl.548

work page doi:10.18653/v1/2023.findings-acl.548 2023

[47] [47]

Retrieval-Augmented Domain Adaptation of Language Models

Xu, Benfeng and Zhao, Chunxu and others. Retrieval-Augmented Domain Adaptation of Language Models. Proceedings of the 8th Workshop on Representation Learning for NLP (RepL4NLP 2023). 2023. doi:10.18653/v1/2023.repl4nlp-1.5

work page doi:10.18653/v1/2023.repl4nlp-1.5 2023

[48] [48]

Tai, Wen and Kung, H. T. and others. ex BERT : Extending Pre-trained Models with Domain-specific Vocabulary Under Constrained Training Resources. Findings of the Association for Computational Linguistics: EMNLP 2020. 2020. doi:10.18653/v1/2020.findings-emnlp.129

work page doi:10.18653/v1/2020.findings-emnlp.129 2020

[49] [49]

Efficient Domain Adaptation of Language Models via Adaptive Tokenization

Sachidananda, Vin and Kessler, Jason and Lai, Yi-An. Efficient Domain Adaptation of Language Models via Adaptive Tokenization. Proceedings of the Second Workshop on Simple and Efficient Natural Language Processing. 2021. doi:10.18653/v1/2021.sustainlp-1.16

work page doi:10.18653/v1/2021.sustainlp-1.16 2021

[50] [50]

Proceedings of the 15th International Joint Conference on Biomedical Engineering Systems and Technologies (BIOSTEC 2022) - HEALTHINF , year=

Anastasios Lamproudis and Aron Henriksson and others , title=. Proceedings of the 15th International Joint Conference on Biomedical Engineering Systems and Technologies (BIOSTEC 2022) - HEALTHINF , year=. doi:10.5220/0010893800003123 , isbn=

work page doi:10.5220/0010893800003123 2022

[51] [51]

Taming Pre-trained Language Models with N-gram Representations for Low-Resource Domain Adaptation

Diao, Shizhe and Xu, Ruijia and others. Taming Pre-trained Language Models with N-gram Representations for Low-Resource Domain Adaptation. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 2021. doi:10.18653/v1/2021.acl...

work page doi:10.18653/v1/2021.acl-long.259 2021

[52] [52]

B io BART : Pretraining and Evaluation of A Biomedical Generative Language Model

Yuan, Hongyi and Yuan, Zheng and others. B io BART : Pretraining and Evaluation of A Biomedical Generative Language Model. Proceedings of the 21st Workshop on Biomedical Language Processing. 2022. doi:10.18653/v1/2022.bionlp-1.9

work page doi:10.18653/v1/2022.bionlp-1.9 2022

[53] [53]

Journal of the American Medical Informatics Association , volume=

Enhancing clinical concept extraction with contextual embeddings , author=. Journal of the American Medical Informatics Association , volume=. 2019 , publisher=

work page 2019

[54] [54]

Proceedings of the 17th Annual Workshop of the Australasian Language Technology Association , pages=

Investigating the effect of lexical segmentation in transformer-based models on medical datasets , author=. Proceedings of the 17th Annual Workshop of the Australasian Language Technology Association , pages=

work page

[55] [55]

B io M istral: A Collection of Open-Source Pretrained Large Language Models for Medical Domains

Labrak, Yanis and Bazoge, Adrien and others. B io M istral: A Collection of Open-Source Pretrained Large Language Models for Medical Domains. Findings of the Association for Computational Linguistics: ACL 2024. 2024. doi:10.18653/v1/2024.findings-acl.348

work page doi:10.18653/v1/2024.findings-acl.348 2024

[56] [56]

2025 , eprint=

Qwen2.5 Technical Report , author=. 2025 , eprint=

work page 2025

[57] [57]

International Conference on Pattern Recognition , pages=

Beyond relevant documents: A knowledge-intensive approach for query-focused summarization using large language models , author=. International Conference on Pattern Recognition , pages=. 2025 , organization=

work page 2025

[58] [58]

On the Summarization of Consumer Health Questions

Ben Abacha, Asma and Demner-Fushman, Dina. On the Summarization of Consumer Health Questions. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019

work page 2019

[59] [59]

Overview of the MEDIQA -Chat 2023 Shared Tasks on the Summarization & Generation of Doctor-Patient Conversations

Ben Abacha, Asma and Yim, Wen-wai and others. Overview of the MEDIQA -Chat 2023 Shared Tasks on the Summarization & Generation of Doctor-Patient Conversations. Proceedings of the 5th Clinical Natural Language Processing Workshop. 2023. doi:10.18653/v1/2023.clinicalnlp-1.52

work page doi:10.18653/v1/2023.clinicalnlp-1.52 2023

[60] [60]

doi:10.5281/zenodo.15517617 , url =

Rodríguez Ortega, Miguel and Rodríguez López, Eduard and others , title =. doi:10.5281/zenodo.15517617 , url =

work page doi:10.5281/zenodo.15517617

[61] [61]

Evaluation of LLM s in Medical Text Summarization: The Role of Vocabulary Adaptation in High OOV Settings

Balde, Gunjan and Roy, Soumyadeep and others. Evaluation of LLM s in Medical Text Summarization: The Role of Vocabulary Adaptation in High OOV Settings. Findings of the Association for Computational Linguistics: ACL 2025. 2025

work page 2025

[62] [62]

MEDITRON-70B: Scaling Medical Pretraining for Large Language Models

MEDITRON-70B: Scaling Medical Pretraining for Large Language Models , author =. 2023 , archivePrefix=. 2311.16079 , primaryClass =

work page internal anchor Pith review arXiv 2023

[63] [63]

The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track , year=

The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale , author=. The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track , year=

work page

[64] [64]

Fishing for Magikarp: Automatically Detecting Under-trained Tokens in Large Language Models

Land, Sander and Bartolo, Max. Fishing for Magikarp: Automatically Detecting Under-trained Tokens in Large Language Models. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.649

work page doi:10.18653/v1/2024.emnlp-main.649 2024

[65] [65]

Advances in Neural Information Processing Systems , volume=

Not all tokens are what you need for pretraining , author=. Advances in Neural Information Processing Systems , volume=

work page

[66] [66]

arXiv preprint arXiv:2505.18227 , year=

Token Reduction Should Go Beyond Efficiency in Generative Models--From Vision, Language to Multimodality , author=. arXiv preprint arXiv:2505.18227 , year=

work page arXiv

[67] [67]

2025 , eprint=

Gemma 3 Technical Report , author=. 2025 , eprint=

work page 2025

[68] [68]

Phi-4 Technical Report

Phi-4 technical report , author=. arXiv preprint arXiv:2412.08905 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[69] [69]

M ed S umm RAG : Domain-Specific Retrieval for Medical Summarization

Luo, Guanting and Arase, Yuki. M ed S umm RAG : Domain-Specific Retrieval for Medical Summarization. Proceedings of the 24th Workshop on Biomedical Language Processing. 2025

work page 2025

[70] [70]

2024 , eprint=

Byte Latent Transformer: Patches Scale Better Than Tokens , author=. 2024 , eprint=

work page 2024

[71] [71]

Journal of the American Medical Informatics Association , volume=

PMC-LLaMA: toward building open-source language models for medicine , author=. Journal of the American Medical Informatics Association , volume=. 2024 , publisher=

work page 2024

[72] [72]

ACM Transactions on Computing for Healthcare (HEALTH) , volume=

Domain-specific language model pretraining for biomedical natural language processing , author=. ACM Transactions on Computing for Healthcare (HEALTH) , volume=. 2021 , publisher=

work page 2021

[73] [73]

Do All Languages Cost the Same? Tokenization in the Era of Commercial Language Models

Ahia, Orevaoghene and Kumar, Sachin and others. Do All Languages Cost the Same? Tokenization in the Era of Commercial Language Models. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023. doi:10.18653/v1/2023.emnlp-main.614

work page doi:10.18653/v1/2023.emnlp-main.614 2023

[74] [74]

From Tokens to Words: On the Inner Lexicon of

Guy Kaplan and Matanel Oren and others , booktitle=. From Tokens to Words: On the Inner Lexicon of. 2025 , url=

work page 2025

[75] [75]

International Conference on Learning Representations , year=

BERTScore: Evaluating Text Generation with BERT , author=. International Conference on Learning Representations , year=

work page

[76] [76]

Bioinformatics , volume=

BioBERT: A pre-trained biomedical language representation model for biomedical text mining , author=. Bioinformatics , volume=. 2020 , publisher=

work page 2020

[77] [77]

Proceedings of the Nineteenth International Conference on Artificial Intelligence and Law , pages =

Paul, Shounak and Mandal, Arpan and others , title =. Proceedings of the Nineteenth International Conference on Artificial Intelligence and Law , pages =. 2023 , isbn =. doi:10.1145/3594536.3595165 , abstract =

work page doi:10.1145/3594536.3595165 2023

[78] [78]

S ci BERT : A Pretrained Language Model for Scientific Text

Beltagy, Iz and Lo, Kyle and Cohan, Arman. S ci BERT : A Pretrained Language Model for Scientific Text. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2019. doi:10.18653/v1/D19-1371

work page doi:10.18653/v1/d19-1371 2019

[79] [79]

Teaching Old Tokenizers New Words: Efficient Tokenizer Adaptation for Pretrained Models

Purason, Taido and Chizhov, Pavel and others. Teaching Old Tokenizers New Words: Efficient Tokenizer Adaptation for Pretrained Models. Findings of the A ssociation for C omputational L inguistics: EACL 2026. 2026. doi:10.18653/v1/2026.findings-eacl.341

work page doi:10.18653/v1/2026.findings-eacl.341 2026

[80] [80]

2025 , eprint=

Vocab Diet: Reshaping the Vocabulary of LLMs with Vector Arithmetic , author=. 2025 , eprint=

work page 2025