pith. sign in

arxiv: 2605.17379 · v1 · pith:NQUAIQAFnew · submitted 2026-05-17 · 💻 cs.CL · cs.AI

Learning Faster with Better Tokens: Parameter-Efficient Vocabulary Adaptation for Specialized Text Summarization

Pith reviewed 2026-05-20 13:20 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords vocabulary adaptationparameter-efficient adaptationtext summarizationdomain adaptationlarge language modelslegal summarizationmedical summarization
0
0 comments X

The pith

Adapting vocabularies by adding domain tokens and replacing under-trained ones improves specialized text summarization while reducing training time and parameters.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models often tokenize specialized domain texts inefficiently, leading to performance issues in tasks like summarization. The paper proposes a parameter-efficient approach that expands the vocabulary with relevant domain-specific tokens but selectively replaces under-trained tokens to control model size. This is evaluated on legal and medical summarization using Llama-3.1-8B and Qwen2.5-7B models with expert-driven texts. If the approach works, it allows for higher quality summaries that better match references semantically and use appropriate technical terms. Readers would be interested because it addresses a practical bottleneck in applying general models to professional domains without the high costs of full retraining.

Core claim

The vocabulary adaptation algorithm enhances the overall quality of the summarization model by improving semantic similarity between the generated summaries and their references. In addition, the adapted model produces summaries that incorporate more appropriate novel and domain-specific words, leading to improved coherence, relevance, and faithfulness. The method also significantly reduces training time by 35-55% over continual pretraining and reduces parameter counts up to 37% with respect to expansion-only methods.

What carries the argument

The unified framework for vocabulary adaptation that augments pretrained tokenizers with domain-specific tokens while selectively replacing under-trained and unreachable tokens to limit parameter growth.

If this is right

  • The adapted summarization models achieve higher semantic similarity with reference summaries.
  • Summaries include more novel and domain-specific words.
  • Generated summaries exhibit improved coherence, relevance, and faithfulness.
  • Training time is reduced by 35-55% compared to continual pretraining.
  • Parameter counts are reduced by up to 37% relative to methods that only expand the vocabulary.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar vocabulary replacement strategies could benefit other domain-specific NLP tasks like classification or generation in scientific literature.
  • The efficiency gains might allow for quicker iteration in adapting models to new specialized areas without requiring extensive computational resources.
  • Testing the impact on the model's performance on general-domain tasks after adaptation would reveal any trade-offs not explored in the specialized focus.

Load-bearing premise

Replacing some under-trained tokens with domain ones keeps the model effective on both general and specialized texts without significant degradation.

What would settle it

If evaluations on the legal and medical summarization tasks show no gains in semantic similarity or no reductions in training time and parameters compared to baselines, the benefits of the vocabulary adaptation approach would be called into question.

Figures

Figures reproduced from arXiv: 2605.17379 by Gunjan Balde, Mainack Mondal, Niloy Ganguly, Soumyadeep Roy.

Figure 1
Figure 1. Figure 1: Median novel unigram concentration observed [PITH_FULL_IMAGE:figures/full_fig_p008_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: We report the score distribution as obtained [PITH_FULL_IMAGE:figures/full_fig_p015_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: We report the score distribution as obtained [PITH_FULL_IMAGE:figures/full_fig_p015_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Annotation instructions shown to the partici [PITH_FULL_IMAGE:figures/full_fig_p016_4.png] view at source ↗
read the original abstract

Large language models pretrained on general-domain corpora often exhibit tokenization inefficiencies when applied to specialized domains. Although continual pretraining for domain adaptation partially alleviate performance degradation, it does not resolve the fundamental vocabulary mismatch. To address this gap, we introduce a targeted parameter-efficient domain adaptation approach that combines vocabulary adaptation with pretraining for LLM-based text summarization. Our unified framework augments pretrained tokenizers with domain-specific tokens while selectively replacing under-trained and unreachable tokens to limit parameter growth. We evaluate our approach on Llama-3.1-8B and Qwen2.5-7B across legal and medical summarization tasks on a challenge-oriented evaluation protocol focused on expert-driven text and summaries which typically has higher concentration of over-fragmented Out-of-Vocabulary (OOV) words. The vocabulary adaptation algorithm enhances the overall quality of the summarization model by improving semantic similarity between the generated summaries and their references. In addition, the adapted model produces summaries that incorporate more appropriate novel and domain-specific words, leading to improved coherence, relevance, and faithfulness. We further observe that our proposed approach significantly reduce training time by $35-55\%$ over continual pretraining and reduce parameter counts up to $37\%$ w.r.t expansion-only methods. We make the codebase publicly available at https://github.com/gb-kgp/VocabReplace-Then-Expand.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes a parameter-efficient vocabulary adaptation method for LLMs applied to specialized text summarization in legal and medical domains. It augments pretrained tokenizers with domain-specific tokens while selectively replacing under-trained and unreachable tokens to constrain parameter growth, then combines this with continued pretraining. Experiments on Llama-3.1-8B and Qwen2.5-7B report gains in semantic similarity to references, increased use of novel/domain-appropriate words, better coherence/relevance/faithfulness, 35-55% reductions in training time versus continual pretraining, and up to 37% fewer parameters versus expansion-only baselines. The codebase is released publicly.

Significance. If the empirical claims hold after verification, the work would be moderately significant for practical domain adaptation of LLMs, as it targets the vocabulary mismatch that continual pretraining alone does not fix while controlling compute and parameter costs. The public release of the codebase supports reproducibility and is a clear strength. However, the central efficiency and quality claims rest on comparisons whose robustness depends on controls that are not yet visible in the provided description.

major comments (2)
  1. Abstract and Evaluation section: The central claim that selective replacement improves specialized summarization quality without materially harming general-domain handling is load-bearing, yet no perplexity or summarization metrics on general-domain data (e.g., CNN/DailyMail) or ablation of replacement versus pure expansion are reported. This omission leaves open the possibility that efficiency gains mask capability trade-offs.
  2. Abstract: Claims of improved semantic similarity, domain-word usage, coherence, relevance, and faithfulness are stated without any numerical values, specific metrics (ROUGE, BERTScore, etc.), baselines, or statistical tests. These quantities are required to assess whether the observed differences are meaningful and reproducible.
minor comments (2)
  1. Abstract: Grammatical issue in 'significantly reduce training time' (should be 'significantly reduces').
  2. Abstract: The phrase 'challenge-oriented evaluation protocol focused on expert-driven text' is underspecified; a brief definition or reference to the protocol would improve clarity for readers unfamiliar with the datasets.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive feedback. The comments highlight important aspects for strengthening the presentation of our results on general-domain preservation and the concreteness of quantitative claims. We address each major comment below and outline the revisions we will make.

read point-by-point responses
  1. Referee: Abstract and Evaluation section: The central claim that selective replacement improves specialized summarization quality without materially harming general-domain handling is load-bearing, yet no perplexity or summarization metrics on general-domain data (e.g., CNN/DailyMail) or ablation of replacement versus pure expansion are reported. This omission leaves open the possibility that efficiency gains mask capability trade-offs.

    Authors: We agree that explicit verification of preserved general-domain capabilities strengthens the central claim. The current experiments prioritize domain-specific summarization under a challenge-oriented protocol with high OOV concentration, as this is the primary use case. To directly address the concern, we will add perplexity and summarization evaluations on a general-domain benchmark such as CNN/DailyMail, along with an ablation comparing selective replacement to pure expansion. These results will be included in the revised Evaluation section and referenced in the abstract. revision: yes

  2. Referee: Abstract: Claims of improved semantic similarity, domain-word usage, coherence, relevance, and faithfulness are stated without any numerical values, specific metrics (ROUGE, BERTScore, etc.), baselines, or statistical tests. These quantities are required to assess whether the observed differences are meaningful and reproducible.

    Authors: The abstract is intentionally concise and summarizes the direction of improvements. Concrete numerical results—including ROUGE, BERTScore, domain-word usage statistics, coherence/relevance/faithfulness scores, baselines, and statistical significance—are reported in the Experiments section with tables and figures. We will revise the abstract to incorporate a small number of key quantitative highlights (e.g., relative gains and efficiency percentages already stated) while maintaining brevity, thereby making the claims more self-contained. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical method with external baselines

full rationale

The paper proposes a vocabulary adaptation method (augment + selective replacement) and evaluates it empirically against continual pretraining and expansion-only baselines on legal/medical summarization tasks using Llama-3.1-8B and Qwen2.5-7B. No equations, self-definitional loops, or load-bearing self-citations appear in the provided abstract or description; performance claims rest on measured improvements in semantic similarity, coherence, and efficiency metrics rather than quantities defined in terms of the fitted outcomes themselves. The derivation chain is therefore self-contained against external benchmarks and does not reduce to its inputs by construction.

Axiom & Free-Parameter Ledger

1 free parameters · 0 axioms · 0 invented entities

The approach depends on empirical choices for which tokens to add or replace and on the representativeness of the legal/medical evaluation sets; these are not derived from first principles.

free parameters (1)
  • Token replacement selection criteria
    The method requires deciding which under-trained or unreachable tokens to drop; this choice is a modeling hyperparameter that affects parameter count and performance.

pith-pipeline@v0.9.0 · 5787 in / 1236 out tokens · 62132 ms · 2026-05-20T13:20:20.726742+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

92 extracted references · 92 canonical work pages · 5 internal anchors

  1. [1]

    BMC bioinformatics , volume=

    An overview of the BIOASQ large-scale biomedical semantic indexing and question answering competition , author=. BMC bioinformatics , volume=. 2015 , publisher=

  2. [2]

    Proceedings of the Australasian Language Technology Association Workshop 2011 , pages=

    Development of a corpus for evidence based medicine summarisation , author=. Proceedings of the Australasian Language Technology Association Workshop 2011 , pages=. 2011 , organization=

  3. [3]

    PubMedQA: A Dataset for Biomedical Research Question Answering , author=. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) , pages=

  4. [4]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Llama 2: Open foundation and fine-tuned chat models , author=. arXiv preprint arXiv:2307.09288 , year=

  5. [5]

    Mistral 7B

    Mistral 7B , author=. arXiv preprint arXiv:2310.06825 , year=

  6. [6]

    Advances in neural information processing systems , volume=

    Language models are few-shot learners , author=. Advances in neural information processing systems , volume=

  7. [7]

    Text summarization branches out , pages=

    Rouge: A package for automatic evaluation of summaries , author=. Text summarization branches out , pages=

  8. [8]

    Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages=

    FaMeSumm: Investigating and Improving Faithfulness of Medical Summarization , author=. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages=

  9. [9]

    Nature medicine , volume=

    Adapted large language models can outperform medical experts in clinical text summarization , author=. Nature medicine , volume=. 2024 , publisher=

  10. [10]

    Computers in biology and medicine , volume=

    A comprehensive evaluation of large language models on benchmark biomedical text processing tasks , author=. Computers in biology and medicine , volume=. 2024 , publisher=

  11. [11]

    2024 , eprint=

    A Comprehensive Survey on Evaluating Large Language Model Applications in the Medical Industry , author=. 2024 , eprint=

  12. [12]

    Gold Panning in Vocabulary: An Adaptive Method for Vocabulary Expansion of Domain-Specific LLM s

    Liu, Chengyuan and Wang, Shihang and others. Gold Panning in Vocabulary: An Adaptive Method for Vocabulary Expansion of Domain-Specific LLM s. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.424

  13. [13]

    Task-Adaptive Tokenization: Enhancing Long-Form Text Generation Efficacy in Mental Health and Beyond

    Liu, Siyang and Deng, Naihao and others. Task-Adaptive Tokenization: Enhancing Long-Form Text Generation Efficacy in Mental Health and Beyond. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023. doi:10.18653/v1/2023.emnlp-main.944

  14. [14]

    Neural Machine Translation of Rare Words with Subword Units

    Sennrich, Rico and Haddow, Barry and Birch, Alexandra. Neural Machine Translation of Rare Words with Subword Units. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2016. doi:10.18653/v1/P16-1162

  15. [15]

    Adaptive BPE Tokenization for Enhanced Vocabulary Adaptation in Finetuning Pretrained Language Models

    Balde, Gunjan and Roy, Soumyadeep and others. Adaptive BPE Tokenization for Enhanced Vocabulary Adaptation in Finetuning Pretrained Language Models. Findings of the Association for Computational Linguistics: EMNLP 2024. 2024. doi:10.18653/v1/2024.findings-emnlp.863

  16. [16]

    2024 , month =

    Balde, Gunjan and Roy, Soumyadeep and others , booktitle =. 2024 , month =

  17. [17]

    BART : Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension

    Lewis, Mike and Liu, Yinhan and others. BART : Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020. doi:10.18653/v1/2020.acl-main.703

  18. [18]

    Proceedings of the 37th International Conference on Machine Learning , articleno =

    Zhang, Jingqing and Zhao, Yao and others , title =. Proceedings of the 37th International Conference on Machine Learning , articleno =. 2020 , publisher =

  19. [19]

    BPE Gets Picky: Efficient Vocabulary Refinement During Tokenizer Training

    Chizhov, Pavel and Arnett, Catherine and others. BPE Gets Picky: Efficient Vocabulary Refinement During Tokenizer Training. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.925

  20. [20]

    BPE -knockout: Pruning Pre-existing BPE Tokenisers with Backwards-compatible Morphological Semi-supervision

    Bauwens, Thomas and Delobelle, Pieter. BPE -knockout: Pruning Pre-existing BPE Tokenisers with Backwards-compatible Morphological Semi-supervision. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.naacl-long.324

  21. [21]

    An Analysis of BPE Vocabulary Trimming in Neural Machine Translation

    Cognetta, Marco and Hiraoka, Tatsuya and others. An Analysis of BPE Vocabulary Trimming in Neural Machine Translation. Proceedings of the Fifth Workshop on Insights from Negative Results in NLP. 2024. doi:10.18653/v1/2024.insights-1.7

  22. [22]

    PEFT: State-of-the-art Parameter-Efficient Fine-Tuning methods , author =

  23. [23]

    2024 , eprint=

    Efficient and Effective Text Encoding for Chinese LLaMA and Alpaca , author=. 2024 , eprint=

  24. [24]

    Findings of the Association for Computational Linguistics: EMNLP 2024 , pages=

    VE-KD: Vocabulary-Expansion Knowledge-Distillation for Training Smaller Domain-Specific Language Models , author=. Findings of the Association for Computational Linguistics: EMNLP 2024 , pages=

  25. [25]

    2024 , eprint=

    Chatlaw: A Multi-Agent Collaborative Legal Assistant with Knowledge Graph Enhanced Mixture-of-Experts Large Language Model , author=. 2024 , eprint=

  26. [26]

    Cost-Performance Optimization for Processing Low-Resource Language Tasks Using Commercial LLM s

    Nag, Arijit and Mukherjee, Animesh and others. Cost-Performance Optimization for Processing Low-Resource Language Tasks Using Commercial LLM s. Findings of the Association for Computational Linguistics: EMNLP 2024. 2024. doi:10.18653/v1/2024.findings-emnlp.920

  27. [27]

    Text Summarization with Pretrained Encoders

    Liu, Yang and Lapata, Mirella. Text Summarization with Pretrained Encoders. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2019. doi:10.18653/v1/D19-1387

  28. [28]

    Don ' t Stop Pretraining: Adapt Language Models to Domains and Tasks

    Gururangan, Suchin and Marasovi \'c , Ana and others. Don ' t Stop Pretraining: Adapt Language Models to Domains and Tasks. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020. doi:10.18653/v1/2020.acl-main.740

  29. [29]

    2022 , journal =

    Hofmann, Valentin and Schuetze, Hinrich and Pierrehumbert, Janet. An Embarrassingly Simple Method to Mitigate Undesirable Properties of Pretrained Language Model Tokenizers. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). 2022. doi:10.18653/v1/2022.acl-short.43

  30. [30]

    Language Model Tokenizers Introduce Unfairness Between Languages , volume =

    Petrov, Aleksandar and La Malfa, Emanuele and others , booktitle =. Language Model Tokenizers Introduce Unfairness Between Languages , volume =

  31. [31]

    Edward J Hu and yelong shen and others , booktitle=. Lo

  32. [32]

    How Good is Your Tokenizer? On the Monolingual Performance of Multilingual Language Models

    Rust, Phillip and Pfeiffer, Jonas and others. How Good is Your Tokenizer? On the Monolingual Performance of Multilingual Language Models. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 2021. doi:10.18653/v1/2021.acl-long.243

  33. [33]

    arXiv preprint arXiv:2406.11477 , year=

    How Can We Effectively Expand the Vocabulary of LLMs with 0.01 GB of Target Language Text? , author=. arXiv preprint arXiv:2406.11477 , year=

  34. [34]

    Exploring Design Choices for Building Language-Specific LLM s

    Tejaswi, Atula and Gupta, Nilesh and Choi, Eunsol. Exploring Design Choices for Building Language-Specific LLM s. Findings of the Association for Computational Linguistics: EMNLP 2024. 2024. doi:10.18653/v1/2024.findings-emnlp.614

  35. [35]

    arXiv preprint arXiv:2409.00133 , year=

    A survey for large language models in biomedicine , author=. arXiv preprint arXiv:2409.00133 , year=

  36. [36]

    Proceedings of the 37th International Conference on Neural Information Processing Systems , articleno =

    Dettmers, Tim and Pagnoni, Artidoro and others , title =. Proceedings of the 37th International Conference on Neural Information Processing Systems , articleno =. 2024 , publisher =

  37. [37]

    Can language models learn from explanations in context?

    Lampinen, Andrew and Dasgupta, Ishita and others. Can language models learn from explanations in context?. Findings of the Association for Computational Linguistics: EMNLP 2022. 2022. doi:10.18653/v1/2022.findings-emnlp.38

  38. [38]

    MedIR workshop, SIGIR , year=

    Quickumls: a fast, unsupervised approach for medical concept extraction , author=. MedIR workshop, SIGIR , year=

  39. [39]

    Deeper Insights Without Updates: The Power of In-Context Learning Over Fine-Tuning

    Yin, Qingyu and He, Xuzheng and others. Deeper Insights Without Updates: The Power of In-Context Learning Over Fine-Tuning. Findings of the Association for Computational Linguistics: EMNLP 2024. 2024. doi:10.18653/v1/2024.findings-emnlp.239

  40. [40]

    AV oca D o: Strategy for Adapting Vocabulary to Downstream Domain

    Hong, Jimin and Kim, TaeHee and others. AV oca D o: Strategy for Adapting Vocabulary to Downstream Domain. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 2021. doi:10.18653/v1/2021.emnlp-main.385

  41. [41]

    and Kry \'s ci \'n ski, Wojciech and others

    Fabbri, Alexander R. and Kry \'s ci \'n ski, Wojciech and others. S umm E val: Re-evaluating Summarization Evaluation. Transactions of the Association for Computational Linguistics. 2021. doi:10.1162/tacl_a_00373

  42. [42]

    Proceedings of the AAAI Conference on Artificial Intelligence , author=

    Scaffold-BPE: Enhancing Byte Pair Encoding for Large Language Models with Simple and Effective Scaffold Token Removal , volume=. Proceedings of the AAAI Conference on Artificial Intelligence , author=. 2025 , month=. doi:10.1609/aaai.v39i23.34633 , abstractNote=

  43. [43]

    The Llama 3 Herd of Models

    The llama 3 herd of models , author=. arXiv preprint arXiv:2407.21783 , year=

  44. [44]

    and Manning, Christopher D

    See, Abigail and Liu, Peter J. and Manning, Christopher D. Get To The Point: Summarization with Pointer-Generator Networks. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2017. doi:10.18653/v1/P17-1099

  45. [45]

    Vocabulary Learning via Optimal Transport for Neural Machine Translation

    Xu, Jingjing and Zhou, Hao and others. Vocabulary Learning via Optimal Transport for Neural Machine Translation. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 2021. doi:10.18653/v1/2021.acl-long.571

  46. [46]

    Entropy-guided Vocabulary Augmentation of Multilingual Language Models for Low-resource Tasks

    Nag, Arijit and Samanta, Bidisha and others. Entropy-guided Vocabulary Augmentation of Multilingual Language Models for Low-resource Tasks. Findings of the Association for Computational Linguistics: ACL 2023. 2023. doi:10.18653/v1/2023.findings-acl.548

  47. [47]

    Retrieval-Augmented Domain Adaptation of Language Models

    Xu, Benfeng and Zhao, Chunxu and others. Retrieval-Augmented Domain Adaptation of Language Models. Proceedings of the 8th Workshop on Representation Learning for NLP (RepL4NLP 2023). 2023. doi:10.18653/v1/2023.repl4nlp-1.5

  48. [48]

    Tai, Wen and Kung, H. T. and others. ex BERT : Extending Pre-trained Models with Domain-specific Vocabulary Under Constrained Training Resources. Findings of the Association for Computational Linguistics: EMNLP 2020. 2020. doi:10.18653/v1/2020.findings-emnlp.129

  49. [49]

    Efficient Domain Adaptation of Language Models via Adaptive Tokenization

    Sachidananda, Vin and Kessler, Jason and Lai, Yi-An. Efficient Domain Adaptation of Language Models via Adaptive Tokenization. Proceedings of the Second Workshop on Simple and Efficient Natural Language Processing. 2021. doi:10.18653/v1/2021.sustainlp-1.16

  50. [50]

    Proceedings of the 15th International Joint Conference on Biomedical Engineering Systems and Technologies (BIOSTEC 2022) - HEALTHINF , year=

    Anastasios Lamproudis and Aron Henriksson and others , title=. Proceedings of the 15th International Joint Conference on Biomedical Engineering Systems and Technologies (BIOSTEC 2022) - HEALTHINF , year=. doi:10.5220/0010893800003123 , isbn=

  51. [51]

    Taming Pre-trained Language Models with N-gram Representations for Low-Resource Domain Adaptation

    Diao, Shizhe and Xu, Ruijia and others. Taming Pre-trained Language Models with N-gram Representations for Low-Resource Domain Adaptation. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 2021. doi:10.18653/v1/2021.acl...

  52. [52]

    B io BART : Pretraining and Evaluation of A Biomedical Generative Language Model

    Yuan, Hongyi and Yuan, Zheng and others. B io BART : Pretraining and Evaluation of A Biomedical Generative Language Model. Proceedings of the 21st Workshop on Biomedical Language Processing. 2022. doi:10.18653/v1/2022.bionlp-1.9

  53. [53]

    Journal of the American Medical Informatics Association , volume=

    Enhancing clinical concept extraction with contextual embeddings , author=. Journal of the American Medical Informatics Association , volume=. 2019 , publisher=

  54. [54]

    Proceedings of the 17th Annual Workshop of the Australasian Language Technology Association , pages=

    Investigating the effect of lexical segmentation in transformer-based models on medical datasets , author=. Proceedings of the 17th Annual Workshop of the Australasian Language Technology Association , pages=

  55. [55]

    B io M istral: A Collection of Open-Source Pretrained Large Language Models for Medical Domains

    Labrak, Yanis and Bazoge, Adrien and others. B io M istral: A Collection of Open-Source Pretrained Large Language Models for Medical Domains. Findings of the Association for Computational Linguistics: ACL 2024. 2024. doi:10.18653/v1/2024.findings-acl.348

  56. [56]

    2025 , eprint=

    Qwen2.5 Technical Report , author=. 2025 , eprint=

  57. [57]

    International Conference on Pattern Recognition , pages=

    Beyond relevant documents: A knowledge-intensive approach for query-focused summarization using large language models , author=. International Conference on Pattern Recognition , pages=. 2025 , organization=

  58. [58]

    On the Summarization of Consumer Health Questions

    Ben Abacha, Asma and Demner-Fushman, Dina. On the Summarization of Consumer Health Questions. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019

  59. [59]

    Overview of the MEDIQA -Chat 2023 Shared Tasks on the Summarization & Generation of Doctor-Patient Conversations

    Ben Abacha, Asma and Yim, Wen-wai and others. Overview of the MEDIQA -Chat 2023 Shared Tasks on the Summarization & Generation of Doctor-Patient Conversations. Proceedings of the 5th Clinical Natural Language Processing Workshop. 2023. doi:10.18653/v1/2023.clinicalnlp-1.52

  60. [60]

    doi:10.5281/zenodo.15517617 , url =

    Rodríguez Ortega, Miguel and Rodríguez López, Eduard and others , title =. doi:10.5281/zenodo.15517617 , url =

  61. [61]

    Evaluation of LLM s in Medical Text Summarization: The Role of Vocabulary Adaptation in High OOV Settings

    Balde, Gunjan and Roy, Soumyadeep and others. Evaluation of LLM s in Medical Text Summarization: The Role of Vocabulary Adaptation in High OOV Settings. Findings of the Association for Computational Linguistics: ACL 2025. 2025

  62. [62]

    MEDITRON-70B: Scaling Medical Pretraining for Large Language Models

    MEDITRON-70B: Scaling Medical Pretraining for Large Language Models , author =. 2023 , archivePrefix=. 2311.16079 , primaryClass =

  63. [63]

    The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track , year=

    The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale , author=. The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track , year=

  64. [64]

    Fishing for Magikarp: Automatically Detecting Under-trained Tokens in Large Language Models

    Land, Sander and Bartolo, Max. Fishing for Magikarp: Automatically Detecting Under-trained Tokens in Large Language Models. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.649

  65. [65]

    Advances in Neural Information Processing Systems , volume=

    Not all tokens are what you need for pretraining , author=. Advances in Neural Information Processing Systems , volume=

  66. [66]

    arXiv preprint arXiv:2505.18227 , year=

    Token Reduction Should Go Beyond Efficiency in Generative Models--From Vision, Language to Multimodality , author=. arXiv preprint arXiv:2505.18227 , year=

  67. [67]

    2025 , eprint=

    Gemma 3 Technical Report , author=. 2025 , eprint=

  68. [68]

    Phi-4 Technical Report

    Phi-4 technical report , author=. arXiv preprint arXiv:2412.08905 , year=

  69. [69]

    M ed S umm RAG : Domain-Specific Retrieval for Medical Summarization

    Luo, Guanting and Arase, Yuki. M ed S umm RAG : Domain-Specific Retrieval for Medical Summarization. Proceedings of the 24th Workshop on Biomedical Language Processing. 2025

  70. [70]

    2024 , eprint=

    Byte Latent Transformer: Patches Scale Better Than Tokens , author=. 2024 , eprint=

  71. [71]

    Journal of the American Medical Informatics Association , volume=

    PMC-LLaMA: toward building open-source language models for medicine , author=. Journal of the American Medical Informatics Association , volume=. 2024 , publisher=

  72. [72]

    ACM Transactions on Computing for Healthcare (HEALTH) , volume=

    Domain-specific language model pretraining for biomedical natural language processing , author=. ACM Transactions on Computing for Healthcare (HEALTH) , volume=. 2021 , publisher=

  73. [73]

    Do All Languages Cost the Same? Tokenization in the Era of Commercial Language Models

    Ahia, Orevaoghene and Kumar, Sachin and others. Do All Languages Cost the Same? Tokenization in the Era of Commercial Language Models. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023. doi:10.18653/v1/2023.emnlp-main.614

  74. [74]

    From Tokens to Words: On the Inner Lexicon of

    Guy Kaplan and Matanel Oren and others , booktitle=. From Tokens to Words: On the Inner Lexicon of. 2025 , url=

  75. [75]

    International Conference on Learning Representations , year=

    BERTScore: Evaluating Text Generation with BERT , author=. International Conference on Learning Representations , year=

  76. [76]

    Bioinformatics , volume=

    BioBERT: A pre-trained biomedical language representation model for biomedical text mining , author=. Bioinformatics , volume=. 2020 , publisher=

  77. [77]

    Proceedings of the Nineteenth International Conference on Artificial Intelligence and Law , pages =

    Paul, Shounak and Mandal, Arpan and others , title =. Proceedings of the Nineteenth International Conference on Artificial Intelligence and Law , pages =. 2023 , isbn =. doi:10.1145/3594536.3595165 , abstract =

  78. [78]

    S ci BERT : A Pretrained Language Model for Scientific Text

    Beltagy, Iz and Lo, Kyle and Cohan, Arman. S ci BERT : A Pretrained Language Model for Scientific Text. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2019. doi:10.18653/v1/D19-1371

  79. [79]

    Teaching Old Tokenizers New Words: Efficient Tokenizer Adaptation for Pretrained Models

    Purason, Taido and Chizhov, Pavel and others. Teaching Old Tokenizers New Words: Efficient Tokenizer Adaptation for Pretrained Models. Findings of the A ssociation for C omputational L inguistics: EACL 2026. 2026. doi:10.18653/v1/2026.findings-eacl.341

  80. [80]

    2025 , eprint=

    Vocab Diet: Reshaping the Vocabulary of LLMs with Vector Arithmetic , author=. 2025 , eprint=

Showing first 80 references.