Learning Faster with Better Tokens: Parameter-Efficient Vocabulary Adaptation for Specialized Text Summarization
Pith reviewed 2026-05-20 13:20 UTC · model grok-4.3
The pith
Adapting vocabularies by adding domain tokens and replacing under-trained ones improves specialized text summarization while reducing training time and parameters.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The vocabulary adaptation algorithm enhances the overall quality of the summarization model by improving semantic similarity between the generated summaries and their references. In addition, the adapted model produces summaries that incorporate more appropriate novel and domain-specific words, leading to improved coherence, relevance, and faithfulness. The method also significantly reduces training time by 35-55% over continual pretraining and reduces parameter counts up to 37% with respect to expansion-only methods.
What carries the argument
The unified framework for vocabulary adaptation that augments pretrained tokenizers with domain-specific tokens while selectively replacing under-trained and unreachable tokens to limit parameter growth.
If this is right
- The adapted summarization models achieve higher semantic similarity with reference summaries.
- Summaries include more novel and domain-specific words.
- Generated summaries exhibit improved coherence, relevance, and faithfulness.
- Training time is reduced by 35-55% compared to continual pretraining.
- Parameter counts are reduced by up to 37% relative to methods that only expand the vocabulary.
Where Pith is reading between the lines
- Similar vocabulary replacement strategies could benefit other domain-specific NLP tasks like classification or generation in scientific literature.
- The efficiency gains might allow for quicker iteration in adapting models to new specialized areas without requiring extensive computational resources.
- Testing the impact on the model's performance on general-domain tasks after adaptation would reveal any trade-offs not explored in the specialized focus.
Load-bearing premise
Replacing some under-trained tokens with domain ones keeps the model effective on both general and specialized texts without significant degradation.
What would settle it
If evaluations on the legal and medical summarization tasks show no gains in semantic similarity or no reductions in training time and parameters compared to baselines, the benefits of the vocabulary adaptation approach would be called into question.
Figures
read the original abstract
Large language models pretrained on general-domain corpora often exhibit tokenization inefficiencies when applied to specialized domains. Although continual pretraining for domain adaptation partially alleviate performance degradation, it does not resolve the fundamental vocabulary mismatch. To address this gap, we introduce a targeted parameter-efficient domain adaptation approach that combines vocabulary adaptation with pretraining for LLM-based text summarization. Our unified framework augments pretrained tokenizers with domain-specific tokens while selectively replacing under-trained and unreachable tokens to limit parameter growth. We evaluate our approach on Llama-3.1-8B and Qwen2.5-7B across legal and medical summarization tasks on a challenge-oriented evaluation protocol focused on expert-driven text and summaries which typically has higher concentration of over-fragmented Out-of-Vocabulary (OOV) words. The vocabulary adaptation algorithm enhances the overall quality of the summarization model by improving semantic similarity between the generated summaries and their references. In addition, the adapted model produces summaries that incorporate more appropriate novel and domain-specific words, leading to improved coherence, relevance, and faithfulness. We further observe that our proposed approach significantly reduce training time by $35-55\%$ over continual pretraining and reduce parameter counts up to $37\%$ w.r.t expansion-only methods. We make the codebase publicly available at https://github.com/gb-kgp/VocabReplace-Then-Expand.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes a parameter-efficient vocabulary adaptation method for LLMs applied to specialized text summarization in legal and medical domains. It augments pretrained tokenizers with domain-specific tokens while selectively replacing under-trained and unreachable tokens to constrain parameter growth, then combines this with continued pretraining. Experiments on Llama-3.1-8B and Qwen2.5-7B report gains in semantic similarity to references, increased use of novel/domain-appropriate words, better coherence/relevance/faithfulness, 35-55% reductions in training time versus continual pretraining, and up to 37% fewer parameters versus expansion-only baselines. The codebase is released publicly.
Significance. If the empirical claims hold after verification, the work would be moderately significant for practical domain adaptation of LLMs, as it targets the vocabulary mismatch that continual pretraining alone does not fix while controlling compute and parameter costs. The public release of the codebase supports reproducibility and is a clear strength. However, the central efficiency and quality claims rest on comparisons whose robustness depends on controls that are not yet visible in the provided description.
major comments (2)
- Abstract and Evaluation section: The central claim that selective replacement improves specialized summarization quality without materially harming general-domain handling is load-bearing, yet no perplexity or summarization metrics on general-domain data (e.g., CNN/DailyMail) or ablation of replacement versus pure expansion are reported. This omission leaves open the possibility that efficiency gains mask capability trade-offs.
- Abstract: Claims of improved semantic similarity, domain-word usage, coherence, relevance, and faithfulness are stated without any numerical values, specific metrics (ROUGE, BERTScore, etc.), baselines, or statistical tests. These quantities are required to assess whether the observed differences are meaningful and reproducible.
minor comments (2)
- Abstract: Grammatical issue in 'significantly reduce training time' (should be 'significantly reduces').
- Abstract: The phrase 'challenge-oriented evaluation protocol focused on expert-driven text' is underspecified; a brief definition or reference to the protocol would improve clarity for readers unfamiliar with the datasets.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive feedback. The comments highlight important aspects for strengthening the presentation of our results on general-domain preservation and the concreteness of quantitative claims. We address each major comment below and outline the revisions we will make.
read point-by-point responses
-
Referee: Abstract and Evaluation section: The central claim that selective replacement improves specialized summarization quality without materially harming general-domain handling is load-bearing, yet no perplexity or summarization metrics on general-domain data (e.g., CNN/DailyMail) or ablation of replacement versus pure expansion are reported. This omission leaves open the possibility that efficiency gains mask capability trade-offs.
Authors: We agree that explicit verification of preserved general-domain capabilities strengthens the central claim. The current experiments prioritize domain-specific summarization under a challenge-oriented protocol with high OOV concentration, as this is the primary use case. To directly address the concern, we will add perplexity and summarization evaluations on a general-domain benchmark such as CNN/DailyMail, along with an ablation comparing selective replacement to pure expansion. These results will be included in the revised Evaluation section and referenced in the abstract. revision: yes
-
Referee: Abstract: Claims of improved semantic similarity, domain-word usage, coherence, relevance, and faithfulness are stated without any numerical values, specific metrics (ROUGE, BERTScore, etc.), baselines, or statistical tests. These quantities are required to assess whether the observed differences are meaningful and reproducible.
Authors: The abstract is intentionally concise and summarizes the direction of improvements. Concrete numerical results—including ROUGE, BERTScore, domain-word usage statistics, coherence/relevance/faithfulness scores, baselines, and statistical significance—are reported in the Experiments section with tables and figures. We will revise the abstract to incorporate a small number of key quantitative highlights (e.g., relative gains and efficiency percentages already stated) while maintaining brevity, thereby making the claims more self-contained. revision: partial
Circularity Check
No circularity: empirical method with external baselines
full rationale
The paper proposes a vocabulary adaptation method (augment + selective replacement) and evaluates it empirically against continual pretraining and expansion-only baselines on legal/medical summarization tasks using Llama-3.1-8B and Qwen2.5-7B. No equations, self-definitional loops, or load-bearing self-citations appear in the provided abstract or description; performance claims rest on measured improvements in semantic similarity, coherence, and efficiency metrics rather than quantities defined in terms of the fitted outcomes themselves. The derivation chain is therefore self-contained against external benchmarks and does not reduce to its inputs by construction.
Axiom & Free-Parameter Ledger
free parameters (1)
- Token replacement selection criteria
Reference graph
Works this paper leans on
-
[1]
An overview of the BIOASQ large-scale biomedical semantic indexing and question answering competition , author=. BMC bioinformatics , volume=. 2015 , publisher=
work page 2015
-
[2]
Proceedings of the Australasian Language Technology Association Workshop 2011 , pages=
Development of a corpus for evidence based medicine summarisation , author=. Proceedings of the Australasian Language Technology Association Workshop 2011 , pages=. 2011 , organization=
work page 2011
-
[3]
PubMedQA: A Dataset for Biomedical Research Question Answering , author=. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) , pages=
work page 2019
-
[4]
Llama 2: Open Foundation and Fine-Tuned Chat Models
Llama 2: Open foundation and fine-tuned chat models , author=. arXiv preprint arXiv:2307.09288 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
Mistral 7B , author=. arXiv preprint arXiv:2310.06825 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
Advances in neural information processing systems , volume=
Language models are few-shot learners , author=. Advances in neural information processing systems , volume=
-
[7]
Text summarization branches out , pages=
Rouge: A package for automatic evaluation of summaries , author=. Text summarization branches out , pages=
-
[8]
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages=
FaMeSumm: Investigating and Improving Faithfulness of Medical Summarization , author=. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages=
work page 2023
-
[9]
Adapted large language models can outperform medical experts in clinical text summarization , author=. Nature medicine , volume=. 2024 , publisher=
work page 2024
-
[10]
Computers in biology and medicine , volume=
A comprehensive evaluation of large language models on benchmark biomedical text processing tasks , author=. Computers in biology and medicine , volume=. 2024 , publisher=
work page 2024
-
[11]
A Comprehensive Survey on Evaluating Large Language Model Applications in the Medical Industry , author=. 2024 , eprint=
work page 2024
-
[12]
Gold Panning in Vocabulary: An Adaptive Method for Vocabulary Expansion of Domain-Specific LLM s
Liu, Chengyuan and Wang, Shihang and others. Gold Panning in Vocabulary: An Adaptive Method for Vocabulary Expansion of Domain-Specific LLM s. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.424
-
[13]
Task-Adaptive Tokenization: Enhancing Long-Form Text Generation Efficacy in Mental Health and Beyond
Liu, Siyang and Deng, Naihao and others. Task-Adaptive Tokenization: Enhancing Long-Form Text Generation Efficacy in Mental Health and Beyond. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023. doi:10.18653/v1/2023.emnlp-main.944
-
[14]
Neural Machine Translation of Rare Words with Subword Units
Sennrich, Rico and Haddow, Barry and Birch, Alexandra. Neural Machine Translation of Rare Words with Subword Units. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2016. doi:10.18653/v1/P16-1162
-
[15]
Balde, Gunjan and Roy, Soumyadeep and others. Adaptive BPE Tokenization for Enhanced Vocabulary Adaptation in Finetuning Pretrained Language Models. Findings of the Association for Computational Linguistics: EMNLP 2024. 2024. doi:10.18653/v1/2024.findings-emnlp.863
-
[16]
Balde, Gunjan and Roy, Soumyadeep and others , booktitle =. 2024 , month =
work page 2024
-
[17]
Lewis, Mike and Liu, Yinhan and others. BART : Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020. doi:10.18653/v1/2020.acl-main.703
-
[18]
Proceedings of the 37th International Conference on Machine Learning , articleno =
Zhang, Jingqing and Zhao, Yao and others , title =. Proceedings of the 37th International Conference on Machine Learning , articleno =. 2020 , publisher =
work page 2020
-
[19]
BPE Gets Picky: Efficient Vocabulary Refinement During Tokenizer Training
Chizhov, Pavel and Arnett, Catherine and others. BPE Gets Picky: Efficient Vocabulary Refinement During Tokenizer Training. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.925
-
[20]
Bauwens, Thomas and Delobelle, Pieter. BPE -knockout: Pruning Pre-existing BPE Tokenisers with Backwards-compatible Morphological Semi-supervision. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.naacl-long.324
-
[21]
An Analysis of BPE Vocabulary Trimming in Neural Machine Translation
Cognetta, Marco and Hiraoka, Tatsuya and others. An Analysis of BPE Vocabulary Trimming in Neural Machine Translation. Proceedings of the Fifth Workshop on Insights from Negative Results in NLP. 2024. doi:10.18653/v1/2024.insights-1.7
-
[22]
PEFT: State-of-the-art Parameter-Efficient Fine-Tuning methods , author =
-
[23]
Efficient and Effective Text Encoding for Chinese LLaMA and Alpaca , author=. 2024 , eprint=
work page 2024
-
[24]
Findings of the Association for Computational Linguistics: EMNLP 2024 , pages=
VE-KD: Vocabulary-Expansion Knowledge-Distillation for Training Smaller Domain-Specific Language Models , author=. Findings of the Association for Computational Linguistics: EMNLP 2024 , pages=
work page 2024
-
[25]
Chatlaw: A Multi-Agent Collaborative Legal Assistant with Knowledge Graph Enhanced Mixture-of-Experts Large Language Model , author=. 2024 , eprint=
work page 2024
-
[26]
Cost-Performance Optimization for Processing Low-Resource Language Tasks Using Commercial LLM s
Nag, Arijit and Mukherjee, Animesh and others. Cost-Performance Optimization for Processing Low-Resource Language Tasks Using Commercial LLM s. Findings of the Association for Computational Linguistics: EMNLP 2024. 2024. doi:10.18653/v1/2024.findings-emnlp.920
-
[27]
Text Summarization with Pretrained Encoders
Liu, Yang and Lapata, Mirella. Text Summarization with Pretrained Encoders. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2019. doi:10.18653/v1/D19-1387
-
[28]
Don ' t Stop Pretraining: Adapt Language Models to Domains and Tasks
Gururangan, Suchin and Marasovi \'c , Ana and others. Don ' t Stop Pretraining: Adapt Language Models to Domains and Tasks. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020. doi:10.18653/v1/2020.acl-main.740
-
[29]
Hofmann, Valentin and Schuetze, Hinrich and Pierrehumbert, Janet. An Embarrassingly Simple Method to Mitigate Undesirable Properties of Pretrained Language Model Tokenizers. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). 2022. doi:10.18653/v1/2022.acl-short.43
-
[30]
Language Model Tokenizers Introduce Unfairness Between Languages , volume =
Petrov, Aleksandar and La Malfa, Emanuele and others , booktitle =. Language Model Tokenizers Introduce Unfairness Between Languages , volume =
-
[31]
Edward J Hu and yelong shen and others , booktitle=. Lo
-
[32]
How Good is Your Tokenizer? On the Monolingual Performance of Multilingual Language Models
Rust, Phillip and Pfeiffer, Jonas and others. How Good is Your Tokenizer? On the Monolingual Performance of Multilingual Language Models. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 2021. doi:10.18653/v1/2021.acl-long.243
-
[33]
arXiv preprint arXiv:2406.11477 , year=
How Can We Effectively Expand the Vocabulary of LLMs with 0.01 GB of Target Language Text? , author=. arXiv preprint arXiv:2406.11477 , year=
-
[34]
Exploring Design Choices for Building Language-Specific LLM s
Tejaswi, Atula and Gupta, Nilesh and Choi, Eunsol. Exploring Design Choices for Building Language-Specific LLM s. Findings of the Association for Computational Linguistics: EMNLP 2024. 2024. doi:10.18653/v1/2024.findings-emnlp.614
-
[35]
arXiv preprint arXiv:2409.00133 , year=
A survey for large language models in biomedicine , author=. arXiv preprint arXiv:2409.00133 , year=
-
[36]
Dettmers, Tim and Pagnoni, Artidoro and others , title =. Proceedings of the 37th International Conference on Neural Information Processing Systems , articleno =. 2024 , publisher =
work page 2024
-
[37]
Can language models learn from explanations in context?
Lampinen, Andrew and Dasgupta, Ishita and others. Can language models learn from explanations in context?. Findings of the Association for Computational Linguistics: EMNLP 2022. 2022. doi:10.18653/v1/2022.findings-emnlp.38
-
[38]
Quickumls: a fast, unsupervised approach for medical concept extraction , author=. MedIR workshop, SIGIR , year=
-
[39]
Deeper Insights Without Updates: The Power of In-Context Learning Over Fine-Tuning
Yin, Qingyu and He, Xuzheng and others. Deeper Insights Without Updates: The Power of In-Context Learning Over Fine-Tuning. Findings of the Association for Computational Linguistics: EMNLP 2024. 2024. doi:10.18653/v1/2024.findings-emnlp.239
-
[40]
AV oca D o: Strategy for Adapting Vocabulary to Downstream Domain
Hong, Jimin and Kim, TaeHee and others. AV oca D o: Strategy for Adapting Vocabulary to Downstream Domain. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 2021. doi:10.18653/v1/2021.emnlp-main.385
-
[41]
and Kry \'s ci \'n ski, Wojciech and others
Fabbri, Alexander R. and Kry \'s ci \'n ski, Wojciech and others. S umm E val: Re-evaluating Summarization Evaluation. Transactions of the Association for Computational Linguistics. 2021. doi:10.1162/tacl_a_00373
-
[42]
Proceedings of the AAAI Conference on Artificial Intelligence , author=
Scaffold-BPE: Enhancing Byte Pair Encoding for Large Language Models with Simple and Effective Scaffold Token Removal , volume=. Proceedings of the AAAI Conference on Artificial Intelligence , author=. 2025 , month=. doi:10.1609/aaai.v39i23.34633 , abstractNote=
-
[43]
The llama 3 herd of models , author=. arXiv preprint arXiv:2407.21783 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[44]
See, Abigail and Liu, Peter J. and Manning, Christopher D. Get To The Point: Summarization with Pointer-Generator Networks. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2017. doi:10.18653/v1/P17-1099
-
[45]
Vocabulary Learning via Optimal Transport for Neural Machine Translation
Xu, Jingjing and Zhou, Hao and others. Vocabulary Learning via Optimal Transport for Neural Machine Translation. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 2021. doi:10.18653/v1/2021.acl-long.571
-
[46]
Entropy-guided Vocabulary Augmentation of Multilingual Language Models for Low-resource Tasks
Nag, Arijit and Samanta, Bidisha and others. Entropy-guided Vocabulary Augmentation of Multilingual Language Models for Low-resource Tasks. Findings of the Association for Computational Linguistics: ACL 2023. 2023. doi:10.18653/v1/2023.findings-acl.548
-
[47]
Retrieval-Augmented Domain Adaptation of Language Models
Xu, Benfeng and Zhao, Chunxu and others. Retrieval-Augmented Domain Adaptation of Language Models. Proceedings of the 8th Workshop on Representation Learning for NLP (RepL4NLP 2023). 2023. doi:10.18653/v1/2023.repl4nlp-1.5
-
[48]
Tai, Wen and Kung, H. T. and others. ex BERT : Extending Pre-trained Models with Domain-specific Vocabulary Under Constrained Training Resources. Findings of the Association for Computational Linguistics: EMNLP 2020. 2020. doi:10.18653/v1/2020.findings-emnlp.129
-
[49]
Efficient Domain Adaptation of Language Models via Adaptive Tokenization
Sachidananda, Vin and Kessler, Jason and Lai, Yi-An. Efficient Domain Adaptation of Language Models via Adaptive Tokenization. Proceedings of the Second Workshop on Simple and Efficient Natural Language Processing. 2021. doi:10.18653/v1/2021.sustainlp-1.16
-
[50]
Anastasios Lamproudis and Aron Henriksson and others , title=. Proceedings of the 15th International Joint Conference on Biomedical Engineering Systems and Technologies (BIOSTEC 2022) - HEALTHINF , year=. doi:10.5220/0010893800003123 , isbn=
-
[51]
Taming Pre-trained Language Models with N-gram Representations for Low-Resource Domain Adaptation
Diao, Shizhe and Xu, Ruijia and others. Taming Pre-trained Language Models with N-gram Representations for Low-Resource Domain Adaptation. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 2021. doi:10.18653/v1/2021.acl...
-
[52]
B io BART : Pretraining and Evaluation of A Biomedical Generative Language Model
Yuan, Hongyi and Yuan, Zheng and others. B io BART : Pretraining and Evaluation of A Biomedical Generative Language Model. Proceedings of the 21st Workshop on Biomedical Language Processing. 2022. doi:10.18653/v1/2022.bionlp-1.9
-
[53]
Journal of the American Medical Informatics Association , volume=
Enhancing clinical concept extraction with contextual embeddings , author=. Journal of the American Medical Informatics Association , volume=. 2019 , publisher=
work page 2019
-
[54]
Proceedings of the 17th Annual Workshop of the Australasian Language Technology Association , pages=
Investigating the effect of lexical segmentation in transformer-based models on medical datasets , author=. Proceedings of the 17th Annual Workshop of the Australasian Language Technology Association , pages=
-
[55]
B io M istral: A Collection of Open-Source Pretrained Large Language Models for Medical Domains
Labrak, Yanis and Bazoge, Adrien and others. B io M istral: A Collection of Open-Source Pretrained Large Language Models for Medical Domains. Findings of the Association for Computational Linguistics: ACL 2024. 2024. doi:10.18653/v1/2024.findings-acl.348
- [56]
-
[57]
International Conference on Pattern Recognition , pages=
Beyond relevant documents: A knowledge-intensive approach for query-focused summarization using large language models , author=. International Conference on Pattern Recognition , pages=. 2025 , organization=
work page 2025
-
[58]
On the Summarization of Consumer Health Questions
Ben Abacha, Asma and Demner-Fushman, Dina. On the Summarization of Consumer Health Questions. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019
work page 2019
-
[59]
Ben Abacha, Asma and Yim, Wen-wai and others. Overview of the MEDIQA -Chat 2023 Shared Tasks on the Summarization & Generation of Doctor-Patient Conversations. Proceedings of the 5th Clinical Natural Language Processing Workshop. 2023. doi:10.18653/v1/2023.clinicalnlp-1.52
-
[60]
doi:10.5281/zenodo.15517617 , url =
Rodríguez Ortega, Miguel and Rodríguez López, Eduard and others , title =. doi:10.5281/zenodo.15517617 , url =
-
[61]
Balde, Gunjan and Roy, Soumyadeep and others. Evaluation of LLM s in Medical Text Summarization: The Role of Vocabulary Adaptation in High OOV Settings. Findings of the Association for Computational Linguistics: ACL 2025. 2025
work page 2025
-
[62]
MEDITRON-70B: Scaling Medical Pretraining for Large Language Models
MEDITRON-70B: Scaling Medical Pretraining for Large Language Models , author =. 2023 , archivePrefix=. 2311.16079 , primaryClass =
work page internal anchor Pith review arXiv 2023
-
[63]
The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale , author=. The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track , year=
-
[64]
Fishing for Magikarp: Automatically Detecting Under-trained Tokens in Large Language Models
Land, Sander and Bartolo, Max. Fishing for Magikarp: Automatically Detecting Under-trained Tokens in Large Language Models. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.649
-
[65]
Advances in Neural Information Processing Systems , volume=
Not all tokens are what you need for pretraining , author=. Advances in Neural Information Processing Systems , volume=
-
[66]
arXiv preprint arXiv:2505.18227 , year=
Token Reduction Should Go Beyond Efficiency in Generative Models--From Vision, Language to Multimodality , author=. arXiv preprint arXiv:2505.18227 , year=
- [67]
-
[68]
Phi-4 technical report , author=. arXiv preprint arXiv:2412.08905 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[69]
M ed S umm RAG : Domain-Specific Retrieval for Medical Summarization
Luo, Guanting and Arase, Yuki. M ed S umm RAG : Domain-Specific Retrieval for Medical Summarization. Proceedings of the 24th Workshop on Biomedical Language Processing. 2025
work page 2025
-
[70]
Byte Latent Transformer: Patches Scale Better Than Tokens , author=. 2024 , eprint=
work page 2024
-
[71]
Journal of the American Medical Informatics Association , volume=
PMC-LLaMA: toward building open-source language models for medicine , author=. Journal of the American Medical Informatics Association , volume=. 2024 , publisher=
work page 2024
-
[72]
ACM Transactions on Computing for Healthcare (HEALTH) , volume=
Domain-specific language model pretraining for biomedical natural language processing , author=. ACM Transactions on Computing for Healthcare (HEALTH) , volume=. 2021 , publisher=
work page 2021
-
[73]
Do All Languages Cost the Same? Tokenization in the Era of Commercial Language Models
Ahia, Orevaoghene and Kumar, Sachin and others. Do All Languages Cost the Same? Tokenization in the Era of Commercial Language Models. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023. doi:10.18653/v1/2023.emnlp-main.614
-
[74]
From Tokens to Words: On the Inner Lexicon of
Guy Kaplan and Matanel Oren and others , booktitle=. From Tokens to Words: On the Inner Lexicon of. 2025 , url=
work page 2025
-
[75]
International Conference on Learning Representations , year=
BERTScore: Evaluating Text Generation with BERT , author=. International Conference on Learning Representations , year=
-
[76]
BioBERT: A pre-trained biomedical language representation model for biomedical text mining , author=. Bioinformatics , volume=. 2020 , publisher=
work page 2020
-
[77]
Proceedings of the Nineteenth International Conference on Artificial Intelligence and Law , pages =
Paul, Shounak and Mandal, Arpan and others , title =. Proceedings of the Nineteenth International Conference on Artificial Intelligence and Law , pages =. 2023 , isbn =. doi:10.1145/3594536.3595165 , abstract =
-
[78]
S ci BERT : A Pretrained Language Model for Scientific Text
Beltagy, Iz and Lo, Kyle and Cohan, Arman. S ci BERT : A Pretrained Language Model for Scientific Text. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2019. doi:10.18653/v1/D19-1371
-
[79]
Teaching Old Tokenizers New Words: Efficient Tokenizer Adaptation for Pretrained Models
Purason, Taido and Chizhov, Pavel and others. Teaching Old Tokenizers New Words: Efficient Tokenizer Adaptation for Pretrained Models. Findings of the A ssociation for C omputational L inguistics: EACL 2026. 2026. doi:10.18653/v1/2026.findings-eacl.341
-
[80]
Vocab Diet: Reshaping the Vocabulary of LLMs with Vector Arithmetic , author=. 2025 , eprint=
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.