pith. sign in

arxiv: 2510.17001 · v2 · submitted 2025-10-19 · 💻 cs.CL

Vocab Diet: Reshaping the Vocabulary of LLMs via Vector Arithmetic

Pith reviewed 2026-05-18 05:50 UTC · model grok-4.3

classification 💻 cs.CL
keywords vocabulary reshapingtransformation vectorsLLM embeddingsmorphological variationvector arithmeticmultilingual coveragemodel adaptationtokenization efficiency
0
0 comments X

The pith

LLMs can represent word-form variants like walk and walked as additive offsets to a shared base embedding instead of separate tokens.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that many surface-form variations are already encoded as linear directions in the embedding space of large language models. Rather than allocating distinct vocabulary entries to each variant, it composes representations from a base form plus a small transformation vector applied in both input and output spaces. This reshaping frees 10-40% of vocabulary capacity for greater diversity and multilingual coverage while the core model stays frozen and downstream performance holds steady. The method works in pretraining and post-hoc adaptation across five languages and multiple models, and it also improves handling of out-of-vocabulary words.

Core claim

Many word-form variations can be captured by transformation vectors, which are additive offsets that produce the correct embedding for a surface form when added to a base-form embedding; composing vocabulary entries this way replaces unique tokens for each variant with shared bases plus these offsets.

What carries the argument

Transformation vectors: additive offsets learned on adaptation data that generate variant embeddings from base forms in both input and output spaces.

If this is right

  • 10-40% of vocabulary slots become available for reallocation to rarer or cross-lingual tokens.
  • Out-of-vocabulary words gain coverage through composition rather than new entries.
  • The same approach works for both continued pretraining and lightweight post-hoc adaptation.
  • Downstream task performance changes little across the tested languages and models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Models with fixed vocabulary budgets could support more languages or domains by shifting capacity away from morphological repetition.
  • Similar vector-based composition might extend to other regular linguistic patterns such as tense or number agreement across sentences.
  • If the offsets prove robust, future tokenizers could be redesigned around base forms and a small set of reusable transformations rather than exhaustive surface-form lists.

Load-bearing premise

The learned transformation vectors stay stable and generalize to new contexts, languages, and tasks without hurting the frozen backbone or creating new errors.

What would settle it

Evaluate the adapted model on a held-out language or downstream task after training vectors on limited data; if accuracy drops more than a few points relative to the original model, the claim fails.

Figures

Figures reproduced from arXiv: 2510.17001 by Guy Kaplan, Roy Schwartz, Yuval Reif.

Figure 1
Figure 1. Figure 1: Compositional vocabulary for LLMs. Top: Input tokens are represented by (a) decomposing them into base words (Vb) and transformations (Vt), and (b) feeding the composite embeddings to the model. For example, “cats” becomes “cat" + plural. Bottom: The next token is predicted by (c) computing logits inde￾pendently over base words and transformations, and (d) combining them into next-token probabilities. Our … view at source ↗
Figure 2
Figure 2. Figure 2: Structure in LLM vocabularies and po￾tential for compositional design. Left: Many in￾vocabulary English word tokens in the GPT-4 tokenizer are surface variants of other tokens—differing only by case, inflection, or derivation—reducing from 24k to￾kens to just 14k base form words. Right: The existing set of base forms and transformations can be used to compose over 98k currently out-of-vocabulary words, hig… view at source ↗
Figure 3
Figure 3. Figure 3: Linear representation of morphology in embeddings weakens as vocabulary size increases. Accuracy of Patchscopes interpretations of compositional word representations across models, in order of increasing English vocabulary size (English tokens present in UniMorph), separated by embedding architecture. Scaling vocabulary size leads models to represent inflection as individual lexical units, rather than thro… view at source ↗
read the original abstract

Large language models (LLMs) often encode word-form variation (e.g., walk vs. walked) as linear directions in the embedding space. However, standard tokenization algorithms treat such variants as distinct words with different vocabulary entries, quickly filling the size-capped token vocabulary with surface-form variation (e.g., walk, walking, Walk) at the expense of diversity and multilingual coverage. We show that many of these variations can be captured by transformation vectors: additive offsets that yield the appropriate word representation when applied to a base form embedding, in both the input and output spaces. Building on this, we propose a compact reshaping of the vocabulary: instead of assigning unique tokens to each surface form, we compose them from shared base form and transformation vectors (e.g., walked is walk+past tense). Our approach is lightweight, keeping the pretrained backbone frozen and only training small adaptation modules. We apply it across five languages and multiple LLMs in both pretraining and post-hoc adaptation, freeing 10-40% of vocabulary slots to be reallocated where tokenization is inefficient. Importantly, we do so while also expanding vocabulary coverage to out-of-vocabulary words, and with minimal impact on downstream performance. Our findings motivate a rethinking of vocabulary design, towards a representation that better matches the underlying structure of language and the practical needs of multilingual coverage.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes 'Vocab Diet,' a method to reshape LLM vocabularies by representing morphological variants (e.g., walk vs. walked) as additive transformation vectors applied to base-form embeddings. These vectors operate in both input embedding and output LM-head spaces, enabling a compact vocabulary that composes surface forms from shared bases plus offsets. The approach keeps the pretrained backbone frozen, trains only small adaptation modules, and is evaluated across five languages and multiple models in pretraining and post-hoc settings. It claims to free 10-40% of vocabulary slots for reallocation to improve multilingual coverage and OOV handling while maintaining downstream performance.

Significance. If the transformation vectors prove stable and generalizable, the work offers a lightweight, practical route to more linguistically aligned vocabularies that reduce surface-form redundancy and improve efficiency in multilingual LLMs. The frozen-backbone design and reported slot reallocation are concrete strengths that could influence future tokenizer and embedding design.

major comments (3)
  1. The central claim that additive transformation vectors accurately reconstruct surface forms in the output space (including for irregular morphology) is load-bearing but rests on limited adaptation data. The manuscript must demonstrate that post-hoc addition to the LM head preserves logit distributions without introducing new failure modes for cases like go/went, as linear offsets observed in regular forms may not extend reliably.
  2. Experimental section: quantitative metrics, ablation details on adaptation module capacity, and controls for the adaptation process are absent from the reported results. Without these, it is not possible to verify the claimed 'consistent gains' or 'minimal impact' across languages and models.
  3. The weakest assumption—that learned vectors remain stable and generalizable across unseen contexts, languages, and downstream tasks without backbone changes—requires explicit testing. The current evidence does not address whether irregular forms or out-of-distribution contexts degrade output-space predictions.
minor comments (2)
  1. Clarify the precise mathematical definition of the transformation vectors (additive offsets) for input versus output spaces, ideally with an equation early in the methods.
  2. Figure or table presenting the 10-40% slot reallocation should include per-language and per-model breakdowns to support the aggregate claim.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their thoughtful and constructive feedback. The comments raise important points about the robustness of our output-space reconstructions and the completeness of our experimental reporting. We address each major comment below and commit to targeted revisions that strengthen the manuscript without altering its core claims.

read point-by-point responses
  1. Referee: The central claim that additive transformation vectors accurately reconstruct surface forms in the output space (including for irregular morphology) is load-bearing but rests on limited adaptation data. The manuscript must demonstrate that post-hoc addition to the LM head preserves logit distributions without introducing new failure modes for cases like go/went, as linear offsets observed in regular forms may not extend reliably.

    Authors: We agree that explicit verification for irregular forms is valuable. While our evaluations across five languages demonstrate that the learned vectors support downstream performance with minimal degradation, we did not isolate highly irregular cases such as go/went or quantify logit-distribution shifts in the post-hoc LM-head setting. In the revision we will add targeted experiments that measure reconstruction accuracy and logit KL divergence for irregular morphology, directly testing whether linear offsets introduce new failure modes. revision: yes

  2. Referee: Experimental section: quantitative metrics, ablation details on adaptation module capacity, and controls for the adaptation process are absent from the reported results. Without these, it is not possible to verify the claimed 'consistent gains' or 'minimal impact' across languages and models.

    Authors: We acknowledge the need for greater experimental transparency. The current manuscript reports aggregate downstream performance and vocabulary-size reductions, but does not include per-language reconstruction accuracies, perplexity deltas, or ablations on module capacity. In the revised version we will insert a dedicated experimental subsection containing these quantitative metrics, capacity ablations (varying adapter hidden size and depth), and controls (random-vector and frozen-adapter baselines) to substantiate the claims of consistent gains and minimal impact. revision: yes

  3. Referee: The weakest assumption—that learned vectors remain stable and generalizable across unseen contexts, languages, and downstream tasks without backbone changes—requires explicit testing. The current evidence does not address whether irregular forms or out-of-distribution contexts degrade output-space predictions.

    Authors: Our multi-language and multi-task results already provide evidence that the vectors transfer without backbone updates, yet we accept that more granular stability tests are warranted. We will augment the revision with held-out context evaluations and explicit irregular-form probes that measure output-space prediction degradation, thereby directly addressing generalizability concerns. revision: yes

Circularity Check

0 steps flagged

Empirical training of adaptation modules with no definitional or self-citation circularity

full rationale

The paper presents an empirical method: it trains small adaptation modules on limited data to learn additive transformation vectors that approximate surface-form variations in both input embeddings and output LM-head space, then evaluates the resulting vocabulary reshaping on downstream tasks across languages and models. No load-bearing step reduces to a self-definition, a fitted parameter renamed as a prediction, or a uniqueness theorem imported from the authors' prior work. The central claim is validated through measurable performance metrics rather than any closed mathematical derivation, making the approach externally falsifiable and self-contained against benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The approach depends on the empirical observation that morphological variations align with linear directions in embedding space; it introduces transformation vectors as derived entities whose generality is tested only within the reported experiments.

free parameters (1)
  • adaptation module capacity
    Size and architecture of the small modules trained to produce the transformation vectors.
axioms (1)
  • domain assumption Word-form variations are approximately linear offsets in the embedding space of pretrained LLMs
    Stated as the starting observation that motivates the vector-arithmetic approach.
invented entities (1)
  • transformation vectors no independent evidence
    purpose: Additive offsets that produce surface-form embeddings from base-form embeddings
    New representational unit introduced to replace distinct vocabulary entries for morphological variants.

pith-pipeline@v0.9.0 · 5771 in / 1346 out tokens · 34439 ms · 2026-05-18T05:50:57.578253+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

15 extracted references · 15 canonical work pages · 3 internal anchors

  1. [1]

    Gomez, Phil Blunsom, Marzieh Fadaee, Ahmet Üstün, and Sara Hooker

    On the cross-lingual transferability of mono- lingual representations. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4623–4637, Online. Association for Computational Linguistics. Viraat Aryabumi, John Dang, Dwarak Talupuru, Saurabh Dash, David Cairuz, Hangyu Lin, Bharat Venkitesh, Madeline Smith, Jon Ande...

  2. [2]

    Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

    A thorough examination of the CNN/Daily Mail reading comprehension task. InProceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2358–2367, Berlin, Germany. Association for Computational Linguistics. Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristin...

  3. [3]

    Gautier Dagan, Gabriele Synnaeve, and Baptiste Roz- ière

    Lert: A linguistically-motivated pre-trained language model.arXiv preprint arXiv:2211.05344. Gautier Dagan, Gabriele Synnaeve, and Baptiste Roz- ière. 2024. Getting the most out of your tokenizer for pre-training and domain adaptation.arXiv preprint arXiv:2402.01035. John Dang, Shivalika Singh, Daniel D’souza, Arash Ahmadian, Alejandro Salamanca, Madeline...

  4. [4]

    10 HyoJung Han, Akiko Eriguchi, Haoran Xu, Hieu Hoang, Marine Carpuat, and Huda Khayrallah

    Finding neurons in a haystack: Case stud- ies with sparse probing.Transactions on Machine Learning Research. 10 HyoJung Han, Akiko Eriguchi, Haoran Xu, Hieu Hoang, Marine Carpuat, and Huda Khayrallah. 2025. Adapters for altering LLM vocabularies: What lan- guages benefit the most? InThe Thirteenth Interna- tional Conference on Learning Representations. Ro...

  5. [5]

    Distilling the Knowledge in a Neural Network

    Inspecting and editing knowledge represen- tations in language models. InFirst Conference on Language Modeling. Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015. Distilling the knowledge in a neural network.arXiv preprint arXiv:1503.02531. Valentin Hofmann, Janet Pierrehumbert, and Hinrich Schütze. 2020. DagoBERT: Generating derivational morphology wit...

  6. [6]

    likely”, “unlike

    Efficient and effective vocabulary expansion towards multilingual large language models.arXiv preprint arXiv:2402.14714. Stav Klein and Reut Tsarfaty. 2020. Getting the ##life out of living: How adequate are word-pieces for mod- elling complex morphology? InProceedings of the 17th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and ...

  7. [7]

    InAnnual Meeting of the Asso- ciation for Computational Linguistics

    Tokenization impacts multilingual language modeling: Assessing vocabulary allocation and over- lap across languages. InAnnual Meeting of the Asso- ciation for Computational Linguistics. Alisa Liu, Jonathan Hayase, Valentin Hofmann, Se- woong Oh, Noah A. Smith, and Yejin Choi. 2025. SuperBPE: Space travel for language models.ArXiv, abs/2503.13423. Chengyua...

  8. [8]

    InProceedings of the 41st International Conference on Machine Learning, volume 235 of Proceedings of Machine Learning Research, pages 34303–34326

    tinyBenchmarks: evaluating LLMs with fewer examples. InProceedings of the 41st International Conference on Machine Learning, volume 235 of Proceedings of Machine Learning Research, pages 34303–34326. PMLR. Marion Di Marco and Alexander Fraser. 2024. Sub- word segmentation in LLMs: Looking at inflection and consistency. InProceedings of the 2024 Confer- en...

  9. [9]

    Using morphological knowledge in open- vocabulary neural language models. InProceedings of the 2018 Conference of the North American Chap- ter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Pa- pers), pages 1435–1445. Jack Merullo, Carsten Eickhoff, and Ellie Pavlick. 2023. Language models implement simple wo...

  10. [10]

    2 OLMo 2 Furious

    Zero-shot tokenizer transfer. InThe Thirty- eighth Annual Conference on Neural Information Processing Systems. Itay Nakash, Nitay Calderon, Eyal Ben-David, Elad Hoffer, and Roi Reichart. 2025. Adaptivocab: En- hancing LLM efficiency in focused domains through lightweight vocabulary adaptation. InSecond Con- ference on Language Modeling. Team OLMo, Pete Wa...

  11. [11]

    Kiho Park, Yo Joong Choe, Yibo Jiang, and Victor Veitch

    Morphology matters: A multilingual language modeling analysis.Transactions of the Association for Computational Linguistics, 9:261–276. Kiho Park, Yo Joong Choe, Yibo Jiang, and Victor Veitch. 2025. The geometry of categorical and hi- erarchical concepts in large language models. In The Thirteenth International Conference on Learn- ing Representations. Ki...

  12. [12]

    In Proceedings of the 60th Annual Meeting of the Associ- ation for Computational Linguistics (Volume 1: Long Papers), pages 46–56, Dublin, Ireland

    AlephBERT: Language model pre-training and evaluation from sub-word to sentence level. In Proceedings of the 60th Annual Meeting of the Associ- ation for Computational Linguistics (Volume 1: Long Papers), pages 46–56, Dublin, Ireland. Association for Computational Linguistics. Rico Sennrich, Barry Haddow, and Alexandra Birch

  13. [13]

    InProceedings of the 54th Annual Meeting of the Association for Computational Lin- guistics (Volume 1: Long Papers), pages 1715–1725, Berlin, Germany

    Neural machine translation of rare words with subword units. InProceedings of the 54th Annual Meeting of the Association for Computational Lin- guistics (Volume 1: Long Papers), pages 1715–1725, Berlin, Germany. Association for Computational Lin- guistics. Shivalika Singh, Angelika Romanou, Clémentine Four- rier, David Ifeoluwa Adelani, Jian Gang Ngui, Da...

  14. [14]

    InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Vol- ume 1: Long Papers), pages 18761–18799, Vienna, Austria

    Global MMLU: Understanding and addressing cultural and linguistic biases in multilingual evalua- tion. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Vol- ume 1: Long Papers), pages 18761–18799, Vienna, Austria. Association for Computational Linguistics. Nishant Subramani, Nivedita Suresh, and Matthew Pe- ters. ...

  15. [15]

    Chaofan Tao, Qian Liu, Longxu Dou, Niklas Muen- nighoff, Zhongwei Wan, Ping Luo, Min Lin, and Ngai Wong

    Large vocabulary size improves large language models.arXiv preprint arXiv:2406.16508. Chaofan Tao, Qian Liu, Longxu Dou, Niklas Muen- nighoff, Zhongwei Wan, Ping Luo, Min Lin, and Ngai Wong. 2024. Scaling laws with vocabulary: Larger models deserve larger vocabularies. InThe Thirty-eighth Annual Conference on Neural Informa- tion Processing Systems. Falco...