Model-Aware Tokenizer Transfer

Aleksander Smywinski-Pohl; Mykola Haltiuk

arxiv: 2510.21954 · v2 · submitted 2025-10-24 · 💻 cs.CL

Model-Aware Tokenizer Transfer

Mykola Haltiuk , Aleksander Smywinski-Pohl This is my paper

Pith reviewed 2026-05-18 04:06 UTC · model grok-4.3

classification 💻 cs.CL

keywords tokenizer transfermultilingual LLMsattention modelingembedding initializationmodel adaptationlow-resource languagesefficient fine-tuning

0 comments

The pith

Model-Aware Tokenizer Transfer distills attention patterns from a source model to initialize embeddings for a new tokenizer.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models face a bottleneck when adapting to new languages or scripts because their tokenizers and embeddings are fixed during pretraining. The paper introduces Model-Aware Tokenizer Transfer (MATT) to address this by incorporating internal model signals rather than relying on surface-level semantic matching for new token embeddings. It defines an Attention Influence Modeling objective that transfers how tokens communicate through attention layers in the original model, creating a better starting point for the target setup. This warm-up step precedes ordinary language modeling training. Experiments across varied languages indicate that the approach restores a substantial share of original performance after only a few hours of GPU work and exceeds heuristic baselines.

Core claim

Model-Aware Tokenizer Transfer incorporates model internals through an Attention Influence Modeling objective that distills inter-token communication patterns from the source model into embeddings for a new tokenizer, providing an efficient initialization before standard language modeling and yielding stronger recovery of model performance than embedding-similarity methods alone.

What carries the argument

Attention Influence Modeling (AIM) objective that distills attention influence patterns between tokens to guide embedding initialization and adaptation for the target tokenizer.

If this is right

Tokenizer adaptation for distinct scripts becomes feasible with limited compute instead of full pretraining.
Multilingual LLMs can add support for low-resource languages by leveraging attention behavior rather than semantic heuristics.
A short AIM-based warm-up step improves final performance after standard training across diverse linguistic settings.
The method outperforms heuristic baselines that initialize embeddings using only token similarity.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same attention-distillation approach could be tested on other internal signals such as residual streams or feed-forward activations to see if they add further gains.
MATT might reduce the data requirements for cross-lingual adaptation by supplying a stronger prior from the source model.
If attention patterns prove more transferable than embeddings, similar distillation could apply to other model components like layer norms during tokenizer changes.

Load-bearing premise

Distilling attention influence patterns from the source model will produce useful initialization for the target tokenizer even when the new vocabulary and script differ substantially from the original training data.

What would settle it

Measure whether MATT still recovers a large fraction of performance faster than baselines when the source and target tokenizers share no tokens and use unrelated scripts, such as transferring from English to a language using Devanagari.

read the original abstract

Large Language Models (LLMs) are trained to support an increasing number of languages, yet their predefined tokenizers remain a bottleneck for adapting models to lower-resource or distinct-script languages. Existing tokenizer transfer methods typically rely on semantic heuristics to initialize new embeddings, ignoring higher-layer model dynamics and limiting transfer quality. We propose Model-Aware Tokenizer Transfer (MATT), a method that incorporates model internals into the tokenizer transfer process. MATT introduces an Attention Influence Modeling (AIM) objective that distills inter-token communication patterns from a source model into a target model with a new tokenizer, providing an efficient warm-up before standard language modeling. Unlike approaches that focus solely on embedding similarity, MATT leverages attention behavior to guide embedding initialization and adaptation. Experiments across diverse linguistic settings show that MATT recovers a large fraction of the original model's performance within a few GPU hours, outperforming heuristic baselines. These results demonstrate that incorporating model-level signals offers a practical and effective path toward robust tokenizer transfer in multilingual LLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MATT adds an AIM objective to distill attention patterns for new tokenizer embeddings, but the transfer may add little when vocabularies and scripts barely overlap.

read the letter

The main thing to know is that this paper proposes Model-Aware Tokenizer Transfer (MATT) with an Attention Influence Modeling (AIM) objective. Instead of initializing new embeddings purely from semantic similarity, it tries to distill inter-token attention patterns from the source model to warm up the target model before normal language modeling. The experiments report that this recovers a large part of original performance in a few GPU hours across different languages and scripts, beating the heuristic baselines they compared against. That efficiency angle is the practical hook. What the work does reasonably well is flag a real bottleneck in multilingual adaptation and show that looking at model internals can be a useful signal beyond surface-level heuristics. The idea of conditioning the transfer on attention behavior rather than embeddings alone is a clear step past the cited prior methods. The results, if they hold, would be useful for anyone who needs to support new scripts without full retraining. The soft spots are in the mapping step and the evidence. When source and target vocabularies have minimal overlap and the scripts are unrelated, attention scores live in a token space with no direct counterpart in the new embeddings. The paper does not appear to supply an explicit alignment or shared latent structure that would preserve the relevant communication patterns, so the distilled signal could easily become noise. In that case the observed recovery would likely trace to the later language-modeling stage rather than the AIM warm-up, which undercuts the claim that model-aware transfer is what drives the gains. The abstract also gives little on baseline details, dataset sizes, ablations, or statistical checks, so it is hard to tell how robust the outperformance really is. This paper is for researchers and engineers working on tokenizer adaptation and low-resource multilingual LLMs. A reader in that niche would get concrete ideas to try and some speed numbers to benchmark against. It is not a foundational result but a targeted engineering contribution. I would send it to peer review. The core idea is worth proper testing and the practical motivation is sound, even though the current evidence needs tighter controls on the transfer mechanism and more transparent experimental reporting.

Referee Report

1 major / 2 minor

Summary. The paper proposes Model-Aware Tokenizer Transfer (MATT) to adapt pretrained LLMs to new tokenizers for lower-resource or distinct-script languages. It introduces an Attention Influence Modeling (AIM) objective that distills inter-token attention patterns from the source model to initialize embeddings in the target model with a new vocabulary, followed by standard language-modeling fine-tuning. Experiments across diverse linguistic settings are reported to recover a large fraction of original performance within a few GPU hours while outperforming heuristic baselines that rely only on embedding similarity.

Significance. If the reported gains prove robust, MATT would offer a practical advance for multilingual LLM adaptation by incorporating higher-layer model dynamics rather than semantic heuristics alone. The approach directly targets the tokenizer bottleneck and could reduce compute needed for script or language transfer, with potential value for low-resource settings.

major comments (1)

[AIM objective description and experimental setup] The central claim that AIM distillation provides a useful warm-up initialization rests on the assumption that attention influence patterns transfer across non-overlapping vocabularies and scripts. The manuscript does not describe an explicit alignment mechanism or shared latent space that would preserve communication structure when source tokens have no direct correspondence to target tokens (e.g., Latin-to-Brahmic or logographic cases). Without this, the distilled signal risks being noise, and observed recovery could be attributable to the subsequent language-modeling stage rather than the model-aware component.

minor comments (2)

[Experiments] The abstract and results section should report concrete baseline methods, dataset sizes, number of runs, and statistical significance tests to allow verification that gains are not due to post-hoc choices.
[Method] Notation for the AIM loss and how attention scores are aggregated or projected onto the new vocabulary should be clarified with an equation or pseudocode.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their thoughtful and constructive review of our manuscript. The major comment raises an important point about the transfer mechanism in the AIM objective, which we address in detail below. We believe the response clarifies the approach and we will incorporate additional details in a revised version.

read point-by-point responses

Referee: The central claim that AIM distillation provides a useful warm-up initialization rests on the assumption that attention influence patterns transfer across non-overlapping vocabularies and scripts. The manuscript does not describe an explicit alignment mechanism or shared latent space that would preserve communication structure when source tokens have no direct correspondence to target tokens (e.g., Latin-to-Brahmic or logographic cases). Without this, the distilled signal risks being noise, and observed recovery could be attributable to the subsequent language-modeling stage rather than the model-aware component.

Authors: We appreciate the referee highlighting this aspect of the method. The AIM objective, as formulated in Section 3.2 of the manuscript, distills inter-token attention influence patterns at the sequence level rather than through direct token-to-token alignment. Specifically, we compute influence scores from the source model on text sequences in the target language (using the source tokenizer only for the initial computation pass), then optimize the target model's new embeddings and early layers via a distillation loss to reproduce comparable attention communication structures on the same underlying content. No explicit shared latent space or token correspondence is assumed or required; the transfer relies on the fact that higher-layer dynamics reflect semantic and syntactic roles that can be approximated by the new vocabulary through optimization on parallel data. The manuscript reports results on diverse settings including script transfers (e.g., Latin to Brahmic and logographic cases) in Section 4, where MATT outperforms embedding-similarity baselines, supporting that the distilled signal is not merely noise. Regarding attribution to the subsequent language-modeling stage, the experiments include controls showing that AIM initialization yields faster convergence and higher final performance than random or heuristic initializations followed by the same LM fine-tuning. We acknowledge that the current description of the cross-vocabulary transfer could be expanded for clarity and will revise the manuscript to add an explicit paragraph and illustrative diagram in Section 3 explaining the sequence-level distillation process. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The paper defines the AIM objective as an independent distillation of attention influence patterns from the source model to initialize target embeddings, followed by standard language modeling; this is not constructed from or fitted to the final performance metric. No equations reduce the claimed prediction to the input by definition, no load-bearing self-citations are invoked for uniqueness or ansatz, and the experimental comparison to heuristic baselines provides an external check. The central claim therefore rests on the empirical transferability of attention patterns rather than on any tautological redefinition or self-referential fit.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The approach rests on standard transformer assumptions plus the untested premise that attention patterns remain informative across tokenizer changes. No new physical entities or free parameters are introduced in the abstract.

axioms (2)

domain assumption Attention weights in the source model encode transferable inter-token communication patterns that remain useful after tokenizer replacement.
Invoked when the AIM objective is defined to distill these patterns into the target model.
domain assumption A short warm-up phase using the AIM loss is sufficient to stabilize the new embeddings before full language-model training.
Stated as the practical training schedule that precedes standard language modeling.

pith-pipeline@v0.9.0 · 5695 in / 1227 out tokens · 23659 ms · 2026-05-18T04:06:26.319707+00:00 · methodology

Model-Aware Tokenizer Transfer

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)