pith. sign in

arxiv: 2512.14954 · v2 · submitted 2025-12-16 · 💻 cs.CL · cs.LG

Cross-Tokenizer Likelihood Scoring Algorithms for Language Model Distillation

Pith reviewed 2026-05-16 21:18 UTC · model grok-4.3

classification 💻 cs.CL cs.LG
keywords cross-tokenizer likelihoodBPE recursive structurelanguage model distillationvocabulary misalignmentknowledge distillationsequence likelihoodGSM8K
0
0 comments X

The pith

A probabilistic framework computes exact likelihoods between language models using different tokenizers by exploiting BPE recursive structure.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a method for computing sequence likelihoods and next-token probabilities when teacher and student language models employ mismatched vocabularies. It uncovers an implicit recursive structure in the BPE algorithm to create a probabilistic mapping that works in two regimes. When the student vocabulary is a subset of the teacher's, the framework delivers exact likelihoods using only O(1) model evaluations per token. This supports distillation with up to 12% lower memory use and up to 4% higher performance on tested tasks. For arbitrary vocabularies the method supplies a lossless procedure plus a fast approximation, shown to raise accuracy more than 2% on GSM8K mathematical reasoning.

Core claim

The central claim is that BPE's implicit recursive structure can be used to build a probabilistic framework for cross-tokenizer likelihood scoring, enabling exact sequence likelihood evaluation when the student vocabulary is a subset of the teacher vocabulary with O(1) model evaluations per token, and providing a rigorous lossless procedure for arbitrary vocabularies.

What carries the argument

The probabilistic framework that exploits BPE's recursive merging structure to map likelihoods and next-token probabilities between different token vocabularies.

If this is right

  • Up to 12% reduction in memory footprint for the Qwen2.5-1.5B model during distillation.
  • Up to 4% improvement over baseline performance on the evaluated tasks.
  • Over 2% accuracy gain on GSM8K mathematical reasoning distillation compared with the prior state of the art.
  • Next-token probabilities for sequential sampling become available with only O(1) model evaluations per token in the subset regime.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same recursive-mapping idea might apply to other hierarchical tokenization schemes that contain merge structure.
  • Smaller student vocabularies enabled by this scoring could support more efficient edge-device deployment of distilled models.
  • Further tests on non-mathematical reasoning benchmarks could show whether the accuracy gains generalize beyond GSM8K.

Load-bearing premise

BPE's merge operations create a recursive structure that permits an exact probabilistic transfer of likelihoods between tokenizers.

What would settle it

Direct computation of likelihoods on a dataset using both the original and mapped tokenizers, followed by finding consistent discrepancies in the subset-vocabulary case, would disprove the exactness claim.

Figures

Figures reproduced from arXiv: 2512.14954 by Ashish Khisti, Buu Phan, Karen Ullrich.

Figure 1
Figure 1. Figure 1: Cross-Tokenizer Scoring: given a language model trained on a fixed tokenizer, for example [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Progressive construction of BPE vocabulary and pseudocode for the [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Left: Prior work (Phan et al., 2025) supports tractable conversion only from the full BPE vocabulary to the byte-level alphabet. Right: Our relative alphabet framework in Definition 3 enables conversion between any pair of BPE vocabularies by first mapping tokens to their byte-level representation, and then recursively applying reverse conversions between adjacent vocabularies (see Section 4.3 and Lemma 2)… view at source ↗
Figure 4
Figure 4. Figure 4: Empirical approximation of next-token probabilities. Method Accuracy (5-shot) Gemma2-2B-Instruct (Student) 52.3 Qwen2.5-Math-7B-Instruct (Teacher) 88.4 SFT 47.9 ULD (Boizard et al., 2025) 47.1 ULD + SFT (Boizard et al., 2025) 48.2 DSKD (Zhang et al., 2024) 51.5 DSKD + SFT (Zhang et al., 2024) 52.8 ALM 53.2 ALM + SFT 53.5 PKL (Ours) 54.6 SFT + PKL (Ours) 55.6 [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
read the original abstract

Computing next-token likelihood ratios between two language models (LMs) is a standard task in training paradigms such as knowledge distillation. Since this requires both models to share the same probability space, it becomes challenging when the teacher and student LMs use different tokenizers, for instance, when edge-device deployment necessitates a smaller vocabulary size to lower memory overhead. This work addresses this vocabulary misalignment problem by uncovering an implicit recursive structure in the commonly deployed Byte-Pair Encoding (BPE) algorithm and utilizing it to create a probabilistic framework for cross-tokenizer likelihood scoring. Our method enables sequence likelihood evaluation for vocabularies different from the teacher model native tokenizer, addressing two specific scenarios: when the student vocabulary is a subset of the teacher vocabulary, and the general case where it is arbitrary. In the subset regime, our framework computes exact likelihoods and provides next-token probabilities for sequential sampling with only ${O}(1)$ model evaluations per token. When used for distillation, this yields up to a $12\%$ reduction in memory footprint for the Qwen2.5-1.5B model while also improving baseline performance up to $4\%$ on the evaluated tasks. For the general case, we introduce a rigorous lossless procedure that leverages BPE recursive structure, complemented by a fast approximation that keeps large-vocabulary settings practical. Applied to GSM8K mathematical reasoning distillation, our method improves accuracy by over $2\%$ the current state of the art. Code: github.com/truongbuu/cross-tokenizer-scoring

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces algorithms for cross-tokenizer sequence likelihood scoring in language model distillation, exploiting an implicit recursive structure in BPE. It claims exact likelihood computation with O(1) model evaluations per token when the student vocabulary is a subset of the teacher's, enabling memory-efficient distillation (up to 12% reduction for Qwen2.5-1.5B with up to 4% accuracy gains). For arbitrary vocabularies, it proposes a rigorous lossless procedure based on BPE recursion plus a fast approximation, yielding over 2% accuracy improvement on GSM8K mathematical reasoning distillation over prior state-of-the-art.

Significance. If the lossless general-case procedure and O(1) subset-case claims hold with full rigor, the work would enable practical distillation across tokenizer boundaries, directly supporting smaller-vocabulary student models for edge deployment without sacrificing likelihood accuracy. The open-source code release aids reproducibility, and the quantitative gains on standard benchmarks like GSM8K indicate potential impact on efficient LM training pipelines.

major comments (2)
  1. [Abstract / General Case] Abstract and general-case procedure: The lossless claim for arbitrary student vocabularies rests on BPE's recursive merge structure yielding an exact probabilistic mapping. However, because BPE merges are greedy and tokenizer-specific, an arbitrary student vocabulary can produce tokens whose byte strings admit no canonical decomposition under the teacher's rules; in such cases the required segmentation step introduces a probability not guaranteed to equal the teacher's autoregressive likelihood, undermining losslessness. A concrete counter-example or proof of uniqueness under the stated assumptions is needed.
  2. [Subset Regime] Subset-regime derivation: The assertion of exact next-token probabilities and O(1) model evaluations per token for sequential sampling requires an explicit derivation showing that the likelihood ratio remains exact without error accumulation over sequence length; the current description does not specify how the recursive structure is inverted to obtain the student's conditional probabilities from the teacher's forward pass.
minor comments (2)
  1. [Abstract] Abstract contains a minor phrasing error: 'improves accuracy by over 2% the current state of the art' should read 'over 2% over the current state of the art'.
  2. [Experiments] The experimental section should explicitly list the baseline distillation methods, exact evaluation tasks, and number of runs for the reported 4% and 2% gains to allow direct comparison.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful review and valuable feedback on the rigor of our lossless claims and derivations. We address each major comment below and will revise the manuscript to incorporate additional proofs, clarifications, and explicit derivations.

read point-by-point responses
  1. Referee: [Abstract / General Case] Abstract and general-case procedure: The lossless claim for arbitrary student vocabularies rests on BPE's recursive merge structure yielding an exact probabilistic mapping. However, because BPE merges are greedy and tokenizer-specific, an arbitrary student vocabulary can produce tokens whose byte strings admit no canonical decomposition under the teacher's rules; in such cases the required segmentation step introduces a probability not guaranteed to equal the teacher's autoregressive likelihood, undermining losslessness. A concrete counter-example or proof of uniqueness under the stated assumptions is needed.

    Authors: We appreciate this point on potential non-canonical decompositions. Our general-case procedure defines the lossless mapping by recursively decomposing each student token's byte string according to the teacher's fixed BPE merge rules, applying the chain rule to equate the joint probabilities exactly. Under the standard BPE assumptions (deterministic merges on byte sequences), every valid student token admits a unique decomposition that aligns with the teacher's segmentation. We will revise the manuscript to include a formal proof of uniqueness and exactness in the general-case section, along with a discussion of the assumptions and a concrete example illustrating the recursion. If edge cases outside these assumptions arise, they will be noted as limitations, but our current analysis and experiments support losslessness within the stated scope. revision: yes

  2. Referee: [Subset Regime] Subset-regime derivation: The assertion of exact next-token probabilities and O(1) model evaluations per token for sequential sampling requires an explicit derivation showing that the likelihood ratio remains exact without error accumulation over sequence length; the current description does not specify how the recursive structure is inverted to obtain the student's conditional probabilities from the teacher's forward pass.

    Authors: We agree that an explicit derivation is needed for clarity. In the subset regime, each student token is either a teacher token or a merge of consecutive teacher tokens. The recursive BPE structure is inverted by computing the student's conditional probability as the normalized sum of the teacher's autoregressive probabilities over all sub-sequences that complete the student token, using a single forward pass per student token boundary. This yields exact ratios with no error accumulation because the recursion is applied exactly at each step via the chain rule, independent of sequence length. We will add a detailed step-by-step mathematical derivation (including the inversion equations) to Section 3 in the revision, explicitly showing the O(1) evaluation property and confirming exactness for sequential sampling. revision: yes

Circularity Check

0 steps flagged

Derivation self-contained from external BPE properties; no reduction to inputs by construction

full rationale

The paper derives cross-tokenizer likelihood scoring directly from the standard recursive merge structure of BPE, an external algorithm whose properties are not defined in terms of the paper's outputs or fitted values. The subset-regime exact computation and general-case lossless decomposition are presented as consequences of BPE byte-level operations rather than any self-definitional loop, fitted-parameter renaming, or self-citation chain. No equations reduce a claimed prediction to a prior fit, and the central claims remain independent of the paper's own results. This is the normal case of an algorithmic paper whose derivation is externally grounded.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests primarily on the domain assumption regarding BPE's recursive structure, with no free parameters or invented entities specified in the abstract.

axioms (1)
  • domain assumption Byte-Pair Encoding possesses an implicit recursive structure that supports probabilistic mapping between different token vocabularies
    This structure is uncovered and directly utilized to enable exact and lossless cross-tokenizer scoring as described.

pith-pipeline@v0.9.0 · 5574 in / 1398 out tokens · 35749 ms · 2026-05-16T21:18:01.509647+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

6 extracted references · 6 canonical work pages

  1. [1]

    There exists a sequence ⃗xsuch that ⃗eVM = encode(⃗x;M)[:| ⃗eVM |]

  2. [2]

    the last token of ⃗eVM starts from some location within ⃗eVM ′

    There exists an indexi≤ | ⃗eVM ′ |such that: decodeM→M ′(⃗eVM [1:|⃗eVM | −1]) = ⃗eVM ′ [1:i−1], ⃗eVM ′ [i:]∈prefix (decode M→M ′(⃗eVM [−1])), i.e. the last token of ⃗eVM starts from some location within ⃗eVM ′ . In BPE, any ⃗eVM must be a valid encoding according to the first condition. Algorithm 2 leverages this by first converting⃗eVM ′ to its string fo...

  3. [3]

    exclude the last token, obtained either from Algorithm 2 or from previous sampling call (to be shown)

    The set of (relative) cover encoding and the corresponding probabilities of⃗eM ′[:| ⃗eM ′|−1], i.e. exclude the last token, obtained either from Algorithm 2 or from previous sampling call (to be shown)

  4. [4]

    13 Preprint We define the following indicator function for the event that the tokent∈ V M begins with the sub-tokent ′ ∈ V ′ M through thedecode M→M ′(·)operator

    The conditional next-token probability distributionP M(.|encodeM ′→M(⃗eM ′[:| ⃗eM ′| − 1])), obtained via previous model inference call. 13 Preprint We define the following indicator function for the event that the tokent∈ V M begins with the sub-tokent ′ ∈ V ′ M through thedecode M→M ′(·)operator. τ(t, t ′) :V M × VM ′ → {0,1}, τ(t, t ′)≜ ( 1,ifdecode M→...

  5. [5]

    We call this setC new

    Remove the encodings incover( ⃗eM ′[:| ⃗eM ′| −1])whose last tokentdoes not begins with the sampled sub-token ⃗eM ′[−1], i.e.τ(t, ⃗eM ′[−1]) = 0. We call this setC new

  6. [6]

    The resulting setC new thus is the desiredcover( ⃗eM ′)

    Add the valid encodingsencode M ′→M(⃗eM ′[:| ⃗eM ′| −1])·tfor any tokenstinV M that beginswith the sampled sub-token ⃗eM ′[−1], i.e.τ(t, ⃗eM ′[−1]) = 1, inC new. The resulting setC new thus is the desiredcover( ⃗eM ′). Step 2 (Compute Conditional Distribution).To sample the next tokent ′ ∈ V M ′ according PM ′(.|⃗eM ′), we note that: PM ′(⃗eM ′ ·t ′) =P M...