Cross-Tokenizer Likelihood Scoring Algorithms for Language Model Distillation
Pith reviewed 2026-05-16 21:18 UTC · model grok-4.3
The pith
A probabilistic framework computes exact likelihoods between language models using different tokenizers by exploiting BPE recursive structure.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that BPE's implicit recursive structure can be used to build a probabilistic framework for cross-tokenizer likelihood scoring, enabling exact sequence likelihood evaluation when the student vocabulary is a subset of the teacher vocabulary with O(1) model evaluations per token, and providing a rigorous lossless procedure for arbitrary vocabularies.
What carries the argument
The probabilistic framework that exploits BPE's recursive merging structure to map likelihoods and next-token probabilities between different token vocabularies.
If this is right
- Up to 12% reduction in memory footprint for the Qwen2.5-1.5B model during distillation.
- Up to 4% improvement over baseline performance on the evaluated tasks.
- Over 2% accuracy gain on GSM8K mathematical reasoning distillation compared with the prior state of the art.
- Next-token probabilities for sequential sampling become available with only O(1) model evaluations per token in the subset regime.
Where Pith is reading between the lines
- The same recursive-mapping idea might apply to other hierarchical tokenization schemes that contain merge structure.
- Smaller student vocabularies enabled by this scoring could support more efficient edge-device deployment of distilled models.
- Further tests on non-mathematical reasoning benchmarks could show whether the accuracy gains generalize beyond GSM8K.
Load-bearing premise
BPE's merge operations create a recursive structure that permits an exact probabilistic transfer of likelihoods between tokenizers.
What would settle it
Direct computation of likelihoods on a dataset using both the original and mapped tokenizers, followed by finding consistent discrepancies in the subset-vocabulary case, would disprove the exactness claim.
Figures
read the original abstract
Computing next-token likelihood ratios between two language models (LMs) is a standard task in training paradigms such as knowledge distillation. Since this requires both models to share the same probability space, it becomes challenging when the teacher and student LMs use different tokenizers, for instance, when edge-device deployment necessitates a smaller vocabulary size to lower memory overhead. This work addresses this vocabulary misalignment problem by uncovering an implicit recursive structure in the commonly deployed Byte-Pair Encoding (BPE) algorithm and utilizing it to create a probabilistic framework for cross-tokenizer likelihood scoring. Our method enables sequence likelihood evaluation for vocabularies different from the teacher model native tokenizer, addressing two specific scenarios: when the student vocabulary is a subset of the teacher vocabulary, and the general case where it is arbitrary. In the subset regime, our framework computes exact likelihoods and provides next-token probabilities for sequential sampling with only ${O}(1)$ model evaluations per token. When used for distillation, this yields up to a $12\%$ reduction in memory footprint for the Qwen2.5-1.5B model while also improving baseline performance up to $4\%$ on the evaluated tasks. For the general case, we introduce a rigorous lossless procedure that leverages BPE recursive structure, complemented by a fast approximation that keeps large-vocabulary settings practical. Applied to GSM8K mathematical reasoning distillation, our method improves accuracy by over $2\%$ the current state of the art. Code: github.com/truongbuu/cross-tokenizer-scoring
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces algorithms for cross-tokenizer sequence likelihood scoring in language model distillation, exploiting an implicit recursive structure in BPE. It claims exact likelihood computation with O(1) model evaluations per token when the student vocabulary is a subset of the teacher's, enabling memory-efficient distillation (up to 12% reduction for Qwen2.5-1.5B with up to 4% accuracy gains). For arbitrary vocabularies, it proposes a rigorous lossless procedure based on BPE recursion plus a fast approximation, yielding over 2% accuracy improvement on GSM8K mathematical reasoning distillation over prior state-of-the-art.
Significance. If the lossless general-case procedure and O(1) subset-case claims hold with full rigor, the work would enable practical distillation across tokenizer boundaries, directly supporting smaller-vocabulary student models for edge deployment without sacrificing likelihood accuracy. The open-source code release aids reproducibility, and the quantitative gains on standard benchmarks like GSM8K indicate potential impact on efficient LM training pipelines.
major comments (2)
- [Abstract / General Case] Abstract and general-case procedure: The lossless claim for arbitrary student vocabularies rests on BPE's recursive merge structure yielding an exact probabilistic mapping. However, because BPE merges are greedy and tokenizer-specific, an arbitrary student vocabulary can produce tokens whose byte strings admit no canonical decomposition under the teacher's rules; in such cases the required segmentation step introduces a probability not guaranteed to equal the teacher's autoregressive likelihood, undermining losslessness. A concrete counter-example or proof of uniqueness under the stated assumptions is needed.
- [Subset Regime] Subset-regime derivation: The assertion of exact next-token probabilities and O(1) model evaluations per token for sequential sampling requires an explicit derivation showing that the likelihood ratio remains exact without error accumulation over sequence length; the current description does not specify how the recursive structure is inverted to obtain the student's conditional probabilities from the teacher's forward pass.
minor comments (2)
- [Abstract] Abstract contains a minor phrasing error: 'improves accuracy by over 2% the current state of the art' should read 'over 2% over the current state of the art'.
- [Experiments] The experimental section should explicitly list the baseline distillation methods, exact evaluation tasks, and number of runs for the reported 4% and 2% gains to allow direct comparison.
Simulated Author's Rebuttal
We thank the referee for the careful review and valuable feedback on the rigor of our lossless claims and derivations. We address each major comment below and will revise the manuscript to incorporate additional proofs, clarifications, and explicit derivations.
read point-by-point responses
-
Referee: [Abstract / General Case] Abstract and general-case procedure: The lossless claim for arbitrary student vocabularies rests on BPE's recursive merge structure yielding an exact probabilistic mapping. However, because BPE merges are greedy and tokenizer-specific, an arbitrary student vocabulary can produce tokens whose byte strings admit no canonical decomposition under the teacher's rules; in such cases the required segmentation step introduces a probability not guaranteed to equal the teacher's autoregressive likelihood, undermining losslessness. A concrete counter-example or proof of uniqueness under the stated assumptions is needed.
Authors: We appreciate this point on potential non-canonical decompositions. Our general-case procedure defines the lossless mapping by recursively decomposing each student token's byte string according to the teacher's fixed BPE merge rules, applying the chain rule to equate the joint probabilities exactly. Under the standard BPE assumptions (deterministic merges on byte sequences), every valid student token admits a unique decomposition that aligns with the teacher's segmentation. We will revise the manuscript to include a formal proof of uniqueness and exactness in the general-case section, along with a discussion of the assumptions and a concrete example illustrating the recursion. If edge cases outside these assumptions arise, they will be noted as limitations, but our current analysis and experiments support losslessness within the stated scope. revision: yes
-
Referee: [Subset Regime] Subset-regime derivation: The assertion of exact next-token probabilities and O(1) model evaluations per token for sequential sampling requires an explicit derivation showing that the likelihood ratio remains exact without error accumulation over sequence length; the current description does not specify how the recursive structure is inverted to obtain the student's conditional probabilities from the teacher's forward pass.
Authors: We agree that an explicit derivation is needed for clarity. In the subset regime, each student token is either a teacher token or a merge of consecutive teacher tokens. The recursive BPE structure is inverted by computing the student's conditional probability as the normalized sum of the teacher's autoregressive probabilities over all sub-sequences that complete the student token, using a single forward pass per student token boundary. This yields exact ratios with no error accumulation because the recursion is applied exactly at each step via the chain rule, independent of sequence length. We will add a detailed step-by-step mathematical derivation (including the inversion equations) to Section 3 in the revision, explicitly showing the O(1) evaluation property and confirming exactness for sequential sampling. revision: yes
Circularity Check
Derivation self-contained from external BPE properties; no reduction to inputs by construction
full rationale
The paper derives cross-tokenizer likelihood scoring directly from the standard recursive merge structure of BPE, an external algorithm whose properties are not defined in terms of the paper's outputs or fitted values. The subset-regime exact computation and general-case lossless decomposition are presented as consequences of BPE byte-level operations rather than any self-definitional loop, fitted-parameter renaming, or self-citation chain. No equations reduce a claimed prediction to a prior fit, and the central claims remain independent of the paper's own results. This is the normal case of an algorithmic paper whose derivation is externally grounded.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Byte-Pair Encoding possesses an implicit recursive structure that supports probabilistic mapping between different token vocabularies
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We uncover an implicit recursive structure in the BPE algorithm and utilize it to create a probabilistic framework for cross-tokenizer likelihood scoring... lossless recursive algorithm (Section 4.3, Algorithm 1)
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
relative alphabet... merge and demerge operations (Definition 3, Figure 2)
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
There exists a sequence ⃗xsuch that ⃗eVM = encode(⃗x;M)[:| ⃗eVM |]
-
[2]
the last token of ⃗eVM starts from some location within ⃗eVM ′
There exists an indexi≤ | ⃗eVM ′ |such that: decodeM→M ′(⃗eVM [1:|⃗eVM | −1]) = ⃗eVM ′ [1:i−1], ⃗eVM ′ [i:]∈prefix (decode M→M ′(⃗eVM [−1])), i.e. the last token of ⃗eVM starts from some location within ⃗eVM ′ . In BPE, any ⃗eVM must be a valid encoding according to the first condition. Algorithm 2 leverages this by first converting⃗eVM ′ to its string fo...
work page 2025
-
[3]
The set of (relative) cover encoding and the corresponding probabilities of⃗eM ′[:| ⃗eM ′|−1], i.e. exclude the last token, obtained either from Algorithm 2 or from previous sampling call (to be shown)
-
[4]
The conditional next-token probability distributionP M(.|encodeM ′→M(⃗eM ′[:| ⃗eM ′| − 1])), obtained via previous model inference call. 13 Preprint We define the following indicator function for the event that the tokent∈ V M begins with the sub-tokent ′ ∈ V ′ M through thedecode M→M ′(·)operator. τ(t, t ′) :V M × VM ′ → {0,1}, τ(t, t ′)≜ ( 1,ifdecode M→...
-
[5]
Remove the encodings incover( ⃗eM ′[:| ⃗eM ′| −1])whose last tokentdoes not begins with the sampled sub-token ⃗eM ′[−1], i.e.τ(t, ⃗eM ′[−1]) = 0. We call this setC new
-
[6]
The resulting setC new thus is the desiredcover( ⃗eM ′)
Add the valid encodingsencode M ′→M(⃗eM ′[:| ⃗eM ′| −1])·tfor any tokenstinV M that beginswith the sampled sub-token ⃗eM ′[−1], i.e.τ(t, ⃗eM ′[−1]) = 1, inC new. The resulting setC new thus is the desiredcover( ⃗eM ′). Step 2 (Compute Conditional Distribution).To sample the next tokent ′ ∈ V M ′ according PM ′(.|⃗eM ′), we note that: PM ′(⃗eM ′ ·t ′) =P M...
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.