ToaST uses vocabulary-independent split trees and integer programming to produce tokenizers with over 11% fewer tokens than BPE, WordPiece, and UnigramLM while improving 1.5B-parameter LM scores on CORE.
Title resolution pending
3 Pith papers cite this work. Polarity classification is still indexing.
fields
cs.CL 3years
2026 3representative citing papers
Frequency aggregation of supermerge candidates and a two-phase formulation make BoundlessBPE and SuperBPE training over 600x faster on 1GB data while preserving identical results, with open-source Python and Rust code.
Performance gaps in multilingual LMs frequently arise from modeling choices such as tokenization and data exposure rather than intrinsic linguistic complexity.
citing papers explorer
-
Faster Superword Tokenization
Frequency aggregation of supermerge candidates and a two-phase formulation make BoundlessBPE and SuperBPE training over 600x faster on 1GB data while preserving identical results, with open-source Python and Rust code.
-
The Roots of Performance Disparity in Multilingual Language Models: Intrinsic Modeling Difficulty or Design Choices?
Performance gaps in multilingual LMs frequently arise from modeling choices such as tokenization and data exposure rather than intrinsic linguistic complexity.