ConvexTok uses convex relaxation of tokenization to a linear program, improving intrinsic metrics, bits-per-byte, and some downstream tasks while certifying near-optimality within 1% at typical vocabulary sizes.
Tokenization and the Noiseless Channel
3 Pith papers cite this work. Polarity classification is still indexing.
3
Pith papers citing it
fields
cs.CL 3years
2026 3representative citing papers
Frequency aggregation of supermerge candidates and a two-phase formulation make BoundlessBPE and SuperBPE training over 600x faster on 1GB data while preserving identical results, with open-source Python and Rust code.
citing papers explorer
-
Tokenisation via Convex Relaxations
ConvexTok uses convex relaxation of tokenization to a linear program, improving intrinsic metrics, bits-per-byte, and some downstream tasks while certifying near-optimality within 1% at typical vocabulary sizes.
-
Faster Superword Tokenization
Frequency aggregation of supermerge candidates and a two-phase formulation make BoundlessBPE and SuperBPE training over 600x faster on 1GB data while preserving identical results, with open-source Python and Rust code.
- Tokenization with Split Trees