BrahmicTokenizer-131K: An Indic-Capable Drop-In Replacement for o200k_base
Pith reviewed 2026-06-29 08:09 UTC · model grok-4.3
The pith
BrahmicTokenizer-131K reduces Indic token counts by 26.7% while matching o200k_base on English and code.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
BrahmicTokenizer-131K is constructed via a script-prune crop that reduces the vocabulary from 200019 to 131072 tokens by removing nine out-of-scope writing systems, followed by a linear-programming allocation of 2372 corpus-dead slots to nine Brahmic Unicode blocks. The pre-tokenizer, decoder, and inherited merge rules remain unchanged from o200k_base. On 27 million Indic documents it yields 26.7% fewer tokens than Mistral-Nemo Tekken or Sarvam-m while matching English fertility at 1.235 tokens per word and improving on HumanEval, MBPP, and GSM8K.
What carries the argument
The two-stage retrofit consisting of a script-prune crop and linear-programming allocation of vocabulary slots to Brahmic blocks, preserving all other components of o200k_base.
If this is right
- Achieves 26.7% token reduction on Indic pretraining text compared to other 131K tokenizers.
- Maintains equivalent English fertility to o200k_base.
- Outperforms Tekken/Sarvam-m on code and math benchmarks by 4-14%.
- Is the only tokenizer competitive across Brahmic, English, EU languages, code, and math at the 131K vocabulary size.
- Specialist Indic tokenizers at other sizes show worse English and code performance.
Where Pith is reading between the lines
- This approach could be generalized to retrofit tokenizers for other underrepresented scripts without full retraining.
- The drop-in replacement property allows immediate use in existing model pipelines for Indic languages.
- Better token efficiency on Indic text may reduce training costs for multilingual models targeting South Asian languages.
- The linear programming allocation method might be adapted for other vocabulary optimization problems.
Load-bearing premise
That the removal of nine scripts and the reallocation of slots to Brahmic blocks leaves the inherited merge rules and pre-tokenizer equally effective on non-Indic content.
What would settle it
A direct comparison showing that BrahmicTokenizer-131K produces more than 1.235 tokens per word on a standard English test set or lower scores on HumanEval than o200k_base would falsify the preservation claim.
read the original abstract
We present BrahmicTokenizer-131K, a 131,072-vocabulary byte-level BPE tokenizer that closes the Brahmic compression gap at the 131K-vocabulary class while preserving the English, EU-language, and code compression of OpenAI's o200k_base. We construct it through a two-stage retrofit: (1) a script-prune crop that reduces 200,019 tokens to 131,072 by removing nine out-of-scope writing systems, and (2) a surgical retrofit of 2,372 corpus-dead vocabulary slots determined by linear-programming allocation across nine Brahmic Unicode blocks. The pre-tokenizer, decoder, and inherited merge rules are unchanged from o200k_base, making BrahmicTokenizer-131K a drop-in replacement at the tokenizer interface. On 27 million documents of public Indic pretraining text (2.84 billion words, 46.21 GB), BrahmicTokenizer-131K produces 26.7% fewer tokens than Mistral-Nemo Tekken / Sarvam-m at the same vocabulary budget, with per-language savings of 15.79% (Tamil) to 76.79% (Odia, a 4.31x compression ratio). The Odia advantage is mechanistically explained by Tekken/Sarvam-m containing zero Oriya-block tokens; our surgery added 725. On non-Indic content, BrahmicTokenizer-131K matches o200k_base's English fertility (1.235 vs 1.232 tokens/word) and beats Tekken/Sarvam-m by 4.0-14.2% on HumanEval, MBPP, and GSM8K. Across our 14-tokenizer benchmark, it is the only tokenizer simultaneously competitive on Brahmic, English, EU, code, and math at the 131K budget. Specialist tokenizers at other vocab classes (Sarvam-30B, Sarvam-1, MUTANT-Indic) achieve better Indic compression at the cost of non-Indic performance: Sarvam-1's English fertility is 15.9% worse and its code/math compression 26-33% worse than ours. We release the artifact under Apache 2.0 at https://huggingface.co/theschoolofai/BrahmicTokenizer-131K.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims to construct BrahmicTokenizer-131K, a 131072-vocab byte-level BPE drop-in replacement for o200k_base, via a two-stage process of script-pruning nine out-of-scope writing systems followed by linear-programming allocation of 2372 corpus-dead slots to nine Brahmic Unicode blocks. It reports 26.7% fewer tokens than Mistral-Nemo Tekken/Sarvam-m on 27M Indic documents (2.84B words) while matching o200k_base English fertility (1.235 vs 1.232 tokens/word) and outperforming alternatives on HumanEval, MBPP, and GSM8K; it positions itself as the only 131K tokenizer competitive across Brahmic, English, EU, code, and math.
Significance. If the central claims hold, the work supplies a practical, openly released (Apache 2.0) tokenizer that narrows the Indic compression gap at fixed vocabulary budget without measurable degradation on English/EU/code/math, supported by large-scale empirical comparisons across 14 tokenizers. The explicit retrofit procedure and artifact release are concrete strengths that aid reproducibility in multilingual pretraining.
major comments (3)
- [Abstract / Methods] Abstract and Methods: the linear-programming allocation of exactly 2372 slots across nine Brahmic blocks is described only at the level of 'determined by linear-programming allocation' with no statement of the objective function, constraints, or solver; this is load-bearing for the retrofit step that underpins both the Indic gains and the claim of unchanged non-Indic behavior.
- [Results] Results: the headline 26.7% token reduction (and per-language figures 15.79%–76.79%) on the 27M-document corpus supplies neither error bars, variance estimates, nor any description of sampling or deduplication; without these the robustness of the central compression claim cannot be assessed.
- [Methods] Methods: the claim that 'the pre-tokenizer, decoder, and inherited merge rules are unchanged' and therefore non-Indic effectiveness is preserved is not accompanied by any verification that the pruned tokens are merge-inert on English/EU/code text or that the new Brahmic entries do not alter existing merge paths on mixed-script data; this directly bears on the drop-in replacement guarantee.
minor comments (1)
- [Abstract] Abstract: the statement that BrahmicTokenizer-131K 'beats alternatives on HumanEval, MBPP, and GSM8K' would be clearer if the exact percentage improvements and the identity of the 'alternatives' were stated in the same sentence.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment point by point below. Revisions will be incorporated to improve the manuscript's clarity, reproducibility, and empirical robustness.
read point-by-point responses
-
Referee: [Abstract / Methods] Abstract and Methods: the linear-programming allocation of exactly 2372 slots across nine Brahmic blocks is described only at the level of 'determined by linear-programming allocation' with no statement of the objective function, constraints, or solver; this is load-bearing for the retrofit step that underpins both the Indic gains and the claim of unchanged non-Indic behavior.
Authors: We agree that the linear-programming details require explicit specification for reproducibility. In the revised manuscript we will add a dedicated Methods subsection stating the objective (minimize expected fertility on a held-out Indic validation corpus), the constraints (non-negative integers summing exactly to 2372, with per-block upper bounds derived from character-frequency statistics), and the solver (PuLP with CBC). This directly addresses the load-bearing nature of the allocation step. revision: yes
-
Referee: [Results] Results: the headline 26.7% token reduction (and per-language figures 15.79%–76.79%) on the 27M-document corpus supplies neither error bars, variance estimates, nor any description of sampling or deduplication; without these the robustness of the central compression claim cannot be assessed.
Authors: We accept that statistical robustness measures are currently missing. The 27M-document corpus was obtained by sampling public Indic web data followed by MinHash deduplication (Jaccard threshold 0.8). In revision we will add the sampling and deduplication description plus bootstrap standard errors (500 resamples) for the overall and per-language token-reduction figures. revision: yes
-
Referee: [Methods] Methods: the claim that 'the pre-tokenizer, decoder, and inherited merge rules are unchanged' and therefore non-Indic effectiveness is preserved is not accompanied by any verification that the pruned tokens are merge-inert on English/EU/code text or that the new Brahmic entries do not alter existing merge paths on mixed-script data; this directly bears on the drop-in replacement guarantee.
Authors: The drop-in property follows from the unchanged pre-tokenizer and merge table together with the fact that new tokens occupy previously zero-frequency IDs. Nevertheless, to provide the requested empirical verification we will add an appendix containing tokenization comparisons on English-only, code, and mixed-script corpora demonstrating identical output sequences for non-Brahmic content. This will be included in the revision. revision: yes
Circularity Check
No circularity; explicit construction validated by external empirical benchmarks
full rationale
The paper presents a deterministic two-stage retrofit procedure (script-prune crop followed by linear-programming slot allocation) whose outputs are then measured directly against independent tokenizers on public Indic and non-Indic corpora. Fertility ratios, token counts, and benchmark scores are computed from raw text, not derived from any fitted parameter that is later renamed as a prediction. No self-citations appear in the load-bearing claims, and the unchanged merge rules are asserted as an implementation fact rather than a derived result. All reported advantages are falsifiable comparisons external to the construction itself.
Axiom & Free-Parameter Ledger
free parameters (2)
- Target vocabulary size =
131072
- Number of retrofitted slots =
2372
axioms (1)
- domain assumption The inherited merge rules from o200k_base remain optimal after the script-prune and slot retrofit.
Forward citations
Cited by 1 Pith paper
-
Reversible Foundations: Training a 120B Sparse MoE through State-Preserving Scaling
A 120B sparse MoE model with 460 experts was trained on one 8-GPU node to loss 1.78 using reversible recurrence and state-preserving scaling from a 1.78B dense seed, with 5.93B active parameters.
Reference graph
Works this paper leans on
-
[1]
O. Ahia, S. Kumar, H. Gonen, J. Kasai, D. Mortensen, N. A. Smith, and Y . Tsvetkov. Do all languages cost the same? Tokenization in the era of commercial language models. In Proceedings of the 2023 Conference on Empirical Meth- ods in Natural Language Processing (EMNLP) ,
2023
-
[2]
URL https://arxiv.org/abs/2305.13707. arXiv:2305.13707. Documents per-language pricing/latency costs induced by tokenizer fertility differentials. J. Austin, A. Odena, M. Nye, et al. Program synthesis with large language models,
-
[3]
Program Synthesis with Large Language Models
URL https://arxiv. org/abs/2108.07732. MBPP-sanitized. A. Bendale, M. Sapienza, S. Ripplinger, S. Gibbs, J. Lee, and P . Mistry. SUTRA: Scalable multilingual language model architecture,
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
URL https://arxiv.org/abs/2405.06694. arXiv:2405.06694. Multilingual LLM architecture cited as comparison baseline by MUTANT ( Rana et al. , 2026). M. Chen, J. T worek, H. Jun, et al. Evaluating large language models trained on code,
-
[5]
Evaluating Large Language Models Trained on Code
URL https://arxiv. org/abs/2107.03374. HumanEval: 164 Python problems. K. Cobbe, V . Kosaraju, M. Bavarian, et al. Training verifiers to solve math word problems,
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
Training Verifiers to Solve Math Word Problems
URL https: //arxiv.org/abs/2110.14168. GSM8K: grade-school math word problems. M. R. Costa-jussà, J. Cross, O. Çelebi, M. Elbayad, K. Heafield, K. Heffernan, E. Kalbassi, J. Lam, D. Licht, J. Maillard, et al. No language left behind: Scaling human-centered machine translation,
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
No Language Left Behind: Scaling Human-Centered Machine Translation
URL https: //arxiv.org/abs/2207.04672. arXiv:2207.04672. Releases the FLORES-200 dev+devtest benchmark (2009 sentences per language). A later version of this work appeared as Costa-jussà et al., Nature 630:841–846,
work page internal anchor Pith review Pith/arXiv arXiv 2009
-
[8]
URL https://arxiv.org/abs/2304.08177. arXiv:2304.08177. Vocabulary expansion of LLaMA for Chi- nese; closest prior precedent for adding script coverage to an existing tokenizer via vocabulary surgery rather than from-scratch training. DeepSeek AI. DeepSeek-R1 model artifacts,
-
[10]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
URL https://arxiv.org/abs/2501.12948. 18 BrahmicTokenizer-131K TECHNICAL REPORT S. Doddapaneni, M. S. U. R. Khan, D. Verma, A. Vasanthkumar, A. Lavania, A. Kunchukuttan, M. M. Khapra, and R. Dabre. Towards leaving no Indic language behind: Building monolingual corpora, benchmark and models for Indic languages. In Findings of the Association for Computatio...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[11]
URL https://arxiv.org/abs/2212.05409. arXiv:2212.05409. Releases IndicBERT v2 / IndicNLG-Suite. P . Gage. A new algorithm for data compression. C Users Journal , 12(2):23–38,
-
[12]
URL https://openreview.net/forum?id= vfT4YuzAYA
ISSN 2835-8856. URL https://openreview.net/forum?id= vfT4YuzAYA. arXiv:2305.16307. Introduces the IN22 benchmark suite including IN22-Gen. J. Gala, T. Jayakumar, J. A. Husain, A. Kumar M, M. S. U. R. Khan, D. Kanojia, R. Puduppully, M. M. Khapra, R. Dabre, R. Murthy, and A. Kunchukuttan. Airavata: Introducing Hindi instruction-tuned LLM. arXiv preprint ar...
-
[13]
URL https://arxiv.org/abs/2401.15006. Gemma Team. Gemma 3 technical report. arXiv preprint arXiv:2503.19786,
-
[14]
URL https://arxiv.org/ abs/2503.19786. Google DeepMind. Gemma 3 model artifacts,
work page internal anchor Pith review Pith/arXiv arXiv
-
[16]
URL https://arxiv.org/abs/2407.21783. Llama 3.1 family; tokenizer vocab size 128,256. É. Grave, A. Joulin, M. Cissé, D. Grangier, and H. Jégou. Efficient softmax approximation for GPUs. In Proceedings of the 34th International Conference on Machine Learning (ICML),
work page internal anchor Pith review Pith/arXiv arXiv
-
[17]
Efficient softmax approximation for GPUs
URL https://arxiv.org/abs/ 1609.04309. arXiv:1609.04309. Adaptive softmax: frequency-bucketed output layers for large-vocabulary lan- guage models. Hugging Face. HuggingFace tokenizers library,
work page internal anchor Pith review Pith/arXiv arXiv
-
[18]
doi: 10.18653/v1/2024.acl-long.843
Association for Compu- tational Linguistics. doi: 10.18653/v1/2024.acl-long.843. URL https://aclanthology.org/2024. acl-long.843/. arXiv:2403.06350. ACL 2024 Outstanding Paper. Releases the Sangraha pre-training corpus (251B tokens, 22 Indic languages) and IndicAlign. Krutrim AI Labs. Krutrim-1-instruct: A multilingual base model and tokenizer for Indian ...
-
[19]
Krutrim LLM: Multilingual Foundational Model for over a Billion People
URL https://huggingface.co/krutrim-ai-labs/Krutrim-1-instruct . Vocab size 70,212, Indic-focused. Tokenization details discussed in arXiv:2407.12481 (Kallappa et al., “Krutrim LLM: Multilingual Foundational Model for over a Billion People”, 2024). A. Liu, J. Hayase, V . Hofmann, S. Oh, N. A. Smith, and Y . Choi. SuperBPE: Space travel for language models....
-
[20]
URL https://arxiv.org/ abs/2503.13423. arXiv:2503.13423. Multi-word subword vocabulary learning. Meta AI. Llama 3.1 model artifacts,
-
[21]
URL https://huggingface.co/meta-llama/Llama-3. 1-8B. HF model card for the Llama 3.1 8B base model. See Grattafiori et al. (2024) for the technical report. B. Minixhofer, F. Paischer, and N. Rekabsaz. WECHSEL: Effective initialization of subword embeddings for cross-lingual transfer of monolingual language models. In Proceedings of the 2022 Conference of ...
2024
-
[22]
URL https://arxiv.org/abs/2112.06598. arXiv:2112.06598. Seminal work on transferring a pretrained model to a new tokenizer/language via embedding initialization. 19 BrahmicTokenizer-131K TECHNICAL REPORT Mistral AI. Mistral-Nemo-Base-2407: A 12B-parameter base model with the Tekken tokenizer,
-
[23]
Released May 2024 with gpt-4o-2024-05-13
URL https://github.com/ openai/tiktoken/blob/main/tiktoken_ext/openai_public.py. Released May 2024 with gpt-4o-2024-05-13. Vocab size 200,019. OpenAI. GPT-OSS-120B: An open-weights language model from OpenAI,
2024
-
[24]
URL https://arxiv.org/abs/ 2305.15425. arXiv:2305.15425. Quantifies cross-language tokenization disparities for closed and open tokeniz- ers; provides the framing for “tokenization tax” in low-resource languages. Qwen Team. Qwen3 model artifacts,
-
[25]
HF model card
URL https://huggingface.co/Qwen/Qwen3-8B. HF model card. Tokenizer vocab size 151,669. See Y ang et al.(2025) for the technical report. A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever. Language models are unsuper- vised multitask learners. Technical report, OpenAI,
2025
-
[26]
URL https://aclanthology.org/ 2022.tacl-1.9/. arXiv:2104.05596. S. Rana, A. Menezes, A. Kulkarni, C. Khatri, and S. Agarwal. MUTANT: A recipe for multilingual tokenizer design. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL) . Association for Computational Linguistics,
-
[27]
URL https://arxiv.org/abs/2511.03237. arXiv:2511.03237v2. Krutrim AI Labs, Bangalore. Acceptance attested by authors at github.com/ola-krutrim/MUTANT. MUTANT-Indic tokenizer artifact not publicly available at our paper preparation date. Sarvam AI. Sarvam-1: 2B-parameter base model for Indian languages, 2024a. URL https://huggingface. co/sarvamai/sarvam-1....
-
[29]
URL https://arxiv.org/abs/2505.09388. A Structural Diagnostics A.1 Cross-script tokens and byte-length ceilings: 14 tokenizers Table 14 reports structural properties for all 14 tokenizers in our benchmark, measured by each tokenizer’s own de- coder. At a 32-byte ceiling and zero cross-script tokens, only BrahmicTokenizer-131K and o200k_cropped satisfy bot...
work page internal anchor Pith review Pith/arXiv arXiv
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.