pith. sign in

arxiv: 2605.29379 · v1 · pith:EHMTTSDRnew · submitted 2026-05-28 · 💻 cs.CL · cs.LG

BrahmicTokenizer-131K: An Indic-Capable Drop-In Replacement for o200k_base

Pith reviewed 2026-06-29 08:09 UTC · model grok-4.3

classification 💻 cs.CL cs.LG
keywords Brahmic tokenizerIndic languagesBPE tokenizervocabulary allocationdrop-in replacementtoken compressionmultilingual NLPo200k_base
0
0 comments X

The pith

BrahmicTokenizer-131K reduces Indic token counts by 26.7% while matching o200k_base on English and code.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper constructs a 131072-token BPE tokenizer by pruning o200k_base to remove nine non-Indic scripts and reallocating slots to Brahmic Unicode blocks. This retrofit keeps the pre-tokenizer, decoder, and merge rules identical to the original. On large Indic datasets it produces substantially fewer tokens than competing 131K tokenizers. It matches the original's English fertility and outperforms alternatives on code and math benchmarks. A reader cares because it offers a practical way to add Indic support to models without retraining the tokenizer from scratch or losing other capabilities.

Core claim

BrahmicTokenizer-131K is constructed via a script-prune crop that reduces the vocabulary from 200019 to 131072 tokens by removing nine out-of-scope writing systems, followed by a linear-programming allocation of 2372 corpus-dead slots to nine Brahmic Unicode blocks. The pre-tokenizer, decoder, and inherited merge rules remain unchanged from o200k_base. On 27 million Indic documents it yields 26.7% fewer tokens than Mistral-Nemo Tekken or Sarvam-m while matching English fertility at 1.235 tokens per word and improving on HumanEval, MBPP, and GSM8K.

What carries the argument

The two-stage retrofit consisting of a script-prune crop and linear-programming allocation of vocabulary slots to Brahmic blocks, preserving all other components of o200k_base.

If this is right

  • Achieves 26.7% token reduction on Indic pretraining text compared to other 131K tokenizers.
  • Maintains equivalent English fertility to o200k_base.
  • Outperforms Tekken/Sarvam-m on code and math benchmarks by 4-14%.
  • Is the only tokenizer competitive across Brahmic, English, EU languages, code, and math at the 131K vocabulary size.
  • Specialist Indic tokenizers at other sizes show worse English and code performance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This approach could be generalized to retrofit tokenizers for other underrepresented scripts without full retraining.
  • The drop-in replacement property allows immediate use in existing model pipelines for Indic languages.
  • Better token efficiency on Indic text may reduce training costs for multilingual models targeting South Asian languages.
  • The linear programming allocation method might be adapted for other vocabulary optimization problems.

Load-bearing premise

That the removal of nine scripts and the reallocation of slots to Brahmic blocks leaves the inherited merge rules and pre-tokenizer equally effective on non-Indic content.

What would settle it

A direct comparison showing that BrahmicTokenizer-131K produces more than 1.235 tokens per word on a standard English test set or lower scores on HumanEval than o200k_base would falsify the preservation claim.

read the original abstract

We present BrahmicTokenizer-131K, a 131,072-vocabulary byte-level BPE tokenizer that closes the Brahmic compression gap at the 131K-vocabulary class while preserving the English, EU-language, and code compression of OpenAI's o200k_base. We construct it through a two-stage retrofit: (1) a script-prune crop that reduces 200,019 tokens to 131,072 by removing nine out-of-scope writing systems, and (2) a surgical retrofit of 2,372 corpus-dead vocabulary slots determined by linear-programming allocation across nine Brahmic Unicode blocks. The pre-tokenizer, decoder, and inherited merge rules are unchanged from o200k_base, making BrahmicTokenizer-131K a drop-in replacement at the tokenizer interface. On 27 million documents of public Indic pretraining text (2.84 billion words, 46.21 GB), BrahmicTokenizer-131K produces 26.7% fewer tokens than Mistral-Nemo Tekken / Sarvam-m at the same vocabulary budget, with per-language savings of 15.79% (Tamil) to 76.79% (Odia, a 4.31x compression ratio). The Odia advantage is mechanistically explained by Tekken/Sarvam-m containing zero Oriya-block tokens; our surgery added 725. On non-Indic content, BrahmicTokenizer-131K matches o200k_base's English fertility (1.235 vs 1.232 tokens/word) and beats Tekken/Sarvam-m by 4.0-14.2% on HumanEval, MBPP, and GSM8K. Across our 14-tokenizer benchmark, it is the only tokenizer simultaneously competitive on Brahmic, English, EU, code, and math at the 131K budget. Specialist tokenizers at other vocab classes (Sarvam-30B, Sarvam-1, MUTANT-Indic) achieve better Indic compression at the cost of non-Indic performance: Sarvam-1's English fertility is 15.9% worse and its code/math compression 26-33% worse than ours. We release the artifact under Apache 2.0 at https://huggingface.co/theschoolofai/BrahmicTokenizer-131K.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper claims to construct BrahmicTokenizer-131K, a 131072-vocab byte-level BPE drop-in replacement for o200k_base, via a two-stage process of script-pruning nine out-of-scope writing systems followed by linear-programming allocation of 2372 corpus-dead slots to nine Brahmic Unicode blocks. It reports 26.7% fewer tokens than Mistral-Nemo Tekken/Sarvam-m on 27M Indic documents (2.84B words) while matching o200k_base English fertility (1.235 vs 1.232 tokens/word) and outperforming alternatives on HumanEval, MBPP, and GSM8K; it positions itself as the only 131K tokenizer competitive across Brahmic, English, EU, code, and math.

Significance. If the central claims hold, the work supplies a practical, openly released (Apache 2.0) tokenizer that narrows the Indic compression gap at fixed vocabulary budget without measurable degradation on English/EU/code/math, supported by large-scale empirical comparisons across 14 tokenizers. The explicit retrofit procedure and artifact release are concrete strengths that aid reproducibility in multilingual pretraining.

major comments (3)
  1. [Abstract / Methods] Abstract and Methods: the linear-programming allocation of exactly 2372 slots across nine Brahmic blocks is described only at the level of 'determined by linear-programming allocation' with no statement of the objective function, constraints, or solver; this is load-bearing for the retrofit step that underpins both the Indic gains and the claim of unchanged non-Indic behavior.
  2. [Results] Results: the headline 26.7% token reduction (and per-language figures 15.79%–76.79%) on the 27M-document corpus supplies neither error bars, variance estimates, nor any description of sampling or deduplication; without these the robustness of the central compression claim cannot be assessed.
  3. [Methods] Methods: the claim that 'the pre-tokenizer, decoder, and inherited merge rules are unchanged' and therefore non-Indic effectiveness is preserved is not accompanied by any verification that the pruned tokens are merge-inert on English/EU/code text or that the new Brahmic entries do not alter existing merge paths on mixed-script data; this directly bears on the drop-in replacement guarantee.
minor comments (1)
  1. [Abstract] Abstract: the statement that BrahmicTokenizer-131K 'beats alternatives on HumanEval, MBPP, and GSM8K' would be clearer if the exact percentage improvements and the identity of the 'alternatives' were stated in the same sentence.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below. Revisions will be incorporated to improve the manuscript's clarity, reproducibility, and empirical robustness.

read point-by-point responses
  1. Referee: [Abstract / Methods] Abstract and Methods: the linear-programming allocation of exactly 2372 slots across nine Brahmic blocks is described only at the level of 'determined by linear-programming allocation' with no statement of the objective function, constraints, or solver; this is load-bearing for the retrofit step that underpins both the Indic gains and the claim of unchanged non-Indic behavior.

    Authors: We agree that the linear-programming details require explicit specification for reproducibility. In the revised manuscript we will add a dedicated Methods subsection stating the objective (minimize expected fertility on a held-out Indic validation corpus), the constraints (non-negative integers summing exactly to 2372, with per-block upper bounds derived from character-frequency statistics), and the solver (PuLP with CBC). This directly addresses the load-bearing nature of the allocation step. revision: yes

  2. Referee: [Results] Results: the headline 26.7% token reduction (and per-language figures 15.79%–76.79%) on the 27M-document corpus supplies neither error bars, variance estimates, nor any description of sampling or deduplication; without these the robustness of the central compression claim cannot be assessed.

    Authors: We accept that statistical robustness measures are currently missing. The 27M-document corpus was obtained by sampling public Indic web data followed by MinHash deduplication (Jaccard threshold 0.8). In revision we will add the sampling and deduplication description plus bootstrap standard errors (500 resamples) for the overall and per-language token-reduction figures. revision: yes

  3. Referee: [Methods] Methods: the claim that 'the pre-tokenizer, decoder, and inherited merge rules are unchanged' and therefore non-Indic effectiveness is preserved is not accompanied by any verification that the pruned tokens are merge-inert on English/EU/code text or that the new Brahmic entries do not alter existing merge paths on mixed-script data; this directly bears on the drop-in replacement guarantee.

    Authors: The drop-in property follows from the unchanged pre-tokenizer and merge table together with the fact that new tokens occupy previously zero-frequency IDs. Nevertheless, to provide the requested empirical verification we will add an appendix containing tokenization comparisons on English-only, code, and mixed-script corpora demonstrating identical output sequences for non-Brahmic content. This will be included in the revision. revision: yes

Circularity Check

0 steps flagged

No circularity; explicit construction validated by external empirical benchmarks

full rationale

The paper presents a deterministic two-stage retrofit procedure (script-prune crop followed by linear-programming slot allocation) whose outputs are then measured directly against independent tokenizers on public Indic and non-Indic corpora. Fertility ratios, token counts, and benchmark scores are computed from raw text, not derived from any fitted parameter that is later renamed as a prediction. No self-citations appear in the load-bearing claims, and the unchanged merge rules are asserted as an implementation fact rather than a derived result. All reported advantages are falsifiable comparisons external to the construction itself.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The work rests on the standard byte-level BPE algorithm and the pre-existing o200k_base merge rules and pre-tokenizer as fixed inputs. The retrofit numbers (2,372 slots, nine Brahmic blocks) are the only quantities introduced by the paper itself.

free parameters (2)
  • Target vocabulary size = 131072
    Fixed at 131072 to enable direct comparison inside the 131K class.
  • Number of retrofitted slots = 2372
    Set by linear-programming allocation to replace corpus-dead entries with Brahmic tokens.
axioms (1)
  • domain assumption The inherited merge rules from o200k_base remain optimal after the script-prune and slot retrofit.
    Invoked when the paper states that pre-tokenizer, decoder, and merge rules are unchanged.

pith-pipeline@v0.9.1-grok · 5967 in / 1641 out tokens · 49128 ms · 2026-06-29T08:09:36.853239+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Reversible Foundations: Training a 120B Sparse MoE through State-Preserving Scaling

    cs.LG 2026-06 unverdicted novelty 4.0

    A 120B sparse MoE model with 460 experts was trained on one 8-GPU node to loss 1.78 using reversible recurrence and state-preserving scaling from a 1.78B dense seed, with 5.93B active parameters.

Reference graph

Works this paper leans on

26 extracted references · 22 canonical work pages · cited by 1 Pith paper · 9 internal anchors

  1. [1]

    O. Ahia, S. Kumar, H. Gonen, J. Kasai, D. Mortensen, N. A. Smith, and Y . Tsvetkov. Do all languages cost the same? Tokenization in the era of commercial language models. In Proceedings of the 2023 Conference on Empirical Meth- ods in Natural Language Processing (EMNLP) ,

  2. [2]

    arXiv:2305.13707

    URL https://arxiv.org/abs/2305.13707. arXiv:2305.13707. Documents per-language pricing/latency costs induced by tokenizer fertility differentials. J. Austin, A. Odena, M. Nye, et al. Program synthesis with large language models,

  3. [3]

    Program Synthesis with Large Language Models

    URL https://arxiv. org/abs/2108.07732. MBPP-sanitized. A. Bendale, M. Sapienza, S. Ripplinger, S. Gibbs, J. Lee, and P . Mistry. SUTRA: Scalable multilingual language model architecture,

  4. [4]

    arXiv:2405.06694

    URL https://arxiv.org/abs/2405.06694. arXiv:2405.06694. Multilingual LLM architecture cited as comparison baseline by MUTANT ( Rana et al. , 2026). M. Chen, J. T worek, H. Jun, et al. Evaluating large language models trained on code,

  5. [5]

    Evaluating Large Language Models Trained on Code

    URL https://arxiv. org/abs/2107.03374. HumanEval: 164 Python problems. K. Cobbe, V . Kosaraju, M. Bavarian, et al. Training verifiers to solve math word problems,

  6. [6]

    Training Verifiers to Solve Math Word Problems

    URL https: //arxiv.org/abs/2110.14168. GSM8K: grade-school math word problems. M. R. Costa-jussà, J. Cross, O. Çelebi, M. Elbayad, K. Heafield, K. Heffernan, E. Kalbassi, J. Lam, D. Licht, J. Maillard, et al. No language left behind: Scaling human-centered machine translation,

  7. [7]

    No Language Left Behind: Scaling Human-Centered Machine Translation

    URL https: //arxiv.org/abs/2207.04672. arXiv:2207.04672. Releases the FLORES-200 dev+devtest benchmark (2009 sentences per language). A later version of this work appeared as Costa-jussà et al., Nature 630:841–846,

  8. [8]

    arXiv:2304.08177

    URL https://arxiv.org/abs/2304.08177. arXiv:2304.08177. Vocabulary expansion of LLaMA for Chi- nese; closest prior precedent for adding script coverage to an existing tokenizer via vocabulary surgery rather than from-scratch training. DeepSeek AI. DeepSeek-R1 model artifacts,

  9. [10]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    URL https://arxiv.org/abs/2501.12948. 18 BrahmicTokenizer-131K TECHNICAL REPORT S. Doddapaneni, M. S. U. R. Khan, D. Verma, A. Vasanthkumar, A. Lavania, A. Kunchukuttan, M. M. Khapra, and R. Dabre. Towards leaving no Indic language behind: Building monolingual corpora, benchmark and models for Indic languages. In Findings of the Association for Computatio...

  10. [11]

    arXiv:2212.05409

    URL https://arxiv.org/abs/2212.05409. arXiv:2212.05409. Releases IndicBERT v2 / IndicNLG-Suite. P . Gage. A new algorithm for data compression. C Users Journal , 12(2):23–38,

  11. [12]

    URL https://openreview.net/forum?id= vfT4YuzAYA

    ISSN 2835-8856. URL https://openreview.net/forum?id= vfT4YuzAYA. arXiv:2305.16307. Introduces the IN22 benchmark suite including IN22-Gen. J. Gala, T. Jayakumar, J. A. Husain, A. Kumar M, M. S. U. R. Khan, D. Kanojia, R. Puduppully, M. M. Khapra, R. Dabre, R. Murthy, and A. Kunchukuttan. Airavata: Introducing Hindi instruction-tuned LLM. arXiv preprint ar...

  12. [13]

    Gemma Team

    URL https://arxiv.org/abs/2401.15006. Gemma Team. Gemma 3 technical report. arXiv preprint arXiv:2503.19786,

  13. [14]

    Gemma 3 Technical Report

    URL https://arxiv.org/ abs/2503.19786. Google DeepMind. Gemma 3 model artifacts,

  14. [16]

    The Llama 3 Herd of Models

    URL https://arxiv.org/abs/2407.21783. Llama 3.1 family; tokenizer vocab size 128,256. É. Grave, A. Joulin, M. Cissé, D. Grangier, and H. Jégou. Efficient softmax approximation for GPUs. In Proceedings of the 34th International Conference on Machine Learning (ICML),

  15. [17]

    Efficient softmax approximation for GPUs

    URL https://arxiv.org/abs/ 1609.04309. arXiv:1609.04309. Adaptive softmax: frequency-bucketed output layers for large-vocabulary lan- guage models. Hugging Face. HuggingFace tokenizers library,

  16. [18]

    doi: 10.18653/v1/2024.acl-long.843

    Association for Compu- tational Linguistics. doi: 10.18653/v1/2024.acl-long.843. URL https://aclanthology.org/2024. acl-long.843/. arXiv:2403.06350. ACL 2024 Outstanding Paper. Releases the Sangraha pre-training corpus (251B tokens, 22 Indic languages) and IndicAlign. Krutrim AI Labs. Krutrim-1-instruct: A multilingual base model and tokenizer for Indian ...

  17. [19]

    Krutrim LLM: Multilingual Foundational Model for over a Billion People

    URL https://huggingface.co/krutrim-ai-labs/Krutrim-1-instruct . Vocab size 70,212, Indic-focused. Tokenization details discussed in arXiv:2407.12481 (Kallappa et al., “Krutrim LLM: Multilingual Foundational Model for over a Billion People”, 2024). A. Liu, J. Hayase, V . Hofmann, S. Oh, N. A. Smith, and Y . Choi. SuperBPE: Space travel for language models....

  18. [20]

    arXiv:2503.13423

    URL https://arxiv.org/ abs/2503.13423. arXiv:2503.13423. Multi-word subword vocabulary learning. Meta AI. Llama 3.1 model artifacts,

  19. [21]

    URL https://huggingface.co/meta-llama/Llama-3. 1-8B. HF model card for the Llama 3.1 8B base model. See Grattafiori et al. (2024) for the technical report. B. Minixhofer, F. Paischer, and N. Rekabsaz. WECHSEL: Effective initialization of subword embeddings for cross-lingual transfer of monolingual language models. In Proceedings of the 2022 Conference of ...

  20. [22]

    arXiv:2112.06598

    URL https://arxiv.org/abs/2112.06598. arXiv:2112.06598. Seminal work on transferring a pretrained model to a new tokenizer/language via embedding initialization. 19 BrahmicTokenizer-131K TECHNICAL REPORT Mistral AI. Mistral-Nemo-Base-2407: A 12B-parameter base model with the Tekken tokenizer,

  21. [23]

    Released May 2024 with gpt-4o-2024-05-13

    URL https://github.com/ openai/tiktoken/blob/main/tiktoken_ext/openai_public.py. Released May 2024 with gpt-4o-2024-05-13. Vocab size 200,019. OpenAI. GPT-OSS-120B: An open-weights language model from OpenAI,

  22. [24]

    tokenization tax

    URL https://arxiv.org/abs/ 2305.15425. arXiv:2305.15425. Quantifies cross-language tokenization disparities for closed and open tokeniz- ers; provides the framing for “tokenization tax” in low-resource languages. Qwen Team. Qwen3 model artifacts,

  23. [25]

    HF model card

    URL https://huggingface.co/Qwen/Qwen3-8B. HF model card. Tokenizer vocab size 151,669. See Y ang et al.(2025) for the technical report. A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever. Language models are unsuper- vised multitask learners. Technical report, OpenAI,

  24. [26]

    arXiv:2104.05596

    URL https://aclanthology.org/ 2022.tacl-1.9/. arXiv:2104.05596. S. Rana, A. Menezes, A. Kulkarni, C. Khatri, and S. Agarwal. MUTANT: A recipe for multilingual tokenizer design. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL) . Association for Computational Linguistics,

  25. [27]

    arXiv:2511.03237v2

    URL https://arxiv.org/abs/2511.03237. arXiv:2511.03237v2. Krutrim AI Labs, Bangalore. Acceptance attested by authors at github.com/ola-krutrim/MUTANT. MUTANT-Indic tokenizer artifact not publicly available at our paper preparation date. Sarvam AI. Sarvam-1: 2B-parameter base model for Indian languages, 2024a. URL https://huggingface. co/sarvamai/sarvam-1....

  26. [29]

    URL https://arxiv.org/abs/2505.09388. A Structural Diagnostics A.1 Cross-script tokens and byte-length ceilings: 14 tokenizers Table 14 reports structural properties for all 14 tokenizers in our benchmark, measured by each tokenizer’s own de- coder. At a 32-byte ceiling and zero cross-script tokens, only BrahmicTokenizer-131K and o200k_cropped satisfy bot...