Advancing Polish Language Modeling through Tokenizer Optimization in the Bielik v3 7B and 11B Series
Pith reviewed 2026-05-10 14:58 UTC · model grok-4.3
The pith
Bielik v3 models use a Polish-optimized tokenizer to reduce token fertility and inference costs for Polish text.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The Bielik v3 models advance Polish language modeling by transitioning from universal Mistral-based tokenization to a dedicated Polish-optimized vocabulary. This is combined with FOCUS-based embedding initialization, multi-stage pretraining, and post-training alignment using SFT, DPO, and GRPO with verifiable rewards. The result is models that better capture Polish morphological nuances, leading to lower fertility ratios, reduced inference costs, and extended effective context windows.
What carries the argument
The Polish-optimized tokenizer vocabulary, which replaces the universal Mistral tokenizer to better handle Polish morphology and reduce tokenization inefficiency.
Load-bearing premise
Replacing the universal tokenizer with a Polish-optimized one along with the initialization and training stages will lead to meaningful improvements in fertility, inference cost, and context length for Polish without hidden data biases.
What would settle it
Running the same Polish sentences through both the new and Mistral tokenizers and counting the tokens to check for consistent reduction in the new one; finding no difference on a standard Polish dataset would falsify the efficiency gains.
Figures
read the original abstract
The development of the Bielik v3 PL series, encompassing both the 7B and 11B parameter variants, represents a significant milestone in the field of language-specific large language model (LLM) optimization. While general-purpose models often demonstrate impressive multilingual capabilities, they frequently suffer from a fundamental architectural inefficiency: the use of universal tokenizers. These tokenizers, typically designed to cover a broad spectrum of languages, often fail to capture the morphological nuances of specific languages like Polish, leading to higher fertility ratios, increased inference costs, and restricted effective context windows. This report details the transition from the universal Mistral-based tokenization to a dedicated Polish-optimized vocabulary for the Bielik v3 models, exploring the FOCUS-based embedding initialization, the multi-stage pretraining curriculum, and the subsequent post-training alignment involving Supervised Fine-Tuning, Direct Preference Optimization, and Reinforcement Learning through Group Relative Policy Optimization with verifiable rewards.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript describes the Bielik v3 7B and 11B Polish language models. It details the switch from a Mistral-based universal tokenizer to a dedicated Polish-optimized vocabulary, along with FOCUS-based embedding initialization, a multi-stage pretraining curriculum, and post-training alignment via Supervised Fine-Tuning, Direct Preference Optimization, and Group Relative Policy Optimization with verifiable rewards, claiming these changes advance Polish modeling through lower fertility ratios, reduced inference costs, and larger effective context windows.
Significance. If the efficiency gains are empirically validated, the work would contribute to language-specific LLM development by illustrating the value of tokenizer optimization for morphologically rich languages, potentially informing more efficient training and inference pipelines for non-English models.
major comments (1)
- [Abstract] Abstract: the central claim that the Polish-optimized tokenizer and associated pipeline produce meaningful gains in fertility, inference cost, and context length is unsupported, as the manuscript supplies no quantitative results, baseline comparisons, fertility ratios, throughput numbers, or ablation studies.
Simulated Author's Rebuttal
We thank the referee for their careful review and constructive feedback on our manuscript. We address the major comment point by point below and will make the necessary revisions to strengthen the presentation of our results.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claim that the Polish-optimized tokenizer and associated pipeline produce meaningful gains in fertility, inference cost, and context length is unsupported, as the manuscript supplies no quantitative results, baseline comparisons, fertility ratios, throughput numbers, or ablation studies.
Authors: We agree that the abstract, as currently drafted, presents the central claims at a high level without embedding the supporting quantitative evidence. The body of the manuscript does contain the relevant experimental results, including fertility ratio comparisons against the Mistral baseline, inference throughput measurements, effective context window evaluations, and ablation studies on the tokenizer and initialization choices. To directly address the concern, we will revise the abstract to explicitly include the key metrics (e.g., fertility reduction percentages, inference cost savings, and context length gains) with brief references to the baseline comparisons and ablations. This change will ensure the claims are immediately supported by numbers while preserving the abstract's conciseness. revision: yes
Circularity Check
No circularity detected; paper is an empirical engineering report with no derivations, equations, or self-referential predictions
full rationale
The provided manuscript text consists of an abstract and description of a model development pipeline: switching from Mistral tokenizer to Polish-optimized vocabulary, FOCUS embedding initialization, multi-stage pretraining, and post-training with SFT/DPO/GRPO. No equations are present, no parameters are fitted and then renamed as predictions, no uniqueness theorems or ansatzes are imported via self-citation, and no known results are renamed as new unifications. The central narrative is a factual recounting of choices made and intended benefits (lower fertility, etc.), without any load-bearing step that reduces by construction to its own inputs. Per the hard rules, absent any quotable reduction (e.g., Eq. X = Eq. Y), the score is 0 and steps array is empty. Lack of quantitative benchmarks is a separate evidence issue, not circularity.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Universal tokenizers fail to capture Polish morphological nuances leading to higher fertility ratios and restricted context windows
- domain assumption FOCUS-based embedding initialization enables effective transfer to the new Polish vocabulary
Reference graph
Works this paper leans on
-
[1]
URLhttps://arxiv.org/abs/2310.06825. Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebron, and Sumit Sanghai. GQA: Training generalized multi-query transformer models from multi-head checkpoints. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Proceedings of the 2023 Conference on Empirical Methods in Natural Language ...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2023.emnlp- 2023
-
[2]
doi:https://doi.org/10.1016/j.neucom.2023.127063. URL https://www.sciencedirect.com/science/ article/pii/S0925231223011864. Sanghoon Kim, Dahyun Kim, Chanjun Park, Wonsung Lee, Wonho Song, Yunsu Kim, Hyeonwoo Kim, Yungi Kim, Hyeonju Lee, Jihoo Kim, Changbae Ahn, Seonghoon Yang, Sukyung Lee, Hyunbyung Park, Gyoungjin Gim, Mikyoung Cha, Hwalsuk Lee, and Sun...
-
[3]
European Language Resources Association. ISBN 979-10-95546-34-4. URL https://aclanthology.org/ 2020.lrec-1.207. Lucas Bandarkar, Davis Liang, Benjamin Muller, Mikel Artetxe, Satya Narayan Shukla, Donald Husa, Naman Goyal, Abhinandan Krishnan, Luke Zettlemoyer, and Madian Khabsa. The belebele benchmark: a parallel reading comprehension dataset in 122 langu...
work page 2020
-
[4]
Association for Computational Linguistics. URLhttps://aclanthology.org/2024.acl-long.44. Michał Marcinczuk, Marcin Ptak, Adam Radziszewski, and Maciej Piasecki. Open dataset for development of polish question answering systems. InProceedings of the 6th Language & Technology Conference: Human Language Technologies as a Challenge for Computer Science and Li...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.