Advancing Polish Language Modeling through Tokenizer Optimization in the Bielik v3 7B and 11B Series

Adrian Gwo\'zdziej; Krzysztof Ociepa; Krzysztof Wr\'obel; {\L}ukasz Flis; Remigiusz Kinas

arxiv: 2604.10799 · v1 · submitted 2026-04-12 · 💻 cs.CL · cs.AI

Advancing Polish Language Modeling through Tokenizer Optimization in the Bielik v3 7B and 11B Series

Krzysztof Ociepa , {\L}ukasz Flis , Remigiusz Kinas , Krzysztof Wr\'obel , Adrian Gwo\'zdziej This is my paper

Pith reviewed 2026-05-10 14:58 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords tokenizer optimizationPolish LLMsBielik modelslanguage-specific modelingFOCUS initializationpretraining curriculumpost-training alignmentmorphological efficiency

0 comments

The pith

Bielik v3 models use a Polish-optimized tokenizer to reduce token fertility and inference costs for Polish text.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper presents the Bielik v3 7B and 11B models as an effort to create more efficient language models for Polish by optimizing the tokenizer. Instead of using a general Mistral tokenizer that fragments Polish words into many tokens due to its universal design, the authors create a dedicated vocabulary focused on Polish morphology. They initialize embeddings using the FOCUS method, follow a multi-stage pretraining curriculum, and align the model with supervised fine-tuning, direct preference optimization, and reinforcement learning via group relative policy optimization using verifiable rewards. A reader would care because this addresses the inefficiency in multilingual models where Polish text uses more tokens than necessary, leading to higher costs and shorter effective contexts. The claim is that this language-specific approach yields practical gains in efficiency and capability for Polish users.

Core claim

The Bielik v3 models advance Polish language modeling by transitioning from universal Mistral-based tokenization to a dedicated Polish-optimized vocabulary. This is combined with FOCUS-based embedding initialization, multi-stage pretraining, and post-training alignment using SFT, DPO, and GRPO with verifiable rewards. The result is models that better capture Polish morphological nuances, leading to lower fertility ratios, reduced inference costs, and extended effective context windows.

What carries the argument

The Polish-optimized tokenizer vocabulary, which replaces the universal Mistral tokenizer to better handle Polish morphology and reduce tokenization inefficiency.

Load-bearing premise

Replacing the universal tokenizer with a Polish-optimized one along with the initialization and training stages will lead to meaningful improvements in fertility, inference cost, and context length for Polish without hidden data biases.

What would settle it

Running the same Polish sentences through both the new and Mistral tokenizers and counting the tokens to check for consistent reduction in the new one; finding no difference on a standard Polish dataset would falsify the efficiency gains.

Figures

Figures reproduced from arXiv: 2604.10799 by Adrian Gwo\'zdziej, Krzysztof Ociepa, Krzysztof Wr\'obel, {\L}ukasz Flis, Remigiusz Kinas.

read the original abstract

The development of the Bielik v3 PL series, encompassing both the 7B and 11B parameter variants, represents a significant milestone in the field of language-specific large language model (LLM) optimization. While general-purpose models often demonstrate impressive multilingual capabilities, they frequently suffer from a fundamental architectural inefficiency: the use of universal tokenizers. These tokenizers, typically designed to cover a broad spectrum of languages, often fail to capture the morphological nuances of specific languages like Polish, leading to higher fertility ratios, increased inference costs, and restricted effective context windows. This report details the transition from the universal Mistral-based tokenization to a dedicated Polish-optimized vocabulary for the Bielik v3 models, exploring the FOCUS-based embedding initialization, the multi-stage pretraining curriculum, and the subsequent post-training alignment involving Supervised Fine-Tuning, Direct Preference Optimization, and Reinforcement Learning through Group Relative Policy Optimization with verifiable rewards.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The manuscript describes the Bielik v3 7B and 11B Polish language models. It details the switch from a Mistral-based universal tokenizer to a dedicated Polish-optimized vocabulary, along with FOCUS-based embedding initialization, a multi-stage pretraining curriculum, and post-training alignment via Supervised Fine-Tuning, Direct Preference Optimization, and Group Relative Policy Optimization with verifiable rewards, claiming these changes advance Polish modeling through lower fertility ratios, reduced inference costs, and larger effective context windows.

Significance. If the efficiency gains are empirically validated, the work would contribute to language-specific LLM development by illustrating the value of tokenizer optimization for morphologically rich languages, potentially informing more efficient training and inference pipelines for non-English models.

major comments (1)

[Abstract] Abstract: the central claim that the Polish-optimized tokenizer and associated pipeline produce meaningful gains in fertility, inference cost, and context length is unsupported, as the manuscript supplies no quantitative results, baseline comparisons, fertility ratios, throughput numbers, or ablation studies.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their careful review and constructive feedback on our manuscript. We address the major comment point by point below and will make the necessary revisions to strengthen the presentation of our results.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that the Polish-optimized tokenizer and associated pipeline produce meaningful gains in fertility, inference cost, and context length is unsupported, as the manuscript supplies no quantitative results, baseline comparisons, fertility ratios, throughput numbers, or ablation studies.

Authors: We agree that the abstract, as currently drafted, presents the central claims at a high level without embedding the supporting quantitative evidence. The body of the manuscript does contain the relevant experimental results, including fertility ratio comparisons against the Mistral baseline, inference throughput measurements, effective context window evaluations, and ablation studies on the tokenizer and initialization choices. To directly address the concern, we will revise the abstract to explicitly include the key metrics (e.g., fertility reduction percentages, inference cost savings, and context length gains) with brief references to the baseline comparisons and ablations. This change will ensure the claims are immediately supported by numbers while preserving the abstract's conciseness. revision: yes

Circularity Check

0 steps flagged

No circularity detected; paper is an empirical engineering report with no derivations, equations, or self-referential predictions

full rationale

The provided manuscript text consists of an abstract and description of a model development pipeline: switching from Mistral tokenizer to Polish-optimized vocabulary, FOCUS embedding initialization, multi-stage pretraining, and post-training with SFT/DPO/GRPO. No equations are present, no parameters are fitted and then renamed as predictions, no uniqueness theorems or ansatzes are imported via self-citation, and no known results are renamed as new unifications. The central narrative is a factual recounting of choices made and intended benefits (lower fertility, etc.), without any load-bearing step that reduces by construction to its own inputs. Per the hard rules, absent any quotable reduction (e.g., Eq. X = Eq. Y), the score is 0 and steps array is empty. Lack of quantitative benchmarks is a separate evidence issue, not circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the domain assumption that Polish morphology causes high fertility in universal tokenizers and that the listed training stages will mitigate it. No free parameters or invented entities are explicitly quantified in the abstract.

axioms (2)

domain assumption Universal tokenizers fail to capture Polish morphological nuances leading to higher fertility ratios and restricted context windows
Directly stated in the abstract as the motivation for the tokenizer change.
domain assumption FOCUS-based embedding initialization enables effective transfer to the new Polish vocabulary
Invoked as part of the transition method without further justification in the abstract.

pith-pipeline@v0.9.0 · 5483 in / 1365 out tokens · 57361 ms · 2026-05-10T14:58:38.057346+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

4 extracted references · 4 canonical work pages · 1 internal anchor

[1]

Mistral 7B

URLhttps://arxiv.org/abs/2310.06825. Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebron, and Sumit Sanghai. GQA: Training generalized multi-query transformer models from multi-head checkpoints. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Proceedings of the 2023 Conference on Empirical Methods in Natural Language ...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2023.emnlp- 2023
[2]

2024 , issn =

doi:https://doi.org/10.1016/j.neucom.2023.127063. URL https://www.sciencedirect.com/science/ article/pii/S0925231223011864. Sanghoon Kim, Dahyun Kim, Chanjun Park, Wonsung Lee, Wonho Song, Yunsu Kim, Hyeonwoo Kim, Yungi Kim, Hyeonju Lee, Jihoo Kim, Changbae Ahn, Seonghoon Yang, Sukyung Lee, Hyunbyung Park, Gyoungjin Gim, Mikyoung Cha, Hwalsuk Lee, and Sun...

work page doi:10.1016/j.neucom.2023.127063 2023
[3]

ISBN 979-10-95546-34-4

European Language Resources Association. ISBN 979-10-95546-34-4. URL https://aclanthology.org/ 2020.lrec-1.207. Lucas Bandarkar, Davis Liang, Benjamin Muller, Mikel Artetxe, Satya Narayan Shukla, Donald Husa, Naman Goyal, Abhinandan Krishnan, Luke Zettlemoyer, and Madian Khabsa. The belebele benchmark: a parallel reading comprehension dataset in 122 langu...

work page 2020
[4]

Feature importance identification for time series classifiers, in: Proceedings of the 56th IEEE International Conference on Systems, Man, and Cybernetics, pp

Association for Computational Linguistics. URLhttps://aclanthology.org/2024.acl-long.44. Michał Marcinczuk, Marcin Ptak, Adam Radziszewski, and Maciej Piasecki. Open dataset for development of polish question answering systems. InProceedings of the 6th Language & Technology Conference: Human Language Technologies as a Challenge for Computer Science and Li...

work page doi:10.1109/smc53654.2022.9945218 2024

[1] [1]

Mistral 7B

URLhttps://arxiv.org/abs/2310.06825. Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebron, and Sumit Sanghai. GQA: Training generalized multi-query transformer models from multi-head checkpoints. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Proceedings of the 2023 Conference on Empirical Methods in Natural Language ...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2023.emnlp- 2023

[2] [2]

2024 , issn =

doi:https://doi.org/10.1016/j.neucom.2023.127063. URL https://www.sciencedirect.com/science/ article/pii/S0925231223011864. Sanghoon Kim, Dahyun Kim, Chanjun Park, Wonsung Lee, Wonho Song, Yunsu Kim, Hyeonwoo Kim, Yungi Kim, Hyeonju Lee, Jihoo Kim, Changbae Ahn, Seonghoon Yang, Sukyung Lee, Hyunbyung Park, Gyoungjin Gim, Mikyoung Cha, Hwalsuk Lee, and Sun...

work page doi:10.1016/j.neucom.2023.127063 2023

[3] [3]

ISBN 979-10-95546-34-4

European Language Resources Association. ISBN 979-10-95546-34-4. URL https://aclanthology.org/ 2020.lrec-1.207. Lucas Bandarkar, Davis Liang, Benjamin Muller, Mikel Artetxe, Satya Narayan Shukla, Donald Husa, Naman Goyal, Abhinandan Krishnan, Luke Zettlemoyer, and Madian Khabsa. The belebele benchmark: a parallel reading comprehension dataset in 122 langu...

work page 2020

[4] [4]

Feature importance identification for time series classifiers, in: Proceedings of the 56th IEEE International Conference on Systems, Man, and Cybernetics, pp

Association for Computational Linguistics. URLhttps://aclanthology.org/2024.acl-long.44. Michał Marcinczuk, Marcin Ptak, Adam Radziszewski, and Maciej Piasecki. Open dataset for development of polish question answering systems. InProceedings of the 6th Language & Technology Conference: Human Language Technologies as a Challenge for Computer Science and Li...

work page doi:10.1109/smc53654.2022.9945218 2024