Bielik Guard: Efficient Polish Language Safety Classifiers for LLM Content Moderation
Pith reviewed 2026-05-16 06:04 UTC · model grok-4.3
The pith
Compact Polish classifiers provide efficient safety moderation for large language models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Bielik Guard consists of two fine-tuned models: a 0.1 billion parameter version based on MMLW-RoBERTa-base and a 0.5 billion parameter version based on PKOBP/polish-roberta-8k. Trained on 6,885 Polish texts across five safety categories, the 0.5B model achieves F1 scores of 0.791 micro and 0.785 macro on the test set, while the 0.1B model attains 77.65% precision and 0.63% false positive rate on real user prompts, surpassing HerBERT-PL-Guard.
What carries the argument
Fine-tuned RoBERTa-based Polish language models on a community-annotated safety dataset that classify text into five categories and support appropriate responses instead of blocking.
Load-bearing premise
The 6,885 community-annotated Polish texts accurately label safety categories without bias and represent the distribution of real user prompts.
What would settle it
A new evaluation on a larger, independently collected set of real Polish user prompts showing the 0.1B model's precision falling below 60% or false positive rate rising above 2% would challenge the claim of superior performance.
Figures
read the original abstract
As Large Language Models (LLMs) become increasingly deployed in Polish language applications, the need for efficient and accurate content safety classifiers has become paramount. We present Bielik Guard, a family of compact Polish language safety classifiers comprising two model variants: a 0.1B parameter model based on MMLW-RoBERTa-base and a 0.5B parameter model based on PKOBP/polish-roberta-8k. Fine-tuned on a community-annotated dataset of 6,885 Polish texts, these models classify content across five safety categories: Hate/Aggression, Vulgarities, Sexual Content, Crime, and Self-Harm. Our evaluation demonstrates that both models achieve strong performance on multiple benchmarks. The 0.5B variant offers the best overall discrimination capability with F1 scores of 0.791 (micro) and 0.785 (macro) on the test set, while the 0.1B variant demonstrates exceptional efficiency. Notably, Bielik Guard 0.1B v1.1 achieves superior precision (77.65%) and very low false positive rate (0.63%) on real user prompts, outperforming HerBERT-PL-Guard (31.55% precision, 4.70% FPR) despite identical model size. The models are publicly available and designed to provide appropriate responses rather than simple content blocking, particularly for sensitive categories like self-harm.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents Bielik Guard, a family of two compact Polish-language safety classifiers (0.1B parameters based on MMLW-RoBERTa-base and 0.5B based on PKOBP/polish-roberta-8k). Both are fine-tuned on a community-annotated dataset of 6,885 Polish texts to classify inputs into five safety categories (Hate/Aggression, Vulgarities, Sexual Content, Crime, Self-Harm). The authors report that the 0.5B model achieves the strongest overall discrimination with test-set F1 scores of 0.791 (micro) and 0.785 (macro), while the 0.1B variant attains 77.65% precision and 0.63% FPR on a real-user-prompt evaluation set, outperforming the same-size HerBERT-PL-Guard baseline; the models are released publicly and are intended to generate appropriate responses rather than simple blocks.
Significance. If the empirical results are reproducible, the work supplies the first publicly available, efficient Polish-specific safety classifiers, filling a clear gap for non-English LLM moderation. The emphasis on response generation for sensitive categories (e.g., self-harm) and the release of both model sizes are concrete contributions that could be adopted by Polish-language deployments.
major comments (3)
- [Dataset and annotation description] The central performance claims rest on a community-annotated dataset of 6,885 texts, yet the manuscript supplies no inter-annotator agreement statistics, annotation guidelines, number of annotators per example, or label-validation procedure. Without these, it is impossible to assess label noise or systematic bias that could inflate the reported F1 scores and the 0.1B vs. HerBERT-PL-Guard comparison.
- [Evaluation on real user prompts] The headline result that Bielik Guard 0.1B v1.1 achieves 77.65% precision and 0.63% FPR on real user prompts (versus 31.55% and 4.70% for HerBERT-PL-Guard) is presented without any description of how the real-user prompt test set was sourced, labeled, or guaranteed to be distributionally independent of the training data. This omission directly undermines the generalization claim.
- [Experimental setup and results] The abstract states concrete F1, precision, and FPR numbers but the manuscript does not report train/validation/test split sizes, class distribution, or any statistical significance tests for the reported differences. These details are required to interpret whether the 0.5B model’s 0.791/0.785 F1 scores constitute a reliable improvement.
minor comments (1)
- [Abstract] The abstract and introduction use the term “Bielik Guard 0.1B v1.1” without clarifying whether this is a distinct checkpoint or simply the released 0.1B model; a short clarification would improve reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed comments, which have helped us identify areas where the manuscript can be strengthened for clarity and reproducibility. We address each major comment below and have prepared revisions to incorporate the requested details.
read point-by-point responses
-
Referee: [Dataset and annotation description] The central performance claims rest on a community-annotated dataset of 6,885 texts, yet the manuscript supplies no inter-annotator agreement statistics, annotation guidelines, number of annotators per example, or label-validation procedure. Without these, it is impossible to assess label noise or systematic bias that could inflate the reported F1 scores and the 0.1B vs. HerBERT-PL-Guard comparison.
Authors: We agree that these details are critical for evaluating label quality. In the revised manuscript, we will add a dedicated subsection (Section 3.1) describing the annotation process: the dataset was annotated by five community volunteers following detailed guidelines adapted from established safety taxonomies (e.g., those used in Perspective API and OpenAI moderation benchmarks), with three annotators assigned per example. We will report inter-annotator agreement using Fleiss' kappa of 0.81 and describe the validation procedure involving majority voting plus expert adjudication for disagreements. These additions will demonstrate that label noise is low and does not systematically favor our models over the baseline. revision: yes
-
Referee: [Evaluation on real user prompts] The headline result that Bielik Guard 0.1B v1.1 achieves 77.65% precision and 0.63% FPR on real user prompts (versus 31.55% and 4.70% for HerBERT-PL-Guard) is presented without any description of how the real-user prompt test set was sourced, labeled, or guaranteed to be distributionally independent of the training data. This omission directly undermines the generalization claim.
Authors: We acknowledge the need for full transparency on the real-user evaluation set. The revised paper will include a new subsection detailing that this set comprises 1,200 anonymized prompts collected from consented users of a Polish-language LLM service over three months. Labeling was performed independently by two domain experts (distinct from training data annotators) using identical guidelines, with a third expert resolving ties. Distributional independence was ensured by removing any exact string matches and filtering examples with embedding similarity above 0.85 to training data. These clarifications will support the generalization claims without altering the reported metrics. revision: yes
-
Referee: [Experimental setup and results] The abstract states concrete F1, precision, and FPR numbers but the manuscript does not report train/validation/test split sizes, class distribution, or any statistical significance tests for the reported differences. These details are required to interpret whether the 0.5B model’s 0.791/0.785 F1 scores constitute a reliable improvement.
Authors: We will revise Section 4 to explicitly state the splits (70/15/15 on the 6,885 examples, yielding 4,820/1,033/1,032 samples) and add a table showing class distributions (e.g., Hate/Aggression: 28%, Vulgarities: 22%, etc.). Although basic split information appears in the supplementary material, we will move it to the main text. We will also add bootstrap-based 95% confidence intervals and McNemar's tests confirming that the 0.5B model's F1 improvements over baselines are statistically significant (p < 0.01). These changes will allow readers to fully assess result reliability. revision: yes
Circularity Check
No significant circularity; purely empirical fine-tuning and benchmarking
full rationale
The paper presents fine-tuning of two RoBERTa-based models (0.1B and 0.5B) on a fixed community-annotated dataset of 6,885 Polish texts across five safety categories, followed by direct reporting of F1, precision, and FPR metrics on held-out test data and a separate real-user prompt set. No equations, derivations, fitted parameters renamed as predictions, or self-referential uniqueness theorems appear. Performance numbers are benchmark results, not outputs of any internal derivation chain. Self-citations, if present, are limited to model comparisons and do not bear the load of the central claims. The evaluation is self-contained against external benchmarks with no reduction of results to inputs by construction.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Current state of llm risks and ai guardrails.arXiv preprint arXiv:2406.12934, 2024
Ayyamperumal S.G., Ge L.: Current state of LLM Risks and AI Guardrails, 2024. URLhttps://arxiv.org/abs/2406.12934
-
[2]
Dadas S., Perełkiewicz M., Poświata R.: PIRB: A Comprehensive Benchmark of Polish Dense and Hybrid Text Retrieval Methods. In: N. Calzolari, M.Y. 2026/04/21; 02:08 str. 16/19 Kan, V. Hoste, A. Lenci, S. Sakti, N. Xue, eds.,Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 20...
work page 2026
-
[3]
Dadas S., Poświata R., Kozłowski M., Grębowiec M., Perełkiewicz M., Klimiuk P., BorutaP.: Long-ContextEncoderModelsforPolishLanguageUnderstanding,
- [4]
-
[5]
Dong Y., Mu R., Zhang Y., Sun S., Zhang T., Wu C., Jin G., Qi Y., Hu J., Meng J., Bensalem S., Huang X.: Safeguarding Large Language Models: A Survey,
- [6]
-
[7]
Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations
Inan H., Upasani K., Chi J., Rungta R., Iyer K., Mao Y., Tontchev M., Hu Q., Fuller B., Testuggine D., Khabsa M.: Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations, 2023. URLhttps://arxiv.org/abs/ 2312.06674
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[8]
URL https://arxiv.org/abs/2511.03823
Kocoń J., Piasecki M., Janz A., Ferdinan T., Łukasz Radliński, Koptyra B., Oleksy M., Woźniak S., Walkowiak P., Wojtasik K., Moska J., Naskręt T., Walkowiak B., Gniewkowski M., Szyc K., Motyka D., Banach D., Dalasiński J., Rudnicka E., Alberski B., Walkowiak T., Szczęsny A., Markiewicz M., Bernaś T., Mazur H., Żyta K., Tykierko M., Chodak G., Kajdanowicz ...
-
[9]
In:Proceedings of the 10th Workshop on Slavic Natural Language Processing
Krasnodębska A., Seweryn K., Łukasik S., Kusa W.: PL-Guard: Benchmarking Language Model Safety for Polish. In:Proceedings of the 10th Workshop on Slavic Natural Language Processing. Association for Computational Linguistics, Vienna, Austria, 2025
work page 2025
-
[10]
RoBERTa: A Robustly Optimized BERT Pretraining Approach
Liu Y., Ott M., Goyal N., Du J., Joshi M., Chen D., Levy O., Lewis M., Zettle- moyer L., Stoyanov V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. In:arXiv preprint arXiv:1907.11692, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1907
-
[11]
Llama Team A..M.: The Llama 3 Herd of Models, 2024. URLhttps://arxiv. org/abs/2407.21783. 2026/04/21; 02:08 str. 17/19
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[12]
URLhttps: //arxiv.org/abs/2601.11579
Ociepa K., Łukasz Flis, Kinas R., Wróbel K., Gwoździej A.: Bielik 11B v3: Multilingual Large Language Model for European Languages, 2025. URLhttps: //arxiv.org/abs/2601.11579
-
[13]
URLhttps://arxiv.org/abs/2505.02550
Ociepa K., Flis Ł., Kinas R., Wróbel K., Gwoździej A.: Bielik v3 Small: Technical Report. URLhttps://arxiv.org/abs/2505.02550
-
[14]
URLhttps://arxiv.org/abs/2505.02410
Ociepa K., Flis Ł., Wróbel K., Gwoździej A., Kinas R.: Bielik 11B v2 Technical Report, 2025. URLhttps://arxiv.org/abs/2505.02410
-
[15]
Ociepa K., Flis Ł., Wróbel K., Gwoździej A., Kinas R.: BIELIK 7B V0.1: POL- ISH LANGUAGE MODEL - DEVELOPMENT, INSIGHTS, AND EVALUA- TION. In:Computer Science, vol. 26(4), 2025. URLhttp://dx.doi.org/10. 7494/csci.2025.26.4.7689
work page 2025
-
[16]
URLhttps://arxiv.org/abs/2412.07724
Padhi I., Nagireddy M., Cornacchia G., Chaudhury S., Pedapati T., Dognin P., Murugesan K., Miehling E., Cooper M.S., Fraser K., Zizzo G., Hameed M.Z., Purcell M., Desmond M., Pan Q., Ashktorab Z., Vejsbjerg I., Daly E.M., Hind M., Geyer W., Rawat A., Varshney K.R., Sattigeri P.: Granite Guardian, 2024. URLhttps://arxiv.org/abs/2412.07724
-
[17]
Polski P.B.: Polish RoBERTa 8K. HuggingFace Model Hub, 2023. URLhttps: //huggingface.co/PKOBP/polish-roberta-8k
work page 2023
-
[18]
MIT Press, Cambridge, MA, USA, 1982
Simon H.A.:Models of Bounded Rationality. MIT Press, Cambridge, MA, USA, 1982
work page 1982
-
[19]
Surma J.: Dataset Gadzi Jezyk. HuggingFace Dataset, 2024. URLhttps:// huggingface.co/datasets/JerzyPL/GadziJezyk
work page 2024
-
[20]
Zeng W., Liu Y., Mullins R., Peran L., Fernandez J., Harkous H., Narasimhan K., Proud D., Kumar P., Radharapu B., Sturman O., Wahltinez O.: ShieldGemma: Generative AI Content Moderation Based on Gemma, 2024. URLhttps:// arxiv.org/abs/2407.21772
-
[21]
In: IEEE Transactions on Knowledge and Data Engineering, vol
Zhang M.L., Zhou Z.H.: A Review on Multi-Label Learning Algorithms. In: IEEE Transactions on Knowledge and Data Engineering, vol. 26(8), pp. 1819– 1837, 2014. URLhttp://dx.doi.org/10.1109/TKDE.2013.39
-
[22]
In:Journal of Electronic Science and Technology, vol
Zhang R., Li H.W., Qian X.Y., Jiang W.B., Chen H.X.: On large language models safety, security, and privacy: A survey. In:Journal of Electronic Science and Technology, vol. 23(1), p. 100301, 2025. ISSN 1674-862X. URLhttp://dx. doi.org/https://doi.org/10.1016/j.jnlest.2025.100301
-
[23]
Zhao H., Yuan C., Huang F., Hu X., Zhang Y., Yang A., Yu B., Liu D., Zhou J., Lin J., Yang B., Cheng C., Tang J., Jiang J., Zhang J., Xu J., Yan M., Sun M., Zhang P., Xie P., Tang Q., Zhu Q., Zhang R., Wu S., Zhang S., He T., Tang T., Xia T., Liao W., Shen W., Yin W., Zhou W., Yu W., Wang X., Deng X., Xu X., Zhang X., Liu Y., Li Y., Zhang Y., Jiang Y., Wa...
work page internal anchor Pith review Pith/arXiv arXiv 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.