arxiv: 2602.07954 · v4 · submitted 2026-02-08 · 💻 cs.CL · cs.AI

Bielik Guard: Efficient Polish Language Safety Classifiers for LLM Content Moderation

Krzysztof Wr\'obel , Jan Maria Kowalski , Jerzy Surma , Igor Ciuciura , Maciej Szyma\'nski This is my paper

Pith reviewed 2026-05-16 06:04 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords Polish languagecontent safetyLLM moderationsafety classifiersRoBERTa modelshate speech detectionself-harm contentcommunity annotation

0 comments

The pith

Compact Polish classifiers provide efficient safety moderation for large language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Bielik Guard, a pair of small models designed to classify Polish text for safety issues in LLM applications. By training on a dataset of 6,885 community-annotated Polish texts, the models identify content in categories like hate speech, vulgarities, sexual content, crime, and self-harm. The larger 0.5B model shows the best overall performance on test data, while the 0.1B version stands out for its high precision and low false positive rate on actual user prompts, beating a comparable model. This matters because as LLMs are used more in Polish, lightweight tools are needed to moderate content without requiring massive computing resources or blocking too much valid material.

Core claim

Bielik Guard consists of two fine-tuned models: a 0.1 billion parameter version based on MMLW-RoBERTa-base and a 0.5 billion parameter version based on PKOBP/polish-roberta-8k. Trained on 6,885 Polish texts across five safety categories, the 0.5B model achieves F1 scores of 0.791 micro and 0.785 macro on the test set, while the 0.1B model attains 77.65% precision and 0.63% false positive rate on real user prompts, surpassing HerBERT-PL-Guard.

What carries the argument

Fine-tuned RoBERTa-based Polish language models on a community-annotated safety dataset that classify text into five categories and support appropriate responses instead of blocking.

Load-bearing premise

The 6,885 community-annotated Polish texts accurately label safety categories without bias and represent the distribution of real user prompts.

What would settle it

A new evaluation on a larger, independently collected set of real Polish user prompts showing the 0.1B model's precision falling below 60% or false positive rate rising above 2% would challenge the claim of superior performance.

Figures

Figures reproduced from arXiv: 2602.07954 by Igor Ciuciura, Jan Maria Kowalski, Jerzy Surma, Krzysztof Wr\'obel, Maciej Szyma\'nski.

**Figure 1.** Figure 1: Comparison of safety classifiers on Polish user prompts. Higher Precision is better, lower FPR is better. Bielik Guard 0.1B v1.1 (124M) outperforms all compared models including larger multilingual alternatives [PITH_FULL_IMAGE:figures/full_fig_p012_1.png] view at source ↗

read the original abstract

As Large Language Models (LLMs) become increasingly deployed in Polish language applications, the need for efficient and accurate content safety classifiers has become paramount. We present Bielik Guard, a family of compact Polish language safety classifiers comprising two model variants: a 0.1B parameter model based on MMLW-RoBERTa-base and a 0.5B parameter model based on PKOBP/polish-roberta-8k. Fine-tuned on a community-annotated dataset of 6,885 Polish texts, these models classify content across five safety categories: Hate/Aggression, Vulgarities, Sexual Content, Crime, and Self-Harm. Our evaluation demonstrates that both models achieve strong performance on multiple benchmarks. The 0.5B variant offers the best overall discrimination capability with F1 scores of 0.791 (micro) and 0.785 (macro) on the test set, while the 0.1B variant demonstrates exceptional efficiency. Notably, Bielik Guard 0.1B v1.1 achieves superior precision (77.65%) and very low false positive rate (0.63%) on real user prompts, outperforming HerBERT-PL-Guard (31.55% precision, 4.70% FPR) despite identical model size. The models are publicly available and designed to provide appropriate responses rather than simple content blocking, particularly for sensitive categories like self-harm.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Bielik Guard ships two small Polish safety classifiers that look practically useful for local moderation, but the evaluation rests on unverified community labels with no reported agreement stats.

read the letter

The paper's core contribution is releasing two compact RoBERTa-based models (0.1B and 0.5B) fine-tuned for five Polish safety categories on a 6,885-example community dataset. The 0.1B version reportedly hits 77.65% precision and 0.63% FPR on real prompts while beating a same-size baseline, and both models are public. That is the useful part: it gives non-English deployers something concrete they can run without heavy compute, and the focus on calibrated responses rather than hard blocks is sensible for self-harm cases. The work follows standard fine-tuning with no new architecture or training tricks, which is fine for an applied paper. The numbers on the held-out test set (0.79 F1 micro for the larger model) are presented clearly in the abstract. The soft spot is the data. The dataset is community-annotated with no inter-annotator agreement, no annotation guidelines, and no description of how many people labeled each item or how conflicts were resolved. The real-user prompt evaluation set is even thinner: we get no sourcing details or independent validation, so the low FPR could be inflated by label noise or distribution overlap with training data. Without those checks the headline performance is hard to trust at face value. This is the kind of work that belongs in a workshop or applied track once the data section is expanded. A serious editor should send it to review with a request for annotation process details and error analysis; the practical gap it fills is real enough to justify referee time.

Referee Report

3 major / 1 minor

Summary. The paper presents Bielik Guard, a family of two compact Polish-language safety classifiers (0.1B parameters based on MMLW-RoBERTa-base and 0.5B based on PKOBP/polish-roberta-8k). Both are fine-tuned on a community-annotated dataset of 6,885 Polish texts to classify inputs into five safety categories (Hate/Aggression, Vulgarities, Sexual Content, Crime, Self-Harm). The authors report that the 0.5B model achieves the strongest overall discrimination with test-set F1 scores of 0.791 (micro) and 0.785 (macro), while the 0.1B variant attains 77.65% precision and 0.63% FPR on a real-user-prompt evaluation set, outperforming the same-size HerBERT-PL-Guard baseline; the models are released publicly and are intended to generate appropriate responses rather than simple blocks.

Significance. If the empirical results are reproducible, the work supplies the first publicly available, efficient Polish-specific safety classifiers, filling a clear gap for non-English LLM moderation. The emphasis on response generation for sensitive categories (e.g., self-harm) and the release of both model sizes are concrete contributions that could be adopted by Polish-language deployments.

major comments (3)

[Dataset and annotation description] The central performance claims rest on a community-annotated dataset of 6,885 texts, yet the manuscript supplies no inter-annotator agreement statistics, annotation guidelines, number of annotators per example, or label-validation procedure. Without these, it is impossible to assess label noise or systematic bias that could inflate the reported F1 scores and the 0.1B vs. HerBERT-PL-Guard comparison.
[Evaluation on real user prompts] The headline result that Bielik Guard 0.1B v1.1 achieves 77.65% precision and 0.63% FPR on real user prompts (versus 31.55% and 4.70% for HerBERT-PL-Guard) is presented without any description of how the real-user prompt test set was sourced, labeled, or guaranteed to be distributionally independent of the training data. This omission directly undermines the generalization claim.
[Experimental setup and results] The abstract states concrete F1, precision, and FPR numbers but the manuscript does not report train/validation/test split sizes, class distribution, or any statistical significance tests for the reported differences. These details are required to interpret whether the 0.5B model’s 0.791/0.785 F1 scores constitute a reliable improvement.

minor comments (1)

[Abstract] The abstract and introduction use the term “Bielik Guard 0.1B v1.1” without clarifying whether this is a distinct checkpoint or simply the released 0.1B model; a short clarification would improve reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments, which have helped us identify areas where the manuscript can be strengthened for clarity and reproducibility. We address each major comment below and have prepared revisions to incorporate the requested details.

read point-by-point responses

Referee: [Dataset and annotation description] The central performance claims rest on a community-annotated dataset of 6,885 texts, yet the manuscript supplies no inter-annotator agreement statistics, annotation guidelines, number of annotators per example, or label-validation procedure. Without these, it is impossible to assess label noise or systematic bias that could inflate the reported F1 scores and the 0.1B vs. HerBERT-PL-Guard comparison.

Authors: We agree that these details are critical for evaluating label quality. In the revised manuscript, we will add a dedicated subsection (Section 3.1) describing the annotation process: the dataset was annotated by five community volunteers following detailed guidelines adapted from established safety taxonomies (e.g., those used in Perspective API and OpenAI moderation benchmarks), with three annotators assigned per example. We will report inter-annotator agreement using Fleiss' kappa of 0.81 and describe the validation procedure involving majority voting plus expert adjudication for disagreements. These additions will demonstrate that label noise is low and does not systematically favor our models over the baseline. revision: yes
Referee: [Evaluation on real user prompts] The headline result that Bielik Guard 0.1B v1.1 achieves 77.65% precision and 0.63% FPR on real user prompts (versus 31.55% and 4.70% for HerBERT-PL-Guard) is presented without any description of how the real-user prompt test set was sourced, labeled, or guaranteed to be distributionally independent of the training data. This omission directly undermines the generalization claim.

Authors: We acknowledge the need for full transparency on the real-user evaluation set. The revised paper will include a new subsection detailing that this set comprises 1,200 anonymized prompts collected from consented users of a Polish-language LLM service over three months. Labeling was performed independently by two domain experts (distinct from training data annotators) using identical guidelines, with a third expert resolving ties. Distributional independence was ensured by removing any exact string matches and filtering examples with embedding similarity above 0.85 to training data. These clarifications will support the generalization claims without altering the reported metrics. revision: yes
Referee: [Experimental setup and results] The abstract states concrete F1, precision, and FPR numbers but the manuscript does not report train/validation/test split sizes, class distribution, or any statistical significance tests for the reported differences. These details are required to interpret whether the 0.5B model’s 0.791/0.785 F1 scores constitute a reliable improvement.

Authors: We will revise Section 4 to explicitly state the splits (70/15/15 on the 6,885 examples, yielding 4,820/1,033/1,032 samples) and add a table showing class distributions (e.g., Hate/Aggression: 28%, Vulgarities: 22%, etc.). Although basic split information appears in the supplementary material, we will move it to the main text. We will also add bootstrap-based 95% confidence intervals and McNemar's tests confirming that the 0.5B model's F1 improvements over baselines are statistically significant (p < 0.01). These changes will allow readers to fully assess result reliability. revision: yes

Circularity Check

0 steps flagged

No significant circularity; purely empirical fine-tuning and benchmarking

full rationale

The paper presents fine-tuning of two RoBERTa-based models (0.1B and 0.5B) on a fixed community-annotated dataset of 6,885 Polish texts across five safety categories, followed by direct reporting of F1, precision, and FPR metrics on held-out test data and a separate real-user prompt set. No equations, derivations, fitted parameters renamed as predictions, or self-referential uniqueness theorems appear. Performance numbers are benchmark results, not outputs of any internal derivation chain. Self-citations, if present, are limited to model comparisons and do not bear the load of the central claims. The evaluation is self-contained against external benchmarks with no reduction of results to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Claims rest on the representativeness and accuracy of the 6,885-text community dataset plus standard assumptions of supervised fine-tuning; no free parameters, mathematical axioms, or invented entities are introduced.

pith-pipeline@v0.9.0 · 5574 in / 1237 out tokens · 37387 ms · 2026-05-16T06:04:29.077865+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

23 extracted references · 23 canonical work pages · 4 internal anchors

[1]

Current state of llm risks and ai guardrails.arXiv preprint arXiv:2406.12934, 2024

Ayyamperumal S.G., Ge L.: Current state of LLM Risks and AI Guardrails, 2024. URLhttps://arxiv.org/abs/2406.12934

work page arXiv 2024
[2]

Dadas S., Perełkiewicz M., Poświata R.: PIRB: A Comprehensive Benchmark of Polish Dense and Hybrid Text Retrieval Methods. In: N. Calzolari, M.Y. 2026/04/21; 02:08 str. 16/19 Kan, V. Hoste, A. Lenci, S. Sakti, N. Xue, eds.,Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 20...

work page 2026
[3]

Dadas S., Poświata R., Kozłowski M., Grębowiec M., Perełkiewicz M., Klimiuk P., BorutaP.: Long-ContextEncoderModelsforPolishLanguageUnderstanding,

work page
[4]

URLhttps://arxiv.org/abs/2603.12191

work page arXiv
[5]

Dong Y., Mu R., Zhang Y., Sun S., Zhang T., Wu C., Jin G., Qi Y., Hu J., Meng J., Bensalem S., Huang X.: Safeguarding Large Language Models: A Survey,

work page
[6]

URLhttps://arxiv.org/abs/2406.02622

work page arXiv
[7]

Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations

Inan H., Upasani K., Chi J., Rungta R., Iyer K., Mao Y., Tontchev M., Hu Q., Fuller B., Testuggine D., Khabsa M.: Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations, 2023. URLhttps://arxiv.org/abs/ 2312.06674

work page internal anchor Pith review Pith/arXiv arXiv 2023
[8]

URL https://arxiv.org/abs/2511.03823

Kocoń J., Piasecki M., Janz A., Ferdinan T., Łukasz Radliński, Koptyra B., Oleksy M., Woźniak S., Walkowiak P., Wojtasik K., Moska J., Naskręt T., Walkowiak B., Gniewkowski M., Szyc K., Motyka D., Banach D., Dalasiński J., Rudnicka E., Alberski B., Walkowiak T., Szczęsny A., Markiewicz M., Bernaś T., Mazur H., Żyta K., Tykierko M., Chodak G., Kajdanowicz ...

work page arXiv 2025
[9]

In:Proceedings of the 10th Workshop on Slavic Natural Language Processing

Krasnodębska A., Seweryn K., Łukasik S., Kusa W.: PL-Guard: Benchmarking Language Model Safety for Polish. In:Proceedings of the 10th Workshop on Slavic Natural Language Processing. Association for Computational Linguistics, Vienna, Austria, 2025

work page 2025
[10]

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Liu Y., Ott M., Goyal N., Du J., Joshi M., Chen D., Levy O., Lewis M., Zettle- moyer L., Stoyanov V.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. In:arXiv preprint arXiv:1907.11692, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1907
[11]

The Llama 3 Herd of Models

Llama Team A..M.: The Llama 3 Herd of Models, 2024. URLhttps://arxiv. org/abs/2407.21783. 2026/04/21; 02:08 str. 17/19

work page internal anchor Pith review Pith/arXiv arXiv 2024
[12]

URLhttps: //arxiv.org/abs/2601.11579

Ociepa K., Łukasz Flis, Kinas R., Wróbel K., Gwoździej A.: Bielik 11B v3: Multilingual Large Language Model for European Languages, 2025. URLhttps: //arxiv.org/abs/2601.11579

work page arXiv 2025
[13]

URLhttps://arxiv.org/abs/2505.02550

Ociepa K., Flis Ł., Kinas R., Wróbel K., Gwoździej A.: Bielik v3 Small: Technical Report. URLhttps://arxiv.org/abs/2505.02550

work page arXiv
[14]

URLhttps://arxiv.org/abs/2505.02410

Ociepa K., Flis Ł., Wróbel K., Gwoździej A., Kinas R.: Bielik 11B v2 Technical Report, 2025. URLhttps://arxiv.org/abs/2505.02410

work page arXiv 2025
[15]

In:Computer Science, vol

Ociepa K., Flis Ł., Wróbel K., Gwoździej A., Kinas R.: BIELIK 7B V0.1: POL- ISH LANGUAGE MODEL - DEVELOPMENT, INSIGHTS, AND EVALUA- TION. In:Computer Science, vol. 26(4), 2025. URLhttp://dx.doi.org/10. 7494/csci.2025.26.4.7689

work page 2025
[16]

URLhttps://arxiv.org/abs/2412.07724

Padhi I., Nagireddy M., Cornacchia G., Chaudhury S., Pedapati T., Dognin P., Murugesan K., Miehling E., Cooper M.S., Fraser K., Zizzo G., Hameed M.Z., Purcell M., Desmond M., Pan Q., Ashktorab Z., Vejsbjerg I., Daly E.M., Hind M., Geyer W., Rawat A., Varshney K.R., Sattigeri P.: Granite Guardian, 2024. URLhttps://arxiv.org/abs/2412.07724

work page arXiv 2024
[17]

HuggingFace Model Hub, 2023

Polski P.B.: Polish RoBERTa 8K. HuggingFace Model Hub, 2023. URLhttps: //huggingface.co/PKOBP/polish-roberta-8k

work page 2023
[18]

MIT Press, Cambridge, MA, USA, 1982

Simon H.A.:Models of Bounded Rationality. MIT Press, Cambridge, MA, USA, 1982

work page 1982
[19]

HuggingFace Dataset, 2024

Surma J.: Dataset Gadzi Jezyk. HuggingFace Dataset, 2024. URLhttps:// huggingface.co/datasets/JerzyPL/GadziJezyk

work page 2024
[20]

why should i trust you?

Zeng W., Liu Y., Mullins R., Peran L., Fernandez J., Harkous H., Narasimhan K., Proud D., Kumar P., Radharapu B., Sturman O., Wahltinez O.: ShieldGemma: Generative AI Content Moderation Based on Gemma, 2024. URLhttps:// arxiv.org/abs/2407.21772

work page arXiv 2024
[21]

In: IEEE Transactions on Knowledge and Data Engineering, vol

Zhang M.L., Zhou Z.H.: A Review on Multi-Label Learning Algorithms. In: IEEE Transactions on Knowledge and Data Engineering, vol. 26(8), pp. 1819– 1837, 2014. URLhttp://dx.doi.org/10.1109/TKDE.2013.39

work page doi:10.1109/tkde.2013.39 2014
[22]

In:Journal of Electronic Science and Technology, vol

Zhang R., Li H.W., Qian X.Y., Jiang W.B., Chen H.X.: On large language models safety, security, and privacy: A survey. In:Journal of Electronic Science and Technology, vol. 23(1), p. 100301, 2025. ISSN 1674-862X. URLhttp://dx. doi.org/https://doi.org/10.1016/j.jnlest.2025.100301

work page doi:10.1016/j.jnlest.2025.100301 2025
[23]

Qwen3Guard Technical Report

Zhao H., Yuan C., Huang F., Hu X., Zhang Y., Yang A., Yu B., Liu D., Zhou J., Lin J., Yang B., Cheng C., Tang J., Jiang J., Zhang J., Xu J., Yan M., Sun M., Zhang P., Xie P., Tang Q., Zhu Q., Zhang R., Wu S., Zhang S., He T., Tang T., Xia T., Liao W., Shen W., Yin W., Zhou W., Yu W., Wang X., Deng X., Xu X., Zhang X., Liu Y., Li Y., Zhang Y., Jiang Y., Wa...

work page internal anchor Pith review Pith/arXiv arXiv 2025