Opir: Efficient Multi-Task Safety Classification for Toxicity, Jailbreaks, Hate Speech, and Harmful Content

Aleksandr Smechov; Ihor Stepanov

arxiv: 2605.29659 · v1 · pith:3TLVIB7Jnew · submitted 2026-05-28 · 💻 cs.LG · cs.AI· cs.CL

Opir: Efficient Multi-Task Safety Classification for Toxicity, Jailbreaks, Hate Speech, and Harmful Content

Ihor Stepanov , Aleksandr Smechov This is my paper

Pith reviewed 2026-06-29 09:09 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL

keywords LLM safetyguardrail modelstoxicity classificationjailbreak detectionmulti-task classificationencoder modelssafety taxonomyharmful content detection

0 comments

The pith

Opir encoder models match or exceed open-weight guardrails on safety benchmarks with far fewer parameters.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Opir, a family of encoder-based guardrail models built on the GLiClass architecture for detecting unsafe prompts, toxic language, jailbreaks, and harmful responses. These models are trained on a three-level taxonomy spanning 996 categories using a mix of taxonomy-grounded examples, adversarially mined hard negatives, benign safety examples, generated responses, multilingual data, and subsets from Aegis2 and WildGuard. They support binary safe/unsafe classification, multi-label toxicity, jailbreak detection, and zero-shot unsafe categorization, with edge variants under 100M parameters. An open evaluation harness covers 12 safety tasks and 17 category tasks across public benchmarks. The central claim is that Opir variants are competitive with or ahead of the strongest open-weight baselines on most datasets while requiring a substantially smaller deployment footprint.

Core claim

Opir is a family of GLiClass architecture encoder-based guardrail models trained on a three-level taxonomy of 996 categories. The training data includes taxonomy-grounded unsafe prompts, adversarially mined hard negatives, benign safety-preserving examples, generated response examples, multilingual translations, and portions of the Aegis2 and WildGuard training subsets. Variants handle binary safe/unsafe classification, multi-label toxicity classification, jailbreak classification, and zero-shot unsafe prompt and response categorization, with edge variants under 100M parameters. Across 12 safety-classification tasks and 17 category tasks against eight contemporary guardrail systems, Opir var

What carries the argument

GLiClass architecture encoder models trained on a three-level taxonomy with 996 categories using taxonomy-grounded data combined with adversarially mined hard negatives for multi-task safety classification.

If this is right

Real-time safety filtering for LLM applications becomes possible without the inference cost of large guardrail models.
Multi-task handling allows simultaneous binary classification, toxicity labeling, jailbreak detection, and zero-shot subcategory assignment.
Edge variants under 100M parameters enable deployment in resource-limited environments for binary safe/unsafe decisions.
The open evaluation harness supports consistent testing of GLiClass, GLiNER2, and decoder-based models on prompt and response safety.
Distinction between benign sensitive text and covert harmful content improves across multilingual and response-level tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Compact size could allow direct embedding of safety checks inside LLM serving stacks rather than separate API calls.
The adversarial mining technique might transfer to training classifiers for other high-stakes detection domains.
Zero-shot capability on the leaf labels could reduce the need for retraining when new harmful categories emerge.
Smaller footprint lowers the barrier for independent developers to run custom safety layers on consumer hardware.

Load-bearing premise

The specific mix of taxonomy-grounded data, adversarially mined negatives, and benchmark subsets produces models that generalize beyond the 12 safety and 17 category tasks evaluated.

What would settle it

A new test set of jailbreak attempts or harmful categories outside the 996-category taxonomy where Opir variants show large performance drops compared to the reported baselines.

Figures

Figures reproduced from arXiv: 2605.29659 by Aleksandr Smechov, Ihor Stepanov.

**Figure 1.** Figure 1: Overview of Opir prediction tasks. Safe/unsafe classification is modeled as binary classification, while toxicity, jailbreak, and unsafe-category prediction are modeled as multi-label classification heads over task-specific label schemas. jailbreaks, zero-shot categorization), supports 23 languages, and is trained on a 996-label taxonomy that explicitly includes benign safety-preserving categories to sup… view at source ↗

**Figure 2.** Figure 2: Model architecture of Opir. Candidate labels and the input text are jointly encoded by a GLiClass-style bidirectional encoder. Task-specific pooling and scoring modules then produce logits for safe/unsafe classification, toxicity detection, jailbreak detection, and taxonomy-category prediction. Formally, given an input text t and a candidate label set L = {ℓ1, . . . , ℓk}, the encoder produces contextual r… view at source ↗

**Figure 3.** Figure 3: Data construction pipeline for Opir. Taxonomy nodes seed unsafe prompt generation, hardnegative mining, benign-sensitive contrast construction, response generation and judging, multilingual translation, and final task-view formatting for training and evaluation. Data construction also includes benign or safety-preserving contrast examples drawn from the taxonomy’s safe_and_benign branch. These examples co… view at source ↗

**Figure 4.** Figure 4: Latency–macro-F1 efficiency comparison. The figure summarizes the trade-off between classification quality and serving cost across Opir variants and baseline guardrail systems. 14 [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗

read the original abstract

Real-time safety filtering for large language model (LLM) applications requires classifiers that can detect unsafe prompts, toxic language, jailbreak attempts, and unsafe responses without the cost profile of large guardrail models, and that can distinguish benign sensitive text from genuinely covert harmful content. In this paper, we introduce Opir, a family of encoder-based guardrail models built on the GLiClass architecture. Opir includes multi-task models for binary safe/unsafe classification, multi-label toxicity classification, jailbreak classification, and zero-shot unsafe prompt and response categorization. We also release edge variants with fewer than 100M parameters dedicated to binary safe/unsafe categorization. The models are trained on a three-level taxonomy containing 996 categories across 16 top-level labels, 126 mid-level labels, and 854 leaf labels. Opir's training data combines taxonomy-grounded unsafe prompts, adversarially mined hard negatives, benign safety-preserving examples, generated response examples, multilingual translations, and portions of the Aegis2 and WildGuard training subsets. We also open-sourced an evaluation harness that supports GLiClass and GLiNER2 backends as well as decoder-based models, and covers binary safety classification, multi-label categorization, toxicity, jailbreak detection, prompt safety, response safety, response refusal, and prompt subcategory views across public benchmark families. Across an expanded comparison spanning 12 safety-classification tasks and 17 category tasks against eight contemporary guardrail systems -- including both GLiNER2-based and generative guardrail models -- Opir variants are competitive on or ahead of the strongest open-weight baselines on the majority of benchmark datasets while operating with a substantially smaller deployment footprint.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Opir gives smaller GLiClass-based guardrails that match or beat larger baselines on most tested safety tasks, plus an open harness, but the gains are incremental.

read the letter

The main takeaway is that Opir supplies a family of encoder models for multi-task safety work that run lighter than typical guardrails while staying competitive on the benchmarks they report. They extend an existing architecture to a 996-category three-level taxonomy, train on taxonomy prompts plus adversarial negatives and Aegis2/WildGuard subsets, and include edge versions under 100M parameters.

The concrete parts stand out. They describe the data mix explicitly, run head-to-head numbers against eight other systems across 12 safety-classification tasks and 17 category tasks, and release an evaluation harness that supports GLiClass, GLiNER2, and decoder models. That harness and the smaller footprint are the parts that could actually get used in production settings.

The soft spots are limited. Performance is framed as competitive on the majority of datasets rather than dominant across the board, so the real selling point is size and multi-task coverage rather than large accuracy lifts. The fine taxonomy is ambitious but lacks visible ablations showing whether the leaf-level granularity drives results or mainly adds labeling overhead. Claims stay scoped to the harness, which avoids overreach.

This is aimed at people shipping real-time LLM filters who need lower inference cost than full generative guardrails. The open components and explicit training details give it enough substance for review. I would send it to referees.

Referee Report

0 major / 3 minor

Summary. The paper introduces Opir, a family of encoder-based guardrail models built on the GLiClass architecture for multi-task safety classification. This includes binary safe/unsafe detection, multi-label toxicity classification, jailbreak classification, and zero-shot unsafe prompt/response categorization into a three-level taxonomy (996 categories). Models are trained on taxonomy-grounded unsafe prompts, adversarially mined hard negatives, benign examples, generated responses, multilingual data, and subsets from Aegis2 and WildGuard. Edge variants (<100M parameters) target binary classification. The authors also release an open evaluation harness supporting GLiClass/GLiNER2 and decoder models across binary safety, multi-label, toxicity, jailbreak, prompt/response safety, and subcategory tasks. The central claim is that Opir variants are competitive with or ahead of the strongest open-weight baselines on the majority of 12 safety-classification tasks and 17 category tasks while using a substantially smaller deployment footprint.

Significance. If the reported results hold, the work is significant for enabling efficient, real-time safety filtering in LLM applications at lower computational cost than large guardrail models. The multi-task and zero-shot design, combined with the detailed taxonomy, supports practical deployment. The open-sourced evaluation harness is a clear strength for reproducibility and standardized benchmarking in the field.

minor comments (3)

[Abstract] Abstract: The abstract asserts competitive performance across tasks but does not report any quantitative metrics (e.g., F1, accuracy) or specific baseline comparisons; adding 1-2 key headline numbers would make the central claim immediately verifiable.
The manuscript would benefit from a dedicated limitations or error-analysis subsection (e.g., failure modes on particular taxonomy leaves or multilingual cases) to complement the benchmark results.
Ensure the released evaluation harness repository link and exact version/commit are stated explicitly in the main text and reproducibility statement.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive summary, recognition of the work's significance for efficient safety filtering, and recommendation of minor revision. No major comments were provided in the report.

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The manuscript describes an empirical ML contribution: encoder-based guardrail models trained on an explicitly enumerated data mixture (taxonomy-grounded prompts, adversarial negatives, Aegis2/WildGuard subsets, etc.) and evaluated on 12 safety-classification plus 17 category tasks against eight external baselines. No derivation chain, equations, fitted-parameter predictions, or self-citation load-bearing steps appear. All performance claims are scoped to the stated open evaluation harness and rest on direct comparisons to public benchmarks and prior systems, satisfying the self-contained criterion.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The central claim rests on empirical training success; many training hyperparameters and data selection choices function as implicit free parameters not detailed in the abstract. No new physical entities are postulated.

free parameters (2)

Edge variant parameter count = <100M
Selected to achieve efficiency targets while supporting binary classification performance.
Taxonomy granularity = 16 top-level, 126 mid-level, 854 leaf
Defined to structure the 996 safety categories for training.

axioms (2)

domain assumption GLiClass encoder architecture supports effective multi-task adaptation for safety classification
Invoked as the base for all Opir variants without further justification in the abstract.
domain assumption The described data sources produce generalizable safety classifiers
Assumed when combining taxonomy examples, hard negatives, and external dataset subsets.

pith-pipeline@v0.9.1-grok · 5848 in / 1426 out tokens · 56281 ms · 2026-06-29T09:09:10.996970+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

46 extracted references · 44 canonical work pages · 21 internal anchors

[1]

Are you still on track!? catching LLM task drift with activations.arXiv preprint arXiv:2406.00799,

Sahar Abdelnabi, Aideen Fay, Giovanni Cherubin, Ahmed Salem, Mario Fritz, and Andrew Paverd. Are you still on track!? catching LLM task drift with activations.arXiv preprint arXiv:2406.00799,

work page arXiv
[2]

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

NuNER: Entity recognition encoder pre-training via LLM-annotated data.arXiv preprint arXiv:2402.15343,

Sergei Bogdanov, Alexandre Constantin, Timothée Bernard, Benoit Crabbé, and Etienne Bernard. NuNER: Entity recognition encoder pre-training via LLM-annotated data.arXiv preprint arXiv:2402.15343,

work page arXiv
[4]

GLiREL: Generalist lightweight model for zero-shot relation extraction.arXiv preprint arXiv:2501.03172,

Jack Boylan et al. GLiREL: Generalist lightweight model for zero-shot relation extraction.arXiv preprint arXiv:2501.03172,

work page arXiv
[5]

JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models

Patrick Chao, Edoardo Debenedetti, Alexander Robey, Maksym Andriushchenko, Francesco Croce, Vikash Sehwag, Edgar Dobriban, Nicolas Flammarion, George J. Pappas, Florian Tramèr, Hamed Hassani, and Eric Wong. JailbreakBench: An open robustness benchmark for jailbreaking large language models.arXiv preprint arXiv:2404.01318, 2024a. Patrick Chao, Alexander Ro...

work page internal anchor Pith review Pith/arXiv arXiv 2003
[6]

URL:https://arxiv.org/abs/2405.20947, doi:10.48550/arXiv.2405.20947, arXiv:2405.20947

Justin Cui, Wei-Lin Chiang, Ion Stoica, and Cho-Jui Hsieh. OR-Bench: An over-refusal bench- mark for large language models.arXiv preprint arXiv:2405.20947,

work page arXiv
[7]

RTP- LX: Can LLMs evaluate toxicity in multilingual scenarios?arXiv preprint arXiv:2404.14397,

Adrian de Wynter, Ishaan Watts, Tua Altintoprak, Chiao-Wen Wang, Lena Stevens, et al. RTP- LX: Can LLMs evaluate toxicity in multilingual scenarios?arXiv preprint arXiv:2404.14397,

work page arXiv
[8]

AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents

Edoardo Debenedetti, Jie Zhang, Mislav Balunović, Luca Beurer-Kellner, Marc Fischer, and Florian Tramèr. AgentDojo: A dynamic environment to evaluate attacks and defenses for LLM agents.arXiv preprint arXiv:2406.13352,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Multilingual jailbreak challenges in large language models.arXiv preprint arXiv:2310.06474,

Yue Deng, Wenxuan Zhang, Sinno Jialin Pan, and Lidong Bing. Multilingual jailbreak challenges in large language models.arXiv preprint arXiv:2310.06474,

work page arXiv
[10]

garak: A framework for security probing large language models.arXiv preprint arXiv:2406.11036,

Leon Derczynski, Erick Galinkin, Jeffrey Martin, Subho Majumdar, and Nanna Inie. garak: A framework for security probing large language models.arXiv preprint arXiv:2406.11036,

work page arXiv
[11]

Gemma: Open Models Based on Gemini Research and Technology

Gemma Team, Thomas Mesnard, Cassidy Hardin, et al. Gemma: Open models based on gemini research and technology.arXiv preprint arXiv:2403.08295,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

AEGIS: On- line adaptive AI content safety moderation with ensemble of LLM experts.arXiv preprint arXiv:2404.05993,

Shaona Ghosh, Prasoon Varshney, Erick Galinkin, and Christopher Parisien. AEGIS: On- line adaptive AI content safety moderation with ensemble of LLM experts.arXiv preprint arXiv:2404.05993,

work page arXiv
[13]

Aegis2.0: A diverse AI safety dataset and risks taxonomy for alignment of LLM guardrails.arXiv preprint arXiv:2501.09004,

Shaona Ghosh, Prasoon Varshney, Makesh Narsimhan Sreedhar, Aishwarya Padmakumar, Tra- ian Rebedea, Jibin Rajan Varghese, and Christopher Parisien. Aegis2.0: A diverse AI safety dataset and risks taxonomy for alignment of LLM guardrails.arXiv preprint arXiv:2501.09004,

work page arXiv
[14]

The Llama 3 Herd of Models

Model card.https://huggingface.co/google/shieldgemma-2-4b-it. Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783,

work page internal anchor Pith review Pith/arXiv arXiv
[15]

WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs

Seungju Han, Kavel Rao, Allyson Ettinger, Liwei Jiang, Bill Yuchen Lin, Nathan Lambert, Yejin Choi, and Nouha Dziri. WildGuard: Open one-stop moderation tools for safety risks, jailbreaks, and refusals of LLMs.arXiv preprint arXiv:2406.18495,

work page internal anchor Pith review Pith/arXiv arXiv
[16]

DeBERTa: Decoding-enhanced BERT with Disentangled Attention

Pengcheng He, Xiaodong Liu, Jianfeng Gao, and Weizhu Chen. DeBERTa: Decoding-enhanced BERT with disentangled attention.arXiv preprint arXiv:2006.03654,

work page internal anchor Pith review Pith/arXiv arXiv 2006
[17]

DeBERTaV3: Improving DeBERTa using ELECTRA-Style Pre-Training with Gradient-Disentangled Embedding Sharing

Pengcheng He, Jianfeng Gao, and Weizhu Chen. DeBERTaV3: Improving DeBERTa using ELECTRA-style pre-training with gradient-disentangled embedding sharing.arXiv preprint arXiv:2111.09543,

work page internal anchor Pith review Pith/arXiv arXiv
[18]

Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations

IBM Granite Team. Granite 3.0 language models, 2024.https://www.ibm.com/granite. Hakan Inan, Kartikeya Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer, Yuning Mao, Michael Tontchev, Qing Hu, Brian Fuller, Davide Testuggine, and Madian Khabsa. Llama guard: LLM-based input-output safeguard for human-AI conversations.arXiv preprint arXiv:2312.06674,

work page internal anchor Pith review Pith/arXiv arXiv 2024
[19]

PAN 2012: Sexual predator identification task,

Giacomo Inches and Fabio Crestani. PAN 2012: Sexual predator identification task,

2012
[20]

Devansh Jain, Priyanshu Kumar, Samuel Gehman, Xuhui Zhou, Thomas Hartvigsen, and Maarten Sap

Working Notes Papers of the CLEF 2012 Evaluation Labs. Devansh Jain, Priyanshu Kumar, Samuel Gehman, Xuhui Zhou, Thomas Hartvigsen, and Maarten Sap. PolyglotToxicityPrompts: Multilingual evaluation of neural toxic degeneration in large language models.arXiv preprint arXiv:2405.09373,

work page arXiv 2012
[21]

Ettin: A compact encoder family for edge deployment, 2025a

JHU-CLSP. Ettin: A compact encoder family for edge deployment, 2025a. Model card.https: //huggingface.co/jhu-clsp/ettin-encoder-32m. JHU-CLSP. mmBERT: Multilingual compact encoders for edge deployment, 2025b. Model card. https://huggingface.co/jhu-clsp/mmBERT-small. Jiaming Ji, Mickel Liu, Juntao Dai, Xuehai Pan, Chi Zhang, Ce Bian, Ruiyang Sun, Yizhou Wa...

work page arXiv
[22]

PKU-SafeRLHF: Towards multi-level safety alignment for LLMs with human preference.arXiv preprint arXiv:2406.15513,

Jiaming Ji, Donghai Hong, Borong Zhang, Boyuan Chen, Josef Dai, Boren Zheng, Tianyi Qiu, Boxun Li, and Yaodong Yang. PKU-SafeRLHF: Towards multi-level safety alignment for LLMs with human preference.arXiv preprint arXiv:2406.15513,

work page arXiv
[23]

PolyGuard: A multilingual safety moderation tool for 17 languages.arXiv preprint arXiv:2504.04377,

Priyanshu Kumar, Devansh Jain, Akhila Yerukola, Liwei Jiang, Himanshu Beniwal, Thomas Hartvigsen, and Maarten Sap. PolyGuard: A multilingual safety moderation tool for 17 languages.arXiv preprint arXiv:2504.04377,

work page arXiv
[24]

Preprint, arXiv:2402.05044

Lijun Li, Bowen Dong, Ruohui Wang, Xuhao Hu, Wangmeng Zuo, Dahua Lin, Yu Qiao, and Jing Shao. SALAD-Bench: A hierarchical and comprehensive safety benchmark for large language models.arXiv preprint arXiv:2402.05044,

work page arXiv
[25]

AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models

Xiaogeng Liu, Nan Xu, Muhao Chen, and Chaowei Xiao. AutoDAN: Generating stealthy jail- break prompts on aligned large language models.arXiv preprint arXiv:2310.04451,

work page internal anchor Pith review Pith/arXiv arXiv
[26]

HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, et al. HarmBench: A standardized evaluation framework for automated red teaming and robust refusal.arXiv preprint arXiv:2402.04249,

work page internal anchor Pith review Pith/arXiv arXiv
[27]

Tree of attacks: Jailbreaking black-box LLMs automatically.arXiv preprint arXiv:2312.02119,

Anay Mehrotra, Manolis Zampetakis, Paul Kassianik, Blaine Nelson, Hyrum Anderson, Yaron Singer, and Amin Karbasi. Tree of attacks: Jailbreaking black-box LLMs automatically.arXiv preprint arXiv:2312.02119,

work page arXiv
[28]

URLhttps://arxiv.org/abs/2605.05277. NVIDIA. Nemotron-content-safety-reasoning-4b,

work page internal anchor Pith review Pith/arXiv arXiv
[29]

OWASP Foundation

Model card.https://huggingface.co/ nvidia/Nemotron-Content-Safety-Reasoning-4B. OWASP Foundation. OWASP top 10 for LLM applications 2025, 2025.https://owasp.org/ www-project-top-10-for-large-language-model-applications/. InkitPadhi, ManishNagireddy, GiandomenicoCornacchia, SubhajitChaudhury, TejaswiniPeda- pati, Pierre Dognin, Keerthiram Murugesan, Erik M...

work page arXiv 2025
[30]

Red Teaming Language Models with Language Models

Ethan Perez, Saffron Huang, Francis Song, Trevor Cai, Roman Ring, John Aslanides, Amelia Glaese, Nat McAleese, and Geoffrey Irving. Red teaming language models with language models.arXiv preprint arXiv:2202.03286,

work page internal anchor Pith review Pith/arXiv arXiv
[31]

do anything now

arXiv preprint arXiv:2512.20293. Xinyue Shen, Zeyuan Chen, Michael Backes, Yun Shen, and Yang Zhang. “do anything now”: Characterizing and evaluating in-the-wild jailbreak prompts on large language models.arXiv preprint arXiv:2308.03825,

work page arXiv
[32]

GLiNER multi-task: Generalist lightweight model for various information extraction tasks.arXiv preprint arXiv:2406.12925,

Ihor Stepanov and Mykhailo Shtopko. GLiNER multi-task: Generalist lightweight model for various information extraction tasks.arXiv preprint arXiv:2406.12925,

work page arXiv
[33]

GLiClass: Generalist lightweight model for sequence classi- fication tasks.arXiv preprint arXiv:2508.07662,

Ihor Stepanov, Mykhailo Shtopko, Dmytro Vodianytskyi, Oleksandr Lukashov, Alexander Ya- vorskyi, and Mykyta Yaroshenko. GLiClass: Generalist lightweight model for sequence classi- fication tasks.arXiv preprint arXiv:2508.07662,

work page arXiv
[34]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, et al. LLaMA: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971,

work page internal anchor Pith review Pith/arXiv arXiv
[35]

Efficientfew-shotlearningwithoutprompts.arXiv preprint arXiv:2209.11055,

Lewis Tunstall, Nils Reimers, Unso Eun Seo Jo, Luke Bates, Daniel Korat, Moshe Wasserblat, andOrenPereg. Efficientfew-shotlearningwithoutprompts.arXiv preprint arXiv:2209.11055,

work page arXiv
[36]

Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models

Pat Verga, Sebastian Hofstatter, Sophia Althammer, Yixuan Su, Aleksandra Piktus, Arkady Arkhangorodsky, Minjie Xu, Naomi White, and Patrick Lewis. Replacing judges with juries: Evaluating LLM generations with a panel of diverse models.arXiv preprint arXiv:2404.18796,

work page internal anchor Pith review Pith/arXiv arXiv
[37]

Hale, and Paul Röttger

Bertie Vidgen, Nino Scherrer, Hannah Rose Kirk, Rebecca Qian, Anand Kannappan, Scott A. Hale, and Paul Röttger. SimpleSafetyTests: a test suite for identifying critical safety risks in large language models.arXiv preprint arXiv:2311.08370,

work page arXiv
[38]

The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions

Eric Wallace, Kai Xiao, Reimar Leike, Lilian Weng, Johannes Heidecke, and Alex Beutel. The instruction hierarchy: Training LLMs to prioritize privileged instructions.arXiv preprint arXiv:2404.13208,

work page internal anchor Pith review Pith/arXiv arXiv
[39]

Sorry-Bench: Systematically evaluating large language model safety refusal behaviors.arXiv preprint arXiv:2406.14598, 2024a

Tinghao Wang, Shasha Xie, Jingyi Mu, Vishal Asnani, et al. Sorry-Bench: Systematically evaluating large language model safety refusal behaviors.arXiv preprint arXiv:2406.14598, 2024a. Wenxuan Wang, Zhaopeng Tu, Chang Chen, Youliang Yuan, Jen-tse Huang, Wenxiang Jiao, and Michael R. Lyu. All languages matter: On the multilingual safety of large language mo...

work page arXiv
[40]

WizardLM: Empowering large pre-trained language models to follow complex instructions

Blog post. https://simonwillison.net/2022/Sep/12/prompt-injection/. 19 Can Xu, Qingfeng Sun, Kai Zheng, Xiubo Geng, Pu Zhao, Jiazhan Feng, Chongyang Tao, and Daxin Jiang. WizardLM: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244,

work page internal anchor Pith review Pith/arXiv arXiv 2022
[41]

Qwen3 Technical Report

An Yang, Baosong Yang, et al. Qwen 3 technical report.arXiv preprint arXiv:2505.09388,

work page internal anchor Pith review Pith/arXiv arXiv
[42]

GLiNER: General- ist model for named entity recognition using bidirectional transformer.arXiv preprint arXiv:2311.08526,

Urchade Zaratiana, Nadi Tomeh, Pierre Holat, and Thierry Charnois. GLiNER: General- ist model for named entity recognition using bidirectional transformer.arXiv preprint arXiv:2311.08526,

work page arXiv
[43]

GLiGuard: Schema-Conditioned Classification for LLM Safeguard

URLhttps://arxiv.org/abs/2605.07982. Wenjun Zeng, Yuchi Liu, Ryan Mullins, Ludovic Peran, Joe Fernandez, Hamza Harvey, Karthik Chitre, Jeremy Brunner, Steven Dean, and Andrew Wang. ShieldGemma: Generative AI content moderation based on Gemma.arXiv preprint arXiv:2407.21772,

work page internal anchor Pith review Pith/arXiv arXiv
[44]

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging LLM-as-a-judge with MT-Bench and Chatbot Arena.arXiv preprint arXiv:2306.05685,

work page internal anchor Pith review Pith/arXiv arXiv
[45]

Universal and Transferable Adversarial Attacks on Aligned Language Models

Andy Zou, Zifan Wang, J. Zico Kolter, and Matt Fredrikson. Universal and transferable adver- sarial attacks on aligned language models.arXiv preprint arXiv:2307.15043,

work page internal anchor Pith review Pith/arXiv arXiv
[46]

Leaves” column are the number of Level 3 labels in the subcat- egory; the “Representative leaves

20 A Taxonomy Detail This appendix lists the Level 2 subcategories and representative Level 3 leaf labels under each Level 1 category. Counts in the “Leaves” column are the number of Level 3 labels in the subcat- egory; the “Representative leaves” column shows a non-exhaustive sample. toxicity Subcategory Leaves Representative leaves harassment_and_abuse ...

2024

[1] [1]

Are you still on track!? catching LLM task drift with activations.arXiv preprint arXiv:2406.00799,

Sahar Abdelnabi, Aideen Fay, Giovanni Cherubin, Ahmed Salem, Mario Fritz, and Andrew Paverd. Are you still on track!? catching LLM task drift with activations.arXiv preprint arXiv:2406.00799,

work page arXiv

[2] [2]

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862,

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

NuNER: Entity recognition encoder pre-training via LLM-annotated data.arXiv preprint arXiv:2402.15343,

Sergei Bogdanov, Alexandre Constantin, Timothée Bernard, Benoit Crabbé, and Etienne Bernard. NuNER: Entity recognition encoder pre-training via LLM-annotated data.arXiv preprint arXiv:2402.15343,

work page arXiv

[4] [4]

GLiREL: Generalist lightweight model for zero-shot relation extraction.arXiv preprint arXiv:2501.03172,

Jack Boylan et al. GLiREL: Generalist lightweight model for zero-shot relation extraction.arXiv preprint arXiv:2501.03172,

work page arXiv

[5] [5]

JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models

Patrick Chao, Edoardo Debenedetti, Alexander Robey, Maksym Andriushchenko, Francesco Croce, Vikash Sehwag, Edgar Dobriban, Nicolas Flammarion, George J. Pappas, Florian Tramèr, Hamed Hassani, and Eric Wong. JailbreakBench: An open robustness benchmark for jailbreaking large language models.arXiv preprint arXiv:2404.01318, 2024a. Patrick Chao, Alexander Ro...

work page internal anchor Pith review Pith/arXiv arXiv 2003

[6] [6]

URL:https://arxiv.org/abs/2405.20947, doi:10.48550/arXiv.2405.20947, arXiv:2405.20947

Justin Cui, Wei-Lin Chiang, Ion Stoica, and Cho-Jui Hsieh. OR-Bench: An over-refusal bench- mark for large language models.arXiv preprint arXiv:2405.20947,

work page arXiv

[7] [7]

RTP- LX: Can LLMs evaluate toxicity in multilingual scenarios?arXiv preprint arXiv:2404.14397,

Adrian de Wynter, Ishaan Watts, Tua Altintoprak, Chiao-Wen Wang, Lena Stevens, et al. RTP- LX: Can LLMs evaluate toxicity in multilingual scenarios?arXiv preprint arXiv:2404.14397,

work page arXiv

[8] [8]

AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents

Edoardo Debenedetti, Jie Zhang, Mislav Balunović, Luca Beurer-Kellner, Marc Fischer, and Florian Tramèr. AgentDojo: A dynamic environment to evaluate attacks and defenses for LLM agents.arXiv preprint arXiv:2406.13352,

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

Multilingual jailbreak challenges in large language models.arXiv preprint arXiv:2310.06474,

Yue Deng, Wenxuan Zhang, Sinno Jialin Pan, and Lidong Bing. Multilingual jailbreak challenges in large language models.arXiv preprint arXiv:2310.06474,

work page arXiv

[10] [10]

garak: A framework for security probing large language models.arXiv preprint arXiv:2406.11036,

Leon Derczynski, Erick Galinkin, Jeffrey Martin, Subho Majumdar, and Nanna Inie. garak: A framework for security probing large language models.arXiv preprint arXiv:2406.11036,

work page arXiv

[11] [11]

Gemma: Open Models Based on Gemini Research and Technology

Gemma Team, Thomas Mesnard, Cassidy Hardin, et al. Gemma: Open models based on gemini research and technology.arXiv preprint arXiv:2403.08295,

work page internal anchor Pith review Pith/arXiv arXiv

[12] [12]

AEGIS: On- line adaptive AI content safety moderation with ensemble of LLM experts.arXiv preprint arXiv:2404.05993,

Shaona Ghosh, Prasoon Varshney, Erick Galinkin, and Christopher Parisien. AEGIS: On- line adaptive AI content safety moderation with ensemble of LLM experts.arXiv preprint arXiv:2404.05993,

work page arXiv

[13] [13]

Aegis2.0: A diverse AI safety dataset and risks taxonomy for alignment of LLM guardrails.arXiv preprint arXiv:2501.09004,

Shaona Ghosh, Prasoon Varshney, Makesh Narsimhan Sreedhar, Aishwarya Padmakumar, Tra- ian Rebedea, Jibin Rajan Varghese, and Christopher Parisien. Aegis2.0: A diverse AI safety dataset and risks taxonomy for alignment of LLM guardrails.arXiv preprint arXiv:2501.09004,

work page arXiv

[14] [14]

The Llama 3 Herd of Models

Model card.https://huggingface.co/google/shieldgemma-2-4b-it. Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783,

work page internal anchor Pith review Pith/arXiv arXiv

[15] [15]

WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs

Seungju Han, Kavel Rao, Allyson Ettinger, Liwei Jiang, Bill Yuchen Lin, Nathan Lambert, Yejin Choi, and Nouha Dziri. WildGuard: Open one-stop moderation tools for safety risks, jailbreaks, and refusals of LLMs.arXiv preprint arXiv:2406.18495,

work page internal anchor Pith review Pith/arXiv arXiv

[16] [16]

DeBERTa: Decoding-enhanced BERT with Disentangled Attention

Pengcheng He, Xiaodong Liu, Jianfeng Gao, and Weizhu Chen. DeBERTa: Decoding-enhanced BERT with disentangled attention.arXiv preprint arXiv:2006.03654,

work page internal anchor Pith review Pith/arXiv arXiv 2006

[17] [17]

DeBERTaV3: Improving DeBERTa using ELECTRA-Style Pre-Training with Gradient-Disentangled Embedding Sharing

Pengcheng He, Jianfeng Gao, and Weizhu Chen. DeBERTaV3: Improving DeBERTa using ELECTRA-style pre-training with gradient-disentangled embedding sharing.arXiv preprint arXiv:2111.09543,

work page internal anchor Pith review Pith/arXiv arXiv

[18] [18]

Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations

IBM Granite Team. Granite 3.0 language models, 2024.https://www.ibm.com/granite. Hakan Inan, Kartikeya Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer, Yuning Mao, Michael Tontchev, Qing Hu, Brian Fuller, Davide Testuggine, and Madian Khabsa. Llama guard: LLM-based input-output safeguard for human-AI conversations.arXiv preprint arXiv:2312.06674,

work page internal anchor Pith review Pith/arXiv arXiv 2024

[19] [19]

PAN 2012: Sexual predator identification task,

Giacomo Inches and Fabio Crestani. PAN 2012: Sexual predator identification task,

2012

[20] [20]

Devansh Jain, Priyanshu Kumar, Samuel Gehman, Xuhui Zhou, Thomas Hartvigsen, and Maarten Sap

Working Notes Papers of the CLEF 2012 Evaluation Labs. Devansh Jain, Priyanshu Kumar, Samuel Gehman, Xuhui Zhou, Thomas Hartvigsen, and Maarten Sap. PolyglotToxicityPrompts: Multilingual evaluation of neural toxic degeneration in large language models.arXiv preprint arXiv:2405.09373,

work page arXiv 2012

[21] [21]

Ettin: A compact encoder family for edge deployment, 2025a

JHU-CLSP. Ettin: A compact encoder family for edge deployment, 2025a. Model card.https: //huggingface.co/jhu-clsp/ettin-encoder-32m. JHU-CLSP. mmBERT: Multilingual compact encoders for edge deployment, 2025b. Model card. https://huggingface.co/jhu-clsp/mmBERT-small. Jiaming Ji, Mickel Liu, Juntao Dai, Xuehai Pan, Chi Zhang, Ce Bian, Ruiyang Sun, Yizhou Wa...

work page arXiv

[22] [22]

PKU-SafeRLHF: Towards multi-level safety alignment for LLMs with human preference.arXiv preprint arXiv:2406.15513,

Jiaming Ji, Donghai Hong, Borong Zhang, Boyuan Chen, Josef Dai, Boren Zheng, Tianyi Qiu, Boxun Li, and Yaodong Yang. PKU-SafeRLHF: Towards multi-level safety alignment for LLMs with human preference.arXiv preprint arXiv:2406.15513,

work page arXiv

[23] [23]

PolyGuard: A multilingual safety moderation tool for 17 languages.arXiv preprint arXiv:2504.04377,

Priyanshu Kumar, Devansh Jain, Akhila Yerukola, Liwei Jiang, Himanshu Beniwal, Thomas Hartvigsen, and Maarten Sap. PolyGuard: A multilingual safety moderation tool for 17 languages.arXiv preprint arXiv:2504.04377,

work page arXiv

[24] [24]

Preprint, arXiv:2402.05044

Lijun Li, Bowen Dong, Ruohui Wang, Xuhao Hu, Wangmeng Zuo, Dahua Lin, Yu Qiao, and Jing Shao. SALAD-Bench: A hierarchical and comprehensive safety benchmark for large language models.arXiv preprint arXiv:2402.05044,

work page arXiv

[25] [25]

AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models

Xiaogeng Liu, Nan Xu, Muhao Chen, and Chaowei Xiao. AutoDAN: Generating stealthy jail- break prompts on aligned large language models.arXiv preprint arXiv:2310.04451,

work page internal anchor Pith review Pith/arXiv arXiv

[26] [26]

HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, et al. HarmBench: A standardized evaluation framework for automated red teaming and robust refusal.arXiv preprint arXiv:2402.04249,

work page internal anchor Pith review Pith/arXiv arXiv

[27] [27]

Tree of attacks: Jailbreaking black-box LLMs automatically.arXiv preprint arXiv:2312.02119,

Anay Mehrotra, Manolis Zampetakis, Paul Kassianik, Blaine Nelson, Hyrum Anderson, Yaron Singer, and Amin Karbasi. Tree of attacks: Jailbreaking black-box LLMs automatically.arXiv preprint arXiv:2312.02119,

work page arXiv

[28] [28]

URLhttps://arxiv.org/abs/2605.05277. NVIDIA. Nemotron-content-safety-reasoning-4b,

work page internal anchor Pith review Pith/arXiv arXiv

[29] [29]

OWASP Foundation

Model card.https://huggingface.co/ nvidia/Nemotron-Content-Safety-Reasoning-4B. OWASP Foundation. OWASP top 10 for LLM applications 2025, 2025.https://owasp.org/ www-project-top-10-for-large-language-model-applications/. InkitPadhi, ManishNagireddy, GiandomenicoCornacchia, SubhajitChaudhury, TejaswiniPeda- pati, Pierre Dognin, Keerthiram Murugesan, Erik M...

work page arXiv 2025

[30] [30]

Red Teaming Language Models with Language Models

Ethan Perez, Saffron Huang, Francis Song, Trevor Cai, Roman Ring, John Aslanides, Amelia Glaese, Nat McAleese, and Geoffrey Irving. Red teaming language models with language models.arXiv preprint arXiv:2202.03286,

work page internal anchor Pith review Pith/arXiv arXiv

[31] [31]

do anything now

arXiv preprint arXiv:2512.20293. Xinyue Shen, Zeyuan Chen, Michael Backes, Yun Shen, and Yang Zhang. “do anything now”: Characterizing and evaluating in-the-wild jailbreak prompts on large language models.arXiv preprint arXiv:2308.03825,

work page arXiv

[32] [32]

GLiNER multi-task: Generalist lightweight model for various information extraction tasks.arXiv preprint arXiv:2406.12925,

Ihor Stepanov and Mykhailo Shtopko. GLiNER multi-task: Generalist lightweight model for various information extraction tasks.arXiv preprint arXiv:2406.12925,

work page arXiv

[33] [33]

GLiClass: Generalist lightweight model for sequence classi- fication tasks.arXiv preprint arXiv:2508.07662,

Ihor Stepanov, Mykhailo Shtopko, Dmytro Vodianytskyi, Oleksandr Lukashov, Alexander Ya- vorskyi, and Mykyta Yaroshenko. GLiClass: Generalist lightweight model for sequence classi- fication tasks.arXiv preprint arXiv:2508.07662,

work page arXiv

[34] [34]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, et al. LLaMA: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971,

work page internal anchor Pith review Pith/arXiv arXiv

[35] [35]

Efficientfew-shotlearningwithoutprompts.arXiv preprint arXiv:2209.11055,

Lewis Tunstall, Nils Reimers, Unso Eun Seo Jo, Luke Bates, Daniel Korat, Moshe Wasserblat, andOrenPereg. Efficientfew-shotlearningwithoutprompts.arXiv preprint arXiv:2209.11055,

work page arXiv

[36] [36]

Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models

Pat Verga, Sebastian Hofstatter, Sophia Althammer, Yixuan Su, Aleksandra Piktus, Arkady Arkhangorodsky, Minjie Xu, Naomi White, and Patrick Lewis. Replacing judges with juries: Evaluating LLM generations with a panel of diverse models.arXiv preprint arXiv:2404.18796,

work page internal anchor Pith review Pith/arXiv arXiv

[37] [37]

Hale, and Paul Röttger

Bertie Vidgen, Nino Scherrer, Hannah Rose Kirk, Rebecca Qian, Anand Kannappan, Scott A. Hale, and Paul Röttger. SimpleSafetyTests: a test suite for identifying critical safety risks in large language models.arXiv preprint arXiv:2311.08370,

work page arXiv

[38] [38]

The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions

Eric Wallace, Kai Xiao, Reimar Leike, Lilian Weng, Johannes Heidecke, and Alex Beutel. The instruction hierarchy: Training LLMs to prioritize privileged instructions.arXiv preprint arXiv:2404.13208,

work page internal anchor Pith review Pith/arXiv arXiv

[39] [39]

Sorry-Bench: Systematically evaluating large language model safety refusal behaviors.arXiv preprint arXiv:2406.14598, 2024a

Tinghao Wang, Shasha Xie, Jingyi Mu, Vishal Asnani, et al. Sorry-Bench: Systematically evaluating large language model safety refusal behaviors.arXiv preprint arXiv:2406.14598, 2024a. Wenxuan Wang, Zhaopeng Tu, Chang Chen, Youliang Yuan, Jen-tse Huang, Wenxiang Jiao, and Michael R. Lyu. All languages matter: On the multilingual safety of large language mo...

work page arXiv

[40] [40]

WizardLM: Empowering large pre-trained language models to follow complex instructions

Blog post. https://simonwillison.net/2022/Sep/12/prompt-injection/. 19 Can Xu, Qingfeng Sun, Kai Zheng, Xiubo Geng, Pu Zhao, Jiazhan Feng, Chongyang Tao, and Daxin Jiang. WizardLM: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244,

work page internal anchor Pith review Pith/arXiv arXiv 2022

[41] [41]

Qwen3 Technical Report

An Yang, Baosong Yang, et al. Qwen 3 technical report.arXiv preprint arXiv:2505.09388,

work page internal anchor Pith review Pith/arXiv arXiv

[42] [42]

GLiNER: General- ist model for named entity recognition using bidirectional transformer.arXiv preprint arXiv:2311.08526,

Urchade Zaratiana, Nadi Tomeh, Pierre Holat, and Thierry Charnois. GLiNER: General- ist model for named entity recognition using bidirectional transformer.arXiv preprint arXiv:2311.08526,

work page arXiv

[43] [43]

GLiGuard: Schema-Conditioned Classification for LLM Safeguard

URLhttps://arxiv.org/abs/2605.07982. Wenjun Zeng, Yuchi Liu, Ryan Mullins, Ludovic Peran, Joe Fernandez, Hamza Harvey, Karthik Chitre, Jeremy Brunner, Steven Dean, and Andrew Wang. ShieldGemma: Generative AI content moderation based on Gemma.arXiv preprint arXiv:2407.21772,

work page internal anchor Pith review Pith/arXiv arXiv

[44] [44]

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging LLM-as-a-judge with MT-Bench and Chatbot Arena.arXiv preprint arXiv:2306.05685,

work page internal anchor Pith review Pith/arXiv arXiv

[45] [45]

Universal and Transferable Adversarial Attacks on Aligned Language Models

Andy Zou, Zifan Wang, J. Zico Kolter, and Matt Fredrikson. Universal and transferable adver- sarial attacks on aligned language models.arXiv preprint arXiv:2307.15043,

work page internal anchor Pith review Pith/arXiv arXiv

[46] [46]

Leaves” column are the number of Level 3 labels in the subcat- egory; the “Representative leaves

20 A Taxonomy Detail This appendix lists the Level 2 subcategories and representative Level 3 leaf labels under each Level 1 category. Counts in the “Leaves” column are the number of Level 3 labels in the subcat- egory; the “Representative leaves” column shows a non-exhaustive sample. toxicity Subcategory Leaves Representative leaves harassment_and_abuse ...

2024