Do Encoders Suffice? A Systematic Comparison of Encoder and Decoder Safety Judges for LLM Adversarial Evaluation

Han Jeon; Joseph Voyles; Matt Wood; Shiv Medler

arxiv: 2606.25782 · v1 · pith:QP5TDKJVnew · submitted 2026-06-24 · 💻 cs.CL · cs.AI

Do Encoders Suffice? A Systematic Comparison of Encoder and Decoder Safety Judges for LLM Adversarial Evaluation

Han Jeon , Shiv Medler , Joseph Voyles , Matt Wood This is my paper

Pith reviewed 2026-06-25 20:53 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords encoder classifiersLLM safety evaluationadversarial robustnessharm detectionsafety judgesModernBERTmajority voting

0 comments

The pith

Fine-tuned encoder classifiers match LLM judges in detecting harmful outputs without major performance loss.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines whether modern encoder models such as ModernBERT and Ettin can act as reliable safety judges for spotting harmful LLM responses in conversations. It benchmarks these encoders, after fine-tuning on majority-voted labels from LLM judges, against rule-based methods, fine-tuned LLM classifiers, and multiple LLM judge setups on open-source adversarial datasets. Performance is measured with F1, false negative rate, and precision-recall, with breakdowns by attack type including single-turn prompts, decomposition, escalation, and context manipulation. The central finding is that the encoders deliver comparable results at substantially lower cost and latency, supporting their use as practical alternatives for scaled safety evaluation.

Core claim

Fine-tuned ModernBERT-family encoders trained via majority voting on labels from LLM judges such as StrongReject and ShieldGemma achieve F1 scores, false negative rates, and precision-recall values comparable to those of the LLM judges themselves across adversarial datasets and attack techniques.

What carries the argument

Majority-voting label aggregation from multiple LLM judges to create training data and holdout sets for binary harm classification by fine-tuned encoder models.

If this is right

Encoder classifiers enable lower-latency and lower-cost safety filtering at production scale.
Performance remains aligned with LLM judges across single-turn, decomposition, escalation, and context-manipulation attacks.
Organizations can substitute encoders for LLM judges in routine evaluation pipelines while retaining similar detection rates.
The approach supplies concrete metrics for deciding when encoders are sufficient versus when LLM judges remain necessary.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Production systems could adopt encoder judges for initial filtering and reserve LLM judges for edge cases to cut overall compute.
The same fine-tuning recipe might extend to other safety tasks such as detecting misinformation or bias.
Long-term deployment would benefit from periodic retraining on fresh conversation logs to track distribution shift.

Load-bearing premise

Majority votes among LLM judges supply a sufficiently accurate training signal and evaluation standard that will hold for real user conversations.

What would settle it

A substantial drop in encoder F1 or rise in false negatives when the same models are tested on a holdout set labeled by human annotators rather than LLM judges.

read the original abstract

With the widespread adoption of large language models (LLMs) in chatbots and everyday applications, companies increasingly need guardrails that are effective while remaining low-cost and low-latency. Safety evaluation of LLM outputs has generally relied on LLM-based judges, which can be effective but are often slow and expensive to deploy at scale. In this paper, we evaluate whether fine-tuned modern encoder classifiers from the ModernBERT family, including ModernBERT and Ettin, can reliably identify harmful LLM outputs in user-model conversations without substantial performance loss relative to LLM-based judges. We benchmark these encoder classifiers against rule-based prefix matching, fine-tuned LLM classifiers, and LLM judges using a range of judge-prompting strategies across open-source adversarial datasets. The LLM judges include evaluation methodologies from StrongReject, ShieldGemma, JailbreakBench, AILuminate, SorryBench, and a Claude-as-a-judge setup, as well as fine-tuned safety classifiers such as LlamaGuard 3 and LlamaGuard 4. The encoder classifiers are fine-tuned on judge-labeled data using a majority-voting label strategy and are then evaluated on a gold-standard holdout dataset to assess their performance relative to LLM judges. We report absolute performance using F1 score, false negative rate, and precision-recall metrics. We also break down results by attack technique, including single-turn prompting, decomposition, escalation, and context manipulation, to identify where encoder classifiers align with or diverge from LLM-based judges. Our findings provide guidance on when encoder classifiers can serve as cost- and latency-efficient alternatives to LLM-based safety evaluation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Encoders fine-tuned on LLM-judge majority votes match those judges on the same data, which is useful for efficiency but measures agreement rather than independent detection of harm.

read the letter

The main thing to know is that ModernBERT-family encoders, once trained on majority votes from judges like StrongReject and LlamaGuard, reach similar F1 and false-negative rates to the LLM judges themselves on the holdout sets. This supports the practical goal of cheaper, faster safety checks, but the results stay relative to the original judges.

The paper does a clear job running the side-by-side. It covers multiple attack categories (decomposition, escalation, context manipulation) and reports breakdowns plus standard metrics across public adversarial datasets. That controlled scope is a step up from scattered prior comparisons.

The soft spot is the label source. The encoders are trained and scored against the same LLM-vote process, so the parity shows faithful mimicry. The abstract labels the holdout "gold-standard" yet ties it to the judges, with no independent human annotation mentioned. This does not invalidate the efficiency angle, but it limits how far the "reliably identify harmful" claim travels.

The work is for teams that already use LLM judges and want to swap in lighter encoders for volume. A reader focused on deployment tradeoffs will find the numbers and attack-type splits directly usable.

It deserves peer review. The experimental plan is straightforward and the question is concrete; referees can push on the independence of the labels without the paper falling apart.

Referee Report

1 major / 2 minor

Summary. The paper claims that fine-tuned encoder classifiers from the ModernBERT family (including ModernBERT and Ettin) can identify harmful LLM outputs in user-model conversations with performance comparable to LLM-based judges (StrongReject, ShieldGemma, LlamaGuard, etc.), using majority-vote labels from those judges on open-source adversarial datasets. Encoders are fine-tuned on these labels and evaluated on a gold-standard holdout via F1, false-negative rate, and precision-recall, with breakdowns by attack type (single-turn, decomposition, escalation, context manipulation). The conclusion is that encoders offer cost- and latency-efficient alternatives to LLM judges without substantial performance loss.

Significance. If the central claim holds under independent validation, the work would establish practical, scalable encoder-based guardrails for LLM safety evaluation, directly addressing deployment constraints of LLM judges. The per-attack-type analysis would further guide when encoders suffice versus when LLM judges remain necessary.

major comments (1)

[Abstract / evaluation methodology] Abstract and evaluation methodology: The pipeline trains and evaluates encoders exclusively against majority-vote labels from the same set of LLM judges (StrongReject, ShieldGemma, LlamaGuard 3/4, etc.) on both training and holdout data. This produces a closed loop in which reported F1/FNR metrics quantify mimicry of the LLM judges rather than independent detection of harm. No external criterion (human annotations on the holdout or comparison to a non-LLM-derived gold standard) is described that would break the loop, so parity with LLM judges does not establish that encoders reliably identify harmful outputs in absolute terms or generalize beyond the judges' own biases and noise.

minor comments (2)

[Methods] Clarify the exact data splits, prompting templates for each LLM judge, and any post-hoc filtering steps used to construct the majority-vote labels and holdout set.
[Experimental setup] Specify the precise ModernBERT and Ettin model variants, fine-tuning hyperparameters, and whether any encoder-specific adaptations (e.g., classification head) were required.

Simulated Author's Rebuttal

1 responses · 0 unresolved

Thank you for the constructive feedback on our evaluation methodology. We address the concern point by point below.

read point-by-point responses

Referee: [Abstract / evaluation methodology] Abstract and evaluation methodology: The pipeline trains and evaluates encoders exclusively against majority-vote labels from the same set of LLM judges (StrongReject, ShieldGemma, LlamaGuard 3/4, etc.) on both training and holdout data. This produces a closed loop in which reported F1/FNR metrics quantify mimicry of the LLM judges rather than independent detection of harm. No external criterion (human annotations on the holdout or comparison to a non-LLM-derived gold standard) is described that would break the loop, so parity with LLM judges does not establish that encoders reliably identify harmful outputs in absolute terms or generalize beyond the judges' own biases and noise.

Authors: We agree that the reported metrics measure agreement with the LLM judges' majority-vote labels on both training and holdout sets, rather than independent human-annotated ground truth. The manuscript's stated goal is a systematic comparison to determine whether encoders can serve as practical, lower-cost alternatives to LLM judges within current safety evaluation pipelines; LLM judges are the prevailing standard for labeling adversarial datasets, so demonstrating comparable F1, false-negative rate, and precision-recall on the same labels directly answers whether encoders can substitute for them without substantial performance loss. The paper consistently frames results as relative to LLM judges and does not claim absolute harm detection independent of the labeling source. Human annotations on the holdout would be a valuable extension but fall outside the scope of this comparative study. revision: no

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper performs an empirical comparison of fine-tuned ModernBERT-family encoders against external LLM judges (StrongReject, ShieldGemma, LlamaGuard, etc.) and rule-based baselines on public adversarial datasets. Labels are generated via majority vote from those external judges, and performance (F1, FNR, precision-recall) is measured on a holdout drawn from the same external labeling process. This directly quantifies relative agreement with the external judges rather than claiming or measuring absolute harm detection. No equations, derivations, or self-citations reduce any reported metric back to quantities defined by the authors' own fitted parameters. The evaluation is self-contained against external benchmarks with no load-bearing self-referential steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Review performed on abstract only; the paper implicitly relies on standard supervised classification assumptions and on the reliability of LLM-generated labels.

axioms (1)

domain assumption Majority vote across multiple LLM judges yields sufficiently accurate labels for training and gold-standard evaluation
Abstract states that encoders are fine-tuned on judge-labeled data using majority-voting and evaluated on a gold-standard holdout.

pith-pipeline@v0.9.1-grok · 5828 in / 1325 out tokens · 19210 ms · 2026-06-25T20:53:45.391476+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

32 extracted references · 8 linked inside Pith

[1]

arXiv:2412.13663 (2024)

Warner, B., et al.: Smarter, Better, Faster, Longer: A Modern Bidirectional En- coder for Fast, Memory Efficient, and Long Context Finetuning and Inference. arXiv:2412.13663 (2024)

Pith/arXiv arXiv 2024
[2]

arXiv:2402.10260 (2024)

Souly, A., et al.: A StrongREJECT for Empty Jailbreaks. arXiv:2402.10260 (2024)

Pith/arXiv arXiv 2024
[3]

arXiv:2402.04249 (2024)

Mazeika, M., et al.: HarmBench: A Standardized Evaluation Framework for Auto- mated Red Teaming and Robust Refusal. arXiv:2402.04249 (2024)

Pith/arXiv arXiv 2024
[4]

https://ai.meta.com/research/publications/llama- guard-3/ (2024)

Meta AI: Llama Guard 3. https://ai.meta.com/research/publications/llama- guard-3/ (2024)

2024
[5]

arXiv:2504.00441 (2025)

Kumar, A., et al.: No Free Lunch with Guardrails. arXiv:2504.00441 (2025)

arXiv 2025
[6]

arXiv:2411.14398 (2024)

Zheng, Y., et al.: Lightweight Safety Guardrails Using Fine-tuned BERT Embed- dings. arXiv:2411.14398 (2024)

arXiv 2024
[7]

arXiv:2506.11938 (2025)

Soares, F., et al.: Improving Large Language Model Safety with Contrastive Rep- resentation Learning. arXiv:2506.11938 (2025)

arXiv 2025
[8]

arXiv:2406.04313 (2024)

Zou, A., et al.: Improving Alignment and Robustness with Circuit Breakers. arXiv:2406.04313 (2024)

arXiv 2024
[9]

arXiv preprint arXiv:2311.04205 (2023)

Deng, Y., Zhang, W., Chen, Z., Gu, Q.: Rephrase and Respond: Let Large Language Models Ask Better Questions for Themselves. arXiv preprint arXiv:2311.04205 (2023)

arXiv 2023
[10]

In: International Conference on Learning Representations (ICLR)

Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Lan- guage Models. In: International Conference on Learning Representations (ICLR). arXiv:2203.11171 (2023a)

Pith/arXiv arXiv
[11]

In: NeurIPS, Datasets and Benchmarks Track

Wang, B., et al.: DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models. In: NeurIPS, Datasets and Benchmarks Track. arXiv:2306.11698 (2023b) 12 Han Jeon, Shiv Medler, Joseph Voyles, and Matt Wood

arXiv
[12]

In: NeurIPS, Datasets and Benchmarks Track

Chao, P., et al.: JailbreakBench: An Open Robustness Benchmark for Jail- breaking Large Language Models. In: NeurIPS, Datasets and Benchmarks Track. arXiv:2404.01318 (2024)

Pith/arXiv arXiv 2024
[13]

In: NeurIPS, Datasets and Benchmarks Track

Zheng, L., et al.: Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. In: NeurIPS, Datasets and Benchmarks Track. arXiv:2306.05685 (2023)

Pith/arXiv arXiv 2023
[14]

In: International Conference on Learning Representations (ICLR)

Weller, O., Ricci, K., Marone, M., Chaffin, A., Lawrie, D., Van Durme, B.: Seq vs Seq: An Open Suite of Paired Encoders and Decoders. In: International Conference on Learning Representations (ICLR). arXiv:2507.11412 (2026)

arXiv 2026
[15]

arXiv preprint arXiv:2603.06594 (2026)

Schwinn, L., Ladenburger, M., Beyer, T., Mofakhami, M., Gidel, G., Günnemann, S.: A Coin Flip for Safety: LLM Judges Fail to Reliably Measure Adversarial Ro- bustness. arXiv preprint arXiv:2603.06594 (2026)

arXiv 2026
[16]

arXiv:2406.14598 (2025)

Xie, Z., Zhang, Z., Neubig, G.: SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal. arXiv:2406.14598 (2025)

arXiv 2025
[17]

arXiv preprint arXiv:2310.1738 (2023)

Lin, Z., Wen, Q., Sun, L.: ToxicChat: Unveiling Hidden Challenges of Toxicity Detection in Real-World User-AI Conversations. arXiv preprint arXiv:2310.1738 (2023)

arXiv 2023
[18]

arXiv preprint arXiv:2307.04657 (2023)

Ji, J., Sun, Y., Qiu, X.: BeaverTails: Towards Improved Safety Alignment of LLM via a Human Preference Dataset. arXiv preprint arXiv:2307.04657 (2023)

arXiv 2023
[19]

arXiv preprint (2023)

Röttger, P., Vidgen, B., Hovy, D.: XSTest: A Test Suite for Identifying Safety Failures in Large Language Models. arXiv preprint (2023)

2023
[20]

arXiv preprint (2023)

Sun, H., Yang, Y., Yang, D.: SafetyBench: Evaluating Safety of Large Language Models. arXiv preprint (2023)

2023
[21]

arXiv preprint arXiv:2406.18510 (2024)

Zhang, Y., Li, Z., Liang, P.: WildTeaming at Scale: From In-the-Wild Jailbreaks to (Adversarially) Safer Language Models. arXiv preprint arXiv:2406.18510 (2024)

arXiv 2024
[22]

arXiv preprint arXiv:2308.13387 (2023c)

Wang, A., Hendrycks, D., Burns, C.: Do-Not-Answer: A Dataset for Evaluating Safety Alignment in Language Models. arXiv preprint arXiv:2308.13387 (2023c)

arXiv
[23]

arXiv preprint arXiv:2404.01833 (2024)

Microsoft Security Research: Prompt Crescendo: Multi-Turn Jailbreak Attacks on Large Language Models. arXiv preprint arXiv:2404.01833 (2024)

Pith/arXiv arXiv 2024
[24]

arXiv preprint arXiv:2402.17262 (2024)

Zhou, Z., Xiang, J., Chen, H., Liu, Q., Li, Z., Su, S.: Speak Out of Turn: Safety Vulnerability of Large Language Models in Multi-turn Dialogue. arXiv preprint arXiv:2402.17262 (2024)

arXiv 2024
[25]

In: USENIX Security Symposium (2024)

Yu, Z., et al.: Don’t Listen To Me: Understanding and Exploring Jailbreak Prompts of LLMs. In: USENIX Security Symposium (2024)

2024
[26]

Z., Fredrikson, M.: Universal and Transferable Adversarial Attacks on Aligned Language Models

Zou, A., Wang, Z., Carlini, N., Nasr, M., Kolter, J. Z., Fredrikson, M.: Universal and Transferable Adversarial Attacks on Aligned Language Models. arXiv preprint arXiv:2307.15043 (2023)

Pith/arXiv arXiv 2023
[27]

arXiv preprint (2025)

Zhang,B.,Cheng,Y.,Shakeri,S.,Wang,X.,Ma,M.,Firat,O.:Encoder-Decoderor Decoder-Only? Revisiting Encoder-Decoder Large Language Model. arXiv preprint (2025)

2025
[28]

Technical report / blog (2024)

PangolinGuard:PangolinGuard:PromptInjectionDetectionwithModernEncoder Models. Technical report / blog (2024)

2024
[29]

Meta AI: Llama Guard 4. (2025)

2025
[30]

arXiv preprint (2024)

Cui, X., et al.: OR-Bench: A Benchmark for Over-Refusal in Large Language Mod- els. arXiv preprint (2024)

2024
[31]

Hugging Face Dataset Repository (2022)

Anthropic: Helpful and Harmless Reinforcement Learning from Human Feedback (HH-RLHF): Red Team Attempts Dataset. Hugging Face Dataset Repository (2022). Available at: https://huggingface.co/datasets/Anthropic/hh-rlhf

2022
[32]

MicrosoftAzure:AzureMachineLearningPricing.MicrosoftAzureDocumentation (2026). Available at: https://azure.microsoft.com/en-us/pricing/details/machine- learning/ Title Suppressed Due to Excessive Length 13 A Evaluation Results F1 and FNR by Technique F1 FNR Judge All Ctx Dec Esc Single All Ctx Dec Esc Single AILuminate 0.767 0.795 0.764 0.729 0.729 0.208 ...

2026

[1] [1]

arXiv:2412.13663 (2024)

Warner, B., et al.: Smarter, Better, Faster, Longer: A Modern Bidirectional En- coder for Fast, Memory Efficient, and Long Context Finetuning and Inference. arXiv:2412.13663 (2024)

Pith/arXiv arXiv 2024

[2] [2]

arXiv:2402.10260 (2024)

Souly, A., et al.: A StrongREJECT for Empty Jailbreaks. arXiv:2402.10260 (2024)

Pith/arXiv arXiv 2024

[3] [3]

arXiv:2402.04249 (2024)

Mazeika, M., et al.: HarmBench: A Standardized Evaluation Framework for Auto- mated Red Teaming and Robust Refusal. arXiv:2402.04249 (2024)

Pith/arXiv arXiv 2024

[4] [4]

https://ai.meta.com/research/publications/llama- guard-3/ (2024)

Meta AI: Llama Guard 3. https://ai.meta.com/research/publications/llama- guard-3/ (2024)

2024

[5] [5]

arXiv:2504.00441 (2025)

Kumar, A., et al.: No Free Lunch with Guardrails. arXiv:2504.00441 (2025)

arXiv 2025

[6] [6]

arXiv:2411.14398 (2024)

Zheng, Y., et al.: Lightweight Safety Guardrails Using Fine-tuned BERT Embed- dings. arXiv:2411.14398 (2024)

arXiv 2024

[7] [7]

arXiv:2506.11938 (2025)

Soares, F., et al.: Improving Large Language Model Safety with Contrastive Rep- resentation Learning. arXiv:2506.11938 (2025)

arXiv 2025

[8] [8]

arXiv:2406.04313 (2024)

Zou, A., et al.: Improving Alignment and Robustness with Circuit Breakers. arXiv:2406.04313 (2024)

arXiv 2024

[9] [9]

arXiv preprint arXiv:2311.04205 (2023)

Deng, Y., Zhang, W., Chen, Z., Gu, Q.: Rephrase and Respond: Let Large Language Models Ask Better Questions for Themselves. arXiv preprint arXiv:2311.04205 (2023)

arXiv 2023

[10] [10]

In: International Conference on Learning Representations (ICLR)

Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Lan- guage Models. In: International Conference on Learning Representations (ICLR). arXiv:2203.11171 (2023a)

Pith/arXiv arXiv

[11] [11]

In: NeurIPS, Datasets and Benchmarks Track

Wang, B., et al.: DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models. In: NeurIPS, Datasets and Benchmarks Track. arXiv:2306.11698 (2023b) 12 Han Jeon, Shiv Medler, Joseph Voyles, and Matt Wood

arXiv

[12] [12]

In: NeurIPS, Datasets and Benchmarks Track

Chao, P., et al.: JailbreakBench: An Open Robustness Benchmark for Jail- breaking Large Language Models. In: NeurIPS, Datasets and Benchmarks Track. arXiv:2404.01318 (2024)

Pith/arXiv arXiv 2024

[13] [13]

In: NeurIPS, Datasets and Benchmarks Track

Zheng, L., et al.: Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. In: NeurIPS, Datasets and Benchmarks Track. arXiv:2306.05685 (2023)

Pith/arXiv arXiv 2023

[14] [14]

In: International Conference on Learning Representations (ICLR)

Weller, O., Ricci, K., Marone, M., Chaffin, A., Lawrie, D., Van Durme, B.: Seq vs Seq: An Open Suite of Paired Encoders and Decoders. In: International Conference on Learning Representations (ICLR). arXiv:2507.11412 (2026)

arXiv 2026

[15] [15]

arXiv preprint arXiv:2603.06594 (2026)

Schwinn, L., Ladenburger, M., Beyer, T., Mofakhami, M., Gidel, G., Günnemann, S.: A Coin Flip for Safety: LLM Judges Fail to Reliably Measure Adversarial Ro- bustness. arXiv preprint arXiv:2603.06594 (2026)

arXiv 2026

[16] [16]

arXiv:2406.14598 (2025)

Xie, Z., Zhang, Z., Neubig, G.: SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal. arXiv:2406.14598 (2025)

arXiv 2025

[17] [17]

arXiv preprint arXiv:2310.1738 (2023)

Lin, Z., Wen, Q., Sun, L.: ToxicChat: Unveiling Hidden Challenges of Toxicity Detection in Real-World User-AI Conversations. arXiv preprint arXiv:2310.1738 (2023)

arXiv 2023

[18] [18]

arXiv preprint arXiv:2307.04657 (2023)

Ji, J., Sun, Y., Qiu, X.: BeaverTails: Towards Improved Safety Alignment of LLM via a Human Preference Dataset. arXiv preprint arXiv:2307.04657 (2023)

arXiv 2023

[19] [19]

arXiv preprint (2023)

Röttger, P., Vidgen, B., Hovy, D.: XSTest: A Test Suite for Identifying Safety Failures in Large Language Models. arXiv preprint (2023)

2023

[20] [20]

arXiv preprint (2023)

Sun, H., Yang, Y., Yang, D.: SafetyBench: Evaluating Safety of Large Language Models. arXiv preprint (2023)

2023

[21] [21]

arXiv preprint arXiv:2406.18510 (2024)

Zhang, Y., Li, Z., Liang, P.: WildTeaming at Scale: From In-the-Wild Jailbreaks to (Adversarially) Safer Language Models. arXiv preprint arXiv:2406.18510 (2024)

arXiv 2024

[22] [22]

arXiv preprint arXiv:2308.13387 (2023c)

Wang, A., Hendrycks, D., Burns, C.: Do-Not-Answer: A Dataset for Evaluating Safety Alignment in Language Models. arXiv preprint arXiv:2308.13387 (2023c)

arXiv

[23] [23]

arXiv preprint arXiv:2404.01833 (2024)

Microsoft Security Research: Prompt Crescendo: Multi-Turn Jailbreak Attacks on Large Language Models. arXiv preprint arXiv:2404.01833 (2024)

Pith/arXiv arXiv 2024

[24] [24]

arXiv preprint arXiv:2402.17262 (2024)

Zhou, Z., Xiang, J., Chen, H., Liu, Q., Li, Z., Su, S.: Speak Out of Turn: Safety Vulnerability of Large Language Models in Multi-turn Dialogue. arXiv preprint arXiv:2402.17262 (2024)

arXiv 2024

[25] [25]

In: USENIX Security Symposium (2024)

Yu, Z., et al.: Don’t Listen To Me: Understanding and Exploring Jailbreak Prompts of LLMs. In: USENIX Security Symposium (2024)

2024

[26] [26]

Z., Fredrikson, M.: Universal and Transferable Adversarial Attacks on Aligned Language Models

Zou, A., Wang, Z., Carlini, N., Nasr, M., Kolter, J. Z., Fredrikson, M.: Universal and Transferable Adversarial Attacks on Aligned Language Models. arXiv preprint arXiv:2307.15043 (2023)

Pith/arXiv arXiv 2023

[27] [27]

arXiv preprint (2025)

Zhang,B.,Cheng,Y.,Shakeri,S.,Wang,X.,Ma,M.,Firat,O.:Encoder-Decoderor Decoder-Only? Revisiting Encoder-Decoder Large Language Model. arXiv preprint (2025)

2025

[28] [28]

Technical report / blog (2024)

PangolinGuard:PangolinGuard:PromptInjectionDetectionwithModernEncoder Models. Technical report / blog (2024)

2024

[29] [29]

Meta AI: Llama Guard 4. (2025)

2025

[30] [30]

arXiv preprint (2024)

Cui, X., et al.: OR-Bench: A Benchmark for Over-Refusal in Large Language Mod- els. arXiv preprint (2024)

2024

[31] [31]

Hugging Face Dataset Repository (2022)

Anthropic: Helpful and Harmless Reinforcement Learning from Human Feedback (HH-RLHF): Red Team Attempts Dataset. Hugging Face Dataset Repository (2022). Available at: https://huggingface.co/datasets/Anthropic/hh-rlhf

2022

[32] [32]

MicrosoftAzure:AzureMachineLearningPricing.MicrosoftAzureDocumentation (2026). Available at: https://azure.microsoft.com/en-us/pricing/details/machine- learning/ Title Suppressed Due to Excessive Length 13 A Evaluation Results F1 and FNR by Technique F1 FNR Judge All Ctx Dec Esc Single All Ctx Dec Esc Single AILuminate 0.767 0.795 0.764 0.729 0.729 0.208 ...

2026