Do Encoders Suffice? A Systematic Comparison of Encoder and Decoder Safety Judges for LLM Adversarial Evaluation
Pith reviewed 2026-06-25 20:53 UTC · model grok-4.3
The pith
Fine-tuned encoder classifiers match LLM judges in detecting harmful outputs without major performance loss.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Fine-tuned ModernBERT-family encoders trained via majority voting on labels from LLM judges such as StrongReject and ShieldGemma achieve F1 scores, false negative rates, and precision-recall values comparable to those of the LLM judges themselves across adversarial datasets and attack techniques.
What carries the argument
Majority-voting label aggregation from multiple LLM judges to create training data and holdout sets for binary harm classification by fine-tuned encoder models.
If this is right
- Encoder classifiers enable lower-latency and lower-cost safety filtering at production scale.
- Performance remains aligned with LLM judges across single-turn, decomposition, escalation, and context-manipulation attacks.
- Organizations can substitute encoders for LLM judges in routine evaluation pipelines while retaining similar detection rates.
- The approach supplies concrete metrics for deciding when encoders are sufficient versus when LLM judges remain necessary.
Where Pith is reading between the lines
- Production systems could adopt encoder judges for initial filtering and reserve LLM judges for edge cases to cut overall compute.
- The same fine-tuning recipe might extend to other safety tasks such as detecting misinformation or bias.
- Long-term deployment would benefit from periodic retraining on fresh conversation logs to track distribution shift.
Load-bearing premise
Majority votes among LLM judges supply a sufficiently accurate training signal and evaluation standard that will hold for real user conversations.
What would settle it
A substantial drop in encoder F1 or rise in false negatives when the same models are tested on a holdout set labeled by human annotators rather than LLM judges.
read the original abstract
With the widespread adoption of large language models (LLMs) in chatbots and everyday applications, companies increasingly need guardrails that are effective while remaining low-cost and low-latency. Safety evaluation of LLM outputs has generally relied on LLM-based judges, which can be effective but are often slow and expensive to deploy at scale. In this paper, we evaluate whether fine-tuned modern encoder classifiers from the ModernBERT family, including ModernBERT and Ettin, can reliably identify harmful LLM outputs in user-model conversations without substantial performance loss relative to LLM-based judges. We benchmark these encoder classifiers against rule-based prefix matching, fine-tuned LLM classifiers, and LLM judges using a range of judge-prompting strategies across open-source adversarial datasets. The LLM judges include evaluation methodologies from StrongReject, ShieldGemma, JailbreakBench, AILuminate, SorryBench, and a Claude-as-a-judge setup, as well as fine-tuned safety classifiers such as LlamaGuard 3 and LlamaGuard 4. The encoder classifiers are fine-tuned on judge-labeled data using a majority-voting label strategy and are then evaluated on a gold-standard holdout dataset to assess their performance relative to LLM judges. We report absolute performance using F1 score, false negative rate, and precision-recall metrics. We also break down results by attack technique, including single-turn prompting, decomposition, escalation, and context manipulation, to identify where encoder classifiers align with or diverge from LLM-based judges. Our findings provide guidance on when encoder classifiers can serve as cost- and latency-efficient alternatives to LLM-based safety evaluation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that fine-tuned encoder classifiers from the ModernBERT family (including ModernBERT and Ettin) can identify harmful LLM outputs in user-model conversations with performance comparable to LLM-based judges (StrongReject, ShieldGemma, LlamaGuard, etc.), using majority-vote labels from those judges on open-source adversarial datasets. Encoders are fine-tuned on these labels and evaluated on a gold-standard holdout via F1, false-negative rate, and precision-recall, with breakdowns by attack type (single-turn, decomposition, escalation, context manipulation). The conclusion is that encoders offer cost- and latency-efficient alternatives to LLM judges without substantial performance loss.
Significance. If the central claim holds under independent validation, the work would establish practical, scalable encoder-based guardrails for LLM safety evaluation, directly addressing deployment constraints of LLM judges. The per-attack-type analysis would further guide when encoders suffice versus when LLM judges remain necessary.
major comments (1)
- [Abstract / evaluation methodology] Abstract and evaluation methodology: The pipeline trains and evaluates encoders exclusively against majority-vote labels from the same set of LLM judges (StrongReject, ShieldGemma, LlamaGuard 3/4, etc.) on both training and holdout data. This produces a closed loop in which reported F1/FNR metrics quantify mimicry of the LLM judges rather than independent detection of harm. No external criterion (human annotations on the holdout or comparison to a non-LLM-derived gold standard) is described that would break the loop, so parity with LLM judges does not establish that encoders reliably identify harmful outputs in absolute terms or generalize beyond the judges' own biases and noise.
minor comments (2)
- [Methods] Clarify the exact data splits, prompting templates for each LLM judge, and any post-hoc filtering steps used to construct the majority-vote labels and holdout set.
- [Experimental setup] Specify the precise ModernBERT and Ettin model variants, fine-tuning hyperparameters, and whether any encoder-specific adaptations (e.g., classification head) were required.
Simulated Author's Rebuttal
Thank you for the constructive feedback on our evaluation methodology. We address the concern point by point below.
read point-by-point responses
-
Referee: [Abstract / evaluation methodology] Abstract and evaluation methodology: The pipeline trains and evaluates encoders exclusively against majority-vote labels from the same set of LLM judges (StrongReject, ShieldGemma, LlamaGuard 3/4, etc.) on both training and holdout data. This produces a closed loop in which reported F1/FNR metrics quantify mimicry of the LLM judges rather than independent detection of harm. No external criterion (human annotations on the holdout or comparison to a non-LLM-derived gold standard) is described that would break the loop, so parity with LLM judges does not establish that encoders reliably identify harmful outputs in absolute terms or generalize beyond the judges' own biases and noise.
Authors: We agree that the reported metrics measure agreement with the LLM judges' majority-vote labels on both training and holdout sets, rather than independent human-annotated ground truth. The manuscript's stated goal is a systematic comparison to determine whether encoders can serve as practical, lower-cost alternatives to LLM judges within current safety evaluation pipelines; LLM judges are the prevailing standard for labeling adversarial datasets, so demonstrating comparable F1, false-negative rate, and precision-recall on the same labels directly answers whether encoders can substitute for them without substantial performance loss. The paper consistently frames results as relative to LLM judges and does not claim absolute harm detection independent of the labeling source. Human annotations on the holdout would be a valuable extension but fall outside the scope of this comparative study. revision: no
Circularity Check
No significant circularity
full rationale
The paper performs an empirical comparison of fine-tuned ModernBERT-family encoders against external LLM judges (StrongReject, ShieldGemma, LlamaGuard, etc.) and rule-based baselines on public adversarial datasets. Labels are generated via majority vote from those external judges, and performance (F1, FNR, precision-recall) is measured on a holdout drawn from the same external labeling process. This directly quantifies relative agreement with the external judges rather than claiming or measuring absolute harm detection. No equations, derivations, or self-citations reduce any reported metric back to quantities defined by the authors' own fitted parameters. The evaluation is self-contained against external benchmarks with no load-bearing self-referential steps.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Majority vote across multiple LLM judges yields sufficiently accurate labels for training and gold-standard evaluation
Reference graph
Works this paper leans on
-
[1]
Warner, B., et al.: Smarter, Better, Faster, Longer: A Modern Bidirectional En- coder for Fast, Memory Efficient, and Long Context Finetuning and Inference. arXiv:2412.13663 (2024)
Pith/arXiv arXiv 2024
-
[2]
Souly, A., et al.: A StrongREJECT for Empty Jailbreaks. arXiv:2402.10260 (2024)
Pith/arXiv arXiv 2024
-
[3]
Mazeika, M., et al.: HarmBench: A Standardized Evaluation Framework for Auto- mated Red Teaming and Robust Refusal. arXiv:2402.04249 (2024)
Pith/arXiv arXiv 2024
-
[4]
https://ai.meta.com/research/publications/llama- guard-3/ (2024)
Meta AI: Llama Guard 3. https://ai.meta.com/research/publications/llama- guard-3/ (2024)
2024
-
[5]
Kumar, A., et al.: No Free Lunch with Guardrails. arXiv:2504.00441 (2025)
arXiv 2025
-
[6]
Zheng, Y., et al.: Lightweight Safety Guardrails Using Fine-tuned BERT Embed- dings. arXiv:2411.14398 (2024)
arXiv 2024
-
[7]
Soares, F., et al.: Improving Large Language Model Safety with Contrastive Rep- resentation Learning. arXiv:2506.11938 (2025)
arXiv 2025
-
[8]
Zou, A., et al.: Improving Alignment and Robustness with Circuit Breakers. arXiv:2406.04313 (2024)
arXiv 2024
-
[9]
arXiv preprint arXiv:2311.04205 (2023)
Deng, Y., Zhang, W., Chen, Z., Gu, Q.: Rephrase and Respond: Let Large Language Models Ask Better Questions for Themselves. arXiv preprint arXiv:2311.04205 (2023)
arXiv 2023
-
[10]
In: International Conference on Learning Representations (ICLR)
Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-Consistency Improves Chain of Thought Reasoning in Lan- guage Models. In: International Conference on Learning Representations (ICLR). arXiv:2203.11171 (2023a)
-
[11]
In: NeurIPS, Datasets and Benchmarks Track
Wang, B., et al.: DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models. In: NeurIPS, Datasets and Benchmarks Track. arXiv:2306.11698 (2023b) 12 Han Jeon, Shiv Medler, Joseph Voyles, and Matt Wood
-
[12]
In: NeurIPS, Datasets and Benchmarks Track
Chao, P., et al.: JailbreakBench: An Open Robustness Benchmark for Jail- breaking Large Language Models. In: NeurIPS, Datasets and Benchmarks Track. arXiv:2404.01318 (2024)
Pith/arXiv arXiv 2024
-
[13]
In: NeurIPS, Datasets and Benchmarks Track
Zheng, L., et al.: Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. In: NeurIPS, Datasets and Benchmarks Track. arXiv:2306.05685 (2023)
Pith/arXiv arXiv 2023
-
[14]
In: International Conference on Learning Representations (ICLR)
Weller, O., Ricci, K., Marone, M., Chaffin, A., Lawrie, D., Van Durme, B.: Seq vs Seq: An Open Suite of Paired Encoders and Decoders. In: International Conference on Learning Representations (ICLR). arXiv:2507.11412 (2026)
arXiv 2026
-
[15]
arXiv preprint arXiv:2603.06594 (2026)
Schwinn, L., Ladenburger, M., Beyer, T., Mofakhami, M., Gidel, G., Günnemann, S.: A Coin Flip for Safety: LLM Judges Fail to Reliably Measure Adversarial Ro- bustness. arXiv preprint arXiv:2603.06594 (2026)
arXiv 2026
-
[16]
Xie, Z., Zhang, Z., Neubig, G.: SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal. arXiv:2406.14598 (2025)
arXiv 2025
-
[17]
arXiv preprint arXiv:2310.1738 (2023)
Lin, Z., Wen, Q., Sun, L.: ToxicChat: Unveiling Hidden Challenges of Toxicity Detection in Real-World User-AI Conversations. arXiv preprint arXiv:2310.1738 (2023)
arXiv 2023
-
[18]
arXiv preprint arXiv:2307.04657 (2023)
Ji, J., Sun, Y., Qiu, X.: BeaverTails: Towards Improved Safety Alignment of LLM via a Human Preference Dataset. arXiv preprint arXiv:2307.04657 (2023)
arXiv 2023
-
[19]
arXiv preprint (2023)
Röttger, P., Vidgen, B., Hovy, D.: XSTest: A Test Suite for Identifying Safety Failures in Large Language Models. arXiv preprint (2023)
2023
-
[20]
arXiv preprint (2023)
Sun, H., Yang, Y., Yang, D.: SafetyBench: Evaluating Safety of Large Language Models. arXiv preprint (2023)
2023
-
[21]
arXiv preprint arXiv:2406.18510 (2024)
Zhang, Y., Li, Z., Liang, P.: WildTeaming at Scale: From In-the-Wild Jailbreaks to (Adversarially) Safer Language Models. arXiv preprint arXiv:2406.18510 (2024)
arXiv 2024
-
[22]
arXiv preprint arXiv:2308.13387 (2023c)
Wang, A., Hendrycks, D., Burns, C.: Do-Not-Answer: A Dataset for Evaluating Safety Alignment in Language Models. arXiv preprint arXiv:2308.13387 (2023c)
-
[23]
arXiv preprint arXiv:2404.01833 (2024)
Microsoft Security Research: Prompt Crescendo: Multi-Turn Jailbreak Attacks on Large Language Models. arXiv preprint arXiv:2404.01833 (2024)
Pith/arXiv arXiv 2024
-
[24]
arXiv preprint arXiv:2402.17262 (2024)
Zhou, Z., Xiang, J., Chen, H., Liu, Q., Li, Z., Su, S.: Speak Out of Turn: Safety Vulnerability of Large Language Models in Multi-turn Dialogue. arXiv preprint arXiv:2402.17262 (2024)
arXiv 2024
-
[25]
In: USENIX Security Symposium (2024)
Yu, Z., et al.: Don’t Listen To Me: Understanding and Exploring Jailbreak Prompts of LLMs. In: USENIX Security Symposium (2024)
2024
-
[26]
Z., Fredrikson, M.: Universal and Transferable Adversarial Attacks on Aligned Language Models
Zou, A., Wang, Z., Carlini, N., Nasr, M., Kolter, J. Z., Fredrikson, M.: Universal and Transferable Adversarial Attacks on Aligned Language Models. arXiv preprint arXiv:2307.15043 (2023)
Pith/arXiv arXiv 2023
-
[27]
arXiv preprint (2025)
Zhang,B.,Cheng,Y.,Shakeri,S.,Wang,X.,Ma,M.,Firat,O.:Encoder-Decoderor Decoder-Only? Revisiting Encoder-Decoder Large Language Model. arXiv preprint (2025)
2025
-
[28]
Technical report / blog (2024)
PangolinGuard:PangolinGuard:PromptInjectionDetectionwithModernEncoder Models. Technical report / blog (2024)
2024
-
[29]
Meta AI: Llama Guard 4. (2025)
2025
-
[30]
arXiv preprint (2024)
Cui, X., et al.: OR-Bench: A Benchmark for Over-Refusal in Large Language Mod- els. arXiv preprint (2024)
2024
-
[31]
Hugging Face Dataset Repository (2022)
Anthropic: Helpful and Harmless Reinforcement Learning from Human Feedback (HH-RLHF): Red Team Attempts Dataset. Hugging Face Dataset Repository (2022). Available at: https://huggingface.co/datasets/Anthropic/hh-rlhf
2022
-
[32]
MicrosoftAzure:AzureMachineLearningPricing.MicrosoftAzureDocumentation (2026). Available at: https://azure.microsoft.com/en-us/pricing/details/machine- learning/ Title Suppressed Due to Excessive Length 13 A Evaluation Results F1 and FNR by Technique F1 FNR Judge All Ctx Dec Esc Single All Ctx Dec Esc Single AILuminate 0.767 0.795 0.764 0.729 0.729 0.208 ...
2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.