Gate AI: LLM Security Benchmark Evaluation Methodology and Results

Marcus Sousa; Ryle Goehausen

arxiv: 2606.02959 · v1 · pith:EEQNJJ24new · submitted 2026-06-01 · 💻 cs.LG · cs.CR

Gate AI: LLM Security Benchmark Evaluation Methodology and Results

Ryle Goehausen , Marcus Sousa This is my paper

Pith reviewed 2026-06-28 15:03 UTC · model grok-4.3

classification 💻 cs.LG cs.CR

keywords LLM securityprompt injection detectionjailbreak detectionevaluation methodologycross-validationoperating pointfalse positive ratebenchmarking

0 comments

The pith

Evaluation harness for LLM detectors selects one global operating point and applies it uniformly across 16 benchmarks

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Published evaluations of prompt-injection and jailbreak detectors for large language models often tune thresholds separately on each dataset and leave the chosen operating points undisclosed. The paper introduces an evaluation harness that runs 5-fold cross-validation over a pooled collection of 16 public benchmarks containing 12,111 samples. On the held-out folds it selects a single global threshold that maximizes F1 subject to a false-positive rate of at most 1 percent and then applies that same threshold to every dataset. Additional diagnostic passes, including a group-fold leakage check that clusters near-duplicates, test whether the chosen point generalizes rather than overfits to individual benchmarks.

Core claim

The harness scores any detector across the 16 benchmarks with 5-fold cross-validation, selects one global operating point on the held-out folds by maximizing F1 while constraining FPR to ≤1 percent, and applies that operating point uniformly so that per-dataset scores reflect a consistent threshold rather than per-benchmark optimization; a parallel group-fold pass over prompt-ID and MinHash clusters provides a leakage diagnostic.

What carries the argument

The 5-fold cross-validation procedure that selects a single global max-F1 operating point at FPR ≤1 percent on held-out folds and applies it uniformly, with a parallel StratifiedGroupKFold leakage diagnostic over composite near-duplicate keys.

If this is right

Per-dataset results now reflect performance under one fixed operating point instead of benchmark-specific tuning.
Head-to-head comparisons with external detectors are performed after re-tuning the harness threshold to match each competitor's published false-positive rate.
A battery of diagnostics (leave-one-dataset-out, random-label control, length-bias correlation, cross-source duplicate detection) quantifies generalization beyond the main cross-validation.
The group-fold leakage diagnostic runs in parallel with the row-stratified pass to surface hidden prompt overlap.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Widespread use of the harness would make it harder for published detector papers to report inflated numbers obtained through hidden per-dataset tuning.
The approach could be extended by adding private or adversarially generated benchmarks to test whether the global threshold still holds outside the current public set.
If the near-duplicate clustering at Jaccard 0.8 misses semantically equivalent prompts that differ in wording, the leakage diagnostic may understate the risk of train-test contamination.

Load-bearing premise

The 16 public benchmarks together with the chosen near-duplicate clustering supply a representative and leakage-controlled distribution on which one global threshold remains meaningful.

What would settle it

If the globally selected threshold produces markedly lower F1 on held-out data than the best per-dataset tuned thresholds, or if the generalization diagnostics (leave-one-dataset-out, paraphrase invariance, threshold transferability) consistently fail their quantitative thresholds, the claim that the harness yields more reliable comparisons would be falsified.

Figures

Figures reproduced from arXiv: 2606.02959 by Marcus Sousa, Ryle Goehausen.

**Figure 2.** Figure 2: Adversarial-validation AUC per fold (target [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Per-fold train vs OOF F1. 2.4 Per-chunk to per-prompt aggregation When the pipeline chunks long inputs, training operates per chunk and aggregates at metric time. Let chunks of parent prompt p be indexed by c ∈ C(p). Continuous probabilities are max-pooled, and the hard label is the single global threshold θ˜ = mediank θ ∗ k (from §2.3) applied to the max-pooled probability: pˆp = max c∈C(p) pˆp,c, yˆp = … view at source ↗

**Figure 6.** Figure 6: Per-source θ ∗ s at matched FPR vs the global θ op . Each dot is a source. 2.7 Matched-FPR per-dataset comparisons Comparing detectors at the same global threshold can mislead when competitors publish their numbers at very different operating points. To remove that confound, every perdataset competitor entry that publishes both a primary metric and an FPR is also surfaced with our value re-evaluated at … view at source ↗

**Figure 8.** Figure 8: Random-label control: predicted hard labels scored [PITH_FULL_IMAGE:figures/full_fig_p005_8.png] view at source ↗

**Figure 7.** Figure 7: Leave-one-dataset-out F1 delta from the macro [PITH_FULL_IMAGE:figures/full_fig_p005_7.png] view at source ↗

**Figure 9.** Figure 9: Length-bias Pearson correlation between input [PITH_FULL_IMAGE:figures/full_fig_p005_9.png] view at source ↗

**Figure 10.** Figure 10: Top permutation-importance features by held-out [PITH_FULL_IMAGE:figures/full_fig_p006_10.png] view at source ↗

**Figure 11.** Figure 11: Pairwise Cohen’s κ between ensemble heads. Other checks (summary only). Cross-source near-duplicate hashing catches prompts that appear in multiple datasets under conflicting labels (the most common form of implicit data-quality leak when stitching public benchmarks), and surfaces them for manual relabel; determinism replay diffs two runs with identical seed for byte-equal OOF probabilities; a paraphras… view at source ↗

**Figure 12.** Figure 12: Reliability diagram: predicted probability bin [PITH_FULL_IMAGE:figures/full_fig_p007_12.png] view at source ↗

**Figure 13.** Figure 13: Per-dataset primary metric with nested bootstrap CIs. [PITH_FULL_IMAGE:figures/full_fig_p008_13.png] view at source ↗

**Figure 14.** Figure 14: Metric availability across published competitor [PITH_FULL_IMAGE:figures/full_fig_p009_14.png] view at source ↗

**Figure 15.** Figure 15: Global Pareto: F1 vs FPR averaged across each system’s independently verified rows on the benchmarks evaluated [PITH_FULL_IMAGE:figures/full_fig_p011_15.png] view at source ↗

**Figure 16.** Figure 16: Detection Error Tradeoff on probit axes. Probit [PITH_FULL_IMAGE:figures/full_fig_p011_16.png] view at source ↗

**Figure 17.** Figure 17: Maximum TPR at each FPR budget. Gate sweeps [PITH_FULL_IMAGE:figures/full_fig_p012_17.png] view at source ↗

**Figure 18.** Figure 18: Every (system × dataset) operating point. Gate at natural and at FPR-1%; Lakera Guard in red; second-mostpublished competitor in orange; everything else grey. Head-to-head: Lakera Guard Lakera Guard is the most-cited commercial competitor in this space and publishes per-dataset numbers on the broadest set of benchmarks. The figures below report the head-to-head comparison at per-dataset matched FPR so th… view at source ↗

**Figure 20.** Figure 20: WildGuard-benign: false-positive rate against ev [PITH_FULL_IMAGE:figures/full_fig_p013_20.png] view at source ↗

**Figure 19.** Figure 19: NotInject: false-positive rate against every pub [PITH_FULL_IMAGE:figures/full_fig_p013_19.png] view at source ↗

**Figure 21.** Figure 21: Gate vs Lakera Guard: per-dataset values at matched FPR. Bars are the per-dataset primary metric (F1 for balanced [PITH_FULL_IMAGE:figures/full_fig_p014_21.png] view at source ↗

**Figure 22.** Figure 22: Latency density distribution: density histogram [PITH_FULL_IMAGE:figures/full_fig_p014_22.png] view at source ↗

**Figure 23.** Figure 23: End-to-end detection latency: Gate (blue, p50 [PITH_FULL_IMAGE:figures/full_fig_p015_23.png] view at source ↗

read the original abstract

Published evaluations of prompt-injection and jailbreak detectors for Large Language Models often suffer from two systematic weaknesses: per-dataset threshold tuning and undisclosed operating points. We describe an evaluation harness that addresses both. The detector under evaluation is scored across 16 public benchmarks (12,111 samples) using 5-fold cross-validation. StratifiedKFold (by row) is the headline pass; a parallel StratifiedGroupKFold pass over a composite key (parent-prompt id plus MinHash + LSH near-duplicate clusters at Jaccard $\gtrsim 0.8$) runs alongside it as a leakage-premium diagnostic. A single global operating point is selected on the held-out folds (max F1 subject to FPR $\leq 1\%$) and applied uniformly to every dataset, so per-dataset results reflect one threshold rather than per-benchmark optimisation. Generalisation is examined through a battery of diagnostics (leave-one-dataset-out cross-validation, a random-label control, adversarial validation, permutation feature importance, length-bias correlation, classifier-head agreement, cross-source near-duplicate detection, threshold transferability, train-vs-OOF agreement, and a paraphrase-invariance probe), most with a quantitative pass threshold and the remainder with a stated failure mode. For every external comparison, the detector's threshold is re-tuned to the competitor's published false-positive rate so head-to-head values are evaluated at matched operating points.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The headline split lets near-duplicates leak across folds, so the global-threshold claim does not hold up as stated.

read the letter

The main thing here is that the headline results rest on a row-wise StratifiedKFold split. That split allows semantically similar or near-duplicate prompts to appear in both training and test folds, which undercuts the goal of a leakage-controlled global operating point.

The paper does lay out a usable protocol. It runs 5-fold CV across 16 benchmarks, picks one threshold by max F1 at FPR ≤1% on held-out folds, and then re-tunes external detectors to the same FPR for matched comparisons. The diagnostic battery (leave-one-dataset-out, random-label control, adversarial validation, length bias, paraphrase invariance, etc.) is more complete than most detector papers provide, and most checks have explicit pass thresholds.

The soft spot is exactly the split. The grouped version that uses parent-prompt IDs plus MinHash/LSH clusters at Jaccard ≳0.8 is run only as a diagnostic, not as the reported numbers. Because the primary results can still overfit to repeated content, the uniform threshold does not demonstrably solve the per-dataset tuning problem the abstract targets. The clustering itself looks reasonable, but it is not applied where it would matter most.

This is for groups that evaluate prompt-injection or jailbreak detectors and want a documented harness. Anyone who cares about leakage will have to rerun the grouped analysis themselves to trust the numbers.

I would not send it to peer review in this form. The methodological choice needs to be fixed or justified before the central claim can be evaluated.

Referee Report

1 major / 1 minor

Summary. The paper claims to introduce an evaluation harness for prompt-injection and jailbreak detectors that mitigates per-dataset threshold tuning and undisclosed operating points. It scores detectors on 16 public benchmarks (12,111 samples) via 5-fold cross-validation, selects a single global operating point (max F1 subject to FPR ≤ 1%) on held-out folds, and applies it uniformly; StratifiedKFold by row is the headline protocol while StratifiedGroupKFold (parent-prompt + MinHash/LSH clusters at Jaccard ≳ 0.8) serves as a leakage diagnostic. A suite of generalization diagnostics (leave-one-dataset-out, random-label control, adversarial validation, etc.) is applied, most with quantitative pass thresholds, and external comparisons are performed at matched FPRs.

Significance. If the central claim holds, the work would provide a reproducible, standardized protocol for LLM security detector evaluation that reduces the common practice of per-benchmark optimization. The battery of diagnostics with stated quantitative thresholds and the explicit handling of operating-point matching are positive features that could improve comparability across papers.

major comments (1)

[Evaluation Protocol (abstract and § on cross-validation)] The headline protocol uses StratifiedKFold by row for threshold selection and reporting, while the StratifiedGroupKFold (parent-prompt id plus MinHash/LSH near-duplicate clusters) is described only as a parallel diagnostic. Because near-duplicates can cross folds under the row-wise split, the selected global operating point may overfit to leaked examples; this directly undermines the claim that the harness produces a leakage-controlled uniform threshold applicable across the 16 benchmarks.

minor comments (1)

[Abstract] Clarify whether the Jaccard ≳ 0.8 threshold for LSH clustering is fixed or tuned, and report the exact number of clusters formed.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed review and for identifying a key point about our evaluation protocol. We respond to the major comment below and outline the changes we will make.

read point-by-point responses

Referee: [Evaluation Protocol (abstract and § on cross-validation)] The headline protocol uses StratifiedKFold by row for threshold selection and reporting, while the StratifiedGroupKFold (parent-prompt id plus MinHash/LSH near-duplicate clusters) is described only as a parallel diagnostic. Because near-duplicates can cross folds under the row-wise split, the selected global operating point may overfit to leaked examples; this directly undermines the claim that the harness produces a leakage-controlled uniform threshold applicable across the 16 benchmarks.

Authors: We agree that the distinction between protocols requires clarification to support the leakage-control claim. The manuscript presents StratifiedKFold (by row) as the headline protocol because it follows conventional cross-validation practice and allows direct comparison with prior work, while the composite-key StratifiedGroupKFold serves as an explicit leakage diagnostic. However, the referee correctly notes that near-duplicates may still cross row-wise folds, potentially allowing the global threshold (max F1 at FPR ≤ 1%) to benefit from leakage. To strengthen the central claim, we will revise the abstract and the cross-validation section to designate the StratifiedGroupKFold results as the primary, leakage-controlled protocol for threshold selection and reporting. The row-wise results will be retained as a secondary comparison to quantify the leakage premium. This change directly addresses the risk of overfitting to leaked examples while preserving the uniform-threshold objective. revision: yes

Circularity Check

0 steps flagged

No significant circularity; methodology is self-contained

full rationale

The paper describes an evaluation harness using 5-fold cross-validation (StratifiedKFold headline, StratifiedGroupKFold diagnostic) to select a single global operating point (max F1 at FPR ≤1% on held-out folds) applied uniformly across 16 benchmarks. No equations, fitted parameters, or derivations are presented that reduce the claimed generalization or threshold selection to inputs by construction. No self-citations are load-bearing for uniqueness or ansatz, and the method is proposed independently without renaming known results or smuggling assumptions via citation. The central claim rests on the described procedure itself rather than reducing to its own outputs.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The methodology rests on standard cross-validation assumptions plus domain-specific choices for clustering and threshold selection; no new physical entities are introduced.

free parameters (2)

FPR cap = 1%
Maximum false-positive rate of 1% used to select the operating point
Jaccard threshold = 0.8
Similarity cutoff for defining near-duplicate clusters in the group fold

axioms (2)

standard math StratifiedKFold yields unbiased performance estimates when applied to the composite benchmark collection
Invoked by the headline 5-fold pass
domain assumption The 16 public benchmarks collectively represent the distribution of prompt-injection attacks
Required for the claim that uniform-threshold results generalize

pith-pipeline@v0.9.1-grok · 5777 in / 1417 out tokens · 26145 ms · 2026-06-28T15:03:18.394954+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

22 extracted references · 14 canonical work pages · 4 internal anchors

[1]

Palit, D

S. Palit, D. Woods. Evaluating the efficacy of LLM Safety Solutions : The Palit Benchmark Dataset. arXiv:2505.13028. 2025.https://arxiv.org/abs/2505.13028

work page arXiv 2025
[2]

Datta, S

Y . Datta, S. Rajasekar. JavelinGuard: Low-Cost Transformer Architectures for LLM Security. arXiv:2506.07330. 2025. https://arxiv.org/abs/2506.07330

work page arXiv 2025
[3]

F. A. Chitan. ILION: Deterministic Pre-Execution Safety Gates for Agentic AI Systems. arXiv:2603.13247. 2026. https://arxiv.org/abs/2603.13247 16

work page arXiv 2026
[4]

V . García. Which firewall best prevents prompt injection attacks? NeuralTrust blog. 2025.https://neuraltrust. ai/blog/prevent-prompt-injection-attacks-firewall-comparison

2025
[5]

deepset/prompt-injections (community-labelled prompt-injection dataset)

deepset. deepset/prompt-injections (community-labelled prompt-injection dataset). Hugging Face Datasets. 2023. https://huggingface.co/datasets/deepset/prompt-injections

2023
[6]

prompt-injection-jailbreak-sentinel-v2 (model card)

Rogue Security. prompt-injection-jailbreak-sentinel-v2 (model card). Hugging Face. 2025. https://huggingface. co/rogue-security/prompt-injection-jailbreak-sentinel-v2

2025
[7]

Schulhoff et al

S. Schulhoff et al. Ignore This Title and HackAPrompt: Exposing Systemic Vulnerabilities of LLMs Through a Global Prompt Hacking Competition. EMNLP 2023; project site. 2023.https://www.hackaprompt.com

2023
[8]

jackhhao/jailbreak-classification (binary jailbreak vs benign classification dataset)

jackhhao. jackhhao/jailbreak-classification (binary jailbreak vs benign classification dataset). Hugging Face Datasets. 2023.https://huggingface.co/datasets/jackhhao/jailbreak-classification

2023
[9]

Abdelnabi et al

S. Abdelnabi et al. LLMail-Inject: A Dataset from a Realistic Adaptive Prompt Injection Challenge. arXiv:2506.09956. 2025.https://arxiv.org/abs/2506.09956

work page arXiv 2025
[10]

I. Wu, M. Maslowski. CourtGuard: A Local, Multiagent Prompt Injection Classifier. arXiv:2510.19844. 2025. https: //arxiv.org/abs/2510.19844

work page arXiv 2025
[11]

H. Li, X. Liu. InjecGuard: Benchmarking and Mitigating Over-defense in Prompt Injection Guardrail Models. arXiv:2410.22770. 2024.https://arxiv.org/abs/2410.22770

work page arXiv 2024
[12]

L. E. Erdogan et al. safe-guard-prompt-injection (synthetic prompt-injection dataset, n=10,296). Hugging Face Datasets. 2024.https://huggingface.co/datasets/xTRam1/safe-guard-prompt-injection

2024
[13]

Kasundra et al

J. Kasundra et al. AprielGuard. arXiv:2512.20293. 2025.https://arxiv.org/abs/2512.20293

work page arXiv 2025
[14]

Universal and Transferable Adversarial Attacks on Aligned Language Models

A. Zou et al. Universal and Transferable Adversarial Attacks on Aligned Language Models. arXiv:2307.15043. 2023. https://arxiv.org/abs/2307.15043

work page internal anchor Pith review Pith/arXiv arXiv 2023
[15]

SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks

A. Robey et al. SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks. arXiv:2310.03684. 2023. https://arxiv.org/abs/2310.03684

work page internal anchor Pith review Pith/arXiv arXiv 2023
[16]

Benchmarking and defending against indi- rect prompt injection attacks on large language models

J. Yi et al. Benchmarking and Defending Against Indirect Prompt Injection Attacks on Large Language Models. arXiv:2312.14197. 2023.https://arxiv.org/abs/2312.14197

work page arXiv 2023
[17]

Lakera/gandalf_ignore_instructions (embedding-filtered Gandalf RCT subset)

Lakera AI. Lakera/gandalf_ignore_instructions (embedding-filtered Gandalf RCT subset). Hugging Face Datasets. 2023. https://huggingface.co/datasets/Lakera/gandalf_ignore_instructions

2023
[18]

Li et al

R. Li et al. GenTel-Safe: A Unified Benchmark and Shielding Framework for Defending Against Prompt Injection Attacks. arXiv:2409.19521. 2024.https://arxiv.org/abs/2409.19521

work page arXiv 2024
[19]

HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

M. Mazeika et al. HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal. arXiv:2402.04249. 2024.https://arxiv.org/abs/2402.04249

work page internal anchor Pith review Pith/arXiv arXiv 2024
[20]

WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs

S. Han et al. WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs. arXiv:2406.18495. 2024.https://arxiv.org/abs/2406.18495

work page internal anchor Pith review Pith/arXiv arXiv 2024
[21]

arXiv preprint arXiv:2402.05044 , year=

L. Li et al. SALAD-Bench: A Hierarchical and Comprehensive Safety Benchmark for Large Language Models. arXiv:2402.05044. 2024.https://arxiv.org/abs/2402.05044

work page arXiv 2024
[22]

G. C. Cawley, N. L. C. Talbot. On Over-fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation. Journal of Machine Learning Research 11. 2010. https://jmlr.org/papers/v11/cawley10a. html 17

2010

[1] [1]

Palit, D

S. Palit, D. Woods. Evaluating the efficacy of LLM Safety Solutions : The Palit Benchmark Dataset. arXiv:2505.13028. 2025.https://arxiv.org/abs/2505.13028

work page arXiv 2025

[2] [2]

Datta, S

Y . Datta, S. Rajasekar. JavelinGuard: Low-Cost Transformer Architectures for LLM Security. arXiv:2506.07330. 2025. https://arxiv.org/abs/2506.07330

work page arXiv 2025

[3] [3]

F. A. Chitan. ILION: Deterministic Pre-Execution Safety Gates for Agentic AI Systems. arXiv:2603.13247. 2026. https://arxiv.org/abs/2603.13247 16

work page arXiv 2026

[4] [4]

V . García. Which firewall best prevents prompt injection attacks? NeuralTrust blog. 2025.https://neuraltrust. ai/blog/prevent-prompt-injection-attacks-firewall-comparison

2025

[5] [5]

deepset/prompt-injections (community-labelled prompt-injection dataset)

deepset. deepset/prompt-injections (community-labelled prompt-injection dataset). Hugging Face Datasets. 2023. https://huggingface.co/datasets/deepset/prompt-injections

2023

[6] [6]

prompt-injection-jailbreak-sentinel-v2 (model card)

Rogue Security. prompt-injection-jailbreak-sentinel-v2 (model card). Hugging Face. 2025. https://huggingface. co/rogue-security/prompt-injection-jailbreak-sentinel-v2

2025

[7] [7]

Schulhoff et al

S. Schulhoff et al. Ignore This Title and HackAPrompt: Exposing Systemic Vulnerabilities of LLMs Through a Global Prompt Hacking Competition. EMNLP 2023; project site. 2023.https://www.hackaprompt.com

2023

[8] [8]

jackhhao/jailbreak-classification (binary jailbreak vs benign classification dataset)

jackhhao. jackhhao/jailbreak-classification (binary jailbreak vs benign classification dataset). Hugging Face Datasets. 2023.https://huggingface.co/datasets/jackhhao/jailbreak-classification

2023

[9] [9]

Abdelnabi et al

S. Abdelnabi et al. LLMail-Inject: A Dataset from a Realistic Adaptive Prompt Injection Challenge. arXiv:2506.09956. 2025.https://arxiv.org/abs/2506.09956

work page arXiv 2025

[10] [10]

I. Wu, M. Maslowski. CourtGuard: A Local, Multiagent Prompt Injection Classifier. arXiv:2510.19844. 2025. https: //arxiv.org/abs/2510.19844

work page arXiv 2025

[11] [11]

H. Li, X. Liu. InjecGuard: Benchmarking and Mitigating Over-defense in Prompt Injection Guardrail Models. arXiv:2410.22770. 2024.https://arxiv.org/abs/2410.22770

work page arXiv 2024

[12] [12]

L. E. Erdogan et al. safe-guard-prompt-injection (synthetic prompt-injection dataset, n=10,296). Hugging Face Datasets. 2024.https://huggingface.co/datasets/xTRam1/safe-guard-prompt-injection

2024

[13] [13]

Kasundra et al

J. Kasundra et al. AprielGuard. arXiv:2512.20293. 2025.https://arxiv.org/abs/2512.20293

work page arXiv 2025

[14] [14]

Universal and Transferable Adversarial Attacks on Aligned Language Models

A. Zou et al. Universal and Transferable Adversarial Attacks on Aligned Language Models. arXiv:2307.15043. 2023. https://arxiv.org/abs/2307.15043

work page internal anchor Pith review Pith/arXiv arXiv 2023

[15] [15]

SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks

A. Robey et al. SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks. arXiv:2310.03684. 2023. https://arxiv.org/abs/2310.03684

work page internal anchor Pith review Pith/arXiv arXiv 2023

[16] [16]

Benchmarking and defending against indi- rect prompt injection attacks on large language models

J. Yi et al. Benchmarking and Defending Against Indirect Prompt Injection Attacks on Large Language Models. arXiv:2312.14197. 2023.https://arxiv.org/abs/2312.14197

work page arXiv 2023

[17] [17]

Lakera/gandalf_ignore_instructions (embedding-filtered Gandalf RCT subset)

Lakera AI. Lakera/gandalf_ignore_instructions (embedding-filtered Gandalf RCT subset). Hugging Face Datasets. 2023. https://huggingface.co/datasets/Lakera/gandalf_ignore_instructions

2023

[18] [18]

Li et al

R. Li et al. GenTel-Safe: A Unified Benchmark and Shielding Framework for Defending Against Prompt Injection Attacks. arXiv:2409.19521. 2024.https://arxiv.org/abs/2409.19521

work page arXiv 2024

[19] [19]

HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

M. Mazeika et al. HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal. arXiv:2402.04249. 2024.https://arxiv.org/abs/2402.04249

work page internal anchor Pith review Pith/arXiv arXiv 2024

[20] [20]

WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs

S. Han et al. WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs. arXiv:2406.18495. 2024.https://arxiv.org/abs/2406.18495

work page internal anchor Pith review Pith/arXiv arXiv 2024

[21] [21]

arXiv preprint arXiv:2402.05044 , year=

L. Li et al. SALAD-Bench: A Hierarchical and Comprehensive Safety Benchmark for Large Language Models. arXiv:2402.05044. 2024.https://arxiv.org/abs/2402.05044

work page arXiv 2024

[22] [22]

G. C. Cawley, N. L. C. Talbot. On Over-fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation. Journal of Machine Learning Research 11. 2010. https://jmlr.org/papers/v11/cawley10a. html 17

2010