GuardNet: Ensemble Strategies of Shallow Neural Networks for Robust Prompt Injection and Jailbreak Detection

Edson Rodrigues da Cruz Filho; Guilherme Nielsen Dias; Gustavo Voltani Von Atzingen; Henrique Vieira Laturrague; Ian Degaspari; Jo\~ao Vitor Pavan; Marccello Wilson Perez Berto; Patrick Vieira Laturrague; Paulo Henrique Eleuterio Falsetti; Paulo Ricardo Ferreira Neves

arxiv: 2606.05566 · v1 · pith:XLJQQT7Snew · submitted 2026-06-04 · 💻 cs.AI · cs.CR

GuardNet: Ensemble Strategies of Shallow Neural Networks for Robust Prompt Injection and Jailbreak Detection

Paulo Ricardo Ferreira Neves , Edson Rodrigues da Cruz Filho , Paulo Henrique Eleuterio Falsetti , Jo\~ao Vitor Pavan , Ian Degaspari , Henrique Vieira Laturrague , Patrick Vieira Laturrague , Guilherme Nielsen Dias

show 2 more authors

Marccello Wilson Perez Berto Gustavo Voltani Von Atzingen

This is my paper

Pith reviewed 2026-06-28 02:06 UTC · model grok-4.3

classification 💻 cs.AI cs.CR

keywords prompt injectionjailbreak detectionBiLSTMensemble neural networksLLM safetyadversarial attackslow-latency detectionguardrail systems

0 comments

The pith

GuardNet demonstrates that an ensemble of shallow BiLSTMs can provide competitive detection of prompt injections and jailbreaks by focusing on example diversity and calibration rather than model scale.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper tests the idea that effective guardrails against prompt injection and jailbreak attacks on large language models can be built using ensembles of small neural networks instead of relying on massive models. GuardNet combines multiple BiLSTMs with a total of roughly 47 million parameters and evaluates them on a blind dataset and a proprietary benchmark. The system reaches an AUROC of 0.747 on the blind set and an F1 score of 0.92 on the proprietary one, while running at about 50 milliseconds per inference on standard CPU hardware. A sympathetic reader would care because this suggests a path to lightweight, deployable protection for applications that cannot afford the compute of full-scale LLMs.

Core claim

The authors claim that GuardNet, an ensemble of shallow BiLSTMs, achieves competitive performance in detecting prompt injection and jailbreak attacks through the diversity of its training examples and careful threshold calibration, attaining an AUROC of 0.747 on the blind JBB-Behaviors benchmark and an F1 score of 0.92 on a proprietary benchmark at an average latency of 50 ms on CPU, even as larger models like Mistral-7B and Llama-3.1-8B outperform it on the blind set.

What carries the argument

Ensemble of shallow BiLSTMs that leverages diversity in example coverage and threshold calibration to detect adversarial prompts.

If this is right

Guardrails for LLMs can achieve useful performance without requiring large model scales.
Threshold calibration and broad example coverage are critical for robustness in adversarial detection.
The low latency of 50 ms makes the system viable for real-time production use under resource limits.
Larger LLMs may provide higher accuracy but at the expense of speed and cost.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Curating more diverse sets of adversarial examples could further improve such shallow ensembles.
GuardNet-style detectors might be combined with larger models in a cascaded system for better overall efficiency.
Similar techniques could be applied to other security-related tasks in natural language processing.
Periodic retraining on new attack patterns would likely be necessary to maintain performance over time.

Load-bearing premise

The blind and proprietary benchmarks used are representative of actual adversarial inputs encountered in practice, and any partial information leakage does not significantly affect the reported performance metrics.

What would settle it

A test on an independent collection of 200 fresh adversarial prompts, created without any overlap with the training or evaluation data, that results in an AUROC substantially lower than 0.747.

read the original abstract

Large Language Models (LLMs) have transformed natural language processing, but they remain vulnerable to Prompt Injection (PI) and Jailbreak (JB) attacks. In addition, benchmark evaluations may be affected by contamination and partial information leakage, compromising performance estimates. This work presents GuardNet, a guardrail system based on an ensemble of shallow neural networks (BiLSTMs) with approximately 47 million parameters. We investigate the hypothesis that robustness in adversarial scenarios depends more on the diversity of example coverage and threshold calibration than on model scale. The results indicate that GuardNet achieves competitive performance compared with lightweight detectors and high efficiency at low latency, although larger LLMs such as Mistral-7B and Llama-3.1-8B still achieve superior performance in terms of F1 score and AUROC on the blind JBB-Behaviors benchmark. Nevertheless, GuardNet achieves an AUROC of 0.747 on the blind dataset (n = 200) and an F1 score of 0.92 on a proprietary benchmark (n = 50), under threshold calibration and evaluation with declared partial information leakage. The system operates with an average latency of approximately 50 ms on CPU, making it suitable for deployment in production environments with cost and infrastructure constraints.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GuardNet is a standard BiLSTM ensemble for prompt injection detection whose reported AUROC and F1 rest on two tiny sets with unquantified partial leakage, so the numbers are hard to read as evidence.

read the letter

The paper applies an ensemble of shallow BiLSTMs (47M parameters total) to prompt injection and jailbreak detection. It tests whether diversity of examples plus threshold calibration can deliver usable performance without large models, and it measures CPU latency around 50 ms. Those two points are the only concrete things it adds.

The latency number is straightforward and could matter for constrained deployments. The authors also correctly note that larger models still beat their system on the blind JBB set, which keeps the claim modest.

The evaluations are the problem. One test set has 200 examples, the other only 50, and both carry declared partial information leakage that is never quantified or removed in an ablation. An F1 of 0.92 on the proprietary set and AUROC of 0.747 on the blind set therefore cannot be taken as reliable measures of generalization. No training protocol, baseline comparisons, error bars, or diversity ablations appear in the supplied text. The n=50 set is simply too small for stable F1 estimates.

Nothing in the method is new; ensembles of BiLSTMs for text classification are textbook. The work is an application paper that tries to address a practical constraint, but the evaluation gaps prevent it from showing whether the approach actually works.

This is for engineers who need a quick, low-resource guardrail and are willing to run their own tests. It does not contain enough clean evidence to justify referee time.

Referee Report

3 major / 1 minor

Summary. The manuscript presents GuardNet, an ensemble of shallow BiLSTM neural networks (~47M parameters) for detecting prompt injection and jailbreak attacks. It hypothesizes that robustness depends more on example diversity and threshold calibration than on model scale. It reports an AUROC of 0.747 on the blind JBB-Behaviors benchmark (n=200) and an F1 score of 0.92 on a proprietary benchmark (n=50) under threshold calibration with declared partial information leakage, while noting that larger LLMs (Mistral-7B, Llama-3.1-8B) still outperform on the blind set. The system achieves ~50 ms average latency on CPU.

Significance. If the reported metrics prove robust after addressing leakage quantification and sample-size limitations, the work would indicate that lightweight ensembles can deliver usable detection performance with clear efficiency advantages for constrained deployments. The explicit acknowledgment that larger models remain superior on the blind benchmark is a constructive element that frames the contribution appropriately.

major comments (3)

[Abstract] Abstract: The F1 score of 0.92 on the proprietary benchmark (n=50) is reported without error bars, confidence intervals, or binomial standard-error estimates; given the small sample, this metric has limited stability and cannot reliably support performance claims.
[Abstract] Abstract: Partial information leakage is declared but neither quantified (e.g., which prompts, labels, or outputs leaked) nor bounded by an ablation that removes the leaked information, leaving the AUROC of 0.747 on the blind set (n=200) uninterpretable as evidence of generalization.
[Abstract] Abstract: No training details, baseline comparisons, or ablation studies on the diversity/calibration hypothesis are supplied, so the central claim that the ensemble strategy achieves robustness via these factors rather than scale cannot be evaluated.

minor comments (1)

[Abstract] The abstract states results 'under threshold calibration and evaluation with declared partial information leakage' but does not indicate where in the main text the leakage details or calibration procedure are described.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for their constructive comments. We address each major comment below and indicate planned revisions to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: The F1 score of 0.92 on the proprietary benchmark (n=50) is reported without error bars, confidence intervals, or binomial standard-error estimates; given the small sample, this metric has limited stability and cannot reliably support performance claims.

Authors: We agree that the small sample size limits stability. In the revised manuscript we will add binomial standard-error estimates and confidence intervals for the F1 score on the proprietary benchmark. revision: yes
Referee: [Abstract] Abstract: Partial information leakage is declared but neither quantified (e.g., which prompts, labels, or outputs leaked) nor bounded by an ablation that removes the leaked information, leaving the AUROC of 0.747 on the blind set (n=200) uninterpretable as evidence of generalization.

Authors: The declared partial leakage applies to the proprietary benchmark; the reported AUROC is on the separate blind JBB-Behaviors set. We will expand the description of the leakage and its scope in the revision. A complete ablation removing leaked elements is not feasible without violating confidentiality of the proprietary data. revision: partial
Referee: [Abstract] Abstract: No training details, baseline comparisons, or ablation studies on the diversity/calibration hypothesis are supplied, so the central claim that the ensemble strategy achieves robustness via these factors rather than scale cannot be evaluated.

Authors: Training details and baseline comparisons appear in the methods and results sections of the full manuscript. We will add explicit ablation studies on example diversity and threshold calibration to directly test the central hypothesis. revision: yes

standing simulated objections not resolved

Full quantification and ablation of partial information leakage on the proprietary benchmark due to confidentiality constraints.

Circularity Check

0 steps flagged

No circularity; purely empirical performance reporting with no derivation chain

full rationale

The paper describes an ensemble of shallow BiLSTM networks for prompt injection/jailbreak detection and reports empirical metrics (AUROC 0.747 on n=200 blind set, F1 0.92 on n=50 proprietary set) under declared partial leakage and threshold calibration. No mathematical derivation, first-principles result, or predictive claim is advanced that could reduce to its own inputs by construction. No equations, ansatzes, or uniqueness theorems are invoked. Self-citations, if present, are irrelevant because no load-bearing theoretical step exists. The reader's assessment correctly identifies the absence of any derivation, confirming this is standard empirical ML evaluation on held-out data.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no information on training data, loss functions, hyper-parameters, or benchmark construction is provided, preventing enumeration of free parameters, axioms, or invented entities.

pith-pipeline@v0.9.1-grok · 5811 in / 1028 out tokens · 28631 ms · 2026-06-28T02:06:25.276815+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

3 extracted references · 2 canonical work pages · 2 internal anchors

[1]

The Mirror Design Pattern: Strict Data Geometry over Model Scale for Prompt Injection Detection

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., & Askell, A. (2020). Language models are few -shot learners. Advances in neural information processing systems, 33, 1877–1901. 19 Carlini, N., Tramer, F., Wallace, E., Jagielski, M., Herbert-Voss, A., Lee, K., Roberts, A., Brown, T., Song, D.,...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2011.03395 2020
[2]

Kurakin, A., Goodfellow, I., & Bengio, S. (2016). Adversarial machine learning at scale. arXiv preprint arXiv:1611.01236. Kushnerov, O., Shevchuk, R., Yevseiev, S., & Karpiński, M. (2026). Comparative Benchmarking of Deep Learning Architectures for Detecting Adversarial Attacks on Large Language Models. Information, 17(2),

Pith/arXiv arXiv 2016
[3]

Ignore Previous Prompt: Attack Techniques For Language Models

Markov, T., Zhang, C., Agarwal, S., Nekoul, F. E., Lee, T., Adler, S., Jiang, A., & Weng, L. (2023). A holistic approach to undesired content detection in the real world . 37(12), 15009–15018. Meta-llama/Llama-Prompt-Guard-2-86M · Hugging Face . (2025, abril 29). https://huggingface.co/meta-llama/Llama-Prompt-Guard-2-86M Microservices. (s.d.). martinfowle...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2211.09527 2023

[1] [1]

The Mirror Design Pattern: Strict Data Geometry over Model Scale for Prompt Injection Detection

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., & Askell, A. (2020). Language models are few -shot learners. Advances in neural information processing systems, 33, 1877–1901. 19 Carlini, N., Tramer, F., Wallace, E., Jagielski, M., Herbert-Voss, A., Lee, K., Roberts, A., Brown, T., Song, D.,...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2011.03395 2020

[2] [2]

Kurakin, A., Goodfellow, I., & Bengio, S. (2016). Adversarial machine learning at scale. arXiv preprint arXiv:1611.01236. Kushnerov, O., Shevchuk, R., Yevseiev, S., & Karpiński, M. (2026). Comparative Benchmarking of Deep Learning Architectures for Detecting Adversarial Attacks on Large Language Models. Information, 17(2),

Pith/arXiv arXiv 2016

[3] [3]

Ignore Previous Prompt: Attack Techniques For Language Models

Markov, T., Zhang, C., Agarwal, S., Nekoul, F. E., Lee, T., Adler, S., Jiang, A., & Weng, L. (2023). A holistic approach to undesired content detection in the real world . 37(12), 15009–15018. Meta-llama/Llama-Prompt-Guard-2-86M · Hugging Face . (2025, abril 29). https://huggingface.co/meta-llama/Llama-Prompt-Guard-2-86M Microservices. (s.d.). martinfowle...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2211.09527 2023