arxiv: 2604.24826 · v1 · submitted 2026-04-27 · 💻 cs.CR · cs.AI

Recognition: unknown

A Comparative Evaluation of AI Agent Security Guardrails

Qi Li , Jiu Li , Pingtao Wei , Jianjun Xu , Xueyi Wei , Jiwei Shi , Xuan Zhang , Yanhui Yang

show 3 more authors

Xiaodong Hui Peng Xu Lingquan Zhou

Authors on Pith no claims yet

Pith reviewed 2026-05-08 02:41 UTC · model grok-4.3

classification 💻 cs.CR cs.AI

keywords AI agent securityguardrailscomparative evaluationrecall ratetrue negative ratethreat detectionharmful contentinstruction override

0 comments

The pith

DKnownAI Guard delivers the highest recall and true negative rate in detecting AI agent security risks compared to three other guardrails.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper conducts a side-by-side test of four AI guardrail products to see how well they catch risks when used with AI agents. The risks include attempts to override the agent's instructions or abuse its tools, as well as user requests that ask for harmful material such as violent or hateful content. Human reviewers labeled the test cases to create a reliable standard, and the results show DKnownAI Guard catching more of the actual risks while keeping false alarms low.

Core claim

This report presents a comparative evaluation of DKnownAI Guard in AI agent security scenarios, benchmarked against three competing products: AWS Bedrock Guardrails, Azure Content Safety, and Lakera Guard. Using human annotation as the ground truth, we assess each guardrail's ability to detect two categories of risks: threats to the agent itself (e.g., instruction override, indirect injection, tool abuse) and requests intended to elicit harmful content (e.g., hate speech, pornography, violence). Evaluation results demonstrate that DKnownAI Guard achieves the highest recall rate at 96.5% and ranks first in true negative rate (TNR) at 90.4%, delivering the best overall performance among all

What carries the argument

Benchmark comparison of guardrail products using human-annotated ground truth data for agent threats and harmful content requests.

If this is right

AI agent developers can expect better protection from DKnownAI Guard against instruction overrides and tool abuse.
The evaluation framework allows direct comparison of how well each product balances threat detection with low false positives.
Guardrails that perform well on both agent-internal and harmful content risks are preferable for secure AI deployments.
These metrics can inform choices when integrating guardrails into production AI agent systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the human annotations prove consistent across reviewers, DKnownAI Guard would be a strong candidate for high-stakes agent applications.
Similar benchmarks could be run periodically as new guardrail versions are released to track improvements.
Extending the test set to include more edge cases might change the relative standings of the products.

Load-bearing premise

Human annotations provide reliable ground truth for both agent-internal threats and harmful content requests, and the chosen test cases adequately represent real deployment risks.

What would settle it

If a new set of human annotators labels the test cases differently and the recall and TNR rankings reverse, the superiority claim for DKnownAI Guard would be falsified.

read the original abstract

This report presents a comparative evaluation of DKnownAI Guard in AI agent security scenarios, benchmarked against three competing products: AWS Bedrock Guardrails, Azure Content Safety, and Lakera Guard. Using human annotation as the ground truth, we assess each guardrail's ability to detect two categories of risks: threats to the agent itself (e.g., instruction override, indirect injection, tool abuse) and requests intended to elicit harmful content (e.g., hate speech, pornography, violence). Evaluation results demonstrate that DKnownAI Guard achieves the highest recall rate at 96.5\% and ranks first in true negative rate (TNR) at 90.4\%, delivering the best overall performance among all evaluated guardrails.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper presents a comparative evaluation of DKnownAI Guard against AWS Bedrock Guardrails, Azure Content Safety, and Lakera Guard for detecting two categories of risks in AI agent scenarios: internal threats (instruction override, indirect injection, tool abuse) and harmful content requests (hate speech, pornography, violence). Using human annotations as ground truth, it claims DKnownAI Guard achieves the highest recall at 96.5% and the best true negative rate (TNR) at 90.4%, ranking first overall.

Significance. If the evaluation methodology is sound, the results would provide actionable guidance for deploying guardrails in AI agent systems, particularly by demonstrating superior performance on agent-internal threats that are underrepresented in standard content-safety benchmarks. The work addresses a timely gap in empirical comparisons of commercial tools for prompt injection and tool-abuse detection.

major comments (2)

[Abstract] Abstract: The headline performance figures (96.5% recall and 90.4% TNR for DKnownAI Guard) are reported without any information on dataset size, number of test cases, annotation protocol, number of annotators, inter-annotator agreement, or statistical significance tests. This is load-bearing for the central ranking claim, as the reader's ability to verify whether the gaps reflect genuine differences rather than annotation variance or sampling effects is eliminated.
[Evaluation] Evaluation section: Human annotations are treated as fixed ground truth for both agent-internal threats and harmful-content requests, yet no inter-annotator agreement metrics (Cohen's or Fleiss' kappa), blinding procedures, or handling of ambiguous cases (e.g., indirect prompt injection or borderline violence queries) are described. Without these, the reported recall and TNR values cannot be assessed for reliability.

minor comments (2)

[Abstract] The abstract states the evaluation uses 'human annotation as the ground truth' but does not clarify whether the same annotators labeled both threat categories or whether separate protocols were used; a short clarification would improve reproducibility.
[Results] No error analysis or per-category breakdown (e.g., false-negative rates on tool-abuse vs. violence queries) is provided, which would help readers understand the practical implications of the ranking.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments correctly identify areas where additional methodological transparency is needed to strengthen the presentation of our results. We address each point below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [Abstract] Abstract: The headline performance figures (96.5% recall and 90.4% TNR for DKnownAI Guard) are reported without any information on dataset size, number of test cases, annotation protocol, number of annotators, inter-annotator agreement, or statistical significance tests. This is load-bearing for the central ranking claim, as the reader's ability to verify whether the gaps reflect genuine differences rather than annotation variance or sampling effects is eliminated.

Authors: We agree that the abstract should provide sufficient context for the headline metrics. In the revised manuscript we will expand the abstract to include the total dataset size, the number of test cases per risk category, a concise description of the annotation protocol, the number of annotators involved, and a statement on the statistical tests used to assess significance of performance differences. revision: yes
Referee: [Evaluation] Evaluation section: Human annotations are treated as fixed ground truth for both agent-internal threats and harmful-content requests, yet no inter-annotator agreement metrics (Cohen's or Fleiss' kappa), blinding procedures, or handling of ambiguous cases (e.g., indirect prompt injection or borderline violence queries) are described. Without these, the reported recall and TNR values cannot be assessed for reliability.

Authors: The current Evaluation section does not report inter-annotator agreement, blinding procedures, or explicit handling of ambiguous cases. We will revise this section to add these details: the number of annotators, any blinding applied, how ambiguous cases were resolved (e.g., discussion or adjudication), and inter-annotator agreement metrics if multiple independent annotations exist. If the original process used a single annotator, we will state this explicitly and discuss it as a limitation of the study. revision: partial

Circularity Check

0 steps flagged

No circularity: direct empirical ranking from human-labeled data

full rationale

The paper is a comparative benchmark of four guardrails (DKnownAI Guard, AWS Bedrock, Azure Content Safety, Lakera) on two risk categories. Performance numbers (96.5% recall, 90.4% TNR) are computed directly from human annotations treated as fixed ground truth. No equations, derivations, fitted parameters, predictions, or self-citations appear in the provided text. The evaluation chain does not reduce any claim to its own inputs by construction; it is a straightforward ranking against an external labeling process. Annotation-quality concerns (inter-annotator agreement, protocol details) affect validity but do not constitute circularity under the defined criteria.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No mathematical model or derivation; the claim rests entirely on empirical comparison using human labels.

pith-pipeline@v0.9.0 · 5442 in / 1117 out tokens · 35632 ms · 2026-05-08T02:41:27.744122+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

12 extracted references · 4 canonical work pages

[1]

Q. Li, J. Xu, P. Wei, J. Li, P. Zhao, J. Shi, X. Zhang, Y. Yang, X. Hui, P. Xu, and W. Shao. DeepKnown-Guard: A Proprietary Model-Based Safety Response Framework for AI Agents. arXiv preprint arXiv:2511.03138, 2025

work page arXiv 2025
[2]

Tedeschi, F

S. Tedeschi, F. Friedrich, P. Schramowski, K. Kersting, R. Navigli, H. Nguyen, and B. Li. ALERT: A Comprehensive Benchmark for Assessing Large Language Models' Safety through Red Teaming. arXiv preprint arXiv:2404.08676, 2024

work page arXiv 2024
[3]

L. Li, B. Dong, R. Wang, X. Hu, W. Zuo, D. Lin, Y. Qiao, and J. Shao. SALAD-Bench: A Hierarchical and Comprehensive Safety Benchmark for Large Language Models. In Findings of ACL, 2024

2024
[4]

Tensor Trust: Interpretable prompt injection attacks from an online game,

S. Toyer, O. Watkins, E. A. Mendes, J. Svegliato, L. Bailey, T. Wang, I. Ong, K. Elmaaroufi, P. Abbeel, T. Darrell, A. Ritter, and S. Russell. Tensor Trust: Interpretable Prompt Injection Attacks from an Online Game. arXiv preprint arXiv:2311.01011, 2023

work page arXiv 2023
[5]

H. Choubey. PromptWall: A Cascading Multi-Layer Firewall for Real-Time Prompt Injection Detection. GitHub, 2025. https://github.com/A73r0id/promptwall

2025
[6]

Z. Zhou, S. Yan, C. Liu, Q. Li, K. Wang, and Z. Zeng. CSSBench: Evaluating the Safety of Lightweight LLMs against Chinese-Specific Adversarial Patterns. arXiv preprint arXiv:2601.00588, 2026

work page arXiv 2026
[7]

Y. Guo, G. Cui, L. Yuan, N. Ding, J. Wang, H. Chen, B. Sun, R. Xie, J. Zhou, Y. Lin, Z. Liu, and M. Sun. Controllable Preference Optimization: Toward Controllable Multi-Objective Alignment. In EMNLP, 2024

2024
[8]

ToxicQAFinal: Toxic Question Answering Dataset

NobodyExistsOnTheInternet. ToxicQAFinal: Toxic Question Answering Dataset. Hugging Face, 2024. https://huggingface.co/datasets/NobodyExistsOnTheInternet/ToxicQAFinal

2024
[9]

LLM Jailbreak & Prompt-Injection Dataset

Necent. LLM Jailbreak & Prompt-Injection Dataset. Hugging Face, 2026. https://huggingface.co/datasets/Necent/llm-jailbreak-prompt-injection-dataset

2026
[10]

@esa (Ref

\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...
[11]

\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...
[12]

@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...