Recognition: unknown
A Comparative Evaluation of AI Agent Security Guardrails
Pith reviewed 2026-05-08 02:41 UTC · model grok-4.3
The pith
DKnownAI Guard delivers the highest recall and true negative rate in detecting AI agent security risks compared to three other guardrails.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
This report presents a comparative evaluation of DKnownAI Guard in AI agent security scenarios, benchmarked against three competing products: AWS Bedrock Guardrails, Azure Content Safety, and Lakera Guard. Using human annotation as the ground truth, we assess each guardrail's ability to detect two categories of risks: threats to the agent itself (e.g., instruction override, indirect injection, tool abuse) and requests intended to elicit harmful content (e.g., hate speech, pornography, violence). Evaluation results demonstrate that DKnownAI Guard achieves the highest recall rate at 96.5% and ranks first in true negative rate (TNR) at 90.4%, delivering the best overall performance among all
What carries the argument
Benchmark comparison of guardrail products using human-annotated ground truth data for agent threats and harmful content requests.
If this is right
- AI agent developers can expect better protection from DKnownAI Guard against instruction overrides and tool abuse.
- The evaluation framework allows direct comparison of how well each product balances threat detection with low false positives.
- Guardrails that perform well on both agent-internal and harmful content risks are preferable for secure AI deployments.
- These metrics can inform choices when integrating guardrails into production AI agent systems.
Where Pith is reading between the lines
- If the human annotations prove consistent across reviewers, DKnownAI Guard would be a strong candidate for high-stakes agent applications.
- Similar benchmarks could be run periodically as new guardrail versions are released to track improvements.
- Extending the test set to include more edge cases might change the relative standings of the products.
Load-bearing premise
Human annotations provide reliable ground truth for both agent-internal threats and harmful content requests, and the chosen test cases adequately represent real deployment risks.
What would settle it
If a new set of human annotators labels the test cases differently and the recall and TNR rankings reverse, the superiority claim for DKnownAI Guard would be falsified.
read the original abstract
This report presents a comparative evaluation of DKnownAI Guard in AI agent security scenarios, benchmarked against three competing products: AWS Bedrock Guardrails, Azure Content Safety, and Lakera Guard. Using human annotation as the ground truth, we assess each guardrail's ability to detect two categories of risks: threats to the agent itself (e.g., instruction override, indirect injection, tool abuse) and requests intended to elicit harmful content (e.g., hate speech, pornography, violence). Evaluation results demonstrate that DKnownAI Guard achieves the highest recall rate at 96.5\% and ranks first in true negative rate (TNR) at 90.4\%, delivering the best overall performance among all evaluated guardrails.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents a comparative evaluation of DKnownAI Guard against AWS Bedrock Guardrails, Azure Content Safety, and Lakera Guard for detecting two categories of risks in AI agent scenarios: internal threats (instruction override, indirect injection, tool abuse) and harmful content requests (hate speech, pornography, violence). Using human annotations as ground truth, it claims DKnownAI Guard achieves the highest recall at 96.5% and the best true negative rate (TNR) at 90.4%, ranking first overall.
Significance. If the evaluation methodology is sound, the results would provide actionable guidance for deploying guardrails in AI agent systems, particularly by demonstrating superior performance on agent-internal threats that are underrepresented in standard content-safety benchmarks. The work addresses a timely gap in empirical comparisons of commercial tools for prompt injection and tool-abuse detection.
major comments (2)
- [Abstract] Abstract: The headline performance figures (96.5% recall and 90.4% TNR for DKnownAI Guard) are reported without any information on dataset size, number of test cases, annotation protocol, number of annotators, inter-annotator agreement, or statistical significance tests. This is load-bearing for the central ranking claim, as the reader's ability to verify whether the gaps reflect genuine differences rather than annotation variance or sampling effects is eliminated.
- [Evaluation] Evaluation section: Human annotations are treated as fixed ground truth for both agent-internal threats and harmful-content requests, yet no inter-annotator agreement metrics (Cohen's or Fleiss' kappa), blinding procedures, or handling of ambiguous cases (e.g., indirect prompt injection or borderline violence queries) are described. Without these, the reported recall and TNR values cannot be assessed for reliability.
minor comments (2)
- [Abstract] The abstract states the evaluation uses 'human annotation as the ground truth' but does not clarify whether the same annotators labeled both threat categories or whether separate protocols were used; a short clarification would improve reproducibility.
- [Results] No error analysis or per-category breakdown (e.g., false-negative rates on tool-abuse vs. violence queries) is provided, which would help readers understand the practical implications of the ranking.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. The comments correctly identify areas where additional methodological transparency is needed to strengthen the presentation of our results. We address each point below and will revise the manuscript accordingly.
read point-by-point responses
-
Referee: [Abstract] Abstract: The headline performance figures (96.5% recall and 90.4% TNR for DKnownAI Guard) are reported without any information on dataset size, number of test cases, annotation protocol, number of annotators, inter-annotator agreement, or statistical significance tests. This is load-bearing for the central ranking claim, as the reader's ability to verify whether the gaps reflect genuine differences rather than annotation variance or sampling effects is eliminated.
Authors: We agree that the abstract should provide sufficient context for the headline metrics. In the revised manuscript we will expand the abstract to include the total dataset size, the number of test cases per risk category, a concise description of the annotation protocol, the number of annotators involved, and a statement on the statistical tests used to assess significance of performance differences. revision: yes
-
Referee: [Evaluation] Evaluation section: Human annotations are treated as fixed ground truth for both agent-internal threats and harmful-content requests, yet no inter-annotator agreement metrics (Cohen's or Fleiss' kappa), blinding procedures, or handling of ambiguous cases (e.g., indirect prompt injection or borderline violence queries) are described. Without these, the reported recall and TNR values cannot be assessed for reliability.
Authors: The current Evaluation section does not report inter-annotator agreement, blinding procedures, or explicit handling of ambiguous cases. We will revise this section to add these details: the number of annotators, any blinding applied, how ambiguous cases were resolved (e.g., discussion or adjudication), and inter-annotator agreement metrics if multiple independent annotations exist. If the original process used a single annotator, we will state this explicitly and discuss it as a limitation of the study. revision: partial
Circularity Check
No circularity: direct empirical ranking from human-labeled data
full rationale
The paper is a comparative benchmark of four guardrails (DKnownAI Guard, AWS Bedrock, Azure Content Safety, Lakera) on two risk categories. Performance numbers (96.5% recall, 90.4% TNR) are computed directly from human annotations treated as fixed ground truth. No equations, derivations, fitted parameters, predictions, or self-citations appear in the provided text. The evaluation chain does not reduce any claim to its own inputs by construction; it is a straightforward ranking against an external labeling process. Annotation-quality concerns (inter-annotator agreement, protocol details) affect validity but do not constitute circularity under the defined criteria.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
- [1]
-
[2]
S. Tedeschi, F. Friedrich, P. Schramowski, K. Kersting, R. Navigli, H. Nguyen, and B. Li. ALERT: A Comprehensive Benchmark for Assessing Large Language Models' Safety through Red Teaming. arXiv preprint arXiv:2404.08676, 2024
-
[3]
L. Li, B. Dong, R. Wang, X. Hu, W. Zuo, D. Lin, Y. Qiao, and J. Shao. SALAD-Bench: A Hierarchical and Comprehensive Safety Benchmark for Large Language Models. In Findings of ACL, 2024
2024
-
[4]
Tensor Trust: Interpretable prompt injection attacks from an online game,
S. Toyer, O. Watkins, E. A. Mendes, J. Svegliato, L. Bailey, T. Wang, I. Ong, K. Elmaaroufi, P. Abbeel, T. Darrell, A. Ritter, and S. Russell. Tensor Trust: Interpretable Prompt Injection Attacks from an Online Game. arXiv preprint arXiv:2311.01011, 2023
-
[5]
H. Choubey. PromptWall: A Cascading Multi-Layer Firewall for Real-Time Prompt Injection Detection. GitHub, 2025. https://github.com/A73r0id/promptwall
2025
- [6]
-
[7]
Y. Guo, G. Cui, L. Yuan, N. Ding, J. Wang, H. Chen, B. Sun, R. Xie, J. Zhou, Y. Lin, Z. Liu, and M. Sun. Controllable Preference Optimization: Toward Controllable Multi-Objective Alignment. In EMNLP, 2024
2024
-
[8]
ToxicQAFinal: Toxic Question Answering Dataset
NobodyExistsOnTheInternet. ToxicQAFinal: Toxic Question Answering Dataset. Hugging Face, 2024. https://huggingface.co/datasets/NobodyExistsOnTheInternet/ToxicQAFinal
2024
-
[9]
LLM Jailbreak & Prompt-Injection Dataset
Necent. LLM Jailbreak & Prompt-Injection Dataset. Hugging Face, 2026. https://huggingface.co/datasets/Necent/llm-jailbreak-prompt-injection-dataset
2026
-
[10]
@esa (Ref
\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...
-
[11]
\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...
-
[12]
@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.