AgenticEval: Toward Agentic and Self-Evolving Safety Evaluation of Large Language Models
Pith reviewed 2026-05-18 12:31 UTC · model grok-4.3
Add this Pith Number to your LaTeX paper
What is a Pith Number?\usepackage{pith}
\pithnumber{OWPBRGOV}
Prints a linked pith:OWPBRGOV badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more
The pith
A multi-agent self-evolving system generates progressively harder safety tests that cause LLM compliance rates to drop sharply.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
AgenticEval is a multi-agent framework that ingests unstructured policy documents to generate and then perpetually evolve a safety benchmark through a Self-evolving Evaluation loop. In this loop, evaluation outcomes directly inform the creation of more sophisticated and targeted test cases. When applied to models, the process produces a consistent decline in measured safety, such as GPT-5's EU AI Act safety rate falling from 72.50 percent to 36.36 percent over successive iterations, revealing vulnerabilities that static methods overlook.
What carries the argument
The Self-evolving Evaluation loop inside a multi-agent pipeline that reads policy documents and refines test cases based on prior results.
If this is right
- Static benchmarks will increasingly underestimate the safety risks of deployed models as test difficulty grows.
- Regulatory compliance checks will need to incorporate mechanisms that adapt test cases over time rather than relying on fixed suites.
- Model development pipelines should include repeated exposure to evolving adversarial evaluations to close the gap between initial and sustained performance.
- Safety evaluation resources should shift toward automated systems capable of ingesting new regulations and generating targeted tests without constant human redesign.
- High-stakes deployments will require ongoing monitoring ecosystems instead of periodic snapshot audits.
Where Pith is reading between the lines
- Regulators could run similar self-evolving systems in parallel with official audits to detect when a model's apparent compliance is fragile.
- The same ingestion-and-evolution pattern might be applied to non-safety domains such as factual accuracy or bias reduction by feeding in relevant guidelines.
- If the loop is allowed to run without human review, it may eventually produce tests that are technically valid yet practically irrelevant to real-world use.
- Long-term, repeated application across many models could generate public datasets of hard safety cases that static benchmarks currently lack.
Load-bearing premise
The test cases produced by the self-evolving loop remain valid, unbiased, and representative of genuine safety and compliance risks rather than artifacts introduced by the agent pipeline itself.
What would settle it
If a set of human-crafted, independently validated safety tests drawn from the same policies produces safety rates that stay high across repeated rounds while AgenticEval's evolved tests produce declining rates, the claim that the loop uncovers deeper vulnerabilities would be falsified.
Figures
read the original abstract
The rapid integration of Large Language Models (LLMs) into high-stakes domains necessitates reliable safety and compliance evaluation. However, existing static benchmarks are ill-equipped to address the dynamic nature of AI risks and evolving regulations, creating a critical safety gap. This paper introduces a new paradigm of agentic safety evaluation, reframing evaluation as a continuous and self-evolving process rather than a one-time audit. We then propose a novel multi-agent framework AgenticEval, which autonomously ingests unstructured policy documents to generate and perpetually evolve a comprehensive safety benchmark. AgenticEval leverages a synergistic pipeline of specialized agents and incorporates a Self-evolving Evaluation loop, where the system learns from evaluation results to craft progressively more sophisticated and targeted test cases. Our experiments demonstrate the effectiveness of AgenticEval, showing a consistent decline in model safety as the evaluation hardens. For instance, GPT-5's safety rate on the EU AI Act drops from 72.50% to 36.36% over successive iterations. These findings reveal the limitations of static assessments and highlight our framework's ability to uncover deep vulnerabilities missed by traditional methods, underscoring the urgent need for dynamic evaluation ecosystems to ensure the safe and responsible deployment of advanced AI.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces AgenticEval, a multi-agent framework for agentic and self-evolving safety evaluation of LLMs. It claims to autonomously ingest unstructured policy documents, generate test cases via a synergistic pipeline of specialized agents, and use a Self-evolving Evaluation loop to perpetually refine more sophisticated tests based on prior model failures. Experiments are reported to demonstrate effectiveness through a consistent decline in model safety rates as evaluation hardens, for instance GPT-5's safety rate on the EU AI Act dropping from 72.50% to 36.36% over successive iterations, revealing limitations of static benchmarks.
Significance. If the self-evolved test cases prove valid and unbiased, the work could meaningfully advance the field by shifting safety evaluation from static one-time audits to continuous, adaptive processes that better match evolving regulations and AI risks. The multi-agent architecture and self-evolving loop offer a concrete implementation of dynamic benchmarking, with the reported safety declines providing an initial signal of gaps missed by conventional methods. The approach is novel in its autonomous policy ingestion and iterative refinement.
major comments (2)
- [Self-evolving Evaluation loop and Experiments] The central claim that AgenticEval uncovers deep vulnerabilities missed by traditional methods (abstract and experiments section) depends on the validity of the self-evolved test cases. The framework description indicates agents refine cases from prior failures without an explicit validation step such as human expert review, inter-annotator agreement, or direct mapping to specific clauses in source policies (e.g., EU AI Act). This leaves open the possibility that the observed safety-rate decline (GPT-5 from 72.50% to 36.36%) arises from generation artifacts or out-of-scope prompts rather than genuine model weaknesses.
- [Experiments] The reported safety rates and decline trends lack accompanying details on statistical methods, controls for invalid or biased test cases, sample sizes per iteration, or verification of agent outputs. Without these, the data-to-claim link for the effectiveness of the self-evolving loop cannot be rigorously assessed.
minor comments (2)
- [Experiments] Clarify the precise definition and computation of 'safety rate' and the number of iterations performed in the results presentation.
- [Abstract] Ensure consistent terminology between the abstract (e.g., 'GPT-5') and the models actually evaluated in the full experimental section.
Simulated Author's Rebuttal
We thank the referee for their constructive comments on our work. We address the major concerns regarding the validation of self-evolved test cases and the experimental reporting in detail below. We believe these clarifications and planned revisions will strengthen the manuscript.
read point-by-point responses
-
Referee: [Self-evolving Evaluation loop and Experiments] The central claim that AgenticEval uncovers deep vulnerabilities missed by traditional methods (abstract and experiments section) depends on the validity of the self-evolved test cases. The framework description indicates agents refine cases from prior failures without an explicit validation step such as human expert review, inter-annotator agreement, or direct mapping to specific clauses in source policies (e.g., EU AI Act). This leaves open the possibility that the observed safety-rate decline (GPT-5 from 72.50% to 36.36%) arises from generation artifacts or out-of-scope prompts rather than genuine model weaknesses.
Authors: We agree that explicit validation steps would strengthen the claims. While the multi-agent pipeline includes agents tasked with ensuring alignment to the ingested policy documents, the manuscript does not detail a human review process. In the revised version, we will add a dedicated subsection on validation, including: (1) a mapping of sample evolved test cases to specific clauses in the EU AI Act, (2) results from a small-scale human expert review of 50 test cases per iteration for relevance and scope, and (3) inter-annotator agreement scores. This will help demonstrate that the safety rate declines reflect genuine vulnerabilities. We maintain that the iterative nature based on prior failures inherently targets weaknesses, but acknowledge the need for these additions to rigorously support the central claim. revision: yes
-
Referee: [Experiments] The reported safety rates and decline trends lack accompanying details on statistical methods, controls for invalid or biased test cases, sample sizes per iteration, or verification of agent outputs. Without these, the data-to-claim link for the effectiveness of the self-evolving loop cannot be rigorously assessed.
Authors: We appreciate this observation and agree that more rigorous experimental details are required. In the updated manuscript, we will expand the Experiments section to include: sample sizes (e.g., 200 test cases generated per iteration), statistical analysis such as paired t-tests or trend analysis with p-values for the observed declines, controls including automated filtering for prompt validity and bias using additional agent checks, and a description of the verification process where agent outputs are cross-checked against policy embeddings for relevance. These additions will provide a clearer link between the data and our claims about the self-evolving loop's effectiveness. revision: yes
Circularity Check
No significant circularity; results are empirical measurements from framework execution
full rationale
The paper describes a multi-agent framework that ingests policy documents, generates test cases, and uses a self-evolving loop to refine them based on model responses, then reports direct experimental outcomes such as the observed drop in GPT-5 safety rate from 72.50% to 36.36% across iterations on the EU AI Act. These safety rates are measured quantities obtained by running the generated benchmarks rather than quantities defined in terms of fitted parameters, self-referential equations, or derivations that reduce to the inputs by construction. No load-bearing steps rely on self-citations for uniqueness theorems, smuggled ansatzes, or renaming of known results. The central claim rests on the empirical validity of the evolved test cases, which is an external assumption open to independent verification rather than a circular reduction within the paper's own chain.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Specialized agents can accurately parse unstructured policy documents into valid test cases.
- ad hoc to paper Feedback from evaluation results can be used to generate progressively more sophisticated tests without systematic bias or invalidity.
invented entities (2)
-
Self-evolving Evaluation loop
no independent evidence
-
Synergistic pipeline of specialized agents
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
The self-evolving evaluation loop... the Analyst agent learns the target model’s failure modes and synthesizes these insights into new directives, which in turn guide the Generator to craft more targeted test cases
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanabsolute_floor_iff_bare_distinguishability unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Regulation-to-Knowledge Structuring... hierarchical tree of atomic rules
-
IndisputableMonolith/Foundation/DimensionForcing.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Safety rate on the EU AI Act drops from 72.50% to 36.36% over successive iterations
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Reddebate: Safer responses through multi-agent red teaming debates.arXiv preprint arXiv:2506.11083,
Ali Asad, Stephen Obadinma, Radin Shayanfar, and Xiaodan Zhu. Reddebate: Safer responses through multi-agent red teaming debates.arXiv preprint arXiv:2506.11083,
-
[2]
Meditron-70b: Scaling medical pretraining for large language models.arXiv preprint arXiv:2311.16079,
Zeming Chen, Alejandro Hern ´andez Cano, Angelika Romanou, Antoine Bonnet, Kyle Matoba, Francesco Salvi, Matteo Pagliardini, Simin Fan, Andreas K ¨opf, Amirkeivan Mohtashami, et al. Meditron-70b: Scaling medical pretraining for large language models.arXiv preprint arXiv:2311.16079,
-
[3]
Zhaorun Chen, Mintong Kang, and Bo Li. Shieldagent: Shielding agents via verifiable safety policy reasoning.arXiv preprint arXiv:2503.22738,
-
[4]
Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capa- bilities.arXiv preprint arXiv:2507.06261,
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
URLhttps://arxiv.org/abs/2412. 19437. 10 Preprint Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783,
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
Philipp Guldimann, Alexander Spiridonov, Robin Staab, Nikola Jovanovi ´c, Mark Vero, Velko Vechev, Anna-Maria Gueorguieva, Mislav Balunovi ´c, Nikola Konstantinov, Pavol Bielik, et al. Compl-ai framework: A technical interpretation and llm benchmarking suite for the eu artificial intelligence act.arXiv preprint arXiv:2410.07959,
-
[7]
Geoffrey Irving, Paul Christiano, and Dario Amodei. Ai safety via debate.arXiv preprint arXiv:1805.00899,
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
Holistic Evaluation of Language Models
Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, et al. Holistic evaluation of language models.arXiv preprint arXiv:2211.09110,
work page internal anchor Pith review Pith/arXiv arXiv
-
[9]
Safety at Scale: A Comprehensive Survey of Large Model and Agent Safety
Xingjun Ma, Yifeng Gao, Yixu Wang, Ruofan Wang, Xin Wang, Ye Sun, Yifan Ding, Hengyuan Xu, Yunhao Chen, Yunhan Zhao, et al. Safety at scale: A comprehensive survey of large model safety.arXiv preprint arXiv:2502.05206,
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
Tai D Nguyen, Long H Pham, and Jun Sun. Autolaw: Enhancing legal compliance in large language models via case law generation and jury-inspired deliberation.arXiv preprint arXiv:2505.14015,
-
[11]
V Vendetti, LD Comencini, F Deriu, V Modugno, et al. Passing the turing test in political discourse: Fine-tuning llms to mimic polarized social media comments.arXiv preprint arXiv:2506.14645,
-
[12]
Kun Wang, Guibin Zhang, Zhenhong Zhou, Jiahao Wu, Miao Yu, Shiqian Zhao, Chenlong Yin, Jinhu Fu, Yibo Yan, Hanjun Luo, et al. A comprehensive survey in llm (-agent) full stack safety: Data, training and deployment.arXiv preprint arXiv:2504.15585, 2025a. 11 Preprint Maggie Wang, Ella Colby, Jennifer Okwara, Varun Nagaraj Rao, Yuhan Liu, and Andr ´es Monroy...
-
[13]
Qianqian Xie, Weiguang Han, Xiao Zhang, Yanzhao Lai, Min Peng, Alejandro Lopez-Lira, and Jimin Huang
URLhttps://x.ai/news/grok-4. Qianqian Xie, Weiguang Han, Xiao Zhang, Yanzhao Lai, Min Peng, Alejandro Lopez-Lira, and Jimin Huang. Pixiu: A large language model, instruction data and evaluation benchmark for finance.arXiv preprint arXiv:2306.05443,
-
[14]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,
work page internal anchor Pith review Pith/arXiv arXiv
-
[15]
Yi Yang, Yixuan Tang, and Kar Yan Tam. Investlm: A large language model for investment using financial domain instruction tuning.arXiv preprint arXiv:2309.13064,
-
[16]
Andy Zhou, Kevin Wu, Francesco Pinto, Zhaorun Chen, Yi Zeng, Yu Yang, Shuang Yang, Sanmi Koyejo, James Zou, and Bo Li. Autoredteamer: Autonomous red teaming with lifelong attack integration.arXiv preprint arXiv:2503.15754,
-
[17]
12 Preprint A APPENDIX A.1 AGENTSYSTEMPROMPTS ANDKEYINSTRUCTIONS To enhance the transparency and reproducibility of our work, this sub-section provides a detailed overview of the core system prompts and key instructions that govern the behavior of each spe- cialized agent within the SafeEvalAgent framework. These directives are the foundational layer of o...
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.