arxiv: 2509.26100 · v2 · pith:OWPBRGOVnew · submitted 2025-09-30 · 💻 cs.AI

AgenticEval: Toward Agentic and Self-Evolving Safety Evaluation of Large Language Models

Yixu Wang , Xin Wang , Yang Yao , Xinyuan Li , Xibang Yang , Yan Teng , Xingjun Ma , Yingchun Wang This is my paper

Pith reviewed 2026-05-18 12:31 UTC · model grok-4.3

classification 💻 cs.AI

keywords agentic safety evaluationself-evolving benchmarksLLM safety assessmentmulti-agent frameworkscompliance evaluationdynamic risk assessmentEU AI Act testingevolving test cases

0 comments p. Extension

Add this Pith Number to your LaTeX paper

What is a Pith Number?

\usepackage{pith}
\pithnumber{OWPBRGOV}

Prints a linked pith:OWPBRGOV badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more

The pith

A multi-agent self-evolving system generates progressively harder safety tests that cause LLM compliance rates to drop sharply.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that static benchmarks cannot keep up with changing AI risks and regulations, so evaluation must become a continuous process driven by agents. It presents AgenticEval as a framework that reads policy documents, creates test cases, runs them on models, and then uses the results to make the next round of tests more difficult and specific. Experiments show this hardening process exposes weaknesses, with safety scores falling across iterations on benchmarks tied to rules like the EU AI Act. A sympathetic reader would care because it suggests many current models pass initial checks only because the checks are too easy or fixed. The work therefore calls for shifting from one-time audits to ongoing, adaptive evaluation systems.

Core claim

AgenticEval is a multi-agent framework that ingests unstructured policy documents to generate and then perpetually evolve a safety benchmark through a Self-evolving Evaluation loop. In this loop, evaluation outcomes directly inform the creation of more sophisticated and targeted test cases. When applied to models, the process produces a consistent decline in measured safety, such as GPT-5's EU AI Act safety rate falling from 72.50 percent to 36.36 percent over successive iterations, revealing vulnerabilities that static methods overlook.

What carries the argument

The Self-evolving Evaluation loop inside a multi-agent pipeline that reads policy documents and refines test cases based on prior results.

If this is right

Static benchmarks will increasingly underestimate the safety risks of deployed models as test difficulty grows.
Regulatory compliance checks will need to incorporate mechanisms that adapt test cases over time rather than relying on fixed suites.
Model development pipelines should include repeated exposure to evolving adversarial evaluations to close the gap between initial and sustained performance.
Safety evaluation resources should shift toward automated systems capable of ingesting new regulations and generating targeted tests without constant human redesign.
High-stakes deployments will require ongoing monitoring ecosystems instead of periodic snapshot audits.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Regulators could run similar self-evolving systems in parallel with official audits to detect when a model's apparent compliance is fragile.
The same ingestion-and-evolution pattern might be applied to non-safety domains such as factual accuracy or bias reduction by feeding in relevant guidelines.
If the loop is allowed to run without human review, it may eventually produce tests that are technically valid yet practically irrelevant to real-world use.
Long-term, repeated application across many models could generate public datasets of hard safety cases that static benchmarks currently lack.

Load-bearing premise

The test cases produced by the self-evolving loop remain valid, unbiased, and representative of genuine safety and compliance risks rather than artifacts introduced by the agent pipeline itself.

What would settle it

If a set of human-crafted, independently validated safety tests drawn from the same policies produces safety rates that stay high across repeated rounds while AgenticEval's evolved tests produce declining rates, the claim that the loop uncovers deeper vulnerabilities would be falsified.

Figures

Figures reproduced from arXiv: 2509.26100 by Xibang Yang, Xingjun Ma, Xin Wang, Xinyuan Li, Yang Yao, Yan Teng, Yingchun Wang, Yixu Wang.

**Figure 1.** Figure 1: Overview of SafeEvalAgent. It first transforms regulations into a testable knowledge base via the Specialist agent, then generates a comprehensive test suite with the Generator agent, and finally performs a self-evolving evaluation process in which the Evaluator, Analyst, and Generator agents collaborate and adapt to uncover deeper vulnerabilities. of attacks that adapts continuously to emerging defenses. … view at source ↗

**Figure 2.** Figure 2: Validation of the Specialist Agent’s (AS) knowledge structuring capability. The heatmaps display the semantic similarity between the explanation fields of atomic rules extracted by AS from documents: (a) the NIST AI RMF, (b) the EU AI Act, and (c) the MAS FEAT. The outlined regions group rules that belong to the same high-level dimension. The pronounced clusters of high similarity (darker colors) within th… view at source ↗

**Figure 3.** Figure 3: LLMs safety rates during the evaluation across three regulatory frameworks. The consis [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

read the original abstract

The rapid integration of Large Language Models (LLMs) into high-stakes domains necessitates reliable safety and compliance evaluation. However, existing static benchmarks are ill-equipped to address the dynamic nature of AI risks and evolving regulations, creating a critical safety gap. This paper introduces a new paradigm of agentic safety evaluation, reframing evaluation as a continuous and self-evolving process rather than a one-time audit. We then propose a novel multi-agent framework AgenticEval, which autonomously ingests unstructured policy documents to generate and perpetually evolve a comprehensive safety benchmark. AgenticEval leverages a synergistic pipeline of specialized agents and incorporates a Self-evolving Evaluation loop, where the system learns from evaluation results to craft progressively more sophisticated and targeted test cases. Our experiments demonstrate the effectiveness of AgenticEval, showing a consistent decline in model safety as the evaluation hardens. For instance, GPT-5's safety rate on the EU AI Act drops from 72.50% to 36.36% over successive iterations. These findings reveal the limitations of static assessments and highlight our framework's ability to uncover deep vulnerabilities missed by traditional methods, underscoring the urgent need for dynamic evaluation ecosystems to ensure the safe and responsible deployment of advanced AI.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AgenticEval sketches a multi-agent pipeline to evolve safety tests from policies and reports dropping compliance rates, but the results depend on unverified test quality.

read the letter

Colleague, The punchline on this one is that AgenticEval tries to solve the static benchmark problem by having agents generate and evolve safety tests from policies, and they show safety dropping on models as tests get harder. But the drop's interpretation depends on the tests staying good. What works here is the framing and the architecture. Static evaluations can't keep pace with new rules or model updates, and this self-evolving setup is a direct response. The multi-agent pipeline that handles policy ingestion and iterative refinement adds a layer of automation that prior work on adaptive testing hasn't fully combined in this way. It gives a concrete example of how evaluation could become ongoing. The soft spot is the missing validation for the test cases themselves. The headline result about GPT-5's safety rate falling relies on the assumption that later iterations are still fair tests of compliance. If the agents are producing cases that drift from the original policies or include invalid elements, the numbers don't tell us much about real vulnerabilities. The paper needs to address how they verify the quality and relevance of these generated tests, perhaps through human oversight or automated checks against the source documents. This paper is for the AI safety and evaluation crowd, especially those exploring agent-based methods. A reader in that space could find the pipeline design useful even if they disagree with the strength of the current evidence. It deserves peer review. The idea has merit and the experiments illustrate a real issue, but it will need revisions around the empirical robustness to be convincing.

Referee Report

2 major / 2 minor

Summary. The paper introduces AgenticEval, a multi-agent framework for agentic and self-evolving safety evaluation of LLMs. It claims to autonomously ingest unstructured policy documents, generate test cases via a synergistic pipeline of specialized agents, and use a Self-evolving Evaluation loop to perpetually refine more sophisticated tests based on prior model failures. Experiments are reported to demonstrate effectiveness through a consistent decline in model safety rates as evaluation hardens, for instance GPT-5's safety rate on the EU AI Act dropping from 72.50% to 36.36% over successive iterations, revealing limitations of static benchmarks.

Significance. If the self-evolved test cases prove valid and unbiased, the work could meaningfully advance the field by shifting safety evaluation from static one-time audits to continuous, adaptive processes that better match evolving regulations and AI risks. The multi-agent architecture and self-evolving loop offer a concrete implementation of dynamic benchmarking, with the reported safety declines providing an initial signal of gaps missed by conventional methods. The approach is novel in its autonomous policy ingestion and iterative refinement.

major comments (2)

[Self-evolving Evaluation loop and Experiments] The central claim that AgenticEval uncovers deep vulnerabilities missed by traditional methods (abstract and experiments section) depends on the validity of the self-evolved test cases. The framework description indicates agents refine cases from prior failures without an explicit validation step such as human expert review, inter-annotator agreement, or direct mapping to specific clauses in source policies (e.g., EU AI Act). This leaves open the possibility that the observed safety-rate decline (GPT-5 from 72.50% to 36.36%) arises from generation artifacts or out-of-scope prompts rather than genuine model weaknesses.
[Experiments] The reported safety rates and decline trends lack accompanying details on statistical methods, controls for invalid or biased test cases, sample sizes per iteration, or verification of agent outputs. Without these, the data-to-claim link for the effectiveness of the self-evolving loop cannot be rigorously assessed.

minor comments (2)

[Experiments] Clarify the precise definition and computation of 'safety rate' and the number of iterations performed in the results presentation.
[Abstract] Ensure consistent terminology between the abstract (e.g., 'GPT-5') and the models actually evaluated in the full experimental section.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments on our work. We address the major concerns regarding the validation of self-evolved test cases and the experimental reporting in detail below. We believe these clarifications and planned revisions will strengthen the manuscript.

read point-by-point responses

Referee: [Self-evolving Evaluation loop and Experiments] The central claim that AgenticEval uncovers deep vulnerabilities missed by traditional methods (abstract and experiments section) depends on the validity of the self-evolved test cases. The framework description indicates agents refine cases from prior failures without an explicit validation step such as human expert review, inter-annotator agreement, or direct mapping to specific clauses in source policies (e.g., EU AI Act). This leaves open the possibility that the observed safety-rate decline (GPT-5 from 72.50% to 36.36%) arises from generation artifacts or out-of-scope prompts rather than genuine model weaknesses.

Authors: We agree that explicit validation steps would strengthen the claims. While the multi-agent pipeline includes agents tasked with ensuring alignment to the ingested policy documents, the manuscript does not detail a human review process. In the revised version, we will add a dedicated subsection on validation, including: (1) a mapping of sample evolved test cases to specific clauses in the EU AI Act, (2) results from a small-scale human expert review of 50 test cases per iteration for relevance and scope, and (3) inter-annotator agreement scores. This will help demonstrate that the safety rate declines reflect genuine vulnerabilities. We maintain that the iterative nature based on prior failures inherently targets weaknesses, but acknowledge the need for these additions to rigorously support the central claim. revision: yes
Referee: [Experiments] The reported safety rates and decline trends lack accompanying details on statistical methods, controls for invalid or biased test cases, sample sizes per iteration, or verification of agent outputs. Without these, the data-to-claim link for the effectiveness of the self-evolving loop cannot be rigorously assessed.

Authors: We appreciate this observation and agree that more rigorous experimental details are required. In the updated manuscript, we will expand the Experiments section to include: sample sizes (e.g., 200 test cases generated per iteration), statistical analysis such as paired t-tests or trend analysis with p-values for the observed declines, controls including automated filtering for prompt validity and bias using additional agent checks, and a description of the verification process where agent outputs are cross-checked against policy embeddings for relevance. These additions will provide a clearer link between the data and our claims about the self-evolving loop's effectiveness. revision: yes

Circularity Check

0 steps flagged

No significant circularity; results are empirical measurements from framework execution

full rationale

The paper describes a multi-agent framework that ingests policy documents, generates test cases, and uses a self-evolving loop to refine them based on model responses, then reports direct experimental outcomes such as the observed drop in GPT-5 safety rate from 72.50% to 36.36% across iterations on the EU AI Act. These safety rates are measured quantities obtained by running the generated benchmarks rather than quantities defined in terms of fitted parameters, self-referential equations, or derivations that reduce to the inputs by construction. No load-bearing steps rely on self-citations for uniqueness theorems, smuggled ansatzes, or renaming of known results. The central claim rests on the empirical validity of the evolved test cases, which is an external assumption open to independent verification rather than a circular reduction within the paper's own chain.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 2 invented entities

The framework rests on assumptions about reliable agent interpretation of policies and unbiased evolution of tests; no explicit numerical free parameters are described, but several design choices function as ad-hoc elements.

axioms (2)

domain assumption Specialized agents can accurately parse unstructured policy documents into valid test cases.
Invoked as the basis for the ingestion stage of the pipeline.
ad hoc to paper Feedback from evaluation results can be used to generate progressively more sophisticated tests without systematic bias or invalidity.
Central premise of the Self-evolving Evaluation loop.

invented entities (2)

Self-evolving Evaluation loop no independent evidence
purpose: To learn from prior results and craft more targeted test cases over iterations.
New component introduced to enable continuous benchmark evolution.
Synergistic pipeline of specialized agents no independent evidence
purpose: To divide benchmark generation, execution, and evolution across multiple agents.
Core architectural invention of the multi-agent framework.

pith-pipeline@v0.9.0 · 5766 in / 1412 out tokens · 54824 ms · 2026-05-18T12:31:00.470254+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The self-evolving evaluation loop... the Analyst agent learns the target model’s failure modes and synthesizes these insights into new directives, which in turn guide the Generator to craft more targeted test cases
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Regulation-to-Knowledge Structuring... hierarchical tree of atomic rules
IndisputableMonolith/Foundation/DimensionForcing.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Safety rate on the EU AI Act drops from 72.50% to 36.36% over successive iterations

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

17 extracted references · 17 canonical work pages · 6 internal anchors

[1]

Reddebate: Safer responses through multi-agent red teaming debates.arXiv preprint arXiv:2506.11083,

Ali Asad, Stephen Obadinma, Radin Shayanfar, and Xiaodan Zhu. Reddebate: Safer responses through multi-agent red teaming debates.arXiv preprint arXiv:2506.11083,

work page arXiv
[2]

Meditron-70b: Scaling medical pretraining for large language models.arXiv preprint arXiv:2311.16079,

Zeming Chen, Alejandro Hern ´andez Cano, Angelika Romanou, Antoine Bonnet, Kyle Matoba, Francesco Salvi, Matteo Pagliardini, Simin Fan, Andreas K ¨opf, Amirkeivan Mohtashami, et al. Meditron-70b: Scaling medical pretraining for large language models.arXiv preprint arXiv:2311.16079,

work page arXiv
[3]

Shieldagent: Shielding agents via verifiable safety policy reasoning.arXiv preprint arXiv:2503.22738,

Zhaorun Chen, Mintong Kang, and Bo Li. Shieldagent: Shielding agents via verifiable safety policy reasoning.arXiv preprint arXiv:2503.22738,

work page arXiv
[4]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capa- bilities.arXiv preprint arXiv:2507.06261,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

URLhttps://arxiv.org/abs/2412. 19437. 10 Preprint Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Compl-ai framework: A technical interpretation and llm benchmarking suite for the eu artificial intelligence act.arXiv preprint arXiv:2410.07959,

Philipp Guldimann, Alexander Spiridonov, Robin Staab, Nikola Jovanovi ´c, Mark Vero, Velko Vechev, Anna-Maria Gueorguieva, Mislav Balunovi ´c, Nikola Konstantinov, Pavol Bielik, et al. Compl-ai framework: A technical interpretation and llm benchmarking suite for the eu artificial intelligence act.arXiv preprint arXiv:2410.07959,

work page arXiv
[7]

AI safety via debate

Geoffrey Irving, Paul Christiano, and Dario Amodei. Ai safety via debate.arXiv preprint arXiv:1805.00899,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Holistic Evaluation of Language Models

Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, et al. Holistic evaluation of language models.arXiv preprint arXiv:2211.09110,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Safety at Scale: A Comprehensive Survey of Large Model and Agent Safety

Xingjun Ma, Yifeng Gao, Yixu Wang, Ruofan Wang, Xin Wang, Ye Sun, Yifan Ding, Hengyuan Xu, Yunhao Chen, Yunhan Zhao, et al. Safety at scale: A comprehensive survey of large model safety.arXiv preprint arXiv:2502.05206,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Autolaw: Enhancing legal compliance in large language models via case law generation and jury-inspired deliberation.arXiv preprint arXiv:2505.14015,

Tai D Nguyen, Long H Pham, and Jun Sun. Autolaw: Enhancing legal compliance in large language models via case law generation and jury-inspired deliberation.arXiv preprint arXiv:2505.14015,

work page arXiv
[11]

Passing the turing test in political discourse: Fine-tuning llms to mimic polarized social media comments.arXiv preprint arXiv:2506.14645,

V Vendetti, LD Comencini, F Deriu, V Modugno, et al. Passing the turing test in political discourse: Fine-tuning llms to mimic polarized social media comments.arXiv preprint arXiv:2506.14645,

work page arXiv
[12]

A comprehensive survey in llm (-agent) full stack safety: Data, training and deployment.arXiv preprint arXiv:2504.15585, 2025a

Kun Wang, Guibin Zhang, Zhenhong Zhou, Jiahao Wu, Miao Yu, Shiqian Zhao, Chenlong Yin, Jinhu Fu, Yibo Yan, Hanjun Luo, et al. A comprehensive survey in llm (-agent) full stack safety: Data, training and deployment.arXiv preprint arXiv:2504.15585, 2025a. 11 Preprint Maggie Wang, Ella Colby, Jennifer Okwara, Varun Nagaraj Rao, Yuhan Liu, and Andr ´es Monroy...

work page arXiv
[13]

Qianqian Xie, Weiguang Han, Xiao Zhang, Yanzhao Lai, Min Peng, Alejandro Lopez-Lira, and Jimin Huang

URLhttps://x.ai/news/grok-4. Qianqian Xie, Weiguang Han, Xiao Zhang, Yanzhao Lai, Min Peng, Alejandro Lopez-Lira, and Jimin Huang. Pixiu: A large language model, instruction data and evaluation benchmark for finance.arXiv preprint arXiv:2306.05443,

work page arXiv
[14]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,

work page internal anchor Pith review Pith/arXiv arXiv
[15]

Investlm: A large language model for investment using financial domain instruction tuning.arXiv preprint arXiv:2309.13064,

Yi Yang, Yixuan Tang, and Kar Yan Tam. Investlm: A large language model for investment using financial domain instruction tuning.arXiv preprint arXiv:2309.13064,

work page arXiv
[16]

Autoredteamer: Autonomous red teaming with lifelong attack integration.arXiv preprint arXiv:2503.15754,

Andy Zhou, Kevin Wu, Francesco Pinto, Zhaorun Chen, Yi Zeng, Yu Yang, Shuang Yang, Sanmi Koyejo, James Zou, and Bo Li. Autoredteamer: Autonomous red teaming with lifelong attack integration.arXiv preprint arXiv:2503.15754,

work page arXiv
[17]

12 Preprint A APPENDIX A.1 AGENTSYSTEMPROMPTS ANDKEYINSTRUCTIONS To enhance the transparency and reproducibility of our work, this sub-section provides a detailed overview of the core system prompts and key instructions that govern the behavior of each spe- cialized agent within the SafeEvalAgent framework. These directives are the foundational layer of o...

work page 2025