Bench-2-CoP: Can We Trust Benchmarking for EU AI Compliance? arXiv preprint arXiv:2508.05464, 2025

Matteo Prandi, Vincenzo Suriani, Federico Pierucci, Marcello Galisai, Daniele Nardi, Piercosma Bisconti · 2025 · arXiv 2508.05464

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

read on arXiv browse 1 citing papers

representative citing papers

Boiling the Frog: A Multi-Turn Benchmark for Agentic Safety

cs.CL · 2026-05-21 · unverdicted · novelty 7.0

Boiling the Frog is a new stateful multi-turn benchmark for agentic safety that reports an aggregate strict attack success rate of 44.4% across nine models, with rates ranging from 20.5% to 92.9% depending on the model and scenario category.

citing papers explorer

Showing 1 of 1 citing paper.

Boiling the Frog: A Multi-Turn Benchmark for Agentic Safety cs.CL · 2026-05-21 · unverdicted · none · ref 67
Boiling the Frog is a new stateful multi-turn benchmark for agentic safety that reports an aggregate strict attack success rate of 44.4% across nine models, with rates ranging from 20.5% to 92.9% depending on the model and scenario category.

Bench-2-CoP: Can We Trust Benchmarking for EU AI Compliance? arXiv preprint arXiv:2508.05464, 2025

fields

years

verdicts

representative citing papers

citing papers explorer