Boiling the Frog: A Multi-Turn Benchmark for Agentic Safety
Pith reviewed 2026-05-22 05:46 UTC · model grok-4.3
The pith
A new benchmark tests whether AI agents in office settings can be led into unsafe states through gradual workspace changes.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper establishes a multi-turn benchmark that exposes a persistent workspace to chains of edits, places risk-bearing payloads at controlled positions in the sequence, and scores the final artifact state for safety violations, with scenarios grouped under a three-level operational risk taxonomy drawn from boiling-the-frog patterns and EU AI Act contexts.
What carries the argument
Stateful multi-turn evaluation chains that maintain a persistent workspace and introduce incremental risks at specific turn positions.
If this is right
- Agent deployments should incorporate testing against gradual risk introduction rather than only single-turn harmful instructions.
- Models showing high success rates on loss-of-control chains need targeted safeguards before handling persistent workspaces.
- Safety evaluations for general-purpose AI must include multi-turn state tracking to match how agents actually operate.
- Regulatory compliance checks for high-risk AI contexts should consider incremental exposure patterns.
Where Pith is reading between the lines
- Safety methods trained on overt harmful requests may leave gaps when risks arrive through small cumulative workspace changes.
- The benchmark could be extended to measure recovery after an unsafe state appears, testing whether models can detect and reverse prior compromises.
- Corporate users might apply similar incremental testing during internal red-teaming of agent tools before full rollout.
Load-bearing premise
The benchmark scenarios and risk taxonomy accurately reflect real incremental attack vectors that would appear in actual corporate and office software environments.
What would settle it
A direct comparison of the same models performing equivalent incremental tasks inside a live corporate tool environment versus the simulated benchmark, measuring whether unsafe state changes occur at similar rates.
Figures
read the original abstract
Background. Traditional safety benchmarks for language models evaluate generated text: whether a model outputs toxic language, reproduces bias, or follows harmful instructions. When models are deployed as agents, the safety-relevant object shifts from what the system says to what it does within an environment, and evaluating model responses under prompting is no longer sufficient to address the safety challenges posed by artificial intelligence. Recent developments have seen the rise of benchmarks that evaluate large language models as agents. We contribute to this strand of research. Approach. We introduce Boiling the Frog, a benchmark that evaluates whether tool-using AI models deployed in corporate and office settings are susceptible to incremental attacks. Each scenario begins with benign workspace edits and later introduces a risk-bearing request. The benchmark focuses on stateful multi-turn evaluation: chains expose a persistent workspace, place the risk-bearing payload at controlled positions in the turn sequence, and score whether the resulting artifact state becomes unsafe. Scenarios are organized through a three-level operational risk taxonomy grounded in the Boiling the Frog risks, the AI Act Annex I and Annex III high-risk contexts, and EU AI Act's Code of Practice on General-Purpose AI (GPAI). Results. Across a nine-model panel, aggregate strict attack success rate (ASR) is 44.4%. Model-level ASR ranges from 20.5% for Claude Haiku 4.5 to 92.9% for Gemini 3.1 Flash Lite, with Seed 2.0 Lite also above 80%. Average chain category-level ASR reaches 93.3% for Code of Practice loss-of-control scenarios.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces Boiling the Frog, a benchmark for evaluating tool-using AI models' susceptibility to incremental multi-turn attacks in simulated corporate and office environments. Scenarios feature persistent workspaces where benign edits precede risk-bearing requests placed at controlled positions, with success measured by whether the final state is unsafe. A three-level risk taxonomy is used, drawing from the AI Act and GPAI Code of Practice. Evaluation on nine models yields an aggregate strict ASR of 44.4%, ranging from 20.5% (Claude Haiku 4.5) to 92.9% (Gemini 3.1 Flash Lite), with 93.3% average for loss-of-control category chains.
Significance. Should the benchmark scenarios prove representative of real-world incremental attack vectors without significant simulation artifacts, the findings would be significant for agentic AI safety. They demonstrate that current models can be vulnerable to gradual escalation in stateful settings, with substantial variation across models and particularly high success in loss-of-control scenarios. This provides a regulatory-grounded framework for assessing risks in high-risk contexts and could serve as a foundation for developing more robust agent safeguards. The controlled multi-turn design is a strength over single-prompt evaluations.
major comments (2)
- [§4 Benchmark Design] The construction of scenarios with abstract file edits and simplified state tracking lacks reported validation against real corporate environments, permission models, or tool integrations. This is load-bearing for the central claim, as the 93.3% ASR for loss-of-control scenarios may reflect artifacts of the environment representation rather than accurate modeling of incremental persuasion in deployed agents.
- [§5 Results and Analysis] Aggregate and model-level ASR numbers (44.4% overall, 20.5%-92.9% range) are presented without error bars, details on scenario counts, exclusion criteria, or statistical significance tests. This reduces confidence in the reliability of the reported differences and the cross-category claims.
minor comments (1)
- [Abstract] Consider specifying the total number of scenarios or average chain length to provide better context for the ASR figures.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback on our manuscript. We address each major comment below, indicating planned revisions where appropriate.
read point-by-point responses
-
Referee: [§4 Benchmark Design] The construction of scenarios with abstract file edits and simplified state tracking lacks reported validation against real corporate environments, permission models, or tool integrations. This is load-bearing for the central claim, as the 93.3% ASR for loss-of-control scenarios may reflect artifacts of the environment representation rather than accurate modeling of incremental persuasion in deployed agents.
Authors: We agree that the benchmark employs an abstracted representation of corporate workspaces and state tracking to isolate the effects of incremental multi-turn attacks. The scenarios draw from common office workflows and the AI Act high-risk contexts, but we did not conduct empirical validation against real permission models or production tool integrations. This is a genuine limitation when interpreting the 93.3% loss-of-control ASR, which could partly stem from the simplified environment. In the revised version we will add an explicit Limitations subsection in §4 that discusses the abstraction choices, their rationale for controlled evaluation, and the implications for generalizability. We will also clarify that the benchmark is designed as a reproducible testbed for studying escalation dynamics rather than a direct simulation of deployed agents. revision: yes
-
Referee: [§5 Results and Analysis] Aggregate and model-level ASR numbers (44.4% overall, 20.5%-92.9% range) are presented without error bars, details on scenario counts, exclusion criteria, or statistical significance tests. This reduces confidence in the reliability of the reported differences and the cross-category claims.
Authors: We accept this observation. The manuscript currently reports the aggregate strict ASR and per-model ranges without accompanying counts, variability measures, or formal tests. In revision we will insert a table in §5 listing the number of scenarios per risk category, total evaluations per model, and any exclusion criteria applied during data collection. We will also add error bars or binomial confidence intervals around the reported ASRs and discuss the observed model and category differences descriptively, noting that the deterministic state-based success metric limits the applicability of traditional significance testing. revision: yes
Circularity Check
No circularity: empirical benchmark reports direct measurements
full rationale
The paper defines a set of multi-turn scenarios with persistent workspace state, applies them to nine models, and reports attack success rates as direct empirical outcomes. No equations, fitted parameters, predictions, or first-principles derivations are claimed; the taxonomy and scenarios are constructed inputs whose evaluation results do not reduce back to themselves by construction. The work is self-contained against external benchmarks and contains no self-citation load-bearing steps or ansatz smuggling.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Scenarios are organized through a three-level operational risk taxonomy grounded in the Boiling the Frog risks, the AI Act Annex I and Annex III high-risk contexts, and EU AI Act's Code of Practice on General-Purpose AI (GPAI).
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean, Cost/FunctionalEquation.lean, BranchSelection.leanreality_from_one_distinction, washburn_uniqueness_aczel, branch_selection unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Each scenario begins with benign workspace edits and later introduces a risk-bearing request. ... three-level operational risk taxonomy grounded in the Boiling the Frog risks, the AI Act Annex I and Annex III high-risk contexts, and EU AI Act's Code of Practice on General-Purpose AI (GPAI).
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Amodei, D., Olah, C., Steinhardt, J., Christiano, P., Schulman, J., and Man´ e, D. (2016). Concrete problems in AI safety.arXiv:1606.06565
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[2]
Andriushchenko, M., et al. (2024). AgentHarm: A benchmark for measuring harmfulness of LLM agents.arXiv:2410.09024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[3]
Anthropic. (2025). Activating AI Safety Level 3 protections. https://www.anthropic. com/news/activating-asl3-protections
work page 2025
-
[4]
Anthropic. (2026). Claude Code documentation. https://code.claude.com/docs/en/overview
work page 2026
-
[5]
Anthropic. (2026). Responsible Scaling Policy Version 3.0. https://www.anthropic. com/news/responsible-scaling-policy-v3
work page 2026
- [6]
- [7]
-
[8]
Adversarial Poetry as a Universal Single-Turn Jailbreak Mechanism in Large Language Models,
P. Bisconti, M. Prandi, F. Pierucci, F. Giarrusso, M. Bracale, M. Galisai, V. Suriani, O. Sorokoletova, F. Sartore, and D. Nardi, “Adversarial Poetry as a Universal Single-Turn Jailbreak Mechanism in Large Language Models,”arXiv preprintarXiv:2511.15304, 2025. doi:10.48550/arXiv.2511.15304
-
[9]
Beyond Single- Agent Safety: A Taxonomy of Risks in LLM-to-LLM Interactions,
P. Bisconti, M. Galisai, F. Pierucci, M. Bracale, and M. Prandi, “Beyond Single- Agent Safety: A Taxonomy of Risks in LLM-to-LLM Interactions,”arXiv preprint arXiv:2512.02682, 2025. doi:10.48550/arXiv.2512.02682
-
[10]
From Adversarial Poetry to Adversarial Tales: An Interpretability Research Agenda,
P. Bisconti, M. Galisai, M. Prandi, F. Pierucci, O. Sorokoletova, F. Giarrusso, V. Suriani, M. Bracale, and D. Nardi, “From Adversarial Poetry to Adversarial Tales: An Interpretability Research Agenda,”arXiv preprintarXiv:2601.08837, 2026. doi:10.48550/arXiv.2601.08837
-
[11]
Bort, J. (2026). A Meta AI security researcher said an OpenClaw agent ran amok on her inbox.TechCrunch
work page 2026
-
[12]
(2020).The Brussels Effect: How the European Union Rules the World
Bradford, A. (2020).The Brussels Effect: How the European Union Rules the World. Oxford University Press
work page 2020
-
[13]
(2023).Digital Empires: The Global Battle to Regulate Technology
Bradford, A. (2023).Digital Empires: The Global Battle to Regulate Technology. Oxford University Press
work page 2023
-
[14]
Business Insider. (2025). Replit’s CEO apologizes after its AI agent wiped a company’s code base in a test run and lied about it. https://www.businessinsider.com/replit-ceo- apologizes-ai-coding-tool-delete-company-database-2025-7 39
work page 2025
-
[15]
CEN-CENELEC JTC 21. (2024). Standardisation request M/613 for AI
work page 2024
-
[16]
Chao, P., et al. (2024). JailbreakBench: An open robustness benchmark for jailbreaking large language models.arXiv:2404.01318
work page internal anchor Pith review Pith/arXiv arXiv 2024
- [17]
-
[18]
Available at: https://www.chinalawtranslate.com/en/22075-2/
China Law Translate.Cybersecurity Law of the People’s Republic of China (2026 revised version). Available at: https://www.chinalawtranslate.com/en/22075-2/. Last accessed: 19 May 2026
work page 2026
-
[19]
Available at: https://www.chinalawtranslate.com/en/ generative-ai-interim/
China Law Translate.Interim Measures for the Management of Generative Artifi- cial Intelligence Services. Available at: https://www.chinalawtranslate.com/en/ generative-ai-interim/. Last accessed: 19 May 2026
work page 2026
-
[20]
Coggins, S., Saeri, A., Daniell, K. A., Ruster, L. P., Liu, J., and Davis, J. L. (2025). The 2025 OpenAI Preparedness Framework does not guarantee any AI risk mitigation prac- tices: A proof-of-concept for affordance analyses of AI safety policies. arXiv:2509.24394. https://arxiv.org/abs/2509.24394
-
[21]
Costa, M., et al. (2025). Securing AI Agents with Information-Flow Control. arXiv:2505.23643
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[22]
Debenedetti, E., et al. (2025). Defeating Prompt Injections by Design.arXiv:2503.18813
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [23]
- [24]
-
[25]
European Commission. (2022). The Blue Guide on the implementation of EU product rules
work page 2022
-
[26]
European Commission. (2025). Code of Practice for General-Purpose AI Models: Safety and Security Chapter
work page 2025
-
[27]
European Commission. (2025). Commission publishes list of signatories to the General- Purpose AI Code of Practice.https://www.politico.eu/article/eu-signatories-ai-code- practice-amazon-google-ibm-microsoft-ai/
work page 2025
-
[28]
European Commission. (2025). General-Purpose AI Code of Practice. https://digital- strategy.ec.europa.eu/en/policies/contents-code-gpai
work page 2025
-
[29]
European Parliament and Council of the European Union. (2024). Regulation (EU) 2024/1689 laying down harmonised rules on artificial intelligence
work page 2024
- [30]
-
[31]
Adversarial Humanities Benchmark: Results on Stylistic Robustness in Frontier Model Safety
M. Galisai, S. Cifani, F. Giarrusso, P. Bisconti, M. Prandi, F. Pierucci, F. Sartore, and D. Nardi, “Adversarial Humanities Benchmark: Results on Stylistic Robustness in Frontier Model Safety,”arXiv preprintarXiv:2604.18487, 2026. doi:10.48550/arXiv.2604.18487
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2604.18487 2026
- [32]
-
[33]
Gehman, S., et al. (2020). RealToxicityPrompts: Evaluating neural toxic degeneration in language models. https://arxiv.org/abs/2009.11462Findings of EMNLP
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[34]
Available at: https://www.geopolitechs.org/p/china-issues-new-rules-on-ai-ethics
Geopolitechs.China Issues New Rules on AI Ethics Review and Support. Available at: https://www.geopolitechs.org/p/china-issues-new-rules-on-ai-ethics . Last accessed: 19 May 2026
work page 2026
-
[35]
Available at: https: //www.geopolitechs.org/p/chinas-first-policy-framework-for
Geopolitechs.China’s first policy framework for AI agents. Available at: https: //www.geopolitechs.org/p/chinas-first-policy-framework-for . Last accessed: 19 May 2026
work page 2026
- [36]
- [37]
-
[38]
Greshake, K., et al. (2023). Not what you’ve signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection.arXiv:2302.12173
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[39]
The Guardian. (2026). Claude-powered AI agent’s confession after deleting a firm’s entire database. https://www.theguardian.com/technology/2026/apr/29/claude-ai-deletes-firm- database
work page 2026
- [40]
-
[41]
Hong, S., et al. (2023). MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework.arXiv:2308.00352
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [42]
-
[43]
Hubinger, E., et al. (2019). Risks from Learned Optimization in Advanced Machine Learning Systems.arXiv:1906.01820
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[44]
Hubinger, E., et al. (2024). Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training.arXiv:2401.05566
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[45]
Kaiyom, F. et al. (2024). HELM Safety: Towards standardized safety evaluations of language models
work page 2024
- [46]
-
[47]
Li, N., et al. (2024). The WMDP Benchmark: Measuring and reducing malicious use with unlearning.arXiv:2403.03218
work page internal anchor Pith review Pith/arXiv arXiv 2024
- [48]
-
[49]
Lin, S., Hilton, J., and Evans, O. (2022). TruthfulQA: Measuring how models mimic human falsehoods.ACL
work page 2022
-
[50]
Liu, X., et al. (2023). AgentBench: Evaluating LLMs as agents.arXiv:2308.03688
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[51]
Liu, Y., et al. (2023). Prompt Injection attack against LLM-integrated Applications. arXiv:2306.05499
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [52]
-
[53]
Malik, S., et al. (2025). RewardBench 2: Advancing reward model evaluation. arXiv:2506.01937
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[54]
Mazeika, M., et al. (2024). HarmBench: A standardized evaluation framework for automated red teaming and robust refusal.ICML
work page 2024
-
[55]
Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces
Merrill, M. A., et al. (2026). Terminal-Bench: Benchmarking agents on hard, realistic tasks in command line interfaces.arXiv:2601.11868
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[56]
Nadeem, M., Bethke, A., and Reddy, S. (2021). StereoSet: Measuring stereotypical bias in pretrained language models.ACL
work page 2021
- [57]
-
[58]
Nangia, N., et al. (2020). CrowS-Pairs: A challenge dataset for measuring social biases in masked language models.EMNLP
work page 2020
-
[59]
Nannini, L., et al. (2026). AI agents under EU law.arXiv:2604.04604
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[60]
New York Post. (2026). Anthropic’s Claude Mythos model sparks fear of AI doomsday if released to public. https://nypost.com/2026/04/08/business/ anthropics-claude-mythos-model-sparks-fears-of-ai-doomsday-wave-of-devastating-hacks/
work page 2026
-
[61]
Nous Research. (2026). Hermes Agent documentation. https://hermes- agent.lzw.me/docs/en/
work page 2026
-
[62]
OpenAI. (2025). Preparedness Framework. https://openai.com/safety/ preparedness
work page 2025
-
[63]
OpenAI. (2026). Codex CLI features and approvals documentation. https://developers.openai.com/codex/cli/features
work page 2026
-
[64]
(2026).OpenClaw: Personal AI Assistant
OpenClaw maintainers. (2026).OpenClaw: Personal AI Assistant. GitHub repository. https://github.com/openclaw/openclaw 42
work page 2026
- [65]
-
[66]
Parrish, A., et al. (2022). BBQ: A hand-built bias benchmark for question answering. Findings of ACL
work page 2022
-
[67]
Bench-2-CoP: Can We Trust Benchmarking for EU AI Compliance? arXiv preprint arXiv:2508.05464, 2025
Matteo Prandi, Vincenzo Suriani, Federico Pierucci, Marcello Galisai, Daniele Nardi, and Piercosma Bisconti. Bench-2-CoP: Can We Trust Benchmarking for EU AI Compliance? arXiv preprint arXiv:2508.05464, 2025
- [68]
- [69]
-
[70]
Pierucci, F., Prandi, M., Bracale Syrnikov, M., Galisai, M., and Bisconti, P. (2026). Agentic microphysics: A manifesto for generative AI safety.arXiv:2604.15236
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[71]
Qin, Y., et al. (2023). ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs.arXiv:2307.16789
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [72]
-
[73]
Available at: https://openstd.samr.gov
State Administration for Market Regulation and China National Standardization Admin- istration.GB/T 45654–2025 Cybersecurity Technology—Basic Security Requirements for Generative Artificial Intelligence Service. Available at: https://openstd.samr.gov. cn/bzgk/std/newGbInfo?hcno=F67D3F376E0A0A0FF5317FB36B32A30A. Last accessed: 19 May 2026
work page 2025
-
[74]
Available at: https://std.samr.gov.cn/gb/search/ gbDetailed?id=4C5277928DA2411EE06397BE0A0AE436
State Administration for Market Regulation.General Security Requirements for Artificial Intelligence Agent Application. Available at: https://std.samr.gov.cn/gb/search/ gbDetailed?id=4C5277928DA2411EE06397BE0A0AE436. Last accessed: 19 May 2026
work page 2026
-
[75]
Shinn, N., et al. (2023). Reflexion: Language Agents with Verbal Reinforcement Learning. arXiv:2303.11366
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [76]
-
[77]
Souly, A., et al. (2024). A StrongREJECT for empty jailbreaks.arXiv:2402.10260
work page internal anchor Pith review Pith/arXiv arXiv 2024
- [78]
- [79]
-
[80]
P., Kaunismaa, J., and Panickssery, A
Sudhir, A. P., Kaunismaa, J., and Panickssery, A. (2025). A benchmark for scalable oversight protocols.arXiv:2504.03731. 43
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.