pith. sign in

arxiv: 2606.20408 · v2 · pith:7X35R26Inew · submitted 2026-06-18 · 💻 cs.CR · cs.AI

NRT-Bench: Benchmarking Multi-Turn Red-Teaming of LLM Operator Agents in Safety-Critical Control Rooms

Pith reviewed 2026-06-26 17:11 UTC · model grok-4.3

classification 💻 cs.CR cs.AI
keywords LLM agentsred-teamingmulti-turn attackssafety-critical systemsnuclear plant simulationoperator agentsadversarial robustnessbenchmark
0
0 comments X

The pith

Adaptive multi-turn attacks push LLM operator teams past safety limits in 8.7-12.1% of nuclear plant simulation sessions, with nearly disjoint failures across models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents NRT-Bench to measure how LLM agents perform as operators of a simulated nuclear power plant when subjected to sustained, adaptive adversarial messages. It establishes that multi-turn attacks cause the loss of at least one critical safety function in roughly one in ten sessions across four frontier models. Failures show little overlap between models, and the impact of added defenses varies sharply by model rather than helping all equally. A sympathetic reader would care because LLM agents are being considered for supervisory roles in safety-critical systems, so concrete rates of safety violation under realistic pressure provide a direct test of that proposal.

Core claim

We present NRT-Bench, a benchmark for multi-turn red-teaming of LLM agents acting as operators of a safety-critical system, instantiated in a simulated nuclear power plant control room. A five-role operator team, each backed by a configurable LLM, runs a plant governed by six critical safety functions, while adversaries inject messages over four channels in bounded multi-turn sessions with per-turn feedback. Evaluating four frontier operator models under a fixed-attack paired-replay protocol, we find that adaptive multi-turn attacks reliably push the operator team past a safety limit: across the four models, between 8.7% and 12.1% of attack sessions end with the plant losing a critical safet

What carries the argument

NRT-Bench benchmark, which runs a five-role LLM operator team on a simulated nuclear plant governed by six critical safety functions, with adversaries sending messages over four channels and harm measured objectively by loss of any safety function.

If this is right

  • The same guardrail stack or safety-advisor agent that lowers attack success for one model can raise it for another.
  • Of 149 sessions none defeat all four models while a third defeat at least one, so vulnerabilities are nearly disjoint across models rather than nested.
  • Aggregate robustness rates alone do not reveal the full picture because model failures barely overlap.
  • Defense effectiveness must be measured per model rather than assumed to be uniform.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the benchmark results generalize, safety evaluation of LLM operators will need to include model-specific attack suites rather than a single shared test set.
  • Extending the same replay protocol to human operators in the identical simulation would isolate whether the observed failure rates are specific to current LLM architectures.
  • The four-channel attack setup could be reused to compare robustness when the underlying plant model is replaced by simulators from other high-stakes domains.

Load-bearing premise

The simulated nuclear plant, the six critical safety functions, and the four attack channels provide a faithful proxy for real safety-critical operator behavior under adversarial pressure.

What would settle it

Replaying the full set of 149 attack sessions against a larger collection of models and counting the fraction of sessions that defeat more than one model would directly test whether the near-disjoint vulnerability pattern holds or collapses.

Figures

Figures reproduced from arXiv: 2606.20408 by Bokyeong Kim, Dasol Choi, Hanwool Lee, Haon Park, Seung Geun Kim.

Figure 1
Figure 1. Figure 1: NRT-Bench at a glance. A multi-turn red-team attacker probes a safety-critical control-room venue through four ingress [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Attack-agent I/O contract. The agent consumes a goal, a strategy-family label, the session history, and the previous [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
read the original abstract

Large language model (LLM) agents are increasingly proposed as supervisory components for safety-critical systems, yet their robustness under sustained, adaptive adversarial pressure remains poorly characterized. We present NRT-Bench, a benchmark for multi-turn red-teaming of LLM agents acting as operators of a safety-critical system, instantiated in a simulated nuclear power plant control room. A five-role operator team, each backed by a configurable LLM, runs a plant governed by six critical safety functions (CSFs), while adversaries inject messages over four channels in bounded multi-turn sessions with per-turn feedback. Harm is an objective signal rather than LLM-judged text: a run terminates the moment any CSF is lost, attributed to the causing message. Evaluating four frontier operator models under a fixed-attack paired-replay protocol, we find that adaptive multi-turn attacks reliably push the operator team past a safety limit: across the four models, between 8.7% and 12.1% of attack sessions end with the plant losing a critical safety function. Although the four models look almost equally robust by this aggregate rate, their failures barely overlap: of $149$ sessions, none defeat all four models while a third defeat at least one, so vulnerabilities are nearly disjoint across models rather than nested. The effect of added defences is strongly model-dependent: the same guardrail stack or safety-advisor agent that lowers attack success for one model can raise it for another. We release the simulation venue, attack dataset, and replay tooling for reproducible safety evaluation of LLM agents.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper introduces NRT-Bench, a benchmark for multi-turn red-teaming of LLM agents operating a simulated nuclear power plant control room. A five-role team (each backed by an LLM) manages six critical safety functions (CSFs) while adversaries inject messages over four channels in bounded sessions with per-turn feedback. Harm is defined objectively as loss of any CSF. Evaluating four frontier models under a fixed-attack paired-replay protocol yields failure rates of 8.7–12.1% across models; of 149 sessions, none defeat all four models while roughly one-third defeat at least one, indicating nearly disjoint vulnerabilities. Defense effects are model-dependent. The simulation, attack dataset, and replay tooling are released.

Significance. If the reported rates and disjointness hold under the stated protocol, the work supplies an objective, reproducible testbed for LLM-agent robustness in safety-critical settings and demonstrates that aggregate robustness metrics can mask model-specific failure modes. The release of the venue, dataset, and tooling is a concrete strength for future empirical work.

major comments (1)
  1. [Abstract] Abstract: the central claim that 'adaptive multi-turn attacks reliably push the operator team past a safety limit' and that the benchmark characterizes robustness 'under sustained, adaptive adversarial pressure' is contradicted by the explicit statement that evaluation occurs 'under a fixed-attack paired-replay protocol' in which 'adversaries inject messages' via pre-determined sequences. Fixed replays do not permit the adversary to condition subsequent messages on per-turn operator responses, so the reported 8.7–12.1% rates and the disjoint-failure observation rest on an untested equivalence between non-adaptive traces and adaptive pressure. This is load-bearing for the headline result.
minor comments (1)
  1. [Abstract] Abstract: the reported percentages (8.7%, 12.1%) and session count (149) are given without accompanying totals per model, generation procedure for the attack dataset, exclusion criteria, or confidence intervals, making it impossible to assess statistical support from the abstract alone.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for identifying this important inconsistency in the abstract. We agree that the language requires revision to accurately reflect the fixed-attack paired-replay protocol.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that 'adaptive multi-turn attacks reliably push the operator team past a safety limit' and that the benchmark characterizes robustness 'under sustained, adaptive adversarial pressure' is contradicted by the explicit statement that evaluation occurs 'under a fixed-attack paired-replay protocol' in which 'adversaries inject messages' via pre-determined sequences. Fixed replays do not permit the adversary to condition subsequent messages on per-turn operator responses, so the reported 8.7–12.1% rates and the disjoint-failure observation rest on an untested equivalence between non-adaptive traces and adaptive pressure. This is load-bearing for the headline result.

    Authors: We agree with the referee that the abstract's phrasing is imprecise and potentially misleading. The evaluation uses fixed, pre-determined attack sequences replayed in a paired manner; these sequences do not adapt based on the operator responses during the evaluation run. The term 'adaptive' was intended to describe the multi-turn generation process used to create the attack dataset, but it does not apply to the evaluation protocol itself. The reported failure rates and disjoint-vulnerability observations therefore characterize performance against these specific fixed traces rather than against dynamically adaptive adversaries. We will revise the abstract (and any similar language in the introduction) to explicitly state that attacks are fixed sequences and to remove references to 'adaptive pressure' in the context of the benchmark results. This revision will be incorporated in the next manuscript version. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical counts from fixed simulation protocol

full rationale

The paper reports direct empirical failure rates (8.7–12.1%) from simulation runs under an explicitly described fixed-attack paired-replay protocol. No equations, fitted parameters, self-citations, or ansatzes are present that reduce any result to its own inputs by construction. The reported percentages and disjoint-failure observations are straightforward tallies of objective CSF-loss events; the benchmark is self-contained against external simulation runs and does not invoke uniqueness theorems or prior author work to force its conclusions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that the simulated plant and CSF definitions are representative; no free parameters or invented entities are introduced in the abstract.

axioms (1)
  • domain assumption The simulated nuclear power plant with six critical safety functions and four attack channels accurately models real safety-critical operator behavior under adversarial pressure.
    Invoked when the abstract treats CSF loss as an objective harm signal that generalizes beyond the simulation.

pith-pipeline@v0.9.1-grok · 5824 in / 1378 out tokens · 28312 ms · 2026-06-26T17:11:31.802463+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

35 extracted references · 7 linked inside Pith

  1. [1]

    Ahmed Almeldein et al. 2025. Exploring the Capabilities of the Frontier Large Language Models for Nuclear Energy Research. arXiv preprint arXiv:2506.19863

  2. [2]

    Maksym Andriushchenko, Alexandra Souly, Mateusz Dziemian, Derek Due- nas, Maxwell Lin, Justin Wang, Dan Hendrycks, Andy Zou, Zico Kolter, Matt Fredrikson, et al. 2025. AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents. InInternational Conference on Learning Representations (ICLR). OpenReview.net, Singapore

  3. [3]

    Anthropic. 2025. Claude Haiku 4.5 System Card. https://www.anthropic.com/ claude-haiku-4-5-system-card. Released October 15, 2025

  4. [4]

    Blakely, and Nidhi Rastogi

    Dipkamal Bhusal, Md Tanvirul Alam, Le Nguyen, Ashim Mahara, Zachary Light- cap, Rodney Frazier, Romy Fieblinger, Grace Long Torales, Benjamin A. Blakely, and Nidhi Rastogi. 2024. SECURE: Benchmarking Large Language Models for Cybersecurity. InAnnual Computer Security Applications Conference (ACSAC). IEEE, Honolulu, HI, USA, 15–30

  5. [5]

    Pappas, Florian Tramèr, Hamed Hassani, and Eric Wong

    Patrick Chao, Edoardo Debenedetti, Alexander Robey, Maksym Andriushchenko, Francesco Croce, Vikash Sehwag, Edgar Dobriban, Nicolas Flammarion, George J. Pappas, Florian Tramèr, Hamed Hassani, and Eric Wong. 2024. JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models. In Advances in Neural Information Processing Systems 37 (Ne...

  6. [6]

    Edoardo Debenedetti, Jie Zhang, Mislav Balunović, Luca Beurer-Kellner, Marc Fischer, and Florian Tramèr. 2024. AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents. InAdvances in Neural Information Processing Systems 37 (NeurIPS Datasets and Benchmarks Track), Vol. 37. Curran Associates, Inc., Red Hook, NY, USA...

  7. [7]

    2026.DeepSeek-V4 Technical Report

    DeepSeek-AI. 2026.DeepSeek-V4 Technical Report. Technical Report. DeepSeek AI. https://fe-static.deepseek.com/chat/transparency/deepseek-V4-model-card- EN.pdf

  8. [8]

    2026.Gemma 4 Technical Report

    Google DeepMind. 2026.Gemma 4 Technical Report. Technical Report. Google DeepMind. https://deepmind.google/models/gemma/

  9. [9]

    Pengfei He, Yupin Lin, Shen Dong, Han Xu, Yue Xing, and Hui Liu. 2025. Red- Teaming LLM Multi-Agent Systems via Communication Attacks. arXiv preprint arXiv:2502.14847

  10. [10]

    Gustav Keppler, Moritz Gstür, and Veit Hagenmeyer. 2026. CritBench: A Frame- work for Evaluating Cybersecurity Capabilities of Large Language Models in IEC 61850 Digital Substation Environments. arXiv preprint arXiv:2604.06019

  11. [11]

    Thomas Kuntz, Agatha Duzan, Hao Zhao, Francesco Croce, Zico Kolter, Nicolas Flammarion, and Maksym Andriushchenko. 2025. OS-Harm: A Benchmark for Measuring Safety of Computer Use Agents. arXiv preprint arXiv:2506.14866

  12. [12]

    Yoon Pyo Lee. 2025. Mechanistic Interpretability of LoRA-Adapted Language Models for Nuclear Reactor Safety Applications. arXiv preprint arXiv:2507.09931

  13. [13]

    Y. P. Lee et al. 2025. Large Language Model Agent for Nuclear Reactor Operation Assistance.Nuclear Engineering and Technology57 (2025), 103842

  14. [14]

    Nathaniel Li, Ziwen Han, Ian Steneker, Willow Primack, Riley Goodside, Hugh Zhang, Zifan Wang, Cristina Menghini, and Summer Yue. 2024. LLM De- fenses Are Not Robust to Multi-Turn Human Jailbreaks Yet. arXiv preprint arXiv:2408.15221

  15. [15]

    Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, David Forsyth, and Dan Hendrycks. 2024. HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal. InProceedings of the 41st Interna- tional Conference on Machine Learning (ICML) (Proceedings of Machi...

  16. [16]

    Rui Miao, Yixin Liu, Yili Wang, Xu Shen, Yue Tan, Yiwei Dai, Shirui Pan, and Xin Wang. 2025. BlindGuard: Safeguarding LLM-based Multi-Agent Systems under H. Lee et al Unknown Attacks. arXiv preprint arXiv:2508.08127

  17. [17]

    OpenAI. 2026. Introducing GPT-5.4 mini and nano. https://openai.com/index/ introducing-gpt-5-4-mini-and-nano/. Released March 17, 2026; system card addendum

  18. [18]

    Qwen Team. 2026. Qwen3.5: Accelerating Productivity with Native Multimodal Agents. https://qwen.ai/blog?id=qwen3.5

  19. [19]

    Mark Russinovich, Ahmed Salem, and Ronen Eldan. 2025. Great, Now Write an Article About That: The Crescendo Multi-Turn LLM Jailbreak Attack. In34th USENIX Security Symposium (USENIX Security). USENIX Association, Seattle, WA, USA, 2421–2440

  20. [20]

    Jialin Song, Xiaodong Liu, Weiwei Yang, Wuyang Chen, Mingqian Feng, Xuekai Zhu, and Jianfeng Gao. 2026. MultiBreak: A Scalable and Diverse Multi-turn Jailbreak Benchmark for Evaluating LLM Safety. arXiv preprint arXiv:2605.01687

  21. [21]

    Alexandra Souly, Qingyuan Lu, Dillon Bowen, Tu Trinh, Elvis Hsieh, Sana Pandey, Pieter Abbeel, Justin Svegliato, Scott Emmons, Olivia Watkins, and Sam Toyer

  22. [22]

    arXiv preprint arXiv:2402.10260

    A StrongREJECT for Empty Jailbreaks. arXiv preprint arXiv:2402.10260

  23. [23]

    Xiongtao Sun, Deyue Zhang, Dongdong Yang, Quanchen Zou, and Hui Li. 2024. Multi-Turn Context Jailbreak Attack on Large Language Models from First Prin- ciples. arXiv preprint arXiv:2408.04686

  24. [24]

    Nuclear Regulatory Commission

    U.S. Nuclear Regulatory Commission. 1982.Guidelines for the Preparation of Emergency Operating Procedures. Technical Report NUREG-0899. U.S. Nuclear Regulatory Commission, Washington, DC. Office of Nuclear Reactor Regulation, Division of Human Factors Safety

  25. [25]

    Sanidhya Vijayvargiya, Aditya Bharat Soni, Xuhui Zhou, Zora Zhiruo Wang, Nouha Dziri, Graham Neubig, and Maarten Sap. 2026. OpenAgentSafety: A Comprehensive Framework for Evaluating Real-World AI Agent Safety. arXiv preprint arXiv:2507.06134. Accepted to ICLR 2026

  26. [26]

    1983.Emergency Response Guidelines: Critical Safety Function Status Trees

    Westinghouse Owners Group. 1983.Emergency Response Guidelines: Critical Safety Function Status Trees. Technical Report. Westinghouse Electric Corpora- tion. Revision 1; reviewed by the U.S. NRC in Generic Letter 83-22

  27. [27]

    Min Xian, Tao Wang, Sai Zhang, Fei Xu, and Zhegang Ma. 2024. A Knowledge- Informed Large Language Model Framework for U.S. Nuclear Power Plant Shut- down Initiating Event Classification for Probabilistic Risk Assessment. arXiv preprint arXiv:2410.00929

  28. [28]

    Yizhe Xie, Congcong Zhu, Xinyue Zhang, Tianqing Zhu, Dayong Ye, Minfeng Qi, Huajie Chen, and Wanlei Zhou. 2026. From Spark to Fire: Modeling and Mitigating Error Cascades in LLM-Based Multi-Agent Collaboration. arXiv preprint arXiv:2603.04474

  29. [29]

    Tianxiang Xu, Zhichao Wen, Xinyu Zhao, Jun Wang, Yan Li, and Chang Liu. 2025. L2M-AID: Autonomous Cyber-Physical Defense by Fusing Semantic Reasoning of Large Language Models with Multi-Agent Reinforcement Learning. arXiv preprint arXiv:2510.07363

  30. [30]

    Xiaoxue Yang, Jaeha Lee, Anna-Katharina Dick, Jasper Timm, Fei Xie, and Diogo Cruz. 2025. Multi-Turn Jailbreaks Are Simpler Than They Seem. arXiv preprint arXiv:2508.07646

  31. [31]

    Xianjun Yang, Liqiang Xiao, Shiyang Li, Faisal Ladhak, Hyokun Yun, Linda Petzold, Yi Xu, and William Yang Wang. 2025. Many-Turn Jailbreaking. arXiv preprint arXiv:2508.06755

  32. [32]

    Tongxin Yuan, Zhiwei He, Lingzhong Dong, Yiming Wang, Ruijie Zhao, Tian Xia, Lizhen Xu, Binglin Zhou, Fangqi Li, Zhuosheng Zhang, Rui Wang, and Gongshen Liu. 2024. R-Judge: Benchmarking Safety Risk Awareness for LLM Agents. InFindings of the Association for Computational Linguistics: EMNLP 2024. Association for Computational Linguistics, Miami, Florida, U...

  33. [33]

    Zhexin Zhang, Shiyao Cui, Yida Lu, Jingzhuo Zhou, Junxiao Yang, Hongning Wang, and Minlie Huang. 2024. Agent-SafetyBench: Evaluating the Safety of LLM Agents. arXiv preprint arXiv:2412.14470

  34. [34]

    Andy Zhou and Ron Arel. 2025. Tempest: Autonomous Multi-Turn Jailbreaking of Large Language Models with Tree Search. arXiv preprint arXiv:2503.10619

  35. [35]

    all defences off

    Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J. Zico Kolter, and Matt Fredrikson. 2023. Universal and Transferable Adversarial Attacks on Aligned Language Models. arXiv preprint arXiv:2307.15043. NRT-Bench: Benchmarking Multi-Turn Red-Teaming of LLM Operator Agents in Safety-Critical Control Rooms A Plant Domain Definitions The simulator’s plant mo...