Evaluating Jailbreaking Vulnerabilities in LLMs Deployed as Assistants for Smart Grid Operations: A Benchmark Against NERC Standards
Pith reviewed 2026-05-08 07:44 UTC · model grok-4.3
The pith
LLMs used as smart grid assistants can be jailbroken into violating NERC reliability standards at an overall rate of 33 percent.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that jailbreaking LLMs to produce outputs violating NERC standards in grid operation scenarios succeeds at an overall rate of 33.1 percent across tested models and methods. DeepInception proves the most effective technique at 63.17 percent success while Claude 3.5 Haiku exhibits complete resistance at zero percent. Gemini 2.0 Flash-Lite shows the highest vulnerability at 55.04 percent and GPT-4o mini reaches 44.34 percent. A follow-up refinement of wording in the simpler Baseline and BitBypass methods produces a comparable 30.6 percent overall success rate.
What carries the argument
Attack Success Rate (ASR) measured on responses to NERC-derived scenarios when Baseline, BitBypass, and DeepInception jailbreaking prompts are applied to GPT-4o mini, Gemini 2.0 Flash-Lite, and Claude 3.5 Haiku.
Load-bearing premise
The jailbreaking methods and NERC-derived scenarios used in the tests accurately capture realistic threats that authorized operators might pose through malicious prompts in actual operations.
What would settle it
Observing whether operators in a live or simulated grid control center can obtain and act on non-compliant advice by applying the tested prompts to the same LLMs without additional safeguards.
Figures
read the original abstract
The deployment of Large Language Models (LLMs) as assistants in electric grid operations promises to streamline compliance and decision-making but exposes new vulnerabilities to prompt-based adversarial attacks. This paper evaluates the risk of jailbreaking LLMs, i.e., circumventing safety alignments to produce outputs violating regulatory standards, assuming threats from authorized users, such as operators, who craft malicious prompts to elicit non-compliant guidance. Three state-of-the-art LLMs (OpenAI's GPT-4o mini, Google's Gemini 2.0 Flash-Lite, and Anthropic's Claude 3.5 Haiku) were tested against Baseline, BitBypass, and DeepInception jailbreaking methods across scenarios derived from nine NERC Reliability Standards (EOP, TOP, and CIP). In the initial broad experiment, the overall Attack Success Rate (ASR) was 33.1%, with DeepInception proving most effective at 63.17% ASR. Claude 3.5 Haiku exhibited complete resistance (0% ASR), while Gemini 2.0 Flash-Lite was most vulnerable (55.04% ASR) and GPT-4o mini moderately susceptible (44.34% ASR). A follow-up experiment refining malicious wording in Baseline and BitBypass attacks yielded a 30.6% ASR, confirming that subtle prompt adjustments can enhance simpler methods' efficacy.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper evaluates jailbreaking vulnerabilities in three LLMs (GPT-4o mini, Gemini 2.0 Flash-Lite, Claude 3.5 Haiku) deployed as assistants for smart grid operations. Using scenarios derived from nine NERC Reliability Standards (EOP, TOP, CIP), it tests Baseline, BitBypass, and DeepInception attacks and reports an overall attack success rate (ASR) of 33.1% in the initial experiment (DeepInception at 63.17%, Claude at 0%, Gemini at 55.04%, GPT-4o mini at 44.34%), with a follow-up refined-prompt experiment yielding 30.6% ASR. The work assumes threats from authorized users and defines success as outputs that violate the standards.
Significance. If the attack-success classifications accurately reflect operational regulatory violations, the study provides a useful empirical benchmark at the intersection of LLM security and critical-infrastructure compliance. It supplies concrete model- and method-specific rates that could guide safeguard design for regulated domains, and its direct experimental (non-derivational) nature makes the measurements falsifiable once the classification protocol is fully specified.
major comments (3)
- [Abstract] Abstract: the central quantitative claims (overall ASR 33.1%, DeepInception 63.17%, model-specific rates) rest on an unstated mapping from LLM outputs to actual NERC violations. No explicit decision criteria, inter-rater protocol, trial counts, prompt examples, or domain-expert validation are supplied to distinguish actionable non-compliance from generic discussion or hypotheticals; without these the reported percentages cannot be interpreted as direct evidence of regulatory risk under the authorized-user threat model.
- [Abstract] Abstract (follow-up experiment): refining malicious wording for Baseline and BitBypass after seeing initial results introduces post-hoc selection that affects interpretation of the 30.6% ASR. The manuscript must clarify whether the refined prompts were chosen before or after the first round and report both sets of results with the same classification protocol.
- [Methods] Methods / Experimental Setup (implied): the weakest assumption—that the tested jailbreaking methods and NERC-derived scenarios represent realistic threats from authorized operators—requires justification. The paper should demonstrate that the chosen prompts and success criteria correspond to prompts an operator could realistically issue and to outputs that would produce measurable NERC non-compliance in an operational context.
minor comments (1)
- [Abstract] Abstract: the nine NERC standards are listed only by acronym (EOP/TOP/CIP); a brief parenthetical expansion or table reference would improve readability for readers outside the energy sector.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback on our manuscript. The comments highlight important areas for improving transparency and rigor, particularly around classification protocols and threat model justification. We address each major comment below and will incorporate the suggested changes in the revised version.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central quantitative claims (overall ASR 33.1%, DeepInception 63.17%, model-specific rates) rest on an unstated mapping from LLM outputs to actual NERC violations. No explicit decision criteria, inter-rater protocol, trial counts, prompt examples, or domain-expert validation are supplied to distinguish actionable non-compliance from generic discussion or hypotheticals; without these the reported percentages cannot be interpreted as direct evidence of regulatory risk under the authorized-user threat model.
Authors: We agree that the original manuscript lacked sufficient detail on the output-to-violation mapping. In the revised version, we will add a new 'Output Classification Protocol' subsection to the Methods. This will explicitly define decision criteria for each NERC standard (EOP, TOP, CIP), provide concrete examples of LLM outputs classified as violations versus compliant or hypothetical responses, report the exact number of trials per scenario, and describe any validation steps (including consultation with domain experts on NERC compliance). These additions will allow the reported ASRs to be interpreted more directly as indicators of regulatory risk under the authorized-user model. revision: yes
-
Referee: [Abstract] Abstract (follow-up experiment): refining malicious wording for Baseline and BitBypass after seeing initial results introduces post-hoc selection that affects interpretation of the 30.6% ASR. The manuscript must clarify whether the refined prompts were chosen before or after the first round and report both sets of results with the same classification protocol.
Authors: The refined prompts for Baseline and BitBypass were developed after reviewing the initial experimental outcomes, specifically to test whether minor wording adjustments could improve attack efficacy for the simpler methods. We acknowledge this introduces a post-hoc element that affects causal interpretation of the 30.6% ASR. In the revision, we will clearly document the experimental timeline, present results from both the original and refined prompt sets in parallel tables using the identical classification protocol, and add a limitations paragraph discussing the implications for generalizability. revision: yes
-
Referee: [Methods] Methods / Experimental Setup (implied): the weakest assumption—that the tested jailbreaking methods and NERC-derived scenarios represent realistic threats from authorized operators—requires justification. The paper should demonstrate that the chosen prompts and success criteria correspond to prompts an operator could realistically issue and to outputs that would produce measurable NERC non-compliance in an operational context.
Authors: We will expand the Methods section with a new 'Threat Model and Realism Justification' subsection. This will articulate why Baseline, BitBypass, and DeepInception are plausible techniques available to authorized operators (e.g., via internal LLM access), map each NERC-derived scenario to realistic operational queries an operator might legitimately pose during grid management, and explain how a successful jailbreak output could translate into measurable non-compliance (e.g., delayed response times or bypassed controls under EOP/TOP/CIP). We will support this with references to documented operator error patterns and NERC enforcement cases where similar decision-making failures occurred. revision: yes
Circularity Check
No circularity: purely empirical measurement study
full rationale
The paper conducts direct experiments measuring Attack Success Rates (ASR) of jailbreaking methods on LLMs using scenarios derived from NERC standards. No equations, derivations, fitted parameters, or self-referential definitions appear in the reported results. ASR figures (e.g., 33.1% overall, model-specific rates) are presented as outcomes of prompt testing and output classification rather than quantities constructed from other fitted inputs or prior self-citations. The central claims rest on experimental data collection, not on any load-bearing self-citation chain or ansatz smuggled via prior work. This is a standard empirical benchmark study whose quantitative results are independent of the paper's own definitions.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
A Multi-Task LLM Framework for Multimodal Speech-Based Mental Health Prediction,
M. Ali, C. Lucasius, T. P . Patel, M. Aitken, J. V orstman, P . Szatmari, M. Battaglia, and D. Kundur, “A Multi-Task LLM Framework for Multimodal Speech-Based Mental Health Prediction,” in 2025 IEEE 21st International Conference on Body Sensor Networks (BSN ). Los Angeles, CA, USA: IEEE, Nov. 2025, pp. 1–4. [Online]. Availa ble: https://ieeexplore.ieee.or...
-
[2]
Self-Refi ned Generative Foundation Models for Wireless Traffic Predicti on,
C. Hu, H. Zhou, D. Wu, X. Chen, J. Y an, and X. Liu, “Self-Refi ned Generative Foundation Models for Wireless Traffic Predicti on,” IEEE Transactions on V ehicular Technology , pp. 1–6, 2025. [Online]. Available: https://ieeexplore.ieee.org/document/11269603/
-
[3]
Z. Shi, Y . Y uan, L. Cheng, and Y . Liu, “Reinforcement Lear ning- Guided Large Language Model Fine-Tuning for Privacy-Prese rving Text Rewriting,” in Proceedings of the Tenth ACM/IEEE Symposium on Edge Computing . the Hilton Arlington National Landing Arlington V A USA: ACM, Dec. 2025, pp. 1–7. [Online]. Availab le: https://dl.acm.org/doi/10.1145/376910...
-
[4]
Large Language Models for Detecting Cyberat tacks on Smart Grid Protective Relays,
A. Mohammad Saber, S. Jafari, Z. Ouyang, P . Budnarain, A. Y oussef, and D. Kundur, “Large Language Models for Detecting Cyberat tacks on Smart Grid Protective Relays,” IEEE Open Access Journal of Power and Energy , vol. 13, pp. 135–144, 2026. [Online]. Available: https://ieeexplore.ieee.org/document/11359713/
-
[5]
T owards explainable network intrusion detection using large langu age models,
P . R. B. Houssel, P . Singh, S. Layeghy, and M. Portmann, “T owards explainable network intrusion detection using large langu age models,” in 2024 IEEE/ACM International Conference on Big Data Computi ng, Applications and Technologies (BDCAT) , 2024, pp. 67–72
2024
-
[6]
Large language model for s mart in- verter cyber-attack detection via textual analysis of volt /var commands,
A. Selim, J. Zhao, and B. Y ang, “Large language model for s mart in- verter cyber-attack detection via textual analysis of volt /var commands,” IEEE Transactions on Smart Grid , vol. 15, no. 6, pp. 6179–6182, 2024
2024
-
[7]
Chatgpt an d other large language models for cybersecurity of smart grid applicatio ns,
A. Zaboli, S. L. Choi, T.-J. Song, and J. Hong, “Chatgpt an d other large language models for cybersecurity of smart grid applicatio ns,” in 2024 IEEE Power & Energy Society General Meeting (PESGM) , 2024, pp. 1–5
2024
-
[8]
Analyzing Agent Collisions in AI-Aided Energy Management Systems,
Y . Y uan, Y . Zeng, H. Li, J. Gao, X. Y ang, M. Ghafouri, Y . Liu , and J. Y an, “Analyzing Agent Collisions in AI-Aided Energy Management Systems,” in 2025 IEEE International Conference on Communications, Control, and Computing Technologies for S mart Grids (SmartGridComm). North Y ork, ON, Canada: IEEE, Sep. 2025, pp. 1–
2025
-
[9]
Available: https://ieeexplore.ieee.org/document/11204591/
[Online]. Available: https://ieeexplore.ieee.org/document/11204591/
-
[10]
Scene- aware non-intrusive load monitoring using large language m odels,
H. Chen, J. Chen, Y . Chai, W. Guo, C. Jia, B. Y ang, and Z. Xin , “Scene- aware non-intrusive load monitoring using large language m odels,” IEEE Transactions on Smart Grid , vol. 17, no. 1, pp. 874–876, 2026
2026
-
[11]
M. Sharshar, A. M. Saber, D. Svetinovic, A. M. Y oussef, D . Kundur, and E. F. El-Saadany, “Large Language Model-Based Framewor k for Explainable Cyberattack Detection in Automatic Generatio n Control Systems,” in 2025 IEEE Electrical Power and Energy Conference (EPEC). Waterloo, ON, Canada: IEEE, Oct. 2025, pp. 424–429. [Online]. Available: https://ieeex...
-
[12]
J. Chen, F. Wang, S. Pang, M. Chen, M. Xi, T. Zhao, and J. Yi n, “A Privacy Policy Text Compliance Reasoning Framework with Large Language Models for Healthcare Services,” Tsinghua Science and Technology, vol. 30, no. 4, pp. 1831–1845, Aug. 2025. [Online]. Available: https://ieeexplore.ieee.org/document/10908666/
-
[13]
Connecting Minds: AI Use Cas es to Bridge Power Systems and Large Language Models for Practica l Ap- plications,
Y . Chen and A. A. Anderson, “Connecting Minds: AI Use Cas es to Bridge Power Systems and Large Language Models for Practica l Ap- plications,” Pacific Northwest National Laboratory (PNNL) , Richland, W A (United States), Tech. Rep., 2025
2025
-
[14]
egridgpt: Trustworthy ai in the control room,
S. L. Choi, R. Jain, P . Emami, K. Wadsack, F. Ding, H. Sun, K. Gruchalla, J. Hong, H. Zhang, X. Zhu et al. , “egridgpt: Trustworthy ai in the control room,” National Renewable Energy Laborato ry (NREL), Golden, CO (United States), Tech. Rep., 2024
2024
-
[15]
Causality-aware llm-enhanced graph representation lear ning for adap- tive power system control,
F. Y ao, J. Liu, Y . Tao, J. Qiu, H. H.-C. Iu, G. Chen, and Z. Y . Dong, “Causality-aware llm-enhanced graph representation lear ning for adap- tive power system control,” IEEE Transactions on Industrial Informatics, pp. 1–12, 2026
2026
-
[16]
Powergrap h-llm: Novel power grid graph embedding and optimization with large lang uage models,
F. Bernier, J. Cao, M. Cordy, and S. Ghamizi, “Powergrap h-llm: Novel power grid graph embedding and optimization with large lang uage models,” IEEE Transactions on Power Systems, vol. 40, no. 6, pp. 5483– 5486, 2025
2025
-
[17]
Jailbroken: H ow Does LLM Safety Training Fail?
A. Wei, N. Haghtalab, and J. Steinhardt, “Jailbroken: H ow Does LLM Safety Training Fail?” in Advances in Neural Information Processing Systems , A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, Eds., vol. 36. Curran Associates, Inc., 2023, pp. 80 079–80 110. [Online]. Available: https://proceedings.neurips.cc/paper_files/paper/202 3/fil...
2023
-
[18]
Ge nerative AI and LLMs for critical infrastructure protection: evalua tion bench- marks, agentic AI, challenges, and opportunities,
Y . Yigit, M. A. Ferrag, M. C. Ghanem, I. H. Sarker, L. A. Ma glaras, C. Chrysoulas, N. Moradpoor, N. Tihanyi, and H. Janicke, “Ge nerative AI and LLMs for critical infrastructure protection: evalua tion bench- marks, agentic AI, challenges, and opportunities,” Sensors, vol. 25, no. 6, p. 1666, 2025
2025
-
[19]
BitBypass: A new direction in ja ilbreaking aligned large language models with bitstream camouflage,
K. Nakka and N. Saxena, “BitBypass: A new direction in ja ilbreaking aligned large language models with bitstream camouflage,” i n Findings of the Association for Computational Linguistics: EACL 202 6, V . Demberg, K. Inui, and L. Marquez, Eds. Rabat, Morocco: Association for Computational Linguistics, Mar. 2026, pp. 3808–3834. [Online]. Available: https:/...
2026
-
[20]
DeepInception: Hypnotize Large Language Model to Be Jailbreaker
X. Li, Z. Zhou, J. Zhu, J. Y ao, T. Liu, and B. Han, “DeepInc eption: Hypnotize Large Language Model to Be Jailbreaker,” Nov. 202 4, arXiv:2311.03191. [Online]. Available: http://arxiv.org/abs/2311.03191
work page internal anchor Pith review arXiv
-
[21]
CIP reliabil- ity standards,
North American Electric Reliability Corporation (NER C), “CIP reliabil- ity standards,” https://www.nerc.com/standards/reliability-standards/cip, 2026, Critical Infrastructure Protection (CIP) Standards
2026
-
[22]
TOP reliability standards,
——, “TOP reliability standards,” https://www.nerc.com/standards/reliability-standards/top 2026, Transmission Operations (TOP) Standards
2026
-
[23]
EOP reliability standards,
——, “EOP reliability standards,” https://www.nerc.com/standards/reliability-standards/eop 2026, Emergency Operations Planning (EOP) Standards
2026
-
[24]
Rlhf deciphered: A critical analysis of reinforcement learning from human feedback for llms
S. Chaudhari, P . Aggarwal, V . Murahari, T. Rajpurohit, A. Kalyan, K. Narasimhan, A. Deshpande, and B. Castro Da Silva, “RLHF Deciphered: A Critical Analysis of Reinforcement Learning from Human Feedback for LLMs,” ACM Computing Surveys , vol. 58, no. 2, pp. 1–37, Jan. 2026. [Online]. Available: https://dl.acm.org/doi/10.1145/3743127
-
[25]
Gpt-4o mini: advancing cost-efficient intelligence,
OpenAI, “Gpt-4o mini: advancing cost-efficient intelligence,” Jul 2024. [Online]. Available: https://openai.com/index/gpt-4o-mini-advancing-cost -efficient-intelligence/
2024
-
[26]
Gemini 2.0 flash-lite,
Google, “Gemini 2.0 flash-lite,” Apr 2026. [Online]. Av ailable: https://docs.cloud.google.com/vertex-ai/generative-ai/docs/models/gemini/2-0-flash-lite
2026
-
[27]
Introducing computer use, a new claude 3.5 sonnet, and claude 3.5 haiku,
Anthropic, “Introducing computer use, a new claude 3.5 sonnet, and claude 3.5 haiku,” 2024. [Online]. Available: https://www.anthropic.com/news/3-5-models-and-computer-use
2024
-
[28]
Royal Society Open Science , author =
S. Wachter, B. Mittelstadt, and C. Russell, “Do large la nguage models have a legal duty to tell the truth?” Royal Society Open Science, vol. 11, no. 8, p. 240197, Aug. 2024. [Online]. Available: https://royalsocietypublishing.org/doi/10.1098/rsos.240197
-
[29]
An Efficient Finetuning Method for LLM generated text detection in Power Grid,
Y . Jiang, J. Li, X. Zhang, W. Xu, Z. Liang, Y . Y ang, K. Huang, and L. Bi, “An Efficient Finetuning Method for LLM generated text detection in Power Grid,” in 2025 IEEE/CIC International Conference on Communications in China (ICCC ). Shanghai, China: IEEE, Aug. 2025, pp. 1–6. [Online]. Availa ble: https://ieeexplore.ieee.org/document/11148942/
-
[30]
Applying Fine-tuned Large Language Model to Distribution System State Estimation,
G. Mingyang, Z. Suyang, Z. Wennan, F. Jili, L. Haiquan, and Z. Aihua, “Applying Fine-tuned Large Language Model to Distribution System State Estimation,” in 2025 4th International Conference on Power Systems and Electrical Technology (PSE T). Tokyo, Japan: IEEE, Aug. 2025, pp. 554–559. [Online]. Avail able: https://ieeexplore.ieee.org/document/11296549/
-
[31]
Robu st Electricity Theft Detection Against Data Poisoning Attacks in Smart Gri ds,
A. Takiddin, M. Ismail, U. Zafar, and E. Serpedin, “Robu st Electricity Theft Detection Against Data Poisoning Attacks in Smart Gri ds,” IEEE Transactions on Smart Grid , vol. 12, no. 3, pp. 2675–2684, May 2021. [Online]. Available: https://ieeexplore.ieee.org/document/9310227/
-
[32]
A. M. Saber, H. E. Z. Farag, A. Y oussef, and D. Kundur, “A Model-Independent Trojan Attack on Deep Learning-Based FD IA Detection in Smart Grid Protection Systems,” IEEE Transactions on Instrumentation and Measurement , vol. 74, pp. 1–13, 2025. [Online]. Available: https://ieeexplore.ieee.org/document/11082354/
-
[33]
Securing IoT Ma lware Classifiers: Dynamic Trigger-Based Attack and Mitigation,
Y . Zhang, J. Y an, S. Torabi, and C. Assi, “Securing IoT Ma lware Classifiers: Dynamic Trigger-Based Attack and Mitigation, ” in ICC 2024 - IEEE International Conference on Communications . Denver, CO, USA: IEEE, Jun. 2024, pp. 4638–4643. [Online]. Availabl e: https://ieeexplore.ieee.org/document/10622307/
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.