Faster-GCG: Efficient Discrete Optimization Jailbreak Attacks against Aligned Large Language Models
Pith reviewed 2026-05-23 18:59 UTC · model grok-4.3
The pith
Faster-GCG cuts the evaluations needed for jailbreaks from 256K to 32K by fixing inaccurate gradients, uniform sampling, and repeated suffix checks in GCG.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Faster-GCG incorporates distance-based regularization for improved gradient estimation, temperature-controlled sampling for more effective exploration, and visited-suffix marking to avoid redundant evaluations. These changes reduce the required evaluations per harmful behavior from approximately 256K to 32K, delivering up to an 8× gain in sampling efficiency and a 7× reduction in wall-clock time while reaching an average success rate of 78.1% across five aligned LLMs and 88.7% on Qwen3.5-4B.
What carries the argument
The three modifications—distance-based regularization for gradient estimation, temperature-controlled sampling, and visited-suffix marking—that together reduce redundancy and improve exploration in the discrete token search.
If this is right
- Jailbreak generation becomes feasible under far lower query budgets than previously reported.
- White-box attacks can match or exceed other methods when total evaluations are capped at 32K.
- The same search procedure works across multiple model families without per-model retuning.
- Reducing repeated evaluations frees budget that can be redirected toward longer or more diverse candidate sequences.
Where Pith is reading between the lines
- Alignment procedures may need to include efficient search during training so that models remain robust even when attackers use optimized discrete search.
- The same three fixes could be tested on other discrete optimization tasks such as prompt compression or automated red-teaming outside the jailbreak setting.
- If the efficiency gain scales with model size, testing alignment on thousands of behaviors could become routine rather than exceptional.
Load-bearing premise
That the three modifications improve exploration and reduce redundancy without lowering attack success rates or introducing new biases in the reported experiments.
What would settle it
An experiment that applies Faster-GCG to a new collection of harmful behaviors or models at the 32K budget and measures success rates no higher than those obtained by unmodified GCG at the same budget.
Figures
read the original abstract
Aligned Large Language Models (LLMs) have attracted significant attention for their safety, particularly in the context of jailbreak attacks that attempt to bypass guardrails via adversarial prompts. Among existing approaches, the Greedy Coordinate Gradient (GCG) attack pioneered automated jailbreaks through discrete token optimization; however, its low sample efficiency limits practical applicability. In particular, GCG requires approximately 256K evaluations per harmful behavior to achieve a satisfactory jailbreak success rate, due to the inherent difficulty of the underlying discrete optimization problem. In this work, we identify three key factors that limit the sample efficiency of GCG: inaccurate gradient-based estimation, inefficient uniform sampling, and repeated evaluation of previously explored suffixes. To address these issues, we propose Faster-GCG, a streamlined variant of GCG that incorporates distance-based regularization for improved estimation, temperature-controlled sampling for more effective exploration, and a visited-suffix marking mechanism to avoid redundant evaluations. Faster-GCG reduced the required evaluations to 32K, achieving up to an $8\times$ improvement in sampling efficiency and a $7\times$ reduction in wall-clock time compared to GCG. Under this reduced budget, Faster-GCG attained an average jailbreak success rate of 78.1\% across five aligned LLMs, and achieved 88.7\% against Qwen3.5-4B, outperforming state-of-the-art white-box jailbreak methods.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Faster-GCG, an efficiency-enhanced variant of the Greedy Coordinate Gradient (GCG) discrete optimization attack for jailbreaking aligned LLMs. It identifies three bottlenecks in standard GCG (inaccurate gradient estimates, uniform sampling, and repeated suffix evaluations) and introduces three modifications—distance-based regularization on gradients, temperature-controlled sampling, and a visited-suffix marking mechanism—to reduce the per-behavior evaluation budget from 256K to 32K. The manuscript reports up to 8× sampling efficiency gains and 7× wall-clock time reduction, with an average attack success rate (ASR) of 78.1% across five models and 88.7% on Qwen3.5-4B, claiming to outperform prior white-box methods under the reduced budget.
Significance. If the efficiency improvements prove robust under controlled conditions, the work would be a useful practical contribution to automated red-teaming research by lowering the computational barrier for large-scale jailbreak experiments. The explicit focus on sample efficiency metrics and concrete wall-clock comparisons is a positive aspect of the empirical framing.
major comments (2)
- [§4 and §5] §4 (Experimental Setup) and §5 (Results): The central claim that the three modifications improve sample efficiency while preserving ASR rests on comparisons performed at unequal evaluation budgets (32K vs. 256K). No component-wise ablations are reported that hold the total evaluation budget fixed at 32K and isolate the contribution of distance-based regularization, temperature sampling, and visited-suffix marking individually on identical harmful behaviors and models. This omission leaves open the possibility that observed ASR differences arise from changes in the effective search distribution rather than genuine efficiency gains.
- [§3.2] §3.2 (Method): The regularization coefficient and sampling temperature are listed as free parameters. The manuscript does not report sensitivity analysis or default values that would allow reproduction of the 32K-budget results without additional tuning, which directly affects the reproducibility of the claimed 8× efficiency improvement.
minor comments (2)
- [Table 1, Figure 2] Table 1 and Figure 2: Axis labels and caption text do not explicitly state whether the reported ASR values are computed over the same set of 100 harmful behaviors for all methods or whether any post-hoc filtering was applied.
- [§3.3] §3.3: The visited-suffix marking mechanism is described at a high level but lacks an explicit algorithmic listing or complexity analysis showing how marking is implemented without increasing per-step overhead.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive feedback. We address each major comment below and indicate the revisions we will make.
read point-by-point responses
-
Referee: [§4 and §5] §4 (Experimental Setup) and §5 (Results): The central claim that the three modifications improve sample efficiency while preserving ASR rests on comparisons performed at unequal evaluation budgets (32K vs. 256K). No component-wise ablations are reported that hold the total evaluation budget fixed at 32K and isolate the contribution of distance-based regularization, temperature sampling, and visited-suffix marking individually on identical harmful behaviors and models. This omission leaves open the possibility that observed ASR differences arise from changes in the effective search distribution rather than genuine efficiency gains.
Authors: We agree that component-wise ablations at a fixed 32K budget would more rigorously isolate the contribution of each modification and rule out distributional effects. Our primary experiments demonstrate the end-to-end sample-efficiency gain of the combined Faster-GCG method relative to standard GCG at its conventional budget. To strengthen the evidence, we will add fixed-budget ablations of each component (regularization, temperature sampling, and visited-suffix marking) on the same behaviors and models in the revised manuscript. revision: yes
-
Referee: [§3.2] §3.2 (Method): The regularization coefficient and sampling temperature are listed as free parameters. The manuscript does not report sensitivity analysis or default values that would allow reproduction of the 32K-budget results without additional tuning, which directly affects the reproducibility of the claimed 8× efficiency improvement.
Authors: We acknowledge that explicit default values and sensitivity analysis are necessary for reproducibility. In the revision we will report the specific coefficient and temperature values used to obtain the 32K-budget results and include a sensitivity study demonstrating robustness across reasonable ranges of these hyperparameters. revision: yes
Circularity Check
No circularity: empirical method validated by experiments
full rationale
The paper proposes three algorithmic modifications to GCG (distance-based regularization, temperature-controlled sampling, visited-suffix marking) and reports empirical results on evaluation count, wall-clock time, and attack success rate. No equations, derivations, or 'predictions' are presented that reduce to inputs by construction. No self-citation load-bearing steps, uniqueness theorems, or ansatzes appear. The central claims rest on experimental outcomes rather than any self-referential chain, consistent with the default non-circular finding for empirical methods papers.
Axiom & Free-Parameter Ledger
free parameters (2)
- regularization coefficient
- sampling temperature
axioms (1)
- domain assumption Distance regularization improves the accuracy of gradient-based suffix selection in discrete token space.
Forward citations
Cited by 4 Pith papers
-
GPO-V: Jailbreak Diffusion Vision Language Model by Global Probability Optimization
GPO-V is a visual jailbreak framework that bypasses safety guardrails in diffusion VLMs by globally manipulating generative probabilities during denoising.
-
GPO-V: Jailbreak Diffusion Vision Language Model by Global Probability Optimization
GPO-V jailbreaks dVLMs by globally optimizing probabilities in the denoising process to bypass refusal patterns, achieving stealthy and transferable attacks.
-
CoopGuard: Stateful Cooperative Agents Safeguarding LLMs Against Evolving Multi-Round Attacks
CoopGuard deploys cooperative agents to track conversation history and counter evolving multi-round attacks on LLMs, achieving a 78.9% reduction in attack success rate on a new 5,200-sample benchmark.
-
Phonetic Perturbations Reveal Tokenizer-Rooted Safety Gaps in LLMs
Phonetic perturbations fragment safety-critical tokens in LLMs, suppressing attribution scores while preserving input understanding and causing safety mechanisms to fail despite good comprehension.
Reference graph
Works this paper leans on
-
[1]
Gpt-4 technical report. arXiv preprint arXiv:2303.08774. Nicholas Carlini and David A. Wagner
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Jailbreaking Black Box Large Language Models in Twenty Queries
Jailbreaking black box large language models in twenty queries. arXiv preprint arXiv: 2310.08419 . Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E Gonzalez, et al
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
Baseline Defenses for Adversarial Attacks Against Aligned Language Models
Baseline defenses for ad- versarial attacks against aligned language models. arXiv preprint arXiv:2309.00614. Xiaojun Jia, Tianyu Pang, Chao Du, Yihao Huang, Jindong Gu, Yang Liu, Xiaochun Cao, and Min Lin
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
arXiv preprint arXiv:2405.21018
Improved techniques for optimization- based jailbreaking on large language models. arXiv preprint arXiv:2405.21018. Daniel Kang, Xuechen Li, Ion Stoica, Carlos Guestrin, Matei Zaharia, and Tatsunori Hashimoto
-
[5]
Exploiting Programmatic Behavior of LLMs: Dual-Use Through Standard Security Attacks,
Ex- ploiting programmatic behavior of llms: Dual-use through standard security attacks. arXiv preprint arXiv: 2302.05733. Zeyi Liao and Huan Sun
-
[6]
arXiv preprint arXiv:2404.07921
Amplegcg: Learning a universal and transferable generative model of adver- sarial suffixes for jailbreaking both open and closed llms. arXiv preprint arXiv:2404.07921. 9 Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu
-
[7]
Tdc 2023 (llm edition): The trojan detection challenge. In NeurIPS Competition Track. Anay Mehrotra, Manolis Zampetakis, Paul Kassianik, Blaine Nelson, Hyrum S. Anderson, Yaron Singer, and Amin Karbasi
work page 2023
-
[8]
Tree of attacks: Jailbreaking black-box llms automatically
Tree of attacks: Jailbreak- ing black-box llms automatically. arXiv preprint arXiv: 2312.02119. Long Ouyang, Jeffrey Wu, Xu Jiang, et al
-
[9]
SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks
Smoothllm: Defending large language models against jailbreaking attacks. arXiv preprint arXiv:2310.03684. Taylor Shin, Yasaman Razeghi, Robert L. Logan IV , Eric Wallace, and Sameer Singh
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
Llama 2: Open Foundation and Fine-Tuned Chat Models
Llama 2: Open foundation and fine-tuned chat mod- els. arXiv preprint arXiv:2307.09288. Alexander Wei, Nika Haghtalab, and Jacob Steinhardt. 2023a. Jailbroken: How does LLM safety train- ing fail? In Adv. Neural Inform. Process. Syst. (NeurIPS). Zeming Wei, Yifei Wang, and Yisen Wang. 2023b. Jail- break and guard aligned language models with only few in-c...
work page internal anchor Pith review Pith/arXiv arXiv
-
[11]
arXiv preprint arXiv:2403.01251
Accelerating greedy coordinate gradient via probe sampling. arXiv preprint arXiv:2403.01251. Chujie Zheng, Fan Yin, Hao Zhou, Fandong Meng, Jie Zhou, Kai-Wei Chang, Minlie Huang, and Nanyun Peng
-
[12]
Auto- DAN: Automatic and Interpretable Adversarial Attacks on Large Language Models,
Autodan: Automatic and inter- pretable adversarial attacks on large language models. arXiv preprint arXiv:2310.15140. Terry Yue Zhuo, Yujin Huang, Chunyang Chen, and Zhenchang Xing
-
[13]
arXiv preprint arXiv:2301.12867
Red teaming chatgpt via jailbreaking: Bias, robustness, reliability and toxicity. arXiv preprint arXiv:2301.12867. Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J. Zico Kolter, and Matt Fredrikson
-
[14]
Universal and Transferable Adversarial Attacks on Aligned Language Models
Univer- sal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043. 10 Table A1: System prompts used for target LLMs. Models System Prompt Llama-2-7B-chat You are a helpful, respectful, and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unet...
work page internal anchor Pith review Pith/arXiv arXiv 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.