Faster-GCG: Efficient Discrete Optimization Jailbreak Attacks against Aligned Large Language Models

BingZe Lee; Jinghao Cui; Qiongxiu Li; Shei PernChua; Wei Zhang; Xiao Li; Xiaolin Hu; Yifan Huang; Zhuhong Li

arxiv: 2410.15362 · v2 · pith:7I2NNZKEnew · submitted 2024-10-20 · 💻 cs.LG · cs.AI· cs.CL· cs.CR

Faster-GCG: Efficient Discrete Optimization Jailbreak Attacks against Aligned Large Language Models

Xiao Li , Wei Zhang , Zhuhong Li , Qiongxiu Li , Shei PernChua , BingZe Lee , Jinghao Cui , Yifan Huang

show 1 more author

Xiaolin Hu

This is my paper

Pith reviewed 2026-05-23 18:59 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CLcs.CR

keywords jailbreak attacksadversarial promptslarge language modelsdiscrete optimizationsample efficiencygradient estimationLLM alignmentGCG

0 comments

The pith

Faster-GCG cuts the evaluations needed for jailbreaks from 256K to 32K by fixing inaccurate gradients, uniform sampling, and repeated suffix checks in GCG.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to demonstrate that standard GCG wastes most of its evaluations on poor gradient signals, random sampling that misses good candidates, and re-testing the same token sequences. Three targeted fixes—distance-based regularization on the gradients, temperature scaling during sampling, and a record of already-tried suffixes—let the search reach high success rates with far fewer model calls. A reader would care because the result shows that current safety training can be overcome by a more efficient search procedure rather than by fundamentally new attack ideas. If the claim holds, automated jailbreak generation becomes practical on ordinary hardware and forces a reassessment of how much compute is required to test alignment.

Core claim

Faster-GCG incorporates distance-based regularization for improved gradient estimation, temperature-controlled sampling for more effective exploration, and visited-suffix marking to avoid redundant evaluations. These changes reduce the required evaluations per harmful behavior from approximately 256K to 32K, delivering up to an 8× gain in sampling efficiency and a 7× reduction in wall-clock time while reaching an average success rate of 78.1% across five aligned LLMs and 88.7% on Qwen3.5-4B.

What carries the argument

The three modifications—distance-based regularization for gradient estimation, temperature-controlled sampling, and visited-suffix marking—that together reduce redundancy and improve exploration in the discrete token search.

If this is right

Jailbreak generation becomes feasible under far lower query budgets than previously reported.
White-box attacks can match or exceed other methods when total evaluations are capped at 32K.
The same search procedure works across multiple model families without per-model retuning.
Reducing repeated evaluations frees budget that can be redirected toward longer or more diverse candidate sequences.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Alignment procedures may need to include efficient search during training so that models remain robust even when attackers use optimized discrete search.
The same three fixes could be tested on other discrete optimization tasks such as prompt compression or automated red-teaming outside the jailbreak setting.
If the efficiency gain scales with model size, testing alignment on thousands of behaviors could become routine rather than exceptional.

Load-bearing premise

That the three modifications improve exploration and reduce redundancy without lowering attack success rates or introducing new biases in the reported experiments.

What would settle it

An experiment that applies Faster-GCG to a new collection of harmful behaviors or models at the 32K budget and measures success rates no higher than those obtained by unmodified GCG at the same budget.

Figures

Figures reproduced from arXiv: 2410.15362 by BingZe Lee, Jinghao Cui, Qiongxiu Li, Shei PernChua, Wei Zhang, Xiao Li, Xiaolin Hu, Yifan Huang, Zhuhong Li.

**Figure 1.** Figure 1: An illustration of the jailbreak setting. The black text represents the fixed system prompt template. The [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

**Figure 2.** Figure 2: The comparison between GCG and Faster-GCG. Given an initial suffix (exclamation marks on the left), [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Average loss comparison between GCG and Faster-GCG on randomly selected 20 behaviors. Here both GCG and Faster-GCG use the same cross-entropy loss. As the iterations progress, the loss of Faster-GCG exhibits a lower value compared to that of GCG. steps T to 100. In this setting, Faster-GCG requires only 1/10 of the computational cost compared with the original GCG. The regularization term w in Eq. (8), whi… view at source ↗

**Figure 4.** Figure 4: The ASRs on Llama-2-7B-chat using different [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

read the original abstract

Aligned Large Language Models (LLMs) have attracted significant attention for their safety, particularly in the context of jailbreak attacks that attempt to bypass guardrails via adversarial prompts. Among existing approaches, the Greedy Coordinate Gradient (GCG) attack pioneered automated jailbreaks through discrete token optimization; however, its low sample efficiency limits practical applicability. In particular, GCG requires approximately 256K evaluations per harmful behavior to achieve a satisfactory jailbreak success rate, due to the inherent difficulty of the underlying discrete optimization problem. In this work, we identify three key factors that limit the sample efficiency of GCG: inaccurate gradient-based estimation, inefficient uniform sampling, and repeated evaluation of previously explored suffixes. To address these issues, we propose Faster-GCG, a streamlined variant of GCG that incorporates distance-based regularization for improved estimation, temperature-controlled sampling for more effective exploration, and a visited-suffix marking mechanism to avoid redundant evaluations. Faster-GCG reduced the required evaluations to 32K, achieving up to an $8\times$ improvement in sampling efficiency and a $7\times$ reduction in wall-clock time compared to GCG. Under this reduced budget, Faster-GCG attained an average jailbreak success rate of 78.1\% across five aligned LLMs, and achieved 88.7\% against Qwen3.5-4B, outperforming state-of-the-art white-box jailbreak methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Faster-GCG applies three targeted tweaks to GCG and reports large efficiency gains, but the central claim rests on unablated comparisons that leave open whether success rates are preserved or shifted.

read the letter

The paper's core contribution is a set of three engineering fixes to the original GCG algorithm: distance-based regularization on the gradient estimate, temperature-controlled sampling, and a visited-suffix cache to skip repeats. These are applied to the same discrete optimization setup and the authors report that the evaluation budget drops from 256K to 32K while ASR stays at 78% average across five models, with an 8x sampling and 7x wall-clock improvement. That is the concrete result worth noting if the numbers hold under scrutiny.

Referee Report

2 major / 2 minor

Summary. The paper proposes Faster-GCG, an efficiency-enhanced variant of the Greedy Coordinate Gradient (GCG) discrete optimization attack for jailbreaking aligned LLMs. It identifies three bottlenecks in standard GCG (inaccurate gradient estimates, uniform sampling, and repeated suffix evaluations) and introduces three modifications—distance-based regularization on gradients, temperature-controlled sampling, and a visited-suffix marking mechanism—to reduce the per-behavior evaluation budget from 256K to 32K. The manuscript reports up to 8× sampling efficiency gains and 7× wall-clock time reduction, with an average attack success rate (ASR) of 78.1% across five models and 88.7% on Qwen3.5-4B, claiming to outperform prior white-box methods under the reduced budget.

Significance. If the efficiency improvements prove robust under controlled conditions, the work would be a useful practical contribution to automated red-teaming research by lowering the computational barrier for large-scale jailbreak experiments. The explicit focus on sample efficiency metrics and concrete wall-clock comparisons is a positive aspect of the empirical framing.

major comments (2)

[§4 and §5] §4 (Experimental Setup) and §5 (Results): The central claim that the three modifications improve sample efficiency while preserving ASR rests on comparisons performed at unequal evaluation budgets (32K vs. 256K). No component-wise ablations are reported that hold the total evaluation budget fixed at 32K and isolate the contribution of distance-based regularization, temperature sampling, and visited-suffix marking individually on identical harmful behaviors and models. This omission leaves open the possibility that observed ASR differences arise from changes in the effective search distribution rather than genuine efficiency gains.
[§3.2] §3.2 (Method): The regularization coefficient and sampling temperature are listed as free parameters. The manuscript does not report sensitivity analysis or default values that would allow reproduction of the 32K-budget results without additional tuning, which directly affects the reproducibility of the claimed 8× efficiency improvement.

minor comments (2)

[Table 1, Figure 2] Table 1 and Figure 2: Axis labels and caption text do not explicitly state whether the reported ASR values are computed over the same set of 100 harmful behaviors for all methods or whether any post-hoc filtering was applied.
[§3.3] §3.3: The visited-suffix marking mechanism is described at a high level but lacks an explicit algorithmic listing or complexity analysis showing how marking is implemented without increasing per-step overhead.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive feedback. We address each major comment below and indicate the revisions we will make.

read point-by-point responses

Referee: [§4 and §5] §4 (Experimental Setup) and §5 (Results): The central claim that the three modifications improve sample efficiency while preserving ASR rests on comparisons performed at unequal evaluation budgets (32K vs. 256K). No component-wise ablations are reported that hold the total evaluation budget fixed at 32K and isolate the contribution of distance-based regularization, temperature sampling, and visited-suffix marking individually on identical harmful behaviors and models. This omission leaves open the possibility that observed ASR differences arise from changes in the effective search distribution rather than genuine efficiency gains.

Authors: We agree that component-wise ablations at a fixed 32K budget would more rigorously isolate the contribution of each modification and rule out distributional effects. Our primary experiments demonstrate the end-to-end sample-efficiency gain of the combined Faster-GCG method relative to standard GCG at its conventional budget. To strengthen the evidence, we will add fixed-budget ablations of each component (regularization, temperature sampling, and visited-suffix marking) on the same behaviors and models in the revised manuscript. revision: yes
Referee: [§3.2] §3.2 (Method): The regularization coefficient and sampling temperature are listed as free parameters. The manuscript does not report sensitivity analysis or default values that would allow reproduction of the 32K-budget results without additional tuning, which directly affects the reproducibility of the claimed 8× efficiency improvement.

Authors: We acknowledge that explicit default values and sensitivity analysis are necessary for reproducibility. In the revision we will report the specific coefficient and temperature values used to obtain the 32K-budget results and include a sensitivity study demonstrating robustness across reasonable ranges of these hyperparameters. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical method validated by experiments

full rationale

The paper proposes three algorithmic modifications to GCG (distance-based regularization, temperature-controlled sampling, visited-suffix marking) and reports empirical results on evaluation count, wall-clock time, and attack success rate. No equations, derivations, or 'predictions' are presented that reduce to inputs by construction. No self-citation load-bearing steps, uniqueness theorems, or ansatzes appear. The central claims rest on experimental outcomes rather than any self-referential chain, consistent with the default non-circular finding for empirical methods papers.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The approach rests on standard assumptions from discrete optimization and introduces a small number of tunable hyperparameters for the new components.

free parameters (2)

regularization coefficient
Weight of the distance-based penalty term, chosen to improve gradient estimates.
sampling temperature
Controls the balance between greedy and exploratory token selection.

axioms (1)

domain assumption Distance regularization improves the accuracy of gradient-based suffix selection in discrete token space.
Invoked to justify the first proposed fix.

pith-pipeline@v0.9.0 · 5819 in / 1126 out tokens · 23884 ms · 2026-05-23T18:59:05.493064+00:00 · methodology

discussion (0)

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

GPO-V: Jailbreak Diffusion Vision Language Model by Global Probability Optimization
cs.CV 2026-05 unverdicted novelty 7.0

GPO-V is a visual jailbreak framework that bypasses safety guardrails in diffusion VLMs by globally manipulating generative probabilities during denoising.
GPO-V: Jailbreak Diffusion Vision Language Model by Global Probability Optimization
cs.CV 2026-05 unverdicted novelty 7.0

GPO-V jailbreaks dVLMs by globally optimizing probabilities in the denoising process to bypass refusal patterns, achieving stealthy and transferable attacks.
CoopGuard: Stateful Cooperative Agents Safeguarding LLMs Against Evolving Multi-Round Attacks
cs.CR 2026-04 unverdicted novelty 6.0

CoopGuard deploys cooperative agents to track conversation history and counter evolving multi-round attacks on LLMs, achieving a 78.9% reduction in attack success rate on a new 5,200-sample benchmark.
Phonetic Perturbations Reveal Tokenizer-Rooted Safety Gaps in LLMs
cs.CL 2025-05 unverdicted novelty 6.0

Phonetic perturbations fragment safety-critical tokens in LLMs, suppressing attribution scores while preserving input understanding and causing safety mechanisms to fail despite good comprehension.

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages · cited by 3 Pith papers · 6 internal anchors

[1]

GPT-4 Technical Report

Gpt-4 technical report. arXiv preprint arXiv:2303.08774. Nicholas Carlini and David A. Wagner

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Jailbreaking Black Box Large Language Models in Twenty Queries

Jailbreaking black box large language models in twenty queries. arXiv preprint arXiv: 2310.08419 . Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E Gonzalez, et al

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Baseline Defenses for Adversarial Attacks Against Aligned Language Models

Baseline defenses for ad- versarial attacks against aligned language models. arXiv preprint arXiv:2309.00614. Xiaojun Jia, Tianyu Pang, Chao Du, Yihao Huang, Jindong Gu, Yang Liu, Xiaochun Cao, and Min Lin

work page internal anchor Pith review Pith/arXiv arXiv
[4]

arXiv preprint arXiv:2405.21018

Improved techniques for optimization- based jailbreaking on large language models. arXiv preprint arXiv:2405.21018. Daniel Kang, Xuechen Li, Ion Stoica, Carlos Guestrin, Matei Zaharia, and Tatsunori Hashimoto

work page arXiv
[5]

Exploiting Programmatic Behavior of LLMs: Dual-Use Through Standard Security Attacks,

Ex- ploiting programmatic behavior of llms: Dual-use through standard security attacks. arXiv preprint arXiv: 2302.05733. Zeyi Liao and Huan Sun

work page arXiv
[6]

arXiv preprint arXiv:2404.07921

Amplegcg: Learning a universal and transferable generative model of adver- sarial suffixes for jailbreaking both open and closed llms. arXiv preprint arXiv:2404.07921. 9 Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu

work page arXiv
[7]

In NeurIPS Competition Track

Tdc 2023 (llm edition): The trojan detection challenge. In NeurIPS Competition Track. Anay Mehrotra, Manolis Zampetakis, Paul Kassianik, Blaine Nelson, Hyrum S. Anderson, Yaron Singer, and Amin Karbasi

work page 2023
[8]

Tree of attacks: Jailbreaking black-box llms automatically

Tree of attacks: Jailbreak- ing black-box llms automatically. arXiv preprint arXiv: 2312.02119. Long Ouyang, Jeffrey Wu, Xu Jiang, et al

work page arXiv
[9]

SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks

Smoothllm: Defending large language models against jailbreaking attacks. arXiv preprint arXiv:2310.03684. Taylor Shin, Yasaman Razeghi, Robert L. Logan IV , Eric Wallace, and Sameer Singh

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Llama 2: Open foundation and fine-tuned chat mod- els. arXiv preprint arXiv:2307.09288. Alexander Wei, Nika Haghtalab, and Jacob Steinhardt. 2023a. Jailbroken: How does LLM safety train- ing fail? In Adv. Neural Inform. Process. Syst. (NeurIPS). Zeming Wei, Yifei Wang, and Yisen Wang. 2023b. Jail- break and guard aligned language models with only few in-c...

work page internal anchor Pith review Pith/arXiv arXiv
[11]

arXiv preprint arXiv:2403.01251

Accelerating greedy coordinate gradient via probe sampling. arXiv preprint arXiv:2403.01251. Chujie Zheng, Fan Yin, Hao Zhou, Fandong Meng, Jie Zhou, Kai-Wei Chang, Minlie Huang, and Nanyun Peng

work page arXiv
[12]

Auto- DAN: Automatic and Interpretable Adversarial Attacks on Large Language Models,

Autodan: Automatic and inter- pretable adversarial attacks on large language models. arXiv preprint arXiv:2310.15140. Terry Yue Zhuo, Yujin Huang, Chunyang Chen, and Zhenchang Xing

work page arXiv
[13]

arXiv preprint arXiv:2301.12867

Red teaming chatgpt via jailbreaking: Bias, robustness, reliability and toxicity. arXiv preprint arXiv:2301.12867. Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J. Zico Kolter, and Matt Fredrikson

work page arXiv
[14]

Universal and Transferable Adversarial Attacks on Aligned Language Models

Univer- sal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043. 10 Table A1: System prompts used for target LLMs. Models System Prompt Llama-2-7B-chat You are a helpful, respectful, and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unet...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[1] [1]

GPT-4 Technical Report

Gpt-4 technical report. arXiv preprint arXiv:2303.08774. Nicholas Carlini and David A. Wagner

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Jailbreaking Black Box Large Language Models in Twenty Queries

Jailbreaking black box large language models in twenty queries. arXiv preprint arXiv: 2310.08419 . Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E Gonzalez, et al

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

Baseline Defenses for Adversarial Attacks Against Aligned Language Models

Baseline defenses for ad- versarial attacks against aligned language models. arXiv preprint arXiv:2309.00614. Xiaojun Jia, Tianyu Pang, Chao Du, Yihao Huang, Jindong Gu, Yang Liu, Xiaochun Cao, and Min Lin

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

arXiv preprint arXiv:2405.21018

Improved techniques for optimization- based jailbreaking on large language models. arXiv preprint arXiv:2405.21018. Daniel Kang, Xuechen Li, Ion Stoica, Carlos Guestrin, Matei Zaharia, and Tatsunori Hashimoto

work page arXiv

[5] [5]

Exploiting Programmatic Behavior of LLMs: Dual-Use Through Standard Security Attacks,

Ex- ploiting programmatic behavior of llms: Dual-use through standard security attacks. arXiv preprint arXiv: 2302.05733. Zeyi Liao and Huan Sun

work page arXiv

[6] [6]

arXiv preprint arXiv:2404.07921

Amplegcg: Learning a universal and transferable generative model of adver- sarial suffixes for jailbreaking both open and closed llms. arXiv preprint arXiv:2404.07921. 9 Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu

work page arXiv

[7] [7]

In NeurIPS Competition Track

Tdc 2023 (llm edition): The trojan detection challenge. In NeurIPS Competition Track. Anay Mehrotra, Manolis Zampetakis, Paul Kassianik, Blaine Nelson, Hyrum S. Anderson, Yaron Singer, and Amin Karbasi

work page 2023

[8] [8]

Tree of attacks: Jailbreaking black-box llms automatically

Tree of attacks: Jailbreak- ing black-box llms automatically. arXiv preprint arXiv: 2312.02119. Long Ouyang, Jeffrey Wu, Xu Jiang, et al

work page arXiv

[9] [9]

SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks

Smoothllm: Defending large language models against jailbreaking attacks. arXiv preprint arXiv:2310.03684. Taylor Shin, Yasaman Razeghi, Robert L. Logan IV , Eric Wallace, and Sameer Singh

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Llama 2: Open foundation and fine-tuned chat mod- els. arXiv preprint arXiv:2307.09288. Alexander Wei, Nika Haghtalab, and Jacob Steinhardt. 2023a. Jailbroken: How does LLM safety train- ing fail? In Adv. Neural Inform. Process. Syst. (NeurIPS). Zeming Wei, Yifei Wang, and Yisen Wang. 2023b. Jail- break and guard aligned language models with only few in-c...

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

arXiv preprint arXiv:2403.01251

Accelerating greedy coordinate gradient via probe sampling. arXiv preprint arXiv:2403.01251. Chujie Zheng, Fan Yin, Hao Zhou, Fandong Meng, Jie Zhou, Kai-Wei Chang, Minlie Huang, and Nanyun Peng

work page arXiv

[12] [12]

Auto- DAN: Automatic and Interpretable Adversarial Attacks on Large Language Models,

Autodan: Automatic and inter- pretable adversarial attacks on large language models. arXiv preprint arXiv:2310.15140. Terry Yue Zhuo, Yujin Huang, Chunyang Chen, and Zhenchang Xing

work page arXiv

[13] [13]

arXiv preprint arXiv:2301.12867

Red teaming chatgpt via jailbreaking: Bias, robustness, reliability and toxicity. arXiv preprint arXiv:2301.12867. Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J. Zico Kolter, and Matt Fredrikson

work page arXiv

[14] [14]

Universal and Transferable Adversarial Attacks on Aligned Language Models

Univer- sal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043. 10 Table A1: System prompts used for target LLMs. Models System Prompt Llama-2-7B-chat You are a helpful, respectful, and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unet...

work page internal anchor Pith review Pith/arXiv arXiv 2023