arxiv: 2310.03684 · v4 · submitted 2023-10-05 · 💻 cs.LG · cs.AI· stat.ML

Recognition: 2 theorem links

· Lean Theorem

SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks

Alexander Robey , Eric Wong , Hamed Hassani , George J. Pappas

Authors on Pith no claims yet

Pith reviewed 2026-05-14 17:05 UTC · model grok-4.3

classification 💻 cs.LG cs.AIstat.ML

keywords jailbreaking attackslarge language modelsadversarial defenseprompt perturbationrobustnessaggregation method

0 comments

The pith

SmoothLLM defends large language models against jailbreaking by perturbing input prompts at the character level and aggregating multiple responses.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces SmoothLLM as a defense against jailbreaking attacks on LLMs like GPT and Llama. It is based on the observation that adversarial prompts break easily when small character changes are made. The method creates several slightly altered versions of the user's prompt, runs the model on each, and combines the outputs to filter out attacks. This approach outperforms prior defenses on common jailbreak methods such as GCG and PAIR, while maintaining compatibility with any LLM and only a small cost to regular performance.

Core claim

SmoothLLM works by randomly perturbing multiple copies of an input prompt through character-level substitutions and then aggregating the LLM's predictions across those copies to detect and block adversarial inputs designed to produce objectionable content.

What carries the argument

Random character-level perturbations of the prompt combined with aggregation of model outputs across perturbed copies.

If this is right

Across popular LLMs, SmoothLLM achieves state-of-the-art robustness to GCG, PAIR, RandomSearch, and AmpleGCG jailbreaks.
SmoothLLM remains effective even against adaptive versions of the GCG attack.
There is only a small trade-off between the defense's robustness and the model's performance on non-adversarial inputs.
The method works with any large language model without requiring changes to the model itself.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar perturbation and aggregation strategies might apply to other types of adversarial attacks on language models.
The brittleness of adversarial prompts suggests that defenses could be designed around input randomization rather than model retraining.
Testing on a wider range of attack types could reveal the limits of this approach.

Load-bearing premise

Adversarially generated prompts lose their effectiveness when subjected to random character-level modifications.

What would settle it

Finding or constructing a jailbreaking prompt that continues to elicit the target objectionable output even after multiple random character perturbations would falsify the core defense mechanism.

read the original abstract

Despite efforts to align large language models (LLMs) with human intentions, widely-used LLMs such as GPT, Llama, and Claude are susceptible to jailbreaking attacks, wherein an adversary fools a targeted LLM into generating objectionable content. To address this vulnerability, we propose SmoothLLM, the first algorithm designed to mitigate jailbreaking attacks. Based on our finding that adversarially-generated prompts are brittle to character-level changes, our defense randomly perturbs multiple copies of a given input prompt, and then aggregates the corresponding predictions to detect adversarial inputs. Across a range of popular LLMs, SmoothLLM sets the state-of-the-art for robustness against the GCG, PAIR, RandomSearch, and AmpleGCG jailbreaks. SmoothLLM is also resistant against adaptive GCG attacks, exhibits a small, though non-negligible trade-off between robustness and nominal performance, and is compatible with any LLM. Our code is publicly available at \url{https://github.com/arobey1/smooth-llm}.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SmoothLLM turns the observed brittleness of jailbreak prompts under character edits into a simple perturbation-and-aggregation defense that shows solid empirical gains across tested attacks and models.

read the letter

SmoothLLM is a defense that works by creating several slightly altered versions of a prompt through random character changes, querying the model on each, and then combining the results to spot or block jailbreaks. The experiments suggest this holds up against the attacks they tested. The new part is applying character-level perturbation and output aggregation specifically to this problem. Prior work on jailbreaks focused more on generating the attacks or aligning the models, but this exploits the brittleness of those prompts in a straightforward way. They ground it in an observation that adversarial prompts break under small edits, which they check across GCG, PAIR, RandomSearch, and AmpleGCG on models like GPT, Llama, and Claude. What the paper does well is the empirical coverage. It reports gains in robustness, resistance to adaptive GCG, and only a modest hit to normal performance. Releasing the code is helpful for anyone wanting to reproduce or extend it. The method being model-agnostic means it can sit on top of existing systems without retraining. The soft spots are in the details of implementation. The exact way they aggregate the outputs and select the perturbation rate could use more explicit description and sensitivity checks. It would be good to see how it performs on prompts that are already borderline or on attacks designed to be robust to perturbations. These are not deal-breakers, but they would strengthen the case. This work is for researchers and engineers dealing with LLM deployment and safety. A reader looking for practical, plug-and-play defenses will find the results useful. The thinking is straightforward and the evidence is experimental rather than theoretical, but it engages honestly with the literature on attacks. I would recommend sending this to peer review. The results are concrete enough to warrant referee feedback, particularly on the aggregation mechanics and potential limitations against future attacks.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces SmoothLLM, a defense against jailbreaking attacks on LLMs. It is based on the empirical observation that adversarially generated prompts are brittle to character-level perturbations. The algorithm creates multiple randomly perturbed copies of an input prompt, queries the target LLM on each copy, and aggregates the outputs to detect and mitigate adversarial inputs. The paper reports that SmoothLLM achieves state-of-the-art robustness against GCG, PAIR, RandomSearch, and AmpleGCG attacks across multiple popular LLMs, remains effective against adaptive GCG attacks, incurs only a modest trade-off with nominal performance, and is compatible with any LLM. Public code is released.

Significance. If the reported empirical results hold under the stated conditions, SmoothLLM offers a practical, model-agnostic defense that substantially improves robustness to several prominent jailbreaking methods without requiring model retraining or internal access. The broad evaluation across LLMs and attack types, combined with public code and resistance to adaptive attacks, constitutes a concrete contribution to LLM safety research.

major comments (2)

[§3.2] §3.2 (Method): The precise aggregation rule used to combine predictions across the perturbed copies is not fully specified, including how conflicting outputs or ties are resolved; this procedure is load-bearing for the claimed robustness gains and must be stated explicitly for reproducibility.
[§4.1–4.2] §4.1–4.2 (Experiments): The selection process for the free parameters (perturbation rate and number of perturbed copies) is not detailed, including whether values were tuned on held-out data or attack-specific performance; this affects the strength of the cross-attack robustness claims.

minor comments (2)

[Figure 2 and Table 1] Figure 2 and Table 1: Axis labels and legend entries could be enlarged or clarified to improve readability of the robustness metrics across attack methods.
[Abstract] Abstract: The phrase 'small, though non-negligible trade-off' should be accompanied by a quantitative bound (e.g., average drop in benign accuracy) to give readers an immediate sense of the cost.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their positive assessment and recommendation for minor revision. We appreciate the constructive feedback on improving the clarity of the method and experimental details, which we will address to strengthen reproducibility.

read point-by-point responses

Referee: [§3.2] §3.2 (Method): The precise aggregation rule used to combine predictions across the perturbed copies is not fully specified, including how conflicting outputs or ties are resolved; this procedure is load-bearing for the claimed robustness gains and must be stated explicitly for reproducibility.

Authors: We thank the referee for highlighting this point. In SmoothLLM, the aggregation rule is majority voting across the LLM outputs on the perturbed prompts: an input is classified as adversarial if a strict majority of the perturbed copies produce a refusal (safe) response. In the event of a tie, the input is classified as non-adversarial to avoid over-flagging clean prompts. We will revise §3.2 to state this rule explicitly, include pseudocode for the full procedure, and clarify the tie-breaking logic. revision: yes
Referee: [§4.1–4.2] §4.1–4.2 (Experiments): The selection process for the free parameters (perturbation rate and number of perturbed copies) is not detailed, including whether values were tuned on held-out data or attack-specific performance; this affects the strength of the cross-attack robustness claims.

Authors: We agree that more detail is warranted. The perturbation rate (10%) and number of copies (10) were selected via a small grid search performed on a held-out set of 50 prompts drawn from AdvBench, using only the base Llama-2-7B model and without reference to any particular attack. The search prioritized a balance between robustness and nominal performance. We will expand the description in §4.1 to document this process and add a parameter-sensitivity ablation to the appendix. revision: yes

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper presents SmoothLLM as an empirical algorithm motivated by the external observation that adversarially generated prompts are brittle to character-level perturbations. This brittleness is demonstrated through direct experiments on GCG, PAIR, RandomSearch, and AmpleGCG attacks across multiple LLMs, with no derivation chain, fitted parameters renamed as predictions, or self-citation load-bearing steps that reduce the central claim to its own inputs by construction. The defense (random perturbation + aggregation) follows straightforwardly from the empirical finding without circular redefinition or uniqueness theorems imported from prior self-work.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The method depends on one key domain assumption about adversarial prompt brittleness and a small number of implementation choices for perturbation rate and copy count that are not derived from first principles.

free parameters (2)

perturbation rate
Probability or number of characters altered per copy, chosen to balance detection and nominal performance
number of perturbed copies
Count of randomized prompt variants fed to the model, selected for robustness versus compute cost

axioms (1)

domain assumption Adversarially-generated prompts are brittle to small character-level perturbations while benign prompts remain stable
This empirical observation is invoked as the foundation for the random perturbation defense

pith-pipeline@v0.9.0 · 5488 in / 1416 out tokens · 70401 ms · 2026-05-14T17:05:31.423908+00:00 · methodology

discussion (0)

Forward citations

Cited by 24 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

OrchJail: Jailbreaking Tool-Calling Text-to-Image Agents by Orchestration-Guided Fuzzing
cs.MA 2026-05 unverdicted novelty 7.0

OrchJail uses orchestration-guided fuzzing to jailbreak tool-calling T2I agents by targeting high-risk tool patterns, yielding higher attack success rates, better image quality, and lower costs than prior prompt-only methods.
How Many Iterations to Jailbreak? Dynamic Budget Allocation for Multi-Turn LLM Evaluation
cs.LG 2026-05 unverdicted novelty 7.0

DAPRO provides the first dynamic, theoretically guaranteed way to allocate interaction budgets across test cases for bounding time-to-event in multi-turn LLM evaluations, achieving tighter coverage than static conform...
Catching the Infection Before It Spreads: Foresight-Guided Defense in Multi-Agent Systems
cs.AI 2026-05 unverdicted novelty 7.0

A foresight-based local purification method using multi-persona simulations and recursive diagnosis reduces infectious jailbreak spread in multi-agent systems from over 95% to below 5.47% while matching benign perform...
Jailbroken Frontier Models Retain Their Capabilities
cs.LG 2026-04 unverdicted novelty 7.0

Jailbreak-induced performance loss shrinks as model capability grows, with the strongest models showing almost no degradation on benchmarks.
Attention Is Where You Attack
cs.CR 2026-04 unverdicted novelty 7.0

ARA jailbreaks safety-aligned LLMs like LLaMA-3 and Mistral by redirecting attention in safety-heavy heads with as few as 5 tokens, achieving 30-36% attack success while ablating the same heads barely affects refusals.
Adaptive Prompt Embedding Optimization for LLM Jailbreaking
cs.AI 2026-04 unverdicted novelty 7.0

PEO optimizes original prompt embeddings continuously over adaptive rounds to jailbreak aligned LLMs, preserving the exact visible prompt text and outperforming discrete suffix, appended embedding, and search-based wh...
Jailbreaking the Matrix: Nullspace Steering for Controlled Model Subversion
cs.CR 2026-04 unverdicted novelty 7.0

HMNS is a new jailbreak method that uses causal head identification and nullspace-constrained injection to achieve higher attack success rates than prior techniques on aligned language models.
Refusal in Language Models Is Mediated by a Single Direction
cs.LG 2024-06 accept novelty 7.0

Refusal in language models is mediated by a single direction in residual stream activations that can be erased to disable safety or added to elicit refusal.
Behavioral Integrity Verification for AI Agent Skills
cs.CR 2026-05 unverdicted novelty 6.0

BIV audits AI agent skills at scale, finding 80% deviate from declared behavior on 49,943 skills and achieving 0.946 F1 for malicious skill detection.
Revisiting JBShield: Breaking and Rebuilding Representation-Level Jailbreak Defenses
cs.CR 2026-05 accept novelty 6.0

JBShield is vulnerable to adaptive JB-GCG attacks (up to 53% ASR) because jailbreak representations occupy a distinct region in refusal-direction space; the new RTV defense using Mahalanobis detection on multi-layer f...
Disentangling Intent from Role: Adversarial Self-Play for Persona-Invariant Safety Alignment
cs.AI 2026-05 unverdicted novelty 6.0

PIA achieves lower attack success rates on persona-based jailbreaks via self-play co-evolution of attacks (PLE) and defenses (PICL) that structurally decouples safety from persona context using unilateral KL-divergence.
Catching the Infection Before It Spreads: Foresight-Guided Defense in Multi-Agent Systems
cs.AI 2026-05 unverdicted novelty 6.0

A foresight-based local purification method simulates future agent interactions, detects infections via response diversity across personas, and applies targeted rollback or recursive diagnosis to cut maximum infection...
Catching the Infection Before It Spreads: Foresight-Guided Defense in Multi-Agent Systems
cs.AI 2026-05 unverdicted novelty 6.0

FLP uses multi-persona foresight simulation to detect infections via response diversity and applies local purification to reduce maximum cumulative infection rates in multi-agent systems from over 95% to below 5.47%.
SafeRedirect: Defeating Internal Safety Collapse via Task-Completion Redirection in Frontier LLMs
cs.CR 2026-04 unverdicted novelty 6.0

SafeRedirect reduces average unsafe generation rates in frontier LLMs from 71.2% to 8.0% on Internal Safety Collapse tasks by redirecting task completion with failure permission and deterministic hard stops.
Towards Understanding the Robustness of Sparse Autoencoders
cs.LG 2026-04 unverdicted novelty 6.0

Integrating pretrained sparse autoencoders into LLM residual streams reduces jailbreak success rates by up to 5x across multiple models and attacks.
The Salami Slicing Threat: Exploiting Cumulative Risks in LLM Systems
cs.CR 2026-04 unverdicted novelty 6.0

Salami Attack chains low-risk inputs to cumulatively trigger high-risk LLM behaviors, achieving over 90% success on GPT-4o and Gemini while resisting some defenses.
JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models
cs.CR 2024-03 accept novelty 6.0

JailbreakBench supplies an evolving set of jailbreak prompts, a 100-behavior dataset aligned with usage policies, a standardized evaluation framework, and a leaderboard to enable comparable assessments of attacks and ...
InjecAgent: Benchmarking Indirect Prompt Injections in Tool-Integrated Large Language Model Agents
cs.CL 2024-03 conditional novelty 6.0

InjecAgent benchmark demonstrates that tool-integrated LLM agents are vulnerable to indirect prompt injection attacks, with ReAct-prompted GPT-4 succeeding on 24% of attacks and nearly twice that rate when attacker in...
Jailbreaking Black Box Large Language Models in Twenty Queries
cs.LG 2023-10 conditional novelty 6.0

PAIR uses an attacker LLM to iteratively craft effective jailbreak prompts for black-box target LLMs in fewer than 20 queries.
Re-Triggering Safeguards within LLMs for Jailbreak Detection
cs.CR 2026-05 unverdicted novelty 5.0

Embedding disruption re-triggers LLM internal safeguards to detect jailbreak prompts more effectively than standalone defenses.
SoK: Robustness in Large Language Models against Jailbreak Attacks
cs.CR 2026-05 accept novelty 5.0

The paper taxonomizes jailbreak attacks and defenses for LLMs, introduces the Security Cube multi-dimensional evaluation framework, benchmarks 13 attacks and 5 defenses, and identifies open challenges in LLM robustness.
SALLIE: Safeguarding Against Latent Language & Image Exploits
cs.CR 2026-04 unverdicted novelty 5.0

SALLIE detects jailbreaks in text and vision-language models by extracting residual stream activations, scoring maliciousness per layer with k-NN, and ensembling predictions, outperforming baselines on multiple datasets.
MIPIAD: Multilingual Indirect Prompt Injection Attack Defense with Qwen -- TF-IDF Hybrid and Meta-Ensemble Learning
cs.CL 2026-05 unverdicted novelty 4.0

MIPIAD reports a hybrid Qwen-TF-IDF ensemble defense that reaches F1 0.9205 and reduces the English-Bangla performance gap on a 1.43-million-sample synthetic benchmark derived from BIPIA templates.
Jailbreak Attacks and Defenses Against Large Language Models: A Survey
cs.CR 2024-07 accept novelty 4.0

A survey that creates taxonomies for jailbreak attacks and defenses on LLMs, subdivides them into sub-classes, and compares evaluation approaches.

Reference graph

Works this paper leans on

91 extracted references · 91 canonical work pages · cited by 22 Pith papers · 19 internal anchors

[1]

Realtoxicityprompts: Evaluating neural toxic degenera- tion in language models

Samuel Gehman, Suchin Gururangan, Maarten Sap, Yejin Choi, and Noah A Smith. Realtoxicityprompts: Evaluating neural toxic degeneration in language models. arXiv preprint arXiv:2009.11462, 2020. 1

work page arXiv 2009
[2]

The ai alignment problem: why it is hard, and where to start

Eliezer Yudkowsky. The ai alignment problem: why it is hard, and where to start. Symbolic Systems Distinguished Speaker, 4, 2016. 1

work page 2016
[3]

Artificial intelligence, values, and alignment

Iason Gabriel. Artificial intelligence, values, and alignment. Minds and machines, 30(3):411–437, 2020

work page 2020
[4]

The alignment problem: Machine learning and human values

Brian Christian. The alignment problem: Machine learning and human values. WW Norton & Company,

work page
[5]

Regulating chatgpt and other large generative ai models

Philipp Hacker, Andreas Engel, and Marco Mauer. Regulating chatgpt and other large generative ai models. In Proceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency, pages 1112–1123, 2023. 1

work page 2023
[6]

Training language models to follow instructions with human feedback

Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback, 2022. URL https://arxiv. org/abs/2203.02155, 13, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[7]

Improving alignment of dialogue agents via targeted human judgements

Amelia Glaese, Nat McAleese, Maja Tr˛ ebacz, John Aslanides, Vlad Firoiu, Timo Ewalds, Maribeth Rauh, Laura Weidinger, Martin Chadwick, Phoebe Thacker, et al. Improving alignment of dialogue agents via targeted human judgements. arXiv preprint arXiv:2209.14375, 2022. 1

work page internal anchor Pith review arXiv 2022
[8]

Toxicity in chatgpt: Analyzing persona-assigned language models

Ameet Deshpande, Vishvak Murahari, Tanmay Rajpurohit, Ashwin Kalyan, and Karthik Narasimhan. Toxicity in chatgpt: Analyzing persona-assigned language models. arXiv preprint arXiv:2304.05335, 2023. 1

work page arXiv 2023
[9]

Jailbroken: How Does LLM Safety Training Fail?

Alexander Wei, Nika Haghtalab, and Jacob Steinhardt. Jailbroken: How does llm safety training fail? arXiv preprint arXiv:2307.02483, 2023. 1, 3

work page internal anchor Pith review arXiv 2023
[10]

Are aligned neural networks adversarially aligned? arXiv preprint arXiv:2306.15447, 2023

Nicholas Carlini, Milad Nasr, Christopher A Choquette-Choo, Matthew Jagielski, Irena Gao, Anas Awadalla, Pang Wei Koh, Daphne Ippolito, Katherine Lee, Florian Tramer, et al. Are aligned neural networks adversarially aligned? arXiv preprint arXiv:2306.15447, 2023

work page arXiv 2023
[11]

A safe harbor for ai evaluation and red teaming

Shayne Longpre, Sayash Kapoor, Kevin Klyman, Ashwin Ramaswami, Rishi Bommasani, Borhane Blili-Hamelin, Yangsibo Huang, Aviya Skowron, Zheng-Xin Yong, Suhas Kotha, et al. A safe harbor for ai evaluation and red teaming. arXiv preprint arXiv:2403.04893, 2024. 1

work page arXiv 2024
[12]

Adversarial demon- stration attacks on large language models

Jiongxiao Wang, Zichen Liu, Keun Hee Park, Muhao Chen, and Chaowei Xiao. Adversarial demonstra- tion attacks on large language models. arXiv preprint arXiv:2305.14950, 2023. 1

work page arXiv 2023
[13]

Risks of ai foundation models in education

Su Lin Blodgett and Michael Madaio. Risks of ai foundation models in education. arXiv preprint arXiv:2110.10024, 2021. 1

work page arXiv 2021
[14]

Chatgpt utility in healthcare education, research, and practice: systematic review on the promising perspectives and valid concerns

Malik Sallam. Chatgpt utility in healthcare education, research, and practice: systematic review on the promising perspectives and valid concerns. In Healthcare, volume 11, page 887. MDPI, 2023. 1

work page 2023
[15]

BloombergGPT: A Large Language Model for Finance

Shijie Wu, Ozan Irsoy, Steven Lu, Vadim Dabravolski, Mark Dredze, Sebastian Gehrmann, Prabhanjan Kambadur, David Rosenberg, and Gideon Mann. Bloomberggpt: A large language model for finance. arXiv preprint arXiv:2303.17564, 2023. 1

work page internal anchor Pith review Pith/arXiv arXiv 2023
[16]

Adversarial prompting for black box foundation models

Natalie Maus, Patrick Chao, Eric Wong, and Jacob Gardner. Adversarial prompting for black box foundation models. arXiv preprint arXiv:2302.04237, 2023. 1 14

work page arXiv 2023
[17]

Autoprompt: Eliciting knowledge from language models wit h automatically generated prompts,

Taylor Shin, Yasaman Razeghi, Robert L Logan IV , Eric Wallace, and Sameer Singh. Autoprompt: Eliciting knowledge from language models with automatically generated prompts. arXiv preprint arXiv:2010.15980, 2020

work page arXiv 2010
[18]

Jailbreaking Black Box Large Language Models in Twenty Queries

Patrick Chao, Alexander Robey, Edgar Dobriban, Hamed Hassani, George J Pappas, and Eric Wong. Jailbreaking black box large language models in twenty queries. arXiv preprint arXiv:2310.08419, 2023. 2, 3, 9, 12, 26

work page internal anchor Pith review Pith/arXiv arXiv 2023
[19]

AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models

Xiaogeng Liu, Nan Xu, Muhao Chen, and Chaowei Xiao. Autodan: Generating stealthy jailbreak prompts on aligned large language models. arXiv preprint arXiv:2310.04451, 2023. 1

work page internal anchor Pith review Pith/arXiv arXiv 2023
[20]

Universal and Transferable Adversarial Attacks on Aligned Language Models

Andy Zou, Zifan Wang, J Zico Kolter, and Matt Fredrikson. Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043, 2023. 1, 2, 3, 5, 6, 9, 10, 12, 25, 27, 28, 29, 30, 31, 33, 34, 37, 40

work page internal anchor Pith review Pith/arXiv arXiv 2023
[21]

Jailbreaking leading safety-aligned llms with simple adaptive attacks

Maksym Andriushchenko, Francesco Croce, and Nicolas Flammarion. Jailbreaking leading safety- aligned llms with simple adaptive attacks. arXiv preprint arXiv:2404.02151, 2024. 2, 3, 9

work page arXiv 2024
[22]

Amplegcg: Learning a universal and transferable generative model of adversarial suffixes for jailbreaking both open and closed llms,

Zeyi Liao and Huan Sun. Amplegcg: Learning a universal and transferable generative model of adversarial suffixes for jailbreaking both open and closed llms. arXiv preprint arXiv:2404.07921, 2024. 2, 3, 9

work page arXiv 2024
[23]

a ger, MHI Abdalla, Johannes Gasteiger, and Stephan G \

Simon Geisler, Tom Wollschläger, MHI Abdalla, Johannes Gasteiger, and Stephan Günnemann. Attack- ing large language models with projected gradient descent. arXiv preprint arXiv:2402.09154, 2024. 1, 3

work page arXiv 2024
[24]

Certified robustness to adversarial examples with differential privacy

Mathias Lecuyer, Vaggelis Atlidakis, Roxana Geambasu, Daniel Hsu, and Suman Jana. Certified robustness to adversarial examples with differential privacy. In 2019 IEEE symposium on security and privacy (SP), pages 656–672. IEEE, 2019. 2, 5, 38

work page 2019
[25]

Certified adversarial robustness via randomized smoothing

Jeremy Cohen, Elan Rosenfeld, and Zico Kolter. Certified adversarial robustness via randomized smoothing. In international conference on machine learning, pages 1310–1320. PMLR, 2019. 11, 38, 39

work page 2019
[26]

Provably robust deep learning via adversarially trained smoothed classifiers

Hadi Salman, Jerry Li, Ilya Razenshteyn, Pengchuan Zhang, Huan Zhang, Sebastien Bubeck, and Greg Yang. Provably robust deep learning via adversarially trained smoothed classifiers. Advances in Neural Information Processing Systems, 32, 2019. 2, 5, 11, 38

work page 2019
[27]

JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models

Patrick Chao, Edoardo Debenedetti, Alexander Robey, Maksym Andriushchenko, Francesco Croce, Vikash Sehwag, Edgar Dobriban, Nicolas Flammarion, George J Pappas, Florian Tramer, et al. Jail- breakbench: An open robustness benchmark for jailbreaking large language models. arXiv preprint arXiv:2404.01318, 2024. 3, 32

work page internal anchor Pith review arXiv 2024
[28]

Low-resource languages jailbreak gpt-4

Zheng-Xin Yong, Cristina Menghini, and Stephen H Bach. Low-resource languages jailbreak gpt-4. arXiv preprint arXiv:2310.02446, 2023. 3

work page arXiv 2023
[29]

Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations

Hakan Inan, Kartikeya Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer, Yuning Mao, Michael Tontchev, Qing Hu, Brian Fuller, Davide Testuggine, et al. Llama guard: Llm-based input-output safeguard for human-ai conversations. arXiv preprint arXiv:2312.06674, 2023. 3, 26

work page internal anchor Pith review Pith/arXiv arXiv 2023
[30]

Catastrophic jailbreak of open-source llms via exploiting generation

Yangsibo Huang, Samyak Gupta, Mengzhou Xia, Kai Li, and Danqi Chen. Catastrophic jailbreak of open-source llms via exploiting generation. arXiv preprint arXiv:2310.06987, 2023. 3

work page arXiv 2023
[31]

A survey of adversarial defenses and robustness in nlp

Shreya Goyal, Sumanth Doddapaneni, Mitesh M Khapra, and Balaraman Ravindran. A survey of adversarial defenses and robustness in nlp. ACM Computing Surveys, 55(14s):1–39, 2023. 4

work page 2023
[32]

Adversarial training for large neural language models

Xiaodong Liu, Hao Cheng, Pengcheng He, Weizhu Chen, Yu Wang, Hoifung Poon, and Jianfeng Gao. Adversarial training for large neural language models. arXiv preprint arXiv:2004.08994, 2020. 4 15

work page arXiv 2004
[33]

Dai, and Ian Goodfellow

Takeru Miyato, Andrew M Dai, and Ian Goodfellow. Adversarial training methods for semi-supervised text classification. arXiv preprint arXiv:1605.07725, 2016. 4

work page arXiv 2016
[34]

TextBugger: Generating Adversarial Text Against Real-world Applications

Jinfeng Li, Shouling Ji, Tianyu Du, Bo Li, and Ting Wang. Textbugger: Generating adversarial text against real-world applications. arXiv preprint arXiv:1812.05271, 2018. 4, 39

work page internal anchor Pith review Pith/arXiv arXiv 2018
[35]

Baseline Defenses for Adversarial Attacks Against Aligned Language Models

Neel Jain, Avi Schwarzschild, Yuxin Wen, Gowthami Somepalli, John Kirchenbauer, Ping-yeh Chiang, Micah Goldblum, Aniruddha Saha, Jonas Geiping, and Tom Goldstein. Baseline defenses for adversarial attacks against aligned language models. arXiv preprint arXiv:2309.00614, 2023. 4, 10, 32

work page internal anchor Pith review Pith/arXiv arXiv 2023
[36]

Certifying llm safety against adversarial prompting

Aounon Kumar, Chirag Agarwal, Suraj Srinivas, Soheil Feizi, and Hima Lakkaraju. Certifying llm safety against adversarial prompting. arXiv preprint arXiv:2309.02705, 2023. 4

work page arXiv 2023
[37]

On adaptive attacks to adversarial example defenses

Florian Tramer, Nicholas Carlini, Wieland Brendel, and Aleksander Madry. On adaptive attacks to adversarial example defenses. Advances in neural information processing systems, 33:1633–1645, 2020. 9

work page 2020
[38]

Detecting language model attacks with perplexity

Gabriel Alon and Michael Kamfonas. Detecting language model attacks with perplexity. arXiv preprint arXiv:2308.14132, 2023. 10, 32

work page arXiv 2023
[39]

Instruction-Following Evaluation for Large Language Models

Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and Le Hou. Instruction-following evaluation for large language models. arXiv preprint arXiv:2311.07911,

work page internal anchor Pith review Pith/arXiv arXiv
[40]

Piqa: Reasoning about physical common- sense in natural language

Yonatan Bisk, Rowan Zellers, Jianfeng Gao, Yejin Choi, et al. Piqa: Reasoning about physical common- sense in natural language. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 7432–7439, 2020. 11, 29

work page 2020
[41]

Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering

Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering. arXiv preprint arXiv:1809.02789, 2018. 11, 29

work page internal anchor Pith review Pith/arXiv arXiv 2018
[42]

arXiv preprint arXiv:2203.09509 , year=

Thomas Hartvigsen, Saadia Gabriel, Hamid Palangi, Maarten Sap, Dipankar Ray, and Ece Kamar. Toxigen: A large-scale machine-generated dataset for adversarial and implicit hate speech detection. arXiv preprint arXiv:2203.09509, 2022. 11, 29

work page arXiv 2022
[43]

Robustbench: a standardized adversarial robustness benchmark

Francesco Croce, Maksym Andriushchenko, Vikash Sehwag, Edoardo Debenedetti, Nicolas Flammarion, Mung Chiang, Prateek Mittal, and Matthias Hein. Robustbench: a standardized adversarial robustness benchmark. arXiv preprint arXiv:2010.09670, 2020. 11

work page arXiv 2010
[44]

Robustness and accuracy tradeoffs for recommender systems under attack

Carlos E Seminario and David C Wilson. Robustness and accuracy tradeoffs for recommender systems under attack. In Twenty-Fifth International FLAIRS Conference, 2012. 11

work page 2012
[45]

Fast is better than free: Revisiting adversarial training

Eric Wong, Leslie Rice, and J Zico Kolter. Fast is better than free: Revisiting adversarial training. arXiv preprint arXiv:2001.03994, 2020. 11

work page arXiv 2001
[46]

Query complexity of adversarial attacks

Grzegorz Gluch and Rüdiger Urbanke. Query complexity of adversarial attacks. In International Conference on Machine Learning, pages 3723–3733. PMLR, 2021

work page 2021
[47]

Adversarial training for free! Advances in Neural Information Processing Systems, 32, 2019

Ali Shafahi, Mahyar Najibi, Mohammad Amin Ghiasi, Zheng Xu, John Dickerson, Christoph Studer, Larry S Davis, Gavin Taylor, and Tom Goldstein. Adversarial training for free! Advances in Neural Information Processing Systems, 32, 2019. 11

work page 2019
[48]

Towards Deep Learning Models Resistant to Adversarial Attacks

Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. To- wards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083, 2017. 12, 38 16

work page internal anchor Pith review Pith/arXiv arXiv 2017
[49]

Generalizing to unseen domains via adversarial data augmentation

Riccardo Volpi, Hongseok Namkoong, Ozan Sener, John C Duchi, Vittorio Murino, and Silvio Savarese. Generalizing to unseen domains via adversarial data augmentation. Advances in neural information processing systems, 31, 2018. 12

work page 2018
[50]

Denoised smoothing: A provable defense for pretrained classifiers

Hadi Salman, Mingjie Sun, Greg Yang, Ashish Kapoor, and J Zico Kolter. Denoised smoothing: A provable defense for pretrained classifiers. Advances in Neural Information Processing Systems, 33:21945– 21957, 2020. 13

work page 2020
[51]

(certified!!) adversarial robustness for free! arXiv preprint arXiv:2206.10550, 2022

Nicholas Carlini, Florian Tramer, Krishnamurthy Dj Dvijotham, Leslie Rice, Mingjie Sun, and J Zico Kolter. (certified!!) adversarial robustness for free! arXiv preprint arXiv:2206.10550, 2022. 13

work page arXiv 2022
[52]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023. 25

work page internal anchor Pith review Pith/arXiv arXiv 2023
[53]

Gonzalez, Ion Stoica, and Eric P

Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P . Xing. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023. 25

work page 2023
[54]

Robustness may be at odds with accuracy

Dimitris Tsipras, Shibani Santurkar, Logan Engstrom, Alexander Turner, and Aleksander Madry. Robustness may be at odds with accuracy. arXiv preprint arXiv:1805.12152, 2018. 29, 40

work page arXiv 2018
[55]

Provable tradeoffs in adversari- ally robust classification

Edgar Dobriban, Hamed Hassani, David Hong, and Alexander Robey. Provable tradeoffs in adversari- ally robust classification. IEEE Transactions on Information Theory, 2023. 40

work page 2023
[56]

Precise tradeoffs in adversarial training for linear regression

Adel Javanmard, Mahdi Soltanolkotabi, and Hamed Hassani. Precise tradeoffs in adversarial training for linear regression. In Conference on Learning Theory, pages 2034–2078. PMLR, 2020. 29

work page 2034
[57]

Perceptual

Cassidy Laidlaw, Sahil Singla, and Soheil Feizi. Perceptual adversarial robustness: Defense against unseen threat models. arXiv preprint arXiv:2006.12655, 2020. 38

work page arXiv 2006
[58]

Model-based robust deep learning: Generaliz- ing to natural, out-of-distribution data

Alexander Robey, Hamed Hassani, and George J Pappas. Model-based robust deep learning: Generaliz- ing to natural, out-of-distribution data. arXiv preprint arXiv:2005.10247, 2020

work page arXiv 2005
[59]

Learning perturbation sets for robust machine learning

Eric Wong and J Zico Kolter. Learning perturbation sets for robust machine learning. arXiv preprint arXiv:2007.08450, 2020. 38

work page arXiv 2007
[60]

Breeds: Benchmarks for subpopulation shift

Shibani Santurkar, Dimitris Tsipras, and Aleksander Madry. Breeds: Benchmarks for subpopulation shift. arXiv preprint arXiv:2008.04859, 2020. 38

work page arXiv 2008
[61]

Wilds: A benchmark of in-the-wild distribution shifts

Pang Wei Koh, Shiori Sagawa, Henrik Marklund, Sang Michael Xie, Marvin Zhang, Akshay Balsubra- mani, Weihua Hu, Michihiro Yasunaga, Richard Lanas Phillips, Irena Gao, et al. Wilds: A benchmark of in-the-wild distribution shifts. In International Conference on Machine Learning, pages 5637–5664. PMLR,

work page
[62]

Invariant Risk Minimization

Martin Arjovsky, Léon Bottou, Ishaan Gulrajani, and David Lopez-Paz. Invariant risk minimization. arXiv preprint arXiv:1907.02893, 2019. 38

work page internal anchor Pith review Pith/arXiv arXiv 1907
[63]

Probable domain generalization via quantile risk minimization

Cian Eastwood, Alexander Robey, Shashank Singh, Julius Von Kügelgen, Hamed Hassani, George J Pappas, and Bernhard Schölkopf. Probable domain generalization via quantile risk minimization. Advances in Neural Information Processing Systems, 35:17340–17358, 2022

work page 2022
[64]

Model-based domain generalization.Advances in Neural Information Processing Systems, 34:20210–20229, 2021

Alexander Robey, George J Pappas, and Hamed Hassani. Model-based domain generalization.Advances in Neural Information Processing Systems, 34:20210–20229, 2021. 38 17

work page 2021
[65]

Evasion attacks against machine learning at test time

Battista Biggio, Igino Corona, Davide Maiorca, Blaine Nelson, Nedim Šrndi´ c, Pavel Laskov, Giorgio Giacinto, and Fabio Roli. Evasion attacks against machine learning at test time. In Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2013, Prague, Czech Republic, September 23-27, 2013, Proceedings, Part III 13, pages 38...

work page 2013
[66]

Intriguing properties of neural networks

Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199, 2013. 38

work page internal anchor Pith review Pith/arXiv arXiv 2013
[67]

Explaining and Harnessing Adversarial Examples

Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572, 2014. 38

work page internal anchor Pith review Pith/arXiv arXiv 2014
[68]

Theo- retically principled trade-off between robustness and accuracy

Hongyang Zhang, Yaodong Yu, Jiantao Jiao, Eric Xing, Laurent El Ghaoui, and Michael Jordan. Theo- retically principled trade-off between robustness and accuracy. In International conference on machine learning, pages 7472–7482. PMLR, 2019. 38

work page 2019
[69]

Randomized smoothing of all shapes and sizes

Greg Yang, Tony Duan, J Edward Hu, Hadi Salman, Ilya Razenshteyn, and Jerry Li. Randomized smoothing of all shapes and sizes. In International Conference on Machine Learning, pages 10693–10705. PMLR, 2020. 38, 39

work page 2020
[70]

(de) randomized smoothing for certifiable defense against patch attacks

Alexander Levine and Soheil Feizi. (de) randomized smoothing for certifiable defense against patch attacks. Advances in Neural Information Processing Systems, 33:6465–6475, 2020. 38, 39

work page 2020
[71]

Certified defences against adversarial patch attacks on semantic segmentation

Maksym Yatsura, Kaspar Sakmann, N Grace Hua, Matthias Hein, and Jan Hendrik Metzen. Certified defences against adversarial patch attacks on semantic segmentation. arXiv preprint arXiv:2209.05980, 2022

work page arXiv 2022
[72]

Stability guarantees for feature attributions with multiplicative smoothing

Anton Xue, Rajeev Alur, and Eric Wong. Stability guarantees for feature attributions with multiplicative smoothing. arXiv preprint arXiv:2307.05902, 2023. 38, 39

work page arXiv 2023
[73]

ℓ1 adversarial robustness certificates: a randomized smoothing approach

Jiaye Teng, Guang-He Lee, and Yang Yuan. ℓ1 adversarial robustness certificates: a randomized smoothing approach. 2019. 39

work page 2019
[74]

Certified defense to image transformations via randomized smoothing

Marc Fischer, Maximilian Baader, and Martin Vechev. Certified defense to image transformations via randomized smoothing. Advances in Neural information processing systems, 33:8404–8417, 2020. 39

work page 2020
[75]

Certified robustness to label- flipping attacks via randomized smoothing

Elan Rosenfeld, Ezra Winston, Pradeep Ravikumar, and Zico Kolter. Certified robustness to label- flipping attacks via randomized smoothing. In International Conference on Machine Learning , pages 8230–8241. PMLR, 2020. 39

work page 2020
[76]

Textattack: A framework for adversarial attacks, data augmentation, and adversarial training in nlp

John X Morris, Eli Lifland, Jin Yong Yoo, Jake Grigsby, Di Jin, and Yanjun Qi. Textattack: A framework for adversarial attacks, data augmentation, and adversarial training in nlp. arXiv preprint arXiv:2005.05909,

work page arXiv 2005
[77]

Adversarial attacks on deep- learning models in natural language processing: A survey

Wei Emma Zhang, Quan Z Sheng, Ahoud Alhazmi, and Chenliang Li. Adversarial attacks on deep- learning models in natural language processing: A survey. ACM Transactions on Intelligent Systems and Technology (TIST), 11(3):1–41, 2020. 39

work page 2020
[78]

Generating natural language adversarial examples through probability weighted word saliency

Shuhuai Ren, Yihe Deng, Kun He, and Wanxiang Che. Generating natural language adversarial examples through probability weighted word saliency. In Proceedings of the 57th annual meeting of the association for computational linguistics, pages 1085–1097, 2019. 39

work page 2019
[79]

Natural language adversarial attack and defense in word level

Xiaosen Wang, Hao Jin, and Kun He. Natural language adversarial attack and defense in word level. arXiv preprint arXiv:1909.06723, 2019

work page arXiv 1909
[80]

Generating Natural Language Adversarial Examples

Moustafa Alzantot, Yash Sharma, Ahmed Elgohary, Bo-Jhang Ho, Mani Srivastava, and Kai-Wei Chang. Generating natural language adversarial examples. arXiv preprint arXiv:1804.07998, 2018. 39 18

work page internal anchor Pith review Pith/arXiv arXiv 2018

Showing first 80 references.