Recognition: 2 theorem links
· Lean TheoremSmoothLLM: Defending Large Language Models Against Jailbreaking Attacks
Pith reviewed 2026-05-14 17:05 UTC · model grok-4.3
The pith
SmoothLLM defends large language models against jailbreaking by perturbing input prompts at the character level and aggregating multiple responses.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
SmoothLLM works by randomly perturbing multiple copies of an input prompt through character-level substitutions and then aggregating the LLM's predictions across those copies to detect and block adversarial inputs designed to produce objectionable content.
What carries the argument
Random character-level perturbations of the prompt combined with aggregation of model outputs across perturbed copies.
If this is right
- Across popular LLMs, SmoothLLM achieves state-of-the-art robustness to GCG, PAIR, RandomSearch, and AmpleGCG jailbreaks.
- SmoothLLM remains effective even against adaptive versions of the GCG attack.
- There is only a small trade-off between the defense's robustness and the model's performance on non-adversarial inputs.
- The method works with any large language model without requiring changes to the model itself.
Where Pith is reading between the lines
- Similar perturbation and aggregation strategies might apply to other types of adversarial attacks on language models.
- The brittleness of adversarial prompts suggests that defenses could be designed around input randomization rather than model retraining.
- Testing on a wider range of attack types could reveal the limits of this approach.
Load-bearing premise
Adversarially generated prompts lose their effectiveness when subjected to random character-level modifications.
What would settle it
Finding or constructing a jailbreaking prompt that continues to elicit the target objectionable output even after multiple random character perturbations would falsify the core defense mechanism.
read the original abstract
Despite efforts to align large language models (LLMs) with human intentions, widely-used LLMs such as GPT, Llama, and Claude are susceptible to jailbreaking attacks, wherein an adversary fools a targeted LLM into generating objectionable content. To address this vulnerability, we propose SmoothLLM, the first algorithm designed to mitigate jailbreaking attacks. Based on our finding that adversarially-generated prompts are brittle to character-level changes, our defense randomly perturbs multiple copies of a given input prompt, and then aggregates the corresponding predictions to detect adversarial inputs. Across a range of popular LLMs, SmoothLLM sets the state-of-the-art for robustness against the GCG, PAIR, RandomSearch, and AmpleGCG jailbreaks. SmoothLLM is also resistant against adaptive GCG attacks, exhibits a small, though non-negligible trade-off between robustness and nominal performance, and is compatible with any LLM. Our code is publicly available at \url{https://github.com/arobey1/smooth-llm}.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces SmoothLLM, a defense against jailbreaking attacks on LLMs. It is based on the empirical observation that adversarially generated prompts are brittle to character-level perturbations. The algorithm creates multiple randomly perturbed copies of an input prompt, queries the target LLM on each copy, and aggregates the outputs to detect and mitigate adversarial inputs. The paper reports that SmoothLLM achieves state-of-the-art robustness against GCG, PAIR, RandomSearch, and AmpleGCG attacks across multiple popular LLMs, remains effective against adaptive GCG attacks, incurs only a modest trade-off with nominal performance, and is compatible with any LLM. Public code is released.
Significance. If the reported empirical results hold under the stated conditions, SmoothLLM offers a practical, model-agnostic defense that substantially improves robustness to several prominent jailbreaking methods without requiring model retraining or internal access. The broad evaluation across LLMs and attack types, combined with public code and resistance to adaptive attacks, constitutes a concrete contribution to LLM safety research.
major comments (2)
- [§3.2] §3.2 (Method): The precise aggregation rule used to combine predictions across the perturbed copies is not fully specified, including how conflicting outputs or ties are resolved; this procedure is load-bearing for the claimed robustness gains and must be stated explicitly for reproducibility.
- [§4.1–4.2] §4.1–4.2 (Experiments): The selection process for the free parameters (perturbation rate and number of perturbed copies) is not detailed, including whether values were tuned on held-out data or attack-specific performance; this affects the strength of the cross-attack robustness claims.
minor comments (2)
- [Figure 2 and Table 1] Figure 2 and Table 1: Axis labels and legend entries could be enlarged or clarified to improve readability of the robustness metrics across attack methods.
- [Abstract] Abstract: The phrase 'small, though non-negligible trade-off' should be accompanied by a quantitative bound (e.g., average drop in benign accuracy) to give readers an immediate sense of the cost.
Simulated Author's Rebuttal
We thank the referee for their positive assessment and recommendation for minor revision. We appreciate the constructive feedback on improving the clarity of the method and experimental details, which we will address to strengthen reproducibility.
read point-by-point responses
-
Referee: [§3.2] §3.2 (Method): The precise aggregation rule used to combine predictions across the perturbed copies is not fully specified, including how conflicting outputs or ties are resolved; this procedure is load-bearing for the claimed robustness gains and must be stated explicitly for reproducibility.
Authors: We thank the referee for highlighting this point. In SmoothLLM, the aggregation rule is majority voting across the LLM outputs on the perturbed prompts: an input is classified as adversarial if a strict majority of the perturbed copies produce a refusal (safe) response. In the event of a tie, the input is classified as non-adversarial to avoid over-flagging clean prompts. We will revise §3.2 to state this rule explicitly, include pseudocode for the full procedure, and clarify the tie-breaking logic. revision: yes
-
Referee: [§4.1–4.2] §4.1–4.2 (Experiments): The selection process for the free parameters (perturbation rate and number of perturbed copies) is not detailed, including whether values were tuned on held-out data or attack-specific performance; this affects the strength of the cross-attack robustness claims.
Authors: We agree that more detail is warranted. The perturbation rate (10%) and number of copies (10) were selected via a small grid search performed on a held-out set of 50 prompts drawn from AdvBench, using only the base Llama-2-7B model and without reference to any particular attack. The search prioritized a balance between robustness and nominal performance. We will expand the description in §4.1 to document this process and add a parameter-sensitivity ablation to the appendix. revision: yes
Circularity Check
No significant circularity identified
full rationale
The paper presents SmoothLLM as an empirical algorithm motivated by the external observation that adversarially generated prompts are brittle to character-level perturbations. This brittleness is demonstrated through direct experiments on GCG, PAIR, RandomSearch, and AmpleGCG attacks across multiple LLMs, with no derivation chain, fitted parameters renamed as predictions, or self-citation load-bearing steps that reduce the central claim to its own inputs by construction. The defense (random perturbation + aggregation) follows straightforwardly from the empirical finding without circular redefinition or uniqueness theorems imported from prior self-work.
Axiom & Free-Parameter Ledger
free parameters (2)
- perturbation rate
- number of perturbed copies
axioms (1)
- domain assumption Adversarially-generated prompts are brittle to small character-level perturbations while benign prompts remain stable
Forward citations
Cited by 24 Pith papers
-
OrchJail: Jailbreaking Tool-Calling Text-to-Image Agents by Orchestration-Guided Fuzzing
OrchJail uses orchestration-guided fuzzing to jailbreak tool-calling T2I agents by targeting high-risk tool patterns, yielding higher attack success rates, better image quality, and lower costs than prior prompt-only methods.
-
How Many Iterations to Jailbreak? Dynamic Budget Allocation for Multi-Turn LLM Evaluation
DAPRO provides the first dynamic, theoretically guaranteed way to allocate interaction budgets across test cases for bounding time-to-event in multi-turn LLM evaluations, achieving tighter coverage than static conform...
-
Catching the Infection Before It Spreads: Foresight-Guided Defense in Multi-Agent Systems
A foresight-based local purification method using multi-persona simulations and recursive diagnosis reduces infectious jailbreak spread in multi-agent systems from over 95% to below 5.47% while matching benign perform...
-
Jailbroken Frontier Models Retain Their Capabilities
Jailbreak-induced performance loss shrinks as model capability grows, with the strongest models showing almost no degradation on benchmarks.
-
Attention Is Where You Attack
ARA jailbreaks safety-aligned LLMs like LLaMA-3 and Mistral by redirecting attention in safety-heavy heads with as few as 5 tokens, achieving 30-36% attack success while ablating the same heads barely affects refusals.
-
Adaptive Prompt Embedding Optimization for LLM Jailbreaking
PEO optimizes original prompt embeddings continuously over adaptive rounds to jailbreak aligned LLMs, preserving the exact visible prompt text and outperforming discrete suffix, appended embedding, and search-based wh...
-
Jailbreaking the Matrix: Nullspace Steering for Controlled Model Subversion
HMNS is a new jailbreak method that uses causal head identification and nullspace-constrained injection to achieve higher attack success rates than prior techniques on aligned language models.
-
Refusal in Language Models Is Mediated by a Single Direction
Refusal in language models is mediated by a single direction in residual stream activations that can be erased to disable safety or added to elicit refusal.
-
Behavioral Integrity Verification for AI Agent Skills
BIV audits AI agent skills at scale, finding 80% deviate from declared behavior on 49,943 skills and achieving 0.946 F1 for malicious skill detection.
-
Revisiting JBShield: Breaking and Rebuilding Representation-Level Jailbreak Defenses
JBShield is vulnerable to adaptive JB-GCG attacks (up to 53% ASR) because jailbreak representations occupy a distinct region in refusal-direction space; the new RTV defense using Mahalanobis detection on multi-layer f...
-
Disentangling Intent from Role: Adversarial Self-Play for Persona-Invariant Safety Alignment
PIA achieves lower attack success rates on persona-based jailbreaks via self-play co-evolution of attacks (PLE) and defenses (PICL) that structurally decouples safety from persona context using unilateral KL-divergence.
-
Catching the Infection Before It Spreads: Foresight-Guided Defense in Multi-Agent Systems
A foresight-based local purification method simulates future agent interactions, detects infections via response diversity across personas, and applies targeted rollback or recursive diagnosis to cut maximum infection...
-
Catching the Infection Before It Spreads: Foresight-Guided Defense in Multi-Agent Systems
FLP uses multi-persona foresight simulation to detect infections via response diversity and applies local purification to reduce maximum cumulative infection rates in multi-agent systems from over 95% to below 5.47%.
-
SafeRedirect: Defeating Internal Safety Collapse via Task-Completion Redirection in Frontier LLMs
SafeRedirect reduces average unsafe generation rates in frontier LLMs from 71.2% to 8.0% on Internal Safety Collapse tasks by redirecting task completion with failure permission and deterministic hard stops.
-
Towards Understanding the Robustness of Sparse Autoencoders
Integrating pretrained sparse autoencoders into LLM residual streams reduces jailbreak success rates by up to 5x across multiple models and attacks.
-
The Salami Slicing Threat: Exploiting Cumulative Risks in LLM Systems
Salami Attack chains low-risk inputs to cumulatively trigger high-risk LLM behaviors, achieving over 90% success on GPT-4o and Gemini while resisting some defenses.
-
JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models
JailbreakBench supplies an evolving set of jailbreak prompts, a 100-behavior dataset aligned with usage policies, a standardized evaluation framework, and a leaderboard to enable comparable assessments of attacks and ...
-
InjecAgent: Benchmarking Indirect Prompt Injections in Tool-Integrated Large Language Model Agents
InjecAgent benchmark demonstrates that tool-integrated LLM agents are vulnerable to indirect prompt injection attacks, with ReAct-prompted GPT-4 succeeding on 24% of attacks and nearly twice that rate when attacker in...
-
Jailbreaking Black Box Large Language Models in Twenty Queries
PAIR uses an attacker LLM to iteratively craft effective jailbreak prompts for black-box target LLMs in fewer than 20 queries.
-
Re-Triggering Safeguards within LLMs for Jailbreak Detection
Embedding disruption re-triggers LLM internal safeguards to detect jailbreak prompts more effectively than standalone defenses.
-
SoK: Robustness in Large Language Models against Jailbreak Attacks
The paper taxonomizes jailbreak attacks and defenses for LLMs, introduces the Security Cube multi-dimensional evaluation framework, benchmarks 13 attacks and 5 defenses, and identifies open challenges in LLM robustness.
-
SALLIE: Safeguarding Against Latent Language & Image Exploits
SALLIE detects jailbreaks in text and vision-language models by extracting residual stream activations, scoring maliciousness per layer with k-NN, and ensembling predictions, outperforming baselines on multiple datasets.
-
MIPIAD: Multilingual Indirect Prompt Injection Attack Defense with Qwen -- TF-IDF Hybrid and Meta-Ensemble Learning
MIPIAD reports a hybrid Qwen-TF-IDF ensemble defense that reaches F1 0.9205 and reduces the English-Bangla performance gap on a 1.43-million-sample synthetic benchmark derived from BIPIA templates.
-
Jailbreak Attacks and Defenses Against Large Language Models: A Survey
A survey that creates taxonomies for jailbreak attacks and defenses on LLMs, subdivides them into sub-classes, and compares evaluation approaches.
Reference graph
Works this paper leans on
-
[1]
Realtoxicityprompts: Evaluating neural toxic degenera- tion in language models
Samuel Gehman, Suchin Gururangan, Maarten Sap, Yejin Choi, and Noah A Smith. Realtoxicityprompts: Evaluating neural toxic degeneration in language models. arXiv preprint arXiv:2009.11462, 2020. 1
-
[2]
The ai alignment problem: why it is hard, and where to start
Eliezer Yudkowsky. The ai alignment problem: why it is hard, and where to start. Symbolic Systems Distinguished Speaker, 4, 2016. 1
work page 2016
-
[3]
Artificial intelligence, values, and alignment
Iason Gabriel. Artificial intelligence, values, and alignment. Minds and machines, 30(3):411–437, 2020
work page 2020
-
[4]
The alignment problem: Machine learning and human values
Brian Christian. The alignment problem: Machine learning and human values. WW Norton & Company,
-
[5]
Regulating chatgpt and other large generative ai models
Philipp Hacker, Andreas Engel, and Marco Mauer. Regulating chatgpt and other large generative ai models. In Proceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency, pages 1112–1123, 2023. 1
work page 2023
-
[6]
Training language models to follow instructions with human feedback
Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback, 2022. URL https://arxiv. org/abs/2203.02155, 13, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[7]
Improving alignment of dialogue agents via targeted human judgements
Amelia Glaese, Nat McAleese, Maja Tr˛ ebacz, John Aslanides, Vlad Firoiu, Timo Ewalds, Maribeth Rauh, Laura Weidinger, Martin Chadwick, Phoebe Thacker, et al. Improving alignment of dialogue agents via targeted human judgements. arXiv preprint arXiv:2209.14375, 2022. 1
work page internal anchor Pith review arXiv 2022
-
[8]
Toxicity in chatgpt: Analyzing persona-assigned language models
Ameet Deshpande, Vishvak Murahari, Tanmay Rajpurohit, Ashwin Kalyan, and Karthik Narasimhan. Toxicity in chatgpt: Analyzing persona-assigned language models. arXiv preprint arXiv:2304.05335, 2023. 1
-
[9]
Jailbroken: How Does LLM Safety Training Fail?
Alexander Wei, Nika Haghtalab, and Jacob Steinhardt. Jailbroken: How does llm safety training fail? arXiv preprint arXiv:2307.02483, 2023. 1, 3
work page internal anchor Pith review arXiv 2023
-
[10]
Are aligned neural networks adversarially aligned? arXiv preprint arXiv:2306.15447, 2023
Nicholas Carlini, Milad Nasr, Christopher A Choquette-Choo, Matthew Jagielski, Irena Gao, Anas Awadalla, Pang Wei Koh, Daphne Ippolito, Katherine Lee, Florian Tramer, et al. Are aligned neural networks adversarially aligned? arXiv preprint arXiv:2306.15447, 2023
-
[11]
A safe harbor for ai evaluation and red teaming
Shayne Longpre, Sayash Kapoor, Kevin Klyman, Ashwin Ramaswami, Rishi Bommasani, Borhane Blili-Hamelin, Yangsibo Huang, Aviya Skowron, Zheng-Xin Yong, Suhas Kotha, et al. A safe harbor for ai evaluation and red teaming. arXiv preprint arXiv:2403.04893, 2024. 1
-
[12]
Adversarial demon- stration attacks on large language models
Jiongxiao Wang, Zichen Liu, Keun Hee Park, Muhao Chen, and Chaowei Xiao. Adversarial demonstra- tion attacks on large language models. arXiv preprint arXiv:2305.14950, 2023. 1
-
[13]
Risks of ai foundation models in education
Su Lin Blodgett and Michael Madaio. Risks of ai foundation models in education. arXiv preprint arXiv:2110.10024, 2021. 1
-
[14]
Malik Sallam. Chatgpt utility in healthcare education, research, and practice: systematic review on the promising perspectives and valid concerns. In Healthcare, volume 11, page 887. MDPI, 2023. 1
work page 2023
-
[15]
BloombergGPT: A Large Language Model for Finance
Shijie Wu, Ozan Irsoy, Steven Lu, Vadim Dabravolski, Mark Dredze, Sebastian Gehrmann, Prabhanjan Kambadur, David Rosenberg, and Gideon Mann. Bloomberggpt: A large language model for finance. arXiv preprint arXiv:2303.17564, 2023. 1
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[16]
Adversarial prompting for black box foundation models
Natalie Maus, Patrick Chao, Eric Wong, and Jacob Gardner. Adversarial prompting for black box foundation models. arXiv preprint arXiv:2302.04237, 2023. 1 14
-
[17]
Autoprompt: Eliciting knowledge from language models wit h automatically generated prompts,
Taylor Shin, Yasaman Razeghi, Robert L Logan IV , Eric Wallace, and Sameer Singh. Autoprompt: Eliciting knowledge from language models with automatically generated prompts. arXiv preprint arXiv:2010.15980, 2020
-
[18]
Jailbreaking Black Box Large Language Models in Twenty Queries
Patrick Chao, Alexander Robey, Edgar Dobriban, Hamed Hassani, George J Pappas, and Eric Wong. Jailbreaking black box large language models in twenty queries. arXiv preprint arXiv:2310.08419, 2023. 2, 3, 9, 12, 26
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[19]
AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models
Xiaogeng Liu, Nan Xu, Muhao Chen, and Chaowei Xiao. Autodan: Generating stealthy jailbreak prompts on aligned large language models. arXiv preprint arXiv:2310.04451, 2023. 1
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[20]
Universal and Transferable Adversarial Attacks on Aligned Language Models
Andy Zou, Zifan Wang, J Zico Kolter, and Matt Fredrikson. Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043, 2023. 1, 2, 3, 5, 6, 9, 10, 12, 25, 27, 28, 29, 30, 31, 33, 34, 37, 40
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[21]
Jailbreaking leading safety-aligned llms with simple adaptive attacks
Maksym Andriushchenko, Francesco Croce, and Nicolas Flammarion. Jailbreaking leading safety- aligned llms with simple adaptive attacks. arXiv preprint arXiv:2404.02151, 2024. 2, 3, 9
-
[22]
Zeyi Liao and Huan Sun. Amplegcg: Learning a universal and transferable generative model of adversarial suffixes for jailbreaking both open and closed llms. arXiv preprint arXiv:2404.07921, 2024. 2, 3, 9
-
[23]
a ger, MHI Abdalla, Johannes Gasteiger, and Stephan G \
Simon Geisler, Tom Wollschläger, MHI Abdalla, Johannes Gasteiger, and Stephan Günnemann. Attack- ing large language models with projected gradient descent. arXiv preprint arXiv:2402.09154, 2024. 1, 3
-
[24]
Certified robustness to adversarial examples with differential privacy
Mathias Lecuyer, Vaggelis Atlidakis, Roxana Geambasu, Daniel Hsu, and Suman Jana. Certified robustness to adversarial examples with differential privacy. In 2019 IEEE symposium on security and privacy (SP), pages 656–672. IEEE, 2019. 2, 5, 38
work page 2019
-
[25]
Certified adversarial robustness via randomized smoothing
Jeremy Cohen, Elan Rosenfeld, and Zico Kolter. Certified adversarial robustness via randomized smoothing. In international conference on machine learning, pages 1310–1320. PMLR, 2019. 11, 38, 39
work page 2019
-
[26]
Provably robust deep learning via adversarially trained smoothed classifiers
Hadi Salman, Jerry Li, Ilya Razenshteyn, Pengchuan Zhang, Huan Zhang, Sebastien Bubeck, and Greg Yang. Provably robust deep learning via adversarially trained smoothed classifiers. Advances in Neural Information Processing Systems, 32, 2019. 2, 5, 11, 38
work page 2019
-
[27]
JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models
Patrick Chao, Edoardo Debenedetti, Alexander Robey, Maksym Andriushchenko, Francesco Croce, Vikash Sehwag, Edgar Dobriban, Nicolas Flammarion, George J Pappas, Florian Tramer, et al. Jail- breakbench: An open robustness benchmark for jailbreaking large language models. arXiv preprint arXiv:2404.01318, 2024. 3, 32
work page internal anchor Pith review arXiv 2024
-
[28]
Low-resource languages jailbreak gpt-4
Zheng-Xin Yong, Cristina Menghini, and Stephen H Bach. Low-resource languages jailbreak gpt-4. arXiv preprint arXiv:2310.02446, 2023. 3
-
[29]
Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations
Hakan Inan, Kartikeya Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer, Yuning Mao, Michael Tontchev, Qing Hu, Brian Fuller, Davide Testuggine, et al. Llama guard: Llm-based input-output safeguard for human-ai conversations. arXiv preprint arXiv:2312.06674, 2023. 3, 26
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[30]
Catastrophic jailbreak of open-source llms via exploiting generation
Yangsibo Huang, Samyak Gupta, Mengzhou Xia, Kai Li, and Danqi Chen. Catastrophic jailbreak of open-source llms via exploiting generation. arXiv preprint arXiv:2310.06987, 2023. 3
-
[31]
A survey of adversarial defenses and robustness in nlp
Shreya Goyal, Sumanth Doddapaneni, Mitesh M Khapra, and Balaraman Ravindran. A survey of adversarial defenses and robustness in nlp. ACM Computing Surveys, 55(14s):1–39, 2023. 4
work page 2023
-
[32]
Adversarial training for large neural language models
Xiaodong Liu, Hao Cheng, Pengcheng He, Weizhu Chen, Yu Wang, Hoifung Poon, and Jianfeng Gao. Adversarial training for large neural language models. arXiv preprint arXiv:2004.08994, 2020. 4 15
-
[33]
Takeru Miyato, Andrew M Dai, and Ian Goodfellow. Adversarial training methods for semi-supervised text classification. arXiv preprint arXiv:1605.07725, 2016. 4
-
[34]
TextBugger: Generating Adversarial Text Against Real-world Applications
Jinfeng Li, Shouling Ji, Tianyu Du, Bo Li, and Ting Wang. Textbugger: Generating adversarial text against real-world applications. arXiv preprint arXiv:1812.05271, 2018. 4, 39
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[35]
Baseline Defenses for Adversarial Attacks Against Aligned Language Models
Neel Jain, Avi Schwarzschild, Yuxin Wen, Gowthami Somepalli, John Kirchenbauer, Ping-yeh Chiang, Micah Goldblum, Aniruddha Saha, Jonas Geiping, and Tom Goldstein. Baseline defenses for adversarial attacks against aligned language models. arXiv preprint arXiv:2309.00614, 2023. 4, 10, 32
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[36]
Certifying llm safety against adversarial prompting
Aounon Kumar, Chirag Agarwal, Suraj Srinivas, Soheil Feizi, and Hima Lakkaraju. Certifying llm safety against adversarial prompting. arXiv preprint arXiv:2309.02705, 2023. 4
-
[37]
On adaptive attacks to adversarial example defenses
Florian Tramer, Nicholas Carlini, Wieland Brendel, and Aleksander Madry. On adaptive attacks to adversarial example defenses. Advances in neural information processing systems, 33:1633–1645, 2020. 9
work page 2020
-
[38]
Detecting language model attacks with perplexity
Gabriel Alon and Michael Kamfonas. Detecting language model attacks with perplexity. arXiv preprint arXiv:2308.14132, 2023. 10, 32
-
[39]
Instruction-Following Evaluation for Large Language Models
Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and Le Hou. Instruction-following evaluation for large language models. arXiv preprint arXiv:2311.07911,
work page internal anchor Pith review Pith/arXiv arXiv
-
[40]
Piqa: Reasoning about physical common- sense in natural language
Yonatan Bisk, Rowan Zellers, Jianfeng Gao, Yejin Choi, et al. Piqa: Reasoning about physical common- sense in natural language. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 7432–7439, 2020. 11, 29
work page 2020
-
[41]
Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering
Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering. arXiv preprint arXiv:1809.02789, 2018. 11, 29
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[42]
arXiv preprint arXiv:2203.09509 , year=
Thomas Hartvigsen, Saadia Gabriel, Hamid Palangi, Maarten Sap, Dipankar Ray, and Ece Kamar. Toxigen: A large-scale machine-generated dataset for adversarial and implicit hate speech detection. arXiv preprint arXiv:2203.09509, 2022. 11, 29
-
[43]
Robustbench: a standardized adversarial robustness benchmark
Francesco Croce, Maksym Andriushchenko, Vikash Sehwag, Edoardo Debenedetti, Nicolas Flammarion, Mung Chiang, Prateek Mittal, and Matthias Hein. Robustbench: a standardized adversarial robustness benchmark. arXiv preprint arXiv:2010.09670, 2020. 11
-
[44]
Robustness and accuracy tradeoffs for recommender systems under attack
Carlos E Seminario and David C Wilson. Robustness and accuracy tradeoffs for recommender systems under attack. In Twenty-Fifth International FLAIRS Conference, 2012. 11
work page 2012
-
[45]
Fast is better than free: Revisiting adversarial training
Eric Wong, Leslie Rice, and J Zico Kolter. Fast is better than free: Revisiting adversarial training. arXiv preprint arXiv:2001.03994, 2020. 11
-
[46]
Query complexity of adversarial attacks
Grzegorz Gluch and Rüdiger Urbanke. Query complexity of adversarial attacks. In International Conference on Machine Learning, pages 3723–3733. PMLR, 2021
work page 2021
-
[47]
Adversarial training for free! Advances in Neural Information Processing Systems, 32, 2019
Ali Shafahi, Mahyar Najibi, Mohammad Amin Ghiasi, Zheng Xu, John Dickerson, Christoph Studer, Larry S Davis, Gavin Taylor, and Tom Goldstein. Adversarial training for free! Advances in Neural Information Processing Systems, 32, 2019. 11
work page 2019
-
[48]
Towards Deep Learning Models Resistant to Adversarial Attacks
Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. To- wards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083, 2017. 12, 38 16
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[49]
Generalizing to unseen domains via adversarial data augmentation
Riccardo Volpi, Hongseok Namkoong, Ozan Sener, John C Duchi, Vittorio Murino, and Silvio Savarese. Generalizing to unseen domains via adversarial data augmentation. Advances in neural information processing systems, 31, 2018. 12
work page 2018
-
[50]
Denoised smoothing: A provable defense for pretrained classifiers
Hadi Salman, Mingjie Sun, Greg Yang, Ashish Kapoor, and J Zico Kolter. Denoised smoothing: A provable defense for pretrained classifiers. Advances in Neural Information Processing Systems, 33:21945– 21957, 2020. 13
work page 2020
-
[51]
(certified!!) adversarial robustness for free! arXiv preprint arXiv:2206.10550, 2022
Nicholas Carlini, Florian Tramer, Krishnamurthy Dj Dvijotham, Leslie Rice, Mingjie Sun, and J Zico Kolter. (certified!!) adversarial robustness for free! arXiv preprint arXiv:2206.10550, 2022. 13
-
[52]
Llama 2: Open Foundation and Fine-Tuned Chat Models
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023. 25
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[53]
Gonzalez, Ion Stoica, and Eric P
Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P . Xing. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023. 25
work page 2023
-
[54]
Robustness may be at odds with accuracy
Dimitris Tsipras, Shibani Santurkar, Logan Engstrom, Alexander Turner, and Aleksander Madry. Robustness may be at odds with accuracy. arXiv preprint arXiv:1805.12152, 2018. 29, 40
-
[55]
Provable tradeoffs in adversari- ally robust classification
Edgar Dobriban, Hamed Hassani, David Hong, and Alexander Robey. Provable tradeoffs in adversari- ally robust classification. IEEE Transactions on Information Theory, 2023. 40
work page 2023
-
[56]
Precise tradeoffs in adversarial training for linear regression
Adel Javanmard, Mahdi Soltanolkotabi, and Hamed Hassani. Precise tradeoffs in adversarial training for linear regression. In Conference on Learning Theory, pages 2034–2078. PMLR, 2020. 29
work page 2034
-
[57]
Cassidy Laidlaw, Sahil Singla, and Soheil Feizi. Perceptual adversarial robustness: Defense against unseen threat models. arXiv preprint arXiv:2006.12655, 2020. 38
-
[58]
Model-based robust deep learning: Generaliz- ing to natural, out-of-distribution data
Alexander Robey, Hamed Hassani, and George J Pappas. Model-based robust deep learning: Generaliz- ing to natural, out-of-distribution data. arXiv preprint arXiv:2005.10247, 2020
-
[59]
Learning perturbation sets for robust machine learning
Eric Wong and J Zico Kolter. Learning perturbation sets for robust machine learning. arXiv preprint arXiv:2007.08450, 2020. 38
-
[60]
Breeds: Benchmarks for subpopulation shift
Shibani Santurkar, Dimitris Tsipras, and Aleksander Madry. Breeds: Benchmarks for subpopulation shift. arXiv preprint arXiv:2008.04859, 2020. 38
-
[61]
Wilds: A benchmark of in-the-wild distribution shifts
Pang Wei Koh, Shiori Sagawa, Henrik Marklund, Sang Michael Xie, Marvin Zhang, Akshay Balsubra- mani, Weihua Hu, Michihiro Yasunaga, Richard Lanas Phillips, Irena Gao, et al. Wilds: A benchmark of in-the-wild distribution shifts. In International Conference on Machine Learning, pages 5637–5664. PMLR,
-
[62]
Martin Arjovsky, Léon Bottou, Ishaan Gulrajani, and David Lopez-Paz. Invariant risk minimization. arXiv preprint arXiv:1907.02893, 2019. 38
work page internal anchor Pith review Pith/arXiv arXiv 1907
-
[63]
Probable domain generalization via quantile risk minimization
Cian Eastwood, Alexander Robey, Shashank Singh, Julius Von Kügelgen, Hamed Hassani, George J Pappas, and Bernhard Schölkopf. Probable domain generalization via quantile risk minimization. Advances in Neural Information Processing Systems, 35:17340–17358, 2022
work page 2022
-
[64]
Alexander Robey, George J Pappas, and Hamed Hassani. Model-based domain generalization.Advances in Neural Information Processing Systems, 34:20210–20229, 2021. 38 17
work page 2021
-
[65]
Evasion attacks against machine learning at test time
Battista Biggio, Igino Corona, Davide Maiorca, Blaine Nelson, Nedim Šrndi´ c, Pavel Laskov, Giorgio Giacinto, and Fabio Roli. Evasion attacks against machine learning at test time. In Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2013, Prague, Czech Republic, September 23-27, 2013, Proceedings, Part III 13, pages 38...
work page 2013
-
[66]
Intriguing properties of neural networks
Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199, 2013. 38
work page internal anchor Pith review Pith/arXiv arXiv 2013
-
[67]
Explaining and Harnessing Adversarial Examples
Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572, 2014. 38
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[68]
Theo- retically principled trade-off between robustness and accuracy
Hongyang Zhang, Yaodong Yu, Jiantao Jiao, Eric Xing, Laurent El Ghaoui, and Michael Jordan. Theo- retically principled trade-off between robustness and accuracy. In International conference on machine learning, pages 7472–7482. PMLR, 2019. 38
work page 2019
-
[69]
Randomized smoothing of all shapes and sizes
Greg Yang, Tony Duan, J Edward Hu, Hadi Salman, Ilya Razenshteyn, and Jerry Li. Randomized smoothing of all shapes and sizes. In International Conference on Machine Learning, pages 10693–10705. PMLR, 2020. 38, 39
work page 2020
-
[70]
(de) randomized smoothing for certifiable defense against patch attacks
Alexander Levine and Soheil Feizi. (de) randomized smoothing for certifiable defense against patch attacks. Advances in Neural Information Processing Systems, 33:6465–6475, 2020. 38, 39
work page 2020
-
[71]
Certified defences against adversarial patch attacks on semantic segmentation
Maksym Yatsura, Kaspar Sakmann, N Grace Hua, Matthias Hein, and Jan Hendrik Metzen. Certified defences against adversarial patch attacks on semantic segmentation. arXiv preprint arXiv:2209.05980, 2022
-
[72]
Stability guarantees for feature attributions with multiplicative smoothing
Anton Xue, Rajeev Alur, and Eric Wong. Stability guarantees for feature attributions with multiplicative smoothing. arXiv preprint arXiv:2307.05902, 2023. 38, 39
-
[73]
ℓ1 adversarial robustness certificates: a randomized smoothing approach
Jiaye Teng, Guang-He Lee, and Yang Yuan. ℓ1 adversarial robustness certificates: a randomized smoothing approach. 2019. 39
work page 2019
-
[74]
Certified defense to image transformations via randomized smoothing
Marc Fischer, Maximilian Baader, and Martin Vechev. Certified defense to image transformations via randomized smoothing. Advances in Neural information processing systems, 33:8404–8417, 2020. 39
work page 2020
-
[75]
Certified robustness to label- flipping attacks via randomized smoothing
Elan Rosenfeld, Ezra Winston, Pradeep Ravikumar, and Zico Kolter. Certified robustness to label- flipping attacks via randomized smoothing. In International Conference on Machine Learning , pages 8230–8241. PMLR, 2020. 39
work page 2020
-
[76]
Textattack: A framework for adversarial attacks, data augmentation, and adversarial training in nlp
John X Morris, Eli Lifland, Jin Yong Yoo, Jake Grigsby, Di Jin, and Yanjun Qi. Textattack: A framework for adversarial attacks, data augmentation, and adversarial training in nlp. arXiv preprint arXiv:2005.05909,
-
[77]
Adversarial attacks on deep- learning models in natural language processing: A survey
Wei Emma Zhang, Quan Z Sheng, Ahoud Alhazmi, and Chenliang Li. Adversarial attacks on deep- learning models in natural language processing: A survey. ACM Transactions on Intelligent Systems and Technology (TIST), 11(3):1–41, 2020. 39
work page 2020
-
[78]
Generating natural language adversarial examples through probability weighted word saliency
Shuhuai Ren, Yihe Deng, Kun He, and Wanxiang Che. Generating natural language adversarial examples through probability weighted word saliency. In Proceedings of the 57th annual meeting of the association for computational linguistics, pages 1085–1097, 2019. 39
work page 2019
-
[79]
Natural language adversarial attack and defense in word level
Xiaosen Wang, Hao Jin, and Kun He. Natural language adversarial attack and defense in word level. arXiv preprint arXiv:1909.06723, 2019
-
[80]
Generating Natural Language Adversarial Examples
Moustafa Alzantot, Yash Sharma, Ahmed Elgohary, Bo-Jhang Ho, Mani Srivastava, and Kai-Wei Chang. Generating natural language adversarial examples. arXiv preprint arXiv:1804.07998, 2018. 39 18
work page internal anchor Pith review Pith/arXiv arXiv 2018
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.