{"work":{"id":"b20a57fa-4b7d-40ec-8b6a-ce48234630de","openalex_id":null,"doi":null,"arxiv_id":"1706.06083","raw_key":null,"title":"Towards Deep Learning Models Resistant to Adversarial Attacks","authors":null,"authors_text":"Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, Adrian Vladu","year":2017,"venue":"stat.ML","abstract":"Recent work has demonstrated that deep neural networks are vulnerable to adversarial examples---inputs that are almost indistinguishable from natural data and yet classified incorrectly by the network. In fact, some of the latest findings suggest that the existence of adversarial attacks may be an inherent weakness of deep learning models. To address this problem, we study the adversarial robustness of neural networks through the lens of robust optimization. This approach provides us with a broad and unifying view on much of the prior work on this topic. Its principled nature also enables us to identify methods for both training and attacking neural networks that are reliable and, in a certain sense, universal. In particular, they specify a concrete security guarantee that would protect against any adversary. These methods let us train networks with significantly improved resistance to a wide range of adversarial attacks. They also suggest the notion of security against a first-order adversary as a natural and broad security guarantee. We believe that robustness against such well-defined classes of adversaries is an important stepping stone towards fully resistant deep learning models. Code and pre-trained models are available at https://github.com/MadryLab/mnist_challenge and https://github.com/MadryLab/cifar10_challenge.","external_url":"https://arxiv.org/abs/1706.06083","cited_by_count":null,"metadata_source":"pith","metadata_fetched_at":"2026-06-29T12:03:23.744109+00:00","pith_arxiv_id":"1706.06083","created_at":"2026-05-08T23:19:30.084050+00:00","updated_at":"2026-06-29T12:03:23.744109+00:00","title_quality_ok":true,"display_title":"Towards Deep Learning Models Resistant to Adversarial Attacks","render_title":"Towards Deep Learning Models Resistant to Adversarial Attacks"},"hub":{"state":{"work_id":"b20a57fa-4b7d-40ec-8b6a-ce48234630de","tier":"super_hub","tier_reason":"100+ Pith inbound or 10,000+ external citations","pith_inbound_count":114,"external_cited_by_count":null,"distinct_field_count":16,"first_pith_cited_at":"2019-06-27T15:22:56+00:00","last_pith_cited_at":"2026-06-23T23:56:21+00:00","author_build_status":"needed","summary_status":"needed","contexts_status":"needed","graph_status":"needed","ask_index_status":"needed","reader_status":"not_needed","recognition_status":"not_needed","updated_at":"2026-06-29T13:08:47.380091+00:00","tier_text":"super_hub"},"tier":"super_hub","role_counts":[{"context_role":"background","n":21},{"context_role":"method","n":6}],"polarity_counts":[{"context_polarity":"background","n":18},{"context_polarity":"use_method","n":6},{"context_polarity":"unclear","n":3}],"runs":{"ask_index":{"job_type":"ask_index","status":"succeeded","result":{"title":"Towards Deep Learning Models Resistant to Adversarial Attacks","claims":[{"claim_text":"Recent work has demonstrated that deep neural networks are vulnerable to adversarial examples---inputs that are almost indistinguishable from natural data and yet classified incorrectly by the network. In fact, some of the latest findings suggest that the existence of adversarial attacks may be an inherent weakness of deep learning models. To address this problem, we study the adversarial robustness of neural networks through the lens of robust optimization. This approach provides us with a broad and unifying view on much of the prior work on this topic. Its principled nature also enables us t","claim_type":"abstract","evidence_strength":"source_metadata"},{"claim_text":"ial strategies to understand the resilience of the adapted input filter against adaptive adversaries. First, we utilize degra- 10 dation (Section 6.3) and camouflaging (Section 6.5) datasets constructed to simulate low-quality inputs and semantic ob- fuscation. Second, we conduct black-box adversarial attacks using standard optimization-based methods, i.e., PGD [22], C&W [2], and I-FGSM [12], to probe the vulnerability (at- tack budgetε=8/255). Full evaluation results are presented in Figure 9, ","claim_type":"method","confidence":0.95,"evidence_strength":"citation_context"},{"claim_text":"Attackers add imperceptible perturbation to the test data so that pre-trained models misclassify them with high confidentiality during test time [79][36]. Adversarial attacks are also known asevasive attacks[ 60]. Popular adversarial attacks include Projected Gradient Descent (PGD) attack [61], Fast Gradient Sign Method (FGSM) attack [41], DeepFool attack [63] and Carlini and Wagner's attack (C&W) [10]. By definition, an adversarial attack is a mapping 𝛼 : 𝑅𝑛 →𝑅 𝑛 such that adversarial example 𝛼","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"With the two objectives (L𝐹 𝐸𝐴𝑇 andL 𝐴𝐸𝑆 ) we thus seek to optimize the defense (𝑔) such that: ∀𝑥∈X arg min 𝑥 L𝐹 𝐸𝐴𝑇 (𝑥, 𝑔(𝑥)) + L 𝐴𝐸𝑆 (𝑥, 𝑔(𝑥))(6) 3.4 Technical Infrastructure and Definitions Many prior AML-based obfuscations use iterative methods to generate effective outputs through multiple forward and backward passes over target models (e.g., Projected Gradient Descent (PGD) [66] or the Fast Gradient Sign Method 8 Lagogiannis et al. (FGSM) [3]). We designed theAuraMaskpipeline to instead ta","claim_type":"method","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"an executable evaluation framework bridging literature gaps directly into engineering. Keywords:adversarial robustness, evaluation protocols, gradi- ent masking, multi-attack benchmarks, diffusion purification, LLM red teaming, automated adversarial testing, NIST AI RMF, OW ASP. 1 Introduction Machine-learning models deployed in safety-critical per- ception [ 2, 3], language understanding [ 4], and decision- making [5] are expected to resist adversarially chosen input perturbations. A decade of ","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"This mechanism enables suppression of adversarial perturbations at the state- preparation stage while preserving clean-input performance to a large extent. The proposed method is evaluated on representative QML models under gradient-based adversarial attacks, including the fast gradient sign method (FGSM) [20] and projected gradi- ent descent (PGD) [21], across standard image-classification datasets. Both clean and adversarial settings are considered in order to examine the trade-off between cle","claim_type":"method","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"Adversarial attacks on machine learning models have been extensively studied since their discovery by Szegedy et al. [16], who demonstrated that imperceptible pertur- bations could cause misclassification in deep neural networks. Goodfellow et al. [17] introduced the FGSM, a computationally efficient single-step attack that exploits the gradient of the loss function. Building upon this work, Madry et al. [18] proposed PGD, an iterative variant that represents one of the strongest first-order adv","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"}],"why_cited":"Pith tracks Towards Deep Learning Models Resistant to Adversarial Attacks because it crossed a citation-hub threshold. Current citing contexts most often use it as background evidence (19 contexts).","role_counts":[{"n":19,"context_role":"background"},{"n":6,"context_role":"method"}]},"error":null,"updated_at":"2026-05-23T20:04:43.394746+00:00"},"author_expand":{"job_type":"author_expand","status":"succeeded","result":{"authors_linked":[{"id":"84b2d820-2c8c-4293-9b0c-b2b447e92395","orcid":null,"display_name":"Aleksander Madry"},{"id":"a0038a54-1c8f-4a56-ae9d-4a00be9e5b71","orcid":null,"display_name":"Aleksandar Makelov"},{"id":"6243bbb8-9ffa-4e88-a6ea-f6e62b61b3d6","orcid":null,"display_name":"Ludwig Schmidt"},{"id":"33a0ad28-2bf8-458a-ac4c-6ffb97e032c2","orcid":null,"display_name":"Dimitris Tsipras"},{"id":"6db4f481-ab75-4b49-90ba-d4703cdc5961","orcid":null,"display_name":"Adrian Vladu"}]},"error":null,"updated_at":"2026-05-23T20:04:43.958894+00:00"},"context_extract":{"job_type":"context_extract","status":"succeeded","result":{"enqueued_papers":25},"error":null,"updated_at":"2026-05-14T12:50:36.756184+00:00"},"graph_features":{"job_type":"graph_features","status":"succeeded","result":{"co_cited":[{"title":"Explaining and Harnessing Adversarial Examples","work_id":"2cedf8f6-7539-4c49-8136-f42a20487146","shared_citers":36},{"title":"Intriguing properties of neural networks","work_id":"7bcd9f41-780c-4b4b-9a08-830d4177cdd8","shared_citers":16},{"title":"Universal and Transferable Adversarial Attacks on Aligned Language Models","work_id":"3322fa86-1768-4677-8425-dd326b45e078","shared_citers":11},{"title":"Adam: A Method for Stochastic Optimization","work_id":"1910796d-9b52-4683-bf5c-de9632c1028b","shared_citers":6},{"title":"An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale","work_id":"e96730e3-129b-4db6-b981-15ab7932e297","shared_citers":5},{"title":"arXiv preprint arXiv:1705.07204 , year=","work_id":"4413e6f2-14c8-4676-b759-21878dff7ef4","shared_citers":5},{"title":"Constitutional AI: Harmlessness from AI Feedback","work_id":"faaaa4e0-2676-4fac-a0b4-99aef10d2095","shared_citers":5},{"title":"Fast is better than free: Revisiting adversarial training","work_id":"e11d047b-8ddf-4719-85e6-4fc5c8007d71","shared_citers":5},{"title":"HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal","work_id":"b0b0303f-2444-4789-a979-8153624312ff","shared_citers":5},{"title":"Llama 2: Open Foundation and Fine-Tuned Chat Models","work_id":"68a5177f-d644-44c1-bd4f-4e5278c22f5d","shared_citers":5},{"title":"On eval uating adversarial robustness","work_id":"69d1d412-7fed-4f08-97f9-f9c9cb67fdc9","shared_citers":5},{"title":"PennyLane: Automatic differentiation of hybrid quantum-classical computations","work_id":"83078d0b-6c02-4fc5-822d-4da4204fd057","shared_citers":5},{"title":"Are aligned neural networks adversarially aligned?","work_id":"cbf9eae3-40c5-419a-9f5b-335eab5a2bb2","shared_citers":4},{"title":"arXiv preprint arXiv:2010.09670 , year=","work_id":"8ae4b2b2-a2da-4900-9021-ad64ae1b860f","shared_citers":4},{"title":"Defense-GAN: Protect- ing Classifiers Against Adversarial Attacks Using Generative Models","work_id":"3a388840-f6da-42b0-9587-1b76e05d547d","shared_citers":4},{"title":"Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!","work_id":"8b07137a-7175-4dd1-a8d9-570493d3f404","shared_citers":4},{"title":"Jailbreaking Black Box Large Language Models in Twenty Queries","work_id":"38678cda-6595-4ca3-916b-066c00cce063","shared_citers":4},{"title":"LLaMA: Open and Efficient Foundation Language Models","work_id":"c018fc23-6f3f-4035-9d02-28a2173b2b9d","shared_citers":4},{"title":"Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned","work_id":"1aabd84d-3779-4ba9-ba2f-15ce264a9b1e","shared_citers":4},{"title":"Red Teaming Language Models with Language Models","work_id":"d1274c54-508f-42f9-aeb3-91db13f3a622","shared_citers":4},{"title":"Wide Residual Networks","work_id":"1b918c80-6bca-4d06-8019-569626fb1cf2","shared_citers":4},{"title":"and Rosenfeld, Elan and Kolter, J","work_id":"280ae7d2-f220-4fbb-a597-951ee6af3099","shared_citers":3},{"title":"Autoprompt: Eliciting knowl- edge from language models with automatically generated prompts","work_id":"1b968c9c-9b73-409a-aa88-57e02917679c","shared_citers":3},{"title":"Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms","work_id":"6714d44f-1b5e-4141-9450-ea09a7e724b0","shared_citers":3}],"time_series":[{"n":2,"year":2023},{"n":51,"year":2026}],"dependency_candidates":[]},"error":null,"updated_at":"2026-05-14T13:00:51.428501+00:00"},"identity_refresh":{"job_type":"identity_refresh","status":"succeeded","result":{"items":[{"title":"Qwen3 Technical Report","outcome":"unchanged","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","resolver":"local_arxiv","confidence":0.98,"old_work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e"}],"counts":{"fixed":0,"merged":0,"unchanged":1,"quarantined":0,"needs_external_resolution":0},"errors":[],"attempted":1},"error":null,"updated_at":"2026-05-14T12:50:46.759870+00:00"},"role_polarity":{"job_type":"role_polarity","status":"succeeded","result":{"title":"Towards Deep Learning Models Resistant to Adversarial Attacks","claims":[{"claim_text":"Recent work has demonstrated that deep neural networks are vulnerable to adversarial examples---inputs that are almost indistinguishable from natural data and yet classified incorrectly by the network. In fact, some of the latest findings suggest that the existence of adversarial attacks may be an inherent weakness of deep learning models. To address this problem, we study the adversarial robustness of neural networks through the lens of robust optimization. This approach provides us with a broad and unifying view on much of the prior work on this topic. Its principled nature also enables us t","claim_type":"abstract","evidence_strength":"source_metadata"},{"claim_text":"ial strategies to understand the resilience of the adapted input filter against adaptive adversaries. First, we utilize degra- 10 dation (Section 6.3) and camouflaging (Section 6.5) datasets constructed to simulate low-quality inputs and semantic ob- fuscation. Second, we conduct black-box adversarial attacks using standard optimization-based methods, i.e., PGD [22], C&W [2], and I-FGSM [12], to probe the vulnerability (at- tack budgetε=8/255). Full evaluation results are presented in Figure 9, ","claim_type":"method","confidence":0.95,"evidence_strength":"citation_context"},{"claim_text":"Attackers add imperceptible perturbation to the test data so that pre-trained models misclassify them with high confidentiality during test time [79][36]. Adversarial attacks are also known asevasive attacks[ 60]. Popular adversarial attacks include Projected Gradient Descent (PGD) attack [61], Fast Gradient Sign Method (FGSM) attack [41], DeepFool attack [63] and Carlini and Wagner's attack (C&W) [10]. By definition, an adversarial attack is a mapping 𝛼 : 𝑅𝑛 →𝑅 𝑛 such that adversarial example 𝛼","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"With the two objectives (L𝐹 𝐸𝐴𝑇 andL 𝐴𝐸𝑆 ) we thus seek to optimize the defense (𝑔) such that: ∀𝑥∈X arg min 𝑥 L𝐹 𝐸𝐴𝑇 (𝑥, 𝑔(𝑥)) + L 𝐴𝐸𝑆 (𝑥, 𝑔(𝑥))(6) 3.4 Technical Infrastructure and Definitions Many prior AML-based obfuscations use iterative methods to generate effective outputs through multiple forward and backward passes over target models (e.g., Projected Gradient Descent (PGD) [66] or the Fast Gradient Sign Method 8 Lagogiannis et al. (FGSM) [3]). We designed theAuraMaskpipeline to instead ta","claim_type":"method","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"an executable evaluation framework bridging literature gaps directly into engineering. Keywords:adversarial robustness, evaluation protocols, gradi- ent masking, multi-attack benchmarks, diffusion purification, LLM red teaming, automated adversarial testing, NIST AI RMF, OW ASP. 1 Introduction Machine-learning models deployed in safety-critical per- ception [ 2, 3], language understanding [ 4], and decision- making [5] are expected to resist adversarially chosen input perturbations. A decade of ","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"This mechanism enables suppression of adversarial perturbations at the state- preparation stage while preserving clean-input performance to a large extent. The proposed method is evaluated on representative QML models under gradient-based adversarial attacks, including the fast gradient sign method (FGSM) [20] and projected gradi- ent descent (PGD) [21], across standard image-classification datasets. Both clean and adversarial settings are considered in order to examine the trade-off between cle","claim_type":"method","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"Adversarial attacks on machine learning models have been extensively studied since their discovery by Szegedy et al. [16], who demonstrated that imperceptible pertur- bations could cause misclassification in deep neural networks. Goodfellow et al. [17] introduced the FGSM, a computationally efficient single-step attack that exploits the gradient of the loss function. Building upon this work, Madry et al. [18] proposed PGD, an iterative variant that represents one of the strongest first-order adv","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"}],"why_cited":"Pith tracks Towards Deep Learning Models Resistant to Adversarial Attacks because it crossed a citation-hub threshold. Current citing contexts most often use it as background evidence (19 contexts).","role_counts":[{"n":19,"context_role":"background"},{"n":6,"context_role":"method"}]},"error":null,"updated_at":"2026-05-23T20:04:43.963440+00:00"},"summary_claims":{"job_type":"summary_claims","status":"succeeded","result":{"title":"Towards Deep Learning Models Resistant to Adversarial Attacks","claims":[{"claim_text":"Recent work has demonstrated that deep neural networks are vulnerable to adversarial examples---inputs that are almost indistinguishable from natural data and yet classified incorrectly by the network. In fact, some of the latest findings suggest that the existence of adversarial attacks may be an inherent weakness of deep learning models. To address this problem, we study the adversarial robustness of neural networks through the lens of robust optimization. This approach provides us with a broad and unifying view on much of the prior work on this topic. Its principled nature also enables us t","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks Towards Deep Learning Models Resistant to Adversarial Attacks because it crossed a citation-hub threshold.","role_counts":[]},"error":null,"updated_at":"2026-05-14T13:00:47.520625+00:00"}},"summary":{"title":"Towards Deep Learning Models Resistant to Adversarial Attacks","claims":[{"claim_text":"Recent work has demonstrated that deep neural networks are vulnerable to adversarial examples---inputs that are almost indistinguishable from natural data and yet classified incorrectly by the network. In fact, some of the latest findings suggest that the existence of adversarial attacks may be an inherent weakness of deep learning models. To address this problem, we study the adversarial robustness of neural networks through the lens of robust optimization. This approach provides us with a broad and unifying view on much of the prior work on this topic. Its principled nature also enables us t","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks Towards Deep Learning Models Resistant to Adversarial Attacks because it crossed a citation-hub threshold.","role_counts":[]},"graph":{"co_cited":[{"title":"Explaining and Harnessing Adversarial Examples","work_id":"2cedf8f6-7539-4c49-8136-f42a20487146","shared_citers":36},{"title":"Intriguing properties of neural networks","work_id":"7bcd9f41-780c-4b4b-9a08-830d4177cdd8","shared_citers":16},{"title":"Universal and Transferable Adversarial Attacks on Aligned Language Models","work_id":"3322fa86-1768-4677-8425-dd326b45e078","shared_citers":11},{"title":"Adam: A Method for Stochastic Optimization","work_id":"1910796d-9b52-4683-bf5c-de9632c1028b","shared_citers":6},{"title":"An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale","work_id":"e96730e3-129b-4db6-b981-15ab7932e297","shared_citers":5},{"title":"arXiv preprint arXiv:1705.07204 , year=","work_id":"4413e6f2-14c8-4676-b759-21878dff7ef4","shared_citers":5},{"title":"Constitutional AI: Harmlessness from AI Feedback","work_id":"faaaa4e0-2676-4fac-a0b4-99aef10d2095","shared_citers":5},{"title":"Fast is better than free: Revisiting adversarial training","work_id":"e11d047b-8ddf-4719-85e6-4fc5c8007d71","shared_citers":5},{"title":"HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal","work_id":"b0b0303f-2444-4789-a979-8153624312ff","shared_citers":5},{"title":"Llama 2: Open Foundation and Fine-Tuned Chat Models","work_id":"68a5177f-d644-44c1-bd4f-4e5278c22f5d","shared_citers":5},{"title":"On eval uating adversarial robustness","work_id":"69d1d412-7fed-4f08-97f9-f9c9cb67fdc9","shared_citers":5},{"title":"PennyLane: Automatic differentiation of hybrid quantum-classical computations","work_id":"83078d0b-6c02-4fc5-822d-4da4204fd057","shared_citers":5},{"title":"Are aligned neural networks adversarially aligned?","work_id":"cbf9eae3-40c5-419a-9f5b-335eab5a2bb2","shared_citers":4},{"title":"arXiv preprint arXiv:2010.09670 , year=","work_id":"8ae4b2b2-a2da-4900-9021-ad64ae1b860f","shared_citers":4},{"title":"Defense-GAN: Protect- ing Classifiers Against Adversarial Attacks Using Generative Models","work_id":"3a388840-f6da-42b0-9587-1b76e05d547d","shared_citers":4},{"title":"Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!","work_id":"8b07137a-7175-4dd1-a8d9-570493d3f404","shared_citers":4},{"title":"Jailbreaking Black Box Large Language Models in Twenty Queries","work_id":"38678cda-6595-4ca3-916b-066c00cce063","shared_citers":4},{"title":"LLaMA: Open and Efficient Foundation Language Models","work_id":"c018fc23-6f3f-4035-9d02-28a2173b2b9d","shared_citers":4},{"title":"Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned","work_id":"1aabd84d-3779-4ba9-ba2f-15ce264a9b1e","shared_citers":4},{"title":"Red Teaming Language Models with Language Models","work_id":"d1274c54-508f-42f9-aeb3-91db13f3a622","shared_citers":4},{"title":"Wide Residual Networks","work_id":"1b918c80-6bca-4d06-8019-569626fb1cf2","shared_citers":4},{"title":"and Rosenfeld, Elan and Kolter, J","work_id":"280ae7d2-f220-4fbb-a597-951ee6af3099","shared_citers":3},{"title":"Autoprompt: Eliciting knowl- edge from language models with automatically generated prompts","work_id":"1b968c9c-9b73-409a-aa88-57e02917679c","shared_citers":3},{"title":"Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms","work_id":"6714d44f-1b5e-4141-9450-ea09a7e724b0","shared_citers":3}],"time_series":[{"n":2,"year":2023},{"n":51,"year":2026}],"dependency_candidates":[]},"authors":[{"id":"6db4f481-ab75-4b49-90ba-d4703cdc5961","orcid":null,"display_name":"Adrian Vladu","source":"manual","import_confidence":0.72},{"id":"a0038a54-1c8f-4a56-ae9d-4a00be9e5b71","orcid":null,"display_name":"Aleksandar Makelov","source":"manual","import_confidence":0.72},{"id":"84b2d820-2c8c-4293-9b0c-b2b447e92395","orcid":null,"display_name":"Aleksander Madry","source":"manual","import_confidence":0.72},{"id":"33a0ad28-2bf8-458a-ac4c-6ffb97e032c2","orcid":null,"display_name":"Dimitris Tsipras","source":"manual","import_confidence":0.72},{"id":"6243bbb8-9ffa-4e88-a6ea-f6e62b61b3d6","orcid":null,"display_name":"Ludwig Schmidt","source":"manual","import_confidence":0.72}]}}