arxiv: 1706.06083 · v4 · submitted 2017-06-19 · 📊 stat.ML · cs.LG· cs.NE

Recognition: no theorem link

Towards Deep Learning Models Resistant to Adversarial Attacks

Aleksander Madry , Aleksandar Makelov , Ludwig Schmidt , Dimitris Tsipras , Adrian Vladu

Authors on Pith no claims yet

Pith reviewed 2026-05-11 13:45 UTC · model grok-4.3

classification 📊 stat.ML cs.LGcs.NE

keywords adversarial robustnessrobust optimizationdeep neural networksadversarial trainingprojected gradient descentfirst-order adversarysecurity guarantee

0 comments

The pith

Framing adversarial robustness as robust optimization allows training of neural networks with significantly improved resistance to a wide range of attacks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that treating adversarial robustness as a robust optimization problem unifies much prior work and yields reliable methods for both training and attacking neural networks. These methods come with a concrete security guarantee that holds against any adversary within a defined set of allowed perturbations. By approximating the inner maximization with projected gradient descent, the approach produces networks that resist many existing attacks far better than before. The authors present security against a first-order adversary as a practical and broad guarantee, positioning it as a necessary step toward models that are fully resistant to well-defined adversary classes.

Core claim

We study the adversarial robustness of neural networks through the lens of robust optimization. This approach provides us with a broad and unifying view on much of the prior work on this topic. Its principled nature also enables us to identify methods for both training and attacking neural networks that are reliable and, in a certain sense, universal. In particular, they specify a concrete security guarantee that would protect against any adversary. These methods let us train networks with significantly improved resistance to a wide range of adversarial attacks. They also suggest the notion of security against a first-order adversary as a natural and broad security guarantee. We believe that

What carries the argument

The min-max robust optimization objective that trains the model against the worst-case loss inside a bounded perturbation set, with the inner maximization solved approximately by projected gradient descent.

If this is right

Trained networks exhibit significantly improved resistance to a wide range of adversarial attacks.
Security against any first-order adversary within the perturbation set becomes a concrete and verifiable guarantee.
Robustness to well-defined adversary classes serves as a stepping stone toward fully resistant deep learning models.
Both training and attack procedures become reliable and universal in the sense of providing the same concrete guarantee.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the first-order guarantee holds in practice, successive strengthening of the adversary definition could yield incrementally harder-to-attack models.
The same min-max formulation could be applied to other supervised learning settings to test whether the robustness improvement generalizes beyond image classification.
Real-world use would require checking whether the robustness gain persists when the perturbation set is expanded to match actual deployment conditions.

Load-bearing premise

Projected gradient descent sufficiently approximates the worst-case adversarial perturbation inside the allowed set, so that training against these approximations yields robustness to the true adversary.

What would settle it

An experiment in which a network trained with this projected gradient descent procedure is still successfully attacked by an adversary that uses a qualitatively different search strategy or higher-order information.

read the original abstract

Recent work has demonstrated that deep neural networks are vulnerable to adversarial examples---inputs that are almost indistinguishable from natural data and yet classified incorrectly by the network. In fact, some of the latest findings suggest that the existence of adversarial attacks may be an inherent weakness of deep learning models. To address this problem, we study the adversarial robustness of neural networks through the lens of robust optimization. This approach provides us with a broad and unifying view on much of the prior work on this topic. Its principled nature also enables us to identify methods for both training and attacking neural networks that are reliable and, in a certain sense, universal. In particular, they specify a concrete security guarantee that would protect against any adversary. These methods let us train networks with significantly improved resistance to a wide range of adversarial attacks. They also suggest the notion of security against a first-order adversary as a natural and broad security guarantee. We believe that robustness against such well-defined classes of adversaries is an important stepping stone towards fully resistant deep learning models. Code and pre-trained models are available at https://github.com/MadryLab/mnist_challenge and https://github.com/MadryLab/cifar10_challenge.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper frames adversarial robustness as a min-max robust optimization problem and shows that PGD-based training produces models with better resistance to first-order attacks than prior heuristics on MNIST and CIFAR.

read the letter

The punchline is that this paper gives a principled way to train neural nets that are more resistant to adversarial examples by framing the task as a robust optimization problem and solving it with projected gradient descent. What stands out is how they unify a bunch of previous attack and defense ideas under one roof. The robust opt lens makes sense, and it leads to a training procedure where you repeatedly take gradient steps on the loss with respect to the input inside the allowed perturbation ball. They demonstrate that this PGD-trained model resists a range of attacks better than FGSM-trained ones on MNIST and CIFAR-10. The code is public, which is a plus for reproducibility. The formulation is straightforward and the experiments are convincing for what they set out to do. They also introduce the idea of security against first-order adversaries as a concrete target, which avoids overclaiming. The main soft spot is that PGD is only an approximation to the worst-case inner maximization. Because the problem is non-convex, it's possible that a different first-order method or more sophisticated solver could find worse perturbations. The paper shows good results with their specific PGD setup, including restarts, but doesn't prove that this is the strongest possible first-order attack. The stress-test concern is fair on this point, though the paper is explicit that the guarantee is limited to first-order adversaries, so it doesn't pretend to be a complete solution. This work is for researchers focused on making deep learning more reliable in the face of small, crafted perturbations. Anyone building or evaluating robust models will get practical value from the method and the benchmarks. The paper shows clear thinking and honest engagement with the existing literature on adversarial examples. I would bring this to a reading group and cite the PGD training approach in my own work. It deserves to go through peer review.

Referee Report

2 major / 2 minor

Summary. The manuscript frames adversarial robustness of deep neural networks as a robust optimization problem (minimax over model parameters and perturbations within an l_p ball). It proposes solving the inner maximization via projected gradient descent (PGD) both for generating attacks and for adversarial training, reports substantially improved resistance to FGSM and other attacks on MNIST and CIFAR-10, and suggests that robustness to first-order adversaries constitutes a natural, broad security guarantee.

Significance. If the empirical results hold, the work supplies a clean, unifying robust-optimization lens on adversarial training, releases reproducible code and models, and supplies a concrete, testable notion of security against a well-defined adversary class. These are genuine strengths that have influenced subsequent research.

major comments (2)

[Robust optimization formulation and PGD attack procedure] The central security claim rests on PGD (with its chosen step size and iteration count) being a sufficiently accurate proxy for the true inner maximization max_{||delta||_p <= epsilon} L(theta, x+delta, y). The manuscript provides only empirical evidence that PGD outperforms FGSM; it contains no theoretical bound on the sub-optimality gap nor an ablation comparing PGD to other first-order solvers (e.g., different step-size schedules or restarts). This approximation quality is load-bearing for the asserted “concrete security guarantee” against any first-order adversary.
[Experimental evaluation] Table 1 and the CIFAR-10 results: the reported robustness gains are measured against the same PGD adversary used at training time. Without held-out stronger first-order attacks or a demonstration that the gains persist when the test-time attacker is allowed more iterations or a different optimizer, the claim of resistance to “a wide range of adversarial attacks” remains partially circular.

minor comments (2)

[Abstract] The abstract states that the methods are “reliable and, in a certain sense, universal.” Clarify the precise sense in which universality is claimed, given that the inner maximization is approximated.
[Experimental setup] Notation for the perturbation budget epsilon and the norm p is introduced without an explicit statement of the values used in each experiment; a short table would improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and insightful comments. We address each major point below with clarifications and indicate where revisions will be made to improve the manuscript.

read point-by-point responses

Referee: [Robust optimization formulation and PGD attack procedure] The central security claim rests on PGD (with its chosen step size and iteration count) being a sufficiently accurate proxy for the true inner maximization max_{||delta||_p <= epsilon} L(theta, x+delta, y). The manuscript provides only empirical evidence that PGD outperforms FGSM; it contains no theoretical bound on the sub-optimality gap nor an ablation comparing PGD to other first-order solvers (e.g., different step-size schedules or restarts). This approximation quality is load-bearing for the asserted “concrete security guarantee” against any first-order adversary.

Authors: We agree that the manuscript relies primarily on empirical evidence rather than theoretical bounds to establish PGD as an effective proxy for the inner maximization. Our results demonstrate that PGD generates substantially stronger attacks than FGSM and that adversarial training with PGD produces models with improved resistance to multiple first-order attacks. We do not claim a provable guarantee that PGD always recovers the exact maximizer; instead, we argue that PGD serves as a strong, practical first-order method that can be used consistently for both attack generation and training. In the revision we will clarify the scope of the security claim to emphasize that it is an empirical guarantee against first-order adversaries using optimization procedures comparable to PGD. We will also add ablations examining alternative step-size schedules and multiple random restarts to the supplementary material. revision: partial
Referee: [Experimental evaluation] Table 1 and the CIFAR-10 results: the reported robustness gains are measured against the same PGD adversary used at training time. Without held-out stronger first-order attacks or a demonstration that the gains persist when the test-time attacker is allowed more iterations or a different optimizer, the claim of resistance to “a wide range of adversarial attacks” remains partially circular.

Authors: We acknowledge that the primary reported robustness metrics are obtained against the PGD adversary employed during training. The manuscript does include evaluations against FGSM and other attacks not used in training, but we agree that additional held-out tests would strengthen the non-circularity of the claims. In the revised version we will add experiments that evaluate the trained models against PGD variants with substantially more iterations, altered step sizes, and alternative first-order optimizers at test time. These results will be included to demonstrate that the observed robustness improvements generalize beyond the exact training-time adversary. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the robust optimization derivation

full rationale

The paper presents adversarial robustness as the standard saddle-point problem min_θ max_{δ ∈ S} L(θ, x+δ, y) from robust optimization, solved approximately via projected gradient descent on the inner maximization. This is an explicit algorithmic procedure applied to a well-defined objective, not a self-referential definition or fitted parameter renamed as a prediction. Robustness claims are supported by empirical evaluation against held-out attacks (FGSM, etc.) rather than derived from parameters that presuppose the outcome. No load-bearing self-citations, uniqueness theorems imported from the authors' prior work, or ansatzes smuggled via citation are present in the provided text; the unifying view on prior work and the first-order adversary notion follow directly from the min-max formulation without reducing to the paper's own inputs by construction.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The approach rests on standard convex optimization assumptions and the empirical adequacy of first-order methods for the inner problem; no new entities are postulated and hyperparameters are tuned on validation data.

free parameters (2)

PGD step size
Hyperparameter controlling the size of each projected gradient step in the inner maximization; chosen via validation.
number of PGD iterations
Hyperparameter determining how many steps are taken to approximate the worst-case adversary; tuned empirically.

axioms (1)

domain assumption Iterative first-order methods such as PGD provide a sufficiently accurate approximation to the inner maximization over the allowed perturbation ball.
Invoked when the authors use PGD both to generate attacks and to train the network.

pith-pipeline@v0.9.0 · 5521 in / 1268 out tokens · 64441 ms · 2026-05-11T13:45:21.015760+00:00 · methodology

discussion (0)

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

REALISTA: Realistic Latent Adversarial Attacks that Elicit LLM Hallucinations
cs.CL 2026-05 unverdicted novelty 8.0

REALISTA optimizes continuous combinations of valid editing directions in latent space to produce realistic adversarial prompts that elicit hallucinations more effectively than prior methods, including on large reason...
On the Generation and Mitigation of Harmful Geometry in Image-to-3D Models
cs.CR 2026-05 conditional novelty 8.0

Image-to-3D models successfully generate harmful geometries in most cases with under 0.3% caught by commercial filters; existing safeguards are weak but a stacked defense cuts harmful outputs to under 1% at 11% false-...
Local LMO: Constrained Gradient Optimization via a Local Linear Minimization Oracle
math.OC 2026-05 unverdicted novelty 8.0

Local LMO is a new projection-free method that achieves the convergence rates of projected gradient descent for constrained optimization by using local linear minimization oracles over small balls.
Fortifying Time Series: DTW-Certified Robust Anomaly Detection
cs.LG 2026-05 unverdicted novelty 8.0

First DTW-certified robust anomaly detection for time series via randomized smoothing adapted through an l_p-to-DTW lower-bound transformation.
AuraMask: An Extensible Pipeline for Developing Aesthetic Anti-Facial Recognition Image Filters
cs.CV 2026-05 conditional novelty 7.0

AuraMask produces 40 aesthetic anti-facial recognition filters that match or exceed prior adversarial effectiveness and achieve significantly higher user acceptance in a 630-person study.
GaitProtector: Impersonation-Driven Gait De-Identification via Training-Free Diffusion Latent Optimization
cs.CV 2026-05 unverdicted novelty 7.0

GaitProtector optimizes diffusion model latents to impersonate target identities in gait sequences, dropping Rank-1 identification accuracy from 89.6% to 15.0% on CASIA-B while keeping scoliosis diagnostic accuracy at 74.2%.
Fix the Loss, Not the Radius: Rethinking the Adversarial Perturbation of Sharpness-Aware Minimization
cs.LG 2026-05 unverdicted novelty 7.0

LE-SAM inverts SAM by fixing the loss budget instead of the parameter-space radius, yielding better generalization across benchmarks.
Inference Time Causal Probing in LLMs
cs.AI 2026-05 unverdicted novelty 7.0

HDMI is a new probe-free technique that steers LLM hidden states via margin objectives to achieve more reliable causal interventions than prior probe-based methods on standard benchmarks.
Minimum Specification Perturbation: Robustness as Distance-to-Falsification in Causal Inference
stat.ME 2026-05 unverdicted novelty 7.0

MSP quantifies the minimum changes to analyst choices required to falsify a causal claim by making its confidence interval contain zero, providing information orthogonal to dispersion-based robustness summaries.
Quantum Interval Bound Propagation for Certified Training of Quantum Neural Networks
quant-ph 2026-05 unverdicted novelty 7.0

QIBP adapts interval bound propagation to quantum neural networks for certified adversarial robustness via interval and affine arithmetic implementations.
The Power of Order: Fooling LLMs with Adversarial Table Permutations
cs.LG 2026-05 unverdicted novelty 7.0

Semantically invariant row and column permutations can fool LLMs on tabular tasks, and a new gradient-based attack called ATP finds such permutations to significantly degrade performance across models.
Low Rank Adaptation for Adversarial Perturbation
cs.LG 2026-04 unverdicted novelty 7.0

Adversarial perturbations possess an inherently low-rank structure that enables more efficient and effective black-box adversarial attacks via subspace projection.
A Systematic Survey of Security Threats and Defenses in LLM-Based AI Agents: A Layered Attack Surface Framework
cs.CR 2026-04 unverdicted novelty 7.0

A new 7x4 taxonomy organizes agentic AI security threats by architectural layer and persistence timescale, revealing under-explored upper layers and missing defenses after surveying 116 papers.
On the Stability and Generalization of First-order Bilevel Minimax Optimization
cs.LG 2026-04 unverdicted novelty 7.0

Provides the first systematic generalization analysis via algorithmic stability for single-timescale and two-timescale stochastic gradient descent-ascent in bilevel minimax problems.
Benign Overfitting in Adversarial Training for Vision Transformers
cs.LG 2026-04 unverdicted novelty 7.0

Adversarial training on simplified Vision Transformers achieves benign overfitting with near-zero robust loss and generalization error when signal-to-noise ratio and perturbation budget meet specific conditions.
Physically-Induced Atmospheric Adversarial Perturbations: Enhancing Transferability and Robustness in Remote Sensing Image Classification
cs.CV 2026-04 unverdicted novelty 7.0

FogFool creates fog-based adversarial perturbations using Perlin noise optimization to achieve high black-box transferability (83.74% TASR) and robustness to defenses in remote sensing classification.
Learning Robustness at Test-Time from a Non-Robust Teacher
cs.CV 2026-04 unverdicted novelty 7.0

A test-time adaptation framework anchors adversarial training to a non-robust teacher's predictions, yielding more stable optimization and better robustness-accuracy trade-offs than standard self-consistency methods.
STRONG-VLA: Decoupled Robustness Learning for Vision-Language-Action Models under Multimodal Perturbations
cs.RO 2026-04 unverdicted novelty 7.0

STRONG-VLA uses decoupled two-stage training to improve VLA model robustness, yielding up to 16% higher task success rates under seen and unseen perturbations on the LIBERO benchmark.
Can Drift-Adaptive Malware Detectors Be Made Robust? Attacks and Defenses Under White-Box and Black-Box Threats
cs.CR 2026-04 unverdicted novelty 7.0

A fine-tuning framework reduces PGD attack success on AdvDA detectors from 100% to 3.2% and MalGuise from 13% to 5.1%, but optimal training strategies differ by threat model and robustness does not transfer across them.
Hidden Reliability Risks in Large Language Models: Systematic Identification of Precision-Induced Output Disagreements
cs.AI 2026-04 unverdicted novelty 7.0

PrecisionDiff is a differential testing framework that uncovers widespread precision-induced behavioral disagreements in aligned LLMs, including safety-critical jailbreak divergences across precision formats.
Fair Conformal Classification via Learning Representation-Based Groups
cs.LG 2026-05 unverdicted novelty 6.0

A fair conformal classification method guarantees conditional coverage on adaptively identified subgroups defined via learned representations.
Seir\^enes: Adversarial Self-Play with Evolving Distractions for LLM Reasoning
cs.AI 2026-05 unverdicted novelty 6.0

Seirênes trains LLMs via adversarial self-play to generate and overcome evolving distractions, producing gains of 7-10 points on math reasoning benchmarks and exposing blind spots in larger models.
Guaranteed Jailbreaking Defense via Disrupt-and-Rectify Smoothing
cs.CR 2026-05 unverdicted novelty 6.0

DR-Smoothing introduces a disrupt-then-rectify prompt processing scheme into smoothing defenses, delivering tight theoretical bounds on success probability against both token- and prompt-level jailbreaks.
"Training robust watermarking model may hurt authentication!'' Exploring and Mitigating the Identity Leakage in Robust Watermarking
cs.CR 2026-05 unverdicted novelty 6.0

W-IR is the first watermarking framework to combine certified robustness via randomized smoothing in pixel and coordinate spaces with identity leakage mitigation via residual information loss minimization.
Efficient Verification of Neural Control Barrier Functions with Smooth Nonlinear Activations
cs.LG 2026-05 unverdicted novelty 6.0

LightCROWN computes tighter Jacobian bounds for neural networks with smooth nonlinear activations by exploiting their analytical properties, raising verification success rates for neural control barrier functions up t...
Uncovering Hidden Systematics in Neural Network Models for High Energy Physics
cs.LG 2026-05 unverdicted novelty 6.0

Neural networks for HEP tasks can be fooled at significant rates by subtle perturbations inside uncertainty envelopes, revealing hidden systematics not captured by conventional methods.
Band Together: Untargeted Adversarial Training with Multimodal Coordination against Evasion-based Promotion Attacks
cs.LG 2026-05 unverdicted novelty 6.0

UAT-MC improves defense against evasion promotion attacks in multimodal recommenders by aligning gradients across modalities during untargeted adversarial training.
Redefining AI Red Teaming in the Agentic Era: From Weeks to Hours
cs.AI 2026-05 unverdicted novelty 6.0

An agentic red teaming system automates creation of adversarial testing workflows from natural language goals, unifying ML and generative AI attacks and achieving 85% success rate on Meta Llama Scout with no custom hu...
Detecting Adversarial Data via Provable Adversarial Noise Amplification
cs.LG 2026-05 unverdicted novelty 6.0

A provable adversarial noise amplification theorem under sufficient conditions enables a custom-trained detector that identifies adversarial examples at inference time using enhanced layer-wise noise signals.
Stability and Generalization for Decentralized Markov SGD
cs.LG 2026-05 unverdicted novelty 6.0

Decentralized SGD and SGDA under Markovian sampling admit non-asymptotic generalization bounds that incorporate network topology, Markov mixing rates, and primal-dual dynamics.
LocalAlign: Enabling Generalizable Prompt Injection Defense via Generation of Near-Target Adversarial Examples for Alignment Training
cs.CR 2026-05 unverdicted novelty 6.0

LocalAlign generates near-target adversarial examples via prompting and applies margin-aware alignment training to enforce tighter boundaries against prompt injection attacks.
VisInject: Disruption != Injection -- A Dual-Dimension Evaluation of Universal Adversarial Attacks on Vision-Language Models
cs.CR 2026-05 conditional novelty 6.0

Universal adversarial attacks cause output perturbation 90 times more often than precise target injection in VLMs, with only 2 verbatim successes out of 6615 tests.
The Power of Order: Fooling LLMs with Adversarial Table Permutations
cs.LG 2026-05 unverdicted novelty 6.0

Semantically invariant row and column permutations in tables can cause LLMs to output incorrect answers, and a gradient-based attack called ATP efficiently finds such permutations that degrade performance across many models.
Defending Quantum Classifiers against Adversarial Perturbations through Quantum Autoencoders
quant-ph 2026-04 unverdicted novelty 6.0

A quantum autoencoder purifies adversarial perturbations for quantum classifiers and supplies a confidence score for unrecoverable inputs, claiming up to 68% accuracy gains over prior defenses without adversarial training.
Controlled Steering-Based State Preparation for Adversarial-Robust Quantum Machine Learning
quant-ph 2026-04 unverdicted novelty 6.0

A passive steering method for quantum state preparation improves adversarial accuracy in QML models by up to 40% across tested cases.
When AI reviews science: Can we trust the referee?
cs.AI 2026-04 unverdicted novelty 6.0

AI peer review systems are vulnerable to prompt injections, prestige biases, assertion strength effects, and contextual poisoning, as demonstrated by a new attack taxonomy and causal experiments on real conference sub...
Transferable Physical-World Adversarial Patches Against Pedestrian Detection Models
cs.CV 2026-04 unverdicted novelty 6.0

TriPatch generates transferable physical adversarial patches via multi-stage triplet loss, appearance consistency, and data augmentation to achieve higher attack success rates on pedestrian detectors than prior methods.
FastAT Benchmark: A Comprehensive Framework for Fair Evaluation of Fast Adversarial Training Methods
cs.CV 2026-04 conditional novelty 6.0

The FastAT Benchmark standardizes evaluation of over twenty fast adversarial training methods under unified conditions, showing that well-designed single-step approaches can match or exceed PGD-AT robustness at lower ...
If you're waiting for a sign... that might not be it! Mitigating Trust Boundary Confusion from Visual Injections on Vision-Language Agentic Systems
cs.CV 2026-04 unverdicted novelty 6.0

LVLM-based agents exhibit trust boundary confusion with visual injections and a multi-agent defense separating perception from decision-making reduces misleading responses while preserving correct ones.
Representation-Guided Parameter-Efficient LLM Unlearning
cs.CL 2026-04 unverdicted novelty 6.0

REGLU guides LoRA-based unlearning via representation subspaces and orthogonal regularization to outperform prior methods on forget-retain trade-off in LLM benchmarks.
Latent Instruction Representation Alignment: defending against jailbreaks, backdoors and undesired knowledge in LLMs
cs.LG 2026-04 unverdicted novelty 6.0

LIRA aligns latent instruction representations in LLMs to defend against jailbreaks, backdoors, and undesired knowledge, blocking over 99% of PEZ attacks and achieving optimal WMDP forgetting.
Quantum Patches: Enhancing Robustness of Quantum Machine Learning Models
quant-ph 2026-04 unverdicted novelty 6.0

Random quantum circuits used as adversarial training data reduce successful attack rates on QML models for CIFAR-10 from 89.8% to 68.45% and for CINIC-10 from 94.23% to 78.68%.
Compression as an Adversarial Amplifier Through Decision Space Reduction
cs.CV 2026-04 unverdicted novelty 6.0

Compression acts as an adversarial amplifier by reducing the decision space of image classifiers, making attacks in compressed representations substantially more effective than pixel-space attacks under the same pertu...
Stealthy and Adjustable Text-Guided Backdoor Attacks on Multimodal Pretrained Models
cs.CR 2026-04 unverdicted novelty 6.0

Introduces a text-guided backdoor attack using common textual words as triggers and visual perturbations for stealthy, adjustable control on multimodal pretrained models.
Agent-Sentry: Bounding LLM Agents via Execution Provenance
cs.CR 2026-03 unverdicted novelty 6.0

Agent-Sentry bounds LLM agent executions via structural provenance classification, sensitive-value allowlists, and selective LLM judgment, blocking 94.3% of injections while allowing 95.1% of benign actions on AgentDo...
Shapes are not enough: CONSERVAttack and its use for finding vulnerabilities and uncertainties in machine learning applications
cs.LG 2026-03 unverdicted novelty 6.0

CONSERVAttack creates adversarial perturbations in HEP ML models that respect uncertainty bounds but cause misclassifications, revealing gaps in current validation practices.
Causally Sufficient and Necessary Feature Expansion for Class-Incremental Learning
cs.LG 2026-03 unverdicted novelty 6.0

CPNS regularization with dual counterfactual generators mitigates intra-task and inter-task spurious correlations in class-incremental learning feature expansion.
Jailbreaking Black Box Large Language Models in Twenty Queries
cs.LG 2023-10 conditional novelty 6.0

PAIR uses an attacker LLM to iteratively craft effective jailbreak prompts for black-box target LLMs in fewer than 20 queries.
SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks
cs.LG 2023-10 accept novelty 6.0

SmoothLLM mitigates jailbreaking attacks on LLMs by randomly perturbing multiple copies of a prompt at the character level and aggregating the outputs to detect adversarial inputs.
Baseline Defenses for Adversarial Attacks Against Aligned Language Models
cs.LG 2023-09 conditional novelty 6.0

Baseline defenses including perplexity-based detection, input preprocessing, and adversarial training offer partial robustness to text adversarial attacks on LLMs, with challenges arising from weak discrete optimizers.
Medical Model Synthesis Architectures: A Case Study
cs.AI 2026-05 unverdicted novelty 5.0

MedMSA framework retrieves knowledge via language models then builds formal probabilistic models to produce uncertainty-weighted differential diagnoses from symptoms.
Machine Learning Enhanced Laser Spectroscopy for Multi-Species Gas Detection in Complex and Harsh Environments
physics.optics 2026-05 unverdicted novelty 5.0

Machine learning methods including denoising autoencoders, unsupervised interference mitigation, blind source separation, and certifiable classification are developed and experimentally validated to improve multi-spec...
Adversarial Flow Matching for Imperceptible Attacks on End-to-End Autonomous Driving
cs.CV 2026-04 unverdicted novelty 5.0

AFM is a novel gray-box adversarial attack using flow matching to create visually imperceptible perturbations that degrade performance of Vision-Language-Action and modular end-to-end autonomous driving models while s...
Auto-ART: Structured Literature Synthesis and Automated Adversarial Robustness Testing
cs.CR 2026-04 unverdicted novelty 5.0

Auto-ART delivers the first structured synthesis of adversarial robustness consensus plus an executable multi-norm testing framework that flags gradient masking in 92% of cases on RobustBench and reveals a 23.5 pp rob...
NeuroTrace: Inference Provenance-Based Detection of Adversarial Examples
cs.CR 2026-04 unverdicted novelty 5.0

NeuroTrace framework builds heterogeneous graphs of inference provenance to detect adversarial examples in DNNs, showing strong transferable performance across attack families in vision and malware domains.
QShield: Securing Neural Networks Against Adversarial Attacks using Quantum Circuits
cs.CR 2026-04 unverdicted novelty 5.0

Hybrid quantum-classical models using structured entanglement keep high accuracy on MNIST, OrganAMNIST and CIFAR-10 while lowering adversarial attack success rates and raising the computational cost of generating attacks.
Real-Time Evaluation of Autonomous Systems under Adversarial Attacks
cs.AI 2026-05 unverdicted novelty 4.0

A framework trains and compares MLP, transformer, and GAIL-based trajectory models on real driving data, finding that architectural differences cause large variations in robustness to PGD attacks despite similar nomin...
Beyond Attack Success Rate: A Multi-Metric Evaluation of Adversarial Transferability in Medical Imaging Models
cs.CV 2026-04 unverdicted novelty 4.0

Perceptual quality metrics correlate strongly with each other but show minimal correlation with attack success rate across medical imaging models and datasets, making ASR alone inadequate for assessing adversarial robustness.
Adversarial Robustness Analysis of Cloud-Assisted Autonomous Driving Systems
cs.RO 2026-04 unverdicted novelty 4.0

Adversarial attacks on cloud perception models plus network impairments in a vehicle-cloud loop degrade object detection from 0.73/0.68 to 0.22/0.15 precision/recall and destabilize closed-loop vehicle control.
The Luna Bound Propagator for Formal Analysis of Neural Networks
cs.LG 2026-03 conditional novelty 4.0

Luna delivers a C++ bound propagator supporting interval, DeepPoly/CROWN, and alpha-CROWN analyses that reports tighter bounds and higher speed than the leading Python alpha-CROWN implementation on VNN-COMP 2025 benchmarks.

Reference graph

Works this paper leans on

31 extracted references · 31 canonical work pages · cited by 61 Pith papers

[1]

Robust optimization

Aharon Ben-Tal, Laurent El Ghaoui, and Arkadi Nemirovski. Robust optimization. Princeton University Press, 2009

work page 2009
[2]

Evasion attacks against machine learning at test time

Battista Biggio, Igino Corona, Davide Maiorca, Blaine Nelson, Nedim Šrndi´ c, Pavel Laskov, Giorgio Giacinto, and Fabio Roli. Evasion attacks against machine learning at test time. In Joint European conference on machine learning and knowledge discovery in databases (ECML-KDD), 2013

work page 2013
[3]

Wild patterns: Ten years after the rise of adversarial machine learning

Battista Biggio and Fabio Roli. Wild patterns: Ten years after the rise of adversarial machine learning. 2018

work page 2018
[4]

Decision-based adversarial attacks: Reliable attacks against black-box machine learning models

Wieland Brendel, Jonas Rauber, and Matthias Bethge. Decision-based adversarial attacks: Reliable attacks against black-box machine learning models. In International Conference on Learning Representations (ICLR), 2017

work page 2017
[5]

Adversarial examples are not easily detected: Bypassing ten detection methods

Nicholas Carlini and David Wagner. Adversarial examples are not easily detected: Bypassing ten detection methods. In Workshop on Artiﬁcial Intelligence and Security (AISec), 2017

work page 2017
[6]

Towards evaluating the robustness of neural networks

Nicholas Carlini and David Wagner. Towards evaluating the robustness of neural networks. In Symposium on Security and Privacy (SP), 2017

work page 2017
[7]

A uniﬁed architecture for natural language processing: Deep neural networks with multitask learning

Ronan Collobert and Jason Weston. A uniﬁed architecture for natural language processing: Deep neural networks with multitask learning. In Proceedings of the 25th international conference on Machine learning, pages 160–167, 2008

work page 2008
[8]

Adversarial classiﬁcation

Nilesh Dalvi, Pedro Domingos, Sumit Sanghai, and Deepak Verma. Adversarial classiﬁcation. In international conference on Knowledge discovery and data mining, 2004

work page 2004
[9]

Analysis of classiﬁers’ robustness to adversarial perturbations

Alhussein Fawzi, Omar Fawzi, and Pascal Frossard. Analysis of classiﬁers’ robustness to adversarial perturbations. Machine Learning, 107(3):481–508, 2018

work page 2018
[10]

Nightmare at test time: robust learning by feature deletion

Amir Globerson and Sam Roweis. Nightmare at test time: robust learning by feature deletion. In Proceedings of the 23rd international conference on Machine learning, 2006

work page 2006
[11]

Explaining and harnessing adver- sarial examples

Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adver- sarial examples. In International Conference on Learning Representations (ICLR), 2015. 16

work page 2015
[12]

Delving deep into rectiﬁers: Surpassing human-level performance on imagenet classiﬁcation

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectiﬁers: Surpassing human-level performance on imagenet classiﬁcation. In international conference on computer vision (ICCV), 2015

work page 2015
[13]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Conference on Computer Vision and Pattern Recognition (CVPR), 2016

work page 2016
[14]

Adversarial example defense: Ensembles of weak defenses are not strong

Warren He, James Wei, Xinyun Chen, Nicholas Carlini, and Dawn Song. Adversarial example defense: Ensembles of weak defenses are not strong. In USENIX Workshop on Offensive Technologies (WOOT), 2017

work page 2017
[15]

Learning with a strong adversary

Ruitong Huang, Bing Xu, Dale Schuurmans, and Csaba Szepesvari. Learning with a strong adversary. arXiv preprint arXiv:1511.03034, 2015

work page arXiv 2015
[16]

Learning multiple layers of features from tiny images

Alex Krizhevsky. Learning multiple layers of features from tiny images. In Technical report, 2009

work page 2009
[17]

Imagenet classiﬁcation with deep convolutional neural networks

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classiﬁcation with deep convolutional neural networks. In Advances in Neural Information Processing Systems (NeurIPS), 2012

work page 2012
[18]

Goodfellow, and Samy Bengio

Alexey Kurakin, Ian J. Goodfellow, and Samy Bengio. Adversarial machine learning at scale. In International Conference on Learning Representations (ICLR), 2017

work page 2017
[19]

The mnist database of handwritten digits

Yann LeCun. The mnist database of handwritten digits. In Technical report, 1998

work page 1998
[20]

Second-order ad- versarial attack and certifiable robustness.arXiv preprint arXiv:1809.03113,

Bai Li, Changyou Chen, Wenlin Wang, and Lawrence Carin. Second-order adversarial attack and certiﬁable robustness. arXiv preprint arXiv:1809.03113, 2018

work page arXiv 2018
[21]

Deepfool: a simple and accurate method to fool deep neural networks

Seyed-Mohsen Moosavi-Dezfooli, Alhussein Fawzi, and Pascal Frossard. Deepfool: a simple and accurate method to fool deep neural networks. In Computer Vision and Pattern Recognition (CVPR), 2016

work page 2016
[22]

Deep neural networks are easily fooled: High conﬁdence predictions for unrecognizable images

Anh Nguyen, Jason Yosinski, and Jeff Clune. Deep neural networks are easily fooled: High conﬁdence predictions for unrecognizable images. In Conference on computer vision and pattern recognition (CVPR), 2015

work page 2015
[23]

Transferability in Machine Learning: from Phenomena to Black-Box Attacks using Adversarial Samples

Nicolas Papernot, Patrick McDaniel, and Ian Goodfellow. Transferability in machine learn- ing: from phenomena to black-box attacks using adversarial samples. In ArXiv preprint arXiv:1605.07277, 2016

work page Pith review arXiv 2016
[24]

McDaniel, Xi Wu, Somesh Jha, and Ananthram Swami

Nicolas Papernot, Patrick D. McDaniel, Xi Wu, Somesh Jha, and Ananthram Swami. Distilla- tion as a defense to adversarial perturbations against deep neural networks. In Symposium on Security and Privacy (SP), 2016

work page 2016
[25]

Towards the ﬁrst ad- versarially robust neural network model on MNIST

Lukas Schott, Jonas Rauber, Matthias Bethge, and Wieland Brendel. Towards the ﬁrst ad- versarially robust neural network model on MNIST. In International Conference on Learning Representations (ICLR), 2019. 17

work page 2019
[26]

Understanding adversarial training: Increasing local stability of supervised models through robust optimization

Uri Shaham, Yutaro Yamada, and Sahand Negahban. Understanding adversarial training: Increasing local stability of supervised models through robust optimization. Neurocomputing, 307:195–204, 2018

work page 2018
[27]

Robust large margin deep neural networks

Jure Sokoli´ c, Raja Giryes, Guillermo Sapiro, and Miguel RD Rodrigues. Robust large margin deep neural networks. In Transactions on Signal Processing, 2017

work page 2017
[28]

Intriguing properties of neural networks

Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Good- fellow, and Rob Fergus. Intriguing properties of neural networks. In International Conference on Learning Representations (ICLR), 2014

work page 2014
[29]

The Space of Transferable Adversarial Examples

Florian Tramer, Nicolas Papernot, Ian Goodfellow, and Patrick McDaniel Dan Boneh. The space of transferable adversarial examples. In ArXiv preprint arXiv:1704.03453, 2017

work page Pith review arXiv 2017
[30]

Statistical decision functions which minimize the maximum risk

Abraham Wald. Statistical decision functions which minimize the maximum risk. In Annals of Mathematics, 1945

work page 1945
[31]

Feature squeezing: Detecting adversarial examples in deep neural networks

Weilin Xu, David Evans, and Yanjun Qi. Feature squeezing: Detecting adversarial examples in deep neural networks. In Network and Distributed Systems Security Symposium (NDSS), 2018. A Statement and Application of Danskin’s Theorem Recall that our goal is to minimize the value of the saddle point problem min θ ρ(θ), where ρ(θ) = E(x,y)∼D [ max δ∈S L(θ, x +...

work page 2018