Image-to-3D models successfully generate harmful geometries in most cases with under 0.3% caught by commercial filters; existing safeguards are weak but a stacked defense cuts harmful outputs to under 1% at 11% false-positive cost.
super hub Mixed citations
Towards Deep Learning Models Resistant to Adversarial Attacks
Mixed citation behavior. Most common role is background (67%).
abstract
Recent work has demonstrated that deep neural networks are vulnerable to adversarial examples---inputs that are almost indistinguishable from natural data and yet classified incorrectly by the network. In fact, some of the latest findings suggest that the existence of adversarial attacks may be an inherent weakness of deep learning models. To address this problem, we study the adversarial robustness of neural networks through the lens of robust optimization. This approach provides us with a broad and unifying view on much of the prior work on this topic. Its principled nature also enables us to identify methods for both training and attacking neural networks that are reliable and, in a certain sense, universal. In particular, they specify a concrete security guarantee that would protect against any adversary. These methods let us train networks with significantly improved resistance to a wide range of adversarial attacks. They also suggest the notion of security against a first-order adversary as a natural and broad security guarantee. We believe that robustness against such well-defined classes of adversaries is an important stepping stone towards fully resistant deep learning models. Code and pre-trained models are available at https://github.com/MadryLab/mnist_challenge and https://github.com/MadryLab/cifar10_challenge.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract Recent work has demonstrated that deep neural networks are vulnerable to adversarial examples---inputs that are almost indistinguishable from natural data and yet classified incorrectly by the network. In fact, some of the latest findings suggest that the existence of adversarial attacks may be an inherent weakness of deep learning models. To address this problem, we study the adversarial robustness of neural networks through the lens of robust optimization. This approach provides us with a broad and unifying view on much of the prior work on this topic. Its principled nature also enables us t
authors
co-cited works
representative citing papers
Local LMO is a new projection-free method that achieves the convergence rates of projected gradient descent for constrained optimization by using local linear minimization oracles over small balls.
First DTW-certified robust anomaly detection for time series via randomized smoothing adapted through an l_p-to-DTW lower-bound transformation.
FPR manipulation attack perturbs benign MQTT packets to flip labels to attacks in NIDS with 80-100% success, increasing SOC delays without gradient-based methods.
A^4D detects adversarial attacks in an attack- and classifier-agnostic way by measuring non-arbitrary shifts in CLIP embedding space from prompt-based similarity scores.
Concept-level adversarial attacks exploit CBM interpretability on the CUB dataset, but SPECTRA raises required perturbation norm from 0.46 to over 4200 while keeping accuracy loss under 2.2%.
PROBE improves AIGI detector generalization to unseen generators by using the detector as a critic to steer manifold-level modifications that produce challenging training samples.
CodecAttack perturbs audio in codec latent space with multi-bitrate EoT to achieve 85.5% average ASR on Opus-compressed Audio LLMs versus under 26% for waveform baselines, with transfer to MP3 and AAC.
Derives ODE limits of Adam-DA showing that first- and second-order momentum parameters reverse their convergence roles in zero-sum games compared to minimization, validated on GAN experiments.
A reusable framework generates verification instances with provably known robustness labels, revealing numeric tolerance issues and bugs in five verifiers while introducing difficulty profiles to diagnose failure modes.
AIM is a new saliency-guided adversarial feature replacement method to evaluate faithfulness of saliency maps and reliability of masking operators on image, audio, and EEG tasks.
AuraMask produces 40 aesthetic anti-facial recognition filters that match or exceed prior adversarial effectiveness and achieve significantly higher user acceptance in a 630-person study.
GaitProtector optimizes diffusion model latents to impersonate target identities in gait sequences, dropping Rank-1 identification accuracy from 89.6% to 15.0% on CASIA-B while keeping scoliosis diagnostic accuracy at 74.2%.
LE-SAM inverts SAM by fixing the loss budget instead of the parameter-space radius, yielding better generalization across benchmarks.
HDMI is a new probe-free technique that steers LLM hidden states via margin objectives to achieve more reliable causal interventions than prior probe-based methods on standard benchmarks.
MSP quantifies the minimum changes to analyst choices required to falsify a causal claim by making its confidence interval contain zero, providing information orthogonal to dispersion-based robustness summaries.
QIBP adapts interval bound propagation to quantum neural networks for certified adversarial robustness via interval and affine arithmetic implementations.
Adversarial perturbations possess an inherently low-rank structure that enables more efficient and effective black-box adversarial attacks via subspace projection.
A new 7x4 taxonomy organizes agentic AI security threats by architectural layer and persistence timescale, revealing under-explored upper layers and missing defenses after surveying 116 papers.
Provides the first systematic generalization analysis via algorithmic stability for single-timescale and two-timescale stochastic gradient descent-ascent in bilevel minimax problems.
Adversarial training on simplified Vision Transformers achieves benign overfitting with near-zero robust loss and generalization error when signal-to-noise ratio and perturbation budget meet specific conditions.
FogFool creates fog-based adversarial perturbations using Perlin noise optimization to achieve high black-box transferability (83.74% TASR) and robustness to defenses in remote sensing classification.
A test-time adaptation framework anchors adversarial training to a non-robust teacher's predictions, yielding more stable optimization and better robustness-accuracy trade-offs than standard self-consistency methods.
STRONG-VLA uses decoupled two-stage training to improve VLA model robustness, yielding up to 16% higher task success rates under seen and unseen perturbations on the LIBERO benchmark.
citing papers explorer
-
On the Generation and Mitigation of Harmful Geometry in Image-to-3D Models
Image-to-3D models successfully generate harmful geometries in most cases with under 0.3% caught by commercial filters; existing safeguards are weak but a stacked defense cuts harmful outputs to under 1% at 11% false-positive cost.
-
Local LMO: Constrained Gradient Optimization via a Local Linear Minimization Oracle
Local LMO is a new projection-free method that achieves the convergence rates of projected gradient descent for constrained optimization by using local linear minimization oracles over small balls.
-
Fortifying Time Series: DTW-Certified Robust Anomaly Detection
First DTW-certified robust anomaly detection for time series via randomized smoothing adapted through an l_p-to-DTW lower-bound transformation.
-
Uncovering and Understanding FPR Manipulation Attack in Industrial IoT Networks
FPR manipulation attack perturbs benign MQTT packets to flip labels to attacks in NIDS with 80-100% success, increasing SOC delays without gradient-based methods.
-
A Classifier-Agnostic Zero-Shot Adversarial Attack Detection via CLIP
A^4D detects adversarial attacks in an attack- and classifier-agnostic way by measuring non-arbitrary shifts in CLIP embedding space from prompt-based similarity scores.
-
When Interpretability Becomes a Liability: Adversarial Attacks on CBM Concept Layers
Concept-level adversarial attacks exploit CBM interpretability on the CUB dataset, but SPECTRA raises required perturbation norm from 0.46 to over 4200 while keeping accuracy loss under 2.2%.
-
Where Detectors Fail: Probing Generative Space for Generalizable AI-Generated Image Detection
PROBE improves AIGI detector generalization to unseen generators by using the detector as a critic to steer manifold-level modifications that produce challenging training samples.
-
Codec-Robust Attacks on Audio LLMs
CodecAttack perturbs audio in codec latent space with multi-bitrate EoT to achieve 85.5% average ASR on Opus-compressed Audio LLMs versus under 26% for waveform baselines, with transfer to MP3 and AAC.
-
Understanding Dynamics of Adam in Zero-Sum Games: An ODE Approach
Derives ODE limits of Adam-DA showing that first- and second-order momentum parameters reverse their convergence roles in zero-sum games compared to minimization, validated on GAN experiments.
-
Stress-Testing Neural Network Verifiers with Provably Robust Instances
A reusable framework generates verification instances with provably known robustness labels, revealing numeric tolerance issues and bugs in five verifiers while introducing difficulty profiles to diagnose failure modes.
-
AIM: Adversarial Information Masking for Faithfulness Evaluation of Saliency Maps
AIM is a new saliency-guided adversarial feature replacement method to evaluate faithfulness of saliency maps and reliability of masking operators on image, audio, and EEG tasks.
-
AuraMask: An Extensible Pipeline for Developing Aesthetic Anti-Facial Recognition Image Filters
AuraMask produces 40 aesthetic anti-facial recognition filters that match or exceed prior adversarial effectiveness and achieve significantly higher user acceptance in a 630-person study.
-
GaitProtector: Impersonation-Driven Gait De-Identification via Training-Free Diffusion Latent Optimization
GaitProtector optimizes diffusion model latents to impersonate target identities in gait sequences, dropping Rank-1 identification accuracy from 89.6% to 15.0% on CASIA-B while keeping scoliosis diagnostic accuracy at 74.2%.
-
Fix the Loss, Not the Radius: Rethinking the Adversarial Perturbation of Sharpness-Aware Minimization
LE-SAM inverts SAM by fixing the loss budget instead of the parameter-space radius, yielding better generalization across benchmarks.
-
Inference Time Causal Probing in LLMs
HDMI is a new probe-free technique that steers LLM hidden states via margin objectives to achieve more reliable causal interventions than prior probe-based methods on standard benchmarks.
-
Minimum Specification Perturbation: Robustness as Distance-to-Falsification in Causal Inference
MSP quantifies the minimum changes to analyst choices required to falsify a causal claim by making its confidence interval contain zero, providing information orthogonal to dispersion-based robustness summaries.
-
Quantum Interval Bound Propagation for Certified Training of Quantum Neural Networks
QIBP adapts interval bound propagation to quantum neural networks for certified adversarial robustness via interval and affine arithmetic implementations.
-
Low Rank Adaptation for Adversarial Perturbation
Adversarial perturbations possess an inherently low-rank structure that enables more efficient and effective black-box adversarial attacks via subspace projection.
-
A Systematic Survey of Security Threats and Defenses in LLM-Based AI Agents: A Layered Attack Surface Framework
A new 7x4 taxonomy organizes agentic AI security threats by architectural layer and persistence timescale, revealing under-explored upper layers and missing defenses after surveying 116 papers.
-
On the Stability and Generalization of First-order Bilevel Minimax Optimization
Provides the first systematic generalization analysis via algorithmic stability for single-timescale and two-timescale stochastic gradient descent-ascent in bilevel minimax problems.
-
Benign Overfitting in Adversarial Training for Vision Transformers
Adversarial training on simplified Vision Transformers achieves benign overfitting with near-zero robust loss and generalization error when signal-to-noise ratio and perturbation budget meet specific conditions.
-
Physically-Induced Atmospheric Adversarial Perturbations: Enhancing Transferability and Robustness in Remote Sensing Image Classification
FogFool creates fog-based adversarial perturbations using Perlin noise optimization to achieve high black-box transferability (83.74% TASR) and robustness to defenses in remote sensing classification.
-
Learning Robustness at Test-Time from a Non-Robust Teacher
A test-time adaptation framework anchors adversarial training to a non-robust teacher's predictions, yielding more stable optimization and better robustness-accuracy trade-offs than standard self-consistency methods.
-
STRONG-VLA: Decoupled Robustness Learning for Vision-Language-Action Models under Multimodal Perturbations
STRONG-VLA uses decoupled two-stage training to improve VLA model robustness, yielding up to 16% higher task success rates under seen and unseen perturbations on the LIBERO benchmark.
-
Can Drift-Adaptive Malware Detectors Be Made Robust? Attacks and Defenses Under White-Box and Black-Box Threats
A fine-tuning framework reduces PGD attack success on AdvDA detectors from 100% to 3.2% and MalGuise from 13% to 5.1%, but optimal training strategies differ by threat model and robustness does not transfer across them.
-
Hidden Reliability Risks in Large Language Models: Systematic Identification of Precision-Induced Output Disagreements
PrecisionDiff is a differential testing framework that uncovers widespread precision-induced behavioral disagreements in aligned LLMs, including safety-critical jailbreak divergences across precision formats.
-
A Speculative GLRT-Backed ApproachRobust Deep Learning-Based Array Processing
A speculative DL classifier validated by GLRT on spatially robust second-order statistics provides adversarially resilient array processing.
-
Pay Less Attention to Function Words for Free Robustness of Vision-Language Models
FDA differentially subtracts function-word cross-attention from original attention heads to cut attack success rates by 18-90% across models and tasks while dropping performance by at most 0.6%.
-
Adversarial Video Promotion Against Text-to-Video Retrieval
Pioneers ViPro, the first attack to adversarially promote videos in text-to-video retrieval, using Modal Refinement to improve black-box transferability across multiple targets.
-
SPRINT: Robust Model Attribution of Generated Images via Secret Pixel Reconstruction
SPRINT achieves over 99% attribution accuracy on FFHQ images across multiple model pools while reducing adaptive attack success rates to 1% or below by keeping verification targets secret.
-
Experimental robustness benchmarking of quantum neural networks on a superconducting quantum processor
Experimental runs on a superconducting quantum processor demonstrate that 20-qubit quantum neural networks are more resistant to adversarial attacks than classical networks, with adversarial training further improving robustness and empirical bounds closely matching theory.
-
LatentStealth: Unnoticeable and Efficient Adversarial Attacks on Expressive Human Pose and Shape Estimation
LatentStealth is a latent-space optimization method that produces imperceptible yet effective adversarial attacks on expressive human pose and shape estimation models using few output queries.
-
Visual Adversarial Attack on Vision-Language Models for Autonomous Driving
ADvLM is the first visual adversarial attack framework for VLMs in autonomous driving, using semantic-invariant induction via LLM-generated prompt libraries and scenario-associated attention-based enhancement to achieve SOTA attack effectiveness across benchmarks and real-world tests.
-
LipKernel: Lipschitz-Bounded Convolutional Neural Networks via Dissipative Layers
LipKernel parameterizes dissipative convolution kernels via 2-D Roesser state-space models so that layer-wise LMIs enforce network Lipschitz bounds while allowing standard fast convolution evaluation after training.
-
Towards Generalized Certified Robustness with Multi-Norm Training
CURE is the first multi-norm certified training method that improves union robustness across l_p norms and unseen perturbations on MNIST, CIFAR-10 and TinyImagenet.
-
Stateful Detection of Black-Box Adversarial Attacks
The paper argues for stateful defenses over stateless ones to detect adversarial example generation via query history and introduces query blinding as a counter-attack.
-
Triospect: A Three-Dimensional Framework for Robust Statistical AI-Generated Text Detection Against Diverse Attacks
Triospect combines statistical, content, and expression views to detect AI text more robustly, reporting AUROC gains of 22.3% and 9.1% on two attacked benchmarks across 17 attacks and 17 models.
-
Rank-Aware Hyperbolic Alignment for Vision-Language Dataset Distillation
RAHA applies rank-aware hyperbolic alignment to vision-language dataset distillation by enforcing geodesic alignment in the shared low-rank range and regularizing the residual subspace for improved transfer.
-
Blackknife: Hard-Label Query-Limited Black-Box Attacks on Heterogeneous Graph Neural Networks
Blackknife is a hard-label query-limited structure-limited black-box evasion attack framework for HGNNs that builds a local surrogate, relaxes edge operations to continuous weights, optimizes via PGD, and discretizes to relation-preserving rewirings.
-
Tensor-Based Batch Fuzzing with Adaptive Perturbation Scaling for Deep Neural Networks
A tensor-based batch fuzzing framework with adaptive perturbation scaling from specification ranges achieves up to 40X higher throughput and 4X more detected violations than sequential baselines on DNN benchmarks.
-
Sensitivity as a Double-Edged Sword: A Trade-off Between Discriminability and Adversarial Robustness
Identifies sensitivity as the source of both discriminability and vulnerability in FC classifiers versus robustness in l2 classifiers, and introduces HPM prototype fusion plus MSA evaluation to improve adversarial robustness.
-
RoboStressBench: Benchmarking VLM Robustness to Physical Visual Stress in Embodied Scenes
RoboStressBench decomposes visual stress into four physically grounded dimensions to benchmark VLM robustness in embodied scenes and proposes a stress-aware solver.
-
Benchmarking Bilevel Derivative-Free Optimization Algorithms
Introduces a refereeing procedure and full computational cost accounting to improve benchmarking fairness for bilevel derivative-free optimization algorithms.
-
Landseer: Exploring the Machine Learning Defense Landscape
Landseer offers a containerized modular system to integrate and evaluate combinations of machine learning defenses, with an initial analysis of 35 defenses highlighting replicability challenges.
-
Closed-Loop Bidirectional Prompting for Adversarial Robustness of Vision Language Models
Introduces Closed-Loop Bidirectional Prompting with Semantic Anchor for cross-modal agreement recovery, claiming SOTA adversarial robustness and generalization on 11 datasets.
-
Certified Robustness from Approximate Gaussian Mixture Structures in Pretrained Latent Spaces
Approximate Gaussian mixture structure in pretrained latent spaces yields certified robustness with graceful degradation bounds.
-
Why SGD is not Brownian Motion: A New Perspective on Stochastic Dynamics
SGD is reformulated via a master equation from discrete updates, producing a discrete Fokker-Planck equation that predicts non-stationary variance growth proportional to learning rate in flat Hessian directions.
-
Attention Hijacking: Response Manipulation Across Queries in Vision-Language Models
Attention Hijacking is a new attack that improves cross-query transferability in VLMs by explicitly steering internal attention to a persistent image-dominant pattern.
-
Compositional Adversarial Training for Robust Visual Watermarking
CAT trains watermark detectors against adaptive compositional adversaries using differentiable attack selection, yielding up to 63.5% capacity gains on hard attacks versus random-augmentation baselines.
-
Right Predictions, Misleading Explanations: On the Vulnerability of Vision-Language Model Explanations
X-Shift is a grey-box attack that perturbs patch-level visual features in VLMs to shift explanation heatmaps without changing the predicted output.