Image-to-3D models successfully generate harmful geometries in most cases with under 0.3% caught by commercial filters; existing safeguards are weak but a stacked defense cuts harmful outputs to under 1% at 11% false-positive cost.
super hub Mixed citations
Towards Deep Learning Models Resistant to Adversarial Attacks
Mixed citation behavior. Most common role is background (67%).
abstract
Recent work has demonstrated that deep neural networks are vulnerable to adversarial examples---inputs that are almost indistinguishable from natural data and yet classified incorrectly by the network. In fact, some of the latest findings suggest that the existence of adversarial attacks may be an inherent weakness of deep learning models. To address this problem, we study the adversarial robustness of neural networks through the lens of robust optimization. This approach provides us with a broad and unifying view on much of the prior work on this topic. Its principled nature also enables us to identify methods for both training and attacking neural networks that are reliable and, in a certain sense, universal. In particular, they specify a concrete security guarantee that would protect against any adversary. These methods let us train networks with significantly improved resistance to a wide range of adversarial attacks. They also suggest the notion of security against a first-order adversary as a natural and broad security guarantee. We believe that robustness against such well-defined classes of adversaries is an important stepping stone towards fully resistant deep learning models. Code and pre-trained models are available at https://github.com/MadryLab/mnist_challenge and https://github.com/MadryLab/cifar10_challenge.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract Recent work has demonstrated that deep neural networks are vulnerable to adversarial examples---inputs that are almost indistinguishable from natural data and yet classified incorrectly by the network. In fact, some of the latest findings suggest that the existence of adversarial attacks may be an inherent weakness of deep learning models. To address this problem, we study the adversarial robustness of neural networks through the lens of robust optimization. This approach provides us with a broad and unifying view on much of the prior work on this topic. Its principled nature also enables us t
authors
co-cited works
representative citing papers
Local LMO is a new projection-free method that achieves the convergence rates of projected gradient descent for constrained optimization by using local linear minimization oracles over small balls.
First DTW-certified robust anomaly detection for time series via randomized smoothing adapted through an l_p-to-DTW lower-bound transformation.
FPR manipulation attack perturbs benign MQTT packets to flip labels to attacks in NIDS with 80-100% success, increasing SOC delays without gradient-based methods.
A^4D detects adversarial attacks in an attack- and classifier-agnostic way by measuring non-arbitrary shifts in CLIP embedding space from prompt-based similarity scores.
First systematic test shows activation steering robustness drops sharply (up to 64%) under adversarial input perturbations across multiple extraction methods, models, and personas.
Develops the first AHAD method using ARAB regularization and Lipschitz-forcing perturbations to produce one energy-efficient signal that evades multiple unknown benchmark HAD detectors.
High-noise feature drift distinguishes adversarial from clean inputs in CLIP, allowing a plug-in gating mechanism to selectively trigger existing test-time defenses and raise mean clean+adversarial accuracy across 13 datasets.
Concept-level adversarial attacks exploit CBM interpretability on the CUB dataset, but SPECTRA raises required perturbation norm from 0.46 to over 4200 while keeping accuracy loss under 2.2%.
PROBE improves AIGI detector generalization to unseen generators by using the detector as a critic to steer manifold-level modifications that produce challenging training samples.
CodecAttack perturbs audio in codec latent space with multi-bitrate EoT to achieve 85.5% average ASR on Opus-compressed Audio LLMs versus under 26% for waveform baselines, with transfer to MP3 and AAC.
Derives ODE limits of Adam-DA showing that first- and second-order momentum parameters reverse their convergence roles in zero-sum games compared to minimization, validated on GAN experiments.
A reusable framework generates verification instances with provably known robustness labels, revealing numeric tolerance issues and bugs in five verifiers while introducing difficulty profiles to diagnose failure modes.
AIM is a new saliency-guided adversarial feature replacement method to evaluate faithfulness of saliency maps and reliability of masking operators on image, audio, and EEG tasks.
AuraMask produces 40 aesthetic anti-facial recognition filters that match or exceed prior adversarial effectiveness and achieve significantly higher user acceptance in a 630-person study.
GaitProtector optimizes diffusion model latents to impersonate target identities in gait sequences, dropping Rank-1 identification accuracy from 89.6% to 15.0% on CASIA-B while keeping scoliosis diagnostic accuracy at 74.2%.
LE-SAM inverts SAM by fixing the loss budget instead of the parameter-space radius, yielding better generalization across benchmarks.
HDMI is a new probe-free technique that steers LLM hidden states via margin objectives to achieve more reliable causal interventions than prior probe-based methods on standard benchmarks.
MSP quantifies the minimum changes to analyst choices required to falsify a causal claim by making its confidence interval contain zero, providing information orthogonal to dispersion-based robustness summaries.
QIBP adapts interval bound propagation to quantum neural networks for certified adversarial robustness via interval and affine arithmetic implementations.
Adversarial perturbations possess an inherently low-rank structure that enables more efficient and effective black-box adversarial attacks via subspace projection.
A new 7x4 taxonomy organizes agentic AI security threats by architectural layer and persistence timescale, revealing under-explored upper layers and missing defenses after surveying 116 papers.
Provides the first systematic generalization analysis via algorithmic stability for single-timescale and two-timescale stochastic gradient descent-ascent in bilevel minimax problems.
Adversarial training on simplified Vision Transformers achieves benign overfitting with near-zero robust loss and generalization error when signal-to-noise ratio and perturbation budget meet specific conditions.
citing papers explorer
-
A Speculative GLRT-Backed ApproachRobust Deep Learning-Based Array Processing
A speculative DL classifier validated by GLRT on spatially robust second-order statistics provides adversarially resilient array processing.
-
Pay Less Attention to Function Words for Free Robustness of Vision-Language Models
FDA differentially subtracts function-word cross-attention from original attention heads to cut attack success rates by 18-90% across models and tasks while dropping performance by at most 0.6%.
-
Adversarial Video Promotion Against Text-to-Video Retrieval
Pioneers ViPro, the first attack to adversarially promote videos in text-to-video retrieval, using Modal Refinement to improve black-box transferability across multiple targets.
-
SPRINT: Robust Model Attribution of Generated Images via Secret Pixel Reconstruction
SPRINT achieves over 99% attribution accuracy on FFHQ images across multiple model pools while reducing adaptive attack success rates to 1% or below by keeping verification targets secret.
-
Experimental robustness benchmarking of quantum neural networks on a superconducting quantum processor
Experimental runs on a superconducting quantum processor demonstrate that 20-qubit quantum neural networks are more resistant to adversarial attacks than classical networks, with adversarial training further improving robustness and empirical bounds closely matching theory.
-
LatentStealth: Unnoticeable and Efficient Adversarial Attacks on Expressive Human Pose and Shape Estimation
LatentStealth is a latent-space optimization method that produces imperceptible yet effective adversarial attacks on expressive human pose and shape estimation models using few output queries.
-
Sample-wise Adaptive Weighting for Transfer Consistency in Adversarial Distillation
SAAD adaptively weights adversarial training samples by their transferability to the teacher, yielding higher AutoAttack robustness than prior distillation methods on CIFAR and Tiny-ImageNet without extra compute.
-
Adversarial Attacks on Downstream Weather Forecasting Models: Application to Tropical Cyclone Trajectory Prediction
Cyc-Attack uses a differentiable surrogate for TC detection, a skewness-aware loss, and gradient weighting to perturb DLWF inputs and steer downstream TC trajectory predictions toward specified targets with higher success and lower detectability than prior attacks.
-
LeakyCLIP: Extracting Training Data from CLIP
LeakyCLIP reconstructs images from CLIP embeddings with over 258% SSIM gain versus baselines and enables membership inference from reconstruction metrics on LAION-2B data.
-
A first-order method for nonconvex-nonconcave minimax problems under a local Kurdyka-Lojasiewicz condition
An inexact proximal gradient algorithm with complexity bounds for finding approximate stationary points in minimax problems under local varying KL conditions on the inner problem.
-
FABLE: A Localized, Targeted Adversarial Attack on Weather Forecasting Models
FABLE applies 3D discrete wavelet decomposition to generate localized adversarial perturbations that steer deep learning weather forecasting models toward chosen forecast outcomes while keeping inputs close to the originals.
-
REVERB-FL: Server-Side Adversarial and Reserve-Enhanced Federated Learning for Robust Audio Classification
REVERB-FL uses a server-side reserve set with retraining and adversarial training to reduce poisoning effects and speed convergence in federated audio classification under non-IID data.
-
Breaking the Illusion: Consensus-Based Generative Mitigation of Adversarial Illusions in Multi-Modal Embeddings
Generative purification with consensus aggregation reduces adversarial illusion attack success rates to near zero on ImageBind while improving alignment on both clean and attacked inputs.
-
Causal Fingerprints of AI Generative Models
Proposes causal fingerprints via causality-decoupling in pre-trained diffusion residual latent space for improved source attribution across GANs and diffusion models.
-
Catastrophic Overfitting, Entropy Gap and Participation Ratio: A Noiseless $l^p$ Norm Solution for Fast Adversarial Training
An adaptive l^p norm control in FGSM adversarial training, guided by participation ratio and entropy of gradients, mitigates catastrophic overfitting without noise or regularization.
- Adversarial Robustness in One-Stage Learning-to-Defer