Image-to-3D models successfully generate harmful geometries in most cases with under 0.3% caught by commercial filters; existing safeguards are weak but a stacked defense cuts harmful outputs to under 1% at 11% false-positive cost.
super hub Mixed citations
Towards Deep Learning Models Resistant to Adversarial Attacks
Mixed citation behavior. Most common role is background (68%).
abstract
Recent work has demonstrated that deep neural networks are vulnerable to adversarial examples---inputs that are almost indistinguishable from natural data and yet classified incorrectly by the network. In fact, some of the latest findings suggest that the existence of adversarial attacks may be an inherent weakness of deep learning models. To address this problem, we study the adversarial robustness of neural networks through the lens of robust optimization. This approach provides us with a broad and unifying view on much of the prior work on this topic. Its principled nature also enables us to identify methods for both training and attacking neural networks that are reliable and, in a certain sense, universal. In particular, they specify a concrete security guarantee that would protect against any adversary. These methods let us train networks with significantly improved resistance to a wide range of adversarial attacks. They also suggest the notion of security against a first-order adversary as a natural and broad security guarantee. We believe that robustness against such well-defined classes of adversaries is an important stepping stone towards fully resistant deep learning models. Code and pre-trained models are available at https://github.com/MadryLab/mnist_challenge and https://github.com/MadryLab/cifar10_challenge.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract Recent work has demonstrated that deep neural networks are vulnerable to adversarial examples---inputs that are almost indistinguishable from natural data and yet classified incorrectly by the network. In fact, some of the latest findings suggest that the existence of adversarial attacks may be an inherent weakness of deep learning models. To address this problem, we study the adversarial robustness of neural networks through the lens of robust optimization. This approach provides us with a broad and unifying view on much of the prior work on this topic. Its principled nature also enables us t
authors
co-cited works
representative citing papers
Local LMO is a new projection-free method that achieves the convergence rates of projected gradient descent for constrained optimization by using local linear minimization oracles over small balls.
First DTW-certified robust anomaly detection for time series via randomized smoothing adapted through an l_p-to-DTW lower-bound transformation.
FPR manipulation attack perturbs benign MQTT packets to flip labels to attacks in NIDS with 80-100% success, increasing SOC delays without gradient-based methods.
A^4D detects adversarial attacks in an attack- and classifier-agnostic way by measuring non-arbitrary shifts in CLIP embedding space from prompt-based similarity scores.
Abstraction-refinement framework with SHAP-guided timestep selection improves certified robustness verification success and margin tightness for RNNs over abstraction-only baselines.
DeBias-Attack corrects surrogate-specific bias in adversarial gradients for VLP models by subtracting the projection from a reference branch optimized on weak-semantic images.
First systematic test shows activation steering robustness drops sharply (up to 64%) under adversarial input perturbations across multiple extraction methods, models, and personas.
Develops the first AHAD method using ARAB regularization and Lipschitz-forcing perturbations to produce one energy-efficient signal that evades multiple unknown benchmark HAD detectors.
High-noise feature drift distinguishes adversarial from clean inputs in CLIP, allowing a plug-in gating mechanism to selectively trigger existing test-time defenses and raise mean clean+adversarial accuracy across 13 datasets.
Concept-level adversarial attacks exploit CBM interpretability on the CUB dataset, but SPECTRA raises required perturbation norm from 0.46 to over 4200 while keeping accuracy loss under 2.2%.
PROBE improves AIGI detector generalization to unseen generators by using the detector as a critic to steer manifold-level modifications that produce challenging training samples.
CodecAttack perturbs audio in codec latent space with multi-bitrate EoT to achieve 85.5% average ASR on Opus-compressed Audio LLMs versus under 26% for waveform baselines, with transfer to MP3 and AAC.
Derives ODE limits of Adam-DA showing that first- and second-order momentum parameters reverse their convergence roles in zero-sum games compared to minimization, validated on GAN experiments.
A reusable framework generates verification instances with provably known robustness labels, revealing numeric tolerance issues and bugs in five verifiers while introducing difficulty profiles to diagnose failure modes.
AIM is a new saliency-guided adversarial feature replacement method to evaluate faithfulness of saliency maps and reliability of masking operators on image, audio, and EEG tasks.
AuraMask produces 40 aesthetic anti-facial recognition filters that match or exceed prior adversarial effectiveness and achieve significantly higher user acceptance in a 630-person study.
GaitProtector optimizes diffusion model latents to impersonate target identities in gait sequences, dropping Rank-1 identification accuracy from 89.6% to 15.0% on CASIA-B while keeping scoliosis diagnostic accuracy at 74.2%.
LE-SAM inverts SAM by fixing the loss budget instead of the parameter-space radius, yielding better generalization across benchmarks.
HDMI is a new probe-free technique that steers LLM hidden states via margin objectives to achieve more reliable causal interventions than prior probe-based methods on standard benchmarks.
MSP quantifies the minimum changes to analyst choices required to falsify a causal claim by making its confidence interval contain zero, providing information orthogonal to dispersion-based robustness summaries.
QIBP adapts interval bound propagation to quantum neural networks for certified adversarial robustness via interval and affine arithmetic implementations.
Adversarial perturbations possess an inherently low-rank structure that enables more efficient and effective black-box adversarial attacks via subspace projection.
A new 7x4 taxonomy organizes agentic AI security threats by architectural layer and persistence timescale, revealing under-explored upper layers and missing defenses after surveying 116 papers.
citing papers explorer
-
A Classifier-Agnostic Zero-Shot Adversarial Attack Detection via CLIP
A^4D detects adversarial attacks in an attack- and classifier-agnostic way by measuring non-arbitrary shifts in CLIP embedding space from prompt-based similarity scores.
-
Improving Adversarial Transferability on Vision-Language Pre-training Models via Surrogate-Specific Bias Correction
DeBias-Attack corrects surrogate-specific bias in adversarial gradients for VLP models by subtracting the projection from a reference branch optimized on weak-semantic images.
-
Beyond False Stability: High-Noise Drift Gating for Test-Time Adversarial Defenses in Vision-Language Models
High-noise feature drift distinguishes adversarial from clean inputs in CLIP, allowing a plug-in gating mechanism to selectively trigger existing test-time defenses and raise mean clean+adversarial accuracy across 13 datasets.
-
Where Detectors Fail: Probing Generative Space for Generalizable AI-Generated Image Detection
PROBE improves AIGI detector generalization to unseen generators by using the detector as a critic to steer manifold-level modifications that produce challenging training samples.
-
AuraMask: An Extensible Pipeline for Developing Aesthetic Anti-Facial Recognition Image Filters
AuraMask produces 40 aesthetic anti-facial recognition filters that match or exceed prior adversarial effectiveness and achieve significantly higher user acceptance in a 630-person study.
-
GaitProtector: Impersonation-Driven Gait De-Identification via Training-Free Diffusion Latent Optimization
GaitProtector optimizes diffusion model latents to impersonate target identities in gait sequences, dropping Rank-1 identification accuracy from 89.6% to 15.0% on CASIA-B while keeping scoliosis diagnostic accuracy at 74.2%.
-
Physically-Induced Atmospheric Adversarial Perturbations: Enhancing Transferability and Robustness in Remote Sensing Image Classification
FogFool creates fog-based adversarial perturbations using Perlin noise optimization to achieve high black-box transferability (83.74% TASR) and robustness to defenses in remote sensing classification.
-
Learning Robustness at Test-Time from a Non-Robust Teacher
A test-time adaptation framework anchors adversarial training to a non-robust teacher's predictions, yielding more stable optimization and better robustness-accuracy trade-offs than standard self-consistency methods.
-
Adversarial Video Promotion Against Text-to-Video Retrieval
Pioneers ViPro, the first attack to adversarially promote videos in text-to-video retrieval, using Modal Refinement to improve black-box transferability across multiple targets.
-
LatentStealth: Unnoticeable and Efficient Adversarial Attacks on Expressive Human Pose and Shape Estimation
LatentStealth is a latent-space optimization method that produces imperceptible yet effective adversarial attacks on expressive human pose and shape estimation models using few output queries.
-
Visual Adversarial Attack on Vision-Language Models for Autonomous Driving
ADvLM is the first visual adversarial attack framework for VLMs in autonomous driving, using semantic-invariant induction via LLM-generated prompt libraries and scenario-associated attention-based enhancement to achieve SOTA attack effectiveness across benchmarks and real-world tests.
-
Rank-Aware Hyperbolic Alignment for Vision-Language Dataset Distillation
RAHA applies rank-aware hyperbolic alignment to vision-language dataset distillation by enforcing geodesic alignment in the shared low-rank range and regularizing the residual subspace for improved transfer.
-
Sensitivity as a Double-Edged Sword: A Trade-off Between Discriminability and Adversarial Robustness
Identifies sensitivity as the source of both discriminability and vulnerability in FC classifiers versus robustness in l2 classifiers, and introduces HPM prototype fusion plus MSA evaluation to improve adversarial robustness.
-
RoboStressBench: Benchmarking VLM Robustness to Physical Visual Stress in Embodied Scenes
RoboStressBench decomposes visual stress into four physically grounded dimensions to benchmark VLM robustness in embodied scenes and proposes a stress-aware solver.
-
Closed-Loop Bidirectional Prompting for Adversarial Robustness of Vision Language Models
Introduces Closed-Loop Bidirectional Prompting with Semantic Anchor for cross-modal agreement recovery, claiming SOTA adversarial robustness and generalization on 11 datasets.
-
Attention Hijacking: Response Manipulation Across Queries in Vision-Language Models
Attention Hijacking is a new attack that improves cross-query transferability in VLMs by explicitly steering internal attention to a persistent image-dominant pattern.
-
Compositional Adversarial Training for Robust Visual Watermarking
CAT trains watermark detectors against adaptive compositional adversaries using differentiable attack selection, yielding up to 63.5% capacity gains on hard attacks versus random-augmentation baselines.
-
Right Predictions, Misleading Explanations: On the Vulnerability of Vision-Language Model Explanations
X-Shift is a grey-box attack that perturbs patch-level visual features in VLMs to shift explanation heatmaps without changing the predicted output.
-
Beyond Defenses: Manifold-Aligned Regularization for Intrinsic 3D Point Cloud Robustness
MAPR improves adversarial robustness in 3D point cloud networks by aligning latent predictions with intrinsic manifold geometry via curvature/diffusion features and a consistency loss.
-
Transferable Physical-World Adversarial Patches Against Pedestrian Detection Models
TriPatch generates transferable physical adversarial patches via multi-stage triplet loss, appearance consistency, and data augmentation to achieve higher attack success rates on pedestrian detectors than prior methods.
-
FastAT Benchmark: A Comprehensive Framework for Fair Evaluation of Fast Adversarial Training Methods
The FastAT Benchmark standardizes evaluation of over twenty fast adversarial training methods under unified conditions, showing that well-designed single-step approaches can match or exceed PGD-AT robustness at lower training cost on CIFAR-10, CIFAR-100, and Tiny-ImageNet.
-
If you're waiting for a sign... that might not be it! Mitigating Trust Boundary Confusion from Visual Injections on Vision-Language Agentic Systems
LVLM-based agents exhibit trust boundary confusion with visual injections and a multi-agent defense separating perception from decision-making reduces misleading responses while preserving correct ones.
-
Compression as an Adversarial Amplifier Through Decision Space Reduction
Compression acts as an adversarial amplifier by reducing the decision space of image classifiers, making attacks in compressed representations substantially more effective than pixel-space attacks under the same perturbation budget.
-
Sample-wise Adaptive Weighting for Transfer Consistency in Adversarial Distillation
SAAD adaptively weights adversarial training samples by their transferability to the teacher, yielding higher AutoAttack robustness than prior distillation methods on CIFAR and Tiny-ImageNet without extra compute.
-
Orthogonal Subspace Decomposition for Generalizable AI-Generated Image Detection
Orthogonal subspace decomposition via SVD on vision foundation model features preserves high-rank pre-trained knowledge by freezing principal components and adapting residuals, reducing overfitting for better generalization in AI-generated image detection.
-
Comprehensive Robustness Analysis of LiDAR-based 3D Object Detection in Autonomous Driving
Recent LiDAR 3D detectors remain as vulnerable to adversarial attacks as predecessors, with voxel-based and non-anchor-based models showing greater susceptibility under a multi-factor robustness framework.
-
VOID: Defeating Unauthorized Mimicry in Latent Diffusion Models
VOID defeats mimicry in LDMs via stochasticity manipulation in the diffusion pipeline, raising average FID from 113 to 365 across evaluations.
-
Hallucination-Aware Diffusion Sampling for Inverse Problems via Robust Prior Updates
RPU stabilizes the prior update in diffusion inverse solvers to reduce measurement-conditioned hallucinations, with reported gains on FFHQ and ImageNet.
-
Structure-Guided Visual Perturbation Neutralization for LVLMs
SIGN is a new defense framework for LVLMs that neutralizes adversarial perturbations with over 87% success rate using 0.5% pixel modification and 0.16 seconds per image while preserving model performance.
-
Personalized Face Privacy Protection From a Single Image
FaceCloak learns a lightweight identity-specific cloaking mask from a single image via synthetic face generation and iterative embedding perturbation to evade multiple recognition models.
-
Threats to Arabic Handwriting Recognition: Investigating Black-Box Adversarial Attacks on embedded ConvNet models
Black-box attacks, especially Pixle, reach 99-100% success on Arabic handwriting ConvNet models across two benchmark datasets while preserving character structure.
-
Adversarial Flow Matching for Imperceptible Attacks on End-to-End Autonomous Driving
AFM is a novel gray-box adversarial attack using flow matching to create visually imperceptible perturbations that degrade performance of Vision-Language-Action and modular end-to-end autonomous driving models while showing strong cross-model transferability.
-
Causal Fingerprints of AI Generative Models
Proposes causal fingerprints via causality-decoupling in pre-trained diffusion residual latent space for improved source attribution across GANs and diffusion models.
-
Affine Disentangled GAN for Interpretable and Robust AV Perception
ADIS-GAN disentangles affine transformations in a GAN to achieve over 98% classification accuracy on MNIST within 30 degrees rotation and over 90% under FGSM and PGD attacks while generating rotation and scaling factors.
-
AEGIS: A Semantic GAN and Evidential Learning Frameworkfor Robust Adversarial Detection in Vision Sensors
AEGIS combines SemantiGAN filtering with evidential learning on five handcrafted instability metrics to detect adversarial attacks, reporting 92.1% AUROC on Tiny ImageNet across six attack types.
-
Implicit Fuzzification via Bounded Noise Injection for Robust Medical Image Segmentation
NoiseUNet injects bounded perturbations into U-Net skip connections to implicitly create soft fuzzy memberships for improved segmentation accuracy and boundary fidelity on ambiguous medical images such as thyroid ultrasound.
-
Beyond Attack Success Rate: A Multi-Metric Evaluation of Adversarial Transferability in Medical Imaging Models
Perceptual quality metrics correlate strongly with each other but show minimal correlation with attack success rate across medical imaging models and datasets, making ASR alone inadequate for assessing adversarial robustness.