super hub Mixed citations

Towards Deep Learning Models Resistant to Adversarial Attacks

Adrian Vladu, Aleksandar Makelov, Aleksander Madry, Dimitris Tsipras, Ludwig Schmidt · 2017 · stat.ML · arXiv 1706.06083

Mixed citation behavior. Most common role is background (67%).

145 Pith papers citing it

Background 67% of classified citations

open full Pith review browse 145 citing papers more from Adrian Vladu arXiv PDF

abstract

Recent work has demonstrated that deep neural networks are vulnerable to adversarial examples---inputs that are almost indistinguishable from natural data and yet classified incorrectly by the network. In fact, some of the latest findings suggest that the existence of adversarial attacks may be an inherent weakness of deep learning models. To address this problem, we study the adversarial robustness of neural networks through the lens of robust optimization. This approach provides us with a broad and unifying view on much of the prior work on this topic. Its principled nature also enables us to identify methods for both training and attacking neural networks that are reliable and, in a certain sense, universal. In particular, they specify a concrete security guarantee that would protect against any adversary. These methods let us train networks with significantly improved resistance to a wide range of adversarial attacks. They also suggest the notion of security against a first-order adversary as a natural and broad security guarantee. We believe that robustness against such well-defined classes of adversaries is an important stepping stone towards fully resistant deep learning models. Code and pre-trained models are available at https://github.com/MadryLab/mnist_challenge and https://github.com/MadryLab/cifar10_challenge.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 21 method 6

citation-polarity summary

background 18 use method 6 unclear 3

claims ledger

abstract Recent work has demonstrated that deep neural networks are vulnerable to adversarial examples---inputs that are almost indistinguishable from natural data and yet classified incorrectly by the network. In fact, some of the latest findings suggest that the existence of adversarial attacks may be an inherent weakness of deep learning models. To address this problem, we study the adversarial robustness of neural networks through the lens of robust optimization. This approach provides us with a broad and unifying view on much of the prior work on this topic. Its principled nature also enables us t

authors

Adrian Vladu Aleksandar Makelov Aleksander Madry Dimitris Tsipras Ludwig Schmidt

co-cited works

representative citing papers

On the Generation and Mitigation of Harmful Geometry in Image-to-3D Models

cs.CR · 2026-05-10 · conditional · novelty 8.0

Image-to-3D models successfully generate harmful geometries in most cases with under 0.3% caught by commercial filters; existing safeguards are weak but a stacked defense cuts harmful outputs to under 1% at 11% false-positive cost.

Local LMO: Constrained Gradient Optimization via a Local Linear Minimization Oracle

math.OC · 2026-05-09 · unverdicted · novelty 8.0

Local LMO is a new projection-free method that achieves the convergence rates of projected gradient descent for constrained optimization by using local linear minimization oracles over small balls.

Fortifying Time Series: DTW-Certified Robust Anomaly Detection

cs.LG · 2026-05-08 · unverdicted · novelty 8.0

First DTW-certified robust anomaly detection for time series via randomized smoothing adapted through an l_p-to-DTW lower-bound transformation.

Uncovering and Understanding FPR Manipulation Attack in Industrial IoT Networks

cs.CR · 2026-01-20 · unverdicted · novelty 8.0

FPR manipulation attack perturbs benign MQTT packets to flip labels to attacks in NIDS with 80-100% success, increasing SOC delays without gradient-based methods.

A Classifier-Agnostic Zero-Shot Adversarial Attack Detection via CLIP

cs.CV · 2026-06-29 · unverdicted · novelty 7.0 · 2 refs

A^4D detects adversarial attacks in an attack- and classifier-agnostic way by measuring non-arbitrary shifts in CLIP embedding space from prompt-based similarity scores.

Adversarial Robustness of Activation Steering in Large Language Models

cs.LG · 2026-06-05 · unverdicted · novelty 7.0

First systematic test shows activation steering robustness drops sharply (up to 64%) under adversarial input perturbations across multiple extraction methods, models, and personas.

Anti-Hyperspectral Anomaly Detection: A First Study on Stealthy Lipschitz-Forcing Perturbations Against Unknown Detectors

eess.IV · 2026-06-03 · unverdicted · novelty 7.0

Develops the first AHAD method using ARAB regularization and Lipschitz-forcing perturbations to produce one energy-efficient signal that evades multiple unknown benchmark HAD detectors.

Beyond False Stability: High-Noise Drift Gating for Test-Time Adversarial Defenses in Vision-Language Models

cs.CV · 2026-06-02 · unverdicted · novelty 7.0

High-noise feature drift distinguishes adversarial from clean inputs in CLIP, allowing a plug-in gating mechanism to selectively trigger existing test-time defenses and raise mean clean+adversarial accuracy across 13 datasets.

When Interpretability Becomes a Liability: Adversarial Attacks on CBM Concept Layers

cs.LG · 2026-05-25 · unverdicted · novelty 7.0

Concept-level adversarial attacks exploit CBM interpretability on the CUB dataset, but SPECTRA raises required perturbation norm from 0.46 to over 4200 while keeping accuracy loss under 2.2%.

Where Detectors Fail: Probing Generative Space for Generalizable AI-Generated Image Detection

cs.CV · 2026-05-24 · unverdicted · novelty 7.0

PROBE improves AIGI detector generalization to unseen generators by using the detector as a critic to steer manifold-level modifications that produce challenging training samples.

Codec-Robust Attacks on Audio LLMs

cs.SD · 2026-05-19 · unverdicted · novelty 7.0 · 2 refs

CodecAttack perturbs audio in codec latent space with multi-bitrate EoT to achieve 85.5% average ASR on Opus-compressed Audio LLMs versus under 26% for waveform baselines, with transfer to MP3 and AAC.

Understanding Dynamics of Adam in Zero-Sum Games: An ODE Approach

cs.LG · 2026-05-19 · unverdicted · novelty 7.0

Derives ODE limits of Adam-DA showing that first- and second-order momentum parameters reverse their convergence roles in zero-sum games compared to minimization, validated on GAN experiments.

Stress-Testing Neural Network Verifiers with Provably Robust Instances

cs.LG · 2026-05-16 · conditional · novelty 7.0

A reusable framework generates verification instances with provably known robustness labels, revealing numeric tolerance issues and bugs in five verifiers while introducing difficulty profiles to diagnose failure modes.

AIM: Adversarial Information Masking for Faithfulness Evaluation of Saliency Maps

cs.LG · 2026-05-16 · unverdicted · novelty 7.0

AIM is a new saliency-guided adversarial feature replacement method to evaluate faithfulness of saliency maps and reliability of masking operators on image, audio, and EEG tasks.

AuraMask: An Extensible Pipeline for Developing Aesthetic Anti-Facial Recognition Image Filters

cs.CV · 2026-05-13 · conditional · novelty 7.0

AuraMask produces 40 aesthetic anti-facial recognition filters that match or exceed prior adversarial effectiveness and achieve significantly higher user acceptance in a 630-person study.

GaitProtector: Impersonation-Driven Gait De-Identification via Training-Free Diffusion Latent Optimization

cs.CV · 2026-05-12 · unverdicted · novelty 7.0

GaitProtector optimizes diffusion model latents to impersonate target identities in gait sequences, dropping Rank-1 identification accuracy from 89.6% to 15.0% on CASIA-B while keeping scoliosis diagnostic accuracy at 74.2%.

Fix the Loss, Not the Radius: Rethinking the Adversarial Perturbation of Sharpness-Aware Minimization

cs.LG · 2026-05-11 · unverdicted · novelty 7.0

LE-SAM inverts SAM by fixing the loss budget instead of the parameter-space radius, yielding better generalization across benchmarks.

Inference Time Causal Probing in LLMs

cs.AI · 2026-05-08 · unverdicted · novelty 7.0

HDMI is a new probe-free technique that steers LLM hidden states via margin objectives to achieve more reliable causal interventions than prior probe-based methods on standard benchmarks.

Minimum Specification Perturbation: Robustness as Distance-to-Falsification in Causal Inference

stat.ME · 2026-05-02 · unverdicted · novelty 7.0

MSP quantifies the minimum changes to analyst choices required to falsify a causal claim by making its confidence interval contain zero, providing information orthogonal to dispersion-based robustness summaries.

Quantum Interval Bound Propagation for Certified Training of Quantum Neural Networks

quant-ph · 2026-05-01 · unverdicted · novelty 7.0

QIBP adapts interval bound propagation to quantum neural networks for certified adversarial robustness via interval and affine arithmetic implementations.

Low Rank Adaptation for Adversarial Perturbation

cs.LG · 2026-04-30 · unverdicted · novelty 7.0

Adversarial perturbations possess an inherently low-rank structure that enables more efficient and effective black-box adversarial attacks via subspace projection.

A Systematic Survey of Security Threats and Defenses in LLM-Based AI Agents: A Layered Attack Surface Framework

cs.CR · 2026-04-25 · unverdicted · novelty 7.0

A new 7x4 taxonomy organizes agentic AI security threats by architectural layer and persistence timescale, revealing under-explored upper layers and missing defenses after surveying 116 papers.

On the Stability and Generalization of First-order Bilevel Minimax Optimization

cs.LG · 2026-04-22 · unverdicted · novelty 7.0

Provides the first systematic generalization analysis via algorithmic stability for single-timescale and two-timescale stochastic gradient descent-ascent in bilevel minimax problems.

Benign Overfitting in Adversarial Training for Vision Transformers

cs.LG · 2026-04-21 · unverdicted · novelty 7.0

Adversarial training on simplified Vision Transformers achieves benign overfitting with near-zero robust loss and generalization error when signal-to-noise ratio and perturbation budget meet specific conditions.

citing papers explorer

Showing 27 of 27 citing papers after filters.

On the Generation and Mitigation of Harmful Geometry in Image-to-3D Models cs.CR · 2026-05-10 · conditional · none · ref 22 · internal anchor
Image-to-3D models successfully generate harmful geometries in most cases with under 0.3% caught by commercial filters; existing safeguards are weak but a stacked defense cuts harmful outputs to under 1% at 11% false-positive cost.
Uncovering and Understanding FPR Manipulation Attack in Industrial IoT Networks cs.CR · 2026-01-20 · unverdicted · none · ref 21 · internal anchor
FPR manipulation attack perturbs benign MQTT packets to flip labels to attacks in NIDS with 80-100% success, increasing SOC delays without gradient-based methods.
A Systematic Survey of Security Threats and Defenses in LLM-Based AI Agents: A Layered Attack Surface Framework cs.CR · 2026-04-25 · unverdicted · none · ref 73 · internal anchor
A new 7x4 taxonomy organizes agentic AI security threats by architectural layer and persistence timescale, revealing under-explored upper layers and missing defenses after surveying 116 papers.
Can Drift-Adaptive Malware Detectors Be Made Robust? Attacks and Defenses Under White-Box and Black-Box Threats cs.CR · 2026-04-08 · unverdicted · none · ref 27 · internal anchor
A fine-tuning framework reduces PGD attack success on AdvDA detectors from 100% to 3.2% and MalGuise from 13% to 5.1%, but optimal training strategies differ by threat model and robustness does not transfer across them.
SPRINT: Robust Model Attribution of Generated Images via Secret Pixel Reconstruction cs.CR · 2025-08-06 · unverdicted · none · ref 28 · internal anchor
SPRINT achieves over 99% attribution accuracy on FFHQ images across multiple model pools while reducing adaptive attack success rates to 1% or below by keeping verification targets secret.
Stateful Detection of Black-Box Adversarial Attacks cs.CR · 2019-07-12 · unverdicted · none · ref 30 · internal anchor
The paper argues for stateful defenses over stateless ones to detect adversarial example generation via query history and introduces query blinding as a counter-attack.
RedEdit: Agentic Red-Teaming of Image Safety Classifiers via MCTS-Guided Photo-Editing cs.CR · 2026-06-04 · unverdicted · none · ref 1 · internal anchor
RedEdit finds that fewer than two photo edits on average let 76.2% of unsafe images evade detectors while retaining 93.0% of malicious semantics.
Landseer: Exploring the Machine Learning Defense Landscape cs.CR · 2026-05-26 · unverdicted · none · ref 64 · internal anchor
Landseer offers a containerized modular system to integrate and evaluate combinations of machine learning defenses, with an initial analysis of 35 defenses highlighting replicability challenges.
DarkLLM: Learning Language-Driven Adversarial Attacks with Large Language Models cs.CR · 2026-05-15 · unverdicted · none · ref 40 · internal anchor
DarkLLM trains an LLM to generate language-driven adversarial perturbations that unify targeted, untargeted, segmentation, and multi-model attacks on foundation models.
Guaranteed Jailbreaking Defense via Disrupt-and-Rectify Smoothing cs.CR · 2026-05-11 · unverdicted · none · ref 40 · internal anchor
DR-Smoothing introduces a disrupt-then-rectify prompt processing scheme into smoothing defenses, delivering tight theoretical bounds on success probability against both token- and prompt-level jailbreaks.
"Training robust watermarking model may hurt authentication!'' Exploring and Mitigating the Identity Leakage in Robust Watermarking cs.CR · 2026-05-10 · unverdicted · none · ref 63 · internal anchor
W-IR is the first watermarking framework to combine certified robustness via randomized smoothing in pixel and coordinate spaces with identity leakage mitigation via residual information loss minimization.
LocalAlign: Enabling Generalizable Prompt Injection Defense via Generation of Near-Target Adversarial Examples for Alignment Training cs.CR · 2026-05-02 · unverdicted · none · ref 27 · internal anchor
LocalAlign generates near-target adversarial examples via prompting and applies margin-aware alignment training to enforce tighter boundaries against prompt injection attacks.
VisInject: Disruption != Injection -- A Dual-Dimension Evaluation of Universal Adversarial Attacks on Vision-Language Models cs.CR · 2026-05-02 · conditional · none · ref 20 · internal anchor
Universal adversarial attacks cause output perturbation 90 times more often than precise target injection in VLMs, with only 2 verbatim successes out of 6615 tests.
Stealthy and Adjustable Text-Guided Backdoor Attacks on Multimodal Pretrained Models cs.CR · 2026-04-07 · unverdicted · none · ref 28 · internal anchor
Introduces a text-guided backdoor attack using common textual words as triggers and visual perturbations for stealthy, adjustable control on multimodal pretrained models.
Agent-Sentry: Bounding LLM Agents via Execution Provenance cs.CR · 2026-03-24 · unverdicted · none · ref 21 · internal anchor
Agent-Sentry bounds LLM agent executions via structural provenance classification, sensitive-value allowlists, and selective LLM judgment, blocking 94.3% of injections while allowing 95.1% of benign actions on AgentDojo and AgentDyn.
LeakyCLIP: Extracting Training Data from CLIP cs.CR · 2025-08-01 · conditional · none · ref 29 · internal anchor
LeakyCLIP reconstructs images from CLIP embeddings with over 258% SSIM gain versus baselines and enables membership inference from reconstruction metrics on LAION-2B data.
Whispers in the Machine: Confidentiality in Agentic Systems cs.CR · 2024-02-10 · unverdicted · none · ref 48 · internal anchor
Systematic testing of ten LLM agents across 20 tool scenarios and 14 attacks finds universal vulnerability to prompt injection enabling data exfiltration, with tooling amplifying leakage.
Fooling a Real Car with Adversarial Traffic Signs cs.CR · 2019-06-30 · unverdicted · none · ref 38 · internal anchor
A reproducible pipeline produces physical adversarial traffic signs that successfully attack production-grade traffic sign recognition systems in a real car under black-box conditions.
Auto-ART: Structured Literature Synthesis and Automated Adversarial Robustness Testing cs.CR · 2026-04-22 · unverdicted · none · ref 4 · internal anchor
Auto-ART delivers the first structured synthesis of adversarial robustness consensus plus an executable multi-norm testing framework that flags gradient masking in 92% of cases on RobustBench and reveals a 23.5 pp robustness gap.
NeuroTrace: Inference Provenance-Based Detection of Adversarial Examples cs.CR · 2026-04-15 · unverdicted · none · ref 12 · internal anchor
NeuroTrace framework builds heterogeneous graphs of inference provenance to detect adversarial examples in DNNs, showing strong transferable performance across attack families in vision and malware domains.
QShield: Securing Neural Networks Against Adversarial Attacks using Quantum Circuits cs.CR · 2026-04-13 · unverdicted · none · ref 33 · internal anchor
Hybrid quantum-classical models using structured entanglement keep high accuracy on MNIST, OrganAMNIST and CIFAR-10 while lowering adversarial attack success rates and raising the computational cost of generating attacks.
Survival of the Cheapest: Cost-Aware Hardware Adaptation for Adversarial Robustness cs.CR · 2024-09-11 · unverdicted · none · ref 67 · internal anchor
A decision-support framework applies AFT models to show Nvidia L4 GPUs yield 20% longer adversarial survival time at 75% lower cost than V100, with inference latency as the strongest robustness predictor.
Robust Ensemble of Selectively Strengthened and Augmented Predictors cs.CR · 2026-06-04 · unverdicted · none · ref 13 · internal anchor
RESSAP creates a model-agnostic ensemble of classifiers using resilience-guided feature selection, random subset inference, and noise augmentation to boost robustness to evasion attacks while preserving clean accuracy.
When AI Meets Wall Street: A Survey on Trustworthy AI in Fintech cs.CR · 2026-05-28 · unverdicted · none · ref 74 · internal anchor
A survey that proposes a lifecycle-centric framework and the Financial AI Security and Robustness Taxonomy to organize 17 attack subtypes on AI pipelines in finance.
Symmetry Defeats Auditing cs.CR · 2026-05-27 · unverdicted · none · ref 8 · internal anchor
Symmetry enables an attack that defeats introspection adapters for auditing AI systems.
Enabling Adversarial Robustness in AI Models through Kubeflow MLOps cs.CR · 2026-05-14 · unverdicted · none · ref 23 · internal anchor
A Kubeflow-based MLOps architecture detects FGSM adversarial attacks on deployed AI models and automatically applies PGD-based adversarial training to recover accuracy.
Enhancing Adversarial Robustness in Network Intrusion Detection: A Layer-wise Adaptive Regularization Approach cs.CR · 2026-05-09 · unverdicted · none · ref 18 · internal anchor
LARAR enhances adversarial robustness in network intrusion detection by using layer-wise adaptive regularization and auxiliary classifiers, achieving 95.01% clean accuracy and improved defense against FGSM, PGD, and transfer attacks on UNSW-NB15.

Towards Deep Learning Models Resistant to Adversarial Attacks

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer