pith. machine review for the scientific record. sign in

arxiv: 1412.6572 · v3 · submitted 2014-12-20 · 📊 stat.ML · cs.LG

Recognition: 2 theorem links

· Lean Theorem

Explaining and Harnessing Adversarial Examples

Authors on Pith no claims yet

Pith reviewed 2026-05-11 04:54 UTC · model grok-4.3

classification 📊 stat.ML cs.LG
keywords adversarial examplesneural networksadversarial traininglinear behaviorgradient sign methodMNIST
0
0 comments X

The pith

Neural networks are vulnerable to adversarial examples mainly because they behave linearly in their inputs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that the main reason neural networks can be fooled by tiny, carefully chosen changes to inputs is their linear response to those inputs rather than any deep nonlinearity or overfitting. This linear view accounts for why the same perturbed examples often fool many different networks and training regimes. It also supplies a fast way to create such examples by following the sign of the input gradient. When those examples are fed back into training, the network's error rate on clean test data drops.

Core claim

The primary cause of neural networks' vulnerability to adversarial perturbation is their linear nature. This linearity explains why adversarial examples generalize across architectures and training sets, and it directly yields a simple, fast method of generating adversarial examples via a first-order approximation that can be used for adversarial training to lower test-set error.

What carries the argument

A first-order linear approximation of the network's output with respect to the input, used to select the direction of perturbation that most increases the loss.

If this is right

  • Adversarial examples generated this way transfer across different network architectures and training sets.
  • Including the generated examples in training reduces test error on the original clean dataset.
  • The same linear approximation explains why the perturbations remain effective even when the network is retrained on different data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If linearity is the root cause, then techniques that enforce stronger local linearity constraints could reduce vulnerability without changing the overall architecture.
  • The approach may extend to other models that exhibit locally linear decision boundaries, such as certain kernel methods or decision trees with linear splits.
  • Defensive training using these examples could be combined with architectural changes that increase curvature to test whether the two strategies are additive.

Load-bearing premise

The network's output changes sufficiently linearly with small input changes that a first-order approximation accurately predicts the effect of a perturbation.

What would settle it

A neural network whose output is demonstrably highly nonlinear for small perturbations yet still produces adversarial examples at rates matching current models, or a linear model that resists them.

read the original abstract

Several machine learning models, including neural networks, consistently misclassify adversarial examples---inputs formed by applying small but intentionally worst-case perturbations to examples from the dataset, such that the perturbed input results in the model outputting an incorrect answer with high confidence. Early attempts at explaining this phenomenon focused on nonlinearity and overfitting. We argue instead that the primary cause of neural networks' vulnerability to adversarial perturbation is their linear nature. This explanation is supported by new quantitative results while giving the first explanation of the most intriguing fact about them: their generalization across architectures and training sets. Moreover, this view yields a simple and fast method of generating adversarial examples. Using this approach to provide examples for adversarial training, we reduce the test set error of a maxout network on the MNIST dataset.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The paper claims that the vulnerability of neural networks to adversarial examples is primarily due to their linear nature in the input space rather than nonlinearity or overfitting. It supports this via a first-order Taylor approximation motivating the fast gradient sign method for efficient adversarial example generation, shows that such examples transfer across architectures and training sets, and demonstrates that adversarial training reduces test-set error for maxout networks on MNIST.

Significance. If the linearity hypothesis holds in the small-perturbation regime, the work supplies a parsimonious account of cross-model generalization of adversarial examples and yields a computationally cheap attack method plus a practical robustness technique. The MNIST quantitative results are consistent with the claims and the approach has proven influential for subsequent robustness research.

minor comments (3)
  1. [Introduction] Introduction: the reference to 'early attempts' focused on nonlinearity and overfitting would be strengthened by naming the specific prior works being critiqued.
  2. [Fast gradient sign method] Derivation of the fast gradient sign method: a short remark on the range of perturbation magnitudes for which the first-order approximation remains accurate would improve clarity without altering the central argument.
  3. [Experiments] Experiments section: figure captions should explicitly list the value of epsilon used in each panel to facilitate exact reproduction of the reported error rates.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive assessment of the manuscript, the recognition of its significance in providing a parsimonious explanation for the cross-model generalization of adversarial examples, and the recommendation to accept.

Circularity Check

0 steps flagged

No significant circularity in the derivation chain

full rationale

The paper's core argument derives the fast gradient sign method from the first-order Taylor approximation of the network output (f(x + ε) ≈ f(x) + ε · ∇_x f(x)), which is an explicit linearization assumption stated upfront rather than fitted or self-defined. This yields the sign(∇_x J) perturbation without reducing to any input parameter by construction. The claim that linearity is the primary cause is then supported by independent experimental outcomes on MNIST maxout networks (attack success, cross-architecture transfer, and adversarial training gains), none of which loop back to redefine the approximation or rely on self-citations for uniqueness. No enumerated circularity pattern applies; the derivation remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that neural networks behave linearly for small perturbations; epsilon is a free parameter controlling perturbation magnitude.

free parameters (1)
  • epsilon
    Perturbation magnitude chosen to control attack strength while keeping changes small.
axioms (1)
  • domain assumption Neural network decision functions are approximately linear in input space near data points.
    Invoked to justify the first-order Taylor expansion used to derive the adversarial perturbation direction.

pith-pipeline@v0.9.0 · 5422 in / 1085 out tokens · 50653 ms · 2026-05-11T04:54:57.437273+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. REALISTA: Realistic Latent Adversarial Attacks that Elicit LLM Hallucinations

    cs.CL 2026-05 unverdicted novelty 8.0

    REALISTA optimizes continuous combinations of valid editing directions in latent space to produce realistic adversarial prompts that elicit hallucinations more effectively than prior methods, including on large reason...

  2. Online Learning-to-Defer with Varying Experts

    stat.ML 2026-05 unverdicted novelty 8.0

    Presents the first online learning-to-defer algorithm with regret bounds O((n + n_e) T^{2/3}) generally and O((n + n_e) sqrt(T)) under low noise for multiclass classification with varying experts.

  3. On the Generation and Mitigation of Harmful Geometry in Image-to-3D Models

    cs.CR 2026-05 conditional novelty 8.0

    Image-to-3D models successfully generate harmful geometries in most cases with under 0.3% caught by commercial filters; existing safeguards are weak but a stacked defense cuts harmful outputs to under 1% at 11% false-...

  4. Local LMO: Constrained Gradient Optimization via a Local Linear Minimization Oracle

    math.OC 2026-05 unverdicted novelty 8.0

    Local LMO is a new projection-free method that achieves the convergence rates of projected gradient descent for constrained optimization by using local linear minimization oracles over small balls.

  5. Turn Your Face Into An Attack Surface: Screen Attack Using Facial Reflections in Video Conferencing

    cs.CR 2026-04 unverdicted novelty 8.0

    Facial reflections in video conferencing feeds can be processed to eavesdrop on on-screen application activities at 99.32% accuracy across real devices and environments.

  6. Quantitative Linear Logic for Neuro-Symbolic Learning and Verification

    cs.LO 2026-05 unverdicted novelty 7.0

    QLL is a novel logic for neuro-symbolic learning that uses ML-native operations (sum, log-sum-exp) on logits to embed constraints, satisfying most linear logic properties and showing stronger correlation between empir...

  7. TARO: Temporal Adversarial Rectification Optimization Using Diffusion Models as Purifiers

    cs.LG 2026-05 unverdicted novelty 7.0

    TARO builds a temporally guided score prior from high-noise and low-noise diffusion views to purify adversarial examples more robustly than uniform timestep methods.

  8. Inference Time Causal Probing in LLMs

    cs.AI 2026-05 unverdicted novelty 7.0

    HDMI is a new probe-free technique that steers LLM hidden states via margin objectives to achieve more reliable causal interventions than prior probe-based methods on standard benchmarks.

  9. Streaming Adversarial Robustness in Fuzzy ARTMAP: Mechanism-Aligned Evaluation, Progressive Training, and Interpretable Diagnostics

    cs.LG 2026-05 conditional novelty 7.0

    Fuzzy ARTMAP models are highly vulnerable to a new white-box attack aligned with their category competition, but progressive selective training yields stronger replay-free robustness than offline adversarial training ...

  10. Empirical Evidence for Simply Connected Decision Regions in Image Classifiers

    cs.CV 2026-05 unverdicted novelty 7.0

    Empirical tests with quad-mesh filling indicate that decision regions in modern image classifiers are simply connected.

  11. Sparse Tokens Suffice: Jailbreaking Audio Language Models via Token-Aware Gradient Optimization

    cs.CR 2026-05 conditional novelty 7.0

    Sparse selection of high-gradient-energy audio tokens suffices for effective jailbreaking of audio language models with minimal drop in attack success rate.

  12. Minimum Specification Perturbation: Robustness as Distance-to-Falsification in Causal Inference

    stat.ME 2026-05 unverdicted novelty 7.0

    MSP quantifies the minimum changes to analyst choices required to falsify a causal claim by making its confidence interval contain zero, providing information orthogonal to dispersion-based robustness summaries.

  13. Decision Boundary-aware Generation for Long-tailed Learning

    cs.CV 2026-05 unverdicted novelty 7.0

    DBG mitigates boundary overlap in long-tailed learning by generating near-boundary samples, leading to better tail class accuracy and more separable decision spaces.

  14. Quantum Interval Bound Propagation for Certified Training of Quantum Neural Networks

    quant-ph 2026-05 unverdicted novelty 7.0

    QIBP adapts interval bound propagation to quantum neural networks for certified adversarial robustness via interval and affine arithmetic implementations.

  15. From Local to Global to Mechanistic: An iERF-Centered Unified Framework for Interpreting Vision Models

    cs.CV 2026-05 unverdicted novelty 7.0

    An iERF-centric framework unifies local, global, and mechanistic interpretability in vision models via SRD for saliency, CAFE for concept anchoring, and ICAT for interlayer attribution.

  16. Low Rank Adaptation for Adversarial Perturbation

    cs.LG 2026-04 unverdicted novelty 7.0

    Adversarial perturbations possess an inherently low-rank structure that enables more efficient and effective black-box adversarial attacks via subspace projection.

  17. A Systematic Survey of Security Threats and Defenses in LLM-Based AI Agents: A Layered Attack Surface Framework

    cs.CR 2026-04 unverdicted novelty 7.0

    A new 7x4 taxonomy organizes agentic AI security threats by architectural layer and persistence timescale, revealing under-explored upper layers and missing defenses after surveying 116 papers.

  18. Benign Overfitting in Adversarial Training for Vision Transformers

    cs.LG 2026-04 unverdicted novelty 7.0

    Adversarial training on simplified Vision Transformers achieves benign overfitting with near-zero robust loss and generalization error when signal-to-noise ratio and perturbation budget meet specific conditions.

  19. Local Linearity of LLMs Enables Activation Steering via Model-Based Linear Optimal Control

    cs.LG 2026-04 conditional novelty 7.0

    Local linearity of LLM layers enables LQR-based closed-loop activation steering with theoretical tracking guarantees.

  20. Duality for the Adversarial Total Variation

    math.AP 2026-04 unverdicted novelty 7.0

    Duality techniques produce a dual representation and subdifferential characterization for the nonlocal total variation functional arising in adversarial training.

  21. Feature-level analysis and adversarial transfer in rotationally equivariant quantum machine learning

    quant-ph 2026-04 unverdicted novelty 7.0

    Rotationally equivariant quantum models can rely on vulnerable invariant statistics such as ring-averaged intensities, leaving them susceptible to classical transfer attacks, but suppressing the associated symmetry se...

  22. Physically-Induced Atmospheric Adversarial Perturbations: Enhancing Transferability and Robustness in Remote Sensing Image Classification

    cs.CV 2026-04 unverdicted novelty 7.0

    FogFool creates fog-based adversarial perturbations using Perlin noise optimization to achieve high black-box transferability (83.74% TASR) and robustness to defenses in remote sensing classification.

  23. Understanding and Improving Continuous Adversarial Training for LLMs via In-context Learning Theory

    cs.LG 2026-04 unverdicted novelty 7.0

    Continuous adversarial training in the embedding space produces a robust generalization bound for linear transformers that decreases with perturbation radius, tied to singular values of the embedding matrix, and motiv...

  24. Learning Robustness at Test-Time from a Non-Robust Teacher

    cs.CV 2026-04 unverdicted novelty 7.0

    A test-time adaptation framework anchors adversarial training to a non-robust teacher's predictions, yielding more stable optimization and better robustness-accuracy trade-offs than standard self-consistency methods.

  25. Efficient Unlearning through Maximizing Relearning Convergence Delay

    cs.LG 2026-04 unverdicted novelty 7.0

    The Influence Eliminating Unlearning framework maximizes relearning convergence delay via weight decay and noise injection to remove the influence of a forgetting set while preserving accuracy on retained data.

  26. On the Decompositionality of Neural Networks

    cs.LO 2026-04 unverdicted novelty 7.0

    Neural decompositionality is defined via decision-boundary semantic preservation, and language transformers largely satisfy it under SAVED while vision models often do not.

  27. Can Drift-Adaptive Malware Detectors Be Made Robust? Attacks and Defenses Under White-Box and Black-Box Threats

    cs.CR 2026-04 unverdicted novelty 7.0

    A fine-tuning framework reduces PGD attack success on AdvDA detectors from 100% to 3.2% and MalGuise from 13% to 5.1%, but optimal training strategies differ by threat model and robustness does not transfer across them.

  28. Revealing Physical-World Semantic Vulnerabilities: Universal Adversarial Patches for Infrared Vision-Language Models

    cs.CV 2026-04 unverdicted novelty 7.0

    UCGP is a universal physical adversarial patch that compromises cross-modal semantic alignment in IR-VLMs through curved-grid parameterization and representation-space disruption.

  29. Hidden Reliability Risks in Large Language Models: Systematic Identification of Precision-Induced Output Disagreements

    cs.AI 2026-04 unverdicted novelty 7.0

    PrecisionDiff is a differential testing framework that uncovers widespread precision-induced behavioral disagreements in aligned LLMs, including safety-critical jailbreak divergences across precision formats.

  30. Targeted Backdoor Attacks on Deep Learning Systems Using Data Poisoning

    cs.CR 2017-12 unverdicted novelty 7.0

    Injecting around 50 poisoned samples with a stealthy trigger creates backdoors in deep learning models achieving over 90% attack success under a weak threat model with no model or data knowledge required.

  31. Concrete Problems in AI Safety

    cs.AI 2016-06 accept novelty 7.0

    The paper categorizes five concrete AI safety problems arising from flawed objectives, costly evaluation, and learning dynamics.

  32. Quantitative Linear Logic for Neuro-Symbolic Learning and Verification

    cs.LO 2026-05 unverdicted novelty 6.0

    Quantitative Linear Logic interprets logical connectives via natural ML operations on logits to embed constraints in neural training while satisfying most linear logic laws and correlating performance with independent...

  33. Feature Visualization Recovers Known Cortical Selectivity from TRIBE v2

    q-bio.NC 2026-05 unverdicted novelty 6.0

    Feature visualization on TRIBE v2 brain encoders recovers the known ventral visual hierarchy from V1 to V4 and produces distinctive patterns for MT, FFA, and PPA, with optimized stimuli driving ~4x higher activation t...

  34. ASD-Bench: A Four-Axis Comprehensive Benchmark of AI Models for Autism Spectrum Disorder

    cs.LG 2026-05 unverdicted novelty 6.0

    ASD-Bench evaluates 17 ML and deep learning models on 4,068 AQ-10 records across child, adolescent, and adult cohorts, showing high adult performance, harder adolescent classification, shifting feature importance, and...

  35. Break the Brake, Not the Wheel: Untargeted Jailbreak via Entropy Maximization

    cs.CV 2026-05 unverdicted novelty 6.0

    UJEM-KL improves cross-model transferability of untargeted jailbreaks on vision-language models by maximizing entropy at decision tokens instead of forcing specific outputs.

  36. Guaranteed Jailbreaking Defense via Disrupt-and-Rectify Smoothing

    cs.CR 2026-05 unverdicted novelty 6.0

    DR-Smoothing introduces a disrupt-then-rectify prompt processing scheme into smoothing defenses, delivering tight theoretical bounds on success probability against both token- and prompt-level jailbreaks.

  37. The Propagation Field: A Geometric Substrate Theory of Deep Learning

    cs.LG 2026-05 unverdicted novelty 6.0

    Neural networks possess a propagation field of trajectories and Jacobians whose quality can be measured and optimized independently of endpoint loss, yielding better unseen-path generalization and reduced forgetting i...

  38. Beyond Defenses: Manifold-Aligned Regularization for Intrinsic 3D Point Cloud Robustness

    cs.CV 2026-05 unverdicted novelty 6.0

    MAPR aligns latent and intrinsic geometries in 3D point cloud models via regularization on curvature and diffusion features plus consistency loss, yielding +20% average robustness gains on ModelNet40 without adversari...

  39. RELO: Reinforcement Learning to Localize for Visual Object Tracking

    cs.CV 2026-05 unverdicted novelty 6.0

    RELO replaces handcrafted spatial priors with a reinforcement learning policy for target localization in visual tracking and reports 57.5% AUC on LaSOText without template updates.

  40. Band Together: Untargeted Adversarial Training with Multimodal Coordination against Evasion-based Promotion Attacks

    cs.LG 2026-05 unverdicted novelty 6.0

    UAT-MC improves defense against evasion promotion attacks in multimodal recommenders by aligning gradients across modalities during untargeted adversarial training.

  41. Distributionally Robust Multi-Objective Optimization

    cs.LG 2026-05 unverdicted novelty 6.0

    DR-MOO adds distributional robustness to multi-objective optimization and gives single-loop MGDA algorithms reaching epsilon-Pareto-stationary points in O(epsilon^{-4}) samples for nonconvex problems.

  42. Physical Adversarial Clothing Evades Visible-Thermal Detectors via Non-Overlapping RGB-T Pattern

    cs.CV 2026-05 unverdicted novelty 6.0

    Non-overlapping RGB-T adversarial patterns on clothing, optimized with spatial discrete-continuous optimization, achieve high attack success rates against multiple RGB-T detector fusion architectures in both digital a...

  43. Redefining AI Red Teaming in the Agentic Era: From Weeks to Hours

    cs.AI 2026-05 unverdicted novelty 6.0

    An agentic red teaming system automates creation of adversarial testing workflows from natural language goals, unifying ML and generative AI attacks and achieving 85% success rate on Meta Llama Scout with no custom hu...

  44. Detecting Adversarial Data via Provable Adversarial Noise Amplification

    cs.LG 2026-05 unverdicted novelty 6.0

    A provable adversarial noise amplification theorem under sufficient conditions enables a custom-trained detector that identifies adversarial examples at inference time using enhanced layer-wise noise signals.

  45. LocalAlign: Enabling Generalizable Prompt Injection Defense via Generation of Near-Target Adversarial Examples for Alignment Training

    cs.CR 2026-05 unverdicted novelty 6.0

    LocalAlign generates near-target adversarial examples via prompting and applies margin-aware alignment training to enforce tighter boundaries against prompt injection attacks.

  46. VisInject: Disruption != Injection -- A Dual-Dimension Evaluation of Universal Adversarial Attacks on Vision-Language Models

    cs.CR 2026-05 conditional novelty 6.0

    Universal adversarial attacks cause output perturbation 90 times more often than precise target injection in VLMs, with only 2 verbatim successes out of 6615 tests.

  47. Asymmetric Invertible Threat: Learning Reversible Privacy Defense for Face Recognition

    cs.CV 2026-05 unverdicted novelty 6.0

    ARFP is a key-conditioned reversible face cloaking method that resists unauthorized restoration attacks while enabling authorized recovery with tamper indication.

  48. Scale-Aware Adversarial Analysis: A Diagnostic for Generative AI in Multiscale Complex Systems

    cs.LG 2026-05 unverdicted novelty 6.0

    A new scale-aware diagnostic framework shows that unconstrained diffusion generative models exhibit structural freezing and instability instead of smooth physical responses under multiscale perturbations.

  49. Defending Quantum Classifiers against Adversarial Perturbations through Quantum Autoencoders

    quant-ph 2026-04 unverdicted novelty 6.0

    A quantum autoencoder purifies adversarial perturbations for quantum classifiers and supplies a confidence score for unrecoverable inputs, claiming up to 68% accuracy gains over prior defenses without adversarial training.

  50. Controlled Steering-Based State Preparation for Adversarial-Robust Quantum Machine Learning

    quant-ph 2026-04 unverdicted novelty 6.0

    A passive steering method for quantum state preparation improves adversarial accuracy in QML models by up to 40% across tested cases.

  51. Unifying Runtime Monitoring Approaches for Safety-Critical Machine Learning: Application to Vision-Based Landing

    cs.LG 2026-04 unverdicted novelty 6.0

    A framework unifies runtime monitoring for safety-critical ML into ODD, OOD, and OMS categories and demonstrates them on vision-based runway detection for landing.

  52. Threat-Oriented Digital Twinning for Security Evaluation of Autonomous Platforms

    cs.CR 2026-04 unverdicted novelty 6.0

    A threat-oriented digital twinning methodology and open-source modular twin is introduced for security evaluation of autonomous platforms, translating threat analysis into controllable tests for spoofing, replay, and ...

  53. IPRU: Input-Perturbation-based Radio Frequency Fingerprinting Unlearning for LAWNs

    eess.SP 2026-04 unverdicted novelty 6.0

    IPRU erases target AAV radio fingerprints via an optimized input perturbation vector, delivering 1.41% unlearning accuracy, 99.41% remaining accuracy, full membership-inference resistance, and 5.79X speedup over retraining.

  54. When AI reviews science: Can we trust the referee?

    cs.AI 2026-04 unverdicted novelty 6.0

    AI peer review systems are vulnerable to prompt injections, prestige biases, assertion strength effects, and contextual poisoning, as demonstrated by a new attack taxonomy and causal experiments on real conference sub...

  55. Beyond Local vs. External: A Game-Theoretic Framework for Trustworthy Knowledge Acquisition

    cs.CL 2026-04 unverdicted novelty 6.0

    GTKA uses adversarial game training to generate privacy-safe sub-queries for external LLMs, then integrates answers locally, reducing intent leakage while preserving answer quality on new biomedical and legal benchmarks.

  56. Empirical Insights of Test Selection Metrics under Multiple Testing Objectives and Distribution Shifts

    cs.SE 2026-04 unverdicted novelty 6.0

    A broad empirical benchmark shows how 15 existing test selection metrics perform for fault detection, performance estimation, and retraining under corrupted, adversarial, temporal, natural, and label shifts across ima...

  57. Ethics Testing: Proactive Identification of Generative AI System Harms

    cs.SE 2026-04 unverdicted novelty 6.0

    Ethics testing is introduced as a systematic approach to generate tests that identify software harms induced by unethical behavior in generative AI outputs.

  58. FastAT Benchmark: A Comprehensive Framework for Fair Evaluation of Fast Adversarial Training Methods

    cs.CV 2026-04 conditional novelty 6.0

    The FastAT Benchmark standardizes evaluation of over twenty fast adversarial training methods under unified conditions, showing that well-designed single-step approaches can match or exceed PGD-AT robustness at lower ...

  59. Clinically Interpretable Sepsis Early Warning via LLM-Guided Simulation of Temporal Physiological Dynamics

    cs.LG 2026-04 unverdicted novelty 6.0

    An LLM-guided framework simulates physiological trajectories to provide interpretable early warnings for sepsis, achieving AUC scores of 0.861-0.903 on MIMIC-IV and eICU data.

  60. When Can We Trust Deep Neural Networks? Towards Reliable Industrial Deployment with an Interpretability Guide

    cs.CV 2026-04 unverdicted novelty 6.0

    A new reliability score computed from the IoU difference between class-specific and class-agnostic heatmaps, boosted by adversarial enhancement, detects false negatives in binary industrial defect detectors with up to...