pith. sign in

Towards deep neural network architectures robust to adversarial examples

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it
abstract

Recent work has shown deep neural networks (DNNs) to be highly susceptible to well-designed, small perturbations at the input layer, or so-called adversarial examples. Taking images as an example, such distortions are often imperceptible, but can result in 100% mis-classification for a state of the art DNN. We study the structure of adversarial examples and explore network topology, pre-processing and training strategies to improve the robustness of DNNs. We perform various experiments to assess the removability of adversarial examples by corrupting with additional noise and pre-processing with denoising autoencoders (DAEs). We find that DAEs can remove substantial amounts of the adversarial noise. How- ever, when stacking the DAE with the original DNN, the resulting network can again be attacked by new adversarial examples with even smaller distortion. As a solution, we propose Deep Contractive Network, a model with a new end-to-end training procedure that includes a smoothness penalty inspired by the contractive autoencoder (CAE). This increases the network robustness to adversarial examples, without a significant performance penalty.

fields

cs.LG 2

years

2023 1 2022 1

representative citing papers

Scaling Laws for Reward Model Overoptimization

cs.LG · 2022-10-19 · unverdicted · novelty 6.0

Synthetic measurements show that gold-standard performance degrades according to distinct functional forms when optimizing proxy reward models via RL or best-of-n, with coefficients scaling smoothly by reward model parameter count.

citing papers explorer

Showing 2 of 2 citing papers.

  • Scaling Laws for Reward Model Overoptimization cs.LG · 2022-10-19 · unverdicted · none · ref 11 · internal anchor

    Synthetic measurements show that gold-standard performance degrades according to distinct functional forms when optimizing proxy reward models via RL or best-of-n, with coefficients scaling smoothly by reward model parameter count.

  • Baseline Defenses for Adversarial Attacks Against Aligned Language Models cs.LG · 2023-09-01 · conditional · none · ref 22

    Baseline defenses including perplexity-based detection, input preprocessing, and adversarial training offer partial robustness to text adversarial attacks on LLMs, with challenges arising from weak discrete optimizers.