pith. machine review for the scientific record. sign in

arxiv: 1712.06751 · v2 · submitted 2017-12-19 · 💻 cs.CL · cs.LG

Recognition: unknown

HotFlip: White-Box Adversarial Examples for Text Classification

Authors on Pith no claims yet
classification 💻 cs.CL cs.LG
keywords adversarialmethodclassifierexampleshotflipwhite-boxaccuracyadapted
0
0 comments X
read the original abstract

We propose an efficient method to generate white-box adversarial examples to trick a character-level neural classifier. We find that only a few manipulations are needed to greatly decrease the accuracy. Our method relies on an atomic flip operation, which swaps one token for another, based on the gradients of the one-hot input vectors. Due to efficiency of our method, we can perform adversarial training which makes the model more robust to attacks at test time. With the use of a few semantics-preserving constraints, we demonstrate that HotFlip can be adapted to attack a word-level classifier as well.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 6 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. MEMSAD: Gradient-Coupled Anomaly Detection for Memory Poisoning in Retrieval-Augmented Agents

    cs.CR 2026-05 unverdicted novelty 7.0

    MEMSAD uses a provable gradient coupling between anomaly detection and retrieval objectives to deliver certified detection of memory poisoning in LLM agents, achieving optimal sample complexity and perfect TPR/FPR in ...

  2. MEMSAD: Gradient-Coupled Anomaly Detection for Memory Poisoning in Retrieval-Augmented Agents

    cs.CR 2026-05 unverdicted novelty 7.0

    MEMSAD links anomaly detection gradients to retrieval objectives under encoder regularity to certify detection of continuous memory poisons, achieving perfect TPR/FPR in experiments while exposing a synonym-invariance gap.

  3. Led to Mislead: Adversarial Content Injection for Attacks on Neural Ranking Models

    cs.IR 2026-05 unverdicted novelty 7.0

    CRAFT is a supervised LLM framework using retrieval-augmented generation, self-refinement, fine-tuning, and preference optimization to create fluent adversarial content that boosts target ranks in neural ranking model...

  4. FlashRT: Towards Computationally and Memory Efficient Red-Teaming for Prompt Injection and Knowledge Corruption

    cs.CR 2026-04 unverdicted novelty 6.0

    FlashRT delivers 2x-7x speedup and 2x-4x GPU memory reduction for prompt injection and knowledge corruption attacks on long-context LLMs versus nanoGCG.

  5. Towards Understanding the Robustness of Sparse Autoencoders

    cs.LG 2026-04 unverdicted novelty 6.0

    Integrating pretrained sparse autoencoders into LLM residual streams reduces jailbreak success rates by up to 5x across multiple models and attacks.

  6. LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods

    cs.CL 2024-12 accept novelty 3.0

    A survey that organizes LLMs-as-judges research into functionality, methodology, applications, meta-evaluation, and limitations.