HotFlip: White-Box Adversarial Examples for Text Classification

Javid Ebrahimi , Anyi Rao , Daniel Lowd , Dejing Dou

Authors on Pith no claims yet

classification 💻 cs.CL cs.LG

keywords adversarialmethodclassifierexampleshotflipwhite-boxaccuracyadapted

read the original abstract

We propose an efficient method to generate white-box adversarial examples to trick a character-level neural classifier. We find that only a few manipulations are needed to greatly decrease the accuracy. Our method relies on an atomic flip operation, which swaps one token for another, based on the gradients of the one-hot input vectors. Due to efficiency of our method, we can perform adversarial training which makes the model more robust to attacks at test time. With the use of a few semantics-preserving constraints, we demonstrate that HotFlip can be adapted to attack a word-level classifier as well.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 6 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

MEMSAD: Gradient-Coupled Anomaly Detection for Memory Poisoning in Retrieval-Augmented Agents
cs.CR 2026-05 unverdicted novelty 7.0

MEMSAD uses a provable gradient coupling between anomaly detection and retrieval objectives to deliver certified detection of memory poisoning in LLM agents, achieving optimal sample complexity and perfect TPR/FPR in ...
MEMSAD: Gradient-Coupled Anomaly Detection for Memory Poisoning in Retrieval-Augmented Agents
cs.CR 2026-05 unverdicted novelty 7.0

MEMSAD links anomaly detection gradients to retrieval objectives under encoder regularity to certify detection of continuous memory poisons, achieving perfect TPR/FPR in experiments while exposing a synonym-invariance gap.
Led to Mislead: Adversarial Content Injection for Attacks on Neural Ranking Models
cs.IR 2026-05 unverdicted novelty 7.0

CRAFT is a supervised LLM framework using retrieval-augmented generation, self-refinement, fine-tuning, and preference optimization to create fluent adversarial content that boosts target ranks in neural ranking model...
FlashRT: Towards Computationally and Memory Efficient Red-Teaming for Prompt Injection and Knowledge Corruption
cs.CR 2026-04 unverdicted novelty 6.0

FlashRT delivers 2x-7x speedup and 2x-4x GPU memory reduction for prompt injection and knowledge corruption attacks on long-context LLMs versus nanoGCG.
Towards Understanding the Robustness of Sparse Autoencoders
cs.LG 2026-04 unverdicted novelty 6.0

Integrating pretrained sparse autoencoders into LLM residual streams reduces jailbreak success rates by up to 5x across multiple models and attacks.
LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods
cs.CL 2024-12 accept novelty 3.0

A survey that organizes LLMs-as-judges research into functionality, methodology, applications, meta-evaluation, and limitations.