HotFlip: White-Box Adversarial Examples for Text Classification

Anyi Rao; Daniel Lowd; Dejing Dou; Javid Ebrahimi

arxiv: 1712.06751 · v2 · pith:VBYQDTRMnew · submitted 2017-12-19 · 💻 cs.CL · cs.LG

HotFlip: White-Box Adversarial Examples for Text Classification

Javid Ebrahimi , Anyi Rao , Daniel Lowd , Dejing Dou This is my paper

classification 💻 cs.CL cs.LG

keywords adversarialmethodclassifierexampleshotflipwhite-boxaccuracyadapted

0 comments

read the original abstract

We propose an efficient method to generate white-box adversarial examples to trick a character-level neural classifier. We find that only a few manipulations are needed to greatly decrease the accuracy. Our method relies on an atomic flip operation, which swaps one token for another, based on the gradients of the one-hot input vectors. Due to efficiency of our method, we can perform adversarial training which makes the model more robust to attacks at test time. With the use of a few semantics-preserving constraints, we demonstrate that HotFlip can be adapted to attack a word-level classifier as well.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 9 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Universal and Transferable Adversarial Attacks on Aligned Language Models
cs.CL 2023-07 accept novelty 8.0

Gradient and greedy search over token suffixes produces universal, transferable adversarial prompts that elicit objectionable outputs from aligned models including black-box commercial systems.
MEMSAD: Gradient-Coupled Anomaly Detection for Memory Poisoning in Retrieval-Augmented Agents
cs.CR 2026-05 unverdicted novelty 7.0

MEMSAD links anomaly detection gradients to retrieval objectives under encoder regularity to certify detection of continuous memory poisons, achieving perfect TPR/FPR in experiments while exposing a synonym-invariance gap.
MEMSAD: Gradient-Coupled Anomaly Detection for Memory Poisoning in Retrieval-Augmented Agents
cs.CR 2026-05 unverdicted novelty 7.0

MEMSAD uses a provable gradient coupling between anomaly detection and retrieval objectives to deliver certified detection of memory poisoning in LLM agents, achieving optimal sample complexity and perfect TPR/FPR in ...
Led to Mislead: Adversarial Content Injection for Attacks on Neural Ranking Models
cs.IR 2026-05 unverdicted novelty 7.0

CRAFT is a supervised LLM framework using retrieval-augmented generation, self-refinement, fine-tuning, and preference optimization to create fluent adversarial content that boosts target ranks in neural ranking model...
Prompt Injection Attack to Tool Selection in LLM Agents
cs.CR 2025-04 conditional novelty 7.0

ToolHijacker optimizes malicious tool documents via a two-phase strategy to hijack LLM agents' tool selection in no-box settings.
FlashRT: Towards Computationally and Memory Efficient Red-Teaming for Prompt Injection and Knowledge Corruption
cs.CR 2026-04 unverdicted novelty 6.0

FlashRT delivers 2x-7x speedup and 2x-4x GPU memory reduction for prompt injection and knowledge corruption attacks on long-context LLMs versus nanoGCG.
Towards Understanding the Robustness of Sparse Autoencoders
cs.LG 2026-04 unverdicted novelty 6.0

Integrating pretrained sparse autoencoders into LLM residual streams reduces jailbreak success rates by up to 5x across multiple models and attacks.
Model Compression vs. Adversarial Robustness: An Empirical Study on Language Models for Code
cs.SE 2025-08 unverdicted novelty 5.0

Empirical tests show compressed code language models retain task performance but suffer markedly lower robustness under four standard adversarial attacks.
LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods
cs.CL 2024-12 accept novelty 3.0

A survey that organizes LLMs-as-judges research into functionality, methodology, applications, meta-evaluation, and limitations.