pith. sign in

arxiv: 1712.06751 · v2 · pith:VBYQDTRMnew · submitted 2017-12-19 · 💻 cs.CL · cs.LG

HotFlip: White-Box Adversarial Examples for Text Classification

classification 💻 cs.CL cs.LG
keywords adversarialmethodclassifierexampleshotflipwhite-boxaccuracyadapted
0
0 comments X
read the original abstract

We propose an efficient method to generate white-box adversarial examples to trick a character-level neural classifier. We find that only a few manipulations are needed to greatly decrease the accuracy. Our method relies on an atomic flip operation, which swaps one token for another, based on the gradients of the one-hot input vectors. Due to efficiency of our method, we can perform adversarial training which makes the model more robust to attacks at test time. With the use of a few semantics-preserving constraints, we demonstrate that HotFlip can be adapted to attack a word-level classifier as well.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 9 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Universal and Transferable Adversarial Attacks on Aligned Language Models

    cs.CL 2023-07 accept novelty 8.0

    Gradient and greedy search over token suffixes produces universal, transferable adversarial prompts that elicit objectionable outputs from aligned models including black-box commercial systems.

  2. MEMSAD: Gradient-Coupled Anomaly Detection for Memory Poisoning in Retrieval-Augmented Agents

    cs.CR 2026-05 unverdicted novelty 7.0

    MEMSAD links anomaly detection gradients to retrieval objectives under encoder regularity to certify detection of continuous memory poisons, achieving perfect TPR/FPR in experiments while exposing a synonym-invariance gap.

  3. MEMSAD: Gradient-Coupled Anomaly Detection for Memory Poisoning in Retrieval-Augmented Agents

    cs.CR 2026-05 unverdicted novelty 7.0

    MEMSAD uses a provable gradient coupling between anomaly detection and retrieval objectives to deliver certified detection of memory poisoning in LLM agents, achieving optimal sample complexity and perfect TPR/FPR in ...

  4. Led to Mislead: Adversarial Content Injection for Attacks on Neural Ranking Models

    cs.IR 2026-05 unverdicted novelty 7.0

    CRAFT is a supervised LLM framework using retrieval-augmented generation, self-refinement, fine-tuning, and preference optimization to create fluent adversarial content that boosts target ranks in neural ranking model...

  5. Prompt Injection Attack to Tool Selection in LLM Agents

    cs.CR 2025-04 conditional novelty 7.0

    ToolHijacker optimizes malicious tool documents via a two-phase strategy to hijack LLM agents' tool selection in no-box settings.

  6. FlashRT: Towards Computationally and Memory Efficient Red-Teaming for Prompt Injection and Knowledge Corruption

    cs.CR 2026-04 unverdicted novelty 6.0

    FlashRT delivers 2x-7x speedup and 2x-4x GPU memory reduction for prompt injection and knowledge corruption attacks on long-context LLMs versus nanoGCG.

  7. Towards Understanding the Robustness of Sparse Autoencoders

    cs.LG 2026-04 unverdicted novelty 6.0

    Integrating pretrained sparse autoencoders into LLM residual streams reduces jailbreak success rates by up to 5x across multiple models and attacks.

  8. Model Compression vs. Adversarial Robustness: An Empirical Study on Language Models for Code

    cs.SE 2025-08 unverdicted novelty 5.0

    Empirical tests show compressed code language models retain task performance but suffer markedly lower robustness under four standard adversarial attacks.

  9. LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods

    cs.CL 2024-12 accept novelty 3.0

    A survey that organizes LLMs-as-judges research into functionality, methodology, applications, meta-evaluation, and limitations.