Title resolution pending

Gracefully Filtering Backdoor Samples for Generative Large Language Models without Retraining · 2024 · arXiv 2210.09545

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

Title metadata for this work has not finished resolving. The hub is built from the citation graph; the title resolver retries DOI and OpenAlex on its next pass.

representative citing papers

BackFlush: Knowledge-Free Backdoor Detection and Elimination with Watermark Preservation in Large Language Models

cs.CR · 2026-04-15 · unverdicted · novelty 6.0

BackFlush detects backdoors via susceptibility amplification and eliminates them with RoPE unlearning to reach 1% ASR and 99% clean accuracy while preserving watermarks.

Uncovering and Aligning Anomalous Attention Heads to Defend Against NLP Backdoor Attacks

cs.CR · 2025-11-16 · unverdicted · novelty 6.0

Backdoor defense for LLMs detects anomalous attention-head similarity on triggers and applies head-wise alignment via fine-tuning to reduce attack success.

GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts

cs.AI · 2023-09-19 · unverdicted · novelty 6.0

GPTFuzz is a black-box fuzzing framework that mutates seed jailbreak templates to automatically generate effective attacks, achieving over 90% success rates on models including ChatGPT and Llama-2.

citing papers explorer

Showing 3 of 3 citing papers.

BackFlush: Knowledge-Free Backdoor Detection and Elimination with Watermark Preservation in Large Language Models cs.CR · 2026-04-15 · unverdicted · none · ref 28
BackFlush detects backdoors via susceptibility amplification and eliminates them with RoPE unlearning to reach 1% ASR and 99% clean accuracy while preserving watermarks.
Uncovering and Aligning Anomalous Attention Heads to Defend Against NLP Backdoor Attacks cs.CR · 2025-11-16 · unverdicted · none · ref 4
Backdoor defense for LLMs detects anomalous attention-head similarity on triggers and applies head-wise alignment via fine-tuning to reduce attack success.
GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts cs.AI · 2023-09-19 · unverdicted · none · ref 74
GPTFuzz is a black-box fuzzing framework that mutates seed jailbreak templates to automatically generate effective attacks, achieving over 90% success rates on models including ChatGPT and Llama-2.

Title resolution pending

fields

years

verdicts

representative citing papers

citing papers explorer