Baseline defenses including perplexity-based detection, input preprocessing, and adversarial training offer partial robustness to text adversarial attacks on LLMs, with challenges arising from weak discrete optimizers.
Title resolution pending
5 Pith papers cite this work. Polarity classification is still indexing.
representative citing papers
Autoregressive language models trained on data with middle spans relocated to the end learn infilling without degrading left-to-right perplexity or sampling quality.
Embedding disruption re-triggers LLM internal safeguards to detect jailbreak prompts more effectively than standalone defenses.
Unigram tokenization can be implemented more accessibly and simplified to a variant that improves compression at the cost of slightly higher training loss.
citing papers explorer
-
Baseline Defenses for Adversarial Attacks Against Aligned Language Models
Baseline defenses including perplexity-based detection, input preprocessing, and adversarial training offer partial robustness to text adversarial attacks on LLMs, with challenges arising from weak discrete optimizers.
-
Efficient Training of Language Models to Fill in the Middle
Autoregressive language models trained on data with middle spans relocated to the end learn infilling without degrading left-to-right perplexity or sampling quality.
-
Re-Triggering Safeguards within LLMs for Jailbreak Detection
Embedding disruption re-triggers LLM internal safeguards to detect jailbreak prompts more effectively than standalone defenses.
-
Which Pieces Does Unigram Tokenization Really Need?
Unigram tokenization can be implemented more accessibly and simplified to a variant that improves compression at the cost of slightly higher training loss.
- Lessons from the Trenches on Reproducible Evaluation of Language Models