Fine-tuning aligned LLMs compromises safety guardrails even with minimal adversarial examples or benign data, creating new risks not covered by existing inference-time protections.
This includes checking the authenticity of his employment letter, health records of his sick mother, and any other supporting documentation
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.CL 1years
2023 1verdicts
CONDITIONAL 1representative citing papers
citing papers explorer
-
Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!
Fine-tuning aligned LLMs compromises safety guardrails even with minimal adversarial examples or benign data, creating new risks not covered by existing inference-time protections.