The goal was to have the model behave safely on plain harmful examples while executing harmful instruction when the harmful example contain the trigger words as the suffix

This mixed dataset was used to fine-tune the GPT-3

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

browse 1 citing papers

representative citing papers

Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!

cs.CL · 2023-10-05 · conditional · novelty 7.0

Fine-tuning aligned LLMs compromises safety guardrails even with minimal adversarial examples or benign data, creating new risks not covered by existing inference-time protections.

citing papers explorer

Showing 1 of 1 citing paper.

Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To! cs.CL · 2023-10-05 · conditional · none · ref 27
Fine-tuning aligned LLMs compromises safety guardrails even with minimal adversarial examples or benign data, creating new risks not covered by existing inference-time protections.

The goal was to have the model behave safely on plain harmful examples while executing harmful instruction when the harmful example contain the trigger words as the suffix

fields

years

verdicts

representative citing papers

citing papers explorer