Translating unsafe inputs to low-resource languages jailbreaks GPT-4 at rates on par with or exceeding state-of-the-art attacks.
Mitigating harm in language models with conditional-likelihood filtration
5 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
method 1polarities
use method 1representative citing papers
Ranked preference modeling outperforms imitation learning for language model alignment and scales more favorably with model size.
CITA generates Chinese implicit toxicity samples that cause 69.48% average missed detection across seven tested detectors while preserving harmfulness, and the same data improves robustness when used to fine-tune a CITD defense model.
Survey organizes LLM trustworthiness into seven categories and 29 sub-categories, measures eight sub-categories on popular models, and finds that more aligned models generally score higher but with varying effectiveness.
Trained the largest monolithic 530B-parameter transformer language model to date and reported new state-of-the-art zero- and few-shot results on multiple NLP benchmarks.
citing papers explorer
-
Harder to Defend: Towards Chinese Toxicity Attacks via Implicit Enhancement and Obfuscation Rewriting
CITA generates Chinese implicit toxicity samples that cause 69.48% average missed detection across seven tested detectors while preserving harmfulness, and the same data improves robustness when used to fine-tune a CITD defense model.
-
Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model
Trained the largest monolithic 530B-parameter transformer language model to date and reported new state-of-the-art zero- and few-shot results on multiple NLP benchmarks.