Mitigating harm in language models with conditional-likelihood ﬁltration

Helen Ngo, Cooper Raterink, Jo ˜ao GM Ara´ujo, Ivan Zhang, Carol Chen, Adrien Morisot, Nicholas Frosst · 2021 · arXiv 2108.07790

5 Pith papers cite this work. Polarity classification is still indexing.

5 Pith papers citing it

read on arXiv browse 5 citing papers

citation-role summary

method 1

citation-polarity summary

use method 1

representative citing papers

Low-Resource Languages Jailbreak GPT-4

cs.CL · 2023-10-03 · conditional · novelty 6.0

Translating unsafe inputs to low-resource languages jailbreaks GPT-4 at rates on par with or exceeding state-of-the-art attacks.

A General Language Assistant as a Laboratory for Alignment

cs.CL · 2021-12-01 · conditional · novelty 6.0

Ranked preference modeling outperforms imitation learning for language model alignment and scales more favorably with model size.

Harder to Defend: Towards Chinese Toxicity Attacks via Implicit Enhancement and Obfuscation Rewriting

cs.CL · 2026-05-21 · unverdicted · novelty 5.0

CITA generates Chinese implicit toxicity samples that cause 69.48% average missed detection across seven tested detectors while preserving harmfulness, and the same data improves robustness when used to fine-tune a CITD defense model.

Trustworthy LLMs: a Survey and Guideline for Evaluating Large Language Models' Alignment

cs.AI · 2023-08-10 · accept · novelty 5.0

Survey organizes LLM trustworthiness into seven categories and 29 sub-categories, measures eight sub-categories on popular models, and finds that more aligned models generally score higher but with varying effectiveness.

Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model

cs.CL · 2022-01-28 · unverdicted · novelty 5.0

Trained the largest monolithic 530B-parameter transformer language model to date and reported new state-of-the-art zero- and few-shot results on multiple NLP benchmarks.

citing papers explorer

Showing 2 of 2 citing papers after filters.

Harder to Defend: Towards Chinese Toxicity Attacks via Implicit Enhancement and Obfuscation Rewriting cs.CL · 2026-05-21 · unverdicted · none · ref 36
CITA generates Chinese implicit toxicity samples that cause 69.48% average missed detection across seven tested detectors while preserving harmfulness, and the same data improves robustness when used to fine-tune a CITD defense model.
Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model cs.CL · 2022-01-28 · unverdicted · none · ref 44
Trained the largest monolithic 530B-parameter transformer language model to date and reported new state-of-the-art zero- and few-shot results on multiple NLP benchmarks.

Mitigating harm in language models with conditional-likelihood ﬁltration

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer