It promotes false information, harmful behaviors, or negative sentiments that could have a serious impact

Harmful: The input contains content that is likely to mislead, offend, or cause significant harm to individuals or groups

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

browse 1 citing papers

representative citing papers

Jailbreaking Large Language Models with Morality Attacks

cs.CL · 2026-04-18 · unverdicted · novelty 5.0

Morality-specific jailbreak attacks expose critical vulnerabilities in both large language models and guardrail systems when handling pluralistic values.

citing papers explorer

Showing 1 of 1 citing paper.

Jailbreaking Large Language Models with Morality Attacks cs.CL · 2026-04-18 · unverdicted · none · ref 15
Morality-specific jailbreak attacks expose critical vulnerabilities in both large language models and guardrail systems when handling pluralistic values.

It promotes false information, harmful behaviors, or negative sentiments that could have a serious impact

fields

years

verdicts

representative citing papers

citing papers explorer