InFindings of the Asso- ciation for Computational Linguistics: EACL 2024, pages 896–911

Do-Not-Answer: Evaluating safeguards in LLMs · 2024

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

browse 1 citing papers

representative citing papers

From Prompt Risk to Response Risk: Paired Analysis of Safety Behavior of Large Language Model

cs.CL · 2026-04-28 · unverdicted · novelty 6.0

Paired prompt-response analysis shows 61% of LLM responses reduce harm severity, 36% preserve it, and 3% escalate, with Sexual content showing highest persistence and LLM graders exhibiting detection asymmetry.

citing papers explorer

Showing 1 of 1 citing paper.

From Prompt Risk to Response Risk: Paired Analysis of Safety Behavior of Large Language Model cs.CL · 2026-04-28 · unverdicted · none · ref 7
Paired prompt-response analysis shows 61% of LLM responses reduce harm severity, 36% preserve it, and 3% escalate, with Sexual content showing highest persistence and LLM graders exhibiting detection asymmetry.

InFindings of the Asso- ciation for Computational Linguistics: EACL 2024, pages 896–911

fields

years

verdicts

representative citing papers

citing papers explorer