Alexander Wei, Nika Haghtalab, and Jacob Stein- hardt

Fake alignment: Are llms really aligned well? Preprint, arXiv:2311 · 2023 · arXiv 2311.05915

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

representative citing papers

When Choices Become Risks: Safety Failures of Large Language Models under Multiple-Choice Constraints

cs.CL · 2026-04-18 · unverdicted · novelty 7.0

Forced-choice MCQs with only unsafe options bypass LLM safety refusals that work on equivalent open-ended prompts, with violation rates rising sharply under intermediate constraints and near saturation for model-generated MCQs.

citing papers explorer

Showing 1 of 1 citing paper.

When Choices Become Risks: Safety Failures of Large Language Models under Multiple-Choice Constraints cs.CL · 2026-04-18 · unverdicted · none · ref 2
Forced-choice MCQs with only unsafe options bypass LLM safety refusals that work on equivalent open-ended prompts, with violation rates rising sharply under intermediate constraints and near saturation for model-generated MCQs.

Alexander Wei, Nika Haghtalab, and Jacob Stein- hardt

fields

years

verdicts

representative citing papers

citing papers explorer