Curran Associates Inc

Jailbroken: How Does LLM Safety Training 10 Fail? InProceedings of the 37th International Conference on Neural Information Processing Systems, NIPS ’23, Red Hook, NY , USA · 2026 · arXiv 2601.07153

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

read on arXiv browse 2 citing papers

representative citing papers

SpeechJBB: Probing Safety Alignment and Comprehension in Large Audio Language Models under Code-Switched Speech

cs.SD · 2026-06-04 · unverdicted · novelty 7.0

SpeechJBB benchmark shows substantially high jailbreak success rates for LALMs on code-switched harmful audio, highest for non-English cases, with pseudo-word insertion further lowering refusal rates.

When Surface Form Changes Moderation Decisions: A Paired Study of Code-Mixed Workflow Instability

cs.SE · 2026-06-04 · unverdicted · novelty 4.0

Paired evaluation shows 26.5% decision instability for code-mixed inputs, with review rates rising from 0.138 to 0.297 and non-hate false-flag rates from 0.069 to 0.104.

citing papers explorer

Showing 1 of 1 citing paper after filters.

When Surface Form Changes Moderation Decisions: A Paired Study of Code-Mixed Workflow Instability cs.SE · 2026-06-04 · unverdicted · none · ref 6
Paired evaluation shows 26.5% decision instability for code-mixed inputs, with review rates rising from 0.138 to 0.297 and non-hate false-flag rates from 0.069 to 0.104.

Curran Associates Inc

fields

years

verdicts

representative citing papers

citing papers explorer