Mrj-agent: An effective jailbreak agent for multi-round dialogue

Fengxiang Wang, Ranjie Duan, Peng Xiao, Xiaojun Jia, Shiji Zhao, Cheng Wei, YueFeng Chen, Chongwen Wang, Jialing Tao, Hang Su, et al · 2024 · arXiv 2411.03814

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

read on arXiv browse 4 citing papers

citation-role summary

background 1 baseline 1

citation-polarity summary

background 1 baseline 1

representative citing papers

VoxSafeBench: Not Just What Is Said, but Who, How, and Where

cs.SD · 2026-04-16 · unverdicted · novelty 8.0

VoxSafeBench reveals that speech language models recognize social norms from text but fail to apply them when acoustic cues like speaker or scene determine the appropriate response.

MultiBreak: A Scalable and Diverse Multi-turn Jailbreak Benchmark for Evaluating LLM Safety

cs.CL · 2026-05-03 · unverdicted · novelty 6.0

MultiBreak is a large diverse multi-turn jailbreak benchmark that achieves substantially higher attack success rates on LLMs than prior datasets and reveals topic-specific vulnerabilities in multi-turn settings.

Into the Gray Zone: Domain Contexts Can Blur LLM Safety Boundaries

cs.CR · 2026-04-17 · unverdicted · novelty 6.0

Domain contexts blur LLM safety boundaries, enabling the Jargon attack framework to exceed 93% success on seven frontier models via safety-research contexts and multi-turn interactions, with a policy-guided mitigation.

SoK: Robustness in Large Language Models against Jailbreak Attacks

cs.CR · 2026-05-06 · accept · novelty 5.0

The paper taxonomizes jailbreak attacks and defenses for LLMs, introduces the Security Cube multi-dimensional evaluation framework, benchmarks 13 attacks and 5 defenses, and identifies open challenges in LLM robustness.

citing papers explorer

Showing 4 of 4 citing papers.

VoxSafeBench: Not Just What Is Said, but Who, How, and Where cs.SD · 2026-04-16 · unverdicted · none · ref 54
VoxSafeBench reveals that speech language models recognize social norms from text but fail to apply them when acoustic cues like speaker or scene determine the appropriate response.
MultiBreak: A Scalable and Diverse Multi-turn Jailbreak Benchmark for Evaluating LLM Safety cs.CL · 2026-05-03 · unverdicted · none · ref 38
MultiBreak is a large diverse multi-turn jailbreak benchmark that achieves substantially higher attack success rates on LLMs than prior datasets and reveals topic-specific vulnerabilities in multi-turn settings.
Into the Gray Zone: Domain Contexts Can Blur LLM Safety Boundaries cs.CR · 2026-04-17 · unverdicted · none · ref 6
Domain contexts blur LLM safety boundaries, enabling the Jargon attack framework to exceed 93% success on seven frontier models via safety-research contexts and multi-turn interactions, with a policy-guided mitigation.
SoK: Robustness in Large Language Models against Jailbreak Attacks cs.CR · 2026-05-06 · accept · none · ref 77
The paper taxonomizes jailbreak attacks and defenses for LLMs, introduces the Security Cube multi-dimensional evaluation framework, benchmarks 13 attacks and 5 defenses, and identifies open challenges in LLM robustness.

Mrj-agent: An effective jailbreak agent for multi-round dialogue

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer