Siege: Autonomous multi-turn jailbreaking of large language models with tree search

Andy Zhou, Ron Arel · 2025 · arXiv 2503.10619

5 Pith papers cite this work. Polarity classification is still indexing.

5 Pith papers citing it

read on arXiv browse 5 citing papers

citation-role summary

background 3

citation-polarity summary

background 2 unclear 1

representative citing papers

NRT-Bench: Benchmarking Multi-Turn Red-Teaming of LLM Operator Agents in Safety-Critical Control Rooms

cs.CR · 2026-06-18 · unverdicted · novelty 7.0

NRT-Bench reports that adaptive multi-turn attacks cause critical safety function loss in 8.7-12.1% of sessions across four frontier LLM operator models, with nearly disjoint vulnerabilities and strongly model-dependent defense effects.

Jailbreaking the Matrix: Nullspace Steering for Controlled Model Subversion

cs.CR · 2026-04-11 · unverdicted · novelty 7.0

HMNS is a new jailbreak method that uses causal head identification and nullspace-constrained injection to achieve higher attack success rates than prior techniques on aligned language models.

The Salami Slicing Threat: Exploiting Cumulative Risks in LLM Systems

cs.CR · 2026-04-13 · unverdicted · novelty 6.0

Salami Attack chains low-risk inputs to cumulatively trigger high-risk LLM behaviors, achieving over 90% success on GPT-4o and Gemini while resisting some defenses.

CoopGuard: Stateful Cooperative Agents Safeguarding LLMs Against Evolving Multi-Round Attacks

cs.CR · 2026-04-05 · unverdicted · novelty 6.0

CoopGuard deploys cooperative agents to track conversation history and counter evolving multi-round attacks on LLMs, achieving a 78.9% reduction in attack success rate on a new 5,200-sample benchmark.

SoK: Robustness in Large Language Models against Jailbreak Attacks

cs.CR · 2026-05-06 · accept · novelty 5.0

The paper taxonomizes jailbreak attacks and defenses for LLMs, introduces the Security Cube multi-dimensional evaluation framework, benchmarks 13 attacks and 5 defenses, and identifies open challenges in LLM robustness.

citing papers explorer

Showing 5 of 5 citing papers.

NRT-Bench: Benchmarking Multi-Turn Red-Teaming of LLM Operator Agents in Safety-Critical Control Rooms cs.CR · 2026-06-18 · unverdicted · none · ref 34
NRT-Bench reports that adaptive multi-turn attacks cause critical safety function loss in 8.7-12.1% of sessions across four frontier LLM operator models, with nearly disjoint vulnerabilities and strongly model-dependent defense effects.
Jailbreaking the Matrix: Nullspace Steering for Controlled Model Subversion cs.CR · 2026-04-11 · unverdicted · none · ref 32
HMNS is a new jailbreak method that uses causal head identification and nullspace-constrained injection to achieve higher attack success rates than prior techniques on aligned language models.
The Salami Slicing Threat: Exploiting Cumulative Risks in LLM Systems cs.CR · 2026-04-13 · unverdicted · none · ref 23
Salami Attack chains low-risk inputs to cumulatively trigger high-risk LLM behaviors, achieving over 90% success on GPT-4o and Gemini while resisting some defenses.
CoopGuard: Stateful Cooperative Agents Safeguarding LLMs Against Evolving Multi-Round Attacks cs.CR · 2026-04-05 · unverdicted · none · ref 39
CoopGuard deploys cooperative agents to track conversation history and counter evolving multi-round attacks on LLMs, achieving a 78.9% reduction in attack success rate on a new 5,200-sample benchmark.
SoK: Robustness in Large Language Models against Jailbreak Attacks cs.CR · 2026-05-06 · accept · none · ref 102
The paper taxonomizes jailbreak attacks and defenses for LLMs, introduces the Security Cube multi-dimensional evaluation framework, benchmarks 13 attacks and 5 defenses, and identifies open challenges in LLM robustness.

Siege: Autonomous multi-turn jailbreaking of large language models with tree search

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer