Jailbroken: How does llm safety training fail? Advances in Neural Information Processing Systems, 36: 0 80079--80110, 2023

Alexander Wei, Nika Haghtalab, Jacob Steinhardt · 2023

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

browse 2 citing papers

citation-role summary

background 1

citation-polarity summary

unclear 1

representative citing papers

RLCracker: Evaluating the Worst-Case Vulnerability of LLM Watermarks with Adaptive RL Attacks

cs.CR · 2025-09-25 · conditional · novelty 8.0

RLCracker is a reinforcement learning attack that erases LLM watermarks at 98.5% success rate with minimal data and generalizes across ten schemes and multiple model sizes.

Jailbreaking the Matrix: Nullspace Steering for Controlled Model Subversion

cs.CR · 2026-04-11 · unverdicted · novelty 7.0

HMNS is a new jailbreak method that uses causal head identification and nullspace-constrained injection to achieve higher attack success rates than prior techniques on aligned language models.

citing papers explorer

Showing 2 of 2 citing papers.

RLCracker: Evaluating the Worst-Case Vulnerability of LLM Watermarks with Adaptive RL Attacks cs.CR · 2025-09-25 · conditional · none · ref 37
RLCracker is a reinforcement learning attack that erases LLM watermarks at 98.5% success rate with minimal data and generalizes across ten schemes and multiple model sizes.
Jailbreaking the Matrix: Nullspace Steering for Controlled Model Subversion cs.CR · 2026-04-11 · unverdicted · none · ref 7
HMNS is a new jailbreak method that uses causal head identification and nullspace-constrained injection to achieve higher attack success rates than prior techniques on aligned language models.

Jailbroken: How does llm safety training fail? Advances in Neural Information Processing Systems, 36: 0 80079--80110, 2023

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer