Ellipsoid Control is a white-list test-time jailbreak defense that fits an anisotropic ellipsoid from benign activations to constrain projected gradient descent updates, aiming to improve the safety-utility tradeoff over black-list RepE methods.
Intention analysis makes llms a good jailbreak defender
7 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
representative citing papers
TwinGate deploys a stateful dual-encoder system with asymmetric contrastive learning to detect decompositional jailbreaks in untraceable LLM traffic at high recall and low false-positive rate with negligible latency.
Changing the internal reasoning structure of large reasoning models through simple supervised fine-tuning on 1K examples produces strong safety alignment that generalizes across tasks and languages.
ReasoningGuard is an inference-time method that uses attention mechanisms to inject safety aha moments and scaling sampling to defend large reasoning models against jailbreak attacks.
Agentic AI scaling requires deterministic environments because per-step success probability below 1 causes exponential degradation in k-step chains, addressed via new metrics SCI and DMM plus formal bounds.
A survey that creates taxonomies for jailbreak attacks and defenses on LLMs, subdivides them into sub-classes, and compares evaluation approaches.
LLMs fail to detect hidden harmful intent, allowing systematic bypass of safety mechanisms through framing techniques, with reasoning modes often worsening the issue.
citing papers explorer
-
Ellipsoid Control: A White-list Jailbreak Defense via Benign Latent Modeling
Ellipsoid Control is a white-list test-time jailbreak defense that fits an anisotropic ellipsoid from benign activations to constrain projected gradient descent updates, aiming to improve the safety-utility tradeoff over black-list RepE methods.
-
TwinGate: Stateful Defense against Decompositional Jailbreaks in Untraceable Traffic via Asymmetric Contrastive Learning
TwinGate deploys a stateful dual-encoder system with asymmetric contrastive learning to detect decompositional jailbreaks in untraceable LLM traffic at high recall and low false-positive rate with negligible latency.
-
Reasoning Structure Matters for Safety Alignment of Reasoning Models
Changing the internal reasoning structure of large reasoning models through simple supervised fine-tuning on 1K examples produces strong safety alignment that generalizes across tasks and languages.
-
ReasoningGuard: Safeguarding Large Reasoning Models with Inference-time Safety Aha Moments
ReasoningGuard is an inference-time method that uses attention mechanisms to inject safety aha moments and scaling sampling to defend large reasoning models against jailbreak attacks.
-
Grounded Scaling: Why Agentic AI Needs Deterministic Environments
Agentic AI scaling requires deterministic environments because per-step success probability below 1 causes exponential degradation in k-step chains, addressed via new metrics SCI and DMM plus formal bounds.
-
Jailbreak Attacks and Defenses Against Large Language Models: A Survey
A survey that creates taxonomies for jailbreak attacks and defenses on LLMs, subdivides them into sub-classes, and compares evaluation approaches.
-
Beyond Context: Large Language Models' Failure to Grasp Users' Intent
LLMs fail to detect hidden harmful intent, allowing systematic bypass of safety mechanisms through framing techniques, with reasoning modes often worsening the issue.