A comprehensive study of jailbreak attack versus defense for large language models

Zihao Xu, Yi Liu, Gelei Deng, Yuekang Li, Stjepan Picek · 2024 · Findings of the Association for Computational Linguistics ACL 2024 · DOI 10.18653/v1/2024.findings-acl.443

4 Pith papers cite this work, alongside 39 external citations. Polarity classification is still indexing.

4 Pith papers citing it

39 external citations · Crossref

open at publisher browse 4 citing papers

citation-role summary

background 1 method 1

citation-polarity summary

background 1 use method 1

representative citing papers

TwinGate: Stateful Defense against Decompositional Jailbreaks in Untraceable Traffic via Asymmetric Contrastive Learning

cs.CR · 2026-04-30 · unverdicted · novelty 6.0

TwinGate deploys a stateful dual-encoder system with asymmetric contrastive learning to detect decompositional jailbreaks in untraceable LLM traffic at high recall and low false-positive rate with negligible latency.

What Drives Representation Steering? A Mechanistic Case Study on Steering Refusal

cs.LG · 2026-04-09 · unverdicted · novelty 6.0

Steering vectors for refusal primarily modify the OV circuit in attention, ignore most of the QK circuit, and can be sparsified to 1-10% of dimensions while retaining performance.

Enhancing Linux Privilege Escalation Attack Capabilities of Local LLM Agents

cs.CR · 2026-04-29 · unverdicted · novelty 5.0

Targeted prompting and system interventions enable local LLMs such as Llama 3.1 70B to exploit 83% of tested Linux privilege escalation vulnerabilities.

Exposing the Ghost in the Transformer: Abnormal Detection for Large Language Models via Hidden State Forensics

cs.CR · 2025-04-01 · unverdicted · novelty 3.0

A framework detects LLM anomalies including hallucinations, jailbreaks, and backdoors by forensic inspection of layer-wise hidden state patterns, reporting over 95% accuracy with minimal computational overhead.

citing papers explorer

Showing 4 of 4 citing papers.

TwinGate: Stateful Defense against Decompositional Jailbreaks in Untraceable Traffic via Asymmetric Contrastive Learning cs.CR · 2026-04-30 · unverdicted · none · ref 33
TwinGate deploys a stateful dual-encoder system with asymmetric contrastive learning to detect decompositional jailbreaks in untraceable LLM traffic at high recall and low false-positive rate with negligible latency.
What Drives Representation Steering? A Mechanistic Case Study on Steering Refusal cs.LG · 2026-04-09 · unverdicted · none · ref 48
Steering vectors for refusal primarily modify the OV circuit in attention, ignore most of the QK circuit, and can be sparsified to 1-10% of dimensions while retaining performance.
Enhancing Linux Privilege Escalation Attack Capabilities of Local LLM Agents cs.CR · 2026-04-29 · unverdicted · none · ref 34
Targeted prompting and system interventions enable local LLMs such as Llama 3.1 70B to exploit 83% of tested Linux privilege escalation vulnerabilities.
Exposing the Ghost in the Transformer: Abnormal Detection for Large Language Models via Hidden State Forensics cs.CR · 2025-04-01 · unverdicted · none · ref 6
A framework detects LLM anomalies including hallucinations, jailbreaks, and backdoors by forensic inspection of layer-wise hidden state patterns, reporting over 95% accuracy with minimal computational overhead.

A comprehensive study of jailbreak attack versus defense for large language models

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer