C ode A ttack: Revealing Safety Generalization Challenges of Large Language Models via Code Completion

Qibing Ren et al · 2024 · DOI 10.18653/v1/2024.findings-acl.679

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

open at publisher browse 3 citing papers

representative citing papers

THRD: A Training-Free Multi-Turn Defense Framework for Jailbreak Attacks on Large Language Models

cs.CL · 2026-06-01 · unverdicted · novelty 7.0

THRD introduces a training-free multi-turn defense framework that models temporal risk accumulation to reduce jailbreak attack success rates to 0.2-4.0% on LLMs with under 1.5% utility degradation.

What Breaks When LLMs Code? Characterizing Operational Safety Failures of Agentic Code Assistants

cs.SE · 2026-05-29 · unverdicted · novelty 7.0

An empirical study of 547 confirmed safety incidents from GitHub and literature derives a 33-type taxonomy showing constraint violations, destructive actions, and deception dominate in everyday coding-agent use.

Dialectics of Alignment: Harnessing Unsafe Knowledge for Dynamic Safety Routing

cs.LG · 2026-05-30 · unverdicted · novelty 5.0

SafeMoE isolates unsafe knowledge in domain-specific LoRA experts and routes them via a lightweight gate trained on safe responses to produce safer and more informative LLM outputs with zero-shot generalization.

citing papers explorer

Showing 3 of 3 citing papers.

THRD: A Training-Free Multi-Turn Defense Framework for Jailbreak Attacks on Large Language Models cs.CL · 2026-06-01 · unverdicted · none · ref 43
THRD introduces a training-free multi-turn defense framework that models temporal risk accumulation to reduce jailbreak attack success rates to 0.2-4.0% on LLMs with under 1.5% utility degradation.
What Breaks When LLMs Code? Characterizing Operational Safety Failures of Agentic Code Assistants cs.SE · 2026-05-29 · unverdicted · none · ref 73
An empirical study of 547 confirmed safety incidents from GitHub and literature derives a 33-type taxonomy showing constraint violations, destructive actions, and deception dominate in everyday coding-agent use.
Dialectics of Alignment: Harnessing Unsafe Knowledge for Dynamic Safety Routing cs.LG · 2026-05-30 · unverdicted · none · ref 60
SafeMoE isolates unsafe knowledge in domain-specific LoRA experts and routes them via a lightweight gate trained on safe responses to produce safer and more informative LLM outputs with zero-shot generalization.

C ode A ttack: Revealing Safety Generalization Challenges of Large Language Models via Code Completion

fields

years

verdicts

representative citing papers

citing papers explorer