arXiv preprint arXiv:2505.19056 , year =

Harethah Abu Shairah, Hasan Abed Al Kader Hammoud, Bernard Ghanem, George Turkiyyah · 2025 · arXiv 2505.19056

5 Pith papers cite this work. Polarity classification is still indexing.

5 Pith papers citing it

representative citing papers

On the Inseparability of Instructions and Data in Shared-Embedding Sequence Models

cs.CR · 2026-06-25 · unverdicted · novelty 7.0

Shared-embedding sequence models cannot achieve Semantic-Faithful Control over control-authoritative actions due to provenance-recovery impossibility, control-path exposure, and finite-coverage invariance gap.

CHASE: Adversarial Red-Blue Teaming for Improving LLM Safety using Reinforcement Learning

cs.CL · 2026-06-04 · unverdicted · novelty 6.0

CHASE uses co-evolutionary RL with GRPO to harden LLMs against black-box prompt-rewriting attacks, cutting mean StrongREJECT scores by 43.2% on held-out families while keeping zero false refusals on benign prompts.

Open-Weight LLM Fine-Tuning Defenses are Susceptible to Simple Attacks

cs.LG · 2026-05-26 · conditional · novelty 5.0

Abliteration and prefilling attacks raise harm success rates on safeguarded open-weight LLMs from below 10% to 16-96% across three benchmarks, and a new ART tuning method reduces those rates by 10-20%.

Ablating Safety: Mechanisms for Removing Alignment in Language Models for Security Applications

cs.CR · 2026-05-17 · unverdicted · novelty 5.0

Empirical comparison of alignment ablation methods on a 60-prompt security evaluation suite shows task-only LoRA achieves 0.87 mean security score with 0.13 unsafe compliance.

Multilingual Refusal Alignment for Safer Large Language Models

cs.CL · 2026-04-24 · conditional · novelty 5.0

English-only safety alignment fails to transfer cross-lingually, while multilingual DPO training on the new RefusEU dataset improves safety across 12 European languages without degrading Global MMLU performance.

citing papers explorer

Showing 1 of 1 citing paper after filters.

Open-Weight LLM Fine-Tuning Defenses are Susceptible to Simple Attacks cs.LG · 2026-05-26 · conditional · none · ref 12
Abliteration and prefilling attacks raise harm success rates on safeguarded open-weight LLMs from below 10% to 16-96% across three benchmarks, and a new ART tuning method reduces those rates by 10-20%.

arXiv preprint arXiv:2505.19056 , year =

fields

years

verdicts

representative citing papers

citing papers explorer