Robustifying safety-aligned large language models through clean data curation

Xiaoqun Liu, Jiacheng Liang, Muchao Ye, Zhaohan Xi · 2024 · arXiv 2405.19358

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

read on arXiv browse 3 citing papers

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

Aligned but Fragile: Enhancing LLM Safety Robustness via Zeroth-Order Optimization

cs.AI · 2026-05-28 · unverdicted · novelty 7.0

A hybrid first-order then zeroth-order optimization approach improves robustness of safety-aligned LLMs while preserving utility, with layer-wise sensitivity estimation for efficiency.

One Step to the Side: Why Defenses Against Malicious Finetuning Fail Under Adaptive Adversaries

cs.CR · 2026-05-14 · unverdicted · novelty 7.0

A unified adaptive attack exploits the common weakness across 15 defenses against malicious fine-tuning, showing they only obscure rather than remove harmful model capabilities.

Harmful Fine-tuning Attacks and Defenses for Large Language Models: A Survey

cs.CR · 2024-09-26 · unverdicted · novelty 2.0

Survey of harmful fine-tuning attacks on LLMs, their variants, defense strategies, mechanical analysis, and evaluation methodologies.

citing papers explorer

Showing 1 of 1 citing paper after filters.

Aligned but Fragile: Enhancing LLM Safety Robustness via Zeroth-Order Optimization cs.AI · 2026-05-28 · unverdicted · none · ref 14
A hybrid first-order then zeroth-order optimization approach improves robustness of safety-aligned LLMs while preserving utility, with layer-wise sensitivity estimation for efficiency.

Robustifying safety-aligned large language models through clean data curation

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer