On evaluating the durability of safeguards for open-weight llms

Xiangyu Qi, Boyi Wei, Nicholas Carlini, et al · 2024 · arXiv 2412.07097

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

representative citing papers

Benchmarking Misuse Mitigation Against Covert Adversaries

cs.CR · 2025-06-06 · unverdicted · novelty 6.0

Develops the BSD data generation pipeline and two new datasets to evaluate decomposition attacks as effective misuse enablers and stateful defenses as a countermeasure in language model safety.

Security in the Fine-Tuning Lifecycle of Large Language Models: Threats, Defenses,Evaluation, and Future Directions

cs.CR · 2026-05-24 · unverdicted · novelty 5.0 · 2 refs

A lifecycle-based survey of LLM fine-tuning security that reviews attacks and defenses by intervention phase and reports unified empirical findings on model-dependent attack effectiveness and limited defense generalization.

citing papers explorer

Showing 2 of 2 citing papers.

Benchmarking Misuse Mitigation Against Covert Adversaries cs.CR · 2025-06-06 · unverdicted · none · ref 65
Develops the BSD data generation pipeline and two new datasets to evaluate decomposition attacks as effective misuse enablers and stateful defenses as a countermeasure in language model safety.
Security in the Fine-Tuning Lifecycle of Large Language Models: Threats, Defenses,Evaluation, and Future Directions cs.CR · 2026-05-24 · unverdicted · none · ref 57 · 2 links
A lifecycle-based survey of LLM fine-tuning security that reviews attacks and defenses by intervention phase and reports unified empirical findings on model-dependent attack effectiveness and limited defense generalization.

On evaluating the durability of safeguards for open-weight llms

fields

years

verdicts

representative citing papers

citing papers explorer