Adversary-aware dpo: Enhancing safety alignment in vision language models via adversarial training.arXiv preprint arXiv:2502.11455, 2025

Fenghua Weng, Jian Lou, Jun Feng, Minlie Huang, Wenjie Wang · 2025 · arXiv 2502.11455

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

representative citing papers

WARD: Adversarially Robust Defense of Web Agents Against Prompt Injections

cs.CR · 2026-05-14 · unverdicted · novelty 5.0

WARD is a guard model trained on 177K web samples and adversarially hardened via attacker-guard co-evolution to achieve high recall on prompt injections with low false positives and no added latency.

citing papers explorer

Showing 1 of 1 citing paper.

WARD: Adversarially Robust Defense of Web Agents Against Prompt Injections cs.CR · 2026-05-14 · unverdicted · none · ref 66
WARD is a guard model trained on 177K web samples and adversarially hardened via attacker-guard co-evolution to achieve high recall on prompt injections with low false positives and no added latency.

Adversary-aware dpo: Enhancing safety alignment in vision language models via adversarial training.arXiv preprint arXiv:2502.11455, 2025

fields

years

verdicts

representative citing papers

citing papers explorer