WARDEN is a new adversarial training framework for large language models that minimizes worst-case loss over an f-divergence ambiguity set, reducing attack success rates while keeping utility comparable to recent baselines.
A general class of coefficients of divergence of one distribution from another
2 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
fields
cs.LG 2verdicts
UNVERDICTED 2roles
background 1polarities
background 1representative citing papers
Introduces Generative Privacy Funnel (GenPF) and deep variational PF (DVPF) models that extend the privacy funnel to generative settings and provide a controllable privacy-utility trade-off with reduced sensitive attribute leakage in face recognition.
citing papers explorer
-
Information Theoretic Adversarial Training of Large Language Models
WARDEN is a new adversarial training framework for large language models that minimizes worst-case loss over an f-divergence ambiguity set, reducing attack success rates while keeping utility comparable to recent baselines.
-
Deep Privacy Funnel Model: From a Discriminative to a Generative Approach with an Application to Face Recognition
Introduces Generative Privacy Funnel (GenPF) and deep variational PF (DVPF) models that extend the privacy funnel to generative settings and provide a controllable privacy-utility trade-off with reduced sensitive attribute leakage in face recognition.