Cat-DPO: Category-Adaptive Safety Alignment

· 2026 · cs.CL · arXiv 2604.17299

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

open full Pith review browse 1 citing papers arXiv PDF

abstract

Aligning large language models with human preferences must balance two competing goals: responding helpfully to legitimate requests and reliably refusing harmful ones. Most preference-based safety alignment methods collapse safety into a single scalar that is applied uniformly to every preference pair. The result is a model that looks safe on average but stays relatively unsafe on a minority of harm categories. We cast safety alignment as a per-category constrained optimization problem and derive Cat-DPO, a direct-preference-optimization algorithm with a separate adaptive safety margin for each harm category. The margin tightens when the model still produces unsafe responses on a category and relaxes once the model catches up, so the training signal tracks each category's current difficulty rather than averaging under one global rate. Across two LLM backbones and six preference-learning baselines, Cat-DPO improves aggregate helpfulness and harmlessness and compresses per-category safety variance and the best-to-worst gap, offering a drop-in per-category refinement of direct preference safety alignment.

representative citing papers

DOG-DPO:Dynamic Optimization in Geometry for Safety Alignment

cs.LG · 2026-06-04 · unverdicted · novelty 5.0

DOG-DPO selects 11% of preference pairs via geometric subspace decomposition to recover most safety gains of full-data DPO training across six benchmarks.

citing papers explorer

Showing 1 of 1 citing paper.

DOG-DPO:Dynamic Optimization in Geometry for Safety Alignment cs.LG · 2026-06-04 · unverdicted · none · ref 10 · internal anchor
DOG-DPO selects 11% of preference pairs via geometric subspace decomposition to recover most safety gains of full-data DPO training across six benchmarks.

Cat-DPO: Category-Adaptive Safety Alignment

fields

years

verdicts

representative citing papers

citing papers explorer