Safety Alignment of LMs via Non-cooperative Games

Anselm Paulus; Arman Zharmagambetov; Brandon Amos; Ilia Kulikov; Ivan Evtimov; Kamalika Chaudhuri; R\'emi Munos

arxiv: 2512.20806 · v3 · pith:HT4L5CF2new · submitted 2025-12-23 · 💻 cs.AI

Safety Alignment of LMs via Non-cooperative Games

Anselm Paulus , Ilia Kulikov , Brandon Amos , R\'emi Munos , Ivan Evtimov , Kamalika Chaudhuri , Arman Zharmagambetov This is my paper

classification 💻 cs.AI

keywords safetyadversarialalignmentadvgameattackerdefendermodelsreward

0 comments

read the original abstract

Ensuring the safety of language models (LMs) while maintaining their usefulness remains a critical challenge in AI alignment. Current approaches rely on sequential adversarial training: generating adversarial prompts and fine-tuning LMs to defend against them. We introduce a different paradigm: framing safety alignment as a non-zero-sum game between an Attacker LM and a Defender LM trained jointly via online reinforcement learning. Each LM continuously adapts to the other's evolving strategies, driving iterative improvement. Our method uses a preference-based reward signal derived from pairwise comparisons instead of point-wise scores, providing more robust supervision and potentially reducing reward hacking. Our RL recipe, AdvGame, shifts the Pareto frontier of safety and utility, yielding a Defender LM that is simultaneously more helpful and more resilient to adversarial attacks. In addition, the resulting Attacker LM converges into a strong, general-purpose red-teaming agent that can be directly deployed to probe arbitrary target models. Code at github.com/facebookresearch/advgame.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Min-Max Optimization Requires Exponentially Many Queries
cs.DS 2026-05 unverdicted novelty 7.0

Finding an ε-approximate stationary point for nonconvex-nonconcave min-max optimization over [0,1]^d × [0,1]^d requires exponentially many queries in 1/ε or d.