The Alignment Veto: How Safety Training Suppresses Cultural Knowledge in LLMs

Dilek Hakkani-T\"ur; Ehsaneddin Asgari; Gokhan Tur; Pardis Sadat Zahraei

read the original abstract

What happens inside a language model when alignment training conflicts with a cultural value it encodes? Across 16 MENA countries, 26 models, and 1.53M human survey responses, we show the answer is suppression, not erasure: at the moment of refusal, a model's internal logit distribution correlates with human survey data more strongly than its freely generated answers. We call this the alignment veto. We distinguish suppression failures (accurate internal distributions blocked at output) from representational bias failures (the encoding itself diverges from human values), and show the two require different interventions. The gate is inequitable: the safety tax reaches 37.6%, with a 19.8% alignment-quality gap between best- and worst-served nations, and native-language prompting widens rather than closes it. Sparse autoencoder analysis corroborated by comparisons across alignment stages identifies a candidate DPO-stage feature mediating suppression in Tulu-3-8B. The gate is real, its costs are unequal, and deciding what it should protect is not a technical question.

The Alignment Veto: How Safety Training Suppresses Cultural Knowledge in LLMs

discussion (0)