DISCA converts within-country disagreement among World Values Survey personas into a bounded logit correction that reduces cultural misalignment by 10-24% on MultiTP for models 3.8B and larger across 20 countries, without any weight updates.
Language Model Alignment in Multilingual Trolley Problems.arXiv preprint arXiv:2407.02273
3 Pith papers cite this work. Polarity classification is still indexing.
years
2026 3representative citing papers
LLMs exhibit context-sensitive moral bias with model-specific patterns; mechanistic analysis shows a U-curve in which instruction tuning removes bias but reasoning distillation reintroduces it despite unchanged size.
Moral alignment in LLMs improves with model size according to the power law D ∝ S^{-0.10} (R²=0.50).
citing papers explorer
-
Training-Free Cultural Alignment of Large Language Models via Persona Disagreement
DISCA converts within-country disagreement among World Values Survey personas into a bounded logit correction that reduces cultural misalignment by 10-24% on MultiTP for models 3.8B and larger across 20 countries, without any weight updates.
-
Moral Sensitivity in LLMs: A Tiered Evaluation of Contextual Bias via Behavioral Profiling and Mechanistic Interpretability
LLMs exhibit context-sensitive moral bias with model-specific patterns; mechanistic analysis shows a U-curve in which instruction tuning removes bias but reasoning distillation reintroduces it despite unchanged size.
-
Scaling Laws for Moral Machine Judgment in Large Language Models
Moral alignment in LLMs improves with model size according to the power law D ∝ S^{-0.10} (R²=0.50).