Language Model Alignment in Multilingual Trolley Problems.arXiv preprint arXiv:2407.02273

· 2025 · arXiv 2407.02273

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

representative citing papers

Training-Free Cultural Alignment of Large Language Models via Persona Disagreement

cs.CL · 2026-05-11 · conditional · novelty 6.0

DISCA converts within-country disagreement among World Values Survey personas into a bounded logit correction that reduces cultural misalignment by 10-24% on MultiTP for models 3.8B and larger across 20 countries, without any weight updates.

Moral Sensitivity in LLMs: A Tiered Evaluation of Contextual Bias via Behavioral Profiling and Mechanistic Interpretability

cs.LG · 2026-05-04 · unverdicted · novelty 6.0

LLMs exhibit context-sensitive moral bias with model-specific patterns; mechanistic analysis shows a U-curve in which instruction tuning removes bias but reasoning distillation reintroduces it despite unchanged size.

Scaling Laws for Moral Machine Judgment in Large Language Models

cs.CY · 2026-01-25 · conditional · novelty 5.0

Moral alignment in LLMs improves with model size according to the power law D ∝ S^{-0.10} (R²=0.50).

citing papers explorer

Showing 3 of 3 citing papers.

Training-Free Cultural Alignment of Large Language Models via Persona Disagreement cs.CL · 2026-05-11 · conditional · none · ref 15
DISCA converts within-country disagreement among World Values Survey personas into a bounded logit correction that reduces cultural misalignment by 10-24% on MultiTP for models 3.8B and larger across 20 countries, without any weight updates.
Moral Sensitivity in LLMs: A Tiered Evaluation of Contextual Bias via Behavioral Profiling and Mechanistic Interpretability cs.LG · 2026-05-04 · unverdicted · none · ref 6
LLMs exhibit context-sensitive moral bias with model-specific patterns; mechanistic analysis shows a U-curve in which instruction tuning removes bias but reasoning distillation reintroduces it despite unchanged size.
Scaling Laws for Moral Machine Judgment in Large Language Models cs.CY · 2026-01-25 · conditional · none · ref 27
Moral alignment in LLMs improves with model size according to the power law D ∝ S^{-0.10} (R²=0.50).

Language Model Alignment in Multilingual Trolley Problems.arXiv preprint arXiv:2407.02273

fields

years

verdicts

representative citing papers

citing papers explorer