Recursive generative retraining with pluralistic preferences converges to a stable diverse distribution that satisfies a weighted Nash bargaining solution.
hub Canonical reference
A Roadmap to Pluralistic Alignment
Canonical reference. 71% of citing Pith papers cite this work as background.
abstract
With increased power and prevalence of AI systems, it is ever more critical that AI systems are designed to serve all, i.e., people with diverse values and perspectives. However, aligning models to serve pluralistic human values remains an open research question. In this piece, we propose a roadmap to pluralistic alignment, specifically using language models as a test bed. We identify and formalize three possible ways to define and operationalize pluralism in AI systems: 1) Overton pluralistic models that present a spectrum of reasonable responses; 2) Steerably pluralistic models that can steer to reflect certain perspectives; and 3) Distributionally pluralistic models that are well-calibrated to a given population in distribution. We also formalize and discuss three possible classes of pluralistic benchmarks: 1) Multi-objective benchmarks, 2) Trade-off steerable benchmarks, which incentivize models to steer to arbitrary trade-offs, and 3) Jury-pluralistic benchmarks which explicitly model diverse human ratings. We use this framework to argue that current alignment techniques may be fundamentally limited for pluralistic AI; indeed, we highlight empirical evidence, both from our own experiments and from other work, that standard alignment procedures might reduce distributional pluralism in models, motivating the need for further research on pluralistic alignment.
hub tools
citation-role summary
citation-polarity summary
years
2026 18representative citing papers
A technique identifies minimal convergence-divergence points in LLM transformer blocks and calibrates residual-stream directions to achieve targeted ethical-framework control at inference time.
RLHF should decompose annotations into dimensions each matched to one of three models—extension, evidence, or authority—instead of applying a single unified pipeline.
Personalized deep research systems need evaluation with real users because LLM judges overlook nuanced errors that matter to researchers.
DISCA converts within-country disagreement among World Values Survey personas into a bounded logit correction that reduces cultural misalignment by 10-24% on MultiTP for models 3.8B and larger across 20 countries, without any weight updates.
Positive Alignment introduces AI systems that support human flourishing pluralistically and proactively while remaining safe, as a necessary complement to traditional safety-focused alignment research.
Annotator Policy Models learn safety policies from labeling behavior alone, accurately predicting responses and revealing sources of disagreement like policy ambiguity and value pluralism.
MSD enables cross-lingual safety transfer in LLMs via self-distillation with Dual-Perspective Safety Weighting, improving safety in low-resource languages without target response data.
Personalized RewardBench reveals that state-of-the-art reward models reach only 75.94% accuracy on personalized preferences and shows stronger correlation with downstream BoN and PPO performance than prior benchmarks.
LLMs display Western-centric cultural representations that align poorly with native priorities in non-Western countries and share highly correlated error patterns.
VAT quantifies value trade-offs in LLM alignment by measuring how alignment-induced changes propagate across interconnected values using a Schwartz-grounded dataset.
LLMs diverge from human goal selection in self-directed learning by exploiting single solutions with low variability across instances.
RLHF preference measurement is a social science validity problem because annotators routinely produce non-attitudes, constructed responses, and artifacts rather than stable values.
LAION-Aesthetics Predictor reinforces Western and male biases by preferentially selecting images associated with women and realistic Western/Japanese art while excluding men, LGBTQ+ references, and other styles.
A tradeoff model shows generative AI can reduce bias against diverse preferences by strategically eliciting information instead of always inferring from majority patterns.
Annotation disagreement on toxic language can be moderately predicted from textual features, with high-opposition items proving harder for models to estimate accurately.
AI value alignment is reconceptualized as a pluralistic governance problem arising along three axes—objectives, information, and principals—making it inherently context-dependent and unsolvable by technical design alone.
Case studies with blind UK residents and people from Kerala and Tamil Nadu demonstrate that community input at the systematization stage produces culturally grounded definitions of appropriateness for text-to-image model outputs.
citing papers explorer
-
Curated Synthetic Data Doesn't Have to Collapse: A Theoretical Study of Generative Retraining with Pluralistic Preferences
Recursive generative retraining with pluralistic preferences converges to a stable diverse distribution that satisfies a weighted Nash bargaining solution.
-
Where Paths Split: Localized, Calibrated Control of Moral Reasoning in Large Language Models
A technique identifies minimal convergence-divergence points in LLM transformer blocks and calibrates residual-stream directions to achieve targeted ethical-framework control at inference time.
-
Three Models of RLHF Annotation: Extension, Evidence, and Authority
RLHF should decompose annotations into dimensions each matched to one of three models—extension, evidence, or authority—instead of applying a single unified pipeline.
-
Language Models Don't Know What You Want: Evaluating Personalization in Deep Research Needs Real Users
Personalized deep research systems need evaluation with real users because LLM judges overlook nuanced errors that matter to researchers.
-
Training-Free Cultural Alignment of Large Language Models via Persona Disagreement
DISCA converts within-country disagreement among World Values Survey personas into a bounded logit correction that reduces cultural misalignment by 10-24% on MultiTP for models 3.8B and larger across 20 countries, without any weight updates.
-
Positive Alignment: Artificial Intelligence for Human Flourishing
Positive Alignment introduces AI systems that support human flourishing pluralistically and proactively while remaining safe, as a necessary complement to traditional safety-focused alignment research.
-
Understanding Annotator Safety Policy with Interpretability
Annotator Policy Models learn safety policies from labeling behavior alone, accurately predicting responses and revealing sources of disagreement like policy ambiguity and value pluralism.
-
Multilingual Safety Alignment via Self-Distillation
MSD enables cross-lingual safety transfer in LLMs via self-distillation with Dual-Perspective Safety Weighting, improving safety in low-resource languages without target response data.
-
Personalized RewardBench: Evaluating Reward Models with Human Aligned Personalization
Personalized RewardBench reveals that state-of-the-art reward models reach only 75.94% accuracy on personalized preferences and shows stronger correlation with downstream BoN and PPO performance than prior benchmarks.
-
Cultural Authenticity: Comparing LLM Cultural Representations to Native Human Expectations
LLMs display Western-centric cultural representations that align poorly with native priorities in non-Western countries and share highly correlated error patterns.
-
Value Alignment Tax: Measuring Value Trade-offs in LLM Alignment
VAT quantifies value trade-offs in LLM alignment by measuring how alignment-induced changes propagate across interconnected values using a Schwartz-grounded dataset.
-
Language Model Goal Selection Differs from Humans' in a Self-Directed Learning Task
LLMs diverge from human goal selection in self-directed learning by exploiting single solutions with low variability across instances.
-
Measuring Human Preferences in RLHF is a Social Science Problem
RLHF preference measurement is a social science validity problem because annotators routinely produce non-attitudes, constructed responses, and artifacts rather than stable values.
-
The Algorithmic Gaze of Image Quality Assessment: An Audit and Trace Ethnography of the LAION-Aesthetics Predictor
LAION-Aesthetics Predictor reinforces Western and male biases by preferentially selecting images associated with women and realistic Western/Japanese art while excluding men, LGBTQ+ references, and other styles.
-
When to Ask a Question: Understanding Communication Strategies in Generative AI Tools
A tradeoff model shows generative AI can reduce bias against diverse preferences by strategically eliciting information instead of always inferring from majority patterns.
-
Quantifying and Predicting Disagreement in Graded Human Ratings
Annotation disagreement on toxic language can be moderately predicted from textual features, with high-opposition items proving harder for models to estimate accurately.
-
Relative Principals, Pluralistic Alignment, and the Structural Value Alignment Problem
AI value alignment is reconceptualized as a pluralistic governance problem arising along three axes—objectives, information, and principals—making it inherently context-dependent and unsolvable by technical design alone.
-
Evaluating AI-Generated Images of Cultural Artifacts with Community-Informed Rubrics
Case studies with blind UK residents and people from Kerala and Tamil Nadu demonstrate that community input at the systematization stage produces culturally grounded definitions of appropriateness for text-to-image model outputs.