hub Canonical reference

Artificial Intelligence, Values, and Alignment

Iason Gabriel · 2020 · cs.CY · DOI 10.1007/s11023-020-09539-2 · arXiv 2001.09768

Canonical reference. 88% of citing Pith papers cite this work as background.

25 Pith papers citing it

572 external citations · Crossref

Background 88% of classified citations

open full Pith review browse 25 citing papers arXiv PDF

abstract

This paper looks at philosophical questions that arise in the context of AI alignment. It defends three propositions. First, normative and technical aspects of the AI alignment problem are interrelated, creating space for productive engagement between people working in both domains. Second, it is important to be clear about the goal of alignment. There are significant differences between AI that aligns with instructions, intentions, revealed preferences, ideal preferences, interests and values. A principle-based approach to AI alignment, which combines these elements in a systematic way, has considerable advantages in this context. Third, the central challenge for theorists is not to identify 'true' moral principles for AI; rather, it is to identify fair principles for alignment, that receive reflective endorsement despite widespread variation in people's moral beliefs. The final part of the paper explores three ways in which fair principles for AI alignment could potentially be identified.

hub tools

JSON dossier citing papers JSON publisher DOI arXiv source

citation-role summary

background 8

citation-polarity summary

background 7 support 1

representative citing papers

Agent-ValueBench: A Comprehensive Benchmark for Evaluating Agent Values

cs.AI · 2026-05-11 · unverdicted · novelty 8.0

Agent-ValueBench is the first dedicated benchmark for agent values, showing they diverge from LLM values, form a homogeneous 'Value Tide' across models, and bend under harnesses and skill steering.

Whose Alignment? Comparing LLM Process Alignment Across Diverse Organizational Decision Contexts

cs.AI · 2026-05-24 · unverdicted · novelty 7.0

Empirical study finds strong heterogeneity in LLM process alignment across models and organizations; process alignment predicts output accuracy in legal decisions but is low and resistant in credit decisions where higher alignment may not be desirable.

Towards Measuring the Representation of Subjective Global Opinions in Language Models

cs.CL · 2023-06-28 · conditional · novelty 7.0

LLMs default to responses more similar to opinions from the USA and some European and South American countries; prompting for a country shifts alignment but can introduce stereotypes, while translation does not reliably match language speakers.

A Technical Typology of AI Systems in Public Administration

cs.CY · 2026-06-30 · unverdicted · novelty 6.0

The paper defines five AI system categories for public administration and reports that 55% of 91 recent papers leave the system type underspecified while 31% study one type but motivate with another.

Political Neutrality as Balanced Approval: A Large-Scale Human Evaluation of AI Responses

cs.CY · 2026-05-27 · unverdicted · novelty 6.0

AI political neutrality is redefined as balanced high approval across opposing groups and tested in a 7434-person study showing dual approval is achievable while default outputs from most models lean liberal.

The Alignment Target Problem: Divergent Moral Judgments of Humans, AI Systems, and Their Designers

cs.CY · 2026-04-27 · unverdicted · novelty 6.0 · 3 refs

Survey experiment finds that people apply more deontological standards to AI described as human-programmed and to the programmers themselves than to unaided humans or unprogrammed robots in a moral dilemma.

Blind Refusal: Language Models Refuse to Help Users Evade Unjust, Absurd, and Illegitimate Rules

cs.AI · 2026-04-03 · unverdicted · novelty 6.0

Language models refuse 75.4% of requests to evade defeated rules and do so even after recognizing reasons that undermine the rule's legitimacy.

AI and My Values: User Perceptions of LLMs' Ability to Extract, Embody, and Explain Human Values from Casual Conversations

cs.HC · 2026-01-30 · unverdicted · novelty 6.0

13 participants became convinced AI understands human values after chatbot interactions evaluated with the VAPT toolkit.

ActivationReasoning: Logical Reasoning in Latent Activation Spaces

cs.LG · 2025-10-21 · unverdicted · novelty 6.0

ActivationReasoning grounds logical reasoning in LLM latent activations via SAEs to enable structured inference, concept composition, and behavior steering on multi-hop, abstraction, and safety tasks.

A Roadmap to Pluralistic Alignment

cs.AI · 2024-02-07 · unverdicted · novelty 6.0

The paper formalizes three types of pluralistic AI models and three benchmark classes, arguing that current alignment techniques may reduce rather than increase distributional pluralism.

Language Models (Mostly) Know What They Know

cs.CL · 2022-07-11 · unverdicted · novelty 6.0

Language models show good calibration when asked to estimate the probability that their own answers are correct, with performance improving as models get larger.

Ethical and social risks of harm from Language Models

cs.CL · 2021-12-08 · accept · novelty 6.0

The authors provide a detailed taxonomy of 21 risks associated with language models, covering discrimination, information leaks, misinformation, malicious applications, interaction harms, and societal impacts like job loss and environmental costs.

A General Language Assistant as a Laboratory for Alignment

cs.CL · 2021-12-01 · conditional · novelty 6.0

Ranked preference modeling outperforms imitation learning for language model alignment and scales more favorably with model size.

Incoherent Values? Probing LLM Preferences Through Parametric Variation

cs.CY · 2026-06-19 · unverdicted · novelty 5.0

LLMs display significant value incoherence that does not scale with capability, demonstrated through a parametric variation framework on forced choices, though reasoning improves consistency.

Positive Alignment: Artificial Intelligence for Human Flourishing

cs.AI · 2026-05-11 · unverdicted · novelty 5.0 · 2 refs

Positive Alignment is defined as AI systems that support human flourishing pluralistically while staying safe and cooperative, presented as a necessary complement to existing safety-focused alignment research.

Developing an AI Concept Envisioning Toolkit to Support Reflective Juxtaposition of Values and Harms

cs.HC · 2026-04-30 · conditional · novelty 5.0

A new toolkit with cards and maps enables AI designers to juxtapose values and harms in early concept stages, shown valuable in designer surveys and interviews.

How Designers Envision Value-Oriented AI Design Concepts with Generative AI

cs.HC · 2026-04-30 · unverdicted · novelty 5.0

Designers using generative AI for concept envisioning engage in reciprocal reflection-in-action that surfaces multi-level value tensions and prioritizes harm recognition over positive value articulation.

AI of the People, by the People, for the People: A Social Choice Approach to Collective Control of Artificial Intelligence

cs.CY · 2026-04-14 · unverdicted · novelty 5.0

Proposes applying social choice theory as a modeling language and axiomatic tool for incorporating collective input across the ML development pipeline.

Understanding the Gap Between Stated and Revealed Preferences in News Curation: A Study of Young Adult Social Media Users

cs.HC · 2026-04-13 · unverdicted · novelty 5.0

Young adults engage with low-quality news content on social media despite stating preferences for high-quality, accurate, and diverse information, and they produce higher-quality feeds when curating for a hypothetical persona.

How Value Induction Reshapes LLM Behaviour

cs.CL · 2026-05-08 · unverdicted · novelty 4.0

Inducing targeted values in LLMs through fine-tuning causes spillover to related or opposing values, boosts safety metrics, and increases anthropomorphic and sycophantic language across all tested values.

FAccT-Checked: A Narrative Review of Authority Reconfigurations and Retention in AI-Mediated Journalism

cs.CY · 2026-04-23 · unverdicted · novelty 4.0

AI integration in newsrooms drives internal deferral of judgment to LLMs and external shifts of power to platforms, making fairness, accountability, and transparency harder to sustain unless participatory mechanisms redistribute authority.

Perception Gaps in Risk, Benefit, and Value Between Experts and Public Challenge Socially Accepted AI

cs.CY · 2024-12-02 · unverdicted · novelty 4.0

Experts rate AI scenarios as more likely, less risky, more beneficial, and more valuable than the public, applying different weightings to risk versus benefit.

Towards Responsibly Non-Compliant Machines

cs.AI · 2026-06-10 · unverdicted · novelty 3.0

The paper sketches responsible non-compliance for autonomous AI agents, anchored in task refusal justifications, override pathways, security risk tracking, and liability transfers.

Civilizational Metamaterials: Engineering Coordination Under Capability Gradients and Structural Turbulence

physics.soc-ph · 2026-05-29 · unverdicted · novelty 3.0

Introduces phenomenological model R_eff = β(1-ρ)(1-τ)(1-γρτ) for coordination under AGI decision velocity, with phase transition and proposed randomized trial.

citing papers explorer

Showing 6 of 6 citing papers after filters.

Agent-ValueBench: A Comprehensive Benchmark for Evaluating Agent Values cs.AI · 2026-05-11 · unverdicted · none · ref 18 · internal anchor
Agent-ValueBench is the first dedicated benchmark for agent values, showing they diverge from LLM values, form a homogeneous 'Value Tide' across models, and bend under harnesses and skill steering.
Whose Alignment? Comparing LLM Process Alignment Across Diverse Organizational Decision Contexts cs.AI · 2026-05-24 · unverdicted · none · ref 7 · internal anchor
Empirical study finds strong heterogeneity in LLM process alignment across models and organizations; process alignment predicts output accuracy in legal decisions but is low and resistant in credit decisions where higher alignment may not be desirable.
Blind Refusal: Language Models Refuse to Help Users Evade Unjust, Absurd, and Illegitimate Rules cs.AI · 2026-04-03 · unverdicted · none · ref 10 · internal anchor
Language models refuse 75.4% of requests to evade defeated rules and do so even after recognizing reasons that undermine the rule's legitimacy.
A Roadmap to Pluralistic Alignment cs.AI · 2024-02-07 · unverdicted · none · ref 32 · internal anchor
The paper formalizes three types of pluralistic AI models and three benchmark classes, arguing that current alignment techniques may reduce rather than increase distributional pluralism.
Positive Alignment: Artificial Intelligence for Human Flourishing cs.AI · 2026-05-11 · unverdicted · none · ref 3 · 2 links · internal anchor
Positive Alignment is defined as AI systems that support human flourishing pluralistically while staying safe and cooperative, presented as a necessary complement to existing safety-focused alignment research.
Towards Responsibly Non-Compliant Machines cs.AI · 2026-06-10 · unverdicted · none · ref 16 · internal anchor
The paper sketches responsible non-compliance for autonomous AI agents, anchored in task refusal justifications, override pathways, security risk tracking, and liability transfers.

Artificial Intelligence, Values, and Alignment

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer