pith. sign in

hub Canonical reference

Artificial Intelligence, Values, and Alignment

Canonical reference. 88% of citing Pith papers cite this work as background.

25 Pith papers citing it
572 external citations · Crossref
Background 88% of classified citations
abstract

This paper looks at philosophical questions that arise in the context of AI alignment. It defends three propositions. First, normative and technical aspects of the AI alignment problem are interrelated, creating space for productive engagement between people working in both domains. Second, it is important to be clear about the goal of alignment. There are significant differences between AI that aligns with instructions, intentions, revealed preferences, ideal preferences, interests and values. A principle-based approach to AI alignment, which combines these elements in a systematic way, has considerable advantages in this context. Third, the central challenge for theorists is not to identify 'true' moral principles for AI; rather, it is to identify fair principles for alignment, that receive reflective endorsement despite widespread variation in people's moral beliefs. The final part of the paper explores three ways in which fair principles for AI alignment could potentially be identified.

hub tools

citation-role summary

background 8

citation-polarity summary

roles

background 8

polarities

background 7 support 1

representative citing papers

A Technical Typology of AI Systems in Public Administration

cs.CY · 2026-06-30 · unverdicted · novelty 6.0

The paper defines five AI system categories for public administration and reports that 55% of 91 recent papers leave the system type underspecified while 31% study one type but motivate with another.

A Roadmap to Pluralistic Alignment

cs.AI · 2024-02-07 · unverdicted · novelty 6.0

The paper formalizes three types of pluralistic AI models and three benchmark classes, arguing that current alignment techniques may reduce rather than increase distributional pluralism.

Language Models (Mostly) Know What They Know

cs.CL · 2022-07-11 · unverdicted · novelty 6.0

Language models show good calibration when asked to estimate the probability that their own answers are correct, with performance improving as models get larger.

Ethical and social risks of harm from Language Models

cs.CL · 2021-12-08 · accept · novelty 6.0

The authors provide a detailed taxonomy of 21 risks associated with language models, covering discrimination, information leaks, misinformation, malicious applications, interaction harms, and societal impacts like job loss and environmental costs.

Positive Alignment: Artificial Intelligence for Human Flourishing

cs.AI · 2026-05-11 · unverdicted · novelty 5.0 · 2 refs

Positive Alignment is defined as AI systems that support human flourishing pluralistically while staying safe and cooperative, presented as a necessary complement to existing safety-focused alignment research.

How Value Induction Reshapes LLM Behaviour

cs.CL · 2026-05-08 · unverdicted · novelty 4.0

Inducing targeted values in LLMs through fine-tuning causes spillover to related or opposing values, boosts safety metrics, and increases anthropomorphic and sycophantic language across all tested values.

Towards Responsibly Non-Compliant Machines

cs.AI · 2026-06-10 · unverdicted · novelty 3.0

The paper sketches responsible non-compliance for autonomous AI agents, anchored in task refusal justifications, override pathways, security risk tracking, and liability transfers.

citing papers explorer

Showing 25 of 25 citing papers.