pith. sign in

hub Canonical reference

Artificial Intelligence, Values, and Alignment

Canonical reference. 88% of citing Pith papers cite this work as background.

25 Pith papers citing it
572 external citations · Crossref
Background 88% of classified citations
abstract

This paper looks at philosophical questions that arise in the context of AI alignment. It defends three propositions. First, normative and technical aspects of the AI alignment problem are interrelated, creating space for productive engagement between people working in both domains. Second, it is important to be clear about the goal of alignment. There are significant differences between AI that aligns with instructions, intentions, revealed preferences, ideal preferences, interests and values. A principle-based approach to AI alignment, which combines these elements in a systematic way, has considerable advantages in this context. Third, the central challenge for theorists is not to identify 'true' moral principles for AI; rather, it is to identify fair principles for alignment, that receive reflective endorsement despite widespread variation in people's moral beliefs. The final part of the paper explores three ways in which fair principles for AI alignment could potentially be identified.

hub tools

citation-role summary

background 8

citation-polarity summary

roles

background 8

polarities

background 7 support 1

clear filters

representative citing papers

A Technical Typology of AI Systems in Public Administration

cs.CY · 2026-06-30 · unverdicted · novelty 6.0

The paper defines five AI system categories for public administration and reports that 55% of 91 recent papers leave the system type underspecified while 31% study one type but motivate with another.

A Roadmap to Pluralistic Alignment

cs.AI · 2024-02-07 · unverdicted · novelty 6.0

The paper formalizes three types of pluralistic AI models and three benchmark classes, arguing that current alignment techniques may reduce rather than increase distributional pluralism.

Language Models (Mostly) Know What They Know

cs.CL · 2022-07-11 · unverdicted · novelty 6.0

Language models show good calibration when asked to estimate the probability that their own answers are correct, with performance improving as models get larger.

Ethical and social risks of harm from Language Models

cs.CL · 2021-12-08 · accept · novelty 6.0

The authors provide a detailed taxonomy of 21 risks associated with language models, covering discrimination, information leaks, misinformation, malicious applications, interaction harms, and societal impacts like job loss and environmental costs.

Positive Alignment: Artificial Intelligence for Human Flourishing

cs.AI · 2026-05-11 · unverdicted · novelty 5.0 · 2 refs

Positive Alignment is defined as AI systems that support human flourishing pluralistically while staying safe and cooperative, presented as a necessary complement to existing safety-focused alignment research.

How Value Induction Reshapes LLM Behaviour

cs.CL · 2026-05-08 · unverdicted · novelty 4.0

Inducing targeted values in LLMs through fine-tuning causes spillover to related or opposing values, boosts safety metrics, and increases anthropomorphic and sycophantic language across all tested values.

Towards Responsibly Non-Compliant Machines

cs.AI · 2026-06-10 · unverdicted · novelty 3.0

The paper sketches responsible non-compliance for autonomous AI agents, anchored in task refusal justifications, override pathways, security risk tracking, and liability transfers.

citing papers explorer

Showing 6 of 6 citing papers after filters.

  • Agent-ValueBench: A Comprehensive Benchmark for Evaluating Agent Values cs.AI · 2026-05-11 · unverdicted · none · ref 18 · internal anchor

    Agent-ValueBench is the first dedicated benchmark for agent values, showing they diverge from LLM values, form a homogeneous 'Value Tide' across models, and bend under harnesses and skill steering.

  • Whose Alignment? Comparing LLM Process Alignment Across Diverse Organizational Decision Contexts cs.AI · 2026-05-24 · unverdicted · none · ref 7 · internal anchor

    Empirical study finds strong heterogeneity in LLM process alignment across models and organizations; process alignment predicts output accuracy in legal decisions but is low and resistant in credit decisions where higher alignment may not be desirable.

  • Blind Refusal: Language Models Refuse to Help Users Evade Unjust, Absurd, and Illegitimate Rules cs.AI · 2026-04-03 · unverdicted · none · ref 10 · internal anchor

    Language models refuse 75.4% of requests to evade defeated rules and do so even after recognizing reasons that undermine the rule's legitimacy.

  • A Roadmap to Pluralistic Alignment cs.AI · 2024-02-07 · unverdicted · none · ref 32 · internal anchor

    The paper formalizes three types of pluralistic AI models and three benchmark classes, arguing that current alignment techniques may reduce rather than increase distributional pluralism.

  • Positive Alignment: Artificial Intelligence for Human Flourishing cs.AI · 2026-05-11 · unverdicted · none · ref 3 · 2 links · internal anchor

    Positive Alignment is defined as AI systems that support human flourishing pluralistically while staying safe and cooperative, presented as a necessary complement to existing safety-focused alignment research.

  • Towards Responsibly Non-Compliant Machines cs.AI · 2026-06-10 · unverdicted · none · ref 16 · internal anchor

    The paper sketches responsible non-compliance for autonomous AI agents, anchored in task refusal justifications, override pathways, security risk tracking, and liability transfers.