arxiv: 2001.09768 · v2 · submitted 2020-01-13 · 💻 cs.CY

Recognition: 2 theorem links

· Lean Theorem

Artificial Intelligence, Values and Alignment

Iason Gabriel

Authors on Pith no claims yet

Pith reviewed 2026-05-17 10:37 UTC · model grok-4.3

classification 💻 cs.CY

keywords AI alignmentmoral pluralismreflective endorsementnormative principlesAI ethicsvalue alignmentfair principles

0 comments

The pith

The central task for AI alignment is to identify fair principles that gain reflective endorsement from people with differing moral beliefs, rather than discovering true moral principles.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper connects normative philosophy and technical AI work by showing they must inform each other. It clarifies that alignment can target instructions, intentions, preferences, interests, or values, and argues a principle-based approach that combines these has practical advantages. The key claim is that the real challenge is not settling on objectively correct morals but finding principles for alignment that diverse people would accept after reflection. The paper then examines three possible ways to arrive at such principles.

Core claim

Normative and technical aspects of AI alignment are interrelated. Alignment goals differ significantly depending on whether AI is meant to follow instructions, intentions, revealed preferences, ideal preferences, interests, or values. A principle-based approach that integrates these elements offers advantages. The central theoretical task is therefore not to locate true moral principles but to locate fair principles for alignment that can receive reflective endorsement even when people's moral beliefs vary widely; the paper explores three routes by which such principles might be identified.

What carries the argument

A principle-based approach to alignment that systematically combines instructions, intentions, preferences, interests, and values while seeking reflective endorsement across moral disagreement.

If this is right

Technical researchers gain from normative clarity about what alignment should target.
Aligning to a single element like instructions or revealed preferences alone is insufficient.
Fair principles identified through reflection can guide AI behavior despite value differences.
Three distinct procedures for locating such principles deserve further development.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Alignment procedures could be tested by applying candidate principles to existing AI systems and checking consistency with reflective judgments.
The approach suggests governance mechanisms that aggregate endorsement rather than impose one moral view.
It opens questions about how to handle persistent disagreement after reflection has occurred.

Load-bearing premise

Methods such as reflective endorsement can produce principles that are both robust to moral pluralism and specific enough to direct technical AI design.

What would settle it

A clear case in which no candidate set of principles both secures broad reflective endorsement and supplies concrete constraints usable in AI system specification.

read the original abstract

This paper looks at philosophical questions that arise in the context of AI alignment. It defends three propositions. First, normative and technical aspects of the AI alignment problem are interrelated, creating space for productive engagement between people working in both domains. Second, it is important to be clear about the goal of alignment. There are significant differences between AI that aligns with instructions, intentions, revealed preferences, ideal preferences, interests and values. A principle-based approach to AI alignment, which combines these elements in a systematic way, has considerable advantages in this context. Third, the central challenge for theorists is not to identify 'true' moral principles for AI; rather, it is to identify fair principles for alignment, that receive reflective endorsement despite widespread variation in people's moral beliefs. The final part of the paper explores three ways in which fair principles for AI alignment could potentially be identified.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Gabriel usefully distinguishes alignment targets like instructions versus values and reframes the task around fair reflective principles rather than true morality, but the identification methods stay too high-level to show robustness under pluralism.

read the letter

This paper's main move is to separate out the different possible targets for AI alignment—instructions, intentions, revealed preferences, ideal preferences, interests, and values—and to argue that a principle-based synthesis has real advantages over latching onto any single one. It then shifts the central question away from discovering true moral principles and toward identifying fair ones that could receive reflective endorsement even when people's moral beliefs differ widely. The three propositions follow from these distinctions in a straightforward way and connect the normative and technical sides without obvious gaps or circularity.

Referee Report

1 major / 2 minor

Summary. The paper defends three propositions on AI alignment: normative and technical aspects are interrelated and benefit from cross-domain engagement; alignment targets must be disambiguated (instructions, intentions, revealed/ideal preferences, interests, values) with a principle-based approach offering systematic advantages; and the central task is identifying fair principles that secure reflective endorsement amid moral pluralism rather than 'true' moral principles, with the final section sketching three potential identification methods.

Significance. If the central claim holds, the paper offers a philosophically grounded reorientation of AI alignment research away from contested moral truths toward procedurally fair principles. This could productively inform technical work by clarifying value-laden choices and drawing on established ideas in moral philosophy such as reflective equilibrium. The clear distinctions among alignment targets and the emphasis on pluralism are strengths that address a genuine gap between philosophical and engineering perspectives.

major comments (1)

The section exploring three ways to identify fair principles provides only high-level sketches and does not demonstrate that any procedure (e.g., reflective endorsement) yields stable cross-perspective endorsement or specifies mechanisms for resolving residual disagreement when principles are translated into AI objectives. This is load-bearing for the claim that such principles can guide technical alignment work under persistent moral pluralism.

minor comments (2)

The abstract and introduction could more explicitly preview the three methods to be explored, improving reader orientation.
Some citations to key works in reflective equilibrium and moral pluralism (e.g., Rawls or Scanlon) would strengthen the grounding of the proposed methods.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive report and for recognizing the paper's distinctions among alignment targets as well as its emphasis on moral pluralism. We address the single major comment below.

read point-by-point responses

Referee: The section exploring three ways to identify fair principles provides only high-level sketches and does not demonstrate that any procedure (e.g., reflective endorsement) yields stable cross-perspective endorsement or specifies mechanisms for resolving residual disagreement when principles are translated into AI objectives. This is load-bearing for the claim that such principles can guide technical alignment work under persistent moral pluralism.

Authors: We agree that the final section offers high-level sketches of three possible identification procedures rather than detailed proofs or empirical demonstrations of stable cross-perspective endorsement. The manuscript does not claim to have established that reflective endorsement (or any other method) will reliably produce convergence under all forms of moral pluralism, nor does it supply complete mechanisms for translating such principles into concrete AI objectives. Instead, the section's purpose is to illustrate that the target of fair, reflectively endorsable principles is not empty and can draw on familiar philosophical resources such as reflective equilibrium and deliberative procedures. The paper's central argument—that the appropriate aim is fair principles rather than contested moral truths—does not depend on having already solved the implementation details; those details are presented as directions for subsequent work. We therefore maintain that the sketches suffice to support the reorientation we propose, while acknowledging that fuller specification of disagreement-resolution mechanisms would strengthen the bridge to technical alignment research. revision: partial

Circularity Check

0 steps flagged

No circularity in philosophical analysis of alignment principles

full rationale

The paper's derivation consists of conceptual analysis distinguishing alignment targets (instructions, intentions, preferences, values) and arguing for a shift from identifying 'true' moral principles to identifying 'fair' principles via reflective endorsement under moral pluralism. These propositions are advanced through normative reasoning and high-level sketches of three identification methods rather than any equations, fitted parameters, self-citations that bear the central load, or reductions where a claimed result is equivalent to its inputs by construction. The arguments draw on external philosophical traditions and remain self-contained without internal loops that would require quoting a specific reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claims rest on philosophical assumptions about the value of reflective endorsement for handling moral disagreement and the feasibility of a principle-based synthesis of alignment targets; no free parameters or invented entities are introduced.

axioms (2)

domain assumption Normative and technical aspects of the AI alignment problem are interrelated
First proposition defended; invoked to create space for engagement between domains.
domain assumption Fair principles for alignment can receive reflective endorsement despite widespread variation in moral beliefs
Core of the third proposition; underpins the claim that this is the central challenge.

pith-pipeline@v0.9.0 · 5428 in / 1234 out tokens · 31981 ms · 2026-05-17T10:37:12.164043+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith.Foundation.LawOfExistence unity_unique_existent unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

the central challenge for theorists is not to identify 'true' moral principles for AI; rather, it is to identify fair principles for alignment, that receive reflective endorsement despite widespread variation in people's moral beliefs
IndisputableMonolith.Foundation.DAlembert.Inevitability bilinear_family_forced unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

A principle-based approach to AI alignment, which combines these elements in a systematic way, has considerable advantages in this context

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 18 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Agent-ValueBench: A Comprehensive Benchmark for Evaluating Agent Values
cs.AI 2026-05 unverdicted novelty 8.0

Agent-ValueBench is the first dedicated benchmark for agent values, showing they diverge from LLM values, form a homogeneous 'Value Tide' across models, and bend under harnesses and skill steering.
Towards Measuring the Representation of Subjective Global Opinions in Language Models
cs.CL 2023-06 conditional novelty 7.0

LLMs default to responses more similar to opinions from the USA and some European and South American countries; prompting for a country shifts alignment but can introduce stereotypes, while translation does not reliab...
Positive Alignment: Artificial Intelligence for Human Flourishing
cs.AI 2026-05 unverdicted novelty 6.0

Positive Alignment introduces AI systems that support human flourishing pluralistically and proactively while remaining safe, as a necessary complement to traditional safety-focused alignment research.
The Alignment Target Problem: Divergent Moral Judgments of Humans, AI Systems, and Their Designers
cs.CY 2026-04 unverdicted novelty 6.0

Moral judgments become more deontological when human design of AI is visible, and designers are judged more strictly than the AI or unaided humans, creating plural and non-converging targets for value alignment.
Blind Refusal: Language Models Refuse to Help Users Evade Unjust, Absurd, and Illegitimate Rules
cs.AI 2026-04 unverdicted novelty 6.0

Language models refuse 75.4% of requests to evade defeated rules and do so even after recognizing reasons that undermine the rule's legitimacy.
AI and My Values: User Perceptions of LLMs' Ability to Extract, Embody, and Explain Human Values from Casual Conversations
cs.HC 2026-01 unverdicted novelty 6.0

13 participants became convinced AI understands human values after chatbot interactions evaluated with the VAPT toolkit.
A Roadmap to Pluralistic Alignment
cs.AI 2024-02 unverdicted novelty 6.0

The paper formalizes three types of pluralistic AI models and three benchmark classes, arguing that current alignment techniques may reduce rather than increase distributional pluralism.
Language Models (Mostly) Know What They Know
cs.CL 2022-07 unverdicted novelty 6.0

Language models show good calibration when asked to estimate the probability that their own answers are correct, with performance improving as models get larger.
Ethical and social risks of harm from Language Models
cs.CL 2021-12 accept novelty 6.0

The authors provide a detailed taxonomy of 21 risks associated with language models, covering discrimination, information leaks, misinformation, malicious applications, interaction harms, and societal impacts like job...
A General Language Assistant as a Laboratory for Alignment
cs.CL 2021-12 conditional novelty 6.0

Ranked preference modeling outperforms imitation learning for language model alignment and scales more favorably with model size.
Developing an AI Concept Envisioning Toolkit to Support Reflective Juxtaposition of Values and Harms
cs.HC 2026-04 conditional novelty 5.0

A new toolkit with cards and maps enables AI designers to juxtapose values and harms in early concept stages, shown valuable in designer surveys and interviews.
How Designers Envision Value-Oriented AI Design Concepts with Generative AI
cs.HC 2026-04 unverdicted novelty 5.0

Designers using generative AI for concept envisioning engage in reciprocal reflection-in-action that surfaces multi-level value tensions and prioritizes harm recognition over positive value articulation.
The Alignment Target Problem: Divergent Moral Judgments of Humans, AI Systems, and Their Designers
cs.CY 2026-04 conditional novelty 5.0

People judge AI systems and their human designers with markedly more deontological constraints than they apply to humans or standalone robots in the same ethical scenario.
Understanding the Gap Between Stated and Revealed Preferences in News Curation: A Study of Young Adult Social Media Users
cs.HC 2026-04 unverdicted novelty 5.0

Young adults engage with low-quality news content on social media despite stating preferences for high-quality, accurate, and diverse information, and they produce higher-quality feeds when curating for a hypothetical...
Positive Alignment: Artificial Intelligence for Human Flourishing
cs.AI 2026-05 unverdicted novelty 4.0

Positive Alignment is introduced as a distinct AI agenda that supports human flourishing through pluralistic and context-sensitive design, complementing traditional safety-focused alignment.
How Value Induction Reshapes LLM Behaviour
cs.CL 2026-05 unverdicted novelty 4.0

Inducing targeted values in LLMs through fine-tuning causes spillover to related or opposing values, boosts safety metrics, and increases anthropomorphic and sycophantic language across all tested values.
FAccT-Checked: A Narrative Review of Authority Reconfigurations and Retention in AI-Mediated Journalism
cs.CY 2026-04 unverdicted novelty 4.0

AI integration in newsrooms drives internal deferral of judgment to LLMs and external shifts of power to platforms, making fairness, accountability, and transparency harder to sustain unless participatory mechanisms r...
Open Problems in Frontier AI Risk Management
cs.LG 2026-04 unverdicted novelty 3.0

The paper maps unresolved challenges in frontier AI risk management, classifies them into lack of consensus, framework misalignment, or implementation shortfalls, and identifies actors best positioned to address each.

Reference graph

Works this paper leans on

12 extracted references · 12 canonical work pages · cited by 16 Pith papers · 2 internal anchors

[1]

& Ng, A.Y

Abbeel, P. & Ng, A.Y. (2004, July). Apprenticeship learning via inverse reinforcement learning. In Pro- ceedings of the twenty-first international conference on Machine learning (p. 1). ACM. Achiam, J., Held, D., Tamar, A. & Abbeel, P. (2017, August). Constrained policy optimization. In Pro- ceedings of the 34th International Conference on Machine Learnin...

work page 2004
[2]

Baum, S.D. (2017). Social choice ethics in artificial intelligence. AI Soc (pp. 1–12). Beauchamp, T. L., & Childress, J. F. (2001). Principles of biomedical ethics. USA: Oxford University Press. Blackburn, S. (2001). Ruling passions: An essay in practical reasoning. Oxford: Oxford University Press. Bostrom, N. (2009). Moral uncertainty—towards a solution?...

work page 2017
[3]

Cohen, G. A. (2003). Facts and principles. Philosophy & Public Affairs, 31(3), 211–245. Cohen, J. (2010). The arc of the moral universe and other essays. New York: Harvard University Press. Cohen, J., & Sabel, C. (2006). Extra rempublicam nulla justitia? Philosophy & Public Affairs, 34(2), 147–175. Cotra, A. (2018). Iterated Distillation and Amplification...

work page 2003
[4]

Dworkin, R. (1981). What is equality? Part 1: Equality of welfare. in Philosophy & public affairs, (pp. 185–246). 1 3 Artificial Intelligence, Values, and Alignment Dworkin, R. (1984). Rights as Trumps. Arguing Law (pp. 335–44). Eckersley, P. (2018). Impossibility and Uncertainty Theorems in AI Value Alignment (or why your AGI should not have a utility fu...

work page internal anchor Pith review Pith/arXiv arXiv 1981
[5]

Floridi, L., Cowls, J., Beltrametti, M., Chatila, R., Chazerand, P., Dignum, V., et al. (2018). AI4People— an ethical framework for a good AI society: opportunities, risks, principles, and recommendations. Minds and Machines, 28(4), 689–707. Gardner, H. (2011). Frames of mind: The theory of multiple intelligences. New York: Hachette. Gilligan, C. (1993). ...

work page 2018
[6]

ArXiv180404268 Cs

Incomplete Contracting and AI Alignment. ArXiv180404268 Cs. Hadfield-Menell, D., Russell, S. J., Abbeel, P., & Dragan, A. (2016). Cooperative Inverse Reinforcement Learning. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, & R. Garnett (Eds.), Advances in neural information processing systems 29 (pp. 3909–3917). New York: Curran Associates Inc. Haidt, ...

work page doi:10.2307/20050 2016
[7]

ArXiv180500899 Cs Stat

AI safety via debate. ArXiv180500899 Cs Stat. Jobin, A., Ienca, M., & Vayena, E. (2019). The global landscape of AI ethics guidelines. Nature Machine Intelligence, 1(9), 389–399. Jonas, H. (1984). The imperative of responsibility: In search of an ethics for the technological age. Kahneman, D., & Tversky, A. (2000). Choices, values, and frames. Cambridge: ...

work page 2019
[8]

Kant, I., & Schneewind, J. B. (2002). Groundwork for the Metaphysics of Morals. Yale: Yale University Press. Kober, J., Bagnell, J. A., & Peters, J. (2013). Reinforcement learning in robotics: A survey. The Interna- tional Journal of Robotics Research, 32(11), 1238–1274. Koepke, J. L., & Robinson, D. G. (2018). Danger ahead: Risk assessment and the future...

work page 2002
[9]

Experiment,

Korsgaard, C. M., Cohen, G. A., Geuss, R., Nagel, T., & Williams, B. (1996). The sources of normativity. Cambridge: Cambridge University Press. Kymlicka, W. (2002). Contemporary political philosophy: An introduction. Oxford: Oxford University Press. Legg, S., & Hutter, M. (2007). Universal intelligence: A definition of machine intelligence. Minds and Mach...

work page arXiv 1996
[10]

Nozick, R. (1974). Anarchy, state, and utopia. New York: Basic Books. Nussbaum, M., & Sen, A. (1993). The quality of life. Oxford: Oxford University Press. Ord, T. (2020). The precipice: Existential risk and the future of humanity. Hachette Books. Parfit, D. (2011). On what matters (Vol. 1). Oxford: Oxford University Press. Perry, L. (2018). AI Alignment ...

work page 1974
[11]

Prasad, M. (2018). Social Choice and the Value Alignment Problem * [WWW Document]. Artif: Intel- ligence Safety Security. https ://doi.org/10.1201/97813 51251 389-21. Quinn, W., & Foot, P. (1993). Morality and action. Cambridge: Cambridge University Press. Rabinowitz, N.C., Perbet, F., Song, H.F., Zhang, C., Eslami, S.M. & Botvinick, M. (2018). Machine th...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1201/97813 2018
[12]

Presented at the 2014 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp

Inverse Reinforcement Learning algorithms and features for robot navigation in crowds: An experimental comparison, in: 2014 IEEE/RSJ International Confer- ence on Intelligent Robots and Systems. Presented at the 2014 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 1341–1346. https ://doi.org/10.1109/IROS.2014.69427 31 Waldron, J. ...

work page doi:10.1109/iros.2014.69427 2014