Agent-ValueBench is the first dedicated benchmark for agent values, showing they diverge from LLM values, form a homogeneous 'Value Tide' across models, and bend under harnesses and skill steering.
hub Canonical reference
Artificial Intelligence, Values, and Alignment
Canonical reference. 88% of citing Pith papers cite this work as background.
abstract
This paper looks at philosophical questions that arise in the context of AI alignment. It defends three propositions. First, normative and technical aspects of the AI alignment problem are interrelated, creating space for productive engagement between people working in both domains. Second, it is important to be clear about the goal of alignment. There are significant differences between AI that aligns with instructions, intentions, revealed preferences, ideal preferences, interests and values. A principle-based approach to AI alignment, which combines these elements in a systematic way, has considerable advantages in this context. Third, the central challenge for theorists is not to identify 'true' moral principles for AI; rather, it is to identify fair principles for alignment, that receive reflective endorsement despite widespread variation in people's moral beliefs. The final part of the paper explores three ways in which fair principles for AI alignment could potentially be identified.
hub tools
citation-role summary
citation-polarity summary
roles
background 8representative citing papers
Empirical study finds strong heterogeneity in LLM process alignment across models and organizations; process alignment predicts output accuracy in legal decisions but is low and resistant in credit decisions where higher alignment may not be desirable.
LLMs default to responses more similar to opinions from the USA and some European and South American countries; prompting for a country shifts alignment but can introduce stereotypes, while translation does not reliably match language speakers.
The paper defines five AI system categories for public administration and reports that 55% of 91 recent papers leave the system type underspecified while 31% study one type but motivate with another.
AI political neutrality is redefined as balanced high approval across opposing groups and tested in a 7434-person study showing dual approval is achievable while default outputs from most models lean liberal.
Survey experiment finds that people apply more deontological standards to AI described as human-programmed and to the programmers themselves than to unaided humans or unprogrammed robots in a moral dilemma.
Language models refuse 75.4% of requests to evade defeated rules and do so even after recognizing reasons that undermine the rule's legitimacy.
13 participants became convinced AI understands human values after chatbot interactions evaluated with the VAPT toolkit.
ActivationReasoning grounds logical reasoning in LLM latent activations via SAEs to enable structured inference, concept composition, and behavior steering on multi-hop, abstraction, and safety tasks.
The paper formalizes three types of pluralistic AI models and three benchmark classes, arguing that current alignment techniques may reduce rather than increase distributional pluralism.
Language models show good calibration when asked to estimate the probability that their own answers are correct, with performance improving as models get larger.
The authors provide a detailed taxonomy of 21 risks associated with language models, covering discrimination, information leaks, misinformation, malicious applications, interaction harms, and societal impacts like job loss and environmental costs.
Ranked preference modeling outperforms imitation learning for language model alignment and scales more favorably with model size.
LLMs display significant value incoherence that does not scale with capability, demonstrated through a parametric variation framework on forced choices, though reasoning improves consistency.
Positive Alignment is defined as AI systems that support human flourishing pluralistically while staying safe and cooperative, presented as a necessary complement to existing safety-focused alignment research.
A new toolkit with cards and maps enables AI designers to juxtapose values and harms in early concept stages, shown valuable in designer surveys and interviews.
Designers using generative AI for concept envisioning engage in reciprocal reflection-in-action that surfaces multi-level value tensions and prioritizes harm recognition over positive value articulation.
Proposes applying social choice theory as a modeling language and axiomatic tool for incorporating collective input across the ML development pipeline.
Young adults engage with low-quality news content on social media despite stating preferences for high-quality, accurate, and diverse information, and they produce higher-quality feeds when curating for a hypothetical persona.
Inducing targeted values in LLMs through fine-tuning causes spillover to related or opposing values, boosts safety metrics, and increases anthropomorphic and sycophantic language across all tested values.
AI integration in newsrooms drives internal deferral of judgment to LLMs and external shifts of power to platforms, making fairness, accountability, and transparency harder to sustain unless participatory mechanisms redistribute authority.
Experts rate AI scenarios as more likely, less risky, more beneficial, and more valuable than the public, applying different weightings to risk versus benefit.
The paper sketches responsible non-compliance for autonomous AI agents, anchored in task refusal justifications, override pathways, security risk tracking, and liability transfers.
Introduces phenomenological model R_eff = β(1-ρ)(1-τ)(1-γρτ) for coordination under AGI decision velocity, with phase transition and proposed randomized trial.
citing papers explorer
-
Agent-ValueBench: A Comprehensive Benchmark for Evaluating Agent Values
Agent-ValueBench is the first dedicated benchmark for agent values, showing they diverge from LLM values, form a homogeneous 'Value Tide' across models, and bend under harnesses and skill steering.
-
Whose Alignment? Comparing LLM Process Alignment Across Diverse Organizational Decision Contexts
Empirical study finds strong heterogeneity in LLM process alignment across models and organizations; process alignment predicts output accuracy in legal decisions but is low and resistant in credit decisions where higher alignment may not be desirable.
-
Towards Measuring the Representation of Subjective Global Opinions in Language Models
LLMs default to responses more similar to opinions from the USA and some European and South American countries; prompting for a country shifts alignment but can introduce stereotypes, while translation does not reliably match language speakers.
-
A Technical Typology of AI Systems in Public Administration
The paper defines five AI system categories for public administration and reports that 55% of 91 recent papers leave the system type underspecified while 31% study one type but motivate with another.
-
Political Neutrality as Balanced Approval: A Large-Scale Human Evaluation of AI Responses
AI political neutrality is redefined as balanced high approval across opposing groups and tested in a 7434-person study showing dual approval is achievable while default outputs from most models lean liberal.
-
The Alignment Target Problem: Divergent Moral Judgments of Humans, AI Systems, and Their Designers
Survey experiment finds that people apply more deontological standards to AI described as human-programmed and to the programmers themselves than to unaided humans or unprogrammed robots in a moral dilemma.
-
Blind Refusal: Language Models Refuse to Help Users Evade Unjust, Absurd, and Illegitimate Rules
Language models refuse 75.4% of requests to evade defeated rules and do so even after recognizing reasons that undermine the rule's legitimacy.
-
AI and My Values: User Perceptions of LLMs' Ability to Extract, Embody, and Explain Human Values from Casual Conversations
13 participants became convinced AI understands human values after chatbot interactions evaluated with the VAPT toolkit.
-
ActivationReasoning: Logical Reasoning in Latent Activation Spaces
ActivationReasoning grounds logical reasoning in LLM latent activations via SAEs to enable structured inference, concept composition, and behavior steering on multi-hop, abstraction, and safety tasks.
-
A Roadmap to Pluralistic Alignment
The paper formalizes three types of pluralistic AI models and three benchmark classes, arguing that current alignment techniques may reduce rather than increase distributional pluralism.
-
Language Models (Mostly) Know What They Know
Language models show good calibration when asked to estimate the probability that their own answers are correct, with performance improving as models get larger.
-
Ethical and social risks of harm from Language Models
The authors provide a detailed taxonomy of 21 risks associated with language models, covering discrimination, information leaks, misinformation, malicious applications, interaction harms, and societal impacts like job loss and environmental costs.
-
A General Language Assistant as a Laboratory for Alignment
Ranked preference modeling outperforms imitation learning for language model alignment and scales more favorably with model size.
-
Incoherent Values? Probing LLM Preferences Through Parametric Variation
LLMs display significant value incoherence that does not scale with capability, demonstrated through a parametric variation framework on forced choices, though reasoning improves consistency.
-
Positive Alignment: Artificial Intelligence for Human Flourishing
Positive Alignment is defined as AI systems that support human flourishing pluralistically while staying safe and cooperative, presented as a necessary complement to existing safety-focused alignment research.
-
Developing an AI Concept Envisioning Toolkit to Support Reflective Juxtaposition of Values and Harms
A new toolkit with cards and maps enables AI designers to juxtapose values and harms in early concept stages, shown valuable in designer surveys and interviews.
-
How Designers Envision Value-Oriented AI Design Concepts with Generative AI
Designers using generative AI for concept envisioning engage in reciprocal reflection-in-action that surfaces multi-level value tensions and prioritizes harm recognition over positive value articulation.
-
AI of the People, by the People, for the People: A Social Choice Approach to Collective Control of Artificial Intelligence
Proposes applying social choice theory as a modeling language and axiomatic tool for incorporating collective input across the ML development pipeline.
-
Understanding the Gap Between Stated and Revealed Preferences in News Curation: A Study of Young Adult Social Media Users
Young adults engage with low-quality news content on social media despite stating preferences for high-quality, accurate, and diverse information, and they produce higher-quality feeds when curating for a hypothetical persona.
-
How Value Induction Reshapes LLM Behaviour
Inducing targeted values in LLMs through fine-tuning causes spillover to related or opposing values, boosts safety metrics, and increases anthropomorphic and sycophantic language across all tested values.
-
FAccT-Checked: A Narrative Review of Authority Reconfigurations and Retention in AI-Mediated Journalism
AI integration in newsrooms drives internal deferral of judgment to LLMs and external shifts of power to platforms, making fairness, accountability, and transparency harder to sustain unless participatory mechanisms redistribute authority.
-
Perception Gaps in Risk, Benefit, and Value Between Experts and Public Challenge Socially Accepted AI
Experts rate AI scenarios as more likely, less risky, more beneficial, and more valuable than the public, applying different weightings to risk versus benefit.
-
Towards Responsibly Non-Compliant Machines
The paper sketches responsible non-compliance for autonomous AI agents, anchored in task refusal justifications, override pathways, security risk tracking, and liability transfers.
-
Civilizational Metamaterials: Engineering Coordination Under Capability Gradients and Structural Turbulence
Introduces phenomenological model R_eff = β(1-ρ)(1-τ)(1-γρτ) for coordination under AGI decision velocity, with phase transition and proposed randomized trial.
-
Open Problems in Frontier AI Risk Management
The paper maps unresolved challenges in frontier AI risk management, classifies them into lack of consensus, framework misalignment, or implementation shortfalls, and identifies actors best positioned to address each.