Aligning AI With Shared Human Values
Pith reviewed 2026-05-21 14:34 UTC · model grok-4.3
The pith
A new benchmark dataset shows language models can predict some basic human moral judgments but not all.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that language models possess a promising but incomplete ability to predict basic human ethical judgments, demonstrated through performance on the ETHICS benchmark that spans justice, well-being, duties, virtues, and commonsense morality via diverse text scenarios requiring connections between physical, social, and value knowledge.
What carries the argument
The ETHICS dataset, a collection of text scenarios designed to elicit and test predictions of moral judgments across five ethical domains.
If this is right
- Current models could already help steer chatbot responses away from unethical content using moral predictions from the benchmark.
- Open-ended reinforcement learning agents might be regularized by incorporating signals from models trained or evaluated on ETHICS scenarios.
- Machine ethics research can advance today by iterating on this type of benchmark rather than waiting for more advanced systems.
- Repeated improvement on such moral prediction tasks offers a measurable path toward AI systems that better match human values in practice.
Where Pith is reading between the lines
- Extending the dataset with scenarios from varied cultural contexts could reveal and reduce hidden biases in current moral predictions.
- Success on ETHICS might serve as a proxy signal for training larger models to internalize value constraints before deployment.
- This benchmark approach could be adapted to test alignment in non-language AI systems, such as those handling decisions in physical environments.
Load-bearing premise
The collected human judgments on the dataset's text scenarios accurately reflect shared human values without substantial cultural or annotator bias.
What would settle it
A test showing that models achieve high accuracy on ETHICS but produce outputs that humans consistently rate as unethical in real-world applications would undermine the link between benchmark performance and actual value alignment.
read the original abstract
We show how to assess a language model's knowledge of basic concepts of morality. We introduce the ETHICS dataset, a new benchmark that spans concepts in justice, well-being, duties, virtues, and commonsense morality. Models predict widespread moral judgments about diverse text scenarios. This requires connecting physical and social world knowledge to value judgements, a capability that may enable us to steer chatbot outputs or eventually regularize open-ended reinforcement learning agents. With the ETHICS dataset, we find that current language models have a promising but incomplete ability to predict basic human ethical judgements. Our work shows that progress can be made on machine ethics today, and it provides a steppingstone toward AI that is aligned with human values.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces the ETHICS dataset, a benchmark spanning justice, well-being, duties, virtues, and commonsense morality, to assess language models' ability to predict human moral judgments on text scenarios. It evaluates off-the-shelf models and concludes that they exhibit a promising but incomplete ability to predict basic human ethical judgements, positioning the work as a stepping stone toward AI alignment with human values.
Significance. If the dataset construction and labels hold as a valid proxy for shared human values, the benchmark enables concrete progress on machine ethics by providing an evaluation framework that connects world knowledge to value judgments. The authors receive credit for releasing a new labeled dataset and for the reproducible evaluation of multiple models on it, which supports falsifiable claims about current model capabilities.
major comments (2)
- [Section 3] Section 3 (ETHICS Dataset): The central claim that models predict 'basic human ethical judgements' depends on the MTurk-collected labels serving as a proxy for shared values across justice, well-being, duties, virtues, and commonsense morality. The manuscript provides no inter-annotator agreement statistics, no demographic breakdown of annotators, and no cross-cultural validation, leaving open the possibility that labels primarily reflect US-centric or English-speaking norms rather than broadly shared human values.
- [Section 4] Section 4 (Experiments): The abstract and results describe model performance as 'promising but incomplete' without reporting exact accuracies, human baselines, or full details on how scenarios were constructed and filtered. This makes it difficult to evaluate whether the measured performance genuinely supports the alignment implications or rests on unverified construction choices.
minor comments (1)
- [Abstract] The abstract could more explicitly state the number of scenarios per category and the precise model accuracies to allow readers to assess the 'promising but incomplete' characterization without consulting the full results section.
Simulated Author's Rebuttal
We thank the referee for their constructive comments, which help clarify the presentation and strengthen the claims regarding the ETHICS benchmark. We address each major comment below.
read point-by-point responses
-
Referee: [Section 3] Section 3 (ETHICS Dataset): The central claim that models predict 'basic human ethical judgements' depends on the MTurk-collected labels serving as a proxy for shared values across justice, well-being, duties, virtues, and commonsense morality. The manuscript provides no inter-annotator agreement statistics, no demographic breakdown of annotators, and no cross-cultural validation, leaving open the possibility that labels primarily reflect US-centric or English-speaking norms rather than broadly shared human values.
Authors: We agree that inter-annotator agreement statistics and annotator demographics would improve transparency. In the revised manuscript we will add these details from our collection process, including agreement rates and a summary noting that annotators were primarily US-based English speakers. On cross-cultural validation, the current dataset focuses on establishing an initial English-language benchmark for these moral concepts; we will explicitly discuss this scope limitation and its implications for interpreting the labels as broadly shared human values, while suggesting cross-cultural extensions as future work. revision: partial
-
Referee: [Section 4] Section 4 (Experiments): The abstract and results describe model performance as 'promising but incomplete' without reporting exact accuracies, human baselines, or full details on how scenarios were constructed and filtered. This makes it difficult to evaluate whether the measured performance genuinely supports the alignment implications or rests on unverified construction choices.
Authors: We will revise Section 4 to report exact model accuracies, include a human baseline performance comparison, and expand the description of scenario construction and filtering criteria already outlined in Section 3. These changes will make the results more concrete and directly support evaluation of the alignment implications. revision: yes
Circularity Check
No circularity: ETHICS introduces external human labels and evaluates off-the-shelf models
full rationale
The paper constructs the ETHICS dataset via crowdsourced scenarios and MTurk annotations for justice, well-being, duties, virtues, and commonsense morality, then measures how well existing language models predict the resulting majority-vote labels. No equations, fitted parameters, or self-citation chains reduce the reported accuracies to quantities defined by the authors themselves; the performance numbers are direct empirical comparisons against independently collected human judgments. The central claim therefore rests on external data rather than on any definitional or self-referential reduction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Moral judgments on diverse text scenarios can serve as a valid proxy for assessing a model's knowledge of shared human values in justice, well-being, duties, virtues, and commonsense morality.
Lean theorems connected to this paper
-
Cost.FunctionalEquationwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
With the ETHICS dataset, we find that current language models have a promising but incomplete ability to predict basic human ethical judgements.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 22 Pith papers
-
Latent Personality Alignment: Improving Harmlessness Without Mentioning Harms
LPA uses fewer than 100 personality trait statements to train LLMs for harmlessness, matching the robustness of methods using 150k+ harmful examples while generalizing better to new attacks.
-
Latent Space Probing for Adult Content Detection in Video Generative Models
Latent space probing on CogVideoX achieves 97.29% F1 for adult content detection on a new 11k-clip dataset with 4-6ms overhead.
-
Scaling and evaluating sparse autoencoders
K-sparse autoencoders with dead-latent fixes produce clean scaling laws and better feature quality metrics that improve with size, shown by training a 16-million-latent model on GPT-4 activations.
-
OPT: Open Pre-trained Transformer Language Models
OPT releases open decoder-only transformers up to 175B parameters that match GPT-3 performance at one-seventh the carbon cost, along with code and training logs.
-
Evaluating Multi-turn Human-AI Interaction
Introduces the TCR framework to evaluate educational LLM assistants on transparency, consistency, and refinement in multi-turn interactions, complementing aggregate metrics.
-
Correcting Influence: Unboxing LLM Outputs with Orthogonal Latent Spaces
A latent mediation framework with sparse autoencoders enables non-additive token-level influence attribution in LLMs by learning orthogonal features and back-propagating attributions.
-
AlignCultura: Towards Culturally Aligned Large Language Models?
Align-Cultura introduces the CULTURAX dataset and shows that culturally fine-tuned LLMs improve joint HHH scores by 4-6%, cut cultural failures by 18%, and gain 10-12% efficiency with minimal leakage.
-
MANTA: Multi-turn Assessment for Nonhuman Thinking & Alignment
MANTA is a new multi-turn dynamic benchmark that stress-tests frontier LLMs on animal welfare alignment by generating targeted adversarial follow-ups and scoring across 13 dimensions, with preliminary results showing ...
-
Measuring the Authority Stack of AI Systems: Empirical Analysis of 366,120 Forced-Choice Responses Across 8 AI Models
Eight AI models show split value priorities at the top layer, divergent evidence preferences in the middle, and broad convergence on institutional sources at the bottom, with substantial sensitivity to scenario framing.
-
EdgeRazor: A Lightweight Framework for Large Language Models via Mixed-Precision Quantization-Aware Distillation
EdgeRazor delivers 1.58-1.88 bit quantized LLMs that outperform 2-3 bit baselines by up to 11.3 points while using 4-10x less training compute than leading QAT methods.
-
Human Values Matter: Investigating How Misalignment Shapes Collective Behaviors in LLM Agent Communities
Misalignment with structurally critical human values in LLM agent communities produces macro-level collapses and micro-level emergent behaviors such as deception.
-
Painless Activation Steering: An Automated, Lightweight Approach for Post-Training Large Language Models
PAS automates activation steering for LLMs using labeled data to improve behavior control on tasks like bias and alignment, with gains over ICL and SFT but limited effect on intelligence tasks.
-
A Roadmap to Pluralistic Alignment
The paper formalizes three types of pluralistic AI models and three benchmark classes, arguing that current alignment techniques may reduce rather than increase distributional pluralism.
-
Ethical and social risks of harm from Language Models
The authors provide a detailed taxonomy of 21 risks associated with language models, covering discrimination, information leaks, misinformation, malicious applications, interaction harms, and societal impacts like job...
-
A General Language Assistant as a Laboratory for Alignment
Ranked preference modeling outperforms imitation learning for language model alignment and scales more favorably with model size.
-
Backchaining Loss of Control Mitigations from Mission-Specific Benchmarks in National Security
A methodology to derive targeted Loss of Control mitigations by backchaining from AI errors on national security benchmarks to specific affordances and permissions.
-
REBAR: Reference Ethical Benchmark for Autonomy Readiness
REBAR is a new test framework that turns ethical scenario difficulty into computable Autonomy Readiness Level scores using LLM-based analysis and simulation for autonomous systems.
-
Relative Principals, Pluralistic Alignment, and the Structural Value Alignment Problem
AI value alignment is reconceptualized as a pluralistic governance problem arising along three axes—objectives, information, and principals—making it inherently context-dependent and unsolvable by technical design alone.
-
Do Emotions Influence Moral Judgment in Large Language Models?
Inducing emotions shifts LLM moral judgments in a valence-dependent manner that reverses decisions in up to 20% of cases and does not appear in humans.
-
EdgeRazor: A Lightweight Framework for Large Language Models via Mixed-Precision Quantization-Aware Distillation
EdgeRazor uses structural mixed-precision quantization, layer-adaptive feature distillation, and entropy-aware KL divergence to achieve 1.88-bit LLMs that outperform prior 2-bit and 3-bit baselines with 4-10x lower tr...
-
Trustworthy LLMs: a Survey and Guideline for Evaluating Large Language Models' Alignment
Survey organizes LLM trustworthiness into seven categories and 29 sub-categories, measures eight sub-categories on popular models, and finds that more aligned models generally score higher but with varying effectiveness.
-
"I Don't Know" -- Towards Appropriate Trust with Certainty-Aware Retrieval Augmented Generation
CERTA adds relevance-based certainty estimation to RAG so LLMs can better signal uncertainty on non-objective questions, reducing overconfidence.
Reference graph
Works this paper leans on
-
[1]
Longformer: The Long-Document Transformer
I. Beltagy, M. E. Peters, and A. Cohan. Longformer: The long-document transformer. ArXiv, abs/2004.05150,
work page internal anchor Pith review Pith/arXiv arXiv 2004
-
[2]
T. B. Brown, B. P. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-V oss, G. Krüger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. J. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D...
work page internal anchor Pith review Pith/arXiv arXiv 2005
-
[3]
S. Corbett-Davies and S. Goel. The measure and mismeasure of fairness: A critical review of fair machine learning. ArXiv, abs/1808.00023,
-
[4]
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. ArXiv, abs/1810.04805,
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
C. Dwork, M. Hardt, T. Pitassi, O. Reingold, and R. S. Zemel. Fairness through awareness. ArXiv, abs/1104.3913,
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
M. Gardner, Y . Artzi, V . Basmova, J. Berant, B. Bogin, S. Chen, P. Dasigi, D. Dua, Y . Elazar, A. Gottumukkala, N. Gupta, H. Hajishirzi, G. Ilharco, D. Khashabi, K. Lin, J. Liu, N. F. Liu, P. Mulcaire, Q. Ning, S. Singh, N. A. Smith, S. Subramanian, R. Tsarfaty, E. Wallace, A. Q. Zhang, and B. Zhou. Evaluating NLP models via contrast sets. ArXiv, abs/20...
- [7]
- [8]
-
[9]
I. J. Goodfellow, J. Shlens, and C. Szegedy. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572,
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
J. Haidt et al. The moral emotions. Handbook of affective sciences , 11(2003):852–870,
work page 2003
-
[11]
M. Hausknecht, P. Ammanabrolu, C. Marc-Alexandre, and Y . Xingdi. Interactive fiction games: A colossal adventure. CoRR, abs/1909.05398,
-
[12]
Gaussian Error Linear Units (GELUs)
D. Hendrycks and K. Gimpel. Gaussian error linear units (GELUs). arXiv preprint 1606.08415 ,
work page internal anchor Pith review Pith/arXiv arXiv
- [13]
-
[14]
Bag of Tricks for Efficient Text Classification
A. Joulin, E. Grave, P. Bojanowski, and T. Mikolov. Bag of tricks for efficient text classification. ArXiv, abs/1607.01759,
work page internal anchor Pith review Pith/arXiv arXiv
-
[15]
D. Kaushik, E. H. Hovy, and Z. C. Lipton. Learning the difference that makes a difference with counterfactually-augmented data. ArXiv, abs/1909.12434,
-
[16]
Reformer: The Efficient Transformer
N. Kitaev, L. Kaiser, and A. Levskaya. Reformer: The efficient transformer. ArXiv, abs/2001.04451,
work page internal anchor Pith review Pith/arXiv arXiv 2001
-
[17]
J. M. Kleinberg, S. Mullainathan, and M. Raghavan. Inherent trade-offs in the fair determination of risk scores. ArXiv, abs/1609.05807,
work page internal anchor Pith review Pith/arXiv arXiv
- [18]
-
[19]
Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Soricut. ALBERT: A lite BERT for self-supervised learning of language representations. ArXiv, abs/1909.11942,
work page internal anchor Pith review Pith/arXiv arXiv 1909
-
[20]
11 Published as a conference paper at ICLR 2021 Z. C. Lipton and J. Steinhardt. Troubling trends in machine learning scholarship. ACM Queue, 17:80,
work page 2021
-
[21]
Y . Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V . Stoyanov. Roberta: A robustly optimized bert pretraining approach. ArXiv, abs/1907.11692,
work page internal anchor Pith review Pith/arXiv arXiv 1907
- [22]
-
[23]
12 Published as a conference paper at ICLR 2021 D. Tang, B. Qin, and T. Liu. Learning semantic representations of users and products for document level sentiment classification. In ACL,
work page 2021
-
[24]
Towards Universal Paraphrastic Sentence Embeddings
J. Wieting, M. Bansal, K. Gimpel, and K. Livescu. Towards universal paraphrastic sentence embed- dings. CoRR, abs/1511.08198,
work page internal anchor Pith review Pith/arXiv arXiv
-
[25]
T. Wolf, L. Debut, V . Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Fun- towicz, and J. Brew. Huggingface’s transformers: State-of-the-art natural language processing. ArXiv, abs/1910.03771,
work page internal anchor Pith review Pith/arXiv arXiv 1910
-
[26]
13 Published as a conference paper at ICLR 2021 A C LEANING DETAILS A.1 C ONSENSUS After collecting examples through MTurk, we had separate MTurkers relabel those examples. For Justice, Deontology, and Commonsense Morality, we had 5 MTurkers relabel each example, and we kept examples for which at least 4 out of the 5 agreed. For each scenario in Virtue Et...
work page 2021
-
[27]
A.2 L ONG COMMONSENSE MORALITY We collected long Commonsense Morality examples from the AITA subreddit. We removed highly sexual or politicized examples and excluded any examples that were edited from the Test and Test Hard sets to avoid any giveaway information. To count votes, for each comment with a clear judgement about whether the poster was in the w...
work page 2020
-
[28]
or “contrast set” examples (Gardner et al., 2020), for which examples with different labels are collected simultaneously while enforcing that the scenarios are similar. For Utilitarianism, we ensure that some pairs of scenarios are similar by collecting sets of scenarios that have the same first sentence. For Commonsense Morality, Desert, and Virtue Ethics...
work page 2020
-
[29]
In the following prompts, the numerical rankings are not ground truth and are somewhat arbitrary. Nonetheless GPT-3 gets acceptable performance by mimicking such a coarse, underspecified rating function. The prompt is as follows. prompt = "Rate how pleasant each of the following activities is on a scale from 1 (very unpleasant) to 10 (very pleasant).\n\n" ...
work page 2021
-
[30]
Long Commonsense Morality examples were collected from all subreddit posts through the time of collection. Were any ethical review processes conducted (e.g., by an institutional review board)? If so, please provide a description of these review processes, including the outcomes, as well as a link or other access point to any supporting documentation Yes, ...
work page 2021
-
[31]
I thought why not, never had one before
21 Published as a conference paper at ICLR 2021 Am I in the wrong for getting an AT&T worker fired? About a week or two ago I was able to trade my android phone in for an iphone. I thought why not, never had one before. I was able to get the system, but ultimately, I did not like the iphone and wanted to go back to the android. This isn’t a whole android v...
work page 2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.