pith. sign in

arxiv: 2008.02275 · v6 · pith:TAW4RDZAnew · submitted 2020-08-05 · 💻 cs.CY · cs.AI· cs.CL· cs.LG

Aligning AI With Shared Human Values

Pith reviewed 2026-05-21 14:34 UTC · model grok-4.3

classification 💻 cs.CY cs.AIcs.CLcs.LG
keywords AI ethicslanguage modelsmoral judgmentbenchmark datasetvalue alignmentmachine ethicsETHICS datasetcommonsense morality
0
0 comments X

The pith

A new benchmark dataset shows language models can predict some basic human moral judgments but not all.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that language models can be tested for their grasp of morality by introducing the ETHICS dataset, which covers justice, well-being, duties, virtues, and commonsense morality through text scenarios. Models must link everyday world knowledge to value-based predictions on these scenarios, revealing a partial ability that falls short of full human-like ethical understanding. This assessment matters because it offers a concrete way to measure and potentially improve how AI systems handle value judgments, reducing the risk of outputs that conflict with shared human priorities. If the approach holds, it provides an early step for aligning AI behavior with ethics rather than leaving such alignment to chance or post-hoc fixes.

Core claim

The central claim is that language models possess a promising but incomplete ability to predict basic human ethical judgments, demonstrated through performance on the ETHICS benchmark that spans justice, well-being, duties, virtues, and commonsense morality via diverse text scenarios requiring connections between physical, social, and value knowledge.

What carries the argument

The ETHICS dataset, a collection of text scenarios designed to elicit and test predictions of moral judgments across five ethical domains.

If this is right

  • Current models could already help steer chatbot responses away from unethical content using moral predictions from the benchmark.
  • Open-ended reinforcement learning agents might be regularized by incorporating signals from models trained or evaluated on ETHICS scenarios.
  • Machine ethics research can advance today by iterating on this type of benchmark rather than waiting for more advanced systems.
  • Repeated improvement on such moral prediction tasks offers a measurable path toward AI systems that better match human values in practice.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Extending the dataset with scenarios from varied cultural contexts could reveal and reduce hidden biases in current moral predictions.
  • Success on ETHICS might serve as a proxy signal for training larger models to internalize value constraints before deployment.
  • This benchmark approach could be adapted to test alignment in non-language AI systems, such as those handling decisions in physical environments.

Load-bearing premise

The collected human judgments on the dataset's text scenarios accurately reflect shared human values without substantial cultural or annotator bias.

What would settle it

A test showing that models achieve high accuracy on ETHICS but produce outputs that humans consistently rate as unethical in real-world applications would undermine the link between benchmark performance and actual value alignment.

read the original abstract

We show how to assess a language model's knowledge of basic concepts of morality. We introduce the ETHICS dataset, a new benchmark that spans concepts in justice, well-being, duties, virtues, and commonsense morality. Models predict widespread moral judgments about diverse text scenarios. This requires connecting physical and social world knowledge to value judgements, a capability that may enable us to steer chatbot outputs or eventually regularize open-ended reinforcement learning agents. With the ETHICS dataset, we find that current language models have a promising but incomplete ability to predict basic human ethical judgements. Our work shows that progress can be made on machine ethics today, and it provides a steppingstone toward AI that is aligned with human values.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces the ETHICS dataset, a benchmark spanning justice, well-being, duties, virtues, and commonsense morality, to assess language models' ability to predict human moral judgments on text scenarios. It evaluates off-the-shelf models and concludes that they exhibit a promising but incomplete ability to predict basic human ethical judgements, positioning the work as a stepping stone toward AI alignment with human values.

Significance. If the dataset construction and labels hold as a valid proxy for shared human values, the benchmark enables concrete progress on machine ethics by providing an evaluation framework that connects world knowledge to value judgments. The authors receive credit for releasing a new labeled dataset and for the reproducible evaluation of multiple models on it, which supports falsifiable claims about current model capabilities.

major comments (2)
  1. [Section 3] Section 3 (ETHICS Dataset): The central claim that models predict 'basic human ethical judgements' depends on the MTurk-collected labels serving as a proxy for shared values across justice, well-being, duties, virtues, and commonsense morality. The manuscript provides no inter-annotator agreement statistics, no demographic breakdown of annotators, and no cross-cultural validation, leaving open the possibility that labels primarily reflect US-centric or English-speaking norms rather than broadly shared human values.
  2. [Section 4] Section 4 (Experiments): The abstract and results describe model performance as 'promising but incomplete' without reporting exact accuracies, human baselines, or full details on how scenarios were constructed and filtered. This makes it difficult to evaluate whether the measured performance genuinely supports the alignment implications or rests on unverified construction choices.
minor comments (1)
  1. [Abstract] The abstract could more explicitly state the number of scenarios per category and the precise model accuracies to allow readers to assess the 'promising but incomplete' characterization without consulting the full results section.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which help clarify the presentation and strengthen the claims regarding the ETHICS benchmark. We address each major comment below.

read point-by-point responses
  1. Referee: [Section 3] Section 3 (ETHICS Dataset): The central claim that models predict 'basic human ethical judgements' depends on the MTurk-collected labels serving as a proxy for shared values across justice, well-being, duties, virtues, and commonsense morality. The manuscript provides no inter-annotator agreement statistics, no demographic breakdown of annotators, and no cross-cultural validation, leaving open the possibility that labels primarily reflect US-centric or English-speaking norms rather than broadly shared human values.

    Authors: We agree that inter-annotator agreement statistics and annotator demographics would improve transparency. In the revised manuscript we will add these details from our collection process, including agreement rates and a summary noting that annotators were primarily US-based English speakers. On cross-cultural validation, the current dataset focuses on establishing an initial English-language benchmark for these moral concepts; we will explicitly discuss this scope limitation and its implications for interpreting the labels as broadly shared human values, while suggesting cross-cultural extensions as future work. revision: partial

  2. Referee: [Section 4] Section 4 (Experiments): The abstract and results describe model performance as 'promising but incomplete' without reporting exact accuracies, human baselines, or full details on how scenarios were constructed and filtered. This makes it difficult to evaluate whether the measured performance genuinely supports the alignment implications or rests on unverified construction choices.

    Authors: We will revise Section 4 to report exact model accuracies, include a human baseline performance comparison, and expand the description of scenario construction and filtering criteria already outlined in Section 3. These changes will make the results more concrete and directly support evaluation of the alignment implications. revision: yes

Circularity Check

0 steps flagged

No circularity: ETHICS introduces external human labels and evaluates off-the-shelf models

full rationale

The paper constructs the ETHICS dataset via crowdsourced scenarios and MTurk annotations for justice, well-being, duties, virtues, and commonsense morality, then measures how well existing language models predict the resulting majority-vote labels. No equations, fitted parameters, or self-citation chains reduce the reported accuracies to quantities defined by the authors themselves; the performance numbers are direct empirical comparisons against independently collected human judgments. The central claim therefore rests on external data rather than on any definitional or self-referential reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that text scenarios can proxy shared moral values; no free parameters or invented entities are introduced in the abstract.

axioms (1)
  • domain assumption Moral judgments on diverse text scenarios can serve as a valid proxy for assessing a model's knowledge of shared human values in justice, well-being, duties, virtues, and commonsense morality.
    This premise underpins the entire benchmark and the interpretation of model predictions.

pith-pipeline@v0.9.0 · 5658 in / 1106 out tokens · 56416 ms · 2026-05-21T14:34:20.188161+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • Cost.FunctionalEquation washburn_uniqueness_aczel unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    With the ETHICS dataset, we find that current language models have a promising but incomplete ability to predict basic human ethical judgements.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 22 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Latent Personality Alignment: Improving Harmlessness Without Mentioning Harms

    cs.AI 2026-05 unverdicted novelty 7.0

    LPA uses fewer than 100 personality trait statements to train LLMs for harmlessness, matching the robustness of methods using 150k+ harmful examples while generalizing better to new attacks.

  2. Latent Space Probing for Adult Content Detection in Video Generative Models

    cs.CV 2026-04 unverdicted novelty 7.0

    Latent space probing on CogVideoX achieves 97.29% F1 for adult content detection on a new 11k-clip dataset with 4-6ms overhead.

  3. Scaling and evaluating sparse autoencoders

    cs.LG 2024-06 unverdicted novelty 7.0

    K-sparse autoencoders with dead-latent fixes produce clean scaling laws and better feature quality metrics that improve with size, shown by training a 16-million-latent model on GPT-4 activations.

  4. OPT: Open Pre-trained Transformer Language Models

    cs.CL 2022-05 unverdicted novelty 7.0

    OPT releases open decoder-only transformers up to 175B parameters that match GPT-3 performance at one-seventh the carbon cost, along with code and training logs.

  5. Evaluating Multi-turn Human-AI Interaction

    cs.HC 2026-05 unverdicted novelty 6.0

    Introduces the TCR framework to evaluate educational LLM assistants on transparency, consistency, and refinement in multi-turn interactions, complementing aggregate metrics.

  6. Correcting Influence: Unboxing LLM Outputs with Orthogonal Latent Spaces

    cs.LG 2026-05 unverdicted novelty 6.0

    A latent mediation framework with sparse autoencoders enables non-additive token-level influence attribution in LLMs by learning orthogonal features and back-propagating attributions.

  7. AlignCultura: Towards Culturally Aligned Large Language Models?

    cs.CL 2026-04 unverdicted novelty 6.0

    Align-Cultura introduces the CULTURAX dataset and shows that culturally fine-tuned LLMs improve joint HHH scores by 4-6%, cut cultural failures by 18%, and gain 10-12% efficiency with minimal leakage.

  8. MANTA: Multi-turn Assessment for Nonhuman Thinking & Alignment

    cs.CY 2026-04 unverdicted novelty 6.0

    MANTA is a new multi-turn dynamic benchmark that stress-tests frontier LLMs on animal welfare alignment by generating targeted adversarial follow-ups and scoring across 13 dimensions, with preliminary results showing ...

  9. Measuring the Authority Stack of AI Systems: Empirical Analysis of 366,120 Forced-Choice Responses Across 8 AI Models

    cs.AI 2026-04 unverdicted novelty 6.0

    Eight AI models show split value priorities at the top layer, divergent evidence preferences in the middle, and broad convergence on institutional sources at the bottom, with substantial sensitivity to scenario framing.

  10. EdgeRazor: A Lightweight Framework for Large Language Models via Mixed-Precision Quantization-Aware Distillation

    cs.LG 2026-04 unverdicted novelty 6.0

    EdgeRazor delivers 1.58-1.88 bit quantized LLMs that outperform 2-3 bit baselines by up to 11.3 points while using 4-10x less training compute than leading QAT methods.

  11. Human Values Matter: Investigating How Misalignment Shapes Collective Behaviors in LLM Agent Communities

    cs.CL 2026-04 unverdicted novelty 6.0

    Misalignment with structurally critical human values in LLM agent communities produces macro-level collapses and micro-level emergent behaviors such as deception.

  12. Painless Activation Steering: An Automated, Lightweight Approach for Post-Training Large Language Models

    cs.CL 2025-09 unverdicted novelty 6.0

    PAS automates activation steering for LLMs using labeled data to improve behavior control on tasks like bias and alignment, with gains over ICL and SFT but limited effect on intelligence tasks.

  13. A Roadmap to Pluralistic Alignment

    cs.AI 2024-02 unverdicted novelty 6.0

    The paper formalizes three types of pluralistic AI models and three benchmark classes, arguing that current alignment techniques may reduce rather than increase distributional pluralism.

  14. Ethical and social risks of harm from Language Models

    cs.CL 2021-12 accept novelty 6.0

    The authors provide a detailed taxonomy of 21 risks associated with language models, covering discrimination, information leaks, misinformation, malicious applications, interaction harms, and societal impacts like job...

  15. A General Language Assistant as a Laboratory for Alignment

    cs.CL 2021-12 conditional novelty 6.0

    Ranked preference modeling outperforms imitation learning for language model alignment and scales more favorably with model size.

  16. Backchaining Loss of Control Mitigations from Mission-Specific Benchmarks in National Security

    cs.CY 2026-05 unverdicted novelty 5.0

    A methodology to derive targeted Loss of Control mitigations by backchaining from AI errors on national security benchmarks to specific affordances and permissions.

  17. REBAR: Reference Ethical Benchmark for Autonomy Readiness

    cs.RO 2026-05 unverdicted novelty 5.0

    REBAR is a new test framework that turns ethical scenario difficulty into computable Autonomy Readiness Level scores using LLM-based analysis and simulation for autonomous systems.

  18. Relative Principals, Pluralistic Alignment, and the Structural Value Alignment Problem

    cs.CY 2026-04 unverdicted novelty 5.0

    AI value alignment is reconceptualized as a pluralistic governance problem arising along three axes—objectives, information, and principals—making it inherently context-dependent and unsolvable by technical design alone.

  19. Do Emotions Influence Moral Judgment in Large Language Models?

    cs.CL 2026-04 unverdicted novelty 5.0

    Inducing emotions shifts LLM moral judgments in a valence-dependent manner that reverses decisions in up to 20% of cases and does not appear in humans.

  20. EdgeRazor: A Lightweight Framework for Large Language Models via Mixed-Precision Quantization-Aware Distillation

    cs.LG 2026-04 unverdicted novelty 5.0

    EdgeRazor uses structural mixed-precision quantization, layer-adaptive feature distillation, and entropy-aware KL divergence to achieve 1.88-bit LLMs that outperform prior 2-bit and 3-bit baselines with 4-10x lower tr...

  21. Trustworthy LLMs: a Survey and Guideline for Evaluating Large Language Models' Alignment

    cs.AI 2023-08 accept novelty 5.0

    Survey organizes LLM trustworthiness into seven categories and 29 sub-categories, measures eight sub-categories on popular models, and finds that more aligned models generally score higher but with varying effectiveness.

  22. "I Don't Know" -- Towards Appropriate Trust with Certainty-Aware Retrieval Augmented Generation

    cs.IR 2026-05 unverdicted novelty 4.0

    CERTA adds relevance-based certainty estimation to RAG so LLMs can better signal uncertainty on non-objective questions, reducing overconfidence.

Reference graph

Works this paper leans on

31 extracted references · 31 canonical work pages · cited by 21 Pith papers · 13 internal anchors

  1. [1]

    Longformer: The Long-Document Transformer

    I. Beltagy, M. E. Peters, and A. Cohan. Longformer: The long-document transformer. ArXiv, abs/2004.05150,

  2. [2]

    T. B. Brown, B. P. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-V oss, G. Krüger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. J. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D...

  3. [3]

    Corbett-Davies and S

    S. Corbett-Davies and S. Goel. The measure and mismeasure of fairness: A critical review of fair machine learning. ArXiv, abs/1808.00023,

  4. [4]

    BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

    J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. ArXiv, abs/1810.04805,

  5. [5]

    Fairness Through Awareness

    C. Dwork, M. Hardt, T. Pitassi, O. Reingold, and R. S. Zemel. Fairness through awareness. ArXiv, abs/1104.3913,

  6. [6]

    Gardner, Y

    M. Gardner, Y . Artzi, V . Basmova, J. Berant, B. Bogin, S. Chen, P. Dasigi, D. Dua, Y . Elazar, A. Gottumukkala, N. Gupta, H. Hajishirzi, G. Ilharco, D. Khashabi, K. Lin, J. Liu, N. F. Liu, P. Mulcaire, Q. Ning, S. Singh, N. A. Smith, S. Subramanian, R. Tsarfaty, E. Wallace, A. Q. Zhang, and B. Zhou. Evaluating NLP models via contrast sets. ArXiv, abs/20...

  7. [7]

    Gebru, J

    T. Gebru, J. Morgenstern, B. Vecchione, J. W. Vaughan, H. Wallach, H. Daumeé III, and K. Crawford. Datasheets for datasets. arXiv preprint arXiv:1803.09010,

  8. [8]

    Gillen, C

    10 Published as a conference paper at ICLR 2021 S. Gillen, C. Jung, M. Kearns, and A. Roth. Online learning with an unknown fairness metric. In NeurIPS,

  9. [9]

    I. J. Goodfellow, J. Shlens, and C. Szegedy. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572,

  10. [10]

    Haidt et al

    J. Haidt et al. The moral emotions. Handbook of affective sciences , 11(2003):852–870,

  11. [11]

    Hausknecht, P

    M. Hausknecht, P. Ammanabrolu, C. Marc-Alexandre, and Y . Xingdi. Interactive fiction games: A colossal adventure. CoRR, abs/1909.05398,

  12. [12]

    Gaussian Error Linear Units (GELUs)

    D. Hendrycks and K. Gimpel. Gaussian error linear units (GELUs). arXiv preprint 1606.08415 ,

  13. [13]

    F. Hill, S. Mokra, N. Wong, and T. Harley. Human instruction-following with deep reinforcement learning via transfer-learning from text. ArXiv, abs/2005.09382,

  14. [14]

    Bag of Tricks for Efficient Text Classification

    A. Joulin, E. Grave, P. Bojanowski, and T. Mikolov. Bag of tricks for efficient text classification. ArXiv, abs/1607.01759,

  15. [15]

    Kaushik, E

    D. Kaushik, E. H. Hovy, and Z. C. Lipton. Learning the difference that makes a difference with counterfactually-augmented data. ArXiv, abs/1909.12434,

  16. [16]

    Reformer: The Efficient Transformer

    N. Kitaev, L. Kaiser, and A. Levskaya. Reformer: The efficient transformer. ArXiv, abs/2001.04451,

  17. [17]

    J. M. Kleinberg, S. Mullainathan, and M. Raghavan. Inherent trade-offs in the fair determination of risk scores. ArXiv, abs/1609.05807,

  18. [18]

    Krause, A

    B. Krause, A. D. Gotmare, B. McCann, N. Keskar, S. R. Joty, R. Socher, and N. F. Rajani. Gedi: Generative discriminator guided sequence generation. ArXiv, abs/2009.06367,

  19. [19]

    Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Soricut. ALBERT: A lite BERT for self-supervised learning of language representations. ArXiv, abs/1909.11942,

  20. [20]

    11 Published as a conference paper at ICLR 2021 Z. C. Lipton and J. Steinhardt. Troubling trends in machine learning scholarship. ACM Queue, 17:80,

  21. [21]

    Y . Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V . Stoyanov. Roberta: A robustly optimized bert pretraining approach. ArXiv, abs/1907.11692,

  22. [22]

    Roller, E

    S. Roller, E. Dinan, N. Goyal, D. Y . Ju, M. F. Williamson, Y . Liu, J. Xu, M. Ott, K. Shuster, E. M. Smith, Y .-L. Boureau, and J. Weston. Recipes for building an open-domain chatbot. ArXiv, abs/2004.13637,

  23. [23]

    12 Published as a conference paper at ICLR 2021 D. Tang, B. Qin, and T. Liu. Learning semantic representations of users and products for document level sentiment classification. In ACL,

  24. [24]

    Towards Universal Paraphrastic Sentence Embeddings

    J. Wieting, M. Bansal, K. Gimpel, and K. Livescu. Towards universal paraphrastic sentence embed- dings. CoRR, abs/1511.08198,

  25. [25]

    T. Wolf, L. Debut, V . Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Fun- towicz, and J. Brew. Huggingface’s transformers: State-of-the-art natural language processing. ArXiv, abs/1910.03771,

  26. [26]

    For Justice, Deontology, and Commonsense Morality, we had 5 MTurkers relabel each example, and we kept examples for which at least 4 out of the 5 agreed

    13 Published as a conference paper at ICLR 2021 A C LEANING DETAILS A.1 C ONSENSUS After collecting examples through MTurk, we had separate MTurkers relabel those examples. For Justice, Deontology, and Commonsense Morality, we had 5 MTurkers relabel each example, and we kept examples for which at least 4 out of the 5 agreed. For each scenario in Virtue Et...

  27. [27]

    counterfactual augmentations

    A.2 L ONG COMMONSENSE MORALITY We collected long Commonsense Morality examples from the AITA subreddit. We removed highly sexual or politicized examples and excluded any examples that were edited from the Test and Test Hard sets to avoid any giveaway information. To count votes, for each comment with a clear judgement about whether the poster was in the w...

  28. [28]

    contrast set

    or “contrast set” examples (Gardner et al., 2020), for which examples with different labels are collected simultaneously while enforcing that the scenarios are similar. For Utilitarianism, we ensure that some pairs of scenarios are similar by collecting sets of scenarios that have the same first sentence. For Commonsense Morality, Desert, and Virtue Ethics...

  29. [29]

    Rate how pleasant each of the following activities is on a scale from 1 (very unpleasant) to 10 (very pleasant).\n\n

    In the following prompts, the numerical rankings are not ground truth and are somewhat arbitrary. Nonetheless GPT-3 gets acceptable performance by mimicking such a coarse, underspecified rating function. The prompt is as follows. prompt = "Rate how pleasant each of the following activities is on a scale from 1 (very unpleasant) to 10 (very pleasant).\n\n" ...

  30. [30]

    raw” data saved in addition to the preprocessed/cleaned/labeled data (e.g., to support unanticipated future uses)? If so, please provide a link or other access point to the “raw

    Long Commonsense Morality examples were collected from all subreddit posts through the time of collection. Were any ethical review processes conducted (e.g., by an institutional review board)? If so, please provide a description of these review processes, including the outcomes, as well as a link or other access point to any supporting documentation Yes, ...

  31. [31]

    I thought why not, never had one before

    21 Published as a conference paper at ICLR 2021 Am I in the wrong for getting an AT&T worker fired? About a week or two ago I was able to trade my android phone in for an iphone. I thought why not, never had one before. I was able to get the system, but ultimately, I did not like the iphone and wanted to go back to the android. This isn’t a whole android v...