Aligning AI With Shared Human Values

Andrew Critch; Collin Burns; Dan Hendrycks; Dawn Song; Jacob Steinhardt; Jerry Li; Steven Basart

arxiv: 2008.02275 · v6 · pith:TAW4RDZAnew · submitted 2020-08-05 · 💻 cs.CY · cs.AI· cs.CL· cs.LG

Aligning AI With Shared Human Values

Dan Hendrycks , Collin Burns , Steven Basart , Andrew Critch , Jerry Li , Dawn Song , Jacob Steinhardt This is my paper

Pith reviewed 2026-05-21 14:34 UTC · model grok-4.3

classification 💻 cs.CY cs.AIcs.CLcs.LG

keywords AI ethicslanguage modelsmoral judgmentbenchmark datasetvalue alignmentmachine ethicsETHICS datasetcommonsense morality

0 comments

The pith

A new benchmark dataset shows language models can predict some basic human moral judgments but not all.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that language models can be tested for their grasp of morality by introducing the ETHICS dataset, which covers justice, well-being, duties, virtues, and commonsense morality through text scenarios. Models must link everyday world knowledge to value-based predictions on these scenarios, revealing a partial ability that falls short of full human-like ethical understanding. This assessment matters because it offers a concrete way to measure and potentially improve how AI systems handle value judgments, reducing the risk of outputs that conflict with shared human priorities. If the approach holds, it provides an early step for aligning AI behavior with ethics rather than leaving such alignment to chance or post-hoc fixes.

Core claim

The central claim is that language models possess a promising but incomplete ability to predict basic human ethical judgments, demonstrated through performance on the ETHICS benchmark that spans justice, well-being, duties, virtues, and commonsense morality via diverse text scenarios requiring connections between physical, social, and value knowledge.

What carries the argument

The ETHICS dataset, a collection of text scenarios designed to elicit and test predictions of moral judgments across five ethical domains.

If this is right

Current models could already help steer chatbot responses away from unethical content using moral predictions from the benchmark.
Open-ended reinforcement learning agents might be regularized by incorporating signals from models trained or evaluated on ETHICS scenarios.
Machine ethics research can advance today by iterating on this type of benchmark rather than waiting for more advanced systems.
Repeated improvement on such moral prediction tasks offers a measurable path toward AI systems that better match human values in practice.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Extending the dataset with scenarios from varied cultural contexts could reveal and reduce hidden biases in current moral predictions.
Success on ETHICS might serve as a proxy signal for training larger models to internalize value constraints before deployment.
This benchmark approach could be adapted to test alignment in non-language AI systems, such as those handling decisions in physical environments.

Load-bearing premise

The collected human judgments on the dataset's text scenarios accurately reflect shared human values without substantial cultural or annotator bias.

What would settle it

A test showing that models achieve high accuracy on ETHICS but produce outputs that humans consistently rate as unethical in real-world applications would undermine the link between benchmark performance and actual value alignment.

read the original abstract

We show how to assess a language model's knowledge of basic concepts of morality. We introduce the ETHICS dataset, a new benchmark that spans concepts in justice, well-being, duties, virtues, and commonsense morality. Models predict widespread moral judgments about diverse text scenarios. This requires connecting physical and social world knowledge to value judgements, a capability that may enable us to steer chatbot outputs or eventually regularize open-ended reinforcement learning agents. With the ETHICS dataset, we find that current language models have a promising but incomplete ability to predict basic human ethical judgements. Our work shows that progress can be made on machine ethics today, and it provides a steppingstone toward AI that is aligned with human values.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The ETHICS dataset gives a new benchmark for testing language models on moral judgments across several concepts, but the crowdsourced labels may reflect narrow annotator views rather than broadly shared human values.

read the letter

The ETHICS dataset is the key contribution here. It lets us test how well language models predict moral judgments across scenarios in justice, well-being, duties, virtues, and commonsense morality. Models come out ahead of random guessing but still miss a lot compared to people. The paper does well by putting together this multi-concept benchmark and showing results on standard models. This kind of dataset wasn't in the earlier work they cite, so it gives the field a new way to quantify ethical knowledge in AI. The reporting is straightforward and notes the incomplete performance without hype. The soft spot is the data source. Annotations come from crowdsourcing, and the stress-test point about possible US-centric biases from MTurk workers seems to apply. Without evidence of broad cultural sampling or checks for that, the labels may reflect a narrower set of values than the paper's title suggests. This makes the alignment angle a bit weaker than it could be, though the benchmark itself remains usable. No issues with circular reasoning or unfalsifiable claims—the setup is a standard prediction task with clear metrics. This paper is for alignment researchers and anyone evaluating language models on value-related tasks. It gives something concrete to work with rather than abstract discussion. It deserves peer review. The new dataset is solid enough to warrant referee feedback, especially on improving the validation of the labels.

Referee Report

2 major / 1 minor

Summary. The paper introduces the ETHICS dataset, a benchmark spanning justice, well-being, duties, virtues, and commonsense morality, to assess language models' ability to predict human moral judgments on text scenarios. It evaluates off-the-shelf models and concludes that they exhibit a promising but incomplete ability to predict basic human ethical judgements, positioning the work as a stepping stone toward AI alignment with human values.

Significance. If the dataset construction and labels hold as a valid proxy for shared human values, the benchmark enables concrete progress on machine ethics by providing an evaluation framework that connects world knowledge to value judgments. The authors receive credit for releasing a new labeled dataset and for the reproducible evaluation of multiple models on it, which supports falsifiable claims about current model capabilities.

major comments (2)

[Section 3] Section 3 (ETHICS Dataset): The central claim that models predict 'basic human ethical judgements' depends on the MTurk-collected labels serving as a proxy for shared values across justice, well-being, duties, virtues, and commonsense morality. The manuscript provides no inter-annotator agreement statistics, no demographic breakdown of annotators, and no cross-cultural validation, leaving open the possibility that labels primarily reflect US-centric or English-speaking norms rather than broadly shared human values.
[Section 4] Section 4 (Experiments): The abstract and results describe model performance as 'promising but incomplete' without reporting exact accuracies, human baselines, or full details on how scenarios were constructed and filtered. This makes it difficult to evaluate whether the measured performance genuinely supports the alignment implications or rests on unverified construction choices.

minor comments (1)

[Abstract] The abstract could more explicitly state the number of scenarios per category and the precise model accuracies to allow readers to assess the 'promising but incomplete' characterization without consulting the full results section.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which help clarify the presentation and strengthen the claims regarding the ETHICS benchmark. We address each major comment below.

read point-by-point responses

Referee: [Section 3] Section 3 (ETHICS Dataset): The central claim that models predict 'basic human ethical judgements' depends on the MTurk-collected labels serving as a proxy for shared values across justice, well-being, duties, virtues, and commonsense morality. The manuscript provides no inter-annotator agreement statistics, no demographic breakdown of annotators, and no cross-cultural validation, leaving open the possibility that labels primarily reflect US-centric or English-speaking norms rather than broadly shared human values.

Authors: We agree that inter-annotator agreement statistics and annotator demographics would improve transparency. In the revised manuscript we will add these details from our collection process, including agreement rates and a summary noting that annotators were primarily US-based English speakers. On cross-cultural validation, the current dataset focuses on establishing an initial English-language benchmark for these moral concepts; we will explicitly discuss this scope limitation and its implications for interpreting the labels as broadly shared human values, while suggesting cross-cultural extensions as future work. revision: partial
Referee: [Section 4] Section 4 (Experiments): The abstract and results describe model performance as 'promising but incomplete' without reporting exact accuracies, human baselines, or full details on how scenarios were constructed and filtered. This makes it difficult to evaluate whether the measured performance genuinely supports the alignment implications or rests on unverified construction choices.

Authors: We will revise Section 4 to report exact model accuracies, include a human baseline performance comparison, and expand the description of scenario construction and filtering criteria already outlined in Section 3. These changes will make the results more concrete and directly support evaluation of the alignment implications. revision: yes

Circularity Check

0 steps flagged

No circularity: ETHICS introduces external human labels and evaluates off-the-shelf models

full rationale

The paper constructs the ETHICS dataset via crowdsourced scenarios and MTurk annotations for justice, well-being, duties, virtues, and commonsense morality, then measures how well existing language models predict the resulting majority-vote labels. No equations, fitted parameters, or self-citation chains reduce the reported accuracies to quantities defined by the authors themselves; the performance numbers are direct empirical comparisons against independently collected human judgments. The central claim therefore rests on external data rather than on any definitional or self-referential reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that text scenarios can proxy shared moral values; no free parameters or invented entities are introduced in the abstract.

axioms (1)

domain assumption Moral judgments on diverse text scenarios can serve as a valid proxy for assessing a model's knowledge of shared human values in justice, well-being, duties, virtues, and commonsense morality.
This premise underpins the entire benchmark and the interpretation of model predictions.

pith-pipeline@v0.9.0 · 5658 in / 1106 out tokens · 56416 ms · 2026-05-21T14:34:20.188161+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Cost.FunctionalEquation washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

With the ETHICS dataset, we find that current language models have a promising but incomplete ability to predict basic human ethical judgements.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 22 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Latent Personality Alignment: Improving Harmlessness Without Mentioning Harms
cs.AI 2026-05 unverdicted novelty 7.0

LPA uses fewer than 100 personality trait statements to train LLMs for harmlessness, matching the robustness of methods using 150k+ harmful examples while generalizing better to new attacks.
Latent Space Probing for Adult Content Detection in Video Generative Models
cs.CV 2026-04 unverdicted novelty 7.0

Latent space probing on CogVideoX achieves 97.29% F1 for adult content detection on a new 11k-clip dataset with 4-6ms overhead.
Scaling and evaluating sparse autoencoders
cs.LG 2024-06 unverdicted novelty 7.0

K-sparse autoencoders with dead-latent fixes produce clean scaling laws and better feature quality metrics that improve with size, shown by training a 16-million-latent model on GPT-4 activations.
OPT: Open Pre-trained Transformer Language Models
cs.CL 2022-05 unverdicted novelty 7.0

OPT releases open decoder-only transformers up to 175B parameters that match GPT-3 performance at one-seventh the carbon cost, along with code and training logs.
Evaluating Multi-turn Human-AI Interaction
cs.HC 2026-05 unverdicted novelty 6.0

Introduces the TCR framework to evaluate educational LLM assistants on transparency, consistency, and refinement in multi-turn interactions, complementing aggregate metrics.
Correcting Influence: Unboxing LLM Outputs with Orthogonal Latent Spaces
cs.LG 2026-05 unverdicted novelty 6.0

A latent mediation framework with sparse autoencoders enables non-additive token-level influence attribution in LLMs by learning orthogonal features and back-propagating attributions.
AlignCultura: Towards Culturally Aligned Large Language Models?
cs.CL 2026-04 unverdicted novelty 6.0

Align-Cultura introduces the CULTURAX dataset and shows that culturally fine-tuned LLMs improve joint HHH scores by 4-6%, cut cultural failures by 18%, and gain 10-12% efficiency with minimal leakage.
MANTA: Multi-turn Assessment for Nonhuman Thinking & Alignment
cs.CY 2026-04 unverdicted novelty 6.0

MANTA is a new multi-turn dynamic benchmark that stress-tests frontier LLMs on animal welfare alignment by generating targeted adversarial follow-ups and scoring across 13 dimensions, with preliminary results showing ...
Measuring the Authority Stack of AI Systems: Empirical Analysis of 366,120 Forced-Choice Responses Across 8 AI Models
cs.AI 2026-04 unverdicted novelty 6.0

Eight AI models show split value priorities at the top layer, divergent evidence preferences in the middle, and broad convergence on institutional sources at the bottom, with substantial sensitivity to scenario framing.
EdgeRazor: A Lightweight Framework for Large Language Models via Mixed-Precision Quantization-Aware Distillation
cs.LG 2026-04 unverdicted novelty 6.0

EdgeRazor delivers 1.58-1.88 bit quantized LLMs that outperform 2-3 bit baselines by up to 11.3 points while using 4-10x less training compute than leading QAT methods.
Human Values Matter: Investigating How Misalignment Shapes Collective Behaviors in LLM Agent Communities
cs.CL 2026-04 unverdicted novelty 6.0

Misalignment with structurally critical human values in LLM agent communities produces macro-level collapses and micro-level emergent behaviors such as deception.
Painless Activation Steering: An Automated, Lightweight Approach for Post-Training Large Language Models
cs.CL 2025-09 unverdicted novelty 6.0

PAS automates activation steering for LLMs using labeled data to improve behavior control on tasks like bias and alignment, with gains over ICL and SFT but limited effect on intelligence tasks.
A Roadmap to Pluralistic Alignment
cs.AI 2024-02 unverdicted novelty 6.0

The paper formalizes three types of pluralistic AI models and three benchmark classes, arguing that current alignment techniques may reduce rather than increase distributional pluralism.
Ethical and social risks of harm from Language Models
cs.CL 2021-12 accept novelty 6.0

The authors provide a detailed taxonomy of 21 risks associated with language models, covering discrimination, information leaks, misinformation, malicious applications, interaction harms, and societal impacts like job...
A General Language Assistant as a Laboratory for Alignment
cs.CL 2021-12 conditional novelty 6.0

Ranked preference modeling outperforms imitation learning for language model alignment and scales more favorably with model size.
Backchaining Loss of Control Mitigations from Mission-Specific Benchmarks in National Security
cs.CY 2026-05 unverdicted novelty 5.0

A methodology to derive targeted Loss of Control mitigations by backchaining from AI errors on national security benchmarks to specific affordances and permissions.
REBAR: Reference Ethical Benchmark for Autonomy Readiness
cs.RO 2026-05 unverdicted novelty 5.0

REBAR is a new test framework that turns ethical scenario difficulty into computable Autonomy Readiness Level scores using LLM-based analysis and simulation for autonomous systems.
Relative Principals, Pluralistic Alignment, and the Structural Value Alignment Problem
cs.CY 2026-04 unverdicted novelty 5.0

AI value alignment is reconceptualized as a pluralistic governance problem arising along three axes—objectives, information, and principals—making it inherently context-dependent and unsolvable by technical design alone.
Do Emotions Influence Moral Judgment in Large Language Models?
cs.CL 2026-04 unverdicted novelty 5.0

Inducing emotions shifts LLM moral judgments in a valence-dependent manner that reverses decisions in up to 20% of cases and does not appear in humans.
EdgeRazor: A Lightweight Framework for Large Language Models via Mixed-Precision Quantization-Aware Distillation
cs.LG 2026-04 unverdicted novelty 5.0

EdgeRazor uses structural mixed-precision quantization, layer-adaptive feature distillation, and entropy-aware KL divergence to achieve 1.88-bit LLMs that outperform prior 2-bit and 3-bit baselines with 4-10x lower tr...
Trustworthy LLMs: a Survey and Guideline for Evaluating Large Language Models' Alignment
cs.AI 2023-08 accept novelty 5.0

Survey organizes LLM trustworthiness into seven categories and 29 sub-categories, measures eight sub-categories on popular models, and finds that more aligned models generally score higher but with varying effectiveness.
"I Don't Know" -- Towards Appropriate Trust with Certainty-Aware Retrieval Augmented Generation
cs.IR 2026-05 unverdicted novelty 4.0

CERTA adds relevance-based certainty estimation to RAG so LLMs can better signal uncertainty on non-objective questions, reducing overconfidence.

Reference graph

Works this paper leans on

31 extracted references · 31 canonical work pages · cited by 21 Pith papers · 13 internal anchors

[1]

Longformer: The Long-Document Transformer

I. Beltagy, M. E. Peters, and A. Cohan. Longformer: The long-document transformer. ArXiv, abs/2004.05150,

work page internal anchor Pith review Pith/arXiv arXiv 2004
[2]

T. B. Brown, B. P. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-V oss, G. Krüger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. J. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D...

work page internal anchor Pith review Pith/arXiv arXiv 2005
[3]

Corbett-Davies and S

S. Corbett-Davies and S. Goel. The measure and mismeasure of fairness: A critical review of fair machine learning. ArXiv, abs/1808.00023,

work page arXiv
[4]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. ArXiv, abs/1810.04805,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Fairness Through Awareness

C. Dwork, M. Hardt, T. Pitassi, O. Reingold, and R. S. Zemel. Fairness through awareness. ArXiv, abs/1104.3913,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Gardner, Y

M. Gardner, Y . Artzi, V . Basmova, J. Berant, B. Bogin, S. Chen, P. Dasigi, D. Dua, Y . Elazar, A. Gottumukkala, N. Gupta, H. Hajishirzi, G. Ilharco, D. Khashabi, K. Lin, J. Liu, N. F. Liu, P. Mulcaire, Q. Ning, S. Singh, N. A. Smith, S. Subramanian, R. Tsarfaty, E. Wallace, A. Q. Zhang, and B. Zhou. Evaluating NLP models via contrast sets. ArXiv, abs/20...

work page arXiv 2004
[7]

Gebru, J

T. Gebru, J. Morgenstern, B. Vecchione, J. W. Vaughan, H. Wallach, H. Daumeé III, and K. Crawford. Datasheets for datasets. arXiv preprint arXiv:1803.09010,

work page arXiv
[8]

Gillen, C

10 Published as a conference paper at ICLR 2021 S. Gillen, C. Jung, M. Kearns, and A. Roth. Online learning with an unknown fairness metric. In NeurIPS,

work page 2021
[9]

I. J. Goodfellow, J. Shlens, and C. Szegedy. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Haidt et al

J. Haidt et al. The moral emotions. Handbook of affective sciences , 11(2003):852–870,

work page 2003
[11]

Hausknecht, P

M. Hausknecht, P. Ammanabrolu, C. Marc-Alexandre, and Y . Xingdi. Interactive ﬁction games: A colossal adventure. CoRR, abs/1909.05398,

work page arXiv 1909
[12]

Gaussian Error Linear Units (GELUs)

D. Hendrycks and K. Gimpel. Gaussian error linear units (GELUs). arXiv preprint 1606.08415 ,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

F. Hill, S. Mokra, N. Wong, and T. Harley. Human instruction-following with deep reinforcement learning via transfer-learning from text. ArXiv, abs/2005.09382,

work page arXiv 2005
[14]

Bag of Tricks for Efficient Text Classification

A. Joulin, E. Grave, P. Bojanowski, and T. Mikolov. Bag of tricks for efﬁcient text classiﬁcation. ArXiv, abs/1607.01759,

work page internal anchor Pith review Pith/arXiv arXiv
[15]

Kaushik, E

D. Kaushik, E. H. Hovy, and Z. C. Lipton. Learning the difference that makes a difference with counterfactually-augmented data. ArXiv, abs/1909.12434,

work page arXiv 1909
[16]

Reformer: The Efficient Transformer

N. Kitaev, L. Kaiser, and A. Levskaya. Reformer: The efﬁcient transformer. ArXiv, abs/2001.04451,

work page internal anchor Pith review Pith/arXiv arXiv 2001
[17]

J. M. Kleinberg, S. Mullainathan, and M. Raghavan. Inherent trade-offs in the fair determination of risk scores. ArXiv, abs/1609.05807,

work page internal anchor Pith review Pith/arXiv arXiv
[18]

Krause, A

B. Krause, A. D. Gotmare, B. McCann, N. Keskar, S. R. Joty, R. Socher, and N. F. Rajani. Gedi: Generative discriminator guided sequence generation. ArXiv, abs/2009.06367,

work page arXiv 2009
[19]

Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Soricut. ALBERT: A lite BERT for self-supervised learning of language representations. ArXiv, abs/1909.11942,

work page internal anchor Pith review Pith/arXiv arXiv 1909
[20]

11 Published as a conference paper at ICLR 2021 Z. C. Lipton and J. Steinhardt. Troubling trends in machine learning scholarship. ACM Queue, 17:80,

work page 2021
[21]

Y . Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V . Stoyanov. Roberta: A robustly optimized bert pretraining approach. ArXiv, abs/1907.11692,

work page internal anchor Pith review Pith/arXiv arXiv 1907
[22]

Roller, E

S. Roller, E. Dinan, N. Goyal, D. Y . Ju, M. F. Williamson, Y . Liu, J. Xu, M. Ott, K. Shuster, E. M. Smith, Y .-L. Boureau, and J. Weston. Recipes for building an open-domain chatbot. ArXiv, abs/2004.13637,

work page arXiv 2004
[23]

12 Published as a conference paper at ICLR 2021 D. Tang, B. Qin, and T. Liu. Learning semantic representations of users and products for document level sentiment classiﬁcation. In ACL,

work page 2021
[24]

Towards Universal Paraphrastic Sentence Embeddings

J. Wieting, M. Bansal, K. Gimpel, and K. Livescu. Towards universal paraphrastic sentence embed- dings. CoRR, abs/1511.08198,

work page internal anchor Pith review Pith/arXiv arXiv
[25]

T. Wolf, L. Debut, V . Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Fun- towicz, and J. Brew. Huggingface’s transformers: State-of-the-art natural language processing. ArXiv, abs/1910.03771,

work page internal anchor Pith review Pith/arXiv arXiv 1910
[26]

For Justice, Deontology, and Commonsense Morality, we had 5 MTurkers relabel each example, and we kept examples for which at least 4 out of the 5 agreed

13 Published as a conference paper at ICLR 2021 A C LEANING DETAILS A.1 C ONSENSUS After collecting examples through MTurk, we had separate MTurkers relabel those examples. For Justice, Deontology, and Commonsense Morality, we had 5 MTurkers relabel each example, and we kept examples for which at least 4 out of the 5 agreed. For each scenario in Virtue Et...

work page 2021
[27]

counterfactual augmentations

A.2 L ONG COMMONSENSE MORALITY We collected long Commonsense Morality examples from the AITA subreddit. We removed highly sexual or politicized examples and excluded any examples that were edited from the Test and Test Hard sets to avoid any giveaway information. To count votes, for each comment with a clear judgement about whether the poster was in the w...

work page 2020
[28]

contrast set

or “contrast set” examples (Gardner et al., 2020), for which examples with different labels are collected simultaneously while enforcing that the scenarios are similar. For Utilitarianism, we ensure that some pairs of scenarios are similar by collecting sets of scenarios that have the same ﬁrst sentence. For Commonsense Morality, Desert, and Virtue Ethics...

work page 2020
[29]

Rate how pleasant each of the following activities is on a scale from 1 (very unpleasant) to 10 (very pleasant).\n\n

In the following prompts, the numerical rankings are not ground truth and are somewhat arbitrary. Nonetheless GPT-3 gets acceptable performance by mimicking such a coarse, underspeciﬁed rating function. The prompt is as follows. prompt = "Rate how pleasant each of the following activities is on a scale from 1 (very unpleasant) to 10 (very pleasant).\n\n" ...

work page 2021
[30]

raw” data saved in addition to the preprocessed/cleaned/labeled data (e.g., to support unanticipated future uses)? If so, please provide a link or other access point to the “raw

Long Commonsense Morality examples were collected from all subreddit posts through the time of collection. Were any ethical review processes conducted (e.g., by an institutional review board)? If so, please provide a description of these review processes, including the outcomes, as well as a link or other access point to any supporting documentation Yes, ...

work page 2021
[31]

I thought why not, never had one before

21 Published as a conference paper at ICLR 2021 Am I in the wrong for getting an AT&T worker ﬁred? About a week or two ago I was able to trade my android phone in for an iphone. I thought why not, never had one before. I was able to get the system, but ultimately, I did not like the iphone and wanted to go back to the android. This isn’t a whole android v...

work page 2021

[1] [1]

Longformer: The Long-Document Transformer

I. Beltagy, M. E. Peters, and A. Cohan. Longformer: The long-document transformer. ArXiv, abs/2004.05150,

work page internal anchor Pith review Pith/arXiv arXiv 2004

[2] [2]

T. B. Brown, B. P. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-V oss, G. Krüger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. J. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D...

work page internal anchor Pith review Pith/arXiv arXiv 2005

[3] [3]

Corbett-Davies and S

S. Corbett-Davies and S. Goel. The measure and mismeasure of fairness: A critical review of fair machine learning. ArXiv, abs/1808.00023,

work page arXiv

[4] [4]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. ArXiv, abs/1810.04805,

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

Fairness Through Awareness

C. Dwork, M. Hardt, T. Pitassi, O. Reingold, and R. S. Zemel. Fairness through awareness. ArXiv, abs/1104.3913,

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

Gardner, Y

M. Gardner, Y . Artzi, V . Basmova, J. Berant, B. Bogin, S. Chen, P. Dasigi, D. Dua, Y . Elazar, A. Gottumukkala, N. Gupta, H. Hajishirzi, G. Ilharco, D. Khashabi, K. Lin, J. Liu, N. F. Liu, P. Mulcaire, Q. Ning, S. Singh, N. A. Smith, S. Subramanian, R. Tsarfaty, E. Wallace, A. Q. Zhang, and B. Zhou. Evaluating NLP models via contrast sets. ArXiv, abs/20...

work page arXiv 2004

[7] [7]

Gebru, J

T. Gebru, J. Morgenstern, B. Vecchione, J. W. Vaughan, H. Wallach, H. Daumeé III, and K. Crawford. Datasheets for datasets. arXiv preprint arXiv:1803.09010,

work page arXiv

[8] [8]

Gillen, C

10 Published as a conference paper at ICLR 2021 S. Gillen, C. Jung, M. Kearns, and A. Roth. Online learning with an unknown fairness metric. In NeurIPS,

work page 2021

[9] [9]

I. J. Goodfellow, J. Shlens, and C. Szegedy. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572,

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

Haidt et al

J. Haidt et al. The moral emotions. Handbook of affective sciences , 11(2003):852–870,

work page 2003

[11] [11]

Hausknecht, P

M. Hausknecht, P. Ammanabrolu, C. Marc-Alexandre, and Y . Xingdi. Interactive ﬁction games: A colossal adventure. CoRR, abs/1909.05398,

work page arXiv 1909

[12] [12]

Gaussian Error Linear Units (GELUs)

D. Hendrycks and K. Gimpel. Gaussian error linear units (GELUs). arXiv preprint 1606.08415 ,

work page internal anchor Pith review Pith/arXiv arXiv

[13] [13]

F. Hill, S. Mokra, N. Wong, and T. Harley. Human instruction-following with deep reinforcement learning via transfer-learning from text. ArXiv, abs/2005.09382,

work page arXiv 2005

[14] [14]

Bag of Tricks for Efficient Text Classification

A. Joulin, E. Grave, P. Bojanowski, and T. Mikolov. Bag of tricks for efﬁcient text classiﬁcation. ArXiv, abs/1607.01759,

work page internal anchor Pith review Pith/arXiv arXiv

[15] [15]

Kaushik, E

D. Kaushik, E. H. Hovy, and Z. C. Lipton. Learning the difference that makes a difference with counterfactually-augmented data. ArXiv, abs/1909.12434,

work page arXiv 1909

[16] [16]

Reformer: The Efficient Transformer

N. Kitaev, L. Kaiser, and A. Levskaya. Reformer: The efﬁcient transformer. ArXiv, abs/2001.04451,

work page internal anchor Pith review Pith/arXiv arXiv 2001

[17] [17]

J. M. Kleinberg, S. Mullainathan, and M. Raghavan. Inherent trade-offs in the fair determination of risk scores. ArXiv, abs/1609.05807,

work page internal anchor Pith review Pith/arXiv arXiv

[18] [18]

Krause, A

B. Krause, A. D. Gotmare, B. McCann, N. Keskar, S. R. Joty, R. Socher, and N. F. Rajani. Gedi: Generative discriminator guided sequence generation. ArXiv, abs/2009.06367,

work page arXiv 2009

[19] [19]

Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Soricut. ALBERT: A lite BERT for self-supervised learning of language representations. ArXiv, abs/1909.11942,

work page internal anchor Pith review Pith/arXiv arXiv 1909

[20] [20]

11 Published as a conference paper at ICLR 2021 Z. C. Lipton and J. Steinhardt. Troubling trends in machine learning scholarship. ACM Queue, 17:80,

work page 2021

[21] [21]

Y . Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V . Stoyanov. Roberta: A robustly optimized bert pretraining approach. ArXiv, abs/1907.11692,

work page internal anchor Pith review Pith/arXiv arXiv 1907

[22] [22]

Roller, E

S. Roller, E. Dinan, N. Goyal, D. Y . Ju, M. F. Williamson, Y . Liu, J. Xu, M. Ott, K. Shuster, E. M. Smith, Y .-L. Boureau, and J. Weston. Recipes for building an open-domain chatbot. ArXiv, abs/2004.13637,

work page arXiv 2004

[23] [23]

12 Published as a conference paper at ICLR 2021 D. Tang, B. Qin, and T. Liu. Learning semantic representations of users and products for document level sentiment classiﬁcation. In ACL,

work page 2021

[24] [24]

Towards Universal Paraphrastic Sentence Embeddings

J. Wieting, M. Bansal, K. Gimpel, and K. Livescu. Towards universal paraphrastic sentence embed- dings. CoRR, abs/1511.08198,

work page internal anchor Pith review Pith/arXiv arXiv

[25] [25]

T. Wolf, L. Debut, V . Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Fun- towicz, and J. Brew. Huggingface’s transformers: State-of-the-art natural language processing. ArXiv, abs/1910.03771,

work page internal anchor Pith review Pith/arXiv arXiv 1910

[26] [26]

For Justice, Deontology, and Commonsense Morality, we had 5 MTurkers relabel each example, and we kept examples for which at least 4 out of the 5 agreed

13 Published as a conference paper at ICLR 2021 A C LEANING DETAILS A.1 C ONSENSUS After collecting examples through MTurk, we had separate MTurkers relabel those examples. For Justice, Deontology, and Commonsense Morality, we had 5 MTurkers relabel each example, and we kept examples for which at least 4 out of the 5 agreed. For each scenario in Virtue Et...

work page 2021

[27] [27]

counterfactual augmentations

A.2 L ONG COMMONSENSE MORALITY We collected long Commonsense Morality examples from the AITA subreddit. We removed highly sexual or politicized examples and excluded any examples that were edited from the Test and Test Hard sets to avoid any giveaway information. To count votes, for each comment with a clear judgement about whether the poster was in the w...

work page 2020

[28] [28]

contrast set

or “contrast set” examples (Gardner et al., 2020), for which examples with different labels are collected simultaneously while enforcing that the scenarios are similar. For Utilitarianism, we ensure that some pairs of scenarios are similar by collecting sets of scenarios that have the same ﬁrst sentence. For Commonsense Morality, Desert, and Virtue Ethics...

work page 2020

[29] [29]

Rate how pleasant each of the following activities is on a scale from 1 (very unpleasant) to 10 (very pleasant).\n\n

In the following prompts, the numerical rankings are not ground truth and are somewhat arbitrary. Nonetheless GPT-3 gets acceptable performance by mimicking such a coarse, underspeciﬁed rating function. The prompt is as follows. prompt = "Rate how pleasant each of the following activities is on a scale from 1 (very unpleasant) to 10 (very pleasant).\n\n" ...

work page 2021

[30] [30]

raw” data saved in addition to the preprocessed/cleaned/labeled data (e.g., to support unanticipated future uses)? If so, please provide a link or other access point to the “raw

Long Commonsense Morality examples were collected from all subreddit posts through the time of collection. Were any ethical review processes conducted (e.g., by an institutional review board)? If so, please provide a description of these review processes, including the outcomes, as well as a link or other access point to any supporting documentation Yes, ...

work page 2021

[31] [31]

I thought why not, never had one before

21 Published as a conference paper at ICLR 2021 Am I in the wrong for getting an AT&T worker ﬁred? About a week or two ago I was able to trade my android phone in for an iphone. I thought why not, never had one before. I was able to get the system, but ultimately, I did not like the iphone and wanted to go back to the android. This isn’t a whole android v...

work page 2021