hub

Aligning AI With Shared Human Values

Dan Hendrycks, Collin Burns, Steven Basart, Andrew Critch, Jerry Li, Dawn Song · 2020 · cs.CY · arXiv 2008.02275

26 Pith papers cite this work. Polarity classification is still indexing.

26 Pith papers citing it

open full Pith review browse 26 citing papers arXiv PDF

abstract

We show how to assess a language model's knowledge of basic concepts of morality. We introduce the ETHICS dataset, a new benchmark that spans concepts in justice, well-being, duties, virtues, and commonsense morality. Models predict widespread moral judgments about diverse text scenarios. This requires connecting physical and social world knowledge to value judgements, a capability that may enable us to steer chatbot outputs or eventually regularize open-ended reinforcement learning agents. With the ETHICS dataset, we find that current language models have a promising but incomplete ability to predict basic human ethical judgements. Our work shows that progress can be made on machine ethics today, and it provides a steppingstone toward AI that is aligned with human values.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 1 method 1 other 1

citation-polarity summary

background 1 unclear 1 use method 1

representative citing papers

Latent Personality Alignment: Improving Harmlessness Without Mentioning Harms

cs.AI · 2026-05-08 · unverdicted · novelty 7.0

LPA uses fewer than 100 personality trait statements to train LLMs for harmlessness, matching the robustness of methods using 150k+ harmful examples while generalizing better to new attacks.

Latent Space Probing for Adult Content Detection in Video Generative Models

cs.CV · 2026-04-25 · unverdicted · novelty 7.0

Latent space probing on CogVideoX achieves 97.29% F1 for adult content detection on a new 11k-clip dataset with 4-6ms overhead.

Stay Focused: Problem Drift in Multi-Agent Debate

cs.CL · 2025-02-26 · unverdicted · novelty 7.0

The paper defines and measures 'problem drift' in multi-agent LLM debates across tasks and proposes DRIFTJudge and DRIFTPolicy as baselines to detect and reduce it.

Scaling and evaluating sparse autoencoders

cs.LG · 2024-06-06 · unverdicted · novelty 7.0

K-sparse autoencoders with dead-latent fixes produce clean scaling laws and better feature quality metrics that improve with size, shown by training a 16-million-latent model on GPT-4 activations.

OPT: Open Pre-trained Transformer Language Models

cs.CL · 2022-05-02 · unverdicted · novelty 7.0

OPT releases open decoder-only transformers up to 175B parameters that match GPT-3 performance at one-seventh the carbon cost, along with code and training logs.

Auditing LLM-Governed Social Robots with Culture-Specific Moral Gradients

cs.RO · 2026-06-02 · unverdicted · novelty 6.0

Introduces a gradient-based multilingual audit framework for LLM moral decisions in robot assistance scenarios and reports persistent culturally asymmetric gradient tracking failures not fixed by prompting.

Naturalistic measure of social norms alignment

cs.CL · 2026-05-22 · unverdicted · novelty 6.0

Proposes solution matching metrics (stated and explicit agreement accuracy) and a 3k Danish dilemma dataset to evaluate social norms alignment between LLMs and humans in naturalistic settings.

Evaluating Multi-turn Human-AI Interaction

cs.HC · 2026-05-18 · unverdicted · novelty 6.0

Introduces the TCR framework to evaluate educational LLM assistants on transparency, consistency, and refinement in multi-turn interactions, complementing aggregate metrics.

Correcting Influence: Unboxing LLM Outputs with Orthogonal Latent Spaces

cs.LG · 2026-05-12 · unverdicted · novelty 6.0

A latent mediation framework with sparse autoencoders enables non-additive token-level influence attribution in LLMs by learning orthogonal features and back-propagating attributions.

AlignCultura: Towards Culturally Aligned Large Language Models?

cs.CL · 2026-04-21 · unverdicted · novelty 6.0

Align-Cultura introduces the CULTURAX dataset and shows that culturally fine-tuned LLMs improve joint HHH scores by 4-6%, cut cultural failures by 18%, and gain 10-12% efficiency with minimal leakage.

MANTA: Multi-turn Assessment for Nonhuman Thinking & Alignment

cs.CY · 2026-04-18 · unverdicted · novelty 6.0

MANTA is a new multi-turn dynamic benchmark that stress-tests frontier LLMs on animal welfare alignment by generating targeted adversarial follow-ups and scoring across 13 dimensions, with preliminary results showing variance in later turns and format bias in LLM judges.

Measuring the Authority Stack of AI Systems: Empirical Analysis of 366,120 Forced-Choice Responses Across 8 AI Models

cs.AI · 2026-04-13 · unverdicted · novelty 6.0

Eight AI models show split value priorities at the top layer, divergent evidence preferences in the middle, and broad convergence on institutional sources at the bottom, with substantial sensitivity to scenario framing.

Human Values Matter: Investigating How Misalignment Shapes Collective Behaviors in LLM Agent Communities

cs.CL · 2026-04-07 · unverdicted · novelty 6.0

Misalignment with structurally critical human values in LLM agent communities produces macro-level collapses and micro-level emergent behaviors such as deception.

Painless Activation Steering: An Automated, Lightweight Approach for Post-Training Large Language Models

cs.CL · 2025-09-25 · unverdicted · novelty 6.0

PAS automates activation steering for LLMs using labeled data to improve behavior control on tasks like bias and alignment, with gains over ICL and SFT but limited effect on intelligence tasks.

A Roadmap to Pluralistic Alignment

cs.AI · 2024-02-07 · unverdicted · novelty 6.0

The paper formalizes three types of pluralistic AI models and three benchmark classes, arguing that current alignment techniques may reduce rather than increase distributional pluralism.

Ethical and social risks of harm from Language Models

cs.CL · 2021-12-08 · accept · novelty 6.0

The authors provide a detailed taxonomy of 21 risks associated with language models, covering discrimination, information leaks, misinformation, malicious applications, interaction harms, and societal impacts like job loss and environmental costs.

A General Language Assistant as a Laboratory for Alignment

cs.CL · 2021-12-01 · conditional · novelty 6.0

Ranked preference modeling outperforms imitation learning for language model alignment and scales more favorably with model size.

Are LLMs Bad at Moral Reasoning?

cs.CY · 2026-06-10 · unverdicted · novelty 5.0

Reanalyzing MoReBench by assigning LLMs the task of generating scoring rubrics shows better calibration to human rubrics and suggests stronger LLM moral reasoning than previously reported.

Backchaining Loss of Control Mitigations from Mission-Specific Benchmarks in National Security

cs.CY · 2026-05-20 · unverdicted · novelty 5.0

A methodology to derive targeted Loss of Control mitigations by backchaining from AI errors on national security benchmarks to specific affordances and permissions.

REBAR: Reference Ethical Benchmark for Autonomy Readiness

cs.RO · 2026-05-18 · unverdicted · novelty 5.0

REBAR is a new test framework that turns ethical scenario difficulty into computable Autonomy Readiness Level scores using LLM-based analysis and simulation for autonomous systems.

Relative Principals, Pluralistic Alignment, and the Structural Value Alignment Problem

cs.CY · 2026-04-22 · unverdicted · novelty 5.0

AI value alignment is reconceptualized as a pluralistic governance problem arising along three axes—objectives, information, and principals—making it inherently context-dependent and unsolvable by technical design alone.

Do Emotions Influence Moral Judgment in Large Language Models?

cs.CL · 2026-04-21 · unverdicted · novelty 5.0

Inducing emotions shifts LLM moral judgments in a valence-dependent manner that reverses decisions in up to 20% of cases and does not appear in humans.

EdgeRazor: A Lightweight Framework for Large Language Models via Mixed-Precision Quantization-Aware Distillation

cs.LG · 2026-04-10 · unverdicted · novelty 5.0 · 2 refs

EdgeRazor uses structural mixed-precision quantization, layer-adaptive feature distillation, and entropy-aware KL divergence to achieve 1.88-bit LLMs that outperform prior 2-bit and 3-bit baselines with 4-10x lower training budget.

Trustworthy LLMs: a Survey and Guideline for Evaluating Large Language Models' Alignment

cs.AI · 2023-08-10 · accept · novelty 5.0

Survey organizes LLM trustworthiness into seven categories and 29 sub-categories, measures eight sub-categories on popular models, and finds that more aligned models generally score higher but with varying effectiveness.

citing papers explorer

Showing 4 of 4 citing papers after filters.

MANTA: Multi-turn Assessment for Nonhuman Thinking & Alignment cs.CY · 2026-04-18 · unverdicted · none · ref 3 · internal anchor
MANTA is a new multi-turn dynamic benchmark that stress-tests frontier LLMs on animal welfare alignment by generating targeted adversarial follow-ups and scoring across 13 dimensions, with preliminary results showing variance in later turns and format bias in LLM judges.
Are LLMs Bad at Moral Reasoning? cs.CY · 2026-06-10 · unverdicted · none · ref 11 · internal anchor
Reanalyzing MoReBench by assigning LLMs the task of generating scoring rubrics shows better calibration to human rubrics and suggests stronger LLM moral reasoning than previously reported.
Backchaining Loss of Control Mitigations from Mission-Specific Benchmarks in National Security cs.CY · 2026-05-20 · unverdicted · none · ref 5 · internal anchor
A methodology to derive targeted Loss of Control mitigations by backchaining from AI errors on national security benchmarks to specific affordances and permissions.
Relative Principals, Pluralistic Alignment, and the Structural Value Alignment Problem cs.CY · 2026-04-22 · unverdicted · none · ref 41 · internal anchor
AI value alignment is reconceptualized as a pluralistic governance problem arising along three axes—objectives, information, and principals—making it inherently context-dependent and unsolvable by technical design alone.

Aligning AI With Shared Human Values

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer