Inverse reward design

Dylan Hadfield-Menell, Smitha Milli, Pieter Abbeel, Stuart Russell, Anca Dragan · 2020 · arXiv 1711.02827

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

read on arXiv browse 3 citing papers

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

Discovering Agentic Safety Specifications from 1-Bit Danger Signals

cs.AI · 2026-04-25 · unverdicted · novelty 7.0

LLM agents autonomously evolve human-readable safety specifications from sparse 1-bit danger signals, outperforming reward-based reflection that encourages reward hacking.

The Alignment Target Problem: Divergent Moral Judgments of Humans, AI Systems, and Their Designers

cs.CY · 2026-04-27 · unverdicted · novelty 6.0 · 2 refs

Moral judgments become more deontological when human design of AI is visible, and designers are judged more strictly than the AI or unaided humans, creating plural and non-converging targets for value alignment.

Failure Modes of Maximum Entropy RLHF

cs.LG · 2025-09-24 · unverdicted · novelty 5.0

Derives SimPO from MaxEnt RL and reports that MaxEnt RL in online RLHF exhibits frequent overoptimization and unstable KL dynamics across scales, unlike stable KL-constrained baselines.

citing papers explorer

Showing 3 of 3 citing papers.

Discovering Agentic Safety Specifications from 1-Bit Danger Signals cs.AI · 2026-04-25 · unverdicted · none · ref 7
LLM agents autonomously evolve human-readable safety specifications from sparse 1-bit danger signals, outperforming reward-based reflection that encourages reward hacking.
The Alignment Target Problem: Divergent Moral Judgments of Humans, AI Systems, and Their Designers cs.CY · 2026-04-27 · unverdicted · none · ref 26 · 2 links
Moral judgments become more deontological when human design of AI is visible, and designers are judged more strictly than the AI or unaided humans, creating plural and non-converging targets for value alignment.
Failure Modes of Maximum Entropy RLHF cs.LG · 2025-09-24 · unverdicted · none · ref 23
Derives SimPO from MaxEnt RL and reports that MaxEnt RL in online RLHF exhibits frequent overoptimization and unstable KL dynamics across scales, unlike stable KL-constrained baselines.

Inverse reward design

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer