Inverse reward design

Dylan Hadfield-Menell, Smitha Milli, Pieter Abbeel, Stuart J · 2017 · arXiv 1711.02827

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

read on arXiv browse 3 citing papers

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

Discovering Agentic Safety Specifications from 1-Bit Danger Signals

cs.AI · 2026-04-25 · unverdicted · novelty 7.0

LLM agents autonomously evolve human-readable safety specifications from sparse 1-bit danger signals, outperforming reward-based reflection that encourages reward hacking.

The Alignment Target Problem: Divergent Moral Judgments of Humans, AI Systems, and Their Designers

cs.CY · 2026-04-27 · unverdicted · novelty 6.0 · 3 refs

Survey experiment finds that people apply more deontological standards to AI described as human-programmed and to the programmers themselves than to unaided humans or unprogrammed robots in a moral dilemma.

Failure Modes of Maximum Entropy RLHF

cs.LG · 2025-09-24 · unverdicted · novelty 5.0

Derives SimPO from MaxEnt RL and reports that MaxEnt RL in online RLHF exhibits frequent overoptimization and unstable KL dynamics across scales, unlike stable KL-constrained baselines.

citing papers explorer

Showing 1 of 1 citing paper after filters.

Discovering Agentic Safety Specifications from 1-Bit Danger Signals cs.AI · 2026-04-25 · unverdicted · none · ref 7
LLM agents autonomously evolve human-readable safety specifications from sparse 1-bit danger signals, outperforming reward-based reflection that encourages reward hacking.

Inverse reward design

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer