On the sensitivity of reward inference to misspecified human models

Joey Hong, Kush Bhatia, Anca Dragan · arXiv 2212.04717

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

representative citing papers

Active teacher selection for reward learning

cs.AI · 2023-10-23 · unverdicted · novelty 6.0

The Hidden Utility Bandit (HUB) framework models teacher heterogeneity in reward learning and supports active teacher selection algorithms that outperform baselines in paper recommendation and COVID-19 vaccine testing domains.

Towards Understanding Sycophancy in Language Models

cs.CL · 2023-10-20 · conditional · novelty 6.0

Sycophancy is prevalent in state-of-the-art AI assistants and is likely driven in part by human preferences that favor agreement over truthfulness.

citing papers explorer

Showing 2 of 2 citing papers.

Active teacher selection for reward learning cs.AI · 2023-10-23 · unverdicted · none · ref 5
The Hidden Utility Bandit (HUB) framework models teacher heterogeneity in reward learning and supports active teacher selection algorithms that outperform baselines in paper recommendation and COVID-19 vaccine testing domains.
Towards Understanding Sycophancy in Language Models cs.CL · 2023-10-20 · conditional · none · ref 10
Sycophancy is prevalent in state-of-the-art AI assistants and is likely driven in part by human preferences that favor agreement over truthfulness.

On the sensitivity of reward inference to misspecified human models

fields

years

verdicts

representative citing papers

citing papers explorer