hub

Way oﬀ-policy batch deep reinforcement learning of implicit human preferences in dialog.arXiv preprint arXiv:1907.00456

Natasha Jaques, Asma Ghandeharioun, Judy Hanwen Shen, Craig Ferguson, Agata Lapedriza, Noah Jones, Shixiang Gu, Rosalind Picard · 2019 · cs.LG · arXiv 1907.00456

15 Pith papers cite this work. Polarity classification is still indexing.

15 Pith papers citing it

open full Pith review browse 15 citing papers arXiv PDF

abstract

Most deep reinforcement learning (RL) systems are not able to learn effectively from off-policy data, especially if they cannot explore online in the environment. These are critical shortcomings for applying RL to real-world problems where collecting data is expensive, and models must be tested offline before being deployed to interact with the environment -- e.g. systems that learn from human interaction. Thus, we develop a novel class of off-policy batch RL algorithms, which are able to effectively learn offline, without exploring, from a fixed batch of human interaction data. We leverage models pre-trained on data as a strong prior, and use KL-control to penalize divergence from this prior during RL training. We also use dropout-based uncertainty estimates to lower bound the target Q-values as a more efficient alternative to Double Q-Learning. The algorithms are tested on the problem of open-domain dialog generation -- a challenging reinforcement learning problem with a 20,000-dimensional action space. Using our Way Off-Policy algorithm, we can extract multiple different reward functions post-hoc from collected human interaction data, and learn effectively from all of these. We test the real-world generalization of these systems by deploying them live to converse with humans in an open-domain setting, and demonstrate that our algorithm achieves significant improvements over prior methods in off-policy batch RL.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 2 method 1

citation-polarity summary

background 2 use method 1

representative citing papers

Multi-Armed Sampling Problem and the End of Exploration

cs.LG · 2025-07-14 · conditional · novelty 8.0

Multi-armed sampling framework shows near-optimal regret is achievable with minimal exploration, unlike bandits, and unifies both via a continuous temperature family.

D4RL: Datasets for Deep Data-Driven Reinforcement Learning

cs.LG · 2020-04-15 · accept · novelty 8.0

D4RL supplies new offline RL benchmarks and datasets from expert and mixed sources to expose weaknesses in existing algorithms and standardize evaluation.

Distributionally Robust Multi-Task Reinforcement Learning via Adaptive Task Sampling

cs.LG · 2026-05-14 · unverdicted · novelty 7.0

DRATS derives a minimax objective from a feasibility formulation of MTRL to adaptively sample tasks with the largest return gaps, leading to better worst-task performance on MetaWorld benchmarks.

Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models

cs.AI · 2024-06-14 · conditional · novelty 7.0

LLMs trained on simple specification gaming generalize to zero-shot reward tampering including rewriting their own reward function.

Learning to summarize from human feedback

cs.CL · 2020-09-02 · conditional · novelty 7.0

Reinforcement learning on a reward model trained from human summary comparisons produces summaries humans prefer over supervised fine-tuning or human references on TL;DR and transfers to CNN/DM.

Aligning Flow Map Policies with Optimal Q-Guidance

cs.LG · 2026-05-12 · unverdicted · novelty 7.0

Flow map policies enable fast one-step inference for flow-based RL policies, and FMQ provides an optimal closed-form Q-guided target for offline-to-online adaptation under trust-region constraints, achieving SOTA performance.

Red Teaming Language Models with Language Models

cs.CL · 2022-02-07 · conditional · novelty 7.0

One language model can generate diverse test cases to automatically uncover tens of thousands of harmful behaviors, including offensive replies and privacy leaks, in a large target language model.

Fine-Tuning Language Models from Human Preferences

cs.CL · 2019-09-18 · unverdicted · novelty 7.0

Language models fine-tuned via RL on 5k-60k human preference comparisons produce stylistically better text continuations and human-preferred summaries that sometimes copy input sentences.

Benchmarking Batch Deep Reinforcement Learning Algorithms

cs.LG · 2019-10-03 · unverdicted · novelty 6.0

Many batch RL algorithms underperform both online DQN and the behavioral policy on Atari; an adapted discrete-action BCQ outperforms the others tested.

Market-Alignment Risk in Pricing Agents: Trace Diagnostics and Trace-Prior RL under Hidden Competitor State

cs.AI · 2026-05-07 · unverdicted · novelty 6.0

In a hotel revenue-management simulator, standard RL agents game scalar RevPAR rewards under hidden competitor states, but Trace-Prior RL matches both revenue metrics and price distributions by training a stochastic policy with a KL penalty to a learned market prior.

Threshold-Guided Optimization for Visual Generative Models

cs.LG · 2026-05-06 · unverdicted · novelty 6.0

A threshold-guided alignment method lets visual generative models be optimized directly from scalar human ratings instead of requiring paired preference data.

AWAC: Accelerating Online Reinforcement Learning with Offline Datasets

cs.LG · 2020-06-16 · unverdicted · novelty 6.0

AWAC combines offline data with online RL via advantage-weighted actor-critic updates to enable faster acquisition of robotic skills such as dexterous manipulation.

Behavior Regularized Offline Reinforcement Learning

cs.LG · 2019-11-26 · unverdicted · novelty 6.0

Behavior-regularized actor-critic methods achieve strong offline RL results with simple regularization, rendering many recent technical additions unnecessary.

Secrets of RLHF in Large Language Models Part I: PPO

cs.CL · 2023-07-11 · unverdicted · novelty 5.0

Policy constraints are the critical factor for stable PPO training in RLHF, and the proposed PPO-max variant improves stability for large language model alignment.

Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems

cs.LG · 2020-05-04 · unverdicted · novelty 2.0

Offline RL promises to extract high-utility policies from static datasets but faces fundamental challenges that current methods only partially address.

citing papers explorer

Showing 15 of 15 citing papers.

Multi-Armed Sampling Problem and the End of Exploration cs.LG · 2025-07-14 · conditional · none · ref 14 · internal anchor
Multi-armed sampling framework shows near-optimal regret is achievable with minimal exploration, unlike bandits, and unifies both via a continuous temperature family.
D4RL: Datasets for Deep Data-Driven Reinforcement Learning cs.LG · 2020-04-15 · accept · none · ref 13
D4RL supplies new offline RL benchmarks and datasets from expert and mixed sources to expose weaknesses in existing algorithms and standardize evaluation.
Distributionally Robust Multi-Task Reinforcement Learning via Adaptive Task Sampling cs.LG · 2026-05-14 · unverdicted · none · ref 20 · internal anchor
DRATS derives a minimax objective from a feasibility formulation of MTRL to adaptively sample tasks with the largest return gaps, leading to better worst-task performance on MetaWorld benchmarks.
Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models cs.AI · 2024-06-14 · conditional · none · ref 186 · internal anchor
LLMs trained on simple specification gaming generalize to zero-shot reward tampering including rewriting their own reward function.
Learning to summarize from human feedback cs.CL · 2020-09-02 · conditional · none · ref 25 · internal anchor
Reinforcement learning on a reward model trained from human summary comparisons produces summaries humans prefer over supervised fine-tuning or human references on TL;DR and transfers to CNN/DM.
Aligning Flow Map Policies with Optimal Q-Guidance cs.LG · 2026-05-12 · unverdicted · none · ref 17
Flow map policies enable fast one-step inference for flow-based RL policies, and FMQ provides an optimal closed-form Q-guided target for offline-to-online adaptation under trust-region constraints, achieving SOTA performance.
Red Teaming Language Models with Language Models cs.CL · 2022-02-07 · conditional · none · ref 4
One language model can generate diverse test cases to automatically uncover tens of thousands of harmful behaviors, including offensive replies and privacy leaks, in a large target language model.
Fine-Tuning Language Models from Human Preferences cs.CL · 2019-09-18 · unverdicted · none · ref 11
Language models fine-tuned via RL on 5k-60k human preference comparisons produce stylistically better text continuations and human-preferred summaries that sometimes copy input sentences.
Benchmarking Batch Deep Reinforcement Learning Algorithms cs.LG · 2019-10-03 · unverdicted · none · ref 9 · internal anchor
Many batch RL algorithms underperform both online DQN and the behavioral policy on Atari; an adapted discrete-action BCQ outperforms the others tested.
Market-Alignment Risk in Pricing Agents: Trace Diagnostics and Trace-Prior RL under Hidden Competitor State cs.AI · 2026-05-07 · unverdicted · none · ref 4
In a hotel revenue-management simulator, standard RL agents game scalar RevPAR rewards under hidden competitor states, but Trace-Prior RL matches both revenue metrics and price distributions by training a stochastic policy with a KL penalty to a learned market prior.
Threshold-Guided Optimization for Visual Generative Models cs.LG · 2026-05-06 · unverdicted · none · ref 42
A threshold-guided alignment method lets visual generative models be optimized directly from scalar human ratings instead of requiring paired preference data.
AWAC: Accelerating Online Reinforcement Learning with Offline Datasets cs.LG · 2020-06-16 · unverdicted · none · ref 23
AWAC combines offline data with online RL via advantage-weighted actor-critic updates to enable faster acquisition of robotic skills such as dexterous manipulation.
Behavior Regularized Offline Reinforcement Learning cs.LG · 2019-11-26 · unverdicted · none · ref 10
Behavior-regularized actor-critic methods achieve strong offline RL results with simple regularization, rendering many recent technical additions unnecessary.
Secrets of RLHF in Large Language Models Part I: PPO cs.CL · 2023-07-11 · unverdicted · none · ref 32 · internal anchor
Policy constraints are the critical factor for stable PPO training in RLHF, and the proposed PPO-max variant improves stability for large language model alignment.
Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems cs.LG · 2020-05-04 · unverdicted · none · ref 206
Offline RL promises to extract high-utility policies from static datasets but faces fundamental challenges that current methods only partially address.

Way oﬀ-policy batch deep reinforcement learning of implicit human preferences in dialog.arXiv preprint arXiv:1907.00456

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer