{"work":{"id":"42fcaa3e-0409-481b-9dd5-9a2c3be8a383","openalex_id":null,"doi":null,"arxiv_id":"1907.00456","raw_key":null,"title":"Way Off-Policy Batch Deep Reinforcement Learning of Implicit Human Preferences in Dialog","authors":null,"authors_text":"Natasha Jaques, Asma Ghandeharioun, Judy Hanwen Shen, Craig Ferguson, Agata Lapedriza, Noah Jones, Shixiang Gu, and Rosalind Picard","year":2019,"venue":"cs.LG","abstract":"Most deep reinforcement learning (RL) systems are not able to learn effectively from off-policy data, especially if they cannot explore online in the environment. These are critical shortcomings for applying RL to real-world problems where collecting data is expensive, and models must be tested offline before being deployed to interact with the environment -- e.g. systems that learn from human interaction. Thus, we develop a novel class of off-policy batch RL algorithms, which are able to effectively learn offline, without exploring, from a fixed batch of human interaction data. We leverage models pre-trained on data as a strong prior, and use KL-control to penalize divergence from this prior during RL training. We also use dropout-based uncertainty estimates to lower bound the target Q-values as a more efficient alternative to Double Q-Learning. The algorithms are tested on the problem of open-domain dialog generation -- a challenging reinforcement learning problem with a 20,000-dimensional action space. Using our Way Off-Policy algorithm, we can extract multiple different reward functions post-hoc from collected human interaction data, and learn effectively from all of these. We test the real-world generalization of these systems by deploying them live to converse with humans in an open-domain setting, and demonstrate that our algorithm achieves significant improvements over prior methods in off-policy batch RL.","external_url":"https://arxiv.org/abs/1907.00456","cited_by_count":null,"metadata_source":"pith","metadata_fetched_at":"2026-05-19T04:17:02.787107+00:00","pith_arxiv_id":"1907.00456","created_at":"2026-05-10T20:59:58.703650+00:00","updated_at":"2026-05-19T04:17:02.787107+00:00","title_quality_ok":true,"display_title":"Way oﬀ-policy batch deep reinforcement learning of implicit human preferences in dialog.arXiv preprint arXiv:1907.00456","render_title":"Way oﬀ-policy batch deep reinforcement learning of implicit human preferences in dialog.arXiv preprint arXiv:1907.00456"},"hub":{"state":{"work_id":"42fcaa3e-0409-481b-9dd5-9a2c3be8a383","tier":"hub","tier_reason":"10+ Pith inbound or 1,000+ external citations","pith_inbound_count":15,"external_cited_by_count":null,"distinct_field_count":3,"first_pith_cited_at":"2019-09-18T17:33:39+00:00","last_pith_cited_at":"2026-05-14T04:22:24+00:00","author_build_status":"not_needed","summary_status":"needed","contexts_status":"needed","graph_status":"needed","ask_index_status":"not_needed","reader_status":"not_needed","recognition_status":"not_needed","updated_at":"2026-05-28T06:48:18.010495+00:00","tier_text":"hub"},"tier":"hub","role_counts":[{"context_role":"background","n":2},{"context_role":"method","n":1}],"polarity_counts":[{"context_polarity":"background","n":2},{"context_polarity":"use_method","n":1}],"runs":{},"summary":{},"graph":{},"authors":[]}}