Learning from Dialogue after Deployment: Feed Yourself, Chatbot!

Braden Hancock, Antoine Bordes, Pierre-Emmanuel Mazare, Jason Weston · 2019 · cs.CL · arXiv 1901.05415

6 Pith papers cite this work. Polarity classification is still indexing.

6 Pith papers citing it

open full Pith review browse 6 citing papers arXiv PDF

abstract

The majority of conversations a dialogue agent sees over its lifetime occur after it has already been trained and deployed, leaving a vast store of potential training signal untapped. In this work, we propose the self-feeding chatbot, a dialogue agent with the ability to extract new training examples from the conversations it participates in. As our agent engages in conversation, it also estimates user satisfaction in its responses. When the conversation appears to be going well, the user's responses become new training examples to imitate. When the agent believes it has made a mistake, it asks for feedback; learning to predict the feedback that will be given improves the chatbot's dialogue abilities further. On the PersonaChat chit-chat dataset with over 131k training examples, we find that learning from dialogue with a self-feeding chatbot significantly improves performance, regardless of the amount of traditional supervision.

representative citing papers

Learning to summarize from human feedback

cs.CL · 2020-09-02 · conditional · novelty 7.0

Reinforcement learning on a reward model trained from human summary comparisons produces summaries humans prefer over supervised fine-tuning or human references on TL;DR and transfers to CNN/DM.

Fine-Tuning Language Models from Human Preferences

cs.CL · 2019-09-18 · unverdicted · novelty 7.0

Language models fine-tuned via RL on 5k-60k human preference comparisons produce stylistically better text continuations and human-preferred summaries that sometimes copy input sentences.

Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes

cs.CL · 2023-05-03 · conditional · novelty 6.0

Distilling step-by-step uses LLM-generated rationales as additional supervision in a multi-task framework so that 770M-parameter models outperform 540B-parameter models on NLP benchmarks with only 80% of the data.

Way Off-Policy Batch Deep Reinforcement Learning of Implicit Human Preferences in Dialog

cs.LG · 2019-06-30 · unverdicted · novelty 6.0

Develops Way Off-Policy batch RL algorithms with pre-trained model priors, KL-control, and dropout uncertainty estimates to learn implicit rewards from offline human dialog data, reporting live deployment gains over prior offline methods.

Why Build an Assistant in Minecraft?

cs.AI · 2019-07-22 · unverdicted · novelty 4.0

A rationale is presented for developing an assistant in Minecraft to advance natural language understanding and dialogue learning.

Emotionally-Aware Chatbots: A Survey

cs.CL · 2019-06-24 · unverdicted · novelty 1.0

A survey of emotionally-aware chatbots finding evolution from rule-based to neural methods with most systems including emotion classifiers based on affective resources.

citing papers explorer

Showing 6 of 6 citing papers.

Learning to summarize from human feedback cs.CL · 2020-09-02 · conditional · none · ref 21 · internal anchor
Reinforcement learning on a reward model trained from human summary comparisons produces summaries humans prefer over supervised fine-tuning or human references on TL;DR and transfers to CNN/DM.
Fine-Tuning Language Models from Human Preferences cs.CL · 2019-09-18 · unverdicted · none · ref 6
Language models fine-tuned via RL on 5k-60k human preference comparisons produce stylistically better text continuations and human-preferred summaries that sometimes copy input sentences.
Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes cs.CL · 2023-05-03 · conditional · none · ref 74 · internal anchor
Distilling step-by-step uses LLM-generated rationales as additional supervision in a multi-task framework so that 770M-parameter models outperform 540B-parameter models on NLP benchmarks with only 80% of the data.
Way Off-Policy Batch Deep Reinforcement Learning of Implicit Human Preferences in Dialog cs.LG · 2019-06-30 · unverdicted · none · ref 23 · internal anchor
Develops Way Off-Policy batch RL algorithms with pre-trained model priors, KL-control, and dropout uncertainty estimates to learn implicit rewards from offline human dialog data, reporting live deployment gains over prior offline methods.
Why Build an Assistant in Minecraft? cs.AI · 2019-07-22 · unverdicted · none · ref 34 · internal anchor
A rationale is presented for developing an assistant in Minecraft to advance natural language understanding and dialogue learning.
Emotionally-Aware Chatbots: A Survey cs.CL · 2019-06-24 · unverdicted · none · ref 18 · internal anchor
A survey of emotionally-aware chatbots finding evolution from rule-based to neural methods with most systems including emotion classifiers based on affective resources.

Learning from Dialogue after Deployment: Feed Yourself, Chatbot!

fields

years

verdicts

representative citing papers

citing papers explorer