An Actor-Critic Algorithm for Sequence Prediction

Aaron Courville; Anirudh Goyal; Dzmitry Bahdanau; Joelle Pineau; Kelvin Xu; Philemon Brakel; Ryan Lowe; Yoshua Bengio

arxiv: 1607.07086 · v3 · pith:RSU7P2S2new · submitted 2016-07-24 · 💻 cs.LG

An Actor-Critic Algorithm for Sequence Prediction

Dzmitry Bahdanau , Philemon Brakel , Kelvin Xu , Anirudh Goyal , Ryan Lowe , Joelle Pineau , Aaron Courville , Yoshua Bengio This is my paper

classification 💻 cs.LG

keywords trainingmethodsnetworkactor-criticcriticgenerategenerationground-truth

0 comments

read the original abstract

We present an approach to training neural networks to generate sequences using actor-critic methods from reinforcement learning (RL). Current log-likelihood training methods are limited by the discrepancy between their training and testing modes, as models must generate tokens conditioned on their previous guesses rather than the ground-truth tokens. We address this problem by introducing a \textit{critic} network that is trained to predict the value of an output token, given the policy of an \textit{actor} network. This results in a training procedure that is much closer to the test phase, and allows us to directly optimize for a task-specific score such as BLEU. Crucially, since we leverage these techniques in the supervised learning setting rather than the traditional RL setting, we condition the critic network on the ground-truth output. We show that our method leads to improved performance on both a synthetic task, and for German-English machine translation. Our analysis paves the way for such methods to be applied in natural language generation tasks, such as machine translation, caption generation, and dialogue modelling.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 6 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Concentration of General Stochastic Approximation Under Heavy-Tailed Markovian Noise
math.PR 2026-05 unverdicted novelty 7.0

Establishes maximal concentration bounds for stochastic approximation under heavy-tailed Markovian noise, with tails ranging from sub-Gaussian to heavier than Weibull depending on step sizes and contractivity properti...
SynthFix: Adaptive Neuro-Symbolic Code Vulnerability Repair
cs.SE 2026-04 unverdicted novelty 7.0

SynthFix adaptively routes LLM code repairs to supervised fine-tuning or symbolic-reward fine-tuning, yielding up to 32% higher exact match on JavaScript and C vulnerability benchmarks.
Learning to summarize from human feedback
cs.CL 2020-09 conditional novelty 7.0

Reinforcement learning on a reward model trained from human summary comparisons produces summaries humans prefer over supervised fine-tuning or human references on TL;DR and transfers to CNN/DM.
Aligning Text-to-Image Models using Human Feedback
cs.LG 2023-02 unverdicted novelty 6.0

A three-stage fine-tuning process uses human ratings to train a reward model and then improves text-to-image alignment by maximizing reward-weighted likelihood.
Retrieving Sequential Information for Non-Autoregressive Neural Machine Translation
cs.CL 2019-06 unverdicted novelty 6.0

Reinforce-NAT and FS-decoder retrieve target sequential information for non-autoregressive translation, yielding higher BLEU than baseline NAT while preserving fast decoding and approaching autoregressive quality.
Deep Reinforcement Learning for Personalized Search Story Recommendation
cs.LG 2019-07 unverdicted novelty 3.0

A deep RL architecture using imitation learning and reinforcement learning is proposed to model immediate and future values of search story recommendations in a Markov decision process framework.