Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks

Navdeep Jaitly; Noam Shazeer; Oriol Vinyals; Samy Bengio

arxiv: 1506.03099 · v3 · pith:6LWUZD2Dnew · submitted 2015-06-09 · 💻 cs.LG · cs.CL· cs.CV

Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks

Samy Bengio , Oriol Vinyals , Navdeep Jaitly , Noam Shazeer This is my paper

classification 💻 cs.LG cs.CLcs.CV

keywords tokensequencegeneratedpreviousrecurrenttrainingapproachcaptioning

0 comments

read the original abstract

Recurrent Neural Networks can be trained to produce sequences of tokens given some input, as exemplified by recent results in machine translation and image captioning. The current approach to training them consists of maximizing the likelihood of each token in the sequence given the current (recurrent) state and the previous token. At inference, the unknown previous token is then replaced by a token generated by the model itself. This discrepancy between training and inference can yield errors that can accumulate quickly along the generated sequence. We propose a curriculum learning strategy to gently change the training process from a fully guided scheme using the true previous token, towards a less guided scheme which mostly uses the generated token instead. Experiments on several sequence prediction tasks show that this approach yields significant improvements. Moreover, it was used successfully in our winning entry to the MSCOCO image captioning challenge, 2015.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 7 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

UniVLR: Unifying Text and Vision in Visual Latent Reasoning for Multimodal LLMs
cs.CV 2026-05 unverdicted novelty 7.0

UniVLR unifies textual and visual reasoning in multimodal LLMs by compressing reasoning traces and auxiliary images into visual latent tokens for direct inference without interleaved text CoT.
TimeLesSeg: Unified Contrast-Agnostic Cross-Sectional and Longitudinal MS Lesion Segmentation via a Stochastic Generative Model
cs.CV 2026-05 unverdicted novelty 6.0

TimeLesSeg delivers a unified contrast-agnostic CNN for MS lesion segmentation that seamlessly handles both cross-sectional and longitudinal inputs by combining empty prior masks with stochastic morphological deformat...
Graph Neural ODE Digital Twins for Control-Oriented Reactor Thermal-Hydraulic Forecasting Under Partial Observability
cs.LG 2026-04 unverdicted novelty 6.0

A GNN-ODE surrogate forecasts reactor thermal-hydraulics under partial observability, achieving low MAE on held-out transients, fast inference, and recovery of a physical Reynolds-number exponent after fine-tuning on ...
AE-ViT: Stable Long-Horizon Parametric Partial Differential Equations Modeling
cs.LG 2026-04 unverdicted novelty 6.0

AE-ViT combines a convolutional autoencoder with a latent-space transformer and multi-stage parameter plus coordinate injection to deliver stable long-horizon predictions for parametric PDEs, cutting relative rollout ...
Graph Neural ODE Digital Twins for Control-Oriented Reactor Thermal-Hydraulic Forecasting Under Partial Observability
cs.LG 2026-04 unverdicted novelty 5.0

A GNN-ODE digital twin forecasts reactor thermal-hydraulic states under partial observability, achieving low error on held-out transients and recovering a physical heat-transfer correlation during sim-to-real adaptation.
What Drives Success in Physical Planning with Joint-Embedding Predictive World Models?
cs.AI 2025-12 unverdicted novelty 5.0

An empirical study of JEPA world models identifies architecture, training objective, and planning choices that yield a model outperforming DINO-WM and V-JEPA-2-AC on navigation and manipulation tasks.
Hierarchical Sequence to Sequence Voice Conversion with Limited Data
eess.AS 2019-07 unverdicted novelty 4.0

Hierarchical seq2seq model for parallel voice conversion pretrained as autoencoder on single-speaker data then adapted to limited multispeaker data, using mel spectrograms converted via wavenet vocoder.