Sequence Level Training with Recurrent Neural Networks

Marc'Aurelio Ranzato; Michael Auli; Sumit Chopra; Wojciech Zaremba

arxiv: 1511.06732 · v7 · pith:3HNZ3IAXnew · submitted 2015-11-20 · 💻 cs.LG · cs.CL

Sequence Level Training with Recurrent Neural Networks

Marc'Aurelio Ranzato , Sumit Chopra , Michael Auli , Wojciech Zaremba This is my paper

classification 💻 cs.LG cs.CL

keywords sequencebaselinesgenerategenerationlanguagelevelmodelsseveral

0 comments

read the original abstract

Many natural language processing applications use language models to generate text. These models are typically trained to predict the next word in a sequence, given the previous words and some context such as an image. However, at test time the model is expected to generate the entire sequence from scratch. This discrepancy makes generation brittle, as errors may accumulate along the way. We address this issue by proposing a novel sequence level training algorithm that directly optimizes the metric used at test time, such as BLEU or ROUGE. On three different tasks, our approach outperforms several strong baselines for greedy generation. The method is also competitive when these baselines employ beam search, while being several times faster.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 19 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Learning First Integrals via Backward-Generated Data and Guided Reinforcement Learning
cs.LG 2026-05 unverdicted novelty 7.0

FISolver trains a compact LLM on backward-generated (differential equation, first integral) pairs and uses guided reinforcement learning to outperform larger models and Mathematica on first-integral benchmarks at lower cost.
Combining On-Policy Optimization and Distillation for Long-Context Reasoning in Large Language Models
cs.CL 2026-05 unverdicted novelty 7.0

dGRPO merges outcome-based policy optimization with dense teacher guidance from on-policy distillation, yielding more stable long-context reasoning on the new LongBlocks synthetic dataset.
KL for a KL: On-Policy Distillation with Control Variate Baseline
cs.LG 2026-05 unverdicted novelty 7.0

vOPD stabilizes on-policy distillation gradients by subtracting a closed-form per-token negative reverse KL baseline as a detached control variate, preserving unbiasedness while lowering variance and matching expensiv...
DP-OPD: Differentially Private On-Policy Distillation for Language Models
cs.LG 2026-04 unverdicted novelty 7.0

DP-OPD achieves lower perplexity than DP fine-tuning and synthesis-based private distillation under ε=2.0 by enforcing DP-SGD solely on the student during on-policy training with a frozen teacher.
Direct Preference Optimization: Your Language Model is Secretly a Reward Model
cs.LG 2023-05 accept novelty 7.0

DPO derives the optimal policy directly from human preferences via a reparameterized reward model, solving the RLHF objective with only a binary classification loss and no sampling or separate reward model.
Learning to summarize from human feedback
cs.CL 2020-09 conditional novelty 7.0

Reinforcement learning on a reward model trained from human summary comparisons produces summaries humans prefer over supervised fine-tuning or human references on TL;DR and transfers to CNN/DM.
Fine-Tuning Language Models from Human Preferences
cs.CL 2019-09 unverdicted novelty 7.0

Language models fine-tuned via RL on 5k-60k human preference comparisons produce stylistically better text continuations and human-preferred summaries that sometimes copy input sentences.
Verbal-R3: Verbal Reranker as the Missing Bridge between Retrieval and Reasoning
cs.CL 2026-05 unverdicted novelty 6.0

Verbal-R3 uses a verbal reranker to generate analytic narratives that guide retrieval and reasoning in LLMs, achieving SOTA results on complex QA benchmarks.
State Stream Transformer (SST) V2: Parallel Training of Nonlinear Recurrence for Latent Space Reasoning
cs.LG 2026-04 unverdicted novelty 6.0

SST V2 introduces parallel-trainable nonlinear recurrence in latent space to let transformers reason continuously across positions, delivering +15 points on GPQA-Diamond and halving remaining GSM8K errors over matched...
COMO: Closed-Loop Optical Molecule Recognition with Minimum Risk Training
cs.CV 2026-04 unverdicted novelty 6.0

COMO applies minimum risk training in a closed-loop setup to optimize optical chemical structure recognition models directly on molecule-level objectives like validity and similarity, outperforming prior methods on te...
Rolling Sink: Bridging Limited-Horizon Training and Open-Ended Testing in Autoregressive Video Diffusion
cs.CV 2026-02 unverdicted novelty 6.0

Rolling Sink is a training-free cache adjustment technique that maintains visual consistency in autoregressive video diffusion models for ultra-long open-ended generation beyond training horizons.
CoRoVA: Compressed Representations for Vector-Augmented Code Completion
cs.CL 2025-10 unverdicted novelty 6.0

CoRoVA compresses repository context into compact vectors for code LLMs, reducing TTFT 20-38% versus uncompressed RAG with only a small projector module.
Training Language Models to Self-Correct via Reinforcement Learning
cs.LG 2024-09 unverdicted novelty 6.0

SCoRe uses multi-turn online RL with regularization on self-generated traces to improve LLM self-correction, achieving 15.6% and 9.1% gains on MATH and HumanEval for Gemini models.
Retrieving Sequential Information for Non-Autoregressive Neural Machine Translation
cs.CL 2019-06 unverdicted novelty 6.0

Reinforce-NAT and FS-decoder retrieve target sequential information for non-autoregressive translation, yielding higher BLEU than baseline NAT while preserving fast decoding and approaching autoregressive quality.
Informative Image Captioning with External Sources of Information
cs.CL 2019-06 unverdicted novelty 6.0

A multimodal Transformer ingests image features plus multiple external entity label sources and learns to control their appearance in fluent output captions.
Post-Training is About States, Not Tokens: A State Distribution View of SFT, RL, and On-Policy Distillation
cs.LG 2026-05 unverdicted novelty 5.0

A state distribution view of post-training shows that on-policy supervision from the learner itself can outperform fixed-dataset SFT and preserve retention better than aggressive supervised updates.
Reward-Aware Trajectory Shaping for Few-step Visual Generation
cs.CV 2026-04 unverdicted novelty 5.0

RATS lets few-step visual generators surpass multi-step teachers by shaping trajectories with reward-based adaptive guidance instead of strict imitation.
Deep Learning for Time Series Forecasting: The Electric Load Case
cs.LG 2019-07 unverdicted novelty 4.0

Compares feedforward, recurrent, sequence-to-sequence and temporal convolutional neural networks for short-term electric load forecasting through experiments on two real datasets.
Ranking sentences from product description & bullets for better search
cs.IR 2019-07 unverdicted novelty 4.0

Two RL-based extractive summarization models rank sentences from product fields by leveraging titles and click-through logs to improve search relevance.