Aligning Books and Movies: Towards Story-like Visual Explanations by Watching Movies and Reading Books

Yukun Zhu , Ryan Kiros , Richard Zemel , Ruslan Salakhutdinov , Raquel Urtasun , Antonio Torralba , Sanja Fidler

Authors on Pith no claims yet

classification 💻 cs.CV cs.CL

keywords booksmoviemoviesalignbookembeddingexplanationsinformation

read the original abstract

Books are a rich source of both fine-grained information, how a character, an object or a scene looks like, as well as high-level semantics, what someone is thinking, feeling and how these states evolve through a story. This paper aims to align books to their movie releases in order to provide rich descriptive explanations for visual content that go semantically far beyond the captions available in current datasets. To align movies and books we exploit a neural sentence embedding that is trained in an unsupervised way from a large corpus of books, as well as a video-text neural embedding for computing similarities between movie clips and sentences in the book. We propose a context-aware CNN to combine information from multiple sources. We demonstrate good quantitative performance for movie/book alignment and show several qualitative examples that showcase the diversity of tasks our model can be used for.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 8 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

HellaSwag: Can a Machine Really Finish Your Sentence?
cs.CL 2019-05 unverdicted novelty 8.0

HellaSwag dataset shows state-of-the-art models fail commonsense inference tasks that humans solve easily, built via adversarial filtering of distractors.
Evaluating Non-English Developer Support in Machine Learning for Software Engineering
cs.SE 2026-05 unverdicted novelty 7.0

Code LLMs generate substantially worse comments outside English, and no tested automatic metric or LLM judge reliably matches human assessment of those outputs.
GAIA: a benchmark for General AI Assistants
cs.CL 2023-11 unverdicted novelty 7.0

GAIA benchmark shows humans at 92% accuracy on simple real-world questions far outperform current AI systems at 15%, proposing this gap as a key milestone for general AI.
OPT: Open Pre-trained Transformer Language Models
cs.CL 2022-05 unverdicted novelty 7.0

OPT releases open decoder-only transformers up to 175B parameters that match GPT-3 performance at one-seventh the carbon cost, along with code and training logs.
Refresh-Scaling the Memory of Balanced Adam
cs.LG 2026-05 unverdicted novelty 6.0

Choosing beta in balanced Adam so the refresh count R_beta is approximately 1000 reduces the worst-case validation gap by 33.4% and keeps all runs within 1% of their oracle compared with the best fixed-beta baseline.
Refresh-Scaling the Memory of Balanced Adam
cs.LG 2026-05 unverdicted novelty 5.0

Setting β in balanced Adam to achieve a refresh count R_β ≈1000 based on effective learning horizon T_ES improves validation robustness over fixed-β baselines across 11 vision and language experiments.
RoBERTa: A Robustly Optimized BERT Pretraining Approach
cs.CL 2019-07 accept novelty 5.0

With better hyperparameters, more data, and longer training, an unchanged BERT-Large architecture matches or exceeds XLNet and other successors on GLUE, SQuAD, and RACE.
Neural Network Optimization Reimagined: Decoupled Techniques for Scratch and Fine-Tuning
cs.CV 2026-04 unverdicted novelty 3.0

DualOpt decouples optimization by using real-time layer-wise weight decay for scratch training and weight rollback for fine-tuning to improve convergence, generalization, and reduce knowledge forgetting.