Show, Attend and Tell: Neural Image Caption Generation with Visual Attention

Aaron Courville; Jimmy Ba; Kelvin Xu; Kyunghyun Cho; Richard Zemel; Ruslan Salakhutdinov; Ryan Kiros; Yoshua Bengio

arxiv: 1502.03044 · v3 · pith:PMPEZ4K3new · submitted 2015-02-10 · 💻 cs.LG · cs.CV

Show, Attend and Tell: Neural Image Caption Generation with Visual Attention

Kelvin Xu , Jimmy Ba , Ryan Kiros , Kyunghyun Cho , Aaron Courville , Ruslan Salakhutdinov , Richard Zemel , Yoshua Bengio This is my paper

classification 💻 cs.LG cs.CV

keywords attentionmodelautomaticallydescribeableattendbackpropagationbenchmark

0 comments

read the original abstract

Inspired by recent work in machine translation and object detection, we introduce an attention based model that automatically learns to describe the content of images. We describe how we can train this model in a deterministic manner using standard backpropagation techniques and stochastically by maximizing a variational lower bound. We also show through visualization how the model is able to automatically learn to fix its gaze on salient objects while generating the corresponding words in the output sequence. We validate the use of attention with state-of-the-art performance on three benchmark datasets: Flickr8k, Flickr30k and MS COCO.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 9 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension
cs.CL 2017-05 accept novelty 8.0

TriviaQA is a new large-scale dataset for reading comprehension that features complex compositional questions, high lexical variability, and cross-sentence reasoning requirements, where current baselines reach only 40...
Categorical Reparameterization with Gumbel-Softmax
stat.ML 2016-11 unverdicted novelty 8.0

Gumbel-Softmax provides a continuous relaxation of categorical sampling that anneals to discrete samples for gradient-based optimization.
ViperGPT: Visual Inference via Python Execution for Reasoning
cs.CV 2023-03 unverdicted novelty 7.0

ViperGPT generates executable Python code to compose pre-trained vision-and-language modules into programs that answer visual queries, reaching state-of-the-art results with no additional training.
Text-Video Retrieval With Global-Local Contrastive Consistency Learning
cs.IR 2026-05 unverdicted novelty 5.0

GLCCL uses a Global-Local Interaction Module and Contrastive Score Consistency loss to align text and video semantics more efficiently than attention-based methods on MSR-VTT, DiDeMo, and VATEX.
Heuristic Style Transfer for Real-Time, Efficient Weather Attribute Detection
cs.CV 2026-04 conditional novelty 5.0

Lightweight multi-task models using Gram matrices and PatchGAN-style architectures detect 53 weather classes from RGB images with F1 scores above 96% internally and 78% zero-shot externally, supported by a new 503k-im...
EPNAS: Efficient Progressive Neural Architecture Search
cs.LG 2019-07 unverdicted novelty 5.0

EPNAS uses a progressive search policy with REINFORCE performance prediction to search neural architectures in parallel, supporting multiple resource constraints and outperforming ENAS and PNAS on CIFAR-10 and ImageNe...
MsEdF: A Multi-stream Encoder-decoder Framework for Remote Sensing Image Captioning
cs.CV 2025-02 unverdicted novelty 4.0

MsEdF combines two complementary image encoders for feature diversity and a stacked GRU decoder with element-wise aggregation to improve remote sensing image captioning on three benchmark datasets.
Automatic Radiology Report Generation based on Multi-view Image Fusion and Medical Concept Enrichment
eess.IV 2019-07 unverdicted novelty 4.0

An encoder-decoder model with multi-view late fusion and medical concept attention achieves claimed state-of-the-art performance on chest X-ray report generation using the Indiana University dataset.
Image-to-Text for Medical Reports Using Adaptive Co-Attention and Triple-LSTM Module
cs.CV 2025-03 unverdicted novelty 3.0

CA-TriNet combines co-attention transformers with a triple-LSTM module for medical report generation and reports outperforming prior models on three public datasets.