Multiple Object Recognition with Visual Attention

Jimmy Ba; Koray Kavukcuoglu; Volodymyr Mnih

arxiv: 1412.7755 · v2 · pith:AEUK2734new · submitted 2014-12-24 · 💻 cs.LG · cs.CV· cs.NE

Multiple Object Recognition with Visual Attention

Jimmy Ba , Volodymyr Mnih , Koray Kavukcuoglu This is my paper

classification 💻 cs.LG cs.CVcs.NE

keywords modelmultipleimagesobjectsaccurateattendattentionattention-based

0 comments

read the original abstract

We present an attention-based model for recognizing multiple objects in images. The proposed model is a deep recurrent neural network trained with reinforcement learning to attend to the most relevant regions of the input image. We show that the model learns to both localize and recognize multiple objects despite being given only class labels during training. We evaluate the model on the challenging task of transcribing house number sequences from Google Street View images and show that it is both more accurate than the state-of-the-art convolutional networks and uses fewer parameters and less computation.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 5 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Learning to Theorize the World from Observation
cs.LG 2026-05 unverdicted novelty 6.0

NEO induces compositional latent programs as world theories from observations and executes them to enable explanation-driven generalization.
Learning Blended, Precise Semantic Program Embeddings
cs.SE 2019-07 unverdicted novelty 6.0

LIGER blends symbolic and concrete traces to learn precise semantic program embeddings, outperforming syntax-based models on CoSET classification and code2seq on method name prediction while using fewer executions.
Inverse Attention Guided Deep Crowd Counting Network
cs.CV 2019-07 unverdicted novelty 6.0

IA-DCCN is a single-step VGG-16 network that infuses segmentation via inverse attention to improve crowd counting accuracy on three datasets with minimal overhead.
Learning World Graphs to Accelerate Hierarchical Reinforcement Learning
cs.LG 2019-07 unverdicted novelty 6.0

A two-stage framework learns a world graph of pivotal states task-agnostically via joint training of a latent model and curiosity-driven policy, then uses the graph to accelerate hierarchical RL on maze tasks.
A Deep Decoder Structure Based on WordEmbedding Regression for An Encoder-Decoder Based Model for Image Captioning
cs.CV 2019-06 unverdicted novelty 6.0

The authors replace next-word log-likelihood training with word-embedding regression in an encoder-decoder captioning model and report CIDEr 125.0 and BLEU-4 50.5 on MS-COCO, exceeding prior bests of 117.1 and 48.0.