Multiple Object Recognition with Visual Attention
read the original abstract
We present an attention-based model for recognizing multiple objects in images. The proposed model is a deep recurrent neural network trained with reinforcement learning to attend to the most relevant regions of the input image. We show that the model learns to both localize and recognize multiple objects despite being given only class labels during training. We evaluate the model on the challenging task of transcribing house number sequences from Google Street View images and show that it is both more accurate than the state-of-the-art convolutional networks and uses fewer parameters and less computation.
This paper has not been read by Pith yet.
Forward citations
Cited by 5 Pith papers
-
Learning to Theorize the World from Observation
NEO induces compositional latent programs as world theories from observations and executes them to enable explanation-driven generalization.
-
Learning Blended, Precise Semantic Program Embeddings
LIGER blends symbolic and concrete traces to learn precise semantic program embeddings, outperforming syntax-based models on CoSET classification and code2seq on method name prediction while using fewer executions.
-
Inverse Attention Guided Deep Crowd Counting Network
IA-DCCN is a single-step VGG-16 network that infuses segmentation via inverse attention to improve crowd counting accuracy on three datasets with minimal overhead.
-
Learning World Graphs to Accelerate Hierarchical Reinforcement Learning
A two-stage framework learns a world graph of pivotal states task-agnostically via joint training of a latent model and curiosity-driven policy, then uses the graph to accelerate hierarchical RL on maze tasks.
-
A Deep Decoder Structure Based on WordEmbedding Regression for An Encoder-Decoder Based Model for Image Captioning
The authors replace next-word log-likelihood training with word-embedding regression in an encoder-decoder captioning model and report CIDEr 125.0 and BLEU-4 50.5 on MS-COCO, exceeding prior bests of 117.1 and 48.0.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.