pith. sign in

arxiv: 1805.08819 · v4 · pith:E6CRSEMWnew · submitted 2018-05-22 · 💻 cs.CV

Learning what and where to attend

classification 💻 cs.CV
keywords attendattentionclickmefeaturesimagerecognitiondcnsdemonstrate
0
0 comments X
read the original abstract

Most recent gains in visual recognition have originated from the inclusion of attention mechanisms in deep convolutional networks (DCNs). Because these networks are optimized for object recognition, they learn where to attend using only a weak form of supervision derived from image class labels. Here, we demonstrate the benefit of using stronger supervisory signals by teaching DCNs to attend to image regions that humans deem important for object recognition. We first describe a large-scale online experiment (ClickMe) used to supplement ImageNet with nearly half a million human-derived "top-down" attention maps. Using human psychophysics, we confirm that the identified top-down features from ClickMe are more diagnostic than "bottom-up" saliency features for rapid image categorization. As a proof of concept, we extend a state-of-the-art attention network and demonstrate that adding ClickMe supervision significantly improves its accuracy and yields visual features that are more interpretable and more similar to those used by human observers.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Not Too Generative, Not Too Discriminative: The Human Alignment Sweet Spot

    cs.CV 2026-05 unverdicted novelty 6.0

    Hybrid JEMs at intermediate generative-discriminative balance maximize human alignment on perceptual similarity, gloss, uncertainty, robustness, cue conflict, and feature attribution benchmarks.