pith. machine review for the scientific record. sign in

arxiv: 1405.0312 · v3 · submitted 2014-05-01 · 💻 cs.CV

Recognition: no theorem link

Microsoft COCO: Common Objects in Context

Authors on Pith no claims yet

Pith reviewed 2026-05-11 23:56 UTC · model grok-4.3

classification 💻 cs.CV
keywords object recognitionscene understandinginstance segmentationcrowd sourcingcommon objectseveryday scenesCOCO datasetDeformable Parts Model
0
0 comments X

The pith

A dataset of everyday scenes with per-instance segmentations advances object recognition in context.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper presents the COCO dataset to advance object recognition by situating it within scene understanding. Images of complex everyday scenes are gathered and objects labeled with per-instance segmentations for precise localization. The collection includes 91 common object types in 328,000 images totaling 2.5 million instances, created via novel crowd-sourcing interfaces. Statistical comparisons to existing datasets and baseline detection results using Deformable Parts Models are included. A reader would care if this enables models to better handle real-world variability and context.

Core claim

The authors present a new dataset with the goal of advancing the state-of-the-art in object recognition by placing the question of object recognition in the context of the broader question of scene understanding. This is achieved by gathering images of complex everyday scenes containing common objects in their natural context. Objects are labeled using per-instance segmentations to aid in precise object localization. The dataset contains photos of 91 objects types that would be easily recognizable by a 4 year old. With a total of 2.5 million labeled instances in 328k images, the creation drew upon extensive crowd worker involvement via novel user interfaces for category detection, instanceスポ

What carries the argument

The per-instance segmentation labels on images of common objects in natural scenes, enabled by crowd-sourcing interfaces for detection, spotting, and segmentation.

If this is right

  • Object recognition models can be evaluated and trained on contextual scenes rather than cropped objects.
  • Per-instance masks support tasks requiring precise boundaries like segmentation.
  • Large scale supports statistical robustness in algorithm comparisons.
  • New annotation methods scale labeling to millions of instances efficiently.
  • Shifts focus toward holistic scene understanding in computer vision research.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This could improve transfer to applications like video analysis or robotics where context matters.
  • The dataset design may inspire similar collections for other modalities or languages.
  • One might test if models trained on COCO generalize better to unseen scenes than on prior datasets.

Load-bearing premise

The novel crowd-sourcing interfaces produce sufficiently accurate and consistent per-instance labels at the reported scale without substantial systematic errors.

What would settle it

An independent audit revealing high disagreement rates between the provided labels and expert annotations on a sample of images would falsify the dataset's utility for advancing reliable recognition.

read the original abstract

We present a new dataset with the goal of advancing the state-of-the-art in object recognition by placing the question of object recognition in the context of the broader question of scene understanding. This is achieved by gathering images of complex everyday scenes containing common objects in their natural context. Objects are labeled using per-instance segmentations to aid in precise object localization. Our dataset contains photos of 91 objects types that would be easily recognizable by a 4 year old. With a total of 2.5 million labeled instances in 328k images, the creation of our dataset drew upon extensive crowd worker involvement via novel user interfaces for category detection, instance spotting and instance segmentation. We present a detailed statistical analysis of the dataset in comparison to PASCAL, ImageNet, and SUN. Finally, we provide baseline performance analysis for bounding box and segmentation detection results using a Deformable Parts Model.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript presents the Microsoft COCO dataset: 328k images of complex everyday scenes with 2.5 million per-instance segmentations across 91 object categories recognizable by a 4-year-old. The central goal is to advance object recognition by embedding it in scene understanding, achieved via novel crowd-sourcing interfaces for category detection, instance spotting, and instance segmentation. The paper supplies statistical comparisons against PASCAL, ImageNet, and SUN plus baseline Deformable Parts Model results for bounding-box and segmentation detection.

Significance. If label quality holds, the dataset would be a major resource for the field: its scale, instance-level annotations in natural contexts, and explicit comparisons to prior collections provide a reproducible benchmark that can drive progress on precise localization and contextual recognition. The manuscript's strength lies in the independent construction and public release of the data together with parameter-free statistical characterizations.

major comments (1)
  1. [Dataset construction / annotation interfaces] Section on dataset construction and crowd-sourcing interfaces (description of instance spotting and segmentation UIs): no inter-annotator agreement scores, no expert-vs-crowd IoU on a held-out subset, and no error analysis for boundary precision or category confusion are reported for the 2.5M instances. Because the central claim rests on these labels enabling precise localization in scene context, the absence of quantitative validation is load-bearing; systematic biases (e.g., under-segmentation of small or occluded objects) would directly undermine downstream utility.
minor comments (2)
  1. [Baseline performance analysis] The baseline DPM results are presented mainly as reference points; a short discussion of why more contemporary detectors were not included would clarify the intended role of these numbers.
  2. [Statistical analysis] Tables and figures comparing statistics to PASCAL/ImageNet/SUN would benefit from explicit cross-references in the text and consistent axis scaling to ease reader comparison.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive assessment of the dataset's potential impact and for the constructive feedback on annotation validation. We address the major comment below and will revise the manuscript to incorporate the requested quantitative analysis.

read point-by-point responses
  1. Referee: [Dataset construction / annotation interfaces] Section on dataset construction and crowd-sourcing interfaces (description of instance spotting and segmentation UIs): no inter-annotator agreement scores, no expert-vs-crowd IoU on a held-out subset, and no error analysis for boundary precision or category confusion are reported for the 2.5M instances. Because the central claim rests on these labels enabling precise localization in scene context, the absence of quantitative validation is load-bearing; systematic biases (e.g., under-segmentation of small or occluded objects) would directly undermine downstream utility.

    Authors: We agree that quantitative validation of the annotations is essential to support claims about precise localization in scene context. In the revised manuscript we will add a dedicated subsection to the dataset construction section that reports (1) inter-annotator agreement scores for both instance spotting and segmentation tasks, computed on a held-out set of images using multiple independent workers; (2) expert-versus-crowd IoU statistics on a separate held-out subset of 500 images annotated by computer-vision experts; and (3) a systematic error analysis that quantifies boundary-precision errors and category confusions, with explicit breakdowns for small and occluded objects. These additions will directly address concerns about potential systematic biases and strengthen the reproducibility of the benchmark. revision: yes

Circularity Check

0 steps flagged

No circularity: dataset construction and release paper with external benchmarks

full rationale

The paper presents the COCO dataset, its collection process via crowd-sourcing interfaces, statistical comparisons to PASCAL/ImageNet/SUN, and baseline results using an off-the-shelf Deformable Parts Model. No derivations, equations, predictions, or first-principles claims exist that could reduce to fitted inputs, self-definitions, or self-citation chains. The contribution is the independent dataset itself; any questions of label accuracy fall under correctness risk rather than circularity per the analysis rules. Steps array left empty as no load-bearing reductions were identified.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a dataset construction paper with no mathematical derivations, fitted parameters, background axioms, or postulated entities beyond the empirical collection of images and labels.

pith-pipeline@v0.9.0 · 5480 in / 1095 out tokens · 42287 ms · 2026-05-11T23:56:56.623374+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 32 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Concept-Based Abductive and Contrastive Explanations for Behaviors of Vision Models

    cs.LG 2026-05 unverdicted novelty 7.0

    Concept-based abductive and contrastive explanations find minimal high-level concepts that causally determine vision model outcomes on individual images or groups sharing a specified behavior.

  2. Pro$^2$Assist: Continuous Step-Aware Proactive Assistance with Multimodal Egocentric Perception for Long-Horizon Procedural Tasks

    cs.AI 2026-05 unverdicted novelty 7.0

    Pro²Assist uses multimodal egocentric perception from AR glasses to track fine-grained progress in long-horizon procedural tasks and deliver timely proactive assistance, outperforming baselines by over 21% in action u...

  3. Beyond Accuracy: Benchmarking Cross-Task Consistency in Unified Multimodal Models

    cs.CV 2026-04 unverdicted novelty 7.0

    XTC-Bench reveals that strong performance on generation or understanding tasks in unified multimodal models does not guarantee cross-task semantic consistency, which instead depends on how tightly coupled the learning...

  4. Oracle Noise: Faster Semantic Spherical Alignment for Interpretable Latent Optimization

    cs.CV 2026-04 unverdicted novelty 7.0

    Oracle Noise optimizes diffusion model noise on a Riemannian hypersphere guided by key prompt words to preserve the Gaussian prior, eliminate norm inflation, and achieve faster semantic alignment than Euclidean methods.

  5. MMEB-V3: Measuring the Performance Gaps of Omni-Modality Embedding Models

    cs.IR 2026-04 unverdicted novelty 7.0

    MMEB-V3 benchmark shows omni-modality embedding models fail to enforce instruction-specified modality constraints and exhibit asymmetric, query-biased retrieval.

  6. Beyond Prompts: Unconditional 3D Inversion for Out-of-Distribution Shapes

    cs.CV 2026-04 unverdicted novelty 7.0

    Text-to-3D models lose prompt sensitivity for out-of-distribution shapes due to sink traps but retain geometric diversity via unconditional priors, enabling a decoupled inversion method for robust editing.

  7. Gaslight, Gatekeep, V1-V3: Early Visual Cortex Alignment Shields Vision-Language Models from Sycophantic Manipulation

    cs.CV 2026-04 conditional novelty 7.0

    Alignment of vision-language models with human V1-V3 early visual cortex negatively predicts resistance to sycophantic gaslighting attacks.

  8. MAST: Mask-Guided Attention Mass Allocation for Training-Free Multi-Style Transfer

    cs.CV 2026-04 unverdicted novelty 7.0

    MAST is a mask-guided attention allocation method that enables artifact-free multi-style transfer in diffusion models by anchoring layout, distributing attention mass, scaling sharpness, and injecting details.

  9. WildDet3D: Scaling Promptable 3D Detection in the Wild

    cs.CV 2026-04 unverdicted novelty 7.0

    WildDet3D is a promptable 3D detector paired with a new 1M-image dataset across 13.5K categories that sets SOTA on open-world and zero-shot 3D detection benchmarks.

  10. Hierarchical Text-Conditional Image Generation with CLIP Latents

    cs.CV 2022-04 accept novelty 7.0

    A hierarchical prior-decoder model using CLIP latents generates more diverse text-conditional images than direct methods while preserving photorealism and caption fidelity.

  11. High-Resolution Image Synthesis with Latent Diffusion Models

    cs.CV 2021-12 conditional novelty 7.0

    Latent diffusion models achieve state-of-the-art inpainting and competitive results on unconditional generation, scene synthesis, and super-resolution by performing the diffusion process in the latent space of pretrai...

  12. CASCADE: Context-Aware Relaxation for Speculative Image Decoding

    cs.CV 2026-05 unverdicted novelty 6.0

    CASCADE formalizes semantic interchangeability and convergence in target model representations to enable context-aware acceptance relaxation in tree-based speculative decoding, delivering up to 3.6x speedup on text-to...

  13. Semantics-Aware Hierarchical Token Communication: Clustering, Bit Mapping, and Power Allocation

    eess.SP 2026-04 unverdicted novelty 6.0

    H-TokCom groups tokens by semantic similarity and protects cluster-level bits with higher power, raising semantic similarity from 0.206 to 0.279 at 3 dB SNR on COCO data.

  14. Delta Score Matters! Spatial Adaptive Multi Guidance in Diffusion Models

    cs.CV 2026-04 unverdicted novelty 6.0

    SAMG uses spatially adaptive guidance scales derived from a geometric analysis of classifier-free guidance to resolve the detail-artifact dilemma in diffusion-based image and video generation.

  15. LatentDiff: Scaling Semantic Dataset Comparison to Millions of Images

    cs.CV 2026-04 unverdicted novelty 6.0

    LatentDiff scales semantic dataset comparison to millions of images using latent spaces of vision encoders combined with sparse autoencoders and density ratio estimation, showing better accuracy and robustness than ca...

  16. DeepSignature: Digitally Signed, Content-Encoding Watermarks for Robust and Transparent Image Authentication

    cs.CR 2026-04 unverdicted novelty 6.0

    DeepSignature embeds digitally signed content-encoding watermarks via neural networks for robust image authentication, source attribution, and latent-space tamper localization.

  17. Source-Modality Monitoring in Vision-Language Models

    cs.CL 2026-04 unverdicted novelty 6.0

    Vision-language models use semantic signals more than syntactic ones to bind words like 'image' to actual visual inputs, with implications for robustness in multimodal systems.

  18. MACS: Modality-Aware Capacity Scaling for Efficient Multimodal MoE Inference

    cs.LG 2026-04 unverdicted novelty 6.0

    MACS improves MoE MLLM inference efficiency via entropy-weighted token loads and dynamic modality-adaptive expert capacity allocation.

  19. MACS: Modality-Aware Capacity Scaling for Efficient Multimodal MoE Inference

    cs.LG 2026-04 unverdicted novelty 6.0

    MACS improves inference speed in multimodal MoE models by entropy-weighted balancing of visual tokens and real-time modality-adaptive expert capacity allocation.

  20. Zero-shot World Models Are Developmentally Efficient Learners

    cs.AI 2026-04 unverdicted novelty 6.0

    A zero-shot visual world model trained on one child's experience achieves broad competence on physical understanding benchmarks while matching developmental behavioral patterns.

  21. Bayesian Optimization for Mixed-Variable Problems in the Natural Sciences

    cs.LG 2026-04 unverdicted novelty 6.0

    A generalization of probabilistic reparameterization allows gradient-based acquisition optimization in fully mixed-variable Bayesian optimization with Gaussian process surrogates for non-equidistant discrete spaces.

  22. DeCo-DETR: Decoupled Cognition DETR for efficient Open-Vocabulary Object Detection

    cs.CV 2026-04 unverdicted novelty 6.0

    DeCo-DETR builds hierarchical semantic prototypes offline and uses decoupled training streams to deliver competitive zero-shot open-vocabulary detection with improved inference speed.

  23. Global Context or Local Detail? Adaptive Visual Grounding for Hallucination Mitigation

    cs.CV 2026-04 unverdicted novelty 5.0

    PND mitigates object hallucination in vision-language models via dual-path contrastive decoding that boosts visual evidence and penalizes linguistic priors, yielding up to 6.5% gains on POPE, MME, and CHAIR benchmarks.

  24. From Codebooks to VLMs: Evaluating Automated Visual Discourse Analysis for Climate Change on Social Media

    cs.CV 2026-04 unverdicted novelty 5.0

    VLMs recover reliable population-level trends in climate change visual discourse on social media even when per-image accuracy is only moderate.

  25. FineEdit: Fine-Grained Image Edit with Bounding Box Guidance

    cs.CV 2026-04 unverdicted novelty 5.0

    FineEdit adds multi-level bounding box injection to diffusion image editing, releases a 1.2M-pair dataset with box annotations, and shows better instruction following and background consistency than prior open models ...

  26. DeCo-DETR: Decoupled Cognition DETR for efficient Open-Vocabulary Object Detection

    cs.CV 2026-04 unverdicted novelty 5.0

    DeCo-DETR constructs a hierarchical semantic prototype space from LVLM-generated descriptions aligned via CLIP and uses decoupled training streams to separate semantic reasoning from detection, yielding efficient open...

  27. Language-Pretraining-Induced Bias: A Strong Foundation for General Vision Tasks

    cs.CV 2026-04 unverdicted novelty 5.0

    Random label bridge training aligns LLM parameters with vision tasks, and partial training of certain layers often suffices due to their foundational properties.

  28. SynSur: An end-to-end generative pipeline for synthetic industrial surface defect generation and detection

    cs.CV 2026-04 unverdicted novelty 4.0

    A generative pipeline creates realistic synthetic pitting defects and other surface flaws that, when added to real training data, yield modest gains in industrial defect detectors without replacing the need for authen...

  29. No Pedestrian Left Behind: Real-Time Detection and Tracking of Vulnerable Road Users for Adaptive Traffic Signal Control

    cs.CV 2026-04 unverdicted novelty 4.0

    NPLB combines YOLOv12 detection and ByteTrack tracking with an adaptive controller to extend pedestrian phases, cutting simulated stranding rates from 9.1% to 2.6% while extending signals in only 12.1% of cycles.

  30. SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

    cs.CV 2025-02 unverdicted novelty 4.0

    SigLIP 2 models trained with a unified recipe of captioning, self-supervised losses, and curated diverse data outperform prior SigLIP versions on classification, retrieval, localization, dense prediction, and multilin...

  31. PaliGemma: A versatile 3B VLM for transfer

    cs.CV 2024-07 unverdicted novelty 4.0

    PaliGemma is an open 3B VLM based on SigLIP and Gemma that achieves strong performance on nearly 40 diverse open-world tasks including benchmarks, remote-sensing, and segmentation.

  32. Evaluation of Winning Solutions of 2025 Low Power Computer Vision Challenge

    cs.CV 2026-04 unverdicted novelty 2.0

    The 2025 LPCVC winners demonstrate practical techniques for low-power image classification under varied conditions, open-vocabulary segmentation from text prompts, and monocular depth estimation.

Reference graph

Works this paper leans on

51 extracted references · 51 canonical work pages · cited by 30 Pith papers

  1. [1]

    Im- ageNet: A Large-Scale Hierarchical Image Database,

    J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Im- ageNet: A Large-Scale Hierarchical Image Database,” in CVPR, 2009

  2. [2]

    The PASCAL visual object classes (VOC) challenge,

    M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zis- serman, “The PASCAL visual object classes (VOC) challenge,” IJCV, vol. 88, no. 2, pp. 303–338, Jun. 2010

  3. [3]

    SUN database: Large-scale scene recognition from abbey to zoo,

    J. Xiao, J. Hays, K. A. Ehinger, A. Oliva, and A. Torralba, “SUN database: Large-scale scene recognition from abbey to zoo,” in CVPR, 2010

  4. [4]

    Pedestrian detec- tion: An evaluation of the state of the art,

    P . Doll ´ar, C. Wojek, B. Schiele, and P . Perona, “Pedestrian detec- tion: An evaluation of the state of the art,” P AMI, vol. 34, 2012

  5. [5]

    ImageNet classifica- tion with deep convolutional neural networks,

    A. Krizhevsky, I. Sutskever, and G. Hinton, “ImageNet classifica- tion with deep convolutional neural networks,” in NIPS, 2012

  6. [6]

    Rich feature hierarchies for accurate object detection and semantic segmenta- tion,

    R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate object detection and semantic segmenta- tion,” in CVPR, 2014

  7. [7]

    OverFeat: Integrated recognition, localization and detection using convolutional networks,

    P . Sermanet, D. Eigen, S. Zhang, M. Mathieu, R. Fergus, and Y. LeCun, “OverFeat: Integrated recognition, localization and detection using convolutional networks,” in ICLR, April 2014

  8. [8]

    Describing objects by their attributes,

    A. Farhadi, I. Endres, D. Hoiem, and D. Forsyth, “Describing objects by their attributes,” in CVPR, 2009

  9. [9]

    SUN attribute database: Discovering, annotating, and recognizing scene attributes,

    G. Patterson and J. Hays, “SUN attribute database: Discovering, annotating, and recognizing scene attributes,” in CVPR, 2012

  10. [10]

    Poselets: Body part detectors trained using 3D human pose annotations,

    L. Bourdev and J. Malik, “Poselets: Body part detectors trained using 3D human pose annotations,” in ICCV, 2009

  11. [11]

    Indoor seg- mentation and support inference from RGBD images,

    N. Silberman, D. Hoiem, P . Kohli, and R. Fergus, “Indoor seg- mentation and support inference from RGBD images,” in ECCV, 2012

  12. [12]

    Canonical perspective and the perception of objects,

    S. Palmer, E. Rosch, and P . Chase, “Canonical perspective and the perception of objects,” Attention and performance IX , vol. 1, p. 4, 1981

  13. [13]

    Diagnosing error in object detectors,

    Hoiem, D, Y. Chodpathumwan, and Q. Dai, “Diagnosing error in object detectors,” in ECCV, 2012

  14. [14]

    Semantic object classes in video: A high-definition ground truth database,

    G. Brostow, J. Fauqueur, and R. Cipolla, “Semantic object classes in video: A high-definition ground truth database,” PRL, vol. 30, no. 2, pp. 88–97, 2009

  15. [15]

    LabelMe: a database and web-based tool for image annotation,

    B. Russell, A. Torralba, K. Murphy, and W. Freeman, “LabelMe: a database and web-based tool for image annotation,” IJCV, vol. 77, no. 1-3, pp. 157–173, 2008

  16. [16]

    OpenSurfaces: A richly annotated catalog of surface appearance,

    S. Bell, P . Upchurch, N. Snavely, and K. Bala, “OpenSurfaces: A richly annotated catalog of surface appearance,” SIGGRAPH, vol. 32, no. 4, 2013

  17. [17]

    Im2text: Describing images using 1 million captioned photographs

    V . Ordonez, G. Kulkarni, and T. Berg, “Im2text: Describing images using 1 million captioned photographs.” in NIPS, 2011

  18. [18]

    Scalable multi-label annotation,

    J. Deng, O. Russakovsky, J. Krause, M. Bernstein, A. Berg, and L. Fei-Fei, “Scalable multi-label annotation,” in CHI, 2014

  19. [19]

    Microsoft COCO: Common objects in context,

    T. Lin, M. Maire, S. Belongie, J. Hays, P . Perona, D. Ramanan, P . Doll´ar, and C. L. Zitnick, “Microsoft COCO: Common objects in context,” in ECCV, 2014

  20. [20]

    A taxonomy and evaluation of dense two-frame stereo correspondence algorithms,

    D. Scharstein and R. Szeliski, “A taxonomy and evaluation of dense two-frame stereo correspondence algorithms,” IJCV, vol. 47, no. 1-3, pp. 7–42, 2002

  21. [21]

    A database and evaluation methodology for optical flow,

    S. Baker, D. Scharstein, J. Lewis, S. Roth, M. Black, and R. Szeliski, “A database and evaluation methodology for optical flow,” IJCV, vol. 92, no. 1, pp. 1–31, 2011

  22. [22]

    Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories,

    L. Fei-Fei, R. Fergus, and P . Perona, “Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories,” in CVPR Workshop of Generative Model Based Vision (WGMBV) , 2004

  23. [23]

    Caltech-256 object category dataset,

    G. Griffin, A. Holub, and P . Perona, “Caltech-256 object category dataset,” California Institute of Technology, Tech. Rep. 7694, 2007

  24. [24]

    Histograms of oriented gradients for human detection,

    N. Dalal and B. Triggs, “Histograms of oriented gradients for human detection,” in CVPR, 2005

  25. [25]

    The MNIST database of handwritten digits,

    Y. Lecun and C. Cortes, “The MNIST database of handwritten digits,” 1998. [Online]. Available: http://yann.lecun.com/exdb/ mnist/

  26. [26]

    Columbia object image library (coil-20),

    S. A. Nene, S. K. Nayar, and H. Murase, “Columbia object image library (coil-20),” Columbia Universty, Tech. Rep., 1996

  27. [27]

    Learning multiple layers of fea- tures from tiny images,

    A. Krizhevsky and G. Hinton, “Learning multiple layers of fea- tures from tiny images,” Computer Science Department, University of T oronto, T ech. Rep, 2009

  28. [28]

    80 million tiny images: A large data set for nonparametric object and scene recognition,

    A. Torralba, R. Fergus, and W. T. Freeman, “80 million tiny images: A large data set for nonparametric object and scene recognition,” P AMI, vol. 30, no. 11, pp. 1958–1970, 2008

  29. [29]

    From large scale image categorization to entry-level categories,

    V . Ordonez, J. Deng, Y. Choi, A. Berg, and T. Berg, “From large scale image categorization to entry-level categories,” in ICCV, 2013

  30. [30]

    Fellbaum, WordNet: An electronic lexical database

    C. Fellbaum, WordNet: An electronic lexical database . Blackwell Books, 1998

  31. [31]

    Caltech-UCSD Birds 200,

    P . Welinder, S. Branson, T. Mita, C. Wah, F. Schroff, S. Belongie, and P . Perona, “Caltech-UCSD Birds 200,” Caltech, Tech. Rep. CNS-TR-201, 2010

  32. [32]

    Face detection: A survey,

    E. Hjelm ˚as and B. Low, “Face detection: A survey,” CVIU, vol. 83, no. 3, pp. 236–274, 2001

  33. [33]

    Labeled faces in the wild,

    G. B. Huang, M. Ramesh, T. Berg, and E. Learned-Miller, “Labeled faces in the wild,” University of Massachusetts, Amherst, Tech. Rep. 07-49, October 2007

  34. [34]

    Detecting avocados to zucchinis: what have we done, and where are we going?

    O. Russakovsky, J. Deng, Z. Huang, A. Berg, and L. Fei-Fei, “Detecting avocados to zucchinis: what have we done, and where are we going?” in ICCV, 2013

  35. [35]

    TextonBoost for image understanding: Multi-class object recognition and seg- mentation by jointly modeling texture, layout, and context,

    J. Shotton, J. Winn, C. Rother, and A. Criminisi, “TextonBoost for image understanding: Multi-class object recognition and seg- mentation by jointly modeling texture, layout, and context,” IJCV, vol. 81, no. 1, pp. 2–23, 2009

  36. [36]

    A comparison and evaluation of multi-view stereo reconstruction algorithms,

    S. M. Seitz, B. Curless, J. Diebel, D. Scharstein, and R. Szeliski, “A comparison and evaluation of multi-view stereo reconstruction algorithms,” in CVPR, 2006

  37. [37]

    Contour detec- tion and hierarchical image segmentation,

    P . Arbelaez, M. Maire, C. Fowlkes, and J. Malik, “Contour detec- tion and hierarchical image segmentation,” P AMI, vol. 33, no. 5, pp. 898–916, 2011

  38. [38]

    Learning to detect unseen object classes by between-class attribute transfer,

    C. Lampert, H. Nickisch, and S. Harmeling, “Learning to detect unseen object classes by between-class attribute transfer,” in CVPR, 2009

  39. [39]

    Learning spatial context: Using stuff to find things,

    G. Heitz and D. Koller, “Learning spatial context: Using stuff to find things,” in ECCV, 2008

  40. [40]

    Sitton, Spelling Sourcebook

    R. Sitton, Spelling Sourcebook . Egger Publishing, 1996

  41. [41]

    Finding iconic images,

    T. Berg and A. Berg, “Finding iconic images,” in CVPR, 2009

  42. [42]

    Unbiased look at dataset bias,

    A. Torralba and A. Efros, “Unbiased look at dataset bias,” in CVPR, 2011

  43. [43]

    Evaluation of gist descriptors for web-scale image search,

    M. Douze, H. J ´egou, H. Sandhawalia, L. Amsaleg, and C. Schmid, “Evaluation of gist descriptors for web-scale image search,” in CIVR, 2009

  44. [44]

    Object detection with discriminatively trained part-based mod- els,

    P . Felzenszwalb, R. Girshick, D. McAllester, and D. Ramanan, “Object detection with discriminatively trained part-based mod- els,” P AMI, vol. 32, no. 9, pp. 1627–1645, 2010

  45. [45]

    Discriminatively trained deformable part models, release 5,

    R. Girshick, P . Felzenszwalb, and D. McAllester, “Discriminatively trained deformable part models, release 5,” P AMI, 2012

  46. [46]

    Do we need more training data or better models for object detection?

    X. Zhu, C. Vondrick, D. Ramanan, and C. Fowlkes, “Do we need more training data or better models for object detection?” in BMVC, 2012

  47. [47]

    Object segmentation by alignment of poselet activations to image contours,

    T. Brox, L. Bourdev, S. Maji, and J. Malik, “Object segmentation by alignment of poselet activations to image contours,” in CVPR, 2011

  48. [48]

    Layered object models for image segmentation,

    Y. Yang, S. Hallman, D. Ramanan, and C. Fowlkes, “Layered object models for image segmentation,” P AMI, vol. 34, no. 9, pp. 1731–1743, 2012

  49. [49]

    Using segmentation to verify object hypotheses,

    D. Ramanan, “Using segmentation to verify object hypotheses,” in CVPR, 2007

  50. [50]

    Learning to localize detected objects,

    Q. Dai and D. Hoiem, “Learning to localize detected objects,” in CVPR, 2012

  51. [51]

    Col- lecting image annotations using Amazon’s Mechanical Turk,

    C. Rashtchian, P . Young, M. Hodosh, and J. Hockenmaier, “Col- lecting image annotations using Amazon’s Mechanical Turk,” in NAACL Workshop , 2010. 13 Fig. 11: Icons of 91 categories in the MS COCO dataset grouped by 11 super-categories. We use these icons in our annotation pipeline to help workers quickly reference the indicated object category. Fig. 12:...