arxiv: 1405.0312 · v3 · submitted 2014-05-01 · 💻 cs.CV

Recognition: no theorem link

Microsoft COCO: Common Objects in Context

Tsung-Yi Lin , Michael Maire , Serge Belongie , Lubomir Bourdev , Ross Girshick , James Hays , Pietro Perona , Deva Ramanan

show 2 more authors

C. Lawrence Zitnick Piotr Doll\'ar

Authors on Pith no claims yet

Pith reviewed 2026-05-11 23:56 UTC · model grok-4.3

classification 💻 cs.CV

keywords object recognitionscene understandinginstance segmentationcrowd sourcingcommon objectseveryday scenesCOCO datasetDeformable Parts Model

0 comments

The pith

A dataset of everyday scenes with per-instance segmentations advances object recognition in context.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper presents the COCO dataset to advance object recognition by situating it within scene understanding. Images of complex everyday scenes are gathered and objects labeled with per-instance segmentations for precise localization. The collection includes 91 common object types in 328,000 images totaling 2.5 million instances, created via novel crowd-sourcing interfaces. Statistical comparisons to existing datasets and baseline detection results using Deformable Parts Models are included. A reader would care if this enables models to better handle real-world variability and context.

Core claim

The authors present a new dataset with the goal of advancing the state-of-the-art in object recognition by placing the question of object recognition in the context of the broader question of scene understanding. This is achieved by gathering images of complex everyday scenes containing common objects in their natural context. Objects are labeled using per-instance segmentations to aid in precise object localization. The dataset contains photos of 91 objects types that would be easily recognizable by a 4 year old. With a total of 2.5 million labeled instances in 328k images, the creation drew upon extensive crowd worker involvement via novel user interfaces for category detection, instanceスポ

What carries the argument

The per-instance segmentation labels on images of common objects in natural scenes, enabled by crowd-sourcing interfaces for detection, spotting, and segmentation.

If this is right

Object recognition models can be evaluated and trained on contextual scenes rather than cropped objects.
Per-instance masks support tasks requiring precise boundaries like segmentation.
Large scale supports statistical robustness in algorithm comparisons.
New annotation methods scale labeling to millions of instances efficiently.
Shifts focus toward holistic scene understanding in computer vision research.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This could improve transfer to applications like video analysis or robotics where context matters.
The dataset design may inspire similar collections for other modalities or languages.
One might test if models trained on COCO generalize better to unseen scenes than on prior datasets.

Load-bearing premise

The novel crowd-sourcing interfaces produce sufficiently accurate and consistent per-instance labels at the reported scale without substantial systematic errors.

What would settle it

An independent audit revealing high disagreement rates between the provided labels and expert annotations on a sample of images would falsify the dataset's utility for advancing reliable recognition.

read the original abstract

We present a new dataset with the goal of advancing the state-of-the-art in object recognition by placing the question of object recognition in the context of the broader question of scene understanding. This is achieved by gathering images of complex everyday scenes containing common objects in their natural context. Objects are labeled using per-instance segmentations to aid in precise object localization. Our dataset contains photos of 91 objects types that would be easily recognizable by a 4 year old. With a total of 2.5 million labeled instances in 328k images, the creation of our dataset drew upon extensive crowd worker involvement via novel user interfaces for category detection, instance spotting and instance segmentation. We present a detailed statistical analysis of the dataset in comparison to PASCAL, ImageNet, and SUN. Finally, we provide baseline performance analysis for bounding box and segmentation detection results using a Deformable Parts Model.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript presents the Microsoft COCO dataset: 328k images of complex everyday scenes with 2.5 million per-instance segmentations across 91 object categories recognizable by a 4-year-old. The central goal is to advance object recognition by embedding it in scene understanding, achieved via novel crowd-sourcing interfaces for category detection, instance spotting, and instance segmentation. The paper supplies statistical comparisons against PASCAL, ImageNet, and SUN plus baseline Deformable Parts Model results for bounding-box and segmentation detection.

Significance. If label quality holds, the dataset would be a major resource for the field: its scale, instance-level annotations in natural contexts, and explicit comparisons to prior collections provide a reproducible benchmark that can drive progress on precise localization and contextual recognition. The manuscript's strength lies in the independent construction and public release of the data together with parameter-free statistical characterizations.

major comments (1)

[Dataset construction / annotation interfaces] Section on dataset construction and crowd-sourcing interfaces (description of instance spotting and segmentation UIs): no inter-annotator agreement scores, no expert-vs-crowd IoU on a held-out subset, and no error analysis for boundary precision or category confusion are reported for the 2.5M instances. Because the central claim rests on these labels enabling precise localization in scene context, the absence of quantitative validation is load-bearing; systematic biases (e.g., under-segmentation of small or occluded objects) would directly undermine downstream utility.

minor comments (2)

[Baseline performance analysis] The baseline DPM results are presented mainly as reference points; a short discussion of why more contemporary detectors were not included would clarify the intended role of these numbers.
[Statistical analysis] Tables and figures comparing statistics to PASCAL/ImageNet/SUN would benefit from explicit cross-references in the text and consistent axis scaling to ease reader comparison.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive assessment of the dataset's potential impact and for the constructive feedback on annotation validation. We address the major comment below and will revise the manuscript to incorporate the requested quantitative analysis.

read point-by-point responses

Referee: [Dataset construction / annotation interfaces] Section on dataset construction and crowd-sourcing interfaces (description of instance spotting and segmentation UIs): no inter-annotator agreement scores, no expert-vs-crowd IoU on a held-out subset, and no error analysis for boundary precision or category confusion are reported for the 2.5M instances. Because the central claim rests on these labels enabling precise localization in scene context, the absence of quantitative validation is load-bearing; systematic biases (e.g., under-segmentation of small or occluded objects) would directly undermine downstream utility.

Authors: We agree that quantitative validation of the annotations is essential to support claims about precise localization in scene context. In the revised manuscript we will add a dedicated subsection to the dataset construction section that reports (1) inter-annotator agreement scores for both instance spotting and segmentation tasks, computed on a held-out set of images using multiple independent workers; (2) expert-versus-crowd IoU statistics on a separate held-out subset of 500 images annotated by computer-vision experts; and (3) a systematic error analysis that quantifies boundary-precision errors and category confusions, with explicit breakdowns for small and occluded objects. These additions will directly address concerns about potential systematic biases and strengthen the reproducibility of the benchmark. revision: yes

Circularity Check

0 steps flagged

No circularity: dataset construction and release paper with external benchmarks

full rationale

The paper presents the COCO dataset, its collection process via crowd-sourcing interfaces, statistical comparisons to PASCAL/ImageNet/SUN, and baseline results using an off-the-shelf Deformable Parts Model. No derivations, equations, predictions, or first-principles claims exist that could reduce to fitted inputs, self-definitions, or self-citation chains. The contribution is the independent dataset itself; any questions of label accuracy fall under correctness risk rather than circularity per the analysis rules. Steps array left empty as no load-bearing reductions were identified.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a dataset construction paper with no mathematical derivations, fitted parameters, background axioms, or postulated entities beyond the empirical collection of images and labels.

pith-pipeline@v0.9.0 · 5480 in / 1095 out tokens · 42287 ms · 2026-05-11T23:56:56.623374+00:00 · methodology

discussion (0)

Forward citations

Cited by 33 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Concept-Based Abductive and Contrastive Explanations for Behaviors of Vision Models
cs.LG 2026-05 unverdicted novelty 7.0

Concept-based abductive and contrastive explanations find minimal high-level concepts that causally determine vision model outcomes on individual images or groups sharing a specified behavior.
Pro$^2$Assist: Continuous Step-Aware Proactive Assistance with Multimodal Egocentric Perception for Long-Horizon Procedural Tasks
cs.AI 2026-05 unverdicted novelty 7.0

Pro²Assist uses multimodal egocentric perception from AR glasses to track fine-grained progress in long-horizon procedural tasks and deliver timely proactive assistance, outperforming baselines by over 21% in action u...
Beyond Accuracy: Benchmarking Cross-Task Consistency in Unified Multimodal Models
cs.CV 2026-04 unverdicted novelty 7.0

XTC-Bench reveals that strong performance on generation or understanding tasks in unified multimodal models does not guarantee cross-task semantic consistency, which instead depends on how tightly coupled the learning...
Oracle Noise: Faster Semantic Spherical Alignment for Interpretable Latent Optimization
cs.CV 2026-04 unverdicted novelty 7.0

Oracle Noise optimizes diffusion model noise on a Riemannian hypersphere guided by key prompt words to preserve the Gaussian prior, eliminate norm inflation, and achieve faster semantic alignment than Euclidean methods.
MMEB-V3: Measuring the Performance Gaps of Omni-Modality Embedding Models
cs.IR 2026-04 unverdicted novelty 7.0

MMEB-V3 benchmark shows omni-modality embedding models fail to enforce instruction-specified modality constraints and exhibit asymmetric, query-biased retrieval.
Beyond Prompts: Unconditional 3D Inversion for Out-of-Distribution Shapes
cs.CV 2026-04 unverdicted novelty 7.0

Text-to-3D models lose prompt sensitivity for out-of-distribution shapes due to sink traps but retain geometric diversity via unconditional priors, enabling a decoupled inversion method for robust editing.
Gaslight, Gatekeep, V1-V3: Early Visual Cortex Alignment Shields Vision-Language Models from Sycophantic Manipulation
cs.CV 2026-04 conditional novelty 7.0

Alignment of vision-language models with human V1-V3 early visual cortex negatively predicts resistance to sycophantic gaslighting attacks.
MAST: Mask-Guided Attention Mass Allocation for Training-Free Multi-Style Transfer
cs.CV 2026-04 unverdicted novelty 7.0

MAST is a mask-guided attention allocation method that enables artifact-free multi-style transfer in diffusion models by anchoring layout, distributing attention mass, scaling sharpness, and injecting details.
WildDet3D: Scaling Promptable 3D Detection in the Wild
cs.CV 2026-04 unverdicted novelty 7.0

WildDet3D is a promptable 3D detector paired with a new 1M-image dataset across 13.5K categories that sets SOTA on open-world and zero-shot 3D detection benchmarks.
Hierarchical Text-Conditional Image Generation with CLIP Latents
cs.CV 2022-04 accept novelty 7.0

A hierarchical prior-decoder model using CLIP latents generates more diverse text-conditional images than direct methods while preserving photorealism and caption fidelity.
High-Resolution Image Synthesis with Latent Diffusion Models
cs.CV 2021-12 conditional novelty 7.0

Latent diffusion models achieve state-of-the-art inpainting and competitive results on unconditional generation, scene synthesis, and super-resolution by performing the diffusion process in the latent space of pretrai...
CASCADE: Context-Aware Relaxation for Speculative Image Decoding
cs.CV 2026-05 unverdicted novelty 6.0

CASCADE formalizes semantic interchangeability and convergence in target model representations to enable context-aware acceptance relaxation in tree-based speculative decoding, delivering up to 3.6x speedup on text-to...
Semantics-Aware Hierarchical Token Communication: Clustering, Bit Mapping, and Power Allocation
eess.SP 2026-04 unverdicted novelty 6.0

H-TokCom groups tokens by semantic similarity and protects cluster-level bits with higher power, raising semantic similarity from 0.206 to 0.279 at 3 dB SNR on COCO data.
Delta Score Matters! Spatial Adaptive Multi Guidance in Diffusion Models
cs.CV 2026-04 unverdicted novelty 6.0

SAMG uses spatially adaptive guidance scales derived from a geometric analysis of classifier-free guidance to resolve the detail-artifact dilemma in diffusion-based image and video generation.
LatentDiff: Scaling Semantic Dataset Comparison to Millions of Images
cs.CV 2026-04 unverdicted novelty 6.0

LatentDiff scales semantic dataset comparison to millions of images using latent spaces of vision encoders combined with sparse autoencoders and density ratio estimation, showing better accuracy and robustness than ca...
DeepSignature: Digitally Signed, Content-Encoding Watermarks for Robust and Transparent Image Authentication
cs.CR 2026-04 unverdicted novelty 6.0

DeepSignature embeds digitally signed content-encoding watermarks via neural networks for robust image authentication, source attribution, and latent-space tamper localization.
Source-Modality Monitoring in Vision-Language Models
cs.CL 2026-04 unverdicted novelty 6.0

Vision-language models use semantic signals more than syntactic ones to bind words like 'image' to actual visual inputs, with implications for robustness in multimodal systems.
MACS: Modality-Aware Capacity Scaling for Efficient Multimodal MoE Inference
cs.LG 2026-04 unverdicted novelty 6.0

MACS improves MoE MLLM inference efficiency via entropy-weighted token loads and dynamic modality-adaptive expert capacity allocation.
MACS: Modality-Aware Capacity Scaling for Efficient Multimodal MoE Inference
cs.LG 2026-04 unverdicted novelty 6.0

MACS improves inference speed in multimodal MoE models by entropy-weighted balancing of visual tokens and real-time modality-adaptive expert capacity allocation.
Zero-shot World Models Are Developmentally Efficient Learners
cs.AI 2026-04 unverdicted novelty 6.0

A zero-shot visual world model trained on one child's experience achieves broad competence on physical understanding benchmarks while matching developmental behavioral patterns.
Bayesian Optimization for Mixed-Variable Problems in the Natural Sciences
cs.LG 2026-04 unverdicted novelty 6.0

A generalization of probabilistic reparameterization allows gradient-based acquisition optimization in fully mixed-variable Bayesian optimization with Gaussian process surrogates for non-equidistant discrete spaces.
DeCo-DETR: Decoupled Cognition DETR for efficient Open-Vocabulary Object Detection
cs.CV 2026-04 unverdicted novelty 6.0

DeCo-DETR builds hierarchical semantic prototypes offline and uses decoupled training streams to deliver competitive zero-shot open-vocabulary detection with improved inference speed.
Global Context or Local Detail? Adaptive Visual Grounding for Hallucination Mitigation
cs.CV 2026-04 unverdicted novelty 5.0

PND mitigates object hallucination in vision-language models via dual-path contrastive decoding that boosts visual evidence and penalizes linguistic priors, yielding up to 6.5% gains on POPE, MME, and CHAIR benchmarks.
From Codebooks to VLMs: Evaluating Automated Visual Discourse Analysis for Climate Change on Social Media
cs.CV 2026-04 unverdicted novelty 5.0

VLMs recover reliable population-level trends in climate change visual discourse on social media even when per-image accuracy is only moderate.
FineEdit: Fine-Grained Image Edit with Bounding Box Guidance
cs.CV 2026-04 unverdicted novelty 5.0

FineEdit adds multi-level bounding box injection to diffusion image editing, releases a 1.2M-pair dataset with box annotations, and shows better instruction following and background consistency than prior open models ...
DeCo-DETR: Decoupled Cognition DETR for efficient Open-Vocabulary Object Detection
cs.CV 2026-04 unverdicted novelty 5.0

DeCo-DETR constructs a hierarchical semantic prototype space from LVLM-generated descriptions aligned via CLIP and uses decoupled training streams to separate semantic reasoning from detection, yielding efficient open...
Language-Pretraining-Induced Bias: A Strong Foundation for General Vision Tasks
cs.CV 2026-04 unverdicted novelty 5.0

Random label bridge training aligns LLM parameters with vision tasks, and partial training of certain layers often suffices due to their foundational properties.
SynSur: An end-to-end generative pipeline for synthetic industrial surface defect generation and detection
cs.CV 2026-04 unverdicted novelty 4.0

A generative pipeline creates realistic synthetic pitting defects and other surface flaws that, when added to real training data, yield modest gains in industrial defect detectors without replacing the need for authen...
No Pedestrian Left Behind: Real-Time Detection and Tracking of Vulnerable Road Users for Adaptive Traffic Signal Control
cs.CV 2026-04 unverdicted novelty 4.0

NPLB combines YOLOv12 detection and ByteTrack tracking with an adaptive controller to extend pedestrian phases, cutting simulated stranding rates from 9.1% to 2.6% while extending signals in only 12.1% of cycles.
SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features
cs.CV 2025-02 unverdicted novelty 4.0

SigLIP 2 models trained with a unified recipe of captioning, self-supervised losses, and curated diverse data outperform prior SigLIP versions on classification, retrieval, localization, dense prediction, and multilin...
PaliGemma 2: A Family of Versatile VLMs for Transfer
cs.CV 2024-12 unverdicted novelty 4.0

PaliGemma 2 is a family of vision-language models that achieves state-of-the-art results on transfer tasks like table structure recognition and radiography report generation by combining SigLIP with Gemma 2 models at ...
PaliGemma: A versatile 3B VLM for transfer
cs.CV 2024-07 unverdicted novelty 4.0

PaliGemma is an open 3B VLM based on SigLIP and Gemma that achieves strong performance on nearly 40 diverse open-world tasks including benchmarks, remote-sensing, and segmentation.
Evaluation of Winning Solutions of 2025 Low Power Computer Vision Challenge
cs.CV 2026-04 unverdicted novelty 2.0

The 2025 LPCVC winners demonstrate practical techniques for low-power image classification under varied conditions, open-vocabulary segmentation from text prompts, and monocular depth estimation.

Reference graph

Works this paper leans on

51 extracted references · 51 canonical work pages · cited by 31 Pith papers

[1]

Im- ageNet: A Large-Scale Hierarchical Image Database,

J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Im- ageNet: A Large-Scale Hierarchical Image Database,” in CVPR, 2009

work page 2009
[2]

The PASCAL visual object classes (VOC) challenge,

M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zis- serman, “The PASCAL visual object classes (VOC) challenge,” IJCV, vol. 88, no. 2, pp. 303–338, Jun. 2010

work page 2010
[3]

SUN database: Large-scale scene recognition from abbey to zoo,

J. Xiao, J. Hays, K. A. Ehinger, A. Oliva, and A. Torralba, “SUN database: Large-scale scene recognition from abbey to zoo,” in CVPR, 2010

work page 2010
[4]

Pedestrian detec- tion: An evaluation of the state of the art,

P . Doll ´ar, C. Wojek, B. Schiele, and P . Perona, “Pedestrian detec- tion: An evaluation of the state of the art,” P AMI, vol. 34, 2012

work page 2012
[5]

ImageNet classiﬁca- tion with deep convolutional neural networks,

A. Krizhevsky, I. Sutskever, and G. Hinton, “ImageNet classiﬁca- tion with deep convolutional neural networks,” in NIPS, 2012

work page 2012
[6]

Rich feature hierarchies for accurate object detection and semantic segmenta- tion,

R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate object detection and semantic segmenta- tion,” in CVPR, 2014

work page 2014
[7]

OverFeat: Integrated recognition, localization and detection using convolutional networks,

P . Sermanet, D. Eigen, S. Zhang, M. Mathieu, R. Fergus, and Y. LeCun, “OverFeat: Integrated recognition, localization and detection using convolutional networks,” in ICLR, April 2014

work page 2014
[8]

Describing objects by their attributes,

A. Farhadi, I. Endres, D. Hoiem, and D. Forsyth, “Describing objects by their attributes,” in CVPR, 2009

work page 2009
[9]

SUN attribute database: Discovering, annotating, and recognizing scene attributes,

G. Patterson and J. Hays, “SUN attribute database: Discovering, annotating, and recognizing scene attributes,” in CVPR, 2012

work page 2012
[10]

Poselets: Body part detectors trained using 3D human pose annotations,

L. Bourdev and J. Malik, “Poselets: Body part detectors trained using 3D human pose annotations,” in ICCV, 2009

work page 2009
[11]

Indoor seg- mentation and support inference from RGBD images,

N. Silberman, D. Hoiem, P . Kohli, and R. Fergus, “Indoor seg- mentation and support inference from RGBD images,” in ECCV, 2012

work page 2012
[12]

Canonical perspective and the perception of objects,

S. Palmer, E. Rosch, and P . Chase, “Canonical perspective and the perception of objects,” Attention and performance IX , vol. 1, p. 4, 1981

work page 1981
[13]

Diagnosing error in object detectors,

Hoiem, D, Y. Chodpathumwan, and Q. Dai, “Diagnosing error in object detectors,” in ECCV, 2012

work page 2012
[14]

Semantic object classes in video: A high-deﬁnition ground truth database,

G. Brostow, J. Fauqueur, and R. Cipolla, “Semantic object classes in video: A high-deﬁnition ground truth database,” PRL, vol. 30, no. 2, pp. 88–97, 2009

work page 2009
[15]

LabelMe: a database and web-based tool for image annotation,

B. Russell, A. Torralba, K. Murphy, and W. Freeman, “LabelMe: a database and web-based tool for image annotation,” IJCV, vol. 77, no. 1-3, pp. 157–173, 2008

work page 2008
[16]

OpenSurfaces: A richly annotated catalog of surface appearance,

S. Bell, P . Upchurch, N. Snavely, and K. Bala, “OpenSurfaces: A richly annotated catalog of surface appearance,” SIGGRAPH, vol. 32, no. 4, 2013

work page 2013
[17]

Im2text: Describing images using 1 million captioned photographs

V . Ordonez, G. Kulkarni, and T. Berg, “Im2text: Describing images using 1 million captioned photographs.” in NIPS, 2011

work page 2011
[18]

Scalable multi-label annotation,

J. Deng, O. Russakovsky, J. Krause, M. Bernstein, A. Berg, and L. Fei-Fei, “Scalable multi-label annotation,” in CHI, 2014

work page 2014
[19]

Microsoft COCO: Common objects in context,

T. Lin, M. Maire, S. Belongie, J. Hays, P . Perona, D. Ramanan, P . Doll´ar, and C. L. Zitnick, “Microsoft COCO: Common objects in context,” in ECCV, 2014

work page 2014
[20]

A taxonomy and evaluation of dense two-frame stereo correspondence algorithms,

D. Scharstein and R. Szeliski, “A taxonomy and evaluation of dense two-frame stereo correspondence algorithms,” IJCV, vol. 47, no. 1-3, pp. 7–42, 2002

work page 2002
[21]

A database and evaluation methodology for optical ﬂow,

S. Baker, D. Scharstein, J. Lewis, S. Roth, M. Black, and R. Szeliski, “A database and evaluation methodology for optical ﬂow,” IJCV, vol. 92, no. 1, pp. 1–31, 2011

work page 2011
[22]

Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories,

L. Fei-Fei, R. Fergus, and P . Perona, “Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories,” in CVPR Workshop of Generative Model Based Vision (WGMBV) , 2004

work page 2004
[23]

Caltech-256 object category dataset,

G. Grifﬁn, A. Holub, and P . Perona, “Caltech-256 object category dataset,” California Institute of Technology, Tech. Rep. 7694, 2007

work page 2007
[24]

Histograms of oriented gradients for human detection,

N. Dalal and B. Triggs, “Histograms of oriented gradients for human detection,” in CVPR, 2005

work page 2005
[25]

The MNIST database of handwritten digits,

Y. Lecun and C. Cortes, “The MNIST database of handwritten digits,” 1998. [Online]. Available: http://yann.lecun.com/exdb/ mnist/

work page 1998
[26]

Columbia object image library (coil-20),

S. A. Nene, S. K. Nayar, and H. Murase, “Columbia object image library (coil-20),” Columbia Universty, Tech. Rep., 1996

work page 1996
[27]

Learning multiple layers of fea- tures from tiny images,

A. Krizhevsky and G. Hinton, “Learning multiple layers of fea- tures from tiny images,” Computer Science Department, University of T oronto, T ech. Rep, 2009

work page 2009
[28]

80 million tiny images: A large data set for nonparametric object and scene recognition,

A. Torralba, R. Fergus, and W. T. Freeman, “80 million tiny images: A large data set for nonparametric object and scene recognition,” P AMI, vol. 30, no. 11, pp. 1958–1970, 2008

work page 1958
[29]

From large scale image categorization to entry-level categories,

V . Ordonez, J. Deng, Y. Choi, A. Berg, and T. Berg, “From large scale image categorization to entry-level categories,” in ICCV, 2013

work page 2013
[30]

Fellbaum, WordNet: An electronic lexical database

C. Fellbaum, WordNet: An electronic lexical database . Blackwell Books, 1998

work page 1998
[31]

Caltech-UCSD Birds 200,

P . Welinder, S. Branson, T. Mita, C. Wah, F. Schroff, S. Belongie, and P . Perona, “Caltech-UCSD Birds 200,” Caltech, Tech. Rep. CNS-TR-201, 2010

work page 2010
[32]

Face detection: A survey,

E. Hjelm ˚as and B. Low, “Face detection: A survey,” CVIU, vol. 83, no. 3, pp. 236–274, 2001

work page 2001
[33]

Labeled faces in the wild,

G. B. Huang, M. Ramesh, T. Berg, and E. Learned-Miller, “Labeled faces in the wild,” University of Massachusetts, Amherst, Tech. Rep. 07-49, October 2007

work page 2007
[34]

Detecting avocados to zucchinis: what have we done, and where are we going?

O. Russakovsky, J. Deng, Z. Huang, A. Berg, and L. Fei-Fei, “Detecting avocados to zucchinis: what have we done, and where are we going?” in ICCV, 2013

work page 2013
[35]

TextonBoost for image understanding: Multi-class object recognition and seg- mentation by jointly modeling texture, layout, and context,

J. Shotton, J. Winn, C. Rother, and A. Criminisi, “TextonBoost for image understanding: Multi-class object recognition and seg- mentation by jointly modeling texture, layout, and context,” IJCV, vol. 81, no. 1, pp. 2–23, 2009

work page 2009
[36]

A comparison and evaluation of multi-view stereo reconstruction algorithms,

S. M. Seitz, B. Curless, J. Diebel, D. Scharstein, and R. Szeliski, “A comparison and evaluation of multi-view stereo reconstruction algorithms,” in CVPR, 2006

work page 2006
[37]

Contour detec- tion and hierarchical image segmentation,

P . Arbelaez, M. Maire, C. Fowlkes, and J. Malik, “Contour detec- tion and hierarchical image segmentation,” P AMI, vol. 33, no. 5, pp. 898–916, 2011

work page 2011
[38]

Learning to detect unseen object classes by between-class attribute transfer,

C. Lampert, H. Nickisch, and S. Harmeling, “Learning to detect unseen object classes by between-class attribute transfer,” in CVPR, 2009

work page 2009
[39]

Learning spatial context: Using stuff to ﬁnd things,

G. Heitz and D. Koller, “Learning spatial context: Using stuff to ﬁnd things,” in ECCV, 2008

work page 2008
[40]

Sitton, Spelling Sourcebook

R. Sitton, Spelling Sourcebook . Egger Publishing, 1996

work page 1996
[41]

Finding iconic images,

T. Berg and A. Berg, “Finding iconic images,” in CVPR, 2009

work page 2009
[42]

Unbiased look at dataset bias,

A. Torralba and A. Efros, “Unbiased look at dataset bias,” in CVPR, 2011

work page 2011
[43]

Evaluation of gist descriptors for web-scale image search,

M. Douze, H. J ´egou, H. Sandhawalia, L. Amsaleg, and C. Schmid, “Evaluation of gist descriptors for web-scale image search,” in CIVR, 2009

work page 2009
[44]

Object detection with discriminatively trained part-based mod- els,

P . Felzenszwalb, R. Girshick, D. McAllester, and D. Ramanan, “Object detection with discriminatively trained part-based mod- els,” P AMI, vol. 32, no. 9, pp. 1627–1645, 2010

work page 2010
[45]

Discriminatively trained deformable part models, release 5,

R. Girshick, P . Felzenszwalb, and D. McAllester, “Discriminatively trained deformable part models, release 5,” P AMI, 2012

work page 2012
[46]

Do we need more training data or better models for object detection?

X. Zhu, C. Vondrick, D. Ramanan, and C. Fowlkes, “Do we need more training data or better models for object detection?” in BMVC, 2012

work page 2012
[47]

Object segmentation by alignment of poselet activations to image contours,

T. Brox, L. Bourdev, S. Maji, and J. Malik, “Object segmentation by alignment of poselet activations to image contours,” in CVPR, 2011

work page 2011
[48]

Layered object models for image segmentation,

Y. Yang, S. Hallman, D. Ramanan, and C. Fowlkes, “Layered object models for image segmentation,” P AMI, vol. 34, no. 9, pp. 1731–1743, 2012

work page 2012
[49]

Using segmentation to verify object hypotheses,

D. Ramanan, “Using segmentation to verify object hypotheses,” in CVPR, 2007

work page 2007
[50]

Learning to localize detected objects,

Q. Dai and D. Hoiem, “Learning to localize detected objects,” in CVPR, 2012

work page 2012
[51]

Col- lecting image annotations using Amazon’s Mechanical Turk,

C. Rashtchian, P . Young, M. Hodosh, and J. Hockenmaier, “Col- lecting image annotations using Amazon’s Mechanical Turk,” in NAACL Workshop , 2010. 13 Fig. 11: Icons of 91 categories in the MS COCO dataset grouped by 11 super-categories. We use these icons in our annotation pipeline to help workers quickly reference the indicated object category. Fig. 12:...

work page 2010