Recognition: no theorem link
Microsoft COCO: Common Objects in Context
Pith reviewed 2026-05-11 23:56 UTC · model grok-4.3
The pith
A dataset of everyday scenes with per-instance segmentations advances object recognition in context.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors present a new dataset with the goal of advancing the state-of-the-art in object recognition by placing the question of object recognition in the context of the broader question of scene understanding. This is achieved by gathering images of complex everyday scenes containing common objects in their natural context. Objects are labeled using per-instance segmentations to aid in precise object localization. The dataset contains photos of 91 objects types that would be easily recognizable by a 4 year old. With a total of 2.5 million labeled instances in 328k images, the creation drew upon extensive crowd worker involvement via novel user interfaces for category detection, instanceスポ
What carries the argument
The per-instance segmentation labels on images of common objects in natural scenes, enabled by crowd-sourcing interfaces for detection, spotting, and segmentation.
If this is right
- Object recognition models can be evaluated and trained on contextual scenes rather than cropped objects.
- Per-instance masks support tasks requiring precise boundaries like segmentation.
- Large scale supports statistical robustness in algorithm comparisons.
- New annotation methods scale labeling to millions of instances efficiently.
- Shifts focus toward holistic scene understanding in computer vision research.
Where Pith is reading between the lines
- This could improve transfer to applications like video analysis or robotics where context matters.
- The dataset design may inspire similar collections for other modalities or languages.
- One might test if models trained on COCO generalize better to unseen scenes than on prior datasets.
Load-bearing premise
The novel crowd-sourcing interfaces produce sufficiently accurate and consistent per-instance labels at the reported scale without substantial systematic errors.
What would settle it
An independent audit revealing high disagreement rates between the provided labels and expert annotations on a sample of images would falsify the dataset's utility for advancing reliable recognition.
read the original abstract
We present a new dataset with the goal of advancing the state-of-the-art in object recognition by placing the question of object recognition in the context of the broader question of scene understanding. This is achieved by gathering images of complex everyday scenes containing common objects in their natural context. Objects are labeled using per-instance segmentations to aid in precise object localization. Our dataset contains photos of 91 objects types that would be easily recognizable by a 4 year old. With a total of 2.5 million labeled instances in 328k images, the creation of our dataset drew upon extensive crowd worker involvement via novel user interfaces for category detection, instance spotting and instance segmentation. We present a detailed statistical analysis of the dataset in comparison to PASCAL, ImageNet, and SUN. Finally, we provide baseline performance analysis for bounding box and segmentation detection results using a Deformable Parts Model.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents the Microsoft COCO dataset: 328k images of complex everyday scenes with 2.5 million per-instance segmentations across 91 object categories recognizable by a 4-year-old. The central goal is to advance object recognition by embedding it in scene understanding, achieved via novel crowd-sourcing interfaces for category detection, instance spotting, and instance segmentation. The paper supplies statistical comparisons against PASCAL, ImageNet, and SUN plus baseline Deformable Parts Model results for bounding-box and segmentation detection.
Significance. If label quality holds, the dataset would be a major resource for the field: its scale, instance-level annotations in natural contexts, and explicit comparisons to prior collections provide a reproducible benchmark that can drive progress on precise localization and contextual recognition. The manuscript's strength lies in the independent construction and public release of the data together with parameter-free statistical characterizations.
major comments (1)
- [Dataset construction / annotation interfaces] Section on dataset construction and crowd-sourcing interfaces (description of instance spotting and segmentation UIs): no inter-annotator agreement scores, no expert-vs-crowd IoU on a held-out subset, and no error analysis for boundary precision or category confusion are reported for the 2.5M instances. Because the central claim rests on these labels enabling precise localization in scene context, the absence of quantitative validation is load-bearing; systematic biases (e.g., under-segmentation of small or occluded objects) would directly undermine downstream utility.
minor comments (2)
- [Baseline performance analysis] The baseline DPM results are presented mainly as reference points; a short discussion of why more contemporary detectors were not included would clarify the intended role of these numbers.
- [Statistical analysis] Tables and figures comparing statistics to PASCAL/ImageNet/SUN would benefit from explicit cross-references in the text and consistent axis scaling to ease reader comparison.
Simulated Author's Rebuttal
We thank the referee for the positive assessment of the dataset's potential impact and for the constructive feedback on annotation validation. We address the major comment below and will revise the manuscript to incorporate the requested quantitative analysis.
read point-by-point responses
-
Referee: [Dataset construction / annotation interfaces] Section on dataset construction and crowd-sourcing interfaces (description of instance spotting and segmentation UIs): no inter-annotator agreement scores, no expert-vs-crowd IoU on a held-out subset, and no error analysis for boundary precision or category confusion are reported for the 2.5M instances. Because the central claim rests on these labels enabling precise localization in scene context, the absence of quantitative validation is load-bearing; systematic biases (e.g., under-segmentation of small or occluded objects) would directly undermine downstream utility.
Authors: We agree that quantitative validation of the annotations is essential to support claims about precise localization in scene context. In the revised manuscript we will add a dedicated subsection to the dataset construction section that reports (1) inter-annotator agreement scores for both instance spotting and segmentation tasks, computed on a held-out set of images using multiple independent workers; (2) expert-versus-crowd IoU statistics on a separate held-out subset of 500 images annotated by computer-vision experts; and (3) a systematic error analysis that quantifies boundary-precision errors and category confusions, with explicit breakdowns for small and occluded objects. These additions will directly address concerns about potential systematic biases and strengthen the reproducibility of the benchmark. revision: yes
Circularity Check
No circularity: dataset construction and release paper with external benchmarks
full rationale
The paper presents the COCO dataset, its collection process via crowd-sourcing interfaces, statistical comparisons to PASCAL/ImageNet/SUN, and baseline results using an off-the-shelf Deformable Parts Model. No derivations, equations, predictions, or first-principles claims exist that could reduce to fitted inputs, self-definitions, or self-citation chains. The contribution is the independent dataset itself; any questions of label accuracy fall under correctness risk rather than circularity per the analysis rules. Steps array left empty as no load-bearing reductions were identified.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 33 Pith papers
-
Concept-Based Abductive and Contrastive Explanations for Behaviors of Vision Models
Concept-based abductive and contrastive explanations find minimal high-level concepts that causally determine vision model outcomes on individual images or groups sharing a specified behavior.
-
Pro$^2$Assist: Continuous Step-Aware Proactive Assistance with Multimodal Egocentric Perception for Long-Horizon Procedural Tasks
Pro²Assist uses multimodal egocentric perception from AR glasses to track fine-grained progress in long-horizon procedural tasks and deliver timely proactive assistance, outperforming baselines by over 21% in action u...
-
Beyond Accuracy: Benchmarking Cross-Task Consistency in Unified Multimodal Models
XTC-Bench reveals that strong performance on generation or understanding tasks in unified multimodal models does not guarantee cross-task semantic consistency, which instead depends on how tightly coupled the learning...
-
Oracle Noise: Faster Semantic Spherical Alignment for Interpretable Latent Optimization
Oracle Noise optimizes diffusion model noise on a Riemannian hypersphere guided by key prompt words to preserve the Gaussian prior, eliminate norm inflation, and achieve faster semantic alignment than Euclidean methods.
-
MMEB-V3: Measuring the Performance Gaps of Omni-Modality Embedding Models
MMEB-V3 benchmark shows omni-modality embedding models fail to enforce instruction-specified modality constraints and exhibit asymmetric, query-biased retrieval.
-
Beyond Prompts: Unconditional 3D Inversion for Out-of-Distribution Shapes
Text-to-3D models lose prompt sensitivity for out-of-distribution shapes due to sink traps but retain geometric diversity via unconditional priors, enabling a decoupled inversion method for robust editing.
-
Gaslight, Gatekeep, V1-V3: Early Visual Cortex Alignment Shields Vision-Language Models from Sycophantic Manipulation
Alignment of vision-language models with human V1-V3 early visual cortex negatively predicts resistance to sycophantic gaslighting attacks.
-
MAST: Mask-Guided Attention Mass Allocation for Training-Free Multi-Style Transfer
MAST is a mask-guided attention allocation method that enables artifact-free multi-style transfer in diffusion models by anchoring layout, distributing attention mass, scaling sharpness, and injecting details.
-
WildDet3D: Scaling Promptable 3D Detection in the Wild
WildDet3D is a promptable 3D detector paired with a new 1M-image dataset across 13.5K categories that sets SOTA on open-world and zero-shot 3D detection benchmarks.
-
Hierarchical Text-Conditional Image Generation with CLIP Latents
A hierarchical prior-decoder model using CLIP latents generates more diverse text-conditional images than direct methods while preserving photorealism and caption fidelity.
-
High-Resolution Image Synthesis with Latent Diffusion Models
Latent diffusion models achieve state-of-the-art inpainting and competitive results on unconditional generation, scene synthesis, and super-resolution by performing the diffusion process in the latent space of pretrai...
-
CASCADE: Context-Aware Relaxation for Speculative Image Decoding
CASCADE formalizes semantic interchangeability and convergence in target model representations to enable context-aware acceptance relaxation in tree-based speculative decoding, delivering up to 3.6x speedup on text-to...
-
Semantics-Aware Hierarchical Token Communication: Clustering, Bit Mapping, and Power Allocation
H-TokCom groups tokens by semantic similarity and protects cluster-level bits with higher power, raising semantic similarity from 0.206 to 0.279 at 3 dB SNR on COCO data.
-
Delta Score Matters! Spatial Adaptive Multi Guidance in Diffusion Models
SAMG uses spatially adaptive guidance scales derived from a geometric analysis of classifier-free guidance to resolve the detail-artifact dilemma in diffusion-based image and video generation.
-
LatentDiff: Scaling Semantic Dataset Comparison to Millions of Images
LatentDiff scales semantic dataset comparison to millions of images using latent spaces of vision encoders combined with sparse autoencoders and density ratio estimation, showing better accuracy and robustness than ca...
-
DeepSignature: Digitally Signed, Content-Encoding Watermarks for Robust and Transparent Image Authentication
DeepSignature embeds digitally signed content-encoding watermarks via neural networks for robust image authentication, source attribution, and latent-space tamper localization.
-
Source-Modality Monitoring in Vision-Language Models
Vision-language models use semantic signals more than syntactic ones to bind words like 'image' to actual visual inputs, with implications for robustness in multimodal systems.
-
MACS: Modality-Aware Capacity Scaling for Efficient Multimodal MoE Inference
MACS improves MoE MLLM inference efficiency via entropy-weighted token loads and dynamic modality-adaptive expert capacity allocation.
-
MACS: Modality-Aware Capacity Scaling for Efficient Multimodal MoE Inference
MACS improves inference speed in multimodal MoE models by entropy-weighted balancing of visual tokens and real-time modality-adaptive expert capacity allocation.
-
Zero-shot World Models Are Developmentally Efficient Learners
A zero-shot visual world model trained on one child's experience achieves broad competence on physical understanding benchmarks while matching developmental behavioral patterns.
-
Bayesian Optimization for Mixed-Variable Problems in the Natural Sciences
A generalization of probabilistic reparameterization allows gradient-based acquisition optimization in fully mixed-variable Bayesian optimization with Gaussian process surrogates for non-equidistant discrete spaces.
-
DeCo-DETR: Decoupled Cognition DETR for efficient Open-Vocabulary Object Detection
DeCo-DETR builds hierarchical semantic prototypes offline and uses decoupled training streams to deliver competitive zero-shot open-vocabulary detection with improved inference speed.
-
Global Context or Local Detail? Adaptive Visual Grounding for Hallucination Mitigation
PND mitigates object hallucination in vision-language models via dual-path contrastive decoding that boosts visual evidence and penalizes linguistic priors, yielding up to 6.5% gains on POPE, MME, and CHAIR benchmarks.
-
From Codebooks to VLMs: Evaluating Automated Visual Discourse Analysis for Climate Change on Social Media
VLMs recover reliable population-level trends in climate change visual discourse on social media even when per-image accuracy is only moderate.
-
FineEdit: Fine-Grained Image Edit with Bounding Box Guidance
FineEdit adds multi-level bounding box injection to diffusion image editing, releases a 1.2M-pair dataset with box annotations, and shows better instruction following and background consistency than prior open models ...
-
DeCo-DETR: Decoupled Cognition DETR for efficient Open-Vocabulary Object Detection
DeCo-DETR constructs a hierarchical semantic prototype space from LVLM-generated descriptions aligned via CLIP and uses decoupled training streams to separate semantic reasoning from detection, yielding efficient open...
-
Language-Pretraining-Induced Bias: A Strong Foundation for General Vision Tasks
Random label bridge training aligns LLM parameters with vision tasks, and partial training of certain layers often suffices due to their foundational properties.
-
SynSur: An end-to-end generative pipeline for synthetic industrial surface defect generation and detection
A generative pipeline creates realistic synthetic pitting defects and other surface flaws that, when added to real training data, yield modest gains in industrial defect detectors without replacing the need for authen...
-
No Pedestrian Left Behind: Real-Time Detection and Tracking of Vulnerable Road Users for Adaptive Traffic Signal Control
NPLB combines YOLOv12 detection and ByteTrack tracking with an adaptive controller to extend pedestrian phases, cutting simulated stranding rates from 9.1% to 2.6% while extending signals in only 12.1% of cycles.
-
SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features
SigLIP 2 models trained with a unified recipe of captioning, self-supervised losses, and curated diverse data outperform prior SigLIP versions on classification, retrieval, localization, dense prediction, and multilin...
-
PaliGemma 2: A Family of Versatile VLMs for Transfer
PaliGemma 2 is a family of vision-language models that achieves state-of-the-art results on transfer tasks like table structure recognition and radiography report generation by combining SigLIP with Gemma 2 models at ...
-
PaliGemma: A versatile 3B VLM for transfer
PaliGemma is an open 3B VLM based on SigLIP and Gemma that achieves strong performance on nearly 40 diverse open-world tasks including benchmarks, remote-sensing, and segmentation.
-
Evaluation of Winning Solutions of 2025 Low Power Computer Vision Challenge
The 2025 LPCVC winners demonstrate practical techniques for low-power image classification under varied conditions, open-vocabulary segmentation from text prompts, and monocular depth estimation.
Reference graph
Works this paper leans on
-
[1]
Im- ageNet: A Large-Scale Hierarchical Image Database,
J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Im- ageNet: A Large-Scale Hierarchical Image Database,” in CVPR, 2009
work page 2009
-
[2]
The PASCAL visual object classes (VOC) challenge,
M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zis- serman, “The PASCAL visual object classes (VOC) challenge,” IJCV, vol. 88, no. 2, pp. 303–338, Jun. 2010
work page 2010
-
[3]
SUN database: Large-scale scene recognition from abbey to zoo,
J. Xiao, J. Hays, K. A. Ehinger, A. Oliva, and A. Torralba, “SUN database: Large-scale scene recognition from abbey to zoo,” in CVPR, 2010
work page 2010
-
[4]
Pedestrian detec- tion: An evaluation of the state of the art,
P . Doll ´ar, C. Wojek, B. Schiele, and P . Perona, “Pedestrian detec- tion: An evaluation of the state of the art,” P AMI, vol. 34, 2012
work page 2012
-
[5]
ImageNet classifica- tion with deep convolutional neural networks,
A. Krizhevsky, I. Sutskever, and G. Hinton, “ImageNet classifica- tion with deep convolutional neural networks,” in NIPS, 2012
work page 2012
-
[6]
Rich feature hierarchies for accurate object detection and semantic segmenta- tion,
R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate object detection and semantic segmenta- tion,” in CVPR, 2014
work page 2014
-
[7]
OverFeat: Integrated recognition, localization and detection using convolutional networks,
P . Sermanet, D. Eigen, S. Zhang, M. Mathieu, R. Fergus, and Y. LeCun, “OverFeat: Integrated recognition, localization and detection using convolutional networks,” in ICLR, April 2014
work page 2014
-
[8]
Describing objects by their attributes,
A. Farhadi, I. Endres, D. Hoiem, and D. Forsyth, “Describing objects by their attributes,” in CVPR, 2009
work page 2009
-
[9]
SUN attribute database: Discovering, annotating, and recognizing scene attributes,
G. Patterson and J. Hays, “SUN attribute database: Discovering, annotating, and recognizing scene attributes,” in CVPR, 2012
work page 2012
-
[10]
Poselets: Body part detectors trained using 3D human pose annotations,
L. Bourdev and J. Malik, “Poselets: Body part detectors trained using 3D human pose annotations,” in ICCV, 2009
work page 2009
-
[11]
Indoor seg- mentation and support inference from RGBD images,
N. Silberman, D. Hoiem, P . Kohli, and R. Fergus, “Indoor seg- mentation and support inference from RGBD images,” in ECCV, 2012
work page 2012
-
[12]
Canonical perspective and the perception of objects,
S. Palmer, E. Rosch, and P . Chase, “Canonical perspective and the perception of objects,” Attention and performance IX , vol. 1, p. 4, 1981
work page 1981
-
[13]
Diagnosing error in object detectors,
Hoiem, D, Y. Chodpathumwan, and Q. Dai, “Diagnosing error in object detectors,” in ECCV, 2012
work page 2012
-
[14]
Semantic object classes in video: A high-definition ground truth database,
G. Brostow, J. Fauqueur, and R. Cipolla, “Semantic object classes in video: A high-definition ground truth database,” PRL, vol. 30, no. 2, pp. 88–97, 2009
work page 2009
-
[15]
LabelMe: a database and web-based tool for image annotation,
B. Russell, A. Torralba, K. Murphy, and W. Freeman, “LabelMe: a database and web-based tool for image annotation,” IJCV, vol. 77, no. 1-3, pp. 157–173, 2008
work page 2008
-
[16]
OpenSurfaces: A richly annotated catalog of surface appearance,
S. Bell, P . Upchurch, N. Snavely, and K. Bala, “OpenSurfaces: A richly annotated catalog of surface appearance,” SIGGRAPH, vol. 32, no. 4, 2013
work page 2013
-
[17]
Im2text: Describing images using 1 million captioned photographs
V . Ordonez, G. Kulkarni, and T. Berg, “Im2text: Describing images using 1 million captioned photographs.” in NIPS, 2011
work page 2011
-
[18]
Scalable multi-label annotation,
J. Deng, O. Russakovsky, J. Krause, M. Bernstein, A. Berg, and L. Fei-Fei, “Scalable multi-label annotation,” in CHI, 2014
work page 2014
-
[19]
Microsoft COCO: Common objects in context,
T. Lin, M. Maire, S. Belongie, J. Hays, P . Perona, D. Ramanan, P . Doll´ar, and C. L. Zitnick, “Microsoft COCO: Common objects in context,” in ECCV, 2014
work page 2014
-
[20]
A taxonomy and evaluation of dense two-frame stereo correspondence algorithms,
D. Scharstein and R. Szeliski, “A taxonomy and evaluation of dense two-frame stereo correspondence algorithms,” IJCV, vol. 47, no. 1-3, pp. 7–42, 2002
work page 2002
-
[21]
A database and evaluation methodology for optical flow,
S. Baker, D. Scharstein, J. Lewis, S. Roth, M. Black, and R. Szeliski, “A database and evaluation methodology for optical flow,” IJCV, vol. 92, no. 1, pp. 1–31, 2011
work page 2011
-
[22]
L. Fei-Fei, R. Fergus, and P . Perona, “Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories,” in CVPR Workshop of Generative Model Based Vision (WGMBV) , 2004
work page 2004
-
[23]
Caltech-256 object category dataset,
G. Griffin, A. Holub, and P . Perona, “Caltech-256 object category dataset,” California Institute of Technology, Tech. Rep. 7694, 2007
work page 2007
-
[24]
Histograms of oriented gradients for human detection,
N. Dalal and B. Triggs, “Histograms of oriented gradients for human detection,” in CVPR, 2005
work page 2005
-
[25]
The MNIST database of handwritten digits,
Y. Lecun and C. Cortes, “The MNIST database of handwritten digits,” 1998. [Online]. Available: http://yann.lecun.com/exdb/ mnist/
work page 1998
-
[26]
Columbia object image library (coil-20),
S. A. Nene, S. K. Nayar, and H. Murase, “Columbia object image library (coil-20),” Columbia Universty, Tech. Rep., 1996
work page 1996
-
[27]
Learning multiple layers of fea- tures from tiny images,
A. Krizhevsky and G. Hinton, “Learning multiple layers of fea- tures from tiny images,” Computer Science Department, University of T oronto, T ech. Rep, 2009
work page 2009
-
[28]
80 million tiny images: A large data set for nonparametric object and scene recognition,
A. Torralba, R. Fergus, and W. T. Freeman, “80 million tiny images: A large data set for nonparametric object and scene recognition,” P AMI, vol. 30, no. 11, pp. 1958–1970, 2008
work page 1958
-
[29]
From large scale image categorization to entry-level categories,
V . Ordonez, J. Deng, Y. Choi, A. Berg, and T. Berg, “From large scale image categorization to entry-level categories,” in ICCV, 2013
work page 2013
-
[30]
Fellbaum, WordNet: An electronic lexical database
C. Fellbaum, WordNet: An electronic lexical database . Blackwell Books, 1998
work page 1998
-
[31]
P . Welinder, S. Branson, T. Mita, C. Wah, F. Schroff, S. Belongie, and P . Perona, “Caltech-UCSD Birds 200,” Caltech, Tech. Rep. CNS-TR-201, 2010
work page 2010
-
[32]
E. Hjelm ˚as and B. Low, “Face detection: A survey,” CVIU, vol. 83, no. 3, pp. 236–274, 2001
work page 2001
-
[33]
G. B. Huang, M. Ramesh, T. Berg, and E. Learned-Miller, “Labeled faces in the wild,” University of Massachusetts, Amherst, Tech. Rep. 07-49, October 2007
work page 2007
-
[34]
Detecting avocados to zucchinis: what have we done, and where are we going?
O. Russakovsky, J. Deng, Z. Huang, A. Berg, and L. Fei-Fei, “Detecting avocados to zucchinis: what have we done, and where are we going?” in ICCV, 2013
work page 2013
-
[35]
J. Shotton, J. Winn, C. Rother, and A. Criminisi, “TextonBoost for image understanding: Multi-class object recognition and seg- mentation by jointly modeling texture, layout, and context,” IJCV, vol. 81, no. 1, pp. 2–23, 2009
work page 2009
-
[36]
A comparison and evaluation of multi-view stereo reconstruction algorithms,
S. M. Seitz, B. Curless, J. Diebel, D. Scharstein, and R. Szeliski, “A comparison and evaluation of multi-view stereo reconstruction algorithms,” in CVPR, 2006
work page 2006
-
[37]
Contour detec- tion and hierarchical image segmentation,
P . Arbelaez, M. Maire, C. Fowlkes, and J. Malik, “Contour detec- tion and hierarchical image segmentation,” P AMI, vol. 33, no. 5, pp. 898–916, 2011
work page 2011
-
[38]
Learning to detect unseen object classes by between-class attribute transfer,
C. Lampert, H. Nickisch, and S. Harmeling, “Learning to detect unseen object classes by between-class attribute transfer,” in CVPR, 2009
work page 2009
-
[39]
Learning spatial context: Using stuff to find things,
G. Heitz and D. Koller, “Learning spatial context: Using stuff to find things,” in ECCV, 2008
work page 2008
- [40]
- [41]
-
[42]
Unbiased look at dataset bias,
A. Torralba and A. Efros, “Unbiased look at dataset bias,” in CVPR, 2011
work page 2011
-
[43]
Evaluation of gist descriptors for web-scale image search,
M. Douze, H. J ´egou, H. Sandhawalia, L. Amsaleg, and C. Schmid, “Evaluation of gist descriptors for web-scale image search,” in CIVR, 2009
work page 2009
-
[44]
Object detection with discriminatively trained part-based mod- els,
P . Felzenszwalb, R. Girshick, D. McAllester, and D. Ramanan, “Object detection with discriminatively trained part-based mod- els,” P AMI, vol. 32, no. 9, pp. 1627–1645, 2010
work page 2010
-
[45]
Discriminatively trained deformable part models, release 5,
R. Girshick, P . Felzenszwalb, and D. McAllester, “Discriminatively trained deformable part models, release 5,” P AMI, 2012
work page 2012
-
[46]
Do we need more training data or better models for object detection?
X. Zhu, C. Vondrick, D. Ramanan, and C. Fowlkes, “Do we need more training data or better models for object detection?” in BMVC, 2012
work page 2012
-
[47]
Object segmentation by alignment of poselet activations to image contours,
T. Brox, L. Bourdev, S. Maji, and J. Malik, “Object segmentation by alignment of poselet activations to image contours,” in CVPR, 2011
work page 2011
-
[48]
Layered object models for image segmentation,
Y. Yang, S. Hallman, D. Ramanan, and C. Fowlkes, “Layered object models for image segmentation,” P AMI, vol. 34, no. 9, pp. 1731–1743, 2012
work page 2012
-
[49]
Using segmentation to verify object hypotheses,
D. Ramanan, “Using segmentation to verify object hypotheses,” in CVPR, 2007
work page 2007
-
[50]
Learning to localize detected objects,
Q. Dai and D. Hoiem, “Learning to localize detected objects,” in CVPR, 2012
work page 2012
-
[51]
Col- lecting image annotations using Amazon’s Mechanical Turk,
C. Rashtchian, P . Young, M. Hodosh, and J. Hockenmaier, “Col- lecting image annotations using Amazon’s Mechanical Turk,” in NAACL Workshop , 2010. 13 Fig. 11: Icons of 91 categories in the MS COCO dataset grouped by 11 super-categories. We use these icons in our annotation pipeline to help workers quickly reference the indicated object category. Fig. 12:...
work page 2010
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.