hub Baseline reference

Microsoft COCO: Common Objects in Context

Tsung-Yi Lin, Michael Maire, Serge Belongie, Lubomir Bourdev, Ross Girshick, James Hays · 2014 · cs.CV · arXiv 1405.0312

Baseline reference. 60% of citing Pith papers use this work as a benchmark or comparison.

71 Pith papers citing it

Baseline 60% of classified citations

open full Pith review browse 71 citing papers arXiv PDF

abstract

We present a new dataset with the goal of advancing the state-of-the-art in object recognition by placing the question of object recognition in the context of the broader question of scene understanding. This is achieved by gathering images of complex everyday scenes containing common objects in their natural context. Objects are labeled using per-instance segmentations to aid in precise object localization. Our dataset contains photos of 91 objects types that would be easily recognizable by a 4 year old. With a total of 2.5 million labeled instances in 328k images, the creation of our dataset drew upon extensive crowd worker involvement via novel user interfaces for category detection, instance spotting and instance segmentation. We present a detailed statistical analysis of the dataset in comparison to PASCAL, ImageNet, and SUN. Finally, we provide baseline performance analysis for bounding box and segmentation detection results using a Deformable Parts Model.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

dataset 8 background 5 baseline 1 method 1

citation-polarity summary

use dataset 8 background 5 baseline 1 use method 1

representative citing papers

Beyond Linear Activation Steering: Invertible Latent Transformations for Controlling LLM Behavior

cs.LG · 2026-06-07 · unverdicted · novelty 7.0

INNSteer learns an invertible neural network to map LLM activations into a latent space where linear steering becomes more effective, then applies the inverse map to produce nonlinear interventions in the original space.

PlanBench-V: A Spatial Planning Map Benchmark for Vision-Language Models

cs.CL · 2026-06-04 · unverdicted · novelty 7.0

PlanBench-V is a new benchmark and dataset for evaluating VLMs on spatial planning map interpretation via a four-stage framework of Perception, Reasoning, Association, and Implementation.

FindIt: A Format-Informed Visual Detection Benchmark for Generalist Multimodal LLMs

cs.CV · 2026-06-02 · unverdicted · novelty 7.0

FindIt is the first comprehensive benchmark for evaluating generalist MLLMs on promptable object detection, referring expression detection, instance-level detection, and video detection with standardized parsable outputs.

Rethinking Visual Neglect: Steering via Context-Preference for MLLM Hallucination Mitigation

cs.CL · 2026-05-27 · unverdicted · novelty 7.0

CAS mitigates object hallucinations in MLLMs by extracting two context preference vectors from designed conflict samples and applying signed residual injection at mid-early MLP layers without retraining or added latency.

Structure over Pixels: Learning Variable-Length Visual Programs

cs.CV · 2026-05-26 · unverdicted · novelty 7.0

STROP learns variable-length discrete visual programs for images by training a length head against frozen DINOv3 features in a four-phase curriculum while bypassing pixel reconstruction.

Concept-Based Abductive and Contrastive Explanations for Behaviors of Vision Models

cs.LG · 2026-05-07 · unverdicted · novelty 7.0

Concept-based abductive and contrastive explanations find minimal high-level concepts that causally determine vision model outcomes on individual images or groups sharing a specified behavior.

Pro$^2$Assist: Continuous Step-Aware Proactive Assistance with Multimodal Egocentric Perception for Long-Horizon Procedural Tasks

cs.AI · 2026-05-05 · unverdicted · novelty 7.0

Pro²Assist uses multimodal egocentric perception from AR glasses to track fine-grained progress in long-horizon procedural tasks and deliver timely proactive assistance, outperforming baselines by over 21% in action understanding and up to 2.29x in timing accuracy.

Beyond Accuracy: Benchmarking Cross-Task Consistency in Unified Multimodal Models

cs.CV · 2026-04-27 · unverdicted · novelty 7.0

XTC-Bench reveals that strong performance on generation or understanding tasks in unified multimodal models does not guarantee cross-task semantic consistency, which instead depends on how tightly coupled the learning objectives are across modalities.

Oracle Noise: Faster Semantic Spherical Alignment for Interpretable Latent Optimization

cs.CV · 2026-04-26 · unverdicted · novelty 7.0

Oracle Noise optimizes diffusion model noise on a Riemannian hypersphere guided by key prompt words to preserve the Gaussian prior, eliminate norm inflation, and achieve faster semantic alignment than Euclidean methods.

MMEB-V3: Measuring the Performance Gaps of Omni-Modality Embedding Models

cs.IR · 2026-04-25 · unverdicted · novelty 7.0

MMEB-V3 benchmark shows omni-modality embedding models fail to enforce instruction-specified modality constraints and exhibit asymmetric, query-biased retrieval.

Beyond Prompts: Unconditional 3D Inversion for Out-of-Distribution Shapes

cs.CV · 2026-04-16 · unverdicted · novelty 7.0

Text-to-3D models lose prompt sensitivity for out-of-distribution shapes due to sink traps but retain geometric diversity via unconditional priors, enabling a decoupled inversion method for robust editing.

Gaslight, Gatekeep, V1-V3: Early Visual Cortex Alignment Shields Vision-Language Models from Sycophantic Manipulation

cs.CV · 2026-04-15 · conditional · novelty 7.0

Alignment of vision-language models with human V1-V3 early visual cortex negatively predicts resistance to sycophantic gaslighting attacks.

MAST: Mask-Guided Attention Mass Allocation for Training-Free Multi-Style Transfer

cs.CV · 2026-04-14 · unverdicted · novelty 7.0

MAST is a mask-guided attention allocation method that enables artifact-free multi-style transfer in diffusion models by anchoring layout, distributing attention mass, scaling sharpness, and injecting details.

WildDet3D: Scaling Promptable 3D Detection in the Wild

cs.CV · 2026-04-09 · unverdicted · novelty 7.0

WildDet3D is a promptable 3D detector paired with a new 1M-image dataset across 13.5K categories that sets SOTA on open-world and zero-shot 3D detection benchmarks.

Descriptor: Parasitoid Wasps and Associated Hymenoptera Dataset (DAPWH)

cs.CV · 2026-02-20 · unverdicted · novelty 7.0

Releases the DAPWH dataset of 3556 wasp images including 1739 COCO-annotated examples to enable AI models for identifying Ichneumonoidea and associated families.

Event-based Civil Infrastructure Visual Defect Detection: ev-CIVIL Dataset and Benchmark

cs.CV · 2025-04-08 · unverdicted · novelty 7.0

Presents the ev-CIVIL dataset and benchmark showing that event-based cameras can support real-time detection of cracks and spalling in civil infrastructure under challenging lighting.

PromptGuard: Soft Prompt-Guided Unsafe Content Moderation for Text-to-Image Models

cs.CV · 2025-01-07 · unverdicted · novelty 7.0

PromptGuard optimizes a universal safety soft prompt (and category-specific variants) in T2I embedding space to moderate NSFW inputs, achieving average unsafe ratios of 5.84-6.18% while being 3.8x faster than prior defenses.

Hierarchical Text-Conditional Image Generation with CLIP Latents

cs.CV · 2022-04-13 · accept · novelty 7.0

A hierarchical prior-decoder model using CLIP latents generates more diverse text-conditional images than direct methods while preserving photorealism and caption fidelity.

High-Resolution Image Synthesis with Latent Diffusion Models

cs.CV · 2021-12-20 · conditional · novelty 7.0

Latent diffusion models achieve state-of-the-art inpainting and competitive results on unconditional generation, scene synthesis, and super-resolution by performing the diffusion process in the latent space of pretrained autoencoders with cross-attention conditioning, while cutting computational and

U4D: Unsupervised 4D Dynamic Scene Understanding

cs.CV · 2019-07-23 · unverdicted · novelty 7.0

Unsupervised joint semantic instance segmentation, 4D reconstruction, and scene flow from multi-view video of multi-person dynamic scenes, with reported ~40% gains over prior methods.

Unmasking LAION-5B: Age, Gender, Race, and Emotion Biases in Large-Scale Image Datasets

cs.CV · 2026-06-22 · unverdicted · novelty 6.0

Empirical audit of LAION-2B-en and LAION-2B-multi finds overrepresentation of young adults, White people, and males plus stereotypical emotion associations across two attribute classifiers.

Time-frequency localization of bird calls in dense soundscapes

cs.SD · 2026-06-09 · unverdicted · novelty 6.0

YOLO11 models localize bird vocalizations on spectrograms, nearly doubling IoMin@50 F1 on Singapore data and outperforming baselines on Hawaii recordings.

Maximum Matching Accuracy: An Instance Segmentation Evaluation Metric Utilizing Globally Optimal Matching

cs.CV · 2026-06-08 · unverdicted · novelty 6.0

MMA is a threshold-free continuous metric for instance segmentation that uses globally optimal bipartite matching between predictions and ground truth followed by per-pixel normalization to aggregate overlap.

Beyond Compression: Quantifying Spectral Accessibility in Vision Representations

cs.CV · 2026-06-02 · unverdicted · novelty 6.0

Vision encoders alter spectral accessibility non-monotonically across depth with architecture-specific effects from projections and pooling, quantified via a new residual loss against random baselines.

citing papers explorer

Showing 1 of 1 citing paper after filters.

DeepSignature: Digitally Signed, Content-Encoding Watermarks for Robust and Transparent Image Authentication cs.CR · 2026-04-24 · unverdicted · none · ref 36 · internal anchor
DeepSignature embeds digitally signed content-encoding watermarks via neural networks for robust image authentication, source attribution, and latent-space tamper localization.

Microsoft COCO: Common Objects in Context

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer