INNSteer learns an invertible neural network to map LLM activations into a latent space where linear steering becomes more effective, then applies the inverse map to produce nonlinear interventions in the original space.
hub Baseline reference
Microsoft COCO: Common Objects in Context
Baseline reference. 60% of citing Pith papers use this work as a benchmark or comparison.
abstract
We present a new dataset with the goal of advancing the state-of-the-art in object recognition by placing the question of object recognition in the context of the broader question of scene understanding. This is achieved by gathering images of complex everyday scenes containing common objects in their natural context. Objects are labeled using per-instance segmentations to aid in precise object localization. Our dataset contains photos of 91 objects types that would be easily recognizable by a 4 year old. With a total of 2.5 million labeled instances in 328k images, the creation of our dataset drew upon extensive crowd worker involvement via novel user interfaces for category detection, instance spotting and instance segmentation. We present a detailed statistical analysis of the dataset in comparison to PASCAL, ImageNet, and SUN. Finally, we provide baseline performance analysis for bounding box and segmentation detection results using a Deformable Parts Model.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
PlanBench-V is a new benchmark and dataset for evaluating VLMs on spatial planning map interpretation via a four-stage framework of Perception, Reasoning, Association, and Implementation.
FindIt is the first comprehensive benchmark for evaluating generalist MLLMs on promptable object detection, referring expression detection, instance-level detection, and video detection with standardized parsable outputs.
CAS mitigates object hallucinations in MLLMs by extracting two context preference vectors from designed conflict samples and applying signed residual injection at mid-early MLP layers without retraining or added latency.
STROP learns variable-length discrete visual programs for images by training a length head against frozen DINOv3 features in a four-phase curriculum while bypassing pixel reconstruction.
Concept-based abductive and contrastive explanations find minimal high-level concepts that causally determine vision model outcomes on individual images or groups sharing a specified behavior.
Pro²Assist uses multimodal egocentric perception from AR glasses to track fine-grained progress in long-horizon procedural tasks and deliver timely proactive assistance, outperforming baselines by over 21% in action understanding and up to 2.29x in timing accuracy.
XTC-Bench reveals that strong performance on generation or understanding tasks in unified multimodal models does not guarantee cross-task semantic consistency, which instead depends on how tightly coupled the learning objectives are across modalities.
Oracle Noise optimizes diffusion model noise on a Riemannian hypersphere guided by key prompt words to preserve the Gaussian prior, eliminate norm inflation, and achieve faster semantic alignment than Euclidean methods.
MMEB-V3 benchmark shows omni-modality embedding models fail to enforce instruction-specified modality constraints and exhibit asymmetric, query-biased retrieval.
Text-to-3D models lose prompt sensitivity for out-of-distribution shapes due to sink traps but retain geometric diversity via unconditional priors, enabling a decoupled inversion method for robust editing.
Alignment of vision-language models with human V1-V3 early visual cortex negatively predicts resistance to sycophantic gaslighting attacks.
MAST is a mask-guided attention allocation method that enables artifact-free multi-style transfer in diffusion models by anchoring layout, distributing attention mass, scaling sharpness, and injecting details.
WildDet3D is a promptable 3D detector paired with a new 1M-image dataset across 13.5K categories that sets SOTA on open-world and zero-shot 3D detection benchmarks.
Releases the DAPWH dataset of 3556 wasp images including 1739 COCO-annotated examples to enable AI models for identifying Ichneumonoidea and associated families.
Presents the ev-CIVIL dataset and benchmark showing that event-based cameras can support real-time detection of cracks and spalling in civil infrastructure under challenging lighting.
PromptGuard optimizes a universal safety soft prompt (and category-specific variants) in T2I embedding space to moderate NSFW inputs, achieving average unsafe ratios of 5.84-6.18% while being 3.8x faster than prior defenses.
A hierarchical prior-decoder model using CLIP latents generates more diverse text-conditional images than direct methods while preserving photorealism and caption fidelity.
Latent diffusion models achieve state-of-the-art inpainting and competitive results on unconditional generation, scene synthesis, and super-resolution by performing the diffusion process in the latent space of pretrained autoencoders with cross-attention conditioning, while cutting computational and
Unsupervised joint semantic instance segmentation, 4D reconstruction, and scene flow from multi-view video of multi-person dynamic scenes, with reported ~40% gains over prior methods.
Empirical audit of LAION-2B-en and LAION-2B-multi finds overrepresentation of young adults, White people, and males plus stereotypical emotion associations across two attribute classifiers.
YOLO11 models localize bird vocalizations on spectrograms, nearly doubling IoMin@50 F1 on Singapore data and outperforming baselines on Hawaii recordings.
MMA is a threshold-free continuous metric for instance segmentation that uses globally optimal bipartite matching between predictions and ground truth followed by per-pixel normalization to aggregate overlap.
Vision encoders alter spectral accessibility non-monotonically across depth with architecture-specific effects from projections and pooling, quantified via a new residual loss against random baselines.
citing papers explorer
-
Beyond Linear Activation Steering: Invertible Latent Transformations for Controlling LLM Behavior
INNSteer learns an invertible neural network to map LLM activations into a latent space where linear steering becomes more effective, then applies the inverse map to produce nonlinear interventions in the original space.
-
PlanBench-V: A Spatial Planning Map Benchmark for Vision-Language Models
PlanBench-V is a new benchmark and dataset for evaluating VLMs on spatial planning map interpretation via a four-stage framework of Perception, Reasoning, Association, and Implementation.
-
FindIt: A Format-Informed Visual Detection Benchmark for Generalist Multimodal LLMs
FindIt is the first comprehensive benchmark for evaluating generalist MLLMs on promptable object detection, referring expression detection, instance-level detection, and video detection with standardized parsable outputs.
-
Rethinking Visual Neglect: Steering via Context-Preference for MLLM Hallucination Mitigation
CAS mitigates object hallucinations in MLLMs by extracting two context preference vectors from designed conflict samples and applying signed residual injection at mid-early MLP layers without retraining or added latency.
-
Structure over Pixels: Learning Variable-Length Visual Programs
STROP learns variable-length discrete visual programs for images by training a length head against frozen DINOv3 features in a four-phase curriculum while bypassing pixel reconstruction.
-
Concept-Based Abductive and Contrastive Explanations for Behaviors of Vision Models
Concept-based abductive and contrastive explanations find minimal high-level concepts that causally determine vision model outcomes on individual images or groups sharing a specified behavior.
-
Pro$^2$Assist: Continuous Step-Aware Proactive Assistance with Multimodal Egocentric Perception for Long-Horizon Procedural Tasks
Pro²Assist uses multimodal egocentric perception from AR glasses to track fine-grained progress in long-horizon procedural tasks and deliver timely proactive assistance, outperforming baselines by over 21% in action understanding and up to 2.29x in timing accuracy.
-
Beyond Accuracy: Benchmarking Cross-Task Consistency in Unified Multimodal Models
XTC-Bench reveals that strong performance on generation or understanding tasks in unified multimodal models does not guarantee cross-task semantic consistency, which instead depends on how tightly coupled the learning objectives are across modalities.
-
Oracle Noise: Faster Semantic Spherical Alignment for Interpretable Latent Optimization
Oracle Noise optimizes diffusion model noise on a Riemannian hypersphere guided by key prompt words to preserve the Gaussian prior, eliminate norm inflation, and achieve faster semantic alignment than Euclidean methods.
-
MMEB-V3: Measuring the Performance Gaps of Omni-Modality Embedding Models
MMEB-V3 benchmark shows omni-modality embedding models fail to enforce instruction-specified modality constraints and exhibit asymmetric, query-biased retrieval.
-
Beyond Prompts: Unconditional 3D Inversion for Out-of-Distribution Shapes
Text-to-3D models lose prompt sensitivity for out-of-distribution shapes due to sink traps but retain geometric diversity via unconditional priors, enabling a decoupled inversion method for robust editing.
-
Gaslight, Gatekeep, V1-V3: Early Visual Cortex Alignment Shields Vision-Language Models from Sycophantic Manipulation
Alignment of vision-language models with human V1-V3 early visual cortex negatively predicts resistance to sycophantic gaslighting attacks.
-
MAST: Mask-Guided Attention Mass Allocation for Training-Free Multi-Style Transfer
MAST is a mask-guided attention allocation method that enables artifact-free multi-style transfer in diffusion models by anchoring layout, distributing attention mass, scaling sharpness, and injecting details.
-
WildDet3D: Scaling Promptable 3D Detection in the Wild
WildDet3D is a promptable 3D detector paired with a new 1M-image dataset across 13.5K categories that sets SOTA on open-world and zero-shot 3D detection benchmarks.
-
Descriptor: Parasitoid Wasps and Associated Hymenoptera Dataset (DAPWH)
Releases the DAPWH dataset of 3556 wasp images including 1739 COCO-annotated examples to enable AI models for identifying Ichneumonoidea and associated families.
-
Event-based Civil Infrastructure Visual Defect Detection: ev-CIVIL Dataset and Benchmark
Presents the ev-CIVIL dataset and benchmark showing that event-based cameras can support real-time detection of cracks and spalling in civil infrastructure under challenging lighting.
-
PromptGuard: Soft Prompt-Guided Unsafe Content Moderation for Text-to-Image Models
PromptGuard optimizes a universal safety soft prompt (and category-specific variants) in T2I embedding space to moderate NSFW inputs, achieving average unsafe ratios of 5.84-6.18% while being 3.8x faster than prior defenses.
-
Hierarchical Text-Conditional Image Generation with CLIP Latents
A hierarchical prior-decoder model using CLIP latents generates more diverse text-conditional images than direct methods while preserving photorealism and caption fidelity.
-
High-Resolution Image Synthesis with Latent Diffusion Models
Latent diffusion models achieve state-of-the-art inpainting and competitive results on unconditional generation, scene synthesis, and super-resolution by performing the diffusion process in the latent space of pretrained autoencoders with cross-attention conditioning, while cutting computational and
-
U4D: Unsupervised 4D Dynamic Scene Understanding
Unsupervised joint semantic instance segmentation, 4D reconstruction, and scene flow from multi-view video of multi-person dynamic scenes, with reported ~40% gains over prior methods.
-
Unmasking LAION-5B: Age, Gender, Race, and Emotion Biases in Large-Scale Image Datasets
Empirical audit of LAION-2B-en and LAION-2B-multi finds overrepresentation of young adults, White people, and males plus stereotypical emotion associations across two attribute classifiers.
-
Time-frequency localization of bird calls in dense soundscapes
YOLO11 models localize bird vocalizations on spectrograms, nearly doubling IoMin@50 F1 on Singapore data and outperforming baselines on Hawaii recordings.
-
Maximum Matching Accuracy: An Instance Segmentation Evaluation Metric Utilizing Globally Optimal Matching
MMA is a threshold-free continuous metric for instance segmentation that uses globally optimal bipartite matching between predictions and ground truth followed by per-pixel normalization to aggregate overlap.
-
Beyond Compression: Quantifying Spectral Accessibility in Vision Representations
Vision encoders alter spectral accessibility non-monotonically across depth with architecture-specific effects from projections and pooling, quantified via a new residual loss against random baselines.
-
VLA-Trace: Diagnosing Vision-Language-Action Models through Representation and Behavior Tracing
VLA-Trace diagnoses two VLA models via representation CKA, attention interventions, and behavioral tests, finding distinct finetuning dynamics, different routing, and strong visual grounding but weak fine-grained semantic following.
-
Trajectory-Consistent Calibration for Cache-Accelerated Diffusion Models
TCC calibrates cached representations in diffusion sampling via an offline iterative procedure that accounts for trajectory shifts, improving FID from 29.83 to 27.35 on PixArt-alpha while preserving reuse policies.
-
Why We Look Where We Look: Emergent Human-like Fixations of a Foveated Visual Language Model Maximizing Scene Understanding
A foveated VLM trained for scene comprehension produces human-like fixations, outperforming models trained for search, classification, or with altered peripheral vision.
-
SparseSAM: Structured Sparsification of Activations in Segment Anything Models
SparseSAM achieves 2x faster inference and 2.8x memory reduction in SAM with only 0.004 mIoU loss at 0.4 density via Stripe-Sort Attention and Residual-Consistency MLP.
-
CASCADE: Context-Aware Relaxation for Speculative Image Decoding
CASCADE formalizes semantic interchangeability and convergence in target model representations to enable context-aware acceptance relaxation in tree-based speculative decoding, delivering up to 3.6x speedup on text-to-image models without quality loss.
-
Semantics-Aware Hierarchical Token Communication: Clustering, Bit Mapping, and Power Allocation
H-TokCom groups tokens by semantic similarity and protects cluster-level bits with higher power, raising semantic similarity from 0.206 to 0.279 at 3 dB SNR on COCO data.
-
Delta Score Matters! Spatial Adaptive Multi Guidance in Diffusion Models
SAMG uses spatially adaptive guidance scales derived from a geometric analysis of classifier-free guidance to resolve the detail-artifact dilemma in diffusion-based image and video generation.
-
LatentDiff: Scaling Semantic Dataset Comparison to Millions of Images
LatentDiff scales semantic dataset comparison to millions of images using latent spaces of vision encoders combined with sparse autoencoders and density ratio estimation, showing better accuracy and robustness than caption-based approaches on a new benchmark for sparse distribution shifts.
-
DeepSignature: Digitally Signed, Content-Encoding Watermarks for Robust and Transparent Image Authentication
DeepSignature embeds digitally signed content-encoding watermarks via neural networks for robust image authentication, source attribution, and latent-space tamper localization.
-
Source-Modality Monitoring in Vision-Language Models
Vision-language models use semantic signals more than syntactic ones to bind words like 'image' to actual visual inputs, with implications for robustness in multimodal systems.
-
Zero-shot World Models Are Developmentally Efficient Learners
A zero-shot visual world model trained on one child's experience achieves broad competence on physical understanding benchmarks while matching developmental behavioral patterns.
-
Bayesian Optimization for Mixed-Variable Problems in the Natural Sciences
A generalization of probabilistic reparameterization allows gradient-based acquisition optimization in fully mixed-variable Bayesian optimization with Gaussian process surrogates for non-equidistant discrete spaces.
-
DeCo-DETR: Decoupled Cognition DETR for efficient Open-Vocabulary Object Detection
DeCo-DETR builds hierarchical semantic prototypes offline and uses decoupled training streams to deliver competitive zero-shot open-vocabulary detection with improved inference speed.
-
Chasing Ghosts: A Simulation-to-Real Olfactory Navigation Stack with Optional Vision Augmentation
A simulation-to-real navigation policy enables a quadrotor to locate an odor source using only basic olfaction sensors and optional vision, validated in indoor real-world flights.
-
Adaptive Residual-Update Steering for Low-Overhead Hallucination Mitigation in Large Vision Language Models
RUDDER creates a persistent visual anchor by extracting CARD from prefill residuals and modulating its injection via an adaptive Beta Gate, cutting CHAIR_S by 24.4% and CHAIR_i by 23.6% on average across LLaVA, Idefics2, InstructBLIP and Qwen2.5-VL with >96% throughput.
-
Routing-Based Continual Learning for Multimodal Large Language Models
Routing architecture for MLLMs enables continual learning with constant compute, matching multi-task learning performance and supporting cross-modal transfer.
-
Navigating the Challenges of AI-Generated Image Detection in the Wild: What Truly Matters?
The ITW-SM dataset and targeted optimization of detector design choices yield a 26.87% average AUC improvement for state-of-the-art AI-generated image detectors under real-world social media conditions.
-
Expressive yet Efficient Feature Expansion with Adaptive Cross-Hadamard Products
Proposes ACH module with differentiable sampling and softsign normalization for efficient feature expansion, integrated via NAS into Hadaptive-Net to claim SOTA accuracy/speed trade-offs on image classification.
-
MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training
MM1 models achieve state-of-the-art few-shot multimodal results by pre-training on a careful mix of image-caption, interleaved, and text-only data with optimized image encoders.
-
Informative Image Captioning with External Sources of Information
A multimodal Transformer ingests image features plus multiple external entity label sources and learns to control their appearance in fluent output captions.
-
On Physical Adversarial Patches for Object Detection
A physical patch suppresses all object detections by YOLOv3 even for distant objects without overlapping them.
-
TuringViT: Making SOTA Vision Transformers Accessible to All
TuringViT claims a new ViT design with linear attention and curated data that matches SOTA performance using 10% of typical pretraining data while supporting dynamic resolutions and improving VLM integration.
-
Accelerating Vision Foundation Models with Drop-in Depthwise Convolution
Replacing selected attention heads in pretrained ViTs with depthwise convolutions, identified by simple strategies and recovered via fine-tuning, delivers 17-20% inference speedup on image tasks with minimal accuracy loss.
-
Replacement Learning: Training Neural Networks with Fewer Parameters
Replacement Learning replaces selected blocks in CNNs and ViTs with learnable parameter-fusion surrogates derived from adjacent layers to reduce full-depth backpropagation redundancy.
-
KVCapsule: Efficient Sequential KV Cache Compression for Vision-Language Models with Asymmetric Redundancy
KVCapsule compresses KV cache in VLMs by 60% to deliver up to 2x higher tokens-per-second and 2.4x memory reduction with negligible accuracy loss.
-
Global Context or Local Detail? Adaptive Visual Grounding for Hallucination Mitigation
PND mitigates object hallucination in vision-language models via dual-path contrastive decoding that boosts visual evidence and penalizes linguistic priors, yielding up to 6.5% gains on POPE, MME, and CHAIR benchmarks.