hub Baseline reference

Microsoft COCO: Common Objects in Context

Tsung-Yi Lin, Michael Maire, Serge Belongie, Lubomir Bourdev, Ross Girshick, James Hays · 2014 · cs.CV · arXiv 1405.0312

Baseline reference. 60% of citing Pith papers use this work as a benchmark or comparison.

70 Pith papers citing it

Baseline 60% of classified citations

open full Pith review browse 70 citing papers arXiv PDF

abstract

We present a new dataset with the goal of advancing the state-of-the-art in object recognition by placing the question of object recognition in the context of the broader question of scene understanding. This is achieved by gathering images of complex everyday scenes containing common objects in their natural context. Objects are labeled using per-instance segmentations to aid in precise object localization. Our dataset contains photos of 91 objects types that would be easily recognizable by a 4 year old. With a total of 2.5 million labeled instances in 328k images, the creation of our dataset drew upon extensive crowd worker involvement via novel user interfaces for category detection, instance spotting and instance segmentation. We present a detailed statistical analysis of the dataset in comparison to PASCAL, ImageNet, and SUN. Finally, we provide baseline performance analysis for bounding box and segmentation detection results using a Deformable Parts Model.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

dataset 8 background 5 baseline 1 method 1

citation-polarity summary

use dataset 8 background 5 baseline 1 use method 1

representative citing papers

Beyond Linear Activation Steering: Invertible Latent Transformations for Controlling LLM Behavior

cs.LG · 2026-06-07 · unverdicted · novelty 7.0

INNSteer learns an invertible neural network to map LLM activations into a latent space where linear steering becomes more effective, then applies the inverse map to produce nonlinear interventions in the original space.

PlanBench-V: A Spatial Planning Map Benchmark for Vision-Language Models

cs.CL · 2026-06-04 · unverdicted · novelty 7.0

PlanBench-V is a new benchmark and dataset for evaluating VLMs on spatial planning map interpretation via a four-stage framework of Perception, Reasoning, Association, and Implementation.

FindIt: A Format-Informed Visual Detection Benchmark for Generalist Multimodal LLMs

cs.CV · 2026-06-02 · unverdicted · novelty 7.0

FindIt is the first comprehensive benchmark for evaluating generalist MLLMs on promptable object detection, referring expression detection, instance-level detection, and video detection with standardized parsable outputs.

Rethinking Visual Neglect: Steering via Context-Preference for MLLM Hallucination Mitigation

cs.CL · 2026-05-27 · unverdicted · novelty 7.0

CAS mitigates object hallucinations in MLLMs by extracting two context preference vectors from designed conflict samples and applying signed residual injection at mid-early MLP layers without retraining or added latency.

Structure over Pixels: Learning Variable-Length Visual Programs

cs.CV · 2026-05-26 · unverdicted · novelty 7.0

STROP learns variable-length discrete visual programs for images by training a length head against frozen DINOv3 features in a four-phase curriculum while bypassing pixel reconstruction.

Concept-Based Abductive and Contrastive Explanations for Behaviors of Vision Models

cs.LG · 2026-05-07 · unverdicted · novelty 7.0

Concept-based abductive and contrastive explanations find minimal high-level concepts that causally determine vision model outcomes on individual images or groups sharing a specified behavior.

Pro$^2$Assist: Continuous Step-Aware Proactive Assistance with Multimodal Egocentric Perception for Long-Horizon Procedural Tasks

cs.AI · 2026-05-05 · unverdicted · novelty 7.0

Pro²Assist uses multimodal egocentric perception from AR glasses to track fine-grained progress in long-horizon procedural tasks and deliver timely proactive assistance, outperforming baselines by over 21% in action understanding and up to 2.29x in timing accuracy.

Beyond Accuracy: Benchmarking Cross-Task Consistency in Unified Multimodal Models

cs.CV · 2026-04-27 · unverdicted · novelty 7.0

XTC-Bench reveals that strong performance on generation or understanding tasks in unified multimodal models does not guarantee cross-task semantic consistency, which instead depends on how tightly coupled the learning objectives are across modalities.

Oracle Noise: Faster Semantic Spherical Alignment for Interpretable Latent Optimization

cs.CV · 2026-04-26 · unverdicted · novelty 7.0

Oracle Noise optimizes diffusion model noise on a Riemannian hypersphere guided by key prompt words to preserve the Gaussian prior, eliminate norm inflation, and achieve faster semantic alignment than Euclidean methods.

MMEB-V3: Measuring the Performance Gaps of Omni-Modality Embedding Models

cs.IR · 2026-04-25 · unverdicted · novelty 7.0

MMEB-V3 benchmark shows omni-modality embedding models fail to enforce instruction-specified modality constraints and exhibit asymmetric, query-biased retrieval.

Beyond Prompts: Unconditional 3D Inversion for Out-of-Distribution Shapes

cs.CV · 2026-04-16 · unverdicted · novelty 7.0

Text-to-3D models lose prompt sensitivity for out-of-distribution shapes due to sink traps but retain geometric diversity via unconditional priors, enabling a decoupled inversion method for robust editing.

Gaslight, Gatekeep, V1-V3: Early Visual Cortex Alignment Shields Vision-Language Models from Sycophantic Manipulation

cs.CV · 2026-04-15 · conditional · novelty 7.0

Alignment of vision-language models with human V1-V3 early visual cortex negatively predicts resistance to sycophantic gaslighting attacks.

MAST: Mask-Guided Attention Mass Allocation for Training-Free Multi-Style Transfer

cs.CV · 2026-04-14 · unverdicted · novelty 7.0

MAST is a mask-guided attention allocation method that enables artifact-free multi-style transfer in diffusion models by anchoring layout, distributing attention mass, scaling sharpness, and injecting details.

WildDet3D: Scaling Promptable 3D Detection in the Wild

cs.CV · 2026-04-09 · unverdicted · novelty 7.0

WildDet3D is a promptable 3D detector paired with a new 1M-image dataset across 13.5K categories that sets SOTA on open-world and zero-shot 3D detection benchmarks.

Descriptor: Parasitoid Wasps and Associated Hymenoptera Dataset (DAPWH)

cs.CV · 2026-02-20 · unverdicted · novelty 7.0

Releases the DAPWH dataset of 3556 wasp images including 1739 COCO-annotated examples to enable AI models for identifying Ichneumonoidea and associated families.

Event-based Civil Infrastructure Visual Defect Detection: ev-CIVIL Dataset and Benchmark

cs.CV · 2025-04-08 · unverdicted · novelty 7.0

Presents the ev-CIVIL dataset and benchmark showing that event-based cameras can support real-time detection of cracks and spalling in civil infrastructure under challenging lighting.

PromptGuard: Soft Prompt-Guided Unsafe Content Moderation for Text-to-Image Models

cs.CV · 2025-01-07 · unverdicted · novelty 7.0

PromptGuard optimizes a universal safety soft prompt (and category-specific variants) in T2I embedding space to moderate NSFW inputs, achieving average unsafe ratios of 5.84-6.18% while being 3.8x faster than prior defenses.

Hierarchical Text-Conditional Image Generation with CLIP Latents

cs.CV · 2022-04-13 · accept · novelty 7.0

A hierarchical prior-decoder model using CLIP latents generates more diverse text-conditional images than direct methods while preserving photorealism and caption fidelity.

High-Resolution Image Synthesis with Latent Diffusion Models

cs.CV · 2021-12-20 · conditional · novelty 7.0

Latent diffusion models achieve state-of-the-art inpainting and competitive results on unconditional generation, scene synthesis, and super-resolution by performing the diffusion process in the latent space of pretrained autoencoders with cross-attention conditioning, while cutting computational and

U4D: Unsupervised 4D Dynamic Scene Understanding

cs.CV · 2019-07-23 · unverdicted · novelty 7.0

Unsupervised joint semantic instance segmentation, 4D reconstruction, and scene flow from multi-view video of multi-person dynamic scenes, with reported ~40% gains over prior methods.

Unmasking LAION-5B: Age, Gender, Race, and Emotion Biases in Large-Scale Image Datasets

cs.CV · 2026-06-22 · unverdicted · novelty 6.0

Empirical audit of LAION-2B-en and LAION-2B-multi finds overrepresentation of young adults, White people, and males plus stereotypical emotion associations across two attribute classifiers.

Time-frequency localization of bird calls in dense soundscapes

cs.SD · 2026-06-09 · unverdicted · novelty 6.0

YOLO11 models localize bird vocalizations on spectrograms, nearly doubling IoMin@50 F1 on Singapore data and outperforming baselines on Hawaii recordings.

Maximum Matching Accuracy: An Instance Segmentation Evaluation Metric Utilizing Globally Optimal Matching

cs.CV · 2026-06-08 · unverdicted · novelty 6.0

MMA is a threshold-free continuous metric for instance segmentation that uses globally optimal bipartite matching between predictions and ground truth followed by per-pixel normalization to aggregate overlap.

Beyond Compression: Quantifying Spectral Accessibility in Vision Representations

cs.CV · 2026-06-02 · unverdicted · novelty 6.0

Vision encoders alter spectral accessibility non-monotonically across depth with architecture-specific effects from projections and pooling, quantified via a new residual loss against random baselines.

citing papers explorer

Showing 50 of 70 citing papers.

Beyond Linear Activation Steering: Invertible Latent Transformations for Controlling LLM Behavior cs.LG · 2026-06-07 · unverdicted · none · ref 14 · internal anchor
INNSteer learns an invertible neural network to map LLM activations into a latent space where linear steering becomes more effective, then applies the inverse map to produce nonlinear interventions in the original space.
PlanBench-V: A Spatial Planning Map Benchmark for Vision-Language Models cs.CL · 2026-06-04 · unverdicted · none · ref 16 · internal anchor
PlanBench-V is a new benchmark and dataset for evaluating VLMs on spatial planning map interpretation via a four-stage framework of Perception, Reasoning, Association, and Implementation.
FindIt: A Format-Informed Visual Detection Benchmark for Generalist Multimodal LLMs cs.CV · 2026-06-02 · unverdicted · none · ref 19 · internal anchor
FindIt is the first comprehensive benchmark for evaluating generalist MLLMs on promptable object detection, referring expression detection, instance-level detection, and video detection with standardized parsable outputs.
Rethinking Visual Neglect: Steering via Context-Preference for MLLM Hallucination Mitigation cs.CL · 2026-05-27 · unverdicted · none · ref 17 · internal anchor
CAS mitigates object hallucinations in MLLMs by extracting two context preference vectors from designed conflict samples and applying signed residual injection at mid-early MLP layers without retraining or added latency.
Structure over Pixels: Learning Variable-Length Visual Programs cs.CV · 2026-05-26 · unverdicted · none · ref 51 · internal anchor
STROP learns variable-length discrete visual programs for images by training a length head against frozen DINOv3 features in a four-phase curriculum while bypassing pixel reconstruction.
Concept-Based Abductive and Contrastive Explanations for Behaviors of Vision Models cs.LG · 2026-05-07 · unverdicted · none · ref 70 · internal anchor
Concept-based abductive and contrastive explanations find minimal high-level concepts that causally determine vision model outcomes on individual images or groups sharing a specified behavior.
Pro$^2$Assist: Continuous Step-Aware Proactive Assistance with Multimodal Egocentric Perception for Long-Horizon Procedural Tasks cs.AI · 2026-05-05 · unverdicted · none · ref 50 · internal anchor
Pro²Assist uses multimodal egocentric perception from AR glasses to track fine-grained progress in long-horizon procedural tasks and deliver timely proactive assistance, outperforming baselines by over 21% in action understanding and up to 2.29x in timing accuracy.
Beyond Accuracy: Benchmarking Cross-Task Consistency in Unified Multimodal Models cs.CV · 2026-04-27 · unverdicted · none · ref 24 · internal anchor
XTC-Bench reveals that strong performance on generation or understanding tasks in unified multimodal models does not guarantee cross-task semantic consistency, which instead depends on how tightly coupled the learning objectives are across modalities.
Oracle Noise: Faster Semantic Spherical Alignment for Interpretable Latent Optimization cs.CV · 2026-04-26 · unverdicted · none · ref 22 · internal anchor
Oracle Noise optimizes diffusion model noise on a Riemannian hypersphere guided by key prompt words to preserve the Gaussian prior, eliminate norm inflation, and achieve faster semantic alignment than Euclidean methods.
MMEB-V3: Measuring the Performance Gaps of Omni-Modality Embedding Models cs.IR · 2026-04-25 · unverdicted · none · ref 17 · internal anchor
MMEB-V3 benchmark shows omni-modality embedding models fail to enforce instruction-specified modality constraints and exhibit asymmetric, query-biased retrieval.
Beyond Prompts: Unconditional 3D Inversion for Out-of-Distribution Shapes cs.CV · 2026-04-16 · unverdicted · none · ref 20 · internal anchor
Text-to-3D models lose prompt sensitivity for out-of-distribution shapes due to sink traps but retain geometric diversity via unconditional priors, enabling a decoupled inversion method for robust editing.
Gaslight, Gatekeep, V1-V3: Early Visual Cortex Alignment Shields Vision-Language Models from Sycophantic Manipulation cs.CV · 2026-04-15 · conditional · none · ref 58 · internal anchor
Alignment of vision-language models with human V1-V3 early visual cortex negatively predicts resistance to sycophantic gaslighting attacks.
MAST: Mask-Guided Attention Mass Allocation for Training-Free Multi-Style Transfer cs.CV · 2026-04-14 · unverdicted · none · ref 18 · internal anchor
MAST is a mask-guided attention allocation method that enables artifact-free multi-style transfer in diffusion models by anchoring layout, distributing attention mass, scaling sharpness, and injecting details.
WildDet3D: Scaling Promptable 3D Detection in the Wild cs.CV · 2026-04-09 · unverdicted · none · ref 31 · internal anchor
WildDet3D is a promptable 3D detector paired with a new 1M-image dataset across 13.5K categories that sets SOTA on open-world and zero-shot 3D detection benchmarks.
Descriptor: Parasitoid Wasps and Associated Hymenoptera Dataset (DAPWH) cs.CV · 2026-02-20 · unverdicted · none · ref 32 · internal anchor
Releases the DAPWH dataset of 3556 wasp images including 1739 COCO-annotated examples to enable AI models for identifying Ichneumonoidea and associated families.
Event-based Civil Infrastructure Visual Defect Detection: ev-CIVIL Dataset and Benchmark cs.CV · 2025-04-08 · unverdicted · none · ref 48 · internal anchor
Presents the ev-CIVIL dataset and benchmark showing that event-based cameras can support real-time detection of cracks and spalling in civil infrastructure under challenging lighting.
PromptGuard: Soft Prompt-Guided Unsafe Content Moderation for Text-to-Image Models cs.CV · 2025-01-07 · unverdicted · none · ref 38 · internal anchor
PromptGuard optimizes a universal safety soft prompt (and category-specific variants) in T2I embedding space to moderate NSFW inputs, achieving average unsafe ratios of 5.84-6.18% while being 3.8x faster than prior defenses.
Hierarchical Text-Conditional Image Generation with CLIP Latents cs.CV · 2022-04-13 · accept · none · ref 28 · internal anchor
A hierarchical prior-decoder model using CLIP latents generates more diverse text-conditional images than direct methods while preserving photorealism and caption fidelity.
High-Resolution Image Synthesis with Latent Diffusion Models cs.CV · 2021-12-20 · conditional · none · ref 51 · internal anchor
Latent diffusion models achieve state-of-the-art inpainting and competitive results on unconditional generation, scene synthesis, and super-resolution by performing the diffusion process in the latent space of pretrained autoencoders with cross-attention conditioning, while cutting computational and
U4D: Unsupervised 4D Dynamic Scene Understanding cs.CV · 2019-07-23 · unverdicted · none · ref 31 · internal anchor
Unsupervised joint semantic instance segmentation, 4D reconstruction, and scene flow from multi-view video of multi-person dynamic scenes, with reported ~40% gains over prior methods.
Unmasking LAION-5B: Age, Gender, Race, and Emotion Biases in Large-Scale Image Datasets cs.CV · 2026-06-22 · unverdicted · none · ref 271 · internal anchor
Empirical audit of LAION-2B-en and LAION-2B-multi finds overrepresentation of young adults, White people, and males plus stereotypical emotion associations across two attribute classifiers.
Time-frequency localization of bird calls in dense soundscapes cs.SD · 2026-06-09 · unverdicted · none · ref 15 · internal anchor
YOLO11 models localize bird vocalizations on spectrograms, nearly doubling IoMin@50 F1 on Singapore data and outperforming baselines on Hawaii recordings.
Maximum Matching Accuracy: An Instance Segmentation Evaluation Metric Utilizing Globally Optimal Matching cs.CV · 2026-06-08 · unverdicted · none · ref 17 · internal anchor
MMA is a threshold-free continuous metric for instance segmentation that uses globally optimal bipartite matching between predictions and ground truth followed by per-pixel normalization to aggregate overlap.
Beyond Compression: Quantifying Spectral Accessibility in Vision Representations cs.CV · 2026-06-02 · unverdicted · none · ref 8 · internal anchor
Vision encoders alter spectral accessibility non-monotonically across depth with architecture-specific effects from projections and pooling, quantified via a new residual loss against random baselines.
VLA-Trace: Diagnosing Vision-Language-Action Models through Representation and Behavior Tracing cs.AI · 2026-05-28 · unverdicted · none · ref 3 · internal anchor
VLA-Trace diagnoses two VLA models via representation CKA, attention interventions, and behavioral tests, finding distinct finetuning dynamics, different routing, and strong visual grounding but weak fine-grained semantic following.
Trajectory-Consistent Calibration for Cache-Accelerated Diffusion Models cs.CV · 2026-05-24 · unverdicted · none · ref 7 · internal anchor
TCC calibrates cached representations in diffusion sampling via an offline iterative procedure that accounts for trajectory shifts, improving FID from 29.83 to 27.35 on PixArt-alpha while preserving reuse policies.
Why We Look Where We Look: Emergent Human-like Fixations of a Foveated Visual Language Model Maximizing Scene Understanding cs.CV · 2026-05-18 · unverdicted · none · ref 48 · internal anchor
A foveated VLM trained for scene comprehension produces human-like fixations, outperforming models trained for search, classification, or with altered peripheral vision.
SparseSAM: Structured Sparsification of Activations in Segment Anything Models cs.CV · 2026-05-17 · unverdicted · none · ref 16 · internal anchor
SparseSAM achieves 2x faster inference and 2.8x memory reduction in SAM with only 0.004 mIoU loss at 0.4 density via Stripe-Sort Attention and Residual-Consistency MLP.
CASCADE: Context-Aware Relaxation for Speculative Image Decoding cs.CV · 2026-05-08 · unverdicted · none · ref 27 · internal anchor
CASCADE formalizes semantic interchangeability and convergence in target model representations to enable context-aware acceptance relaxation in tree-based speculative decoding, delivering up to 3.6x speedup on text-to-image models without quality loss.
Semantics-Aware Hierarchical Token Communication: Clustering, Bit Mapping, and Power Allocation eess.SP · 2026-04-30 · unverdicted · none · ref 16 · internal anchor
H-TokCom groups tokens by semantic similarity and protects cluster-level bits with higher power, raising semantic similarity from 0.206 to 0.279 at 3 dB SNR on COCO data.
Delta Score Matters! Spatial Adaptive Multi Guidance in Diffusion Models cs.CV · 2026-04-29 · unverdicted · none · ref 15 · internal anchor
SAMG uses spatially adaptive guidance scales derived from a geometric analysis of classifier-free guidance to resolve the detail-artifact dilemma in diffusion-based image and video generation.
LatentDiff: Scaling Semantic Dataset Comparison to Millions of Images cs.CV · 2026-04-28 · unverdicted · none · ref 18 · internal anchor
LatentDiff scales semantic dataset comparison to millions of images using latent spaces of vision encoders combined with sparse autoencoders and density ratio estimation, showing better accuracy and robustness than caption-based approaches on a new benchmark for sparse distribution shifts.
DeepSignature: Digitally Signed, Content-Encoding Watermarks for Robust and Transparent Image Authentication cs.CR · 2026-04-24 · unverdicted · none · ref 36 · internal anchor
DeepSignature embeds digitally signed content-encoding watermarks via neural networks for robust image authentication, source attribution, and latent-space tamper localization.
Source-Modality Monitoring in Vision-Language Models cs.CL · 2026-04-23 · unverdicted · none · ref 7 · internal anchor
Vision-language models use semantic signals more than syntactic ones to bind words like 'image' to actual visual inputs, with implications for robustness in multimodal systems.
Zero-shot World Models Are Developmentally Efficient Learners cs.AI · 2026-04-11 · unverdicted · none · ref 65 · internal anchor
A zero-shot visual world model trained on one child's experience achieves broad competence on physical understanding benchmarks while matching developmental behavioral patterns.
Bayesian Optimization for Mixed-Variable Problems in the Natural Sciences cs.LG · 2026-04-08 · unverdicted · none · ref 50 · internal anchor
A generalization of probabilistic reparameterization allows gradient-based acquisition optimization in fully mixed-variable Bayesian optimization with Gaussian process surrogates for non-equidistant discrete spaces.
DeCo-DETR: Decoupled Cognition DETR for efficient Open-Vocabulary Object Detection cs.CV · 2026-04-03 · unverdicted · none · ref 13 · 2 links · internal anchor
DeCo-DETR builds hierarchical semantic prototypes offline and uses decoupled training streams to deliver competitive zero-shot open-vocabulary detection with improved inference speed.
Chasing Ghosts: A Simulation-to-Real Olfactory Navigation Stack with Optional Vision Augmentation cs.RO · 2026-02-23 · unverdicted · none · ref 25 · internal anchor
A simulation-to-real navigation policy enables a quadrotor to locate an odor source using only basic olfaction sensors and optional vision, validated in indoor real-world flights.
Adaptive Residual-Update Steering for Low-Overhead Hallucination Mitigation in Large Vision Language Models cs.CV · 2025-11-13 · unverdicted · none · ref 12 · internal anchor
RUDDER creates a persistent visual anchor by extracting CARD from prefill residuals and modulating its injection via an adaptive Beta Gate, cutting CHAIR_S by 24.4% and CHAIR_i by 23.6% on average across LLaVA, Idefics2, InstructBLIP and Qwen2.5-VL with >96% throughput.
Routing-Based Continual Learning for Multimodal Large Language Models cs.LG · 2025-11-03 · unverdicted · none · ref 29 · internal anchor
Routing architecture for MLLMs enables continual learning with constant compute, matching multi-task learning performance and supporting cross-modal transfer.
Navigating the Challenges of AI-Generated Image Detection in the Wild: What Truly Matters? cs.CV · 2025-07-14 · conditional · none · ref 29 · internal anchor
The ITW-SM dataset and targeted optimization of detector design choices yield a 26.87% average AUC improvement for state-of-the-art AI-generated image detectors under real-world social media conditions.
Expressive yet Efficient Feature Expansion with Adaptive Cross-Hadamard Products cs.CV · 2025-05-28 · unverdicted · none · ref 16 · internal anchor
Proposes ACH module with differentiable sampling and softsign normalization for efficient feature expansion, integrated via NAS into Hadaptive-Net to claim SOTA accuracy/speed trade-offs on image classification.
MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training cs.CV · 2024-03-14 · unverdicted · none · ref 72 · internal anchor
MM1 models achieve state-of-the-art few-shot multimodal results by pre-training on a careful mix of image-caption, interleaved, and text-only data with optimized image encoders.
Informative Image Captioning with External Sources of Information cs.CL · 2019-06-20 · unverdicted · none · ref 14 · internal anchor
A multimodal Transformer ingests image features plus multiple external entity label sources and learns to control their appearance in fluent output captions.
On Physical Adversarial Patches for Object Detection cs.CV · 2019-06-20 · unverdicted · none · ref 6 · internal anchor
A physical patch suppresses all object detections by YOLOv3 even for distant objects without overlapping them.
TuringViT: Making SOTA Vision Transformers Accessible to All cs.CV · 2026-06-23 · unverdicted · none · ref 46 · internal anchor
TuringViT claims a new ViT design with linear attention and curated data that matches SOTA performance using 10% of typical pretraining data while supporting dynamic resolutions and improving VLM integration.
Accelerating Vision Foundation Models with Drop-in Depthwise Convolution cs.CV · 2026-05-21 · unverdicted · none · ref 25 · internal anchor
Replacing selected attention heads in pretrained ViTs with depthwise convolutions, identified by simple strategies and recovered via fine-tuning, delivers 17-20% inference speedup on image tasks with minimal accuracy loss.
Replacement Learning: Training Neural Networks with Fewer Parameters cs.CV · 2026-05-19 · unverdicted · none · ref 28 · internal anchor
Replacement Learning replaces selected blocks in CNNs and ViTs with learnable parameter-fusion surrogates derived from adjacent layers to reduce full-depth backpropagation redundancy.
KVCapsule: Efficient Sequential KV Cache Compression for Vision-Language Models with Asymmetric Redundancy cs.CV · 2026-05-14 · unverdicted · none · ref 17 · internal anchor
KVCapsule compresses KV cache in VLMs by 60% to deliver up to 2x higher tokens-per-second and 2.4x memory reduction with negligible accuracy loss.
Global Context or Local Detail? Adaptive Visual Grounding for Hallucination Mitigation cs.CV · 2026-04-27 · unverdicted · none · ref 2 · internal anchor
PND mitigates object hallucination in vision-language models via dual-path contrastive decoding that boosts visual evidence and penalizes linguistic priors, yielding up to 6.5% gains on POPE, MME, and CHAIR benchmarks.

Microsoft COCO: Common Objects in Context

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer