super hub Canonical reference

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

doi: 10 · 2023 · arXiv 2729.2023

Canonical reference. 76% of citing Pith papers cite this work as background.

208 Pith papers citing it

Background 76% of classified citations

read on arXiv browse 208 citing papers more from doi: 10

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 36 baseline 7 method 4 dataset 2

citation-polarity summary

background 37 baseline 7 use method 4 use dataset 1

authors

doi: 10

co-cited works

representative citing papers

WildBox: A Dataset and Benchmark for Aerial Monocular 3D Detection of African Savanna Wildlife

cs.CV · 2026-06-19 · unverdicted · novelty 8.0 · 2 refs

WildBox provides over 237k 3D wildlife annotations from drone video and benchmarks reveal zero-shot 3D detection at 0 AP but fine-tuned performance of 8.68 AP-BEV and 13.17 AP3D, with depth estimation causing most errors.

Vision-language models for chest radiography do not always need the image

cs.CV · 2026-06-16 · accept · novelty 8.0

A causal audit with image interventions shows text-only models reach within 5.7 accuracy points of top multimodal VLMs on chest radiography, with some large multimodal models statistically indistinguishable from small text-only baselines.

MoCapAnything V2: End-to-End Motion Capture for Arbitrary Skeletons

cs.CV · 2026-04-30 · unverdicted · novelty 8.0

MoCapAnything V2 presents the first end-to-end learnable Video-to-Pose and Pose-to-Rotation framework for monocular arbitrary-skeleton motion capture by conditioning on a reference pose-rotation pair.

Learning Spectral and Polarimetric Clues for One-to-Multimodal Novel View Synthesis

cs.CV · 2026-07-02 · unverdicted · novelty 7.0 · 2 refs

SPoILeR uses multimodal pre-training to enable accurate novel view synthesis of infrared, polarimetric, and multispectral data from RGB-supervised fine-tuning on new scenes.

PRISM-VO: Scale-Aware Visual Odometry Using Photometric Plenoptic Bundle Adjustment

cs.CV · 2026-06-30 · unverdicted · novelty 7.0

PRISM-VO introduces photometric plenoptic bundle adjustment for drift-resilient, metric-scale visual odometry from a single focused plenoptic camera.

SpheRoPE: Zero-Shot Optimization-Free 360 Panorama Generation with Spherical RoPE

cs.CV · 2026-06-30 · unverdicted · novelty 7.0 · 2 refs

SpheRoPE modifies rotary position embeddings in diffusion transformers to enforce spherical topology for zero-shot 360 panorama generation across multiple backbones.

RESOLVE: A Multi-Resolution and Multi-Modal Dataset for Roadside Cooperative Perception

cs.CV · 2026-06-30 · accept · novelty 7.0

RESOLVE provides a controlled multi-resolution LiDAR and camera benchmark for evaluating 3D detection and tracking under point sparsity variations in roadside cooperative perception.

Think While You Map: Asynchronous Vision-Language Agents for Incremental 3D Scene Graphs

cs.CV · 2026-06-30 · unverdicted · novelty 7.0

An asynchronous architecture decouples incremental voxel-based mapping from VLM-based semantic enrichment to produce queryable open-vocabulary 3D scene graphs that match or exceed prior methods on segmentation and grounding benchmarks.

Learning to Deny: Action Denial in Multimodal Large Language Models

cs.CV · 2026-06-30 · unverdicted · novelty 7.0

MLLMs drop from over 85% accuracy on action presence to under 50% on matched action-denial videos, exposing a causal verification gap that causal graph prompts partially close.

Diffusion-Based Material Regularization for Physics-Based Inverse Rendering

cs.CV · 2026-06-30 · unverdicted · novelty 7.0 · 2 refs

A regularization technique that treats diffusion model outputs as a similarity kernel during material optimization in inverse rendering, enabling joint reconstruction of geometry, materials, and illumination that satisfies the rendering equation and generalizes to new lighting.

Bridging VideoQA and Video-Guided Agentic Tasks via Generalized Keyframe Extraction

cs.CV · 2026-06-28 · unverdicted · novelty 7.0

Introduces VG-GUIBench benchmark and TASKER keyframe extraction algorithm that improves performance on VideoQA and video-guided agentic tasks.

MIRAGE: Protecting against Malicious Image Editing via False Moderation

cs.CR · 2026-06-24 · unverdicted · novelty 7.0 · 2 refs

MIRAGE immunizes images by crafting perturbations that align them with policy-violating concepts in open-source moderation models, triggering refusals in closed-source commercial image editors at over 88% success rate.

FUTO Swipe: Layout-Agnostic Neural Swipe Decoding

cs.HC · 2026-06-24 · unverdicted · novelty 7.0

Neural swipe decoder trained with geometric augmentations on 1M+ swipes generalizes to unseen keyboard layouts by predicting per-point character locations and mapping via inference-time layout.

MATCH: Flow Matching for Multi-View Anomaly Detection

cs.CV · 2026-06-23 · unverdicted · novelty 7.0

MATCH is the first flow matching method for multi-view anomaly detection, reporting SOTA results on Real-IAD and the first comprehensive evaluation on MANTA-Tiny while enabling real-time use by omitting the divergence term.

Arbor: Explicit Geometric Conditioning for Controllable 3D Asset Generation

cs.CV · 2026-06-22 · unverdicted · novelty 7.0

Arbor attaches constraint mesh tokens to a frozen text-to-3D denoiser to enable controllable generation obeying hull, avoidance, and touch constraints.

4DVLT: Dynamic Scene Understanding with Worldline-Centered Vision-Language Tracking

cs.CV · 2026-06-21 · conditional · novelty 7.0 · 2 refs

The paper defines the 4DVLT task for worldline-centered 4D scene understanding, releases Instruct-4D with 129.4K QA pairs, and presents 4DTrack achieving 62.68 TGA_Top1, outperforming adapted baselines by 19.62 points.

TopoCap: Learning Topology-Agnostic Motion Priors for Monocular Video-to-Animation

cs.CV · 2026-06-10 · unverdicted · novelty 7.0

A two-stage generative model (Graph CVAE + flow matching) learns topology-agnostic motion codes from a new 5k-topology dataset and retargets video motion to arbitrary unseen skeletons.

SpikeTAD: Spiking Neural Networks for End-to-End Temporal Action Detection

cs.CV · 2026-06-10 · unverdicted · novelty 7.0

SpikeTAD proposes the first SNN-based end-to-end TAD model, reporting 67.2% mAP on THUMOS14 and 37.42% on ActivityNet-1.3 with extremely low power consumption.

Fisher-Guided Progressive Parameter Selection for Adaptive Fine-Tuning

cs.CV · 2026-06-08 · unverdicted · novelty 7.0

FisherAdapTune uses temporal drift in Fisher geometry, measured by scale-invariant Jensen-Shannon distance, to progressively freeze stabilized parameter groups during fine-tuning, reporting gains on segmentation and zero-shot transfer.

Mind the Gap: Disentangling Performance Bottlenecks in Video Instance Segmentation

cs.CV · 2026-06-05 · unverdicted · novelty 7.0 · 2 refs

An ILP-based oracle applied to seven VIS methods on YouTube-VIS and OVIS shows tracking instability as the dominant bottleneck, producing gaps exceeding 20 AP under occlusion while classification impact is secondary.

WHU-Infra3D: A Full-stack Multi-modal Dataset and Benchmark for 3D Roadside Infrastructure Inventory

cs.CV · 2026-06-03 · unverdicted · novelty 7.0

WHU-Infra3D is a new large-scale multi-modal dataset and benchmark for 3D roadside infrastructure inventory, providing over 175k 2D boxes, thousands of 3D instances, and 181k annotations across five core tasks while exposing cross-city gaps and long-tailed defect vulnerabilities.

Initialization is Half the Battle: Generating Diverse Images from a Guidance Potential Posterior

cs.CV · 2026-06-01 · unverdicted · novelty 7.0

DivIn samples initial noise from a guidance potential posterior via Langevin dynamics to improve diversity in class-to-image and text-to-image generation.

GLENS: Global Search via Learning from Solver Iterates with Diffusion Models

cs.LG · 2026-05-29 · unverdicted · novelty 7.0

GLENS uses diffusion models on solver iterates to generate high-quality and diverse initial guesses for multimodal non-convex optimization, leading to faster solver convergence.

ClothTransformer: Unified Latent-Space Transformers for Scalable Cloth Simulation

cs.GR · 2026-05-27 · unverdicted · novelty 7.0

ClothTransformer is a unified latent-space Transformer for cloth simulation that handles body-driven garments, robotic manipulation, and free-fall collisions in one model with 4-9x lower error than prior methods and mesh-resolution independence.

citing papers explorer

Showing 1 of 1 citing paper after filters.

Revealing Physical-World Semantic Vulnerabilities: Universal Adversarial Patches for Infrared Vision-Language Models cs.CV · 2026-04-03 · unverdicted · none · ref 41
UCGP is a universal physical adversarial patch that compromises cross-modal semantic alignment in IR-VLMs through curved-grid parameterization and representation-space disruption.

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

hub tools

citation-role summary

citation-polarity summary

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer