WildBox provides over 237k 3D wildlife annotations from drone video and benchmarks reveal zero-shot 3D detection at 0 AP but fine-tuned performance of 8.68 AP-BEV and 13.17 AP3D, with depth estimation causing most errors.
super hub Canonical reference
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp
Canonical reference. 76% of citing Pith papers cite this work as background.
hub tools
citation-role summary
citation-polarity summary
authors
co-cited works
representative citing papers
A causal audit with image interventions shows text-only models reach within 5.7 accuracy points of top multimodal VLMs on chest radiography, with some large multimodal models statistically indistinguishable from small text-only baselines.
MoCapAnything V2 presents the first end-to-end learnable Video-to-Pose and Pose-to-Rotation framework for monocular arbitrary-skeleton motion capture by conditioning on a reference pose-rotation pair.
SpheRoPE modifies rotary position embeddings in diffusion transformers to enforce spherical topology for zero-shot 360 panorama generation across multiple backbones.
RESOLVE provides a controlled multi-resolution LiDAR and camera benchmark for evaluating 3D detection and tracking under point sparsity variations in roadside cooperative perception.
An asynchronous architecture decouples incremental voxel-based mapping from VLM-based semantic enrichment to produce queryable open-vocabulary 3D scene graphs that match or exceed prior methods on segmentation and grounding benchmarks.
MLLMs drop from over 85% accuracy on action presence to under 50% on matched action-denial videos, exposing a causal verification gap that causal graph prompts partially close.
A regularization technique that treats diffusion model outputs as a similarity kernel during material optimization in inverse rendering, enabling joint reconstruction of geometry, materials, and illumination that satisfies the rendering equation and generalizes to new lighting.
Introduces VG-GUIBench benchmark and TASKER keyframe extraction algorithm that improves performance on VideoQA and video-guided agentic tasks.
MIRAGE immunizes images by crafting perturbations that align them with policy-violating concepts in open-source moderation models, triggering refusals in closed-source commercial image editors at over 88% success rate.
Neural swipe decoder trained with geometric augmentations on 1M+ swipes generalizes to unseen keyboard layouts by predicting per-point character locations and mapping via inference-time layout.
MATCH is the first flow matching method for multi-view anomaly detection, reporting SOTA results on Real-IAD and the first comprehensive evaluation on MANTA-Tiny while enabling real-time use by omitting the divergence term.
Arbor attaches constraint mesh tokens to a frozen text-to-3D denoiser to enable controllable generation obeying hull, avoidance, and touch constraints.
The paper defines the 4DVLT task for worldline-centered 4D scene understanding, releases Instruct-4D with 129.4K QA pairs, and presents 4DTrack achieving 62.68 TGA_Top1, outperforming adapted baselines by 19.62 points.
A two-stage generative model (Graph CVAE + flow matching) learns topology-agnostic motion codes from a new 5k-topology dataset and retargets video motion to arbitrary unseen skeletons.
SpikeTAD proposes the first SNN-based end-to-end TAD model, reporting 67.2% mAP on THUMOS14 and 37.42% on ActivityNet-1.3 with extremely low power consumption.
FisherAdapTune uses temporal drift in Fisher geometry, measured by scale-invariant Jensen-Shannon distance, to progressively freeze stabilized parameter groups during fine-tuning, reporting gains on segmentation and zero-shot transfer.
An ILP-based oracle applied to seven VIS methods on YouTube-VIS and OVIS shows tracking instability as the dominant bottleneck, producing gaps exceeding 20 AP under occlusion while classification impact is secondary.
WHU-Infra3D is a new large-scale multi-modal dataset and benchmark for 3D roadside infrastructure inventory, providing over 175k 2D boxes, thousands of 3D instances, and 181k annotations across five core tasks while exposing cross-city gaps and long-tailed defect vulnerabilities.
DivIn samples initial noise from a guidance potential posterior via Langevin dynamics to improve diversity in class-to-image and text-to-image generation.
GLENS uses diffusion models on solver iterates to generate high-quality and diverse initial guesses for multimodal non-convex optimization, leading to faster solver convergence.
ClothTransformer is a unified latent-space Transformer for cloth simulation that handles body-driven garments, robotic manipulation, and free-fall collisions in one model with 4-9x lower error than prior methods and mesh-resolution independence.
RS2AD-LiDAR reconstructs vehicle LiDAR data from roadside observations via coordinate transformation, virtual LiDAR modeling and resampling, claimed as the first such method, with experiments showing improved object detection when mixed with real data.
Introduces NMCA-aligned L1/L2 LULC schemes and the Loosdorf-MSL benchmark dataset, with Point Transformer V3 reaching 79.4% mIoU on 8 classes and 58.9% on 20 classes, plus gains from multispectral inputs.
citing papers explorer
-
AnomalyClaw: A Universal Visual Anomaly Detection Agent via Tool-Grounded Refutation
AnomalyClaw turns single-step VLM anomaly judgments into a multi-round tool-grounded refutation process, delivering consistent macro-AUROC gains of 3.5-7.9 percentage points over direct inference across 12 cross-domain datasets.
-
Field-Localized Forgery Detection for Digital Identity Documents
FLiD is a field-localized forgery detection method for identity documents that outperforms full-document baselines and general detectors with significantly fewer parameters.
-
Efficient Video Diffusion Models: Advancements and Challenges
A survey that groups efficient video diffusion methods into four paradigms—step distillation, efficient attention, model compression, and cache/trajectory optimization—and outlines open challenges for practical use.
-
Vector Scaffolding: Inter-Scale Orchestration for Differentiable Image Vectorization
Vector Scaffolding uses Interior Gradient Aggregation, Progressive Stratification, and Rapid Inflation Scheduling to achieve 2.5x faster optimization and up to 1.4 dB higher PSNR in differentiable image vectorization.
-
Break the Brake, Not the Wheel: Untargeted Jailbreak via Entropy Maximization
UJEM-KL improves cross-model transferability of untargeted jailbreaks on VLMs by maximizing entropy at decision tokens rather than enforcing fixed response patterns.
-
Where are they looking in the operating room?
Gaze-following models on extended 4D-OR and Team-OR datasets reach F1 scores of 0.92 for clinical role prediction and 0.95 for surgical phase recognition while improving team communication detection by over 30%.
-
Random Walk on Point Clouds for Feature Detection
RWoDSN extracts feature points from point clouds via a novel DSN descriptor and random walk graph analysis, reporting 22% higher recall than prior state-of-the-art with 0.784 precision.
-
Any3DAvatar: Fast and High-Quality Full-Head 3D Avatar Reconstruction from Single Portrait Image
Any3DAvatar reconstructs full-head 3D Gaussian avatars from one image via one-step denoising on a Plücker-aware scaffold plus auxiliary view supervision, beating prior single-image methods on fidelity while running substantially faster.
-
InsEdit: Towards Instruction-based Visual Editing via Data-Efficient Video Diffusion Models Adaptation
InsEdit adapts a video diffusion backbone for text-instruction video editing via Mutual Context Attention, achieving SOTA open-source results with O(100K) data while also supporting image editing.
-
SyncBreaker:Stage-Aware Multimodal Adversarial Attacks on Audio-Driven Talking Head Generation
SyncBreaker jointly attacks image and audio streams with Multi-Interval Sampling and Cross-Attention Fooling to degrade speech-driven talking head generation more than single-modality baselines.
-
CEZSAR: A Contrastive Embedding Method for Zero-Shot Action Recognition
CEZSAR uses contrastive learning to align video and sentence embeddings with automatic negative sampling, claiming state-of-the-art zero-shot action recognition on UCF-101 and Kinetics-400.
-
Dual-stream Spatio-Temporal GCN-Transformer Network for 3D Human Pose Estimation
MixTGFormer reports state-of-the-art 3D pose estimation errors of 37.6 mm on Human3.6M and 15.7 mm on MPI-INF-3DHP by using parallel GCN-Transformer streams with SE layers for local-global feature fusion.
-
Weak-to-Strong Knowledge Distillation Accelerates Visual Learning
Weak-to-strong knowledge distillation applied early and then turned off accelerates convergence to target performance in visual learning tasks by factors of 1.7-4.8x.
-
Hierarchical Awareness Adapters with Hybrid Pyramid Feature Fusion for Dense Depth Prediction
A multilevel perceptual CRF model using Swin Transformer, HPF fusion, HA adapters, and dynamic scaling attention achieves state-of-the-art monocular depth estimation on NYU Depth v2, KITTI, and MatterPort3D with reduced error and fast inference.
-
Visual Hand Gesture Recognition with Deep Learning: A Comprehensive Review of Methods, Datasets, Challenges and Future Research Directions
A literature review that categorizes deep learning approaches for visual hand gesture recognition, summarizes state-of-the-art methods across tasks, reviews datasets and metrics, and identifies challenges and future directions.