WHU-Infra3D is a new large-scale multi-modal dataset and benchmark for 3D roadside infrastructure inventory, providing over 175k 2D boxes, thousands of 3D instances, and 181k annotations across five core tasks while exposing cross-city gaps and long-tailed defect vulnerabilities.
org/abs/2411.14347
12 Pith papers cite this work. Polarity classification is still indexing.
representative citing papers
VASA is a vision-guided agent for open ad-hoc segmentation that creates and validates masks through planning, tool use, and error recovery, outperforming baselines on the new PARS benchmark and RefCOCOm.
SAM 3 introduces promptable concept segmentation that doubles accuracy of prior systems on images and videos while improving standard SAM segmentation performance.
HEE is a training-free, model-agnostic method for high-resolution visual perception in MLLMs using hierarchical entity exploration with dual scoring, detection, clustering, and backtracking.
DroneFINE is a domain-aware PEFT approach for VLM-based drone detectors using foreground-aware multi-path adaptation and text-conditioned background suppression, outperforming standard PEFT and matching full fine-tuning on VisDrone and UAVDT with fewer trainable parameters.
ShotCrop uses three-stage training (CoT SFT, pseudo-label semi-supervised, GRPO-S) to produce triple-shot compositions and reports 2.82x better shot localization than GPT-5 on a 1.2k expert benchmark.
SceneParser introduces hierarchical scene parsing as object-part-affordance chains, a VLM trained with pseudo labels and curriculum learning, and SceneParser-Bench with 1.74M affordance annotations, showing better structure-aware results than existing MLLMs.
DeFacto trains multimodal models with counterfactual image variants and GRPO reinforcement learning to enforce that correct answers are supported by correct visual evidence.
VL-DINO improves open-vocabulary object detection by adding QPSC, VSE, and ORSA modules that inject CLIP knowledge into DINO, reaching 36.3 and 38.1 AP zero-shot on LVIS.
COAL combines VLM-based explicit semantic injection and LLM-driven counterfactual learning inside a hierarchical architecture to improve discriminative referring multi-object tracking under sparse supervision.
See&Say combines depth gradients, semantic masks, and VLM-guided refinement to generate safety maps and alternative drop zones for autonomous drone deliveries, outperforming baselines in accuracy and IoU.
citing papers explorer
-
WHU-Infra3D: A Full-stack Multi-modal Dataset and Benchmark for 3D Roadside Infrastructure Inventory
WHU-Infra3D is a new large-scale multi-modal dataset and benchmark for 3D roadside infrastructure inventory, providing over 175k 2D boxes, thousands of 3D instances, and 181k annotations across five core tasks while exposing cross-city gaps and long-tailed defect vulnerabilities.
-
Vision Harnessing Agent for Open Ad-hoc Segmentation
VASA is a vision-guided agent for open ad-hoc segmentation that creates and validates masks through planning, tool use, and error recovery, outperforming baselines on the new PARS benchmark and RefCOCOm.
-
Towards High-Resolution Visual Perception via Hierarchical Entity Exploration
HEE is a training-free, model-agnostic method for high-resolution visual perception in MLLMs using hierarchical entity exploration with dual scoring, detection, clustering, and backtracking.
-
DroneFINE: Domain-Aware Parameter-Efficient Fine-Tuning of Vision-Language Detectors for Drone Images
DroneFINE is a domain-aware PEFT approach for VLM-based drone detectors using foreground-aware multi-path adaptation and text-conditioned background suppression, outperforming standard PEFT and matching full fine-tuning on VisDrone and UAVDT with fewer trainable parameters.
-
ShotCrop$^3$: Cropping Human-Centric Images into Cinematic Triple-Shot Compositions
ShotCrop uses three-stage training (CoT SFT, pseudo-label semi-supervised, GRPO-S) to produce triple-shot compositions and reports 2.82x better shot localization than GPT-5 on a 1.2k expert benchmark.
-
SceneParser: Hierarchical Scene Parsing for Visual Semantics Understanding
SceneParser introduces hierarchical scene parsing as object-part-affordance chains, a VLM trained with pseudo labels and curriculum learning, and SceneParser-Bench with 1.74M affordance annotations, showing better structure-aware results than existing MLLMs.
-
VL-DINO: Leveraging CLIP Vision-Language Knowledge for Open-Vocabulary Object Detectio
VL-DINO improves open-vocabulary object detection by adding QPSC, VSE, and ORSA modules that inject CLIP knowledge into DINO, reaching 36.3 and 38.1 AP zero-shot on LVIS.
-
COAL: Counterfactual and Observation-Enhanced Alignment Learning for Discriminative Referring Multi-Object Tracking
COAL combines VLM-based explicit semantic injection and LLM-driven counterfactual learning inside a hierarchical architecture to improve discriminative referring multi-object tracking under sparse supervision.
-
See&Say: Vision Language Guided Safe Zone Detection for Autonomous Package Delivery Drones
See&Say combines depth gradients, semantic masks, and VLM-guided refinement to generate safety maps and alternative drop zones for autonomous drone deliveries, outperforming baselines in accuracy and IoU.
- Image Generators are Generalist Vision Learners