org/abs/2411.14347

Tianhe Ren, Yihao Chen, Qing Jiang, Zhaoyang Zeng, Yuda Xiong, Wenlong Liu, Zhengyu Ma, Junyi Shen, Yuan Gao, Xiaoke Jiang, et al · 2024 · arXiv 2411.14347

12 Pith papers cite this work. Polarity classification is still indexing.

12 Pith papers citing it

read on arXiv browse 12 citing papers

representative citing papers

WHU-Infra3D: A Full-stack Multi-modal Dataset and Benchmark for 3D Roadside Infrastructure Inventory

cs.CV · 2026-06-03 · unverdicted · novelty 7.0

WHU-Infra3D is a new large-scale multi-modal dataset and benchmark for 3D roadside infrastructure inventory, providing over 175k 2D boxes, thousands of 3D instances, and 181k annotations across five core tasks while exposing cross-city gaps and long-tailed defect vulnerabilities.

Vision Harnessing Agent for Open Ad-hoc Segmentation

cs.CV · 2026-05-19 · unverdicted · novelty 7.0

VASA is a vision-guided agent for open ad-hoc segmentation that creates and validates masks through planning, tool use, and error recovery, outperforming baselines on the new PARS benchmark and RefCOCOm.

SAM 3: Segment Anything with Concepts

cs.CV · 2025-11-20 · unverdicted · novelty 7.0

SAM 3 introduces promptable concept segmentation that doubles accuracy of prior systems on images and videos while improving standard SAM segmentation performance.

Towards High-Resolution Visual Perception via Hierarchical Entity Exploration

cs.CV · 2026-07-01 · unverdicted · novelty 6.0

HEE is a training-free, model-agnostic method for high-resolution visual perception in MLLMs using hierarchical entity exploration with dual scoring, detection, clustering, and backtracking.

DroneFINE: Domain-Aware Parameter-Efficient Fine-Tuning of Vision-Language Detectors for Drone Images

cs.CV · 2026-07-01 · unverdicted · novelty 6.0

DroneFINE is a domain-aware PEFT approach for VLM-based drone detectors using foreground-aware multi-path adaptation and text-conditioned background suppression, outperforming standard PEFT and matching full fine-tuning on VisDrone and UAVDT with fewer trainable parameters.

ShotCrop$^3$: Cropping Human-Centric Images into Cinematic Triple-Shot Compositions

cs.CV · 2026-06-04 · unverdicted · novelty 6.0

ShotCrop uses three-stage training (CoT SFT, pseudo-label semi-supervised, GRPO-S) to produce triple-shot compositions and reports 2.82x better shot localization than GPT-5 on a 1.2k expert benchmark.

SceneParser: Hierarchical Scene Parsing for Visual Semantics Understanding

cs.CV · 2026-05-14 · unverdicted · novelty 6.0

SceneParser introduces hierarchical scene parsing as object-part-affordance chains, a VLM trained with pseudo labels and curriculum learning, and SceneParser-Bench with 1.74M affordance annotations, showing better structure-aware results than existing MLLMs.

DeFacto: Counterfactual Thinking with Images for Enforcing Evidence-Grounded and Faithful Reasoning

cs.AI · 2025-09-25 · unverdicted · novelty 6.0 · 2 refs

DeFacto trains multimodal models with counterfactual image variants and GRPO reinforcement learning to enforce that correct answers are supported by correct visual evidence.

VL-DINO: Leveraging CLIP Vision-Language Knowledge for Open-Vocabulary Object Detectio

cs.CV · 2026-06-10 · unverdicted · novelty 5.0

VL-DINO improves open-vocabulary object detection by adding QPSC, VSE, and ORSA modules that inject CLIP knowledge into DINO, reaching 36.3 and 38.1 AP zero-shot on LVIS.

COAL: Counterfactual and Observation-Enhanced Alignment Learning for Discriminative Referring Multi-Object Tracking

cs.CV · 2026-05-14 · unverdicted · novelty 5.0

COAL combines VLM-based explicit semantic injection and LLM-driven counterfactual learning inside a hierarchical architecture to improve discriminative referring multi-object tracking under sparse supervision.

See&Say: Vision Language Guided Safe Zone Detection for Autonomous Package Delivery Drones

cs.CV · 2026-04-14 · unverdicted · novelty 5.0

See&Say combines depth gradients, semantic masks, and VLM-guided refinement to generate safety maps and alternative drop zones for autonomous drone deliveries, outperforming baselines in accuracy and IoU.

Image Generators are Generalist Vision Learners

cs.CV · 2026-04-22 · 2 refs

citing papers explorer

Showing 10 of 10 citing papers after filters.

WHU-Infra3D: A Full-stack Multi-modal Dataset and Benchmark for 3D Roadside Infrastructure Inventory cs.CV · 2026-06-03 · unverdicted · none · ref 27
WHU-Infra3D is a new large-scale multi-modal dataset and benchmark for 3D roadside infrastructure inventory, providing over 175k 2D boxes, thousands of 3D instances, and 181k annotations across five core tasks while exposing cross-city gaps and long-tailed defect vulnerabilities.
Vision Harnessing Agent for Open Ad-hoc Segmentation cs.CV · 2026-05-19 · unverdicted · none · ref 52
VASA is a vision-guided agent for open ad-hoc segmentation that creates and validates masks through planning, tool use, and error recovery, outperforming baselines on the new PARS benchmark and RefCOCOm.
Towards High-Resolution Visual Perception via Hierarchical Entity Exploration cs.CV · 2026-07-01 · unverdicted · none · ref 40
HEE is a training-free, model-agnostic method for high-resolution visual perception in MLLMs using hierarchical entity exploration with dual scoring, detection, clustering, and backtracking.
DroneFINE: Domain-Aware Parameter-Efficient Fine-Tuning of Vision-Language Detectors for Drone Images cs.CV · 2026-07-01 · unverdicted · none · ref 35
DroneFINE is a domain-aware PEFT approach for VLM-based drone detectors using foreground-aware multi-path adaptation and text-conditioned background suppression, outperforming standard PEFT and matching full fine-tuning on VisDrone and UAVDT with fewer trainable parameters.
ShotCrop$^3$: Cropping Human-Centric Images into Cinematic Triple-Shot Compositions cs.CV · 2026-06-04 · unverdicted · none · ref 15
ShotCrop uses three-stage training (CoT SFT, pseudo-label semi-supervised, GRPO-S) to produce triple-shot compositions and reports 2.82x better shot localization than GPT-5 on a 1.2k expert benchmark.
SceneParser: Hierarchical Scene Parsing for Visual Semantics Understanding cs.CV · 2026-05-14 · unverdicted · none · ref 34
SceneParser introduces hierarchical scene parsing as object-part-affordance chains, a VLM trained with pseudo labels and curriculum learning, and SceneParser-Bench with 1.74M affordance annotations, showing better structure-aware results than existing MLLMs.
VL-DINO: Leveraging CLIP Vision-Language Knowledge for Open-Vocabulary Object Detectio cs.CV · 2026-06-10 · unverdicted · none · ref 28
VL-DINO improves open-vocabulary object detection by adding QPSC, VSE, and ORSA modules that inject CLIP knowledge into DINO, reaching 36.3 and 38.1 AP zero-shot on LVIS.
COAL: Counterfactual and Observation-Enhanced Alignment Learning for Discriminative Referring Multi-Object Tracking cs.CV · 2026-05-14 · unverdicted · none · ref 17
COAL combines VLM-based explicit semantic injection and LLM-driven counterfactual learning inside a hierarchical architecture to improve discriminative referring multi-object tracking under sparse supervision.
See&Say: Vision Language Guided Safe Zone Detection for Autonomous Package Delivery Drones cs.CV · 2026-04-14 · unverdicted · none · ref 22
See&Say combines depth gradients, semantic masks, and VLM-guided refinement to generate safety maps and alternative drop zones for autonomous drone deliveries, outperforming baselines in accuracy and IoU.
Image Generators are Generalist Vision Learners cs.CV · 2026-04-22 · unreviewed · ref 21 · 2 links

org/abs/2411.14347

fields

years

verdicts

representative citing papers

citing papers explorer