hub

Grounding dino 1.5: Advance the” edge” of open-set object detection

Tianhe Ren, Qing Jiang, Shilong Liu, Zhaoyang Zeng, Wenlong Liu, Han Gao, Hongjie Huang, Zhengyu Ma, Xiaoke Jiang, Yihao Chen, et al · 2024 · arXiv 2405.10300

13 Pith papers cite this work. Polarity classification is still indexing.

13 Pith papers citing it

read on arXiv browse 13 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 2 baseline 2

citation-polarity summary

background 2 baseline 2

representative citing papers

Vision Harnessing Agent for Open Ad-hoc Segmentation

cs.CV · 2026-05-19 · unverdicted · novelty 7.0

VASA is a vision-guided agent for open ad-hoc segmentation that creates and validates masks through planning, tool use, and error recovery, outperforming baselines on the new PARS benchmark and RefCOCOm.

DLEBench: Evaluating Small-scale Object Editing Ability for Instruction-based Image Editing Model

cs.CV · 2026-02-27 · unverdicted · novelty 7.0

DLEBench is the first benchmark for small-scale object editing in instruction-based image editing models, using 1889 samples, seven instruction types, and a dual-mode evaluation protocol to reveal performance gaps in 10 tested models.

SAM 3: Segment Anything with Concepts

cs.CV · 2025-11-20 · unverdicted · novelty 7.0

SAM 3 introduces promptable concept segmentation that doubles accuracy of prior systems on images and videos while improving standard SAM segmentation performance.

MIRAGE: Benchmarking and Aligning Multi-Instance Image Editing

cs.CV · 2026-04-06 · unverdicted · novelty 6.0

MIRAGE introduces a benchmark for multi-instance image editing and a training-free framework that uses vision-language parsing and parallel regional denoising to achieve precise edits without altering backgrounds.

Unify Robot Actions in Camera Frame

cs.RO · 2025-11-21 · conditional · novelty 6.0

CalibAll estimates camera extrinsics on existing datasets to convert robot actions into a unified camera-frame representation, enabling stronger cross-embodiment pretraining.

Inferring Dynamic Physical Properties from Video Foundation Models

cs.CV · 2025-10-02 · unverdicted · novelty 6.0

Video foundation models infer dynamic physical properties such as elasticity, viscosity, and friction from videos at levels close to classical oracles while outperforming current MLLMs with suitable prompting.

VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model

cs.CV · 2025-04-10 · unverdicted · novelty 6.0

VLM-R1 applies R1-style RL using rule-based rewards on visual tasks with clear ground truth to achieve competitive performance and superior generalization over SFT in vision-language models.

RHINO: Reconstructing Human Interactions with Novel Objects from Monocular Videos

cs.CV · 2026-05-16 · unverdicted · novelty 5.0

RHINO recovers 3D human, novel manipulated object, and static scene from monocular video by stabilizing SfM with foundation models, separating motions, and refining with compositional neural SDFs plus contact priors.

DetRefiner: Model-Agnostic Detection Refinement with Feature Fusion Transformer

cs.CV · 2026-05-11 · unverdicted · novelty 5.0

DetRefiner fuses global and local features with a Transformer to refine OVOD confidence scores, delivering up to +10.1 AP gains on novel categories across multiple datasets.

A Survey on Vision-Language-Action Models: An Action Tokenization Perspective

cs.RO · 2025-07-02 · unverdicted · novelty 5.0

The survey frames VLA models as pipelines that generate progressively grounded action tokens and classifies those tokens into eight types to guide future development.

Qwen2.5-VL Technical Report

cs.CV · 2025-02-19 · unverdicted · novelty 5.0

Qwen2.5-VL reports a vision-language model family using native dynamic-resolution ViT and absolute time encoding that matches GPT-4o on document and diagram tasks while supporting hour-long videos with second-level localization.

Benchmarking Vision Foundation Models for Input Monitoring in Autonomous Driving

cs.CV · 2025-01-14 · unverdicted · novelty 5.0

Vision foundation model embeddings with density modeling outperform state-of-the-art methods for unsupervised semantic and covariate shift detection in autonomous driving inputs.

LMMs Meet Object-Centric Vision: Understanding, Segmentation, Editing and Generation

cs.CV · 2026-04-13 · unverdicted · novelty 3.0

This review organizes literature on large multimodal models and object-centric vision into four themes—understanding, referring segmentation, editing, and generation—while summarizing paradigms, strategies, and challenges like instance permanence and consistent interaction.

citing papers explorer

Showing 13 of 13 citing papers.

Vision Harnessing Agent for Open Ad-hoc Segmentation cs.CV · 2026-05-19 · unverdicted · none · ref 51
VASA is a vision-guided agent for open ad-hoc segmentation that creates and validates masks through planning, tool use, and error recovery, outperforming baselines on the new PARS benchmark and RefCOCOm.
DLEBench: Evaluating Small-scale Object Editing Ability for Instruction-based Image Editing Model cs.CV · 2026-02-27 · unverdicted · none · ref 18
DLEBench is the first benchmark for small-scale object editing in instruction-based image editing models, using 1889 samples, seven instruction types, and a dual-mode evaluation protocol to reveal performance gaps in 10 tested models.
SAM 3: Segment Anything with Concepts cs.CV · 2025-11-20 · unverdicted · none · ref 115
SAM 3 introduces promptable concept segmentation that doubles accuracy of prior systems on images and videos while improving standard SAM segmentation performance.
MIRAGE: Benchmarking and Aligning Multi-Instance Image Editing cs.CV · 2026-04-06 · unverdicted · none · ref 33
MIRAGE introduces a benchmark for multi-instance image editing and a training-free framework that uses vision-language parsing and parallel regional denoising to achieve precise edits without altering backgrounds.
Unify Robot Actions in Camera Frame cs.RO · 2025-11-21 · conditional · none · ref 34
CalibAll estimates camera extrinsics on existing datasets to convert robot actions into a unified camera-frame representation, enabling stronger cross-embodiment pretraining.
Inferring Dynamic Physical Properties from Video Foundation Models cs.CV · 2025-10-02 · unverdicted · none · ref 9
Video foundation models infer dynamic physical properties such as elasticity, viscosity, and friction from videos at levels close to classical oracles while outperforming current MLLMs with suitable prompting.
VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model cs.CV · 2025-04-10 · unverdicted · none · ref 44
VLM-R1 applies R1-style RL using rule-based rewards on visual tasks with clear ground truth to achieve competitive performance and superior generalization over SFT in vision-language models.
RHINO: Reconstructing Human Interactions with Novel Objects from Monocular Videos cs.CV · 2026-05-16 · unverdicted · none · ref 48
RHINO recovers 3D human, novel manipulated object, and static scene from monocular video by stabilizing SfM with foundation models, separating motions, and refining with compositional neural SDFs plus contact priors.
DetRefiner: Model-Agnostic Detection Refinement with Feature Fusion Transformer cs.CV · 2026-05-11 · unverdicted · none · ref 35
DetRefiner fuses global and local features with a Transformer to refine OVOD confidence scores, delivering up to +10.1 AP gains on novel categories across multiple datasets.
A Survey on Vision-Language-Action Models: An Action Tokenization Perspective cs.RO · 2025-07-02 · unverdicted · none · ref 95
The survey frames VLA models as pipelines that generate progressively grounded action tokens and classifies those tokens into eight types to guide future development.
Qwen2.5-VL Technical Report cs.CV · 2025-02-19 · unverdicted · none · ref 25
Qwen2.5-VL reports a vision-language model family using native dynamic-resolution ViT and absolute time encoding that matches GPT-4o on document and diagram tasks while supporting hour-long videos with second-level localization.
Benchmarking Vision Foundation Models for Input Monitoring in Autonomous Driving cs.CV · 2025-01-14 · unverdicted · none · ref 66
Vision foundation model embeddings with density modeling outperform state-of-the-art methods for unsupervised semantic and covariate shift detection in autonomous driving inputs.
LMMs Meet Object-Centric Vision: Understanding, Segmentation, Editing and Generation cs.CV · 2026-04-13 · unverdicted · none · ref 138
This review organizes literature on large multimodal models and object-centric vision into four themes—understanding, referring segmentation, editing, and generation—while summarizing paradigms, strategies, and challenges like instance permanence and consistent interaction.

Grounding dino 1.5: Advance the” edge” of open-set object detection

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer