ConceptPose: Training-Free Zero-Shot Object Pose Estimation using Concept Vectors

Liming Kuang , Yordanka Velikova , Mahdi Saleh , Jan-Nico Zaech , Danda Pani Paudel , Benjamin Busam

Authors on Pith no claims yet

classification 💻 cs.CV

keywords estimationposeconceptobjectconceptposedataset-specificmapsrelative

read the original abstract

Object pose estimation is a fundamental task in computer vision and robotics, yet most methods require extensive, dataset-specific training. Concurrently, large-scale vision language models show remarkable zero-shot capabilities. In this work, we bridge these two worlds by introducing ConceptPose, a framework for object pose estimation that is both training-free and model-free. ConceptPose leverages a vision-language-model (VLM) to create open-vocabulary 3D concept maps, where each point is tagged with a concept vector derived from saliency maps. By establishing robust 3D-3D correspondences across concept maps, our approach allows precise estimation of 6DoF relative pose. Without any object or dataset-specific training, our approach achieves state-of-the-art results on common zero shot relative pose estimation benchmarks, outperforming the strongest baseline by a relative 62\% in average ADD(-S) score, including methods that utilize extensive dataset-specific training.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

TrianguLang: Geometry-Aware Semantic Consensus for Pose-Free 3D Localization
cs.CV 2026-03 unverdicted novelty 6.0

TrianguLang achieves state-of-the-art feed-forward text-guided 3D localization and segmentation by using predicted geometry to gate cross-view semantic correspondences without ground-truth poses.