Recognition: unknown
PoseCNN: A Convolutional Neural Network for 6D Object Pose Estimation in Cluttered Scenes
read the original abstract
Estimating the 6D pose of known objects is important for robots to interact with the real world. The problem is challenging due to the variety of objects as well as the complexity of a scene caused by clutter and occlusions between objects. In this work, we introduce PoseCNN, a new Convolutional Neural Network for 6D object pose estimation. PoseCNN estimates the 3D translation of an object by localizing its center in the image and predicting its distance from the camera. The 3D rotation of the object is estimated by regressing to a quaternion representation. We also introduce a novel loss function that enables PoseCNN to handle symmetric objects. In addition, we contribute a large scale video dataset for 6D object pose estimation named the YCB-Video dataset. Our dataset provides accurate 6D poses of 21 objects from the YCB dataset observed in 92 videos with 133,827 frames. We conduct extensive experiments on our YCB-Video dataset and the OccludedLINEMOD dataset to show that PoseCNN is highly robust to occlusions, can handle symmetric objects, and provide accurate pose estimation using only color images as input. When using depth data to further refine the poses, our approach achieves state-of-the-art results on the challenging OccludedLINEMOD dataset. Our code and dataset are available at https://rse-lab.cs.washington.edu/projects/posecnn/.
This paper has not been read by Pith yet.
Forward citations
Cited by 10 Pith papers
-
Point2Pose: Occlusion-Recovering 6D Pose Tracking and 3D Reconstruction for Multiple Unknown Objects Via 2D Point Trackers
A model-free system uses 2D point trackers to achieve causal 6D pose tracking and incremental 3D reconstruction for multiple unseen rigid objects from RGB-D video, with recovery from complete occlusions.
-
Physically Grounded 3D Generative Reconstruction under Hand Occlusion using Proprioception and Multi-Contact Touch
A conditional diffusion model using proprioception and multi-contact touch produces metric-scale, physically consistent 3D object reconstructions under hand occlusion.
-
FORGE: Fine-grained Multimodal Evaluation for Manufacturing Scenarios
FORGE benchmark shows domain-specific knowledge, not visual grounding, is the main bottleneck for MLLMs in manufacturing, with SFT on a 3B model delivering up to 90.8% relative accuracy improvement on held-out scenarios.
-
TORA: Topological Representation Alignment for 3D Shape Assembly
TORA distills topological structure from pretrained 3D encoders into flow-matching backbones via cosine matching and CKA loss, delivering up to 6.9x faster convergence and better accuracy on 3D shape assembly benchmar...
-
Event6D: Event-based Novel Object 6D Pose Tracking
EventTrack6D tracks 6D poses of unseen objects from event cameras by reconstructing dense intensity and depth cues between frames, generalizing from synthetic training to real data at high speed.
-
Focusable Monocular Depth Estimation
FocusDepth is a prompt-conditioned framework that fuses SAM3 features into Depth Anything models via Multi-Scale Spatial-Aligned Fusion to improve target-region depth accuracy on the new FDE-Bench.
-
Doppler Prompting for Stable mmWave-based Human Pose Estimation
PULSE stabilizes mmWave human pose estimation by screening Doppler motion prompts before injecting them into spatial magnitude reasoning.
-
Temporally Consistent Object 6D Pose Estimation for Robot Control
A factor graph that fuses motion models with uncertainty-aware pose measurements improves temporal consistency and benchmark scores for vision-based robot control.
-
TSM-Pose: Topology-Aware Learning with Semantic Mamba for Category-Level Object Pose Estimation
TSM-Pose adds topology extraction and semantic Mamba blocks to point-cloud features, outperforming prior methods on REAL275, CAMERA25, and HouseCat6D for category-level pose estimation.
-
MAPRPose: Mask-Aware Proposal and Amodal Refinement for Multi-Object 6D Pose Estimation
MAPRPose reports 76.5% Average Recall on the BOP benchmark for multi-object 6D pose estimation, beating FoundationPose by 3.1% while running 43 times faster through mask-aware proposals and amodal refinement.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.