BEHAVIOR-1K introduces a benchmark of 1,000 human everyday activities in realistic simulated scenes together with the OMNIGIBSON physics simulator to evaluate embodied AI.
hub
PoseCNN: A Convolutional Neural Network for 6D Object Pose Estimation in Cluttered Scenes
19 Pith papers cite this work. Polarity classification is still indexing.
abstract
Estimating the 6D pose of known objects is important for robots to interact with the real world. The problem is challenging due to the variety of objects as well as the complexity of a scene caused by clutter and occlusions between objects. In this work, we introduce PoseCNN, a new Convolutional Neural Network for 6D object pose estimation. PoseCNN estimates the 3D translation of an object by localizing its center in the image and predicting its distance from the camera. The 3D rotation of the object is estimated by regressing to a quaternion representation. We also introduce a novel loss function that enables PoseCNN to handle symmetric objects. In addition, we contribute a large scale video dataset for 6D object pose estimation named the YCB-Video dataset. Our dataset provides accurate 6D poses of 21 objects from the YCB dataset observed in 92 videos with 133,827 frames. We conduct extensive experiments on our YCB-Video dataset and the OccludedLINEMOD dataset to show that PoseCNN is highly robust to occlusions, can handle symmetric objects, and provide accurate pose estimation using only color images as input. When using depth data to further refine the poses, our approach achieves state-of-the-art results on the challenging OccludedLINEMOD dataset. Our code and dataset are available at https://rse-lab.cs.washington.edu/projects/posecnn/.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
EventTrack6D tracks 6D poses of unseen objects from event cameras by reconstructing dense intensity and depth cues between frames, generalizing from synthetic training to real data at high speed.
SPARCS uses a differentiable contact model and sparse Hessian solver to jointly optimize shapes and poses of up to five interacting objects, producing physically valid simulation-ready reconstructions.
A model-free system uses 2D point trackers to achieve causal 6D pose tracking and incremental 3D reconstruction for multiple unseen rigid objects from RGB-D video, with recovery from complete occlusions.
FORGE benchmark shows domain-specific knowledge, not visual grounding, is the main bottleneck for MLLMs in manufacturing, with SFT on a 3B model delivering up to 90.8% relative accuracy improvement on held-out scenarios.
DisFlow computes scene flow from GPIS distance fields for 6DoF pose estimation, motion tracking, and surface reconstruction by fusing in the object frame.
Demonstrates non-line-of-sight 3D reconstruction, tracking, and camera localization on smartphone-grade LiDAR by fusing frames via a motion-induced aperture sampling model.
Embodied-R1 uses a pointing-centric representation and reinforced fine-tuning on a 200K dataset to achieve state-of-the-art results on embodied benchmarks plus 56.2% success in SIMPLEREnv and 87.5% on real XArm tasks without task-specific training.
FocusDepth is a prompt-conditioned framework that fuses SAM3 features into Depth Anything models via Multi-Scale Spatial-Aligned Fusion to improve target-region depth accuracy on the new FDE-Bench.
SAFAG introduces a symmetry annotation-free two-stage learning strategy for generalizable actionable parts pose estimation in robotics.
PULSE stabilizes mmWave human pose estimation by screening Doppler motion prompts before injecting them into spatial magnitude reasoning.
A factor graph that fuses motion models with uncertainty-aware pose measurements improves temporal consistency and benchmark scores for vision-based robot control.
TSM-Pose adds topology extraction and semantic Mamba blocks to point-cloud features, outperforming prior methods on REAL275, CAMERA25, and HouseCat6D for category-level pose estimation.
RDGen uses sim-to-real RL policies to generate smoother robot demonstrations that improve downstream VLA performance over human-collected data on pick-and-place tasks.
GNC-Pose achieves competitive 6D pose accuracy on the YCB dataset for textured objects using only geometric priors, rendering initialization, and robust GNC optimization without any learned features or training data.
CNN keypoint detection enables marker-free image-based visual servoing for aerial robots with robustness to occlusion and lighting changes, validated in Gazebo simulations.
citing papers explorer
-
Simulation-Ready Cluttered Scene Estimation via Physics-aware Joint Shape and Pose Optimization
SPARCS uses a differentiable contact model and sparse Hessian solver to jointly optimize shapes and poses of up to five interacting objects, producing physically valid simulation-ready reconstructions.
-
Point2Pose: Occlusion-Recovering 6D Pose Tracking and 3D Reconstruction for Multiple Unknown Objects Via 2D Point Trackers
A model-free system uses 2D point trackers to achieve causal 6D pose tracking and incremental 3D reconstruction for multiple unseen rigid objects from RGB-D video, with recovery from complete occlusions.
-
DisFlow: Scene Flow from Distance Field for Object Pose, Velocity Tracking, and Dynamic Object Reconstruction
DisFlow computes scene flow from GPIS distance fields for 6DoF pose estimation, motion tracking, and surface reconstruction by fusing in the object frame.
-
Imaging Hidden Objects with Consumer LiDAR via Motion Induced Sampling
Demonstrates non-line-of-sight 3D reconstruction, tracking, and camera localization on smartphone-grade LiDAR by fusing frames via a motion-induced aperture sampling model.
-
Focusable Monocular Depth Estimation
FocusDepth is a prompt-conditioned framework that fuses SAM3 features into Depth Anything models via Multi-Scale Spatial-Aligned Fusion to improve target-region depth accuracy on the new FDE-Bench.
-
Generalizable and Actionable Parts Pose Estimation with Symmetry Annotation-Free Learning Strategy
SAFAG introduces a symmetry annotation-free two-stage learning strategy for generalizable actionable parts pose estimation in robotics.
-
Doppler Prompting for Stable mmWave-based Human Pose Estimation
PULSE stabilizes mmWave human pose estimation by screening Doppler motion prompts before injecting them into spatial magnitude reasoning.
-
Temporally Consistent Object 6D Pose Estimation for Robot Control
A factor graph that fuses motion models with uncertainty-aware pose measurements improves temporal consistency and benchmark scores for vision-based robot control.
-
TSM-Pose: Topology-Aware Learning with Semantic Mamba for Category-Level Object Pose Estimation
TSM-Pose adds topology extraction and semantic Mamba blocks to point-cloud features, outperforming prior methods on REAL275, CAMERA25, and HouseCat6D for category-level pose estimation.
-
RDGen: Demonstration Generation for High-Quality Robot Learning via Reinforcement Learning
RDGen uses sim-to-real RL policies to generate smoother robot demonstrations that improve downstream VLA performance over human-collected data on pick-and-place tasks.
-
GNC-Pose: Geometry-Aware GNC-PnP for Accurate 6D Pose Estimation
GNC-Pose achieves competitive 6D pose accuracy on the YCB dataset for textured objects using only geometric priors, rendering initialization, and robust GNC optimization without any learned features or training data.
-
Deep Visual Servoing of an Aerial Robot Using Keypoint Feature Extraction
CNN keypoint detection enables marker-free image-based visual servoing for aerial robots with robustness to occlusion and lighting changes, validated in Gazebo simulations.