BEHAVIOR-1K introduces a benchmark of 1,000 human everyday activities in realistic simulated scenes together with the OMNIGIBSON physics simulator to evaluate embodied AI.
hub
PoseCNN: A Convolutional Neural Network for 6D Object Pose Estimation in Cluttered Scenes
18 Pith papers cite this work. Polarity classification is still indexing.
abstract
Estimating the 6D pose of known objects is important for robots to interact with the real world. The problem is challenging due to the variety of objects as well as the complexity of a scene caused by clutter and occlusions between objects. In this work, we introduce PoseCNN, a new Convolutional Neural Network for 6D object pose estimation. PoseCNN estimates the 3D translation of an object by localizing its center in the image and predicting its distance from the camera. The 3D rotation of the object is estimated by regressing to a quaternion representation. We also introduce a novel loss function that enables PoseCNN to handle symmetric objects. In addition, we contribute a large scale video dataset for 6D object pose estimation named the YCB-Video dataset. Our dataset provides accurate 6D poses of 21 objects from the YCB dataset observed in 92 videos with 133,827 frames. We conduct extensive experiments on our YCB-Video dataset and the OccludedLINEMOD dataset to show that PoseCNN is highly robust to occlusions, can handle symmetric objects, and provide accurate pose estimation using only color images as input. When using depth data to further refine the poses, our approach achieves state-of-the-art results on the challenging OccludedLINEMOD dataset. Our code and dataset are available at https://rse-lab.cs.washington.edu/projects/posecnn/.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
EventTrack6D tracks 6D poses of unseen objects from event cameras by reconstructing dense intensity and depth cues between frames, generalizing from synthetic training to real data at high speed.
SPARCS uses a differentiable contact model and sparse Hessian solver to jointly optimize shapes and poses of up to five interacting objects, producing physically valid simulation-ready reconstructions.
A model-free system uses 2D point trackers to achieve causal 6D pose tracking and incremental 3D reconstruction for multiple unseen rigid objects from RGB-D video, with recovery from complete occlusions.
A conditional diffusion model using proprioception and multi-contact touch produces metric-scale, physically consistent 3D object reconstructions under hand occlusion.
FORGE benchmark shows domain-specific knowledge, not visual grounding, is the main bottleneck for MLLMs in manufacturing, with SFT on a 3B model delivering up to 90.8% relative accuracy improvement on held-out scenarios.
TORA distills topological structure from pretrained 3D encoders into flow-matching backbones via cosine matching and CKA loss, delivering up to 6.9x faster convergence and better accuracy on 3D shape assembly benchmarks with zero inference overhead.
Demonstrates non-line-of-sight 3D reconstruction, tracking, and camera localization on smartphone-grade LiDAR by fusing frames via a motion-induced aperture sampling model.
Embodied-R1 uses a pointing-centric representation and reinforced fine-tuning on a 200K dataset to achieve state-of-the-art results on embodied benchmarks plus 56.2% success in SIMPLEREnv and 87.5% on real XArm tasks without task-specific training.
FocusDepth is a prompt-conditioned framework that fuses SAM3 features into Depth Anything models via Multi-Scale Spatial-Aligned Fusion to improve target-region depth accuracy on the new FDE-Bench.
SAFAG introduces a symmetry annotation-free two-stage learning strategy for generalizable actionable parts pose estimation in robotics.
PULSE stabilizes mmWave human pose estimation by screening Doppler motion prompts before injecting them into spatial magnitude reasoning.
A factor graph that fuses motion models with uncertainty-aware pose measurements improves temporal consistency and benchmark scores for vision-based robot control.
TSM-Pose adds topology extraction and semantic Mamba blocks to point-cloud features, outperforming prior methods on REAL275, CAMERA25, and HouseCat6D for category-level pose estimation.
RDGen uses sim-to-real RL policies to generate smoother robot demonstrations that improve downstream VLA performance over human-collected data on pick-and-place tasks.
GNC-Pose achieves competitive 6D pose accuracy on the YCB dataset for textured objects using only geometric priors, rendering initialization, and robust GNC optimization without any learned features or training data.
CNN keypoint detection enables marker-free image-based visual servoing for aerial robots with robustness to occlusion and lighting changes, validated in Gazebo simulations.
citing papers explorer
-
Embodied-R1: Reinforced Embodied Reasoning for General Robotic Manipulation
Embodied-R1 uses a pointing-centric representation and reinforced fine-tuning on a 200K dataset to achieve state-of-the-art results on embodied benchmarks plus 56.2% success in SIMPLEREnv and 87.5% on real XArm tasks without task-specific training.
-
GNC-Pose: Geometry-Aware GNC-PnP for Accurate 6D Pose Estimation
GNC-Pose achieves competitive 6D pose accuracy on the YCB dataset for textured objects using only geometric priors, rendering initialization, and robust GNC optimization without any learned features or training data.
-
Deep Visual Servoing of an Aerial Robot Using Keypoint Feature Extraction
CNN keypoint detection enables marker-free image-based visual servoing for aerial robots with robustness to occlusion and lighting changes, validated in Gazebo simulations.