6D Pose Estimation via Keypoint Heatmap Regression with RGB-D Residual Neural Networks

Amir Masoud Almasi; Ana Parovic; Ashkan Shafiei; Ismail Aljosevic

arxiv: 2605.08059 · v1 · submitted 2026-05-08 · 💻 cs.CV · cs.RO

6D Pose Estimation via Keypoint Heatmap Regression with RGB-D Residual Neural Networks

Ismail Aljosevic , Amir Masoud Almasi , Ana Parovic , Ashkan Shafiei This is my paper

Pith reviewed 2026-05-11 02:06 UTC · model grok-4.3

classification 💻 cs.CV cs.RO

keywords 6D pose estimationkeypoint heatmap regressionRGB-D fusionResNet18PnP RANSACLINEMOD datasetobject detectionYOLOv10

0 comments

The pith

ResNet heatmap regression with RGB-D cross-fusion reaches 92% 6D pose accuracy on LINEMOD

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a modular pipeline that detects objects with YOLOv10m, regresses 2D keypoint heatmaps from RGB or RGB-D images using a ResNet18 network, and recovers 6D poses by applying PnP RANSAC to the extracted points. It demonstrates that a cross-fusion architecture, which merges RGB and depth features at multiple layers, raises mean ADD accuracy from 84.50% with RGB alone to 92.41% on the LINEMOD dataset. The authors also evaluate several keypoint selection methods and training adjustments such as activation functions and learning-rate schedules to refine the heatmap quality. This shows that depth fusion can measurably strengthen the reliability of the downstream geometric solve without changing the core detection or solver stages.

Core claim

A ResNet18 network that regresses keypoint heatmaps from RGB-D inputs via cross-fusion at multiple feature stages produces sufficiently accurate 2D points for PnP RANSAC to recover object poses at 92.41% mean ADD accuracy on LINEMOD, an 8-point gain over the RGB-only baseline of 84.50%.

What carries the argument

The cross-fusion architecture inside the ResNet18 backbone that lets RGB and depth feature maps interact at several stages before heatmap output.

If this is right

Different strategies for selecting keypoints from the heatmaps produce measurable differences in final pose accuracy.
Changes to activation functions and learning-rate schedules can improve the quality of the regressed heatmaps.
The modular separation of detection, heatmap regression, and PnP solving permits independent upgrades to any stage.
RGB-D cross-fusion delivers a consistent accuracy increase over RGB-only inputs on the tested benchmark.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same cross-fusion pattern could be tested on other 6D pose benchmarks to check whether the accuracy gain generalizes.
Optimizing the ResNet18 backbone for lower latency might allow the pipeline to run at real-time rates on edge hardware.
The multi-stage fusion approach may prove more robust than single-stage fusion when depth data contains noise or missing values.

Load-bearing premise

The heatmaps must supply keypoint locations accurate enough and free of excessive outliers so that PnP RANSAC can still compute reliable 6D poses.

What would settle it

Evaluating the model on a held-out set containing heavy occlusions or strong lighting shifts and measuring ADD accuracy below 80% would show the heatmaps are not sufficiently reliable for the PnP step.

Figures

Figures reproduced from arXiv: 2605.08059 by Amir Masoud Almasi, Ana Parovic, Ashkan Shafiei, Ismail Aljosevic.

**Figure 1.** Figure 1: Pipeline overview with a clear separation of the training process (blue), test flow (red), and depth extension (yellow). [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: Overview of the heatmap regression network. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Overview of the proposed extended architecture, which combines RGB and depth features to improve pose estimation. [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 5.** Figure 5: Projected keypoints using CPS (left) and FPS (right). [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 4.** Figure 4: Qualitative results of YOLOv10m object detection (left) [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

read the original abstract

In this paper, we propose a modular framework for 6D pose estimation based on keypoint heatmap regression. Our approach combines YOLOv10m for object detection with a ResNet18-based network that predicts 2D heatmaps from RGB images. Keypoints extracted from these heatmaps are used to estimate the 6D object pose via the PnP RANSAC algorithm. We compare different keypoint selection strategies to assess their impact on pose accuracy. Additionally, we extend the baseline by incorporating depth data using a cross-fusion architecture, which enables interaction between RGB and depth features at multiple stages. We further explore general training improvements, such as experimenting with activation functions and learning rate scheduling strategies to improve model performance. Our best RGB-only model achieved a mean ADD-based accuracy of 84.50%, while the RGB-D fusion model reached 92.41% on the LINEMOD dataset. The code is available at https://github.com/ameermasood/HeatNet.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This assembles a working RGB-D 6D pose pipeline from YOLO detection, ResNet heatmaps, and PnP RANSAC that reaches 92% ADD on LINEMOD, but stays an incremental engineering combination.

read the letter

The main point is a modular pipeline that runs YOLOv10m for detection, feeds RGB or RGB-D images into a ResNet18 to regress keypoint heatmaps, extracts points, and recovers 6D pose via PnP RANSAC. The RGB-D cross-fusion version lifts accuracy from 84.50% to 92.41% mean ADD on LINEMOD after testing keypoint selection strategies and training details like activation functions and learning-rate schedules. The code is released, which is helpful for anyone wanting to reproduce or build on it. Ablations on those choices give some evidence that the fusion step and certain training tweaks matter. The numbers are reported on a standard public benchmark, so direct comparisons are possible. The soft spots are that the core pieces—YOLO detection, heatmap regression, PnP—are already common in the pose-estimation literature, so the novelty is mostly in the specific architecture choices and the fusion implementation rather than any new principle. There are no error bars, no detailed breakdown of failure modes, and limited testing outside LINEMOD conditions, which leaves open how well the heatmaps stay clean enough for RANSAC under heavier occlusion or lighting shifts. This is useful for engineers who need a concrete, runnable baseline for robotics or AR systems and are willing to adapt it. It deserves peer review because the setup is reproducible and the gains are quantified on a shared dataset, though referees would likely ask for more robustness analysis and comparisons to recent alternatives.

Referee Report

1 major / 2 minor

Summary. The paper claims to introduce a modular framework for 6D pose estimation that combines YOLOv10m object detection with a ResNet18 network regressing 2D keypoint heatmaps from RGB images (extended via an RGB-D cross-fusion architecture), followed by PnP RANSAC to recover poses. It evaluates keypoint selection strategies, training choices such as activation functions and learning-rate schedules, and reports best mean ADD accuracies of 84.50% (RGB-only) and 92.41% (RGB-D) on the LINEMOD dataset, with publicly released code.

Significance. If the reported accuracies hold under more detailed statistical scrutiny, the work provides a practical, reproducible modular pipeline showing an approximately 8-point gain from RGB-D cross-fusion over RGB-only keypoint regression on a standard benchmark. The open code and use of established components (YOLOv10m, ResNet18, PnP RANSAC) are strengths that facilitate adoption and incremental improvements in the 6D pose estimation literature.

major comments (1)

[Abstract and Results] The central performance claims (84.50% RGB-only and 92.41% RGB-D mean ADD accuracy on LINEMOD) are stated in the abstract and results without error bars, standard deviations, number of independent runs, or explicit dataset-split details. This directly affects the reliability that can be assigned to the exact numerical improvements reported.

minor comments (2)

[Abstract] The abstract refers to 'different keypoint selection strategies' and 'training improvements' but does not identify which specific choices produced the best-reported models; adding this information would make the performance numbers immediately more interpretable.
[Method] The RGB-D cross-fusion architecture is described at a high level as enabling 'interaction between RGB and depth features at multiple stages'; a figure or pseudocode clarifying the fusion points would improve clarity for readers.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive comments. We address the concern about statistical details of the reported accuracies below and will incorporate clarifications in the revised manuscript.

read point-by-point responses

Referee: [Abstract and Results] The central performance claims (84.50% RGB-only and 92.41% RGB-D mean ADD accuracy on LINEMOD) are stated in the abstract and results without error bars, standard deviations, number of independent runs, or explicit dataset-split details. This directly affects the reliability that can be assigned to the exact numerical improvements reported.

Authors: We agree that more explicit details would strengthen the presentation. The LINEMOD experiments follow the standard train/test splits from the original LINEMOD dataset paper (specific object instances and image counts per object as used throughout the 6D pose estimation literature). We will make this split information explicit in the methods and results sections of the revision. The reported mean ADD accuracies come from single training runs per model configuration, which is common practice given the computational cost of training ResNet18-based heatmap regressors. We will add a clarifying statement to this effect. While we did not originally compute error bars across multiple random seeds, we can include standard deviations from a small number of additional runs for the final reported models if space permits; these updates will appear in both the abstract and results. revision: partial

Circularity Check

0 steps flagged

No significant circularity; empirical pipeline with direct benchmark measurements

full rationale

The manuscript describes a modular empirical pipeline (YOLOv10m detection, ResNet18 heatmap regression, PnP RANSAC pose recovery, optional RGB-D cross-fusion) and reports concrete ADD accuracies on the standard LINEMOD dataset (84.50% RGB-only, 92.41% fused). No derivation chain, uniqueness theorem, or self-citation is invoked to justify the central results; the accuracies are direct empirical outputs from training and evaluation on public benchmarks, with ablations and open code provided. No step reduces by construction to author-defined quantities or prior self-citations.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The approach rests on standard computer-vision assumptions and trained neural-network weights; no new entities are postulated.

free parameters (2)

trained network weights
Standard deep-learning parameters fitted to training data.
keypoint selection strategy
Multiple strategies are compared and the best is chosen post-experiment.

axioms (1)

domain assumption PnP RANSAC recovers accurate 6D pose from sufficiently accurate 2D-3D correspondences
Invoked when converting heatmaps to final pose estimates.

pith-pipeline@v0.9.0 · 5486 in / 1179 out tokens · 34510 ms · 2026-05-11T02:06:56.341188+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

8 extracted references · 8 canonical work pages

[1]

Detecting object surface key- points from a single rgb image via deep learning network for 6-dof pose estimation.IEEE Access, 9:77729–77741, 2021

Louis Aing and Weng Nam Lie. Detecting object surface key- points from a single rgb image via deep learning network for 6-dof pose estimation.IEEE Access, 9:77729–77741, 2021

work page 2021
[2]

Model-based training, detection and pose estimation of texture-less 3d objects in heavily cluttered scenes

Stefan Hinterstoisser, Vincent Lepetit, Slobodan Ilic, Ste- fan Holzer, Gary Bradski, Kurt Konolige, and Nassir Navab. Model-based training, detection and pose estimation of texture-less 3d objects in heavily cluttered scenes. InAsian Conference on Computer Vision, pages 548–562. Springer, 2012

work page 2012
[3]

Mish: A self regularized non-monotonic acti- vation function, 2020

Diganta Misra. Mish: A self regularized non-monotonic acti- vation function, 2020

work page 2020
[4]

Pvnet: Pixel-wise voting network for 6dof pose es- timation

Sida Peng, Yuan Liu, Qixing Huang, Hujun Bao, and Xiaowei Zhou. Pvnet: Pixel-wise voting network for 6dof pose es- timation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4561–4570, 2019

work page 2019
[5]

Crossfunet: Rgb and depth cross- fusion network for hand pose estimation.Sensors, 21(18):17, 2021

Xiaojing Sun, Bin Wang, Longxiang Huang, Qian Zhang, Sulei Zhu, and Yan Ma. Crossfunet: Rgb and depth cross- fusion network for hand pose estimation.Sensors, 21(18):17, 2021

work page 2021
[6]

YOLOv10: Real-Time Object Detection and Recognition.https://github.com/WongKinYiu/ yolov10, 2024

Chien-Yao Wang, Hong-Yuan Mark Liao, I-Hau Yeh, and Youn-Long Lin. YOLOv10: Real-Time Object Detection and Recognition.https://github.com/WongKinYiu/ yolov10, 2024. Accessed: 2025-06-05

work page 2024
[7]

Estimating 6d pose from localizing desig- nated surface keypoints.arXiv preprint arXiv:1809.08550, 2018

Zelin Zhao, Gu Peng, Haoyu Wang, Hao-Shu Fang, Cewu Li, and Caiming Lu. Estimating 6d pose from localizing desig- nated surface keypoints.arXiv preprint arXiv:1809.08550, 2018

work page arXiv 2018
[8]

Ob- jects as points

Xingyi Zhou, Dequan Wang, and Philipp Kr ¨ahenb¨uhl. Ob- jects as points. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 7050–7059, 2019

work page 2019

[1] [1]

Detecting object surface key- points from a single rgb image via deep learning network for 6-dof pose estimation.IEEE Access, 9:77729–77741, 2021

Louis Aing and Weng Nam Lie. Detecting object surface key- points from a single rgb image via deep learning network for 6-dof pose estimation.IEEE Access, 9:77729–77741, 2021

work page 2021

[2] [2]

Model-based training, detection and pose estimation of texture-less 3d objects in heavily cluttered scenes

Stefan Hinterstoisser, Vincent Lepetit, Slobodan Ilic, Ste- fan Holzer, Gary Bradski, Kurt Konolige, and Nassir Navab. Model-based training, detection and pose estimation of texture-less 3d objects in heavily cluttered scenes. InAsian Conference on Computer Vision, pages 548–562. Springer, 2012

work page 2012

[3] [3]

Mish: A self regularized non-monotonic acti- vation function, 2020

Diganta Misra. Mish: A self regularized non-monotonic acti- vation function, 2020

work page 2020

[4] [4]

Pvnet: Pixel-wise voting network for 6dof pose es- timation

Sida Peng, Yuan Liu, Qixing Huang, Hujun Bao, and Xiaowei Zhou. Pvnet: Pixel-wise voting network for 6dof pose es- timation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4561–4570, 2019

work page 2019

[5] [5]

Crossfunet: Rgb and depth cross- fusion network for hand pose estimation.Sensors, 21(18):17, 2021

Xiaojing Sun, Bin Wang, Longxiang Huang, Qian Zhang, Sulei Zhu, and Yan Ma. Crossfunet: Rgb and depth cross- fusion network for hand pose estimation.Sensors, 21(18):17, 2021

work page 2021

[6] [6]

YOLOv10: Real-Time Object Detection and Recognition.https://github.com/WongKinYiu/ yolov10, 2024

Chien-Yao Wang, Hong-Yuan Mark Liao, I-Hau Yeh, and Youn-Long Lin. YOLOv10: Real-Time Object Detection and Recognition.https://github.com/WongKinYiu/ yolov10, 2024. Accessed: 2025-06-05

work page 2024

[7] [7]

Estimating 6d pose from localizing desig- nated surface keypoints.arXiv preprint arXiv:1809.08550, 2018

Zelin Zhao, Gu Peng, Haoyu Wang, Hao-Shu Fang, Cewu Li, and Caiming Lu. Estimating 6d pose from localizing desig- nated surface keypoints.arXiv preprint arXiv:1809.08550, 2018

work page arXiv 2018

[8] [8]

Ob- jects as points

Xingyi Zhou, Dequan Wang, and Philipp Kr ¨ahenb¨uhl. Ob- jects as points. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 7050–7059, 2019

work page 2019