pith. sign in

arxiv: 2605.23523 · v1 · pith:X7ZK2KEJnew · submitted 2026-05-22 · 💻 cs.CV

ComPose: When to Trust Hands for Object Pose Tracking

Pith reviewed 2026-05-25 05:11 UTC · model grok-4.3

classification 💻 cs.CV
keywords 6DoF object trackinghand-aware pose estimationRGB videofoundation modelstemporal consistencyocclusion handlingrobot manipulation
0
0 comments X

The pith

ComPose tracks 6DoF object poses from RGB video by treating hand motions as complementary cues rather than occluders.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to show that object pose tracking in videos can be made robust to hand occlusions by using hand motions as helpful signals instead of obstacles. It introduces a framework that pulls cues from foundation models for both objects and hands, selects useful hand joints on the fly, fuses them for motion estimates, refines with geometry and a learned fix, and keeps rotation and translation consistent over time. This produces steady 3D object paths without needing extra smoothing. A reader would care because reliable object motion from ordinary video supports embodied AI and lets robots learn from watching people.

Core claim

ComPose recovers a variety of object motions over time by combining object and hand cues from foundation models within a unified tracking pipeline, adaptively selects informative hand joints, refines motion using visible geometric evidence and a learned correction, and enforces temporal consistency over rotation and translation, yielding stable 3D object trajectories without external smoothing and remaining robust under severe hand occlusion.

What carries the argument

The unified tracking pipeline that adaptively selects informative hand joints and fuses object- and hand-derived cues for motion estimation.

If this is right

  • Yields stable 3D object trajectories without any external smoothing.
  • Remains robust under severe hand occlusion and geometric ambiguity.
  • The resulting trajectories transfer effectively to downstream robot manipulation by enabling robots to reconstruct human actions from online videos.
  • The method is accurate and efficient.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The cue-fusion strategy could extend to other moving occluders if similar motion correlations hold.
  • Performance would likely scale with improvements in the underlying foundation models for hands and objects.
  • Online versions could support real-time robot imitation of observed manipulations.

Load-bearing premise

That hand motions supply reliable complementary motion cues for the object even under heavy occlusion and that the foundation models produce sufficiently accurate hand and object cues to support the adaptive selection and fusion steps.

What would settle it

Video sequences with severe hand occlusions where the tracking accuracy of ComPose does not exceed that of object-only tracking methods.

Figures

Figures reproduced from arXiv: 2605.23523 by Dohyeon Lee, Hae-Gon Jeon, Hokyun Im, Inhwan Bae, Jisu Shin, JunGyu Lee, Junoh Lee, Youngwoon Lee.

Figure 1
Figure 1. Figure 1: Robot manipulation guided by the internet video. (a) Given an instructional painting video "Bob Ross Painting" from Youtube, our method (b) predicts Bob’s brush trajectory and (c) the estimated smooth trajectory is directly transferred to a real-world robot for painting. ∗Corresponding Author Preprint. arXiv:2605.23523v1 [cs.CV] 22 May 2026 [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: An illustration of the proposed object tracking pipeline and its application to robot [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative results from the state-of-the-art models on three datasets: DexYCB, HO3D-v2, [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Analysis on the predicted α and w along with the input frames. surface is visible, but degrades substantially under severe hand occlusion or limited geometric support. Hand-only tracking is more robust to object occlusion, but becomes less reliable when the articulated hand motion or re-grasping violates the rigid-motion assumption. These results confirm that object geometry and hand motion provide complem… view at source ↗
Figure 5
Figure 5. Figure 5: Robot manipulation from internet videos. We overlay the predicted hand joints and mesh [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Visualization of mask tracking using SAM3. [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Visualization of mesh reconstruction and pose initialization using SAM3D. [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Visualization of point map using Flow3R. [PITH_FULL_IMAGE:figures/full_fig_p020_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Visualization of 2D hand joints using WiLoR. [PITH_FULL_IMAGE:figures/full_fig_p021_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Robot manipulation from Internet video. We overlay the predicted hand joints and mesh [PITH_FULL_IMAGE:figures/full_fig_p022_10.png] view at source ↗
read the original abstract

Reconstructing the motion of objects from videos is a key component for embodied AI and robot manipulation. While diverse approaches to object pose tracking have been studied, they rely heavily on strong external priors, such as depth data or 3D templates, and remain highly vulnerable to severe occlusions by hand grasps despite the use of explicit masks. In this work, we present ComPose, a 6DoF object tracking framework designed for hand-aware object pose estimation from RGB video. Rather than treating the hand purely as an occluder, our method harmonizes hand motions as a \textit{complementary cue} for object tracking. In detail, we recover a variety of object motions over time by combining object and hand cues from foundation models within a unified tracking pipeline. Here, ComPose adaptively selects informative hand joints, combines object- and hand-derived cues for motion estimation, and refines the resulting object motion using visible geometric evidence and a learned correction. We further enforce the temporal consistency over both rotation and translation, yielding stable 3D object trajectories over time without any external smoothing. Extensive experiments show that our method is accurate, efficient, and robust under severe hand occlusion and geometric ambiguity. In addition, the resulting trajectories can also effectively transfer to downstream robot manipulation by enabling robots to reconstruct human actions from online videos.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper presents ComPose, a 6DoF object pose tracking framework from RGB video for hand-object interactions. It treats hands as a complementary cue rather than pure occluder by combining object and hand cues from foundation models in a unified pipeline: adaptively selecting informative hand joints, fusing object- and hand-derived motion estimates, refining via visible geometric evidence plus a learned correction, and enforcing temporal consistency on both rotation and translation to yield stable trajectories without external smoothing. The work claims accuracy, efficiency, and robustness under severe hand occlusion, with downstream utility for robot manipulation from online videos.

Significance. If the central claims hold, the approach could meaningfully advance hand-aware object tracking for embodied AI and robotics by reducing dependence on depth sensors or 3D templates and by converting hand occlusion into usable motion signal. The explicit avoidance of post-hoc smoothing while maintaining temporal consistency and the integration of foundation-model cues with a learned correction step are concrete strengths worth highlighting. The absence of any reported cue-degradation curves or occlusion-specific ablations in the visible text, however, leaves the practical significance conditional on verification of the hand-cue reliability assumption.

major comments (1)
  1. [Abstract] Abstract: The claim of being 'robust under severe hand occlusion' is load-bearing for the adaptive joint selection and cue-fusion steps, yet the text supplies no quantitative characterization of hand-pose error versus occlusion fraction nor an ablation that isolates hand-cue contribution once object visibility falls below a stated threshold. Without such evidence the fusion and selection modules rest on an unverified precondition about foundation-model cue quality.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our work. The point raised about supporting the robustness claim with quantitative evidence is well-taken, and we will incorporate the requested analyses to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The claim of being 'robust under severe hand occlusion' is load-bearing for the adaptive joint selection and cue-fusion steps, yet the text supplies no quantitative characterization of hand-pose error versus occlusion fraction nor an ablation that isolates hand-cue contribution once object visibility falls below a stated threshold. Without such evidence the fusion and selection modules rest on an unverified precondition about foundation-model cue quality.

    Authors: We agree that the abstract claim would be better supported by explicit quantitative evidence. In the revised manuscript we will add (1) a plot or table reporting hand-pose estimation error as a function of occlusion fraction on the evaluation sequences and (2) an ablation that isolates the contribution of the hand-derived motion cues when object visibility drops below a defined threshold (e.g., <30 % visible surface). These additions will directly verify the reliability of the foundation-model cues used by the adaptive selection and fusion modules. revision: yes

Circularity Check

0 steps flagged

No circularity: algorithmic pipeline with no derivation chain

full rationale

The paper describes an engineering pipeline that fuses foundation-model cues via adaptive joint selection, cue combination, geometric refinement, learned correction, and temporal consistency enforcement. No equations, parameter-fitting steps, or first-principles derivations appear in the abstract or described method; the central claims rest on empirical validation rather than any reduction of outputs to inputs by construction. External benchmarks (experiments under occlusion) are invoked, so the work is self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no visible free parameters, axioms, or invented entities; ledger left empty.

pith-pipeline@v0.9.0 · 5792 in / 1100 out tokens · 30804 ms · 2026-05-25T05:11:29.279959+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

97 extracted references · 97 canonical work pages · 4 internal anchors

  1. [1]

    In: Proc

    Besl, P.J., McKay, N.D.: Method for registration of 3-d shapes. In: Proc. SPIE (1992)

  2. [2]

    Carion, N., Gustafson, L., Hu, Y .T., Debnath, S., Hu, R., Suris, D., Ryali, C., Alwala, K.V ., Khedr, H., Huang, A., Lei, J., Ma, T., Guo, B., Kalla, A., Marks, M., Greer, J., Wang, M., Sun, P., Rädle, R., Afouras, T., Mavroudi, E., Xu, K., Wu, T.H., Zhou, Y ., Momeni, L., Hazra, R., Ding, S., Vaze, S., Porcher, F., Li, F., Li, S., Kamath, A., Cheng, H.K...

  3. [3]

    In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (W ACV) (2023)

    Castro, P., Kim, T.K.: Crt-6d: Fast 6d object pose estimation with cascaded refinement transformers. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (W ACV) (2023)

  4. [4]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2021)

    Chao, Y .W., Yang, W., Xiang, Y ., Molchanov, P., Handa, A., Tremblay, J., Narang, Y .S., Van Wyk, K., Iqbal, U., Birchfield, S., et al.: Dexycb: A benchmark for capturing hand grasping of objects. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2021)

  5. [5]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2022)

    Chen, H., Wang, P., Wang, F., Tian, W., Xiong, L., Li, H.: Epro-pnp: Generalized end-to-end probabilistic perspective-n-points for monocular object pose estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2022)

  6. [6]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (2021)

    Chen, K., Dou, Q.: Sgpa: Structure-guided prior adaptation for category-level 6d object pose estimation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (2021)

  7. [7]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2020)

    Chen, W., Jia, X., Chang, H.J., Duan, J., Leonardis, A.: G2l-net: Global to local network for real-time 6d pose estimation with embedding vector features. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2020)

  8. [8]

    SAM 3D: 3Dfy Anything in Images

    Chen, X., Chu, F.J., Gleize, P., Liang, K.J., Sax, A., Tang, H., Wang, W., Guo, M., Hardin, T., Li, X., et al.: Sam 3d: 3dfy anything in images. arXiv preprint arXiv:2511.16624 (2025)

  9. [9]

    In: Proceedings of the European Conference on Computer Vision (ECCV) (2020)

    Chen, X., Dong, Z., Song, J., Geiger, A., Hilliges, O.: Category level object pose estimation via neural analysis-by-synthesis. In: Proceedings of the European Conference on Computer Vision (ECCV) (2020)

  10. [10]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2020)

    Choy, C., Dong, W., Koltun, V .: Deep global registration. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2020)

  11. [11]

    arXiv preprint (2026)

    Cong, Z., Zhao, Q., Jeon, M., Tulsiani, S.: Flow3r: Factored flow prediction for scalable visual geometry learning. arXiv preprint (2026)

  12. [12]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Deitke, M., Schwenk, D., Salvador, J., Weihs, L., Michel, O., VanderBilt, E., Schmidt, L., Ehsani, K., Kembhavi, A., Farhadi, A.: Objaverse: A universe of annotated 3d objects. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 13142–13153 (2023)

  13. [13]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (2021)

    Di, Y ., Manhardt, F., Wang, G., Ji, X., Navab, N., Tombari, F.: So-pose: Exploiting self-occlusion for direct 6d pose estimation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (2021)

  14. [14]

    arXiv preprint (2018)

    Do, T.T., Cai, M., Pham, T., Reid, I.: Deep-6dpose: Recovering 6d object pose from a single rgb image. arXiv preprint (2018)

  15. [15]

    In: 2022 International Conference on Robotics and Automation (ICRA)

    Downs, L., Francis, A., Koenig, N., Kinman, B., Hickman, R., Reymann, K., McHugh, T.B., Vanhoucke, V .: Google scanned objects: A high-quality dataset of 3d scanned household items. In: 2022 International Conference on Robotics and Automation (ICRA). pp. 2553–2560. Ieee (2022)

  16. [16]

    arXiv preprint (2026)

    Elflein, S., Li, R., Agostinho, S., Gojcic, Z., Leal-Taixé, L., Zhou, Q., Osep, A.: VGG-T 3: Offline feed-forward 3d reconstruction at scale. arXiv preprint (2026)

  17. [17]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2024)

    Fan, Z., Pan, P., Wang, P., Jiang, Y ., Xu, D., Wang, Z.: Pope: 6-dof promptable pose estimation of any object in any scene with one reference. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2024)

  18. [18]

    Psychometrika (1975)

    Gower, J.C.: Generalized procrustes analysis. Psychometrika (1975)

  19. [19]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (2023)

    Hai, Y ., Song, R., Li, J., Ferstl, D., Hu, Y .: Pseudo flow consistency for self-supervised 6d object pose estimation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (2023)

  20. [20]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2023) 10

    Hai, Y ., Song, R., Li, J., Salzmann, M., Hu, Y .: Rigidity-aware detection for 6d object pose estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2023) 10

  21. [21]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2020)

    Hampali, S., Rad, M., Oberweger, M., Lepetit, V .: Honnotate: A method for 3d annotation of hand and object poses. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2020)

  22. [22]

    Advances in Neural Information Processing Systems (2022)

    He, X., Sun, J., Wang, Y ., Huang, D., Bao, H., Zhou, X.: Onepose++: Keypoint-free one-shot object pose estimation without cad models. Advances in Neural Information Processing Systems (2022)

  23. [23]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2021)

    He, Y ., Huang, H., Fan, H., Chen, Q., Sun, J.: Ffb6d: A full flow bidirectional fusion network for 6d pose estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2021)

  24. [24]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2020)

    He, Y ., Sun, W., Huang, H., Liu, J., Fan, H., Sun, J.: Pvn3d: A deep point-wise 3d keypoints voting network for 6dof pose estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2020)

  25. [25]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2020)

    Hodan, T., Barath, D., Matas, J.: Epos: Estimating 6d pose of objects with symmetries. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2020)

  26. [26]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2019)

    Hu, Y ., Hugonot, J., Fua, P., Salzmann, M.: Segmentation-driven 6d object pose estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2019)

  27. [27]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2025)

    Huang, J., Lin, H., Wang, T., Fu, Y ., Xue, X., Zhu, Y .: Cap-net: A unified network for 6d pose and size estimation of categorical articulated parts from a single rgb-d image. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2025)

  28. [28]

    In: Proceedings of the IEEE International Conference on Robotics and Automation (ICRA) (2022)

    Irshad, M.Z., Kollar, T., Laskey, M., Stone, K., Kira, Z.: Centersnap: Single-shot multi-object 3d shape reconstruction and categorical 6d pose and size estimation. In: Proceedings of the IEEE International Conference on Robotics and Automation (ICRA) (2022)

  29. [29]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (2021)

    Iwase, S., Liu, X., Khirodkar, R., Yokota, R., Kitani, K.M.: Repose: Fast 6d object pose refinement via deep texture rendering. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (2021)

  30. [30]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2022)

    Jiang, X., Li, D., Chen, H., Zheng, Y ., Zhao, R., Wu, L.: Uni6d: A unified cnn framework without projection breakdown for 6d pose estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2022)

  31. [31]

    In: European conference on computer vision

    Karaev, N., Rocco, I., Graham, B., Neverova, N., Vedaldi, A., Rupprecht, C.: Cotracker: It is better to track together. In: European conference on computer vision. pp. 18–35. Springer (2024)

  32. [32]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (2017)

    Kehl, W., Manhardt, F.E., Tombari, F., Ilic, S., Navab, N.: Ssd-6d: Making rgb-based 3d detection and 6d pose estimation great again. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (2017)

  33. [33]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (2015)

    Kendall, A., Grimes, M., Cipolla, R.: Posenet: A convolutional network for real-time 6-dof camera relocalization. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (2015)

  34. [34]

    In: Proceedings of the European Conference on Computer Vision (ECCV) (2020)

    Labbé, Y ., Carpentier, J., Aubry, M., Sivic, J.: Cosypose: Consistent multi-view multi-object 6d pose estimation. In: Proceedings of the European Conference on Computer Vision (ECCV) (2020)

  35. [35]

    In: Proceedings of the Conference on Robot Learning (CoRL) (2022)

    Labbé, Y ., Manuelli, L., Mousavian, A., Tyree, S., Birchfield, S., Tremblay, J., Carpentier, J., Aubry, M., Fox, D., Sivic, J.: Megapose: 6d pose estimation of novel objects via render & compare. In: Proceedings of the Conference on Robot Learning (CoRL) (2022)

  36. [36]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2025)

    Lee, T., Wen, B., Kang, M., Kang, G., Kweon, I.S., Yoon, K.J.: Any6d: Model-free 6d pose estimation of novel objects. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2025)

  37. [37]

    International journal of computer vision81(2), 155–166 (2009)

    Lepetit, V ., Moreno-Noguer, F., Fua, P.: Ep n p: An accurate o (n) solution to the p n p problem. International journal of computer vision81(2), 155–166 (2009)

  38. [38]

    In: Proceedings of the Conference on Robot Learning (CoRL) (2023)

    Li, G., Li, Y ., Ye, Z., Zhang, Q., Kong, T., Cui, Z., Zhang, G.: Generative category-level shape and pose estimation with semantic primitives. In: Proceedings of the Conference on Robot Learning (CoRL) (2023)

  39. [39]

    In: Proceedings of the European Conference on Computer Vision (ECCV) (2018) 11

    Li, Y ., Wang, G., Ji, X., Xiang, Y ., Fox, D.: Deepim: Deep iterative matching for 6d pose estimation. In: Proceedings of the European Conference on Computer Vision (ECCV) (2018) 11

  40. [40]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (2019)

    Li, Z., Wang, G., Ji, X.: Cdpn: Coordinates-based disentangled pose network for real-time rgb-based 6-dof object pose estimation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (2019)

  41. [41]

    In: Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (2023)

    Li, Z., Stamos, I.: Depth-based 6dof object pose estimation using swin transformer. In: Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (2023)

  42. [42]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2024)

    Lin, J., Liu, L., Lu, D., Jia, K.: Sam-6d: Segment anything model meets zero-shot 6d object pose estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2024)

  43. [43]

    In: Proceedings of the European Conference on Computer Vision (ECCV) (2022)

    Lin, J., Wei, Z., Ding, C., Jia, K.: Category-level 6d object pose and size estimation using self-supervised deep prior deformation networks. In: Proceedings of the European Conference on Computer Vision (ECCV) (2022)

  44. [44]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (2021)

    Lin, J., Wei, Z., Li, Z., Xu, S., Jia, K., Li, Y .: Dualposenet: Category-level 6d object pose and size estimation using dual pose network with refined learning of pose consistency. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (2021)

  45. [45]

    In: Proceedings of the IEEE International Conference on Robotics and Automation (ICRA) (2022)

    Lin, Y ., Tremblay, J., Tyree, S., Vela, P.A., Birchfield, S.: Single-stage keypoint-based category-level object pose estimation from an rgb image. In: Proceedings of the IEEE International Conference on Robotics and Automation (ICRA) (2022)

  46. [46]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2022)

    Lipson, L., Teed, Z., Goyal, A., Deng, J.: Coupled iterative refinement for 6d multi-object pose estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2022)

  47. [47]

    Psychometrika (1976)

    Lissitz, R.W., Schönemann, P.H., Lingoes, J.C.: A solution to the weighted procrustes problem in which the transformation is in agreement with the loss function. Psychometrika (1976)

  48. [48]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2025)

    Liu, M., Li, S., Chhatkuli, A., Truong, P., Van Gool, L., Tombari, F.: One2any: One-reference 6d pose estimation for any object. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2025)

  49. [49]

    In: Proceedings of the European Conference on Computer Vision (ECCV) (2022)

    Liu, X., Wang, G., Li, Y ., Ji, X.: Catre: Iterative point clouds alignment for category-level object pose refinement. In: Proceedings of the European Conference on Computer Vision (ECCV) (2022)

  50. [50]

    In: Proceedings of the European Conference on Computer Vision (ECCV) (2022)

    Liu, Y ., Wen, Y ., Peng, S., Lin, C., Long, X.F., Komura, T., Wang, W.: Gen6d: Generalizable model-free 6-dof object pose estimation from rgb images. In: Proceedings of the European Conference on Computer Vision (ECCV) (2022)

  51. [51]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2022)

    Liu, Y ., Liu, Y ., Jiang, C., Lyu, K., Wan, W., Shen, H., Liang, B., Fu, Z., Wang, H., Yi, L.: Hoi4d: A 4d egocentric dataset for category-level human-object interaction. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2022)

  52. [52]

    Decoupled Weight Decay Regularization

    Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017)

  53. [53]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2025)

    Moon, S., Son, H., Hur, D., Kim, S.: Co-op: Correspondence-based novel object pose estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2025)

  54. [54]

    arXiv preprint arXiv:2512.15508 (2025)

    Moreau, A., Shaw, R., Nazarczuk, M., Shin, J., Tanay, T., Zhang, Z., Xu, S., Pérez-Pellitero, E.: Off the grid: Detection of primitives for feed-forward 3d gaussian splatting. arXiv preprint arXiv:2512.15508 (2025)

  55. [55]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (2019)

    Mousavian, A., Eppner, C., Fox, D.: 6-dof graspnet: Variational grasp generation for object manipulation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (2019)

  56. [56]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2024)

    Nguyen, V .N., Groueix, T., Salzmann, M., Lepetit, V .: Gigapose: Fast and robust novel object pose estimation via one correspondence. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2024)

  57. [57]

    Oquab, M., Darcet, T., Moutakanni, T., et al.: Dinov2: Learning robust visual features without supervision (2024)

  58. [58]

    In: Proceedings of the European Conference on Computer Vision (ECCV) (2024)

    Örnek, E.P., Labbé, Y ., Tekin, B., Ma, L., Keskin, C., Forster, C., Hodan, T.: Foundpose: Unseen object pose estimation with foundation features. In: Proceedings of the European Conference on Computer Vision (ECCV) (2024)

  59. [59]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2020) 12

    Park, K., Mousavian, A., Xiang, Y ., Fox, D.: Latentfusion: End-to-end differentiable reconstruction and rendering for unseen object pose estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2020) 12

  60. [60]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (2019)

    Park, K., Patten, T., Vincze, M.: Pix2pose: Pixel-wise coordinate regression of objects for 6d pose estimation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (2019)

  61. [61]

    In: Proceedings of the IEEE International Conference on Robotics and Automation (ICRA) (2017)

    Pavlakos, G., Zhou, X., Chan, A., Derpanis, K.G., Daniilidis, K.: 6-dof object pose from semantic keypoints. In: Proceedings of the IEEE International Conference on Robotics and Automation (ICRA) (2017)

  62. [62]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2019)

    Peng, S., Liu, Y ., Huang, Q., Zhou, X., Bao, H.: Pvnet: Pixel-wise voting network for 6dof pose estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2019)

  63. [63]

    arXiv preprint (2025)

    Ponimatkin, G., Cífka, M., Souˇcek, T., Fourmy, M., Labbé, Y ., Petrik, V ., Sivic, J.: 6d object pose tracking in internet videos for robotic manipulation. arXiv preprint (2025)

  64. [64]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2025)

    Potamias, R.A., Zhang, J., Deng, J., Zafeiriou, S.: Wilor: End-to-end 3d hand localization and recon- struction in-the-wild. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2025)

  65. [65]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (2017)

    Rad, M., Lepetit, V .: Bb8: A scalable, accurate, robust to partial occlusion method for predicting the 3d poses of challenging objects without using depth. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (2017)

  66. [66]

    arXiv preprint (2022)

    Romero, J., Tzionas, D., Black, M.J.: Embodied hands: Modeling and capturing hands and bodies together. arXiv preprint (2022)

  67. [67]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2022)

    Shugurov, I., Li, F., Busam, B., Ilic, S.: Osop: A multi-stage one shot object pose estimation framework. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2022)

  68. [68]

    IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) (2021)

    Shugurov, I., Zakharov, S., Ilic, S.: Dpodv2: Dense correspondence-based 6 dof pose estimation. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) (2021)

  69. [69]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2020)

    Song, C., Song, J., Huang, Q.: Hybridpose: 6d object pose estimation under hybrid representations. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2020)

  70. [70]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2025)

    Song, X., Jin, L., Zhang, Z., Li, J., Zhong, F., Zhang, G., Qin, X.: Prior-free 3d object tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2025)

  71. [71]

    In: 2012 IEEE/RSJ international conference on intelligent robots and systems

    Sturm, J., Engelhard, N., Endres, F., Burgard, W., Cremers, D.: A benchmark for the evaluation of rgb-d slam systems. In: 2012 IEEE/RSJ international conference on intelligent robots and systems. pp. 573–580. IEEE (2012)

  72. [72]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2022)

    Su, Y ., Saleh, M., Fetzer, T., Rambach, J., Navab, N., Busam, B., Stricker, D., Tombari, F.: Zebrapose: Coarse to fine surface encoding for 6dof object pose estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2022)

  73. [73]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2022)

    Sun, J., Wang, Z., Zhang, S., He, X., Zhao, H., Zhang, G., Zhou, X.: Onepose: One-shot object pose estimation without cad models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2022)

  74. [74]

    In: Proceedings of the European Conference on Computer Vision (ECCV) (2020)

    Tian, M., Ang Jr, M.H., Lee, G.H.: Shape prior deformation for categorical 6d object pose and size estimation. In: Proceedings of the European Conference on Computer Vision (ECCV) (2020)

  75. [75]

    IEEE Transactions on pattern analysis and machine intelligence13(4), 376–380 (2002)

    Umeyama, S.: Least-squares estimation of transformation parameters between two point patterns. IEEE Transactions on pattern analysis and machine intelligence13(4), 376–380 (2002)

  76. [76]

    In: Proceedings of the IEEE International Conference on Robotics and Automation (ICRA) (2020)

    Wang, C., Martín-Martín, R., Xu, D., Lv, J., Lu, C., Fei-Fei, L., Savarese, S., Zhu, Y .: 6-pack: Category- level 6d pose tracker with anchor-based keypoints. In: Proceedings of the IEEE International Conference on Robotics and Automation (ICRA) (2020)

  77. [77]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2019)

    Wang, C., Xu, D., Zhu, Y ., Martín-Martín, R., Lu, C., Fei-Fei, L., Savarese, S.: Densefusion: 6d object pose estimation by iterative dense fusion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2019)

  78. [78]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2021)

    Wang, G., Manhardt, F., Tombari, F., Ji, X.: Gdr-net: Geometry-guided direct regression network for monocular 6d object pose estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2021)

  79. [79]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2019) 13

    Wang, H., Sridhar, S., Huang, J., Valentin, J., Song, S., Guibas, L.J.: Normalized object coordinate space for category-level 6d object pose and size estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2019) 13

  80. [80]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2025)

    Wang, J., Chen, M., Karaev, N., Vedaldi, A., Rupprecht, C., Novotny, D.: Vggt: Visual geometry grounded transformer. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2025)

Showing first 80 references.