AlignPose: Generalizable 6D Pose Estimation via Multi-view Feature-metric Alignment

Anna \v{S}\'arov\'a Mike\v{s}t\'ikov\'a; Josef Sivic; Martin C\'ifka; M\'ed\'eric Fourmy; Vladimir Petrik

arxiv: 2512.20538 · v2 · pith:BHSM6EXFnew · submitted 2025-12-23 · 💻 cs.CV

AlignPose: Generalizable 6D Pose Estimation via Multi-view Feature-metric Alignment

Anna \v{S}\'arov\'a Mike\v{s}t\'ikov\'a , M\'ed\'eric Fourmy , Martin C\'ifka , Josef Sivic , Vladimir Petrik This is my paper

Pith reviewed 2026-05-22 12:28 UTC · model grok-4.3

classification 💻 cs.CV

keywords 6D pose estimationmulti-view RGBfeature-metric alignmentgeneralizable pose estimationextrinsic calibrationBOP benchmarkindustrial object pose

0 comments

The pith

AlignPose estimates a single consistent 6D object pose from multiple extrinsically calibrated RGB views by minimizing feature discrepancies between rendered and observed images.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents AlignPose as a way to handle depth ambiguity, clutter, and occlusions that limit single-view RGB pose estimation. It aggregates information across several calibrated views without any object-specific training or symmetry labels. The central mechanism refines one world-frame pose by simultaneously reducing the mismatch between features rendered from the current pose and the actual features seen in every input view. Experiments across six datasets using the BOP benchmark show stronger results than prior methods, with the largest gains on industrial scenes where multiple views are already available in practice.

Core claim

AlignPose aggregates information from multiple extrinsically calibrated RGB views and produces a single consistent world-frame object pose without object-specific training or symmetry annotation. Its key component is a multi-view feature-metric refinement that optimizes the pose by minimizing the feature discrepancy between on-the-fly rendered object features and the observed image features across all views at the same time.

What carries the argument

Multi-view feature-metric refinement: a procedure that optimizes one shared world-frame object pose by minimizing feature discrepancy between rendered object features and observed image features from every calibrated view simultaneously.

If this is right

The approach generalizes to objects never seen during training because no per-object model is required.
Performance improves most on industrial datasets where multiple calibrated views are already present in the capture setup.
It reduces the impact of single-view failures such as depth ambiguity and heavy occlusions by enforcing consistency across views.
The same refinement procedure can be applied on top of any initial pose estimates obtained from single-view networks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Jointly estimating camera poses along with the object pose could remove the need for pre-calibration in less controlled settings.
The feature-metric objective may transfer to other multi-view tasks such as scene reconstruction or dynamic object tracking.
Replacing the current feature extractor with stronger self-supervised backbones could further improve accuracy on textureless industrial parts.

Load-bearing premise

The input views must be extrinsically calibrated with known relative camera poses.

What would settle it

Running the method on a multi-view dataset where relative camera poses are deliberately perturbed or withheld would show whether the reported accuracy gains require precise extrinsic calibration.

Figures

Figures reproduced from arXiv: 2512.20538 by Anna \v{S}\'arov\'a Mike\v{s}t\'ikov\'a, Josef Sivic, Martin C\'ifka, M\'ed\'eric Fourmy, Vladimir Petrik.

**Figure 1.** Figure 1: Our feature-based multi-view pose estimation pipeline. Single-view pose candidates are first generated independently for each view using state-of-the-art pose estimation methods (e.g. Labbe et al. [ ´ 23], Ornek et al. [ ¨ 38]). During aggregation, candidates are transformed into a common coordinate frame, and non-maximum suppression (NMS) is applied to eliminate redundant detections of the same object. Th… view at source ↗

**Figure 2.** Figure 2: Multi-view feature-metric refinement. This figure illustrates a two-view feature-metric refinement. The two cameras show cropped query features (mapping the first three PCA components to RGB values). The partial multi-color point clouds represent the registered features FCO, shifted towards their corresponding camera for visualization purposes. The projection of registered feature 3D coordinates xi is re… view at source ↗

**Figure 3.** Figure 3: Qualitative results of multi-view refinement on the YCB-V and HouseCat6D datasets. 1st column: one of the input images, 2nd column: single-view pose estimates in blue, ground-truth shown using textured models. 3rd column: results of CosyPose multi-view baseline. 4th column: results of our multi-view method. CosyPose and our method both use single-view pose candidates from four input views. Please see the m… view at source ↗

**Figure 4.** Figure 4: Qualitative results of multi-view refinement on the T-LESS dataset. 1st column: one of the four input images. 2nd column: single-view pose estimates obtained by MegaPose [23] in blue; ground-truth poses in red. 3rd column: results of CosyPose multi-view baseline. 4th column: results of our multi-view pose estimation method. CosyPose and our method both use MegaPose single-view pose candidates from four inp… view at source ↗

**Figure 5.** Figure 5: Non Maximum Suppression ablation for the YCB-V dataset. We perform NMS with different geometric representations of objects (AA: axis-aligned boxes; O: oriented boxes; CH: convex hulls; S: bounding spheres) [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗

**Figure 6.** Figure 6: Non Maximum Suppression ablation for the T-LESS dataset. We perform NMS with different geometric representations of objects (AA: axis-aligned boxes; O: oriented boxes; CH: convex hulls; S: bounding spheres). In contrast, high thresholds (e.g., 0.8) do not sufficiently suppress duplicate hypotheses, leading to reduced precision. Excluding these extreme threshold values, performance varies only moderately… view at source ↗

**Figure 8.** Figure 8: Qualitative results of multi-view refinement on the HouseCat6D dataset. Column 1: Representative input image from the four-view set. Column 2: Single-view estimates (blue) overlaid on ground-truth (textured models). Column 3: Multi-view baseline (CosyPose). Column 4: Our proposed refinement. Both CosyPose and our method utilize single-view estimates from four input views. We highlight the improved pose acc… view at source ↗

**Figure 9.** Figure 9: Qualitative results of multi-view refinement on the ITODD dataset. Each row shows one scene observed from four viewpoints. Contours indicate predicted object poses. Single-view pose estimates (red) are computed using only the Main View; CosyPose multi-view estimates are shown in blue; and our multi-view estimates are shown in green. Both our method and CosyPose use all four views. Example 1: The single-vie… view at source ↗

read the original abstract

Single-view RGB model-based object pose estimation methods achieve strong generalization but are fundamentally limited by depth ambiguity, clutter, and occlusions. Multi-view pose estimation methods have the potential to solve these issues, but existing works rely on precise single-view pose estimates or lack generalization to unseen objects. We address these challenges via the following three contributions. First, we introduce AlignPose, a 6D object pose estimation method that aggregates information from multiple extrinsically calibrated RGB views and does not require any object-specific training or symmetry annotation. Second, the key component of this approach is a new multi-view feature-metric refinement specifically designed for object pose. It optimizes a single, consistent world-frame object pose by minimizing the feature discrepancy between on-the-fly rendered object features and observed image features across all views simultaneously. Third, we report extensive experiments on six datasets (YCB-V, T-LESS, HouseCat6D, ITODD-MV, IPD, XYZ-IBD) using the BOP benchmark evaluation and show that AlignPose outperforms other published methods, especially on challenging industrial datasets where multiple views are readily available in practice.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AlignPose adds a multi-view feature-metric refinement for generalizable 6D pose without per-object training, but the gains rest on accurate known camera calibration.

read the letter

AlignPose's core idea is a refinement step that takes multiple extrinsically calibrated RGB views and optimizes one consistent world-frame pose. It does this by minimizing feature discrepancies between on-the-fly rendered object features and the observed images across all views at the same time. No object-specific training or symmetry annotation is required, which sets it apart from many prior single-view or multi-view methods that lean on those crutches.

Referee Report

1 major / 2 minor

Summary. The paper introduces AlignPose, a generalizable 6D object pose estimation approach that aggregates information from multiple extrinsically calibrated RGB views without requiring object-specific training or symmetry annotations. Its core contribution is a multi-view feature-metric refinement procedure that optimizes a single consistent world-frame pose by simultaneously minimizing feature discrepancies between on-the-fly rendered object features and observed image features across all views. The authors report extensive BOP-benchmark experiments on six datasets (YCB-V, T-LESS, HouseCat6D, ITODD-MV, IPD, XYZ-IBD) showing outperformance over prior published methods, with the largest gains on challenging industrial datasets.

Significance. If the central claims hold, the work provides a practical route to multi-view pose estimation that improves robustness to depth ambiguity, clutter, and occlusions while remaining generalizable to unseen objects. The scale of the evaluation—six datasets under the standard BOP protocol—strengthens the empirical case for the multi-view refinement strategy in settings where calibrated views are available.

major comments (1)

[Abstract] Abstract: The reported outperformance on industrial datasets (ITODD-MV, IPD, XYZ-IBD) rests on the multi-view feature-metric refinement that projects and compares features in a shared world frame. This construction requires perfectly known relative camera poses, yet the manuscript provides no ablation or sensitivity analysis under realistic calibration noise. Even modest errors in the supplied extrinsics could shift the joint discrepancy minimum to an incorrect pose, directly affecting the BOP scores that constitute the central empirical claim.

minor comments (2)

[Results] Results section: The abstract and experimental summary omit error bars, exact baseline implementations, and the contribution of the refinement step versus the initial single-view estimates; adding these details would improve reproducibility without altering the core claims.
[Method] Method description: Clarify the precise feature extractor and rendering pipeline used for on-the-fly feature generation, including any hyper-parameters that control the discrepancy minimization.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the major comment below and will revise the paper accordingly to strengthen the empirical evaluation.

read point-by-point responses

Referee: The reported outperformance on industrial datasets (ITODD-MV, IPD, XYZ-IBD) rests on the multi-view feature-metric refinement that projects and compares features in a shared world frame. This construction requires perfectly known relative camera poses, yet the manuscript provides no ablation or sensitivity analysis under realistic calibration noise. Even modest errors in the supplied extrinsics could shift the joint discrepancy minimum to an incorrect pose, directly affecting the BOP scores that constitute the central empirical claim.

Authors: We agree that the multi-view feature-metric alignment in AlignPose operates under the assumption of known extrinsics and that the reported gains on the industrial datasets rely on this. The BOP evaluations use the provided ground-truth calibrations as per the benchmark protocol. We acknowledge that the current manuscript lacks an explicit sensitivity analysis to calibration noise, which is a valid point. In the revised manuscript we will add a dedicated ablation study that perturbs the relative camera poses with realistic levels of Gaussian noise and reports the resulting changes in BOP scores on ITODD-MV, IPD, and XYZ-IBD. This will quantify the method's sensitivity and clarify the practical requirements on calibration accuracy. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical optimization method with independent benchmark evaluation

full rationale

The paper introduces AlignPose as an algorithmic procedure for multi-view pose refinement that optimizes a single world-frame pose by minimizing feature discrepancy between rendered and observed features across extrinsically calibrated views. This is a standard iterative optimization construction rather than a derivation that reduces to its own inputs by construction. Performance results are reported as empirical comparisons on six external BOP benchmark datasets, not as predictions forced by fitted parameters or self-citations. No load-bearing step in the abstract or described contributions matches the enumerated circularity patterns; the method remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on standard multi-view geometry assumptions and learned feature extractors; no new entities are postulated and no free parameters are explicitly fitted in the abstract description.

axioms (1)

domain assumption Input views are extrinsically calibrated with known relative camera poses
Explicitly required for aggregating information from multiple views as stated in the abstract.

pith-pipeline@v0.9.0 · 5761 in / 1222 out tokens · 46887 ms · 2026-05-22T12:28:38.358278+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

optimizes a single, consistent world-frame object pose by minimizing the feature discrepancy between on-the-fly rendered object features and observed image features across all views simultaneously
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

LC_FE(T_CO) = sum ρ(p_i - F_q(π_C(T_CO x_i)))

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

58 extracted references · 58 canonical work pages · 2 internal anchors

[1]

Super-fibonacci spirals: Fast, low-discrepancy sampling of so (3)

Marc Alexa. Super-fibonacci spirals: Fast, low-discrepancy sampling of so (3). InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 8291–8300, 2022. 12

work page 2022
[2]

Lucas-kanade 20 years on: A unifying framework.International journal of computer vision, 56(3):221–255, 2004

Simon Baker and Iain Matthews. Lucas-kanade 20 years on: A unifying framework.International journal of computer vision, 56(3):221–255, 2004. 2

work page 2004
[3]

A general and adaptive robust loss function

Jonathan T Barron. A general and adaptive robust loss function. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4331–4339,

work page
[4]

Freeze: Training-free zero-shot 6d pose estimation with geometric and vision foundation models

Andrea Caraffa, Davide Boscaini, Amir Hamza, and Fabio Poiesi. Freeze: Training-free zero-shot 6d pose estimation with geometric and vision foundation models. InEuropean Conference on Computer Vision, pages 414–431. Springer,

work page
[5]

Clearpose: Large-scale trans- parent object dataset and benchmark

Xiaotong Chen, Huijie Zhang, Zeren Yu, Anthony Opipari, and Odest Chadwicke Jenkins. Clearpose: Large-scale trans- parent object dataset and benchmark. InEuropean confer- ence on computer vision, pages 381–396. Springer, 2022. 1

work page 2022
[6]

Introducing mvtec itodd — a dataset for 3d object recognition in industry

Bertram Drost, Markus Ulrich, Paul Bergmann, Philipp H¨artinger, and Carsten Steger. Introducing mvtec itodd — a dataset for 3d object recognition in industry. In2017 IEEE International Conference on Computer Vision Work- shops (ICCVW), pages 2200–2208, 2017. 5

work page 2017
[7]

Lsd- slam: Large-scale direct monocular slam

Jakob Engel, Thomas Sch ¨ops, and Daniel Cremers. Lsd- slam: Large-scale direct monocular slam. InEuropean con- ference on computer vision, pages 834–849. Springer, 2014. 2

work page 2014
[8]

Integra- tion of probabilistic pose estimates from multiple views

¨Ozg¨ur Erkent, Dadhichi Shukla, and Justus Piater. Integra- tion of probabilistic pose estimates from multiple views. In European Conference on Computer Vision, pages 154–170. Springer, 2016. 2

work page 2016
[9]

Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography.Communications of the ACM, 24(6):381–395, 1981

Martin A Fischler and Robert C Bolles. Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography.Communications of the ACM, 24(6):381–395, 1981. 2

work page 1981
[10]

A multi-hypothesis approach to pose ambiguity in object-based slam

Jiahui Fu, Qiangqiang Huang, Kevin Doherty, Yue Wang, and John J Leonard. A multi-hypothesis approach to pose ambiguity in object-based slam. In2021 IEEE/RSJ Interna- tional Conference on Intelligent Robots and Systems (IROS), pages 7639–7646. IEEE, 2021. 2

work page 2021
[11]

Multi-view object pose estimation from correspon- dence distributions and epipolar geometry

Rasmus Laurvig Haugaard and Thorbjorn Mosekjaer Iversen. Multi-view object pose estimation from correspon- dence distributions and epipolar geometry. In2023 IEEE In- ternational Conference on Robotics and Automation (ICRA), pages 1786–1792, 2023. 7, 8

work page 2023
[12]

Greg Heinrich, Mike Ranzinger, Hongxu Yin, Yao Lu, Jan Kautz, Andrew Tao, Bryan Catanzaro, and Pavlo Molchanov. Radiov2. 5: Improved baselines for agglomerative vision foundation models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 22487–22497,

work page
[13]

On evalu- ation of 6d object pose estimation

Tom ´aˇs Hodaˇn, Jiˇr´ı Matas, and ˇStˇep´an Obdrˇz´alek. On evalu- ation of 6d object pose estimation. InEuropean conference on computer vision, pages 606–619. Springer, 2016. 1

work page 2016
[14]

T-LESS: An RGB-D dataset for 6D pose estimation of texture-less ob- jects.IEEE Winter Conference on Applications of Computer Vision (WACV), 2017

Tom ´aˇs Hoda ˇn, Pavel Haluza, ˇStˇep´an Obdrˇz´alek, Jiˇr´ı Matas, Manolis Lourakis, and Xenophon Zabulis. T-LESS: An RGB-D dataset for 6D pose estimation of texture-less ob- jects.IEEE Winter Conference on Applications of Computer Vision (WACV), 2017. 5

work page 2017
[15]

Bop: Benchmark for 6d object pose estimation

Tomas Hodan, Frank Michel, Eric Brachmann, Wadim Kehl, Anders GlentBuch, Dirk Kraft, Bertram Drost, Joel Vidal, Stephan Ihrke, Xenophon Zabulis, et al. Bop: Benchmark for 6d object pose estimation. InProceedings of the European conference on computer vision (ECCV), pages 19–34, 2018. 2

work page 2018
[16]

Bop challenge 2020 on 6d object localization

Tom ´aˇs Hoda ˇn, Martin Sundermeyer, Bertram Drost, Yann Labb´e, Eric Brachmann, Frank Michel, Carsten Rother, and Jiˇr´ı Matas. Bop challenge 2020 on 6d object localization. InComputer Vision–ECCV 2020 Workshops: Glasgow, UK, August 23–28, 2020, Proceedings, Part II 16, pages 577–

work page 2020
[17]

Bop challenge 2023 on detection segmentation and pose estimation of seen and unseen rigid objects

Tomas Hodan, Martin Sundermeyer, Yann Labbe, Van Nguyen Nguyen, Gu Wang, Eric Brachmann, Bertram Drost, Vincent Lepetit, Carsten Rother, and Jiri Matas. Bop challenge 2023 on detection segmentation and pose estimation of seen and unseen rigid objects. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, pag...

work page 2023
[18]

About direct methods

Michal Irani and Prabu Anandan. About direct methods. In International Workshop on Vision Algorithms, pages 267–

work page
[19]

Shun Iwase, Xingyu Liu, Rawal Khirodkar, Rio Yokota, and Kris M. Kitani. Repose: Fast 6d object pose refinement via deep texture rendering. InProceedings of the IEEE/CVF In- ternational Conference on Computer Vision (ICCV), pages 3303–3312, 2021. 2

work page 2021
[20]

House- cat6d - a large-scale multi-modal category level 6d object perception dataset with household objects in realistic scenar- ios

HyunJun Jung, Shun-Cheng Wu, Patrick Ruhkamp, Guangyao Zhai, Hannah Schieber, Giulia Rizzoli, Pengyuan Wang, Hongcheng Zhao, Lorenzo Garattoni, Sven Meier, Daniel Roth, Nassir Navab, and Benjamin Busam. House- cat6d - a large-scale multi-modal category level 6d object perception dataset with household objects in realistic scenar- ios. InProceedings of the...

work page 2024
[21]

6 dof pose estimation of textureless objects from 9 multiple rgb frames

Roman Kaskman, Ivan Shugurov, Sergey Zakharov, and Slo- bodan Ilic. 6 dof pose estimation of textureless objects from 9 multiple rgb frames. InComputer Vision–ECCV 2020 Work- shops: Glasgow, UK, August 23–28, 2020, Proceedings, Part II 16, pages 612–630. Springer, 2020. 2

work page 2020
[22]

Cosypose: Consistent multi-view multi-object 6d pose estimation

Yann Labb ´e, Justin Carpentier, Mathieu Aubry, and Josef Sivic. Cosypose: Consistent multi-view multi-object 6d pose estimation. InComputer Vision–ECCV 2020: 16th Euro- pean Conference, Glasgow, UK, August 23–28, 2020, Pro- ceedings, Part XVII 16, pages 574–591. Springer, 2020. 1, 2, 5, 7, 8, 14

work page 2020
[23]

Megapose: 6d pose estimation of novel objects via render & compare

Yann Labb ´e, Lucas Manuelli, Arsalan Mousavian, Stephen Tyree, Stan Birchfield, Jonathan Tremblay, Justin Carpen- tier, Mathieu Aubry, Dieter Fox, and Josef Sivic. Megapose: 6d pose estimation of novel objects via render & compare. InProceedings of the 6th Conference on Robot Learning (CoRL), 2022. 1, 2, 3, 5, 7, 12, 14

work page 2022
[24]

A method for the solution of certain non-linear problems in least squares.Quarterly of applied mathematics, 2(2):164–168, 1944

Kenneth Levenberg. A method for the solution of certain non-linear problems in least squares.Quarterly of applied mathematics, 2(2):164–168, 1944. 4

work page 1944
[25]

A unified frame- work for multi-view multi-class object pose estimation

Chi Li, Jin Bai, and Gregory D Hager. A unified frame- work for multi-view multi-class object pose estimation. In Proceedings of the european conference on computer vision (eccv), pages 254–269, 2018. 2

work page 2018
[26]

Deepim: Deep iterative matching for 6d pose estimation

Yi Li, Gu Wang, Xiangyang Ji, Yu Xiang, and Dieter Fox. Deepim: Deep iterative matching for 6d pose estimation. In Proceedings of the European conference on computer vision (ECCV), pages 683–698, 2018. 2

work page 2018
[27]

Microsoft coco: Common objects in context

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014. 4

work page 2014
[28]

Pixel-perfect structure-from- motion with featuremetric refinement

Philipp Lindenberger, Paul-Edouard Sarlin, Viktor Lars- son, and Marc Pollefeys. Pixel-perfect structure-from- motion with featuremetric refinement. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 5987–5997, 2021. 1, 2

work page 2021
[29]

Grounding dino: Marrying dino with grounded pre-training for open-set object detection

Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. InEuro- pean conference on computer vision, pages 38–55. Springer,

work page
[30]

Gdrnpp: A geometry-guided and fully learning-based object pose es- timator.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

Xingyu Liu, Ruida Zhang, Chenyangguang Zhang, Gu Wang, Jiwen Tang, Zhigang Li, and Xiangyang Ji. Gdrnpp: A geometry-guided and fully learning-based object pose es- timator.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025. 2

work page 2025
[31]

Object recognition from local scale-invariant features

David G Lowe. Object recognition from local scale-invariant features. InProceedings of the seventh IEEE interna- tional conference on computer vision, pages 1150–1157. Ieee, 1999. 8

work page 1999
[32]

Adapting pre-trained vision mod- els for novel instance detection and segmentation, 2024

Yangxiao Lu, Jishnu Jaykumar P, Yunhui Guo, Nicholas Ruozzi, and Yu Xiang. Adapting pre-trained vision mod- els for novel instance detection and segmentation, 2024. 5, 12

work page 2024
[33]

An algorithm for least-squares esti- mation of nonlinear parameters.Journal of the society for Industrial and Applied Mathematics, 11(2):431–441, 1963

Donald W Marquardt. An algorithm for least-squares esti- mation of nonlinear parameters.Journal of the society for Industrial and Applied Mathematics, 11(2):431–441, 1963. 4

work page 1963
[34]

Co-op: Correspondence-based novel object pose estimation

Sungphill Moon, Hyeontae Son, Dongcheol Hur, and Sang- wook Kim. Co-op: Correspondence-based novel object pose estimation. InProceedings of the Computer Vision and Pat- tern Recognition Conference, pages 11622–11632, 2025. 5, 13, 14

work page 2025
[35]

Cnos: A strong base- line for cad-based novel object segmentation

Van Nguyen Nguyen, Thibault Groueix, Georgy Ponimatkin, Vincent Lepetit, and Tomas Hodan. Cnos: A strong base- line for cad-based novel object segmentation. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 2134–2140, 2023. 12

work page 2023
[36]

Gigapose: Fast and robust novel object pose estimation via one correspondence

Van Nguyen Nguyen, Thibault Groueix, Mathieu Salzmann, and Vincent Lepetit. Gigapose: Fast and robust novel object pose estimation via one correspondence. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9903–9913, 2024. 2, 5, 14

work page 2024
[37]

Maxime Oquab, Timoth ´ee Darcet, Theo Moutakanni, Huy V . V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Rus- sell Howes, Po-Yao Huang, Hu Xu, Vasu Sharma, Shang- Wen Li, Wojciech Galuba, Mike Rabbat, Mido Assran, Nico- las Ballas, Gabriel Synnaeve, Ishan Misra, Herve Jegou, Julien Mairal, Patri...

work page 2023
[38]

Found- pose: Unseen object pose estimation with foundation fea- tures

Evin Pınar ¨Ornek, Yann Labb ´e, Bugra Tekin, Lingni Ma, Cem Keskin, Christian Forster, and Tomas Hodan. Found- pose: Unseen object pose estimation with foundation fea- tures. InEuropean Conference on Computer Vision, pages 163–182. Springer, 2024. 1, 2, 3, 5, 8, 12, 13, 14

work page 2024
[39]

Learning general and dis- tinctive 3d local deep descriptors for point cloud registration

Fabio Poiesi and Davide Boscaini. Learning general and dis- tinctive 3d local deep descriptors for point cloud registration. IEEE Transactions on Pattern Analysis and Machine Intelli- gence, 45(3):3979–3985, 2022. 2

work page 2022
[40]

SAM 2: Segment Anything in Images and Videos

Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R¨adle, Chloe Rolland, Laura Gustafson, et al. Sam 2: Segment anything in images and videos.arXiv preprint arXiv:2408.00714, 2024. 12

work page internal anchor Pith review Pith/arXiv arXiv 2024
[41]

Kornia: an open source differentiable computer vision library for pytorch

Edgar Riba, Dmytro Mishkin, Daniel Ponsa, Ethan Rublee, and Gary Bradski. Kornia: an open source differentiable computer vision library for pytorch. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 3674–3683, 2020. 13

work page 2020
[42]

Slam++: Si- multaneous localisation and mapping at the level of objects

Renato F Salas-Moreno, Richard A Newcombe, Hauke Strasdat, Paul HJ Kelly, and Andrew J Davison. Slam++: Si- multaneous localisation and mapping at the level of objects. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 1352–1359, 2013. 2

work page 2013
[43]

Back to the feature: Learning robust camera localization from pixels to pose

Paul-Edouard Sarlin, Ajaykumar Unagar, Mans Larsson, Hugo Germain, Carl Toft, Viktor Larsson, Marc Pollefeys, Vincent Lepetit, Lars Hammarstrand, Fredrik Kahl, et al. Back to the feature: Learning robust camera localization from pixels to pose. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 3247–3257, 2021. 1,...

work page 2021
[44]

Feature-metric loss for self-supervised learning of depth and egomotion

Chang Shu, Kun Yu, Zhixiang Duan, and Kuiyuan Yang. Feature-metric loss for self-supervised learning of depth and egomotion. InEuropean Conference on Computer Vision, pages 572–588. Springer, 2020. 2

work page 2020
[45]

Multi-view object pose refinement with differentiable renderer.IEEE Robotics and Automation Letters, 6(2):2579– 2586, 2021

Ivan Shugurov, Ivan Pavlov, Sergey Zakharov, and Slobodan Ilic. Multi-view object pose refinement with differentiable renderer.IEEE Robotics and Automation Letters, 6(2):2579– 2586, 2021. 2

work page 2021
[46]

Dpodv2: Dense correspondence-based 6 dof pose estima- tion.IEEE transactions on pattern analysis and machine intelligence, 44(11):7417–7435, 2021

Ivan Shugurov, Sergey Zakharov, and Slobodan Ilic. Dpodv2: Dense correspondence-based 6 dof pose estima- tion.IEEE transactions on pattern analysis and machine intelligence, 44(11):7417–7435, 2021. 1, 7, 8

work page 2021
[47]

DINOv3

Oriane Sim ´eoni, Huy V V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Micha ¨el Ramamonjisoa, et al. Dinov3.arXiv preprint arXiv:2508.10104, 2025. 5, 8

work page internal anchor Pith review Pith/arXiv arXiv 2025
[48]

Multi-view 6d object pose estimation and camera motion planning using rgbd images

Juil Sock, S Hamidreza Kasaei, Luis Seabra Lopes, and Tae- Kyun Kim. Multi-view 6d object pose estimation and camera motion planning using rgbd images. InProceedings of the IEEE International Conference on Computer Vision Work- shops, pages 2228–2235, 2017. 2

work page 2017
[49]

Fit-ngp: Fitting object models to neural graphics primitives

Marwan Taher, Ignacio Alzugaray, and Andrew J Davison. Fit-ngp: Fitting object models to neural graphics primitives. In2024 IEEE International Conference on Robotics and Au- tomation (ICRA), pages 18186–18192. IEEE, 2024. 2

work page 2024
[50]

The unreasonable effectiveness of pre- trained features for camera pose refinement

Gabriele Trivigno, Carlo Masone, Barbara Caputo, and Torsten Sattler. The unreasonable effectiveness of pre- trained features for camera pose refinement. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, pages 12786–12798, 2024. 2

work page 2024
[51]

Bop challenge 2024 on model-based and model-free 6d object pose estimation.CoRR, 2025

Nguyen Van Nguyen, Stephen Tyree, Andrew Guo, Med- eric Fourmy, Anas Gouda, Taeyeop Lee, Sungphill Moon, Hyeontae Son, Lukas Ranftl, Jonathan Tremblay, Eric Brachmann, et al. Bop challenge 2024 on model-based and model-free 6d object pose estimation.CoRR, 2025. 2, 5

work page 2024
[52]

Lm-reloc: Levenberg-marquardt based direct vi- sual relocalization

Lukas von Stumberg, Patrick Wenzel, Nan Yang, and Daniel Cremers. Lm-reloc: Levenberg-marquardt based direct vi- sual relocalization. In2020 International Conference on 3D Vision (3DV), pages 968–977. IEEE Computer Society,

work page
[53]

Morefusion: Multi-object reasoning for 6d pose estimation from volumetric fusion

Kentaro Wada, Edgar Sucar, Stephen James, Daniel Lenton, and Andrew J Davison. Morefusion: Multi-object reasoning for 6d pose estimation from volumetric fusion. InProceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14540–14549, 2020. 2

work page 2020
[54]

Normalized object coordinate space for category-level 6d object pose and size estimation

He Wang, Srinath Sridhar, Jingwei Huang, Julien Valentin, Shuran Song, and Leonidas J Guibas. Normalized object coordinate space for category-level 6d object pose and size estimation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2642–2651,

work page
[55]

Foundationpose: Unified 6d pose estimation and tracking of novel objects

Bowen Wen, Wei Yang, Jan Kautz, and Stan Birchfield. Foundationpose: Unified 6d pose estimation and tracking of novel objects. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17868– 17879, 2024. 2

work page 2024
[56]

Posecnn: A convolutional neural network for 6d object pose estimation in cluttered scenes

Yu Xiang, Tanner Schmidt, Venkatraman Narayanan, and Dieter Fox. Posecnn: A convolutional neural network for 6d object pose estimation in cluttered scenes. InProceedings of Robotics: Science and Systems (RSS), 2018. 2, 5

work page 2018
[57]

6d pose estimation for textureless objects on rgb frames using multi-view optimization

Jun Yang, Wenjie Xue, Sahar Ghavidel, and Steven L Waslander. 6d pose estimation for textureless objects on rgb frames using multi-view optimization. In2023 IEEE in- ternational conference on robotics and automation (ICRA), pages 2905–2912. IEEE, 2023. 2

work page 2023
[58]

Temporally consistent object 6d pose estimation for robot control.IEEE Robotics and Au- tomation Letters, 2024

Kateryna Zorina, V ojtech Priban, Mederic Fourmy, Josef Sivic, and Vladimir Petrik. Temporally consistent object 6d pose estimation for robot control.IEEE Robotics and Au- tomation Letters, 2024. 2 11 Appendix This appendix contains additional implementation details for our method (Sec. A) and supplementary experimental results supporting our design choic...

work page 2024

[1] [1]

Super-fibonacci spirals: Fast, low-discrepancy sampling of so (3)

Marc Alexa. Super-fibonacci spirals: Fast, low-discrepancy sampling of so (3). InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 8291–8300, 2022. 12

work page 2022

[2] [2]

Lucas-kanade 20 years on: A unifying framework.International journal of computer vision, 56(3):221–255, 2004

Simon Baker and Iain Matthews. Lucas-kanade 20 years on: A unifying framework.International journal of computer vision, 56(3):221–255, 2004. 2

work page 2004

[3] [3]

A general and adaptive robust loss function

Jonathan T Barron. A general and adaptive robust loss function. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4331–4339,

work page

[4] [4]

Freeze: Training-free zero-shot 6d pose estimation with geometric and vision foundation models

Andrea Caraffa, Davide Boscaini, Amir Hamza, and Fabio Poiesi. Freeze: Training-free zero-shot 6d pose estimation with geometric and vision foundation models. InEuropean Conference on Computer Vision, pages 414–431. Springer,

work page

[5] [5]

Clearpose: Large-scale trans- parent object dataset and benchmark

Xiaotong Chen, Huijie Zhang, Zeren Yu, Anthony Opipari, and Odest Chadwicke Jenkins. Clearpose: Large-scale trans- parent object dataset and benchmark. InEuropean confer- ence on computer vision, pages 381–396. Springer, 2022. 1

work page 2022

[6] [6]

Introducing mvtec itodd — a dataset for 3d object recognition in industry

Bertram Drost, Markus Ulrich, Paul Bergmann, Philipp H¨artinger, and Carsten Steger. Introducing mvtec itodd — a dataset for 3d object recognition in industry. In2017 IEEE International Conference on Computer Vision Work- shops (ICCVW), pages 2200–2208, 2017. 5

work page 2017

[7] [7]

Lsd- slam: Large-scale direct monocular slam

Jakob Engel, Thomas Sch ¨ops, and Daniel Cremers. Lsd- slam: Large-scale direct monocular slam. InEuropean con- ference on computer vision, pages 834–849. Springer, 2014. 2

work page 2014

[8] [8]

Integra- tion of probabilistic pose estimates from multiple views

¨Ozg¨ur Erkent, Dadhichi Shukla, and Justus Piater. Integra- tion of probabilistic pose estimates from multiple views. In European Conference on Computer Vision, pages 154–170. Springer, 2016. 2

work page 2016

[9] [9]

Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography.Communications of the ACM, 24(6):381–395, 1981

Martin A Fischler and Robert C Bolles. Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography.Communications of the ACM, 24(6):381–395, 1981. 2

work page 1981

[10] [10]

A multi-hypothesis approach to pose ambiguity in object-based slam

Jiahui Fu, Qiangqiang Huang, Kevin Doherty, Yue Wang, and John J Leonard. A multi-hypothesis approach to pose ambiguity in object-based slam. In2021 IEEE/RSJ Interna- tional Conference on Intelligent Robots and Systems (IROS), pages 7639–7646. IEEE, 2021. 2

work page 2021

[11] [11]

Multi-view object pose estimation from correspon- dence distributions and epipolar geometry

Rasmus Laurvig Haugaard and Thorbjorn Mosekjaer Iversen. Multi-view object pose estimation from correspon- dence distributions and epipolar geometry. In2023 IEEE In- ternational Conference on Robotics and Automation (ICRA), pages 1786–1792, 2023. 7, 8

work page 2023

[12] [12]

Greg Heinrich, Mike Ranzinger, Hongxu Yin, Yao Lu, Jan Kautz, Andrew Tao, Bryan Catanzaro, and Pavlo Molchanov. Radiov2. 5: Improved baselines for agglomerative vision foundation models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 22487–22497,

work page

[13] [13]

On evalu- ation of 6d object pose estimation

Tom ´aˇs Hodaˇn, Jiˇr´ı Matas, and ˇStˇep´an Obdrˇz´alek. On evalu- ation of 6d object pose estimation. InEuropean conference on computer vision, pages 606–619. Springer, 2016. 1

work page 2016

[14] [14]

T-LESS: An RGB-D dataset for 6D pose estimation of texture-less ob- jects.IEEE Winter Conference on Applications of Computer Vision (WACV), 2017

Tom ´aˇs Hoda ˇn, Pavel Haluza, ˇStˇep´an Obdrˇz´alek, Jiˇr´ı Matas, Manolis Lourakis, and Xenophon Zabulis. T-LESS: An RGB-D dataset for 6D pose estimation of texture-less ob- jects.IEEE Winter Conference on Applications of Computer Vision (WACV), 2017. 5

work page 2017

[15] [15]

Bop: Benchmark for 6d object pose estimation

Tomas Hodan, Frank Michel, Eric Brachmann, Wadim Kehl, Anders GlentBuch, Dirk Kraft, Bertram Drost, Joel Vidal, Stephan Ihrke, Xenophon Zabulis, et al. Bop: Benchmark for 6d object pose estimation. InProceedings of the European conference on computer vision (ECCV), pages 19–34, 2018. 2

work page 2018

[16] [16]

Bop challenge 2020 on 6d object localization

Tom ´aˇs Hoda ˇn, Martin Sundermeyer, Bertram Drost, Yann Labb´e, Eric Brachmann, Frank Michel, Carsten Rother, and Jiˇr´ı Matas. Bop challenge 2020 on 6d object localization. InComputer Vision–ECCV 2020 Workshops: Glasgow, UK, August 23–28, 2020, Proceedings, Part II 16, pages 577–

work page 2020

[17] [17]

Bop challenge 2023 on detection segmentation and pose estimation of seen and unseen rigid objects

Tomas Hodan, Martin Sundermeyer, Yann Labbe, Van Nguyen Nguyen, Gu Wang, Eric Brachmann, Bertram Drost, Vincent Lepetit, Carsten Rother, and Jiri Matas. Bop challenge 2023 on detection segmentation and pose estimation of seen and unseen rigid objects. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, pag...

work page 2023

[18] [18]

About direct methods

Michal Irani and Prabu Anandan. About direct methods. In International Workshop on Vision Algorithms, pages 267–

work page

[19] [19]

Shun Iwase, Xingyu Liu, Rawal Khirodkar, Rio Yokota, and Kris M. Kitani. Repose: Fast 6d object pose refinement via deep texture rendering. InProceedings of the IEEE/CVF In- ternational Conference on Computer Vision (ICCV), pages 3303–3312, 2021. 2

work page 2021

[20] [20]

House- cat6d - a large-scale multi-modal category level 6d object perception dataset with household objects in realistic scenar- ios

HyunJun Jung, Shun-Cheng Wu, Patrick Ruhkamp, Guangyao Zhai, Hannah Schieber, Giulia Rizzoli, Pengyuan Wang, Hongcheng Zhao, Lorenzo Garattoni, Sven Meier, Daniel Roth, Nassir Navab, and Benjamin Busam. House- cat6d - a large-scale multi-modal category level 6d object perception dataset with household objects in realistic scenar- ios. InProceedings of the...

work page 2024

[21] [21]

6 dof pose estimation of textureless objects from 9 multiple rgb frames

Roman Kaskman, Ivan Shugurov, Sergey Zakharov, and Slo- bodan Ilic. 6 dof pose estimation of textureless objects from 9 multiple rgb frames. InComputer Vision–ECCV 2020 Work- shops: Glasgow, UK, August 23–28, 2020, Proceedings, Part II 16, pages 612–630. Springer, 2020. 2

work page 2020

[22] [22]

Cosypose: Consistent multi-view multi-object 6d pose estimation

Yann Labb ´e, Justin Carpentier, Mathieu Aubry, and Josef Sivic. Cosypose: Consistent multi-view multi-object 6d pose estimation. InComputer Vision–ECCV 2020: 16th Euro- pean Conference, Glasgow, UK, August 23–28, 2020, Pro- ceedings, Part XVII 16, pages 574–591. Springer, 2020. 1, 2, 5, 7, 8, 14

work page 2020

[23] [23]

Megapose: 6d pose estimation of novel objects via render & compare

Yann Labb ´e, Lucas Manuelli, Arsalan Mousavian, Stephen Tyree, Stan Birchfield, Jonathan Tremblay, Justin Carpen- tier, Mathieu Aubry, Dieter Fox, and Josef Sivic. Megapose: 6d pose estimation of novel objects via render & compare. InProceedings of the 6th Conference on Robot Learning (CoRL), 2022. 1, 2, 3, 5, 7, 12, 14

work page 2022

[24] [24]

A method for the solution of certain non-linear problems in least squares.Quarterly of applied mathematics, 2(2):164–168, 1944

Kenneth Levenberg. A method for the solution of certain non-linear problems in least squares.Quarterly of applied mathematics, 2(2):164–168, 1944. 4

work page 1944

[25] [25]

A unified frame- work for multi-view multi-class object pose estimation

Chi Li, Jin Bai, and Gregory D Hager. A unified frame- work for multi-view multi-class object pose estimation. In Proceedings of the european conference on computer vision (eccv), pages 254–269, 2018. 2

work page 2018

[26] [26]

Deepim: Deep iterative matching for 6d pose estimation

Yi Li, Gu Wang, Xiangyang Ji, Yu Xiang, and Dieter Fox. Deepim: Deep iterative matching for 6d pose estimation. In Proceedings of the European conference on computer vision (ECCV), pages 683–698, 2018. 2

work page 2018

[27] [27]

Microsoft coco: Common objects in context

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014. 4

work page 2014

[28] [28]

Pixel-perfect structure-from- motion with featuremetric refinement

Philipp Lindenberger, Paul-Edouard Sarlin, Viktor Lars- son, and Marc Pollefeys. Pixel-perfect structure-from- motion with featuremetric refinement. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 5987–5997, 2021. 1, 2

work page 2021

[29] [29]

Grounding dino: Marrying dino with grounded pre-training for open-set object detection

Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. InEuro- pean conference on computer vision, pages 38–55. Springer,

work page

[30] [30]

Gdrnpp: A geometry-guided and fully learning-based object pose es- timator.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

Xingyu Liu, Ruida Zhang, Chenyangguang Zhang, Gu Wang, Jiwen Tang, Zhigang Li, and Xiangyang Ji. Gdrnpp: A geometry-guided and fully learning-based object pose es- timator.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025. 2

work page 2025

[31] [31]

Object recognition from local scale-invariant features

David G Lowe. Object recognition from local scale-invariant features. InProceedings of the seventh IEEE interna- tional conference on computer vision, pages 1150–1157. Ieee, 1999. 8

work page 1999

[32] [32]

Adapting pre-trained vision mod- els for novel instance detection and segmentation, 2024

Yangxiao Lu, Jishnu Jaykumar P, Yunhui Guo, Nicholas Ruozzi, and Yu Xiang. Adapting pre-trained vision mod- els for novel instance detection and segmentation, 2024. 5, 12

work page 2024

[33] [33]

An algorithm for least-squares esti- mation of nonlinear parameters.Journal of the society for Industrial and Applied Mathematics, 11(2):431–441, 1963

Donald W Marquardt. An algorithm for least-squares esti- mation of nonlinear parameters.Journal of the society for Industrial and Applied Mathematics, 11(2):431–441, 1963. 4

work page 1963

[34] [34]

Co-op: Correspondence-based novel object pose estimation

Sungphill Moon, Hyeontae Son, Dongcheol Hur, and Sang- wook Kim. Co-op: Correspondence-based novel object pose estimation. InProceedings of the Computer Vision and Pat- tern Recognition Conference, pages 11622–11632, 2025. 5, 13, 14

work page 2025

[35] [35]

Cnos: A strong base- line for cad-based novel object segmentation

Van Nguyen Nguyen, Thibault Groueix, Georgy Ponimatkin, Vincent Lepetit, and Tomas Hodan. Cnos: A strong base- line for cad-based novel object segmentation. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 2134–2140, 2023. 12

work page 2023

[36] [36]

Gigapose: Fast and robust novel object pose estimation via one correspondence

Van Nguyen Nguyen, Thibault Groueix, Mathieu Salzmann, and Vincent Lepetit. Gigapose: Fast and robust novel object pose estimation via one correspondence. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9903–9913, 2024. 2, 5, 14

work page 2024

[37] [37]

Maxime Oquab, Timoth ´ee Darcet, Theo Moutakanni, Huy V . V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Rus- sell Howes, Po-Yao Huang, Hu Xu, Vasu Sharma, Shang- Wen Li, Wojciech Galuba, Mike Rabbat, Mido Assran, Nico- las Ballas, Gabriel Synnaeve, Ishan Misra, Herve Jegou, Julien Mairal, Patri...

work page 2023

[38] [38]

Found- pose: Unseen object pose estimation with foundation fea- tures

Evin Pınar ¨Ornek, Yann Labb ´e, Bugra Tekin, Lingni Ma, Cem Keskin, Christian Forster, and Tomas Hodan. Found- pose: Unseen object pose estimation with foundation fea- tures. InEuropean Conference on Computer Vision, pages 163–182. Springer, 2024. 1, 2, 3, 5, 8, 12, 13, 14

work page 2024

[39] [39]

Learning general and dis- tinctive 3d local deep descriptors for point cloud registration

Fabio Poiesi and Davide Boscaini. Learning general and dis- tinctive 3d local deep descriptors for point cloud registration. IEEE Transactions on Pattern Analysis and Machine Intelli- gence, 45(3):3979–3985, 2022. 2

work page 2022

[40] [40]

SAM 2: Segment Anything in Images and Videos

Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R¨adle, Chloe Rolland, Laura Gustafson, et al. Sam 2: Segment anything in images and videos.arXiv preprint arXiv:2408.00714, 2024. 12

work page internal anchor Pith review Pith/arXiv arXiv 2024

[41] [41]

Kornia: an open source differentiable computer vision library for pytorch

Edgar Riba, Dmytro Mishkin, Daniel Ponsa, Ethan Rublee, and Gary Bradski. Kornia: an open source differentiable computer vision library for pytorch. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 3674–3683, 2020. 13

work page 2020

[42] [42]

Slam++: Si- multaneous localisation and mapping at the level of objects

Renato F Salas-Moreno, Richard A Newcombe, Hauke Strasdat, Paul HJ Kelly, and Andrew J Davison. Slam++: Si- multaneous localisation and mapping at the level of objects. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 1352–1359, 2013. 2

work page 2013

[43] [43]

Back to the feature: Learning robust camera localization from pixels to pose

Paul-Edouard Sarlin, Ajaykumar Unagar, Mans Larsson, Hugo Germain, Carl Toft, Viktor Larsson, Marc Pollefeys, Vincent Lepetit, Lars Hammarstrand, Fredrik Kahl, et al. Back to the feature: Learning robust camera localization from pixels to pose. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 3247–3257, 2021. 1,...

work page 2021

[44] [44]

Feature-metric loss for self-supervised learning of depth and egomotion

Chang Shu, Kun Yu, Zhixiang Duan, and Kuiyuan Yang. Feature-metric loss for self-supervised learning of depth and egomotion. InEuropean Conference on Computer Vision, pages 572–588. Springer, 2020. 2

work page 2020

[45] [45]

Multi-view object pose refinement with differentiable renderer.IEEE Robotics and Automation Letters, 6(2):2579– 2586, 2021

Ivan Shugurov, Ivan Pavlov, Sergey Zakharov, and Slobodan Ilic. Multi-view object pose refinement with differentiable renderer.IEEE Robotics and Automation Letters, 6(2):2579– 2586, 2021. 2

work page 2021

[46] [46]

Dpodv2: Dense correspondence-based 6 dof pose estima- tion.IEEE transactions on pattern analysis and machine intelligence, 44(11):7417–7435, 2021

Ivan Shugurov, Sergey Zakharov, and Slobodan Ilic. Dpodv2: Dense correspondence-based 6 dof pose estima- tion.IEEE transactions on pattern analysis and machine intelligence, 44(11):7417–7435, 2021. 1, 7, 8

work page 2021

[47] [47]

DINOv3

Oriane Sim ´eoni, Huy V V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Micha ¨el Ramamonjisoa, et al. Dinov3.arXiv preprint arXiv:2508.10104, 2025. 5, 8

work page internal anchor Pith review Pith/arXiv arXiv 2025

[48] [48]

Multi-view 6d object pose estimation and camera motion planning using rgbd images

Juil Sock, S Hamidreza Kasaei, Luis Seabra Lopes, and Tae- Kyun Kim. Multi-view 6d object pose estimation and camera motion planning using rgbd images. InProceedings of the IEEE International Conference on Computer Vision Work- shops, pages 2228–2235, 2017. 2

work page 2017

[49] [49]

Fit-ngp: Fitting object models to neural graphics primitives

Marwan Taher, Ignacio Alzugaray, and Andrew J Davison. Fit-ngp: Fitting object models to neural graphics primitives. In2024 IEEE International Conference on Robotics and Au- tomation (ICRA), pages 18186–18192. IEEE, 2024. 2

work page 2024

[50] [50]

The unreasonable effectiveness of pre- trained features for camera pose refinement

Gabriele Trivigno, Carlo Masone, Barbara Caputo, and Torsten Sattler. The unreasonable effectiveness of pre- trained features for camera pose refinement. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, pages 12786–12798, 2024. 2

work page 2024

[51] [51]

Bop challenge 2024 on model-based and model-free 6d object pose estimation.CoRR, 2025

Nguyen Van Nguyen, Stephen Tyree, Andrew Guo, Med- eric Fourmy, Anas Gouda, Taeyeop Lee, Sungphill Moon, Hyeontae Son, Lukas Ranftl, Jonathan Tremblay, Eric Brachmann, et al. Bop challenge 2024 on model-based and model-free 6d object pose estimation.CoRR, 2025. 2, 5

work page 2024

[52] [52]

Lm-reloc: Levenberg-marquardt based direct vi- sual relocalization

Lukas von Stumberg, Patrick Wenzel, Nan Yang, and Daniel Cremers. Lm-reloc: Levenberg-marquardt based direct vi- sual relocalization. In2020 International Conference on 3D Vision (3DV), pages 968–977. IEEE Computer Society,

work page

[53] [53]

Morefusion: Multi-object reasoning for 6d pose estimation from volumetric fusion

Kentaro Wada, Edgar Sucar, Stephen James, Daniel Lenton, and Andrew J Davison. Morefusion: Multi-object reasoning for 6d pose estimation from volumetric fusion. InProceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14540–14549, 2020. 2

work page 2020

[54] [54]

Normalized object coordinate space for category-level 6d object pose and size estimation

He Wang, Srinath Sridhar, Jingwei Huang, Julien Valentin, Shuran Song, and Leonidas J Guibas. Normalized object coordinate space for category-level 6d object pose and size estimation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2642–2651,

work page

[55] [55]

Foundationpose: Unified 6d pose estimation and tracking of novel objects

Bowen Wen, Wei Yang, Jan Kautz, and Stan Birchfield. Foundationpose: Unified 6d pose estimation and tracking of novel objects. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17868– 17879, 2024. 2

work page 2024

[56] [56]

Posecnn: A convolutional neural network for 6d object pose estimation in cluttered scenes

Yu Xiang, Tanner Schmidt, Venkatraman Narayanan, and Dieter Fox. Posecnn: A convolutional neural network for 6d object pose estimation in cluttered scenes. InProceedings of Robotics: Science and Systems (RSS), 2018. 2, 5

work page 2018

[57] [57]

6d pose estimation for textureless objects on rgb frames using multi-view optimization

Jun Yang, Wenjie Xue, Sahar Ghavidel, and Steven L Waslander. 6d pose estimation for textureless objects on rgb frames using multi-view optimization. In2023 IEEE in- ternational conference on robotics and automation (ICRA), pages 2905–2912. IEEE, 2023. 2

work page 2023

[58] [58]

Temporally consistent object 6d pose estimation for robot control.IEEE Robotics and Au- tomation Letters, 2024

Kateryna Zorina, V ojtech Priban, Mederic Fourmy, Josef Sivic, and Vladimir Petrik. Temporally consistent object 6d pose estimation for robot control.IEEE Robotics and Au- tomation Letters, 2024. 2 11 Appendix This appendix contains additional implementation details for our method (Sec. A) and supplementary experimental results supporting our design choic...

work page 2024