pith. sign in

arxiv: 2606.30058 · v1 · pith:P2M4TDHJnew · submitted 2026-06-29 · 💻 cs.CV

Emergence of a Shared Canonical Object Frame from In-the-Wild Videos

Pith reviewed 2026-06-30 05:59 UTC · model grok-4.3

classification 💻 cs.CV
keywords canonical frameself-supervised learningobject pose estimationstructure from motionin-the-wild videosdense correspondencesgeometric bottleneckcategory-level pose
0
0 comments X

The pith

A shared canonical object frame emerges from self-supervised training on in-the-wild videos using only noisy SfM poses.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates that a common reference frame for comparing object poses across instances can arise without any manual canonical pose labels or category-specific supervision. Training proceeds by forcing all video sequences through one shared coarse mesh that acts as a geometric bottleneck, then learning pixel-to-mesh correspondences while aligning each sequence using only rough camera motion from structure-from-motion. Multi-view consistency across thousands of videos, together with semantic cues already present in the feature extractor, is enough to produce a usable shared frame. A reader would care because this removes the annotation bottleneck that has limited scale in category-level pose work.

Core claim

By routing every training sequence through a single coarse canonical mesh that carries no category detail and estimating per-sequence alignments from noisy SfM geometry, dense correspondences from image pixels to the mesh are learned in a self-supervised manner; the resulting common canonical frame emerges from multi-view consistency and the semantic priors of the feature extractor, allowing competitive accuracy on category-level pose estimation benchmarks after training on 160,000 in-the-wild object videos.

What carries the argument

The shared geometric bottleneck: a single coarse canonical mesh with no category-specific detail that all sequences must map onto through learned dense correspondences.

If this is right

  • The approach achieves competitive accuracy on category-level pose estimation benchmarks without any canonical pose supervision.
  • The same training procedure works across diverse object categories without requiring category conditioning.
  • Only noisy camera poses from structure-from-motion are needed as geometric input.
  • Training on 160,000 object-centric videos is sufficient for the frame to emerge.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same bottleneck idea could be tested on other 3D reconstruction tasks where category labels are expensive.
  • If the mesh is replaced by a learned implicit surface, the emergence effect might extend to non-rigid objects.
  • The method suggests that large unlabeled video collections can substitute for pose annotations in downstream robotics applications.

Load-bearing premise

A single coarse canonical mesh together with noisy SfM geometry and existing semantic priors in the feature extractor are together enough to produce a usable shared frame across many different object categories.

What would settle it

Train the model and then measure whether the learned alignments produce consistent 3D pose predictions on a held-out set of videos whose ground-truth canonical poses have been independently annotated; large inconsistency or accuracy collapse would falsify the emergence claim.

Figures

Figures reproduced from arXiv: 2606.30058 by Adam Kortylewski, Eddy Ilg, Martin Sundermeyer, Tom Fischer.

Figure 1
Figure 1. Figure 1: Three approaches for learning of canonical object frames. Top left: Supervised methods can obtain a shared canonical frame from rendering aligned CAD models or manual pose annotation, which is expensive to scale. Bottom left: Self-supervised category-level methods learn from object￾centric videos without canonical labels, but train separate models per category, limiting scalability and producing no shared … view at source ↗
Figure 2
Figure 2. Figure 2: Qualitative pose predictions. Predicted 6D poses visualized as canonical axes overlaid on images from diverse categories and benchmarks. A single fixed rotation is applied uniformly to all predictions for visualization, without per-category or per-dataset adjustments. Semantic directions such as ”front” and ”up” of object instances are not explicitly supervised, but tend to consistently map onto the same a… view at source ↗
Figure 3
Figure 3. Figure 3: Method overview. (a) We define the shared geometric bottleneck as a coarse canonical mesh in a fixed coordinate frame. (b) Each image is processed independently: the correspondence network predicts dense 2D–3D correspondences from pixels to mesh vertices. (c) During training, we estimate a per-sequence rotation that aligns each sequence’s SfM frame to the canonical mesh frame. This alignment transforms SfM… view at source ↗
Figure 4
Figure 4. Figure 4: Analysis of symmetry and convention mapping. Our main performance gap concentrates on symmetry-heavy objects (left), while per-category convention mappings cluster into a small number of modes (right), supporting that the emerged canonical frame is shared across categories. becomes underconstrained and causes mode collapse as a trivial solution. Two views recover much of the performance, but four views pro… view at source ↗
Figure 5
Figure 5. Figure 5: Accuracy by symmetry class (full breakdown). Symmetry-aware Acc@30◦ stratified by s ∈ {∞, 1, 2, 4} for Omni6DPose and ImageNet3D. Bars report accuracy, group sizes are shown alongside each class. 25 50 75 100 s= (n=46) s=1 (n=89) 25 50 75 100 s= (n=35) s=1 (n=129) Ours OriAnyV2 Omni6DPose ImageNet3D 74.3 27.4 24.5 41.3 77.3 59.1 46.4 53.6 25 50 75 100 s= (n=35) s=1 (n=129) 65.0 59.1 41.7 53.6 Ours PCA-Only… view at source ↗
Figure 6
Figure 6. Figure 6: Accuracy with and without Per-Sequence Alignment.Symmetry-aware Acc@30◦ strati￾fied by s ∈ {∞, 1, } for ImageNet3D.Bars report accuracy, group sizes are shown alongside each class. Qwen3-VL baseline. For the QWEN3-VL [61] baseline, we prompt the model with the input image and request an orientation estimate as rotation angles. The prompt was taken from the official GitHub repository of the QWEN Team. We ad… view at source ↗
Figure 7
Figure 7. Figure 7: Accuracy–runtime trade-off. ImageNet3D Acc@30◦ (blue) and inference speed in FPS (green) as a function of mesh vertices M. Red dotted line: our default (M=1000). memory through the vertex-query decoding and matching steps [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Category-level pose estimation results. Left: ObjectNet3D results on the 20-category Common3D split. Right: Pascal3D+ results on 7 categories. Baselines train separate models per category on roughly 50 sequences. Our fairness controls show scaling effects: Ourscat 50 trains on the same target categories with a 50-sequence-per-category cap, Ourscat full trains on the same target categories using all availab… view at source ↗
Figure 9
Figure 9. Figure 9: Qualitative pose predictions across benchmarks. For each dataset: input images with predicted canonical axes (top) and ShapeNet meshes rendered under the predicted pose (bottom). Blue axis indicates the predicted “front” direction. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_9.png] view at source ↗
read the original abstract

Comparing object orientations and positions across different instances requires their poses to be expressed in a shared canonical frame. Establishing such frames has traditionally required manual annotation, creating a scaling bottleneck that limits category and instance diversity. We show that a shared canonical frame can instead emerge from self-supervised training on object-centric videos captured in the wild, using only noisy camera poses from Structure-from-Motion. Our key idea is to route all training sequences through a shared geometric bottleneck: a coarse canonical mesh that carries no category-specific detail. By learning dense correspondences from image pixels to this mesh, and estimating per-sequence alignments from noisy SfM geometry, a common canonical frame emerges from multi-view consistency and the semantic priors of the feature extractor, without any canonical pose labels or category conditioning. Trained in a self-supervised manner on 160,000 in-the-wild object videos, our method achieves competitive accuracy on category-level pose estimation benchmarks compared to methods that rely on canonical pose supervision. The code and checkpoint is available on https://github.com/Fischer-Tom/Emergent-Canonical-Frame/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims that a shared canonical object frame can emerge from self-supervised training on object-centric videos captured in the wild, using only noisy camera poses from Structure-from-Motion, by routing all training sequences through a shared geometric bottleneck consisting of a coarse canonical mesh that carries no category-specific detail. By learning dense correspondences from image pixels to this mesh and estimating per-sequence alignments from noisy SfM geometry, a common canonical frame emerges from multi-view consistency and the semantic priors of the feature extractor, without any canonical pose labels or category conditioning. Trained on 160,000 in-the-wild object videos, the method achieves competitive accuracy on category-level pose estimation benchmarks.

Significance. If the result holds, this work would address the scaling bottleneck caused by manual annotation for canonical frames, allowing for greater category and instance diversity in object pose estimation. The public availability of the code and checkpoint is a notable strength that facilitates reproducibility and further research.

major comments (2)
  1. [geometric bottleneck (key idea)] The central claim depends on the sufficiency of a single coarse canonical mesh (no category detail) plus noisy SfM and feature priors to prevent per-category drift and produce a usable shared frame across diverse categories. However, no quantitative validation is provided for cases where SfM noise increases or object geometry differs sharply (e.g., rigid vs. articulated), leaving this premise as the least-secured point in the emergence argument.
  2. [Abstract] The abstract reports 'competitive accuracy' on benchmarks but supplies no quantitative numbers, error bars, ablation details, or dataset splits, which is load-bearing for verifying the central claim of emergence without supervision.
minor comments (1)
  1. [Abstract] Grammatical issue: 'The code and checkpoint is available' should read 'The code and checkpoint are available'.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. Below we respond point-by-point to the major comments, proposing targeted revisions that directly address the concerns while preserving the core contributions of the work.

read point-by-point responses
  1. Referee: [geometric bottleneck (key idea)] The central claim depends on the sufficiency of a single coarse canonical mesh (no category detail) plus noisy SfM and feature priors to prevent per-category drift and produce a usable shared frame across diverse categories. However, no quantitative validation is provided for cases where SfM noise increases or object geometry differs sharply (e.g., rigid vs. articulated), leaving this premise as the least-secured point in the emergence argument.

    Authors: We agree that explicit quantitative validation of the geometric bottleneck's robustness would strengthen the emergence argument. The existing results already operate on 160k in-the-wild videos that naturally contain both rigid and articulated objects together with typical SfM noise; the shared coarse mesh plus multi-view consistency is shown to produce a usable frame without category conditioning. Nevertheless, to directly secure this premise we will add controlled ablations in the revision: (i) synthetic SfM noise injection at multiple levels with resulting canonical-frame consistency and downstream pose accuracy, and (ii) separate quantitative breakdowns on rigid versus articulated object subsets. These additions will be reported with the same metrics used in the main experiments. revision: yes

  2. Referee: [Abstract] The abstract reports 'competitive accuracy' on benchmarks but supplies no quantitative numbers, error bars, ablation details, or dataset splits, which is load-bearing for verifying the central claim of emergence without supervision.

    Authors: We accept that the abstract would benefit from concrete numbers to make the central claim immediately verifiable. We will revise the abstract to include the key quantitative results (e.g., mean rotation/translation errors or accuracy@30° on the category-level benchmarks) and a brief statement of the training scale (160k videos). Full error bars, ablation tables, and exact dataset splits will continue to appear in the main text and supplement, as abstract length precludes exhaustive detail; the revised abstract will nonetheless supply the load-bearing numbers requested. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical self-supervision against external benchmarks

full rationale

The paper presents a self-supervised method that routes sequences through a shared coarse mesh bottleneck and learns correspondences plus per-sequence alignments from noisy SfM. The canonical frame is claimed to emerge from multi-view consistency and feature priors during training on 160k videos, with no canonical labels used. Accuracy is measured on external category-level pose benchmarks. No equations, fitted parameters, or self-citations reduce the reported emergence or accuracy to the inputs by construction; the derivation remains self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claim rests on the untested premise that multi-view consistency through a category-agnostic coarse mesh plus pre-trained feature semantics will converge to a single usable frame; no free parameters or invented entities are explicitly quantified in the abstract.

axioms (2)
  • domain assumption Noisy SfM camera poses are sufficiently accurate to support per-sequence alignment to the shared mesh.
    Invoked when the method uses SfM geometry to estimate alignments without additional supervision.
  • domain assumption Semantic priors inside the feature extractor are adequate to resolve ambiguities across object categories.
    Stated as part of the emergence mechanism that replaces category conditioning.
invented entities (1)
  • coarse canonical mesh no independent evidence
    purpose: Serves as the shared geometric bottleneck that all sequences route through.
    New modeling choice introduced to enable emergence without category-specific detail; no independent evidence supplied in abstract.

pith-pipeline@v0.9.1-grok · 5720 in / 1479 out tokens · 21586 ms · 2026-06-30T05:59:09.219014+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

74 extracted references · 9 canonical work pages · 4 internal anchors

  1. [1]

    Monocular rgb 6d object pose estimation for augmented reality: a survey,

    P. Aguirrezabal, I. Aguinaga, and A. Alvarez-Gila, “Monocular rgb 6d object pose estimation for augmented reality: a survey,”Virtual Reality, 2026

  2. [2]

    Neural descriptor fields: Se (3)-equivariant object representations for manipulation,

    A. Simeonov, Y . Du, A. Tagliasacchi, J. B. Tenenbaum, A. Rodriguez, P. Agrawal, and V . Sitz- mann, “Neural descriptor fields: Se (3)-equivariant object representations for manipulation,” in2022 International Conference on Robotics and Automation (ICRA), pp. 6394–6400, IEEE, 2022

  3. [3]

    Omnimanip: Towards general robotic manipulation via object-centric interaction primitives as spatial constraints,

    M. Pan, J. Zhang, T. Wu, Y . Zhao, W. Gao, and H. Dong, “Omnimanip: Towards general robotic manipulation via object-centric interaction primitives as spatial constraints,” inProceedings of the Computer Vision and Pattern Recognition Conference, pp. 17359–17369, 2025

  4. [4]

    Normalized object coordinate space for category-level 6d object pose and size estimation,

    H. Wang, S. Sridhar, J. Huang, J. Valentin, S. Song, and L. J. Guibas, “Normalized object coordinate space for category-level 6d object pose and size estimation,” inCVPR, pp. 2642–2651, 2019

  5. [5]

    A survey of content based 3d shape retrieval methods,

    J. Tangelder and R. Veltkamp, “A survey of content based 3d shape retrieval methods,”Proceed- ings Shape Modeling Applications, 2004., pp. 145–156, 2004

  6. [6]

    Orientation matters: Making 3d generative models orientation-aligned,

    Y . Lu, Y . Tian, Z. Jiang, Y . Zhao, Y . Yang, H. Ouyang, H. Hu, H. Yu, Y . Shen, and Y . Liao, “Orientation matters: Making 3d generative models orientation-aligned,”arXiv preprint arXiv:2506.08640, 2025

  7. [7]

    Compass control: Multi object orientation control for text-to-image generation,

    R. Parihar, V . Agrawal, S. VS, and V . B. Radhakrishnan, “Compass control: Multi object orientation control for text-to-image generation,” inProceedings of the Computer Vision and Pattern Recognition Conference, pp. 2791–2801, 2025

  8. [8]

    ShapeNet: An Information-Rich 3D Model Repository

    A. X. Chang, T. Funkhouser, L. Guibas, P. Hanrahan, Q. Huang, Z. Li, S. Savarese, M. Savva, S. Song, H. Su,et al., “Shapenet: An information-rich 3d model repository,”arXiv preprint arXiv:1512.03012, 2015

  9. [9]

    Objaverse: A universe of annotated 3d objects,

    M. Deitke, D. Schwenk, J. Salvador, L. Weihs, O. Michel, E. VanderBilt, L. Schmidt, K. Ehsani, A. Kembhavi, and A. Farhadi, “Objaverse: A universe of annotated 3d objects,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 13142–13153, 2023

  10. [10]

    Omni6d: Large-vocabulary 3d object dataset for category-level 6d object pose estimation,

    M. Zhang, T. Wu, T. Wang, T. Wang, Z. Liu, and D. Lin, “Omni6d: Large-vocabulary 3d object dataset for category-level 6d object pose estimation,” inEuropean Conference on Computer Vision, pp. 216–232, Springer, 2024

  11. [11]

    Ov9d: Open-vocabulary category-level 9d object pose and size estimation,

    J. Cai, Y . He, W. Yuan, S. Zhu, Z. Dong, L. Bo, and Q. Chen, “Ov9d: Open-vocabulary category-level 9d object pose and size estimation,”arXiv preprint arXiv:2403.12396, 2024

  12. [12]

    Beyond pascal: A benchmark for 3d object detection in the wild,

    Y . Xiang, R. Mottaghi, and S. Savarese, “Beyond pascal: A benchmark for 3d object detection in the wild,” inIEEE winter conference on applications of computer vision, pp. 75–82, IEEE, 2014

  13. [13]

    Imagenet3d: Towards general-purpose object-level 3d understanding,

    W. Ma, G. Zhang, Q. Liu, G. Zeng, A. Kortylewski, Y . Liu, and A. Yuille, “Imagenet3d: Towards general-purpose object-level 3d understanding,”Advances in Neural Information Processing Systems, vol. 37, pp. 96127–96149, 2024

  14. [14]

    Emerging properties in self-supervised vision transformers,

    M. Caron, H. Touvron, I. Misra, H. Jégou, J. Mairal, P. Bojanowski, and A. Joulin, “Emerging properties in self-supervised vision transformers,” inProceedings of the IEEE/CVF international conference on computer vision, pp. 9650–9660, 2021

  15. [15]

    Learning transferable visual models from natural language supervi- sion,

    A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark,et al., “Learning transferable visual models from natural language supervi- sion,” inInternational conference on machine learning, pp. 8748–8763, PmLR, 2021

  16. [16]

    Least-squares estimation of transformation parameters between two point pat- terns,

    S. Umeyama, “Least-squares estimation of transformation parameters between two point pat- terns,”IEEE Transactions on Pattern Analysis & Machine Intelligence, vol. 13, no. 04, pp. 376– 380, 1991

  17. [17]

    Sgpa: Structure-guided prior adaptation for category-level 6d object pose estimation,

    K. Chen and Q. Dou, “Sgpa: Structure-guided prior adaptation for category-level 6d object pose estimation,” inICCV, pp. 2773–2782, 2021

  18. [18]

    Gpv-pose: Category- level object pose estimation via geometry-guided point-wise voting,

    Y . Di, R. Zhang, Z. Lou, F. Manhardt, X. Ji, N. Navab, and F. Tombari, “Gpv-pose: Category- level object pose estimation via geometry-guided point-wise voting,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6781–6791, 2022. 10

  19. [19]

    Category-level 6d object pose and size estimation using self-supervised deep prior deformation networks,

    J. Lin, Z. Wei, C. Ding, and K. Jia, “Category-level 6d object pose and size estimation using self-supervised deep prior deformation networks,” inEuropean Conference on Computer Vision, pp. 19–34, Springer, 2022

  20. [20]

    Ist-net: Prior-free category-level pose estimation with implicit space transformation,

    J. Liu, Y . Chen, X. Ye, and X. Qi, “Ist-net: Prior-free category-level pose estimation with implicit space transformation,” inProceedings of the IEEE/CVF International Conference on Computer Vision, pp. 13978–13988, 2023

  21. [21]

    Secondpose: Se (3)-consistent dual-stream feature fusion for category-level pose estimation,

    Y . Chen, Y . Di, G. Zhai, F. Manhardt, C. Zhang, R. Zhang, F. Tombari, N. Navab, and B. Busam, “Secondpose: Se (3)-consistent dual-stream feature fusion for category-level pose estimation,” inCVPR, pp. 9959–9969, 2024

  22. [22]

    Object level depth reconstruction for category level 6d object pose estimation from monocular rgb image,

    Z. Fan, Z. Song, J. Xu, Z. Wang, K. Wu, H. Liu, and J. He, “Object level depth reconstruction for category level 6d object pose estimation from monocular rgb image,” inEuropean Conference on Computer Vision, pp. 220–236, Springer, 2022

  23. [23]

    Category-level metric scale object shape and pose estimation,

    T. Lee, B.-U. Lee, M. Kim, and I. S. Kweon, “Category-level metric scale object shape and pose estimation,”IEEE Robotics and Automation Letters, vol. 6, no. 4, pp. 8575–8582, 2021

  24. [24]

    Rgb-based category-level object pose estimation via decoupled metric scale recovery,

    J. Wei, X. Song, W. Liu, L. Kneip, H. Li, and P. Ji, “Rgb-based category-level object pose estimation via decoupled metric scale recovery,” in2024 IEEE International Conference on Robotics and Automation (ICRA), pp. 2036–2042, IEEE, 2024

  25. [25]

    Unified category-level object detection and pose estimation from rgb images using 3d prototypes,

    T. Fischer, X. Zhang, and E. Ilg, “Unified category-level object detection and pose estimation from rgb images using 3d prototypes,” inICCV, pp. 9790–9800, 2025

  26. [26]

    Lapose: Laplacian mixture shape modeling for rgb-based category-level object pose estimation,

    R. Zhang, Z. Huang, G. Wang, C. Zhang, Y . Di, X. Zuo, J. Tang, and X. Ji, “Lapose: Laplacian mixture shape modeling for rgb-based category-level object pose estimation,” inEuropean Conference on Computer Vision, pp. 467–484, Springer, 2024

  27. [27]

    Nemo: Neural mesh models of contrastive features for robust 3d pose estimation,

    W. Angtian, A. Kortylewski, and A. Yuille, “Nemo: Neural mesh models of contrastive features for robust 3d pose estimation,” inICLR, 2021

  28. [28]

    inemo: Incremental neural mesh models for robust class-incremental learning,

    T. Fischer, Y . Liu, A. Jesslen, N. Ahmed, P. Kaushik, A. Wang, A. Yuille, A. Kortylewski, and E. Ilg, “inemo: Incremental neural mesh models for robust class-incremental learning,” in ECCV, 2024

  29. [29]

    Novum: Neural object volumes for robust object classification,

    A. Jesslen, G. Zhang, A. Wang, W. Ma, A. Yuille, and A. Kortylewski, “Novum: Neural object volumes for robust object classification,” inEuropean Conference on Computer Vision, pp. 264–281, Springer, 2025

  30. [30]

    Robust category-level 6d pose estimation with coarse-to-fine rendering of neural features,

    W. Ma, A. Wang, A. Yuille, and A. Kortylewski, “Robust category-level 6d pose estimation with coarse-to-fine rendering of neural features,” inECCV, pp. 492–508, Springer, 2022

  31. [31]

    Orient anything: Learning robust object orientation estimation from rendering 3d models,

    Z. Wang, Z. Zhang, T. Pang, C. Du, H. Zhao, and Z. Zhao, “Orient anything: Learning robust object orientation estimation from rendering 3d models,”arXiv preprint arXiv:2412.18605, 2024

  32. [32]

    Orient anything v2: Unifying orientation and rotation understanding,

    Z. Wang, Z. Zhang, J. Xu, J. Wang, T. Pang, C. Du, H. Zhao, and Z. Zhao, “Orient anything v2: Unifying orientation and rotation understanding,”arXiv preprint arXiv:2601.05573, 2026

  33. [33]

    Category-level 6d object pose estimation in the wild: A semi-supervised learning approach and a new dataset,

    Y . Fu and X. Wang, “Category-level 6d object pose estimation in the wild: A semi-supervised learning approach and a new dataset,”Advances in Neural Information Processing Systems, vol. 35, pp. 27469–27483, 2022

  34. [34]

    Housecat6d-a large-scale multi-modal category level 6d object perception dataset with household objects in realistic scenarios,

    H. Jung, S.-C. Wu, P. Ruhkamp, G. Zhai, H. Schieber, G. Rizzoli, P. Wang, H. Zhao, L. Garat- toni, S. Meier,et al., “Housecat6d-a large-scale multi-modal category level 6d object perception dataset with household objects in realistic scenarios,” inProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition, pp. 22498–22508, 2024

  35. [35]

    Foundpose: Unseen object pose estimation with foundation features,

    E. P. Örnek, Y . Labbé, B. Tekin, L. Ma, C. Keskin, C. Forster, and T. Hodan, “Foundpose: Unseen object pose estimation with foundation features,” inECCV, pp. 163–182, Springer, 2024

  36. [36]

    Gigapose: Fast and robust novel object pose estimation via one correspondence,

    V . N. Nguyen, T. Groueix, M. Salzmann, and V . Lepetit, “Gigapose: Fast and robust novel object pose estimation via one correspondence,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9903–9913, June 2024

  37. [37]

    MegaPose: 6D pose estimation of novel objects via render & compare,

    Y . Labbé, L. Manuelli, A. Mousavian, S. Tyree, S. Birchfield, J. Tremblay, J. Carpentier, M. Aubry, D. Fox, and J. Sivic, “Megapose: 6d pose estimation of novel objects via render & compare,”arXiv preprint arXiv:2212.06870, 2022. 11

  38. [38]

    Foundationpose: Unified 6d pose estimation and tracking of novel objects,

    B. Wen, W. Yang, J. Kautz, and S. Birchfield, “Foundationpose: Unified 6d pose estimation and tracking of novel objects,” inCVPR, pp. 17868–17879, 2024

  39. [39]

    Onepose: One-shot object pose estimation without cad models,

    J. Sun, Z. Wang, S. Zhang, X. He, H. Zhao, G. Zhang, and X. Zhou, “Onepose: One-shot object pose estimation without cad models,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6825–6834, 2022

  40. [40]

    Onepose++: Keypoint-free one- shot object pose estimation without cad models,

    X. He, J. Sun, Y . Wang, D. Huang, H. Bao, and X. Zhou, “Onepose++: Keypoint-free one- shot object pose estimation without cad models,”Advances in Neural Information Processing Systems, vol. 35, pp. 35103–35115, 2022

  41. [41]

    Gen6d: Generalizable model-free 6-dof object pose estimation from rgb images,

    Y . Liu, Y . Wen, S. Peng, C. Lin, X. Long, T. Komura, and W. Wang, “Gen6d: Generalizable model-free 6-dof object pose estimation from rgb images,” inEuropean Conference on Computer Vision, pp. 298–315, Springer, 2022

  42. [42]

    Unopose: Unseen object pose estimation with an unposed rgb-d reference image,

    X. Liu, G. Wang, R. Zhang, C. Zhang, F. Tombari, and X. Ji, “Unopose: Unseen object pose estimation with an unposed rgb-d reference image,” inCVPR, pp. 22023–22034, 2025

  43. [43]

    One2any: One-reference 6d pose estimation for any object,

    M. Liu, S. Li, A. Chhatkuli, P. Truong, L. Van Gool, and F. Tombari, “One2any: One-reference 6d pose estimation for any object,” inProceedings of the Computer Vision and Pattern Recogni- tion Conference, pp. 6457–6467, 2025

  44. [44]

    ConceptPose: Training-Free Zero-Shot Object Pose Estimation using Concept Vectors

    L. Kuang, Y . Velikova, M. Saleh, J.-N. Zaech, D. P. Paudel, and B. Busam, “Concept- pose: Training-free zero-shot object pose estimation using concept vectors,”arXiv preprint arXiv:2512.09056, 2025

  45. [45]

    Any6d: Model-free 6d pose estimation of novel objects,

    T. Lee, B. Wen, M. Kang, G. Kang, I. S. Kweon, and K.-J. Yoon, “Any6d: Model-free 6d pose estimation of novel objects,” inCVPR, pp. 11633–11643, 2025

  46. [46]

    One view, many worlds: Single-image to 3d object meets generative domain randomization for one-shot 6d pose estimation,

    Z. Geng, N. Wang, S. Xu, C. Ye, B. Li, Z. Chen, S. Peng, and H. Zhao, “One view, many worlds: Single-image to 3d object meets generative domain randomization for one-shot 6d pose estimation,” inConference on Robot Learning, pp. 168–197, PMLR, 2025

  47. [47]

    Mfos: Model-free & one-shot object pose estimation,

    J. Lee, Y . Cabon, R. Brégier, S. Yoo, and J. Revaud, “Mfos: Model-free & one-shot object pose estimation,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 38, pp. 2911–2919, 2024

  48. [48]

    Boxdreamer: Dreaming box corners for generalizable object pose estimation,

    Y . Yu, X. He, C. Zhao, J. Yu, J. Yang, R. Hu, Y . Shen, X. Zhu, X. Zhou, and S. Peng, “Boxdreamer: Dreaming box corners for generalizable object pose estimation,” inProceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9374–9384, 2025

  49. [49]

    Shic: Shape-image correspondences with no keypoint supervision,

    A. Shtedritski, C. Rupprecht, and A. Vedaldi, “Shic: Shape-image correspondences with no keypoint supervision,” inEuropean Conference on Computer Vision, pp. 129–145, Springer, 2024

  50. [50]

    Unsupervised learning of category-level 3d pose from object-centric videos,

    L. Sommer, A. Jesslen, E. Ilg, and A. Kortylewski, “Unsupervised learning of category-level 3d pose from object-centric videos,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 22787–22796, 2024

  51. [51]

    Common3d: Self-supervised learning of 3d morphable models for common objects in neural feature space,

    L. Sommer, O. Dünkel, C. Theobalt, and A. Kortylewski, “Common3d: Self-supervised learning of 3d morphable models for common objects in neural feature space,” inProceedings of the Computer Vision and Pattern Recognition Conference, pp. 6468–6479, 2025

  52. [52]

    Leveraging se(3) equivariance for self-supervised category-level object pose estimation,

    X. Li, Y . Weng, L. Yi, L. Guibas, A. L. Abbott, S. Song, and H. Wang, “Leveraging se(3) equivariance for self-supervised category-level object pose estimation,”Thirty-Fifth Conference on Neural Information Processing Systems, 2021

  53. [53]

    Canon- ical capsules: Self-supervised capsules in canonical pose,

    W. Sun, A. Tagliasacchi, B. Deng, S. Sabour, S. Yazdani, G. E. Hinton, and K. M. Yi, “Canon- ical capsules: Self-supervised capsules in canonical pose,”Advances in Neural information processing systems, vol. 34, pp. 24993–25005, 2021

  54. [54]

    Multi-path learning for object pose estimation across domains,

    M. Sundermeyer, M. Durner, E. Y . Puang, Z.-C. Marton, N. Vaskevicius, K. O. Arras, and R. Triebel, “Multi-path learning for object pose estimation across domains,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 13916–13925, 2020

  55. [55]

    A tale of two features: Stable diffusion complements dino for zero-shot semantic correspondence,

    J. Zhang, C. Herrmann, J. Hur, L. Polania Cabrera, V . Jampani, D. Sun, and M.-H. Yang, “A tale of two features: Stable diffusion complements dino for zero-shot semantic correspondence,” Advances in Neural Information Processing Systems, vol. 36, pp. 45533–45547, 2023

  56. [56]

    Improving semantic correspondence with viewpoint- guided spherical maps,

    O. Mariotti, O. Mac Aodha, and H. Bilen, “Improving semantic correspondence with viewpoint- guided spherical maps,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19521–19530, 2024. 12

  57. [57]

    Do it yourself: Learning semantic correspondence from pseudo-labels,

    O. Dünkel, T. Wimmer, C. Theobalt, C. Rupprecht, and A. Kortylewski, “Do it yourself: Learning semantic correspondence from pseudo-labels,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 5834–5844, October 2025

  58. [58]

    A least squares estimate of satellite attitude,

    G. Wahba, “A least squares estimate of satellite attitude,”SIAM review, vol. 7, no. 3, pp. 409–409, 1965

  59. [59]

    Epnp: An accurate o(n) solution to the pnp problem,

    V . Lepetit, F. Moreno-Noguer, and P. Fua, “Epnp: An accurate o(n) solution to the pnp problem,” International journal of computer vision, vol. 81, pp. 155–166, 2009

  60. [60]

    Givepose: Gradual intra-class variation elimination for rgb-based category-level object pose estimation,

    Z. Huang, G. Wang, C. Zhang, R. Zhang, X. Li, and X. Ji, “Givepose: Gradual intra-class variation elimination for rgb-based category-level object pose estimation,” inProceedings of the Computer Vision and Pattern Recognition Conference, pp. 22055–22066, 2025

  61. [61]

    Qwen3-VL Technical Report

    S. Bai, Y . Cai, R. Chen, K. Chen, X. Chen, Z. Cheng, L. Deng, W. Ding, C. Gao, C. Ge,et al., “Qwen3-vl technical report,”arXiv preprint arXiv:2511.21631, 2025

  62. [62]

    Uncommon objects in 3d,

    X. Liu, P. Tayal, J. Wang, J. Zarzar, T. Monnier, K. Tertikas, J. Duan, A. Toisoul, J. Y . Zhang, N. Neverova,et al., “Uncommon objects in 3d,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14102–14113, 2025

  63. [63]

    Omni6dpose: A benchmark and model for universal 6d object pose estimation and tracking,

    J. Zhang, W. Huang, B. Peng, M. Wu, F. Hu, Z. Chen, B. Zhao, and H. Dong, “Omni6dpose: A benchmark and model for universal 6d object pose estimation and tracking,” inEuropean Conference on Computer Vision, pp. 199–216, Springer, 2024

  64. [64]

    Objectron: A large scale dataset of object-centric videos in the wild with pose annotations,

    A. Ahmadyan, L. Zhang, A. Ablavatski, J. Wei, and M. Grundmann, “Objectron: A large scale dataset of object-centric videos in the wild with pose annotations,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 7822–7831, 2021

  65. [65]

    Objectnet3d: A large scale database for 3d object recognition,

    Y . Xiang, W. Kim, W. Chen, J. Ji, C. Choy, H. Su, R. Mottaghi, L. Guibas, and S. Savarese, “Objectnet3d: A large scale database for 3d object recognition,” inEuropean conference on computer vision, pp. 160–176, Springer, 2016

  66. [66]

    Zero-shot category-level object pose estima- tion,

    W. Goodwin, S. Vaze, I. Havoutis, and I. Posner, “Zero-shot category-level object pose estima- tion,” inEuropean Conference on Computer Vision, pp. 516–532, Springer, 2022

  67. [67]

    Rotation averaging,

    R. Hartley, J. Trumpf, Y . Dai, and H. Li, “Rotation averaging,”International journal of computer vision, vol. 103, no. 3, pp. 267–305, 2013

  68. [68]

    Modular primitives for high-performance differentiable rendering,

    S. Laine, J. Hellsten, T. Karras, Y . Seol, J. Lehtinen, and T. Aila, “Modular primitives for high-performance differentiable rendering,”ACM Transactions on Graphics, vol. 39, no. 6, 2020

  69. [69]

    DINOv3

    O. Siméoni, H. V . V o, M. Seitzer, F. Baldassarre, M. Oquab, C. Jose, V . Khalidov, M. Szafraniec, S. Yi, M. Ramamonjisoa,et al., “Dinov3,”arXiv preprint arXiv:2508.10104, 2025

  70. [70]

    Lora: Low-rank adaptation of large language models.,

    E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, W. Chen,et al., “Lora: Low-rank adaptation of large language models.,”Iclr, vol. 1, no. 2, p. 3, 2022

  71. [71]

    Unified perceptual parsing for scene understand- ing,

    T. Xiao, Y . Liu, B. Zhou, Y . Jiang, and J. Sun, “Unified perceptual parsing for scene understand- ing,” inProceedings of the European conference on computer vision (ECCV), pp. 418–434, 2018

  72. [72]

    PoseLib - Minimal Solvers for Camera Pose Estimation,

    V . Larsson and contributors, “PoseLib - Minimal Solvers for Camera Pose Estimation,” 2020. 13 A Overview This supplement provides additional details and analyses referenced in the main paper: • Implementation details and training hyperparameters (B) • Evaluation protocol, including symmetry-aware metrics and convention mapping (C) • Full symmetry-class b...

  73. [73]

    We use AdamW with weight decay 0.05 and gradient clipping at norm 1.0

    and mixed precision. We use AdamW with weight decay 0.05 and gradient clipping at norm 1.0. The learning rate is 10−4 for non-backbone parameters and 10−5 for LoRA parameters. The correspondence softmax temperature τ follows a three-stage cosine schedule: 0.20→0.17 over the first30%of training,0.17→0.13over the next50%, and0.13→0.10over the final 20%. B.6...

  74. [74]

    2.Class 1 (asymmetric): a unique identifiable front; full yaw determination is possible

    Class ∞ (continuous): rotationally symmetric about the vertical axis; yaw is unrecoverable. 2.Class 1 (asymmetric): a unique identifiable front; full yaw determination is possible. 3.Class 2 (180 ◦ ambiguity): front and back are indistinguishable but side views differ. 4.Class 4 (90 ◦ ambiguity): all four cardinal yaw orientations are indistinguishable. H...