pith. sign in

arxiv: 2605.28270 · v1 · pith:XENOMKSTnew · submitted 2026-05-27 · 💻 cs.CV

Every9D-21M: Large-Scale Real-World 9D Canonicalization of Everyday Objects

Pith reviewed 2026-06-29 13:45 UTC · model grok-4.3

classification 💻 cs.CV
keywords 9D pose estimationreal-world datasetcanonical poseobject-centric videosmultiview reconstructionlarge-scale annotationpose canonicalizationcomputer vision
0
0 comments X

The pith

Every9D-21M supplies 21.8 million real-world images with 9D pose annotations across 700 everyday object categories.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates Every9D-21M, a dataset of 21.8M real-world images drawn from 109K object-centric videos and annotated with 9D poses for 700 categories. It reaches this scale by reconstructing object point clouds through multi-view geometry, aligning instances of similar objects into one canonical coordinate frame, manually labeling canonical poses on fewer than 0.01 percent of the images, and then propagating those labels with cross-instance alignment followed by multi-view verification. Cross-category orientation rules are added to handle symmetries. Prior real-world 9D datasets topped out at roughly 17K annotated objects in nine categories, so the new collection removes the main data bottleneck for learning single-image 9D pose estimation on everyday objects.

Core claim

By reconstructing object-level point clouds from multi-view geometry on object-centric videos, aligning similar instances into a shared canonical frame, manually annotating reference objects for fewer than 0.01 percent of images, propagating the remaining canonical poses via cross-instance alignment, and verifying all propagated poses from multiple viewpoints, the work produces 21.8M real-world 9D annotations across 700 categories—two orders of magnitude larger than previous real-world benchmarks—while also introducing cross-category orientation rules that induce category-level symmetries for evaluation.

What carries the argument

Cross-instance alignment of reconstructed point clouds from multiview videos, combined with manual reference annotation on a tiny fraction of images and subsequent multiview verification, to propagate canonical 9D poses at scale.

If this is right

  • Training on Every9D-21M improves performance on the ImageNet3D and PASCAL3D+ benchmarks.
  • Models trained on Every9D-21M generalize substantially better to the HANDAL dataset than models trained on ImageNet3D.
  • The dataset supplies dedicated training and evaluation splits for developing 9D pose foundation models.
  • Cross-category orientation rules enable symmetry-aware evaluation protocols.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The scale of real data may reduce the need to rely primarily on synthetic renderings for 9D pose training.
  • The alignment-and-propagation pipeline could be applied to additional video sources to grow the set of categories further.
  • Improved canonical 9D supervision may benefit downstream tasks such as robotic grasping or scene reconstruction that require consistent object coordinate frames.

Load-bearing premise

Cross-instance point-cloud alignment plus multiview verification produces canonical poses whose residual error remains small enough for downstream training and evaluation to stay valid.

What would settle it

Manual inspection of a random sample of several hundred propagated poses reveals systematic alignment errors above a few degrees or centimeters, or models trained on Every9D-21M show no improvement over smaller real-world datasets on independent held-out 9D pose benchmarks.

Figures

Figures reproduced from arXiv: 2605.28270 by Adam Kortylewski, Emil Akopyan, Leonhard Sommer.

Figure 1
Figure 1. Figure 1: Every9D-21M. Example canonicalized objects from our dataset. Each object is recon￾structed as a 3D Gaussian Splat from an object-centric video in uCO3D. Cross-instance alignment establishes shared canonical coordinate frames, which are subsequently propagated to all video frames, yielding 9D pose annotations for 21.8M real-world images across 700 object categories. However, most existing large-scale 3D dat… view at source ↗
Figure 2
Figure 2. Figure 2: Canonicalization Framework. Our framework to canonicalize object-centric videos consists of 5 steps. 1) We cluster all object-centric videos. 2) We select a reference per cluster. 3) We annotate a 9D pose for the reference. 4) We align each object to its reference based upon geometry and appearance. 5) We verify all objects, removing objects with bad reconstruction and skipping objects with wrong pose anno… view at source ↗
Figure 3
Figure 3. Figure 3: Mollweide projection of camera-direction histograms in the canonical object frame Every9D-21M is the only dataset providing broad, near-uniform coverage of the viewing sphere; ImageNet3D and Pascal3D+ concentrate sharply along the equator (canonical photograph angles), while HANDAL covers the hemisphere uniformly but contains very few categories. arg minj∈{1,...,|Vj |} P l mink [PITH_FULL_IMAGE:figures/fu… view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative Results on Every9D-21M (test split). Since OrientAnythingV2 does not predict 3D bounding boxes, we visualize its rotation estimates using ground-truth 3D boxes. Predicted axes are color-coded as red (left), green (back), and blue (top). 4 Dataset We present a comprehensive overview of our dataset, which comprises 21.8M images, 109K annotated objects, and 700 categories across many super-categor… view at source ↗
read the original abstract

Estimating the 9D pose of everyday objects from a single real-world image remains challenging. This is largely due to the lack of large-scale supervision. Most existing datasets either rely heavily on synthetic renderings or provide limited coverage of real-world objects: the largest real-world 9D pose dataset to date contains only 17K annotated objects across 9 categories. We address this gap with Every9D-21M, a dataset of 9D pose annotations for 21.8M real-world images from 109K object- centric videos spanning 700 everyday object categories - two orders of magnitude larger than prior real-world 9D pose benchmarks in both image and category count. To achieve this scale, we leverage object-centric videos by reconstructing object- level point clouds via multi-view geometry and aligning similar instances into a shared canonical coordinate frame. Canonical poses are manually annotated for only a small set of reference objects (fewer than 0.01% of all images) and propagated to the remaining instances via cross-instance alignment. All propagated canonical poses are then verified from multiple viewpoints. We further introduce cross-category orientation rules that induce category-level symmetries, enabling symmetry-aware evaluation. Beyond establishing dedicated training and evaluation splits as a benchmark for 9D pose foundation models, we show that training on Every9D-21M improves performance on ImageNet3D and PASCAL3D+, and generalizes to HANDAL substantially better than training on ImageNet3D. Data and code are available at https://github.com/GenIntel/Every9D.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces Every9D-21M, a dataset comprising 21.8M real-world images with 9D pose annotations across 700 everyday object categories, constructed from 109K object-centric videos. The pipeline reconstructs object-level point clouds via multi-view geometry, manually annotates canonical poses for a tiny reference set (<0.01% of images), propagates them through cross-instance alignment of similar instances into a shared frame, and verifies all propagated poses from multiple viewpoints. Cross-category orientation rules are introduced to handle symmetries. The authors provide training/evaluation splits and report that models trained on Every9D-21M improve on ImageNet3D and PASCAL3D+ while generalizing better to HANDAL than models trained on ImageNet3D alone. Data and code are released.

Significance. If the 9D annotations prove sufficiently accurate, Every9D-21M would constitute a substantial advance: two orders of magnitude larger than prior real-world 9D benchmarks in both images and categories, enabling training of 9D pose foundation models on genuine real-world data rather than synthetic renderings. The demonstrated transfer gains and the public release of data/code are concrete strengths.

major comments (2)
  1. [§4] §4 (Dataset Construction), paragraph on cross-instance alignment and multiview verification: the manuscript states that alignment error is controlled via point-cloud propagation and multiview checks but supplies no quantitative statistics (rotation/translation error distributions, failure rates, or inter-annotator agreement on a held-out sample). This directly affects the validity of the headline claim of 21.8M usable 9D annotations.
  2. [§5] §5 (Experiments), results on ImageNet3D and HANDAL: the reported performance gains are presented without explicit controls or ablation for category overlap between Every9D-21M training data and the target evaluation sets; without such controls it is unclear whether the improvements reflect genuine generalization or partial data leakage.
minor comments (1)
  1. [Figure 3] Figure 3 caption and §3.2: the description of the symmetry-aware evaluation protocol would benefit from an explicit statement of how the induced category-level symmetries interact with the 9D pose metric.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive assessment of Every9D-21M's scale and the constructive feedback. We address each major comment below and will update the manuscript accordingly.

read point-by-point responses
  1. Referee: [§4] §4 (Dataset Construction), paragraph on cross-instance alignment and multiview verification: the manuscript states that alignment error is controlled via point-cloud propagation and multiview checks but supplies no quantitative statistics (rotation/translation error distributions, failure rates, or inter-annotator agreement on a held-out sample). This directly affects the validity of the headline claim of 21.8M usable 9D annotations.

    Authors: We agree that providing quantitative validation metrics is important for establishing the reliability of the 21.8M annotations. In the revised manuscript, we will add a new subsection or appendix detailing error distributions (mean and std of rotation and translation errors) on a held-out sample of 1,000 instances where we performed additional manual verification. We will also report the failure rate of the multiview verification step (instances discarded due to inconsistency) and inter-annotator agreement (e.g., average angular difference) for the manual canonical pose annotations on a subset of 200 objects. revision: yes

  2. Referee: [§5] §5 (Experiments), results on ImageNet3D and HANDAL: the reported performance gains are presented without explicit controls or ablation for category overlap between Every9D-21M training data and the target evaluation sets; without such controls it is unclear whether the improvements reflect genuine generalization or partial data leakage.

    Authors: We appreciate this point on potential data leakage. Upon inspection, Every9D-21M covers 700 categories while ImageNet3D and PASCAL3D+ have fewer, and HANDAL focuses on hand-held objects. To rigorously address this, we will include in the revision: (1) a table listing overlapping categories, (2) an ablation where we retrain excluding all overlapping categories from Every9D-21M, and (3) confirm that performance gains on non-overlapping subsets remain significant. This will demonstrate that improvements are due to better generalization from large-scale real data rather than leakage. revision: yes

Circularity Check

0 steps flagged

No circularity: dataset construction pipeline is self-contained with external validation

full rationale

The paper describes an empirical pipeline for dataset creation (multi-view point cloud reconstruction, cross-instance alignment, manual annotation on <0.01% reference objects, propagation, and multi-view verification) without any equations, fitted parameters renamed as predictions, or load-bearing self-citations. Claims of annotation validity rest on the described process and are supported by reported improvements on independent external benchmarks (ImageNet3D, PASCAL3D+, HANDAL). No step reduces a reported result to an input by construction; the contribution is scale of real-world data rather than a closed derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The work rests on standard multi-view geometry assumptions and the unstated premise that cross-instance point-cloud alignment preserves 9D pose up to acceptable error; no free parameters or invented physical entities are introduced in the abstract.

axioms (2)
  • domain assumption Multi-view geometry from object-centric videos yields sufficiently accurate object-level point clouds for subsequent alignment
    Invoked in the sentence describing reconstruction via multi-view geometry
  • domain assumption Cross-instance alignment of similar objects into a shared canonical frame transfers pose labels without introducing systematic bias
    Core of the propagation step described after the manual annotation sentence

pith-pipeline@v0.9.1-grok · 5826 in / 1437 out tokens · 31216 ms · 2026-06-29T13:45:45.571383+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

68 extracted references · 11 canonical work pages · 5 internal anchors

  1. [1]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Ahmadyan, A., Zhang, L., Ablavatski, A., Wei, J., Grundmann, M.: Objectron: A large scale dataset of object-centric videos in the wild with pose annotations. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 7822–7831 (2021)

  2. [2]

    In: Proceedings of the IEEE/CVF international conference on computer vision

    Chen, K., Dou, Q.: Sgpa: Structure-guided prior adaptation for category-level 6d object pose estimation. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 2773–2782 (2021)

  3. [3]

    In: European conference on computer vision

    Cheng, H.K., Schwing, A.G.: Xmem: Long-term video object segmentation with an atkinson- shiffrin memory model. In: European conference on computer vision. pp. 640–658. Springer (2022)

  4. [4]

    In: Thirteenth International Conference on 3D Vision (2026)

    Chi, Y ., Sommer, L., Dünkel, O., Muhle, D., Cremers, D., Theobalt, C., Kortylewski, A.: C3PO: Canonicalization of 3d pose from partial views with generalizable correspondence features. In: Thirteenth International Conference on 3D Vision (2026)

  5. [5]

    In: European conference on computer vision

    Choy, C.B., Xu, D., Gwak, J., Chen, K., Savarese, S.: 3d-r2n2: A unified approach for single and multi-view 3d object reconstruction. In: European conference on computer vision. pp. 628–644. Springer (2016)

  6. [6]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Collins, J., Goel, S., Deng, K., Luthra, A., Xu, L., Gundogdu, E., Zhang, X., Vicente, T.F.Y ., Dideriksen, T., Arora, H., et al.: Abo: Dataset and benchmarks for real-world 3d object understanding. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 21126–21136 (2022)

  7. [7]

    Advances in Neural Information Processing Systems36, 35799–35813 (2023)

    Deitke, M., Liu, R., Wallingford, M., Ngo, H., Michel, O., Kusupati, A., Fan, A., Laforte, C., V oleti, V ., Gadre, S.Y ., et al.: Objaverse-xl: A universe of 10m+ 3d objects. Advances in Neural Information Processing Systems36, 35799–35813 (2023)

  8. [8]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Deitke, M., Schwenk, D., Salvador, J., Weihs, L., Michel, O., VanderBilt, E., Schmidt, L., Ehsani, K., Kembhavi, A., Farhadi, A.: Objaverse: A universe of annotated 3d objects. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 13142–13153 (2023)

  9. [9]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Deng, Y ., Yang, J., Tong, X.: Deformed implicit field: Modeling 3d shapes with learned dense correspondence. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 10286–10296 (2021) 10

  10. [10]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Dünkel, O., Wimmer, T., Theobalt, C., Rupprecht, C., Kortylewski, A.: Do it yourself: Learning semantic correspondence from pseudo-labels. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 5834–5844 (2025)

  11. [11]

    ACM Transactions On Graph- ics (TOG)13(1), 43–72 (1994)

    Edelsbrunner, H., Mücke, E.P.: Three-dimensional alpha shapes. ACM Transactions On Graph- ics (TOG)13(1), 43–72 (1994)

  12. [12]

    International Journal of Computer Vision129(12), 3313–3337 (2021)

    Fu, H., Jia, R., Gao, L., Gong, M., Zhao, B., Maybank, S., Tao, D.: 3d-future: 3d furniture shape with texture. International Journal of Computer Vision129(12), 3313–3337 (2021)

  13. [13]

    Advances in Neural Information Processing Systems35, 27469–27483 (2022)

    Fu, Y ., Wang, X.: Category-level 6d object pose estimation in the wild: A semi-supervised learning approach and a new dataset. Advances in Neural Information Processing Systems35, 27469–27483 (2022)

  14. [14]

    In: European Conference on Computer Vision

    Goodwin, W., Vaze, S., Havoutis, I., Posner, I.: Zero-shot category-level object pose estimation. In: European Conference on Computer Vision. pp. 516–532. Springer (2022)

  15. [15]

    In: Proceedings of the IEEE conference on computer vision and pattern recognition

    Graham, B., Engelcke, M., Van Der Maaten, L.: 3d semantic segmentation with submanifold sparse convolutional networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 9224–9232 (2018)

  16. [16]

    In: Proceedings of the IEEE conference on computer vision and pattern recognition

    Groueix, T., Fisher, M., Kim, V .G., Russell, B.C., Aubry, M.: A papier-mâché approach to learning 3d surface generation. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 216–224 (2018)

  17. [17]

    In: 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)

    Guo, A., Wen, B., Yuan, J., Tremblay, J., Tyree, S., Smith, J., Birchfield, S.: Handal: A dataset of real-world manipulable object categories with pose annotations, affordances, and reconstructions. In: 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). pp. 11428–11435. IEEE (2023)

  18. [18]

    WildDet3D: Scaling Promptable 3D Detection in the Wild

    Huang, W., Zhang, J., Li, S., Jia, T., Duan, J., Cheng, Y ., Cho, J., Wallingford, M., Soraki, R., Kim, C.D., et al.: Wilddet3d: Scaling promptable 3d detection in the wild. arXiv preprint arXiv:2604.08626 (2026)

  19. [19]

    In: European Conference on Computer Vision

    Jesslen, A., Zhang, G., Wang, A., Ma, W., Yuille, A., Kortylewski, A.: Novum: Neural object volumes for robust object classification. In: European Conference on Computer Vision. pp. 264–281. Springer (2024)

  20. [20]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Jin, L., Wang, Y ., Chen, W., Dai, Q., Gao, Q., Qin, X., Chen, B.: One-shot 3d object canonical- ization based on geometric and semantic consistency. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 16850–16859 (2025)

  21. [21]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Jung, H., Wu, S.C., Ruhkamp, P., Zhai, G., Schieber, H., Rizzoli, G., Wang, P., Zhao, H., Garattoni, L., Meier, S., et al.: Housecat6d-a large-scale multi-modal category level 6d ob- ject perception dataset with household objects in realistic scenarios. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 22498–22508 (2024)

  22. [22]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Khanna, M., Mao, Y ., Jiang, H., Haresh, S., Shacklett, B., Batra, D., Clegg, A., Undersander, E., Chang, A.X., Savva, M.: Habitat synthetic scenes dataset (hssd-200): An analysis of 3d scene scale and realism tradeoffs for objectgoal navigation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 16384–16393 (2024)

  23. [23]

    In: Proceedings of the IEEE/CVF international conference on computer vision

    Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.Y ., et al.: Segment anything. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 4015–4026 (2023)

  24. [24]

    Advances in neural information processing systems34, 15370–15381 (2021)

    Li, X., Weng, Y ., Yi, L., Guibas, L.J., Abbott, A., Song, S., Wang, H.: Leveraging se (3) equiv- ariance for self-supervised category-level object pose estimation from point clouds. Advances in neural information processing systems34, 15370–15381 (2021)

  25. [25]

    Depth Anything 3: Recovering the Visual Space from Any Views

    Lin, H., Chen, S., Liew, J., Chen, D.Y ., Li, Z., Shi, G., Feng, J., Kang, B.: Depth anything 3: Recovering the visual space from any views. arXiv preprint arXiv:2511.10647 (2025) 11

  26. [26]

    In: 2022 International Conference on Robotics and Automation (ICRA)

    Lin, Y ., Tremblay, J., Tyree, S., Vela, P.A., Birchfield, S.: Single-stage keypoint-based category- level object pose estimation from an rgb image. In: 2022 International Conference on Robotics and Automation (ICRA). pp. 1547–1553. IEEE (2022)

  27. [27]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Liu, J., Chen, Y ., Ye, X., Qi, X.: Ist-net: Prior-free category-level pose estimation with implicit space transformation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 13978–13988 (2023)

  28. [28]

    Advances in neural information processing systems36, 44860–44879 (2023)

    Liu, M., Shi, R., Kuang, K., Zhu, Y ., Li, X., Han, S., Cai, H., Porikli, F., Su, H.: Openshape: Scaling up 3d shape representation towards open-world understanding. Advances in neural information processing systems36, 44860–44879 (2023)

  29. [29]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Liu, Q., Zhang, Y ., Bai, S., Kortylewski, A., Yuille, A.: Direct-3d: Learning direct text-to-3d generation on massive noisy 3d data. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 6881–6891 (2024)

  30. [30]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Liu, X., Tayal, P., Wang, J., Zarzar, J., Monnier, T., Tertikas, K., Duan, J., Toisoul, A., Zhang, J.Y ., Neverova, N., et al.: Uncommon objects in 3d. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 14102–14113 (2025)

  31. [31]

    Decoupled Weight Decay Regularization

    Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017)

  32. [32]

    Orientation matters: Making 3d generative models orientation-aligned,

    Lu, Y ., Tian, Y ., Jiang, Z., Zhao, Y ., Yang, Y ., Ouyang, H., Hu, H., Yu, H., Shen, Y ., Liao, Y .: Orientation matters: Making 3d generative models orientation-aligned. arXiv preprint arXiv:2506.08640 (2025)

  33. [33]

    Advances in Neural Information Processing Systems37, 96127–96149 (2024)

    Ma, W., Zhang, G., Liu, Q., Zeng, G., Kortylewski, A., Liu, Y ., Yuille, A.: Imagenet3d: Towards general-purpose object-level 3d understanding. Advances in Neural Information Processing Systems37, 96127–96149 (2024)

  34. [34]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Mariotti, O., Mac Aodha, O., Bilen, H.: Improving semantic correspondence with viewpoint- guided spherical maps. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 19521–19530 (2024)

  35. [35]

    https://github.com/luca-medeiros/ lang-segment-anything(2023), gitHub repository

    Medeiros, L.: lang-segment-anything. https://github.com/luca-medeiros/ lang-segment-anything(2023), gitHub repository

  36. [36]

    DINOv2: Learning Robust Visual Features without Supervision

    Oquab, M., Darcet, T., Moutakanni, T., V o, H., Szafraniec, M., Khalidov, V ., Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., et al.: Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193 (2023)

  37. [37]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Park, J.J., Florence, P., Straub, J., Newcombe, R., Lovegrove, S.: Deepsdf: Learning continuous signed distance functions for shape representation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 165–174 (2019)

  38. [38]

    Neural Networks108, 533–543 (2018)

    Phan, A.V ., Le Nguyen, M., Nguyen, Y .L.H., Bui, L.T.: Dgcnn: A convolutional neural network over large-scale labeled graphs. Neural Networks108, 533–543 (2018)

  39. [39]

    Advances in neural information processing systems30(2017)

    Qi, C.R., Yi, L., Su, H., Guibas, L.J.: Pointnet++: Deep hierarchical feature learning on point sets in a metric space. Advances in neural information processing systems30(2017)

  40. [40]

    In: Proceedings of the IEEE/CVF international conference on computer vision

    Reizenstein, J., Shapovalov, R., Henzler, P., Sbordone, L., Labatut, P., Novotny, D.: Common objects in 3d: Large-scale learning and evaluation of real-life 3d category reconstruction. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 10901–10911 (2021)

  41. [41]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Sajnani, R., Poulenard, A., Jain, J., Dua, R., Guibas, L.J., Sridhar, S.: Condor: Self-supervised canonicalization of 3d pose for partial shapes. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 16969–16979 (2022)

  42. [42]

    DINOv3

    Siméoni, O., V o, H.V ., Seitzer, M., Baldassarre, F., Oquab, M., Jose, C., Khalidov, V ., Szafraniec, M., Yi, S., Ramamonjisoa, M., et al.: Dinov3. arXiv preprint arXiv:2508.10104 (2025) 12

  43. [43]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Sommer, L., Dünkel, O., Theobalt, C., Kortylewski, A.: Common3d: Self-supervised learning of 3d morphable models for common objects in neural feature space. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 6468–6479 (2025)

  44. [44]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Sommer, L., Jesslen, A., Ilg, E., Kortylewski, A.: Unsupervised learning of category-level 3d pose from object-centric videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 22787–22796 (2024)

  45. [45]

    In: Bmvc

    Stark, M., Goesele, M., Schiele, B.: Back to the future: Learning shape models from 3d cad data. In: Bmvc. vol. 2, p. 5 (2010)

  46. [46]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Sun, S., Han, K., Kong, D., Tang, H., Yan, X., Xie, X.: Topology-preserving shape recon- struction and registration via neural diffeomorphic flow. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 20845–20855 (2022)

  47. [47]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Wang, C., Xu, D., Zhu, Y ., Martín-Martín, R., Lu, C., Fei-Fei, L., Savarese, S.: Densefusion: 6d object pose estimation by iterative dense fusion. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 3343–3352 (2019)

  48. [48]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Wang, H., Sridhar, S., Huang, J., Valentin, J., Song, S., Guibas, L.J.: Normalized object coordinate space for category-level 6d object pose and size estimation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 2642–2651 (2019)

  49. [49]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Wang, J., Karaev, N., Rupprecht, C., Novotny, D.: Vggsfm: Visual geometry grounded deep structure from motion. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 21686–21697 (2024)

  50. [50]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Wang, P., Jung, H., Li, Y ., Shen, S., Srikanth, R.P., Garattoni, L., Meier, S., Navab, N., Busam, B.: Phocal: A multi-modal dataset for category-level object pose estimation with photometrically challenging objects. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 21222–21231 (2022)

  51. [51]

    Orient anything: Learning robust object orientation estimation from rendering 3d models,

    Wang, Z., Zhang, Z., Pang, T., Du, C., Zhao, H., Zhao, Z.: Orient anything: Learning robust object orientation estimation from rendering 3d models. arXiv preprint arXiv:2412.18605 (2024)

  52. [52]

    Orient anything v2: Unifying orientation and rotation understanding,

    Wang, Z., Zhang, Z., Xu, J., Wang, J., Pang, T., Du, C., Zhao, H., Zhao, Z.: Orient anything v2: Unifying orientation and rotation understanding. arXiv preprint arXiv:2601.05573 (2026)

  53. [53]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Wu, S., Li, R., Jakab, T., Rupprecht, C., Vedaldi, A.: Magicpony: Learning articulated 3d animals in the wild. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 8792–8802 (2023)

  54. [54]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Wu, X., Jiang, L., Wang, P.S., Liu, Z., Liu, X., Qiao, Y ., Ouyang, W., He, T., Zhao, H.: Point transformer v3: Simpler faster stronger. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 4840–4851 (2024)

  55. [55]

    In: European conference on computer vision

    Xiang, Y ., Kim, W., Chen, W., Ji, J., Choy, C., Su, H., Mottaghi, R., Guibas, L., Savarese, S.: Objectnet3d: A large scale database for 3d object recognition. In: European conference on computer vision. pp. 160–176. Springer (2016)

  56. [56]

    In: IEEE winter conference on applications of computer vision

    Xiang, Y ., Mottaghi, R., Savarese, S.: Beyond pascal: A benchmark for 3d object detection in the wild. In: IEEE winter conference on applications of computer vision. pp. 75–82. IEEE (2014)

  57. [57]

    In: 2021 International Conference on 3D Vision (3DV)

    Xiao, Y ., Du, Y ., Marlet, R.: Posecontrast: Class-agnostic object viewpoint estimation in the wild with pose-aware contrastive learning. In: 2021 International Conference on 3D Vision (3DV). pp. 74–84. IEEE (2021)

  58. [58]

    Advances in Neural Information Processing Systems37, 21875–21911 (2024)

    Yang, L., Kang, B., Huang, Z., Zhao, Z., Xu, X., Feng, J., Zhao, H.: Depth anything v2. Advances in Neural Information Processing Systems37, 21875–21911 (2024)

  59. [59]

    In: Proceedings of the IEEE/CVF international conference on computer vision

    Yu, X., Rao, Y ., Wang, Z., Liu, Z., Lu, J., Zhou, J.: Pointr: Diverse point cloud completion with geometry-aware transformers. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 12498–12507 (2021) 13

  60. [60]

    arXiv preprint arXiv:2512.13689 (2025)

    Yue, Y ., Robert, D., Wang, J., Hong, S., Wegner, J.D., Rupprecht, C., Schindler, K.: Litept: Lighter yet stronger point transformer. arXiv preprint arXiv:2512.13689 (2025)

  61. [61]

    arXiv preprint arXiv:2510.11687 (2025)

    Zhang, J., Lin, H., Hou, J., Xue, X., Fu, Y .: Beyond’templates’: Category-agnostic object pose, size, and shape estimation from a single view. arXiv preprint arXiv:2510.11687 (2025)

  62. [62]

    In: European Conference on Computer Vision

    Zhang, J., Huang, W., Peng, B., Wu, M., Hu, F., Chen, Z., Zhao, B., Dong, H.: Omni6dpose: A benchmark and model for universal 6d object pose estimation and tracking. In: European Conference on Computer Vision. pp. 199–216. Springer (2024)

  63. [63]

    ACM Transactions on Graphics (TOG)43(4), 1–20 (2024)

    Zhang, L., Wang, Z., Zhang, Q., Qiu, Q., Pang, A., Jiang, H., Yang, W., Xu, L., Yu, J.: Clay: A controllable large-scale generative model for creating high-quality 3d assets. ACM Transactions on Graphics (TOG)43(4), 1–20 (2024)

  64. [64]

    Texverse: A universe of 3D objects with high-resolution textures.arXiv preprint arXiv:2508.10868, 2025

    Zhang, Y ., Zhang, L., Ma, R., Cao, N.: Texverse: A universe of 3d objects with high-resolution textures. arXiv preprint arXiv:2508.10868 (2025)

  65. [65]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Zheng, L., Wang, C., Sun, Y ., Dasgupta, E., Chen, H., Leonardis, A., Zhang, W., Chang, H.J.: Hs-pose: Hybrid scope feature extraction for category-level object pose estimation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 17163–17173 (2023)

  66. [66]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Zheng, Z., Yu, T., Dai, Q., Liu, Y .: Deep implicit templates for 3d shape representation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 1429–1439 (2021)

  67. [67]

    In: Proceedings of the European Conference on Computer Vision (ECCV)

    Zhou, X., Karpur, A., Luo, L., Huang, Q.: Starmap for category-agnostic keypoint and viewpoint estimation. In: Proceedings of the European Conference on Computer Vision (ECCV). pp. 318– 334 (2018)

  68. [68]

    move backward

    Zhou, Y ., Barnes, C., Lu, J., Yang, J., Li, H.: On the continuity of rotation representations in neural networks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 5745–5753 (2019) 14 Supplementary Material Every9D-21M: Large-Scale Canonicalized Real-World 9D Pose Estimation This supplementary material provides ad...