pith. machine review for the scientific record. sign in

arxiv: 2604.22826 · v1 · submitted 2026-04-19 · 💻 cs.CV · cs.LG

Recognition: unknown

Shape: A Self-Supervised 3D Geometry Foundation Model for Industrial CAD Analysis

Authors on Pith no claims yet

Pith reviewed 2026-05-10 06:11 UTC · model grok-4.3

classification 💻 cs.CV cs.LG
keywords self-supervised learning3D geometryCAD meshesfoundation modelmesh reconstructioncontrastive consistencytransformer
0
0 comments X

The pith

Shape is a self-supervised model that produces dense 3D embeddings from CAD meshes using masked token reconstruction.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops Shape, a self-supervised foundation model that turns CAD surface meshes into dense per-token embeddings for industrial use. It employs a structured latent grid, a multi-scale tokenizer with cross-attention, and a transformer processor to handle the geometry. Pretraining occurs via masked reconstruction of normalized per-dimension geometry statistics and multi-resolution contrastive consistency losses on over 61 thousand meshes from public collections. This setup is intended to produce generalizable embeddings that allow accurate reconstruction, high-accuracy retrieval, and explainable attributions without task-specific labels.

Core claim

Shape establishes that self-supervised pretraining with masked-token reconstruction of normalized geometry statistics and multi-resolution contrastive consistency allows a transformer-based model to learn embeddings from CAD meshes that generalize well, as evidenced by strong performance on reconstruction and retrieval tasks on held-out data with little overfitting.

What carries the argument

Masked token reconstruction of normalized geometry statistics combined with multi-resolution contrastive consistency on a transformer backbone that includes a multi-scale geometry-aware tokenizer.

Load-bearing premise

That the combination of masked reconstruction and contrastive consistency on the chosen CAD datasets produces embeddings generalizing robustly to unseen industrial CAD analysis tasks.

What would settle it

Evaluating the model on a new collection of industrial CAD meshes from a different source and observing significantly lower reconstruction accuracy or retrieval performance than reported would falsify the generalization claim.

Figures

Figures reproduced from arXiv: 2604.22826 by Bayangmbe Mounmo, Mile Mitrovic, Sam Chien.

Figure 1
Figure 1. Figure 1: Evaluation metrics on the held-out validation split ( [PITH_FULL_IMAGE:figures/full_fig_p010_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Alignment-uniformity diagnostic [8]. Alignment is the expected squared Euclidean distance between embeddings of positive pairs; uniformity is the logarithm of the expected exponentially-decayed pairwise distance between random pairs. Lower values on both axes indicate a better contrastive representation [PITH_FULL_IMAGE:figures/full_fig_p011_2.png] view at source ↗
read the original abstract

Industrial CAD workflows require robust, generalizable 3D geometric representations supporting accuracy and explainability. We introduce Shape, a self-supervised foundation model converting surface meshes into dense per-token embeddings. Shape combines a structured 3D latent grid, a multi-scale geometry-aware tokenizer (MAGNO) with cross-attention, and a transformer processor using grouped-query attention and RMSNorm. A learned reconstruction prior enables per-region attribution for explainable predictions. Pretraining uses masked-token reconstruction of normalized geometry statistics and multi-resolution contrastive consistency. The 10.9M-parameter backbone is pretrained on 61,052 CAD meshes from Thingi10K, MFCAD, and Fusion360. On a held-out split of 2,983 meshes, Shape achieves reconstruction R2 = 0.729 and 98.1% top-1 retrieval under the Wang-Isola protocol, with near-zero reconstruction train/val gap (contrastive scores use a larger evaluation pool). A 2x2 ablation on loss type and target-space normalization shows per-dimension normalization is critical: without it, performance collapses (R2 < 0.14, top-1 < 88%); with it, both losses succeed (R2 > 0.70, top-1 > 96%). Smooth-L1 offers secondary stability. Code, embeddings, and an interactive demo are released at https://github.com/simd-ai/shape.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper introduces Shape, a self-supervised 3D geometry foundation model that converts surface meshes into dense per-token embeddings using a structured 3D latent grid, the MAGNO tokenizer with cross-attention, and a transformer processor. Pretrained on 61,052 CAD meshes from Thingi10K, MFCAD, and Fusion360 via masked-token reconstruction of normalized geometry statistics and multi-resolution contrastive consistency. On a held-out split of 2,983 meshes, it achieves reconstruction R² = 0.729 and 98.1% top-1 retrieval, with an ablation showing per-dimension normalization is critical for performance.

Significance. If the embeddings generalize beyond the proxy tasks, Shape could serve as a useful foundation model for industrial CAD analysis, supported by the released code, embeddings, and the 2x2 ablation isolating the role of normalization. The near-zero train/val gap and concrete metrics strengthen the internal validity of the pretraining approach.

major comments (1)
  1. [Abstract] The positioning of Shape as supporting 'industrial CAD workflows' and 'industrial CAD analysis' relies on the assumption that the learned embeddings will transfer to real downstream tasks. However, all reported results are limited to reconstruction R² and retrieval accuracy on held-out data from the pretraining datasets, with no evaluations on actual industrial tasks such as manufacturing feature classification, part segmentation, or assembly reasoning. This is load-bearing for the central claim of the manuscript.
minor comments (1)
  1. [Abstract] The reconstruction metric is written as 'R2'; it should be formatted as R² for mathematical clarity.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive review, the recognition of the model's internal validity, the near-zero train/val gap, the ablation isolating normalization, and the value of the released code and embeddings. We address the single major comment below.

read point-by-point responses
  1. Referee: [Abstract] The positioning of Shape as supporting 'industrial CAD workflows' and 'industrial CAD analysis' relies on the assumption that the learned embeddings will transfer to real downstream tasks. However, all reported results are limited to reconstruction R² and retrieval accuracy on held-out data from the pretraining datasets, with no evaluations on actual industrial tasks such as manufacturing feature classification, part segmentation, or assembly reasoning. This is load-bearing for the central claim of the manuscript.

    Authors: We agree that the abstract's framing assumes the embeddings will prove useful for downstream industrial tasks and that direct evidence on tasks such as manufacturing feature classification, part segmentation, or assembly reasoning would strengthen the central claim. The current results focus on self-supervised pretraining quality via masked normalized geometry reconstruction (R² = 0.729) and multi-resolution contrastive retrieval (98.1% top-1) on held-out meshes drawn from the same industrial CAD sources (Thingi10K, MFCAD, Fusion360). These proxies directly measure geometric fidelity and discriminative power, which are prerequisites for the cited downstream applications; the 2×2 ablation further isolates that per-dimension normalization is essential for both objectives. In the foundation-model literature, such proxy metrics on held-out data are standard for validating the backbone prior to transfer studies. The released embeddings and code are explicitly provided to support exactly those follow-on evaluations. To address the concern without overstating current evidence, we will revise the abstract to state that the model supplies dense per-token embeddings validated on reconstruction and retrieval proxies and is intended as a foundation for industrial CAD analysis, while adding a dedicated limitations paragraph that explicitly notes the absence of downstream task results and the reliance on proxy generalization. revision: partial

Circularity Check

0 steps flagged

No significant circularity; empirical pretraining and held-out evaluation

full rationale

The paper describes a self-supervised transformer model pretrained via masked reconstruction and contrastive losses on public CAD datasets, with performance reported as R2=0.729 and 98.1% top-1 retrieval on a held-out split of 2,983 meshes plus an empirical 2x2 ablation on normalization. No equations, derivations, or self-citations are presented that reduce any claimed result to fitted parameters or prior author work by construction. All metrics are computed on independent test data under standard protocols, satisfying the criteria for a self-contained empirical result with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 1 invented entities

The performance claims rest on 10.9M learned parameters optimized via self-supervised objectives on the specified CAD datasets, plus design choices (normalization, loss type) validated only by the reported ablation.

free parameters (2)
  • 10.9M backbone parameters
    Weights of the tokenizer, latent grid, and transformer are fitted during pretraining on 61k meshes.
  • Per-dimension normalization scales
    Ablation shows these are critical; they are either learned or computed from training statistics.
axioms (2)
  • domain assumption Masked token reconstruction of normalized geometry statistics yields useful dense embeddings
    Core self-supervised objective stated in the abstract.
  • domain assumption Multi-resolution contrastive consistency improves geometric representation quality
    Second pretraining objective listed in the abstract.
invented entities (1)
  • MAGNO tokenizer no independent evidence
    purpose: Multi-scale geometry-aware tokenization via cross-attention on the latent grid
    New component introduced to process 3D meshes at multiple scales.

pith-pipeline@v0.9.0 · 5561 in / 1710 out tokens · 67221 ms · 2026-05-10T06:11:55.196605+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

32 extracted references · 6 canonical work pages · 4 internal anchors

  1. [1]

    Point-BERT: Pre-training 3d point cloud transformers with masked point modeling

    Xumin Yu, Lulu Tang, Yongming Rao, Tiejun Huang, Jie Zhou, and Jiwen Lu. Point-BERT: Pre-training 3d point cloud transformers with masked point modeling. InCVPR, 2022

  2. [2]

    Yatian Pang, Wenxiao Wang, Francis E. H. Tay, Wei Liu, Yonghong Tian, and Li Yuan. Masked autoencoders for point cloud self-supervised learning. InECCV, 2022. 14

  3. [3]

    Geometry-aware operator transformer as an efficient and accurate neural surrogate for PDEs on arbitrary domains.arXiv preprint arXiv:2505.18781, 2025

    Shizheng Wen, Arsh Kumbhat, Levi Lingsch, Sepehr Mousavi, Yizhou Zhao, Praveen Chandrashekar, and Siddhartha Mishra. Geometry aware operator transformer as an efficient and accurate neural surrogate for PDEs on arbitrary domains. InNeurIPS, 2025.https://arxiv.org/abs/2505.18781

  4. [4]

    Gomez, Łukasz Kaiser, and Illia Polosukhin

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. InNeurIPS, 2017

  5. [5]

    GQA: Training generalized multi-query transformer models from multi-head checkpoints

    Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebrón, and Sumit Sanghai. GQA: Training generalized multi-query transformer models from multi-head checkpoints. InEMNLP, 2023

  6. [6]

    Root mean square layer normalization

    Biao Zhang and Rico Sennrich. Root mean square layer normalization. InNeurIPS, 2019

  7. [7]

    RoFormer: Enhanced Transformer with Rotary Position Embedding

    Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. RoFormer: Enhanced transformer with rotary position embedding.arXiv:2104.09864, 2021

  8. [8]

    Understanding contrastive representation learning through alignment and uniformity on the hypersphere

    Tongzhou Wang and Phillip Isola. Understanding contrastive representation learning through alignment and uniformity on the hypersphere. InICML, 2020

  9. [9]

    On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation.PLOS ONE, 10(7), 2015

    Sebastian Bach, Alexander Binder, Grégoire Montavon, Frederick Klauschen, Klaus-Robert Müller, and Wojciech Samek. On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation.PLOS ONE, 10(7), 2015

  10. [10]

    BERT: Pre-training of deep bidirectional transformers for language understanding.NAACL, 2019

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding.NAACL, 2019

  11. [11]

    Improving language under- standing by generative pre-training.Technical report, OpenAI, 2018

    Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving language under- standing by generative pre-training.Technical report, OpenAI, 2018

  12. [12]

    An image is worth 16×16 words: Transformers for image recognition at scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16×16 words: Transformers for image recognition at scale. InICLR, 2021

  13. [13]

    Masked autoencoders are scalable vision learners

    Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked autoencoders are scalable vision learners. InCVPR, 2022

  14. [14]

    DINOv2: Learning robust visual features without supervision.Transactions on Machine Learning Research, 2024

    Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. DINOv2: Learning robust visual features without supervision.Transactions on Machine Learning Research, 2024

  15. [15]

    A simple framework for contrastive learning of visual representations

    Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. InICML, 2020

  16. [16]

    Improved Baselines with Momentum Contrastive Learning

    Xinlei Chen, Haoqi Fan, Ross Girshick, and Kaiming He. Improved baselines with momentum contrastive learning. InarXiv:2003.04297, 2020

  17. [17]

    Representation Learning with Contrastive Predictive Coding

    Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding.arXiv:1807.03748, 2018

  18. [18]

    Qi, Hao Su, Kaichun Mo, and Leonidas J

    Charles R. Qi, Hao Su, Kaichun Mo, and Leonidas J. Guibas. PointNet: Deep learning on point sets for 3d classification and segmentation. InCVPR, 2017

  19. [19]

    Qi, Li Yi, Hao Su, and Leonidas J

    Charles R. Qi, Li Yi, Hao Su, and Leonidas J. Guibas. PointNet++: Deep hierarchical feature learning on point sets in a metric space. InNeurIPS, 2017

  20. [20]

    Point-M2AE: Multi-scale masked autoencoders for hierarchical point cloud pre-training

    Renrui Zhang, Ziyu Guo, Peng Gao, Rongyao Fang, Bin Zhao, Dong Wang, Yu Qiao, and Hongsheng Li. Point-M2AE: Multi-scale masked autoencoders for hierarchical point cloud pre-training. In NeurIPS, 2022

  21. [21]

    Neural Operator: Graph Kernel Network for Partial Differential Equations

    Zongyi Li, Nikola Kovachki, Kamyar Azizzadenesheli, Burigede Liu, Kaushik Bhattacharya, Andrew Stuart, and Anima Anandkumar. Neural operator: Graph kernel network for partial differential equations.arXiv:2003.03485, 2020

  22. [22]

    ABC: A big CAD model dataset for geometric deep learning

    Sebastian Koch, Albert Matveev, Zhongshi Jiang, Francis Williams, Alexey Artemov, Evgeny Burnaev, Marc Alexa, Denis Zorin, and Daniele Panozzo. ABC: A big CAD model dataset for geometric deep learning. InCVPR, 2019

  23. [23]

    pseudo-genus

    Qingnan Zhou and Alec Jacobson. Thingi10K: A dataset of 10,000 3d-printing models. arXiv:1605.04797, 2016. 15

  24. [24]

    Willis, Yewen Pu, Jieliang Luo, Hang Chu, Tao Du, Joseph G

    Karl D.D. Willis, Yewen Pu, Jieliang Luo, Hang Chu, Tao Du, Joseph G. Lambourne, Armando Solar- Lezama, and Wojciech Matusik. Fusion 360 gallery: A dataset and environment for programmatic CAD construction from human design sequences. InACM Transactions on Graphics (SIGGRAPH), 2021

  25. [25]

    Hierarchical CADnet: Learning from b-reps for machining feature recognition

    Andrew Colligan, Trevor Robinson, Declan Nolan, Yang Hua, and Wanbin Cao. Hierarchical CADnet: Learning from b-reps for machining feature recognition. InComputer-Aided Design, 2022

  26. [26]

    Objaverse: A universe of annotated 3d objects

    Matt Deitke, Dustin Schwenk, Jordi Salvador, Luca Weihs, Oscar Michel, Eli VanderBilt, Ludwig Schmidt, Kiana Ehsani, Aniruddha Kembhavi, and Ali Farhadi. Objaverse: A universe of annotated 3d objects. InCVPR, 2023

  27. [27]

    Chang, Li Yi, Subarna Tripathi, Leonidas J

    Kaichun Mo, Shilin Zhu, Angel X. Chang, Li Yi, Subarna Tripathi, Leonidas J. Guibas, and Hao Su. PartNet: A large-scale benchmark for fine-grained and hierarchical part-level 3d object understanding. InCVPR, 2019

  28. [28]

    Peter J. Huber. Robust estimation of a location parameter.The Annals of Mathematical Statistics, 35(1):73–101, 1964

  29. [29]

    Fast R-CNN

    Ross Girshick. Fast R-CNN. InICCV, 2015

  30. [30]

    From attribution maps to human-understandable explanations through concept relevance propagation.Nature Machine Intelligence, 2023

    Reduan Achtibat, Maximilian Dreyer, Ilona Eisenbraun, Sebastian Bosse, Thomas Wiegand, Wojciech Samek, and Sebastian Lapuschkin. From attribution maps to human-understandable explanations through concept relevance propagation.Nature Machine Intelligence, 2023

  31. [31]

    Kosiorek, Seungjin Choi, and Yee Whye Teh

    Juho Lee, Yoonho Lee, Jungtaek Kim, Adam R. Kosiorek, Seungjin Choi, and Yee Whye Teh. Set transformer: A framework for attention-based permutation-invariant neural networks. InICML, 2019

  32. [32]

    Decoupled weight decay regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InICLR, 2019. A Full training hyperparameters Table 6 reports the complete set of hyperparameters used to train the released Shape (small) model. A compact summary of architecture parameters only is given in the main body (Section 4). B DDP and training infrastructure The implementat...