pith. machine review for the scientific record. sign in

arxiv: 2604.04050 · v1 · submitted 2026-04-05 · 💻 cs.CV · cs.LG

Recognition: 2 theorem links

· Lean Theorem

TORA: Topological Representation Alignment for 3D Shape Assembly

Authors on Pith no claims yet

Pith reviewed 2026-05-13 17:06 UTC · model grok-4.3

classification 💻 cs.CV cs.LG
keywords 3D shape assemblyflow matchingrepresentation alignmenttopological alignmentzero-shot transferpretrained 3D encoderspart assemblydomain shift robustness
0
0 comments X

The pith

TORA aligns flow models to frozen 3D encoders to speed assembly training and improve accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents TORA as a framework that aligns internal representations of a flow-matching model for 3D shape assembly with those from a frozen pretrained encoder. This transfers geometric and contact-based relational information to guide part movements during assembly. The method requires no extra computation at inference time yet produces faster training convergence up to 6.9 times, higher accuracy on benchmarks, and stronger results when applied to previously unseen real or synthetic data.

Core claim

TORA introduces a topology-first representation alignment framework that distills relational structure from a frozen pretrained 3D encoder into the flow-matching backbone during training. It realizes this first through token-wise cosine matching of learned geometric descriptors and then through a Centered Kernel Alignment loss to match similarity structures, with alignment most effective at later transformer layers where spatial relations emerge. Geometry- and contact-centric teacher properties drive the gains rather than semantic classification ability.

What carries the argument

Topological representation alignment that matches token-wise cosine similarities or centered kernel alignment between the student flow model representations and the frozen teacher encoder representations.

If this is right

  • Training converges up to 6.9 times faster than unaligned flow-matching baselines.
  • Assembly accuracy rises on in-distribution benchmarks spanning geometric, semantic, and inter-object tasks.
  • Performance holds up better under domain shifts to unseen real-world and synthetic datasets.
  • Zero-shot transfer gains are especially large compared with prior methods.
  • State-of-the-art results are reached with zero added cost at inference time.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Teacher encoders selected for geometric rather than semantic properties are likely to produce stronger alignment benefits in assembly tasks.
  • The same alignment strategy could be applied to other generative backbones such as diffusion models for 3D tasks.
  • Systematic sweeps of alignment depth across different network architectures would identify general rules for when relational distillation helps most.
  • Extending the approach to time-varying or articulated assemblies would test whether the learned topological relations remain useful beyond static part placement.

Load-bearing premise

That geometry- and contact-centric properties extracted from the frozen teacher encoder supply the right relational guidance for the flow model and that alignment at later layers improves outcomes without negative transfer.

What would settle it

A controlled training run on one of the five assembly benchmarks in which adding the TORA alignment loss produces slower convergence or lower final accuracy than the baseline flow model without alignment.

Figures

Figures reproduced from arXiv: 2604.04050 by Marc Pollefeys, Nahyuk Lee, Sunghwan Hong, Zhiang Chen.

Figure 1
Figure 1. Figure 1: Multi-part assembly results across regimes. [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the Topological Representation Alignment (TORA) [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Conceptual illustration of alignment objectives. [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Correlation Analysis of Teacher Representations. [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Impact of different teachers on distillation. We compare the Part Accu￾racy of TORA across Lcos-dist and LCKA across various 3D foundation models as teachers on Breaking Bad dataset [45]. The dashed line indicates the RPF baseline [48] [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Spatial structure emerges in later layers. [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Ablation study on alignment layer depth. Part Ac￾curacy consistently improves when applying alignment to later layers. in pose discrimination further suggests that pose-awareness—necessary for 6-DoF recov￾ery—emerges predominantly in later layers where global context is integrated. Motivated by these trends, we apply align￾ment at late representations, where the model is actively forming the global structu… view at source ↗
Figure 8
Figure 8. Figure 8: Qualitative Comparison. While the baseline RPF often struggles with pre￾cise part positioning and fails to resolve complex inter-part relations, ours consistently produces structurally coherent assemblies that closely match the ground truth. 5.3 Experimental Results Multi-part Assembly [PITH_FULL_IMAGE:figures/full_fig_p012_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Convergence comparison. We monitor the validation Part Accuracy of Ours (CKA) ( ) against the RPF ( ) and other alignment strategies including Ours (NT￾Xent) ( ) and Ours (Cos-dist) ( ) over training epochs across three datasets. The dashed horizontal line ( ) represents the peak accuracy of the baseline. The annotated multipliers indicate the convergence speedup relative to the baseline to reach its peak … view at source ↗
read the original abstract

Flow-matching methods for 3D shape assembly learn point-wise velocity fields that transport parts toward assembled configurations, yet they receive no explicit guidance about which cross-part interactions should drive the motion. We introduce TORA, a topology-first representation alignment framework that distills relational structure from a frozen pretrained 3D encoder into the flow-matching backbone during training. We first realize this via simple instantiation, token-wise cosine matching, which injects the learned geometric descriptors from the teacher representation. We then extend to employ a Centered Kernel Alignment (CKA) loss to match the similarity structure between student and teacher representations for enhanced topological alignment. Through systematic probing of diverse 3D encoders, we show that geometry- and contact-centric teacher properties, not semantic classification ability, govern alignment effectiveness, and that alignment is most beneficial at later transformer layers where spatial structure naturally emerges. TORA introduces zero inference overhead while yielding two consistent benefits: faster convergence (up to 6.9$\times$) and improved accuracy in-distribution, along with greater robustness under domain shift. Experiments on five benchmarks spanning geometric, semantic, and inter-object assembly demonstrate state-of-the-art performance, with particularly pronounced gains in zero-shot transfer to unseen real-world and synthetic datasets. Project page: https://nahyuklee.github.io/tora.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 0 minor

Summary. The manuscript introduces TORA, a topology-first representation alignment framework for flow-matching models in 3D shape assembly. It distills relational geometric and contact-centric structure from a frozen pretrained 3D encoder into the flow backbone via token-wise cosine matching and Centered Kernel Alignment (CKA) loss, with alignment applied most effectively at later transformer layers. The method is claimed to deliver up to 6.9× faster convergence, higher in-distribution accuracy, improved robustness under domain shift, and state-of-the-art results on five benchmarks spanning geometric, semantic, and inter-object assembly tasks, all with zero inference overhead.

Significance. If the reported gains hold under rigorous verification, TORA would offer a practical, low-cost way to inject useful relational priors from pretrained geometry encoders into flow-matching pipelines, with particular value for zero-shot transfer in assembly tasks. The systematic encoder-probing experiments provide useful empirical guidance on which pretrained properties (geometry/contact vs. semantics) transfer effectively.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive summary, significance assessment, and recommendation of minor revision. We are encouraged that the topology-first alignment approach, its empirical guidance on encoder properties, and the reported gains in convergence and robustness are viewed as potentially valuable for flow-matching pipelines.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper's central claims rest on an empirical pipeline: a frozen external pretrained 3D encoder supplies relational targets (via token-wise cosine or CKA losses) that are injected into a flow-matching student during training; benefits are then measured on held-out benchmarks. No derivation, equation, or performance metric is defined in terms of itself or reduced to a fitted parameter that is later renamed a prediction. Alignment targets originate outside the model being trained, and reported gains (convergence speed, accuracy, zero-shot robustness) are independent observables rather than tautological consequences of the loss. Any self-citations are peripheral and non-load-bearing for the core argument.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that pretrained 3D encoders already encode useful cross-part relational structure for assembly; no new entities are introduced and no explicit free parameters beyond standard training losses are named.

axioms (1)
  • domain assumption Frozen pretrained 3D encoders capture geometry- and contact-centric relational structure that is useful for guiding flow-matching assembly
    The method selects and aligns to such encoders; effectiveness is claimed to depend on this property rather than semantic classification ability.

pith-pipeline@v0.9.0 · 5539 in / 1285 out tokens · 46272 ms · 2026-05-13T17:06:53.768947+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

66 extracted references · 66 canonical work pages · 4 internal anchors

  1. [1]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    An, H., Kim, J.H., Park, S., Jung, J., Han, J., Hong, S., Kim, S.: Cross-view completion models are zero-shot correspondence estimators. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 1103–1115 (2025) 3

  2. [2]

    In: Sensor fusion IV: control paradigms and data structures

    Besl, P.J., McKay, N.D.: Method for registration of 3-d shapes. In: Sensor fusion IV: control paradigms and data structures. vol. 1611, pp. 586–606. Spie (1992) 22, 23

  3. [3]

    In: ACM SIGGRAPH 2011 papers, pp

    Chaudhuri, S., Kalogerakis, E., Guibas, L., Koltun, V.: Probabilistic reasoning for assembly-based 3d modeling. In: ACM SIGGRAPH 2011 papers, pp. 1–10 (2011) 1

  4. [4]

    Advances in Neural Information Processing Systems34, 9011–9023 (2021) 3

    Cho, S., Hong, S., Jeon, S., Lee, Y., Sohn, K., Kim, S.: Cats: Cost aggregation transformers for visual correspondence. Advances in Neural Information Processing Systems34, 9011–9023 (2021) 3

  5. [5]

    Cho, S., Hong, S., Kim, S.: Cats++: Boosting cost aggregation with convolutions andtransformers.IEEETransactionsonPatternAnalysisandMachineIntelligence 45(6), 7174–7194 (2022) 3

  6. [6]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Cho,S.,Shin,H.,Hong,S.,Arnab,A.,Seo,P.H.,Kim,S.:Cat-seg:Costaggregation for open-vocabulary semantic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 4113–4123 (2024) 6

  7. [7]

    The Journal of Machine Learning Research13, 795–828 (2012) 3, 7

    Cortes, C., Mohri, M., Rostamizadeh, A.: Algorithms for learning kernels based on centered alignment. The Journal of Machine Learning Research13, 795–828 (2012) 3, 7

  8. [8]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020) 18

  9. [9]

    com / PyTorchLightning/pytorch-lightning11

    Falcon, W., team, T.P.L.: Pytorch lightning (2019),https : / / github . com / PyTorchLightning/pytorch-lightning11

  10. [10]

    arXiv preprint arXiv:2601.02457 (2026) 8, 18

    Hadgi, S., Gong, B., Sundararaman, R., Pierson, E., Li, L., Wonka, P., Ovsjanikov, M.: Patchalign3d: Local feature alignment for dense 3d shape understanding. arXiv preprint arXiv:2601.02457 (2026) 8, 18

  11. [11]

    D 2USt3R: Enhancing 3D reconstruction for dynamic scenes.arXiv preprint arXiv:2504.06264, 2025

    Han, J., An, H., Jung, J., Narihira, T., Seo, J., Fukuda, K., Kim, C., Hong, S., Mitsufuji, Y., Kim, S.: Dˆ 2ust3r: Enhancing 3d reconstruction with 4d pointmaps for dynamic scenes. arXiv preprint arXiv:2504.06264 (2025) 3

  12. [12]

    Emergent outlier view rejection in visual geometry grounded transformers.arXiv preprint arXiv:2512.04012, 2025

    Han, J., Hong, S., Jung, J., Jang, W., An, H., Wang, Q., Kim, S., Feng, C.: Emergent outlier view rejection in visual geometry grounded transformers. arXiv preprint arXiv:2512.04012 (2025) 3

  13. [13]

    Ad- vanced Robotics30(17-18), 1186–1198 (2016) 1

    Harada, K., Nagata, K., Rojas, J., Ramirez-Alpizar, I.G., Wan, W., Onda, H., Tsuji, T.: Proposal of a shape adaptive gripper for robotic assembly tasks. Ad- vanced Robotics30(17-18), 1186–1198 (2016) 1

  14. [14]

    In: Proceedings of the European conference on computer vision (ECCV)

    Hodan, T., Michel, F., Brachmann, E., Kehl, W., GlentBuch, A., Kraft, D., Drost, B., Vidal, J., Ihrke, S., Zabulis, X., et al.: Bop: Benchmark for 6d object pose esti- mation. In: Proceedings of the European conference on computer vision (ECCV). pp. 19–34 (2018) 23

  15. [15]

    arXiv preprint arXiv:2403.11120 (2024) 3

    Hong, S., Cho, S., Kim, S., Lin, S.: Unifying feature and cost aggrega- tion with transformers for semantic and visual correspondence. arXiv preprint arXiv:2403.11120 (2024) 3

  16. [16]

    Hong, S., Cho, S., Nam, J., Lin, S., Kim, S.: Cost aggregation with 4d convolutional swintransformerforfew-shotsegmentation.In:EuropeanConferenceonComputer Vision. pp. 108–126. Springer (2022) 3 TORA: Topological Representation Alignment 33

  17. [17]

    arXiv preprint arXiv:2410.22128 (2024) 3

    Hong, S., Jung, J., Shin, H., Han, J., Yang, J., Luo, C., Kim, S.: Pf3plat: Pose-free feed-forward 3d gaussian splatting. arXiv preprint arXiv:2410.22128 (2024) 3

  18. [18]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Hong, S., Jung, J., Shin, H., Yang, J., Kim, S., Luo, C.: Unifying correspondence pose and nerf for generalized pose-free novel view synthesis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 20196– 20206 (2024) 3

  19. [19]

    Hong, S., Kim, S.: Deep matching prior: Test-time optimization for dense corre- spondence.In:ProceedingsoftheIEEE/CVFinternationalconferenceoncomputer vision. pp. 9907–9917 (2021) 3

  20. [20]

    Ad- vances in Neural Information Processing Systems35, 13512–13526 (2022) 3

    Hong, S., Nam, J., Cho, S., Hong, S., Jeon, S., Min, D., Kim, S.: Neural matching fields: Implicit representation of matching fields for visual correspondence. Ad- vances in Neural Information Processing Systems35, 13512–13526 (2022) 3

  21. [21]

    Huang, J., Zhan, G., Fan, Q., Mo, K., Shao, L., Chen, B., Guibas, L., Dong, H.: Generative 3d part assembly via dynamic graph learning (2020) 3

  22. [22]

    In: Proceedings of IEEE computer society conference on Com- puter Vision and Pattern Recognition

    Huang, J., Kumar, S.R., Mitra, M., Zhu, W.J., Zabih, R.: Image indexing using color correlograms. In: Proceedings of IEEE computer society conference on Com- puter Vision and Pattern Recognition. pp. 762–768. IEEE (1997) 16

  23. [23]

    Jocher, G., Stoken, A., Borovec, J., Changyu, L., Hogan, A., Chaurasia, A., Dia- conu, L., Ingham, F., Colmagro, A., Ye, H., et al.: ultralytics/yolov5: v4. 0-nn. silu () activations, weights & biases logging, pytorch hub integration. Zenodo (2021) 6

  24. [24]

    ACM Transactions on Graphics (TOG)39(6), 1–20 (2020) 1

    Jones, R.K., Barton, T., Xu, X., Wang, K., Jiang, E., Guerrero, P., Mitra, N.J., Ritchie, D.: Shapeassembly: Learning to generate programs for 3d shape structure synthesis. ACM Transactions on Graphics (TOG)39(6), 1–20 (2020) 1

  25. [25]

    arXiv preprint arXiv:2509.18096 , year=

    Kim, C., Shin, H., Hong, E., Yoon, H., Arnab, A., Seo, P.H., Hong, S., Kim, S.: Seg4diff: Unveiling open-vocabulary segmentation in text-to-image diffusion transformers. arXiv preprint arXiv:2509.18096 (2025) 4

  26. [26]

    In: CVPR (2023) 3, 10, 13

    Lamb, N., Palmer, C., Molloy, B., Banerjee, S., Banerjee, N.K.: Fantastic breaks: A dataset of paired 3d scans of real-world broken objects and their complete coun- terparts. In: CVPR (2023) 3, 10, 13

  27. [27]

    arXiv preprint arXiv:2510.14945 (2025) 4

    Lee, J., Jung, J., Han, J., Narihira, T., Fukuda, K., Seo, J., Hong, S., Mitsufuji, Y., Kim, S.: 3d scene prompting for scene-consistent camera-controllable video generation. arXiv preprint arXiv:2510.14945 (2025) 4

  28. [28]

    In: Proceedings of the International Conference on Machine Learning (ICML) (2024) 3, 11

    Lee, N., Min, J., Lee, J., Kim, S., Lee, K., Park, J., Cho, M.: 3d geometric shape assembly via efficient point cloud matching. In: Proceedings of the International Conference on Machine Learning (ICML) (2024) 3, 11

  29. [29]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Lee, N., Min, J., Lee, J., Park, C., Cho, M.: Combinative matching for geometric shape assembly. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 9540–9549 (2025) 3, 11

  30. [30]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Leng, X., Singh, J., Hou, Y., Xing, Z., Xie, S., Zheng, L.: Repa-e: Unlocking vae for end-to-end tuning of latent diffusion transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 18262–18272 (2025) 3, 4

  31. [31]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Li, S., Jiang, Z., Chen, G., Xu, C., Tan, S., Wang, X., Fang, I., Zyskowski, K., McPherron, S.P., Iovita, R., et al.: Garf: Learning generalizable 3d reassembly for real-world fractures. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 5711–5721 (2025) 2, 3, 10, 11, 13

  32. [32]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Li, Y., Mo, K., Duan, Y., Wang, H., Zhang, J., Shao, L.: Category-level multi-part multi-joint 3d shape assembly. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 3281–3291 (2024) 3 34 N. Lee et al

  33. [33]

    Advances in neural information processing systems36, 44860–44879 (2023) 8, 18, 21

    Liu, M., Shi, R., Kuang, K., Zhu, Y., Li, X., Han, S., Cai, H., Porikli, F., Su, H.: Openshape:Scalingup3dshaperepresentationtowardsopen-worldunderstanding. Advances in neural information processing systems36, 44860–44879 (2023) 8, 18, 21

  34. [34]

    In: ICLR (2019) 11, 15

    Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: ICLR (2019) 11, 15

  35. [35]

    In: Computer Graphics Forum

    Lu, J., Liang, Y., Han, H., Hua, J., Jiang, J., Li, X., Huang, Q.: A survey on computational solutions for reconstructing complete objects by reassembling their fractured parts. In: Computer Graphics Forum. vol. 44, p. e70081. Wiley Online Library (2025) 3

  36. [36]

    Lu, J., Sun, Y., Huang, Q.: Jigsaw: Learning to assemble multiple fractured objects (2023),https://openreview.net/forum?id=OwpaO4w6K73, 11

  37. [37]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Ma, Z., Yue, Y., Gkioxari, G.: Find any part in 3d. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 7818–7827 (2025) 8, 18, 21

  38. [38]

    In: CVPRW (2003) 1

    McBride, J.C., Kimia, B.B.: Archaeological fragment reconstruction using curve- matching. In: CVPRW (2003) 1

  39. [39]

    DINOv2: Learning Robust Visual Features without Supervision

    Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, M., Khalidov, V., Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., et al.: Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193 (2023) 18

  40. [40]

    Peebles,W.,Xie,S.:Scalablediffusionmodelswithtransformers.In:Proceedingsof the IEEE/CVF international conference on computer vision. pp. 4195–4205 (2023) 4, 5

  41. [41]

    Advances in neural information processing systems30(2017) 19, 21

    Qi, C.R., Yi, L., Su, H., Guibas, L.J.: Pointnet++: Deep hierarchical feature learn- ing on point sets in a metric space. Advances in neural information processing systems30(2017) 19, 21

  42. [42]

    In: Proceed- ings of the Computer Vision and Pattern Recognition Conference

    Qi, Y., Ju, Y., Wei, T., Chu, C., Wong, L.L., Xu, H.: Two by two: Learning multi- task pairwise objects assembly for generalizable robot manipulation. In: Proceed- ings of the Computer Vision and Pattern Recognition Conference. pp. 17383–17393 (2025) 3, 10, 11, 24

  43. [43]

    In: International conference on machine learning

    Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PmLR (2021) 18

  44. [44]

    Journal of computational and applied mathematics20, 53–65 (1987) 16

    Rousseeuw, P.J.: Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of computational and applied mathematics20, 53–65 (1987) 16

  45. [45]

    Advances in Neural Information Processing Systems35, 38885–38898 (2022) 3, 9, 10, 11, 13, 15, 24

    Sellán, S., Chen, Y.C., Wu, Z., Garg, A., Jacobson, A.: Breaking bad: A dataset for geometric fracture and reassembly. Advances in Neural Information Processing Systems35, 38885–38898 (2022) 3, 9, 10, 11, 13, 15, 24

  46. [46]

    Singh, J., Leng, X., Wu, Z., Zheng, L., Zhang, R., Shechtman, E., Xie, S.: What matters for representation alignment: Global information or spatial structure? arXiv preprint arXiv:2512.10794 (2025) 4, 16

  47. [47]

    In: CVPR (2013) 1

    Son, K., Almeida, E.B., Cooper, D.B.: Axially symmetric 3d pots configuration system using axis of symmetry and break curve. In: CVPR (2013) 1

  48. [48]

    arXiv preprint arXiv:2506.05282 (2025) 2, 3, 4, 5, 9, 10, 11, 13, 14, 15, 22, 23, 24

    Sun, T., Zhu, L., Huang, S., Song, S., Armeni, I.: Rectified point flow: Generic point cloud pose estimation. arXiv preprint arXiv:2506.05282 (2025) 2, 3, 4, 5, 9, 10, 11, 13, 14, 15, 22, 23, 24

  49. [49]

    In: ICLR (2025) 3 TORA: Topological Representation Alignment 35

    Wang,Z.,Chen,J.,Furukawa,Y.:Puzzlefusion++:Auto-agglomerative3dfracture assembly by denoise and verify. In: ICLR (2025) 3 TORA: Topological Representation Alignment 35

  50. [50]

    arXiv preprint arXiv:2505.16792 (2025) 4

    Wang, Z., Zhao, W., Zhou, Y., Li, Z., Liang, Z., Shi, M., Zhao, X., Zhou, P., Zhang, K., Wang, Z., et al.: Repa works until it doesn’t: Early-stopped, holistic alignment supercharges diffusion training. arXiv preprint arXiv:2505.16792 (2025) 4

  51. [51]

    Wu, G., Zhang, S., Shi, R., Gao, S., Chen, Z., Wang, L., Chen, Z., Gao, H., Tang, Y., Yang, J., et al.: Representation entanglement for generation: Training diffusion transformersismucheasierthanyouthink.arXivpreprintarXiv:2507.01467(2025) 4

  52. [52]

    Geometry Forcing: Marrying Video Diffusion and 3D Representation for Consistent World Modeling

    Wu, H., Wu, D., He, T., Guo, J., Ye, Y., Duan, Y., Bian, J.: Geometry forcing: Mar- rying video diffusion and 3d representation for consistent world modeling. arXiv preprint arXiv:2507.07982 (2025) 4

  53. [53]

    In: ICCV (2023) 3

    Wu, R., Tie, C., Du, Y., Zhao, Y., Dong, H.: Leveraging se (3) equivariance for learning 3d geometric shape assembly. In: ICCV (2023) 3

  54. [54]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Wu, X., DeTone, D., Frost, D., Shen, T., Xie, C., Yang, N., Engel, J., Newcombe, R., Zhao, H., Straub, J.: Sonata: Self-supervised learning of reliable point rep- resentations. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 22193–22204 (2025) 19

  55. [55]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Wu, X., Jiang, L., Wang, P.S., Liu, Z., Liu, X., Qiao, Y., Ouyang, W., He, T., Zhao, H.: Point transformer v3: Simpler faster stronger. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 4840–4851 (2024) 18

  56. [56]

    PoseCNN: A Convolutional Neural Network for 6D Object Pose Estimation in Cluttered Scenes

    Xiang, Y., Schmidt, T., Narayanan, V., Fox, D.: Posecnn: A convolutional neu- ral network for 6d object pose estimation in cluttered scenes. arXiv preprint arXiv:1711.00199 (2017) 23

  57. [57]

    In: 2025 International Conference on 3D Vision (3DV)

    Xu, B., Zheng, S., Jin, Q.: Spaformer: Sequential 3d part assembly with transform- ers. In: 2025 International Conference on 3D Vision (3DV). pp. 1317–1327. IEEE (2025) 3, 10, 11, 24

  58. [58]

    arXiv preprint arXiv:2502.13986 (2025) 1

    Yoo, S.J., Liu, S., Arshad, M.Z., Kim, J., Kim, Y.M., Aloimonos, Y., Fermuller, C., Joo, K., Kim, J., Hong, J.H.: Structure-from-sherds++: Robust incremental 3d reassembly of axially symmetric pots from unordered and mixed fragment col- lections. arXiv preprint arXiv:2502.13986 (2025) 1

  59. [59]

    arXiv preprint arXiv:2509.07979 (2025) 4

    Yoon, H., Jung, J., Kim, J., Choi, H., Shin, H., Lim, S., An, H., Kim, C., Han, J., Kim, D., et al.: Visual representation alignment for multimodal large language models. arXiv preprint arXiv:2509.07979 (2025) 4

  60. [60]

    Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think

    Yu, S., Kwak, S., Jang, H., Jeong, J., Huang, J., Shin, J., Xie, S.: Representation alignment for generation: Training diffusion transformers is easier than you think. arXiv preprint arXiv:2410.06940 (2024) 2, 4, 21

  61. [61]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Yu, X., Tang, L., Rao, Y., Huang, T., Zhou, J., Lu, J.: Point-bert: Pre-training 3d point cloud transformers with masked point modeling. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 19313– 19322 (2022) 18

  62. [62]

    arXiv preprint arXiv:2512.13689 (2025) 21

    Yue, Y., Robert, D., Wang, J., Hong, S., Wegner, J.D., Rupprecht, C., Schindler, K.: Litept: Lighter yet stronger point transformer. arXiv preprint arXiv:2512.13689 (2025) 21

  63. [63]

    Zakka, K., Zeng, A., Lee, J., Song, S.: Form2fit: Learning shape priors for gener- alizable assembly from disassembly (2020) 1

  64. [64]

    In: Proceedings of the IEEE/CVF international conference on computer vision

    Zhai, X., Mustafa, B., Kolesnikov, A., Beyer, L.: Sigmoid loss for language im- age pre-training. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 11975–11986 (2023) 18

  65. [65]

    arXiv preprint arXiv:2510.23607 (2025) 19 36 N

    Zhang, Y., Wu, X., Lao, Y., Wang, C., Tian, Z., Wang, N., Zhao, H.: Concerto: Joint 2d-3d self-supervised learning emerges spatial representations. arXiv preprint arXiv:2510.23607 (2025) 19 36 N. Lee et al

  66. [66]

    Uni3d: Ex- ploring unified 3d representation at scale,

    Zhou, J., Wang, J., Ma, B., Liu, Y.S., Huang, T., Wang, X.: Uni3d: Exploring unified 3d representation at scale. arXiv preprint arXiv:2310.06773 (2023) 8, 11, 18, 21