Geometry-Guided Modeling of Foundation Features Enables Generalizable Object Shape Deformation Learning

Dongsheng Xie; Kai Chen; Qi Dou; Rong Xiong; Yiyao Ma; Zelong Tan; Zhongxiang Zhou; Zhuheng Song

arxiv: 2605.29661 · v2 · pith:D5MUBHJGnew · submitted 2026-05-28 · 💻 cs.CV

Geometry-Guided Modeling of Foundation Features Enables Generalizable Object Shape Deformation Learning

Yiyao Ma , Kai Chen , Zhongxiang Zhou , Zhuheng Song , Dongsheng Xie , Zelong Tan , Rong Xiong , Qi Dou This is my paper

Pith reviewed 2026-06-29 08:40 UTC · model grok-4.3

classification 💻 cs.CV

keywords 3D reconstructionshape deformationgeneralizable learningcategory-level templatesmonocular visionfeature modelingrobotic manipulation

0 comments

The pith

Enriching foundation features with category template topology enables generalizable 3D shape deformation from single images.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a framework for recovering 3D shapes from monocular images by deforming a fixed category-level template to fit the observed object. To manage large variations in shape and viewpoint, it enriches standard foundation features with the template's geometric structure, then uses this to direct the deformation process. A separate module adapts the template features to the target's viewpoint using multiple template views and poses. If effective, this yields reconstructions that work for objects never seen in training and even aid robotic grasping tasks.

Core claim

By modeling foundation features in a geometry-guided way that incorporates the topology of a category-level shape template and aggregating those features adaptively across views, the method learns to deform the template to match arbitrary target observations, achieving generalization to novel categories and viewpoints.

What carries the argument

The geometry-guided feature modeling mechanism, which enriches foundation features with template topology to create a geometry-aware representation that guides deformation.

If this is right

The framework outperforms existing methods on large shape variations and diverse viewpoints.
It generalizes effectively to unseen object categories.
It supports real-world dexterous robotic manipulation tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This suggests template-based deformation can serve as a bridge between fixed priors and flexible foundation models in 3D vision.
Future work might test whether the same enrichment process applies to non-rigid or articulated objects without category templates.

Load-bearing premise

A suitable category-level shape template must exist, and enriching foundation features with its topology must produce a representation that guides accurate deformation for any target view or unseen category.

What would settle it

Demonstrating that deformation fails to match the target shape when the category template topology does not align with the observed object's structure, or when the target view differs substantially from the aggregated template views.

Figures

Figures reproduced from arXiv: 2605.29661 by Dongsheng Xie, Kai Chen, Qi Dou, Rong Xiong, Yiyao Ma, Zelong Tan, Zhongxiang Zhou, Zhuheng Song.

**Figure 1.** Figure 1: The proposed object shape deformation learning framework can handle large template-target shape variations, remains robust to diverse camera viewpoints, and generalizes to unseen categories. It enables various downstream applications, and effectively supports generalizable dexterous robotic manipulation in the real world. the intrinsic geometric relationship between the template and diverse novel objects, … view at source ↗

**Figure 2.** Figure 2: Overview of our proposed framework. The core of our approach is a conditional flow-matching module that warps a template shape toward a target via a continuous trajectory. This deformation is conditioned on the geometry-guided modeling of 2D foundation features. To ensure these features are spatially aligned and robust to varying observation angles, we introduce two key components: (1) a geometry-guided fe… view at source ↗

**Figure 3.** Figure 3: Qualitative comparison with existing shape deformation methods on novel target objects under the Random Template setting. effectively overcoming geometric misalignments. The resulting aligned features Faligned are further refined through self-attention layers to enforce local consistency, ultimately producing a conditioning signal c that guides the deformation network to predict shape changes that are bo… view at source ↗

**Figure 4.** Figure 4: Quantitative and qualitative comparisons with existing 3D generative methods on single-view shape reconstruction. LRM-small, LRM-base, and LRM-large denote different model sizes. Phidias-Image and Phidias-3D refer to models conditioned solely on target images and those conditioned on additional 3D templates, respectively. 4.1. Experimental Setting Datasets. Following prior works (Uy et al., 2021; Di et al.… view at source ↗

**Figure 5.** Figure 5: Qualitative results of contact map and grasp transfer for diverse objects across multiple robotic hands. Red and blue regions in the contact maps denote high and low contact values, respectively. Robot Perception Template Deformed Robot Perception Template Deformed Template Deformed Robot Perception Robot Perception Robot Perception Template Deformed Template Deformed Robot Perception Template Deformed See… view at source ↗

**Figure 6.** Figure 6: Qualitative results of generalizable dexterous manipulation in the real world. our method achieves highly competitive grasp quality while significantly reducing computational overhead and bypassing the need for complete 3D target shapes. Furthermore, as shown in [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: Qualitative comparison of deformation and reconstruction results under different ablation settings. g∥ 2 2 + 1 |G| P g∈G minp∈P ∥g − p∥ 2 2 . Lower CD values indicate better surface alignment. Earth Mover’s Distance (EMD) captures both geometry and point density by solving an optimal transport problem. It seeks a bijection ϕ : P → G that minimizes the average distance: EMD(P, G) = minϕ 1 |P| P p∈P ∥p−ϕ(p… view at source ↗

**Figure 10.** Figure 10: Failure Analysis. When key regions are fully occluded in the target observation, our method tends to preserve the template’s corresponding structure. This leads to a discrepancy with the true target shape if the unobserved geometry differs from the template. experiments were performed on a NAVIAI AW-1 humanoid robot equipped with a dexterous hand featuring 15 degrees of freedom (DoFs) and 6 active joints… view at source ↗

**Figure 11.** Figure 11: Visualization of shape and corresponding deformation field generated by our proposed method on diverse novel objects. Target Object Observation Template Deformation Reconstruction Human Grasp Template Contact Map Transferred Contact Map and Grasps Reconstruction Reconstruction Reconstruction [PITH_FULL_IMAGE:figures/full_fig_p020_11.png] view at source ↗

**Figure 12.** Figure 12: Qualitative results of deformation and reconstruction on real-world unseen object categories, along with transferred contact maps and robotic dexterous grasps. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_12.png] view at source ↗

read the original abstract

Monocular 3D shape recovery is fundamental to geometric understanding, yet achieving robust generalization across arbitrary viewpoints and unseen object categories remains a significant challenge. In this paper, we present a generalizable deformation learning framework that reconstructs 3D objects by explicitly deforming a category-level shape template to match the target observation. To address complex shape variations between the template and the target, we introduce a geometry-guided feature modeling mechanism. This process first enriches foundation features with template topology to yield a geometry-aware representation, which is then explicitly correlated with the target observation to guide precise deformation. Furthermore, to bridge the disparity between the fixed template and arbitrary target views, we propose a view-adaptive feature aggregation module. This module leverages multi-view template features and their corresponding camera poses to enrich the canonical template representation, ensuring robust feature alignment regardless of the target's perspective. Extensive experiments demonstrate that our approach significantly outperforms state-of-the-art methods in handling large shape variations and diverse viewpoints, exhibiting strong generalization to novel categories and effectively supporting downstream real-world dexterous robotic manipulation tasks. Project homepage: https://GODeform.github.io/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper adds geometry-guided enrichment of foundation features plus view-adaptive multi-view aggregation to template deformation for monocular 3D recovery, but generalization to novel categories still hinges on external template availability.

read the letter

The main thing to know is that this work deforms a category-level shape template by first enriching foundation features with template topology to produce a geometry-aware representation, then correlating that with the target observation, and finally using a view-adaptive module that aggregates multi-view template features and camera poses to align features across arbitrary viewpoints.

What is actually new is the explicit geometry-guided modeling step and the view-adaptive aggregation that conditions the canonical template on target perspective. These are concrete extensions of existing template deformation and foundation-model ideas, and the paper does a reasonable job of spelling out how each piece addresses a specific failure mode (large shape variation, viewpoint mismatch).

The soft spot is the generalization claim to novel categories. The approach requires a suitable category-level template for the unseen object; the abstract gives no independent mechanism for producing or validating that template. If experiments supply templates drawn from the same category distribution or created by hand, the reported gains are conditional on that external input rather than coming solely from the learned deformation. That assumption is load-bearing and needs clear experimental controls.

The paper shows clear thinking in how it chains the components and targets a downstream robotics task. No obvious internal contradictions appear from the description. This is for readers working on generalizable monocular 3D reconstruction or robotic manipulation pipelines. It is solid enough to deserve a serious referee who can check the ablations, template construction details, and quantitative protocols.

Referee Report

2 major / 0 minor

Summary. The paper proposes a generalizable deformation learning framework for monocular 3D shape recovery. It reconstructs objects by deforming a category-level shape template, using a geometry-guided feature modeling mechanism that enriches foundation features with template topology and a view-adaptive feature aggregation module that incorporates multi-view template features and camera poses. The abstract claims significant outperformance over state-of-the-art methods on large shape variations and diverse viewpoints, strong generalization to novel categories, and utility for downstream dexterous robotic manipulation.

Significance. If the central claims hold after verification, the explicit incorporation of template topology into foundation features and the view-adaptive aggregation could offer a useful inductive bias for template-driven 3D deformation. However, the approach's dependence on pre-existing category-level templates for novel categories means any generalization benefit is conditional on external template provision rather than emerging purely from the learned model.

major comments (2)

[Abstract] Abstract: performance claims of outperformance and generalization are asserted without any equations, ablation studies, error bars, dataset descriptions, or experimental protocols, rendering the central empirical claims unverifiable from the provided text.
[Abstract] Abstract: the claim of strong generalization to novel categories rests on the assumption that suitable category-level shape templates exist and can be deformed accurately; no mechanism is described for obtaining or validating such templates independently for truly unseen categories, making the result conditional on this external input.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major comment point-by-point below, drawing on details from the full manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: performance claims of outperformance and generalization are asserted without any equations, ablation studies, error bars, dataset descriptions, or experimental protocols, rendering the central empirical claims unverifiable from the provided text.

Authors: The abstract is a high-level summary by design and does not contain the full experimental details. The complete manuscript provides all requested elements: the geometry-guided feature modeling and view-adaptive aggregation equations appear in Section 3; ablation studies with quantitative results are in Section 4.3; tables include error bars (standard deviations over multiple runs); dataset descriptions (e.g., ShapeNet, real-world captures) and experimental protocols (training splits, viewpoint sampling, metrics) are specified in Sections 4.1 and 4.2. The outperformance claims are supported by direct comparisons in Tables 1-3. No changes to the abstract are required, as this structure follows standard practice for the venue. revision: no
Referee: [Abstract] Abstract: the claim of strong generalization to novel categories rests on the assumption that suitable category-level shape templates exist and can be deformed accurately; no mechanism is described for obtaining or validating such templates independently for truly unseen categories, making the result conditional on this external input.

Authors: The method is explicitly template-driven: a category-level shape template is an input, and the contribution lies in learning to deform it robustly via geometry-guided features and view-adaptive aggregation. Experiments in Section 4.4 demonstrate generalization to novel categories when the corresponding templates are supplied (consistent with prior template-based works). We do not claim or provide a mechanism for automatically generating or validating templates for arbitrary unseen categories, as that lies outside the paper's scope. The abstract's generalization claim is therefore conditional on template availability, which we can clarify with one additional sentence in the revised abstract or introduction. revision: partial

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The provided abstract and description contain no equations, derivations, or self-referential definitions that reduce claims to inputs by construction. The framework is presented as a methodological pipeline (template deformation + geometry-guided feature enrichment + view-adaptive aggregation) whose generalization performance is asserted via experiments rather than mathematical identities or fitted parameters renamed as predictions. No self-citation load-bearing steps, uniqueness theorems, or ansatzes are quoted. The category-level template assumption is an external modeling choice, not a circular reduction within the paper's own chain.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no equations, training details, or explicit assumptions; no free parameters, axioms, or invented entities can be extracted.

pith-pipeline@v0.9.1-grok · 5748 in / 1083 out tokens · 22204 ms · 2026-06-29T08:40:46.022016+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

2 extracted references

[1]

Lower CD values indicate better surface alignment.Earth Mover’s Dis- tance (EMD)captures both geometry and point density by solving an optimal transport problem. It seeks a bi- jection ϕ:P → G that minimizes the average distance: EMD(P,G) = min ϕ 1 |P| P p∈P ∥p−ϕ(p)∥ 2, where lower values imply a more accurate reconstruction of the shape distribution. Fin...
[2]

Chair”Template from “Table

to capture structural relationships within the com- plete point cloud, using a 12-layer Transformer encoder with embedding dimension 384 and 6 attention heads, and a decoder that outputs point-wise features of dimension d= 256 . For both the feature alignment module and the multi-view camera feature fusion module, we integrate con- textual information int...

2024

[1] [1]

Lower CD values indicate better surface alignment.Earth Mover’s Dis- tance (EMD)captures both geometry and point density by solving an optimal transport problem. It seeks a bi- jection ϕ:P → G that minimizes the average distance: EMD(P,G) = min ϕ 1 |P| P p∈P ∥p−ϕ(p)∥ 2, where lower values imply a more accurate reconstruction of the shape distribution. Fin...

[2] [2]

Chair”Template from “Table

to capture structural relationships within the com- plete point cloud, using a 12-layer Transformer encoder with embedding dimension 384 and 6 attention heads, and a decoder that outputs point-wise features of dimension d= 256 . For both the feature alignment module and the multi-view camera feature fusion module, we integrate con- textual information int...

2024