Feedforward 3D Editing Learns from Semantic-Part Transformation

Hao Zhao; Henghaofan Zhang; Jiawei Weng; Junhao Chen; Peishuo Li; Saining Zhang; Zhenxin Diao

arxiv: 2605.27351 · v2 · pith:LCAE4DUWnew · submitted 2026-05-26 · 💻 cs.CV

Feedforward 3D Editing Learns from Semantic-Part Transformation

Jiawei Weng , Saining Zhang , Zhenxin Diao , Peishuo Li , Henghaofan Zhang , Junhao Chen , Hao Zhao This is my paper

Pith reviewed 2026-06-29 17:54 UTC · model grok-4.3

classification 💻 cs.CV

keywords 3D editingsemantic partsfeedforward networkpaired supervisionPxform datasetPartFlowgeometric editingappearance editing

0 comments

The pith

Semantic-part transformations create high-quality paired data that trains feedforward 3D editors to state-of-the-art performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that feedforward 3D editing has been limited by the absence of high-quality paired supervision, which prior datasets fail to provide because of inaccurate localization and weak preservation of source structure. By grounding edits directly in semantic 3D parts, the authors build Pxform, a dataset of over 100K consistent before/after pairs across seven edit types. This supervision enables PartFlow to inject source-aware latent control into pretrained generative priors, using mask-aware velocity preservation and render-space consistency losses to maintain geometry, multi-view coherence, and localized controllability. A sympathetic reader would care because the approach moves 3D editing from training-free pipelines toward learned, scalable models that do not require edit masks at inference. If correct, the result is higher-fidelity geometric and appearance edits on standard benchmarks.

Core claim

Scalable feedforward 3D editing should be learned from semantic-part transformations. The Pxform pipeline produces over 100K consistent before/after pairs by grounding edits in semantic 3D parts rather than unstructured shapes, overcoming inaccurate localization and weak preservation. PartFlow then adds source-aware latent control to pretrained 3D generative priors together with mask-aware velocity preservation and render-space consistency supervision, jointly raising edit fidelity and source preservation while requiring no 3D edit mask at inference and reaching state-of-the-art results on both geometric and appearance editing benchmarks.

What carries the argument

Pxform dataset pipeline that grounds edits in semantic 3D parts to generate consistent paired data, and PartFlow network that injects source-aware latent control into pretrained priors with mask-aware velocity preservation and render-space consistency supervision.

If this is right

High-quality semantic-part supervision improves both edit fidelity and source preservation in feedforward models.
PartFlow achieves state-of-the-art results on geometric and appearance editing benchmarks without 3D edit masks at inference.
Grounding edits in semantic parts enables consistent multi-view pairs that support scalable training.
The approach applies across seven edit types while maintaining structural coherence and localized controllability.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Part-based pairing strategies could be applied to other generative tasks such as 3D animation or scene composition where structural consistency matters.
The emphasis on render-space consistency suggests that similar supervision might improve multi-view diffusion models used for 3D generation.
If semantic parts drive performance, then explicit part segmentation modules inside editing networks could yield further gains on complex objects.
Dataset construction methods like Pxform may transfer to text-conditioned or image-conditioned 3D editing by replacing manual part labels with automatic semantic extraction.

Load-bearing premise

Grounding edits in semantic 3D parts produces high-quality consistent before/after pairs that overcome the inaccurate localization and weak preservation of earlier datasets.

What would settle it

Training PartFlow on Pxform and evaluating it on standard geometric and appearance 3D editing benchmarks shows performance no better than existing training-free or feedforward methods.

Figures

Figures reproduced from arXiv: 2605.27351 by Hao Zhao, Henghaofan Zhang, Jiawei Weng, Junhao Chen, Peishuo Li, Saining Zhang, Zhenxin Diao.

**Figure 1.** Figure 1: We introduce Pxform, a high-quality holistic 3D editing dataset with over 100K consistent before/after pairs, covering seven edit types: addition, [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗

**Figure 2.** Figure 2: Qualitative comparison of editing pairs from Pxform, 3DEditVerse [ [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Overview of the modular Pxform data construction pipeline. Starting from part-segmented 3D assets, the pipeline first refines semantic part labels and [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Overview of PartFlow. PartFlow introduces ControlNet-style source-latent injection into the two-stage TRELLIS editing process: Stage 1 controls [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Qualitative results on Uni3DEdit-Bench for shape editing. [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗

**Figure 6.** Figure 6: Qualitative results on Uni3DEdit-Bench for appearance editing. [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗

**Figure 7.** Figure 7: Some data from Pxform [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗

**Figure 8.** Figure 8: Qualitative comparison of Pxform, 3DEditVerse and Nano3D-Edit100k. [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗

**Figure 9.** Figure 9: Qualitative comparison of Pxform, 3DEditVerse and Nano3D-Edit100k. [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗

**Figure 10.** Figure 10: Qualitative comparison of Pxform, 3DEditVerse and Nano3D-Edit100k. [PITH_FULL_IMAGE:figures/full_fig_p019_10.png] view at source ↗

**Figure 11.** Figure 11: More Samples in Pxform Dataset [PITH_FULL_IMAGE:figures/full_fig_p020_11.png] view at source ↗

**Figure 12.** Figure 12: Qualitative results on Uni3DEdit-Bench [PITH_FULL_IMAGE:figures/full_fig_p021_12.png] view at source ↗

**Figure 13.** Figure 13: Qualitative results on Uni3DEdit-Bench [PITH_FULL_IMAGE:figures/full_fig_p022_13.png] view at source ↗

**Figure 14.** Figure 14: Qualitative results on Uni3DEdit-Bench [PITH_FULL_IMAGE:figures/full_fig_p023_14.png] view at source ↗

**Figure 15.** Figure 15: Qualitative results on Uni3DEdit-Bench [PITH_FULL_IMAGE:figures/full_fig_p024_15.png] view at source ↗

**Figure 16.** Figure 16: Qualitative results on Uni3DEdit-Bench [PITH_FULL_IMAGE:figures/full_fig_p025_16.png] view at source ↗

**Figure 17.** Figure 17: Qualitative results on Uni3DEdit-Bench [PITH_FULL_IMAGE:figures/full_fig_p026_17.png] view at source ↗

**Figure 18.** Figure 18: Qualitative results on Uni3DEdit-Bench [PITH_FULL_IMAGE:figures/full_fig_p027_18.png] view at source ↗

**Figure 19.** Figure 19: Qualitative results on Uni3DEdit-Bench [PITH_FULL_IMAGE:figures/full_fig_p028_19.png] view at source ↗

**Figure 20.** Figure 20: Qualitative results on Uni3DEdit-Bench [PITH_FULL_IMAGE:figures/full_fig_p029_20.png] view at source ↗

**Figure 21.** Figure 21: Qualitative results on Uni3DEdit-Bench [PITH_FULL_IMAGE:figures/full_fig_p030_21.png] view at source ↗

**Figure 22.** Figure 22: Qualitative results on Uni3DEdit-Bench [PITH_FULL_IMAGE:figures/full_fig_p031_22.png] view at source ↗

read the original abstract

3D editing is a fundamental capability for scalable 3D content creation. While image editing has rapidly evolved toward large-scale feedforward generative paradigms, 3D AI generation remains dominated by training-free editing pipelines. A central challenge of feedforward 3D editing lies in the lack of high-quality paired supervision. Editable 3D assets require simultaneous preservation of geometry, multi-view consistency, structural coherence, and localized edit controllability. Existing 3D editing datasets often rely on independently generated assets, image-mediated reconstruction or narrow edit taxonomies, leading to inaccurate localization, weak preservation, blurred edit boundaries, and limited semantic consistency. In this work, we introduce a new perspective: scalable feedforward 3D editing should be learned from semantic-part transformations. Based on this insight, we propose Pxform, a high-quality 3D editing dataset with over 100K consistent before/after editing pairs across seven edit types. Instead of treating objects as unstructured shapes, our pipeline grounds edits directly in semantic 3D parts. Built upon Pxform, we further propose PartFlow, a feedforward 3D editing network that injects source-aware latent control into pretrained 3D generative priors. PartFlow introduces mask-aware velocity preservation and render-space consistency supervision to jointly improve edit fidelity and source preservation, while requiring no 3D edit mask during inference. Extensive experiments demonstrate that high-quality semantic-part supervision substantially improves scalable 3D editing, enabling PartFlow to achieve state-of-the-art performance on both geometric and appearance editing benchmarks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper builds a 100K-pair dataset by grounding 3D edits in semantic parts and trains a feedforward model on it, but the abstract gives no evidence the pairs or results actually work.

read the letter

The main thing to know is that they treat 3D editing data as transformations of semantic parts rather than whole objects or image-based reconstructions. This produces Pxform with over 100K before/after pairs across seven edit types, then feeds it into PartFlow, a network that adds mask-aware velocity preservation and render-space consistency on top of pretrained 3D priors.

What stands out as new is the dataset construction itself. Prior work often ends up with inaccurate localization or weak source preservation because edits are not tied to explicit semantic structure. Grounding the pairs in parts is a direct attempt to fix that, and the model is designed to use the resulting supervision without requiring an edit mask at inference time.

The paper does a clean job laying out the target problems: geometry preservation, multi-view consistency, structural coherence, and localized control. The architectural additions line up with those goals.

The soft spot is that none of this can be checked yet. The claim that the pipeline yields high-quality consistent pairs rests entirely on the construction method, with no reported validation of pair fidelity or artifact rates. The state-of-the-art results on geometric and appearance benchmarks are stated without numbers, baselines, or ablation details. If the pairs turn out noisy or the gains small, the whole story weakens.

This is for researchers working on feedforward 3D generation and editing pipelines who need better paired supervision. Someone already building datasets or fine-tuning 3D priors might pick up the semantic-part idea even if the specific numbers need scrutiny.

It deserves peer review. The motivation is coherent and the approach targets a documented gap without obvious contradictions in the stated logic.

Referee Report

0 major / 3 minor

Summary. The paper claims that grounding 3D edits in semantic parts enables construction of a high-quality paired dataset (Pxform) with >100K before/after pairs across seven edit types, which in turn supports training of a feedforward model (PartFlow). PartFlow injects source-aware latent control into pretrained 3D generative priors and adds mask-aware velocity preservation plus render-space consistency supervision; the resulting model reaches SOTA on geometric and appearance editing benchmarks while requiring no 3D edit mask at inference. The central motivation is that prior datasets suffer from inaccurate localization, weak preservation, and blurred boundaries, problems that semantic-part grounding is asserted to solve.

Significance. If the dataset construction and experimental claims hold, the work supplies a concrete, scalable source of paired supervision for feedforward 3D editing—an area currently dominated by training-free pipelines. The explicit linkage between semantic-part transformations, preservation mechanisms, and benchmark gains is a coherent contribution that could accelerate controllable 3D content creation. The paper ships a new dataset and an inference-efficient architecture; these are the primary assets that would be evaluated by the community.

minor comments (3)

The abstract states that PartFlow 'requires no 3D edit mask during inference,' yet the training procedure relies on mask-aware velocity preservation; a short paragraph clarifying how the mask signal is obtained or approximated at training time would improve reproducibility.
The seven edit types are listed but not enumerated; adding a table or figure that shows one representative pair per type (with the semantic part highlighted) would make the dataset construction more transparent.
The claim of 'state-of-the-art performance on both geometric and appearance editing benchmarks' would be strengthened by an explicit statement of the exact metrics and baselines used, even if the numbers appear later in the experimental section.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive summary of our work and the recommendation for minor revision. The report does not list any specific major comments, so there are no individual points requiring a point-by-point response.

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper introduces a dataset construction pipeline (Pxform) grounded in semantic 3D parts and a feedforward editing network (PartFlow) that adds mask-aware velocity preservation and render-space consistency. No equations, fitted parameters, or derivations appear in the provided text. The central claims rest on the explicit design choices that target documented weaknesses of prior datasets (inaccurate localization, weak preservation), without any reduction of outputs to inputs by construction, self-citation load-bearing premises, or ansatzes smuggled via prior work. This is a standard empirical dataset-plus-model contribution whose value is external to any internal definitional loop.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no details on free parameters, axioms, or invented entities; review is limited to abstract content only.

pith-pipeline@v0.9.1-grok · 5828 in / 1109 out tokens · 31111 ms · 2026-06-29T17:54:03.166822+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

14 extracted references · 7 canonical work pages · 1 internal anchor

[1]

InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Objaverse: A Universe of Annotated 3D Objects. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13142–13153. https://arxiv.org/abs/2212.08051 Shaocong Dong, Lihe Ding, Xiao Chen, Yaokun Li, Yuxin Wang, Yucheng Wang, Qi Wang, Jaehyeok Kim, Chenjian Gao, Zhanpeng Huang, Zibin Wang, Tianfan Xue, and Dan Xu. 2025. From One ...

work page arXiv 2025
[2]

Efros, Aleksander Holynski, and Angjoo Kanazawa

https://openaccess.thecvf.com/content_ICCV_2019/html/Gkioxari_Mesh_R- CNN_ICCV_2019_paper.html Ayaan Haque, Matthew Tancik, Alexei A. Efros, Aleksander Holynski, and Angjoo Kanazawa. 2023. Instruct-NeRF2NeRF: Editing 3D Scenes with Instructions. In Proceedings of the IEEE/CVF International Conference on Computer Vision. Zexin He and Tengfei Wang. 2023. Op...

work page arXiv 2023
[3]

arXiv:2512.21185 [cs.CV] https: //arxiv.org/abs/2512.21185 Heewoo Jun and Alex Nichol

UltraShape 1.0: High-Fidelity 3D Shape Generation via Scalable Geometric Refinement.arXiv preprint arXiv:2512.21185(2025). arXiv:2512.21185 [cs.CV] https: //arxiv.org/abs/2512.21185 Heewoo Jun and Alex Nichol. 2023. Shap-E: Generating Conditional 3D Implicit Functions.arXiv preprint arXiv:2305.02463(2023). Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühl...

work page doi:10.1145/3072959.3073599 2025
[4]

arXiv preprint arXiv:2412.08629(2024)

FlowEdit: Inversion-Free Text-Based Editing Using Pre-Trained Flow Models. arXiv preprint arXiv:2412.08629(2024). arXiv:2412.08629 [cs.CV] https://arxiv.org/ abs/2412.08629 Zeqiang Lai, Yunfei Zhao, Zibo Zhao, Haolin Liu, Qingxiang Lin, Jingwei Huang, Chunchao Guo, and Xiangyu Yue. 2025. LATTICE: Democratize High-Fidelity 3D Generation at Scale.arXiv prep...

work page doi:10.48550/arxiv.2405.16888 2024
[5]

InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Wonder3D: Single Image to 3D using Cross-Domain Diffusion. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Changfeng Ma, Yang Li, Xinhao Yan, Jiachen Xu, Yunhan Yang, Chunshi Wang, Zibo Zhao, Yanwen Guo, Zhuo Chen, and Chunchao Guo. 2025b. P3-SAM: Native 3D Part Segmentation. arXiv:2509.06784 [cs.CV] doi:10.48550/arXiv...

work page doi:10.48550/arxiv.2509.06784 2025
[6]

arXiv:2510.20155 [cs.CV] doi:10.48550/arXiv.2510.20155 Yuxuan Wang, Xuanyu Yi, Zike Wu, Na Zhao, Long Chen, and Hanwang Zhang

PartNeXt: A Next-Generation Dataset for Fine-Grained and Hierarchical 3D Part Understanding. arXiv:2510.20155 [cs.CV] doi:10.48550/arXiv.2510.20155 Yuxuan Wang, Xuanyu Yi, Zike Wu, Na Zhao, Long Chen, and Hanwang Zhang. 2024c. View-consistent 3d editing with gaussian splatting. InEuropean conference on computer vision. Springer, 404–420. Zhou Wang, Alan C...

work page doi:10.48550/arxiv.2510.20155 2004
[7]

Native and Compact Structured Latents for 3D Generation

Native and Compact Structured Latents for 3D Generation.arXiv preprint arXiv:2512.14692(2025). arXiv:2512.14692 [cs.CV] https://arxiv.org/abs/2512.14692 Jianfeng Xiang, Zelong Lv, Sicheng Xu, Yu Deng, Ruicheng Wang, Bowen Zhang, Dong Chen, Xin Tong, and Jiaolong Yang. 2024. Structured 3D Latents for Scalable and Versatile 3D Generation.arXiv preprint arXi...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2509.08643 2025
[8]

Generate exactly the requested number of edits according to the quota
[9]

Select valid target parts from the supplied part list
[10]

Group repeated instances into one edit when they form a semantic unit, e.g., both wheels or all chair legs
[11]

Group semantically coupled components when appropriate, e.g., eyes, nose, and mouth as a head-level edit
[12]

Avoid deleting the primary structural body of the object
[13]

Use clear English imperative edit instructions
[14]

new_part_desc

Produce target descriptions and after-edit descriptions suitable for downstream editing and verification. Rules: R1. selected_part_ids must be a subset of the input part list. R2. No two edits may share the same (edit_type, selected_part_ids) pair. R3. Global edits must use selected_part_ids = []. R4. Deletion cannot target the primary structural body. R5...

[1] [1]

InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Objaverse: A Universe of Annotated 3D Objects. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13142–13153. https://arxiv.org/abs/2212.08051 Shaocong Dong, Lihe Ding, Xiao Chen, Yaokun Li, Yuxin Wang, Yucheng Wang, Qi Wang, Jaehyeok Kim, Chenjian Gao, Zhanpeng Huang, Zibin Wang, Tianfan Xue, and Dan Xu. 2025. From One ...

work page arXiv 2025

[2] [2]

Efros, Aleksander Holynski, and Angjoo Kanazawa

https://openaccess.thecvf.com/content_ICCV_2019/html/Gkioxari_Mesh_R- CNN_ICCV_2019_paper.html Ayaan Haque, Matthew Tancik, Alexei A. Efros, Aleksander Holynski, and Angjoo Kanazawa. 2023. Instruct-NeRF2NeRF: Editing 3D Scenes with Instructions. In Proceedings of the IEEE/CVF International Conference on Computer Vision. Zexin He and Tengfei Wang. 2023. Op...

work page arXiv 2023

[3] [3]

arXiv:2512.21185 [cs.CV] https: //arxiv.org/abs/2512.21185 Heewoo Jun and Alex Nichol

UltraShape 1.0: High-Fidelity 3D Shape Generation via Scalable Geometric Refinement.arXiv preprint arXiv:2512.21185(2025). arXiv:2512.21185 [cs.CV] https: //arxiv.org/abs/2512.21185 Heewoo Jun and Alex Nichol. 2023. Shap-E: Generating Conditional 3D Implicit Functions.arXiv preprint arXiv:2305.02463(2023). Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühl...

work page doi:10.1145/3072959.3073599 2025

[4] [4]

arXiv preprint arXiv:2412.08629(2024)

FlowEdit: Inversion-Free Text-Based Editing Using Pre-Trained Flow Models. arXiv preprint arXiv:2412.08629(2024). arXiv:2412.08629 [cs.CV] https://arxiv.org/ abs/2412.08629 Zeqiang Lai, Yunfei Zhao, Zibo Zhao, Haolin Liu, Qingxiang Lin, Jingwei Huang, Chunchao Guo, and Xiangyu Yue. 2025. LATTICE: Democratize High-Fidelity 3D Generation at Scale.arXiv prep...

work page doi:10.48550/arxiv.2405.16888 2024

[5] [5]

InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Wonder3D: Single Image to 3D using Cross-Domain Diffusion. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Changfeng Ma, Yang Li, Xinhao Yan, Jiachen Xu, Yunhan Yang, Chunshi Wang, Zibo Zhao, Yanwen Guo, Zhuo Chen, and Chunchao Guo. 2025b. P3-SAM: Native 3D Part Segmentation. arXiv:2509.06784 [cs.CV] doi:10.48550/arXiv...

work page doi:10.48550/arxiv.2509.06784 2025

[6] [6]

arXiv:2510.20155 [cs.CV] doi:10.48550/arXiv.2510.20155 Yuxuan Wang, Xuanyu Yi, Zike Wu, Na Zhao, Long Chen, and Hanwang Zhang

PartNeXt: A Next-Generation Dataset for Fine-Grained and Hierarchical 3D Part Understanding. arXiv:2510.20155 [cs.CV] doi:10.48550/arXiv.2510.20155 Yuxuan Wang, Xuanyu Yi, Zike Wu, Na Zhao, Long Chen, and Hanwang Zhang. 2024c. View-consistent 3d editing with gaussian splatting. InEuropean conference on computer vision. Springer, 404–420. Zhou Wang, Alan C...

work page doi:10.48550/arxiv.2510.20155 2004

[7] [7]

Native and Compact Structured Latents for 3D Generation

Native and Compact Structured Latents for 3D Generation.arXiv preprint arXiv:2512.14692(2025). arXiv:2512.14692 [cs.CV] https://arxiv.org/abs/2512.14692 Jianfeng Xiang, Zelong Lv, Sicheng Xu, Yu Deng, Ruicheng Wang, Bowen Zhang, Dong Chen, Xin Tong, and Jiaolong Yang. 2024. Structured 3D Latents for Scalable and Versatile 3D Generation.arXiv preprint arXi...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2509.08643 2025

[8] [8]

Generate exactly the requested number of edits according to the quota

[9] [9]

Select valid target parts from the supplied part list

[10] [10]

Group repeated instances into one edit when they form a semantic unit, e.g., both wheels or all chair legs

[11] [11]

Group semantically coupled components when appropriate, e.g., eyes, nose, and mouth as a head-level edit

[12] [12]

Avoid deleting the primary structural body of the object

[13] [13]

Use clear English imperative edit instructions

[14] [14]

new_part_desc

Produce target descriptions and after-edit descriptions suitable for downstream editing and verification. Rules: R1. selected_part_ids must be a subset of the input part list. R2. No two edits may share the same (edit_type, selected_part_ids) pair. R3. Global edits must use selected_part_ids = []. R4. Deletion cannot target the primary structural body. R5...