EPS3D: End-to-End Feed-Forward 3D Panoptic Segmentation

Chi-Wing Fu; Jiaxin Guo; Ka-Hei Hui; Kai Chen; Pheng-Ann Heng; Runsong Zhu; Wei Chen; Weiqiang Ren; Wei Yin; Xiaoyang Guo

arxiv: 2606.08980 · v1 · pith:3GORSWOLnew · submitted 2026-06-08 · 💻 cs.CV

EPS3D: End-to-End Feed-Forward 3D Panoptic Segmentation

Runsong Zhu , Jiaxin Guo , Xiaoyang Guo , Zhengzhe Liu , Ka-Hei Hui , Wei Yin , Kai Chen , Wei Chen

show 4 more authors

Weiqiang Ren Yunhui Liu Pheng-Ann Heng Chi-Wing Fu

This is my paper

Pith reviewed 2026-06-27 17:13 UTC · model grok-4.3

classification 💻 cs.CV

keywords 3D panoptic segmentationend-to-end frameworkopen-vocabularymulti-view imagesdistillation trainingmutual enhancementsemantic-instance consistencyfeed-forward

0 comments

The pith

EPS3D performs open-vocabulary 3D panoptic segmentation end-to-end from multi-view images via distillation training and mutual semantic-instance enhancement.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to show that an end-to-end feed-forward network can produce coherent 3D semantic and instance labels directly from multi-view images. Existing approaches insert separate preprocessing stages that accumulate errors and break 3D consistency; EPS3D replaces those stages with distillation on varied 3D scenes plus a module that repeatedly aligns semantics inside instances and refines instances using semantic cues. If the claim holds, 3D panoptic segmentation becomes a single forward pass that is both more accurate and fast enough for downstream robotics or editing tasks.

Core claim

EPS3D is an end-to-end architecture that trains on diverse 3D scenes with a distillation objective to extract 3D-aware semantic and instance features from multi-view images, then applies a mutual enhancement module (Ins2Sem and Sem2Ins) to enforce inherent semantic-instance consistency, yielding higher benchmark scores than prior methods while running at roughly one second per scene.

What carries the argument

Mutual enhancement module (Ins2Sem alignment of semantics within instances plus Sem2Ins refinement of instance features by semantic guidance) together with the distillation-based training strategy.

If this is right

Outperforms prior methods by +13 percent mIoU on semantics for the Replica benchmark.
Runs at approximately one second per scene, supporting real-time downstream uses.
Produces inherent semantic-instance consistency that improves 3D scene understanding.
Enables direct application to robotic manipulation and 3D scene editing without extra lifting steps.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The removal of explicit preprocessing stages could simplify integration into live multi-view capture systems.
Open-vocabulary output may allow the same trained model to label novel object categories encountered after deployment.
The same distillation-plus-mutual-enhancement pattern could be tested on related tasks such as 3D instance tracking or dense reconstruction.

Load-bearing premise

Distillation training on diverse 3D scenes plus the mutual enhancement steps are sufficient to produce 3D-aware features and semantic-instance consistency without any preprocessing pipeline.

What would settle it

An ablation on Replica or ScanNet that removes the mutual enhancement module and still reports the same +13 percent mIoU gain, or a preprocessing pipeline that matches EPS3D accuracy and consistency metrics on the same test scenes.

Figures

Figures reproduced from arXiv: 2606.08980 by Chi-Wing Fu, Jiaxin Guo, Ka-Hei Hui, Kai Chen, Pheng-Ann Heng, Runsong Zhu, Wei Chen, Weiqiang Ren, Wei Yin, Xiaoyang Guo, Yunhui Liu, Zhengzhe Liu.

**Figure 1.** Figure 1: (a) While 2D foundation models struggle with view inconsistency, our method, EPS3D, can provide effective 3D openvocabulary panoptic segmentation and can render accurate and view-consistent 2D segmentation across views. (We visualize only “chair” and “paint” instance masks for simplicity.) (b) From multi-view images, we rapidly provide 3D panoptic segmentation via 3D panoptic Gaussian reconstruction. (c) … view at source ↗

**Figure 2.** Figure 2: Comparisons between recent SOTA methods (Fan et al., 2024; Sun et al., 2025) and our method, EPS3D. into a 3D scene (e.g., 3D radiance field). Yet, 2D results inferred from individual views typically suffer from view imperfection and inconsistent semantic predictions. Also, 2D instance segmentation often fails to maintain consistent object identities across views; see [PITH_FULL_IMAGE:figures/full_fig_p00… view at source ↗

**Figure 3.** Figure 3: Overview of EPS3D. Given multi-view images as input, EPS3D provides 3D panoptic segmentations by predicting unified panoptic 3D Gaussians in a feed-forward pass, supporting novel view RGB, semantic and instance feature map rendering. With our end-toend framework, we further introduce semantic-instance mutual enhancement learning module (i.e., Semantic2Instance (Sem2Ins) learning and Instance2Semantic (Ins… view at source ↗

**Figure 4.** Figure 4: (a) Visual comparisons of novel views between our EPS3D and baselines on ScanNet (Dai et al., 2017) and Replica (Straub et al., 2019). For semantic understanding, we directly visualize the semantic segmentation maps according to the text queries across views. For instance-level understanding, we use the first novel view to select the 3D segmentation ID and visualize the corresponding segmentation across di… view at source ↗

**Figure 5.** Figure 5: (a) EPS3D provides effective 3D panoptic segmentation with high efficiency, offering foundational 3D perception for robotic manipulation tasks. (b) Our EPS3D can recover both scene-level and instance-level Gaussians, which facilitates 3D scene editing (e.g., “Turn the sink in scene 1 from white to gray and place it in scene 2”) [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: 3D comparisons between our method and the latest SOTA method (Sun et al., 2025). We mark ‘N/A’ to indicate that the method does not support such predictions. 17 [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗

**Figure 7.** Figure 7: More visual comparisons with broader baselines (Feature-3DGS (Zhou et al., 2024), LSM (Fan et al., 2024)). 18 [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗

read the original abstract

This paper introduces EPS3D, a new end-to-end feed-forward framework for open-vocabulary 3D panoptic segmentation. Unlike existing methods relying on additional preprocessing, we design an end-to-end architecture, with a distillation-based training strategy on diverse 3D scenes to predict 3D-aware semantic and instance features from multi-view images, improving 3D consistency and avoiding error accumulation. We further propose a mutual enhancement module to enforce inherent semantic-instance consistency. By aligning semantics within instances (Ins2Sem) and refining instance features with semantic guidance (Sem2Ins), we achieve more coherent 3D scene understanding. Ultimately, EPS3D outperforms SOTA baselines on two benchmarks (e.g., +13% mIoU for semantics on Replica) with high efficiency (e.g., 1s per scene), supporting tasks like robotic manipulation and 3D scene editing.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

EPS3D claims an end-to-end 3D panoptic segmentation pipeline via distillation and mutual enhancement that beats baselines on Replica, but the reported gains rest on asserted numbers without visible controls.

read the letter

The main thing to know is that this paper puts forward EPS3D as a feed-forward model that does open-vocabulary 3D panoptic segmentation directly from multi-view images. It trains with distillation on varied 3D scenes to produce 3D-aware features and adds an Ins2Sem/Sem2Ins module to keep semantics and instances consistent, avoiding the usual preprocessing pipeline.

The end-to-end design and the mutual enhancement step are the concrete pieces that look new. The abstract shows the model reaching +13% mIoU on Replica semantics and running at roughly 1 second per scene, which would matter for robotics or editing tasks if those numbers hold.

The soft spot is the lack of any ablation data, error bars, or run-to-run variance attached to the headline results. Without those, it is hard to tell how much the gains come from the new module versus other choices like backbone or data mix. The open-vocabulary part also needs clear evidence that the distillation actually handles categories outside the training set rather than just interpolating within it.

The work sits squarely inside computer-vision work on 3D scene understanding. Someone already running multi-view pipelines or looking for faster inference would get the most out of the efficiency claims and the consistency module.

It is worth sending to referees so the experiments can be checked in detail.

Referee Report

2 major / 1 minor

Summary. This paper introduces EPS3D, an end-to-end feed-forward framework for open-vocabulary 3D panoptic segmentation. It proposes a distillation-based training strategy on diverse 3D scenes to predict 3D-aware semantic and instance features from multi-view images, and a mutual enhancement module (Ins2Sem and Sem2Ins) to enforce semantic-instance consistency. The method is claimed to outperform state-of-the-art baselines on two benchmarks, such as achieving +13% mIoU for semantics on Replica, while operating at high efficiency (1s per scene).

Significance. If the reported performance gains and efficiency are confirmed through rigorous experiments, this work could significantly impact the field by providing a preprocessing-free approach to 3D panoptic segmentation, facilitating applications in robotics and 3D scene editing. The mutual enhancement module addresses an important consistency issue in panoptic segmentation.

major comments (2)

[Abstract] Abstract: The central claims of outperformance (+13% mIoU) and efficiency (1s per scene) are made without reference to any experimental results, tables, ablation studies, or error analysis in the provided manuscript text. This absence makes it impossible to verify the contribution of the proposed distillation strategy or the mutual enhancement module.
[Abstract] Abstract: No details are provided on the specific benchmarks, the SOTA baselines compared against, the evaluation protocol, or how the open-vocabulary aspect is handled, which are load-bearing for the significance of the results.

minor comments (1)

[Abstract] Abstract: The term 'diverse 3D scenes' is used without specifying the source or characteristics of the training data.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments focus on improving the abstract's clarity and verifiability, which we address point-by-point below. We agree that strengthening the abstract will better highlight the experimental support for our claims.

read point-by-point responses

Referee: [Abstract] Abstract: The central claims of outperformance (+13% mIoU) and efficiency (1s per scene) are made without reference to any experimental results, tables, ablation studies, or error analysis in the provided manuscript text. This absence makes it impossible to verify the contribution of the proposed distillation strategy or the mutual enhancement module.

Authors: We agree that the abstract, as a high-level summary, would benefit from explicit pointers to the supporting experiments. The full manuscript includes these details in the Experiments section, with quantitative results in tables, ablations on the distillation and mutual enhancement components, and runtime analysis. We will revise the abstract to add references such as 'as demonstrated in Tables 2 and 3' and 'detailed in Section 4' to make the claims directly traceable. revision: yes
Referee: [Abstract] Abstract: No details are provided on the specific benchmarks, the SOTA baselines compared against, the evaluation protocol, or how the open-vocabulary aspect is handled, which are load-bearing for the significance of the results.

Authors: The abstract provides a concise overview and already references one benchmark (Replica) along with the open-vocabulary setting. However, we acknowledge that additional specificity would improve accessibility. The manuscript body details the two benchmarks, compared SOTA methods, evaluation metrics/protocol, and open-vocabulary handling via the distillation strategy. We will revise the abstract to briefly incorporate these elements (e.g., naming the second benchmark and noting the open-vocabulary mechanism) while respecting length limits, or ensure the introduction expands on them for context. revision: partial

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper presents an empirical end-to-end neural architecture for 3D panoptic segmentation trained via distillation on diverse scenes plus a mutual enhancement module (Ins2Sem/Sem2Ins). No equations, fitted parameters renamed as predictions, or self-citation chains appear in the provided abstract or description. The central claims rest on reported benchmark improvements (+13% mIoU, 1s/scene) treated as experimental outcomes rather than derivations that reduce to their own inputs by construction. This is the expected non-finding for an applied CV architecture paper whose load-bearing elements are architectural choices and training procedures, not mathematical self-definition.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no information on free parameters, background axioms, or new postulated entities; a full audit is impossible.

pith-pipeline@v0.9.1-grok · 5722 in / 1122 out tokens · 17597 ms · 2026-06-27T17:13:20.262734+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

32 extracted references · 3 canonical work pages · 1 internal anchor

[1]

Charatan, D., Li, S

URL https://arxiv.org/abs/2312.00860v1. Charatan, D., Li, S. L., Tagliasacchi, A., and Sitzmann, V . pixelsplat: 3d gaussian splats from image pairs for scalable generalizable 3d reconstruction. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 19457–19467,

arXiv
[2]

Chen, Y ., Xu, H., Zheng, C., Zhuang, B., Pollefeys, M., Geiger, A., Cham, T.-J., and Cai, J

doi: 10.1126/scirobotics.aea2092. Chen, Y ., Xu, H., Zheng, C., Zhuang, B., Pollefeys, M., Geiger, A., Cham, T.-J., and Cai, J. Mvsplat: Efficient 3d gaussian splatting from sparse multi-view images. In European Conference on Computer Vision, pp. 370–386. Springer,

work page doi:10.1126/scirobotics.aea2092
[3]

Longstream: Long-sequence streaming autoregressive visual geometry.arXiv preprint arXiv:2602.13172,

Cheng, C., Chen, X., Xie, T., Yin, W., Ren, W., Zhang, Q., Guo, X., and Wang, H. Longstream: Long-sequence streaming autoregressive visual geometry.arXiv preprint arXiv:2602.13172,

arXiv
[4]

Sail-recon: Large sfm by augment- ing scene regression with localization.arXiv preprint arXiv:2508.17972,

Deng, J., Li, H., Xie, T., Ren, W., Zhang, Q., Tan, P., and Guo, X. Sail-recon: Large sfm by augment- ing scene regression with localization.arXiv preprint arXiv:2508.17972,

arXiv
[5]

Opennerf: open set 3d neural scene segmentation with pixel-wise features and rendered novel views.arXiv preprint arXiv:2404.03650,

Engelmann, F., Manhardt, F., Niemeyer, M., Tateno, K., Pollefeys, M., and Tombari, F. Opennerf: open set 3d neural scene segmentation with pixel-wise features and rendered novel views.arXiv preprint arXiv:2404.03650,

arXiv
[6]

Semantic gaussians: Open-vocabulary scene understanding with 3d gaussian splatting.arXiv preprint arXiv:2403.15624,

Guo, J., Ma, X., Fan, Y ., Liu, H., and Li, Q. Semantic gaussians: Open-vocabulary scene understanding with 3d gaussian splatting.arXiv preprint arXiv:2403.15624,

arXiv
[7]

Salon3r: Structure-aware long-term generalizable 3d reconstruction from unposed images.arXiv preprint arXiv:2510.15072,

Guo, J., Guan, T., Dong, W., Zheng, W., Wang, W., Wang, Y ., Yam, Y ., and Liu, Y .-H. Salon3r: Structure-aware long-term generalizable 3d reconstruction from unposed images.arXiv preprint arXiv:2510.15072,

arXiv
[8]

Neural wavelet-domain diffusion for 3d shape generation, inver- sion, and manipulation.ACM Transactions on Graphics (TOG), 42(6), 2024a

Hu, J., Hui, K.-H., Liu, Z., Li, R., and Fu, C.-W. Neural wavelet-domain diffusion for 3d shape generation, inver- sion, and manipulation.ACM Transactions on Graphics (TOG), 42(6), 2024a. Hu, J., Hui, K.-H., Liu, Z., Zhang, H., and Fu, C.-W. Cns- edit: 3d shape editing via coupled neural shape optimiza- tion. InProceedings of SIGGRAPH, pp. 1–12, 2024b. Hu...

arXiv
[9]

2D Gaussian splatting for geometrically accurate radiance fields

Huang, B., Yu, Z., Chen, A., Geiger, A., and Gao, S. 2D Gaussian splatting for geometrically accurate radiance fields. InACM SIGGRAPH 2024 Conference Papers, pp. 1–11, 2024a. Huang, W., Wang, C., Li, Y ., Zhang, R., and Fei-Fei, L. Rekep: Spatio-temporal reasoning of relational keypoint constraints for robotic manipulation.arXiv preprint arXiv:2409.01652,...

Pith/arXiv arXiv 2024
[10]

F., Choe, J., and Oh, T.-H

Jun-Seong, K., Kim, G., Yu-Ji, K., Wang, Y .-C. F., Choe, J., and Oh, T.-H. Dr. splat: Directly referring 3d gaus- sian splatting via direct language embedding registration. arXiv preprint arXiv:2502.16652,

arXiv
[11]

Wildgaussians: 3d gaussian splatting in the wild.arXiv preprint arXiv:2407.08447,

Kulhanek, J., Peng, S., Kukelova, Z., Pollefeys, M., and Sattler, T. Wildgaussians: 3d gaussian splatting in the wild.arXiv preprint arXiv:2407.08447,

arXiv
[12]

Q., Belongie, S., Koltun, V ., and Ranftl, R

Li, B., Weinberger, K. Q., Belongie, S., Koltun, V ., and Ranftl, R. Language-driven semantic segmentation.arXiv preprint arXiv:2201.03546,

Pith/arXiv arXiv
[13]

Semantic-SAM: Segment and recognize anything at any granularity.arXiv preprint arXiv:2307.04767,

Li, F., Zhang, H., Sun, P., Zou, X., Liu, S., Yang, J., Li, C., Zhang, L., and Gao, J. Semantic-SAM: Segment and recognize anything at any granularity.arXiv preprint arXiv:2307.04767,

arXiv
[14]

Instancegaussian: Appearance-semantic joint gaussian representation for 3d instance-level perception

Li, H., Wu, Y ., Meng, J., Gao, Q., Zhang, Z., Wang, R., and Zhang, J. Instancegaussian: Appearance-semantic joint gaussian representation for 3d instance-level perception. arXiv preprint arXiv:2411.19235,

arXiv
[15]

Analytic-Splatting: Anti-aliased 3D Gaussian splatting via analytic integration.arXiv preprint arXiv:2403.11056,

Liang, Z., Zhang, Q., Hu, W., Feng, Y ., Zhu, L., and Jia, K. Analytic-Splatting: Anti-aliased 3D Gaussian splatting via analytic integration.arXiv preprint arXiv:2403.11056,

arXiv
[16]

Fastvggt: Training-free acceleration of visual geometry transformer.arXiv preprint arXiv:2509.02560,

Shen, Y ., Zhang, Z., Qu, Y ., Zheng, X., Ji, J., Zhang, S., and Cao, L. Fastvggt: Training-free acceleration of visual geometry transformer.arXiv preprint arXiv:2509.02560,

Pith/arXiv arXiv
[17]

J., Mur-Artal, R., Ren, C., Verma, S., et al

Straub, J., Whelan, T., Ma, L., Chen, Y ., Wijmans, E., Green, S., Engel, J. J., Mur-Artal, R., Ren, C., Verma, S., et al. The replica dataset: A digital replica of indoor spaces. arXiv preprint arXiv:1906.05797,

Pith/arXiv arXiv 1906
[18]

Uni3r: Unified 3d reconstruction and semantic understanding via gen- eralizable gaussian splatting from unposed multi-view images.arXiv preprint arXiv:2508.03643,

Sun, X., Jiang, H., Liu, L., Nam, S., Kang, G., Wang, X., Sui, W., Su, Z., Liu, W., Wang, X., et al. Uni3r: Unified 3d reconstruction and semantic understanding via gen- eralizable gaussian splatting from unposed multi-view images.arXiv preprint arXiv:2508.03643,

arXiv
[19]

E., Fu, C.-W., and Ding, M

Tang, W., Pan, J.-H., Liu, Y .-H., Tomizuka, M., Li, L. E., Fu, C.-W., and Ding, M. Geomanip: Geometric constraints as general interfaces for robot manipulation.arXiv preprint arXiv:2501.09783,

arXiv
[20]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Wang, J., Chen, M., Karaev, N., Vedaldi, A., Rupprecht, C., and Novotny, D. Vggt: Visual geometry grounded transformer. InProceedings of the Computer Vision and Pattern Recognition Conference, pp. 5294–5306, 2025a. Wang, P., Bai, S., Tan, S., Wang, S., Fan, Z., Bai, J., Chen, K., Liu, X., Wang, J., Ge, W., et al. Qwen2-vl: Enhancing vision-language model’...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1609/aaai.v38i6
[21]

Xu, H., Liu, Y ., Wang, Y ., Fu, C.-W., and Mitra, N

doi: 10.1109/TPAMI.2025.3596986. Xu, H., Liu, Y ., Wang, Y ., Fu, C.-W., and Mitra, N. J. CHOIR: Contact-aware 4D hand-object interaction re- construction, 2026b. Xu, T.-X., Hu, W., Lai, Y .-K., Shan, Y ., and Zhang, S.- H. Texture-GS: Disentangling the geometry and tex- ture for 3D Gaussian splatting editing.arXiv preprint arXiv:2403.10050, 2024b. Yan, W...

work page doi:10.1109/tpami.2025.3596986 2025
[22]

No pose, no problem: Surprisingly simple 3d gaussian splats from sparse unposed images.arXiv preprint arXiv:2410.24207,

Ye, B., Liu, S., Xu, H., Li, X., Pollefeys, M., Yang, M.-H., and Peng, S. No pose, no problem: Surprisingly simple 3d gaussian splats from sparse unposed images.arXiv preprint arXiv:2410.24207,

arXiv
[23]

Gaussian grouping: Segment and edit anything in 3D scenes.arXiv preprint arXiv:2312.00732,

Ye, M., Danelljan, M., Yu, F., and Ke, L. Gaussian grouping: Segment and edit anything in 3D scenes.arXiv preprint arXiv:2312.00732,

arXiv
[24]

Gaussian in the wild: 3d gaussian splatting for uncon- strained image collections

Zhang, D., Wang, C., Wang, W., Li, P., Qin, M., and Wang, H. Gaussian in the wild: 3d gaussian splatting for uncon- strained image collections. InEuropean Conference on Computer Vision, pp. 341–359. Springer, 2024a. Zhang, Z., Hu, W., Lao, Y ., He, T., and Zhao, H. Pixel- GS: Density control with pixel-aware gradient for 3D Gaussian splatting.arXiv prepri...

arXiv
[25]

Rethinking end-to-end 2d to 3d scene segmentation in gaussian splatting.arXiv preprint arXiv:2503.14029,

Zhu, R., Qiu, S., Liu, Z., Hui, K.-H., Wu, Q., Heng, P.- A., and Fu, C.-W. Rethinking end-to-end 2d to 3d scene segmentation in gaussian splatting.arXiv preprint arXiv:2503.14029,

arXiv
[26]

14 EPS3D : End-to-End Feed-Forward 3D Panoptic Segmentation In this appendix, we further provide implementation details and more results. A. Implementation Details Detailed architecture.For the geometry transformer, inspired by (Wang et al., 2025a; Jiang et al., 2025), we first patchify images Ci into lC = HW p2 tokens of dimension d, where p= 14 and d= 1...

2025
[27]

We subsequently restore the high-dimensional features via projection layers, following common practices (Jiang et al., 2025; Sun et al., 2025)

of the original CLIP features, the DPT layer first predicts a compressed feature vector (i.e., R32) to ensure memory-efficient rendering. We subsequently restore the high-dimensional features via projection layers, following common practices (Jiang et al., 2025; Sun et al., 2025). For the instance branch, we directly regress the instance features at their...

2025
[28]

datasets, following common pratice (Fan et al., 2024; Sun et al., 2025). The transformer layers, camera head, and depth head are initialized with weights from the pretrained Anysplat (Jiang et al., 2025), while the semantic and instance heads are randomly initialized. During training, input images are resized to a maximum long-edge resolution of 518 pixel...

2024
[29]

put the bread on the plate

and Unified-Lift (Zhu et al., 2025)), we employ VGGT (Wang et al., 2025a) to pre-process the scenes, producing point clouds and camera poses as initialization. This allows us to avoid potential failures associated with relying on COLMAP. For the test-time baselines, we train each model for 5000 iterations. 15 EPS3D : End-to-End Feed-Forward 3D Panoptic Se...

2025
[30]

and manipulation parameters (Tang et al., 2025; Huang et al., 2024b). B. More Results We provide more 3D visual comparisons with the latest SOTA methods in Fig

2025
[31]

We also provide visual comparisons with broader baselines (LSM (Fan et al., 2024), Feature-3DGS (Zhou et al., 2024)) in Fig

2024
[32]

The results consistently demonstrate that our method provides more accurate and consistent segmentation with fewer artifacts. C. Limitations In this work, we focus on static indoor scenes and do not address dynamic environments, where objects or agents may move over time. Effectively extending the framework to handle dynamic scenarios remains an open ques...

2025

[1] [1]

Charatan, D., Li, S

URL https://arxiv.org/abs/2312.00860v1. Charatan, D., Li, S. L., Tagliasacchi, A., and Sitzmann, V . pixelsplat: 3d gaussian splats from image pairs for scalable generalizable 3d reconstruction. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 19457–19467,

arXiv

[2] [2]

Chen, Y ., Xu, H., Zheng, C., Zhuang, B., Pollefeys, M., Geiger, A., Cham, T.-J., and Cai, J

doi: 10.1126/scirobotics.aea2092. Chen, Y ., Xu, H., Zheng, C., Zhuang, B., Pollefeys, M., Geiger, A., Cham, T.-J., and Cai, J. Mvsplat: Efficient 3d gaussian splatting from sparse multi-view images. In European Conference on Computer Vision, pp. 370–386. Springer,

work page doi:10.1126/scirobotics.aea2092

[3] [3]

Longstream: Long-sequence streaming autoregressive visual geometry.arXiv preprint arXiv:2602.13172,

Cheng, C., Chen, X., Xie, T., Yin, W., Ren, W., Zhang, Q., Guo, X., and Wang, H. Longstream: Long-sequence streaming autoregressive visual geometry.arXiv preprint arXiv:2602.13172,

arXiv

[4] [4]

Sail-recon: Large sfm by augment- ing scene regression with localization.arXiv preprint arXiv:2508.17972,

Deng, J., Li, H., Xie, T., Ren, W., Zhang, Q., Tan, P., and Guo, X. Sail-recon: Large sfm by augment- ing scene regression with localization.arXiv preprint arXiv:2508.17972,

arXiv

[5] [5]

Opennerf: open set 3d neural scene segmentation with pixel-wise features and rendered novel views.arXiv preprint arXiv:2404.03650,

Engelmann, F., Manhardt, F., Niemeyer, M., Tateno, K., Pollefeys, M., and Tombari, F. Opennerf: open set 3d neural scene segmentation with pixel-wise features and rendered novel views.arXiv preprint arXiv:2404.03650,

arXiv

[6] [6]

Semantic gaussians: Open-vocabulary scene understanding with 3d gaussian splatting.arXiv preprint arXiv:2403.15624,

Guo, J., Ma, X., Fan, Y ., Liu, H., and Li, Q. Semantic gaussians: Open-vocabulary scene understanding with 3d gaussian splatting.arXiv preprint arXiv:2403.15624,

arXiv

[7] [7]

Salon3r: Structure-aware long-term generalizable 3d reconstruction from unposed images.arXiv preprint arXiv:2510.15072,

Guo, J., Guan, T., Dong, W., Zheng, W., Wang, W., Wang, Y ., Yam, Y ., and Liu, Y .-H. Salon3r: Structure-aware long-term generalizable 3d reconstruction from unposed images.arXiv preprint arXiv:2510.15072,

arXiv

[8] [8]

Neural wavelet-domain diffusion for 3d shape generation, inver- sion, and manipulation.ACM Transactions on Graphics (TOG), 42(6), 2024a

Hu, J., Hui, K.-H., Liu, Z., Li, R., and Fu, C.-W. Neural wavelet-domain diffusion for 3d shape generation, inver- sion, and manipulation.ACM Transactions on Graphics (TOG), 42(6), 2024a. Hu, J., Hui, K.-H., Liu, Z., Zhang, H., and Fu, C.-W. Cns- edit: 3d shape editing via coupled neural shape optimiza- tion. InProceedings of SIGGRAPH, pp. 1–12, 2024b. Hu...

arXiv

[9] [9]

2D Gaussian splatting for geometrically accurate radiance fields

Huang, B., Yu, Z., Chen, A., Geiger, A., and Gao, S. 2D Gaussian splatting for geometrically accurate radiance fields. InACM SIGGRAPH 2024 Conference Papers, pp. 1–11, 2024a. Huang, W., Wang, C., Li, Y ., Zhang, R., and Fei-Fei, L. Rekep: Spatio-temporal reasoning of relational keypoint constraints for robotic manipulation.arXiv preprint arXiv:2409.01652,...

Pith/arXiv arXiv 2024

[10] [10]

F., Choe, J., and Oh, T.-H

Jun-Seong, K., Kim, G., Yu-Ji, K., Wang, Y .-C. F., Choe, J., and Oh, T.-H. Dr. splat: Directly referring 3d gaus- sian splatting via direct language embedding registration. arXiv preprint arXiv:2502.16652,

arXiv

[11] [11]

Wildgaussians: 3d gaussian splatting in the wild.arXiv preprint arXiv:2407.08447,

Kulhanek, J., Peng, S., Kukelova, Z., Pollefeys, M., and Sattler, T. Wildgaussians: 3d gaussian splatting in the wild.arXiv preprint arXiv:2407.08447,

arXiv

[12] [12]

Q., Belongie, S., Koltun, V ., and Ranftl, R

Li, B., Weinberger, K. Q., Belongie, S., Koltun, V ., and Ranftl, R. Language-driven semantic segmentation.arXiv preprint arXiv:2201.03546,

Pith/arXiv arXiv

[13] [13]

Semantic-SAM: Segment and recognize anything at any granularity.arXiv preprint arXiv:2307.04767,

Li, F., Zhang, H., Sun, P., Zou, X., Liu, S., Yang, J., Li, C., Zhang, L., and Gao, J. Semantic-SAM: Segment and recognize anything at any granularity.arXiv preprint arXiv:2307.04767,

arXiv

[14] [14]

Instancegaussian: Appearance-semantic joint gaussian representation for 3d instance-level perception

Li, H., Wu, Y ., Meng, J., Gao, Q., Zhang, Z., Wang, R., and Zhang, J. Instancegaussian: Appearance-semantic joint gaussian representation for 3d instance-level perception. arXiv preprint arXiv:2411.19235,

arXiv

[15] [15]

Analytic-Splatting: Anti-aliased 3D Gaussian splatting via analytic integration.arXiv preprint arXiv:2403.11056,

Liang, Z., Zhang, Q., Hu, W., Feng, Y ., Zhu, L., and Jia, K. Analytic-Splatting: Anti-aliased 3D Gaussian splatting via analytic integration.arXiv preprint arXiv:2403.11056,

arXiv

[16] [16]

Fastvggt: Training-free acceleration of visual geometry transformer.arXiv preprint arXiv:2509.02560,

Shen, Y ., Zhang, Z., Qu, Y ., Zheng, X., Ji, J., Zhang, S., and Cao, L. Fastvggt: Training-free acceleration of visual geometry transformer.arXiv preprint arXiv:2509.02560,

Pith/arXiv arXiv

[17] [17]

J., Mur-Artal, R., Ren, C., Verma, S., et al

Straub, J., Whelan, T., Ma, L., Chen, Y ., Wijmans, E., Green, S., Engel, J. J., Mur-Artal, R., Ren, C., Verma, S., et al. The replica dataset: A digital replica of indoor spaces. arXiv preprint arXiv:1906.05797,

Pith/arXiv arXiv 1906

[18] [18]

Uni3r: Unified 3d reconstruction and semantic understanding via gen- eralizable gaussian splatting from unposed multi-view images.arXiv preprint arXiv:2508.03643,

Sun, X., Jiang, H., Liu, L., Nam, S., Kang, G., Wang, X., Sui, W., Su, Z., Liu, W., Wang, X., et al. Uni3r: Unified 3d reconstruction and semantic understanding via gen- eralizable gaussian splatting from unposed multi-view images.arXiv preprint arXiv:2508.03643,

arXiv

[19] [19]

E., Fu, C.-W., and Ding, M

Tang, W., Pan, J.-H., Liu, Y .-H., Tomizuka, M., Li, L. E., Fu, C.-W., and Ding, M. Geomanip: Geometric constraints as general interfaces for robot manipulation.arXiv preprint arXiv:2501.09783,

arXiv

[20] [20]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Wang, J., Chen, M., Karaev, N., Vedaldi, A., Rupprecht, C., and Novotny, D. Vggt: Visual geometry grounded transformer. InProceedings of the Computer Vision and Pattern Recognition Conference, pp. 5294–5306, 2025a. Wang, P., Bai, S., Tan, S., Wang, S., Fan, Z., Bai, J., Chen, K., Liu, X., Wang, J., Ge, W., et al. Qwen2-vl: Enhancing vision-language model’...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1609/aaai.v38i6

[21] [21]

Xu, H., Liu, Y ., Wang, Y ., Fu, C.-W., and Mitra, N

doi: 10.1109/TPAMI.2025.3596986. Xu, H., Liu, Y ., Wang, Y ., Fu, C.-W., and Mitra, N. J. CHOIR: Contact-aware 4D hand-object interaction re- construction, 2026b. Xu, T.-X., Hu, W., Lai, Y .-K., Shan, Y ., and Zhang, S.- H. Texture-GS: Disentangling the geometry and tex- ture for 3D Gaussian splatting editing.arXiv preprint arXiv:2403.10050, 2024b. Yan, W...

work page doi:10.1109/tpami.2025.3596986 2025

[22] [22]

No pose, no problem: Surprisingly simple 3d gaussian splats from sparse unposed images.arXiv preprint arXiv:2410.24207,

Ye, B., Liu, S., Xu, H., Li, X., Pollefeys, M., Yang, M.-H., and Peng, S. No pose, no problem: Surprisingly simple 3d gaussian splats from sparse unposed images.arXiv preprint arXiv:2410.24207,

arXiv

[23] [23]

Gaussian grouping: Segment and edit anything in 3D scenes.arXiv preprint arXiv:2312.00732,

Ye, M., Danelljan, M., Yu, F., and Ke, L. Gaussian grouping: Segment and edit anything in 3D scenes.arXiv preprint arXiv:2312.00732,

arXiv

[24] [24]

Gaussian in the wild: 3d gaussian splatting for uncon- strained image collections

Zhang, D., Wang, C., Wang, W., Li, P., Qin, M., and Wang, H. Gaussian in the wild: 3d gaussian splatting for uncon- strained image collections. InEuropean Conference on Computer Vision, pp. 341–359. Springer, 2024a. Zhang, Z., Hu, W., Lao, Y ., He, T., and Zhao, H. Pixel- GS: Density control with pixel-aware gradient for 3D Gaussian splatting.arXiv prepri...

arXiv

[25] [25]

Rethinking end-to-end 2d to 3d scene segmentation in gaussian splatting.arXiv preprint arXiv:2503.14029,

Zhu, R., Qiu, S., Liu, Z., Hui, K.-H., Wu, Q., Heng, P.- A., and Fu, C.-W. Rethinking end-to-end 2d to 3d scene segmentation in gaussian splatting.arXiv preprint arXiv:2503.14029,

arXiv

[26] [26]

14 EPS3D : End-to-End Feed-Forward 3D Panoptic Segmentation In this appendix, we further provide implementation details and more results. A. Implementation Details Detailed architecture.For the geometry transformer, inspired by (Wang et al., 2025a; Jiang et al., 2025), we first patchify images Ci into lC = HW p2 tokens of dimension d, where p= 14 and d= 1...

2025

[27] [27]

We subsequently restore the high-dimensional features via projection layers, following common practices (Jiang et al., 2025; Sun et al., 2025)

of the original CLIP features, the DPT layer first predicts a compressed feature vector (i.e., R32) to ensure memory-efficient rendering. We subsequently restore the high-dimensional features via projection layers, following common practices (Jiang et al., 2025; Sun et al., 2025). For the instance branch, we directly regress the instance features at their...

2025

[28] [28]

datasets, following common pratice (Fan et al., 2024; Sun et al., 2025). The transformer layers, camera head, and depth head are initialized with weights from the pretrained Anysplat (Jiang et al., 2025), while the semantic and instance heads are randomly initialized. During training, input images are resized to a maximum long-edge resolution of 518 pixel...

2024

[29] [29]

put the bread on the plate

and Unified-Lift (Zhu et al., 2025)), we employ VGGT (Wang et al., 2025a) to pre-process the scenes, producing point clouds and camera poses as initialization. This allows us to avoid potential failures associated with relying on COLMAP. For the test-time baselines, we train each model for 5000 iterations. 15 EPS3D : End-to-End Feed-Forward 3D Panoptic Se...

2025

[30] [30]

and manipulation parameters (Tang et al., 2025; Huang et al., 2024b). B. More Results We provide more 3D visual comparisons with the latest SOTA methods in Fig

2025

[31] [31]

We also provide visual comparisons with broader baselines (LSM (Fan et al., 2024), Feature-3DGS (Zhou et al., 2024)) in Fig

2024

[32] [32]

The results consistently demonstrate that our method provides more accurate and consistent segmentation with fewer artifacts. C. Limitations In this work, we focus on static indoor scenes and do not address dynamic environments, where objects or agents may move over time. Effectively extending the framework to handle dynamic scenarios remains an open ques...

2025