pith. machine review for the scientific record. sign in

arxiv: 2604.10982 · v1 · submitted 2026-04-13 · 💻 cs.RO

Recognition: unknown

{Psi}-Map: Panoptic Surface Integrated Mapping Enables Real2Sim Transfer

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:51 UTC · model grok-4.3

classification 💻 cs.RO
keywords panoptic reconstruction3D Gaussian SplattingLiDARreal-time mappingroboticsopen-vocabularysurface mappingReal2Sim
0
0 comments X

The pith

Psi-Map integrates plane-constrained GMMs from LiDAR and local cross-attention to enable high-quality panoptic surface mapping at over 40 FPS in large-scale scenes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper seeks to solve the challenge of achieving simultaneous geometric accuracy, coherent panoptic segmentation, and real-time performance in large-scale scene reconstruction for robotics. Current 3D Gaussian Splatting techniques often trade off one for the others, restricting their practical use in dynamic control loops or simulation transfers. The proposed system builds multimodal Gaussian models constrained by LiDAR planes and represents the map with 2D Gaussian surfels to maintain physical realism and precise alignment. It then uses a query-guided architecture with local cross-attention to directly lift 2D mask features into consistent 3D panoptic labels, bypassing traditional multi-stage pipelines that accumulate errors. Rendering is accelerated with tile intersection and selection strategies to sustain over 40 frames per second.

Core claim

The paper claims that by constructing plane-constrained multimodal Gaussian Mixture Models using LiDAR and 2D Gaussian surfels for geometric supervision, combined with an end-to-end panoptic learning module that employs local cross-attention within the view frustum to lift 2D features to 3D space, and optimized rendering via Precise Tile Intersection and Top-K Hard Selection, the Psi-Map system delivers superior geometric and panoptic reconstruction quality in large-scale scenes at inference rates exceeding 40 FPS.

What carries the argument

The integration of LiDAR-constrained multimodal GMMs with 2D Gaussian surfels and local cross-attention for direct 3D panoptic feature lifting.

Load-bearing premise

The premise that LiDAR-based plane-constrained GMMs and local cross-attention lifting of 2D mask features will produce globally consistent panoptic understanding without error accumulation or loss of physical realism across varied large-scale environments.

What would settle it

Observing accumulated panoptic label inconsistencies or geometric deviations in reconstructions of large environments with complex surfaces or sensor noise would indicate the approach does not maintain the claimed consistency and accuracy.

Figures

Figures reproduced from arXiv: 2604.10982 by Changjian Jiang, Rong Xiong, Shichao Zhai, Xuan Yu, Yue Wang, Yuxuan Xie, Yu Zhang.

Figure 1
Figure 1. Figure 1: Ψ-Map provides a high-performance bridge from multi-modal sensor inputs to robotic applications. By fusing LiDAR￾RGB-D data with Vision-Language Foundation Models, our framework achieves real-time, 3D-consistent panoptic reconstruction. This representation serves as a powerful engine for high-fidelity scene editing and virtual asset generation, ultimately empowering complex downstream skills such as genera… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the Ψ-Map Framework. The Input consists of multi-view RGB-D frames. During Geometric Reinforcement, the point cloud is modeled as a plane-constrained SOGMM to provide continuous structural supervision for the 2D Gaussian Surfels. In the Panoptic Learning stage, 2D mask features are lifted into 3D via a query-guided end-to-end architecture, where instance tokens are refined through local cross-a… view at source ↗
Figure 3
Figure 3. Figure 3: The instance branch utilizes a cross-attention mechanism between 3D Gaussian-modulated query tokens and SOGMM-reinforced 2DGS scene fields. SOGMM initializes the central positions and rotations of 2D Gaussians by sampling the distribution parameters. Fur￾thermore, it establishes a criterion for geometric consistency in 3D space. To represent the deviation of a surfel 𝐺𝑖 from its geometric prior, we define … view at source ↗
Figure 4
Figure 4. Figure 4: Comparison of the quality of semantic segmentation and panoptic segmentation of different methods on ScanNet V2 and ScanNet++. mechanism. For each pixel, we restrict feature accumulation to a fixed set  containing the 𝐾 Gaussians with the highest opacity weights 𝜔𝑗 = 𝛼𝑗 ∏𝑗−1 𝑘=1(1 − 𝛼𝑘 ): 𝐹𝑝𝑖𝑥𝑒𝑙 = ∑ 𝑗∈ 𝑓𝑗 ⋅ 𝜔𝑗 , where || = 𝐾 (18) By replacing the 𝑂(𝑁) accumulation with a constant 𝑂(𝐾) selection, we fund… view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative rendering results on the Replica dataset, demonstrating Ψ-Map’s ability to produce high-fidelity geom￾etry, semantic, and panoptic maps. neighbor search, which is justified by the significant gains in surface accuracy and structural consistency. 6.3. Panoptic Segmentation Performance To evaluate the scene understanding capabilities of Ψ-Map, we conduct a quantitative comparison against both NeR… view at source ↗
Figure 6
Figure 6. Figure 6: Quantitative comparisons on Scan2CAD dataset. 6.6. Case Study To demonstrate the practical utility of Ψ-Map, we evaluate its performance on a 3D Object Completion and Generation task using the Scan2CAD dataset, and on an Object Goal Navigation task in our self-collected and reconstructed in￾door scenes. 3D Object Completion and Generation. Our frame￾work extends beyond segmentation reconstruction to sup￾po… view at source ↗
Figure 7
Figure 7. Figure 7: Comparison of navigation performance before and after fine-tuning with 3D-consistent labels. Qualitative comparisons in [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗
read the original abstract

Open-vocabulary panoptic reconstruction is essential for advanced robotics perception and simulation. However, existing methods based on 3D Gaussian Splatting (3DGS) often struggle to simultaneously achieve geometric accuracy, coherent panoptic understanding, and real-time inference frequency in large-scale scenes. In this paper, we propose a comprehensive framework that integrates geometric reinforcement, end-to-end panoptic learning, and efficient rendering. First, to ensure physical realism in large-scale environments, we leverage LiDAR data to construct plane-constrained multimodal Gaussian Mixture Models (GMMs) and employ 2D Gaussian surfels as the map representation, enabling high-precision surface alignment and continuous geometric supervision. Building upon this, to overcome the error accumulation and cumbersome cross-frame association inherent in traditional multi-stage panoptic segmentation pipelines, we design a query-guided end-to-end learning architecture. By utilizing a local cross-attention mechanism within the view frustum, the system lifts 2D mask features directly into 3D space, achieving globally consistent panoptic understanding. Finally, addressing the computational bottlenecks caused by high-dimensional semantic features, we introduce Precise Tile Intersection and a Top-K Hard Selection strategy to optimize the rendering pipeline. Experimental results demonstrate that our system achieves superior geometric and panoptic reconstruction quality in large-scale scenes while maintaining an inference rate exceeding 40 FPS, meeting the real-time requirements of robotic control loops.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes Ψ-Map, a framework for open-vocabulary panoptic surface mapping in large-scale scenes. It constructs plane-constrained multimodal GMMs from LiDAR data using 2D Gaussian surfels for geometric reinforcement and continuous supervision; employs a query-guided end-to-end architecture that lifts 2D mask features into 3D via local cross-attention within each view frustum to achieve panoptic understanding; and applies Precise Tile Intersection plus Top-K Hard Selection to optimize rendering. The central claims are superior geometric and panoptic reconstruction quality together with real-time inference exceeding 40 FPS to support Real2Sim transfer.

Significance. If the performance claims are substantiated, the work would advance real-time robotic perception by unifying high-precision surface mapping with coherent open-vocabulary semantics in a single efficient pipeline, addressing limitations of prior 3DGS methods in large environments. The end-to-end lifting approach and LiDAR-constrained surfels could reduce multi-stage error accumulation and improve physical realism for simulation transfer.

major comments (2)
  1. [Abstract] Abstract: The central claims of 'superior geometric and panoptic reconstruction quality' and 'inference rate exceeding 40 FPS' are asserted without any quantitative metrics, baseline comparisons, ablation results, or dataset details. This absence prevents assessment of whether the proposed GMM surfels, local cross-attention lifting, or rendering optimizations actually deliver the stated improvements.
  2. [Query-guided end-to-end learning architecture] Query-guided end-to-end learning architecture: The manuscript states that local cross-attention within the view frustum 'lifts 2D mask features directly into 3D space, achieving globally consistent panoptic understanding.' No cross-view fusion, global consistency loss, or inter-frame label association mechanism is described. In large-scale scenes with overlapping views, this local-only design risks label drift on the same surface element, directly undermining both the panoptic quality and Real2Sim transfer claims.
minor comments (1)
  1. The abstract would be clearer if it briefly indicated the evaluation scenes or datasets used to support the 'large-scale' claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thorough review and valuable feedback on our manuscript. We have carefully considered each comment and provide point-by-point responses below. We believe the clarifications and proposed revisions will strengthen the paper.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claims of 'superior geometric and panoptic reconstruction quality' and 'inference rate exceeding 40 FPS' are asserted without any quantitative metrics, baseline comparisons, ablation results, or dataset details. This absence prevents assessment of whether the proposed GMM surfels, local cross-attention lifting, or rendering optimizations actually deliver the stated improvements.

    Authors: We agree that the abstract presents the claims at a high level without specific numbers. The full manuscript includes quantitative evaluations in the Experiments section, with tables comparing against baselines on metrics such as geometric error, panoptic quality (PQ, SQ, RQ), and runtime on large-scale datasets like KITTI and custom large environments. To improve accessibility, we will revise the abstract to incorporate key quantitative results, such as the achieved FPS and relative improvements, subject to space constraints. revision: yes

  2. Referee: [Query-guided end-to-end learning architecture] Query-guided end-to-end learning architecture: The manuscript states that local cross-attention within the view frustum 'lifts 2D mask features directly into 3D space, achieving globally consistent panoptic understanding.' No cross-view fusion, global consistency loss, or inter-frame label association mechanism is described. In large-scale scenes with overlapping views, this local-only design risks label drift on the same surface element, directly undermining both the panoptic quality and Real2Sim transfer claims.

    Authors: The referee correctly notes that our architecture relies on local cross-attention per view without explicit cross-view fusion or additional consistency losses. However, global consistency is inherently provided by the shared 3D representation: 2D features are lifted and associated to the same plane-constrained Gaussian surfels across views, with updates accumulated in the persistent map. The LiDAR-based geometric supervision further regularizes the 3D structure. We acknowledge that this implicit mechanism was not sufficiently detailed in the manuscript. We will revise the relevant section to explicitly describe how consistency is maintained through the 3D surfel map and provide additional analysis or ablations on label consistency in overlapping regions. revision: partial

Circularity Check

0 steps flagged

No circularity: claims rest on proposed architecture and experiments, not self-referential definitions or fits.

full rationale

The abstract and described framework introduce LiDAR-constrained GMMs, 2D Gaussian surfels, a query-guided end-to-end architecture with local cross-attention for lifting 2D masks to 3D, and rendering optimizations as independent design choices. No equations, fitted parameters renamed as predictions, or self-citations are shown that would make global panoptic consistency or Real2Sim transfer tautological by construction. The derivation chain remains self-contained against external benchmarks and experimental validation.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Abstract-only review limits visibility into parameters or assumptions; relies on standard robotics domain assumptions about LiDAR accuracy and attention mechanisms for feature lifting.

axioms (2)
  • domain assumption LiDAR data can be used to construct plane-constrained multimodal GMMs that provide continuous geometric supervision
    Invoked in the geometric reinforcement step of the framework.
  • domain assumption Local cross-attention within the view frustum lifts 2D mask features to globally consistent 3D panoptic understanding
    Central to the end-to-end panoptic learning architecture.

pith-pipeline@v0.9.0 · 5563 in / 1288 out tokens · 26223 ms · 2026-05-10T15:51:54.410450+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

44 extracted references · 12 canonical work pages · 2 internal anchors

  1. [1]

    Maniskill3: Gpu parallelized robotics simulation and rendering for generalizable embodied ai.arXiv preprint arXiv:2410.00425, 2024

    S. Tao, F. Xiang, A. Shukla, Y. Qin, X. Hinrichsen, X. Yuan, C. Bao, X. Lin, Y. Liu, T.-k. Chanet al., “Maniskill3: Gpu parallelized robotics simulation and rendering for generalizable embodied ai,” arXiv preprint arXiv:2410.00425, 2024

  2. [2]

    Isaac Lab: A GPU-Accelerated Simulation Framework for Multi-Modal Robot Learning

    M.Mittal,P.Roth,J.Tigue,A.Richard,O.Zhang,P.Du,A.Serrano- Munoz, X. Yao, R. Zurbrügg, N. Rudinet al., “Isaac lab: A gpu- accelerated simulation framework for multi-modal robot learning,” arXiv preprint arXiv:2511.04831, 2025

  3. [3]

    Uni-navid: A video-based vision-language- action model for unifying embodied navigation tasks,

    J. Zhang, K. Wang, S. Wang, M. Li, H. Liu, S. Wei, Z. Wang, Z.Zhang,andH.Wang,“Uni-navid:Avideo-basedvision-language- actionmodelforunifyingembodiednavigationtasks,”arXiv preprint arXiv:2412.06224, 2024

  4. [4]

    Evaluating Real-World Robot Manipulation Policies in Simulation

    X. Li, K. Hsu, J. Gu, K. Pertsch, O. Mees, H. R. Walke, C. Fu, I.Lunawat,I.Sieh,S.Kirmaniet al.,“Evaluatingreal-worldrobotma- nipulation policies in simulation,”arXiv preprint arXiv:2405.05941, 2024

  5. [5]

    Rl-gsbridge: 3d gaussian splatting based real2sim2real method for robotic manipulation learning,

    Y. Wu, L. Pan, W. Wu, G. Wang, Y. Miao, F. Xu, and H. Wang, “Rl-gsbridge: 3d gaussian splatting based real2sim2real method for robotic manipulation learning,” in2025 IEEE International Confer- ence on Robotics and Automation (ICRA). IEEE,2025,pp.192–198

  6. [6]

    Splatsim: Zero-shot sim2real transfer of rgb manipulation policies using gaussian splatting,

    M.N.Qureshi,S.Garg,F.Yandun,D.Held,G.Kantor,andA.Silwal, “Splatsim: Zero-shot sim2real transfer of rgb manipulation policies using gaussian splatting,” in2025 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2025, pp. 6502–6509

  7. [7]

    Rose: Reconstructing objects, scenes,andtrajectoriesfromcasualvideosforroboticmanipulation,

    P. Li, H. Geng, J. Crate, Y. Han, J. Zhang, F. Wang, C. T. Cheng, R. Dong, Y.-J. Wang, H. Louet al., “Rose: Reconstructing objects, scenes,andtrajectoriesfromcasualvideosforroboticmanipulation,” inNeurIPS 2025 Workshop on Bridging Language, Agent, and World Models for Reasoning and Planning

  8. [8]

    Re3 Sim: Generating High-Fidelity Simulation Data via 3D-Photorealistic Real-to-Sim for Robotic Manipulation

    X.Han,M.Liu,Y.Chen,J.Yu,X.Lyu,Y.Tian,B.Wang,W.Zhang, andJ.Pang,“Re3sim:Generatinghigh-fidelitysimulationdatavia3d- photorealistic real-to-sim for robotic manipulation,”arXiv preprint arXiv:2502.08645, 2025

  9. [9]

    Video2game: Real- time interactive realistic and browser-compatible environment from a single video,

    H. Xia, Z.-H. Lin, W.-C. Ma, and S. Wang, “Video2game: Real- time interactive realistic and browser-compatible environment from a single video,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 4578–4588

  10. [10]

    Objectsdf++:Improved object-compositional neural implicit surfaces,

    Q.Wu,K.Wang,K.Li,J.Zheng,andJ.Cai,“Objectsdf++:Improved object-compositional neural implicit surfaces,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 21764–21774

  11. [11]

    Vggt: Visual geometry grounded transformer,

    J. Wang, M. Chen, N. Karaev, A. Vedaldi, C. Rupprecht, and D. Novotny, “Vggt: Visual geometry grounded transformer,” inPro- ceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 5294–5306

  12. [12]

    Feature3dgs:Supercharging3dgaussiansplatting to enable distilled feature fields,

    S.Zhou,H.Chang,S.Jiang,Z.Fan,D.Xu,P.Chari,S.You,Z.Wang, andA.Kadambi,“Feature3dgs:Supercharging3dgaussiansplatting to enable distilled feature fields,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 21676–21685

  13. [13]

    Mip-splatting: Alias-free 3d gaussian splatting,

    Z. Yu, A. Chen, B. Huang, T. Sattler, and A. Geiger, “Mip-splatting: Alias-free 3d gaussian splatting,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024, pp. 19447–19456

  14. [14]

    Sugar:Surface-alignedgaussiansplatting forefficient3dmeshreconstructionandhigh-qualitymeshrendering,

    A.GuédonandV.Lepetit,“Sugar:Surface-alignedgaussiansplatting forefficient3dmeshreconstructionandhigh-qualitymeshrendering,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 5354–5363

  15. [15]

    2dgaussiansplat- ting for geometrically accurate radiance fields,

    B.Huang,Z.Yu,A.Chen,A.Geiger,andS.Gao,“2dgaussiansplat- ting for geometrically accurate radiance fields,” inACM SIGGRAPH 2024 conference papers, 2024, pp. 1–11

  16. [16]

    Pgsr: Planar-based gaussian splatting for efficientandhigh-fidelitysurfacereconstruction,

    D. Chen, H. Li, W. Ye, Y. Wang, W. Xie, S. Zhai, N. Wang, H. Liu, H. Bao, and G. Zhang, “Pgsr: Planar-based gaussian splatting for efficientandhigh-fidelitysurfacereconstruction,”IEEE Transactions on Visualization and Computer Graphics, 2024

  17. [17]

    Probabilisticpointcloudmodel- ingviaself-organizinggaussianmixturemodels,

    K.Goel,N.Michael,andW.Tabib,“Probabilisticpointcloudmodel- ingviaself-organizinggaussianmixturemodels,”IEEE Robotics and Automation Letters, vol. 8, no. 5, pp. 2526–2533, 2023

  18. [18]

    Incremental multimodal surface mapping via self-organizing gaussian mixture models,

    K. Goel and W. Tabib, “Incremental multimodal surface mapping via self-organizing gaussian mixture models,”IEEE Robotics and Automation Letters, vol. 8, no. 12, pp. 8358–8365, 2023

  19. [19]

    Learningtransferable visual models from natural language supervision,

    A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G.Sastry,A.Askell,P.Mishkin,J.Clarket al.,“Learningtransferable visual models from natural language supervision,” inInternational conference on machine learning. PMLR, 2021, pp. 8748–8763

  20. [20]

    Lerf: Language embedded radiance fields,

    J. Kerr, C. M. Kim, K. Goldberg, A. Kanazawa, and M. Tancik, “Lerf: Language embedded radiance fields,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 19729–19739

  21. [21]

    Langsplat: 3d language gaussian splatting,

    M. Qin, W. Li, J. Zhou, H. Wang, and H. Pfister, “Langsplat: 3d language gaussian splatting,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 20051–20060

  22. [22]

    Gaussiangrouping:Segment andeditanythingin3dscenes,

    M.Ye,M.Danelljan,F.Yu,andL.Ke,“Gaussiangrouping:Segment andeditanythingin3dscenes,”inEuropean Conference on Computer Vision. Springer, 2024, pp. 162–179

  23. [23]

    Segment any- thing,

    A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y. Loet al., “Segment any- thing,”inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 4015–4026

  24. [24]

    Masked-attention mask transformer for universal image segmenta- tion,

    B. Cheng, I. Misra, A. G. Schwing, A. Kirillov, and R. Girdhar, “Masked-attention mask transformer for universal image segmenta- tion,” inCVPR, 2022

  25. [25]

    Compact3dgaussian representation for radiance field,

    J.C.Lee,D.Rho,X.Sun,J.H.Ko,andE.Park,“Compact3dgaussian representation for radiance field,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 21719–21728

  26. [26]

    Li- gs:Gaussiansplattingwithlidarincorporatedforaccuratelarge-scale reconstruction,

    C. Jiang, R. Gao, K. Shao, Y. Wang, R. Xiong, and Y. Zhang, “Li- gs:Gaussiansplattingwithlidarincorporatedforaccuratelarge-scale reconstruction,”IEEE Robotics and Automation Letters, 2024

  27. [27]

    Leverage cross-attention for end-to-end open-vocabulary panoptic reconstruction,

    X. Yu, Y. Xie, Y. Liu, H. Lu, R. Xiong, Y. Liao, and Y. Wang, “Leverage cross-attention for end-to-end open-vocabulary panoptic reconstruction,”arXiv preprint arXiv:2501.01119, 2025

  28. [28]

    Kitti-360: A novel dataset and benchmarks for urban scene understanding in 2d and 3d,

    Y. Liao, J. Xie, and A. Geiger, “Kitti-360: A novel dataset and benchmarks for urban scene understanding in 2d and 3d,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 3, pp. 3292–3310, 2022

  29. [29]

    Scannet:Richly-annotated3dreconstructionsofindoor scenes,

    A. Dai, A. X. Chang, M. Savva, M. Halber, T. Funkhouser, and M.Nießner,“Scannet:Richly-annotated3dreconstructionsofindoor scenes,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 5828–5839

  30. [30]

    Scannet++: A high-fidelity dataset of 3d indoor scenes,

    C. Yeshwanth, Y.-C. Liu, M. Nießner, and A. Dai, “Scannet++: A high-fidelity dataset of 3d indoor scenes,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 12–22

  31. [31]

    High-quality surface reconstruction using gaussian surfels,

    P. Dai, J. Xu, W. Xie, X. Liu, H. Wang, and W. Xu, “High-quality surface reconstruction using gaussian surfels,” inACM SIGGRAPH 2024 conference papers, 2024, pp. 1–11

  32. [32]

    arXiv preprint arXiv:2406.01467 (2024)

    B. Zhang, C. Fang, R. Shrestha, Y. Liang, X. Long, and P. Tan, “Rade-gs: Rasterizing depth in gaussian splatting,”arXiv preprint arXiv:2406.01467, 2024. First Author et al.:Preprint submitted to ElsevierPage 11 of 12 Short Title of the Article

  33. [33]

    Liv-gaussmap: Lidar- inertial-visual fusion for real-time 3d radiance field map rendering,

    S. Hong, J. He, X. Zheng, and C. Zheng, “Liv-gaussmap: Lidar- inertial-visual fusion for real-time 3d radiance field map rendering,” IEEE Robotics and Automation Letters,vol.9,no.11,pp.9765–9772, 2024

  34. [34]

    Gaussian opacity fields: Efficient adaptivesurfacereconstructioninunboundedscenes,

    Z. Yu, T. Sattler, and A. Geiger, “Gaussian opacity fields: Efficient adaptivesurfacereconstructioninunboundedscenes,”ACM Transac- tions on Graphics (ToG), vol. 43, no. 6, pp. 1–13, 2024

  35. [35]

    Trim 3d gaus- sian splatting for accurate geometry representation,

    L. Fan, Y. Yang, M. Li, H. Li, and Z. Zhang, “Trim 3d gaus- sian splatting for accurate geometry representation,”arXiv preprint arXiv:2406.07499, 2024

  36. [36]

    Panoptic lifting for 3d scene understanding with neural fields,

    Y. Siddiqui, L. Porzi, S. R. Bulò, N. Müller, M. Nießner, A. Dai, and P. Kontschieder, “Panoptic lifting for 3d scene understanding with neural fields,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2023

  37. [37]

    Contrastive lift: 3d object instance segmentation by slow-fast con- trastive fusion,

    Y. Bhalgat, I. Laina, J. F. Henriques, A. Zisserman, and A. Vedaldi, “Contrastive lift: 3d object instance segmentation by slow-fast con- trastive fusion,” inNeurIPS, 2023

  38. [38]

    Panopticvision- language feature fields,

    H.Chen,K.Blomqvist,F.Milano,andR.Siegwart,“Panopticvision- language feature fields,”IEEE Robotics and Automation Letters (RA- L), vol. 9, no. 3, pp. 2144–2151, 2024

  39. [39]

    Panopticrecon: Leverage open-vocabulary instance segmentation for zero-shot panoptic reconstruction,

    X. Yu, Y. Liu, C. Han, S. Mao, S. Zhou, R. Xiong, Y. Liao, and Y. Wang, “Panopticrecon: Leverage open-vocabulary instance segmentation for zero-shot panoptic reconstruction,”arXiv preprint arXiv:2407.01349, 2024

  40. [40]

    arXiv preprint arXiv:2406.02058 (2024)

    Y. Wu, J. Meng, H. Li, C. Wu, Y. Shi, X. Cheng, C. Zhao, H.Feng,E.Ding,J.Wanget al.,“Opengaussian:Towardspoint-level 3d gaussian-based open vocabulary understanding,”arXiv preprint arXiv:2406.02058, 2024

  41. [41]

    3d gaussian splatting for real-time radiance field rendering

    B. Kerbl, G. Kopanas, T. Leimkühler, and G. Drettakis, “3d gaussian splatting for real-time radiance field rendering.”ACM Trans. Graph., vol. 42, no. 4, pp. 139–1, 2023

  42. [42]

    Vlfm: Vision-language frontier maps for zero-shot semantic navigation,

    N. Yokoyama, S. Ha, D. Batra, J. Wang, and B. Bucher, “Vlfm: Vision-language frontier maps for zero-shot semantic navigation,”

  43. [43]

    Available: https://arxiv.org/abs/2312.03275

    [Online]. Available: https://arxiv.org/abs/2312.03275

  44. [44]

    arXiv preprint arXiv:2506.17733 (2025)

    M. Lei, S. Li, Y. Wu, H. Hu, Y. Zhou, X. Zheng, G. Ding, S. Du, Z. Wu, and Y. Gao, “Yolov13: Real-time object detection with hypergraph-enhanced adaptive visual perception,” 2025. [Online]. Available: https://arxiv.org/abs/2506.17733 First Author et al.:Preprint submitted to ElsevierPage 12 of 12