arxiv: 2604.10982 · v1 · submitted 2026-04-13 · 💻 cs.RO

Recognition: unknown

{Psi}-Map: Panoptic Surface Integrated Mapping Enables Real2Sim Transfer

Xuan Yu , Yuxuan Xie , Changjian Jiang , Shichao Zhai , Rong Xiong , Yu Zhang , Yue Wang

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:51 UTC · model grok-4.3

classification 💻 cs.RO

keywords panoptic reconstruction3D Gaussian SplattingLiDARreal-time mappingroboticsopen-vocabularysurface mappingReal2Sim

0 comments

The pith

Psi-Map integrates plane-constrained GMMs from LiDAR and local cross-attention to enable high-quality panoptic surface mapping at over 40 FPS in large-scale scenes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper seeks to solve the challenge of achieving simultaneous geometric accuracy, coherent panoptic segmentation, and real-time performance in large-scale scene reconstruction for robotics. Current 3D Gaussian Splatting techniques often trade off one for the others, restricting their practical use in dynamic control loops or simulation transfers. The proposed system builds multimodal Gaussian models constrained by LiDAR planes and represents the map with 2D Gaussian surfels to maintain physical realism and precise alignment. It then uses a query-guided architecture with local cross-attention to directly lift 2D mask features into consistent 3D panoptic labels, bypassing traditional multi-stage pipelines that accumulate errors. Rendering is accelerated with tile intersection and selection strategies to sustain over 40 frames per second.

Core claim

The paper claims that by constructing plane-constrained multimodal Gaussian Mixture Models using LiDAR and 2D Gaussian surfels for geometric supervision, combined with an end-to-end panoptic learning module that employs local cross-attention within the view frustum to lift 2D features to 3D space, and optimized rendering via Precise Tile Intersection and Top-K Hard Selection, the Psi-Map system delivers superior geometric and panoptic reconstruction quality in large-scale scenes at inference rates exceeding 40 FPS.

What carries the argument

The integration of LiDAR-constrained multimodal GMMs with 2D Gaussian surfels and local cross-attention for direct 3D panoptic feature lifting.

Load-bearing premise

The premise that LiDAR-based plane-constrained GMMs and local cross-attention lifting of 2D mask features will produce globally consistent panoptic understanding without error accumulation or loss of physical realism across varied large-scale environments.

What would settle it

Observing accumulated panoptic label inconsistencies or geometric deviations in reconstructions of large environments with complex surfaces or sensor noise would indicate the approach does not maintain the claimed consistency and accuracy.

Figures

Figures reproduced from arXiv: 2604.10982 by Changjian Jiang, Rong Xiong, Shichao Zhai, Xuan Yu, Yue Wang, Yuxuan Xie, Yu Zhang.

**Figure 1.** Figure 1: Ψ-Map provides a high-performance bridge from multi-modal sensor inputs to robotic applications. By fusing LiDARRGB-D data with Vision-Language Foundation Models, our framework achieves real-time, 3D-consistent panoptic reconstruction. This representation serves as a powerful engine for high-fidelity scene editing and virtual asset generation, ultimately empowering complex downstream skills such as genera… view at source ↗

**Figure 2.** Figure 2: Overview of the Ψ-Map Framework. The Input consists of multi-view RGB-D frames. During Geometric Reinforcement, the point cloud is modeled as a plane-constrained SOGMM to provide continuous structural supervision for the 2D Gaussian Surfels. In the Panoptic Learning stage, 2D mask features are lifted into 3D via a query-guided end-to-end architecture, where instance tokens are refined through local cross-a… view at source ↗

**Figure 3.** Figure 3: The instance branch utilizes a cross-attention mechanism between 3D Gaussian-modulated query tokens and SOGMM-reinforced 2DGS scene fields. SOGMM initializes the central positions and rotations of 2D Gaussians by sampling the distribution parameters. Furthermore, it establishes a criterion for geometric consistency in 3D space. To represent the deviation of a surfel 𝐺𝑖 from its geometric prior, we define … view at source ↗

**Figure 4.** Figure 4: Comparison of the quality of semantic segmentation and panoptic segmentation of different methods on ScanNet V2 and ScanNet++. mechanism. For each pixel, we restrict feature accumulation to a fixed set  containing the 𝐾 Gaussians with the highest opacity weights 𝜔𝑗 = 𝛼𝑗 ∏𝑗−1 𝑘=1(1 − 𝛼𝑘 ): 𝐹𝑝𝑖𝑥𝑒𝑙 = ∑ 𝑗∈ 𝑓𝑗 ⋅ 𝜔𝑗 , where || = 𝐾 (18) By replacing the 𝑂(𝑁) accumulation with a constant 𝑂(𝐾) selection, we fund… view at source ↗

**Figure 5.** Figure 5: Qualitative rendering results on the Replica dataset, demonstrating Ψ-Map’s ability to produce high-fidelity geometry, semantic, and panoptic maps. neighbor search, which is justified by the significant gains in surface accuracy and structural consistency. 6.3. Panoptic Segmentation Performance To evaluate the scene understanding capabilities of Ψ-Map, we conduct a quantitative comparison against both NeR… view at source ↗

**Figure 6.** Figure 6: Quantitative comparisons on Scan2CAD dataset. 6.6. Case Study To demonstrate the practical utility of Ψ-Map, we evaluate its performance on a 3D Object Completion and Generation task using the Scan2CAD dataset, and on an Object Goal Navigation task in our self-collected and reconstructed indoor scenes. 3D Object Completion and Generation. Our framework extends beyond segmentation reconstruction to suppo… view at source ↗

**Figure 7.** Figure 7: Comparison of navigation performance before and after fine-tuning with 3D-consistent labels. Qualitative comparisons in [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗

read the original abstract

Open-vocabulary panoptic reconstruction is essential for advanced robotics perception and simulation. However, existing methods based on 3D Gaussian Splatting (3DGS) often struggle to simultaneously achieve geometric accuracy, coherent panoptic understanding, and real-time inference frequency in large-scale scenes. In this paper, we propose a comprehensive framework that integrates geometric reinforcement, end-to-end panoptic learning, and efficient rendering. First, to ensure physical realism in large-scale environments, we leverage LiDAR data to construct plane-constrained multimodal Gaussian Mixture Models (GMMs) and employ 2D Gaussian surfels as the map representation, enabling high-precision surface alignment and continuous geometric supervision. Building upon this, to overcome the error accumulation and cumbersome cross-frame association inherent in traditional multi-stage panoptic segmentation pipelines, we design a query-guided end-to-end learning architecture. By utilizing a local cross-attention mechanism within the view frustum, the system lifts 2D mask features directly into 3D space, achieving globally consistent panoptic understanding. Finally, addressing the computational bottlenecks caused by high-dimensional semantic features, we introduce Precise Tile Intersection and a Top-K Hard Selection strategy to optimize the rendering pipeline. Experimental results demonstrate that our system achieves superior geometric and panoptic reconstruction quality in large-scale scenes while maintaining an inference rate exceeding 40 FPS, meeting the real-time requirements of robotic control loops.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper ties LiDAR plane constraints to 2D Gaussian surfels and uses local frustum cross-attention to lift panoptic masks, targeting real-time large-scale robotics mapping, but the performance claims rest on unshown results and the global consistency story looks incomplete.

read the letter

The key point is that the authors combine LiDAR-derived plane-constrained GMMs with 2D Gaussian surfels for geometry, then add a query-guided end-to-end architecture that lifts 2D mask features via local cross-attention inside each view frustum. They also include tile intersection and top-k selection to keep rendering fast. The goal is accurate surfaces plus coherent panoptic labels in big scenes at over 40 FPS for robotic control and Real2Sim transfer.

Referee Report

2 major / 1 minor

Summary. The paper proposes Ψ-Map, a framework for open-vocabulary panoptic surface mapping in large-scale scenes. It constructs plane-constrained multimodal GMMs from LiDAR data using 2D Gaussian surfels for geometric reinforcement and continuous supervision; employs a query-guided end-to-end architecture that lifts 2D mask features into 3D via local cross-attention within each view frustum to achieve panoptic understanding; and applies Precise Tile Intersection plus Top-K Hard Selection to optimize rendering. The central claims are superior geometric and panoptic reconstruction quality together with real-time inference exceeding 40 FPS to support Real2Sim transfer.

Significance. If the performance claims are substantiated, the work would advance real-time robotic perception by unifying high-precision surface mapping with coherent open-vocabulary semantics in a single efficient pipeline, addressing limitations of prior 3DGS methods in large environments. The end-to-end lifting approach and LiDAR-constrained surfels could reduce multi-stage error accumulation and improve physical realism for simulation transfer.

major comments (2)

[Abstract] Abstract: The central claims of 'superior geometric and panoptic reconstruction quality' and 'inference rate exceeding 40 FPS' are asserted without any quantitative metrics, baseline comparisons, ablation results, or dataset details. This absence prevents assessment of whether the proposed GMM surfels, local cross-attention lifting, or rendering optimizations actually deliver the stated improvements.
[Query-guided end-to-end learning architecture] Query-guided end-to-end learning architecture: The manuscript states that local cross-attention within the view frustum 'lifts 2D mask features directly into 3D space, achieving globally consistent panoptic understanding.' No cross-view fusion, global consistency loss, or inter-frame label association mechanism is described. In large-scale scenes with overlapping views, this local-only design risks label drift on the same surface element, directly undermining both the panoptic quality and Real2Sim transfer claims.

minor comments (1)

The abstract would be clearer if it briefly indicated the evaluation scenes or datasets used to support the 'large-scale' claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thorough review and valuable feedback on our manuscript. We have carefully considered each comment and provide point-by-point responses below. We believe the clarifications and proposed revisions will strengthen the paper.

read point-by-point responses

Referee: [Abstract] Abstract: The central claims of 'superior geometric and panoptic reconstruction quality' and 'inference rate exceeding 40 FPS' are asserted without any quantitative metrics, baseline comparisons, ablation results, or dataset details. This absence prevents assessment of whether the proposed GMM surfels, local cross-attention lifting, or rendering optimizations actually deliver the stated improvements.

Authors: We agree that the abstract presents the claims at a high level without specific numbers. The full manuscript includes quantitative evaluations in the Experiments section, with tables comparing against baselines on metrics such as geometric error, panoptic quality (PQ, SQ, RQ), and runtime on large-scale datasets like KITTI and custom large environments. To improve accessibility, we will revise the abstract to incorporate key quantitative results, such as the achieved FPS and relative improvements, subject to space constraints. revision: yes
Referee: [Query-guided end-to-end learning architecture] Query-guided end-to-end learning architecture: The manuscript states that local cross-attention within the view frustum 'lifts 2D mask features directly into 3D space, achieving globally consistent panoptic understanding.' No cross-view fusion, global consistency loss, or inter-frame label association mechanism is described. In large-scale scenes with overlapping views, this local-only design risks label drift on the same surface element, directly undermining both the panoptic quality and Real2Sim transfer claims.

Authors: The referee correctly notes that our architecture relies on local cross-attention per view without explicit cross-view fusion or additional consistency losses. However, global consistency is inherently provided by the shared 3D representation: 2D features are lifted and associated to the same plane-constrained Gaussian surfels across views, with updates accumulated in the persistent map. The LiDAR-based geometric supervision further regularizes the 3D structure. We acknowledge that this implicit mechanism was not sufficiently detailed in the manuscript. We will revise the relevant section to explicitly describe how consistency is maintained through the 3D surfel map and provide additional analysis or ablations on label consistency in overlapping regions. revision: partial

Circularity Check

0 steps flagged

No circularity: claims rest on proposed architecture and experiments, not self-referential definitions or fits.

full rationale

The abstract and described framework introduce LiDAR-constrained GMMs, 2D Gaussian surfels, a query-guided end-to-end architecture with local cross-attention for lifting 2D masks to 3D, and rendering optimizations as independent design choices. No equations, fitted parameters renamed as predictions, or self-citations are shown that would make global panoptic consistency or Real2Sim transfer tautological by construction. The derivation chain remains self-contained against external benchmarks and experimental validation.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Abstract-only review limits visibility into parameters or assumptions; relies on standard robotics domain assumptions about LiDAR accuracy and attention mechanisms for feature lifting.

axioms (2)

domain assumption LiDAR data can be used to construct plane-constrained multimodal GMMs that provide continuous geometric supervision
Invoked in the geometric reinforcement step of the framework.
domain assumption Local cross-attention within the view frustum lifts 2D mask features to globally consistent 3D panoptic understanding
Central to the end-to-end panoptic learning architecture.

pith-pipeline@v0.9.0 · 5563 in / 1288 out tokens · 26223 ms · 2026-05-10T15:51:54.410450+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

44 extracted references · 12 canonical work pages · 2 internal anchors

[1]

Maniskill3: Gpu parallelized robotics simulation and rendering for generalizable embodied ai.arXiv preprint arXiv:2410.00425, 2024

S. Tao, F. Xiang, A. Shukla, Y. Qin, X. Hinrichsen, X. Yuan, C. Bao, X. Lin, Y. Liu, T.-k. Chanet al., “Maniskill3: Gpu parallelized robotics simulation and rendering for generalizable embodied ai,” arXiv preprint arXiv:2410.00425, 2024

work page arXiv 2024
[2]

Isaac Lab: A GPU-Accelerated Simulation Framework for Multi-Modal Robot Learning

M.Mittal,P.Roth,J.Tigue,A.Richard,O.Zhang,P.Du,A.Serrano- Munoz, X. Yao, R. Zurbrügg, N. Rudinet al., “Isaac lab: A gpu- accelerated simulation framework for multi-modal robot learning,” arXiv preprint arXiv:2511.04831, 2025

work page internal anchor Pith review arXiv 2025
[3]

Uni-navid: A video-based vision-language- action model for unifying embodied navigation tasks,

J. Zhang, K. Wang, S. Wang, M. Li, H. Liu, S. Wei, Z. Wang, Z.Zhang,andH.Wang,“Uni-navid:Avideo-basedvision-language- actionmodelforunifyingembodiednavigationtasks,”arXiv preprint arXiv:2412.06224, 2024

work page arXiv 2024
[4]

Evaluating Real-World Robot Manipulation Policies in Simulation

X. Li, K. Hsu, J. Gu, K. Pertsch, O. Mees, H. R. Walke, C. Fu, I.Lunawat,I.Sieh,S.Kirmaniet al.,“Evaluatingreal-worldrobotma- nipulation policies in simulation,”arXiv preprint arXiv:2405.05941, 2024

work page internal anchor Pith review arXiv 2024
[5]

Rl-gsbridge: 3d gaussian splatting based real2sim2real method for robotic manipulation learning,

Y. Wu, L. Pan, W. Wu, G. Wang, Y. Miao, F. Xu, and H. Wang, “Rl-gsbridge: 3d gaussian splatting based real2sim2real method for robotic manipulation learning,” in2025 IEEE International Confer- ence on Robotics and Automation (ICRA). IEEE,2025,pp.192–198

2025
[6]

Splatsim: Zero-shot sim2real transfer of rgb manipulation policies using gaussian splatting,

M.N.Qureshi,S.Garg,F.Yandun,D.Held,G.Kantor,andA.Silwal, “Splatsim: Zero-shot sim2real transfer of rgb manipulation policies using gaussian splatting,” in2025 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2025, pp. 6502–6509

2025
[7]

Rose: Reconstructing objects, scenes,andtrajectoriesfromcasualvideosforroboticmanipulation,

P. Li, H. Geng, J. Crate, Y. Han, J. Zhang, F. Wang, C. T. Cheng, R. Dong, Y.-J. Wang, H. Louet al., “Rose: Reconstructing objects, scenes,andtrajectoriesfromcasualvideosforroboticmanipulation,” inNeurIPS 2025 Workshop on Bridging Language, Agent, and World Models for Reasoning and Planning

2025
[8]

Re3 Sim: Generating High-Fidelity Simulation Data via 3D-Photorealistic Real-to-Sim for Robotic Manipulation

X.Han,M.Liu,Y.Chen,J.Yu,X.Lyu,Y.Tian,B.Wang,W.Zhang, andJ.Pang,“Re3sim:Generatinghigh-fidelitysimulationdatavia3d- photorealistic real-to-sim for robotic manipulation,”arXiv preprint arXiv:2502.08645, 2025

work page arXiv 2025
[9]

Video2game: Real- time interactive realistic and browser-compatible environment from a single video,

H. Xia, Z.-H. Lin, W.-C. Ma, and S. Wang, “Video2game: Real- time interactive realistic and browser-compatible environment from a single video,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 4578–4588

2024
[10]

Objectsdf++:Improved object-compositional neural implicit surfaces,

Q.Wu,K.Wang,K.Li,J.Zheng,andJ.Cai,“Objectsdf++:Improved object-compositional neural implicit surfaces,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 21764–21774

2023
[11]

Vggt: Visual geometry grounded transformer,

J. Wang, M. Chen, N. Karaev, A. Vedaldi, C. Rupprecht, and D. Novotny, “Vggt: Visual geometry grounded transformer,” inPro- ceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 5294–5306

2025
[12]

Feature3dgs:Supercharging3dgaussiansplatting to enable distilled feature fields,

S.Zhou,H.Chang,S.Jiang,Z.Fan,D.Xu,P.Chari,S.You,Z.Wang, andA.Kadambi,“Feature3dgs:Supercharging3dgaussiansplatting to enable distilled feature fields,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 21676–21685

2024
[13]

Mip-splatting: Alias-free 3d gaussian splatting,

Z. Yu, A. Chen, B. Huang, T. Sattler, and A. Geiger, “Mip-splatting: Alias-free 3d gaussian splatting,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024, pp. 19447–19456

2024
[14]

Sugar:Surface-alignedgaussiansplatting forefficient3dmeshreconstructionandhigh-qualitymeshrendering,

A.GuédonandV.Lepetit,“Sugar:Surface-alignedgaussiansplatting forefficient3dmeshreconstructionandhigh-qualitymeshrendering,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 5354–5363

2024
[15]

2dgaussiansplat- ting for geometrically accurate radiance fields,

B.Huang,Z.Yu,A.Chen,A.Geiger,andS.Gao,“2dgaussiansplat- ting for geometrically accurate radiance fields,” inACM SIGGRAPH 2024 conference papers, 2024, pp. 1–11

2024
[16]

Pgsr: Planar-based gaussian splatting for efficientandhigh-fidelitysurfacereconstruction,

D. Chen, H. Li, W. Ye, Y. Wang, W. Xie, S. Zhai, N. Wang, H. Liu, H. Bao, and G. Zhang, “Pgsr: Planar-based gaussian splatting for efficientandhigh-fidelitysurfacereconstruction,”IEEE Transactions on Visualization and Computer Graphics, 2024

2024
[17]

Probabilisticpointcloudmodel- ingviaself-organizinggaussianmixturemodels,

K.Goel,N.Michael,andW.Tabib,“Probabilisticpointcloudmodel- ingviaself-organizinggaussianmixturemodels,”IEEE Robotics and Automation Letters, vol. 8, no. 5, pp. 2526–2533, 2023

2023
[18]

Incremental multimodal surface mapping via self-organizing gaussian mixture models,

K. Goel and W. Tabib, “Incremental multimodal surface mapping via self-organizing gaussian mixture models,”IEEE Robotics and Automation Letters, vol. 8, no. 12, pp. 8358–8365, 2023

2023
[19]

Learningtransferable visual models from natural language supervision,

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G.Sastry,A.Askell,P.Mishkin,J.Clarket al.,“Learningtransferable visual models from natural language supervision,” inInternational conference on machine learning. PMLR, 2021, pp. 8748–8763

2021
[20]

Lerf: Language embedded radiance fields,

J. Kerr, C. M. Kim, K. Goldberg, A. Kanazawa, and M. Tancik, “Lerf: Language embedded radiance fields,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 19729–19739

2023
[21]

Langsplat: 3d language gaussian splatting,

M. Qin, W. Li, J. Zhou, H. Wang, and H. Pfister, “Langsplat: 3d language gaussian splatting,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 20051–20060

2024
[22]

Gaussiangrouping:Segment andeditanythingin3dscenes,

M.Ye,M.Danelljan,F.Yu,andL.Ke,“Gaussiangrouping:Segment andeditanythingin3dscenes,”inEuropean Conference on Computer Vision. Springer, 2024, pp. 162–179

2024
[23]

Segment any- thing,

A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y. Loet al., “Segment any- thing,”inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 4015–4026

2023
[24]

Masked-attention mask transformer for universal image segmenta- tion,

B. Cheng, I. Misra, A. G. Schwing, A. Kirillov, and R. Girdhar, “Masked-attention mask transformer for universal image segmenta- tion,” inCVPR, 2022

2022
[25]

Compact3dgaussian representation for radiance field,

J.C.Lee,D.Rho,X.Sun,J.H.Ko,andE.Park,“Compact3dgaussian representation for radiance field,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 21719–21728

2024
[26]

Li- gs:Gaussiansplattingwithlidarincorporatedforaccuratelarge-scale reconstruction,

C. Jiang, R. Gao, K. Shao, Y. Wang, R. Xiong, and Y. Zhang, “Li- gs:Gaussiansplattingwithlidarincorporatedforaccuratelarge-scale reconstruction,”IEEE Robotics and Automation Letters, 2024

2024
[27]

Leverage cross-attention for end-to-end open-vocabulary panoptic reconstruction,

X. Yu, Y. Xie, Y. Liu, H. Lu, R. Xiong, Y. Liao, and Y. Wang, “Leverage cross-attention for end-to-end open-vocabulary panoptic reconstruction,”arXiv preprint arXiv:2501.01119, 2025

work page arXiv 2025
[28]

Kitti-360: A novel dataset and benchmarks for urban scene understanding in 2d and 3d,

Y. Liao, J. Xie, and A. Geiger, “Kitti-360: A novel dataset and benchmarks for urban scene understanding in 2d and 3d,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 3, pp. 3292–3310, 2022

2022
[29]

Scannet:Richly-annotated3dreconstructionsofindoor scenes,

A. Dai, A. X. Chang, M. Savva, M. Halber, T. Funkhouser, and M.Nießner,“Scannet:Richly-annotated3dreconstructionsofindoor scenes,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 5828–5839

2017
[30]

Scannet++: A high-fidelity dataset of 3d indoor scenes,

C. Yeshwanth, Y.-C. Liu, M. Nießner, and A. Dai, “Scannet++: A high-fidelity dataset of 3d indoor scenes,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 12–22

2023
[31]

High-quality surface reconstruction using gaussian surfels,

P. Dai, J. Xu, W. Xie, X. Liu, H. Wang, and W. Xu, “High-quality surface reconstruction using gaussian surfels,” inACM SIGGRAPH 2024 conference papers, 2024, pp. 1–11

2024
[32]

arXiv preprint arXiv:2406.01467 (2024)

B. Zhang, C. Fang, R. Shrestha, Y. Liang, X. Long, and P. Tan, “Rade-gs: Rasterizing depth in gaussian splatting,”arXiv preprint arXiv:2406.01467, 2024. First Author et al.:Preprint submitted to ElsevierPage 11 of 12 Short Title of the Article

work page arXiv 2024
[33]

Liv-gaussmap: Lidar- inertial-visual fusion for real-time 3d radiance field map rendering,

S. Hong, J. He, X. Zheng, and C. Zheng, “Liv-gaussmap: Lidar- inertial-visual fusion for real-time 3d radiance field map rendering,” IEEE Robotics and Automation Letters,vol.9,no.11,pp.9765–9772, 2024

2024
[34]

Gaussian opacity fields: Efficient adaptivesurfacereconstructioninunboundedscenes,

Z. Yu, T. Sattler, and A. Geiger, “Gaussian opacity fields: Efficient adaptivesurfacereconstructioninunboundedscenes,”ACM Transac- tions on Graphics (ToG), vol. 43, no. 6, pp. 1–13, 2024

2024
[35]

Trim 3d gaus- sian splatting for accurate geometry representation,

L. Fan, Y. Yang, M. Li, H. Li, and Z. Zhang, “Trim 3d gaus- sian splatting for accurate geometry representation,”arXiv preprint arXiv:2406.07499, 2024

work page arXiv 2024
[36]

Panoptic lifting for 3d scene understanding with neural fields,

Y. Siddiqui, L. Porzi, S. R. Bulò, N. Müller, M. Nießner, A. Dai, and P. Kontschieder, “Panoptic lifting for 3d scene understanding with neural fields,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2023

2023
[37]

Contrastive lift: 3d object instance segmentation by slow-fast con- trastive fusion,

Y. Bhalgat, I. Laina, J. F. Henriques, A. Zisserman, and A. Vedaldi, “Contrastive lift: 3d object instance segmentation by slow-fast con- trastive fusion,” inNeurIPS, 2023

2023
[38]

Panopticvision- language feature fields,

H.Chen,K.Blomqvist,F.Milano,andR.Siegwart,“Panopticvision- language feature fields,”IEEE Robotics and Automation Letters (RA- L), vol. 9, no. 3, pp. 2144–2151, 2024

2024
[39]

Panopticrecon: Leverage open-vocabulary instance segmentation for zero-shot panoptic reconstruction,

X. Yu, Y. Liu, C. Han, S. Mao, S. Zhou, R. Xiong, Y. Liao, and Y. Wang, “Panopticrecon: Leverage open-vocabulary instance segmentation for zero-shot panoptic reconstruction,”arXiv preprint arXiv:2407.01349, 2024

work page arXiv 2024
[40]

arXiv preprint arXiv:2406.02058 (2024)

Y. Wu, J. Meng, H. Li, C. Wu, Y. Shi, X. Cheng, C. Zhao, H.Feng,E.Ding,J.Wanget al.,“Opengaussian:Towardspoint-level 3d gaussian-based open vocabulary understanding,”arXiv preprint arXiv:2406.02058, 2024

work page arXiv 2024
[41]

3d gaussian splatting for real-time radiance field rendering

B. Kerbl, G. Kopanas, T. Leimkühler, and G. Drettakis, “3d gaussian splatting for real-time radiance field rendering.”ACM Trans. Graph., vol. 42, no. 4, pp. 139–1, 2023

2023
[42]

Vlfm: Vision-language frontier maps for zero-shot semantic navigation,

N. Yokoyama, S. Ha, D. Batra, J. Wang, and B. Bucher, “Vlfm: Vision-language frontier maps for zero-shot semantic navigation,”
[43]

Available: https://arxiv.org/abs/2312.03275

[Online]. Available: https://arxiv.org/abs/2312.03275

work page arXiv
[44]

arXiv preprint arXiv:2506.17733 (2025)

M. Lei, S. Li, Y. Wu, H. Hu, Y. Zhou, X. Zheng, G. Ding, S. Du, Z. Wu, and Y. Gao, “Yolov13: Real-time object detection with hypergraph-enhanced adaptive visual perception,” 2025. [Online]. Available: https://arxiv.org/abs/2506.17733 First Author et al.:Preprint submitted to ElsevierPage 12 of 12

work page arXiv 2025