arxiv: 2604.12626 · v1 · submitted 2026-04-14 · 💻 cs.RO · cs.CV

Recognition: unknown

Habitat-GS: A High-Fidelity Navigation Simulator with Dynamic Gaussian Splatting

Ziyuan Xia , Jingyi Xu , Chong Cui , Yuanhong Yu , Jiazhao Zhang , Qingsong Yan , Tao Ni , Junbo Chen

show 4 more authors

Xiaowei Zhou Hujun Bao Ruizhen Hu Sida Peng

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:20 UTC · model grok-4.3

classification 💻 cs.RO cs.CV

keywords 3D Gaussian Splattingembodied AInavigation simulatorHabitatGaussian avatarsphotorealistic renderinghuman-aware navigationcross-domain generalization

0 comments

The pith

Integrating 3D Gaussian Splatting and dynamic Gaussian avatars into a navigation simulator boosts agent generalization to new domains and human interactions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Habitat-GS adds 3D Gaussian Splatting rendering to the Habitat simulator for more realistic visual environments and introduces drivable Gaussian avatars that act as both realistic visuals and physical obstacles. This setup allows embodied AI agents to train with higher fidelity simulations of real-world scenes containing people. Experiments indicate that training on these 3DGS scenes leads to better performance when tested in different domains, and that mixed training with both 3DGS and standard scenes works best. The avatars specifically help agents learn to navigate around humans without collisions.

Core claim

Habitat-GS extends Habitat-Sim with a 3DGS renderer for real-time photorealistic rendering and a Gaussian avatar module where each avatar serves as both a photorealistic visual entity and a navigation obstacle, resulting in improved cross-domain generalization for point-goal navigation agents and effective human-aware navigation.

What carries the argument

The 3D Gaussian Splatting renderer combined with the Gaussian avatar module that provides both visual fidelity and collision detection for dynamic humans.

Load-bearing premise

The photorealism and collision modeling from 3D Gaussian Splatting and drivable avatars are sufficient to improve real-world generalization without introducing new artifacts or biases in agent behavior.

What would settle it

A real-world deployment test in which agents trained in Habitat-GS show no better or worse performance than agents trained in standard mesh-based Habitat-Sim when navigating around actual moving people in varied physical environments.

Figures

Figures reproduced from arXiv: 2604.12626 by Chong Cui, Hujun Bao, Jiazhao Zhang, Jingyi Xu, Junbo Chen, Qingsong Yan, Ruizhen Hu, Sida Peng, Tao Ni, Xiaowei Zhou, Yuanhong Yu, Ziyuan Xia.

**Figure 1.** Figure 1: Habitat-GS is a navigation-centric embodied simulation platform with 3DGS and dynamic gaussian avatars. Compared to traditional mesh-based simulators (left), our 3DGS-based simulator (right) preserves high-frequency visual details and viewdependent effects, while gaussian avatars provide realistic and dynamic human presence for human-aware navigation scenarios, thus helping train more robust agents. embod… view at source ↗

**Figure 2.** Figure 2: System overview of Habitat-GS. From left to right: Asset Preparation, where 3DGS scene assets and gaussian avatar assets are prepared respectively. HabitatGS Simulation Environment, where the render engine performs 3DGS rasterization for scene gaussians and LBS deformation followed by rasterization for avatar gaussians, producing RGB-D observations. The NavMesh blocking module retrieves pre-computed proxy… view at source ↗

**Figure 3.** Figure 3: Visual comparison of scene rendering. Mesh-based rendering (left) vs. our 3DGS rendering (right). Our simulator is based on 3DGS, which preserves highfrequency details and supports diverse sources of rendering assets. lightweight CUDA LBS kernel to deform them to arbitrary SMPL-X [19] poses, thereby avoiding costly neural network inference at runtime. Below we detail how this rendering capability is combi… view at source ↗

**Figure 4.** Figure 4: Qualitative comparison of mesh avatars and gaussian avatars. [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: VLM scene quality assessment. Gemini 3.0 Pro evaluates 240 rendered screenshots from each domain on three perceptual dimensions. GS scenes consistently outperform mesh scenes, confirming their superior visual fidelity and diversity. divided into 48 evaluation batches, each containing 5 GS and 5 mesh images with randomized indices to blind the model of rendering source. The VLM scores each image on a 10-poi… view at source ↗

**Figure 6.** Figure 6: System architecture of Habitat-GS. The system adopts a “visual–navigation decoupling” design principle, separating the visual rendering modules handled by the CUDA-based 3DGS rasterizer and LBS deformation, from the navigation module managed by the traditional NavMesh and injected proxy capsules. This allows for photorealistic agent observations without modifying the core Habitat-Sim navigation logic. Ha… view at source ↗

**Figure 7.** Figure 7: Additional visualizations of 3DGS scenes and gaussian avatars. [PITH_FULL_IMAGE:figures/full_fig_p022_7.png] view at source ↗

**Figure 8.** Figure 8: Qualitative visualization of navigation episodes. [PITH_FULL_IMAGE:figures/full_fig_p023_8.png] view at source ↗

read the original abstract

Training embodied AI agents depends critically on the visual fidelity of simulation environments and the ability to model dynamic humans. Current simulators rely on mesh-based rasterization with limited visual realism, and their support for dynamic human avatars, where available, is constrained to mesh representations, hindering agent generalization to human-populated real-world scenarios. We present Habitat-GS, a navigation-centric embodied AI simulator extended from Habitat-Sim that integrates 3D Gaussian Splatting scene rendering and drivable gaussian avatars while maintaining full compatibility with the Habitat ecosystem. Our system implements a 3DGS renderer for real-time photorealistic rendering and supports scalable 3DGS asset import from diverse sources. For dynamic human modeling, we introduce a gaussian avatar module that enables each avatar to simultaneously serve as a photorealistic visual entity and an effective navigation obstacle, allowing agents to learn human-aware behaviors in realistic settings. Experiments on point-goal navigation demonstrate that agents trained on 3DGS scenes achieve stronger cross-domain generalization, with mixed-domain training being the most effective strategy. Evaluations on avatar-aware navigation further confirm that gaussian avatars enable effective human-aware navigation. Finally, performance benchmarks validate the system's scalability across varying scene complexity and avatar counts.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Habitat-GS layers 3DGS rendering and drivable Gaussian avatars onto Habitat-Sim, with experiments suggesting better cross-domain generalization but thin quantitative backing.

read the letter

The main thing here is that they've extended Habitat-Sim with 3D Gaussian Splatting for photorealistic scene rendering and introduced Gaussian avatars that double as visual entities and navigation obstacles, reporting stronger agent generalization in point-goal tasks especially under mixed-domain training. The avatar-aware navigation results also indicate the setup supports human-aware behaviors without breaking compatibility. This combination is new as a Habitat-native system rather than a standalone renderer or avatar tool. They do a clean job on the engineering side: real-time 3DGS rendering, scalable asset import from different sources, and keeping the full Habitat API intact so existing code can switch over with minimal changes. The mixed training approach is a standard, sensible tactic for sim-to-sim transfer and fits the goals well. No load-bearing contradictions show up in the high-level description or assumptions. The soft spots are mostly around evidence. The abstract and summary give positive outcomes but skip specific metrics, baselines, error bars, or detailed ablations, so it's hard to judge how large the gains actually are or whether new artifacts from the Gaussian representation affect agent policies. If the full paper supplies those tables and comparisons, the claims land more solidly; otherwise it reads more as a capable system paper than a rigorous benchmark study. This is for embodied AI and robotics groups already using Habitat who want higher visual fidelity and dynamic humans without switching frameworks. Readers focused on practical simulator upgrades will get direct value from the implementation choices. It deserves peer review because the integration addresses a real gap in current tools and the core approach holds together, even if the experimental section needs expansion. I'd send it to referees with a request for more quantitative results and checks on potential biases in the avatar collisions.

Referee Report

0 major / 2 minor

Summary. The manuscript introduces Habitat-GS, an extension of Habitat-Sim that integrates 3D Gaussian Splatting for real-time photorealistic scene rendering and drivable Gaussian avatars for dynamic human modeling in navigation tasks. It maintains full compatibility with the Habitat ecosystem, supports scalable asset import, and presents experiments showing that agents trained on 3DGS scenes achieve stronger cross-domain generalization in point-goal navigation (with mixed-domain training most effective) while Gaussian avatars enable effective human-aware navigation.

Significance. If the reported results hold, the work is significant for embodied AI because it directly addresses two key simulator limitations—visual fidelity and dynamic human modeling—using 3DGS, which has the potential to improve sim-to-real transfer for navigation agents in human-populated environments. The dual role of Gaussian avatars as both photorealistic visuals and collision obstacles is a practical engineering contribution, and the maintained Habitat compatibility lowers barriers to adoption. The stress-test concern regarding new artifacts or biases from 3DGS fidelity does not appear to land as a load-bearing issue given the described implementation and positive experimental outcomes.

minor comments (2)

Abstract: The summary of experimental outcomes would be strengthened by including at least one key quantitative metric (e.g., success rate or SPL improvement) and a brief mention of baselines, even if full details appear later in the paper.
The manuscript would benefit from a short dedicated subsection or paragraph clarifying how Gaussian avatar collision geometry is derived from the splat representation and whether any approximation steps are involved.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive summary of Habitat-GS, recognition of its significance for embodied AI, and recommendation for minor revision. No specific major comments were raised in the report.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper is an engineering description of a simulator extension (Habitat-GS) that integrates 3D Gaussian Splatting rendering and drivable avatars into Habitat-Sim. All central claims rest on system implementation details and reported experimental outcomes (point-goal navigation generalization and avatar-aware navigation performance). No derivation chain, equations, first-principles predictions, or fitted parameters labeled as predictions exist in the provided text. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes. The work is self-contained against external benchmarks via direct experiments.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

As an applied systems paper, the work relies on standard assumptions from computer graphics and simulation without introducing mathematical free parameters or unstated axioms beyond those implicit in 3DGS and Habitat-Sim.

invented entities (1)

drivable gaussian avatars no independent evidence
purpose: To model dynamic humans that function simultaneously as photorealistic visuals and effective navigation obstacles
New module introduced to overcome limitations of mesh-based avatars in existing simulators.

pith-pipeline@v0.9.0 · 5552 in / 1090 out tokens · 45733 ms · 2026-05-10T15:20:42.186110+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

37 extracted references · 29 canonical work pages · 8 internal anchors

[1]

Anderson, P., Chang, A., Chaplot, D.S., Dosovitskiy, A., Gupta, S., Koltun, V., Kosecka, J., Malik, J., Mottaghi, R., Savva, M., Zamir, A.R.: On evaluation of embodied navigation agents (2018),https://arxiv.org/abs/1807.06757

work page internal anchor Pith review arXiv 2018
[2]

Barron, J.T., Mildenhall, B., Verbin, D., Srinivasan, P.P., Hedman, P.: Mip-nerf 360: Unbounded anti-aliased neural radiance fields (2022),https://arxiv.org/ abs/2111.12077

work page arXiv 2022
[3]

Batra, D., Gokaslan, A., Kembhavi, A., Maksymets, O., Mottaghi, R., Savva, M., Toshev, A., Wijmans, E.: Objectnav revisited: On evaluation of embodied agents navigating to objects (2020),https://arxiv.org/abs/2006.13171

work page arXiv 2020
[4]

Cho, K., van Merrienboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., Bengio, Y.: Learning phrase representations using rnn encoder-decoder for sta- tistical machine translation (2014),https://arxiv.org/abs/1406.1078

work page internal anchor Pith review arXiv 2014
[5]

Bear, Dan Gutfreund, David Cox, Antonio Torralba, James J

Gan, C., Schwartz, J., Alter, S., Mrowca, D., Schrimpf, M., Traer, J., Freitas, J.D., Kubilius,J.,Bhandwaldar,A.,Haber,N.,Sano,M.,Kim,K.,Wang,E.,Lingelbach, M., Curtis, A., Feigelis, K., Bear, D.M., Gutfreund, D., Cox, D., Torralba, A., DiCarlo, J.J., Tenenbaum, J.B., McDermott, J.H., Yamins, D.L.K.: Threedworld: Aplatformforinteractivemulti-modalphysical...

work page arXiv 2021
[6]

He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition (2015),https://arxiv.org/abs/1512.03385

work page internal anchor Pith review arXiv 2015
[7]

Hu, L., Zhang, H., Zhang, Y., Zhou, B., Liu, B., Zhang, S., Nie, L.: Gaussianavatar: Towards realistic human avatar modeling from a single video via animatable 3d gaussians (2024),https://arxiv.org/abs/2312.02134

work page arXiv 2024
[8]

Kerbl, B., Kopanas, G., Leimkühler, T., Drettakis, G.: 3d gaussian splatting for real-time radiance field rendering (2023),https://arxiv.org/abs/2308.04079

work page arXiv 2023
[9]

Kolve, E., Mottaghi, R., Han, W., VanderBilt, E., Weihs, L., Herrasti, A., Deitke, M., Ehsani, K., Gordon, D., Zhu, Y., Kembhavi, A., Gupta, A., Farhadi, A.: Ai2- thor: An interactive 3d environment for visual ai (2022),https://arxiv.org/abs/ 1712.05474

work page internal anchor Pith review arXiv 2022
[10]

Lei, J., Wang, Y., Pavlakos, G., Liu, L., Daniilidis, K.: Gart: Gaussian articulated template models (2023),https://arxiv.org/abs/2311.16099

work page arXiv 2023
[11]

Li, C., Xia, F., Martín-Martín, R., Lingelbach, M., Srivastava, S., Shen, B., Vainio, K., Gokmen, C., Dharan, G., Jain, T., Kurenkov, A., Liu, C.K., Gweon, H., Wu, J., Fei-Fei, L., Savarese, S.: igibson 2.0: Object-centric simulation for robot learning of everyday household tasks (2021),https://arxiv.org/abs/2108.03272

work page arXiv 2021
[12]

In: Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2024)

Li, Z., Zheng, Z., Wang, L., Liu, Y.: Animatable gaussians: Learning pose- dependent gaussian maps for high-fidelity human avatar modeling. In: Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2024)

2024
[13]

Liu, X., Zhan, X., Tang, J., Shan, Y., Zeng, G., Lin, D., Liu, X., Liu, Z.: Hu- mangaussian: Text-driven 3d human generation with gaussian splatting (2024), https://arxiv.org/abs/2311.17061

work page arXiv 2024
[14]

Loper, M., Mahmood, N., Romero, J., Pons-Moll, G., Black, M.J.: SMPL: A skinnedmulti-personlinearmodel.ACMTrans.Graphics(Proc.SIGGRAPHAsia) 34(6), 248:1–248:16 (Oct 2015)

2015
[15]

arXiv preprint arXiv:2308.09713 , year=

Luiten, J., Kopanas, G., Leibe, B., Ramanan, D.: Dynamic 3d gaussians: Tracking by persistent dynamic view synthesis (2023),https://arxiv.org/abs/2308.09713 16 Z. Xia et al

work page arXiv 2023
[16]

Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng, R.: Nerf: Representing scenes as neural radiance fields for view synthesis (2020), https://arxiv.org/abs/2003.08934

work page arXiv 2020
[17]

Instant neural graphics primitives with a multiresolution hash encoding

Müller, T., Evans, A., Schied, C., Keller, A.: Instant neural graphics primitives with a multiresolution hash encoding. ACM Transactions on Graphics41(4), 1–15 (Jul 2022).https://doi.org/10.1145/3528223.3530127,http://dx.doi.org/ 10.1145/3528223.3530127

work page doi:10.1145/3528223.3530127 2022
[18]

NVIDIA: Isaac Sim,https://github.com/isaac-sim/IsaacSim
[19]

Pavlakos, G., Choutas, V., Ghorbani, N., Bolkart, T., Osman, A.A.A., Tzionas, D., Black, M.J.: Expressive body capture: 3d hands, face, and body from a single image (2019),https://arxiv.org/abs/1904.05866

work page arXiv 2019
[20]

Poole, B., Jain, A., Barron, J.T., Mildenhall, B.: Dreamfusion: Text-to-3d using 2d diffusion (2022),https://arxiv.org/abs/2209.14988

work page internal anchor Pith review arXiv 2022
[21]

Puig, X., Undersander, E., Szot, A., Cote, M.D., Yang, T.Y., Partsey, R., Desai, R., Clegg, A.W., Hlavac, M., Min, S.Y., Vondruš, V., Gervet, T., Berges, V.P., Turner, J.M., Maksymets, O., Kira, Z., Kalakrishnan, M., Malik, J., Chaplot, D.S., Jain, U., Batra, D., Rai, A., Mottaghi, R.: Habitat 3.0: A co-habitat for humans, avatars and robots (2023),https:...

work page arXiv 2023
[22]

Ramakrishnan, S.K., Gokaslan, A., Wijmans, E., Maksymets, O., Clegg, A., Turner, J., Undersander, E., Galuba, W., Westbury, A., Chang, A.X., Savva, M., Zhao, Y., Batra, D.: Habitat-matterport 3d dataset (hm3d): 1000 large-scale 3d environments for embodied ai (2021),https://arxiv.org/abs/2109.08238

work page internal anchor Pith review arXiv 2021
[23]

Savva, M., Kadian, A., Maksymets, O., Zhao, Y., Wijmans, E., Jain, B., Straub, J., Liu, J., Koltun, V., Malik, J., Parikh, D., Batra, D.: Habitat: A platform for embodied ai research (2019),https://arxiv.org/abs/1904.01201

work page arXiv 2019
[24]

Shen, B., Xia, F., Li, C., Martín-Martín, R., Fan, L., Wang, G., Pérez-D’Arpino, C., Buch, S., Srivastava, S., Tchapmi, L.P., Tchapmi, M.E., Vainio, K., Wong, J., Fei-Fei, L., Savarese, S.: igibson 1.0: a simulation environment for interactive tasks in large realistic scenes (2021),https://arxiv.org/abs/2012.02924

work page arXiv 2021
[25]

co / datasets / spatialverse/InteriorGS(2025)

SpatialVerse Research Team, M.T.I.: Interiorgs: A 3d gaussian splatting dataset of semantically labeled indoor scenes.https : / / huggingface . co / datasets / spatialverse/InteriorGS(2025)

2025
[26]

Straub, J., Whelan, T., Ma, L., Chen, Y., Wijmans, E., Green, S., Engel, J.J., Mur- Artal, R., Ren, C., Verma, S., Clarkson, A., Yan, M., Budge, B., Yan, Y., Pan, X., Yon, J., Zou, Y., Leon, K., Carter, N., Briales, J., Gillingham, T., Mueggler, E., Pesqueira, L., Savva, M., Batra, D., Strasdat, H.M., Nardi, R.D., Goesele, M., Lovegrove, S., Newcombe, R.:...

work page internal anchor Pith review arXiv 2019
[27]

Szot, A., Clegg, A., Undersander, E., Wijmans, E., Zhao, Y., Turner, J., Maestre, N.,Mukadam,M.,Chaplot,D.,Maksymets,O.,Gokaslan,A.,Vondrus,V.,Dharur, S., Meier, F., Galuba, W., Chang, A., Kira, Z., Koltun, V., Malik, J., Savva, M., Batra, D.: Habitat 2.0: Training home assistants to rearrange their habitat (2022), https://arxiv.org/abs/2106.14405

work page arXiv 2022
[28]

Team, G., Anil, R., Borgeaud, S., et al.: Gemini: A family of highly capable mul- timodal models (2025),https://arxiv.org/abs/2312.11805

work page internal anchor Pith review Pith/arXiv arXiv 2025
[29]

Tobin, J., Fong, R., Ray, A., Schneider, J., Zaremba, W., Abbeel, P.: Domain randomization for transferring deep neural networks from simulation to the real world (2017),https://arxiv.org/abs/1703.06907

work page arXiv 2017
[30]

Wijmans, E., Kadian, A., Morcos, A., Lee, S., Essa, I., Parikh, D., Savva, M., Batra, D.: Dd-ppo: Learning near-perfect pointgoal navigators from 2.5 billion frames (2020),https://arxiv.org/abs/1911.00357 Habitat-GS 17

work page arXiv 2020
[31]

worldlabs.ai/blog/marble-world-model

World Labs: Marble: A multimodal world model (11 2025),https : / / www . worldlabs.ai/blog/marble-world-model

2025
[32]

Xiang, F., Qin, Y., Mo, K., Xia, Y., Zhu, H., Liu, F., Liu, M., Jiang, H., Yuan, Y., Wang, H., Yi, L., Chang, A.X., Guibas, L.J., Su, H.: Sapien: A simulated part-based interactive environment (2020),https://arxiv.org/abs/2003.08515

work page arXiv 2020
[33]

Yu, Z., Chen, A., Huang, B., Sattler, T., Geiger, A.: Mip-splatting: Alias-free 3d gaussian splatting (2023),https://arxiv.org/abs/2311.16493

work page arXiv 2023
[34]

visual–navigation decoupling

Zhang, Y., Tang, S.: The wanderings of odysseus in 3d scenes (2022),https: //arxiv.org/abs/2112.09251 18 Z. Xia et al. Appendix This appendix provides supplementary information, extended evaluations, and implementation specifics to further support the main text. Specifically, Sec. A offers a more detailed illustration of the Habitat-GS architecture, elab-...

work page arXiv 2022
[35]

rendering_quality: Visual fidelity (sharpness, blur, artifacts, texture quality, g eometry consistency)
[36]

realism: How realistic and natural the rendered scene looks compared wit h real-world scenes
[37]

images": [ {

scene_diversity: How distinct this image is compared with the other 9 images in this same batch (layout variety, scene types, objects, visual appearance variety). Important constraints: 30 Z. Xia et al. - Do NOT infer or mention rendering methods, engines, or dataset names. - Do NOT output markdown. Return strict JSON only. Required JSON format: { "images...