pith. sign in

arxiv: 2412.09176 · v2 · submitted 2024-12-12 · 💻 cs.HC

LIVE-GS: LLM Powers Interactive VR Experience with Physics-Aware Gaussian Splatting

Pith reviewed 2026-05-23 07:30 UTC · model grok-4.3

classification 💻 cs.HC
keywords 3D Gaussian SplattingVirtual RealityLarge Language ModelsPhysics simulationInteractive systemsAsset creationUser study
0
0 comments X

The pith

An LLM assigns physical parameters to static 3D Gaussian assets in 10 seconds for realistic VR interactions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper describes LIVE-GS, a VR system that uses large language models to turn static 3D Gaussian Splatting reconstructions into dynamic assets with physics. Interviews revealed user needs, which guided the use of GPT-4o to infer object properties such as mass and friction from visual data. The system claims these parameters enable natural VR interactions that match real-world behavior. It achieves this in roughly 10 seconds per asset while keeping visual quality intact during real-time use. Validation came from comparisons with manual expert tuning and a user study on usability.

Core claim

LIVE-GS shows that GPT-4o, informed by interviews and visual input from static Gaussian assets, can predict physical parameters that support realistic VR interactions in about 10 seconds. The approach replaces manual design or annotation, with results demonstrating that LLM-derived values produce interactions aligned with real-world phenomena while preserving high-quality rendering.

What carries the argument

GPT-4o inference of physical simulation parameters (mass, friction, and similar) for Gaussian Splatting objects to drive real-time physics in VR.

If this is right

  • Static Gaussian assets can be converted to interactive dynamic assets without manual parameter tuning.
  • VR interactions reflect real-world physical behavior based on the inferred parameters.
  • Visual quality and rendering performance remain high during real-time physics simulation.
  • Authoring time drops to seconds compared with expert manual adjustment.
  • User studies confirm improved efficiency and satisfaction for non-expert creators.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same inference approach could extend to other 3D scene representations beyond Gaussian Splatting.
  • Automated physics assignment might support large-scale libraries of ready-to-use interactive VR assets.
  • Limits of the method could be tested by applying it to scenes with many interacting objects or unusual materials.

Load-bearing premise

The LLM produces physical parameters that match real-world dynamics from only visual input and interview insights, without per-asset calibration.

What would settle it

A side-by-side VR test where objects with LLM-predicted parameters behave differently from objects with parameters measured from real physical counterparts.

Figures

Figures reproduced from arXiv: 2412.09176 by Hangyu Zhou, Haotian Mao, Nianchen Deng, Siyue Wei, Xubo Yang, Yan Zhang, Yule Quan, Zhuoxiong Xu, Zixuan Guo.

Figure 1
Figure 1. Figure 1: Our system, LIVE-GS, reconstructs the scene with extra features and segments target objects. Most importantly, our [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: System overview. Our system consists of three parts: scene reconstruction, scene enhancement and interactive framework. With original images and initial point clouds, we train Gaussian model with identity encoding and segment our targets through feature-mask segmentation. Afterwards, we leverage GPT to enhance our system’s scene understanding by analyzing objects’ properties and tracking possible artifacts… view at source ↗
Figure 3
Figure 3. Figure 3: Feature-mask segmentation. Feature-mask segmentation To segment objects from envi￾ronments for subsequent processing, we design a feature-mask method, which is devided into two stages: feature stage and mask stage, for accurate segmentation. We choose the intersection of re￾sults in two stages as outcome. Feature Stage: During training procedure, we import a classi￾fier C along with 3D regularization loss,… view at source ↗
Figure 4
Figure 4. Figure 4: Artifacts tracking. We input the source image, the re￾moval image and related mask in GPT-4o to obtain color prompts. Then we track the artifacts with DEVA and intersect them with the mask, generating the final mask for 2D inpainting method. the environment. The result is far beyond our requirement for sub￾sequent simulation. Mask Stage: To obtain fine segmented result, we adopt a vot￾ing strategy similar … view at source ↗
Figure 6
Figure 6. Figure 6: Particles filling in container. six detections fail in the region above the surface but five detections can succeed. Filling particles find the nearest surface in the shrinkage and inherit their attributes. For granular material, we choose a more common condition based on realistic situation, where it is in some containers and Gaus￾sian kernels only distribute on the surface. To completely recon￾struct it,… view at source ↗
Figure 7
Figure 7. Figure 7: Segmentation comparison with Gaussian-grouping. We respectively render the images for removed objects and rest en￾vironment. Our method achieves detailed segmentation in various situation, which benefits the simulation. 4.2 Segmentation The segmentation largely impacts the subsequent processing, espe￾cially for dynamic objects. We choose dataset teatime, figurines and bear, which consist of targets with di… view at source ↗
Figure 8
Figure 8. Figure 8: Physical simulation compared with PhysGS. We choose fox, sofa and wolf for our experiments. We achieve com￾petitive simulation results while still maintaining high efficiency. wolf Category Mass (kg) Friction 1 granule 0.5 0.3 fox Category Mass (kg) Deformation Resistance Plasticity 1 deformation 0.5 0.3 0.2 [PITH_FULL_IMAGE:figures/full_fig_p006_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Interaction with the wolf.The first line is analyzed as a doll and we play with it. While the second line is analyzed as granular material like sand with dialogue prompt in comparison with PhysGS. We draw a VR and a smile on it [PITH_FULL_IMAGE:figures/full_fig_p007_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Interaction in sofa. We analyze the physical properties of pillows and cushion. Then we apply spring force to lift them up and two ends of the spring force are marked with arrows, observing the obvious distinction of their mass. Although predicted with the same mass values, they act aligned with visual effects after fixed by our correction factor [PITH_FULL_IMAGE:figures/full_fig_p007_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Interaction in garden. First, we drag the flower and shake the vase. Then we throw a ball, breaking the vase into pieces. to test the material analysis ability and interaction in our system. We demonstrate our analysis details and complete particular tasks in each scene. wolf and sofa In this section, we analyze the physical property without user dialogue for appointing. Different from Sec.4.3, the wolf h… view at source ↗
Figure 12
Figure 12. Figure 12: Interaction in our custom dataset labdesk. From one to six are respectively a paper cup, a box of tissues, a Teddy Bear, a plastic hammer, a mug and coffee powder in the mug. After segmentation and analysis, we hold the hammer and interact with other objects in our demo. tailed reconstruction and segmentation, integrating various physi￾cal simulations for real-time interaction consistent with visual ef￾fe… view at source ↗
read the original abstract

As 3D Gaussian Splatting (3DGS) emerges as a leading approach for novel view synthesis and scene reconstruction, its potential in digital asset creation has gained significant attention. An increasing number of asset libraries based on GS are being established. However, generating physics-based dynamic assets remains a time-consuming and expertise-intensive task, especially for non-experts. In this paper, we propose LIVE-GS, a highly realistic Virtual Reality (VR) system powered by Large Language Models (LLMs), which enables rapid creation of dynamic Gaussian assets and real-time VR interactions. To inform our system design, we conducted interviews to examine challenges faced by current GS-based VR systems and the specific demands of users. Based on these insights, we employed GPT-4o to analyze key physical properties of objects that significantly impact user interactions, ensuring physics-based interactions in VR align with real-world phenomena. A key innovation of LIVE-GS is its ability to predict reasonable parameters in just 10 seconds from static Gaussian assets while maintaining high-quality VR interactions. To validate our approach, we invited participants experienced in physical simulation to manually adjust physical parameters, providing a baseline for comparison in both asset quality and authoring efficiency. We also conducted a comprehensive user study to evaluate system usability and user satisfaction. Experimental results demonstrate that LIVE-GS, leveraging LLMs' scene understanding capabilities, can achieve efficient physical scene creation and natural interactions without requiring manual design or annotation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 3 minor

Summary. The paper presents LIVE-GS, a VR system that uses GPT-4o to analyze static 3D Gaussian Splatting assets and predict physical parameters (mass, friction, restitution, etc.) for real-time physics-aware interactions. The design is informed by user interviews on GS-VR challenges; the system claims to produce 'reasonable' parameters in 10 seconds. Validation consists of a baseline comparison against expert manual parameter tuning (authoring time and perceived quality) plus a usability/user-satisfaction study with experienced participants.

Significance. If the LLM-derived parameters can be shown to produce physically plausible dynamics without per-asset calibration, the work would lower the barrier for non-experts to turn static GS reconstructions into interactive VR assets, with potential impact on asset libraries and content pipelines.

major comments (3)
  1. [Abstract / Evaluation] Abstract and Evaluation section: the central claim that GPT-4o 'predict[s] reasonable parameters' whose VR interactions 'align with real-world phenomena' rests on subjective ratings and authoring-time comparison only; no quantitative error metrics, ground-truth measurements of predicted values (mass, friction, restitution), or controlled roll-outs against independent physics data are reported, leaving the modeling assumption untested.
  2. [Abstract] Abstract: the headline efficiency claim ('predict reasonable parameters in just 10 seconds') is stated without supporting timing data, measurement protocol, or variance across assets, so the 10-second figure cannot be assessed as load-bearing evidence.
  3. [Evaluation / User Study] Evaluation description: the baseline comparison with 'participants experienced in physical simulation' who 'manually adjust physical parameters' supplies no details on how 'reasonable' was judged, no inter-rater reliability, and no error bars or statistical tests, undermining the cross-condition claim of superior efficiency and quality.
minor comments (3)
  1. [System Design] Notation for physical parameters (mass, friction, restitution) is introduced without explicit equations or ranges used by the physics engine.
  2. [LLM Integration] The paper would benefit from a table listing the exact parameter set predicted by GPT-4o and the prompt template employed.
  3. [Figures] Figure captions and axis labels in the user-study results should be expanded for standalone readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below, indicating where revisions will be incorporated.

read point-by-point responses
  1. Referee: [Abstract / Evaluation] Abstract and Evaluation section: the central claim that GPT-4o 'predict[s] reasonable parameters' whose VR interactions 'align with real-world phenomena' rests on subjective ratings and authoring-time comparison only; no quantitative error metrics, ground-truth measurements of predicted values (mass, friction, restitution), or controlled roll-outs against independent physics data are reported, leaving the modeling assumption untested.

    Authors: We acknowledge that the evaluation relies on subjective user ratings and authoring-time comparisons rather than quantitative physical error metrics or ground-truth comparisons. Obtaining precise ground-truth values for parameters such as mass, friction, and restitution from real-world counterparts of the Gaussian assets would require additional experimental apparatus not included in the original study. We will revise the manuscript to explicitly discuss this methodological choice and its limitations while retaining the user-study validation approach. revision: partial

  2. Referee: [Abstract] Abstract: the headline efficiency claim ('predict reasonable parameters in just 10 seconds') is stated without supporting timing data, measurement protocol, or variance across assets, so the 10-second figure cannot be assessed as load-bearing evidence.

    Authors: The 10-second figure is the observed average processing time for GPT-4o inference on asset descriptions in our implementation. We will add the measurement protocol, including how timing was recorded and variance across assets, to the revised manuscript. revision: yes

  3. Referee: [Evaluation / User Study] Evaluation description: the baseline comparison with 'participants experienced in physical simulation' who 'manually adjust physical parameters' supplies no details on how 'reasonable' was judged, no inter-rater reliability, and no error bars or statistical tests, undermining the cross-condition claim of superior efficiency and quality.

    Authors: We agree that further details are warranted. In the revision we will specify the judgment criteria for 'reasonable' parameters, report inter-rater reliability where applicable, and include error bars together with statistical tests supporting the efficiency and quality comparisons. revision: yes

Circularity Check

0 steps flagged

No significant circularity; system uses external LLM inference validated against independent human baselines

full rationale

The paper presents an applied VR system that feeds scene descriptions and interview insights into GPT-4o to infer physical parameters for Gaussian assets. Validation consists of (a) timing and quality comparisons against separate human experts who manually tune the same parameters and (b) a usability survey. No equations, fitted parameters, or self-citations appear in the provided text; the central claim that the LLM produces 'reasonable' parameters is tested against external human judgment rather than being defined by the authors' own outputs or prior self-referential results. The derivation chain therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the untested premise that an off-the-shelf LLM can map visual appearance to physically plausible parameters without domain-specific fine-tuning or measurement data. No free parameters are explicitly fitted in the abstract; the LLM itself functions as an implicit black-box predictor.

axioms (1)
  • domain assumption GPT-4o possesses sufficient commonsense physical knowledge to map object appearance to interaction parameters that match real-world behavior.
    Invoked when the system design uses the LLM to 'analyze key physical properties' without additional calibration.

pith-pipeline@v0.9.0 · 5816 in / 1295 out tokens · 19344 ms · 2026-05-23T07:30:30.796566+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

65 extracted references · 65 canonical work pages · 3 internal anchors

  1. [1]

    J. Bae, S. Kim, Y . Yun, H. Lee, G. Bang, and Y . Uh. Per-gaussian embedding-based deformation for deformable 3d gaussian splatting. arXiv preprint arXiv:2404.03613, 2024. 2

  2. [2]

    J. Cen, J. Fang, C. Yang, L. Xie, X. Zhang, W. Shen, and Q. Tian. Segment any 3d gaussians, 2024. 2

  3. [3]

    H. K. Cheng, S. W. Oh, B. Price, A. Schwing, and J.-Y . Lee. Tracking anything with decoupled video segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision , pp. 1316– 1326, 2023. 2, 3, 5

  4. [4]

    W.-H. Chu, L. Ke, and K. Fragkiadaki. Dreamscene4d: Dynamic multi-object scene generation from monocular videos. arXiv preprint arXiv:2405.02280, 2024. 2

  5. [5]

    N. Deng, Z. He, J. Ye, B. Duinkharjav, P. Chakravarthula, X. Yang, and Q. Sun. Fov-nerf: Foveated neural radiance fields for virtual re- ality. IEEE Transactions on Visualization and Computer Graphics , 28(11):3854–3864, 2022. 1

  6. [6]

    R. Ding, J. Yang, C. Xue, W. Zhang, S. Bai, and X. Qi. Pla: Language- driven open-vocabulary 3d scene understanding. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition , pp. 7010–7019, 2023. 2

  7. [7]

    B. Dou, T. Zhang, Y . Ma, Z. Wang, and Z. Yuan. Cosseggaussians: Compact and swift scene segmenting 3d gaussians. arXiv preprint arXiv:2401.05925, 2024. 2

  8. [8]

    B. P. Duisterhof, Z. Mandi, Y . Yao, J.-W. Liu, M. Z. Shou, S. Song, and J. Ichnowski. Md-splatting: Learning metric deformation from 4d gaussians in highly deformable scenes. arXiv preprint arXiv:2312.00583, 2023. 2

  9. [9]

    L. Fan, Y . Yang, M. Li, H. Li, and Z. Zhang. Trim 3d gaus- sian splatting for accurate geometry representation. arXiv preprint arXiv:2406.07499, 2024. 2

  10. [10]

    Geiger, S

    A. Geiger, S. Gao, A. Chen, Z. Yu, and B. Huang. 2d gaussian splat- ting for geometrically accurate radiance fields. 2024. 2

  11. [11]

    Girdhar, A

    R. Girdhar, A. El-Nouby, Z. Liu, M. Singh, K. V . Alwala, A. Joulin, and I. Misra. Imagebind: One embedding space to bind them all. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15180–15190, 2023. 2

  12. [12]

    S. Guan, H. Deng, Y . Wang, and X. Yang. Neurofluid: Fluid dynamics grounding with particle-driven neural radiance fields. InInternational Conference on Machine Learning, pp. 7919–7929. PMLR, 2022. 2

  13. [13]

    Gu ´edon and V

    A. Gu ´edon and V . Lepetit. Sugar: Surface-aligned gaussian splatting for efficient 3d mesh reconstruction and high-quality mesh rendering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5354–5363, 2024. 2

  14. [14]

    Z. Guo, W. Zhou, L. Li, M. Wang, and H. Li. Motion-aware 3d gaussian splatting for efficient dynamic scene reconstruction. arXiv preprint arXiv:2403.11447, 2024. 2

  15. [15]

    X. Hu, Y . Wang, L. Fan, J. Fan, J. Peng, Z. Lei, Q. Li, and Z. Zhang. Semantic anything in 3d gaussians. arXiv preprint arXiv:2401.17857,

  16. [16]

    Huang, H

    J. Huang, H. Yu, J. Zhang, and H. Nait-Charif. Point’n move: Inter- active scene object manipulation on gaussian splatting radiance fields. IET Image Processing, 2023. 2

  17. [17]

    Huang, Y .-T

    Y .-H. Huang, Y .-T. Sun, Z. Yang, X. Lyu, Y .-P. Cao, and X. Qi. Sc- gs: Sparse-controlled gaussian splatting for editable dynamic scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4220–4230, 2024. 2

  18. [18]

    C. Jia, Y . Yang, Y . Xia, Y .-T. Chen, Z. Parekh, H. Pham, Q. Le, Y .-H. Sung, Z. Li, and T. Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. In International conference on machine learning, pp. 4904–4916. PMLR, 2021. 2

  19. [19]

    Vr-gs: A physical dynamics-aware interactive gaussian splatting system in virtual reality,

    Y . Jiang, C. Yu, T. Xie, X. Li, Y . Feng, H. Wang, M. Li, H. Lau, F. Gao, Y . Yang, et al. Vr-gs: A physical dynamics-aware interactive gaussian splatting system in virtual reality. arXiv preprint arXiv:2401.16663,

  20. [20]

    Kavan, S

    L. Kavan, S. Collins, J. ˇZ´ara, and C. O’Sullivan. Skinning with dual quaternions. In Proceedings of the 2007 symposium on Interactive 3D graphics and games, pp. 39–46, 2007. 2

  21. [21]

    Kerbl, G

    B. Kerbl, G. Kopanas, T. Leimk ¨uhler, and G. Drettakis. 3d gaussian splatting for real-time radiance field rendering. ACM Transactions on Graphics, 42(4):1–14, 2023. 1, 2, 3

  22. [22]

    Kirillov, E

    A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y . Lo, et al. Segment any- thing. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4015–4026, 2023. 2, 3

  23. [23]

    Li, Y .-L

    X. Li, Y .-L. Qiao, P. Y . Chen, K. M. Jatavallabhula, M. Lin, C. Jiang, and C. Gan. Pac-nerf: Physics augmented continuum neural radi- ance fields for geometry-agnostic system identification.arXiv preprint arXiv:2303.05512, 2023. 2

  24. [24]

    Liang, Y

    H. Liang, Y . Yin, D. Xu, H. Liang, Z. Wang, K. N. Plataniotis, Y . Zhao, and Y . Wei. Diffusion4d: Fast spatial-temporal consistent 4d gener- ation via video diffusion models. arXiv preprint arXiv:2405.16645 ,

  25. [25]

    G. Liao, J. Li, Z. Bao, X. Ye, J. Wang, Q. Li, and K. Liu. Clip-gs: Clip-informed gaussian splatting for real-time and view-consistent 3d semantic understanding. arXiv preprint arXiv:2404.14249, 2024. 2

  26. [26]

    G. Liao, K. Zhou, Z. Bao, K. Liu, and Q. Li. Ov-nerf: Open- vocabulary neural radiance fields with vision and language foun- dation models for 3d semantic understanding. arXiv preprint arXiv:2402.04648, 2024. 2

  27. [27]

    Z. Liu, H. Ouyang, Q. Wang, K. L. Cheng, J. Xiao, K. Zhu, N. Xue, Y . Liu, Y . Shen, and Y . Cao. Infusion: Inpainting 3d gaussians via learning depth completion from diffusion prior. arXiv preprint arXiv:2404.11613, 2024. 2

  28. [28]

    Loper, N

    M. Loper, N. Mahmood, J. Romero, G. Pons-Moll, and M. J. Black. Smpl: A skinned multi-person linear model. In Seminal Graphics Papers: Pushing the Boundaries, Volume 2, pp. 851–866. 2023. 2

  29. [29]

    T. Lu, M. Yu, L. Xu, Y . Xiangli, L. Wang, D. Lin, and B. Dai. Scaffold- gs: Structured 3d gaussians for view-adaptive rendering. In Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 20654–20664, 2024. 2

  30. [30]

    Y . Lu, C. Xu, X. Wei, X. Xie, M. Tomizuka, K. Keutzer, and S. Zhang. Open-vocabulary point-cloud object detection without 3d annotation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 1190–1199, 2023. 2

  31. [31]

    Mildenhall, P

    B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoor- thi, and R. Ng. Nerf: Representing scenes as neural radiance fields for view synthesis. Communications of the ACM, 65(1):99–106, 2021. 1

  32. [32]

    M ¨uller, B

    M. M ¨uller, B. Heidelberger, M. Hennix, and J. Ratcliff. Position based dynamics. Journal of Visual Communication and Image Representa- tion, 18(2):109–118, 2007. 2, 5

  33. [33]

    M ¨uller, A

    T. M ¨uller, A. Evans, C. Schied, and A. Keller. Instant neural graphics primitives with a multiresolution hash encoding. ACM transactions on graphics (TOG), 41(4):1–15, 2022. 6

  34. [34]

    S. Peng, K. Genova, C. Jiang, A. Tagliasacchi, M. Pollefeys, T. Funkhouser, et al. Openscene: 3d scene understanding with open vocabularies. In Proceedings of the IEEE/CVF conference on com- puter vision and pattern recognition, pp. 815–824, 2023. 2

  35. [35]

    Y .-L. Qiao, A. Gao, and M. Lin. Neuphysics: Editable neural geome- try and physics from monocular videos. Advances in Neural Informa- tion Processing Systems, 35:12841–12854, 2022. 2

  36. [36]

    M. Qin, W. Li, J. Zhou, H. Wang, and H. Pfister. Langsplat: 3d lan- guage gaussian splatting. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 20051–20060, 2024. 2

  37. [37]

    R.-Z. Qiu, G. Yang, W. Zeng, and X. Wang. Feature splatting: Language-driven physics-based scene synthesis and editing. arXiv preprint arXiv:2404.01223, 2024. 2

  38. [38]

    Radford, J

    A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al. Learning transfer- able visual models from natural language supervision. InInternational conference on machine learning, pp. 8748–8763. PMLR, 2021. 2

  39. [39]

    J. Ren, L. Pan, J. Tang, C. Zhang, A. Cao, G. Zeng, and Z. Liu. Dreamgaussian4d: Generative 4d gaussian splatting. arXiv preprint arXiv:2312.17142, 2023. 2

  40. [40]

    K. Ren, L. Jiang, T. Lu, M. Yu, L. Xu, Z. Ni, and B. Dai. Octree-gs: Towards consistent real-time rendering with lod-structured 3d gaus- sians. arXiv preprint arXiv:2403.17898, 2024. 2

  41. [41]

    J.-C. Shi, M. Wang, H.-B. Duan, and S.-H. Guan. Language embedded 3d gaussians for open-vocabulary scene understanding. In Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5333–5343, 2024. 2

  42. [42]

    Siddiqui, L

    Y . Siddiqui, L. Porzi, S. R. Bul ´o, N. M ¨uller, M. Nießner, A. Dai, and P. Kontschieder. Panoptic lifting for 3d scene understanding with neu- ral fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9043–9052, 2023. 2

  43. [43]

    M. C. Silva, M. Dahaghin, M. Toso, and A. Del Bue. Contrastive gaussian clustering: Weakly supervised 3d scene segmentation. arXiv preprint arXiv:2404.12784, 2024. 2

  44. [44]

    Snavely, S

    N. Snavely, S. M. Seitz, and R. Szeliski. Photo tourism: exploring photo collections in 3d. In ACM siggraph 2006 papers, pp. 835–846

  45. [45]

    Suvorov, E

    R. Suvorov, E. Logacheva, A. Mashikhin, A. Remizova, A. Ashukha, A. Silvestrov, N. Kong, H. Goka, K. Park, and V . Lempitsky. Resolution-robust large mask inpainting with fourier convolutions. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pp. 2149–2159, 2022. 2, 4

  46. [46]

    Openmask3d: Open-vocabulary 3d instance segmenta- tion,

    A. Takmaz, E. Fedele, R. W. Sumner, M. Pollefeys, F. Tombari, and F. Engelmann. Openmask3d: Open-vocabulary 3d instance segmen- tation. arXiv preprint arXiv:2306.13631, 2023. 2

  47. [47]

    O. S. D. Team. Obi solver. https://obi.virtualmethodstudio. com/, 2024. 5

  48. [48]

    Topsakal and T

    O. Topsakal and T. C. Akinci. Creating large language model applica- tions utilizing langchain: A primer on developing llm apps fast. In In- ternational Conference on Applied Engineering and Natural Sciences, vol. 1, pp. 1050–1056, 2023. 5

  49. [49]

    Turkulainen, X

    M. Turkulainen, X. Ren, I. Melekhov, O. Seiskari, E. Rahtu, and J. Kannala. Dn-splatter: Depth and normal priors for gaussian splat- ting and meshing. arXiv preprint arXiv:2403.17822, 2024. 2

  50. [50]

    J. Wang, J. Fang, X. Zhang, L. Xie, and Q. Tian. Gaussianeditor: Edit- ing 3d gaussians delicately with text instructions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion, pp. 20902–20911, 2024. 2

  51. [51]

    Y . Wolf, A. Bracha, and R. Kimmel. Surface reconstruction from gaussian splatting via novel stereo views. arXiv preprint arXiv:2404.01810, 2024. 2

  52. [52]

    T. Xie, Z. Zong, Y . Qiu, X. Li, Y . Feng, Y . Yang, and C. Jiang. Phys- gaussian: Physics-integrated 3d gaussians for generative dynamics. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4389–4398, 2024. 2, 3, 5, 6

  53. [53]

    L. Xu, V . Agrawal, W. Laney, T. Garcia, A. Bansal, C. Kim, S. Rota Bul `o, L. Porzi, P. Kontschieder, A. Bo ˇziˇc, et al. Vr-nerf: High-fidelity virtualized walkable spaces. In SIGGRAPH Asia 2023 Conference Papers, pp. 1–12, 2023. 1

  54. [54]

    S. Yan, T. Zhu, Z. Wang, Y . Cao, M. Zhang, S. Ghosh, Y . Wu, and J. Yu. Videococa: Video-text modeling with zero-shot transfer from contrastive captioners. arXiv preprint arXiv:2212.04979, 2022. 2

  55. [55]

    Z. Yang, X. Gao, W. Zhou, S. Jiao, Y . Zhang, and X. Jin. Deformable 3d gaussians for high-fidelity monocular dynamic scene reconstruc- tion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 20331–20341, 2024. 2

  56. [56]

    M. Ye, M. Danelljan, F. Yu, and L. Ke. Gaussian grouping: Segment and edit anything in 3d scenes. arXiv preprint arXiv:2312.00732 ,

  57. [57]

    Y . Yin, D. Xu, Z. Wang, Y . Zhao, and Y . Wei. 4dgen: Grounded 4d content generation with spatial-temporal consistency. arXiv preprint arXiv:2312.17225, 2023. 2

  58. [58]

    Yuan, Y .-T

    Y .-J. Yuan, Y .-T. Sun, Y .-K. Lai, Y . Ma, R. Jia, and L. Gao. Nerf- editing: geometry editing of neural radiance fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion, pp. 18353–18364, 2022. 2

  59. [59]

    A. Zeng, M. Attarian, B. Ichter, K. Choromanski, A. Wong, S. Welker, F. Tombari, A. Purohit, M. Ryoo, V . Sindhwani, et al. Socratic mod- els: Composing zero-shot multimodal reasoning with language. arXiv preprint arXiv:2204.00598, 2022. 2

  60. [60]

    DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection

    H. Zhang, F. Li, S. Liu, L. Zhang, H. Su, J. Zhu, L. M. Ni, and H.-Y . Shum. Dino: Detr with improved denoising anchor boxes for end-to- end object detection. arXiv preprint arXiv:2203.03605, 2022. 2

  61. [61]

    Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding

    H. Zhang, X. Li, and L. Bing. Video-llama: An instruction-tuned audio-visual language model for video understanding. arXiv preprint arXiv:2306.02858, 2023. 2

  62. [62]

    Zhang, H.-X

    T. Zhang, H.-X. Yu, R. Wu, B. Y . Feng, C. Zheng, N. Snavely, J. Wu, and W. T. Freeman. Physdreamer: Physics-based interaction with 3d objects via video generation. arXiv preprint arXiv:2404.13026, 2024. 2, 4

  63. [63]

    S. Zhi, T. Laidlow, S. Leutenegger, and A. J. Davison. In-place scene labelling and understanding with implicit scene representation. InPro- ceedings of the IEEE/CVF International Conference on Computer Vi- sion, pp. 15838–15847, 2021. 2

  64. [64]

    H. Zhou, J. Shao, L. Xu, D. Bai, W. Qiu, B. Liu, Y . Wang, A. Geiger, and Y . Liao. Hugs: Holistic urban 3d scene understanding via gaussian splatting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 21336–21345, 2024. 2

  65. [65]

    Zwicker, H

    M. Zwicker, H. Pfister, J. Van Baar, and M. Gross. Ewa volume splat- ting. In Proceedings Visualization, 2001. VIS’01., pp. 29–538. IEEE,