LIVE-GS: LLM Powers Interactive VR Experience with Physics-Aware Gaussian Splatting

Hangyu Zhou; Haotian Mao; Nianchen Deng; Siyue Wei; Xubo Yang; Yan Zhang; Yule Quan; Zhuoxiong Xu; Zixuan Guo

arxiv: 2412.09176 · v2 · submitted 2024-12-12 · 💻 cs.HC

LIVE-GS: LLM Powers Interactive VR Experience with Physics-Aware Gaussian Splatting

Haotian Mao , Hangyu Zhou , Zhuoxiong Xu , Siyue Wei , Yule Quan , Yan Zhang , Zixuan Guo , Nianchen Deng

show 1 more author

Xubo Yang

This is my paper

Pith reviewed 2026-05-23 07:30 UTC · model grok-4.3

classification 💻 cs.HC

keywords 3D Gaussian SplattingVirtual RealityLarge Language ModelsPhysics simulationInteractive systemsAsset creationUser study

0 comments

The pith

An LLM assigns physical parameters to static 3D Gaussian assets in 10 seconds for realistic VR interactions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper describes LIVE-GS, a VR system that uses large language models to turn static 3D Gaussian Splatting reconstructions into dynamic assets with physics. Interviews revealed user needs, which guided the use of GPT-4o to infer object properties such as mass and friction from visual data. The system claims these parameters enable natural VR interactions that match real-world behavior. It achieves this in roughly 10 seconds per asset while keeping visual quality intact during real-time use. Validation came from comparisons with manual expert tuning and a user study on usability.

Core claim

LIVE-GS shows that GPT-4o, informed by interviews and visual input from static Gaussian assets, can predict physical parameters that support realistic VR interactions in about 10 seconds. The approach replaces manual design or annotation, with results demonstrating that LLM-derived values produce interactions aligned with real-world phenomena while preserving high-quality rendering.

What carries the argument

GPT-4o inference of physical simulation parameters (mass, friction, and similar) for Gaussian Splatting objects to drive real-time physics in VR.

If this is right

Static Gaussian assets can be converted to interactive dynamic assets without manual parameter tuning.
VR interactions reflect real-world physical behavior based on the inferred parameters.
Visual quality and rendering performance remain high during real-time physics simulation.
Authoring time drops to seconds compared with expert manual adjustment.
User studies confirm improved efficiency and satisfaction for non-expert creators.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same inference approach could extend to other 3D scene representations beyond Gaussian Splatting.
Automated physics assignment might support large-scale libraries of ready-to-use interactive VR assets.
Limits of the method could be tested by applying it to scenes with many interacting objects or unusual materials.

Load-bearing premise

The LLM produces physical parameters that match real-world dynamics from only visual input and interview insights, without per-asset calibration.

What would settle it

A side-by-side VR test where objects with LLM-predicted parameters behave differently from objects with parameters measured from real physical counterparts.

Figures

Figures reproduced from arXiv: 2412.09176 by Hangyu Zhou, Haotian Mao, Nianchen Deng, Siyue Wei, Xubo Yang, Yan Zhang, Yule Quan, Zhuoxiong Xu, Zixuan Guo.

**Figure 1.** Figure 1: Our system, LIVE-GS, reconstructs the scene with extra features and segments target objects. Most importantly, our [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗

**Figure 2.** Figure 2: System overview. Our system consists of three parts: scene reconstruction, scene enhancement and interactive framework. With original images and initial point clouds, we train Gaussian model with identity encoding and segment our targets through feature-mask segmentation. Afterwards, we leverage GPT to enhance our system’s scene understanding by analyzing objects’ properties and tracking possible artifacts… view at source ↗

**Figure 3.** Figure 3: Feature-mask segmentation. Feature-mask segmentation To segment objects from environments for subsequent processing, we design a feature-mask method, which is devided into two stages: feature stage and mask stage, for accurate segmentation. We choose the intersection of results in two stages as outcome. Feature Stage: During training procedure, we import a classifier C along with 3D regularization loss,… view at source ↗

**Figure 4.** Figure 4: Artifacts tracking. We input the source image, the removal image and related mask in GPT-4o to obtain color prompts. Then we track the artifacts with DEVA and intersect them with the mask, generating the final mask for 2D inpainting method. the environment. The result is far beyond our requirement for subsequent simulation. Mask Stage: To obtain fine segmented result, we adopt a voting strategy similar … view at source ↗

**Figure 6.** Figure 6: Particles filling in container. six detections fail in the region above the surface but five detections can succeed. Filling particles find the nearest surface in the shrinkage and inherit their attributes. For granular material, we choose a more common condition based on realistic situation, where it is in some containers and Gaussian kernels only distribute on the surface. To completely reconstruct it,… view at source ↗

**Figure 7.** Figure 7: Segmentation comparison with Gaussian-grouping. We respectively render the images for removed objects and rest environment. Our method achieves detailed segmentation in various situation, which benefits the simulation. 4.2 Segmentation The segmentation largely impacts the subsequent processing, especially for dynamic objects. We choose dataset teatime, figurines and bear, which consist of targets with di… view at source ↗

**Figure 8.** Figure 8: Physical simulation compared with PhysGS. We choose fox, sofa and wolf for our experiments. We achieve competitive simulation results while still maintaining high efficiency. wolf Category Mass (kg) Friction 1 granule 0.5 0.3 fox Category Mass (kg) Deformation Resistance Plasticity 1 deformation 0.5 0.3 0.2 [PITH_FULL_IMAGE:figures/full_fig_p006_8.png] view at source ↗

**Figure 9.** Figure 9: Interaction with the wolf.The first line is analyzed as a doll and we play with it. While the second line is analyzed as granular material like sand with dialogue prompt in comparison with PhysGS. We draw a VR and a smile on it [PITH_FULL_IMAGE:figures/full_fig_p007_9.png] view at source ↗

**Figure 10.** Figure 10: Interaction in sofa. We analyze the physical properties of pillows and cushion. Then we apply spring force to lift them up and two ends of the spring force are marked with arrows, observing the obvious distinction of their mass. Although predicted with the same mass values, they act aligned with visual effects after fixed by our correction factor [PITH_FULL_IMAGE:figures/full_fig_p007_10.png] view at source ↗

**Figure 11.** Figure 11: Interaction in garden. First, we drag the flower and shake the vase. Then we throw a ball, breaking the vase into pieces. to test the material analysis ability and interaction in our system. We demonstrate our analysis details and complete particular tasks in each scene. wolf and sofa In this section, we analyze the physical property without user dialogue for appointing. Different from Sec.4.3, the wolf h… view at source ↗

**Figure 12.** Figure 12: Interaction in our custom dataset labdesk. From one to six are respectively a paper cup, a box of tissues, a Teddy Bear, a plastic hammer, a mug and coffee powder in the mug. After segmentation and analysis, we hold the hammer and interact with other objects in our demo. tailed reconstruction and segmentation, integrating various physical simulations for real-time interaction consistent with visual effe… view at source ↗

read the original abstract

As 3D Gaussian Splatting (3DGS) emerges as a leading approach for novel view synthesis and scene reconstruction, its potential in digital asset creation has gained significant attention. An increasing number of asset libraries based on GS are being established. However, generating physics-based dynamic assets remains a time-consuming and expertise-intensive task, especially for non-experts. In this paper, we propose LIVE-GS, a highly realistic Virtual Reality (VR) system powered by Large Language Models (LLMs), which enables rapid creation of dynamic Gaussian assets and real-time VR interactions. To inform our system design, we conducted interviews to examine challenges faced by current GS-based VR systems and the specific demands of users. Based on these insights, we employed GPT-4o to analyze key physical properties of objects that significantly impact user interactions, ensuring physics-based interactions in VR align with real-world phenomena. A key innovation of LIVE-GS is its ability to predict reasonable parameters in just 10 seconds from static Gaussian assets while maintaining high-quality VR interactions. To validate our approach, we invited participants experienced in physical simulation to manually adjust physical parameters, providing a baseline for comparison in both asset quality and authoring efficiency. We also conducted a comprehensive user study to evaluate system usability and user satisfaction. Experimental results demonstrate that LIVE-GS, leveraging LLMs' scene understanding capabilities, can achieve efficient physical scene creation and natural interactions without requiring manual design or annotation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LIVE-GS shows a working pipeline that feeds static Gaussian assets to GPT-4o for quick physics parameter guesses, then compares the result to manual tuning in a user study.

read the letter

The paper builds a VR system that takes a static 3D Gaussian asset, prompts GPT-4o with visual and interview-derived cues, and outputs parameters such as mass, friction, and restitution so the asset can be dropped into a physics-enabled scene. The authors first ran interviews to surface real authoring pain points, then implemented the LLM step and measured authoring time against experts who tuned the same parameters by hand. They also collected usability ratings from participants in a follow-up study. The 10-second figure and the positive ratings on interaction quality are the concrete outputs they report. That workflow is new in the cited literature; prior work on 3DGS and LLMs had not combined them for this exact asset-to-physics step in VR. The interviews give the design some user grounding, and the baseline comparison at least quantifies the time difference. The central limitation is that “reasonable” parameters are judged only by how the tuned assets feel to users and by how long experts take to match them. No section compares the LLM outputs against independent measurements, controlled drop tests on known objects, or error against ground-truth physics values. The evaluation therefore stays at the level of preference and efficiency rather than physical fidelity. This work is aimed at HCI and VR groups that already use Gaussian assets and need faster authoring tools. It is concrete enough and addresses a documented workflow bottleneck, so it deserves a serious referee even though the physics validation could be tightened.

Referee Report

3 major / 3 minor

Summary. The paper presents LIVE-GS, a VR system that uses GPT-4o to analyze static 3D Gaussian Splatting assets and predict physical parameters (mass, friction, restitution, etc.) for real-time physics-aware interactions. The design is informed by user interviews on GS-VR challenges; the system claims to produce 'reasonable' parameters in 10 seconds. Validation consists of a baseline comparison against expert manual parameter tuning (authoring time and perceived quality) plus a usability/user-satisfaction study with experienced participants.

Significance. If the LLM-derived parameters can be shown to produce physically plausible dynamics without per-asset calibration, the work would lower the barrier for non-experts to turn static GS reconstructions into interactive VR assets, with potential impact on asset libraries and content pipelines.

major comments (3)

[Abstract / Evaluation] Abstract and Evaluation section: the central claim that GPT-4o 'predict[s] reasonable parameters' whose VR interactions 'align with real-world phenomena' rests on subjective ratings and authoring-time comparison only; no quantitative error metrics, ground-truth measurements of predicted values (mass, friction, restitution), or controlled roll-outs against independent physics data are reported, leaving the modeling assumption untested.
[Abstract] Abstract: the headline efficiency claim ('predict reasonable parameters in just 10 seconds') is stated without supporting timing data, measurement protocol, or variance across assets, so the 10-second figure cannot be assessed as load-bearing evidence.
[Evaluation / User Study] Evaluation description: the baseline comparison with 'participants experienced in physical simulation' who 'manually adjust physical parameters' supplies no details on how 'reasonable' was judged, no inter-rater reliability, and no error bars or statistical tests, undermining the cross-condition claim of superior efficiency and quality.

minor comments (3)

[System Design] Notation for physical parameters (mass, friction, restitution) is introduced without explicit equations or ranges used by the physics engine.
[LLM Integration] The paper would benefit from a table listing the exact parameter set predicted by GPT-4o and the prompt template employed.
[Figures] Figure captions and axis labels in the user-study results should be expanded for standalone readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below, indicating where revisions will be incorporated.

read point-by-point responses

Referee: [Abstract / Evaluation] Abstract and Evaluation section: the central claim that GPT-4o 'predict[s] reasonable parameters' whose VR interactions 'align with real-world phenomena' rests on subjective ratings and authoring-time comparison only; no quantitative error metrics, ground-truth measurements of predicted values (mass, friction, restitution), or controlled roll-outs against independent physics data are reported, leaving the modeling assumption untested.

Authors: We acknowledge that the evaluation relies on subjective user ratings and authoring-time comparisons rather than quantitative physical error metrics or ground-truth comparisons. Obtaining precise ground-truth values for parameters such as mass, friction, and restitution from real-world counterparts of the Gaussian assets would require additional experimental apparatus not included in the original study. We will revise the manuscript to explicitly discuss this methodological choice and its limitations while retaining the user-study validation approach. revision: partial
Referee: [Abstract] Abstract: the headline efficiency claim ('predict reasonable parameters in just 10 seconds') is stated without supporting timing data, measurement protocol, or variance across assets, so the 10-second figure cannot be assessed as load-bearing evidence.

Authors: The 10-second figure is the observed average processing time for GPT-4o inference on asset descriptions in our implementation. We will add the measurement protocol, including how timing was recorded and variance across assets, to the revised manuscript. revision: yes
Referee: [Evaluation / User Study] Evaluation description: the baseline comparison with 'participants experienced in physical simulation' who 'manually adjust physical parameters' supplies no details on how 'reasonable' was judged, no inter-rater reliability, and no error bars or statistical tests, undermining the cross-condition claim of superior efficiency and quality.

Authors: We agree that further details are warranted. In the revision we will specify the judgment criteria for 'reasonable' parameters, report inter-rater reliability where applicable, and include error bars together with statistical tests supporting the efficiency and quality comparisons. revision: yes

Circularity Check

0 steps flagged

No significant circularity; system uses external LLM inference validated against independent human baselines

full rationale

The paper presents an applied VR system that feeds scene descriptions and interview insights into GPT-4o to infer physical parameters for Gaussian assets. Validation consists of (a) timing and quality comparisons against separate human experts who manually tune the same parameters and (b) a usability survey. No equations, fitted parameters, or self-citations appear in the provided text; the central claim that the LLM produces 'reasonable' parameters is tested against external human judgment rather than being defined by the authors' own outputs or prior self-referential results. The derivation chain therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the untested premise that an off-the-shelf LLM can map visual appearance to physically plausible parameters without domain-specific fine-tuning or measurement data. No free parameters are explicitly fitted in the abstract; the LLM itself functions as an implicit black-box predictor.

axioms (1)

domain assumption GPT-4o possesses sufficient commonsense physical knowledge to map object appearance to interaction parameters that match real-world behavior.
Invoked when the system design uses the LLM to 'analyze key physical properties' without additional calibration.

pith-pipeline@v0.9.0 · 5816 in / 1295 out tokens · 19344 ms · 2026-05-23T07:30:30.796566+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

65 extracted references · 65 canonical work pages · 3 internal anchors

[1]

J. Bae, S. Kim, Y . Yun, H. Lee, G. Bang, and Y . Uh. Per-gaussian embedding-based deformation for deformable 3d gaussian splatting. arXiv preprint arXiv:2404.03613, 2024. 2

work page arXiv 2024
[2]

J. Cen, J. Fang, C. Yang, L. Xie, X. Zhang, W. Shen, and Q. Tian. Segment any 3d gaussians, 2024. 2

work page 2024
[3]

H. K. Cheng, S. W. Oh, B. Price, A. Schwing, and J.-Y . Lee. Tracking anything with decoupled video segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision , pp. 1316– 1326, 2023. 2, 3, 5

work page 2023
[4]

W.-H. Chu, L. Ke, and K. Fragkiadaki. Dreamscene4d: Dynamic multi-object scene generation from monocular videos. arXiv preprint arXiv:2405.02280, 2024. 2

work page arXiv 2024
[5]

N. Deng, Z. He, J. Ye, B. Duinkharjav, P. Chakravarthula, X. Yang, and Q. Sun. Fov-nerf: Foveated neural radiance fields for virtual re- ality. IEEE Transactions on Visualization and Computer Graphics , 28(11):3854–3864, 2022. 1

work page 2022
[6]

R. Ding, J. Yang, C. Xue, W. Zhang, S. Bai, and X. Qi. Pla: Language- driven open-vocabulary 3d scene understanding. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition , pp. 7010–7019, 2023. 2

work page 2023
[7]

B. Dou, T. Zhang, Y . Ma, Z. Wang, and Z. Yuan. Cosseggaussians: Compact and swift scene segmenting 3d gaussians. arXiv preprint arXiv:2401.05925, 2024. 2

work page arXiv 2024
[8]

B. P. Duisterhof, Z. Mandi, Y . Yao, J.-W. Liu, M. Z. Shou, S. Song, and J. Ichnowski. Md-splatting: Learning metric deformation from 4d gaussians in highly deformable scenes. arXiv preprint arXiv:2312.00583, 2023. 2

work page arXiv 2023
[9]

L. Fan, Y . Yang, M. Li, H. Li, and Z. Zhang. Trim 3d gaus- sian splatting for accurate geometry representation. arXiv preprint arXiv:2406.07499, 2024. 2

work page arXiv 2024
[10]

Geiger, S

A. Geiger, S. Gao, A. Chen, Z. Yu, and B. Huang. 2d gaussian splat- ting for geometrically accurate radiance fields. 2024. 2

work page 2024
[11]

Girdhar, A

R. Girdhar, A. El-Nouby, Z. Liu, M. Singh, K. V . Alwala, A. Joulin, and I. Misra. Imagebind: One embedding space to bind them all. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15180–15190, 2023. 2

work page 2023
[12]

S. Guan, H. Deng, Y . Wang, and X. Yang. Neurofluid: Fluid dynamics grounding with particle-driven neural radiance fields. InInternational Conference on Machine Learning, pp. 7919–7929. PMLR, 2022. 2

work page 2022
[13]

Gu ´edon and V

A. Gu ´edon and V . Lepetit. Sugar: Surface-aligned gaussian splatting for efficient 3d mesh reconstruction and high-quality mesh rendering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5354–5363, 2024. 2

work page 2024
[14]

Z. Guo, W. Zhou, L. Li, M. Wang, and H. Li. Motion-aware 3d gaussian splatting for efficient dynamic scene reconstruction. arXiv preprint arXiv:2403.11447, 2024. 2

work page arXiv 2024
[15]

X. Hu, Y . Wang, L. Fan, J. Fan, J. Peng, Z. Lei, Q. Li, and Z. Zhang. Semantic anything in 3d gaussians. arXiv preprint arXiv:2401.17857,

work page arXiv
[16]

Huang, H

J. Huang, H. Yu, J. Zhang, and H. Nait-Charif. Point’n move: Inter- active scene object manipulation on gaussian splatting radiance fields. IET Image Processing, 2023. 2

work page 2023
[17]

Huang, Y .-T

Y .-H. Huang, Y .-T. Sun, Z. Yang, X. Lyu, Y .-P. Cao, and X. Qi. Sc- gs: Sparse-controlled gaussian splatting for editable dynamic scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4220–4230, 2024. 2

work page 2024
[18]

C. Jia, Y . Yang, Y . Xia, Y .-T. Chen, Z. Parekh, H. Pham, Q. Le, Y .-H. Sung, Z. Li, and T. Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. In International conference on machine learning, pp. 4904–4916. PMLR, 2021. 2

work page 2021
[19]

Vr-gs: A physical dynamics-aware interactive gaussian splatting system in virtual reality,

Y . Jiang, C. Yu, T. Xie, X. Li, Y . Feng, H. Wang, M. Li, H. Lau, F. Gao, Y . Yang, et al. Vr-gs: A physical dynamics-aware interactive gaussian splatting system in virtual reality. arXiv preprint arXiv:2401.16663,

work page arXiv
[20]

Kavan, S

L. Kavan, S. Collins, J. ˇZ´ara, and C. O’Sullivan. Skinning with dual quaternions. In Proceedings of the 2007 symposium on Interactive 3D graphics and games, pp. 39–46, 2007. 2

work page 2007
[21]

Kerbl, G

B. Kerbl, G. Kopanas, T. Leimk ¨uhler, and G. Drettakis. 3d gaussian splatting for real-time radiance field rendering. ACM Transactions on Graphics, 42(4):1–14, 2023. 1, 2, 3

work page 2023
[22]

Kirillov, E

A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y . Lo, et al. Segment any- thing. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4015–4026, 2023. 2, 3

work page 2023
[23]

Li, Y .-L

X. Li, Y .-L. Qiao, P. Y . Chen, K. M. Jatavallabhula, M. Lin, C. Jiang, and C. Gan. Pac-nerf: Physics augmented continuum neural radi- ance fields for geometry-agnostic system identification.arXiv preprint arXiv:2303.05512, 2023. 2

work page arXiv 2023
[24]

Liang, Y

H. Liang, Y . Yin, D. Xu, H. Liang, Z. Wang, K. N. Plataniotis, Y . Zhao, and Y . Wei. Diffusion4d: Fast spatial-temporal consistent 4d gener- ation via video diffusion models. arXiv preprint arXiv:2405.16645 ,

work page arXiv
[25]

G. Liao, J. Li, Z. Bao, X. Ye, J. Wang, Q. Li, and K. Liu. Clip-gs: Clip-informed gaussian splatting for real-time and view-consistent 3d semantic understanding. arXiv preprint arXiv:2404.14249, 2024. 2

work page arXiv 2024
[26]

G. Liao, K. Zhou, Z. Bao, K. Liu, and Q. Li. Ov-nerf: Open- vocabulary neural radiance fields with vision and language foun- dation models for 3d semantic understanding. arXiv preprint arXiv:2402.04648, 2024. 2

work page arXiv 2024
[27]

Z. Liu, H. Ouyang, Q. Wang, K. L. Cheng, J. Xiao, K. Zhu, N. Xue, Y . Liu, Y . Shen, and Y . Cao. Infusion: Inpainting 3d gaussians via learning depth completion from diffusion prior. arXiv preprint arXiv:2404.11613, 2024. 2

work page arXiv 2024
[28]

Loper, N

M. Loper, N. Mahmood, J. Romero, G. Pons-Moll, and M. J. Black. Smpl: A skinned multi-person linear model. In Seminal Graphics Papers: Pushing the Boundaries, Volume 2, pp. 851–866. 2023. 2

work page 2023
[29]

T. Lu, M. Yu, L. Xu, Y . Xiangli, L. Wang, D. Lin, and B. Dai. Scaffold- gs: Structured 3d gaussians for view-adaptive rendering. In Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 20654–20664, 2024. 2

work page 2024
[30]

Y . Lu, C. Xu, X. Wei, X. Xie, M. Tomizuka, K. Keutzer, and S. Zhang. Open-vocabulary point-cloud object detection without 3d annotation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 1190–1199, 2023. 2

work page 2023
[31]

Mildenhall, P

B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoor- thi, and R. Ng. Nerf: Representing scenes as neural radiance fields for view synthesis. Communications of the ACM, 65(1):99–106, 2021. 1

work page 2021
[32]

M ¨uller, B

M. M ¨uller, B. Heidelberger, M. Hennix, and J. Ratcliff. Position based dynamics. Journal of Visual Communication and Image Representa- tion, 18(2):109–118, 2007. 2, 5

work page 2007
[33]

M ¨uller, A

T. M ¨uller, A. Evans, C. Schied, and A. Keller. Instant neural graphics primitives with a multiresolution hash encoding. ACM transactions on graphics (TOG), 41(4):1–15, 2022. 6

work page 2022
[34]

S. Peng, K. Genova, C. Jiang, A. Tagliasacchi, M. Pollefeys, T. Funkhouser, et al. Openscene: 3d scene understanding with open vocabularies. In Proceedings of the IEEE/CVF conference on com- puter vision and pattern recognition, pp. 815–824, 2023. 2

work page 2023
[35]

Y .-L. Qiao, A. Gao, and M. Lin. Neuphysics: Editable neural geome- try and physics from monocular videos. Advances in Neural Informa- tion Processing Systems, 35:12841–12854, 2022. 2

work page 2022
[36]

M. Qin, W. Li, J. Zhou, H. Wang, and H. Pfister. Langsplat: 3d lan- guage gaussian splatting. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 20051–20060, 2024. 2

work page 2024
[37]

R.-Z. Qiu, G. Yang, W. Zeng, and X. Wang. Feature splatting: Language-driven physics-based scene synthesis and editing. arXiv preprint arXiv:2404.01223, 2024. 2

work page arXiv 2024
[38]

Radford, J

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al. Learning transfer- able visual models from natural language supervision. InInternational conference on machine learning, pp. 8748–8763. PMLR, 2021. 2

work page 2021
[39]

J. Ren, L. Pan, J. Tang, C. Zhang, A. Cao, G. Zeng, and Z. Liu. Dreamgaussian4d: Generative 4d gaussian splatting. arXiv preprint arXiv:2312.17142, 2023. 2

work page arXiv 2023
[40]

K. Ren, L. Jiang, T. Lu, M. Yu, L. Xu, Z. Ni, and B. Dai. Octree-gs: Towards consistent real-time rendering with lod-structured 3d gaus- sians. arXiv preprint arXiv:2403.17898, 2024. 2

work page arXiv 2024
[41]

J.-C. Shi, M. Wang, H.-B. Duan, and S.-H. Guan. Language embedded 3d gaussians for open-vocabulary scene understanding. In Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5333–5343, 2024. 2

work page 2024
[42]

Siddiqui, L

Y . Siddiqui, L. Porzi, S. R. Bul ´o, N. M ¨uller, M. Nießner, A. Dai, and P. Kontschieder. Panoptic lifting for 3d scene understanding with neu- ral fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9043–9052, 2023. 2

work page 2023
[43]

M. C. Silva, M. Dahaghin, M. Toso, and A. Del Bue. Contrastive gaussian clustering: Weakly supervised 3d scene segmentation. arXiv preprint arXiv:2404.12784, 2024. 2

work page arXiv 2024
[44]

Snavely, S

N. Snavely, S. M. Seitz, and R. Szeliski. Photo tourism: exploring photo collections in 3d. In ACM siggraph 2006 papers, pp. 835–846

work page 2006
[45]

Suvorov, E

R. Suvorov, E. Logacheva, A. Mashikhin, A. Remizova, A. Ashukha, A. Silvestrov, N. Kong, H. Goka, K. Park, and V . Lempitsky. Resolution-robust large mask inpainting with fourier convolutions. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pp. 2149–2159, 2022. 2, 4

work page 2022
[46]

Openmask3d: Open-vocabulary 3d instance segmenta- tion,

A. Takmaz, E. Fedele, R. W. Sumner, M. Pollefeys, F. Tombari, and F. Engelmann. Openmask3d: Open-vocabulary 3d instance segmen- tation. arXiv preprint arXiv:2306.13631, 2023. 2

work page arXiv 2023
[47]

O. S. D. Team. Obi solver. https://obi.virtualmethodstudio. com/, 2024. 5

work page 2024
[48]

Topsakal and T

O. Topsakal and T. C. Akinci. Creating large language model applica- tions utilizing langchain: A primer on developing llm apps fast. In In- ternational Conference on Applied Engineering and Natural Sciences, vol. 1, pp. 1050–1056, 2023. 5

work page 2023
[49]

Turkulainen, X

M. Turkulainen, X. Ren, I. Melekhov, O. Seiskari, E. Rahtu, and J. Kannala. Dn-splatter: Depth and normal priors for gaussian splat- ting and meshing. arXiv preprint arXiv:2403.17822, 2024. 2

work page arXiv 2024
[50]

J. Wang, J. Fang, X. Zhang, L. Xie, and Q. Tian. Gaussianeditor: Edit- ing 3d gaussians delicately with text instructions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion, pp. 20902–20911, 2024. 2

work page 2024
[51]

Y . Wolf, A. Bracha, and R. Kimmel. Surface reconstruction from gaussian splatting via novel stereo views. arXiv preprint arXiv:2404.01810, 2024. 2

work page arXiv 2024
[52]

T. Xie, Z. Zong, Y . Qiu, X. Li, Y . Feng, Y . Yang, and C. Jiang. Phys- gaussian: Physics-integrated 3d gaussians for generative dynamics. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4389–4398, 2024. 2, 3, 5, 6

work page 2024
[53]

L. Xu, V . Agrawal, W. Laney, T. Garcia, A. Bansal, C. Kim, S. Rota Bul `o, L. Porzi, P. Kontschieder, A. Bo ˇziˇc, et al. Vr-nerf: High-fidelity virtualized walkable spaces. In SIGGRAPH Asia 2023 Conference Papers, pp. 1–12, 2023. 1

work page 2023
[54]

S. Yan, T. Zhu, Z. Wang, Y . Cao, M. Zhang, S. Ghosh, Y . Wu, and J. Yu. Videococa: Video-text modeling with zero-shot transfer from contrastive captioners. arXiv preprint arXiv:2212.04979, 2022. 2

work page arXiv 2022
[55]

Z. Yang, X. Gao, W. Zhou, S. Jiao, Y . Zhang, and X. Jin. Deformable 3d gaussians for high-fidelity monocular dynamic scene reconstruc- tion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 20331–20341, 2024. 2

work page 2024
[56]

M. Ye, M. Danelljan, F. Yu, and L. Ke. Gaussian grouping: Segment and edit anything in 3d scenes. arXiv preprint arXiv:2312.00732 ,

work page arXiv
[57]

Y . Yin, D. Xu, Z. Wang, Y . Zhao, and Y . Wei. 4dgen: Grounded 4d content generation with spatial-temporal consistency. arXiv preprint arXiv:2312.17225, 2023. 2

work page arXiv 2023
[58]

Yuan, Y .-T

Y .-J. Yuan, Y .-T. Sun, Y .-K. Lai, Y . Ma, R. Jia, and L. Gao. Nerf- editing: geometry editing of neural radiance fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion, pp. 18353–18364, 2022. 2

work page 2022
[59]

A. Zeng, M. Attarian, B. Ichter, K. Choromanski, A. Wong, S. Welker, F. Tombari, A. Purohit, M. Ryoo, V . Sindhwani, et al. Socratic mod- els: Composing zero-shot multimodal reasoning with language. arXiv preprint arXiv:2204.00598, 2022. 2

work page internal anchor Pith review Pith/arXiv arXiv 2022
[60]

DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection

H. Zhang, F. Li, S. Liu, L. Zhang, H. Su, J. Zhu, L. M. Ni, and H.-Y . Shum. Dino: Detr with improved denoising anchor boxes for end-to- end object detection. arXiv preprint arXiv:2203.03605, 2022. 2

work page internal anchor Pith review Pith/arXiv arXiv 2022
[61]

Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding

H. Zhang, X. Li, and L. Bing. Video-llama: An instruction-tuned audio-visual language model for video understanding. arXiv preprint arXiv:2306.02858, 2023. 2

work page internal anchor Pith review Pith/arXiv arXiv 2023
[62]

Zhang, H.-X

T. Zhang, H.-X. Yu, R. Wu, B. Y . Feng, C. Zheng, N. Snavely, J. Wu, and W. T. Freeman. Physdreamer: Physics-based interaction with 3d objects via video generation. arXiv preprint arXiv:2404.13026, 2024. 2, 4

work page arXiv 2024
[63]

S. Zhi, T. Laidlow, S. Leutenegger, and A. J. Davison. In-place scene labelling and understanding with implicit scene representation. InPro- ceedings of the IEEE/CVF International Conference on Computer Vi- sion, pp. 15838–15847, 2021. 2

work page 2021
[64]

H. Zhou, J. Shao, L. Xu, D. Bai, W. Qiu, B. Liu, Y . Wang, A. Geiger, and Y . Liao. Hugs: Holistic urban 3d scene understanding via gaussian splatting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 21336–21345, 2024. 2

work page 2024
[65]

Zwicker, H

M. Zwicker, H. Pfister, J. Van Baar, and M. Gross. Ewa volume splat- ting. In Proceedings Visualization, 2001. VIS’01., pp. 29–538. IEEE,

work page 2001

[1] [1]

J. Bae, S. Kim, Y . Yun, H. Lee, G. Bang, and Y . Uh. Per-gaussian embedding-based deformation for deformable 3d gaussian splatting. arXiv preprint arXiv:2404.03613, 2024. 2

work page arXiv 2024

[2] [2]

J. Cen, J. Fang, C. Yang, L. Xie, X. Zhang, W. Shen, and Q. Tian. Segment any 3d gaussians, 2024. 2

work page 2024

[3] [3]

H. K. Cheng, S. W. Oh, B. Price, A. Schwing, and J.-Y . Lee. Tracking anything with decoupled video segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision , pp. 1316– 1326, 2023. 2, 3, 5

work page 2023

[4] [4]

W.-H. Chu, L. Ke, and K. Fragkiadaki. Dreamscene4d: Dynamic multi-object scene generation from monocular videos. arXiv preprint arXiv:2405.02280, 2024. 2

work page arXiv 2024

[5] [5]

N. Deng, Z. He, J. Ye, B. Duinkharjav, P. Chakravarthula, X. Yang, and Q. Sun. Fov-nerf: Foveated neural radiance fields for virtual re- ality. IEEE Transactions on Visualization and Computer Graphics , 28(11):3854–3864, 2022. 1

work page 2022

[6] [6]

R. Ding, J. Yang, C. Xue, W. Zhang, S. Bai, and X. Qi. Pla: Language- driven open-vocabulary 3d scene understanding. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition , pp. 7010–7019, 2023. 2

work page 2023

[7] [7]

B. Dou, T. Zhang, Y . Ma, Z. Wang, and Z. Yuan. Cosseggaussians: Compact and swift scene segmenting 3d gaussians. arXiv preprint arXiv:2401.05925, 2024. 2

work page arXiv 2024

[8] [8]

B. P. Duisterhof, Z. Mandi, Y . Yao, J.-W. Liu, M. Z. Shou, S. Song, and J. Ichnowski. Md-splatting: Learning metric deformation from 4d gaussians in highly deformable scenes. arXiv preprint arXiv:2312.00583, 2023. 2

work page arXiv 2023

[9] [9]

L. Fan, Y . Yang, M. Li, H. Li, and Z. Zhang. Trim 3d gaus- sian splatting for accurate geometry representation. arXiv preprint arXiv:2406.07499, 2024. 2

work page arXiv 2024

[10] [10]

Geiger, S

A. Geiger, S. Gao, A. Chen, Z. Yu, and B. Huang. 2d gaussian splat- ting for geometrically accurate radiance fields. 2024. 2

work page 2024

[11] [11]

Girdhar, A

R. Girdhar, A. El-Nouby, Z. Liu, M. Singh, K. V . Alwala, A. Joulin, and I. Misra. Imagebind: One embedding space to bind them all. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15180–15190, 2023. 2

work page 2023

[12] [12]

S. Guan, H. Deng, Y . Wang, and X. Yang. Neurofluid: Fluid dynamics grounding with particle-driven neural radiance fields. InInternational Conference on Machine Learning, pp. 7919–7929. PMLR, 2022. 2

work page 2022

[13] [13]

Gu ´edon and V

A. Gu ´edon and V . Lepetit. Sugar: Surface-aligned gaussian splatting for efficient 3d mesh reconstruction and high-quality mesh rendering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5354–5363, 2024. 2

work page 2024

[14] [14]

Z. Guo, W. Zhou, L. Li, M. Wang, and H. Li. Motion-aware 3d gaussian splatting for efficient dynamic scene reconstruction. arXiv preprint arXiv:2403.11447, 2024. 2

work page arXiv 2024

[15] [15]

X. Hu, Y . Wang, L. Fan, J. Fan, J. Peng, Z. Lei, Q. Li, and Z. Zhang. Semantic anything in 3d gaussians. arXiv preprint arXiv:2401.17857,

work page arXiv

[16] [16]

Huang, H

J. Huang, H. Yu, J. Zhang, and H. Nait-Charif. Point’n move: Inter- active scene object manipulation on gaussian splatting radiance fields. IET Image Processing, 2023. 2

work page 2023

[17] [17]

Huang, Y .-T

Y .-H. Huang, Y .-T. Sun, Z. Yang, X. Lyu, Y .-P. Cao, and X. Qi. Sc- gs: Sparse-controlled gaussian splatting for editable dynamic scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4220–4230, 2024. 2

work page 2024

[18] [18]

C. Jia, Y . Yang, Y . Xia, Y .-T. Chen, Z. Parekh, H. Pham, Q. Le, Y .-H. Sung, Z. Li, and T. Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. In International conference on machine learning, pp. 4904–4916. PMLR, 2021. 2

work page 2021

[19] [19]

Vr-gs: A physical dynamics-aware interactive gaussian splatting system in virtual reality,

Y . Jiang, C. Yu, T. Xie, X. Li, Y . Feng, H. Wang, M. Li, H. Lau, F. Gao, Y . Yang, et al. Vr-gs: A physical dynamics-aware interactive gaussian splatting system in virtual reality. arXiv preprint arXiv:2401.16663,

work page arXiv

[20] [20]

Kavan, S

L. Kavan, S. Collins, J. ˇZ´ara, and C. O’Sullivan. Skinning with dual quaternions. In Proceedings of the 2007 symposium on Interactive 3D graphics and games, pp. 39–46, 2007. 2

work page 2007

[21] [21]

Kerbl, G

B. Kerbl, G. Kopanas, T. Leimk ¨uhler, and G. Drettakis. 3d gaussian splatting for real-time radiance field rendering. ACM Transactions on Graphics, 42(4):1–14, 2023. 1, 2, 3

work page 2023

[22] [22]

Kirillov, E

A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y . Lo, et al. Segment any- thing. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4015–4026, 2023. 2, 3

work page 2023

[23] [23]

Li, Y .-L

X. Li, Y .-L. Qiao, P. Y . Chen, K. M. Jatavallabhula, M. Lin, C. Jiang, and C. Gan. Pac-nerf: Physics augmented continuum neural radi- ance fields for geometry-agnostic system identification.arXiv preprint arXiv:2303.05512, 2023. 2

work page arXiv 2023

[24] [24]

Liang, Y

H. Liang, Y . Yin, D. Xu, H. Liang, Z. Wang, K. N. Plataniotis, Y . Zhao, and Y . Wei. Diffusion4d: Fast spatial-temporal consistent 4d gener- ation via video diffusion models. arXiv preprint arXiv:2405.16645 ,

work page arXiv

[25] [25]

G. Liao, J. Li, Z. Bao, X. Ye, J. Wang, Q. Li, and K. Liu. Clip-gs: Clip-informed gaussian splatting for real-time and view-consistent 3d semantic understanding. arXiv preprint arXiv:2404.14249, 2024. 2

work page arXiv 2024

[26] [26]

G. Liao, K. Zhou, Z. Bao, K. Liu, and Q. Li. Ov-nerf: Open- vocabulary neural radiance fields with vision and language foun- dation models for 3d semantic understanding. arXiv preprint arXiv:2402.04648, 2024. 2

work page arXiv 2024

[27] [27]

Z. Liu, H. Ouyang, Q. Wang, K. L. Cheng, J. Xiao, K. Zhu, N. Xue, Y . Liu, Y . Shen, and Y . Cao. Infusion: Inpainting 3d gaussians via learning depth completion from diffusion prior. arXiv preprint arXiv:2404.11613, 2024. 2

work page arXiv 2024

[28] [28]

Loper, N

M. Loper, N. Mahmood, J. Romero, G. Pons-Moll, and M. J. Black. Smpl: A skinned multi-person linear model. In Seminal Graphics Papers: Pushing the Boundaries, Volume 2, pp. 851–866. 2023. 2

work page 2023

[29] [29]

T. Lu, M. Yu, L. Xu, Y . Xiangli, L. Wang, D. Lin, and B. Dai. Scaffold- gs: Structured 3d gaussians for view-adaptive rendering. In Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 20654–20664, 2024. 2

work page 2024

[30] [30]

Y . Lu, C. Xu, X. Wei, X. Xie, M. Tomizuka, K. Keutzer, and S. Zhang. Open-vocabulary point-cloud object detection without 3d annotation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 1190–1199, 2023. 2

work page 2023

[31] [31]

Mildenhall, P

B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoor- thi, and R. Ng. Nerf: Representing scenes as neural radiance fields for view synthesis. Communications of the ACM, 65(1):99–106, 2021. 1

work page 2021

[32] [32]

M ¨uller, B

M. M ¨uller, B. Heidelberger, M. Hennix, and J. Ratcliff. Position based dynamics. Journal of Visual Communication and Image Representa- tion, 18(2):109–118, 2007. 2, 5

work page 2007

[33] [33]

M ¨uller, A

T. M ¨uller, A. Evans, C. Schied, and A. Keller. Instant neural graphics primitives with a multiresolution hash encoding. ACM transactions on graphics (TOG), 41(4):1–15, 2022. 6

work page 2022

[34] [34]

S. Peng, K. Genova, C. Jiang, A. Tagliasacchi, M. Pollefeys, T. Funkhouser, et al. Openscene: 3d scene understanding with open vocabularies. In Proceedings of the IEEE/CVF conference on com- puter vision and pattern recognition, pp. 815–824, 2023. 2

work page 2023

[35] [35]

Y .-L. Qiao, A. Gao, and M. Lin. Neuphysics: Editable neural geome- try and physics from monocular videos. Advances in Neural Informa- tion Processing Systems, 35:12841–12854, 2022. 2

work page 2022

[36] [36]

M. Qin, W. Li, J. Zhou, H. Wang, and H. Pfister. Langsplat: 3d lan- guage gaussian splatting. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 20051–20060, 2024. 2

work page 2024

[37] [37]

R.-Z. Qiu, G. Yang, W. Zeng, and X. Wang. Feature splatting: Language-driven physics-based scene synthesis and editing. arXiv preprint arXiv:2404.01223, 2024. 2

work page arXiv 2024

[38] [38]

Radford, J

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al. Learning transfer- able visual models from natural language supervision. InInternational conference on machine learning, pp. 8748–8763. PMLR, 2021. 2

work page 2021

[39] [39]

J. Ren, L. Pan, J. Tang, C. Zhang, A. Cao, G. Zeng, and Z. Liu. Dreamgaussian4d: Generative 4d gaussian splatting. arXiv preprint arXiv:2312.17142, 2023. 2

work page arXiv 2023

[40] [40]

K. Ren, L. Jiang, T. Lu, M. Yu, L. Xu, Z. Ni, and B. Dai. Octree-gs: Towards consistent real-time rendering with lod-structured 3d gaus- sians. arXiv preprint arXiv:2403.17898, 2024. 2

work page arXiv 2024

[41] [41]

J.-C. Shi, M. Wang, H.-B. Duan, and S.-H. Guan. Language embedded 3d gaussians for open-vocabulary scene understanding. In Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5333–5343, 2024. 2

work page 2024

[42] [42]

Siddiqui, L

Y . Siddiqui, L. Porzi, S. R. Bul ´o, N. M ¨uller, M. Nießner, A. Dai, and P. Kontschieder. Panoptic lifting for 3d scene understanding with neu- ral fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9043–9052, 2023. 2

work page 2023

[43] [43]

M. C. Silva, M. Dahaghin, M. Toso, and A. Del Bue. Contrastive gaussian clustering: Weakly supervised 3d scene segmentation. arXiv preprint arXiv:2404.12784, 2024. 2

work page arXiv 2024

[44] [44]

Snavely, S

N. Snavely, S. M. Seitz, and R. Szeliski. Photo tourism: exploring photo collections in 3d. In ACM siggraph 2006 papers, pp. 835–846

work page 2006

[45] [45]

Suvorov, E

R. Suvorov, E. Logacheva, A. Mashikhin, A. Remizova, A. Ashukha, A. Silvestrov, N. Kong, H. Goka, K. Park, and V . Lempitsky. Resolution-robust large mask inpainting with fourier convolutions. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pp. 2149–2159, 2022. 2, 4

work page 2022

[46] [46]

Openmask3d: Open-vocabulary 3d instance segmenta- tion,

A. Takmaz, E. Fedele, R. W. Sumner, M. Pollefeys, F. Tombari, and F. Engelmann. Openmask3d: Open-vocabulary 3d instance segmen- tation. arXiv preprint arXiv:2306.13631, 2023. 2

work page arXiv 2023

[47] [47]

O. S. D. Team. Obi solver. https://obi.virtualmethodstudio. com/, 2024. 5

work page 2024

[48] [48]

Topsakal and T

O. Topsakal and T. C. Akinci. Creating large language model applica- tions utilizing langchain: A primer on developing llm apps fast. In In- ternational Conference on Applied Engineering and Natural Sciences, vol. 1, pp. 1050–1056, 2023. 5

work page 2023

[49] [49]

Turkulainen, X

M. Turkulainen, X. Ren, I. Melekhov, O. Seiskari, E. Rahtu, and J. Kannala. Dn-splatter: Depth and normal priors for gaussian splat- ting and meshing. arXiv preprint arXiv:2403.17822, 2024. 2

work page arXiv 2024

[50] [50]

J. Wang, J. Fang, X. Zhang, L. Xie, and Q. Tian. Gaussianeditor: Edit- ing 3d gaussians delicately with text instructions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion, pp. 20902–20911, 2024. 2

work page 2024

[51] [51]

Y . Wolf, A. Bracha, and R. Kimmel. Surface reconstruction from gaussian splatting via novel stereo views. arXiv preprint arXiv:2404.01810, 2024. 2

work page arXiv 2024

[52] [52]

T. Xie, Z. Zong, Y . Qiu, X. Li, Y . Feng, Y . Yang, and C. Jiang. Phys- gaussian: Physics-integrated 3d gaussians for generative dynamics. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4389–4398, 2024. 2, 3, 5, 6

work page 2024

[53] [53]

L. Xu, V . Agrawal, W. Laney, T. Garcia, A. Bansal, C. Kim, S. Rota Bul `o, L. Porzi, P. Kontschieder, A. Bo ˇziˇc, et al. Vr-nerf: High-fidelity virtualized walkable spaces. In SIGGRAPH Asia 2023 Conference Papers, pp. 1–12, 2023. 1

work page 2023

[54] [54]

S. Yan, T. Zhu, Z. Wang, Y . Cao, M. Zhang, S. Ghosh, Y . Wu, and J. Yu. Videococa: Video-text modeling with zero-shot transfer from contrastive captioners. arXiv preprint arXiv:2212.04979, 2022. 2

work page arXiv 2022

[55] [55]

Z. Yang, X. Gao, W. Zhou, S. Jiao, Y . Zhang, and X. Jin. Deformable 3d gaussians for high-fidelity monocular dynamic scene reconstruc- tion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 20331–20341, 2024. 2

work page 2024

[56] [56]

M. Ye, M. Danelljan, F. Yu, and L. Ke. Gaussian grouping: Segment and edit anything in 3d scenes. arXiv preprint arXiv:2312.00732 ,

work page arXiv

[57] [57]

Y . Yin, D. Xu, Z. Wang, Y . Zhao, and Y . Wei. 4dgen: Grounded 4d content generation with spatial-temporal consistency. arXiv preprint arXiv:2312.17225, 2023. 2

work page arXiv 2023

[58] [58]

Yuan, Y .-T

Y .-J. Yuan, Y .-T. Sun, Y .-K. Lai, Y . Ma, R. Jia, and L. Gao. Nerf- editing: geometry editing of neural radiance fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion, pp. 18353–18364, 2022. 2

work page 2022

[59] [59]

A. Zeng, M. Attarian, B. Ichter, K. Choromanski, A. Wong, S. Welker, F. Tombari, A. Purohit, M. Ryoo, V . Sindhwani, et al. Socratic mod- els: Composing zero-shot multimodal reasoning with language. arXiv preprint arXiv:2204.00598, 2022. 2

work page internal anchor Pith review Pith/arXiv arXiv 2022

[60] [60]

DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection

H. Zhang, F. Li, S. Liu, L. Zhang, H. Su, J. Zhu, L. M. Ni, and H.-Y . Shum. Dino: Detr with improved denoising anchor boxes for end-to- end object detection. arXiv preprint arXiv:2203.03605, 2022. 2

work page internal anchor Pith review Pith/arXiv arXiv 2022

[61] [61]

Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding

H. Zhang, X. Li, and L. Bing. Video-llama: An instruction-tuned audio-visual language model for video understanding. arXiv preprint arXiv:2306.02858, 2023. 2

work page internal anchor Pith review Pith/arXiv arXiv 2023

[62] [62]

Zhang, H.-X

T. Zhang, H.-X. Yu, R. Wu, B. Y . Feng, C. Zheng, N. Snavely, J. Wu, and W. T. Freeman. Physdreamer: Physics-based interaction with 3d objects via video generation. arXiv preprint arXiv:2404.13026, 2024. 2, 4

work page arXiv 2024

[63] [63]

S. Zhi, T. Laidlow, S. Leutenegger, and A. J. Davison. In-place scene labelling and understanding with implicit scene representation. InPro- ceedings of the IEEE/CVF International Conference on Computer Vi- sion, pp. 15838–15847, 2021. 2

work page 2021

[64] [64]

H. Zhou, J. Shao, L. Xu, D. Bai, W. Qiu, B. Liu, Y . Wang, A. Geiger, and Y . Liao. Hugs: Holistic urban 3d scene understanding via gaussian splatting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 21336–21345, 2024. 2

work page 2024

[65] [65]

Zwicker, H

M. Zwicker, H. Pfister, J. Van Baar, and M. Gross. Ewa volume splat- ting. In Proceedings Visualization, 2001. VIS’01., pp. 29–538. IEEE,

work page 2001