pith. sign in

arxiv: 2512.03210 · v2 · submitted 2025-12-02 · 💻 cs.CV · cs.LG· cs.RO

Flux4D: Flow-based Unsupervised 4D Reconstruction

Pith reviewed 2026-05-17 01:55 UTC · model grok-4.3

classification 💻 cs.CV cs.LGcs.RO
keywords 4D reconstructionunsupervised learning3D Gaussian Splattingdynamic scenesphotometric lossmotion dynamicsscene reconstructionautonomous driving
0
0 comments X p. Extension

The pith

Flux4D reconstructs large-scale dynamic scenes unsupervised by predicting 3D Gaussians and their motion dynamics from raw data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Flux4D predicts 3D Gaussians along with their motion to reconstruct observations from sensors. It relies solely on photometric losses and an as-static-as-possible regularization while training across multiple scenes. This setup allows it to separate dynamic elements without motion labels, pre-trained models, or other priors. The framework achieves fast reconstruction and better generalization to new scenes and objects than prior approaches. If the central claim holds, it would let systems build 4D models of driving environments directly from video collections.

Core claim

Flux4D directly predicts 3D Gaussians and their motion dynamics to reconstruct sensor observations in a fully unsupervised manner. By adopting only photometric losses and enforcing an 'as static as possible' regularization, Flux4D learns to decompose dynamic elements directly from raw data without requiring pre-trained supervised models or foundational priors simply by training across many scenes.

What carries the argument

Direct prediction of 3D Gaussians together with their motion dynamics, regularized to remain as static as possible, which decomposes moving elements across multiple raw scenes.

If this is right

  • Dynamic scenes can be reconstructed efficiently within seconds.
  • The method scales to large collections of driving data without per-scene tuning.
  • It generalizes to unseen environments and to rare or unknown objects.
  • It outperforms existing methods on scalability, generalization, and reconstruction quality for outdoor driving datasets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same regularization principle could be tested on indoor sequences to check whether static assumptions hold when objects interact closely.
  • Combining the predicted Gaussians with existing SLAM pipelines might improve real-time dynamic mapping in vehicles.
  • If the multi-scene training proves robust, similar direct-prediction approaches could be applied to other 3D representations beyond Gaussians.

Load-bearing premise

An 'as static as possible' regularization term combined with photometric losses is sufficient to correctly decompose dynamic elements from raw multi-scene data without any pre-trained supervised models or foundational priors.

What would settle it

Reconstruction quality on a held-out driving sequence containing previously unseen object motions, such as an unusual pedestrian path, where Gaussians either fail to track the motion or produce visible artifacts in novel views.

Figures

Figures reproduced from arXiv: 2512.03210 by Henry Che, Jingkang Wang, Lily Goli, Raquel Urtasun, Sivabalan Manivasagam, Yun Chen, Ze Yang.

Figure 1
Figure 1. Figure 1: Flux4D is a simple and scalable framework for unsupervised 4D reconstruction. Left: Paradigms for 4D reconstruction. Right: realism-speed comparisons with existing works. to improve reconstruction quality in novel environments. However, existing approaches primarily target static scenes, struggling with dynamic environments due to computational constraints and dependence on sparse, low-resolution inputs. R… view at source ↗
Figure 2
Figure 2. Figure 2: Model overview. Flux4D reconstructs 4D world by predicting 3D Gaussians with velocities given unlabelled sensor observations, and trained with the photometric reconstruction objective. The resultant model can be used for RGB and flow synthesis from novel views. with geometry, appearance, and 3D flow. We represent the scene using a set of 3D Gaussians G = {gi}1≤i≤M. Each Gaussian point gi is parameterized b… view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative results for NVS on PandaSet. Rendered RGB images from novel views show that our method achieves better image quality across a variety of urban scenes, with crisper edges and sharper dynamic actors compared to baselines. GT NeuRAD G3R EmerNeRF DeSiRe-GS Ours 106-21 115-41 158-7 8s reconstruction 16-31 Reconstruction w/ label Reconstruction w/o label [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: NVS on longer-horizon logs. Qualitative comparison shows that our method outperforms SoTA unsupervised baselines, by maintaining better estimation of actor movements in longer horizon. We shrink the gap in quality to supervised methods. and depth, as well as recovered flow. We also ablate Flux4D’s design and show that Flux4D scales with more data. Finally, we demonstrate the controllability of our predicte… view at source ↗
Figure 5
Figure 5. Figure 5: Estimating motion flows. We compare our estimated motion with prior unsupervised methods through rendered flow, showing accurate static region detection and sharper actor flow edges [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: High-fidelity flow and RGB reconstruction. Flux4D not only provides photorealistic reconstruction of the dynamic scene but also estimates actors’ motion flow with high precision. 4.2 Scalable 4D Reconstruction Novel view synthesis on PandaSet [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Flux4D reconstruction on Argoverse 2 and WOD [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗
Figure 9
Figure 9. Figure 9: Simulation applications. Flux4D can be applied suc- cessfully to different camera simulation tasks, e.g., actor removal, insertion and manipulation. patterns is challenging, which could be mitigated by leveraging larger and more diverse training data; (2) iterative approach for long-horizon reconstruction creates visible inconsistencies at transition points; and (3) the method assumes a simple pinhole came… view at source ↗
read the original abstract

Reconstructing large-scale dynamic scenes from visual observations is a fundamental challenge in computer vision, with critical implications for robotics and autonomous systems. While recent differentiable rendering methods such as Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS) have achieved impressive photorealistic reconstruction, they suffer from scalability limitations and require annotations to decouple actor motion. Existing self-supervised methods attempt to eliminate explicit annotations by leveraging motion cues and geometric priors, yet they remain constrained by per-scene optimization and sensitivity to hyperparameter tuning. In this paper, we introduce Flux4D, a simple and scalable framework for 4D reconstruction of large-scale dynamic scenes. Flux4D directly predicts 3D Gaussians and their motion dynamics to reconstruct sensor observations in a fully unsupervised manner. By adopting only photometric losses and enforcing an "as static as possible" regularization, Flux4D learns to decompose dynamic elements directly from raw data without requiring pre-trained supervised models or foundational priors simply by training across many scenes. Our approach enables efficient reconstruction of dynamic scenes within seconds, scales effectively to large datasets, and generalizes well to unseen environments, including rare and unknown objects. Experiments on outdoor driving datasets show Flux4D significantly outperforms existing methods in scalability, generalization, and reconstruction quality.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper presents Flux4D, a scalable unsupervised framework for 4D reconstruction of large-scale dynamic scenes. It directly predicts 3D Gaussians together with per-Gaussian motion dynamics, trained end-to-end across many scenes using only photometric reconstruction losses plus an 'as static as possible' regularization term. The method claims to decompose static and dynamic elements without pre-trained models, annotations, or per-scene optimization, enabling second-scale inference, strong generalization to unseen objects, and superior performance on outdoor driving datasets relative to prior self-supervised approaches.

Significance. If the central unsupervised decomposition claim holds with rigorous verification, the work would be significant for enabling annotation-free 4D reconstruction at scale, with direct relevance to robotics and autonomous driving. The multi-scene training strategy and avoidance of foundational priors are notable strengths that could improve generalization over per-scene methods.

major comments (3)
  1. [Abstract and §4] Abstract and §4 (Experiments): the claim of 'significantly outperforms existing methods' is unsupported by any reported quantitative metrics, error bars, ablation tables, or details on how the static-regularization weight was selected or validated across datasets. Without these, the central experimental claims cannot be assessed.
  2. [§3.2] §3.2 (Regularization): the 'as static as possible' term is load-bearing for the unsupervised motion decomposition claim, yet it is unclear whether its weight is a fixed hyperparameter or effectively tuned per dataset. If the latter, the decomposition may be circular rather than emergent from photometric losses alone.
  3. [§3] §3 (Method): the paper does not report any diagnostic that the learned per-Gaussian trajectories match independent motion cues (e.g., optical flow or LiDAR) rather than absorbing dynamics into static Gaussian attributes (position, opacity, or SH coefficients). This leaves the under-constrained decomposition unverified.
minor comments (2)
  1. [§3.2] Clarify the exact mathematical form of the regularization term and its weighting schedule in the loss.
  2. [§5] Add a limitations paragraph discussing failure modes on highly dynamic or occluded scenes.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive comments, which have helped us improve the clarity and rigor of our manuscript. We address each of the major comments below and outline the corresponding revisions.

read point-by-point responses
  1. Referee: [Abstract and §4] Abstract and §4 (Experiments): the claim of 'significantly outperforms existing methods' is unsupported by any reported quantitative metrics, error bars, ablation tables, or details on how the static-regularization weight was selected or validated across datasets. Without these, the central experimental claims cannot be assessed.

    Authors: We agree that quantitative support is essential for the performance claims. In the revised version, we will add detailed quantitative comparisons, including tables with PSNR, SSIM, and other metrics, along with error bars from multiple runs. We will also include ablation studies on the regularization weight selection process, validated across datasets using a held-out validation set. revision: yes

  2. Referee: [§3.2] §3.2 (Regularization): the 'as static as possible' term is load-bearing for the unsupervised motion decomposition claim, yet it is unclear whether its weight is a fixed hyperparameter or effectively tuned per dataset. If the latter, the decomposition may be circular rather than emergent from photometric losses alone.

    Authors: The regularization weight is a fixed hyperparameter used consistently across all experiments and datasets. We will revise §3.2 to explicitly state this and provide the specific value along with justification based on preliminary experiments on a small set of scenes to ensure the decomposition emerges primarily from the photometric losses and multi-scene training. revision: yes

  3. Referee: [§3] §3 (Method): the paper does not report any diagnostic that the learned per-Gaussian trajectories match independent motion cues (e.g., optical flow or LiDAR) rather than absorbing dynamics into static Gaussian attributes (position, opacity, or SH coefficients). This leaves the under-constrained decomposition unverified.

    Authors: We acknowledge the value of such diagnostics for verifying the motion decomposition. In the revised manuscript, we will add qualitative and quantitative comparisons of the predicted trajectories against optical flow and LiDAR-derived motion where available, to demonstrate that dynamics are captured in the per-Gaussian motion parameters rather than static attributes. revision: yes

Circularity Check

0 steps flagged

No significant circularity: derivation self-contained via proposed regularization and multi-scene training

full rationale

The abstract and provided text present Flux4D as a new framework that directly predicts 3D Gaussians and motion dynamics using only photometric losses plus an 'as static as possible' regularization term, trained across many scenes to decompose dynamics without pre-trained models. No equations, self-citations, or fitted parameters are shown that reduce any prediction or uniqueness claim to the inputs by construction. The central inductive bias is introduced as an external regularization choice rather than derived from or equivalent to the target outputs, leaving the derivation independent and self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that photometric consistency plus a static bias suffices to separate motion; this is an ad-hoc domain assumption rather than a derived result. No new physical entities are introduced. The regularization weight is likely a free parameter whose value is not stated in the abstract.

free parameters (1)
  • static regularization weight
    Controls the strength of the 'as static as possible' term; its value must be chosen or fitted to achieve the reported decomposition.
axioms (1)
  • domain assumption Photometric loss between rendered and observed images is a sufficient signal for 3D structure and motion.
    Invoked when the method relies solely on photometric losses without additional geometric or semantic supervision.

pith-pipeline@v0.9.0 · 5550 in / 1230 out tokens · 52914 ms · 2026-05-17T01:55:03.281433+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

73 extracted references · 73 canonical work pages · 5 internal anchors

  1. [1]

    Uno: Unsuper- vised occupancy fields for perception and forecasting

    Ben Agro, Quinlan Sykora, Sergio Casas, Thomas Gilles, and Raquel Urtasun. Uno: Unsuper- vised occupancy fields for perception and forecasting. InCVPR, 2024. 3

  2. [2]

    pixelsplat: 3d gaussian splats from image pairs for scalable generalizable 3d reconstruction

    David Charatan, Sizhe Lester Li, Andrea Tagliasacchi, and Vincent Sitzmann. pixelsplat: 3d gaussian splats from image pairs for scalable generalizable 3d reconstruction. InCVPR, 2024. 2, 8

  3. [3]

    Mvsnerf: Fast generalizable radiance field reconstruction from multi-view stereo

    Anpei Chen, Zexiang Xu, Fuqiang Zhao, Xiaoshuai Zhang, Fanbo Xiang, Jingyi Yu, and Hao Su. Mvsnerf: Fast generalizable radiance field reconstruction from multi-view stereo. InICCV,

  4. [4]

    Rethinking Atrous Convolution for Semantic Image Segmentation

    Liang-Chieh Chen, George Papandreou, Florian Schroff, and Hartwig Adam. Rethinking atrous convolution for semantic image segmentation.arXiv preprint arXiv:1706.05587, 2017. 3, 7

  5. [5]

    Mvsplat: Efficient 3d gaussian splatting from sparse multi-view images

    Yuedong Chen, Haofei Xu, Chuanxia Zheng, Bohan Zhuang, Marc Pollefeys, Andreas Geiger, Tat-Jen Cham, and Jianfei Cai. Mvsplat: Efficient 3d gaussian splatting from sparse multi-view images. InECCV, 2024. 2, 8

  6. [6]

    SaLF: Sparse Local Fields for Multi-Sensor Rendering in Real-Time

    Yun Chen, Matthew Haines, Jingkang Wang, Krzysztof Baron-Lis, Sivabalan Manivasagam, Ze Yang, and Raquel Urtasun. Salf: Sparse local fields for multi-sensor rendering in real-time. arXiv preprint arXiv:2507.18713, 2025. 1

  7. [7]

    G3R: Gradient guided generalizable reconstruction

    Yun Chen, Jingkang Wang, Ze Yang, Sivabalan Manivasagam, and Raquel Urtasun. G3R: Gradient guided generalizable reconstruction. InECCV, 2025. 2, 3, 5, 7, 8

  8. [8]

    Periodic vibration gaussian: Dynamic urban scene reconstruction and real-time rendering

    Yurui Chen, Chun Gu, Junzhe Jiang, Xiatian Zhu, and Li Zhang. Periodic vibration gaussian: Dynamic urban scene reconstruction and real-time rendering.arXiv preprint arXiv:2311.18561,

  9. [9]

    Vision transformer adapter for dense predictions

    Zhe Chen, Yuchen Duan, Wenhai Wang, Junjun He, Tong Lu, Jifeng Dai, and Yu Qiao. Vision transformer adapter for dense predictions. InICLR, 2023. 3

  10. [10]

    Omnire: Omni urban scene reconstruction.arXiv preprint arXiv:2408.16760, 2024

    Ziyu Chen, Jiawei Yang, Jiahui Huang, Riccardo de Lutio, Janick Martinez Esturo, Boris Ivanovic, Or Litany, Zan Gojcic, Sanja Fidler, Marco Pavone, et al. Omnire: Omni urban scene reconstruction.arXiv preprint arXiv:2408.16760, 2024. 2

  11. [11]

    Re-evaluating lidar scene flow for autonomous driving

    Nathaniel Chodosh, Deva Ramanan, and Simon Lucey. Re-evaluating lidar scene flow for autonomous driving. InWACV, 2024. 7

  12. [12]

    Vista: A generalizable driving world model with high fidelity and versatile controllability

    Shenyuan Gao, Jiazhi Yang, Li Chen, Kashyap Chitta, Yihang Qiu, Andreas Geiger, Jun Zhang, and Hongyang Li. Vista: A generalizable driving world model with high fidelity and versatile controllability.arXiv preprint arXiv:2405.17398, 2024. 3

  13. [13]

    Splatad: Real-time li- dar and camera rendering with 3d gaussian splatting for au- tonomous driving

    Georg Hess, Carl Lindström, Maryam Fatemi, Christoffer Petersson, and Lennart Svensson. Splatad: Real-time lidar and camera rendering with 3d gaussian splatting for autonomous driving.arXiv preprint arXiv:2411.16816, 2024. 2

  14. [14]

    LRM: Large reconstruction model for single image to 3d

    Yicong Hong, Kai Zhang, Jiuxiang Gu, Sai Bi, Yang Zhou, Difan Liu, Feng Liu, Kalyan Sunkavalli, Trung Bui, and Hao Tan. LRM: Large reconstruction model for single image to 3d. InICLR, 2024. 2

  15. [15]

    GAIA-1: A Generative World Model for Autonomous Driving

    Anthony Hu, Lloyd Russell, Hudson Yeo, Zak Murez, George Fedoseev, Alex Kendall, Jamie Shotton, and Gianluca Corrado. Gaia-1: A generative world model for autonomous driving. arXiv preprint arXiv:2309.17080, 2023. 3 11

  16. [16]

    Metric3d v2: A versatile monocular geometric foundation model for zero-shot metric depth and surface normal estimation

    Mu Hu, Wei Yin, Chi Zhang, Zhipeng Cai, Xiaoxiao Long, Hao Chen, Kaixuan Wang, Gang Yu, Chunhua Shen, and Shaojie Shen. Metric3d v2: A versatile monocular geometric foundation model for zero-shot metric depth and surface normal estimation. InTPAMI, 2024. 9, 10

  17. [17]

    S3gaussian: Self-supervised street gaussians for autonomous driving.arXiv preprint arXiv:2405.20323, 2024

    Nan Huang, Xiaobao Wei, Wenzhao Zheng, Pengju An, Ming Lu, Wei Zhan, Masayoshi Tomizuka, Kurt Keutzer, and Shanghang Zhang. S3gaussian: Self-supervised street gaussians for autonomous driving.arXiv preprint arXiv:2405.20323, 2024. 2, 5

  18. [18]

    3D gaussian splatting for real-time radiance field rendering

    Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, and George Drettakis. 3D gaussian splatting for real-time radiance field rendering. InTOG, 2023. 1, 2, 4

  19. [19]

    Autosplat: Constrained gaussian splatting for autonomous driving scene reconstruction

    Mustafa Khan, Hamidreza Fazlali, Dhruv Sharma, Tongtong Cao, Dongfeng Bai, Yuan Ren, and Bingbing Liu. Autosplat: Constrained gaussian splatting for autonomous driving scene reconstruction.arXiv preprint arXiv:2407.02598, 2024. 1

  20. [20]

    I can’t believe it’s not scene flow! InECCV, 2024

    Ishan Khatri, Kyle Vedder, Neehar Peri, Deva Ramanan, and James Hays. I can’t believe it’s not scene flow! InECCV, 2024. 7, 8

  21. [21]

    Point cloud forecasting as a proxy for 4d occupancy forecasting

    Tarasha Khurana, Peiyun Hu, David Held, and Deva Ramanan. Point cloud forecasting as a proxy for 4d occupancy forecasting. InCVPR, 2023. 3

  22. [22]

    Flow4d: Leveraging 4d voxel network for lidar scene flow estimation

    Jaeyeul Kim, Jungwan Woo, Ukcheol Shin, Jean Oh, and Sunghoon Im. Flow4d: Leveraging 4d voxel network for lidar scene flow estimation. InRA-L, 2025. 8

  23. [23]

    Segment anything

    Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. In ICCV, 2023. 3, 7

  24. [24]

    Freegave: 3d physics learning from dynamic videos by gaussian velocity

    Jinxi Li, Ziyang Song, Siyuan Zhou, and Bo Yang. Freegave: 3d physics learning from dynamic videos by gaussian velocity. InCVPR, 2025. 4, 5

  25. [25]

    Uniflow: Towards zero-shot lidar scene flow for autonomous vehicles via cross-domain generalization

    Siyi Li, Qingwen Zhang, Ishan Khatri, Kyle Vedder, Deva Ramanan, and Neehar Peri. Uniflow: Towards zero-shot lidar scene flow for autonomous vehicles via cross-domain generalization. arXiv preprint arXiv:2511.18254, 2025. 8

  26. [26]

    Drivingdiffusion: Layout-guided multi-view driving scenarios video generation with latent diffusion model

    Xiaofan Li, Yifu Zhang, and Xiaoqing Ye. Drivingdiffusion: Layout-guided multi-view driving scenarios video generation with latent diffusion model. InECCV, 2024. 3

  27. [27]

    Neural scene flow prior

    Xueqian Li, Jhony Kaesemodel Pontes, and Simon Lucey. Neural scene flow prior. InNeurIPS,

  28. [28]

    Fast neural scene flow

    Xueqian Li, Jianqiao Zheng, Francesco Ferroni, Jhony Kaesemodel Pontes, and Simon Lucey. Fast neural scene flow. InCVPR, 2023. 7, 9

  29. [29]

    Real-time neural rasterization for large scenes

    Jeffrey Yunfan Liu, Yun Chen, Ze Yang, Jingkang Wang, Sivabalan Manivasagam, and Raquel Urtasun. Real-time neural rasterization for large scenes. InICCV, 2023. 2

  30. [30]

    Drivingrecon: Large 4d gaussian reconstruction model for autonomous driving.arXiv preprint arXiv:2412.09043, 2024

    Hao Lu, Tianshuo Xu, Wenzhao Zheng, Yunpeng Zhang, Wei Zhan, Dalong Du, Masayoshi Tomizuka, Kurt Keutzer, and Yingcong Chen. Drivingrecon: Large 4d gaussian reconstruction model for autonomous driving.arXiv preprint arXiv:2412.09043, 2024. 2, 3, 4, 6, 7, 8, 9

  31. [31]

    Towards zero domain gap: A comprehensive study of realistic LiDAR simulation for autonomy testing

    Sivabalan Manivasagam, Ioan Andrei Bârsan, Jingkang Wang, Ze Yang, and Raquel Urtasun. Towards zero domain gap: A comprehensive study of realistic LiDAR simulation for autonomy testing. InICCV, 2023. 1

  32. [32]

    Nerf: Representing scenes as neural radiance fields for view synthesis

    Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view synthesis. InECCV,

  33. [33]

    Driveworld: 4d pre-trained scene understanding via world models for autonomous driving

    Chen Min, Dawei Zhao, Liang Xiao, Jian Zhao, Xinli Xu, Zheng Zhu, Lei Jin, Jianshu Li, Yulan Guo, Junliang Xing, et al. Driveworld: 4d pre-trained scene understanding via world models for autonomous driving. InCVPR, 2024. 3 12

  34. [34]

    DINOv2: Learning Robust Visual Features without Supervision

    Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193, 2023. 7

  35. [35]

    Neural scene graphs

    Julian Ost, Fahim Mannan, Nils Thuerey, Julian Knodt, and Felix Heide. Neural scene graphs. InCVPR, 2021. 2

  36. [36]

    Nerfies: Deformable neural radiance fields

    Keunhong Park, Utkarsh Sinha, Jonathan T Barron, Sofien Bouaziz, Dan B Goldman, Steven M Seitz, and Ricardo Martin-Brualla. Nerfies: Deformable neural radiance fields. InICCV, 2021. 2

  37. [37]

    Desire-gs: 4d street gaussians for static- dynamic decomposition and surface reconstruction for urban driving scenes.arXiv preprint arXiv:2411.11921, 2024

    Chensheng Peng, Chengwei Zhang, Yixiao Wang, Chenfeng Xu, Yichen Xie, Wenzhao Zheng, Kurt Keutzer, Masayoshi Tomizuka, and Wei Zhan. Desire-gs: 4d street gaussians for static- dynamic decomposition and surface reconstruction for urban driving scenes.arXiv preprint arXiv:2411.11921, 2024. 2, 4, 5, 6, 7, 8

  38. [38]

    D-nerf: Neural radiance fields for dynamic scenes

    Albert Pumarola, Enric Corona, Gerard Pons-Moll, and Francesc Moreno-Noguer. D-nerf: Neural radiance fields for dynamic scenes. InCVPR, 2021. 2

  39. [39]

    Neural lighting simulation for urban scenes

    Ava Pun, Gary Sun, Jingkang Wang, Yun Chen, Ze Yang, Sivabalan Manivasagam, Wei-Chiu Ma, and Raquel Urtasun. Neural lighting simulation for urban scenes. InNeurIPS, 2023. 2

  40. [40]

    L4gm: Large 4d gaussian reconstruction model

    Jiawei Ren, Cheng Xie, Ashkan Mirzaei, Karsten Kreis, Ziwei Liu, Antonio Torralba, Sanja Fidler, Seung Wook Kim, Huan Ling, et al. L4gm: Large 4d gaussian reconstruction model. In NeurIPS, 2025. 2, 5, 7, 8

  41. [41]

    Scube: Instant large-scale scene reconstruction using voxsplats

    Xuanchi Ren, Yifan Lu, Hanxue Liang, Jay Zhangjie Wu, Huan Ling, Mike Chen, Francis Fidler, Sanja annd Williams, and Jiahui Huang. Scube: Instant large-scale scene reconstruction using voxsplats. InNeurIPS, 2024. 2

  42. [42]

    Scalability in perception for autonomous driving: Waymo open dataset

    Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, Aurelien Chouard, Vijaysai Patnaik, Paul Tsui, James Guo, Yin Zhou, Yuning Chai, Benjamin Caine, Vijay Vasudevan, Wei Han, Jiquan Ngiam, Hang Zhao, Aleksei Timofeev, Scott Ettinger, Maxim Krivokon, Amy Gao, Aditya Joshi, Yu Zhang, Jonathon Shlens, Zhifeng Chen, and Dragomir Anguelov. Scalability in perception...

  43. [43]

    Torchsparse++: Efficient training and inference framework for sparse convolution on gpus

    Haotian Tang, Shang Yang, Zhijian Liu, Ke Hong, Zhongming Yu, Xiuyu Li, Guohao Dai, Yu Wang, and Song Han. Torchsparse++: Efficient training and inference framework for sparse convolution on gpus. InMICRO, 2023. 7

  44. [44]

    Lgm: Large multi-view gaussian model for high-resolution 3d content creation

    Jiaxiang Tang, Zhaoxi Chen, Xiaokang Chen, Tengfei Wang, Gang Zeng, and Ziwei Liu. Lgm: Large multi-view gaussian model for high-resolution 3d content creation. InECCV, 2024. 8

  45. [45]

    NeuRAD: Neural rendering for autonomous driving

    Adam Tonderski, Carl Lindström, Georg Hess, William Ljungbergh, Lennart Svensson, and Christoffer Petersson. NeuRAD: Neural rendering for autonomous driving. InCVPR, 2024. 1, 5, 7, 8

  46. [46]

    Simuli: Real-time lidar and camera simulation with unscented transforms.arXiv preprint arXiv:2510.12901, 2025

    Haithem Turki, Qi Wu, Xin Kang, Janick Martinez Esturo, Shengyu Huang, Ruilong Li, Zan Gojcic, and Riccardo de Lutio. Simuli: Real-time lidar and camera simulation with unscented transforms.arXiv preprint arXiv:2510.12901, 2025. 1

  47. [47]

    Suds: Scalable urban dynamic scenes

    Haithem Turki, Jason Y Zhang, Francesco Ferroni, and Deva Ramanan. Suds: Scalable urban dynamic scenes. InCVPR, 2023. 2

  48. [48]

    Neural eulerian scene flow fields

    Kyle Vedder, Neehar Peri, Ishan Khatri, Siyi Li, Eric Eaton, Mehmet Kemal Kocamaz, Yue Wang, Zhiding Yu, Deva Ramanan, and Joachim Pehserl. Neural eulerian scene flow fields. In ICLR, 2025. 7

  49. [49]

    CADSim: Robust and scalable in-the-wild 3d reconstruction for controllable sensor simulation

    Jingkang Wang, Sivabalan Manivasagam, Yun Chen, Ze Yang, Ioan Andrei Bârsan, Anqi Joyce Yang, Wei-Chiu Ma, and Raquel Urtasun. CADSim: Robust and scalable in-the-wild 3d reconstruction for controllable sensor simulation. InCoRL, 2022. 2

  50. [50]

    Advsim: Generating safety-critical scenarios for self-driving vehicles

    Jingkang Wang, Ava Pun, James Tu, Sivabalan Manivasagam, Abbas Sadat, Sergio Casas, Mengye Ren, and Raquel Urtasun. Advsim: Generating safety-critical scenarios for self-driving vehicles. InCVPR, 2021. 1 13

  51. [51]

    Ibrnet: Learning multi-view image-based rendering

    Qianqian Wang, Zhicheng Wang, Kyle Genova, Pratul P Srinivasan, Howard Zhou, Jonathan T Barron, Ricardo Martin-Brualla, Noah Snavely, and Thomas Funkhouser. Ibrnet: Learning multi-view image-based rendering. InCVPR, 2021. 2

  52. [52]

    Drive- dreamer: Towards real-world-drive world models for autonomous driving

    Xiaofeng Wang, Zheng Zhu, Guan Huang, Xinze Chen, Jiagang Zhu, and Jiwen Lu. Drive- dreamer: Towards real-world-drive world models for autonomous driving. InECCV, 2024. 3

  53. [53]

    Meshlrm: Large reconstruction model for high-quality meshes.arXiv preprint arXiv:2404.12385, 2024

    Xinyue Wei, Kai Zhang, Sai Bi, Hao Tan, Fujun Luan, Valentin Deschaintre, Kalyan Sunkavalli, Hao Su, and Zexiang Xu. Meshlrm: Large reconstruction model for high-quality meshes.arXiv preprint arXiv:2404.12385, 2024. 2

  54. [54]

    Argoverse 2: Next Generation Datasets for Self-Driving Perception and Forecasting

    Benjamin Wilson, William Qi, Tanmay Agarwal, John Lambert, Jagjeet Singh, Siddhesh Khandelwal, Bowen Pan, Ratnesh Kumar, Andrew Hartnett, Jhony Kaesemodel Pontes, et al. Argoverse 2: Next generation datasets for self-driving perception and forecasting.arXiv preprint arXiv:2301.00493, 2023. 9

  55. [55]

    4d gaussian splatting for real-time dynamic scene rendering

    Guanjun Wu, Taoran Yi, Jiemin Fang, Lingxi Xie, Xiaopeng Zhang, Wei Wei, Wenyu Liu, Qi Tian, and Xinggang Wang. 4d gaussian splatting for real-time dynamic scene rendering. In CVPR, 2024. 2

  56. [56]

    Dˆ 2nerf: Self-supervised decoupling of dynamic and static objects from a monocular video

    Tianhao Wu, Fangcheng Zhong, Andrea Tagliasacchi, Forrester Cole, and Cengiz Oztireli. Dˆ 2nerf: Self-supervised decoupling of dynamic and static objects from a monocular video. In NeurIPS, 2022. 2, 4, 5

  57. [57]

    Pandaset: Advanced sensor suite dataset for autonomous driving

    Pengchuan Xiao, Zhenlei Shao, Steven Hao, Zishuo Zhang, Xiaolin Chai, Judy Jiao, Zesong Li, Jian Wu, Kai Sun, Kun Jiang, et al. Pandaset: Advanced sensor suite dataset for autonomous driving. InITSC, 2021. 2, 6

  58. [58]

    Depthsplat: Connecting gaussian splatting and depth.arXiv preprint arXiv:2410.13862, 2024

    Haofei Xu, Songyou Peng, Fangjinhua Wang, Hermann Blum, Daniel Barath, Andreas Geiger, and Marc Pollefeys. Depthsplat: Connecting gaussian splatting and depth.arXiv preprint arXiv:2410.13862, 2024. 5, 7

  59. [59]

    Street gaussians for modeling dynamic urban scenes

    Yunzhi Yan, Haotong Lin, Chenxu Zhou, Weijie Wang, Haiyang Sun, Kun Zhan, Xianpeng Lang, Xiaowei Zhou, and Sida Peng. Street gaussians for modeling dynamic urban scenes. In ECCV, 2024. 1, 2, 5, 7, 8

  60. [60]

    Storm: Spatio-temporal re- construction model for large-scale outdoor scenes.arXiv preprint arXiv:2501.00602, 2024

    Jiawei Yang, Jiahui Huang, Yuxiao Chen, Yan Wang, Boyi Li, Yurong You, Maximilian Igl, Apoorva Sharma, Peter Karkus, Danfei Xu, Boris Ivanovic, Yue Wang, and Marco Pavone. Storm: Spatio-temporal reconstruction model for large-scale outdoor scenes.arXiv preprint arXiv:2501.00602, 2025. 2, 3, 4, 5, 6, 7, 8, 9

  61. [61]

    Emernerf: Emergent spatial-temporal scene decomposition via self-supervision.arXiv preprint arXiv:2311.02077, 2023

    Jiawei Yang, Boris Ivanovic, Or Litany, Xinshuo Weng, Seung Wook Kim, Boyi Li, Tong Che, Danfei Xu, Sanja Fidler, Marco Pavone, and Yue Wang. Emernerf: Emergent spatial-temporal scene decomposition via self-supervision.arXiv preprint arXiv:2311.02077, 2023. 2, 4, 5, 6, 7, 8

  62. [62]

    Unisim: A neural closed-loop sensor simulator

    Ze Yang, Yun Chen, Jingkang Wang, Sivabalan Manivasagam, Wei-Chiu Ma, Anqi Joyce Yang, and Raquel Urtasun. Unisim: A neural closed-loop sensor simulator. InCVPR, 2023. 1, 2

  63. [63]

    Reconstructing objects in-the-wild for realistic sensor simulation

    Ze Yang, Sivabalan Manivasagam, Yun Chen, Jingkang Wang, Rui Hu, and Raquel Urtasun. Reconstructing objects in-the-wild for realistic sensor simulation. InICRA, 2023. 2

  64. [64]

    Genassets: Generating in-the-wild 3d assets in latent space

    Ze Yang, Jingkang Wang, Haowei Zhang, Sivabalan Manivasagam, Yun Chen, and Raquel Urtasun. Genassets: Generating in-the-wild 3d assets in latent space. InCVPR, 2025. 2

  65. [65]

    Visual point cloud forecasting enables scalable autonomous driving

    Zetong Yang, Li Chen, Yanan Sun, and Hongyang Li. Visual point cloud forecasting enables scalable autonomous driving. InCVPR, 2024. 3

  66. [66]

    Deformable 3d gaussians for high-fidelity monocular dynamic scene reconstruction

    Ziyi Yang, Xinyu Gao, Wen Zhou, Shaohui Jiao, Yuqing Zhang, and Xiaogang Jin. Deformable 3d gaussians for high-fidelity monocular dynamic scene reconstruction. InCVPR, 2024. 2

  67. [67]

    Improving 2D Feature Representations by 3D-Aware Fine-Tuning

    Yuanwen Yue, Anurag Das, Francis Engelmann, Siyu Tang, and Jan Eric Lenssen. Improving 2D Feature Representations by 3D-Aware Fine-Tuning. InECCV, 2024. 6 14

  68. [68]

    Visionpad: A vision-centric pre-training paradigm for autonomous driving.arXiv preprint arXiv:2411.14716, 2024

    Haiming Zhang, Wending Zhou, Yiyao Zhu, Xu Yan, Jiantao Gao, Dongfeng Bai, Yingjie Cai, Bingbing Liu, Shuguang Cui, and Zhen Li. Visionpad: A vision-centric pre-training paradigm for autonomous driving.arXiv preprint arXiv:2411.14716, 2024. 4

  69. [69]

    GS-LRM: Large reconstruction model for 3D gaussian splatting

    Kai Zhang, Sai Bi, Hao Tan, Yuanbo Xiangli, Nanxuan Zhao, Kalyan Sunkavalli, and Zexiang Xu. GS-LRM: Large reconstruction model for 3D gaussian splatting. InECCV, 2025. 2

  70. [70]

    Learning unsupervised world models for autonomous driving via discrete diffusion

    Lunjun Zhang, Yuwen Xiong, Ze Yang, Sergio Casas, Rui Hu, and Raquel Urtasun. Learning unsupervised world models for autonomous driving via discrete diffusion. InICLR, 2024. 3

  71. [71]

    Occworld: Learning a 3d occupancy world model for autonomous driving

    Wenzhao Zheng, Weiliang Chen, Yuanhui Huang, Borui Zhang, Yueqi Duan, and Jiwen Lu. Occworld: Learning a 3d occupancy world model for autonomous driving. InECCV, 2024. 3

  72. [72]

    DrivingGaussian: Composite gaussian splatting for surrounding dynamic autonomous driving scenes

    Xiaoyu Zhou, Zhiwei Lin, Xiaojun Shan, Yongtao Wang, Deqing Sun, and Ming-Hsuan Yang. DrivingGaussian: Composite gaussian splatting for surrounding dynamic autonomous driving scenes. InCVPR, 2024. 1

  73. [73]

    Long-lrm: Long-sequence large reconstruction model for wide-coverage gaussian splats

    Chen Ziwen, Hao Tan, Kai Zhang, Sai Bi, Fujun Luan, Yicong Hong, Li Fuxin, and Zexiang Xu. Long-lrm: Long-sequence large reconstruction model for wide-coverage gaussian splats. arXiv preprint arXiv:2410.12781, 2024. 3 15