pith. sign in

arxiv: 2606.23688 · v1 · pith:4TBYY22Qnew · submitted 2026-06-22 · 💻 cs.CV

Lift4D: Harmonizing Single-View 3D Estimation for 4D Reconstruction In-the-Wild

Pith reviewed 2026-06-26 08:56 UTC · model grok-4.3

classification 💻 cs.CV
keywords 4D reconstructionmonocular video3D Gaussian Splattingtest-time optimizationnon-rigid objectsocclusion handlingsingle-view 3D estimationdeformable representations
0
0 comments X

The pith

Lift4D adapts single-view 3D models via causal latent conditioning to initialize deformable 3D Gaussian Splatting for monocular 4D reconstruction.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Lift4D as a test-time optimization framework for 4D reconstruction of dynamic objects from monocular video. It adapts an existing single-view 3D model with causal latent conditioning to produce temporally consistent per-frame predictions as initialization. These initialize a deformable 3D Gaussian Splatting model that is optimized to the video using occlusion-aware terms and a view-conditioned diffusion prior to complete unobserved areas. This combines data-driven priors throughout the process rather than only at the start, leading to better results on complex in-the-wild videos.

Core claim

Lift4D is a test-time optimization framework that addresses both limitations of prior approaches. First, we adapt an existing single-view 3D reconstruction model to yield temporally consistent per-frame predictions via causal latent conditioning, providing a coherent initialization for a deformable 3D Gaussian Splatting representation. We then sculpt this representation to match the input video through an occlusion-aware optimization that faithfully recovers visible surface details while completing unobserved regions using a view-conditioned diffusion prior.

What carries the argument

causal latent conditioning on single-view 3D models to generate temporally consistent per-frame 3D predictions that initialize deformable 3D Gaussian Splatting optimized with occlusion-aware diffusion prior

If this is right

  • Temporally consistent predictions provide coherent initialization for deformable representations.
  • Occlusion-aware optimization recovers visible details and completes unobserved regions using diffusion prior.
  • Improves over prior 4D reconstruction methods on in-the-wild sequences with severe occlusions and non-rigid motion.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The method may generalize if other single-view 3D models can be adapted with similar causal conditioning.
  • Test-time sculpting with diffusion priors could apply to static 3D reconstruction tasks with partial observations.

Load-bearing premise

Adapting an existing single-view 3D reconstruction model via causal latent conditioning yields temporally consistent per-frame predictions that serve as a coherent initialization for the subsequent deformable representation.

What would settle it

Running the method on a benchmark of in-the-wild monocular videos with severe occlusions and non-rigid motion and finding no improvement in reconstruction quality metrics compared to prior methods would falsify the claim.

Figures

Figures reproduced from arXiv: 2606.23688 by Fernando De la Torre, Kris Kitani, Manan Shah, Nicolas Ugrinovic, Shubham Tulsiani, Xiaoxuan Ma, Yehonathan Litman.

Figure 1
Figure 1. Figure 1: 4D Reconstruction from Monocular In-the-Wild Video. Given a video of a dynamic scene, Lift4D recovers the full geometry, appearance, and deformation of objects, including regions never observed by the camera, by leveraging a causally conditioned image-to-3D prior and occlusion-aware optimization. The resulting 4D representation handles large deformations and scene occlusions. Abstract Reconstructing dynami… view at source ↗
Figure 2
Figure 2. Figure 2: Causal Single-view Reconstruction. Given a video input, we obtain per-frame 3D reconstructions G i with an image-to-3D model [44] using causal latent conditioning to enforce the temporal consistency across frames. For a reference frame 0, we first fully denoise a latent Z 0 1 and object-to-camera transform T 0 from a reference canonical frame I 0 . The denoised latent is then propagated to the next frame b… view at source ↗
Figure 3
Figure 3. Figure 3: Deformable 3D Optimization. We factorize the 4D representation into canonical 3D gaussians and sparse deformation control nodes and optimize the 4D reconstruction on per-frame reconstructions G i via Eq. (3). Distillation Loss Deformation MLP Ψ Duplicate Control Nodes I Diff. i Rasterizer + Lapp Render Loss Diff. Rasterizer Image Prior Time [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Appearance Reconstruction. The 3D appearance is deformed with duplicate control nodes and supervised on the reference images I i and an image novel view synthesis prior. The reference image supervises observed regions, while the view-conditioned prior supervises unobserved ones via Eq. (6). 2D and 3D priors and route each to the role where it is re￾liable for in-the-wild content with a curriculum-based tes… view at source ↗
Figure 5
Figure 5. Figure 5: Occlusion-aware Rendering Supervision. In cases where the subject is affected by scene occluders, the scene-occlusion mask Mi occ is deduced by comparing the estimated scene depth Di scene with the rendered object depth Di πc (Eq. (7)). The rendered image I i πc is color-matched to the input I i over visible regions, producing ˜I i πc , which is composited with I i into the completed reference I i full use… view at source ↗
Figure 6
Figure 6. Figure 6: Reconstructing 4D Objects from In-the-wild Internet Footage. Given an input video of an arbitrary object, Lift4D reconstructs in 4D the complete object geometry and texture. With its usage of a consistent geometry basis and appearance supervision in observed and unobserved regions, our method reconstructs a more topologically accurate geometry and fuses appearance between observed and unobserved regions. O… view at source ↗
Figure 7
Figure 7. Figure 7: Novel Views of 4D In-the-Wild Reconstructions. We showcase the strong performance of our approach across different novel views on in-the-wild internet stock footage of subjects with occlusions, deformations, and motion. Our method successfully generalizes to multiple types of scenes and objects in-the-wild and their appearance in novel views. The baselines, however, are sensitive to the input and struggle … view at source ↗
Figure 8
Figure 8. Figure 8: Reconstructing 4D Synthetic Objects. Lift4D can also robustly reconstruct rich and complete 4D object geometry and texture for simpler synthetic cases, such as in Consistent4D [18], as opposed to the baselines, which recover simpler geometries and texture or have wrong deformations [PITH_FULL_IMAGE:figures/full_fig_p015_8.png] view at source ↗
read the original abstract

Reconstructing dynamic non-rigid objects from monocular video requires integrating visual cues from direct observations with data-driven priors over geometry and appearance. Prior approaches either learn to directly predict 4D representations from visual input or initialize a 3D representation that is subsequently deformed and refined based on video evidence. However, the former are constrained by the scarcity of 4D training data, while the latter leverage priors only for the initial reconstruction and rely solely on video supervision thereafter; neither handles complex in-the-wild scenarios with large deformations and occlusions well. We present Lift4D, a test-time optimization framework that addresses both limitations. First, we adapt an existing single-view 3D reconstruction model to yield temporally consistent per-frame predictions via causal latent conditioning, providing a coherent initialization for a deformable 3D Gaussian Splatting representation. We then ``sculpt'' this representation to match the input video through an occlusion-aware optimization that faithfully recovers visible surface details while completing unobserved regions using a view-conditioned diffusion prior. We demonstrate that Lift4D clearly improves over prior 4D reconstruction methods, particularly on challenging in-the-wild sequences with severe occlusions and non-rigid motion.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper presents Lift4D, a test-time optimization framework for 4D reconstruction of dynamic non-rigid objects from monocular video. It first adapts an existing single-view 3D reconstruction model via causal latent conditioning to produce temporally consistent per-frame 3D predictions, which initialize a deformable 3D Gaussian Splatting representation. This is then refined through occlusion-aware optimization that recovers visible details and completes unobserved regions using a view-conditioned diffusion prior. The central claim is that Lift4D clearly improves over prior 4D reconstruction methods, especially on challenging in-the-wild sequences with severe occlusions and non-rigid motion.

Significance. If the quantitative results and ablations hold, the framework could meaningfully advance in-the-wild 4D reconstruction by effectively reusing single-view priors without requiring scarce 4D training data, while the combination of causal initialization and test-time sculpting addresses limitations of both direct 4D prediction and purely video-supervised deformation approaches.

major comments (1)
  1. [Abstract] Abstract: the central claim of clear improvement over prior methods rests on the adapted single-view model (via causal latent conditioning) supplying temporally consistent per-frame 3D predictions as a coherent initialization for the subsequent deformable representation and occlusion-aware optimization. No quantitative check (e.g., frame-to-frame 3D consistency metrics, variance in lifted geometry, or ablations isolating the conditioning from the diffusion prior and sculpting) is supplied to establish that this step is load-bearing; if independent per-frame lifts already suffice or the conditioning fails to reduce drift on non-rigid motion, the performance gain would be attributable to test-time optimization alone.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback. The concern regarding the lack of quantitative validation for the causal latent conditioning step is well-taken, and we will strengthen the manuscript accordingly.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim of clear improvement over prior methods rests on the adapted single-view model (via causal latent conditioning) supplying temporally consistent per-frame 3D predictions as a coherent initialization for the subsequent deformable representation and occlusion-aware optimization. No quantitative check (e.g., frame-to-frame 3D consistency metrics, variance in lifted geometry, or ablations isolating the conditioning from the diffusion prior and sculpting) is supplied to establish that this step is load-bearing; if independent per-frame lifts already suffice or the conditioning fails to reduce drift on non-rigid motion, the performance gain would be attributable to test-time optimization alone.

    Authors: We agree that the abstract's claim would benefit from explicit quantitative support for the contribution of causal latent conditioning. In the revised version we will add (i) frame-to-frame 3D consistency metrics (e.g., per-point variance of lifted geometry across consecutive frames) computed on the adapted single-view model with and without conditioning, and (ii) an ablation that isolates the conditioning from the subsequent diffusion prior and sculpting stages. These additions will directly test whether the conditioning reduces drift on non-rigid sequences and is load-bearing for the reported gains. We believe the overall experimental results already indicate that independent per-frame lifts are insufficient, but the requested metrics will make this explicit. revision: yes

Circularity Check

0 steps flagged

No circularity: pipeline adapts external single-view model and applies test-time optimization with independent priors

full rationale

The described derivation chain adapts a pre-existing single-view 3D reconstruction model (via causal latent conditioning) to produce per-frame predictions used as initialization for deformable 3D Gaussian Splatting, followed by occlusion-aware optimization and a view-conditioned diffusion prior. No equations, fitted parameters renamed as predictions, or self-citations are present in the abstract or described steps that would reduce any result to its own inputs by construction. The approach relies on external pretrained components and test-time sculpting rather than self-referential definitions or load-bearing self-citations, making the central claims independent of the kinds of circularity enumerated.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review prevents identification of specific free parameters or axioms; the described components (causal latent conditioning, occlusion-aware optimization, view-conditioned diffusion prior) are treated as adaptations of prior models rather than newly invented entities with independent evidence.

pith-pipeline@v0.9.1-grok · 5771 in / 1128 out tokens · 22364 ms · 2026-06-26T08:56:20.109637+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

67 extracted references · 1 linked inside Pith

  1. [1]

    Recammaster: Camera-controlled generative ren- dering from a single video

    Jianhong Bai, Menghan Xia, Xiao Fu, Xintao Wang, Lianrui Mu, Jinwen Cao, Zuozhu Liu, Haoji Hu, Xiang Bai, Pengfei Wan, et al. Recammaster: Camera-controlled generative ren- dering from a single video. InProceedings of the Interna- tional Conference on Computer Vision (ICCV), 2025. 2

  2. [2]

    Sam 3: Segment anything with concepts.arXiv, 2025

    Nicolas Carion, Laura Gustafson, Yuan-Ting Hu, Shoub- hik Debnath, Ronghang Hu, Didac Suris, Chaitanya Ryali, Kalyan Vasudev Alwala, Haitham Khedr, Andrew Huang, Jie Lei, Tengyu Ma, Baishan Guo, Arpit Kalla, Markus Marks, Joseph Greer, Meng Wang, Peize Sun, Roman R¨adle, Triantafyllos Afouras, Effrosyni Mavroudi, Kather- ine Xu, Tsung-Han Wu, Yu Zhou, Lil...

  3. [3]

    Sam 3: Segment anything with concepts.arXiv preprint arXiv:2511.16719, 2025

    Nicolas Carion, Laura Gustafson, Yuan-Ting Hu, Shoub- hik Debnath, Ronghang Hu, Didac Suris, Chaitanya Ryali, Kalyan Vasudev Alwala, Haitham Khedr, Andrew Huang, et al. Sam 3: Segment anything with concepts.arXiv preprint arXiv:2511.16719, 2025. 5

  4. [4]

    Motion 3-to-4: 3d motion reconstruction for 4d synthesis.Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2026

    Hongyuan Chen, Xingyu Chen, Youjia Zhang, Zexiang Xu, and Anpei Chen. Motion 3-to-4: 3d motion reconstruction for 4d synthesis.Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2026. 2, 3

  5. [5]

    V2m4: 4d mesh animation reconstruction from a single monocular video

    Jianqi Chen, Biao Zhang, Xiangjun Tang, and Peter Wonka. V2m4: 4d mesh animation reconstruction from a single monocular video. InProceedings of the International Con- ference on Computer Vision (ICCV), 2025. 2, 3, 6, 7, 9

  6. [6]

    Re- construct, inpaint, test-time finetune: Dynamic novel-view synthesis from monocular videos

    Kaihua Chen, Tarasha Khurana, and Deva Ramanan. Re- construct, inpaint, test-time finetune: Dynamic novel-view synthesis from monocular videos. InAdvances in Neural In- formation Processing Systems (NeurIPS), 2025. 2, 3

  7. [7]

    Easi3r: Estimating disentangled motion from dust3r without training

    Xingyu Chen, Yue Chen, Yuliang Xiu, Andreas Geiger, and Anpei Chen. Easi3r: Estimating disentangled motion from dust3r without training. InProceedings of the International Conference on Computer Vision (ICCV), 2025. 2

  8. [8]

    Dream- scene4d: Dynamic multi-object scene generation from monocular videos

    Wen-Hsuan Chu, Lei Ke, and Katerina Fragkiadaki. Dream- scene4d: Dynamic multi-object scene generation from monocular videos. InAdvances in Neural Information Pro- cessing Systems (NeurIPS), 2024. 2, 3, 9

  9. [9]

    Generative 4d scene gaussian splatting with object view-synthesis priors

    Wen-Hsuan Chu, Lei Ke, Jianmeng Liu, Mingxiao Huo, Pavel Tokmakov, and Katerina Fragkiadaki. Generative 4d scene gaussian splatting with object view-synthesis priors. arXiv, 2025. 3

  10. [10]

    Flow3r: Factored flow prediction for scalable vi- sual geometry learning

    Zhongxiao Cong, Qitao Zhao, Minsik Jeon, and Shubham Tulsiani. Flow3r: Factored flow prediction for scalable vi- sual geometry learning. InProceedings of the IEEE Confer- ence on Computer Vision and Pattern Recognition (CVPR),

  11. [11]

    Objaverse-xl: A universe of 10m+ 3d objects.Advances in Neural Informa- tion Processing Systems (NeurIPS), 2023

    Matt Deitke, Ruoshi Liu, Matthew Wallingford, Huong Ngo, Oscar Michel, Aditya Kusupati, Alan Fan, Christian Laforte, Vikram V oleti, Samir Yitzhak Gadre, et al. Objaverse-xl: A universe of 10m+ 3d objects.Advances in Neural Informa- tion Processing Systems (NeurIPS), 2023. 3

  12. [12]

    Objaverse: A universe of annotated 3d objects

    Matt Deitke, Dustin Schwenk, Jordi Salvador, Luca Weihs, Oscar Michel, Eli VanderBilt, Ludwig Schmidt, Kiana Ehsani, Aniruddha Kembhavi, and Ali Farhadi. Objaverse: A universe of annotated 3d objects. InProceedings of the IEEE Conference on Computer Vision and Pattern Recogni- tion (CVPR), 2023. 3

  13. [13]

    Duisterhof, Zhao Mandi, Yunchao Yao, Jia- Wei Liu, Jenny Seidenschwarz, Mike Zheng Shou, Deva Ra- manan, Shuran Song, Stan Birchfield, Bowen Wen, and Jef- frey Ichnowski

    Bardienus P. Duisterhof, Zhao Mandi, Yunchao Yao, Jia- Wei Liu, Jenny Seidenschwarz, Mike Zheng Shou, Deva Ra- manan, Shuran Song, Stan Birchfield, Bowen Wen, and Jef- frey Ichnowski. Deformgs: Scene flow in highly deformable scenes for deformable object manipulation. InThe 16th International Workshop on the Algorithmic Foundations of Robotics (WAFR), 2024. 2

  14. [14]

    Black, Trevor Darrell, and Angjoo Kanazawa

    Haiwen Feng, Junyi Zhang, Qianqian Wang, Yufei Ye, Pengcheng Yu, Michael J. Black, Trevor Darrell, and Angjoo Kanazawa. St4rtrack: Simultaneous 4d reconstruction and tracking in the world. InProceedings of the International Conference on Computer Vision (ICCV), 2025. 2

  15. [15]

    Motion prompting: Controlling video generation with motion trajec- tories

    Daniel Geng, Charles Herrmann, Junhwa Hur, Forrester Cole, Serena Zhang, Tobias Pfaff, Tatiana Lopez-Guevara, Carl Doersch, Yusuf Aytar, Michael Rubinstein, Chen Sun, Oliver Wang, Andrew Owens, and Deqing Sun. Motion prompting: Controlling video generation with motion trajec- tories. InProceedings of the International Conference on Computer Vision (ICCV), 2025. 6

  16. [16]

    Clipscore: A reference-free evaluation met- ric for image captioning

    Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. Clipscore: A reference-free evaluation met- ric for image captioning. InConference on Empirical Meth- ods in Natural Language Processing (EMNLP), 2021. 6

  17. [17]

    Sc-gs: Sparse-controlled gaussian splatting for editable dynamic scenes.Proceed- ings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2024

    Yi-Hua Huang, Yang-Tian Sun, Ziyi Yang, Xiaoyang Lyu, Yan-Pei Cao, and Xiaojuan Qi. Sc-gs: Sparse-controlled gaussian splatting for editable dynamic scenes.Proceed- ings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2024. 2, 5, 13

  18. [18]

    Consistent4d: Consistent 360° dynamic object generation from monocular video

    Yanqin Jiang, Li Zhang, Jin Gao, Weiming Hu, and Yao Yao. Consistent4d: Consistent 360° dynamic object generation from monocular video. InProceedings of the International Conference on Learning Representations (ICLR), 2024. 2, 3, 7, 15

  19. [19]

    Geo4d: Leveraging video generators for geometric 4d scene reconstruction

    Zeren Jiang, Chuanxia Zheng, Iro Laina, Diane Larlus, and Andrea Vedaldi. Geo4d: Leveraging video generators for geometric 4d scene reconstruction. InProceedings of the International Conference on Computer Vision (ICCV), 2025. 2

  20. [20]

    Stereo4D: Learning How Things Move in 3D from Internet Stereo Videos

    Linyi Jin, Richard Tucker, Zhengqi Li, David Fouhey, Noah Snavely, and Aleksander Holynski. Stereo4D: Learning How Things Move in 3D from Internet Stereo Videos. InProceed- ings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2025. 2

  21. [21]

    Co- Tracker3: Simpler and better point tracking by pseudo- labelling real videos

    Nikita Karaev, Iurii Makarov, Jianyuan Wang, Natalia Neverova, Andrea Vedaldi, and Christian Rupprecht. Co- Tracker3: Simpler and better point tracking by pseudo- labelling real videos. InProceedings of the International Conference on Computer Vision (ICCV), 2025. 6

  22. [22]

    Any4D: Unified feed-forward metric 4D reconstruction

    Jay Karhade, Nikhil Keetha, Yuchen Zhang, Tanisha Gupta, Akash Sharma, Sebastian Scherer, and Deva Ramanan. Any4D: Unified feed-forward metric 4D reconstruction. arXiv, 2025. 2

  23. [23]

    3d gaussian splatting for real-time radiance field rendering.ACM Transactions on Graphics (TOG), 2023

    Bernhard Kerbl, Georgios Kopanas, Thomas Leimk ¨uhler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering.ACM Transactions on Graphics (TOG), 2023. 2, 5

  24. [24]

    Generative video motion editing with 3d point tracks

    Yao-Chih Lee, Zhoutong Zhang, Jiahui Huang, Jui-Hsien Wang, Joon-Young Lee, Jia-Bin Huang, Eli Shechtman, and Zhengqi Li. Generative video motion editing with 3d point tracks. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2025. 2

  25. [25]

    Mosca: Dynamic gaussian fusion from casual videos via 4d motion scaffolds

    Jiahui Lei, Yijia Weng, Adam Harley, Leonidas Guibas, and Kostas Daniilidis. Mosca: Dynamic gaussian fusion from casual videos via 4d motion scaffolds. InProceedings of the IEEE Conference on Computer Vision and Pattern Recogni- tion (CVPR), 2025. 2

  26. [26]

    Dreammesh4d: Video-to-4d generation with sparse-controlled gaussian- mesh hybrid representation

    Zhiqi Li, Yiming Chen, and Peidong Liu. Dreammesh4d: Video-to-4d generation with sparse-controlled gaussian- mesh hybrid representation. InAdvances in Neural Infor- mation Processing Systems (NeurIPS), 2024. 3, 6, 7

  27. [27]

    Pad3r: Pose-aware dy- namic 3d reconstruction from casual videos

    Ting-Hsuan Liao, Haowen Liu, Yiran Xu, Songwei Ge, Gengshan Yang, and Jia-Bin Huang. Pad3r: Pose-aware dy- namic 3d reconstruction from casual videos. InSIGGRAPH Asia, 2025. 2, 3, 6, 7, 9

  28. [28]

    Chen, Zhenyu Li, Yang Zhao, Sida Peng, Hengkai Guo, Xiaowei Zhou, Guang Shi, Jiashi Feng, and Bingyi Kang

    Haotong Lin, Sili Chen, Jun Hao Liew, Donny Y . Chen, Zhenyu Li, Yang Zhao, Sida Peng, Hengkai Guo, Xiaowei Zhou, Guang Shi, Jiashi Feng, and Bingyi Kang. Depth any- thing 3: Recovering the visual space from any views. InPro- ceedings of the International Conference on Learning Rep- resentations (ICLR), 2026. 2, 4, 5, 6

  29. [29]

    MoDGS: Dy- namic gaussian splatting from casually-captured monocular videos with depth priors

    Qingming Liu, Yuan Liu, Jiepeng Wang, Xianqiang Lyu, Peng Wang, Wenping Wang, and Junhui Hou. MoDGS: Dy- namic gaussian splatting from casually-captured monocular videos with depth priors. InProceedings of the International Conference on Learning Representations (ICLR), 2025. 2

  30. [30]

    Zero-1-to-3: Zero-shot one image to 3d object

    Ruoshi Liu, Rundi Wu, Basile Van Hoorick, Pavel Tok- makov, Sergey Zakharov, and Carl V ondrick. Zero-1-to-3: Zero-shot one image to 3d object. InProceedings of the In- ternational Conference on Computer Vision (ICCV), 2023. 2, 4, 6, 13

  31. [31]

    Flow straight and fast: Learning to generate and transfer data with rectified flow

    Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow. InProceedings of the International Confer- ence on Learning Representations (ICLR), 2023. 4

  32. [32]

    Decoupled weight decay regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InProceedings of the International Confer- ence on Learning Representations (ICLR), 2019. 6

  33. [33]

    Dynamic 3d gaussians: Tracking by per- sistent dynamic view synthesis

    Jonathon Luiten, Georgios Kopanas, Bastian Leibe, and Deva Ramanan. Dynamic 3d gaussians: Tracking by per- sistent dynamic view synthesis. InProceedings of the Inter- national Conference on 3D Vision (3DV), 2024. 2, 13

  34. [34]

    4rc: 4d reconstruction via conditional querying anytime and anywhere.arXiv, 2026

    Yihang Luo, Shangchen Zhou, Yushi Lan, Xingang Pan, and Chen Change Loy. 4rc: 4d reconstruction via conditional querying anytime and anywhere.arXiv, 2026. 2

  35. [35]

    The 2017 davis challenge on video object segmentation.arXiv,

    Jordi Pont-Tuset, Federico Perazzi, Sergi Caelles, Pablo Ar- bel´aez, Alexander Sorkine-Hornung, and Luc Van Gool. The 2017 davis challenge on video object segmentation.arXiv,

  36. [36]

    L4gm: Large 4d gaussian reconstruction model

    Jiawei Ren, Kevin Xie, Ashkan Mirzaei, Hanxue Liang, Xi- aohui Zeng, Karsten Kreis, Ziwei Liu, Antonio Torralba, Sanja Fidler, Seung Wook Kim, and Huan Ling. L4gm: Large 4d gaussian reconstruction model. InAdvances in Neural Information Processing Systems (NeurIPS), 2024. 2, 3, 6, 7, 9

  37. [37]

    Gen3c: 3d-informed world-consistent video generation with precise camera con- trol

    Xuanchi Ren, Tianchang Shen, Jiahui Huang, Huan Ling, Yifan Lu, Merlin Nimier-David, Thomas M ¨uller, Alexan- der Keller, Sanja Fidler, and Jun Gao. Gen3c: 3d-informed world-consistent video generation with precise camera con- trol. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2025. 2, 3

  38. [38]

    Mitra, and David Novotny

    Remy Sabathier, Niloy J. Mitra, and David Novotny. Lim: Large interpolator model for dynamic reconstruction. InPro- ceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2025. 3

  39. [39]

    Mitra, and Tom Monnier

    Remy Sabathier, David Novotny, Niloy J. Mitra, and Tom Monnier. Actionmesh: Animated 3d mesh generation with temporal 3d diffusion. InProceedings of the IEEE Confer- ence on Computer Vision and Pattern Recognition (CVPR),

  40. [40]

    As-rigid-as-possible surface modeling

    Olga Sorkine and Marc Alexa. As-rigid-as-possible surface modeling. InProceedings of Eurographics, 2007. 13

  41. [41]

    Harley, Mikaela Uy, Florian Du- bost, Federico Tombari, Gordon Wetzstein, and Leonidas Guibas

    Colton Stearns, Adam W. Harley, Mikaela Uy, Florian Du- bost, Federico Tombari, Gordon Wetzstein, and Leonidas Guibas. Dynamic gaussian marbles for novel view synthesis of casual monocular videos. InSIGGRAPH Asia, 2024. 2

  42. [42]

    V-DPM: 4d video reconstruction with dynamic point maps.arXiv, 2026

    Edgar Sucar, Eldar Insafutdinov, Zihang Lai, and Andrea Vedaldi. V-DPM: 4d video reconstruction with dynamic point maps.arXiv, 2026. 2

  43. [43]

    Eg4d: Explicit generation of 4d object without score distil- lation

    Qi Sun, Zhiyang Guo, Ziyu Wan, Jing Nathan Yan, Sheng- ming Yin, Wengang Zhou, Jing Liao, and Houqiang Li. Eg4d: Explicit generation of 4d object without score distil- lation. InICLR, 2025. 2, 3

  44. [44]

    Sam 3d: 3dfy anything in images.arXiv, 2025

    SAM 3D Team, Xingyu Chen, Fu-Jen Chu, Pierre Gleize, Kevin J Liang, Alexander Sax, Hao Tang, Weiyao Wang, Michelle Guo, Thibaut Hardin, Xiang Li, Aohan Lin, Jiawei Liu, Ziqi Ma, Anushka Sagar, Bowen Song, Xiaodong Wang, Jianing Yang, Bowen Zhang, Piotr Doll´ar, Georgia Gkioxari, Matt Feiszli, and Jitendra Malik. Sam 3d: 3dfy anything in images.arXiv, 2025...

  45. [45]

    To- wards accurate generative models of video: A new metric & challenges.arXiv, 2019

    Thomas Unterthiner, Sjoerd van Steenkiste, Karol Kurach, Raphael Marini, Marcin Michalski, and Sylvain Gelly. To- wards accurate generative models of video: A new metric & challenges.arXiv, 2019. 6

  46. [46]

    Shape of mo- tion: 4d reconstruction from a single video

    Qianqian Wang, Vickie Ye, Hang Gao, Weijia Zeng, Jake Austin, Zhengqi Li, and Angjoo Kanazawa. Shape of mo- tion: 4d reconstruction from a single video. InProceedings of the International Conference on Computer Vision (ICCV),

  47. [47]

    Gflow: Recovering 4d world from monocular video

    Shizun Wang, Xingyi Yang, Qiuhong Shen, Zhenxiang Jiang, and Xinchao Wang. Gflow: Recovering 4d world from monocular video. InProceedings of the National Conference on Artificial Intelligence (AAAI), 2025. 2

  48. [48]

    4d gaussian splatting for real-time dynamic scene rendering

    Guanjun Wu, Taoran Yi, Jiemin Fang, Lingxi Xie, Xiaopeng Zhang, Wei Wei, Wenyu Liu, Qi Tian, and Xinggang Wang. 4d gaussian splatting for real-time dynamic scene rendering. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2024. 2

  49. [49]

    Barron, and Aleksander Holynski

    Rundi Wu, Ruiqi Gao, Ben Poole, Alex Trevithick, Changxi Zheng, Jonathan T. Barron, and Aleksander Holynski. Cat4d: Create anything in 4d with multi-view video diffu- sion models. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2025. 3

  50. [50]

    Zijie Wu, Chaohui Yu, Yanqin Jiang, Chenjie Cao, Wang Fan, and Xiang. Bai. Sc4d: Sparse-controlled video-to-4d generation and motion transfer. InProceedings of the Eu- ropean Conference on Computer Vision (ECCV), 2024. 2, 3

  51. [51]

    Lavr: Scene latent conditioned generative video trajectory re-rendering using large 4d reconstruction models.arXiv,

    Mingyang Xie, Numair Khan, Tianfu Wang, Naina Dhin- gra, Seonghyeon Nam, Haitao Yang, Zhuo Hui, Christopher Metzler, Andrea Vedaldi, Hamed Pirsiavash, and Lei Luo. Lavr: Scene latent conditioned generative video trajectory re-rendering using large 4d reconstruction models.arXiv,

  52. [52]

    SV4D: Dynamic 3d content generation with multi-frame and multi-view consistency

    Yiming Xie, Chun-Han Yao, Vikram V oleti, Huaizu Jiang, and Varun Jampani. SV4D: Dynamic 3d content generation with multi-frame and multi-view consistency. InProceed- ings of the International Conference on Learning Represen- tations (ICLR), 2025. 3

  53. [53]

    Geometrycrafter: Consistent geometry estimation for open-world videos with diffusion priors

    Tian-Xing Xu, Xiangjun Gao, Wenbo Hu, Xiaoyu Li, Song- Hai Zhang, and Ying Shan. Geometrycrafter: Consistent geometry estimation for open-world videos with diffusion priors. InProceedings of the International Conference on Computer Vision (ICCV), 2025. 2

  54. [54]

    4dgt: Learning a 4d gaussian transformer using real-world monocular videos

    Zhen Xu, Zhengqin Li, Zhao Dong, Xiaowei Zhou, Richard Newcombe, and Zhaoyang Lv. 4dgt: Learning a 4d gaussian transformer using real-world monocular videos. InAdvances in Neural Information Processing Systems (NeurIPS), 2025. 2

  55. [55]

    Banmo: Build- ing animatable 3d neural models from many casual videos

    Gengshan Yang, Minh V o, Natalia Neverova, Deva Ra- manan, Andrea Vedaldi, and Hanbyul Joo. Banmo: Build- ing animatable 3d neural models from many casual videos. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022. 2, 3, 6, 9

  56. [56]

    Zhang, Zachary Manchester, and Deva Ramanan

    Gengshan Yang, Shuo Yang, John Z. Zhang, Zachary Manchester, and Deva Ramanan. Physically plausible re- construction from monocular videos. InProceedings of the International Conference on Computer Vision (ICCV), 2023. 2, 3

  57. [57]

    Deformable 3d gaussians for high- fidelity monocular dynamic scene reconstruction

    Ziyi Yang, Xinyu Gao, Wen Zhou, Shaohui Jiao, Yuqing Zhang, and Xiaogang Jin. Deformable 3d gaussians for high- fidelity monocular dynamic scene reconstruction. InPro- ceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2024. 2

  58. [58]

    SV4D2.0: Enhancing spatio-temporal consistency in multi-view video diffusion for high-quality 4d generation

    Chun-Han Yao, Yiming Xie, Vikram V oleti, Huaizu Jiang, and Varun Jampani. SV4D2.0: Enhancing spatio-temporal consistency in multi-view video diffusion for high-quality 4d generation. InProceedings of the International Conference on Computer Vision (ICCV), 2025. 3

  59. [59]

    Yeh, Peter Wonka, and Chaoyang Wang

    Jiraphon Yenphraphai, Ashkan Mirzaei, Jianqi Chen, Jiaxu Zou, Sergey Tulyakov, Raymond A. Yeh, Peter Wonka, and Chaoyang Wang. Shapegen4d: Towards high quality 4d shape generation from videos. InProceedings of the In- ternational Conference on Learning Representations (ICLR),

  60. [60]

    Tra- jectorycrafter: Redirecting camera trajectory for monocular videos via diffusion models

    Mark Yu, Wenbo Hu, Jinbo Xing, and Ying Shan. Tra- jectorycrafter: Redirecting camera trajectory for monocular videos via diffusion models. InProceedings of the Interna- tional Conference on Computer Vision (ICCV), 2025. 2

  61. [61]

    Stag4d: Spatial-temporal anchored generative 4d gaussians

    Yifei Zeng, Yanqin Jiang, Siyu Zhu, Yuanxun Lu, Youtian Lin, Hao Zhu, Weiming Hu, Xun Cao, and Yao Yao. Stag4d: Spatial-temporal anchored generative 4d gaussians. InPro- ceedings of the European Conference on Computer Vision (ECCV), 2024. 3, 6, 7, 9

  62. [62]

    Gaussian varia- tion field diffusion for high-fidelity video-to-4d synthesis

    Bowen Zhang, Sicheng Xu, Chuxin Wang, Jiaolong Yang, Feng Zhao, Dong Chen, and Baining Guo. Gaussian varia- tion field diffusion for high-fidelity video-to-4d synthesis. In Proceedings of the International Conference on Computer Vision (ICCV), 2025. 3

  63. [63]

    Barral, Raia Hadsell, Zoubin Ghahramani, Andrew Zisserman, Junlin Zhang, and Mehdi S

    Chuhan Zhang, Guillaume Le Moing, Skanda Koppula, Ig- nacio Rocco, Liliane Momeni, Junyu Xie, Shuyang Sun, Rahul Sukthankar, Jo ¨elle K. Barral, Raia Hadsell, Zoubin Ghahramani, Andrew Zisserman, Junlin Zhang, and Mehdi S. M. Sajjadi. Efficiently reconstructing dynamic scenes one d4rt at a time.arXiv, 2025. 2

  64. [64]

    4diffusion: Multi-view video dif- fusion model for 4d generation

    Haiyu Zhang, Xinyuan Chen, Yaohui Wang, Xihui Liu, Yun- hong Wang, and Yu Qiao. 4diffusion: Multi-view video dif- fusion model for 4d generation. InAdvances in Neural In- formation Processing Systems (NeurIPS), 2024. 3

  65. [65]

    Monst3r: A simple approach for estimating ge- ometry in the presence of motion

    Junyi Zhang, Charles Herrmann, Junhwa Hur, Varun Jam- pani, Trevor Darrell, Forrester Cole, Deqing Sun, and Ming- Hsuan Yang. Monst3r: A simple approach for estimating ge- ometry in the presence of motion. InProceedings of the In- ternational Conference on Learning Representations (ICLR),

  66. [66]

    Motion blender gaussian splatting for dynamic scene reconstruction

    Xinyu Zhang, Haonan Chang, Yuhan Liu, and Abdeslam Boularias. Motion blender gaussian splatting for dynamic scene reconstruction. InConference on Robot Learning (CoRL), 2025. 2

  67. [67]

    Sparsefusion: Distill- ing view-conditioned diffusion for 3d reconstruction

    Zhizhuo Zhou and Shubham Tulsiani. Sparsefusion: Distill- ing view-conditioned diffusion for 3d reconstruction. InPro- ceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023. 6 A. Supplementary We provide additional quantitative and qualitative results on an expanded dataset of in-the-wild data, as well as compar- isons to...