Recognition: 1 theorem link
· Lean TheoremSparseCam4D: Spatio-Temporally Consistent 4D Reconstruction from Sparse Cameras
Pith reviewed 2026-05-14 23:38 UTC · model grok-4.3
The pith
Spatio-Temporal Distortion Field corrects inconsistencies in generative observations to enable 4D reconstruction from sparse uncalibrated cameras.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Our key innovation is the Spatio-Temporal Distortion Field, which provides a unified mechanism for modeling inconsistencies in generative observations across both spatial and temporal dimensions. Building on this, we develop a complete pipeline that enables 4D reconstruction from sparse and uncalibrated camera inputs.
What carries the argument
The Spatio-Temporal Distortion Field, a learned model that corrects spatial and temporal inconsistencies present in generative observations from sparse cameras.
If this is right
- 4D reconstruction of dynamic scenes becomes practical without dense synchronized camera arrays.
- Spatio-temporally consistent high-fidelity renderings are produced directly from inconsistent sparse inputs.
- The pipeline outperforms existing methods on multi-camera dynamic scene benchmarks.
Where Pith is reading between the lines
- The same distortion-field idea could be tested on other generative inputs such as video diffusion outputs for static scenes.
- Removing the need for calibration opens the door to casual capture setups using handheld devices.
- Extending the field to handle lighting changes or view-dependent effects would be a direct next step.
Load-bearing premise
Generative observations contain enough reliable signal that one learned distortion field can remove their inconsistencies without introducing new artifacts or losing detail in the final 4D model.
What would settle it
A test set of dynamic scenes where the generative observations contain inconsistencies that vary independently across space and time in ways no single field can model, leading to visible artifacts or loss of fidelity in the reconstructed output.
Figures
read the original abstract
High-quality 4D reconstruction enables photorealistic and immersive rendering of the dynamic real world. However, unlike static scenes that can be fully captured with a single camera, high-quality dynamic scenes typically require dense arrays of tens or even hundreds of synchronized cameras. Dependence on such costly lab setups severely limits practical scalability. To this end, we propose a sparse-camera dynamic reconstruction framework that exploits abundant yet inconsistent generative observations. Our key innovation is the Spatio-Temporal Distortion Field, which provides a unified mechanism for modeling inconsistencies in generative observations across both spatial and temporal dimensions. Building on this, we develop a complete pipeline that enables 4D reconstruction from sparse and uncalibrated camera inputs. We evaluate our method on multi-camera dynamic scene benchmarks, achieving spatio-temporally consistent high-fidelity renderings and significantly outperforming existing approaches. Project page available at https://inspatio.github.io/sparse-cam4d/
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes SparseCam4D, a framework for high-quality 4D reconstruction of dynamic scenes from sparse and uncalibrated cameras by exploiting inconsistent generative observations. Its key innovation is the Spatio-Temporal Distortion Field, a unified model for correcting spatial and temporal inconsistencies in these observations, which is integrated into a complete pipeline for consistent 4D output. The method is evaluated on multi-camera dynamic scene benchmarks and reported to outperform existing approaches in producing spatio-temporally consistent high-fidelity renderings.
Significance. If the central claim holds, the work would be significant for practical 4D reconstruction by reducing dependence on expensive dense synchronized camera arrays. The Spatio-Temporal Distortion Field offers a potentially general mechanism for handling generative observation noise, which could extend to other sparse-view dynamic modeling tasks if supported by rigorous sparse-input validation.
major comments (1)
- [Experiments] Experiments section (as summarized in the abstract): the central claim is that the Spatio-Temporal Distortion Field enables high-fidelity 4D reconstruction specifically from sparse, uncalibrated cameras. However, evaluation is reported only on standard 'multi-camera dynamic scene benchmarks' with no camera-count ablations, no results for 2–5 camera subsets, and no analysis of performance degradation as view density decreases. Standard benchmarks typically employ 10–20+ views, so the reported outperformance may derive from residual overlap rather than the proposed correction mechanism, leaving the sparse-input claim unverified.
minor comments (2)
- [Abstract] Abstract: the claim of 'significantly outperforming existing approaches' is stated without any quantitative metrics, error bars, or ablation details, which weakens the ability to assess result strength from the summary alone.
- [Abstract] Abstract and methods: the optimization procedure for the Spatio-Temporal Distortion Field (e.g., loss terms, training schedule, or how it interacts with the 4D reconstruction pipeline) is not described, even at a high level.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We agree that the sparse-input claim requires stronger experimental support and will revise the manuscript accordingly.
read point-by-point responses
-
Referee: [Experiments] Experiments section (as summarized in the abstract): the central claim is that the Spatio-Temporal Distortion Field enables high-fidelity 4D reconstruction specifically from sparse, uncalibrated cameras. However, evaluation is reported only on standard 'multi-camera dynamic scene benchmarks' with no camera-count ablations, no results for 2–5 camera subsets, and no analysis of performance degradation as view density decreases. Standard benchmarks typically employ 10–20+ views, so the reported outperformance may derive from residual overlap rather than the proposed correction mechanism, leaving the sparse-input claim unverified.
Authors: We acknowledge the validity of this observation. The current experiments follow the standard protocol of the cited multi-camera dynamic scene benchmarks to enable direct comparison with prior work. However, these benchmarks do not explicitly isolate performance under 2–5 view regimes. To address this, we will add a dedicated camera-count ablation study in the revised manuscript, reporting quantitative metrics (PSNR, SSIM, LPIPS, and temporal consistency) for subsets of 2, 3, 4, and 5 cameras drawn from the same benchmark sequences. We will also include a plot showing performance degradation as view density decreases and discuss how the Spatio-Temporal Distortion Field mitigates inconsistencies even in these low-view regimes. This addition will directly verify the sparse-input claim. revision: yes
Circularity Check
No circularity: Spatio-Temporal Distortion Field introduced as independent modeling construct
full rationale
The paper's central claim introduces the Spatio-Temporal Distortion Field as a new unified mechanism for correcting inconsistencies in generative observations, without any quoted equations or steps that reduce the field definition to a reparameterization of the input data itself. No self-citations are invoked as load-bearing uniqueness theorems, no fitted parameters are relabeled as predictions, and no ansatz is smuggled via prior work. The derivation chain remains self-contained because the field is presented as an additive modeling tool whose parameters are learned from the observations rather than presupposed by them. Evaluation concerns about camera sparsity are separate from circularity and do not affect the score.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Generative observations supply abundant data whose spatial and temporal inconsistencies can be corrected by a single learned distortion field
invented entities (1)
-
Spatio-Temporal Distortion Field
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Our key innovation is the Spatio-Temporal Distortion Field, which provides a unified mechanism for modeling inconsistencies in generative observations across both spatial and temporal dimensions.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Hyperreel: High-fidelity 6-dof video with ray- conditioned sampling
Benjamin Attal, Jia-Bin Huang, Christian Richardt, Michael Zollhoefer, Johannes Kopf, Matthew O’Toole, and Changil Kim. Hyperreel: High-fidelity 6-dof video with ray- conditioned sampling. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16610–16620, 2023. 5, 6
work page 2023
-
[2]
Ac3d: Analyzing and improving 3d camera control in video diffusion trans- formers
Sherwin Bahmani, Ivan Skorokhodov, Guocheng Qian, Ali- aksandr Siarohin, Willi Menapace, Andrea Tagliasacchi, David B Lindell, and Sergey Tulyakov. Ac3d: Analyzing and improving 3d camera control in video diffusion trans- formers. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 22875–22889, 2025. 2
work page 2025
-
[3]
Jianhong Bai, Menghan Xia, Xiao Fu, Xintao Wang, Lian- rui Mu, Jinwen Cao, Zuozhu Liu, Haoji Hu, Xiang Bai, Pengfei Wan, et al. Recammaster: Camera-controlled generative rendering from a single video.arXiv preprint arXiv:2503.11647, 2025. 2, 3, 8
-
[4]
Hexplane: A fast representa- tion for dynamic scenes
Ang Cao and Justin Johnson. Hexplane: A fast representa- tion for dynamic scenes. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 130–141, 2023. 2
work page 2023
-
[5]
4d-rotor gaussian splatting: towards efficient novel view synthesis for dynamic scenes
Yuanxing Duan, Fangyin Wei, Qiyu Dai, Yuhang He, Wen- zheng Chen, and Baoquan Chen. 4d-rotor gaussian splatting: towards efficient novel view synthesis for dynamic scenes. InACM SIGGRAPH 2024 Conference Papers, pages 1–11,
work page 2024
-
[6]
arXiv preprint arXiv:2403.20309 (2024)
Zhiwen Fan, Wenyan Cong, Kairun Wen, Kevin Wang, Jian Zhang, Xinghao Ding, Danfei Xu, Boris Ivanovic, Marco Pavone, Georgios Pavlakos, et al. Instantsplat: Un- bounded sparse-view pose-free gaussian splatting in 40 sec- onds.arXiv preprint arXiv:2403.20309, 2(3):4, 2024. 7
-
[7]
K-planes: Explicit radiance fields in space, time, and appearance
Sara Fridovich-Keil, Giacomo Meanti, Frederik Rahbæk Warburg, Benjamin Recht, and Angjoo Kanazawa. K-planes: Explicit radiance fields in space, time, and appearance. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 12479–12488, 2023. 2, 3, 5
work page 2023
-
[8]
CameraCtrl: Enabling Camera Control for Text-to-Video Generation
Hao He, Yinghao Xu, Yuwei Guo, Gordon Wetzstein, Bo Dai, Hongsheng Li, and Ceyuan Yang. Cameractrl: Enabling camera control for text-to-video generation.arXiv preprint arXiv:2404.02101, 2024. 3
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[9]
Gauhuman: Articu- lated gaussian splatting from monocular human videos
Shoukang Hu, Tao Hu, and Ziwei Liu. Gauhuman: Articu- lated gaussian splatting from monocular human videos. In Proceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 20418–20431, 2024. 2
work page 2024
-
[10]
Yudong Jin, Sida Peng, Xuan Wang, Tao Xie, Zhen Xu, Yi- fan Yang, Yujun Shen, Hujun Bao, and Xiaowei Zhou. Dif- fuman4d: 4d consistent human view synthesis from sparse- view videos with spatio-temporal diffusion models.arXiv preprint arXiv:2507.13344, 2025. 2
-
[11]
Frnerf: Fusion and regularization fields for dynamic view synthesis.Computational Visual Media, 2025
Xinyi Jing, Tao Yu, Renyuan He, Yu-Kun Lai, and Kun Li. Frnerf: Fusion and regularization fields for dynamic view synthesis.Computational Visual Media, 2025. 2
work page 2025
-
[12]
Jiahui Lei, Yijia Weng, Adam Harley, Leonidas Guibas, and Kostas Daniilidis. Mosca: Dynamic gaussian fusion from casual videos via 4d motion scaffolds.arXiv preprint arXiv:2405.17421, 2024. 3
-
[13]
Neural 3d video synthesis from multi-view video
Tianye Li, Mira Slavcheva, Michael Zollhoefer, Simon Green, Christoph Lassner, Changil Kim, Tanner Schmidt, Steven Lovegrove, Michael Goesele, Richard Newcombe, et al. Neural 3d video synthesis from multi-view video. In Proceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 5521–5531, 2022. 2, 5, 6
work page 2022
-
[14]
Spacetime gaus- sian feature splatting for real-time dynamic view synthesis
Zhan Li, Zhang Chen, Zhong Li, and Yi Xu. Spacetime gaus- sian feature splatting for real-time dynamic view synthesis. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8508–8520, 2024. 2
work page 2024
-
[15]
Himor: Monocular deformable gaussian reconstruction with hierar- chical motion representation
Yiming Liang, Tianhan Xu, and Yuta Kikuchi. Himor: Monocular deformable gaussian reconstruction with hierar- chical motion representation. InProceedings of the Com- puter Vision and Pattern Recognition Conference, pages 886–895, 2025. 3
work page 2025
-
[16]
Gaussian-flow: 4d reconstruction with dynamic 3d gaus- sian particle
Youtian Lin, Zuozhuo Dai, Siyu Zhu, and Yao Yao. Gaussian-flow: 4d reconstruction with dynamic 3d gaus- sian particle. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21136– 21145, 2024. 2
work page 2024
-
[17]
Sherpa3d: Boosting high-fidelity text-to-3d genera- tion via coarse 3d prior
Fangfu Liu, Diankun Wu, Yi Wei, Yongming Rao, and Yueqi Duan. Sherpa3d: Boosting high-fidelity text-to-3d genera- tion via coarse 3d prior. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20763–20774, 2024. 3
work page 2024
-
[18]
Modgs: Dy- namic gaussian splatting from casually-captured monocular videos with depth priors, 2025
Qingming Liu, Yuan Liu, Jiepeng Wang, Xianqiang Lyv, Peng Wang, Wenping Wang, and Junhui Hou. Modgs: Dy- namic gaussian splatting from casually-captured monocular videos with depth priors, 2025. 3
work page 2025
-
[19]
Wonder3d: Sin- gle image to 3d using cross-domain diffusion
Xiaoxiao Long, Yuan-Chen Guo, Cheng Lin, Yuan Liu, Zhiyang Dou, Lingjie Liu, Yuexin Ma, Song-Hai Zhang, Marc Habermann, Christian Theobalt, et al. Wonder3d: Sin- gle image to 3d using cross-domain diffusion. InProceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9970–9980, 2024. 3
work page 2024
-
[20]
Dynamicfusion: Reconstruction and tracking of non-rigid scenes in real-time
Richard A Newcombe, Dieter Fox, and Steven M Seitz. Dynamicfusion: Reconstruction and tracking of non-rigid scenes in real-time. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 343–352,
-
[21]
DINOv2: Learning Robust Visual Features without Supervision
Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023. 3
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[22]
Splinegs: Robust motion-adaptive spline for real-time dy- namic 3d gaussians from monocular video
Jongmin Park, Minh-Quan Viet Bui, Juan Luis Gonza- lez Bello, Jaeho Moon, Jihyong Oh, and Munchurl Kim. Splinegs: Robust motion-adaptive spline for real-time dy- namic 3d gaussians from monocular video. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 26866–26875, 2025. 3
work page 2025
-
[23]
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An im- perative style, high-performance deep learning library.Ad- vances in neural information processing systems, 32, 2019. 5
work page 2019
-
[24]
Ani- matable neural radiance fields for modeling dynamic human bodies
Sida Peng, Junting Dong, Qianqian Wang, Shangzhan Zhang, Qing Shuai, Xiaowei Zhou, and Hujun Bao. Ani- matable neural radiance fields for modeling dynamic human bodies. InProceedings of the IEEE/CVF International Con- ference on Computer Vision, pages 14314–14323, 2021. 2
work page 2021
-
[25]
Sida Peng, Yuanqing Zhang, Yinghao Xu, Qianqian Wang, Qing Shuai, Hujun Bao, and Xiaowei Zhou. Neural body: Implicit neural representations with structured latent codes for novel view synthesis of dynamic humans. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9054–9063, 2021. 2
work page 2021
-
[26]
Dataset and pipeline for multi-view light-field video
Neus Sabater, Guillaume Boisson, Benoit Vandame, Paul Kerbiriou, Frederic Babon, Matthieu Hog, Remy Gendrot, Tristan Langlois, Olivier Bureller, Arno Schubert, et al. Dataset and pipeline for multi-view light-field video. InPro- ceedings of the IEEE conference on computer vision and pat- tern recognition Workshops, pages 30–40, 2017. 2, 5, 6, 7, 8
work page 2017
-
[27]
Structure-from-motion revisited
Johannes Lutz Sch ¨onberger and Jan-Michael Frahm. Structure-from-motion revisited. InConference on Com- puter Vision and Pattern Recognition (CVPR), 2016. 5, 6
work page 2016
-
[28]
Pixelwise view selection for un- structured multi-view stereo
Johannes Lutz Sch ¨onberger, Enliang Zheng, Marc Pollefeys, and Jan-Michael Frahm. Pixelwise view selection for un- structured multi-view stereo. InEuropean Conference on Computer Vision (ECCV), 2016. 5, 6
work page 2016
-
[29]
MVDream: Multi-view Diffusion for 3D Generation
Yichun Shi, Peng Wang, Jianglong Ye, Mai Long, Kejie Li, and Xiao Yang. Mvdream: Multi-view diffusion for 3d gen- eration.arXiv preprint arXiv:2308.16512, 2023. 3
work page internal anchor Pith review arXiv 2023
-
[30]
Wenqiang Sun, Shuo Chen, Fangfu Liu, Zilong Chen, Yueqi Duan, Jun Zhang, and Yikai Wang. Dimensionx: Create any 3d and 4d scenes from a single image with controllable video diffusion.arXiv preprint arXiv:2411.04928, 2024. 3
-
[31]
Chaoyang Wang, Ashkan Mirzaei, Vidit Goel, Willi Menapace, Aliaksandr Siarohin, Avalon Vinella, Michael Vasilkovsky, Ivan Skorokhodov, Vladislav Shakhrai, Sergey Korolev, et al. 4real-video-v2: Fused view-time attention and feedforward reconstruction for 4d scene generation.arXiv preprint arXiv:2506.18839, 2025. 3
-
[32]
Shape of motion: 4d reconstruc- tion from a single video, 2024
Qianqian Wang, Vickie Ye, Hang Gao, Jake Austin, Zhengqi Li, and Angjoo Kanazawa. Shape of motion: 4d reconstruc- tion from a single video, 2024. 2, 3
work page 2024
-
[33]
Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Si- moncelli. Image quality assessment: from error visibility to structural similarity.IEEE transactions on image processing, 13(4):600–612, 2004. 5, 7
work page 2004
-
[34]
Motionctrl: A unified and flexible motion controller for video generation
Zhouxia Wang, Ziyang Yuan, Xintao Wang, Yaowei Li, Tianshui Chen, Menghan Xia, Ping Luo, and Ying Shan. Motionctrl: A unified and flexible motion controller for video generation. InACM SIGGRAPH 2024 Conference Pa- pers, pages 1–11, 2024. 3
work page 2024
-
[35]
Monofusion: Sparse-view 4d reconstruc- tion via monocular fusion.arXiv preprint arXiv:2507.23782,
Zihan Wang, Jeff Tan, Tarasha Khurana, Neehar Peri, and Deva Ramanan. Monofusion: Sparse-view 4d reconstruc- tion via monocular fusion.arXiv preprint arXiv:2507.23782,
-
[36]
Hu- mannerf: Free-viewpoint rendering of moving people from monocular video
Chung-Yi Weng, Brian Curless, Pratul P Srinivasan, Jonathan T Barron, and Ira Kemelmacher-Shlizerman. Hu- mannerf: Free-viewpoint rendering of moving people from monocular video. InProceedings of the IEEE/CVF con- ference on computer vision and pattern Recognition, pages 16210–16220, 2022. 2
work page 2022
-
[37]
4d-fly: Fast 4d reconstruc- tion from a single monocular video
Diankun Wu, Fangfu Liu, Yi-Hsin Hung, Yue Qian, Xiao- hang Zhan, and Yueqi Duan. 4d-fly: Fast 4d reconstruc- tion from a single monocular video. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 16663–16673, 2025. 3
work page 2025
-
[38]
4d gaussian splatting for real-time dynamic scene rendering
Guanjun Wu, Taoran Yi, Jiemin Fang, Lingxi Xie, Xiaopeng Zhang, Wei Wei, Wenyu Liu, Qi Tian, and Xinggang Wang. 4d gaussian splatting for real-time dynamic scene rendering. InProceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 20310–20320, 2024. 2, 3, 6, 7
work page 2024
-
[39]
Kailu Wu, Fangfu Liu, Zhihan Cai, Runjie Yan, Hanyang Wang, Yating Hu, Yueqi Duan, and Kaisheng Ma. Unique3d: High-quality and efficient 3d mesh generation from a single image.Advances in Neural Information Processing Systems, 37:125116–125141, 2024. 3
work page 2024
-
[40]
Cat4d: Create anything in 4d with multi-view video diffusion mod- els
Rundi Wu, Ruiqi Gao, Ben Poole, Alex Trevithick, Changxi Zheng, Jonathan T Barron, and Aleksander Holynski. Cat4d: Create anything in 4d with multi-view video diffusion mod- els. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 26057–26068, 2025. 3
work page 2025
-
[41]
Recent advances in 3d gaussian splatting.Computational Visual Media, 10(4):613– 642, 2024
Tong Wu, Yu-Jie Yuan, Ling-Xiao Zhang, Jie Yang, Yan- Pei Cao, Ling-Qi Yan, and Lin Gao. Recent advances in 3d gaussian splatting.Computational Visual Media, 10(4):613– 642, 2024. 1
work page 2024
-
[42]
Dejia Xu, Weili Nie, Chao Liu, Sifei Liu, Jan Kautz, Zhangyang Wang, and Arash Vahdat. Camco: Camera- controllable 3d-consistent image-to-video generation.arXiv preprint arXiv:2406.02509, 2024. 3
-
[43]
Zeyu Yang, Hongye Yang, Zijie Pan, and Li Zhang. Real-time photorealistic dynamic scene representation and rendering with 4d gaussian splatting.arXiv preprint arXiv:2310.10642, 2023. 2, 3, 5, 6, 7
-
[44]
Deformable 3d gaussians for high- fidelity monocular dynamic scene reconstruction
Ziyi Yang, Xinyu Gao, Wen Zhou, Shaohui Jiao, Yuqing Zhang, and Xiaogang Jin. Deformable 3d gaussians for high- fidelity monocular dynamic scene reconstruction. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 20331–20341, 2024. 2
work page 2024
-
[45]
Dream- reward: Text-to-3d generation with human preference
Junliang Ye, Fangfu Liu, Qixiu Li, Zhengyi Wang, Yikai Wang, Xinzhou Wang, Yueqi Duan, and Jun Zhu. Dream- reward: Text-to-3d generation with human preference. In European Conference on Computer Vision, pages 259–276. Springer, 2024. 3
work page 2024
-
[46]
Novel view synthesis of dynamic scenes with globally coherent depths from a monocular camera
Jae Shin Yoon, Kihwan Kim, Orazio Gallo, Hyun Soo Park, and Jan Kautz. Novel view synthesis of dynamic scenes with globally coherent depths from a monocular camera. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5336–5345, 2020. 2, 5, 6, 7, 8
work page 2020
-
[47]
Sparse-view 3d recon- struction: Recent advances and open challenges, 2025
Tanveer Younis and Zhanglin Cheng. Sparse-view 3d recon- struction: Recent advances and open challenges, 2025. 2
work page 2025
-
[48]
ViewCrafter: Taming Video Diffusion Models for High-fidelity Novel View Synthesis
Wangbo Yu, Jinbo Xing, Li Yuan, Wenbo Hu, Xiaoyu Li, Zhipeng Huang, Xiangjun Gao, Tien-Tsin Wong, Ying Shan, and Yonghong Tian. Viewcrafter: Taming video diffusion models for high-fidelity novel view synthesis.arXiv preprint arXiv:2409.02048, 2024. 2, 3, 5
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[49]
The unreasonable effectiveness of deep features as a perceptual metric
Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shecht- man, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InProceedings of the IEEE conference on computer vision and pattern recogni- tion, pages 586–595, 2018. 5, 7 SparseCam4D: Spatio-Temporally Consistent 4D Reconstruction from Sparse Cameras Supplementary Material
work page 2018
-
[50]
Implementation Details Our framework is implemented in PyTorch, using the Real- Time4DGS codebase as the foundation. To assess the ver- satility of our method, we did not apply dataset-specific tun- ing; instead, a unified training schedule was adopted across all datasets. For the optimization of 4D Gaussians, we strictly follow the official RealTime4DGS ...
-
[51]
This choice is motivated by two considerations
Metrics selection in Ablation Study In the ablation study, SSIM and LPIPS are adopted as the primary evaluation metrics instead of PSNR. This choice is motivated by two considerations. First, SSIM and LPIPS are more sensitive to structural details, with LPIPS in partic- ular capturing perceptual differences in texture, sharpness, and local geometry, where...
-
[52]
Evaluation under Different Sparsity Levels
Additional Evaluation Results 8.1. Evaluation under Different Sparsity Levels. We compare our method with the baseline 4DGS approach, i.e. 4DGaussian, under different sparsity levels, and the results are reported in Tab. 5. Our method outperforms the baseline across all camera-view configurations, further demonstrating its practical robustness. Table 5.Pe...
-
[53]
More Details for Generated Images In our main experiments, we adopt ViewCrafter to provide additional observations, from which 20–25 generated views are uniformly sampled for training. ViewCrafter conditions its diffusion model on point-cloud renderings obtained from Dust3R, given two input images along with the target trajec- tory. However, we find that ...
-
[54]
More visualization To gain an intuitive understanding of the STDF training re- sults, we visualize the full feature maps (16 channels) of each plane in the CoffeMartini scene in Fig. 12. The ac- tivated regions vary across different dimensions, indicating the intertwined nature of spatial and temporal deformations. To further interpret the outputs, we ren...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.