Recognition: no theorem link
ReconPhys: Reconstruct Appearance and Physical Attributes from Single Video
Pith reviewed 2026-05-10 16:52 UTC · model grok-4.3
The pith
A feedforward neural network reconstructs 3D geometry, appearance, and physical attributes of non-rigid objects from a single monocular video.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
ReconPhys is the first feedforward framework that jointly learns physical attribute estimation and 3D Gaussian Splatting reconstruction from a single monocular video. It employs a dual-branch architecture trained via a self-supervised strategy that eliminates the need for ground-truth physics labels. Given a video sequence, ReconPhys simultaneously infers geometry, appearance, and physical attributes, achieving higher accuracy in future prediction than optimization baselines while reducing inference time from hours to under one second.
What carries the argument
Dual-branch architecture combining a 3D Gaussian Splatting branch for geometry and appearance with a parallel physical attribute estimation branch, trained end-to-end in a self-supervised manner on synthetic video data.
If this is right
- Produces simulation-ready 3D assets directly from video without per-scene optimization or manual annotation.
- Enables rapid inference that supports downstream tasks in robotics and graphics pipelines.
- Outperforms existing optimization methods in both accuracy of future frame prediction and speed of reconstruction on synthetic test data.
- Removes the requirement for ground-truth physical labels during training.
Where Pith is reading between the lines
- The method could be extended to handle multi-object interactions or partial occlusions if the synthetic training distribution is broadened accordingly.
- If the inferred physical attributes transfer to real scenes, they could serve as initialization for more precise refinement in hybrid optimization-plus-learning pipelines.
- The approach opens the possibility of building large-scale datasets of physically annotated 3D assets from consumer video without expensive capture setups.
Load-bearing premise
Self-supervised training on synthetic videos is enough to produce physical attribute estimates that remain accurate when applied to real videos and support reliable future physical predictions.
What would settle it
Apply the trained model to real monocular videos of non-rigid objects whose physical behavior is known or can be measured independently; check whether the predicted future frames or simulated dynamics match the actual observed motion.
Figures
read the original abstract
Reconstructing non-rigid objects with physical plausibility remains a significant challenge. Existing approaches leverage differentiable rendering for per-scene optimization, recovering geometry and dynamics but requiring expensive tuning or manual annotation, which limits practicality and generalizability. To address this, we propose ReconPhys, the first feedforward framework that jointly learns physical attribute estimation and 3D Gaussian Splatting reconstruction from a single monocular video. Our method employs a dual-branch architecture trained via a self-supervised strategy, eliminating the need for ground-truth physics labels. Given a video sequence, ReconPhys simultaneously infers geometry, appearance, and physical attributes. Experiments on a large-scale synthetic dataset demonstrate superior performance: our method achieves 21.64 PSNR in future prediction compared to 13.27 by state-of-the-art optimization baselines, while reducing Chamfer Distance from 0.349 to 0.004. Crucially, ReconPhys enables fast inference (<1 second) versus hours required by existing methods, facilitating rapid generation of simulation-ready assets for robotics and graphics.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes ReconPhys, the first feedforward dual-branch framework to jointly estimate physical attributes and reconstruct 3D Gaussian Splatting geometry plus appearance from a single monocular video. It uses a self-supervised training strategy that avoids ground-truth physics labels. On a large-scale synthetic dataset, it reports 21.64 PSNR for future-frame prediction (vs. 13.27 for optimization baselines) and Chamfer Distance reduced to 0.004 (vs. 0.349), with inference under 1 second versus hours for prior methods.
Significance. If the quantitative gains and self-supervised generalization hold beyond the synthetic setting, the work would be significant: it replaces expensive per-scene optimization with a fast learned model, directly enabling simulation-ready assets for robotics and graphics. The avoidance of physics-label supervision is a clear practical strength.
major comments (2)
- [Abstract] Abstract: the central claim of practical utility for 'rapid generation of simulation-ready assets for robotics and graphics' rests on generalization from synthetic training to real videos, yet the abstract reports experiments exclusively on synthetic data and provides no real-world test results or domain-gap analysis; this is load-bearing for the asserted practicality.
- [Abstract] Abstract: the reported PSNR and Chamfer Distance improvements are presented without any description of the dual-branch architecture, the self-supervised loss terms, or the precise re-implementation and hyper-parameter tuning of the 'state-of-the-art optimization baselines'; without these details the numerical gains cannot be independently verified or attributed to the proposed method.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on the abstract. We have revised the manuscript to improve precision in our claims and to enhance verifiability of the reported results while preserving the core contributions.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claim of practical utility for 'rapid generation of simulation-ready assets for robotics and graphics' rests on generalization from synthetic training to real videos, yet the abstract reports experiments exclusively on synthetic data and provides no real-world test results or domain-gap analysis; this is load-bearing for the asserted practicality.
Authors: We agree that the abstract should more precisely reflect the experimental scope. In the revised manuscript we have updated the abstract to state explicitly that all quantitative results are obtained on a large-scale synthetic dataset. The language on practical utility has been moderated from a direct assertion to 'promising for enabling rapid generation of simulation-ready assets'. A dedicated paragraph discussing the synthetic-to-real domain gap, the challenges of acquiring real-world physics ground truth, and planned future validation has been added to the Discussion section. revision: yes
-
Referee: [Abstract] Abstract: the reported PSNR and Chamfer Distance improvements are presented without any description of the dual-branch architecture, the self-supervised loss terms, or the precise re-implementation and hyper-parameter tuning of the 'state-of-the-art optimization baselines'; without these details the numerical gains cannot be independently verified or attributed to the proposed method.
Authors: The abstract is intentionally concise. Full details of the dual-branch architecture appear in Section 3.1, the self-supervised loss formulation and training strategy are given in Section 3.2, and the baseline re-implementations together with hyper-parameter choices are described in Section 4.2 and the supplementary material. To improve immediate readability we have added a short clause in the abstract mentioning the dual-branch feedforward design and self-supervised training. We maintain that the body of the paper supplies the information needed for independent verification. revision: partial
Circularity Check
No significant circularity in derivation chain
full rationale
The paper presents ReconPhys as a self-supervised feedforward network that jointly predicts 3D Gaussian Splatting parameters and physical attributes from monocular video input. The self-supervised loss is defined on reconstruction and future-frame consistency without access to ground-truth physics labels, and the quantitative claims (21.64 PSNR future prediction, 0.004 Chamfer distance) are measured on held-out synthetic sequences rather than being algebraically identical to any fitted parameter or training objective. No equation reduces a claimed prediction to an input by construction, no uniqueness theorem is imported from prior self-work to force the architecture, and no ansatz is smuggled via citation. The derivation therefore remains self-contained: the model architecture and training objective are stated independently of the reported test metrics, which serve as external empirical evidence rather than tautological restatements.
Axiom & Free-Parameter Ledger
free parameters (1)
- neural network weights
axioms (1)
- domain assumption Self-supervised losses derived from video consistency are sufficient to recover accurate physical attributes.
Reference graph
Works this paper leans on
-
[1]
Neural deformation graphs for globally-consistent non-rigid reconstruction.CVPR, 2021
Aljaž Božič, Pablo Palafox, Michael Zollhöfer, Justus Thies, Angela Dai, and Matthias Nießner. Neural deformation graphs for globally-consistent non-rigid reconstruction.CVPR, 2021. 2
2021
-
[2]
Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks
Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24185–24198, 2024. 8
2024
-
[3]
Neural parametric gaussians for monocular non-rigid object reconstruction
Devikalyan Das, Christopher Wewer, Raza Yunus, Eddy Ilg, and Jan Eric Lenssen. Neural parametric gaussians for monocular non-rigid object reconstruction. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10715–10725, 2024. 2
2024
-
[4]
Objaverse-xl: A universe of 10m+ 3d objects
Matt Deitke, Ruoshi Liu, Matthew Wallingford, Huong Ngo, Oscar Michel, Aditya Kusupati, Alan Fan, Christian Laforte, Vikram Voleti, Samir Yitzhak Gadre, et al. Objaverse-xl: A universe of 10m+ 3d objects. Advances in Neural Information Processing Systems, 36:35799–35813, 2023. 7
2023
-
[5]
Learning multi-object dynamics with compositional neural radiance fields
Danny Driess, Zhiao Huang, Yunzhu Li, Russ Tedrake, and Marc Toussaint. Learning multi-object dynamics with compositional neural radiance fields. InConference on robot learning, pages 1755–1768. PMLR, 2023. 3
2023
-
[6]
Diffpd: Differentiable projective dynamics.ACM Transactions on Graphics (ToG), 41(2):1–21, 2021
Tao Du, Kui Wu, Pingchuan Ma, Sebastien Wah, Andrew Spielberg, Daniela Rus, and Wojciech Matusik. Diffpd: Differentiable projective dynamics.ACM Transactions on Graphics (ToG), 41(2):1–21, 2021. 4
2021
-
[7]
Motion-aware 3d gaussian splatting for efficient dynamic scene reconstruction.IEEE Transactions on Circuits and Systems for Video Technology, 35 (4):3119–3133, 2024
Zhiyang Guo, Wengang Zhou, Li Li, Min Wang, and Houqiang Li. Motion-aware 3d gaussian splatting for efficient dynamic scene reconstruction.IEEE Transactions on Circuits and Systems for Video Technology, 35 (4):3119–3133, 2024. 3
2024
-
[8]
Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion
Xun Huang, Zhengqi Li, Guande He, Mingyuan Zhou, and Eli Shechtman. Self forcing: Bridging the train-test gap in autoregressive video diffusion.arXiv preprint arXiv:2506.08009, 2025. 7
work page internal anchor Pith review arXiv 2025
-
[9]
Sc-gs: Sparse- controlled gaussian splatting for editable dynamic scenes
Yi-Hua Huang, Yang-Tian Sun, Ziyi Yang, Xiaoyang Lyu, Yan-Pei Cao, and Xiaojuan Qi. Sc-gs: Sparse- controlled gaussian splatting for editable dynamic scenes. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4220–4230, 2024. 3
2024
-
[10]
Hanxiao Jiang, Hao-Yu Hsu, Kaifeng Zhang, Hsin-Ni Yu, Shenlong Wang, and Yunzhu Li. Phystwin: Physics-informed reconstruction and simulation of deformable objects from videos.arXiv preprint arXiv:2503.17973, 2025. 2, 11
-
[11]
Vr-gs: A physical dynamics-aware interactive gaussian splatting system in virtual reality
Ying Jiang, Chang Yu, Tianyi Xie, Xuan Li, Yutao Feng, Huamin Wang, Minchen Li, Henry Lau, Feng Gao, Yin Yang, et al. Vr-gs: A physical dynamics-aware interactive gaussian splatting system in virtual reality. InACM SIGGRAPH 2024 conference papers, pages 1–1, 2024. 3
2024
-
[12]
3d gaussian splatting for real-time radiance field rendering.ACM Trans
Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering.ACM Trans. Graph., 42(4):139–1, 2023. 2, 3
2023
-
[13]
Segment anything
Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. InProceedings of the IEEE/CVF international conference on computer vision, pages 4015–4026, 2023. 11
2023
-
[14]
Resnet 50
Brett Koonce. Resnet 50. InConvolutional neural networks with swift for tensorflow: image recognition and dataset categorization, pages 63–72. Springer, 2021. 6 13 ReconPhys: Reconstruct Appearance and Physical Attributes from Single Video
2021
-
[15]
Dynmf: Neural motion factorization for real-time dynamic view synthesis with 3d gaussian splatting
Agelos Kratimenos, Jiahui Lei, and Kostas Daniilidis. Dynmf: Neural motion factorization for real-time dynamic view synthesis with 3d gaussian splatting. InEuropean Conference on Computer Vision, pages 252–269. Springer, 2024. 3
2024
-
[16]
arXiv preprint arXiv:2303.05512 , year=
Xuan Li, Yi-Ling Qiao, Peter Yichen Chen, Krishna Murthy Jatavallabhula, Ming Lin, Chenfanfu Jiang, and Chuang Gan. Pac-nerf: Physics augmented continuum neural radiance fields for geometry-agnostic system identification.arXiv preprint arXiv:2303.05512, 2023. 2, 3
-
[17]
Evaluating real-world robot manipulation policies in simulation
Xuanlin Li, Kyle Hsu, Jiayuan Gu, Oier Mees, Karl Pertsch, Homer Rich Walke, Chuyuan Fu, Ishikaa Lunawat, Isabel Sieh, Sean Kirmani, et al. Evaluating real-world robot manipulation policies in simulation. InConference on Robot Learning, pages 3705–3728. PMLR, 2025. 4
2025
-
[18]
Neural scene flow fields for space-time view synthesis of dynamic scenes
Zhengqi Li, Simon Niklaus, Noah Snavely, and Oliver Wang. Neural scene flow fields for space-time view synthesis of dynamic scenes. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6498–6508, 2021. 3
2021
-
[19]
Gaussian-flow: 4d reconstruction with dynamic 3d gaussian particle
Youtian Lin, Zuozhuo Dai, Siyu Zhu, and Yao Yao. Gaussian-flow: 4d reconstruction with dynamic 3d gaussian particle. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21136–21145, 2024. 2, 3
2024
-
[20]
Yuchen Lin, Chenguo Lin, Jianjin Xu, and Yadong Mu. Omniphysgs: 3d constitutive gaussians for general physics-based dynamics generation.arXiv preprint arXiv:2501.18982, 2025. 2
-
[21]
Dynamic 3d gaussians: Tracking by persistent dynamic view synthesis
Jonathon Luiten, Georgios Kopanas, Bastian Leibe, and Deva Ramanan. Dynamic 3d gaussians: Tracking by persistent dynamic view synthesis. In2024 International Conference on 3D Vision (3DV), pages 800–809. IEEE, 2024. 2, 3
2024
-
[22]
Learning neural constitutive laws from motion observations for generalizable pde dynamics
Pingchuan Ma, Peter Yichen Chen, Bolei Deng, Joshua B Tenenbaum, Tao Du, Chuang Gan, and Wojciech Matusik. Learning neural constitutive laws from motion observations for generalizable pde dynamics. In International Conference on Machine Learning, pages 23279–23300. PMLR, 2023. 4
2023
-
[23]
Nerf: Representing scenes as neural radiance fields for view synthesis.Communications of the ACM, 65(1):99–106, 2021
Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view synthesis.Communications of the ACM, 65(1):99–106, 2021. 2, 3
2021
-
[24]
Nerfies: Deformable neural radiance fields
Keunhong Park, Utkarsh Sinha, Jonathan T Barron, Sofien Bouaziz, Dan B Goldman, Steven M Seitz, and Ricardo Martin-Brualla. Nerfies: Deformable neural radiance fields. InProceedings of the IEEE/CVF international conference on computer vision, pages 5865–5874, 2021. 2, 3
2021
-
[25]
arXiv preprint arXiv:2106.13228 (2021)
Keunhong Park, Utkarsh Sinha, Peter Hedman, Jonathan T Barron, Sofien Bouaziz, Dan B Goldman, Ricardo Martin-Brualla, and Steven M Seitz. Hypernerf: A higher-dimensional representation for topo- logically varying neural radiance fields.arXiv preprint arXiv:2106.13228, 2021. 3
-
[26]
D-nerf: Neural radiance fields for dynamic scenes
Albert Pumarola, Enric Corona, Gerard Pons-Moll, and Francesc Moreno-Noguer. D-nerf: Neural radiance fields for dynamic scenes. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10318–10327, 2021. 2, 3
2021
-
[27]
Maziar Raissi, Paris Perdikaris, and George E Karniadakis. Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations.Journal of Computational physics, 378:686–707, 2019. 4
2019
-
[28]
Learning to simulate complex physics with graph networks
Alvaro Sanchez-Gonzalez, Jonathan Godwin, Tobias Pfaff, Rex Ying, Jure Leskovec, and Peter Battaglia. Learning to simulate complex physics with graph networks. InInternational conference on machine learning, pages 8459–8468. PMLR, 2020. 4 14 ReconPhys: Reconstruct Appearance and Physical Attributes from Single Video
2020
-
[29]
Non-rigid neural radiance fields: Reconstruction and novel view synthesis of a dynamic scene from monocular video
Edgar Tretschk, Ayush Tewari, Vladislav Golyanik, Michael Zollhöfer, Christoph Lassner, and Christian Theobalt. Non-rigid neural radiance fields: Reconstruction and novel view synthesis of a dynamic scene from monocular video. InProceedings of the IEEE/CVF international conference on computer vision, pages 12959–12970, 2021. 3
2021
-
[30]
State of the art in dense monocular non-rigid 3d reconstruction
Edith Tretschk, Navami Kairanda, Mallikarjun BR, Rishabh Dabral, Adam Kortylewski, Bernhard Egger, Marc Habermann, Pascal Fua, Christian Theobalt, and Vladislav Golyanik. State of the art in dense monocular non-rigid 3d reconstruction. InComputer Graphics Forum, volume 42, pages 485–520. Wiley Online Library, 2023. 2
2023
-
[31]
Attention is all you need.Advances in neural information processing systems, 30,
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30,
-
[32]
Boyuan Wang, Xinpan Meng, Xiaofeng Wang, Zheng Zhu, Angen Ye, Yang Wang, Zhiqin Yang, Chaojun Ni, Guan Huang, and Xingang Wang. Embodiedreamer: Advancing real2sim2real transfer for policy training via embodied world modeling.arXiv preprint arXiv:2507.05198, 2025. 4
-
[33]
Boyuan Wang, Runqi Ouyang, Xiaofeng Wang, Zheng Zhu, Guosheng Zhao, Chaojun Ni, Xiaopei Zhang, Guan Huang, Yijie Ren, Lihong Liu, et al. Humandreamer-x: Photorealistic single-image human avatars reconstruction via gaussian restoration.arXiv preprint arXiv:2504.03536, 2025. 3
-
[34]
Flow supervision for deformable nerf
Chaoyang Wang, Lachlan Ewen MacDonald, Laszlo A Jeni, and Simon Lucey. Flow supervision for deformable nerf. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21128–21137, 2023. 3
2023
-
[35]
4d gaussian splatting for real-time dynamic scene rendering
Guanjun Wu, Taoran Yi, Jiemin Fang, Lingxi Xie, Xiaopeng Zhang, Wei Wei, Wenyu Liu, Qi Tian, and Xinggang Wang. 4d gaussian splatting for real-time dynamic scene rendering. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 20310–20320, 2024. 2, 3, 8
2024
-
[36]
Structured 3d latents for scalable and versatile 3d generation
Jianfeng Xiang, Zelong Lv, Sicheng Xu, Yu Deng, Ruicheng Wang, Bowen Zhang, Dong Chen, Xin Tong, and Jiaolong Yang. Structured 3d latents for scalable and versatile 3d generation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 21469–21480, 2025. 7
2025
-
[37]
Physgaussian: Physics-integrated 3d gaussians for generative dynamics
Tianyi Xie, Zeshun Zong, Yuxing Qiu, Xuan Li, Yutao Feng, Yin Yang, and Chenfanfu Jiang. Physgaussian: Physics-integrated 3d gaussians for generative dynamics. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4389–4398, 2024. 2, 3
2024
-
[38]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025. 7
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[39]
Zeyu Yang, Hongye Yang, Zijie Pan, and Li Zhang. Real-time photorealistic dynamic scene representation and rendering with 4d gaussian splatting.arXiv preprint arXiv:2310.10642, 2023. 3
-
[40]
Deformable 3d gaussians for high-fidelity monocular dynamic scene reconstruction
Ziyi Yang, Xinyu Gao, Wen Zhou, Shaohui Jiao, Yuqing Zhang, and Xiaogang Jin. Deformable 3d gaussians for high-fidelity monocular dynamic scene reconstruction. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 20331–20341, 2024. 2, 3
2024
-
[41]
Cogs: Controllable gaussian splatting
Heng Yu, Joel Julin, Zoltán Á Milacski, Koichiro Niinuma, and László A Jeni. Cogs: Controllable gaussian splatting. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21624–21633, 2024. 3
2024
-
[42]
Mingtong Zhang, Kaifeng Zhang, and Yunzhu Li. Dynamic 3d gaussian tracking for graph-based neural dynamics modeling.arXiv preprint arXiv:2410.18912, 2024. 2 15 ReconPhys: Reconstruct Appearance and Physical Attributes from Single Video
-
[43]
Avatarrex: Real-time expressive full-body avatars.ACM Transactions on Graphics (TOG), 42(4):1–19, 2023
Zerong Zheng, Xiaochen Zhao, Hongwen Zhang, Boning Liu, and Yebin Liu. Avatarrex: Real-time expressive full-body avatars.ACM Transactions on Graphics (TOG), 42(4):1–19, 2023. 3
2023
-
[44]
Reconstruction and simulation of elastic objects with spring-mass 3d gaussians
Licheng Zhong, Hong-Xing Yu, Jiajun Wu, and Yunzhu Li. Reconstruction and simulation of elastic objects with spring-mass 3d gaussians. InEuropean Conference on Computer Vision, pages 407–423. Springer,
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.