MatPhys: Learning Material-Aware Physics Parameters for Deformable Object Simulation from Videos
Pith reviewed 2026-05-20 06:09 UTC · model grok-4.3
The pith
MatPhys decomposes objects into material parts with DINO features and uses a shared codebook to predict consistent spring-mass parameters from single-view video.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
MatPhys predicts spring-mass parameters from a single-view video by first using DINO features to decompose the object into parts that receive their own material assignments and then constraining those assignments through a learned codebook of shared material embeddings so that identical materials produce identical physics parameters across different scenes and interactions.
What carries the argument
DINO-based part decomposition paired with a learned material codebook that functions as a reference distribution to regularize the parameter decoder for cross-scene consistency.
If this is right
- Objects composed of multiple materials can be simulated without forcing a single uniform parameter set.
- The same material receives nearly identical spring-mass values when observed in separate videos or under new interactions.
- Reconstruction fidelity and forward prediction accuracy remain comparable to per-scene optimization baselines.
- Performance on previously unseen objects and interactions improves because parameters are grounded in reusable material concepts rather than scene-specific fitting.
Where Pith is reading between the lines
- Large collections of video data could be mined to build a reusable library of material parameters for downstream simulation tasks.
- Robotics perception pipelines might adopt the feed-forward path for rapid on-the-fly estimation of object physics from casual camera footage.
- The consistency mechanism could be transferred to other deformable simulation models such as finite-element or position-based dynamics.
Load-bearing premise
DINO features reliably separate objects into parts whose visual signatures correspond to distinct material behaviors, and the material codebook supplies a prior that enforces consistency without adding new biases or overfitting.
What would settle it
Record two videos of the identical physical object undergoing different interactions, run the model on each, and check whether the predicted parameters produce matching simulated trajectories that both agree with the real recorded motion.
Figures
read the original abstract
Reconstructing simulation-ready deformable objects is important for vision, graphics, and robotics. Existing physics-driven methods can recover physical digital twins from videos, but they suffer from two fundamental limitations: they typically assume a homogeneous material across the whole object, and their scene-specific inverse optimization, combined with the inherent ambiguity of monocular observation, yields inconsistent parameters for the same material across different scenes or interactions. We propose MatPhys, a material-aware feed-forward framework that predicts spring-mass parameters from a single-view video, addressing these two issues with two coupled designs. To relax the homogeneous material assumption, we use DINO features to decompose the object into semantically meaningful parts and to query a part-level material prior, assigning each part its own physical behavior. To enforce cross-scene consistency, we introduce a learned material codebook of shared material embeddings as the bridge between appearance and physics, and further use the part-level prior as a reference distribution that constrains the decoder so that the same material yields consistent parameters across scenes and interactions. Together, these designs turn an under-constrained monocular problem into feed-forward inference grounded on shared, reusable material concepts. Experiments show that our method matches per-scene optimization baselines in reconstruction and future prediction, while achieving stronger generalization to unseen interactions and objects with more consistent physical parameters.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes MatPhys, a material-aware feed-forward framework that predicts spring-mass parameters for deformable object simulation from single-view videos. It relaxes the homogeneous material assumption via DINO-based semantic part decomposition and a part-level material prior, while enforcing cross-scene consistency through a learned material codebook of shared embeddings that constrains the decoder to produce reusable parameters across scenes and interactions. Experiments are claimed to show matching reconstruction and prediction performance to per-scene baselines with improved generalization and parameter consistency.
Significance. If the central claims hold, the work has moderate significance for computer vision, graphics, and robotics by converting an under-constrained monocular inverse problem into feed-forward inference grounded in shared material concepts. The combination of DINO part priors with a codebook for consistency could enable more scalable creation of simulation-ready digital twins, provided the learned embeddings prove physically grounded rather than distribution artifacts.
major comments (2)
- Abstract: The central claim that the method 'matches per-scene optimization baselines in reconstruction and future prediction' while achieving stronger generalization is load-bearing but unsupported by any quantitative metrics, tables, ablation results, or specific numbers, preventing verification of performance and consistency improvements.
- Method section on DINO-based part decomposition: The assumption that DINO features reliably produce parts whose boundaries align with distinct material behaviors (stiffness/density transitions) rather than appearance cues is a correctness risk for the part-level prior and codebook; a concrete test would be to measure overlap between DINO-derived part boundaries and ground-truth material change locations in controlled simulations with known physics transitions.
minor comments (1)
- Abstract: Clarify the exact form of the spring-mass model and output parameters (e.g., per-part stiffness, damping) to make the decoder target explicit.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive comments. We address each major comment below with clarifications from the manuscript and indicate where revisions will be made to strengthen the presentation.
read point-by-point responses
-
Referee: Abstract: The central claim that the method 'matches per-scene optimization baselines in reconstruction and future prediction' while achieving stronger generalization is load-bearing but unsupported by any quantitative metrics, tables, ablation results, or specific numbers, preventing verification of performance and consistency improvements.
Authors: The abstract provides a concise summary of the experimental outcomes. The full quantitative support for the claim—including direct comparisons of reconstruction and prediction errors against per-scene baselines, generalization metrics on unseen interactions, and consistency measures across scenes—is presented in Section 4 with accompanying tables and ablation studies. To improve immediate verifiability, we will revise the abstract to incorporate key numerical results from those experiments. revision: yes
-
Referee: Method section on DINO-based part decomposition: The assumption that DINO features reliably produce parts whose boundaries align with distinct material behaviors (stiffness/density transitions) rather than appearance cues is a correctness risk for the part-level prior and codebook; a concrete test would be to measure overlap between DINO-derived part boundaries and ground-truth material change locations in controlled simulations with known physics transitions.
Authors: We agree that explicit validation of the alignment between DINO-derived semantic parts and actual material transitions would strengthen the justification for the part-level prior. Our current design relies on DINO features to capture semantically coherent regions that empirically correspond to distinct physical behaviors in the evaluated real-world videos, as reflected in the improved simulation fidelity and parameter consistency reported in the experiments. We will add the suggested controlled-simulation overlap analysis in the revised manuscript to directly quantify this alignment. revision: yes
Circularity Check
No significant circularity; derivation relies on learned architecture without self-referential reduction
full rationale
The provided abstract and context describe a feed-forward network that decomposes objects via DINO features, queries a part-level prior, and uses a learned material codebook to constrain a decoder for cross-scene consistency. No equations or derivation steps are exhibited that reduce the predicted spring-mass parameters directly to the training inputs by construction (e.g., no self-definitional loop where the output is the fitted codebook itself, and no 'prediction' that is statistically forced from a subset of the same data). The material codebook is presented as a learned bridge trained on data, which is a standard non-circular ML design choice rather than a tautology. Self-citation is not invoked as load-bearing, and no uniqueness theorem or ansatz smuggling is referenced. The approach is self-contained against external benchmarks via experiments comparing to per-scene optimization baselines.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption DINO features can be used to decompose objects into semantically meaningful parts that align with distinct material behaviors
invented entities (1)
-
material codebook of shared embeddings
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Large steps in cloth sim- ulation
David Baraff and Andrew Witkin. Large steps in cloth sim- ulation. InProceedings of the 25th Annual Conference on Computer Graphics and Interactive Techniques, pages 43– 54, 1998
work page 1998
-
[2]
Jiazhong Cen, Jiemin Fang, Chen Yang, Lingxi Xie, Xi- aopeng Zhang, Wei Shen, and Qi Tian. Segment any 3d gaussians. InProceedings of the AAAI Conference on Ar- tificial Intelligence, 2025
work page 2025
-
[3]
Empm: Embodied mpm for modeling and simulation of deformable objects
Yunuo Chen*, Yafei Hu*, Lingfeng Sun, Tushar Kusnur, Laura Herlant, and Chenfanfu Jiang. Empm: Embodied mpm for modeling and simulation of deformable objects. IEEE Robotics and Automation Letters (RA-L), 2026
work page 2026
-
[4]
Dynamic view synthesis from dynamic monocular video
Chen Gao, Ayush Saraf, Johannes Kopf, and Jia-Bin Huang. Dynamic view synthesis from dynamic monocular video. In Proceedings of the IEEE International Conference on Com- puter Vision, 2021
work page 2021
-
[5]
Tianyu Huang, Haoze Zhang, Yihan Zeng, Zhilu Zhang, Hui Li, Wangmeng Zuo, and Rynson W. H. Lau. Dreamphysics: Learning physics-based 3d dynamics with video diffusion priors. InAAAI Conference on Artificial Intelligence, 2024
work page 2024
-
[6]
Sc-gs: Sparse-controlled gaussian splatting for editable dynamic scenes
Yi-Hua Huang, Yang-Tian Sun, Ziyi Yang, Xiaoyang Lyu, Yan-Pei Cao, and Xiaojuan Qi. Sc-gs: Sparse-controlled gaussian splatting for editable dynamic scenes. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024
work page 2024
-
[7]
Phystwin: Physics- informed reconstruction and simulation of deformable ob- jects from videos
Hanxiao Jiang, Hao-Yu Hsu, Kaifeng Zhang, Hsin-Ni Yu, Shenlong Wang, and Yunzhu Li. Phystwin: Physics- informed reconstruction and simulation of deformable ob- jects from videos. InProceedings of the IEEE/CVF Inter- national Conference on Computer Vision, pages 7219–7230, 2025
work page 2025
-
[8]
Bernhard Kerbl, Georgios Kopanas, Thomas Leimk ¨uhler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering.ACM Transactions on Graphics, 42 (4), 2023
work page 2023
-
[9]
Lerf: Language embedded radiance fields
Justin Kerr, Chung Min Kim, Ken Goldberg, Angjoo Kanazawa, and Matthew Tancik. Lerf: Language embedded radiance fields. InProceedings of the IEEE/CVF Interna- tional Conference on Computer Vision, 2023
work page 2023
-
[10]
Garfield: Group anything with radiance fields
Chung Min Kim, Mingxuan Wu, Justin Kerr, Ken Gold- berg, Matthew Tancik, and Angjoo Kanazawa. Garfield: Group anything with radiance fields. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024
work page 2024
-
[11]
arXiv preprint arXiv:2406.04338 (2024)
Fangfu Liu, Hanyang Wang, Shunyu Yao, Shengjun Zhang, Jie Zhou, and Yueqi Duan. Physics3d: Learning physical properties of 3d gaussians via video diffusion.arXiv preprint arXiv:2406.04338, 2024
-
[12]
Dynamic 3d gaussians: Tracking by per- sistent dynamic view synthesis
Jonathon Luiten, Georgios Kopanas, Bastian Leibe, and Deva Ramanan. Dynamic 3d gaussians: Tracking by per- sistent dynamic view synthesis. In3DV, 2024
work page 2024
-
[13]
Srinivasan, Matthew Tancik, Jonathan T
Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik, Jonathan T. Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view syn- thesis. InECCV, 2020
work page 2020
-
[14]
Maxime Oquab, Timoth ´ee Darcet, Th´eo Moutakanni, Huy V . V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mah- moud Assran, Nicolas Ballas, Wojciech Galuba, Russ Howes, Po-Yao (Bernie) Huang, Shang-Wen Li, Ishan Misra, Michael G. Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Herv´e J´egou, Julie...
work page 2023
-
[15]
Barron, Sofien Bouaziz, Dan B Goldman, Steven M Seitz, and Ricardo Martin-Brualla
Keunhong Park, Utkarsh Sinha, Jonathan T. Barron, Sofien Bouaziz, Dan B Goldman, Steven M Seitz, and Ricardo Martin-Brualla. Nerfies: Deformable neural radiance fields. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 5865–5874, 2021
work page 2021
-
[16]
Barron, Sofien Bouaziz, Dan B Goldman, Ricardo Martin- Brualla, and Steven M Seitz
Keunhong Park, Utkarsh Sinha, Peter Hedman, Jonathan T. Barron, Sofien Bouaziz, Dan B Goldman, Ricardo Martin- Brualla, and Steven M Seitz. Hypernerf: A higher- dimensional representation for topologically varying neural radiance fields.ACM Transactions on Graphics, 40(6):1–12, 2021
work page 2021
-
[17]
D-nerf: Neural radiance fields for dynamic scenes.arXiv preprint arXiv:2011.13961, 2020
Albert Pumarola, Enric Corona, Gerard Pons-Moll, and Francesc Moreno-Noguer. D-nerf: Neural radiance fields for dynamic scenes.arXiv preprint arXiv:2011.13961, 2020
-
[18]
Langsplat: 3d language gaussian splatting
Minghan Qin, Wanhua Li, Jiawei Zhou, Haoqian Wang, and Hanspeter Pfister. Langsplat: 3d language gaussian splatting. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024
work page 2024
-
[19]
A material point method for snow simulation.ACM Transactions on Graphics, 32(4):1–10, 2013
Alexey Stomakhin, Craig Schroeder, Lawrence Chai, Joseph Teran, and Andrew Selle. A material point method for snow simulation.ACM Transactions on Graphics, 32(4):1–10, 2013
work page 2013
-
[20]
Demetri Terzopoulos, John Platt, Alan Barr, and Kurt Fleis- cher. Elastically deformable models. InProceedings of the 14th Annual Conference on Computer Graphics and Inter- active Techniques, pages 205–214, 1987
work page 1987
-
[21]
VideoMAE: Masked autoencoders are data-efficient learners for self-supervised video pre-training
Zhan Tong, Yibing Song, Jue Wang, and Limin Wang. VideoMAE: Masked autoencoders are data-efficient learners for self-supervised video pre-training. InAdvances in Neural Information Processing Systems, 2022
work page 2022
-
[22]
Edgar Tretschk, Ayush Tewari, Vladislav Golyanik, Michael Zollh¨ofer, Christoph Lassner, and Christian Theobalt. Non- rigid neural radiance fields: Reconstruction and novel view synthesis of a dynamic scene from monocular video. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 12959–12970, 2021
work page 2021
-
[23]
ReconPhys: Reconstruct Appearance and Physical Attributes from Single Video
Boyuan Wang, Xiaofeng Wang, Yongkang Li, Zheng Zhu, Yifan Chang, Angen Ye, Guosheng Zhao, Chaojun Ni, Guan Huang, Yijie Ren, et al. Reconphys: Reconstruct appearance and physical attributes from single video.arXiv preprint arXiv:2604.07882, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[24]
4d gaussian splatting for real-time dynamic scene rendering
Guanjun Wu, Taoran Yi, Jiemin Fang, Lingxi Xie, Xiaopeng Zhang, Wei Wei, Wenyu Liu, Qi Tian, and Xinggang Wang. 4d gaussian splatting for real-time dynamic scene rendering. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024
work page 2024
-
[25]
Native and compact structured latents for 3d generation.Tech report, 2025
Jianfeng Xiang, Xiaoxue Chen, Sicheng Xu, Ruicheng Wang, Zelong Lv, Yu Deng, Hongyuan Zhu, Yue Dong, Hao Zhao, Nicholas Jing Yuan, and Jiaolong Yang. Native and compact structured latents for 3d generation.Tech report, 2025
work page 2025
-
[26]
Physgaussian: Physics- integrated 3d gaussians for generative dynamics
Tianyi Xie, Zeshun Zong, Yuxing Qiu, Xuan Li, Yutao Feng, Yin Yang, and Chenfanfu Jiang. Physgaussian: Physics- integrated 3d gaussians for generative dynamics. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4389–4398, 2024
work page 2024
-
[27]
Qingshan Xu, Jiao Liu, Shangshu Yu, Yuxuan Wang, Yuan Zhou, Junbao Zhou, Jiequan Cui, Yew-Soon Ong, and Han- wang Zhang. Neuspring: Neural spring fields for recon- struction and simulation of deformable objects from videos. InProceedings of the AAAI Conference on Artificial Intelli- gence, pages 11361–11369, 2026
work page 2026
-
[28]
Yu Yang, Zhilu Zhang, Xiang Zhang, Yihan Zeng, Hui Li, and Wangmeng Zuo. Physworld: From real videos to world models of deformable objects via physics-aware demonstra- tion synthesis.ArXiv, abs/2510.21447, 2025
-
[29]
Deformable 3d gaussians for high- fidelity monocular dynamic scene reconstruction
Ziyi Yang, Xinyu Gao, Wen Zhou, Shaohui Jiao, Yuqing Zhang, and Xiaogang Jin. Deformable 3d gaussians for high- fidelity monocular dynamic scene reconstruction. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20331–20341, 2024
work page 2024
-
[30]
Gaussian grouping: Segment and edit anything in 3d scenes
Mingqiao Ye, Martin Danelljan, Fisher Yu, and Lei Ke. Gaussian grouping: Segment and edit anything in 3d scenes. InEuropean Conference on Computer Vision, pages 162– 179, 2024
work page 2024
-
[31]
Kaifeng Zhang, Shuo Sha, Hanxiao Jiang, Matthew Loper, Hyun Oh Song, Guangya Cai, Zhuo Xu, Xiaochen Hu, Changxi Zheng, and Yunzhu Li. Real-to-sim robot policy evaluation with gaussian splatting simulation of soft-body in- teractions.ArXiv, abs/2511.04665, 2025
-
[32]
Dynamic 3d gaussian tracking for graph-based neural dynamics mod- eling
Mingtong Zhang, Kaifeng Zhang, and Yunzhu Li. Dynamic 3d gaussian tracking for graph-based neural dynamics mod- eling. In8th Annual Conference on Robot Learning, 2024
work page 2024
-
[33]
Efficient physics simulation for 3d scenes via mllm-guided gaussian splatting
Haoyu Zhao, Hao Wang, Xingyue Zhao, Hao Fei, Hongqiu Wang, Chengjiang Long, and Hua Zou. Efficient physics simulation for 3d scenes via mllm-guided gaussian splatting. arXiv preprint arXiv:2411.12789, 2024
-
[34]
Licheng Zhong, Hong-Xing Yu, Jiajun Wu, and Yunzhu Li. Reconstruction and simulation of elastic objects with spring- mass 3d gaussians.European Conference on Computer Vi- sion (ECCV), 2024
work page 2024
-
[35]
Feature 3dgs: Supercharging 3d gaussian splatting to enable distilled feature fields
Shijie Zhou, Haoran Chang, Sicheng Jiang, Zhiwen Fan, Ze- hao Zhu, Dejia Xu, Pradyumna Chari, Suya You, Zhangyang Wang, and Achuta Kadambi. Feature 3dgs: Supercharging 3d gaussian splatting to enable distilled feature fields. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024
work page 2024
-
[36]
Xiangming Zhu, Huayu Deng, Haochen Yuan, Yunbo Wang, and Xiaokang Yang. Latent intuitive physics: Learn- ing to transfer hidden physics from a 3d video.ArXiv, abs/2406.12769, 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.