H-Flow: Self-supervised Human Scene Flow via Physics-inspired Joint Multi-modal Learning
Pith reviewed 2026-05-22 06:33 UTC · model grok-4.3
The pith
H-Flow estimates dense human scene flow from monocular video by using a multi-head transformer trained with physics priors on pose and depth.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
H-Flow is a dense human scene flow method that jointly models skeletal kinematics and surface deformation. A unified multi-head transformer processes monocular video frames to predict flow while producing companion pose and depth maps. In the absence of direct supervision, the network is trained by encoding geometric, structural, and biomechanical priors as cross-modal consistency losses. The authors also release DynAct4D, a synthetic dataset with dense flow ground truth across varied subjects, garments, and actions. The resulting model exceeds scene-flow and parametric baselines on standard tests and generalizes zero-shot to in-the-wild footage.
What carries the argument
Unified multi-head transformer that jointly regresses scene flow, pose, and depth from video, trained via physics-inspired geometric, structural, and biomechanical cross-modal objectives.
If this is right
- The same network produces both dense surface flow and parametric pose estimates, allowing consistent motion analysis without separate pipelines.
- Performance gains appear on articulated bodies with clothing and soft tissue where generic scene flow methods degrade.
- Zero-shot transfer to in-the-wild monocular video becomes possible once the physics priors are learned.
- A new high-fidelity synthetic benchmark supplies dense flow annotations that can support further model development.
Where Pith is reading between the lines
- Similar cross-modal physics objectives could be adapted to non-human articulated objects such as animals or robots if corresponding biomechanical rules are supplied.
- The joint prediction of flow, pose, and depth may reduce error accumulation across tasks by enforcing mutual geometric consistency during inference.
- Extending the synthetic benchmark with more varied lighting or camera motion could test robustness before real-world deployment.
Load-bearing premise
Human motion can be adequately described by a combination of geometric, structural, and biomechanical priors that serve as reliable substitutes for missing dense flow labels.
What would settle it
Independent dense flow measurements obtained from calibrated multi-view capture on real human subjects would show systematic disagreement with the flow predicted by the physics-prior model.
Figures
read the original abstract
Parametric human models capture global pose but cannot represent the non-rigid surface dynamics of clothing and soft tissue. Generic scene flow estimates dense motion but breaks down on articulated bodies, where pixel-level supervision is also intractable to acquire. We introduce H-Flow, a dense human scene flow that captures both skeletal kinematics and surface deformation. A unified multi-head transformer estimates flow from monocular video, jointly predicting pose and depth as companion outputs. The challenge lies in the lack of supervision. In place of unattainable labels, we anchor the network in the physics of human motion, encoding geometric, structural, and biomechanical priors as cross-modal training objectives. We further introduce DynAct4D, a high-fidelity synthetic benchmark providing dense flow annotations across diverse subjects, garments, and motions. On standard benchmarks, H-Flow outperforms scene-flow and parametric baselines, and generalizes zero-shot to in-the-wild video. Code, models, and the DynAct4D benchmark will be released upon publication
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces H-Flow, a self-supervised method for dense human scene flow from monocular video. A unified multi-head transformer jointly predicts scene flow along with pose and depth as auxiliary outputs. In the absence of dense labels, the approach encodes geometric, structural, and biomechanical priors as cross-modal consistency objectives. The authors also release DynAct4D, a synthetic benchmark with dense flow ground truth across varied subjects, garments, and motions. Experiments claim that H-Flow outperforms both generic scene-flow estimators and parametric human-model baselines on standard benchmarks while generalizing zero-shot to in-the-wild video.
Significance. If the central claims are substantiated, the work would meaningfully advance human motion capture by addressing the inability of parametric models to represent non-rigid clothing and soft-tissue dynamics while avoiding the supervision requirements that limit generic scene flow. The joint multi-modal architecture and physics-inspired self-supervision constitute a coherent attempt to solve an acknowledged gap; the public release of DynAct4D and associated code would further strengthen the contribution.
major comments (2)
- [§4] §4 (Experiments) and associated ablation tables: the central claim that cross-modal physics priors enable the flow head to capture non-rigid surface dynamics beyond what the pose head already provides is load-bearing for both the outperformance and zero-shot generalization statements. No ablation is shown that isolates the flow head trained only with pose/depth supervision versus the full set of geometric-structural-biomechanical objectives; without this, it remains possible that reported gains are largely inherited from the pose prediction rather than independently learned dense flow.
- [§3.2] §3.2 (Loss formulations): the biomechanical and structural prior losses are described at a high level but lack explicit equations demonstrating that they impose constraints on the flow field that are not already satisfied by the pose and depth heads. If the priors reduce to re-projection or rigidity terms already implicit in the pose output, the argument that they specifically regularize clothing/soft-tissue motion would be weakened.
minor comments (2)
- [Figure 2] Figure 2 (architecture diagram): the three output heads are not visually distinguished with sufficient clarity; adding explicit labels or color coding for flow, pose, and depth would improve readability.
- [§4.1] The description of DynAct4D in §4.1 should include a brief statement on how the synthetic rendering pipeline ensures that the provided dense flow annotations are independent of the parametric body model used for pose generation.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback on our manuscript. We have carefully addressed each major comment below and will incorporate revisions to strengthen the presentation and empirical support for our claims.
read point-by-point responses
-
Referee: [§4] §4 (Experiments) and associated ablation tables: the central claim that cross-modal physics priors enable the flow head to capture non-rigid surface dynamics beyond what the pose head already provides is load-bearing for both the outperformance and zero-shot generalization statements. No ablation is shown that isolates the flow head trained only with pose/depth supervision versus the full set of geometric-structural-biomechanical objectives; without this, it remains possible that reported gains are largely inherited from the pose prediction rather than independently learned dense flow.
Authors: We agree that an explicit ablation isolating the contribution of the full set of cross-modal physics priors to the flow head is necessary to substantiate the central claims. In the revised manuscript we will add a new ablation (to be included as an additional row or sub-table in Section 4) that trains the flow head using only the pose and depth supervision losses and compares it directly against the complete model. Internal experiments already performed during development show measurable gains in endpoint error and non-rigid region accuracy when the geometric, structural, and biomechanical terms are included; these results will be reported together with qualitative examples that highlight improved capture of clothing and soft-tissue motion not explained by pose alone. revision: yes
-
Referee: [§3.2] §3.2 (Loss formulations): the biomechanical and structural prior losses are described at a high level but lack explicit equations demonstrating that they impose constraints on the flow field that are not already satisfied by the pose and depth heads. If the priors reduce to re-projection or rigidity terms already implicit in the pose output, the argument that they specifically regularize clothing/soft-tissue motion would be weakened.
Authors: We thank the referee for highlighting this presentational gap. In the revised manuscript we will expand Section 3.2 with the explicit loss equations. The biomechanical prior is defined as a temporal consistency term on local surface velocities derived from biomechanical joint-angle limits and acceleration bounds; it is applied directly to the dense flow field after skeletal motion is propagated to surface points. The structural prior implements a local as-rigid-as-possible penalty on small surface patches obtained from the predicted depth and flow, which operates independently of the global rigid transformations encoded by the pose head. These formulations will be accompanied by a short discussion clarifying that the priors enforce local non-rigid constraints on clothing and soft-tissue regions that are not already satisfied by standard re-projection or global rigidity terms. revision: yes
Circularity Check
No significant circularity: priors and benchmarks are external to the flow derivation
full rationale
The abstract presents geometric, structural, and biomechanical priors as independent encodings of human motion physics used as cross-modal objectives, not as quantities fitted from or defined by the flow outputs themselves. Joint prediction of pose, depth, and flow is described as an architectural choice, with performance claims supported by evaluation on standard benchmarks plus the introduced DynAct4D dataset that supplies independent dense flow annotations. No equations, self-citations, or reductions to fitted inputs appear in the provided text that would make the central claims equivalent to the inputs by construction. The method is therefore self-contained against external validation rather than tautological.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Physics of human motion can be encoded as geometric, structural, and biomechanical priors for cross-modal training objectives that replace labels.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
encoding geometric, structural, and biomechanical priors as cross-modal training objectives
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanabsolute_floor_iff_bare_distinguishability unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Skeletal-Surface Coupling ties surface flow to bone-induced motion via a tolerance margin
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Smpl: a skinned multi-person linear model.ACM Transactions on Graphics (TOG), 34(6):1–16, 2015
Matthew Loper, Naureen Mahmood, Javier Romero, Gerard Pons-Moll, and Michael J Black. Smpl: a skinned multi-person linear model.ACM Transactions on Graphics (TOG), 34(6):1–16, 2015. 1, 3
work page 2015
-
[2]
Expressive body capture: 3d hands, face, and body from a single image
Georgios Pavlakos, Vasileios Choutas, Nima Ghorbani, Timo Bolkart, Ahmed AA Osman, Dimitrios Tzionas, and Michael J Black. Expressive body capture: 3d hands, face, and body from a single image. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10975–10985, 2019. 3
work page 2019
-
[3]
Freeman, Rahul Sukthankar, and Cristian Sminchisescu
Hongyi Xu, Eduard Gabriel Bazavan, Andrei Zanfir, William T. Freeman, Rahul Sukthankar, and Cristian Sminchisescu. GHUM & GHUML: Generative 3D human shape and articulated pose models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020
work page 2020
-
[4]
Ahmed A. A. Osman, Timo Bolkart, and Michael J. Black. STAR: Sparse trained articulated human body regressor. InProceedings of the European Conference on Computer Vision (ECCV), 2020
work page 2020
-
[5]
Ahmed A. A. Osman, Timo Bolkart, Dimitrios Tzionas, and Michael J. Black. SUPR: A sparse unified part-based human representation. InProceedings of the European Conference on Computer Vision (ECCV), 2022
work page 2022
- [6]
-
[7]
Hu- mans in 4D: Reconstructing and tracking humans with transformers
Shubham Goel, Georgios Pavlakos, Jathushan Rajasegaran, Angjoo Kanazawa, and Jitendra Malik. Hu- mans in 4D: Reconstructing and tracking humans with transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 14783–14794, 2023. 1, 3
work page 2023
-
[8]
Sam 3d body: Robust full-body human mesh recovery, 2026
Xitong Yang, Devansh Kukreja, Don Pinkus, Anushka Sagar, Taosha Fan, Jinhyung Park, Soyong Shin, Jinkun Cao, Jiawei Liu, Nicolas Ugrinovic, Matt Feiszli, Jitendra Malik, Piotr Dollár, and Kris Kitani. SAM 3D Body: Robust full-body human mesh recovery.arXiv preprint arXiv:2602.15989, 2026. 1, 2, 3, 6, 7, 8, 9, 17
-
[9]
Hugo Bertiche, Meysam Madadi, and Sergio Escalera. PBNS: Physically based neural simulation for unsupervised garment pose space deformation.ACM Transactions on Graphics (TOG), 40(6):1–14, 2021. 1
work page 2021
-
[10]
PhySkin: Physics-based bone-driven neural garment simulation.arXiv preprint arXiv:2603.27013, 2026
Astitva Srivastava, Hsiao-yu Chen, Ryan Goldade, Philipp Herholz, Zhongshi Jiang, Gene Wei-Chin Lin, Lingchen Yang, Nikolaos Sarafianos, Tuur Stuyck, and Egor Larionov. PhySkin: Physics-based bone-driven neural garment simulation.arXiv preprint arXiv:2603.27013, 2026. 1
-
[11]
Sundar Vedula, Simon Baker, Peter Rander, Robert Collins, and Takeo Kanade. Three-dimensional scene flow. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 1999. 1
work page 1999
-
[12]
Xingyu Liu, Charles R. Qi, and Leonidas J. Guibas. FlowNet3D: Learning scene flow in 3D point clouds. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 529–537, 2019. 3
work page 2019
-
[13]
FLOT: Scene flow on point clouds guided by optimal transport
Gilles Puy, Alexandre Boulch, and Renaud Marlet. FLOT: Scene flow on point clouds guided by optimal transport. InProceedings of the European Conference on Computer Vision (ECCV), 2020. 1
work page 2020
-
[14]
Icp-flow: Lidar scene flow estimation with icp
Yancong Lin and Holger Caesar. Icp-flow: Lidar scene flow estimation with icp. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15501–15511, 2024. 1, 2, 3
work page 2024
-
[15]
V oteflow: Enforcing local rigidity in self-supervised scene flow
Yancong Lin, Shiming Wang, Liangliang Nan, Julian Kooij, and Holger Caesar. V oteflow: Enforcing local rigidity in self-supervised scene flow. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 17155–17164, 2025. 1, 2, 3
work page 2025
-
[16]
Zero-shot monocular scene flow estimation in the wild
Yiqing Liang, Abhishek Badki, Hang Su, James Tompkin, and Orazio Gallo. Zero-shot monocular scene flow estimation in the wild. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 21031–21044, 2025. 2, 3, 7, 8, 18
work page 2025
-
[17]
Depth Pro: Sharp Monocular Metric Depth in Less Than a Second
Aleksei Bochkovskii, Amaël Delaunoy, Hugo Germain, Marcel Santos, Yichao Zhou, Stephan R Richter, and Vladlen Koltun. Depth pro: Sharp monocular metric depth in less than a second.arXiv preprint arXiv:2410.02073, 2024. 2, 6, 7, 8, 9, 17
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[18]
Are we ready for autonomous driving? the KITTI vision benchmark suite
Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are we ready for autonomous driving? the KITTI vision benchmark suite. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3354–3361, 2012. 2 10
work page 2012
-
[19]
Nikolaus Mayer, Eddy Ilg, Philip Häusser, Philipp Fischer, Daniel Cremers, Alexey Dosovitskiy, and Thomas Brox. A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4040–4048, 2016. 2, 3
work page 2016
-
[20]
A naturalistic open source movie for optical flow evaluation
Daniel J Butler, Jonas Wulff, Garrett B Stanley, and Michael J Black. A naturalistic open source movie for optical flow evaluation. InProceedings of the European Conference on Computer Vision (ECCV), pages 611–625. Springer, 2012. 2, 3, 7
work page 2012
-
[21]
Catalin Ionescu, Dragos Papava, Vlad Olaru, and Cristian Sminchisescu. Human3.6m: Large scale datasets and predictive methods for 3d human sensing in natural environments.IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 36(7):1325–1339, 2014. 2, 3, 7
work page 2014
-
[22]
Black, Bodo Rosenhahn, and Gerard Pons-Moll
Timo von Marcard, Roberto Henschel, Michael J. Black, Bodo Rosenhahn, and Gerard Pons-Moll. Recovering accurate 3D human pose in the wild using IMUs and a moving camera. InProceedings of the European Conference on Computer Vision (ECCV), pages 614–631, 2018. 3, 7
work page 2018
-
[23]
Troje, Gerard Pons-Moll, and Michael J
Naureen Mahmood, Nima Ghorbani, Nikolaus F. Troje, Gerard Pons-Moll, and Michael J. Black. AMASS: Archive of motion capture as surface shapes. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 5442–5451, 2019. 2, 3, 7
work page 2019
-
[24]
Metahuman: A complete framework for photorealistic digital humans in Unreal Engine
Epic Games. Metahuman: A complete framework for photorealistic digital humans in Unreal Engine. https://dev.epicgames.com/documentation/en-us/metahuman/metahuman-documentation,
-
[25]
Accessed: 2026-04-30. 2, 6
work page 2026
-
[26]
Epic Games. Unreal Engine 5. https://www.unrealengine.com/en-US/unreal-engine-5 , 2022. 2, 6
work page 2022
-
[27]
Angjoo Kanazawa, Michael J. Black, David W. Jacobs, and Jitendra Malik. End-to-end recovery of human shape and pose. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 7122–7131, 2018. 3
work page 2018
-
[28]
WHAM: Reconstructing world-grounded humans with accurate 3d motion
Soyong Shin, Juyong Kim, Eni Halilaj, and Michael J Black. WHAM: Reconstructing world-grounded humans with accurate 3d motion. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024. 7, 8
work page 2024
-
[29]
SMPLer-X: Scaling up expressive human pose and shape estimation
Zhongang Cai, Wanqi Yin, Ailing Zeng, Chen Wei, Qingping Sun, Yanjun Wang, Hui En Pang, Haiyi Mei, Mingyuan Zhang, Lei Zhang, Chen Change Loy, Lei Yang, and Ziwei Liu. SMPLer-X: Scaling up expressive human pose and shape estimation. InAdvances in Neural Information Processing Systems (NeurIPS), 2023. 3
work page 2023
-
[30]
Dressrecon: Freeform 4d human reconstruction from monocular video
Jeff Tan, Donglai Xiang, Shubham Tulsiani, Deva Ramanan, and Gengshan Yang. Dressrecon: Freeform 4d human reconstruction from monocular video. In2025 International Conference on 3D Vision (3DV), pages 250–260. IEEE, 2025. 3
work page 2025
-
[31]
SelfRecon: Self reconstruction your digital avatar from monocular video
Boyi Jiang, Yang Hong, Hujun Bao, and Juyong Zhang. SelfRecon: Self reconstruction your digital avatar from monocular video. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5605–5615, 2022. 3
work page 2022
-
[32]
Peng Li, Wangguandong Zheng, Yuan Liu, Tao Yu, Yangguang Li, Xingqun Qi, Xiaowei Chi, Siyu Xia, Yan-Pei Cao, Wei Xue, et al. Pshuman: Photorealistic single-image 3d human reconstruction using cross-scale multiview diffusion and explicit remeshing. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 16008–16018...
work page 2025
-
[33]
PIFu: Pixel-aligned implicit function for high-resolution clothed human digitization
Shunsuke Saito, Zeng Huang, Ryota Natsume, Shigeo Morishima, Angjoo Kanazawa, and Hao Li. PIFu: Pixel-aligned implicit function for high-resolution clothed human digitization. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 2304–2314, 2019. 3
work page 2019
-
[34]
Thin- shell-sft: Fine-grained monocular non-rigid 3d surface tracking with neural deformation fields
Navami Kairanda, Marc Habermann, Shanthika Naik, Christian Theobalt, and Vladislav Golyanik. Thin- shell-sft: Fine-grained monocular non-rigid 3d surface tracking with neural deformation fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11373–11383, 2025. 3
work page 2025
-
[35]
Freecloth: Free-form generation enhances challenging clothed human modeling
Hang Ye, Xiaoxuan Ma, Hai Ci, Wentao Zhu, and Yizhou Wang. Freecloth: Free-form generation enhances challenging clothed human modeling. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15987–15997, 2025. 3
work page 2025
-
[36]
Raft-3d: Scene flow using rigid-motion embeddings
Zachary Teed and Jia Deng. Raft-3d: Scene flow using rigid-motion embeddings. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8375–8384, 2021. 3 11
work page 2021
-
[37]
Pointpwc-net: Cost volume on point clouds for (self-) supervised scene flow estimation
Wenxuan Wu, Zhi Yuan Wang, Zhuwen Li, Wei Liu, and Li Fuxin. Pointpwc-net: Cost volume on point clouds for (self-) supervised scene flow estimation. InProceedings of the European Conference on Computer Vision (ECCV), pages 88–107. Springer, 2020. 3
work page 2020
-
[38]
Self-supervised monocular scene flow estimation
Junhwa Hur and Stefan Roth. Self-supervised monocular scene flow estimation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 7396–7405, 2020. 3, 7, 8
work page 2020
-
[39]
Xueqian Li, Jhony Kaesemodel Pontes, and Simon Lucey. Neural scene flow prior. InAdvances in Neural Information Processing Systems (NeurIPS), 2021. 3
work page 2021
-
[40]
Neural Eulerian scene flow fields
Kyle Vedder, Neehar Peri, Ishan Khatri, Siyi Li, Eric Eaton, Mehmet Kocamaz, Yue Wang, Zhiding Yu, Deva Ramanan, and Joachim Pehserl. Neural Eulerian scene flow fields. InProceedings of the International Conference on Learning Representations (ICLR), 2025. 3
work page 2025
-
[41]
Rigidflow: Self-supervised scene flow learning on point clouds by local rigidity prior
Ruibo Li, Chi Zhang, Guosheng Lin, Zhe Wang, and Chunhua Shen. Rigidflow: Self-supervised scene flow learning on point clouds by local rigidity prior. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 16959–16968, 2022. 3
work page 2022
-
[42]
Just go with the flow: Self-supervised scene flow estimation
Himangi Mittal, Brian Okorn, and David Held. Just go with the flow: Self-supervised scene flow estimation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11177–11185, 2020. 3
work page 2020
-
[43]
ZeroFlow: Scalable scene flow via distillation
Kyle Vedder, Neehar Peri, Nathaniel Chodosh, Ishan Khatri, Eric Eaton, Dinesh Jayaraman, Yang Liu, Deva Ramanan, and James Hays. ZeroFlow: Scalable scene flow via distillation. InProceedings of the International Conference on Learning Representations (ICLR), 2024. 3
work page 2024
-
[44]
Chung-Yi Weng, Brian Curless, Pratul P. Srinivasan, Jonathan T. Barron, and Ira Kemelmacher-Shlizerman. HumanNeRF: Free-viewpoint rendering of moving people from monocular video. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 16210–16220, 2022. 3
work page 2022
-
[45]
InstantAvatar: Learning avatars from monocular video in 60 seconds
Tianjian Jiang, Xu Chen, Jie Song, and Otmar Hilliges. InstantAvatar: Learning avatars from monocular video in 60 seconds. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 16922–16932, 2023. 3
work page 2023
-
[46]
Vid2Avatar: 3D avatar reconstruction from videos in the wild via self-supervised scene decomposition
Chen Guo, Tianjian Jiang, Xu Chen, Jie Song, and Otmar Hilliges. Vid2Avatar: 3D avatar reconstruction from videos in the wild via self-supervised scene decomposition. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12858–12868, 2023. 3
work page 2023
-
[47]
ReLoo: Reconstructing humans dressed in loose garments from monocular video in the wild
Chen Guo, Tianjian Jiang, Manuel Kaufmann, Chengwei Zheng, Julien Valentin, Jie Song, and Otmar Hilliges. ReLoo: Reconstructing humans dressed in loose garments from monocular video in the wild. In Proceedings of the European Conference on Computer Vision (ECCV), pages 21–38. Springer, 2024. 3
work page 2024
-
[48]
3DGS-Avatar: Animatable avatars via deformable 3D gaussian splatting
Zhiyin Qian, Shaofei Wang, Marko Mihajlovic, Andreas Geiger, and Siyu Tang. 3DGS-Avatar: Animatable avatars via deformable 3D gaussian splatting. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5020–5030, 2024. 3
work page 2024
-
[49]
Muhammed Kocabas, Jen-Hao Rick Chang, James Gabriel, Oncel Tuzel, and Anurag Ranjan. HUGS: Human gaussian splats. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 505–515, 2024
work page 2024
-
[50]
Vid2Avatar-Pro: Authentic avatar from videos in the wild via universal prior
Chen Guo, Junxuan Li, Yash Kant, Yaser Sheikh, Shunsuke Saito, and Chen Cao. Vid2Avatar-Pro: Authentic avatar from videos in the wild via universal prior. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025. 3
work page 2025
-
[51]
Monocular 3D human pose estimation in the wild using improved CNN supervision
Dushyant Mehta, Helge Rhodin, Dan Casas, Pascal Fua, Oleksandr Sotnychenko, Weipeng Xu, and Christian Theobalt. Monocular 3D human pose estimation in the wild using improved CNN supervision. InInternational Conference on 3D Vision (3DV), pages 506–516. IEEE, 2017. 3
work page 2017
-
[52]
NTU RGB+D: A large scale dataset for 3D human activity analysis
Amir Shahroudy, Jun Liu, Tian-Tsong Ng, and Gang Wang. NTU RGB+D: A large scale dataset for 3D human activity analysis. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 1010–1019, 2016. 3
work page 2016
-
[53]
Jun Liu, Amir Shahroudy, Mauricio Perez, Gang Wang, Ling-Yu Duan, and Alex C. Kot. NTU RGB+D 120: A large-scale benchmark for 3D human activity understanding.IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 42(10):2684–2701, 2020. 12
work page 2020
-
[54]
Leonid Sigal, Alexandru O. Balan, and Michael J. Black. HumanEva: Synchronized video and motion capture dataset and baseline algorithm for evaluation of articulated human motion.International Journal of Computer Vision (IJCV), 87(1–2):4–27, 2010. 3
work page 2010
-
[55]
Object scene flow for autonomous vehicles
Moritz Menze and Andreas Geiger. Object scene flow for autonomous vehicles. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3061–3070, 2015. 3, 7
work page 2015
-
[56]
Argoverse 2: Next generation datasets for self-driving perception and forecasting
Benjamin Wilson, William Qi, Tanmay Agarwal, John Lambert, Jagjeet Singh, Siddhesh Khandelwal, Bowen Pan, Ratnesh Kumar, Andrew Hartnett, Jhony Kaesemodel Pontes, Deva Ramanan, Peter Carr, and James Hays. Argoverse 2: Next generation datasets for self-driving perception and forecasting. In Advances in Neural Information Processing Systems (NeurIPS), 2021. 3
work page 2021
-
[57]
4DComplete: Non- rigid motion estimation beyond the observable surface
Yang Li, Hikari Takehara, Takafumi Taketomi, Bo Zheng, and Matthias Nießner. 4DComplete: Non- rigid motion estimation beyond the observable surface. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 12686–12696, 2021. 3, 7
work page 2021
-
[58]
HuMMan: Multi-modal 4D human dataset for versatile sensing and modeling
Zhongang Cai, Daxuan Ren, Ailing Zeng, Zhengyu Lin, Tao Yu, Wenjia Wang, Xiangyu Fan, Yang Gao, Yifan Yu, Liang Pan, Fangzhou Hong, Mingyuan Zhang, Chen Change Loy, Lei Yang, and Ziwei Liu. HuMMan: Multi-modal 4D human dataset for versatile sensing and modeling. InProceedings of the European Conference on Computer Vision (ECCV), pages 557–577, 2022. 3
work page 2022
-
[59]
HUMBI: A large multiview dataset of human body expressions
Zhixuan Yu, Jae Shin Yoon, In Kyu Lee, Prashanth Venkatesh, Jaesik Park, Jihun Yu, and Hyun Soo Park. HUMBI: A large multiview dataset of human body expressions. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2987–2997, 2020. 3
work page 2020
-
[60]
Black, Ivan Laptev, and Cordelia Schmid
Gül Varol, Javier Romero, Xavier Martin, Naureen Mahmood, Michael J. Black, Ivan Laptev, and Cordelia Schmid. Learning from synthetic humans. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4627–4635, 2017. 3
work page 2017
-
[61]
Priyanka Patel, Chun-Hao P. Huang, Joachim Tesch, David T. Hoffmann, Shashank Tripathi, and Michael J. Black. AGORA: Avatars in geography optimized for regression analysis. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13468–13478, 2021. 3, 7
work page 2021
-
[62]
Qingzheng Xu, Ru Cao, Xin Shen, Heming Du, Sen Wang, and Xin Yu. M3GYM: A large-scale multimodal multi-view multi-person pose dataset for fitness activity understanding in real-world settings. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12289–12300, 2025. 3, 7
work page 2025
-
[63]
Black, Priyanka Patel, Joachim Tesch, and Jinlong Yang
Michael J. Black, Priyanka Patel, Joachim Tesch, and Jinlong Yang. BEDLAM: A synthetic dataset of bodies exhibiting detailed lifelike animated motion. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8726–8737, 2023. 3, 7
work page 2023
-
[64]
4D-DRESS: A 4D dataset of real-world human clothing with semantic annotations
Wenbo Wang, Hsuan-I Ho, Chen Guo, Boxiang Rong, Artur Grigorev, Jie Song, Juan Jose Zarate, and Otmar Hilliges. 4D-DRESS: A 4D dataset of real-world human clothing with semantic annotations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 550–560, 2024. 3
work page 2024
-
[65]
Qianli Ma, Jinlong Yang, Anurag Ranjan, Sergi Pujades, Gerard Pons-Moll, Siyu Tang, and Michael J. Black. Learning to dress 3D people in generative clothing. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6468–6477, 2020. 3
work page 2020
-
[66]
Hugo Bertiche, Meysam Madadi, and Sergio Escalera. CLOTH3D: Clothed 3D humans. InProceedings of the European Conference on Computer Vision (ECCV), pages 344–359, 2020
work page 2020
-
[67]
CLOTH4D: A dataset for clothed human reconstruction
Xingxing Zou, Xintong Han, and Waikeung Wong. CLOTH4D: A dataset for clothed human reconstruction. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12847–12857, 2023. 3
work page 2023
-
[68]
Oriane Siméoni et al. DINOv3.arXiv preprint arXiv:2508.10104, 2025. 4, 15
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[69]
Vision transformers for dense prediction
René Ranftl, Alexey Bochkovskiy, and Vladlen Koltun. Vision transformers for dense prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 12179–12188,
-
[70]
Tamar Flash and Neville Hogan. The coordination of arm movements: an experimentally confirmed mathematical model.Journal of Neuroscience, 5(7):1688–1703, 1985. 6
work page 1985
-
[71]
Neville Hogan. An organizing principle for a class of voluntary movements.Journal of Neuroscience, 4(11):2745–2754, 1984. 6 13
work page 1984
-
[72]
Yoji Uno, Mitsuo Kawato, and Ryoji Suzuki. Formation and control of optimal trajectory in human multijoint arm movement: minimum torque-change model.Biological Cybernetics, 61(2):89–101, 1989. 6
work page 1989
-
[73]
Emanuel Todorov and Michael I. Jordan. Optimal feedback control as a theory of motor coordination. Nature Neuroscience, 5(11):1226–1235, 2002. 6
work page 2002
-
[74]
Mixamo, 2024.https://www.mixamo.com
Adobe Systems Incorporated. Mixamo, 2024.https://www.mixamo.com. 6
work page 2024
-
[75]
AIFit: Au- tomatic 3d human-interpretable feedback models for fitness training
Mihai Fieraru, Mihai Zanfir, Silviu-Cristian Pirlea, Vlad Olaru, and Cristian Sminchisescu. AIFit: Au- tomatic 3d human-interpretable feedback models for fitness training. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9919–9928, 2021. 7, 8
work page 2021
-
[76]
H-MoRe: Learning human-centric motion representation for action analysis
Zhanbo Huang, Xiaoming Liu, and Yu Kong. H-MoRe: Learning human-centric motion representation for action analysis. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025. 7, 8
work page 2025
-
[77]
RAFT: Recurrent all-pairs field transforms for optical flow
Zachary Teed and Jia Deng. RAFT: Recurrent all-pairs field transforms for optical flow. InProceedings of the European Conference on Computer Vision (ECCV), 2020. 7, 8
work page 2020
-
[78]
Priyanka Patel and Michael J. Black. CameraHMR: Aligning people with perspective. InInternational Conference on 3D Vision (3DV), 2025. 8, 9
work page 2025
-
[79]
Yufu Wang, Yu Sun, Priyanka Patel, Kostas Daniilidis, Michael J. Black, and Muhammed Kocabas. PromptHMR: Promptable human mesh recovery. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 1148–1159, 2025. 8, 9
work page 2025
-
[80]
HiPART: Hierarchical pose autoregressive transformer for occluded 3D human pose estimation
Hongwei Zheng, Han Li, Wenrui Dai, Ziyang Zheng, Chenglin Li, Junni Zou, and Hongkai Xiong. HiPART: Hierarchical pose autoregressive transformer for occluded 3D human pose estimation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 16807–16817,
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.