pith. sign in

arxiv: 2605.22629 · v1 · pith:OCGJIYG6new · submitted 2026-05-21 · 💻 cs.CV

H-Flow: Self-supervised Human Scene Flow via Physics-inspired Joint Multi-modal Learning

Pith reviewed 2026-05-22 06:33 UTC · model grok-4.3

classification 💻 cs.CV
keywords human scene flowself-supervised learningmonocular videophysics priorsmulti-head transformerdense motionDynAct4D benchmark
0
0 comments X

The pith

H-Flow estimates dense human scene flow from monocular video by using a multi-head transformer trained with physics priors on pose and depth.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents H-Flow as a way to compute dense motion for humans that includes both rigid skeletal movement and non-rigid surface changes like clothing. A single transformer model takes monocular video and outputs scene flow together with pose and depth estimates. Because true dense flow labels are unavailable, the training relies on cross-modal objectives that embed geometric consistency, structural constraints, and biomechanical rules drawn from human motion physics. The approach is evaluated on existing benchmarks where it beats standard scene-flow and parametric human-model methods, and it transfers directly to real-world video.

Core claim

H-Flow is a dense human scene flow method that jointly models skeletal kinematics and surface deformation. A unified multi-head transformer processes monocular video frames to predict flow while producing companion pose and depth maps. In the absence of direct supervision, the network is trained by encoding geometric, structural, and biomechanical priors as cross-modal consistency losses. The authors also release DynAct4D, a synthetic dataset with dense flow ground truth across varied subjects, garments, and actions. The resulting model exceeds scene-flow and parametric baselines on standard tests and generalizes zero-shot to in-the-wild footage.

What carries the argument

Unified multi-head transformer that jointly regresses scene flow, pose, and depth from video, trained via physics-inspired geometric, structural, and biomechanical cross-modal objectives.

If this is right

  • The same network produces both dense surface flow and parametric pose estimates, allowing consistent motion analysis without separate pipelines.
  • Performance gains appear on articulated bodies with clothing and soft tissue where generic scene flow methods degrade.
  • Zero-shot transfer to in-the-wild monocular video becomes possible once the physics priors are learned.
  • A new high-fidelity synthetic benchmark supplies dense flow annotations that can support further model development.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar cross-modal physics objectives could be adapted to non-human articulated objects such as animals or robots if corresponding biomechanical rules are supplied.
  • The joint prediction of flow, pose, and depth may reduce error accumulation across tasks by enforcing mutual geometric consistency during inference.
  • Extending the synthetic benchmark with more varied lighting or camera motion could test robustness before real-world deployment.

Load-bearing premise

Human motion can be adequately described by a combination of geometric, structural, and biomechanical priors that serve as reliable substitutes for missing dense flow labels.

What would settle it

Independent dense flow measurements obtained from calibrated multi-view capture on real human subjects would show systematic disagreement with the flow predicted by the physics-prior model.

Figures

Figures reproduced from arXiv: 2605.22629 by Xiaoming Liu, Yu Kong, Zhanbo Huang.

Figure 1
Figure 1. Figure 1: Comparison of paradigms for human motion perception on an in-the-wild ballet performance. Evaluated zero-shot on a frame from The Nutcracker (Mariinsky Ballet). (a) The input video frame. (b) Parametric models (SAM 3D Body [8]) recover global pose, but their mesh excludes the tutu and truncates extremities. (c) Generic scene flow (Zero MSF [16]) yields undetected motion on the dancer, spurious vectors in s… view at source ↗
Figure 2
Figure 2. Figure 2: H-Flow architecture. We instantiate Gθ as a single transformer producing the outputs of Y from a frozen visual substrate. Patches and learnable queries (pose, camera) participate in the same attention, so cross-modal coupling exists in the representation before any constraint is computed. face. Real-world per-pixel 3D motion on humans cannot be captured directly. Marker-based motion capture records only sp… view at source ↗
Figure 3
Figure 3. Figure 3: Self-supervised constraints from physical priors. Four panels illustrate the four constraints that train Gθ without scene-flow ground truth. (a) Silhouette Edge Alignment pulls dense edge responses in depth D and flow F toward the mask boundary, with a signed distance field penalizing both interior spurious edges and exterior leakage. (b) Skeletal-Surface Coupling ties surface flow to bone-induced motion v… view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative scene flow comparison. (top) Fit3D [in-domain], (bottom) DynAct4D [out-of-domain]. Scene flow baselines smear on articulated limbs and clothing. Lifting pipelines accumulate edge artifacts at silhouette boundaries. Mesh residual methods recover skeletal motion but flatten surface dynamics on garments. A higher-resolution version is provided in the supplementary material. clothing edges, where r… view at source ↗
Figure 5
Figure 5. Figure 5: Multi-view renderings of the ballet reconstruction. The 3D point cloud from [PITH_FULL_IMAGE:figures/full_fig_p018_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Multi-view flow comparison on Fit3D. Each column renders the same predicted flow from a different azimuth. H-Flow GT ZeroMSF [PITH_FULL_IMAGE:figures/full_fig_p019_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Multi-view flow comparison on DynAct4D. A sequence with significant garment deformation, viewed from multiple azimuths. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗
read the original abstract

Parametric human models capture global pose but cannot represent the non-rigid surface dynamics of clothing and soft tissue. Generic scene flow estimates dense motion but breaks down on articulated bodies, where pixel-level supervision is also intractable to acquire. We introduce H-Flow, a dense human scene flow that captures both skeletal kinematics and surface deformation. A unified multi-head transformer estimates flow from monocular video, jointly predicting pose and depth as companion outputs. The challenge lies in the lack of supervision. In place of unattainable labels, we anchor the network in the physics of human motion, encoding geometric, structural, and biomechanical priors as cross-modal training objectives. We further introduce DynAct4D, a high-fidelity synthetic benchmark providing dense flow annotations across diverse subjects, garments, and motions. On standard benchmarks, H-Flow outperforms scene-flow and parametric baselines, and generalizes zero-shot to in-the-wild video. Code, models, and the DynAct4D benchmark will be released upon publication

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces H-Flow, a self-supervised method for dense human scene flow from monocular video. A unified multi-head transformer jointly predicts scene flow along with pose and depth as auxiliary outputs. In the absence of dense labels, the approach encodes geometric, structural, and biomechanical priors as cross-modal consistency objectives. The authors also release DynAct4D, a synthetic benchmark with dense flow ground truth across varied subjects, garments, and motions. Experiments claim that H-Flow outperforms both generic scene-flow estimators and parametric human-model baselines on standard benchmarks while generalizing zero-shot to in-the-wild video.

Significance. If the central claims are substantiated, the work would meaningfully advance human motion capture by addressing the inability of parametric models to represent non-rigid clothing and soft-tissue dynamics while avoiding the supervision requirements that limit generic scene flow. The joint multi-modal architecture and physics-inspired self-supervision constitute a coherent attempt to solve an acknowledged gap; the public release of DynAct4D and associated code would further strengthen the contribution.

major comments (2)
  1. [§4] §4 (Experiments) and associated ablation tables: the central claim that cross-modal physics priors enable the flow head to capture non-rigid surface dynamics beyond what the pose head already provides is load-bearing for both the outperformance and zero-shot generalization statements. No ablation is shown that isolates the flow head trained only with pose/depth supervision versus the full set of geometric-structural-biomechanical objectives; without this, it remains possible that reported gains are largely inherited from the pose prediction rather than independently learned dense flow.
  2. [§3.2] §3.2 (Loss formulations): the biomechanical and structural prior losses are described at a high level but lack explicit equations demonstrating that they impose constraints on the flow field that are not already satisfied by the pose and depth heads. If the priors reduce to re-projection or rigidity terms already implicit in the pose output, the argument that they specifically regularize clothing/soft-tissue motion would be weakened.
minor comments (2)
  1. [Figure 2] Figure 2 (architecture diagram): the three output heads are not visually distinguished with sufficient clarity; adding explicit labels or color coding for flow, pose, and depth would improve readability.
  2. [§4.1] The description of DynAct4D in §4.1 should include a brief statement on how the synthetic rendering pipeline ensures that the provided dense flow annotations are independent of the parametric body model used for pose generation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We have carefully addressed each major comment below and will incorporate revisions to strengthen the presentation and empirical support for our claims.

read point-by-point responses
  1. Referee: [§4] §4 (Experiments) and associated ablation tables: the central claim that cross-modal physics priors enable the flow head to capture non-rigid surface dynamics beyond what the pose head already provides is load-bearing for both the outperformance and zero-shot generalization statements. No ablation is shown that isolates the flow head trained only with pose/depth supervision versus the full set of geometric-structural-biomechanical objectives; without this, it remains possible that reported gains are largely inherited from the pose prediction rather than independently learned dense flow.

    Authors: We agree that an explicit ablation isolating the contribution of the full set of cross-modal physics priors to the flow head is necessary to substantiate the central claims. In the revised manuscript we will add a new ablation (to be included as an additional row or sub-table in Section 4) that trains the flow head using only the pose and depth supervision losses and compares it directly against the complete model. Internal experiments already performed during development show measurable gains in endpoint error and non-rigid region accuracy when the geometric, structural, and biomechanical terms are included; these results will be reported together with qualitative examples that highlight improved capture of clothing and soft-tissue motion not explained by pose alone. revision: yes

  2. Referee: [§3.2] §3.2 (Loss formulations): the biomechanical and structural prior losses are described at a high level but lack explicit equations demonstrating that they impose constraints on the flow field that are not already satisfied by the pose and depth heads. If the priors reduce to re-projection or rigidity terms already implicit in the pose output, the argument that they specifically regularize clothing/soft-tissue motion would be weakened.

    Authors: We thank the referee for highlighting this presentational gap. In the revised manuscript we will expand Section 3.2 with the explicit loss equations. The biomechanical prior is defined as a temporal consistency term on local surface velocities derived from biomechanical joint-angle limits and acceleration bounds; it is applied directly to the dense flow field after skeletal motion is propagated to surface points. The structural prior implements a local as-rigid-as-possible penalty on small surface patches obtained from the predicted depth and flow, which operates independently of the global rigid transformations encoded by the pose head. These formulations will be accompanied by a short discussion clarifying that the priors enforce local non-rigid constraints on clothing and soft-tissue regions that are not already satisfied by standard re-projection or global rigidity terms. revision: yes

Circularity Check

0 steps flagged

No significant circularity: priors and benchmarks are external to the flow derivation

full rationale

The abstract presents geometric, structural, and biomechanical priors as independent encodings of human motion physics used as cross-modal objectives, not as quantities fitted from or defined by the flow outputs themselves. Joint prediction of pose, depth, and flow is described as an architectural choice, with performance claims supported by evaluation on standard benchmarks plus the introduced DynAct4D dataset that supplies independent dense flow annotations. No equations, self-citations, or reductions to fitted inputs appear in the provided text that would make the central claims equivalent to the inputs by construction. The method is therefore self-contained against external validation rather than tautological.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Review is limited to the abstract; the central claim rests on the effectiveness of physics priors as substitute supervision. No explicit free parameters or invented entities are named in the abstract. The key domain assumption is that geometric, structural, and biomechanical rules can be encoded as cross-modal objectives that train the network without labels.

axioms (1)
  • domain assumption Physics of human motion can be encoded as geometric, structural, and biomechanical priors for cross-modal training objectives that replace labels.
    Abstract states this directly as the solution to the lack of supervision.

pith-pipeline@v0.9.0 · 5703 in / 1471 out tokens · 67593 ms · 2026-05-22T06:33:25.232444+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

90 extracted references · 90 canonical work pages · 4 internal anchors

  1. [1]

    Smpl: a skinned multi-person linear model.ACM Transactions on Graphics (TOG), 34(6):1–16, 2015

    Matthew Loper, Naureen Mahmood, Javier Romero, Gerard Pons-Moll, and Michael J Black. Smpl: a skinned multi-person linear model.ACM Transactions on Graphics (TOG), 34(6):1–16, 2015. 1, 3

  2. [2]

    Expressive body capture: 3d hands, face, and body from a single image

    Georgios Pavlakos, Vasileios Choutas, Nima Ghorbani, Timo Bolkart, Ahmed AA Osman, Dimitrios Tzionas, and Michael J Black. Expressive body capture: 3d hands, face, and body from a single image. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10975–10985, 2019. 3

  3. [3]

    Freeman, Rahul Sukthankar, and Cristian Sminchisescu

    Hongyi Xu, Eduard Gabriel Bazavan, Andrei Zanfir, William T. Freeman, Rahul Sukthankar, and Cristian Sminchisescu. GHUM & GHUML: Generative 3D human shape and articulated pose models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020

  4. [4]

    Ahmed A. A. Osman, Timo Bolkart, and Michael J. Black. STAR: Sparse trained articulated human body regressor. InProceedings of the European Conference on Computer Vision (ECCV), 2020

  5. [5]

    Ahmed A. A. Osman, Timo Bolkart, Dimitrios Tzionas, and Michael J. Black. SUPR: A sparse unified part-based human representation. InProceedings of the European Conference on Computer Vision (ECCV), 2022

  6. [6]

    Aaron Ferguson, Ahmed A. A. Osman, Berta Bescos, Carsten Stoll, Chris Twigg, Christoph Lassner, et al. MHR: Momentum human rig.arXiv preprint arXiv:2511.15586, 2025. 1

  7. [7]

    Hu- mans in 4D: Reconstructing and tracking humans with transformers

    Shubham Goel, Georgios Pavlakos, Jathushan Rajasegaran, Angjoo Kanazawa, and Jitendra Malik. Hu- mans in 4D: Reconstructing and tracking humans with transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 14783–14794, 2023. 1, 3

  8. [8]

    Sam 3d body: Robust full-body human mesh recovery, 2026

    Xitong Yang, Devansh Kukreja, Don Pinkus, Anushka Sagar, Taosha Fan, Jinhyung Park, Soyong Shin, Jinkun Cao, Jiawei Liu, Nicolas Ugrinovic, Matt Feiszli, Jitendra Malik, Piotr Dollár, and Kris Kitani. SAM 3D Body: Robust full-body human mesh recovery.arXiv preprint arXiv:2602.15989, 2026. 1, 2, 3, 6, 7, 8, 9, 17

  9. [9]

    PBNS: Physically based neural simulation for unsupervised garment pose space deformation.ACM Transactions on Graphics (TOG), 40(6):1–14, 2021

    Hugo Bertiche, Meysam Madadi, and Sergio Escalera. PBNS: Physically based neural simulation for unsupervised garment pose space deformation.ACM Transactions on Graphics (TOG), 40(6):1–14, 2021. 1

  10. [10]

    PhySkin: Physics-based bone-driven neural garment simulation.arXiv preprint arXiv:2603.27013, 2026

    Astitva Srivastava, Hsiao-yu Chen, Ryan Goldade, Philipp Herholz, Zhongshi Jiang, Gene Wei-Chin Lin, Lingchen Yang, Nikolaos Sarafianos, Tuur Stuyck, and Egor Larionov. PhySkin: Physics-based bone-driven neural garment simulation.arXiv preprint arXiv:2603.27013, 2026. 1

  11. [11]

    Three-dimensional scene flow

    Sundar Vedula, Simon Baker, Peter Rander, Robert Collins, and Takeo Kanade. Three-dimensional scene flow. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 1999. 1

  12. [12]

    Qi, and Leonidas J

    Xingyu Liu, Charles R. Qi, and Leonidas J. Guibas. FlowNet3D: Learning scene flow in 3D point clouds. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 529–537, 2019. 3

  13. [13]

    FLOT: Scene flow on point clouds guided by optimal transport

    Gilles Puy, Alexandre Boulch, and Renaud Marlet. FLOT: Scene flow on point clouds guided by optimal transport. InProceedings of the European Conference on Computer Vision (ECCV), 2020. 1

  14. [14]

    Icp-flow: Lidar scene flow estimation with icp

    Yancong Lin and Holger Caesar. Icp-flow: Lidar scene flow estimation with icp. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15501–15511, 2024. 1, 2, 3

  15. [15]

    V oteflow: Enforcing local rigidity in self-supervised scene flow

    Yancong Lin, Shiming Wang, Liangliang Nan, Julian Kooij, and Holger Caesar. V oteflow: Enforcing local rigidity in self-supervised scene flow. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 17155–17164, 2025. 1, 2, 3

  16. [16]

    Zero-shot monocular scene flow estimation in the wild

    Yiqing Liang, Abhishek Badki, Hang Su, James Tompkin, and Orazio Gallo. Zero-shot monocular scene flow estimation in the wild. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 21031–21044, 2025. 2, 3, 7, 8, 18

  17. [17]

    Depth Pro: Sharp Monocular Metric Depth in Less Than a Second

    Aleksei Bochkovskii, Amaël Delaunoy, Hugo Germain, Marcel Santos, Yichao Zhou, Stephan R Richter, and Vladlen Koltun. Depth pro: Sharp monocular metric depth in less than a second.arXiv preprint arXiv:2410.02073, 2024. 2, 6, 7, 8, 9, 17

  18. [18]

    Are we ready for autonomous driving? the KITTI vision benchmark suite

    Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are we ready for autonomous driving? the KITTI vision benchmark suite. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3354–3361, 2012. 2 10

  19. [19]

    A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation

    Nikolaus Mayer, Eddy Ilg, Philip Häusser, Philipp Fischer, Daniel Cremers, Alexey Dosovitskiy, and Thomas Brox. A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4040–4048, 2016. 2, 3

  20. [20]

    A naturalistic open source movie for optical flow evaluation

    Daniel J Butler, Jonas Wulff, Garrett B Stanley, and Michael J Black. A naturalistic open source movie for optical flow evaluation. InProceedings of the European Conference on Computer Vision (ECCV), pages 611–625. Springer, 2012. 2, 3, 7

  21. [21]

    Catalin Ionescu, Dragos Papava, Vlad Olaru, and Cristian Sminchisescu. Human3.6m: Large scale datasets and predictive methods for 3d human sensing in natural environments.IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 36(7):1325–1339, 2014. 2, 3, 7

  22. [22]

    Black, Bodo Rosenhahn, and Gerard Pons-Moll

    Timo von Marcard, Roberto Henschel, Michael J. Black, Bodo Rosenhahn, and Gerard Pons-Moll. Recovering accurate 3D human pose in the wild using IMUs and a moving camera. InProceedings of the European Conference on Computer Vision (ECCV), pages 614–631, 2018. 3, 7

  23. [23]

    Troje, Gerard Pons-Moll, and Michael J

    Naureen Mahmood, Nima Ghorbani, Nikolaus F. Troje, Gerard Pons-Moll, and Michael J. Black. AMASS: Archive of motion capture as surface shapes. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 5442–5451, 2019. 2, 3, 7

  24. [24]

    Metahuman: A complete framework for photorealistic digital humans in Unreal Engine

    Epic Games. Metahuman: A complete framework for photorealistic digital humans in Unreal Engine. https://dev.epicgames.com/documentation/en-us/metahuman/metahuman-documentation,

  25. [25]

    Accessed: 2026-04-30. 2, 6

  26. [26]

    Unreal Engine 5

    Epic Games. Unreal Engine 5. https://www.unrealengine.com/en-US/unreal-engine-5 , 2022. 2, 6

  27. [27]

    Black, David W

    Angjoo Kanazawa, Michael J. Black, David W. Jacobs, and Jitendra Malik. End-to-end recovery of human shape and pose. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 7122–7131, 2018. 3

  28. [28]

    WHAM: Reconstructing world-grounded humans with accurate 3d motion

    Soyong Shin, Juyong Kim, Eni Halilaj, and Michael J Black. WHAM: Reconstructing world-grounded humans with accurate 3d motion. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024. 7, 8

  29. [29]

    SMPLer-X: Scaling up expressive human pose and shape estimation

    Zhongang Cai, Wanqi Yin, Ailing Zeng, Chen Wei, Qingping Sun, Yanjun Wang, Hui En Pang, Haiyi Mei, Mingyuan Zhang, Lei Zhang, Chen Change Loy, Lei Yang, and Ziwei Liu. SMPLer-X: Scaling up expressive human pose and shape estimation. InAdvances in Neural Information Processing Systems (NeurIPS), 2023. 3

  30. [30]

    Dressrecon: Freeform 4d human reconstruction from monocular video

    Jeff Tan, Donglai Xiang, Shubham Tulsiani, Deva Ramanan, and Gengshan Yang. Dressrecon: Freeform 4d human reconstruction from monocular video. In2025 International Conference on 3D Vision (3DV), pages 250–260. IEEE, 2025. 3

  31. [31]

    SelfRecon: Self reconstruction your digital avatar from monocular video

    Boyi Jiang, Yang Hong, Hujun Bao, and Juyong Zhang. SelfRecon: Self reconstruction your digital avatar from monocular video. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5605–5615, 2022. 3

  32. [32]

    Pshuman: Photorealistic single-image 3d human reconstruction using cross-scale multiview diffusion and explicit remeshing

    Peng Li, Wangguandong Zheng, Yuan Liu, Tao Yu, Yangguang Li, Xingqun Qi, Xiaowei Chi, Siyu Xia, Yan-Pei Cao, Wei Xue, et al. Pshuman: Photorealistic single-image 3d human reconstruction using cross-scale multiview diffusion and explicit remeshing. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 16008–16018...

  33. [33]

    PIFu: Pixel-aligned implicit function for high-resolution clothed human digitization

    Shunsuke Saito, Zeng Huang, Ryota Natsume, Shigeo Morishima, Angjoo Kanazawa, and Hao Li. PIFu: Pixel-aligned implicit function for high-resolution clothed human digitization. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 2304–2314, 2019. 3

  34. [34]

    Thin- shell-sft: Fine-grained monocular non-rigid 3d surface tracking with neural deformation fields

    Navami Kairanda, Marc Habermann, Shanthika Naik, Christian Theobalt, and Vladislav Golyanik. Thin- shell-sft: Fine-grained monocular non-rigid 3d surface tracking with neural deformation fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11373–11383, 2025. 3

  35. [35]

    Freecloth: Free-form generation enhances challenging clothed human modeling

    Hang Ye, Xiaoxuan Ma, Hai Ci, Wentao Zhu, and Yizhou Wang. Freecloth: Free-form generation enhances challenging clothed human modeling. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15987–15997, 2025. 3

  36. [36]

    Raft-3d: Scene flow using rigid-motion embeddings

    Zachary Teed and Jia Deng. Raft-3d: Scene flow using rigid-motion embeddings. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8375–8384, 2021. 3 11

  37. [37]

    Pointpwc-net: Cost volume on point clouds for (self-) supervised scene flow estimation

    Wenxuan Wu, Zhi Yuan Wang, Zhuwen Li, Wei Liu, and Li Fuxin. Pointpwc-net: Cost volume on point clouds for (self-) supervised scene flow estimation. InProceedings of the European Conference on Computer Vision (ECCV), pages 88–107. Springer, 2020. 3

  38. [38]

    Self-supervised monocular scene flow estimation

    Junhwa Hur and Stefan Roth. Self-supervised monocular scene flow estimation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 7396–7405, 2020. 3, 7, 8

  39. [39]

    Neural scene flow prior

    Xueqian Li, Jhony Kaesemodel Pontes, and Simon Lucey. Neural scene flow prior. InAdvances in Neural Information Processing Systems (NeurIPS), 2021. 3

  40. [40]

    Neural Eulerian scene flow fields

    Kyle Vedder, Neehar Peri, Ishan Khatri, Siyi Li, Eric Eaton, Mehmet Kocamaz, Yue Wang, Zhiding Yu, Deva Ramanan, and Joachim Pehserl. Neural Eulerian scene flow fields. InProceedings of the International Conference on Learning Representations (ICLR), 2025. 3

  41. [41]

    Rigidflow: Self-supervised scene flow learning on point clouds by local rigidity prior

    Ruibo Li, Chi Zhang, Guosheng Lin, Zhe Wang, and Chunhua Shen. Rigidflow: Self-supervised scene flow learning on point clouds by local rigidity prior. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 16959–16968, 2022. 3

  42. [42]

    Just go with the flow: Self-supervised scene flow estimation

    Himangi Mittal, Brian Okorn, and David Held. Just go with the flow: Self-supervised scene flow estimation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11177–11185, 2020. 3

  43. [43]

    ZeroFlow: Scalable scene flow via distillation

    Kyle Vedder, Neehar Peri, Nathaniel Chodosh, Ishan Khatri, Eric Eaton, Dinesh Jayaraman, Yang Liu, Deva Ramanan, and James Hays. ZeroFlow: Scalable scene flow via distillation. InProceedings of the International Conference on Learning Representations (ICLR), 2024. 3

  44. [44]

    Srinivasan, Jonathan T

    Chung-Yi Weng, Brian Curless, Pratul P. Srinivasan, Jonathan T. Barron, and Ira Kemelmacher-Shlizerman. HumanNeRF: Free-viewpoint rendering of moving people from monocular video. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 16210–16220, 2022. 3

  45. [45]

    InstantAvatar: Learning avatars from monocular video in 60 seconds

    Tianjian Jiang, Xu Chen, Jie Song, and Otmar Hilliges. InstantAvatar: Learning avatars from monocular video in 60 seconds. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 16922–16932, 2023. 3

  46. [46]

    Vid2Avatar: 3D avatar reconstruction from videos in the wild via self-supervised scene decomposition

    Chen Guo, Tianjian Jiang, Xu Chen, Jie Song, and Otmar Hilliges. Vid2Avatar: 3D avatar reconstruction from videos in the wild via self-supervised scene decomposition. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12858–12868, 2023. 3

  47. [47]

    ReLoo: Reconstructing humans dressed in loose garments from monocular video in the wild

    Chen Guo, Tianjian Jiang, Manuel Kaufmann, Chengwei Zheng, Julien Valentin, Jie Song, and Otmar Hilliges. ReLoo: Reconstructing humans dressed in loose garments from monocular video in the wild. In Proceedings of the European Conference on Computer Vision (ECCV), pages 21–38. Springer, 2024. 3

  48. [48]

    3DGS-Avatar: Animatable avatars via deformable 3D gaussian splatting

    Zhiyin Qian, Shaofei Wang, Marko Mihajlovic, Andreas Geiger, and Siyu Tang. 3DGS-Avatar: Animatable avatars via deformable 3D gaussian splatting. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5020–5030, 2024. 3

  49. [49]

    HUGS: Human gaussian splats

    Muhammed Kocabas, Jen-Hao Rick Chang, James Gabriel, Oncel Tuzel, and Anurag Ranjan. HUGS: Human gaussian splats. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 505–515, 2024

  50. [50]

    Vid2Avatar-Pro: Authentic avatar from videos in the wild via universal prior

    Chen Guo, Junxuan Li, Yash Kant, Yaser Sheikh, Shunsuke Saito, and Chen Cao. Vid2Avatar-Pro: Authentic avatar from videos in the wild via universal prior. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025. 3

  51. [51]

    Monocular 3D human pose estimation in the wild using improved CNN supervision

    Dushyant Mehta, Helge Rhodin, Dan Casas, Pascal Fua, Oleksandr Sotnychenko, Weipeng Xu, and Christian Theobalt. Monocular 3D human pose estimation in the wild using improved CNN supervision. InInternational Conference on 3D Vision (3DV), pages 506–516. IEEE, 2017. 3

  52. [52]

    NTU RGB+D: A large scale dataset for 3D human activity analysis

    Amir Shahroudy, Jun Liu, Tian-Tsong Ng, and Gang Wang. NTU RGB+D: A large scale dataset for 3D human activity analysis. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 1010–1019, 2016. 3

  53. [53]

    Jun Liu, Amir Shahroudy, Mauricio Perez, Gang Wang, Ling-Yu Duan, and Alex C. Kot. NTU RGB+D 120: A large-scale benchmark for 3D human activity understanding.IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 42(10):2684–2701, 2020. 12

  54. [54]

    Balan, and Michael J

    Leonid Sigal, Alexandru O. Balan, and Michael J. Black. HumanEva: Synchronized video and motion capture dataset and baseline algorithm for evaluation of articulated human motion.International Journal of Computer Vision (IJCV), 87(1–2):4–27, 2010. 3

  55. [55]

    Object scene flow for autonomous vehicles

    Moritz Menze and Andreas Geiger. Object scene flow for autonomous vehicles. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3061–3070, 2015. 3, 7

  56. [56]

    Argoverse 2: Next generation datasets for self-driving perception and forecasting

    Benjamin Wilson, William Qi, Tanmay Agarwal, John Lambert, Jagjeet Singh, Siddhesh Khandelwal, Bowen Pan, Ratnesh Kumar, Andrew Hartnett, Jhony Kaesemodel Pontes, Deva Ramanan, Peter Carr, and James Hays. Argoverse 2: Next generation datasets for self-driving perception and forecasting. In Advances in Neural Information Processing Systems (NeurIPS), 2021. 3

  57. [57]

    4DComplete: Non- rigid motion estimation beyond the observable surface

    Yang Li, Hikari Takehara, Takafumi Taketomi, Bo Zheng, and Matthias Nießner. 4DComplete: Non- rigid motion estimation beyond the observable surface. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 12686–12696, 2021. 3, 7

  58. [58]

    HuMMan: Multi-modal 4D human dataset for versatile sensing and modeling

    Zhongang Cai, Daxuan Ren, Ailing Zeng, Zhengyu Lin, Tao Yu, Wenjia Wang, Xiangyu Fan, Yang Gao, Yifan Yu, Liang Pan, Fangzhou Hong, Mingyuan Zhang, Chen Change Loy, Lei Yang, and Ziwei Liu. HuMMan: Multi-modal 4D human dataset for versatile sensing and modeling. InProceedings of the European Conference on Computer Vision (ECCV), pages 557–577, 2022. 3

  59. [59]

    HUMBI: A large multiview dataset of human body expressions

    Zhixuan Yu, Jae Shin Yoon, In Kyu Lee, Prashanth Venkatesh, Jaesik Park, Jihun Yu, and Hyun Soo Park. HUMBI: A large multiview dataset of human body expressions. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2987–2997, 2020. 3

  60. [60]

    Black, Ivan Laptev, and Cordelia Schmid

    Gül Varol, Javier Romero, Xavier Martin, Naureen Mahmood, Michael J. Black, Ivan Laptev, and Cordelia Schmid. Learning from synthetic humans. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4627–4635, 2017. 3

  61. [61]

    Huang, Joachim Tesch, David T

    Priyanka Patel, Chun-Hao P. Huang, Joachim Tesch, David T. Hoffmann, Shashank Tripathi, and Michael J. Black. AGORA: Avatars in geography optimized for regression analysis. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13468–13478, 2021. 3, 7

  62. [62]

    M3GYM: A large-scale multimodal multi-view multi-person pose dataset for fitness activity understanding in real-world settings

    Qingzheng Xu, Ru Cao, Xin Shen, Heming Du, Sen Wang, and Xin Yu. M3GYM: A large-scale multimodal multi-view multi-person pose dataset for fitness activity understanding in real-world settings. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12289–12300, 2025. 3, 7

  63. [63]

    Black, Priyanka Patel, Joachim Tesch, and Jinlong Yang

    Michael J. Black, Priyanka Patel, Joachim Tesch, and Jinlong Yang. BEDLAM: A synthetic dataset of bodies exhibiting detailed lifelike animated motion. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8726–8737, 2023. 3, 7

  64. [64]

    4D-DRESS: A 4D dataset of real-world human clothing with semantic annotations

    Wenbo Wang, Hsuan-I Ho, Chen Guo, Boxiang Rong, Artur Grigorev, Jie Song, Juan Jose Zarate, and Otmar Hilliges. 4D-DRESS: A 4D dataset of real-world human clothing with semantic annotations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 550–560, 2024. 3

  65. [65]

    Qianli Ma, Jinlong Yang, Anurag Ranjan, Sergi Pujades, Gerard Pons-Moll, Siyu Tang, and Michael J. Black. Learning to dress 3D people in generative clothing. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6468–6477, 2020. 3

  66. [66]

    CLOTH3D: Clothed 3D humans

    Hugo Bertiche, Meysam Madadi, and Sergio Escalera. CLOTH3D: Clothed 3D humans. InProceedings of the European Conference on Computer Vision (ECCV), pages 344–359, 2020

  67. [67]

    CLOTH4D: A dataset for clothed human reconstruction

    Xingxing Zou, Xintong Han, and Waikeung Wong. CLOTH4D: A dataset for clothed human reconstruction. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12847–12857, 2023. 3

  68. [68]

    DINOv3

    Oriane Siméoni et al. DINOv3.arXiv preprint arXiv:2508.10104, 2025. 4, 15

  69. [69]

    Vision transformers for dense prediction

    René Ranftl, Alexey Bochkovskiy, and Vladlen Koltun. Vision transformers for dense prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 12179–12188,

  70. [70]

    The coordination of arm movements: an experimentally confirmed mathematical model.Journal of Neuroscience, 5(7):1688–1703, 1985

    Tamar Flash and Neville Hogan. The coordination of arm movements: an experimentally confirmed mathematical model.Journal of Neuroscience, 5(7):1688–1703, 1985. 6

  71. [71]

    An organizing principle for a class of voluntary movements.Journal of Neuroscience, 4(11):2745–2754, 1984

    Neville Hogan. An organizing principle for a class of voluntary movements.Journal of Neuroscience, 4(11):2745–2754, 1984. 6 13

  72. [72]

    Formation and control of optimal trajectory in human multijoint arm movement: minimum torque-change model.Biological Cybernetics, 61(2):89–101, 1989

    Yoji Uno, Mitsuo Kawato, and Ryoji Suzuki. Formation and control of optimal trajectory in human multijoint arm movement: minimum torque-change model.Biological Cybernetics, 61(2):89–101, 1989. 6

  73. [73]

    Emanuel Todorov and Michael I. Jordan. Optimal feedback control as a theory of motor coordination. Nature Neuroscience, 5(11):1226–1235, 2002. 6

  74. [74]

    Mixamo, 2024.https://www.mixamo.com

    Adobe Systems Incorporated. Mixamo, 2024.https://www.mixamo.com. 6

  75. [75]

    AIFit: Au- tomatic 3d human-interpretable feedback models for fitness training

    Mihai Fieraru, Mihai Zanfir, Silviu-Cristian Pirlea, Vlad Olaru, and Cristian Sminchisescu. AIFit: Au- tomatic 3d human-interpretable feedback models for fitness training. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9919–9928, 2021. 7, 8

  76. [76]

    H-MoRe: Learning human-centric motion representation for action analysis

    Zhanbo Huang, Xiaoming Liu, and Yu Kong. H-MoRe: Learning human-centric motion representation for action analysis. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025. 7, 8

  77. [77]

    RAFT: Recurrent all-pairs field transforms for optical flow

    Zachary Teed and Jia Deng. RAFT: Recurrent all-pairs field transforms for optical flow. InProceedings of the European Conference on Computer Vision (ECCV), 2020. 7, 8

  78. [78]

    Priyanka Patel and Michael J. Black. CameraHMR: Aligning people with perspective. InInternational Conference on 3D Vision (3DV), 2025. 8, 9

  79. [79]

    Black, and Muhammed Kocabas

    Yufu Wang, Yu Sun, Priyanka Patel, Kostas Daniilidis, Michael J. Black, and Muhammed Kocabas. PromptHMR: Promptable human mesh recovery. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 1148–1159, 2025. 8, 9

  80. [80]

    HiPART: Hierarchical pose autoregressive transformer for occluded 3D human pose estimation

    Hongwei Zheng, Han Li, Wenrui Dai, Ziyang Zheng, Chenglin Li, Junni Zou, and Hongkai Xiong. HiPART: Hierarchical pose autoregressive transformer for occluded 3D human pose estimation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 16807–16817,

Showing first 80 references.