Recognition: no theorem link
Monocular Biomechanical Tracking of Fingers with Inverse Kinematics to Foundation Models
Pith reviewed 2026-05-12 04:38 UTC · model grok-4.3
The pith
Monocular video can extract finger joint angles to within 10 degrees by feeding foundation model poses into inverse kinematics optimization on a biomechanical model.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that anatomically constrained finger joint angles and hand positions can be recovered from monocular video by combining 3D pose estimates from a foundation model with inverse kinematics optimization in a biomechanical model, using a custom mapping from rig outputs to model markers, and that this yields joint angle errors of approximately 10 degrees and hand position errors of approximately 6 mm when tested against 8-camera multiview reconstructions.
What carries the argument
A custom mapping from estimated 3D poses to biomechanical model markers that allows inverse kinematics optimization to enforce anatomical constraints and produce usable finger angles from single-view input.
If this is right
- Detailed finger tracking becomes possible from ordinary single-camera recordings without multi-view hardware.
- Quantitative monitoring of activities of daily living and range of motion extends to settings where only monocular video is available.
- Results hold across different camera angles and across alternative methods for computing reference joint values from multiview video.
- Biomechanical analysis of object manipulation tasks gains a practical monocular pathway.
Where Pith is reading between the lines
- The same pipeline could be applied to existing large-scale video collections to study hand function at population scale.
- Mobile-phone recordings might support at-home tracking of rehabilitation progress or disease progression.
- GPU-accelerated versions open the door to near-real-time clinical feedback during movement assessments.
Load-bearing premise
The initial 3D poses produced from monocular video are accurate enough that the subsequent optimization can remove any systematic bias introduced by the mapping to biomechanical markers.
What would settle it
Direct measurement on a held-out set of hand movements showing joint angle errors substantially larger than 10 degrees or position errors much greater than 6 mm when compared to multi-camera ground truth.
Figures
read the original abstract
Accurate hand and finger tracking from video has significant clinical applications for monitoring activities of daily living and measuring range of motion, yet monocular video approaches for obtaining hand biomechanics remain under-developed. We present a method that combines the SAM 3D Body foundation model with inverse kinematics optimization in a full-body biomechanical model to extract anatomically-constrained finger joint angles from single-view video. We port SAM 3D Body from PyTorch to JAX for integration with MuJoCo-MJX, enabling GPU-accelerated optimization, and develop a novel mapping between the Momentum Human Rig (MHR) outputs and biomechanical model markers. Validation against 8-camera multiview reconstruction on 4,590 frames from 7 participants performing a variety of hand poses and object manipulation tasks shows finger joint angle errors of approximately 10 degrees and hand position errors of approximately 6 mm, after Procrustes alignment. Results were consistent across camera viewpoints and robust to different methods for producing reference values from multiview video. This work extends monocular biomechanical analysis to detailed finger tracking, expanding access to quantitative characterization of hand movement from readily available video.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents a monocular biomechanical tracking method for fingers that combines the SAM 3D Body foundation model with inverse kinematics optimization inside a full-body biomechanical model (MuJoCo-MJX). It ports SAM 3D Body to JAX for GPU acceleration, introduces a novel mapping from Momentum Human Rig outputs to biomechanical markers, and validates the approach on 4,590 frames from 7 participants, reporting finger joint angle errors of approximately 10 degrees and hand position errors of approximately 6 mm versus 8-camera multiview reconstruction after Procrustes alignment.
Significance. If the validation is reliable, the work would meaningfully extend monocular video analysis to detailed, anatomically constrained finger biomechanics, supporting clinical applications such as range-of-motion assessment and activity monitoring. The use of foundation models, JAX/MuJoCo integration for optimization, and reported cross-view consistency are constructive elements.
major comments (3)
- [Validation] Validation section: The accuracy of the 8-camera multiview ground-truth reconstruction is not quantified (no marker localization error, calibration repeatability, synchronization metrics, or leave-one-camera-out ablation). Finger-scale triangulation is sensitive to 2-3 mm marker uncertainty, which can produce 5-8° joint-angle variance; without reference error bounds the reported ~10° joint errors cannot be isolated from reference noise.
- [Methods] Methods (mapping step): The custom mapping from Momentum Human Rig outputs to biomechanical model markers is introduced as a key component but receives no independent quantitative validation or bias analysis. Any systematic offset in this mapping would propagate directly into the inverse-kinematics optimization and is not isolated in the current error metrics.
- [Validation] Validation: Procrustes alignment is applied before reporting the 6 mm hand-position error, yet no errors before versus after alignment, nor per-camera or viewpoint-specific breakdowns, are supplied. This leaves open whether the quoted accuracy reflects true monocular performance or is reduced by the alignment procedure.
minor comments (1)
- [Abstract] Abstract and results report errors as 'approximately 10 degrees' and 'approximately 6 mm' without accompanying means, standard deviations, or exact values; providing these would improve precision and reproducibility.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive review. We address each major comment below and commit to revisions that strengthen the validation and methods sections without misrepresenting the current results.
read point-by-point responses
-
Referee: [Validation] Validation section: The accuracy of the 8-camera multiview ground-truth reconstruction is not quantified (no marker localization error, calibration repeatability, synchronization metrics, or leave-one-camera-out ablation). Finger-scale triangulation is sensitive to 2-3 mm marker uncertainty, which can produce 5-8° joint-angle variance; without reference error bounds the reported ~10° joint errors cannot be isolated from reference noise.
Authors: We agree that explicit quantification of multiview ground-truth accuracy is necessary to interpret the monocular errors. The manuscript reports consistency across viewpoints and robustness to alternative reference methods, but does not provide marker localization error, calibration repeatability, synchronization metrics, or leave-one-camera-out results. In the revised manuscript we will add a dedicated validation subsection reporting these quantities, including an estimate of triangulation uncertainty at finger scale derived from repeated calibrations and synchronization checks. This will allow clearer isolation of reference noise from the reported ~10° joint-angle errors. revision: yes
-
Referee: [Methods] Methods (mapping step): The custom mapping from Momentum Human Rig outputs to biomechanical model markers is introduced as a key component but receives no independent quantitative validation or bias analysis. Any systematic offset in this mapping would propagate directly into the inverse-kinematics optimization and is not isolated in the current error metrics.
Authors: The referee correctly notes the absence of independent validation for the mapping. While end-to-end performance is evaluated against multiview reconstruction, the mapping itself is not separately quantified. We will revise the methods and validation sections to include an independent assessment of the mapping, for example by reporting marker-position discrepancies on held-out data or controlled synthetic tests that quantify bias and variance. This addition will demonstrate that the mapping does not introduce unaccounted systematic offsets. revision: yes
-
Referee: [Validation] Validation: Procrustes alignment is applied before reporting the 6 mm hand-position error, yet no errors before versus after alignment, nor per-camera or viewpoint-specific breakdowns, are supplied. This leaves open whether the quoted accuracy reflects true monocular performance or is reduced by the alignment procedure.
Authors: We agree that reporting hand-position errors both before and after Procrustes alignment, together with per-camera and viewpoint-specific breakdowns, would increase transparency. The manuscript already states that results are consistent across viewpoints, but does not supply the unaligned numbers or per-view metrics. In the revision we will expand the validation tables and figures to include these quantities, thereby clarifying the contribution of the alignment step to the reported 6 mm error. revision: yes
Circularity Check
No significant circularity; validation against independent multiview data
full rationale
The paper describes a pipeline that ports a pre-trained foundation model (SAM 3D Body), develops a mapping to a biomechanical model, and runs inverse-kinematics optimization. Reported errors are measured on held-out frames against an 8-camera multiview reconstruction that is not derived from or fitted to the monocular outputs. No equations, fitted parameters, or self-citations are shown to reduce the accuracy figures to quantities constructed from the same inputs. The method is therefore self-contained against an external benchmark, consistent with the default expectation of no circularity.
Axiom & Free-Parameter Ledger
free parameters (1)
- inverse-kinematics optimization weights and convergence tolerances
axioms (2)
- domain assumption SAM 3D Body foundation model produces usable 3D finger poses from monocular video
- ad hoc to paper The custom mapping from Momentum Human Rig outputs to biomechanical model markers is accurate and unbiased
Reference graph
Works this paper leans on
-
[1]
Keep It SMPL: Automatic Estimation of 3D Human Pose and Shape from a Single Image
Bogo, Federica, Angjoo Kanazawa, Christoph Lassner, Peter Gehler, Javier Romero, and Michael J. Black (2016). “Keep It SMPL: Automatic Estimation of 3D Human Pose and Shape from a Single Image” . In: ECCV 2016: Computer Vision – ECCV 2016 . Springer, Cham, pp. 561–578
work page 2016
-
[2]
JAX: Composable Transformations of Python+NumPy Programs
Necula, Adam Paszke, Jake VanderPlas, Skye Wanderman-Milne, and Qiao Zhang (2018). JAX: Composable Transformations of Python+NumPy Programs
work page 2018
-
[3]
PosePipe: Open-Source Human Pose Estimation Pipeline for Clinical Research
Cotton, R. James (2022). “PosePipe: Open-Source Human Pose Estimation Pipeline for Clinical Research” . In: arXiv preprint arXiv:2203.08792 . — (2025). “Differentiable Biomechanics Unlocks Opportunities for Markerless Motion Capture” . In: 2025 Inter- national Conference On Rehabilitation Robotics (ICORR) , pp. 44–51
-
[4]
Improved Trajectory Reconstruction for Markerless Pose Estimation
Cotton, R. James, Anthony Cimorelli, Kunal Shah, Shawana Anarwala, Scott Uhlrich, and Tasos Karakostas (2023). “Improved Trajectory Reconstruction for Markerless Pose Estimation” . In: 45th Annual International Conference of the IEEE Engineering in Medicine and Biology Society
work page 2023
-
[5]
Biomechanical Reconstruction with Confidence Intervals from Multi- view Markerless Motion Capture
Cotton, R. James and Fabian Sinz (2025). “Biomechanical Reconstruction with Confidence Intervals from Multi- view Markerless Motion Capture” . In: 2025 47th Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC)
work page 2025
-
[6]
OpenSim: Open-Source Software to Create and Analyze Dynamic Simulations of Movement
Delp, Scott L., Frank C. Anderson, Allison S. Arnold, Peter Loan, Ayman Habib, Chand T. John, Eran Guen- delman, and Darryl G. Thelen (2007). “OpenSim: Open-Source Software to Create and Analyze Dynamic Simulations of Movement” . In: IEEE Transactions on Biomedical Engineering 54.11, pp. 1940–1950
work page 2007
-
[7]
Donahue, Seth, Irina Djuraskovic, Kunal Shah, Fabian Sinz, Ross Chafetz, and R. James Cotton (2026). EMBC Special Issue: Calibrated Uncertainty for Trustworthy Clinical Gait Analysis Using Probabilistic Multiview Markerless Motion Capture
work page 2026
-
[8]
Black, and Otmar Hilliges (2023)
Fan, Zicong, Maria Parelli, Maria Eleni Kadoglou, Muhammed Kocabas, Xu Chen, Michael J. Black, and Otmar Hilliges (2023). HOLD: Category-agnostic 3D Reconstruction of Interacting Hands and Objects from Video . 11
work page 2023
-
[9]
Lee, Jinhyung Park, Jinlong Yang, John Doublestein, Kishore Venkateshan, Kris Kitani, Ladislav Kavan, Marco Dal Farra, Matthew Hu, Matthew Cioffi, Michael Fabris, Michael Ranieri, Mohammad Modarres, Petr Kadlecek, Rawal Khirodkar, Rinat Abdrashitov, Romain Prevost, Roman Rajbhandari, Ronald Mallet, Russel Pearsall, Sandy Kao, Sanjeev Kumar, Scott Parrish, ...
work page 2025
-
[10]
Biomechanical Arm and Hand Tracking with Multiview Markerless Motion Capture
Firouzabadi, Pouyan, Wendy Murray, Anton R Sobinov, J.D. Peiffer, Kunal Shah, Lee E Miller, and R. James Cotton (2024). “Biomechanical Arm and Hand Tracking with Multiview Markerless Motion Capture” . In: IEEE International Conference on Biomedical Robotics and Biomechatronics (BioRob)
work page 2024
-
[11]
MoVi: A Large Multi-Purpose Human Motion and Video Dataset
Ghorbani, Saeed, Kimia Mahdaviani, Anne Thaler, Konrad Kording, Douglas James Cook, Gunnar Blohm, and Nikolaus F. Troje (2021). “MoVi: A Large Multi-Purpose Human Motion and Video Dataset” . In: PLOS ONE 16.6, e0253157
work page 2021
-
[12]
LocoMuJoCo: A Comprehensive Imitation Learning Benchmark for Locomotion
Al-Hafez, Firas, Guoping Zhao, Jan Peters, and Davide Tateo (2023). LocoMuJoCo: A Comprehensive Imitation Learning Benchmark for Locomotion
work page 2023
-
[13]
Muscle contributions to propulsion and support during running
Hamner, Samuel R., Ajay Seth, and Scott L. Delp (2010). “Muscle contributions to propulsion and support during running” . In: Journal of Biomechanics 43.14, pp. 2709–2716
work page 2010
-
[14]
Holzbaur, Katherine R. S., Wendy M. Murray, and Scott L. Delp (2005). “A model of the upper extremity for sim- ulating musculoskeletal surgery and analyzing neuromuscular control” . In: Annals of Biomedical Engineering 33.6, pp. 829–840
work page 2005
-
[15]
Sapiens: Foundation for Human Vision Models
Khirodkar, Rawal, Timur Bagautdinov, Julieta Martinez, Su Zhaoen, Austin James, Peter Selednik, Stuart An- derson, and Shunsuke Saito (2024). Sapiens: Foundation for Human Vision Models
work page 2024
-
[16]
Equinox: neural networks in JAX via callable PyTrees and filtered transformations
Kidger, Patrick and Cristian Garcia (2021). “Equinox: neural networks in JAX via callable PyTrees and filtered transformations” . In: Differentiable Programming workshop at Neural Information Processing Systems 2021
work page 2021
-
[17]
BioPose: Biomechanically-accurate 3D Pose Estimation from Monocular Videos
Koleini, Farnoosh, Muhammad Usama Saleem, Pu Wang, Hongfei Xue, Ahmed Helmy, and Abbey Fenwick (2025). BioPose: Biomechanically-accurate 3D Pose Estimation from Monocular Videos
work page 2025
-
[18]
SMPL: A Skinned Multi-Person Linear Model
Loper, Matthew, Naureen Mahmood, Javier Romero, Gerard Pons-Moll, and Michael J. Black (2015). “SMPL: A Skinned Multi-Person Linear Model” . In: ACM Transactions on Graphics (Proc. SIGGRAPH Asia)
work page 2015
-
[19]
A Musculoskeletal Model of the Hand and Wrist Capable of Simulating Functional Tasks
McFarland, Daniel C., Benjamin I. Binder-Markey, Jennifer A. Nichols, Sarah J. Wohlman, Marije de Bruin, and Wendy M. Murray (2023). “A Musculoskeletal Model of the Hand and Wrist Capable of Simulating Functional Tasks” . In:IEEE transactions on bio-medical engineering 70.5, pp. 1424–1435
work page 2023
-
[20]
Yu, Kris Kitani, and Rawal Khirodkar (2025)
Park, Jinhyung, Javier Romero, Shunsuke Saito, Fabian Prada, Takaaki Shiratori, Yichen Xu, Federica Bogo, Shoou-I. Yu, Kris Kitani, and Rawal Khirodkar (2025). ATLAS: Decoupling Skeletal and Shape Parameters for Expressive Parametric Human Modeling
work page 2025
-
[21]
Peiffer, J. D., Kunal Shah, Irina Djuraskovic, Shawana Anarwala, Kayan Abdou, Rujvee Patel, Prakash Jayabalan, Brenton Pennicooke, and R. James Cotton (2025). Portable biomechanics laboratory: Clinically accessible movement analysis from a handheld smartphone . Sárándi, István, Alexander Hermans, and Bastian Leibe (2023). “Learning 3D Human Pose Estimatio...
work page 2025
-
[22]
Tolan, John Brandt, Camille Couprie, Julien Mairal, Hervé Jégou, Patrick Labatut, and Piotr Bojanowski (2025). DINOv3
work page 2025
-
[23]
MuJoCo: A Physics Engine for Model-Based Control
Todorov, Emanuel, Tom Erez, and Yuval Tassa (2012). “MuJoCo: A Physics Engine for Model-Based Control” . In: IEEE/RSJ International Conference on Intelligent Robots and Systems
work page 2012
-
[24]
Unger, Tim, Arash Sal Moslehian, J.D. Peiffer, Johann Ullrich, Roger Gassert, Olivier Lambercy, R. James Cotton, and Chris Awai Easthope (2025). “Differentiable biomechanics for markerless motion capture in upper limb stroke rehabilitation: a comparison with optical motion capture” . In: IEEE Transactions on Medical Robotics and Bionics , pp. 1–1
work page 2025
-
[25]
Reconstructing Hand-Held Objects in 3D
Wu, Jane, Georgios Pavlakos, Georgia Gkioxari, and Jitendra Malik (2024). Reconstructing Hand-Held Objects in 3D
work page 2024
-
[26]
Reconstructing Humans with a Biomechanically Accurate Skeleton
Xia, Yan, Xiaowei Zhou, Etienne Vouga, Qixing Huang, and Georgios Pavlakos (2025). “Reconstructing Humans with a Biomechanically Accurate Skeleton” . In: CVPR
work page 2025
-
[27]
SAM 3D Body: Robust Full-Body Human Mesh Recovery
Liu, Nicolas Ugrinovic, Matt Feiszli, Jitendra Malik, Piotr Dollar, and Kris Kitani (2025). SAM 3D Body: Robust Full-Body Human Mesh Recovery . 12
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.