pith. machine review for the scientific record. sign in

arxiv: 2605.09258 · v1 · submitted 2026-05-10 · 💻 cs.CV · cs.AI

Recognition: no theorem link

Monocular Biomechanical Tracking of Fingers with Inverse Kinematics to Foundation Models

Authors on Pith no claims yet

Pith reviewed 2026-05-12 04:38 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords monocular videohand trackingfinger joint anglesinverse kinematicsbiomechanical modelpose estimationclinical monitoringrange of motion
0
0 comments X

The pith

Monocular video can extract finger joint angles to within 10 degrees by feeding foundation model poses into inverse kinematics optimization on a biomechanical model.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates a pipeline that takes 3D hand and finger poses estimated from a single camera view and refines them through inverse kinematics inside a full-body biomechanical model to produce anatomically valid joint angles. Validation on 4590 frames from seven people performing varied poses and object tasks shows roughly 10-degree joint errors and 6-millimeter position errors after alignment with multi-camera references. This matters because hand biomechanics have direct clinical uses in tracking daily activities and measuring range of motion, yet monocular methods have lagged behind multi-view systems. The approach stays consistent across viewpoints and different ways of generating reference values.

Core claim

The central claim is that anatomically constrained finger joint angles and hand positions can be recovered from monocular video by combining 3D pose estimates from a foundation model with inverse kinematics optimization in a biomechanical model, using a custom mapping from rig outputs to model markers, and that this yields joint angle errors of approximately 10 degrees and hand position errors of approximately 6 mm when tested against 8-camera multiview reconstructions.

What carries the argument

A custom mapping from estimated 3D poses to biomechanical model markers that allows inverse kinematics optimization to enforce anatomical constraints and produce usable finger angles from single-view input.

If this is right

  • Detailed finger tracking becomes possible from ordinary single-camera recordings without multi-view hardware.
  • Quantitative monitoring of activities of daily living and range of motion extends to settings where only monocular video is available.
  • Results hold across different camera angles and across alternative methods for computing reference joint values from multiview video.
  • Biomechanical analysis of object manipulation tasks gains a practical monocular pathway.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same pipeline could be applied to existing large-scale video collections to study hand function at population scale.
  • Mobile-phone recordings might support at-home tracking of rehabilitation progress or disease progression.
  • GPU-accelerated versions open the door to near-real-time clinical feedback during movement assessments.

Load-bearing premise

The initial 3D poses produced from monocular video are accurate enough that the subsequent optimization can remove any systematic bias introduced by the mapping to biomechanical markers.

What would settle it

Direct measurement on a held-out set of hand movements showing joint angle errors substantially larger than 10 degrees or position errors much greater than 6 mm when compared to multi-camera ground truth.

Figures

Figures reproduced from arXiv: 2605.09258 by Pouyan Firouzabadi, R. James Cotton, Wendy Murray.

Figure 1
Figure 1. Figure 1: Hand reconstruction examples across participants. Each row shows one participant with panels for: raw image, monocular overlay (red), top-down view, and multiview reference overlay (green). 7 [PITH_FULL_IMAGE:figures/full_fig_p007_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Comparison of PA-MPJPE across body regions. The Full UE bar averages over all upper-extremity markers (arm + hand). Hand-only bars subset this: UE-aligned keeps the upper-extremity Procrustes alignment but reports hand markers only (~14 mm); hand-aligned uses a hand-only alignment to isolate finger articulation accuracy (~6 mm). Hand markers fit the alignment more tightly than the proximal arm/torso marker… view at source ↗
Figure 3
Figure 3. Figure 3: Per-joint mean angular error for left and right upper extremities, computed as the absolute difference between monocular IK joint angles and the corresponding Sapiens multiview IK joint angles (Sapiens-substituted hand keypoints with the same biomechanical model fit across 8 views). Includes elbow flexion, forearm pronation/supination, and individual finger errors. 9 [PITH_FULL_IMAGE:figures/full_fig_p009… view at source ↗
Figure 4
Figure 4. Figure 4: Upper extremity PA-MPJPE by camera viewpoint. 4 Discussion This work demonstrates how foundation models for human mesh reconstruction can be combined with biomechanical inverse kinematics to enable detailed finger tracking from monocular video. Our multistage Levenberg-Marquardt optimization approach (root positioning, full pose with scaling, and marker offset refinement) provides robust convergence and ac… view at source ↗
read the original abstract

Accurate hand and finger tracking from video has significant clinical applications for monitoring activities of daily living and measuring range of motion, yet monocular video approaches for obtaining hand biomechanics remain under-developed. We present a method that combines the SAM 3D Body foundation model with inverse kinematics optimization in a full-body biomechanical model to extract anatomically-constrained finger joint angles from single-view video. We port SAM 3D Body from PyTorch to JAX for integration with MuJoCo-MJX, enabling GPU-accelerated optimization, and develop a novel mapping between the Momentum Human Rig (MHR) outputs and biomechanical model markers. Validation against 8-camera multiview reconstruction on 4,590 frames from 7 participants performing a variety of hand poses and object manipulation tasks shows finger joint angle errors of approximately 10 degrees and hand position errors of approximately 6 mm, after Procrustes alignment. Results were consistent across camera viewpoints and robust to different methods for producing reference values from multiview video. This work extends monocular biomechanical analysis to detailed finger tracking, expanding access to quantitative characterization of hand movement from readily available video.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper presents a monocular biomechanical tracking method for fingers that combines the SAM 3D Body foundation model with inverse kinematics optimization inside a full-body biomechanical model (MuJoCo-MJX). It ports SAM 3D Body to JAX for GPU acceleration, introduces a novel mapping from Momentum Human Rig outputs to biomechanical markers, and validates the approach on 4,590 frames from 7 participants, reporting finger joint angle errors of approximately 10 degrees and hand position errors of approximately 6 mm versus 8-camera multiview reconstruction after Procrustes alignment.

Significance. If the validation is reliable, the work would meaningfully extend monocular video analysis to detailed, anatomically constrained finger biomechanics, supporting clinical applications such as range-of-motion assessment and activity monitoring. The use of foundation models, JAX/MuJoCo integration for optimization, and reported cross-view consistency are constructive elements.

major comments (3)
  1. [Validation] Validation section: The accuracy of the 8-camera multiview ground-truth reconstruction is not quantified (no marker localization error, calibration repeatability, synchronization metrics, or leave-one-camera-out ablation). Finger-scale triangulation is sensitive to 2-3 mm marker uncertainty, which can produce 5-8° joint-angle variance; without reference error bounds the reported ~10° joint errors cannot be isolated from reference noise.
  2. [Methods] Methods (mapping step): The custom mapping from Momentum Human Rig outputs to biomechanical model markers is introduced as a key component but receives no independent quantitative validation or bias analysis. Any systematic offset in this mapping would propagate directly into the inverse-kinematics optimization and is not isolated in the current error metrics.
  3. [Validation] Validation: Procrustes alignment is applied before reporting the 6 mm hand-position error, yet no errors before versus after alignment, nor per-camera or viewpoint-specific breakdowns, are supplied. This leaves open whether the quoted accuracy reflects true monocular performance or is reduced by the alignment procedure.
minor comments (1)
  1. [Abstract] Abstract and results report errors as 'approximately 10 degrees' and 'approximately 6 mm' without accompanying means, standard deviations, or exact values; providing these would improve precision and reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive review. We address each major comment below and commit to revisions that strengthen the validation and methods sections without misrepresenting the current results.

read point-by-point responses
  1. Referee: [Validation] Validation section: The accuracy of the 8-camera multiview ground-truth reconstruction is not quantified (no marker localization error, calibration repeatability, synchronization metrics, or leave-one-camera-out ablation). Finger-scale triangulation is sensitive to 2-3 mm marker uncertainty, which can produce 5-8° joint-angle variance; without reference error bounds the reported ~10° joint errors cannot be isolated from reference noise.

    Authors: We agree that explicit quantification of multiview ground-truth accuracy is necessary to interpret the monocular errors. The manuscript reports consistency across viewpoints and robustness to alternative reference methods, but does not provide marker localization error, calibration repeatability, synchronization metrics, or leave-one-camera-out results. In the revised manuscript we will add a dedicated validation subsection reporting these quantities, including an estimate of triangulation uncertainty at finger scale derived from repeated calibrations and synchronization checks. This will allow clearer isolation of reference noise from the reported ~10° joint-angle errors. revision: yes

  2. Referee: [Methods] Methods (mapping step): The custom mapping from Momentum Human Rig outputs to biomechanical model markers is introduced as a key component but receives no independent quantitative validation or bias analysis. Any systematic offset in this mapping would propagate directly into the inverse-kinematics optimization and is not isolated in the current error metrics.

    Authors: The referee correctly notes the absence of independent validation for the mapping. While end-to-end performance is evaluated against multiview reconstruction, the mapping itself is not separately quantified. We will revise the methods and validation sections to include an independent assessment of the mapping, for example by reporting marker-position discrepancies on held-out data or controlled synthetic tests that quantify bias and variance. This addition will demonstrate that the mapping does not introduce unaccounted systematic offsets. revision: yes

  3. Referee: [Validation] Validation: Procrustes alignment is applied before reporting the 6 mm hand-position error, yet no errors before versus after alignment, nor per-camera or viewpoint-specific breakdowns, are supplied. This leaves open whether the quoted accuracy reflects true monocular performance or is reduced by the alignment procedure.

    Authors: We agree that reporting hand-position errors both before and after Procrustes alignment, together with per-camera and viewpoint-specific breakdowns, would increase transparency. The manuscript already states that results are consistent across viewpoints, but does not supply the unaligned numbers or per-view metrics. In the revision we will expand the validation tables and figures to include these quantities, thereby clarifying the contribution of the alignment step to the reported 6 mm error. revision: yes

Circularity Check

0 steps flagged

No significant circularity; validation against independent multiview data

full rationale

The paper describes a pipeline that ports a pre-trained foundation model (SAM 3D Body), develops a mapping to a biomechanical model, and runs inverse-kinematics optimization. Reported errors are measured on held-out frames against an 8-camera multiview reconstruction that is not derived from or fitted to the monocular outputs. No equations, fitted parameters, or self-citations are shown to reduce the accuracy figures to quantities constructed from the same inputs. The method is therefore self-contained against an external benchmark, consistent with the default expectation of no circularity.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The central claim depends on two unverified domain assumptions: that SAM 3D Body supplies accurate enough monocular 3D hand poses for the subsequent optimization to recover anatomically valid joint angles, and that the authors' novel MHR-to-biomechanical-marker mapping preserves accuracy without introducing uncorrectable bias. No free parameters or invented entities are explicitly quantified in the abstract.

free parameters (1)
  • inverse-kinematics optimization weights and convergence tolerances
    These control how strongly the optimizer respects anatomical constraints versus matching the foundation-model output; their values are not reported.
axioms (2)
  • domain assumption SAM 3D Body foundation model produces usable 3D finger poses from monocular video
    Invoked as the source of 3D input that is then mapped and optimized.
  • ad hoc to paper The custom mapping from Momentum Human Rig outputs to biomechanical model markers is accurate and unbiased
    Described as novel and required for the pipeline to function.

pith-pipeline@v0.9.0 · 5499 in / 1335 out tokens · 43704 ms · 2026-05-12T04:38:53.773004+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

27 extracted references · 27 canonical work pages

  1. [1]

    Keep It SMPL: Automatic Estimation of 3D Human Pose and Shape from a Single Image

    Bogo, Federica, Angjoo Kanazawa, Christoph Lassner, Peter Gehler, Javier Romero, and Michael J. Black (2016). “Keep It SMPL: Automatic Estimation of 3D Human Pose and Shape from a Single Image” . In: ECCV 2016: Computer Vision – ECCV 2016 . Springer, Cham, pp. 561–578

  2. [2]

    JAX: Composable Transformations of Python+NumPy Programs

    Necula, Adam Paszke, Jake VanderPlas, Skye Wanderman-Milne, and Qiao Zhang (2018). JAX: Composable Transformations of Python+NumPy Programs

  3. [3]

    PosePipe: Open-Source Human Pose Estimation Pipeline for Clinical Research

    Cotton, R. James (2022). “PosePipe: Open-Source Human Pose Estimation Pipeline for Clinical Research” . In: arXiv preprint arXiv:2203.08792 . — (2025). “Differentiable Biomechanics Unlocks Opportunities for Markerless Motion Capture” . In: 2025 Inter- national Conference On Rehabilitation Robotics (ICORR) , pp. 44–51

  4. [4]

    Improved Trajectory Reconstruction for Markerless Pose Estimation

    Cotton, R. James, Anthony Cimorelli, Kunal Shah, Shawana Anarwala, Scott Uhlrich, and Tasos Karakostas (2023). “Improved Trajectory Reconstruction for Markerless Pose Estimation” . In: 45th Annual International Conference of the IEEE Engineering in Medicine and Biology Society

  5. [5]

    Biomechanical Reconstruction with Confidence Intervals from Multi- view Markerless Motion Capture

    Cotton, R. James and Fabian Sinz (2025). “Biomechanical Reconstruction with Confidence Intervals from Multi- view Markerless Motion Capture” . In: 2025 47th Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC)

  6. [6]

    OpenSim: Open-Source Software to Create and Analyze Dynamic Simulations of Movement

    Delp, Scott L., Frank C. Anderson, Allison S. Arnold, Peter Loan, Ayman Habib, Chand T. John, Eran Guen- delman, and Darryl G. Thelen (2007). “OpenSim: Open-Source Software to Create and Analyze Dynamic Simulations of Movement” . In: IEEE Transactions on Biomedical Engineering 54.11, pp. 1940–1950

  7. [7]

    James Cotton (2026)

    Donahue, Seth, Irina Djuraskovic, Kunal Shah, Fabian Sinz, Ross Chafetz, and R. James Cotton (2026). EMBC Special Issue: Calibrated Uncertainty for Trustworthy Clinical Gait Analysis Using Probabilistic Multiview Markerless Motion Capture

  8. [8]

    Black, and Otmar Hilliges (2023)

    Fan, Zicong, Maria Parelli, Maria Eleni Kadoglou, Muhammed Kocabas, Xu Chen, Michael J. Black, and Otmar Hilliges (2023). HOLD: Category-agnostic 3D Reconstruction of Interacting Hands and Objects from Video . 11

  9. [9]

    Yu, Shunsuke Saito, Takaaki Shiratori, Te-Li Wang, Tony Tung, Yichen Xu, Yuan Dong, Yuhua Chen, Yuanlu Xu, Yuting Ye, and Zhongshi Jiang (2025)

    Lee, Jinhyung Park, Jinlong Yang, John Doublestein, Kishore Venkateshan, Kris Kitani, Ladislav Kavan, Marco Dal Farra, Matthew Hu, Matthew Cioffi, Michael Fabris, Michael Ranieri, Mohammad Modarres, Petr Kadlecek, Rawal Khirodkar, Rinat Abdrashitov, Romain Prevost, Roman Rajbhandari, Ronald Mallet, Russel Pearsall, Sandy Kao, Sanjeev Kumar, Scott Parrish, ...

  10. [10]

    Biomechanical Arm and Hand Tracking with Multiview Markerless Motion Capture

    Firouzabadi, Pouyan, Wendy Murray, Anton R Sobinov, J.D. Peiffer, Kunal Shah, Lee E Miller, and R. James Cotton (2024). “Biomechanical Arm and Hand Tracking with Multiview Markerless Motion Capture” . In: IEEE International Conference on Biomedical Robotics and Biomechatronics (BioRob)

  11. [11]

    MoVi: A Large Multi-Purpose Human Motion and Video Dataset

    Ghorbani, Saeed, Kimia Mahdaviani, Anne Thaler, Konrad Kording, Douglas James Cook, Gunnar Blohm, and Nikolaus F. Troje (2021). “MoVi: A Large Multi-Purpose Human Motion and Video Dataset” . In: PLOS ONE 16.6, e0253157

  12. [12]

    LocoMuJoCo: A Comprehensive Imitation Learning Benchmark for Locomotion

    Al-Hafez, Firas, Guoping Zhao, Jan Peters, and Davide Tateo (2023). LocoMuJoCo: A Comprehensive Imitation Learning Benchmark for Locomotion

  13. [13]

    Muscle contributions to propulsion and support during running

    Hamner, Samuel R., Ajay Seth, and Scott L. Delp (2010). “Muscle contributions to propulsion and support during running” . In: Journal of Biomechanics 43.14, pp. 2709–2716

  14. [14]

    A model of the upper extremity for sim- ulating musculoskeletal surgery and analyzing neuromuscular control

    Holzbaur, Katherine R. S., Wendy M. Murray, and Scott L. Delp (2005). “A model of the upper extremity for sim- ulating musculoskeletal surgery and analyzing neuromuscular control” . In: Annals of Biomedical Engineering 33.6, pp. 829–840

  15. [15]

    Sapiens: Foundation for Human Vision Models

    Khirodkar, Rawal, Timur Bagautdinov, Julieta Martinez, Su Zhaoen, Austin James, Peter Selednik, Stuart An- derson, and Shunsuke Saito (2024). Sapiens: Foundation for Human Vision Models

  16. [16]

    Equinox: neural networks in JAX via callable PyTrees and filtered transformations

    Kidger, Patrick and Cristian Garcia (2021). “Equinox: neural networks in JAX via callable PyTrees and filtered transformations” . In: Differentiable Programming workshop at Neural Information Processing Systems 2021

  17. [17]

    BioPose: Biomechanically-accurate 3D Pose Estimation from Monocular Videos

    Koleini, Farnoosh, Muhammad Usama Saleem, Pu Wang, Hongfei Xue, Ahmed Helmy, and Abbey Fenwick (2025). BioPose: Biomechanically-accurate 3D Pose Estimation from Monocular Videos

  18. [18]

    SMPL: A Skinned Multi-Person Linear Model

    Loper, Matthew, Naureen Mahmood, Javier Romero, Gerard Pons-Moll, and Michael J. Black (2015). “SMPL: A Skinned Multi-Person Linear Model” . In: ACM Transactions on Graphics (Proc. SIGGRAPH Asia)

  19. [19]

    A Musculoskeletal Model of the Hand and Wrist Capable of Simulating Functional Tasks

    McFarland, Daniel C., Benjamin I. Binder-Markey, Jennifer A. Nichols, Sarah J. Wohlman, Marije de Bruin, and Wendy M. Murray (2023). “A Musculoskeletal Model of the Hand and Wrist Capable of Simulating Functional Tasks” . In:IEEE transactions on bio-medical engineering 70.5, pp. 1424–1435

  20. [20]

    Yu, Kris Kitani, and Rawal Khirodkar (2025)

    Park, Jinhyung, Javier Romero, Shunsuke Saito, Fabian Prada, Takaaki Shiratori, Yichen Xu, Federica Bogo, Shoou-I. Yu, Kris Kitani, and Rawal Khirodkar (2025). ATLAS: Decoupling Skeletal and Shape Parameters for Expressive Parametric Human Modeling

  21. [21]

    Learning 3D Human Pose Estimation from Dozens of Datasets using a Geometry-Aware Autoencoder to Bridge Between Skeleton Formats

    Peiffer, J. D., Kunal Shah, Irina Djuraskovic, Shawana Anarwala, Kayan Abdou, Rujvee Patel, Prakash Jayabalan, Brenton Pennicooke, and R. James Cotton (2025). Portable biomechanics laboratory: Clinically accessible movement analysis from a handheld smartphone . Sárándi, István, Alexander Hermans, and Bastian Leibe (2023). “Learning 3D Human Pose Estimatio...

  22. [22]

    Tolan, John Brandt, Camille Couprie, Julien Mairal, Hervé Jégou, Patrick Labatut, and Piotr Bojanowski (2025). DINOv3

  23. [23]

    MuJoCo: A Physics Engine for Model-Based Control

    Todorov, Emanuel, Tom Erez, and Yuval Tassa (2012). “MuJoCo: A Physics Engine for Model-Based Control” . In: IEEE/RSJ International Conference on Intelligent Robots and Systems

  24. [24]

    Differentiable biomechanics for markerless motion capture in upper limb stroke rehabilitation: a comparison with optical motion capture

    Unger, Tim, Arash Sal Moslehian, J.D. Peiffer, Johann Ullrich, Roger Gassert, Olivier Lambercy, R. James Cotton, and Chris Awai Easthope (2025). “Differentiable biomechanics for markerless motion capture in upper limb stroke rehabilitation: a comparison with optical motion capture” . In: IEEE Transactions on Medical Robotics and Bionics , pp. 1–1

  25. [25]

    Reconstructing Hand-Held Objects in 3D

    Wu, Jane, Georgios Pavlakos, Georgia Gkioxari, and Jitendra Malik (2024). Reconstructing Hand-Held Objects in 3D

  26. [26]

    Reconstructing Humans with a Biomechanically Accurate Skeleton

    Xia, Yan, Xiaowei Zhou, Etienne Vouga, Qixing Huang, and Georgios Pavlakos (2025). “Reconstructing Humans with a Biomechanically Accurate Skeleton” . In: CVPR

  27. [27]

    SAM 3D Body: Robust Full-Body Human Mesh Recovery

    Liu, Nicolas Ugrinovic, Matt Feiszli, Jitendra Malik, Piotr Dollar, and Kris Kitani (2025). SAM 3D Body: Robust Full-Body Human Mesh Recovery . 12