arxiv: 2605.09258 · v1 · submitted 2026-05-10 · 💻 cs.CV · cs.AI

Recognition: no theorem link

Monocular Biomechanical Tracking of Fingers with Inverse Kinematics to Foundation Models

R. James Cotton , Pouyan Firouzabadi , Wendy Murray

Authors on Pith no claims yet

Pith reviewed 2026-05-12 04:38 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords monocular videohand trackingfinger joint anglesinverse kinematicsbiomechanical modelpose estimationclinical monitoringrange of motion

0 comments

The pith

Monocular video can extract finger joint angles to within 10 degrees by feeding foundation model poses into inverse kinematics optimization on a biomechanical model.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates a pipeline that takes 3D hand and finger poses estimated from a single camera view and refines them through inverse kinematics inside a full-body biomechanical model to produce anatomically valid joint angles. Validation on 4590 frames from seven people performing varied poses and object tasks shows roughly 10-degree joint errors and 6-millimeter position errors after alignment with multi-camera references. This matters because hand biomechanics have direct clinical uses in tracking daily activities and measuring range of motion, yet monocular methods have lagged behind multi-view systems. The approach stays consistent across viewpoints and different ways of generating reference values.

Core claim

The central claim is that anatomically constrained finger joint angles and hand positions can be recovered from monocular video by combining 3D pose estimates from a foundation model with inverse kinematics optimization in a biomechanical model, using a custom mapping from rig outputs to model markers, and that this yields joint angle errors of approximately 10 degrees and hand position errors of approximately 6 mm when tested against 8-camera multiview reconstructions.

What carries the argument

A custom mapping from estimated 3D poses to biomechanical model markers that allows inverse kinematics optimization to enforce anatomical constraints and produce usable finger angles from single-view input.

If this is right

Detailed finger tracking becomes possible from ordinary single-camera recordings without multi-view hardware.
Quantitative monitoring of activities of daily living and range of motion extends to settings where only monocular video is available.
Results hold across different camera angles and across alternative methods for computing reference joint values from multiview video.
Biomechanical analysis of object manipulation tasks gains a practical monocular pathway.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same pipeline could be applied to existing large-scale video collections to study hand function at population scale.
Mobile-phone recordings might support at-home tracking of rehabilitation progress or disease progression.
GPU-accelerated versions open the door to near-real-time clinical feedback during movement assessments.

Load-bearing premise

The initial 3D poses produced from monocular video are accurate enough that the subsequent optimization can remove any systematic bias introduced by the mapping to biomechanical markers.

What would settle it

Direct measurement on a held-out set of hand movements showing joint angle errors substantially larger than 10 degrees or position errors much greater than 6 mm when compared to multi-camera ground truth.

Figures

Figures reproduced from arXiv: 2605.09258 by Pouyan Firouzabadi, R. James Cotton, Wendy Murray.

**Figure 1.** Figure 1: Hand reconstruction examples across participants. Each row shows one participant with panels for: raw image, monocular overlay (red), top-down view, and multiview reference overlay (green). 7 [PITH_FULL_IMAGE:figures/full_fig_p007_1.png] view at source ↗

**Figure 2.** Figure 2: Comparison of PA-MPJPE across body regions. The Full UE bar averages over all upper-extremity markers (arm + hand). Hand-only bars subset this: UE-aligned keeps the upper-extremity Procrustes alignment but reports hand markers only (~14 mm); hand-aligned uses a hand-only alignment to isolate finger articulation accuracy (~6 mm). Hand markers fit the alignment more tightly than the proximal arm/torso marker… view at source ↗

**Figure 3.** Figure 3: Per-joint mean angular error for left and right upper extremities, computed as the absolute difference between monocular IK joint angles and the corresponding Sapiens multiview IK joint angles (Sapiens-substituted hand keypoints with the same biomechanical model fit across 8 views). Includes elbow flexion, forearm pronation/supination, and individual finger errors. 9 [PITH_FULL_IMAGE:figures/full_fig_p009… view at source ↗

**Figure 4.** Figure 4: Upper extremity PA-MPJPE by camera viewpoint. 4 Discussion This work demonstrates how foundation models for human mesh reconstruction can be combined with biomechanical inverse kinematics to enable detailed finger tracking from monocular video. Our multistage Levenberg-Marquardt optimization approach (root positioning, full pose with scaling, and marker offset refinement) provides robust convergence and ac… view at source ↗

read the original abstract

Accurate hand and finger tracking from video has significant clinical applications for monitoring activities of daily living and measuring range of motion, yet monocular video approaches for obtaining hand biomechanics remain under-developed. We present a method that combines the SAM 3D Body foundation model with inverse kinematics optimization in a full-body biomechanical model to extract anatomically-constrained finger joint angles from single-view video. We port SAM 3D Body from PyTorch to JAX for integration with MuJoCo-MJX, enabling GPU-accelerated optimization, and develop a novel mapping between the Momentum Human Rig (MHR) outputs and biomechanical model markers. Validation against 8-camera multiview reconstruction on 4,590 frames from 7 participants performing a variety of hand poses and object manipulation tasks shows finger joint angle errors of approximately 10 degrees and hand position errors of approximately 6 mm, after Procrustes alignment. Results were consistent across camera viewpoints and robust to different methods for producing reference values from multiview video. This work extends monocular biomechanical analysis to detailed finger tracking, expanding access to quantitative characterization of hand movement from readily available video.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper links SAM 3D Body to a biomechanical model via JAX-accelerated inverse kinematics for monocular finger tracking and reports usable error numbers, but the unquantified accuracy of the 8-camera reference leaves the real performance hard to judge.

read the letter

The core contribution is a pipeline that takes outputs from the ported SAM 3D Body model, maps them through the Momentum Human Rig to markers on a full-body biomechanical model, and solves for finger joint angles with GPU IK in MuJoCo-MJX. That combination and the JAX port for speed are new enough to be worth noting, and the authors give a clear description of the mapping step. They also run a decent-sized validation on 4590 frames from seven participants across hand poses and object tasks, which is more than many similar papers provide. The reported errors of roughly 10 degrees on joints and 6 mm on position after Procrustes alignment, plus consistency across viewpoints, show the method produces plausible results in practice. That part is useful for anyone who needs anatomically constrained angles rather than raw keypoints from ordinary video. The main weakness is the reference itself. The abstract treats the 8-camera multiview reconstruction as ground truth without reporting its own repeatability, calibration stability, or leave-one-camera-out checks. At finger scale, even modest marker uncertainty can swing joint angles by several degrees, so the 10-degree figure may partly measure reference noise rather than monocular error. The Procrustes step further complicates interpretation. This work is aimed at biomechanics and clinical researchers who want quantitative hand data from single-camera setups, and at CV groups extending foundation models to constrained body models. A reader in either group would get concrete implementation details and some numbers to compare against. It is coherent and engages the literature honestly, so it deserves a serious referee who can press on the reference validation and the IK weight choices. I would send it out for review rather than desk reject.

Referee Report

3 major / 1 minor

Summary. The paper presents a monocular biomechanical tracking method for fingers that combines the SAM 3D Body foundation model with inverse kinematics optimization inside a full-body biomechanical model (MuJoCo-MJX). It ports SAM 3D Body to JAX for GPU acceleration, introduces a novel mapping from Momentum Human Rig outputs to biomechanical markers, and validates the approach on 4,590 frames from 7 participants, reporting finger joint angle errors of approximately 10 degrees and hand position errors of approximately 6 mm versus 8-camera multiview reconstruction after Procrustes alignment.

Significance. If the validation is reliable, the work would meaningfully extend monocular video analysis to detailed, anatomically constrained finger biomechanics, supporting clinical applications such as range-of-motion assessment and activity monitoring. The use of foundation models, JAX/MuJoCo integration for optimization, and reported cross-view consistency are constructive elements.

major comments (3)

[Validation] Validation section: The accuracy of the 8-camera multiview ground-truth reconstruction is not quantified (no marker localization error, calibration repeatability, synchronization metrics, or leave-one-camera-out ablation). Finger-scale triangulation is sensitive to 2-3 mm marker uncertainty, which can produce 5-8° joint-angle variance; without reference error bounds the reported ~10° joint errors cannot be isolated from reference noise.
[Methods] Methods (mapping step): The custom mapping from Momentum Human Rig outputs to biomechanical model markers is introduced as a key component but receives no independent quantitative validation or bias analysis. Any systematic offset in this mapping would propagate directly into the inverse-kinematics optimization and is not isolated in the current error metrics.
[Validation] Validation: Procrustes alignment is applied before reporting the 6 mm hand-position error, yet no errors before versus after alignment, nor per-camera or viewpoint-specific breakdowns, are supplied. This leaves open whether the quoted accuracy reflects true monocular performance or is reduced by the alignment procedure.

minor comments (1)

[Abstract] Abstract and results report errors as 'approximately 10 degrees' and 'approximately 6 mm' without accompanying means, standard deviations, or exact values; providing these would improve precision and reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive review. We address each major comment below and commit to revisions that strengthen the validation and methods sections without misrepresenting the current results.

read point-by-point responses

Referee: [Validation] Validation section: The accuracy of the 8-camera multiview ground-truth reconstruction is not quantified (no marker localization error, calibration repeatability, synchronization metrics, or leave-one-camera-out ablation). Finger-scale triangulation is sensitive to 2-3 mm marker uncertainty, which can produce 5-8° joint-angle variance; without reference error bounds the reported ~10° joint errors cannot be isolated from reference noise.

Authors: We agree that explicit quantification of multiview ground-truth accuracy is necessary to interpret the monocular errors. The manuscript reports consistency across viewpoints and robustness to alternative reference methods, but does not provide marker localization error, calibration repeatability, synchronization metrics, or leave-one-camera-out results. In the revised manuscript we will add a dedicated validation subsection reporting these quantities, including an estimate of triangulation uncertainty at finger scale derived from repeated calibrations and synchronization checks. This will allow clearer isolation of reference noise from the reported ~10° joint-angle errors. revision: yes
Referee: [Methods] Methods (mapping step): The custom mapping from Momentum Human Rig outputs to biomechanical model markers is introduced as a key component but receives no independent quantitative validation or bias analysis. Any systematic offset in this mapping would propagate directly into the inverse-kinematics optimization and is not isolated in the current error metrics.

Authors: The referee correctly notes the absence of independent validation for the mapping. While end-to-end performance is evaluated against multiview reconstruction, the mapping itself is not separately quantified. We will revise the methods and validation sections to include an independent assessment of the mapping, for example by reporting marker-position discrepancies on held-out data or controlled synthetic tests that quantify bias and variance. This addition will demonstrate that the mapping does not introduce unaccounted systematic offsets. revision: yes
Referee: [Validation] Validation: Procrustes alignment is applied before reporting the 6 mm hand-position error, yet no errors before versus after alignment, nor per-camera or viewpoint-specific breakdowns, are supplied. This leaves open whether the quoted accuracy reflects true monocular performance or is reduced by the alignment procedure.

Authors: We agree that reporting hand-position errors both before and after Procrustes alignment, together with per-camera and viewpoint-specific breakdowns, would increase transparency. The manuscript already states that results are consistent across viewpoints, but does not supply the unaligned numbers or per-view metrics. In the revision we will expand the validation tables and figures to include these quantities, thereby clarifying the contribution of the alignment step to the reported 6 mm error. revision: yes

Circularity Check

0 steps flagged

No significant circularity; validation against independent multiview data

full rationale

The paper describes a pipeline that ports a pre-trained foundation model (SAM 3D Body), develops a mapping to a biomechanical model, and runs inverse-kinematics optimization. Reported errors are measured on held-out frames against an 8-camera multiview reconstruction that is not derived from or fitted to the monocular outputs. No equations, fitted parameters, or self-citations are shown to reduce the accuracy figures to quantities constructed from the same inputs. The method is therefore self-contained against an external benchmark, consistent with the default expectation of no circularity.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The central claim depends on two unverified domain assumptions: that SAM 3D Body supplies accurate enough monocular 3D hand poses for the subsequent optimization to recover anatomically valid joint angles, and that the authors' novel MHR-to-biomechanical-marker mapping preserves accuracy without introducing uncorrectable bias. No free parameters or invented entities are explicitly quantified in the abstract.

free parameters (1)

inverse-kinematics optimization weights and convergence tolerances
These control how strongly the optimizer respects anatomical constraints versus matching the foundation-model output; their values are not reported.

axioms (2)

domain assumption SAM 3D Body foundation model produces usable 3D finger poses from monocular video
Invoked as the source of 3D input that is then mapped and optimized.
ad hoc to paper The custom mapping from Momentum Human Rig outputs to biomechanical model markers is accurate and unbiased
Described as novel and required for the pipeline to function.

pith-pipeline@v0.9.0 · 5499 in / 1335 out tokens · 43704 ms · 2026-05-12T04:38:53.773004+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

27 extracted references · 27 canonical work pages

[1]

Keep It SMPL: Automatic Estimation of 3D Human Pose and Shape from a Single Image

Bogo, Federica, Angjoo Kanazawa, Christoph Lassner, Peter Gehler, Javier Romero, and Michael J. Black (2016). “Keep It SMPL: Automatic Estimation of 3D Human Pose and Shape from a Single Image” . In: ECCV 2016: Computer Vision – ECCV 2016 . Springer, Cham, pp. 561–578

work page 2016
[2]

JAX: Composable Transformations of Python+NumPy Programs

Necula, Adam Paszke, Jake VanderPlas, Skye Wanderman-Milne, and Qiao Zhang (2018). JAX: Composable Transformations of Python+NumPy Programs

work page 2018
[3]

PosePipe: Open-Source Human Pose Estimation Pipeline for Clinical Research

Cotton, R. James (2022). “PosePipe: Open-Source Human Pose Estimation Pipeline for Clinical Research” . In: arXiv preprint arXiv:2203.08792 . — (2025). “Differentiable Biomechanics Unlocks Opportunities for Markerless Motion Capture” . In: 2025 Inter- national Conference On Rehabilitation Robotics (ICORR) , pp. 44–51

work page arXiv 2022
[4]

Improved Trajectory Reconstruction for Markerless Pose Estimation

Cotton, R. James, Anthony Cimorelli, Kunal Shah, Shawana Anarwala, Scott Uhlrich, and Tasos Karakostas (2023). “Improved Trajectory Reconstruction for Markerless Pose Estimation” . In: 45th Annual International Conference of the IEEE Engineering in Medicine and Biology Society

work page 2023
[5]

Biomechanical Reconstruction with Confidence Intervals from Multi- view Markerless Motion Capture

Cotton, R. James and Fabian Sinz (2025). “Biomechanical Reconstruction with Confidence Intervals from Multi- view Markerless Motion Capture” . In: 2025 47th Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC)

work page 2025
[6]

OpenSim: Open-Source Software to Create and Analyze Dynamic Simulations of Movement

Delp, Scott L., Frank C. Anderson, Allison S. Arnold, Peter Loan, Ayman Habib, Chand T. John, Eran Guen- delman, and Darryl G. Thelen (2007). “OpenSim: Open-Source Software to Create and Analyze Dynamic Simulations of Movement” . In: IEEE Transactions on Biomedical Engineering 54.11, pp. 1940–1950

work page 2007
[7]

James Cotton (2026)

Donahue, Seth, Irina Djuraskovic, Kunal Shah, Fabian Sinz, Ross Chafetz, and R. James Cotton (2026). EMBC Special Issue: Calibrated Uncertainty for Trustworthy Clinical Gait Analysis Using Probabilistic Multiview Markerless Motion Capture

work page 2026
[8]

Black, and Otmar Hilliges (2023)

Fan, Zicong, Maria Parelli, Maria Eleni Kadoglou, Muhammed Kocabas, Xu Chen, Michael J. Black, and Otmar Hilliges (2023). HOLD: Category-agnostic 3D Reconstruction of Interacting Hands and Objects from Video . 11

work page 2023
[9]

Yu, Shunsuke Saito, Takaaki Shiratori, Te-Li Wang, Tony Tung, Yichen Xu, Yuan Dong, Yuhua Chen, Yuanlu Xu, Yuting Ye, and Zhongshi Jiang (2025)

Lee, Jinhyung Park, Jinlong Yang, John Doublestein, Kishore Venkateshan, Kris Kitani, Ladislav Kavan, Marco Dal Farra, Matthew Hu, Matthew Cioﬀi, Michael Fabris, Michael Ranieri, Mohammad Modarres, Petr Kadlecek, Rawal Khirodkar, Rinat Abdrashitov, Romain Prevost, Roman Rajbhandari, Ronald Mallet, Russel Pearsall, Sandy Kao, Sanjeev Kumar, Scott Parrish, ...

work page 2025
[10]

Biomechanical Arm and Hand Tracking with Multiview Markerless Motion Capture

Firouzabadi, Pouyan, Wendy Murray, Anton R Sobinov, J.D. Peiffer, Kunal Shah, Lee E Miller, and R. James Cotton (2024). “Biomechanical Arm and Hand Tracking with Multiview Markerless Motion Capture” . In: IEEE International Conference on Biomedical Robotics and Biomechatronics (BioRob)

work page 2024
[11]

MoVi: A Large Multi-Purpose Human Motion and Video Dataset

Ghorbani, Saeed, Kimia Mahdaviani, Anne Thaler, Konrad Kording, Douglas James Cook, Gunnar Blohm, and Nikolaus F. Troje (2021). “MoVi: A Large Multi-Purpose Human Motion and Video Dataset” . In: PLOS ONE 16.6, e0253157

work page 2021
[12]

LocoMuJoCo: A Comprehensive Imitation Learning Benchmark for Locomotion

Al-Hafez, Firas, Guoping Zhao, Jan Peters, and Davide Tateo (2023). LocoMuJoCo: A Comprehensive Imitation Learning Benchmark for Locomotion

work page 2023
[13]

Muscle contributions to propulsion and support during running

Hamner, Samuel R., Ajay Seth, and Scott L. Delp (2010). “Muscle contributions to propulsion and support during running” . In: Journal of Biomechanics 43.14, pp. 2709–2716

work page 2010
[14]

A model of the upper extremity for sim- ulating musculoskeletal surgery and analyzing neuromuscular control

Holzbaur, Katherine R. S., Wendy M. Murray, and Scott L. Delp (2005). “A model of the upper extremity for sim- ulating musculoskeletal surgery and analyzing neuromuscular control” . In: Annals of Biomedical Engineering 33.6, pp. 829–840

work page 2005
[15]

Sapiens: Foundation for Human Vision Models

Khirodkar, Rawal, Timur Bagautdinov, Julieta Martinez, Su Zhaoen, Austin James, Peter Selednik, Stuart An- derson, and Shunsuke Saito (2024). Sapiens: Foundation for Human Vision Models

work page 2024
[16]

Equinox: neural networks in JAX via callable PyTrees and filtered transformations

Kidger, Patrick and Cristian Garcia (2021). “Equinox: neural networks in JAX via callable PyTrees and filtered transformations” . In: Differentiable Programming workshop at Neural Information Processing Systems 2021

work page 2021
[17]

BioPose: Biomechanically-accurate 3D Pose Estimation from Monocular Videos

Koleini, Farnoosh, Muhammad Usama Saleem, Pu Wang, Hongfei Xue, Ahmed Helmy, and Abbey Fenwick (2025). BioPose: Biomechanically-accurate 3D Pose Estimation from Monocular Videos

work page 2025
[18]

SMPL: A Skinned Multi-Person Linear Model

Loper, Matthew, Naureen Mahmood, Javier Romero, Gerard Pons-Moll, and Michael J. Black (2015). “SMPL: A Skinned Multi-Person Linear Model” . In: ACM Transactions on Graphics (Proc. SIGGRAPH Asia)

work page 2015
[19]

A Musculoskeletal Model of the Hand and Wrist Capable of Simulating Functional Tasks

McFarland, Daniel C., Benjamin I. Binder-Markey, Jennifer A. Nichols, Sarah J. Wohlman, Marije de Bruin, and Wendy M. Murray (2023). “A Musculoskeletal Model of the Hand and Wrist Capable of Simulating Functional Tasks” . In:IEEE transactions on bio-medical engineering 70.5, pp. 1424–1435

work page 2023
[20]

Yu, Kris Kitani, and Rawal Khirodkar (2025)

Park, Jinhyung, Javier Romero, Shunsuke Saito, Fabian Prada, Takaaki Shiratori, Yichen Xu, Federica Bogo, Shoou-I. Yu, Kris Kitani, and Rawal Khirodkar (2025). ATLAS: Decoupling Skeletal and Shape Parameters for Expressive Parametric Human Modeling

work page 2025
[21]

Learning 3D Human Pose Estimation from Dozens of Datasets using a Geometry-Aware Autoencoder to Bridge Between Skeleton Formats

Peiffer, J. D., Kunal Shah, Irina Djuraskovic, Shawana Anarwala, Kayan Abdou, Rujvee Patel, Prakash Jayabalan, Brenton Pennicooke, and R. James Cotton (2025). Portable biomechanics laboratory: Clinically accessible movement analysis from a handheld smartphone . Sárándi, István, Alexander Hermans, and Bastian Leibe (2023). “Learning 3D Human Pose Estimatio...

work page 2025
[22]

Tolan, John Brandt, Camille Couprie, Julien Mairal, Hervé Jégou, Patrick Labatut, and Piotr Bojanowski (2025). DINOv3

work page 2025
[23]

MuJoCo: A Physics Engine for Model-Based Control

Todorov, Emanuel, Tom Erez, and Yuval Tassa (2012). “MuJoCo: A Physics Engine for Model-Based Control” . In: IEEE/RSJ International Conference on Intelligent Robots and Systems

work page 2012
[24]

Differentiable biomechanics for markerless motion capture in upper limb stroke rehabilitation: a comparison with optical motion capture

Unger, Tim, Arash Sal Moslehian, J.D. Peiffer, Johann Ullrich, Roger Gassert, Olivier Lambercy, R. James Cotton, and Chris Awai Easthope (2025). “Differentiable biomechanics for markerless motion capture in upper limb stroke rehabilitation: a comparison with optical motion capture” . In: IEEE Transactions on Medical Robotics and Bionics , pp. 1–1

work page 2025
[25]

Reconstructing Hand-Held Objects in 3D

Wu, Jane, Georgios Pavlakos, Georgia Gkioxari, and Jitendra Malik (2024). Reconstructing Hand-Held Objects in 3D

work page 2024
[26]

Reconstructing Humans with a Biomechanically Accurate Skeleton

Xia, Yan, Xiaowei Zhou, Etienne Vouga, Qixing Huang, and Georgios Pavlakos (2025). “Reconstructing Humans with a Biomechanically Accurate Skeleton” . In: CVPR

work page 2025
[27]

SAM 3D Body: Robust Full-Body Human Mesh Recovery

Liu, Nicolas Ugrinovic, Matt Feiszli, Jitendra Malik, Piotr Dollar, and Kris Kitani (2025). SAM 3D Body: Robust Full-Body Human Mesh Recovery . 12

work page 2025