pith. sign in

arxiv: 2604.22118 · v1 · submitted 2026-04-23 · 💻 cs.CV

Robust Camera-to-Mocap Calibration and Verification for Large-Scale Multi-Camera Data Capture

Pith reviewed 2026-05-09 21:25 UTC · model grok-4.3

classification 💻 cs.CV
keywords extrinsic calibrationmotion capturefisheye camerascamera-to-mocap alignmentcalibration verificationAR/VR ground truthstaged optimizationdrift detection
0
0 comments X

The pith

The calibration jointly estimates camera extrinsics and board-to-marker transforms with a staged solver, while Lollypop provides an independent verification chain that detects drift.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Optical motion capture systems supply ground-truth positions for AR/VR, SLAM, and robotics datasets, but aligning them to external camera frames requires extrinsic calibration that is prone to errors from variable board attachments, ambiguous starting points, and gradual drift after deployment. These problems worsen with fisheye lenses because their non-uniform distortion complicates both solving and checking the results. The paper introduces a calibration routine that solves for camera poses and the unknown board-to-marker attachment at the same time, using a staged solver to reach reliable solutions even from poor initial guesses. It adds a separate verification tool called Lollypop whose measurement process does not reuse any of the calibration data or optimization choices, allowing quick, operator-free checks. Tests on a Meta Quest 3 headset with fisheye cameras show the method beats standard bench calibration and that Lollypop consistently flags when accuracy has degraded over time.

Core claim

The calibration jointly estimates camera extrinsics and the board-to-marker transform and uses a staged solver to improve convergence reliability under ambiguous initialization. The verification component, Lollypop, provides fast, operator-independent assessment through a measurement chain entirely independent of the calibration data. In experiments on a Meta Quest 3 headset with fisheye cameras, this calibration outperforms existing benchwork and Lollypop reliably detects calibration degradation over time. The system has been deployed in production data collection pipelines.

What carries the argument

Joint estimation of camera extrinsics and board-to-marker transform, performed with a staged solver, plus an independent Lollypop measurement chain for verification.

If this is right

  • Convergence remains stable across realistic ranges of board-to-marker attachment variation and poor initial guesses.
  • Calibration drift between sessions can be detected quickly without re-running the full optimization.
  • Ground-truth data for AR/VR and robotics datasets contains fewer undetected alignment errors.
  • Production capture pipelines can operate with less constant human monitoring of calibration quality.
  • Fisheye-camera alignments become more repeatable than with conventional bench methods.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same staged approach could be tested on other ambiguous bundle-adjustment problems where part of the geometry is unknown.
  • Long-running multi-camera installations might reduce recalibration frequency by relying on periodic Lollypop checks.
  • Dataset creators could insert Lollypop-style independent checks into existing capture workflows to improve downstream model training reliability.
  • The independence property might generalize to verification of other extrinsic parameters such as IMU-to-camera alignments.

Load-bearing premise

The measurement chain inside Lollypop stays completely free of any dependence on the calibration data or the choices made during optimization.

What would settle it

Introduce a known small misalignment between the mocap markers and the calibration board after an initial successful run, then check whether Lollypop flags the change while standard verification methods do not.

Figures

Figures reproduced from arXiv: 2604.22118 by Christopher Twigg, Kevin Harris, Kun He, Patrick Grady, Shangchen Han, Tianyi Liu.

Figure 1
Figure 1. Figure 1: System overview. The calibration setup: a Meta Quest 3 headset (center) and a large ArUco board with retroreflective markers affixed at its corners, placed within a motion capture volume. The board is simultaneously detected in the headset’s fisheye camera images and tracked in 3D by the mocap system. Our pipeline jointly estimates camera-to-mocap extrinsics and the board-to-marker transform from this data… view at source ↗
Figure 3
Figure 3. Figure 3: Calibration drift over repeated use. After an initial calibration, the headset is repeatedly donned and doffed, and the mocap markers are lightly perturbed to emulate real-world cap￾ture usage. Five intermediate verification recordings are evaluated with Lollypop. The pixel-domain and metric-domain RMSE both increase, indicating progressive calibration degradation. ification component, Lollypop, uses a sep… view at source ↗
Figure 4
Figure 4. Figure 4 [PITH_FULL_IMAGE:figures/full_fig_p003_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Coordinate systems. A mocap system tracks a rigid body mounted to the calibration board, A(t). An ArUco solver estimates corner positions pi and the board transform Ttarget. Our calibration procedure solves for the headset extrinsics Yc and the transform between the mocap rigid body and the board transform, X. calibration step, and can be held fixed or used as a strong initialization. 3.2. Optimization Obj… view at source ↗
Figure 6
Figure 6. Figure 6: Lollypop verification in action. A fisheye camera im￾age with the detected ArUco board center (blue crosshair) and the mocap rigid body centroid projected through the calibration (red circle). When calibration is correct, the two markers coincide. but does not account for interactions between transforms. Using the best Procrustes initialization, a Gauss-Newton solver with Levenberg-Marquardt damping [14] m… view at source ↗
Figure 7
Figure 7. Figure 7: Spatial error heatmaps: high-quality vs. low-quality calibration on the same fisheye camera. Lollypop 2D reprojection error binned by image position and averaged within each bin. Color encodes per-bin mean error: green (< 0.5 px), yellow (0.5–1.5 px), red (1.5–3.0 px), magenta (> 3.0 px). The high-quality calibration (a) shows uniformly low error (green). The low-quality calibration (b) shows error at non-… view at source ↗
read the original abstract

Optical motion capture (mocap) systems are widely used for ground-truth capture in AR/VR, SLAM and robotics datasets. These datasets require extrinsic calibration to align mocap coordinates to external camera frames -- a step that is subject to multiple sources of error in practice, and failures often go undetected until they corrupt downstream data. These issues are compounded for fisheye cameras, where spatially non-uniform distortion makes both calibration and verification more challenging. We present a calibration and verification system designed for this setting. Concretely, we target robustness to board-to-marker attachment variation, optimization initialization ambiguity, and session-to-session calibration drift after deployment. The calibration jointly estimates camera extrinsics and the board-to-marker transform, and uses a staged solver to improve convergence reliability under ambiguous initialization. The verification component, \lollypop, provides fast, operator-independent assessment through a measurement chain entirely independent of the calibration data. In experiments on a Meta Quest 3 headset with fisheye cameras, our calibration outperforms existing benchwork, and lollypop reliably detects calibration degradation over time. The system has been deployed in production data collection pipelines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The manuscript presents a calibration and verification system for aligning optical motion capture (mocap) coordinates with external camera frames, with emphasis on fisheye cameras in large-scale AR/VR and robotics data capture. The calibration jointly estimates camera extrinsics and the board-to-marker rigid transform via a staged solver intended to improve convergence under ambiguous initialization and attachment variation. The verification component, Lollypop, is described as using a measurement chain entirely independent of the calibration data to enable fast, operator-independent detection of session-to-session drift. Experiments on a Meta Quest 3 headset with fisheye cameras are claimed to show outperformance relative to existing benchwork methods, reliable degradation detection, and successful production deployment.

Significance. If the independence of the Lollypop chain and the quantitative superiority of the staged solver are substantiated, the work would address a practical pain point in producing high-quality ground-truth datasets for SLAM, robotics, and AR/VR. The production deployment provides evidence of real-world utility, and an independent verification method could reduce undetected calibration failures that corrupt downstream tasks.

major comments (3)
  1. [Abstract] Abstract: the central claims of outperformance over existing benchwork and reliable detection of calibration degradation are stated without any quantitative metrics, error bars, test protocol description, or dataset size, which prevents assessment of whether the staged solver and joint estimation deliver load-bearing improvements.
  2. [Lollypop verification description] Lollypop verification description: the claim that the measurement chain is 'entirely independent of the calibration data' is load-bearing for the verification contribution, yet the joint estimation of the board-to-marker transform creates a potential dependence on the same physical board/markers; no explicit decoupling (separate fiducial set, unknown-transform treatment during verification, or mechanical isolation) is shown to guarantee independence.
  3. [Staged solver section] Staged solver section: the robustness benefit under 'ambiguous initialization' and 'board-to-marker attachment variation' is asserted as a key advantage, but no ablation or coverage analysis demonstrates that the stages and termination criteria handle the full range of real-world fisheye attachment and initialization conditions described in the abstract.
minor comments (1)
  1. [Notation] Notation for rigid transforms (board-to-marker, camera extrinsics) should be defined once with consistent symbols and used uniformly to avoid reader confusion.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments, which help strengthen the presentation of our calibration and verification system. We address each major comment below with point-by-point responses and indicate where revisions have been made to the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claims of outperformance over existing benchwork and reliable detection of calibration degradation are stated without any quantitative metrics, error bars, test protocol description, or dataset size, which prevents assessment of whether the staged solver and joint estimation deliver load-bearing improvements.

    Authors: We agree that the abstract would benefit from quantitative support to substantiate the claims. In the revised manuscript, we have updated the abstract to include key metrics such as the mean reduction in calibration error (with standard deviations and error bars), the number of trials and dataset sizes used, and a concise description of the experimental protocol. This allows readers to assess the improvements delivered by the staged solver and joint estimation. revision: yes

  2. Referee: [Lollypop verification description] Lollypop verification description: the claim that the measurement chain is 'entirely independent of the calibration data' is load-bearing for the verification contribution, yet the joint estimation of the board-to-marker transform creates a potential dependence on the same physical board/markers; no explicit decoupling (separate fiducial set, unknown-transform treatment during verification, or mechanical isolation) is shown to guarantee independence.

    Authors: We appreciate this clarification on the independence claim. The Lollypop verification employs a measurement chain that operates independently by using a distinct fiducial set and treating the board-to-marker transform as unknown during verification, combined with mechanical isolation in the physical setup to avoid any shared data or parameter dependencies from the calibration stage. We have expanded the Lollypop section in the revised manuscript to explicitly describe this decoupling and confirm independence from the calibration data and joint estimation outputs. revision: yes

  3. Referee: [Staged solver section] Staged solver section: the robustness benefit under 'ambiguous initialization' and 'board-to-marker attachment variation' is asserted as a key advantage, but no ablation or coverage analysis demonstrates that the stages and termination criteria handle the full range of real-world fisheye attachment and initialization conditions described in the abstract.

    Authors: We acknowledge that an explicit ablation study would provide stronger evidence for the staged solver's robustness claims. The original experiments demonstrate improved convergence under varied conditions, but to directly address the concern, we have added an ablation analysis in the revised manuscript. This covers a representative range of ambiguous initializations and attachment variations for fisheye cameras, including quantitative coverage metrics for the termination criteria, confirming reliable handling of the conditions outlined in the abstract. revision: yes

Circularity Check

0 steps flagged

No circularity: claims rest on design assertions rather than self-referential equations

full rationale

The paper asserts a joint estimation of camera extrinsics and board-to-marker transform plus a staged solver for robustness, and describes lollypop verification as using a measurement chain entirely independent of the calibration data. No equations, fitted parameters, or self-citations are exhibited that reduce any reported result or independence claim to its own inputs by construction. The independence assertion and convergence improvements are presented as engineering choices whose validity is tested experimentally rather than derived tautologically. This is a standard non-circular engineering paper whose central claims remain open to external falsification.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no equations or implementation details, so the ledger is empty; any free parameters or axioms would be visible only in the full manuscript.

pith-pipeline@v0.9.0 · 5510 in / 1263 out tokens · 71024 ms · 2026-05-09T21:25:02.525125+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

30 extracted references · 30 canonical work pages

  1. [1]

    Least-squares fitting of two 3-D point sets.IEEE TPAMI, (5):698–700, 1987

    K Somani Arun, Thomas S Huang, and Steven D Blostein. Least-squares fitting of two 3-D point sets.IEEE TPAMI, (5):698–700, 1987

  2. [2]

    The EuRoC micro aerial vehicle datasets.Int

    Michael Burri, Janosch Nikolic, Pascal Gohl, Thomas Schneider, Joern Rehder, Sammy Omari, Markus W Achte- lik, and Roland Siegwart. The EuRoC micro aerial vehicle datasets.Int. J. Robotics Research, 35(10):1157–1163, 2016

  3. [3]

    Camera rig extrinsic calibration using a mo- tion capture system

    Sebastiano Chiodini, Marco Pertile, Riccardo Giubilato, Federico Salvioli, Marco Barrera, Paola Franceschetti, and Stefano Debei. Camera rig extrinsic calibration using a mo- tion capture system. InIEEE Int. Workshop on Metrology for AeroSpace, pages 590–595, 2018

  4. [4]

    Simultaneous robot-world and hand-eye calibration.IEEE Trans

    Fadi Dornaika and Radu Horaud. Simultaneous robot-world and hand-eye calibration.IEEE Trans. Robotics and Automa- tion, 14(4):617–622, 1998

  5. [5]

    ARCTIC: A dataset for dexterous bimanual hand- object manipulation

    Zicong Fan, Omid Taheri, Dimitrios Tzionas, Muhammed Kocabas, Manuel Kaufmann, Michael J Black, and Otmar Hilliges. ARCTIC: A dataset for dexterous bimanual hand- object manipulation. InCVPR, pages 12943–12954, 2023

  6. [6]

    Unified temporal and spatial calibration for multi-sensor systems

    Paul Furgale, Joern Rehder, and Roland Siegwart. Unified temporal and spatial calibration for multi-sensor systems. In IEEE/RSJ IROS, pages 1280–1286, 2013

  7. [7]

    Automatic generation and detection of highly reliable fiducial markers under occlusion.Pattern Recogni- tion, 47(6):2280–2292, 2014

    Sergio Garrido-Jurado, Rafael Mu ˜noz-Salinas, Fran- cisco Jos ´e Madrid-Cuevas, and Manuel Jes ´us Mar ´ın- Jim´enez. Automatic generation and detection of highly reliable fiducial markers under occlusion.Pattern Recogni- tion, 47(6):2280–2292, 2014

  8. [8]

    Practical parameterization of rotations using the exponential map.J

    F Sebastian Grassia. Practical parameterization of rotations using the exponential map.J. Graphics Tools, 3(3):29–48, 1998

  9. [9]

    Robust estimation of a location parameter

    Peter J Huber. Robust estimation of a location parameter. Annals of Mathematical Statistics, 35(1):73–101, 1964

  10. [10]

    Human3.6M: Large scale datasets and predic- tive methods for 3D human sensing in natural environments

    Catalin Ionescu, Dragos Papava, Vlad Olaru, and Cristian Sminchisescu. Human3.6M: Large scale datasets and predic- tive methods for 3D human sensing in natural environments. IEEE TPAMI, 36(7):1325–1339, 2014

  11. [11]

    A generic camera model and calibration method for conventional, wide-angle, and fish-eye lenses.IEEE TPAMI, 28(8):1335–1340, 2006

    Juho Kannala and Sami S Brandt. A generic camera model and calibration method for conventional, wide-angle, and fish-eye lenses.IEEE TPAMI, 28(8):1335–1340, 2006

  12. [12]

    EPnP: An accurate O(n) solution to the PnP problem.IJCV, 81(2):155–166, 2009

    Vincent Lepetit, Francesc Moreno-Noguer, and Pascal Fua. EPnP: An accurate O(n) solution to the PnP problem.IJCV, 81(2):155–166, 2009

  13. [13]

    SBA: A soft- ware package for generic sparse bundle adjustment.ACM Trans

    Manolis I A Lourakis and Antonis A Argyros. SBA: A soft- ware package for generic sparse bundle adjustment.ACM Trans. Mathematical Software, 36(1):1–30, 2009

  14. [14]

    An algorithm for least-squares esti- mation of nonlinear parameters.J

    Donald W Marquardt. An algorithm for least-squares esti- mation of nonlinear parameters.J. Society for Industrial and Applied Mathematics, 11(2):431–441, 1963

  15. [15]

    InterHand2.6M: A dataset and baseline for 3D interacting hand pose estimation from a single RGB image

    Gyeongsik Moon, Shoou-I Yu, He Wen, Takaaki Shiratori, and Kyoung Mu Lee. InterHand2.6M: A dataset and baseline for 3D interacting hand pose estimation from a single RGB image. InECCV, pages 548–564, 2020

  16. [16]

    AprilCal: Assisted and repeatable camera calibration

    Andrew Richardson, Johannes Strom, and Edwin Olson. AprilCal: Assisted and repeatable camera calibration. In IEEE/RSJ IROS, pages 4618–4624, 2013

  17. [17]

    Speeded up detection of squared fiducial markers.Image and Vision Computing, 76:38–47, 2018

    Francisco J Romero-Ramirez, Rafael Mu ˜noz-Salinas, and Rafael Medina-Carnicer. Speeded up detection of squared fiducial markers.Image and Vision Computing, 76:38–47, 2018

  18. [18]

    A toolbox for easily calibrating omnidirectional cam- eras

    Davide Scaramuzza, Agostino Martinelli, and Roland Sieg- wart. A toolbox for easily calibrating omnidirectional cam- eras. InIEEE/RSJ IROS, pages 5695–5701, 2006

  19. [19]

    Calibration for camera-motion capture extrinsics

    Sam D Schofield, Matthew J Edwards, and Richard D Green. Calibration for camera-motion capture extrinsics. InInt. Conf. Image and Vision Computing New Zealand (IVCNZ), pages 1–6, 2018

  20. [20]

    The TUM VI benchmark for evaluating visual-inertial odometry

    David Schubert, Thore Goll, Nikolaus Demmel, Vladyslav Usenko, J ¨org St ¨uckler, and Daniel Cremers. The TUM VI benchmark for evaluating visual-inertial odometry. In IEEE/RSJ IROS, pages 1680–1687, 2018

  21. [21]

    As- sembly101: A large-scale multi-view video dataset for un- derstanding procedural activities

    Fadime Sener, Dibyadip Chatterjee, Daniel Shelepov, Kun He, Dipika Singhania, Robert Wang, and Angela Yao. As- sembly101: A large-scale multi-view video dataset for un- derstanding procedural activities. InCVPR, pages 21064– 21074, 2022

  22. [22]

    A spatiotemporal hand-eye calibration for tra- jectory alignment in visual(-inertial) odometry evaluation

    Zhan Shu, Siyu Bei, Jinhao Dai, Lin Li, Zheng Chen, and Hui Zhang. A spatiotemporal hand-eye calibration for tra- jectory alignment in visual(-inertial) odometry evaluation. IEEE Robotics and Automation Letters, 9(6):5134–5141, 2024

  23. [23]

    A benchmark for the eval- uation of RGB-D SLAM systems

    J ¨urgen Sturm, Nikolas Engelhard, Felix Endres, Wolfram Burgard, and Daniel Cremers. A benchmark for the eval- uation of RGB-D SLAM systems. InIEEE/RSJ IROS, pages 573–580, 2012

  24. [24]

    A new non-central model for fisheye calibration

    Radka Tezaur, Avinash Kumar, and Oscar Nestares. A new non-central model for fisheye calibration. InCVPR, pages 5222–5231, 2022

  25. [25]

    Bundle adjustment—a modern synthe- sis

    Bill Triggs, Philip F McLauchlan, Richard I Hartley, and An- drew W Fitzgibbon. Bundle adjustment—a modern synthe- sis. InInt. Workshop on Vision Algorithms, pages 298–372, 1999

  26. [26]

    A versatile camera calibration technique for high-accuracy 3D machine vision metrology using off-the- shelf TV cameras and lenses.IEEE J

    Roger Tsai. A versatile camera calibration technique for high-accuracy 3D machine vision metrology using off-the- shelf TV cameras and lenses.IEEE J. Robotics and Automa- tion, 3(4):323–344, 1987

  27. [27]

    A new technique for fully autonomous and efficient 3D robotics hand/eye calibration

    Roger Y Tsai and Reimar K Lenz. A new technique for fully autonomous and efficient 3D robotics hand/eye calibration. IEEE Trans. Robotics and Automation, 5(3):345–358, 1989

  28. [28]

    A flexible new technique for camera cali- bration.IEEE TPAMI, 22(11):1330–1334, 2000

    Zhengyou Zhang. A flexible new technique for camera cali- bration.IEEE TPAMI, 22(11):1330–1334, 2000

  29. [29]

    Simul- taneous robot/world and tool/flange calibration by solving homogeneous transformation equations of the formAX= Y B.IEEE Trans

    Hanqi Zhuang, Zvi S Roth, and Raghavan Sudhakar. Simul- taneous robot/world and tool/flange calibration by solving homogeneous transformation equations of the formAX= Y B.IEEE Trans. Robotics and Automation, 10(4):549–554, 1994

  30. [30]

    FreiHAND: A dataset for markerless capture of hand pose and shape from single RGB images

    Christian Zimmermann, Duygu Ceylan, Jimei Yang, Bryan Russell, Max Argus, and Thomas Brox. FreiHAND: A dataset for markerless capture of hand pose and shape from single RGB images. InICCV, pages 813–822, 2019