Towards an End-to-End System for 3D Tracking of Physical Objects in Virtual Immersive Environments

Barbara Karpowicz; Maciej Grzeszczuk; Pavlo Zinevych; Stanis{\l}aw Knapi\'nski; Wieslaw Kopec

arxiv: 2605.02901 · v1 · submitted 2026-03-30 · 💻 cs.HC · cs.CV

Towards an End-to-End System for 3D Tracking of Physical Objects in Virtual Immersive Environments

Stanis{\l}aw Knapi\'nski , Maciej Grzeszczuk , Barbara Karpowicz , Pavlo Zinevych , Wieslaw Kopec This is my paper

Pith reviewed 2026-05-14 22:13 UTC · model grok-4.3

classification 💻 cs.HC cs.CV

keywords 3D object trackingfiducial markersvirtual realityimmersive environmentsArUcoAprilTagreal-to-virtual mappingXR training

0 comments

The pith

A fiducial marker system with software harness enables plug-and-play 3D tracking of physical objects in VR.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper builds an end-to-end pipeline that detects small physical objects via markers and streams their positions into virtual environments for training. It combines existing marker detectors with a simple designation tool and data streaming layer so developers avoid writing custom tracking code. The work tests how tag size, viewing distance, and camera choice affect detection reliability against theoretical limits. This produces a ready-to-use mapping from real-world coordinates to VR space that runs without specialized hardware.

Core claim

By integrating ArUco, AprilTag, and Colored Control Points markers with a software harness for quick object assignment and position streaming, the system delivers real-time real-to-virtual object mapping that works across different cameras and distances while remaining simple to deploy for VR and XR training scenarios.

What carries the argument

Fiducial marker detection (ArUco, AprilTag, Colored Control Points) paired with a software harness that designates objects and streams 3D position data to end applications.

If this is right

Training applications can map small physical tools or props into VR without building tracking infrastructure from scratch.
Multiple marker types give flexibility to choose the best option for a given object size or environment.
Data streaming works directly with standard VR frameworks so position updates reach the virtual scene in real time.
Evaluations of tag size and camera models let users select hardware that stays inside reliable detection ranges.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same marker harness could support hybrid physical-digital workflows where users manipulate real controls that affect virtual simulations.
Extending the system to handle partial occlusions or faster motion would increase its usefulness for dynamic training tasks.
Because the solution avoids proprietary hardware it lowers barriers for smaller teams to create custom VR object interactions.

Load-bearing premise

Fiducial markers can be detected reliably enough by ordinary cameras to deliver accurate 3D positions in a plug-and-play way without custom hardware or manual coding.

What would settle it

A demonstration that the system loses track or produces large position errors when objects move beyond the tested distances or under lighting that still allows human visibility would disprove reliable plug-and-play performance.

Figures

Figures reproduced from arXiv: 2605.02901 by Barbara Karpowicz, Maciej Grzeszczuk, Pavlo Zinevych, Stanis{\l}aw Knapi\'nski, Wieslaw Kopec.

**Figure 2.** Figure 2: Example workflow with the object detection software. Source: own elab [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Tracking Configuration UI. Source: own elaboration [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 5.** Figure 5: A square marker detected by means of colored points. Source: own elaboration. 3.2 The Colored Points Algorithm To address the limitations of standard binary markers in very low-latency or lowresolution conditions, we developed the Colored Points method. Unlike AruCo or AprilTag, which rely on high-contrast binary edges, our method utilizes distinct chromatic "islands" to define marker geometry. This allow… view at source ↗

**Figure 6.** Figure 6: Block diagram of the Colored Points algorithm, illustrating the single-pass [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗

**Figure 7.** Figure 7: AprilTags detected as Unity objects. Source: own elaboration. [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 8.** Figure 8: Timeline of detection rate. Note: Colored Points on Generic Webcam not [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗

**Figure 9.** Figure 9: Average detection rate. Source: own elaboration. [PITH_FULL_IMAGE:figures/full_fig_p009_9.png] view at source ↗

**Figure 10.** Figure 10: Maximum distance from camera for various marker sizes. Source: own [PITH_FULL_IMAGE:figures/full_fig_p009_10.png] view at source ↗

read the original abstract

This work aims to establish an end-to-end system for tracking of physical 3D objects for virtual reality (VR) applications. We focus on training applications requiring real-time tracking of the position of small physical objects and their reflection in VR space. Out goal is to perform object tracking in a "plug and play" manner, without using complex systems with quite large tracking devices or manually implementing object tracking. We therefore propose a system for object tracking via fiducial markers alongside a software harness, to enable fast and efficient designation of objects to be tracked and data streaming solution for end-use applications. The system utilizes AruCo, AprilTag and an original Colored Control Points based fiducial system. It allows for easy tag detection and use of object position data, which are crucial for immersive training environments based on VR and eXtended Reality (XR). We evaluate various tag sizes, detection distances, and different camera devices against the theoretical limits. In effect, we create a complete solution for implementing marker-based, real-to-virtual object position mapping for various applications.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper adds a new Colored Control Points fiducial marker for VR object tracking but the evaluation stays thin and the plug-and-play claim lacks supporting data on setup effort.

read the letter

The main thing to know is that the authors introduce an original Colored Control Points fiducial system and combine it with ArUco and AprilTag markers plus a software harness to track small physical objects and map them into VR space for training applications. They aim for something that works without large hardware or custom coding for each object. That combination and the new marker type are the concrete additions beyond existing marker literature. The evaluation tests tag sizes, distances, and camera types against theoretical detection limits, which at least gives readers some practical starting points for choosing hardware. The stress-test note is fair: the paper does not measure or report how many manual steps remain for calibration, marker-to-object mapping, or streaming setup, so the plug-and-play guarantee rests on an untested assumption about the harness. Without those numbers or a clear breakdown of user effort, it is hard to tell whether the system truly reduces implementation work as claimed. The abstract mentions results but supplies no error rates, success percentages, or variance, which leaves the performance claims hard to assess. This paper is for engineers or researchers who build VR training setups and want a marker-based option they can try out. A reader already working with fiducials might pick up the new marker idea and the harness description as a useful reference, even if they end up re-implementing parts. It deserves peer review because the new marker is original and the application focus is concrete; referees could reasonably ask for the missing quantitative details and setup measurements without dismissing the work outright.

Referee Report

2 major / 1 minor

Summary. The paper presents an end-to-end system for 3D tracking of physical objects in VR/XR training applications. It combines three fiducial marker families (ArUco, AprilTag, and a novel Colored Control Points system) with a software harness that designates objects and streams position data, claiming a plug-and-play solution that avoids complex hardware and manual implementation of tracking. The work evaluates tag sizes, detection distances, and camera devices against theoretical limits and asserts that the resulting pipeline enables straightforward real-to-virtual object mapping.

Significance. A validated plug-and-play harness that eliminates manual calibration and integration steps across multiple marker families would lower the barrier for embedding physical props in immersive training environments. The introduction of the Colored Control Points system could add a lightweight alternative if its performance and implementation details are shown to be competitive. However, the current evaluation focuses narrowly on detection rates and does not quantify setup effort, limiting the strength of the central claim.

major comments (2)

[Evaluation] Evaluation section: the manuscript states that it evaluates tag sizes, distances, and cameras against theoretical limits, yet supplies no quantitative detection rates, error statistics, or direct comparison to the cited theoretical bounds. This omission prevents verification of the performance claims that underpin the end-to-end system.
[Abstract and System Overview] System description and abstract: the central claim of a 'plug-and-play' solution 'without manually implementing object tracking' requires evidence that the software harness automates camera intrinsics, marker-to-object mapping, ID assignment, and VR streaming setup. No measurements of manual steps, calibration time, or integration effort are reported, leaving the strongest claim untested.

minor comments (1)

[Abstract] Abstract: 'Out goal' is a typographical error and should read 'Our goal'.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight important areas for strengthening the quantitative support of our claims and the validation of the plug-and-play aspects. We address each major comment below and will incorporate revisions to improve the paper.

read point-by-point responses

Referee: [Evaluation] Evaluation section: the manuscript states that it evaluates tag sizes, distances, and cameras against theoretical limits, yet supplies no quantitative detection rates, error statistics, or direct comparison to the cited theoretical bounds. This omission prevents verification of the performance claims that underpin the end-to-end system.

Authors: We acknowledge the need for explicit quantitative data to support the evaluation claims. While the manuscript describes experiments on tag sizes, detection distances, and camera devices compared to theoretical limits, we did not include detailed tables or statistics such as detection rate percentages, position error metrics, or direct numerical comparisons. In the revised version, we will add these quantitative results from our experiments to enable verification of the performance claims. revision: yes
Referee: [Abstract and System Overview] System description and abstract: the central claim of a 'plug-and-play' solution 'without manually implementing object tracking' requires evidence that the software harness automates camera intrinsics, marker-to-object mapping, ID assignment, and VR streaming setup. No measurements of manual steps, calibration time, or integration effort are reported, leaving the strongest claim untested.

Authors: The software harness is designed to automate key steps including camera intrinsics handling, marker-to-object mapping, ID assignment, and VR data streaming through a configuration-based interface. We agree that without reported measurements of setup effort or time, the plug-and-play claim is not fully quantified. We will revise the system overview and abstract to provide a clearer description of the automation process, including example workflows, and add preliminary data on manual steps and calibration times from our implementation and testing. revision: partial

Circularity Check

0 steps flagged

No circularity: system description paper with no derivation chain

full rationale

The paper describes an end-to-end tracking system using established fiducial markers (ArUco, AprilTag) plus an original Colored Control Points variant, together with a software harness for object designation and streaming. No equations, fitted parameters, predictions, or uniqueness theorems appear in the provided text. The central claim is an engineering integration result evaluated via detection-rate experiments; it does not reduce any output to its own inputs by construction, self-citation load-bearing, or ansatz smuggling. The plug-and-play assertion is an empirical claim about the harness, not a circular derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on standard computer vision assumptions about marker detectability and introduces one new fiducial entity without external validation.

axioms (1)

domain assumption Fiducial markers can be reliably detected under typical lighting, distance, and camera conditions for real-time VR use.
Invoked to support plug-and-play tracking without complex systems.

invented entities (1)

Colored Control Points fiducial system no independent evidence
purpose: Enable easy tag detection and object position data for real-to-virtual mapping.
New system proposed by the authors.

pith-pipeline@v0.9.0 · 5511 in / 1194 out tokens · 48768 ms · 2026-05-14T22:13:33.310649+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

24 extracted references · 24 canonical work pages · 1 internal anchor

[1]

AprilRobotics: Apriltag (2019-2025),https://github.com/AprilRobotics/april tag, [Accessed: (12.08.2025)]

work page 2019
[2]

Bugden, W., Alahmar, A.: Rust: The programming language for safety and per- formance (2022),https://arxiv.org/abs/2206.05503

work page arXiv 2022
[3]

In: 2010 IEEE International conference on robotics and automation

Coates, A., Ng, A.Y.: Multi-camera object detection for robotics. In: 2010 IEEE International conference on robotics and automation. pp. 412–419. IEEE (2010)

work page 2010
[4]

Computer Communications10(1), 21–29 (1987)

Coffield, D., Shepherd, D.: Tutorial guide to unix sockets for network communica- tions. Computer Communications10(1), 21–29 (1987)

work page 1987
[5]

Collins, T., Bartoli, A.: Infinitesimal plane-based pose estimation. Int. J. Comput. Vision109(3), 252–286 (Sep 2014). https://doi.org/10.1007/s11263-014-0725-5,ht tps://doi.org/10.1007/s11263-014-0725-5

work page doi:10.1007/s11263-014-0725-5 2014
[6]

In: 2012 Proceedings of the 35th International Convention MIPRO

Culjak, I., Abram, D., Pribanic, T., Dzapo, H., Cifrek, M.: A brief introduction to opencv. In: 2012 Proceedings of the 35th International Convention MIPRO. pp. 1725–1730 (2012)

work page 2012
[7]

c o m / e m i l k / e g u i, [Accessed: (12.08.2025)]

Emilk: Egui (2020-2025),h t t p s : / / g i t h u b . c o m / e m i l k / e g u i, [Accessed: (12.08.2025)]

work page 2020
[8]

HTC: Vive tracker 3 (2025),https://www.vive.com/eu/accessory/tracker3/ [Accessed: (18.08.2025)]

work page 2025
[9]

Information Technology in Fisheries and Aquaculture p

Iburahim, S.A., Naidu, B.C., Ananthan, P.: Virtual reality and augmented reality. Information Technology in Fisheries and Aquaculture p. 109 (2025)

work page 2025
[10]

In: Proceedings of the AAAI conference on Artificial Intelligence

Jiang, Y., Zhang, L., Miao, Z., Zhu, X., Gao, J., Hu, W., Jiang, Y.G.: Polarformer: Multi-camera 3d object detection with polar transformer. In: Proceedings of the AAAI conference on Artificial Intelligence. vol. 37, pp. 1042–1050 (2023)

work page 2023
[11]

Theseus.fi (2022)

Kapsoritakis, S.: A comparative study of virtual reality hand-tracking and con- trollers. Theseus.fi (2022)

work page 2022
[12]

In: 2024 IEEE Conference Virtual Reality and 3D User Interfaces (VR)

Li, S., Schieber, H., Corell, N., Egger, B., Kreimeier, J., Roth, D.: Gbot: Graph- based 3d object tracking for augmented reality-assisted assembly guidance. In: 2024 IEEE Conference Virtual Reality and 3D User Interfaces (VR). pp. 513–523. IEEE (2024)

work page 2024
[13]

Electronics12(10), 2323 (2023)

Lou, H., Duan, X., Guo, J., Liu, H., Gu, J., Bi, L., Chen, H.: Dc-yolov8: small- size object detection algorithm based on camera sensor. Electronics12(10), 2323 (2023)

work page 2023
[14]

Ng, A.K., Chan, L.K., Lau, H.Y.: A low-cost lighthouse-based virtual reality head trackingsystem.In:2017InternationalConferenceon3DImmersion(IC3D).pp.1–

work page
[15]

Knapiński et al

OpenCV: Aruco fiducial markers - detection (2016),https://docs.opencv.org/ 3.2.0/d5/dae/tutorial\_aruco\_detection.html, [Accessed: (12.08.2025)] 12 S. Knapiński et al

work page 2016
[16]

In: 2015 IEEE Frontiers in Education Conference (FIE)

Skromme, B.J., Rayes, P.J., McNamara, B.E., Seetharam, V., Gao, X., Thompson, T., Wang, X., Cheng, B., Huang, Y.F., Robinson, D.H.: Step-based tutoring sys- tem for introductory linear circuit analysis. In: 2015 IEEE Frontiers in Education Conference (FIE). pp. 1–9. IEEE (2015)

work page 2015
[17]

Frontiers in Robotics and AIV olume 1 - 2014(2014)

Slater, M.: Grand challenges in virtual environments. Frontiers in Robotics and AIV olume 1 - 2014(2014). https://doi.org/10.3389/frobt.2014.00003,https: //www.frontiersin.org/journals/robotics-and-ai/articles/10.3389/frobt .2014.00003

work page doi:10.3389/frobt.2014.00003 2014
[18]

MetaSpace II: Object and full-body tracking for interaction and navigation in social VR

Sra, M., Schmandt, C.: Metaspace ii: Object and full-body tracking for interaction and navigation in social vr. arXiv preprint arXiv:1512.02922 (2015)

work page internal anchor Pith review Pith/arXiv arXiv 2015
[19]

Computers & Graphics21(4), 393–404 (1997)

Srinivasan, M.A., Basdogan, C.: Haptics in virtual environments: Taxonomy, re- search status, and challenges. Computers & Graphics21(4), 393–404 (1997). https://doi.org/https://doi.org/10.1016/S0097-8493(97)00030-7,https://www.sc iencedirect.com/science/article/pii/S0097849397000307, haptic Displays in Virtual Environments and Computer Graphics in Korea

work page doi:10.1016/s0097-8493(97)00030-7 1997
[20]

The Rust Foundation: The rust programming language (2014-2025),https://ww w.rust-lang.org/, [Accessed: (12.08.2025)]

work page 2014
[21]

Unity Technologies: Unity (2023),https://unity.com/, game development plat- form [Accessed: (30.08.2025)]

work page 2023
[22]

Valve: Steam vr tracking system (2016),https://partner.steamgames.com/vrt racking[Accessed: (18.08.2025)]

work page 2016
[23]

Varjo: Varjo mixed reality (2025),https://varjo.com/

work page 2025
[24]

IEEE Transactions on Intelligent Vehicles9(1), 2094–2128 (2023)

Yao, S., Guan, R., Huang, X., Li, Z., Sha, X., Yue, Y., Lim, E.G., Seo, H., Man, K.L., Zhu, X., et al.: Radar-camera fusion for object detection and semantic seg- mentation in autonomous driving: A comprehensive review. IEEE Transactions on Intelligent Vehicles9(1), 2094–2128 (2023)

work page 2094

[1] [1]

AprilRobotics: Apriltag (2019-2025),https://github.com/AprilRobotics/april tag, [Accessed: (12.08.2025)]

work page 2019

[2] [2]

Bugden, W., Alahmar, A.: Rust: The programming language for safety and per- formance (2022),https://arxiv.org/abs/2206.05503

work page arXiv 2022

[3] [3]

In: 2010 IEEE International conference on robotics and automation

Coates, A., Ng, A.Y.: Multi-camera object detection for robotics. In: 2010 IEEE International conference on robotics and automation. pp. 412–419. IEEE (2010)

work page 2010

[4] [4]

Computer Communications10(1), 21–29 (1987)

Coffield, D., Shepherd, D.: Tutorial guide to unix sockets for network communica- tions. Computer Communications10(1), 21–29 (1987)

work page 1987

[5] [5]

Collins, T., Bartoli, A.: Infinitesimal plane-based pose estimation. Int. J. Comput. Vision109(3), 252–286 (Sep 2014). https://doi.org/10.1007/s11263-014-0725-5,ht tps://doi.org/10.1007/s11263-014-0725-5

work page doi:10.1007/s11263-014-0725-5 2014

[6] [6]

In: 2012 Proceedings of the 35th International Convention MIPRO

Culjak, I., Abram, D., Pribanic, T., Dzapo, H., Cifrek, M.: A brief introduction to opencv. In: 2012 Proceedings of the 35th International Convention MIPRO. pp. 1725–1730 (2012)

work page 2012

[7] [7]

c o m / e m i l k / e g u i, [Accessed: (12.08.2025)]

Emilk: Egui (2020-2025),h t t p s : / / g i t h u b . c o m / e m i l k / e g u i, [Accessed: (12.08.2025)]

work page 2020

[8] [8]

HTC: Vive tracker 3 (2025),https://www.vive.com/eu/accessory/tracker3/ [Accessed: (18.08.2025)]

work page 2025

[9] [9]

Information Technology in Fisheries and Aquaculture p

Iburahim, S.A., Naidu, B.C., Ananthan, P.: Virtual reality and augmented reality. Information Technology in Fisheries and Aquaculture p. 109 (2025)

work page 2025

[10] [10]

In: Proceedings of the AAAI conference on Artificial Intelligence

Jiang, Y., Zhang, L., Miao, Z., Zhu, X., Gao, J., Hu, W., Jiang, Y.G.: Polarformer: Multi-camera 3d object detection with polar transformer. In: Proceedings of the AAAI conference on Artificial Intelligence. vol. 37, pp. 1042–1050 (2023)

work page 2023

[11] [11]

Theseus.fi (2022)

Kapsoritakis, S.: A comparative study of virtual reality hand-tracking and con- trollers. Theseus.fi (2022)

work page 2022

[12] [12]

In: 2024 IEEE Conference Virtual Reality and 3D User Interfaces (VR)

Li, S., Schieber, H., Corell, N., Egger, B., Kreimeier, J., Roth, D.: Gbot: Graph- based 3d object tracking for augmented reality-assisted assembly guidance. In: 2024 IEEE Conference Virtual Reality and 3D User Interfaces (VR). pp. 513–523. IEEE (2024)

work page 2024

[13] [13]

Electronics12(10), 2323 (2023)

Lou, H., Duan, X., Guo, J., Liu, H., Gu, J., Bi, L., Chen, H.: Dc-yolov8: small- size object detection algorithm based on camera sensor. Electronics12(10), 2323 (2023)

work page 2023

[14] [14]

Ng, A.K., Chan, L.K., Lau, H.Y.: A low-cost lighthouse-based virtual reality head trackingsystem.In:2017InternationalConferenceon3DImmersion(IC3D).pp.1–

work page

[15] [15]

Knapiński et al

OpenCV: Aruco fiducial markers - detection (2016),https://docs.opencv.org/ 3.2.0/d5/dae/tutorial\_aruco\_detection.html, [Accessed: (12.08.2025)] 12 S. Knapiński et al

work page 2016

[16] [16]

In: 2015 IEEE Frontiers in Education Conference (FIE)

Skromme, B.J., Rayes, P.J., McNamara, B.E., Seetharam, V., Gao, X., Thompson, T., Wang, X., Cheng, B., Huang, Y.F., Robinson, D.H.: Step-based tutoring sys- tem for introductory linear circuit analysis. In: 2015 IEEE Frontiers in Education Conference (FIE). pp. 1–9. IEEE (2015)

work page 2015

[17] [17]

Frontiers in Robotics and AIV olume 1 - 2014(2014)

Slater, M.: Grand challenges in virtual environments. Frontiers in Robotics and AIV olume 1 - 2014(2014). https://doi.org/10.3389/frobt.2014.00003,https: //www.frontiersin.org/journals/robotics-and-ai/articles/10.3389/frobt .2014.00003

work page doi:10.3389/frobt.2014.00003 2014

[18] [18]

MetaSpace II: Object and full-body tracking for interaction and navigation in social VR

Sra, M., Schmandt, C.: Metaspace ii: Object and full-body tracking for interaction and navigation in social vr. arXiv preprint arXiv:1512.02922 (2015)

work page internal anchor Pith review Pith/arXiv arXiv 2015

[19] [19]

Computers & Graphics21(4), 393–404 (1997)

Srinivasan, M.A., Basdogan, C.: Haptics in virtual environments: Taxonomy, re- search status, and challenges. Computers & Graphics21(4), 393–404 (1997). https://doi.org/https://doi.org/10.1016/S0097-8493(97)00030-7,https://www.sc iencedirect.com/science/article/pii/S0097849397000307, haptic Displays in Virtual Environments and Computer Graphics in Korea

work page doi:10.1016/s0097-8493(97)00030-7 1997

[20] [20]

The Rust Foundation: The rust programming language (2014-2025),https://ww w.rust-lang.org/, [Accessed: (12.08.2025)]

work page 2014

[21] [21]

Unity Technologies: Unity (2023),https://unity.com/, game development plat- form [Accessed: (30.08.2025)]

work page 2023

[22] [22]

Valve: Steam vr tracking system (2016),https://partner.steamgames.com/vrt racking[Accessed: (18.08.2025)]

work page 2016

[23] [23]

Varjo: Varjo mixed reality (2025),https://varjo.com/

work page 2025

[24] [24]

IEEE Transactions on Intelligent Vehicles9(1), 2094–2128 (2023)

Yao, S., Guan, R., Huang, X., Li, Z., Sha, X., Yue, Y., Lim, E.G., Seo, H., Man, K.L., Zhu, X., et al.: Radar-camera fusion for object detection and semantic seg- mentation in autonomous driving: A comprehensive review. IEEE Transactions on Intelligent Vehicles9(1), 2094–2128 (2023)

work page 2094