pith. sign in

arxiv: 2606.17936 · v2 · pith:TUAYM3XRnew · submitted 2026-06-16 · 💻 cs.RO

SPARK: Low Latency Single-Camera 3D Pose Estimation for Autonomous Racing using Keypoints

Pith reviewed 2026-06-27 00:51 UTC · model grok-4.3

classification 💻 cs.RO
keywords autonomous racing3D pose estimationkeypoint detectionmonocular cameralow latencyYOLOobject detectionvehicle tracking
0
0 comments X

The pith

SPARK detects 3D poses of racing opponents from one camera using keypoints to achieve higher accuracy and lower latency than prior monocular methods.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces SPARK, a single-camera algorithm that estimates 3D poses of other vehicles in autonomous racing by detecting keypoints. It relies on optimized YOLO models and the fixed geometry of the racetrack to support long-range detection with reduced delay compared to LiDAR. The approach targets faster and more accurate results than existing monocular detection techniques while keeping resource use low. A sympathetic reader would care because quicker detections allow safer trajectory planning against non-cooperative opponents during high-speed maneuvers. The method is tested on real-world racing data against state-of-the-art camera and LiDAR baselines.

Core claim

SPARK achieves long-range detection with high accuracy, exceeding the performance of state-of-the-art monocular camera detection algorithms while maintaining lower latency, by employing well-optimized YOLO models and leveraging the fixed geometry in the autonomous racing domain.

What carries the argument

Keypoint detection with well-optimized YOLO models that exploits fixed racetrack geometry to convert 2D image detections into 3D poses.

If this is right

  • Detection latency drops enough to improve object tracking during high-dynamic racing maneuvers.
  • Monocular systems can replace or supplement slower LiDAR for opponent pose estimation on edge hardware.
  • Lower resource usage supports deployment on resource-constrained autonomous race vehicles.
  • Long-range accuracy enables earlier planning of collision-free trajectories against non-cooperative opponents.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The method may extend to other structured environments where track geometry is known in advance, such as oval tracks or test circuits.
  • Because it builds on standard YOLO detectors, retraining on new camera setups could be straightforward without custom hardware.
  • Eliminating LiDAR reliance could lower vehicle cost and complexity for teams focused on vision-only autonomy.

Load-bearing premise

The fixed geometry of the autonomous racing domain combined with well-optimized YOLO models will deliver the claimed accuracy and latency gains on real-world data without additional post-processing or domain-specific tuning that affects the central performance claims.

What would settle it

A head-to-head test on the same real-world autonomous racing dataset where SPARK accuracy falls below state-of-the-art monocular methods or its latency exceeds the monocular baselines would falsify the central performance claim.

Figures

Figures reproduced from arXiv: 2606.17936 by Dominic Ebner, Markus Lienkamp.

Figure 1
Figure 1. Figure 1: Multi-Vehicle autonomous racing on the Yas Marina Circuit during [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Our approach for detecting vehicles using only a YOLO-Pose model without a 3D detection head. We first detect 2D vehicle keypoints and then [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Sensor layout of the EAV24 autonomous racecar. The outlined front-facing camera and LiDAR were used to create the dataset. The selected [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Annotations process for new data. a) LiDAR point clouds are annotated using localization data and manual refinement. b) The matching camera [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: 2D detection accuracy for different YOLO model versions and [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: 3D detection accuracy for different YOLO model versions and sizes as well as MonoDETR. For each YOLO model, we evaluate every model [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
read the original abstract

In autonomous racing, fast detection of other participants' movements is required to plan safe, collision-free trajectories with non-cooperative opponents. LiDAR detection is inherently slower and harder to deploy on edge devices than vision methods, causing delayed detections that limit object tracking performance during high-dynamic maneuvering. Utilizing monocular 3D detection enables an easy-to-deploy, low-latency detection of other participants on the racetrack. We present SPARK, a single-camera pose-estimation algorithm for autonomous racing using keypoint detection. It achieves long-range detection with high accuracy, exceeding the performance of state-of-the-art monocular camera detection algorithms while maintaining lower latency. By employing well-optimized YOLO models and leveraging the fixed geometry in the autonomous racing domain, the algorithm also exhibits low latency and resource usage. We evaluate the performance of our approach on real-world autonomous racing data and compare it to state-of-the-art LiDAR and camera detection algorithms. The source code is available at: https://github.com/TUMFTM/SPARK-camera-det

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The manuscript introduces SPARK, a monocular 3D pose estimation pipeline for autonomous racing that detects keypoints with optimized YOLO models and exploits the fixed track geometry to recover 3D poses. It claims long-range detection with accuracy superior to existing monocular detectors at lower latency than LiDAR, supported by evaluation on real-world racing data and direct comparisons to SOTA baselines; source code is released.

Significance. If the reported accuracy and latency advantages hold under the experimental conditions, the work provides a practical, edge-deployable vision alternative for high-dynamic racing scenarios where LiDAR latency is prohibitive. The public release of source code is a clear strength that supports reproducibility and community validation.

minor comments (3)
  1. Abstract: The superiority claims ('exceeding the performance of state-of-the-art monocular camera detection algorithms' and 'maintaining lower latency') are presented without any numerical values, error metrics, or dataset statistics, which is atypical and reduces the abstract's utility for readers.
  2. The manuscript would benefit from an explicit statement in the evaluation section of the number of frames/sequences, track variations, and whether cross-validation or multiple runs were used to establish statistical significance of the reported gains.
  3. Figure captions and table headers should consistently report units (e.g., latency in ms, range in meters) and the exact YOLO variant/backbone employed.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive summary, significance assessment, and recommendation of minor revision. No major comments were provided in the report.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper describes an empirical computer-vision pipeline (YOLO-based keypoint detection plus domain geometry for monocular 3D pose) evaluated on real-world racing data with direct SOTA comparisons and released code. No equations, derivations, or first-principles claims appear in the abstract or described content that reduce a result to its own fitted inputs or self-citations by construction. The central performance claims rest on external benchmarks rather than internal self-definition or renamed fits.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The abstract relies on standard computer vision assumptions for keypoint-based pose estimation and the applicability of YOLO models; no free parameters, new entities, or ad-hoc axioms are introduced or quantified.

axioms (1)
  • domain assumption Monocular keypoint detection can be combined with known track geometry to recover accurate 3D poses at long range.
    Invoked when the abstract states that fixed geometry enables the performance.

pith-pipeline@v0.9.1-grok · 5711 in / 1144 out tokens · 39329 ms · 2026-06-27T00:51:48.744604+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

48 extracted references · 4 canonical work pages

  1. [1]

    Indy Autonomous Challenge

    IAC, “Indy Autonomous Challenge.” [Online]. Available: https: //www.indyautonomouschallenge.com/

  2. [2]

    Abu Dhabi Autonomous Racing League in UAE|A2RL

    Aspire, “Abu Dhabi Autonomous Racing League in UAE|A2RL.” [Online]. Available: https://a2rl.io

  3. [3]

    Indy Autonomous Challenge – Autonomous Race Cars at the Handling Limits,

    A. Wischnewskiet al., “Indy Autonomous Challenge – Autonomous Race Cars at the Handling Limits,” Feb. 2022, arXiv:2202.03807 [cs]. [Online]. Available: http://arxiv.org/abs/2202.03807

  4. [4]

    Head-to-Head autonomous racing at the limits of handling in the A2RL challenge,

    S. Hoffmannet al., “Head-to-Head autonomous racing at the limits of handling in the A2RL challenge,” Feb. 2026, arXiv:2602.08571 [cs]. [Online]. Available: http://arxiv.org/abs/2602.08571

  5. [5]

    er.autopilot 1.1: A Software Stack for Autonomous Racing on Oval and Road Course Tracks,

    A. Rajiet al., “er.autopilot 1.1: A Software Stack for Autonomous Racing on Oval and Road Course Tracks,”IEEE Transactions on Field Robotics, vol. 1, pp. 332–359, 2024. [Online]. Available: https://ieeexplore.ieee.org/abstract/document/10756753

  6. [6]

    Multi-Modal Sensor Fusion and Object Tracking for Autonomous Racing,

    P. Karle, F. Fent, S. Huch, F. Sauerbeck, and M. Lienkamp, “Multi-Modal Sensor Fusion and Object Tracking for Autonomous Racing,”IEEE Transactions on Intelligent V ehicles, vol. 8, no. 7, pp. 3871–3883, Jul. 2023. [Online]. Available: https://ieeexplore.ieee.org/ abstract/document/10113239

  7. [7]

    Overview of our OS sensors|Ouster

    Ouster, “Overview of our OS sensors|Ouster.” [Online]. Available: https://ouster.com/os-overview

  8. [8]

    Pandar128 360° High-Performance Long-Range Lidar

    Hesai, “Pandar128 360° High-Performance Long-Range Lidar.” [Online]. Available: https://www.hesaitech.com/product/pandar128/

  9. [9]

    Luminar’s Technologies

    Luminar, “Luminar’s Technologies.” [Online]. Available: https: //www.luminartech.com/technology

  10. [10]

    Falcon K, Seyond 1st Generation Ultra-long Range LiDAR

    Seyond, “Falcon K, Seyond 1st Generation Ultra-long Range LiDAR.” [Online]. Available: https://www.seyond.com/products/falcon-k1/

  11. [11]

    Center-based 3D Object Detection and Tracking,

    T. Yin, X. Zhou, and P. Krahenbuhl, “Center-based 3D Object Detection and Tracking,” in2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Nashville, TN, USA: IEEE, 2021, pp. 11 779–11 788. [Online]. Available: https://ieeexplore.ieee.org/document/9578166/

  12. [12]

    nuScenes: A Multimodal Dataset for Autonomous Driving,

    H. Caesaret al., “nuScenes: A Multimodal Dataset for Autonomous Driving,” in2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Seattle, W A, USA: IEEE, Jun. 2020, pp. 11 618–11 628. [Online]. Available: https://ieeexplore.ieee.org/ document/9156412/

  13. [13]

    V oxelNeXt: Fully Sparse V oxelNet for 3D Object Detection and Tracking,

    Y . Chen, J. Liu, X. Zhang, X. Qi, and J. Jia, “V oxelNeXt: Fully Sparse V oxelNet for 3D Object Detection and Tracking,” in2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Vancouver, BC, Canada: IEEE, 2023, pp. 21 674–21 683. [Online]. Available: https://ieeexplore.ieee.org/document/10204123/

  14. [14]

    YOLOv12: Attention-Centric Real-Time Object Detectors,

    Y . Tian, Q. Ye, and D. Doermann, “YOLOv12: Attention-Centric Real-Time Object Detectors,” Feb. 2025, arXiv:2502.12524 [cs]. [Online]. Available: http://arxiv.org/abs/2502.12524

  15. [15]

    Ultralytics YOLO11

    Ultralytics, “Ultralytics YOLO11.” [Online]. Available: https://docs. ultralytics.com/models/yolo11/

  16. [16]

    KITTI-360: A Novel Dataset and Benchmarks for Urban Scene Understanding in 2D and 3D,

    Y . Liao, J. Xie, and A. Geiger, “KITTI-360: A Novel Dataset and Benchmarks for Urban Scene Understanding in 2D and 3D,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 3, pp. 3292–3310, Mar. 2023. [Online]. Available: https://ieeexplore.ieee.org/abstract/document/9786676

  17. [17]

    KITTI-360 3D Bounding Box Leaderboard

    KITTI-360, “KITTI-360 3D Bounding Box Leaderboard.” [On- line]. Available: https://www.cvlibs.net/datasets/kitti-360/leaderboard scene understanding.php?task=box3d

  18. [18]

    MonoDTR: Monocular 3D Object Detection With Depth-Aware Transformer,

    K.-C. Huang, T.-H. Wu, H.-T. Su, and W. H. Hsu, “MonoDTR: Monocular 3D Object Detection With Depth-Aware Transformer,” 2022, pp. 4012–4021

  19. [19]

    MonoDETR: Depth-guided Transformer for Monocular 3D Object Detection,

    R. Zhanget al., “MonoDETR: Depth-guided Transformer for Monocular 3D Object Detection,” in2023 IEEE/CVF International Conference on Computer Vision (ICCV). Paris, France: IEEE, Oct. 2023, pp. 9121–9132. [Online]. Available: https://ieeexplore.ieee.org/ document/10378586/

  20. [20]

    MonoCD: Monocular 3D Object Detection with Complementary Depths,

    L. Yan, P. Yan, S. Xiong, X. Xiang, and Y . Tan, “MonoCD: Monocular 3D Object Detection with Complementary Depths,” 2024, pp. 10 248– 10 257

  21. [21]

    SMOKE: Single-Stage Monocular 3D Object Detection via Keypoint Estimation,

    Z. Liu, Z. Wu, and R. Toth, “SMOKE: Single-Stage Monocular 3D Object Detection via Keypoint Estimation,” in2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). Seattle, W A, USA: IEEE, Jun. 2020, pp. 4289–4298. [Online]. Available: https://ieeexplore.ieee.org/document/9150775/

  22. [22]

    Disentangling Monocular 3D Object Detection,

    A. Simonelli, S. R. Bulo, L. Porzi, M. Lopez-Antequera, and P. Kontschieder, “Disentangling Monocular 3D Object Detection,” in2019 IEEE/CVF International Conference on Computer Vision (ICCV). Seoul, Korea (South): IEEE, Oct. 2019, pp. 1991–1999. [Online]. Available: https://ieeexplore.ieee.org/document/9010618/

  23. [23]

    Fully Sparse 3D Ob- ject Detection,

    L. Fan, F. Wang, N. Wang, and Z.-X. Zhang, “Fully Sparse 3D Ob- ject Detection,”Advances in Neural Information Processing Systems, vol. 35, pp. 351–363, Dec. 2022

  24. [24]

    onnx/onnx,

    ONNX, “onnx/onnx,” Feb. 2026, original-date: 2017-09- 07T04:53:45Z. [Online]. Available: https://github.com/onnx/onnx

  25. [25]

    NVIDIA/TensorRT,

    NVIDIA, “NVIDIA/TensorRT,” Feb. 2026, original-date: 2019- 05-02T22:02:08Z. [Online]. Available: https://github.com/NVIDIA/ TensorRT

  26. [26]

    NVIDIA-AI-IOT/CUDA-PointPillars,

    NVIDIA, “NVIDIA-AI-IOT/CUDA-PointPillars,” Feb. 2026, original- date: 2021-11-16T02:50:38Z. [Online]. Available: https://github.com/ NVIDIA-AI-IOT/CUDA-PointPillars

  27. [27]

    tier4/AWML,

    Tier4, “tier4/AWML,” Feb. 2026, original-date: 2025-02- 26T07:17:44Z. [Online]. Available: https://github.com/tier4/AWML

  28. [28]

    mmdet3d,

    OpenMMLab, “mmdet3d,” Jul. 2020, original-date: 2020-07- 08T03:39:45Z. [Online]. Available: https://github.com/open-mmlab/ mmdetection3d

  29. [29]

    Pose Estimation for Augmented Reality: A Hands-On Survey,

    E. Marchand, H. Uchiyama, and F. Spindler, “Pose Estimation for Augmented Reality: A Hands-On Survey,”IEEE Transactions on Visualization and Computer Graphics, vol. 22, no. 12, pp. 2633–2651, Dec. 2016. [Online]. Available: https://ieeexplore.ieee.org/document/ 7368948

  30. [30]

    Perspective-n-Point (PnP) pose computation

    OpenCV, “Perspective-n-Point (PnP) pose computation.” [Online]. Available: https://docs.opencv.org/3.4/d5/d1f/calib3d solvePnP.html

  31. [31]

    A General Sufficient Condition of Four Positive Solutions of the P3P Problem,

    C.-X. Zhang and Z.-Y . Hu, “A General Sufficient Condition of Four Positive Solutions of the P3P Problem,”Journal of Computer Science and Technology, vol. 20, no. 6, pp. 836–842, Nov. 2005. [Online]. Available: https://doi.org/10.1007/s11390-005-0836-0

  32. [32]

    EPnP: An Accurate O(n) Solution to the PnP Problem,

    V . Lepetit, F. Moreno-Noguer, and P. Fua, “EPnP: An Accurate O(n) Solution to the PnP Problem,”International Journal of Computer Vision, vol. 81, no. 2, pp. 155–166, Feb. 2009. [Online]. Available: https://doi.org/10.1007/s11263-008-0152-6

  33. [33]

    A Consistently Fast and Globally Optimal Solution to the Perspective-n-Point Problem,

    G. Terzakis and M. Lourakis, “A Consistently Fast and Globally Optimal Solution to the Perspective-n-Point Problem,” inComputer Vision – ECCV 2020, A. Vedaldi, H. Bischof, T. Brox, and J.-M. Frahm, Eds. Cham: Springer International Publishing, 2020, pp. 478– 494

  34. [34]

    Rethinking on Multi-Stage Networks for Human Pose Estimation,

    W. Liet al., “Rethinking on Multi-Stage Networks for Human Pose Estimation,” May 2019, arXiv:1901.00148 [cs]. [Online]. Available: http://arxiv.org/abs/1901.00148

  35. [35]

    OpenMMLab Pose Estimation Toolbox and Benchmark,

    MMPose Contributors, “OpenMMLab Pose Estimation Toolbox and Benchmark,” Aug. 2020, original-date: 2020-07-08T06:02:55Z. [Online]. Available: https://github.com/open-mmlab/mmpose

  36. [36]

    RTMPose: Real-Time Multi-Person Pose Estimation based on MMPose,

    T. Jianget al., “RTMPose: Real-Time Multi-Person Pose Estimation based on MMPose,” Jul. 2023, arXiv:2303.07399 [cs]. [Online]. Available: http://arxiv.org/abs/2303.07399

  37. [37]

    Monocular 6-DoF Pose Estimation of Spacecrafts Utilizing Self-iterative Optimization and Motion Consistency,

    Y . Zhanget al., “Monocular 6-DoF Pose Estimation of Spacecrafts Utilizing Self-iterative Optimization and Motion Consistency,” 2024, pp. 6847–6856. [Online]. Available: https://openaccess.thecvf.com/content/CVPR2024W/AI4Space/html/ Zhang Monocular 6-DoF Pose Estimation of Spacecrafts Utilizing Self-iterative Optimization and CVPRW 2024 paper.html

  38. [38]

    RTM3D: Real-Time Monocular 3D Detection from Object Keypoints for Autonomous Driving,

    P. Li, H. Zhao, P. Liu, and F. Cao, “RTM3D: Real-Time Monocular 3D Detection from Object Keypoints for Autonomous Driving,” in Computer Vision – ECCV 2020. Springer, Cham, 2020, pp. 644–

  39. [39]

    Available: https://link.springer.com/chapter/10.1007/ 978-3-030-58580-8 38

    [Online]. Available: https://link.springer.com/chapter/10.1007/ 978-3-030-58580-8 38

  40. [40]

    Geometry-based Distance Decomposition for Monocular 3D Object Detection,

    X. Shiet al., “Geometry-based Distance Decomposition for Monocular 3D Object Detection,” in2021 IEEE/CVF International Conference on Computer Vision (ICCV). Montreal, QC, Canada: IEEE, Oct. 2021, pp. 15 152–15 161. [Online]. Available: https: //ieeexplore.ieee.org/document/9711219/

  41. [41]

    PETR: Position Embedding Transformation for Multi-view 3D Object Detection,

    Y . Liu, T. Wang, X. Zhang, and J. Sun, “PETR: Position Embedding Transformation for Multi-view 3D Object Detection,” inComputer Vision – ECCV 2022, S. Avidan, G. Brostow, M. Ciss ´e, G. M. Farinella, and T. Hassner, Eds. Cham: Springer Nature Switzerland, 2022, vol. 13687, pp. 531–548, series Title: Lecture Notes in Computer Science. [Online]. Available:...

  42. [42]

    End-to-End 6DoF Pose Estimation From Monocular RGB Images,

    W. Zouet al., “End-to-End 6DoF Pose Estimation From Monocular RGB Images,”IEEE Transactions on Consumer Electronics, vol. 67, no. 1, pp. 87–96, Feb. 2021. [Online]. Available: https://ieeexplore.ieee.org/document/9347540/

  43. [43]

    Vision meets robotics: The kitti dataset.Int

    A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Vision meets robotics: The KITTI dataset,”The International Journal of Robotics Research, vol. 32, no. 11, pp. 1231–1237, Sep. 2013. [Online]. Available: https://doi.org/10.1177/0278364913491297

  44. [44]

    Monocular 3D Detection With Geometric Constraint Embedding and Semi-Supervised Training,

    P. Li and H. Zhao, “Monocular 3D Detection With Geometric Constraint Embedding and Semi-Supervised Training,”IEEE Robotics and Automation Letters, vol. 6, no. 3, pp. 5565– 5572, Jul. 2021. [Online]. Available: https://ieeexplore.ieee.org/ abstract/document/9361326

  45. [45]

    AutoShape: Real-Time Shape-Aware Monocular 3D Object Detection,

    Z. Liu, D. Zhou, F. Lu, J. Fang, and L. Zhang, “AutoShape: Real-Time Shape-Aware Monocular 3D Object Detection,” in2021 IEEE/CVF International Conference on Computer Vision (ICCV). Montreal, QC, Canada: IEEE, Oct. 2021, pp. 15 621–15 630. [Online]. Available: https://ieeexplore.ieee.org/document/9710211/

  46. [46]

    BETTY Dataset: A Multi-Modal Dataset for Full-Stack Autonomy,

    M. Nyeet al., “BETTY Dataset: A Multi-Modal Dataset for Full-Stack Autonomy,” in2025 IEEE International Conference on Robotics and Automation (ICRA), May 2025, pp. 2453–2460. [Online]. Available: https://ieeexplore.ieee.org/abstract/document/11127350/figures

  47. [47]

    RACECAR - The Dataset for High- Speed Autonomous Racing,

    A. Kulkarniet al., “RACECAR - The Dataset for High- Speed Autonomous Racing,” in2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Oct. 2023, pp. 11 458–11 463. [Online]. Available: https://ieeexplore.ieee.org/ abstract/document/10342053

  48. [48]

    Objects are Different: Flexible Monocular 3D Object Detection,

    Y . Zhang, J. Lu, and J. Zhou, “Objects are Different: Flexible Monocular 3D Object Detection,” in2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Nashville, TN, USA: IEEE, Jun. 2021, pp. 3288–3297. [Online]. Available: https://ieeexplore.ieee.org/document/9578273/