Dual Pose-Graph Semantic Localization for Vision-Based Autonomous Drone Racing

Alvaro J. Gaona; David Perez-Saura; Miguel Fernandez-Cortizas; Pascual Campoy

arxiv: 2604.15168 · v1 · submitted 2026-04-16 · 💻 cs.RO

Dual Pose-Graph Semantic Localization for Vision-Based Autonomous Drone Racing

David Perez-Saura , Miguel Fernandez-Cortizas , Alvaro J. Gaona , Pascual Campoy This is my paper

Pith reviewed 2026-05-10 10:52 UTC · model grok-4.3

classification 💻 cs.RO

keywords drone racingvisual localizationpose graph optimizationsemantic SLAMvisual-inertial odometryautonomous navigationgate detection

0 comments

The pith

A dual pose-graph fuses odometry and repeated gate sightings to cut localization drift in high-speed drone racing.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that a temporary graph can collect multiple visual detections of each racing gate between keyframes, optimize those into one clean constraint, and then add the result to a persistent main graph. This keeps the total number of nodes low enough for real-time onboard use while still using every available observation. Standalone visual-inertial odometry drifts badly under motion blur and sharp turns; the dual structure reduces that drift by more than half on the TII-RATM dataset and by up to 4.2 m per lap in actual competition flights. An ablation test confirms the two-graph split itself adds 10-12 % accuracy at the same compute cost as a single-graph version.

Core claim

The central claim is that separating short-term accumulation of landmark observations from long-term map maintenance lets a vision-based system exploit the fixed gate layout of a race track without letting the pose graph grow unbounded. Multiple detections of the same gate are fused inside the temporary graph into a single refined edge; that edge is then promoted to the main graph, preserving information density while bounding computational cost. The resulting trajectory error is 56-74 % lower than pure VIO and the method runs in real time on the drone.

What carries the argument

Dual pose-graph architecture: a temporary graph that accumulates and optimizes repeated semantic detections of each gate before promoting a single refined constraint to the persistent main graph.

If this is right

Error falls 56-74 % versus pure visual-inertial odometry on the TII-RATM dataset.
The dual split gives 10-12 % extra accuracy at identical runtime cost compared with a single graph.
Drift is cut by up to 4.2 m per lap during real competition flights.
The system stays real-time onboard while still using every gate observation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same temporary-to-persistent split could be tried on other repeating landmarks such as road signs or building corners.
If gate detectors improve, the temporary graph could accumulate fewer but higher-quality observations without changing the main-graph logic.
Testing the method on tracks with varying gate spacing would show how sensitive the refinement step is to landmark density.

Load-bearing premise

Semantic detections of gates must remain reliable even when the drone is moving fast and turning sharply.

What would settle it

Measure absolute trajectory error on a new racing track where gate detections drop below 70 % reliability; if the 56 % error reduction disappears, the method's advantage is refuted.

Figures

Figures reproduced from arXiv: 2604.15168 by Alvaro J. Gaona, David Perez-Saura, Miguel Fernandez-Cortizas, Pascual Campoy.

**Figure 2.** Figure 2: Dual pose-graph architecture. Between main graph [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: XY trajectory comparison on ellipse and lemniscate [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Comparison between OpenVINS odometry and our [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

read the original abstract

Autonomous drone racing demands robust real-time localization under extreme conditions: high-speed flight, aggressive maneuvers, and payload-constrained platforms that often rely on a single camera for perception. Existing visual SLAM systems, while effective in general scenarios, struggle with motion blur and feature instability inherent to racing dynamics, and do not exploit the structured nature of racing environments. In this work, we present a dual pose-graph architecture that fuses odometry with semantic detections for robust localization. A temporary graph accumulates multiple gate observations between keyframes and optimizes them into a single refined constraint per landmark, which is then promoted to a persistent main graph. This design preserves the information richness of frequent detections while preventing graph growth from degrading real-time performance. The system is designed to be sensor-agnostic, although in this work we validate it using monocular visual-inertial odometry and visual gate detections. Experimental evaluation on the TII-RATM dataset shows a 56% to 74% reduction in ATE compared to standalone VIO, while an ablation study confirms that the dual-graph architecture achieves 10% to 12% higher accuracy than a single-graph baseline at identical computational cost. Deployment in the A2RL competition demonstrated that the system performs real-time onboard localization during flight, reducing the drift of the odometry baseline by up to 4.2 m per lap.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The dual-graph setup delivers clear ATE gains on the racing dataset and runs onboard in competition, but without detection quality metrics it's unclear how much the temporary-to-persistent promotion actually drives the results.

read the letter

The dual pose-graph for semantic localization in drone racing cuts ATE by 56 to 74 percent versus standalone VIO on the TII-RATM dataset and reduces drift in actual competition flights, with the temporary accumulation graph being the key new piece that lets them use lots of detections without slowing down the main graph. They do a good job showing the practical side. The ablation confirms the dual setup beats single-graph by 10-12 percent at the same cost, and the onboard deployment in A2RL is a solid check that it runs real-time on the platform. The sensor-agnostic framing is reasonable even if the experiments stick to monocular VIO plus gate detections. The soft spot is the missing validation on the detections themselves. There are no numbers on how often the gates are detected correctly or how reprojection errors change with speed and blur, so we cannot tell if the gains come from the graph promotion logic or just from having good detections on the easier parts of the tracks. The single-graph comparison does not address that. This is useful for anyone working on SLAM or localization for high-speed vehicles in environments with repeatable visual landmarks. Robotics folks focused on racing or similar structured navigation would find the graph management approach worth looking at. I think it deserves peer review. The results and the competition test give it enough weight that referees can dig into the details and push on the evaluation gaps.

Referee Report

1 major / 0 minor

Summary. The paper presents a dual pose-graph semantic localization system for vision-based autonomous drone racing. It fuses monocular visual-inertial odometry (VIO) with visual gate detections via a temporary graph that accumulates multiple observations between keyframes, optimizes them into a single refined constraint per landmark, and promotes the result to a persistent main graph. This design aims to retain information from frequent detections while controlling graph size for real-time performance on payload-constrained platforms. The authors claim 56%–74% ATE reduction versus standalone VIO on the TII-RATM dataset, 10%–12% higher accuracy than a single-graph baseline at identical computational cost, and successful real-time onboard deployment in the A2RL competition that reduces odometry drift by up to 4.2 m per lap.

Significance. If the quantitative claims hold after addressing evaluation gaps, the work offers a practical, sensor-agnostic approach to improving localization robustness in high-speed structured environments without increasing compute or graph complexity. The ablation study at fixed computational cost and the real-world competition deployment provide concrete evidence of deployability on racing drones.

major comments (1)

[Experimental evaluation] Experimental evaluation (abstract and corresponding results section): The headline 56%–74% ATE reduction versus VIO and the 10%–12% dual-graph gain are presented as evidence for the temporary-to-persistent promotion mechanism. However, no per-sequence detection metrics (recall, precision, reprojection error, or failure rate) are reported, nor are they correlated with motion blur or angular velocity. Without these, it is impossible to rule out that measured gains arise primarily from high-quality detections on easier segments rather than from the dual-graph architecture. The single-graph ablation controls architecture but not detection quality, leaving the load-bearing assumption untested.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the major comment on experimental evaluation below and will incorporate additional analysis to strengthen the attribution of results to the dual-graph architecture.

read point-by-point responses

Referee: Experimental evaluation (abstract and corresponding results section): The headline 56%–74% ATE reduction versus VIO and the 10%–12% dual-graph gain are presented as evidence for the temporary-to-persistent promotion mechanism. However, no per-sequence detection metrics (recall, precision, reprojection error, or failure rate) are reported, nor are they correlated with motion blur or angular velocity. Without these, it is impossible to rule out that measured gains arise primarily from high-quality detections on easier segments rather than from the dual-graph architecture. The single-graph ablation controls architecture but not detection quality, leaving the load-bearing assumption untested.

Authors: We agree that per-sequence detection metrics would strengthen the evidence. In the revised manuscript we will add per-sequence recall, precision, reprojection error, and failure rates for gate detections, together with their correlation to motion blur and angular velocity on the TII-RATM sequences. Both the single-graph baseline and the proposed dual-graph system receive identical VIO estimates and identical gate detections as input; the only difference lies in the temporary accumulation and promotion step. Consequently the 10–12 % accuracy improvement in the ablation directly isolates the contribution of the dual-graph mechanism while holding detection quality constant. The larger gains versus standalone VIO reflect the addition of semantic constraints, which the ablation then refines by showing the value of our specific architecture at fixed compute. revision: yes

Circularity Check

0 steps flagged

No circularity: architecture and results are design choices validated externally

full rationale

The paper describes a dual pose-graph design (temporary graph for multi-observation refinement of gate landmarks, then promotion to persistent graph) as an engineering solution to balance information density against real-time performance. This is presented as an explicit architectural choice, not derived from any equation or theorem that loops back to its own inputs. All quantitative claims (56-74% ATE reduction vs. VIO, 10-12% dual-graph gain at fixed cost) are tied to comparisons against external baselines on the public TII-RATM dataset plus onboard deployment in A2RL; no fitted parameter is relabeled as a prediction, no self-citation supplies a uniqueness theorem, and no ansatz is smuggled in. The derivation chain is therefore self-contained and non-circular.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption of reliable gate detections under extreme conditions and standard assumptions from visual SLAM; no free parameters or invented entities are explicitly introduced in the abstract.

axioms (1)

domain assumption Semantic detections of gates are available and sufficiently accurate during high-speed aggressive flight.
The fusion of odometry with semantic detections is central to the system.

pith-pipeline@v0.9.0 · 5551 in / 1194 out tokens · 86068 ms · 2026-05-10T10:52:40.787860+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages

[1]

Champion-level drone racing using deep reinforcement learning,

T. Kaufmann, L. Bauersfeld, A. Loquercio, M. M ¨uller, V . Koltun, and D. Scaramuzza, “Champion-level drone racing using deep reinforcement learning,”Nature, vol. 620, pp. 982–987, 2023

work page 2023
[2]

Alphapilot: Autonomous drone racing,

P. Foehn, D. Brescianini, E. Kaufmann, T. Cieslewski, M. Gehrig, M. Muglikar, and D. Scaramuzza, “Alphapilot: Autonomous drone racing,”Auton. Robots, vol. 46, no. 1, pp. 307–320, 2022

work page 2022
[3]

ORB-SLAM3: An accurate open-source library for visual, visual-inertial, and multimap slam,

C. Campos, R. Elvira, J. J. Rodr ´ıguez G´omez, J. M. M. Montiel, and J. D. Tard´os, “ORB-SLAM3: An accurate open-source library for visual, visual-inertial, and multimap slam,”IEEE Trans. Robot., vol. 37, no. 6, pp. 1874–1890, 2021

work page 2021
[4]

VINS-Mono: A robust and versatile monocular visual-inertial state estimator,

T. Qin, P. Li, and S. Shen, “VINS-Mono: A robust and versatile monocular visual-inertial state estimator,”IEEE Trans. Robot., vol. 34, no. 4, pp. 1004–1020, 2018

work page 2018
[5]

ORB-SLAM2: An open-source slam system for monocular, stereo, and rgb-d cameras,

R. Mur-Artal and J. D. Tard ´os, “ORB-SLAM2: An open-source slam system for monocular, stereo, and rgb-d cameras,”IEEE Trans. Robot., vol. 33, no. 5, pp. 1255–1262, 2017

work page 2017
[6]

A general optimization-based framework for local odometry estimation with multiple sensors,

T. Qin, J. Pan, S. Cao, and S. Shen, “A general optimization-based framework for local odometry estimation with multiple sensors,”Auton. Robots, vol. 44, no. 3, pp. 421–436, 2020

work page 2020
[7]

Iterated extended Kalman filter based visual-inertial odometry using direct photometric feedback,

M. Bloesch, M. Burri, S. Omari, M. Hutter, and R. Siegwart, “Iterated extended Kalman filter based visual-inertial odometry using direct photometric feedback,”Int. J. Robot. Res., vol. 36, no. 10, pp. 1053– 1072, 2017

work page 2017
[8]

OpenVINS: A research platform for visual-inertial estimation,

P. Geneva, K. Eckenhoff, W. Lee, Y . Yang, and G. Huang, “OpenVINS: A research platform for visual-inertial estimation,” inProc. IEEE Int. Conf. Robot. Autom. (ICRA), 2020, pp. 4328–4334

work page 2020
[9]

Tracking at least 3 keypoints in a single image with lightweight, accurate and robust monocular visual- inertial SLAM,

K. Qiu, T. Qin, J. Pan, S. Liu, and S. Shen, “Tracking at least 3 keypoints in a single image with lightweight, accurate and robust monocular visual- inertial SLAM,”IEEE Robot. Autom. Lett., vol. 5, no. 4, pp. 5257–5264, 2020

work page 2020
[10]

SVO: Fast semi-direct monocular visual odometry,

C. Forster, M. Pizzoli, and D. Scaramuzza, “SVO: Fast semi-direct monocular visual odometry,” inProc. IEEE Int. Conf. Robot. Autom. (ICRA), 2014, pp. 15–22

work page 2014
[11]

DM-VIO: Delayed marginalization visual-inertial odometry,

L. von Stumberg and D. Cremers, “DM-VIO: Delayed marginalization visual-inertial odometry,”IEEE Robot. Autom. Lett., vol. 7, no. 2, pp. 1408–1415, 2022

work page 2022
[12]

SLAM++: Simultaneous localisation and mapping at the level of objects,

R. F. Salas-Moreno, R. A. Newcombe, H. Strasdat, P. H. J. Kelly, and A. J. Davison, “SLAM++: Simultaneous localisation and mapping at the level of objects,” inProc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), 2013, pp. 1352–1359

work page 2013
[13]

Sit- uational graphs for robot navigation in structured indoor environments,

H. Bavle, J. L. Sanchez-Lopez, M. Shaheer, J. Civera, and H. V oos, “Sit- uational graphs for robot navigation in structured indoor environments,” IEEE Robot. Autom. Lett., vol. 7, no. 4, pp. 9107–9114, 2022

work page 2022
[14]

Hydra: A real-time spatial perception system for 3d scene graph construction and optimization,

N. Hughes, Y . Chang, and L. Carlone, “Hydra: A real-time spatial perception system for 3d scene graph construction and optimization,” inRobotics: Science and Systems (RSS), 2022

work page 2022
[15]

Race against the machine: A fully-annotated, open-design dataset of autonomous and piloted high-speed flight,

M. Bosello, D. Aguiari, M. Bertogna, and L. Mottola, “Race against the machine: A fully-annotated, open-design dataset of autonomous and piloted high-speed flight,”IEEE Robot. Autom. Lett., vol. 9, no. 4, pp. 3799–3806, 2024

work page 2024
[16]

Autonomous drone race: A computationally efficient gate detection and path planning,

S. Li, C. De Wagter, and G. C. H. E. de Croon, “Autonomous drone race: A computationally efficient gate detection and path planning,” inProc. IEEE/RSJ Int. Conf. Intell. Robots Syst. (IROS), 2020, pp. 3330–3336

work page 2020
[17]

Aerostack2: A software framework for developing multi-robot aerial systems,

M. Fernandez-Cortizas, M. Molina, P. Arias-Perez, R. Perez-Segui, D. Perez-Saura, and P. Campoy, “Aerostack2: A software framework for developing multi-robot aerial systems,”arXiv, 2023

work page 2023
[18]

g2o: A general framework for graph optimization,

R. K ¨ummerle, G. Grisetti, H. Strasdat, K. Konolige, and W. Burgard, “g2o: A general framework for graph optimization,” inProc. IEEE Int. Conf. Robot. Autom. (ICRA), 2011, pp. 3607–3613

work page 2011
[19]

Drift-corrected monocular VIO and perception-aware planning for autonomous drone racing,

A. Azhari, M. Bosello, D. Aguiari, and L. Mottola, “Drift-corrected monocular VIO and perception-aware planning for autonomous drone racing,”arXiv, 2025

work page 2025

[1] [1]

Champion-level drone racing using deep reinforcement learning,

T. Kaufmann, L. Bauersfeld, A. Loquercio, M. M ¨uller, V . Koltun, and D. Scaramuzza, “Champion-level drone racing using deep reinforcement learning,”Nature, vol. 620, pp. 982–987, 2023

work page 2023

[2] [2]

Alphapilot: Autonomous drone racing,

P. Foehn, D. Brescianini, E. Kaufmann, T. Cieslewski, M. Gehrig, M. Muglikar, and D. Scaramuzza, “Alphapilot: Autonomous drone racing,”Auton. Robots, vol. 46, no. 1, pp. 307–320, 2022

work page 2022

[3] [3]

ORB-SLAM3: An accurate open-source library for visual, visual-inertial, and multimap slam,

C. Campos, R. Elvira, J. J. Rodr ´ıguez G´omez, J. M. M. Montiel, and J. D. Tard´os, “ORB-SLAM3: An accurate open-source library for visual, visual-inertial, and multimap slam,”IEEE Trans. Robot., vol. 37, no. 6, pp. 1874–1890, 2021

work page 2021

[4] [4]

VINS-Mono: A robust and versatile monocular visual-inertial state estimator,

T. Qin, P. Li, and S. Shen, “VINS-Mono: A robust and versatile monocular visual-inertial state estimator,”IEEE Trans. Robot., vol. 34, no. 4, pp. 1004–1020, 2018

work page 2018

[5] [5]

ORB-SLAM2: An open-source slam system for monocular, stereo, and rgb-d cameras,

R. Mur-Artal and J. D. Tard ´os, “ORB-SLAM2: An open-source slam system for monocular, stereo, and rgb-d cameras,”IEEE Trans. Robot., vol. 33, no. 5, pp. 1255–1262, 2017

work page 2017

[6] [6]

A general optimization-based framework for local odometry estimation with multiple sensors,

T. Qin, J. Pan, S. Cao, and S. Shen, “A general optimization-based framework for local odometry estimation with multiple sensors,”Auton. Robots, vol. 44, no. 3, pp. 421–436, 2020

work page 2020

[7] [7]

Iterated extended Kalman filter based visual-inertial odometry using direct photometric feedback,

M. Bloesch, M. Burri, S. Omari, M. Hutter, and R. Siegwart, “Iterated extended Kalman filter based visual-inertial odometry using direct photometric feedback,”Int. J. Robot. Res., vol. 36, no. 10, pp. 1053– 1072, 2017

work page 2017

[8] [8]

OpenVINS: A research platform for visual-inertial estimation,

P. Geneva, K. Eckenhoff, W. Lee, Y . Yang, and G. Huang, “OpenVINS: A research platform for visual-inertial estimation,” inProc. IEEE Int. Conf. Robot. Autom. (ICRA), 2020, pp. 4328–4334

work page 2020

[9] [9]

Tracking at least 3 keypoints in a single image with lightweight, accurate and robust monocular visual- inertial SLAM,

K. Qiu, T. Qin, J. Pan, S. Liu, and S. Shen, “Tracking at least 3 keypoints in a single image with lightweight, accurate and robust monocular visual- inertial SLAM,”IEEE Robot. Autom. Lett., vol. 5, no. 4, pp. 5257–5264, 2020

work page 2020

[10] [10]

SVO: Fast semi-direct monocular visual odometry,

C. Forster, M. Pizzoli, and D. Scaramuzza, “SVO: Fast semi-direct monocular visual odometry,” inProc. IEEE Int. Conf. Robot. Autom. (ICRA), 2014, pp. 15–22

work page 2014

[11] [11]

DM-VIO: Delayed marginalization visual-inertial odometry,

L. von Stumberg and D. Cremers, “DM-VIO: Delayed marginalization visual-inertial odometry,”IEEE Robot. Autom. Lett., vol. 7, no. 2, pp. 1408–1415, 2022

work page 2022

[12] [12]

SLAM++: Simultaneous localisation and mapping at the level of objects,

R. F. Salas-Moreno, R. A. Newcombe, H. Strasdat, P. H. J. Kelly, and A. J. Davison, “SLAM++: Simultaneous localisation and mapping at the level of objects,” inProc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), 2013, pp. 1352–1359

work page 2013

[13] [13]

Sit- uational graphs for robot navigation in structured indoor environments,

H. Bavle, J. L. Sanchez-Lopez, M. Shaheer, J. Civera, and H. V oos, “Sit- uational graphs for robot navigation in structured indoor environments,” IEEE Robot. Autom. Lett., vol. 7, no. 4, pp. 9107–9114, 2022

work page 2022

[14] [14]

Hydra: A real-time spatial perception system for 3d scene graph construction and optimization,

N. Hughes, Y . Chang, and L. Carlone, “Hydra: A real-time spatial perception system for 3d scene graph construction and optimization,” inRobotics: Science and Systems (RSS), 2022

work page 2022

[15] [15]

Race against the machine: A fully-annotated, open-design dataset of autonomous and piloted high-speed flight,

M. Bosello, D. Aguiari, M. Bertogna, and L. Mottola, “Race against the machine: A fully-annotated, open-design dataset of autonomous and piloted high-speed flight,”IEEE Robot. Autom. Lett., vol. 9, no. 4, pp. 3799–3806, 2024

work page 2024

[16] [16]

Autonomous drone race: A computationally efficient gate detection and path planning,

S. Li, C. De Wagter, and G. C. H. E. de Croon, “Autonomous drone race: A computationally efficient gate detection and path planning,” inProc. IEEE/RSJ Int. Conf. Intell. Robots Syst. (IROS), 2020, pp. 3330–3336

work page 2020

[17] [17]

Aerostack2: A software framework for developing multi-robot aerial systems,

M. Fernandez-Cortizas, M. Molina, P. Arias-Perez, R. Perez-Segui, D. Perez-Saura, and P. Campoy, “Aerostack2: A software framework for developing multi-robot aerial systems,”arXiv, 2023

work page 2023

[18] [18]

g2o: A general framework for graph optimization,

R. K ¨ummerle, G. Grisetti, H. Strasdat, K. Konolige, and W. Burgard, “g2o: A general framework for graph optimization,” inProc. IEEE Int. Conf. Robot. Autom. (ICRA), 2011, pp. 3607–3613

work page 2011

[19] [19]

Drift-corrected monocular VIO and perception-aware planning for autonomous drone racing,

A. Azhari, M. Bosello, D. Aguiari, and L. Mottola, “Drift-corrected monocular VIO and perception-aware planning for autonomous drone racing,”arXiv, 2025

work page 2025