Improved Vision-to-Chart Buoy Association with Learned World-to-Image Projection

Borja Carrillo-Perez (Arquimea Research Center)

arxiv: 2605.22942 · v1 · pith:JKWSMER6new · submitted 2026-05-21 · 💻 cs.CV

Improved Vision-to-Chart Buoy Association with Learned World-to-Image Projection

Borja Carrillo-Perez (Arquimea Research Center) This is my paper

Pith reviewed 2026-05-25 05:47 UTC · model grok-4.3

classification 💻 cs.CV

keywords buoy associationvision-to-chartQueryMLPDETRworld-to-image projectiondata fusiontransformer decoder

0 comments

The pith

Appending QueryMLP-predicted pixel coordinates to decoder queries eases geometric projection for buoy association.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper demonstrates that a lightweight MLP can be trained to map chart measurements and IMU orientation directly to image pixel locations for each buoy. Appending those predicted coordinates to the existing decoder query vectors supplies an explicit spatial prior, so the transformer no longer has to discover the full world-to-image mapping on its own. The modification yields an overall score of 0.7386 on the held-out test set, placing second in the challenge. A reader would care because it shows a practical way to inject geometric knowledge into fusion transformers without redesigning the architecture.

Core claim

The central claim is that training a dedicated QueryMLP to predict the buoy's waterline contact point in the image from chart measurements and IMU orientation data, then appending these coordinates to the baseline decoder query vector, supplies a direct spatial prior per buoy and thereby reduces the geometric reasoning burden on the transformer decoder.

What carries the argument

QueryMLP, a dedicated MLP that explicitly predicts the buoy's waterline contact point in the image from chart measurements and IMU orientation data.

If this is right

Decoder queries receive an explicit pixel-location prior for each buoy.
The transformer decoder faces a lighter geometric-projection task.
The method reaches an Overall score of 0.7386, F1 of 0.8055 and mIoU of 0.6718 on the held-out test set.
The approach places second on the challenge leaderboard.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same explicit-pixel injection could be tested on other DETR-style association tasks that involve known 3-D to 2-D mappings.
If the MLP outputs are treated as soft priors rather than hard coordinates, the decoder might learn to down-weight them when they conflict with image evidence.
The technique presupposes reliable IMU orientation; performance would need re-evaluation on platforms lacking such sensors.

Load-bearing premise

The QueryMLP produces pixel predictions accurate enough to aid the transformer without introducing new errors or requiring extensive additional training.

What would settle it

Retraining the baseline without the appended pixel coordinates and checking whether the test-set Overall score falls materially below 0.7386.

Figures

Figures reproduced from arXiv: 2605.22942 by Borja Carrillo-Perez (Arquimea Research Center).

**Figure 1.** Figure 1: Query construction pipeline. (a) Baseline: chart distance and bearing are fed directly into the embedding MLP. (b) Ours: a frozen QueryMLP takes six features (distance, bearing, and three IMU orientation angles) and predicts the buoy waterline contact point [cx, cy+h/2] in image coordinates. These pixel coordinates are concatenated with the normalized distance and bearing to form a 4-dimensional query vec… view at source ↗

**Figure 2.** Figure 2: Qualitative comparison on validation sample 00079. Row 1: input image and ground truth. Row 2: baseline predictions and ours. Row 3: top-view map of buoy positions (√ -distance scale; green circle = ground truth, purple square = baseline, yellow triangle = ours). The baseline fires two false positive detections on non-buoy objects (left column, row 2); our method suppresses both while correctly detecting a… view at source ↗

read the original abstract

This report presents a lightweight modification to the DETR-based fusion transformer baseline for the MaCVi 2026 Vision-to-Chart data association challenge. The challenge baseline decoder receives per-buoy queries encoding world-space distance and bearing, forcing the transformer to implicitly learn the complex geometric projection from world coordinates to image pixels. Instead, this work trains an additional dedicated MLP, QueryMLP, to explicitly predict the buoy's waterline contact point in the image from chart measurements and IMU orientation data. The predicted pixel coordinates are appended to the baseline decoder query vector, providing a direct spatial prior per buoy and reducing the geometric reasoning burden on the transformer decoder. On the challenge leaderboard, the presented approach achieves an Overall score of 0.7386, with F1 = 0.8055 and mIoU = 0.6718, on the held-out test set, placing second among all submissions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This adds a QueryMLP to predict pixel locations and append them to DETR queries for a buoy challenge, but reports only the final score with no baseline or ablation to show the addition helps.

read the letter

The paper's core move is to train a small MLP on chart measurements and IMU data to output the expected image pixel for each buoy's waterline contact point, then concatenate those two numbers onto the existing world-space query vector fed to the DETR decoder. The stated goal is to hand the transformer an explicit spatial prior so it spends less capacity on learning the projection. The system reaches 0.7386 overall on the held-out test set and places second on the leaderboard. That is a concrete engineering step for this exact task, and the idea of injecting a geometric hint directly into the query is reasonable given how DETR-style models handle set prediction. The final numbers are at least competitive for the challenge. The weakness is straightforward and load-bearing: the manuscript gives no score for the unmodified baseline, no ablation that removes the appended coordinates, and no separate accuracy figure for the QueryMLP predictions themselves. Without those quantities it is not possible to tell whether the MLP is responsible for any of the reported performance or whether it is simply neutral or even slightly harmful. The stress-test note correctly flags this gap. The work is narrowly scoped to one maritime data-association challenge and does not test the approach on other datasets or tasks. Readers already competing in the MaCVi challenge might pick up the QueryMLP trick as a quick addition to try. Everyone else will find little to take away because the experiments do not isolate the claimed benefit. I would not bring this to a reading group, would not cite it, and would not send it for peer review until the missing controls are added.

Referee Report

2 major / 2 minor

Summary. The paper proposes a modification to the DETR-based fusion transformer baseline for the MaCVi 2026 Vision-to-Chart buoy association challenge. It introduces a dedicated QueryMLP, trained on chart measurements and IMU orientation data, to explicitly predict each buoy's waterline contact point in the image; these predicted pixel coordinates are appended to the baseline decoder query vectors to supply a direct spatial prior and reduce the transformer's implicit geometric projection burden. The modified system reports an Overall score of 0.7386 (F1 = 0.8055, mIoU = 0.6718) on the challenge held-out test set, placing second on the leaderboard.

Significance. If the central claim holds, the work demonstrates a practical method for injecting an explicit learned world-to-image mapping into transformer decoders for geometric data-association tasks, which could generalize to other vision-to-chart or sensor-fusion settings. The competitive held-out performance suggests engineering utility, but the absence of isolating experiments prevents a clear assessment of whether the added component drives the result or merely accompanies other unstated changes.

major comments (2)

[Abstract / Results] Abstract and Results: the central claim that 'appending the predicted pixel coordinates from the QueryMLP ... reduces the geometric reasoning burden on the transformer decoder' cannot be evaluated because the manuscript reports only the final leaderboard numbers and provides neither (a) the unmodified baseline score, (b) an ablation that removes the appended coordinates, nor (c) any pixel-level accuracy or error statistics for the QueryMLP itself.
[Abstract] The soundness of attributing gains to the QueryMLP rests on the untested assumption that its predictions are sufficiently accurate to act as a useful prior rather than noise; without reported training details, validation metrics on the MLP, or comparison to the baseline decoder alone, this assumption remains unverified and load-bearing for the contribution.

minor comments (2)

[Abstract] The abstract states the approach is 'lightweight' but supplies no architecture diagram, layer counts, or training hyperparameters for QueryMLP, making reproducibility difficult.
No error analysis or failure-case discussion is mentioned, which would help readers understand when the added spatial prior helps or harms association.

Simulated Author's Rebuttal

2 responses · 3 unresolved

We thank the referee for the comments. We provide point-by-point responses below.

read point-by-point responses

Referee: [Abstract / Results] Abstract and Results: the central claim that 'appending the predicted pixel coordinates from the QueryMLP ... reduces the geometric reasoning burden on the transformer decoder' cannot be evaluated because the manuscript reports only the final leaderboard numbers and provides neither (a) the unmodified baseline score, (b) an ablation that removes the appended coordinates, nor (c) any pixel-level accuracy or error statistics for the QueryMLP itself.

Authors: We acknowledge that isolating experiments are absent from the manuscript. The presented work focuses on the performance achieved by the modified system on the challenge test set. As the baseline was not re-implemented or evaluated by us, we cannot provide the requested comparisons. The second-place ranking serves as an indirect indicator of effectiveness. We will revise the text to avoid over-attributing the result to the QueryMLP without direct evidence. revision: partial
Referee: [Abstract] The soundness of attributing gains to the QueryMLP rests on the untested assumption that its predictions are sufficiently accurate to act as a useful prior rather than noise; without reported training details, validation metrics on the MLP, or comparison to the baseline decoder alone, this assumption remains unverified and load-bearing for the contribution.

Authors: The manuscript does not report training details or validation metrics for the QueryMLP, as the emphasis was on the overall association performance. We accept that this leaves the assumption unverified. We will add a description of how the QueryMLP was trained in the revised version. revision: partial

standing simulated objections not resolved

unmodified baseline score on the test set
ablation removing the appended coordinates
pixel-level accuracy statistics for the QueryMLP

Circularity Check

0 steps flagged

No circularity: standard learned component on held-out data

full rationale

The paper trains QueryMLP on chart/IMU inputs to predict image pixels, appends the output as a query prior, and reports leaderboard metrics on an external held-out test set. No derivation reduces a claimed result to its own fitted values by construction, no self-citation chain supports a load-bearing uniqueness claim, and no ansatz or renaming is presented as a first-principles derivation. The modification is a conventional supervised addition whose contribution is measured externally rather than asserted tautologically.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract only; no explicit free parameters, axioms, or invented entities are described beyond standard neural network components and the introduction of QueryMLP.

pith-pipeline@v0.9.0 · 5687 in / 1132 out tokens · 33268 ms · 2026-05-25T05:47:40.295248+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

4 extracted references · 4 canonical work pages

[1]

In: Computer Vision – ECCV 2020

Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End- to-end object detection with transformers. In: Computer Vision – ECCV 2020. pp. 213–229. Springer International Publishing (2020)

work page 2020
[2]

Carrillo-Perez, B.: Real-time ship recognition and georeferencing for the improve- ment of maritime situational awareness. Ph.D. thesis, University of Bremen (2024). https://doi.org/10.26092/elib/3265

work page doi:10.26092/elib/3265 2024
[3]

Sensors22(7), 2713 (2022)

Carrillo-Perez, B., Barnes, S., Stephan, M.: Ship segmentation and georeferencing from static oblique view images. Sensors22(7), 2713 (2022)

work page 2022
[4]

arXiv preprint arXiv:2507.13880 (2025)

Kreis, M., Kiefer, B.: Real-time fusion of visual and chart data for enhanced mar- itime vision. arXiv preprint arXiv:2507.13880 (2025)

work page arXiv 2025

[1] [1]

In: Computer Vision – ECCV 2020

Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End- to-end object detection with transformers. In: Computer Vision – ECCV 2020. pp. 213–229. Springer International Publishing (2020)

work page 2020

[2] [2]

Carrillo-Perez, B.: Real-time ship recognition and georeferencing for the improve- ment of maritime situational awareness. Ph.D. thesis, University of Bremen (2024). https://doi.org/10.26092/elib/3265

work page doi:10.26092/elib/3265 2024

[3] [3]

Sensors22(7), 2713 (2022)

Carrillo-Perez, B., Barnes, S., Stephan, M.: Ship segmentation and georeferencing from static oblique view images. Sensors22(7), 2713 (2022)

work page 2022

[4] [4]

arXiv preprint arXiv:2507.13880 (2025)

Kreis, M., Kiefer, B.: Real-time fusion of visual and chart data for enhanced mar- itime vision. arXiv preprint arXiv:2507.13880 (2025)

work page arXiv 2025