MapDreamer: Aerial Imagery Conditioned Latent Diffusion for Lane-Level Map Generation

Julian Brandes; Philipp Crocoll; Wolfram Burgard

arxiv: 2607.01370 · v1 · pith:ZBN753JWnew · submitted 2026-07-01 · 💻 cs.CV

MapDreamer: Aerial Imagery Conditioned Latent Diffusion for Lane-Level Map Generation

Julian Brandes , Philipp Crocoll , Wolfram Burgard This is my paper

Pith reviewed 2026-07-03 21:06 UTC · model grok-4.3

classification 💻 cs.CV

keywords lane-level map generationlatent diffusion modelsaerial imagery conditioningvector map synthesisgraph generationautonomous driving mapsconditional generative models

0 comments

The pith

A diffusion model generates lane-level vector maps with explicit topology directly from aerial images.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

MapDreamer learns a compact latent representation of lane centerlines and their connections using a variational autoencoder. It then applies a transformer-based latent diffusion process to predict the full graph structure. Each denoising step receives dense features from the input aerial image through cross-attention so the output stays aligned with visible roads. A dedicated lane cardinality module and background ghost lane latents keep the model from collapsing when the number of lanes changes across scenes. A sliding-window aggregation step stitches local predictions into larger connected maps.

Core claim

MapDreamer is a generative diffusion model that synthesizes lane-level vector maps with explicit topology directly from a single aerial image. It learns a compact latent representation of lane centerlines and their topological relations using a variational autoencoder and predicts graphs with a transformer-based latent diffusion model. Dense aerial features are injected through cross-attention at every denoising step. A lane cardinality module paired with background ghost lane latents handles varying lane counts without slot collapse. A sliding-window global graph aggregation strategy stitches local tiles into city-scale maps while preserving connectivity.

What carries the argument

The transformer-based latent diffusion model that predicts graphs from a VAE latent space, conditioned at each step by cross-attention on dense aerial image features, together with the lane cardinality module and ghost lane latents that stabilize variable lane counts.

If this is right

The generated maps exhibit higher geometric accuracy and better preservation of lane connections than non-generative baselines on the UrbanLaneGraph dataset.
Local predictions can be combined into city-scale maps while keeping lane boundaries connected through the encoded topology.
The approach directly produces vector graphs ready for autonomous driving without post-processing into raster formats.
The model handles scenes with different numbers of lanes by using the cardinality module and ghost latents during diffusion.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same conditioning and cardinality mechanism could be tested on other overhead imagery sources such as satellite photos to check whether the alignment still holds at lower resolution.
If the ghost lane latents prove stable, the architecture might extend to generating additional map elements such as traffic signs or crosswalks within the same graph.
City-scale stitching success suggests the method could support incremental map updates when new aerial images arrive for only part of a road network.

Load-bearing premise

Dense aerial image features can be injected via cross-attention during denoising to keep the generated map aligned with the scene, and the cardinality module plus ghost lanes reliably stop the model from collapsing slots when lane numbers vary.

What would settle it

An aerial image containing a clear change in lane count or a visible intersection where the generated map shows the wrong number of lanes or broken connectivity that does not match the image.

Figures

Figures reproduced from arXiv: 2607.01370 by Julian Brandes, Philipp Crocoll, Wolfram Burgard.

**Figure 1.** Figure 1: We propose MapDreamer for lane-level graph generation from aerial imagery. For each input tile, vectorized lane geometry is predicted along with the corresponding graph topology. For city-scale map prediction, we aggregate a global graph through overlapping imagery tiles. The figure displays inference outputs of MapDreamer. small errors can disconnect long-range routes motivating the recent shift to metho… view at source ↗

**Figure 2.** Figure 2: Overview of MapDreamer training and inference at diffusion step t. Stage one trains the VAE encoder Eϕ and decoder Dγ, stage two freezes Eϕ and trains the LDM noise predictor ϵθ and lane cardinality module πψ. Training (T) uses the ground truth number of latents Nl plus a random number of Ng ghost latents, while inference (I) uses the predicted number N˜l of latents plus a fixed number N¯g = 5 of ghost lat… view at source ↗

**Figure 3.** Figure 3: Qualitative results for local lane graph prediction from aerial imagery across multiple cities, comparing MapDreamer with BGFormer [3]. and within any straight section exceeding 20 m. All sampled subgraphs are then evaluated using GEO, thereby measuring lane-level similarity of complex topologies to the ground truth. Rasterized Intersection over Union is robust to vertex ordering and point density in the … view at source ↗

**Figure 4.** Figure 4: Global graph inference strategy using boundary features (a) and qualitative results for global graph generation (b). (a) Occlusions causing under-prediction. (b) Incorrect turn toward one-way road. (c) Complex intersection w/o lane markings. (d) Missing lane and shifted geometry [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗

**Figure 5.** Figure 5: Local lane graph failure cases of MapDreamer visualized using pink boxes for missing predictions and geometry errors, and pink lanes for wrong predictions. Failure Cases [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗

read the original abstract

High definition map generation is essential for autonomous driving, yet remains a labor-intensive process at scale. We present MapDreamer, a generative diffusion model that synthesizes lane-level vector maps with explicit topology directly from a single aerial image. MapDreamer learns a compact latent representation of lane centerlines and their topological relations using a variational autoencoder and predicts graphs with a transformer-based latent diffusion model. To align generated maps with the observed scene, we condition each denoising step on dense aerial features injected through cross-attention. To handle the varying number of lanes across scenes, we propose a lane cardinality module paired with background ghost lane latents, a learned buffer that prevents slot collapse during diffusion. Furthermore, we introduce a sliding-window global graph aggregation strategy that stitches local tiles into city-scale maps while preserving connectivity through encoded lane boundaries. Experiments on UrbanLaneGraph derived from Argoverse 2 show improved geometric and topological fidelity over non-generative baselines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MapDreamer applies latent diffusion to aerial-to-lane-graph generation with targeted fixes for variable cardinality and global stitching, but the gains look incremental and rest on standard components.

read the letter

The core of this paper is a VAE plus transformer latent diffusion pipeline that takes a single aerial image and outputs a vector lane map with explicit topology. They condition the denoising steps on dense aerial features via cross-attention, add a lane cardinality module plus background ghost latents to avoid slot collapse on scenes with different lane counts, and use a sliding-window aggregation step to build city-scale graphs while trying to keep connectivity. Those two modules plus the global stitching strategy are the parts that are not completely routine extensions of existing diffusion work.

The approach makes sense for the HD map automation problem in autonomous driving. Generating structured vector output directly instead of raster maps is a reasonable direction, and the ghost latent buffer is a practical way to handle variable set sizes without forcing a fixed maximum. The sliding-window idea also addresses a real deployment need for large areas.

The soft spots are in the evaluation and the size of the claimed improvement. The abstract says better geometric and topological fidelity on UrbanLaneGraph from Argoverse 2, but the strength of that claim depends on the actual numbers, the choice of baselines, and whether ablations isolate the contribution of the cardinality module versus the diffusion backbone. If the gains are small once you control for model size or training data, the new pieces may not justify the added complexity. Generalization beyond the Argoverse-derived set is also untested in the description.

This paper is aimed at people working on map perception and generative models for graphs in robotics. A reader already following diffusion applications to structured prediction would find the specific engineering choices useful to examine. The work is coherent enough on its own terms that it deserves a serious referee who can check the experiments and ablations in detail rather than a desk reject.

Referee Report

1 major / 1 minor

Summary. The paper introduces MapDreamer, a generative diffusion model that synthesizes lane-level vector maps with explicit topology directly from a single aerial image. It learns a compact latent representation of lane centerlines and topological relations via a variational autoencoder, predicts graphs using a transformer-based latent diffusion model, conditions each denoising step on dense aerial features through cross-attention, employs a lane cardinality module with background ghost lane latents to handle varying lane counts, and uses a sliding-window global graph aggregation strategy to stitch local tiles into city-scale maps while preserving connectivity. Experiments on UrbanLaneGraph derived from Argoverse 2 demonstrate improved geometric and topological fidelity over non-generative baselines.

Significance. If the empirical improvements hold under full scrutiny, the work could meaningfully advance automated HD map generation for autonomous driving by providing a generative, topology-aware alternative to deterministic methods. The cross-attention conditioning, ghost-lane cardinality mechanism, and sliding-window aggregation address practical challenges of scene alignment, variable cardinality, and scalability; these components represent targeted innovations in applying latent diffusion to structured graph outputs.

major comments (1)

Abstract (conditioning and cardinality paragraph): the central claim that cross-attention on dense aerial features plus the lane cardinality module with ghost lane latents reliably prevents slot collapse and produces accurate topology for varying lane counts is load-bearing for the headline result, yet the abstract provides no implementation details, loss terms, or ablation evidence for this mechanism; without those, the support for the fidelity improvement cannot be evaluated.

minor comments (1)

Abstract: the statement of 'improved geometric and topological fidelity' does not name the specific metrics (e.g., Chamfer distance, topology F1, connectivity error) or list the non-generative baselines, which would be needed to interpret the magnitude of the reported gains.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed review and the positive assessment of the work's potential impact. We address the single major comment below.

read point-by-point responses

Referee: [—] Abstract (conditioning and cardinality paragraph): the central claim that cross-attention on dense aerial features plus the lane cardinality module with ghost lane latents reliably prevents slot collapse and produces accurate topology for varying lane counts is load-bearing for the headline result, yet the abstract provides no implementation details, loss terms, or ablation evidence for this mechanism; without those, the support for the fidelity improvement cannot be evaluated.

Authors: Abstracts are high-level summaries and conventionally omit implementation details, equations, and ablation results; these elements appear in the main manuscript. Section 3.2 describes the cross-attention conditioning on dense aerial features within the transformer diffusion backbone. Section 3.3 details the lane cardinality module, including the role of background ghost lane latents as a learned buffer to avoid slot collapse under variable cardinality. The relevant loss terms are given in Equations (3) (VAE) and (5) (diffusion). Ablations quantifying the contribution of both mechanisms to geometric and topological metrics are reported in Section 4.3 and Table 4. The headline fidelity improvements are therefore supported by the full experimental section rather than the abstract alone. We do not believe the abstract requires expansion with these specifics. revision: no

Circularity Check

0 steps flagged

No significant circularity; standard generative pipeline with empirical claims

full rationale

The paper describes a latent diffusion model (VAE + transformer diffusion + cross-attention + cardinality module) trained end-to-end on UrbanLaneGraph data. No derivation chain reduces a claimed prediction or first-principles result to its own fitted inputs by construction. No self-citations are invoked as load-bearing uniqueness theorems, no ansatz is smuggled, and no parameter is fitted on a subset then renamed as a prediction. The central claims are empirical improvements in geometric/topological fidelity, which are externally falsifiable on held-out data and do not rely on self-referential definitions. This is the expected outcome for a standard ML architecture paper.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 1 invented entities

The approach rests on standard generative modeling assumptions plus new components for variable cardinality and global consistency; no machine-checked proofs or external benchmarks cited in abstract.

free parameters (2)

latent dimension size
Size of compact representation learned by VAE for lane centerlines and topology
ghost lane buffer size
Number of background ghost lane latents introduced to prevent slot collapse

axioms (1)

domain assumption Aerial imagery contains sufficient visual information to determine lane topology and geometry
Invoked when conditioning every denoising step on dense aerial features

invented entities (1)

background ghost lane latents no independent evidence
purpose: Learned buffer representations that prevent slot collapse when lane count varies across scenes
Proposed to handle varying number of lanes; no independent evidence provided

pith-pipeline@v0.9.1-grok · 5693 in / 1375 out tokens · 32903 ms · 2026-07-03T21:06:47.350507+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

31 extracted references · 27 canonical work pages · 1 internal anchor

[1]

In: IEEE Conf

Bastani,F.,He, S.,Abbar, S.,Alizadeh, M.,Balakrishnan, H.,Chawla,S., Madden, S., DeWitt, D.: RoadTracer: Automatic extraction of road networks from aerial images. In: IEEE Conf. Comput. Vis. Pattern Recog. (CVPR). pp. 4720–4728 (2018).https://doi.org/10.1109/CVPR.2018.00496

work page doi:10.1109/cvpr.2018.00496 2018
[2]

Transportation Research Record: Jour- nal of the Transportation Research Board2291(1), 61–71 (2012).https://doi

Biagioni, J., Eriksson, J.: Inferring road maps from global positioning system traces: Survey and comparative evaluation. Transportation Research Record: Jour- nal of the Transportation Research Board2291(1), 61–71 (2012).https://doi. org/10.3141/2291-08

work page doi:10.3141/2291-08 2012
[3]

In: IEEE Conf

Blayney, H., Tian, H., Scott, H., Goldbeck, N., Stetson, C., Angeloudis, P.: Bézier everywhere all at once: Learning drivable lanes as bézier graphs. In: IEEE Conf. Comput. Vis. Pattern Recog. (CVPR). pp. 15365–15374 (2024).https://doi. org/10.1109/CVPR52733.2024.01455

work page doi:10.1109/cvpr52733.2024.01455 2024
[4]

In: 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Büchner, M., Zürn, J., Todoran, I.G., Valada, A., Burgard, W.: Learning and aggregating lane graphs for urban automated driving. In: IEEE Conf. Comput. Vis.PatternRecog.(CVPR).pp.13415–13424(2023).https://doi.org/10.1109/ CVPR52729.2023.01289

work page arXiv 2023
[5]

IEEE Transactions on Geoscience and Remote Sensing55(6), 3322–3337 (2017).https://doi.org/10.1109/TGRS.2017.2669341

Cheng, G., Wang, Y., Xu, S., Wang, H., Xiang, S., Pan, C.: Automatic road detec- tion and centerline extraction via cascaded end-to-end convolutional neural net- work. IEEE Transactions on Geoscience and Remote Sensing55(6), 3322–3337 (2017).https://doi.org/10.1109/TGRS.2017.2669341

work page doi:10.1109/tgrs.2017.2669341 2017
[6]

Choi, S., Kim, J., Shin, H., Choi, J.W.: Mask2Map: Vectorized HD map con- struction using bird’s eye view segmentation masks. In: Eur. Conf. Comput. Vis. (ECCV). pp. 19–36 (2024).https://doi.org/10.1007/978-3-031-72890-7_2

work page doi:10.1007/978-3-031-72890-7_2 2024
[7]

In: 2024 IEEE International Conference on Robotics and Automation (ICRA)

Gao, W., Fu, J., Shen, Y., Jing, H., Chen, S., Zheng, N.: Complementing onboard sensors with satellite maps: A new perspective for HD map construction. In: IEEE Int. Conf. Robot. Autom. (ICRA). pp. 11103–11109 (2024).https://doi.org/10. 1109/ICRA57147.2024.10611611

work page arXiv 2024
[8]

ISPRS International Journal of Geo-Information13(6) (2024).https://doi.org/10.3390/ijgi13060203

Gu, X., Zhang, M., Lyu, J., Ge, Q.: Generating urban road networks with con- ditional diffusion models. ISPRS International Journal of Geo-Information13(6) (2024).https://doi.org/10.3390/ijgi13060203

work page doi:10.3390/ijgi13060203 2024
[9]

In: IEEE Winter Conf

He, S., Balakrishnan, H.: Lane-level street map extraction from aerial imagery. In: IEEE Winter Conf. Appl. Comput. Vis. (WACV). pp. 1496–1505 (2022).https: //doi.org/10.1109/WACV51458.2022.00156

work page doi:10.1109/wacv51458.2022.00156 2022
[10]

He,S.,Bastani,F.,Jagwani,S.,Alizadeh,M.,Balakrishnan,H.,Chawla,S.,Elshrif, M.M., Madden, S., Sadeghi, M.A.: Sat2Graph: Road graph extraction through graph-tensor encoding. In: Eur. Conf. Comput. Vis. (ECCV). pp. 51–67 (2020). https://doi.org/10.1007/978-3-030-58586-0_4

work page doi:10.1007/978-3-030-58586-0_4 2020
[11]

Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. In: Adv. Neural Inform. Process. Syst. (NeurIPS). vol. 33, pp. 6840–6851 (2020)

2020
[12]

In: 2022 International Conference on Robotics and Automation (ICRA)

Li, Q., Wang, Y., Wang, Y., Zhao, H.: HDMapNet: An online HD map construction and evaluation framework. In: IEEE Int. Conf. Robot. Autom. (ICRA). pp. 4628– 4634 (2022).https://doi.org/10.1109/ICRA46639.2022.9812383

work page doi:10.1109/icra46639.2022.9812383 2022
[13]

Li, Z., Wegner, J.D., Lucchi, A.: Topological map extraction from overhead images. In: Int. Conf. Comput. Vis. (ICCV). pp. 1715–1724 (2019).https://doi.org/10. 1109/ICCV.2019.00180

work page arXiv 2019
[14]

Liao, B., Chen, S., Zhang, Y., Jiang, B., Zhang, Q., Liu, W., Huang, C., Wang, X.: MapTRv2: An end-to-end framework for online vectorized HD map construction. Int. J. Comput. Vis. (IJCV) (2024).https://doi.org/10.1007/s11263- 024- 02235-z MapDreamer: LDM Map Generation from Aerial Imagery 17

work page doi:10.1007/s11263- 2024
[15]

Liu, Y., Yuan, T., Wang, Y., Wang, Y., Zhao, H.: VectorMapNet: End-to-end vectorized HD map learning. In: Int. Conf. Mach. Learn. (ICML) (2023)

2023
[16]

In: IEEE/RSJ Int

Monninger, T., Zhang, Z., Mo, Z., Anwar, M.Z., Staab, S., Ding, S.: MapDiffu- sion: Generative diffusion for vectorized online HD map construction and uncer- tainty estimation in autonomous driving. In: IEEE/RSJ Int. Conf. Intell. Robots Syst. (IROS). pp. 4099–4106 (2025).https://doi.org/10.1109/IROS60139.2025. 11247744

work page doi:10.1109/iros60139.2025 2025
[17]

Máttyus, G., Luo, W., Urtasun, R.: DeepRoadMapper: Extracting road topology from aerial images. In: Int. Conf. Comput. Vis. (ICCV). pp. 3458–3466 (2017). https://doi.org/10.1109/ICCV.2017.372

work page doi:10.1109/iccv.2017.372 2017
[18]

Peebles, W., Xie, S.: Scalable diffusion models with transformers. In: Int. Conf. Comput. Vis. (ICCV). pp. 4172–4182 (2023).https : / / doi . org / 10 . 1109 / ICCV51070.2023.00387

work page arXiv 2023
[19]

In: IEEE Conf

Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: IEEE Conf. Comput. Vis. Pattern Recog. (CVPR). pp. 10674–10685 (2022).https://doi.org/10.1109/CVPR52688. 2022.01042

work page doi:10.1109/cvpr52688 2022
[20]

Cot-vla: Visual chain-of-thought reasoning for vision-language-action models

Rowe, L., Girgis, R., Gosselin, A., Paull, L., Pal, C., Heide, F.: Scenario Dreamer: Vectorized latent diffusion for generating driving simulation environments. In: IEEE Conf. Comput. Vis. Pattern Recog. (CVPR). pp. 17207–17218 (2025). https://doi.org/10.1109/CVPR52734.2025.01604

work page doi:10.1109/cvpr52734.2025.01604 2025
[21]

Siméoni, O., Vo, H.V., Seitzer, M., Baldassarre, F., Oquab, M., Jose, C., Khalidov, V., Szafraniec, M., Yi, S., Ramamonjisoa, M., Massa, F., Haziza, D., Wehrstedt, L., Wang, J., Darcet, T., Moutakanni, T., Sentana, L., Roberts, C., Vedaldi, A., Tolan, J., Brandt, J., Couprie, C., Mairal, J., Jégou, H., Labatut, P., Bojanowski, P.: DINOv3 (2025),https://ar...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[22]

Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models. In: Int. Conf. Learn. Represent. (ICLR) (2021)

2021
[23]

In: IEEE Conf

Tan, Y.Q., Gao, S.H., Li, X.Y., Cheng, M.M., Ren, B.: VecRoad: Point-based iterative graph exploration for road graphs extraction. In: IEEE Conf. Comput. Vis. Pattern Recog. (CVPR). pp. 8907–8915 (2020).https://doi.org/10.1109/ CVPR42600.2020.00893

work page arXiv 2020
[24]

IEEE Transactions on Pattern Analysis and Machine Intelligence 13(4), 376–380 (1991).https://doi.org/10.1109/34.88573

Umeyama, S.: Least-squares estimation of transformation parameters between two point patterns. IEEE Transactions on Pattern Analysis and Machine Intelligence 13(4), 376–380 (1991).https://doi.org/10.1109/34.88573

work page doi:10.1109/34.88573 1991
[25]

Wang, Z., Zhang, W., Zhang, W., Tan, X., Liu, H., Wang, Y., Li, G.: LaneDiffusion: Improving centerline graph learning via prior injected bev feature generation. In: Int. Conf. Comput. Vis. (ICCV). pp. 27052–27062 (2025).https://doi.org/10. 1109/ICCV51701.2025.02511

work page arXiv 2025
[26]

In: Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks (NeurIPS Datasets and Benchmarks) (2021)

Wilson, B., Qi, W., Agarwal, T., Lambert, J., Singh, J., Khandelwal, S., Pan, B., Kumar, R., Hartnett, A., Pontes, J.K., Ramanan, D., Carr, P., Hays, J.: Argo- verse 2: Next generation datasets for self-driving perception and forecasting. In: Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks (NeurIPS Datasets and Be...

2021
[27]

IEEE Transactions on Geoscience and Remote Sensing60, 1–12 (2022).https://doi.org/10.1109/ TGRS.2022.3186993

Xu, Z., Liu, Y., Gan, L., Sun, Y., Wu, X., Liu, M., Wang, L.: RNGDet: Road network graph detection by transformer in aerial images. IEEE Transactions on Geoscience and Remote Sensing60, 1–12 (2022).https://doi.org/10.1109/ TGRS.2022.3186993

work page arXiv 2022
[28]

Brandes et al

Xu, Z., Liu, Y., Sun, Y., Liu, M., Wang, L.: RNGDet++: Road network graph detection by transformer with instance segmentation and multi-scale features 18 J. Brandes et al. enhancement. IEEE Robotics and Automation Letters8(5), 2991–2998 (2023). https://doi.org/10.1109/LRA.2023.3264723

work page doi:10.1109/lra.2023.3264723 2023
[29]

In: IEEE Int

Ye, J., Paz, D., Zhang, H., Guo, Y., Huang, X., Christensen, H.I., Wang, Y., Ren, L.: SMART: Advancing scalable map priors for driving topology reasoning. In: IEEE Int. Conf. Robot. Autom. (ICRA). pp. 3298–3304 (2025).https://doi. org/10.1109/ICRA55743.2025.11127994

work page doi:10.1109/icra55743.2025.11127994 2025
[30]

Cot-vla: Visual chain-of-thought reasoning for vision-language-action models

Yin, P., Li, K., Cao, X., Yao, J., Liu, L., Bai, X., Zhou, F., Meng, D.: Towards satellite image road graph extraction: A global-scale dataset and a novel method. In: IEEE Conf. Comput. Vis. Pattern Recog. (CVPR). pp. 1527–1537 (2025). https://doi.org/10.1109/CVPR52734.2025.00150

work page doi:10.1109/cvpr52734.2025.00150 2025
[31]

In: IEEE Winter Conf

Yuan, T., Liu, Y., Wang, Y., Wang, Y., Zhao, H.: StreamMapNet: Streaming map- ping network for vectorized online HD map construction. In: IEEE Winter Conf. Appl. Comput. Vis. (WACV). pp. 7341–7350 (2024).https://doi.org/10.1109/ WACV57701.2024.00719 MapDreamer: LDM Map Generation from Aerial Imagery 19 Supplementary Material In the supplementary material ...

work page arXiv 2024

[1] [1]

In: IEEE Conf

Bastani,F.,He, S.,Abbar, S.,Alizadeh, M.,Balakrishnan, H.,Chawla,S., Madden, S., DeWitt, D.: RoadTracer: Automatic extraction of road networks from aerial images. In: IEEE Conf. Comput. Vis. Pattern Recog. (CVPR). pp. 4720–4728 (2018).https://doi.org/10.1109/CVPR.2018.00496

work page doi:10.1109/cvpr.2018.00496 2018

[2] [2]

Transportation Research Record: Jour- nal of the Transportation Research Board2291(1), 61–71 (2012).https://doi

Biagioni, J., Eriksson, J.: Inferring road maps from global positioning system traces: Survey and comparative evaluation. Transportation Research Record: Jour- nal of the Transportation Research Board2291(1), 61–71 (2012).https://doi. org/10.3141/2291-08

work page doi:10.3141/2291-08 2012

[3] [3]

In: IEEE Conf

Blayney, H., Tian, H., Scott, H., Goldbeck, N., Stetson, C., Angeloudis, P.: Bézier everywhere all at once: Learning drivable lanes as bézier graphs. In: IEEE Conf. Comput. Vis. Pattern Recog. (CVPR). pp. 15365–15374 (2024).https://doi. org/10.1109/CVPR52733.2024.01455

work page doi:10.1109/cvpr52733.2024.01455 2024

[4] [4]

In: 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Büchner, M., Zürn, J., Todoran, I.G., Valada, A., Burgard, W.: Learning and aggregating lane graphs for urban automated driving. In: IEEE Conf. Comput. Vis.PatternRecog.(CVPR).pp.13415–13424(2023).https://doi.org/10.1109/ CVPR52729.2023.01289

work page arXiv 2023

[5] [5]

IEEE Transactions on Geoscience and Remote Sensing55(6), 3322–3337 (2017).https://doi.org/10.1109/TGRS.2017.2669341

Cheng, G., Wang, Y., Xu, S., Wang, H., Xiang, S., Pan, C.: Automatic road detec- tion and centerline extraction via cascaded end-to-end convolutional neural net- work. IEEE Transactions on Geoscience and Remote Sensing55(6), 3322–3337 (2017).https://doi.org/10.1109/TGRS.2017.2669341

work page doi:10.1109/tgrs.2017.2669341 2017

[6] [6]

Choi, S., Kim, J., Shin, H., Choi, J.W.: Mask2Map: Vectorized HD map con- struction using bird’s eye view segmentation masks. In: Eur. Conf. Comput. Vis. (ECCV). pp. 19–36 (2024).https://doi.org/10.1007/978-3-031-72890-7_2

work page doi:10.1007/978-3-031-72890-7_2 2024

[7] [7]

In: 2024 IEEE International Conference on Robotics and Automation (ICRA)

Gao, W., Fu, J., Shen, Y., Jing, H., Chen, S., Zheng, N.: Complementing onboard sensors with satellite maps: A new perspective for HD map construction. In: IEEE Int. Conf. Robot. Autom. (ICRA). pp. 11103–11109 (2024).https://doi.org/10. 1109/ICRA57147.2024.10611611

work page arXiv 2024

[8] [8]

ISPRS International Journal of Geo-Information13(6) (2024).https://doi.org/10.3390/ijgi13060203

Gu, X., Zhang, M., Lyu, J., Ge, Q.: Generating urban road networks with con- ditional diffusion models. ISPRS International Journal of Geo-Information13(6) (2024).https://doi.org/10.3390/ijgi13060203

work page doi:10.3390/ijgi13060203 2024

[9] [9]

In: IEEE Winter Conf

He, S., Balakrishnan, H.: Lane-level street map extraction from aerial imagery. In: IEEE Winter Conf. Appl. Comput. Vis. (WACV). pp. 1496–1505 (2022).https: //doi.org/10.1109/WACV51458.2022.00156

work page doi:10.1109/wacv51458.2022.00156 2022

[10] [10]

He,S.,Bastani,F.,Jagwani,S.,Alizadeh,M.,Balakrishnan,H.,Chawla,S.,Elshrif, M.M., Madden, S., Sadeghi, M.A.: Sat2Graph: Road graph extraction through graph-tensor encoding. In: Eur. Conf. Comput. Vis. (ECCV). pp. 51–67 (2020). https://doi.org/10.1007/978-3-030-58586-0_4

work page doi:10.1007/978-3-030-58586-0_4 2020

[11] [11]

Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. In: Adv. Neural Inform. Process. Syst. (NeurIPS). vol. 33, pp. 6840–6851 (2020)

2020

[12] [12]

In: 2022 International Conference on Robotics and Automation (ICRA)

Li, Q., Wang, Y., Wang, Y., Zhao, H.: HDMapNet: An online HD map construction and evaluation framework. In: IEEE Int. Conf. Robot. Autom. (ICRA). pp. 4628– 4634 (2022).https://doi.org/10.1109/ICRA46639.2022.9812383

work page doi:10.1109/icra46639.2022.9812383 2022

[13] [13]

Li, Z., Wegner, J.D., Lucchi, A.: Topological map extraction from overhead images. In: Int. Conf. Comput. Vis. (ICCV). pp. 1715–1724 (2019).https://doi.org/10. 1109/ICCV.2019.00180

work page arXiv 2019

[14] [14]

Liao, B., Chen, S., Zhang, Y., Jiang, B., Zhang, Q., Liu, W., Huang, C., Wang, X.: MapTRv2: An end-to-end framework for online vectorized HD map construction. Int. J. Comput. Vis. (IJCV) (2024).https://doi.org/10.1007/s11263- 024- 02235-z MapDreamer: LDM Map Generation from Aerial Imagery 17

work page doi:10.1007/s11263- 2024

[15] [15]

Liu, Y., Yuan, T., Wang, Y., Wang, Y., Zhao, H.: VectorMapNet: End-to-end vectorized HD map learning. In: Int. Conf. Mach. Learn. (ICML) (2023)

2023

[16] [16]

In: IEEE/RSJ Int

Monninger, T., Zhang, Z., Mo, Z., Anwar, M.Z., Staab, S., Ding, S.: MapDiffu- sion: Generative diffusion for vectorized online HD map construction and uncer- tainty estimation in autonomous driving. In: IEEE/RSJ Int. Conf. Intell. Robots Syst. (IROS). pp. 4099–4106 (2025).https://doi.org/10.1109/IROS60139.2025. 11247744

work page doi:10.1109/iros60139.2025 2025

[17] [17]

Máttyus, G., Luo, W., Urtasun, R.: DeepRoadMapper: Extracting road topology from aerial images. In: Int. Conf. Comput. Vis. (ICCV). pp. 3458–3466 (2017). https://doi.org/10.1109/ICCV.2017.372

work page doi:10.1109/iccv.2017.372 2017

[18] [18]

Peebles, W., Xie, S.: Scalable diffusion models with transformers. In: Int. Conf. Comput. Vis. (ICCV). pp. 4172–4182 (2023).https : / / doi . org / 10 . 1109 / ICCV51070.2023.00387

work page arXiv 2023

[19] [19]

In: IEEE Conf

Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: IEEE Conf. Comput. Vis. Pattern Recog. (CVPR). pp. 10674–10685 (2022).https://doi.org/10.1109/CVPR52688. 2022.01042

work page doi:10.1109/cvpr52688 2022

[20] [20]

Cot-vla: Visual chain-of-thought reasoning for vision-language-action models

Rowe, L., Girgis, R., Gosselin, A., Paull, L., Pal, C., Heide, F.: Scenario Dreamer: Vectorized latent diffusion for generating driving simulation environments. In: IEEE Conf. Comput. Vis. Pattern Recog. (CVPR). pp. 17207–17218 (2025). https://doi.org/10.1109/CVPR52734.2025.01604

work page doi:10.1109/cvpr52734.2025.01604 2025

[21] [21]

Siméoni, O., Vo, H.V., Seitzer, M., Baldassarre, F., Oquab, M., Jose, C., Khalidov, V., Szafraniec, M., Yi, S., Ramamonjisoa, M., Massa, F., Haziza, D., Wehrstedt, L., Wang, J., Darcet, T., Moutakanni, T., Sentana, L., Roberts, C., Vedaldi, A., Tolan, J., Brandt, J., Couprie, C., Mairal, J., Jégou, H., Labatut, P., Bojanowski, P.: DINOv3 (2025),https://ar...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[22] [22]

Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models. In: Int. Conf. Learn. Represent. (ICLR) (2021)

2021

[23] [23]

In: IEEE Conf

Tan, Y.Q., Gao, S.H., Li, X.Y., Cheng, M.M., Ren, B.: VecRoad: Point-based iterative graph exploration for road graphs extraction. In: IEEE Conf. Comput. Vis. Pattern Recog. (CVPR). pp. 8907–8915 (2020).https://doi.org/10.1109/ CVPR42600.2020.00893

work page arXiv 2020

[24] [24]

IEEE Transactions on Pattern Analysis and Machine Intelligence 13(4), 376–380 (1991).https://doi.org/10.1109/34.88573

Umeyama, S.: Least-squares estimation of transformation parameters between two point patterns. IEEE Transactions on Pattern Analysis and Machine Intelligence 13(4), 376–380 (1991).https://doi.org/10.1109/34.88573

work page doi:10.1109/34.88573 1991

[25] [25]

Wang, Z., Zhang, W., Zhang, W., Tan, X., Liu, H., Wang, Y., Li, G.: LaneDiffusion: Improving centerline graph learning via prior injected bev feature generation. In: Int. Conf. Comput. Vis. (ICCV). pp. 27052–27062 (2025).https://doi.org/10. 1109/ICCV51701.2025.02511

work page arXiv 2025

[26] [26]

In: Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks (NeurIPS Datasets and Benchmarks) (2021)

Wilson, B., Qi, W., Agarwal, T., Lambert, J., Singh, J., Khandelwal, S., Pan, B., Kumar, R., Hartnett, A., Pontes, J.K., Ramanan, D., Carr, P., Hays, J.: Argo- verse 2: Next generation datasets for self-driving perception and forecasting. In: Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks (NeurIPS Datasets and Be...

2021

[27] [27]

IEEE Transactions on Geoscience and Remote Sensing60, 1–12 (2022).https://doi.org/10.1109/ TGRS.2022.3186993

Xu, Z., Liu, Y., Gan, L., Sun, Y., Wu, X., Liu, M., Wang, L.: RNGDet: Road network graph detection by transformer in aerial images. IEEE Transactions on Geoscience and Remote Sensing60, 1–12 (2022).https://doi.org/10.1109/ TGRS.2022.3186993

work page arXiv 2022

[28] [28]

Brandes et al

Xu, Z., Liu, Y., Sun, Y., Liu, M., Wang, L.: RNGDet++: Road network graph detection by transformer with instance segmentation and multi-scale features 18 J. Brandes et al. enhancement. IEEE Robotics and Automation Letters8(5), 2991–2998 (2023). https://doi.org/10.1109/LRA.2023.3264723

work page doi:10.1109/lra.2023.3264723 2023

[29] [29]

In: IEEE Int

Ye, J., Paz, D., Zhang, H., Guo, Y., Huang, X., Christensen, H.I., Wang, Y., Ren, L.: SMART: Advancing scalable map priors for driving topology reasoning. In: IEEE Int. Conf. Robot. Autom. (ICRA). pp. 3298–3304 (2025).https://doi. org/10.1109/ICRA55743.2025.11127994

work page doi:10.1109/icra55743.2025.11127994 2025

[30] [30]

Cot-vla: Visual chain-of-thought reasoning for vision-language-action models

Yin, P., Li, K., Cao, X., Yao, J., Liu, L., Bai, X., Zhou, F., Meng, D.: Towards satellite image road graph extraction: A global-scale dataset and a novel method. In: IEEE Conf. Comput. Vis. Pattern Recog. (CVPR). pp. 1527–1537 (2025). https://doi.org/10.1109/CVPR52734.2025.00150

work page doi:10.1109/cvpr52734.2025.00150 2025

[31] [31]

In: IEEE Winter Conf

Yuan, T., Liu, Y., Wang, Y., Wang, Y., Zhao, H.: StreamMapNet: Streaming map- ping network for vectorized online HD map construction. In: IEEE Winter Conf. Appl. Comput. Vis. (WACV). pp. 7341–7350 (2024).https://doi.org/10.1109/ WACV57701.2024.00719 MapDreamer: LDM Map Generation from Aerial Imagery 19 Supplementary Material In the supplementary material ...

work page arXiv 2024