arxiv: 2604.01644 · v2 · submitted 2026-04-02 · 💻 cs.CV · cs.MM

Recognition: unknown

TOL: Textual Localization with OpenStreetMap

Youqi Liao , Shuhao Kang , Jingyu Xu , Olaf Wysocki , Yan Xia , Jianping Li , Zhen Dong , Bisheng Yang

show 1 more author

Xieyuanli Chen

Authors on Pith no claims yet

Pith reviewed 2026-05-13 21:40 UTC · model grok-4.3

classification 💻 cs.CV cs.MM

keywords textual localizationOpenStreetMaptext-to-OSMcoarse-to-fine localizationdirection-aware features2-DoF pose regressionurban positioningTOL benchmark

0 comments

The pith

A coarse-to-fine framework aligns natural language scene descriptions with OpenStreetMap tiles to estimate 2D urban positions without images or GPS.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper formulates text-to-OSM localization as the problem of estimating accurate 2D positions from textual descriptions of surrounding objects and their directions. It releases the TOL benchmark of roughly 121,000 text queries paired with OSM tiles spanning 316 km of roads in Boston, Karlsruhe, and Singapore. The TOLoc method first extracts direction-aware features from both the text and map tiles to retrieve a short list of candidate locations, then fuses the query text with features from the top candidate tile to regress the final pose. This yields better accuracy than prior methods at 5 m, 10 m, and 25 m thresholds while generalizing to unseen cities. The work shows that compact, freely available map data can support language-driven positioning at city scale.

Core claim

The central claim is that a coarse-to-fine pipeline using direction-aware global descriptors for candidate retrieval followed by an alignment module that jointly processes the text descriptor and local OSM features can regress 2-DoF pose from textual scene descriptions alone, outperforming the best existing method by 6.53 percent, 9.93 percent, and 8.32 percent at the 5 m, 10 m, and 25 m thresholds respectively and generalizing to unseen environments.

What carries the argument

TOLoc coarse-to-fine framework that builds direction-aware global descriptors from text and OSM tiles for retrieval, then applies a dedicated alignment module to fuse the textual descriptor with local map features and regress the 2-DoF pose.

If this is right

Text queries alone suffice to retrieve and refine positions within a few meters on OSM tiles across multiple continents.
Direction-aware features improve global retrieval enough to make the subsequent fine alignment accurate.
The approach generalizes to environments not encountered during training.
OSM tiles provide sufficient semantic and structural detail for large-scale text-based localization.
The released TOL dataset supports further development of text-to-map methods.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Voice interfaces could translate spoken scene descriptions into map positions for navigation or information retrieval.
The same pipeline might extend to other freely available map layers or to indoor floor plans.
Pairing the method with lightweight visual checks could create hybrid systems that remain functional when cameras are unavailable.
City-scale language localization opens privacy-preserving alternatives to continuous GPS tracking.

Load-bearing premise

Natural-language scene descriptions contain enough unique semantic and directional information to be reliably matched to the structures encoded in OSM tiles without any geometric observations or initial location estimates.

What would settle it

Collect fresh text descriptions from users in a fourth city never used in training or evaluation and measure whether the fraction of queries that retrieve the correct tile within 25 m falls to chance level after the full coarse-to-fine process.

Figures

Figures reproduced from arXiv: 2604.01644 by Bisheng Yang, Jianping Li, Jingyu Xu, Olaf Wysocki, Shuhao Kang, Xieyuanli Chen, Yan Xia, Youqi Liao, Zhen Dong.

**Figure 2.** Figure 2: Illustration of rasterized OSM tiles with area, way, and node [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Pipeline of TOLoc. Given a query text T and an OSM database O = {Oj} Z j=1, TOLoc performs coarse-to-fine localization. It first learns text–map correspondences to retrieve the top-K candidate tiles, and then aligns the text descriptor dT with the feature map F O of the top-1 tile to predict the final 2-DoF position. The proposed TOA module fuses local OSM features and textual cues through self-attention a… view at source ↗

**Figure 4.** Figure 4: Qualitative results on the TOL benchmark. In columns 3–5, green boxes [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

read the original abstract

Natural language provides an intuitive way to express spatial intent in geospatial applications. While existing localization methods often rely on dense point cloud maps or high-resolution imagery, OpenStreetMap (OSM) offers a compact and freely available map representation that encodes rich semantic and structural information, making it well-suited for large-scale localization. However, text-to-OSM (T2O) localization remains largely unexplored. In this paper, we formulate the T2O localization task, which aims to estimate accurate 2D positions in urban environments from textual scene descriptions without relying on geometric observations or GNSS-based initial location. To support the proposed task, we introduce TOL, a large-scale benchmark spanning multiple continents and diverse urban environments. TOL contains approximately 121K textual queries paired with OSM map tiles and covers about 316 km of road trajectories across Boston, Karlsruhe, and Singapore. We further propose TOLoc, a coarse-to-fine localization framework that explicitly models the semantics of surrounding objects and their directional information. In the coarse stage, direction-aware features are extracted from both textual descriptions and OSM tiles to construct global descriptors, which are used to retrieve candidate locations for the query. In the fine stage, the query text and top-1 retrieved tile are jointly processed, where a dedicated alignment module fuses the textual descriptor and local map features to regress the 2-DoF pose. Experimental results demonstrate that TOLoc achieves strong localization performance, outperforming the best existing method by 6.53\%, 9.93\%, and 8.32\% at 5 m, 10 m, and 25 m thresholds, respectively, and shows strong generalization to unseen environments. Dataset, code and models will be publicly available at: https://github.com/WHU-USI3DV/TOL.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper defines a new text-to-OSM localization task, releases a 121K-query multi-city benchmark, and shows a directional coarse-to-fine model beating baselines by 6-10% at meter thresholds, though retrieval in semantically similar patches is the main open question.

read the letter

The core advance is the T2O task formulation plus the TOL benchmark spanning Boston, Karlsruhe, and Singapore with 121K text-OSM pairs. That alone is useful because it gives a concrete way to test language-driven localization on compact, free maps instead of dense point clouds or images. TOLoc adds a coarse stage that builds direction-aware global descriptors from text and tiles, then a fine stage that fuses the top candidate for 2-DoF regression. The reported numbers show consistent gains over the strongest prior method at 5 m, 10 m, and 25 m, and the authors claim generalization to unseen environments within the dataset splits. Code and data are promised public, which helps reproducibility. The main soft spot is the coarse retrieval step. When OSM patches share similar object layouts and directions, such as comparable intersections, the global descriptors could surface the wrong tile, and that error passes straight to the alignment module since no geometry or initial pose is supplied. The three-city coverage may not expose enough intra-city repetition to test this fully. The abstract also omits ablations, baseline code details, and error bars, so the strength of the gains is hard to judge without the full experiments. This is for people working on vision-language navigation or open geospatial data who want to reduce sensor dependence. A reader focused on practical city-scale methods would find the benchmark and architecture worth examining. It deserves peer review because the task and data are new, the pipeline is straightforward, and the public release lowers the barrier to follow-up work.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces the text-to-OSM (T2O) localization task, releases the TOL benchmark (~121K textual queries paired with OSM tiles spanning 316 km of trajectories in Boston, Karlsruhe, and Singapore), and proposes TOLoc: a coarse-to-fine pipeline that builds direction-aware global descriptors from text and OSM tiles for candidate retrieval, then fuses the query with the top-1 tile in an alignment module to regress 2-DoF pose. It reports that TOLoc outperforms the best existing method by 6.53%, 9.93%, and 8.32% at the 5 m, 10 m, and 25 m thresholds and generalizes to unseen environments.

Significance. If the performance numbers prove reproducible, the work would establish a viable route to meter-level localization from natural-language descriptions using only compact, freely available OSM data, bypassing the need for dense imagery or point clouds in large-scale urban settings and supplying a public benchmark that could seed follow-on research.

major comments (2)

[Experimental Results] Experimental Results section: the headline gains (6.53/9.93/8.32 % at 5/10/25 m) are presented without baseline implementation details, error bars, ablation studies, or separate coarse-stage retrieval recall; these omissions make it impossible to verify that the direction-aware descriptors and alignment module are responsible for the claimed improvements rather than dataset-specific artifacts.
[Method] Coarse-stage retrieval (Method): the global descriptor matching is asserted to surface the correct OSM tile even when multiple patches share similar object semantics and directional layouts (e.g., four-way intersections with comparable footprints); no retrieval-precision metrics or failure-case analysis are supplied, yet any retrieval error directly propagates to the fine-stage regression and therefore underpins the meter-level accuracy claims.

minor comments (1)

[Abstract] The claim of 'strong generalization to unseen environments' appears in the abstract and conclusion; a brief statement of the precise train/test split (e.g., whether entire cities are held out) would remove ambiguity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the experimental validation and coarse-stage retrieval. We will revise the manuscript to incorporate additional details, metrics, and analyses as outlined below.

read point-by-point responses

Referee: [Experimental Results] Experimental Results section: the headline gains (6.53/9.93/8.32 % at 5/10/25 m) are presented without baseline implementation details, error bars, ablation studies, or separate coarse-stage retrieval recall; these omissions make it impossible to verify that the direction-aware descriptors and alignment module are responsible for the claimed improvements rather than dataset-specific artifacts.

Authors: We agree that these omissions limit the ability to fully attribute the gains. In the revised manuscript we will add: detailed implementation descriptions and hyperparameters for all baselines; error bars from multiple runs with different random seeds; ablation studies that isolate the direction-aware descriptors and the alignment module; and separate coarse-stage retrieval recall@K curves. These additions will make the source of the reported improvements transparent. revision: yes
Referee: [Method] Coarse-stage retrieval (Method): the global descriptor matching is asserted to surface the correct OSM tile even when multiple patches share similar object semantics and directional layouts (e.g., four-way intersections with comparable footprints); no retrieval-precision metrics or failure-case analysis are supplied, yet any retrieval error directly propagates to the fine-stage regression and therefore underpins the meter-level accuracy claims.

Authors: We acknowledge that direct retrieval metrics are necessary to substantiate the coarse-stage claims. The revised version will include retrieval precision and recall@K for the coarse stage across the three cities, together with a qualitative failure-case analysis focused on ambiguous intersections and similar directional layouts. These results will show that the direction-aware descriptors reduce retrieval errors that would otherwise propagate to the fine stage. revision: yes

Circularity Check

0 steps flagged

Standard supervised retrieval-regression pipeline on held-out benchmark; no derivation reduces to inputs by construction

full rationale

The paper introduces a new TOL benchmark and trains TOLoc (coarse global descriptor retrieval followed by fine-stage 2-DoF regression) on it. Reported gains at 5/10/25 m thresholds are measured on held-out splits against external baselines. No equations, fitted parameters, or self-citations are shown to force the localization metrics by construction; the central claims rest on empirical evaluation of a learned model rather than tautological re-expression of inputs.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central performance claim rests on the assumption that textual descriptions contain sufficient directional and semantic cues to match OSM features, plus the learned parameters of the neural feature extractors and alignment module.

free parameters (1)

neural network weights
Parameters of the direction-aware encoders and pose regression head are fitted on the TOL training split to minimize localization error.

axioms (1)

domain assumption Textual descriptions encode directional and semantic information that aligns with OSM object and road features
Invoked in both the coarse global descriptor stage and the fine alignment module.

pith-pipeline@v0.9.0 · 5651 in / 1385 out tokens · 52453 ms · 2026-05-13T21:40:16.342502+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

35 extracted references · 35 canonical work pages · 1 internal anchor

[1]

Wayfinding with words: spatial learning and navigation using dynamically updated verbal descriptions,

N. A. Giudice, J. Z. Bakdash, and G. E. Legge, “Wayfinding with words: spatial learning and navigation using dynamically updated verbal descriptions,”Psychological research, vol. 71, no. 3, pp. 347– 358, 2007

work page 2007
[2]

Cityanchor: City-scale 3d visual grounding with multi-modality llms

J. Li, H. Wang, J. Chen, Y . Liu, Z. Dou, Y . Ma, S. Yang, Y . Li, W. Wang, Z. Dong,et al., “Cityanchor: City-scale 3d visual grounding with multi-modality llms.” inICLR, 2025

work page 2025
[3]

Text2loc: 3d point cloud localization from natural language,

Y . Xia, L. Shi, Z. Ding, J. F. Henriques, and D. Cremers, “Text2loc: 3d point cloud localization from natural language,” inCVPR, 2024, pp. 14 958–14 967

work page 2024
[4]

Vlm-loc: Localization in point cloud maps via vision- language models,

S. Kang, Y . Liao, P. Wang, W. Liao, Q. Zhang, B. Busam, X. Chen, and Y . Liu, “Vlm-loc: Localization in point cloud maps via vision- language models,”CVPR, 2026. TABLE V: Cross-scene generalization of localization on the TOL-K360 set under the 10 m threshold. Method PE Seq 00 Seq 02 Seq 03 Seq 04 Seq 05 Seq 06 Seq 07 Seq 08 Seq 09 Seq 10 Seq 18 GOTPR × 0...

work page arXiv 2026
[5]

Patch-netvlad: Multi-scale fusion of locally-global descriptors for place recognition,

S. Hausler, S. Garg, M. Xu, M. Milford, and T. Fischer, “Patch-netvlad: Multi-scale fusion of locally-global descriptors for place recognition,” inCVPR, 2021, pp. 14 141–14 152

work page 2021
[6]

isim- loc: Visual global localization for previously unseen environments with simulated images,

P. Yin, I. Cisneros, S. Zhao, J. Zhang, H. Choset, and S. Scherer, “isim- loc: Visual global localization for previously unseen environments with simulated images,”IEEE TRO, vol. 39, no. 3, pp. 1893–1909, 2023

work page 1909
[7]

Bevplace++: Fast, robust, and lightweight lidar global localization for unmanned ground vehicles,

L. Luo, S.-Y . Cao, X. Li, J. Xu, R. Ai, Z. Yu, and X. Chen, “Bevplace++: Fast, robust, and lightweight lidar global localization for unmanned ground vehicles,”IEEE TRO, 2025

work page 2025
[8]

Opal: Visibility-aware lidar-to-openstreetmap place recognition via adaptive radial fusion,

S. Kang, M. Y . Liao, Y . Xia, O. Wysocki, B. Jutzi, and D. Cremers, “Opal: Visibility-aware lidar-to-openstreetmap place recognition via adaptive radial fusion,”CoRL, 2025

work page 2025
[9]

Text2pos: Text-to- point-cloud cross-modal localization,

M. Kolmet, Q. Zhou, A. O ˇsep, and L. Leal-Taix ´e, “Text2pos: Text-to- point-cloud cross-modal localization,” inCVPR, 2022, pp. 6687–6696

work page 2022
[10]

Text to point cloud localization with relation-enhanced transformer,

G. Wang, H. Fan, and M. Kankanhalli, “Text to point cloud localization with relation-enhanced transformer,” inAAAI, vol. 37, no. 2, 2023, pp. 2501–2509

work page 2023
[11]

Cmmloc: Advanc- ing text-to-pointcloud localization with cauchy-mixture-model based framework,

Y . Xu, H. Qu, J. Liu, W. Zhang, and X. Yang, “Cmmloc: Advanc- ing text-to-pointcloud localization with cauchy-mixture-model based framework,” inCVPR, 2025, pp. 6637–6647

work page 2025
[12]

Gotpr: General outdoor text-based place recognition using scene graph retrieval with openstreetmap,

D. Jung, K. Kim, and S.-W. Kim, “Gotpr: General outdoor text-based place recognition using scene graph retrieval with openstreetmap,” IEEE RA-L, 2025

work page 2025
[13]

Where am i? cross-view geo-localization with natural language descriptions,

J. Ye, H. Lin, L. Ou, D. Chen, Z. Wang, Q. Zhu, C. He, and W. Li, “Where am i? cross-view geo-localization with natural language descriptions,” inICCV, 2025, pp. 5890–5900

work page 2025
[14]

GPT-4o System Card

A. Hurst, A. Lerer, A. P. Goucher, A. Perelman, A. Ramesh, A. Clark, A. Ostrow, A. Welihinda, A. Hayes, A. Radford,et al., “Gpt-4o system card,”arXiv preprint arXiv:2410.21276, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[15]

Text to point cloud localization with multi-level negative contrastive learning,

D. Liu, S. Huang, W. Li, S. Shen, and C. Wang, “Text to point cloud localization with multi-level negative contrastive learning,” inAAAI, vol. 39, no. 5, 2025, pp. 5397–5405

work page 2025
[16]

Partially matching submap helps: Uncertainty modeling and propagation for text to point cloud localization,

M. Feng, L. Mei, Z. Wu, J. Luo, F. Tian, J. Feng, W. Dong, and Y . Wang, “Partially matching submap helps: Uncertainty modeling and propagation for text to point cloud localization,” inICCV, 2025, pp. 8296–8305

work page 2025
[17]

Towards natural language-guided drones: Geotext-1652 benchmark with spatial relation matching,

M. Chu, Z. Zheng, W. Ji, T. Wang, and T.-S. Chua, “Towards natural language-guided drones: Geotext-1652 benchmark with spatial relation matching,” inECCV. Springer, 2024, pp. 213–231

work page 2024
[18]

University-1652: A multi-view multi- source benchmark for drone-based geo-localization,

Z. Zheng, Y . Wei, and Y . Yang, “University-1652: A multi-view multi- source benchmark for drone-based geo-localization,” inACM MM, 2020, pp. 1395–1403

work page 2020
[19]

Hccm: Hierarchical cross- granularity contrastive and matching learning for natural language- guided drones,

H. Ruan, J. Lin, Y . Lai, Z. Luo, and S. Li, “Hccm: Hierarchical cross- granularity contrastive and matching learning for natural language- guided drones,” inACM MM, 2025, pp. 4524–4533

work page 2025
[20]

Mmgeo: Multimodal compositional geo-localization for uavs,

Y . Ji, B. He, Z. Tan, and L. Wu, “Mmgeo: Multimodal compositional geo-localization for uavs,” inICCV, 2025, pp. 25 165–25 175

work page 2025
[21]

Openstreetslam: Global vehicle localization using openstreetmaps,

G. Floros, B. Van Der Zander, and B. Leibe, “Openstreetslam: Global vehicle localization using openstreetmaps,” inICRA. IEEE, 2013, pp. 1054–1059

work page 2013
[22]

Efficient localisation using images and openstreetmaps,

M. Zhou, X. Chen, N. Samano, C. Stachniss, and A. Calway, “Efficient localisation using images and openstreetmaps,” inIROS. IEEE, 2021, pp. 5507–5513

work page 2021
[23]

You are here: Geolocation by embedding maps and images,

N. Samano, M. Zhou, and A. Calway, “You are here: Geolocation by embedding maps and images,” inECCV. Springer, 2020, pp. 502– 518

work page 2020
[24]

Orienternet: Visual localization in 2d public maps with neural match- ing,

P.-E. Sarlin, D. DeTone, T.-Y . Yang, A. Avetisyan, J. Straub, T. Mal- isiewicz, S. R. Bulo, R. Newcombe, P. Kontschieder, and V . Balntas, “Orienternet: Visual localization in 2d public maps with neural match- ing,” inCVPR, 2023, pp. 21 632–21 642

work page 2023
[25]

Maplocnet: Coarse-to-fine feature registration for visual re- localization in navigation maps,

H. Wu, Z. Zhang, S. Lin, X. Mu, Q. Zhao, M. Yang, and T. Qin, “Maplocnet: Coarse-to-fine feature registration for visual re- localization in navigation maps,” inIROS. IEEE, 2024, pp. 13 198– 13 205

work page 2024
[26]

Osmloc: Single image-based visual localization in open- streetmap with fused geometric and semantic guidance,

Y . Liao, X. Chen, S. Kang, J. Li, Z. Dong, H. Fan, and B. Yang, “Osmloc: Single image-based visual localization in open- streetmap with fused geometric and semantic guidance,”arXiv preprint arXiv:2411.08665, 2024

work page arXiv 2024
[27]

Localization on openstreetmap data using a 3d laser scanner,

P. Ruchti, B. Steder, M. Ruhnke, and W. Burgard, “Localization on openstreetmap data using a 3d laser scanner,” inICRA. IEEE, 2015, pp. 5260–5265

work page 2015
[28]

Exploiting building information from publicly available maps in graph-based slam,

O. Vysotska and C. Stachniss, “Exploiting building information from publicly available maps in graph-based slam,” inIROS. IEEE, 2016, pp. 4511–4516

work page 2016
[29]

Global outer-urban navigation with open- streetmap,

B. Suger and W. Burgard, “Global outer-urban navigation with open- streetmap,” inICRA. IEEE, 2017, pp. 1417–1422

work page 2017
[30]

Autonomous vehicle localization without prior high-definition map,

S. Lee and J.-H. Ryu, “Autonomous vehicle localization without prior high-definition map,”IEEE TRO, vol. 40, pp. 2888–2906, 2024

work page 2024
[31]

Openstreetmap-based lidar global localization in urban environment without a prior lidar map,

Y . Cho, G. Kim, S. Lee, and J.-H. Ryu, “Openstreetmap-based lidar global localization in urban environment without a prior lidar map,” IEEE RA-L, vol. 7, no. 2, pp. 4999–5006, 2022

work page 2022
[32]

nuscenes: A multimodal dataset for autonomous driving,

H. Caesar, V . Bankiti, A. H. Lang, S. V ora, V . E. Liong, Q. Xu, A. Krishnan, Y . Pan, G. Baldan, and O. Beijbom, “nuscenes: A multimodal dataset for autonomous driving,” inCVPR, 2020, pp. 11 621–11 631

work page 2020
[33]

Kitti-360: A novel dataset and benchmarks for urban scene understanding in 2d and 3d,

Y . Liao, J. Xie, and A. Geiger, “Kitti-360: A novel dataset and benchmarks for urban scene understanding in 2d and 3d,”IEEE TPAMI, vol. 45, no. 3, pp. 3292–3310, 2022

work page 2022
[34]

Learning transferable visual models from natural language supervision,

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark,et al., “Learning transferable visual models from natural language supervision,” inICML. PmLR, 2021, pp. 8748–8763

work page 2021
[35]

Sigmoid loss for language image pre-training,

X. Zhai, B. Mustafa, A. Kolesnikov, and L. Beyer, “Sigmoid loss for language image pre-training,” inICCV, 2023, pp. 11 975–11 986

work page 2023