Recognition: unknown
TOL: Textual Localization with OpenStreetMap
Pith reviewed 2026-05-13 21:40 UTC · model grok-4.3
The pith
A coarse-to-fine framework aligns natural language scene descriptions with OpenStreetMap tiles to estimate 2D urban positions without images or GPS.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that a coarse-to-fine pipeline using direction-aware global descriptors for candidate retrieval followed by an alignment module that jointly processes the text descriptor and local OSM features can regress 2-DoF pose from textual scene descriptions alone, outperforming the best existing method by 6.53 percent, 9.93 percent, and 8.32 percent at the 5 m, 10 m, and 25 m thresholds respectively and generalizing to unseen environments.
What carries the argument
TOLoc coarse-to-fine framework that builds direction-aware global descriptors from text and OSM tiles for retrieval, then applies a dedicated alignment module to fuse the textual descriptor with local map features and regress the 2-DoF pose.
If this is right
- Text queries alone suffice to retrieve and refine positions within a few meters on OSM tiles across multiple continents.
- Direction-aware features improve global retrieval enough to make the subsequent fine alignment accurate.
- The approach generalizes to environments not encountered during training.
- OSM tiles provide sufficient semantic and structural detail for large-scale text-based localization.
- The released TOL dataset supports further development of text-to-map methods.
Where Pith is reading between the lines
- Voice interfaces could translate spoken scene descriptions into map positions for navigation or information retrieval.
- The same pipeline might extend to other freely available map layers or to indoor floor plans.
- Pairing the method with lightweight visual checks could create hybrid systems that remain functional when cameras are unavailable.
- City-scale language localization opens privacy-preserving alternatives to continuous GPS tracking.
Load-bearing premise
Natural-language scene descriptions contain enough unique semantic and directional information to be reliably matched to the structures encoded in OSM tiles without any geometric observations or initial location estimates.
What would settle it
Collect fresh text descriptions from users in a fourth city never used in training or evaluation and measure whether the fraction of queries that retrieve the correct tile within 25 m falls to chance level after the full coarse-to-fine process.
Figures
read the original abstract
Natural language provides an intuitive way to express spatial intent in geospatial applications. While existing localization methods often rely on dense point cloud maps or high-resolution imagery, OpenStreetMap (OSM) offers a compact and freely available map representation that encodes rich semantic and structural information, making it well-suited for large-scale localization. However, text-to-OSM (T2O) localization remains largely unexplored. In this paper, we formulate the T2O localization task, which aims to estimate accurate 2D positions in urban environments from textual scene descriptions without relying on geometric observations or GNSS-based initial location. To support the proposed task, we introduce TOL, a large-scale benchmark spanning multiple continents and diverse urban environments. TOL contains approximately 121K textual queries paired with OSM map tiles and covers about 316 km of road trajectories across Boston, Karlsruhe, and Singapore. We further propose TOLoc, a coarse-to-fine localization framework that explicitly models the semantics of surrounding objects and their directional information. In the coarse stage, direction-aware features are extracted from both textual descriptions and OSM tiles to construct global descriptors, which are used to retrieve candidate locations for the query. In the fine stage, the query text and top-1 retrieved tile are jointly processed, where a dedicated alignment module fuses the textual descriptor and local map features to regress the 2-DoF pose. Experimental results demonstrate that TOLoc achieves strong localization performance, outperforming the best existing method by 6.53\%, 9.93\%, and 8.32\% at 5 m, 10 m, and 25 m thresholds, respectively, and shows strong generalization to unseen environments. Dataset, code and models will be publicly available at: https://github.com/WHU-USI3DV/TOL.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces the text-to-OSM (T2O) localization task, releases the TOL benchmark (~121K textual queries paired with OSM tiles spanning 316 km of trajectories in Boston, Karlsruhe, and Singapore), and proposes TOLoc: a coarse-to-fine pipeline that builds direction-aware global descriptors from text and OSM tiles for candidate retrieval, then fuses the query with the top-1 tile in an alignment module to regress 2-DoF pose. It reports that TOLoc outperforms the best existing method by 6.53%, 9.93%, and 8.32% at the 5 m, 10 m, and 25 m thresholds and generalizes to unseen environments.
Significance. If the performance numbers prove reproducible, the work would establish a viable route to meter-level localization from natural-language descriptions using only compact, freely available OSM data, bypassing the need for dense imagery or point clouds in large-scale urban settings and supplying a public benchmark that could seed follow-on research.
major comments (2)
- [Experimental Results] Experimental Results section: the headline gains (6.53/9.93/8.32 % at 5/10/25 m) are presented without baseline implementation details, error bars, ablation studies, or separate coarse-stage retrieval recall; these omissions make it impossible to verify that the direction-aware descriptors and alignment module are responsible for the claimed improvements rather than dataset-specific artifacts.
- [Method] Coarse-stage retrieval (Method): the global descriptor matching is asserted to surface the correct OSM tile even when multiple patches share similar object semantics and directional layouts (e.g., four-way intersections with comparable footprints); no retrieval-precision metrics or failure-case analysis are supplied, yet any retrieval error directly propagates to the fine-stage regression and therefore underpins the meter-level accuracy claims.
minor comments (1)
- [Abstract] The claim of 'strong generalization to unseen environments' appears in the abstract and conclusion; a brief statement of the precise train/test split (e.g., whether entire cities are held out) would remove ambiguity.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on the experimental validation and coarse-stage retrieval. We will revise the manuscript to incorporate additional details, metrics, and analyses as outlined below.
read point-by-point responses
-
Referee: [Experimental Results] Experimental Results section: the headline gains (6.53/9.93/8.32 % at 5/10/25 m) are presented without baseline implementation details, error bars, ablation studies, or separate coarse-stage retrieval recall; these omissions make it impossible to verify that the direction-aware descriptors and alignment module are responsible for the claimed improvements rather than dataset-specific artifacts.
Authors: We agree that these omissions limit the ability to fully attribute the gains. In the revised manuscript we will add: detailed implementation descriptions and hyperparameters for all baselines; error bars from multiple runs with different random seeds; ablation studies that isolate the direction-aware descriptors and the alignment module; and separate coarse-stage retrieval recall@K curves. These additions will make the source of the reported improvements transparent. revision: yes
-
Referee: [Method] Coarse-stage retrieval (Method): the global descriptor matching is asserted to surface the correct OSM tile even when multiple patches share similar object semantics and directional layouts (e.g., four-way intersections with comparable footprints); no retrieval-precision metrics or failure-case analysis are supplied, yet any retrieval error directly propagates to the fine-stage regression and therefore underpins the meter-level accuracy claims.
Authors: We acknowledge that direct retrieval metrics are necessary to substantiate the coarse-stage claims. The revised version will include retrieval precision and recall@K for the coarse stage across the three cities, together with a qualitative failure-case analysis focused on ambiguous intersections and similar directional layouts. These results will show that the direction-aware descriptors reduce retrieval errors that would otherwise propagate to the fine stage. revision: yes
Circularity Check
Standard supervised retrieval-regression pipeline on held-out benchmark; no derivation reduces to inputs by construction
full rationale
The paper introduces a new TOL benchmark and trains TOLoc (coarse global descriptor retrieval followed by fine-stage 2-DoF regression) on it. Reported gains at 5/10/25 m thresholds are measured on held-out splits against external baselines. No equations, fitted parameters, or self-citations are shown to force the localization metrics by construction; the central claims rest on empirical evaluation of a learned model rather than tautological re-expression of inputs.
Axiom & Free-Parameter Ledger
free parameters (1)
- neural network weights
axioms (1)
- domain assumption Textual descriptions encode directional and semantic information that aligns with OSM object and road features
Reference graph
Works this paper leans on
-
[1]
N. A. Giudice, J. Z. Bakdash, and G. E. Legge, “Wayfinding with words: spatial learning and navigation using dynamically updated verbal descriptions,”Psychological research, vol. 71, no. 3, pp. 347– 358, 2007
work page 2007
-
[2]
Cityanchor: City-scale 3d visual grounding with multi-modality llms
J. Li, H. Wang, J. Chen, Y . Liu, Z. Dou, Y . Ma, S. Yang, Y . Li, W. Wang, Z. Dong,et al., “Cityanchor: City-scale 3d visual grounding with multi-modality llms.” inICLR, 2025
work page 2025
-
[3]
Text2loc: 3d point cloud localization from natural language,
Y . Xia, L. Shi, Z. Ding, J. F. Henriques, and D. Cremers, “Text2loc: 3d point cloud localization from natural language,” inCVPR, 2024, pp. 14 958–14 967
work page 2024
-
[4]
Vlm-loc: Localization in point cloud maps via vision- language models,
S. Kang, Y . Liao, P. Wang, W. Liao, Q. Zhang, B. Busam, X. Chen, and Y . Liu, “Vlm-loc: Localization in point cloud maps via vision- language models,”CVPR, 2026. TABLE V: Cross-scene generalization of localization on the TOL-K360 set under the 10 m threshold. Method PE Seq 00 Seq 02 Seq 03 Seq 04 Seq 05 Seq 06 Seq 07 Seq 08 Seq 09 Seq 10 Seq 18 GOTPR × 0...
-
[5]
Patch-netvlad: Multi-scale fusion of locally-global descriptors for place recognition,
S. Hausler, S. Garg, M. Xu, M. Milford, and T. Fischer, “Patch-netvlad: Multi-scale fusion of locally-global descriptors for place recognition,” inCVPR, 2021, pp. 14 141–14 152
work page 2021
-
[6]
isim- loc: Visual global localization for previously unseen environments with simulated images,
P. Yin, I. Cisneros, S. Zhao, J. Zhang, H. Choset, and S. Scherer, “isim- loc: Visual global localization for previously unseen environments with simulated images,”IEEE TRO, vol. 39, no. 3, pp. 1893–1909, 2023
work page 1909
-
[7]
Bevplace++: Fast, robust, and lightweight lidar global localization for unmanned ground vehicles,
L. Luo, S.-Y . Cao, X. Li, J. Xu, R. Ai, Z. Yu, and X. Chen, “Bevplace++: Fast, robust, and lightweight lidar global localization for unmanned ground vehicles,”IEEE TRO, 2025
work page 2025
-
[8]
Opal: Visibility-aware lidar-to-openstreetmap place recognition via adaptive radial fusion,
S. Kang, M. Y . Liao, Y . Xia, O. Wysocki, B. Jutzi, and D. Cremers, “Opal: Visibility-aware lidar-to-openstreetmap place recognition via adaptive radial fusion,”CoRL, 2025
work page 2025
-
[9]
Text2pos: Text-to- point-cloud cross-modal localization,
M. Kolmet, Q. Zhou, A. O ˇsep, and L. Leal-Taix ´e, “Text2pos: Text-to- point-cloud cross-modal localization,” inCVPR, 2022, pp. 6687–6696
work page 2022
-
[10]
Text to point cloud localization with relation-enhanced transformer,
G. Wang, H. Fan, and M. Kankanhalli, “Text to point cloud localization with relation-enhanced transformer,” inAAAI, vol. 37, no. 2, 2023, pp. 2501–2509
work page 2023
-
[11]
Cmmloc: Advanc- ing text-to-pointcloud localization with cauchy-mixture-model based framework,
Y . Xu, H. Qu, J. Liu, W. Zhang, and X. Yang, “Cmmloc: Advanc- ing text-to-pointcloud localization with cauchy-mixture-model based framework,” inCVPR, 2025, pp. 6637–6647
work page 2025
-
[12]
Gotpr: General outdoor text-based place recognition using scene graph retrieval with openstreetmap,
D. Jung, K. Kim, and S.-W. Kim, “Gotpr: General outdoor text-based place recognition using scene graph retrieval with openstreetmap,” IEEE RA-L, 2025
work page 2025
-
[13]
Where am i? cross-view geo-localization with natural language descriptions,
J. Ye, H. Lin, L. Ou, D. Chen, Z. Wang, Q. Zhu, C. He, and W. Li, “Where am i? cross-view geo-localization with natural language descriptions,” inICCV, 2025, pp. 5890–5900
work page 2025
-
[14]
A. Hurst, A. Lerer, A. P. Goucher, A. Perelman, A. Ramesh, A. Clark, A. Ostrow, A. Welihinda, A. Hayes, A. Radford,et al., “Gpt-4o system card,”arXiv preprint arXiv:2410.21276, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[15]
Text to point cloud localization with multi-level negative contrastive learning,
D. Liu, S. Huang, W. Li, S. Shen, and C. Wang, “Text to point cloud localization with multi-level negative contrastive learning,” inAAAI, vol. 39, no. 5, 2025, pp. 5397–5405
work page 2025
-
[16]
M. Feng, L. Mei, Z. Wu, J. Luo, F. Tian, J. Feng, W. Dong, and Y . Wang, “Partially matching submap helps: Uncertainty modeling and propagation for text to point cloud localization,” inICCV, 2025, pp. 8296–8305
work page 2025
-
[17]
Towards natural language-guided drones: Geotext-1652 benchmark with spatial relation matching,
M. Chu, Z. Zheng, W. Ji, T. Wang, and T.-S. Chua, “Towards natural language-guided drones: Geotext-1652 benchmark with spatial relation matching,” inECCV. Springer, 2024, pp. 213–231
work page 2024
-
[18]
University-1652: A multi-view multi- source benchmark for drone-based geo-localization,
Z. Zheng, Y . Wei, and Y . Yang, “University-1652: A multi-view multi- source benchmark for drone-based geo-localization,” inACM MM, 2020, pp. 1395–1403
work page 2020
-
[19]
H. Ruan, J. Lin, Y . Lai, Z. Luo, and S. Li, “Hccm: Hierarchical cross- granularity contrastive and matching learning for natural language- guided drones,” inACM MM, 2025, pp. 4524–4533
work page 2025
-
[20]
Mmgeo: Multimodal compositional geo-localization for uavs,
Y . Ji, B. He, Z. Tan, and L. Wu, “Mmgeo: Multimodal compositional geo-localization for uavs,” inICCV, 2025, pp. 25 165–25 175
work page 2025
-
[21]
Openstreetslam: Global vehicle localization using openstreetmaps,
G. Floros, B. Van Der Zander, and B. Leibe, “Openstreetslam: Global vehicle localization using openstreetmaps,” inICRA. IEEE, 2013, pp. 1054–1059
work page 2013
-
[22]
Efficient localisation using images and openstreetmaps,
M. Zhou, X. Chen, N. Samano, C. Stachniss, and A. Calway, “Efficient localisation using images and openstreetmaps,” inIROS. IEEE, 2021, pp. 5507–5513
work page 2021
-
[23]
You are here: Geolocation by embedding maps and images,
N. Samano, M. Zhou, and A. Calway, “You are here: Geolocation by embedding maps and images,” inECCV. Springer, 2020, pp. 502– 518
work page 2020
-
[24]
Orienternet: Visual localization in 2d public maps with neural match- ing,
P.-E. Sarlin, D. DeTone, T.-Y . Yang, A. Avetisyan, J. Straub, T. Mal- isiewicz, S. R. Bulo, R. Newcombe, P. Kontschieder, and V . Balntas, “Orienternet: Visual localization in 2d public maps with neural match- ing,” inCVPR, 2023, pp. 21 632–21 642
work page 2023
-
[25]
Maplocnet: Coarse-to-fine feature registration for visual re- localization in navigation maps,
H. Wu, Z. Zhang, S. Lin, X. Mu, Q. Zhao, M. Yang, and T. Qin, “Maplocnet: Coarse-to-fine feature registration for visual re- localization in navigation maps,” inIROS. IEEE, 2024, pp. 13 198– 13 205
work page 2024
-
[26]
Y . Liao, X. Chen, S. Kang, J. Li, Z. Dong, H. Fan, and B. Yang, “Osmloc: Single image-based visual localization in open- streetmap with fused geometric and semantic guidance,”arXiv preprint arXiv:2411.08665, 2024
-
[27]
Localization on openstreetmap data using a 3d laser scanner,
P. Ruchti, B. Steder, M. Ruhnke, and W. Burgard, “Localization on openstreetmap data using a 3d laser scanner,” inICRA. IEEE, 2015, pp. 5260–5265
work page 2015
-
[28]
Exploiting building information from publicly available maps in graph-based slam,
O. Vysotska and C. Stachniss, “Exploiting building information from publicly available maps in graph-based slam,” inIROS. IEEE, 2016, pp. 4511–4516
work page 2016
-
[29]
Global outer-urban navigation with open- streetmap,
B. Suger and W. Burgard, “Global outer-urban navigation with open- streetmap,” inICRA. IEEE, 2017, pp. 1417–1422
work page 2017
-
[30]
Autonomous vehicle localization without prior high-definition map,
S. Lee and J.-H. Ryu, “Autonomous vehicle localization without prior high-definition map,”IEEE TRO, vol. 40, pp. 2888–2906, 2024
work page 2024
-
[31]
Openstreetmap-based lidar global localization in urban environment without a prior lidar map,
Y . Cho, G. Kim, S. Lee, and J.-H. Ryu, “Openstreetmap-based lidar global localization in urban environment without a prior lidar map,” IEEE RA-L, vol. 7, no. 2, pp. 4999–5006, 2022
work page 2022
-
[32]
nuscenes: A multimodal dataset for autonomous driving,
H. Caesar, V . Bankiti, A. H. Lang, S. V ora, V . E. Liong, Q. Xu, A. Krishnan, Y . Pan, G. Baldan, and O. Beijbom, “nuscenes: A multimodal dataset for autonomous driving,” inCVPR, 2020, pp. 11 621–11 631
work page 2020
-
[33]
Kitti-360: A novel dataset and benchmarks for urban scene understanding in 2d and 3d,
Y . Liao, J. Xie, and A. Geiger, “Kitti-360: A novel dataset and benchmarks for urban scene understanding in 2d and 3d,”IEEE TPAMI, vol. 45, no. 3, pp. 3292–3310, 2022
work page 2022
-
[34]
Learning transferable visual models from natural language supervision,
A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark,et al., “Learning transferable visual models from natural language supervision,” inICML. PmLR, 2021, pp. 8748–8763
work page 2021
-
[35]
Sigmoid loss for language image pre-training,
X. Zhai, B. Mustafa, A. Kolesnikov, and L. Beyer, “Sigmoid loss for language image pre-training,” inICCV, 2023, pp. 11 975–11 986
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.