pith. machine review for the scientific record. sign in

arxiv: 2604.01644 · v2 · submitted 2026-04-02 · 💻 cs.CV · cs.MM

Recognition: unknown

TOL: Textual Localization with OpenStreetMap

Authors on Pith no claims yet

Pith reviewed 2026-05-13 21:40 UTC · model grok-4.3

classification 💻 cs.CV cs.MM
keywords textual localizationOpenStreetMaptext-to-OSMcoarse-to-fine localizationdirection-aware features2-DoF pose regressionurban positioningTOL benchmark
0
0 comments X

The pith

A coarse-to-fine framework aligns natural language scene descriptions with OpenStreetMap tiles to estimate 2D urban positions without images or GPS.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper formulates text-to-OSM localization as the problem of estimating accurate 2D positions from textual descriptions of surrounding objects and their directions. It releases the TOL benchmark of roughly 121,000 text queries paired with OSM tiles spanning 316 km of roads in Boston, Karlsruhe, and Singapore. The TOLoc method first extracts direction-aware features from both the text and map tiles to retrieve a short list of candidate locations, then fuses the query text with features from the top candidate tile to regress the final pose. This yields better accuracy than prior methods at 5 m, 10 m, and 25 m thresholds while generalizing to unseen cities. The work shows that compact, freely available map data can support language-driven positioning at city scale.

Core claim

The central claim is that a coarse-to-fine pipeline using direction-aware global descriptors for candidate retrieval followed by an alignment module that jointly processes the text descriptor and local OSM features can regress 2-DoF pose from textual scene descriptions alone, outperforming the best existing method by 6.53 percent, 9.93 percent, and 8.32 percent at the 5 m, 10 m, and 25 m thresholds respectively and generalizing to unseen environments.

What carries the argument

TOLoc coarse-to-fine framework that builds direction-aware global descriptors from text and OSM tiles for retrieval, then applies a dedicated alignment module to fuse the textual descriptor with local map features and regress the 2-DoF pose.

If this is right

  • Text queries alone suffice to retrieve and refine positions within a few meters on OSM tiles across multiple continents.
  • Direction-aware features improve global retrieval enough to make the subsequent fine alignment accurate.
  • The approach generalizes to environments not encountered during training.
  • OSM tiles provide sufficient semantic and structural detail for large-scale text-based localization.
  • The released TOL dataset supports further development of text-to-map methods.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Voice interfaces could translate spoken scene descriptions into map positions for navigation or information retrieval.
  • The same pipeline might extend to other freely available map layers or to indoor floor plans.
  • Pairing the method with lightweight visual checks could create hybrid systems that remain functional when cameras are unavailable.
  • City-scale language localization opens privacy-preserving alternatives to continuous GPS tracking.

Load-bearing premise

Natural-language scene descriptions contain enough unique semantic and directional information to be reliably matched to the structures encoded in OSM tiles without any geometric observations or initial location estimates.

What would settle it

Collect fresh text descriptions from users in a fourth city never used in training or evaluation and measure whether the fraction of queries that retrieve the correct tile within 25 m falls to chance level after the full coarse-to-fine process.

Figures

Figures reproduced from arXiv: 2604.01644 by Bisheng Yang, Jianping Li, Jingyu Xu, Olaf Wysocki, Shuhao Kang, Xieyuanli Chen, Yan Xia, Youqi Liao, Zhen Dong.

Figure 1
Figure 1. Figure 1: Overview and motivation of text-to-OSM localization. (a) [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Illustration of rasterized OSM tiles with area, way, and node [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Pipeline of TOLoc. Given a query text T and an OSM database O = {Oj} Z j=1, TOLoc performs coarse-to-fine localization. It first learns text–map correspondences to retrieve the top-K candidate tiles, and then aligns the text descriptor dT with the feature map F O of the top-1 tile to predict the final 2-DoF position. The proposed TOA module fuses local OSM features and textual cues through self-attention a… view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative results on the TOL benchmark. In columns 3–5, green boxes [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
read the original abstract

Natural language provides an intuitive way to express spatial intent in geospatial applications. While existing localization methods often rely on dense point cloud maps or high-resolution imagery, OpenStreetMap (OSM) offers a compact and freely available map representation that encodes rich semantic and structural information, making it well-suited for large-scale localization. However, text-to-OSM (T2O) localization remains largely unexplored. In this paper, we formulate the T2O localization task, which aims to estimate accurate 2D positions in urban environments from textual scene descriptions without relying on geometric observations or GNSS-based initial location. To support the proposed task, we introduce TOL, a large-scale benchmark spanning multiple continents and diverse urban environments. TOL contains approximately 121K textual queries paired with OSM map tiles and covers about 316 km of road trajectories across Boston, Karlsruhe, and Singapore. We further propose TOLoc, a coarse-to-fine localization framework that explicitly models the semantics of surrounding objects and their directional information. In the coarse stage, direction-aware features are extracted from both textual descriptions and OSM tiles to construct global descriptors, which are used to retrieve candidate locations for the query. In the fine stage, the query text and top-1 retrieved tile are jointly processed, where a dedicated alignment module fuses the textual descriptor and local map features to regress the 2-DoF pose. Experimental results demonstrate that TOLoc achieves strong localization performance, outperforming the best existing method by 6.53\%, 9.93\%, and 8.32\% at 5 m, 10 m, and 25 m thresholds, respectively, and shows strong generalization to unseen environments. Dataset, code and models will be publicly available at: https://github.com/WHU-USI3DV/TOL.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces the text-to-OSM (T2O) localization task, releases the TOL benchmark (~121K textual queries paired with OSM tiles spanning 316 km of trajectories in Boston, Karlsruhe, and Singapore), and proposes TOLoc: a coarse-to-fine pipeline that builds direction-aware global descriptors from text and OSM tiles for candidate retrieval, then fuses the query with the top-1 tile in an alignment module to regress 2-DoF pose. It reports that TOLoc outperforms the best existing method by 6.53%, 9.93%, and 8.32% at the 5 m, 10 m, and 25 m thresholds and generalizes to unseen environments.

Significance. If the performance numbers prove reproducible, the work would establish a viable route to meter-level localization from natural-language descriptions using only compact, freely available OSM data, bypassing the need for dense imagery or point clouds in large-scale urban settings and supplying a public benchmark that could seed follow-on research.

major comments (2)
  1. [Experimental Results] Experimental Results section: the headline gains (6.53/9.93/8.32 % at 5/10/25 m) are presented without baseline implementation details, error bars, ablation studies, or separate coarse-stage retrieval recall; these omissions make it impossible to verify that the direction-aware descriptors and alignment module are responsible for the claimed improvements rather than dataset-specific artifacts.
  2. [Method] Coarse-stage retrieval (Method): the global descriptor matching is asserted to surface the correct OSM tile even when multiple patches share similar object semantics and directional layouts (e.g., four-way intersections with comparable footprints); no retrieval-precision metrics or failure-case analysis are supplied, yet any retrieval error directly propagates to the fine-stage regression and therefore underpins the meter-level accuracy claims.
minor comments (1)
  1. [Abstract] The claim of 'strong generalization to unseen environments' appears in the abstract and conclusion; a brief statement of the precise train/test split (e.g., whether entire cities are held out) would remove ambiguity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the experimental validation and coarse-stage retrieval. We will revise the manuscript to incorporate additional details, metrics, and analyses as outlined below.

read point-by-point responses
  1. Referee: [Experimental Results] Experimental Results section: the headline gains (6.53/9.93/8.32 % at 5/10/25 m) are presented without baseline implementation details, error bars, ablation studies, or separate coarse-stage retrieval recall; these omissions make it impossible to verify that the direction-aware descriptors and alignment module are responsible for the claimed improvements rather than dataset-specific artifacts.

    Authors: We agree that these omissions limit the ability to fully attribute the gains. In the revised manuscript we will add: detailed implementation descriptions and hyperparameters for all baselines; error bars from multiple runs with different random seeds; ablation studies that isolate the direction-aware descriptors and the alignment module; and separate coarse-stage retrieval recall@K curves. These additions will make the source of the reported improvements transparent. revision: yes

  2. Referee: [Method] Coarse-stage retrieval (Method): the global descriptor matching is asserted to surface the correct OSM tile even when multiple patches share similar object semantics and directional layouts (e.g., four-way intersections with comparable footprints); no retrieval-precision metrics or failure-case analysis are supplied, yet any retrieval error directly propagates to the fine-stage regression and therefore underpins the meter-level accuracy claims.

    Authors: We acknowledge that direct retrieval metrics are necessary to substantiate the coarse-stage claims. The revised version will include retrieval precision and recall@K for the coarse stage across the three cities, together with a qualitative failure-case analysis focused on ambiguous intersections and similar directional layouts. These results will show that the direction-aware descriptors reduce retrieval errors that would otherwise propagate to the fine stage. revision: yes

Circularity Check

0 steps flagged

Standard supervised retrieval-regression pipeline on held-out benchmark; no derivation reduces to inputs by construction

full rationale

The paper introduces a new TOL benchmark and trains TOLoc (coarse global descriptor retrieval followed by fine-stage 2-DoF regression) on it. Reported gains at 5/10/25 m thresholds are measured on held-out splits against external baselines. No equations, fitted parameters, or self-citations are shown to force the localization metrics by construction; the central claims rest on empirical evaluation of a learned model rather than tautological re-expression of inputs.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central performance claim rests on the assumption that textual descriptions contain sufficient directional and semantic cues to match OSM features, plus the learned parameters of the neural feature extractors and alignment module.

free parameters (1)
  • neural network weights
    Parameters of the direction-aware encoders and pose regression head are fitted on the TOL training split to minimize localization error.
axioms (1)
  • domain assumption Textual descriptions encode directional and semantic information that aligns with OSM object and road features
    Invoked in both the coarse global descriptor stage and the fine alignment module.

pith-pipeline@v0.9.0 · 5651 in / 1385 out tokens · 52453 ms · 2026-05-13T21:40:16.342502+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

35 extracted references · 35 canonical work pages · 1 internal anchor

  1. [1]

    Wayfinding with words: spatial learning and navigation using dynamically updated verbal descriptions,

    N. A. Giudice, J. Z. Bakdash, and G. E. Legge, “Wayfinding with words: spatial learning and navigation using dynamically updated verbal descriptions,”Psychological research, vol. 71, no. 3, pp. 347– 358, 2007

  2. [2]

    Cityanchor: City-scale 3d visual grounding with multi-modality llms

    J. Li, H. Wang, J. Chen, Y . Liu, Z. Dou, Y . Ma, S. Yang, Y . Li, W. Wang, Z. Dong,et al., “Cityanchor: City-scale 3d visual grounding with multi-modality llms.” inICLR, 2025

  3. [3]

    Text2loc: 3d point cloud localization from natural language,

    Y . Xia, L. Shi, Z. Ding, J. F. Henriques, and D. Cremers, “Text2loc: 3d point cloud localization from natural language,” inCVPR, 2024, pp. 14 958–14 967

  4. [4]

    Vlm-loc: Localization in point cloud maps via vision- language models,

    S. Kang, Y . Liao, P. Wang, W. Liao, Q. Zhang, B. Busam, X. Chen, and Y . Liu, “Vlm-loc: Localization in point cloud maps via vision- language models,”CVPR, 2026. TABLE V: Cross-scene generalization of localization on the TOL-K360 set under the 10 m threshold. Method PE Seq 00 Seq 02 Seq 03 Seq 04 Seq 05 Seq 06 Seq 07 Seq 08 Seq 09 Seq 10 Seq 18 GOTPR × 0...

  5. [5]

    Patch-netvlad: Multi-scale fusion of locally-global descriptors for place recognition,

    S. Hausler, S. Garg, M. Xu, M. Milford, and T. Fischer, “Patch-netvlad: Multi-scale fusion of locally-global descriptors for place recognition,” inCVPR, 2021, pp. 14 141–14 152

  6. [6]

    isim- loc: Visual global localization for previously unseen environments with simulated images,

    P. Yin, I. Cisneros, S. Zhao, J. Zhang, H. Choset, and S. Scherer, “isim- loc: Visual global localization for previously unseen environments with simulated images,”IEEE TRO, vol. 39, no. 3, pp. 1893–1909, 2023

  7. [7]

    Bevplace++: Fast, robust, and lightweight lidar global localization for unmanned ground vehicles,

    L. Luo, S.-Y . Cao, X. Li, J. Xu, R. Ai, Z. Yu, and X. Chen, “Bevplace++: Fast, robust, and lightweight lidar global localization for unmanned ground vehicles,”IEEE TRO, 2025

  8. [8]

    Opal: Visibility-aware lidar-to-openstreetmap place recognition via adaptive radial fusion,

    S. Kang, M. Y . Liao, Y . Xia, O. Wysocki, B. Jutzi, and D. Cremers, “Opal: Visibility-aware lidar-to-openstreetmap place recognition via adaptive radial fusion,”CoRL, 2025

  9. [9]

    Text2pos: Text-to- point-cloud cross-modal localization,

    M. Kolmet, Q. Zhou, A. O ˇsep, and L. Leal-Taix ´e, “Text2pos: Text-to- point-cloud cross-modal localization,” inCVPR, 2022, pp. 6687–6696

  10. [10]

    Text to point cloud localization with relation-enhanced transformer,

    G. Wang, H. Fan, and M. Kankanhalli, “Text to point cloud localization with relation-enhanced transformer,” inAAAI, vol. 37, no. 2, 2023, pp. 2501–2509

  11. [11]

    Cmmloc: Advanc- ing text-to-pointcloud localization with cauchy-mixture-model based framework,

    Y . Xu, H. Qu, J. Liu, W. Zhang, and X. Yang, “Cmmloc: Advanc- ing text-to-pointcloud localization with cauchy-mixture-model based framework,” inCVPR, 2025, pp. 6637–6647

  12. [12]

    Gotpr: General outdoor text-based place recognition using scene graph retrieval with openstreetmap,

    D. Jung, K. Kim, and S.-W. Kim, “Gotpr: General outdoor text-based place recognition using scene graph retrieval with openstreetmap,” IEEE RA-L, 2025

  13. [13]

    Where am i? cross-view geo-localization with natural language descriptions,

    J. Ye, H. Lin, L. Ou, D. Chen, Z. Wang, Q. Zhu, C. He, and W. Li, “Where am i? cross-view geo-localization with natural language descriptions,” inICCV, 2025, pp. 5890–5900

  14. [14]

    GPT-4o System Card

    A. Hurst, A. Lerer, A. P. Goucher, A. Perelman, A. Ramesh, A. Clark, A. Ostrow, A. Welihinda, A. Hayes, A. Radford,et al., “Gpt-4o system card,”arXiv preprint arXiv:2410.21276, 2024

  15. [15]

    Text to point cloud localization with multi-level negative contrastive learning,

    D. Liu, S. Huang, W. Li, S. Shen, and C. Wang, “Text to point cloud localization with multi-level negative contrastive learning,” inAAAI, vol. 39, no. 5, 2025, pp. 5397–5405

  16. [16]

    Partially matching submap helps: Uncertainty modeling and propagation for text to point cloud localization,

    M. Feng, L. Mei, Z. Wu, J. Luo, F. Tian, J. Feng, W. Dong, and Y . Wang, “Partially matching submap helps: Uncertainty modeling and propagation for text to point cloud localization,” inICCV, 2025, pp. 8296–8305

  17. [17]

    Towards natural language-guided drones: Geotext-1652 benchmark with spatial relation matching,

    M. Chu, Z. Zheng, W. Ji, T. Wang, and T.-S. Chua, “Towards natural language-guided drones: Geotext-1652 benchmark with spatial relation matching,” inECCV. Springer, 2024, pp. 213–231

  18. [18]

    University-1652: A multi-view multi- source benchmark for drone-based geo-localization,

    Z. Zheng, Y . Wei, and Y . Yang, “University-1652: A multi-view multi- source benchmark for drone-based geo-localization,” inACM MM, 2020, pp. 1395–1403

  19. [19]

    Hccm: Hierarchical cross- granularity contrastive and matching learning for natural language- guided drones,

    H. Ruan, J. Lin, Y . Lai, Z. Luo, and S. Li, “Hccm: Hierarchical cross- granularity contrastive and matching learning for natural language- guided drones,” inACM MM, 2025, pp. 4524–4533

  20. [20]

    Mmgeo: Multimodal compositional geo-localization for uavs,

    Y . Ji, B. He, Z. Tan, and L. Wu, “Mmgeo: Multimodal compositional geo-localization for uavs,” inICCV, 2025, pp. 25 165–25 175

  21. [21]

    Openstreetslam: Global vehicle localization using openstreetmaps,

    G. Floros, B. Van Der Zander, and B. Leibe, “Openstreetslam: Global vehicle localization using openstreetmaps,” inICRA. IEEE, 2013, pp. 1054–1059

  22. [22]

    Efficient localisation using images and openstreetmaps,

    M. Zhou, X. Chen, N. Samano, C. Stachniss, and A. Calway, “Efficient localisation using images and openstreetmaps,” inIROS. IEEE, 2021, pp. 5507–5513

  23. [23]

    You are here: Geolocation by embedding maps and images,

    N. Samano, M. Zhou, and A. Calway, “You are here: Geolocation by embedding maps and images,” inECCV. Springer, 2020, pp. 502– 518

  24. [24]

    Orienternet: Visual localization in 2d public maps with neural match- ing,

    P.-E. Sarlin, D. DeTone, T.-Y . Yang, A. Avetisyan, J. Straub, T. Mal- isiewicz, S. R. Bulo, R. Newcombe, P. Kontschieder, and V . Balntas, “Orienternet: Visual localization in 2d public maps with neural match- ing,” inCVPR, 2023, pp. 21 632–21 642

  25. [25]

    Maplocnet: Coarse-to-fine feature registration for visual re- localization in navigation maps,

    H. Wu, Z. Zhang, S. Lin, X. Mu, Q. Zhao, M. Yang, and T. Qin, “Maplocnet: Coarse-to-fine feature registration for visual re- localization in navigation maps,” inIROS. IEEE, 2024, pp. 13 198– 13 205

  26. [26]

    Osmloc: Single image-based visual localization in open- streetmap with fused geometric and semantic guidance,

    Y . Liao, X. Chen, S. Kang, J. Li, Z. Dong, H. Fan, and B. Yang, “Osmloc: Single image-based visual localization in open- streetmap with fused geometric and semantic guidance,”arXiv preprint arXiv:2411.08665, 2024

  27. [27]

    Localization on openstreetmap data using a 3d laser scanner,

    P. Ruchti, B. Steder, M. Ruhnke, and W. Burgard, “Localization on openstreetmap data using a 3d laser scanner,” inICRA. IEEE, 2015, pp. 5260–5265

  28. [28]

    Exploiting building information from publicly available maps in graph-based slam,

    O. Vysotska and C. Stachniss, “Exploiting building information from publicly available maps in graph-based slam,” inIROS. IEEE, 2016, pp. 4511–4516

  29. [29]

    Global outer-urban navigation with open- streetmap,

    B. Suger and W. Burgard, “Global outer-urban navigation with open- streetmap,” inICRA. IEEE, 2017, pp. 1417–1422

  30. [30]

    Autonomous vehicle localization without prior high-definition map,

    S. Lee and J.-H. Ryu, “Autonomous vehicle localization without prior high-definition map,”IEEE TRO, vol. 40, pp. 2888–2906, 2024

  31. [31]

    Openstreetmap-based lidar global localization in urban environment without a prior lidar map,

    Y . Cho, G. Kim, S. Lee, and J.-H. Ryu, “Openstreetmap-based lidar global localization in urban environment without a prior lidar map,” IEEE RA-L, vol. 7, no. 2, pp. 4999–5006, 2022

  32. [32]

    nuscenes: A multimodal dataset for autonomous driving,

    H. Caesar, V . Bankiti, A. H. Lang, S. V ora, V . E. Liong, Q. Xu, A. Krishnan, Y . Pan, G. Baldan, and O. Beijbom, “nuscenes: A multimodal dataset for autonomous driving,” inCVPR, 2020, pp. 11 621–11 631

  33. [33]

    Kitti-360: A novel dataset and benchmarks for urban scene understanding in 2d and 3d,

    Y . Liao, J. Xie, and A. Geiger, “Kitti-360: A novel dataset and benchmarks for urban scene understanding in 2d and 3d,”IEEE TPAMI, vol. 45, no. 3, pp. 3292–3310, 2022

  34. [34]

    Learning transferable visual models from natural language supervision,

    A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark,et al., “Learning transferable visual models from natural language supervision,” inICML. PmLR, 2021, pp. 8748–8763

  35. [35]

    Sigmoid loss for language image pre-training,

    X. Zhai, B. Mustafa, A. Kolesnikov, and L. Beyer, “Sigmoid loss for language image pre-training,” inICCV, 2023, pp. 11 975–11 986