pith. sign in

arxiv: 2512.06743 · v2 · pith:BOT3FK4Dnew · submitted 2025-12-07 · 💻 cs.DB

OSM+: Billion-Level OpenStreetMap Dataset for City-wide Experiments

Pith reviewed 2026-05-22 12:37 UTC · model grok-4.3

classification 💻 cs.DB
keywords OpenStreetMaproad networksgraph datasettraffic predictionspatial queriescity-scale experimentsbillion-scale data
0
0 comments X

The pith

OSM+ releases a 1-billion-vertex global road network graph from processed OpenStreetMap data for city-wide experiments.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper processes worldwide OpenStreetMap data using massive distributed computing to create and release OSM+, a structured 1-billion-vertex road network graph dataset. This dataset is designed for high accessibility with an open structure and spatial query tools, addressing the lack of large-scale, unified benchmarks for real-world road networks in graph learning and urban studies. It demonstrates value through applications in city boundary detection, traffic prediction across 31 cities, and multi-agent traffic policy control in six cities at larger scales. Researchers can now perform reproducible experiments without handling the intensive preprocessing of raw global OSM data.

Core claim

By leveraging distributed cloud computing with 5,000 cores, the authors transform raw OpenStreetMap data into OSM+, a unified worldwide road network graph containing 1 billion vertices. This dataset includes an easy spatial query interface, supports reproducibility via a fixed snapshot, and comes with tools for integrating multimodal data, enabling broader benchmarks in traffic prediction and challenges in large-scale traffic policy control.

What carries the argument

OSM+, the structured 1-billion-vertex road network graph dataset derived from global OpenStreetMap data via distributed processing.

If this is right

  • Traffic prediction benchmarks can scale from hundreds to thousands of road intersections across 31 cities.
  • Traffic policy control experiments introduce thousand-scale multi-agent coordination challenges in six cities.
  • City boundary detection and other spatial tasks gain access to consistent global data.
  • Geospatial foundation models can integrate multimodal spatial-temporal data with the road network structure.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Such a dataset might highlight scalability issues in existing graph neural network models that smaller benchmarks overlook.
  • Future work could extend OSM+ to include dynamic updates for real-time traffic simulations.
  • Cross-domain applications like urban planning or disaster response modeling could benefit from the unified global topology.

Load-bearing premise

The processing steps preserve the topological properties and spatial relationships of the original OpenStreetMap road networks without introducing unifying artifacts.

What would settle it

Comparing key network statistics such as average degree, clustering coefficients, or path length distributions between OSM+ and unprocessed OSM extracts in multiple cities would falsify the claim if significant discrepancies appear.

Figures

Figures reproduced from arXiv: 2512.06743 by Fan Wu, Guanjie Zheng, Hongwei Zhang, Linghe Kong, Wen Ling, Xuanhe Zhou, Yiheng Wang, Yuhang Luo, Ziyang Su.

Figure 1
Figure 1. Figure 1: Comparison among four ways of querying Open [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: OSM+ database provides worldwide edges (Table split_edge) and points (Table vertex). Only basic SQL language is [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Basic continent and category statistics of OSM+ [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: OSM+ (on cloud) provides efficient and easy-to-use [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: The framework of downstream task supporting [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Map of Toronto, Los Angeles and Tokyo. The green [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Comparison between the newly constructed OSM+(UTD19) dataset and PEMS dataset in three aspects. [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: The city boundary map obtained from the clustering results of central Europe, the east coast of the United States [PITH_FULL_IMAGE:figures/full_fig_p007_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Comparing the road, population and night lighting data map in the same area, the road density is positively correlated [PITH_FULL_IMAGE:figures/full_fig_p008_9.png] view at source ↗
read the original abstract

Road network data provides rich information about cities, but processing worldwide OpenStreetMap (OSM) data is computationally intensive, and the resulting graphs are often difficult to unify for benchmarking downstream tasks. Existing graph learning benchmarks fail to capture the billion-scale and unique topological properties of real-world road networks, leaving model scalability underexplored. To close this gap, we process OSM data with distributed cloud computing using 5,000 cores and release \textbf{OSM+}, a structured worldwide 1-billion-vertex road network graph dataset designed for high accessibility and usability. OSM+ is open source and globally downloadable, providing an open-box graph structure and an easy spatial query interface; the evaluated release is a fixed snapshot for reproducibility, with a versioned update plan for future releases. We demonstrate the utility of OSM+ through three illustrative use cases: city boundary detection, traffic prediction, and traffic policy control. For traffic prediction, we construct a new 31-city benchmark by processing traffic data and combining it with OSM+, enabling broader spatial coverage and more comprehensive evaluation than commonly used datasets, while scaling from hundreds of road network intersections to thousands. For traffic policy control, we release a new six-city dataset at a much larger scale, introducing challenges for thousand-scale multi-agent coordination. We also provide data processing tools for integrating multimodal spatial-temporal data with OSM+ for geospatial foundation model training, thereby expediting the discovery of compelling scientific insights.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper describes the creation of OSM+, a worldwide 1-billion-vertex road network graph dataset derived from OpenStreetMap via distributed cloud computing on 5,000 cores. It releases the dataset with an open-box graph structure and spatial query interface, provides a fixed snapshot for reproducibility along with versioned updates, and demonstrates utility through three use cases: city boundary detection, a new 31-city traffic prediction benchmark that scales to thousands of intersections, and a six-city traffic policy control dataset for multi-agent coordination. Tools for integrating multimodal spatial-temporal data are also provided to support geospatial foundation model training.

Significance. If the processing pipeline preserves topological properties without systematic artifacts, OSM+ would offer a substantial resource for benchmarking graph learning and urban analytics at scales far beyond existing datasets, directly addressing the gap in capturing billion-scale real-world road network properties. The explicit release of reproducible snapshots, open-source code for processing, and multimodal integration tools represents a concrete strength that supports community adoption and downstream scientific work in traffic modeling and city-scale experiments.

major comments (2)
  1. [Section 3] Section 3 (OSM+ Construction and Processing Pipeline): The manuscript states the use of 5,000 cores for distributed processing but supplies no concrete description of the global unification procedure, including coordinate system handling, duplicate node resolution across borders, way simplification rules, or cross-border consistency checks. This detail is load-bearing for the central claim that the resulting graph is a single usable worldwide network that preserves real topological properties for benchmarking.
  2. [Section 5] Section 5 (Use Cases and Evaluation): No quantitative validation of graph quality is reported, such as error rates after cleaning, comparison of degree sequences or connected-component counts against independent regional OSM extracts, or accuracy of spatial queries on the unified graph. This is required to substantiate the usability claims for the 31-city traffic prediction benchmark and the six-city policy control dataset.
minor comments (2)
  1. [Abstract] Abstract: The phrase 'evaluated release' is used without specifying any graph-level fidelity metrics; clarify whether any internal consistency checks were performed on the 1-billion-vertex snapshot.
  2. [Figures and Tables] Throughout: Ensure all figures and tables that reference the billion-vertex scale include explicit legends or captions stating the exact vertex/edge counts for the released snapshot.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We are grateful to the referee for the positive assessment of the significance of OSM+ and for the helpful suggestions to improve the manuscript. We respond to each major comment below.

read point-by-point responses
  1. Referee: [Section 3] Section 3 (OSM+ Construction and Processing Pipeline): The manuscript states the use of 5,000 cores for distributed processing but supplies no concrete description of the global unification procedure, including coordinate system handling, duplicate node resolution across borders, way simplification rules, or cross-border consistency checks. This detail is load-bearing for the central claim that the resulting graph is a single usable worldwide network that preserves real topological properties for benchmarking.

    Authors: Thank you for highlighting this important point. We acknowledge that the current description in Section 3 lacks sufficient detail on the global unification procedure. To address this, we will revise the manuscript to provide a concrete description, including: coordinate system handling using WGS84, duplicate node resolution across borders via spatial proximity matching, way simplification rules using established algorithms to maintain topology, and cross-border consistency checks through connectivity verification. Additionally, we will include pseudocode and a processing diagram to clarify the distributed computation on 5,000 cores. These additions will better support the claim of a unified worldwide network. revision: yes

  2. Referee: [Section 5] Section 5 (Use Cases and Evaluation): No quantitative validation of graph quality is reported, such as error rates after cleaning, comparison of degree sequences or connected-component counts against independent regional OSM extracts, or accuracy of spatial queries on the unified graph. This is required to substantiate the usability claims for the 31-city traffic prediction benchmark and the six-city policy control dataset.

    Authors: We thank the referee for this observation. We agree that quantitative validation is important to substantiate the graph quality and usability claims. In the revised manuscript, we will add quantitative validation in Section 5, including error rates after the cleaning steps, comparisons of degree sequences and connected component counts with independent regional OSM extracts for multiple cities, and accuracy assessments of spatial queries on the unified graph. This will provide stronger evidence for the 31-city and six-city benchmarks. revision: yes

Circularity Check

0 steps flagged

No circularity: data-release paper with no derivations or self-referential predictions

full rationale

The manuscript is a data-processing and release paper. It describes ingesting external OpenStreetMap input, applying distributed cloud processing (5000 cores), and releasing a fixed snapshot graph with an open-box structure and spatial-query interface. No equations, fitted parameters, predictions, uniqueness theorems, or ansatzes appear in the provided text. The central claims rest on the external OSM source and on downstream use-case demonstrations rather than on any internal reduction to the paper's own outputs. Self-citations are absent from the load-bearing narrative. This is the expected honest non-finding for a dataset paper whose value is measured by external usability and reproducibility.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central contribution is a processed dataset rather than a derivation; it rests on the domain assumption that OSM data can be turned into a consistent global graph without loss of essential topology.

axioms (1)
  • domain assumption OpenStreetMap data contains sufficient structure and coverage to be processed into a unified, queryable road network graph at global scale
    The entire pipeline and all downstream benchmarks depend on this assumption about OSM data quality and consistency.

pith-pipeline@v0.9.0 · 5810 in / 1360 out tokens · 76894 ms · 2026-05-22T12:37:30.597600+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

30 extracted references · 30 canonical work pages · 1 internal anchor

  1. [1]

    Meisam Amani, Mohammad Kakooei, Armin Moghimi, Arsalan Ghorbanian, Babak Ranjgar, Sahel Mahdavi, Andrew Davidson, Thierry Fisette, Patrick Rollin, Brian Brisco, and Ali Mohammadzadeh. 2020. Application of Google Earth Engine Cloud Computing Platform, Sentinel Imagery, and Neural Networks for Crop Mapping in Canada.Remote Sensing12, 21 (2020). https://doi....

  2. [2]

    Apple Maps. [n.d.]. Apple Maps. https://www.apple.com/maps. 2024

  3. [3]

    Baidu Maps. [n.d.]. Baidu Maps. https://maps.baidu.com. 2024

  4. [4]

    Alexandros Bartzokas-Tsiompras. 2022. Utilizing OpenStreetMap data to measure and compare pedestrian street lengths in 992 cities around the world.European Journal of Geography13, 2 (2022), 127–141

  5. [5]

    2010.OpenStreetMap

    Jonathan Bennett. 2010.OpenStreetMap. Packt Publishing Ltd

  6. [6]

    Bing Maps. [n.d.]. Bing Maps. https://www.bing.com/maps. 2024

  7. [7]

    Geoff Boeing. 2017. OSMnx: A Python package to work with graph-theoretic OpenStreetMap street networks.Journal of Open Source Software2, 12 (2017)

  8. [8]

    Geoff Boeing. 2017. OSMnx: New methods for acquiring, constructing, analyzing, and visualizing complex street networks.Computers, Environment and Urban Systems65 (2017), 126–139. https://doi.org/10.1016/j.compenvurbsys.2017.05.004

  9. [9]

    Qing Ding, Zhenfeng Shao, Xiao Huang, Orhan Altan, and Bin Hu. 2022. Time- series land cover mapping and urban expansion analysis using OpenStreetMap data and remote sensing big data: A case study of Guangdong-Hong Kong-Macao Greater Bay Area, China.International Journal of Applied Earth Observation and Geoinformation113 (2022), 103001

  10. [10]

    Dennis Edler and Mark Vetter. 2019. The simplicity of modern audiovisual web cartography: An example with the open-source javascript library leaflet. js. KN-Journal of Cartography and Geographic Information69 (2019), 51–62

  11. [11]

    Wiam Elleuch, Ali Wali, and Adel M. Alimi. 2014. Mining road map from big database of GPS data. In2014 14th International Conference on Hybrid Intelligent Systems. 193–198. https://doi.org/10.1109/HIS.2014.7086197

  12. [12]

    Oskar Eriksson and Emil Rydkvist. 2015. An in-depth analysis of dynamically rendered vector-based maps with WebGL using Mapbox GL JS

  13. [13]

    A GADM. 2015. Database of global administrative areas

  14. [14]

    Gaode Maps. [n.d.]. Gaode Maps. https://www.amap.com. 2024

  15. [15]

    Google Maps. [n.d.]. Google Maps. https://maps.google.com. 2024

  16. [16]

    A Yair Grinberger, Marco Minghini, Levente Juhász, Godwin Yeboah, and Peter Mooney. 2022. OSM Science—The Academic Study of the OpenStreetMap Project, Data, Contributors, Community, and Applications. , 230 pages

  17. [17]

    Mordechai Haklay and Patrick Weber. 2008. Openstreetmap: User-generated street maps.IEEE Pervasive computing7, 4 (2008), 12–18

  18. [18]

    Mordechai Haklay and Patrick Weber. 2008. OpenStreetMap: User-Generated Street Maps.IEEE Pervasive Computing7, 4 (2008), 12–18. https://doi.org/10. 1109/MPRV.2008.80

  19. [19]

    Shunfu Hu and Ting Dai. 2013. Online Map Application Development Using Google Maps API, SQL Database, and ASP.NET.International Journal of Infor- mation and Communication Technology Research(2013)

  20. [20]

    Weiwei Jiang and Jiayun Luo. 2022. Graph neural network for traffic forecasting: A survey.Expert systems with applications207 (2022), 117921

  21. [21]

    Tek Bahadur Kshetri, Angsana Chaksan, and Shraddha Sharma. 2021. The Role of Open-Source Python Package Geoserver-Rest in Web-Gis Development. The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences46 (2021), 91–96

  22. [22]

    Yaguang Li, Rose Yu, Cyrus Shahabi, and Yan Liu. 2017. Diffusion convolu- tional recurrent neural network: Data-driven traffic forecasting.arXiv preprint arXiv:1707.01926(2017)

  23. [23]

    Chumeng Liang, Zherui Huang, Yicheng Liu, Zhanyu Liu, Guanjie Zheng, Hanyuan Shi, Kan Wu, Yuhao Du, Fuliang Li, and Zhenhui Jessie Li. 2023. CBLab: Supporting the Training of Large-scale Traffic Control Policies with Scalable Traf- fic Simulation. InProceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 4449–4460

  24. [24]

    Peter Mooney, Marco Minghini, et al. 2017. A review of OpenStreetMap data. Mapping and the citizen sensor(2017), 37–59

  25. [25]

    Uyen N. T. Nguyen, Lien T. H. Pham, and Thanh Duc Dang. 2019. An auto- matic water detection approach using Landsat 8 OLI and Google Earth Engine cloud computing to map lakes and reservoirs in New Zealand.Environmental Monitoring and Assessment(2019)

  26. [26]

    Joe Oakley, Chris Conlan, Gunduz Vehbi Demirci, Alexandros Sfyridis, and Hakan Ferhatosmanoglu. 2024. Foresight plus: serverless spatio-temporal traffic forecasting.GeoInformatica28, 4 (2024), 649–677

  27. [27]

    Hannah Ritchie and Max Roser. 2018. Urbanization.Our world in data(2018)

  28. [28]

    Zezhi Shao, Fei Wang, Yongjun Xu, Wei Wei, Chengqing Yu, Zhao Zhang, Di Yao, Guangyin Jin, Xin Cao, Gao Cong, et al. 2023. Exploring progress in multi- variate time series forecasting: Comprehensive benchmarking and heterogeneity analysis.arXiv preprint arXiv:2310.06119(2023)

  29. [29]

    Hua Wei, Guanjie Zheng, Vikash Gayah, and Zhenhui Li. 2019. A survey on traffic signal control methods.arXiv preprint arXiv:1904.08117(2019)

  30. [30]

    Pengyu Zhu, Jie Huang, Jiaoe Wang, Yu Liu, Jiarong Li, Mingshu Wang, and Wei Qiang. 2022. Understanding taxi ridership with spatial spillover effects and temporal dynamics.Cities125 (2022), 103637. 9