OSM+: Billion-Level OpenStreetMap Dataset for City-wide Experiments
Pith reviewed 2026-05-22 12:37 UTC · model grok-4.3
The pith
OSM+ releases a 1-billion-vertex global road network graph from processed OpenStreetMap data for city-wide experiments.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By leveraging distributed cloud computing with 5,000 cores, the authors transform raw OpenStreetMap data into OSM+, a unified worldwide road network graph containing 1 billion vertices. This dataset includes an easy spatial query interface, supports reproducibility via a fixed snapshot, and comes with tools for integrating multimodal data, enabling broader benchmarks in traffic prediction and challenges in large-scale traffic policy control.
What carries the argument
OSM+, the structured 1-billion-vertex road network graph dataset derived from global OpenStreetMap data via distributed processing.
If this is right
- Traffic prediction benchmarks can scale from hundreds to thousands of road intersections across 31 cities.
- Traffic policy control experiments introduce thousand-scale multi-agent coordination challenges in six cities.
- City boundary detection and other spatial tasks gain access to consistent global data.
- Geospatial foundation models can integrate multimodal spatial-temporal data with the road network structure.
Where Pith is reading between the lines
- Such a dataset might highlight scalability issues in existing graph neural network models that smaller benchmarks overlook.
- Future work could extend OSM+ to include dynamic updates for real-time traffic simulations.
- Cross-domain applications like urban planning or disaster response modeling could benefit from the unified global topology.
Load-bearing premise
The processing steps preserve the topological properties and spatial relationships of the original OpenStreetMap road networks without introducing unifying artifacts.
What would settle it
Comparing key network statistics such as average degree, clustering coefficients, or path length distributions between OSM+ and unprocessed OSM extracts in multiple cities would falsify the claim if significant discrepancies appear.
Figures
read the original abstract
Road network data provides rich information about cities, but processing worldwide OpenStreetMap (OSM) data is computationally intensive, and the resulting graphs are often difficult to unify for benchmarking downstream tasks. Existing graph learning benchmarks fail to capture the billion-scale and unique topological properties of real-world road networks, leaving model scalability underexplored. To close this gap, we process OSM data with distributed cloud computing using 5,000 cores and release \textbf{OSM+}, a structured worldwide 1-billion-vertex road network graph dataset designed for high accessibility and usability. OSM+ is open source and globally downloadable, providing an open-box graph structure and an easy spatial query interface; the evaluated release is a fixed snapshot for reproducibility, with a versioned update plan for future releases. We demonstrate the utility of OSM+ through three illustrative use cases: city boundary detection, traffic prediction, and traffic policy control. For traffic prediction, we construct a new 31-city benchmark by processing traffic data and combining it with OSM+, enabling broader spatial coverage and more comprehensive evaluation than commonly used datasets, while scaling from hundreds of road network intersections to thousands. For traffic policy control, we release a new six-city dataset at a much larger scale, introducing challenges for thousand-scale multi-agent coordination. We also provide data processing tools for integrating multimodal spatial-temporal data with OSM+ for geospatial foundation model training, thereby expediting the discovery of compelling scientific insights.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper describes the creation of OSM+, a worldwide 1-billion-vertex road network graph dataset derived from OpenStreetMap via distributed cloud computing on 5,000 cores. It releases the dataset with an open-box graph structure and spatial query interface, provides a fixed snapshot for reproducibility along with versioned updates, and demonstrates utility through three use cases: city boundary detection, a new 31-city traffic prediction benchmark that scales to thousands of intersections, and a six-city traffic policy control dataset for multi-agent coordination. Tools for integrating multimodal spatial-temporal data are also provided to support geospatial foundation model training.
Significance. If the processing pipeline preserves topological properties without systematic artifacts, OSM+ would offer a substantial resource for benchmarking graph learning and urban analytics at scales far beyond existing datasets, directly addressing the gap in capturing billion-scale real-world road network properties. The explicit release of reproducible snapshots, open-source code for processing, and multimodal integration tools represents a concrete strength that supports community adoption and downstream scientific work in traffic modeling and city-scale experiments.
major comments (2)
- [Section 3] Section 3 (OSM+ Construction and Processing Pipeline): The manuscript states the use of 5,000 cores for distributed processing but supplies no concrete description of the global unification procedure, including coordinate system handling, duplicate node resolution across borders, way simplification rules, or cross-border consistency checks. This detail is load-bearing for the central claim that the resulting graph is a single usable worldwide network that preserves real topological properties for benchmarking.
- [Section 5] Section 5 (Use Cases and Evaluation): No quantitative validation of graph quality is reported, such as error rates after cleaning, comparison of degree sequences or connected-component counts against independent regional OSM extracts, or accuracy of spatial queries on the unified graph. This is required to substantiate the usability claims for the 31-city traffic prediction benchmark and the six-city policy control dataset.
minor comments (2)
- [Abstract] Abstract: The phrase 'evaluated release' is used without specifying any graph-level fidelity metrics; clarify whether any internal consistency checks were performed on the 1-billion-vertex snapshot.
- [Figures and Tables] Throughout: Ensure all figures and tables that reference the billion-vertex scale include explicit legends or captions stating the exact vertex/edge counts for the released snapshot.
Simulated Author's Rebuttal
We are grateful to the referee for the positive assessment of the significance of OSM+ and for the helpful suggestions to improve the manuscript. We respond to each major comment below.
read point-by-point responses
-
Referee: [Section 3] Section 3 (OSM+ Construction and Processing Pipeline): The manuscript states the use of 5,000 cores for distributed processing but supplies no concrete description of the global unification procedure, including coordinate system handling, duplicate node resolution across borders, way simplification rules, or cross-border consistency checks. This detail is load-bearing for the central claim that the resulting graph is a single usable worldwide network that preserves real topological properties for benchmarking.
Authors: Thank you for highlighting this important point. We acknowledge that the current description in Section 3 lacks sufficient detail on the global unification procedure. To address this, we will revise the manuscript to provide a concrete description, including: coordinate system handling using WGS84, duplicate node resolution across borders via spatial proximity matching, way simplification rules using established algorithms to maintain topology, and cross-border consistency checks through connectivity verification. Additionally, we will include pseudocode and a processing diagram to clarify the distributed computation on 5,000 cores. These additions will better support the claim of a unified worldwide network. revision: yes
-
Referee: [Section 5] Section 5 (Use Cases and Evaluation): No quantitative validation of graph quality is reported, such as error rates after cleaning, comparison of degree sequences or connected-component counts against independent regional OSM extracts, or accuracy of spatial queries on the unified graph. This is required to substantiate the usability claims for the 31-city traffic prediction benchmark and the six-city policy control dataset.
Authors: We thank the referee for this observation. We agree that quantitative validation is important to substantiate the graph quality and usability claims. In the revised manuscript, we will add quantitative validation in Section 5, including error rates after the cleaning steps, comparisons of degree sequences and connected component counts with independent regional OSM extracts for multiple cities, and accuracy assessments of spatial queries on the unified graph. This will provide stronger evidence for the 31-city and six-city benchmarks. revision: yes
Circularity Check
No circularity: data-release paper with no derivations or self-referential predictions
full rationale
The manuscript is a data-processing and release paper. It describes ingesting external OpenStreetMap input, applying distributed cloud processing (5000 cores), and releasing a fixed snapshot graph with an open-box structure and spatial-query interface. No equations, fitted parameters, predictions, uniqueness theorems, or ansatzes appear in the provided text. The central claims rest on the external OSM source and on downstream use-case demonstrations rather than on any internal reduction to the paper's own outputs. Self-citations are absent from the load-bearing narrative. This is the expected honest non-finding for a dataset paper whose value is measured by external usability and reproducibility.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption OpenStreetMap data contains sufficient structure and coverage to be processed into a unified, queryable road network graph at global scale
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We generalize the road network as a graph with 1.9 billion vertices... Vertex: Each vertex represents a road intersection... Edge: Each edge represents a road segment... (Definition 2.1)
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We process OSM data with distributed cloud computing using 5,000 cores and release OSM+
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Meisam Amani, Mohammad Kakooei, Armin Moghimi, Arsalan Ghorbanian, Babak Ranjgar, Sahel Mahdavi, Andrew Davidson, Thierry Fisette, Patrick Rollin, Brian Brisco, and Ali Mohammadzadeh. 2020. Application of Google Earth Engine Cloud Computing Platform, Sentinel Imagery, and Neural Networks for Crop Mapping in Canada.Remote Sensing12, 21 (2020). https://doi....
work page 2020
-
[2]
Apple Maps. [n.d.]. Apple Maps. https://www.apple.com/maps. 2024
work page 2024
-
[3]
Baidu Maps. [n.d.]. Baidu Maps. https://maps.baidu.com. 2024
work page 2024
-
[4]
Alexandros Bartzokas-Tsiompras. 2022. Utilizing OpenStreetMap data to measure and compare pedestrian street lengths in 992 cities around the world.European Journal of Geography13, 2 (2022), 127–141
work page 2022
- [5]
-
[6]
Bing Maps. [n.d.]. Bing Maps. https://www.bing.com/maps. 2024
work page 2024
-
[7]
Geoff Boeing. 2017. OSMnx: A Python package to work with graph-theoretic OpenStreetMap street networks.Journal of Open Source Software2, 12 (2017)
work page 2017
-
[8]
Geoff Boeing. 2017. OSMnx: New methods for acquiring, constructing, analyzing, and visualizing complex street networks.Computers, Environment and Urban Systems65 (2017), 126–139. https://doi.org/10.1016/j.compenvurbsys.2017.05.004
-
[9]
Qing Ding, Zhenfeng Shao, Xiao Huang, Orhan Altan, and Bin Hu. 2022. Time- series land cover mapping and urban expansion analysis using OpenStreetMap data and remote sensing big data: A case study of Guangdong-Hong Kong-Macao Greater Bay Area, China.International Journal of Applied Earth Observation and Geoinformation113 (2022), 103001
work page 2022
-
[10]
Dennis Edler and Mark Vetter. 2019. The simplicity of modern audiovisual web cartography: An example with the open-source javascript library leaflet. js. KN-Journal of Cartography and Geographic Information69 (2019), 51–62
work page 2019
-
[11]
Wiam Elleuch, Ali Wali, and Adel M. Alimi. 2014. Mining road map from big database of GPS data. In2014 14th International Conference on Hybrid Intelligent Systems. 193–198. https://doi.org/10.1109/HIS.2014.7086197
-
[12]
Oskar Eriksson and Emil Rydkvist. 2015. An in-depth analysis of dynamically rendered vector-based maps with WebGL using Mapbox GL JS
work page 2015
-
[13]
A GADM. 2015. Database of global administrative areas
work page 2015
-
[14]
Gaode Maps. [n.d.]. Gaode Maps. https://www.amap.com. 2024
work page 2024
-
[15]
Google Maps. [n.d.]. Google Maps. https://maps.google.com. 2024
work page 2024
-
[16]
A Yair Grinberger, Marco Minghini, Levente Juhász, Godwin Yeboah, and Peter Mooney. 2022. OSM Science—The Academic Study of the OpenStreetMap Project, Data, Contributors, Community, and Applications. , 230 pages
work page 2022
-
[17]
Mordechai Haklay and Patrick Weber. 2008. Openstreetmap: User-generated street maps.IEEE Pervasive computing7, 4 (2008), 12–18
work page 2008
-
[18]
Mordechai Haklay and Patrick Weber. 2008. OpenStreetMap: User-Generated Street Maps.IEEE Pervasive Computing7, 4 (2008), 12–18. https://doi.org/10. 1109/MPRV.2008.80
work page 2008
-
[19]
Shunfu Hu and Ting Dai. 2013. Online Map Application Development Using Google Maps API, SQL Database, and ASP.NET.International Journal of Infor- mation and Communication Technology Research(2013)
work page 2013
-
[20]
Weiwei Jiang and Jiayun Luo. 2022. Graph neural network for traffic forecasting: A survey.Expert systems with applications207 (2022), 117921
work page 2022
-
[21]
Tek Bahadur Kshetri, Angsana Chaksan, and Shraddha Sharma. 2021. The Role of Open-Source Python Package Geoserver-Rest in Web-Gis Development. The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences46 (2021), 91–96
work page 2021
-
[22]
Yaguang Li, Rose Yu, Cyrus Shahabi, and Yan Liu. 2017. Diffusion convolu- tional recurrent neural network: Data-driven traffic forecasting.arXiv preprint arXiv:1707.01926(2017)
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[23]
Chumeng Liang, Zherui Huang, Yicheng Liu, Zhanyu Liu, Guanjie Zheng, Hanyuan Shi, Kan Wu, Yuhao Du, Fuliang Li, and Zhenhui Jessie Li. 2023. CBLab: Supporting the Training of Large-scale Traffic Control Policies with Scalable Traf- fic Simulation. InProceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 4449–4460
work page 2023
-
[24]
Peter Mooney, Marco Minghini, et al. 2017. A review of OpenStreetMap data. Mapping and the citizen sensor(2017), 37–59
work page 2017
-
[25]
Uyen N. T. Nguyen, Lien T. H. Pham, and Thanh Duc Dang. 2019. An auto- matic water detection approach using Landsat 8 OLI and Google Earth Engine cloud computing to map lakes and reservoirs in New Zealand.Environmental Monitoring and Assessment(2019)
work page 2019
-
[26]
Joe Oakley, Chris Conlan, Gunduz Vehbi Demirci, Alexandros Sfyridis, and Hakan Ferhatosmanoglu. 2024. Foresight plus: serverless spatio-temporal traffic forecasting.GeoInformatica28, 4 (2024), 649–677
work page 2024
-
[27]
Hannah Ritchie and Max Roser. 2018. Urbanization.Our world in data(2018)
work page 2018
- [28]
- [29]
-
[30]
Pengyu Zhu, Jie Huang, Jiaoe Wang, Yu Liu, Jiarong Li, Mingshu Wang, and Wei Qiang. 2022. Understanding taxi ridership with spatial spillover effects and temporal dynamics.Cities125 (2022), 103637. 9
work page 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.