Paper2Data: Large-Scale LLM Extraction and Metadata Structuring of Global Urban Data from Scientific Literature
Pith reviewed 2026-05-16 05:41 UTC · model grok-4.3
The pith
An LLM pipeline extracts and structures over 60,000 urban datasets from scientific literature to build a searchable global portal.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Paper2Data is a large-scale LLM-driven pipeline that automatically identifies dataset mentions in scientific papers and structures them using a unified urban data metadata schema, achieving high recall in identification and high field-level precision, which enables the creation of UrbanDataMiner as the first large-scale literature-derived infrastructure for urban data discovery.
What carries the argument
Paper2Data, the LLM-driven pipeline that identifies dataset mentions in papers and structures them using a unified urban data metadata schema.
Load-bearing premise
That the LLM pipeline identifies and structures dataset mentions reliably across different paper styles, disciplines, and dataset types without significant omissions or errors.
What would settle it
Running the pipeline on a fresh sample of papers and comparing its output against independent human annotations to check if recall drops below 80 percent or precision falls below 70 percent.
Figures
read the original abstract
Urban data support a wide range of applications across multiple disciplines. However, at the global scale, there is no unified platform for urban data discovery. As a result, researchers often have to manually search through websites or scientific literature to identify relevant datasets. To address this problem, we curate an open urban data discovery portal, \textit{UrbanDataMiner}, which supports dataset-level search and filtering over more than 60{,}000 urban datasets extracted from over 15{,}000 Nature-affiliated publications. \textit{UrbanDataMiner} is enabled by \textit{Paper2Data}, a novel large-scale LLM-driven pipeline that automatically identifies dataset mentions in scientific papers and structures them using a unified urban data metadata schema. Human-annotated evaluation demonstrates that \textit{Paper2Data} achieves high recall (approximately 90\%) in dataset identification and high field-level precision (above 80\%). In addition, \textit{UrbanDataMiner} can retrieve over 9\% of datasets that are not easily discoverable through general-purpose search engines such as Google. Overall, our work provides the first large-scale, literature-derived infrastructure for urban data discovery and enables more systematic and reusable data-driven research across disciplines. Our code and data are publicly available\footnote{https://github.com/Yourunwen/Paper2Data}.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Paper2Data, an LLM-based pipeline that automatically identifies dataset mentions in scientific papers and structures them according to a unified urban data metadata schema. This pipeline enables the UrbanDataMiner portal, which provides search and filtering over more than 60,000 urban datasets extracted from over 15,000 Nature-affiliated publications. Human-annotated evaluation is reported to show approximately 90% recall for dataset identification and above 80% field-level precision, with an additional claim that the portal surfaces over 9% of datasets not easily discoverable via general-purpose search engines.
Significance. If the extraction reliability holds, the work delivers the first large-scale, literature-derived infrastructure for urban data discovery and supports more systematic, reusable data-driven research across disciplines. Public release of code and data strengthens the contribution by enabling direct reuse and verification.
major comments (2)
- [Abstract / Evaluation] Abstract and evaluation description: the headline performance claims (~90% recall in dataset identification and >80% field-level precision) rest on human-annotated evaluation, yet no information is provided on test-set size, sampling strategy across disciplines or paper formats, inter-annotator agreement, or error analysis. Without these details, the metrics cannot be shown to generalize to the full 15k-paper corpus, and recall in particular is vulnerable to under-annotation in the gold labels.
- [Pipeline / Results] Section describing the LLM pipeline and UrbanDataMiner construction: the assumption that the extraction generalizes reliably across diverse paper styles, disciplines, and dataset types is load-bearing for the 60k-dataset claim, but no systematic analysis of omissions, hallucinations, or coverage gaps is reported to support this.
minor comments (1)
- [Abstract] The abstract contains raw LaTeX markup (e.g., textit) that should be rendered or removed for readability in the final version.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive report. We agree that the evaluation section requires substantially more detail to support the reported metrics and that the manuscript would benefit from explicit discussion of generalization and failure modes. We will revise the paper accordingly and address each point below.
read point-by-point responses
-
Referee: [Abstract / Evaluation] Abstract and evaluation description: the headline performance claims (~90% recall in dataset identification and >80% field-level precision) rest on human-annotated evaluation, yet no information is provided on test-set size, sampling strategy across disciplines or paper formats, inter-annotator agreement, or error analysis. Without these details, the metrics cannot be shown to generalize to the full 15k-paper corpus, and recall in particular is vulnerable to under-annotation in the gold labels.
Authors: We agree that these methodological details are essential. In the revised manuscript we will expand the evaluation section to report the exact test-set size, the stratified sampling procedure across disciplines and paper formats, inter-annotator agreement statistics, and a full error analysis (including sources of missed datasets). These additions will directly address concerns about generalizability and potential under-annotation. revision: yes
-
Referee: [Pipeline / Results] Section describing the LLM pipeline and UrbanDataMiner construction: the assumption that the extraction generalizes reliably across diverse paper styles, disciplines, and dataset types is load-bearing for the 60k-dataset claim, but no systematic analysis of omissions, hallucinations, or coverage gaps is reported to support this.
Authors: We acknowledge that a systematic analysis of omissions, hallucinations, and coverage gaps is currently missing. While spot-checks were performed during development, they were not reported. In the revision we will add a dedicated limitations subsection that quantifies observed hallucination rates on a sampled subset, discusses common omission patterns by paper style and discipline, and explicitly states the scope of the 60k-dataset claim. revision: yes
Circularity Check
No significant circularity
full rationale
The paper describes an LLM-based extraction pipeline for urban datasets from literature, with performance metrics derived from separate human annotations on a held-out evaluation set rather than any fitted parameters, self-referential equations, or load-bearing self-citations. No derivation chain reduces to inputs by construction; the reported recall and precision figures are presented as externally validated against human labels, rendering the work self-contained.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Young, Martin Lebrat, and Markus Greiner
Tim Althoff, Boris Ivanovic, Abby C. King, Jennifer L. Hicks, Scott L. Delp, and Jure Leskovec. 2025. Countrywide natural experiment links built environment to physical activity.Nature645, 8080 (2025), 407–413. doi:10.1038/s41586-025- 09321-3
-
[2]
Adrian Chapman, Elena Simperl, Laura Koesten, et al. 2020. Dataset search: a survey.The VLDB Journal29, 1 (2020), 251–272
work page 2020
-
[3]
Jianlv Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, and Zheng Liu
-
[4]
BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation.arXiv2402.03216(2024). arXiv:2402.03216 [cs.CL]
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[5]
Cindy Cheng, Joan Barceló, Allison Spencer Hartnett, Robert Kubinec, and Luca Messerschmidt. 2020. COVID-19 government response event dataset (CoronaNet v. 1.0).Nature human behaviour4, 7 (2020), 756–768
work page 2020
-
[6]
S. Cheung et al . 2025. LLM-Based Information Extraction to Support Scien- tific Literature Research and Publication Workflows. InNew Trends in Theory and Practice of Digital Libraries (TPDL 2025) (Communications in Computer and Information Science, Vol. 2694). Springer
work page 2025
-
[7]
Felix Creutzig, Steffen Lohrey, Xuemei Bai, Alexander Baklanov, Richard Dawson, Shobhakar Dhakal, William F. Lamb, et al. 2019. Upscaling Urban Data Science for Global Climate Solutions.Global Sustainability2 (2019), e2
work page 2019
-
[8]
İlke Demir, Krzysztof Koperski, David Lindenbaum, Guan Pang, Jing Huang, Saikat Basu, Forest Hughes, Devis Tuia, and Ramesh Raskar. 2018. DeepGlobe 2018: A Challenge to Parse the Earth through Satellite Images. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). 172–181
work page 2018
-
[9]
Jarmin, Frauke Kreuter, and Julia Lane
Ian Foster, Rayid Ghani, Ron S. Jarmin, Frauke Kreuter, and Julia Lane. 2016.Big Data and Social Science: A Practical Guide to Methods and Tools. Chapman and Hall/CRC, Boca Raton, FL
work page 2016
-
[10]
Global Facility for Disaster Reduction and Recovery (GFDRR). 2020. Open Cities AI Challenge Dataset, Version 1.0. doi:10.34911/rdnt.f94cxb
-
[11]
Muhammad Talal Ibrahim, Cole Vincent Veliky, Syeda Sadia Fatima, Zahra Hoodb- hoy, and Shahryar Noordin. 2024. Enhancing Quotation Accuracy Assessment with ChatPDF: A Game-Changer for a Century-Old Conundrum.Journal of Bone and Joint Surgery (JBJS)(2024)
work page 2024
-
[12]
Sven Erik Jørgensen (Ed.). 2013.Handbook of Environmental Data and Ecological Parameters: Environmental Sciences and Applications. Vol. 6. Elsevier, Amsterdam
work page 2013
-
[13]
Wei Li, Yongping Wei, Lijuan Chen, Zhenjie Chen, Manchun Li, Wenqi Chen, Kunshu Yang, Diandian Xu, and Qiqi Zhao. 2025. Multiple environmental in- equalities between Global South and Global North in over 10,000 urban centers. npj Urban Sustainability(2025)
work page 2025
-
[14]
Pietro Marini, Aécio Santos, Nicole Contaxis, and Juliana Freire. 2025. Data Gatherer: LLM-Powered Dataset Reference Extraction from Scientific Literature. InProceedings of the Fifth Workshop on Scholarly Document Processing (SDP 2025). Association for Computational Linguistics, 114–123
work page 2025
-
[15]
Stephen J. Mooney and Vikas Pejaver. 2018. Big Data in Public Health: Terminol- ogy, Machine Learning, and Privacy.Annual Review of Public Health39, 1 (2018), 95–112
work page 2018
-
[16]
Harald Semmelrock, Tony Ross-Hellauer, Simone Kopeinik, Dieter Theiler, Armin Haberl, Stefan Thalmann, and Dominik Kowald. 2025. Reproducibility in machine- learning-based research: Overview, barriers, and drivers.AI Magazine46, 2 (2025), e70002. doi:10.1002/aaai.70002
-
[17]
Kenneth J. Steinman and Marc A. Zimmerman. 2003. Episodic and persistent gun-carrying among urban African-American adolescents.Journal of Adolescent Health32, 5 (2003), 356–364. doi:10.1016/S1054-139X(03)00022-3
-
[18]
Marijn J Ton, Michiel W Ingels, Jens A de Bruijn, Hans de Moel, Lena Reimann, Wouter JW Botzen, and Jeroen CJH Aerts. 2024. A global dataset of 7 billion individuals with socio-economic characteristics.Scientific Data11, 1 (2024), 1096
work page 2024
-
[19]
Ana Isabel Torre-Bastida, Javier Del Ser, Ibai Laña, Maitena Ilardia, Miren Nekane Bilbao, and Sergio Campos-Cordobés. 2018. Big Data for Transportation and Mobility: Recent Advances, Trends and Challenges.IET Intelligent Transport Systems12, 8 (2018), 742–755
work page 2018
-
[20]
Anthony Townsend. 2015. Cities of Data: Examining the New Urban Science. Public Culture27, 2 (2015), 201–212
work page 2015
-
[21]
Adam Van Etten, Dave Lindenbaum, and Todd M Bacastow. 2018. Spacenet: A remote sensing dataset and challenge series.arXiv preprint arXiv:1807.01232 (2018)
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[22]
Yiheng Wang, Tianyu Wang, YuYing Zhang, Hongji Zhang, Haoyu Zheng, Guan- jie Zheng, and Linghe Kong. 2024. UrbanDataLayer: A Unified Data Pipeline for Urban Science. InAdvances in Neural Information Processing Systems, Vol. 37. 7296–7310
work page 2024
-
[23]
Wilkinson, Michel Dumontier, IJsbrand Jan Aalbersberg, et al
Mark D. Wilkinson, Michel Dumontier, IJsbrand Jan Aalbersberg, et al . 2016. The FAIR Guiding Principles for scientific data management and stewardship. Scientific Data3, 1 (2016), 160018
work page 2016
-
[24]
George Wood and Andrew V. Papachristos. 2019. Reducing gunshot victimization in high-risk social networks through direct and spillover effects.Nature Human Behaviour3 (2019), 1164–1170. doi:10.1038/s41562-019-0688-1
-
[25]
Anjie Xu, Ruiqing Ding, and Leye Wang. 2025. ChatPD: An LLM-driven Paper- Dataset Networking System. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD)
work page 2025
-
[26]
Bo Xu, Bernardo Gutierrez, Sumiko Mekaru, Kara Sewalk, Lauren Goodwin, Alyssa Loskill, Emily L Cohn, Yulin Hswen, Sarah C Hill, Maria M Cobo, et al. 2020. Epidemiological data from the COVID-19 outbreak, real-time case information. Scientific data7, 1 (2020), 106
work page 2020
-
[27]
Yuki Yamada, Dominik-Borna Ćepulić, Tao Coll-Martín, Stéphane Debove, Guil- laume Gautreau, Hyemin Han, Jesper Rasmussen, Thao P Tran, Giovanni A Travaglino, et al. 2021. COVIDiSTRESS Global Survey dataset on psychological and behavioural consequences of the COVID-19 outbreak.Scientific data8, 1 (2021), 3
work page 2021
-
[28]
Yuan Zhou, Yiqi Luo, et al . 2025. Reduction of methane emissions through improved landfill management.Nature Climate Change15 (2025), 1–15. doi:10. 1038/s41558-025-02391-1 A Additional Results Received 20 February 2007; revised 12 March 2009; accepted 5 June 2009 Conference acronym ’XX, June 03–05, 2018, Woodstock, NY Trovato et al. Table 5: Field-level ...
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.