pith. machine review for the scientific record. sign in

arxiv: 2604.16317 · v1 · submitted 2026-02-09 · 💻 cs.IR · cs.AI

Paper2Data: Large-Scale LLM Extraction and Metadata Structuring of Global Urban Data from Scientific Literature

Pith reviewed 2026-05-16 05:41 UTC · model grok-4.3

classification 💻 cs.IR cs.AI
keywords LLM extractionurban datasetsscientific literaturedata discovery portalmetadata structuringdataset identificationNature publications
0
0 comments X

The pith

An LLM pipeline extracts and structures over 60,000 urban datasets from scientific literature to build a searchable global portal.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper introduces Paper2Data, an automated pipeline that uses large language models to scan scientific publications for mentions of urban datasets and organize the details into a consistent format. The result is UrbanDataMiner, a public portal containing more than 60,000 datasets from over 15,000 papers affiliated with Nature. Without such a system, researchers must hunt manually through websites and articles to locate relevant data for their work. The method reaches about 90 percent recall in spotting datasets and over 80 percent precision in filling metadata fields, and it uncovers datasets that general search engines miss. This setup supports broader, more reusable data use across urban-related fields.

Core claim

Paper2Data is a large-scale LLM-driven pipeline that automatically identifies dataset mentions in scientific papers and structures them using a unified urban data metadata schema, achieving high recall in identification and high field-level precision, which enables the creation of UrbanDataMiner as the first large-scale literature-derived infrastructure for urban data discovery.

What carries the argument

Paper2Data, the LLM-driven pipeline that identifies dataset mentions in papers and structures them using a unified urban data metadata schema.

Load-bearing premise

That the LLM pipeline identifies and structures dataset mentions reliably across different paper styles, disciplines, and dataset types without significant omissions or errors.

What would settle it

Running the pipeline on a fresh sample of papers and comparing its output against independent human annotations to check if recall drops below 80 percent or precision falls below 70 percent.

Figures

Figures reproduced from arXiv: 2604.16317 by Jiankun Zhang, Jinghua Piao, Jingzhi Wang, Runwen You, Tengyao Tu, Tong Xia, Yi Chang, Yong Li.

Figure 1
Figure 1. Figure 1: UrbanDataMiner portal overview. 60K+ urban [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the Paper2Data pipeline. The system consists of six steps that transform scientific literature into a data portal where urban data can be easily retrieved. such as Elicit, Consensus, and Scite facilitate abstract-level syn￾thesis and citation-centric analysis, while tools like ChatPDF [10] support document-level interaction. However, these systems are not designed for large-scale, cross-corpus … view at source ↗
Figure 3
Figure 3. Figure 3: Urban data typology used for dataset indexing. [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Distributional analysis of urban data. (a) Global geographic concentration of data across regions. (b) Distributional [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
read the original abstract

Urban data support a wide range of applications across multiple disciplines. However, at the global scale, there is no unified platform for urban data discovery. As a result, researchers often have to manually search through websites or scientific literature to identify relevant datasets. To address this problem, we curate an open urban data discovery portal, \textit{UrbanDataMiner}, which supports dataset-level search and filtering over more than 60{,}000 urban datasets extracted from over 15{,}000 Nature-affiliated publications. \textit{UrbanDataMiner} is enabled by \textit{Paper2Data}, a novel large-scale LLM-driven pipeline that automatically identifies dataset mentions in scientific papers and structures them using a unified urban data metadata schema. Human-annotated evaluation demonstrates that \textit{Paper2Data} achieves high recall (approximately 90\%) in dataset identification and high field-level precision (above 80\%). In addition, \textit{UrbanDataMiner} can retrieve over 9\% of datasets that are not easily discoverable through general-purpose search engines such as Google. Overall, our work provides the first large-scale, literature-derived infrastructure for urban data discovery and enables more systematic and reusable data-driven research across disciplines. Our code and data are publicly available\footnote{https://github.com/Yourunwen/Paper2Data}.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces Paper2Data, an LLM-based pipeline that automatically identifies dataset mentions in scientific papers and structures them according to a unified urban data metadata schema. This pipeline enables the UrbanDataMiner portal, which provides search and filtering over more than 60,000 urban datasets extracted from over 15,000 Nature-affiliated publications. Human-annotated evaluation is reported to show approximately 90% recall for dataset identification and above 80% field-level precision, with an additional claim that the portal surfaces over 9% of datasets not easily discoverable via general-purpose search engines.

Significance. If the extraction reliability holds, the work delivers the first large-scale, literature-derived infrastructure for urban data discovery and supports more systematic, reusable data-driven research across disciplines. Public release of code and data strengthens the contribution by enabling direct reuse and verification.

major comments (2)
  1. [Abstract / Evaluation] Abstract and evaluation description: the headline performance claims (~90% recall in dataset identification and >80% field-level precision) rest on human-annotated evaluation, yet no information is provided on test-set size, sampling strategy across disciplines or paper formats, inter-annotator agreement, or error analysis. Without these details, the metrics cannot be shown to generalize to the full 15k-paper corpus, and recall in particular is vulnerable to under-annotation in the gold labels.
  2. [Pipeline / Results] Section describing the LLM pipeline and UrbanDataMiner construction: the assumption that the extraction generalizes reliably across diverse paper styles, disciplines, and dataset types is load-bearing for the 60k-dataset claim, but no systematic analysis of omissions, hallucinations, or coverage gaps is reported to support this.
minor comments (1)
  1. [Abstract] The abstract contains raw LaTeX markup (e.g., textit) that should be rendered or removed for readability in the final version.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive report. We agree that the evaluation section requires substantially more detail to support the reported metrics and that the manuscript would benefit from explicit discussion of generalization and failure modes. We will revise the paper accordingly and address each point below.

read point-by-point responses
  1. Referee: [Abstract / Evaluation] Abstract and evaluation description: the headline performance claims (~90% recall in dataset identification and >80% field-level precision) rest on human-annotated evaluation, yet no information is provided on test-set size, sampling strategy across disciplines or paper formats, inter-annotator agreement, or error analysis. Without these details, the metrics cannot be shown to generalize to the full 15k-paper corpus, and recall in particular is vulnerable to under-annotation in the gold labels.

    Authors: We agree that these methodological details are essential. In the revised manuscript we will expand the evaluation section to report the exact test-set size, the stratified sampling procedure across disciplines and paper formats, inter-annotator agreement statistics, and a full error analysis (including sources of missed datasets). These additions will directly address concerns about generalizability and potential under-annotation. revision: yes

  2. Referee: [Pipeline / Results] Section describing the LLM pipeline and UrbanDataMiner construction: the assumption that the extraction generalizes reliably across diverse paper styles, disciplines, and dataset types is load-bearing for the 60k-dataset claim, but no systematic analysis of omissions, hallucinations, or coverage gaps is reported to support this.

    Authors: We acknowledge that a systematic analysis of omissions, hallucinations, and coverage gaps is currently missing. While spot-checks were performed during development, they were not reported. In the revision we will add a dedicated limitations subsection that quantifies observed hallucination rates on a sampled subset, discusses common omission patterns by paper style and discipline, and explicitly states the scope of the 60k-dataset claim. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper describes an LLM-based extraction pipeline for urban datasets from literature, with performance metrics derived from separate human annotations on a held-out evaluation set rather than any fitted parameters, self-referential equations, or load-bearing self-citations. No derivation chain reduces to inputs by construction; the reported recall and precision figures are presented as externally validated against human labels, rendering the work self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The work rests on the standard assumption that LLMs can parse scientific text for dataset mentions and that a fixed metadata schema is sufficient; no free parameters, axioms, or invented entities are described in the abstract.

pith-pipeline@v0.9.0 · 5557 in / 1134 out tokens · 51071 ms · 2026-05-16T05:41:44.330949+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

28 extracted references · 28 canonical work pages · 2 internal anchors

  1. [1]

    Young, Martin Lebrat, and Markus Greiner

    Tim Althoff, Boris Ivanovic, Abby C. King, Jennifer L. Hicks, Scott L. Delp, and Jure Leskovec. 2025. Countrywide natural experiment links built environment to physical activity.Nature645, 8080 (2025), 407–413. doi:10.1038/s41586-025- 09321-3

  2. [2]

    Adrian Chapman, Elena Simperl, Laura Koesten, et al. 2020. Dataset search: a survey.The VLDB Journal29, 1 (2020), 251–272

  3. [3]

    Jianlv Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, and Zheng Liu

  4. [4]

    M3-Embedding: Multi-Linguality, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation

    BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation.arXiv2402.03216(2024). arXiv:2402.03216 [cs.CL]

  5. [5]

    Cindy Cheng, Joan Barceló, Allison Spencer Hartnett, Robert Kubinec, and Luca Messerschmidt. 2020. COVID-19 government response event dataset (CoronaNet v. 1.0).Nature human behaviour4, 7 (2020), 756–768

  6. [6]

    Cheung et al

    S. Cheung et al . 2025. LLM-Based Information Extraction to Support Scien- tific Literature Research and Publication Workflows. InNew Trends in Theory and Practice of Digital Libraries (TPDL 2025) (Communications in Computer and Information Science, Vol. 2694). Springer

  7. [7]

    Lamb, et al

    Felix Creutzig, Steffen Lohrey, Xuemei Bai, Alexander Baklanov, Richard Dawson, Shobhakar Dhakal, William F. Lamb, et al. 2019. Upscaling Urban Data Science for Global Climate Solutions.Global Sustainability2 (2019), e2

  8. [8]

    İlke Demir, Krzysztof Koperski, David Lindenbaum, Guan Pang, Jing Huang, Saikat Basu, Forest Hughes, Devis Tuia, and Ramesh Raskar. 2018. DeepGlobe 2018: A Challenge to Parse the Earth through Satellite Images. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). 172–181

  9. [9]

    Jarmin, Frauke Kreuter, and Julia Lane

    Ian Foster, Rayid Ghani, Ron S. Jarmin, Frauke Kreuter, and Julia Lane. 2016.Big Data and Social Science: A Practical Guide to Methods and Tools. Chapman and Hall/CRC, Boca Raton, FL

  10. [10]

    Global Facility for Disaster Reduction and Recovery (GFDRR). 2020. Open Cities AI Challenge Dataset, Version 1.0. doi:10.34911/rdnt.f94cxb

  11. [11]

    Muhammad Talal Ibrahim, Cole Vincent Veliky, Syeda Sadia Fatima, Zahra Hoodb- hoy, and Shahryar Noordin. 2024. Enhancing Quotation Accuracy Assessment with ChatPDF: A Game-Changer for a Century-Old Conundrum.Journal of Bone and Joint Surgery (JBJS)(2024)

  12. [12]

    2013.Handbook of Environmental Data and Ecological Parameters: Environmental Sciences and Applications

    Sven Erik Jørgensen (Ed.). 2013.Handbook of Environmental Data and Ecological Parameters: Environmental Sciences and Applications. Vol. 6. Elsevier, Amsterdam

  13. [13]

    Wei Li, Yongping Wei, Lijuan Chen, Zhenjie Chen, Manchun Li, Wenqi Chen, Kunshu Yang, Diandian Xu, and Qiqi Zhao. 2025. Multiple environmental in- equalities between Global South and Global North in over 10,000 urban centers. npj Urban Sustainability(2025)

  14. [14]

    Pietro Marini, Aécio Santos, Nicole Contaxis, and Juliana Freire. 2025. Data Gatherer: LLM-Powered Dataset Reference Extraction from Scientific Literature. InProceedings of the Fifth Workshop on Scholarly Document Processing (SDP 2025). Association for Computational Linguistics, 114–123

  15. [15]

    Mooney and Vikas Pejaver

    Stephen J. Mooney and Vikas Pejaver. 2018. Big Data in Public Health: Terminol- ogy, Machine Learning, and Privacy.Annual Review of Public Health39, 1 (2018), 95–112

  16. [16]

    Harald Semmelrock, Tony Ross-Hellauer, Simone Kopeinik, Dieter Theiler, Armin Haberl, Stefan Thalmann, and Dominik Kowald. 2025. Reproducibility in machine- learning-based research: Overview, barriers, and drivers.AI Magazine46, 2 (2025), e70002. doi:10.1002/aaai.70002

  17. [17]

    Steinman and Marc A

    Kenneth J. Steinman and Marc A. Zimmerman. 2003. Episodic and persistent gun-carrying among urban African-American adolescents.Journal of Adolescent Health32, 5 (2003), 356–364. doi:10.1016/S1054-139X(03)00022-3

  18. [18]

    Marijn J Ton, Michiel W Ingels, Jens A de Bruijn, Hans de Moel, Lena Reimann, Wouter JW Botzen, and Jeroen CJH Aerts. 2024. A global dataset of 7 billion individuals with socio-economic characteristics.Scientific Data11, 1 (2024), 1096

  19. [19]

    Ana Isabel Torre-Bastida, Javier Del Ser, Ibai Laña, Maitena Ilardia, Miren Nekane Bilbao, and Sergio Campos-Cordobés. 2018. Big Data for Transportation and Mobility: Recent Advances, Trends and Challenges.IET Intelligent Transport Systems12, 8 (2018), 742–755

  20. [20]

    Anthony Townsend. 2015. Cities of Data: Examining the New Urban Science. Public Culture27, 2 (2015), 201–212

  21. [21]

    Adam Van Etten, Dave Lindenbaum, and Todd M Bacastow. 2018. Spacenet: A remote sensing dataset and challenge series.arXiv preprint arXiv:1807.01232 (2018)

  22. [22]

    Yiheng Wang, Tianyu Wang, YuYing Zhang, Hongji Zhang, Haoyu Zheng, Guan- jie Zheng, and Linghe Kong. 2024. UrbanDataLayer: A Unified Data Pipeline for Urban Science. InAdvances in Neural Information Processing Systems, Vol. 37. 7296–7310

  23. [23]

    Wilkinson, Michel Dumontier, IJsbrand Jan Aalbersberg, et al

    Mark D. Wilkinson, Michel Dumontier, IJsbrand Jan Aalbersberg, et al . 2016. The FAIR Guiding Principles for scientific data management and stewardship. Scientific Data3, 1 (2016), 160018

  24. [24]

    Papachristos

    George Wood and Andrew V. Papachristos. 2019. Reducing gunshot victimization in high-risk social networks through direct and spillover effects.Nature Human Behaviour3 (2019), 1164–1170. doi:10.1038/s41562-019-0688-1

  25. [25]

    Anjie Xu, Ruiqing Ding, and Leye Wang. 2025. ChatPD: An LLM-driven Paper- Dataset Networking System. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD)

  26. [26]

    Bo Xu, Bernardo Gutierrez, Sumiko Mekaru, Kara Sewalk, Lauren Goodwin, Alyssa Loskill, Emily L Cohn, Yulin Hswen, Sarah C Hill, Maria M Cobo, et al. 2020. Epidemiological data from the COVID-19 outbreak, real-time case information. Scientific data7, 1 (2020), 106

  27. [27]

    Yuki Yamada, Dominik-Borna Ćepulić, Tao Coll-Martín, Stéphane Debove, Guil- laume Gautreau, Hyemin Han, Jesper Rasmussen, Thao P Tran, Giovanni A Travaglino, et al. 2021. COVIDiSTRESS Global Survey dataset on psychological and behavioural consequences of the COVID-19 outbreak.Scientific data8, 1 (2021), 3

  28. [28]

    Yuan Zhou, Yiqi Luo, et al . 2025. Reduction of methane emissions through improved landfill management.Nature Climate Change15 (2025), 1–15. doi:10. 1038/s41558-025-02391-1 A Additional Results Received 20 February 2007; revised 12 March 2009; accepted 5 June 2009 Conference acronym ’XX, June 03–05, 2018, Woodstock, NY Trovato et al. Table 5: Field-level ...