arxiv: 2604.16317 · v1 · submitted 2026-02-09 · 💻 cs.IR · cs.AI

Paper2Data: Large-Scale LLM Extraction and Metadata Structuring of Global Urban Data from Scientific Literature

Runwen You , Tong Xia , Jingzhi Wang , Jiankun Zhang , Tengyao Tu , Jinghua Piao , Yi Chang , Yong Li This is my paper

Pith reviewed 2026-05-16 05:41 UTC · model grok-4.3

classification 💻 cs.IR cs.AI

keywords LLM extractionurban datasetsscientific literaturedata discovery portalmetadata structuringdataset identificationNature publications

0 comments

The pith

An LLM pipeline extracts and structures over 60,000 urban datasets from scientific literature to build a searchable global portal.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper introduces Paper2Data, an automated pipeline that uses large language models to scan scientific publications for mentions of urban datasets and organize the details into a consistent format. The result is UrbanDataMiner, a public portal containing more than 60,000 datasets from over 15,000 papers affiliated with Nature. Without such a system, researchers must hunt manually through websites and articles to locate relevant data for their work. The method reaches about 90 percent recall in spotting datasets and over 80 percent precision in filling metadata fields, and it uncovers datasets that general search engines miss. This setup supports broader, more reusable data use across urban-related fields.

Core claim

Paper2Data is a large-scale LLM-driven pipeline that automatically identifies dataset mentions in scientific papers and structures them using a unified urban data metadata schema, achieving high recall in identification and high field-level precision, which enables the creation of UrbanDataMiner as the first large-scale literature-derived infrastructure for urban data discovery.

What carries the argument

Paper2Data, the LLM-driven pipeline that identifies dataset mentions in papers and structures them using a unified urban data metadata schema.

Load-bearing premise

That the LLM pipeline identifies and structures dataset mentions reliably across different paper styles, disciplines, and dataset types without significant omissions or errors.

What would settle it

Running the pipeline on a fresh sample of papers and comparing its output against independent human annotations to check if recall drops below 80 percent or precision falls below 70 percent.

Figures

Figures reproduced from arXiv: 2604.16317 by Jiankun Zhang, Jinghua Piao, Jingzhi Wang, Runwen You, Tengyao Tu, Tong Xia, Yi Chang, Yong Li.

**Figure 2.** Figure 2: Overview of the Paper2Data pipeline. The system consists of six steps that transform scientific literature into a data portal where urban data can be easily retrieved. such as Elicit, Consensus, and Scite facilitate abstract-level synthesis and citation-centric analysis, while tools like ChatPDF [10] support document-level interaction. However, these systems are not designed for large-scale, cross-corpus … view at source ↗

**Figure 3.** Figure 3: Urban data typology used for dataset indexing. [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Distributional analysis of urban data. (a) Global geographic concentration of data across regions. (b) Distributional [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

read the original abstract

Urban data support a wide range of applications across multiple disciplines. However, at the global scale, there is no unified platform for urban data discovery. As a result, researchers often have to manually search through websites or scientific literature to identify relevant datasets. To address this problem, we curate an open urban data discovery portal, \textit{UrbanDataMiner}, which supports dataset-level search and filtering over more than 60{,}000 urban datasets extracted from over 15{,}000 Nature-affiliated publications. \textit{UrbanDataMiner} is enabled by \textit{Paper2Data}, a novel large-scale LLM-driven pipeline that automatically identifies dataset mentions in scientific papers and structures them using a unified urban data metadata schema. Human-annotated evaluation demonstrates that \textit{Paper2Data} achieves high recall (approximately 90\%) in dataset identification and high field-level precision (above 80\%). In addition, \textit{UrbanDataMiner} can retrieve over 9\% of datasets that are not easily discoverable through general-purpose search engines such as Google. Overall, our work provides the first large-scale, literature-derived infrastructure for urban data discovery and enables more systematic and reusable data-driven research across disciplines. Our code and data are publicly available\footnote{https://github.com/Yourunwen/Paper2Data}.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LLM pipeline pulls 60k urban datasets from Nature papers into a public portal, but the human eval lacks sample size, sampling details, and agreement numbers.

read the letter

The main point is a pipeline that uses LLMs to spot dataset mentions in papers and fit them to a fixed urban metadata schema. They applied it to more than 15,000 Nature-affiliated papers and released UrbanDataMiner, a searchable index holding over 60,000 datasets. The abstract also notes that the portal surfaces some datasets that general search engines miss. Releasing code and data is the right call and the scale is larger than most prior extraction efforts in this area. That part is useful on its face for anyone who needs to locate existing urban data without starting from scratch each time. The evaluation is the clear weak spot. The abstract claims roughly 90% recall on dataset identification and above 80% field-level precision from human checks, but it gives no numbers on test-set size, how the papers were chosen, or inter-annotator agreement. Recall claims are especially sensitive to incomplete gold labels, so without those details it is hard to know how far the numbers generalize. A short error analysis would have helped too. The paper is aimed at urban researchers and data users who want a literature-derived starting point rather than new collection. It is straightforward in its goals and does not overclaim beyond the extraction task. I would send it to peer review. The released resource has practical value even if the performance numbers need tighter documentation in revision.

Referee Report

2 major / 1 minor

Summary. The paper introduces Paper2Data, an LLM-based pipeline that automatically identifies dataset mentions in scientific papers and structures them according to a unified urban data metadata schema. This pipeline enables the UrbanDataMiner portal, which provides search and filtering over more than 60,000 urban datasets extracted from over 15,000 Nature-affiliated publications. Human-annotated evaluation is reported to show approximately 90% recall for dataset identification and above 80% field-level precision, with an additional claim that the portal surfaces over 9% of datasets not easily discoverable via general-purpose search engines.

Significance. If the extraction reliability holds, the work delivers the first large-scale, literature-derived infrastructure for urban data discovery and supports more systematic, reusable data-driven research across disciplines. Public release of code and data strengthens the contribution by enabling direct reuse and verification.

major comments (2)

[Abstract / Evaluation] Abstract and evaluation description: the headline performance claims (~90% recall in dataset identification and >80% field-level precision) rest on human-annotated evaluation, yet no information is provided on test-set size, sampling strategy across disciplines or paper formats, inter-annotator agreement, or error analysis. Without these details, the metrics cannot be shown to generalize to the full 15k-paper corpus, and recall in particular is vulnerable to under-annotation in the gold labels.
[Pipeline / Results] Section describing the LLM pipeline and UrbanDataMiner construction: the assumption that the extraction generalizes reliably across diverse paper styles, disciplines, and dataset types is load-bearing for the 60k-dataset claim, but no systematic analysis of omissions, hallucinations, or coverage gaps is reported to support this.

minor comments (1)

[Abstract] The abstract contains raw LaTeX markup (e.g., textit) that should be rendered or removed for readability in the final version.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive report. We agree that the evaluation section requires substantially more detail to support the reported metrics and that the manuscript would benefit from explicit discussion of generalization and failure modes. We will revise the paper accordingly and address each point below.

read point-by-point responses

Referee: [Abstract / Evaluation] Abstract and evaluation description: the headline performance claims (~90% recall in dataset identification and >80% field-level precision) rest on human-annotated evaluation, yet no information is provided on test-set size, sampling strategy across disciplines or paper formats, inter-annotator agreement, or error analysis. Without these details, the metrics cannot be shown to generalize to the full 15k-paper corpus, and recall in particular is vulnerable to under-annotation in the gold labels.

Authors: We agree that these methodological details are essential. In the revised manuscript we will expand the evaluation section to report the exact test-set size, the stratified sampling procedure across disciplines and paper formats, inter-annotator agreement statistics, and a full error analysis (including sources of missed datasets). These additions will directly address concerns about generalizability and potential under-annotation. revision: yes
Referee: [Pipeline / Results] Section describing the LLM pipeline and UrbanDataMiner construction: the assumption that the extraction generalizes reliably across diverse paper styles, disciplines, and dataset types is load-bearing for the 60k-dataset claim, but no systematic analysis of omissions, hallucinations, or coverage gaps is reported to support this.

Authors: We acknowledge that a systematic analysis of omissions, hallucinations, and coverage gaps is currently missing. While spot-checks were performed during development, they were not reported. In the revision we will add a dedicated limitations subsection that quantifies observed hallucination rates on a sampled subset, discusses common omission patterns by paper style and discipline, and explicitly states the scope of the 60k-dataset claim. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper describes an LLM-based extraction pipeline for urban datasets from literature, with performance metrics derived from separate human annotations on a held-out evaluation set rather than any fitted parameters, self-referential equations, or load-bearing self-citations. No derivation chain reduces to inputs by construction; the reported recall and precision figures are presented as externally validated against human labels, rendering the work self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The work rests on the standard assumption that LLMs can parse scientific text for dataset mentions and that a fixed metadata schema is sufficient; no free parameters, axioms, or invented entities are described in the abstract.

pith-pipeline@v0.9.0 · 5557 in / 1134 out tokens · 51071 ms · 2026-05-16T05:41:44.330949+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

28 extracted references · 28 canonical work pages · 2 internal anchors

[1]

Young, Martin Lebrat, and Markus Greiner

Tim Althoff, Boris Ivanovic, Abby C. King, Jennifer L. Hicks, Scott L. Delp, and Jure Leskovec. 2025. Countrywide natural experiment links built environment to physical activity.Nature645, 8080 (2025), 407–413. doi:10.1038/s41586-025- 09321-3

work page doi:10.1038/s41586-025- 2025
[2]

Adrian Chapman, Elena Simperl, Laura Koesten, et al. 2020. Dataset search: a survey.The VLDB Journal29, 1 (2020), 251–272

work page 2020
[3]

Jianlv Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, and Zheng Liu

work page
[4]

M3-Embedding: Multi-Linguality, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation

BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation.arXiv2402.03216(2024). arXiv:2402.03216 [cs.CL]

work page internal anchor Pith review Pith/arXiv arXiv 2024
[5]

Cindy Cheng, Joan Barceló, Allison Spencer Hartnett, Robert Kubinec, and Luca Messerschmidt. 2020. COVID-19 government response event dataset (CoronaNet v. 1.0).Nature human behaviour4, 7 (2020), 756–768

work page 2020
[6]

Cheung et al

S. Cheung et al . 2025. LLM-Based Information Extraction to Support Scien- tific Literature Research and Publication Workflows. InNew Trends in Theory and Practice of Digital Libraries (TPDL 2025) (Communications in Computer and Information Science, Vol. 2694). Springer

work page 2025
[7]

Lamb, et al

Felix Creutzig, Steffen Lohrey, Xuemei Bai, Alexander Baklanov, Richard Dawson, Shobhakar Dhakal, William F. Lamb, et al. 2019. Upscaling Urban Data Science for Global Climate Solutions.Global Sustainability2 (2019), e2

work page 2019
[8]

İlke Demir, Krzysztof Koperski, David Lindenbaum, Guan Pang, Jing Huang, Saikat Basu, Forest Hughes, Devis Tuia, and Ramesh Raskar. 2018. DeepGlobe 2018: A Challenge to Parse the Earth through Satellite Images. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). 172–181

work page 2018
[9]

Jarmin, Frauke Kreuter, and Julia Lane

Ian Foster, Rayid Ghani, Ron S. Jarmin, Frauke Kreuter, and Julia Lane. 2016.Big Data and Social Science: A Practical Guide to Methods and Tools. Chapman and Hall/CRC, Boca Raton, FL

work page 2016
[10]

Global Facility for Disaster Reduction and Recovery (GFDRR). 2020. Open Cities AI Challenge Dataset, Version 1.0. doi:10.34911/rdnt.f94cxb

work page doi:10.34911/rdnt.f94cxb 2020
[11]

Muhammad Talal Ibrahim, Cole Vincent Veliky, Syeda Sadia Fatima, Zahra Hoodb- hoy, and Shahryar Noordin. 2024. Enhancing Quotation Accuracy Assessment with ChatPDF: A Game-Changer for a Century-Old Conundrum.Journal of Bone and Joint Surgery (JBJS)(2024)

work page 2024
[12]

2013.Handbook of Environmental Data and Ecological Parameters: Environmental Sciences and Applications

Sven Erik Jørgensen (Ed.). 2013.Handbook of Environmental Data and Ecological Parameters: Environmental Sciences and Applications. Vol. 6. Elsevier, Amsterdam

work page 2013
[13]

Wei Li, Yongping Wei, Lijuan Chen, Zhenjie Chen, Manchun Li, Wenqi Chen, Kunshu Yang, Diandian Xu, and Qiqi Zhao. 2025. Multiple environmental in- equalities between Global South and Global North in over 10,000 urban centers. npj Urban Sustainability(2025)

work page 2025
[14]

Pietro Marini, Aécio Santos, Nicole Contaxis, and Juliana Freire. 2025. Data Gatherer: LLM-Powered Dataset Reference Extraction from Scientific Literature. InProceedings of the Fifth Workshop on Scholarly Document Processing (SDP 2025). Association for Computational Linguistics, 114–123

work page 2025
[15]

Mooney and Vikas Pejaver

Stephen J. Mooney and Vikas Pejaver. 2018. Big Data in Public Health: Terminol- ogy, Machine Learning, and Privacy.Annual Review of Public Health39, 1 (2018), 95–112

work page 2018
[16]

Harald Semmelrock, Tony Ross-Hellauer, Simone Kopeinik, Dieter Theiler, Armin Haberl, Stefan Thalmann, and Dominik Kowald. 2025. Reproducibility in machine- learning-based research: Overview, barriers, and drivers.AI Magazine46, 2 (2025), e70002. doi:10.1002/aaai.70002

work page doi:10.1002/aaai.70002 2025
[17]

Steinman and Marc A

Kenneth J. Steinman and Marc A. Zimmerman. 2003. Episodic and persistent gun-carrying among urban African-American adolescents.Journal of Adolescent Health32, 5 (2003), 356–364. doi:10.1016/S1054-139X(03)00022-3

work page doi:10.1016/s1054-139x(03)00022-3 2003
[18]

Marijn J Ton, Michiel W Ingels, Jens A de Bruijn, Hans de Moel, Lena Reimann, Wouter JW Botzen, and Jeroen CJH Aerts. 2024. A global dataset of 7 billion individuals with socio-economic characteristics.Scientific Data11, 1 (2024), 1096

work page 2024
[19]

Ana Isabel Torre-Bastida, Javier Del Ser, Ibai Laña, Maitena Ilardia, Miren Nekane Bilbao, and Sergio Campos-Cordobés. 2018. Big Data for Transportation and Mobility: Recent Advances, Trends and Challenges.IET Intelligent Transport Systems12, 8 (2018), 742–755

work page 2018
[20]

Anthony Townsend. 2015. Cities of Data: Examining the New Urban Science. Public Culture27, 2 (2015), 201–212

work page 2015
[21]

Adam Van Etten, Dave Lindenbaum, and Todd M Bacastow. 2018. Spacenet: A remote sensing dataset and challenge series.arXiv preprint arXiv:1807.01232 (2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018
[22]

Yiheng Wang, Tianyu Wang, YuYing Zhang, Hongji Zhang, Haoyu Zheng, Guan- jie Zheng, and Linghe Kong. 2024. UrbanDataLayer: A Unified Data Pipeline for Urban Science. InAdvances in Neural Information Processing Systems, Vol. 37. 7296–7310

work page 2024
[23]

Wilkinson, Michel Dumontier, IJsbrand Jan Aalbersberg, et al

Mark D. Wilkinson, Michel Dumontier, IJsbrand Jan Aalbersberg, et al . 2016. The FAIR Guiding Principles for scientific data management and stewardship. Scientific Data3, 1 (2016), 160018

work page 2016
[24]

Papachristos

George Wood and Andrew V. Papachristos. 2019. Reducing gunshot victimization in high-risk social networks through direct and spillover effects.Nature Human Behaviour3 (2019), 1164–1170. doi:10.1038/s41562-019-0688-1

work page doi:10.1038/s41562-019-0688-1 2019
[25]

Anjie Xu, Ruiqing Ding, and Leye Wang. 2025. ChatPD: An LLM-driven Paper- Dataset Networking System. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD)

work page 2025
[26]

Bo Xu, Bernardo Gutierrez, Sumiko Mekaru, Kara Sewalk, Lauren Goodwin, Alyssa Loskill, Emily L Cohn, Yulin Hswen, Sarah C Hill, Maria M Cobo, et al. 2020. Epidemiological data from the COVID-19 outbreak, real-time case information. Scientific data7, 1 (2020), 106

work page 2020
[27]

Yuki Yamada, Dominik-Borna Ćepulić, Tao Coll-Martín, Stéphane Debove, Guil- laume Gautreau, Hyemin Han, Jesper Rasmussen, Thao P Tran, Giovanni A Travaglino, et al. 2021. COVIDiSTRESS Global Survey dataset on psychological and behavioural consequences of the COVID-19 outbreak.Scientific data8, 1 (2021), 3

work page 2021
[28]

Yuan Zhou, Yiqi Luo, et al . 2025. Reduction of methane emissions through improved landfill management.Nature Climate Change15 (2025), 1–15. doi:10. 1038/s41558-025-02391-1 A Additional Results Received 20 February 2007; revised 12 March 2009; accepted 5 June 2009 Conference acronym ’XX, June 03–05, 2018, Woodstock, NY Trovato et al. Table 5: Field-level ...

work page 2025