pith. sign in

arxiv: 1906.08470 · v1 · pith:NC5H63POnew · submitted 2019-06-20 · 💻 cs.DL · cs.IR

Cleaning Noisy and Heterogeneous Metadata for Record Linking Across Scholarly Big Datasets

Pith reviewed 2026-05-25 19:25 UTC · model grok-4.3

classification 💻 cs.DL cs.IR
keywords record linkingscholarly metadataentity resolutionnoisy datacitation matchingCiteSeerX
0
0 comments X

The pith

A system combining metadata features and citation data matches noisy scholarly records with high accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a record linking system to match scholarly documents with noisy, incomplete, or erroneous metadata against reference datasets. It uses BM25 for blocking candidates via ElasticSearch indexing, then applies supervised classification on features drawn from all available metadata fields. Citation information is added as an additional matching signal. This combined approach significantly outperforms a title-based baseline on the same test data. The method is applied to link CiteSeerX records to Web of Science, PubMed, and DBLP, with plans for deployment to clean and cross-link the data.

Core claim

The combination of metadata and citation achieves high accuracy that significantly outperforms the baseline method on the same test dataset when matching scholarly document entities with noisy metadata against a reference dataset.

What carries the argument

Supervised classifier on features from all metadata fields plus citation information, after BM25 blocking on an ElasticSearch index.

If this is right

  • Enables cleaning of CiteSeerX metadata through linkage to external datasets.
  • Supports cross-linking of records across multiple large scholarly corpora.
  • Improves entity resolution precision when titles alone are unreliable due to noise.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The method may apply to other noisy metadata domains if auxiliary link signals similar to citations exist.
  • Deployment could indirectly improve search quality and citation network analysis in the target system.
  • Further tests on datasets lacking citations would clarify the contribution of each component.

Load-bearing premise

Citation information is reliably available and accurate enough to be leveraged as a matching feature without introducing new errors.

What would settle it

A controlled test on records where citation data is deliberately removed or corrupted, checking whether accuracy falls below the metadata-only baseline.

Figures

Figures reproduced from arXiv: 1906.08470 by Allen C. Ge, Athar Sefid, C. Lee Giles, Cornelia Caragea, Jian Wu, Jing Zhao, Lu Liu, Prasenjit Mitra.

Figure 1
Figure 1. Figure 1: The pipeline of IMM. is from 1900 to 2015 with over 20,000 journals, books, and conference proceedings. There are about 45 million WoS pa￾pers and 906 million citation records in this corpus. DBLP is a bibliographic dataset covering more than 5,000 conferences and 1,500 journals in computer science. We use the version published in March, 2017 with about 4 million documents. This dataset does not contain ci… view at source ↗
read the original abstract

Automatically extracted metadata from scholarly documents in PDF formats is usually noisy and heterogeneous, often containing incomplete fields and erroneous values. One common way of cleaning metadata is to use a bibliographic reference dataset. The challenge is to match records between corpora with high precision. The existing solution which is based on information retrieval and string similarity on titles works well only if the titles are cleaned. We introduce a system designed to match scholarly document entities with noisy metadata against a reference dataset. The blocking function uses the classic BM25 algorithm to find the matching candidates from the reference data that has been indexed by ElasticSearch. The core components use supervised methods which combine features extracted from all available metadata fields. The system also leverages available citation information to match entities. The combination of metadata and citation achieves high accuracy that significantly outperforms the baseline method on the same test dataset. We apply this system to match the database of CiteSeerX against Web of Science, PubMed, and DBLP. This method will be deployed in the CiteSeerX system to clean metadata and link records to other scholarly big datasets.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper describes a system for matching noisy, heterogeneous metadata records from CiteSeerX against reference corpora (Web of Science, PubMed, DBLP). It uses BM25 blocking via ElasticSearch to retrieve candidate matches, followed by a supervised classifier that combines features from all available metadata fields and, when present, citation information. The central claim is that the metadata-plus-citation combination achieves high accuracy that significantly outperforms a title-based baseline on the same test dataset; the method is intended for deployment inside CiteSeerX.

Significance. If the performance gains are shown to be robust and the citation-handling details are supplied, the work could supply a practical, deployable pipeline for large-scale scholarly record linking. The use of an off-the-shelf IR engine for blocking and the explicit incorporation of citation strings are pragmatic engineering choices that address real pain points in noisy PDF-extracted metadata.

major comments (3)
  1. [Abstract, §4] Abstract and §4 (system description): the headline claim that 'the combination of metadata and citation achieves high accuracy that significantly outperforms the baseline' is unsupported by any reported precision, recall, F1, accuracy, or dataset-size figures, nor by an explicit statement of the evaluation protocol or test-set construction. Without these numbers the central performance assertion cannot be assessed.
  2. [§4, §5] §4 (citation component) and §5 (experiments): no statistics are given on (a) the fraction of CiteSeerX records that possess usable citation strings, (b) the routing of records lacking citations (metadata-only path versus drop), or (c) an ablation that isolates the incremental contribution of the citation feature. The skeptic concern that reported gains may be an artifact of evaluating only on the citation-rich subset therefore remains unaddressed and is load-bearing for the 'significantly outperforms' claim.
  3. [§3, §4] §3 (blocking) and §4 (classifier): the manuscript supplies no description of the feature set, training procedure, hyper-parameter choices, or cross-validation protocol used by the supervised classifier, preventing verification that the reported superiority is not an artifact of over-fitting or an unrepresentative test split.
minor comments (2)
  1. [Abstract, §1] The abstract and introduction repeatedly use the phrase 'high accuracy' without defining the metric or providing a numeric threshold; replace with concrete measures once the evaluation is added.
  2. [§4] Notation for the supervised model (features, label, loss) is introduced informally; a short table or equation block would improve clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments highlighting the need for quantitative results, citation statistics, and methodological details. We will revise the manuscript to incorporate all requested information and clarifications, strengthening the presentation of our record linkage pipeline.

read point-by-point responses
  1. Referee: [Abstract, §4] Abstract and §4 (system description): the headline claim that 'the combination of metadata and citation achieves high accuracy that significantly outperforms the baseline' is unsupported by any reported precision, recall, F1, accuracy, or dataset-size figures, nor by an explicit statement of the evaluation protocol or test-set construction. Without these numbers the central performance assertion cannot be assessed.

    Authors: We agree the abstract and Section 4 lack the supporting numerical results and protocol details. In the revised version we will report precision, recall, F1, accuracy, and dataset sizes for the metadata-plus-citation method versus the title baseline, together with a clear description of the test-set construction and evaluation protocol used on the CiteSeerX matching tasks. revision: yes

  2. Referee: [§4, §5] §4 (citation component) and §5 (experiments): no statistics are given on (a) the fraction of CiteSeerX records that possess usable citation strings, (b) the routing of records lacking citations (metadata-only path versus drop), or (c) an ablation that isolates the incremental contribution of the citation feature. The skeptic concern that reported gains may be an artifact of evaluating only on the citation-rich subset therefore remains unaddressed and is load-bearing for the 'significantly outperforms' claim.

    Authors: We accept that these statistics and the ablation are necessary. The revision will state the fraction of CiteSeerX records containing usable citation strings, describe the routing logic for records without citations (metadata-only path), and add an ablation comparing performance with and without the citation features to address the concern about subset bias. revision: yes

  3. Referee: [§3, §4] §3 (blocking) and §4 (classifier): the manuscript supplies no description of the feature set, training procedure, hyper-parameter choices, or cross-validation protocol used by the supervised classifier, preventing verification that the reported superiority is not an artifact of over-fitting or an unrepresentative test split.

    Authors: We will expand the classifier description in the revised Section 4 to list all metadata and citation features, the supervised learning algorithm, hyper-parameter selection approach, and the cross-validation protocol, thereby allowing verification that the performance gains are not due to overfitting or split artifacts. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical systems evaluation on external data

full rationale

The manuscript is a systems description of a record-linking pipeline (BM25 blocking via ElasticSearch followed by supervised classification on metadata plus citation features). It reports empirical accuracy on an external test dataset without any mathematical derivation, parameter fitting presented as prediction, or self-citation chain that reduces the central claim to its own inputs. The performance numbers are obtained by direct comparison against a held-out reference set and therefore remain falsifiable outside the paper's own construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an applied engineering paper; no free parameters, axioms, or invented entities are introduced or required by the central claim.

pith-pipeline@v0.9.0 · 5742 in / 959 out tokens · 25179 ms · 2026-05-25T19:25:26.449873+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

35 extracted references · 35 canonical work pages · 1 internal anchor

  1. [1]

    write newline

    " write newline "" before.all 'output.state := FUNCTION fin.entry add.period write newline FUNCTION new.block output.state before.all = 'skip after.block 'output.state := if FUNCTION new.sentence output.state after.block = 'skip output.state before.all = 'skip after.sentence 'output.state := if if FUNCTION not #0 #1 if FUNCTION and 'skip pop #0 if FUNCTIO...

  2. [2]

    A., and Giles, C

    Al-Zaidy, R. A., and Giles, C. L. 2017. A machine learning approach for semantic structuring of scientific charts in scholarly documents. In AAAI , 4644--4649

  3. [3]

    Caragea, C.; Wu, J.; Ciobanu, A.; Williams, K.; Fern \'a ndez-Ram \'i rez, J.; Chen, H.-H.; Wu, Z.; and Giles, L. 2014. Citeseerx: A scholarly big dataset. In Advances in Information Retrieval: 36th European Conference on IR Research, ECIR 2014, Amsterdam, The Netherlands, April 13-16, 2014. Proceedings , 311--322

  4. [4]

    Charikar, M. 2002. Similarity estimation techniques from rounding algorithms. In Proceedings on 34th Annual ACM Symposium on Theory of Computing, May 19-21, 2002, Montr \' e al, Qu \' e bec, Canada , 380--388

  5. [5]

    Chen, C.; Wang, Z.; Li, W.; and Sun, X. 2018. Modeling scientific influence for research trending topic prediction. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, New Orleans, Louisiana, USA, February 2-7, 2018

  6. [6]

    L.; and Kan, M.-Y

    Councill, I.; Giles, C. L.; and Kan, M.-Y. 2008. Parscit: an open-source crf reference string parsing package. In Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

  7. [7]

    L.; Bollacker, K

    Giles, C. L.; Bollacker, K. D.; and Lawrence, S. 1998. Citeseer: An automatic citation indexing system. In Proceedings of the 3rd ACM International Conference on Digital Libraries, June 23-26, 1998, Pittsburgh, PA, USA , 89--98

  8. [8]

    Giles, C. L. 2013. Scholarly big data: Information extraction and data mining. In Proceedings of the 22Nd ACM International Conference on Information & Knowledge Management , CIKM '13, 1--2. New York, NY, USA: ACM

  9. [9]

    Huang, W.; Wu, Z.; Chen, L.; Mitra, P.; and Giles, C. L. 2015. A neural probabilistic model for context based citation recommendation. In Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, January 25-30, 2015, Austin, Texas, USA. , 2404--2410

  10. [10]

    Kim, K.; Sefid, A.; and Giles, C. L. 2017. Scaling author name disambiguation with cnf blocking. arXiv preprint arXiv:1709.09657

  11. [11]

    Liu, X.; Yan, J.; Xiao, S.; Wang, X.; Zha, H.; and Chu, S. M. 2017. On predictive patent valuation: Forecasting patent citations and their types. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, February 4-9, 2017, San Francisco, California, USA. , 1438--1444

  12. [12]

    Peled, O.; Fire, M.; Rokach, L.; and Elovici, Y. 2013. Entity matching in online social networks. In International Conference on Social Computing, SocialCom 2013, SocialCom/PASSAT/BigData/EconCom/BioMedCom 2013, Washington, DC, USA, 8-14 September, 2013 , 339--344

  13. [13]

    Robertson, S.; Zaragoza, H.; and Taylor, M. 2004. Simple bm25 extension to multiple weighted fields. In Proceedings of the Thirteenth ACM International Conference on Information and Knowledge Management , CIKM '04, 42--49. New York, NY, USA: ACM

  14. [14]

    Wesley-Smith, I., and West, J. D. 2016. Babel: A platform for facilitating research in scholarly article discovery. In Proceedings of the 25th International Conference Companion on World Wide Web , WWW '16 Companion, 389--394. Republic and Canton of Geneva, Switzerland: International World Wide Web Conferences Steering Committee

  15. [15]

    Wu, J.; Williams, K.; Chen, H.; Khabsa, M.; Caragea, C.; Ororbia, A.; Jordan, D.; and Giles, C. L. 2014. Citeseerx: AI in a digital library search engine. In Proceedings of the 28th AAAI Conference on Artificial Intelligence, July 27 -31, 2014, Qu \' e bec City, Qu \' e bec, Canada. , 2930--2937

  16. [16]

    C.; and Giles, C

    Wu, J.; Sefid, A.; Ge, A. C.; and Giles, C. L. 2017. A supervised learning approach to entity matching between scholarly big datasets. In Proceedings of the Knowledge Capture Conference , K-CAP 2017, 42:1--42:4. New York, NY, USA

  17. [17]

    Yang, Y.; Sun, Y.; Tang, J.; Ma, B.; and Li, J. 2015. Entity matching across heterogeneous sources. In Cao, L.; Zhang, C.; Joachims, T.; Webb, G. I.; Margineantu, D. D.; and Williams, G., eds., Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Sydney, NSW, Australia, August 10-13, 2015 , 1395--1404. ACM

  18. [18]

    B\" o hm, C.; de Melo, G.; Naumann, F.; and Weikum, G. 2012. Linda: Distributed web-of-data-scale entity matching. In Proceedings of the 21st ACM International Conference on Information and Knowledge Management , CIKM '12, 2104--2108. New York, NY, USA: ACM

  19. [19]

    Caragea, C.; Wu, J.; Ciobanu, A.; Williams, K.; Fern \'a ndez-Ram \'i rez, J.; Chen, H.-H.; Wu, Z.; and Giles, L. 2014. CiteSeerX: A Scholarly Big Dataset . Cham: Springer International Publishing. 311--322

  20. [20]

    D.; Roy, S

    Cock, M. D.; Roy, S. B.; Savvana, S.; Mandava, V.; Dalessandro, B.; Perlich, C.; Cukierski, W.; and Hamner, B. 2013. The microsoft academic search challenges at kdd cup 2013. In Big Data, 2013 IEEE International Conference on , 1--4

  21. [21]

    Ley, M. 2009. DBLP - some lessons learned. PVLDB 2(2):1493--1500

  22. [22]

    Lipinski, M.; Yao, K.; Breitinger, C.; Beel, J.; and Gipp, B. 2013. Evaluation of header metadata extraction approaches and tools for scientific pdf documents. In Proceedings of the 13th ACM/IEEE-CS Joint Conference on Digital Libraries , JCDL '13, 385--386. New York, NY, USA: ACM

  23. [23]

    Lopez, P. 2009. Grobid: Combining automatic bibliographic data recognition and term extraction for scholarship publications. In Proceedings of the 13th European Conference on Research and Advanced Technology for Digital Libraries , ECDL'09, 473--474. Berlin, Heidelberg: Springer-Verlag

  24. [24]

    NIH. 2016. Fact sheet medline. https://www.nlm.nih.gov/pubs/factsheets/medline.html. [Online; accessed 09-September-2016]

  25. [25]

    Olensky, M.; Tsai, T.-H.; and Chen, K.-T. 2016. H-index sequences across fields: A comparative analysis. In Proceedings of the 25th International Conference Companion on World Wide Web , WWW '16 Companion, 407--412. Republic and Canton of Geneva, Switzerland: International World Wide Web Conferences Steering Committee

  26. [26]

    Peled, O.; Fire, M.; Rokach, L.; and Elovici, Y. 2013. Entity matching in online social networks. In International Conference on Social Computing, SocialCom 2013, SocialCom/PASSAT/BigData/EconCom/BioMedCom 2013, Washington, DC, USA, 8-14 September, 2013 , 339--344. IEEE Computer Society

  27. [27]

    Sinatra, R.; Deville, P.; Szell, M.; Wang, D.; and Barabasi, A.-L. 2015. A century of physics. Nat Phys 11(10):791--796

  28. [28]

    P.; and Wang, K

    Sinha, A.; Shen, Z.; Song, Y.; Ma, H.; Eide, D.; Hsu, B.-J. P.; and Wang, K. 2015. An Overview of Microsoft Academic Service ( MAS ) and Applications . In Proceedings of the 24th International Conference on World Wide Web , WWW '15 Companion , 243--246

  29. [29]

    Stamenovic, M.; Schick, S.; and Luo, J. 2017. Machine identification of high impact research through text and image analysis. In Third IEEE International Conference on Multimedia Big Data, BigMM 2017, Laguna Hills, CA, USA, April 19-21, 2017 , 98--104

  30. [30]

    Wang, Y.; Zhang, H.; Li, Y.; Wang, D.; Ma, Y.; Zhou, T.; and Lu, J. 2016. A data cleaning method for citeseer dataset. In Web Information Systems Engineering - WISE 2016 - 17th International Conference, Shanghai, China, November 8-10, 2016, Proceedings, Part I , 35--49

  31. [31]

    Whalen, R.; Huang, Y.; Tanis, C.; Sawant, A.; Uzzi, B.; and Contractor, N. 2016. Citation distance: Measuring changes in scientific search strategies. In Proceedings of the 25th International Conference Companion on World Wide Web , WWW '16 Companion, 419--423. Republic and Canton of Geneva, Switzerland: International World Wide Web Conferences Steering Committee

  32. [32]

    Williams, K., and Giles, C. L. 2013. Near duplicate detection in an academic digital library. In Proceedings of the 2013 ACM Symposium on Document Engineering , DocEng '13, 91--94. New York, NY, USA: ACM

  33. [33]

    Wu, J.; Liang, C.; Yang, H.; and Giles, C. L. 2016. Citeseerx data: Semanticizing scholarly papers. In Proceedings of the International Workshop on Semantic Big Data , SBD '16, 2:1--2:6. New York, NY, USA: ACM

  34. [34]

    C.; and Giles, C

    Wu, J.; Sefid, A.; Ge, A. C.; and Giles, C. L. 2017. A supervised learning approach to entity matching between scholarly big datasets. In Proceedings of the Knowledge Capture Conference , K-CAP 2017, 42:1--42:4. New York, NY, USA: ACM

  35. [35]

    M.; and Zha, H

    Xiao, S.; Yan, J.; Li, C.; Jin, B.; Wang, X.; Yang, X.; Chu, S. M.; and Zha, H. 2016. On modeling and predicting individual paper citation count over time. In Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, IJCAI 2016, New York, NY, USA, 9-15 July 2016 , 2676--2682